Opened 20 months ago

Last modified 7 weeks ago

#13624 new bug

loadObj() does not respect alignment

Reported by: tmcdonell Owned by:
Priority: high Milestone: 8.8.1
Component: Runtime System (Linker) Version: 8.0.1
Keywords: linker, newcomer Cc: angerman, George
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime crash Test Case:
Blocked By: Blocking: #8949
Related Tickets: Differential Rev(s):
Wiki Page:

Description

This is perhaps known, but I'll write it down here in case somebody else runs into this problem as well.

Since loadObj() just mmap()s the entire object file and decodes it in place, it does not respect the alignment requirements specified in the section headers. This is problematic for instructions which require alignment, e.g. SSE, AVX.

The attached map.ll program is map (+1) over an array of floating point numbers. In particular, the core loop is 8-way SIMD vectorised x 4-way unrolled, for 32-elements per loop iteration. A tail loop handles any remainder one-at-a-time.

You can compile it using llc -filetype=obj -mcpu=native map.ll. For a CPU with AVX instructions (sandy bridge or later) you should get the following:

$ objdump -d map.o
Disassembly of section .text:

0000000000000000 <map>:
   0:	49 89 f3             	mov    %rsi,%r11
   3:	49 29 fb             	sub    %rdi,%r11
   6:	0f 8e f9 00 00 00    	jle    105 <map+0x105>
   c:	49 83 fb 20          	cmp    $0x20,%r11
  10:	0f 82 bd 00 00 00    	jb     d3 <map+0xd3>
  16:	4d 89 da             	mov    %r11,%r10
  19:	49 83 e2 e0          	and    $0xffffffffffffffe0,%r10
  1d:	4d 89 d9             	mov    %r11,%r9
  20:	49 83 e1 e0          	and    $0xffffffffffffffe0,%r9
  24:	0f 84 a9 00 00 00    	je     d3 <map+0xd3>
  2a:	49 01 fa             	add    %rdi,%r10
  2d:	48 8d 44 ba 60       	lea    0x60(%rdx,%rdi,4),%rax
  32:	49 8d 7c b8 60       	lea    0x60(%r8,%rdi,4),%rdi
  37:	c5 fc 28 05 00 00 00 	vmovaps 0x0(%rip),%ymm0        # 3f <map+0x3f>
  3e:	00
  3f:	4c 89 c9             	mov    %r9,%rcx
  42:	66 66 66 66 66 2e 0f 	data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
  49:	1f 84 00 00 00 00 00
  50:	c5 f8 10 4f a0       	vmovups -0x60(%rdi),%xmm1
  55:	c5 f8 10 57 c0       	vmovups -0x40(%rdi),%xmm2
  5a:	c5 f8 10 5f e0       	vmovups -0x20(%rdi),%xmm3
  5f:	c5 f8 10 27          	vmovups (%rdi),%xmm4
  63:	c4 e3 75 18 4f b0 01 	vinsertf128 $0x1,-0x50(%rdi),%ymm1,%ymm1
  6a:	c4 e3 6d 18 57 d0 01 	vinsertf128 $0x1,-0x30(%rdi),%ymm2,%ymm2
  71:	c4 e3 65 18 5f f0 01 	vinsertf128 $0x1,-0x10(%rdi),%ymm3,%ymm3
  78:	c4 e3 5d 18 67 10 01 	vinsertf128 $0x1,0x10(%rdi),%ymm4,%ymm4
  7f:	c5 f4 58 c8          	vaddps %ymm0,%ymm1,%ymm1
  83:	c5 ec 58 d0          	vaddps %ymm0,%ymm2,%ymm2
  87:	c5 e4 58 d8          	vaddps %ymm0,%ymm3,%ymm3
  8b:	c5 dc 58 e0          	vaddps %ymm0,%ymm4,%ymm4
  8f:	c4 e3 7d 19 48 b0 01 	vextractf128 $0x1,%ymm1,-0x50(%rax)
  96:	c5 f8 11 48 a0       	vmovups %xmm1,-0x60(%rax)
  9b:	c4 e3 7d 19 50 d0 01 	vextractf128 $0x1,%ymm2,-0x30(%rax)
  a2:	c5 f8 11 50 c0       	vmovups %xmm2,-0x40(%rax)
  a7:	c4 e3 7d 19 58 f0 01 	vextractf128 $0x1,%ymm3,-0x10(%rax)
  ae:	c5 f8 11 58 e0       	vmovups %xmm3,-0x20(%rax)
  b3:	c4 e3 7d 19 60 10 01 	vextractf128 $0x1,%ymm4,0x10(%rax)
  ba:	c5 f8 11 20          	vmovups %xmm4,(%rax)
  be:	48 83 e8 80          	sub    $0xffffffffffffff80,%rax
  c2:	48 83 ef 80          	sub    $0xffffffffffffff80,%rdi
  c6:	48 83 c1 e0          	add    $0xffffffffffffffe0,%rcx
  ca:	75 84                	jne    50 <map+0x50>
  cc:	4d 39 cb             	cmp    %r9,%r11
  cf:	75 05                	jne    d6 <map+0xd6>
  d1:	eb 32                	jmp    105 <map+0x105>
  d3:	49 89 fa             	mov    %rdi,%r10
  d6:	4c 29 d6             	sub    %r10,%rsi
  d9:	4a 8d 04 92          	lea    (%rdx,%r10,4),%rax
  dd:	4b 8d 0c 90          	lea    (%r8,%r10,4),%rcx
  e1:	c5 fa 10 05 00 00 00 	vmovss 0x0(%rip),%xmm0        # e9 <map+0xe9>
  e8:	00
  e9:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
  f0:	c5 fa 58 09          	vaddss (%rcx),%xmm0,%xmm1
  f4:	c5 fa 11 08          	vmovss %xmm1,(%rax)
  f8:	48 83 c0 04          	add    $0x4,%rax
  fc:	48 83 c1 04          	add    $0x4,%rcx
 100:	48 ff ce             	dec    %rsi
 103:	75 eb                	jne    f0 <map+0xf0>
 105:	c5 f8 77             	vzeroupper
 108:	c3                   	req

The attached test.c will load the object file and try to execute it. The #define N on line 7 will change the size of the array. For fewer than 32 elements this works as expected (where the input array is [0..N-1]):

$ ./build.sh
+ llc-4.0 -filetype=obj -mcpu=native map.ll
+ ghc --make -no-hs-main test.c

$ ./a.out
array size is 31
calling function...
ok
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0 25.0 26.0 27.0 28.0 29.0 30.0 31.0

For 32 elements or larger (i.e. entering the core loop) the program will (almost certainly) segfault.

$ lldb a.out
(lldb) target create "a.out"
Current executable set to 'a.out' (x86_64).
(lldb) run
Process 7294 launched: '<snip>/a.out' (x86_64)
array size is 32
calling function...
Process 7294 stopped
* thread #1: tid = 0xc41676, 0x000000010019f207, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x000000010019f207
->  0x10019f207: vmovaps 0xe1(%rip), %ymm0
    0x10019f20f: movq   %r9, %rcx
    0x10019f212: nopw   %cs:(%rax,%rax)
    0x10019f220: vmovups -0x60(%rdi), %xmm1

The VMOVAPS instruction requires the source address to be 32-byte aligned. It is attempting to load 8 floats from one of the const sections (the ones for the +1), but since the section was not loaded at the required alignment, fails.

I've tested this on x86_64 macOS (Mach-O) and ubuntu (ELF). I don't have any other systems to test on.

Attachments (3)

test.c (1.4 KB) - added by tmcdonell 20 months ago.
map.ll (4.7 KB) - added by tmcdonell 20 months ago.
build.sh (91 bytes) - added by tmcdonell 20 months ago.

Download all attachments as: .zip

Change History (18)

Changed 20 months ago by tmcdonell

Attachment: test.c added

Changed 20 months ago by tmcdonell

Attachment: map.ll added

Changed 20 months ago by tmcdonell

Attachment: build.sh added

comment:1 Changed 20 months ago by bgamari

Cc: angerman added
Keywords: linker added
Milestone: 8.4.1
Priority: normalhigh

Wow, I didn't know that. That is terrible!

My understanding is that the linker (at least in the case of ELF) actually maps sections incorrectly anyways: We ought to be using the object file's segments, not sections, for the purposes of mapping. This will likely reduce the number of mappings we need to setup, and allow us to set the correct RWX access control bits (e.g. currently we map code as writable, which is rather terrifying from a security perspective).

comment:2 Changed 20 months ago by angerman

For what it's worth the macho/aarch64 does this already, and elf/armv7, elf/aarch64 linker will do it. They essentially requests memory via mmap, and memcpy the data per section; this makes sections being page aligned, which I believe satisfies most alignment requirements. An additional check wouldn't hurt though.

However we can't use segments, as segments can contain multiple sections that are not bound to the same type. A segment can contain data and text section; which in turn need to live in different pages for them to be properly mprotected. This then means we can't mmap segments.

However what we could try to reduce memory footprint is to aggregate text and data sections across object files. Of course mmap + memcpy is slower than directly mmapping.

comment:3 Changed 20 months ago by bgamari

For the record, each segment has an associated set of permission flags. IIRC, two sections of mapping characteristics won't end up in the same segment.

However, we nevertheless can't use segments for an unrelated reason: only executables and shared objects are organized into segments. Our runtime linker is only concerned with object files, which have no useful segment organizations AFAIK. Oh well.

comment:4 Changed 17 months ago by bgamari

Fixing this will unblock #8949.

comment:5 Changed 17 months ago by bgamari

I've opened #14069 to fix the protection bits issue.

comment:6 Changed 12 months ago by George

Cc: George added

comment:7 Changed 11 months ago by bgamari

Keywords: newcomer added
Milestone: 8.4.18.6.1

This is something that we should try to fix (along with #14069) in 8.6.

comment:8 Changed 11 months ago by George

Type of failure: None/UnknownRuntime crash

comment:9 Changed 10 months ago by George

Blocking: 8949 added

comment:10 Changed 6 months ago by bgamari

Milestone: 8.6.18.8.1

This won't be fixed in 8.6. Bumping to 8.8.

comment:11 Changed 4 months ago by carter

What are the platforms that need this fix?

comment:12 Changed 4 months ago by carter

And what are the challenges to doing this correctly?

comment:13 Changed 2 months ago by George

Trying to find out if MacOS on 8.6.1 is one of the platforms where this needs to be fixed but I can't reproduce:

./build.sh
+ llc -filetype=obj -mcpu=native map.ll
+ ghc --make -no-hs-main test.c
bash-3.2$ ./a.out
array size is 31
calling function...
ok
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0 25.0 26.0 27.0 28.0 29.0 30.0 31.0 
bash-3.2$ lldb a.out
(lldb) target create "a.out"
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 52, in <module>
    import weakref
  File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/weakref.py", line 14, in <module>
    from _weakref import (
ImportError: cannot import name _remove_dead_weakref
Current executable set to 'a.out' (x86_64).
(lldb) quit
quit

I don't understand why lldb is needed to reproduce this in any case. In particular how it changes the array size to 32, could you explain? In the description it says "The #define N on line 7 will change the size of the array...For 32 elements or larger (i.e. entering the core loop) the program will (almost certainly) segfault. " I trying changing N to 32 but then running build.sh a.out just worked, no segfault.

My questions about why we need lldb and why we can't just use build.sh and a.out to reproduce still stand but my inability to reproduce may be due to not having a version of lldb that is compatible with the llc version I am using. My lldb is the Apple one, do I need one specific to the version of llvm I am using?

 lldb --version
lldb-1000.11.37.1
  Swift-4.2
bash-3.2$ llc --version
LLVM (http://llvm.org/):
  LLVM version 6.0.1
  Optimized build.
  Default target: x86_64-apple-darwin17.7.0
  Host CPU: nehalem
Last edited 2 months ago by George (previous) (diff)

comment:14 in reply to:  13 ; Changed 2 months ago by tmcdonell

Replying to George:

Trying to find out if MacOS on 8.6.1 is one of the platforms where this needs to be fixed but I can't reproduce:

The problem still exists. I verified this just now; macOS 10.13.6

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 8.6.1
$ ghc --print-project-git-commit-id
0d2cdec78471728a0f2c487581d36acda68bb941

I don't understand why lldb is needed to reproduce this in any case.

Obviously, lldb is not required. Just run the executable and watch it segfault. The point of using a debugger (any would suffice) is to demonstrate which instruction is causing the problem, which for this issue is important to understand.

In particular how it changes the array size to 32, could you explain? In the description it says "The #define N on line 7 will change the size of the array...For 32 elements or larger (i.e. entering the core loop) the program will (almost certainly) segfault. " I trying changing N to 32 but then running build.sh a.out just worked, no segfault.

If you read the description, you will see that the core loop is a 8-way vectorised and 4-way unrolled; i.e. 32-elements per trip. For any remainder it backs out to a single element per loop. Thus, to activate the code path which exhibits the problem, you require at least 32 elements.

The description also says "For a CPU with AVX instructions (sandy bridge or later) you should get the following". Since you are running an original corei7 (nephalem) and don't have AVX instructions, this exact example obviously won't trigger for you. If you checked the objdump output as suggested, you would also have seen this.

comment:15 in reply to:  14 Changed 2 months ago by George

Sorry, when I first read the bug I realized that I wouldn't be able to reproduce it on my machine but was interested in seeing it fixed as I plan on getting a new machine in the near future. When I saw comment 11 I forgot all that and tried to reproduce. Sorry for wasting your time but thanks for taking the time to answer. I'll be more careful in the future.

Last edited 7 weeks ago by George (previous) (diff)
Note: See TracTickets for help on using tickets.