loadObj() does not respect alignment
This is perhaps known, but I'll write it down here in case somebody else runs into this problem as well.
Since loadObj()
just mmap()
s the entire object file and decodes it in place, it does not respect the alignment requirements specified in the section headers. This is problematic for instructions which require alignment, e.g. SSE, AVX.
The attached map.ll
program is map (+1)
over an array of floating point numbers. In particular, the core loop is 8-way SIMD vectorised x 4-way unrolled, for 32-elements per loop iteration. A tail loop handles any remainder one-at-a-time.
You can compile it using llc -filetype=obj -mcpu=native map.ll
. For a CPU with AVX instructions (sandy bridge or later) you should get the following:
$ objdump -d map.o
Disassembly of section .text:
0000000000000000 <map>:
0: 49 89 f3 mov %rsi,%r11
3: 49 29 fb sub %rdi,%r11
6: 0f 8e f9 00 00 00 jle 105 <map+0x105>
c: 49 83 fb 20 cmp $0x20,%r11
10: 0f 82 bd 00 00 00 jb d3 <map+0xd3>
16: 4d 89 da mov %r11,%r10
19: 49 83 e2 e0 and $0xffffffffffffffe0,%r10
1d: 4d 89 d9 mov %r11,%r9
20: 49 83 e1 e0 and $0xffffffffffffffe0,%r9
24: 0f 84 a9 00 00 00 je d3 <map+0xd3>
2a: 49 01 fa add %rdi,%r10
2d: 48 8d 44 ba 60 lea 0x60(%rdx,%rdi,4),%rax
32: 49 8d 7c b8 60 lea 0x60(%r8,%rdi,4),%rdi
37: c5 fc 28 05 00 00 00 vmovaps 0x0(%rip),%ymm0 # 3f <map+0x3f>
3e: 00
3f: 4c 89 c9 mov %r9,%rcx
42: 66 66 66 66 66 2e 0f data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
49: 1f 84 00 00 00 00 00
50: c5 f8 10 4f a0 vmovups -0x60(%rdi),%xmm1
55: c5 f8 10 57 c0 vmovups -0x40(%rdi),%xmm2
5a: c5 f8 10 5f e0 vmovups -0x20(%rdi),%xmm3
5f: c5 f8 10 27 vmovups (%rdi),%xmm4
63: c4 e3 75 18 4f b0 01 vinsertf128 $0x1,-0x50(%rdi),%ymm1,%ymm1
6a: c4 e3 6d 18 57 d0 01 vinsertf128 $0x1,-0x30(%rdi),%ymm2,%ymm2
71: c4 e3 65 18 5f f0 01 vinsertf128 $0x1,-0x10(%rdi),%ymm3,%ymm3
78: c4 e3 5d 18 67 10 01 vinsertf128 $0x1,0x10(%rdi),%ymm4,%ymm4
7f: c5 f4 58 c8 vaddps %ymm0,%ymm1,%ymm1
83: c5 ec 58 d0 vaddps %ymm0,%ymm2,%ymm2
87: c5 e4 58 d8 vaddps %ymm0,%ymm3,%ymm3
8b: c5 dc 58 e0 vaddps %ymm0,%ymm4,%ymm4
8f: c4 e3 7d 19 48 b0 01 vextractf128 $0x1,%ymm1,-0x50(%rax)
96: c5 f8 11 48 a0 vmovups %xmm1,-0x60(%rax)
9b: c4 e3 7d 19 50 d0 01 vextractf128 $0x1,%ymm2,-0x30(%rax)
a2: c5 f8 11 50 c0 vmovups %xmm2,-0x40(%rax)
a7: c4 e3 7d 19 58 f0 01 vextractf128 $0x1,%ymm3,-0x10(%rax)
ae: c5 f8 11 58 e0 vmovups %xmm3,-0x20(%rax)
b3: c4 e3 7d 19 60 10 01 vextractf128 $0x1,%ymm4,0x10(%rax)
ba: c5 f8 11 20 vmovups %xmm4,(%rax)
be: 48 83 e8 80 sub $0xffffffffffffff80,%rax
c2: 48 83 ef 80 sub $0xffffffffffffff80,%rdi
c6: 48 83 c1 e0 add $0xffffffffffffffe0,%rcx
ca: 75 84 jne 50 <map+0x50>
cc: 4d 39 cb cmp %r9,%r11
cf: 75 05 jne d6 <map+0xd6>
d1: eb 32 jmp 105 <map+0x105>
d3: 49 89 fa mov %rdi,%r10
d6: 4c 29 d6 sub %r10,%rsi
d9: 4a 8d 04 92 lea (%rdx,%r10,4),%rax
dd: 4b 8d 0c 90 lea (%r8,%r10,4),%rcx
e1: c5 fa 10 05 00 00 00 vmovss 0x0(%rip),%xmm0 # e9 <map+0xe9>
e8: 00
e9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
f0: c5 fa 58 09 vaddss (%rcx),%xmm0,%xmm1
f4: c5 fa 11 08 vmovss %xmm1,(%rax)
f8: 48 83 c0 04 add $0x4,%rax
fc: 48 83 c1 04 add $0x4,%rcx
100: 48 ff ce dec %rsi
103: 75 eb jne f0 <map+0xf0>
105: c5 f8 77 vzeroupper
108: c3 req
The attached test.c
will load the object file and try to execute it. The #define N
on line 7 will change the size of the array. For fewer than 32 elements this works as expected (where the input array is [0..N-1]):
$ ./build.sh
+ llc-4.0 -filetype=obj -mcpu=native map.ll
+ ghc --make -no-hs-main test.c
$ ./a.out
array size is 31
calling function...
ok
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0 25.0 26.0 27.0 28.0 29.0 30.0 31.0
For 32 elements or larger (i.e. entering the core loop) the program will (almost certainly) segfault.
$ lldb a.out
(lldb) target create "a.out"
Current executable set to 'a.out' (x86_64).
(lldb) run
Process 7294 launched: '<snip>/a.out' (x86_64)
array size is 32
calling function...
Process 7294 stopped
* thread #1: tid = 0xc41676, 0x000000010019f207, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
frame #0: 0x000000010019f207
-> 0x10019f207: vmovaps 0xe1(%rip), %ymm0
0x10019f20f: movq %r9, %rcx
0x10019f212: nopw %cs:(%rax,%rax)
0x10019f220: vmovups -0x60(%rdi), %xmm1
The VMOVAPS
instruction requires the source address to be 32-byte aligned. It is attempting to load 8 floats from one of the const sections (the ones for the +1), but since the section was not loaded at the required alignment, fails.
I've tested this on x86_64 macOS (Mach-O) and ubuntu (ELF). I don't have any other systems to test on.
Trac metadata
Trac field | Value |
---|---|
Version | 8.0.1 |
Type | Bug |
TypeOfFailure | OtherFailure |
Priority | normal |
Resolution | Unresolved |
Component | Runtime System (Linker) |
Test case | |
Differential revisions | |
BlockedBy | |
Related | |
Blocking | |
CC | |
Operating system | |
Architecture |