loadObj() does not respect alignment

This is perhaps known, but I'll write it down here in case somebody else runs into this problem as well.

Since loadObj() just mmap()s the entire object file and decodes it in place, it does not respect the alignment requirements specified in the section headers. This is problematic for instructions which require alignment, e.g. SSE, AVX.

The attached map.ll program is map (+1) over an array of floating point numbers. In particular, the core loop is 8-way SIMD vectorised x 4-way unrolled, for 32-elements per loop iteration. A tail loop handles any remainder one-at-a-time.

You can compile it using llc -filetype=obj -mcpu=native map.ll. For a CPU with AVX instructions (sandy bridge or later) you should get the following:

$ objdump -d map.o
Disassembly of section .text:

0000000000000000 <map>:
   0:	49 89 f3             	mov    %rsi,%r11
   3:	49 29 fb             	sub    %rdi,%r11
   6:	0f 8e f9 00 00 00    	jle    105 <map+0x105>
   c:	49 83 fb 20          	cmp    $0x20,%r11
  10:	0f 82 bd 00 00 00    	jb     d3 <map+0xd3>
  16:	4d 89 da             	mov    %r11,%r10
  19:	49 83 e2 e0          	and    $0xffffffffffffffe0,%r10
  1d:	4d 89 d9             	mov    %r11,%r9
  20:	49 83 e1 e0          	and    $0xffffffffffffffe0,%r9
  24:	0f 84 a9 00 00 00    	je     d3 <map+0xd3>
  2a:	49 01 fa             	add    %rdi,%r10
  2d:	48 8d 44 ba 60       	lea    0x60(%rdx,%rdi,4),%rax
  32:	49 8d 7c b8 60       	lea    0x60(%r8,%rdi,4),%rdi
  37:	c5 fc 28 05 00 00 00 	vmovaps 0x0(%rip),%ymm0        # 3f <map+0x3f>
  3e:	00
  3f:	4c 89 c9             	mov    %r9,%rcx
  42:	66 66 66 66 66 2e 0f 	data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
  49:	1f 84 00 00 00 00 00
  50:	c5 f8 10 4f a0       	vmovups -0x60(%rdi),%xmm1
  55:	c5 f8 10 57 c0       	vmovups -0x40(%rdi),%xmm2
  5a:	c5 f8 10 5f e0       	vmovups -0x20(%rdi),%xmm3
  5f:	c5 f8 10 27          	vmovups (%rdi),%xmm4
  63:	c4 e3 75 18 4f b0 01 	vinsertf128 $0x1,-0x50(%rdi),%ymm1,%ymm1
  6a:	c4 e3 6d 18 57 d0 01 	vinsertf128 $0x1,-0x30(%rdi),%ymm2,%ymm2
  71:	c4 e3 65 18 5f f0 01 	vinsertf128 $0x1,-0x10(%rdi),%ymm3,%ymm3
  78:	c4 e3 5d 18 67 10 01 	vinsertf128 $0x1,0x10(%rdi),%ymm4,%ymm4
  7f:	c5 f4 58 c8          	vaddps %ymm0,%ymm1,%ymm1
  83:	c5 ec 58 d0          	vaddps %ymm0,%ymm2,%ymm2
  87:	c5 e4 58 d8          	vaddps %ymm0,%ymm3,%ymm3
  8b:	c5 dc 58 e0          	vaddps %ymm0,%ymm4,%ymm4
  8f:	c4 e3 7d 19 48 b0 01 	vextractf128 $0x1,%ymm1,-0x50(%rax)
  96:	c5 f8 11 48 a0       	vmovups %xmm1,-0x60(%rax)
  9b:	c4 e3 7d 19 50 d0 01 	vextractf128 $0x1,%ymm2,-0x30(%rax)
  a2:	c5 f8 11 50 c0       	vmovups %xmm2,-0x40(%rax)
  a7:	c4 e3 7d 19 58 f0 01 	vextractf128 $0x1,%ymm3,-0x10(%rax)
  ae:	c5 f8 11 58 e0       	vmovups %xmm3,-0x20(%rax)
  b3:	c4 e3 7d 19 60 10 01 	vextractf128 $0x1,%ymm4,0x10(%rax)
  ba:	c5 f8 11 20          	vmovups %xmm4,(%rax)
  be:	48 83 e8 80          	sub    $0xffffffffffffff80,%rax
  c2:	48 83 ef 80          	sub    $0xffffffffffffff80,%rdi
  c6:	48 83 c1 e0          	add    $0xffffffffffffffe0,%rcx
  ca:	75 84                	jne    50 <map+0x50>
  cc:	4d 39 cb             	cmp    %r9,%r11
  cf:	75 05                	jne    d6 <map+0xd6>
  d1:	eb 32                	jmp    105 <map+0x105>
  d3:	49 89 fa             	mov    %rdi,%r10
  d6:	4c 29 d6             	sub    %r10,%rsi
  d9:	4a 8d 04 92          	lea    (%rdx,%r10,4),%rax
  dd:	4b 8d 0c 90          	lea    (%r8,%r10,4),%rcx
  e1:	c5 fa 10 05 00 00 00 	vmovss 0x0(%rip),%xmm0        # e9 <map+0xe9>
  e8:	00
  e9:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
  f0:	c5 fa 58 09          	vaddss (%rcx),%xmm0,%xmm1
  f4:	c5 fa 11 08          	vmovss %xmm1,(%rax)
  f8:	48 83 c0 04          	add    $0x4,%rax
  fc:	48 83 c1 04          	add    $0x4,%rcx
 100:	48 ff ce             	dec    %rsi
 103:	75 eb                	jne    f0 <map+0xf0>
 105:	c5 f8 77             	vzeroupper
 108:	c3                   	req

The attached test.c will load the object file and try to execute it. The #define N on line 7 will change the size of the array. For fewer than 32 elements this works as expected (where the input array is [0..N-1]):

$ ./build.sh
+ llc-4.0 -filetype=obj -mcpu=native map.ll
+ ghc --make -no-hs-main test.c

$ ./a.out
array size is 31
calling function...
ok
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0 25.0 26.0 27.0 28.0 29.0 30.0 31.0

For 32 elements or larger (i.e. entering the core loop) the program will (almost certainly) segfault.

$ lldb a.out
(lldb) target create "a.out"
Current executable set to 'a.out' (x86_64).
(lldb) run
Process 7294 launched: '<snip>/a.out' (x86_64)
array size is 32
calling function...
Process 7294 stopped
* thread #1: tid = 0xc41676, 0x000000010019f207, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x000000010019f207
->  0x10019f207: vmovaps 0xe1(%rip), %ymm0
    0x10019f20f: movq   %r9, %rcx
    0x10019f212: nopw   %cs:(%rax,%rax)
    0x10019f220: vmovups -0x60(%rdi), %xmm1

The VMOVAPS instruction requires the source address to be 32-byte aligned. It is attempting to load 8 floats from one of the const sections (the ones for the +1), but since the section was not loaded at the required alignment, fails.

I've tested this on x86_64 macOS (Mach-O) and ubuntu (ELF). I don't have any other systems to test on.

Trac metadata

Trac field	Value
Version	8.0.1
Type	Bug
TypeOfFailure	OtherFailure
Priority	normal
Resolution	Unresolved
Component	Runtime System (Linker)
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information