Opened 5 years ago

Last modified 5 months ago

#4211 new task

LLVM: Stack alignment on OSX

Reported by: dterei Owned by: dterei
Priority: normal Milestone: 7.12.1
Component: Compiler (LLVM) Version: 6.13
Keywords: Cc: pho@…, dmp@…
Operating System: MacOS X Architecture: x86
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Revisions:

Description

On OSX the ABI requires that the stack is 16 byte aligned when making function calls. (Although this only really needs to be obeyed when making calls that will go through the dynamic linker, so FFI calls). Since the stack is 16 byte aligned at the site of the call, on entry to a function most compilers (both llvm and gcc) expect the stack to now be aligned to 16n - 4, since 4 bytes should have been pushed for the return address as part of the call instruction. GHC though since it uses jumps everywhere keeps that stack at 16 byte aligned on function entrance. This means that LLVM generates incorrect stack alignment code, always off by 4.

For the moment I have handled this by using the LLvm Mangler (which is only needed on OS X already for TNTC) to simply correctly fix up the stack alignment code.

E.g Asm generated by LLVM:

_func:
    subl $12, %esp
    ...
    call _sin
    ...
    addl $12, %esp

The mangler will change this to:

_func:
    subl $16, %esp
    ...
    call _sin
    ...
    addl $16, %esp

The better solution would be to change GHC to keep the stack at 16n - 4 alignment on function. This will require changing the RTS (StgCRun.hs) to set the stack properly before calling into Stg land and also fixing up the NCG to align code properly. There may also be a problem with the C backend as currently all function prolouge and epilouge code is stripped out, which means all the stack manipulation code generated by GCC is removed. This works fine now since the stack is already 16 byte aligned on entry, but if it is now 16n - 4 byte aligned then some stack manipulation will be required.

Change History (29)

comment:1 Changed 5 years ago by igloo

  • Milestone set to 6.14.1

comment:2 Changed 4 years ago by PHO

  • Cc pho@… added

comment:3 Changed 4 years ago by igloo

  • Milestone changed from 7.0.1 to 7.0.2

comment:4 Changed 4 years ago by igloo

  • Milestone changed from 7.0.2 to 7.2.1

comment:5 Changed 4 years ago by simonmar

I'm changing the code that handles stack alignment as a result of #5250, I think this change would be easy to make too. David, ping me if you want help with this.

comment:6 Changed 4 years ago by dterei

Yes that would be great Simon, there shouldn't be any issues now that we don't support the registerised C backend. Only concern I can think of with this change is there may now be cases where the NCG didn't need to do any sp manipulation but now will. (E.g for any C calls the NCG will need to fix up alignment by subtracting 12 (or 8 on x64), while in the past if no system stack was used the sp was already correctly aligned). I can't see this causing any performance regression though.

If you are happy to just take over this ticket thats great, otherwise just point me to what needs to be changed and I'll finish it off. I think when I had a very brief go at this ticket ages back the main problem was just finding all the locations involved in Sp tracking and manipulation.

comment:7 Changed 4 years ago by simonmar

Right, C calls with no arguments will now need an esp adjustment, but I don't think it's a big deal (and it's only for calls with no arguments, those with arguments already need an Sp adjustment).

Do you want to have a go? There are only two places that need to be changed: StgCRun.c, where we currently add 12 to esp to align it, we can just drop that. The other place is X86.CodeGen.genCCall where we pad the stack offset if necessary.

Urgh, I just realised that we probably need to do something in rts/Adjustor.c too. I looked at the code there, and the Darwin/x86 version is vastly more complicated than the non-Darwin x86 version, and I'm surprised if all that complexity is needed just to get the stack pointer aligned...

comment:8 Changed 4 years ago by dterei

OK. Yep I'll have a go at this sometime soon when I get some spare time. Probably won't have any for at least the next 10 days.

comment:9 Changed 4 years ago by simonmar

I just added a test for stack alignment for #5250, and sure enough it fails with LLVM:

=====> 5250(optllvm) 23 of 23 [0, 0, 0]
cd . && '/64playpen/simonmar/validate/inplace/bin/ghc-stage2' -fforce-recomp -dcore-lint -dcmm-lint -dno-debug-output -no-user-package-conf -rtsopts  -o 5250 5250.hs -O -fllvm spalign.c  >5250.comp.stderr 2>&1
cd . && ./5250    </dev/null >5250.run.stdout 2>5250.run.stderr
Wrong exit code (expected 0 , actual 1 )
Stdout:
esp not aligned correctly: c39c5d30

comment:10 Changed 4 years ago by igloo

  • Milestone changed from 7.2.1 to 7.4.1

comment:11 Changed 4 years ago by dmp

  • Cc dmp@… added

I have made the changes to keep the stack 16+8 byte aligned throughout
STG code for x86_64 targets. The changes are available in my github
repo on the 4211 branch

https://github.com/dmpots/ghc/tree/4211

I can also attach the patches to this ticket if that would work better.

The results look promising. I get no new regressions on a validate run
and a testsuite run with WAY=optllvm goes from 36 failures to 6
failures, plus two unexpected passes (one of which is the 5250 test
simonmar mentioned above).

I would love to see these changes merged to the mainline GHC. I
suppose that the main sticking point is probably that I made the
change for 64-bit targets, but not the 32-bit targets. Is there any
chance it will get merged for the 64-bit targets? My motivation (and
therefore availble time) is much less for 32-bit targets and I'm
worried that dealing with the stdcall convetion could be a bit more
compilicated.

comment:12 Changed 4 years ago by dterei

Great! I'll have a look at the patch and hopefully merge. If you can handle the 32bit case that would be greatly appreciated though. I assume you disabled the llvm mangler stack fixup in the 64bit case but not 32bit?

comment:13 Changed 4 years ago by dmp

Yes the stack mangling has been disabled for the 64bit case so it's now only used for the 32bit darwin target. I might get some time to make the changes for the 32bit case too. Perhaps my reservations about the difficulties handling stdcall are unfounded.

comment:14 Changed 4 years ago by dmp@…

commit a9ce36118f0de3aeb427792f8f2c5ae097c94d3f

Author: David M Peixotto <[email protected]>
Date:   Wed Oct 19 15:49:06 2011 -0500

    Change stack alignment to 16+8 bytes in STG code
    
    This patch changes the STG code so that %rsp to be aligned
    to a 16-byte boundary + 8. This is the alignment required by
    the x86_64 ABI on entry to a function. Previously we kept
    %rsp aligned to a 16-byte boundary, but this was causing
    problems for the LLVM backend (see #4211).
    
    We now don't need to invoke llvm stack mangler on
    x86_64 targets. Since the stack is now 16+8 byte algined in
    STG land on x86_64, we don't need to mangle the stack
    manipulations with the llvm mangler.
    
    This patch only modifies the alignement for x86_64 backends.
    
    Signed-off-by: David Terei <[email protected]>

 compiler/llvmGen/LlvmMangler.hs   |    6 +++-
 compiler/nativeGen/X86/CodeGen.hs |   16 +++++++-----
 rts/StgCRun.c                     |   46 +++++++++++++++++++++----------------
 3 files changed, 39 insertions(+), 29 deletions(-)

comment:15 Changed 4 years ago by dterei

Thanks for the patch, applied!

Leaving ticket open though to trac fixing this on 32bit.

comment:16 Changed 4 years ago by davidterei@…

commit 8a1c644af72caf122e73dac801496c055fc82dd9

Author: David Terei <[email protected]>
Date:   Tue Nov 15 19:21:34 2011 -0800

    Fix #4211: No need to fixup stack using mangler on OSX
    
    We now manage the stack correctly on both x86 and i386, keeping
    the stack align at (16n bytes - word size) on function entry
    and at (16n bytes) on function calls. This gives us compatability
    with LLVM and GCC.

 compiler/llvmGen/LlvmMangler.hs   |   46 +++---------------------------------
 compiler/nativeGen/X86/CodeGen.hs |    6 ++--
 rts/StgCRun.c                     |    2 +-
 3 files changed, 8 insertions(+), 46 deletions(-)

comment:17 Changed 4 years ago by dterei

  • Resolution set to fixed
  • Status changed from new to closed

comment:18 Changed 3 years ago by simonmar

  • difficulty set to Unknown
  • Owner dterei deleted
  • Resolution fixed deleted
  • Status changed from closed to new

Re-opening this ticket as the test rts/5250(llvm) is failing on x86/linux, indicating that the stack pointer is not aligned correctly on that platform. (David: I thought I had mentioned this before, but I searched around and can't find another ticket).

comment:19 Changed 3 years ago by dterei

Argh! OK thanks Simon, will take a look when I get the time.

comment:20 Changed 3 years ago by dterei

  • Owner set to dterei

comment:21 Changed 3 years ago by simonmar

  • Milestone changed from 7.4.1 to 7.4.2

comment:22 Changed 3 years ago by igloo

  • Milestone changed from 7.4.2 to 7.4.3

comment:23 Changed 3 years ago by dterei

  • Cc dmp@… rice.edu added; dmp@… removed

Simon I can't reproduce. Just tested on Linux both 32bit and 64 bit machine and 5250 passes for llvm with current HEAD.

comment:24 Changed 3 years ago by dterei

  • Status changed from new to infoneeded

comment:25 Changed 3 years ago by dterei

  • Cc dmp@… added; dmp@… rice.edu removed

comment:26 Changed 3 years ago by igloo

  • Milestone changed from 7.4.3 to 7.6.2

comment:27 Changed 2 years ago by simonmar

  • Status changed from infoneeded to new

I'm slightly unhappy about the solution to this ticket. Since a9ce36118f0de3aeb427792f8f2c5ae097c94d3f we now align the stack to 16+8 bytes on x86_64, which means that virtually every foreign call now looks like

	subq $8,%rsp
	movl $0,%eax
	call foo
	addq $8,%rsp

since most calls on x86_64 pass arguments in registers, the 8 byte adjustment is to keep the alignment constraint after the return address is pushed. (The assignment to %eax is to keep the ABI happy just in case the function we're calling is varargs, which is a separate issue).

Now I understand the reason this was done, but I hate having to emit those two extra instructions around every call. I just wanted to record my dissatisfaction in this ticket in case it inspires someone to find a better solution :-)

comment:28 Changed 11 months ago by thoughtpolice

  • Milestone changed from 7.6.2 to 7.10.1

Moving to 7.10.1.

comment:29 Changed 5 months ago by thoughtpolice

  • Milestone changed from 7.10.1 to 7.12.1

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

Note: See TracTickets for help on using tickets.