Opened 16 months ago

Closed 15 months ago

Last modified 15 months ago

#8965 closed bug (fixed)

bootstrapping failure on Linux/ppc64el

Reported by: cjwatson Owned by:
Priority: normal Milestone: 7.8.3
Component: Compiler Version: 7.8.1-rc2
Keywords: Cc: slyfox@…
Operating System: Linux Architecture: powerpc64
Type of failure: GHC doesn't work at all Test Case:
Blocked By: Blocking:
Related Tickets: Differential Revisions:

Description

A few distributions, including Ubuntu which I work on, have a new little-endian Linux ppc64 port (known variously as ppc64el, ppc64le, powerpc64le, etc.). As well as the obvious endianness difference, this varies from traditional ppc64 in that it uses a new version of the ELF ABI. https://bugs.openjdk.java.net/browse/JDK-8035647 has a useful set of links explaining the changes.

I've been trying to bootstrap GHC on this architecture, but I've been running into failures, which I explained here along with my procedure:

https://lists.ubuntu.com/archives/ubuntu-devel-discuss/2014-April/014922.html

Things that I believe I have ruled out so far:

  • It's not libffi. In the above procedure I was missing unpacking libffi6:ppc64el in the sysroot and configuring using --with-system-libffi (our libffi is newer), but that makes no difference; furthermore, breakpoints set on all exposed ffi_* symbols and on createAdjustor never fire.
  • It's not signal handling. I replaced unlit with a shell script that sleeps 5 and then execs the real unlit, and ghc-stage2 segfaulted before it returned; strace shows that SIGSEGV is the first signal received.
  • It's not buggy native code generation. Although getting an NCG up on this platform might be tractable later, for now I added a powerpc64le*) $2="powerpc64le" ;; case before powerpc64* to GHC_CONVERT_CPU and added powerpc64le to the ArchUnknown list in FPTOOLS_SET_HASKELL_PLATFORM_VARS. Everything should be going via GCC.
  • I don't think I'm taking a fundamentally wrong bootstrapping approach; I'm using the exact same procedure and in fact I started from the exact same git tree that I just used to successfully bootstrap GHC on arm64.

Here's a gdb trace showing the location of the segfault:

*** Literate pre-processor:
../inplace/lib/unlit -h HelloWorld.lhs HelloWorld.lhs /tmp/ghc6597_0/ghc6597_1.lpp

Breakpoint 1, 0x000000001507daac in runInteractiveProcess ()
(gdb) bt
#0  0x000000001507daac in runInteractiveProcess ()
#1  0x000000001507d968 in c8z6_entry ()
#2  0x00003fffb7804a20 in ?? ()
Cannot access memory at address 0xf
(gdb) finish
Run till exit from #0  0x000000001507daac in runInteractiveProcess ()
0x000000001507d968 in c8z6_entry ()
(gdb) stepi
0x000000001507d96c in c8z6_entry ()
(gdb)
0x000000001507d970 in c8z6_entry ()
(gdb)
0x000000001507d974 in c8z6_entry ()
(gdb)
0x000000001507d978 in c8z6_entry ()
(gdb)
0x000000001507d97c in c8z6_entry ()
(gdb)
0x000000001507d980 in c8z6_entry ()
(gdb)
0x000000001507d984 in c8z6_entry ()
(gdb)
0x000000001507d988 in c8z6_entry ()
(gdb)
0x000000001507d98c in c8z6_entry ()
(gdb)
0x000000001507d990 in c8z6_entry ()
(gdb)
0x000000001507d994 in c8z6_entry ()
(gdb)
0x000000001507d998 in c8z6_entry ()
(gdb)
0x000000001507d99c in c8z6_entry ()
(gdb)
0x000000001507d9a0 in c8z6_entry ()
(gdb)
0x000000001507d9a4 in c8z6_entry ()
(gdb)
0x000000001507d9a8 in c8z6_entry ()
(gdb)
0x000000001507d9ac in c8z6_entry ()
(gdb)
0x000000001507d9b0 in c8z6_entry ()
(gdb)
0x000000001507d9e8 in c8z6_entry ()
(gdb)
Cannot access memory at address 0xf
(gdb) disas /rm
Dump of assembler code for function c8z6_entry:
   0x000000001507d8ec <+0>:     6f 16 40 3c     lis     r2,5743
   0x000000001507d8f0 <+4>:     20 9c 42 38     addi    r2,r2,-25568
   0x000000001507d8f4 <+8>:     a6 02 08 7c     mflr    r0
   0x000000001507d8f8 <+12>:    10 00 01 f8     std     r0,16(r1)
   0x000000001507d8fc <+16>:    c1 ff 21 f8     stdu    r1,-64(r1)
   0x000000001507d900 <+20>:    00 00 00 60     nop
   0x000000001507d904 <+24>:    60 92 22 39     addi    r9,r2,-28064
   0x000000001507d908 <+28>:    68 03 49 e9     ld      r10,872(r9)
   0x000000001507d90c <+32>:    10 00 4a 39     addi    r10,r10,16
   0x000000001507d910 <+36>:    68 03 49 f9     std     r10,872(r9)
   0x000000001507d914 <+40>:    58 03 69 e9     ld      r11,856(r9)
   0x000000001507d918 <+44>:    08 00 0b e8     ld      r0,8(r11)
   0x000000001507d91c <+48>:    70 03 29 e9     ld      r9,880(r9)
   0x000000001507d920 <+52>:    40 48 aa 7f     cmpld   cr7,r10,r9
   0x000000001507d924 <+56>:    90 00 9d 41     bgt     cr7,0x1507d9b4 <c8z6_entry+200>
   0x000000001507d928 <+60>:    40 00 6b e8     ld      r3,64(r11)
   0x000000001507d92c <+64>:    38 00 8b e8     ld      r4,56(r11)
   0x000000001507d930 <+68>:    48 00 ab e8     ld      r5,72(r11)
   0x000000001507d934 <+72>:    50 00 cb e8     ld      r6,80(r11)
   0x000000001507d938 <+76>:    58 00 eb e8     ld      r7,88(r11)
   0x000000001507d93c <+80>:    10 00 0b e9     ld      r8,16(r11)
   0x000000001507d940 <+84>:    30 00 2b e9     ld      r9,48(r11)
   0x000000001507d944 <+88>:    28 00 4b e9     ld      r10,40(r11)
   0x000000001507d948 <+92>:    20 00 8b e9     ld      r12,32(r11)
   0x000000001507d94c <+96>:    20 00 81 f9     std     r12,32(r1)
   0x000000001507d950 <+100>:   00 00 8b e9     ld      r12,0(r11)
   0x000000001507d954 <+104>:   28 00 81 f9     std     r12,40(r1)
   0x000000001507d958 <+108>:   30 00 01 f8     std     r0,48(r1)
   0x000000001507d95c <+112>:   18 00 6b e9     ld      r11,24(r11)
   0x000000001507d960 <+116>:   38 00 61 f9     std     r11,56(r1)
   0x000000001507d964 <+120>:   19 01 00 48     bl      0x1507da7c <runInteractiveProcess+8>
   0x000000001507d968 <+124>:   00 00 00 60     nop
   0x000000001507d96c <+128>:   00 00 00 60     nop
   0x000000001507d970 <+132>:   60 92 22 39     addi    r9,r2,-28064
   0x000000001507d974 <+136>:   68 03 49 e9     ld      r10,872(r9)
   0x000000001507d978 <+140>:   f6 ff e2 3c     addis   r7,r2,-10
   0x000000001507d97c <+144>:   30 17 07 39     addi    r8,r7,5936
   0x000000001507d980 <+148>:   f8 ff 0a f9     std     r8,-8(r10)
   0x000000001507d984 <+152>:   68 03 49 e9     ld      r10,872(r9)
   0x000000001507d988 <+156>:   b4 07 63 7c     extsw   r3,r3
   0x000000001507d98c <+160>:   00 00 6a f8     std     r3,0(r10)
   0x000000001507d990 <+164>:   68 03 49 e9     ld      r10,872(r9)
   0x000000001507d994 <+168>:   f9 ff 4a 39     addi    r10,r10,-7
   0x000000001507d998 <+172>:   18 00 49 f9     std     r10,24(r9)
   0x000000001507d99c <+176>:   58 03 49 e9     ld      r10,856(r9)
   0x000000001507d9a0 <+180>:   60 00 0a 39     addi    r8,r10,96
   0x000000001507d9a4 <+184>:   58 03 09 f9     std     r8,856(r9)
   0x000000001507d9a8 <+188>:   60 00 2a e9     ld      r9,96(r10)
   0x000000001507d9ac <+192>:   00 00 69 e8     ld      r3,0(r9)
   0x000000001507d9b0 <+196>:   38 00 00 48     b       0x1507d9e8 <c8z6_entry+252>
   0x000000001507d9b4 <+200>:   00 00 00 60     nop
   0x000000001507d9b8 <+204>:   60 92 22 39     addi    r9,r2,-28064
   0x000000001507d9bc <+208>:   10 00 40 39     li      r10,16
   0x000000001507d9c0 <+212>:   a0 03 49 f9     std     r10,928(r9)
   0x000000001507d9c4 <+216>:   da ff 42 3d     addis   r10,r2,-38
   0x000000001507d9c8 <+220>:   38 36 4a 39     addi    r10,r10,13880
   0x000000001507d9cc <+224>:   f8 ff 4b f9     std     r10,-8(r11)
   0x000000001507d9d0 <+228>:   18 00 09 f8     std     r0,24(r9)
   0x000000001507d9d4 <+232>:   58 03 49 e9     ld      r10,856(r9)
   0x000000001507d9d8 <+236>:   f8 ff 4a 39     addi    r10,r10,-8
   0x000000001507d9dc <+240>:   58 03 49 f9     std     r10,856(r9)
   0x000000001507d9e0 <+244>:   3b ff e2 3c     addis   r7,r2,-197
   0x000000001507d9e4 <+248>:   60 c0 67 38     addi    r3,r7,-16288
   0x000000001507d9e8 <+252>:   40 00 21 38     addi    r1,r1,64
=> 0x000000001507d9ec <+256>:   10 00 01 e8     ld      r0,16(r1)
---Type <return> to continue, or q <return> to quit---
   0x000000001507d9f0 <+260>:   a6 03 08 7c     mtlr    r0
   0x000000001507d9f4 <+264>:   20 00 80 4e     blr
   0x000000001507d9f8 <+268>:   00 00 00 00     .long 0x0
   0x000000001507d9fc <+272>:   00 00 00 01     .long 0x1000000
   0x000000001507da00 <+276>:   80 00 00 00     .long 0x80
End of assembler dump.
(gdb)

Let me know what else I could try.

Attachments (2)

0001-Be-less-untruthful-about-the-prototypes-of-external-.patch (2.2 KB) - added by cjwatson 16 months ago.
Be less untruthful about the prototypes of external functions
0002-Add-the-powerpc64le-architecture.patch (1.2 KB) - added by cjwatson 16 months ago.
Add the powerpc64le architecture

Download all attachments as: .zip

Change History (13)

Changed 16 months ago by cjwatson

Be less untruthful about the prototypes of external functions

Changed 16 months ago by cjwatson

Add the powerpc64le architecture

comment:1 Changed 16 months ago by cjwatson

I finally managed to figure this out, thanks in part to some debugging tips from slyfox on #ghc. The two necessary patches are attached, and I'd appreciate review. With this, I've been able to completely bootstrap GHC 7.8 on this architecture, albeit without GHCi for now.

comment:2 follow-up: Changed 16 months ago by ezyang

Wow, nice catch! Are there any other places where we are improperly declaring null argument lists?

comment:3 Changed 16 months ago by thoughtpolice

  • Milestone set to 7.8.3
  • Status changed from new to patch

Excellent work Colin! I'll put this in for 7.8.3

comment:4 in reply to: ↑ 2 Changed 16 months ago by cjwatson

Replying to ezyang:

Wow, nice catch! Are there any other places where we are improperly declaring null argument lists?

I wouldn't like to categorically say no, because I don't know GHC anywhere near well enough for that. :-) The best I can do is to say that nothing else impeded the bootstrap on this architecture ...

I did find a couple of related problems:

  • While using an empty parameter list is better than a wrong parameter list, it's technically an obsolescent feature in C11, and the proper fix is to generate a correct prototype. The compiler hacking for this is beyond me.
  • There are a couple of uses of (at least) debugBelch in rts/*.cmm, which have a similar problem: if you try to run the compiler on ppc64el with -Da, for instance, it crashes because the call to debugBelch in stg_ap_0_fast corrupts the caller's stack, as it didn't realise it was calling a varargs function and so didn't allocate enough stack space. There's a debugBelch2 workaround in libraries/base for the same kind of problem; the RTS probably needs to do something similar. Generating correct prototypes in the compiler would fix this problem too.

comment:5 Changed 16 months ago by slyfox

  • Cc slyfox@… added

Great catch indeed!

As for prototype mismatch there is a very heavy hammer I tried a while ago:

./configure --enable-unregisterised CFLAGS=-flto LDFLAGS=-flto

It dies in Cmm in things like 'memset' being 'const' incompatible with stdlib.h thing, but in general very nice thing to check for bugs repository-wide.

And find real problems. I'll try that once again and post most relevant bits.

Last edited 16 months ago by slyfox (previous) (diff)

comment:6 Changed 16 months ago by slyfox

http://code.haskell.org/~slyfox/ghc-7.9.20140412-lto.log.txt:

error: variable 'nocldstop' redeclared as function warning: type of 'stg_MUT_ARR_PTRS_CLEAN_info' does not match original declaration [enabled by default]

comment:7 Changed 15 months ago by Austin Seipp <austin@…>

In 5a31f231eebfb8140f9b519b166094d9d4fc2d79/ghc:

Be less untruthful about the prototypes of external functions

GHC's generated C code uses dummy prototypes for foreign imports.  At the
moment these all claim to be (void), i.e. functions of zero arguments.  On
most platforms this doesn't matter very much: calls to these functions put
the parameters in the usual places anyway, and (with the exception of
varargs) things just work.

However, the ELFv2 ABI on ppc64 optimises stack allocation
(http://gcc.gnu.org/ml/gcc-patches/2013-11/msg01149.html): a call to a
function that has a prototype, is not varargs, and receives all parameters
in registers rather than on the stack does not require the caller to
allocate an argument save area.  The incorrect prototypes cause GCC to
believe that all functions declared this way can be called without an
argument save area, but if the callee has sufficiently many arguments then
it will expect that area to be present, and will thus corrupt the caller's
stack.  This happens in particular with calls to runInteractiveProcess in
libraries/process/cbits/runProcess.c.

The simplest fix appears to be to declare these external functions with an
unspecified argument list rather than a void argument list.  This is no
worse for platforms that don't care either way, and allows a successful
bootstrap of GHC 7.8 on little-endian Linux ppc64 (which uses the ELFv2
ABI).

Fixes #8965

Signed-off-by: Colin Watson <[email protected]>
Signed-off-by: Austin Seipp <[email protected]>

comment:8 Changed 15 months ago by thoughtpolice

  • Status changed from patch to merge

Merged, thank you!

comment:9 Changed 15 months ago by thoughtpolice

  • Resolution set to fixed
  • Status changed from merge to closed

Merged in 7.8, thanks!

comment:10 Changed 15 months ago by simonpj

See Colin's excellent blog post for more details.

comment:11 Changed 15 months ago by Simon Peyton Jones <simonpj@…>

In 31a7bb463b6a3e99ede6de994c1f449c43a9118c/ghc:

Add comments to explain the change to EF_ (Trac #8965)
Note: See TracTickets for help on using tickets.