Opened 3 years ago

Closed 2 years ago

#10375 closed bug (fixed)

arm: ghci hits an illegal instruction

Reported by: erikd Owned by:
Priority: high Milestone: 7.10.3
Component: Runtime System (Linker) Version: 7.10.1
Keywords: Cc: bgamari, hvr
Operating System: Unknown/Multiple Architecture: arm
Type of failure: GHCi crash Test Case:
Blocked By: Blocking:
Related Tickets: #10969 Differential Rev(s): Phab:D1323
Wiki Page:

Description (last modified by erikd)

Some GHCi tests on arm/linux ie:

(cd testsuite ; make TEST="print018 print020 print021 print022 print025")

fail, all with the same error:

Stderr:
Illegal instruction

Change History (69)

comment:1 Changed 3 years ago by erikd

Description: modified (diff)

Tests ghci053 ghcirun001 ghcirun002also fail in the same way.

comment:2 Changed 3 years ago by erikd

The tests failures for ghci053 ghcirun001 ghcirun002 all seem to be instances of the same failure, and can be boiled down to a mimimal example of:

data Planet = Mercury | Venus deriving Eq
Mercury == Mercury

Deriving Eq followed by a comparison is enough to cause GHCi to die with Illegal instruction. Turning this script into a program, compiling it and running the program works as it should.

This is purely a GHCi issue, quite possibly the run time linker.

comment:3 Changed 3 years ago by erikd

Summary: arm: ghci.debugger tests hit illegal instructionarm: ghci hits an illegal instruction

comment:4 Changed 3 years ago by simonpj

Is this ARM-specific? I think so.

comment:5 Changed 3 years ago by erikd

Yes, arm specific.

comment:6 Changed 3 years ago by erikd

Milestone: 7.12.17.10.2
Version: 7.117.10.1

Problem also exists on the ghc-7.10 branch.

comment:7 Changed 3 years ago by erikd

If I compile the RTS with -debug and run

inplace/bin/ghc-stage +RTS -DS -Dl -RTS --interactive

it fails as follows:

Prelude> data X = A | B deriving Eq
lookupSymbol: looking up ghczmprim_GHCziTypes_True_closure
lookupSymbol: symbol not found
lookupSymbol: looking up ghczmprim_GHCziTypes_False_closure
lookupSymbol: symbol not found
lookupSymbol: looking up ghczmprim_GHCziTypes_False_closure
lookupSymbol: symbol not found
lookupSymbol: looking up ghczmprim_GHCziTypes_True_closure
lookupSymbol: symbol not found
lookupSymbol: looking up ghczmprim_GHCziClasses_DZCEq_con_info
lookupSymbol: symbol not found
lookupSymbol: looking up ghczmprim_GHCziClasses_zeze_closure
lookupSymbol: symbol not found
lookupSymbol: looking up ghczmprim_GHCziClasses_not_closure
lookupSymbol: symbol not found
Prelude> A == A
lookupSymbol: looking up ghczmprim_GHCziClasses_zeze_closure
lookupSymbol: symbol not found
lookupSymbol: looking up ghczmprim_GHCziTypes_ZC_con_info
lookupSymbol: symbol not found
lookupSymbol: looking up ghczmprim_GHCziTypes_ZMZN_closure
lookupSymbol: symbol not found
lookupSymbol: looking up base_GHCziBase_returnIO_closure
lookupSymbol: symbol not found
lookupSymbol: looking up base_GHCziShow_zdfShowBool_closure
lookupSymbol: symbol not found
lookupSymbol: looking up base_SystemziIO_print_closure
lookupSymbol: symbol not found
lookupSymbol: looking up base_GHCziBase_thenIO_closure
lookupSymbol: symbol not found
Segmentation fault

This (apart from the segfault) is identical to what happens on x86_64/linux compiled the same way.

comment:8 Changed 3 years ago by erikd

Turning the test case into a script:

data X = A | B deriving Eq
A == A

and running as:

inplace/bin/ghc-stage2 +RTS -Ds -Di -RTS --interactive < ghci-segfault.script

results in:

evaluating unknown closure -- yielding to sched
Object 0xb06446a4 = BLACKHOLE(0xb0619a6c)
    b14ff460: cap 0: thread 17 stopped (yielding)
    b14ff460: --<< thread 17 (ThreadRunGHC) stopped to switch evaluators
    b14ff460: cap 0: running thread 17 (ThreadRunGHC)
Segmentation fault

The x86_64/linux version also has this stuff but doesn't segfault:

evaluating unknown closure -- yielding to sched
Object 0x7fbea1548150 = BLACKHOLE(0x7fbea1858cb8)
7fbea22fd700: cap 0: thread 19 stopped (yielding)
7fbea22fd700: --<< thread 19 (ThreadRunGHC) stopped to switch evaluators
7fbea22fd700: cap 0: running thread 19 (ThreadRunGHC)
7fbea22fd700: cap 0: thread 19 stopped (yielding)
7fbea22fd700: --<< thread 19 (ThreadInterpret) stopped to switch evaluators
7fbea22fd700: cap 0: running thread 19 (ThreadInterpret)

comment:9 Changed 3 years ago by erikd

A bit of printf debugging shows that the segfault is happening in the schedule function of rts/Schedule.c, specifically on the call to StgRun around line number 460:

    case ThreadRunGHC:
    {
        StgRegTable *r;
        r = StgRun((StgFunPtr) stg_returnToStackTop, &cap->r);
        cap = regTableToCapability(r);
        ret = r->rRet;
        break;
    }

and on Arm, I have confirmed that StgRun is the assembler version.

Last edited 3 years ago by erikd (previous) (diff)

comment:10 Changed 3 years ago by erikd

What I'd like to do at this point is to run ghc-stage under GNU GDB. To load it in GDB, I've copied inplace/bin/ghc-stage2 and added gdb --args in front of the line:

"$executablename" -B"$topdir" ${1+"$@"}

I've also created a .ghci file in the current directory containing the test case from #comment:8 and tested that it does segfault when run without GDB. Unfortunately, with GDB, there is no segfault (nor do a get a GHCi prompt), I suspect because of some interaction between GDB and GHCi.

Last edited 3 years ago by erikd (previous) (diff)

comment:11 Changed 3 years ago by erikd

Managed to run ghc-stage2 --interactive in one terminal and the attach GDB to it in another using ghc -tui -p <pid>. I can then manually enter the two lines of test case code and have it segfault.

The GDB backtrace is a little odd (even by haskell program standards):

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb14ff460 (LWP 19914)]                  
Cannot access memory at address 0xb88849d4
(gdb) bt
#0  0xb88849d4 in ?? ()
#1  0x70000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

but that address 0xb88849d4 being in both the "Cannot access memory at" message and the stack might point to stack corruption while the 0x70000000 address on the stack is *very* suspicious.

Using the Linux pmap <pid> command, I was able to confirm that both addresses, 0xb88849d4 and 0x70000000 are outside the process' address map. Address 0x70000000 is nowhere near anything, and the nearest thing to 0xb88849d4 is:

...
b6f2f000      4K r---- ld-2.19.so
b6f30000      4K rw--- ld-2.19.so
beb52000    136K rw---   [ stack ]
...

so that is also a no-man's-land address.

Last edited 3 years ago by erikd (previous) (diff)

comment:12 Changed 3 years ago by erikd

Tried a bunch of other -Dx RTS debugging options as found in https://ghc.haskell.org/trac/ghc/wiki/Debugging/RuntimeSystem but none of them provide any new information.

comment:13 Changed 3 years ago by bgamari

Hmm, excellent sleuthing work so far. Have you tried running in Valgrind? It typically struggles to do anything useful on Haskell code but you never know. You may also want to add a printf outputting all of the relevant local bindings around line 460 to see where exactly this mysterious address is coming from.

comment:14 Changed 3 years ago by erikd

Valgrind has zero complaints before the program segfaults. Now looking to dump the state of the stack around problem code.

comment:15 Changed 3 years ago by erikd

Using the gdb macros on the wiki I can retrieve the following info after the segfault:

Program received signal SIGSEGV, Segmentation fault.                                                                                                         
[Switching to Thread 0xb14ff460 (LWP 7214)]                                                                                                              
0xb88d09d4 in ?? ()                                                                                                                                          
(gdb) bt                                                                                                                                                 
#0  0xb88d09d4 in ?? ()
#1  0x70000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) pregs
$1 = {rR1 = {w = 0, a = 0x0, c = 0, f = 0, i = 0, p = 0x0}
     ,rR2 = {w = 0, a = 0x0, c = 0, f = 0, i = 0, p = 0x0}
     , rR3 = {w = 0, a = 0x0, c = 0, f = 0, i = 0, p = 0x0}
     , rR4 = {w = 0, a = 0x0, c = 0, f = 0, i = 0, p = 0x0}
     , rR5 = {w = 0, a = 0x0, c = 0, f = 0, i = 0, p = 0x0}
     , rR6 = {w = 0, a = 0x0, c = 0, f = 0, i = 0, p = 0x0}
     , rR7 = {w = 0, a = 0x0, c = 0, f = 0, i = 0, p = 0x0}
     , rR8 = {w = 0, a = 0x0, c = 0, f = 0, i = 0, p = 0x0}
     , rR9 = {w = 0, a = 0x0, c = 0, f = 0, i = 0, p = 0x0}
     , rR10 = {w = 0, a = 0x0, c = 0, f = 0, i = 0, p = 0x0}
     , rF1 = 0, rF2 = 0, rF3 = 0, rF4 = 0, rF5 = 0, rF6 = 0
     , rD1 = 0, rD2 = 0, rD3 = 0, rD4 = 0, rD5 = 0, rD6 = 0
     , rXMM1 = {h = 0, l = 0}
     , rXMM2 = {h = 0, l = 0}
     , rXMM3 = {h = 0, l = 0}
     , rXMM4 = {h = 0, l = 0}
     , rXMM5 = {h = 0, l = 0}
     , rXMM6 = {h = 0, l = 0}
     , rYMM1 = {h = {h = 0, l = 0}, l = {h = 0, l = 0}}
     , rYMM2 = {h = {h = 0, l = 0}, l = {h = 0, l = 0}}
     , rYMM3 = {h = {h = 0, l = 0}, l = {h = 0, l = 0}}
     , rYMM4 = {h = {h = 0, l = 0}, l = {h = 0, l = 0}}
     , rYMM5 = {h = {h = 0, l = 0}, l = {h = 0, l = 0}}
     , rYMM6 = {h = {h = 0, l = 0}, l = {h = 0, l = 0}}
     , rZMM1 = {h = {h = {h = 0, l = 0}, l = {h = 0, l = 0}}, l = {h = {h = 0, l = 0}, l = {h = 0, l = 0}}}
     , rZMM2 = {h = {h = {h = 0, l = 0}, l = {h = 0, l = 0}}, l = {h = {h = 0, l = 0}, l = {h = 0, l = 0}}}
     , rZMM3 = {h = {h = {h = 0, l = 0}, l = {h = 0, l = 0}}, l = {h = {h = 0, l = 0}, l = {h = 0, l = 0}}}
     , rZMM4 = {h = {h = {h = 0, l = 0}, l = {h = 0, l = 0}}, l = {h = {h = 0, l = 0}, l = {h = 0, l = 0}}}
     , rZMM5 = {h = {h = {h = 0, l = 0}, l = {h = 0, l = 0}}, l = {h = {h = 0, l = 0}, l = {h = 0, l = 0}}}
     , rZMM6 = {h = {h = {h = 0, l = 0}, l = {h = 0, l = 0}}, l = {h = {h = 0, l = 0}, l = {h = 0, l = 0}}}
     , rL1 = 0, rSp = 0x0, rSpLim = 0x0, rHp = 0x0, rHpLim = 0xb08fbfff
     , rCCCS = 0x0, rCurrentTSO = 0xb09cc8a4, rNursery = 0x109078
     , rCurrentNursery = 0xb0801f60, rCurrentAlloc = 0xb0901980, rHpAlloc = 0, rRet = 3}
(gdb) ptso
$2 = {header = {info = 0xb2bf43c0 <stg_TSO_info$def>}, _link = 0xb2c146d8 <stg_END_TSO_QUEUE_closure>
     , global_link = 0xb2c146d8 <stg_END_TSO_QUEUE_closure>
     , stackobj = 0xb09cc4f4, what_next = 1, why_blocked = 0, flags = 0
     , block_info = {closure = 0xb2c146d8 <stg_END_TSO_QUEUE_closure>
                    , prev = 0xb2c146d8 <stg_END_TSO_QUEUE_closure>
                    , bh = 0xb2c146d8 <stg_END_TSO_QUEUE_closure>
                    , throwto = 0xb2c146d8 <stg_END_TSO_QUEUE_closure>
                    , wakeup = 0xb2c146d8 <stg_END_TSO_QUEUE_closure>
                    , fd = -1295956264
                    }
     , id = 22, saved_errno = 0, dirty = 1
     , bound = 0x0, cap = 0xb2c16180 <MainCapability>
     , trec = 0xb2c146d0 <stg_NO_TREC_closure>
     , blocked_exceptions = 0xb2c146d8 <stg_END_TSO_QUEUE_closure>
     , bq = 0xb2c146d8 <stg_END_TSO_QUEUE_closure>
     , alloc_limit = 2592, tot_stack_size = 232
     }
(gdb) info registers
r0             0xb6f73018       3069653016
r1             0xb2bf434c       2998879052
r2             0x1      1
r3             0xb09cc7c6       2963064774
r4             0xb2c16190       2999017872
r5             0xb09cc81c       2963064860
r6             0xb08fbbf4       2962209780
r7             0xb09cc938       2963065144
r8             0xb14fefa0       2974805920
r9             0x0      0
r10            0xb2bf4160       2998878560
r11            0xb09cc558       2963064152
r12            0xb2c15b8c       2999016332
sp             0xb14fcd68       0xb14fcd68
lr             0x70000000       1879048192
pc             0xb88d09d4       0xb88d09d4
cpsr           0x400f0010       1074724880

The pc register definitetly does contain a bad value, but there's no good explanation of how it got there.

comment:16 Changed 3 years ago by bgamari

Okay. For the record, you are using a statically linked GHC, yes (DYNAMIC_GHC_PROGRAMS=NO I believe)?

Perhaps you could set a breakpoint at rts/Schedule.c:460 and have a look at stg_returnToStackTop and cap?

comment:17 Changed 3 years ago by erikd

This is definitely not statically linked (the lld command shows dynamically linked Haskell libraries like haskeline, terminfo, transformers as well as the usual C library suspects.

Debugging this in GDB is *really* difficult. Firstly, I can only recreate the problem when running with --interactive. Secondly, GDB and ghc-stage2 --interactive both try to take control of the same terminals so I have to start the ghc-stage2 --interactive process and connect gdb to it with gdb -p $(pidof ghc-stage2). The .gchi file contains:

putStr "Continue ... "
_ <- getChar
putStrLn "ok"
data X = Y deriving Eq
Y == Y

The getChar is there to pause the ghc-stage2 process long enough to attach GDB. Unfortunately, this means that the number of passes through the code I'm trying to debug is not deterministic which makes setting a breakpoint rather difficult.

I am currently playing around with replacing getChar with threadDelay to see if I can make it determinisitc.

comment:18 Changed 3 years ago by erikd

Even switching to using threadDelay doesn't make it deterministic under GDB, but it is deterministic when just run at the command line. Giving up on GDB because it has a Heisenberg-like effect on the problem. Will need to debug this with printf.

If I add a statc int variable in the case ThreadRunGHC block and increment it every time the code passes through this block and I run it at the command line, the count gets to 152 or 153 before the crash.

comment:19 Changed 3 years ago by erikd

Added a static int count=0 variable that gets incremented each time the schedule function passes through the ThreadRunGHC case statement. I then added a printf:

printf ("%d : %u %u\n", count, cap->no, (unsigned)cap->r.rCurrentTSO->id);

and running it with the getChar in .ghci I note that the last set of values printed by this statement before it segfaults is:

183 : 0 22

So I run it again and attached GDB when the process is waiting for getChar and then set a breakpoint:

(gdb) break rts/Schedule.c:490 if (count == 183 && cap->no == 0 &&
cap->r.rCurrentTSO->id == 22)

only to have the process crash in a completely different way:

Program received signal SIGILL, Illegal instruction. 
0xb2c56c7a in schedule (initialCapability=0xb2c96180 <MainCapability>,
task=0x10faf8) at rts/Schedule.c:491                                         

(gdb) bt
#0  0xb2c56c7a in schedule (initialCapability=0xb2c96180 <MainCapability>,
task=0x10faf8) at rts/Schedule.c:491
#1  0xb2c59040 in scheduleWaitThread (tso=0xb2707000, ret=0x0, pcap=0xbeffe944)
at rts/Schedule.c:2408
#2  0xb2c47c98 in rts_evalLazyIO (cap=0xbeffe944, p=0xd2c30
<ZCMain_main_closure>, ret=0x0) at rts/RtsAPI.c:500
#3  0xb2c5a0fc in real_main () at rts/RtsMain.c:63
#4  0xb2c5a1d0 in hs_main (argc=3, argv=0xbeffeb04, main_closure=0xd2c30
<ZCMain_main_closure>, rts_config=...) at rts/RtsMain.c:114
#5  0x000cd2c6 in main ()

(gdb) dis 0xb2c56c7a
warning: bad breakpoint number at or near '0xb2c56c7a'

comment:20 Changed 3 years ago by erikd

This does not need deriving Eq. The following in .ghci is enough to trigger it:

data X = Y
let eqX = \ a b -> case a of { Y -> case b of { Y -> True }}
eqX Y Y

but none of the other obvious nullary contructors like:

True == True
Nothing == Nothing

trigger it.

comment:21 in reply to:  15 Changed 3 years ago by carter

@erikd, your register snapshot doesnt make sense, it names x86_64 registers, namely XMM*, YMM* , and ZMM* families, but your code is on ARM right? Or am I misreading the dump?

Last edited 3 years ago by erikd (previous) (diff)

comment:22 Changed 3 years ago by bgamari

@carter, I was perplexed by this at first as well but if you look at the info registers output you'll see that GDB is indeed targetting the correct architecture. YMM and friends are only mentioned by the output of pregs (source here, https://ghc.haskell.org/trac/ghc/attachment/wiki/Debugging/CompiledCode/.gdbinit) which just prints the current thread's StgRegTable and indeed has fields of these names regardless of the architecture..

comment:23 Changed 3 years ago by erikd

By hacking mk/config.mk.in I was able to compile with DYNAMIC_GHC_PROGRAMS set to NO and create a ghc-stage2 compiler that dynamically links only to the C libraries. It also crashes just in a different way. Regardless of whether a .ghci file exists, ghc-stage2 --interactive crashes with Illegal instruction as soon as it is run and before it even diplays a prompt:

$ inplace/bin/ghc-stage2 --interactive
GHCi, version 7.11.20150523: http://www.haskell.org/ghc/  :? for help
Illegal instruction

Checking the settings file, I find that the ld.gold linker *is* being used.

comment:24 Changed 3 years ago by erikd

With the DYNAMIC_GHC_PROGRAMS = NO version of ghc-stage2 I can actually run it under GDB directly, instead of running it in one terminal and attaching GDB to it from another terminal. However, if I try to set a breakpoint suddenly GDB and ghc-stage2's terminal manipulation start fighting each other again.

The above problem means I'm back to attaching GDB to ghc-stage2 from a separate terminal, but I can't do it with a getChar in the .gchi file because the statically linked ghc-stage2 is crashing before it loads the .ghci file. Instead, I have hacked ghc/Main.hs as follows:

 
 main :: IO ()
 main = do
+   args <- getArgs
+   if "--interactive" `elem` args
+       then do
+           putStr "Continue ... "
+           hFlush stdout
+           _ <- getChar
+           putStrLn "done"
+       else return ()
    initGCStatistics -- See Note [-Bsymbolic and hooks]
    hSetBuffering stdout LineBuffering
    hSetBuffering stderr LineBuffering

This does allow me to attach GDB, but then I'm stuck with the same problem as before, attaching GDB changes the state of the program to an extent where I can't catch the problem.

Valgrind is still completely silent apart from "possibly lost: 408 bytes in 3 blocks".

Last edited 3 years ago by erikd (previous) (diff)

comment:25 Changed 3 years ago by erikd

By hacking configure.ac I managed to build GHC with -fsanitize=address passed to GCC.

Absolutely, positively, 100% certain that something is wrong:

$ inplace/bin/ghc-stage2 --interactive
ASAN:SIGSEGV
=================================================================
==7051==ERROR: AddressSanitizer: SEGV on unknown address 0x00000000
(pc 0x00000000 sp 0xbeffe5f0 bp 0x37dffd15 T0)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV ??:0 ??

ghc-stage2 compiled with AddressSanitizer on x86_64/linux does not do the above.

Unfortunately, while AddressSanitizer has found a problem, it doesn't provide any new information. I suspect it doesn't know enough about GHC's run time and calling conventions.

Last edited 3 years ago by erikd (previous) (diff)

comment:26 Changed 3 years ago by erikd

This is interesting. with GHC compiled with AddressSanitizer, it no longer needs --interactive to bring on a crash. In fact just running inplace/bin/ghc-stage2 on the command line with no arguments crashes with a SIGSEGV. In GDB:

(gdb) r
Starting program: /home/erikd/Git/ghc-upstream/inplace/lib/bin/ghc-stage2
    -B/home/erikd/Git/ghc-upstream/inplace/lib
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00000000 in ?? ()
(gdb) bt
#0  0x00000000 in ?? ()
#1  0xb6b2db4c in pthread_cond_init () from /usr/lib/arm-linux-gnueabihf/
    libasan.so.1
#2  0xb1e1407c in newTask (worker=rtsFalse) at rts/Task.c:218
#3  0xb1e14702 in allocTask () at rts/Task.c:128
#4  newBoundTask () at rts/Task.c:304
#5  0xb1df6970 in rts_lock () at rts/RtsAPI.c:556
#6  0xb1e40398 in ioManagerStart () at rts/posix/Signals.c:218
#7  0xb1dec428 in hs_init_ghc (argc=argc@entry=0xbeffe87c, argv=argv@entry=
    0xbeffe878, rts_config=...) at rts/RtsStartup.c:263
#8  0xb1e14fb2 in hs_main (argc=-1310605083, argv=0xd88e1 <StackOverflowHook>,
    main_closure=0xdeb34 <ZCMain_main_closure>, rts_config=...) at
    rts/RtsMain.c:51
#9  0x000d8b3a in main ()

Hopefully this isn't a red herring.

comment:27 Changed 3 years ago by erikd

I smell a rat, or maybe its a red herring. Seem to be getting SIGSEGV inside the call to pthread_cond_init. A bit of googling tells me that AddressSanitizer intercepts calls to pthread_cond_init and can get it wrong eg : https://code.google.com/p/address-sanitizer/issues/detail?id=297 .

Sure enough, trivial C program:

#include <stdio.h>
#include <pthread.h>
int main (void)
{   pthread_cond_t cond;
    pthread_cond_init(&cond, NULL);
    return 0;
}

Works correctly when compiled and run (even under Valgrind) as:

gcc -Wall cond_init.c -o cond_init && valgrind ./cond_init

but crashes with a SIGSEGV when compiled and run as:

gcc -Wall -fsanitize=address cond_init.c -o cond_init && ./cond_init

Reported this as a bug in Debian's gcc : https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=786850

comment:28 Changed 3 years ago by erikd

Debian's clang compiler is also busted when using -fsanitize=address, but its seems that Debian's gcc-5 compiles the above C program correctly.

Also note that #10383 (on AArch64/Arm64) is likely to be the same bug as this one. Wonder if a 64 bit address space might make it easier to debug.

comment:29 Changed 3 years ago by erikd

Got GHC compiling with gcc-5 -fsanitize=address and AddressSanitizer did not find anything to complain about, but ghc-stage2 still crashes with either a segfault or an illegal instruction depending on where it jumps to.

comment:30 Changed 3 years ago by erikd

Have checked that arm_HOST_ARCH_PRE_ARMv7 is undefined on this ARMv7 machine.

comment:31 Changed 3 years ago by erikd

Priority: normalhigh

comment:32 Changed 3 years ago by thoughtpolice

Milestone: 7.10.27.10.3

Moving to 7.10.3 milestone - if you think this is an error (or a showstopping bug), please remilestone it to 7.10.2 and let us know why.

comment:33 Changed 2 years ago by rwbarton

Just for context, what kind of configuration is this? e.g. DYNAMIC_GHC_PROGRAMS=YES/NO (seems both were mentioned in the comments, so maybe it is broken for both)? Registerised (with LLVM) or unregisterised? If registerised, then does the unregisterised version work? Could you add this information to the ticket description?

I think ghci at least used to work in some configuration on ARM, right?

Last edited 2 years ago by rwbarton (previous) (diff)

comment:34 Changed 2 years ago by erikd

The first shipped version of GHC with GHCI support for ARM was ghc-7.8. I just compiled ghc-7.8.4 on ARM and the resulting GHCi fails just like it does for ghc-7.10 and ghc-7.11.

For ghc-7.11 (git master) I have tried:

  • DYNAMIC_GHC_PROGRAMS set to both YES and NO and have the same crash either way.
  • Unregisterised set to YES and to NO. Both crash in the same way.

comment:35 Changed 2 years ago by erikd

Just tried compiling with SMP disabled (removing arm from ArchSupportsSMP in mk/config.mk.in) and GHCi fails just like before.

This is however proof that this issue is not SMP related.

comment:36 Changed 2 years ago by Ansible

Just wanted to point out that on raspberry pi 2 with debian jessie, ghc and ghci are both broken. I tried the ghc binary from the ghc site, and it fails with the "Planet" test (from the top of this thread) in ghci, and in compiled code it fails "hello world". So its pretty much a show stopper for ARM.

GHCi, version 7.10.2: http://www.haskell.org/ghc/  :? for help
Prelude> data Planet = Mercury | Venus deriving Eq
Prelude> Mercury == Mercury
Illegal instruction
bburdette@jessie-rpi:~$

And my hello world program:

main = do
  putStrLn "hello"

The results:

bburdette@jessie-rpi:~$ ghc hello.hs
[1 of 1] Compiling Main             ( hello.hs, hello.o )
Linking hello ...
bburdette@jessie-rpi:~$ ls
bin   ghc-7.10.2-arm-unknown-linux.tar.xz  hello.hi  hello.o
code  hello                                hello.hs
bburdette@jessie-rpi:~$ ./hello
Illegal instruction
bburdette@jessie-rpi:~$ ghc -O2 hello.hs
bburdette@jessie-rpi:~$ ./hello
Illegal instruction
bburdette@jessie-rpi:~$

comment:37 Changed 2 years ago by rwbarton

Interesting that you see this behavior also from a compiled, non-threaded, statically linked (I assume) program. That should be much easier to debug. You are using the gold linker right?

comment:38 Changed 2 years ago by Ansible

I did have the gold linker enabled when I did that test. since then I've gone back to arch and 7.8.2. ghc --info output is below (from debian jesse on raspberry pi 2):

 [("Project name","The Glorious Glasgow Haskell Compilation System")
 ,("GCC extra via C opts"," -fwrapv")
 ,("C compiler command","/usr/bin/gcc")
 ,("C compiler flags"," -fno-stack-protector")
 ,("C compiler link flags"," -fuse-ld=gold -Wl,-z,noexecstack")
 ,("Haskell CPP command","/usr/bin/gcc")
 ,("Haskell CPP flags","-E -undef -traditional ")
 ,("ld command","/usr/bin/ld.gold")
 ,("ld flags"," -z noexecstack")
 ,("ld supports compact unwind","YES")
 ,("ld supports build-id","YES")
 ,("ld supports filelist","NO")
 ,("ld is GNU ld","YES")
 ,("ar command","/usr/bin/ar")
 ,("ar flags","q")
 ,("ar supports at file","YES")
 ,("touch command","touch")
 ,("dllwrap command","/bin/false")
 ,("windres command","/bin/false")
 ,("libtool command","libtool")
 ,("perl command","/usr/bin/perl")
 ,("cross compiling","NO")
 ,("target os","OSLinux")
 ,("target arch","ArchARM {armISA = ARMv7, armISAExt = [VFPv3,NEON],
armABI = HARD}")
 ,("target word size","4")
 ,("target has GNU nonexec stack","False")
 ,("target has .ident directive","True")
 ,("target has subsections via symbols","False")
 ,("Unregisterised","NO")
 ,("LLVM llc command","/usr/bin/llc-3.5")
 ,("LLVM opt command","/usr/bin/opt-3.5")
 ,("Project version","7.10.2")
 ,("Project Git commit id","0da488c4438d88c9252e0b860426b8e74b5fc9e8")
 ,("Booter version","7.6.3")
 ,("Stage","2")
 ,("Build platform","arm-unknown-linux")
 ,("Host platform","arm-unknown-linux")
 ,("Target platform","arm-unknown-linux")
 ,("Have interpreter","YES")
 ,("Object splitting supported","NO")
 ,("Have native code generator","NO")
 ,("Support SMP","YES")
 ,("Tables next to code","YES")
 ,("RTS ways","l debug thr thr_debug thr_l thr_p dyn debug_dyn thr_dyn
thr_debug_dyn l_dyn thr_l_dyn")
 ,("Support dynamic-too","YES")
 ,("Support parallel --make","YES")
 ,("Support reexported-modules","YES")
 ,("Support thinning and renaming package flags","YES")
 ,("Uses package keys","YES")
 ,("Dynamic by default","NO")
 ,("GHC Dynamic","YES")
 ,("Leading underscore","NO")
 ,("Debug on","False")
 ,("LibDir","/usr/local/lib/ghc-7.10.2")
 ,("Global Package DB","/usr/local/lib/ghc-7.10.2/package.conf.d")
 ]


comment:39 Changed 2 years ago by Ansible

although maybe gold would be a good thing to try on 7.8.2. hmm.

comment:40 Changed 2 years ago by erikd

If I configure the build with --enable-unregisterised, built it and then re-ran my tests.

The program still crashes with SIGILL, but I now get a sensible backtrace:

GHCi, version 7.11.20150905: http://www.haskell.org/ghc/  :? for help

Program received signal SIGILL, Illegal instruction.
0x03ff9dbc in stg_ap_v_fast ()
(gdb) bt
#0  0x03ff9dbc in stg_ap_v_fast ()
#1  0x03fc6ce6 in StgRun (f=0x3ff9db4 <stg_ap_v_fast>, basereg=0x49d2090 <MainCapability+16>) at
    rts/StgCRun.c:81
#2  0x03fca52a in schedule (initialCapability=0x49d2080 <MainCapability>, task=0x49e62c0) at 
    rts/Schedule.c:524
#3  0x03fcc5e6 in scheduleWaitThread (tso=0xb6c07000, ret=0x0, pcap=0xbeffeb74) at 
    rts/Schedule.c:2429
#4  0x03fbc7e4 in rts_evalLazyIO (cap=0xbeffeb74, p=0x402d470 <ZCMain_main_closure>, ret=0x0) at 
    rts/RtsAPI.c:500
#5  0x03fce2be in hs_main (argc=3, argv=0xbeffed64, main_closure=0x402d470 <ZCMain_main_closure>, 
    rts_config=...) at rts/RtsMain.c:64
#6  0x00141508 in main ()

and if I disassemble at the SIGILL I get:

(gdb) disassemble
Dump of assembler code for function stg_ap_v_fast:
   0x03ff9db4 <+0>:     push    {r7, lr}
   0x03ff9db6 <+2>:     sub     sp, #32
   0x03ff9db8 <+4>:     add     r7, sp, #0
   0x03ff9dba <+6>:     ldr     r3, [pc, #508]  ; (0x3ff9fb8 <stg_ap_v_fast+516>)
=> 0x03ff9dbc <+8>:     ldr     r3, [r3, #0]
   0x03ff9dbe <+10>:    and.w   r3, r3, #3

Now to figure out why that load instruction is illegal.

Last edited 2 years ago by erikd (previous) (diff)

comment:41 Changed 2 years ago by erikd

From: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHGJHED.html

The instruction matches:

op{type}{cond} Rt, [Rn, #offset]!          ; pre-indexed

but the under the section titled "Register restrictions" it says:

Rn must be different from Rt in the pre-index and post-index forms.

comment:42 Changed 2 years ago by erikd

Things to note:

  • CPU register r3 is the argument/result/scratch register.
  • stg_ap_v_fast is generated CMM code generated by utils/genapply.
  • Since this is on Arm, the CMM code is passed throught the LLVM backend.

Confirming that we are actually generating invalid Arm assembly:

utils/genapply/dist/build/tmp/genapply > suspect.cmm
inplace/bin/ghc-stage2 -Irts -c suspect.cmm -o suspect.o

and then objdump -D suspect.o and searching for stg-ap_v_fast results in:

0000853c <stg_ap_v_fast>:
    853c:    b480      push   {r7}
    853e:    b087      sub    sp, #28
    8540:    af00      add    r7, sp, #0
    8542:    4b76      ldr    r3, [pc, #472]  ; (871c <stg_ap_v_fast+0x1e0>)
    8544:    681b      ldr    r3, [r3, #0]

which is exactly what we found in GDB.

Last edited 2 years ago by erikd (previous) (diff)

comment:43 Changed 2 years ago by erikd

Compiling as above with -keep-s-file -keep-llvm-file:

stg_ap_v_fast$def:                      @ @"stg_ap_v_fast$def"
        .fnstart
.Leh_func_begin69:
@ BB#0:                                 @ %cmL
        movw    r0, :lower16:MainCapability
        movt    r0, :upper16:MainCapability
        ldr     r1, [r0, #16]
        and     r2, r1, #3
        cmp     r2, #1
        bne     .LBB69_2
@ BB#1:                                 @ %cmn
        ldr     r0, [r1, #-1]
        ldr     r0, [r0]

and the LLVM code:

@stg_ap_v_fast = alias i8* bitcast (void ()* @stg_ap_v_fast$def to i8*)
define ccc void @stg_ap_v_fast$def() align 4 nounwind
{
cmL:
  %lcmm = alloca i32, i32 1
  %lcmA = alloca i32, i32 1
  %lcmC = alloca i32, i32 1
  %ln3zE = ptrtoint i8* @MainCapability to i32
  %ln3zF = add i32 %ln3zE, 16
  %ln3zG = inttoptr i32 %ln3zF to i32*
  %ln3zH = load i32* %ln3zG, !tbaa !5
  %ln3zI = and i32 %ln3zH, 3
  %ln3zJ = icmp eq i32 %ln3zI, 1
  br i1 %ln3zJ, label %cmn, label %cmp
cmp:
  %ln3zK = ptrtoint i8* @MainCapability to i32
  %ln3zL = add i32 %ln3zK, 808
  %ln3zM = inttoptr i32 %ln3zL to i32*
  %ln3zN = load i32* %ln3zM, !tbaa !5
  %ln3zO = add i32 %ln3zN, -4

which suggests this is an LLVM Arm code gen bug.

comment:44 Changed 2 years ago by erikd

Got some help from the llvm-dev mailing list. People there pointed out that this:

(gdb) disassemble
Dump of assembler code for function stg_ap_v_fast:
   0x03ff9db4 <+0>:     push    {r7, lr}
   0x03ff9db6 <+2>:     sub     sp, #32
   0x03ff9db8 <+4>:     add     r7, sp, #0
   0x03ff9dba <+6>:     ldr     r3, [pc, #508]  ; (0x3ff9fb8 <stg_ap_v_fast+516>)
=> 0x03ff9dbc <+8>:     ldr     r3, [r3, #0]
   0x03ff9dbe <+10>:    and.w   r3, r3, #

show the CPU executing Thumb instructions.

Thumb instructions are encoded in two btyes (and hence instructions with addresses that are even) while Arm instructions are encoded in 4 bytes (so instructions should be aligned at 4 bytes offsets).

This could be that GDB getting the CPU mode incorrect. Not sure how to figure out if that is the case or even if and how that could happen.

comment:45 Changed 2 years ago by erikd

Component: GHCiRuntime System (Linker)

All the previous debugging had produced confusing, contradictory and un-repeatable results so I dropped back to simple wolf-fence debugging with putStrLn and over about 3 hours managed to narrow this down to the function GhciMonad.turnOffBuffering. Modifying that function as follows:

turnOffBuffering :: IO ()
turnOffBuffering
 = do putStrLn "turnOffBuffering start"
      hdls <- mapM getHandle [stdin_ptr,stdout_ptr,stderr_ptr]
      putStrLn "turnOffBuffering middle"
      mapM_ (\h -> hSetBuffering h NoBuffering) hdls
      putStrLn "turnOffBuffering end"

and running the previous test results in:

GHCi, version 7.11.20150910: http://www.haskell.org/ghc/  :? for help
turnOffBuffering start
turnOffBuffering middle
Illegal instruction

so that it seems that calling hSetBuffering causes the crash. However, hSetBuffering is not the problem. Rather, hSetBuffering is the first function to be run from the Base package which is loaded with the runtime linker.

Seeing this reminded me of #2972 which was a linker problem, so modifying the "Component" field of this ticket to reflect that.

comment:46 Changed 2 years ago by erikd

Reading more from #2972, I see that all the comments about PowerPC being limited to 24 bit offsets for relative branch instructions should also apply to Arm. However, enabling USE_CONTIGUOUS_MMAP in rts/linker.c for arm_HOST_ARCH as well as powerpc_HOST_ARCH doesn't fix this issue.

Last edited 2 years ago by erikd (previous) (diff)

comment:47 in reply to:  41 Changed 2 years ago by rwbarton

Replying to erikd:

From: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHGJHED.html

The instruction matches:

op{type}{cond} Rt, [Rn, #offset]!          ; pre-indexed

but the under the section titled "Register restrictions" it says:

Rn must be different from Rt in the pre-index and post-index forms.

Pre-indexed instructions have a trailing !, though. This syntax means something like "increment Rn by #offset, then operate on the value pointed to by the updated Rn". That's why Rn cannot also appear as Rt, there would be two stores to the same register in the same instruction.

The offending instruction here is ldr r3, [r3, #0] without a !, which is just a r3 = *r3.

I suspect the issue is actually executing Thumb code in ARM mode, as appears to be the case in #10863, though I still have no idea as to the cause.

comment:48 Changed 2 years ago by erikd

Since I know that this is some problem with the function hSetBuffering when it is being loaded by the runtime linker I added a printf to rts/Linker.c to print the address of the hSetBuffering related functions when they are loaded. Running this now looks like:

$ inplace/bin/ghc-stage2 --interactive
GHCi, version 7.11.20150912: http://www.haskell.org/ghc/  :? for help
lookupSymbol: value of base_GHCziIOziHandle_hSetBuffering1_entry is 0xb3f3d3cd
lookupSymbol: value of base_GHCziIOziHandle_hSetBuffering1_entry is 0xb3f3d3cd
lookupSymbol: value of base_GHCziIOziHandle_hSetBuffering1_info is 0xb421f738
lookupSymbol: value of base_GHCziIOziHandle_hSetBuffering1_closure is 0xb421d5b0
lookupSymbol: value of base_GHCziIOziHandle_hSetBuffering_entry is 0xb3f3d471
lookupSymbol: value of base_GHCziIOziHandle_hSetBuffering_info is 0xb421f718
lookupSymbol: value of base_GHCziIOziHandle_hSetBuffering1_entry is 0xb3f3d3cd
turnOffBuffering start
turnOffBuffering middle
Illegal instruction

If I now load this into gdb, run it until it crashes I get:

GHCi, version 7.11.20150912: http://www.haskell.org/ghc/  :? for help
lookupSymbol: value of base_GHCziIOziHandle_hSetBuffering1_entry is 0xb3f503cd
lookupSymbol: value of base_GHCziIOziHandle_hSetBuffering1_entry is 0xb3f503cd
lookupSymbol: value of base_GHCziIOziHandle_hSetBuffering1_info is 0xb4232738
lookupSymbol: value of base_GHCziIOziHandle_hSetBuffering1_closure is 0xb42305b0
lookupSymbol: value of base_GHCziIOziHandle_hSetBuffering_entry is 0xb3f50471
lookupSymbol: value of base_GHCziIOziHandle_hSetBuffering_info is 0xb4232718
lookupSymbol: value of base_GHCziIOziHandle_hSetBuffering1_entry is 0xb3f503cd
turnOffBuffering start
turnOffBuffering middle

Program received signal SIGILL, Illegal instruction.
0x04004e5c in stg_ap_v_fast ()
(gdb) disass base_GHCziIOziHandle_hSetBuffering1_entry
Dump of assembler code for function base_GHCziIOziHandle_hSetBuffering1_entry:
   0x03d423c4 <+0>:     movw    r3, #58688      ; 0xe540
   0x03d423c8 <+4>:     movt    r3, #1181       ; 0x49d
   0x03d423cc <+8>:     ldr.w   r2, [r3, #808]  ; 0x328
   0x03d423d0 <+12>:    subs    r2, #4

For some reason, the address of base_GHCziIOziHandle_hSetBuffering1_entry in GDB is not the address printed by the runtime linker.

Even more confusingly, disassembling the address the 0xb3f3d3cd fails, possibly because its outside the memory map.

comment:49 Changed 2 years ago by erikd

Added some debug to rts/Linker.c to print anything related to the function hSetBuffering and get:

GHCi, version 7.11.20150912: http://www.haskell.org/ghc/  :? for help

HSbase-4.8.2.0-FkKrfrxMoa06H6w6uNoJiJ.o, nentires 414047, section 1, symtab 14, strtab 15
base_GHCziIOziHandle_hSetBuffering1_entry   is 0xb3eb63cd, is_thumb 1, R_ARM_THM_MOVW_ABS_NC 
base_GHCziIOziHandle_hSetBuffering1_entry   is 0xb3eb63cd, is_thumb 1, R_ARM_THM_MOVT_ABS    

HSbase-4.8.2.0-FkKrfrxMoa06H6w6uNoJiJ.o, nentires 103794, section 2, symtab 14, strtab 15
base_GHCziIOziHandle_hSetBuffering1_info    is 0xb4198738, is_thumb 0, R_ARM_ABS32           
base_GHCziIOziHandle_hSetBuffering1_closure is 0xb41965b0, is_thumb 0, R_ARM_ABS32           
base_GHCziIOziHandle_hSetBuffering_entry    is 0xb3eb6471, is_thumb 1, R_ARM_ABS32           
base_GHCziIOziHandle_hSetBuffering_info     is 0xb4198718, is_thumb 0, R_ARM_ABS32           
base_GHCziIOziHandle_hSetBuffering1_entry   is 0xb3eb63cd, is_thumb 1, R_ARM_ABS32           

turnOffBuffering start
turnOffBuffering middle
Illegal instruction

Not sure why base_GHCziIOziHandle_hSetBuffering1_entry has 3 entries, all with the same address, but with different ELF_R_TYPE types, ie R_ARM_THM_MOVW_ABS_NC, R_ARM_THM_MOVT_ABS and R_ARM_THM_MOVT_ABS.

comment:50 Changed 2 years ago by erikd

According to readelf -S on the object file the sections are:

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .text             PROGBITS        00000000 000038 4c7518 00  AX  0   0  8
  [ 2] .data             PROGBITS        00000000 4c7550 111ab8 00  WA  0   0  8
  [ 3] .bss              NOBITS          00000000 5d9008 000000 00  WA  0   0  4
  [ 4] .init_array       INIT_ARRAY      00000000 5d9008 000004 00  WA  0   0  4
  [ 5] .rodata.str1.4    PROGBITS        00000000 5d900c 00000e 01 AMS  0   0  4
  [ 6] .rodata           PROGBITS        00000000 5d901c 00dd40 00   A  0   0  4
  [ 7] .comment          PROGBITS        00000000 5e6d5c 00001d 01  MS  0   0  1
  [ 8] .rel.text         REL             00000000 5e6d7c 328af8 08   I 14   1  4
  [ 9] .rel.data         REL             00000000 90f874 0cab90 08   I 14   2  4
  [10] .rel.init_array   REL             00000000 9da404 000008 08   I 14   4  4
  [11] .rel.rodata       REL             00000000 9da40c 008820 08   I 14   6  4
  [12] .note.GNU-stack   PROGBITS        00000000 9e2c2c 000000 00      0   0  0
  [13] .ARM.attributes   ARM_ATTRIBUTES  00000000 9e2c2c 000035 00      0   0  1
  [14] .symtab           SYMTAB          00000000 9e2c64 29dc00 10     15 122623  4
  [15] .strtab           STRTAB          00000000 c80864 2dd2e5 00      0   0  1
  [16] .shstrtab         STRTAB          00000000 f5db49 000088 00      0   0  1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings)
  I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown)
  O (extra OS processing required) o (OS specific), p (processor specific)

comment:51 Changed 2 years ago by erikd

Bingo!

I was building the compiler comfigured with --enable-unregisterised so that GDB be actually be useful. Noticed that loading inplace/bin/ghc-stage2 --interactve into GDB and running it caused a SIGILL here:

(gdb) disassemble
Dump of assembler code for function stg_ap_v_fast:
   0x03ff9db4 <+0>:     push    {r7, lr}
   0x03ff9db6 <+2>:     sub     sp, #32
   0x03ff9db8 <+4>:     add     r7, sp, #0
   0x03ff9dba <+6>:     ldr     r3, [pc, #508]  ; (0x3ff9fb8 <stg_ap_v_fast+516>)
=> 0x03ff9dbc <+8>:     ldr     r3, [r3, #0]
   0x03ff9dbe <+10>:    and.w   r3, r3, #

I also found that if I reloaded the debug target and set a break point in stg_ap_v_fast I could step though the code of that function without a crash. The SIGILL crash was only happening after executing a function that was loaded by the GHCi run time linker.

The other clue in the assembly code for the stg_ap_v_fast funtcion was that it shows Thumb instructions (all Thumb instructions are encoded in two bytes whereas all Arm instructions are encoded in 4 bytes), and elsewhere I learned that SIGILL can occur when executng Arm instructions in Thumb mode and vice versa. In a monolithic executable, jumps from Arm code to Thumb code are correctly handled by the linker so the compiler itself was correctly linked.

However, object code loaded at run time by GHCi is loaded by GHCi's run time linker which may not be getting the Arm/Thumb interop correct.

Then I figured out that all Haskell code compiled via the LLVM backend generates pure Arm code but the C code compiler with GCC was Thumb code.

Fortunately GCC can be forced to generate pure Arm code with the -marm option.

With the -marm GHCi works when compiled Unregisterised. In Registerised mode there are still some problems:

Prelude> data X = X | Y deriving Eq
Prelude> X == Y
False
Prelude> X == X

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb59ff460 (LWP 20022)]
0xb6ff0024 in ?? ()
(gdb) disass 0xb6ff0024
No function contains specified address.
(gdb) disass 0xb6ff0020,+10
Dump of assembler code from 0xb6ff0020 to 0xb6ff002a:
   0xb6ff0020:  ldrbvs  r6, [r4, #-3689]!       ; 0xe69
=> 0xb6ff0024:  strbtvc r6, [r3], #-370 ; 0x172
   0xb6ff0028:  bcc     0xb894d9d4
End of assembler dump.
(gdb) info regs
Undefined info command: "regs".  Try "help info".
(gdb) info registers
r0             0xb6ff0018       3070165016
r1             0x28     40
r2             0x1      1
r3             0x70000000       1879048192
r4             0x234db90        37018512
r5             0xb6cd451c       3066905884
r6             0xb6c1a33c       3066143548
r7             0xb6cddbec       3066944492
r8             0xb3a550fc       3013955836
r9             0x7fffffff       2147483647
r10            0x1e6d24c        31904332
r11            0xb6cd4254       3066905172
r12            0x234d37c        37016444
sp             0x1316abc0       0x1316abc0
lr             0x70000000       1879048192
pc             0xb6ff0024       0xb6ff0024
cpsr           0x400f0010       1074724880

The value of r3, 0x70000000 seems to be a strange address to be loading data from.

comment:52 Changed 2 years ago by erikd

Tried re-producing this segfault with --enable-unregisterised mainly because the Unregisterised mode can be debugged in GDB. Unfortunately, the segfault doesn't happen in Unregisterised mode.

Going back to registerised mode.

comment:53 Changed 2 years ago by erikd

When compiler in Registerised mode, inplace/bin/ghc-stage2 --interactive basically works. I can import modules, I can print things etc. Everything works fine until:

Prelude> data X = X | Y deriving Eq
Prelude> X == X

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb59ff460 (LWP 29599)]
0xb6ff0024 in ?? ()

(gdb) bt
#0  0xb6ff0024 in ?? ()
#1  0x70000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(gdb) disass 0xb6ff0020,+0x30
Dump of assembler code from 0xb6ff0020 to 0xb6ff0050:
   0xb6ff0020:  ldrbvs  r6, [r4, #-3689]!       ; 0xe69
=> 0xb6ff0024:  strbtvc r6, [r3], #-370 ; 0x172
   0xb6ff0028:  bcc     0xb894d9d4
   0xb6ff002c:  stmdbvs r3!, {r0, r1, r2, r6, r11, sp, lr}^
   0xb6ff0030:  subseq  r2, r8, r1, lsr lr
   0xb6ff0034:  andeq   r0, r0, r3, lsr r0
   0xb6ff0038:                  ; <UNDEFINED> instruction: 0xb6ff0038
   0xb6ff003c:  andeq   r0, r0, r8
   0xb6ff0040:  andeq   r0, r1, r0
   0xb6ff0044:  andeq   r0, r1, r1
   0xb6ff0048:  stmdaeq r7, {r0, r3, r6}^
   0xb6ff004c:  strdeq  r12, [r6, #208]!        ; 0xd0
End of assembler dump.

(gdb) info registers
r0             0xb6ff0018       3070165016
r1             0x28     40
r2             0x1      1
r3             0x70000000       1879048192
r4             0x2350b50        37030736
r5             0xb38c4a0c       3012315660
r6             0xb5187d24       3038280996
r7             0xb38c4b30       3012315952
r8             0xb2ad2c38       2997693496
r9             0x7fffffff       2147483647
r10            0x1e6cef4        31903476
r11            0xb38c4750       3012314960
r12            0x235036c        37028716
sp             0xb59fcd70       0xb59fcd70
lr             0x70000000       1879048192
pc             0xb6ff0024       0xb6ff0024
cpsr           0x400f0010       1074724880

comment:54 Changed 2 years ago by erikd

Turns out the memory region its trying to execute is not actually code. Hex dump is:

b6ff0000: 0000 0000 3300 0000 0800 ffb6 0800 0000  ....3...........
b6ff0010: 0000 0100 0100 0000 4900 4708 f0cd e601  ........I.G.....
b6ff0020: 696e 7465 7261 6374 6976 653a 4768 6369  interactive:Ghci
b6ff0030: 312e 5800 3300 0000 3800 ffb6 0800 0000  1.X.3...8.......
b6ff0040: 0000 0100 0100 0100 4900 4708 f0cd e601  ........I.G.....
b6ff0050: 696e 7465 7261 6374 6976 653a 4768 6369  interactive:Ghci
b6ff0060: 312e 5900 790f 0000 6800 ffb6 0f00 2de9  1.Y.y...h.....-.
b6ff0070: 0000 9fe5 00f0 9fe5 6c00 ffb6 cceb e801  ........l.......

and the instruction that segfaults is at 0xb6ff0024.

comment:55 Changed 2 years ago by erikd

Now back to same situation as comment:8 and comment:11. I have a statically built ghc-stage2 binary that segfaults. Going to try two things:

  • Port previous fix that disables Thumb instructions back to ghc-7.10 branch.
  • Try disabling SMP on master branch.

comment:56 Changed 2 years ago by erikd

Results of previous experiments:

  • On ghc-7.20 branch, GHCi with Thumb code disabled still crashes.
  • On master branch, with SMP disabled, GHCi still crashes.

Should probably continue working on master branch with SMP disabled.

comment:57 Changed 2 years ago by erikd

Notice that when it crashes the back trace looks like:

(gdb) bt
#0  0xb6ff0024 in ?? ()
#1  0x70000000 in ?? ()

The address of the previous function (the value 0x70000000) is a bit strange. If comes from the link register, lr:

(gdb) info registers
r0             0xb6ff0018       3070165016
r1             0x28     40
r2             0x1      1
r3             0x70000000       1879048192
r4             0x2350b50        37030736
r5             0xb38c4a0c       3012315660
r6             0xb5187d24       3038280996
r7             0xb38c4b30       3012315952
r8             0xb2ad2c38       2997693496
r9             0x7fffffff       2147483647
r10            0x1e6cef4        31903476
r11            0xb38c4750       3012314960
r12            0x235036c        37028716
sp             0xb59fcd70       0xb59fcd70
lr             0x70000000       1879048192
pc             0xb6ff0024       0xb6ff0024
cpsr           0x400f0010       1074724880

So lets run the program under GDB and break on $lr == 0x70000000 :

(gdb) break main
Breakpoint 1 at 0x93b08
(gdb) r
Starting program: /home/erikd/Git/ghc-upstream/inplace/lib/bin/ghc-stage2
 -B/home/erikd/Git/ghc-upstream/inplace/lib --interactive
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".

Breakpoint 1, 0x00093b08 in main ()
(gdb) watch $lr == 0x70000000 
Watchpoint 2: $lr == 0x70000000
(gdb) c
Continuing.
[New Thread 0xb6bff460 (LWP 5344)]
[New Thread 0xb61ff460 (LWP 5345)]
[New Thread 0xb59ff460 (LWP 5346)]

Program received signal SIGILL, Illegal instruction.
strlen () at ../ports/sysdeps/arm/armv6t2/strlen.S:82
82      ../ports/sys

so we're back to SIGILL, but the assembly looks mostly sane:

(gdb) disass 0xb6fe8800
Dump of assembler code for function strlen:
   0xb6fe8800 <+0>:     pld     [r0]
   0xb6fe8804 <+4>:     strd    r4, r5, [sp, #-8]!
   0xb6fe8808 <+8>:     bic.w   r1, r0, #7
   0xb6fe880c <+12>:    mvn.w   r12, #0
   0xb6fe8810 <+16>:    ands.w  r4, r0, #7
   0xb6fe8814 <+20>:    pld     [r1, #32]
   0xb6fe8818 <+24>:    bne.w   0xb6fe88ae <strlen+174>
   0xb6fe881c <+28>:    mov.w   r4, #0
=> 0xb6fe8820 <+32>:    mvn.w   r0, #7
   0xb6fe8824 <+36>:    ldrd    r2, r3, [r1]
   0xb6fe8828 <+40>:    pld     [r1, #64]       ; 0x40
   0xb6fe882c <+44>:    add.w   r0, r0, #8

Edited to remove data that came from a different debug session.

Last edited 2 years ago by erikd (previous) (diff)

comment:58 in reply to:  57 Changed 2 years ago by int-e

Replying to erikd:

I'm not sure why GDB thinks this is within the strlen function. If I disassemble strlen I get the same assembler instructions, but at a different address:

That looks like there are several copies of strlen in the code, including symbol information. So the function name is found when looking for the symbol closest to 0xb6fe88ae, but when resolving the name to an address, a different copy is used instead.

comment:59 Changed 2 years ago by erikd

On irc, @bgamari suggested that this problem may just be a failure to flush the instruction caches. On inspecting the code and adding debug printf statements, found that the instruction cache is indeed being flushed.

Futhermore cache flushing on arm/linux only works if it was allocated using mmap() with the PROT_EXEC flag set. I have also confirmed that this is case.

Since this issue of a SIGILL in seemingly correct and valid code only happens when I try to break on watch $lr == 0x70000000 I'm going to chalk this up to an GDB related Heisenburg-like interaction between GDB and the debug target.

comment:60 Changed 2 years ago by erikd

Debugging with GDB was too un-reliable, even with Virtual Address randomization disabled in the Linux kernel and master tick timer disabled (+RTS -V0 -RTS).

However, running with +RTS -Da -RTS I noticed that one of the last things the program did before crashing was to print out a BCO (Bycode compiled object). Unfortunately, the output with -Ds was simply way too verbose.

Adding some custom debug which was only activated after the BCO was loaded showed that the crash (sometimes SIGILL, sometimes SIGSEGV) was happening some time after the call to StgRun in rts/Schedule.c.

Since, the crash after calling StgRun happens immediately after loading the BCO, my suspicion is that the StgRun call is actually calling the recently loaded BCO.

My custom debug shows something along the lines of (eliding irrelevant stuff):

StgRun call 128
StgRun call 129
StgRun call 130

BCO
         0:  PUSH_ALTS  0xb39c2d08
         2:  PUSH_L   2
         4:  ENTER
INSTRS:
     6   0   2   2  46 
PTRS:
   0xb39c2d08

StgRun call 131
Segmentation fault
Last edited 2 years ago by erikd (previous) (diff)

comment:61 Changed 2 years ago by erikd

Adding more custom debug just before the call to StgRun I find that cap->r.rSp, rSpLim and rHp are all null pointers and rHpLim is 0x1, even on calls to StgRun that do not crash. Why?

comment:62 Changed 2 years ago by erikd

Tried once again to debug this using GDB.

$ gdb --args inplace/bin/ghc-stage2 +RTS -V0 -RTS --interactive
(gdb) break main

(gdb) break main
Breakpoint 1 at 0x93cd8
(gdb) r
Starting program: /home/erikd/Git/ghc-upstream/inplace/lib/bin/ghc-stage2
 -B/home/erikd/Git/ghc-upstream/inplace/lib +RTS -V0 -RTS --interactive
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".

Breakpoint 1, 0x00093cd8 in main ()
(gdb) while 1
 >x/i $pc
 >stepi
 >end

and it crashes in a at least 3 or 4 different ways on different runs, none of them even remotely like the way it crashes outside GDB.

comment:63 Changed 2 years ago by erikd

Differential Rev(s): Phab:D1323

comment:64 Changed 2 years ago by erikd

Haven't run the full test suite yet, but running:

make test TEST="print018 print020 print021 print022 print025 ghci053 ghcirun001 ghcirun002"

which used to result in 8 failing tests, now results in 8 passing tests.

comment:65 Changed 2 years ago by erikd

After running the full test suite:

OVERALL SUMMARY for test run started at Mon Oct 12 11:15:08 2015 AEDT
 3:52:23 spent to go through
    4694 total tests, which gave rise to
   14893 test cases, of which
   10265 were skipped

      72 had missing libraries
    4445 expected passes
     100 expected failures

       0 caused framework failures
       0 unexpected passes
       9 unexpected failures
       2 unexpected stat failures

Unexpected failures:
   codeGen/should_compile       T10518 [exit code non-0] (normal)
   codeGen/should_run           CmmSwitchTest [exit code non-0] (normal,g1)
   numeric/should_run           T8726 [bad exit code] (normal)
   primops/should_run           T9430 [bad exit code] (normal)
   programs/barton-mangler-bug  barton-mangler-bug [exit code non-0] (normal)
   programs/joao-circular       joao-circular [exit code non-0] (normal)
   rts                          T3424 [exit code non-0] (normal)
   rts                          outofmem [bad stdout] (normal)

Unexpected stat failures:
   perf/should_run  Conversions [stat not good enough] (normal)
   perf/should_run  T9203 [stat not good enough] (normal)

All of the ghci test failures are gone!

comment:66 Changed 2 years ago by Erik de Castro Lopo <erikd@…>

In 933adc0f/ghc:

Fix GHCi on Arm (#10375).

Arm has two instruction sets, Arm and Thumb, and an execution mode for each.
Executing Arm code in Thumb mode or vice-versa will likely result in an
Illegal instruction exception.

Furthermore, Haskell code compiled via LLVM was generating Arm instructions
while C code compiled via GCC was generating Thumb code by default. When
these two object code types were being linked by the system linker, all was
fine, because the system linker knows how to jump and call from one
instruction set to the other.

The first problem was with GHCi's object code loader which did not know
about Thumb vs Arm. When loading an object file `StgCRun` would jump
into the loaded object which could change the mode causing a crash after
it returned. This was fixed by forcing all C code to generate Arm
instructions by passing `-marm` to GCC.

The second problem was the `mkJumpToAddr` function which was generating
Thumb instructions. Changing that to generate Arm instructions instead
results in a working GHCi on Arm.

Test Plan: validate on x86_64 and arm

Reviewers: bgamari, austin, hvr

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D1323

GHC Trac Issues: #10375

comment:67 Changed 2 years ago by erikd

Status: newmerge

comment:68 Changed 2 years ago by erikd

comment:69 Changed 2 years ago by bgamari

Resolution: fixed
Status: mergeclosed
Note: See TracTickets for help on using tickets.