I wrote a multithreaded GUI program for a research project using gtk2hs, and it works fine on fedora 17, which uses ghc 7.0.4. It crashes almost as soon as it starts when compiled and run on fedora 18 (ghc 7.4.1). There's a message on the console that it was killed because of a segmentation fault and that's about all it tells me. I tracked down the code that causes crash, and it seems to happen because I add an action to the gtk loop:
timeoutAddFull action ...
and the crash happens when the action runs the first time. I thought it was a bug in ghc 7.4.1, because I found a bug report that talks about a crash involving STM and multithreading, and supposedly was fixed in 7.4.2. So I set up a virtual machine and installed fedora 18 then upgraded it to rawhide to try my program under 7.4.2. But, the same crash happens on my rawhide machine.
However, it happened that I had to disable selinux on my rawhide machine using the boot command line because something started going wrong, still not sure what (hey, it's rawhide). Now my program does not crash. I just tested this on my fedora 18 laptop (still ghc 7.4.1) using both the version compiled on fedora 18 and the files from where I compiled it on rawhide: when I disable selinux, my program runs fine, but when it's enabled (even if set to permissive rather than enforcing) my program seg faults.
There's nothing useful in /var/log/messages, no indication of what selinux is unhappy about. I did find this: http://www.haskell.org/pipermail/haskell-cafe/2007-August/031120.html but at least in that problem, there was a definite error message about memory mapping, and I'm not getting one.
So as best I can tell, ghc 7.4.1&2 must both be doing something strange, maybe marking some piece of memory as data instead of code, maybe when performing calls to gtk, maybe in building thunks for use by timeoutAddFull, and eventually triggering a security problem.
My original program is huge. The problem must be some unexpected interaction between ghc's newer run time systems, gtk, and selinux. I'm attaching the smallest test case I could concoct and the build command. When you run the resulting program, it does nothing for about 2 seconds, then the action to print "tick" runs, and it crashes.
I'm filing the bug here because it might be a problem in the ghc runtime.
Trac metadata
Trac field
Value
Version
7.4.2
Type
Bug
TypeOfFailure
OtherFailure
Priority
normal
Resolution
Unresolved
Component
Runtime System
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
By the way: I compiled Try1 (attached) on my Fedora 17 machine with ghc 7.0.4, and copied the executable over to my Fedora 18 machine. That executable works fine, no seg fault.
And while I was at it: I copied the executable compiled with ghc 7.4.1 from my Fedora 18 machine to my Fedora 17 machine, and it segfaults on Fedora 17 as well as 18.
I just tried to get ghc-7.4.2 from the generic linux build tar.bz2 file on haskell.org, but it won't work on f19/rawhide because it requires libgmp.so.3, but f19/rawhide comes with libgmp.so.10.... I'm trying to avoid rebuilding the entire haskell platform somewhere to track down the source of this bug. Suggestions?
Investigating the possibility that libffi.so.6 is where the bug lives: After trying and failing to get ghc to use libffi.5 on fedora 19/rawhide (either in /lib or in the ghc installation tree), I tried making /lib/libffi.so.6 a link to libffi.so.5. Then Try2 still segfaults in the same place.
I also tried building ghc 7.4.2 on fedora 17, and installing gtk via cabal: Try2 segfaults in the same place. So whatever's going on, it isn't just which version of libffi, and it isn't just the fedora release.
This is on Fedora 18 with ghc 7.4.1. (FYI: I'm also using the development version of gtk2hs; darcs gives latest patch date of Tue Feb 19 16:39:46 EST 2013). Now Try2 fails with a concrete error message:
Try2: internal error: ASSERTION FAILED: file rts/STM.c, line 1476
I've been working with gtk 2.32.4, ghc 7.4.2, and the development tree from gtk2hs. I added a few print statements and tracked down this much of the problem:
makeCallback: funPtr = 0xb7e8900cmakeCallback: destroyFunPtr = 0x0821fcd6g_timeout_add_full: function = 0xb7e8900cg_timeout_add_full: data = 0xb7e8900cg_timeout_add_full: notify = 0x821fcd6g_main_dispatch: dispatch = 0xb7ed11c0g_main_dispatch: source = 0x82c0500g_main_dispatch: callback = 0xb7e8900cg_main_dispatch: user_data = 0xb7e8900cg_timeout_dispatch: source = 0x82c0500g_timeout_dispatch: callback = 0xb7e8900cg_timeout_dispatch: user_data = 0xb7e8900c(gdb) disass /r 0xb7e8900c,+5Dump of assembler code from 0xb7e8900c to 0xb7e89011: 0xb7e8900c: e8 c3 14 3b 50 call 0x823a4d4End of assembler dump.(gdb) disass /r 0x823a4d4,+20Dump of assembler code from 0x823a4d4 to 0x823a4e8:=> 0x0823a4d4: 00 00 add %al,(%eax) 0x0823a4d6: 00 00 add %al,(%eax) 0x0823a4d8: 20 00 and %al,(%eax) 0x0823a4da: 00 00 add %al,(%eax) 0x0823a4dc <stg_sel_ret_5_upd_info+0>: 89 f0 mov %esi,%eax 0x0823a4de <stg_sel_ret_5_upd_info+2>: 83 e0 fc and $0xfffffffc,%eax 0x0823a4e1 <stg_sel_ret_5_upd_info+5>: 8b 70 18 mov 0x18(%eax),%esi 0x0823a4e4 <stg_sel_ret_5_upd_info+8>: 83 c5 04 add $0x4,%ebp 0x0823a4e7 <stg_sel_ret_5_upd_info+11>: f7 c6 03 00 00 00 test $0x3,%esiEnd of assembler dump.
In gtk2hs/Glib/System/Glib/MainLoop.chs, makeCallback function, the call to mkSourceFunc (which is a foreign import wrapper) seems to return a thunk stored at 0xb7e8900c, but the function call right at that address seems to be off by 8 bytes? Those first four instructions make no sense. The seg fault happens at that first add %al, (%eax) because %eax is a bad pointer.
On the Ubuntu 12.10 live image, after installing GHC 7.4.2, it runs with no seg fault. However, Ubuntu doesn't use SELinux. Maybe the thunk that goes back into Haskell is jumping to the wrong address, a few bytes before the actual function, and the instructions there are basically harmless, but SELinux catches them?
On Fedora 17, with GHC 7.4.2, I tried running valgrind on the Main program from ghc-bug-002.zip with this result:
valgrind --leak-check=full ./Main==30226== Memcheck, a memory error detector==30226== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.==30226== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info==30226== Command: ./Main==30226== Setting callbackset_callback: at topset_callback: p_callback = (nil)set_callback: callback_data = 0set_callback: p_finalizer = (nil)set_callback: new pointer values:set_callback: p_callback = 0x401102cset_callback: callback_data = 10set_callback: p_finalizer = 0x401100cset_callback: doneInvoking callbackinvoke_callback: at topinvoke_callback: p_callback = 0x401102cinvoke_callback: callback_data = 10invoke_callback: p_finalizer = 0x401100cinvoke_callback: calling callback==30226== Invalid read of size 1==30226== at 0x822D5E8: freeSignalHandlers (Signals.c:90)==30226== Address 0xa is not stack'd, malloc'd or (recently) free'd==30226== ==30226== ==30226== Process terminating with default action of signal 11 (SIGSEGV)==30226== Access not within mapped region at address 0xA==30226== at 0x822D5E8: freeSignalHandlers (Signals.c:90)==30226== If you believe this happened as a result of a stack==30226== overflow in your program's main thread (unlikely but==30226== possible), you can try to increase the size of the==30226== main thread stack using the --main-stacksize= flag.==30226== The main thread stack size used in this run was 8388608.==30226== ==30226== HEAP SUMMARY:==30226== in use at exit: 40,622 bytes in 32 blocks==30226== total heap usage: 52 allocs, 20 frees, 43,076 bytes allocated==30226== ==30226== LEAK SUMMARY:==30226== definitely lost: 0 bytes in 0 blocks==30226== indirectly lost: 0 bytes in 0 blocks==30226== possibly lost: 0 bytes in 0 blocks==30226== still reachable: 40,622 bytes in 32 blocks==30226== suppressed: 0 bytes in 0 blocks==30226== Reachable blocks (those to which a pointer was found) are not shown.==30226== To see them, rerun with: --leak-check=full --show-reachable=yes==30226== ==30226== For counts of detected and suppressed errors, rerun with: -v==30226== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)Segmentation fault
Minimal example; works on ghc 7.0.4, seg faults on ghc 7.4.2; tried an extra command line option in ./build, put in separate small function for easier breakpoint in gdb
Abbreviated test case, with strace and .s files. (.se means on Fedora which enables SELinux, .ns means on Ubuntu which does not; 7.x.x is the version of GHC that generated the file.)
Maybe it's not an 8 byte problem. If the callback is eventually supposed to call adjustorCode, then the error is even weirder:
On Fedora 17, (SE, GHC 742), in just_invoke_callback (ghc-bug-003), tracing through...
Inside createAdjustor in ghc-7.4.2/rts/Adjustor.c, the AdjustorStub code that is generated at line 386 :-o is
(gdb) disas /r adjustorStub,+5Dump of assembler code from 0xb7ffc02c to 0xb7ffc031: 0xb7ffc02c: e8 87 a4 22 50 call 0x82264b8 <adjustorCode>End of assembler dump.
e8 is the opcode for an ip-relative jump.
The same bytes during set_callback and just_invoke_callback are interpreted differently for some reason:
(gdb) print adjustorCode$20 = {<text variable, no debug info>} 0x82264b8 <adjustorCode>(gdb) disas /r *p_callback,+5Dump of assembler code from 0xb7ffd02c to 0xb7ffd031: 0xb7ffd02c: e8 87 a4 22 50 call 0x82274b8 <- off by 0x1000 from adjustorCodeEnd of assembler dump.
createAdjustor calls allocateExec (rts/sm/Storage.c) which calls ffi_closure_alloc. So in createAdjustor, line 381, we should have (if I'm reading the libffi documentation correctly)
adjustorStub is a pointer in data address space to the adjustor stub
code is a pointer in code address space to the very same spot in memory
Sorry, formatting of the last message went wrong:
createAdjustor calls allocateExec (rts/sm/Storage.c) which calls ffi_closure_alloc. So in createAdjustor, line 381, we should have (if I'm reading the libffi documentation correctly):
adjustorStub is a pointer in data address space to the adjustor stub
code is a pointer in code address space to the very same spot in memory
and the relative call needs to be calculated in code address space
Before I forget: The pointer that gets returned after all of that is the data-space address rather than the code-space address, and I suppose that must be right so that the memory block can be deallocated later. But it sort of worries me that the call instruction to that data-space address works. Does the CPU or kernel recognize that the same memory is also mapped to a code-space address and make some correction?
The two addresses contain the same memory (double-mapped), but one is writable while the other is executable. This is how libffi works around the SELinux restrictions. On non-SELinux systems the code and data addresses are probably the same.
This function, createAdjustor returns the code address, not the data address.