#7629 closed bug (fixed)
segmentation fault in compiled program, involves gtk, selinux
Reported by: | wgmitchener | Owned by: | simonmar |
---|---|---|---|
Priority: | high | Milestone: | 7.6.2 |
Component: | Runtime System | Version: | 7.4.2 |
Keywords: | segmentation fault, multithreading, selinux, gtk | Cc: | garrett.mitchener@…, juhp@…, simonmar |
Operating System: | Linux | Architecture: | x86 |
Type of failure: | Runtime crash | Test Case: | |
Blocked By: | Blocking: | ||
Related Tickets: | Differential Rev(s): | ||
Wiki Page: |
Description
I wrote a multithreaded GUI program for a research project using gtk2hs, and it works fine on fedora 17, which uses ghc 7.0.4. It crashes almost as soon as it starts when compiled and run on fedora 18 (ghc 7.4.1). There's a message on the console that it was killed because of a segmentation fault and that's about all it tells me. I tracked down the code that causes crash, and it seems to happen because I add an action to the gtk loop:
timeoutAddFull action ...
and the crash happens when the action runs the first time. I thought it was a bug in ghc 7.4.1, because I found a bug report that talks about a crash involving STM and multithreading, and supposedly was fixed in 7.4.2. So I set up a virtual machine and installed fedora 18 then upgraded it to rawhide to try my program under 7.4.2. But, the same crash happens on my rawhide machine.
However, it happened that I had to disable selinux on my rawhide machine using the boot command line because something started going wrong, still not sure what (hey, it's rawhide). Now my program does not crash. I just tested this on my fedora 18 laptop (still ghc 7.4.1) using both the version compiled on fedora 18 and the files from where I compiled it on rawhide: when I disable selinux, my program runs fine, but when it's enabled (even if set to permissive rather than enforcing) my program seg faults.
There's nothing useful in /var/log/messages, no indication of what selinux is unhappy about. I did find this: http://www.haskell.org/pipermail/haskell-cafe/2007-August/031120.html but at least in that problem, there was a definite error message about memory mapping, and I'm not getting one.
So as best I can tell, ghc 7.4.1&2 must both be doing something strange, maybe marking some piece of memory as data instead of code, maybe when performing calls to gtk, maybe in building thunks for use by timeoutAddFull, and eventually triggering a security problem.
My original program is huge. The problem must be some unexpected interaction between ghc's newer run time systems, gtk, and selinux. I'm attaching the smallest test case I could concoct and the build command. When you run the resulting program, it does nothing for about 2 seconds, then the action to print "tick" runs, and it crashes.
I'm filing the bug here because it might be a problem in the ghc runtime.
Attachments (7)
Change History (38)
Changed 5 years ago by
comment:1 Changed 5 years ago by
Cc: | garrett.mitchener@… added |
---|
comment:2 follow-up: 3 Changed 5 years ago by
By the way: I compiled Try1 (attached) on my Fedora 17 machine with ghc 7.0.4, and copied the executable over to my Fedora 18 machine. That executable works fine, no seg fault.
comment:3 Changed 5 years ago by
And while I was at it: I copied the executable compiled with ghc 7.4.1 from my Fedora 18 machine to my Fedora 17 machine, and it segfaults on Fedora 17 as well as 18.
comment:4 Changed 5 years ago by
Cc: | juhp@… added |
---|
I reproduced this on Fedora 18 i686 (doesn't happen on x86_64).
Changed 5 years ago by
Smaller testcase using only glib: "ghc-7.4.2 --make Try2.hs" && ./Try2 => segfaults on i686
comment:6 Changed 5 years ago by
difficulty: | → Unknown |
---|---|
Milestone: | → 7.6.2 |
Owner: | set to simonmar |
Priority: | normal → high |
Could someone compile the program with -debug
, run it under gdb
, and grab a backtrace with bt
please?
comment:7 Changed 5 years ago by
I did what I could about getting a backtrace, see new attachment, but it's not much info. I compiled it with
ghc --make Try2 -debug
with ghc-7.4.2 on a virtual machine running fedora 19/rawhide.
comment:8 Changed 5 years ago by
Status: | new → infoneeded |
---|
Is your GHC using the libffi
that comes with Fedora, or the one bundled with GHC?
comment:9 Changed 5 years ago by
The problems I'm having with Try2 are compiled just as ghc is packaged on Fedora. According to ldd:
(on fedora 19, ghc 7.4.2) ldd Try2 yields
linux-gate.so.1 => (0xb772f000) libgobject-2.0.so.0 => /lib/libgobject-2.0.so.0 (0x43995000) libglib-2.0.so.0 => /lib/libglib-2.0.so.0 (0x43829000) libgmp.so.10 => /lib/sse2/libgmp.so.10 (0x4eb12000) libffi.so.6 => /lib/libffi.so.6 (0x439e8000) libm.so.6 => /lib/libm.so.6 (0x4373a000) librt.so.1 => /lib/librt.so.1 (0x4372f000) libdl.so.2 => /lib/libdl.so.2 (0x43728000) libc.so.6 => /lib/libc.so.6 (0x4354e000) libpthread.so.0 => /lib/libpthread.so.0 (0x4370c000) /lib/ld-linux.so.2 (0x4352b000)
so I think all of these are the fedora packaged libraries.
comment:10 Changed 5 years ago by
libffi on fedora 17 is libffi.5 but on fedora 18 and 19/rawhide, it's libffi.6
Is there some sort of version clash going on here, where ghc doesn't work with libffi.6? Or does ghc require specific patches to libffi?
(I have to copy libffi.6 over to f17 machines to run test cases where I compile something on rawhide and run it on f17.)
comment:11 Changed 5 years ago by
I just tried to get ghc-7.4.2 from the generic linux build tar.bz2 file on haskell.org, but it won't work on f19/rawhide because it requires libgmp.so.3, but f19/rawhide comes with libgmp.so.10.... I'm trying to avoid rebuilding the entire haskell platform somewhere to track down the source of this bug. Suggestions?
comment:12 Changed 5 years ago by
Investigating the possibility that libffi.so.6 is where the bug lives: After trying and failing to get ghc to use libffi.5 on fedora 19/rawhide (either in /lib or in the ghc installation tree), I tried making /lib/libffi.so.6 a link to libffi.so.5. Then Try2 still segfaults in the same place.
comment:13 Changed 5 years ago by
I decided to try with ghc 7.6.2. On fedora 19/rawhide, Try2 seg faults in the same place.
comment:14 Changed 5 years ago by
I also tried building ghc 7.4.2 on fedora 17, and installing gtk via cabal: Try2 segfaults in the same place. So whatever's going on, it isn't just which version of libffi, and it isn't just the fedora release.
comment:15 Changed 5 years ago by
A bit more information: I compiled Try2.hs with
ghc -debug --make Try2.hs
This is on Fedora 18 with ghc 7.4.1. (FYI: I'm also using the development version of gtk2hs; darcs gives latest patch date of Tue Feb 19 16:39:46 EST 2013). Now Try2 fails with a concrete error message:
Try2: internal error: ASSERTION FAILED: file rts/STM.c, line 1476
(GHC version 7.4.1 for i386_unknown_linux) Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug
Aborted
which seems maybe to have something to do with not doing one STM transaction inside another? The function where that assertion happens is stmWait.
comment:16 Changed 5 years ago by
I've been working with gtk 2.32.4, ghc 7.4.2, and the development tree from gtk2hs. I added a few print statements and tracked down this much of the problem:
makeCallback: funPtr = 0xb7e8900c makeCallback: destroyFunPtr = 0x0821fcd6 g_timeout_add_full: function = 0xb7e8900c g_timeout_add_full: data = 0xb7e8900c g_timeout_add_full: notify = 0x821fcd6 g_main_dispatch: dispatch = 0xb7ed11c0 g_main_dispatch: source = 0x82c0500 g_main_dispatch: callback = 0xb7e8900c g_main_dispatch: user_data = 0xb7e8900c g_timeout_dispatch: source = 0x82c0500 g_timeout_dispatch: callback = 0xb7e8900c g_timeout_dispatch: user_data = 0xb7e8900c (gdb) disass /r 0xb7e8900c,+5 Dump of assembler code from 0xb7e8900c to 0xb7e89011: 0xb7e8900c: e8 c3 14 3b 50 call 0x823a4d4 End of assembler dump. (gdb) disass /r 0x823a4d4,+20 Dump of assembler code from 0x823a4d4 to 0x823a4e8: => 0x0823a4d4: 00 00 add %al,(%eax) 0x0823a4d6: 00 00 add %al,(%eax) 0x0823a4d8: 20 00 and %al,(%eax) 0x0823a4da: 00 00 add %al,(%eax) 0x0823a4dc <stg_sel_ret_5_upd_info+0>: 89 f0 mov %esi,%eax 0x0823a4de <stg_sel_ret_5_upd_info+2>: 83 e0 fc and $0xfffffffc,%eax 0x0823a4e1 <stg_sel_ret_5_upd_info+5>: 8b 70 18 mov 0x18(%eax),%esi 0x0823a4e4 <stg_sel_ret_5_upd_info+8>: 83 c5 04 add $0x4,%ebp 0x0823a4e7 <stg_sel_ret_5_upd_info+11>: f7 c6 03 00 00 00 test $0x3,%esi End of assembler dump.
In gtk2hs/Glib/System/Glib/MainLoop.chs, makeCallback function, the call to mkSourceFunc (which is a foreign import wrapper) seems to return a thunk stored at 0xb7e8900c, but the function call right at that address seems to be off by 8 bytes? Those first four instructions make no sense. The seg fault happens at that first add %al, (%eax) because %eax is a bad pointer.
comment:17 Changed 5 years ago by
I just added a minimal example that doesn't need GTK -- see attachment ghc-bug-002.zip.
It's a simple case of Haskell calling into C calling back into Haskell. I'm using Fedora 17. The program works fine when compiled under GHC 7.0.4:
Setting callback set_callback: at top set_callback: p_callback = (nil) set_callback: callback_data = 0 set_callback: p_finalizer = (nil) set_callback: new pointer values: set_callback: p_callback = 0xb77ee02c set_callback: callback_data = 10 set_callback: p_finalizer = 0xb77ee00c set_callback: done Invoking callback invoke_callback: at top invoke_callback: p_callback = 0xb77ee02c invoke_callback: callback_data = 10 invoke_callback: p_finalizer = 0xb77ee00c invoke_callback: calling callback invoke_callback: return value is 11 invoke_callback: done Clearing callback clear_callback: at top clear_callback: p_callback = 0xb77ee02c clear_callback: callback_data = 10 clear_callback: p_finalizer = 0xb77ee00c clear_callback: finalizing callback clear_callback: p_callback = (nil) clear_callback: callback_data = 0 clear_callback: p_finalizer = (nil) clear_callback: done
But it seg faults under GHC 7.4.2.
Setting callback set_callback: at top set_callback: p_callback = (nil) set_callback: callback_data = 0 set_callback: p_finalizer = (nil) set_callback: new pointer values: set_callback: p_callback = 0xb77d702c set_callback: callback_data = 10 set_callback: p_finalizer = 0xb77d700c set_callback: done Invoking callback invoke_callback: at top invoke_callback: p_callback = 0xb77d702c invoke_callback: callback_data = 10 invoke_callback: p_finalizer = 0xb77d700c invoke_callback: calling callback Segmentation fault
On the Ubuntu 12.10 live image, after installing GHC 7.4.2, it runs with no seg fault. However, Ubuntu doesn't use SELinux. Maybe the thunk that goes back into Haskell is jumping to the wrong address, a few bytes before the actual function, and the instructions there are basically harmless, but SELinux catches them?
comment:18 Changed 5 years ago by
On Fedora 17, with GHC 7.4.2, I tried running valgrind on the Main program from ghc-bug-002.zip with this result:
valgrind --leak-check=full ./Main ==30226== Memcheck, a memory error detector ==30226== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al. ==30226== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==30226== Command: ./Main ==30226== Setting callback set_callback: at top set_callback: p_callback = (nil) set_callback: callback_data = 0 set_callback: p_finalizer = (nil) set_callback: new pointer values: set_callback: p_callback = 0x401102c set_callback: callback_data = 10 set_callback: p_finalizer = 0x401100c set_callback: done Invoking callback invoke_callback: at top invoke_callback: p_callback = 0x401102c invoke_callback: callback_data = 10 invoke_callback: p_finalizer = 0x401100c invoke_callback: calling callback ==30226== Invalid read of size 1 ==30226== at 0x822D5E8: freeSignalHandlers (Signals.c:90) ==30226== Address 0xa is not stack'd, malloc'd or (recently) free'd ==30226== ==30226== ==30226== Process terminating with default action of signal 11 (SIGSEGV) ==30226== Access not within mapped region at address 0xA ==30226== at 0x822D5E8: freeSignalHandlers (Signals.c:90) ==30226== If you believe this happened as a result of a stack ==30226== overflow in your program's main thread (unlikely but ==30226== possible), you can try to increase the size of the ==30226== main thread stack using the --main-stacksize= flag. ==30226== The main thread stack size used in this run was 8388608. ==30226== ==30226== HEAP SUMMARY: ==30226== in use at exit: 40,622 bytes in 32 blocks ==30226== total heap usage: 52 allocs, 20 frees, 43,076 bytes allocated ==30226== ==30226== LEAK SUMMARY: ==30226== definitely lost: 0 bytes in 0 blocks ==30226== indirectly lost: 0 bytes in 0 blocks ==30226== possibly lost: 0 bytes in 0 blocks ==30226== still reachable: 40,622 bytes in 32 blocks ==30226== suppressed: 0 bytes in 0 blocks ==30226== Reachable blocks (those to which a pointer was found) are not shown. ==30226== To see them, rerun with: --leak-check=full --show-reachable=yes ==30226== ==30226== For counts of detected and suppressed errors, rerun with: -v ==30226== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0) Segmentation fault
Changed 5 years ago by
Attachment: | ghc-bug-002.zip added |
---|
Minimal example; works on ghc 7.0.4, seg faults on ghc 7.4.2; tried an extra command line option in ./build, put in separate small function for easier breakpoint in gdb
comment:19 Changed 5 years ago by
Doing some work with gdb with ghc-bug-003 (which I'll attach in a minute)
On Ubuntu 12.10 with GHC 7.4.2 (no SELinux, no seg fault), the call to set_callback comes out like this:
Breakpoint 1, set_callback (f=0xb7b3f02c, d=10, fin=0xb7b3f00c) at Callback.c:18 18 p_callback = f; (gdb) x/i f 0xb7b3f02c:call 0x80a1358 <adjustorCode> (gdb) print adjustorCode $1 = {<text variable, no debug info>} 0x80a1358 <adjustorCode>
On Fedora 17 with GHC 7.4.2 (with SELinux, seg faults), the call to set_callback comes out like this:
Breakpoint 1, set_callback (f=0xb7ffd02c, d=10, fin=0xb7ffd00c) at Callback.c:18 18 p_callback = f; (gdb) x/i f 0xb7ffd02c: call 0x82274b8 (gdb) print adjustorCode $1 = {<text variable, no debug info>} 0x82264b8 <adjustorCode>
so is something going wrong with this adjustorCode function?
Changed 5 years ago by
Attachment: | ghc-bug-003.zip added |
---|
Abbreviated test case, with strace and .s files. (.se means on Fedora which enables SELinux, .ns means on Ubuntu which does not; 7.x.x is the version of GHC that generated the file.)
comment:20 Changed 5 years ago by
Maybe it's not an 8 byte problem. If the callback is eventually supposed to call adjustorCode, then the error is even weirder:
On Fedora 17, (SE, GHC 742), in just_invoke_callback (ghc-bug-003), tracing through...
Inside createAdjustor in ghc-7.4.2/rts/Adjustor.c, the AdjustorStub code that is generated at line 386 :-o is
(gdb) disas /r adjustorStub,+5 Dump of assembler code from 0xb7ffc02c to 0xb7ffc031: 0xb7ffc02c: e8 87 a4 22 50 call 0x82264b8 <adjustorCode> End of assembler dump.
e8 is the opcode for an ip-relative jump.
The same bytes during set_callback and just_invoke_callback are interpreted differently for some reason:
(gdb) print adjustorCode $20 = {<text variable, no debug info>} 0x82264b8 <adjustorCode> (gdb) disas /r *p_callback,+5 Dump of assembler code from 0xb7ffd02c to 0xb7ffd031: 0xb7ffd02c: e8 87 a4 22 50 call 0x82274b8 <- off by 0x1000 from adjustorCode End of assembler dump.
which means something hideous has happened.
comment:21 Changed 5 years ago by
Got it:
ghc-7.4.2/rts/Adjustor.c:380
createAdjustor calls allocateExec (rts/sm/Storage.c) which calls ffi_closure_alloc. So in createAdjustor, line 381, we should have (if I'm reading the libffi documentation correctly)
adjustorStub is a pointer in data address space to the adjustor stub code is a pointer in code address space to the very same spot in memory
and sure enough they are off by 0x1000:
(gdb) print adjustorStub $3 = (AdjustorStub *) 0xb7ffc00c (gdb) print code $4 = (void *) 0xb7ffd00c
which means the correct calculation of the relative call should be
*(long*)&adjustorStub->call[1] = ((char*)&adjustorCode) - ((char*)code + 5); // code instead of adjustorStub
Apparently code and data and done with different segment settings under SELinux. Chaos follows.
Going to rebuild GHC 7.4.2 with that change and see if this works...
comment:22 Changed 5 years ago by
Sorry, formatting of the last message went wrong: createAdjustor calls allocateExec (rts/sm/Storage.c) which calls ffi_closure_alloc. So in createAdjustor, line 381, we should have (if I'm reading the libffi documentation correctly):
adjustorStub is a pointer in data address space to the adjustor stub
code is a pointer in code address space to the very same spot in memory
and the relative call needs to be calculated in code address space
comment:23 Changed 5 years ago by
Okay, it works!
I've attached a patch, going to do a few more tests. Now what?
comment:24 Changed 5 years ago by
More tests: My gtk-based simulation program works with the above patch on Fedora 17 with GHC 7.4.2.
By the way, the same mistake in Adjustor.c seems to be present in all later versions of GHC as well.
comment:25 follow-up: 28 Changed 5 years ago by
Status: | infoneeded → new |
---|
Well done for tracking this down!
Your fix looks good to me. Could someone validate and push please?
comment:26 Changed 5 years ago by
Before I forget: The pointer that gets returned after all of that is the data-space address rather than the code-space address, and I suppose that must be right so that the memory block can be deallocated later. But it sort of worries me that the call instruction to that data-space address works. Does the CPU or kernel recognize that the same memory is also mapped to a code-space address and make some correction?
comment:27 Changed 5 years ago by
The two addresses contain the same memory (double-mapped), but one is writable while the other is executable. This is how libffi
works around the SELinux restrictions. On non-SELinux systems the code and data addresses are probably the same.
This function, createAdjustor
returns the code address, not the data address.
comment:28 Changed 5 years ago by
Replying to simonmar:
Well done for tracking this down!
Your fix looks good to me. Could someone validate and push please?
I'm new at this process: Is "validate and push" something I'm supposed to do or does someone on the inside of the GHC group do this?
comment:29 Changed 5 years ago by
commit 27cf625ab871f34434d9fe86cecf85a31f73f0e5
Author: Ian Lynagh <ian@well-typed.com> Date: Tue Apr 9 13:53:28 2013 +0100 Fix segfaults on SELinux machines; fixes #7629 Patch from wgmitchener. From the ticket: The two addresses (adjustorStub and code) contain the same memory (double-mapped), but one is writable while the other is executable. This is how libffi works around the SELinux restrictions. On non-SELinux systems the code and data addresses are probably the same. rts/Adjustor.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-)
comment:30 Changed 5 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
wgmitchener: It's something one of the GHC team does.
I've now validated and pushed; thanks for diagnosing it and sending the patch!
comment:31 Changed 4 years ago by
Cc: | simonmar added |
---|
I note for the record that this didn't make ghc-7.6.3 but will be in ghc-7.8.
I am finally backporting the patch to Fedora now.
wgmitchener: Thank you again for fixing this.
Program that crashes