Opened 15 months ago

Closed 13 months ago

Last modified 3 months ago

#7629 closed bug (fixed)

segmentation fault in compiled program, involves gtk, selinux

Reported by: wgmitchener Owned by: simonmar
Priority: high Milestone: 7.6.2
Component: Runtime System Version: 7.4.2
Keywords: segmentation fault, multithreading, selinux, gtk Cc: garrett.mitchener@…, juhp@…, simonmar
Operating System: Linux Architecture: x86
Type of failure: Runtime crash Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description

I wrote a multithreaded GUI program for a research project using gtk2hs, and it works fine on fedora 17, which uses ghc 7.0.4. It crashes almost as soon as it starts when compiled and run on fedora 18 (ghc 7.4.1). There's a message on the console that it was killed because of a segmentation fault and that's about all it tells me. I tracked down the code that causes crash, and it seems to happen because I add an action to the gtk loop:

timeoutAddFull action ...

and the crash happens when the action runs the first time. I thought it was a bug in ghc 7.4.1, because I found a bug report that talks about a crash involving STM and multithreading, and supposedly was fixed in 7.4.2. So I set up a virtual machine and installed fedora 18 then upgraded it to rawhide to try my program under 7.4.2. But, the same crash happens on my rawhide machine.

However, it happened that I had to disable selinux on my rawhide machine using the boot command line because something started going wrong, still not sure what (hey, it's rawhide). Now my program does not crash. I just tested this on my fedora 18 laptop (still ghc 7.4.1) using both the version compiled on fedora 18 and the files from where I compiled it on rawhide: when I disable selinux, my program runs fine, but when it's enabled (even if set to permissive rather than enforcing) my program seg faults.

There's nothing useful in /var/log/messages, no indication of what selinux is unhappy about. I did find this: http://www.haskell.org/pipermail/haskell-cafe/2007-August/031120.html but at least in that problem, there was a definite error message about memory mapping, and I'm not getting one.

So as best I can tell, ghc 7.4.1&2 must both be doing something strange, maybe marking some piece of memory as data instead of code, maybe when performing calls to gtk, maybe in building thunks for use by timeoutAddFull, and eventually triggering a security problem.

My original program is huge. The problem must be some unexpected interaction between ghc's newer run time systems, gtk, and selinux. I'm attaching the smallest test case I could concoct and the build command. When you run the resulting program, it does nothing for about 2 seconds, then the action to print "tick" runs, and it crashes.

I'm filing the bug here because it might be a problem in the ghc runtime.

Attachments (7)

Try1.hs (175 bytes) - added by wgmitchener 15 months ago.
Program that crashes
build (62 bytes) - added by wgmitchener 15 months ago.
build script for Try1.hs
Try2.hs (161 bytes) - added by juhpetersen 15 months ago.
Smaller testcase using only glib: "ghc-7.4.2 --make Try2.hs" && ./Try2 => segfaults on i686
Try2-bt.txt (1.0 KB) - added by wgmitchener 15 months ago.
gdb backtrace of Try2 from fedora 19/rawhide
ghc-bug-002.zip (1.5 KB) - added by wgmitchener 13 months ago.
Minimal example; works on ghc 7.0.4, seg faults on ghc 7.4.2; tried an extra command line option in ./build, put in separate small function for easier breakpoint in gdb
ghc-bug-003.zip (18.1 KB) - added by wgmitchener 13 months ago.
Abbreviated test case, with strace and .s files. (.se means on Fedora which enables SELinux, .ns means on Ubuntu which does not; 7.x.x is the version of GHC that generated the file.)
Fix-adjustor.patch (519 bytes) - added by wgmitchener 13 months ago.
Patch for rts/Adjustor.c

Download all attachments as: .zip

Change History (38)

Changed 15 months ago by wgmitchener

Program that crashes

Changed 15 months ago by wgmitchener

build script for Try1.hs

comment:1 Changed 15 months ago by wgmitchener

  • Cc garrett.mitchener@… added

comment:2 follow-up: Changed 15 months ago by wgmitchener

By the way: I compiled Try1 (attached) on my Fedora 17 machine with ghc 7.0.4, and copied the executable over to my Fedora 18 machine. That executable works fine, no seg fault.

comment:3 in reply to: ↑ 2 Changed 15 months ago by wgmitchener

And while I was at it: I copied the executable compiled with ghc 7.4.1 from my Fedora 18 machine to my Fedora 17 machine, and it segfaults on Fedora 17 as well as 18.

comment:4 Changed 15 months ago by juhpetersen

  • Cc juhp@… added

I reproduced this on Fedora 18 i686 (doesn't happen on x86_64).

Changed 15 months ago by juhpetersen

Smaller testcase using only glib: "ghc-7.4.2 --make Try2.hs" && ./Try2 => segfaults on i686

comment:5 Changed 15 months ago by juhpetersen

I would be curious if this still happens with ghc-7.6.

comment:6 Changed 15 months ago by simonmar

  • Difficulty set to Unknown
  • Milestone set to 7.6.2
  • Owner set to simonmar
  • Priority changed from normal to high

Could someone compile the program with -debug, run it under gdb, and grab a backtrace with bt please?

Changed 15 months ago by wgmitchener

gdb backtrace of Try2 from fedora 19/rawhide

comment:7 Changed 15 months ago by wgmitchener

I did what I could about getting a backtrace, see new attachment, but it's not much info. I compiled it with

ghc --make Try2 -debug

with ghc-7.4.2 on a virtual machine running fedora 19/rawhide.

comment:8 Changed 15 months ago by simonmar

  • Status changed from new to infoneeded

Is your GHC using the libffi that comes with Fedora, or the one bundled with GHC?

comment:9 Changed 15 months ago by wgmitchener

The problems I'm having with Try2 are compiled just as ghc is packaged on Fedora. According to ldd:

(on fedora 19, ghc 7.4.2) ldd Try2 yields

linux-gate.so.1 => (0xb772f000)
libgobject-2.0.so.0 => /lib/libgobject-2.0.so.0 (0x43995000)
libglib-2.0.so.0 => /lib/libglib-2.0.so.0 (0x43829000)
libgmp.so.10 => /lib/sse2/libgmp.so.10 (0x4eb12000)
libffi.so.6 => /lib/libffi.so.6 (0x439e8000)
libm.so.6 => /lib/libm.so.6 (0x4373a000)
librt.so.1 => /lib/librt.so.1 (0x4372f000)
libdl.so.2 => /lib/libdl.so.2 (0x43728000)
libc.so.6 => /lib/libc.so.6 (0x4354e000)
libpthread.so.0 => /lib/libpthread.so.0 (0x4370c000)
/lib/ld-linux.so.2 (0x4352b000)

so I think all of these are the fedora packaged libraries.

comment:10 Changed 15 months ago by wgmitchener

libffi on fedora 17 is libffi.5
but on fedora 18 and 19/rawhide, it's libffi.6

Is there some sort of version clash going on here, where ghc doesn't work with libffi.6? Or does ghc require specific patches to libffi?

(I have to copy libffi.6 over to f17 machines to run test cases where I compile something on rawhide and run it on f17.)

comment:11 Changed 15 months ago by wgmitchener

I just tried to get ghc-7.4.2 from the generic linux build tar.bz2 file on haskell.org, but it won't work on f19/rawhide because it requires libgmp.so.3, but f19/rawhide comes with libgmp.so.10.... I'm trying to avoid rebuilding the entire haskell platform somewhere to track down the source of this bug. Suggestions?

comment:12 Changed 15 months ago by wgmitchener

Investigating the possibility that libffi.so.6 is where the bug lives: After trying and failing to get ghc to use libffi.5 on fedora 19/rawhide (either in /lib or in the ghc installation tree), I tried making /lib/libffi.so.6 a link to libffi.so.5. Then Try2 still segfaults in the same place.

comment:13 Changed 15 months ago by wgmitchener

I decided to try with ghc 7.6.2. On fedora 19/rawhide, Try2 seg faults in the same place.

comment:14 Changed 15 months ago by wgmitchener

I also tried building ghc 7.4.2 on fedora 17, and installing gtk via cabal: Try2 segfaults in the same place. So whatever's going on, it isn't just which version of libffi, and it isn't just the fedora release.

comment:15 Changed 14 months ago by wgmitchener

A bit more information: I compiled Try2.hs with

ghc -debug --make Try2.hs

This is on Fedora 18 with ghc 7.4.1. (FYI: I'm also using the development version of gtk2hs; darcs gives latest patch date of Tue Feb 19 16:39:46 EST 2013). Now Try2 fails with a concrete error message:

Try2: internal error: ASSERTION FAILED: file rts/STM.c, line 1476

(GHC version 7.4.1 for i386_unknown_linux)
Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug

Aborted

which seems maybe to have something to do with not doing one STM transaction inside another? The function where that assertion happens is stmWait.

comment:16 Changed 14 months ago by wgmitchener

I've been working with gtk 2.32.4, ghc 7.4.2, and the development tree from gtk2hs. I added a few print statements and tracked down this much of the problem:

makeCallback: funPtr = 0xb7e8900c
makeCallback: destroyFunPtr = 0x0821fcd6
g_timeout_add_full: function = 0xb7e8900c
g_timeout_add_full: data = 0xb7e8900c
g_timeout_add_full: notify = 0x821fcd6
g_main_dispatch: dispatch = 0xb7ed11c0
g_main_dispatch: source = 0x82c0500
g_main_dispatch: callback = 0xb7e8900c
g_main_dispatch: user_data = 0xb7e8900c
g_timeout_dispatch: source = 0x82c0500
g_timeout_dispatch: callback = 0xb7e8900c
g_timeout_dispatch: user_data = 0xb7e8900c

(gdb) disass /r 0xb7e8900c,+5
Dump of assembler code from 0xb7e8900c to 0xb7e89011:
   0xb7e8900c:	e8 c3 14 3b 50	call   0x823a4d4
End of assembler dump.

(gdb) disass /r 0x823a4d4,+20
Dump of assembler code from 0x823a4d4 to 0x823a4e8:
=> 0x0823a4d4:	00 00	add    %al,(%eax)
   0x0823a4d6:	00 00	add    %al,(%eax)
   0x0823a4d8:	20 00	and    %al,(%eax)
   0x0823a4da:	00 00	add    %al,(%eax)
   0x0823a4dc <stg_sel_ret_5_upd_info+0>:	89 f0	mov    %esi,%eax
   0x0823a4de <stg_sel_ret_5_upd_info+2>:	83 e0 fc	and    $0xfffffffc,%eax
   0x0823a4e1 <stg_sel_ret_5_upd_info+5>:	8b 70 18	mov    0x18(%eax),%esi
   0x0823a4e4 <stg_sel_ret_5_upd_info+8>:	83 c5 04	add    $0x4,%ebp
   0x0823a4e7 <stg_sel_ret_5_upd_info+11>:	f7 c6 03 00 00 00	test   $0x3,%esi
End of assembler dump.

In gtk2hs/Glib/System/Glib/MainLoop.chs, makeCallback function, the call to mkSourceFunc (which is a foreign import wrapper) seems to return a thunk stored at 0xb7e8900c, but the function call right at that address seems to be off by 8 bytes? Those first four instructions make no sense. The seg fault happens at that first add %al, (%eax) because %eax is a bad pointer.

comment:17 Changed 13 months ago by wgmitchener

I just added a minimal example that doesn't need GTK -- see attachment ghc-bug-002.zip.

It's a simple case of Haskell calling into C calling back into Haskell. I'm using Fedora 17. The program works fine when compiled under GHC 7.0.4:

Setting callback
set_callback: at top
set_callback: p_callback = (nil)
set_callback: callback_data = 0
set_callback: p_finalizer = (nil)
set_callback: new pointer values:
set_callback: p_callback = 0xb77ee02c
set_callback: callback_data = 10
set_callback: p_finalizer = 0xb77ee00c
set_callback: done
Invoking callback
invoke_callback: at top
invoke_callback: p_callback = 0xb77ee02c
invoke_callback: callback_data = 10
invoke_callback: p_finalizer = 0xb77ee00c
invoke_callback: calling callback
invoke_callback: return value is 11
invoke_callback: done
Clearing callback
clear_callback: at top
clear_callback: p_callback = 0xb77ee02c
clear_callback: callback_data = 10
clear_callback: p_finalizer = 0xb77ee00c
clear_callback: finalizing callback
clear_callback: p_callback = (nil)
clear_callback: callback_data = 0
clear_callback: p_finalizer = (nil)
clear_callback: done

But it seg faults under GHC 7.4.2.

Setting callback
set_callback: at top
set_callback: p_callback = (nil)
set_callback: callback_data = 0
set_callback: p_finalizer = (nil)
set_callback: new pointer values:
set_callback: p_callback = 0xb77d702c
set_callback: callback_data = 10
set_callback: p_finalizer = 0xb77d700c
set_callback: done
Invoking callback
invoke_callback: at top
invoke_callback: p_callback = 0xb77d702c
invoke_callback: callback_data = 10
invoke_callback: p_finalizer = 0xb77d700c
invoke_callback: calling callback
Segmentation fault

On the Ubuntu 12.10 live image, after installing GHC 7.4.2, it runs with no seg fault. However, Ubuntu doesn't use SELinux. Maybe the thunk that goes back into Haskell is jumping to the wrong address, a few bytes before the actual function, and the instructions there are basically harmless, but SELinux catches them?

comment:18 Changed 13 months ago by wgmitchener

On Fedora 17, with GHC 7.4.2, I tried running valgrind on the Main program from ghc-bug-002.zip with this result:

valgrind --leak-check=full ./Main
==30226== Memcheck, a memory error detector
==30226== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.
==30226== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==30226== Command: ./Main
==30226== 
Setting callback
set_callback: at top
set_callback: p_callback = (nil)
set_callback: callback_data = 0
set_callback: p_finalizer = (nil)
set_callback: new pointer values:
set_callback: p_callback = 0x401102c
set_callback: callback_data = 10
set_callback: p_finalizer = 0x401100c
set_callback: done
Invoking callback
invoke_callback: at top
invoke_callback: p_callback = 0x401102c
invoke_callback: callback_data = 10
invoke_callback: p_finalizer = 0x401100c
invoke_callback: calling callback
==30226== Invalid read of size 1
==30226==    at 0x822D5E8: freeSignalHandlers (Signals.c:90)
==30226==  Address 0xa is not stack'd, malloc'd or (recently) free'd
==30226== 
==30226== 
==30226== Process terminating with default action of signal 11 (SIGSEGV)
==30226==  Access not within mapped region at address 0xA
==30226==    at 0x822D5E8: freeSignalHandlers (Signals.c:90)
==30226==  If you believe this happened as a result of a stack
==30226==  overflow in your program's main thread (unlikely but
==30226==  possible), you can try to increase the size of the
==30226==  main thread stack using the --main-stacksize= flag.
==30226==  The main thread stack size used in this run was 8388608.
==30226== 
==30226== HEAP SUMMARY:
==30226==     in use at exit: 40,622 bytes in 32 blocks
==30226==   total heap usage: 52 allocs, 20 frees, 43,076 bytes allocated
==30226== 
==30226== LEAK SUMMARY:
==30226==    definitely lost: 0 bytes in 0 blocks
==30226==    indirectly lost: 0 bytes in 0 blocks
==30226==      possibly lost: 0 bytes in 0 blocks
==30226==    still reachable: 40,622 bytes in 32 blocks
==30226==         suppressed: 0 bytes in 0 blocks
==30226== Reachable blocks (those to which a pointer was found) are not shown.
==30226== To see them, rerun with: --leak-check=full --show-reachable=yes
==30226== 
==30226== For counts of detected and suppressed errors, rerun with: -v
==30226== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Segmentation fault

Changed 13 months ago by wgmitchener

Minimal example; works on ghc 7.0.4, seg faults on ghc 7.4.2; tried an extra command line option in ./build, put in separate small function for easier breakpoint in gdb

comment:19 Changed 13 months ago by wgmitchener

Doing some work with gdb with ghc-bug-003 (which I'll attach in a minute)

On Ubuntu 12.10 with GHC 7.4.2 (no SELinux, no seg fault), the call to set_callback comes out like this:

Breakpoint 1, set_callback (f=0xb7b3f02c, d=10, fin=0xb7b3f00c) at Callback.c:18
18  p_callback = f;
(gdb) x/i f
   0xb7b3f02c:call   0x80a1358 <adjustorCode>
(gdb) print adjustorCode
$1 = {<text variable, no debug info>} 0x80a1358 <adjustorCode>

On Fedora 17 with GHC 7.4.2 (with SELinux, seg faults), the call to set_callback comes out like this:

Breakpoint 1, set_callback (f=0xb7ffd02c, d=10, fin=0xb7ffd00c) at Callback.c:18
18	  p_callback = f;
(gdb) x/i f
   0xb7ffd02c:	call   0x82274b8
(gdb) print adjustorCode
$1 = {<text variable, no debug info>} 0x82264b8 <adjustorCode>

so is something going wrong with this adjustorCode function?

Changed 13 months ago by wgmitchener

Abbreviated test case, with strace and .s files. (.se means on Fedora which enables SELinux, .ns means on Ubuntu which does not; 7.x.x is the version of GHC that generated the file.)

comment:20 Changed 13 months ago by wgmitchener

Maybe it's not an 8 byte problem. If the callback is eventually supposed to call adjustorCode, then the error is even weirder:

On Fedora 17, (SE, GHC 742), in just_invoke_callback (ghc-bug-003), tracing through...

Inside createAdjustor in ghc-7.4.2/rts/Adjustor.c, the AdjustorStub? code that is generated at line 386 :-o is

(gdb) disas /r adjustorStub,+5
Dump of assembler code from 0xb7ffc02c to 0xb7ffc031:
   0xb7ffc02c:	e8 87 a4 22 50	call   0x82264b8 <adjustorCode>
End of assembler dump.

e8 is the opcode for an ip-relative jump.

The same bytes during set_callback and just_invoke_callback are interpreted differently for some reason:

(gdb) print adjustorCode
$20 = {<text variable, no debug info>} 0x82264b8 <adjustorCode>

(gdb) disas /r *p_callback,+5
Dump of assembler code from 0xb7ffd02c to 0xb7ffd031:
   0xb7ffd02c:	e8 87 a4 22 50	call   0x82274b8   <- off by 0x1000 from adjustorCode
End of assembler dump.

which means something hideous has happened.

comment:21 Changed 13 months ago by wgmitchener

Got it:

ghc-7.4.2/rts/Adjustor.c:380

createAdjustor calls allocateExec (rts/sm/Storage.c) which calls ffi_closure_alloc. So in createAdjustor, line 381, we should have (if I'm reading the libffi documentation correctly)

adjustorStub is a pointer in data address space to the adjustor stub
code is a pointer in code address space to the very same spot in memory

and sure enough they are off by 0x1000:

(gdb) print adjustorStub
$3 = (AdjustorStub *) 0xb7ffc00c
(gdb) print code
$4 = (void *) 0xb7ffd00c

which means the correct calculation of the relative call should be

*(long*)&adjustorStub->call[1] = ((char*)&adjustorCode) - ((char*)code + 5); // code instead of adjustorStub

Apparently code and data and done with different segment settings under SELinux. Chaos follows.

Going to rebuild GHC 7.4.2 with that change and see if this works...

comment:22 Changed 13 months ago by wgmitchener

Sorry, formatting of the last message went wrong:
createAdjustor calls allocateExec (rts/sm/Storage.c) which calls ffi_closure_alloc. So in createAdjustor, line 381, we should have (if I'm reading the libffi documentation correctly):

adjustorStub is a pointer in data address space to the adjustor stub

code is a pointer in code address space to the very same spot in memory

and the relative call needs to be calculated in code address space

comment:23 Changed 13 months ago by wgmitchener

Okay, it works!

I've attached a patch, going to do a few more tests. Now what?

Changed 13 months ago by wgmitchener

Patch for rts/Adjustor.c

comment:24 Changed 13 months ago by wgmitchener

More tests: My gtk-based simulation program works with the above patch on Fedora 17 with GHC 7.4.2.

By the way, the same mistake in Adjustor.c seems to be present in all later versions of GHC as well.

comment:25 follow-up: Changed 13 months ago by simonmar

  • Status changed from infoneeded to new

Well done for tracking this down!

Your fix looks good to me. Could someone validate and push please?

comment:26 Changed 13 months ago by wgmitchener

Before I forget: The pointer that gets returned after all of that is the data-space address rather than the code-space address, and I suppose that must be right so that the memory block can be deallocated later. But it sort of worries me that the call instruction to that data-space address works. Does the CPU or kernel recognize that the same memory is also mapped to a code-space address and make some correction?

comment:27 Changed 13 months ago by simonmar

The two addresses contain the same memory (double-mapped), but one is writable while the other is executable. This is how libffi works around the SELinux restrictions. On non-SELinux systems the code and data addresses are probably the same.

This function, createAdjustor returns the code address, not the data address.

comment:28 in reply to: ↑ 25 Changed 13 months ago by wgmitchener

Replying to simonmar:

Well done for tracking this down!

Your fix looks good to me. Could someone validate and push please?

I'm new at this process: Is "validate and push" something I'm supposed to do or does someone on the inside of the GHC group do this?

comment:29 Changed 13 months ago by ian@…

commit 27cf625ab871f34434d9fe86cecf85a31f73f0e5

Author: Ian Lynagh <ian@well-typed.com>
Date:   Tue Apr 9 13:53:28 2013 +0100

    Fix segfaults on SELinux machines; fixes #7629
    
    Patch from wgmitchener.
    
    From the ticket:
    The two addresses (adjustorStub and code) contain the same memory
    (double-mapped), but one is writable while the other is executable.
    This is how libffi works around the SELinux restrictions. On
    non-SELinux systems the code and data addresses are probably the same.

 rts/Adjustor.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

comment:30 Changed 13 months ago by igloo

  • Resolution set to fixed
  • Status changed from new to closed

wgmitchener: It's something one of the GHC team does.

I've now validated and pushed; thanks for diagnosing it and sending the patch!

comment:31 Changed 3 months ago by juhpetersen

  • Cc simonmar added

I note for the record that this didn't make ghc-7.6.3 but will be in ghc-7.8.

I am finally backporting the patch to Fedora now.

wgmitchener: Thank you again for fixing this.

Note: See TracTickets for help on using tickets.