Null deref by threaded runtime scheduler
Using ghc and runtime built from HEAD on Tuesday (although this has been an issue on older builds I tried as well), the ghc runtime crashes on a null deref.
The system is a 24-processor Intel Xeon 2.66 GHz shared memory system running Linux 2.6.16.60-0.34-smp in 64 bit mode (from SUSE 10 SP2).
This only happens when the program is compiled with -threaded (threaded runtime), but doesn't happen with the threaded debug runtime.
It also only happens when +RTS -N2 or a greater number of threads is passed to the runtime. It doesn't happen when -g1 is passed to turn off distributed garbage collection.
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1241594176 (LWP 20751)]
0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x994370) at rts/Schedule.c:672
672 return (waiting_for_gc ||
(gdb) list
667 // - another thread is initiating a GC
668 // - another Task is returning from a foreign call
669 // - the thread at the head of the run queue cannot be run
670 // by this Task (it is bound to another Task, or it is unbound
671 // and this task it bound).
672 return (waiting_for_gc ||
673 cap->returning_tasks_hd != NULL ||
674 (!emptyRunQueue(cap) && (task->tso == NULL
675 ? cap->run_queue_hd->bound != NULL
676 : cap->run_queue_hd->bound != task)));
(gdb) print cap
$1 = (Capability *) 0x0
(gdb) print *task
$2 = {id = 1241594176, cap = 0x0, stopped = 7160552, suspended_tso = 0x0, tso = 0x7fccc4, stat = NoStatus, ret = 0x0, cond = {__data = {__lock = 1, __futex = 0,
__total_seq = 1, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x84e340, __nwaiters = 10033872, __broadcast_seq = 0},
__size = "\001\000\000\000\000\000\000\000\001", '\0' <repeats 23 times>, "@�\204\000\000\000\000\000�\032\231\000\000\000\000", __align = 1}, lock = {__data = {
__lock = 0, __count = 1, __owner = 0, __nusers = 0, __kind = 22668, __spins = 0, __list = {__prev = 0x5793, __next = 0x85afd}},
__size = "\000\000\000\000\001", '\0' <repeats 11 times>, "\214X\000\000\000\000\000\000\223W\000\000\000\000\000\000�Z\b\000\000\000\000", __align = 4294967296},
wakeup = 547420, elapsedtimestart = 202, muttimestart = 0, mut_time = 0, mut_etime = 0, gc_time = 0, gc_etime = 10033296, prev = 0x994370, next = 0x0, return_link = 0x0,
all_link = 0x0, prev_stack = 0x9945c0}
(gdb) info threads
* 21 Thread 1241594176 (LWP 20751) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x994370) at rts/Schedule.c:672
20 Thread 1233201472 (LWP 20750) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9b4930) at rts/Schedule.c:672
19 Thread 1224808768 (LWP 20749) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9b2df0) at rts/Schedule.c:672
18 Thread 1216416064 (LWP 20748) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9b12b0) at rts/Schedule.c:672
17 Thread 1208023360 (LWP 20747) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9af770) at rts/Schedule.c:672
16 Thread 1199630656 (LWP 20746) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9adc30) at rts/Schedule.c:672
15 Thread 1191237952 (LWP 20745) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9ac0f0) at rts/Schedule.c:672
14 Thread 1182845248 (LWP 20744) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9aa5b0) at rts/Schedule.c:672
13 Thread 1174452544 (LWP 20743) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9a8a70) at rts/Schedule.c:672
12 Thread 1166059840 (LWP 20742) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9a6f30) at rts/Schedule.c:672
11 Thread 1157667136 (LWP 20741) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9a53f0) at rts/Schedule.c:672
10 Thread 1149274432 (LWP 20740) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9a38b0) at rts/Schedule.c:672
9 Thread 1140881728 (LWP 20739) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9a1d70) at rts/Schedule.c:672
8 Thread 1132489024 (LWP 20738) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9a0230) at rts/Schedule.c:672
7 Thread 1124096320 (LWP 20737) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x99e6f0) at rts/Schedule.c:672
6 Thread 1115703616 (LWP 20736) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x99cbb0) at rts/Schedule.c:672
5 Thread 1107310912 (LWP 20735) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x99b070) at rts/Schedule.c:672
4 Thread 1098918208 (LWP 20734) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x999530) at rts/Schedule.c:672
3 Thread 1090525504 (LWP 20733) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9979f0) at rts/Schedule.c:672
2 Thread 1082132800 (LWP 20732) 0x00002aae2fa22548 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0
1 Thread 46927615298704 (LWP 20729) 0x00002aae2fa20553 in pthread_cond_signal@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
The values of *task and the thread in which the crash occurs is not always consistent, but the stacktrace is. For example, another crash...
(gdb) bt
#0 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9b4930) at rts/Schedule.c:672
#1 0x0000000000673205 in workerStart (task=0x9915e0) at rts/Schedule.c:2033
#2 0x00002aae0cd1a143 in start_thread () from /lib64/libpthread.so.0
#3 0x00002aae0ceee8cd in clone () from /lib64/libc.so.6
#4 0x0000000000000000 in ?? ()
(gdb) print *task
$6 = {id = 1233201472, cap = 0x0, stopped = 11791697, suspended_tso = 0x0, tso = 0x25ed0f1, stat = NoStatus, ret = 0x0, cond = {__data = {__lock = 1, __futex = 0,
__total_seq = 1, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x84e800, __nwaiters = 10033872, __broadcast_seq = 0},
__size = "\001\000\000\000\000\000\000\000\001", '\0' <repeats 24 times>, "�\204\000\000\000\000\000�\032\231\000\000\000\000", __align = 1}, lock = {__data = {__lock = 0,
__count = 1, __owner = 0, __nusers = 0, __kind = 875, __spins = 0, __list = {__prev = 0x33a, __next = 0x2da2b}},
__size = "\000\000\000\000\001", '\0' <repeats 11 times>, "k\003\000\000\000\000\000\000:\003\000\000\000\000\000\000+�\002\000\000\000\000", __align = 4294967296},
wakeup = 186902, elapsedtimestart = 24, muttimestart = 0, mut_time = 0, mut_etime = 0, gc_time = 0, gc_etime = 10033296, prev = 0x9b4930, next = 0x0, return_link = 0x0,
all_link = 0x0, prev_stack = 0x9b4b80}
(gdb) info threads
22 Thread 1249986880 (LWP 20846) 0x00002aae0cd20548 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0
21 Thread 1241594176 (LWP 20845) 0x00002aae0cd1e1c6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
* 20 Thread 1233201472 (LWP 20844) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9b4930) at rts/Schedule.c:672
19 Thread 1224808768 (LWP 20843) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9b2df0) at rts/Schedule.c:672
18 Thread 1216416064 (LWP 20842) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9b12b0) at rts/Schedule.c:672
17 Thread 1208023360 (LWP 20841) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9af770) at rts/Schedule.c:672
16 Thread 1199630656 (LWP 20840) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9adc30) at rts/Schedule.c:672
15 Thread 1191237952 (LWP 20839) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9ac0f0) at rts/Schedule.c:672
14 Thread 1182845248 (LWP 20838) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9aa5b0) at rts/Schedule.c:672
13 Thread 1174452544 (LWP 20837) gcWorkerThread (cap=<value optimized out>) at includes/SpinLock.h:45
12 Thread 1166059840 (LWP 20836) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9a6f30) at rts/Schedule.c:672
11 Thread 1157667136 (LWP 20835) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9a53f0) at rts/Schedule.c:672
10 Thread 1149274432 (LWP 20834) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9a38b0) at rts/Schedule.c:672
9 Thread 1140881728 (LWP 20833) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9a1d70) at rts/Schedule.c:672
8 Thread 1132489024 (LWP 20832) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9a0230) at rts/Schedule.c:672
7 Thread 1124096320 (LWP 20831) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x99e6f0) at rts/Schedule.c:672
6 Thread 1115703616 (LWP 20830) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x99cbb0) at rts/Schedule.c:672
5 Thread 1107310912 (LWP 20829) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x99b070) at rts/Schedule.c:672
4 Thread 1098918208 (LWP 20827) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x999530) at rts/Schedule.c:672
3 Thread 1090525504 (LWP 20826) 0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9979f0) at rts/Schedule.c:672
2 Thread 1082132800 (LWP 20825) 0x00002aae0cee86e2 in select () from /lib64/libc.so.6
1 Thread 46927031233680 (LWP 20824) 0x00002aae0cd1c2ef in pthread_mutex_lock () from /lib64/libpthread.so.0
valgrind memcheck does not report any errors prior to the NULL deref:
==20983== Thread 2:
==20983== Invalid read of size 8
==20983== at 0x67262C: schedule (Schedule.c:672)
==20983== by 0x673204: workerStart (Schedule.c:2033)
==20983== by 0x58E4142: start_thread (in /lib64/libpthread-2.4.so)
==20983== by 0x5AB78CC: clone (in /lib64/libc-2.4.so)
==20983== Address 0x1d0 is not stack'd, malloc'd or (recently) free'd
==20983==
==20983== Process terminating with default action of signal 11 (SIGSEGV)
valgrind helgrind reports numerous possible data race conditions, including one between schedule and GarbageCollect just prior to the crash:
==21106== Possible data race during read of size 8 at 0x5ccc300 by thread #5
==21106== at 0x672612: schedule (Schedule.c:367)
==21106== by 0x673204: workerStart (Schedule.c:2033)
==21106== by 0x4A24A1E: mythread_wrapper (hg_intercepts.c:194)
==21106== by 0x58E7142: start_thread (in /lib64/libpthread-2.4.so)
==21106== by 0x5ABA8CC: clone (in /lib64/libc-2.4.so)
==21106== This conflicts with a previous write of size 8 by thread #1
==21106== at 0x679B8D: GarbageCollect (SpinLock.h:55)
==21106== by 0x6712D0: scheduleDoGC (Schedule.c:1522)
==21106== by 0x672C77: schedule (Schedule.c:621)
==21106== by 0x66FF84: real_main (RtsMain.c:68)
==21106== by 0x67009D: hs_main (RtsMain.c:117)
==21106== by 0x5A17183: (below main) (in /lib64/libc-2.4.so)
I am still working on whether I can make a small program to reproduce this.