Non-deterministic hs_try_putmvar003 failure on CI
Wrong exit code for hs_try_putmvar003(threaded1)(expected 0 , actual 139 )
*** unexpected failure for hs_try_putmvar003(threaded1)
A quick search suggests that 139 might mean a segfault (SIGSEGV).
This is a non-deterministic CI failure that happened to me at least two times (Feb 12 and Feb 24), and forced me to restart the build job to get a green build status. Both times it was the validate-x86_64-linux-fedora27
job, but on different runners (ghc-ci-gce-x86_64-1
and maurer.smart-cactus.org
).
Note that another failure that seems to happen only on Fedora is #16350 – could it be a problem with the Fedora image?
Trac metadata
Trac field | Value |
---|---|
Version | 8.9 |
Type | Bug |
TypeOfFailure | OtherFailure |
Priority | normal |
Resolution | Unresolved |
Component | Test Suite |
Test case | |
Differential revisions | |
BlockedBy | |
Related | |
Blocking | |
CC | RyanGlScott, bgamari, mpickering |
Operating system | |
Architecture |
- Show closed items
Relates to
- #134348.2.15
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Vladislav Zavialov changed weight to 5
changed weight to 5
- Vladislav Zavialov added Tbug Trac import testsuite labels
added Tbug Trac import testsuite labels
- Maintainer
!438 (merged) proposes to skip this test.
- Maintainer
Oops, I intended to make ticket:16361#comment:168969 in a different ticket. Sorry about the noise.
- Ryan Scott added CI breakage label
added CI breakage label
- Ben Gamari marked this issue as related to #13434 (closed)
marked this issue as related to #13434 (closed)
- Ben Gamari marked this issue as related to #13722 (closed)
marked this issue as related to #13722 (closed)
- Ben Gamari marked this issue as related to #12800 (closed)
marked this issue as related to #12800 (closed)
- Maintainer
This has been reported a few times: #13434 (closed), #13722 (closed), #12800 (closed), and this ticket. The crash is always a segfault.
- Ben Gamari changed title from Non-deterministic hs_try_putmvar00 failure on CI to Non-deterministic hs_try_putmvar003 failure on CI
changed title from Non-deterministic hs_try_putmvar00 failure on CI to Non-deterministic hs_try_putmvar003 failure on CI
- Ben Gamari added Pnormal label
added Pnormal label
- Maintainer
Looking back at the test tracking statistics, it seems that this has crashed 7 times in the last few months:
ghc_test_tracking=# select * from results_view where test_name LIKE 'hs_try_putmvar003%' and result != 'success' order by time desc; job_name | test_name | time | commit_sha | result | other ----------------------------------+------------------------------+----------------------------+------------------------------------------+--------------------+------------------------------------ validate-x86_64-linux-fedora27 | hs_try_putmvar003(threaded1) | 2019-09-20 15:42:35.855432 | 30537d8a73f42bff1a01f2b69d761d4c955be2fc | unexpected_failure | {"message": "bad exit code (139)"} validate-x86_64-linux-fedora27 | hs_try_putmvar003(threaded1) | 2019-09-16 13:35:38.576944 | adfece0fc2cf886207d832901a842602f87fef18 | unexpected_failure | {"message": "bad exit code (139)"} validate-x86_64-linux-fedora27 | hs_try_putmvar003(threaded1) | 2019-08-27 22:41:03.606925 | ee2fad9e503ffdf61a086f721553aa3c502d1cb8 | unexpected_failure | {"message": "bad exit code (139)"} validate-aarch64-linux-deb9 | hs_try_putmvar003(threaded1) | 2019-08-16 09:47:59.982361 | 47e162374051ed3e874ed7916cc811df288cbd95 | unexpected_failure | {"message": "exit code non-0"} validate-x86_64-linux-fedora27 | hs_try_putmvar003(threaded1) | 2019-07-05 06:57:50.310483 | f002250abac21ee4e9c9e4d7bc05db8aa885a65d | unexpected_failure | {"message": "bad exit code (134)"} validate-x86_64-linux-deb9-debug | hs_try_putmvar003(threaded1) | 2019-07-02 23:52:21.367096 | eba2ee31fac9309ce3d93d3d22ba9dd11e29c02b | unexpected_failure | {"message": "exit code non-0"} validate-x86_64-linux-deb9-debug | hs_try_putmvar003(threaded2) | 2019-07-02 23:52:21.367096 | eba2ee31fac9309ce3d93d3d22ba9dd11e29c02b | unexpected_failure | {"message": "exit code non-0"} release-x86_64-linux-deb9-dwarf | hs_try_putmvar003(threaded1) | 2019-07-02 23:04:56.738285 | eba2ee31fac9309ce3d93d3d22ba9dd11e29c02b | unexpected_failure | {"message": "exit code non-0"} (8 rows)
Typically the failure is in the
threaded1
way although once it also crashed inthreaded2
.Edited by Ben Gamari - Ben Gamari mentioned in commit 04c7bf51
mentioned in commit 04c7bf51
- Ben Gamari mentioned in merge request !1819 (closed)
mentioned in merge request !1819 (closed)
- Maintainer
Marked
hs_try_putmvar003
as fragile inthreaded1
in !1819 (closed). - Maintainer
https://gitlab.haskell.org/ghc/ghc/-/jobs/165303 failed in
threaded2
it seems. This test seems to have gotten significantly less reliable recently. I'll investigate. - Maintainer
Ahh, this looks bad:
$ _build/ghc-stage1 testsuite/tests/concurrent/should_run/hs_try_putmvar003.hs -threaded testsuite/tests/concurrent/should_run/hs_try_putmvar003_c.c -rtsopts -debug $ while true; do testsuite/tests/concurrent/should_run/hs_try_putmvar003 2 64 16 500 +RTS -N4 || break; echo done; done hs_try_putmvar003: internal error: MVAR object (0x42000091e0) entered! (GHC version 8.9.0.20190919 for x86_64_unknown_linux) Please report this as a GHC bug: https://www.haskell.org/ghc/reportabug Aborted
Edited by Ben Gamari - Maintainer
The failure looks like:
>>> bt #0 0x00007f468747abe0 in raise () from /nix/store/g2p6fwjc995jrq3d8vph7k45l9zhdf8f-glibc-2.27/lib/libc.so.6 #1 0x00007f468747bdc1 in abort () from /nix/store/g2p6fwjc995jrq3d8vph7k45l9zhdf8f-glibc-2.27/lib/libc.so.6 #2 0x0000000000728b54 in rtsFatalInternalErrorFn (s=0x7b9bb0 "MVAR object (%p) entered!", ap=0x7f46575f9c60) at rts/RtsMessages.c:186 #3 0x000000000072870b in barf (s=0x7b9bb0 "MVAR object (%p) entered!") at rts/RtsMessages.c:48 #4 0x0000000000762ce3 in stg_MVAR_DIRTY_info () #5 0x0000000000000000 in ?? () >>> up 4 #4 0x0000000000762ce3 in stg_MVAR_DIRTY_info () >>> x/8a $rbp 0x4200325378: 0x406790 0x764860 <stg_upd_frame_info> 0x4200325388: 0x42001066d0 0x558218 <base_GHCziMVar_tryPutMVar1_info+56> 0x4200325398: 0x884421 0x4583a8 <base_GHCziBase_zdfFunctorIO1_info+64> 0x42003253a8: 0x884421 0x75e170 <stg_catch_frame_info>
This appears to happen whenever the first command-line argument to the executable is 2.
- Maintainer
Minimal reproducer:
$ _build/ghc-stage1 testsuite/tests/concurrent/should_run/hs_try_putmvar003.hs -threaded testsuite/tests/concurrent/should_run/hs_try_putmvar003_c.c -rtsopts -debug -g3 $ nix run -f ~/ghc-utils -c ghc-rr testsuite/tests/concurrent/should_run/hs_try_putmvar003 2 1 1 1 +RTS -N1
Edited by Ben Gamari - Ben Gamari mentioned in commit 81631d04
mentioned in commit 81631d04
- Ben Gamari mentioned in commit bb984ac6
mentioned in commit bb984ac6
- Maintainer
While looking at !1603 (closed) I found the following data race:
WARNING: ThreadSanitizer: data race (pid=6043) Write of size 1 at 0x7b1c0001dff0 by thread T32: #0 pthread_mutex_destroy <null> (libtsan.so.0+0x00000002c95b) #1 destroyCallbackQueue <null> (hs_try_putmvar003+0x000000416afc) #2 scheduleWorker rts/Schedule.c:2570 (hs_try_putmvar003+0x00000073eb9e) #3 workerStart rts/Task.c:445 (hs_try_putmvar003+0x000000745875) #4 <null> <null> (libtsan.so.0+0x000000028d5b) Previous atomic read of size 1 at 0x7b1c0001dff0 by thread T26 (mutexes: write M52208230880894960): #0 pthread_mutex_unlock <null> (libtsan.so.0+0x00000003c60b) #1 callback <null> (hs_try_putmvar003+0x000000416a56) Location is heap block of size 104 at 0x7b1c0001dff0 allocated by thread T24: #0 malloc <null> (libtsan.so.0+0x00000002b251) #1 mkCallbackQueue <null> (hs_try_putmvar003+0x000000416a90) #2 scheduleWorker rts/Schedule.c:2570 (hs_try_putmvar003+0x00000073eb9e) #3 workerStart rts/Task.c:445 (hs_try_putmvar003+0x000000745875) #4 <null> <null> (libtsan.so.0+0x000000028d5b) Mutex M52208230880894960 is already destroyed. Thread T32 (tid=6209, running) created by thread T24 at: #0 pthread_create <null> (libtsan.so.0+0x00000002c010) #1 createOSThread rts/posix/OSThreads.c:137 (hs_try_putmvar003+0x00000076ec2f) #2 startWorkerTask rts/Task.c:497 (hs_try_putmvar003+0x00000074628a) #3 releaseCapability_ rts/Capability.c:567 (hs_try_putmvar003+0x00000072c257) #4 suspendThread rts/Schedule.c:2418 (hs_try_putmvar003+0x00000073e384) #5 <null> <null> (hs_try_putmvar003+0x0000004136b2) #6 scheduleWorker rts/Schedule.c:2570 (hs_try_putmvar003+0x00000073eb9e) #7 workerStart rts/Task.c:445 (hs_try_putmvar003+0x000000745875) #8 <null> <null> (libtsan.so.0+0x000000028d5b) Thread T26 (tid=6116, finished) created by thread T24 at: #0 pthread_create <null> (libtsan.so.0+0x00000002c010) #1 mkCallbackQueue <null> (hs_try_putmvar003+0x000000416ad2) #2 scheduleWorker rts/Schedule.c:2570 (hs_try_putmvar003+0x00000073eb9e) #3 workerStart rts/Task.c:445 (hs_try_putmvar003+0x000000745875) #4 <null> <null> (libtsan.so.0+0x000000028d5b) Thread T24 (tid=6111, finished) created by thread T7 at: #0 pthread_create <null> (libtsan.so.0+0x00000002c010) #1 createOSThread rts/posix/OSThreads.c:137 (hs_try_putmvar003+0x00000076ec2f) #2 startWorkerTask rts/Task.c:497 (hs_try_putmvar003+0x00000074628a) #3 releaseCapability_ rts/Capability.c:567 (hs_try_putmvar003+0x00000072c257) #4 suspendThread rts/Schedule.c:2418 (hs_try_putmvar003+0x00000073e384) #5 <null> <null> (hs_try_putmvar003+0x0000004144e1) #6 scheduleWorker rts/Schedule.c:2570 (hs_try_putmvar003+0x00000073eb9e) #7 workerStart rts/Task.c:445 (hs_try_putmvar003+0x000000745875) #8 <null> <null> (libtsan.so.0+0x000000028d5b) SUMMARY: ThreadSanitizer: data race (/nix/store/c7hj2bk4aqgpb3q0h5xhq7lag0lq3jm7-gcc-7.4.0-lib/lib/libtsan.so.0+0x2c95b) in pthread_mutex_destroy ``` This very well may be relevant.
Edited by Ben Gamari - Ben Gamari mentioned in merge request !1731 (closed)
mentioned in merge request !1731 (closed)
- Developer
Could this be related to #18919 (closed)? And perhaps fixed by !4458 (closed)?
- Ben Gamari mentioned in merge request !6882 (closed)
mentioned in merge request !6882 (closed)