Non-deterministic hs_try_putmvar003 failure on CI

changed weight to 5

!438 (merged) proposes to skip this test.

Oops, I intended to make ticket:16361#comment:168969 in a different ticket. Sorry about the noise.

marked this issue as related to #13434 (closed)

marked this issue as related to #13722 (closed)

marked this issue as related to #12800 (closed)

This has been reported a few times: #13434 (closed), #13722 (closed), #12800 (closed), and this ticket. The crash is always a segfault.

changed title from Non-deterministic hs_try_putmvar00 failure on CI to Non-deterministic hs_try_putmvar003 failure on CI

added Pnormal label

Looking back at the test tracking statistics, it seems that this has crashed 7 times in the last few months:

ghc_test_tracking=# select * from results_view where test_name LIKE 'hs_try_putmvar003%' and result != 'success' order by time desc;
             job_name             |          test_name           |            time            |                commit_sha                |       result       |               other                
----------------------------------+------------------------------+----------------------------+------------------------------------------+--------------------+------------------------------------
 validate-x86_64-linux-fedora27   | hs_try_putmvar003(threaded1) | 2019-09-20 15:42:35.855432 | 30537d8a73f42bff1a01f2b69d761d4c955be2fc | unexpected_failure | {"message": "bad exit code (139)"}
 validate-x86_64-linux-fedora27   | hs_try_putmvar003(threaded1) | 2019-09-16 13:35:38.576944 | adfece0fc2cf886207d832901a842602f87fef18 | unexpected_failure | {"message": "bad exit code (139)"}
 validate-x86_64-linux-fedora27   | hs_try_putmvar003(threaded1) | 2019-08-27 22:41:03.606925 | ee2fad9e503ffdf61a086f721553aa3c502d1cb8 | unexpected_failure | {"message": "bad exit code (139)"}
 validate-aarch64-linux-deb9      | hs_try_putmvar003(threaded1) | 2019-08-16 09:47:59.982361 | 47e162374051ed3e874ed7916cc811df288cbd95 | unexpected_failure | {"message": "exit code non-0"}
 validate-x86_64-linux-fedora27   | hs_try_putmvar003(threaded1) | 2019-07-05 06:57:50.310483 | f002250abac21ee4e9c9e4d7bc05db8aa885a65d | unexpected_failure | {"message": "bad exit code (134)"}
 validate-x86_64-linux-deb9-debug | hs_try_putmvar003(threaded1) | 2019-07-02 23:52:21.367096 | eba2ee31fac9309ce3d93d3d22ba9dd11e29c02b | unexpected_failure | {"message": "exit code non-0"}
 validate-x86_64-linux-deb9-debug | hs_try_putmvar003(threaded2) | 2019-07-02 23:52:21.367096 | eba2ee31fac9309ce3d93d3d22ba9dd11e29c02b | unexpected_failure | {"message": "exit code non-0"}
 release-x86_64-linux-deb9-dwarf  | hs_try_putmvar003(threaded1) | 2019-07-02 23:04:56.738285 | eba2ee31fac9309ce3d93d3d22ba9dd11e29c02b | unexpected_failure | {"message": "exit code non-0"}
(8 rows)

Typically the failure is in the threaded1 way although once it also crashed in threaded2.

mentioned in commit 04c7bf51

mentioned in merge request !1819 (closed)

Marked hs_try_putmvar003 as fragile in threaded1 in !1819 (closed).

https://gitlab.haskell.org/ghc/ghc/-/jobs/165303 failed in threaded2 it seems. This test seems to have gotten significantly less reliable recently. I'll investigate.

Ahh, this looks bad:

$ _build/ghc-stage1 testsuite/tests/concurrent/should_run/hs_try_putmvar003.hs -threaded testsuite/tests/concurrent/should_run/hs_try_putmvar003_c.c  -rtsopts -debug
$ while true; do testsuite/tests/concurrent/should_run/hs_try_putmvar003 2 64 16 500 +RTS -N4 || break; echo done; done
hs_try_putmvar003: internal error: MVAR object (0x42000091e0) entered!
    (GHC version 8.9.0.20190919 for x86_64_unknown_linux)
    Please report this as a GHC bug:  https://www.haskell.org/ghc/reportabug
Aborted

The failure looks like:

>>> bt
#0  0x00007f468747abe0 in raise () from /nix/store/g2p6fwjc995jrq3d8vph7k45l9zhdf8f-glibc-2.27/lib/libc.so.6
#1  0x00007f468747bdc1 in abort () from /nix/store/g2p6fwjc995jrq3d8vph7k45l9zhdf8f-glibc-2.27/lib/libc.so.6
#2  0x0000000000728b54 in rtsFatalInternalErrorFn (s=0x7b9bb0 "MVAR object (%p) entered!", ap=0x7f46575f9c60) at rts/RtsMessages.c:186
#3  0x000000000072870b in barf (s=0x7b9bb0 "MVAR object (%p) entered!") at rts/RtsMessages.c:48
#4  0x0000000000762ce3 in stg_MVAR_DIRTY_info ()
#5  0x0000000000000000 in ?? ()
>>> up 4
#4  0x0000000000762ce3 in stg_MVAR_DIRTY_info ()
>>> x/8a $rbp
0x4200325378:   0x406790        0x764860 <stg_upd_frame_info>
0x4200325388:   0x42001066d0    0x558218 <base_GHCziMVar_tryPutMVar1_info+56>
0x4200325398:   0x884421        0x4583a8 <base_GHCziBase_zdfFunctorIO1_info+64>
0x42003253a8:   0x884421        0x75e170 <stg_catch_frame_info>

This appears to happen whenever the first command-line argument to the executable is 2.

Minimal reproducer:

$ _build/ghc-stage1 testsuite/tests/concurrent/should_run/hs_try_putmvar003.hs -threaded testsuite/tests/concurrent/should_run/hs_try_putmvar003_c.c  -rtsopts -debug -g3
$ nix run -f ~/ghc-utils -c ghc-rr testsuite/tests/concurrent/should_run/hs_try_putmvar003 2 1 1 1 +RTS -N1

mentioned in commit 81631d04

mentioned in commit bb984ac6

While looking at !1603 (closed) I found the following data race:

WARNING: ThreadSanitizer: data race (pid=6043)
  Write of size 1 at 0x7b1c0001dff0 by thread T32:
    #0 pthread_mutex_destroy <null> (libtsan.so.0+0x00000002c95b)
    #1 destroyCallbackQueue <null> (hs_try_putmvar003+0x000000416afc)
    #2 scheduleWorker rts/Schedule.c:2570 (hs_try_putmvar003+0x00000073eb9e)                                                                                                                                                                                                             
    #3 workerStart rts/Task.c:445 (hs_try_putmvar003+0x000000745875)
    #4 <null> <null> (libtsan.so.0+0x000000028d5b)

  Previous atomic read of size 1 at 0x7b1c0001dff0 by thread T26 (mutexes: write M52208230880894960):
    #0 pthread_mutex_unlock <null> (libtsan.so.0+0x00000003c60b)                                                                                                                                                                                                                         
    #1 callback <null> (hs_try_putmvar003+0x000000416a56)

  Location is heap block of size 104 at 0x7b1c0001dff0 allocated by thread T24:                                                                                                                                                                                                          
    #0 malloc <null> (libtsan.so.0+0x00000002b251)
    #1 mkCallbackQueue <null> (hs_try_putmvar003+0x000000416a90)
    #2 scheduleWorker rts/Schedule.c:2570 (hs_try_putmvar003+0x00000073eb9e)                                                                                                                                                                                                             
    #3 workerStart rts/Task.c:445 (hs_try_putmvar003+0x000000745875)
    #4 <null> <null> (libtsan.so.0+0x000000028d5b)

  Mutex M52208230880894960 is already destroyed.                                                                                                                                                                                                                                         

  Thread T32 (tid=6209, running) created by thread T24 at:
    #0 pthread_create <null> (libtsan.so.0+0x00000002c010)
    #1 createOSThread rts/posix/OSThreads.c:137 (hs_try_putmvar003+0x00000076ec2f)                                                                                                                                                                                                       
    #2 startWorkerTask rts/Task.c:497 (hs_try_putmvar003+0x00000074628a)
    #3 releaseCapability_ rts/Capability.c:567 (hs_try_putmvar003+0x00000072c257)
    #4 suspendThread rts/Schedule.c:2418 (hs_try_putmvar003+0x00000073e384)
    #5 <null> <null> (hs_try_putmvar003+0x0000004136b2)                                                                                                                                                                                                                                  
    #6 scheduleWorker rts/Schedule.c:2570 (hs_try_putmvar003+0x00000073eb9e)
    #7 workerStart rts/Task.c:445 (hs_try_putmvar003+0x000000745875)
    #8 <null> <null> (libtsan.so.0+0x000000028d5b)
                                                                                                                                                                                                                                                                                         
  Thread T26 (tid=6116, finished) created by thread T24 at:
    #0 pthread_create <null> (libtsan.so.0+0x00000002c010)
    #1 mkCallbackQueue <null> (hs_try_putmvar003+0x000000416ad2)
    #2 scheduleWorker rts/Schedule.c:2570 (hs_try_putmvar003+0x00000073eb9e)
    #3 workerStart rts/Task.c:445 (hs_try_putmvar003+0x000000745875)                                                                                                                                                                                                                     
    #4 <null> <null> (libtsan.so.0+0x000000028d5b)

  Thread T24 (tid=6111, finished) created by thread T7 at:
    #0 pthread_create <null> (libtsan.so.0+0x00000002c010)
    #1 createOSThread rts/posix/OSThreads.c:137 (hs_try_putmvar003+0x00000076ec2f)                                                                                                                                                                                                       
    #2 startWorkerTask rts/Task.c:497 (hs_try_putmvar003+0x00000074628a)
    #3 releaseCapability_ rts/Capability.c:567 (hs_try_putmvar003+0x00000072c257)
    #4 suspendThread rts/Schedule.c:2418 (hs_try_putmvar003+0x00000073e384)                                                                                                                                                                                                              
    #5 <null> <null> (hs_try_putmvar003+0x0000004144e1)
    #6 scheduleWorker rts/Schedule.c:2570 (hs_try_putmvar003+0x00000073eb9e)
    #7 workerStart rts/Task.c:445 (hs_try_putmvar003+0x000000745875)                                                                                                                                                                                                                     
    #8 <null> <null> (libtsan.so.0+0x000000028d5b)

SUMMARY: ThreadSanitizer: data race (/nix/store/c7hj2bk4aqgpb3q0h5xhq7lag0lq3jm7-gcc-7.4.0-lib/lib/libtsan.so.0+0x2c95b) in pthread_mutex_destroy                       ```

This very well may be relevant.

mentioned in merge request !1731 (closed)

Could this be related to #18919 (closed)? And perhaps fixed by !4458 (closed)?

mentioned in merge request !6882 (closed)

Trac field	Value
Version	8.9
Type	Bug
TypeOfFailure	OtherFailure
Priority	normal
Resolution	Unresolved
Component	Test Suite
Test case
Differential revisions
BlockedBy
Related
Blocking
CC	RyanGlScott, bgamari, mpickering
Operating system
Architecture

Non-deterministic hs_try_putmvar003 failure on CI

Child items 0

Activity

Non-deterministic hs_try_putmvar003 failure on CI

Relates to

Activity