Opened 3 years ago

Closed 3 years ago

#9722 closed bug (fixed)

ghcirun004 intermittently fails with ghc: ioManagerWakeup: write: Bad file descriptor

Reported by: ezyang Owned by: simonmar
Priority: high Milestone: 7.10.1
Component: Runtime System Version: 7.9
Keywords: Cc: hvr, AndreasVoellmy, edsko, simonmar
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime crash Test Case: ghcirun004
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D714, Phab:D722
Wiki Page:

Description

I was recently repeatedly running the validate test suite and got this test to fail:

 Actual stderr output differs from expected:  
--- /dev/null   2014-02-15 17:35:19.578872448 -0800  
+++ ./ghci/should_run/ghcirun004.run.stderr     2014-10-24 14:46:15.289211100 -0700  
@@ -0,0 +1 @@  
+ghc: ioManagerWakeup: write: Bad file descriptor  
*** unexpected failure for ghcirun004(ghci)  

Might be related to #5443.

Attachments (2)

T9722.Ds.log (6.2 KB) - added by thomie 3 years ago.
./T9722 +RTS -I0.0001 -Ds
Control.hs (6.5 KB) - added by AndreasVoellmy 3 years ago.
Modification to explore problem.

Download all attachments as: .zip

Change History (32)

comment:1 Changed 3 years ago by simonpj

I'm seeing this too, on Linux. Also on test T2589.

Simon

comment:2 Changed 3 years ago by simonpj

Does anyone have a clue what is going on here? I regularly get a batch of failures

Unexpected failures:
   ffi/should_run           ffi018_ghci [bad stdout or stderr] (ghci)
   ghci/should_run          T2589 [bad stdout or stderr] (ghci)
   ghci/should_run          ghcirun001 [bad stdout or stderr] (ghci)
   ghci/should_run          ghcirun002 [bad exit code] (ghci)
   ghci/should_run          ghcirun003 [bad stdout or stderr] (ghci)
   ghci/should_run          ghcirun004 [bad stdout or stderr] (ghci)

all of which are this ioManagerWakeup:write: Bad file descriptor thing.

What is ioManagerWakeup? Who calls it? Why? What does it mean to have a bad file descriptor?

It's bad because I can't get a clean validate these days.

Incidentally

make TEST=ghcirun001

works fine!

Simon

comment:3 Changed 3 years ago by thoughtpolice

Milestone: 7.10.1
Priority: normalhighest

comment:4 Changed 3 years ago by AndreasVoellmy

Cc: AndreasVoellmy added

comment:5 Changed 3 years ago by AndreasVoellmy

I haven't been able to recreate this on HEAD (a2e7bbfe7656cf7dbf1af4da5c077ac0b5d41127) on Ubuntu 14.04.1 LTS. Is there a test that fails most of the time? If so, how do you run it to get it to fail frequently? Or are these test failures occurring only rarely?

comment:6 Changed 3 years ago by simonpj

For me they never occur when running the test on its own. They only occur during sh validate --fast. The difference, perhaps, is that many tests are running at the same time.

comment:7 Changed 3 years ago by thomie

Harbormaster just failed on ghcirun004 as well.

comment:8 Changed 3 years ago by ezyang

I haven't seen this bug recently, but I don't see any commits which might have solved it.

comment:9 Changed 3 years ago by simonpj

I haven't seen it recently either.

Simon

comment:10 Changed 3 years ago by thoughtpolice

Priority: highesthigh

comment:11 Changed 3 years ago by thoughtpolice

Priority: highesthigh

comment:12 Changed 3 years ago by edsko

Cc: edsko added

comment:13 Changed 3 years ago by edsko

Just FYI, I just saw this while using 7.10 to compile my project.

comment:14 Changed 3 years ago by thomie

Cc: simonmar added
Component: GHCiRuntime System
Owner: set to simonmar

I can reproduce the notorious 'ioManagerWakeup' bug. Not for ghci004 itself, but for some other ghci tests.

On my puny laptop with only 2 cores, I do the following:

  1. Make sure the cpu is plenty busy, by running ghc -e [0..] > /dev/null in 3 different shells.
  1. Run make test EXTRA_HC_OPTS='+RTS -I0.1 -RTS' TEST='print002 print003 print006 print007 print008 print010'

For me, all tests always fail with:

"ioManagerWakup: write: Bad file descriptor"

Step 2 sets the 'GC idle time' for ghci script tests to 0.1 seconds (from the default for GHC/GHCi of 5 seconds, see #3408); the same value that is used for other tests with WAY=ghci (among which ghci004).

The necessity of step 1 might explain why we don't see the bug when running a single test in isolation, but only when running validate, which runs a lot of tests at the same time.

comment:15 Changed 3 years ago by thomie

Differential Rev(s): Phab:D714
Status: newpatch

I have a workaround patch which sets the 'GC idle time' for ghci tests to 0.3 seconds, which is the default for user programs. This prevents the bug from triggering.

We can keep this ticket open until the real bug is fixed.

comment:16 Changed 3 years ago by thomie

Smaller test case (without any cpu loading) for the real bug:

$ cat T9722.hs
main = return ()

$ ghc-7.10.0.20150307 -threaded -rtsopts T9722.hs

$ ./T9722 +RTS -I0.000000001
T9722: ioManagerWakeup: write: Bad file descriptor

This only happens ~50% of the time, and it doesn't happen with 7.8.4.

comment:17 Changed 3 years ago by AndreasVoellmy

Where is T9722.hs? Is it in master branch somewhere?

comment:18 in reply to:  17 Changed 3 years ago by AndreasVoellmy

Replying to AndreasVoellmy:

Where is T9722.hs? Is it in master branch somewhere?

Nevermind. I see the contents of T9722 from your message.

comment:19 Changed 3 years ago by AndreasVoellmy

I wasn't able to cause the problem with either T9722 with the arguments mentioned above or with the test with print002, etc. I tried with a fresh copy of ghc (version 7.11.20150307). I'll try to create it again with sh validate --fast.

comment:20 Changed 3 years ago by thomie

Andreas: if your computer has more than 2 cores you might need to start more copies of ghc -e [0..] > /dev/null. Another good one to keep the cpu is busy is to compile ghc with -j<ncores> at the same time. Do all of that, and then run the print002 test with +RTS -I<some small number>. It should work...

comment:21 Changed 3 years ago by AndreasVoellmy

I still haven't been able to recreate this. What OS are you running and can you post which commit of ghc you are working with?

Changed 3 years ago by thomie

Attachment: T9722.Ds.log added

./T9722 +RTS -I0.0001 -Ds

comment:22 Changed 3 years ago by thomie

I am on 64 bit Ubuntu 14.04. I can reproduce with ghc-7.10.0.20150307 and ghc-7.11.20150307 from https://launchpad.net/~hvr/+archive/ubuntu/ghc (ghc-head).

I have attached the output from running ./T9722 +RTS -I0.0001 -Ds.

comment:23 Changed 3 years ago by AndreasVoellmy

I'll keep trying to reproduce... in the meantime, can you try to recompile ghc with the following definition in rts/posix/Signals.c? Then recreate the bug and we will hopefully have a bit more info to help debug:

void
ioManagerWakeup (void)
{
    int r;
    int fd;
    // Wake up the IO Manager thread by sending a byte down its pipe
    if (io_manager_wakeup_fd >= 0) {
#if defined(HAVE_EVENTFD)
        StgWord64 n = (StgWord64)IO_MANAGER_WAKEUP;
        fd = io_manager_wakeup_fd;
        r = write(fd, (char *) &n, 8);
#else
        StgWord8 byte = (StgWord8)IO_MANAGER_WAKEUP;
        fd = io_manager_wakeup_fd;
        r = write(fd, &byte, 1);
#endif
        if (r == -1) {
            perror("ioManagerWakeup: write(A)");
            sysErrorBelch("ioManagerWakeup: write; fd: %d", fd);
        }
    }
}

comment:24 Changed 3 years ago by thomie

I build a fresh ghc head (commit 8b7534b39052c9cb44411bea0ca311a751564d6c) with the change to ioManagerWakeup from comment:23.

$ ~/ghc-master-build/inplace/bin/ghc-stage2 -threaded -rtsopts T9722.hs
$ ./T9722 +RTS -I0.000000001
ioManagerWakeup: write(A): Bad file descriptor
T9722: ioManagerWakeup: write; fd: 9: Bad file descriptor

I don't understand why the output is different when I redirect it to a file. But in case it gives you a clue, here it is:

$ ./T9722 +RTS -I0.000000001 > T9722.stderr
$ cat T9722.stderr
ioManagerWakeup: write(A): Bad file descriptor
T9722: ioManagerWakeup: write; fd: 9: Invalid argument

Changed 3 years ago by AndreasVoellmy

Attachment: Control.hs added

Modification to explore problem.

comment:25 Changed 3 years ago by AndreasVoellmy

Thanks thomie! I think I see what is happening now: the TimerManager registers a file descriptor with the RTS for use by the ioManagerWakeup() function in rts/posix/Signals.c. When the TimerManager shuts down, it closes this file descriptor. I think there is a race condition where rts/Schedule.c calls ioManagerWakeup() after the TimerManager has shutdown. The TimerManager should only shutdown when the program is exiting, so it's probably safe to simply silently ignore the failing write, although this probably needs some review.

In the meantime, I've modified GHC/Event/Control.hs to write -1 to io_manager_wakeup_fd just before it closes the file descriptor. This should address the race condition. Can you compile your ghc with the attached Control.hs and check whether it makes the problem go away for you?

Thanks!

comment:26 Changed 3 years ago by thomie

I ran all the tests I mentioned before multiple times, and they all succeeded. Hurray!

Before shipping: I did get a build warning about unused UNPACK in W.

comment:27 Changed 3 years ago by AndreasVoellmy

OK great!

To summarize: Whenever the RTS has been inactive for idleGCDelayTime, the idle timer fires and calls wakeUpRts(), which in turn calls ioManagerWakeup(), which in turn writes a byte (or a few) to a file descriptor (stored in the io_manager_wakeup_fd variable) registered by the TimerManager and on which the TimerManager will wait. (Note that the write will only occur if the file descriptor is non-negative.) When the RTS shuts down, it shuts down the TimerManager, and in this process the file descriptor stored in io_manager_wakeup_fd is closed. In the error case, the idle timer fires after the close of the file occurs, and then the write() call in ioManagerWakeup() fails and the aforementioned error message gets printed.

I think the solution in the Control.hs modification is solid; it addresses the problem without silently ignoring other error conditions due to the write(). Another solution might be to disable the idle timer once the program shutdown sequence has started, though that could have more far-reaching effects, so I think the minor Control.hs modification is good for now.

I'll clean it up, add a comment or two, and then make a patch.

comment:28 Changed 3 years ago by Andreas Voellmy <andreas.voellmy@…>

In 74625d6847e970e8bdc6991c327515b3e10b231b/ghc:

RTS/IOManager: fix trac issue #9722.

Summary:
Whenever the RTS has been inactive for idleGCDelayTime, the idle timer
fires and calls wakeUpRts(), which in turn calls ioManagerWakeup(),
which in turn writes a byte (or a few) to a file descriptor (stored in
the io_manager_wakeup_fd variable) registered by the TimerManager and
on which the TimerManager will wait. (Note that the write will only
occur if the file descriptor is non-negative.) When the RTS shuts
down, it shuts down the TimerManager, and in this process the file
descriptor stored in io_manager_wakeup_fd is closed. In the error
case, the idle timer fires after the close of the file occurs, and
then the write() call in ioManagerWakeup() fails and the
aforementioned error message gets printed.

This patch solves the problem by (1) having the TimerManager (via
Control) write -1 to io_manager_wakeup_fd just before closing the file
descriptor written in io_manager_wakeup_fd, and (2) having
ioManagerWakeup() ignore an error returned by write() in the case that
the write returned -1 and the io_manager_wakeup_fd is -1.

Reviewers: austin, simonmar, hvr, thomie

Reviewed By: thomie

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D722

GHC Trac Issues: #9722

comment:29 Changed 3 years ago by AndreasVoellmy

Differential Rev(s): Phab:D714Phab:D714, Phab:D722
Status: patchmerge

comment:30 Changed 3 years ago by thoughtpolice

Resolution: fixed
Status: mergeclosed
Note: See TracTickets for help on using tickets.