Opened 4 years ago

Closed 13 months ago

Last modified 13 months ago

#4934 closed bug (fixed)

threadWaitRead works incorrectly on nonthreaded RTS

Reported by: slyfox Owned by:
Priority: normal Milestone: 7.10.1
Component: Runtime System Version: 7.0.1
Keywords: Cc: tibbe, bos, simonmar
Operating System: Linux Architecture: x86_64 (amd64)
Type of failure: Incorrect result at runtime Test Case:
Blocked By: Blocking:
Related Tickets: Differential Revisions:

Description

I found out it in xmobar on ghc-7.0.1.
When I ran it I got the following message:

xmobar: file descriptor 8954336 out of range for select (0--1024).
Recompile with -threaded to work around this.

-threaded option works, but the message is absolutely misleading.

I've decided to track where so large descriptor could come,
but there was no such place.

All xmobar does is:

    do x11connection_fd <- connectionNumber d
       threadWaitRead (Fd fd) -- here we die

x11connection_fd has value 3 at the time of call.
This value comes from libX11 (via FFI, so I suspect it's a problem).

Another example (might be unrelated):

-- a.hs 
import Control.Concurrent

-- 12 is picked at random as guaranteed to be invalid FD
main = threadWaitRead 12 >> print "yet input!"

Building:

$ ghc --make -fforce-recomp a.hs -o a
[1 of 1] Compiling Main             ( a.hs, a.o )
Linking a ...

$ ghc --make -fforce-recomp a.hs -threaded -o a.threaded
[1 of 1] Compiling Main             ( a.hs, a.o )
Linking a.threaded ...

Running:

$ ./a.threaded 
a.threaded: epollControl: invalid argument (Bad file descriptor)

Looks good.

And nonthreaded:

$ ./a
"yet input!"

According to strace there was an error but it was not reported.

sf tmp # strace -etrace='select,write' ./a >/dev/null
select(13, [12], [], NULL, {134, 217727}) = -1 EBADF (Bad file descriptor) # this one

select(2, [], [1], NULL, {0, 0})        = 1 (out [1], left {0, 0})
write(1, "\"yet input!\"\n", 13)        = 13

Side note: Haskeline seems to workaround this problem on it's own way.

libraries/haskeline/System/Console/Haskeline/Backend/Posix.hsc

-- Different versions of ghc work better using different functions.
blockUntilInput :: Handle -> IO ()
#if __GLASGOW_HASKELL__ >= 611
-- threadWaitRead doesn't work with the new ghc IO library,
-- because it keeps a buffer even when NoBuffering is set.
blockUntilInput h = hWaitForInput h (-1) >> return ()
#else
-- hWaitForInput doesn't work with -threaded on ghc < 6.10
-- (#2363 in ghc's trac)
blockUntilInput h = unsafeHandleToFd h >>= threadWaitRead
#endif

Attachments (7)

test_it.tar.gz (4.3 KB) - added by slyfox 4 years ago.
test_it.tar.gz - selfcontained example with weird fault
bad_fd.hs (141 bytes) - added by slyfox 4 years ago.
bad_fd.hs - should issue an error with nonthreaded RTS, but does not
0001-rts-posix-Select.c-remove-unused-select_succeeded.patch (1016 bytes) - added by slyfox 14 months ago.
0001-rts-posix-Select.c-remove-unused-select_succeeded.patch
0002-rts-posix-Select.c-search-for-EBADFs.patch (3.0 KB) - added by slyfox 14 months ago.
0002-rts-posix-Select.c-search-for-EBADFs.patch
0003-rts-posix-Select.c-throw-exception-to-reader-as-soon.patch (6.1 KB) - added by slyfox 13 months ago.
0003-rts-posix-Select.c-throw-exception-to-reader-as-soon.patch - adds exception throw
0004-rts-posix-Select.c-don-t-put-thread-with-pending-exc.patch (2.0 KB) - added by slyfox 13 months ago.
0004-rts-posix-Select.c-don-t-put-thread-with-pending-exc.patch - fixes last problem with run queue
0001-Raise-exceptions-when-blocked-in-bad-FDs-fixes-Trac-.patch (11.9 KB) - added by slyfox 13 months ago.
0001-Raise-exceptions-when-blocked-in-bad-FDs-fixes-Trac-.patch

Download all attachments as: .zip

Change History (36)

comment:1 Changed 4 years ago by slyfox

  • Architecture changed from Unknown/Multiple to x86_64 (amd64)
  • Operating System changed from Unknown/Multiple to Linux
  • Type of failure changed from None/Unknown to Incorrect result at runtime

Changed 4 years ago by slyfox

test_it.tar.gz - selfcontained example with weird fault

comment:2 Changed 4 years ago by slyfox

The output of sample:

$ ./test_me.sh 

[1 of 3] Compiling Commands         ( Commands.hs, obj/Commands.o )
[2 of 3] Compiling Xmobar           ( Xmobar.hs, obj/Xmobar.o )

Xmobar.hs:13:1:
    Warning: In the use of `block'
             (imported from Control.Exception, but defined in GHC.IO):
             Deprecated: "use Control.Exception.mask instead"

Xmobar.hs:13:1:
    Warning: In the use of `unblock'
             (imported from Control.Exception, but defined in GHC.IO):
             Deprecated: "use Control.Exception.mask instead"
[3 of 3] Compiling Main             ( Main.hs, obj/Main.o )
Linking test_it ...
Entering event loop
"eventLoop IN"
"MAIN: unblocking"
"forkIO: startCommand: runCom \"date\" [] \"\" 10""forkIO: startCommand: runStdinReader""forkIO: checker spawned"


"WAIT IO: 1"
"SIR: got line""CHECKER: IN"

"CHECKER: throw WakeUp -> me"
"CHECKER: thrown"
"CHECKER: IN"
"MAIN: woke up!"
"MAIN: unblocking"
"WAIT IO: 1"
test_it: file descriptor 7264944 out of range for select (0--1024).
Recompile with -threaded to work around this.

comment:3 Changed 4 years ago by simonmar

  • Milestone set to 7.0.3
  • Owner set to simonmar
  • Priority changed from normal to high

comment:4 follow-up: Changed 4 years ago by simonmar

I've tried this program on a couple of different machines (32 and 64-bit Linux) with several versions of GHC, and haven't managed to reproduce the error. Does it fail right away for you?

Can you tell me more about your setup? (exact versions of tools etc.) Does it fail on a different machine for you?

comment:5 in reply to: ↑ 4 Changed 4 years ago by slyfox

Replying to simonmar:

I've tried this program on a couple of different machines (32 and 64-bit Linux) with several versions of GHC, and haven't managed to reproduce the error. Does it fail right away for you?

Does the first test (main = threadWaitRead 12 >> print "yet input!") work for you and shows an error in nonthreaded case?

For me it works incorrectly (does not show any errors) in:

  • ghc-6.12.3 nonthreaded
  • ghc-6.12.3 threaded
  • ghc-7.0.1 nonthreaded

The second test (in tarball) fails only on ghc-7.0.1 nonthreaded.

The test is factored out from xmobar sources. The failure is 100% reproducible on two machines.

Can you tell me more about your setup? (exact versions of tools etc.) Does it fail on a different machine for you?

It's my amd64 gentoo desktop (notebook) box with recent software.
Linux kernel 2.6.37, glibc-2.12.2, ghc-7.0.1 (bootstrapped by 7.0.0.20101028)

comment:6 Changed 4 years ago by slyfox

The same behaviour is observable on stock ubuntu 10.10 (amd64) with ghc-7.0.1

(built as user: ./configure --prefix=$HOME/somewhere && make && make install).

comment:7 follow-up: Changed 4 years ago by simonmar

  • Status changed from new to infoneeded

My mistake, I don't think I actually tested with the stock 7.0.1, only the current 7.0 branch. It does indeed fail with 7.0.1, but not the current 7.0 branch.

I'm fairly certain this is #4813 in another guise. I'll close when you've had a chance to test against a recent snapshot.

Changed 4 years ago by slyfox

bad_fd.hs - should issue an error with nonthreaded RTS, but does not

comment:8 in reply to: ↑ 7 Changed 4 years ago by slyfox

Replying to simonmar:

My mistake, I don't think I actually tested with the stock 7.0.1, only the current 7.0 branch. It does indeed fail with 7.0.1, but not the current 7.0 branch.

I'm fairly certain this is #4813 in another guise. I'll close when you've had a chance to test against a recent snapshot.

Tried , ("Project version","7.0.1.20110211"). Results are the following:
}}}

  • test_it.tar.gz - works correctly now
  • bad_fd.hs - still eats an error with nonthreaded RTS:
ghc-7.0/ghc-install:bin/ghc --make -fforce-recomp bad_fd.hs -o bad_fd.nt
[1 of 1] Compiling Main             ( bad_fd.hs, bad_fd.o )
Linking bad_fd.nt ...
ghc-7.0/ghc-install:bin/ghc --make -fforce-recomp bad_fd.hs -threaded -o bad_fd.t
[1 of 1] Compiling Main             ( bad_fd.hs, bad_fd.o )
Linking bad_fd.t ...

ghc-7.0/ghc-install:./bad_fd.nt
"yet input!" # how come?
ghc-7.0/ghc-install:./bad_fd.t
bad_fd.t: epollControl: invalid argument (Bad file descriptor)

comment:9 follow-up: Changed 4 years ago by simonmar

  • Cc tibbe bos added

This is the way it has always worked. The problem is that when select() finds that one of the file descriptors has been closed, it returns an error, but doesn't tell you which one it was, so in that case we inform all the blocked threads that they should wake up, and retry whatever operation they were doing, which will result in the appropriate exception being thrown. I presume that epoll is able to determine which file descriptor is erroneous and generate the exception from threadWaitRead instead.

So typical uses of threadWaitRead won't notice the difference. We could make the epoll backend ignore the exception, but that doesn't seem right. Personally I'm happy to leave things as they are, but document that threadWaitRead may or may not raise an exception if the FD is closed (actually I'd like to remove threadWaitRead from Control.Concurrent anyway, it's really an internal API).

tibbe/bos, any thoughts?

comment:10 in reply to: ↑ 9 Changed 4 years ago by tibbe

Replying to simonmar:

So typical uses of threadWaitRead won't notice the difference. We could make the epoll backend ignore the exception, but that doesn't seem right. Personally I'm happy to leave things as they are, but document that threadWaitRead may or may not raise an exception if the FD is closed (actually I'd like to remove threadWaitRead from Control.Concurrent anyway, it's really an internal API).

tibbe/bos, any thoughts?

I agree. I think threadWaitRead should raise an exception for closed FDs. Otherwise it might mask programmer bugs.

comment:11 follow-up: Changed 4 years ago by simonmar

The problem is that we can't easily make threadWaitRead raise an exception for closed FDs in the non-threaded RTS where we use select, because we have no information about which FD has the error. Presumably the same applies in the new IO manager when using poll?

comment:12 in reply to: ↑ 11 ; follow-up: Changed 4 years ago by tibbe

Replying to simonmar:

The problem is that we can't easily make threadWaitRead raise an exception for closed FDs in the non-threaded RTS where we use select, because we have no information about which FD has the error. Presumably the same applies in the new IO manager when using poll?

poll sets POLLNVAL per FD so that should be possible.

comment:13 in reply to: ↑ 12 Changed 4 years ago by simonmar

  • Milestone changed from 7.0.3 to _|_
  • Priority changed from high to normal

Replying to tibbe:

poll sets POLLNVAL per FD so that should be possible.

Ok. In that case we could declare it to be a bug in the non-threaded RTS that could be fixed by using poll() instead of select() and generating correct error returns from waitRead#. Fixing it won't be a high priority, though.

comment:14 Changed 4 years ago by simonmar

  • Status changed from infoneeded to new

comment:15 Changed 4 years ago by simonmar

  • Owner simonmar deleted

Changed 14 months ago by slyfox

0001-rts-posix-Select.c-remove-unused-select_succeeded.patch

Changed 14 months ago by slyfox

0002-rts-posix-Select.c-search-for-EBADFs.patch

comment:16 Changed 14 months ago by slyfox

  • Cc simonmar added
  • difficulty set to Unknown

0001-rts-posix-Select.c-remove-unused-select_succeeded.patch is a cleanup patch removing unused variable
0002-rts-posix-Select.c-search-for-EBADFs.patch is a patch, that finds bad file descriptors passed to select()

What is the easiest way to convert such TSOs into IOException thunks?

Thanks!

Changed 13 months ago by slyfox

0003-rts-posix-Select.c-throw-exception-to-reader-as-soon.patch - adds exception throw

comment:17 Changed 13 months ago by slyfox

0003-rts-posix-Select.c-throw-exception-to-reader-as-soon.patch
adds another step to kill thread on threadWaitRead.

Now it works as:

[1 of 1] Compiling Main             ( a.hs, a.o )
Linking a ...
a: awaitEvent: invalid argument (Bad file descriptor)
a: user error (If you can read this, shutdownHaskellAndExit did not exit.)

The problem is the status of killed thread.
RTS thinks it did not finish, but blocked on FFI call:

$ ./a +RTS -D{s,i,r,S,z}
cap 0: thread 1 stopped (blocked on a read operation)
        thread    1 @ 0x7fd2fa405390 is blocked on read from fd 12 (TSO_DIRTY)
scheduler: checking for threads blocked on I/O (waiting)
found bad READ fd=12
cap 0: raising exception in thread 1.
Waking up blocked thread 1
cap 0: running thread 1 (ThreadRunGHC)
a: awaitEvent: invalid argument (Bad file descriptor)
cap 0: thread 1 stopped (suspended while making a foreign call)

comment:18 Changed 13 months ago by slyfox

Some notes about the code
which does threadWaitRead from haskell, but does not perform read().

For xmobar the bug was in polling on X11 socket from haskell
code, and handling of sockets from C code (which ignored bad FD).

Thus such code led to 100% CPU busy loop:

    x11_ctx <- get_x11_ctx
    forever $ do
        threadWaitRead (x11_connection x11_ctx)
        x11_poll_events event_callback -- ignores -EBADF

Changed 13 months ago by slyfox

0004-rts-posix-Select.c-don-t-put-thread-with-pending-exc.patch - fixes last problem with run queue

comment:19 Changed 13 months ago by slyfox

After throwToSingleThreaded tso should not be pushed to runnables queue.
0004-rts-posix-Select.c-don-t-put-thread-with-pending-exc.patch fixes this problem.

comment:20 Changed 13 months ago by slyfox

  • Status changed from new to patch

Now even with non-threaded RTS error is reported as expected:

$ inplace/bin/ghc-stage2 --make -fforce-recomp a.hs -o a -debug && ./a
a: awaitEvent: invalid argument (Bad file descriptor)

comment:21 Changed 13 months ago by slyfox

If idea in patches looks fine i can squash last 3 patches in one and cleanup comments around.

I can also slightly change polling cycle to resume only descriptors with data available,
and not all the waiters. It would be more robust against spurious blocking calls.

What do you think?

comment:22 follow-up: Changed 13 months ago by simonmar

Yes, please squash the patches and then I'll review.

Not sure what you mean by "slightly change polling cycle to resume only descriptors with data available", don't we already do that?

comment:23 in reply to: ↑ 22 Changed 13 months ago by slyfox

Yes, please squash the patches and then I'll review.

Will do.

Not sure what you mean by "slightly change polling cycle to resume only descriptors with data available", don't we already do that?

I was a bit unclear. I meant: only in a case of select() == EBADF failure we
always resumed all the blocked threads:
https://ghc.haskell.org/trac/ghc/browser/ghc/rts/posix/Select.c#L295

  if ( errno == EBADF )
    unblock_all = rtsTrue;
    ...
  ready = unblock_all || FD_ISSET(tso->block_info.fd, &rfd); // [1]
  ...
  if (ready)
    pushOnRunQueue(&MainCapability,tso);

Current patchset only propagates exceptions early,
but does not fix spurious resumes (as it was before).

I propose to fix resume problem:

---    ready = unblock_all || FD_ISSET(tso->block_info.fd, &rfd);
+++    if (unblock_all) ready = fd_is_really_ready (fd);
+++    else             ready = FD_ISSET(tso->block_info.fd, &rfd);

It' won't be hard to implement on top of current patchset.
Secondary select() can already provide all the info.

Should I do it as a) a separate patch/ticket, b) merge into final result,
c) don't touch it at all?

Thanks!

comment:24 follow-up: Changed 13 months ago by simonmar

(b) merge into your patch please.

Changed 13 months ago by slyfox

0001-Raise-exceptions-when-blocked-in-bad-FDs-fixes-Trac-.patch

comment:25 in reply to: ↑ 24 Changed 13 months ago by slyfox

Replying to simonmar:

(b) merge into your patch please.

Done as: 0001-Raise-exceptions-when-blocked-in-bad-FDs-fixes-Trac-.patch

comment:26 Changed 13 months ago by Simon Marlow <marlowsd@…>

In 9fd507e5758f4141ac2619f0db57136bcab035c6/ghc:

Raise exceptions when blocked in bad FDs (fixes Trac #4934)

Before the patch any call to 'select()' with 'bad_fd' led to:
- unblocking of all threads
- hiding exception for 'threadWaitRead bad_fd'

The patch fixes both cases in this way:
after 'select()' failure we iterate over each blocked descriptor
and poll individually to see it's actual status, which is:
- READY (move to run queue)
- BLOCKED (leave in blocked queue)
- INVALID (send an IOErrror exception)

Signed-off-by: Sergei Trofimovich <[email protected]>

comment:27 Changed 13 months ago by simonmar

  • Resolution set to fixed
  • Status changed from patch to closed

comment:28 Changed 13 months ago by hvr

  • Milestone changed from to 7.10.1

comment:29 Changed 13 months ago by Simon Peyton Jones <simonpj@…>

In 7f467d0fbb1424f638a0d39caf57b9c0198421a8/ghc:

Fix Windows build (wibble to fix for Trac #4934)
Note: See TracTickets for help on using tickets.