Opened 2 years ago

Closed 2 years ago

Last modified 6 months ago

#7653 closed bug (fixed)

incorrect handling of StackOverflow exception in the event manager

Reported by: nus Owned by: tibbe
Priority: normal Milestone: 7.8.1
Component: libraries/base Version: 7.7
Keywords: Cc: johan.tibell@…, kazu@…
Operating System: Linux Architecture: Unknown/Multiple
Type of failure: Runtime crash Test Case:
Blocked By: Blocking:
Related Tickets: Differential Revisions:

Description

Under high pressure of registerTimeouts the event manager thread's stack overflows.

Testcases:
Shachaf initially reported this testcase on #ghc,
which is irreproducible locally:

import Control.Monad;
import Control.Concurrent
main = replicateM_ 1000000 (forkIO (threadDelay 1))

to be compiled and run as follows:

$ ghc -O2 -threaded Main.hs && time ./Main +RTS -N

Limiting the stack to the minimum helped to reproduce this locally, both on x86 and x64:

import Control.Monad
import Control.Concurrent
main = replicateM_ 502 (forkIO (threadDelay 1)) -- 504 on x64
$ ghc -O2 -threaded -with-rtsopts="-N8 -K4" rplfrk.hs && ./rplfrk # -K8 on x64

and this, though less deterministically:

import Control.Monad
import Control.Concurrent
main = replicateM_ 340 ( forkIO (threadDelay 1))
$ ghc -O2 -threaded -rtsopts repl2-x86.hs && ./repl2-x86 +RTS -N1 -K4

Error messages look like:

Stack space overflow: current size 4 bytes.
Use `+RTS -Ksize -RTS' to increase it.
repl2-x86: sendWakeup: invalid argument (Bad file descriptor)
[...repeated...]
repl2-x86: ioManagerDie: write: Bad file descriptor

Attachments (2)

ghc-bug7653-patch-iomanager-invalidate-fds.diff (630 bytes) - added by nus 2 years ago.
Invalidate the IO manager's control and wakeup fds on close
ghc-bug7653-patch-dont-wakeup-dead-manager.diff (527 bytes) - added by nus 2 years ago.
don't wake up an event manager in the finished state.

Download all attachments as: .zip

Change History (14)

comment:1 Changed 2 years ago by nus

libraries\base\GHC\Event\Manager.hs, the thunk accummulated in f':

registerTimeout :: EventManager -> Int -> TimeoutCallback -> IO TimeoutKey
registerTimeout mgr us cb = do
  !key <- newUnique (emUniqueSource mgr)
  if us <= 0 then cb
    else do
      now <- getMonotonicTime
      let expTime = fromIntegral us / 1000000.0 + now

      -- We intentionally do not evaluate the modified map to WHNF here.
      -- Instead, we leave a thunk inside the IORef and defer its
      -- evaluation until mkTimeout in the event loop.  This is a
      -- workaround for a nasty IORef contention problem that causes the
      -- thread-delay benchmark to take 20 seconds instead of 0.2.
      atomicModifyIORef (emTimeouts mgr) $ \f ->
          let f' = (Q.insert key expTime cb) . f in (f', ())
      wakeManager mgr
  return $ TK key

is evaluated in applyEdits:

step :: EventManager -> TimeoutQueue -> IO (Bool, TimeoutQueue)
step mgr@EventManager{..} tq = do
  (timeout, q') <- mkTimeout tq
  I.poll emBackend timeout (onFdEvent mgr)
  state <- readIORef emState
  state `seq` return (state == Running, q')
 where

  -- | Call all expired timer callbacks and return the time to the
  -- next timeout.
  mkTimeout :: TimeoutQueue -> IO (Timeout, TimeoutQueue)
  mkTimeout q = do
      now <- getMonotonicTime
      applyEdits <- atomicModifyIORef emTimeouts $ \f -> (id, f)
      let (expired, q'') = let q' = applyEdits q in q' `seq` Q.atMost now q'
      sequence_ $ map Q.value expired
      let timeout = case Q.minView q'' of
            Nothing             -> Forever
            Just (Q.E _ t _, _) ->
                -- This value will always be positive since the call
                -- to 'atMost' above removed any timeouts <= 'now'
                let t' = t - now in t' `seq` Timeout t'
      return (timeout, q'')

The main loop :: EventManager -> IO () in cleanup :: EventManager -> IO () uses:

closeControl :: Control -> IO ()
closeControl w = do
  _ <- c_close . fromIntegral . controlReadFd $ w
  _ <- c_close . fromIntegral . controlWriteFd $ w
#if defined(HAVE_EVENTFD)
  _ <- c_close . fromIntegral . controlEventFd $ w
#else
  _ <- c_close . fromIntegral . wakeupReadFd $ w
  _ <- c_close . fromIntegral . wakeupWriteFd $ w
#endif
  return ()

which doesn't clean up the C side of newControl :: IO Control.

Changed 2 years ago by nus

Invalidate the IO manager's control and wakeup fds on close

comment:2 Changed 2 years ago by nus

Another part of the problem is that we really shouldn't be trying to operate on an event manager if it's in the Finished or Dying states. A proper fix would require changing the API to reflect that in the return type of the operations using wakeManager. Is it worth the trouble?

Changed 2 years ago by nus

don't wake up an event manager in the finished state.

comment:3 Changed 2 years ago by nus

  • Status changed from new to patch

comment:4 Changed 2 years ago by igloo

  • Cc johan.tibell@… added
  • difficulty set to Unknown
  • Owner set to tibbe

Johan, could you take a look at these patches please? Thanks!

comment:5 Changed 2 years ago by kazu-yamamoto

  • Cc kazu@… added

comment:6 follow-up: Changed 2 years ago by kazu-yamamoto

First of all, I would like to discuss what is the best behavior when the stack of IO manager overflows.

Note that in the current parallel IO manager, timer manager and IO managers are separated. We should apply this discussion to the timer manager.

A comment to the first patch: even if we set manager's FD to -1, nobody refers to it. What is the purpose of this patch?

comment:7 in reply to: ↑ 6 Changed 2 years ago by nus

Replying to kazu-yamamoto:

First of all, I would like to discuss what is the best behavior when the stack of IO manager overflows.

Perhaps the best would be not to cause stack overflows in the manager at all. At least the shutdown should be as gracious as possible.

Note that in the current parallel IO manager, timer manager and IO managers are separated. We should apply this discussion to the timer manager.

Sure, the bug report and the patches were made before the new I/O manager merge. The code excerpts above that show how the thunk is accumulated are now only pertinent to 7.4 and 7.6 branches.
While I'm not sure how (and if) the situation like this could be reproduced on the current HEAD, the concerns might still be applicable:

  1. There're no counterparts for c_setIOManagerControlFd and c_setIOManagerWakeupFd of newControl in closeControl;
  2. A situation might emerge (again, I'm not sure how, but still) when wakeManager would be passed an EventManager in the Finished state.

A comment to the first patch: even if we set manager's FD to -1, nobody refers to it. What is the purpose of this patch?

The RTS does, please have a look at ioManagerWakeup in rts/posix/Signals.c (and wakeUpRts in rts/Schedule.c).

comment:8 Changed 2 years ago by kazu-yamamoto

Thank you for explanation.

Your patch can be applied to GHC/Event/TimerManager.hs. Unfortunately, the following error happens with/without your patches:

rplfrk: user error (Pattern match failure in do expression at libraries/base GHC/Event/Thread.hs:212:3-10)

The code is here:

getSystemTimerManager :: IO TM.TimerManager
getSystemTimerManager = do
  Just mgr <- readIORef timerManager
  return mgr

What should getSystemTimerManager do when readIORef returns Nothing?

comment:9 Changed 2 years ago by nus

I should clarify that the patches were meant for the old I/O manager, that is the one that's currently in the 7.4 and 7.6 branches, and only the 7.6 and pre-merge HEAD builds were validated. The new I/O manager situation needs more insight.

comment:10 Changed 2 years ago by igloo

  • Milestone set to 7.8.1

comment:11 Changed 2 years ago by igloo

  • Resolution set to fixed
  • Status changed from patch to closed

Fixed by

commit e843e73690f828498f6e33bb89f47a50c3ab2ac9
Author: Ian Lynagh <[email protected]>
Date:   Sat Jun 8 20:19:59 2013 +0100

    IO manager: Edit the timeout queue directly, rather than using an edit list

    Fixes #7653.

comment:12 Changed 6 months ago by Austin Seipp <austin@…>

In a520761d065a84838896e8dd09d8aaec77480d60/ghc:

Remove outdated TODO in TimeManager

Summary:
It describes a work around Trac #3838, but it is already fixed and the
workaround removed, Trac #7653

Test Plan: not needed

Reviewers: hvr, Mikolaj, austin

Reviewed By: austin

Subscribers: thomie, carter

Differential Revision: https://phabricator.haskell.org/D478
Note: See TracTickets for help on using tickets.