Opened 3 years ago

Closed 3 years ago

#5421 closed bug (fixed)

<<loop>> in withMVar (reproducible, but with large test case)

Reported by: JohnMillikin Owned by: simonmar
Priority: high Milestone: 7.4.1
Component: libraries/base Version: 7.2.1
Keywords: Cc: hvr@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Incorrect result at runtime Difficulty:
Test Case: Blocked By:
Blocking: Related Tickets:

Description (last modified by igloo)

A user of one of my libraries is reporting that it fails with the message <<loop>> when compiled in GHC 7.2.1. The error is coming from withMVar, and is reproducible.

Unfortunately, the problem is *very* tricky to reproduce. Attempting to debug it (such as adding print statements, re-ordering operations, or removing unrelated code) will cause it to vanish. Therefore, the test case I have attached is a full Cabal project with build instructions. Nearly every line of code is required, and changing or removing them hides the error again.

Changes which can hide the problem:

  • Altering or removing the startup messages (even the version number).
  • Removing unused attributes from types.
  • Removing unused error checking.
  • Using _ <- char 'a' instead of void (char 'a').
  • Moving a function from one module to another.
  • Simplifying loops which only run over one element.
  • Building it all at once with --make, and not a separate cabal-dev install step.
  • Building 'Main' with -O2.

Attachments (1)

withMVar_crash.tar.gz (4.5 KB) - added by JohnMillikin 3 years ago.
test case

Download all attachments as: .zip

Change History (10)

Changed 3 years ago by JohnMillikin

test case

comment:1 Changed 3 years ago by hvr

  • Cc hvr@… added

comment:2 Changed 3 years ago by JohnMillikin

After sleeping on it, I just realized the underlying cause. Here's a reduced test case:

import Control.Concurrent
import Control.Monad.Fix

data Client = Client
	{ clientLock :: MVar ()
	}

main = do
	mvar <- newMVar ()
	
	client <- mfix $ \client -> do
		_ <- forkIO (mainLoop client)
		threadDelay 500000
		return (Client mvar)
	return ()

mainLoop client = withMVar (clientLock client) (\_ -> return ())

The weirdness in the code itself is now much more obvious. However, I still can't get it to <<loop>> in GHC 7.0.4 or earlier.

comment:3 Changed 3 years ago by simonpj

That is indeed a strange thing to do. (Thinking aloud.) Internally what happens is that the mfix code generates something like this:

main :: State# -> (State#, ())
main s = case newMVar () of { (s', mvar) -> 
             letrec 
                client = snd t
                t = helper client s'    -- THUNK BUILD HERE
             in (fst t, ())
         }

helper :: MVar () -> State# -> (State#, MVar ())
helper client s' = case fork (mainLoop client) of { (s'', ()) ->
                   case delay 5000 s'' of { (s''', ()) ->
                   return (s''', ()) } }

The mfix builds a thunk t, which it immedaitely begins to poke on (by evaluating fst t). That spawns the other thread, and delays. At this point the thunk is still under evaluation. The scheduler switches threads, and I would have thought that it blackholed the thunk at that moment (which would have led to a visible <<loop>>, but apparently not. Simon M what do you think?

Does the same happen if you run on just one core?

The thunk t does IO and so we should be super-careful not to duplicate its work. We are super-cafeful in some circumstances (like unsafePerformIO) but I'm not sure we are super-careful here. Simon?

Thanks for boiling this down.

Simon

comment:4 Changed 3 years ago by JohnMillikin

Does the same happen if you run on just one core?

Changing the number of cores with +RTS -N1, -N2, etc seems to have no effect.

The scheduler switches threads, and I would have thought that it blackholed the thunk at that moment (which would have led to a visible <<loop>>, but apparently not

Does that mean it would <<loop>> when forkIO is called, instead of within the thread? That's definitely better behavior if possible, since it would make debugging much easier.

comment:5 Changed 3 years ago by simonmar

The cause is quite simple in fact. Here's the definition of fixIO, from GHC.IO:

fixIO :: (a -> IO a) -> IO a
fixIO k = do
    ref <- newIORef (throw NonTermination)
    ans <- unsafeInterleaveIO (readIORef ref)
    result <- k ans
    writeIORef ref result
    return result

So if you pull on ans before it is ready, you get <<loop>>. It loops for me with 7.0.3.

We could have a cleverer version of fixIO in which child threads that demanded the result would block until the main thread is ready - indeed, changing the IORef to an MVar ought to do the trick, although it would have an extra cost, and if you pulled on the result before it was ready in the parent thread you would get BlockedIndefinitelyOnMVar instead of NonTermination.

comment:6 Changed 3 years ago by simonmar

  • Milestone set to 7.4.1
  • Priority changed from normal to high

Decide what to do for 7.4.1.

comment:7 Changed 3 years ago by igloo

  • Description modified (diff)

comment:8 Changed 3 years ago by igloo

  • Owner set to simonmar

I'd suggest using an MVar, and perhaps having an unsafeFixIO which uses an IORef (but can do odd things in multithreaded programs).

comment:9 Changed 3 years ago by simonmar

  • Architecture changed from x86_64 (amd64) to Unknown/Multiple
  • Component changed from Compiler to libraries/base
  • Operating System changed from Linux to Unknown/Multiple
  • Resolution set to fixed
  • Status changed from new to closed

Done:

commit 2cc5a65eb6463bb92b84cc4416a410ab80cda950

Author: Simon Marlow <marlowsd@gmail.com>
Date:   Fri Nov 4 15:27:39 2011 +0000

    use MVar to define fixIO, for thread-safety (see #5421)
Note: See TracTickets for help on using tickets.