Opened 8 months ago

Closed 8 months ago

#8209 closed bug (fixed)

Race condition in setNumCapabilities

Reported by: akio Owned by: simonmar
Priority: highest Milestone: 7.8.1
Component: Runtime System Version: 7.7
Keywords: Cc: andreas.voellmy@…
Operating System: Unknown/Multiple Architecture: x86_64 (amd64)
Type of failure: Runtime crash Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description

In HEAD, the following program sometimes deadlocks (about 1/10 of the time).

import Control.Concurrent
import Control.Monad
import GHC.Conc

main = do
  mainTid <- myThreadId
  labelThread mainTid "main"
  forM_ [0..0] $ \i -> forkIO $ do
    subTid <- myThreadId
    labelThread subTid $ "sub " ++ show i
    forM_ [0..100000000] $ \j -> putStrLn $ "sub " ++ show i ++ ": " ++ show j
  yield
  setNumCapabilities 2

The problem seems to be that there is a race condition between setNumCapabilites and a Task returning from a foreign call. Specifically, a sequence of events like the following can happen:

  1. Task 0 makes a foreign call.
  2. Task 1 calls setNumCapabilities()
  3. The call on task 0 returns; it reads task->cap from the memory.
  4. Task 1 moves the Capabilities, invalidating all pointers to them.
  5. Task 0 takes over the invalidated Capability.

The attached patch adds an ASSERT to the RTS to demonstrate the problem (it does not fix the problem).

Attachments (1)

assert.patch (1.3 KB) - added by akio 8 months ago.
adds an ASSERT

Download all attachments as: .zip

Change History (14)

Changed 8 months ago by akio

adds an ASSERT

comment:1 Changed 8 months ago by parcs

Perhaps related, I am getting a rare deadlock with this test case:

Test.hs

import GHC.Conc

main :: IO ()
main = do
    setNumCapabilities 2
    setNumCapabilities 3

Command Line

$ ghc-stage2 -threaded Test.hs
[1 of 1] Compiling Main             ( Test.hs, Test.o )
Linking Test ...
$ while true; do ./Test; echo -n .; done
.............................................<eventually stops>
Version 1, edited 8 months ago by parcs (previous) (next) (diff)

comment:2 follow-ups: Changed 8 months ago by ezyang

I haven't run the program, but are you sure (5) happens? It is fine for task 0 to hit your ASSERT even when setNumCapabilities is running, because it still needs to check whether or not it's acceptable to run, and that won't happen until setNumCapabilities. What it seems to me is happening is that waitForReturnCapability is improperly pre-committing to the capability it wants to return to, whereas it should always retrieve a fresh candidate capability each time around the wait loop.

comment:3 in reply to: ↑ 2 Changed 8 months ago by akio

Replying to ezyang:

I haven't run the program, but are you sure (5) happens?

On second thought I'm not so sure. I'll look at this again later today.

comment:4 Changed 8 months ago by simonmar

  • Owner set to simonmar
  • Priority changed from normal to highest

Thanks for the report. Treating as a blocker for 7.8.

comment:5 Changed 8 months ago by simonmar

  • Milestone set to 7.8.1

comment:6 Changed 8 months ago by akio

I found another error (reproducible with the same code).

moreCapabilities calls memcpy to copy Capabilitys. This means the lock field is also copied. If the lock field of the old Capability is held by another thread, the new copy of lock is also in the held state. However it will never be released, because the actual lock held is the old lock, not the new one. The task running setNumCapabilities() then deadlocks trying to acquire this lock. (By the way copying a mutex seems to be an undefined behavior according to the pthread spec)

comment:7 in reply to: ↑ 2 ; follow-up: Changed 8 months ago by akio

Replying to ezyang:

I haven't run the program, but are you sure (5) happens? It is fine for task 0 to hit your ASSERT even when setNumCapabilities is running, because it still needs to check whether or not it's acceptable to run, and that won't happen until setNumCapabilities. What it seems to me is happening is that waitForReturnCapability is improperly pre-committing to the capability it wants to return to, whereas it should always retrieve a fresh candidate capability each time around the wait loop.

Yes, it seems that that is what's happening in practice. However, isn't there still a possibility that e.g. task 0 tries to acquire cap->lock in waitForReturnCapability, after cap is stgFreed in setNumCapabilities?

comment:8 in reply to: ↑ 7 Changed 8 months ago by ezyang

Replying to akio:

However, isn't there still a possibility that e.g. task 0 tries to acquire cap->lock in waitForReturnCapability, after cap is stgFreed in setNumCapabilities?

Certainly that would be wrong, and a fix has to deal with it. (Note that while the memcpy problem I would expect to cause a deadlock, I would not expect this to cause a deadlock.)

comment:9 Changed 8 months ago by Simon Marlow <marlowsd@…>

In aa779e092c4f4d6a6691f3a4fc4074e6359337f8/ghc:

Don't move Capabilities in setNumCapabilities (#8209)

We have various problems with reallocating the array of Capabilities,
due to threads in waitForReturnCapability that are already holding a
pointer to a Capability.

Rather than add more locking to make this safer, I decided it would be
easier to ensure that we never move the Capabilities at all.  The
capabilities array is now an array of pointers to Capabaility.  There
are extra indirections, but it rarely matters - we don't often access
Capabilities via the array, normally we already have a pointer to
one.  I ran the parallel benchmarks and didn't see any difference.

comment:10 Changed 8 months ago by Simon Marlow <marlowsd@…>

comment:11 Changed 8 months ago by AndreasVoellmy

  • Cc andreas.voellmy@… added

comment:12 Changed 8 months ago by AndreasVoellmy

@Simon: did your patch resolve this issue? If so, should it be marked fixed?

comment:13 Changed 8 months ago by simonmar

  • Resolution set to fixed
  • Status changed from new to closed

oops, thanks for the reminder.

Note: See TracTickets for help on using tickets.