Opened 7 months ago

Closed 5 months ago

Last modified 4 months ago

#13615 closed bug (fixed)

Nondeterminism in ‘pure’ function w/ parallel evaluation & memo combinators

Reported by: pacak Owned by: bgamari
Priority: highest Milestone: 8.2.1
Component: Compiler Version: 7.10.3
Keywords: Cc: liyang, asr, dfeuer
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Incorrect result at runtime Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D3695
Wiki Page:

Description (last modified by pacak)

The example below uses in-place copies (mostly to fix dependencies and to make it more self-contained) of parallel (modified), unordered-containers (modified), and hashable. Its behavior is nondeterministic, despite using only supposedly pure operations.

Aforementioned aside, it only depends on base and deepseq, and there's no unsafePtrEq involved at all. The affected fromListWith uses foldl' to effect in-place HashMap updates using unsafe{Thaw,Freeze}.

I don't see how references to intermediate versions of HashMap could possibly leak out, so in-place updates seem valid to me.

Strictifying (or even rnfing) the arguments to unsafeInsertWith doesn't help. The issue is reproducible with -O0 on HEAD but not with -O2; On 8.0.2 and 7.10.1, it can also be reproduced with -O2.

sum . map snd = sum . map snd . toList . fromListWith (+)

The above identity fails when the input contains values constructed using both memo combinators and parallel evaluation strategies.

I have identified the in-place-update that needs to be replaced with copy-and-update to make things work―patch unsafeInsertWith.go in unordered-containers-0.2.8.0/Data/HashMap/Strict.hs, under the BitmapIndexed pattern as follows:

-            A.unsafeUpdateM ary i st'
-            return t
+            let ary' = A.update ary i st'
+            return $ BitmapIndexed b ary'

Steps to reproduce:

git clone https://github.com/pacak/cuddly-bassoon && cd cuddly-bassoon
cabal new-build
dist-newstyle/build/gamebooksolver-0.1.0.0/build/gamebooksolver-solvebook02/gamebooksolver-solvebook02

This will either finish successfully producing no output, or will die with an error from the regroup function in src/Solver.hs.

Your machine must have at least 2 CPU cores to reproduce the problem; also it will not show up with +RTS -N1.

This issue was first reported here: https://github.com/tibbe/unordered-containers/issues/147

Change History (57)

comment:1 Changed 7 months ago by bgamari

Milestone: 8.2.1
Priority: normalhighest

comment:2 Changed 7 months ago by mpickering

Can also build easily with "cabal new-build".

comment:3 Changed 7 months ago by pacak

Description: modified (diff)

comment:4 Changed 7 months ago by bgamari

Interesting, forcing st' before passing it to unsafeUpdateM in the the BitmapIndexed branch of Data.HashMap.Strict.unsafeInsertWith also seems to fix the issue. I haven't thought much about why.

Never mind; it makes no difference.

Last edited 7 months ago by bgamari (previous) (diff)

comment:5 Changed 7 months ago by liyang

Cc: liyang added
Description: modified (diff)
Summary: Wrong results from what supposed to be pure function with presense of parallel evaluationNondeterminism in ‘pure’ function w/ parallel evaluation & memo combinators

I've editorised the text of the original bug report. Pray I do not editorise it further.

comment:6 Changed 7 months ago by liyang

Description: modified (diff)

I have editorised the text further. Sorry.

comment:7 Changed 7 months ago by liyang

Description: modified (diff)

comment:8 Changed 7 months ago by pacak

I pushed a bunch of simplification commits to the repo, failure chance went down a bit but still within 10-20% on my machine. If you can't reproduce the problem - try first commit.

comment:9 Changed 7 months ago by pacak

Description: modified (diff)

comment:10 Changed 7 months ago by bgamari

While playing around with the repro, I noticed that the issue is reproducible if the unsafeUpdateM in unsafeInsertWith is performed, even if the resulting HashMap doesn't refer to the old array. e.g.,

    go h k x s t@(BitmapIndexed b ary)
        | b .&. m == 0 = do
            let ary' = A.insert ary i $! leaf h k x
            return $! bitmapIndexedOrFull (b .|. m) ary'
        | otherwise = do
            st <- A.indexM ary i
            st' <- rnf x `seq` go h k x (s+bitsPerSubkey) st
            let !ary' = A.update ary i $! st'
            A.unsafeUpdateM ary i st'
            return $ BitmapIndexed b ary'
      where m = mask h s
            i = sparseIndex b m

I'm not yet sure whether this is a meaninfgul observation, but it is an observation nonetheless.

Last edited 7 months ago by bgamari (previous) (diff)

comment:11 Changed 7 months ago by bgamari

I have confirmed that eliminating the use of parMap in solve eliminates the issue as does replacing rdeepseq with rseq. I don't know whether the use of the memo combinator contributes as runtime blows up when it is removed

Both uses of A.unsafeUpdateM in unsafeInsertWith (one in the BitmapIndexed branch and one in the Full branch) are problematic. Replacing either one with A.update decreases the failure rate, but the rate is still non-zero.

comment:12 Changed 7 months ago by bgamari

So it appears that the intermediate mutable arrays from unsafeInsertWith are somehow "leaking" out of the fold. Consider this variant of unsafeInsertWith's BitIndexed codepath,

    go h k x s (BitmapIndexed b ary)
        | b .&. m == 0 = do
            let ary' = A.insert ary i $! leaf h k x
            return $! bitmapIndexedOrFull (b .|. m) ary'
        | otherwise = do
            st <- A.indexM ary i
            st' <- ({-# SCC "hi3" #-}rnf x) `seq` go h k x (s+bitsPerSubkey) st
            let !ary' = A.update ary i $! st'
            A.unsafeUpdateM ary i (error "fiddlesticks")
            return $ BitmapIndexed b ary'
      where m = mask h s
            i = sparseIndex b m

Now, as far as I can tell, if no one has kept any references to the HashMap that unsafeInsertWith is inserting into there should be no way that we enter error "fiddlesticks". Afterall, the array ary is dead after we finish evaluating go.

However, in actuality this doesn't appear to be true: fiddlesticks is indeed entered! Namely, it is forced by the rnf at the beginning of unsafeInsertWith.

unsafeInsertWith f k0 v0 m0 = ({-# SCC "top" #-}rnf m0) `seq` runST (go h0 k0 v0 0 m0)

Now this is quite strange as it means that either,

  1. a reference to the old ary is somehow leaking out of go
  2. there are multiple references to ary scattered throughout the HashMap

It's not immediately obvious how either of these could be true, but the facts don't lie.

comment:13 Changed 7 months ago by bgamari

Hypothesis: thawArray# is to blame.

Falsified by falling back to __GLASGOW_HAKSELL__ < 702 codepath in Data.HashMap.Array.thaw. Problem still reproducible.

Last edited 7 months ago by bgamari (previous) (diff)

comment:14 Changed 7 months ago by bgamari

For the record, I've also tested with -fno-full-laziness; the issue persists.

comment:15 Changed 7 months ago by pacak

<bgamari> gamebooksolver-solvebook02: Those are expected to be equal(6030,6030)
<bgamari> GHC has an interesting sense of equality

Code that generates this error looks like this and both s' and s are Ints.

      if s' /= s
            then error $ "Those are expected to be equal" ++ show (s', s)
            else xs'

I don't see how that's possible unless Eq instance for Int gets corrupted somehow. If that's the case - it also might mean Num instance is bogus as well and that results in wrong sum.

comment:16 Changed 7 months ago by pacak

I've confirmed that ghc has an interesting sense of equality:

regroup :: (NFData a, Show a, Hashable a, Eq a, Ord a) => [(a, Int)] -> [(a, Int)]
regroup xs =
    let xs' = HM.toList $ HM.fromListWith (+) xs
        s' = sum (map snd xs')
        s  = sum (map snd xs)
     in if s' /= s
            then if show s' == show s
                    then error "WAT????"
                    else error $ "Those are expected to be equal" ++ show (s', s)
            else xs'
gamebooksolver-solvebook02: Those are expected to be equal(141,121)
CallStack (from HasCallStack):
  error, called at src/Main.hs:50:26 in main:Main
gamebooksolver-solvebook02: Those are expected to be equal(384,300)
CallStack (from HasCallStack):
  error, called at src/Main.hs:50:26 in main:Main
gamebooksolver-solvebook02: Those are expected to be equal(1045,1045)
CallStack (from HasCallStack):
  error, called at src/Main.hs:50:26 in main:Main
gamebooksolver-solvebook02: WAT????
CallStack (from HasCallStack):
  error, called at src/Main.hs:49:26 in main:Main
gamebooksolver-solvebook02: WAT????
CallStack (from HasCallStack):
  error, called at src/Main.hs:49:26 in main:Main
gamebooksolver-solvebook02: WAT????
CallStack (from HasCallStack):
  error, called at src/Main.hs:49:26 in main:Main

Actually it's even worse, note third error.

comment:17 Changed 7 months ago by bgamari

Indeed I have also observed similar insanity to comment:16.

comment:18 Changed 7 months ago by bgamari

Note that the probability of seeing the issue increases markedly if regroup is defined as,

regroup :: (NFData a, Show a, Hashable a, Eq a, Ord a) => [(a, Int)] -> [(a, Int)]
regroup xs0 =
    let xs = map (\(x,y) -> (x, y+1)) xs0
        xs' = HM.toList $ HM.fromListWith (+) xs
        s' = sum (map snd xs')
        s  = sum (map snd xs)
     in if s' /= s
            then if show s' == show s
                    then error "WAT????"
                    else error $ "Those are expected to be equal" ++ show (s', s)
            else xs'

It appears that elements are getting duplicated in the list inspected by fromListWith (which is why the (+1) makes the issue easier to see: you won't notice if you drop a zero from a sum)

Last edited 7 months ago by bgamari (previous) (diff)

comment:19 Changed 7 months ago by bgamari

-feager-blackholing also makes no difference.

I take it back; I just needed to clean more thoroughly. -feager-blackholing indeed does seem to fix the issue.

Last edited 7 months ago by bgamari (previous) (diff)

comment:20 Changed 7 months ago by bgamari

So perhaps we are entering a closure twice and consequently get a race condition where one thread in-place increments a hashtable entry by x, then the other does the same, but uses the now already incremented hashtable, resulting in an overall contribution of 2*x. This explains why the hashtable sum is always larger than the expected result.

While it's not clear precisely what closure we are entering multiple times, the fix is clear: ensure that unordered-containers is compiled with -feager-blackholing. However, we should also (at very least) document this caveat better. The users-guide current advertises -feager-blackholing as a optimization to avoid redundant computation in parallel programs. However, in this case that it's absolutely critical for correctness.

comment:21 Changed 7 months ago by bgamari

For future reference, the -feager-blackholing flag was introduced in GHC 6.12.

comment:22 Changed 7 months ago by akio

-feager-blackholing does not eliminate the race completely, right? I thought you needed a call to noDuplicate# in order to guarantee single entry.

comment:23 Changed 7 months ago by bgamari

Indeed you may be right; I'll need to try to work out precisely where the multiple entry is occurring tomorrow.

comment:24 Changed 7 months ago by simonpj

Why does it matter if we evaluate the same thunk twice? I'm lost.

Simon

comment:25 Changed 7 months ago by bgamari

Why does it matter if we evaluate the same thunk twice? I'm lost.

Well, I'm still not entirely certain what the problematic thunk looks like, but the program in question *does* rely on in-place modification of an array which would break if performed more than once.

I believe the problem revolves around Data.HashMap.Strict.unsafeInsertWith, which, to paraphrase, does the following,

unsafeInsertWith :: (v -> v -> v) -> k -> v -> HashMap k v -> HashMap k v
unsafeInsertWith f k v (HashMap arr) = runST $ do
    marr <- unsafeThawArray# arr
    let some_index = {- some function of k -}
    old_v <- readArray# marr some_index
    writeArray# marr some_index (f old_v v)
    _ <- unsafeFreezeArray# marr
    return (HashMap arr)

That is, despite being a "pure" function, unsafeInsertWith takes the array out of the HashMap which it was given, thaws it, modifies it, freezes it again, and then returns the same array in a "new" HashMap result. This means that if we have a thunk unsafeInsertWith some_key some_value hm which we enter twice, the multiple entry will be observable through the fact that the array element at some_index will be f v (f v old_v), instead of f v old_v as was intended.

Last edited 7 months ago by bgamari (previous) (diff)

comment:26 Changed 7 months ago by bgamari

I've put this aside for now. The current state of things is,

  • My current theory is that a closure is being entered multiple times, resulting in repeated observable mutation. It's still not entirely clear precisely which closure is being entered multiple times, but the evidence before us is consistent with the multiple mutation hypothesis.
  • compiling unordered-containers with -feager-blackholing appears to resolve the issue, however I'm pretty certain that this is merely shrinking the race window, not eliminating it altogether.
  • adding a noDuplicate# (e.g. see https://github.com/bgamari/unordered-containers/commit/bf799b2a00da8c78ac78d818de8f1963c7ff87ad) at the beginning of unsafeInsertWith also resolves the issue, even with lazy blackholing enabled. However, this seems like a very large hammer to place in an inner loop like this.

comment:27 Changed 7 months ago by Ben Gamari <ben@…>

In 228d4670/ghc:

Use memcpy in cloneArray

While looking at #13615 I noticed that there was this strange open-coded
memcpy in the definition of the cloneArray macro. I don't see why this
should be preferable to memcpy.

Test Plan: Validate, particularly focusing on array operations

Reviewers: simonmar, tibbe, austin, alexbiehl

Reviewed By: tibbe, alexbiehl

Subscribers: alexbiehl, rwbarton, thomie

Differential Revision: https://phabricator.haskell.org/D3504

comment:28 Changed 7 months ago by glguy

I believe I was experiencing this same bug in code that used Foreign.Marshal.Unsafe.unsafeLocalState and par together

https://github.com/glguy/kami-solver/blob/master/src/Kami.hs#L172-L181

unsafeLocalState is defined as unsafeDupablePerformIO. Switching to unsafePerformIO (which adds the noDuplicate) resolved the random segmentation faults.

I'm just adding this comment here in case it becomes useful at some point.

comment:29 Changed 7 months ago by bgamari

Milestone: 8.2.17.10.4

comment:30 Changed 7 months ago by bgamari

Milestone: 7.10.48.2.1
Version: 8.2.1-rc27.10.3

comment:31 Changed 6 months ago by bgamari

Owner: set to bgamari

comment:32 Changed 6 months ago by asr

Cc: asr added

comment:33 Changed 6 months ago by bgamari

So with -A8k the incorrect behavior seems to disappear entirely, but with the expected degradation in performance. However, it's hard to rule out that this is merely due a change in timings.

However, I've also confirmed that we are indeed entering unsafeInsertWith multiple times on the same HashMap. I've instrumented unsafeInsertWith to record when it is entered (within the context of a particular fromListWith call) and found that indeed we do see multiple threads entering concurrently.

Looking at the stacks of the threads involved in the multiple entry, I see some highly suspicious behavior. Namely, the top twenty or so frames are nearly identical. For instance, we have

  • <unnamed>

     
    1 thread 17 (entered first)
     1thread 16 (entered second)
    22-----------------------------------
    33
    440: RET_SMALL
     5  return = 0x434298 <rxSU_info+72> (symbol and line for ./Data/HashMap/Strict.hs, line 93)
     6  field 0: Ptr  0x434298 <rxSU_info+72>
     7  field 1: Word 283476377554
     81: RET_SMALL
    59  return = 0x439338 <sy5n_info+272> (symbol and line for ./Data/HashMap/Strict.hs, line 166)
    610  field 0: Ptr  0x439338 <sy5n_info+272>
    711  field 1: Ptr  0x420072cea0
     
    1115  field 5: Ptr  0x42008230b0
    1216  field 6: Ptr  0x420082312d
    1317  field 7: Word 283476377554
    14 1: RET_SMALL
     182: RET_SMALL
    1519  return = 0x439498 <unorderedzmcontainerszm0zi2zi8zi0zminplace_DataziHashMapziStrict_zdwfromListWith_info+216> (symbol and line for ./Data/HashMap/Strict.hs, line 160)
    16 2: UPDATE_FRAME(0x4200810610: BLACKHOLE(0x420046661a: constr))
    17 3: RET_SMALL
     203: UPDATE_FRAME(0x4200810610: BLACKHOLE(0x420046661a: constr))
     214: RET_SMALL
    1822  return = 0x44c790 <snqC_info+88> (symbol and line for ./Data/HashMap/Base.hs, line 122)
    1923  field 0: Ptr  0x44c790 <snqC_info+88>
    2024  field 1: Word 283476373666
    21 4: UPDATE_FRAME(0x4200810678: BLACKHOLE(0x42004694fa: constr))
    22 5: RET_SMALL
     255: UPDATE_FRAME(0x4200810678: BLACKHOLE(0x42004694fa: constr))
     266: RET_SMALL
    2327  return = 0x471b98 <base_GHCziBase_map_info+192> (symbol and line for libraries/base/GHC/Base.hs, line 946)
    2428  field 0: Word 4660120
    25 6: UPDATE_FRAME(0x42008106b0: BLACKHOLE(0x4200469552: constr))
    26 7: RET_SMALL
     297: UPDATE_FRAME(0x42008106b0: BLACKHOLE(0x4200469552: constr))
     308: RET_SMALL
    2731  return = 0x4af310 <s4Vv_info+56> (symbol and line for libraries/base/GHC/List.hs, line 187)
    2832  field 0: Word 4911888
    2933  field 1: Word 283476373528
    3034  field 2: Ptr  0x420082305a
    31 8: RET_SMALL
     359: RET_SMALL
    3236  return = 0x40abb8 <sr4u_info+232> (symbol and line for src/Solver.hs, line 41)
    3337  field 0: Word 4238264
    3438  field 1: Ptr  0x4200822f78
    3539  field 2: Word 283476373392
    3640  field 3: Ptr  0x4200822ee0
    3741  field 4: Ptr  0x4200822f00
    38 9: UPDATE_FRAME(0x42008106e0: BLACKHOLE(0x42004414b0: THUNK))
    39 10: RET_SMALL
     4210: UPDATE_FRAME(0x42008106e0: BLACKHOLE(0x42004414b0: THUNK))
     4311: RET_SMALL
    4044  return = 0x471d10 <s5BF_info+168> (symbol and line for libraries/base/GHC/Base.hs, line 860)
    4145  field 0: Word 4660496
    42 11: UPDATE_FRAME(0x4200724cf8: THUNK)
    43 12: RET_SMALL
     4612: UPDATE_FRAME(0x4200724cf8: THUNK)
     4713: RET_SMALL
    4448  return = 0x4392c8 <sy5n_info+160> (symbol and line for ./Data/HashMap/Strict.hs, line 164)
    4549  field 0: Word 4428488
    4650  field 1: Word 283469122208
     
    4852  field 3: Ptr  0x7ec538
    4953  field 4: Ptr  0x4200138a90
    5054  field 5: Ptr  0x4200138b0d
    51   field 6: Ptr  0x420007ae22
    52 13: RET_SMALL
     55  field 6: Ptr  0x4200724e9a
     5614: RET_SMALL
    5357  return = 0x439498 <unorderedzmcontainerszm0zi2zi8zi0zminplace_DataziHashMapziStrict_zdwfromListWith_info+216> (symbol and line for ./Data/HashMap/Strict.hs, line 160)
    54 14: UPDATE_FRAME(0x420010bb98: AP_STACK(size=9, fun=BLACKHOLE(0x4200724062: constr), payload=0x420010bbb8))
    55 15: RET_SMALL
     5815: UPDATE_FRAME(0x420010bb98: AP_STACK(size=9, fun=BLACKHOLE(0x4200724062: constr), payload=0x420010bbb8))
     5916: RET_SMALL
    5660  return = 0x44c790 <snqC_info+88> (symbol and line for ./Data/HashMap/Base.hs, line 122)
    5761  field 0: Ptr  0x44c790 <snqC_info+88>
    5862  field 1: Word 283469122178
    59 16: UPDATE_FRAME(0x420010bc00: AP_STACK(size=3, fun=AP_STACK(size=9, fun=BLACKHOLE(0x4200724062: constr), payload=0x420010bbb8), payload=0x420010bc20))
    60 17: RET_SMALL
     6317: UPDATE_FRAME(0x420010bc00: AP_STACK(size=3, fun=AP_STACK(size=9, fun=BLACKHOLE(0x4200724062: constr), payload=0x420010bbb8), payload=0x420010bc20))
     6418: RET_SMALL
    6165  return = 0x471b98 <base_GHCziBase_map_info+192> (symbol and line for libraries/base/GHC/Base.hs, line 946)
    6266  field 0: Word 4660120
    63 18: UPDATE_FRAME(0x420010bc38: AP_STACK(size=2, fun=AP_STACK(size=3, fun=AP_STACK(size=9, fun=BLACKHOLE(0x4200724062: constr), payload=0x420010bbb8), payload=0x420010bc20), payload=0x420010bc58))
    64 19: RET_SMALL
     6719: UPDATE_FRAME(0x420010bc38: AP_STACK(size=2, fun=AP_STACK(size=3, fun=AP_STACK(size=9, fun=BLACKHOLE(0x4200724062: constr), payload=0x420010bbb8), payload=0x420010bc20), payload=0x420010bc58))
     6820: RET_SMALL
    6569  return = 0x4af310 <s4Vv_info+56> (symbol and line for libraries/base/GHC/List.hs, line 187)
    6670  field 0: Word 4911888
    6771  field 1: Word 283469122040
    6872  field 2: Ptr  0x4200138a3a
    69 20: RET_SMALL
     7321: RET_SMALL
    7074  return = 0x40abb8 <sr4u_info+232> (symbol and line for src/Solver.hs, line 41)
    7175  field 0: Word 4238264
    7276  field 1: Ptr  0x4200138968
    7377  field 2: Word 283469121920
    7478  field 3: Ptr  0x42001388d0
    7579  field 4: Ptr  0x42001388f0
    76 21: UPDATE_FRAME(0x420010bc68: AP_STACK(size=10, fun=AP_STACK(size=2, fun=AP_STACK(size=3, fun=AP_STACK(size=9, fun=BLACKHOLE(0x4200724062: constr), payload=0x420010bbb8), payload=0x420010bc20), payload=0x420010bc58), payload=0x420010bc88))
    77 22: UPDATE_FRAME(0x4200056810: BLACKHOLE(0x420000bc50: TSO))
    78 23: RET_SMALL
     8022: UPDATE_FRAME(0x420010bc68: AP_STACK(size=10, fun=AP_STACK(size=2, fun=AP_STACK(size=3, fun=AP_STACK(size=9, fun=BLACKHOLE(0x4200724062: constr), payload=0x420010bbb8), payload=0x420010bc20), payload=0x420010bc58), payload=0x420010bc88))
     8123: UPDATE_FRAME(0x420086c660: BLACKHOLE(0x420086f000: TSO))
     8224: RET_SMALL
    7983  return = 0x471d38 <s5BF_info+208> (symbol and line for libraries/base/GHC/Base.hs, line 859)
    8084  field 0: Ptr  0x471d38 <s5BF_info+208>
    81   field 1: Word 283468195841
    82 24: UPDATE_FRAME(0x4200056560: BLACKHOLE(0x420000bc50: TSO))
    83 25: RET_SMALL
     85  field 1: Word 283476674129
     8625: UPDATE_FRAME(0x420086c3b0: BLACKHOLE(0x420086f000: TSO))
     8726: RET_SMALL
    8488  return = 0x4392c8 <sy5n_info+160> (symbol and line for ./Data/HashMap/Strict.hs, line 164)
    8589  field 0: Word 4428488
    86   field 1: Word 283468195696
    87   field 2: Word 283468195296
     90  field 1: Word 283476673984
     91  field 2: Word 283476673584
    8892  field 3: Ptr  0x7e60c8
    89   field 4: Ptr  0x4200056760
    90   field 5: Ptr  0x42000567dd
     93  field 4: Ptr  0x420086c5b0
     94  field 5: Ptr  0x420086c62d
    9195  field 6: Ptr  0x7ee011
    92 26: RET_SMALL
     9627: RET_SMALL
    9397  return = 0x439498 <unorderedzmcontainerszm0zi2zi8zi0zminplace_DataziHashMapziStrict_zdwfromListWith_info+216> (symbol and line for ./Data/HashMap/Strict.hs, line 160)
    94 27: UPDATE_FRAME(0x4200056718: BLACKHOLE(0x420000bc50: TSO))
    95 28: RET_SMALL
     9828: UPDATE_FRAME(0x420086c568: BLACKHOLE(0x420086f000: TSO))
     9929: RET_SMALL
    96100  return = 0x44c790 <snqC_info+88> (symbol and line for ./Data/HashMap/Base.hs, line 122)
    97101  field 0: Ptr  0x44c790 <snqC_info+88>
    98   field 1: Word 283468195666
    99 29: UPDATE_FRAME(0x4200056650: BLACKHOLE(0x420000bc50: TSO))
    100 30: RET_SMALL
     102  field 1: Word 283476673954
     10330: UPDATE_FRAME(0x420086c4a0: BLACKHOLE(0x420086f000: TSO))
     10431: RET_SMALL
    101105  return = 0x471b98 <base_GHCziBase_map_info+192> (symbol and line for libraries/base/GHC/Base.hs, line 946)
    102106  field 0: Word 4660120
    103 31: UPDATE_FRAME(0x4200056688: BLACKHOLE(0x420000bc50: TSO))
    104 32: RET_SMALL
     10732: UPDATE_FRAME(0x420086c4d8: BLACKHOLE(0x420086f000: TSO))
     10833: RET_SMALL
    105109  return = 0x4af310 <s4Vv_info+56> (symbol and line for libraries/base/GHC/List.hs, line 187)
    106110  field 0: Word 4911888
    107   field 1: Word 283468195528
    108   field 2: Ptr  0x420005670a
    109 33: RET_SMALL
     111  field 1: Word 283476673816
     112  field 2: Ptr  0x420086c55a
     11334: RET_SMALL
    110114  return = 0x40abb8 <sr4u_info+232> (symbol and line for src/Solver.hs, line 41)
    111115  field 0: Word 4238264
    112   field 1: Ptr  0x4200056638
    113   field 2: Word 283468195408
    114   field 3: Ptr  0x42000565a0
    115   field 4: Ptr  0x42000565c0
    116 34: UPDATE_FRAME(0x4200056528: BLACKHOLE(0x420000bc50: TSO))
    117 35: RET_SMALL
     116  field 1: Ptr  0x420086c488
     117  field 2: Word 283476673696
     118  field 3: Ptr  0x420086c3f0
     119  field 4: Ptr  0x420086c410
     12035: UPDATE_FRAME(0x420086c378: BLACKHOLE(0x420086f000: TSO))
     12136: RET_SMALL
    118122  return = 0x471d38 <s5BF_info+208> (symbol and line for libraries/base/GHC/Base.hs, line 859)
    119123  field 0: Ptr  0x471d38 <s5BF_info+208>
    120   field 1: Word 283468195081
    121 36: UPDATE_FRAME(0x4200056258: BLACKHOLE(0x420000bc50: TSO))
    122 37: RET_SMALL
     124  field 1: Word 283476673369
     12537: UPDATE_FRAME(0x420086c0a8: BLACKHOLE(0x420086f000: TSO))
     12638: RET_SMALL
    123127  return = 0x4392c8 <sy5n_info+160> (symbol and line for ./Data/HashMap/Strict.hs, line 164)
    124128  field 0: Word 4428488
    125   field 1: Word 283468194936
    126   field 2: Word 283468194536
     129  field 1: Word 283476673224
     130  field 2: Word 283476672824
    127131  field 3: Ptr  0x7ebb39
    128   field 4: Ptr  0x4200056468
    129   field 5: Ptr  0x42000564e5
     132  field 4: Ptr  0x420086c2b8
     133  field 5: Ptr  0x420086c335
    130134  field 6: Ptr  0x7ee011
    131 38: RET_SMALL
     13539: RET_SMALL
    132136  return = 0x439498 <unorderedzmcontainerszm0zi2zi8zi0zminplace_DataziHashMapziStrict_zdwfromListWith_info+216> (symbol and line for ./Data/HashMap/Strict.hs, line 160)
    133 39: UPDATE_FRAME(0x4200056420: BLACKHOLE(0x420000bc50: TSO))

Here we see that the threads' stacks differ by only a single word (save frame 0, which is merely an artifact of my instrumentation) in the first 21 frames. Intriguingly, they begin to differ after frame 21/22, which is an update frame updating an AP_STACK. Strangely, the only way that we can end up with such a closure is via the raiseAsync machinery or via the bytecode interpreter. There's clearly still more digging to do.

comment:34 Changed 6 months ago by bgamari

Ahh, of course; the AP_STACK is presumably being produced by raiseAsync while suspending a thread in preparation for GC.

comment:35 Changed 6 months ago by bgamari

One interesting observation is that the multiple-entry issue persists even if the list passed to fromListWith is fully forced; that is,

    let xs' = deepseq xs `seq` HM.toList (HM.fromListWith (+) xs)

instead of

    let xs' = HM.toList (HM.fromListWith (+) xs)
Last edited 6 months ago by bgamari (previous) (diff)

comment:36 Changed 6 months ago by bgamari

Alright, I think I have a theory for how we end up getting multiple entries despite having no thunk allocation inside fromListWith:

  1. Thread A enters a fromListWith closure and begins folding over the insertions
  2. At some point during this fold we need to garbage collect; the garbage collector constructs an AP_STACK closure capturing the state of Thread A, including the partially-finished fromListWith computation
  3. Garbage collection commences and finishes, evaluation resumes
  4. At some point Thread A is resumed, picking up the previously suspended fromListWith computation; we are blackholing lazily so no update to the closure is made
  5. At some later point Thread B tries to force the same fromListWith computation; finding that it's not blackholed, it enters
  6. We now have two mutator threads performing evaluation on the same, effectful computation with shared, mutable state.

Does this sound plausible?

The only bit that I'm a bit hazy on is point (5). That is, how is it that Thread B is able to enter the suspended computation started by Thread A given that (if I understand correctly) Thread A won't update the original fromListWith thunk until finishes its evaluation and pops its associated update frame.

Last edited 5 months ago by bgamari (previous) (diff)

comment:37 Changed 6 months ago by dfeuer

I've written a seriously legitimate modification of Data.HashMap.Strict.fromListWith that Ben indicates still demonstrates the problem. This version uses a separate type of mutable hashmap that holds MArrays. When it's done, it traverses the tree and (unsafely) freezes all the arrays. Unless I've goofed up, there should be no doubt that this version should work correctly.

comment:38 Changed 6 months ago by dfeuer

Cc: dfeuer added

comment:39 Changed 5 months ago by bgamari

simonmar, does comment:36 sound at all plausible?

comment:40 Changed 5 months ago by bgamari

I should also mention that the variant prepared by dfeuer in comment:37 also exhibits this bug.

comment:41 Changed 5 months ago by andrewthad

I have simplified the example failure here: https://github.com/andrewthad/cuddly-bassoon/tree/9e87acbc43b10d38758e6263b8d17231cb1f3ed7

In this commit, I entirely removed the use of hashmap and hashable. I'm just adding up the numbers in an STRef, and the problem still shows up.

comment:42 Changed 5 months ago by pacak

runST = unsafeDupablePerformIO -- more or less

unsafeDupablePerformIO: This version of unsafePerformIO is more efficient because it omits the check that the IO is only being performed by a single thread.

Basically "not to be used in multiple threads, we really mean it, look at the name"

par: Indicates that it may be beneficial to evaluate the first argument in parallel with the second.

Basically "do things in multiple threads".

The problem is unsafeDupablePerformIO is hidden in a bunch of different places. HashMaps, ByteStrings, etc... par is less frequent but not used directly either but it's also out there.

comment:43 Changed 5 months ago by andrewthad

pacak, it should in general be safe to duplicate an ST computation on multiple threads. For IO, it's a problem because mutable variables can come from outside of the IO computation. Something like unsafeDupablePerformIO (modifyIORef ...) is bad because for this reason. But, with ST, variables cannot escape the scope of the computation, so it should be safe. There is no way to write the ST equivalent of the expression I just gave. You would instead have to write: runST (do {r <- newSTRef 5; modifySTRef r ...;}), but this should be safe to duplicate, since each thread ends up with its own STRef (no sharing of the reference across threads). Also worth mentioning is that GHC cannot float out the call to newSTRef because of how the state token is threaded through the computation in STs Monad instance.

In #11760, it was discovered that lazy ST was unsound because it allows you to create pure expressions that capture mutable variables. When one of these expression is evaluated on two capabilities simultaneously, we get nondeterministic results. According to people on that thread, strict ST should not have the same problem. I don't totally understand why.

comment:44 Changed 5 months ago by bgamari

According to people on that thread, strict ST should not have the same problem. I don't totally understand why.

In the case of strict ST all effects should be executed before we produce the result. This means that there should be no chance that entering any thunk arising from the ST block will produce effects, meaning that multiple entry should pose no threat to correctness. However, if my hypothesis from comment:36 holds then it is indeed possible for the garbage collector to suspend a computation before all effects have taken place. This places us in a position where multiple-entry will indeed cause effects to be performed multiple times.

Last edited 5 months ago by bgamari (previous) (diff)

comment:45 Changed 5 months ago by bgamari

I have confirmed that we indeed hit the codepath in raiseAsync responsible for updating the thunk under evaluation. I believe this fills the final question in comment:36.

To recap, what is happening is the following,

  1. Thread A enters a fromListWith closure and begins folding over the insertions
  2. At some point during this fold we need to garbage collect (usually due to a stack overflow, judging from the eventlog); the garbage collector constructs an AP_STACK closure capturing the state of Thread A, including the partially-finished fromListWith computation. The fromListWith thunk is updated to point to this AP_STACK.
  3. Garbage collection commences and finishes, evaluation resumes
  4. At some point Thread A is resumed, entering the previously saved AP_STACK computation which we prepared in step (2); we are blackholing lazily so no update to the AP_STACK closure is made
  5. At some later point Thread B tries to force the same AP_STACK computation; finding that it's not blackholed, it enters

We now have two mutator threads performing evaluation on the same, effectful computation with shared, mutable state.

My first intuition says that the easiest way to avoid this would be to unconditionally eagerly blackhole AP_STACKs. I believe this can be done straightforwardly,

  • rts/Apply.cmm

    diff --git a/rts/Apply.cmm b/rts/Apply.cmm
    index f14bb8f331..a35b41e5b0 100644
    a b INFO_TABLE(stg_AP_STACK,/*special layout*/0,0,AP_STACK,"AP_STACK","AP_STACK") 
    464464
    465465  Words = StgAP_STACK_size(ap);
    466466
     467  W_ old_info;
     468  (old_info) = prim %cmpxchgW(ap, stg_AP_STACK_info, stg_WHITEHOLE_info);
     469  if (old_info != stg_AP_STACK_info) {
     470    /* someone else beat us to it */
     471    jump ENTRY_LBL(stg_WHITEHOLE) (ap);
     472  }
     473  StgInd_indirectee(ap) = CurrentTSO;
     474  W_[ap] = stg_EAGER_BLACKHOLE_info;
     475
    467476  /*
    468477   * Check for stack overflow.  IMPORTANT: use a _ENTER check here,
    469478   * because if the check fails, we might end up blackholing this very

However, in doing this I'm seeing non-deterministic <<loops>> from the testcase. I haven't yet worked out why.

Last edited 5 months ago by bgamari (previous) (diff)

comment:46 Changed 5 months ago by bgamari

Indeed disabling the duplicate-work-suspension logic in threadPaused prevents the issue from manifesting.

comment:47 Changed 5 months ago by Ben Gamari <ben@…>

In 1e47126/ghc:

rts: Clarify whitehole logic in threadPaused

Previously we would look at the indirectee field of a WHITEHOLE object.
However, WHITEHOLE isn't a sort of indirection and therefore has no
indirectee field.

I encountered this while investigating #13615, although it doesn't fix
that bug.

Test Plan: Validate

Reviewers: simonmar, austin, erikd

Subscribers: rwbarton, thomie

GHC Trac Issues: #13615

Differential Revision: https://phabricator.haskell.org/D3674

comment:48 Changed 5 months ago by bgamari

Me: 1, Bug: 0.

Details and patch coming.

comment:49 Changed 5 months ago by bgamari

Differential Rev(s): Phab:D3695
Status: newpatch

Phab:D3695 is very similar to what I proposed in comment:45, but shifted a few lines to avoid re-entering a BLACKHOLE if the stack check fails. Such a silly mistake yet it took quite some time to work out.

Last edited 5 months ago by bgamari (previous) (diff)

comment:50 Changed 5 months ago by Ben Gamari <ben@…>

In 0836bfb/ghc:

testsuite: Add testcase for #13615

Reviewers: austin

Subscribers: dfeuer, rwbarton, thomie

GHC Trac Issues: #13615

Differential Revision: https://phabricator.haskell.org/D3696

comment:51 Changed 5 months ago by Ben Gamari <ben@…>

In fd7a7a6/ghc:

Eagerly blackhole AP_STACKs

This fixes #13615. See the rather lengthy Note [AP_STACKs must be
eagerly blackholed] for details.

Reviewers: simonmar, austin, erikd, dfeuer

Subscribers: duog, dfeuer, hsyl20, rwbarton, thomie

GHC Trac Issues: #13615

Differential Revision: https://phabricator.haskell.org/D3695

comment:52 Changed 5 months ago by bgamari

Resolution: fixed
Status: patchclosed

comment:53 Changed 4 months ago by Ben Gamari <ben@…>

In c940e3b/ghc:

dmdAnal: Ensure that ExnStr flag isn't dropped inappropriately

This fixes #13977 and consequently #13615. Here an optimization in the
demand analyser was too liberal, causing us to drop the ExnStr flag and
consequently resulting in incorrect demand signatures. This manifested
as a segmentation fault in #13615 as we incorrectly concluded that an
application of catchRetry# would bottom.

Specifically, we had

    orElse' :: STM a -> STM a -> STM a
    orElse' x = catchRetry# x y
      where y = {- some action -}

Where the catchRetry# primop places a demand of <xC(S),1*C1(U)> on its
first argument. However, due to #13977 the demand analyser would assign
a demand of <C(S),1*C1(U)> on the first argument of orElse'. Note the
missing `x`.

    case orElse' bottomingAction anotherAction of { x -> Just x }

being transformed to,

    case orElse' bottomingAction anotherAction of {}

by the simplifier. This would naturally blow up when orElse' returned at
runtime, causing the segmentation fault described in #13615.

Test Plan: Validate, perhaps add a testcase

Reviewers: austin, simonpj

Reviewed By: simonpj

Subscribers: rwbarton, thomie

GHC Trac Issues: #13977, #13615

Differential Revision: https://phabricator.haskell.org/D3756

comment:54 Changed 4 months ago by bgamari

Comment:53 was actualling intended to refer to #13916. Too many bugs with too similar numbers.

comment:55 Changed 4 months ago by Ben Gamari <ben@…>

In bade356/ghc:

rts: Claim AP_STACK before adjusting Sp

In the fix to #13615 we introduced some logic to atomically blackhole
AP_STACKs closures upon entry. However, this logic was placed *after* a
stack pointer adjustment. This meant that if someone else beat us to
blackholing the AP_STACK we would suspend the thread with uninitialized
content on the stack.  This would then later blow up when threadPaused
attempted to walk the stack, hence #13970.

Silly bug but still cost lots of head-scratching to find.

Thanks to albertov for the great repro.

Fixes #13970. Bug originally introduced by the fix to #13615.

Reviewers: austin, erikd, simonmar

Reviewed By: erikd, simonmar

Subscribers: rwbarton, thomie

GHC Trac Issues: #13970, #13615

Differential Revision: https://phabricator.haskell.org/D3760

comment:56 Changed 4 months ago by dfeuer

bgamari, how hard do you think it would be to merge this to 8.0? 7.10? 7.8? Apparently there are still a few people using 7.6, but they should really give it up.

comment:57 Changed 4 months ago by bgamari

Merging likely wouldn't be terribly difficult. Building and releasing bindists, on the other hand, would require several days of effort.

Note: See TracTickets for help on using tickets.