Opened 5 months ago

Closed 4 months ago

Last modified 3 months ago

#13916 closed bug (fixed)

Optimizations create run time seg faults

Reported by: newthin Owned by:
Priority: highest Milestone: 8.2.1
Component: Compiler (CodeGen) Version: 8.0.2
Keywords: optimization Cc: jmct
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime crash Test Case:
Blocked By: Blocking:
Related Tickets: #8091 Differential Rev(s): Phab:D3756
Wiki Page:

Description

In the attached program, when compiling ghc main.hs -O2, the program immediately segfaults on run. When compiled with ghc main.hs, however, the program runs fine. Indeed, in our full code base it seems like even no flag is not enough, and I must explicitly begin the Bracket module with {-# OPTIONS_GHC -O0 #-} -- however, that may be due to stack or some other build tool.

The main logic which fails is in Bracket.hs, which provides an api to convert bracket-like environment handlers (those with open/close capabilities) to caches which you can run concurrent actions against. The API allows two types of limit on the number of concurrent environments - a Lax limit, which will spin up additional connections if none are available, and a Hard limit, which caps the environment count. This seems like a detail, but if I remove the Lax option (as in BracketOneType), the bug goes away. With both options present, both options manifest the bug.

The main.hs module is just a toy program aimed at this api, where we write 1000 lines split to arbitrary files (in a __dir specified on the top line - set to tmp). None of this code is important - the actual use case (which fails) is handling multiple concurrent connections to a database.

The other programs (mainList, mainOneModule, and mainOneType) all appear to run correctly, and represent my attempts to create minimal cases. In particular, these respectively replace Vector with List, move all code into a single module, and get rid of the multiple options for cache types. The modules should all be the same modulo imports (or inlining) of different Bracket modules.

I have replicated the bug on Linux, Windows, and Windows Creator Update, all with their respective versions of GHC-8.0.2. The code fails for at least vector-0.11.0.0 and vector-0.12.0.1.

NB: There are times with the database access example where, worse than crashing, the program will simply return bad results. I think this might have to do with the laziness of withEnv, and our production version uses NFData to deepseq arguments before returning the environment to the cache - that was omitted in this test case for simplicity's sake.

Attachments (1)

O2failurereplication.tar (32.0 KB) - added by newthin 5 months ago.
4 programs, one of which (main.hs) fails when compiled with optimizations

Download all attachments as: .zip

Change History (28)

Changed 5 months ago by newthin

Attachment: O2failurereplication.tar added

4 programs, one of which (main.hs) fails when compiled with optimizations

comment:1 Changed 5 months ago by bgamari

Oh dear, this looks quite bad indeed. Thanks for providing such a nice reproducer!

gdb places the crash in,

Program received signal SIGSEGV, Segmentation fault.
stg_takeMVarzh () at rts/PrimOps.cmm:1483
1483	rts/PrimOps.cmm: No such file or directory.
(gdb) bt
#0  stg_takeMVarzh () at rts/PrimOps.cmm:1483
#1  0x000000000064dc88 in base_GHCziIOziHandleziInternals_zdwdozuoperation_info () at libraries/base/GHC/IO/Handle/Internals.hs:163
#2  0x0000000000646ce0 in sb1n_info () at libraries/base/GHC/IO/Handle.hs:470
#3  0x0000000000407378 in s6zq_info () at ./Bracket.hs:105
#4  0x0000000000613e58 in s3E1_info () at libraries/base/Control/Exception/Base.hs:222
#5  0x00000000006e75f0 in ?? ()
#6  0x000000000040a328 in sdbc_info () at main.hs:21
#7  0x0000000000454680 in sbFu_info ()
#8  0x0000000000000000 in ?? ()

The disassembly of the top frame is,

(gdb) disassemble 
Dump of assembler code for function stg_takeMVarzh:
   0x00000000006ebd70 <+0>:	mov    (%rbx),%rax
   0x00000000006ebd73 <+3>:	cmpq   $0x94f088,0x18(%rbx)
   0x00000000006ebd7b <+11>:	je     0x6ebe97 <stg_takeMVarzh+295>
   0x00000000006ebd81 <+17>:	mov    %rbx,%rcx
   0x00000000006ebd84 <+20>:	mov    0x18(%rbx),%rdx
   0x00000000006ebd88 <+24>:	mov    0x8(%rbx),%rbx
   0x00000000006ebd8c <+28>:	cmp    $0x94f088,%rbx
   0x00000000006ebd93 <+35>:	je     0x6ebe56 <stg_takeMVarzh+230>
=> 0x00000000006ebd99 <+41>:	cmpq   $0x6ecf30,(%rbx)
   0x00000000006ebda0 <+48>:	je     0x6ebe4d <stg_takeMVarzh+221>
...

Where 0x6ecf30 is apparently stg_IND_info. This likely means that we are approximately here in stg_takeMVarzh,

...
loop:
    if (q == stg_END_TSO_QUEUE_closure) {
        /* No further putMVars, MVar is now empty */
        StgMVar_value(mvar) = stg_END_TSO_QUEUE_closure;
        // If the MVar is not already dirty, then we don't need to make
        // it dirty, as it is empty with nothing blocking on it.
        unlockClosure(mvar, info);
        return (val);
    }
    if (StgHeader_info(q) == stg_IND_info ||     /* <- perhaps here */
        StgHeader_info(q) == stg_MSG_NULL_info) {
        q = StgInd_indirectee(q);
        goto loop;
    }
...

The value of $rbx looks quite crazy,

(gdb) print $rbx
$1 = 5283285312859013268

comment:2 Changed 5 months ago by bgamari

Well this may be odd. When stg_takeMVarzh is called $rax contains a pointer to a CATCH_RETRY_FRAME. I've hiked more miles than I can count on my fingers today, so I'll need to drop this here and get some sleep. However, I'll be back.

comment:3 Changed 5 months ago by bgamari

One last observation: it looks like this CATCH_RETRY_FRAME was (perhaps as expected) constructed by stg_catchRetryzh,

(rr) bt
#0  0x00000000006eb879 in stg_catchRetryzh () at rts/PrimOps.cmm:1245
#1  0x00000000006eb408 in ?? () at rts/PrimOps.cmm:990
#2  0x00000000004073a0 in s6zq_info () at ./Bracket.hs:105
#3  0x0000000000613e58 in s3E1_info () at libraries/base/Control/Exception/Base.hs:222
#4  0x00000000006e75f0 in ?? ()
#5  0x000000000040a328 in sdbc_info () at main.hs:21
#6  0x0000000000454680 in sbFu_info ()
#7  0x0000000000000000 in ?? ()

comment:4 Changed 5 months ago by dfeuer

The code you point to implements MVars. The example code doesn't seem to handle any of those directly. The only MVars I see in the vicinity are in async, and there don't seem to be any arrays around there to index improperly. Specifically, forConcurrently_ (used in main.hs) is ultimately defined in terms of concurrently, which uses an MVar. But that doesn't bring any potentially-dangerous vector indexes close enough to make those look like good candidates for the problem.

comment:5 Changed 5 months ago by bgamari

The code you point to implements MVars. The example code doesn't seem to handle any of those directly.

Look again at the stacktrace. The MVars involved are those in GHC's Handle implementation.

comment:6 Changed 5 months ago by bgamari

The crash appears to happen in hsIsClosed called from isDead; killer is never run. Specifically, my instrumented testcase produces the following fairly reliably (sometimes there are more isDead invocations before the segfault),

$ ./main
maker
maker
maker
maker
maker
isDead
Segmentation fault
Last edited 5 months ago by bgamari (previous) (diff)

comment:7 Changed 5 months ago by bgamari

Here is my analysis of stg_takeMVarzh for future reference,

stg_takeMVarzh:                              // $rbx = mvar
<+0>:	mov    (%rbx),%rax                   // $rax = info = GET_INFO(mvar)
<+3>:	cmpq   $0x94f128,0x18(%rbx)          // if (StgMVar_value(mvar) == stg_END_TSO_QUEUE_closure)
<+11>:	je     0x6ec277 <stg_takeMVarzh+295> //     { jump; } (not taken)
<+17>:	mov    %rbx,%rcx                        
<+20>:	mov    0x18(%rbx),%rdx               // $rdx = val = StgMVar_value(mvar)
<+24>:	mov    0x8(%rbx),%rbx                // $rbx = q   = StgMVar_head(mvar)
<+28>:	cmp    $0x94f128,%rbx                // if (q == stg_END_TSO_QUEUE_closure)
<+35>:	je     0x6ec236 <stg_takeMVarzh+230> //     { jump; } (not taken)
<+41>:	cmpq   $0x6ed310,(%rbx)              // if (StgHeader_info(q) == stg_IND_info)
                                             // ^ crashes here
<+48>:	je     0x6ec22d <stg_takeMVarzh+221> //     { jump; }
<+54>:	cmpq   $0x6ed938,(%rbx)              // if (StgHeader_info(q) == stg_MSG_NULL_info)

comment:8 Changed 5 months ago by bgamari

Something slightly suspicious that I hadn't noticed earlier: compiling Bracket with -dstg-lint seems to loop the compiler (or hits a very large stack explosion).

Last edited 5 months ago by bgamari (previous) (diff)

comment:9 Changed 5 months ago by bgamari

Unfortunately the STG lint issue is due to a bug in the linter. See #13941 and fix Phab:D3714.

comment:10 Changed 5 months ago by bgamari

So to recap, takeMVar is getting passed a pointer that does not appear to point to an MVar closure (it is in fact a blackhole'd cons cell, which was constructed by GHC.Base.++). There are two possible explanations for this,

  1. the closure was once an MVar but overwritten by someone since it was allocated (implying that whatever code overwrote the closure is at fault)
  2. the pointer is invalid (implying that some parent in the callgraph may be at fault, or they took the pointer from another closure which itself was trampled on by someone else).

I believe I've ruled out (1): watching the closure's memory suggests that there never was an MVar closure at this address.

Last edited 5 months ago by bgamari (previous) (diff)

comment:11 Changed 5 months ago by bgamari

Indeed a FUN closure is being passed to GHC.IO.Handle.hIsClosed, which expects a Handle. hIsClosed then later calls GHC.IO.Handle.Internals.do_operation, which in turn calls takeMVar.

comment:12 Changed 4 months ago by bgamari

Milestone: 8.2.2
Priority: normalhighest

Indeed the Handle argument passed to killer lambda is also a cons cell.

comment:13 Changed 4 months ago by jmct

Cc: jmct added

comment:14 Changed 4 months ago by bgamari

For the record, I now have a simplification of the testcase that produces an "MVar object entered!" assertion instead of a segmentation fault.

It seems that the reason is that readMVar is getting passed a MVAR_DIRTY closure instead of an MVar CONSTR. The pointer it is given is tagged with tag 0, consequently it tries to enter the closure, resulting in the assertion.

Last edited 4 months ago by bgamari (previous) (diff)

comment:15 Changed 4 months ago by bgamari

So it looks like when we are in limitMakeEnv.go we find that eenvpermission is the MVar () instead of an Either (MVar ()) (). Consequently when we extract the field while performing the case analysis on it, we extract the MVar# instead of an MVar() as the code expects. This is how we end up passing a MVAR_DIRTY to readMVar.

It looks like we end up in this situation due to miscompilation. The STG has this in the body of limitMakeEnv.go,

let sat_s6tv [Occ=Once]
      :: GHC.Prim.State# GHC.Prim.RealWorld
         -> (# GHC.Prim.State# GHC.Prim.RealWorld,
               Data.Either.Either env_a3xy () #)
    [LclId] =
        \r [void_0E]
            src<Bracket.hs:99:40-68>
            src<Bracket.hs:92:34-39>
            src<Bracket.hs:62:47-52>
            Bracket.$wtakeEnv
                ww1_s6tk
                ww2_s6tl
                ww3_s6tm
                GHC.Prim.void#; } in ...

where,

Bracket.$wtakeEnv [InlPrag=[0]]
  :: forall env.
     GHC.Prim.Int#
     -> GHC.Prim.Int#
     -> GHC.Prim.Array# (Control.Concurrent.STM.TMVar.TMVar env)
     -> GHC.Prim.State# GHC.Prim.RealWorld
     -> (# GHC.Prim.State# GHC.Prim.RealWorld, env #)

Clearly this is bogus: $wtakeEnv returns an env whereas sat_s6tv is supposed to be returning an Either. Presumably stg-lint didn't catch this due to its poor sensitivity to type errors.

Last edited 4 months ago by bgamari (previous) (diff)

comment:16 Changed 4 months ago by bgamari

It turns out the simplified Core output for s6tv is even crazier,

let sat_s6tv [Occ=Once]
      :: GHC.Prim.State# GHC.Prim.RealWorld
         -> (# GHC.Prim.State# GHC.Prim.RealWorld,
               Data.Either.Either env_a3xy () #)
    [LclId]
    sat_s6tv
      = \ (s1_s6tt [Occ=Once, OS=OneShot]
             :: GHC.Prim.State# GHC.Prim.RealWorld) ->
          src<Bracket.hs:99:40-68>
          src<Bracket.hs:92:34-39>
          src<Bracket.hs:62:47-52>
          case Bracket.$wtakeEnv
                 @ env_a3xy ww1_s6tk ww2_s6tl ww3_s6tm s1_s6tt
                                   of {} } in

It seems that somehow the strictness analyser has concluded that takeEnv diverges as its strictness signature is,

takeEnv_s5fz
  :: forall env.
     Vector (TMVar env)
     -> GHC.Prim.State# GHC.Prim.RealWorld
     -> (# GHC.Prim.State# GHC.Prim.RealWorld, env #)
[LclId,
 Arity=2,
 Str=<B,1*U(U,U,U)><B,U>b,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [20 0] 126 0}]

comment:17 Changed 4 months ago by bgamari

Here is a sketch of the structure of takeEnv along with strictness signatures,

takeEnv =
  \ (@ env_a3hT)
    (x_X3Lx [Dmd=<B,1*U(U,U,U)>] :: Vector (TMVar env_a3hT))
    (eta_B1 [Dmd=<B,U>] :: GHC.Prim.State# GHC.Prim.RealWorld) ->
  case x_X3Lx of
      { Data.Vector.Vector ipv_s4Q7 [Dmd=<B,U>] ipv_s4Q8 [Dmd=<B,U>]
                           ipv_s4Q9 [Dmd=<B,U>] ->
        joinrec {
          foldlM_loop_a45q [Occ=LoopBreaker]
            :: GHC.Types.SPEC
               -> STM env_a3hT
               -> Int
               -> (# GHC.Prim.State# GHC.Prim.RealWorld, env_a3hT #)
          foldlM_loop_a45q (ds4_a45s [Dmd=<S,1*U>] :: GHC.Types.SPEC)
                           (z1_a45t [Dmd=<C(S),1*C1(U(U,U))>] :: STM env_a3hT)
                           (s1_a45u [Dmd=<S(S),1*U(U)>] :: Int) = ...
        }
        in jump foldlM_loop_a45q
             GHC.Types.SPEC
             (((src<Bracket.hs:119:32-36> GHC.Prim.retry#) @ env_a3hT)
              `cast` (Sym (GHC.Conc.Sync.N:STM[0] <env_a3hT>_R)
                      :: ((GHC.Prim.State# GHC.Prim.RealWorld
                           -> (# GHC.Prim.State# GHC.Prim.RealWorld, env_a3hT #)) :: *)
                         ~R#
                         (STM env_a3hT :: *)))
             (GHC.Types.I# 0#)
      }

Note how the analyzer somehow assigns hyperstrict demands for all of the binders bound by the case analysis of the Vector despite the fact that the whole thing consists of nothing more than a call to foldlM_loop_a45q which does not itself diverge.

Last edited 4 months ago by bgamari (previous) (diff)

comment:18 Changed 4 months ago by bgamari

I think I see: foldlM_loop_a45q places a C(S) demand on its second argument and called with this argument being retry#. retry# has a divergent demand result (namely Demand.botRes). This is likely how the analyzer concludes that takeEnv will diverge.

comment:19 Changed 4 months ago by bgamari

It looks like this regressed with the solution to #8091. See ticket:8091#comment:7 for an explanation of what happened here.

Last edited 4 months ago by bgamari (previous) (diff)

comment:20 Changed 4 months ago by bgamari

So after taking a bit a break I came back to this and realized that ticket:8091#comment:7 isn't quite right for the reason articulated in ticket:8091#comment:9. Instead I think the problem is that ExnStr demands aren't being pushed outwards when they should be. I've opened #13977 to explore this possibility.

comment:21 Changed 4 months ago by bgamari

For the record, as far as short term solutions for 8.2.1 are concerned, there are two possibile workarounds that seem to resolve the issue,

  • weakening retry#'s result demand to topRes
  • weakening retryCatch#'s first-argument demand to lazyApply1Dmd

I suspect that neither of these are the right solution; in the end we should try to fix #13977 (unless, of course, I've horribly misunderstood something).

comment:22 Changed 4 months ago by bgamari

Differential Rev(s): Phab:D3756
Status: newpatch

Phab:D3756 is I believe the principled solution to #13977 and consequently this bug.

Last edited 4 months ago by bgamari (previous) (diff)

comment:23 Changed 4 months ago by bgamari

Merged to master with c940e3b92f4527ca59fcae93f36c869de3e7ccb9,

dmdAnal: Ensure that ExnStr flag isn't dropped inappropriately

This fixes #13977 and consequently #13615. Here an optimization in the
demand analyser was too liberal, causing us to drop the ExnStr flag and
consequently resulting in incorrect demand signatures. This manifested
as a segmentation fault in #13615 as we incorrectly concluded that an
application of catchRetry# would bottom.

Specifically, we had

    orElse' :: STM a -> STM a -> STM a
    orElse' x = catchRetry# x y
      where y = {- some action -}

Where the catchRetry# primop places a demand of <xC(S),1*C1(U)> on its
first argument. However, due to #13977 the demand analyser would assign
a demand of <C(S),1*C1(U)> on the first argument of orElse'. Note the
missing `x`.

    case orElse' bottomingAction anotherAction of { x -> Just x }

being transformed to,

    case orElse' bottomingAction anotherAction of {}

by the simplifier. This would naturally blow up when orElse' returned at
runtime, causing the segmentation fault described in #13615.

Test Plan: Validate, perhaps add a testcase

Reviewers: austin, simonpj

Reviewed By: simonpj

Subscribers: rwbarton, thomie

GHC Trac Issues: #13977, #13615

Differential Revision: https://phabricator.haskell.org/D3756

comment:24 Changed 4 months ago by bgamari

Milestone: 8.2.28.2.1
Resolution: fixed
Status: patchclosed

comment:25 Changed 4 months ago by Ben Gamari <ben@…>

In b2d3ec37/ghc:

testsuite: Add test for #13916

comment:26 Changed 4 months ago by Ben Gamari <ben@…>

In ccac387b/ghc:

Revert "testsuite: Add test for #13916"

This reverts commit b2d3ec370b97fe5f448e8f1d4e0b7374c63c60a8. Didn't
mean to push this one.

comment:27 Changed 3 months ago by Ben Gamari <ben@…>

In 02862140/ghc:

testsuite: Add test for #13916

Reviewers: austin

Subscribers: rwbarton, thomie

Differential Revision: https://phabricator.haskell.org/D3764
Note: See TracTickets for help on using tickets.