#13916 closed bug (fixed)
Optimizations create run time seg faults
Reported by: | newthin | Owned by: | |
---|---|---|---|
Priority: | highest | Milestone: | 8.2.1 |
Component: | Compiler (CodeGen) | Version: | 8.0.2 |
Keywords: | optimization | Cc: | jmct |
Operating System: | Unknown/Multiple | Architecture: | Unknown/Multiple |
Type of failure: | Runtime crash | Test Case: | |
Blocked By: | Blocking: | ||
Related Tickets: | #8091 | Differential Rev(s): | Phab:D3756 |
Wiki Page: |
Description
In the attached program, when compiling ghc main.hs -O2
, the program immediately segfaults on run. When compiled with ghc main.hs
, however, the program runs fine. Indeed, in our full code base it seems like even no flag is not enough, and I must explicitly begin the Bracket module with {-# OPTIONS_GHC -O0 #-}
-- however, that may be due to stack or some other build tool.
The main logic which fails is in Bracket.hs, which provides an api to convert bracket-like environment handlers (those with open/close capabilities) to caches which you can run concurrent actions against. The API allows two types of limit on the number of concurrent environments - a Lax
limit, which will spin up additional connections if none are available, and a Hard
limit, which caps the environment count. This seems like a detail, but if I remove the Lax
option (as in BracketOneType), the bug goes away. With both options present, both options manifest the bug.
The main.hs module is just a toy program aimed at this api, where we write 1000 lines split to arbitrary files (in a __dir
specified on the top line - set to tmp). None of this code is important - the actual use case (which fails) is handling multiple concurrent connections to a database.
The other programs (mainList, mainOneModule, and mainOneType) all appear to run correctly, and represent my attempts to create minimal cases. In particular, these respectively replace Vector with List, move all code into a single module, and get rid of the multiple options for cache types. The modules should all be the same modulo imports (or inlining) of different Bracket
modules.
I have replicated the bug on Linux, Windows, and Windows Creator Update, all with their respective versions of GHC-8.0.2. The code fails for at least vector-0.11.0.0 and vector-0.12.0.1.
NB: There are times with the database access example where, worse than crashing, the program will simply return bad results. I think this might have to do with the laziness of withEnv
, and our production version uses NFData to deepseq arguments before returning the environment to the cache - that was omitted in this test case for simplicity's sake.
Attachments (1)
Change History (28)
Changed 20 months ago by
Attachment: | O2failurereplication.tar added |
---|
comment:1 Changed 20 months ago by
Oh dear, this looks quite bad indeed. Thanks for providing such a nice reproducer!
gdb
places the crash in,
Program received signal SIGSEGV, Segmentation fault. stg_takeMVarzh () at rts/PrimOps.cmm:1483 1483 rts/PrimOps.cmm: No such file or directory. (gdb) bt #0 stg_takeMVarzh () at rts/PrimOps.cmm:1483 #1 0x000000000064dc88 in base_GHCziIOziHandleziInternals_zdwdozuoperation_info () at libraries/base/GHC/IO/Handle/Internals.hs:163 #2 0x0000000000646ce0 in sb1n_info () at libraries/base/GHC/IO/Handle.hs:470 #3 0x0000000000407378 in s6zq_info () at ./Bracket.hs:105 #4 0x0000000000613e58 in s3E1_info () at libraries/base/Control/Exception/Base.hs:222 #5 0x00000000006e75f0 in ?? () #6 0x000000000040a328 in sdbc_info () at main.hs:21 #7 0x0000000000454680 in sbFu_info () #8 0x0000000000000000 in ?? ()
The disassembly of the top frame is,
(gdb) disassemble Dump of assembler code for function stg_takeMVarzh: 0x00000000006ebd70 <+0>: mov (%rbx),%rax 0x00000000006ebd73 <+3>: cmpq $0x94f088,0x18(%rbx) 0x00000000006ebd7b <+11>: je 0x6ebe97 <stg_takeMVarzh+295> 0x00000000006ebd81 <+17>: mov %rbx,%rcx 0x00000000006ebd84 <+20>: mov 0x18(%rbx),%rdx 0x00000000006ebd88 <+24>: mov 0x8(%rbx),%rbx 0x00000000006ebd8c <+28>: cmp $0x94f088,%rbx 0x00000000006ebd93 <+35>: je 0x6ebe56 <stg_takeMVarzh+230> => 0x00000000006ebd99 <+41>: cmpq $0x6ecf30,(%rbx) 0x00000000006ebda0 <+48>: je 0x6ebe4d <stg_takeMVarzh+221> ...
Where 0x6ecf30 is apparently stg_IND_info
. This likely means that we are approximately here in stg_takeMVarzh
,
... loop: if (q == stg_END_TSO_QUEUE_closure) { /* No further putMVars, MVar is now empty */ StgMVar_value(mvar) = stg_END_TSO_QUEUE_closure; // If the MVar is not already dirty, then we don't need to make // it dirty, as it is empty with nothing blocking on it. unlockClosure(mvar, info); return (val); } if (StgHeader_info(q) == stg_IND_info || /* <- perhaps here */ StgHeader_info(q) == stg_MSG_NULL_info) { q = StgInd_indirectee(q); goto loop; } ...
The value of $rbx
looks quite crazy,
(gdb) print $rbx $1 = 5283285312859013268
comment:2 Changed 20 months ago by
Well this may be odd. When stg_takeMVarzh
is called $rax
contains a pointer to a CATCH_RETRY_FRAME
. I've hiked more miles than I can count on my fingers today, so I'll need to drop this here and get some sleep. However, I'll be back.
comment:3 Changed 20 months ago by
One last observation: it looks like this CATCH_RETRY_FRAME
was (perhaps as expected) constructed by stg_catchRetryzh
,
(rr) bt #0 0x00000000006eb879 in stg_catchRetryzh () at rts/PrimOps.cmm:1245 #1 0x00000000006eb408 in ?? () at rts/PrimOps.cmm:990 #2 0x00000000004073a0 in s6zq_info () at ./Bracket.hs:105 #3 0x0000000000613e58 in s3E1_info () at libraries/base/Control/Exception/Base.hs:222 #4 0x00000000006e75f0 in ?? () #5 0x000000000040a328 in sdbc_info () at main.hs:21 #6 0x0000000000454680 in sbFu_info () #7 0x0000000000000000 in ?? ()
comment:4 Changed 20 months ago by
The code you point to implements MVar
s. The example code doesn't seem to handle any of those directly. The only MVar
s I see in the vicinity are in async
, and there don't seem to be any arrays around there to index improperly. Specifically, forConcurrently_
(used in main.hs
) is ultimately defined in terms of concurrently
, which uses an MVar
. But that doesn't bring any potentially-dangerous vector indexes close enough to make those look like good candidates for the problem.
comment:5 Changed 20 months ago by
The code you point to implements MVars. The example code doesn't seem to handle any of those directly.
Look again at the stacktrace. The MVar
s involved are those in GHC's Handle
implementation.
comment:6 Changed 20 months ago by
The crash appears to happen in hsIsClosed
called from isDead
; killer
is never run. Specifically, my instrumented testcase produces the following fairly reliably (sometimes there are more isDead
invocations before the segfault),
$ ./main maker maker maker maker maker isDead Segmentation fault
comment:7 Changed 20 months ago by
Here is my analysis of stg_takeMVarzh
for future reference,
stg_takeMVarzh: // $rbx = mvar <+0>: mov (%rbx),%rax // $rax = info = GET_INFO(mvar) <+3>: cmpq $0x94f128,0x18(%rbx) // if (StgMVar_value(mvar) == stg_END_TSO_QUEUE_closure) <+11>: je 0x6ec277 <stg_takeMVarzh+295> // { jump; } (not taken) <+17>: mov %rbx,%rcx <+20>: mov 0x18(%rbx),%rdx // $rdx = val = StgMVar_value(mvar) <+24>: mov 0x8(%rbx),%rbx // $rbx = q = StgMVar_head(mvar) <+28>: cmp $0x94f128,%rbx // if (q == stg_END_TSO_QUEUE_closure) <+35>: je 0x6ec236 <stg_takeMVarzh+230> // { jump; } (not taken) <+41>: cmpq $0x6ed310,(%rbx) // if (StgHeader_info(q) == stg_IND_info) // ^ crashes here <+48>: je 0x6ec22d <stg_takeMVarzh+221> // { jump; } <+54>: cmpq $0x6ed938,(%rbx) // if (StgHeader_info(q) == stg_MSG_NULL_info)
comment:8 Changed 20 months ago by
Something slightly suspicious that I hadn't noticed earlier: compiling Bracket
with -dstg-lint
seems to loop the compiler (or hits a very large stack explosion).
comment:9 Changed 20 months ago by
Unfortunately the STG lint issue is due to a bug in the linter. See #13941 and fix Phab:D3714.
comment:10 Changed 20 months ago by
So to recap, takeMVar
is getting passed a pointer that does not appear to point to an MVar
closure (it is in fact a blackhole'd cons cell, which was constructed by GHC.Base.++
). There are two possible explanations for this,
- the closure was once an
MVar
but overwritten by someone since it was allocated (implying that whatever code overwrote the closure is at fault) - the pointer is invalid (implying that some parent in the callgraph may be at fault, or they took the pointer from another closure which itself was trampled on by someone else).
I believe I've ruled out (1): watching the closure's memory suggests that there never was an MVar
closure at this address.
comment:11 Changed 20 months ago by
Indeed a FUN
closure is being passed to GHC.IO.Handle.hIsClosed
, which expects a Handle
. hIsClosed
then later calls GHC.IO.Handle.Internals.do_operation
, which in turn calls takeMVar
.
comment:12 Changed 20 months ago by
Milestone: | → 8.2.2 |
---|---|
Priority: | normal → highest |
Indeed the Handle
argument passed to killer
lambda is also a cons cell.
comment:13 Changed 20 months ago by
Cc: | jmct added |
---|
comment:14 Changed 19 months ago by
For the record, I now have a simplification of the testcase that produces an "MVar object entered!" assertion instead of a segmentation fault.
It seems that the reason is that readMVar
is getting passed a MVAR_DIRTY
closure instead of an MVar
CONSTR
. The pointer it is given is tagged with tag 0, consequently it tries to enter the closure, resulting in the assertion.
comment:15 Changed 19 months ago by
So it looks like when we are in limitMakeEnv.go
we find that eenvpermission
is the MVar ()
instead of an Either (MVar ()) ()
. Consequently when we extract the field while performing the case analysis on it, we extract the MVar#
instead of an MVar()
as the code expects. This is how we end up passing a MVAR_DIRTY
to readMVar
.
It looks like we end up in this situation due to miscompilation. The STG has this in the body of limitMakeEnv.go
,
let sat_s6tv [Occ=Once] :: GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, Data.Either.Either env_a3xy () #) [LclId] = \r [void_0E] src<Bracket.hs:99:40-68> src<Bracket.hs:92:34-39> src<Bracket.hs:62:47-52> Bracket.$wtakeEnv ww1_s6tk ww2_s6tl ww3_s6tm GHC.Prim.void#; } in ...
where,
Bracket.$wtakeEnv [InlPrag=[0]] :: forall env. GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Array# (Control.Concurrent.STM.TMVar.TMVar env) -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, env #)
Clearly this is bogus: $wtakeEnv
returns an env
whereas sat_s6tv
is supposed to be returning an Either
. Presumably stg-lint didn't catch this due to its poor sensitivity to type errors.
comment:16 Changed 19 months ago by
It turns out the simplified Core output for s6tv
is even crazier,
let sat_s6tv [Occ=Once] :: GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, Data.Either.Either env_a3xy () #) [LclId] sat_s6tv = \ (s1_s6tt [Occ=Once, OS=OneShot] :: GHC.Prim.State# GHC.Prim.RealWorld) -> src<Bracket.hs:99:40-68> src<Bracket.hs:92:34-39> src<Bracket.hs:62:47-52> case Bracket.$wtakeEnv @ env_a3xy ww1_s6tk ww2_s6tl ww3_s6tm s1_s6tt of {} } in
It seems that somehow the strictness analyser has concluded that takeEnv
diverges as its strictness signature is,
takeEnv_s5fz :: forall env. Vector (TMVar env) -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, env #) [LclId, Arity=2, Str=<B,1*U(U,U,U)><B,U>b, Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True, WorkFree=True, Expandable=True, Guidance=IF_ARGS [20 0] 126 0}]
comment:17 Changed 19 months ago by
Here is a sketch of the structure of takeEnv
along with strictness signatures,
takeEnv = \ (@ env_a3hT) (x_X3Lx [Dmd=<B,1*U(U,U,U)>] :: Vector (TMVar env_a3hT)) (eta_B1 [Dmd=<B,U>] :: GHC.Prim.State# GHC.Prim.RealWorld) -> case x_X3Lx of { Data.Vector.Vector ipv_s4Q7 [Dmd=<B,U>] ipv_s4Q8 [Dmd=<B,U>] ipv_s4Q9 [Dmd=<B,U>] -> joinrec { foldlM_loop_a45q [Occ=LoopBreaker] :: GHC.Types.SPEC -> STM env_a3hT -> Int -> (# GHC.Prim.State# GHC.Prim.RealWorld, env_a3hT #) foldlM_loop_a45q (ds4_a45s [Dmd=<S,1*U>] :: GHC.Types.SPEC) (z1_a45t [Dmd=<C(S),1*C1(U(U,U))>] :: STM env_a3hT) (s1_a45u [Dmd=<S(S),1*U(U)>] :: Int) = ... } in jump foldlM_loop_a45q GHC.Types.SPEC (((src<Bracket.hs:119:32-36> GHC.Prim.retry#) @ env_a3hT) `cast` (Sym (GHC.Conc.Sync.N:STM[0] <env_a3hT>_R) :: ((GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, env_a3hT #)) :: *) ~R# (STM env_a3hT :: *))) (GHC.Types.I# 0#) }
Note how the analyzer somehow assigns hyperstrict demands for all of the binders bound by the case analysis of the Vector
despite the fact that the whole thing consists of nothing more than a call to foldlM_loop_a45q
which does not itself diverge.
comment:18 Changed 19 months ago by
I think I see: foldlM_loop_a45q
places a C(S)
demand on its second argument and called with this argument being retry#
. retry#
has a divergent demand result (namely Demand.botRes
). This is likely how the analyzer concludes that takeEnv
will diverge.
comment:19 Changed 19 months ago by
Related Tickets: | → #8091 |
---|
It looks like this regressed with the solution to #8091. See ticket:8091#comment:7 for an explanation of what happened here.
comment:20 Changed 19 months ago by
So after taking a bit a break I came back to this and realized that ticket:8091#comment:7 isn't quite right for the reason articulated in ticket:8091#comment:9. Instead I think the problem is that ExnStr
demands aren't being pushed outwards when they should be. I've opened #13977 to explore this possibility.
comment:21 Changed 19 months ago by
For the record, as far as short term solutions for 8.2.1 are concerned, there are two possibile workarounds that seem to resolve the issue,
- weakening
retry#
's result demand totopRes
- weakening
retryCatch#
's first-argument demand tolazyApply1Dmd
I suspect that neither of these are the right solution; in the end we should try to fix #13977 (unless, of course, I've horribly misunderstood something).
comment:22 Changed 19 months ago by
Differential Rev(s): | → Phab:D3756 |
---|---|
Status: | new → patch |
Phab:D3756 is I believe the principled solution to #13977 and consequently this bug.
comment:23 Changed 19 months ago by
Merged to master with c940e3b92f4527ca59fcae93f36c869de3e7ccb9,
dmdAnal: Ensure that ExnStr flag isn't dropped inappropriately This fixes #13977 and consequently #13615. Here an optimization in the demand analyser was too liberal, causing us to drop the ExnStr flag and consequently resulting in incorrect demand signatures. This manifested as a segmentation fault in #13615 as we incorrectly concluded that an application of catchRetry# would bottom. Specifically, we had orElse' :: STM a -> STM a -> STM a orElse' x = catchRetry# x y where y = {- some action -} Where the catchRetry# primop places a demand of <xC(S),1*C1(U)> on its first argument. However, due to #13977 the demand analyser would assign a demand of <C(S),1*C1(U)> on the first argument of orElse'. Note the missing `x`. case orElse' bottomingAction anotherAction of { x -> Just x } being transformed to, case orElse' bottomingAction anotherAction of {} by the simplifier. This would naturally blow up when orElse' returned at runtime, causing the segmentation fault described in #13615. Test Plan: Validate, perhaps add a testcase Reviewers: austin, simonpj Reviewed By: simonpj Subscribers: rwbarton, thomie GHC Trac Issues: #13977, #13615 Differential Revision: https://phabricator.haskell.org/D3756
comment:24 Changed 19 months ago by
Milestone: | 8.2.2 → 8.2.1 |
---|---|
Resolution: | → fixed |
Status: | patch → closed |
Merged to ghc-8.2
with 26f839f457ce1bf1a940a12e45c8137621ce1378
4 programs, one of which (main.hs) fails when compiled with optimizations