I've been looking into this, here's what I found out so far:
Because we start evaluating foo before hitting the breakpoint, by the time the we return to the GHCi prompt foo points to a blackhole.
Once we stop at the breakpoint we do :print foo, pprintClosureCommand calls bindSuspensions with the id foo.
bindSuspensions invents a new name _t1 and binds it to the thunk that is foo, via RtClosureInspect.cvObtainTerm.
cvObtainTerm looks at the heap object pointed to by foo, which is a blackhole, and follows the indirectee pointer. It turns out the indirectee is a TSO object. At this point _t1 becomes bound to a TSO object, and evaluating it (e.g. with print _t1) causes this crash because TSO objects can't be entered.
I tried modifying cvObtainTerm so that it doesn't follow the indirectee pointer when it sees a blackhole. That way we bind _t1 to the blackhole object instead of the TSO object pointed by the indirectee field, but that caused a deadlock in the scheduler. I don't understand why yet.
simonmar, could you advise? Does the story make sense so far?
Simon, I implemented changes in cvObtainTerm as discussed yesterday, but I'm still getting "TSO object entered" errors.
Previously cvObtainTerm follwed indirectee's of BLACKHOLEs no matter what.
With my changes I only follow the indirectees when they're not TSO or BLOCKING_QUEUE. Somehow with this I still get "TSO object entered".
If I don't follow BLACKHOLE indirectees at all (and bind a BLACKHOLE to _t1 in this reproducer) then I get a deadlock as expected. Do you have any ideas on why this may be happening?
I simplified example. whnf function is not required to trigger this bug. You can also notice that context of execution (i.e. [main.hs:2:7-11]) prints twice which looks like
a bug and probably deserves a ticket.
I created a differential with an implementation of the idea in ticket:8316#comment:150788. The patch seems to do the right thing but there's probably another bug in somewhere else so I'm still getting "TSO entered" errors.
Ah, looking at this ticket again, I can see what I missed last time (in D4535).
The problem is D4535 does not have any effect because TSO and BLOCKING_QUEUE are already not handled by cvObtainTerm.go and cvObtainTerm returns a Suspension when it finds one of those objects. So even if we follow a BLACKHOLE that points to a TSO we return a Suspension. In D4535 we returned Suspension slightly earlier (before following the indirectee), but the value we returned was identical to the value we returned without the patch.
What we should do is if we see a BLACKHOLE pointing to an TSO or BLOCKING_QUEUE we should return a Suspension with the BLACKHOLE itself as the hval (currently: hval is the indirectee).
However I suspect entering the BLACKHOLE will result in a deadlock because the thread that's supposed to evaluate the expression (i.e. the owner) is blocked on an MVar (the breakpoint MVar passed to GHCi.Run.withBreakAction) and when we enter the BLACKHOLE our thread gets parked, to be unparked by the owner of the BLACKHOLE, which never happens as we don't update the MVar before entering the BLACKHOLE.
I don't see a good way to solve this. The thread that is evaluiating foo is stopped at a breakpoint - that's what the user asked for, so it's not entirely surprising that if they evaluate something that requires foo then it deadlocks.
What would we like to happen? I can think of a couple of alternatives:
*1. Just make it work**
Should it automatically continue evaluation of foo? How would you know when to do that? Evaluating a BLACKHOLE doesn't necessarily mean that we're about to deadlock, we might be evaluating something that another thread is evaluating. As soon as we release the breakMVar the thread will continue evaluating foo, but I don't know of a way to tell whether/when we should do that.
Perhaps instead of the MVar, a breakpoint should be an asynchronous exception so that we end up with a thunk that we could poke to continue evaluation? That would make this work, but it would mean a big change to the way breakpoints work and I'm not sure whether it would run into other problems. One potential problem is that it's a lot more expensive than the current breakpoint mechanism, so :trace wouldn't work so well.
*2. Make it an error of some kind**
e.g.
[main.hs:2:7-11] *Main> _t1*** Exception: blocked on breakpoint 1
The question is how to achieve that. Perhaps we periodically monitor the thread we just created to do the evaluation and check whether it's blocked on a blackhole, and then compare the owner of the blackhole it is blocked on against all the threads we know are currently at breakpoints? That could possibly work, but it's tricky to implement.
Here's another example that deadlocks even with GHC 8.6:
foo = 0 : barbar = 1 : foo
in GHCi:
GHCi, version 8.6.1: http://www.haskell.org/ghc/ :? for help:Loaded GHCi configuration from /home/omer/rcbackup/.ghci[1 of 1] Compiling Main ( test.hs, interpreted )Ok, one module loaded.λ:1> :break fooBreakpoint 0 activated at test.hs:1:7-13λ:2> fooStopped in Main.foo, test.hs:1:7-13_result :: [Integer] = _[test.hs:1:7-13] λ:3> :print barbar = (_t1::[Integer])[test.hs:1:7-13] λ:4> _t1[1
The reason why we don't get "TSO entered" here is because _t1 stands for bar, and bar itself is not locked by the evaluator thread. Instead an object in bar's payload is owned.
I think this shows that even if we could somehow release the MVar in the original reproducer there will be deadlocks.
I think we should:
Merge the patch. We should never enter a TSO or BLOCKING_QUEUE.
Document this behavior in the user manual
Disallow evaluating BLACKHOLEs in GHCi
The last step would fix the original reproducer, but my example above will still deadlock and that's what you get for having lazy evaluation.
Ömer Sinan Ağacanchanged title from GHCi debugger segfaults when trying force a certain variable to GHCi debugger panics when trying force a certain variable
changed title from GHCi debugger segfaults when trying force a certain variable to GHCi debugger panics when trying force a certain variable
I left a comment drawing attention to this on the D5179 before merging but yes, you are right, we should wait to close this until all of these tasks are done.
The panic has been fixed, rest of ticket:8316#comment:160624 still need to be implemented, but they're not as urgent. Removing the milestone and marking the ticket as 'newcomer'.