Opened 3 years ago

Closed 3 years ago

#10218 closed bug (fixed)

GHC creates incorrect code which throws <<loop>>

Reported by: yongqli Owned by:
Priority: normal Milestone: 7.10.2
Component: Compiler Version: 7.10.1
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case: strianal/should_run/T10218
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

My co-worker and I have spent almost two weeks tracking down this bug. We have finally produced a reasonably small test case... please take a look: https://github.com/yongqli/ghctestcase

Under certain circumstances GHC generates incorrect code which goes into <<loop>>.

When you compile and run the project without profiling and with eager black-holing, it will throw <<loop>> and exit.

If you compile either without eager black-holing or with profiling, it will run correctly.

See setup.hs for further details.

This bug can be delicate to trigger, inlining certain things will prevent it from occurring. However, we've encountered this in production, and it is much harder to work-around in an actual project. We've had to resort to inlining every function in the affected module, which slows down compilation considerably.

We've found that this bug triggers on 7.10.1 and 7.8.4. The test case will only compile on 7.10.1.

Please let me know if there are any questions.

Attachments (3)

Main.hs (1.4 KB) - added by AlexET 3 years ago.
Smaller Example
Main.2.hs (627 bytes) - added by shachaf 3 years ago.
Simplified example
Main.3.hs (815 bytes) - added by AlexET 3 years ago.

Download all attachments as: .zip

Change History (19)

comment:1 Changed 3 years ago by jstolarek

I can confirm this happens on my 64bit Debian Wheezy with GHC 7.10.1.

Changed 3 years ago by AlexET

Attachment: Main.hs added

Smaller Example

comment:2 Changed 3 years ago by AlexET

I have managed to reproduce on windows 64-bit on GHC-7.10.1. I have shrunk the example down. It requires linear, vector and transformers. It needs to be complied with -O1 and -feager-blackholes.

The redundant constraint in the context of the type of guessStates is required for the bug to trigger. Inlining calc_zs also prevents the bug from triggering. Adding sharing between xs and ys in guessStates also prevents the bug. Changing the monad to the Identity monad prevents the bug. Removing the VectorSpace space class and just adding the constraints to LinAlg also prevents the bug.

It is also clear that this should example should terminate.

Last edited 3 years ago by AlexET (previous) (diff)

Changed 3 years ago by shachaf

Attachment: Main.2.hs added

Simplified example

comment:3 Changed 3 years ago by shachaf

I simplified AlexET's example some more. This bug is very touchy.

comment:4 Changed 3 years ago by nomeata

Did you systematically try enabling/disabling the various -fsomething options?

comment:5 Changed 3 years ago by AlexET

I have added a possibly slightly simpler version.

The difference between the working and not working cases is now really small.

I think the simplifier core is mostly equal (modulo alpha). However the demand analysis is different. The core-prep also has a case in one which is a let in the other. The stg differ similarly. The cmm differs more (but the difference between a case and a let would cause a larger difference there).

In terms of flags currently I know that specialise ,funbox-strict-fields and static-argument-transformation are uneeded. cse, strictness are needed. I will try more later.

Last edited 3 years ago by AlexET (previous) (diff)

Changed 3 years ago by AlexET

Attachment: Main.3.hs added

comment:6 Changed 3 years ago by AlexET

Looking at the difference in usage demands it seems that there is a variable (specifically a dictionary for A) with demand <L,1*U(A,1*U)>. But it is used twice, specifically its first component (a Foldable dictionary) is also used.

This demand was correct in the dump-stranal file but then later the first component is used (due to cse) in a later section. This means the usage demand is incorrect. In the working version it is used in both locations already during demand analysis and hence is given the correct usage information.

I don't know if this is the cause or whether the demand analysis information is irrelevant after cse. But if the demand analysis is used after cse then cse should make sure sure it is correct.

I have tried making a smaller example but I am finding it hard to get cse to fire (most examples are inlined first before cse can fire but cse wont fire if they are marked noinline).

comment:7 Changed 3 years ago by thoughtpolice

Milestone: 7.10.2
Priority: normalhighest

comment:8 Changed 3 years ago by simonpj

Milestone: 7.10.2
Priority: highestnormal

Could you try -flate-dmd-anal (7.10 only)? Thanks!

comment:9 Changed 3 years ago by AlexET

Running late demand analysis does seem to fix this. I am now pretty sure this is a problem with cse and demand. There seems to be two ways to fix it. Either drop demand information on variables involved in cse or somehow infer the correct demand information.

comment:10 Changed 3 years ago by simonpj

OK good. Can you also try -fno-cse. I guess that too will fix it.

I would still really like to know how the incorrect demand info leads to a loop. I still don't understand that. I wonder if adding some trace calls might help show what is going on, without making the bug disappear?

comment:11 Changed 3 years ago by simonpj

OK I think I have it.

  • There is a dictionary thunk thus (in STG syntax):
            let {
              $dA_s7fS [Dmd=<L,1*U(A,1*U)>] :: Main.A t_a2qQ
              [LclId, Str=DmdType] =
                  \s srt:SRT:[] [] Main.$p1C $dC_s7fP; } in
    
    Note that it claims to be single-entry. But in fact it is used twice, due to CSE.
  • Because it is single-entry, it is not updated when evaluation is complete.
  • But, with eager blackholing, it is blackholed on entry, I think in an effort to kill off any space leak.
  • So the second use finds the black hole, and reports <<loop>>.

I was very puzzled because <<loop>> usually only arises from a letrec that is too strict, like rec { x = 1+x }. But the single-entry thing explains it.

So, now we understand why Alex's guess is right: the CSE pass is failing to update (or zap) the demand info for the shared things, and that causes something to be single-entry when it should not be.

Next: fix it.

comment:12 Changed 3 years ago by simonpj

OK here is a nice small test case, which does not (unlike the test above) depend on lens!

module Main where

{-# NOINLINE foo #-}
foo :: Bool -> Int -> Int -> Int
foo True  _ x = 1
foo False _ x = x+1

{-# NOINLINE bar #-}
bar :: Int -> (Int,Int)
bar x = let y1 = x * 2
            y2 = x * 2
        in (foo False y1 y2,foo False y2 y1)

main = print (fst p + snd p)
  where
    p = bar 3

Compile with -O -feager-blackholing and you get <<loop>. Add -fno-cse or -flate-dmd-anal restores correct behaviour.

Points to note

  • foo uses its second argument zero times, and its third argument exactly once.
  • So the two calls to foo in bar use y1 exactly once and y2 exactly once.
  • But when y1 and y2 are CSE'd, the usage goes up to twice; and that is the problem.

I'm validating a simple fix in CSE, which zaps the demand-info on binders which are potentially shared. It's a bit brutal. But another run of the demand analyser (which has other advantages) restores everything again. I'm validating now; will commit next week.

Really sorry to have taken two weeks of your time to find this bug. But at least your efforts will be rewarded by a real fix.

Simon

comment:13 Changed 3 years ago by Simon Peyton Jones <simonpj@…>

In d261d4cbcc867405f71d7c9580628f52978e2267/ghc:

Zap usage info in CSE (Trac #10218)

Trac #10218 reports a subtle bug that turned out to be:

- CSE invalidated the usage information computed
  by earlier demand analysis, by increasing sharing

- that made a single-entry thunk into a multi-entry thunk

- and with -feager-blackholing, that led to <<loop>>

The patch fixes it by making the CSE pass zap usage information for
let-bound identifiers.   It can be restored by -flate-dmd-anal.

(But making -flate-dmd-anal the default needs some careful work;
see Trac #7782.)

comment:14 Changed 3 years ago by simonpj

Status: newmerge
Test Case: yesstrianal/should_run/T10218

Right I've fixed this. Thank you for identifying it. I'm really sorry it cost you so much to track down.

Simon

comment:15 Changed 3 years ago by simonpj

Milestone: 7.10.2

Merge to 7.10.2

comment:16 Changed 3 years ago by thoughtpolice

Resolution: fixed
Status: mergeclosed
Note: See TracTickets for help on using tickets.