Memory leak caused by nested monadic loops

changed weight to 5

Floating out return () won't help much. Even if you do that, the program looks like

worker = let l = poll ()
         in forever l

r = return ()
poll a = r >> poll a

so executing l will cause l to be evaluated as r >> r >> r >> r >> ... and we still have a live reference to l from within forever.

If you instead define

poll a = r >>= \_ -> poll a

and compile without optimization then l now looks like r >>= (\_ -> ...) and the space leak is gone. But turning on optimizations seems to cause GHC to rewrite the above to

poll a = r >>= (let s = poll a in \_ -> s)

which now has the original problem again.

(BTW I am always compiling with -fno-state-hack, since that just adds another layer of confusion to an already confusing situation.)

I also wanted to see what difference the state hack makes, but couldn't find -fno-stack-hack in ghc 8 user's guide. I thought that perhaps it was replaced by a less hacky optimization in ghc 8.

changed milestone to %8.2.1

Trac metadata

Trac field	Value
TypeOfFailure	OtherFailure → RuntimePerformance

added runtime perf label

Reid: you're probably right and the return () is a red herring. I guess I still don't completely understand how this works.

The original program leaks memory even without optimization. In this case, the trouble is that

poll () = return () >> return () >> ...

forever always holds on to its argument, which prevents any part of poll () from being collected as it's run. You can fix this (with or without optimization) by defining

poll _ = fix (return () >>)

which makes poll () finite in size.

While you and I both know that forever doesn't need to hold on to poll () in this case, I don't see how GHC can be expected to see that.

I missed something. It seems that {-# NOINLINE #-} here is effectively preventing poll from being specialized within its own RHS. While the constant argument ends up going away, the constant Monad dictionary does not. Writing

poll :: (Monad m) => x -> m a
poll = go where
  go action = do
    return ()
    go action

fixes the leak. So maybe something needs to be changed to allow internal specialization in the face of NOINLINE? Of course, that could have some other downside.

By the way, forever isn't needed, you can replace it with (\x -> x >> x) in Feuerbach's program. In the SO question the user noticed that the problem is solved by defining poll a b without explicit recursive use of poll a b.

changed title from Constant values are not floated out of the loop to Memory leak caused by nested monadic loops

I think I now understand what's going on here. (Basically, what Reid said.) I summarized my thoughts in this blog post: https://ro-che.info/articles/2017-01-10-nested-loop-space-leak

I don't know what ghc can do here, but hopefully there is something.

In the blog post you write

Let’s change (“eta-expand”) the poll code as if then had arity 4, without actually changing then or thenIO or their runtime arities:

and that this fixes the space leak. Isn’t this eta-expansion exactly what the state hack is about? So why does it not not work here? Ah, becuase poll is not at type IO a but rather for an arbitrary Monad… yeah, then it’s harder.

I thought that the state hack doesn't change the arity of functions, just marks them single-entry.

Also, I don't think ghc needs the state hack to figure out the correct arity (in the non-polymorphic case), since it can see through the newtypes.

I thought that the state hack doesn't change the arity of functions, just marks them single-entry.

That’s how it works, but the ultimate goal of the state hack (the way I see it) is to eta-expand them. Usually this works by treating a State#-typed lambda as one-shot:

foo = bar x >> baz y

will, after inlining >> (if that happens, but it usually does for IO) look like this:

foo =
  let a = bar x
      b = baz y
  in \s0 -> case a s0 of (_,s1) -> b s1

which is usually bad because of the allocations of a and b. (But I say “usually” because it is good if bar x is expensive and foo is used many times – these are the instances when people complain about the state hack removing this sharing).

Now because s0 is treated as one-shot, two optimizations are enabled, both of which have ultimately the same effect:

The inliner feels empowered to inline a and b into the lambda, because it is one shot. Then we obtain

foo = \s0 -> case bar x s0 of (_,s1) -> baz y s1

which we want.

The arity analysis uses the one-shot information on the lambda to determine that foo has arity 1 and eta-expands it:

foo = \s -> (let a = bar x b = baz y in \s0 -> case a s0 of (_,s1) -> b s1) s

which then gets simplified to

foo = \s -> case bar x s of (_,s1) -> baz y s1

again.

I once had an idea for a less intrusive state hack variant which would do this eta-expansion directly (but otherwise do not meddle with the one-shotness of State#-typed lambdas, which can be wrong), but did not follow through with it. See ticket:9388#comment:96531 for some background.

But this is all State#-specific, while your bug is about abstractly monad code, where the compiler may now use knowledge about IO’s bind, right?

Suppose it had been

worker :: (Monad m) => m ()
worker = forever $ poll 1

poll :: (Monad m) => Int -> m a
poll n
 | n>10000    = return ()
 | otherwise = do print (expensive n)
                  poll (n+1)

where expensive is expensive to compute. After printing out the results of 10000 calls to expensive, the forever will do it all again.

Question: would you expect all the calls to expensive to be recomputed? Presumably not. poll builds a big action

print (expensive 0) >> print (expensive 1) >> print (expensive 2) >> ...

and forever just repeatedly executes that action. But to remember all those results clearly takes O(n) space.

Now in this case there is no real work to be shared, but that's clearly harder for GHC to spot. Especially when (as in this case) the monad is unspecified, so perhaps the call return () does a tremendous amount of work.

I'm not saying things can't be improved here, but you are in delicate territory.

Simon: as a matter of fact, I would expect all the calls to expensive to be recomputed. I think most programmers would interpret the above code as

while (1) {
  for (n = 1; 1 <= 10000; n++) {
    print(expensive(n));
  }
}

If I wanted the calls to expensive to be shared, I would probably put them into a list. (Of course, a similar piece of code could be intended for m = []; but based on print, it looks like something intended for an IO-like monad.)

I do understand the tradeoffs involved, and I'm not saying that this should be obvious to the compiler. I'm just saying that this might not be the most convincing example where the current behavior is the one a programmer would expect.

What if a programmer could annotate variables (like worker or poll) as "function-like" vs "value-like"? "Function-like" means "I don't care about this value, feel free to recompute it" and "value-like" means "please cache it if possible".

If we write

foo = ...
  where
    bar = ...
      where
        baz = ...

do we intend for bar to be recomputed on each call to foo, or shared globally? Do we intend for baz to be recomputed on each call to bar, or on each call to foo, or shared globally? The compiler doesn't know. Worse, experienced Haskell developers have gotten used to the way GHC tends to float things around, so making things simpler and more predictable is likely to turn a lot of currently-good code bad. Maybe Joachim's oneShot can help in some places?

David:

If foo is marked as function-like by the programmer, bar and baz are recomputed on each invocation of foo. If foo is marked as value-like and is shared, bar and baz are shared, too. Furthermore, bar and baz could have their own annotations.

Examples:

{-# FUNCTION_LIKE fib #-}
fib = (fibs !!) where
  {- VALUE_LIKE fibs #-}
  fibs = 0 : 1 : zipWith (+) fibs (tail fibs)

{-# VALUE_LIKE val #-}
val = sum $ map f [1..10^6] where
  {-# FUNCTION_LIKE f #-}
  f = (+5) . (^2)

Now, what would FUNCTION_LIKE mean, exactly? In the fib example above, there can be two interpretations:

Don't bother "remembering" the PAP (!!) fibs, but do remember something like fibs !! n.
Don't even remember fibs !! n.

The second interpretation is useful in cases like

mk_action :: Monad m => Int -> m ()
mk_action = flip replicateM_ (return ())

where we may want to say, "don't bother to remember either the expansion or the result of mk_action".

So, simple FUNCTION_LIKE can be ambiguous. What if we annotated functions with arities? Assigning an explicit arity to a function means that:

ghc attempts to perform eta reductions or expansions to match the declared arity;
ghc doesn't float things out of lambdas with declared arity >= 1.

E.g. following Reid's example (https://ghc.haskell.org/trac/ghc/timeline?from=2017-01-07T14%3A42%3A24Z&precision=second), he could write

poll a = r >>= f
  where
    {-# ARITY 1 f #-}
    f _ = poll a

and ghc would not transform f to (let s = poll a in \_ -> s).

And, of course, the original problem could be easily solved by annotating poll with the correct arity, as I point out in the blog post.

It is also possible that these two things — forcing eta-expansion and not floating out local bindings — should be two different and orthogonal pragmass. What do you all think?

So there are three issues here:

The programs under discussion leak space.
The space leak is hard to see and unintuitive. (I didn't understand what was going on until I saw Roman's answer on SO.)
You can't even eliminate the space leak in a reliable way.

Fixing 1 directly seems infeasible without breaking other programs, because of examples like Simon's in ticket:13080#comment:130393. Plus, the space leak does exist if you evaluate the program naively using lazy evaluation, despite 2. (And there are parallel examples that don't involve IO.) But we should be able to do something about 3, and it would be useful in many settings besides this one (such as benchmarking).

I like the idea that if the user wants FUNCTION_LIKE behavior, then they should write a function. It aligns well with the basic rule for sharing (an expression inside a lambda is shared for the lifetime of that call to the lambda) and we already know how to implement it (just compile with -O0). We just have to not stuff it up during optimization. This is basically the full laziness problem yet again.

The problem is of course that writing a function isn't sufficient because GHC will probably just float the body out. Besides that there's a second danger: we could inline and then beta-reduce. I don't want to write it out, but if you imagine inlining >>= and f in Roman's latest example, and then if you allow beta-reducing f, the lambda in f would disappear and then presumably nothing would stop GHC from floating poll a out of the lambda in >>=. It's hard to see how the inlining could be to blame here, so I blame the beta-reduction. (Then perhaps if we are never going to be allowed to beta-reduce, we can also not bother inlining. Not sure.)

So, it seems to me we need a new kind of Core lambda, or a flag on lambda, that means

don't float out of this,
don't beta reduce this.

How to give the user access to this is another question. I haven't thought about this ARITY suggestion much yet. Another possibility would be a magic pattern synonym _#, which matches anything but turns the lambda into a beta blocker. So then we'd write Roman's example as something like

import GHC.Exts (pattern _#)
poll a = r >>= f
  where
    f _# = poll a

See also #9520 (closed), #8457. #12656 is a bit different, but it is the ticket I was prompted to remember by Roman's SO post. #12620 has a proposed solution but I think tying the "no floating" annotation to a lambda rather than an expression within the lambda might be more robust and easier to understand.

So, it seems to me we need a new kind of Core lambda, or a flag on lambda, that means (a) don't float out of this, (b) don't beta reduce this.

We already have (a), namely one-shot lambdas. I don't understand why (b) could possibly be good.

Perhaps someone can give a from-scratch articulation of the problem and possible solutions, informed by this thread?

do we intend for bar to be recomputed on each call to foo, or shared globally?

The answer to this question depends heavily on what it means for “foo” to be called! Consider this:

foo :: Int -> IO ()
foo x = …
  where bar = expensive x

If we now run mapM (foo 1) [1.1000], then, in one sense, the function foo is called once (when passed x). This returns a value of type IO (), which is then executed 1000 times. This is the sense that the compiler understands, and without further hacks, bar would be evaluated only once here. Some users know and expect this.

But there is another sense where one thinks of a call to foo as the execution of the IO action produced by foo 1. This is probably how most users in most cases think about functions returning an IO something. . With the current implementation of IO (), this is when the state token is passed to the function wrapped in IO ().

The state hack is about eta-expanding foo so that these notions coincide. Unfortunately, and as far as I can tell, there is no way of writing foo to get this result directly (without breaking the IO abstraction barrier).

The same distinction works for other monads, of course: foo 1 might return a Parser (), and we have the distinction between calculating the parser, and applying it to some input. And, in extension, with an arbitrary Monad the distinction is even more evident.

So in this thread, we should be very precise which form of “calling” is the right one.

changed milestone to %8.4.1

Given that 8.2.1-rc1 is imminent, I'm bumping these off to the 8.4

changed milestone to %8.6.1

This ticket won't be resolved in 8.4; remilestoning for 8.6. Do holler if you are affected by this or would otherwise like to work on it.

changed milestone to %8.8.1

This won't be fixed in 8.6.

I think I have been hit by this bug. Is the following going to leak memory due to this bug:

worker :: (AppMonad m) => TChan MyType -> m ()
worker chan = do
  mItem <- tryReadTChanIO chan
  case mItem of
    Just item -> do
       processItem item
       worker chan
    Nothing -> pure ()

Can it be fixed by using whileJust_ from http://hackage.haskell.org/package/monad-loops-0.4.3/docs/src/Control-Monad-Loops.html#whileJust_ ?

worker :: (AppMonad m) => TChan MyType -> m ()
worker chan = whileJust_ (tryReadTChanIO chan) processItem

changed milestone to %8.10.1

Bumping milestones of low-priority tickets.

added Pnormal label

removed milestone

I had the same space leak/memory leak problem in GHC 8.8.4. My workaround was a wrapper function that does the recursion. The inner function now returns a Bool indication whether the computation is done or not. ExpM a is Monad that is specified by the user of the library.

runResultData :: (ExperimentDef a) => Key Exp -> Maybe Int -> Int -> RepResultType -> ResultData a -> DB (ExpM a) (Updated, ResultData a)
runResultData !expId !maxSteps !len !repResType !resData = do
  (!done, !upd, !resData') <- runResultData' expId maxSteps len repResType resData
  if done
    then return (upd, resData')
    else runResultData expId maxSteps len repResType resData'
{-# NOINLINE runResultData #-}
{-# NOINLINE runResultData' #-} 

runResultData' :: (ExperimentDef a) => Key Exp -> Maybe Int -> Int -> RepResultType -> ResultData a -> DB (ExpM a) (Bool, Updated, ResultData a)
runResultData' !expId !maxSteps !len !repResType !resData = do
  ...

For me the code works with constant memory also without the NOINLINE pragmas. However, that might depend on the optimisation, so it's probably safer to use it. I guess this is more or less the same approach as untilJust, but might be handier for other users that have this issue.

Dear GHC developers,

please allow me to express my strong interest in a solution to this issue.

I know some workarounds, but is there a definite workaround? Because from what I understand this can neither be detected be the compiler nor prevented during development. But is this really the case?

If so, the problem described in this issue will prevent further acceptance of production use of Haskell by my colleagues.

I ask myself: How can I advocate for Haskell on the grounds of correctness and elegant code e.g. the possibility of equational reasoning, when on the other hand in light of this issue the runtime behavior is unpredictable in a frightening way, as soon as one uses some kind of nested monadic loop, as is done by almost every server program?

I have not dared defending the usage of Haskell in the company I work for, after this bug was really hitting us hard, after we rolled out a new version of our software that also contained one major component/service that was rewritten in Haskell (and that was admittedly not well enough tested).

Sadly this took place during the height of the pandemic, and this problem impacted users seeking medical support mainly for Covid. It was bad, and was even covered in the national news on TV, and also made my colleagues and me stay up many nights to manually restart servers that ran out of memory from before Christmas through new year.

Maybe this bug isn't actually a bug, but worse, the unintended consequence of the way GHC and/or the language itself is designed. In that case, that might justify a change in the compiler or the semantics of the language.

Anyway, sorry for warming this up with such drama, I mean no harm and I just wanted to share my experience in the hope it will be a perspective that somehow contributes to making Haskell better.

Hello Sven, you are right to be concerned. This issue is a critical one for adoption of Haskell in production. Fortunately the most egregious problems (including the one in this ticket) are solved by turning off the full laziness pessimisation (by using -fno-full-laziness). Doing so seems to have no downside that is noticed by benchmarks so it could be worth just using that flag on all builds you ever do with GHC.

If you have a problem like this that is not solved by -fno-full-laziness then I would be very interested to hear about it. Please contact me and send details.

If you turn off full laziness then you have to be a lot more careful to manually float out any bindings you want to be shared.

If you turn off full laziness then you have to be a lot more careful to manually float out any bindings you want to be shared.

Perhaps nofib is not representative, but the author of the ticket I shared found "nothing changing" when turning off full laziness. In any case, if you want sharing you should write your program with manifest sharing, and GHC should respect that unless it has good reason to believe it's not making things worse for you. If you don't want sharing, likewise.

nofib is ancient and weird and I very much doubt it's representative of anything.

The author of that ticket compiled stage 1 GHC with -fno-full-laziness and it showed very little difference in compilation time of the rest of the files in the build. SPJ thought it was worth running further benchmarks and so he suggested nofib, yielding the results I just described.

Perhaps there are benchmarks with better diagnostic power. Do you know of any? If so then they should probably be incorporated into GHC's regression suite! If nofib is representative of nothing then I wonder why the GHC team keep it around.

Trac field	Value
Version	8.0.1
Type	Bug
TypeOfFailure	OtherFailure
Priority	normal
Resolution	Unresolved
Component	Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture

Memory leak caused by nested monadic loops

Child items 0

Activity