Opened 9 years ago

Last modified 3 months ago

#367 new bug (None)

Infinite loops can hang Concurrent Haskell

Reported by: simonpj Owned by: ezyang
Priority: lowest Milestone:
Component: Compiler Version: 6.4.1
Keywords: scheduler allocation Cc: ganesh.sittampalam@…, SamB, leon.p.smith@…, pho@…, bgamari@…, fryguybob@…, mail@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Incorrect result at runtime Difficulty: Unknown
Test Case: concurrent/should_run/T367, concurrent/should_run/T367_letnoescape Blocked By:
Blocking: Related Tickets:

Description (last modified by simonmar)

An infinite loop that does not allocate can hang 
Concurrent Haskell, becuase no thread switching 
occurs.  Demo code below (from Koen Claessen).

Bites occasionally, but not often.

Simon



module Main where

import Control.Concurrent
  ( forkIO
  , threadDelay
  , killThread
  , newEmptyMVar
  , takeMVar
  , putMVar
  )

import Data.IORef

import IO( hFlush, stdout )

timeout :: Int -> a -> IO (Maybe a)
timeout n x =
  do put "Race starts ..."
     resV <- newEmptyMVar
     pidV <- newEmptyMVar

     let waitAndFail =
           do put "Waiting ..."
              threadDelay n
              put "Done waiting!"
              putMVar resV Nothing

         eval =
           do put "Evaluating ..."
              x `seq` put "Done!"
              putMVar resV (Just x)

     -- used "mfix" here before but got non-termination 
problems
     -- (not sure they had anything to do with mfix)
     pid1  <- forkIO $ do pid2 <- takeMVar pidV
                          eval
                          killThread pid2
     pid2  <- forkIO $ do waitAndFail
                          killThread pid1
     putMVar pidV pid2

     put "Blocking ..."
     takeMVar resV

put s =
  do putStrLn s
     hFlush stdout

main =
  do timeout 1 (sum (repeat 1))
<<<

The above program produces the following (expected 
result):

>>>
Race starts ...
Blocking ...
Evaluating ...
Waiting ...
Done waiting!
<<<

If you replace 'sum (repeat 1)' by 'last (repeat 1)' the
program hangs.


Attachments (1)

patch (7.8 KB) - added by ezyang 19 months ago.
Yield checks at the beginning of functions and no-let-escape, but not case alts

Download all attachments as: .zip

Change History (39)

comment:1 Changed 8 years ago by simonmar

  • Architecture set to Unknown
  • Description modified (diff)
  • Difficulty set to Unknown
  • Operating System set to Unknown
  • Version changed from None to 6.4.1

comment:2 Changed 8 years ago by igloo

  • Milestone set to 6.8

One reason this is annoying is that it means you can't have a manager thread running code (which you have no reason to believe won't go into an infinite loop or generate an infinite amount of output) with a timeout.

For example, lambdabot can't just use something like the timeout function above to ensure its modules don't go into an infinite loop. Another example is not being able to give up and return a bad result if you run out of time in something like the ICFP contest.

comment:3 Changed 7 years ago by guest

  • Cc ganesh@… added

comment:4 Changed 7 years ago by guest

  • Cc ganesh.sittampalam@… added; ganesh@… removed

comment:5 Changed 6 years ago by igloo

  • Milestone changed from 6.8 branch to _|_

comment:6 Changed 6 years ago by simonmar

  • Architecture changed from Unknown to Unknown/Multiple

comment:7 Changed 6 years ago by simonmar

  • Operating System changed from Unknown to Unknown/Multiple

comment:8 Changed 5 years ago by SamB

This is just a gratuitous comment to add the word "scheduler" to the page.

comment:9 Changed 5 years ago by SamB

  • Keywords scheduler allocation added

comment:10 Changed 5 years ago by SamB

According to CapabilitiesAndScheduling:

We do have a time-slice mechanism: the timer interrupt (see Timer.c) sets the context_switch flag, which causes the running thread to return to the scheduler the next time a heap check fails (at the end of the current nursery block). When a heap check fails, the thread doesn't necessarily always return to the scheduler: as long as the context_switch flag isn't set, and there is another block in the nursery, it resets Hp and HpLim?? to point to the new block, and continues.

To fix this bug, we need a way for the timer signal handler to *force* the Haskell code to stop in bounded time.

Two ways that come to mind for the handler to force a stop are:

  1. insert a breakpoint (or a special jump?) at some pre-arranged point in any arbitrarily-long allocation-free loop, such that the breakpoint signal handler can safely enter the schedular
  2. use some sort of instruction-by-instruction Call Frame Information, something like that of DWARF 2 and up, to figure out the innermost frame's stack layout.

Approach 1 is kind of icky insofar as it might cause spurious stops in other threads, or worse!

For those unfamiliar with it: DWARF's CFI represents a function of type (IP value × register name) → Maybe (how to find the value of that register in the caller), where Nothing means "that register got clobbered". Obviously, we would also need information about which of the interrupted code's registers and stack slots represented pointers that should be followed by the garbage collector for approach 2 to work.

Any other ideas about how the timer handler can guarantee re-entering the scheduler in bounded time?

There *is* the obvious "check every time around a non-allocating loop" approach, but it seems obvious that that would cost far too much where we can least afford it. (Is this actually true? So many things that seem obvious aren't...)

comment:11 Changed 5 years ago by SamB

  • Cc SamB added
  • Owner nobody deleted
  • Status changed from assigned to new

comment:12 Changed 4 years ago by igloo

  • Type of failure set to Incorrect result at runtime

comment:13 Changed 2 years ago by lpsmith

  • Cc leon.p.smith@… added

comment:14 Changed 2 years ago by PHO

  • Cc pho@… added

comment:15 Changed 21 months ago by ezyang

While solving this problem in general may be quite hard, we might be able to allow users to annotate potentially time-consuming, non-allocating regions of code in a way that would allow them to be interrupted, by co-opting the mechanism we have for interruptible FFI calls. Namely, pthread_kill()'ing recalcitrant threads >:-) (alas, this won't work for Windows, but you can't have everything in life...) E.g.

runInterruptibly (evaluate (last (repeat 1))

The primary difficulty of this scheme is making sure the computation is being done on threads which we can afford to terminate with extreme prejudice. This is easy for safe FFI calls, because we give up the capability before entering the foreign code. But a thread executing Haskell will generally have the capability, since it might GC.

One answer might be: on entering such an annotated region, give up the capability, and make the thread do an InCall? if it happens to hit something that needs GC'ing. Of course, this means we need to have a version of all the code being generated here to use different GC entry-points which do the InCall? shindig (in the cases where this is useful, there shouldn't be very many of them, since the whole point is this is a non-allocating loop!) but this could cause a pretty massive expansion in overall code size. The benefit of such a scheme is that we only pay a cost entering such a region. We don't want to add any checks to the pre-existing GC functions, since we'll pay the cost for all programs, even ones not using this functionality.

Another possibility is to still give up the capability, but to demand that any code covered by such an annotation be statically known never to make allocations. But it is probably too difficult to explain to the programmer under what circumstances allocation occurs, and I imagine many regions of code will allocate once or twice, but have large chunks inside which perform no allocation. On the other hand, it would be pretty nice if Haskell programmers could carve out a chunk of code, and say, "This is doing a very clever numeric calculation which should compile straight to something efficient which does no allocation."

But I think the right answer here is to make it possible to run Haskell code *without* holding a capability, and then proceed from there.

comment:16 Changed 21 months ago by ezyang

Alas, the above proposal is totally infeasible in the current world order, because it is completely impossible to run Haskell code without holding the capability. I thought you might get away with only instrumenting GC calls, but really any and all of the memory you're looking at could get moved around by a GC who thinks that they have all the capabilities. So we’re back to an implementation strategy where a thread holding a capability gets killed, and then we recover, but this seems generally Hard(TM).

comment:17 Changed 19 months ago by ezyang

  • Owner set to ezyang

It turns out the stupid implementation is actually pretty fast.

diff --git a/compiler/codeGen/StgCmmHeap.hs b/compiler/codeGen/StgCmmHeap.hs
index fb37391..a70d132 100644
--- a/compiler/codeGen/StgCmmHeap.hs
+++ b/compiler/codeGen/StgCmmHeap.hs
@@ -557,15 +557,22 @@ do_checks checkStack alloc do_gc = do
     hp_oflo = CmmMachOp (mo_wordUGt dflags)
                         [CmmReg hpReg, CmmReg (CmmGlobal HpLim)]
 
+    -- Yielding if HpLim == 0
+    yielding = CmmMachOp (mo_wordEq dflags)
+                        [CmmReg (CmmGlobal HpLim), CmmLit (zeroCLit dflags)]
+
     alloc_n = mkAssign (CmmGlobal HpAlloc) alloc_lit
   gc_id <- newLabelC
 
   when checkStack $ do
      emit =<< mkCmmIfGoto sp_oflo gc_id
 
-  when (alloc /= 0) $ do
-     emitAssign hpReg bump_hp
-     emit =<< mkCmmIfThen hp_oflo (alloc_n <*> mkBranch gc_id)
+  if (alloc /= 0)
+    then do
+      emitAssign hpReg bump_hp
+      emit =<< mkCmmIfThen hp_oflo (alloc_n <*> mkBranch gc_id)
+    else do
+      emit =<< mkCmmIfThen yielding (alloc_n <*> mkBranch gc_id)
 
   emitOutOfLine gc_id $
      do_gc -- this is expected to jump back somewhere

A totally unscientific benchmark on

module Main where

import qualified Data.Vector as U

main = U.sum (U.enumFromTo 1 (1000000000 :: Int)) `seq` return ()

yields this C--

 Main.main1_entry()
         { [(c1Zm,
             Main.main1_info:
                 const 65539;
                 const 0;
                 const 15;),
            (c1Zn,
             block_c1Zn_info:
                 const 0;
                 const 32;)]
         }
     c1Zm:
         if (Sp - 12 < SpLim) goto c1Zv;
         if (HpLim == 0) goto c1Zu;
         I32[Sp - 8] = 0;
         I32[Sp - 12] = 1;
         I32[Sp - 4] = c1Zn;
         Sp = Sp - 12;
         jump Main.main_$s$wfoldlM'_loop_info; // []
     c1Zu:
         HpAlloc = 0;
         goto c1Zv;
     c1Zv:
         R1 = Main.main1_closure;
         jump stg_gc_fun; // [R1]
     c1Zn:
         if (HpLim == 0) goto c1ZC;
         R1 = GHC.Tuple.()_closure+1;
         Sp = Sp + 4;
         jump I32[Sp]; // [R1]
     c1ZC:
         HpAlloc = 0;
         R1 = R1;
         jump stg_gc_unbx_r1; // [R1]

which is only about 10% slower.

Unfortunately, if you want to guarantee things don't hang, it's not enough to compile the untrusted code like this; all the other code needs to have this transformation applied too. So we should probably have a -yielding flag (like -threaded or -profiling) which allows us to compile "yielding" versions of all relevant code. Unfortunately, the number of combinations of such flags exponentially increases...

comment:18 Changed 19 months ago by ezyang

Code bloat is pretty big across the board, but runtime performance hit is about what I expected.

NoFib Results

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
           anna         +21.8%     +0.0%      0.14      0.14     +0.0%
           ansi         +21.8%     +0.0%      0.00      0.00     +0.0%
           atom         +21.9%     +0.0%     +0.8%     +0.3%     +0.0%
         awards         +21.9%     +0.0%      0.00      0.00     +0.0%
         banner         +21.7%     +0.0%      0.00      0.00     +0.0%
     bernouilli         +21.9%     +0.0%     -2.1%     -1.8%     +0.0%
          boyer         +21.8%     +0.0%      0.05      0.05     +0.0%
         boyer2         +21.8%     +0.0%      0.01      0.01     +0.0%
           bspt         +21.6%     +0.0%      0.02      0.02     +0.0%
      cacheprof         +21.3%     +0.1%     +4.2%     +4.0%     +0.0%
       calendar         +21.9%     +0.0%      0.00      0.00     +0.0%
       cichelli         +21.8%     +0.0%      0.11      0.11     +0.0%
        circsim         +21.9%     +0.0%     +2.1%     +2.0%     +0.0%
       clausify         +21.9%     +0.0%      0.05      0.05     +0.0%
  comp_lab_zift         +21.9%     +0.0%     +2.2%     +1.1%     +0.0%
       compress         +21.8%     +0.0%     +3.6%     +4.5%     +0.0%
      compress2         +21.8%     +0.0%     +0.0%     +0.0%     +0.0%
    constraints         +22.0%     +0.0%     +0.4%     +0.4%     +0.0%
   cryptarithm1         +21.9%     +0.0%     +1.4%     +1.3%     +0.0%
   cryptarithm2         +21.9%     +0.0%      0.02      0.02     +0.0%
            cse         +21.8%     +0.0%      0.00      0.00     +0.0%
          eliza         +21.5%     +0.0%      0.00      0.00     +0.0%
          event         +21.9%     +0.0%      0.17      0.17     +0.0%
         exp3_8         +21.9%     +0.0%     +0.0%     +0.0%     +0.0%
         expert         +21.9%     +0.0%      0.00      0.00     +0.0%
            fem         +22.1%     +0.0%      0.03      0.03     +0.0%
            fft         +21.6%     +0.0%      0.05      0.05     +0.0%
           fft2         +21.6%     +0.0%      0.08      0.08     +0.0%
       fibheaps         +21.9%     +0.0%      0.04      0.04     +0.0%
           fish         +21.8%     +0.0%      0.03      0.03     +0.0%
          fluid         +21.8%     +0.0%      0.01      0.01     +0.0%
         fulsom         +21.7%     +0.0%     +2.5%     +2.5%     -0.9%
         gamteb         +21.7%     +0.0%      0.06      0.06     +0.0%
            gcd         +21.9%     +0.0%      0.04      0.04     +0.0%
    gen_regexps         +21.9%     +0.0%      0.00      0.00     +0.0%
         genfft         +21.8%     +0.0%      0.05      0.05     +0.0%
             gg         +21.7%     +0.0%      0.02      0.02     +0.0%
           grep         +21.8%     +0.0%      0.00      0.00     +0.0%
         hidden         +22.0%     +0.0%    +15.0%    +14.8%     +0.0%
            hpg         +21.7%     +0.0%      0.16      0.16     +0.0%
            ida         +21.9%     +0.0%      0.13      0.13     +0.0%
          infer         +21.8%     +0.0%      0.08      0.08     +0.0%
        integer         +21.9%     +0.0%    +13.9%    +13.9%     +0.0%
      integrate         +21.9%     +0.0%     -9.1%     -8.1%     +0.0%
        knights         +21.9%     +0.0%      0.01      0.01     +0.0%
           lcss         +21.9%     +0.0%     +0.2%     +0.3%     +0.0%
           life         +21.9%     +0.0%     +5.7%     +5.2%     +0.0%
           lift         +21.8%     +0.0%      0.00      0.00     +0.0%
      listcompr         +21.8%     +0.0%      0.11      0.11     +0.0%
       listcopy         +21.8%     +0.0%      0.12      0.12     +0.0%
       maillist         +21.9%     +0.0%      0.10      0.10     -5.4%
         mandel         +21.5%     +0.0%      0.09      0.09     +0.0%
        mandel2         +21.9%     +0.0%      0.01      0.01     +0.0%
        minimax         +21.9%     +0.0%      0.01      0.01     +0.0%
        mkhprog         +21.9%     +0.0%      0.01      0.01     +0.0%
     multiplier         +21.9%     +0.0%      0.15      0.15     +0.0%
       nucleic2         +21.4%     +0.0%      0.09      0.09     +0.0%
           para         +22.0%     +0.0%     +3.2%     +3.2%     +0.0%
      paraffins         +21.9%     +0.0%      0.11      0.11     +0.0%
         parser         +21.8%     +0.0%      0.05      0.05     +0.0%
        parstof         +21.3%     +0.0%      0.01      0.01     +0.0%
            pic         +22.0%     +0.0%      0.02      0.02     +0.0%
          power         +21.8%     +0.0%     +0.0%     -0.1%     +0.0%
         pretty         +21.9%     +0.0%      0.00      0.00     +0.0%
         primes         +21.9%     +0.0%      0.08      0.08     +0.0%
      primetest         +21.9%     +0.0%      0.14      0.14     +0.0%
         prolog         +21.9%     +0.0%      0.01      0.01     +0.0%
         puzzle         +21.9%     +0.0%      0.19      0.19     +0.0%
         queens         +21.9%     +0.0%      0.03      0.03     +0.0%
        reptile         +21.5%     +0.0%      0.02      0.02     +0.0%
        rewrite         +21.9%     +0.0%      0.02      0.02     +0.0%
           rfib         +21.9%     +0.0%      0.02      0.02     +0.0%
            rsa         +21.9%     +0.0%      0.04      0.04     +0.0%
            scc         +21.9%     +0.0%      0.00      0.00     +0.0%
          sched         +21.9%     +0.0%      0.03      0.03     +0.0%
            scs         +21.9%     +0.0%     +2.4%     +2.8%     +0.0%
         simple         +22.2%     +0.0%     +4.8%     +5.0%     +0.0%
          solid         +21.8%     +0.0%      0.17      0.17     +0.0%
        sorting         +21.9%     +0.0%      0.00      0.00     +0.0%
         sphere         +21.9%     +0.0%      0.08      0.08     +0.0%
         symalg         +22.0%     +0.0%      0.02      0.02     +0.0%
            tak         +21.9%     +0.0%      0.02      0.02     +0.0%
      transform         +21.8%     +0.0%     -4.6%     -4.4%     +0.0%
       treejoin         +21.9%     +0.0%     +0.0%     +0.0%     +0.0%
      typecheck         +21.8%     +0.0%     +5.5%     +6.6%     +0.0%
        veritas         +21.1%     +0.0%      0.01      0.01     +0.0%
           wang         +21.9%     +0.0%      0.14      0.14     +0.0%
      wave4main         +21.9%     +0.0%     +3.0%     +3.0%     +0.0%
   wheel-sieve1         +21.9%     +0.0%    +11.9%    +11.9%     +0.0%
   wheel-sieve2         +21.9%     +0.0%     +4.8%     +4.8%     +0.0%
           x2n1         +21.6%     +0.0%      0.01      0.01     +0.0%
--------------------------------------------------------------------------------
            Min         +21.1%     +0.0%     -9.1%     -8.1%     -5.4%
            Max         +22.2%     +0.1%    +15.0%    +14.8%     +0.0%
 Geometric Mean         +21.8%     +0.0%     +2.6%     +2.7%     -0.1%

comment:19 follow-up: Changed 19 months ago by simonmar

You could improve code size by omitting the HpAlloc = 0 assignment (perhaps making sure that it is initialized to zero in LOAD_THREAD_STATE or something).

Another alternative is to use SpLim instead of HpLim to trigger the interrupt, on the grounds that there are more stack checks than heap checks. We would have to put SpLim in a memory location instead of a register, but we could move HpLim into a register.

Something else we could do is add a flag on every top-level function to say whether it is non-allocating (rather like the NoCafRefs flag), and we could use that to optimise away many of the extra checks.

comment:20 in reply to: ↑ 19 ; follow-up: Changed 19 months ago by ezyang

Replying to simonmar:

You could improve code size by omitting the HpAlloc = 0 assignment (perhaps making sure that it is initialized to zero in LOAD_THREAD_STATE or something).

Fascinatingly enough, this doesn't help all that much, since instruction alignments adds in nops to fill in the space savings.

Another alternative is to use SpLim instead of HpLim to trigger the interrupt, on the grounds that there are more stack checks than heap checks. We would have to put SpLim in a memory location instead of a register, but we could move HpLim into a register.

To be clear, this is changing globally how preemption would work, since prior to this patch we were zeroing HpLim? to trigger a yield. But it should otherwise work. I'll chase up some stats here too. (If SpLim? is checked more often, won't we pay a performance cost for having it in a memory location?)

Something else we could do is add a flag on every top-level function to say whether it is non-allocating (rather like the NoCafRefs flag), and we could use that to optimise away many of the extra checks.

I don't quite understand what this means: isn't alloc = 0 in the heap check just this information?

comment:21 Changed 19 months ago by ezyang

Here is the nofib results on HpAlloc?:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
           anna         +21.8%     +0.0%      0.14      0.14     +0.0%
           ansi         +21.8%     +0.0%      0.00      0.00     +0.0%
           atom         +21.9%     +0.0%     +0.7%     +0.5%     +0.0%
         awards         +21.9%     +0.0%      0.00      0.00     +0.0%
         banner         +21.7%     +0.0%      0.00      0.00     +0.0%
     bernouilli         +21.9%     +0.0%     -2.1%     -2.1%     +0.0%
          boyer         +21.8%     +0.0%      0.06      0.06     +0.0%
         boyer2         +21.8%     +0.0%      0.01      0.01     +0.0%
           bspt         +21.6%     +0.0%      0.02      0.02     +0.0%
      cacheprof         +21.3%     +0.0%     +3.9%     +3.7%     -0.2%
       calendar         +21.9%     +0.0%      0.00      0.00     +0.0%
       cichelli         +21.8%     +0.0%      0.11      0.11     +0.0%
        circsim         +21.9%     +0.0%     +2.0%     +2.2%     +0.0%
       clausify         +21.9%     +0.0%      0.05      0.05     +0.0%
  comp_lab_zift         +21.9%     +0.0%     +1.9%     +0.8%     +0.0%
       compress         +21.8%     +0.0%     +4.2%     +4.5%     +0.0%
      compress2         +21.8%     +0.0%     +0.0%     +0.0%     +0.0%
    constraints         +22.0%     +0.0%     +1.0%     +1.0%     +0.0%
   cryptarithm1         +21.9%     +0.0%     +1.3%     +1.2%     +0.0%
   cryptarithm2         +21.9%     +0.0%      0.02      0.02     +0.0%
            cse         +21.8%     +0.0%      0.00      0.00     +0.0%
          eliza         +21.5%     +0.0%      0.00      0.00     +0.0%
          event         +21.9%     +0.0%      0.17      0.17     +0.0%
         exp3_8         +21.9%     +0.0%     +0.0%     +0.3%     +0.0%
         expert         +21.9%     +0.0%      0.00      0.00     +0.0%
            fem         +22.1%     +0.0%      0.03      0.03     +0.0%
            fft         +21.6%     +0.0%      0.05      0.05     +0.0%
           fft2         +21.6%     +0.0%      0.08      0.08     +0.0%
       fibheaps         +21.9%     +0.0%      0.04      0.04     +0.0%
           fish         +21.8%     +0.0%      0.03      0.03     +0.0%
          fluid         +21.8%     +0.0%      0.01      0.01     +0.0%
         fulsom         +21.7%     +0.0%     +2.1%     +1.7%     +0.5%
         gamteb         +21.7%     +0.0%      0.06      0.06     +0.0%
            gcd         +21.9%     +0.0%      0.04      0.04     +0.0%
    gen_regexps         +21.9%     +0.0%      0.00      0.00     +0.0%
         genfft         +21.8%     +0.0%      0.05      0.05     +0.0%
             gg         +21.7%     +0.0%      0.02      0.02     +0.0%
           grep         +21.8%     +0.0%      0.00      0.00     +0.0%
         hidden         +22.0%     +0.0%    +14.6%    +14.4%     +0.0%
            hpg         +21.7%     +0.0%      0.16      0.16     +0.0%
            ida         +21.9%     +0.0%      0.13      0.13     +0.0%
          infer         +21.8%     +0.0%      0.08      0.08     +0.0%
        integer         +21.9%     +0.0%    +14.0%    +13.9%     +0.0%
      integrate         +21.9%     +0.0%     -8.0%     -7.8%     +0.0%
        knights         +21.9%     +0.0%      0.01      0.01     +0.0%
           lcss         +21.9%     +0.0%     +0.2%     +0.2%     +0.0%
           life         +21.9%     +0.0%     +5.9%     +5.7%     +0.0%
           lift         +21.8%     +0.0%      0.00      0.00     +0.0%
      listcompr         +21.8%     +0.0%      0.11      0.11     +0.0%
       listcopy         +21.8%     +0.0%      0.12      0.12     +0.0%

(Also here because I wanted to see if some of the improvements were reproduceable.)

comment:22 in reply to: ↑ 20 ; follow-up: Changed 19 months ago by simonmar

Replying to ezyang:

Replying to simonmar:

You could improve code size by omitting the HpAlloc = 0 assignment (perhaps making sure that it is initialized to zero in LOAD_THREAD_STATE or something).

Fascinatingly enough, this doesn't help all that much, since instruction alignments adds in nops to fill in the space savings.

But we don't actually align the heap-check failure branch, so I'm confused. Can you post the asm code you're seeing?

Another alternative is to use SpLim instead of HpLim to trigger the interrupt, on the grounds that there are more stack checks than heap checks. We would have to put SpLim in a memory location instead of a register, but we could move HpLim into a register.

To be clear, this is changing globally how preemption would work, since prior to this patch we were zeroing HpLim? to trigger a yield. But it should otherwise work. I'll chase up some stats here too. (If SpLim? is checked more often, won't we pay a performance cost for having it in a memory location?)

Maybe, but it's worth measuring I think.

Something else we could do is add a flag on every top-level function to say whether it is non-allocating (rather like the NoCafRefs flag), and we could use that to optimise away many of the extra checks.

I don't quite understand what this means: isn't alloc = 0 in the heap check just this information?

I had in mind making it a transitive property - a function would get the alloc flag if it is guaranteed to allocate within a bounded time, so then any callers don't need a yield check.

comment:23 in reply to: ↑ 22 Changed 19 months ago by ezyang

Replying to simonmar:

Replying to ezyang:

Fascinatingly enough, this doesn't help all that much, since instruction alignments adds in nops to fill in the space savings.

But we don't actually align the heap-check failure branch, so I'm confused. Can you post the asm code you're seeing?

I misspoke; actually, we're page-aligning the data section, and the savings aren't enough to get us to the previous page. It's technically a benefit, but only if the increase in size means you can't fit the entire code block in the instruction cache...

- 12 .text         001e239c  0804a770  0804a770  00002770  2**4
+ 12 .text         001e23bc  0804a770  0804a770  00002770  2**4
                   CONTENTS, ALLOC, LOAD, READONLY, CODE

I had in mind making it a transitive property - a function would get the alloc flag if it is guaranteed to allocate within a bounded time, so then any callers don't need a yield check.

Hm, I guess this is good for the little blocks we generate which only have one exit point, and not so good if there are multiple exit points (since all of them would need the Alloc flag set to work.)

comment:24 Changed 19 months ago by ezyang

Here is a variant of the patch that only does heap checks when not returning (since it's not possible to "infinitely return".) We do much better on runtime and binary size; maybe good enough to ship by default!

diff --git a/compiler/codeGen/StgCmmHeap.hs b/compiler/codeGen/StgCmmHeap.hs
index fb37391..4c53bc2 100644
--- a/compiler/codeGen/StgCmmHeap.hs
+++ b/compiler/codeGen/StgCmmHeap.hs
@@ -371,7 +371,7 @@ entryHeapCheck cl_info nodeSet arity args code
 
        loop_id <- newLabelC
        emitLabel loop_id
-       heapCheck True (gc_call updfr_sz <*> mkBranch loop_id) code
+       heapCheck True True (gc_call updfr_sz <*> mkBranch loop_id) code
 
 {-
     -- This code is slightly outdated now and we could easily keep the above
@@ -461,7 +461,7 @@ cannedGCReturnsTo :: Bool -> CmmExpr -> [LocalReg] -> Label -> ByteOff
 cannedGCReturnsTo cont_on_stack gc regs lret off code
   = do dflags <- getDynFlags
        updfr_sz <- getUpdFrameOff
-       heapCheck False (gc_call dflags gc updfr_sz) code
+       heapCheck False False (gc_call dflags gc updfr_sz) code
   where
     reg_exprs = map (CmmReg . CmmLocal) regs
       -- Note [stg_gc arguments]
@@ -476,7 +476,7 @@ genericGC code
        lretry <- newLabelC
        emitLabel lretry
        call <- mkCall generic_gc (GC, GC) [] [] updfr_sz (0,[])
-       heapCheck False (call <*> mkBranch lretry) code
+       heapCheck False True (call <*> mkBranch lretry) code
 
 cannedGCEntryPoint :: DynFlags -> [LocalReg] -> Maybe CmmExpr
 cannedGCEntryPoint dflags regs
@@ -524,22 +524,23 @@ mkGcLabel :: String -> CmmExpr
 mkGcLabel s = CmmLit (CmmLabel (mkCmmCodeLabel rtsPackageId (fsLit s)))
 
 -------------------------------
-heapCheck :: Bool -> CmmAGraph -> FCode a -> FCode a
-heapCheck checkStack do_gc code
+heapCheck :: Bool -> Bool -> CmmAGraph -> FCode a -> FCode a
+heapCheck checkStack checkYield do_gc code
   = getHeapUsage $ \ hpHw ->
     -- Emit heap checks, but be sure to do it lazily so
     -- that the conditionals on hpHw don't cause a black hole
-    do  { codeOnly $ do_checks checkStack hpHw do_gc
+    do  { codeOnly $ do_checks checkStack checkYield hpHw do_gc
         ; tickyAllocHeap hpHw
         ; doGranAllocate hpHw
         ; setRealHp hpHw
         ; code }
 
 do_checks :: Bool       -- Should we check the stack?
+          -> Bool       -- Should we check for preemption?
           -> WordOff    -- Heap headroom
           -> CmmAGraph  -- What to do on failure
           -> FCode ()
-do_checks checkStack alloc do_gc = do
+do_checks checkStack checkYield alloc do_gc = do
   dflags <- getDynFlags
   let
     alloc_lit = mkIntExpr dflags (alloc * wORD_SIZE dflags) -- Bytes
@@ -557,15 +558,22 @@ do_checks checkStack alloc do_gc = do
     hp_oflo = CmmMachOp (mo_wordUGt dflags)
                         [CmmReg hpReg, CmmReg (CmmGlobal HpLim)]
 
+    -- Yielding if HpLim == 0
+    yielding = CmmMachOp (mo_wordEq dflags)
+                        [CmmReg (CmmGlobal HpLim), CmmLit (zeroCLit dflags)]
+
     alloc_n = mkAssign (CmmGlobal HpAlloc) alloc_lit
   gc_id <- newLabelC
 
   when checkStack $ do
      emit =<< mkCmmIfGoto sp_oflo gc_id
 
-  when (alloc /= 0) $ do
-     emitAssign hpReg bump_hp
-     emit =<< mkCmmIfThen hp_oflo (alloc_n <*> mkBranch gc_id)
+  if (alloc /= 0)
+    then do
+      emitAssign hpReg bump_hp
+      emit =<< mkCmmIfThen hp_oflo (alloc_n <*> mkBranch gc_id)
+    else do
+      emit =<< mkCmmIfGoto yielding gc_id
 
   emitOutOfLine gc_id $
      do_gc -- this is expected to jump back somewhere
--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
           anna          +5.5%     +0.0%      0.13      0.13     +0.0%
           ansi          +6.0%     +0.0%      0.00      0.00     +0.0%
           atom          +6.0%     +0.0%     +0.0%     +0.3%     +0.0%
         awards          +6.0%     +0.0%      0.00      0.00     +0.0%
         banner          +5.9%     +0.0%      0.00      0.00     +0.0%
     bernouilli          +6.0%     +0.0%     +0.3%     +0.3%     +0.0%
          boyer          +5.9%     +0.0%      0.06      0.06     +0.0%
         boyer2          +5.9%     +0.0%      0.01      0.01     +0.0%
           bspt          +5.8%     +0.0%      0.02      0.02     +0.0%
      cacheprof          +5.7%     -0.1%     -0.3%     -0.3%     +0.0%
       calendar          +6.0%     +0.0%      0.00      0.00     +0.0%
       cichelli          +6.0%     +0.0%      0.10      0.10     +0.0%
        circsim          +6.0%     +0.0%     +1.0%     +1.0%     +0.0%
       clausify          +6.0%     +0.0%      0.05      0.05     +0.0%
  comp_lab_zift          +6.0%     +0.0%     +2.1%     +1.6%     +0.0%
       compress          +6.0%     +0.0%     +0.0%     +0.0%     +0.0%
      compress2          +6.0%     +0.0%     +0.0%     +0.0%     +0.0%
    constraints          +6.0%     +0.0%     +0.9%     +0.9%     +0.0%
   cryptarithm1          +6.0%     +0.0%     -2.7%     -2.8%     +0.0%
   cryptarithm2          +6.0%     +0.0%      0.02      0.02     +0.0%
            cse          +6.0%     +0.0%      0.00      0.00     +0.0%
          eliza          +5.9%     +0.0%      0.00      0.00     +0.0%
          event          +6.0%     +0.0%      0.17      0.17     +0.0%
         exp3_8          +6.0%     +0.0%     +0.0%     -0.3%     +0.0%
         expert          +5.9%     +0.0%      0.00      0.00     +0.0%
            fem          +5.9%     +0.0%      0.03      0.03     +0.0%
            fft          +6.0%     +0.0%      0.05      0.05     +0.0%
           fft2          +6.0%     +0.0%      0.08      0.08     +0.0%
       fibheaps          +6.0%     +0.0%      0.04      0.04     +0.0%
           fish          +6.0%     +0.0%      0.03      0.03     +0.0%
          fluid          +5.8%     +0.0%      0.01      0.01     +0.0%
         fulsom          +5.9%     +0.0%     +1.7%     +1.0%     +0.9%
         gamteb          +6.0%     +0.0%      0.06      0.06     +0.0%
            gcd          +6.0%     +0.0%      0.04      0.04     +0.0%
    gen_regexps          +6.0%     +0.0%      0.00      0.00     +0.0%
         genfft          +6.0%     +0.0%      0.04      0.04     +0.0%
             gg          +5.9%     +0.0%      0.02      0.02     +0.0%
           grep          +5.9%     +0.0%      0.00      0.00     +0.0%
         hidden          +5.9%     +0.0%     +8.9%     +9.3%     +0.0%
            hpg          +5.9%     +0.0%      0.16      0.16     +0.0%
            ida          +6.0%     +0.0%      0.13      0.13     +0.0%
          infer          +5.9%     +0.0%      0.08      0.08     +0.0%
        integer          +6.0%     +0.0%     +1.1%     +1.0%     +0.0%
      integrate          +6.0%     +0.0%     -4.9%     -4.7%     +0.0%
        knights          +6.0%     +0.0%      0.01      0.01     +0.0%
           lcss          +6.0%     +0.0%     +0.0%     +0.0%     +0.0%
           life          +6.0%     +0.0%     +5.3%     +4.6%     +0.0%
           lift          +6.0%     +0.0%      0.00      0.00     +0.0%
      listcompr          +6.0%     +0.0%      0.11      0.11     +0.0%
       listcopy          +6.0%     +0.0%      0.12      0.12     +0.0%
       maillist          +6.0%     +0.0%      0.10      0.10     +1.7%
         mandel          +6.0%     +0.0%      0.09      0.09     +0.0%
        mandel2          +6.0%     +0.0%      0.01      0.01     +0.0%
        minimax          +6.0%     +0.0%      0.01      0.01     +0.0%
        mkhprog          +6.0%     +0.0%      0.01      0.01     +0.0%
     multiplier          +6.0%     +0.0%      0.15      0.15     +0.0%
       nucleic2          +6.0%     +0.0%      0.08      0.08     +0.0%
           para          +6.0%     +0.0%     -0.2%     -0.0%     +0.0%
      paraffins          +6.0%     +0.0%      0.11      0.11     +0.0%
         parser          +5.8%     +0.0%      0.04      0.04     +0.0%
        parstof          +5.7%     +0.0%      0.01      0.01     +0.0%
            pic          +5.9%     +0.0%      0.02      0.02     +0.0%
          power          +6.0%     +0.0%     +3.1%     +2.8%     +0.0%
         pretty          +6.0%     +0.0%      0.00      0.00     +0.0%
         primes          +6.0%     +0.0%      0.08      0.08     +0.0%
      primetest          +6.0%     +0.0%      0.14      0.14     +0.0%
         prolog          +6.0%     +0.0%      0.01      0.01     +0.0%
         puzzle          +6.0%     +0.0%      0.18      0.18     +0.0%
         queens          +6.0%     +0.0%      0.03      0.03     +0.0%
        reptile          +5.8%     +0.0%      0.02      0.02     +0.0%
        rewrite          +6.0%     +0.0%      0.02      0.02     +0.0%
           rfib          +6.0%     +0.0%      0.02      0.02     +0.0%
            rsa          +6.0%     +0.0%      0.04      0.04     +0.0%
            scc          +6.0%     +0.0%      0.00      0.00     +0.0%
          sched          +6.0%     +0.0%      0.03      0.03     +0.0%
            scs          +5.9%     +0.0%     +1.5%     +1.6%     +0.0%
         simple          +5.7%     +0.0%     +1.0%     +1.8%     +0.0%
          solid          +5.9%     +0.0%      0.17      0.17     +0.0%
        sorting          +6.0%     +0.0%      0.00      0.00     +0.0%
         sphere          +6.0%     +0.0%      0.08      0.08     +0.0%
         symalg          +6.0%     +0.0%      0.02      0.02     +0.0%
            tak          +6.0%     +0.0%      0.02      0.02     +0.0%
      transform          +5.9%     +0.0%     -6.7%     -6.7%     +0.0%
       treejoin          +6.0%     +0.0%     -0.3%     -0.3%     +0.0%
      typecheck          +5.9%     +0.0%     +0.5%     +1.1%     +0.0%
        veritas          +5.4%     +0.0%      0.01      0.01     +0.0%
           wang          +6.0%     +0.0%      0.14      0.14     +0.0%
      wave4main          +6.0%     +0.0%     +1.1%     +0.9%     +0.0%
   wheel-sieve1          +6.0%     +0.0%     +4.3%     +4.6%     +0.0%
   wheel-sieve2          +6.0%     +0.0%     +1.0%     +1.0%     +0.0%
           x2n1          +6.0%     +0.0%      0.01      0.01     +0.0%
--------------------------------------------------------------------------------
            Min          +5.4%     -0.1%     -6.7%     -6.7%     +0.0%
            Max          +6.0%     +0.0%     +8.9%     +9.3%     +1.7%
 Geometric Mean          +5.9%     -0.0%     +0.7%     +0.7%     +0.0%

comment:25 Changed 19 months ago by ezyang

Here are the aggregate stats for the stack check version (overall, it seems that it produces more bloated executables, but is a little bit faster):

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
----------------------------------------------------------------------------
            Min          +6.1%     -0.1%     -7.2%     -6.7%    -50.0%
            Max          +6.7%     +5.8%     +5.9%     +6.1%     +0.8%
 Geometric Mean          +6.6%     +0.1%     +0.1%     +0.2%     -0.8%

Switching to stack checks has some interesting interactions with some of the optimizations; for example, we can no longer optimize away Sp - 0 < SpLim? as true, since SpLim? may have been twiddled with. If we run nofib without adding the extra yields but with those optimizations turned off, we see:

            Min          +5.9%     -0.1%     -7.3%     -7.0%    -50.0%
            Max          +6.7%     +5.8%     +3.9%     +4.1%     +0.0%
 Geometric Mean          +6.6%     +0.1%     -0.1%     -0.1%     -0.8%

So we already pay most of the cost from having to ditch the optimizations.

comment:26 Changed 19 months ago by simonmar

What you've done here is omit the yield checks on (some) case alternatives. Which is almost ok, because all loops will go through a function entry, except for let-no-escapes which use altHeapCheck. So with this patch you won't catch non-allocating recursive let-no-escapes.

To make this correct you need to ensure that let-no-escapes get a yield point too, but you could also omit the yield points from ordinary altHeapChecks which should reduce the code size hit further.

The confusion probably arose because of my naming scheme: altHeapCheckReturnsTo is the heap check for a case alternative where the case scrutinee made an external call, and hence had a "returns to" continuation that we can re-use for the heap check's continuation.

comment:27 Changed 19 months ago by ezyang

For reference, here is an example of a non-allocating recursive let-no-escape:

{-# LANGUAGE MagicHash #-}

import GHC.Conc
import GHC.Prim
import GHC.Exts

main = numSparks >>= \x -> f x `seq` return ()

{-# NOINLINE f #-}
f :: Int -> Bool
f i@(I# j) = let fail :: Int# -> Bool
                 fail i = fail (i +# 1#)
      in if (case i of
            0 -> True
            _ -> False) then fail j else False

Changed 19 months ago by ezyang

Yield checks at the beginning of functions and no-let-escape, but not case alts

comment:28 Changed 19 months ago by ezyang

This patch, plus the SpLim? checks:

            Min          +6.1%     -0.1%     -8.6%     -8.0%    -50.0%
            Max          +6.8%     +5.8%     +4.8%     +5.5%     +1.3%
 Geometric Mean          +6.7%     +0.1%     +0.0%     +0.0%     -0.7%

Good things for runtime, but not all that much difference for code size.

This patch only (without SpLim? checks)

            Min          +5.7%     -0.0%     -6.5%     -6.4%    -50.0%
            Max          +6.3%     +5.8%     +5.0%     +5.5%     +0.8%
 Geometric Mean          +6.2%     +0.1%     +0.5%     +0.5%     -0.8%

comment:29 Changed 19 months ago by simonmar

I think we could incorporate the patch, but not turn on the flag by default. If you want to work on it further and see if you can get the code size penalty down that would be great - my suggestion would be to do as I mentioned earlier and omit the yield check for functions which are guaranteed to yield within a finite time, because they only call other functions which have that property. You can assume that calling an arbitrary closure or a primop is guaranteed to yield. As a first step you can do the analysis within a module, the next step would be to extend it across module boundaries.

We should check the primops to make sure that none of them hog the CPU for a long time. I think newArray# has this problem, because it fills in the array in a loop, so for a large array it won't yield quickly.

comment:30 Changed 19 months ago by ganesh

Why can you assume that calling an arbitrary closure is guaranteed to yield?

comment:31 Changed 19 months ago by ezyang

OK, once I get my validate running (failing, due to a certain someone, wink), I will push the patch, plus this docu patch:

<varlistentry>
  <term>
    <option>-falways-yield</option>
    <indexterm><primary><option>-falways-yield</option></primary></indexterm>
  </term>
  <listitem>
      <para>Tells GHC to always emit a pre-emption check on entry
    points to functions. This means that threads that run in tight
    non-allocating loops will get preempted in a timely fashion;
    otherwise, GHC may never manage to interrupt such a loop.  This
    imposes a very slight performance impact but inflates binary sizes
    by about 5%, so it is not enabled by default.  Note that if you
    would like to guarantee that threads can always be interrupted,
    you will need to compile all libraries with this flag.</para>
  </listitem>
</varlistentry>

What are we going to do with information about CPU hogging primitives? There are lots of unsafe primitives which can cause GHC to segfault or jump to arbitrary code, so mostly this information would have to be advisory for Safe Haskell implementors, who would know if one of these primitives were called it better be doing bounds checks, etc. We can't forcibly terminate the primops, since they're fat machine instructions and don't give up the capability?

comment:32 Changed 19 months ago by simonpj

Great thanks Edward. Is -falways-yield a good name? Perhaps -fguarantee-yield-points?

comment:33 Changed 19 months ago by simonmar

Yes, thanks Edward. I don't know why we didn't do this ages ago!

Regarding the CPU-hogging primitives, we should rewrite them so that they include a regular yield check. For example in newArray# we can do a yield check every 1000 iterations of the loop, which won't have a measurable impact on performance.

Perhaps -fno-omit-yields for the flag? Where -fomit-yields is an optimisation that we enable by default.

comment:34 Changed 19 months ago by ezyang@…

commit d3128bfc286002862e916296629a22f1ce987e4e

Author: Edward Z. Yang <ezyang@mit.edu>
Date:   Mon Sep 17 18:28:49 2012 +0200

    Partially fix #367 by adding HpLim checks to entry with -fno-omit-yields.
    
    The current fix is relatively dumb as far as where to add HpLim
    checks: it will always perform a check unless we know that we're
    returning from a closure or we are doing a non let-no-escape case
    analysis.  The performance impact on the nofib suite looks like this:
    
                Min          +5.7%     -0.0%     -6.5%     -6.4%    -50.0%
                Max          +6.3%     +5.8%     +5.0%     +5.5%     +0.8%
     Geometric Mean          +6.2%     +0.1%     +0.5%     +0.5%     -0.8%
    
    Overall, the executable bloat is the biggest problem, so we keep the old
    omit-yields optimization on by default. Remember that if you need an
    interruptibility guarantee, you need to recompile all of your libraries
    with -fno-omit-yields.
    
    A better fix would involve only inserting the yields necessary to break
    loops; this is left as future work.
    
    Signed-off-by: Edward Z. Yang <ezyang@mit.edu>

 compiler/codeGen/StgCmmExpr.hs |    4 +--
 compiler/codeGen/StgCmmHeap.hs |   57 ++++++++++++++++++++++++++--------------
 compiler/main/DynFlags.hs      |    4 +++
 docs/users_guide/using.xml     |   18 ++++++++++++
 4 files changed, 60 insertions(+), 23 deletions(-)

comment:35 Changed 17 months ago by bgamari

  • Cc bgamari@… added

comment:36 Changed 15 months ago by fryguybob

  • Cc fryguybob@… added

comment:37 Changed 3 months ago by nh2

  • Cc mail@… added

comment:38 Changed 3 months ago by jstolarek

  • Test Case set to concurrent/should_run/T367, concurrent/should_run/T367_letnoescape
Note: See TracTickets for help on using tickets.