Opened 9 years ago

Last modified 20 months ago

#2289 new bug

Needless reboxing of values when returning from a tight loop

Reported by: dons Owned by:
Priority: lowest Milestone:
Component: Compiler Version: 6.8.2
Keywords: boxing, loops, performance Cc: dons@…, johan.tibell@…, michal.terepeta@…, tkn.akio@…, jwlato@…, alpmestan@…, hackage.haskell.org@…, mail@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #2387,#1600 Differential Rev(s):
Wiki Page:

Description

GHC wants to box up strict values when returning from tight inner loops, even when they're immediately taken apart. This leads to redundant instructions in the bodies of tight loops, and more code.

It affects, in particular, loops that result from fusion, which need to be tight, but often return multiple values via unlifted pairs.

Consider this program:

{-# OPTIONS -fexcess-precision #-}
{-# LANGUAGE TypeOperators #-}

import System.Environment
import Text.Printf
import Data.Array.Vector

mean :: UArr Double -> Double
mean arr = s / fromIntegral l
  where
    s :*: l = foldlU k (0 :*: 0) arr :: (Double :*: Int)
    k (s :*: n) x = s+x :*: n+1

main = do
    [d] <- map read `fmap` getArgs
    printf "%f\n" (mean (enumFromToFracU 1 d))

It generates this rather good Core (ghc 6.8.2):

$s$wfold_s1rB :: Double#
               -> Int#
               -> Double#
               -> (# Double, Int #)

$s$wfold_s1rB =
\ (sc_s1rr :: Double#)
  (sc1_s1rs :: Int#)
  (sc2_s1rt :: Double#) ->
  case >## sc_s1rr y_a1pr of wild4_X1no {
    False ->
      $s$wfold_s1rB
        (+## sc_s1rr 1.0)
        (+# sc1_s1rs 1)
        (+## sc2_s1rt sc_s1rr);
    True -> (# D# sc2_s1rt, I# sc1_s1rs #)
  };
} in 
case $s$wfold_s1rB 2.0 1 1.0 of ww_s1qg { (# ww1_s1qi, ww2_s1qj #) ->
case ww1_s1qi of wild4_a1mC { D# x_a1mE ->
case ww2_s1qj of wild5_aP6 { I# x1_aP8 ->
case /## x_a1mE (int2Double# x1_aP8)
of wild21_a1mK { __DEFAULT ->
D# wild21_a1mK

But note, what's this?

    True -> (# D# sc2_s1rt, I# sc1_s1rs #)
  };
} in 
case $s$wfold_s1rB 2.0 1 1.0 of ww_s1qg { (# ww1_s1qi, ww2_s1qj #) ->
case ww1_s1qi of wild4_a1mC { D# x_a1mE ->
case ww2_s1qj of wild5_aP6 { I# x1_aP8 ->
case /## x_a1mE (int2Double# x1_aP8)

The return values of what was a strict pair are boxed, placed in an unboxed tuple, and then immediately unboxed and the division takes place.

Ok, let's isolate this. Here, the boxed return, from the inner loop:

mean_s19V :: Double#
           -> Int#
           -> Double#
           -> (# Double, Int #)

mean_s19V =
\ (ds1_dD3 :: Double#)
  (ds2_dD4 :: Int#)
  (ds3_dD5 :: Double#) ->
  case >## ds1_dD3 d#_aoG of wild4_Xw {
    False ->
      mean_s19V
        (+## ds1_dD3 1.0)
        (+# ds2_dD4 1)
        (+## ds3_dD5 ds1_dD3);
    True -> (# D# ds3_dD5, I# ds2_dD4 #)
  };
} in 
case mean_s19V 2.0 1 1.0 of wild4_Xr { (# ds1_dCV, ds2_dCW #) ->
case ds1_dCV of wild5_Xv { D# x_aoR ->
case ds2_dCW of wild6_Xy { I# y_aoS ->
case /## x_aoR (int2Double# y_aoS) of wild7_XB { __DEFAULT ->
D# wild7_XB

And the inner loop and exit:

s1bd_info:

  -- what's this stuff?
  leaq        32(%r12), %rax
  cmpq        %r15, %rax
  movq        %rax, %r12
  ja  .L17

  -- ok, to business:
  ucomisd     5(%rbx), %xmm5
  ja  .L19
  movapd      %xmm6, %xmm0
  leaq        -32(%rax), %r12
  incq        %rsi
  addsd       %xmm5, %xmm0
  addsd       .LC1(%rip), %xmm5
  movapd      %xmm0, %xmm6
  jmp s1bd_info


.L19:
  movq        %rsi, -16(%rax)
  movq        $base_GHCziBase_Izh_con_info, -24(%rax)
  movq        $base_GHCziFloat_Dzh_con_info, -8(%rax)
  movsd       %xmm6, (%rax)
  leaq        -7(%rax), %rbx
  leaq        -23(%rax), %rsi
  jmp *(%rbp)

Now, I can avoid the reboxing manually:

mean_s19R :: Double#
           -> Int#
           -> Double#
           -> (# Double#, Int# #)

mean_s19R =
\ (ds1_dCZ :: Double#)
  (ds2_dD0 :: Int#)
  (ds3_dD1 :: Double#) ->
  case >## ds1_dCZ d#_aoG of wild4_Xw {
    False ->
      mean_s19R
        (+## ds1_dCZ 1.0)
        (+# ds2_dD0 1)
        (+## ds3_dD1 ds1_dCZ);
    True -> (# ds3_dD1, ds2_dD0 #)
  };
} in 
case mean_s19R 2.0 1 1.0 of wild4_Xr { (# x_aoR, y_aoS #) ->
case /## x_aoR (int2Double# y_aoS) of wild5_Xv { __DEFAULT ->
D# wild5_Xv

And we get:

s1b9_info:
  -- hey , our junk is gone!

  ucomisd     5(%rbx), %xmm5
  ja  .L17
  movapd      %xmm6, %xmm0
  incq        %rsi
  addsd       %xmm5, %xmm0
  addsd       .LC1(%rip), %xmm5
  movapd      %xmm0, %xmm6
  jmp s1b9_info

-- cool, that was it, let's go home:
.L17:
  movapd      %xmm6, %xmm5
  movq        %rsi, %rbx
  jmp *(%rbp)

Which is a much better result. The loop is tighter.

What can be done here?

Change History (47)

comment:1 Changed 9 years ago by dons

Here's a simple test case:

{-# LANGUAGE TypeOperators #-}
{-# OPTIONS -funbox-strict-fields #-}

import System.Environment
import Text.Printf

data P a b = P !a !b

mean :: Double -> Double -> P Double Int
mean n m = go n 0 0
    where
        go :: Double -> Int -> Double -> P Double Int
        go x l s | x > m      = P s l
                 | otherwise  = go (x+1) (l+1) (s+x)

main = do
    [d] <- map read `fmap` getArgs
    printf "%f\n" (case mean 1 d of
                        (P x y) -> x / fromIntegral y    )

Yields:

$wgo_s1az :: Double# -> Int# -> Double# -> (# Double, Int #)
$wgo_s1az =
    \ (ww_X1au :: Double#)
    (ww1_X1az :: Int#)
    (ww2_X1aE :: Double#) ->
        case >## ww_X1au y_a178 of wild4_X1g {
            False ->
                $wgo_s1az
                (+## ww_X1au 1.0)
                (+# ww1_X1az 1)
                (+## ww2_X1aE ww_X1au);
            True -> (# D# ww2_X1aE, I# ww1_X1az #)
        };

    } in 
        case $wgo_s1az 2.0 1 1.0 of ww_s1a2 { (# ww1_s1a4, ww2_s1a5 #) ->
        case ww1_s1a4 of wild4_a17r { D# x_a17t ->
        case ww2_s1a5 of wild5_aEV { I# x1_aEX ->
        case /## x_a17t (int2Double# x1_aEX)
        of wild21_a17z { __DEFAULT ->
        D# wild21_a17z

Exhibiting the reboxing.

Running this:

$ time ./G 1e9                  
500000000.067109
./G 1e9  2.21s user 0.00s system 99% cpu 2.225 total

While if we prevent the reboxing, by moving the division inside the loop:

$wgo_s1a0 :: Double# -> Int# -> Double# -> Double#
$wgo_s1a0 =
 \ (ww_X19X :: Double#)
   (ww1_X1a2 :: Int#)
   (ww2_X1a7 :: Double#) ->
   case >## ww_X19X y_a16C of wild4_X1g {
     False ->
       $wgo_s1a0
         (+## ww_X19X 1.0)
         (+# ww1_X1a2 1)
         (+## ww2_X1a7 ww_X19X);
     True -> /## ww2_X1a7 (int2Double# ww1_X1a2

We get faster code:

$ time ./G 1e9                  
500000000.067109
./G 1e9  1.84s user 0.01s system 99% cpu 1.861 total

So I suspect the boxing causes a heap check to end up inside the loop (run on every iteration?), and thus a performance loss.

comment:2 Changed 9 years ago by dons

Note that if we specialise the constructor manually, we can get the correct unboxing: That is, just using strict pairs isn't enough:

data P a b = P {-# UNPACK #-} !a {-# UNPACK #-} !b

won't work.

where we use P only for P Double Int, I'm unable to get the return value unboxed.

If we specialise P, data P = P !Double !Int, then we do get the unpacking, as expected.

This makes strict pairs a little less useful for accumulating state -- we have to specialise them directly ourselves to avoid the reboxing penalty.

comment:3 Changed 9 years ago by simonpj

difficulty: Unknown

Nice example. I took a little look at it. Two things

First, yes GHC never does a heap-check at the start of an alternative of a primop case; that is, one whose scrutinee is just a primop application. In this example that's bad, because there is no allocation before the conditional, and the hot path does no allocation at all.

The fix is to put the heap check at the start of the alternatives, if no allocation precedes the case itself. This would require a significant (but not drastic) change to the code generator. Happily, it'll be a much easier change when John Dias's new back end comes on stream, we'll hold it till then.

Second, and orthogonally, this loop has a nested CPR property. There is no reason that the CPR analyser can't deal with nested stuff, but it doesn't. There's a nice little project there too.

Either of these would fix this particular example, but both are valuable in their own right.

Simon

comment:4 Changed 9 years ago by dons

Looking in the CPR analyser, I see the following comment:

The analysis here detects nested CPR information. For example, if a function returns a constructed pair, the first element of which is a constructed int, then the analysis will detect nested CPR information for the int as well. Unfortunately, the current transformations can't take advantage of the nested CPR information. They have (broken now, I think) code which will flatten out nested CPR components and rebuild them in the wrapper, but enabling this would lose laziness. It is possible to make use of the nested info: if we knew that a caller was strict in that position then we could create a specialized version of the function which flattened/reconstructed that position.

It is not known whether this optimisation would be worthwhile.

So, there's some skeleton code in their for the nested CPR stuff. And the CPR analyser seems pretty small too. Would working on this be worthwhile now?

comment:5 Changed 9 years ago by simonpj

Rats. I'd forgotten about the strictness question:

f :: Int -> (Int,Int)
f x = (g x, h x)

Suppose g and h have the CPR property -- that is, they explicitly return a boxed value. Then it's a mistake to transform to

f x = case (g x, h x) of { (I# r1, I# r2) ->
      (I# r1 ,I# r2) }

because that'd make f too strict. But in your example, g and h are themselves constructors.

My conclusion: for the nested part of CPR analysis we do not want to "look through" function calls, but rather look only for literal constructor applications. I have not thought about how much this'd affect the analysis.

Provided the analysis was modified in this way, it shouldn't be too hard to modify the worker/wrapper part to take account of it.

But NB that CprAnalyse is dead code; the current analysis is done as part of strictness analysis in DmdAnal. And the strictness analyser itself needs love and attention. So much to do, so little time.

Simon

comment:6 Changed 9 years ago by simonpj

(Retrying, having messed up typesetting.)

Rats. I'd forgotten about the strictness question:

f :: Int -> (Int,Int)
f x = (g x, h x)

Suppose g and h have the CPR property -- that is, they explicitly return a boxed value. Then it's a mistake to transform to

f x = case (g x, h x) of { (I# r1, I# r2) ->
      (I# r1 ,I# r2) }

because that'd make f too strict. But in your example, g and h are themselves constructors.

My conclusion: for the nested part of CPR analysis we do not want to "look through" function calls, but rather look only for literal constructor applications. I have not thought about how much this'd affect the analysis.

Provided the analysis was modified in this way, it shouldn't be too hard to modify the worker/wrapper part to take account of it.

But NB that CprAnalyse is dead code; the current analysis is done as part of strictness analysis in DmdAnal. And the strictness analyser itself needs love and attention. So much to do, so little time.

comment:7 Changed 9 years ago by simonpj

See also #2387, which shows another example of the same phenomenon. (I'm closing #2387 to leave just this one open.)

Simon

comment:8 Changed 9 years ago by igloo

Milestone: 6.10 branch

comment:9 Changed 9 years ago by dons

Another case we'd like genralised CPR to work is for sum types.

Consider the task of summing integers from a file, one per line. We need to parse each line, returning possibly success, in a tight loop:

import qualified Data.ByteString.Char8 as S

main = print . go 0 =<< S.getContents
  where
    go !n s = case S.readInt s of
                  S.Nothing    -> n
                  S.Just (k,t) -> go (n+k) (S.tail t)

Where

readInt :: ByteString -> Maybe (Int, ByteString)

We'd like the Just/Nothing tag returned in a register, in an ideal world. And the components of the pair as well. Currently we have to monomorphise the type, and flatten it , to get register returns here.

Note that Clean uses a triple of (Bool, Int, String) for this kind of thing.

comment:10 Changed 9 years ago by simonpj

Don't hold your breath for unboxed sum types. For example, if the Bool is True, the other two fields may be uninitialised, and should not be followed by the GC. I suppose if you were prepared to stubs initialised to (error "Bad"), then you could do this worker/wrapper split for go:

readInt :: String -> Maybe (Int, String)
readInt s = case readInt_w s of
               (# True,  n, s #) -> Nothing
               (# False, n, s #) -> Just (n,s)

readInt_w :: String -> (# Bool, Int, String #)
readInt_w s = case <readInt_body> of
                Just (n,s) -> (# False, n, s #)
                Nothing    -> (# True, error "BAD", error "BAD" #)

Things get harder if there are more constructors in the sum type, or the constructors have more arguments.

Simon

comment:12 Changed 9 years ago by simonmar

Architecture: UnknownUnknown/Multiple

comment:13 Changed 9 years ago by simonmar

Operating System: UnknownUnknown/Multiple

comment:14 Changed 9 years ago by igloo

Milestone: 6.10 branch6.12 branch

comment:15 Changed 9 years ago by simonpj

See also #2387, and #1600

comment:16 Changed 8 years ago by simonmar

I believe this example fits into the same category. We have a recursive tree traversal in the ST monad that returns an Int, and we want the Int unboxed. Here's the complete code, both the version that doesn't optimise as well as we'd like, and the hand-optimised version:

{-# LANGUAGE BangPatterns, UnboxedTuples, MagicHash #-}
module Test where

import Data.Array.ST
import Control.Monad.ST
import Data.Array.Base
import GHC.ST
import GHC.Exts

data Tree
      = Nil
      | Node {-#UNPACK#-} !Int
                          !Tree
                          !Tree
             {-#UNPACK#-} !Int

#if 0
-- The code we want to write
traverse :: Tree -> STUArray s Int Int -> ST s Int
traverse Nil                     !arr = return 0
traverse (Node item child alt w) !arr = do
  childw <- traverse child arr
  altw   <- traverse alt arr
  itemw <- unsafeRead arr item
  unsafeWrite arr item (itemw + childw + w)
  return $! childw + w + altw
#else
-- The code we have to write
traverse :: Tree -> STUArray s Int Int -> ST s Int
traverse tree arr = ST $ \s -> 
  case traverse' tree arr s of { (# s', i #) -> (# s', I# i #) }
  where
  traverse' Nil                             !arr s  = (# s, 0# #)
  traverse' (Node item child alt w@(I# w#)) !arr s0 =
     case traverse' child arr s0 of { (# s1, childw #) ->
     case traverse' alt arr   s1 of { (# s2, altw   #) ->
     case unsafeRead arr item of { ST f -> case f s2 of { (# s3, I# itemw #) ->
     case unsafeWrite arr item (I# itemw + I# childw + w) of { ST f -> case f s2 of { (# s4, _ #) ->
     (# s4, childw +# w# +# altw #)
     }}}}}}
#endif

comment:17 Changed 8 years ago by simonmar

Type of failure: Runtime performance bug

comment:18 Changed 7 years ago by igloo

Milestone: 6.12 branch6.12.3

comment:19 Changed 7 years ago by igloo

Milestone: 6.12.36.14.1
Priority: normallow

comment:20 Changed 7 years ago by igloo

Milestone: 7.0.17.0.2

comment:21 Changed 7 years ago by tibbe

Cc: johan.tibell@… added

comment:22 Changed 7 years ago by igloo

Milestone: 7.0.27.2.1

comment:23 Changed 7 years ago by michalt

Cc: michal.terepeta@… added

comment:24 Changed 6 years ago by igloo

Milestone: 7.2.17.4.1

comment:25 Changed 6 years ago by igloo

Milestone: 7.4.17.6.1
Priority: lowlowest

comment:26 Changed 5 years ago by simonpj@…

commit 4fa3f16ddb9fa8e5d59bde5354918a39e0430a74

Author: Simon Peyton Jones <simonpj@microsoft.com>
Date:   Mon May 28 17:33:42 2012 +0100

    Be less aggressive about the result discount
    
    This patch fixes Trac #6099 by reducing the result discount in CoreUnfold.conSize.
    See Note [Constructor size and result discount] in CoreUnfold.
    
    The existing version is definitely too aggressive. Simon M found it an
    "unambiguous win" but it is definitely what led to the bloat. In a function
    with a lot of case branches, all returning a constructor, the discount could
    grow arbitrarily large.
    
    I also had to increase the -funfolding-creation-threshold from 450 to 750,
    otherwise some functions that should inline simply never get an unfolding.
    (The massive result discount was allow the unfolding to appear before.)
    
    The nofib results are these, picking a handful of outliers to show.
    
            Program           Size    Allocs   Runtime   Elapsed  TotalMem
    --------------------------------------------------------------------------------
             fulsom          -0.5%     -1.6%     -2.8%     -2.6%    +31.1%
           maillist          -0.2%     -0.0%      0.09      0.09     -3.7%
             mandel          -0.4%     +6.6%      0.12      0.12     +0.0%
           nucleic2          -0.2%    +18.5%      0.11      0.11     +0.0%
            parstof          -0.4%     +4.0%      0.00      0.00     +0.0%
    --------------------------------------------------------------------------------
                Min          -0.9%     -1.6%    -19.7%    -19.7%     -3.7%
                Max          +0.3%    +18.5%     +2.7%     +2.7%    +31.1%
     Geometric Mean          -0.3%     +0.4%     -3.0%     -3.0%     +0.2%
    
    Turns out that nucleic2 has a function
      Main.$wabsolute_pos =
        \ (ww_s4oj :: Types.Tfo) (ww1_s4oo :: Types.FloatT)
          (ww2_s4op :: Types.FloatT) (ww3_s4oq :: Types.FloatT) ->
          case ww_s4oj
          of _
          { Types.Tfo a_a1sS b_a1sT c_a1sU d_a1sV e_a1sW f_a1sX g_a1sY h_a1sZ i_a1t0 tx_a1t1 ty_a1t2 tz_a1t3 ->
          (# case ww1_s4oo of _ { GHC.Types.F# x_a2sO ->
             case a_a1sS of _ { GHC.Types.F# y_a2sS ->
             case ww2_s4op of _ { GHC.Types.F# x1_X2y9 ->
             case d_a1sV of _ { GHC.Types.F# y1_X2yh ->
             case ww3_s4oq of _ { GHC.Types.F# x2_X2yj ->
             case g_a1sY of _ { GHC.Types.F# y2_X2yr ->
             case tx_a1t1 of _ { GHC.Types.F# y3_X2yn ->
             GHC.Types.F#
               (GHC.Prim.plusFloat#
                  (GHC.Prim.plusFloat#
                     (GHC.Prim.plusFloat#
                        (GHC.Prim.timesFloat# x_a2sO y_a2sS)
                        (GHC.Prim.timesFloat# x1_X2y9 y1_X2yh))
                     (GHC.Prim.timesFloat# x2_X2yj y2_X2yr))
                  y3_X2yn)
             } } }}}}},
    
            <similar>,
            <similar> )
    
    This is pretty big, but inlining it does get rid of that F# allocation.
    But we'll also get rid of it with deep CPR: Trac #2289. For now we just
    accept the change.

 compiler/coreSyn/CoreUnfold.lhs |   73 ++++++++++++++++++++++-----------------
 compiler/main/StaticFlags.hs    |    7 +++-
 2 files changed, 47 insertions(+), 33 deletions(-)

comment:27 Changed 5 years ago by akio

Cc: tkn.akio@… added

comment:28 Changed 5 years ago by igloo

Milestone: 7.6.17.6.2

comment:29 Changed 5 years ago by jwlato

Cc: jwlato@… added

comment:30 Changed 5 years ago by alpmestan

Cc: alpmestan@… added

comment:31 Changed 5 years ago by morabbin

comment:32 Changed 5 years ago by liyang

Cc: hackage.haskell.org@… added

comment:33 Changed 4 years ago by jstolarek

Cc: jan.stolarek@… added

comment:34 Changed 4 years ago by carter

whats the status of this codegen issue? Was this one of the problems that were supposed to be addressed with loopification?

comment:35 in reply to:  34 Changed 4 years ago by jstolarek

Replying to carter:

Was this one of the problems that were supposed to be addressed with loopification?

No. I was planning to look into this bug during my internship, but in the end I didn't have enough time.

comment:36 Changed 4 years ago by simonpj

In fact it's orthogonal to do with loopification, which does not affect allocation. Its to do with nested CPR analysis, which I'm longing to do but keep getting diverted.

Simon

comment:37 Changed 4 years ago by nomeata

Cc: mail@… added

comment:38 Changed 3 years ago by thoughtpolice

Milestone: 7.6.27.10.1

Moving to 7.10.1.

comment:39 Changed 3 years ago by thoughtpolice

Milestone: 7.10.17.12.1

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

comment:40 Changed 3 years ago by thoughtpolice

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

comment:41 Changed 2 years ago by thoughtpolice

Milestone: 7.12.18.0.1

Milestone renamed

comment:42 Changed 2 years ago by jstolarek

Cc: jan.stolarek@… removed

comment:43 Changed 21 months ago by thomie

Milestone: 8.0.1

comment:44 Changed 20 months ago by twhitehead

I was just going to open a ticket to report that I had discovered that the compiler can't unbox multiple return values.

That is, anytime you create a non-inlined function (such as a loop) that returns multiple values, you must pay the boxing and unboxing penalty.

My reading of these tickets (#1600, #2289, and #2387) though is that this is already known. I'll just add a simple example here

main :: IO ()
main = case loop2 100 (10,10) of (au,ad) -> print (au - ad)

loop2 :: Int -> (Int,Int) -> (Int,Int)
loop2 n (au,ad) | n > 0     = loop2 (n-1) (au+1,ad-1)
                | otherwise = (au,ad)

yielding (with -ddump-simpl -dsuppress-all -dno-suppress-type-signatures)

Rec {
main_$s$wloop2 [Occ=LoopBreaker]
  :: Int# -> Int# -> Int# -> (# Int, Int #)
[GblId, Arity=3, Caf=NoCafRefs, Str=DmdType <L,U><L,U><L,U>]
main_$s$wloop2 =
  \ (sc_s3vn :: Int#) (sc1_s3vo :: Int#) (sc2_s3vp :: Int#) ->
    case tagToEnum# (># sc_s3vn 0) of _ [Occ=Dead] {
      False -> (# I# sc1_s3vo, I# sc2_s3vp #);
      True ->
        main_$s$wloop2 (-# sc_s3vn 1) (+# sc1_s3vo 1) (-# sc2_s3vp 1)
    }
end Rec }

main1 :: State# RealWorld -> (# State# RealWorld, () #)
[GblId,
 Arity=1,
 Str=DmdType <L,U>,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [0] 171 0}]
main1 =
  \ (eta_B1 [OS=OneShot] :: State# RealWorld) ->
    case main_$s$wloop2 100 10 10
    of _ [Occ=Dead] { (# ww1_s3uF, ww2_s3uG #) ->
    hPutStr2
      stdout
      (case ww1_s3uF of _ [Occ=Dead] { I# x_a1d4 ->
       case ww2_s3uG of _ [Occ=Dead] { I# y_a1d8 ->
       case $wshowSignedInt 0 (-# x_a1d4 y_a1d8) ([])
       of _ [Occ=Dead] { (# ww5_a1xQ, ww6_a1xR #) ->
       : ww5_a1xQ ww6_a1xR
       }
       }
       })
      True
      eta_B1
    }

Cheers! -Tyson

PS: The perfect performance storm comes when you have a very fast inner loop returning multiple values to an outer loop that evaluates the inner loop many many times.

  • The inner loop can't be inlined because it is a loop.
  • The return values from the inner loop must be boxed and unboxed because GHC doesn't seem to be able to do otherwise when there are multiple return values.
  • The inner loop being very fast means the relative cost of boxing and unboxing is high.
  • The outer loop repeating the inner loop many many times turns the high relative cost into a high absolute cost.

In my specific case I was processing around 10G worth of text records. The inner loop found the next field separator. The outer loop built the data records. It was a nasty hit.

comment:45 Changed 20 months ago by rwbarton

Your example combines two issues. The big one is that you can in fact call loop2 like this:

main = case loop2 100 (undefined,undefined) of (_,_) -> return ()

and it will terminate without evaluating undefined. So GHC can't eagerly evaluate au+1 and ad-1, so there is thunk build-up in the argument pair, that eventually becomes the result pair.

To fix this you could either

  • return a strict pair (data P = P !Int !Int, | otherwise = P au ad), which might be inconvenient, or
  • manually make the return tuple strict (| otherwise = au `seq` ad `seq` (au,ad)).

Then, the second issue, which is the one I think you are interested in, is whether after one of these changes, the unboxed tuple returned by loop2 contains boxed Ints or unboxed Int#s. (It can't very well contain Int#s without either change, since au and ad might be undefined! Or GHC would have to generate multiple specializations of loop2 for different demand patterns, which it doesn't currently do.) Currently, loop2 produces Int#s if its result type is a strict pair and -funbox-small-strict-fields is on (which it is by default), otherwise Ints.

Unboxing the Ints inside the unboxed tuple is indeed nested CPR, which is the subject of this ticket.

comment:46 Changed 20 months ago by twhitehead

Thanks for feedback. I follow that loop2 in itself is not strict in the accumulators au and ad. It would seem GHC considers the context, however, as it did go ahead and create a version strict in au and ad (i.e., main_$s$wloop2 is clearly specialized to take au and ad as unboxed Int#s).

main_$s$wloop2 [Occ=LoopBreaker]
  :: Int# -> Int# -> Int# -> (# Int, Int #)

Possibly the only issue left here is indeed the nested CPR then? At least it is all I can see really different from this single-return-value variant in which GHC wonderfully eliminates all boxing and unboxing.

main :: IO ()
main = case loop1 100 10 of a -> print a

loop1 :: Int -> Int -> Int
loop1 n a | n > 0     = loop1 (n-1) (-a)
          | otherwise = a
Rec {
$wloop1 :: Int# -> Int# -> Int#
$wloop1 =
  \ (ww_s3th :: Int#) (ww1_s3tl :: Int#) ->
    case tagToEnum# (># ww_s3th 0) of _ {
      False -> ww1_s3tl;
      True -> $wloop1 (-# ww_s3th 1) (negateInt# ww1_s3tl)
    }
end Rec }

I'm also not sure I follow you on the only-one-specialization rule. From looking at core dumps it seems GHC does do multiple specializations. For example,

  1. adding an export of loop2 causes a purely non-strict version to be generated and
  2. adding a case loop2 100 (10,undefined) of (au,_) -> ... causes a strict-only-in-the-first-argument version to be generated

all in the same code and in addition to main_$s$wloop2 above. Perhaps things have been relaxed?

comment:47 Changed 20 months ago by simonpj

Yes, nested CPR would be great. See NestedCPR for Joachim's writeup of his work on it; lots of stuff there.

Re one-specialisation: SpecConstr is free to generate lots of specialisations. But the demand analyser does not make many copies of a function, one for each demand pattern.

Note: See TracTickets for help on using tickets.