Changes between Version 6 and Version 7 of LateLamLift


Ignore:
Timestamp:
Jul 18, 2013 6:53:23 PM (9 months ago)
Author:
nfrisby
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • LateLamLift

    v6 v7  
    1 === TODOs === 
     1[[PageOutline]] 
     2 
     3There's three issues with the late lambda lift. 
     4 
     5  * there's a troubling increase in nofib binary sizes due to lambda-lifting the libraries. With !SplitObjs=YES, it's ~7%. With !SplitObjs=NO it's ~3.5%. 
     6 
     7  * there's some significant slowdowns 
     8 
     9  * (blocked by the first two items) the implementation still needs a lot of refactoring/simplification/optimization/clean-up etc 
     10 
     11=== Increase in Binary Size === 
     12 
     13There's a troubling increase in nofib binary sizes due to lambda-lifting the libraries. With !SplitObjs=YES, it's ~7%. With !SplitObjs=NO it's ~3.5%. 
     14 
     15I have some hypotheses. Lambda lifting a function 'f` might swell the .o file if (N * M) is "too big", where N is number of free variables in 'f` and M is the number of applications of 'f`. The transform results in N more arguments to be loaded into registers on each of M calls; previously those arguments were only stored once into the closure when it was allocated. For LNEs, I think the info table may be larger than the proc-point it would otherwise be. 
     16 
     17My measurements don't reveal a very strong correlation for those on the libHSbase modules, so I think something else more significant is going on. I'm still trying to determine it. In particular, I need to see how much new inlining is caused by the lambda lift. From the opposite direction, I'm also trying to narrow my search using !SplitObjs and objtools to better pin down the individual functions that dominant the increase in executables. x2n1 in particular is a tiny program that gets a very large increase (with !SplitObjs=YES), so I'm chasing from there. 
     18 
     19=== Slow downs === 
     20 
     21Here's a couple snippets from my notes about some drastic slowdowns on my Sandy Bridge. 
     22 
     23====  shootout/n-body slows down 50% elapsed ==== 
     24 
     25Slows down 50% at O2! 
     26 
     27In one particular example, a loop involves a call to sqrt. It's out-of-line, so we must stash the live variables on the stack. Before the lambda lift, however, the variables were already on the stack to begin with. After the lift, they are passed in registers, so we have to add code to the loop that pushes and pops the variables around the sqrt call. Unfortunately there's several Double#s, so this puts a lot of pressure on my Sandy Bridge's load-store units. 
     28 
     29Quote from includes/stg/MachRegs.h 
     30 
     31{{{ 
     32    /* ---------------------------------------------------------------------------- 
     33       Caller saves and callee-saves regs. 
     34 
     35       Caller-saves regs have to be saved around C-calls made from STG 
     36       land, so this file defines CALLER_SAVES_<reg> for each <reg> that 
     37       is designated caller-saves in that machine's C calling convention. 
     38 
     39       As it stands, the only registers that are ever marked caller saves 
     40       are the RX, FX, DX and USER registers; as a result, if you 
     41       decide to caller save a system register (e.g. SP, HP, etc), note that 
     42       this code path is completely untested! -- EZY 
     43       -------------------------------------------------------------------------- */ 
     44}}} 
     45 
     46In n-body, the problematic lifts adds 3 RX registers and 4 DX registers to the loop, which all get saved across a C-call to sqrt.  Without lifting, those values are each only used once per iteration and directly from the closure environment, so they never make it to a register. 
     47 
     48This one motivates the "llf6" variant in which we don't lift recursive functions if there's more than 6 free variables. 
     49 
     50There's also slowdowns I'm struggling to explain. 
     51 
     52==== shootout/reverse-complement mode=slow slows down 7% elapsed, 27% runtime ==== 
     53 
     54At O2, adding LLF (the llf6 variant) gives 7% elapsed slowdown, 27% runtime slowdown. This test reads a big file) 
     55 
     56I used Intel's performance hardawre counters to determine that the IPC is detrimentally afffected by the LLF, even those the resulting assembly has fewer instructions.  The LLF'd version executes fewer instructions, but takes more time. 
     57 
     58I suspect it's a caching effect --- just because nothing looks like a big change! I don't have a better reason than that yet… 
     59 
     60I isolated a couple problematic floats. 
     61 
     62{{{ 
     63Run Time 
     64 
     65log-slow-llf6 is the baseline. 
     66log-slow-O2 is no lift. 
     67log-slow-Main-1 changes it to *not* lift one particular function. 
     68log-slow-Main-4 changes it to *not* lift a separate particular function. 
     69 
     70--------------------------------------------------------------- 
     71        Program        log-slow-llf6     log-slow-O2 log-slow-Main-1 log-slow-Main-4  
     72--------------------------------------------------------------- 
     73reverse-complem                 1.20          -26.7%          -15.4%          -27.7%  
     74 
     75Elapsed Time 
     76 
     77--------------------------------------------------------------- 
     78        Program        log-slow-llf6     log-slow-O2 log-slow-Main-1 log-slow-Main-4  
     79--------------------------------------------------------------- 
     80reverse-complem                 4.83           -7.1%           -4.9%           -6.6%  
     81 
     82The Main-1 float is in Main.ca 
     83 
     84      letrec { 
     85        a_s3dY [Occ=LoopBreaker] 
     86          :: [(GHC.Types.Int, GHC.Word.Word8)] 
     87             -> GHC.Prim.State# GHC.Prim.RealWorld 
     88             -> (# GHC.Prim.State# GHC.Prim.RealWorld, () #) 
     89        [LclId, Arity=2, Str=DmdType <S,1*U><L,U>, Unf=OtherCon []] 
     90        a_s3dY = 
     91          \ (ds3_s3dD [Occ=Once!] :: [(GHC.Types.Int, GHC.Word.Word8)]) 
     92            (eta_s3dF [Occ=Once*] :: GHC.Prim.State# GHC.Prim.RealWorld) -> 
     93            case ds3_s3dD of _ { 
     94              [] -> (# eta_s3dF, GHC.Tuple.() #); 
     95              : y_s3dI [Occ=Once!] ys_s3dW [Occ=Once] -> 
     96                case y_s3dI of _ { (x_s3dM [Occ=Once!], ds4_s3dP [Occ=Once!]) -> 
     97                case x_s3dM of _ { GHC.Types.I# d_s3dS [Occ=Once] -> 
     98                case ds4_s3dP of _ { GHC.Word.W8# x1_s3dU [Occ=Once] -> 
     99                case GHC.Prim.plusAddr# ds1_s3du d_s3dS of sat_s3sB { __DEFAULT -> 
     100                case GHC.Prim.writeWord8OffAddr# 
     101                       @ GHC.Prim.RealWorld sat_s3sB 0 x1_s3dU eta_s3dF 
     102                of s2_s3dX { __DEFAULT -> 
     103                a_s3dY ys_s3dW s2_s3dX 
     104                } 
     105                } 
     106                } 
     107                } 
     108                } 
     109            }; } in 
     110 
     111        llf_a1_r3ce 
     112          :: GHC.Prim.Addr# 
     113             -> [(GHC.Types.Int, GHC.Word.Word8)] 
     114             -> GHC.Prim.State# GHC.Prim.RealWorld 
     115             -> (# GHC.Prim.State# GHC.Prim.RealWorld, () #) 
     116 
     117  It's only entered 27 times, regardless of mode=slow. 
     118 
     119  The Main-4 float is in Main.$wa 
     120 
     121            let-no-escape { 
     122              $w$j_s3eI [Occ=Once*!] 
     123                :: GHC.Prim.State# GHC.Prim.RealWorld 
     124                   -> (# GHC.Prim.State# GHC.Prim.RealWorld, () #) 
     125              [LclId, Arity=1, Str=DmdType <L,U>, Unf=OtherCon []] 
     126              $w$j_s3eI = 
     127                \ (w1_s3eD [Occ=Once] :: GHC.Prim.State# GHC.Prim.RealWorld) -> 
     128                  case GHC.Prim.-# ww_s3e6 a_s3el of sat_s3po { __DEFAULT -> 
     129                  case GHC.Prim.-# sat_s3po 1 of sat_s3pq { __DEFAULT -> 
     130                  let { 
     131                    sat_s3pp [Occ=Once] :: GHC.Ptr.Ptr GHC.Word.Word8 
     132                    [LclId, Str=DmdType] 
     133                    sat_s3pp = GHC.Ptr.Ptr @ GHC.Word.Word8 ipv3_s3eu } in 
     134                  case GHC.IO.Handle.Text.$wa4 
     135                         @ GHC.Word.Word8 
     136                         GHC.IO.Handle.FD.stdout 
     137                         sat_s3pp 
     138                         sat_s3pq 
     139                         GHC.Types.True 
     140                         w1_s3eD 
     141                  of _ { (# ipv5_s3eH [Occ=Once], _ #) -> 
     142                  (# ipv5_s3eH, GHC.Tuple.() #) 
     143                  } 
     144                  } 
     145                  } } in 
     146 
     147        llf_$w$j_r3cj 
     148          :: GHC.Prim.Addr# 
     149             -> GHC.Prim.Int# 
     150             -> GHC.Prim.Int# 
     151             -> GHC.Prim.State# GHC.Prim.RealWorld 
     152             -> (# GHC.Prim.State# GHC.Prim.RealWorld, () #) 
     153 
     154  s3eI occurs 18 times, but it's only entered three times, regardless of mode=slow. 
     155 
     156  It's lifted counterpart is inlined 9 times, but it's also entered three times, regardless of mode=slow. 
     157}}} 
     158 
     159=== (old) TODOs === 
    2160 
    3161  * LNE catch 22: good to lift (enables simplifications) but also bad to lift (causes a slight slow down)