Opened 23 months ago

Closed 16 months ago

Last modified 16 months ago

#11565 closed feature request (fixed)

Restore code to handle '-fmax-worker-args' flag

Reported by: slyfox Owned by:
Priority: normal Milestone: 8.2.1
Component: Compiler Version: 7.10.3
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by slyfox)

When i had a pass through DynFlags I've noticed dead code:

https://phabricator.haskell.org/D1727

maxWorkerArgs handling was accidentally lost 3 years ago in major update changeset:0831a12ea2fc73c33652eeec1adc79fa19700578

The consensus is to try to put option handling back.

Change History (10)

comment:1 Changed 23 months ago by slyfox

Keywords: newcomer added

comment:2 Changed 23 months ago by slyfox

Description: modified (diff)

comment:3 Changed 22 months ago by thomie

Keywords: newcomer removed
Type: bugfeature request

comment:4 Changed 16 months ago by slyfox

Current motivating example to fix it is DynFlags example itself. I was profiling perf build of GHC and noticed a function that pushes whole DynFlags from stack to heap. This small piece of code emits 10 pages of mov instructions.

https://git.haskell.org/ghc.git/blob/HEAD:/compiler/nativeGen/AsmCodeGen.hs#l1109

1086 cmmExprNative :: ReferenceKind -> CmmExpr -> CmmOptM CmmExpr
1087 cmmExprNative referenceKind expr = do
1088      dflags <- getDynFlags
1089      let platform = targetPlatform dflags
1090          arch = platformArch platform
1091      case expr of
...
1106         CmmLit (CmmLabel lbl)
1107            -> do
1108                 cmmMakeDynamicReference dflags referenceKind lbl
...
       │      cmmExprNative :: ReferenceKind -> CmmExpr -> CmmOptM CmmExpr
       │      cmmExprNative referenceKind expr = do
  0,11 │        cmp    $0x3,%rax
       │      ↑ jb     3ceb930 <cFO7_info+0x8b0>
       │                 -- we must convert block Ids to CLabels here, because we
       │                 -- might have to do the PIC transformation.  Hence we must
       │                 -- not modify BlockIds beyond this point.
       │
       │              CmmLit (CmmLabel lbl)
       │                 -> do
  2,02 │        add    $0x890,%r12
       │        cmp    0x358(%r13),%r12
       │      ↑ ja     3cf456f <cFIc_info+0x7df>
  0,16 │        mov    0x7(%rbx),%rax
  0,59 │        lea    ghc_DynFlags_DynFlags_con_info,%rbx
  0,05 │        mov    %rbx,-0x888(%r12)
  3,41 │18e9:   mov    0x50(%rsp),%rbx
  0,05 │        mov    %rbx,-0x880(%r12)
  0,32 │        mov    0x58(%rsp),%r14
       │        mov    %r14,-0x878(%r12)
       │        mov    0x60(%rsp),%rbx
       │        mov    %rbx,-0x870(%r12)
  0,05 │        mov    0x68(%rsp),%r14
       │        mov    %r14,-0x868(%r12)
       │        mov    0x70(%rsp),%rbx
       │        mov    %rbx,-0x860(%r12)
       │        mov    0x78(%rsp),%r14
  0,11 │        mov    %r14,-0x858(%r12)
  0,05 │        mov    0x80(%rsp),%rbx
       │        mov    %rbx,-0x850(%r12)
  0,05 │        mov    0x88(%rsp),%r14
       │        mov    %r14,-0x848(%r12)
       │        mov    0x90(%rsp),%rbx
       │        mov    %rbx,-0x840(%r12)
  0,05 │        mov    0x98(%rsp),%r14
  0,05 │        mov    %r14,-0x838(%r12)
  0,11 │        mov    0xa0(%rsp),%rbx
       │        mov    %rbx,-0x830(%r12)
       │        mov    0xa8(%rsp),%r14
       │        mov    %r14,-0x828(%r12)
  0,05 │        mov    0xb0(%rsp),%rbx
       │        mov    %rbx,-0x820(%r12)
       │        mov    0xb8(%rsp),%r14
... <a few more pages of it>

On x86_64 register mapping is: %r12% - heap, %rsp - machine SP.

The suspiction is worker/wrapper optimisation that moves huge 140-field record DynFlags from heap to stack even its not mutated.

Looking at the AsmCodeGen.hs with -ddump-worker-wrapper

"inplace/bin/ghc-stage1" -hisuf hi -osuf  o -hcsuf hc -static  -O -H64m -g -Wall      -this-unit-id ghc-8.1 -hide-all-packages -i -icompiler/basicTypes -icompiler/cmm -icompiler/codeGen -icompiler/coreSyn -icompiler/deSugar -icompiler/ghci -icompiler/hsSyn -icompiler/iface -icompiler/llvmGen -icompiler/main -icompiler/nativeGen -icompiler/parser -icompiler/prelude -icompiler/profiling -icompiler/rename -icompiler/simplCore -icompiler/simplStg -icompiler/specialise -icompiler/stgSyn -icompiler/stranal -icompiler/typecheck -icompiler/types -icompiler/utils -icompiler/vectorise -icompiler/stage2/build -Icompiler/stage2/build -icompiler/stage2/build/./autogen -Icompiler/stage2/build/./autogen -Icompiler/. -Icompiler/parser -Icompiler/utils -Icompiler/../rts/dist/build -Icompiler/stage2   -optP-DGHCI -optP-include -optPcompiler/stage2/build/./autogen/cabal_macros.h -package-id array-0.5.1.1 -package-id base-4.9.0.0 -package-id binary-0.8.3.0 -package-id bytestring-0.10.8.1 -package-id containers-0.5.7.1 -package-id deepseq-1.4.2.0 -package-id directory-1.2.6.2 -package-id filepath-1.4.1.0 -package-id ghc-boot-8.1 -package-id ghci-8.1 -package-id hoopl-3.10.2.1 -package-id hpc-0.6.0.3 -package-id process-1.4.2.0 -package-id template-haskell-2.11.0.0 -package-id time-1.6.0.1 -package-id transformers-0.5.2.0 -package-id unix-2.7.2.0 -Wall -fno-warn-name-shadowing -this-unit-id ghc -XHaskell2010 -optc-DTHREADED_RTS -DGHCI_TABLES_NEXT_TO_CODE -DSTAGE=2 -Rghc-timing -O2  -no-user-package-db -rtsopts      -Wnoncanonical-monad-instances  -odir compiler/stage2/build -hidir compiler/stage2/build -stubdir compiler/stage2/build   -dynamic-too -c compiler/nativeGen/AsmCodeGen.hs -o compiler/stage2/build/AsmCodeGen.o -dyno compiler/stage2/build/AsmCodeGen.dyn_o -ddump-worker-wrapper

there is a few places with functions with huge arity-140. One of the first places picked at random: dumpIfSet_dyn accepts a lot of separate arguments.

       case dflags_ab5I of
       { DynFlags ww1_al11 ww2_al12 ww3_al13 ww4_al14 ww5_al15
                  ww6_al16 [Dmd=<L,U(U)>] ww7_al17 ww8_al18 ww9_al19 ww10_al1a
                  ww11_al1b ww12_al1c ww13_al1d ww14_al1e ww15_al1f ww16_al1g
                  ww17_al1h ww18_al1i ww19_al1j ww20_al1k ww21_al1l ww22_al1m
                  ww23_al1n ww24_al1o ww25_al1p ww26_al1q ww27_al1r ww28_al1s
                  ww29_al1t ww30_al1u ww31_al1v ww32_al1w ww33_al1x ww34_al1y
                  ww35_al1z ww36_al1A ww37_al1B ww38_al1C ww39_al1D ww40_al1E
                  ww41_al1F ww42_al1G ww43_al1H ww44_al1I ww45_al1J ww46_al1K
                  ww47_al1L ww48_al1M ww49_al1N ww50_al1O ww51_al1P ww52_al1Q
                  ww53_al1R ww54_al1S ww55_al1T ww56_al1U ww57_al1V ww58_al1W
                  ww59_al1X ww60_al1Y ww61_al1Z ww62_al20 ww63_al21 ww64_al22
                  ww65_al23 ww66_al24 ww67_al25 ww68_al26 ww69_al27 ww70_al28
                  ww71_al29 ww72_al2a ww73_al2b ww74_al2c ww75_al2d ww76_al2e
                  ww77_al2f ww78_al2g ww79_al2h ww80_al2i ww81_al2j ww82_al2k
                  ww83_al2l ww84_al2m [Dmd=<L,U(U)>] ww85_al2n [Dmd=<S,U>] ww86_al2o
                  ww87_al2p ww88_al2q ww89_al2r ww90_al2s ww91_al2t ww92_al2u
                  ww93_al2v ww94_al2w ww95_al2x ww96_al2y ww97_al2z ww98_al2A
                  ww99_al2B ww100_al2C ww101_al2D ww102_al2E ww103_al2F ww104_al2G
                  ww105_al2H ww106_al2I ww107_al2J ww108_al2K ww109_al2L ww110_al2M
                  ww111_al2N ww112_al2O ww113_al2P ww114_al2Q ww115_al2R
                  ww116_al2S [Dmd=<L,U(U)>] ww117_al2T ww118_al2U ww119_al2V
                  ww120_al2W ww121_al2X ww122_al2Y ww123_al2Z ww124_al30 ww125_al31
                  ww126_al32 ww127_al33 ww128_al34 ww129_al35 ww130_al36 ww131_al37
                  ww132_al38 ww133_al39 ww134_al3a ww135_al3b ww136_al3c ->
       ErrUtils.$wdumpIfSet_dyn
         ww1_al11
         ww2_al12
         ww3_al13
         ww4_al14
         ww5_al15
         ww6_al16
         ww7_al17
         ww8_al18
         ww9_al19
         ww10_al1a
         ww11_al1b
         ww12_al1c
         ww13_al1d
         ww14_al1e
         ww15_al1f
         ww16_al1g
         ww17_al1h
         ww18_al1i
         ww19_al1j
         ww20_al1k
...

I'll try to craft small example that demonstrates the blowup.

comment:5 Changed 16 months ago by slyfox

And dumpIfSet_dyn (used across the GHC including AsmCodegen) is exported as a 141-ary function (along with 5-ary function):

$ inplace/bin/ghc-stage1 --show-iface compiler/stage2/build/ErrUtils.dyn_hi

...
31b85108354ff085ace45a61abe9a220
  $wdumpIfSet_dyn ::
    GhcMode
    -> GhcLink
    -> HscTarget
    -> Settings
    -> SigOf
    -> Int
    -> Int
    -> Int
    -> Int
...
    -> SDoc
    -> State# RealWorld
    -> (# State# RealWorld, () #)
  {- Arity: 140,
     Strictness: <L,U><L,U><L,U><L,U><L,U><L,U(U)><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L
,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U
><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><
L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U(U)><S,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U>
<L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U(U)><L,U><L,U><L,U><L,U><L,U><L,U
><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><L,U><S,U><L,1*U><L,U><S,U>,
     Inline: [0] -}
...
  dumpIfSet_dyn :: DynFlags -> DumpFlag -> String -> SDoc -> IO ()
  {- Arity: 5,
     Strictness: <S(LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLSLLLLLLLLLLLLLLLLLLLLLL
LLLLLLLLLLLLLLLLLLLLLLLLLLLLL),1*U(U,U,U,U,U,U(U),U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U
,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U(U),U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U
,U,U,U,U,U,U,U,U,U(U),U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U,U)><S,U><L,1*U><L,U><S,U>,
     Unfolding: InlineRule (0, True, True)
                dumpIfSet_dyn1
                  `cast`
                (<DynFlags>_R
                 ->_R <DumpFlag>_R
                 ->_R <String>_R
                 ->_R <SDoc>_R
                 ->_R Sym (N:IO[0] <()>_R)) -}

comment:6 Changed 16 months ago by slyfox

Here comes minimal example for one direction: heap to stack.

The trigger is a function with many USED record fields: show in this case. In case of DynFlags it's a full (or large) subset of fields used in various GHC subsystems.

-- A.hs
module A(D) where

-- like DynFlgs in GHC
data D = D { f_00, f_01, f_02, f_03, f_04
           , f_10, f_11, f_12, f_13, f_14
           , f_20, f_21, f_22, f_23, f_24
           , f_30, f_31, f_32, f_33, f_34
           , f_40, f_41, f_42, f_43, f_44
           , f_50, f_51, f_52, f_53, f_54

           , g_00, g_01, g_02, g_03, g_04
           , g_10, g_11, g_12, g_13, g_14
           , g_20, g_21, g_22, g_23, g_24
           , g_30, g_31, g_32, g_33, g_34
           , g_40, g_41, g_42, g_43, g_44
           , g_50, g_51, g_52, g_53, g_54

           , h_00, h_01, h_02, h_03, h_04
           , h_10, h_11, h_12, h_13, h_14
           , h_20, h_21, h_22, h_23, h_24
           , h_30, h_31, h_32, h_33, h_34
           , h_40, h_41, h_42, h_43, h_44
           , h_50, h_51, h_52, h_53, h_54

           , i_00, i_01, i_02, i_03, i_04
           , i_10, i_11, i_12, i_13, i_14
           , i_20, i_21, i_22, i_23, i_24
           , i_30, i_31, i_32, i_33, i_34
           , i_40, i_41, i_42, i_43, i_44
           , i_50, i_51, i_52, i_53, i_54 :: Int

           } deriving Show
-- B.hs
module B (tiny_foo) where

import qualified A

tiny_foo :: A.D -> Bool
tiny_foo d = null (show d)

Let's look at the size of module B on -O0 and -O1 while A is compiled -O2:

-O0, no unboxing happens.

$ ghc -c -O2 A.hs && ghc -c -O0 B.hs -ddump-stg -fforce-recomp
compilation IS NOT required

==================== STG syntax: ====================
$trModule1_r3JD :: GHC.Types.TrName
[GblId, Caf=NoCafRefs, Str=DmdType, Unf=OtherCon []] =
    NO_CCS GHC.Types.TrNameS! ["main"#];

$trModule2_r3P0 :: GHC.Types.TrName
[GblId, Caf=NoCafRefs, Str=DmdType, Unf=OtherCon []] =
    NO_CCS GHC.Types.TrNameS! ["B"#];

B.$trModule :: GHC.Types.Module
[GblId, Caf=NoCafRefs, Str=DmdType, Unf=OtherCon []] =
    NO_CCS GHC.Types.Module! [$trModule1_r3JD $trModule2_r3P0];

B.tiny_foo :: A.D -> GHC.Types.Bool
[GblId, Arity=1, Str=DmdType, Unf=OtherCon []] =
    \r srt:SRT:[r30 :-> A.$fShowD,
                rAj :-> Data.Foldable.$fFoldable[]] [d_s3P4]
        let {
          sat_s3P5 [Occ=Once] :: [GHC.Types.Char]
          [LclId, Str=DmdType] =
              \u srt:SRT:[r30 :-> A.$fShowD] [] GHC.Show.show A.$fShowD d_s3P4;
        } in  Data.Foldable.null Data.Foldable.$fFoldable[] sat_s3P5;

-O1, unboxing hapened:

$ ghc -c -O2 A.hs && ghc -c -O1 B.hs -ddump-stg -fforce-recomp
compilation IS NOT required

==================== STG syntax: ====================
B.$trModule2 :: GHC.Types.TrName
[GblId, Caf=NoCafRefs, Str=DmdType m1, Unf=OtherCon []] =
    NO_CCS GHC.Types.TrNameS! ["main"#];

B.$trModule1 :: GHC.Types.TrName
[GblId, Caf=NoCafRefs, Str=DmdType m1, Unf=OtherCon []] =
    NO_CCS GHC.Types.TrNameS! ["B"#];

B.$trModule :: GHC.Types.Module
[GblId, Caf=NoCafRefs, Str=DmdType m, Unf=OtherCon []] =
    NO_CCS GHC.Types.Module! [B.$trModule2 B.$trModule1];

B.$wtiny_foo [InlPrag=[0]]
  :: GHC.Types.Int
     -> GHC.Types.Int
     -> GHC.Types.Int
...
     -> GHC.Types.Int
     -> GHC.Types.Bool
[GblId,
 Arity=120,
 Str=DmdType <L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)><L,1*U(U)>,
 Unf=OtherCon []] =
    \r srt:SRT:[r3a :-> A.$w$cshowsPrec] [ww_s47S
                                          ww1_s47T
                                          ww2_s47U
                                          ww3_s47V
                                          ww4_s47W
                                          ww5_s47X
                                          ww6_s47Y
...
                                          ww115_s49J
                                          ww116_s49K
                                          ww117_s49L
                                          ww118_s49M
                                          ww119_s49N]
        case
            A.$w$cshowsPrec
                0#
                ww_s47S
                ww1_s47T
                ww2_s47U
                ww3_s47V
                ww4_s47W
                ww5_s47X
                ww6_s47Y
                ww7_s47Z
                ww8_s480
                ww9_s481
                ww10_s482
                ww11_s483
                ww12_s484
                ww13_s485
...
                ww117_s49L
                ww118_s49M
                ww119_s49N
                GHC.Types.[]
        of
        _ [Occ=Dead]
        { [] -> GHC.Types.True [];
          : _ [Occ=Dead] _ [Occ=Dead] -> GHC.Types.False [];
        };

B.tiny_foo [InlPrag=INLINE[0]] :: A.D -> GHC.Types.Bool
[GblId,
 Arity=1,
 Str=DmdType <S,1*U(1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U),1*U(U))>,
 Unf=OtherCon []] =
    \r srt:SRT:[r47O :-> B.$wtiny_foo] [w_s49R]
        case w_s49R of _ [Occ=Dead] {
          A.D ww1_s49T [Occ=Once]
              ww2_s49U [Occ=Once]
              ww3_s49V [Occ=Once]
              ww4_s49W [Occ=Once]
              ww5_s49X [Occ=Once]
...
              ww118_s4bM [Occ=Once]
              ww119_s4bN [Occ=Once]
              ww120_s4bO [Occ=Once] ->
              B.$wtiny_foo
                  ww1_s49T
                  ww2_s49U
                  ww3_s49V
                  ww4_s49W
                  ww5_s49X
                  ww6_s49Y
                  ww7_s49Z
                  ww8_s4a0
                  ww9_s4a1
                  ww10_s4a2
                  ww11_s4a3
                  ww12_s4a4
                  ww13_s4a5
                  ww14_s4a6
                  ww15_s4a7
                  ww16_s4a8

This causes a lot of 'mov' instructions from heap to stack to be generated at each callsite. In this case it's 9 pages:

$ ghc -c -O2 A.hs && ghc -c -O1 B.hs -ddump-asm -fforce-recomp
...
        movq %rbx,856(%rbp)
        movq 880(%rbp),%rbx
        movq %rbx,864(%rbp)
        movq 888(%rbp),%rbx
        movq %rbx,872(%rbp)
        movq 896(%rbp),%rbx
        movq %rbx,880(%rbp)
        movq 904(%rbp),%rbx
        movq %rbx,888(%rbp)
        movq %rax,896(%rbp)
        movq $GHC.Types.[]_closure+1,904(%rbp)
        addq $-24,%rbp
        jmp A.$w$cshowsPrec_info

comment:7 Changed 16 months ago by simonpj

Yes, it's bad for worker/wrapper to generate a worker function with a vast number of arguments. Some limit in the worker/wrapper generator would be a Good Thing.

Should not be too hard. Unlike the old days, we don't need to trim the strictness signature. In the old days, the strictness signature was used by importing modules to generate an appropriate wrapper; but now the wrapper is conveyed by an ordinary inlining. So there is just one place the choice is made, namely when generating the worker/wrapper split.

I can advise if someone wants to try this

Simon

comment:8 Changed 16 months ago by Sergei Trofimovich <siarheit@…>

In a48de37/ghc:

restore -fmax-worker-args handling (Trac #11565)

maxWorkerArgs handling was accidentally lost 3 years ago
in a major update of demand analysis
    commit 0831a12ea2fc73c33652eeec1adc79fa19700578

Old regression is noticeable as:
- code bloat (requires stack reshuffling)
- compilation slowdown (more code to optimise/generate)
- and increased heap usage (DynFlags unboxing/reboxing?)

On a simple compile benchmark this change causes heap
allocation drop from 70G don to 67G (ghc perf build).

Signed-off-by: Sergei Trofimovich <siarheit@google.com>

Reviewers: simonpj, ezyang, goldfire, austin, bgamari

Reviewed By: simonpj, ezyang

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D2503

GHC Trac Issues: #11565

comment:9 Changed 16 months ago by slyfox

Milestone: 8.2.1
Resolution: fixed
Status: newclosed
Type of failure: None/UnknownRuntime performance bug

comment:10 Changed 16 months ago by Sergei Trofimovich <siarheit@…>

In f93c363f/ghc:

extend '-fmax-worker-args' limit to specialiser (Trac #11565)

It's a complementary change to
    a48de37dcca98e7d477040b0ed298bcd1b3ab303
    restore -fmax-worker-args handling (Trac #11565)

I don't have a small example but I've noticed another
discrepancy when was profiling GHC for performance

    cmmExprNative :: ReferenceKind -> CmmExpr -> CmmOptM CmmExpr

was specialised by 'spec_one' down to a function with arity 159.
As a result 'perf record' pointed at it as at slowest
function in whole ghc library.

I've extended -fmax-worker-args effect to 'spec_one'
as it does the same worker/wrapper split to push
arguments to the heap.

The change decreases heap usage on a synth.bash benchmark
(Trac #9221) from 67G down to 64G (-4%). Benchmark runtime
decreased from 14.5 s down to 14.s (-7%).

Signed-off-by: Sergei Trofimovich <siarheit@google.com>

Reviewers: ezyang, simonpj, austin, goldfire, bgamari

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D2507

GHC Trac Issues: #11565
Note: See TracTickets for help on using tickets.