Opened 6 months ago

Last modified 4 days ago

#14880 new bug

GHC panic: updateRole

Reported by: RyanGlScott Owned by: goldfire
Priority: normal Milestone: 8.8.1
Component: Compiler (Type checker) Version: 8.2.2
Keywords: TypeInType Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time crash or panic Test Case:
Blocked By: Blocking:
Related Tickets: #15076 Differential Rev(s): Phab:D4769
Wiki Page:

Description (last modified by RyanGlScott)

The following program panics on GHC 8.0.2, 8.2.2, 8.4.1, and HEAD:

{-# LANGUAGE ConstraintKinds #-}
{-# LANGUAGE GADTs #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE TypeInType #-}
{-# LANGUAGE TypeOperators #-}
module Bug where

import Data.Kind
import Data.Type.Equality ((:~:))

type SameKind (a :: k) (b :: k) = (() :: Constraint)

data TyFun :: Type -> Type -> Type
type a ~> b = TyFun a b -> Type
infixr 0 ~>
type family Apply (f :: k1 ~> k2) (x :: k1) :: k2

data WhyCongSym1 (x :: Type) :: forall (a :: x)
                                       (y :: Type)
                                       (z :: x).
                                Type ~> (x ~> y) ~> x ~> x ~> (a :~: z) ~> Type

data WhyCongSym0 :: forall (x :: Type)
                           (a :: x)
                           (y :: Type)
                           (z :: x).
                    Type ~> Type ~> (x ~> y) ~> x ~> x ~> (a :~: z) ~> Type
  where
    WhyCongSym0KindInference :: forall x arg.
                                SameKind (Apply WhyCongSym0 arg) (WhyCongSym1 arg) =>
                                WhyCongSym0 x
$ /opt/ghc/8.2.2/bin/ghci Bug.hs
GHCi, version 8.2.2: http://www.haskell.org/ghc/  :? for help
Loaded GHCi configuration from /home/rgscott/.ghci
[1 of 1] Compiling Bug              ( Bug.hs, interpreted )
ghc: panic! (the 'impossible' happened)
  (GHC version 8.2.2 for x86_64-unknown-linux):
        updateRole
  WhyCongSym0
  arg_a1A6[sk:1]
  [a1A5 :-> 4, a2Cy :-> 0, a2Cz :-> 1, a2CA :-> 2, a2CB :-> 3]
  Call stack:
      CallStack (from HasCallStack):
        prettyCurrentCallStack, called at compiler/utils/Outputable.hs:1133:58 in ghc:Outputable
        callStackDoc, called at compiler/utils/Outputable.hs:1137:37 in ghc:Outputable
        pprPanic, called at compiler/typecheck/TcTyDecls.hs:656:23 in ghc:TcTyDecls

Attachments (3)

logs.tar.gz (22.5 KB) - added by tdammers 5 weeks ago.
Performance logs for comparison
intmapbench.hs (721 bytes) - added by tdammers 4 weeks ago.
IntMap mini-benchmark
logs2.tar.gz (2.4 KB) - added by tdammers 2 weeks ago.

Download all attachments as: .zip

Change History (87)

comment:1 Changed 6 months ago by RyanGlScott

Description: modified (diff)

comment:2 Changed 6 months ago by RyanGlScott

Here's as minimal of an example as I can conjure up:

{-# LANGUAGE GADTs #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TypeInType #-}
module Bug where

import Data.Kind
import Data.Proxy

data Foo (x :: Type) :: forall (a :: x). Proxy a -> Type

data Bar :: Type -> Type where
    MkBar :: forall x arg.
             -- Commenting out the line below makes the issue go away
             Foo arg ~ Foo arg =>
             Bar x
$ /opt/ghc/8.2.2/bin/ghci Bug.hs
GHCi, version 8.2.2: http://www.haskell.org/ghc/  :? for help
Loaded GHCi configuration from /home/rgscott/.ghci
[1 of 1] Compiling Bug              ( Bug.hs, interpreted )
ghc: panic! (the 'impossible' happened)
  (GHC version 8.2.2 for x86_64-unknown-linux):
        updateRole
  Bar
  arg_a1vT[sk:1]
  [a1vS :-> 0]
  Call stack:
      CallStack (from HasCallStack):
        prettyCurrentCallStack, called at compiler/utils/Outputable.hs:1133:58 in ghc:Outputable
        callStackDoc, called at compiler/utils/Outputable.hs:1137:37 in ghc:Outputable
        pprPanic, called at compiler/typecheck/TcTyDecls.hs:656:23 in ghc:TcTyDecls

comment:3 Changed 6 months ago by simonpj

Even simpler

 data Foo (x :: Type) :: forall (a :: x). Proxy a -> Type

 data Bar :: Type -> Type where
     MkBar :: forall x arg.
              -- Commenting out the line below makes the issue go away
              Proxy (Foo arg) ->
              Bar x

The panic is caused because the existentials for MkBar are messed up;

{-  MkBar
    univ_tvs:   (x :: *)
    ex_tvs:     (a :: arg_aZv)
                (arg_XZx :: *)
    arg ty:     Proxy @arg_XZx a
    result ty:  Foo @arg_XZx a

Note the confusion of two arg variables. Ran out of time at that point.

comment:4 Changed 6 months ago by simonpj

Keywords: TypeInType added; Roles removed
Owner: set to goldfire

The trouble is that the type of MkBar should really be

MkBar :: forall (x:Type) (arg:Type) {a:arg}.
         Proxy @(Proxy @arg a -> Type)
               (Foo arg @a)
      -> Bar x

where I have put in all the kind applications. The trouble is that MkBar has an implicit forall's variable a, whose kind mentions an explicit type variable arg. So the implicit argument must appear later in the telescope. But tcConDecl (the ConDeclGADT case) doesn't allow that:

             tkv_bndrs      = mkTyVarBinders Inferred  tkvs'
             user_tv_bndrs  = mkTyVarBinders Specified user_tvs'
             all_user_bndrs = tkv_bndrs ++ user_tv_bndrs

Notice that the inferred ones always come first. But here they can't!

Solution: do a topo-sort of the tyvars that is allowed to interleave the Inferred and Specified ones.

But is that the only place? If we try something like this with a function type signature thus

f :: forall (v :: *) (a :: Proxy (k :: v)). Proxy a
f = f

we get the error message

T14880.hs:24:6: error:
    * The kind of variable `k', namely `v',
      depends on variable `v' from an inner scope
      Perhaps bind `k' sometime after binding `v'
      NB: Implicitly-bound variables always come before other ones.
    * In the type signature:
        f :: forall (v :: *) (a :: Proxy (k :: v)). Proxy a

But is there anything really wrong with this signature? It we topo-sorted the type variables we'd be fine.

There are other places (exp in TcTyClsDecls) where we seem to put all the inferred variables around the outside. I don't know how to be sure in which, if any, of these case we have a bug. Maybe we need more topo-sorts?

Amnother oddd thing about this ticket is the data decl for Foo:

data Foo (v :: Type) :: forall (a :: v). Proxy a -> Type

That is a strange kind signature. I don't expect to see foralls to the right of the :: in such a decl. So, more questions

  • Is it valuable to permit TyCons whose kinds are not in prenex form (i.e. all foralls at the front)?

If so, we should document it.

Meanwhile I'm not going to fix this because it may all come out in the wash of Richards's upcoming kind-inference patch.

It's nothing to do with roles!

comment:5 Changed 6 months ago by goldfire

The updateRole panic triggers on various invalid-tycon problems. I'm not surprised that roles aren't involved.

Yes, a forall to the right of the :: in a data declaration makes fine sense. For example, here is one way to define heterogeneous equality:

data (:~~:) :: forall k1. k1 -> forall k2. k2 -> Type where
  HRefl :: a :~~: a

This definition is actually a touch more general than one where k1 and k2 are quantified prenex, as the "Practical Type Inference" paper explains.

I don't see a need to document this specially.

As for the toposorting suggestions: I've considered it a design principle that implicit quantification comes before all explicit quantification. Of course, we can't panic when something goes wrong here, but I think this design is a good one, just to have some rules that users can rely on.

My existing patch still panics on this case, but I agree that no one should spend time on this until that patch lands.

comment:6 Changed 5 months ago by goldfire

With the Foo in comment:3, this also panics:

quux :: forall x arg. Proxy (Foo arg) -> Maybe x
quux = undefined

comment:7 in reply to:  6 Changed 4 months ago by RyanGlScott

Replying to goldfire:

With the Foo in comment:3, this also panics:

It does? I must be doing something wrong, since this appears to compile for me:

{-# LANGUAGE GADTs #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TypeInType #-}
module Bug where

import Data.Kind
import Data.Proxy

data Foo (x :: Type) :: forall (a :: x). Proxy a -> Type

quux :: forall x arg. Proxy (Foo arg) -> Maybe x
quux = undefined

comment:8 Changed 4 months ago by simonpj

My existing patch still panics on this case, but I agree that no one should spend time on this until that patch lands.

Richard, now that the patch has landed, and this stuff is in your head, might you look at this? I doubt it's hard.

comment:9 Changed 4 months ago by RyanGlScott

I think that this ticket and #15076 share a symptom in common. This claim is based on the fact that slightly tweaking the program in comment:6 / comment:7 :

{-# LANGUAGE GADTs #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TypeInType #-}
module Bug where

import Data.Kind
import Data.Proxy

data Foo (x :: Type) :: forall (a :: x). Proxy a -> Type

quux :: forall arg. Proxy (Foo arg) -> ()
quux (_ :: _) = ()

Yields:

$ ~/Software/ghc/inplace/bin/ghc-stage2 Bug.hs
[1 of 1] Compiling Bug              ( Bug.hs, Bug.o )

Bug.hs:12:12: error:ghc-stage2: panic! (the 'impossible' happened)
  (GHC version 8.5.20180420 for x86_64-unknown-linux):
        No skolem info:
  [arg_aZr[sk:1]]
  Call stack:
      CallStack (from HasCallStack):
        callStackDoc, called at compiler/utils/Outputable.hs:1162:37 in ghc:Outputable
        pprPanic, called at compiler/typecheck/TcErrors.hs:3224:5 in ghc:TcErrors

Which is the same panic as in #15076. Plus, if you run this program with -ddump-tc-trace, you see:

reportUnsolved (after zonking):
  Free tyvars: arg_aZr[sk:1]
  Tidy env: ([ESf71 :-> 1], [aZr :-> arg_aZr[sk:1]])
  Wanted: WC {wc_impl =
                Implic {
                  TcLevel = 2
                  Skolems = (a_a1p8[sk:2] :: arg_aZr[sk:1]) arg_a1p9[sk:2]
                  No-eqs = True
                  Status = Unsolved
                  Given =
                  Wanted =
                    WC {wc_simple =
                          [D] _ {0}:: Proxy
                                        (Foo arg_a1p9[sk:2] a_a1p8[sk:2]) (CHoleCan: TypeHole(_))}
                  Binds = EvBindsVar<a1pg>
                  Needed inner = []
                  Needed outer = []
                  the type signature for:
                    quux :: forall (a :: arg_aZr[sk:1]) arg. Proxy (Foo arg a) -> () }}

Just like in #15076, arg_aZr is not bound in any implication. On the other hand, there is another type variable, arg_a1p9, that is suspiciously similar. Moreover, the type signature it gives for quux:

quux :: forall (a :: arg_aZr[sk:1]) arg. Proxy (Foo arg a) -> ()

Seems to have two different copies of arg! This is especially interesting in light of comment:3, where simonpj discovered that the existentially quantified tyvars in MkBar were screwed up, leading to two copies of arg.

comment:10 Changed 4 months ago by simonpj

The trouble is that MkBar has an implicit forall's variable a, whose kind mentions an explicit type variable arg.

Richard and I discussed this on Monday. Our conclusions

  • Inferred variables should always precede Specified ones. That is, do not top-sort
  • Idea: simply refrain from quantifying any inferred variables that mention specified ones. They'll get defaulted to Any which is probably fine. This refraining can readily be done in candidatesQTyVarsOfType.
  • Also came up: close over kinds once in tyCoVarsOfType instead of at every leaf. This is not just an efficiency issue: consider
    tyCoVarsOfType (forall a. b::a)
    

Richard is on the job

comment:11 Changed 4 months ago by simonpj

Trac #15076 is another example of the same phenomenon.

comment:12 Changed 4 months ago by simonpj

And probably #14040 too.

comment:13 Changed 4 months ago by goldfire

Yes, I have a fix here, but it causes a failure in, e.g., indexed-types/should_compile/T12369. Haven't had time to investigate.

comment:14 Changed 4 months ago by simonpj

Maybe #14887 too!

comment:15 Changed 3 months ago by simonpj

Blocking: 15170 added

comment:16 Changed 3 months ago by simonpj

Blocking: 15170 removed

I believe that #15170 is also blocked on this.

comment:17 Changed 3 months ago by simonpj

In a call today Richard and Simon decided:

  • Kill decideKindGeneralisationPlan; instead always generalise. (Explain why we don't need to worry about local generalisation as we do with terms.)
  • And hence kill tc_hs_sig_type, leaving tc_hs_sig_type_and_gen
  • tc_hs_sig_type_and_gen should not call solveEqualities (which fails if there are unsolved equalities)
  • Instead call solveLocalEqualities, gather unsolved equalities, and do not quantify over their free vars. NB: tcImplicitTKBndrs already calls solveLocalEqualities, and must do so. Needs a Note to explain why: it's so that we can top-sort the bound variables
  • Promote all the free vars of the unsolved constraints. Principle: all free vars should either be generalised or promoted.
  • Get rid of zonkPromote and friends altogether
Last edited 3 months ago by simonpj (previous) (diff)

comment:18 Changed 3 months ago by goldfire

While I agree with comment:17, that doesn't have anything to do with this ticket...

comment:19 Changed 2 months ago by goldfire

Differential Rev(s): Phab:D4769

Since my last update, the patch in Phab:D4769 works but has performance trouble. I've asked Ben & friends for help.

comment:20 Changed 6 weeks ago by tdammers

Could you maybe provide a quick summary of the exact performance trouble? I'm running some tests as we speak, but it would be easier for me if you could hint at some tests that produce particularly bad behavior, or maybe an example program on which GHC peforms badly after this patch.

comment:21 Changed 6 weeks ago by goldfire

To be honest, I don't recall the details any more. But the problems show up in the performance tests, so running a validation should show them up.

comment:22 Changed 6 weeks ago by tdammers

OK, so I tried to validate, but I'm getting core lint errors while compiling stage2:

"inplace/bin/ghc-stage1" -this-unit-id rts -shared -dynamic -dynload deploy -no-auto-link-packages -Lrts/dist/build -lffi -optl-Wl,-rpath -optl-Wl,'$ORIGIN' -optl-Wl,-zorigin `cat rts/dist/libs.depend` rts/dist/build/Arena.l_dyn_o rts/dist/build/Stable.l_dyn_o rts/dist/build/Linker.l_dyn_o rts/dist/build/RtsFlags.l_dyn_o rts/dist/build/Sparks.l_dyn_o rts/dist/build/Profiling.l_dyn_o rts/dist/build/RtsStartup.l_dyn_o rts/dist/build/FileLock.l_dyn_o rts/dist/build/StgPrimFloat.l_dyn_o rts/dist/build/Pool.l_dyn_o rts/dist/build/StaticPtrTable.l_dyn_o rts/dist/build/Disassembler.l_dyn_o rts/dist/build/TopHandler.l_dyn_o rts/dist/build/WSDeque.l_dyn_o rts/dist/build/RetainerSet.l_dyn_o rts/dist/build/ProfilerReport.l_dyn_o rts/dist/build/Schedule.l_dyn_o rts/dist/build/RtsAPI.l_dyn_o rts/dist/build/Adjustor.l_dyn_o rts/dist/build/Globals.l_dyn_o rts/dist/build/Capability.l_dyn_o rts/dist/build/xxhash.l_dyn_o rts/dist/build/Messages.l_dyn_o rts/dist/build/Interpreter.l_dyn_o rts/dist/build/Hpc.l_dyn_o rts/dist/build/Libdw.l_dyn_o rts/dist/build/RetainerProfile.l_dyn_o rts/dist/build/ClosureFlags.l_dyn_o rts/dist/build/LdvProfile.l_dyn_o rts/dist/build/StgCRun.l_dyn_o rts/dist/build/Task.l_dyn_o rts/dist/build/ProfilerReportJson.l_dyn_o rts/dist/build/RtsUtils.l_dyn_o rts/dist/build/Weak.l_dyn_o rts/dist/build/Stats.l_dyn_o rts/dist/build/Trace.l_dyn_o rts/dist/build/Hash.l_dyn_o rts/dist/build/RaiseAsync.l_dyn_o rts/dist/build/ProfHeap.l_dyn_o rts/dist/build/RtsMessages.l_dyn_o rts/dist/build/RtsSymbolInfo.l_dyn_o rts/dist/build/RtsDllMain.l_dyn_o rts/dist/build/Timer.l_dyn_o rts/dist/build/HsFFI.l_dyn_o rts/dist/build/OldARMAtomic.l_dyn_o rts/dist/build/STM.l_dyn_o rts/dist/build/ThreadLabels.l_dyn_o rts/dist/build/Proftimer.l_dyn_o rts/dist/build/RtsSymbols.l_dyn_o rts/dist/build/PathUtils.l_dyn_o rts/dist/build/Threads.l_dyn_o rts/dist/build/RtsMain.l_dyn_o rts/dist/build/CheckUnload.l_dyn_o rts/dist/build/Inlines.l_dyn_o rts/dist/build/Ticky.l_dyn_o rts/dist/build/LibdwPool.l_dyn_o rts/dist/build/ThreadPaused.l_dyn_o rts/dist/build/Printer.l_dyn_o rts/dist/build/fs.l_dyn_o rts/dist/build/hooks/OutOfHeap.l_dyn_o rts/dist/build/hooks/MallocFail.l_dyn_o rts/dist/build/hooks/FlagDefaults.l_dyn_o rts/dist/build/hooks/OnExit.l_dyn_o rts/dist/build/hooks/LongGCSync.l_dyn_o rts/dist/build/hooks/StackOverflow.l_dyn_o rts/dist/build/sm/MBlock.l_dyn_o rts/dist/build/sm/Scav.l_dyn_o rts/dist/build/sm/GCUtils.l_dyn_o rts/dist/build/sm/CNF.l_dyn_o rts/dist/build/sm/Compact.l_dyn_o rts/dist/build/sm/Sweep.l_dyn_o rts/dist/build/sm/GCAux.l_dyn_o rts/dist/build/sm/MarkWeak.l_dyn_o rts/dist/build/sm/BlockAlloc.l_dyn_o rts/dist/build/sm/Evac_thr.l_dyn_o rts/dist/build/sm/GC.l_dyn_o rts/dist/build/sm/Sanity.l_dyn_o rts/dist/build/sm/Evac.l_dyn_o rts/dist/build/sm/Storage.l_dyn_o rts/dist/build/sm/Scav_thr.l_dyn_o rts/dist/build/eventlog/EventLogWriter.l_dyn_o rts/dist/build/eventlog/EventLog.l_dyn_o rts/dist/build/linker/elf_reloc_aarch64.l_dyn_o rts/dist/build/linker/MachO.l_dyn_o rts/dist/build/linker/elf_plt.l_dyn_o rts/dist/build/linker/elf_plt_aarch64.l_dyn_o rts/dist/build/linker/M32Alloc.l_dyn_o rts/dist/build/linker/elf_got.l_dyn_o rts/dist/build/linker/Elf.l_dyn_o rts/dist/build/linker/CacheFlush.l_dyn_o rts/dist/build/linker/SymbolExtras.l_dyn_o rts/dist/build/linker/elf_plt_arm.l_dyn_o rts/dist/build/linker/LoadArchive.l_dyn_o rts/dist/build/linker/PEi386.l_dyn_o rts/dist/build/linker/elf_reloc.l_dyn_o rts/dist/build/linker/elf_util.l_dyn_o rts/dist/build/posix/OSMem.l_dyn_o rts/dist/build/posix/GetTime.l_dyn_o rts/dist/build/posix/OSThreads.l_dyn_o rts/dist/build/posix/Itimer.l_dyn_o rts/dist/build/posix/TTY.l_dyn_o rts/dist/build/posix/Signals.l_dyn_o rts/dist/build/posix/Select.l_dyn_o rts/dist/build/posix/GetEnv.l_dyn_o   rts/dist/build/StgStartup.l_dyn_o rts/dist/build/PrimOps.l_dyn_o rts/dist/build/HeapStackCheck.l_dyn_o rts/dist/build/Updates.l_dyn_o rts/dist/build/Exception.l_dyn_o rts/dist/build/StgMiscClosures.l_dyn_o rts/dist/build/Apply.l_dyn_o rts/dist/build/Compact.l_dyn_o rts/dist/build/StgStdThunks.l_dyn_o rts/dist/build/AutoApply.l_dyn_o  -fPIC -dynamic -eventlog  -O0 -H64m -Wall -fllvm-fill-undef-with-garbage    -Werror -Iincludes -Iincludes/dist -Iincludes/dist-derivedconstants/header -Iincludes/dist-ghcconstants/header -Irts -Irts/dist/build -DCOMPILING_RTS -DFS_NAMESPACE=rts -this-unit-id rts -dcmm-lint      -i -irts -irts/dist/build -Irts/dist/build -irts/dist/build/./autogen -Irts/dist/build/./autogen            -O2 -Wcpp-undef    -Wnoncanonical-monad-instances   -o rts/dist/build/libHSrts_l-ghc8.5.20180710.so
"rm" -f rts/dist/build/libHSrts_thr_l-ghc8.5.20180710.so
"inplace/bin/ghc-stage1" -this-unit-id rts -shared -dynamic -dynload deploy -no-auto-link-packages -Lrts/dist/build -lffi -optl-Wl,-rpath -optl-Wl,'$ORIGIN' -optl-Wl,-zorigin `cat rts/dist/libs.depend` rts/dist/build/Arena.thr_l_dyn_o rts/dist/build/Stable.thr_l_dyn_o rts/dist/build/Linker.thr_l_dyn_o rts/dist/build/RtsFlags.thr_l_dyn_o rts/dist/build/Sparks.thr_l_dyn_o rts/dist/build/Profiling.thr_l_dyn_o rts/dist/build/RtsStartup.thr_l_dyn_o rts/dist/build/FileLock.thr_l_dyn_o rts/dist/build/StgPrimFloat.thr_l_dyn_o rts/dist/build/Pool.thr_l_dyn_o rts/dist/build/StaticPtrTable.thr_l_dyn_o rts/dist/build/Disassembler.thr_l_dyn_o rts/dist/build/TopHandler.thr_l_dyn_o rts/dist/build/WSDeque.thr_l_dyn_o rts/dist/build/RetainerSet.thr_l_dyn_o rts/dist/build/ProfilerReport.thr_l_dyn_o rts/dist/build/Schedule.thr_l_dyn_o rts/dist/build/RtsAPI.thr_l_dyn_o rts/dist/build/Adjustor.thr_l_dyn_o rts/dist/build/Globals.thr_l_dyn_o rts/dist/build/Capability.thr_l_dyn_o rts/dist/build/xxhash.thr_l_dyn_o rts/dist/build/Messages.thr_l_dyn_o rts/dist/build/Interpreter.thr_l_dyn_o rts/dist/build/Hpc.thr_l_dyn_o rts/dist/build/Libdw.thr_l_dyn_o rts/dist/build/RetainerProfile.thr_l_dyn_o rts/dist/build/ClosureFlags.thr_l_dyn_o rts/dist/build/LdvProfile.thr_l_dyn_o rts/dist/build/StgCRun.thr_l_dyn_o rts/dist/build/Task.thr_l_dyn_o rts/dist/build/ProfilerReportJson.thr_l_dyn_o rts/dist/build/RtsUtils.thr_l_dyn_o rts/dist/build/Weak.thr_l_dyn_o rts/dist/build/Stats.thr_l_dyn_o rts/dist/build/Trace.thr_l_dyn_o rts/dist/build/Hash.thr_l_dyn_o rts/dist/build/RaiseAsync.thr_l_dyn_o rts/dist/build/ProfHeap.thr_l_dyn_o rts/dist/build/RtsMessages.thr_l_dyn_o rts/dist/build/RtsSymbolInfo.thr_l_dyn_o rts/dist/build/RtsDllMain.thr_l_dyn_o rts/dist/build/Timer.thr_l_dyn_o rts/dist/build/HsFFI.thr_l_dyn_o rts/dist/build/OldARMAtomic.thr_l_dyn_o rts/dist/build/STM.thr_l_dyn_o rts/dist/build/ThreadLabels.thr_l_dyn_o rts/dist/build/Proftimer.thr_l_dyn_o rts/dist/build/RtsSymbols.thr_l_dyn_o rts/dist/build/PathUtils.thr_l_dyn_o rts/dist/build/Threads.thr_l_dyn_o rts/dist/build/RtsMain.thr_l_dyn_o rts/dist/build/CheckUnload.thr_l_dyn_o rts/dist/build/Inlines.thr_l_dyn_o rts/dist/build/Ticky.thr_l_dyn_o rts/dist/build/LibdwPool.thr_l_dyn_o rts/dist/build/ThreadPaused.thr_l_dyn_o rts/dist/build/Printer.thr_l_dyn_o rts/dist/build/fs.thr_l_dyn_o rts/dist/build/hooks/OutOfHeap.thr_l_dyn_o rts/dist/build/hooks/MallocFail.thr_l_dyn_o rts/dist/build/hooks/FlagDefaults.thr_l_dyn_o rts/dist/build/hooks/OnExit.thr_l_dyn_o rts/dist/build/hooks/LongGCSync.thr_l_dyn_o rts/dist/build/hooks/StackOverflow.thr_l_dyn_o rts/dist/build/sm/MBlock.thr_l_dyn_o rts/dist/build/sm/Scav.thr_l_dyn_o rts/dist/build/sm/GCUtils.thr_l_dyn_o rts/dist/build/sm/CNF.thr_l_dyn_o rts/dist/build/sm/Compact.thr_l_dyn_o rts/dist/build/sm/Sweep.thr_l_dyn_o rts/dist/build/sm/GCAux.thr_l_dyn_o rts/dist/build/sm/MarkWeak.thr_l_dyn_o rts/dist/build/sm/BlockAlloc.thr_l_dyn_o rts/dist/build/sm/Evac_thr.thr_l_dyn_o rts/dist/build/sm/GC.thr_l_dyn_o rts/dist/build/sm/Sanity.thr_l_dyn_o rts/dist/build/sm/Evac.thr_l_dyn_o rts/dist/build/sm/Storage.thr_l_dyn_o rts/dist/build/sm/Scav_thr.thr_l_dyn_o rts/dist/build/eventlog/EventLogWriter.thr_l_dyn_o rts/dist/build/eventlog/EventLog.thr_l_dyn_o rts/dist/build/linker/elf_reloc_aarch64.thr_l_dyn_o rts/dist/build/linker/MachO.thr_l_dyn_o rts/dist/build/linker/elf_plt.thr_l_dyn_o rts/dist/build/linker/elf_plt_aarch64.thr_l_dyn_o rts/dist/build/linker/M32Alloc.thr_l_dyn_o rts/dist/build/linker/elf_got.thr_l_dyn_o rts/dist/build/linker/Elf.thr_l_dyn_o rts/dist/build/linker/CacheFlush.thr_l_dyn_o rts/dist/build/linker/SymbolExtras.thr_l_dyn_o rts/dist/build/linker/elf_plt_arm.thr_l_dyn_o rts/dist/build/linker/LoadArchive.thr_l_dyn_o rts/dist/build/linker/PEi386.thr_l_dyn_o rts/dist/build/linker/elf_reloc.thr_l_dyn_o rts/dist/build/linker/elf_util.thr_l_dyn_o rts/dist/build/posix/OSMem.thr_l_dyn_o rts/dist/build/posix/GetTime.thr_l_dyn_o rts/dist/build/posix/OSThreads.thr_l_dyn_o rts/dist/build/posix/Itimer.thr_l_dyn_o rts/dist/build/posix/TTY.thr_l_dyn_o rts/dist/build/posix/Signals.thr_l_dyn_o rts/dist/build/posix/Select.thr_l_dyn_o rts/dist/build/posix/GetEnv.thr_l_dyn_o   rts/dist/build/StgStartup.thr_l_dyn_o rts/dist/build/PrimOps.thr_l_dyn_o rts/dist/build/HeapStackCheck.thr_l_dyn_o rts/dist/build/Updates.thr_l_dyn_o rts/dist/build/Exception.thr_l_dyn_o rts/dist/build/StgMiscClosures.thr_l_dyn_o rts/dist/build/Apply.thr_l_dyn_o rts/dist/build/Compact.thr_l_dyn_o rts/dist/build/StgStdThunks.thr_l_dyn_o rts/dist/build/AutoApply.thr_l_dyn_o  -fPIC -dynamic -optc-DTHREADED_RTS -eventlog  -O0 -H64m -Wall -fllvm-fill-undef-with-garbage    -Werror -Iincludes -Iincludes/dist -Iincludes/dist-derivedconstants/header -Iincludes/dist-ghcconstants/header -Irts -Irts/dist/build -DCOMPILING_RTS -DFS_NAMESPACE=rts -this-unit-id rts -dcmm-lint      -i -irts -irts/dist/build -Irts/dist/build -irts/dist/build/./autogen -Irts/dist/build/./autogen            -O2 -Wcpp-undef    -Wnoncanonical-monad-instances   -o rts/dist/build/libHSrts_thr_l-ghc8.5.20180710.so
"inplace/bin/ghc-stage1" -hisuf hi -osuf  o -hcsuf hc -static  -O0 -H64m -Wall -fllvm-fill-undef-with-garbage    -Werror    -this-unit-id ghc-prim-0.5.2.0 -hide-all-packages -i -ilibraries/ghc-prim/. -ilibraries/ghc-prim/dist-install/build -Ilibraries/ghc-prim/dist-install/build -ilibraries/ghc-prim/dist-install/build/./autogen -Ilibraries/ghc-prim/dist-install/build/./autogen -Ilibraries/ghc-prim/.    -optP-include -optPlibraries/ghc-prim/dist-install/build/./autogen/cabal_macros.h -package-id rts -this-unit-id ghc-prim -XHaskell2010 -O -dcore-lint -dno-debug-output  -no-user-package-db -rtsopts  -Wno-trustworthy-safe -Wno-deprecated-flags     -Wnoncanonical-monad-instances  -odir libraries/ghc-prim/dist-install/build -hidir libraries/ghc-prim/dist-install/build -stubdir libraries/ghc-prim/dist-install/build   -dynamic-too -c libraries/ghc-prim/./GHC/CString.hs -o libraries/ghc-prim/dist-install/build/GHC/CString.o -dyno libraries/ghc-prim/dist-install/build/GHC/CString.dyn_o
"inplace/bin/ghc-stage1" -hisuf hi -osuf  o -hcsuf hc -static  -O0 -H64m -Wall -fllvm-fill-undef-with-garbage    -Werror    -this-unit-id ghc-prim-0.5.2.0 -hide-all-packages -i -ilibraries/ghc-prim/. -ilibraries/ghc-prim/dist-install/build -Ilibraries/ghc-prim/dist-install/build -ilibraries/ghc-prim/dist-install/build/./autogen -Ilibraries/ghc-prim/dist-install/build/./autogen -Ilibraries/ghc-prim/.    -optP-include -optPlibraries/ghc-prim/dist-install/build/./autogen/cabal_macros.h -package-id rts -this-unit-id ghc-prim -XHaskell2010 -O -dcore-lint -dno-debug-output  -no-user-package-db -rtsopts  -Wno-trustworthy-safe -Wno-deprecated-flags     -Wnoncanonical-monad-instances  -odir libraries/ghc-prim/dist-install/build -hidir libraries/ghc-prim/dist-install/build -stubdir libraries/ghc-prim/dist-install/build   -dynamic-too -c libraries/ghc-prim/./GHC/IntWord64.hs -o libraries/ghc-prim/dist-install/build/GHC/IntWord64.o -dyno libraries/ghc-prim/dist-install/build/GHC/IntWord64.dyn_o
"inplace/bin/ghc-stage1" -hisuf hi -osuf  o -hcsuf hc -static  -O0 -H64m -Wall -fllvm-fill-undef-with-garbage    -Werror    -this-unit-id base-4.12.0.0 -hide-all-packages -i -ilibraries/base/. -ilibraries/base/dist-install/build -Ilibraries/base/dist-install/build -ilibraries/base/dist-install/build/./autogen -Ilibraries/base/dist-install/build/./autogen -Ilibraries/base/include -Ilibraries/base/dist-install/build/include   -optP-DOPTIMISE_INTEGER_GCD_LCM -optP-include -optPlibraries/base/dist-install/build/./autogen/cabal_macros.h -package-id ghc-prim-0.5.2.0 -package-id integer-gmp-1.0.2.0 -package-id rts -this-unit-id base -XHaskell2010 -O -dcore-lint -dno-debug-output  -no-user-package-db -rtsopts  -Wno-trustworthy-safe -Wno-deprecated-flags     -Wnoncanonical-monad-instances  -odir libraries/base/dist-install/build -hidir libraries/base/dist-install/build -stubdir libraries/base/dist-install/build   -dynamic-too -c libraries/base/./GHC/Base.hs-boot -o libraries/base/dist-install/build/GHC/Base.o-boot -dyno libraries/base/dist-install/build/GHC/Base.dyn_o-boot
"inplace/bin/ghc-stage1" -hisuf hi -osuf  o -hcsuf hc -static  -O0 -H64m -Wall -fllvm-fill-undef-with-garbage    -Werror    -this-unit-id base-4.12.0.0 -hide-all-packages -i -ilibraries/base/. -ilibraries/base/dist-install/build -Ilibraries/base/dist-install/build -ilibraries/base/dist-install/build/./autogen -Ilibraries/base/dist-install/build/./autogen -Ilibraries/base/include -Ilibraries/base/dist-install/build/include   -optP-DOPTIMISE_INTEGER_GCD_LCM -optP-include -optPlibraries/base/dist-install/build/./autogen/cabal_macros.h -package-id ghc-prim-0.5.2.0 -package-id integer-gmp-1.0.2.0 -package-id rts -this-unit-id base -XHaskell2010 -O -dcore-lint -dno-debug-output  -no-user-package-db -rtsopts  -Wno-trustworthy-safe -Wno-deprecated-flags     -Wnoncanonical-monad-instances  -odir libraries/base/dist-install/build -hidir libraries/base/dist-install/build -stubdir libraries/base/dist-install/build   -dynamic-too -c libraries/base/./GHC/Real.hs-boot -o libraries/base/dist-install/build/GHC/Real.o-boot -dyno libraries/base/dist-install/build/GHC/Real.dyn_o-boot
*** Core Lint errors : in result of Simplifier ***
<no location info>: warning:
    [RHS of $j_sCd :: [Char]]
    Join point has invalid type: $j_sCd :: [Char]
*** Offending Program ***
unpackCString# [InlPrag=NOINLINE CONLIKE] :: Addr# -> [Char]
[LclIdX,
 Arity=1,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [0] 102 0}]
unpackCString#
  = \ (addr_atX :: Addr#) ->
      letrec {
        unpack_sBT [Occ=LoopBreaker] :: Int# -> [Char]
        [LclId,
         Arity=1,
         Unf=Unf{Src=<vanilla>, TopLvl=False, Value=True, ConLike=True,
                 WorkFree=True, Expandable=True, Guidance=IF_ARGS [0] 62 40}]
        unpack_sBT
          = \ (nh_atZ :: Int#) ->
              case indexCharOffAddr# addr_atX nh_atZ of ch_axY {
                __DEFAULT -> : @ Char (C# ch_axY) (unpack_sBT (+# nh_atZ 1#));
                '\NUL'# -> [] @ Char
              }; } in
      unpack_sBT 0#

----- snip -----

$trModule_sAS :: TrName
[LclId,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 10 20}]
$trModule_sAS = TrNameS $trModule_sAR

$trModule :: Module
[LclIdX,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 10 30}]
$trModule = Module $trModule_sAQ $trModule_sAS

*** End of Offense ***


<no location info>: error: 
Compilation had errors


libraries/ghc-prim/ghc.mk:4: recipe for target 'libraries/ghc-prim/dist-install/build/GHC/CString.o' failed
make[1]: *** [libraries/ghc-prim/dist-install/build/GHC/CString.o] Error 1
make[1]: *** Waiting for unfinished jobs....
Makefile:122: recipe for target 'all' failed
make: *** [all] Error 2

This is from running validate on ec9638b222 - as far as I can tell, this is about where the patch originally branched off of master; in any case it applies cleanly, and ec9638b222 itself validates without errors.

Since this error occurs while compiling stage2, my assumption is that the patch changes GHC to produce incorrect output; the incorrect stage1 compiler then fails with a core lint error while compiling stage2.

comment:23 Changed 6 weeks ago by tdammers

I also tried checking out the actual phab patch, but that too fails to build during validate.

comment:24 Changed 6 weeks ago by goldfire

I had posted about the fix for that behavior on the Well-Typed GitLab page, but I've now also updated Phab. If you download again from there, you should hopefully be OK.

comment:25 Changed 6 weeks ago by tdammers

Thanks! Yes, I'm OK now - too OK in fact, because now it validates cleanly, with no apparent performance problems.

comment:26 Changed 6 weeks ago by goldfire

Hooray! :)

Would you mind rebasing and committing, then?

comment:27 Changed 5 weeks ago by simonpj

with no apparent performance problems.

I would love it if it was possible to do a nofib-style run on testsuite/perf, producing a table showing changes before and after.

The current testsuite stuff just tells if you get more than 5% worse (or whatever). There could easily be a big swing we never see. It's laborious running the tests one by one.

Would be possible for nofib-analyse to look at the logs somehow? It'd be extremely helpful.

Meanwhile, yes, let's commit.

comment:28 Changed 5 weeks ago by tdammers

Ah, never mind, it seems I had misinterpreted what those phabricator tags mean; re-running validate on the correct commit now.

comment:29 Changed 5 weeks ago by tdammers

OK, I managed to build it, and get some proper test failures:

Unexpected failures:
   /tmp/ghctest-tmsk6xdn/test   spaces/./simplCore/should_compile/T3234.run  T3234 [stderr mismatch] (optasm)

Unexpected stat failures:
   /tmp/ghctest-tmsk6xdn/test   spaces/./perf/compiler/T5321Fun.run  T5321Fun [stat not good enough] (normal)
   /tmp/ghctest-tmsk6xdn/test   spaces/./perf/compiler/T5631.run     T5631 [stat not good enough] (normal)

So I'll look into those.

On a side note, Haddock now fails to compile, which can be worked around by setting HADDOCK_DOCS=NO in mk/validate.mk, but should, I think, be resolved before we can merge.

comment:30 Changed 5 weeks ago by tdammers

For completeness' sake, here's the Haddock compilation error:

"inplace/bin/ghc-stage2" -hisuf dyn_hi -osuf  dyn_o -hcsuf dyn_hc -fPIC -dynamic  -O0 -H64m -Wall  -Werror    -hide-all-packages -i -iutils/haddock/driver -iutils/haddock/haddock-api/src -iutils/haddock/haddock-library/vendor/attoparsec-0.13.1.0 -iutils/haddock/haddock-library/src-iutils/haddock/dist/build -Iutils/haddock/dist/build -iutils/haddock/dist/build/haddock/autogen -Iutils/haddock/dist/build/haddock/autogen  -optP-DIN_GHC_TREE -optP-include -optPutils/haddock/dist/build/haddock/autogen/cabal_macros.h -package-id Cabal-2.3.0.0 -package-id array-0.5.2.0 -package-id base-4.12.0.0 -package-id bytestring-0.10.8.2 -package-id containers-0.5.11.0 -package-id deepseq-1.4.4.0 -package-id directory-1.3.2.3 -package-id filepath-1.4.2 -package-id ghc-8.5 -package-id ghc-boot-8.5 -package-id transformers-0.5.5.0 -package-id xhtml-3000.2.2 -funbox-strict-fields -Wall -fwarn-tabs -O2 -threaded -XHaskell2010  -no-user-package-db -rtsopts  -Wno-unused-imports -Wno-deprecations     -Wnoncanonical-monad-instances  -odir utils/haddock/dist/build -hidir utils/haddock/dist/build -stubdir utils/haddock/dist/build    -c utils/haddock/haddock-api/src/Haddock/InterfaceFile.hs -o utils/haddock/dist/build/Haddock/InterfaceFile.dyn_o

utils/haddock/haddock-api/src/Haddock/Convert.hs:486:39: error:
    • Couldn't match type ‘UniqSet.UniqSet TyVar’
                     with ‘InterestingVarFun
                           -> VarSet -> ([Var], VarSet) -> ([Var], VarSet)’
      Expected type: TyConBinder -> FV
        Actual type: TyConBinder -> TyVarSet
    • In the first argument of ‘mapUnionFV’, namely
        ‘injectiveVarsOfBinder’
      In the second argument of ‘($)’, namely
        ‘mapUnionFV injectiveVarsOfBinder dropped_binders’
      In the expression:
        fvVarSet $ mapUnionFV injectiveVarsOfBinder dropped_binders
    |
486 |                            mapUnionFV injectiveVarsOfBinder dropped_binders
    |                                       ^^^^^^^^^^^^^^^^^^^^^
utils/haddock/ghc.mk:20: recipe for target 'utils/haddock/dist/build/Haddock/Convert.dyn_o' failed
make[1]: *** [utils/haddock/dist/build/Haddock/Convert.dyn_o] Error 1
make[1]: *** Waiting for unfinished jobs....
Makefile:122: recipe for target 'all' failed
make: *** [all] Error 2

comment:31 Changed 5 weeks ago by RyanGlScott

You should be able to fix that Haddock compilation error with this patch:

  • haddock-api/src/Haddock/Convert.hs

    diff --git a/haddock-api/src/Haddock/Convert.hs b/haddock-api/src/Haddock/Convert.hs
    index bf6fbab..9f7b15d 100644
    a b synifyType _ (TyConApp tc tys) 
    512512                  = splitAtList  tys binders
    513513            result_kind  = mkTyConKind remaining_binders res_kind
    514514            result_vars  = tyCoVarsOfType result_kind
    515             dropped_vars = fvVarSet $
    516                            mapUnionFV injectiveVarsOfBinder dropped_binders
     515            dropped_vars = mapUnionVarSet injectiveVarsOfBinder dropped_binders
    517516
    518517        in not (subVarSet result_vars dropped_vars)
    519518

comment:32 Changed 5 weeks ago by goldfire

This all looks like it's going swimmingly. Let me know if I need to intervene. Thanks!

Last edited 5 weeks ago by goldfire (previous) (diff)

comment:33 Changed 5 weeks ago by goldfire

I just checked test T3234 locally, and it seems to work. Perhaps something went awry in rebasing... (or perhaps I'm not in the right state locally somehow)

comment:34 in reply to:  33 Changed 5 weeks ago by tdammers

Replying to goldfire:

I just checked test T3234 locally, and it seems to work. Perhaps something went awry in rebasing... (or perhaps I'm not in the right state locally somehow)

I haven't rebased anything yet, I checked out the code directly from Phabricator's staging area, so it should be identical to what we're looking at here. However, T3234 is a stdout deviation, not a stat failure, so I'm not too worried about that one, it's probably just a cosmetic difference in output formatting or some such.

The thing I'm looking into right now is the stat failures, T5321Fun and T5631.

comment:35 in reply to:  31 Changed 5 weeks ago by tdammers

Replying to RyanGlScott:

You should be able to fix that Haddock compilation error with this patch: [...]

Thanks, works like a charm (although line numbers differ in my version).

comment:36 Changed 5 weeks ago by tdammers

-ddump-simpl suggests the problem is not increased Core size; the output is practically identical.

So I did a run with +RTS -p, and got this:

        Mon Jul 16 16:37 2018 Time and Allocation Profiling Report  (Final)

           ghc-stage2 +RTS -p -RTS -B/home/tobias/well-typed/devel/ghc-phab/inplace/lib -c T5631.hs -fforce-recomp

        total time  =        1.38 secs   (1383 ticks @ 1000 us, 1 processor)
        total alloc = 1,208,527,080 bytes  (excludes profiling overheads)

COST CENTRE                MODULE     SRC                                                 %time %alloc

tc_rn_src_decls            TcRnDriver compiler/typecheck/TcRnDriver.hs:(491,4)-(553,7)     38.7   55.9
withCleanupSession.cleanup GHC        compiler/main/GHC.hs:(468,4)-(475,37)                23.1    0.0
zonkTopDecls               TcRnDriver compiler/typecheck/TcRnDriver.hs:(442,16)-(443,43)    7.0    8.7
CorePrep                   HscMain    compiler/main/HscMain.hs:(1317,24)-(1318,57)          3.0    4.9
deSugar                    HscMain    compiler/main/HscMain.hs:512:7-44                     2.6    2.5
pprNativeCode              AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65)    2.4    1.8
CoreTidy                   HscMain    compiler/main/HscMain.hs:1257:27-67                   1.8    2.3
simplIdF                   Simplify   compiler/simplCore/Simplify.hs:859:61-79              1.6    1.3
simplifyTop                TcRnDriver compiler/typecheck/TcRnDriver.hs:409:25-39            1.4    2.2
Parser                     HscMain    compiler/main/HscMain.hs:(317,5)-(385,20)             1.4    2.6
StgCmm                     HscMain    compiler/main/HscMain.hs:(1432,13)-(1433,62)          1.3    1.0

~And then digging into the call graph analysis, this surprising bit:

         hscTypecheck                            HscMain                       compiler/main/HscMain.hs:(425,1)-(428,20)          1564          1    0.0    0.0    51.9   73.1
          extract_renamed_stuff                  HscMain                       compiler/main/HscMain.hs:(397,1)-(410,31)          1640          1    0.0    0.0     0.0    0.0
          hscTypecheck'                          HscMain                       compiler/main/HscMain.hs:(434,1)-(456,38)          1565          1    0.0    0.0    51.9   73.1
           [...]
           tcRnModule'                           HscMain                       compiler/main/HscMain.hs:(461,1)-(500,72)          1570          1    0.0    0.0    50.5   70.5
            Typecheck-Rename                     HscMain                       compiler/main/HscMain.hs:(464,16)-(465,73)         1571          1    0.0    0.0    50.5   70.5
             ioMsgMaybe                          HscMain                       compiler/main/HscMain.hs:(251,1)-(256,122)         1572          1    0.4    0.8    50.5   70.5

In other words, we spend 70% of our allocations on "dealing with errors and warnings returned by a compilation step".

Actually no, it just means that whatever ioMsgMaybe wraps here is spending 70% of allocations.

Last edited 5 weeks ago by tdammers (previous) (diff)

comment:37 Changed 5 weeks ago by simonpj

Can we compare the numbers with and without the patch? That should show differences.

Also I suggest compiling the compiler with -ticky and comparing output with and without the patch.

comment:38 in reply to:  37 Changed 5 weeks ago by tdammers

Replying to simonpj:

Can we compare the numbers with and without the patch? That should show differences.

And it does.

Before patch:

        Tue Jul 17 12:17 2018 Time and Allocation Profiling Report  (Final)

           ghc-stage2 +RTS -p -RTS -B/home/tobias/well-typed/devel/ghc-phab/inplace/lib -c T5631.hs -fforce-recomp

        total time  =        1.28 secs   (1278 ticks @ 1000 us, 1 processor)
        total alloc = 1,146,867,328 bytes  (excludes profiling overheads)

COST CENTRE        MODULE     SRC                                                 %time %alloc

tc_rn_src_decls    TcRnDriver compiler/typecheck/TcRnDriver.hs:(491,4)-(553,7)     40.7   55.4
setSessionDynFlags GHC        compiler/main/GHC.hs:(578,1)-(584,16)                 8.7    0.7
zonkTopDecls       TcRnDriver compiler/typecheck/TcRnDriver.hs:(442,16)-(443,43)    7.0    9.2
tcRnSrcDecls       TcRnDriver compiler/typecheck/TcRnDriver.hs:255:25-65            4.5    0.4
writeIface         HscMain    compiler/main/HscMain.hs:1283:9-45                    3.4    0.2
withCleanupSession GHC        compiler/main/GHC.hs:(466,1)-(475,37)                 2.9    0.1
CorePrep           HscMain    compiler/main/HscMain.hs:(1317,24)-(1318,57)          2.8    4.1
deSugar            HscMain    compiler/main/HscMain.hs:512:7-44                     2.5    2.7
pprNativeCode      AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65)    2.3    1.9
simplifyTop        TcRnDriver compiler/typecheck/TcRnDriver.hs:409:25-39            1.6    2.3
CoreTidy           HscMain    compiler/main/HscMain.hs:1257:27-67                   1.6    1.8

[...]

               tc_rn_src_decls                   TcRnDriver                    compiler/typecheck/TcRnDriver.hs:(491,4)-(553,7)   1575          1   40.7   55.4    42.8   57.3
                Digraph.topSort                  Digraph                       compiler/utils/Digraph.hs:350:48-75                1586        341    0.0    0.0     0.0    0.0
                solveSimples                     TcInteract                    compiler/typecheck/TcInteract.hs:(241,5)-(242,21)  1587        266    0.0    0.0     1.9    1.6
                 solve_loop                      TcInteract                    compiler/typecheck/TcInteract.hs:(246,9)-(250,44)  1589          0    1.5    1.1     1.9    1.6
                  canNC                          TcCanonical                   compiler/typecheck/TcCanonical.hs:(84,5)-(90,45)   1590       1206    0.4    0.5     0.4    0.5
                  canClass                       TcCanonical                   compiler/typecheck/TcCanonical.hs:98:5-31          1591         15    0.0    0.0     0.0    0.0
                  canEqLeafTyVarEq               TcCanonical                   compiler/typecheck/TcCanonical.hs:105:5-39         1592          3    0.0    0.0     0.0    0.0
                Digraph.scc                      Digraph                       compiler/utils/Digraph.hs:285:44-67                1576         25    0.0    0.0     0.0    0.0
                bin_anns                         HscTypes                      compiler/main/HscTypes.hs:1093:47-56               1581         12    0.0    0.0     0.0    0.0
                bin_exports                      HscTypes                      compiler/main/HscTypes.hs:1088:50-55               1578         12    0.0    0.0     0.0    0.0
                bin_fam_insts                    HscTypes                      compiler/main/HscTypes.hs:1096:52-57               1584         12    0.0    0.0     0.0    0.0
                bin_fixities                     HscTypes                      compiler/main/HscTypes.hs:1091:51-56               1579         12    0.0    0.0     0.0    0.0
                bin_insts                        HscTypes                      compiler/main/HscTypes.hs:1095:48-53               1583         12    0.0    0.0     0.0    0.0
                bin_rules                        HscTypes                      compiler/main/HscTypes.hs:1097:48-57               1585         12    0.0    0.0     0.0    0.0
                bin_tycldecls                    HscTypes                      compiler/main/HscTypes.hs:1094:52-57               1582         12    0.2    0.2     0.2    0.2
                bin_usages                       HscTypes                      compiler/main/HscTypes.hs:1087:49-58               1577         12    0.0    0.0     0.0    0.0
                bin_warns                        HscTypes                      compiler/main/HscTypes.hs:1092:48-57               1580         12    0.0    0.0     0.0    0.0
                substTyWith                      TyCoRep                       compiler/types/TyCoRep.hs:(2191,23)-(2192,50)      1616          4    0.0    0.0     0.0    0.0

After:

tc_rn_src_decls    TcRnDriver compiler/typecheck/TcRnDriver.hs:(491,4)-(553,7)     49.3   55.9
zonkTopDecls       TcRnDriver compiler/typecheck/TcRnDriver.hs:(442,16)-(443,43)    7.6    8.7
CorePrep           HscMain    compiler/main/HscMain.hs:(1317,24)-(1318,57)          3.9    4.9
deSugar            HscMain    compiler/main/HscMain.hs:512:7-44                     3.3    2.5
setSessionDynFlags GHC        compiler/main/GHC.hs:(578,1)-(584,16)                 2.9    0.7
CoreTidy           HscMain    compiler/main/HscMain.hs:1257:27-67                   2.2    2.3
simplifyTop        TcRnDriver compiler/typecheck/TcRnDriver.hs:409:25-39            2.1    2.2
withCleanupSession GHC        compiler/main/GHC.hs:(466,1)-(475,37)                 2.1    0.1
pprNativeCode      AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65)    2.0    1.8
Parser             HscMain    compiler/main/HscMain.hs:(317,5)-(385,20)             1.7    2.6
solve_loop         TcInteract compiler/typecheck/TcInteract.hs:(246,9)-(250,44)     1.6    1.1

[...]


               tc_rn_src_decls                   TcRnDriver                    compiler/typecheck/TcRnDriver.hs:(491,4)-(553,7)   1584          1   49.3   55.9    52.2   57.7
                Digraph.topSort                  Digraph                       compiler/utils/Digraph.hs:350:48-75                1595        341    0.1    0.0     0.1    0.0
                solveSimples                     TcInteract                    compiler/typecheck/TcInteract.hs:(241,5)-(242,21)  1596        266    0.2    0.0     2.5    1.5
                 solve_loop                      TcInteract                    compiler/typecheck/TcInteract.hs:(246,9)-(250,44)  1598          0    1.6    1.0     2.3    1.5
                  canNC                          TcCanonical                   compiler/typecheck/TcCanonical.hs:(84,5)-(90,45)   1599       1206    0.7    0.5     0.7    0.5
                  canClass                       TcCanonical                   compiler/typecheck/TcCanonical.hs:98:5-31          1600         15    0.0    0.0     0.0    0.0
                  canEqLeafTyVarEq               TcCanonical                   compiler/typecheck/TcCanonical.hs:105:5-39         1601          3    0.0    0.0     0.0    0.0
                Digraph.scc                      Digraph                       compiler/utils/Digraph.hs:285:44-67                1585         25    0.0    0.0     0.0    0.0
                bin_anns                         HscTypes                      compiler/main/HscTypes.hs:1093:47-56               1590         12    0.0    0.0     0.0    0.0
                bin_exports                      HscTypes                      compiler/main/HscTypes.hs:1088:50-55               1587         12    0.0    0.0     0.0    0.0
                bin_fam_insts                    HscTypes                      compiler/main/HscTypes.hs:1096:52-57               1593         12    0.0    0.0     0.0    0.0
                bin_fixities                     HscTypes                      compiler/main/HscTypes.hs:1091:51-56               1588         12    0.0    0.0     0.0    0.0
                bin_insts                        HscTypes                      compiler/main/HscTypes.hs:1095:48-53               1592         12    0.0    0.0     0.0    0.0
                bin_rules                        HscTypes                      compiler/main/HscTypes.hs:1097:48-57               1594         12    0.0    0.0     0.0    0.0
                bin_tycldecls                    HscTypes                      compiler/main/HscTypes.hs:1094:52-57               1591         12    0.3    0.2     0.3    0.2
                bin_usages                       HscTypes                      compiler/main/HscTypes.hs:1087:49-58               1586         12    0.1    0.0     0.1    0.0
                bin_warns                        HscTypes                      compiler/main/HscTypes.hs:1092:48-57               1589         12    0.0    0.0     0.0    0.0
                substTyWith                      TyCoRep                       compiler/types/TyCoRep.hs:(2274,23)-(2275,50)      1625          4    0.0    0.0     0.0    0.0

So apparently tc_rn_src_decls is responsible for the entire performance hit here, however the calls it makes that we track seem to be identical, and none of its children contributes significantly.

comment:39 Changed 5 weeks ago by tdammers

Another interesting observation is that it eats up more CPU time, but doesn't allocate more.

comment:40 Changed 5 weeks ago by tdammers

Drilling down some more, I found that the biggest performance sink is tcPolyInfer, in both the pre- and post-patch versions; in both cases, it is responsible for just over 50% of allocations, but in terms of runtime, it's 36% before the patch, but 45% after - so that alone would explain a 5% performance deviation.

Further data points:

  • On the first run after making a fresh build, withCleanupSession.cleanup shows up prominently, but from the second run onwards, it scores 0 on all metrics.
  • Overall performance for profile builds is such that the patch does actually produce a performance improvement, from approx. 1.25 seconds down to 1.07.
  • The patch does not touch tcPolyInfer at all, the entire module is identical between both versions
  • The performance counters suggests that code paths taken inside tcPolyInfer are the same

Changed 5 weeks ago by tdammers

Attachment: logs.tar.gz added

Performance logs for comparison

comment:41 Changed 5 weeks ago by goldfire

Given what you've found so far, I feel sure that the problems stem from the changes in TyCoRep. But I'm curious to see what else you find!

comment:42 Changed 5 weeks ago by tdammers

Hah, that's a good hint! I did another run, this time compiling with -fprof-auto-top on TyCoRep only, and this reveals two interesting things:

  • tcvs_of_type, which doesn't exist in the old version yet, is responsible for 4.5% of CPU time in the new version
  • zonkTopDecls, while allocating about 8.8% in both versions, jumps from 5.9% CPU to 8.2%

I'll dig deeper on these two.

comment:43 Changed 5 weeks ago by simonpj

Ah! The tcvs_of_type thing sounds promising.

comment:44 Changed 4 weeks ago by tdammers

Managed to get meaningful ticky-ticky profiling data.

Before the patch:

The following table is explained by http://ghc.haskell.org/trac/ghc/wiki/Debugging/TickyTicky
All allocation numbers are in bytes.

**************************************************

    Entries      Alloc    Alloc'd  Non-void Arguments      STG Name
--------------------------------------------------------------------------------
     172005   90819296          0   5 EMMMS                ghc:TcUnify.promoteTcType2{v r3m} (fun)
    2301299   85611504          0   2 >L                   base:GHC.Base.map{v 01X} (fun)
    1612793   73036520          0   3 i.M                  containers-0.5.11.0:Data.IntMap.Internal.$winsert{v r7O} (fun)
     699656   37268264          0   1 L                    go4{v sNuG} (ghc:TcMType) (fun) in r1Y
     448802   35904160          0   7 E>>>>.M              ghc:TcMType.$s$wmapType{v r1Y} (fun)
     367181   23499584          0   2 LS                   ghc:TcRnMonad.traceTc1{v rlu} (fun)
     887399   22625616          0   1 M                    go28{v sNuF} (ghc:TcMType) (fun) in r1Y
     283135   21931488          0   4 >SSM                 ghc:TcHsSyn.$wzonkTyVarOcc{v r1F} (fun)
     269137   18194264          0   1 L                    go13{v sSIr} (ghc:TcHsSyn) (fun) in rN
     302653   16395904          0   5 +>LLL                ghc:MonadUtils.zipWith3M{v rp} (fun)
     194400   15552000          0   7 E>>>>.M              ghc:TcHsSyn.$s$wmapType{v rN} (fun)
     559499   15509096          0   1 M                    ghc:TcMType.zonkTcTyVar{v r1l} (fun)
     138823   13706576          0   1 S                    sat_sMU6{v} (ghc:TcMType) (fun) in r6P
     459700   11089080          0   1 S                    sat_sNvf{v} (ghc:TcMType) (fun) in sNuF

After:

The following table is explained by http://ghc.haskell.org/trac/ghc/wiki/Debugging/TickyTicky
All allocation numbers are in bytes.

**************************************************

    Entries      Alloc    Alloc'd  Non-void Arguments      STG Name
--------------------------------------------------------------------------------
     172005   90819296          0   5 EMMMS                ghc:TcUnify.promoteTcType2{v r3m} (fun)
    2330347   86853624          0   2 >L                   base:GHC.Base.map{v 01X} (fun)
    1271867   57141560          0   3 i.M                  containers-0.5.11.0:Data.IntMap.Internal.$winsert{v r7O} (fun)
     699656   37268264          0   1 L                    go4{v sNmh} (ghc:TcMType) (fun) in r1Y
     448802   35904160          0   7 E>>>>.M              ghc:TcMType.$s$wmapType{v r1Y} (fun)
     693375   27595440          0   3 MIM                  poly_merge1{v rs9H} (containers-0.5.11.0:Data.IntMap.Internal) (fun)
     367181   23499584          0   2 LS                   ghc:TcRnMonad.traceTc1{v rlu} (fun)
     887399   22625616          0   1 M                    go28{v sNmg} (ghc:TcMType) (fun) in r1Y
     283135   21931488          0   4 >SSM                 ghc:TcHsSyn.$wzonkTyVarOcc{v r1F} (fun)
     269137   18194264          0   1 L                    go13{v sSHY} (ghc:TcHsSyn) (fun) in rN
      95951   16510560          0   2 >L                   base:Data.OldList.sortBy{v rD} (fun)
     302653   16395904          0   5 +>LLL                ghc:MonadUtils.zipWith3M{v rp} (fun)
     194400   15552000          0   7 E>>>>.M              ghc:TcHsSyn.$s$wmapType{v rN} (fun)
     559499   15509096          0   1 M                    ghc:TcMType.zonkTcTyVar{v r1l} (fun)

Note the poly_merge1 entry in the post-patch one.

comment:45 Changed 4 weeks ago by RyanGlScott

poly_merge1 is likely coming from the mergeWithKey' function from Data.IntMap (which unionVarSet is implemented in terms of), as it has a helper function named merge1. The question, I suppose, is which use(s) of unionVarSet are problematic. There are at least a couple new uses of unionVarSet in Phab:D4769:

  • In mapUnionVarSetSet, which is used in the pivotal closeOverKinds function
  • Directly in closeOverKinds itself
  • In nonDetCmpTypes

comment:46 Changed 4 weeks ago by RyanGlScott

Question: is it more efficient to use closeOverKinds (unionVarSet set1 set2) than unionVarSet (closeOverKinds set1) (closeOverKinds set2)? If so, I think the way nonDetCmpTypes is currently implemented is inefficient. We currently have:

nonDetCmpTypes :: [Type] -> [Type] -> Ordering
nonDetCmpTypes ts1 ts2 = nonDetCmpTypesX rn_env ts1 ts2
  where
    rn_env = mkRnEnv2 (mkInScopeSet (tyCoVarsOfTypes ts1 `unionVarSet` tyCoVarsOfTypes ts2))

In particular, note this part:

tyCoVarsOfTypes ts1 `unionVarSet` tyCoVarsOfTypes ts2

tyCoVarsOfTypes calls closeOverKinds, so we're calling closeOverKinds once more than is necessary. Perhaps this should be written like this?

tyCoVarsOfTypes (ts1 ++ ts2)

comment:47 Changed 4 weeks ago by simonpj

Re comment:46, yes maybe it'd be more efficient like that. But I doubt this is the issue. nonDetCmpTypes (I hope) only evaluates that free-var set if it compare types involving forall. Otherwise it's just a thunk.

comment:48 Changed 4 weeks ago by simonpj

I have a new hypothesis, provoked by this merge popping up in the statistics.

  • Previously, tyCoVarsOfTypes used tyCoFVsOfTypes, which in turn uses unionFV (from utils/FV.hs). This FV thing is relatively elaborate, maintaining a VarSet alongside a list. Plus an "interesting" predicate which, in the case of tyCoVarsOfType is const True.
  • In the patch, tyCoVarsOfType switches to using unionVarSet. This "obviously" shold be more efficient, because it just maintains a VarSet, with no list.
  • BUT tyCoVarsOfType does not use an accumulator; it just straightforwardly unions the free-vars in a tree-like way. In contrast, unionFV does use accumulator style, behind the scenes.
  • Bottom line: in the old version we make many calls to extendVarSet; in the new one we make many calls to unionVarSet.

It should jolly well be the case that inserting 100 things one by one into a set should be no faster than unioning a balanced tree of the same 100 things. But looking at Data.Map.Internal.insert, and comparing with Data.Map.Internal.mergeWithKey', I bet the former is faster.

This hypothesis is easy to test: just change the new tcvs_of_type to take an accumulator.

If this turns out to make a difference, I think we should file a bug on containers. Users should not have to worry about this!

comment:49 Changed 4 weeks ago by tdammers

So far, a naive attempt at rewriting tcvs_of_type in terms of VarSet -> VarSet didn't produce significant performance improvements; however, I have not eliminated unionVarSet completely yet - the ForAll cases in both tcvs_of_type and tcvs_of_co are both such that naively expressing them as compositions of applications of tcvs_of_... and delVarSet does not produce correct results - I assume this is because the order in which vars are added and deleted from the set matter. So I'm now testing a replacement for unionVarSet that is written in terms of set insertions instead of set unions; hopefully this will show a bigger difference. I should probably also benchmark the relevant set operations on their own, to verify that they do indeed expose the performance issue we're suspecting.

Last edited 4 weeks ago by tdammers (previous) (diff)

comment:50 Changed 4 weeks ago by tdammers

Update; the rewritten unionVarSet performs significantly worse than the original one. So either I did something wrong there, or our suspicion isn't correct.

comment:51 Changed 4 weeks ago by simonpj

So far, a naive attempt at rewriting tcvs_of_type in terms of VarSet -> VarSet didn't produce significant performance improvements

What does the naive attempt look like. I'm expecting:

tcvs_of_type :: Type -> VarSet -> VarSet
tcvs_of_type (App t1 t2) acc = tcvs_of_type t1 (tcvs_of_type t2 acc)
...etc...

Is that what you have? I recall that Bartosz got much worse perf with a higher-order version.

I would not worry about foralls -- they are relatively rare.

Last edited 4 weeks ago by simonpj (previous) (diff)

comment:52 Changed 4 weeks ago by tdammers

Yes, that's essentially what I have. Specifically:

tcvs_of_type' :: Type -> TyCoVarSetNotClosed -> TyCoVarSetNotClosed
tcvs_of_type' (TyVarTy v)                 = flip extendVarSet v
tcvs_of_type' (TyConApp _ tys)            = tcvs_of_types' tys
tcvs_of_type' (LitTy {})                  = id
tcvs_of_type' (AppTy fun arg)             = tcvs_of_type' fun . tcvs_of_type' arg
tcvs_of_type' (FunTy arg res)             = tcvs_of_type' arg . tcvs_of_type' res
tcvs_of_type' (ForAllTy (TvBndr tv _) ty) = unionVarSet (delVarSet (tcvs_of_type ty) tv) . tcvs_of_type' (tyVarKind tv)
tcvs_of_type' (CastTy ty co)              = tcvs_of_type' ty . tcvs_of_co' co
tcvs_of_type' (CoercionTy co)             = tcvs_of_co' co

tcvs_of_types :: [Type] -> TyCoVarSetNotClosed
tcvs_of_types ts = tcvs_of_types' ts emptyVarSet

tcvs_of_types' :: [Type] -> TyCoVarSetNotClosed -> TyCoVarSetNotClosed
tcvs_of_types' [] = id
tcvs_of_types' (t:ts) = tcvs_of_type' t . tcvs_of_types' ts

tcvs_of_co :: Coercion -> TyCoVarSetNotClosed
tcvs_of_co co = tcvs_of_co' co emptyVarSet

tcvs_of_co' :: Coercion -> TyCoVarSetNotClosed -> TyCoVarSetNotClosed
tcvs_of_co' (Refl _ ty)               = tcvs_of_type' ty
tcvs_of_co' (TyConAppCo _ _ cos)      = tcvs_of_cos' cos
tcvs_of_co' (AppCo co arg)            = tcvs_of_co' co . tcvs_of_co' arg
tcvs_of_co' (ForAllCo tv kind_co co)  = unionVarSet (delVarSet (tcvs_of_co co) tv) . tcvs_of_co' kind_co
tcvs_of_co' (FunCo _ co1 co2)         = tcvs_of_co' co1 . tcvs_of_co' co2
tcvs_of_co' (CoVarCo v)               = flip extendVarSet v
tcvs_of_co' (HoleCo h)                = flip extendVarSet (coHoleCoVar h)
    -- See Note [CoercionHoles and coercion free variables]
tcvs_of_co' (AxiomInstCo _ _ cos)      = tcvs_of_cos' cos
tcvs_of_co' (UnivCo p _ t1 t2)        = tcvs_of_prov' p . tcvs_of_type' t1 . tcvs_of_type' t2
tcvs_of_co' (SymCo co)          = tcvs_of_co' co
tcvs_of_co' (TransCo co1 co2)   = tcvs_of_co' co1 . tcvs_of_co' co2
tcvs_of_co' (NthCo _ _ co)      = tcvs_of_co' co
tcvs_of_co' (LRCo _ co)         = tcvs_of_co' co
tcvs_of_co' (InstCo co arg)     = tcvs_of_co' co . tcvs_of_co' arg
tcvs_of_co' (CoherenceCo c1 c2) = tcvs_of_co' c1 . tcvs_of_co' c2
tcvs_of_co' (KindCo co)         = tcvs_of_co' co
tcvs_of_co' (SubCo co)          = tcvs_of_co' co
tcvs_of_co' (AxiomRuleCo _ cos)  = tcvs_of_cos' cos

tcvs_of_cos :: [Coercion] -> TyCoVarSetNotClosed
tcvs_of_cos co = tcvs_of_cos' co emptyVarSet

tcvs_of_cos' :: [Coercion] -> TyCoVarSetNotClosed -> TyCoVarSetNotClosed
tcvs_of_cos' [] = id
tcvs_of_cos' (co:cos) = tcvs_of_co' co . tcvs_of_cos' cos

-- tcvs_of_prov :: UnivCoProvenance -> TyCoVarSetNotClosed
-- tcvs_of_prov p = tcvs_of_prov' p emptyVarSet

tcvs_of_prov' :: UnivCoProvenance -> TyCoVarSetNotClosed -> TyCoVarSetNotClosed
tcvs_of_prov' UnsafeCoerceProv    = id
tcvs_of_prov' (PhantomProv co)    = tcvs_of_co' co
tcvs_of_prov' (ProofIrrelProv co) = tcvs_of_co' co
tcvs_of_prov' (PluginProv _)      = id

So the '-ed versions are the ones with an accumulator; I kept the original ones around but rewrote them in terms of the accumulator version so that other functions that depend on them keep working unchanged. Apart from the forall cases, everything goes through the accumulator version, and uses only single-var insertions and no unions. I also rewrote the plural ones (tcvs_of_cos / tcvs_of_types) to avoid unions there. The only real difference I can see here is that I used point-free style over parentheses, but that surely ought not to make a difference.

There is a slight performance improvement over the union-based version, but not enough to explain the test failures, and it's still nowhere near the original numbers.

To wit:

Pre-patch:

/home/tobias/well-typed/devel/ghc-phab/inplace/lib/bin/ghc-stage2 -B/home/tobias/well-typed/devel/ghc-phab/inplace/lib -c testsuite/tests/perf/compiler/T5631.hs -fforce-recomp +RTS -rlogs/old.ticky 

STACK USAGE:

ENTERS: 51211603  of which 51211603 (100.0%) direct to the entry code
		  [the rest indirected via Node's info ptr]
    6192996 ( 12.1%) thunks
    3676345 (  7.2%) data values
          0 (  0.0%) normal indirections
          0 (  0.0%) permanent indirections

FUNCTION ENTRIES: 41342262

TAIL CALLS: 33149658, of which 27886423 (84%) were to known functions

SLOW APPLICATIONS: 0 evaluated, 0 unevaluated

         Too few args   Correct args   Too many args
   FUN            0              0               0
   PAP            0              0               0


RETURNS: 23002648
   11895484 ( 51.7%) from entering a new constructor
		  [the rest from entering an existing constructor]

RET_NEW:            11895484:  36.3% 13.5% 24.9%  4.5% 19.1%  0.9%  0.3%  0.2%  0.4%
RET_OLD:             3676345:  11.6% 18.6% 18.8%  5.7%  6.3% 20.5%  0.3%  1.1% 17.1%
RET_UNBOXED_TUP:     7430819:   0.0% 68.9% 29.7%  1.2%  0.1%  0.0%  0.1%  0.1%  0.0%

Post-patch:

/home/tobias/well-typed/devel/ghc-phab/inplace/lib/bin/ghc-stage2 -B/home/tobias/well-typed/devel/ghc-phab/inplace/lib -c testsuite/tests/perf/compiler/T5631.hs -fforce-recomp +RTS -rlogs/new.ticky 

STACK USAGE:

ENTERS: 54717036  of which 54717036 (100.0%) direct to the entry code
		  [the rest indirected via Node's info ptr]
    6198335 ( 11.3%) thunks
    4824139 (  8.8%) data values
          0 (  0.0%) normal indirections
          0 (  0.0%) permanent indirections

FUNCTION ENTRIES: 43694562

TAIL CALLS: 35496964, of which 29883207 (84%) were to known functions

SLOW APPLICATIONS: 0 evaluated, 0 unevaluated

         Too few args   Correct args   Too many args
   FUN            0              0               0
   PAP            0              0               0


RETURNS: 24705888
   13776908 ( 55.8%) from entering a new constructor
		  [the rest from entering an existing constructor]

RET_NEW:            13776908:  36.0% 11.6% 25.9%  3.9% 21.1%  0.8%  0.2%  0.1%  0.3%
RET_OLD:             4824139:  23.8% 14.1% 20.4%  4.4%  7.6% 15.6%  0.3%  0.9% 13.0%
RET_UNBOXED_TUP:     6104841:   0.0% 83.5% 14.8%  1.4%  0.1%  0.0%  0.1%  0.1%  0.0%

Post-patch with rewritten tcvs... functions:

/home/tobias/well-typed/devel/ghc-phab/inplace/lib/bin/ghc-stage2 -B/home/tobias/well-typed/devel/ghc-phab/inplace/lib -c testsuite/tests/perf/compiler/T5631.hs -fforce-recomp +RTS -rlogs/new-accum.ticky 

STACK USAGE:

ENTERS: 54394961  of which 54394961 (100.0%) direct to the entry code
		  [the rest indirected via Node's info ptr]
    6198336 ( 11.4%) thunks
    4906771 (  9.0%) data values
          0 (  0.0%) normal indirections
          0 (  0.0%) permanent indirections

FUNCTION ENTRIES: 43289854

TAIL CALLS: 35075741, of which 29461984 (84%) were to known functions

SLOW APPLICATIONS: 0 evaluated, 0 unevaluated

         Too few args   Correct args   Too many args
   FUN            0              0               0
   PAP            0              0               0


RETURNS: 24431674
   13420062 ( 54.9%) from entering a new constructor
		  [the rest from entering an existing constructor]

RET_NEW:            13420062:  32.7% 11.9% 25.9%  4.0% 24.0%  0.8%  0.2%  0.1%  0.4%
RET_OLD:             4906771:  26.3% 13.9% 17.1%  4.3%  9.2% 15.3%  0.3%  0.9% 12.8%
RET_UNBOXED_TUP:     6104841:   0.0% 83.5% 14.8%  1.4%  0.1%  0.0%  0.1%  0.1%  0.0%

comment:53 Changed 4 weeks ago by simonpj

Let me urge you to try it with an explicit accumulator, exactly as I wrote it.

I recall that this made a very big difference for Bartosz.

comment:54 in reply to:  53 Changed 4 weeks ago by tdammers

Replying to simonpj:

Let me urge you to try it with an explicit accumulator, exactly as I wrote it.

I recall that this made a very big difference for Bartosz.

Tried that, but it seems to make absolutely no difference (that is, allocations go down by less than 0.004%).

Also worth noting that VarSet here is based on IntMap, not Map as you quoted. Not sure if that changes anything, but maybe IntMap doesn't have the same problem Map has.

comment:55 Changed 4 weeks ago by RyanGlScott

I can confirm that my hypothesis in comment:46 did not yield fruit: it saved a whopping 400 bytes of allocation in T5321Fun, and that's it.

comment:56 in reply to:  55 Changed 4 weeks ago by tdammers

I did some benchmarking of Data.Map and also Data.IntMap in isolation.

For both, I implemented combining a large number of positive integers into a set and then testing for the presence of -1 in the set, using two strategies:

  1. the "unions" approach: unions . map singleton
  2. the "insert" approach: foldr insert empty

For IntMap (which is what we're using here), the "unions" approach is slightly faster (~14.5s vs. ~15s for 100 million elements); for Map, the "unions" approach is a lot faster (~6.5s vs. ~8.2s for 10 million elements).

Note that this is the worst possible case for union, and it still outperforms individual insertions for some reason.

So if that is accurate, then the idea that speeding things up by using lots of inserts instead of lots of set unions doesn't hold up.

comment:57 in reply to:  53 Changed 4 weeks ago by tdammers

Replying to simonpj:

Let me urge you to try it with an explicit accumulator, exactly as I wrote it.

I recall that this made a very big difference for Bartosz.

Can you remember whether that situation involved raw VarSets, or FVs? From what I see in the code, FV could, under the right circumstances, avoid hitting the underlying IntSet entirely, while after this patch, we're using VarSet directly, so maybe that's why those set operations are being hit harder now?

comment:58 Changed 4 weeks ago by simonpj

the "unions" approach: unions . map singleton

That is essentially foldr (union . singleton) empty, which is really very close to foldr insert empty.

What if you do a balanced tree of binary union operations? I suppose something like

bigSet :: Int -> Int -> IntSet
bigSet m n
  | m==n = singleton n
  | otherwise = bigSet m mid `union` bigSet (mid+1) n
  where
    mid = (n+m)/2

comment:59 Changed 4 weeks ago by simonpj

This is all very puzzling. But when we get to the bottom of it we may speed up a key piece of infrastructure! So I think it's worth persevering.

Some other ideas

  1. Is poly_merge still high on the list in ticky in the accumulator version? If so, why?
  1. Can you post the accumulator version for types anyway?
  1. Try switching back to FV. That is, take the currently-deleted code for tyCoFVsOfType, and re-instate it, but modify it not to take the kinds at the variable occurrences. Now use that instead of the new fvs_of_type. It "obviously" should be a lot slower -- is it?
  1. In unitFV I see
    unitFV :: Id -> FV
    unitFV var fv_cand in_scope acc@(have, haveSet)
      | var `elemVarSet` in_scope = acc
      | var `elemVarSet` haveSet = acc
      | fv_cand var = (var:have, extendVarSet haveSet var)
      | otherwise = acc
    

Notice that before inserting we test for membership. Could that possibly be a win? If so, we can use it in the VarSet version.

comment:60 Changed 4 weeks ago by simonpj

Can you remember whether that situation involved raw VarSets, or FVs?

The old code for tyCoFVsOfType looked like

fvs_of_type (AppTy fun arg)    a b c = (fvs_of_type fun `unionFV` fvs_of_type arg) a b c

I believe that if you eta reduce this to

fvs_of_type (AppTy fun arg) = (fvs_of_type fun `unionFV` fvs_of_type arg)

it runs a lot slower. That should not happen, but I recall that it did. I don't understand what you mean about "avoiding hitting the underlying IntSet entirely".

comment:61 in reply to:  60 Changed 4 weeks ago by tdammers

Replying to simonpj:

Can you remember whether that situation involved raw VarSets, or FVs?

The old code for tyCoFVsOfType looked like

fvs_of_type (AppTy fun arg)    a b c = (fvs_of_type fun `unionFV` fvs_of_type arg) a b c

I believe that if you eta reduce this to

fvs_of_type (AppTy fun arg) = (fvs_of_type fun `unionFV` fvs_of_type arg)

it runs a lot slower. That should not happen, but I recall that it did. I don't understand what you mean about "avoiding hitting the underlying IntSet entirely".

Well, if you look at the definition of FV:

type FV = InterestingVarFun
             -- Used for filtering sets as we build them
          -> VarSet
             -- Locally bound variables
          -> ([Var], VarSet)
             -- List to preserve ordering and set to check for membership,
             -- so that the list doesn't have duplicates
             -- For explanation of why using `VarSet` is not deterministic see
             -- Note [Deterministic UniqFM] in UniqDFM.
          -> ([Var], VarSet)

...then it's not immediately obvious that code that manipulates FVs will always touch the VarSets in there at all. Specifically, variables that are not "interesting" bypass the VarSet (i.e., IntSet) machinery entirely, if I'm not mistaken.

Not sure if that has anything to do with the problem at hand though.

comment:62 in reply to:  59 Changed 4 weeks ago by tdammers

Replying to simonpj:

This is all very puzzling. But when we get to the bottom of it we may speed up a key piece of infrastructure! So I think it's worth persevering.

  1. Is poly_merge still high on the list in ticky in the accumulator version? If so, why?

As a matter of fact, no it's not. So it looks like the allocations we're winning from not using poly_merge as much just end up being eaten up elsewhere. Top 15 entries for the accumulator version:

    Entries      Alloc    Alloc'd  Non-void Arguments      STG Name
--------------------------------------------------------------------------------
    2539473  107917008          0   3 i.M                  containers-0.5.11.0:Data.IntMap.Internal.$winsert{v r7O} (fun)
     172005   90819296          0   5 EMMMS                ghc:TcUnify.promoteTcType2{v r3m} (fun)
    2330347   86853624          0   2 >L                   base:GHC.Base.map{v 01X} (fun)
     699656   37268264          0   1 L                    go4{v sNmj} (ghc:TcMType) (fun) in r1Y
     448802   35904160          0   7 E>>>>.M              ghc:TcMType.$s$wmapType{v r1Y} (fun)
     367181   23499584          0   2 LS                   ghc:TcRnMonad.traceTc1{v rlu} (fun)
     887399   22625616          0   1 M                    go28{v sNmi} (ghc:TcMType) (fun) in r1Y
     283135   21931488          0   4 >SSM                 ghc:TcHsSyn.$wzonkTyVarOcc{v r1F} (fun)
     269137   18194264          0   1 L                    go13{v sSI0} (ghc:TcHsSyn) (fun) in rN
      95951   16510560          0   2 >L                   base:Data.OldList.sortBy{v rD} (fun)
     302653   16395904          0   5 +>LLL                ghc:MonadUtils.zipWith3M{v rp} (fun)
     194400   15552000          0   7 E>>>>.M              ghc:TcHsSyn.$s$wmapType{v rN} (fun)
     559499   15509096          0   1 M                    ghc:TcMType.zonkTcTyVar{v r1l} (fun)
     138823   13706576          0   1 S                    sat_sMLW{v} (ghc:TcMType) (fun) in r6P
     440463   11089080          0   1 S                    sat_sNmS{v} (ghc:TcMType) (fun) in sNmi

comment:63 in reply to:  59 Changed 4 weeks ago by tdammers

Replying to simonpj:

  1. In unitFV I see
    unitFV :: Id -> FV
    unitFV var fv_cand in_scope acc@(have, haveSet)
      | var `elemVarSet` in_scope = acc
      | var `elemVarSet` haveSet = acc
      | fv_cand var = (var:have, extendVarSet haveSet var)
      | otherwise = acc
    

Notice that before inserting we test for membership. Could that possibly be a win? If so, we can use it in the VarSet version.

I believe that the "test before insert" thing is needed because we're accumulating a set (haveSet) and a list (have) in parallel, instead of just the set, so in order to keep the two in sync, we cannot simply insert unconditionally, as that would produce duplicates in the list.

I don't think it can possibly produce a performance advantage beyond skipping the list cons, *unless* we are in a special situation where:

  • in_scope is much smaller than haveSet (such that testing membership of in_scope is significantly faster than testing haveSet membership); AND:
  • in_scope is likely to match new entries (because when it doesn't match, the extra test just eats up clock cycles without gaining anything)

One interesting thing might be to invert the fv_cand condition and change the ordering, so that the fv_cand test is done before the set membership tests, something like:

unitFV :: Id -> FV
unitFV var fv_cand in_scope acc@(have, haveSet)
  | not (fv_cand var) = acc
  | var `elemVarSet` in_scope = acc
  | var `elemVarSet` haveSet = acc
  | otherwise = (var:have, extendVarSet haveSet var)

But that won't change the performance of VarSet the slightest.

comment:64 in reply to:  58 Changed 4 weeks ago by tdammers

Replying to simonpj:

the "unions" approach: unions . map singleton

That is essentially foldr (union . singleton) empty, which is really very close to foldr insert empty.

What if you do a balanced tree of binary union operations? I suppose something like

bigSet :: Int -> Int -> IntSet
bigSet m n
  | m==n = singleton n
  | otherwise = bigSet m mid `union` bigSet (mid+1) n
  where
    mid = (n+m)/2

Did exactly that, benchmark code attached. Results:

./intmapbench insert    1.32s user 0.05s system 99% cpu 1.370 total
./intmapbench union     1.01s user 0.01s system 99% cpu 1.029 total
./intmapbench balanced  0.24s user 0.01s system 98% cpu 0.260 total

So the only surprising thing here is that even in the worst case of a completely unbalanced insertion order, union outperforms insert, but otherwise, it behaves as expected, and this definitely does not support our hypothesis that individual inserts might be faster than set unions.

Last edited 4 weeks ago by tdammers (previous) (diff)

Changed 4 weeks ago by tdammers

Attachment: intmapbench.hs added

IntMap mini-benchmark

comment:65 Changed 4 weeks ago by simonpj

Maybe try (3) in comment:59? That makes before-and-after more similar.

It might help to put it all on a branch... I'm not sure of the precise code you are measuring.

(It's cool that balanced unions are actually faster. FOUR TIMES faster!)

comment:66 in reply to:  65 Changed 4 weeks ago by tdammers

Replying to simonpj:

Maybe try (3) in comment:59? That makes before-and-after more similar.

Yes, will do that.

It might help to put it all on a branch... I'm not sure of the precise code you are measuring.

I took the patch from phabricator and rebased it onto what I figured must be the last commit before Richard started working on it. I'll push my branch to git.haskell.org.

(It's cool that balanced unions are actually faster. FOUR TIMES faster!)

It is, although it would have been easier for us if that had turned out to be the problem. Then again, one of the fundamental libraries out there not being broken is kind of a comforting thought...

comment:67 Changed 3 weeks ago by RyanGlScott

Milestone: 8.8.1

comment:68 in reply to:  59 Changed 3 weeks ago by tdammers

Replying to simonpj:

  1. Try switching back to FV. That is, take the currently-deleted code for tyCoFVsOfType, and re-instate it, but modify it not to take the kinds at the variable occurrences. Now use that instead of the new fvs_of_type. It "obviously" should be a lot slower -- is it?

I'm slightly confused here. fvs_of_type already uses FV, before and after the patch; the patch however introduces tcvs_of_type, which uses VarSets directly instead of going through FV, and this function is where the slowdown appears to be.

I changed the code such that it always uses fvs_of_type (and thus FV) instead of tcvs_of_type, and the results are mixed:

  • The two perf test cases that fail validation now perform slightly better
  • ...but not enough to make the tests pass
  • ...and now 3 more perf tests fail

Output from ./validate:

==== STAGE 2 TESTS ====

Unexpected results from:
TEST="T12227 T12545 T14052 T3234 T5321Fun T5631"

SUMMARY for test run started at Mon Jul 30 12:28:36 2018 CEST
 0:22:35 spent to go through
    6350 total tests, which gave rise to
   19866 test cases, of which
   13175 were skipped

      36 had missing libraries
    6495 expected passes
     154 expected failures

       0 caused framework failures
       0 caused framework warnings
       0 unexpected passes
       1 unexpected failures
       5 unexpected stat failures

Unexpected failures:
   /tmp/ghctest-51wr9fmd/test   spaces/./simplCore/should_compile/T3234.run  T3234 [stderr mismatch] (optasm)

Unexpected stat failures:
   /tmp/ghctest-51wr9fmd/test   spaces/./perf/compiler/T5631.run     T5631 [stat not good enough] (normal)
   /tmp/ghctest-51wr9fmd/test   spaces/./perf/compiler/T5321Fun.run  T5321Fun [stat not good enough] (normal)
   /tmp/ghctest-51wr9fmd/test   spaces/./perf/compiler/T12227.run    T12227 [stat not good enough] (normal)
   /tmp/ghctest-51wr9fmd/test   spaces/./perf/compiler/T12545.run    T12545 [stat not good enough] (normal)
   /tmp/ghctest-51wr9fmd/test   spaces/./perf/should_run/T14052.run  T14052 [stat not good enough] (ghci)

comment:69 Changed 3 weeks ago by tdammers

Richard's original patch: http://git.haskell.org/ghc.git/log/refs/heads/wip/T14880.

I'll push experiments to branches wip/T14880/...

comment:70 Changed 3 weeks ago by simonpj

Just to record that we decided that the experiment in comment:68 is misleading.

There are two main changes in Richard's patch

  1. Do closeOverKinds at the end, here:
    -- NEW
    tyCoVarsOfType = closeOverKinds . tcvs_of_type
    

rather than independently at each leaf:

-- OLD
tyCoFVsOfType (TyVarTy v)        a b c = (unitFV v `unionFV` tyCoFVsOfType (tyVarKind v)) a b c

-- NEW
tcvs_of_type (TyVarTy v) = unitVarSet v
  1. Insetad of using FV and then taking its VarSet, use VarSets directly. This jolly well ought to be faster.

But in comment:68, Tobias used the new (1) but the old (2) and thus closed over the kind vars both at the leaves and at the end. That's not going to be fast.

We agreed that a it'd be iluminating to start from the existing GHC baseline and try (2) alone: is VarSet faster than FV?

Then add (1).

NB: it is (1) that fixes the actual bug in this ticket. Ultimately it is not optional. But (1) should be a perf boost too! If a variable occurs N times, we'll take the free vars of its kind once instead of N times.

comment:71 Changed 3 weeks ago by simonpj

I was suspicious about comment:64 so I took a look. Sure enough

  • The list [1..n] wasn't being fused away because it was shared between insert and union, while balanced worked differently and generated no intermediate list.
  • More significantly, the integers were in ascending order which is fantastically good for balanced union: the tree that implements the set is always balanced and never needs to be transformed.

So I tried with a random tree. Code is below. Results are much more plausible

  • balanced: 1.15G alloc, 1.6sec
  • union: 1.2G alloc, 1.6sec
  • insert: 1.17G alloc, 1.6sec

Conclusion: containers is behaving OK.

{-#LANGUAGE LambdaCase #-}
module Main( run, main ) where

import qualified Data.IntSet as Set
import System.Environment
import System.Random

data Tree = Leaf Int | Node Tree Tree

randomTree :: Int -> Int -> IO Tree
randomTree lo hi
  | lo == hi
  = do { n <- randomRIO (1,100000)
       ; return (Leaf n) }
  | otherwise
  = do { l <- randomTree lo mid
       ; r <- randomTree (mid+1) hi
       ; return (Node l r) }
  where
    mid = (lo+hi) `div` 2

main = do { args <- getArgs
          ; case args of
              (size:what:_) -> do { t <- randomTree 1 (read size)
                                  ; print (Set.size (run what t)) }
              _ -> error "Invalid args" }

run :: String -> Tree -> Set.IntSet
run "balanced" t = go t
  where
    go (Leaf n) = Set.singleton n
    go (Node t1 t2) = go t1 `Set.union` go t2

run "insert" t = go t Set.empty
  where
    go (Leaf n) acc = Set.insert n acc
    go (Node t1 t2) acc = go t1 (go t2 acc)

run "union" t = go t Set.empty
  where
    go (Leaf n) acc = Set.union (Set.singleton n) acc
    go (Node t1 t2) acc = go t1 (go t2 acc)

comment:72 in reply to:  71 Changed 3 weeks ago by tdammers

Replying to simonpj:

I was suspicious about comment:64 so I took a look. Sure enough

  • The list [1..n] wasn't being fused away because it was shared between insert and union, while balanced worked differently and generated no intermediate list.

Ah yes. Lesson learned once again: benchmarking is hard and subtle.

  • More significantly, the integers were in ascending order which is fantastically good for balanced union: the tree that implements the set is always balanced and never needs to be transformed.

Right. I probably had the wrong intuition about set performance characteristics here.

comment:73 Changed 3 weeks ago by Richard Eisenberg <rae@…>

In f8618a9b/ghc:

Remove the type-checking knot.

Bug #15380 hangs because a knot-tied TyCon ended up in a kind.
Looking at the code in tcInferApps, I'm amazed this hasn't happened
before! I couldn't think of a good way to fix it (with dependent
types, we can't really keep types out of kinds, after all), so
I just went ahead and removed the knot.

This was remarkably easy to do. In tcTyVar, when we find a TcTyCon,
just use it. (Previously, we looked up the knot-tied TyCon and used
that.) Then, during the final zonk, replace TcTyCons with the real,
full-blooded TyCons in the global environment. It's all very easy.

The new bit is explained in the existing
Note [Type checking recursive type and class declarations]
in TcTyClsDecls.

Naturally, I removed various references to the knot and the
zonkTcTypeInKnot (and related) functions. Now, we can print types
during type checking with abandon!

NB: There is a teensy error message regression with this patch,
around the ordering of quantified type variables. This ordering
problem is fixed (I believe) with the patch for #14880. The ordering
affects only internal variables that cannot be instantiated with
any kind of visible type application.

There is also a teensy regression around the printing of types
in TH splices. I think this is really a TH bug and will file
separately.

Test case: dependent/should_fail/T15380

comment:74 Changed 3 weeks ago by tdammers

New results: following the suggestion to go back to the (incorrect) baseline before the patch, and then *only* rewriting tyCoVarsOf{Type|Co}[s] in terms of VarSet, thus skipping the detour via FV, I compared three different versions:

  • the pre-patch baseline, available on the wip/T14880-baseline branch on git.haskell.org;
  • the patch, applied fully, available on the wip/T14880 branch
  • the baseline version with the tvCoVarsOf... functions rewritten in terms of VarSets, copying the logic from the FV flavors, available on the wip/T14880-just-tvs branch.

The expectation was that -just-tvs would perform better than the baseline, because it skips the additional machinery of FV; however, it turns out that it's actually slower and allocates more.

Detailed timing results:

** Baseline **

	Command being timed: "/home/tobias/well-typed/devel/ghc-phab/inplace/bin/ghc-stage2 -c testsuite/tests/perf/compiler/T5631.hs -fforce-recomp +RTS -rlogs/T14880-baseline.ticky -RTS"
	User time (seconds): 3.17
	System time (seconds): 0.08
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.27
	Maximum resident set size (kbytes): 292764
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 42788
	Voluntary context switches: 392
	Involuntary context switches: 18

** Just VarSet **

	Command being timed: "/home/tobias/well-typed/devel/ghc-phab/inplace/bin/ghc-stage2 -c testsuite/tests/perf/compiler/T5631.hs -fforce-recomp +RTS -rlogs/T14880-just-tvs.ticky -RTS"
	User time (seconds): 3.28
	System time (seconds): 0.06
	Percent of CPU this job got: 91%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.64
	Maximum resident set size (kbytes): 294916
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 6
	Minor (reclaiming a frame) page faults: 42784
	Voluntary context switches: 452
	Involuntary context switches: 20

** 14880 patch **

	Command being timed: "/home/tobias/well-typed/devel/ghc-phab/inplace/bin/ghc-stage2 -c testsuite/tests/perf/compiler/T5631.hs -fforce-recomp +RTS -rlogs/T14880.ticky -RTS"
	User time (seconds): 3.26
	System time (seconds): 0.07
	Percent of CPU this job got: 95%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.49
	Maximum resident set size (kbytes): 295684
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 11
	Minor (reclaiming a frame) page faults: 43047
	Voluntary context switches: 422
	Involuntary context switches: 17

comment:75 Changed 3 weeks ago by goldfire

Very interesting. What happens if you make a version of FV that uses its same setup (passing around partially applied functions, etc.) but without the list? Then rewrite the functions in terms of that?

Also, I think it would be helpful not to use ticky profiling when generating these results (unless the data somehow comes from the ticky output), but also to include the allocations.

Thanks!

comment:76 in reply to:  75 Changed 3 weeks ago by tdammers

Replying to goldfire:

Very interesting. What happens if you make a version of FV that uses its same setup (passing around partially applied functions, etc.) but without the list? Then rewrite the functions in terms of that?

Ah, good point, I'll try that. Shouldn't be overly difficult.

Also, I think it would be helpful not to use ticky profiling when generating these results (unless the data somehow comes from the ticky output), but also to include the allocations.

Right, of course. The data is just plain old time --verbose; the reason I have -ticky there is because the idea was to generate ticky profiles at the same time, and the thought behind it was that ticky overhead would be somewhat proportional, and thus not disruptive. But I'll do another run without ticky.

comment:77 Changed 3 weeks ago by tdammers

Hmm, so when I skip the -ticky, the difference between the baseline and the VarSet version melts away:

Baseline:

        User time (seconds): 3.26
        System time (seconds): 0.09
        Percent of CPU this job got: 94%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.53

VarSet:

        User time (seconds): 3.25
        System time (seconds): 0.06
        Percent of CPU this job got: 96%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.44

In fact, VarSet now ends up faster than FV.

comment:78 Changed 3 weeks ago by goldfire

Before we continue drilling in that direction, are these results reproducible? Timing is notoriously fickle, and I think it would be wise to make sure we know what we're getting into here.

Also, I do think it's worthwhile to still measure allocations, just as another data point.

If the FV/VarSet change isn't the driving force here, then what is? I recommend removing the code that adds the fvs of a variable's kind (without replacing this logic with anything) and testing again. That should surely take less time. (Be careful, e.g., by comparing -ddump-tc-traces, that GHC's overall behavior doesn't change by doing this.) Then, add the closeOverKinds. Maybe the algorithm in closeOverKinds (simple though it is) is somehow grossly inefficient.

comment:79 in reply to:  78 Changed 3 weeks ago by tdammers

Replying to goldfire:

Before we continue drilling in that direction, are these results reproducible? Timing is notoriously fickle, and I think it would be wise to make sure we know what we're getting into here.

Oh yes, absolutely right, and I just did exactly that.

Turns out the switch from FV to VarSet increases allocations quite a bit, but not enough to explain the failure completely - only about halfway. Observe:

baseline:

("bytes allocated", "1145886664")

with VarSet instead of fv:

("bytes allocated", "1176456040")

patch applied fully:

("bytes allocated", "1207902048")

This uses the settings from the test suite directly, and a validate build. So it should be perfectly reproducible, something like this:

for NAME in T14880-baseline T14880-just-tvs T14880
do
    echo "$NAME"
    git checkout "wip/$NAME" && (make -j2 && (/usr/bin/time --verbose ~/well-typed/devel/ghc-phab/inplace/bin/ghc-stage2 -c testsuite/tests/perf/compiler/T5631.hs -fforce-recomp +RTS -V0 -tlogs/$NAME.stat --machine-readable -RTS) |& tee logs/$NAME.timing && echo "$NAME")
done

comment:80 Changed 2 weeks ago by goldfire

The next step would appear to be to duplicate the FV module (with identical function structure) but to drop the lists. Then use the new NonDetFV instead of FV. I would be shocked if the allocations go anywhere but down.

Then, we'll have a new conundrum: why is VarSet slower than this NonDetFV? But let's not figure that out until we've made the observation that it is indeed slower.

comment:81 Changed 2 weeks ago by tdammers

NonDetFV (NDFV) speed / allocations test still running.

While writing the code for NDFV, one thing caught my eye however: the delFV / delFVs functions, rather than actually remove anything from the have list and the haveSet, *add* to the in_scope set; naturally, rewriting this in terms of VarSet, where the delVarSet function actually removes the var from the set, will change performance characteristics, one way or the other, depending on the situation.

comment:82 Changed 2 weeks ago by tdammers

Speed test results: performance for the NDFV version is virtually identical to that of the baseline (which uses FV). So leaving out the list when it is not going to be demanded anyway makes absolutely no difference, as one would expect.

Full logs to be attached. The relevant branches can be found on git.haskell.org:

  • wip/T14880-baseline (before the patch)
  • wip/T14880 (patch applied as-is)
  • wip/T14880-just-tvs (just the move from FV to VarSet)
  • wip/T14880-nondet-fv (introducing the NDFV flavour of FV, but otherwise identical to the baseline)

To reproduce logs, use something like:

for NAME in T14880-nondet-fv T14880 T14880-baseline T14880-just-tvs
do
    echo "$NAME"
    git checkout "wip/$NAME" \
    && (
        make -j2 \
        && (/usr/bin/time --verbose \
            ~/path/to/ghc/inplace/bin/ghc-stage2 \
                -c testsuite/tests/perf/compiler/T5631.hs \
                -fforce-recomp \
                +RTS -V0 -tlogs/$NAME.stat --machine-readable -RTS) \
        |& tee logs/$NAME.timing \
        && echo "$NAME"
       )
done

Changed 2 weeks ago by tdammers

Attachment: logs2.tar.gz added

comment:83 Changed 2 weeks ago by goldfire

OK. So NDFV behaves roughly identically to FV -- faster than just a bare VarSet. But why?

Let's look at the difference between NDFV and VarSet.

  1. The former has an InterestingVarFun. But I don't believe that's used in your tests.
  1. The former maintains a set of elements to be removed, while the latter just algorithmically removes elements when necessary.
  1. The former uses an accumulator to build sets, while the latter does not.

Which of these factors contributes to the performance change? To learn this, we can test each in isolation, by removing the difference and then measuring. I suspect (2). If this is the case, perhaps we'll discover an optimization that can apply well beyond GHC!

comment:84 Changed 4 days ago by simonpj

Re comment:83

  1. I think that InterestingVarFun is const True, so it only adds overhead.
  1. I think you mean that NDFV keeps a set of forall-bound variables, and checks in that set before adding to the accumulator. But foralls are rare (I think), so the set will typically be empty.
  1. That leaves the accumulator-style as the main candidate.

As Richard says, though, the way to find out is to try it one thing at a time.

Note: See TracTickets for help on using tickets.