Opened 8 months ago

Last modified 6 hours ago

#14880 new bug

GHC panic: updateRole

Reported by: RyanGlScott Owned by: goldfire
Priority: normal Milestone: 8.6.2
Component: Compiler (Type checker) Version: 8.2.2
Keywords: TypeInType Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time crash or panic Test Case:
Blocked By: Blocking: #15592
Related Tickets: #15076 Differential Rev(s): Phab:D4769, Phab:D5141, Phab:D5147, Phab:D5150, Phab:D5208
Wiki Page:

Description (last modified by RyanGlScott)

The following program panics on GHC 8.0.2, 8.2.2, 8.4.1, and HEAD:

{-# LANGUAGE ConstraintKinds #-}
{-# LANGUAGE GADTs #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE TypeInType #-}
{-# LANGUAGE TypeOperators #-}
module Bug where

import Data.Kind
import Data.Type.Equality ((:~:))

type SameKind (a :: k) (b :: k) = (() :: Constraint)

data TyFun :: Type -> Type -> Type
type a ~> b = TyFun a b -> Type
infixr 0 ~>
type family Apply (f :: k1 ~> k2) (x :: k1) :: k2

data WhyCongSym1 (x :: Type) :: forall (a :: x)
                                       (y :: Type)
                                       (z :: x).
                                Type ~> (x ~> y) ~> x ~> x ~> (a :~: z) ~> Type

data WhyCongSym0 :: forall (x :: Type)
                           (a :: x)
                           (y :: Type)
                           (z :: x).
                    Type ~> Type ~> (x ~> y) ~> x ~> x ~> (a :~: z) ~> Type
  where
    WhyCongSym0KindInference :: forall x arg.
                                SameKind (Apply WhyCongSym0 arg) (WhyCongSym1 arg) =>
                                WhyCongSym0 x
$ /opt/ghc/8.2.2/bin/ghci Bug.hs
GHCi, version 8.2.2: http://www.haskell.org/ghc/  :? for help
Loaded GHCi configuration from /home/rgscott/.ghci
[1 of 1] Compiling Bug              ( Bug.hs, interpreted )
ghc: panic! (the 'impossible' happened)
  (GHC version 8.2.2 for x86_64-unknown-linux):
        updateRole
  WhyCongSym0
  arg_a1A6[sk:1]
  [a1A5 :-> 4, a2Cy :-> 0, a2Cz :-> 1, a2CA :-> 2, a2CB :-> 3]
  Call stack:
      CallStack (from HasCallStack):
        prettyCurrentCallStack, called at compiler/utils/Outputable.hs:1133:58 in ghc:Outputable
        callStackDoc, called at compiler/utils/Outputable.hs:1137:37 in ghc:Outputable
        pprPanic, called at compiler/typecheck/TcTyDecls.hs:656:23 in ghc:TcTyDecls

Attachments (4)

logs.tar.gz (22.5 KB) - added by tdammers 3 months ago.
Performance logs for comparison
intmapbench.hs (721 bytes) - added by tdammers 3 months ago.
IntMap mini-benchmark
logs2.tar.gz (2.4 KB) - added by tdammers 2 months ago.
perf-step3.log (32.2 KB) - added by tdammers 3 weeks ago.
Performance test results after step 3

Download all attachments as: .zip

Change History (156)

comment:1 Changed 8 months ago by RyanGlScott

Description: modified (diff)

comment:2 Changed 8 months ago by RyanGlScott

Here's as minimal of an example as I can conjure up:

{-# LANGUAGE GADTs #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TypeInType #-}
module Bug where

import Data.Kind
import Data.Proxy

data Foo (x :: Type) :: forall (a :: x). Proxy a -> Type

data Bar :: Type -> Type where
    MkBar :: forall x arg.
             -- Commenting out the line below makes the issue go away
             Foo arg ~ Foo arg =>
             Bar x
$ /opt/ghc/8.2.2/bin/ghci Bug.hs
GHCi, version 8.2.2: http://www.haskell.org/ghc/  :? for help
Loaded GHCi configuration from /home/rgscott/.ghci
[1 of 1] Compiling Bug              ( Bug.hs, interpreted )
ghc: panic! (the 'impossible' happened)
  (GHC version 8.2.2 for x86_64-unknown-linux):
        updateRole
  Bar
  arg_a1vT[sk:1]
  [a1vS :-> 0]
  Call stack:
      CallStack (from HasCallStack):
        prettyCurrentCallStack, called at compiler/utils/Outputable.hs:1133:58 in ghc:Outputable
        callStackDoc, called at compiler/utils/Outputable.hs:1137:37 in ghc:Outputable
        pprPanic, called at compiler/typecheck/TcTyDecls.hs:656:23 in ghc:TcTyDecls

comment:3 Changed 8 months ago by simonpj

Even simpler

 data Foo (x :: Type) :: forall (a :: x). Proxy a -> Type

 data Bar :: Type -> Type where
     MkBar :: forall x arg.
              -- Commenting out the line below makes the issue go away
              Proxy (Foo arg) ->
              Bar x

The panic is caused because the existentials for MkBar are messed up;

{-  MkBar
    univ_tvs:   (x :: *)
    ex_tvs:     (a :: arg_aZv)
                (arg_XZx :: *)
    arg ty:     Proxy @arg_XZx a
    result ty:  Foo @arg_XZx a

Note the confusion of two arg variables. Ran out of time at that point.

comment:4 Changed 7 months ago by simonpj

Keywords: TypeInType added; Roles removed
Owner: set to goldfire

The trouble is that the type of MkBar should really be

MkBar :: forall (x:Type) (arg:Type) {a:arg}.
         Proxy @(Proxy @arg a -> Type)
               (Foo arg @a)
      -> Bar x

where I have put in all the kind applications. The trouble is that MkBar has an implicit forall's variable a, whose kind mentions an explicit type variable arg. So the implicit argument must appear later in the telescope. But tcConDecl (the ConDeclGADT case) doesn't allow that:

             tkv_bndrs      = mkTyVarBinders Inferred  tkvs'
             user_tv_bndrs  = mkTyVarBinders Specified user_tvs'
             all_user_bndrs = tkv_bndrs ++ user_tv_bndrs

Notice that the inferred ones always come first. But here they can't!

Solution: do a topo-sort of the tyvars that is allowed to interleave the Inferred and Specified ones.

But is that the only place? If we try something like this with a function type signature thus

f :: forall (v :: *) (a :: Proxy (k :: v)). Proxy a
f = f

we get the error message

T14880.hs:24:6: error:
    * The kind of variable `k', namely `v',
      depends on variable `v' from an inner scope
      Perhaps bind `k' sometime after binding `v'
      NB: Implicitly-bound variables always come before other ones.
    * In the type signature:
        f :: forall (v :: *) (a :: Proxy (k :: v)). Proxy a

But is there anything really wrong with this signature? It we topo-sorted the type variables we'd be fine.

There are other places (exp in TcTyClsDecls) where we seem to put all the inferred variables around the outside. I don't know how to be sure in which, if any, of these case we have a bug. Maybe we need more topo-sorts?

Amnother oddd thing about this ticket is the data decl for Foo:

data Foo (v :: Type) :: forall (a :: v). Proxy a -> Type

That is a strange kind signature. I don't expect to see foralls to the right of the :: in such a decl. So, more questions

  • Is it valuable to permit TyCons whose kinds are not in prenex form (i.e. all foralls at the front)?

If so, we should document it.

Meanwhile I'm not going to fix this because it may all come out in the wash of Richards's upcoming kind-inference patch.

It's nothing to do with roles!

comment:5 Changed 7 months ago by goldfire

The updateRole panic triggers on various invalid-tycon problems. I'm not surprised that roles aren't involved.

Yes, a forall to the right of the :: in a data declaration makes fine sense. For example, here is one way to define heterogeneous equality:

data (:~~:) :: forall k1. k1 -> forall k2. k2 -> Type where
  HRefl :: a :~~: a

This definition is actually a touch more general than one where k1 and k2 are quantified prenex, as the "Practical Type Inference" paper explains.

I don't see a need to document this specially.

As for the toposorting suggestions: I've considered it a design principle that implicit quantification comes before all explicit quantification. Of course, we can't panic when something goes wrong here, but I think this design is a good one, just to have some rules that users can rely on.

My existing patch still panics on this case, but I agree that no one should spend time on this until that patch lands.

comment:6 Changed 7 months ago by goldfire

With the Foo in comment:3, this also panics:

quux :: forall x arg. Proxy (Foo arg) -> Maybe x
quux = undefined

comment:7 in reply to:  6 Changed 6 months ago by RyanGlScott

Replying to goldfire:

With the Foo in comment:3, this also panics:

It does? I must be doing something wrong, since this appears to compile for me:

{-# LANGUAGE GADTs #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TypeInType #-}
module Bug where

import Data.Kind
import Data.Proxy

data Foo (x :: Type) :: forall (a :: x). Proxy a -> Type

quux :: forall x arg. Proxy (Foo arg) -> Maybe x
quux = undefined

comment:8 Changed 6 months ago by simonpj

My existing patch still panics on this case, but I agree that no one should spend time on this until that patch lands.

Richard, now that the patch has landed, and this stuff is in your head, might you look at this? I doubt it's hard.

comment:9 Changed 6 months ago by RyanGlScott

I think that this ticket and #15076 share a symptom in common. This claim is based on the fact that slightly tweaking the program in comment:6 / comment:7 :

{-# LANGUAGE GADTs #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TypeInType #-}
module Bug where

import Data.Kind
import Data.Proxy

data Foo (x :: Type) :: forall (a :: x). Proxy a -> Type

quux :: forall arg. Proxy (Foo arg) -> ()
quux (_ :: _) = ()

Yields:

$ ~/Software/ghc/inplace/bin/ghc-stage2 Bug.hs
[1 of 1] Compiling Bug              ( Bug.hs, Bug.o )

Bug.hs:12:12: error:ghc-stage2: panic! (the 'impossible' happened)
  (GHC version 8.5.20180420 for x86_64-unknown-linux):
        No skolem info:
  [arg_aZr[sk:1]]
  Call stack:
      CallStack (from HasCallStack):
        callStackDoc, called at compiler/utils/Outputable.hs:1162:37 in ghc:Outputable
        pprPanic, called at compiler/typecheck/TcErrors.hs:3224:5 in ghc:TcErrors

Which is the same panic as in #15076. Plus, if you run this program with -ddump-tc-trace, you see:

reportUnsolved (after zonking):
  Free tyvars: arg_aZr[sk:1]
  Tidy env: ([ESf71 :-> 1], [aZr :-> arg_aZr[sk:1]])
  Wanted: WC {wc_impl =
                Implic {
                  TcLevel = 2
                  Skolems = (a_a1p8[sk:2] :: arg_aZr[sk:1]) arg_a1p9[sk:2]
                  No-eqs = True
                  Status = Unsolved
                  Given =
                  Wanted =
                    WC {wc_simple =
                          [D] _ {0}:: Proxy
                                        (Foo arg_a1p9[sk:2] a_a1p8[sk:2]) (CHoleCan: TypeHole(_))}
                  Binds = EvBindsVar<a1pg>
                  Needed inner = []
                  Needed outer = []
                  the type signature for:
                    quux :: forall (a :: arg_aZr[sk:1]) arg. Proxy (Foo arg a) -> () }}

Just like in #15076, arg_aZr is not bound in any implication. On the other hand, there is another type variable, arg_a1p9, that is suspiciously similar. Moreover, the type signature it gives for quux:

quux :: forall (a :: arg_aZr[sk:1]) arg. Proxy (Foo arg a) -> ()

Seems to have two different copies of arg! This is especially interesting in light of comment:3, where simonpj discovered that the existentially quantified tyvars in MkBar were screwed up, leading to two copies of arg.

comment:10 Changed 6 months ago by simonpj

The trouble is that MkBar has an implicit forall's variable a, whose kind mentions an explicit type variable arg.

Richard and I discussed this on Monday. Our conclusions

  • Inferred variables should always precede Specified ones. That is, do not top-sort
  • Idea: simply refrain from quantifying any inferred variables that mention specified ones. They'll get defaulted to Any which is probably fine. This refraining can readily be done in candidatesQTyVarsOfType.
  • Also came up: close over kinds once in tyCoVarsOfType instead of at every leaf. This is not just an efficiency issue: consider
    tyCoVarsOfType (forall a. b::a)
    

Richard is on the job

comment:11 Changed 6 months ago by simonpj

Trac #15076 is another example of the same phenomenon.

comment:12 Changed 6 months ago by simonpj

And probably #14040 too.

comment:13 Changed 6 months ago by goldfire

Yes, I have a fix here, but it causes a failure in, e.g., indexed-types/should_compile/T12369. Haven't had time to investigate.

comment:14 Changed 6 months ago by simonpj

Maybe #14887 too!

comment:15 Changed 5 months ago by simonpj

Blocking: 15170 added

comment:16 Changed 5 months ago by simonpj

Blocking: 15170 removed

I believe that #15170 is also blocked on this.

comment:17 Changed 5 months ago by simonpj

In a call today Richard and Simon decided:

  • Kill decideKindGeneralisationPlan; instead always generalise. (Explain why we don't need to worry about local generalisation as we do with terms.)
  • And hence kill tc_hs_sig_type, leaving tc_hs_sig_type_and_gen
  • tc_hs_sig_type_and_gen should not call solveEqualities (which fails if there are unsolved equalities)
  • Instead call solveLocalEqualities, gather unsolved equalities, and do not quantify over their free vars. NB: tcImplicitTKBndrs already calls solveLocalEqualities, and must do so. Needs a Note to explain why: it's so that we can top-sort the bound variables
  • Promote all the free vars of the unsolved constraints. Principle: all free vars should either be generalised or promoted.
  • Get rid of zonkPromote and friends altogether
Last edited 5 months ago by simonpj (previous) (diff)

comment:18 Changed 5 months ago by goldfire

While I agree with comment:17, that doesn't have anything to do with this ticket...

comment:19 Changed 4 months ago by goldfire

Differential Rev(s): Phab:D4769

Since my last update, the patch in Phab:D4769 works but has performance trouble. I've asked Ben & friends for help.

comment:20 Changed 3 months ago by tdammers

Could you maybe provide a quick summary of the exact performance trouble? I'm running some tests as we speak, but it would be easier for me if you could hint at some tests that produce particularly bad behavior, or maybe an example program on which GHC peforms badly after this patch.

comment:21 Changed 3 months ago by goldfire

To be honest, I don't recall the details any more. But the problems show up in the performance tests, so running a validation should show them up.

comment:22 Changed 3 months ago by tdammers

OK, so I tried to validate, but I'm getting core lint errors while compiling stage2:

"inplace/bin/ghc-stage1" -this-unit-id rts -shared -dynamic -dynload deploy -no-auto-link-packages -Lrts/dist/build -lffi -optl-Wl,-rpath -optl-Wl,'$ORIGIN' -optl-Wl,-zorigin `cat rts/dist/libs.depend` rts/dist/build/Arena.l_dyn_o rts/dist/build/Stable.l_dyn_o rts/dist/build/Linker.l_dyn_o rts/dist/build/RtsFlags.l_dyn_o rts/dist/build/Sparks.l_dyn_o rts/dist/build/Profiling.l_dyn_o rts/dist/build/RtsStartup.l_dyn_o rts/dist/build/FileLock.l_dyn_o rts/dist/build/StgPrimFloat.l_dyn_o rts/dist/build/Pool.l_dyn_o rts/dist/build/StaticPtrTable.l_dyn_o rts/dist/build/Disassembler.l_dyn_o rts/dist/build/TopHandler.l_dyn_o rts/dist/build/WSDeque.l_dyn_o rts/dist/build/RetainerSet.l_dyn_o rts/dist/build/ProfilerReport.l_dyn_o rts/dist/build/Schedule.l_dyn_o rts/dist/build/RtsAPI.l_dyn_o rts/dist/build/Adjustor.l_dyn_o rts/dist/build/Globals.l_dyn_o rts/dist/build/Capability.l_dyn_o rts/dist/build/xxhash.l_dyn_o rts/dist/build/Messages.l_dyn_o rts/dist/build/Interpreter.l_dyn_o rts/dist/build/Hpc.l_dyn_o rts/dist/build/Libdw.l_dyn_o rts/dist/build/RetainerProfile.l_dyn_o rts/dist/build/ClosureFlags.l_dyn_o rts/dist/build/LdvProfile.l_dyn_o rts/dist/build/StgCRun.l_dyn_o rts/dist/build/Task.l_dyn_o rts/dist/build/ProfilerReportJson.l_dyn_o rts/dist/build/RtsUtils.l_dyn_o rts/dist/build/Weak.l_dyn_o rts/dist/build/Stats.l_dyn_o rts/dist/build/Trace.l_dyn_o rts/dist/build/Hash.l_dyn_o rts/dist/build/RaiseAsync.l_dyn_o rts/dist/build/ProfHeap.l_dyn_o rts/dist/build/RtsMessages.l_dyn_o rts/dist/build/RtsSymbolInfo.l_dyn_o rts/dist/build/RtsDllMain.l_dyn_o rts/dist/build/Timer.l_dyn_o rts/dist/build/HsFFI.l_dyn_o rts/dist/build/OldARMAtomic.l_dyn_o rts/dist/build/STM.l_dyn_o rts/dist/build/ThreadLabels.l_dyn_o rts/dist/build/Proftimer.l_dyn_o rts/dist/build/RtsSymbols.l_dyn_o rts/dist/build/PathUtils.l_dyn_o rts/dist/build/Threads.l_dyn_o rts/dist/build/RtsMain.l_dyn_o rts/dist/build/CheckUnload.l_dyn_o rts/dist/build/Inlines.l_dyn_o rts/dist/build/Ticky.l_dyn_o rts/dist/build/LibdwPool.l_dyn_o rts/dist/build/ThreadPaused.l_dyn_o rts/dist/build/Printer.l_dyn_o rts/dist/build/fs.l_dyn_o rts/dist/build/hooks/OutOfHeap.l_dyn_o rts/dist/build/hooks/MallocFail.l_dyn_o rts/dist/build/hooks/FlagDefaults.l_dyn_o rts/dist/build/hooks/OnExit.l_dyn_o rts/dist/build/hooks/LongGCSync.l_dyn_o rts/dist/build/hooks/StackOverflow.l_dyn_o rts/dist/build/sm/MBlock.l_dyn_o rts/dist/build/sm/Scav.l_dyn_o rts/dist/build/sm/GCUtils.l_dyn_o rts/dist/build/sm/CNF.l_dyn_o rts/dist/build/sm/Compact.l_dyn_o rts/dist/build/sm/Sweep.l_dyn_o rts/dist/build/sm/GCAux.l_dyn_o rts/dist/build/sm/MarkWeak.l_dyn_o rts/dist/build/sm/BlockAlloc.l_dyn_o rts/dist/build/sm/Evac_thr.l_dyn_o rts/dist/build/sm/GC.l_dyn_o rts/dist/build/sm/Sanity.l_dyn_o rts/dist/build/sm/Evac.l_dyn_o rts/dist/build/sm/Storage.l_dyn_o rts/dist/build/sm/Scav_thr.l_dyn_o rts/dist/build/eventlog/EventLogWriter.l_dyn_o rts/dist/build/eventlog/EventLog.l_dyn_o rts/dist/build/linker/elf_reloc_aarch64.l_dyn_o rts/dist/build/linker/MachO.l_dyn_o rts/dist/build/linker/elf_plt.l_dyn_o rts/dist/build/linker/elf_plt_aarch64.l_dyn_o rts/dist/build/linker/M32Alloc.l_dyn_o rts/dist/build/linker/elf_got.l_dyn_o rts/dist/build/linker/Elf.l_dyn_o rts/dist/build/linker/CacheFlush.l_dyn_o rts/dist/build/linker/SymbolExtras.l_dyn_o rts/dist/build/linker/elf_plt_arm.l_dyn_o rts/dist/build/linker/LoadArchive.l_dyn_o rts/dist/build/linker/PEi386.l_dyn_o rts/dist/build/linker/elf_reloc.l_dyn_o rts/dist/build/linker/elf_util.l_dyn_o rts/dist/build/posix/OSMem.l_dyn_o rts/dist/build/posix/GetTime.l_dyn_o rts/dist/build/posix/OSThreads.l_dyn_o rts/dist/build/posix/Itimer.l_dyn_o rts/dist/build/posix/TTY.l_dyn_o rts/dist/build/posix/Signals.l_dyn_o rts/dist/build/posix/Select.l_dyn_o rts/dist/build/posix/GetEnv.l_dyn_o   rts/dist/build/StgStartup.l_dyn_o rts/dist/build/PrimOps.l_dyn_o rts/dist/build/HeapStackCheck.l_dyn_o rts/dist/build/Updates.l_dyn_o rts/dist/build/Exception.l_dyn_o rts/dist/build/StgMiscClosures.l_dyn_o rts/dist/build/Apply.l_dyn_o rts/dist/build/Compact.l_dyn_o rts/dist/build/StgStdThunks.l_dyn_o rts/dist/build/AutoApply.l_dyn_o  -fPIC -dynamic -eventlog  -O0 -H64m -Wall -fllvm-fill-undef-with-garbage    -Werror -Iincludes -Iincludes/dist -Iincludes/dist-derivedconstants/header -Iincludes/dist-ghcconstants/header -Irts -Irts/dist/build -DCOMPILING_RTS -DFS_NAMESPACE=rts -this-unit-id rts -dcmm-lint      -i -irts -irts/dist/build -Irts/dist/build -irts/dist/build/./autogen -Irts/dist/build/./autogen            -O2 -Wcpp-undef    -Wnoncanonical-monad-instances   -o rts/dist/build/libHSrts_l-ghc8.5.20180710.so
"rm" -f rts/dist/build/libHSrts_thr_l-ghc8.5.20180710.so
"inplace/bin/ghc-stage1" -this-unit-id rts -shared -dynamic -dynload deploy -no-auto-link-packages -Lrts/dist/build -lffi -optl-Wl,-rpath -optl-Wl,'$ORIGIN' -optl-Wl,-zorigin `cat rts/dist/libs.depend` rts/dist/build/Arena.thr_l_dyn_o rts/dist/build/Stable.thr_l_dyn_o rts/dist/build/Linker.thr_l_dyn_o rts/dist/build/RtsFlags.thr_l_dyn_o rts/dist/build/Sparks.thr_l_dyn_o rts/dist/build/Profiling.thr_l_dyn_o rts/dist/build/RtsStartup.thr_l_dyn_o rts/dist/build/FileLock.thr_l_dyn_o rts/dist/build/StgPrimFloat.thr_l_dyn_o rts/dist/build/Pool.thr_l_dyn_o rts/dist/build/StaticPtrTable.thr_l_dyn_o rts/dist/build/Disassembler.thr_l_dyn_o rts/dist/build/TopHandler.thr_l_dyn_o rts/dist/build/WSDeque.thr_l_dyn_o rts/dist/build/RetainerSet.thr_l_dyn_o rts/dist/build/ProfilerReport.thr_l_dyn_o rts/dist/build/Schedule.thr_l_dyn_o rts/dist/build/RtsAPI.thr_l_dyn_o rts/dist/build/Adjustor.thr_l_dyn_o rts/dist/build/Globals.thr_l_dyn_o rts/dist/build/Capability.thr_l_dyn_o rts/dist/build/xxhash.thr_l_dyn_o rts/dist/build/Messages.thr_l_dyn_o rts/dist/build/Interpreter.thr_l_dyn_o rts/dist/build/Hpc.thr_l_dyn_o rts/dist/build/Libdw.thr_l_dyn_o rts/dist/build/RetainerProfile.thr_l_dyn_o rts/dist/build/ClosureFlags.thr_l_dyn_o rts/dist/build/LdvProfile.thr_l_dyn_o rts/dist/build/StgCRun.thr_l_dyn_o rts/dist/build/Task.thr_l_dyn_o rts/dist/build/ProfilerReportJson.thr_l_dyn_o rts/dist/build/RtsUtils.thr_l_dyn_o rts/dist/build/Weak.thr_l_dyn_o rts/dist/build/Stats.thr_l_dyn_o rts/dist/build/Trace.thr_l_dyn_o rts/dist/build/Hash.thr_l_dyn_o rts/dist/build/RaiseAsync.thr_l_dyn_o rts/dist/build/ProfHeap.thr_l_dyn_o rts/dist/build/RtsMessages.thr_l_dyn_o rts/dist/build/RtsSymbolInfo.thr_l_dyn_o rts/dist/build/RtsDllMain.thr_l_dyn_o rts/dist/build/Timer.thr_l_dyn_o rts/dist/build/HsFFI.thr_l_dyn_o rts/dist/build/OldARMAtomic.thr_l_dyn_o rts/dist/build/STM.thr_l_dyn_o rts/dist/build/ThreadLabels.thr_l_dyn_o rts/dist/build/Proftimer.thr_l_dyn_o rts/dist/build/RtsSymbols.thr_l_dyn_o rts/dist/build/PathUtils.thr_l_dyn_o rts/dist/build/Threads.thr_l_dyn_o rts/dist/build/RtsMain.thr_l_dyn_o rts/dist/build/CheckUnload.thr_l_dyn_o rts/dist/build/Inlines.thr_l_dyn_o rts/dist/build/Ticky.thr_l_dyn_o rts/dist/build/LibdwPool.thr_l_dyn_o rts/dist/build/ThreadPaused.thr_l_dyn_o rts/dist/build/Printer.thr_l_dyn_o rts/dist/build/fs.thr_l_dyn_o rts/dist/build/hooks/OutOfHeap.thr_l_dyn_o rts/dist/build/hooks/MallocFail.thr_l_dyn_o rts/dist/build/hooks/FlagDefaults.thr_l_dyn_o rts/dist/build/hooks/OnExit.thr_l_dyn_o rts/dist/build/hooks/LongGCSync.thr_l_dyn_o rts/dist/build/hooks/StackOverflow.thr_l_dyn_o rts/dist/build/sm/MBlock.thr_l_dyn_o rts/dist/build/sm/Scav.thr_l_dyn_o rts/dist/build/sm/GCUtils.thr_l_dyn_o rts/dist/build/sm/CNF.thr_l_dyn_o rts/dist/build/sm/Compact.thr_l_dyn_o rts/dist/build/sm/Sweep.thr_l_dyn_o rts/dist/build/sm/GCAux.thr_l_dyn_o rts/dist/build/sm/MarkWeak.thr_l_dyn_o rts/dist/build/sm/BlockAlloc.thr_l_dyn_o rts/dist/build/sm/Evac_thr.thr_l_dyn_o rts/dist/build/sm/GC.thr_l_dyn_o rts/dist/build/sm/Sanity.thr_l_dyn_o rts/dist/build/sm/Evac.thr_l_dyn_o rts/dist/build/sm/Storage.thr_l_dyn_o rts/dist/build/sm/Scav_thr.thr_l_dyn_o rts/dist/build/eventlog/EventLogWriter.thr_l_dyn_o rts/dist/build/eventlog/EventLog.thr_l_dyn_o rts/dist/build/linker/elf_reloc_aarch64.thr_l_dyn_o rts/dist/build/linker/MachO.thr_l_dyn_o rts/dist/build/linker/elf_plt.thr_l_dyn_o rts/dist/build/linker/elf_plt_aarch64.thr_l_dyn_o rts/dist/build/linker/M32Alloc.thr_l_dyn_o rts/dist/build/linker/elf_got.thr_l_dyn_o rts/dist/build/linker/Elf.thr_l_dyn_o rts/dist/build/linker/CacheFlush.thr_l_dyn_o rts/dist/build/linker/SymbolExtras.thr_l_dyn_o rts/dist/build/linker/elf_plt_arm.thr_l_dyn_o rts/dist/build/linker/LoadArchive.thr_l_dyn_o rts/dist/build/linker/PEi386.thr_l_dyn_o rts/dist/build/linker/elf_reloc.thr_l_dyn_o rts/dist/build/linker/elf_util.thr_l_dyn_o rts/dist/build/posix/OSMem.thr_l_dyn_o rts/dist/build/posix/GetTime.thr_l_dyn_o rts/dist/build/posix/OSThreads.thr_l_dyn_o rts/dist/build/posix/Itimer.thr_l_dyn_o rts/dist/build/posix/TTY.thr_l_dyn_o rts/dist/build/posix/Signals.thr_l_dyn_o rts/dist/build/posix/Select.thr_l_dyn_o rts/dist/build/posix/GetEnv.thr_l_dyn_o   rts/dist/build/StgStartup.thr_l_dyn_o rts/dist/build/PrimOps.thr_l_dyn_o rts/dist/build/HeapStackCheck.thr_l_dyn_o rts/dist/build/Updates.thr_l_dyn_o rts/dist/build/Exception.thr_l_dyn_o rts/dist/build/StgMiscClosures.thr_l_dyn_o rts/dist/build/Apply.thr_l_dyn_o rts/dist/build/Compact.thr_l_dyn_o rts/dist/build/StgStdThunks.thr_l_dyn_o rts/dist/build/AutoApply.thr_l_dyn_o  -fPIC -dynamic -optc-DTHREADED_RTS -eventlog  -O0 -H64m -Wall -fllvm-fill-undef-with-garbage    -Werror -Iincludes -Iincludes/dist -Iincludes/dist-derivedconstants/header -Iincludes/dist-ghcconstants/header -Irts -Irts/dist/build -DCOMPILING_RTS -DFS_NAMESPACE=rts -this-unit-id rts -dcmm-lint      -i -irts -irts/dist/build -Irts/dist/build -irts/dist/build/./autogen -Irts/dist/build/./autogen            -O2 -Wcpp-undef    -Wnoncanonical-monad-instances   -o rts/dist/build/libHSrts_thr_l-ghc8.5.20180710.so
"inplace/bin/ghc-stage1" -hisuf hi -osuf  o -hcsuf hc -static  -O0 -H64m -Wall -fllvm-fill-undef-with-garbage    -Werror    -this-unit-id ghc-prim-0.5.2.0 -hide-all-packages -i -ilibraries/ghc-prim/. -ilibraries/ghc-prim/dist-install/build -Ilibraries/ghc-prim/dist-install/build -ilibraries/ghc-prim/dist-install/build/./autogen -Ilibraries/ghc-prim/dist-install/build/./autogen -Ilibraries/ghc-prim/.    -optP-include -optPlibraries/ghc-prim/dist-install/build/./autogen/cabal_macros.h -package-id rts -this-unit-id ghc-prim -XHaskell2010 -O -dcore-lint -dno-debug-output  -no-user-package-db -rtsopts  -Wno-trustworthy-safe -Wno-deprecated-flags     -Wnoncanonical-monad-instances  -odir libraries/ghc-prim/dist-install/build -hidir libraries/ghc-prim/dist-install/build -stubdir libraries/ghc-prim/dist-install/build   -dynamic-too -c libraries/ghc-prim/./GHC/CString.hs -o libraries/ghc-prim/dist-install/build/GHC/CString.o -dyno libraries/ghc-prim/dist-install/build/GHC/CString.dyn_o
"inplace/bin/ghc-stage1" -hisuf hi -osuf  o -hcsuf hc -static  -O0 -H64m -Wall -fllvm-fill-undef-with-garbage    -Werror    -this-unit-id ghc-prim-0.5.2.0 -hide-all-packages -i -ilibraries/ghc-prim/. -ilibraries/ghc-prim/dist-install/build -Ilibraries/ghc-prim/dist-install/build -ilibraries/ghc-prim/dist-install/build/./autogen -Ilibraries/ghc-prim/dist-install/build/./autogen -Ilibraries/ghc-prim/.    -optP-include -optPlibraries/ghc-prim/dist-install/build/./autogen/cabal_macros.h -package-id rts -this-unit-id ghc-prim -XHaskell2010 -O -dcore-lint -dno-debug-output  -no-user-package-db -rtsopts  -Wno-trustworthy-safe -Wno-deprecated-flags     -Wnoncanonical-monad-instances  -odir libraries/ghc-prim/dist-install/build -hidir libraries/ghc-prim/dist-install/build -stubdir libraries/ghc-prim/dist-install/build   -dynamic-too -c libraries/ghc-prim/./GHC/IntWord64.hs -o libraries/ghc-prim/dist-install/build/GHC/IntWord64.o -dyno libraries/ghc-prim/dist-install/build/GHC/IntWord64.dyn_o
"inplace/bin/ghc-stage1" -hisuf hi -osuf  o -hcsuf hc -static  -O0 -H64m -Wall -fllvm-fill-undef-with-garbage    -Werror    -this-unit-id base-4.12.0.0 -hide-all-packages -i -ilibraries/base/. -ilibraries/base/dist-install/build -Ilibraries/base/dist-install/build -ilibraries/base/dist-install/build/./autogen -Ilibraries/base/dist-install/build/./autogen -Ilibraries/base/include -Ilibraries/base/dist-install/build/include   -optP-DOPTIMISE_INTEGER_GCD_LCM -optP-include -optPlibraries/base/dist-install/build/./autogen/cabal_macros.h -package-id ghc-prim-0.5.2.0 -package-id integer-gmp-1.0.2.0 -package-id rts -this-unit-id base -XHaskell2010 -O -dcore-lint -dno-debug-output  -no-user-package-db -rtsopts  -Wno-trustworthy-safe -Wno-deprecated-flags     -Wnoncanonical-monad-instances  -odir libraries/base/dist-install/build -hidir libraries/base/dist-install/build -stubdir libraries/base/dist-install/build   -dynamic-too -c libraries/base/./GHC/Base.hs-boot -o libraries/base/dist-install/build/GHC/Base.o-boot -dyno libraries/base/dist-install/build/GHC/Base.dyn_o-boot
"inplace/bin/ghc-stage1" -hisuf hi -osuf  o -hcsuf hc -static  -O0 -H64m -Wall -fllvm-fill-undef-with-garbage    -Werror    -this-unit-id base-4.12.0.0 -hide-all-packages -i -ilibraries/base/. -ilibraries/base/dist-install/build -Ilibraries/base/dist-install/build -ilibraries/base/dist-install/build/./autogen -Ilibraries/base/dist-install/build/./autogen -Ilibraries/base/include -Ilibraries/base/dist-install/build/include   -optP-DOPTIMISE_INTEGER_GCD_LCM -optP-include -optPlibraries/base/dist-install/build/./autogen/cabal_macros.h -package-id ghc-prim-0.5.2.0 -package-id integer-gmp-1.0.2.0 -package-id rts -this-unit-id base -XHaskell2010 -O -dcore-lint -dno-debug-output  -no-user-package-db -rtsopts  -Wno-trustworthy-safe -Wno-deprecated-flags     -Wnoncanonical-monad-instances  -odir libraries/base/dist-install/build -hidir libraries/base/dist-install/build -stubdir libraries/base/dist-install/build   -dynamic-too -c libraries/base/./GHC/Real.hs-boot -o libraries/base/dist-install/build/GHC/Real.o-boot -dyno libraries/base/dist-install/build/GHC/Real.dyn_o-boot
*** Core Lint errors : in result of Simplifier ***
<no location info>: warning:
    [RHS of $j_sCd :: [Char]]
    Join point has invalid type: $j_sCd :: [Char]
*** Offending Program ***
unpackCString# [InlPrag=NOINLINE CONLIKE] :: Addr# -> [Char]
[LclIdX,
 Arity=1,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [0] 102 0}]
unpackCString#
  = \ (addr_atX :: Addr#) ->
      letrec {
        unpack_sBT [Occ=LoopBreaker] :: Int# -> [Char]
        [LclId,
         Arity=1,
         Unf=Unf{Src=<vanilla>, TopLvl=False, Value=True, ConLike=True,
                 WorkFree=True, Expandable=True, Guidance=IF_ARGS [0] 62 40}]
        unpack_sBT
          = \ (nh_atZ :: Int#) ->
              case indexCharOffAddr# addr_atX nh_atZ of ch_axY {
                __DEFAULT -> : @ Char (C# ch_axY) (unpack_sBT (+# nh_atZ 1#));
                '\NUL'# -> [] @ Char
              }; } in
      unpack_sBT 0#

----- snip -----

$trModule_sAS :: TrName
[LclId,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 10 20}]
$trModule_sAS = TrNameS $trModule_sAR

$trModule :: Module
[LclIdX,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 10 30}]
$trModule = Module $trModule_sAQ $trModule_sAS

*** End of Offense ***


<no location info>: error: 
Compilation had errors


libraries/ghc-prim/ghc.mk:4: recipe for target 'libraries/ghc-prim/dist-install/build/GHC/CString.o' failed
make[1]: *** [libraries/ghc-prim/dist-install/build/GHC/CString.o] Error 1
make[1]: *** Waiting for unfinished jobs....
Makefile:122: recipe for target 'all' failed
make: *** [all] Error 2

This is from running validate on ec9638b222 - as far as I can tell, this is about where the patch originally branched off of master; in any case it applies cleanly, and ec9638b222 itself validates without errors.

Since this error occurs while compiling stage2, my assumption is that the patch changes GHC to produce incorrect output; the incorrect stage1 compiler then fails with a core lint error while compiling stage2.

comment:23 Changed 3 months ago by tdammers

I also tried checking out the actual phab patch, but that too fails to build during validate.

comment:24 Changed 3 months ago by goldfire

I had posted about the fix for that behavior on the Well-Typed GitLab page, but I've now also updated Phab. If you download again from there, you should hopefully be OK.

comment:25 Changed 3 months ago by tdammers

Thanks! Yes, I'm OK now - too OK in fact, because now it validates cleanly, with no apparent performance problems.

comment:26 Changed 3 months ago by goldfire

Hooray! :)

Would you mind rebasing and committing, then?

comment:27 Changed 3 months ago by simonpj

with no apparent performance problems.

I would love it if it was possible to do a nofib-style run on testsuite/perf, producing a table showing changes before and after.

The current testsuite stuff just tells if you get more than 5% worse (or whatever). There could easily be a big swing we never see. It's laborious running the tests one by one.

Would be possible for nofib-analyse to look at the logs somehow? It'd be extremely helpful.

Meanwhile, yes, let's commit.

comment:28 Changed 3 months ago by tdammers

Ah, never mind, it seems I had misinterpreted what those phabricator tags mean; re-running validate on the correct commit now.

comment:29 Changed 3 months ago by tdammers

OK, I managed to build it, and get some proper test failures:

Unexpected failures:
   /tmp/ghctest-tmsk6xdn/test   spaces/./simplCore/should_compile/T3234.run  T3234 [stderr mismatch] (optasm)

Unexpected stat failures:
   /tmp/ghctest-tmsk6xdn/test   spaces/./perf/compiler/T5321Fun.run  T5321Fun [stat not good enough] (normal)
   /tmp/ghctest-tmsk6xdn/test   spaces/./perf/compiler/T5631.run     T5631 [stat not good enough] (normal)

So I'll look into those.

On a side note, Haddock now fails to compile, which can be worked around by setting HADDOCK_DOCS=NO in mk/validate.mk, but should, I think, be resolved before we can merge.

comment:30 Changed 3 months ago by tdammers

For completeness' sake, here's the Haddock compilation error:

"inplace/bin/ghc-stage2" -hisuf dyn_hi -osuf  dyn_o -hcsuf dyn_hc -fPIC -dynamic  -O0 -H64m -Wall  -Werror    -hide-all-packages -i -iutils/haddock/driver -iutils/haddock/haddock-api/src -iutils/haddock/haddock-library/vendor/attoparsec-0.13.1.0 -iutils/haddock/haddock-library/src-iutils/haddock/dist/build -Iutils/haddock/dist/build -iutils/haddock/dist/build/haddock/autogen -Iutils/haddock/dist/build/haddock/autogen  -optP-DIN_GHC_TREE -optP-include -optPutils/haddock/dist/build/haddock/autogen/cabal_macros.h -package-id Cabal-2.3.0.0 -package-id array-0.5.2.0 -package-id base-4.12.0.0 -package-id bytestring-0.10.8.2 -package-id containers-0.5.11.0 -package-id deepseq-1.4.4.0 -package-id directory-1.3.2.3 -package-id filepath-1.4.2 -package-id ghc-8.5 -package-id ghc-boot-8.5 -package-id transformers-0.5.5.0 -package-id xhtml-3000.2.2 -funbox-strict-fields -Wall -fwarn-tabs -O2 -threaded -XHaskell2010  -no-user-package-db -rtsopts  -Wno-unused-imports -Wno-deprecations     -Wnoncanonical-monad-instances  -odir utils/haddock/dist/build -hidir utils/haddock/dist/build -stubdir utils/haddock/dist/build    -c utils/haddock/haddock-api/src/Haddock/InterfaceFile.hs -o utils/haddock/dist/build/Haddock/InterfaceFile.dyn_o

utils/haddock/haddock-api/src/Haddock/Convert.hs:486:39: error:
    • Couldn't match type ‘UniqSet.UniqSet TyVar’
                     with ‘InterestingVarFun
                           -> VarSet -> ([Var], VarSet) -> ([Var], VarSet)’
      Expected type: TyConBinder -> FV
        Actual type: TyConBinder -> TyVarSet
    • In the first argument of ‘mapUnionFV’, namely
        ‘injectiveVarsOfBinder’
      In the second argument of ‘($)’, namely
        ‘mapUnionFV injectiveVarsOfBinder dropped_binders’
      In the expression:
        fvVarSet $ mapUnionFV injectiveVarsOfBinder dropped_binders
    |
486 |                            mapUnionFV injectiveVarsOfBinder dropped_binders
    |                                       ^^^^^^^^^^^^^^^^^^^^^
utils/haddock/ghc.mk:20: recipe for target 'utils/haddock/dist/build/Haddock/Convert.dyn_o' failed
make[1]: *** [utils/haddock/dist/build/Haddock/Convert.dyn_o] Error 1
make[1]: *** Waiting for unfinished jobs....
Makefile:122: recipe for target 'all' failed
make: *** [all] Error 2

comment:31 Changed 3 months ago by RyanGlScott

You should be able to fix that Haddock compilation error with this patch:

  • haddock-api/src/Haddock/Convert.hs

    diff --git a/haddock-api/src/Haddock/Convert.hs b/haddock-api/src/Haddock/Convert.hs
    index bf6fbab..9f7b15d 100644
    a b synifyType _ (TyConApp tc tys) 
    512512                  = splitAtList  tys binders
    513513            result_kind  = mkTyConKind remaining_binders res_kind
    514514            result_vars  = tyCoVarsOfType result_kind
    515             dropped_vars = fvVarSet $
    516                            mapUnionFV injectiveVarsOfBinder dropped_binders
     515            dropped_vars = mapUnionVarSet injectiveVarsOfBinder dropped_binders
    517516
    518517        in not (subVarSet result_vars dropped_vars)
    519518

comment:32 Changed 3 months ago by goldfire

This all looks like it's going swimmingly. Let me know if I need to intervene. Thanks!

Last edited 3 months ago by goldfire (previous) (diff)

comment:33 Changed 3 months ago by goldfire

I just checked test T3234 locally, and it seems to work. Perhaps something went awry in rebasing... (or perhaps I'm not in the right state locally somehow)

comment:34 in reply to:  33 Changed 3 months ago by tdammers

Replying to goldfire:

I just checked test T3234 locally, and it seems to work. Perhaps something went awry in rebasing... (or perhaps I'm not in the right state locally somehow)

I haven't rebased anything yet, I checked out the code directly from Phabricator's staging area, so it should be identical to what we're looking at here. However, T3234 is a stdout deviation, not a stat failure, so I'm not too worried about that one, it's probably just a cosmetic difference in output formatting or some such.

The thing I'm looking into right now is the stat failures, T5321Fun and T5631.

comment:35 in reply to:  31 Changed 3 months ago by tdammers

Replying to RyanGlScott:

You should be able to fix that Haddock compilation error with this patch: [...]

Thanks, works like a charm (although line numbers differ in my version).

comment:36 Changed 3 months ago by tdammers

-ddump-simpl suggests the problem is not increased Core size; the output is practically identical.

So I did a run with +RTS -p, and got this:

        Mon Jul 16 16:37 2018 Time and Allocation Profiling Report  (Final)

           ghc-stage2 +RTS -p -RTS -B/home/tobias/well-typed/devel/ghc-phab/inplace/lib -c T5631.hs -fforce-recomp

        total time  =        1.38 secs   (1383 ticks @ 1000 us, 1 processor)
        total alloc = 1,208,527,080 bytes  (excludes profiling overheads)

COST CENTRE                MODULE     SRC                                                 %time %alloc

tc_rn_src_decls            TcRnDriver compiler/typecheck/TcRnDriver.hs:(491,4)-(553,7)     38.7   55.9
withCleanupSession.cleanup GHC        compiler/main/GHC.hs:(468,4)-(475,37)                23.1    0.0
zonkTopDecls               TcRnDriver compiler/typecheck/TcRnDriver.hs:(442,16)-(443,43)    7.0    8.7
CorePrep                   HscMain    compiler/main/HscMain.hs:(1317,24)-(1318,57)          3.0    4.9
deSugar                    HscMain    compiler/main/HscMain.hs:512:7-44                     2.6    2.5
pprNativeCode              AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65)    2.4    1.8
CoreTidy                   HscMain    compiler/main/HscMain.hs:1257:27-67                   1.8    2.3
simplIdF                   Simplify   compiler/simplCore/Simplify.hs:859:61-79              1.6    1.3
simplifyTop                TcRnDriver compiler/typecheck/TcRnDriver.hs:409:25-39            1.4    2.2
Parser                     HscMain    compiler/main/HscMain.hs:(317,5)-(385,20)             1.4    2.6
StgCmm                     HscMain    compiler/main/HscMain.hs:(1432,13)-(1433,62)          1.3    1.0

~And then digging into the call graph analysis, this surprising bit:

         hscTypecheck                            HscMain                       compiler/main/HscMain.hs:(425,1)-(428,20)          1564          1    0.0    0.0    51.9   73.1
          extract_renamed_stuff                  HscMain                       compiler/main/HscMain.hs:(397,1)-(410,31)          1640          1    0.0    0.0     0.0    0.0
          hscTypecheck'                          HscMain                       compiler/main/HscMain.hs:(434,1)-(456,38)          1565          1    0.0    0.0    51.9   73.1
           [...]
           tcRnModule'                           HscMain                       compiler/main/HscMain.hs:(461,1)-(500,72)          1570          1    0.0    0.0    50.5   70.5
            Typecheck-Rename                     HscMain                       compiler/main/HscMain.hs:(464,16)-(465,73)         1571          1    0.0    0.0    50.5   70.5
             ioMsgMaybe                          HscMain                       compiler/main/HscMain.hs:(251,1)-(256,122)         1572          1    0.4    0.8    50.5   70.5

In other words, we spend 70% of our allocations on "dealing with errors and warnings returned by a compilation step".

Actually no, it just means that whatever ioMsgMaybe wraps here is spending 70% of allocations.

Last edited 3 months ago by tdammers (previous) (diff)

comment:37 Changed 3 months ago by simonpj

Can we compare the numbers with and without the patch? That should show differences.

Also I suggest compiling the compiler with -ticky and comparing output with and without the patch.

comment:38 in reply to:  37 Changed 3 months ago by tdammers

Replying to simonpj:

Can we compare the numbers with and without the patch? That should show differences.

And it does.

Before patch:

        Tue Jul 17 12:17 2018 Time and Allocation Profiling Report  (Final)

           ghc-stage2 +RTS -p -RTS -B/home/tobias/well-typed/devel/ghc-phab/inplace/lib -c T5631.hs -fforce-recomp

        total time  =        1.28 secs   (1278 ticks @ 1000 us, 1 processor)
        total alloc = 1,146,867,328 bytes  (excludes profiling overheads)

COST CENTRE        MODULE     SRC                                                 %time %alloc

tc_rn_src_decls    TcRnDriver compiler/typecheck/TcRnDriver.hs:(491,4)-(553,7)     40.7   55.4
setSessionDynFlags GHC        compiler/main/GHC.hs:(578,1)-(584,16)                 8.7    0.7
zonkTopDecls       TcRnDriver compiler/typecheck/TcRnDriver.hs:(442,16)-(443,43)    7.0    9.2
tcRnSrcDecls       TcRnDriver compiler/typecheck/TcRnDriver.hs:255:25-65            4.5    0.4
writeIface         HscMain    compiler/main/HscMain.hs:1283:9-45                    3.4    0.2
withCleanupSession GHC        compiler/main/GHC.hs:(466,1)-(475,37)                 2.9    0.1
CorePrep           HscMain    compiler/main/HscMain.hs:(1317,24)-(1318,57)          2.8    4.1
deSugar            HscMain    compiler/main/HscMain.hs:512:7-44                     2.5    2.7
pprNativeCode      AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65)    2.3    1.9
simplifyTop        TcRnDriver compiler/typecheck/TcRnDriver.hs:409:25-39            1.6    2.3
CoreTidy           HscMain    compiler/main/HscMain.hs:1257:27-67                   1.6    1.8

[...]

               tc_rn_src_decls                   TcRnDriver                    compiler/typecheck/TcRnDriver.hs:(491,4)-(553,7)   1575          1   40.7   55.4    42.8   57.3
                Digraph.topSort                  Digraph                       compiler/utils/Digraph.hs:350:48-75                1586        341    0.0    0.0     0.0    0.0
                solveSimples                     TcInteract                    compiler/typecheck/TcInteract.hs:(241,5)-(242,21)  1587        266    0.0    0.0     1.9    1.6
                 solve_loop                      TcInteract                    compiler/typecheck/TcInteract.hs:(246,9)-(250,44)  1589          0    1.5    1.1     1.9    1.6
                  canNC                          TcCanonical                   compiler/typecheck/TcCanonical.hs:(84,5)-(90,45)   1590       1206    0.4    0.5     0.4    0.5
                  canClass                       TcCanonical                   compiler/typecheck/TcCanonical.hs:98:5-31          1591         15    0.0    0.0     0.0    0.0
                  canEqLeafTyVarEq               TcCanonical                   compiler/typecheck/TcCanonical.hs:105:5-39         1592          3    0.0    0.0     0.0    0.0
                Digraph.scc                      Digraph                       compiler/utils/Digraph.hs:285:44-67                1576         25    0.0    0.0     0.0    0.0
                bin_anns                         HscTypes                      compiler/main/HscTypes.hs:1093:47-56               1581         12    0.0    0.0     0.0    0.0
                bin_exports                      HscTypes                      compiler/main/HscTypes.hs:1088:50-55               1578         12    0.0    0.0     0.0    0.0
                bin_fam_insts                    HscTypes                      compiler/main/HscTypes.hs:1096:52-57               1584         12    0.0    0.0     0.0    0.0
                bin_fixities                     HscTypes                      compiler/main/HscTypes.hs:1091:51-56               1579         12    0.0    0.0     0.0    0.0
                bin_insts                        HscTypes                      compiler/main/HscTypes.hs:1095:48-53               1583         12    0.0    0.0     0.0    0.0
                bin_rules                        HscTypes                      compiler/main/HscTypes.hs:1097:48-57               1585         12    0.0    0.0     0.0    0.0
                bin_tycldecls                    HscTypes                      compiler/main/HscTypes.hs:1094:52-57               1582         12    0.2    0.2     0.2    0.2
                bin_usages                       HscTypes                      compiler/main/HscTypes.hs:1087:49-58               1577         12    0.0    0.0     0.0    0.0
                bin_warns                        HscTypes                      compiler/main/HscTypes.hs:1092:48-57               1580         12    0.0    0.0     0.0    0.0
                substTyWith                      TyCoRep                       compiler/types/TyCoRep.hs:(2191,23)-(2192,50)      1616          4    0.0    0.0     0.0    0.0

After:

tc_rn_src_decls    TcRnDriver compiler/typecheck/TcRnDriver.hs:(491,4)-(553,7)     49.3   55.9
zonkTopDecls       TcRnDriver compiler/typecheck/TcRnDriver.hs:(442,16)-(443,43)    7.6    8.7
CorePrep           HscMain    compiler/main/HscMain.hs:(1317,24)-(1318,57)          3.9    4.9
deSugar            HscMain    compiler/main/HscMain.hs:512:7-44                     3.3    2.5
setSessionDynFlags GHC        compiler/main/GHC.hs:(578,1)-(584,16)                 2.9    0.7
CoreTidy           HscMain    compiler/main/HscMain.hs:1257:27-67                   2.2    2.3
simplifyTop        TcRnDriver compiler/typecheck/TcRnDriver.hs:409:25-39            2.1    2.2
withCleanupSession GHC        compiler/main/GHC.hs:(466,1)-(475,37)                 2.1    0.1
pprNativeCode      AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65)    2.0    1.8
Parser             HscMain    compiler/main/HscMain.hs:(317,5)-(385,20)             1.7    2.6
solve_loop         TcInteract compiler/typecheck/TcInteract.hs:(246,9)-(250,44)     1.6    1.1

[...]


               tc_rn_src_decls                   TcRnDriver                    compiler/typecheck/TcRnDriver.hs:(491,4)-(553,7)   1584          1   49.3   55.9    52.2   57.7
                Digraph.topSort                  Digraph                       compiler/utils/Digraph.hs:350:48-75                1595        341    0.1    0.0     0.1    0.0
                solveSimples                     TcInteract                    compiler/typecheck/TcInteract.hs:(241,5)-(242,21)  1596        266    0.2    0.0     2.5    1.5
                 solve_loop                      TcInteract                    compiler/typecheck/TcInteract.hs:(246,9)-(250,44)  1598          0    1.6    1.0     2.3    1.5
                  canNC                          TcCanonical                   compiler/typecheck/TcCanonical.hs:(84,5)-(90,45)   1599       1206    0.7    0.5     0.7    0.5
                  canClass                       TcCanonical                   compiler/typecheck/TcCanonical.hs:98:5-31          1600         15    0.0    0.0     0.0    0.0
                  canEqLeafTyVarEq               TcCanonical                   compiler/typecheck/TcCanonical.hs:105:5-39         1601          3    0.0    0.0     0.0    0.0
                Digraph.scc                      Digraph                       compiler/utils/Digraph.hs:285:44-67                1585         25    0.0    0.0     0.0    0.0
                bin_anns                         HscTypes                      compiler/main/HscTypes.hs:1093:47-56               1590         12    0.0    0.0     0.0    0.0
                bin_exports                      HscTypes                      compiler/main/HscTypes.hs:1088:50-55               1587         12    0.0    0.0     0.0    0.0
                bin_fam_insts                    HscTypes                      compiler/main/HscTypes.hs:1096:52-57               1593         12    0.0    0.0     0.0    0.0
                bin_fixities                     HscTypes                      compiler/main/HscTypes.hs:1091:51-56               1588         12    0.0    0.0     0.0    0.0
                bin_insts                        HscTypes                      compiler/main/HscTypes.hs:1095:48-53               1592         12    0.0    0.0     0.0    0.0
                bin_rules                        HscTypes                      compiler/main/HscTypes.hs:1097:48-57               1594         12    0.0    0.0     0.0    0.0
                bin_tycldecls                    HscTypes                      compiler/main/HscTypes.hs:1094:52-57               1591         12    0.3    0.2     0.3    0.2
                bin_usages                       HscTypes                      compiler/main/HscTypes.hs:1087:49-58               1586         12    0.1    0.0     0.1    0.0
                bin_warns                        HscTypes                      compiler/main/HscTypes.hs:1092:48-57               1589         12    0.0    0.0     0.0    0.0
                substTyWith                      TyCoRep                       compiler/types/TyCoRep.hs:(2274,23)-(2275,50)      1625          4    0.0    0.0     0.0    0.0

So apparently tc_rn_src_decls is responsible for the entire performance hit here, however the calls it makes that we track seem to be identical, and none of its children contributes significantly.

comment:39 Changed 3 months ago by tdammers

Another interesting observation is that it eats up more CPU time, but doesn't allocate more.

comment:40 Changed 3 months ago by tdammers

Drilling down some more, I found that the biggest performance sink is tcPolyInfer, in both the pre- and post-patch versions; in both cases, it is responsible for just over 50% of allocations, but in terms of runtime, it's 36% before the patch, but 45% after - so that alone would explain a 5% performance deviation.

Further data points:

  • On the first run after making a fresh build, withCleanupSession.cleanup shows up prominently, but from the second run onwards, it scores 0 on all metrics.
  • Overall performance for profile builds is such that the patch does actually produce a performance improvement, from approx. 1.25 seconds down to 1.07.
  • The patch does not touch tcPolyInfer at all, the entire module is identical between both versions
  • The performance counters suggests that code paths taken inside tcPolyInfer are the same

Changed 3 months ago by tdammers

Attachment: logs.tar.gz added

Performance logs for comparison

comment:41 Changed 3 months ago by goldfire

Given what you've found so far, I feel sure that the problems stem from the changes in TyCoRep. But I'm curious to see what else you find!

comment:42 Changed 3 months ago by tdammers

Hah, that's a good hint! I did another run, this time compiling with -fprof-auto-top on TyCoRep only, and this reveals two interesting things:

  • tcvs_of_type, which doesn't exist in the old version yet, is responsible for 4.5% of CPU time in the new version
  • zonkTopDecls, while allocating about 8.8% in both versions, jumps from 5.9% CPU to 8.2%

I'll dig deeper on these two.

comment:43 Changed 3 months ago by simonpj

Ah! The tcvs_of_type thing sounds promising.

comment:44 Changed 3 months ago by tdammers

Managed to get meaningful ticky-ticky profiling data.

Before the patch:

The following table is explained by http://ghc.haskell.org/trac/ghc/wiki/Debugging/TickyTicky
All allocation numbers are in bytes.

**************************************************

    Entries      Alloc    Alloc'd  Non-void Arguments      STG Name
--------------------------------------------------------------------------------
     172005   90819296          0   5 EMMMS                ghc:TcUnify.promoteTcType2{v r3m} (fun)
    2301299   85611504          0   2 >L                   base:GHC.Base.map{v 01X} (fun)
    1612793   73036520          0   3 i.M                  containers-0.5.11.0:Data.IntMap.Internal.$winsert{v r7O} (fun)
     699656   37268264          0   1 L                    go4{v sNuG} (ghc:TcMType) (fun) in r1Y
     448802   35904160          0   7 E>>>>.M              ghc:TcMType.$s$wmapType{v r1Y} (fun)
     367181   23499584          0   2 LS                   ghc:TcRnMonad.traceTc1{v rlu} (fun)
     887399   22625616          0   1 M                    go28{v sNuF} (ghc:TcMType) (fun) in r1Y
     283135   21931488          0   4 >SSM                 ghc:TcHsSyn.$wzonkTyVarOcc{v r1F} (fun)
     269137   18194264          0   1 L                    go13{v sSIr} (ghc:TcHsSyn) (fun) in rN
     302653   16395904          0   5 +>LLL                ghc:MonadUtils.zipWith3M{v rp} (fun)
     194400   15552000          0   7 E>>>>.M              ghc:TcHsSyn.$s$wmapType{v rN} (fun)
     559499   15509096          0   1 M                    ghc:TcMType.zonkTcTyVar{v r1l} (fun)
     138823   13706576          0   1 S                    sat_sMU6{v} (ghc:TcMType) (fun) in r6P
     459700   11089080          0   1 S                    sat_sNvf{v} (ghc:TcMType) (fun) in sNuF

After:

The following table is explained by http://ghc.haskell.org/trac/ghc/wiki/Debugging/TickyTicky
All allocation numbers are in bytes.

**************************************************

    Entries      Alloc    Alloc'd  Non-void Arguments      STG Name
--------------------------------------------------------------------------------
     172005   90819296          0   5 EMMMS                ghc:TcUnify.promoteTcType2{v r3m} (fun)
    2330347   86853624          0   2 >L                   base:GHC.Base.map{v 01X} (fun)
    1271867   57141560          0   3 i.M                  containers-0.5.11.0:Data.IntMap.Internal.$winsert{v r7O} (fun)
     699656   37268264          0   1 L                    go4{v sNmh} (ghc:TcMType) (fun) in r1Y
     448802   35904160          0   7 E>>>>.M              ghc:TcMType.$s$wmapType{v r1Y} (fun)
     693375   27595440          0   3 MIM                  poly_merge1{v rs9H} (containers-0.5.11.0:Data.IntMap.Internal) (fun)
     367181   23499584          0   2 LS                   ghc:TcRnMonad.traceTc1{v rlu} (fun)
     887399   22625616          0   1 M                    go28{v sNmg} (ghc:TcMType) (fun) in r1Y
     283135   21931488          0   4 >SSM                 ghc:TcHsSyn.$wzonkTyVarOcc{v r1F} (fun)
     269137   18194264          0   1 L                    go13{v sSHY} (ghc:TcHsSyn) (fun) in rN
      95951   16510560          0   2 >L                   base:Data.OldList.sortBy{v rD} (fun)
     302653   16395904          0   5 +>LLL                ghc:MonadUtils.zipWith3M{v rp} (fun)
     194400   15552000          0   7 E>>>>.M              ghc:TcHsSyn.$s$wmapType{v rN} (fun)
     559499   15509096          0   1 M                    ghc:TcMType.zonkTcTyVar{v r1l} (fun)

Note the poly_merge1 entry in the post-patch one.

comment:45 Changed 3 months ago by RyanGlScott

poly_merge1 is likely coming from the mergeWithKey' function from Data.IntMap (which unionVarSet is implemented in terms of), as it has a helper function named merge1. The question, I suppose, is which use(s) of unionVarSet are problematic. There are at least a couple new uses of unionVarSet in Phab:D4769:

  • In mapUnionVarSetSet, which is used in the pivotal closeOverKinds function
  • Directly in closeOverKinds itself
  • In nonDetCmpTypes

comment:46 Changed 3 months ago by RyanGlScott

Question: is it more efficient to use closeOverKinds (unionVarSet set1 set2) than unionVarSet (closeOverKinds set1) (closeOverKinds set2)? If so, I think the way nonDetCmpTypes is currently implemented is inefficient. We currently have:

nonDetCmpTypes :: [Type] -> [Type] -> Ordering
nonDetCmpTypes ts1 ts2 = nonDetCmpTypesX rn_env ts1 ts2
  where
    rn_env = mkRnEnv2 (mkInScopeSet (tyCoVarsOfTypes ts1 `unionVarSet` tyCoVarsOfTypes ts2))

In particular, note this part:

tyCoVarsOfTypes ts1 `unionVarSet` tyCoVarsOfTypes ts2

tyCoVarsOfTypes calls closeOverKinds, so we're calling closeOverKinds once more than is necessary. Perhaps this should be written like this?

tyCoVarsOfTypes (ts1 ++ ts2)

comment:47 Changed 3 months ago by simonpj

Re comment:46, yes maybe it'd be more efficient like that. But I doubt this is the issue. nonDetCmpTypes (I hope) only evaluates that free-var set if it compare types involving forall. Otherwise it's just a thunk.

comment:48 Changed 3 months ago by simonpj

I have a new hypothesis, provoked by this merge popping up in the statistics.

  • Previously, tyCoVarsOfTypes used tyCoFVsOfTypes, which in turn uses unionFV (from utils/FV.hs). This FV thing is relatively elaborate, maintaining a VarSet alongside a list. Plus an "interesting" predicate which, in the case of tyCoVarsOfType is const True.
  • In the patch, tyCoVarsOfType switches to using unionVarSet. This "obviously" shold be more efficient, because it just maintains a VarSet, with no list.
  • BUT tyCoVarsOfType does not use an accumulator; it just straightforwardly unions the free-vars in a tree-like way. In contrast, unionFV does use accumulator style, behind the scenes.
  • Bottom line: in the old version we make many calls to extendVarSet; in the new one we make many calls to unionVarSet.

It should jolly well be the case that inserting 100 things one by one into a set should be no faster than unioning a balanced tree of the same 100 things. But looking at Data.Map.Internal.insert, and comparing with Data.Map.Internal.mergeWithKey', I bet the former is faster.

This hypothesis is easy to test: just change the new tcvs_of_type to take an accumulator.

If this turns out to make a difference, I think we should file a bug on containers. Users should not have to worry about this!

comment:49 Changed 3 months ago by tdammers

So far, a naive attempt at rewriting tcvs_of_type in terms of VarSet -> VarSet didn't produce significant performance improvements; however, I have not eliminated unionVarSet completely yet - the ForAll cases in both tcvs_of_type and tcvs_of_co are both such that naively expressing them as compositions of applications of tcvs_of_... and delVarSet does not produce correct results - I assume this is because the order in which vars are added and deleted from the set matter. So I'm now testing a replacement for unionVarSet that is written in terms of set insertions instead of set unions; hopefully this will show a bigger difference. I should probably also benchmark the relevant set operations on their own, to verify that they do indeed expose the performance issue we're suspecting.

Last edited 3 months ago by tdammers (previous) (diff)

comment:50 Changed 3 months ago by tdammers

Update; the rewritten unionVarSet performs significantly worse than the original one. So either I did something wrong there, or our suspicion isn't correct.

comment:51 Changed 3 months ago by simonpj

So far, a naive attempt at rewriting tcvs_of_type in terms of VarSet -> VarSet didn't produce significant performance improvements

What does the naive attempt look like. I'm expecting:

tcvs_of_type :: Type -> VarSet -> VarSet
tcvs_of_type (App t1 t2) acc = tcvs_of_type t1 (tcvs_of_type t2 acc)
...etc...

Is that what you have? I recall that Bartosz got much worse perf with a higher-order version.

I would not worry about foralls -- they are relatively rare.

Last edited 3 months ago by simonpj (previous) (diff)

comment:52 Changed 3 months ago by tdammers

Yes, that's essentially what I have. Specifically:

tcvs_of_type' :: Type -> TyCoVarSetNotClosed -> TyCoVarSetNotClosed
tcvs_of_type' (TyVarTy v)                 = flip extendVarSet v
tcvs_of_type' (TyConApp _ tys)            = tcvs_of_types' tys
tcvs_of_type' (LitTy {})                  = id
tcvs_of_type' (AppTy fun arg)             = tcvs_of_type' fun . tcvs_of_type' arg
tcvs_of_type' (FunTy arg res)             = tcvs_of_type' arg . tcvs_of_type' res
tcvs_of_type' (ForAllTy (TvBndr tv _) ty) = unionVarSet (delVarSet (tcvs_of_type ty) tv) . tcvs_of_type' (tyVarKind tv)
tcvs_of_type' (CastTy ty co)              = tcvs_of_type' ty . tcvs_of_co' co
tcvs_of_type' (CoercionTy co)             = tcvs_of_co' co

tcvs_of_types :: [Type] -> TyCoVarSetNotClosed
tcvs_of_types ts = tcvs_of_types' ts emptyVarSet

tcvs_of_types' :: [Type] -> TyCoVarSetNotClosed -> TyCoVarSetNotClosed
tcvs_of_types' [] = id
tcvs_of_types' (t:ts) = tcvs_of_type' t . tcvs_of_types' ts

tcvs_of_co :: Coercion -> TyCoVarSetNotClosed
tcvs_of_co co = tcvs_of_co' co emptyVarSet

tcvs_of_co' :: Coercion -> TyCoVarSetNotClosed -> TyCoVarSetNotClosed
tcvs_of_co' (Refl _ ty)               = tcvs_of_type' ty
tcvs_of_co' (TyConAppCo _ _ cos)      = tcvs_of_cos' cos
tcvs_of_co' (AppCo co arg)            = tcvs_of_co' co . tcvs_of_co' arg
tcvs_of_co' (ForAllCo tv kind_co co)  = unionVarSet (delVarSet (tcvs_of_co co) tv) . tcvs_of_co' kind_co
tcvs_of_co' (FunCo _ co1 co2)         = tcvs_of_co' co1 . tcvs_of_co' co2
tcvs_of_co' (CoVarCo v)               = flip extendVarSet v
tcvs_of_co' (HoleCo h)                = flip extendVarSet (coHoleCoVar h)
    -- See Note [CoercionHoles and coercion free variables]
tcvs_of_co' (AxiomInstCo _ _ cos)      = tcvs_of_cos' cos
tcvs_of_co' (UnivCo p _ t1 t2)        = tcvs_of_prov' p . tcvs_of_type' t1 . tcvs_of_type' t2
tcvs_of_co' (SymCo co)          = tcvs_of_co' co
tcvs_of_co' (TransCo co1 co2)   = tcvs_of_co' co1 . tcvs_of_co' co2
tcvs_of_co' (NthCo _ _ co)      = tcvs_of_co' co
tcvs_of_co' (LRCo _ co)         = tcvs_of_co' co
tcvs_of_co' (InstCo co arg)     = tcvs_of_co' co . tcvs_of_co' arg
tcvs_of_co' (CoherenceCo c1 c2) = tcvs_of_co' c1 . tcvs_of_co' c2
tcvs_of_co' (KindCo co)         = tcvs_of_co' co
tcvs_of_co' (SubCo co)          = tcvs_of_co' co
tcvs_of_co' (AxiomRuleCo _ cos)  = tcvs_of_cos' cos

tcvs_of_cos :: [Coercion] -> TyCoVarSetNotClosed
tcvs_of_cos co = tcvs_of_cos' co emptyVarSet

tcvs_of_cos' :: [Coercion] -> TyCoVarSetNotClosed -> TyCoVarSetNotClosed
tcvs_of_cos' [] = id
tcvs_of_cos' (co:cos) = tcvs_of_co' co . tcvs_of_cos' cos

-- tcvs_of_prov :: UnivCoProvenance -> TyCoVarSetNotClosed
-- tcvs_of_prov p = tcvs_of_prov' p emptyVarSet

tcvs_of_prov' :: UnivCoProvenance -> TyCoVarSetNotClosed -> TyCoVarSetNotClosed
tcvs_of_prov' UnsafeCoerceProv    = id
tcvs_of_prov' (PhantomProv co)    = tcvs_of_co' co
tcvs_of_prov' (ProofIrrelProv co) = tcvs_of_co' co
tcvs_of_prov' (PluginProv _)      = id

So the '-ed versions are the ones with an accumulator; I kept the original ones around but rewrote them in terms of the accumulator version so that other functions that depend on them keep working unchanged. Apart from the forall cases, everything goes through the accumulator version, and uses only single-var insertions and no unions. I also rewrote the plural ones (tcvs_of_cos / tcvs_of_types) to avoid unions there. The only real difference I can see here is that I used point-free style over parentheses, but that surely ought not to make a difference.

There is a slight performance improvement over the union-based version, but not enough to explain the test failures, and it's still nowhere near the original numbers.

To wit:

Pre-patch:

/home/tobias/well-typed/devel/ghc-phab/inplace/lib/bin/ghc-stage2 -B/home/tobias/well-typed/devel/ghc-phab/inplace/lib -c testsuite/tests/perf/compiler/T5631.hs -fforce-recomp +RTS -rlogs/old.ticky 

STACK USAGE:

ENTERS: 51211603  of which 51211603 (100.0%) direct to the entry code
		  [the rest indirected via Node's info ptr]
    6192996 ( 12.1%) thunks
    3676345 (  7.2%) data values
          0 (  0.0%) normal indirections
          0 (  0.0%) permanent indirections

FUNCTION ENTRIES: 41342262

TAIL CALLS: 33149658, of which 27886423 (84%) were to known functions

SLOW APPLICATIONS: 0 evaluated, 0 unevaluated

         Too few args   Correct args   Too many args
   FUN            0              0               0
   PAP            0              0               0


RETURNS: 23002648
   11895484 ( 51.7%) from entering a new constructor
		  [the rest from entering an existing constructor]

RET_NEW:            11895484:  36.3% 13.5% 24.9%  4.5% 19.1%  0.9%  0.3%  0.2%  0.4%
RET_OLD:             3676345:  11.6% 18.6% 18.8%  5.7%  6.3% 20.5%  0.3%  1.1% 17.1%
RET_UNBOXED_TUP:     7430819:   0.0% 68.9% 29.7%  1.2%  0.1%  0.0%  0.1%  0.1%  0.0%

Post-patch:

/home/tobias/well-typed/devel/ghc-phab/inplace/lib/bin/ghc-stage2 -B/home/tobias/well-typed/devel/ghc-phab/inplace/lib -c testsuite/tests/perf/compiler/T5631.hs -fforce-recomp +RTS -rlogs/new.ticky 

STACK USAGE:

ENTERS: 54717036  of which 54717036 (100.0%) direct to the entry code
		  [the rest indirected via Node's info ptr]
    6198335 ( 11.3%) thunks
    4824139 (  8.8%) data values
          0 (  0.0%) normal indirections
          0 (  0.0%) permanent indirections

FUNCTION ENTRIES: 43694562

TAIL CALLS: 35496964, of which 29883207 (84%) were to known functions

SLOW APPLICATIONS: 0 evaluated, 0 unevaluated

         Too few args   Correct args   Too many args
   FUN            0              0               0
   PAP            0              0               0


RETURNS: 24705888
   13776908 ( 55.8%) from entering a new constructor
		  [the rest from entering an existing constructor]

RET_NEW:            13776908:  36.0% 11.6% 25.9%  3.9% 21.1%  0.8%  0.2%  0.1%  0.3%
RET_OLD:             4824139:  23.8% 14.1% 20.4%  4.4%  7.6% 15.6%  0.3%  0.9% 13.0%
RET_UNBOXED_TUP:     6104841:   0.0% 83.5% 14.8%  1.4%  0.1%  0.0%  0.1%  0.1%  0.0%

Post-patch with rewritten tcvs... functions:

/home/tobias/well-typed/devel/ghc-phab/inplace/lib/bin/ghc-stage2 -B/home/tobias/well-typed/devel/ghc-phab/inplace/lib -c testsuite/tests/perf/compiler/T5631.hs -fforce-recomp +RTS -rlogs/new-accum.ticky 

STACK USAGE:

ENTERS: 54394961  of which 54394961 (100.0%) direct to the entry code
		  [the rest indirected via Node's info ptr]
    6198336 ( 11.4%) thunks
    4906771 (  9.0%) data values
          0 (  0.0%) normal indirections
          0 (  0.0%) permanent indirections

FUNCTION ENTRIES: 43289854

TAIL CALLS: 35075741, of which 29461984 (84%) were to known functions

SLOW APPLICATIONS: 0 evaluated, 0 unevaluated

         Too few args   Correct args   Too many args
   FUN            0              0               0
   PAP            0              0               0


RETURNS: 24431674
   13420062 ( 54.9%) from entering a new constructor
		  [the rest from entering an existing constructor]

RET_NEW:            13420062:  32.7% 11.9% 25.9%  4.0% 24.0%  0.8%  0.2%  0.1%  0.4%
RET_OLD:             4906771:  26.3% 13.9% 17.1%  4.3%  9.2% 15.3%  0.3%  0.9% 12.8%
RET_UNBOXED_TUP:     6104841:   0.0% 83.5% 14.8%  1.4%  0.1%  0.0%  0.1%  0.1%  0.0%

comment:53 Changed 3 months ago by simonpj

Let me urge you to try it with an explicit accumulator, exactly as I wrote it.

I recall that this made a very big difference for Bartosz.

comment:54 in reply to:  53 Changed 3 months ago by tdammers

Replying to simonpj:

Let me urge you to try it with an explicit accumulator, exactly as I wrote it.

I recall that this made a very big difference for Bartosz.

Tried that, but it seems to make absolutely no difference (that is, allocations go down by less than 0.004%).

Also worth noting that VarSet here is based on IntMap, not Map as you quoted. Not sure if that changes anything, but maybe IntMap doesn't have the same problem Map has.

comment:55 Changed 3 months ago by RyanGlScott

I can confirm that my hypothesis in comment:46 did not yield fruit: it saved a whopping 400 bytes of allocation in T5321Fun, and that's it.

comment:56 in reply to:  55 Changed 3 months ago by tdammers

I did some benchmarking of Data.Map and also Data.IntMap in isolation.

For both, I implemented combining a large number of positive integers into a set and then testing for the presence of -1 in the set, using two strategies:

  1. the "unions" approach: unions . map singleton
  2. the "insert" approach: foldr insert empty

For IntMap (which is what we're using here), the "unions" approach is slightly faster (~14.5s vs. ~15s for 100 million elements); for Map, the "unions" approach is a lot faster (~6.5s vs. ~8.2s for 10 million elements).

Note that this is the worst possible case for union, and it still outperforms individual insertions for some reason.

So if that is accurate, then the idea that speeding things up by using lots of inserts instead of lots of set unions doesn't hold up.

comment:57 in reply to:  53 Changed 3 months ago by tdammers

Replying to simonpj:

Let me urge you to try it with an explicit accumulator, exactly as I wrote it.

I recall that this made a very big difference for Bartosz.

Can you remember whether that situation involved raw VarSets, or FVs? From what I see in the code, FV could, under the right circumstances, avoid hitting the underlying IntSet entirely, while after this patch, we're using VarSet directly, so maybe that's why those set operations are being hit harder now?

comment:58 Changed 3 months ago by simonpj

the "unions" approach: unions . map singleton

That is essentially foldr (union . singleton) empty, which is really very close to foldr insert empty.

What if you do a balanced tree of binary union operations? I suppose something like

bigSet :: Int -> Int -> IntSet
bigSet m n
  | m==n = singleton n
  | otherwise = bigSet m mid `union` bigSet (mid+1) n
  where
    mid = (n+m)/2

comment:59 Changed 3 months ago by simonpj

This is all very puzzling. But when we get to the bottom of it we may speed up a key piece of infrastructure! So I think it's worth persevering.

Some other ideas

  1. Is poly_merge still high on the list in ticky in the accumulator version? If so, why?
  1. Can you post the accumulator version for types anyway?
  1. Try switching back to FV. That is, take the currently-deleted code for tyCoFVsOfType, and re-instate it, but modify it not to take the kinds at the variable occurrences. Now use that instead of the new fvs_of_type. It "obviously" should be a lot slower -- is it?
  1. In unitFV I see
    unitFV :: Id -> FV
    unitFV var fv_cand in_scope acc@(have, haveSet)
      | var `elemVarSet` in_scope = acc
      | var `elemVarSet` haveSet = acc
      | fv_cand var = (var:have, extendVarSet haveSet var)
      | otherwise = acc
    

Notice that before inserting we test for membership. Could that possibly be a win? If so, we can use it in the VarSet version.

comment:60 Changed 3 months ago by simonpj

Can you remember whether that situation involved raw VarSets, or FVs?

The old code for tyCoFVsOfType looked like

fvs_of_type (AppTy fun arg)    a b c = (fvs_of_type fun `unionFV` fvs_of_type arg) a b c

I believe that if you eta reduce this to

fvs_of_type (AppTy fun arg) = (fvs_of_type fun `unionFV` fvs_of_type arg)

it runs a lot slower. That should not happen, but I recall that it did. I don't understand what you mean about "avoiding hitting the underlying IntSet entirely".

comment:61 in reply to:  60 Changed 3 months ago by tdammers

Replying to simonpj:

Can you remember whether that situation involved raw VarSets, or FVs?

The old code for tyCoFVsOfType looked like

fvs_of_type (AppTy fun arg)    a b c = (fvs_of_type fun `unionFV` fvs_of_type arg) a b c

I believe that if you eta reduce this to

fvs_of_type (AppTy fun arg) = (fvs_of_type fun `unionFV` fvs_of_type arg)

it runs a lot slower. That should not happen, but I recall that it did. I don't understand what you mean about "avoiding hitting the underlying IntSet entirely".

Well, if you look at the definition of FV:

type FV = InterestingVarFun
             -- Used for filtering sets as we build them
          -> VarSet
             -- Locally bound variables
          -> ([Var], VarSet)
             -- List to preserve ordering and set to check for membership,
             -- so that the list doesn't have duplicates
             -- For explanation of why using `VarSet` is not deterministic see
             -- Note [Deterministic UniqFM] in UniqDFM.
          -> ([Var], VarSet)

...then it's not immediately obvious that code that manipulates FVs will always touch the VarSets in there at all. Specifically, variables that are not "interesting" bypass the VarSet (i.e., IntSet) machinery entirely, if I'm not mistaken.

Not sure if that has anything to do with the problem at hand though.

comment:62 in reply to:  59 Changed 3 months ago by tdammers

Replying to simonpj:

This is all very puzzling. But when we get to the bottom of it we may speed up a key piece of infrastructure! So I think it's worth persevering.

  1. Is poly_merge still high on the list in ticky in the accumulator version? If so, why?

As a matter of fact, no it's not. So it looks like the allocations we're winning from not using poly_merge as much just end up being eaten up elsewhere. Top 15 entries for the accumulator version:

    Entries      Alloc    Alloc'd  Non-void Arguments      STG Name
--------------------------------------------------------------------------------
    2539473  107917008          0   3 i.M                  containers-0.5.11.0:Data.IntMap.Internal.$winsert{v r7O} (fun)
     172005   90819296          0   5 EMMMS                ghc:TcUnify.promoteTcType2{v r3m} (fun)
    2330347   86853624          0   2 >L                   base:GHC.Base.map{v 01X} (fun)
     699656   37268264          0   1 L                    go4{v sNmj} (ghc:TcMType) (fun) in r1Y
     448802   35904160          0   7 E>>>>.M              ghc:TcMType.$s$wmapType{v r1Y} (fun)
     367181   23499584          0   2 LS                   ghc:TcRnMonad.traceTc1{v rlu} (fun)
     887399   22625616          0   1 M                    go28{v sNmi} (ghc:TcMType) (fun) in r1Y
     283135   21931488          0   4 >SSM                 ghc:TcHsSyn.$wzonkTyVarOcc{v r1F} (fun)
     269137   18194264          0   1 L                    go13{v sSI0} (ghc:TcHsSyn) (fun) in rN
      95951   16510560          0   2 >L                   base:Data.OldList.sortBy{v rD} (fun)
     302653   16395904          0   5 +>LLL                ghc:MonadUtils.zipWith3M{v rp} (fun)
     194400   15552000          0   7 E>>>>.M              ghc:TcHsSyn.$s$wmapType{v rN} (fun)
     559499   15509096          0   1 M                    ghc:TcMType.zonkTcTyVar{v r1l} (fun)
     138823   13706576          0   1 S                    sat_sMLW{v} (ghc:TcMType) (fun) in r6P
     440463   11089080          0   1 S                    sat_sNmS{v} (ghc:TcMType) (fun) in sNmi

comment:63 in reply to:  59 Changed 3 months ago by tdammers

Replying to simonpj:

  1. In unitFV I see
    unitFV :: Id -> FV
    unitFV var fv_cand in_scope acc@(have, haveSet)
      | var `elemVarSet` in_scope = acc
      | var `elemVarSet` haveSet = acc
      | fv_cand var = (var:have, extendVarSet haveSet var)
      | otherwise = acc
    

Notice that before inserting we test for membership. Could that possibly be a win? If so, we can use it in the VarSet version.

I believe that the "test before insert" thing is needed because we're accumulating a set (haveSet) and a list (have) in parallel, instead of just the set, so in order to keep the two in sync, we cannot simply insert unconditionally, as that would produce duplicates in the list.

I don't think it can possibly produce a performance advantage beyond skipping the list cons, *unless* we are in a special situation where:

  • in_scope is much smaller than haveSet (such that testing membership of in_scope is significantly faster than testing haveSet membership); AND:
  • in_scope is likely to match new entries (because when it doesn't match, the extra test just eats up clock cycles without gaining anything)

One interesting thing might be to invert the fv_cand condition and change the ordering, so that the fv_cand test is done before the set membership tests, something like:

unitFV :: Id -> FV
unitFV var fv_cand in_scope acc@(have, haveSet)
  | not (fv_cand var) = acc
  | var `elemVarSet` in_scope = acc
  | var `elemVarSet` haveSet = acc
  | otherwise = (var:have, extendVarSet haveSet var)

But that won't change the performance of VarSet the slightest.

comment:64 in reply to:  58 Changed 3 months ago by tdammers

Replying to simonpj:

the "unions" approach: unions . map singleton

That is essentially foldr (union . singleton) empty, which is really very close to foldr insert empty.

What if you do a balanced tree of binary union operations? I suppose something like

bigSet :: Int -> Int -> IntSet
bigSet m n
  | m==n = singleton n
  | otherwise = bigSet m mid `union` bigSet (mid+1) n
  where
    mid = (n+m)/2

Did exactly that, benchmark code attached. Results:

./intmapbench insert    1.32s user 0.05s system 99% cpu 1.370 total
./intmapbench union     1.01s user 0.01s system 99% cpu 1.029 total
./intmapbench balanced  0.24s user 0.01s system 98% cpu 0.260 total

So the only surprising thing here is that even in the worst case of a completely unbalanced insertion order, union outperforms insert, but otherwise, it behaves as expected, and this definitely does not support our hypothesis that individual inserts might be faster than set unions.

Last edited 3 months ago by tdammers (previous) (diff)

Changed 3 months ago by tdammers

Attachment: intmapbench.hs added

IntMap mini-benchmark

comment:65 Changed 3 months ago by simonpj

Maybe try (3) in comment:59? That makes before-and-after more similar.

It might help to put it all on a branch... I'm not sure of the precise code you are measuring.

(It's cool that balanced unions are actually faster. FOUR TIMES faster!)

comment:66 in reply to:  65 Changed 3 months ago by tdammers

Replying to simonpj:

Maybe try (3) in comment:59? That makes before-and-after more similar.

Yes, will do that.

It might help to put it all on a branch... I'm not sure of the precise code you are measuring.

I took the patch from phabricator and rebased it onto what I figured must be the last commit before Richard started working on it. I'll push my branch to git.haskell.org.

(It's cool that balanced unions are actually faster. FOUR TIMES faster!)

It is, although it would have been easier for us if that had turned out to be the problem. Then again, one of the fundamental libraries out there not being broken is kind of a comforting thought...

comment:67 Changed 3 months ago by RyanGlScott

Milestone: 8.8.1

comment:68 in reply to:  59 Changed 3 months ago by tdammers

Replying to simonpj:

  1. Try switching back to FV. That is, take the currently-deleted code for tyCoFVsOfType, and re-instate it, but modify it not to take the kinds at the variable occurrences. Now use that instead of the new fvs_of_type. It "obviously" should be a lot slower -- is it?

I'm slightly confused here. fvs_of_type already uses FV, before and after the patch; the patch however introduces tcvs_of_type, which uses VarSets directly instead of going through FV, and this function is where the slowdown appears to be.

I changed the code such that it always uses fvs_of_type (and thus FV) instead of tcvs_of_type, and the results are mixed:

  • The two perf test cases that fail validation now perform slightly better
  • ...but not enough to make the tests pass
  • ...and now 3 more perf tests fail

Output from ./validate:

==== STAGE 2 TESTS ====

Unexpected results from:
TEST="T12227 T12545 T14052 T3234 T5321Fun T5631"

SUMMARY for test run started at Mon Jul 30 12:28:36 2018 CEST
 0:22:35 spent to go through
    6350 total tests, which gave rise to
   19866 test cases, of which
   13175 were skipped

      36 had missing libraries
    6495 expected passes
     154 expected failures

       0 caused framework failures
       0 caused framework warnings
       0 unexpected passes
       1 unexpected failures
       5 unexpected stat failures

Unexpected failures:
   /tmp/ghctest-51wr9fmd/test   spaces/./simplCore/should_compile/T3234.run  T3234 [stderr mismatch] (optasm)

Unexpected stat failures:
   /tmp/ghctest-51wr9fmd/test   spaces/./perf/compiler/T5631.run     T5631 [stat not good enough] (normal)
   /tmp/ghctest-51wr9fmd/test   spaces/./perf/compiler/T5321Fun.run  T5321Fun [stat not good enough] (normal)
   /tmp/ghctest-51wr9fmd/test   spaces/./perf/compiler/T12227.run    T12227 [stat not good enough] (normal)
   /tmp/ghctest-51wr9fmd/test   spaces/./perf/compiler/T12545.run    T12545 [stat not good enough] (normal)
   /tmp/ghctest-51wr9fmd/test   spaces/./perf/should_run/T14052.run  T14052 [stat not good enough] (ghci)

comment:69 Changed 3 months ago by tdammers

Richard's original patch: http://git.haskell.org/ghc.git/log/refs/heads/wip/T14880.

I'll push experiments to branches wip/T14880/...

comment:70 Changed 3 months ago by simonpj

Just to record that we decided that the experiment in comment:68 is misleading.

There are two main changes in Richard's patch

  1. Do closeOverKinds at the end, here:
    -- NEW
    tyCoVarsOfType = closeOverKinds . tcvs_of_type
    

rather than independently at each leaf:

-- OLD
tyCoFVsOfType (TyVarTy v)        a b c = (unitFV v `unionFV` tyCoFVsOfType (tyVarKind v)) a b c

-- NEW
tcvs_of_type (TyVarTy v) = unitVarSet v
  1. Insetad of using FV and then taking its VarSet, use VarSets directly. This jolly well ought to be faster.

But in comment:68, Tobias used the new (1) but the old (2) and thus closed over the kind vars both at the leaves and at the end. That's not going to be fast.

We agreed that a it'd be iluminating to start from the existing GHC baseline and try (2) alone: is VarSet faster than FV?

Then add (1).

NB: it is (1) that fixes the actual bug in this ticket. Ultimately it is not optional. But (1) should be a perf boost too! If a variable occurs N times, we'll take the free vars of its kind once instead of N times.

comment:71 Changed 3 months ago by simonpj

I was suspicious about comment:64 so I took a look. Sure enough

  • The list [1..n] wasn't being fused away because it was shared between insert and union, while balanced worked differently and generated no intermediate list.
  • More significantly, the integers were in ascending order which is fantastically good for balanced union: the tree that implements the set is always balanced and never needs to be transformed.

So I tried with a random tree. Code is below. Results are much more plausible

  • balanced: 1.15G alloc, 1.6sec
  • union: 1.2G alloc, 1.6sec
  • insert: 1.17G alloc, 1.6sec

Conclusion: containers is behaving OK.

{-#LANGUAGE LambdaCase #-}
module Main( run, main ) where

import qualified Data.IntSet as Set
import System.Environment
import System.Random

data Tree = Leaf Int | Node Tree Tree

randomTree :: Int -> Int -> IO Tree
randomTree lo hi
  | lo == hi
  = do { n <- randomRIO (1,100000)
       ; return (Leaf n) }
  | otherwise
  = do { l <- randomTree lo mid
       ; r <- randomTree (mid+1) hi
       ; return (Node l r) }
  where
    mid = (lo+hi) `div` 2

main = do { args <- getArgs
          ; case args of
              (size:what:_) -> do { t <- randomTree 1 (read size)
                                  ; print (Set.size (run what t)) }
              _ -> error "Invalid args" }

run :: String -> Tree -> Set.IntSet
run "balanced" t = go t
  where
    go (Leaf n) = Set.singleton n
    go (Node t1 t2) = go t1 `Set.union` go t2

run "insert" t = go t Set.empty
  where
    go (Leaf n) acc = Set.insert n acc
    go (Node t1 t2) acc = go t1 (go t2 acc)

run "union" t = go t Set.empty
  where
    go (Leaf n) acc = Set.union (Set.singleton n) acc
    go (Node t1 t2) acc = go t1 (go t2 acc)

comment:72 in reply to:  71 Changed 3 months ago by tdammers

Replying to simonpj:

I was suspicious about comment:64 so I took a look. Sure enough

  • The list [1..n] wasn't being fused away because it was shared between insert and union, while balanced worked differently and generated no intermediate list.

Ah yes. Lesson learned once again: benchmarking is hard and subtle.

  • More significantly, the integers were in ascending order which is fantastically good for balanced union: the tree that implements the set is always balanced and never needs to be transformed.

Right. I probably had the wrong intuition about set performance characteristics here.

comment:73 Changed 3 months ago by Richard Eisenberg <rae@…>

In f8618a9b/ghc:

Remove the type-checking knot.

Bug #15380 hangs because a knot-tied TyCon ended up in a kind.
Looking at the code in tcInferApps, I'm amazed this hasn't happened
before! I couldn't think of a good way to fix it (with dependent
types, we can't really keep types out of kinds, after all), so
I just went ahead and removed the knot.

This was remarkably easy to do. In tcTyVar, when we find a TcTyCon,
just use it. (Previously, we looked up the knot-tied TyCon and used
that.) Then, during the final zonk, replace TcTyCons with the real,
full-blooded TyCons in the global environment. It's all very easy.

The new bit is explained in the existing
Note [Type checking recursive type and class declarations]
in TcTyClsDecls.

Naturally, I removed various references to the knot and the
zonkTcTypeInKnot (and related) functions. Now, we can print types
during type checking with abandon!

NB: There is a teensy error message regression with this patch,
around the ordering of quantified type variables. This ordering
problem is fixed (I believe) with the patch for #14880. The ordering
affects only internal variables that cannot be instantiated with
any kind of visible type application.

There is also a teensy regression around the printing of types
in TH splices. I think this is really a TH bug and will file
separately.

Test case: dependent/should_fail/T15380

comment:74 Changed 3 months ago by tdammers

New results: following the suggestion to go back to the (incorrect) baseline before the patch, and then *only* rewriting tyCoVarsOf{Type|Co}[s] in terms of VarSet, thus skipping the detour via FV, I compared three different versions:

  • the pre-patch baseline, available on the wip/T14880-baseline branch on git.haskell.org;
  • the patch, applied fully, available on the wip/T14880 branch
  • the baseline version with the tvCoVarsOf... functions rewritten in terms of VarSets, copying the logic from the FV flavors, available on the wip/T14880-just-tvs branch.

The expectation was that -just-tvs would perform better than the baseline, because it skips the additional machinery of FV; however, it turns out that it's actually slower and allocates more.

Detailed timing results:

** Baseline **

	Command being timed: "/home/tobias/well-typed/devel/ghc-phab/inplace/bin/ghc-stage2 -c testsuite/tests/perf/compiler/T5631.hs -fforce-recomp +RTS -rlogs/T14880-baseline.ticky -RTS"
	User time (seconds): 3.17
	System time (seconds): 0.08
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.27
	Maximum resident set size (kbytes): 292764
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 42788
	Voluntary context switches: 392
	Involuntary context switches: 18

** Just VarSet **

	Command being timed: "/home/tobias/well-typed/devel/ghc-phab/inplace/bin/ghc-stage2 -c testsuite/tests/perf/compiler/T5631.hs -fforce-recomp +RTS -rlogs/T14880-just-tvs.ticky -RTS"
	User time (seconds): 3.28
	System time (seconds): 0.06
	Percent of CPU this job got: 91%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.64
	Maximum resident set size (kbytes): 294916
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 6
	Minor (reclaiming a frame) page faults: 42784
	Voluntary context switches: 452
	Involuntary context switches: 20

** 14880 patch **

	Command being timed: "/home/tobias/well-typed/devel/ghc-phab/inplace/bin/ghc-stage2 -c testsuite/tests/perf/compiler/T5631.hs -fforce-recomp +RTS -rlogs/T14880.ticky -RTS"
	User time (seconds): 3.26
	System time (seconds): 0.07
	Percent of CPU this job got: 95%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.49
	Maximum resident set size (kbytes): 295684
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 11
	Minor (reclaiming a frame) page faults: 43047
	Voluntary context switches: 422
	Involuntary context switches: 17

comment:75 Changed 3 months ago by goldfire

Very interesting. What happens if you make a version of FV that uses its same setup (passing around partially applied functions, etc.) but without the list? Then rewrite the functions in terms of that?

Also, I think it would be helpful not to use ticky profiling when generating these results (unless the data somehow comes from the ticky output), but also to include the allocations.

Thanks!

comment:76 in reply to:  75 Changed 3 months ago by tdammers

Replying to goldfire:

Very interesting. What happens if you make a version of FV that uses its same setup (passing around partially applied functions, etc.) but without the list? Then rewrite the functions in terms of that?

Ah, good point, I'll try that. Shouldn't be overly difficult.

Also, I think it would be helpful not to use ticky profiling when generating these results (unless the data somehow comes from the ticky output), but also to include the allocations.

Right, of course. The data is just plain old time --verbose; the reason I have -ticky there is because the idea was to generate ticky profiles at the same time, and the thought behind it was that ticky overhead would be somewhat proportional, and thus not disruptive. But I'll do another run without ticky.

comment:77 Changed 2 months ago by tdammers

Hmm, so when I skip the -ticky, the difference between the baseline and the VarSet version melts away:

Baseline:

        User time (seconds): 3.26
        System time (seconds): 0.09
        Percent of CPU this job got: 94%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.53

VarSet:

        User time (seconds): 3.25
        System time (seconds): 0.06
        Percent of CPU this job got: 96%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.44

In fact, VarSet now ends up faster than FV.

comment:78 Changed 2 months ago by goldfire

Before we continue drilling in that direction, are these results reproducible? Timing is notoriously fickle, and I think it would be wise to make sure we know what we're getting into here.

Also, I do think it's worthwhile to still measure allocations, just as another data point.

If the FV/VarSet change isn't the driving force here, then what is? I recommend removing the code that adds the fvs of a variable's kind (without replacing this logic with anything) and testing again. That should surely take less time. (Be careful, e.g., by comparing -ddump-tc-traces, that GHC's overall behavior doesn't change by doing this.) Then, add the closeOverKinds. Maybe the algorithm in closeOverKinds (simple though it is) is somehow grossly inefficient.

comment:79 in reply to:  78 Changed 2 months ago by tdammers

Replying to goldfire:

Before we continue drilling in that direction, are these results reproducible? Timing is notoriously fickle, and I think it would be wise to make sure we know what we're getting into here.

Oh yes, absolutely right, and I just did exactly that.

Turns out the switch from FV to VarSet increases allocations quite a bit, but not enough to explain the failure completely - only about halfway. Observe:

baseline:

("bytes allocated", "1145886664")

with VarSet instead of fv:

("bytes allocated", "1176456040")

patch applied fully:

("bytes allocated", "1207902048")

This uses the settings from the test suite directly, and a validate build. So it should be perfectly reproducible, something like this:

for NAME in T14880-baseline T14880-just-tvs T14880
do
    echo "$NAME"
    git checkout "wip/$NAME" && (make -j2 && (/usr/bin/time --verbose ~/well-typed/devel/ghc-phab/inplace/bin/ghc-stage2 -c testsuite/tests/perf/compiler/T5631.hs -fforce-recomp +RTS -V0 -tlogs/$NAME.stat --machine-readable -RTS) |& tee logs/$NAME.timing && echo "$NAME")
done

comment:80 Changed 2 months ago by goldfire

The next step would appear to be to duplicate the FV module (with identical function structure) but to drop the lists. Then use the new NonDetFV instead of FV. I would be shocked if the allocations go anywhere but down.

Then, we'll have a new conundrum: why is VarSet slower than this NonDetFV? But let's not figure that out until we've made the observation that it is indeed slower.

comment:81 Changed 2 months ago by tdammers

NonDetFV (NDFV) speed / allocations test still running.

While writing the code for NDFV, one thing caught my eye however: the delFV / delFVs functions, rather than actually remove anything from the have list and the haveSet, *add* to the in_scope set; naturally, rewriting this in terms of VarSet, where the delVarSet function actually removes the var from the set, will change performance characteristics, one way or the other, depending on the situation.

comment:82 Changed 2 months ago by tdammers

Speed test results: performance for the NDFV version is virtually identical to that of the baseline (which uses FV). So leaving out the list when it is not going to be demanded anyway makes absolutely no difference, as one would expect.

Full logs to be attached. The relevant branches can be found on git.haskell.org:

  • wip/T14880-baseline (before the patch)
  • wip/T14880 (patch applied as-is)
  • wip/T14880-just-tvs (just the move from FV to VarSet)
  • wip/T14880-nondet-fv (introducing the NDFV flavour of FV, but otherwise identical to the baseline)

To reproduce logs, use something like:

for NAME in T14880-nondet-fv T14880 T14880-baseline T14880-just-tvs
do
    echo "$NAME"
    git checkout "wip/$NAME" \
    && (
        make -j2 \
        && (/usr/bin/time --verbose \
            ~/path/to/ghc/inplace/bin/ghc-stage2 \
                -c testsuite/tests/perf/compiler/T5631.hs \
                -fforce-recomp \
                +RTS -V0 -tlogs/$NAME.stat --machine-readable -RTS) \
        |& tee logs/$NAME.timing \
        && echo "$NAME"
       )
done

Changed 2 months ago by tdammers

Attachment: logs2.tar.gz added

comment:83 Changed 2 months ago by goldfire

OK. So NDFV behaves roughly identically to FV -- faster than just a bare VarSet. But why?

Let's look at the difference between NDFV and VarSet.

  1. The former has an InterestingVarFun. But I don't believe that's used in your tests.
  1. The former maintains a set of elements to be removed, while the latter just algorithmically removes elements when necessary.
  1. The former uses an accumulator to build sets, while the latter does not.

Which of these factors contributes to the performance change? To learn this, we can test each in isolation, by removing the difference and then measuring. I suspect (2). If this is the case, perhaps we'll discover an optimization that can apply well beyond GHC!

comment:84 Changed 2 months ago by simonpj

Re comment:83

  1. I think that InterestingVarFun is const True, so it only adds overhead.
  1. I think you mean that NDFV keeps a set of forall-bound variables, and checks in that set before adding to the accumulator. But foralls are rare (I think), so the set will typically be empty.
  1. That leaves the accumulator-style as the main candidate.

As Richard says, though, the way to find out is to try it one thing at a time.

comment:85 in reply to:  84 Changed 7 weeks ago by tdammers

Replying to simonpj:

  1. I think you mean that NDFV keeps a set of forall-bound variables, and checks in that set before adding to the accumulator. But foralls are rare (I think), so the set will typically be empty.

Well, considering the del... function for FV / NDFV:

delNDFV :: Var -> NDFV -> NDFV
delNDFV var fv fv_cand !in_scope acc =
  fv fv_cand (extendVarSet in_scope var) acc

...it turns out that *all* "deletions" from the "set" are implemented as adding to the in_scope set. Despite the name, the data structure doesn't really care whether they are really forall-bound variables. They might be though - I don't know if there are other reasons to delete from a free-var set.

  1. That leaves the accumulator-style as the main candidate.

As Richard says, though, the way to find out is to try it one thing at a time.

Yes, absolutely. Will do exactly that.

comment:86 Changed 7 weeks ago by tdammers

OK, so; replacing the delNDFV function with an implementation that doesn't touch the in_scope set at all, and instead removes items from the acc set, we get practically identical performance as with NDFV in its previous incarnation. Which means that 2. is not the culprit; the in_scope set seems to not make a difference, suggesting that we aren't really deleting a substantial number of elements from the set here.

comment:87 Changed 7 weeks ago by simonpj

OK, so does that take us back to (3), namely the accumulator style? Can you switch the style?

comment:88 in reply to:  87 Changed 7 weeks ago by tdammers

Replying to simonpj:

OK, so does that take us back to (3), namely the accumulator style? Can you switch the style?

I think that would just bring us back to using plain old VarSets, which we have already profiled.

I could of course make a VarSet flavor that has the additional interesting function and in_scope checks, though I have doubts on whether it would reveal anything new.

comment:89 Changed 7 weeks ago by simonpj

I think that would just bring us back to using plain old VarSets, which we have already profiled.

Maybe so, but then I believe you are saying:

  • (a) FV is faster than VarSet
  • (b) None of (1), (2), (3) from comment:83 are responsible

These can't both be true. We can successively transform FV into VarSet by changing (1), (2), (3), and at some point the perf must change, if (a) is true.

Or have I misunderstood something?

comment:90 Changed 7 weeks ago by tdammers

Oh no. What I was trying to say is that (3) seems to be responsible; I could write an FV flavor to verify it, but it would amount to pretty much just VarSet, just wrapped in a record together with two other fields that we won't use.

In theory it should be interesting to take our FV and *only* change (3), but keep (1) and (2), except that we already suspected that neither of them does anything significant here, and we have already ruled them out as culprits individually.

But I'll do it anyway, even if it's just to make sure I didn't miss anything in embarrassing ways.

comment:91 Changed 7 weeks ago by simonpj

What I was trying to say is that (3) seems to be responsible

So you think that using VarSet, but with an accumulator style, will be fastest of all?

And yet comment:71 suggests that switching to accumulator style should not have any effect on VarSet. Maybe I was benchmarking the wrong thing.

Anyway, yes please try!

comment:92 Changed 7 weeks ago by tdammers

Indeed, yes, that is confusing. I'll also re-run my other profiles to rule out external factors.

comment:93 Changed 7 weeks ago by tdammers

Test run results:

  • wip/T14880-baseline: pre-patch GHC HEAD.

("bytes allocated", "1144547520")

  • wip/T14880: patch fully applied.

("bytes allocated", "1207856632")

  • wip/T14880-just-tvs: only implement the "use VarSets, not FV" part of #14880

("bytes allocated", "1176403280")

  • wip/T14880-nondet-fv: baseline (same as wip/T14880-baseline), but using the NDFV replacement instead of FV (getting rid of the list that FV drags around to maintain insertion order).

("bytes allocated", "1144547040")

  • wip/T14880-nondet-no-in-scope: same as wip/T14880-nondet-fv, but instead of adding to the in_scope set, remove items from the accumulator.

("bytes allocated", "4064821552")

  • wip/T14880-nondet-fv-2: same as wip/T14880-nondet-fv, but with the NDFV type rewritten to update sets in place rather than building a computation accumulator-style.

("bytes allocated", "1222408136")

So it seems that somehow my results got mixed up somewhere [1], because there is a whopping 400% increase in memory use when removing the in-scope feature. There must be something else going on there though, because neither the fully applied patch nor the just-tvs version deviate this severely, despite also not using this in-scope pre-filtering.

Disregarding the nondet-no-in-scope figure, we thus get:

  • applying full patch: 1145M -> 1208M
  • only introducing VarSet: 1145M -> 1176M (worse than baseline, better than full patch)
  • making FV nondeterministic: 1145M -> 1145M (no effect)
  • rewriting (ND)FV to not use accumulator style: 1145M -> 1222M (worse than full patch)

[1] On further thought, I think this is actually expected, due to the way FV composes up set manipulation functions.

Last edited 7 weeks ago by tdammers (previous) (diff)

comment:94 Changed 7 weeks ago by goldfire

I have two takeaways from this:

  • An accumulator style is the Right Way to go when building up a set. This idea should probably be abstracted over right in the VarSet interface. But do we get these gains without eta-expanding everything? (See Note [FV eta expansion] in FV.)
  • The extra work to track order has no effect. I find this surprising, but not overly so, because we can avoid the extra work via laziness.

comment:95 in reply to:  94 Changed 7 weeks ago by tdammers

Replying to goldfire:

I have two takeaways from this:

  • An accumulator style is the Right Way to go when building up a set. This idea should probably be abstracted over right in the VarSet interface. But do we get these gains without eta-expanding everything? (See Note [FV eta expansion] in FV.)

IIRC, I eta-reduced the relevant functions (tyCoVarsOfXXXX), but found no significant difference. I only did this for the VarSet approach though; I believe the reasoning in that note only holds for FV.

Baking accumulator style into VarSet is logical, but if we do this, we also need to add the in_scope filtering thing found in FV, because, as the nondet-no-in-scope version shows, not doing this absolutely devastates performance. However, VarSet plus accumulator style is exactly NDFV, which is FV minus the list, which in turn makes no significant difference anyway. So effectively, this takeaway says we should be using FV.

  • The extra work to track order has no effect. I find this surprising, but not overly so, because we can avoid the extra work via laziness.

Well, when I wrote "no effect", I was actually lying a little bit - getting rid of the list reduces overall allocations by about 500 bytes. Which, I think, is about what you'd expect from removing something that never gets evaluated anyway.

comment:96 Changed 7 weeks ago by goldfire

I guess my "bake into VarSet" idea was essentially to take the FV module and combine it into the VarSet module, making clear that the FV-derived functions are the right way to incrementally build up a VarSet. It's kind of like we know to build up a String using higher-order functions, not ++ -- but we should make it very clear, right in the VarSet module.

Going even further, I imagine this would generalize to any Map and might be good to incorporate into containers.

comment:97 Changed 7 weeks ago by simonpj

OK, I think I have finally nailed what is going on.

I have pushed my work to wip/T1448-accum. NB Missing "0" in the branch name, sorry.

The key insight is in the commit message of the top patch on that branch, which is the only patch that isn't in master, I think:

commit a4e7a912ff426ef9cb968ea21d1a514f203a7ea5
Author: Simon Peyton Jones <simonpj@microsoft.com>
Date:   Fri Aug 31 14:18:55 2018 +0100

    Use an accumulator version of tyCoVarsOfType

    In TyCoRep we now have tyCoVarsOfType implemented

    1) Using FV -- this is the baseline version in GHC today

    2) Using VarSets via unionVarSet

    3) Using VarSets in accumulator-style

    In this patch (3) is enabled.

    When compiling perf/compiler/T5631 we get

             Compiler allocs
       (1)   1,144M
       (2)   1,175M
       (3)   1,142M

    The key new insight in (3) is this:

      ty_co_vars_of_type (TyVarTy v) is acc
        | v `elemVarSet` is  = acc
        | v `elemVarSet` acc = acc   <---- NB!
        | otherwise          = ty_co_vars_of_type (tyVarKind v) is (extendVarSet acc v)

    Notice the second line! If the variable is already in the
    accumulator, don't re-add it.  This makes big difference.
    Without it, allocation is 1,169M or so.

    One cause is that we only take the free vars of its kind once;
    that problem will go away when we do the main part of #14088 and
    close over kinds /afterwards/.  But still, another cause is perhaps
    that every insert into a set overwrites the previous item, and so
    allocates a new path to the item; it's not a no-op even if the
    item is there already.

    Why use (3) rather than (1)?  Because it just /has/ to
    be better;

    * FV carries around an InterestingVarFun, which does nothing
      useful here, but is tested at every variable

    * FV carries around a [Var] for the deterministic version.

    For this very hot operation (finding free vars) I think it
    makes sense to have special purpose code.

    On the way I also simplified the (less used) coVarsOfType/Co family
    to use FV, by making serious use of the InterestingVarFun!

It was this check against the accumulator, which FV has always performed, which is responsible for how FV mysteriously out-performed all other contenders. Once that is done, a stripped-down version of FV (implemented as (3) in the code above) slightly out-performs FV.

All of this is relative to wip/T14880-baseline.

Therefore I suggest

  1. Apply my patch to HEAD (and check for no regressions)
  2. Apply the rest of the #14880 patch to that

By "the rest" I mean everything in the original patch http://git.haskell.org/ghc.git/log/refs/heads/wip/T14880 except the implementation of tcvs_of_type. Specifically, doing closeOverKinds at the end, rather than at each leaf. Doubtless other stuff too.

There are a number of other tickets that are waiting for this one to be finished: #14040, #15076, #14887.

Tobias can you do this?

comment:98 Changed 6 weeks ago by tdammers

Yes, will do. Rebasing the whole thing on HEAD is going to be a bit hairy, but no show stopper.

One thing I find remarkable is that Set doesn't do the existence test, but I guess that's just a matter of prioritizing CPU over RAM that turns out in our disadvantage in this specific case.

comment:99 Changed 6 weeks ago by tdammers

OK, I can confirm that the 1448-accum version outperforms the FV baseline. I'll attempt a rebase now...

comment:100 Changed 6 weeks ago by tdammers

The code as found on the wip/T14880-accum branch now validates, except for one stat failure: haddock.base allocates 6.5% more than expected. I don't know whether this patch introduces the regression or the head I rebased on though. Turns out this regression existed before the patch, so I won't deal with it here.

This is exactly the same version as wip/T1448-accum, but rebased onto master and with some fixes to make it build again (the Coercion type has changed in the meantime, so I had to add another case branch).

Last edited 6 weeks ago by tdammers (previous) (diff)

comment:101 Changed 6 weeks ago by simonpj

OK super. Let's commit! (Does it fix the orginal bug? Is there a test case?)

comment:102 Changed 6 weeks ago by tdammers

I'm not done merging yet, unfortunately. I have rebased your patch onto the pre-patch master branch, but I still have to combine that with the rest of the #14880 patch, which is a bit tricky because both perform similar refactorings (factor out a private worker function for tyCoVarsOf...), but for different reasons. I'm afraid the easiest way is going to be to take the patch as is, and then manually re-engineer the -accum change from there.

Richard did add some test cases though, and we have the existing tests that highlighted the regression in the first place (one of them is what we've been using throughout here).

comment:103 Changed 6 weeks ago by simonpj

I'm afraid the easiest way is going to be to take the patch as is, and then manually re-engineer the -accum change from there.

OK. That should be very easy to accomplish. Let me know if you need help (make a branch if so).

comment:104 in reply to:  103 Changed 6 weeks ago by tdammers

Replying to simonpj:

I'm afraid the easiest way is going to be to take the patch as is, and then manually re-engineer the -accum change from there.

OK. That should be very easy to accomplish. Let me know if you need help (make a branch if so).

I think I've got it, but it probably won't hurt if you take a look. I'll let you know when I'm ready.

comment:105 Changed 6 weeks ago by tdammers

So I manually retrofit the -accum change onto the wip/T14880 branch (pushed to wip/T14880-reengineered); however, I'm still getting stat failures. I'm currently running a full ./validate on a clean build tree to confirm, but maybe you could take a look in the meantime to see if I've missed anything. I have taken the liberty to change a few things around a bit; I believe they shouldn't make a fundamental difference, but maybe I'm wrong.

I also noticed that there is an error in the -accum patch; lines 1664-1667 say:

{- 1664 -} ty_co_vars_of_co_var v is acc
{- 1665 -}   | v `elemVarSet` is = acc
{- 1666 -}   | v `elemVarSet` is  = acc
{- 1667 -}   | otherwise         = ty_co_vars_of_type (varType v) is (extendVarSet acc v)

...but line 1666 is redundant and should probably rather be:

{- 1666 -}   | v `elemVarSet` acc = acc

So I changed that.

The above hasn't been rebased onto master yet btw.; I also have a branch around that is the result of actually merging the wip/T1448-accum branch onto a rebased wip/T14880 branch - that one also has stat failures, but it still contains the error described above. I'll also run a validate on this one after fixing that.

comment:106 Changed 6 weeks ago by goldfire

I might have time to take a look at this, but I'll need a bit more direction. You say that you might have missed something -- what two branches can I compare to assess that?

As for the one-line change: it's seems conceivable that the change could have a measurable performance impact. Have you tested the change, independent of anything else? If the fix makes performance worse, that would be an insight, because the acc check looks like it's only an optimization, not an essential check for correctness.

Thanks!

comment:107 in reply to:  106 Changed 6 weeks ago by tdammers

Replying to goldfire:

I might have time to take a look at this, but I'll need a bit more direction. You say that you might have missed something -- what two branches can I compare to assess that?

I think you may be missing some context here, so here's the executive summary:

  • Your original patch is on wip/T14880
  • The baseline HEAD from right before your patch is on wip/T14880-baseline
  • Simon's patch (as mentioned above) is on wip/T1448-accum
  • wip/T14880-reengineered contains your patch (as on wip/T14880), with the approach from Simon's patch applied practically verbatim.

Now here's the conundrum:

  • The baseline version performs within limits
  • Your patch applied as-is fails two performance tests: 5321Fun and 5631
  • Simon's patch improves performance on 5631
  • But Simon's approach applied to your patch doesn't

As for the one-line change: it's seems conceivable that the change could have a measurable performance impact. Have you tested the change, independent of anything else? If the fix makes performance worse, that would be an insight, because the acc check looks like it's only an optimization, not an essential check for correctness.

It is only an optimization, but that's the whole point of digging further here - the patch allocates more even though we would it expect it to allocate less. AFAIK there are no correctness issues anymore. The patch without the acc check fails the two performance tests, but so does the version with the check.

comment:108 Changed 6 weeks ago by simonpj

But Simon's approach applied to your patch doesn't

Gah! This zombie keeps rising from the dead.

I suggest

Step 1: improve tyCoVarsOfType.

  • Apply my patch to HEAD, with the fix in ty_co_vars_of_co_var.
  • Check that it's a win -- generally perf should improve slightly
  • Commit

Now move on to this ticket

Step 2: deal with Note [Closing over free variable kinds] (this note is in Richard's original patch).

This is a change that fixes an outright bug, albeit one that has not been reported. I think it is nothing to do with the original updateRole problem.

  • Invite Richard to implement Note [Closing over free variable kinds] in TyCoRep, based on his patch. This will take him (or me) 20 mins. I think the only changes are in TyCoRep.
  • Do perf tests.

Step 3. Back to this ticket and updateRole:

  • Apply the rest of Richard's patch, except the stuff in TyCoRep
  • Test

comment:109 Changed 6 weeks ago by simonpj

Richard, in the original patch you say

Also while working on this, I noticed that GHC wasn't doing its best job at keeping left-to-right ordering of type variables in cases where it doesn't really matter, but still would be nice. I've made some improvements in this area.

Where exactly is this change??

comment:110 in reply to:  108 Changed 6 weeks ago by tdammers

Replying to simonpj:

But Simon's approach applied to your patch doesn't

Gah! This zombie keeps rising from the dead.

I suggest

Step 1: improve tyCoVarsOfType.

  • Apply my patch to HEAD, with the fix in ty_co_vars_of_co_var.
  • Check that it's a win -- generally perf should improve slightly
  • Commit

Yes. This should be fairly straightforward - that part of the whole thing I completely understand, so I should be able to do that rather easily. It gets a bit tricky combining this part with other changes that also factor out a tcvs_of_type style worker, but I believe none of these changes are currently in HEAD.

Now move on to this ticket

Step 2: deal with Note [Closing over free variable kinds] (this note is in Richard's original patch).

This is a change that fixes an outright bug, albeit one that has not been reported. I think it is nothing to do with the original updateRole problem.

  • Invite Richard to implement Note [Closing over free variable kinds] in TyCoRep, based on his patch. This will take him (or me) 20 mins. I think the only changes are in TyCoRep.
  • Do perf tests.

Step 3. Back to this ticket and updateRole:

  • Apply the rest of Richard's patch, except the stuff in TyCoRep
  • Test

Is Step 2 the part where you close over kinds at the end rather than in between? Because that also seems fairly straightforward, it's just a simple closeOverKinds implementation, and then factoring out workers for the various tyCoVarsOf... functions and wrapping them with closeOverKinds appropriately. The rest of Richard's patch is what goes over my head.

Other than that, yes, this sounds like a solid plan.

comment:111 Changed 6 weeks ago by simonpj

Is Step 2 the part where you close over kinds at the end rather than in between?

That is step 2, yes. It sounds as if you can make a stab at that and get us to review.

Then we can help with step 3.

comment:112 Changed 6 weeks ago by goldfire

Where exactly is this change??

See changes to FV.hs and UniqDFM.hs here (which is the changeset for the original patch): https://github.com/ghc/ghc/compare/bfc1fc2566944a455572303cbb2cbbf0c539c871...wip/T14880

Otherwise, go ahead with comment::110.

comment:113 Changed 6 weeks ago by simonpj

See changes to FV.hs and UniqDFM.hs

OK: these two look like another unrelated change, so let's do them in Step 1a (after the rest of step 1), so that if they cause perf wibbles we know what the culprit is.

Then we can move on to the semantically important changes: steps 2, and 3.

comment:114 in reply to:  112 Changed 5 weeks ago by tdammers

Replying to goldfire:

Where exactly is this change??

See changes to FV.hs and UniqDFM.hs here (which is the changeset for the original patch): https://github.com/ghc/ghc/compare/bfc1fc2566944a455572303cbb2cbbf0c539c871...wip/T14880

Otherwise, go ahead with comment::110.

That's the entire changeset for T14880 though, isn't it? But we're really talking about fixing the insertion order for FV and UniqDFM for this step IIUC. So:

Step 0: Fix FV / UniqDFM insertion order

Step 1: Refactor tyCoVars... to use accumulator-style VarSet (Simon's patch)

Step 2: Close over kinds at the end

Step 3: "Everything else"

Last edited 5 weeks ago by tdammers (previous) (diff)

comment:115 Changed 5 weeks ago by tdammers

Looks like step 0 already introduces a performance regression:

Unexpected results from:
TEST="T9630"

SUMMARY for test run started at Mon Sep 10 12:56:45 2018 CEST
 0:06:01 spent to go through
     110 total tests, which gave rise to
     526 test cases, of which
     416 were skipped

       4 had missing libraries
     105 expected passes
       0 expected failures

       0 caused framework failures
       0 caused framework warnings
       0 unexpected passes
       0 unexpected failures
       1 unexpected stat failures

Unexpected stat failures:
   compiler/T9630.run  T9630 [stat not good enough] (normal)

That one test deviates by +15% allocations.

(The relevant patch: http://git.haskell.org/ghc.git/commitdiff/c8a1f9c42a37eb0c3514aef1a716fdbcb912da31?hp=510c5f4f22aca29a9c36fd993ac79e9077b28173)

comment:116 Changed 5 weeks ago by goldfire

Wow! It never occurred to me that the incidental change around ordering (which was entirely unforced -- done just so that users would more often see variables reported in an order similar to what was written) could have such an impact on performance. I suppose it's because mapUnionFV can get inlined but f can't? Happily, the solution is dead simple: just don't make that change! (We still can, I imagine, make the changes in UniqDFM. I'd be even more flabbergasted if those changes affected performance.)

Thanks!

comment:117 Changed 5 weeks ago by tdammers

Alright, if the FV/UniqDFM change doesn't affect correctness, I'll first get the rest of the patches implemented on top of master, and then we can take a closer look at step 0 if desired.

comment:118 Changed 5 weeks ago by simonpj

We still can, I imagine, make the changes in UniqDFM. I'd be even more flabbergasted if those changes affected performance.)

Richard is suggesting trying just the UniqDFM part of Step 0. But I'm totally ok with making a separate ticket for that bit of the patch

comment:119 Changed 5 weeks ago by tdammers

Well, the reason I want to skip the FV/UniqDFM part for now is because people are blocked on this ticket, and we don't strictly need this part to get them unblocked.

comment:120 Changed 5 weeks ago by tdammers

Differential Rev(s): Phab:D4769Phab:D4769, Phab:D5141

Phab:5141 addresses Step 1.

I found no performance regressions, and nofib indicates deviations on the order of 0.1%, most of them improvements.

comment:121 Changed 5 weeks ago by tdammers

Step 2 implemented here, on top of step 1:

http://git.haskell.org/ghc.git/commit/2728d63f0a42251d24d5fc4f044633f981891131

Unfortunately, this causes regressions:

Unexpected results from:
TEST="T12150 T12227 T12545 T5321Fun T5631"

SUMMARY for test run started at Tue Sep 11 12:37:11 2018 CEST
 0:05:56 spent to go through
     110 total tests, which gave rise to
     526 test cases, of which
     416 were skipped

       4 had missing libraries
     101 expected passes
       0 expected failures

       0 caused framework failures
       0 caused framework warnings
       0 unexpected passes
       0 unexpected failures
       5 unexpected stat failures

Unexpected stat failures:
   compiler/T5631.run     T5631 [stat not good enough] (normal)
   compiler/T5321Fun.run  T5321Fun [stat not good enough] (normal)
   compiler/T12227.run    T12227 [stat not good enough] (normal)
   compiler/T12545.run    T12545 [stat not good enough] (normal)
   compiler/T12150.run    T12150 [stat not good enough] (optasm)

It's possible that I made mistakes in manually porting the patch onto master, so by all means please review.

comment:122 Changed 5 weeks ago by simonpj

Sigh.

For a start, closeOverKinds stupidly goes via FV, which is silly in our new setup. Perhaps

closeOverKinds tv_set = tv_set `unionVarSet` mapUnionVarSet kind_fvs tv_set
  where
    kind_fvs tcv = ty_co_vars_of_type (varType tv) emptyVarSet emptyVarSet

You might want to define

tyCoVarsOfType_unclosed :: Type -> TyCoVarSet
tyCoVarsOfTYpe_unclosed ty = ty_co_var_of_type emptyVarSet emptyVarSet

Next, I'd worry about that mapUnionVarSet. It seems that checking for dups, and avoiding unions, is a help. So perhaps

closeOverKinds tv_set = close emptyVarSet tv_set
 where
   close :: TyCoVarSet -> TyCoVarSet -> TyCoVarSet
   -- (close acc tvs2) extends acc with tcvs closed over kinds
   -- Invariant: acc is already closed-over-kinds
   close acc tcvs = nonDetFoldUFM close1 acc tvs

   close1 :: TyCoVar -> TyCoVarSet -> TyCoVarSet
   close1 tcv acc
     = add1 (close kind_vars acc)
     where
       kind_vars = ty_co_vars_of_type (varType tcv) acc emptyVarSet
       add1 tvs2 | tcv `elemVarSet` tvs2 = tvs2
                 | otherwise             = extendVarSet tvs2 tcv

By using the the "in-scope set" (first argument) of ty_co_var_of_type, I'm saying "don't bother with variables whose kind we have already closed over". And I'm trying to use extendVarSet, not unionVarSet.

I'm not certain I have this right, but I think it's close.

comment:123 Changed 5 weeks ago by simonpj

Good grief. More revelations/insight concerning closeOverKinds.

  • The actual bug for #14880 is fixed by the small patch in comment:13, concerning candidateQTyVarsOfType in TcType. Nothing to do with TyCoRep.
  • But this patch did not quite work for reasons that are lost in the mists of time. Later, Ricahrd produced Phab:D4769 (see comment:19).
  • But Phab:D4769 was significantly more ambitious, which in turn led to the detective work on this ticket. In particular, concerning closeOverKinds, it added this in TyCoRep:
    Note [Closing over free variable kinds]
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Note that it's necessary to close over kinds at the /end/ of collecting
    the variables. This is for two reasons:
    
    1. Efficiency. If we have Proxy (a::k) -> Proxy (a::k) -> Proxy (a::k), then
       we don't want to have to traverse k more than once.
    
    2. Correctness. Imagine we have forall k. b -> k, where b has
       kind k, for some k bound in an outer scope. If we look at b's kind inside
       the forall, we'll collect that k is free and then remove k from the set of
       free variables. This is plain wrong. We must instead compute that b is free
       and then conclude that b's kind is free.
    
  • It seems that moving closeOverKinds from the occurrences of a type variable to after finding the free vars (i.e. Step 2) is responsible for the perf regressions reported in comment:121.

But note that, contrary to my claim in comment:108, this change in closeOverKinds does not fix the original bug -- that was in TcType!

My new insight is this: Step 2 is can be done in a much simpler way.

Consider the two points in the Note above.

  1. Efficiency: we now have an accumulator, so the seond time we encounter 'a', we'll ignore it, certainly not looking at its kind.
  1. Correctness: we have an "in-scope set" (I think we should call it it a "bound-var set"), specifying variables that are bound by a forall in the type we are traversing; we simply ignore these variables, certainly not looking at their kind.

So consider forall k. b -> k, where b :: k->Type is free; but of course, it's a different k! When looking at b -> k we'll have k in the bound-var set. So we'll ignore the k. But suppose this is our first encounter with b; we want the free vars of its kind. But we want to behave as if we took the free vars of its kind at the end; that is, with no bound vars in scope.

So it's easy. Our current code is this

ty_co_vars_of_type (TyVarTy v) is acc
  | v `elemVarSet` is  = acc
  | v `elemVarSet` acc = acc
  | otherwise          = ty_co_vars_of_type (tyVarKind v) is (extendVarSet acc v)

All we need do is to take the free vars of tyVarKind v with an empty bound-var set, thus:

ty_co_vars_of_type (TyVarTy v) is acc
  | v `elemVarSet` is  = acc
  | v `elemVarSet` acc = acc
  | otherwise          = ty_co_vars_of_type (tyVarKind v) emptyVarSet (extendVarSet acc v)
                                                          ^^^^^^^^^^^

This change is subtle, but it's both more efficient and less code than my suggestion in comment:122.

The same change should be made in the FV version lower down in TyCoRep.

Would you like to try that.

Richard: do you agree?

Last edited 5 weeks ago by simonpj (previous) (diff)

comment:124 Changed 5 weeks ago by goldfire

  • First off, no mists of time here: the more ambitious patch was because comment:13 fixed the problem in TcType -- which caused the original bug -- but that same problem existed in TyCoRep. We just didn't have (and still don't, to my knowledge) a concrete program that exhibits misbehavior. I don't remember the exact cause of the bug I mentioned in comment:13, but I don't think that's about the changes in TyCoRep.
  • Your new approach relies on this invariant: If a variable in a tyvar kind is in the bound-var set, then the variable is in the bound-var set, too. Perhaps easier is to think about the contrapositive: If a variable is not in the bound-var set (in other words, if we look at it at all), then no variable in its kind is in the bound-var set (so we can zap the bound-var set). This invariant is certainly true of well-typed programs. (It's just barely conceivable that it might not be true in the case of a mis-ordered telescope, but let's not worry about that now.) So: yes, I think that works nicely.

Nice idea. It's very much worth a Note.

@tdammers: Do you think you can push this new idea through? I can write the Note if need be.

comment:125 Changed 5 weeks ago by simonpj

If a variable is not in the bound-var set (in other words, if we look at it at all), then no variable in its kind is in the bound-var set (so we can zap the bound-var set).

No, not at all! Suppose (b::k) is free in some type (forall k. b -> T k). Then, when we encounter b

  • b is not in the bound-var set
  • But k certainly is, and should be
  • But it's a different k, despite having the same name

So it's not just that we can zap the bound-var set; we must zap it!!

Consider: if we instead waited until the end we'd have b::k in our free var set. Then we call closeOverKinds (that was our plan, before comment:123). Well, at that point the bound-var set is certainly empty (we are at the top), and we find the free vars of k (namely k itself) using that empty bound-var set.

Regardless, this still looks solid to me, if Tobias can just test it.

(It could conceivably be worth doing TcType.candidateQTyVarsOfType the same way, for consistency; but that is an un-forced change.)

comment:126 Changed 5 weeks ago by goldfire

We're in complete agreement. I was phrasing things strangely, in that I was considering variables to have proper identities, not just uniques. That is, the variables in the kind of a free var might, incidentally, have a doppelganger in the bound-var set, but that's just a doppelgangers, not an appearance of the variable we're processing. In any case, I agree with comment:125.

comment:127 Changed 5 weeks ago by tdammers

So, let's see if I understand things correctly:

  • The original plan was to move the "closing over kinds" part out of ty_co_vars_of_type, and doing it in one go at the very end.
  • The new plan is to change ty_co_vars_of_type such that the "interesting" case (TyVar with a variable that hasn't been seen before) is handled in place, because the real problem is not about when we do the closing over kinds, but rather that we do not want to handle the same variable more than once.
  • Hence, we should not move the closing-over-kinds part at all, and instead just modify the TyVar branch of ty_co_vars_of_type so that it handles each variable it encounters exactly once (comment:123)

I will certainly give that a try and see how that plays out in terms of performance.

I had been implementing Simon's suggestion from comment:122, and at least while still using mapUnionVarSet, I'm still getting regressions on 4 tests, so the hypothesis in comment:123 seems plausible to me - after all, if we can handle this correctly while already traversing the code, that's going to be more efficient than first building up the whole varset and then traversing it again to remove the variables we don't want.

comment:128 Changed 5 weeks ago by simonpj

Hence, we should not move the closing-over-kinds part at all, and instead just modify the TyVar branch of ty_co_vars_of_type so that it handles each variable it encounters exactly once (comment:123)

Correct, that's what I'm suggesting.

I had been implementing Simon's suggestion from comment:122, and at least while still using mapUnionVarSet, I'm still getting regressions on 4 tests,

I'd be surprised if you still got regressions if you adopted the more efficient closeOverKinds from comment:122. But (a) I've been surprised many times by this zombie and (b) I think comment:123 is better anyway.

TL;DR: do comment:123.

comment:129 Changed 5 weeks ago by tdammers

Indeed. comment:123 implemented in Phab:5147, no more stat regressions, will do a nofib run as well nofib looks fine too.

Implementing comment:122 fully would involve sorting out a few type errors, because the posted snippet doesn't compile as-is, but I believe comment:123 is better anyway.

Last edited 5 weeks ago by tdammers (previous) (diff)

comment:130 Changed 5 weeks ago by tdammers

Differential Rev(s): Phab:D4769, Phab:D5141Phab:D4769, Phab:D5141, Phab:5147

comment:131 Changed 5 weeks ago by simonpj

Differential Rev(s): Phab:D4769, Phab:D5141, Phab:5147Phab:D4769, Phab:D5141, Phab:D5147

comment:132 Changed 4 weeks ago by tdammers

Differential Rev(s): Phab:D4769, Phab:D5141, Phab:D5147Phab:D4769, Phab:D5141, Phab:D5147, Phab:D5150

comment:133 Changed 4 weeks ago by simonpj

Plan (c.f. comment:114)

  • Step 0: I think Tobias wants to skip this for now (comment:119) (unless there is some reason not to). Action: make a new ticket for this.
  • Step 1: commit now, unless there is some reason not to. Phab:D5141
  • Step 2: Phab:D5147. Ben says there are Lint errors, which sounds mysterious. He will explain how to reproduce this.
  • Step 3: still to come.

Let's get this done!

But we decided not to hold up the 8.6 train for this alone.

Last edited 4 weeks ago by simonpj (previous) (diff)

comment:134 Changed 4 weeks ago by tdammers

Did an overnight validate run of the 3 changes, one by one. Here's what I got:

-- master (a3bce956d7) --

Unexpected stat failures:
   /tmp/ghctest-hajvz66w/test   spaces/perf/compiler/T9630.run  T9630 [stat not good enough] (normal)

-- + step 1 (wip/T14880-2-step1) --

Unexpected stat failures:
   /tmp/ghctest-shsa0qwp/test   spaces/perf/compiler/T9630.run  T9630 [stat not good enough] (normal)

-- + step 2 (wip/T14880-2-step2-c123) --

Unexpected failures:
   /tmp/ghctest-5yp1lu_z/test   spaces/dependent/should_fail/BadTelescope4.run  BadTelescope4 [exit code 0] (normal)
   /tmp/ghctest-5yp1lu_z/test   spaces/dependent/should_fail/T14066g.run        T14066g [exit code 0] (normal)
   /tmp/ghctest-5yp1lu_z/test   spaces/partial-sigs/should_compile/T12844.run   T12844 [stderr mismatch] (normal)
   /tmp/ghctest-5yp1lu_z/test   spaces/partial-sigs/should_compile/T15039a.run  T15039a [stderr mismatch] (normal)
   /tmp/ghctest-5yp1lu_z/test   spaces/partial-sigs/should_compile/T15039b.run  T15039b [stderr mismatch] (normal)
   /tmp/ghctest-5yp1lu_z/test   spaces/partial-sigs/should_compile/T15039c.run  T15039c [stderr mismatch] (normal)
   /tmp/ghctest-5yp1lu_z/test   spaces/partial-sigs/should_compile/T15039d.run  T15039d [stderr mismatch] (normal)
   /tmp/ghctest-5yp1lu_z/test   spaces/partial-sigs/should_run/T15415.run       T15415 [bad stderr] (ghci)
   /tmp/ghctest-5yp1lu_z/test   spaces/polykinds/T14265.run                     T14265 [stderr mismatch] (normal)

-- + step 3 (wip/T14880-2-step3) --

Unexpected failures:
   /tmp/ghctest-ht1chu9r/test   spaces/dependent/should_compile/T14880-2.run       T14880-2 [exit code non-0] (normal)
   /tmp/ghctest-ht1chu9r/test   spaces/dependent/should_fail/BadTelescope4.run     BadTelescope4 [exit code 0] (normal)
   /tmp/ghctest-ht1chu9r/test   spaces/dependent/should_compile/dynamic-paper.run  dynamic-paper [stderr mismatch] (normal)
   /tmp/ghctest-ht1chu9r/test   spaces/dependent/should_fail/T14066g.run           T14066g [exit code 0] (normal)
   /tmp/ghctest-ht1chu9r/test   spaces/partial-sigs/should_compile/T12844.run      T12844 [stderr mismatch] (normal)
   /tmp/ghctest-ht1chu9r/test   spaces/partial-sigs/should_compile/T15039a.run     T15039a [stderr mismatch] (normal)
   /tmp/ghctest-ht1chu9r/test   spaces/partial-sigs/should_compile/T15039b.run     T15039b [stderr mismatch] (normal)
   /tmp/ghctest-ht1chu9r/test   spaces/partial-sigs/should_compile/T15039c.run     T15039c [stderr mismatch] (normal)
   /tmp/ghctest-ht1chu9r/test   spaces/partial-sigs/should_compile/T15039d.run     T15039d [stderr mismatch] (normal)
   /tmp/ghctest-ht1chu9r/test   spaces/partial-sigs/should_run/T15415.run          T15415 [bad stderr] (ghci)
   /tmp/ghctest-ht1chu9r/test   spaces/polykinds/T9222.run                         T9222 [stderr mismatch] (normal)
   /tmp/ghctest-ht1chu9r/test   spaces/polykinds/T14265.run                        T14265 [stderr mismatch] (normal)

So it seems that step1 doesn't add any regressions beyond what's on master already, but steps 2 and 3 do. Step 3 also fails to pass the additional test T14880-2 it introduces.

comment:135 Changed 4 weeks ago by simonpj

But these are error wibbles, probably fine. (Still need looking at, of course.

No perf changes? I can see no reported regression, but could you try some of the ones which did regress before, and see what happens to compiler allocations etc?

Changed 3 weeks ago by tdammers

Attachment: perf-step3.log added

Performance test results after step 3

comment:136 Changed 3 weeks ago by tdammers

The perf regressions we saw in previous attempts were these:

Unexpected stat failures:
   compiler/T5631.run     T5631 [stat not good enough] (normal)
   compiler/T5321Fun.run  T5321Fun [stat not good enough] (normal)
   compiler/T12227.run    T12227 [stat not good enough] (normal)
   compiler/T12545.run    T12545 [stat not good enough] (normal)
   compiler/T12150.run    T12150 [stat not good enough] (optasm)

Results for these:

=====> T5631(normal) 7 of 43 [0, 0, 0]
cd "T5631.run" &&  "/home/tobias/well-typed/devel/ghc-phab/inplace/test   spaces/ghc-stage2" -c T5631.hs -no-user-package-db -rtsopts -fno-warn-missed-specialisations -fshow-warning-groups -fdiagnostics-color=never -fno-diagnostics-show-caret -dno-debug-output   +RTS -V0 -tT5631.comp.stats --machine-readable -RTS
    Expected    T5631(normal) bytes allocated: 1161885448 +/-5%
    Lower bound T5631(normal) bytes allocated: 1103791175 
    Upper bound T5631(normal) bytes allocated: 1219979721 
    Actual      T5631(normal) bytes allocated: 1165383416 
    Deviation   T5631(normal) bytes allocated:        0.3 %

=====> T5321Fun(normal) 10 of 43 [0, 0, 0]
cd "T5321Fun.run" &&  "/home/tobias/well-typed/devel/ghc-phab/inplace/test   spaces/ghc-stage2" -c T5321Fun.hs -no-user-package-db -rtsopts -fno-warn-missed-specialisations -fshow-warning-groups -fdiagnostics-color=never -fno-diagnostics-show-caret -dno-debug-output   +RTS -V0 -tT5321Fun.comp.stats --machine-readable -RTS
    Expected    T5321Fun(normal) bytes allocated: 423774560 +/-5%
    Lower bound T5321Fun(normal) bytes allocated: 402585832 
    Upper bound T5321Fun(normal) bytes allocated: 444963288 
    Actual      T5321Fun(normal) bytes allocated: 438838640 
    Deviation   T5321Fun(normal) bytes allocated:       3.6 %

=====> T12227(normal) 25 of 43 [0, 0, 0]
cd "T12227.run" &&  "/home/tobias/well-typed/devel/ghc-phab/inplace/test   spaces/ghc-stage2" -c T12227.hs -no-user-package-db -rtsopts -fno-warn-missed-specialisations -fshow-warning-groups -fdiagnostics-color=never -fno-diagnostics-show-caret -dno-debug-output  -O2 -ddump-hi -ddump-to-file +RTS -M1G +RTS -V0 -tT12227.comp.stats --machine-readable -RTS
    Expected    T12227(normal) bytes allocated: 752214784 +/-5%
    Lower bound T12227(normal) bytes allocated: 714604044 
    Upper bound T12227(normal) bytes allocated: 789825524 
    Actual      T12227(normal) bytes allocated: 744873984 
    Deviation   T12227(normal) bytes allocated:      -1.0 %

=====> T12545(normal) 28 of 43 [0, 0, 0]
cd "T12545.run" &&  "/home/tobias/well-typed/devel/ghc-phab/inplace/test   spaces/ghc-stage2" --make  T12545 -no-user-package-db -rtsopts -fno-warn-missed-specialisations -fshow-warning-groups -fdiagnostics-color=never -fno-diagnostics-show-caret -dno-debug-output  -v0 +RTS -V0 -tT12545.comp.stats --machine-readable -RTS
    Expected    T12545(normal) bytes allocated: 3249613688 +/-5%
    Lower bound T12545(normal) bytes allocated: 3087133003 
    Upper bound T12545(normal) bytes allocated: 3412094373 
    Actual      T12545(normal) bytes allocated: 3212031504 
    Deviation   T12545(normal) bytes allocated:       -1.2 %

=====> T12150(optasm) 32 of 43 [0, 0, 0]
cd "T12150.run" &&  "/home/tobias/well-typed/devel/ghc-phab/inplace/test   spaces/ghc-stage2" -c T12150.hs -no-user-package-db -rtsopts -fno-warn-missed-specialisations -fshow-warning-groups -fdiagnostics-color=never -fno-diagnostics-show-caret -dno-debug-output  -O -fasm  +RTS -V0 -tT12150.comp.stats --machine-readable -RTS
    Expected    T12150(optasm) bytes allocated: 77557800 +/-5%
    Lower bound T12150(optasm) bytes allocated: 73679910 
    Upper bound T12150(optasm) bytes allocated: 81435690 
    Actual      T12150(optasm) bytes allocated: 76394064 
    Deviation   T12150(optasm) bytes allocated:     -1.5 %

Executive summary: All of these tests perform roughly the same or better, except for the notorious T5631Fun test, which deviates by 3.6% - still significant, but not enough to exceed the 5% threshold.

comment:137 Changed 3 weeks ago by simonpj

But these figure are

  • deviations from the "expected" stored in all.T

That's quite different from

  • difference in allocation between just before the patch and just after the patch

Maybe this 3.6% was there before the patch!

Also, it'd be good to know what "the patch" is when showing the differences.

comment:138 in reply to:  137 Changed 3 weeks ago by tdammers

Replying to simonpj:

But these figure are

  • deviations from the "expected" stored in all.T

That's quite different from

  • difference in allocation between just before the patch and just after the patch

Ah yes, of course. I'll do another run on the base commit.

Also, it'd be good to know what "the patch" is when showing the differences.

Right, yes. "The Patch" is the latest implementation of steps 1, 2-c123, and 3, as found in ​Phab:D5141, ​Phab:D5147, and ​Phab:D5150 (all 3 applied in order).

comment:139 Changed 3 weeks ago by tdammers

Comparing the baseline commit 510c5f4f22 against wip/T14880-2-step3, we get (bytes allocated):

  • T5631 from 1169733648 down to 1165383416, or +0.7% down to +0.3% (vs. all.T)
  • T5321Fun from 435532680 up to 438838640, or +2.8% up to +3.6%
  • T12227 from 754850808 down to 744873984, or +0.4% down to -1.0%
  • T12545 from 3289709120 down to 3212031504, or +1.2% down to -1.2%
  • T12150 from 76390472 to 76394064, insignificant

comment:140 Changed 3 weeks ago by tdammers

The first test to fail is T14880-2, a new test based on the reproduction case in comment:2. GHC 8.4 fails with a panic; the GHC version in wip/T14880-2-step3 compiles it without errors when compiling core lint disabled, but enabling core lint (as in the test suite) triggers a lint failure:

=====> T14880-2(normal) 1 of 1 [0, 0, 0]
cd "dependent/should_compile/T14880-2.run" &&  "/home/tobias/well-typed/devel/ghc-phab/inplace/test   spaces/ghc-stage2" -c T14880-2.hs -dcore-lint -dcmm-lint -no-user-package-db -rtsopts -fno-warn-missed-specialisations -fshow-warning-groups -fdiagnostics-color=never -fno-diagnostics-show-caret -dno-debug-output  
Compile failed (exit code 1) errors were:
*** Core Lint errors : in result of Desugar (before optimization) ***
<no location info>: warning:
    In the type ‘Proxy (Foo arg_a1eO)’
    Kind application error in type ‘Proxy a_a1eN’
      Function kind = forall k. k -> *
      Arg kinds = [(arg_a1eO, *), (a_a1eN, arg_a1eM)]
    Fun: arg_a1eO
         (a_a1eN, arg_a1eM)
<no location info>: warning:
    In the type ‘Proxy (Foo arg_a1eO)’
    Kind application error in type ‘Foo arg_a1eO’
      Function kind = forall x -> forall (a :: x). Proxy a -> *
      Arg kinds = [(arg_a1eO, *), (a_a1eN, arg_a1eM)]
    Forall: a_aZA
            arg_a1eO
            (a_a1eN, arg_a1eM)
*** Offending Program ***
Rec {
$tcFoo :: TyCon
[LclIdX]
$tcFoo
  = TyCon
      1426396007728932770##
      937350176756910988##
      $trModule
      (TrNameS "Foo"#)
      2#
      $krep_a1qj

$krep_a1qm [InlPrag=NOUSERINLINE[~]] :: KindRep
[LclId]
$krep_a1qm = $WKindRepVar (I# 1#)

$krep_a1ql [InlPrag=NOUSERINLINE[~]] :: KindRep
[LclId]
$krep_a1ql = $WKindRepVar (I# 0#)

$krep_a1qj [InlPrag=NOUSERINLINE[~]] :: KindRep
[LclId]
$krep_a1qj = KindRepFun $krep_a1qk krep$*

$krep_a1qk [InlPrag=NOUSERINLINE[~]] :: KindRep
[LclId]
$krep_a1qk
  = KindRepTyConApp
      $tcProxy
      (: @ KindRep $krep_a1ql (: @ KindRep $krep_a1qm ([] @ KindRep)))

$trModule :: Module
[LclIdX]
$trModule = Module (TrNameS "main"#) (TrNameS "Bug"#)

quux :: forall arg (a :: arg) arg. Proxy (Foo arg) -> ()
[LclIdX]
quux
  = \ (@ arg_a1eM)
      (@ (a_a1eN :: arg_a1eM))
      (@ arg_a1eO)
      (ds_d1qn :: Proxy (Foo arg_a1eO)) ->
      ()
end Rec }

*** End of Offense ***

I can't tell whether this means that the Core is indeed incorrect, or that the core linter is wrong in rejecting this program.


Then we have BadTelescope4. GHC 8.4 fails with the following error:

testsuite/tests/dependent/should_fail/BadTelescope4.hs:9:39: error:
    Variable ‘a’ used as both a kind and a type
    Did you intend to use TypeInType?
  |
9 | data Bad a (c :: Proxy b) (d :: Proxy a) (x :: SameKind b d)
  |                                       ^

testsuite/tests/dependent/should_fail/BadTelescope4.hs:9:59: error:
    Variable ‘d’ used as both a kind and a type
    Did you intend to use TypeInType?
  |
9 | data Bad a (c :: Proxy b) (d :: Proxy a) (x :: SameKind b d)
  |                                                           ^

But the patched GHC compiles it just fine, causing a test failure. I'm not 100% sure, but I believe that the code should in fact typecheck now, so I guess we should mark it as "should compile"?


Next up: dynamic_paper. According to the test definition, this one should compile. However, on GHC 8.4, it fails with an error (Variable ‘k’ used as both a kind and a type), which I believe is correct for 8.4; the patched GHC however ends up exhausting simplifier ticks. So that's probably not good.


Then, T14066g: this one is very similar to BadTelescope4, and I believe the same arguments apply here - it fails on GHC 8.4, but I believe should compile cleanly with the step3 patch.


The remaining ones are all stdout deviations:

   /tmp/ghctest-ht1chu9r/test   spaces/partial-sigs/should_compile/T12844.run      T12844 [stderr mismatch] (normal)
   /tmp/ghctest-ht1chu9r/test   spaces/partial-sigs/should_compile/T15039a.run     T15039a [stderr mismatch] (normal)
   /tmp/ghctest-ht1chu9r/test   spaces/partial-sigs/should_compile/T15039b.run     T15039b [stderr mismatch] (normal)
   /tmp/ghctest-ht1chu9r/test   spaces/partial-sigs/should_compile/T15039c.run     T15039c [stderr mismatch] (normal)
   /tmp/ghctest-ht1chu9r/test   spaces/partial-sigs/should_compile/T15039d.run     T15039d [stderr mismatch] (normal)
   /tmp/ghctest-ht1chu9r/test   spaces/partial-sigs/should_run/T15415.run          T15415 [bad stderr] (ghci)
   /tmp/ghctest-ht1chu9r/test   spaces/polykinds/T9222.run                         T9222 [stderr mismatch] (normal)
   /tmp/ghctest-ht1chu9r/test   spaces/polykinds/T14265.run                        T14265 [stderr mismatch] (normal)
  • T12844 is caused by names of inferred type/kind variable names mismatching; not a problem.
  • T15039? are all caused by changing the order in which type/kind variables are listed in an error message; not a problem.
  • T15415: kind variables swapped in error message; not a problem.
  • T9222: inferred variable names swapped in implicit forall; not a problem.
  • T14265: variables swapped in error message; not a problem.

So to summarize: T14880-2 needs looking into; for BadTelescope4, dynamic_paper, and T14066g, I would like for someone to confirm my reasoning; the remaining tests can be accepted without further ado.

Last edited 3 weeks ago by tdammers (previous) (diff)

comment:141 Changed 3 weeks ago by simonpj

Richard and I had a good discussion about this. Happily, we are now back to addressing the original problem, rather than chasing perf ghosts. We've agreed a path forwards and he's going to execute on it. So you can down-tools for now.

(Note to self: Richard and I wrote notes on the FC-call channel, dated today.)

comment:142 Changed 2 weeks ago by simonpj

Richard, any progress on this? We are keen to nail it.

comment:143 Changed 2 weeks ago by simonpj

Tobias will nail Step 1 and commit it. Then over to Richard.

comment:144 Changed 2 weeks ago by bgamari

Milestone: 8.8.18.6.2

It would be great if we could get this into a shape where it could be merged to 8.6.2.

comment:145 Changed 13 days ago by goldfire

Blocking: 15592 added

comment:146 Changed 11 days ago by goldfire

Interestingly, the current approach to this undoes the work to fix #14552, which has a very similar character. We now accept the program in #14552, but zapping some tyvars to Any (an approach we recognized would work back then, but decided issuing an error was better).

The curious can track progress here.

comment:147 Changed 11 days ago by goldfire

Differential Rev(s): Phab:D4769, Phab:D5141, Phab:D5147, Phab:D5150Phab:D4769, Phab:D5141, Phab:D5147, Phab:D5150, Phab:D5208

My patch is at Phab:D5208. Before the most recent rebase, it validated.

comment:148 Changed 9 days ago by tdammers

Looks like Phab:D5141 validates cleanly. Do we still need this part, or does Phab:D5208 supersede this effort?

Also, Phab:D5208 fails to build on CI, so that needs looking into, I think.

comment:149 Changed 8 days ago by simonpj

Looks like ​Phab:D5141 validates cleanly. Do we still need this part, or does ​Phab:D5208 supersede this effort?

No, it doesn't supersede. Let's just get D5141 committed. It does Step 1. Then we can at last forget about Step 1.

In fact, let's do Step 2 too. All we need to do there is to deal with the closeOverKinds change, using the clever trick in comment:123. But NB: you need the same trick for ty_co_vars_of_co_var.

I think Step 2 should work cleanly, and probably improves performance. Tobias can you do that?

That sets the stage for Phab:D5208, where some semantic trickiness is going on. But let's get the easy stuff done, nailed, committed, and forgotten!

comment:150 Changed 7 days ago by tdammers

Great! Phab:D5141 is ready to merge as far as I am concerned.

Step 2 is already lying around in patch form, but I need to rebase it onto step 1, and sort out the test wibbles. Doing that as we speak.

comment:151 Changed 35 hours ago by Krzysztof Gogolewski <krz.gogolewski@…>

In 08b3db7e/ghc:

Use an accumulator version of tyCoVarsOfType

Summary:
This is part 1 from #14880: factor out a worker for the tyCoVarsOf... family of
function, implementing them in terms of VarSet, but with accumulator-style
(like in `FV`) built in, and with the same kind of pre-insert lookup; this
has shown to perform better than either FV or plain VarSet in this particular
scenario.

Original notes from simonpj:

In TyCoRep we now have tyCoVarsOfType implemented

1) Using FV -- this is the baseline version in GHC today

2) Using VarSets via unionVarSet

3) Using VarSets in accumulator-style

In this patch (3) is enabled.

When compiling perf/compiler/T5631 we get

         Compiler allocs
   (1)   1,144M
   (2)   1,175M
   (3)   1,142M

The key new insight in (3) is this:

  ty_co_vars_of_type (TyVarTy v) is acc
    | v `elemVarSet` is  = acc
    | v `elemVarSet` acc = acc   <---- NB!
    | otherwise          = ty_co_vars_of_type (tyVarKind v) is (extendVarSet acc v)

Notice the second line! If the variable is already in the
accumulator, don't re-add it.  This makes big difference.
Without it, allocation is 1,169M or so.

One cause is that we only take the free vars of its kind once;
that problem will go away when we do the main part of #14088 and
close over kinds /afterwards/.  But still, another cause is perhaps
that every insert into a set overwrites the previous item, and so
allocates a new path to the item; it's not a no-op even if the
item is there already.

Why use (3) rather than (1)?  Becuase it just /has/ to
be better;

* FV carries around an InterestingVarFun, which does nothing
  useful here, but is tested at every variable

* FV carries around a [Var] for the deterministic version.

For this very hot operation (finding free vars) I think it
makes sense to have speical purpose code.

On the way I also simplified the (less used) coVarsOfType/Co family
to use FV, by making serious use of the InterestingVarFun!

Test Plan: validate, nofib

Reviewers: simonpj, bgamari

Reviewed By: simonpj

Subscribers: rwbarton, carter

GHC Trac Issues: #14880

Differential Revision: https://phabricator.haskell.org/D5141

comment:152 Changed 6 hours ago by simonpj

Step 2 is Phab:D5147 and wip/T14880​-2-step2-c​123. Two validation failures.

Note: See TracTickets for help on using tickets.