Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#9872 closed bug (fixed)

Runing type functions is too slow

Reported by: simonpj Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.8.3
Keywords: Cc: goldfire, cactus, dimitris@…, jstolarek
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by simonpj)

Gergo reports, prompted by this question on Stack Overflow.

I wrote some code today using closed type families and datakinds. Also, as a baseline, I typechecked the code using open type families from the original question.

The two files are attached as slow1.hs and slow2.hs below.

On GHC 7.8.3, typechecking took about 45 seconds for each. However, on a 'perf' build of GHC 7.9 d8c437b3, with ghc-stage2, the first one took 1m3s and the second one 1m12s. A 40% and 60% increase in typechecking time, respectively!

Is this some known regression, something surprising, or is 'perf' simply not the right build flavour for this kind of comparison?

Attachments (3)

slow1.hs (4.9 KB) - added by simonpj 3 years ago.
First slow case
slow2.hs (3.8 KB) - added by simonpj 3 years ago.
Second slow case
exp-tyfams.hs (9.9 KB) - added by jstolarek 3 years ago.

Download all attachments as: .zip

Change History (35)

Changed 3 years ago by simonpj

Attachment: slow1.hs added

First slow case

Changed 3 years ago by simonpj

Attachment: slow2.hs added

Second slow case

comment:1 Changed 3 years ago by simonpj

Cc: goldfire cactus dimitris@… added
Description: modified (diff)

Very helpful examples. Working on them.

Simon

comment:2 Changed 3 years ago by simonpj

Some diagnosis:

  • 90%+ of compile time is being spent in TrieMap code, called from partitionFunEqs, called from kick_out in the constraint solver. Here are the top cost centres in a profiled compiler.
    COST CENTRE    MODULE    %time %alloc
    
    fdT            TrieMap    36.7   44.8
    fdList         TrieMap    13.4   16.8
    foldTyLit      TrieMap    10.9   17.2
    fdVar          TrieMap     8.2   11.1
    foldMaybe      TrieMap     3.7    1.3
    foldUFM        UniqFM      3.7    1.5
    lm_nil         TrieMap     2.3    0.0
    lm_cons        TrieMap     1.9    0.0
    mapUnionVarSet VarSet      1.5    1.1
    tlm_string     TrieMap     1.2    0.0
    foldTM         TrieMap     1.1    0.7
    
  • The inert set has up to 150 inert CFunEqCans. That's absurd. I think I know how to reduce it dramatically by improving the order in which goals are solved.
  • Each call to kick_out requires us to look at each of the 150 inert items, in case it mentions the newly-assigned variable. (This is already quadratic, hence wanting to minimise the size of the inert set.)

    But partitionFunEqs still should not be so expensive. Part of the reason turned out to be the use of a TrieMap. Suppose a TrieMap has exactly one item in it, with a big key, like A (K C D E F) (K E D C F) etc. Then the TrieMap will have a long string of singleton UFMs etc to represent it. That's not a very clever representation; simply to reconstruct a copy takes time proportional to the size of the key.

    I think the solution may be to give TrieMaps a special case for singleton maps, which does not invert the kay.

comment:3 Changed 3 years ago by goldfire

This all begs a question I've wondered several times: Flattening in the solver is great when we have bits of a type that are unknown, like F (Int, a). But, it seems like an awful lot of work to do when a type is fully known, such as F Bool. In the latter case, is there a reason we don't use normaliseType? It would seem to me that this all would get more efficient if we just called normaliseType as an early step in the process, say in flattenFamApp.

comment:4 Changed 3 years ago by jstolarek

Cc: jstolarek added

comment:5 Changed 3 years ago by Simon Peyton Jones <simonpj@…>

In 37c2ed4bc3d4ff0a4681e9d27c7f748886e413f6/ghc:

Optimise partitionFunEqs for the 'false' case

In the examples from Trac #9872 we were getting a large set of inert CFunEqCans,
and partitioning them was taking ages.  This patch improves it somewhat by optimising
the partition for the case where the predicat is false.

The ticket has more info.

comment:6 in reply to:  3 Changed 3 years ago by simonpj

Replying to goldfire:

This all begs a question I've wondered several times: Flattening in the solver is great when we have bits of a type that are unknown, like F (Int, a). But, it seems like an awful lot of work to do when a type is fully known, such as F Bool. In the latter case, is there a reason we don't use normaliseType? It would seem to me that this all would get more efficient if we just called normaliseType as an early step in the process, say in flattenFamApp.

Reasonable question. But if we had

   Eq (F (P Q R)
         (G (H (F (G (H a Int))))))

there's a risk of trying to normalise at the outer level, and then at every inner level, which would not be very clever. I'll think about this a bit more.

Simon

comment:7 Changed 3 years ago by goldfire

Recall that normaliseType does "deep" normalisation, not just at top level -- it recursively tries to normalise all arguments. And, I don't think it would take long for it to try. If normaliseType doesn't remove a top-level type function, then flatten as normal, but it's still possible some good progress was made below top level.

comment:8 Changed 3 years ago by cactus

With the latest changes I am seeing some further performance regression, to put it mildly.

Typechecking time of slow1.hs on a Linux perf build:

on 7535c83b600792fe03235d2da0a6affcbfddde4b: ~1 minute

on c2c1888110ef91774bf17e8358aa41537124d7f5: 40 minutes

Last edited 3 years ago by cactus (previous) (diff)

comment:9 Changed 3 years ago by Simon Peyton Jones <simonpj@…>

In 13b0b460d365828c7d41b39d2ce7d5bbe4c69f98/ghc:

Reorganise the work list, so that flattening goals are treated in the right order

Trac #9872 showed the importance of processing goals in depth-first, so that
we do not build up a huge pool of suspended function calls, waiting for their
children to fire.  There is a detailed explanation in
     Note [The flattening work list]
in TcFlatten

The effect for Trac #9872 (slow1.hs) is dramatic.  We go from too long
to wait down to 28Gbyte allocation.  GHC 7.8.3 did 116Gbyte allocation!

comment:10 Changed 3 years ago by Simon Peyton Jones <simonpj@…>

comment:11 Changed 3 years ago by simonpj

Resolution: fixed
Status: newclosed

Actually with the other changes I've made in the last day or two, allocation in HEAD for T9872a is down to 5.5Gbytes.

Thanks for these examples, Gergo. I claim victory.

Simon

comment:12 Changed 3 years ago by cactus

You might want to also add, as a test, the version that uses closed type families without datakinds: https://gist.github.com/gergoerdi/c5f3522d41e603559dfc

comment:13 Changed 3 years ago by simonpj

yes, both tests are in commit fca85c9b1617417a6170f3591a29d1fe36d35da5/ghc

comment:14 Changed 3 years ago by cactus

but the one I linked to in my latest comment is a third version: closed type families, but everything is on *.

comment:15 Changed 3 years ago by simonpj

Oh ok. Do you think it's significantly different? Or just different?

comment:16 Changed 3 years ago by cactus

I don't know enough about the internals of the type family normalizer to know if it makes a difference or not; but since the datakinds one is closed-world and this is open-world, it might.

Although now that I think about it, since the type families in question are closed, it's not really open-world anyway...

I don't know. Your call.

comment:17 Changed 3 years ago by Simon Peyton Jones <simonpj@…>

In 7256213843b80d75a86f033be77516a62d56044a/ghc:

Add a third test variant to Trac #9872

comment:18 Changed 3 years ago by cactus

Just to put some numbers on the improvement processing time-wise, here's an unscientific test run of all three tests, using GHC 7.8.3 and GHC a225c70e00586e2f38a0e27fae698324ae81b006 built in perf mode (rounded to tenth of seconds because the whole thing is unscientific anyway, just a single run of ghc --make on the three source files)

GHC 7.8.3 GHC a225c70
T9872a 45.5s 7s
T9872b 35.7s 6.5s
T9872c 45.3s 7.5s
Last edited 3 years ago by cactus (previous) (diff)

comment:19 Changed 3 years ago by jstolarek

I mentioned on ghc-devs that I witnessed exponential compile times for type families. I just built the latest HEAD and I see that the problem is gone. With my test case I get the following dependency between input size and compile time:

Input size GHC 7.8.4 RC1 GHC HEAD
64 1.2s 0.6s
128 11.1s 1.3s
256 1m41s 3.7s

Compiling with GHC 7.8.4 required passing -ftype-function-depth=1000 (perhaps a smaller value would work, but the default one was too low).

comment:20 Changed 3 years ago by simonpj

What is "my test case"? Perhaps we can add it to the perf/compiler tests?

Simon

Changed 3 years ago by jstolarek

Attachment: exp-tyfams.hs added

comment:21 Changed 3 years ago by jstolarek

Attached. It's generated automatically using Template Haskell so it's a bit messy. I was planning to clean it up a but that required more time that I had at my disposal :-(

comment:22 Changed 3 years ago by Simon Peyton Jones <simonpj@…>

In dd1b6d4fc96bc2779beaca836d1b9e4b158c1597/ghc:

Add Jan Stolarek's test for Trac #9872

comment:23 Changed 3 years ago by goldfire

I have some work-in-progress here that cuts allocation amounts for the T9872x cases by half. I'm still twiddling knobs to find the best combination of settings, but the high-order bit is this: try to reduce a type family (in flatten_exact_fam_app_fully) before flattening the arguments. Only if that fails do we flatten the arguments.

This works well if we have something like

type family F a where
  F a = G a
type family G a
type family H a where
  H Bool = Char

and we call F (H Bool). The current (HEAD) flattener will flatten H Bool, then reduce F Char to G Char, then flatten Char again, then try to reduce G Char. With my patch, it just reduces F (H Bool) to G (H Bool), then tries to reduce G, fails, and then reduces H Bool to Char. This saves one flattening. But, you can imagine it saving a lot more (and it does, in practice).

On current tests, it works better not to use the flat-cache (which amounts to memoization of type-level functions) for this very-eager reduction. I'll look forward to trying Janek's tests, which satisfies a direct need: I've been worried that I'm overspecializing to the existing tests, and fresh tests will help.

comment:24 Changed 3 years ago by Richard Eisenberg <eir@…>

In 8e2d858bb837a322f26face78df1b6ef3898e762/ghc:

Optimize flattener by trying to reduce a TF before reducing its args.

This has a demonstrated 2x speed boost on the T9872{a,b,c} tests.
(#9872)

comment:25 Changed 3 years ago by goldfire

In thinking about this issue, the Right Way to get type families to be faster is to optimize them properly. By this, I mean taking the type family definitions and performing an optimization pass on them during type-checking/desugaring. We've essentially just implemented a tiny interpreter inside of TcFlatten, and the whole thing could be more principled in approach.

I'm not making a concrete proposal, just musing on the fact that we have here a classic problem -- how to write a fast interpreter for a given programming language. It just happens to be the language of type families. I suppose there is a body of research and experience on this very issue, and if we want to be serious about fast compilation times in the presence of heavy type-level computation, it would do well to use that body of knowledge.

In any case, I'm done dwelling on performance for a while, but I'm quite pleased with my end result.

comment:26 Changed 3 years ago by jstolarek

I'll look forward to trying Janek's tests, which satisfies a direct need: I've been worried that I'm overspecializing to the existing tests, and fresh tests will help.

My test is just a promoted scanr function generated using singletons. So if you need more tests generating them should be fairly straightforward.

comment:27 Changed 3 years ago by jstolarek

I've been chasing memory leak on my injective type families branch and it seems that tests for this ticket are the culprit. When run separately they are harmless and finish in a matter of seconds. However each of them allocates a lot of memory:

T9872a:

2,680,443,384 bytes allocated in the heap
1,341,144,448 bytes copied during GC
  230,497,136 bytes maximum residency (10 sample(s))
    1,651,136 bytes maximum slop
          465 MB total memory in use (0 MB lost due to fragmentation)

Productivity  34.5% of total user, 34.5% of total elapsed

T9872b:

3,480,475,896 bytes allocated in the heap
2,049,027,512 bytes copied during GC
  466,737,240 bytes maximum residency (12 sample(s))
    2,234,096 bytes maximum slop
          912 MB total memory in use (0 MB lost due to fragmentation)

Productivity  33.2% of total user, 33.1% of total elapsed

T9872c:

2,963,257,224 bytes allocated in the heap
1,905,496,768 bytes copied during GC
  454,512,352 bytes maximum residency (11 sample(s))
    2,104,512 bytes maximum slop
          889 MB total memory in use (0 MB lost due to fragmentation)

Productivity  31.2% of total user, 31.2% of total elapsed

T9872d:

740,175,432 bytes allocated in the heap
564,712,136 bytes copied during GC
 68,077,728 bytes maximum residency (11 sample(s))
    904,080 bytes maximum slop
        174 MB total memory in use (0 MB lost due to fragmentation)

Productivity  26.1% of total user, 26.3% of total elapsed

So when these tests are run concurrently during validation they eat up ~2,5GB of RAM and that, in conjunction with other things running in the background, is too much for my machine. Note also the productivity: between to 2/3rd and 3/4th of running time is spent on garbage collection. These numbers don't look good.

EDIT: I rebased my branch on top of master today, so it contains Richard's fixes.

Last edited 3 years ago by jstolarek (previous) (diff)

comment:28 Changed 3 years ago by goldfire

An interesting observation here is that the performance tests should probably never be run in parallel. :) Perhaps open a new feature request for this?

I continue to think that comment:25 is the way to go forward here. But that's a lot of work!

comment:29 Changed 3 years ago by jstolarek

Let me only remark that these numbers were taken in a separate runs, not parallel ones. Richard, so you're seeing similar numbers for these tests? GC time actually looks pretty bad and I think it deserves a bug report, not just a feature request.

comment:30 Changed 3 years ago by simonpj

Comment:25 is a bit cryptic. What sort of "compilation" or "optimisation" did you have in mind?

comment:31 Changed 3 years ago by goldfire

I don't have anything in particular in mind -- just that we happen to have some expertise on taking a functional program and executing it quickly, and perhaps we can apply that knowledge to this problem. It seems to me that the algorithm used to reduce type families evolved out of a constraint-solving process, and taking a more direct tack might prove fruitful, while still retaining the ability to work with programs where bits (skolem type variables) aren't -- and can't be -- known.

comment:32 Changed 3 years ago by ezyang

See also #9960

Note: See TracTickets for help on using tickets.