Opened 13 months ago

Closed 12 months ago

Last modified 12 months ago

#8900 closed bug (fixed)

Strictness analysis regression

Reported by: tibbe Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.8.1-rc2
Keywords: Cc: simonmar
Operating System: MacOS X Architecture: x86_64 (amd64)
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Revisions:

Description (last modified by tibbe)

Edit: There were two issues discussed here. One is solved. I left the ticket open for the strictness analysis regression part. Analysis of strictness regression starts in comment 7 below.

I ran a simple benchmark that exercises Data.HashMap.Lazy.insert. It's 16% slower using HEAD compared to using 7.6.3. The generated Core is a bit different and the generated Cmm is quite a bit different.

Steps to reproduce

  1. Download the attached HashMapInsert.hs benchmark.
  2. Install unordered-containers with both 7.6.3 and HEAD:
$ cabal install -w ghc-7.6.3 unordered-containers-0.2.3.3
$ cabal install -w inplace/bin/ghc-stage2 unordered-containers-0.2.3.3
  1. Compile the benchmark with both compilers:
$ ghc-7.6.3 -O2 HashMapInsert.hs
$ mv HashMapInsert HashMapInsertOld
$ inplace/bin/ghc-stage2 -O2 HashMapInsert.hs
$ mv HashMapInsert HashMapInsertNew

Results (best of 3 runs)

7.6.3

$ ./HashMapInsertOld +RTS -s
   1,191,223,528 bytes allocated in the heap
     141,978,520 bytes copied during GC
      37,811,840 bytes maximum residency (8 sample(s))
      22,378,432 bytes maximum slop
              99 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      2277 colls,     0 par    0.06s    0.06s     0.0000s    0.0002s
  Gen  1         8 colls,     0 par    0.07s    0.10s     0.0127s    0.0479s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.24s  (  0.24s elapsed)
  GC      time    0.13s  (  0.17s elapsed)
  EXIT    time    0.00s  (  0.01s elapsed)
  Total   time    0.37s  (  0.41s elapsed)

  %GC     time      34.8%  (40.3% elapsed)

  Alloc rate    4,923,204,681 bytes per MUT second

  Productivity  65.2% of total user, 59.0% of total elapsed

HEAD:

$ ./HashMapInsertNew +RTS -s
   1,191,223,128 bytes allocated in the heap
     231,158,688 bytes copied during GC
      55,533,064 bytes maximum residency (13 sample(s))
      22,378,488 bytes maximum slop
             144 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      2268 colls,     0 par    0.06s    0.07s     0.0000s    0.0003s
  Gen  1        13 colls,     0 par    0.12s    0.16s     0.0127s    0.0468s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.25s  (  0.25s elapsed)
  GC      time    0.18s  (  0.23s elapsed)
  EXIT    time    0.00s  (  0.01s elapsed)
  Total   time    0.43s  (  0.49s elapsed)

  %GC     time      41.6%  (47.5% elapsed)

  Alloc rate    4,738,791,249 bytes per MUT second

  Productivity  58.3% of total user, 51.9% of total elapsed

(Note that this is without the patches in #8885, so they're not the cause.)

An interesting difference is that we spend more time in GC in HEAD. I don't know if that's related.

Attachments (7)

HashMapInsert.hs (737 bytes) - added by tibbe 13 months ago.
HashMapInsert-7.6.3.dump-simpl (27.7 KB) - added by tibbe 13 months ago.
HashMapInsert-7.6.3.dump-opt-cmm (64.0 KB) - added by tibbe 13 months ago.
HashMapInsert-HEAD.dump-simpl (28.4 KB) - added by tibbe 13 months ago.
HashMapInsert-HEAD.dump-opt-cmm (81.6 KB) - added by tibbe 13 months ago.
HashMapInsert-master-no-extra-case.dump-opt-cmm (80.3 KB) - added by tibbe 13 months ago.
HashMapInsert-7.6.3-no-extra-case.dump-opt-cmm (64.0 KB) - added by tibbe 13 months ago.

Download all attachments as: .zip

Change History (32)

Changed 13 months ago by tibbe

comment:1 Changed 13 months ago by tibbe

  • Description modified (diff)
  • Summary changed from unordered-containers 19% slower in HEAD vs 7.6.3 to unordered-containers 16% slower in HEAD vs 7.6.3

comment:2 Changed 13 months ago by tibbe

  • Version changed from 7.6.3 to 7.9

comment:3 Changed 13 months ago by tibbe

I also see some changes in the Core. I've attached both the Core and optimized Cmm output. The action is in $wpoly_go, which is the body of the Data.HashMap.Lazy.insert function.

Last edited 13 months ago by tibbe (previous) (diff)

Changed 13 months ago by tibbe

Changed 13 months ago by tibbe

Changed 13 months ago by tibbe

Changed 13 months ago by tibbe

comment:4 Changed 13 months ago by tibbe

  • Description modified (diff)

comment:5 Changed 13 months ago by tibbe

  • Description modified (diff)

comment:6 Changed 13 months ago by simonmar

So we have

  MUT     time    0.24s  (  0.24s elapsed)
  GC      time    0.13s  (  0.17s elapsed)

vs

  MUT     time    0.25s  (  0.25s elapsed)
  GC      time    0.18s  (  0.23s elapsed)

ie. almost all the difference is in GC. And:

     141,978,520 bytes copied during GC
              99 MB total memory in use (0 MB lost due to fragmentation)

vs

     231,158,688 bytes copied during GC
             144 MB total memory in use (0 MB lost due to fragmentation)

And

  Gen  1         8 colls,     0 par    0.07s    0.10s     0.0127s    0.0479s

vs

  Gen  1        13 colls,     0 par    0.12s    0.16s     0.0127s    0.0468s

My guess, based on seeing things like this before, is that the benchmark is very sensitive to when exactly major GC strikes - perhaps it has a spike in memory usage at some point. You ought to be able to test this hypothesis by tweaking the GC options. Try with a very large heap size (-A2G).

There's a very small change in the MUT time, which is probably accounted for by extra cache misses caused by the extra GC activity. So I suspect this is nothing to worry about.

comment:7 Changed 13 months ago by tibbe

I've found a likely culprit in the generated Core.

Compare the case for Full in $wpoly_go for 7.6.3:

569	        case indexArray# rb_a1cI i#_s1oq of _ { (# ipv2_a1cS #) ->
570	        case $wpoly_go ww_s2HT ww1_s2HX w_s2HZ (+# ww2_s2I2 4) ipv2_a1cS
571	        of st'_a1cV { __DEFAULT ->

vs the case for Full in $wpoly_go for HEAD:

554	        case indexArray# dt_a2Ly i#_a2LH
555	        of _ [Occ=Dead] { (# ipv2_a2LM #) ->
556	        case ipv2_a2LM of st_a2LO { __DEFAULT ->
557	        case $wpoly_go ww_s4A3 ww1_s4A7 w_s4zY (+# ww2_s4Ab 4) st_a2LO
558	        of st'_a2LP { __DEFAULT ->

Both cases correspond to this snippet in Data.HashMap.Base:

    go h k x s t@(Full ary) =
        let !st  = A.index ary i
            !st' = go h k x (s+bitsPerSubkey) st

Here's the definition of A.index:

index :: Array a -> Int -> a
index ary (I# i#) =
        case indexArray# (unArray ary) i# of (# b #) -> b

There's an extra case in the HEAD version, which translates into an extra tag bits check in Cmm. This happens in a couple of places in the core. This case is unnecessary since $wpoly_go scrutinizes st_a2LO immediately. This looks like a regression in strictness analysis.

Last edited 13 months ago by tibbe (previous) (diff)

comment:8 Changed 13 months ago by tibbe

While the extra case is definitely a regression, it doesn't seem to be the cause of the time difference. Changing A.index to:

index :: Array a -> Int -> (# a #)
index ary (I# i#) =
        indexArray# (unArray ary) i#

and the snippet from Data.HashMap.Base to:

    go h k x s t@(Full ary) =
        let !(# st #) = A.index ary i
            !st' = go h k x (s+bitsPerSubkey) st

which forces the indexing, but not the returned element, makes the difference in the Core go away. The time difference remains however.

Last edited 13 months ago by tibbe (previous) (diff)

comment:9 Changed 13 months ago by tibbe

Here are the numbers with -A2G:

7.6.3:

$ ./HashMapInsert +RTS -s -A2G
   1,191,223,528 bytes allocated in the heap
           3,400 bytes copied during GC
          36,080 bytes maximum residency (1 sample(s))
          13,072 bytes maximum slop
            2081 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.01s    0.01s     0.0064s    0.0064s

  INIT    time    0.01s  (  0.02s elapsed)
  MUT     time    0.58s  (  0.93s elapsed)
  GC      time    0.01s  (  0.01s elapsed)
  EXIT    time    0.01s  (  0.01s elapsed)
  Total   time    0.60s  (  0.96s elapsed)

  %GC     time       1.0%  (0.7% elapsed)

  Alloc rate    2,063,221,325 bytes per MUT second

  Productivity  97.6% of total user, 61.0% of total elapsed

HEAD:

$ ./HashMapInsert +RTS -s -A2G
   1,191,223,096 bytes allocated in the heap
           3,312 bytes copied during GC
          35,992 bytes maximum residency (1 sample(s))
          13,160 bytes maximum slop
            2081 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.01s    0.02s     0.0158s    0.0158s

  INIT    time    0.01s  (  0.02s elapsed)
  MUT     time    0.60s  (  0.93s elapsed)
  GC      time    0.01s  (  0.02s elapsed)
  EXIT    time    0.01s  (  0.02s elapsed)
  Total   time    0.62s  (  0.99s elapsed)

  %GC     time       1.0%  (1.6% elapsed)

  Alloc rate    1,998,679,702 bytes per MUT second

  Productivity  97.7% of total user, 61.3% of total elapsed

I'll accept Simon M's explanation of the difference. I'll leave the ticket open for the strictness analysis issue.

comment:10 Changed 13 months ago by tibbe

  • Description modified (diff)
  • Summary changed from unordered-containers 16% slower in HEAD vs 7.6.3 to Strictness analysis regression

comment:11 Changed 13 months ago by tibbe

  • Version changed from 7.9 to 7.8.1-rc2

Strictness analysis regression is also in the latest 7.8 snapshot I had lying around, ghc-7.8.0.20140228, so it's likely in 7.8 RC.

comment:12 Changed 13 months ago by tibbe

  • Description modified (diff)

comment:13 Changed 13 months ago by simonmar

If that small difference in the MUT time is reliable, there might be a regression in the code generator we need to look into. Were those numbers with or without the extra case expression?

comment:14 Changed 13 months ago by tibbe

Without the extra case expression, there's still a small difference. Using the best of 5 runs, here are the numbers:

HEAD (46d05ba03d1491cade4a3fe33f0b8c404ad3c760):

$ ./HashMapInsert +RTS -s -A2G
   1,191,223,096 bytes allocated in the heap
           3,312 bytes copied during GC
          35,992 bytes maximum residency (1 sample(s))
          13,160 bytes maximum slop
            2081 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.01s    0.01s     0.0060s    0.0060s

  INIT    time    0.01s  (  0.02s elapsed)
  MUT     time    0.60s  (  0.89s elapsed)
  GC      time    0.01s  (  0.01s elapsed)
  EXIT    time    0.01s  (  0.01s elapsed)
  Total   time    0.62s  (  0.92s elapsed)

  %GC     time       0.9%  (0.6% elapsed)

  Alloc rate    1,991,359,179 bytes per MUT second

  Productivity  97.8% of total user, 66.0% of total elapsed

7.6.3:

$ ./HashMapInsert +RTS -s -A2G
   1,191,223,528 bytes allocated in the heap
           3,400 bytes copied during GC
          36,080 bytes maximum residency (1 sample(s))
          13,072 bytes maximum slop
            2081 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.01s    0.01s     0.0057s    0.0057s

  INIT    time    0.01s  (  0.02s elapsed)
  MUT     time    0.59s  (  0.86s elapsed)
  GC      time    0.01s  (  0.01s elapsed)
  EXIT    time    0.01s  (  0.01s elapsed)
  Total   time    0.61s  (  0.90s elapsed)

  %GC     time       0.9%  (0.6% elapsed)

  Alloc rate    2,033,686,150 bytes per MUT second

  Productivity  97.8% of total user, 66.4% of total elapsed

I've attached the Cmm for both HEAD and 7.6.3. They're not trivial to compare as in HEAD everything is in one function, $wpoly_go_entry, while in 7.6.3 it's split over two, $wpoly_go_info and s2ZS_ret.

comment:15 Changed 12 months ago by simonpj

Johan, the "extra case" comes from commit 28d9a03253e8fd613667526a170b684f2017d299, whose comment says

    Make CaseElim a bit less aggressive
    
    See Note [Case elimination: lifted case]:
    
    We used to do case elimination if
            (c) the scrutinee is a variable and 'x' is used strictly
    But that changes
        case x of { _ -> error "bad" }
        --> error "bad"
    which is very puzzling if 'x' is later bound to (error "good").
    Where the order of evaluation is specified (via seq or case)
    we should respect it.

And indeed that's a good point. Note [Case binder next] in Simplify.lhs is relevant here.

You point out that $wpoly_go is strict in its last arg, so it really is "next". I wonder how picky we want to be. Consider

case x of y -> g (error "uk") y

and suppose that g is strict in both arguments. Would it be ok to drop the case? Then we might get error "uk" (if g evaluated its first arg first) rather than any error arising from x? Probably yes, I guess.

Indeed maybe the change is just wrong, or at least over-conservative. According to our paper "A semantics of imprecise exceptions" it is ok. I don't think anyone actually reported a bug. So maybe we should revert to the ghc 7.6 behaviour?

comment:16 Changed 12 months ago by tibbe

I don't know what the right approach is. If we decide to keep the new behavior I'll have to be more careful about any extra evaluation, as I can't trust GHC to remove it.

comment:17 Changed 12 months ago by simonmar

I'm in favour of being gung-ho about evaluation order, especially when there's an optimisation to be had. If the user wants to control evaluation order, then pseq is the right way to do it, not seq.

comment:18 Changed 12 months ago by Simon Peyton Jones <simonpj@…>

In 0b6fa3e95078797f87302780a85607decab806fb/ghc:

Eliminate redundant seq's (Trac #8900)

This patch makes the simplifier eliminate a redundant seq like
    case x of y -> ...y....
where y is used strictly.  GHC used to do this, but I made it less
aggressive in

   commit 28d9a03253e8fd613667526a170b684f2017d299 (Jan 2013)

However #8900 shows that doing so sometimes loses good
transformations; and the transformation is valid according to "A
semantics for imprecise exceptions".  So I'm restoring the old
behaviour.

See Note [Eliminating redundant seqs]

comment:19 Changed 12 months ago by simonpj

OK I have restored the old behaviour. (No need to merge.)

I did a nofib run and found

  • No visible change in binary size
  • No change in allocation
  • Wibbles up and down in runtime but nothing consistent. (On my machine running k-nucleotide varies by up to 10% runtime in successive runes, for example.)

So I don't think there are significant losses, and Johan definitely has a gain to get.

I have not actually tried Johan's example, but perhaps Johan can?

Simon

comment:20 Changed 12 months ago by tibbe

  • Resolution set to fixed
  • Status changed from new to closed

My code now looks good. Thanks!

comment:21 follow-up: Changed 12 months ago by simonpj

Comment 9 you said "While the extra case is definitely a regression, it doesn't seem to be the cause of the time difference". and in comment 14 "without the extra case there is still a small difference".

Is that still true? If so there might still be something to track down.

Simon

comment:22 in reply to: ↑ 21 Changed 12 months ago by tibbe

Replying to simonpj:

Comment 9 you said "While the extra case is definitely a regression, it doesn't seem to be the cause of the time difference". and in comment 14 "without the extra case there is still a small difference".

Is that still true? If so there might still be something to track down.

There's still a (quite small) difference in mutator time. I haven't had time to investigate it.

The Cmm output by 7.6.3 (attachment:HashMapInsert-7.6.3.dump-opt-cmm) and HEAD (attachment:HashMapInsert-HEAD.dump-opt-cmm) look quite a bit different, at least superficially. For example, 7.6.3 has the core loop split into two functions, $wpoly_go_info and s30b_ret, while HEAD has just one function, $wpoly_go_entry.

comment:23 Changed 12 months ago by tibbe

I've looked a bit more closely and one difference between 7.6.3 and HEAD is the extra stack spilling before the eval check I reported in #8905.

Another difference is that the recursive call appears as

jump $wpoly_go_info; // [R6, R5, R4, R3, R2]

in 7.6.3 but as

call $wpoly_go_info(R6, R5, R4, R3, R2) returns to c56A, args: 8, res: 8, upd: 8;

in HEAD. I don't know if this is just a syntactic difference.

comment:24 follow-up: Changed 12 months ago by simonmar

one difference between 7.6.3 and HEAD is the extra stack spilling before the eval check I reported in #8905

I was surprised by this, because I would expect HEAD and 7.6.3 to generate very similar code with respect to spilling before a call. #8905 is about an improvement we can make in the new code generator, that wasn't possible in the old codegen.

Looking at your dumps I see this for HEAD in $wpoly_go_info:

     c5k9:
	          I64[Sp - 40] = PicBaseReg + block_c53R_info;
	          R1 = R6;
	          I64[Sp - 32] = R2;
	          I64[Sp - 24] = R3;
	          P64[Sp - 16] = R4;
	          I64[Sp - 8] = R5;
	          Sp = Sp - 40;
	          if (R1 & 7 != 0) goto c53R; else goto c53S;

and this for 7.6.3:

	        I64[Sp - 32] = R5;
	        I64[Sp - 24] = R4;
	        I64[Sp - 16] = R3;
	        I64[Sp - 8] = R2;
                R1 = R6;
	        I64[Sp - 40] = PicBaseReg + s30b_info;
	        Sp = Sp - 40;
	        if (R1 & 7 != 0) goto c3zM;

they look identical to me modulo reordering and things falling into different stack slots. Maybe the problematic bit is somewhere else - could you point to it?

Some of the other differences you're seeing are due to the fact that the new codegen (with the NCG) doesn't break up functions at proc-points, so you see larger chunks of code where 7.6.3 broke things into smaller pieces. Most of the time this won't make any difference to the generated code, unless there's a join point (a let-no-escape) where HEAD should generate better code.

comment:25 in reply to: ↑ 24 Changed 12 months ago by tibbe

Replying to simonmar:

I was surprised by this, because I would expect HEAD and 7.6.3 to generate very similar code with respect to spilling before a call. #8905 is about an improvement we can make in the new code generator, that wasn't possible in the old codegen.

My mistake. I've been reading too much Cmm lately.

they look identical to me modulo reordering and things falling into different stack slots. Maybe the problematic bit is somewhere else - could you point to it?

I have no idea then. There's a lot of Cmm and diff tools don't work well because blocks have been reordered a lot between 7.6.3 and HEAD. Since the difference is so small I suggest we ignore it for now. I'm more interested in getting #8905 into a production-ready state.

Note: See TracTickets for help on using tickets.