Opened 3 years ago

Closed 2 years ago

#5549 closed bug (fixed)

~100% performance regression in HEAD compared to ghc6.12, ~22% compared to 7.0.4

Reported by: tomaszw Owned by:
Priority: high Milestone: 7.4.1
Component: Compiler Version: 7.3
Keywords: performance regression Cc:
Operating System: Linux Architecture: x86
Type of failure: Runtime performance bug Difficulty:
Test Case: perf/should_run/T5549 Blocked By:
Blocking: Related Tickets:

Description

Testing following simple program with ghc 7.3.20110927 reveals quite big performance regression compared to ghc6, and smaller one compared to 7.0.4:

lcs3 :: Eq a => [a] -> [a] -> [a]
lcs3 a b = fst $ aux (a, length a) (b, length b)
  where
    aux (_,0) _ = ([],0)
    aux _ (_,0) = ([],0)
    aux (a@(ha:as),la) (b@(hb:bs), lb)
      | ha == hb  = let (s,n) = aux (as,la-1) (bs,lb-1) in (ha : s, n+1)
      | otherwise =
        let (sa,na) = aux (as,la-1) (b,lb)
            (sb,nb) = aux (a,la) (bs,lb-1) in
        if na > nb then (sa,na) else (sb,nb)

main = do putStrLn . show $ lcs3 [1..20] [10..20]

on ghc6.12:

tomaszw@frodo: ~ $ time ./LCS +RTS -s
./LCS +RTS -s 
[10,11,12,13,14,15,16,17,18,19,20]
     888,979,332 bytes allocated in the heap
         647,648 bytes copied during GC
          31,116 bytes maximum residency (1 sample(s))
          33,060 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:  1695 collections,     0 parallel,  0.01s,  0.01s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    1.12s  (  1.13s elapsed)
  GC    time    0.01s  (  0.01s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    1.13s  (  1.14s elapsed)

  %GC time       1.1%  (0.7% elapsed)

  Alloc rate    793,684,067 bytes per MUT second

  Productivity  98.9% of total user, 98.3% of total elapsed


real	0m1.141s
user	0m1.132s
sys	0m0.008s

on ghc 7.0.4:

tomaszw@frodo: ~ $ time ./LCS +RTS -s
./LCS +RTS -s 
[10,11,12,13,14,15,16,17,18,19,20]
   4,338,709,876 bytes allocated in the heap
       5,603,080 bytes copied during GC
          31,892 bytes maximum residency (1 sample(s))
         109,240 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:  8275 collections,     0 parallel,  0.05s,  0.05s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    1.86s  (  1.87s elapsed)
  GC    time    0.05s  (  0.05s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    1.91s  (  1.91s elapsed)

  %GC time       2.5%  (2.4% elapsed)

  Alloc rate    2,331,159,557 bytes per MUT second

  Productivity  97.5% of total user, 97.3% of total elapsed


real	0m1.913s
user	0m1.896s
sys	0m0.012s

on ghc7.3.20110927:

tomaszw@frodo: ~ $ ghc --make -O2  -rtsopts LCS.hs

[1 of 1] Compiling Main             ( LCS.hs, LCS.o )
Linking LCS ...
tomaszw@frodo: ~ $ time ./LCS +RTS -s
[10,11,12,13,14,15,16,17,18,19,20]
   1,467,845,016 bytes allocated in the heap
       1,211,836 bytes copied during GC
          60,144 bytes maximum residency (1 sample(s))
          22,744 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      2810 colls,     0 par    0.01s    0.01s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.00s    0.00s     0.0002s    0.0002s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    2.37s  (  2.38s elapsed)
  GC      time    0.01s  (  0.01s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    2.39s  (  2.39s elapsed)

  %GC     time       0.6%  (0.6% elapsed)

  Alloc rate    619,142,210 bytes per MUT second

  Productivity  99.4% of total user, 99.2% of total elapsed


real	0m2.392s
user	0m2.384s
sys	0m0.008s

I also tried the llvm backed on newer ghc, but got equivalent numbers.

Change History (5)

comment:1 Changed 3 years ago by daniel.is.fischer

Cannot reproduce on x86_64 linux, 6.12.3, 7.0.4, 7.2.1 and 7.3.20111002 all produce code running in 1.23 ± 0.01s here. Seems to be platform specific. However, 7.0.4 and 7.3 allocate about 30% more than 6.12 and 7.2.

comment:2 Changed 3 years ago by tomaszw

Hey, I have retried the test on newer head (on same 32bit x86 linux box):

The Glorious Glasgow Haskell Compilation System, version 7.3.20111013

and my running time numbers are back to exact ghc6.12 level. Awesome!

comment:3 Changed 3 years ago by simonmar

  • Milestone set to 7.4.1
  • Priority changed from normal to high

let's add the program as a regression test, and look into the difference in allocations (if it's still there).

comment:4 Changed 2 years ago by simonpj@…

commit ca380cd19530fad4860e42844e42a473188e0013

Author: Simon Peyton Jones <simonpj@microsoft.com>
Date:   Fri Oct 21 17:53:30 2011 +0100

    Recover proper sharing for Integer literals
    
    Trac #5549 showed a loss of performance for GHC 7.4.
    What was happening was that an integer literal was being
    allocated each time around a loop, rather than being
    floated to top level and shared.
    
    Two fixes
     * Make the float-out pass float literals that are non-trivial
     * Make the inliner *not* treat Integer literals as size-zero

 compiler/basicTypes/Literal.lhs  |   30 +++++++++++++++++-------------
 compiler/coreSyn/CoreUnfold.lhs  |   14 +++++++++++++-
 compiler/coreSyn/PprCore.lhs     |   13 ++++++-------
 compiler/simplCore/SetLevels.lhs |   14 +++++++++++++-
 4 files changed, 49 insertions(+), 22 deletions(-)

comment:5 Changed 2 years ago by simonpj

  • Resolution set to fixed
  • Status changed from new to closed
  • Test Case set to perf/should_run/T5549

There really was a bug here: literal Integer constants were getting inlined inside functions, resulting in greatly increased allocation. I don't know how you got back to the same figures as 6.12.

I get this for HEAD before my fix

  12,228,727,040 bytes allocated in the heap
       5,875,944 bytes copied during GC
          27,832 bytes maximum residency (1 sample(s))
          21,752 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     23381 colls,     0 par    0.16s    0.16s     0.0000s    0.0001s
  Gen  1         1 colls,     0 par    0.00s    0.00s     0.0006s    0.0006s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    5.97s  (  5.97s elapsed)
  GC      time    0.16s  (  0.16s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    6.13s  (  6.13s elapsed)

  %GC     time       2.6%  (2.6% elapsed)

  Alloc rate    2,048,966,008 bytes per MUT second

  Productivity  97.4% of total user, 97.4% of total elapsed

and this for 6.12

   6,751,283,512 bytes allocated in the heap
       7,051,256 bytes copied during GC
          27,832 bytes maximum residency (1 sample(s))
         242,160 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0: 12877 collections,     0 parallel,  0.10s,  0.09s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.01s  (  0.00s elapsed)
  MUT   time    4.57s  (  4.60s elapsed)
  GC    time    0.10s  (  0.09s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    4.68s  (  4.69s elapsed)

  %GC time       2.1%  (1.9% elapsed)

  Alloc rate    1,474,079,369 bytes per MUT second

  Productivity  97.6% of total user, 97.5% of total elapsed

But I get this after my fix:

   6,725,848,096 bytes allocated in the heap
       4,559,016 bytes copied during GC
          27,816 bytes maximum residency (1 sample(s))
          19,544 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     12876 colls,     0 par    0.11s    0.11s     0.0000s    0.0001s
  Gen  1         1 colls,     0 par    0.00s    0.00s     0.0006s    0.0006s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    5.28s  (  5.29s elapsed)
  GC      time    0.11s  (  0.11s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    5.39s  (  5.39s elapsed)

  %GC     time       2.0%  (2.0% elapsed)

  Alloc rate    1,272,387,285 bytes per MUT second

  Productivity  98.0% of total user, 98.0% of total elapsed

So we are happy.

These figures are not for your program, but another one where I added a more extreme case: testsuite/tests/perf/should_run/T5549.

Thanks for reporting!

Simon

Note: See TracTickets for help on using tickets.