Opened 2 years ago

Closed 22 months ago

#6110 closed bug (fixed)

Data.Vector.Unboxed performance regression of 7.4.1 relative to 7.0.4

Reported by: mdgabriel Owned by: pcapriotti
Priority: high Milestone: 7.4.3
Component: Compiler Version: 7.4.1
Keywords: Vector Performance Regression Cc:
Operating System: Linux Architecture: x86
Type of failure: Runtime performance bug Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets: #6111

Description

Problem

Severe Data.Vector.Unboxed performance regression in 7.4.1 relative to 7.0.4:
(Sum GHC 7.4.1)/(Sum GHC 7.0.4) ~ 2.4

System

GNU/Linux 3.2.0-24-generic 38-Ubuntu i386

Compilers

GHC 7.0.4
GHC 7.4.1
GCC 4.6.3 for a baseline

Main.hs

module Main where

import System.Environment (getArgs)
import qualified Data.Vector.Unboxed as U (generate, sum)

main :: IO ()
main = do args <- getArgs
          if length args == 1
            then putSum (read (head args) :: Int)
            else error "need a count operand"

putSum :: Int -> IO ()
putSum cnt = let v = U.generate cnt (\i -> fromIntegral i :: Double)
                 s = U.sum v
             in putStrLn ("Sum="++show s)

GHC compilation

ghc --version

7.4.1

ghc -O2 -Wall --make -o sum Main.hs


ghc --version

7.0.4

ghc -O2 -Wall --make -o sum Main.hs

Baseline csum.c

#include <libgen.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
  unsigned long i, size;
  double tot=0;

  if (argc != 2)
    {
      (void)fprintf(stderr, "usage: %s size\n", basename(argv[0]));
      return(1);
    }

  size = atol(argv[1]);

  for(i = 0; i < size; i++) tot += (double)i;

  (void)printf("Sum=%.15e\n", tot);

  return(0);
}

GCC baseline compilation

gcc --version

4.6.3

gcc -O2 -Wall csum.c -o csum

Data: time sum-7.0.4 n

n seconds
100000000 0.74
200000000 1.46
300000000 2.24
400000000 2.94
500000000 3.70
600000000 4.40
700000000 5.14
800000000 5.89
900000000 6.62
1000000000 7.34

Data: time sum-7.4.1 n

n seconds
100000000 1.74
200000000 3.49
300000000 5.24
400000000 6.98
500000000 8.73
600000000 10.51
700000000 12.22
800000000 13.96
900000000 15.75
1000000000 17.51

Data: time csum-4.6.3 n

n seconds
100000000 1.04
200000000 2.10
300000000 3.12
400000000 4.19
500000000 5.23
600000000 6.26
700000000 7.32
800000000 8.37
900000000 9.41
1000000000 10.45

Linear in n

y is in seconds

GHC 7.0.4: y = (0.73/108) * n + 0.03
GCC 4.6.3: y = (1.04/10
8) * n + 0.03
GHC 7.4.1: y = (1.75/108) * n - 0.01

Severe performance regression:
GHC 7.4.1/GHC 7.0.4 ~ 1.75/0.73 ~ 2.4

Notes

1/ I discovered the problem in a slightly more complicated case when I recompiled a package that used some simple statisics. The sum of [0..(n-1)] was the simplest case that I imagined to demonstrate the problem.

2/ I tried a similar experiment with Data.List, Data.Array.Unboxed,
Data.Vector.Storable.MMap, and Foreign.Marshal.Alloc. In all cases,
the GHC 7.4.1 version was faster than the GHC 7.0.4 version.

3/ It is the same Data.Vector.Unboxed code in both cases compilied and installed separately for each version of the GHC compiler. Thus, the problem appears to be the interaction between Data.Vector.Unboxed and the 7.4.1 compiler that causes the performance regression.

4/ I am impressed that the GHC 7.0.4 sum is faster than the GCC 4.6.3 sum. I expected it to be close, but not faster. Given this
impressive result, I certainly would hope that the same result can be recovered once again.

Change History (14)

comment:1 Changed 2 years ago by mdgabriel

comment:2 Changed 23 months ago by simonmar

  • Difficulty set to Unknown
  • Milestone set to 7.4.2
  • Priority changed from normal to high

Possibly another Integer-related regression, we should look into it.

comment:3 Changed 23 months ago by igloo

  • Milestone changed from 7.4.2 to 7.4.3

comment:4 Changed 23 months ago by simonpj

  • Owner set to pcapriotti

comment:5 Changed 22 months ago by pcapriotti

  • Status changed from new to infoneeded

I can't reproduce this regression. I tried every compiler since 7.0.3, and they
all give very similar results in this benchmark.

Here are my results with 7.4.1:

100000000 0m0.328s
200000000 0m0.644s
300000000 0m0.948s
400000000 0m1.256s
500000000 0m1.572s
600000000 0m1.872s
700000000 0m2.208s
800000000 0m2.488s
900000000 0m2.784s
1000000000 0m3.132s

and 7.0.4:

100000000 0m0.324s
200000000 0m0.624s
300000000 0m0.936s
400000000 0m1.252s
500000000 0m1.544s
600000000 0m1.856s
700000000 0m2.148s
800000000 0m2.472s
900000000 0m2.784s
1000000000 0m3.092s

As you can see, 7.0.4 is only slightly faster. I don't think the difference is
significant, and it certainly is not a factor of 2.4.

@mdgabriel: can you provide any other information that could help us reproduce
the problem?

comment:6 Changed 22 months ago by mdgabriel

@pcapriotti: I would by happy to supply more information, but I am not sure what to supply that would be helpful. Can you give me some guidance?

What I can tell you is that 7.0.4 was a built by me. Version 7.4.1 is available for Ubuntu 12.04 LTS. In both cases, I compiled the package vector using cabal with the options -O and -p.

I will verify my installation to make sure I do not make a mistake and have an unoptimized Data.Vector for 7.4.1.

  • Marcus

comment:7 Changed 22 months ago by mdgabriel

@pcapriotti: I built a 7.4.1 from source with Data.Vector and obtained the same results. I added some SCC manually but obtained no insights. +RTS -t -RTS etc simply told me that all my time is mutator time. So, either

  1. Your version of 7.4.1 is compiled differently than mine, or
  2. Your environment is different from mine such that the difference manifests itself only with 7.4.1 and not 7.0.4.
  3. Something else?

Is there some information you would wish that I have not given?

  • Marcus

comment:8 Changed 22 months ago by pcapriotti

@mdgabriel: How did you build GHC 7.4.1 and 7.0.4? What does your build.mk look like?

comment:9 Changed 22 months ago by mdgabriel

@pcapriotti: In both cases,

  1. ./configure --prefix=SOMEPATH
  2. make
  3. make install

So there was no build.mk, that is, I chose not to use one and to accept the default which has worked for me since ghc 6. I only use build.mk for some specific reason which I have not done in quite a while.

comment:10 Changed 22 months ago by pcapriotti

I do see a small difference when compiling with -prof, though much less than you reported (5.69s vs. 4.89s with 1000000000 as argument).

Can you please post the prof files for runs with 1000000000 as argument?

Can you also check if the difference goes away when you compile the sample program without -prof?

comment:11 Changed 22 months ago by igloo

For me, 7.4.1 is twice as fast as 7.0.2 with an argument of 10000000000.

$ ~/bindists/7.0.2/inst/bin/ghc -O2 -Wall --make -o sum Main.hs
$ time ./sum 10000000000
Sum=4.999999999006786e19
./sum 10000000000  34.77s user 0.00s system 100% cpu 34.771 total
$ ~/bindists/7.4.1/inst/bin/ghc -O2 -Wall --make -o sum Main.hs
$ time ./sum 10000000000
Sum=4.999999999006786e19
./sum 10000000000  17.39s user 0.00s system 100% cpu 17.389 total

Both have vector 0.9.1.

comment:12 Changed 22 months ago by mdgabriel

I too am using in all cases vector 0.9.1. I also carefully rebuilt both 7.0.4 and 7.4.1 with the appropriate Haskell platforms and kept logs to be absolutely sure that I have proper builds.

I have profiling data and GC data if you still want it, but given the previous entry and given #6111, I replaced

v = U.generate cnt (\i -> fromIntegral i :: Double)

with

v = U.generate cnt (\i -> fromIntegral i :: Int)

and obtained the following timings

GHC 7.0.4 time: 1000000000 4.08

GHC 7.4.1 time: 1000000000 5.01

Needless to say, the results are worthless due to integer overflow but
it clearly points to fromIntegral as possessing the problem.

I then replaced

v = U.generate cnt (\i -> fromIntegral i :: Double)

with

v = U.generate cnt (\i -> fromIntegral i :: Int64)

and obtained the following timings

GHC 7.0.4 time: 1000000000 12.82

GHC 7.4.1 time: 1000000000 12.12

The problem clearly appears to be the coercion from Int to Double via
fromIntegral and it clearly appears to be my environment.

Is this coercion code in GHC or is it a library call or what changed between 7.0.4 and 7.4.1?

comment:13 Changed 22 months ago by mdgabriel

Profiling

SCC was used for profiling ghc 7.0.4 and 7.4.1:

putSum :: Int -> IO ()
putSum cnt = let f = {-# SCC "fromIntegral" #-} (\i -> fromIntegral i :: Double)
                 v = {-# SCC "U.generate"   #-} U.generate cnt f
                 s = {-# SCC "U.sum"        #-} U.sum v
             in putStrLn ("Sum="++show s)

Profiling using '-prof -auto-all' with ghc 7.0.4:

	Fri Jun 22 19:00 2012 Time and Allocation Profiling Report  (Final)

	   sum704prof +RTS -p -RTS 100000000

	total time  =       45.54 secs   (2277 ticks @ 20 ms)
	total alloc = 14,800,016,380 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE               %time %alloc

U.generate                     Main                  46.9   35.1
U.sum                          Main                  31.2   40.5
fromIntegral                   Main                  21.9   24.3


                                                                                               individual    inherited
COST CENTRE              MODULE                                               no.    entries  %time %alloc   %time %alloc

MAIN                     MAIN                                                   1           0   0.0    0.0   100.0  100.0
 CAF                     Main                                                 300           2   0.0    0.0   100.0  100.0
  main                   Main                                                 306           1   0.0    0.0   100.0  100.0
   putSum                Main                                                 307           1   0.0    0.0   100.0  100.0
    U.sum                Main                                                 310           1  31.2   40.5    31.2   40.5
    U.generate           Main                                                 309           1  46.9   35.1    46.9   35.1
    fromIntegral         Main                                                 308   100000000  21.9   24.3    21.9   24.3
 CAF                     GHC.Float                                            277           1   0.0    0.0     0.0    0.0
 CAF                     Text.Read.Lex                                        266           4   0.0    0.0     0.0    0.0
 CAF                     GHC.IO.Handle.FD                                     236           2   0.0    0.0     0.0    0.0
 CAF                     GHC.IO.Encoding.Iconv                                194           2   0.0    0.0     0.0    0.0
 CAF                     GHC.Conc.Signal                                      191           1   0.0    0.0     0.0    0.0
 CAF                     Data.Vector.Unboxed                                  182           1   0.0    0.0     0.0    0.0

Profiling using '-prof -auto-all' with ghc 7.4.1:

	Fri Jun 22 19:09 2012 Time and Allocation Profiling Report  (Final)

	   sum741prof +RTS -p -RTS 100000000

	total time  =       44.55 secs   (44550 ticks @ 1000 us, 1 processor)
	total alloc = 15,600,082,056 bytes  (excludes profiling overheads)

COST CENTRE  MODULE    %time %alloc

U.sum        Main       34.3   38.5
U.generate   Main       31.0   38.5
putSum.f.\   Main       24.9   23.1
putSum.f     Main        5.8    0.0
fromIntegral Main        3.9    0.0


                                                              individual     inherited
COST CENTRE         MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN                MAIN                     67           0    0.0    0.0   100.0  100.0
 CAF                Main                    133           0    0.0    0.0   100.0  100.0
  main              Main                    134           1    0.0    0.0   100.0  100.0
   putSum           Main                    135           1    0.0    0.0   100.0  100.0
    putSum.f        Main                    140           1    0.0    0.0     0.0    0.0
     fromIntegral   Main                    141           1    0.0    0.0     0.0    0.0
    putSum.v        Main                    138           1    0.0    0.0    65.7   61.5
     U.generate     Main                    139           1   31.0   38.5    65.7   61.5
      putSum.f      Main                    142           0    5.8    0.0    34.7   23.1
       fromIntegral Main                    143           0    3.9    0.0    28.9   23.1
        putSum.f.\  Main                    144   100000000   24.9   23.1    24.9   23.1
    putSum.s        Main                    136           1    0.0    0.0    34.3   38.5
     U.sum          Main                    137           1   34.3   38.5    34.3   38.5
 CAF                GHC.Conc.Signal         111           0    0.0    0.0     0.0    0.0
 CAF                Text.Read.Lex           104           0    0.0    0.0     0.0    0.0
 CAF                GHC.Float               101           0    0.0    0.0     0.0    0.0
 CAF                GHC.IO.Encoding         100           0    0.0    0.0     0.0    0.0
 CAF                GHC.IO.Encoding.Iconv    94           0    0.0    0.0     0.0    0.0
 CAF                GHC.IO.Handle.FD         92           0    0.0    0.0     0.0    0.0

Statistics

Statistics for sum compiled with ghc 7.0.4:

./sum704 1000000000 +RTS -ssum704-stats.txt 
          74,836 bytes allocated in the heap
           1,440 bytes copied during GC
          42,316 bytes maximum residency (1 sample(s))
          23,220 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    7.42s  (  7.46s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    7.43s  (  7.46s elapsed)

  %GC time       0.0%  (0.0% elapsed)

  Alloc rate    10,078 bytes per MUT second

  Productivity 100.0% of total user, 99.5% of total elapsed

Statistics for sum compiled with ghc 7.4.1:

./sum741 1000000000 +RTS -ssum741-stats.txt 
          87,864 bytes allocated in the heap
           1,756 bytes copied during GC
          42,632 bytes maximum residency (1 sample(s))
          18,808 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.00s    0.00s     0.0002s    0.0002s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   17.55s  ( 17.69s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   17.55s  ( 17.69s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    5,006 bytes per MUT second

  Productivity 100.0% of total user, 99.2% of total elapsed

comment:14 Changed 22 months ago by mdgabriel

  • Resolution set to fixed
  • Status changed from infoneeded to closed

@pcapriotti and @igloo:

I built the head of GHC with the changeset c7a8941b04541789e60950bb126902effae0ccab of #6110, and this recovered the vast majority of the performance difference that I observed between 7.0.4 and 7.4.1.

The version 7.4.1 is still slower than 7.0.4 for me, but I most likely would not have noticed this difference if the above changeset had already been in place. For me, this ticket is closed.

Note: See TracTickets for help on using tickets.