#7561 closed bug (duplicate)

Unnecessary Heap Allocations - Slow Performance

Reported by: wurmli Owned by:
Priority: normal Milestone: 7.8.1
Component: Compiler Version: 7.6.1
Keywords: Cc: benl, rleshchinskiy@…, pho@…, shachaf@…
Operating System: Linux Architecture: x86_64 (amd64)
Type of failure: Runtime performance bug Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description (last modified by simonpj)

Using the vector library operations that should in principle take place locally and fast, are slow and build a large heap. While trying to analyse what is going on a strange effect showed. Compiling the attached small program with heap profiling support produced an executable that runs fast and uses the heap as expected, whereas built without profiling support it is slow. The effect shows on linux architectures amd64 and i386, using ghc 7.6.1 and 7.4.1, respectively.

1) With profiling support

ghc --make -rtsopts -threaded -O2 -prof -fprof-auto heapAllocVec2.hs
./heapAllocVec2 +RTS  -s -RTS 3628800

produces

fromList [3628800]
     667,829,536 bytes allocated in the heap
         125,768 bytes copied during GC
          65,560 bytes maximum residency (2 sample(s))
          20,096 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

...

Total   time    0.34s  (  0.35s elapsed)

2) Without profiling support

ghc --make -rtsopts  -threaded -O2  heapAllocVec2.hs
./heapAllocVec2 +RTS  -s -RTS 3628800

fromList [3628800]
  26,098,406,816 bytes allocated in the heap
      22,674,848 bytes copied during GC
          47,184 bytes maximum residency (2 sample(s))
          22,448 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

...

Total   time   10.99s  ( 11.06s elapsed)

Attachments (1)

heapAllocVec2.hs (890 bytes) - added by wurmli 16 months ago.

Download all attachments as: .zip

Change History (11)

Changed 16 months ago by wurmli

comment:1 Changed 16 months ago by simonpj

  • Description modified (diff)
  • Difficulty set to Unknown

I think your point is: profiling interferes with optimisation. And so it does. I don't know how to fix that I'm afraid.

Or are you sayig that a non-profiled program is unexpectedly slow?

comment:2 Changed 16 months ago by simonmar

  • Cc benl rl@… added
  • Component changed from Build System to Compiler
  • Milestone set to 7.8.1

I just reproduced it with 7.4.1: the program does appear to run many times faster with profiling enabled, which is indeed very suspicious.

I suggest that someone familiar with vector has a look to see which optimisation or RULE is not happening. Roman or Ben perhaps?

comment:3 Changed 16 months ago by simonpj

  • Cc rleshchinskiy@… added; rl@… removed

Crumbs, I got that totally back to front. The PROFILED program runs faster! (I though it was the un-profiled one.) As Simon says that is deeply mysterious. I've changed Roman's email on the cc list to his gmail address. vector is his package.

comment:4 Changed 15 months ago by PHO

  • Cc pho@… added

comment:5 Changed 15 months ago by wurmli

Let me add a similar case, exemplifying better why I think that not the vector library, but the way the optimiser deals with reductions seems to be the culprit. (That's why I had put "unnecessary heap allocations in the title; #1498 might be a related ticket.)

import System.Environment
import Control.Applicative
import Control.Monad

import Control.Monad.ST

import qualified Data.Vector.Storable.Mutable as VSM
import qualified Data.Vector.Storable as G


-------------------------------------------------------------------
-- Increment counter by 1

acc1 v = do{ 
  k<-VSM.read v 0; 
  VSM.write v 0 (k+1)
  }

-- Increment counter 4 times by 1 until limit is reached
-- i.e. each loop adds 4

whileModify1 :: G.MVector s Int -> ST s ()
whileModify1 v = do
  go <- do{ k<-VSM.read v 0; n<-VSM.read v 1; return (compare k n)}
  case go of
    LT -> do {acc1 v; acc1 v; acc1 v; acc1 v; whileModify1 v}
    EQ -> return ()
    GT -> return ()

-------------------------------------------------------------------
-- Increment counter by 2

acc2 v = do{ 
  k<-VSM.read v 0; 
  VSM.write v 0 (k+1);
  k<-VSM.read v 0; 
  VSM.write v 0 (k+1);
  }

-- Increment counter twice by 2 until limit is reached
-- i.e. each loop adds 4

whileModify2 :: G.MVector s Int -> ST s ()
whileModify2 v = do
  go <- do{ k<-VSM.read v 0; n<-VSM.read v 1; return (compare k n)}
  case go of
    LT -> do {acc2 v; acc2 v; whileModify2 v}
    EQ -> return ()
    GT -> return ()

--------------------------------------------------------------------
    
-- Case 1 is fast with low use of heap space
-- Case 2 is slow with high use of heap 
          (about 400 times slower, 300 times more heap)
    
testAcc :: Int -> Int -> G.Vector Int 
testAcc a n = runST $ do
  v <- (G.thaw $ G.fromList [0,n]) :: ST s (G.MVector s Int)
  case a of
    1 -> whileModify1 v
    2 -> whileModify2 v
  G.freeze v

main = do 
  let k = 50000000 
  n <- read.head <$> getArgs
  putStrLn $ show $ testAcc n k
  

The llvm backend improves timing about 4 fold

ghc -threaded -rtsopts -O2 -fllvm heapAndSpeed.hs

Even though the two ways of increasing the counter are almost the same, the resulting code behaves vastly different, not only in speed, but also in heap usage (and garbage collector activity).

./heapAndSpeed 1 +RTS -s

fromList [50000000,50000000]
          74,088 bytes allocated in the heap
           6,296 bytes copied during GC
          47,184 bytes maximum residency (1 sample(s))
          18,352 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

...

INIT    time    0.00s  (  0.00s elapsed)
MUT     time    0.02s  (  0.02s elapsed)
GC      time    0.00s  (  0.00s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time    0.02s  (  0.02s elapsed)

./heapAndSpeed 2 +RTS -s

fromList [50000000,50000000]
  25,000,074,088 bytes allocated in the heap
       7,048,952 bytes copied during GC
          47,184 bytes maximum residency (2 sample(s))
          22,448 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

...

INIT    time    0.00s  (  0.00s elapsed)
MUT     time    8.06s  (  8.07s elapsed)
GC      time    0.17s  (  0.17s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time    8.23s  (  8.24s elapsed)

comment:6 Changed 15 months ago by simonpj

There seem to be two separate issues here

  1. A profiled program runs a lot faster than a non-profiled one
  2. Two variants of a program that depends on the (complex) vector library run with very different speeds.

Can anyone help boil either (1) or (2) down into a smaller program, or otherwise characterise what's going on? I'm out of bandwidth. Using -ticky profiling might help guide you to where the allocation is happening.

Thanks

Simon

comment:7 Changed 15 months ago by shachaf

  • Cc shachaf@… added

For (1), I narrowed the bug down a bit -- it has nothing to do with Vector in particular. Here's a somewhat simpler version:

import Control.Applicative
import Control.Monad
import Control.Monad.ST
import Data.STRef
import System.Environment

doLoop :: Int -> STRef s Int -> ST s ()
doLoop n v = go ((< n) <$> readSTRef v)
  where
    go mbool = do
      t <- mbool
      when t $ modifySTRef v (+1) >> go mbool

main :: IO ()
main = do
   n <- read.head <$> getArgs
   print $ runST $ do { r <- newSTRef 0; doLoop n r; readSTRef r }

If you have a look at the Core for this program under normal -O2, you'll see that instead of sharing n, the read call from main is being moved all the way into the inner loop! This explains the significant slowdown.

The code can probably be simplified further, but as it is even small changes to the program are enough to make the problem go away -- e.g. adding an SCC annotation in almost any part of doLoop, or evaluating n before passing it in, or shuffling code around too much.

I haven't looked into (2). Either it's the same problem or one of these should be split out into a different ticket, presumably.

comment:8 Changed 15 months ago by simonmar

So (1) is probably a duplicate of #1168 then. Try with -fno-state-hack and see if that fixes it.

comment:9 Changed 15 months ago by shachaf

Ah, yes, -fno-state-hack fixes it. So it looks like it's the same issue as #1168.

comment:10 Changed 15 months ago by simonmar

  • Resolution set to duplicate
  • Status changed from new to closed

So I expect that profiling isn't applying the state hack (perhaps the cost centres get in the way). Therefore I'll close this as a dup, but we should bear in mind the state hack is biting us in subtle and unexpected ways (e.g. big performance differences with profiling).

Note: See TracTickets for help on using tickets.