wiki:Performance/Runtime

Performance of programs compiled with GHC

Here is where we track various on-going efforts to improve the runtime performance of code produced by GHC. If you are interested in the performance of the compiler itself, see Performance/Compiler.

Relevant tickets

  • #10992: Data.List.sum is much slower than the naive recursive definition for it. Does not happen in 7.8.
  • #6166: An alleged runtime performance regression in mwc-random.
  • #14980 (regressed in 8.4): Runtime performance regression with binary operations on vectors

Identify tickets by using "Runtime performance bug" for the "Type of failure field".

Open Tickets:

#16040
Unboxing-Related Performance Issue with Polymorphic Functions
#16004
Vector performance regression in GHC 8.6
#15969
Generic1 deriving should use more coercions
#15842
Exponentiation needs PrelRules
#15731
Add sortOn/coerce rule
#15727
bug: all generations are collected sequentially when compacting collection kicks in
#15717
Performance regression in for_ alternatives from GHC 8.2.2 to newer GHCs
#15652
SerializedCompact has a [(Ptr a, Word)] instead of a custom datatype
#15642
Improve the worst case performance of weak pointers
#15620
Speed up Data.Unique
#15574
C wrappers for Haskell foreign exports don't have finalizers (causes memory leak).
#15524
Performance regression when using the GHC API to evaluate code compared to 8.4
#15503
interpreter: sequence_ (replicate 100000000 (return ())) gobbles up memory
#15366
GHC.Conc.Windows has a surprising queue
#15227
Add PrelRules for par#
#15185
Enum instance for IntX / WordX are inefficient
#15176
Superclass `Monad m =>` makes program run 100 times slower
#15153
GHC uses O_NONBLOCK on regular files, which has no effect, and blocks the runtime
#15127
Unbox around runRW#
#14980
Runtime performance regression with binary operations on vectors
#14941
Switching direct type family application to EqPred (~) prevents inlining in code using vector (10x slowdown)
#14929
Program compiled with -O2 exhibits much worse performance
#14870
Runtime performance regression in 8.4
#14827
Recognize when inlining would create a join point
#14816
Missed Called Arity opportunity?
#14797
High-residency modules during GHC build
#14789
GHCi fails to garbage collect declaration `l = length [1..10^8]` entered at prompt
#14762
Foreign.Marshal.Pool functions use inefficient O(n) operations
#14727
Unboxed sum performance surprisingly poor
#14620
Polymorphic functions not easily recognized as join points
#14610
newtype wrapping of a monadic stack kills performance
#14565
Performance degrades from -O1 to -O2
#14564
CAF isn't floated
#14509
Consider adding new stg_ap_* functions
#14461
Reuse free variable lists through nested closures
#14407
rts: Threads/caps affinity
#14383
Allocation in VS up 500%
#14359
C-- pipeline/NCG fails to optimize simple repeated addition
#14337
typeRepKind can perform substantial amounts of allocation
#14295
tagToEnum# leads to some silly closures
#14256
GHCi is faster than compiled code
#14239
Let -fspecialise-aggressively respect NOINLINE (or NOSPECIALISABLE?)
#14211
Compiler is unable to INLINE as well as the programmer can manually
#14208
Performance with O0 is much better than the default or with -O2, runghc performs the best
#14072
Code generated by GHC 8.2.1 faster than 8.0.1 but still somewhat slower than 7.10.3
#14003
Allow more worker arguments in SpecConstr
#13904
LLVM does not need to trash caller-saved registers.
#13873
Adding a SPECIALIZE at a callsite in Main.hs is causing a regression
#13851
Change in specialisation(?) behaviour since 8.0.2 causes 6x slowdown
#13763
Performance regression (~34%) in 8.2.1, poor register allocation(?) in an inner loop over an array
#13725
Remove false dependency on the destination of the popcnt instruction
#13692
Constructors and such should be able to move around seq# sometimes
#13629
sqrt should use machine instruction on x86_64
#13362
GHC first generation of GC to be as large as largest cache size by default
#13339
Arbitrarily large expressions built out of cheap primops are not floated out
#13334
Constant folding for repeated integer operation of unknown value
#13331
Worker/wrapper can lead to sharing failure
#13309
Use liftA2 in ApplicativeDo
#13296
stat() calls can block Haskell runtime
#13280
Consider deriving more Foldable methods
#13225
Fannkuch-redux time regression from join point patch
#13193
Integer (gmp) performance regression?
#13153
Several Traversable instances have an extra fmap
#13080
Memory leak caused by nested monadic loops
#13016
SPECIALIZE INLINE doesn't necessarily inline specializations of a recursive function
#13014
Seemingly unnecessary marking of a SpecConstr specialization as a loopbreaker
#13002
:set -O does not work in .ghci file
#12953
Use computed gotos in the interpreter when the compiler supports it
#12900
Common up identical info tables
#12893
Profiling defeats stream fusion when using vector library
#12817
Degraded performance with constraint synonyms
#12808
For closures, Loop Invariant Code Flow related to captured free values not lifted outside the loop...
#12798
LLVM seeming to over optimize, producing inefficient assembly code...
#12737
T12227 is failing on ghc-8.0
#12665
Make Read instances for Integral types faster, and make them fail fast
#12640
Class member functions not substituted for MultiParamTypeClasses
#12566
Memory leak
#12232
Opportunity to do better in register allocations
#12231
Eliminate redundant heap allocations/deallocations
#12181
Multi-threaded code on ARM64 GHC runtime doesn't use all available cores
#11677
Dramatic de-optimization with "-O", "-O1", "-O2" options
#11668
SPEC has a runtime cost if constructor specialization isn't performed
#11587
Place shared objects in LIBDIR
#11561
Have static ghci link against its own copy of its libraries
#11441
RFC: Inline intermediate languages (Core, STG, Cmm, even StrictCore)
#11393
Ability to define INLINE pragma for all instances of a given typeclass
#11271
Costly let binding gets duplicated in IO action value
#11226
Performance regression (involving sum, map, enumFromThenTo)
#11222
Teach strictness analysis about `catch`-like operations
#11146
Manual eta expansion leads to orders of magnitude less allocations
#11143
Feature request: Add index/read/write primops with byte offset for ByteArray#
#11134
Limit frequency of idle GCs
#11029
Performance loss due to eta expansion
#10992
Performance regression due to lack of inlining of `foldl` and `foldl'`.
#10944
powModInteger slower than computing pow and mod separately
#10922
String inlining is inconsistent
#10906
`SPECIALIZE instance` could be better
#10809
Add prefetch{Small}{Mutable}Array[0..3]#
#10804
Rules conditional on strictess properties
#10730
Spectral norm allocations increased 17% between 7.6 and 7.8
#10652
Better cache performance in Array#
#10648
Some 64-vector SIMD primitives are absolutely useless
#10626
Missed opportunity for SpecConstr
#10606
avoid redundant stores to the stack when examining already-tagged data
#10482
Not enough unboxing happens on data-family function argument
#10470
Allocating StablePtrs leads to GC slowdown even after they're freed
#10434
SPECIALISE instance does not specialize as far as SPECIALISE for type signatures
#10421
exponential blowup in inlining (without INLINE pragmas)
#10417
Rule matching not "seeing through" floating and type lambda (and maybe cast)
#10401
state hack-related regression
#10371
GHC fails to inline and specialize a function
#10346
Cross-module SpecConstr
#10319
Eta expand PAPs
#10229
setThreadAffinity assumes a certain CPU virtual core layout
#10124
Simple case analyses generate too many branches
#10120
Unnecessary code duplication from case analysis
#10069
CPR related performance issue
#10062
Codegen on sequential FFI calls is not very good
#10049
Lower level memcpy primop
#10016
UNPACK support for existentials
#10012
Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure
#10005
Operations on string literals won't be inlined
#9992
Constructor specialization requires eta expansion
#9989
GHCI is slow for precompiled code
#9944
Performance issue re: simple loop
#9923
Offer copy-on-GC sliced arrays
#9809
Overwhelming the TimerManager
#9798
Frustrating behaviour of the INLINE pragma
#9792
map/coerce rule does not fire until the coercion is known
#9790
Produce coercion rules for derived Functor instances
#9786
Make quot/rem/div/mod with known divisors fast
#9701
GADTs not specialized properly
#9688
Improve the interaction between CSE and the join point transformation
#9661
Branchless ==# is compiled to branchy code
#9660
unnecessary indirect jump when returning a case scrutinee
#9659
Offer branchless conditional (CMOV) primop
#9655
Do not UNPACK strict fields that are very wide
#9646
Simplifer non-determinism leading to 8 fold difference in run time performance
#9645
Optimize range checks for primitive types
#9617
Implement `quot` and `rem` using `quotRem`; implement `div` and `mod` using `divMod`
#9601
Make the rewrite rule system more powerful
#9542
GHC-IO-Handle-Text.hPutStr' and writeBlocks look like they need refactoring
#9522
SPECIALISE pragmas for derived instances
#9447
Add support for resizing `MutableByteArray#`s
#9431
integer-gmp small Integer multiplication does two multiplications on x86
#9388
Narrow the scope of the notorious "state hack"
#9374
Investigate Static Argument Transformation
#9353
prefetch primops are not currently useful
#9350
Consider using xchg instead of mfence for CS stores
#9349
excessive inlining due to state hack
#9342
Branchless arithmetic operations
#9320
Inlining regression/strangeness in 7.8
#9289
add anyToAddr# :: (#a#)-> Addr# primop (inverse of addrToAny#)
#9279
Local wrapper function remains in final program; result = extra closure allocation
#9251
ghc does not expose branchless max/min operations as primops
#9246
GHC generates poor code for repeated uses of min/max
#9192
Add sameByteArray#
#9137
A way to match RULES only for literals
#9120
Cache intermediate powers
#9088
Per-thread Haskell thread list/numbering (remove global lock from thread allocation)
#9041
NCG generates slow loop code
#8971
Native Code Generator for 8.0.1 is not as optimized as 7.6.3...
#8955
Syscall intrinsic
#8949
switch -msse2 to be on by default
#8905
Function arguments are always spilled/reloaded if scrutinee is already in WHNF
#8903
Add dead store elimination
#8887
Double double assignment in optimized Cmm on SPARC
#8871
No-op assignment I64[BaseReg + 784] = I64[BaseReg + 784]; is generated into optimized Cmm
#8814
7.8 optimizes attoparsec improperly
#8733
I/O manager causes unnecessary syscalls in send/recv loops
#8732
Global big object heap allocator lock causes contention
#8668
SPECIALIZE silently fails to apply
#8662
GHC does not inline cheap inner loop when used in two places
#8655
Evaluate know-to-terminate-soon thunks
#8635
GHC optimisation flag ignored when importing a local module with derived type classes
#8623
Strange slowness when using async library with FFI callbacks
#8598
IO hack in demand analyzer gets in the way of CPR
#8589
Bad choice of loop breaker with INLINABLE/INLINE
#8578
Improvements to SpinLock implementation
#8457
-ffull-laziness does more harm than good
#8404
Default to turning on architecture specific optimizations in the codegen
#8354
Add INLINE (or at least INLINABLE) pragmas for methods of Ord in ghc-prim
#8336
Sinking pass could optimize some assignments better
#8327
Cmm sinking does not eliminate dead code in loops
#8326
Place heap checks common in case alternatives before the case
#8317
Optimize tagToEnum# at Core level
#8313
Poor performance of higher-order functions with unboxing
#8311
suboptimal code generated for even :: Int -> Bool by NCG (x86, x86_64)
#8279
bad alignment in code gen yields substantial perf issue
#8272
testing if SpLim=$rbp and Sp=$rsp changed performance at all
#8151
ghc-7.4.2 on OpenIndiana (Solaris) createSubprocess fails
#8048
Register spilling produces ineffecient/highly contending code
#8046
Make the timer management scale better across multicore
#8032
Worker-wrapper transform and NOINLINE trigger bad reboxing behavior
#8023
dph-examples binaries don't use all CPUs
#7977
Optimization: Shift dropped list heads by coeffecient to prevent thunk generation
#7741
Add SIMD support to x86/x86_64 NCG
#7679
Regression in -fregs-graph performance
#7647
UNPACK polymorphic fields
#7602
Threaded RTS performing badly on recent OS X (10.8?)
#7596
Opportunity to improve CSE
#7542
GHC doesn't optimize (strict) composition with id
#7511
Room for GHC runtime improvement >~5%, inlining related
#7398
RULES don't apply to a newtype constructor
#7378
Identical alts/bad divInt# code
#7374
rule not firing
#7367
float-out causes extra allocation
#7309
The Ix instance for (,) leaks space in range
#7307
Share top-level code for strings
#7300
Allow CAFs kept reachable by FFI to be forcibly made unreachable for GC
#7283
Specialise INLINE functions
#7273
Binary size increase in nofib/grep between 7.6.1 and HEAD
#7206
Implement cheap build
#7114
Cannot recover (good) inlining behaviour from 7.0.2 in 7.4.1
#7109
Inlining depends on datatype size, even with INLINE pragmas
#7080
Make RULES and SPECIALISE more consistent
#7063
Register allocators can't handle non-uniform register sets
#6092
Liberate case not happening
#6070
Fun with the demand analyser
#5928
INLINABLE fails to specialize in presence of simple wrapper
#5834
Allow both INLINE and INLINABLE for the same function
#5775
Inconsistency in demand analysis
#5645
Sharing across functions causing space leak
#5567
LLVM: Improve alias analysis / performance
#5463
SPECIALISE pragmas generated from Template Haskell are ignored
#5444
Slow 64-bit primops on 32 bit system
#5355
Link plugins against existing libHSghc
#5344
CSE should look through coercions
#5326
Polymorphic instances aren't automatically specialised
#5302
Unused arguments in join points
#5298
Inlined functions aren't fully specialised
#5262
Compiling with -O makes some expressions too lazy and causes space leaks
#5218
Add unpackCStringLen# to create Strings from string literals
#5171
Misfeature of Cmm optimiser: no way to extract a branch of expression into a separate statement
#5075
CPR optimisation for sum types if only one constructor is used
#5059
Pragma to SPECIALISE on value arguments
#4960
Better inlining test in CoreUnfold
#4945
Another SpecConstr infelicity
#4941
SpecConstr generates functions that do not use their arguments
#4937
Remove indirections caused by sum types, such as Maybe
#4833
Finding the right loop breaker
#4831
Too many specialisations in SpecConstr
#4823
Loop strength reduction for array indexing
#4470
Loop optimization: identical counters
#4301
Optimisations give bad core for foldl' (flip seq) ()
#4101
Primitive constant unfolding
#4096
New primops for indexing: index*OffAddrUsing# etc
#4081
Strict constructor fields inspected in loop
#4005
Bad behaviour in the generational GC with paraffins -O2
#3781
Improve inlining for local functions
#3767
SpecConstr for join points
#3765
Rules should "look through" case binders too
#3755
Improve join point inlining
#3744
Comparisons against minBound/maxBound not optimised for (Int|Word)(8|16|32)
#3606
The Ord instance for unboxed arrays is very inefficient
#3557
CPU Vector instructions in GHC.Prim
#3462
New codegen: allocate large objects using allocateLocal()
#3458
Allocation where none should happen
#3138
Returning a known constructor: GHC generates terrible code for cmonad
#3107
Over-eager GC when blocked on a signal in the non-threaded runtime
#3073
Avoid reconstructing dictionaries in recursive instance methods
#3061
GHC's GC default heap growth strategy is not as good as other runtimes
#3055
Int / Word / IntN / WordN are unequally optimized
#3034
divInt# floated into a position which leads to low arity
#2731
Avoid unnecessary evaluation when unpacking constructors
#2642
Improve SpecConstr for join points
#2625
Unexpected -ddump-simpl output for derived Ord instance and UNPACKed fields
#2607
Inlining defeats selector thunk optimisation
#2598
Avoid excessive specialisation in SpecConstr
#2465
Fusion of recursive functions
#2439
Missed optimisation with dictionaries and loops
#2387
Optimizer misses unboxing opportunity
#2374
MutableByteArray# is slower than Addr#
#2289
Needless reboxing of values when returning from a tight loop
#2273
inlining defeats seq
#2269
Word type to Double or Float conversions are slower than Int conversions
#2255
Improve SpecConstr for free variables
#2132
Optimise nested comparisons
#2028
STM slightly conservative on write-only transactions
#1687
A faster (^)-function.
#1600
Optimisation: CPR the results of IO
#1544
Derived Read instances for recursive datatypes with infix constructors are too inefficient
#1498
Optimisation: eliminate unnecessary heap check in recursive function
#1216
Missed opportunity for let-no-esape
#1168
Optimisation sometimes decreases sharing in IO code
#1147
Quadratic behaviour in the compacting GC
#932
Improve inlining
#917
-O introduces space leak
#855
Improvements to SpecConstr
#728
switch to compacting collection when swapping occurs
#605
Optimisation: strict enumerations
#149
missed CSE opportunity

Closed Tickets:

#15802
Inlining of constant fails when both cross-module and recursive
#15226
GHC doesn't know that seq# produces something in WHNF
#15143
Passing an IO value through several functions results in program hanging.
#15131
Speed up certain Foldable NonEmpty methods
#14978
GADTs don't seem to unpack properly
#14855
Implementation of liftA2 for Const has high arity
#14790
eqTypeRep does not inline
#14519
Exponential runtime performance regression in GHC 8.2 + Data.Text.Lazy + Text.RE.TDFA
#14336
ghci leaks memory
#14258
n-body runtime regressed badly due to CoreFVs patch
#14240
CSE’ing w/w’ed code regresses program runtime
#14224
zipWith does not inline
#14192
Change to 1TB VIRT allocation makes it impossible to core-dump Haskell programs
#14187
Transpose hangs on infinite by finite lists
#14140
Better treatment for dataToTag
#14052
Significant GHCi speed regression with :module and `let` in GHC 8.2.1
#13999
Simple function not inlined within declaration marked NOINLINE
#13982
HEAD GHC+Cabal uses too much memory
#13930
Cabal configure regresses in space/time
#13690
Running profiling tests in the GHCi way is extremely slow
#13654
Optimize casMutVar# for single-threaded runtime
#13623
join points produce bad code for stream fusion
#13604
ghci no longer loads dynamic .o files by default if they were built with -O
#13566
Bigger core size in ghc8 compared to ghc7
#13536
Program which terminates instantly in GHC 8.0.2 runs for minutes with 8.2.1
#13422
INLINE CONLIKE sometimes fails to inline
#13376
GHC fails to specialize a pair of polymorphic INLINABLE functions
#13328
Foldable, Functor, and Traversable deriving handle phantom types badly
#13288
Resident set size exceeds +RTS -M limit with large nurseries
#13246
hPutBuf issues unnecessary empty write syscalls for large writes
#13228
Surprising inlining failure
#13218
<$ is bad in derived functor instances
#13040
realToFrac into Complex Double has no specialization
#13025
Type family reduction irregularity (change from 7.10.3 to 8.0.1)
#13001
EnumFromThenTo is is not a good producer
#12996
Memory leak in recursion when switching from -O1 to -O2
#12990
Partially applied constructors with unpacked fields simplified badly
#12964
Runtime regression to RTS change
#12804
forever contains a space leak
#12781
Significantly higher allocation with INLINE vs NOINLINE
#12603
INLINE and manually inlining produce different code
#12525
Internal identifiers creeping into :show bindings
#12378
Not enough inlining happens with single-method type classes
#12354
Word foldl' isn't optimized as well as Int foldl'
#12241
Surprising constructor accumulation
#12217
PowerPC NCG: Remove TOC save for calls.
#12129
Optimize the implementation of minusInteger in the integer-gmp package
#12022
unsafeShiftL and unsafeShiftR are not marked as INLINE
#11989
Performance bug reading large-exponent float without explicit type
#11965
USE_PTHREAD_FOR_ITIMER causes unnecessary wake-ups
#11808
nofib's cryptarithm1 regresses due to deferred inlining of Int's Ord operations
#11795
Performance issues with replicateM_
#11725
Performance Regression from 7.8.3 to 7.10.3
#11710
Fusion of a simple listArray call is very fragile
#11707
Don't desugar large lists with build
#11701
ghc generates significant slower code
#11688
Bytestring break failing rewrite to breakByte and failing to eliminate boxing/unboxing
#11568
Regression in nofib/shootout/k-nucleotide
#11565
Restore code to handle '-fmax-worker-args' flag
#11533
Stack check not optimized out even if it could be
#11486
info tables are no longer aligned
#11383
CAFs lose sharing due to implicit call stacks
#11382
Optimize Data.Char
#11372
Loopification does not trigger for IO even if it could
#11365
Worse performance with -O
#11318
Data.Text.length allocates one closure per character
#11284
Lambda-lifting fails in simple Text example
#11273
PowerPC NCG: Assign all STG float and double regs to PowerPC registers
#11272
Overloaded state-monadic function is not specialised
#11116
GC reports memory in use way below the actual
#11054
GHC on Windows could not use more than 64 logical processors
#10830
maximumBy has a space leak
#10825
Poor performance of optimized code.
#10788
performance regression involving minimum
#10780
Weak reference is still alive if key is alive, but weak reference itself not reachable
#10750
silly assembly for comparing Doubles
#10744
Allow oneShot to work with unboxed types
#10720
New GHC fails to specialize imported function
#10717
fannkuch-redux allocations increase by factor of 10000 between 7.4.2 and 7.6.3
#10678
integer-gmp's runS seems unnecessarily expensive
#10677
slightly silly assembly for testing whether a Word# is 0##
#10676
silly assembly for comparing the result of comparisons that return Int# against 0#
#10649
Performance issue with unnecessary reboxing
#10457
Revise/remove custom mapM implementation for lists
#10415
ForeignPtr touched in FFI wrapper is never discarded
#10400
Run time increases by 40% in fractal plotter core loop
#10359
Tuple constraint synonym led to asymptotic performance lossage
#10291
compiling huge HashSet hogs memory
#10290
compiling huge HashSet hogs memory
#10260
last uses too much space with optimizations disabled
#10148
Optimization causes repeated computation
#10137
Rewrite switch code generation
#10129
emitCmmLitSwitch could be better
#10108
Dramatic slowdown with -O2 bytestream and list streams combined.
#10067
The Read Integer instance is too slow
#10064
Add support for "foo"## literals to MagicHash
#10060
The Traversable instance for Array looks unlikely to be good
#10034
Regression in mapM_ performance
#10014
Data.Array.Base.elems needlessly calls bounds.
#9885
ghc-pkg parser eats too much memory
#9848
List.all does not fuse
#9827
void does not use <$
#9801
Make listArray fuse
#9797
Investigate rewriting `>>=` to `*>` or `>>` for appropriate types
#9796
Implement amap/coerce rule for `Array`
#9781
Make list monad operations fuse
#9740
D380 caused fft2 regressions
#9715
The most minimal Gloss project causes the profiler to fail silently.
#9696
readRawBufferPtr and writeRawBufferPtr allocate memory
#9676
Data.List.isSuffixOf can be very inefficient
#9638
Speed up Data.Char.isDigit
#9577
String literals are wasting space
#9546
filterM is not a good consumer for list fusion
#9540
words is not a good producer; unwords is not a good consumer
#9537
concatMap is not a good producer for list fusion
#9520
Running an action twice uses much more memory than running it once
#9510
Prelude.!! is not a good consumer
#9509
No automatic specialization of inlinable imports in 7.8
#9502
mapAccumL does not participate in foldr/build fusion
#9476
Implement late lambda-lifting
#9441
CSE should deal with letrec
#9430
implement more arithmetic operations natively in the LLVM backend
#9398
Data.List.cycle is not a good producer
#9369
Data.List.unfoldr does not fuse and is not inlined.
#9356
scanl does not participate in list fusion
#9355
scanr does not participate in stream fusion
#9345
Data.List.inits is extremely slow
#9344
takeWhile does not participate in list fusion
#9343
foldl' is not a good consumer
#9339
last is not a good consumer
#9332
Memory blowing up for strict sum/strict foldl in ghci
#9326
Minor change to list comprehension structure leads to poor performance
#9291
Don't reconstruct sum types if the type subtly changes
#9234
Compiled code performance regression
#9214
UNPACK support for sum types
#9203
Perf regression in 7.8.2 relative to 7.6.3, possibly related to HashMap
#9188
quot with a power of two is not optimized to a shift
#9159
cmm case, binary search instead of jump table
#9157
cmm common block not eliminated
#9136
Constant folding in Core could be better
#9132
takeWhile&C. still not fusible
#9105
Profiling binary consumes CPU even when idle on Linux.
#9075
Per-thread weak pointer list (remove global lock on mkWeak#)
#9067
Optimize clearNursery by short-circuiting when we get to currentNursery
#9021
[CID43168] rts/linker.c has a memory leak in the dlopen/dlerror code
#8901
(very) bad inline heuristics
#8900
Strictness analysis regression
#8835
7.6.3 vs 7.8-RC performance regression
#8832
Constant-folding regression wrt `clearBit (bit 0) 0 `
#8793
Improve GHC.Event.IntTable performance
#8766
length [Integer] is twice as slow but length [Int] is 10 times faster
#8763
forM_ [1..N] does not get fused (allocates 50% more)
#8680
In STM: Variables only in left branch of orElse can invalidate the right branch transaction
#8647
Reduce allocations in `integer-gmp`
#8638
Optimize by demoting "denormalized" Integers (i.e. J# -> S#)
#8609
Clean up block allocator
#8585
Loopification should omit stack check
#8513
Parallel GC increases CPU load while slowing down program
#8508
Inlining Unsaturated Function Applications
#8472
Primitive string literals prevent optimization
#8456
Control flow optimisations duplicate blocks
#8435
Do not copy stack after stack overflow
#8345
A more efficient atomicModifyIORef'
#8321
improve basic block layout on LLVM backend by predicting stack/heap checks
#8255
GC Less Operation
#8224
Excessive system time -- new IO manager problem?
#8124
Possible leaks when using foreign export.
#8082
Ordering of assembly blocks affects performance
#8027
Adding one call to getNumCapabilities triggers performance nose dive (6X slowdown)
#7954
Strictness analysis regression
#7923
Optimization for takeMVar/putMVar when MVar left empty
#7865
SpecConstr duplicating computations
#7850
Strangely high memory usage on optimized Ackermann function
#7837
Rules involving equality constraints don't fire
#7785
Module-local function not specialized with ConstraintKinds
#7611
Rewrite rules application prevented by type variable application (map id vs. map (\x -> x))
#7561
Unnecessary Heap Allocations - Slow Performance
#7556
build/fold causes with ByteString unpack causes huge memory leak
#7460
Double literals generated bad core
#7436
Derived Foldable and Traversable instances become extremely inefficient due to eta-expansion
#7429
Unexplained performance boost with +RTS -h
#7418
Writing to stderr is 7x slower than writing to stdout
#7382
Evaluating GHCi expressions is slow following the dynamic-by-default change
#7363
runghc leaks space in IO
#7292
Optimization works for Word but not Word32 or Word64
#7284
plusAddr# x 0 isn't optimised away
#7257
Regression: pinned memory fragmentation
#7219
Reinstate constant propagation in some form
#7211
Huge space leak on a program that shouldn't leak
#7116
Missing optimisation: strength reduction of floating-point multiplication
#7091
DPH Matrix product memory usage
#7058
Add strict version of modifySTRef
#7052
Numeric types’ Read instances use exponential CPU/memory
#6166
Performance regression in mwc-random since 7.0.x
#6121
Very poor constant folding
#6111
Simple loop performance regression of 7.4.1 relative to 7.0.4
#6110
Data.Vector.Unboxed performance regression of 7.4.1 relative to 7.0.4
#6082
Program compiled with 7.4.1 runs many times slower than compiled with 7.2.2
#6056
INLINABLE pragma prevents worker-wrapper to happen.
#6000
Performance of Fibonnaci compare to Python
#5996
fix for CSE
#5991
regression: huge number of wakeups in xmonad
#5949
Demand analysis attributes manifestly wrong demand type
#5945
Lambda lifting
#5926
Add strict versions of modifyIORef and atomicModifyIORef
#5916
runST isn't free
#5888
Performance regression in 7.4.1 compared to 6.12.3
#5835
Make better use of known dictionaries
#5809
Arity analysis could be better
#5779
SPECIALISE pragma generates wrong activations
#5776
Rule matching regression
#5774
main = forever (putStrLn =<< getLine) continuously saturates a CPU when compiled
#5773
main = forever (putStrLn =<< getLine) continuously saturates a CPU when compiled
#5767
Integer inefficiencies
#5749
GHC 7.0.4 Performance Regression (Possibly Vector)
#5741
openFile should fail if null bytes are in the argument
#5731
Bad code for Double literals
#5715
Inliner fails to inline a function, causing 20x slowdown
#5623
GHC 7.2.1 Performance Regression: Vector
#5615
ghc produces poor code for `div` with constant powers of 2.
#5598
Function quotRem is inefficient
#5569
Ineffective seq/BangPatterns
#5549
~100% performance regression in HEAD compared to ghc6.12, ~22% compared to 7.0.4
#5505
Program runs faster with profiling than without
#5367
Program in (-N1) runs 10 times slower than it with two threads (-N2)
#5339
Data.Bits instances should use default shift instead of shiftL/shiftR
#5327
INLINABLE pragma and newtypes prevents inlining
#5237
Inefficient code generated for x^2
#5205
Control.Monad.forever leaks space
#5161
Poor performance of division; unnecessary branching
#5152
GHC generates poor code for large 64-bit literals
#5113
Huge performance regression of 7.0.2, 7.0.3 and HEAD over 7.0.1 and 6.12 (MonoLocalBinds)
#5034
Performance of Data.Graph.{preorderF, postorderF}
#5000
Eliminate absent arguments in non-strict positions
#4986
negative Double numbers print out all wrong
#4965
60% performance regression in continuation-heavy code between 6.12 and 7
#4962
Dead code fed to CorePrep because RULEs keep it alive spuriously
#4951
Performance regression 7.0.1 -> 7.0.1.20110201
#4943
Another odd missed SpecConstr opportunity
#4930
Case-of-case not eliminated when it could be
#4908
Easy SpecConstr opportunity that is nonetheless missed
#4495
GHC fails to inline methods of single-method classes
#4474
3 ways to write a function (unexpected performance difference and regression)
#4463
CORE notes break optimisation
#4448
Another case of SpecConstr not specialising
#4442
Add unaligned version of indexWordArray#
#4431
SpecConstr doesn't specialise
#4428
Local functions lose their unfoldings
#4397
RULES for Class ops don't fire in HEAD
#4365
Error handle in readProcess not closed
#4344
Better toRational for Float and Double
#4337
Better power for Rational
#4322
High CPU usage during idle time due to GC
#4306
UNPACK can lead to unnecessary copying and wasted stack space
#4285
STM bug on Windows?
#4280
Proposal: Performance improvements for Data.Set
#4279
Proposal: Performance improvements for Data.IntMap
#4278
Proposal: Add strict versions of foldlWithKey and insertLookupWithKey to Data.Map
#4277
Proposal: Significant performance improvements for Data.Map
#4276
-O0 runs in constant space, -O1 and -O2 don't
#4262
GHC's runtime never terminates unused worker threads
#4223
LLVM slower then NCG, C example
#4184
Squirrelly inliner behaviour leads to 80x slowdown
#4138
Performance regression in overloading
#4120
Iface type variable out of scope in cast
#4065
Inconsistent loop performance
#4064
SpecConstr broken for NOINLINE loops in 6.13
#4062
Bad choice of loop breaker?
#4021
Problem of Interaction Between the FreeBSD Kernel and the GHC RTS
#4018
Concurrency space leak
#4007
Look again at eta expansion during gentle simplification
#4004
Improve performance of a few functions in Foreign.Marshal.*
#3990
UNPACK doesn't unbox data families
#3969
Poor performance of generated code on x86.
#3938
Data growth issue in System.Timeout
#3838
Performance issues with blackholes
#3772
Methods not inlined
#3738
Typechecker floats stuff out of INLINE right hand sides
#3737
inlining happens on foldl1 and does not happen on direct application of combinator
#3736
GHC specialising instead of inlining
#3735
GHC specialising instead of inlining
#3717
Superfluous seq no eliminated
#3709
Data.Either.partitionEithers is not lazy enough
#3698
Bad code generated for zip/filter/filter loop
#3697
Method selectors aren't floated out of loops
#3655
Performance regression relative to 6.10
#3627
Profiling loses eta-expansion opportunities unnecessarily
#3586
Initialisation of unboxed arrays is too slow
#3526
Inliner behaviour with instances is confusing
#3518
GHC GC rises greatly on -N8 compared to -N7
#3501
Error thunks not being exposed with "B" strictness
#3437
Optimizer creates space leak on simple code
#3349
poor responsiveness of ghci
#3331
control-monad-queue performance regression
#3273
memory leak due to optimisation
#3264
Real World Haskell book example issue
#3245
Quadratic slowdown in Data.Typeable
#3181
Regression in unboxing
#3123
make INLINE work for recursive definitions (generalized loop peeling/loop unrolling)
#3116
missed opportunity for call-pattern specialisation
#3076
Make genericLength tail-recursive so it doesn't overflow stack
#3065
Reorder tests in quot to improve code
#2940
Do CSE after CorePrep
#2915
Arity is smaller than need be
#2902
Example where ghc 6.10.1 fails to optimize recursive instance function calls
#2884
Compiled code performance worsens when module names are long enough
#2840
Top level string literals
#2831
Floated error expressions get poor strictness, leaving bad arity
#2823
Another arity expansion bug
#2822
Arity expansion not working right
#2797
ghci stack overflows when ghc does not
#2785
Memory leakage with socket benchmark program
#2727
DiffArray performance unusable for advertized purpose
#2720
eyeball/inline1 still isn't optimised with -fno-method-sharing
#2712
Parallel GC scheduling problems
#2581
Record selectors not being inlined
#2463
unsafePerformIO in unused record field affects optimisations
#2462
Data.List.sum is slower than 6.8.3
#2450
Data.Complex.magnitude squares using ^(2 :: Int), which is slow
#2440
Bad code with type families
#2396
Default class method not inlined
#2329
Control.Parallel.Strategies: definitions of rnf for most collections are poor
#2325
Compile-time computations
#2280
randomR too slow
#2253
Native code generator could do better
#2236
Deep stacks make execution time go through the roof
#2185
Memory leak with parMap
#2163
GHC makes thunks for Integers we are strict in
#2105
garbage collection confusing in ghci for foreign objects
#2092
Quadratic amount of code generated
#2078
INLINE and strictness
#1890
Regression in mandelbrot benchmark due to inlining
#1889
Regression in concurrency performance from ghc 6.6 to 6.8
#1818
Code size increase vs. 6.6.1
#1752
CSE can create space leaks by increasing sharing
#1607
seq can make code slower
#1434
Missing RULEs for truncate
#1117
[2,4..10] is not a good list producer
#955
more object-code blow-up in ghc-6.8.3 vs. ghc-6.4.2 (both with optimization)
#876
Length is not a good consumer
#783
SRTs bigger than they should be?
#650
Improve interaction between mutable arrays and GC
#635
Replace use of select() in the I/O manager with epoll/kqueue/etc.
#594
Support use of SSE2 in the x86 native code genreator
#427
Random.StdGen slowness

Nofib results

Austin, 5 May 2015

Full results are here (updated May 5th, 2015)

NB: The baseline here is 7.6.3

Ben, 31 July 2015

http://home.smart-cactus.org/~ben/nofib.html

Baseline is 7.4.2.

Nofib outliers

Binary sizes

7.6 to 7.8
  • Solid average binary size increase of 5.3%.

Allocations

7.4 to 7.6
  • fannkuch-redux: increased by factor of 10,000?!?!
    • 7.6.3: <<ghc: 870987952 bytes, 1668 GCs (1666 + 2), 0/0 avg/max bytes residency (0 samples), 84640 bytes GC work, 1M in use, 0.00 INIT (0.00 elapsed), 2.43 MUT (2.43 elapsed), 0.00 GC (0.00 elapsed), 0.00 GC(0) (0.00 elapsed), 0.00 GC(1) (0.00 elapsed), 1 balance :ghc>>
    • 7.4.2: <<ghc: 74944 bytes, 1 GCs (0 + 1), 0/0 avg/max bytes residency (0 samples), 3512 bytes GC work, 1M in use, 0.00 INIT (0.00 elapsed), 2.25 MUT (2.25 elapsed), 0.00 GC (0.00 elapsed), 0.00 GC(0) (0.00 elapsed), 0.00 GC(1) (0.00 elapsed), 1 balance :ghc>>
    • According to [FoldrBuildNotes] this test is very sensitive to fusion
    • Filed #10717 to track this.
7.6 to 7.8
  • spectral-norm: increases by 17.0%.
    • A lot more calls to map, over 100 more! Maybe inliner failure?
    • Over twice as many calls to ghc-prim:GHC.Classes.$fEqChar_$c=={v r90O} (& similar functions). Also over twice as many calls to elem,
    • Similarly, many more calls to other specializations, like base:Text.ParserCombinators.ReadP.$fMonadPlusP_$cmplus{v r1sr}, which adds even more allocations (from 301 to 3928 for this one entry!)
    • Basically the same story up to HEAD!
7.8 to 7.10
  • gcd: increases by 20.7%.
    • Ticky tells us that this seems to be a combination of a few things; most everything seems fairly similar, but we see a large amount of allocations attributable to 7.10 that I can't figure out where they came from, aside from the new integer-gmp: integer-gmp-1.0.0.0:GHC.Integer.Type.$WS#{v rwl} accounts for 106696208 extra bytes of allocation! It also seems like there are actual extant calls to GHC.Base.map in 7.10, and none in 7.8. These are the main differences.
  • pidigits: increases by 7.4%.
    • Ticky tells us that this seems to be, in large part, due to integer-gmp (which is mostly what it benchmarks anyway). I think part of this is actually an error, because before integer-gmp, a lot of things were done in C-- code or whatnot, while the new integer-gmp does everything in Haskell, so a lot more Haskell code shows up in the profile. So the results aren't 1-to-1. One thing that seems to be happening is that there are a lot more specializations going on that are called repeatedly, it seems; many occurrences of things like sat_sad2{v} (integer-gmp-1.0.0.0:GHC.Integer.Type) in rfK which don't exist in the 7.8 profiles, each with a lot of entries and allocations.
  • primetest: went down 27.5% in 7.6-to-7.8, but 8.8% slower than 7.6 now - in total it got something like 36.6% worse.
    • Much like pidigits, a lot more integer-gmp stuff shows up in these profiles. While it's still just like the last one, there are some other regressions; for example, GHC.Integer.Type.remInteger seems to have 245901/260800 calls/bytes allocated, vs 121001/200000 for 7.8

TODO Lots of fusion changes have happened in the last few months too - but these should all be pretty diagnosable with some reverts, since they're usually very localized. Maybe worth looking through base changes.

Runtime

7.6 to 7.8
  • lcss: increases by 12.6%.
    • Ticky says it seems to be map calls yet again! These jump hugely here from 21014 to 81002.
    • Also, another inner loop with algb it looks like gets called a huge number of times too - algb2 is called 2001056 times vs 7984760 times!
      • Same with algb and algb1, which seem to be called more often too.
    • Some other similar things; a few regressions in the # of calls to things like Text.ParserCombinator.ReadP specializations, I think.
    • Same story with HEAD!
7.8 to 7.10
  • lcss: decreased by ~5% in 7.10, but still 7% slower than 7.6.
    • See above for real regressions.
  • multiplier: increases by 7.6%.
    • map strikes again? 2601324 vs 3597333 calls, with an accompanying allocation delta.
    • But some other inner loops here work and go away correctly (mainly go), unlike e.g. lcss.

Comparing integer-gmp 0.5 and 1.0

One of the major factors that has changed recently is integer-gmp. Namely, GHC 7.10 includes integer-gmp-1.0, a major rework of integer-gmp-0.5. I've compiled GHC 7.10.1 with integer-gmp 0.5 and 1.0. Here is a nofib comparison. There are a few interesting points here,

  • Binary sizes dropped dramatically and consistently (typically around 60 to 70%) from 0.5 to 1.0.
  • Runtime is almost always within error. A few exceptions,
    • binary-trees: 6% slower with 1.0
    • pidigits: 5% slower
    • integer: 4% slower
    • cryptarithm1: 2.5% slower
    • circsim: 3% faster
    • lcss: 5% faster
    • power: 17% faster
  • Allocations are typically similar. The only test that improves significantly is prime whose allocations decreased by 24% Many more tests regress considerably,
    • bernoulli: +15%
    • gcd: +21%
    • kahan: +40%
    • mandel +34%
    • primetest: +50%
    • rsa: +53%

The allocation issue is actually discussed in the commit message (c774b28f76ee4c220f7c1c9fd81585e0e3af0e8a),

Due to the different (over)allocation scheme and potentially different accounting (via the new {shrink,resize}MutableByteArray# primitives), some of the nofib benchmarks actually results in increased allocation numbers (but not necessarily an increase in runtime!). I believe the allocation numbers could improve if {resize,shrink}MutableByteArray# could be optimised to reallocate in-place more efficiently.

The message then goes on to list exactly the nofib tests mentioned above. Given that there isn't a strong negative trend in runtime corresponding with these increased allocations, I'm leaning towards ignoring these for now.

Last modified 9 months ago Last modified on Apr 4, 2018 9:42:28 AM