newtype wrapping of a monadic stack kills performance
I have a project where I in one module (A) I decided to build something like
a minimal framework for the other bigger module (B) to use. One of the main
points of the framework is that the monadic stacks are hidden behind
newtype
s and in those monads you can only use the functions that the
module A provides.
The module A can be found here:
https://github.com/mrkkrp/mmark/blob/ghc-bug-newtypes/Text/MMark/Parser/Internal.hs
There are two monadic stack wrapped with newtypes: BParser
and IParser
.
The module B is this:
https://github.com/mrkkrp/mmark/blob/ghc-bug-newtypes/Text/MMark/Parser.hs
But it's not really relevant.
Now if I change newtype
s to type
synonyms like so:
type BParser a = ParsecT MMarkErr Text (State BlockState) a
type IParser a = StateT InlineState (Parsec MMarkErr Text) a
and do corresponding minor corrections, I get 2x less allocations and almost 2x faster code (before):
Case Allocated GCs Max
with file: data/bench-yaml-block.md 119,080 0 11,088
with file: data/bench-thematic-break.md 74,368 0 10,224
with file: data/bench-heading.md 901,928 0 9,432
with file: data/bench-fenced-code-block.md 145,744 0 9,368
with file: data/bench-indented-code-block.md 124,312 0 9,368
with file: data/bench-unordered-list.md 2,010,496 1 10,784
with file: data/bench-ordered-list.md 2,025,016 1 10,728
with file: data/bench-blockquote.md 1,961,672 1 42,648
with file: data/bench-paragraph.md 2,084,104 1 42,648
with file: data/comprehensive.md 25,899,496 24 44,200
benchmarking with file: data/comprehensive.md
time 3.590 ms (3.578 ms .. 3.601 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 3.555 ms (3.546 ms .. 3.565 ms)
std dev 31.07 μs (24.63 μs .. 39.90 μs)
After:
Case Allocated GCs Max
with file: data/bench-yaml-block.md 116,864 0 11,088
with file: data/bench-thematic-break.md 64,776 0 10,392
with file: data/bench-heading.md 615,672 0 9,432
with file: data/bench-fenced-code-block.md 144,736 0 9,368
with file: data/bench-indented-code-block.md 123,352 0 9,368
with file: data/bench-unordered-list.md 795,072 0 41,568
with file: data/bench-ordered-list.md 808,808 0 41,512
with file: data/bench-blockquote.md 826,392 0 41,568
with file: data/bench-paragraph.md 881,432 0 41,568
with file: data/comprehensive.md 10,945,104 10 44,440
benchmarking with file: data/comprehensive.md
time 2.451 ms (2.448 ms .. 2.456 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 2.432 ms (2.427 ms .. 2.437 ms)
std dev 15.86 μs (13.16 μs .. 19.01 μs)
I'm not great at reading non-trivial core, but I gave it a shot and dumped
some core. One thing I noticed that core for the module A is the same in
both cases. Then the problem is an inter-module problem, probably like when
you don't dump definitions into interface files with INLINEABLE
and end up
without specialization, but here it perhaps has to do with the fact that B
has no information about internals of the monadic stacks from A, but I'm not
sure.
Here is the beginnings of core dumps (before):
==================== Tidy Core ====================
2017-12-23 04:55:17.967944505 UTC
Result size of Tidy Core
= {terms: 210,169,
types: 209,675,
coercions: 82,492,
joins: 973/3,753}
After:
==================== Tidy Core ====================
2017-12-23 04:58:46.386265108 UTC
Result size of Tidy Core
= {terms: 301,256,
types: 263,104,
coercions: 28,560,
joins: 1,726/5,393}
So it looks like newtype
wrapping reduces GHC's ability to create join
points SPJ talked about recently.
I do understand that you expect a minimal reproducing example but I could
not produce it even though I spent several hours in futile attempts. I
started by creating a small project, defining a similar stack with a
newtype
wrapper and using it in a simple parser in another module. Then I
benchmarked the parser. There is no difference between newtype
d and code
and the code that uses just type synonyms. I tried different variations, no difference.
The core output is too big for me to analyze, it's like 25 Mb and 33 Mb, and I have no idea what's going on there.
You're welcome to experiment with the repo, there are benchmarks for memory usage and criterion benchmarks for speed of execution:
https://github.com/mrkkrp/mmark
I have created two branches I won't touch, one with newtypes and another one with type synonyms:
- https://github.com/mrkkrp/mmark/tree/ghc-bug-newtypes
- https://github.com/mrkkrp/mmark/tree/ghc-bug-type-synonyms
Just checkout one of these and run stack bench
.
This is the commit that changes newtypes to type synonyms:
https://github.com/mrkkrp/mmark/commit/759d8d4aa52dd57a393299c63e8c9b70d0d43290
I'm submitting this because my friend convinced me that it's better to let you know (even though I could not create a minimal reproducing example on my own) than to let it go completely unnoticed.