Eliminate redundant heap allocations/deallocations
The code generated by GHC sometimes places the Heap allocation and deallocation in a common path which has multiple sub-branches inside it. Some of those branches may not need heap at all and it just checks allocates and deallocates it without using. This adds unnecessary overhead and becomes significant when the loop is small and runs very frequently.
Instead of putting the heap allocation code at a higher level branch we can move it to more specific branches where it is really needed so that it gets out of the paths which do not need it.
I am attaching a specific example of such a case. The code used here is in a public github repository [2]. It should be quite easy to play with it, there are detailed instructions along with the code to reproduce and investigate the issue.
The files referenced in the following text are attached with this ticket. They are available in the github repo as well. Look at the CMM trace in decompose-loop-execution-trace.cmm
and decompose-loop-full.cmm
, start at label c4ic
where we allocate space on heap (+48). There are multiple possible paths from this point on, some of those use the heap and some don't. Some interesting paths are:
1) c4ic (allocate) -> c4mw -> {c4pv} -> ...
2) c4ic (allocate) -> c4mw -> c4pw -> ((c4pr -> ({c4pe} -> ... | c4ph -> ...)) | cp4ps -> ...)
I have put those which use the heap inside curly braces, the rest do not use it at all. The paths which actually need the heap are rarely executed in this case. But because of the placement it is always allocated and deallocated in a the fast path which is executed almost all the time.
If we can place this allocation at both c4pv
and c4pe
instead of the common parent then we can save the fast path from this check. The same thing applies to the allocation at label c4jd
as well.
The whole loop is composed on 51 instructions (ghc-8.0.1) and 8 of them are heap adjustment related, which comes out to almost 16%. If I can get this back then the performance of this loop will get quite close to the gcc compiled code which is currently around 30% faster.
This is a CMM level optimization which will benefit all architectures. I also noticed that llvm removed the heap adjustment though not the checks.
References
- [1] mail thread - https://mail.haskell.org/pipermail/ghc-devs/2016-June/012275.html
- [2] github repo for code, traces, reproduction details. See the GHC_PERF_ISSUE directory: https://github.com/harendra-kumar/unicode-transforms/tree/ghc-trac-12231