Very large slowdown when using parallel garbage collector
As part of debugging some performance issues on an application I am writing, I concluded that the issue is in the parallel GC implemented in the GHC RTS. I extracted the code attached to make a self-contained use-case, but in my system the code runs in 16s when using a single thread, in 18s when using 6 threads but no parallel GC and in over a minute when using 6 threads with parallel GC!
The true slowdown in the full code is actually worse and relevant for the application (some steps take >1 hour instead of <1 minute!). Parts of the code do take full advantage of parallel processing, this is just one simple test case.
On some machines it seems worse than others and it seems that the input file (data.txt) needs to be quite large for the problem to really show up (the attached script generates a 16 million input file, this is still smaller than some of my real use cases, but I couldn't trigger it with only 1 million). Similarly, with 4 threads, the slowdown is detectable, but not as large.
While running, CPU usage is very high (I tested with 16 threads and it uses 16 CPUs continuously, top reports 1600% CPU).
Using '+RTS -A64m' is another way around the issue, but for the full application it is still not as effective as '+RTS -qg', so there still seems to be a performance issue here.