bad alignment in code gen yields substantial perf issue

independently, a number of folks have noticed that in various ways, GHC currently has quite a few different memory alignment related performance problems that can have >= 10% perf impact!

Nicolas Frisby notes

On my laptop, a program showed a consistent slowdown with -fdicts-strict

I didn't find any obvious causes in the Core differences, so I turned to Intel's 
Performance Counter Monitor for measurements. After trying a few counters, I eventually 
saw that there are about an order of magnitude more misaligned memory loads with 
-fdicts-strict than without, so I think that may be a significant part of the slowdown.
I'm not sure if these are code or data reads.

Can anyone suggest how to validate this hypothesis about misaligned reads?

A subsequent commit has changed the behavior I was seeing, so I'm not interested 
in alternatives means to determine if -fdicts-strict is somehow at fault — I'm just 
asking specifically about data/code memory alignment in GHC and how to 
diagnose/experiment with it.

Reid Barton has independently noted

so I did a nofib run with llvm libraries, ghc quickbuild

so there's this really simple benchmark tak,
https://github.com/ghc/nofib/blob/master/imaginary/tak/Main.hs
it doesn't use any libraries at all in the main loop because the Ints all get unboxed
but it's still 8% slower with quick-llvm (vs -fasm)
weird right?

[14:36:30] <carter>	 could you post the asm it generates for that function?
[14:36:49] <rwbarton>	 well it's identical between the two versions
<rwbarton>	 but they get linked at different offsets because some llvm sections are different sizes
<rwbarton>	 if I add a 128-byte symbol to the .text section to move it to the same address... then the llvm libs version is just as fast
<rwbarton>	 well, apparently 404000 is good and 403f70 is bad
 <rwbarton>	 I guess I can test other alignments easily enough
<rwbarton>	 I imagine it wants to start on a cache line
 <rwbarton>	 but I don't know if it's just a coincidence that it worked with the ncg libraries
 <rwbarton>	 that it got a good location

<rwbarton>	 for this program every 32-byte aligned address is 10+% faster than any merely 16-byte aligned address

 <rwbarton>	 and by alignment I mean alignment of the start of the Haskell code section
 <carter>	 haswell, sandybridge, ivy bridge, other?
 <rwbarton>	 dunno
 <rwbarton>	 I have similar results on Intel(R) Core(TM)2 Duo CPU     T7300  @ 2.00GHz
 <rwbarton>	 and on Quad-Core AMD Opteron(tm) Processor 2374 HE
 <carter>	 ok
 <rwbarton>	 trying a patch now that aligns all *_entry symbols to 32 bytes

the key point in there is that on the tak benchmark, better alignment for the code made a 10% perf differnce on TAk on Core2 and opteron cpus!

benjamin scarlet and Luite are speculating that this may be further induced by Tables next to code (TNC) accidentally creating bad alignment so theres cache line pollution / conflicts between the L1 Instruction-cache and data-caches. So one experiment would be to have the TNC transform pad after the table so the function entry point starts on the next cacheline?

Edited Mar 09, 2019 by Jan Stolarek

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information