Opened 4 years ago

Last modified 2 years ago

#10062 new bug

Codegen on sequential FFI calls is not very good

Reported by: chadaustin Owned by:
Priority: normal Milestone:
Component: Compiler (CodeGen) Version: 7.8.3
Keywords: Cc: simonmar, shumovichy@…, bgamari, tjakway
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by bgamari)

I'm writing a library for efficiently building up a byte buffer. The fastest approach I've found is via FFI, with restricted effects like ST. It's over twice as fast as ByteString Builder.

Consider this example API usage: https://github.com/chadaustin/buffer-builder/blob/6bd0a39c56f63ab751faf29f9784ac87d52638be/bench/Bench.hs#L46

It compiles into an instruction sequence containing direct, sequenced FFI calls. For example, the last three calls work out to:

addq $8,%rsp
movq %rbx,%rdi
movq 72(%rsp),%rax
movq %rax,%rsi
subq $8,%rsp
movl $0,%eax
call bw_append_bsz

addq $8,%rsp
movq %rbx,%rdi
movl $35,%esi
subq $8,%rsp
movl $0,%eax
call bw_append_byte

addq $8,%rsp
movq %rbx,%rdi
movq 64(%rsp),%rax
movq %rax,%rsi
subq $8,%rsp
movl $0,%eax
call bw_append_bsz

I don't know why rsp is being changed so much. I also can't explain the assignment to eax before the call. (It should also be xorl eax,eax, I would think.)

To my reading, the above instruction sequence could be reduced to:

movq %rbx,%rdi
movq 64(%rsp),%rsi
call bw_append_bsz

movq %rbx,%rdi
movl $35,%esi
call bw_append_byte

movq %rbx,%rdi
movq 56(%rsp),%rsi
call bw_append_bsz

To reproduce, check out git@github.com:chadaustin/buffer-builder.git at revision 6bd0a39c56f63ab751faf29f9784ac87d52638be

cabal configure --enable-benchmarks
cabal bench

And then look at the ./dist/build/bench/bench-tmp/bench/Bench.dump-asm file.

This is specifically on OS X 64-bit with GHC 7.8.3, but I saw similar code generation on GHC 7.6 on Linux 64-bit.

Change History (8)

comment:1 Changed 4 years ago by Yuras

I don't know why rsp is being changed so much.

It is necessary because ccall calling convention requires alignment that is different from the one used in ghc, see here for details. Though all this addq/subq probably can be combined... IIRC we don't have optimization pass for asm (?)

I also can't explain the assignment to eax before the call

It is required by calling convention too.

So, the generated code is pretty reasonable, but there definitely is space for improvements.

comment:2 Changed 4 years ago by Yuras

Cc: shumovichy@… added

comment:3 Changed 4 years ago by chadaustin

Wow! Thanks for the fast response! The ABI constraints make sense. In light of them, then yes, the improvements would be:

  • Change "movl $0,%eax" to "xorq %rax,%rax" (or maybe "movb $0,%al", though I don't know whether a smaller instruction is more important than breaking the data dependency on rax)
  • Merge the rsp additions and subtractions between FFI calls
  • Eliminate the extra move in "movq 72(%rsp),%rax; movq %rax,%rsi"

comment:4 Changed 4 years ago by svenpanne

A quick remark regarding "xorq %rax,%rax" vs. "movb $0,%al": I would recommend the xorq variant, we had very bad experiences with data dependencies between partially written registers in the v8 JavaScript JIT. Intel CPUs have funny performance characteristics, and these vary vastly between architectures and even releases of the same architecture. :-/ No need to repeat the pain in GHC...

comment:5 Changed 4 years ago by rwbarton

Change "movl $0,%eax" to "xorq %rax,%rax" (or maybe "movb $0,%al", though I don't know whether a smaller instruction is more important than breaking the data dependency on rax)

This one is already implemented in GHC 7.10. (xor %eax,%eax is the Officially Correct instruction to use: it is the smallest instruction to clear a register and it has special processor support to break dependency chains, even though it looks like it doesn't).

comment:6 Changed 4 years ago by bgamari

Cc: bgamari added

comment:7 Changed 3 years ago by bgamari

Description: modified (diff)

comment:8 Changed 2 years ago by tjakway

Cc: tjakway added
Note: See TracTickets for help on using tickets.