Codegen on sequential FFI calls is not very good

I'm writing a library for efficiently building up a byte buffer. The fastest approach I've found is via FFI, with restricted effects like ST. It's over twice as fast as ByteString Builder.

Consider this example API usage: https://github.com/chadaustin/buffer-builder/blob/6bd0a39c56f63ab751faf29f9784ac87d52638be/bench/Bench.hs#L46

It compiles into an instruction sequence containing direct, sequenced FFI calls. For example, the last three calls work out to:

addq $8,%rsp
movq %rbx,%rdi
movq 72(%rsp),%rax
movq %rax,%rsi
subq $8,%rsp
movl $0,%eax
call bw_append_bsz

addq $8,%rsp
movq %rbx,%rdi
movl $35,%esi
subq $8,%rsp
movl $0,%eax
call bw_append_byte

addq $8,%rsp
movq %rbx,%rdi
movq 64(%rsp),%rax
movq %rax,%rsi
subq $8,%rsp
movl $0,%eax
call bw_append_bsz

I don't know why rsp is being changed so much. I also can't explain the assignment to eax before the call. (It should also be xorl eax,eax, I would think.)

To my reading, the above instruction sequence could be reduced to:

movq %rbx,%rdi
movq 64(%rsp),%rsi
call bw_append_bsz

movq %rbx,%rdi
movl $35,%esi
call bw_append_byte

movq %rbx,%rdi
movq 56(%rsp),%rsi
call bw_append_bsz

To reproduce, check out git@github.com:chadaustin/buffer-builder.git at revision 6bd0a39c56f63ab751faf29f9784ac87d52638be

cabal configure --enable-benchmarks
cabal bench

And then look at the ./dist/build/bench/bench-tmp/bench/Bench.dump-asm file.

This is specifically on OS X 64-bit with GHC 7.8.3, but I saw similar code generation on GHC 7.6 on Linux 64-bit.

Edited Mar 10, 2019 by Ben Gamari

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information