Codegen on sequential FFI calls is not very good
I'm writing a library for efficiently building up a byte buffer. The fastest approach I've found is via FFI, with restricted effects like ST. It's over twice as fast as ByteString Builder.
Consider this example API usage: https://github.com/chadaustin/buffer-builder/blob/6bd0a39c56f63ab751faf29f9784ac87d52638be/bench/Bench.hs#L46
It compiles into an instruction sequence containing direct, sequenced FFI calls. For example, the last three calls work out to:
addq $8,%rsp
movq %rbx,%rdi
movq 72(%rsp),%rax
movq %rax,%rsi
subq $8,%rsp
movl $0,%eax
call bw_append_bsz
addq $8,%rsp
movq %rbx,%rdi
movl $35,%esi
subq $8,%rsp
movl $0,%eax
call bw_append_byte
addq $8,%rsp
movq %rbx,%rdi
movq 64(%rsp),%rax
movq %rax,%rsi
subq $8,%rsp
movl $0,%eax
call bw_append_bsz
I don't know why rsp
is being changed so much. I also can't explain the assignment to eax
before the call. (It should also be xorl eax,eax
, I would think.)
To my reading, the above instruction sequence could be reduced to:
movq %rbx,%rdi
movq 64(%rsp),%rsi
call bw_append_bsz
movq %rbx,%rdi
movl $35,%esi
call bw_append_byte
movq %rbx,%rdi
movq 56(%rsp),%rsi
call bw_append_bsz
To reproduce, check out git@github.com:chadaustin/buffer-builder.git
at revision 6bd0a39c56f63ab751faf29f9784ac87d52638be
cabal configure --enable-benchmarks
cabal bench
And then look at the ./dist/build/bench/bench-tmp/bench/Bench.dump-asm
file.
This is specifically on OS X 64-bit with GHC 7.8.3, but I saw similar code generation on GHC 7.6 on Linux 64-bit.