missed opportunity for call-pattern specialisation
With strict ByteString
an a simple tail recursive iteration pattern we get perfect code.
First the definitions:
data ByteString
= BS {-# UNPACK #-} !(ForeignPtr Word8) -- payload
{-# UNPACK #-} !Int -- offset
{-# UNPACK #-} !Int -- length
null :: ByteString -> Bool
null (BS _ _ l) = l <= 0
tail :: ByteString -> ByteString
tail (BS p s l)
| l <= 0 = error "ByteString.tail: empty ByteString"
| otherwise = BS p (s+1) (l-1)
Now a trivial iteration pattern:
length :: ByteString -> Int
length = go 0
where
go !n bs | null bs = n
| otherwise = go (n+1) (tail bs)
Perfect core code (edited for clarity):
go :: Int# -> Addr# -> ForeignPtrContents
-> Int# -> Int# -> Int#
go = \n p fpc o l ->
case l <=# 0 of
False -> go (n +# 1) p fpc (o +# 1) (l -# 1)
True -> n
length :: ByteString -> GHC.Base.Int
length = \bs ->
case bs of
BS p fpc 0 l ->
case go 0 p fpc 0 l of
n -> I# n
This worked because strict ByteString
is a single constructor data type which allows it to be unpacked into separate parameters in the recursive call.
Now, lets try the same with lazy ByteStrings
:
data ByteString
= Empty
| Chunk {-# UNPACK #-} !StrictBS.ByteString ByteString
This of course has two constructors. It's built for call-pattern specialisation.
Now the ops:
null :: ByteString -> Bool
null Empty = True
null _ = False
tail :: ByteString -> ByteString
tail Empty = error "empty tail"
tail (Chunk (S.BS fp s 1) cs) = cs
tail (Chunk (S.BS fp s l) cs) = Chunk (S.BS fp (s+1) (l-1)) cs
We can use the exact same code for tail, but now for the lazy ByteString
type:
length :: ByteString -> Int
length = go 0
where
go !n bs | null bs = n
| otherwise = go (n+1) (tail bs)
But, oh noes!! The optimisation does not work:
go :: Int# -> ByteString -> Int#
go = \n bs ->
case bs of
Empty -> n
Chunk p fpc o l bs' ->
go (n +# 1)
(case l of
1 -> bs'
_ -> Chunk p fpc (o +# 1) (l -# 1) bs')
length :: ByteString -> Int
length = \bs ->
case go 0 bs of
n -> I# n
However this is not because call pattern specialisation didn't do what we wanted. We know this for several reasons. One, if we remove the case
tail (Chunk (S.BS fp s 1) cs) = cs
then call pattern specialisation works and the code ends up as a perfect loop. Of course we need that case for correctness.
Also, if we change the definition of lazy ByteString
to be a single constructor and represent the end by an empty chunk then we still get essentially the same code.
Also, if we use uncons
instead of null
and tail
then we effectively perform by hand the missing optimisation transformation and get perfect code (and call-pattern specialisation happens exactly as expected).
length :: ByteString -> Int
length = go 0
where
go !n bs = case uncons bs of
Nothing -> n
Just (_, bs') -> go (n+1) bs'
uncons :: ByteString -> Maybe (Word8, ByteString)
uncons Empty = Nothing
uncons (Chunk c cs)
| StrictBS.length c == 1
= Just (S.unsafeHead c, cs)
| otherwise
= Just (S.unsafeHead c, Chunk (S.unsafeTail c) cs)
This version with uncons
gives us perfect code:
go_chunk :: ByteString -> Int# -> Int#
-> ForeignPtrContents -> Addr# -> Int# -> Int#
go_chunk = \bs' l o fpc p n ->
case l of
1 -> go (n +# 1) bs'
_ -> go_chunk bs' (l -# 1) (o +# 1) fpc p (n +# 1)
go :: Int# -> ByteString -> Int#
go = \n bs ->
case bs of
Empty -> n
Chunk p fpc o l bs' ->
case l of
1 -> go (n +# 1) bs'
_ -> go_chunk bs' (l -# 1) (o +# 1) fpc p (n +# 1)
length :: ByteString -> Int
length = \bs ->
case go 0 bs of
n -> I# n
and we can see the specialisation rule that ghc invented for us:
forall bs l o fpc p n.
go n (Chunk p fpc o l bs)
= go_chunk bs l o fpc p n
Aside: looks like there's an extra/missing reverse in there somewhere.
The problem with the head
/ tail
version is that the following transformation is never performed and so the opportunity for call pattern specialisation (or even simple worker/wrapper unpacking) is never exposed:
go (n+1)
(case l of
1 -> bs'
_ -> Chunk p fpc (o+1) (l-1) bs')
=
case l of
1 -> go (n+1) bs'
_ -> go (n+1) (Chunk p fpc (o+1) (l-1) bs')
which is the fragment inside the inlined definition of the go worker function:
go !n bs = case bs of
Empty -> n
Chunk p fpc o l bs' ->
go (n+1)
(case l of
1 -> bs'
_ -> Chunk p fpc (o+1) (l-1) bs')
So go
is tail recursive and strict in both arguments. This situation will arise whenever we have something like
go (tail xs)
and tail
involves a case analysis. Since go
is just a function call there is no problem in duplicating it into the two branches of tail
. The fact that it enables the call-pattern specialisation makes this a huge win.
This will happen with lazy ByteString
s or any other chunked representation using trees. We really need the per-chunk inner loop to be good with just a single test in the fast path to see if we're at the end of the chunk. When we get to the end of the chunk we can move into the slow path and do more cunning things with chunks and trees and lazyness. But the overall speed of these kinds of operations is determined by the quality of the inner loop. If that inner loop has to allocate a fresh constructor each time round then we lose.
In principle, if we can optimise perfectly then there is no reason for strict ByteString
to have faster code than lazy ByteString
, because their inner loops should look exactly the same.
Trac metadata
Trac field | Value |
---|---|
Version | 6.10.1 |
Type | Bug |
TypeOfFailure | OtherFailure |
Priority | normal |
Resolution | Unresolved |
Component | Compiler |
Test case | |
Differential revisions | |
BlockedBy | |
Related | |
Blocking | |
CC | |
Operating system | |
Architecture |