Optimizer misses unboxing opportunity
In my studying of the fannkuch benchmark, I've discovered (I think) another missed optimization. A scaled down illustration goes as follows:
{-# LANGUAGE TypeOperators, BangPatterns #-}
module Main (main) where
import Control.Monad.ST
import System.Environment
data (:*:) a b = !a :*: !b
whileLoop :: Int -> ST s Int
whileLoop = go 0
where
go !n k
| k == 0 = return n
| otherwise = go (n+1) (k-1)
{-# INLINE whileLoop #-}
iter :: Int -> Int -> ST s (Bool :*: Int)
iter k n = do
k' <- whileLoop 40 >>= \k' -> return $! max k k'
b <- return (n == 0)
return $! b :*: k'
{-# INLINE iter #-}
mainLoop :: Int -> Int -> ST s Int
mainLoop k n = do
done :*: k' <- iter k n
if done
then return k'
else mainLoop k' (n - 1)
main = print =<< stToIO . mainLoop 0 . read . head =<< getArgs
If we look at the core for whileLoop's worker, we see:
$wpoly_go_r1aE :: forall s_aem.
Int# -> Int# -> STRep s_aem Int
$wpoly_go_r1aE =
\ (@ s_aem)
(ww_s18Y :: Int#)
(ww1_s192 :: Int#)
(eta_s19w :: State# s_aem) ->
case ww1_s192 of wild_XF {
__DEFAULT ->
$wpoly_go_r1aE @ s_aem (+# ww_s18Y 1) (-# wild_XF 1) eta_s19w;
0 -> (# eta_s19w, I# ww_s18Y #)
}
Note, the return type is a boxed Int. The function is only used once, like so:
...
case $wpoly_go_r1aE @ s_aem 0 40 w_s19f of wild_aFw { (# new_s_aFB, r_aFC #) ->
case r_aFC of wild1_aG9 { I# y_aGb ->
case <=# ww_s199 y_aGb of wild2_aGd {
False ->
...
In other words, go boxes its results at the end of the loop, and the function that uses it immediately looks inside the box for the value. In this particular micro-benchmark, the boxed value (wild1_aG9 above) is actually used in the case where mainLoop returns the boxed value. However, in the larger benchmark I pulled this from, that is not the case in several areas (only the unboxed value is used, but it still goes through a box). Either way, the boxed value is only used at the end of the loop (and not every time there), and on every other iteration, this results in superfluous allocation.
I'll attach a manually unboxed version I wrote (it also has iter and max manually inlined to mainLoop, since that makes it easier to write; the core for the above shows that they are getting inlined properly, so I assume that isn't the issue). Using +RTS -sstderr shows that (running 100 million iterations here), the manually unboxed version allocates 50 kilobytes on the heap, and runs in around 15 seconds, whereas the version above that doesn't get unboxed does 1.6 gigabytes of heap allocation, and takes 18 seconds (in the larger benchmark, such extra boxing would happen perhaps 40 million times over the course of the program).
Thanks for your help.
Trac metadata
Trac field | Value |
---|---|
Version | 6.8.2 |
Type | Bug |
TypeOfFailure | OtherFailure |
Priority | normal |
Resolution | Unresolved |
Component | Compiler |
Test case | |
Differential revisions | |
BlockedBy | |
Related | |
Blocking | |
CC | |
Operating system | |
Architecture |