Opened 6 years ago

Last modified 13 months ago

#4470 new feature request

Loop optimization: identical counters

Reported by: choenerzs Owned by:
Priority: normal Milestone:
Component: Compiler Version:
Keywords: loop optimization Cc: choener@…, pumpkingod@…, michal.terepeta@…, dterei, wren@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

Consider the small program below, where 'f' has to counters 'i' and 'j'. Both are completely identical; the only difference is that 'i' is used to change 's', while 'j' changes 'm'. It would be beneficial to have GHC transform 'f' into something like 'ff' so that one register less is required.

Neither GHC nor LLVM perform this optimization.

Code of this kind occurs when one uses the "vector library". See this discussion: http://www.haskell.org/pipermail/glasgow-haskell-users/2010-November/019446.html

{-# LANGUAGE BangPatterns #-}

module Main where

import Criterion.Main

f :: Int -> Int -> Int -> Int -> Int
f !i !j !s !m
  | i == 0    = s+m
  | otherwise = f (i-1) (j-1) (s + i+1) (m + j*5)

g :: Int -> Int
g !k = f k k 0 0


ff :: Int -> Int -> Int -> Int
ff !i !s !m
  | i == 0    = s+m
  | otherwise = ff (i-1) (s + i+1) (m + i*5)

gg :: Int -> Int
gg !k = ff k 0 0



{-
main = do
  print $ g 20
  print $ gg 20
-}

main = defaultMain
  [ bench " g" $ whnf g  20 -- 67.9ns
  , bench "gg" $ whnf gg 20 -- 64.5ns
  ]

Function 'f' produces this core:

$wf =
  \ (ww_s1uU :: Int#)
    (ww1_s1uY :: Int#)
    (ww2_s1v2 :: Int#)
    (ww3_s1v6 :: Int#) ->
    case ww_s1uU of wild_B1 {
      __DEFAULT ->
        $wf
          (-# wild_B1 1)
          (-# ww1_s1uY 1)
          (+# (+# ww2_s1v2 wild_B1) 1)
          (+# ww3_s1v6 (*# ww1_s1uY 5));
      0 -> +# ww2_s1v2 ww3_s1v6
    }

'wild_B1' and 'ww1_s1uY' should be merged in this case.

The attached source is above program.

Attachments (1)

FunOpt.hs (523 bytes) - added by choenerzs 6 years ago.

Download all attachments as: .zip

Change History (18)

Changed 6 years ago by choenerzs

Attachment: FunOpt.hs added

comment:1 Changed 6 years ago by rl

Here is another example:

import Data.Vector.Unboxed as V

dotp :: Vector Double -> Vector Double -> Double
dotp xs ys = V.sum (V.zipWith (*) xs ys)

GHC generates this loop:

    letrec {
      $s$wfoldlM'_loop_s1TD [Occ=LoopBreaker]
        :: Int# -> Int# -> Double# -> Double#
      $s$wfoldlM'_loop_s1TD =
        \ (sc_s1Tk :: Int#) (sc1_s1Tl :: Int#) (sc2_s1Tm :: Double#) ->
          case >=# sc1_s1Tl ipv1_s1QU of _ {
            False ->
              case >=# sc_s1Tk ipv4_s1Rb of _ {
                False ->
                  $s$wfoldlM'_loop_s1TD
                    (+# sc_s1Tk 1)
                    (+# sc1_s1Tl 1)
                    (+##
                       sc2_s1Tm
                       (*##
                          (indexDoubleArray# ipv2_s1QV (+# ipv_s1QT sc1_s1Tl))
                          (indexDoubleArray# ipv5_s1Rc (+# ipv3_s1Ra sc_s1Tk))));
                True -> sc2_s1Tm
              };
            True -> sc2_s1Tm
          }; } in
    $s$wfoldlM'_loop_s1TD 0 0 0.0

It should be easy to spot that sc_s1Tk and sc1_s1Tl are always equal. In fact, SpecConstr probably already has most of the machinery for this.

comment:2 Changed 6 years ago by pumpkin

Cc: pumpkingod@… added

comment:3 Changed 6 years ago by batterseapower

Very interesting. The reason that LLVM doesn't get this is the same reason that it doesn't optimise this C program:

int wf(int sp[]) {
    if (sp[0] == 0) {
        return sp[2] + sp[3];
    } else {
        sp[3] = sp[3] + (sp[1] * 5);
        sp[2] = (sp[2] + sp[0]) + 1;
        sp[1] = sp[1] - 1;
        sp[0] = sp[0] - 1;
        return wf(sp);
    }
}

int g(int a) {
    int sp[] = { a, a, 0, 0 };
    return wf(sp);
}

int main(void) {
    printf("%d\n", g(5));
    return 0;
}

If we rewrite the program to eliminate the "sp" array, we get this:

int wf(int a, int b, int c, int d) {
    if (a == 0) {
        return c + d;
    } else {
        return wf(a - 1, b - 1, (c + a) + 1, d + (b * 5));
    }
}

int g(int a) {
    return wf(a, a, 0, 0);
}

int main(void) {
    printf("%d\n", g(5));
    return 0;
}

LLVM *can* optimise this definition of g(int). Indeed, it identifiers a as an induction variable and turns it into an incrementing counter, and eliminates the computation of b.

It looks like LLVM has a big blind spot for pointer-carried arguments, and needs to do some promotion of elements of an input array into registers before we can reuse its loop optimisation infrastructure. In fact, it looks like LLVM's loop optimisations are being *totally disabled* by our use of sp[] right now - so there are big wins to be had.

I have posted this example to the llvm list in the hope of some insight: http://article.gmane.org/gmane.comp.compilers.llvm.devel/36068

comment:4 Changed 6 years ago by rl

Yes, I also noticed that LLVM doesn't seem to do any real loop optimisations for Haskell programs. I assumed that this was because it didn't know that sp isn't aliased but David fixed that and it didn't help much. It pretty disappointing, really.

comment:5 Changed 6 years ago by igloo

Milestone: 7.2.1

comment:6 Changed 6 years ago by michalt

Cc: michal.terepeta@… added

comment:7 Changed 6 years ago by dterei

Cc: dterei added

comment:8 Changed 5 years ago by igloo

Milestone: 7.4.17.6.1

comment:9 Changed 4 years ago by igloo

Milestone: 7.6.17.6.2

comment:10 Changed 3 years ago by thoughtpolice

Milestone: 7.6.27.10.1

Moving to 7.10.1.

comment:11 Changed 2 years ago by thomie

difficulty: Unknown
Type of failure: None/UnknownRuntime performance bug

comment:12 Changed 2 years ago by carter

this is (slightly?) related to series fusion as found in http://www.cse.unsw.edu.au/~amosr/papers/lippmeier2013flowfusion.pdf I think

comment:13 Changed 2 years ago by choenerzs

It is (afaik) more than just slightly related. However, the (wanted) solutions are different. Flow fusion works within the limited context of repa by using a different intermediate system. Here, I'd like a more general solution -- since we can't express everything in terms of repa or even vector.

On the other hand, I think it will be more complicated to get this done on the GHC side -- though the solution might be very worthwhile given the comments above.

comment:14 Changed 2 years ago by thoughtpolice

Milestone: 7.10.17.12.1

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

comment:15 Changed 18 months ago by thoughtpolice

Milestone: 7.12.18.0.1

Milestone renamed

comment:16 Changed 17 months ago by WrenThornton

Cc: wren@… added

comment:17 Changed 13 months ago by thomie

Milestone: 8.0.1
Note: See TracTickets for help on using tickets.