Opened 4 years ago

Last modified 3 months ago

#4470 new feature request

Loop optimization: identical counters

Reported by: choenerzs Owned by:
Priority: normal Milestone: 7.12.1
Component: Compiler Version:
Keywords: loop optimization Cc: choener@…, pumpkingod@…, michal.terepeta@…, dterei
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Revisions:

Description

Consider the small program below, where 'f' has to counters 'i' and 'j'. Both are completely identical; the only difference is that 'i' is used to change 's', while 'j' changes 'm'. It would be beneficial to have GHC transform 'f' into something like 'ff' so that one register less is required.

Neither GHC nor LLVM perform this optimization.

Code of this kind occurs when one uses the "vector library". See this discussion: http://www.haskell.org/pipermail/glasgow-haskell-users/2010-November/019446.html

{-# LANGUAGE BangPatterns #-}

module Main where

import Criterion.Main

f :: Int -> Int -> Int -> Int -> Int
f !i !j !s !m
  | i == 0    = s+m
  | otherwise = f (i-1) (j-1) (s + i+1) (m + j*5)

g :: Int -> Int
g !k = f k k 0 0


ff :: Int -> Int -> Int -> Int
ff !i !s !m
  | i == 0    = s+m
  | otherwise = ff (i-1) (s + i+1) (m + i*5)

gg :: Int -> Int
gg !k = ff k 0 0



{-
main = do
  print $ g 20
  print $ gg 20
-}

main = defaultMain
  [ bench " g" $ whnf g  20 -- 67.9ns
  , bench "gg" $ whnf gg 20 -- 64.5ns
  ]

Function 'f' produces this core:

$wf =
  \ (ww_s1uU :: Int#)
    (ww1_s1uY :: Int#)
    (ww2_s1v2 :: Int#)
    (ww3_s1v6 :: Int#) ->
    case ww_s1uU of wild_B1 {
      __DEFAULT ->
        $wf
          (-# wild_B1 1)
          (-# ww1_s1uY 1)
          (+# (+# ww2_s1v2 wild_B1) 1)
          (+# ww3_s1v6 (*# ww1_s1uY 5));
      0 -> +# ww2_s1v2 ww3_s1v6
    }

'wild_B1' and 'ww1_s1uY' should be merged in this case.

The attached source is above program.

Attachments (1)

FunOpt.hs (523 bytes) - added by choenerzs 4 years ago.

Download all attachments as: .zip

Change History (15)

Changed 4 years ago by choenerzs

comment:1 Changed 4 years ago by rl

Here is another example:

import Data.Vector.Unboxed as V

dotp :: Vector Double -> Vector Double -> Double
dotp xs ys = V.sum (V.zipWith (*) xs ys)

GHC generates this loop:

    letrec {
      $s$wfoldlM'_loop_s1TD [Occ=LoopBreaker]
        :: Int# -> Int# -> Double# -> Double#
      $s$wfoldlM'_loop_s1TD =
        \ (sc_s1Tk :: Int#) (sc1_s1Tl :: Int#) (sc2_s1Tm :: Double#) ->
          case >=# sc1_s1Tl ipv1_s1QU of _ {
            False ->
              case >=# sc_s1Tk ipv4_s1Rb of _ {
                False ->
                  $s$wfoldlM'_loop_s1TD
                    (+# sc_s1Tk 1)
                    (+# sc1_s1Tl 1)
                    (+##
                       sc2_s1Tm
                       (*##
                          (indexDoubleArray# ipv2_s1QV (+# ipv_s1QT sc1_s1Tl))
                          (indexDoubleArray# ipv5_s1Rc (+# ipv3_s1Ra sc_s1Tk))));
                True -> sc2_s1Tm
              };
            True -> sc2_s1Tm
          }; } in
    $s$wfoldlM'_loop_s1TD 0 0 0.0

It should be easy to spot that sc_s1Tk and sc1_s1Tl are always equal. In fact, SpecConstr probably already has most of the machinery for this.

comment:2 Changed 4 years ago by pumpkin

  • Cc pumpkingod@… added

comment:3 Changed 4 years ago by batterseapower

Very interesting. The reason that LLVM doesn't get this is the same reason that it doesn't optimise this C program:

int wf(int sp[]) {
    if (sp[0] == 0) {
        return sp[2] + sp[3];
    } else {
        sp[3] = sp[3] + (sp[1] * 5);
        sp[2] = (sp[2] + sp[0]) + 1;
        sp[1] = sp[1] - 1;
        sp[0] = sp[0] - 1;
        return wf(sp);
    }
}

int g(int a) {
    int sp[] = { a, a, 0, 0 };
    return wf(sp);
}

int main(void) {
    printf("%d\n", g(5));
    return 0;
}

If we rewrite the program to eliminate the "sp" array, we get this:

int wf(int a, int b, int c, int d) {
    if (a == 0) {
        return c + d;
    } else {
        return wf(a - 1, b - 1, (c + a) + 1, d + (b * 5));
    }
}

int g(int a) {
    return wf(a, a, 0, 0);
}

int main(void) {
    printf("%d\n", g(5));
    return 0;
}

LLVM *can* optimise this definition of g(int). Indeed, it identifiers a as an induction variable and turns it into an incrementing counter, and eliminates the computation of b.

It looks like LLVM has a big blind spot for pointer-carried arguments, and needs to do some promotion of elements of an input array into registers before we can reuse its loop optimisation infrastructure. In fact, it looks like LLVM's loop optimisations are being *totally disabled* by our use of sp[] right now - so there are big wins to be had.

I have posted this example to the llvm list in the hope of some insight: http://article.gmane.org/gmane.comp.compilers.llvm.devel/36068

comment:4 Changed 4 years ago by rl

Yes, I also noticed that LLVM doesn't seem to do any real loop optimisations for Haskell programs. I assumed that this was because it didn't know that sp isn't aliased but David fixed that and it didn't help much. It pretty disappointing, really.

comment:5 Changed 4 years ago by igloo

  • Milestone set to 7.2.1

comment:6 Changed 4 years ago by michalt

  • Cc michal.terepeta@… added

comment:7 Changed 4 years ago by dterei

  • Cc dterei added

comment:8 Changed 3 years ago by igloo

  • Milestone changed from 7.4.1 to 7.6.1

comment:9 Changed 3 years ago by igloo

  • Milestone changed from 7.6.1 to 7.6.2

comment:10 Changed 9 months ago by thoughtpolice

  • Milestone changed from 7.6.2 to 7.10.1

Moving to 7.10.1.

comment:11 Changed 4 months ago by thomie

  • difficulty set to Unknown
  • Type of failure changed from None/Unknown to Runtime performance bug

comment:12 Changed 4 months ago by carter

this is (slightly?) related to series fusion as found in http://www.cse.unsw.edu.au/~amosr/papers/lippmeier2013flowfusion.pdf I think

comment:13 Changed 4 months ago by choenerzs

It is (afaik) more than just slightly related. However, the (wanted) solutions are different. Flow fusion works within the limited context of repa by using a different intermediate system. Here, I'd like a more general solution -- since we can't express everything in terms of repa or even vector.

On the other hand, I think it will be more complicated to get this done on the GHC side -- though the solution might be very worthwhile given the comments above.

comment:14 Changed 3 months ago by thoughtpolice

  • Milestone changed from 7.10.1 to 7.12.1

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

Note: See TracTickets for help on using tickets.