Opened 18 months ago

Last modified 6 months ago

#7367 new bug

Optimiser / Linker Problem on amd64

Reported by: wurmli Owned by:
Priority: normal Milestone: 7.8.3
Component: Build System Version: 7.6.1
Keywords: Cc: wasserman.louis@…
Operating System: Linux Architecture: x86_64 (amd64)
Type of failure: Runtime performance bug Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description

The Haskell fannkuchredux contribution of Louis Wasserman to "The Computer Language Benchmarks Game" at shootout.alioth.debian.org times out on the amd64 machines, but not on the i386. I can reproduce it on my Debian amd64 machine.

It turns out that compiling without optimisation or with a simple -O produces a fast program, but with enourmously large heap space allocated (10G compared with 67k on a virtual i386 machine) and also more garbage collector activity.

The source is below (because I don't find a way to attach the file). At the end of the source I copied my make command line, run command line and output produced with option -sstderr.


{-  The Computer Language Benchmarks Game
    http://shootout.alioth.debian.org/
    contributed by Louis Wasserman
    
    This should be compiled with:
    	-threaded -O2 -fexcess-precision -fasm
    and run with:
    	+RTS -N<number of cores> -RTS <input>
-}

import Control.Concurrent
import Control.Monad
import System.Environment
import Foreign hiding (rotate)
import Data.Monoid

type Perm = Ptr Word8

data F = F {-# UNPACK #-} !Int {-# UNPACK #-} !Int

instance Monoid F where
	mempty = F 0 0
	F s1 m1 `mappend` F s2 m2 = F (s1 + s2) (max m1 m2)

incPtr = (`advancePtr` 1)
decPtr = (`advancePtr` (-1))

flop :: Int -> Perm -> IO ()
flop k xs = flopp xs (xs `advancePtr` k)
 where flopp i j = when (i < j) $ swap i j >> flopp (incPtr i) (decPtr j)
       swap i j = do
	a <- peek i
	b <- peek j
	poke j a
	poke i b

flopS :: Perm -> (Int -> IO a) -> IO a
flopS !xs f = do
	let go !acc = do
		k <- peekElemOff xs 0
		if k == 0 then f acc else flop (fromIntegral k) xs >> go (acc+1)
	go 0

increment :: Ptr Word8 -> Ptr Word8 -> IO ()
increment !p !ct = do
	first <- peekElemOff p 1
	pokeElemOff p 1 =<< peekElemOff p 0
	pokeElemOff p 0 first
	
	let go !i !first = do
		ci <- peekElemOff ct i
		if fromIntegral ci < i then pokeElemOff ct i (ci+1) else do
			pokeElemOff ct i 0
			let !i' = i + 1
			moveArray p (incPtr p) i'
			pokeElemOff p i' first
			go i' =<< peekElemOff p 0
	go 1 first  

genPermutations :: Int -> Int -> Int -> Ptr Word8 -> Ptr Word8 -> IO F
genPermutations !n !l !r !perm !count = allocaArray n $ \ destF -> do
	let upd j !f run = do
		p0 <- peekElemOff perm 0
		if p0 == 0 then increment perm count >> run f else do
			copyArray destF perm n
			increment perm count
			flopS destF $ \ flops -> 
				run (f `mappend` F (checksum j flops) flops)
	let go j !f = if j >= r then return f else upd j f (go (j+1))
	go l mempty
 where checksum i f = if i .&. 1 == 0 then f else -f

facts :: [Int]
facts = scanl (*) 1 [1..12]

unrank :: Int -> Int -> (Ptr Word8 -> Ptr Word8 -> IO a) -> IO a
unrank !idx !n f = allocaArray n $ \ p -> allocaArray n $ \ count ->
  allocaArray n $ \ pp -> do
	mapM_ (\ i -> pokeElemOff p i (fromIntegral i)) [0..n-1]
	let go i !idx = when (i >= 0) $ do
		let fi = facts !! i
		let (q, r) = idx `quotRem` fi
		pokeElemOff count i (fromIntegral q)
		copyArray pp p (i+1)
		let go' j = when (j <= i) $ do
			let jq = j + q
			pokeElemOff p j =<< peekElemOff pp (if jq <= i then jq else jq - i - 1)
			go' (j+1)
		go' 0
		go (i-1) r
	go (n-1) idx
	f p count

main = do
   n <- fmap (read.head) getArgs
   let fact = product [1..n]
   let bk = fact `quot` 4
   vars <- forM [0,bk..fact-1] $ \ ix -> do
   	var <- newEmptyMVar
   	forkIO (unrank ix n $ \ p -> genPermutations n ix (min fact (ix + bk)) p >=> putMVar var)
   	return var
   F chksm mflops <- liftM mconcat (mapM takeMVar vars)
   putStrLn $ (show chksm) ++ "\nPfannkuchen(" ++ (show n) ++ ") = " ++ (show $ mflops)


{-

wurmli@noah-nofen:~/hpw/haskell/work/fannkuch$ ghc  --make
 -XBangPatterns -O -threaded -fllvm -rtsopts fannkuchredux.ghc-3.hs 
[1 of 1] Compiling Main             ( fannkuchredux.ghc-3.hs, fannkuchredux.ghc-3.o )
Linking fannkuchredux.ghc-3 ...
wurmli@noah-nofen:~/hpw/haskell/work/fannkuch$ ./fannkuchredux.ghc-3 +RTS -N4 -sstderr   -RTS 12
3968050
Pfannkuchen(12) = 65
  10,538,122,952 bytes allocated in the heap
         359,512 bytes copied during GC
          47,184 bytes maximum residency (2 sample(s))
          51,120 bytes maximum slop
               4 MB total memory in use (1 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      6053 colls,  6053 par    0.16s    0.04s     0.0000s    0.0001s
  Gen  1         2 colls,     1 par    0.00s    0.00s     0.0001s    0.0001s

  Parallel GC work balance: 40.82% (serial 0%, perfect 100%)

  TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   44.73s  ( 11.51s elapsed)
  GC      time    0.16s  (  0.04s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   44.89s  ( 11.55s elapsed)

  Alloc rate    235,589,887 bytes per MUT second

  Productivity  99.6% of total user, 387.3% of total elapsed

gc_alloc_block_sync: 31024
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

-}

Attachments (1)

simpl.gz (16.2 KB) - added by rwbarton 8 months ago.
output of ghc -XBangPatterns -O -threaded -fllvm -rtsopts -ddump-simpl -ddump-stg f-orig.hs

Download all attachments as: .zip

Change History (18)

comment:1 Changed 18 months ago by simonpj

  • Difficulty set to Unknown

Can you explain the problem more precisely? You say that something "times out", but I'm not sure what. Then you say that something else takes a lot of heap; but the figures you give seem very small (eg 359k bytes copied during GC). And you may be saying that behaviour is different on different architectures, but I'm not sure.

If you could be more precise about how to reproduce the problem you are describing, and why you think it's a problem with GHC (rather than with the program it's compiling) that would be helpful.

Simon

comment:2 Changed 12 months ago by igloo

  • Milestone set to 7.8.1

It looks like the "bytes allocated in the heap" dramatically increased between 7.4 and 7.6:

$ ghc-7.4.1 --make -O q -XBangPatterns -threaded -rtsopts 
[1 of 1] Compiling Main             ( q.hs, q.o )
Linking q ...
$ time ./q 12 +RTS -N4 -s
3968050
Pfannkuchen(12) = 65
          88,592 bytes allocated in the heap
           6,040 bytes copied during GC
          46,928 bytes maximum residency (1 sample(s))
          43,184 bytes maximum slop
               4 MB total memory in use (1 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.00s    0.00s     0.0002s    0.0002s

  Parallel GC work balance: -nan (0 / 0, ideal 4)

                        MUT time (elapsed)       GC time  (elapsed)
  Task  0 (worker) :   78.76s    ( 20.00s)       0.00s    (  0.00s)
  Task  1 (worker) :   78.76s    ( 20.00s)       0.00s    (  0.00s)
  Task  2 (bound)  :   78.75s    ( 20.00s)       0.00s    (  0.00s)
  Task  3 (worker) :   78.76s    ( 20.00s)       0.00s    (  0.00s)
  Task  4 (worker) :   78.75s    ( 20.00s)       0.00s    (  0.00s)
  Task  5 (worker) :   78.76s    ( 20.00s)       0.00s    (  0.00s)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   78.75s  ( 20.00s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   78.76s  ( 20.00s elapsed)

  Alloc rate    1,124 bytes per MUT second

  Productivity 100.0% of total user, 393.9% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
./q 12 +RTS -N4 -s  78.76s user 0.00s system 393% cpu 19.998 total
$ ghc-7.6.2 --make -O q -XBangPatterns -threaded -rtsopts
[1 of 1] Compiling Main             ( q.hs, q.o )
Linking q ...
$ time ./q 12 +RTS -N4 -s
3968050
Pfannkuchen(12) = 65
  10,538,123,264 bytes allocated in the heap
         390,688 bytes copied during GC
          46,856 bytes maximum residency (2 sample(s))
          51,448 bytes maximum slop
               3 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      5843 colls,  5843 par   28.64s    0.38s     0.0001s    0.0131s
  Gen  1         2 colls,     1 par    0.00s    0.00s     0.0002s    0.0002s

  Parallel GC work balance: 61.48% (serial 0%, perfect 100%)

  TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   54.57s  ( 20.89s elapsed)
  GC      time   28.64s  (  0.38s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   83.21s  ( 21.27s elapsed)

  Alloc rate    193,121,729 bytes per MUT second

  Productivity  65.6% of total user, 256.6% of total elapsed

gc_alloc_block_sync: 28902
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
./q 12 +RTS -N4 -s  83.21s user 0.18s system 392% cpu 21.272 total

comment:3 Changed 8 months ago by rwbarton

Based on an examination of the Core, GHC HEAD (and presumably 7.6 though I don't have it to test) is floating the expression i .&. 1 == 0 out of \f -> ... in checksum, resulting in the allocation of a thunk per execution of the go/upd loop in genPermutations. GHC 7.4 doesn't perform this "optimization" in this particular program. Not sure why.

The allocations under GHC HEAD can be eliminated with a change to the benchmark program like this:

--- f-orig.hs	2013-08-26 19:42:27.000000000 -0400
+++ f.hs	2013-08-26 19:30:09.000000000 -0400
@@ -64,11 +64,11 @@
 		if p0 == 0 then increment perm count >> run f else do
 			copyArray destF perm n
 			increment perm count
-			flopS destF $ \ flops -> 
-				run (f `mappend` F (checksum j flops) flops)
+                        if j .&. 1 == 0
+                          then flopS destF $ \ flops -> run (f `mappend` F flops flops)
+                          else flopS destF $ \ flops -> run (f `mappend` F (-flops) flops)
 	let go j !f = if j >= r then return f else upd j f (go (j+1))
 	go l mempty
- where checksum i f = if i .&. 1 == 0 then f else -f

or by building with -fno-full-laziness provided that flopp and swap are moved out of flop to the top level (otherwise, those allocate instead).

comment:4 Changed 8 months ago by simonpj

rwbarton: thanks. It's really helpful to nail performance regressions.

However I can't see the thunk you are referring to. Could you perhaps compile with -ddump-simpl -ddump-stg and point to the line number of the offending thunk allocation? (Perhaps also give the command line you are using to compile, so I can see what flags you are using.)

Ta. Simon

Changed 8 months ago by rwbarton

output of ghc -XBangPatterns -O -threaded -fllvm -rtsopts -ddump-simpl -ddump-stg f-orig.hs

comment:5 Changed 8 months ago by rwbarton

Hi Simon,

It's at lines 731-742 and then at lines 2242-2270 in the attachment; or you can find it by searching for and# (just one occurrence in each part of the dump) and taking a dozen or so lines around it.

I hope I'm not misinterpreting the core, and let { lvl6_s1Vb :: GHC.Types.Bool ... } in ... is in fact a thunk allocation.

comment:6 Changed 8 months ago by simonpj

Ah, got it now, thanks. Indeed, there is a real issue here. genPermutations looks like this:

genPermutations !n !l !r !perm !count = allocaArray n $ \ destF -> do
	let upd j !f run = do
		p0 <- peekElemOff perm 0
		if p0 == 0 then increment perm count >> run f else do
			copyArray destF perm n
			increment perm count
			flopS destF $ \ flops -> 
				run (f `mappend` F (checksum j flops) flops)
	let go j !f = if j >= r then return f else upd j f (go (j+1))
	go l mempty
 where 
    checksum i f = if i .&. 1 == 0 then f else -f

Now, just inline checksum and that call to flopS looks like

   flopS destF $ \ flops -> 
	run (f `mappend` F (if  then flops else -flops) flops)

Now flopS really does call its argument many times, so work really is saved by lifting out the constant expression to get:

   let lvl::Bool = j .&. 1 == 0
   in flopS destF $ \ flops -> 
	run (f `mappend` F (if lvl then flops else -flops) flops)

As it happens, in this case allocating the thunk may be more expensive than repeating the computation, but that's a bit hard for GHC to work out.

Ideas welcome.

Simon

comment:7 Changed 8 months ago by rwbarton

Actually, flopS uses its "continuation" f linearly, like the CPS translation of an ordinary imperative function. Teaching GHC to recognize this sounds like quite an undertaking, though.

This seems like a typical instance of how it is hard to decide whether to float out a let.

The trouble with let-floating from the user's perspective is that when GHC wrongly decides to float out a let, it is virtually impossible to stop it from doing so. One way out is -fno-full-laziness, but GHC probably was correctly floating out a lot of other things, and it's difficult and tedious to locate and transform all those parts of the code.

One suggestion is to identify situations in which let-floating can never be harmful, and allow -fno-full-laziness (or perhaps a new flag, -fsome-full-laziness?) to perform those floatings. For example, I don't think it can ever cost to float a binding whose RHS is a lambda to the top level. That rule would float out flopp and swap from flop here, and compiling with -fno-full-laziness would then eliminate all allocations.

It would also be very useful to be able to control let-floating locally with some kind of pragma or magic built-in function. I'm sure this idea has been explored before and I also have some ideas of how to go about it, but I want to get a little more familiar with the structure of GHC before proposing anything.

comment:8 follow-up: Changed 8 months ago by simonpj

I don't think flopS uses its continuation linearly!

flopS :: Perm -> (Int -> IO a) -> IO a
flopS !xs f = do
	let go !acc = do
		k <- peekElemOff xs 0
		if k == 0 then f acc else flop (fromIntegral k) xs >> go (acc+1)
	go 0

One call to flopS will give many calls to its argument f. Actually the new cardinality analyser does spot (and exploit) this kind of linearity.

I'd be happy to hear more ideas as you get more familiar with GHC.

Simon

comment:9 in reply to: ↑ 8 Changed 8 months ago by rwbarton

Replying to simonpj:

I don't think flopS uses its continuation linearly!

flopS :: Perm -> (Int -> IO a) -> IO a
flopS !xs f = do
	let go !acc = do
		k <- peekElemOff xs 0
		if k == 0 then f acc else flop (fromIntegral k) xs >> go (acc+1)
	go 0

Doesn't it though?

flopS is just a single call to go, so it uses f linearly if go does.

go performs an action unrelated to f, and then either it finishes by calling f, or it performs another action unrelated to f and then calls itself. So any run of go calls f exactly once, as a tail call (or it loops forever).

Ah, perhaps you've misparsed if k == 0 then f acc else flop (fromIntegral k) xs >> go (acc+1) as (if k == 0 then f acc else flop (fromIntegral k) xs) >> go (acc+1)?

comment:10 Changed 8 months ago by simonpj

Oh I missed that. go is recursive, but it only calls f as it exits. This refactoring would make that much clearer (including to the programmer), and GHC might even spot the linearity:

flopS :: Perm -> (Int -> IO a) -> IO a
flopS !xs f = do
	let go !acc = do
		k <- peekElemOff xs 0
		if k == 0 then return acc else flop (fromIntegral k) xs >> go (acc+1)
	acc <- go 0
        f acc

However your points about controlling floating remain. It would not be hard to add flags for "only float if the thing will get to top level" or "only float existing bindings, don't create new thunks".

comment:11 Changed 8 months ago by wurmli

May I add to your consideration the following simple example whith the same structure and behaviour:

-- ghc --make -threaded -O2 -fllvm -rtsopts heapAndSpeed.hs
-- ./headAndSpeed +RTS -s

{-# LANGUAGE BangPatterns #-}

import System.Environment
import Control.Applicative
import Control.Monad

import Control.Monad.ST

import qualified Data.STRef as VSM

-------------------------------------------------------------------
-- Increment counter by 1

acc1 :: VSM.STRef s ( Int , Int ) -> ST s ()
acc1 !v = do{ 
  (!k,!n) <- VSM.readSTRef v; 
  VSM.writeSTRef v (k+1,n)
  }

-- Increment counter until limit

whileModify1 :: VSM.STRef s ( Int , Int ) -> ST s ()
whileModify1 !v = do
  !go <- do{ (k,n) <- VSM.readSTRef v; return (compare k n)}
  case go of
    LT -> do {acc1 v; whileModify1 v}
    EQ -> return ()
    GT -> return ()
    
testAcc :: Int -> (Int,Int)
testAcc n = runST $ do
  v <- VSM.newSTRef (0,n)
  whileModify1 v
  VSM.readSTRef v

main = do 
  let k = 20000000
  putStrLn $ show $ testAcc k 

comment:12 follow-up: Changed 8 months ago by rwbarton

wurmli, what's the matter with it?

"800,100,272 bytes allocated in the heap" means that the total size of all the allocations done over the course of the program is 800,100,272 bytes. That's the expected size of 20 million (Int, Int) pairs which share their second field (n), plus a small amount of other stuff. It doesn't have anything to do with the size of the heap at any given time. The maximum heap size is shown separately: "50,520 bytes maximum residency" which is quite reasonable.

Similarly your original program does not ever occupy 10 GB of heap at a time. If you look at the process in top you will see a memory usage close to "47,184 bytes maximum residency" (well probably more like a couple MB, to hold the program image, but not anything near 10 GB).

I have no idea why the original program timed out on the language benchmark machines, but it wasn't due to it allocating 10 GB sequentially. Allocation of short-lived objects is very cheap. But it is not free, and this discussion has been about why current GHC produces a program that allocates a lot when GHC 7.4 did not. Eliminating the large amount of allocation might reduce the runtime by a few percent or so.

comment:13 in reply to: ↑ 12 Changed 8 months ago by wurmli

Replying to rwbarton:

wurmli, what's the matter with it?

"800,100,272 bytes allocated in the heap" means that the total size of all the allocations done over the course of the program is 800,100,272 bytes. That's the expected size of 20 million (Int, Int) pairs which share their second field (n), plus a small amount of other stuff. It doesn't have anything to do with the size of the heap at any given time. The maximum heap size is shown separately: "50,520 bytes maximum residency" which is quite reasonable.

Similarly your original program does not ever occupy 10 GB of heap at a time. If you look at the process in top you will see a memory usage close to "47,184 bytes maximum residency" (well probably more like a couple MB, to hold the program image, but not anything near 10 GB).

I have no idea why the original program timed out on the language benchmark machines, but it wasn't due to it allocating 10 GB sequentially. Allocation of short-lived objects is very cheap. But it is not free, and this discussion has been about why current GHC produces a program that allocates a lot when GHC 7.4 did not. Eliminating the large amount of allocation might reduce the runtime by a few percent or so.

Would you agree that it is reasonable to expect the optimiser to optimise these allocations away? My simple assumption about the fannkuch program is that speed is enhanced if memory use stays local. The more only registers and cache are used the faster the program runs. With the repeated allocation of an intermediary variable the cache might be exhausted and the processor might have to copy in and out of cache what could slow down the program.

comment:14 follow-up: Changed 8 months ago by simonpj

wurmli, no I don't think it's reasonable. (Unless I'm missing something.) I have not looked very carefully, but I think this program does a lot of read/write of an imperatively-mutable (STRef (Int,Int)). If we could eliminate those read/write pairs altogether, that would indeed be a good thing, but that is pretty hard to do, because in principle any computation of type (ST s a) might mutate that reference. GHC simply doesn't have any serious optimisations for imperative code; it focuses on optimising functional code.

To put it another way, what program would you expect GHC to transform your code into?

Simon

comment:15 in reply to: ↑ 14 Changed 8 months ago by wurmli

Replying to simonpj:

wurmli, no I don't think it's reasonable. (Unless I'm missing something.) I have not looked very carefully, but I think this program does a lot of read/write of an imperatively-mutable (STRef (Int,Int)). If we could eliminate those read/write pairs altogether, that would indeed be a good thing, but that is pretty hard to do, because in principle any computation of type (ST s a) might mutate that reference. GHC simply doesn't have any serious optimisations for imperative code; it focuses on optimising functional code.

To put it another way, what program would you expect GHC to transform your code into?

Simon

I am sorry for not choosing my words carefully enough.

As a Haskell programmer I have only limited ways to influence how the compiled program utilises the resources on a given computer architecture. Most of the time this is a great blessing as it lets me concentrate on the logic of my task. But for some tasks, e.g. numerics or simulation, raw speed often results from optimal use of the resources. So I should probably not say it is a reasonable expectation but a hope that the code generator and optimiser can do that job for me. In the same way that I might manage to write a recursive function as a tail recursive one, I would have hoped that by giving a hint with an imperative style Haskell program that the code generator / optimiser could generate what you would get from, say, an analoguous C-program.

If not you and your top colleagues working on ghc, who should be able to find a solution, if one even exists? Let me look stupid: in one of the optimisation phases an analyses of dead and live variables takes place (which looks like the action of a garbage collector to me). My naive expectation would be that an optimisation is possible if and only if an analyses of dead and live variables is doable (based on the syntactic structure at that stage).

Hans Peter

comment:16 Changed 8 months ago by carter

Let me preface that I'm very likely not following this thread. So please view my following remarks as also being questions for clarification.

I'm trying to follow this thread:

  1. the issue initially was that theres overally agressive let floating? I believe the way Manuel addresses this in his Repa Code is by using the touch function to prevent let floating, right?
  1. currently: its now a question about having a more systematic way of soundly handling the cost model of let floatings and when to do them?

@Hans / Wurmli
As a haskell programmer, you can absolutely write great performant low level code in haskell (or cheaply ffi out if need be). It does require really understanding how GHC works, and how it compiles code. Really performant haskell does not treat the compiler as a black box, but rather as a partner to in a conversation. I have some fun examples of bit fiddling haskell code that turns into exactly the straight line register manipulations I"d hope any language to generate. But to really write HPC grade code, you have to really understand your tools!

The "standard" way to systematically write very very performant code in haskell, is to first design a library with the "right" abstractions, and in tandem, have a "dialogue" where you figure out how to give the library internals a representation that GHC can aggressively fuse / simplfify away to make things fast. The Vector Library does this with stream fusion, and the GPU lib Accelerate and the CPU libs Repa 3 / Repa 4 libs all have very nice views on some parts of this methodology (I have my own, different take in progress that adds some interest twists too). In some respects, its also an ongoing exploratory engineering effort and research effort to make it better and better.

point being: there is no magical compiler, merely compilers that can "collaborate" with the library author. GHC does an great (and getting even better over time) job of this. If you have specific optimizations you'd want, please illustrate what the "input" and "result" codes from the optimization would be! Humans, given enough time, often are the best optimizers, so the best a compiler can do is support library authors writing easy to optimize libraries!

Importantly: currently GHC doesn't pass much aliasing information to code generators, though for numerical / bit fiddling codes, LLVM can do some tremendously amazing optimziations. There will also be great support for some basic simd code writing in 7.8.

That said, after 7.8 release, and before 7.10 lands, I think its pretty fair to say that a lot of great work will be happening to better support GHC having a good numerical story. If nothing else, its something that I (time permitting), want to improve/help with.

That said: in the mean time, its ok to have "fat primops" written in C that you ffi out to, and having all your application / numerical logic, and memory management be on the haskell side. I'm actually doing that quite a bit in my own codes, and while theres plenty of room for even better performance, even with that naive approach I'm able to get temptingly close to Ye Olde ancient but really really fast Fortran Grade performance with fairly little effort on my part.

I hope i'm contributing to this thread with these questions and remarks :)

comment:17 Changed 6 months ago by carter

has anyone done any exploration on this yet?

Last edited 6 months ago by carter (previous) (diff)
Note: See TracTickets for help on using tickets.