Opened 10 years ago

Closed 10 years ago

Last modified 7 years ago

#1138 closed bug (fixed)

The -fexcess-precision flag is ignored if supplied on the command line.

Reported by: dons Owned by:
Priority: normal Milestone:
Component: Driver Version: 6.6
Keywords: numerics, excess-precision Cc:
Operating System: Unknown/Multiple Architecture: x86
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by dons)

The numerics/Double-based programs on the great language shootout were performing poorly. Investigations revealed that the -fexcess-precision flag was being silently ignored by GHC when supplied as a command line flag. If it is supplied as a {-# OPTIONS -fexcess-precision #-} pragma, it is respected.

Consider the following shootout entry for the 'mandelbrot' benchmark. It writes the mandelbrot set as bmp format to stdout.

import System
import System.IO
import Foreign
import Foreign.Marshal.Array

main = do
    w <- getArgs >>= readIO . head
    let n      = w `div` 8
        m  = 2 / fromIntegral w
    putStrLn ("P4\n"++show w++" "++show w)
    p <- mallocArray0 n

    unfold n (next_x w m n) p (T 1 0 0 (-1))

unfold :: Int -> (T -> Maybe (Word8,T)) -> Ptr Word8 -> T -> IO ()
unfold !i !f !ptr !x0 = loop x0
    loop !x = go ptr 0 x

    go !p !n !x = case f x of
        Just (w,y) | n /= i -> poke p w >> go (p `plusPtr` 1) (n+1) y
        Nothing             -> hPutBuf stdout ptr i
        _                   -> hPutBuf stdout ptr i >> loop x
{-# NOINLINE unfold #-}

data T = T !Int !Int !Int !Double

next_x !w !iw !bw (T bx x y ci)
    | y  == w   = Nothing
    | bx == bw  = Just (loop_x w x 8 iw ci 0, T 1 0    (y+1)   (iw+ci))
    | otherwise = Just (loop_x w x 8 iw ci 0, T (bx+1) (x+8) y ci)

loop_x !w !x !n !iw !ci !b
    | x < w = if n == 0
                    then b
                    else loop_x w (x+1) (n-1) iw ci (b+b+v)
    | otherwise = b `shiftL` n
    v = fractal 0 0 (fromIntegral x * iw - 1.5) ci 50

fractal :: Double -> Double -> Double -> Double -> Int -> Word8
fractal !r !i !cr !ci !k
    | r2 + i2 > 4 = 0
    | k == 0      = 1
    | otherwise   = fractal (r2-i2+cr) ((r+r)*i+ci) cr ci (k-1)
    (!r2,!i2) = (r*r,i*i)

We can compile and run this as follows:

$ ghc -O -fglasgow-exts -optc-march=pentium4 -fbang-patterns -funbox-strict-fields -optc-O2 -optc-mfpmath=sse -optc-msse2 -fexcess-precision -o m1 mandel3.hs -no-recomp

$ time ./m1 3000 > /dev/null
./m1 3000 > /dev/null  8.12s user 0.00s system 99% cpu 8.143 total

8s is around 3x the speed of C (or worse).

now, if we add the following pragma to the top of the file:

{-# OPTIONS -fexcess-precision #-}

and recompile and rerun:

$ ghc -O -fglasgow-exts -optc-march=pentium4 -fbang-patterns -funbox-strict-fields -optc-O2 -optc-mfpmath=sse -optc-msse2 -fexcess-precision -o m1 mandel3.hs -no-recomp

$ time ./m1 3000 > /dev/null
./m1 3000 > /dev/null  2.94s user 0.00s system 99% cpu 2.945 total

Nearly 3x faster, and competitive with C.

Across the board the -fexcess-precision flag seems to be ignored by GHC, affecting all Double-based entries on the shootout.

A diff on the ghc -v3 output shows that -ffloat-store is not being passed to GCC when -fexcess-precision is supplied on the command line.

Change History (5)

comment:1 Changed 10 years ago by dons

Description: modified (diff)

A smaller example:

import Text.Printf

main = go (1/3) 3 1

go :: Double -> Double -> Int -> IO ()
go !x !y !i
    | i == 100000000 = printf "%f\n" (x+y)
    | otherwise      = go (x*y/3) (x*9) (i+1)

This program, run with the following flags:

$ ghc -O -fexcess-precision -fbang-patterns -optc-O -optc-ffast-math -optc-mfpmath=sse -optc-msse2 A.hs -o a

Runs in:

$ time ./a
./a  4.23s user 0.01s system 97% cpu 4.350 total

If we then move -fexcess-precision into the file, as a pragma:

$ time ./a
./a  0.91s user 0.00s system 99% cpu 0.908 total

Note that asking GCC to generate sse instructions makes a 10% or better improvment too.

For reference, this C program:

#include <stdio.h>

int main()
    double x = 1.0/3.0;
    double y = 3.0;
    int i    = 1;
    for (; i<=100000000; i++) {
        x = x*y/3.0;
        y = x*9.0;
    printf("%f\n", x+y);
    return 0;

$ gcc -O3 -ffast-math -mfpmath=sse -msse2 t.c -o a.out -std=c99
$ time ./a.out
./a.out  1.00s user 0.00s system 98% cpu 1.012 total

Which is pretty nice for GHC :-)

But now I wonder, how much of the bad numerics press has been soley due to -fexcess-precision being ignored?

comment:2 Changed 10 years ago by dons

By the way, here's the Core loop for the above 'go' function that seems to consistently beat gcc:

    movl    16(%ebp), %eax
    cmpl    $1000000000, %eax
    jne .L7
    movl    $r1s7_closure, %esi
    addl    $20, %ebp
    movl    (%ebp), %eax
    jmp *%eax
    incl    %eax
    movsd   .LC2, %xmm0
    movsd   (%ebp), %xmm1
    mulsd   %xmm0, %xmm1
    movsd   8(%ebp), %xmm0
    mulsd   (%ebp), %xmm0
    mulsd   .LC1, %xmm0
    movl    %eax, 16(%ebp)
    movsd   %xmm1, 8(%ebp)
    movsd   %xmm0, (%ebp)
    movl    $Main_zdwgo_info, %eax
    jmp .L8

Even with the indirect jump!

comment:3 Changed 10 years ago by simonmar

Resolution: fixed
Status: newclosed

already fixed, in both the 6.6 branch and HEAD, this is the patch:

Fri Dec  1 16:41:57 GMT 2006  Simon Marlow <>
  * Ugly hack to fix -fexcess-precision

comment:4 Changed 8 years ago by simonmar

Operating System: UnknownUnknown/Multiple

comment:5 Changed 7 years ago by simonmar

difficulty: Easy (1 hr)Easy (less than 1 hour)
Note: See TracTickets for help on using tickets.