Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#10108 closed bug (invalid)

Dramatic slowdown with -O2 bytestream and list streams combined.

Reported by: Truman Owned by:
Priority: normal Milestone: 7.10.2
Component: Compiler Version: 7.8.4
Keywords: Cc:
Operating System: Linux Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case: break_in.hs
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

In a small testcase intended to filter log files from stdin, I found that using bytestreams and list streams in combination worked fine when compiled without optimization, but runtime performance slowed down by a factor of about 3600 when I compiled with -O2. Changing the import of Data.List.Stream to Data.List gets rid of the problem.

Attachments (3)

break_ins.hs (3.0 KB) - added by Truman 3 years ago.
Source file (described how to create input file too).
ghc_nopt.log (6.6 KB) - added by Truman 3 years ago.
Verbose non-optimized compilation log (-v3 so not too big)
ghc_opt.log (9.6 KB) - added by Truman 3 years ago.
Verbose optimized compilation log (-v3 so not too big)

Download all attachments as: .zip

Change History (9)

Changed 3 years ago by Truman

Attachment: break_ins.hs added

Source file (described how to create input file too).

comment:1 Changed 3 years ago by simonpj

Please can you say exactly which versions of which libraries are required? And the exact command line you used to compile break_ins.hs?

This an awkward one.

On the one hand, ghc -O2 should not produce code 3600 time slower than ghc -O. On the the other hand, the example depends on two big libraries, each of which has lots of rewrite rules that may or may not be right. So there may be nothing wrong with GHC.

It'd be really helpful if someone (e.g. the library maintainers) could investigate, and characterise more clearly what is going on. The more precisely the problem is identified, the more likely that someone will fix it.

Simon

comment:2 Changed 3 years ago by Truman

Sorry about not including the library information. I should have known better.

Here is a log showing compiling non-optimized, running, compiling optimized, and running again, that shows the runtime difference. I shortened the input file (described above) to 5000 lines to speed things up, but the problem is still clear.

The versions of the two seemingly relevant imports are: bytestring-0.10.4.0 stream-fusion 0.1.2.5

The log files detail lots of other information.

I'm running GHC 7.8.4 under Arch Linux 3.19.2-1-ARCH, and I have tested and confirmed this on both 32 and 64 bit machines.

I will attach the two verbose compilation log files.

CPC~/ghc_prob> touch break_ins.hs
CPC~/ghc_prob> ghc -v3 break_ins.hs >ghc_nopt.log 2>&1
CPC~/ghc_prob> time break_ins <break_in.in >bi_nopt.out

real	0m0.077s
user	0m0.077s
sys	0m0.003s

CPC~/ghc_prob> touch break_ins.hs
CPC~/ghc_prob> ghc -v3 -O2 break_ins.hs >ghc_opt.log 2>&1
CPC~/ghc_prob> time break_ins <break_in.in >bi_opt.out

real	0m3.110s
user	0m3.090s
sys	0m0.017s
CPC~/ghc_prob> 
Last edited 3 years ago by simonpj (previous) (diff)

Changed 3 years ago by Truman

Attachment: ghc_nopt.log added

Verbose non-optimized compilation log (-v3 so not too big)

Changed 3 years ago by Truman

Attachment: ghc_opt.log added

Verbose optimized compilation log (-v3 so not too big)

comment:3 Changed 3 years ago by simonpj

Interesting. The actual code in Main never gets very big, which is promising.

Could you add -ddump-rule-firings -ddump-simpl to your command line and give the log files for that?

Simon

comment:4 Changed 3 years ago by rwbarton

Resolution: invalid
Status: newclosed

It's a bug or limitation in the stream-fusion package. This simpler program has the same problem:

import qualified Data.List.Stream as SL

bad_sum :: [Int] -> Int
bad_sum xs =
  if null xs
  then 0
  else head xs + bad_sum (SL.tail xs)

main = print $ bad_sum [1..20000]

The reason is this rewrite rule for SL.tail:

{-# RULES
"tail -> fusible" [~1] forall xs.
    tail xs = unstream (Stream.tail (stream xs))
--"tail -> unfused" [1] forall xs.
--    unstream (Stream.tail (stream xs)) = tail xs
  #-}

Note that the second rule is commented out! So each recursive call adds a layer of something similar to foldr (:) [] to the remainder of the list, and the total time is quadratic.

You might submit a bug report on stream-fusion, though I don't know whether it is still maintained.

comment:5 Changed 3 years ago by rwbarton

(By the way, this was an issue of no optimizations versus -O2 (or -O), not -O versus -O2.)

comment:6 Changed 3 years ago by simonpj

Thanks for investigating, Reid. Very helpful.

Note: See TracTickets for help on using tickets.