Opened 8 years ago

Last modified 6 years ago

#4005 new bug

Bad behaviour in the generational GC with paraffins -O2

Reported by: hwloidl Owned by:
Priority: normal Milestone:
Component: Runtime System Version: 6.12.1 RC1
Keywords: optimisation Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case: see attached (compile and run) log
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

paraffinsL.hs seems to give a slowdown of almost a factor of 2 when using -O2 during compilation (also with -O).. I noticed that when tinkering with parallel variants, but happens even just for seq compilation. pL.hs is a cleaned-up, stand-alone version that triggers this behaviour.

I checked this with ghc-6.13 (HEAD as of 1.4.2010) and with ghc-6.12.2rc2 (as of 20.4.2010), with the commands below.

This is on a i386-unknown-linux machine

Attachments (4)

BONZO (10.5 KB) - added by hwloidl 8 years ago.
log of compilation and execution
pL.hs (3.0 KB) - added by hwloidl 8 years ago.
cleaned-up sources of the program triggering the bug
3.hs (1.6 KB) - added by igloo 8 years ago.
slightly smaller source
5.hs (2.8 KB) - added by igloo 8 years ago.

Download all attachments as: .zip

Change History (18)

Changed 8 years ago by hwloidl

Attachment: BONZO added

log of compilation and execution

Changed 8 years ago by hwloidl

Attachment: pL.hs added

cleaned-up sources of the program triggering the bug

comment:1 Changed 8 years ago by igloo

Milestone: 6.12.3
Priority: normalhigh

Looks like GCer time is where the time is going:

No -O:

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time   44.11s  ( 44.33s elapsed) 
  GC    time    1.30s  (  1.32s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   45.41s  ( 45.65s elapsed)

  %GC time       2.9%  (2.9% elapsed)          

With -O:

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time   45.39s  ( 45.99s elapsed)
  GC    time   83.81s  ( 83.53s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time  129.20s  (129.52s elapsed)

  %GC time      64.9%  (64.5% elapsed)

comment:2 Changed 8 years ago by igloo

By the way, thanks for a great testcase!

comment:3 Changed 8 years ago by simonpj

Owner: set to igloo

But WHY is GC suddenly taking such a long time with -O? Previously perhaps it didn't.

Ian: might you do a space profile and see if you can track down what is happening here? It's a nice small program.

Simon

comment:4 Changed 8 years ago by hwloidl

Note that this program pL.hs is not exactly the paraffins code from the nofib suite. To parallelise it, I tweaked it, mainly using lists rather than arrays. So I'm not sure this is a case of regression.

BTW, the parallelism on this program is quite nice: a parBuffer strategy rolling over a circular list that is generated by the program. Just FYI.

Cheers, --

HW

comment:5 Changed 8 years ago by simonmar

The heap is full of [Radical]. It looks like the optimiser has somehow increased sharing, but I couldn't pinpoint the problem easily because the Core is long and very deeply nested.

comment:6 Changed 8 years ago by simonmar

Oh, I should mention that I tried -fno-cse and -fno-full-laziness, but neither made any difference.

Changed 8 years ago by igloo

Attachment: 3.hs added

slightly smaller source

Changed 8 years ago by igloo

Attachment: 5.hs added

comment:7 Changed 8 years ago by igloo

Slow:

ghc --make pL -fforce-recomp -rtsopts -fenable-rewrite-rules -fno-ignore-interface-pragmas

Fast:

ghc --make pL -fforce-recomp -rtsopts -fno-ignore-interface-pragmas

comment:8 Changed 8 years ago by simonmar

Some more information on this:

  • It appears to be a generational GC effect, in the sense that running the program with +RTS -G1 eliminates the difference between the versions. Apparently when using generational GC, a thunk has been promoted into the old generation, subsequently causing a lot of data to be promoted, despite the fact that it would all be unreachable if we had done a major GC.
  • what is strange is that increasing the nursery size doesn't fix the problem. So apparently this is not just one thunk, but a large supply of long-lived thunks. Perhaps the program is allocating a lot of long-lived thunks at the beginning, and then evaluating them sequentially. So it doesn't matter when we do young-generation collection, we will always promote floating garbage.

Here's a snippet of +RTS -S showing the high promotion rate from the young generation, followed by an old-generation collection reducing the heap size back to around 300k:

   524208       272    322584  0.00  0.00    2.01    4.81    0    0  (Gen:  0)
   524256       288    322600  0.00  0.00    2.01    4.81    0    0  (Gen:  0)
   524256    313528    635832  0.00  0.00    2.01    4.81    0    0  (Gen:  0)
   524248    335488    970432  0.00  0.00    2.01    4.81    0    0  (Gen:  0)
   524208    334688   1304816  0.00  0.00    2.01    4.81    0    0  (Gen:  0)
   524248    293400    322152  0.00  0.00    2.01    4.81    0    0  (Gen:  1)
   524288       296    322152  0.00  0.00    2.01    4.81    0    0  (Gen:  0)

comment:9 Changed 8 years ago by igloo

Owner: igloo deleted
Priority: highnormal

comment:10 Changed 7 years ago by igloo

Milestone: 6.12.36.14.1
Priority: normallow

comment:11 Changed 7 years ago by igloo

Milestone: 7.0.17.0.2

comment:12 Changed 7 years ago by igloo

Milestone: 7.0.27.2.1

comment:13 Changed 6 years ago by igloo

Milestone: 7.2.17.4.1

comment:14 Changed 6 years ago by simonmar

Architecture: x86Unknown/Multiple
Component: CompilerRuntime System
difficulty: Unknown
Milestone: 7.4.1_|_
Operating System: LinuxUnknown/Multiple
Priority: lownormal
Summary: paraffins: slowdown of ca 2 with -O2Bad behaviour in the generational GC with paraffins -O2

We narrowed this down to a bad effect of generational GC. There's no obvious fix, but the test case might come in handy if we implement some automatic tuning in the generational GC sometime, so I'm moving it to _|_.

Note: See TracTickets for help on using tickets.