Opened 12 months ago

Last modified 12 months ago

#9315 new bug

Weird change in allocation numbers of T9203

Reported by: nomeata Owned by: simonmar
Priority: normal Milestone:
Component: Runtime System Version: 7.9
Keywords: Cc: simonmar
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Revisions:

Description (last modified by jrp)

While trying to improve our handle on allocation numbers, I’m stuck with the test case T9203. On some machines, it allocates roughly 95747304 bytes (this includes travis and my laptop), on others 42946176 bytes (e.g. on the machine where I monitor benchmark performance). All machines are 64 bit Linux machines.

The output of -ddump-simpl, -ddump-stg and -ddump-cmm is identical (up to variable names). Even -ddump-asm looks the same, besides some jump target reordering. The binary runs too small to get a heap profile.

I’m a bit stuck here: What can be the cause for these differences?

(BTW, if have an up-to-date GHC tree, can you report the number you get? Run make -C testsuite VERBOSE=4 TEST=T9203 for that.)

Attachments (2)

T9203-high-alloc.ticky (12.0 KB) - added by nomeata 12 months ago.
T9203-low-alloc.ticky (12.0 KB) - added by nomeata 12 months ago.

Download all attachments as: .zip

Change History (26)

comment:1 Changed 12 months ago by nomeata

Phabricator and SPJ also have the higher number. If noone else can reproduce 42946176 I guess I’ll change it back. But I still don’t get it.

comment:2 Changed 12 months ago by nomeata

Ah, Austin already reverted it in changeset:4690466de2249fd73600567bd90d462ac26b2d1c/ghc. But sure enough, my benchmark builder fails again...

And it really is a 64bit executable, in case you think that’s it:

$ file ./ghc-master/testsuite/tests/perf/should_run/T9203
./ghc-master/testsuite/tests/perf/should_run/T9203: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.24, BuildID[sha1]=0x7672054d3dd125700b11710744581a8ae8502766, not stripped

comment:3 Changed 12 months ago by simonpj

Next thing to try is ticky-ticky profiling. Compile with -ticky and run with +RTS -rT9203.ticky and compare output.

Simon

comment:4 Changed 12 months ago by nomeata

How did I forget about ticky ticky? Thanks for reminding!

Ticky output is identical.... so it has to be in the libraries. I recompiled them with -ticky, and now I see some difference related to GHC.Fingerprint.fingerprintData1. I’m attaching both ticky reports, maybe someone can see a pattern.

Changed 12 months ago by nomeata

Changed 12 months ago by nomeata

comment:5 Changed 12 months ago by nomeata

Sigh. Both Foreign.Storable and GHC.Fingerprint have (with -dsuppress-unique) identical core. Also Data.Typeable.Internal where the ticky diffs show the same number of calls to base:Data.Typeable.Internal.$fTypeableks_$ctypeRep#, but a different allocation count.... hah! but only when I copy the ghc invocation from one host to the other, not if if I run make there, which for some reason passes -O2 when compiling Data.Typeable.Internal....

So the problem is as follows: Invocation of validate and the usual devel2 settings in mk/build.mk have GhcLibHcOpts += -O -dcore-lint, while the default in the absence of everything is -O2 (GhcLibHcOpts=-O2 in mk/config.mk.in). And sometime that makes a difference...

Clearly, this needs to be consolidated. What are the settings used to actually build the releases? -O or -O2? The appropriate one needs to be used in validate, otherwise these tests are not very useful. And then people will have to use the same flags in their mk/build.mk or learn to ignore benchmark results from such a tree.

Last edited 12 months ago by nomeata (previous) (diff)

comment:6 Changed 12 months ago by simonpj

As well as your last para, you seem to have found that compiling some particular module (is it Data.Typeable.Internal?) with -O2 instance of -O1 makes a factor of 2 difference in allocation for at least this test. Is that right? If so, it'd be worth knowing why, and perhaps putting {-# OPTIONS_GHC -O2 #-} with a clear comment in that file.

Simon

comment:7 Changed 12 months ago by nomeata

Right, but that’s a separate issue from getting reproducible test cases, which is what I care about at the moment. The factor of two has so far only turned up in a pathetic test case, hasn’t it?

comment:8 Changed 12 months ago by jrp

  • Description modified (diff)

comment:9 Changed 12 months ago by jrp

I get

=====> T9203(normal) 2035 of 4040 [0, 0, 0] 
cd ./perf/should_run && '/Users/jrp/Projects/ghc/inplace/bin/ghc-stage2' -fforce-recomp -dcore-lint -dcmm-lint -dno-debug-output -no-user-package-db -rtsopts -fno-ghci-history -o T9203 T9203.hs  -O2  >T9203.comp.stderr 2>&1
cd ./perf/should_run && ./T9203  +RTS -V0 -tT9203.stats --machine-readable -RTS   </dev/null >T9203.run.stdout 2>T9203.run.stderr
bytes allocated value is too low:
(If this is because you have improved GHC, please
update the test so that GHC doesn't regress again)
    Expected    bytes allocated:  95747304 +/-5%
    Lower bound bytes allocated:  90959938 
    Upper bound bytes allocated: 100534670 
    Actual      bytes allocated:  42946176 
*** unexpected failure for T9203(normal)

with OS X, BuildFlavour = perf and InstallExtraPackages=YES. There are also a few other performance-related failures:

Unexpected results from:
TEST="T5611 T9203 T5435_dyn_asm haddock.Cabal T4801 T6048"

OVERALL SUMMARY for test run started at Sun Jul 13 17:28:27 2014 BST
 2:42:49 spent to go through
    4040 total tests, which gave rise to
   19417 test cases, of which
   15712 were skipped

      28 had missing libraries
    3629 expected passes
      42 expected failures

       0 caused framework failures
       0 unexpected passes
       6 unexpected failures

Unexpected failures:
   concurrent/should_run  T5611 [bad stderr] (normal)
   perf/compiler          T4801 [stat too good] (normal)
   perf/compiler          T6048 [stat not good enough] (optasm)
   perf/haddock           haddock.Cabal [stat not good enough] (normal)
   perf/should_run        T9203 [stat too good] (normal)
   rts                    T5435_dyn_asm [bad stdout] (normal)

comment:10 Changed 12 months ago by simonmar

The perf tests don't guarantee to pass unless you use the validate settings, I think that's unavoidable. Maybe it's possible to just skip them unless the correct settings are being used.

We don't use -O2 in validate because validate is supposed to be fast. But that doesn't render the perf tests useless, they're just testing -O and not -O2 (and individual tests can use -O2 if we want).

Should we force -O2 for Data.Typeable.Internal? A tricky one. I generally don't like forcing optimisation settings for individual modules like this, because it prevents us from seeing the real effect of global optimisation settings.

comment:11 follow-up: Changed 12 months ago by nomeata

The perf tests don't guarantee to pass unless you use the validate settings, I think that's unavoidable.

True; but it seems that we would want to validate the settings used for release, aren’t we? Otherwise there might be a regression affecting our releases, and we won’t notice....

comment:12 in reply to: ↑ 11 Changed 12 months ago by simonmar

Replying to nomeata:

The perf tests don't guarantee to pass unless you use the validate settings, I think that's unavoidable.

True; but it seems that we would want to validate the settings used for release, aren’t we? Otherwise there might be a regression affecting our releases, and we won’t notice....

There are *lots* of things that validate doesn't test for, and lots of regressions that it won't catch. It's a compromise between coverage and speed, deliberately. For instance, it doesn't test profiling at all - yet that's an important feature of our releases. We need to make the full testsuite runs more visible to developers, and focus on fixing regressions that they find.

I suspect that adding -O2 to libraries and compiler (we would have to do both) would make validate slower, but I don't know how much; feel free to measure it. I do think validate is too slow already, though, and we should be finding ways to speed it up.

The full testsuite run should use -O2 (I think I used to do that when I did the nightly builds), and the perf tests should have a separate set of figures for when libraries are built with -O2 where necessary. Probably only this one test will need to have different results.

comment:13 Changed 12 months ago by jrp

Running with the validate settings (ie, running validate) I get all the performance tests to pass other than:

=====> T4801(normal) 1987 of 4045 [0, 0, 0] 
cd ./perf/compiler && '/Users/jrp/Projects/haskell/ghc/inplace/bin/ghc-stage2' -fforce-recomp -dno-debug-output -no-user-package-db -rtsopts -fno-ghci-history -c T4801.hs   +RTS -V0 -tT4801.comp.stats --machine-readable -RTS -static >T4801.comp.stderr 2>&1
max_bytes_used value is too low:
(If this is because you have improved GHC, please
update the test so that GHC doesn't regress again)
    Expected    T4801(normal) max_bytes_used: 25145320 +/-5%
    Lower bound T4801(normal) max_bytes_used: 23888054 
    Upper bound T4801(normal) max_bytes_used: 26402586 
    Actual      T4801(normal) max_bytes_used: 23720120 
    Deviation   T4801(normal) max_bytes_used:     -5.7 %
peak_megabytes_allocated value is too low:
(If this is because you have improved GHC, please
update the test so that GHC doesn't regress again)
    Expected    T4801(normal) peak_megabytes_allocated: 70 +/-1%
    Lower bound T4801(normal) peak_megabytes_allocated: 69 
    Upper bound T4801(normal) peak_megabytes_allocated: 71 
    Actual      T4801(normal) peak_megabytes_allocated: 68 
    Deviation   T4801(normal) peak_megabytes_allocated: -2.9 %
bytes allocated value is too low:
(If this is because you have improved GHC, please
update the test so that GHC doesn't regress again)
    Expected    T4801(normal) bytes allocated: 464872776 +/-5%
    Lower bound T4801(normal) bytes allocated: 441629137 
    Upper bound T4801(normal) bytes allocated: 488116415 
    Actual      T4801(normal) bytes allocated: 415645496 
    Deviation   T4801(normal) bytes allocated:     -10.6 %
*** unexpected failure for T4801(normal)

This is on OS X with the master HEAD

comment:14 Changed 12 months ago by nomeata

I suspect that adding -O2 to libraries and compiler (we would have to do both) would make validate lower, but I don't know how much; feel free to measure it. I do think validate is too slow already, though, and we should be finding ways to speed it up.

I changed the builder for ghcspeed to set Validating=YES, and the time to run make was reduced by ¼. I didn’t measure the time for a whole validate run, I guess I’ll do that separately.

But setting that changes more than just GhcLibHcOpts, e.g. it disables split objects (increasing binary sizes by a factor of 3.5).

Allocation of nofib benchmarks wobble a bit, but significat changes are only on the increasing side (fish, +5%), so for nofib, the release settings seem to be better than the validate setting. (good!)

The allocation number of #4801’s test case actually decreases by 4.7%. In total, 7 test cases now get *better* allocation numbers, while only haddock.base degrades by more than 1%.

comment:15 Changed 12 months ago by nomeata

I just timed two plain ./validate runs. The first with the current settings (-O):

$ cat /tmp/validate-O.txt 
Command exited with non-zero status 1
3573.46user 204.47system 36:13.68elapsed 173%CPU (0avgtext+0avgdata 1534340maxresident)k
76944inputs+17408472outputs (255major+249188896minor)pagefaults 0swaps

The second one with GhcLibHcOpts += -O2 -dcore-lint in k/validate-settings.mk:

Command exited with non-zero status 1
3675.64user 209.70system 37:10.83elapsed 174%CPU (0avgtext+0avgdata 1535452maxresident)k
1048inputs+18010496outputs (18major+249644503minor)pagefaults 0swaps

Yes, it does take longer. But hardly significantly.

Hence I’m in favour of setting GhcLibHcOpts += -O2 for validate runs, bringing our validation closer to what we release, and also making a make -C testsuite in non-validate non-customized tree to yield the same performance numbers (at least for non-compiler-performance-tests).

comment:16 Changed 12 months ago by jrp

It may also help to run validate with a default -j8 or some such. The tests and compilations are all more or less single core and most developers will have at least 8 cores.

comment:17 Changed 12 months ago by nomeata

The current default is -j2, I guess it could be updated. But this is digressing from what this ticket is about (making validate settings and default settings agree more, if possible).

comment:18 Changed 12 months ago by jrp

Understood, but I thought that there was some hesitation about using more expensive validate settings because of the running time.

comment:19 Changed 12 months ago by nomeata

Of course improving validate time is desirable, I put your suggestion at #9341.

comment:20 Changed 12 months ago by simonmar

If we change validate to use GhcLibHcOpts=-O2 and update the tests to match, then

  • validate takes a minute longer
  • the tests now fail with a quick build (which we recommend for development, and I generally use)
  • the compiler is still not built with -O2

So I'm not convinced this is a win.

BTW, on the question of -j setting, the usual way is to say CPUS=8 ./validate if you have an 8-core machine.

comment:21 follow-up: Changed 12 months ago by nomeata

validate takes a minute longer

Out of more than half an hour. It’s not like validate is fast enough so that people can actually sit and watch it. And no matter whether its 36 or 37 minutes, I guess most will stop running validate themselves and rather leave that to Phabricator or other CI stuff, precisely because validate takes so long and its convenient to let someone else do it for you.

the tests now fail with a quick build (which we recommend for development, and I generally use)

True. But development trees are inherently unusable, as people might have all kinds of settings there (-O, -DDEBUG etc.) that invalidate performance numbers. I never paid attention to them for that reason, unless I was doing a clean validate run in a separate checkout.

Isn’t running validate in the development tree discouraged anyways?

the compiler is still not built with -O2

True, but one step at a time. (I’ll time this change later, probably over lunch.)

You say quick is recommended; I thought devel2 is recommended and that’s what I use and recommend. And there will be much more variation in people’s development trees settings, it’s hard to aim for them. What we can aim for are our two well-defined settings: The default (i.e. what we release) and validate. I’m arguing that we should unify these two as much as possible.

comment:22 Changed 12 months ago by nomeata

the compiler is still not built with -O2

True, but one step at a time. (I’ll time this change later, probably over lunch.)

4311.12user 223.56system 42:58.26elapsed 175%CPU (0avgtext+0avgdata 1557124maxresident)k
256824inputs+18215896outputs (472major+251942192minor)pagefaults 0swaps

Another +5 minutes, so I won’t argue for changing that at this point.

comment:23 in reply to: ↑ 21 Changed 12 months ago by simonmar

Replying to nomeata:

validate takes a minute longer

Out of more than half an hour. It’s not like validate is fast enough so that people can actually sit and watch it. And no matter whether its 36 or 37 minutes, I guess most will stop running validate themselves and rather leave that to Phabricator or other CI stuff, precisely because validate takes so long and its convenient to let someone else do it for you.

But then where do you stop? By your arguments we should be making validate run the full testsuite, which takes 3 hours or so (IIRC). When we first made validate it took 15 mins, now it takes more than double that, and we arrived here not by huge leaps, but by adding one minute at a time. So every time I see this happening, I like to question (vigorously!) whether that extra minute is paying for itself. In this case I'm not at all convinced.

The point of validate is to catch bugs that will block other developers if they get committed: build breakage or serious bugs, whereas everything else we catch with the nightly builds, If there's a feeling that we should change this, then let's take that discussion to the mailing list. It's not at all clear-cut (which is why we're having this discussion!).

What I'd like to see instead is one of the following:

  • Skip the test unless the libraries were built with -O2
  • Skip the test *if* the libraries were built with -O2
  • Have different results according to the optimisation levels

All of which fix the perceived problem without introducing the regression in validate time.

comment:24 Changed 12 months ago by nomeata

I tried to bring it up on the mailing list, but nobody cared...

What I'd like to see instead is one of the following:

  • Skip the test unless the libraries were built with -O2
  • Skip the test *if* the libraries were built with -O2
  • Have different results according to the optimisation levels

All of which fix the perceived problem without introducing the regression in validate time.

While the third option is the most complete, I don’t think it is maintainable.

Given the aim of validate is to prevent breakage, and performance regressions are not breakage, I think it is ok to skip the tests there (even makes it faster!), and run the tests only when we are doing a release build.

So how about this:

  • The expected numbers are adjusted to reflect the release settings.
  • validate in modes fast and normal pass SKIP_PERF_TESTS to the test suite
  • validate in slow mode uses the release settings (-O2 to libraries and stage2), and does not SKIP_PERF_TESTS
  • Optionally: The test suite, when invoked manually, ignores performance tests if the flags are not as expected¹

Note that the test suite already skips compiler performance tests if -DDEBUG is set, so this has precedent.

¹ Is that possible in a reliable and not-too-complicate way, and without duplicating the “expected flags”? Maybe a warning “You have a custom mk/build.mk; failures in performance tests might be irrelevant.” would be enough

Note: See TracTickets for help on using tickets.