Opened 2 years ago

Last modified 3 weeks ago

#12758 new task

Bring sanity to our performance testsuite

Reported by: bgamari Owned by:
Priority: high Milestone: 8.8.1
Component: Test Suite Version: 8.0.1
Keywords: Cc: nomeata, michalt
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: #15936 Differential Rev(s): Phab:D3758, Phab:D5059
Wiki Page:

Description (last modified by bgamari)

The GHC testsuite's performance tests tends to be a constant source of busy-work as it requires that contributors manually bump performance numbers and propagate these changes across platforms. Moreover, the testsuite is poor at noticing performance regressions due to false positive failures (due to spurious environmental differences) and false negatives (due to the rather generous acceptance windows that many tests have).

Joachim and I briefly discussed this at Hac Phi and came up with the following proposal:

  • We rip expected performance numbers out of the .T files
  • We replace the existing compiler_stats_num_field test modifier with an implementation that dumps performance metrics to a CSV file
  • Introduce a tool to compare this CSV file to metrics associated with a ancestor commit via git notes
  • The tool would also be able to add notes

Change History (22)

comment:1 Changed 2 years ago by bgamari

Description: modified (diff)

comment:2 Changed 2 years ago by bgamari

Cc: nomeata added

comment:3 Changed 2 years ago by simonpj

I'm all for eliminating busy-work, but my experience is that we have too few, not too many, performance tests. Mostly, when a perf-test fails, there really is something to investigate, annoying though I often find it.

The exception is max_bytes_used which wobbles around unpredicatably.

I'm not sure how your new approach would make things better. Would you like to elaborate?

comment:4 Changed 2 years ago by michalt

Cc: michalt added

comment:5 Changed 2 years ago by Ben Gamari <ben@…>

In 2940a617/ghc:

testsuite: Specify expected allocations of T12877 for Windows

This deviated by 12% from the expected allocations on Windows.
Yet another case of #12758.

comment:6 Changed 2 years ago by Ben Gamari <ben@…>

In d398162/ghc:

testsuite: Separate out Windows results for T5205

This test seems to have much different allocation behavior on Windows
and Linux.  Previously we had widened the acceptance window to 7% to
accomodate this, but even this isn't enough any more. Instead of further
widening the window let's just give an expected number for each
platform. Really, this is precisely the issue with our performance
testing model which I've been complaining about in #12758.

Fixes test for #5205 on 64-bit Windows.

Test Plan: Validate on Windows

Reviewers: austin

Subscribers: thomie, Phyx

Differential Revision: https://phabricator.haskell.org/D2848

GHC Trac Issues: #5205

comment:7 Changed 23 months ago by bgamari

Milestone: 8.2.18.4.1

There is general agreement that we want to do something along these lines. However, it will probably only happen for 8.4.

comment:8 Changed 19 months ago by Ben Gamari <ben@…>

In 5e940bd/ghc:

Switched out optparse for argparse in runtests.py

Tangentially related to my prior work on trac ticket #12758.

Signed-off-by: Jared Weakly <jweakly@pdx.edu>

Reviewers: austin, bgamari

Subscribers: rwbarton, thomie

Differential Revision: https://phabricator.haskell.org/D3792

comment:9 Changed 13 months ago by bgamari

Milestone: 8.4.18.6.1

This ticket won't be resolved in 8.4; remilestoning for 8.6. Do holler if you are affected by this or would otherwise like to work on it.

comment:10 Changed 13 months ago by bgamari

A quick update: Last summer Jared Weakly implemented the proposal of this ticket. The result is Phab:D3758. At the time the plan was to switch to Jenkins in the fall (see #13716) and incorporate his work at that point since it requires a bit of integration with CI. However, shortly before ICFP we started a discussion which culminated in the decision to instead move CI to CircleCI. This process is still on-going (see wiki:ContinuousIntegration) but is nearly an end, at which point we will be able to move ahead with Phab:D3758.

comment:11 Changed 13 months ago by bgamari

Priority: normalhigh

comment:12 Changed 12 months ago by Ben Gamari <ben@…>

In 81a5e05/ghc:

circleci: Skip performance tests

Once we finally get the automation for #12758 we can re-enable these.

comment:13 Changed 12 months ago by bgamari

Differential Rev(s): Phab:D3758

comment:14 Changed 8 months ago by bgamari

Milestone: 8.6.18.8.1

This won't be fixed in 8.6. Bumping to 8.8.

comment:15 Changed 6 months ago by domenkozar

Some insight how I've implemented a similar setting for Snabb (open source high performance networking drivers).

Using Nix, we build a matrix of Snabb (multiple git branches, different versions of software dependencies like Qemu, etc) and plot all of those graphs.

There is no smart automation to abort of fail for contributions, but if you run these tests before releases - and even as a cronjob every X days, you can spot regressions easily and then bisect the commit.

Nix tags what derivations are benchmarks and those run on specialized hardware 1 at the time.

Example report: https://hydra.snabb.co/build/3641949/download/2/report.html

comment:16 Changed 6 months ago by davide

In the case of expected changes in performance metrics, we still need the tests to pass. This requires some convenient mechanism by which the author can explicitly inform the CI of that change. The current plan is for the author to indicate such changes in the commit message in the following form:

Metric  (Increase | Decrease)  ['metric' | \['metrics',..\]]  [\((test_env|way)='abc',...\)]: TestName01, TestName02, ...

e.g.

# All metrics decreased for test T1234.
Metric Decrease: T1234

# max_bytes_used decreased for test T1234.
Metric Decrease 'max_bytes_used': T1234

# bytes allocated and peak_megabytes_allocated increased for tests T005 and T006.
Metric Increase ['bytes allocated', 'peak_megabytes_allocated']:
    T005,
    T00

# All metrics decreased for test T200 in the x86_linux environment
Metric Decrease (test_env = 'x86_linux'): T200

comment:17 Changed 6 months ago by osa1

Differential Rev(s): Phab:D3758Phab:D3758, Phab:D5059

comment:18 Changed 5 months ago by davide

I've added a wiki page for the proposed change: https://ghc.haskell.org/trac/ghc/wiki/Performance/Tests.

comment:19 Changed 3 months ago by Ben Gamari <ben@…>

In 932cd41d/ghc:

testsuite: Save performance metrics in git notes.

This patch makes the following improvement:
  - Automatically records test metrics (per test environment) so that
    the programmer need not supply nor update expected values in *.T
    files.
    - On expected metric changes, the programmer need only indicate the
      direction of change in the git commit message.
  - Provides a simple python tool "perf_notes.py" to compare metrics
    over time.

Issues:
  - Using just the previous commit allows performance to drift with each
    commit.
    - Currently we allow drift as we have a preference for minimizing
      false positives.
    - Some possible alternatives include:
      - Use metrics from a fixed commit per test: the last commit that
        allowed a change in performance (else the oldest metric)
      - Or use some sort of aggregate since the last commit that allowed
        a change in performance (else all available metrics)
      - These alternatives may result in a performance issue (with the
        test driver) having to heavily search git commits/notes.
  - Run locally, performance tests will trivially pass unless the tests
    were run locally on the previous commit. This is often not the case
    e.g.  after pulling recent changes.

Previously, *.T files contain statements such as:
```
stats_num_field('peak_megabytes_allocated', (2, 1))
compiler_stats_num_field('bytes allocated',
                         [(wordsize(64), 165890392, 10)])
```
This required the programmer to give the expected values and a tolerance
deviation (percentage). With this patch, the above statements are
replaced with:
```
collect_stats('peak_megabytes_allocated', 5)
collect_compiler_stats('bytes allocated', 10)
```
So that programmer must only enter which metrics to test and a tolerance
deviation. No expected value is required. CircleCI will then run the
tests per test environment and record the metrics to a git note for that
commit and push them to the git.haskell.org ghc repo. Metrics will be
compared to the previous commit. If they are different by the tolerance
deviation from the *.T file, then the corresponding test will fail. By
adding to the git commit message e.g.
```
 # Metric (In|De)crease <metric(s)> <options>: <tests>
Metric Increase ['bytes allocated', 'peak_megabytes_allocated'] \
         (test_env='linux_x86', way='default'):
    Test012, Test345
Metric Decrease 'bytes allocated':
    Test678
Metric Increase:
    Test711
```
This will allow the noted changes (letting the test pass). Note that by
omitting metrics or options, the change will apply to all possible
metrics/options (i.e. in the above, an increase for all metrics in all
test environments is allowed for Test711)

phabricator will use the message in the description

Reviewers: bgamari, hvr

Reviewed By: bgamari

Subscribers: rwbarton, carter

GHC Trac Issues: #12758

Differential Revision: https://phabricator.haskell.org/D5059

comment:20 Changed 7 weeks ago by osa1

What needs to be done here? An updated summary would be helpful.

comment:21 Changed 4 weeks ago by bgamari

comment:22 Changed 3 weeks ago by davide

I've been working on some changes recently, which is currently under review here !22. To Summarize:

  • Gitlab CI pushes performance results to the GHC Performance Notes repo.
    • This is important for logging performance metrics.
  • When running performance tests, establish baseline (expected) performance values from older commits (not just the previous one) and even use performance numbers pushed from gitlab CI.
    • This should greatly improve usability: for many common workflows this will succeed in establishing a baseline where the previous method would fail.
Note: See TracTickets for help on using tickets.