nofib’s cacheprof’s allocations nondeterminisitic

mentioned in issue #7206

changed weight to 5

I could not find the cause immediately, so I’ll report my partial findings here.

It is not the IO part. Inlining the input as one big string, and skipping the output, does not make the symptom disappear.
Ticky-ticky profiling yields identical results for the cacheprof modules, the only difference is in the global counters (HEAP_CHK_ctr and STK_CHK_ctr) to be precise.

The cause does not seem to be tied to a particular part of the code: Luckily, doFile has a very plain pipeline. The symptoms disappear when I stop at processing with_ccs and appear if I go until with_synth_2. But if I read with_css from a file instead of calculating it and only do the synth_2 call, the result also becomes deterministic. Tricky.

This has been tainting nofib results since years, should we maybe prioritize and fix this?

Trac metadata

Trac field	Value
Version	7.6.3 → 8.5

It just needs someone determined enough to find the cause…

Although serialising and deserialising the result of with_ccs gets rid of the non-determinism in total allocations AFAICT (using Read/Show is pretty slow, so I only did ~20 iterations), the average residency/number of samples seems to vary by one (23 vs. 24). Not sure if it's important.

If I seq with_ccs like this:

foldr seq () with_ccs `seq` return final

where final = final_cleanup with_synth_2, the non-determinism seems irreproducible. Still, the number of samples (and average residency, consequently) varies between 10 and 11.

If I force data_areas in synth_2 like this:

foldr seq () data_areas `seq` synthd_assy ++ data_areas

the non-determinism is irreproducible. If I instead force the whole list, like this:

let res = synthd_assy ++ data_areas
in foldr seq () res `seq` res

it's non-deterministic still. Seems like this isn't really possible to localise?! I'll leave this for another day.

Some notes from my Windows machine:

In the default build configuration with -O2, allocations are completely stable. Maximum residency is unstable, though.
The instability in max residency goes away if I do similar surgically seqs as above.
In -O0, I even get unstable allocations on Windows, with the samples spread wider than on Linux the other day.
In -O0 -prof -fprof-auto there's still a seldom, minimal instability of 40 bytes delta
Judging from +RTS -S output, the culprit seems to be an errorneous life set. Comparing two reports from different runs, there was always a point at which the live bytes differed and only much later total bytes allocated would differ. I had a sample where the difference was 1488 bytes in (minor) collection no. 11. These allocations seem to be kept live until the end of the program, and more live data is added during the run of the program.

Just a few notes while it's still fresh in my head. I always compared two traces via rr.

Based on my observation that bytes copied was the first metric to diverge, I did some printf debugging in the RTS.
I noticed that the first divergence comes from copying additional blackholes. After evacuating the blackhole in trace B, it resumes 'normal' operation, e.g. does the same thing trace A would do instead of evacuating the black hole.
The black holes are evacuated when the stack is scavenged. They're pointed to by update frames, all of which originally pointed to stg_sel_0_upd selector thunks.
I also noticed that the mut_list_size output of +RTS -Dg varies between the two traces, and it's never because of any mutable primitives (e.g., arrays, MVars, etc.).

I'm not sure what pushes (or fails to push) the update frames. Maybe it has to do with ThreadPaused.c:stackSqueeze()? Maybe this is related to #4450 (closed)?

Trac metadata

Trac field	Value
Related	- → #4450 (closed)

Indeed, never making the call to ThreadPaused.c:stackSqueeze() gets rid of the non-determinism. So does +RTS -V0 (which disables the master tick timer that triggers context switches, if understand correctly) as proposed in #4450 (closed) and +RTS -Z (which just deactivates stack squeezing completely).

The latter seems like an easy fix for this particular benchmark, but this will probably come up again in the future. Maybe we can teach the RTS not do stack squeezing during context switches?

CCing simonmar for input, as I'm not particularly familiar with RTS internals.

Trac metadata

Trac field	Value
Differential revisions	- → D5460

mentioned in issue #16065 (closed)

Replying to [ticket:8611#comment:165409 sgraf]:

The latter seems like an easy fix for this particular benchmark, but this will probably come up again in the future. Maybe we can teach the RTS not do stack squeezing during context switches?

CCing simonmar for input, as I'm not particularly familiar with RTS internals.

Continuing discussion in #16065 (closed).

Whoohoo, thanks for drilling down here!

Trac metadata

Trac field	Value
Differential revisions	D5460 → D5470
Related	#4450 (closed) → #4450 (closed), #16065 (closed)

added Pnormal label

Should this be closed in light of #16065 (closed)?

Yes, probably. I've mostly paged this out, but I expect cacheprofs allocation to be stable now.

closed

mentioned in issue #15333 (closed)

mentioned in issue #17686 (closed)

Trac field	Value
Version	7.6.3
Type	Bug
TypeOfFailure	OtherFailure
Priority	normal
Resolution	Unresolved
Component	NoFib benchmark suite
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture

nofib’s cacheprof’s allocations nondeterminisitic

Child items 0

Activity

nofib’s cacheprof’s allocations nondeterminisitic

Relates to

Activity