I could not find the cause immediately, so I’ll report my partial findings here.
It is not the IO part. Inlining the input as one big string, and skipping the output, does not make the symptom disappear.
Ticky-ticky profiling yields identical results for the cacheprof modules, the only difference is in the global counters (HEAP_CHK_ctr and STK_CHK_ctr) to be precise.
The cause does not seem to be tied to a particular part of the code: Luckily, doFile has a very plain pipeline. The symptoms disappear when I stop at processing with_ccs and appear if I go until with_synth_2. But if I read with_css from a file instead of calculating it and only do the synth_2 call, the result also becomes deterministic. Tricky.
Although serialising and deserialising the result of with_ccs gets rid of the non-determinism in total allocations AFAICT (using Read/Show is pretty slow, so I only did ~20 iterations), the average residency/number of samples seems to vary by one (23 vs. 24). Not sure if it's important.
where final = final_cleanup with_synth_2, the non-determinism seems irreproducible. Still, the number of samples (and average residency, consequently) varies between 10 and 11.
In the default build configuration with -O2, allocations are completely stable. Maximum residency is unstable, though.
The instability in max residency goes away if I do similar surgically seqs as above.
In -O0, I even get unstable allocations on Windows, with the samples spread wider than on Linux the other day.
In -O0 -prof -fprof-auto there's still a seldom, minimal instability of 40 bytes delta
Judging from +RTS -S output, the culprit seems to be an errorneous life set. Comparing two reports from different runs, there was always a point at which the live bytes differed and only much later total bytes allocated would differ. I had a sample where the difference was 1488 bytes in (minor) collection no. 11. These allocations seem to be kept live until the end of the program, and more live data is added during the run of the program.
Just a few notes while it's still fresh in my head. I always compared two traces via rr.
Based on my observation that bytes copied was the first metric to diverge, I did some printf debugging in the RTS.
I noticed that the first divergence comes from copying additional blackholes. After evacuating the blackhole in trace B, it resumes 'normal' operation, e.g. does the same thing trace A would do instead of evacuating the black hole.
The black holes are evacuated when the stack is scavenged. They're pointed to by update frames, all of which originally pointed to stg_sel_0_upd selector thunks.
I also noticed that the mut_list_size output of +RTS -Dg varies between the two traces, and it's never because of any mutable primitives (e.g., arrays, MVars, etc.).
I'm not sure what pushes (or fails to push) the update frames. Maybe it has to do with ThreadPaused.c:stackSqueeze()? Maybe this is related to #4450 (closed)?
Indeed, never making the call to ThreadPaused.c:stackSqueeze() gets rid of the non-determinism.
So does +RTS -V0 (which disables the master tick timer that triggers context switches, if understand correctly) as proposed in #4450 (closed) and +RTS -Z (which just deactivates stack squeezing completely).
The latter seems like an easy fix for this particular benchmark, but this will probably come up again in the future. Maybe we can teach the RTS not do stack squeezing during context switches?
CCing simonmar for input, as I'm not particularly familiar with RTS internals.
The latter seems like an easy fix for this particular benchmark, but this will probably come up again in the future. Maybe we can teach the RTS not do stack squeezing during context switches?
CCing simonmar for input, as I'm not particularly familiar with RTS internals.