Mechanism for checking that we only enter single-entry thunks once

changed weight to 5

added Tfeature request Trac import labels

Simonmar, could you confirm that there are no blatant inaccuracies in this description? If not, do you have an opinion on which approach is preferable?

For what it's worth, while investigating #10414 (closed), I ran the whole test suite with -feager-blackholing, which (at the time) caused all single-entry thunks to be overwritten with a black hole on entry. There were no test failures, which gives me a reasonably high degree of confidence that the "single-entry thunks are entered only once" invariant is maintained, outside of the situations involving multiple threads entering a closure simultaneously discussed in #10414 (closed).

mentioned in issue #10744 (closed)

A little motivation for anyone considering to work on this: Right now I would be very happy to have this feature.

This is me thinking out aloud about this feature. Ignore me if you want, but if you are curious, feel free to jump in.

Widening the scope of this ticket a bit (or shifting it?): Wouldn’t it be nice if we not only link the single-entry thunks, but also find out about missed opportunities for single-entry thunks? Such a counting would subsume the linting above.

I’m considering to build on [Debugging/TickyTicky ticky-ticky debugging], as it already provides a lot of the necessary infrastructure. In particular, it counts entries to thunks, when -ticky-dyn-thunk is enabled. The number is not quite what we want, though: It seems that if a certain static thunk is created n times, then it will report n entries; whereas in the scope of this ticket we want to know that each allocated thunk was entered once.

Also, currently a dynamic (i.e. an instance of a static) thunk that is not marked as single-entry will always be reported to be entered 0 or 1 times, as the result is shared. How do we get better numbers here? One way might be a special heap object, let’s call it a “counting indirection”. It behaves mostly like a regular indirection, i.e. it’s entry code would count a tick and then jump to the entry code of the indirectee, but the crucial difference would be that the garbage collector will 'not' erase it when it evacuates memory. It will, however, evacuate the indirectee as usual. This way, the sharing semantics are preserved, but every entry via the thunk will be counted, not just the first.

These counting indirections would be created by the code that creates a thunk. This way, this counting can be done independent of -ticky-dyn-thunk. These heap objects can also carry additional data used in counting, e.g. the name of the thunk (in CorePrep names) and whether GHC created this thunk as single-entry or not (which would not affect the behaviour of the counting indirection, but be useful for the reporting).

I’m not sure yet what’s the best data structure to do the counting, if we indeed want to count uses of each dynamic instance of the thunk (or rather, of the counting indirection) separately. Maybe statically allocate 'm' (maybe 1024) counters per thunk, and an index, and the first 'm' instances of the thunk use the these counter (each, upon first entry, incrementing the index to allocate its slot), and if there are more than 1024 entries, then simply stop counting.

I kind-of like this design, but I’m sure I’m missing some rather ugly details.

It seems that if a certain static thunk is created n times, then it will report n entries; whereas in the scope of this ticket we want to know that each allocated thunk was entered once.

Yes exactly. Let's call the code that allocates the thunk the thunk allocation site. For some of these sites, cardinality analysis may say "this is a single-entry thunk". Call those the single-entry thunk allocation sites' and the others the multi-entry thunk allocation sites.

For each single-entry thunk allocation site, we'd like to know if any of the thunks it allocates are entered more than once. (This would mean that the analysis was unsound.)
For each multi-entry allocation site, we'd like to know if all the thunks it allocates are entered at most once. This is a possible missed opportunity to make it a single-entry site. (Only "possible" because in a different run it might be entered more than once.)
For all allocation sites we'd like to know if every thunk allocated there was never entered; that's a possible missed opportunity for absence.

A counting indirection sounds like the right kind of thing.

I would also like to count lambdas! So for the STG binding

f = \xy. e

How many times do I call f (ie enter the closure)? That is, is it a one-shot lambda.
Do I ever partially apply it, or is it always saturated?

Similar technology might do the job.

A counting indirection sounds like the right kind of thing.

Looking a bit into this IND_COUNTING now; so I’ll go into “dumping notes into trac” mode, for the future benefit of me or whoever else continues.

There is already a closure type IND_PERM which is not removed by the GC; our counting indirections requires the same behavior. So this is a good start, I guess. The following comments are observations from looking at every mention of IND_PERM in the code.

Cmm.h has macros to ENTER a closure, which shortcuts indirections without calling their entry code. This is not what we want for an IND_COUNTING; so we should use the default code there, to actually enter the counting closure.
Should the interpreter also count entries? I don’t see any reference to ticky in Interpreter.c, so probably not.
Should rts/Stable.c preserve IND_COUNTING? It does not preserve IND_PERM, so probably not.
Fun fact: tricky and profiling do currently not mix. As I plan to do this counting as part of ticky, this means I can ignore problems with profiling in the RTS (the closures still have to work, of course).
scavenge_mark_stack does nothing for IND_PERM with a comment I don’t fully understand. Will revisit.

Weird. I only find lots of code that handles IND_PERM, but none that actually creates closures of that type. Is all that dead code, likely since 2010’s reimplementation of BLACHOLES by Simon Marlow? (changeset:5d52d9b6/ghc)

Created D1821 to let Harbormaster check if IND_PERM is really unused.

Why not emit some update code into a single entry thunk that updates the thunk with something that will explode if it is entered again? That seems a lot easier than allocating indirctions.

Why not emit some update code into a single entry thunk that updates the thunk with something that will explode if it is entered again? That seems a lot easier than allocating indirctions.

Right, that would be sufficient to detect thunks that are wrongly marked single-entry. But we also want to learn about thunks that are not marked single-entry, but should be.

Right, that would be sufficient to detect thunks that are wrongly marked single-entry. But we also want to learn about thunks that are not marked single-entry, but should be.

Do you expect to learn much by doing that? We know the analysis is inaccurate, and I expect you'll just find a lot of thunks that are used a long way from the definition site.

Yes we expect to learn a lot. From a scientific point of view (and writing the journal paper) I'm very interested to know how many of the single-entry thunks are found by the analysis. Let's say there are N thunk allocation sites in the code. We find that 15% are single-entry. If we found that only 20% were entered once (that 15% plus another 5%) that would mean there was at most another 5% that a more sophisticated analysis might find. But the number might be much larger.

If the number is large, we might start to characterise the missed opportunities further. Perhaps some are being missed because of a bug in the analysis. Others might be caught by a simple improvement in the analysis.

That's the thinking.

mentioned in commit f42db157

I have made initial, still hackish, progress. After a few iterations I can now run all of nofib with my counting indirections and it does not segfault any more (phew).

Here is a report from integrate, note the new columns #Alloc, Single and Multpile.

    Entries      Alloc    Alloc'd     #Alloc     Single   Multiple  Non-void Arguments      STG Name
--------------------------------------------------------------------------------
      50000     800000          0          0          0          0   1 D                    main@main:Main.main6{v r7sX}
          0          0    8000000     500000          0          0   0                      sat_s7u6{v} (main@main:Main) (thk) in r7sQ
          0          0    8000000     500000          0          0   0                      sat_s7u3{v} (main@main:Main) (thk) in r7sQ
          0          0    8000000     500000          0          0   0                      sat_s7tZ{v} (main@main:Main) (thk) in r7sQ
          0          0    8000000     500000          0          0   0                      sat_s7tU{v} (main@main:Main) (thk) in r7sQ
          0          0    8000000     500000          0          0   0                      sat_s7tP{v} (main@main:Main) (thk) in r7sQ
          0          0    8000000     500000          0          0   0                      sat_s7tK{v} (main@main:Main) (thk) in r7sQ
          0          0    8000000     500000          0          0   0                      sat_s7tF{v} (main@main:Main) (thk) in r7sQ
          0          0    8000000     500000          0          0   0                      sat_s7tA{v} (main@main:Main) (thk) in r7sQ
    4050000          0    7200000     450000          0     450000   1 D                    sat_s7uz{v} (main@main:Main) in s7uB
          0          0    8000000     500000          0          0   0                      sat_s7tu{v} (main@main:Main) (thk) in r7sQ
     500000   72000000          0          0          0          0   3 dd>                  main@main:Main.$wintegrate1D{v r7sQ}
     450000   32400000     800000      50000          0      50000   1 D                    sat_s7uB{v} (main@main:Main) in r7t3
      50000     800000          0          0          0          0   1 D                    main@main:Main.main10{v r7t2}
      50000    3600000          0          0          0          0   2 DD                   main@main:Main.main11{v r7t3}
      50000          0          0          0          0          0   4 LLid                 main@main:Main.$wgo{v r7t4}
          2          0         24          1          1          0   0                      sat_s7vH{v} (main@main:Main) (thk) in r7sR
          1         64          0          0          0          0   0                      main@main:Main.main1{v r7sR}
          1          0          0          0          0          0   0                      main@main:Main.main15{v r7t9}
          1          0          0          0          0          0   0                      main@main::Main.main{v 01D}

It is still buggy (or not fully understood), e.g.:

I thought I only enabled it for thunks, but it also seems to be present for two functions, and only there count multiple entries... Weird.
Also, these 500000-times allocated thunks are not really unused, are they?.
Two total entries for a thunk (sat_s7vH) that has been allocated once and called once? Maybe there was a heap check inbetween?

But still: You can tell that sat_s7uB has been allocated 50000 and each instance is called 9 times – this matches the 9-element-list in the code.

The implementation uses a new closure type that carries a pointer to a ticky counter struct, and its own little counter:

typedef struct {
    StgHeader   header;
    StgClosure *indirectee;
    const void *ent_counter; // A StgEntCounter
    StgWord     entries;
} StgIndCounting;

Upon entry, this increments the private counter, and if it does so the first time, increases Single, and on the second time decreases Single and increases Multiple. Otherwise, it passes R1 onto the indirectee, preserving the tag.

Good work!

Nooo, where is the longish text I wrote here yesterday? :-( I hate it when that happens. Darn, let’s see what I can remember. grnrl.

I thought I only enabled it for thunks, but it also seems to be present for two functions, and only there count multiple entries... Weird.

Still weird. If I add my counting indirection to everything, I get segfaults. So I guard it with idRepArity id == 0 for now, thinking I would only add it to thunks. But the above report shows some that have “Non-void arguments”, but still idRepArity id == 0; this is one example:

let {
  sat_s7uy [Occ=Once] :: GHC.Types.Double -> GHC.Types.Double
  [LclId, Str=DmdType] =
      \r srt:SRT:[] [x_s7ux] GHC.Float.timesDouble x_s7ux y_s7uu;
} in 
  case Main.$wintegrate1D 0.0## ww3_s7uw sat_s7uy of ww4_s7uz {
    __DEFAULT -> GHC.Types.D# [ww4_s7uz];
  };

Why does this have a zero idRepArity? Why does it not crash? Is it because it is used as an argument to a function, and hence no fast calls occur?

Also, these 500000-times allocated thunks are not really unused, are they?.

No, they are not. These are constructors, where entries are not counted. I extended the ticky report to be a bit more explicit about the kind of things these names are, i.e. distinguish cons from thunks, and for thunks, indicate which are one-shot and which are static.

Two total entries for a thunk (sat_s7vH) that has been allocated once and called once? Maybe there was a heap check inbetween?

No, this is simply a bug in the ticky code, where there are two calls to tickyEnterThunk (likely introduced by 99d4e5b4, possibly due to a merge mistake).

The latter two improvements are at D2014, mostly to get phab to test the commit (which does not seem to happen... why?).

I now need to wrap my head around how to do the counting for functions. My counting indirection does not work as it is, because functions are entered differently. So I either have to create a counting indirection that works for functions (but I don’t know yet if a single one would work for all types of functions), or add the counting code to the function itself. In this case, the function closure needs an additional field for the counter.

If you want to count how many times a function is entered, why not just put a ticky counter at the start of the function's code?

Simon

If you want to count how many times a function is entered, why not just put a ticky counter at the start of the function's code?

The problem is that I need to count each dynamic instance separately. So I need a separate counter for each allocated closure, so the only place it can go is inside the closer as another non-pointer argument.

(For static functions, things are easier of course.)

Oh ok, well, same deal. When you enter a (dynamically allocated) function there's a register pointing to the function closure. So the ticky code at the start of the function's code can bump the count in ClosurePtr[4] (or whatever the appropriate offset is. No?

Yes! That’s the idea, and conceptually straight-forward. I’m just a bit worried that the change will be too invasive to the compiler’s code (info tables, offset calculation etc.)

But I guess I should just do it, and see how bad it is. In the worst case we keep it on a separate branch just for the investigation for the paper.

Ok, I implemented it, and nofib still runs (yay). The code is yet unrepresentable, though :-)

Preliminary observations (gotta run now):

In all of nofib, there is one single entry thunk that is entered multiple times (something in gamteb, allocated 2048 times, each time with two entries). Did not investigate yet, though.

On first glance, plenty of thunks which appear to be single-entry, but are not marked as such.

mentioned in issue #11731 (closed)

I started to put my code on wip/T10613. Currently, the extra counters require -ticky -ticky-allocd -ticky-dyn-thunk to be shown.

mentioned in commit 6c2c853b

Ignore the commit in #10613; I forgot that this Diff was not intended to be merged. I'll be reverting it shortly.

mentioned in issue #11770 (closed)

mentioned in issue #12364

mentioned in issue #13536 (closed)

See #13536 (closed) for an example where it'd have been great to have this feature!

added Pnormal label

added demand analysis label

Trac field	Value
Version	7.10.1
Type	FeatureRequest
TypeOfFailure	OtherFailure
Priority	normal
Resolution	Unresolved
Component	Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC	simonmar, simonpj
Operating system
Architecture

Mechanism for checking that we only enter single-entry thunks once

Child items 0

Activity