Opened 11 months ago

Closed 8 months ago

#9003 closed bug (fixed)

GHC eventlog: thread stop status codes modified (breaking ghc-events, threadscope, edentv)

Reported by: jberthold Owned by: ezyang
Priority: high Milestone: 7.8.3
Component: Runtime System Version: 7.8.2
Keywords: Cc: simonmar, jberthold, ezyang
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Other Test Case:
Blocked By: Blocking:
Related Tickets: Differential Revisions:

Description

A patch in GHC-7.7 modified the "thread stop status" encoding, so the current ghc-events (version 0.4.2.0 from 2012) produces wrong stop reasons when reading an eventlog written by GHC-7.8.2.

The GHC patch is
http://git.haskell.org/ghc.git/commit/70e20631742e516c6a11c3c112fbd5b4a08c15ac
The MVarRead block reason was added in the middle, shifting all following block reason codes one up.

The error can be observed when a thread blocks on a black hole during a program run. Attached is a small test program which produces this situation, and two event logs for the same program run with the same arguments and RTS parameters. The program's two threads evaluate the same data, the parent thread typically blocks on a blackhole created by a forked child thread. Note the sequence around user event "child": The log written by GHC-7.6.3 correctly shows blocking on a black hole, while the log written by GHC-7.8.2 claims the main thread blocks "on I/O read".

Attachments (3)

wrongeventlog.hs (576 bytes) - added by jberthold 11 months ago.
Test program
wrong763.eventlog (6.5 KB) - added by jberthold 11 months ago.
trace written by GHC-7.6.3
wrong782.eventlog (7.3 KB) - added by jberthold 11 months ago.
trace written by GHC-7.8.2 (not correctly read by ghc-events show)

Download all attachments as: .zip

Change History (22)

Changed 11 months ago by jberthold

Test program

Changed 11 months ago by jberthold

trace written by GHC-7.6.3

Changed 11 months ago by jberthold

trace written by GHC-7.8.2 (not correctly read by ghc-events show)

comment:1 Changed 11 months ago by jberthold

  • Cc jberthold added

Like the GHC-6.x tid field length problem, any solution to the problem requires baking in knowledge of GHC version differences into ghc-events (might break other software that writes event log format).

A fix in ghc-events is possible (the fork ghc-events-parallel has been fixed like this as a temporary solution): EVENT_USER_MARKER was added after the 7.6 branch, so the ghc-events parser can use it as an indication to use different thread stop state decoders. When fixing it like this, however, older tools will always yield the error.

OTOH, when reverting to the original encoding in GHC (fix inside GHC), one needs to reliably detect GHC-7.8.2 and distinguish it from subsequent versions. Again by adding a new event type in master and GHC-7.8.3.

comment:2 follow-up: Changed 11 months ago by ezyang

  • Cc ezyang added

Oof, sorry about breaking backwards compatibility in such an annoying way.

I am sympathetic, but it is not clear to me what the correct path from here is. Here are some possibilities:

  1. Accept that includes/Constants.h really is GHC internal and subject to change however we like (f30d527344db528618f64a25250a3be557d9f287 is an example of another commit which changed numbering, so my commit is definitely in keeping with tradition), and setup ghc-events to match on the relevant GHC version and change its behavior appropriately. I guess we'd need to tag eventlog with the relevant GHC version (I hope we're doing that already...) To take things a step further, since ghc-events would be GHC specific, we might fold it into the GHC build tree, so it tracks changes with the rest of GHC (this might not be so appropriate as you might process eventlogs produced by one version of GHC with another version.)
  1. Establish the constants as a proper ABI which we guarantee the stability of, and add comments and test-cases to prevent people from renumbering them to maintain cleanliness. We can take this further and consider the state of affairs in 7.8.2 to be a bug, and renumber back to the original version. (I believe this is your proposal.) ghc-event only needs to work around this particular version. (I think version numbers should be added to the eventlog format anyway.) Note: some of the fields we use these constants for are fixed size, so if we never reuse numbers we may run out of space, so we may eventually need to reuse some of them.

comment:3 in reply to: ↑ 2 Changed 11 months ago by jberthold

Replying to ezyang:

Oof, sorry about breaking backwards compatibility in such an annoying way.

I think we should have some "warning signs" in the code (includes/rts/Constants.h), otherwise this will happen again sooner or later.
The shared knowledge of ghc-events and GHC runtime system is in includes/rts/EventLogFormat, which is obviously not up to date right now. includes/rts/Constants.h should point there.

About the options:

  1. Accept that includes/Constants.h really is GHC internal

Decidedly not the intention (for the parts mentioned in EventLogFormat.h).
The other commit you found is interesting... My best guess is tha "Relocated" was never actually written to an event log (I might be wrong, though...)

  1. Establish the constants as a proper ABI which we guarantee the stability of, and add comments and test-cases to prevent people from renumbering them to maintain cleanliness.We can take this further and consider the state of affairs in 7.8.2 to be a bug,

This is what Simon thought of, judging from what he replied to my mail earlier.
I completely agree with you, there have to be comments in the code to avoid this in the future.
Then either

2a) ghc-events stays as it is and will never parse ghc-7.8.2 eventlogs 100% correctly

2b) GHC and ghc-events are adapted to recognise ghc-7.8.2 traces (and not later versions) from their header, by adding an event or extending one with a new field, in GHC-7.8.3.

Or there might be

  1. Consider older ghc-events versions outdated (something we wanted to avoid), and only adapt ghc-events. This is what I did for the fork ghc-events-parallel (but it leaves the route to 2B open).

And: Yes, an event indicating the RTS version is written by (newer versions of) GHC. However, we cannot rely on it being present, and it is fiddly/fragile to read the beginning of an eventlog and then switch parsers when (and iff) it is found. I tried that, but then went for the other fix.
The information must be derived from the header (event information), not from events that may or may not be written.

comment:4 Changed 11 months ago by ezyang

Here's another perspective: really this is just a bug in this function implementation:

INLINE_HEADER void traceEventStopThread(Capability          *cap    STG_UNUSED,·
                                        StgTSO              *tso    STG_UNUSED,·
                                        StgThreadReturnCode  status STG_UNUSED,
                                        StgWord32           info    STG_UNUSED)
{
    traceSchedEvent2(cap, EVENT_STOP_THREAD, tso, status, info);
    dtraceStopThread((EventCapNo)cap->no, (EventThreadID)tso->id,
                     (EventThreadStatus)status, (EventThreadID)info);
}

We shouldn't cast from StgThreadReturnCode to EventThreadStatus, really we should have some translation function from one to the other (coded in terms of the constants, so updating Constants.h works automatically).

(I am not 100% certain I believe this line of argument, but I thought it was worth putting it out there.)

Last edited 11 months ago by ezyang (previous) (diff)

comment:5 Changed 11 months ago by simonmar

I agree with Edward. The fact that the thread statuses are written verbatim into the event log is the real bug here, because it ties the thread status values to the event log format, whereas they ought to be an internal implementation detail. We only did it this way in the first place because we are lazy (that's the royal "we").

So I propose we fix it in 7.8.3. That is, revert to the original numbering in the event log. But do it with a big switch statement in traceEventStopThread, rather than undoing the patch that reordered the numbers. Then we won't have this problem again.

Whether ghc-events is patched to work with 7.8.2 or not is a separate matter. It would be good to do this, but it doesn't change the fact that existing ghc-events (and ThreadScope) will give strange results with 7.8.2. If ghc-events is fixed, I'd much rather it was done by looking at the version number rather than inferring it from something else. Is that really too hard?

comment:6 Changed 9 months ago by thoughtpolice

  • Milestone changed from 7.8.3 to 7.8.4

Moving to 7.8.4.

comment:7 Changed 8 months ago by simonpj

  • Priority changed from normal to high

Edward says he'll look at this.

comment:8 Changed 8 months ago by ezyang

Since we dropped the ball on this and 7.8.3 is really soon, the current proposal is to just revert to the original numbering, but not add the big switch statement. This should be a small noncontroversial patch, and then we can fix it properly later.

comment:9 Changed 8 months ago by ezyang

  • Milestone changed from 7.8.4 to 7.8.3
  • Owner changed from simonmar to ezyang

comment:10 Changed 8 months ago by Edward Z. Yang <ezyang@…>

In 5f3c5384df59717ca8013c5df8d1f65692867825/ghc:

Partially fix #9003 by reverting bad numbering.

Signed-off-by: Edward Z. Yang <[email protected]>

comment:11 Changed 8 months ago by ezyang

  • Status changed from new to merge

comment:12 Changed 8 months ago by jberthold

I would prefer if the patch included a change to some event (eventlog header), so that the 7.8.2 version can be recognized by reading the header.
This is the best way to fix ghc-events.

The event EVENT_USER_MARKER was new in 7.8, and my fix to ghc-events-parallel was based on this. If the old numbering is restored now, I would like to have a way to distinguish version 7.8.3 from 7.8.2, which is easily done by changing events (could be adding an event which is not used, or extending an event with a new field).
If any such change is pending, I suggest to apply it now, or revert the numbering later.

comment:13 follow-up: Changed 8 months ago by simonmar

@jberthold: why not check the RTS version? It's in the RTS_IDENTIFIER event.

comment:14 in reply to: ↑ 13 Changed 8 months ago by jberthold

Replying to simonmar:

@jberthold: why not check the RTS version? It's in the RTS_IDENTIFIER event.

That would be possible, assuming that the event is indeed present in every trace (preferably near the beginning). The implementation has to "unwrap" all event blocks which group the event, though, which appears cumbersome.
The more serious problem I see is parsing traces without this event: in order to restart with the proper block reason decoder, the parser needs to keep a reference to the input event stream (making it a list, not a stream), leading to memory issues for larger traces (and they do get large...)

Identifying the version from header information is much cleaner, that is what the header is for. (OK, I admit that adding or extending an event for that purpose is a hack).

comment:15 Changed 8 months ago by simonmar

You're right, the RTS_IDENTIFIER event could be anywhere in the stream, possibly right at the end. We should put something in the header for this.

Still, 7.8.2 was buggy and it's not a disaster if we don't support it properly.

comment:16 Changed 8 months ago by carter

Agreed with Simon, especially since event log support was pretty busted with 7.8.2 anyways (eg: user events didn't seem to show up correctly!)

comment:17 Changed 8 months ago by juhpetersen

If we don't care too much about 7.8.2, is there anything more to do here for 7.8.3?

comment:18 Changed 8 months ago by Austin Seipp <austin@…>

In bd5f3ef6585640f762d96426bb041d79a5038e8e/ghc:

rts: Fix #9003 with an annoying hack

The TL;DR is that by adding this, we can distinguish GHC 7.8.3 from
7.8.2, which had a buggy implementation. See the ticket for details.

Signed-off-by: Austin Seipp <[email protected]>

comment:19 Changed 8 months ago by thoughtpolice

  • Resolution set to fixed
  • Status changed from merge to closed

@jberthold - this is done. 7.8.3 and HEAD now include a EVENT_HACK_BUG_T9003 event that you should be able to use. Closing.

Note: See TracTickets for help on using tickets.