Opened 4 years ago

Last modified 4 weeks ago

#3693 new feature request

Show stack traces

Reported by: jpet Owned by: Tarrasch
Priority: normal Milestone: 7.10.1
Component: Runtime System Version: 6.10.4
Keywords: Cc: mihai.maruseac@…, pumpkingod@…, lonetiger@…, hackage.haskell.org@…, anton.nik@…, dterei, bgamari@…, scpmw@…, yi.codeplayer@…, nathanhowell@…, idhameed@…, miffoljud@…, simonmar
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description

Debugging stack overflows can be very difficult, because GHC gives very little information as to exactly what is overflowing. Showing a basic stack dump (on crash, or in the ghci debugger) would be enormously helpful.

(Entered after spending two days trying to determine the cause of a stack overflow, before discovering it was a GHC bug [3677], which would have been apparent immediately if I could have only seen a callstack.)

Attachments (6)

StackTraces.dpatch (144.1 KB) - added by batterseapower 4 years ago.
StackTraces-Base.dpatch (34.3 KB) - added by batterseapower 4 years ago.
StackTraces.3.dpatch (149.7 KB) - added by batterseapower 4 years ago.
Compiler-StackTraces.dpatch (83.1 KB) - added by batterseapower 4 years ago.
Testsuite-StackTraces.dpatch (84.9 KB) - added by batterseapower 4 years ago.
Base-StackTraces.dpatch (48.2 KB) - added by batterseapower 4 years ago.

Download all attachments as: .zip

Change History (75)

comment:1 Changed 4 years ago by simonmar

  • Milestone set to _|_

Stack traces are difficult to implement, which is why there isn't one. See e.g. ExplicitCallStack.

I've been pondering the use of HPC to generate some tracing information though, possibly for feeding into ThreadScope?.

comment:2 Changed 4 years ago by guest

Might it be easy to get non-lexical call stacks? That way you could at least see the chain of "case" statements that forced the thunk producing an error. And some information is better than none at all.

comment:3 Changed 4 years ago by jpet

The linked-to proposals and discussion are mostly about generating what the stack "would be" if evaluation order had been strict.

I'm suggesting what the previous comment does, something much simpler (and IMHO much more useful)--I want to see a dump of the actual stack that actually exists in memory. Walking these stacks is easy, well-defined, and portable; the only obvious obstacle seems to be annotating code with source line and file info.

comment:4 Changed 4 years ago by jpet

Sorry, I should have said, "walking these stacks for stack traces seems feasible, since the garbage collector already has to do it." I should know better than to use the word "easy" when suggesting a task for someone else.

comment:5 Changed 4 years ago by simonmar

Yes you can walk the stack, but the problem is that due to the amount of code transformation that GHC does, it is almost impossible to relate stack frames at runtime to the original code. The code that runs bears little resemblance to the code you wrote in the original program. Case expressions in the compiled code may or may not correspond to case expressions or pattern matches in the original code, and GHC doesn't generally keep track of that information.

comment:6 Changed 4 years ago by simonpj

I agree with all you say, but I think there's still a case for saying that any stack trace is better than none. It may not be easily explicable to Joe User, but it would give an expert a lot more to go on. It's a bit like attaching a loudspeaker to the memory access pin of a CPU; you can't explain the noises that come out, but you get to recognise them and they can be very helpful. (Don't laugh -- I worked on an Eliot 803 which did this! Well, I'm not sure which pin it was, but it made noises anyway.)

In short, I think there's a case for a quick-and-dirty stack trace, if such a thing could be done without much effort.

Simon

comment:7 Changed 4 years ago by simonmar

Ok, let's explore what could be done. In gdb you can see the stack like this:

(gdb) p16 $rbp
0x2b265a880368: 0x2b265a87f2c0
0x2b265a880360: 0x748d08 <stg_marked_upd_frame_info>
0x2b265a880358: 0x459090 <smN_info>
0x2b265a880350: 0x6aafd0 <s1VW_info>
0x2b265a880348: 0x2b265a87f2d1
0x2b265a880340: 0x58afc0 <s1Ed_info>
0x2b265a880338: 0x2b265a87f2d1
0x2b265a880330: 0xb6757b <base_GHCziStorable_readInt8OffPtr1_closure+3>
0x2b265a880328: 0xaf04d8 <base_GHCziInt_zdfEqInt8_closure>
0x2b265a880320: 0xb22c39 <r1wc_closure+1>
0x2b265a880318: 0x2b265a87f328
0x2b265a880310: 0x59dc60 <s11L_info>
0x2b265a880308: 0x2b2655c4dc18
0x2b265a880300: 0x6abf38 <sOz_info>
0x2b265a8802f8: 0x2b265a87f328
0x2b265a8802f0: 0x748b88 <stg_upd_frame_info>

In the RTS there's also a way to print out stacks, though I don't use it very much. Note this is upside-down compared to the above dump:

(gdb) p printStackChunk( 0x2b265a8802f0,0x2b265a880368)
Object 0x2b265a8802f0 = UPDATE_FRAME(0x748b88,0x2b265a87f328)
RET_SMALL (0x6abf28)
   stk[12] (0x2b265a880308) = Word# 47443647716376
RET_SMALL (0x59dc50)
   stk[10] (0x2b265a880318) = 0x2b265a87f328
   stk[9] (0x2b265a880320) = 0xb22c39
   stk[8] (0x2b265a880328) = 0xaf04d8
   stk[7] (0x2b265a880330) = 0xb6757b
   stk[6] (0x2b265a880338) = 0x2b265a87f2d1
RET_SMALL (0x58afb0)
   stk[4] (0x2b265a880348) = 0x2b265a87f2d1
RET_SMALL (0x6aafc0)
RET_SMALL (0x459080)
Object 0x2b265a880360 = UPDATE_FRAME(0x748d08,0x2b265a87f2c0)

You get less information in that there are no symbol names: the RTS doesn't have access to the binary's symbol table. That information is available in the binary, and it might be possible to extract it using the RTS linker (speculation - I don't know how hard this would be in practice). On the other hand, the RTS knows where the boundaries between stack frames are, and it can recognise an UPDATE_FRAME etc.

But even with symbol names, you couldn't glean a lot from the trace here. What might help is if we knew which module each of those symbols comes from, but that info isn't available, even in the symbol table. We could add the module name to every symbol, but that would make local symbols much bigger. Perhaps that doesn't matter as these are only local symbols and so symbol lookup isn't actually performed at link time, it just makes the strings bigger in the object files. Someone could do that experiment.

Maybe the right thing would be to add some debugging info to the compiled code that doesn't take up any space at runtime, but which the RTS could slurp in if necessary.

comment:8 Changed 4 years ago by simonpj

Apart from update/exception frames, the stack mainly consists of frames pushed by 'case'. If we put a string into each return-info-table (much as we do already for constructors, I think), then we could record (a) module, (b) top-level function (c) case binder... and that'd be plenty. It'd bloat our binaries a bit but I'm betting it would not be much.

An alternative could be to do this only if you compile with -debug or -ticky. But that would lose part of the benefit, I'm guessing.

Simon

comment:9 Changed 4 years ago by simonmar

Right, yes, we could do that. I didn't consider that kind of thing because I have a feeling it would bloat binaries too much (there are a lot more case expressions than constructors), but someone could do the experiment and find out.

Putting the info in a debug section as I mentioned before would be a good way to avoid the binary bloat, but it makes it more complicated to extract again. If you're going to do that you might as well use the symbol table too, rather than duplicate the info in the debug section.

comment:10 Changed 4 years ago by asuffield

For memory and disk space concerns, this is why we use .debug sections (and sometimes place them in second files).

For the difficulty of accessing and dumping it, how about not duplicating the effort? Emit the debugging information in DWARF format as if it was any other language, and either borrow the code from gdb, or (perhaps easier) teach gdb the few extra bits it needs to decipher ghc call stacks.

comment:11 Changed 4 years ago by batterseapower

  • patch set to 0

I've written a very simple patch to add support for basic STG stack traces.

Here's how it looks:

$ cat Stacks2.hs
module Main where

main = thunk_use_site (myHead [])

{-# NOINLINE thunk_use_site #-}
thunk_use_site :: Bool -> IO ()
thunk_use_site foo = case foo of True -> putStrLn "Woot!"

{-# NOINLINE myHead #-}
myHead (x:xs) = x

$ ./Stacks2
Stacks2: Stacks2.hs:10:1-17: Non-exhaustive patterns in function myHead

Stacks2: Backtrace:
  main:Main.thunk_use_site1
  base:GHC.TopHandler.runMainIO1

It takes the approach of storing an optional string pointer in return closure info tables. Currently only frames arising from "case" scrutinisation are supported. The string generated from them contains information about the top-level function from which the "case" originated.

There is a new primop, raiseWithStack#, and the catch# primop now supplies a ByteArray?# to the exception handler, which contains an array of string pointers. This is reified into a [String] on the Haskell side, and shown to the user in the Haskell-side TopHandler?.

I changed the simplifier in one place to stop it generating new top-level names like "a", and instead basing those names on the binder the code originated from (if any). We could go through fixing such parts of the compiler on an ad-hoc basis, or we could implement something using HPC information to produce decent stack traces.

This patch is pretty rough and ready and probably buggy, but I thought I'd throw it out there. Areas for improvement:

  • Reify other sorts of stack frame. In particular, allow the RetInfoTables? generated by hand-written .cmm to contain strings, and show update frames. If we do this we should probably use [StackFrame?] on the Haskell side rather than [String], where StackFrame? is an algebraic type with a case for each kind of stack frame.
  • Better information about how the backtrace corresponds to the program the user wrote (e.g. using HPC as discussed above)
  • Common up stack frame strings. Currently, I'll emit the same string several times if there are multiple "case" expressions within one top level binding, which is going to contribute to bloat
  • Test it out. In particular, actually try running the test suite :-)

One note: I repurposed some code from the IA64 branch of Adjustor.c to create my ByteArray?#. However, it looks to me like the original code would not even compile (apparently allocate() has gained a Capability* argument since that code was written). Is the IA64 backend dead? If it's not, I've *certainly* broken it with this patch.

Here are some rough-and-ready code bloat numbers. The base library before (approximately, this is another recent build I had lying around):

$ du .
8.0K	./autogen
116K	./cbits
132K	./Control/Concurrent
112K	./Control/Exception
 36K	./Control/Monad/ST
 80K	./Control/Monad
780K	./Control
 24K	./Data/STRef
2.6M	./Data
 16K	./Debug
464K	./Foreign/C
184K	./Foreign/Marshal
876K	./Foreign
236K	./GHC/IO/Encoding
400K	./GHC/IO/Handle
1.3M	./GHC/IO
5.6M	./GHC
 68K	./System/Console
 44K	./System/IO
 28K	./System/Mem
316K	./System/Posix
660K	./System
124K	./Text/ParserCombinators
144K	./Text/Read
 16K	./Text/Show
404K	./Text
8.0K	./Unsafe
 26M	.

And after:

$ du .
8.0K	./autogen
116K	./cbits
148K	./Control/Concurrent
120K	./Control/Exception
 36K	./Control/Monad/ST
 80K	./Control/Monad
848K	./Control
 24K	./Data/STRef
2.8M	./Data
 16K	./Debug
472K	./Foreign/C
200K	./Foreign/Marshal
920K	./Foreign
244K	./GHC/IO/Encoding
460K	./GHC/IO/Handle
1.5M	./GHC/IO
6.3M	./GHC
 80K	./System/Console
 52K	./System/IO
 32K	./System/Mem
332K	./System/Posix
712K	./System
132K	./Text/ParserCombinators
156K	./Text/Read
 16K	./Text/Show
440K	./Text
8.0K	./Unsafe
 30M	.

So a 15% increase in code size. Not too bad, especially considering there is room for improvement by commoning up some strings, as I outlined above.

Changed 4 years ago by batterseapower

comment:12 Changed 4 years ago by batterseapower

By the way, those NOINLINE annotations are actually redundant. You get the same backtrace without them.

Changed 4 years ago by batterseapower

comment:13 Changed 4 years ago by igloo

  • Milestone changed from _|_ to 6.14.1
  • patch changed from 0 to 1

Changed 4 years ago by batterseapower

comment:14 Changed 4 years ago by batterseapower

The new version of the patch is much improved. Backtraces now look like this:

$ ./Stacks2
Stacks2: Stacks2.hs:8:1-17: Non-exhaustive patterns in function myHead

Stacks2: Backtrace:
  main:Main.thunk_use_site1 (Stacks2.hs:6:1-14)
  base:GHC.TopHandler.runMainIO1 (libraries/base/GHC/TopHandler.lhs:54:1-9)

Despite the fact that there is more information in the backtrace than before (the SrcLocs?), I actually managed to reduce binary bloat a little bit (it is now at 8%):

$ du .
8.0K	./autogen
116K	./cbits
148K	./Control/Concurrent
116K	./Control/Exception
 36K	./Control/Monad/ST
 80K	./Control/Monad
832K	./Control
 24K	./Data/STRef
2.7M	./Data
 16K	./Debug
468K	./Foreign/C
192K	./Foreign/Marshal
900K	./Foreign
236K	./GHC/IO/Encoding
424K	./GHC/IO/Handle
1.4M	./GHC/IO
6.0M	./GHC
 72K	./System/Console
 48K	./System/IO
 32K	./System/Mem
332K	./System/Posix
692K	./System
128K	./Text/ParserCombinators
152K	./Text/Read
 16K	./Text/Show
420K	./Text
8.0K	./Unsafe
 28M	.

This was achieved by commoning up the info, and not emitting it if it isn't actually needed.

I also tried to reduce the impact on physical memory of this debug info being in the object file by trying to emit it in a single contiguous section. The idea is that if its all together on one page then the OS's cache manager will simply not load it into physical memory until a stack trace is actually needed - thanks to Simon Marlow for this idea.

RetInfoTables? originating from the RTS now have some primitive backtrace information.

I've modified the tidier so that Internal names get names based on their referrer. This generalises the Simplifier hack from my last patch, so I've got rid of that change.

Remaining work:

  • Special handling for update frames (maybe), and perhaps some other of the non-standard RTS return frames
  • Look to see if I can improve the precision of backtrace info by using some of the context local to the return point
  • Run testsuite (I've been avoiding this, my machine is slow :-)

comment:15 Changed 4 years ago by simonmar

This is pretty exciting! I'm looking forward to having panic messages in GHC emit a stack trace by default. I'd be interested to see what a stack trace from the bowels of GHC looks like - have you tried it?

I've been thinking about stripping the debug info - we could do it at link time with a flag to GHC. strip has a flag to strip a particular section, and then we can tell the RTS that the debug info has been stripped in the same way that we tell it about the -rtsopts flag, i.e. with a special compiled-at-link-time C file.

comment:16 Changed 4 years ago by batterseapower

Thomas Schilling talked to me about the API I have at the moment. It's quite specialised in that you can only get a list of return frame descriptions, and makes the RTS do a comparatively large amount of work.

I've based this possible new API on his ideas. The idea is to move much more responsibility into the Haskell side of things, and hence allow for more interesting stack trace processors to be written. For example, you could write one which let the user interactively poke at the closures within the stack from using Vacuum or something.

catchWeird :: a -> (forall b. SomeException -> Addr# -> IO b -> (a -> IO b) -> IO b) -> a
retInfoTableDescription# :: Addr# -> Ptr CString
walkStack# :: Addr# -> Addr#

The idea is that the exception handler gets, along with the exception and a pointer to esp, two continuations:

  1. A continuation to use if you are able to recover from the exception and want to return a value. Entering this continuation strips the stack down to the current catch frame and returns that value to the thing above.
  2. A continuation to use if you are not able to recover. Entering this continuation passes control to the next exception handler on the stack.

The Addr# pointer to the stack top points to undefined memory after you enter either continuation, so you need to ensure that you have forced any thunks that close over it before you enter a continuation.

You can use walkStack# to move your stack pointer down the stack, and then unpackClosure# combined with retInfoTableDescription# to obtain a description of the frame.

I think the major obstacle to this API is deciding how to handle stack overflow exceptions. In that case, we cannot assume that there will be enough stack to let the exception handler reify the stack frame before continuing. Presumably, to deal with this we would have to either:

  1. Start a new stack somewhere to run the handler on if we get an asynchronous stack overflow exception
  2. Bump up the stack size temporarily while we run the handler on the original stack

Either way it sounds fiddly. I'm probably not going to implement this API yet, and just stick to the basic story for now.

comment:17 follow-up: Changed 4 years ago by batterseapower

Well, the "fast" testsuite basically works, with the exception of some of the GHC performance tests (and ffi005 which I have never seen pass on any of my OS X machines). This is encouraging, but I need to look carefully at the performance tests.

I'm not having much luck with providing a way to strip out the debug info. I played with the suggestion by Simon to have an optionally-strippable section, but:

  • OS X strip has no option to strip sections with a given name
  • If I install GNU strip using Macports, trying to strip anything produces by GHC causes the stripped thing to die immediately upon running with "Illegal instruction". gstrip itself complains about "unable to read unknown load command 0x22". I have no idea what to make of that.

Which is kind of discouraging. Even if I got it all working on every platform, we would never be able to strip the .ghc_debug section from shared libraries, which is sort of annoying.

I guess I could probably stuff the info into a normally-named .debug section, and then just run plain "strip" on the result if you have requested that stack traces be disabled. This might be annoying for those who link in their own .o files with debug info they want to preserve, but maybe that is an acceptable tradeoff. Thoughts?

comment:18 in reply to: ↑ 17 Changed 4 years ago by simonmar

Replying to batterseapower:

I'm not having much luck with providing a way to strip out the debug info. I played with the suggestion by Simon to have an optionally-strippable section, but:

  • OS X strip has no option to strip sections with a given name

No big deal - just make it a no-op on OS X, perhaps emitting a warning.

Which is kind of discouraging. Even if I got it all working on every platform, we would never be able to strip the .ghc_debug section from shared libraries, which is sort of annoying.

It isn't such a problem in shared libraries, I'm not too worried about this.

comment:19 Changed 4 years ago by igloo

  • Status changed from new to patch

Changed 4 years ago by batterseapower

Changed 4 years ago by batterseapower

Changed 4 years ago by batterseapower

comment:20 Changed 4 years ago by batterseapower

I've attached some fairly complete patches for stack trace support.

The outstanding issues are:

  1. These three tests fail for me on both Windows and Linux after applying the patch, but I'm not sure why some of them do. One is a GHC performance test, so it looks like there is a small perf regression caused by the patches:
   3429(threaded2)
   T1969(normal)
   T3294(normal)
  1. These two tests fail on Windows only:
   stacktrace1-ghci(ghci)
   stacktrace1-stripped(normal,optc,hpc,optasm,threaded1,threaded2)

The first failure is harmless. The error is "Failed to load interface for `HC.TopHandler?'", and is solved if you run dos2unix on the .genscript file. I'm not really sure why this is a problem for that particular ghci test and no other one, though!

The second failure is more serious: my guess is that Windows barfs when loading an executable where some symbols point into a stripped section. The exact error I get is "stacktrace1-stripped.exe is not a valid Win32 application", which is not very informative. It would however be straightforward to show a warning when the -strip-stack-info flag is given on Windows, just like we already do if it is used on OS X.

  1. Currently I show the stg_ stack frames originating from the RTS, but it's up for debate as to whether we should bother. I think we probably should show them, because it's quite interesting to see where the update frames, catch frames etc are. It's very easy to change this behaviour to turn them off (a 2-line edit in CmmParse?.y).

Other than that I think the patch is pretty much done. I've tested it on 32 bit Intel on all three of Windows, Linux and OS X. Solaris / 64 bit support may possibly have minor bugs in them, but I can't test that. e.g. i'm not sure if I got the solaris assembler syntax correct.

This patch will make throwing an exception slightly slower because it reifies the stack to a depth of 100 when you call stg_raisezh. Other flaws with the design that has been implemented mean that:

  1. It is impossible to re-raise a Haskell-side "StackTrace?" value once you have caught it. This is fixable at the cost of extra allocation (reified stacks should be of type Arr# String rather than a ByteArray?# array of Addr# to CStrings), but that allocation can be done statically by generating a closure in .ghc_debug section.
  2. There is no support for "inner exceptions". I.e. if an exception handler throws an exception, we should pair the original stack trace with the stack trace from the new place it is throwing from. This can be implemented with a bit more effort than 1, but if we did it then it would subsume 1 (we could keep stack traces as ByteArray?#) and would require less effort on the part of the programmer to get informative stack traces.

Make of it what you will :-). I don't want to spend too many more cycles on this problem right now, so if you want to change the design radically it may take me a while to get around to it.

comment:21 Changed 4 years ago by batterseapower

I had a look at how you might encode how to walk the STG stack using DWARF. I don't think it's possible with DWARF v2 because the "register rules" which define how the stack walking process updates registers as it walks back up the stack do not include what we need.

Basically we need to update:
$eip to (top_closure->info->entry_code)
$ebp to (top_closure - size(top_closure))

(where the top_closure pointer can be determined from $eip and $ebp in a standard way)

However, the DWARF2 register rules do not make provision for arbitrary expressions: you can only transfer from other registers or a value stored at some offset from the frame base address.

DWARFv3 adds expression and val_expression transfer rules which we could easily use to implement this stack walking scheme. (Which have apparently been implemented in GDB since 2006: http://cygwin.ru/ml/gdb-patches/2006-03/msg00076.html).

The work that would need to be done is:

  • Generate CFA and location information for code emitted from GHC
  • Add libdwarf as a GHC dependency
  • Augment the RTS with some code using libdwarf to reify the stack when an exception occurs, passing it to Haskell land as in my existing patch
  • The RTS code must be robust to missing DWARF info (i.e. if the executable was stripped)

The benefit to this approach compared to the one my patch implements is that dropping the debug info can be done with standard tools, and it will be more pleasant to use GDB on GHC-generated executables.

comment:22 follow-up: Changed 4 years ago by mihai.maruseac

  • Cc mihai.maruseac@… added

So, is this solved? Will we have a stack trace in a future version of GHC?

(I just got free time and was thinking into dwelling into this when I found that there are several patches allready :) )

comment:23 Changed 4 years ago by pumpkin

  • Cc pumpkingod@… added

comment:24 Changed 4 years ago by Phyx-

  • Cc lonetiger@… added

comment:25 in reply to: ↑ 22 Changed 4 years ago by simonmar

Replying to mihai.maruseac:

So, is this solved? Will we have a stack trace in a future version of GHC?

(I just got free time and was thinking into dwelling into this when I found that there are several patches allready :) )

Max's patch isn't going into 7.0. There are some unsolved problems with the design (see earlier comments), so we're still looking for a good solution.

comment:26 Changed 4 years ago by igloo

  • Status changed from patch to new

comment:27 Changed 4 years ago by simonpj

Quite a lot of work has been done here, esp by Max, and it'd be a pity not to make push it to completion. But no one "owns" this ticket, so I'm afraid it may languish unloved.

Does anyone want to take it on and push it over the hill? I'm sure Simon and Ian (esp) will act in support. Even enumerating the issues that need to be solved would be helpful.

Simon

comment:28 Changed 4 years ago by mihai.maruseac

Hi,

I am trying to do something as a disertation project (to be defended on July) but I don't guarantee that I'll stick to it instead of changing to something else or that I'll get something good done. However, I'll try, so hints, support, issues needing to be solved, anything would be useful.

Mihai Maruseac

PS: don't think I should take this as owner since I've only recently digged into GHC's sources

comment:29 Changed 3 years ago by liyang

  • Cc hackage.haskell.org@… added

comment:30 Changed 3 years ago by igloo

  • Milestone changed from 7.0.1 to 7.0.2

comment:31 Changed 3 years ago by igloo

  • Milestone changed from 7.0.2 to 7.2.1

comment:32 Changed 3 years ago by ezyang

Hey Max,

Since we recently moved to Git, there's not a terribly easy way to import these Darcs patches anymore. Do you think you could move your patch series into a Git branch? I'd even volunteer to keep it up to date with the moving master :-)

Cheers,
Edward

comment:33 Changed 3 years ago by lelf

  • Cc anton.nik@… added

comment:34 Changed 3 years ago by batterseapower

Edward: I think my patches are useless because the approach I took is a dead end :-(. The Right Thing to do is generate DWARF info, even though it is a considerably more involved solution.

comment:35 Changed 3 years ago by dterei

  • Cc dterei added

comment:36 Changed 3 years ago by igloo

  • Milestone changed from 7.2.1 to 7.4.1

comment:37 Changed 3 years ago by bgamari

  • Cc bgamari@… added

comment:38 Changed 2 years ago by scpmw

  • Cc scpmw@… added

comment:39 Changed 2 years ago by huangyi

  • Cc yi.codeplayer@… added

comment:40 Changed 2 years ago by nathanhowell

  • Cc nathanhowell@… added

comment:41 Changed 2 years ago by igloo

  • Milestone changed from 7.4.1 to 7.6.1

comment:42 Changed 20 months ago by igloo

  • Milestone changed from 7.6.1 to 7.6.2

comment:43 Changed 15 months ago by morabbin

See also #960.

comment:44 Changed 9 months ago by scpmw

  • Difficulty set to Unknown

Some progress on this - our DWARF / profiling branch now generates DWARF using NCG, complete with .debug_frame unwind records that are at least valid at the start of blocks. That's enough for GDB to backtrace:

$ gdb stack-trace -q
Reading symbols from tack-trace...done.
(gdb) break stg_raisezh
Breakpoint 1 at 0x68f2f8: file rts/Exception.cmm, line 433.
(gdb) run
Starting program: stack-trace 
Breakpoint 1, stg_raisezh () at rts/Exception.cmm:433
433	{
(gdb) bt
#0  stg_raisezh () at rts/Exception.cmm:433
#1  0x0000000000694330 in ?? () at rts/Updates.cmm:57
#2  0x00000000004047a0 in Main_zdwfibzuerr_info () at stack-trace.hs:7
#3  0x0000000000404770 in Main_zdwfibzuerr_info () at stack-trace.hs:5

Note however that we end up with "??" sometimes due to GDB looking at IP-1 for information about the supposed "call" instruction. With tables next to code, that is guaranteed to be inside the info table, so right now I just declare the info table as belonging to a source file so at least the "at bla" bit is correct.

Furthermore, using Max' patch and some more or less involved lookups we can do the backtracking from the RTS as well, which then looks roughly as follows (code here):

stack-trace: Barf!
Loading debug data...
Stack trace:
   0: stg_bh_upd_frame_ret (at rts/Updates.cmm:86:1-91:2)
   1: fib_err (at stack-trace.hs:7:37-7:42)
   2: fib_err (at stack-trace.hs:7:37-7:42)
   3: fib_err (at stack-trace.hs:7:37-7:42)
   4: fib_err (at stack-trace.hs:7:37-7:42)

Here the RTS is reading in DWARF from our own executable as well as libraries using libdwarf, then looking up the code pointers to get labels. These are then looked up in the debug info table we normally use for profiling, which yields (amongst other things) references to the original source code.

Caveats:

  • The quality of the result obviously depends a lot on what we can actually find on the stack. Ironically, higher optimization levels sometimes produce better results, as there's a higher change for functions to have specialized return code.
  • Only works on Linux at this point - main sticking point would be undoing relocation and finding the right files to read DWARF from (right now I read /proc/self/maps for that).
  • Depends on a significant amount of code from my profiling-ncg branch, which has a few change that might be controversial - such as adding CmmTick? and CmmContext? to CmmNode? instead of tracking source code ticks per-block or per-procedure. Reasoning is basically that this solution seems to have the least overhead on Cmm optimizations.

Something I am wondering right now is what kind of API would be most appropriate for this kind of thing. This comes down pretty much to how much extra functionality we want - we would probably want to be able to request stack dumps manually, but would we want to be able to make the stack contents transparent to the Haskell program? If yes we could convert stack dumps into a Haskell lists, possibly offering whatever extra bits of information while we're at it (e.g. Core references). On the other hand that would be really, really impure - and quite some work to boot.

Thoughts? Any particular features people would like to see?

comment:45 Changed 8 months ago by simonmar

Peter - this is really exciting stuff. What I'm most interested in at this stage is a plan for getting it integrated. In practice what that means is breaking down the code into separate pieces that can reasonably be reviewed independently, and I'll probably need your help to point me at key places in the code where the important design decisions are.

comment:46 Changed 8 months ago by scpmw

Great, thanks! I will start with splitting things up then. Unless there's other preferences I will submit the patches into a GitHub branch. That way, we don't have to spam this ticket with individual patches - and can document all interesting design choices in the commit message along the way.

Would be good to get feedback on the initial source notes patch, as that's probably the piece of code with the highest chance of screwing things up in subtle ways. I am currently trying again to hunt down any further Core optimization behaviour changes due to source notes, but this version already works very well in my experience.

comment:47 Changed 8 months ago by Tarrasch

Hello, I'm planning to take a serious look at this, it'll be my master thesis. I have a few questions here that I believe is of general interest.

(1) Peter, your progress looks fantastic, and it's great to see that you can retrieve the stack trace both through a debugger and through the run-time system. But which one should we aim to use in the end? Opinions? I personally think having the run-time doing the job to be simpler for the users (If they are unfamiliar with debuggers) and more convenient (what if I forgot to run my program with gdb?)

(2) By now, GHC have had the LLVM backend for a while. Would stack traces working for this backend be more useful? Would it be more/less work? LLVM has support for DWARF, but at this time I have a bad idea what that even means. But I do know some complications with the llvm approach. I've spoken with Nathan Howell about this and he thought that one missing piece is that the LLVM architecture only understands DWARF 3, but DWARF 4 is necessary to implement what we want. So we would need to patch LLVM.

(3) Peter, would it be possible to show in somewhat more detail on how to reproduce the stack traces you pasted previously? It would be really useful if I could have some trail to follow when playing around.

comment:48 Changed 8 months ago by scpmw

Hi Tarrasch

(1) Thanks - the good thing with using DWARF is that it doesn't really matter how you use it. My primary use case was actually to feed it to ThreadScope? for profiling. Other people have talked about using Linux "perf" tools - this kind of openness is a good thing. From my point of view, going over the RTS will probably be the most fruitful approach most of the time, simply because it can account for more of the generated code's quirkiness (see the "IP-1" trouble with GDB).

(2) The LLVM backend was actually what we started with, and at this point we still support line-number information. Walking the stack is more problematic - we can declare values as being visible in DWARF using llvm.dbg.declare and friends, but (as far as I know) have no way to communicate unwind rules to LLVM.

Dwarf 4 isn't really the issue here though, Dwarf 3 is sufficient - even DW_CFA_def_cfa_expression will probably only be needed for rare variable-sized stack frames.

(3) Tthere should actually be little magic involved with building my branch, as everything's on by default. Just make sure you have the dependencies (libdwarf). Also get the stack-trace branch of the "base" library from my GitHub?, as otherwise you won't have a top-level exception handler that prints a stack trace. In case you run into any problems, just drop me a mail.

comment:49 Changed 8 months ago by ihameed

  • Cc idhameed@… added

comment:50 Changed 5 months ago by Tarrasch

  • Cc miffoljud@… simonmar added
  • Milestone changed from 7.6.2 to 7.10.1
  • Owner set to Tarrasch

comment:51 Changed 3 months ago by carter

has there been any progress on this branch? wheres the activity / dev work been? https://github.com/scpmw/ghc/commits/profiling-ncg ? or where?

comment:52 Changed 3 months ago by Tarrasch

Hi Carter,

Implementing stack traces can be thought of two parts, (1) the code-generator stuff that now has to be able to also emit debug data and (2) Library and rts components. Peter is working on the (1)-part, his most recent work is here[1] (afaict). I work on the (2)-side of things. My work is here[2].

But I think the most interesting parts is the research and conversations we're having. Peter and I are usually collaborating at least weekly on this. I'll be more than happy to share or explain any detail you might be interested in.

[1]: https://github.com/scpmw/ghc/tree/profiling-import
[2]: https://github.com/Tarrasch/ghc/tree/dev (I just pushed this, good thing you reminded me)

comment:53 Changed 3 months ago by simonpj

Tim Schroeder writes (on Haskell Cafe)

Hello,

I have developed various profiling and debugging tools in the past, but haven't done any work in that direction for GHC compiled Haskell programs. I thought a good start might be to write a document detailing how the call stack works in GHC, and also write a basic debugger stack trace tool. Here:

https://github.com/blitzcode/ghc-stack

I hope this is helpful to people trying to pin down crashes, and people interested in working on debugging / profiling tools. I'd also really appreciate if somebody with more GHC / RTS knowledge than me would have a look and see if I got things right.

Cheers,
Tim

comment:54 Changed 3 months ago by Tarrasch

Wow, this is amazing! There's also a discussion tree at reddit: http://www.reddit.com/r/haskell/comments/1vo1se/hacking_ghcs_stack_for_fun_and_profit_featuring/

Tim covers a lot of interesting stuff I have not really had time nor knowledge to dive into. In particular he confronts the difficulties of foreign function calls and receiving signals that don't get turned into Haskell exceptions, like unsafeWrite v 1000000000 (0 :: Int) in contrast to error "rts-aware exception".

For my thesis work, I'll be sticking with how to handle RTS-aware exceptions. But I'll be enthusiastically following how Tim's ghd is progressing. His README on the github page is a very, very well-written and good read.

comment:55 Changed 3 months ago by blitzcode

Thank you for your kind words!

I've been most interested in the lower-level faults simply because they have the least support from the existing tools.

I'm currently back on another project after spending a week or so on the debugging project, but I'm thinking about how to turn it into a a more usable and portable tool. Basically, I think I would rewrite it to be based fully on gdb's machine interface or lldb's C++ API. This buys me a few things:

  • Unified code to access the target process state (no more Win32 ReadProcessMemory?() vs Linux ptrace() vs OS X mach_vm_read_overwrite())
  • Unified code for symbol lookup (no more Win32 Sym* APIs vs Linux addr2line vs OS X / BSD atos)
  • Support for both stack traces from live processes and core dumps with no extra work
  • C/C++/etc. stack unwinding already done

The biggest issue still seems to be keeping the debugger's view of the stack and heap in sync with the runtime / compiler sources. GHC's RTS data structures / memory layout seem fairly stable from what I can tell, but I'd like to make everything as automatic as possible. After all, there are quite a few different OS, CPU architecture, compiler and RTS flavors to support. I would use the DerivedConstants?.h header and generate some additional offsets, register mappings and such using a similar mechanism.

I still think the approach of directly accessing the target process through OS-level functions has some merit. Mostly for profiling, where we'd like to inspect the program hundreds or even thousands of times per second with as little overhead as possible. The CCS is a fairly good tool for pinning down the location of a crash, but Peter's work on improving DWARF information is really what I'm looking forward to for profiling. Hopefully this can be a part of 7.10!

Peter, do your DWARF improvements handle FFI calls (traversing from C back into Haskell)? Also, do you think it would be feasible to support frame pointers as an option for GHC? I know debuggers have long moved on to doing stack unwinding with debug information, but there seem to be numerous profiling tools still using them for speed and simplicity.

There's been some discussion here regarding RTS APIs to access the stack, but are there any ideas on how to expose this API to other processes? There are advantages to having debugging, profiling and heap walking tools be as unobtrusive as possible, not changing the program code or runtime behavior. RTS support for this type of access could make writing such tools simpler.

comment:56 Changed 3 months ago by scpmw

Unified code for symbol lookup (no more Win32 Sym* APIs vs Linux addr2line vs OS X / BSD atos)

As far as the current design goes, we already read out the symbol and DWARF tables of all loaded executable sections. So by going through the RTS you'd have that part covered as well (well, at least for Linux). See Dwarf.c.

Mostly for profiling, where we'd like to inspect the program hundreds or even thousands of times per second with as little overhead as possible.

I suppose you are talking about an external sampling routine that walks the stack in certain intervals? My approach would have been to try to do that in the RTS, but this might work as well.

Peter, do your DWARF improvements handle FFI calls (traversing from C back into Haskell)?

I saw it work at least once - and when in doubt, the machinery for generating DWARF unwind from Cmm is flexible enough that there is probably a way to make it work if it doesn't already. Going from Haskell back to C is probably trickier.

Bottom line is that even with with the points you mention, I am still a fan of the "leave it to the RTS"-approach. My main reason is that we already have (and need) quite a bit of functionality in the RTS, so having an external program re-implement stack walking and symbol lookup seems awkward to me.

So maybe the solution here is to make the RTS more powerful. Could it be a possibility to use the mentioned libraries to displace libdwarf in the RTS? It's already a somewhat uncommon dependency, so if the switch would buy us unwinding of the C stack and better portability, it might work out quite nicely.

comment:57 Changed 3 months ago by blitzcode

This is all quite tricky since there seem to be so many different approaches to consider! ;-)

For profiling, I am mostly coming at this problem from the perspective of already having an external tool that works for C/C++ and wanting to add Haskell support to it (it's a sampling profiler, pausing threads and capturing their stack, as you already said). I think you make a very valid point about already having all the functionality in the RTS (eventually). I don't have any technical / ideological reason to prefer an external profiler, what I'm after is seeing profiling data in realtime, while the program is running. If the RTS is collecting profiling data it could either provide some kind of IPC API to have an external tool collect it, or maybe just have a Haskell API so one could use a library like EKG to show it to the user.

I couldn't really figure out how the Haskell to C stack transition works, but I think that's a less common case than the Haskell to C one anyway. Glad to hear that the latter seems to be working!

I don't agree that having an external program do a stack walk is necessarily complicated or awkward. With your DWARF3 work, one should be able to use something like libunwind-ptrace (http://www.nongnu.org/libunwind/man/libunwind-ptrace%283%29.html) and have that functionality available without any custom work.

For debugging, I think there's merit for an external tool. For RTS exceptions we're ok, but as soon as we have an actual segfault, some memory corruption has often already taken place. Any kind of stack traversal, symbol lookup and error reporting code will have a decent chance of failing itself due to a corrupted heap or such.

The only library that I can think of would be libunwind. It's supported on Linux/BSD, Apple ships some kind of version of it on OS X (there are some limitations / issues with their port, IIRC), does DWARF2/3. It's not on Windows, but there's build-in support for stack walking through the DbgHelp? library (StackWalk64, no idea if Windows PDB debug information supports the kind of wizardry your using to explain the STG stack to a DWARF3 debugger, though). It seems generating the right kind of debug symbols, and supporting and doing stack walking for Haskell & C/C++ on all three platforms is a major challenge...

comment:58 Changed 2 months ago by scpmw

Status update: I have now finished polishing up the initial source-note patches for tracking source code relations throughout the Core and Stg stages:

https://github.com/scpmw/ghc/commits/profiling-review2

This is in many ways the most tricky part of the whole endeavour, as it touches pretty much every single pass that does something interesting with Core. It also extends the whole tickish framework a bit, which means cost-centres, HPC ticks and breakpoints have had some of their code rebuilt.

On the other hand, it now passes the whole testsuite (minus T4891 and T5435_dyn_asm, which fail on master for me), and I tested that '-g' causes no extra allocations across nofib. I *think* it should be merge-ready for 7.9.

Notes for review (Simon?):

  • I tried to be chatty with the comments and spell out exactly how I understand the properties of tickishs. Some new tickish flags and lots of documentation in CoreSyn.lhs.
  • Given that we now support multiple ticks in the tree at the same time, I had to give mkTick a rewrite.
  • The change to cheapEqExpr is a bit awkward, but it was the only cheap way I could think of for making it look through ticks.
  • Small changes with their own commit are optional changes or refactoring.

Next up will the Cmm support, which so far still needs work on the CmmContext? replacement.

comment:59 Changed 2 months ago by simonmar

Thanks, I'll find some time to look through these in detail. One question straight off: in the comment on SourceNote? (which is good, BTW), you wrote

results of annotated code *must* be invariant against changes to all source code *except* the spans referenced in the source notes.

But surely the value of an expression depends on the values of its free variables? The statement only seems to be true for closed expressions.

comment:60 Changed 2 months ago by scpmw

Good point. I could claim that I was arguing from the point of view just before evaluating the expression, but this needs clarification.

comment:61 Changed 2 months ago by simonmar

@simonpj: you might want to take a look at these patches before they land, since they make a lot of changes in coreSyn and the simplifier. Start here, and in particular look at SourceNote for high-level comments: https://github.com/scpmw/ghc/commit/d233ea4da0a5725d4d29a3fb68b9b8f7c9bc6d32#diff-8b5b79c6cb1af2b387fa2d814cd0d3f6R1638

comment:62 Changed 2 months ago by simonpj

Thanks for the heads-up. Can anyone point me to a single write-up of

  • The basic idea
  • A specification, as precise as possible. (E.g. given a proposed Core-Core transformation, does the transformation meet the specfification or not? There are a lot of side conditions like "if tickishSoftScope then..." in the code.)
  • An overview of how the implementation works

supported by examples.

There are lots of related design ideas sketched here ExplicitCallStack. If there's a serious proposal to adopt this one, it would be helpful to have a write up in at least this level of detail, perhaps more. Maybe it exists already.

Simon

comment:63 Changed 2 months ago by simonmar

Simon - the best place to start is Peter and David's paper from last year's Haskell Symposium: http://dl.acm.org/citation.cfm?id=2503788

I think the next level of detail is the comments in the code. If you point to places where more explanation is needed, I'm sure Peter will be happy to add it.

This is orthogonal to other methods for stack tracing, including the existing cost-centre-stack profiling. What these patches do is establish a best-effort mapping from object code to source code, without affecting what code gets generated at all. It should have zero effect on optimisation - that's the main point. Then we can use the information later to try to make sense of the stack, and for profiling. There are follow-on patches to add support for these.

comment:64 Changed 2 months ago by Tarrasch

Yes, it would maybe be less confusing if there was a separate ticket for Peter's initial patches. The ticket could be named "Implement an object code to source code mapping" or maybe just "Emit DWARF in NCG".

comment:65 Changed 2 months ago by scpmw

given a proposed Core-Core transformation, does the transformation meet the specfification or not?

Simon covered the basic idea pretty well. From the transformation point of view, the general rule of thumb is that source notes want to stay on covered code (even if transformed), but don't care too much about new code getting added in ("soft scope"). So we can for example work with fusion by simply floating "upwards":

src<1.hs> foldr k z (src<2.hs> build g)
===> src<1.hs> src<2.hs> g k z

Cost-centres would disallow this kind of transformation, as the new stacks have little to do with the old one. Source notes on the other hand want the transformation to still happen, so we just annotate the new expression with everything that appeared on the deconstructed source expression. The paper goes into (way) more detail here, but the short version is that while not always optimal, this approach is always legal.

The only special case so far is let- and case-floating, where we run into the same problems as cost-centres (and use pretty much the same solution). Additionally we decide to not care too much about allocation costs at a number of points where it would have made things too messy.

comment:66 Changed 2 months ago by simonpj

I would really, really like a wiki page that explains all this. Reverse-engineering it from a 65-comment trail is very labour intensive. The wiki page can by all means point to the paper (preferably not behind a paywall), explain the principles, etc. That might not take many words -- if so, that's great!

Meanwhile I will go back to the paper

Simon

comment:67 Changed 6 weeks ago by simonpj

Simon M and I took a look today. We added a few notes to the github thing ​https://github.com/scpmw/ghc/commit/d233ea4da0a5725d4d29a3fb68b9b8f7c9bc6d32#diff-8b5b79c6cb1af2b387fa2d814cd0d3f6R1638

The major stumbling block is giving a really clear description of what each of the various predicates on ticks means: tickishScoped, tickishSoftScoped, tickishCounts, tickishIsCode etc.

We might want to rename some of them. Eg tickisScoped and tickishSoftScoped might be replaced by canFloatCodeInsideThisTick and canFloatCodeOutsideThisTick, perhaps. Each call site has asks a question about the equivalence of two pieces of code, and the predicates should probably have an example drawn from every call site. That might make the correct generalisation clear.

Simon

comment:68 Changed 5 weeks ago by scpmw

Okay, I have now submitted an (hopefully) improved version:

https://github.com/scpmw/ghc/pull/6

The main new idea is that instead of tickishScoped+tickishSoftScope and tickishIgnoreCheap respectively we now have two main tickish properties:

  • Scoping (Source): Determines what transformations are allowed with respect to covered code. The three levels are NoScope for don't-care (= HPC ticks), SoftScope for no-float-out and CostCentreScope for no-float-out plus no-float-in modulo lambdas.
  • Placement (Source): It occurred to me that mkTick has a very similar problem with characterising where to create ticks / what to automatically push them through. So I did basically the same thing - the three levels here are PlaceRuntime which basically floats through just type applications and casts, PlaceNonLam which additionally floats through lambdas, and PlaceCostCentre which again implements all the CC-specific rules (such as eliminating ticks on HNFs).

Note that the difference between PlaceRuntime and PlaceNonLam is essentially tickishCounts. However, it's actually easier this way, as it allows mkTick to do its work without having to care about tickishCounts at all (Source).

Furthermore I tried to update all documentation and add more comments all around. What is the last verdict on a Wiki page? Sorry if I'm a bit reluctant about this, but I can't think of anything offhand that I would want to put on a Wiki instead of into a code comment. Some more examples, possibly?

Btw: A non-paywall link to the paper is http://eprints.whiterose.ac.uk/76448/

Last edited 5 weeks ago by scpmw (previous) (diff)

comment:69 Changed 4 weeks ago by simonpj

Concerning the patch, I just looked at the CoreSyn changes in https://github.com/scpmw/ghc/commit/d829a088a98b1d716ad4d51593d22deae0e97d0a.

I think it's a real improvement.

I've left a couple of notes about Ord instances, which I hate.

I still find the language that specifies what "placement" or "scoping" is, hard to understand. I'd urge you to add a list of all the transformations you can think of, in a table, with their placement/scoping properties. Something like this:

     Transformation    
Before                      After                             Placement        Scoping
----------------------------------------------------------------
tick t (\x.e)              \x.tick t e                        PlaceNonLam     n/a
tick t (let x = e in b)    let x = e in tick t b                  --- never --
tick t (let x = e in b)    let x = tick t e in tick t b          n/a          Soft
(tick t fun) arg           tick t (fun arg)                         ...etc...
...etc...

That would bring focus to the general descriptions. There should pretty be a line of the table for each call that makes use of the properties to allow, or disallow, a transformation.

Oh, maybe you need a column for "counts" which is (I believe) a third property of a tick that, with scoped/placement, says which transformations are valid. Maybe you want a third data type rather than a boolean, to highlight this fact?

Some comments with tickishScoped would be useful. Why are HPC ticks NoScope? That sounds as though we can move code outside an HPC tick so it is no longer covered. Or something.

As the the wiki page, there is stuff that you take for granted but which forms the big picture

  • The fact that semantically ticks never have significance.
  • What it means for a tick to "cover" some code.
  • The fact that (I think that) ticks have lexical scope, from the point of view of what it means to be "covered".
  • A little summary of the various kinds of ticks, their significance and a rough sketch of how they work. For example, your new SourceNotes turn into something attached to DWARF records which is somehow recovered by a debugger. It's hard to get from the SourceNote data constructor to all that relevant other stuff in the compiler.

Of course you can put all this big-picture stuff in CoreSyn with the data type declarations, but it's getting big. Maybe you want a module for the Tick stuff that CoreSyn imports, where we can accumulate this overview material.

Broadly speaking, though, it looks fine to me.

Simon

Note: See TracTickets for help on using tickets.