Changes between Version 20 and Version 21 of DebuggingGhcCrashes


Ignore:
Timestamp:
Aug 27, 2009 10:44:25 AM (6 years ago)
Author:
simonmar
Comment:

moved to Debugging/CompiledCode

Legend:

Unmodified
Added
Removed
Modified
  • DebuggingGhcCrashes

    v20 v21  
    1 = Debugging GHC-compiled code with gdb =
    2 
    3 If you're working on a GHC port, tweaking the runtime, or modifying
    4 the code generator, at some point you will have to become familiar
    5 with how to use [http://www.gnu.org/software/gdb/ gdb] for debugging
    6 crashes in GHC-compiled programs.
    7 
    8 This is quite scary, for a few reasons:
    9 
    10  * We can't do source-level debugging of Haskell code.  gdb doesn't
    11    know anything about Haskell.  We're stuck with assembly-level
    12    debugging of Haskell code.
    13 
    14  * Most of the time you can't even get a backtrace inside GHC-compiled
    15    code.  The GHC calling conventions are entirely different from C's,
    16    and GHC-compiled code doesn't use the C stack.    There are a
    17    couple of exceptions:
    18    * when the crash happens inside the runtime, you can use the usual
    19      C debugging facilities of gdb with full backtrace and
    20      source-level debugging.  (make sure you use a {{{-debug}}}
    21      version of the runtime to get the debug info, though, and if
    22      the bug happens with the non-threaded runtime then that will also
    23      simplify things).
    24    * when the crash happens inside a foreign call invoked by Haskell
    25      code.  In this case you'll be able to see which Haskell fragment
    26      invoked the foreign call (it'll be something like
    27      {{{MyModule_zdwccall13_info}}}, more about symbol names later).
    28 
    29  * In fact, there ''are'' no useful backtraces in Haskell code,
    30    because of lazy evaluation and tail-calls.  Finding out "how we got
    31    here" is often a challenge.
    32 
    33 == The recipe ==
    34 
    35 Usually you want to do something along these lines:
    36 
    37  * Look at the advice on the [wiki:Building/DebuggingGHC Debugging GHC page].
    38 
    39  * Compile with {{{-dcore-lint}}} to make sure the crash isn't caused by the compiler
    40    generating incorrect code in a way that can be detected statically.
    41 
    42  * Compile with {{{-debug}}}!  This links the program with a debugging
    43    version of the runtime, that includes extra assertions, debugging
    44    code, and debugging info for gdb.
    45 
    46  * Don't compile with {{{-threaded}}} (assuming the bug still happens
    47    with the non-threaded runtime).
    48 
    49  * Grab our [attachment:".gdbinit" gdb macros].
    50 
    51  * Run the program in gdb until it crashes, type {{{where}}} to find
    52    out whether the crash is in the RTS or in Haskell code.  If you see
    53    an uninformative backtrace with a symbol like {{{s32y_info}}} at
    54    the top, then you're in Haskell.  If you're in the runtime, then
    55    we don't have anything more to tell you here, you can use gdb's
    56    full C debugging facilities to track down the problem.
    57 
    58  * Make the crash happen as early as possible.  {{{-debug}}} already
    59    turned on lots of assertions, one of which might trigger.  Also,
    60    try {{{+RTS -DS}}} which turns on a bunch of extra "sanity
    61    checking", i.e. expensive assertions about the state of the runtime
    62    at regular points.  One thing this enables is a full sweep of the
    63    heap after each garbage collection to make sure there are no
    64    dangling references.  Also, it fills all free memory with the value
    65    {{{0xaaaaaaaa}}} (the sound of GHC disappearing down a hole).
    66 
    67  * Look at the fragment of code that crashed: {{{disassemble}}}.  It
    68    should be pretty small (code blocks in GHC-generated code usually
    69    are).  Look at the instruction that caused the crash - the address
    70    is given in the {{{where}}} output earlier.  If you can tell what
    71    the problem is right there, then great.
    72 
    73  * Poke around in the heap and stack and see if you can figure out
    74    what has gone wrong.  See "poking around", below.
    75 
    76  * Make the crash happen repeatably, if possible.  That means after
    77    the exact same number of garbage collections and context switches
    78    each time. Use {{{+RTS -C0}}} to turn off timer-based context switches.  If
    79    you're debugging heap profiling, add {{{-i0}}} too.
    80 
    81  * If you still haven't figured it out, you probably need to go back in
    82    time a bit to find out what was happening just before the crash.
    83    This is the tricky bit, because going back in time isn't directly
    84    supported by gdb, we need to find a way to hit a breakpoint a short
    85    time before the crash happened.  See "going back in time", below.
    86 
    87 == Going back in time ==
    88 
    89  * Set a breakpoint on a code fragment that is regularly executed.
    90    Some good examples are {{{stg_upd_frame_info}}} (the standard
    91    update code), {{{stg_upd_frame_1_info}}} (the update code for a
    92    1-tagged object, eg. a cons cell), {{{stg_ap_p_info}}} (the apply
    93    code for a single pointer argument).
    94 
    95    {{{break stg_upd_frame_info}}}
    96 
    97  * Ignore that breakpoint for ever
    98 
    99    {{{ignore 1 9999999}}}
    100 
    101  * Run the program
    102 
    103    {{{run}}}
    104 
    105  * Find out how many times the breakpoint was hit
    106 
    107    {{{info break}}}
    108 
    109  * Suppose it was hit 4325 times, then next time we'll ignore it
    110    for 4324 times (i.e. subtract one), which will stop on the 4325th
    111    time, just before the crash.
    112 
    113    {{{ignore 1 4324}}}
    114 
    115  * Run the program
    116 
    117    {{{run}}}
    118 
    119  * Single step until the crash happens
    120 
    121    {{{si}}}...
    122 
    123  * If the crash doesn't happen for a long time, try picking another
    124    break point (e.g. something you stepped through this time). You
    125    might have to repeat this a few times to get to a useful point.
    126 
    127  * Sometimes the crash is a long way into the program, and setting a
    128    frequently accessed breakopint like {{{stg_upd_frame_info}}} means
    129    the program just runs too slowly; in this case you might have to
    130    use a 2-stage strategy instead.  Set a breakpoint on a less
    131    frequently accessed point (a good one I use for this is
    132    {{{GarbageCollect}}}, which stops at each GC), do the ignore thing
    133    on this, and then switch to a more frequently accessed breakpoint
    134    when you are closer to the crash.
    135 
    136 == Decoding symbol names ==
    137 
    138 Symbols in GHC are encoded using something called the Z-encoding (see
    139 [http://darcs.haskell.org/ghc/compiler/utils/Encoding.hs Encoding.hs]).  Basically special symbols are replaced by sequences
    140 beginning with {{{z}}} or {{{Z}}}.  eg. {{{state#}}} becomes
    141 {{{statezh}}}.  The letter {{{z}}} itself is replaced by {{{zz}}}.
    142 
    143 External symbols follow the pattern
    144 ''module''_''identifier''_''kind''.  Where ''module'' is the module
    145 name, ''identifier'' is the identifier in the STG code that this
    146 symbol relates to, and ''kind'' is the kind of symbol:
    147 
    148 ||{{{closure}}}||a static closure||
    149 ||{{{info}}}||an info table, also entry code or return address||
    150 ||{{{con_info}}}||an info table for a constructor||
    151 ||{{{fast}}}||a primitive||
    152 
    153 (see [http://darcs.haskell.org/ghc/compiler/cmm/CLabel.hs CLabel.hs]
    154 for a table of these).
    155 
    156 For example:
    157 
    158   {{{GHCziBase_ZMZN_closure}}}
    159 
    160 Means the static closure for {{{[]}}} in module {{{GHC.Base}}}.
    161 
    162   {{{DataziList_foldlzq_info}}}
    163 
    164 Means the entry code (or info table) for {{{Data.List.foldl'}}}.
    165 
    166 Internal symbols are of the form ''identifier''_''kind'', i.e. the
    167 same as external symbols but without the module name.
    168 
    169 == Poking around ==
    170 
    171 There are two things you usually want to inspect inside a code
    172 fragment: the contents of heap objects, and the contents of the
    173 stack.  A full description of the layout of these things is here: [wiki:Commentary/Rts/Storage]; what follows is a basic introduction to
    174 looking at heap & stack objects in a running program.
    175 
    176 You can display memory in gdb with something like {{{x/4a}}} to
    177 display 4 words of memory, or using our [attachment:".gdbinit" gdb macros] you get slighty
    178 nicer output:
    179 
    180 {{{
    181 (gdb) p4 $rbx
    182 0x2b35cc6c99a0: 0x4482d8 <stg_IND_1_info>
    183 0x2b35cc6c9998: 0x2b35cc6c9978
    184 0x2b35cc6c9990: 0x5
    185 0x2b35cc6c9988: 0x404e30 <GHCziBase_Izh_con_info>
    186 }}}
    187 
    188 {{{p4}}} displays 4 words of memory backwards, i.e. lower addresses at
    189 the bottom.  In this case I'm displaying memory pointed to by the
    190 register {{{rbx}}}, which corresponds to the STG register {{{R1}}} on
    191 a recent x86_64 build.  Check
    192 [http://darcs.haskell.org/ghc/includes/stg/MachRegs.h MachRegs.h] to
    193 see which machine registers correspond to which STG registers on your
    194 platform.
    195 
    196 In the example above, I can see that {{{$rbx}}} points to a heap
    197 closure for the {{{Int}}} value 5.  Closures always consist of an info
    198 pointer ({{{GHCziBase_Izh_con_info}}} in this case, the {{{I#}}}
    199 constructor), followed by any number of payload words (just one word
    200 containing the value 5, here).  Full details on closure layouts are in
    201 [http://darcs.haskell.org/ghc/includes/rts/storage/Closures.h Closures.h].
    202 
    203 It looks like the next word contains garbage, probably because it is
    204 the next free location in the heap.
    205 
    206 We can see the info tables using the {{{pinfo}}} macro:
    207 
    208 {{{
    209 (gdb) pinfo &GHCziBase_Izh_con_info
    210 $2 = {layout = {payload = {ptrs = 0, nptrs = 1}, bitmap = 4294967296,
    211     large_bitmap_offset = 0, __pad_large_bitmap_offset = 0,
    212     selector_offset = 4294967296}, type = 3, srt_bitmap = 0,
    213   code = 0x404e30 "ÿe"}
    214 }}}
    215 
    216 Info tables tell the garbage collector and other parts of the system
    217 about the layout of closures, they are rarely used during actual
    218 execution.  The ''info pointer'' of a closure actually points to the
    219 ''entry code'' for the closure; the info table lives just before the
    220 entry code (this is a trick used by GHC so that the common operation
    221 of jumping to the entry code for a closure can be done with a single
    222 indirection).  The layout of info tables is defined in
    223 [http://darcs.haskell.org/ghc/includes/rts/storage/InfoTables.h InfoTables.h].
    224 
    225 To display the stack, you need to know what the {{{Sp}}} register is
    226 mapped to on this platform.  On x86_64 you'll find it in {{{$rbp}}}:
    227 
    228 {{{
    229 (gdb) p8 $rbp
    230 0x2ae9697ccc00: 0x1
    231 0x2ae9697ccbf8: 0x40b8c8 <GHCziBase_plusInt_info+32>
    232 0x2ae9697ccbf0: 0x2ae9696c9868
    233 0x2ae9697ccbe8: 0x44a2d8 <stg_upd_frame_info>
    234 0x2ae9697ccbe0: 0x1
    235 0x2ae9697ccbd8: 0x40b8c8 <GHCziBase_plusInt_info+32>
    236 0x2ae9697ccbd0: 0x2ae9696c98a8
    237 0x2ae9697ccbc8: 0x44a2d8 <stg_upd_frame_info>
    238 }}}
    239 
    240 It looks like the stack has an update frame on the top, followed by
    241 some stack frame from {{{GHC.Base}}}, followed by another update
    242 frame.
    243 
    244 Return addresses on the stack have info tables just like heap
    245 closures, and you can display them with {{{pinfo}}} in the same way:
    246 
    247 {{{
    248 (gdb) pinfo 0x40b8c8
    249 $4 = {layout = {payload = {ptrs = 65, nptrs = 0}, bitmap = 65,
    250     large_bitmap_offset = 65, __pad_large_bitmap_offset = 65,
    251     selector_offset = 65}, type = 36, srt_bitmap = 0,
    252   code = 0x40b8c8 "I\203Ä\020M9üw$H\213E\bI\003E\bIÇD$ø0N@"}
    253 }}}
    254 
    255 However, return addresses have special info tables with more
    256 information; to display the whole thing, use {{{prinfo}}}
    257 
    258 {{{
    259 (gdb) prinfo 0x40b8c8
    260 $5 = {srt_offset = 4241688, __pad_srt_offset = 6684481, i = {layout = {
    261       payload = {ptrs = 65, nptrs = 0}, bitmap = 65, large_bitmap_offset = 65,
    262       __pad_large_bitmap_offset = 65, selector_offset = 65}, type = 36,
    263     srt_bitmap = 0, code = 0x40b8c8 "I\203Ä\020M9üw$H\213E\bI\003E\bIÇD$ø0N@"}}
    264 }}}
    265 
    266 The {{{type}}} field tells us what kind of object this is, in this
    267 case {{{36}}}}, which means a {{{RET_SMALL}}} stack frame (see
    268 [http://darcs.haskell.org/ghc/includes/rts/storage/ClosureTypes.h ClosureTypes.h]
    269 for a list of closure types, but make sure you are
    270 looking at the right version of this file for the build you're using,
    271 because the types do change).
    272 
    273 == Useful flags ==
    274 
    275 See {{{-ddump-stg, -ddump-simpl, -ddump-cmm, -dppr-debug}}}.
    276 
    277 == Useful hints ==
    278 
    279  * Use the {{{dpc}}} macro if you want to avoid the pain of matching the {{{pc}}} address with instructions at every program step:
    280 {{{
    281 (gdb) dpc
    282 (gdb) si
    283 0x0000000000408a37 in base_GHCziTopHandler_lvl9_info ()
    284 1: x/i $pc  0x408a37 <base_GHCziTopHandler_lvl9_info+7>:        jb     0x408a55 <base_GHCziTopHandler_lvl9_info+37>
    285 (gdb)
    286 0x0000000000408a39 in base_GHCziTopHandler_lvl9_info ()
    287 1: x/i $pc  0x408a39 <base_GHCziTopHandler_lvl9_info+9>:        mov    $0x5cf248,%ebx
    288 (gdb)
    289 0x0000000000408a3e in base_GHCziTopHandler_lvl9_info ()
    290 1: x/i $pc  0x408a3e <base_GHCziTopHandler_lvl9_info+14>:       movq   $0x408a80,0xfffffffffffffff8(%rbp)
    291 (gdb)
    292 }}}
    293    Also note that an empty gdb request repeats the previous command, in this case {{{si}}}.
    294  * Sometimes you need to look and step through the compiled code of Haskell libraries. I find it useful to change the lines 43 and 48 of file {{{mk/suffix.mk}}} as follows:
    295 {{{
    296 43:  $(HC) $(HC_OPTS) -c $< -o $@  -ohi $(basename $@).$(way_)hi -ddump-simpl -ddump-cmm -ddump-stg > $@.output
    297 48:  $(HC) $(HC_OPTS) -c $< -o $@  -ohi $(basename $@).$(way_)hi -ddump-simpl -ddump-cmm -ddump-stg > $@.output
    298 }}}
    299    Now, I can use file {{{libraries/base/GHC/TopHandler.o.output}}} to make sense of what is going on in {{{TopHandler.o}}}.
    300 
    301 == Mapping back to the STG code ==
    302 
    303 To figure out which bit of Haskell code corresponds to the assembly
    304 fragment you're looking at, you need to look at the STG intermediate
    305 code generated by GHC.  Use the {{{-ddump-stg}}} flag.  The reason we
    306 have to look at the STG is becase this is the last phase before code
    307 generation, after all the transformations have happened, and the
    308 symbol names in STG correspond pretty directly to the symbols you see
    309 in the object code.
    310 
    311 For example, if you're stopped in a code fragment {{{s28a_info}}},
    312 then you should be able to search for {{{s28a}}} in the STG output and
    313 find it - but you first have to find which module it comes from, and
    314 the best way to do that is to grep for {{{s28a}}} in all the modules
    315 of your program.
    316 
    317 == When you've found & fixed it ==
    318 
    319  * Tell someone, they'll probably be impressed :-)
    320 
    321  * Try to think about whether adding an assertion somewhere might have
    322    helped you find the bug quicker, and if so, add one.  Test it.  Add
    323    some commentary next to the assertion to help someone else
    324    understand the problem later.
    325  
    326  * Add a test to the testsuite for your bug.
     1[[redirect(wiki:Debugging/CompiledCode)]]