Opened 3 years ago

Closed 3 years ago

Last modified 8 weeks ago

#11627 closed bug (fixed)

Segmentation fault for space_leak_001 with profiling (-hc)

Reported by: thomie Owned by: jme
Priority: high Milestone: 8.0.1
Component: Profiling Version: 7.10.3
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime crash Test Case: perf/space_leaks/space_leak_001
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D2005
Wiki Page:

Description (last modified by bgamari)

WAY=profasm is omitted by default for this test, but the code looks like this:

Test.hs:

import Data.List
main = print $ length $ show (foldl' (*) 1 [1..100000] :: Integer)
$ ghc Test.hs -prof -O
$ ./Test +RTS -hc
Segmentation fault (core dumped)

Reproducible with at least 7.10.3 and HEAD, also without -O.

Change History (11)

comment:1 Changed 3 years ago by jme

Owner: set to jme

The segfault occurs because shrinkMutableByteArray# and resizeMutableByteArray# introduce slop at the ends of the large_object MutableByteArray#s holding the Integers. Since the arrays are large, they are not copied during GC, so this slop is still present when the heap census is run (after GC). But when heapCensusChain encounters a shrunken MutableByteArray#, it thinks the slop following the array is another closure, and chaos quickly ensues.

The fix should be straightforward.

comment:2 Changed 3 years ago by thomie

Nice find.

I do wonder why arrays need to ever be shrunk for this example. The Integer only increases in size.

comment:3 Changed 3 years ago by jme

Thanks.

Although I didn't verify it, I believe the culprit is the show, which repeatedly calls quotRemInteger to divide the result into smaller chunks.

comment:4 Changed 3 years ago by jme

Actually, in show, it also possible for the p*p in jsplitf to trigger a segfault. If p is w words long, 2w words are initially allocated for the result of the multiply. But if the most significant word turns out to be 0, the result is shrunk by a word.

comment:5 Changed 3 years ago by jme

Differential Rev(s): Phab:D2005
Status: newpatch

comment:6 Changed 3 years ago by Ben Gamari <ben@…>

In ba95f22e/ghc:

prof: Fix heap census for large ARR_WORDS (#11627)

The heap census now handles large ARR_WORDS objects which have
been shrunk by shrinkMutableByteArray# or resizeMutableByteArray#.

Test Plan: ./validate && make test WAY=profasm

Reviewers: hvr, bgamari, austin, thomie

Reviewed By: thomie

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D2005

GHC Trac Issues: #11627

comment:7 Changed 3 years ago by bgamari

Status: patchmerge

comment:8 Changed 3 years ago by bgamari

Description: modified (diff)
Milestone: 8.0.1
Resolution: fixed
Status: mergeclosed
Type of failure: None/UnknownRuntime crash
Last edited 3 years ago by bgamari (previous) (diff)

comment:9 Changed 8 months ago by Ben Gamari <ben@…>

In f0179e3/ghc:

testsuite: Skip T11627a and T11627b on Darwin

Darwin tends to give us a very small stack which the retainer profiler tends to
overflow. Strangely, this manifested on CircleCI yet not Harbormaster.

See #15287 and #11627.

comment:10 Changed 8 weeks ago by Ben Gamari <ben@…>

In 6d9d6f9a/ghc:

testsuite: Enable T11627a on Darwin

The retainer profiler no longer uses the C stack for its mark stack (#14758).
Consequently even the small C stack provided on Darwin should be sufficient to
run this test. See #11627

comment:11 Changed 8 weeks ago by Ben Gamari <ben@…>

In 9937820/ghc:

testsuite: Fix a variety of issues when building with integer-simple

 * Mark arith011 as broken with integer-simple

   As noted in #16091, arith011 fails when run against integer-simple with a
   "divide by zero" exception. This suggests that integer-gmp and integer-simple
   are handling division by zero differently.

 * This also fixes broken_without_gmp; the lack of types made the previous
   failure silent, sadly. Improves situation of #16043.

 * Mark several tests implicitly depending upon integer-gmp as broken
   with integer-simple. These expect to see Core coming from integer-gmp,
   which breaks with integer-simple.

 * Increase runtime timeout multiplier of T11627a with integer-simple

   I previously saw that T11627a timed out in all profiling ways when run against
   integer-simple. I suspect this is due to integer-simple's rather verbose heap
   representation. Let's see whether increasing the runtime timeout helps.

   Fixes test for #11627.

This is all in service of fixing #16043.
Note: See TracTickets for help on using tickets.