Opened 12 months ago

Closed 6 weeks ago

Last modified 3 weeks ago

#14758 closed bug (fixed)

Retainer profiler can overflow the C stack

Reported by: bgamari Owned by: qnikst
Priority: normal Milestone: 8.6.3
Component: Profiling Version: 8.6.1
Keywords: newcomer Cc: osa1, maoe
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: #15287 Differential Rev(s): Phab:D5351
Wiki Page:

Description

I'm not entirely sure what conditions trigger this, but I am observing a reliable segmentation fault with a program with large heap compiled with 8.4.1-alpha3 and run with retainer profiling enabled. Judging by the fact that the crashing instruction is a mov _, ($rsp), I'm reasonable certain that the issue is a C stack overflow. The top of the stack looks like,

#0  0x000000000249212c in retainClosure (c0=0x42af3459b8, cp0=cp0@entry=0x42af347000, r0=r0@entry=0x42bc4fd1a0) at rts/RetainerProfile.c:1488                                        
#1  0x00000000024932b0 in retain_small_bitmap (c_child_r=0x42bc4fd1a0, c=0x42af347000, bitmap=<optimized out>, size=<optimized out>, p=0x42af347260) at rts/RetainerProfile.c:1209   
#2  retainStack (c=c@entry=0x42af347000, c_child_r=c_child_r@entry=0x42bc4fd1a0, stackStart=<optimized out>, stackEnd=0x42af347370) at rts/RetainerProfile.c:1350                    
#3  0x0000000002492870 in retainClosure (c0=<optimized out>, cp0=cp0@entry=0x42af345b28, r0=r0@entry=0x2a5ac20 <CCS_SYSTEM>) at rts/RetainerProfile.c:1686                           
#4  0x0000000002492887 in retainClosure (c0=<optimized out>, cp0=cp0@entry=0x42af3473e0, r0=r0@entry=0x42bc4fd1a0) at rts/RetainerProfile.c:1695                                     
#5  0x00000000024932b0 in retain_small_bitmap (c_child_r=0x42bc4fd1a0, c=0x42af3473e0, bitmap=<optimized out>, size=<optimized out>, p=0x42af347690) at rts/RetainerProfile.c:1209   
#6  retainStack (c=c@entry=0x42af3473e0, c_child_r=c_child_r@entry=0x42bc4fd1a0, stackStart=<optimized out>, stackEnd=0x42af347750) at rts/RetainerProfile.c:1350                    
#7  0x0000000002492870 in retainClosure (c0=<optimized out>, cp0=cp0@entry=0x42af345d88, r0=r0@entry=0x2a5ac20 <CCS_SYSTEM>) at rts/RetainerProfile.c:1686                           
#8  0x0000000002492887 in retainClosure (c0=<optimized out>, cp0=cp0@entry=0x42af3477c0, r0=r0@entry=0x42bc4fd1a0) at rts/RetainerProfile.c:1695                                     
#9  0x00000000024932b0 in retain_small_bitmap (c_child_r=0x42bc4fd1a0, c=0x42af3477c0, bitmap=<optimized out>, size=<optimized out>, p=0x42af347a70) at rts/RetainerProfile.c:1209   
#10 retainStack (c=c@entry=0x42af3477c0, c_child_r=c_child_r@entry=0x42bc4fd1a0, stackStart=<optimized out>, stackEnd=0x42af347b30) at rts/RetainerProfile.c:1350                    
#11 0x0000000002492870 in retainClosure (c0=<optimized out>, cp0=cp0@entry=0x42af3481a8, r0=r0@entry=0x2a5ac20 <CCS_SYSTEM>) at rts/RetainerProfile.c:1686                           
#12 0x0000000002492887 in retainClosure (c0=<optimized out>, cp0=cp0@entry=0x42af347ba0, r0=r0@entry=0x42bc4fd1a0) at rts/RetainerProfile.c:1695                                     
#13 0x00000000024932b0 in retain_small_bitmap (c_child_r=0x42bc4fd1a0, c=0x42af347ba0, bitmap=<optimized out>, size=<optimized out>, p=0x42af347e50) at rts/RetainerProfile.c:1209   
#14 retainStack (c=c@entry=0x42af347ba0, c_child_r=c_child_r@entry=0x42bc4fd1a0, stackStart=<optimized out>, stackEnd=0x42af347f10) at rts/RetainerProfile.c:1350                    
#15 0x0000000002492870 in retainClosure (c0=<optimized out>, cp0=cp0@entry=0x42af348408, r0=r0@entry=0x2a5ac20 <CCS_SYSTEM>) at rts/RetainerProfile.c:1686                           
#16 0x0000000002492887 in retainClosure (c0=<optimized out>, cp0=cp0@entry=0x42af349000, r0=r0@entry=0x42bc4fd1a0) at rts/RetainerProfile.c:1695                                     
#17 0x00000000024932b0 in retain_small_bitmap (c_child_r=0x42bc4fd1a0, c=0x42af349000, bitmap=<optimized out>, size=<optimized out>, p=0x42af3492b0) at rts/RetainerProfile.c:1209   
#18 retainStack (c=c@entry=0x42af349000, c_child_r=c_child_r@entry=0x42bc4fd1a0, stackStart=<optimized out>, stackEnd=0x42af349370) at rts/RetainerProfile.c:1350                    
#19 0x0000000002492870 in retainClosure (c0=<optimized out>, cp0=cp0@entry=0x42af348668, r0=r0@entry=0x2a5ac20 <CCS_SYSTEM>) at rts/RetainerProfile.c:1686                           
#20 0x0000000002492887 in retainClosure (c0=<optimized out>, cp0=cp0@entry=0x42af3493e0, r0=r0@entry=0x42bc4fd1a0) at rts/RetainerProfile.c:1695                                     
#21 0x00000000024932b0 in retain_small_bitmap (c_child_r=0x42bc4fd1a0, c=0x42af3493e0, bitmap=<optimized out>, size=<optimized out>, p=0x42af349690) at rts/RetainerProfile.c:1209   
#22 retainStack (c=c@entry=0x42af3493e0, c_child_r=c_child_r@entry=0x42bc4fd1a0, stackStart=<optimized out>, stackEnd=0x42af349750) at rts/RetainerProfile.c:1350                    
#23 0x0000000002492870 in retainClosure (c0=<optimized out>, cp0=cp0@entry=0x42af3488c8, r0=r0@entry=0x2a5ac20 <CCS_SYSTEM>) at rts/RetainerProfile.c:1686                           
#24 0x0000000002492887 in retainClosure (c0=<optimized out>, cp0=cp0@entry=0x42af3497c0, r0=r0@entry=0x42bc4fd1a0) at rts/RetainerProfile.c:1695                                     
#25 0x00000000024932b0 in retain_small_bitmap (c_child_r=0x42bc4fd1a0, c=0x42af3497c0, bitmap=<optimized out>, size=<optimized out>, p=0x42af349a70) at rts/RetainerProfile.c:1209   
#26 retainStack (c=c@entry=0x42af3497c0, c_child_r=c_child_r@entry=0x42bc4fd1a0, stackStart=<optimized out>, stackEnd=0x42af349b30) at rts/RetainerProfile.c:1350                    
#27 0x0000000002492870 in retainClosure (c0=<optimized out>, cp0=cp0@entry=0x42af348b28, r0=r0@entry=0x2a5ac20 <CCS_SYSTEM>) at rts/RetainerProfile.c:1686                           
#28 0x0000000002492887 in retainClosure (c0=<optimized out>, cp0=cp0@entry=0x42af349ba0, r0=r0@entry=0x42bc4fd1a0) at rts/RetainerProfile.c:1695                                     
...

and this goes on for at least 30000 frames. It looks very much like this is a bug in the retainer profiler.

Change History (11)

comment:1 Changed 7 months ago by bgamari

I suspect that #15287 is another manifestation of this.

comment:2 Changed 4 months ago by osa1

Keywords: newcomer added
Priority: highnormal
Version: 8.4.1-alpha18.6.1

Just a status update: we discussed this a few weeks ago in a meeting. This is easy to fix, just replace recursive calls to retainClosure (directly or indirectly via retainStack) with stack pushes. For this we need to add new stack element types handle those in retainClosure (which is where we pop the stack). Not too hard to do. In the end we weren't sure that retainer profiler is too useful in practice, so we did not prioritize this (I'm updating the ticket priority now to reflect this).

If you're using retainer profiler for anything useful and suffering from this bug, let us know.

Also, this seems like a great newcomer ticket to me. The changes are only in one file (RetainerProfile.c) and you only need to know GHC heap object layout and some C. If anyone's interested in working on this let me know and I can give more detailed instructions.

comment:3 Changed 4 months ago by osa1

Cc: osa1 added

comment:4 Changed 2 months ago by maoe

Cc: maoe added

comment:5 Changed 2 months ago by qnikst

Owner: set to qnikst

I'm going to work on this ticket during Munihac.

comment:6 Changed 2 months ago by qnikst

Differential Rev(s): Phab:D5351
Status: newpatch

comment:7 Changed 2 months ago by bgamari

Milestone: 8.6.3

Let's try to get this in to 8.6.3.

comment:8 Changed 6 weeks ago by Ömer Sinan Ağacan <omeragacan@…>

In 5f1d949/ghc:

Remove explicit recursion in retainer profiling (fixes #14758)

Retainer profiling contained a recursion that under
certain circumstances could lead to the stack overflow
in C code.

The idea of the improvement is to keep an explicit stack for the
object, more precise to reuse existing stack, but allow new type of
objects to be stored there.

There is no reliable reproducer that is not a big program
but in some cases foldr (+) 0 [1..10000000] can work.

Reviewers: bgamari, simonmar, erikd, osa1

Reviewed By: bgamari, osa1

Subscribers: osa1, rwbarton, carter

GHC Trac Issues: #14758

Differential Revision: https://phabricator.haskell.org/D5351

comment:9 Changed 6 weeks ago by osa1

Status: patchmerge

comment:10 Changed 6 weeks ago by bgamari

Resolution: fixed
Status: mergeclosed

comment:11 Changed 3 weeks ago by Ben Gamari <ben@…>

In 6d9d6f9a/ghc:

testsuite: Enable T11627a on Darwin

The retainer profiler no longer uses the C stack for its mark stack (#14758).
Consequently even the small C stack provided on Darwin should be sufficient to
run this test. See #11627
Note: See TracTickets for help on using tickets.