Opened 2 years ago

Closed 12 months ago

#10840 closed bug (fixed)

Periodic alarm signals can cause a retry loop to get stuck

Reported by: Rufflewind Owned by: bgamari
Priority: high Milestone: 8.2.1
Component: Compiler Version: 7.10.2
Keywords: Cc: simonmar, hamish, hsyl20
Operating System: MacOS X Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D2796
Wiki Page:

Description

The periodic alarm signals emitted by the GHC RTS can hang the program under certain circumstances. In particular, any retry-loop of the form

while (interruptible_syscall() == FAILED && errno == EINTR);

can cause a hang if the syscall takes a long time and cannot be resumed, as it will forced to restart from the beginning each time it gets interrupted. If the syscall takes longer than the interval between successive alarm signals, it will be stuck in this loop forever. This was found to occur with statfs64 (indirectly called by realpath) and open on SSHFS on Mac OS X.

Using a safe foreign import seems to mitigate the issue, but according to Simon Marlow this is probably an accident.

So far there seems to be no way to guarantee the suspension of signals except through low-level tools like pthread_setmask.

Attachments (1)

0001-rts-timer-use-timerfd_-on-Linux-instead-of-signals.patch (2.5 KB) - added by hsyl20 21 months ago.
Make the RTS use timerfd_* on Linux

Download all attachments as: .zip

Change History (23)

comment:1 Changed 2 years ago by simonmar

Cc: simonmar added

comment:2 Changed 2 years ago by hamish

Cc: hamish added

comment:3 Changed 2 years ago by hsyl20

Cc: hsyl20 added

comment:4 in reply to:  description Changed 21 months ago by hsyl20

On Mac OS, you can try to enable USE_PTHREAD_FOR_ITIMER in rts/posix/Itimer.c to make the RTS avoid using alarm signals. Maybe we should make it a flag (or the default) for the threaded RTS?

On Linux the same problem may occur. We can use the same solution and even improve it by using the timerfd_* syscalls instead of usleep (cf the following attachment).

Changed 21 months ago by hsyl20

Make the RTS use timerfd_* on Linux

comment:5 Changed 21 months ago by simonmar

I like this patch a lot, those signals have been causing problems for a long time. Want to put it on Phabricator so we can review and get it in?

comment:6 Changed 21 months ago by hsyl20

Done: https://phabricator.haskell.org/D1947

A potential issue is that these syscalls have been introduced in 2.6.25. Do we want to support older kernels?

The issue remains for Mac OS X.

comment:7 Changed 21 months ago by Ben Gamari <ben@…>

In 120b9cdb/ghc:

rts/timer: use timerfd_* on Linux instead of alarm signals

Reviewers: erikd, simonmar, austin, bgamari

Reviewed By: simonmar, bgamari

Subscribers: hvr, thomie

Differential Revision: https://phabricator.haskell.org/D1947

GHC Trac Issues: #10840

comment:8 Changed 21 months ago by bgamari

Milestone: 8.0.1
Status: newmerge

This seems like a bad enough issue that we may want to merge to 8.0. We'll need 120b9cdb as well as 8626d76a (to fix #11697).

Last edited 21 months ago by bgamari (previous) (diff)

comment:9 Changed 21 months ago by bgamari

Resolution: fixed
Status: mergeclosed

comment:10 Changed 21 months ago by hsyl20

Resolution: fixed
Status: closednew

The ticket was for Mac OS X. We only solved the issue on Linux.

comment:11 Changed 20 months ago by thoughtpolice

Priority: normalhigh

I'm bumping the priority on this and keeping it in 8.0.1, as it seems rather bad to leave open on OS X, especially if the fix in comment:4 can work (use USE_PTHREAD_FOR_ITIMER, if I'm following correctly). Would anyone like to test this?

comment:12 Changed 20 months ago by bgamari

Owner: set to bgamari

I can test comment:4.

comment:13 Changed 20 months ago by bgamari

Unfortunately I have observed non-reproducible hangs while validating on OS X with USE_PTHREAD_FOR_ITIMER.

comment:14 Changed 20 months ago by bgamari

Milestone: 8.0.18.0.2

This won't happen for 8.0.1.

comment:15 Changed 19 months ago by bgamari

Unfortunately it seems that the patch as merged in comment:9 has a subtle race condition (see #11830) and introduces unnecessary wake-ups (see #11965, #1623). I'm going to revert this for 8.0.1 and perhaps we can give it another try in 8.0.2 if this issues have been sorted.

comment:16 Changed 19 months ago by hsyl20

In the threaded RTS, on platforms without "timerfd" we could create a "ghc ticker" thread too and block the alarm signal in all threads except this one. I think this would fix the reported issue on OS X.

comment:17 in reply to:  16 Changed 19 months ago by hsyl20

Replying to hsyl20:

In the threaded RTS, on platforms without "timerfd" we could create a "ghc ticker" thread too and block the alarm signal in all threads except this one. I think this would fix the reported issue on OS X.

Forget this idea, @simonmar told me that it won't work nicely with threads that don't belong to GHC.

comment:18 Changed 19 months ago by bgamari

The hangs should now be fixed. I'll try reenabling USE_PTHREAD_FOR_ITIMER once I have access to the OS X test again.

comment:19 Changed 18 months ago by thomie

Milestone: 8.0.28.2.1
Operating System: Unknown/MultipleMacOS X

If I understand correctly, this is just an OS X issue now.

And the commits for #11830, #11965 and this ticket have not been merged to the ghc-8.0 branch.

comment:20 Changed 12 months ago by bgamari

Differential Rev(s): Phab:D2796
Status: newpatch

Here is a patch enabling the pthread-based itimer implementation on Darwin. Let's see how Harbormaster fares.

comment:21 Changed 12 months ago by Ben Gamari <ben@…>

In d70d452a/ghc:

rts: Use pthread itimer implementation on Darwin

We want to avoid using SIGALRM whenever possible since we will interrupt
long-running system calls. See #10840.

Test Plan: Validate on Darwin

Reviewers: austin, erikd, simonmar

Reviewed By: simonmar

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D2796

GHC Trac Issues: #10840

comment:22 Changed 12 months ago by bgamari

Resolution: fixed
Status: patchclosed
Note: See TracTickets for help on using tickets.