Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#4074 closed bug (fixed)

Forking not possible with large processes

Reported by: nomeata Owned by: simonmar
Priority: normal Milestone: 6.12.3
Component: Runtime System Version: 6.12.1
Keywords: Cc: 575534@…, marcot@…
Operating System: Linux Architecture: Unknown/Multiple
Type of failure: None/Unknown Difficulty: Easy (less than 1 hour)
Test Case: Blocked By:
Blocking: Related Tickets:

Description

If a haskell program requires a lot of memory, trying to fork() fails because, due to the program size, the clone() syscall takes long and will interrupted by the ghc runtime timer, restarting the syscall, just to be interrupted again.

This happens repeatedly with ghc 6.12, which seems to require noticeable more memory than 6.10, when building large Haskell programs on slower arches, and causes some problems with Haskell in Debian.

The problem can also be observed by running a simple C program that malloc’s a lot of memory (in the range of 1G) and then tries to fork with profiling enabled.

In the corresponding Debian bug report against libc (http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=575534), which also has the demo C code, it was suggested that it might be the program’s responsibility to disable such timers while clone() runs.

Do you agree with that? Is it something you can do? Might this be related to #1882 (which mentions timers and fork)?

Change History (8)

comment:1 Changed 4 years ago by simonmar

  • Component changed from Compiler to Runtime System
  • Difficulty set to Easy (less than 1 hour)
  • Milestone set to 6.12.3
  • Owner set to simonmar

I think we probably should disable timers in forkProcess; I'll look into it.

Incedentally, this timer/fork issue is the reason that #3998 is hard to fix. The workaround we use in the process library is to use vfork instead of fork. (#1882 is unrelated, I think)

comment:2 Changed 4 years ago by simonmar

I'm having trouble constructing a test case in Haskell that displays the problem. Can anyone else do this?

comment:3 Changed 4 years ago by nomeata

It is probably hard to find a machine that is both slow enough (CPU wise) and large enough (memory wise). Here is my test code:

import qualified Data.ByteString as BS
import Control.Concurrent
import System.Process

main = do
    let size = 1000 * 1024 * 1024
        bs = BS.replicate size (fromIntegral 5)
    BS.minimum bs `seq` return ()
    forkIO $ putStrLn "Forked Child"
    runCommand "echo hi"
    putStrLn "Parent"
    BS.minimum bs `seq` return ()

On my computer, I could trigger the behaviour with a size of 1500*1024*1024:

vfork()                                 = ? ERESTARTNOINTR (To be restarted)
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0x1a)                      = 56
clone(child_stack=0xbea, flags=CLONE_FILES|CLONE_PTRACE|CLONE_VFORK|CLONE_DETACHED|126) = ? ERESTARTNOINTR (To be restarted)
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0x1a)                      = 56
[..]

What is interesting is that the program uses vfork() at first. If this works (because no timer interrupt happens), this call is used:

vfork(Process 3243 attached
)                                 = 3243

So there seems to be a fallback logic that tries clone() if vfork() does not work which kicks in even in the case of ERESTARTNOINTR – probably not what was intended. But I guess this is independent of our issue.

If you have problem reproducing the error, you can see if at strace at least shows one or two failed calls due to ERESTARTNOINTR. If so, the problem is present, the symptoms are just not as bad.

I could not observe this problem with the threaded RTS right now.

comment:4 Changed 4 years ago by simonmar

  • Status changed from new to infoneeded

I'm not able to reproduce it, but I've applied the following fix anyway:

Tue May 18 04:32:14 PDT 2010  Simon Marlow <marlowsd@gmail.com>
  * Fix #4074 (I hope).
  
  1. allow multiple threads to call startTimer()/stopTimer() pairs
  2. disable the timer around fork() in forkProcess()
  
  A corresponding change to the process package is required.

and the process package patch:

Tue May 18 09:36:17 BST 2010  Simon Marlow <marlowsd@gmail.com>
  * Fix #4074: disable the timer signal around fork()

If you can test the fix that would be very helpful.

comment:5 Changed 4 years ago by simonmar

  • Status changed from infoneeded to merge

comment:6 Changed 4 years ago by igloo

  • Resolution set to fixed
  • Status changed from merge to closed

Both merged.

comment:7 Changed 4 years ago by nomeata

So far, I avoided having to build ghc6, therefore I’m reluctant to do it now. Maybe when I find some spare time for it. But I’m optimistic that this fixes the problem.

comment:8 Changed 4 years ago by marcotmarcot

  • Cc marcot@… added

I could reproduce the bug with a GHC built just before the patch was applied, and I couldn't reproduce it with a GHC built just after the patch was applied, so I assume this patch fixes the bug.

Note: See TracTickets for help on using tickets.