Opened 10 months ago

Last modified 7 months ago

#7993 new bug

ghc 7.6 (not 7.4) sometimes hangs at child process exit on s390x

Reported by: cjwatson Owned by:
Priority: normal Milestone:
Component: Runtime System Version: 7.6.3
Keywords: Cc: simonmar
Operating System: Linux Architecture: Other
Type of failure: Other Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description

On Debian's s390x architecture (64-bit S/390, Linux kernel), builds of several packages hang with GHC 7.6 where they did not hang with GHC 7.4. In particular, ghc itself hangs during its own build when bootstrapping with 7.6. This is quite easy to reproduce on affected systems, although it doesn't hang in exactly the same place every time. It appears that the runtime sometimes deadlocks when a subprocess exits; the strace looks like this:

7523  exit_group(0)                     = ?
6680  <... futex resumed> )             = ? ERESTARTSYS (To be restarted)
6680  --- SIGCHLD (Child exited) @ 0 (0) ---
6680  futex(0x84fa86ac, FUTEX_WAIT_PRIVATE, 1143, NULL) = ? ERESTARTSYS (To be restarted)
6680  --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
6680  sigreturn()                       = ? (mask now [])
6680  futex(0x84fa86ac, FUTEX_WAIT_PRIVATE, 1143, NULL) = ? ERESTARTSYS (To be restarted)
6680  --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
6680  sigreturn()                       = ? (mask now [])
6680  futex(0x84fa86ac, FUTEX_WAIT_PRIVATE, 1143, NULL) = ? ERESTARTSYS (To be restarted)
[repeats forever]

ghc spawns enough subprocesses (gcc etc.) that it's essentially bound to hit this sooner or later. I suspect perhaps a lack of signal-safety somewhere - at an extremely wild guess, perhaps the type of an important variable written in a signal handler happens to exceed the size of sig_atomic_t on s390x and not elsewhere - but I haven't yet been able to track this down in the time available to me.

If you don't immediately recognise this as something obvious, then perhaps somebody more fluent in Haskell than I would be good enough to suggest test code that exercises this and is somewhat simpler than "build ghc"? If my analysis is at all close to the mark, then something that sits in a loop forking and reaping a trivial child process on each iteration should be enough to reproduce this. On the assumption that most non-Debian-developers don't have convenient access to S/390 machines (Debian developers can use zelenka.debian.org), I'd be happy to try things out.

Change History (4)

comment:1 Changed 8 months ago by nomeata

  • Difficulty set to Unknown

I tried to find out if -V0 helps, but unfortunately it does not.

comment:2 Changed 8 months ago by nomeata

I tried to reproduce the problem by spawning lots of processes, but

import System.Process
main = mapM_ (\_ -> readProcess "/bin/echo" ["hello", "world"] "") [0..10000]

did not deadlock.

comment:3 Changed 7 months ago by pmylund

  • Cc simonmar added

I am experiencing the same issue, but on x86_64, and with my own application which uses GHC (7.6.2) threads. On occasion a thread will loop forever, and give the same kind of output from strace. (Is it the same issue?)

Unfortunately I don't know how to begin troubleshooting. Whatever I do to try to reproduce it in a smaller test, the problem goes away if I don't use the combination of threads, STM and exception handling that I have in my larger application.

I will keep trying and report back, but any input is appreciated.

Update: Please ignore the above. It turns out I had become trapped in an infinite loop inside my own recursive function, and that the strace output is to be expected from applications that are actually doing something (like infinitely looping.) Sorry about that.

Last edited 7 months ago by pmylund (previous) (diff)

comment:4 Changed 7 months ago by simonmar

Getting a stack trace would probably help. You want to make sure that GHC itself is built with -debug: set GhcDebugged=YES in your build.mk (this will slow down the build, but you can remove it later). When the process hangs, attach to it with gdb and get a backtrace of all the threads.

Note: See TracTickets for help on using tickets.