Opened 8 years ago

Closed 7 years ago

Last modified 7 years ago

#3758 closed bug (fixed)

Huge regression in concurrent app performance and reliability under threaded runtime

Reported by: bos Owned by: simonmar
Priority: low Milestone: 7.0.1
Component: Runtime System Version: 6.12.1
Keywords: Cc: 8mayday@…, johan.tibell@…, dons@…
Operating System: Linux Architecture: x86_64 (amd64)
Type of failure: Runtime crash Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by igloo)

I have a trivial HTTP server that I wrote to do some performance measurements with, and it behaves quite poorly for me under 6.12.1 if I try to use multiple cores. The symptoms are very low throughput and frequent hangs.

I'm on a dual core 64-bit Linux system (Fedora 12).

I have attached the server code that you can use to reproduce the problem. It requires the network and network-bytestring packages to function, but is otherwise standalone.

Running it is easy:

ghc -fforce-recomp -O --make Netbench
./Netbench localhost 8080

I've been using the standard apachebench tool, ab, to stress the server:

ab -c 10 -n 30000 http://localhost:8080/

This hits the server with 10 concurrent requests for a total of 30,000 requests. Here are some throughput measurements:

  • 6.10.4, unthreaded RTS: 5539 req/sec
  • 6.10.4, threaded RTS, -N1: 7758 req/sec
  • 6.10.4, threaded RTS, -N2: 5856 req/sec
  • 6.12.1, unthreaded RTS: 5612 req/sec
  • 6.12.1, threaded RTS, -N1: 7437 req/sec
  • 6.12.1, threaded RTS, -N2: 1978 req/sec

With -N2 under 6.12.1, there is a high probability (> 50%) that the server will deadlock mid-run.

When a multi-CPU run completes successfully under 6.12.1, ab reports that quite often a single request will get stuck for several seconds before the server responds. This does not seem to happen in other scenarios.

If you would like any more details, please let me know. Thanks.

Attachments (2)

Netbench.hs (2.6 KB) - added by bos 8 years ago.
Netbench.hs
Screenshot-ThreadScope - Netbench.eventlog.png (62.8 KB) - added by simonmar 8 years ago.

Download all attachments as: .zip

Change History (23)

Changed 8 years ago by bos

Attachment: Netbench.hs added

Netbench.hs

comment:1 Changed 8 years ago by bos

Oops, sorry for the messy formatting of the allegedly bulleted lists. I tried! I can't figure out how to go back and tidy them up.

comment:2 Changed 8 years ago by bsdemon

Cc: 8mayday@… added

comment:3 Changed 8 years ago by igloo

Description: modified (diff)
Milestone: 6.12.2

Thanks for the report. I've fixed the formatting.

comment:4 Changed 8 years ago by tibbe

Cc: johan.tibell@… added

comment:5 Changed 8 years ago by simonmar

Owner: set to simonmar
Priority: normalhigh

comment:6 Changed 8 years ago by simonmar

Summary: Huge regression in concurrent app performance and reliability under threaded runtime`

Part of the problem at least is this:

#3553
parallel gc suffers badly if one thread is descheduled

You're on a dual-core machine, running 3 threads (2 for the server and one for the client, plus whatever other processes decide to wake up from time to time). In 6.12.1 we enabled the parallel GC by default for young gen collections. The parallel GC uses spinlocks for synchronisation, which do not cope at all when one of the GC threads gets descheduled, which happens quite a lot in this workload.

You can get back to the single-threaded performance by turning off the parallel GC, with +RTS -qg.

The rule of thumb is that if you use -N2, you better be sure you have 2 CPUs available for the whole run of the program, otherwise performance will go down the tubes. The work I'm doing on the GC at the moment will eventually help here.

As for the absolute performance, clearly we could do a lot better. I've attached a ThreadScope profile showing one of the gaps you can see if you zoom in. This gap in particular I think is select() waking up, but after select has woken a couple of Haskell threads it takes a little while before we start running them. More investigation is needed.

I'm not closing the ticket yet because you also reported a deadlock. I haven't been able to reproduce that yet.

Changed 8 years ago by simonmar

comment:7 Changed 8 years ago by simonmar

Summary: `Huge regression in concurrent app performance and reliability under threaded runtime

oops, revert accidentally spammed ticket summary.

comment:8 Changed 8 years ago by bos

Simon, thanks for looking into this.

I tried the same benchmarks, only this time using a Mac running ab as the client over a gigabit ethernet connection.

With 6.12.1, -N2 -qg, I get good numbers: just below 15,000 requests per second.

With plain -N2 (hence using parallel GC), the numbers don't take as big a dive as when I was running client and server on a single box, but they still drop to about 7,500 requests per second, a 50% reduction in throughput.

Since the server has nothing else running on it in this case, I really do have 2 CPUs available. I think :-)

comment:9 Changed 8 years ago by simonmar

I'll investigate some more, but what I think is happening is that we're only really using one core at a time, and the other core is mostly asleep (the ThreadScope profile confirms this). Hence, when using the parallel GC we have to keep waking up the second core and waiting for it, when in fact it would be better to just do a single-threaded GC.

Now, I don't understand yet why the benchmark isn't keeping both cores busy, but we ought to be able to debug that using ThreadScope.

Are you still seeing the deadlocks, BTW? I couldn't reproduce that on my dual-core laptop running Ubuntu Karmic (32-bit).

comment:10 Changed 8 years ago by bos

I only see the deadlocks when running server and client on the same machine. Everything seems reliable when they're physically separate.

comment:11 Changed 8 years ago by bos

By the way, as far as only using one core is concerned, it might be related to the I/O manager being a single thread. Since the worker threads ought in principle to be able to do their job by running two system calls, I wouldn't be terribly surprised if the I/O manager was the choke point. Is that something that ThreadScope could show?

I'll probably need to add event logging to the I/O manager as I get deeper into reworking it anyway :-)

comment:12 Changed 8 years ago by simonmar

Thread 2 is the I/O manager, so you can see what it is up to in ThreadScope. You can generate events from Haskell code using GHC.Exts.traceEvent, although they only show up in the "Events" tab in ThreadScope currently, not in the timeline view.

comment:13 Changed 8 years ago by simonmar

Cc: dons@… added

Some more on this:

I'm still unable to reproduce any deadlocks, with either 6.12.1 or HEAD. It's not unheard of for there to be bugs in Linux threading, but I haven't seen any for a few years, and we're both running the same kernel (2.6.31). So it could be a race that is tickled on your hardware but not mine, I'll try on different hardware when I'm back in the office.

I've cobbled together a patch that should make the parallel GC better behaved when only some of the cores are active: basically it doesn't bother waking up idle cores to do a parallel GC. This avoids most of the slowdown with -N2 here. I need to check that it doesn't hurt any of my other benchmarks before committing.

Something else that helps a bit: +RTS -qa uses the Linux affinity API to fix threads to CPUs.

I've noticed that if I go over 100 threads with ab, then a very few requests take >3s, but the rest are under 100ms. That is suspicious, and could indicate something strange in the scheduler.

You can also win by doing this:

  forM_ [0..numCapabilities-1] $ \n ->
    forkOnIO n (listenLoop sock)

and then run with +RTS -N2 -qm -qa, to disable automatic migration and enable affinity. This basically assigns one server thread and its children to each core, and doesn't let them move around.

comment:14 Changed 8 years ago by simonmar

After applying the patch from #3553, on a dual-core 1.2GHz Intel Core, Ubuntu Karmic:

  • HEAD, unthreaded: ~6300 req/sec
  • HEAD, -threaded -N1: ~6600 req/sec
  • HEAD, -threaded -N2: ~6600 req/sec
  • HEAD, -threaded -N2 -qg: ~6600 req/sec

the numbers wobble around quite a bit, but as far as I can tell the problem is now gone. I'll wait for confirmation from Bryan (maybe when we put out a 6.12.2 RC?) before closing the bug.

comment:15 Changed 8 years ago by simonmar

I just fixed a deadlock. I'm not sure if it's your deadlock, but fingers crossed.

Tue Jan 26 07:00:37 PST 2010  Simon Marlow <marlowsd@gmail.com>
  * Fix a deadlock, and possibly other problems
  After a bound thread had completed, its TSO remains in the heap until
  it has been GC'd, although the associated Task is returned to the
  caller where it is freed and possibly re-used.  
  
  The bug was that GC was following the pointer to the Task and updating
  the TSO field, meanwhile the Task had already been recycled (it was
  being used by exitScheduler()). Confusion ensued, leading to a very
  occasional deadlock at shutdown, but in principle it could result in
  other crashes too.
  
  The fix is to remove the link between the TSO and the Task when the
  TSO has completed and the call to schedule() has returned; see
  comments in Schedule.c.

comment:16 Changed 8 years ago by bos

Thanks for working on this, Simon. It'll be a few days before I have a chance to look at the results, at the very least.

comment:17 Changed 8 years ago by simonmar

Bryan: if you had a chance to test 6.12.2 RC1, that would be most helpful, thanks.

comment:18 Changed 8 years ago by igloo

The above patch is in the 6.12 branch, so we believe this is fixed, and thus not a blocker for the release.

comment:19 Changed 8 years ago by igloo

Milestone: 6.12.26.12.3
Priority: highnormal
Status: newinfoneeded

comment:20 Changed 7 years ago by igloo

Milestone: 6.12.36.14.1
Priority: normallow

comment:21 Changed 7 years ago by igloo

Resolution: fixed
Status: infoneededclosed

Assuming this is fixed by the above patch.

Note: See TracTickets for help on using tickets.