Currently, GHC uses a round-robin scheduler for Haskell threads, with some heuristics for when threads should be bumped to the front of the queue. This patch set replaces this scheduler with 'stride scheduling', as described by Waldspurger and Weihl '95, which is an efficient, deterministic method for scheduling processes with differing priorities. Priorities are assigned by giving 'tickets' to threads; a thread with twice as many tickets as another will run twice as often. I’d like to replace the round-robin scheduler completely with this scheduler.
Here are nofib benchmarks comparing the old scheduler to the new:
-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem-------------------------------------------------------------------------------- Min -0.0% -52.2% -18.8% -18.6% -83.1% Max +1.0% +4.2% +4.9% +5.1% +7.7% Geometric Mean +0.1% -2.8% -0.9% -0.9% -2.9%
Here are some technical details:
Without any changes to ticket values, scheduling behavior should be functionally identical to round-robin. (By default, new threads, including the IO thread, get allocated the max nubmer of tickets possible.) This is not quite identical, since our heap does not have FIFO property (see below.)
The current patch-set uses a very simple (e.g. undergrad level) resizable-array backed heap to implement the priority queue; we can play some tricks to reduce the memory footprint of the priority queue (e.g. using the container_of macro to eliminate the extra keys store); and a more fancy data structure would make it easier for us to surgically remove entries or reweight them, but I wanted to keep overhead low. If anyone has a pet implementation of priority queues in C feel free to swap it in. Right now, this only affects the uses of promoteInRunQueue() in Messages.c; I still need to check if #3838 (closed) has regressed.
We get three new primops: setTickets#, getTickets# and modifyTickets#. We don't support creating threads with specific numbers of tickets (mostly because that would have added an annoyingly large set of extra primops); instead, you're expected to spawn a thread which gets max-ticket allocation, and then weight it down.
_link is no longer used for linking TSOs in the run queue. I have tried my best to stamp out any code which operated on this assumption, but I may have missed some.
Modifying a TSO's tickets takes out the scheduler lock; the hope is that this operation is quick and rare enough that a global lock here is not too bad.
We are unfortunately stuck with some magic constants w.r.t. ticket values: 1 << 20 is the maximum number of tickets our implementation is hard-coded to support.
Sleeping and blocked tasks do not get any 'bonus' for being blocked.
In an ideal world, when a thread hits a black hole, it should temporarily give its tickets to the thread evaluating the black hole, so it will unblock more quickly. Unfortunately, implementing this is pretty complicated (the blackhole evaluating thread could die; or it could get stuck on a blackhole itself and need to gift its tickets; it shouldn't be able to give away the tickets it was gifted.) So this implementation leaves that out. Similar semantics for MVars are probably possible, but will require userland assistance too.
I haven't prepared a patch to base yet with a user-level API, but here is a proposed draft:
type Tickets = Int-- | Maximum number of tickets we support a thread having. (Currently 2 >> 20)-- Note that this doesn't bound the *global* maximum tickets.maxTickets :: Tickets-- | Changes the number of tickets allocated to a thread. The ticket count must-- not be less than or equal to zero, or greater than maxTickets. (Corresponds-- to setTickets# primop)setTickets :: ThreadId -> Tickets -> IO ()-- | Returns the number of tickets currently allocated to a thread. (Corresponds to-- getTickets# primop)getTickets :: ThreadId -> IO Tickets-- | Atomically performs a linear transformation on the number of tickets a thread;-- e.g. scaling the number of tickets by a rational number, adding another fixed-- set of tickets, and then returning the number of 'leftover' tickets from the operation; e.g.-- if the net amount of tickets is reduced, then the returned result is positive;-- if the net amount of tickets is increased, the returned result is negative.-- In the absence of concurrent threads, the following property holds forall-- t, m and x:---- do r <- getTickets t-- d <- scaleTickets t m x-- r' <- getTickets t-- return (r == r' + d)---- If the scaling would reduce the number of tickets below zero or increase the-- number of tickets beyond maxTickets, this function will instead fail with-- an exception. This function will be subject to some rounding error; powers of two-- are, however, likely to be exact. (Corresponds to modifyTickets# primop; note-- that the sentinel value for failure is maxTickets + 1, since it is impossible for-- a thread's ticket value to change by that much.)modifyTickets :: ThreadId -> Ratio Int -> Tickets -> IO Tickets-- | Forks a new thread, transferring some percentage of tickets from the current-- thread to it (so the net number of tickets stays constant.) Fails if the rational-- is greater than 1 or less than or equal to zero, or if there are not enough tickets-- in the current thread.forkIOWith :: IO a -> Ratio Int -> IO ThreadId
Edited
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
Ok, to get started on the reviewing process, I'll jot down some concerns.
First, nofib has zero concurrency benchmarks in it, so the results aren't very meaningful. Still, you seem to have some outliers in there - any idea why? nofib/smp has some concurrency benchmarks that I normally use when making changes to the scheduler.
Run all the tests in testsuite/concurrent. They aren't run by default in a validate.
The patch adds no less than 4 fields to the TSO struct, which is quite a lot. We should benchmark carefully to make sure this isn't a problem.
Why stride scheduling rather than something else, e.g. Linux's CFS? Wouldn't CFS address the issue of giving more time to blocked threads? What about inaccuracies due to the time it takes for a thread to respond to the timer signal?
Currently, if there's a very long run queue, a minor GC won't have to traverse a lot of it because it will be in the old generation, but here the GC has to look at every element of the priority queue. We should make a benchmark with a long run queue and see if that's an issue.
promoteInRunQueue: it would be a shame to lose this, but as long as we're sure that #3838 (closed) hasn't regressed it should be ok.
could we have some more docs on the scheduler functions, capPassUpdate etc. please?
Looks like we should do annulTSO in a better way. These fields should be empty normally, set when a thread is blocked, and unset again when the thread is unblocked.
I don't see why sched_mutex is needed in setTickets and co. Could you explain?
Summary: SMP benchmarks show massive degradation of performance, so probably one of the performance bugs Simon Marlow pointed out is really killing us and needs to be fixed.
First, nofib has zero concurrency benchmarks in it, so the results aren't very meaningful. Still, you seem to have some outliers in there - any idea why? nofib/smp has some concurrency benchmarks that I normally use when making changes to the scheduler.
Oops, I forgot that normal nofib doesn't do smp. SMP doesn't look so good, and it seems likely that some of the performance problems you've noted below might be contributing. I'll investigate more thoroughly...
Run all the tests in testsuite/concurrent. They aren't run by default in a validate.
Yup, I've been running a full testsuite run and will be checking those results today.
The patch adds no less than 4 fields to the TSO struct, which is quite a lot. We should benchmark carefully to make sure this isn't a problem.
I can reduce this by one by paying the cost of a division every time we're scheduled (since stride = STRIDE1/tickets, always); we can reduce it by one more by observing that pass and remain are never both set at the same time and turning it into a union. I'm not sure what benchmarks in particular would start failing due to the increase in size, other than just general badness.
Why stride scheduling rather than something else, e.g. Linux's CFS? Wouldn't CFS address the issue of giving more time to blocked threads? What about inaccuracies due to the time it takes for a thread to respond to the timer signal?
Here's the secret: Linux's CFS is stride scheduling! The primary differences here are the backing data structure (they use a red-black tree while we're using a simple heap) and how timeslices are measured (Linux counts nanoseconds; we stupidly just consider a jump to the scheduler one quantum.)
It is possible to implement CFS-style sleeper fairness, by modifying what happens to a processes' pass when it sleeps. I'll post a version with those changes.
Currently, if there's a very long run queue, a minor GC won't have to traverse a lot of it because it will be in the old generation, but here the GC has to look at every element of the priority queue. We should make a benchmark with a long run queue and see if that's an issue.
Could be a problem. If we switched to a tree data structure, we could probably recover this behavior (the GC will only have to traverse log n).
promoteInRunQueue: it would be a shame to lose this, but as long as we're sure that #3838 (closed) hasn't regressed it should be ok.
There is no particular reason why we can't implement this, and if we switch to a red-black tree it gets easier.
could we have some more docs on the scheduler functions, capPassUpdate etc. please?
OK.
Looks like we should do annulTSO in a better way. These fields should be empty normally, set when a thread is blocked, and unset again when the thread is unblocked.
OK.
I don't see why sched_mutex is needed in setTickets and co. Could you explain?
setTickets access can be concurrent, so since a read and a write occur, it needs to be protected by a lock.
The goal of the change is to allow Haskell programmers to change how frequently threads they create are being run; in case there are some threads that are important and should be run as frequently as possible, or threads are being created by an untrusted user who should not be allowed to thread-bomb the system.
Roughly how the new scheduling algorithm achieves the goal.
Stride scheduling operates by replacing our old doubly linked list run queue with a priority queue. The priorities are 'passes' which increment by their stride (computed based on their priority) every time they run; thus, a task with a low stride will run multiple times before their pass catches up.
There seems to be something about priorities involved. Who sets those priorities? The programmer? The runtime system?
The priorities are set by the programmer using setTickets and related functions.
What about priority inversion?
Without any change to priorities, scheduling behavior is intended to be identical to what it was previously. Priority inversion can occur when a thread blocks on another; this patch doesn't address that yet (though I've mentioned about how to do it for BLACKHOLES).
Currently, if there's a very long run queue, a minor GC won't have to traverse a lot of it because it will be in the old generation, but here the GC has to look at every element of the priority queue. We should make a benchmark with a long run queue and see if that's an issue.
Based on some investigation, this appears to be probably one of the primary causes of slowdown. I checked to see if this was an issue by checking how often markCapability calls occurred versus actual evacuations performed on TSO objects. Although the number of markCapability calls is the same in both cases, TSO objects get evacuated a lot more frequently in the new world.
One of the things that I'm not really clear about is how to tell if I can skip evacuating an object. This GC part of the code is pretty opaque to me, and while I feel like there should be something like a log(n) solution I'm not seeing an implementation path.
I would expect to see lots more calls to evacuate on TSOs, but in most cases the TSO should be in the old generation, so evacuate will just return (in the case of a minor GC, that is). Are you counting actual copies, or just calls to evacuate?
It would probably be a good idea to add some diagnostics to track the size of the run queue too. I'm not completely convinced that this is your problem yet.
You’re right, it’s not (or, at least, the situation is more complicated.) With the following gratuitous change:
diff --git a/rts/Capability.c b/rts/Capability.cindex 811df58..b4a1ec0 100644--- a/rts/Capability.c+++ b/rts/Capability.c@@ -1007,6 +1007,10 @@ markCapability (evac_fn evac, void *user, Capability *cap, // or fewer Capabilities as GC threads, but just in case there // are more, we mark every Capability whose number is the GC // thread's index plus a multiple of the number of GC threads.+ StgTSO *tso;+ for (tso = cap->run_queue_hd; tso != END_TSO_QUEUE; tso = tso->_link) {+ evac(user, (StgClosure **)(void *)&tso);+ } evac(user, (StgClosure **)(void *)&cap->run_queue_hd); evac(user, (StgClosure **)(void *)&cap->run_queue_tl); #if defined(THREADED_RTS)
I've been focusing on sieve, since it had the most slowdown. What I do notice when I add some diagnostics is that the run queue starts with a 100 threads in it, and that the program runs a lot more slowly when the run queue is large when it is small. When I count actual copies, TSOs are copied a lot more frequently in the new code. If you have some ideas on what to look at, that would be great!
I think what I am going to try to do is implement another version which uses sorted doubly linked lists instead, and see if the same problems crop up. (I also realized that we're leaking memory for the run queues, since I never free them. Oops!)
Maybe a good idea is to replace the malloc'd array with a MUT_ARR_PTR; then we only evacuate one thing (and all of the card marking happens automatically). Unfortunately, there don't seem to be any amenities for manipulating these from RTS-land.
Here is an alternate patch, which uses sorted linked lists. It is mostly interesting because it also exhibits the same slowdown as the original patch, even though it should hypothetically act the same way as the original scheduler (except for some constant overhead.)
Perhaps you should check that the scheduling behaviour is the same. e.g. in sieve, if we manage to run a thread to completion in one time slice, then it won't need to be copied by the GC, but if it gets preempted then it will get copied.
A glance at the ThreadScope profile, or even just the eventlog, might be illuminating. Also check the +RTS -s stats, a good sanity check is whether the amount of allocation is the same (if not, something strange is going on). And the GC stats will give you more clues.
OK, I have a much better sense for where the performance problems are coming from.
Adding extra words to the TSO has no impact on most smp benchmarks, but really kills threads006 (up to 100% slowdown); this is not too surprising since this benchmark involves creating 200,000 threads. A representative stat without a small TSO is
What I don't understand is why the memory in use blows up by 150MB; by my count, the extra words should only be adding something like 10MB of overhead; my best guess is that I am actually nudging TSO size over some invisible threshold (maybe the big object threshold).
From there, the next two bugs have to do with subtle scheduler changes. The first bug I have a clean fix for (it was a plain old bug); 'sieve' slows down a lot if threads which HeapOverflow don't get put back in front of the run queue. Fixing that dramatically improves runtime for all of the benchmarks except 'threads003': we can make that performance problem go away if we force threads to get appended to the end of the run queue (of course, stride scheduling won't work in that case!) I'm still investigating these.
Well, the magic threshold (on x86_64) is apparently one StgWord64 and one StgWord32 (which I believe ); add one more StgWord32 and the memory usage blows up. (That is, increasing the struct's size to 16 64-bit words. 15 and 14 work fine.) I’m a bit surprised we were so close to the limit; I guess I’ll have to figure out how to do this with only two variables, or see if I can condense other parts of the TSO struct. (Actually, we probably have even less to work with since Windows adds another StgWord32 and PROFILING adds another pointer field... so we literally have zero extra fields! Ouch!)
/* The size argument we are given includes all the per-thread * overheads: * * - The TSO structure * - The STACK header * * This is so that we can use a nice round power of 2 for the * default stack size (e.g. 1k), and if we're allocating lots of * threads back-to-back they'll fit nicely in a block. It's a bit * of a benchmark hack, but it doesn't do any harm. */
I don't think the TSO problem (in threads006) has to do with influencing stack size; when I run with GC stats, I don't see any thread overflows (the stack is only one word smaller), and the overall allocation remains the same (TSO ends up being 0x10 aligned, but I don't see why that would cause performance problems...)
If we used actual time rather than scheduler quanta, wouldn't that fix the heap-check case too? Calling gettimeofday is very cheap these days, cheap enough that we can call it every time we emit an event in the event log, so I'm sure it won't be a problem to call it in the scheduler.
My guess is that with threads003 we're just pushing the size up enough that the GC has to touch another cache line per TSO. If that's the case, then I'm not too worried, since this is a microbenchmark and won't affect real world performance much. But we should verify that that's what's going on. It can't be that we're allocating another stack chunk, because that would show up in the allocations.
Check whether the amount of memory copied by the GC goes up in proportion to the amount you added to the TSO struct. If it is a lot more, then something else is going on. Also try some measurements with perf?
Once this is squashed (or accounted for), we can move on to checking that priorities work right, and looking at the primops and the user API.
[ezyang@hs01 threads006]$ ./Main-big 200000 +RTS -s 224,198,416 bytes allocated in the heap 853,296,496 bytes copied during GC 197,453,072 bytes maximum residency (10 sample(s)) 25,701,000 bytes maximum slop 426 MB total memory in use (0 MB lost due to fragmentation)[ezyang@hs01 threads006]$ ./Main-medium 200000 +RTS -s 224,198,416 bytes allocated in the heap 690,756,096 bytes copied during GC 125,998,552 bytes maximum residency (9 sample(s)) 24,121,032 bytes maximum slop 270 MB total memory in use (0 MB lost due to fragmentation)
That's an over 150MB jump.
I think I have an important clue, though; consider these three runs on the big TSO executable:
[ezyang@hs01 threads006]$ ./Main-big 182260 +RTS -s 204,318,544 bytes allocated in the heap 622,311,352 bytes copied during GC 110,659,024 bytes maximum residency (9 sample(s)) 23,499,512 bytes maximum slop 239 MB total memory in use (0 MB lost due to fragmentation)[ezyang@hs01 threads006]$ ./Main-big 182261 +RTS -s 204,319,648 bytes allocated in the heap 622,313,432 bytes copied during GC 110,659,024 bytes maximum residency (9 sample(s)) 23,453,024 bytes maximum slop 239 MB total memory in use (0 MB lost due to fragmentation)[ezyang@hs01 threads006]$ ./Main-big 182262 +RTS -s 204,320,752 bytes allocated in the heap 791,958,776 bytes copied during GC 169,695,072 bytes maximum residency (10 sample(s)) 23,503,128 bytes maximum slop 395 MB total memory in use (0 MB lost due to fragmentation)
Wow! In the case of the 182262th thread, it's literally the straw that broke the camel's back. I went ahead and logged GC events. The breakpoint occurs here:
---------------------------------------------------------- Gen Max Mut-list Blocks Large Live Slop Blocks Bytes Objects ---------------------------------------------------------- 0 53762 0 2 0 1184 7008 1 53762 3 53766 94 196721432 23504104---------------------------------------------------------- 196722616 23511112----------------------------------------------------------Memory inventory: gen 0 blocks : 1 blocks ( 0.0 MB) gen 1 blocks : 53766 blocks ( 210.0 MB) nursery : 129 blocks ( 0.5 MB) retainer : 0 blocks ( 0.0 MB) arena blocks : 0 blocks ( 0.0 MB) exec : 0 blocks ( 0.0 MB) free : 6332 blocks ( 24.7 MB) total : 60228 blocks ( 235.3 MB)alloc new todo block 0x7f13c7aff000 for gen 0alloc new todo block 0x7f13c7ca1000 for gen 1[snip]---------------------------------------------------------- Gen Max Mut-list Blocks Large Live Slop Blocks Bytes Objects ---------------------------------------------------------- 0 82858 0 2 0 104 8088 1 82858 1 45569 3 169695176 16955448---------------------------------------------------------- 169695280 16963536----------------------------------------------------------Memory inventory: gen 0 blocks : 1 blocks ( 0.0 MB) gen 1 blocks : 45569 blocks ( 178.0 MB) nursery : 129 blocks ( 0.5 MB) retainer : 0 blocks ( 0.0 MB) arena blocks : 0 blocks ( 0.0 MB) exec : 0 blocks ( 0.0 MB) free : 53841 blocks ( 210.3 MB) total : 99540 blocks ( 388.8 MB)
Inside the snip, approximately a 100 hundred megablocks are allocated as TODO space for scavenge, ala
allocated 1 megablock(s) at 0x7f13b5a00000alloc new todo block 0x7f13b5a04000 for gen 1increasing limit for 0x7f13b5a04000 to 0x7f13b5a04800
It seems we run out of todo blocks while evacuating, and then essentially need to copy our entire heap. Ouch! Unfortunately, I don't know enough to say if this is a GC bug or not.
Correct me if I'm wrong, but in the "slow" case you're just doing one more major GC than in the fast case, right? The extra thread just pushed allocation over the threshold for the next GC. If that's the case, then this is not a problem at all.
The sorted list implementation performs comparably to the original when no priority changes are involved. However, it does not apply sleeper fairness (give threads that were sleeping extra timeslices), and when sleeper fairness is applied, we suffer from performance degradation (since operations on the sorted list are no longer constant time). So if this implementation wants to go anywhere, it needs to be upgraded into something like a skip list, which can efficiently search and remove things inside the structure.
I've stamped out some of the performance bugs from the array implementation (mostly by adding an extra queue which we use to handle "must run now!" threads). However, we suffer from a huge bottleneck deleteMinPQueue(), which by the accounts of perf takes up 22% of the runtime. Apparently logarithmically many comparisons/memory writes cost a lot when a lot of threads are involved. I am going to look for something more high performance.
Some other interesting results about sleeper fairness: for most benchmarks, it doesn't have much effect, but threads003 displays some interesting behavior (this comparison is before and after sleeper fairness is applied):
So, changing how we schedule blocked threads (i.e. giving them a priority boost) results in a 10% speedup, but we end up using 20% more memory! A very interesting tradeoff, if you ask me...
The red-black tree implementation is now behaving very consistently and with only 1-3% overhead. (Right now, it uses a malloc'd free list of red-black tree nodes; this ended up being quicker than using heap allocated stuff.) Unfortunately, sleeper fairness (with the objective of giving sleepy threads a bit of a boost) makes all of the benchmarks perform worse; in part due to the fact that sleepy threads increase the overhead of managing processes, since they tend to have different priorities. I don't at the moment have a good way of measuring latency change, see #7659 (closed) for a bug related to that. Additionally, the impact of scheduling is fairly hard on some test programs I’ve constructed unless you insert explicit yields into your programs. One more worse thing is that the the rb-trees are synthetically constructed to get a performance boost when all threads have the same priority; when threads have different priorities you will immediately see a slowdown because more rb-tree manipulations will become necessary.)
The array heap implementation has lower overhead (0-2%) but is far more inconsistent; some benchmarks do much worse. Right now it’s still pretty unclear where all of this overhead is coming from.
The end conclusion is as follows: if red-black trees end up being the dominant implementation, then there is still a chance that we can replace the current scheduler wholesale, since its runtime characteristics are very similar to the current scheduler. (Sleeper fairness will need to continue to be a configurable option, assuming we can prove that it improves latency.) However, if array heaps end up having much better properties on prioritized workloads (test-cases of which I have not constructed yet), I find it highly implausible that we’ll be able to finesse the implementation to work identically to the current scheduler, and we will need to support both. On the plus side, because I have made so many implementations I have a good sense refactor the API properly.
I’ll attach some patches, but the comments still need some cleaning up.
Looking good, I'm still excited about this work. I really hope you'll put some of your benchmarks into nofib/smp - the current set may well be giving us a skewed view of what's important. I definitely agree that we want latency benchmarks too, but I haven't thought about how to construct them.
Have you thought about keeping the red-black tree nodes in the heap? That's what I would do. Heap allocation is much faster than malloc, and reclamation is free. You would get some advantages for free, such as when parts of the tree are in the old generation they won't get touched during a minor GC.
Yep, I think any new benchmarks will probably be pretty important, maybe even more so than the actually stride scheduling :)
Have you thought about keeping the red-black tree nodes in the heap? That's what I would do. Heap allocation is much faster than malloc, and reclamation is free. You would get some advantages for free, such as when parts of the tree are in the old generation they won't get touched during a minor GC.
I’ve tried it, jerry-rigging StgMutArrPtrs. I concluded we’ll need to make a custom closure (ugh!) or make some improvements to how GHC garbage collects mutable objects (actually, I was planning on filing a bug, here we go #7662). In the case of mutable arrays, there are some funny characteristics: the write barriers are relatively cheap but they all get scanned because mutable arrays are permanently on the mutable list, so we don’t end up getting any benefits. (There aren’t any other existing mutable objects which can be jerry-rigged here.) Adding things to the mutable list may end up being expensive in the long run.
The current malloc'd red-black implementation is pretty efficient; we maintain a free list and allocate them in blocks. I haven't implemented reclamation yet but even if we leak all of the free nodes it will only consume space up to the high watermark of number of simultaneous threads.
I'd like to add a word of caution to the inclusion of any form of
priority/weighted scheduling into GHC at this time.
Scheduling (unless it discards work) does not, a priori, change the
total amount of work that has to be done, just its order. Such
scheduling strategies don't create computational power, they just
distribute the "disappointment" of not being first...
This would imply that all of the timing changes reported in this
thread are the result of interactions with other resources - different
patterns of memory access, garbage collection etc.
The community already recognises these sort of performance couplings
and there several mechanisms (at a higher level) that can be used to
manage the inevitable performance trades.
I strongly believe that adding differential scheduling at this time
will be, overall, counter productive to GHC's RTS development. It will
create a new class of performance artefacts that will require
resources to investigate. Time will be spent pushing trades around,
chasing the proverbial bubble under the wall paper around and around.
Please don't get me wrong, I am highly appreciative of Edward's
efforts. The more overhead we can reduce in the scheduling, the
better.
If we want to start looking at more complex differential treatment of
the allocation of supply (cpu cycles) to demand (task steps) then we
need to consider the issues of how that trade interacts with the
others. And how the overall trading space is going to manifest itself
in terms of end-user goals. We also need to come up with a sensible
approach of how to discard computational demand (load shedding) when
the "system" becomes "excessively" infeasible to schedule to met those
goals. Fortunately there are approaches to this (stuff I've been
involved with in data networks for many years) - and Haskell's
operational has all the right hooks in to do this. Interested PhD
students, do get in touch....
OK, after a brief break, I've started working on latency benchmarks to help evaluate how much alternative scheduling algorithms help. Of course, the first order of business was to establish some baselines, so I ran some tests on GHC 7.4 first, as well as a bare-bones C version of the application. The tests are here https://github.com/ezyang/latency ; eventually I'll integrate them with nofib (although it's not obvious how to integrate them into the current nofib reporting framework). The basic method is to write a byte onto a Unix domain socket; I'm sure I've botched some instrumentation somewhere so some eyeballs here would be useful. I can also make pretty graphs if people are interested.
System info:
[ezyang@hs01 latency]$ uname -aLinux hs01.scs.stanford.edu 3.5.3-1-ARCH #1 SMP PREEMPT Sun Aug 26 09:14:51 CEST 2012 x86_64 GNU/Linux
Something is a bit suspicious about those numbers. Without -threaded the IO manager is not used at all, yet your second set of numbers are quite different from the first even without -threaded. Furthermore, the second set of numbers seem to have a lot more outliers (the 99.99% slot is much higher).
Hmm, it looks like the GHC 7.7.20130122 numbers are nonsense (I used what I thought was a perf build, but it looks like it actually was just a validate build, so libraries/rts may not have been as optimized as they could be. Hmm, I guess I should go re-run the nofib numbers too, since some of those might be suspicious.) The 7.4.2 numbers, being distro built, should still be valid.
Oh, and here are the C numbers, on the same machine. These should roughly correspond directly to the domain socket overhead.
When I attempt to replicate the conditions of #3838 (closed) (by removing the thread bump), I do not see any appreciable increase in runtime. This occurs even if I explicitly yield while inside the atomicModifyIORef (using unsafePerformIO), or I increase the number of threads being run. My conclusion is that a small pileup does occur, but since there's no longer a global list we keep traversing it's not expensive. But I can't rule out (1) the test-case having enough incidental other changes that have made it stop tickling the bug, or (2) some other RTS change masking over this particular problem. A conundrum!
Edward: if I recall correctly, the thread bump was only a minor optimisation, the real fix for #3838 (closed) was the redesign of black hole blocking so that we didn't have a global list. So that would agree with your findings.
I don't have a lot of evidence that bumping the blocker is a good idea. I was hoping you might find some :-)