Stride scheduling for Haskell threads with priorities

changed weight to 5

added RTS Tfeature request Trac import labels

Attached file 0001-Simple-heap-based-priority-queue-implementation.patch (download).

Simple priority queue implementation

Attached file 0002-Stride-scheduling-draft-11-thread-migrating-implemen.patch (download).

Stride scheduling patch

changed milestone to %7.8.1

Awesome!

Ok, to get started on the reviewing process, I'll jot down some concerns.

First, nofib has zero concurrency benchmarks in it, so the results aren't very meaningful. Still, you seem to have some outliers in there - any idea why? nofib/smp has some concurrency benchmarks that I normally use when making changes to the scheduler.
Run all the tests in testsuite/concurrent. They aren't run by default in a validate.
The patch adds no less than 4 fields to the TSO struct, which is quite a lot. We should benchmark carefully to make sure this isn't a problem.
Why stride scheduling rather than something else, e.g. Linux's CFS? Wouldn't CFS address the issue of giving more time to blocked threads? What about inaccuracies due to the time it takes for a thread to respond to the timer signal?
Currently, if there's a very long run queue, a minor GC won't have to traverse a lot of it because it will be in the old generation, but here the GC has to look at every element of the priority queue. We should make a benchmark with a long run queue and see if that's an issue.
promoteInRunQueue: it would be a shame to lose this, but as long as we're sure that #3838 (closed) hasn't regressed it should be ok.
could we have some more docs on the scheduler functions, capPassUpdate etc. please?
Looks like we should do annulTSO in a better way. These fields should be empty normally, set when a thread is blocked, and unset again when the thread is unblocked.
I don't see why sched_mutex is needed in setTickets and co. Could you explain?

Can you also say a bit more about

What the goal of this change is.
Roughly how the new scheduling algorithm achieves the goal.
There seems to be something about priorities involved. Who sets those priorities? The programmer? The runtime system?
What about priority inversion?

Thanks

Simon

Summary: SMP benchmarks show massive degradation of performance, so probably one of the performance bugs Simon Marlow pointed out is really killing us and needs to be fixed.

Replying to [ticket:7606#comment:68019 simonmar]:

First, nofib has zero concurrency benchmarks in it, so the results aren't very meaningful. Still, you seem to have some outliers in there - any idea why? nofib/smp has some concurrency benchmarks that I normally use when making changes to the scheduler.

Oops, I forgot that normal nofib doesn't do smp. SMP doesn't look so good, and it seems likely that some of the performance problems you've noted below might be contributing. I'll investigate more thoroughly...

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
    callback001          +0.1%     +0.0%     +7.7%     +7.6%     +0.0%
    callback002          +0.1%     +0.0%    +10.7%    +10.7%     +0.0%
           chan          +0.1%     +0.0%    -29.0%    -29.1%    -71.0%
          sieve          +0.1%     +0.0%   +141.6%   +142.0%    +89.3%
     threads001          +0.1%     +0.0%    +12.7%    +12.6%     +0.0%
     threads003          +0.1%     +9.1%    +44.7%    +44.6%    +25.8%
     threads006          +0.1%     +0.0%    +30.3%    +30.7%    +46.6%
     threads007          +0.1%     -0.2%    +17.0%    +17.0%     -6.8%
--------------------------------------------------------------------------------
            Min          +0.1%     -0.2%    -29.0%    -29.1%    -71.0%
            Max          +0.1%     +9.1%   +141.6%   +142.0%    +89.3%
 Geometric Mean          +0.1%     +1.1%    +22.5%    +22.5%     -0.7%

Run all the tests in testsuite/concurrent. They aren't run by default in a validate.

Yup, I've been running a full testsuite run and will be checking those results today.

The patch adds no less than 4 fields to the TSO struct, which is quite a lot. We should benchmark carefully to make sure this isn't a problem.

I can reduce this by one by paying the cost of a division every time we're scheduled (since stride = STRIDE1/tickets, always); we can reduce it by one more by observing that pass and remain are never both set at the same time and turning it into a union. I'm not sure what benchmarks in particular would start failing due to the increase in size, other than just general badness.

Why stride scheduling rather than something else, e.g. Linux's CFS? Wouldn't CFS address the issue of giving more time to blocked threads? What about inaccuracies due to the time it takes for a thread to respond to the timer signal?

Here's the secret: Linux's CFS is stride scheduling! The primary differences here are the backing data structure (they use a red-black tree while we're using a simple heap) and how timeslices are measured (Linux counts nanoseconds; we stupidly just consider a jump to the scheduler one quantum.)

It is possible to implement CFS-style sleeper fairness, by modifying what happens to a processes' pass when it sleeps. I'll post a version with those changes.

Currently, if there's a very long run queue, a minor GC won't have to traverse a lot of it because it will be in the old generation, but here the GC has to look at every element of the priority queue. We should make a benchmark with a long run queue and see if that's an issue.

Could be a problem. If we switched to a tree data structure, we could probably recover this behavior (the GC will only have to traverse log n).

promoteInRunQueue: it would be a shame to lose this, but as long as we're sure that #3838 (closed) hasn't regressed it should be ok.

There is no particular reason why we can't implement this, and if we switch to a red-black tree it gets easier.

could we have some more docs on the scheduler functions, capPassUpdate etc. please?

OK.

Looks like we should do annulTSO in a better way. These fields should be empty normally, set when a thread is blocked, and unset again when the thread is unblocked.

OK.

I don't see why sched_mutex is needed in setTickets and co. Could you explain?

setTickets access can be concurrent, so since a read and a write occur, it needs to be protected by a lock.

Replying to [ticket:7606#comment:68028 simonpj]:

What the goal of this change is.

The goal of the change is to allow Haskell programmers to change how frequently threads they create are being run; in case there are some threads that are important and should be run as frequently as possible, or threads are being created by an untrusted user who should not be allowed to thread-bomb the system.

Roughly how the new scheduling algorithm achieves the goal.

Stride scheduling operates by replacing our old doubly linked list run queue with a priority queue. The priorities are 'passes' which increment by their stride (computed based on their priority) every time they run; thus, a task with a low stride will run multiple times before their pass catches up.

There seems to be something about priorities involved. Who sets those priorities? The programmer? The runtime system?

The priorities are set by the programmer using setTickets and related functions.

What about priority inversion?

Without any change to priorities, scheduling behavior is intended to be identical to what it was previously. Priority inversion can occur when a thread blocks on another; this patch doesn't address that yet (though I've mentioned about how to do it for BLACKHOLES).

Currently, if there's a very long run queue, a minor GC won't have to traverse a lot of it because it will be in the old generation, but here the GC has to look at every element of the priority queue. We should make a benchmark with a long run queue and see if that's an issue.

Based on some investigation, this appears to be probably one of the primary causes of slowdown. I checked to see if this was an issue by checking how often markCapability calls occurred versus actual evacuations performed on TSO objects. Although the number of markCapability calls is the same in both cases, TSO objects get evacuated a lot more frequently in the new world.

One of the things that I'm not really clear about is how to tell if I can skip evacuating an object. This GC part of the code is pretty opaque to me, and while I feel like there should be something like a log(n) solution I'm not seeing an implementation path.

I would expect to see lots more calls to evacuate on TSOs, but in most cases the TSO should be in the old generation, so evacuate will just return (in the case of a minor GC, that is). Are you counting actual copies, or just calls to evacuate?

It would probably be a good idea to add some diagnostics to track the size of the run queue too. I'm not completely convinced that this is your problem yet.

You’re right, it’s not (or, at least, the situation is more complicated.) With the following gratuitous change:

diff --git a/rts/Capability.c b/rts/Capability.c
index 811df58..b4a1ec0 100644
--- a/rts/Capability.c
+++ b/rts/Capability.c
@@ -1007,6 +1007,10 @@ markCapability (evac_fn evac, void *user, Capability *cap,
     // or fewer Capabilities as GC threads, but just in case there
     // are more, we mark every Capability whose number is the GC
     // thread's index plus a multiple of the number of GC threads.
+    StgTSO *tso;
+    for (tso = cap->run_queue_hd; tso != END_TSO_QUEUE; tso = tso->_link) {
+        evac(user, (StgClosure **)(void *)&tso);
+    }
     evac(user, (StgClosure **)(void *)&cap->run_queue_hd);
     evac(user, (StgClosure **)(void *)&cap->run_queue_tl);
 #if defined(THREADED_RTS)

We only see a very minor bump in runtime:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
          sieve          +0.0%     +0.0%     +0.7%     +0.7%     +0.0%

I've been focusing on sieve, since it had the most slowdown. What I do notice when I add some diagnostics is that the run queue starts with a 100 threads in it, and that the program runs a lot more slowly when the run queue is large when it is small. When I count actual copies, TSOs are copied a lot more frequently in the new code. If you have some ideas on what to look at, that would be great!

I think what I am going to try to do is implement another version which uses sorted doubly linked lists instead, and see if the same problems crop up. (I also realized that we're leaking memory for the run queues, since I never free them. Oops!)

Measuring the effect of adding five 64-bit words to the TSO (and no other changes)

        Program           Size    Allocs   Runtime   Elapsed  TotalMem
          sieve          +0.0%     +0.0%     +0.7%     +0.8%     +0.0%

E.g. it hurts, but is not the root cause of the major problems.

Maybe a good idea is to replace the malloc'd array with a MUT_ARR_PTR; then we only evacuate one thing (and all of the card marking happens automatically). Unfortunately, there don't seem to be any amenities for manipulating these from RTS-land.

Attached file ghc-stride-scheduling-sorted-list.patch (download).

Stride scheduling using sorted linked lists

Here is an alternate patch, which uses sorted linked lists. It is mostly interesting because it also exhibits the same slowdown as the original patch, even though it should hypothetically act the same way as the original scheduler (except for some constant overhead.)

Perhaps you should check that the scheduling behaviour is the same. e.g. in sieve, if we manage to run a thread to completion in one time slice, then it won't need to be copied by the GC, but if it gets preempted then it will get copied.

A glance at the ThreadScope profile, or even just the eventlog, might be illuminating. Also check the +RTS -s stats, a good sanity check is whether the amount of allocation is the same (if not, something strange is going on). And the GC stats will give you more clues.

BTW, I like the idea of using a MUT_ARR, but the card marking is probably too coarse to help much (128 entries per card).

OK, I have a much better sense for where the performance problems are coming from.

Adding extra words to the TSO has no impact on most smp benchmarks, but really kills threads006 (up to 100% slowdown); this is not too surprising since this benchmark involves creating 200,000 threads. A representative stat without a small TSO is

224198416 bytes, 425 GCs (416 + 9), 0/0 avg/max bytes residency (0 samples), 699092656 bytes GC work, 279M in use, 0.00 INIT (0.00 elapsed), 0.02 MUT (0.02 elapsed), 0.33 GC (0.33 elapsed), 0.12 GC(0) (0.12 elapsed), 0.21 GC(1) (0.21 elapsed)

A representative stat with a big TSO is

224198416 bytes, 425 GCs (415 + 10), 0/0 avg/max bytes residency (0 samples), 853296496 bytes GC work, 426M in use, 0.00 INIT (0.00 elapsed), 0.00 MUT (0.00 elapsed), 0.47 GC (0.47 elapsed), 0.12 GC(0) (0.12 elapsed), 0.34 GC(1) (0.35 elapsed)

What I don't understand is why the memory in use blows up by 150MB; by my count, the extra words should only be adding something like 10MB of overhead; my best guess is that I am actually nudging TSO size over some invisible threshold (maybe the big object threshold).

From there, the next two bugs have to do with subtle scheduler changes. The first bug I have a clean fix for (it was a plain old bug); 'sieve' slows down a lot if threads which HeapOverflow don't get put back in front of the run queue. Fixing that dramatically improves runtime for all of the benchmarks except 'threads003': we can make that performance problem go away if we force threads to get appended to the end of the run queue (of course, stride scheduling won't work in that case!) I'm still investigating these.

Well, the magic threshold (on x86_64) is apparently one StgWord64 and one StgWord32 (which I believe ); add one more StgWord32 and the memory usage blows up. (That is, increasing the struct's size to 16 64-bit words. 15 and 14 work fine.) I’m a bit surprised we were so close to the limit; I guess I’ll have to figure out how to do this with only two variables, or see if I can condense other parts of the TSO struct. (Actually, we probably have even less to work with since Windows adds another StgWord32 and PROFILING adds another pointer field... so we literally have zero extra fields! Ouch!)

I wonder if we're hitting this:

    /* The size argument we are given includes all the per-thread
     * overheads:
     *
     *    - The TSO structure
     *    - The STACK header
     *
     * This is so that we can use a nice round power of 2 for the
     * default stack size (e.g. 1k), and if we're allocating lots of
     * threads back-to-back they'll fit nicely in a block.  It's a bit
     * of a benchmark hack, but it doesn't do any harm.
     */

Attached file ghc-stride-scheduling-sorted-list-2.10.patch (download).

Sorted list implementation draft 2

OK, I've fixed all of the performance problems, except for the TSO one.

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
    callback001          +0.0%     +0.0%     -0.6%     +0.0%     +0.0%
    callback002          +0.0%     +0.0%     +0.2%     +0.0%     +0.0%
           chan          +0.0%     +0.0%     +0.2%     +0.2%     +3.2%
          sieve          +0.0%     +0.0%     -0.4%     -0.1%     +0.0%
     threads001          +0.0%     +0.0%     +0.8%     +0.8%     +0.0%
     threads003          +0.0%     -0.0%     +0.1%     +0.3%     +0.0%
     threads006          +0.0%     +0.0%    +35.1%    +34.9%    +52.7%
     threads007          +0.0%     +0.0%     -0.1%     +0.0%     +0.0%

I don't think the TSO problem (in threads006) has to do with influencing stack size; when I run with GC stats, I don't see any thread overflows (the stack is only one word smaller), and the overall allocation remains the same (TSO ends up being 0x10 aligned, but I don't see why that would cause performance problems...)

big (slows down)
    size = 0x80
    sizeOf(StgTSO) = 0x80 (0x10 * 8)
    free = 0x7ffff6b0a800
    allocate(0x70) = 0x7ffff6b0a800 (stack)
    free = 0x7ffff6b0ab80 (+0x380 == 0x70 * 8)
    allocate(0x10) = 0x7ffff6b0ab80 (tso)
    free = 0x7ffff6b0ac00 (+0x80)
    # overall, free += 0x400

medium (is fine)
    size = 0x80
    sizeOf(StgTSO) = 0x78 (0xf * 8)
    free = 0x7ffff6b0a800
    allocate(0x71) = 0x7ffff6b0a800 (stack)
    free = 0x7ffff6b0ab88 (+0x388 == 0x71 * 8)
    allocate(0xf) = 0x7ffff6b0ab88 (tso)
    free = 0x7ffff6b0ac00 (+0x78)
    # overall, free += 0x400

Trac metadata

Trac field	Value
CC	jwlato@gmail.com, wren@community.haskell.org → hackage.haskell.org@liyang.hu, jwlato@gmail.com, wren@community.haskell.org

Great progress!

If we used actual time rather than scheduler quanta, wouldn't that fix the heap-check case too? Calling gettimeofday is very cheap these days, cheap enough that we can call it every time we emit an event in the event log, so I'm sure it won't be a problem to call it in the scheduler.

My guess is that with threads003 we're just pushing the size up enough that the GC has to touch another cache line per TSO. If that's the case, then I'm not too worried, since this is a microbenchmark and won't affect real world performance much. But we should verify that that's what's going on. It can't be that we're allocating another stack chunk, because that would show up in the allocations.

Check whether the amount of memory copied by the GC goes up in proportion to the amount you added to the TSO struct. If it is a lot more, then something else is going on. Also try some measurements with perf?

Once this is squashed (or accounted for), we can move on to checking that priorities work right, and looking at the primops and the user API.

Copied memory is affected disproportionately:

[ezyang@hs01 threads006]$ ./Main-big 200000 +RTS -s
     224,198,416 bytes allocated in the heap
     853,296,496 bytes copied during GC
     197,453,072 bytes maximum residency (10 sample(s))
      25,701,000 bytes maximum slop
             426 MB total memory in use (0 MB lost due to fragmentation)

[ezyang@hs01 threads006]$ ./Main-medium 200000 +RTS -s
     224,198,416 bytes allocated in the heap
     690,756,096 bytes copied during GC
     125,998,552 bytes maximum residency (9 sample(s))
      24,121,032 bytes maximum slop
             270 MB total memory in use (0 MB lost due to fragmentation)

That's an over 150MB jump.

I think I have an important clue, though; consider these three runs on the big TSO executable:

[ezyang@hs01 threads006]$ ./Main-big 182260 +RTS -s
     204,318,544 bytes allocated in the heap
     622,311,352 bytes copied during GC
     110,659,024 bytes maximum residency (9 sample(s))
      23,499,512 bytes maximum slop
             239 MB total memory in use (0 MB lost due to fragmentation)

[ezyang@hs01 threads006]$ ./Main-big 182261 +RTS -s
     204,319,648 bytes allocated in the heap
     622,313,432 bytes copied during GC
     110,659,024 bytes maximum residency (9 sample(s))
      23,453,024 bytes maximum slop
             239 MB total memory in use (0 MB lost due to fragmentation)

[ezyang@hs01 threads006]$ ./Main-big 182262 +RTS -s
     204,320,752 bytes allocated in the heap
     791,958,776 bytes copied during GC
     169,695,072 bytes maximum residency (10 sample(s))
      23,503,128 bytes maximum slop
             395 MB total memory in use (0 MB lost due to fragmentation)

Wow! In the case of the 182262th thread, it's literally the straw that broke the camel's back. I went ahead and logged GC events. The breakpoint occurs here:

----------------------------------------------------------
  Gen     Max  Mut-list  Blocks    Large     Live     Slop
       Blocks     Bytes          Objects                  
----------------------------------------------------------
    0   53762         0       2        0     1184     7008
    1   53762         3   53766       94 196721432 23504104
----------------------------------------------------------
                                         196722616 23511112
----------------------------------------------------------

Memory inventory:
  gen 0 blocks :     1 blocks (   0.0 MB)
  gen 1 blocks : 53766 blocks ( 210.0 MB)
  nursery      :   129 blocks (   0.5 MB)
  retainer     :     0 blocks (   0.0 MB)
  arena blocks :     0 blocks (   0.0 MB)
  exec         :     0 blocks (   0.0 MB)
  free         :  6332 blocks (  24.7 MB)
  total        : 60228 blocks ( 235.3 MB)
alloc new todo block 0x7f13c7aff000 for gen  0
alloc new todo block 0x7f13c7ca1000 for gen  1

[snip]

----------------------------------------------------------
  Gen     Max  Mut-list  Blocks    Large     Live     Slop
       Blocks     Bytes          Objects                  
----------------------------------------------------------
    0   82858         0       2        0      104     8088
    1   82858         1   45569        3 169695176 16955448
----------------------------------------------------------
                                         169695280 16963536
----------------------------------------------------------

Memory inventory:
  gen 0 blocks :     1 blocks (   0.0 MB)
  gen 1 blocks : 45569 blocks ( 178.0 MB)
  nursery      :   129 blocks (   0.5 MB)
  retainer     :     0 blocks (   0.0 MB)
  arena blocks :     0 blocks (   0.0 MB)
  exec         :     0 blocks (   0.0 MB)
  free         : 53841 blocks ( 210.3 MB)
  total        : 99540 blocks ( 388.8 MB)

Inside the snip, approximately a 100 hundred megablocks are allocated as TODO space for scavenge, ala

allocated 1 megablock(s) at 0x7f13b5a00000
alloc new todo block 0x7f13b5a04000 for gen  1
increasing limit for 0x7f13b5a04000 to 0x7f13b5a04800

It seems we run out of todo blocks while evacuating, and then essentially need to copy our entire heap. Ouch! Unfortunately, I don't know enough to say if this is a GC bug or not.

Correct me if I'm wrong, but in the "slow" case you're just doing one more major GC than in the fast case, right? The extra thread just pushed allocation over the threshold for the next GC. If that's the case, then this is not a problem at all.

Oh yes, you're right! (I was looking at the Gen 0 counts; but for Gen 1 it went from 9 to 10.)

OK, here's where we stand:

The sorted list implementation performs comparably to the original when no priority changes are involved. However, it does not apply sleeper fairness (give threads that were sleeping extra timeslices), and when sleeper fairness is applied, we suffer from performance degradation (since operations on the sorted list are no longer constant time). So if this implementation wants to go anywhere, it needs to be upgraded into something like a skip list, which can efficiently search and remove things inside the structure.
I've stamped out some of the performance bugs from the array implementation (mostly by adding an extra queue which we use to handle "must run now!" threads). However, we suffer from a huge bottleneck deleteMinPQueue(), which by the accounts of perf takes up 22% of the runtime. Apparently logarithmically many comparisons/memory writes cost a lot when a lot of threads are involved. I am going to look for something more high performance.

Some other interesting results about sleeper fairness: for most benchmarks, it doesn't have much effect, but threads003 displays some interesting behavior (this comparison is before and after sleeper fairness is applied):

        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
     threads003          -0.0%     +9.0%     -8.8%     -8.9%    +20.2%

So, changing how we schedule blocked threads (i.e. giving them a priority boost) results in a 10% speedup, but we end up using 20% more memory! A very interesting tradeoff, if you ask me...

Erm, sorry, that's not quite right. Here is a better set of numbers comparing without and with sleeper fairness.

    callback001          -0.0%     +0.0%     -8.8%     -8.6%     +0.0%
    callback002          -0.0%     +0.0%     -2.9%     -3.0%     +0.0%
           chan          -0.0%     +0.0%     -0.3%     -0.1%     -0.6%
          sieve          -0.0%     +0.0%     -0.9%     -0.7%     +0.0%
     threads001          -0.0%     +0.0%     -1.7%     -1.8%     +0.0%
     threads003          -0.0%     +0.0%     -0.2%     -0.2%     +3.4%
     threads006          -0.0%     +0.0%     +0.6%     +0.0%     +0.0%
     threads007          -0.0%     +7.1%     -3.1%     -2.8%    -47.0%

OK, here's a new progress report.

The red-black tree implementation is now behaving very consistently and with only 1-3% overhead. (Right now, it uses a malloc'd free list of red-black tree nodes; this ended up being quicker than using heap allocated stuff.) Unfortunately, sleeper fairness (with the objective of giving sleepy threads a bit of a boost) makes all of the benchmarks perform worse; in part due to the fact that sleepy threads increase the overhead of managing processes, since they tend to have different priorities. I don't at the moment have a good way of measuring latency change, see #7659 (closed) for a bug related to that. Additionally, the impact of scheduling is fairly hard on some test programs I’ve constructed unless you insert explicit yields into your programs. One more worse thing is that the the rb-trees are synthetically constructed to get a performance boost when all threads have the same priority; when threads have different priorities you will immediately see a slowdown because more rb-tree manipulations will become necessary.)

The array heap implementation has lower overhead (0-2%) but is far more inconsistent; some benchmarks do much worse. Right now it’s still pretty unclear where all of this overhead is coming from.

The end conclusion is as follows: if red-black trees end up being the dominant implementation, then there is still a chance that we can replace the current scheduler wholesale, since its runtime characteristics are very similar to the current scheduler. (Sleeper fairness will need to continue to be a configurable option, assuming we can prove that it improves latency.) However, if array heaps end up having much better properties on prioritized workloads (test-cases of which I have not constructed yet), I find it highly implausible that we’ll be able to finesse the implementation to work identically to the current scheduler, and we will need to support both. On the plus side, because I have made so many implementations I have a good sense refactor the API properly.

I’ll attach some patches, but the comments still need some cleaning up.

rbtrees:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
    callback001          +0.2%     +0.0%     +2.2%     +1.9%     -2.0%
    callback002          +0.2%     +0.0%     +1.7%     +1.6%     +0.0%
           chan          +0.2%     +0.0%     +2.1%     +2.1%     +0.6%
          sieve          +0.2%     +0.0%     +2.5%     +2.6%     +0.0%
     threads001          +0.2%     +0.0%     +1.7%     +1.5%     +0.0%
     threads003          +0.2%     +0.0%     -8.9%     -8.8%    +12.7%
     threads006          +0.2%     +0.0%     +1.6%     +0.8%     -3.2%
     threads007          +0.2%     -0.0%     +2.9%     +3.0%     -0.0%
--------------------------------------------------------------------------------
            Min          +0.2%     -0.0%     -8.9%     -8.8%     -3.2%
            Max          +0.2%     +0.0%     +2.9%     +3.0%    +12.7%
 Geometric Mean          +0.2%     +0.0%     +0.7%     +0.5%     +0.9%

heap arrays (using emulated ordering in order to recreate thread conditions)

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
    callback001          +0.1%     +0.0%     +2.1%     +2.0%     -2.0%
    callback002          +0.1%     +0.0%     +1.3%     +1.3%     +0.0%
           chan          +0.1%     +0.0%     +0.5%     +0.4%     +0.0%
          sieve          +0.1%     +0.0%     -0.7%     -0.7%     +0.0%
     threads001          +0.1%     +0.0%     -0.6%     -0.7%     +0.0%
     threads003          +0.1%     +0.0%    +62.0%    +62.4%     -3.2%
     threads006          +0.1%     +0.0%     +6.0%     +5.8%     -3.2%
     threads007          +0.1%     +0.0%     +6.7%     +6.8%     -8.3%
--------------------------------------------------------------------------------
            Min          +0.1%     +0.0%     -0.7%     -0.7%     -8.3%
            Max          +0.1%     +0.0%    +62.0%    +62.4%     +0.0%
 Geometric Mean          +0.1%     +0.0%     +8.2%     +8.2%     -2.1%

Attached file common.patch (download).

Common functionality used by both schedulers

Attached file 0001-Red-black-trees-27-no-sleeper-fairness-copy-of-22.patch (download).

Red-black tree implementation version 27

Attached file 0001-Heap-draft-3-rebased.patch (download).

array heap implementation draft 3 (second epoch)

Looking good, I'm still excited about this work. I really hope you'll put some of your benchmarks into nofib/smp - the current set may well be giving us a skewed view of what's important. I definitely agree that we want latency benchmarks too, but I haven't thought about how to construct them.

Have you thought about keeping the red-black tree nodes in the heap? That's what I would do. Heap allocation is much faster than malloc, and reclamation is free. You would get some advantages for free, such as when parts of the tree are in the old generation they won't get touched during a minor GC.

Yep, I think any new benchmarks will probably be pretty important, maybe even more so than the actually stride scheduling :)

Have you thought about keeping the red-black tree nodes in the heap? That's what I would do. Heap allocation is much faster than malloc, and reclamation is free. You would get some advantages for free, such as when parts of the tree are in the old generation they won't get touched during a minor GC.

I’ve tried it, jerry-rigging StgMutArrPtrs. I concluded we’ll need to make a custom closure (ugh!) or make some improvements to how GHC garbage collects mutable objects (actually, I was planning on filing a bug, here we go #7662). In the case of mutable arrays, there are some funny characteristics: the write barriers are relatively cheap but they all get scanned because mutable arrays are permanently on the mutable list, so we don’t end up getting any benefits. (There aren’t any other existing mutable objects which can be jerry-rigged here.) Adding things to the mutable list may end up being expensive in the long run.

The current malloc'd red-black implementation is pretty efficient; we maintain a free list and allocate them in blocks. I haven't implemented reclamation yet but even if we leak all of the free nodes it will only consume space up to the high watermark of number of simultaneous threads.

Do you have to use mutable objects? Why not do an Okasaki-style immutable red-black tree? In the heap, allocation is very cheap, mutation is not.

No we don’t, but all of the C implementations that I can easily crib off of are mutable :)

I'd like to add a word of caution to the inclusion of any form of priority/weighted scheduling into GHC at this time.

Scheduling (unless it discards work) does not, a priori, change the total amount of work that has to be done, just its order. Such scheduling strategies don't create computational power, they just distribute the "disappointment" of not being first...

This would imply that all of the timing changes reported in this thread are the result of interactions with other resources - different patterns of memory access, garbage collection etc.

The community already recognises these sort of performance couplings and there several mechanisms (at a higher level) that can be used to manage the inevitable performance trades.

I strongly believe that adding differential scheduling at this time will be, overall, counter productive to GHC's RTS development. It will create a new class of performance artefacts that will require resources to investigate. Time will be spent pushing trades around, chasing the proverbial bubble under the wall paper around and around.

Please don't get me wrong, I am highly appreciative of Edward's efforts. The more overhead we can reduce in the scheduling, the better.

If we want to start looking at more complex differential treatment of the allocation of supply (cpu cycles) to demand (task steps) then we need to consider the issues of how that trade interacts with the others. And how the overall trading space is going to manifest itself in terms of end-user goals. We also need to come up with a sensible approach of how to discard computational demand (load shedding) when the "system" becomes "excessively" infeasible to schedule to met those goals. Fortunately there are approaches to this (stuff I've been involved with in data networks for many years) - and Haskell's operational has all the right hooks in to do this. Interested PhD students, do get in touch....

OK, after a brief break, I've started working on latency benchmarks to help evaluate how much alternative scheduling algorithms help. Of course, the first order of business was to establish some baselines, so I ran some tests on GHC 7.4 first, as well as a bare-bones C version of the application. The tests are here https://github.com/ezyang/latency ; eventually I'll integrate them with nofib (although it's not obvious how to integrate them into the current nofib reporting framework). The basic method is to write a byte onto a Unix domain socket; I'm sure I've botched some instrumentation somewhere so some eyeballs here would be useful. I can also make pretty graphs if people are interested.

System info:

[ezyang@hs01 latency]$ uname -a
Linux hs01.scs.stanford.edu 3.5.3-1-ARCH #1 SMP PREEMPT Sun Aug 26 09:14:51 CEST 2012 x86_64 GNU/Linux

GHC 7.4.2 (as installed by Arch Linux)

      0%      50%      90%      99%    99.9%   99.99% 
0.000016 0.000021 0.000064 0.000096 0.000157 0.000175

GHC 7.4.2 threaded -N1

      0%      50%      90%      99%    99.9%   99.99% 
0.000048 0.000096 0.000191 0.000221 0.000260 0.000301

GHC 7.4.2 threaded -N2

      0%      50%      90%      99%    99.9%   99.99% 
0.000054 0.000208 0.000264 0.000307 0.000338 0.000370

GHC 7.7.20130122 (prior to the newest latency IO manager)

      0%      50%      90%      99%    99.9%   99.99% 
0.000015 0.000045 0.000093 0.000117 0.000153 0.000169

GHC 7.7.20130122 (prior to the newest latency IO manager)

      0%      50%      90%      99%    99.9%   99.99% 
0.000015 0.000045 0.000093 0.000117 0.000153 0.000169

GHC 7.7.20130122 -threaded

      0%      50%      90%      99%    99.9%   99.99% 
0.000044 0.000058 0.000177 0.000219 0.000252 0.000281

GHC 7.7.20130122 -threaded -N2

      0%      50%      90%      99%    99.9%   99.99% 
0.000053 0.000210 0.000258 0.000298 0.000335 0.000376

I'm building a version of GHC with the latest IO manager stuff and we'll see how that changes things...

Aaand the latest numbers! It looks like the new IO manager does a lot better!

GHC 7.7.20130221

      0%      50%      90%      99%    99.9%   99.99% 
0.000009 0.000019 0.000059 0.000103 0.000130 0.000926

GHC 7.7.20130221 -threaded

      0%      50%      90%      99%    99.9%   99.99% 
0.000026 0.000046 0.000090 0.000167 0.000220 0.001286

GHC 7.7.20130221 -threaded -N2

      0%      50%      90%      99%    99.9%   99.99% 
0.000026 0.000055 0.000101 0.000179 0.000253 0.000927

Something is a bit suspicious about those numbers. Without -threaded the IO manager is not used at all, yet your second set of numbers are quite different from the first even without -threaded. Furthermore, the second set of numbers seem to have a lot more outliers (the 99.99% slot is much higher).

(BTW I think this is great stuff nonetheless).

Hmm, it looks like the GHC 7.7.20130122 numbers are nonsense (I used what I thought was a perf build, but it looks like it actually was just a validate build, so libraries/rts may not have been as optimized as they could be. Hmm, I guess I should go re-run the nofib numbers too, since some of those might be suspicious.) The 7.4.2 numbers, being distro built, should still be valid.

Oh, and here are the C numbers, on the same machine. These should roughly correspond directly to the domain socket overhead.

      0%      50%      90%      99%    99.9%   99.99% 
0.000006 0.000014 0.000017 0.000028 0.000062 0.000259

When I attempt to replicate the conditions of #3838 (closed) (by removing the thread bump), I do not see any appreciable increase in runtime. This occurs even if I explicitly yield while inside the atomicModifyIORef (using unsafePerformIO), or I increase the number of threads being run. My conclusion is that a small pileup does occur, but since there's no longer a global list we keep traversing it's not expensive. But I can't rule out (1) the test-case having enough incidental other changes that have made it stop tickling the bug, or (2) some other RTS change masking over this particular problem. A conundrum!

Edward: if I recall correctly, the thread bump was only a minor optimisation, the real fix for #3838 (closed) was the redesign of black hole blocking so that we didn't have a global list. So that would agree with your findings.

I don't have a lot of evidence that bumping the blocker is a good idea. I was hoping you might find some :-)

Well, bumping the blocker does seem to help a little bit. Here's what happens when I remove the bump:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
    callback001          -0.0%     +0.0%     +5.6%     +5.0%     +0.0%
    callback002          -0.0%     +0.0%     +0.3%     +0.9%     +0.0%
           chan          -0.0%     +0.0%     +1.4%     +1.3%     +0.0%
          sieve          -0.0%     +0.0%     +4.1%     +5.1%     +4.2%
     threads001          -0.0%     +0.0%     -1.8%     -1.3%     +0.0%
     threads003          -0.0%     -0.0%     +0.4%     +0.7%     +0.0%
     threads006          -0.0%     +0.0%     +0.0%     +0.6%     +0.0%
     threads007          -0.0%     +0.0%     -1.3%     -0.7%     +0.0%
--------------------------------------------------------------------------------
            Min          -0.0%     -0.0%     -1.8%     -1.3%     +0.0%
            Max          -0.0%     +0.0%     +5.6%     +5.1%     +4.2%
 Geometric Mean          -0.0%     -0.0%     +1.0%     +1.4%     +0.5%

Stride scheduling for Haskell threads with priorities

Child items 0

Activity