Opened 15 months ago

Last modified 12 months ago

#7606 new feature request

Stride scheduling for Haskell threads with priorities

Reported by: ezyang Owned by: ezyang
Priority: normal Milestone: 7.8.3
Component: Runtime System Version: 7.7
Keywords: Cc: jwlato@…, wren@…, hackage.haskell.org@…, ikke+ghc@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description

Currently, GHC uses a round-robin scheduler for Haskell threads, with some heuristics for when threads should be bumped to the front of the queue. This patch set replaces this scheduler with 'stride scheduling', as described by Waldspurger and Weihl '95, which is an efficient, deterministic method for scheduling processes with differing priorities. Priorities are assigned by giving 'tickets' to threads; a thread with twice as many tickets as another will run twice as often. I’d like to replace the round-robin scheduler completely with this scheduler.

Here are nofib benchmarks comparing the old scheduler to the new:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
            Min          -0.0%    -52.2%    -18.8%    -18.6%    -83.1%
            Max          +1.0%     +4.2%     +4.9%     +5.1%     +7.7%
 Geometric Mean          +0.1%     -2.8%     -0.9%     -0.9%     -2.9%

Here are some technical details:

  • Without any changes to ticket values, scheduling behavior should be functionally identical to round-robin. (By default, new threads, including the IO thread, get allocated the max nubmer of tickets possible.) This is not quite identical, since our heap does not have FIFO property (see below.)
  • The current patch-set uses a very simple (e.g. undergrad level) resizable-array backed heap to implement the priority queue; we can play some tricks to reduce the memory footprint of the priority queue (e.g. using the container_of macro to eliminate the extra keys store); and a more fancy data structure would make it easier for us to surgically remove entries or reweight them, but I wanted to keep overhead low. If anyone has a pet implementation of priority queues in C feel free to swap it in. Right now, this only affects the uses of promoteInRunQueue() in Messages.c; I still need to check if #3838 has regressed.
  • We get three new primops: setTickets#, getTickets# and modifyTickets#. We don't support creating threads with specific numbers of tickets (mostly because that would have added an annoyingly large set of extra primops); instead, you're expected to spawn a thread which gets max-ticket allocation, and then weight it down.
  • _link is no longer used for linking TSOs in the run queue. I have tried my best to stamp out any code which operated on this assumption, but I may have missed some.
  • Modifying a TSO's tickets takes out the scheduler lock; the hope is that this operation is quick and rare enough that a global lock here is not too bad.
  • We are unfortunately stuck with some magic constants w.r.t. ticket values: 1 << 20 is the maximum number of tickets our implementation is hard-coded to support.
  • Sleeping and blocked tasks do not get any 'bonus' for being blocked.
  • In an ideal world, when a thread hits a black hole, it should temporarily give its tickets to the thread evaluating the black hole, so it will unblock more quickly. Unfortunately, implementing this is pretty complicated (the blackhole evaluating thread could die; or it could get stuck on a blackhole itself and need to gift its tickets; it shouldn't be able to give away the tickets it was gifted.) So this implementation leaves that out. Similar semantics for MVars are probably possible, but will require userland assistance too.

I haven't prepared a patch to base yet with a user-level API, but here is a proposed draft:

type Tickets = Int

-- | Maximum number of tickets we support a thread having. (Currently 2 >> 20)
-- Note that this doesn't bound the *global* maximum tickets.
maxTickets :: Tickets

-- | Changes the number of tickets allocated to a thread.  The ticket count must
-- not be less than or equal to zero, or greater than maxTickets. (Corresponds
-- to setTickets# primop)
setTickets :: ThreadId -> Tickets -> IO ()

-- | Returns the number of tickets currently allocated to a thread.  (Corresponds to
-- getTickets# primop)
getTickets :: ThreadId -> IO Tickets

-- | Atomically performs a linear transformation on the number of tickets a thread;
-- e.g. scaling the number of tickets by a rational number, adding another fixed
-- set of tickets, and then returning the number of 'leftover' tickets from the operation; e.g.
-- if the net amount of tickets is reduced, then the returned result is positive;
-- if the net amount of tickets is increased, the returned result is negative.
-- In the absence of concurrent threads, the following property holds forall
-- t, m and x:
--
--      do r <- getTickets t
--         d <- scaleTickets t m x
--         r' <- getTickets t
--         return (r == r' + d)
--
-- If the scaling would reduce the number of tickets below zero or increase the
-- number of tickets beyond maxTickets, this function will instead fail with
-- an exception.  This function will be subject to some rounding error; powers of two
-- are, however, likely to be exact.  (Corresponds to modifyTickets# primop; note
-- that the sentinel value for failure is maxTickets + 1, since it is impossible for
-- a thread's ticket value to change by that much.)
modifyTickets :: ThreadId -> Ratio Int -> Tickets -> IO Tickets

-- | Forks a new thread, transferring some percentage of tickets from the current
-- thread to it (so the net number of tickets stays constant.)  Fails if the rational
-- is greater than 1 or less than or equal to zero, or if there are not enough tickets
-- in the current thread.
forkIOWith :: IO a -> Ratio Int -> IO ThreadId

Attachments (7)

0001-Simple-heap-based-priority-queue-implementation.patch (6.2 KB) - added by ezyang 15 months ago.
Simple priority queue implementation
0002-Stride-scheduling-draft-11-thread-migrating-implemen.patch (29.2 KB) - added by ezyang 15 months ago.
Stride scheduling patch
ghc-stride-scheduling-sorted-list.patch (18.8 KB) - added by ezyang 15 months ago.
Stride scheduling using sorted linked lists
ghc-stride-scheduling-sorted-list-2.10.patch (27.3 KB) - added by ezyang 15 months ago.
Sorted list implementation draft 2
common.patch (14.4 KB) - added by ezyang 15 months ago.
Common functionality used by both schedulers
0001-Red-black-trees-27-no-sleeper-fairness-copy-of-22.patch (41.5 KB) - added by ezyang 15 months ago.
Red-black tree implementation version 27
0001-Heap-draft-3-rebased.patch (21.4 KB) - added by ezyang 15 months ago.
array heap implementation draft 3 (second epoch)

Download all attachments as: .zip

Change History (47)

Changed 15 months ago by ezyang

Simple priority queue implementation

Changed 15 months ago by ezyang

Stride scheduling patch

comment:1 follow-up: Changed 15 months ago by simonmar

  • Difficulty set to Unknown
  • Milestone set to 7.8.1

Awesome!

Ok, to get started on the reviewing process, I'll jot down some concerns.

  • First, nofib has zero concurrency benchmarks in it, so the results aren't very meaningful. Still, you seem to have some outliers in there - any idea why? nofib/smp has some concurrency benchmarks that I normally use when making changes to the scheduler.
  • Run all the tests in testsuite/concurrent. They aren't run by default in a validate.
  • The patch adds no less than 4 fields to the TSO struct, which is quite a lot. We should benchmark carefully to make sure this isn't a problem.
  • Why stride scheduling rather than something else, e.g. Linux's CFS? Wouldn't CFS address the issue of giving more time to blocked threads? What about inaccuracies due to the time it takes for a thread to respond to the timer signal?
  • Currently, if there's a very long run queue, a minor GC won't have to traverse a lot of it because it will be in the old generation, but here the GC has to look at every element of the priority queue. We should make a benchmark with a long run queue and see if that's an issue.
  • promoteInRunQueue: it would be a shame to lose this, but as long as we're sure that #3838 hasn't regressed it should be ok.
  • could we have some more docs on the scheduler functions, capPassUpdate etc. please?
  • Looks like we should do annulTSO in a better way. These fields should be empty normally, set when a thread is blocked, and unset again when the thread is unblocked.
  • I don't see why sched_mutex is needed in setTickets and co. Could you explain?

comment:2 follow-up: Changed 15 months ago by simonpj

Can you also say a bit more about

  • What the goal of this change is.
  • Roughly how the new scheduling algorithm achieves the goal.
  • There seems to be something about priorities involved. Who sets those priorities? The programmer? The runtime system?
  • What about priority inversion?

Thanks

Simon

comment:3 in reply to: ↑ 1 Changed 15 months ago by ezyang

Summary: SMP benchmarks show massive degradation of performance, so probably one of the performance bugs Simon Marlow pointed out is really killing us and needs to be fixed.

Replying to simonmar:

  • First, nofib has zero concurrency benchmarks in it, so the results aren't very meaningful. Still, you seem to have some outliers in there - any idea why? nofib/smp has some concurrency benchmarks that I normally use when making changes to the scheduler.

Oops, I forgot that normal nofib doesn't do smp. SMP doesn't look so good, and it seems likely that some of the performance problems you've noted below might be contributing. I'll investigate more thoroughly...

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
    callback001          +0.1%     +0.0%     +7.7%     +7.6%     +0.0%
    callback002          +0.1%     +0.0%    +10.7%    +10.7%     +0.0%
           chan          +0.1%     +0.0%    -29.0%    -29.1%    -71.0%
          sieve          +0.1%     +0.0%   +141.6%   +142.0%    +89.3%
     threads001          +0.1%     +0.0%    +12.7%    +12.6%     +0.0%
     threads003          +0.1%     +9.1%    +44.7%    +44.6%    +25.8%
     threads006          +0.1%     +0.0%    +30.3%    +30.7%    +46.6%
     threads007          +0.1%     -0.2%    +17.0%    +17.0%     -6.8%
--------------------------------------------------------------------------------
            Min          +0.1%     -0.2%    -29.0%    -29.1%    -71.0%
            Max          +0.1%     +9.1%   +141.6%   +142.0%    +89.3%
 Geometric Mean          +0.1%     +1.1%    +22.5%    +22.5%     -0.7%
  • Run all the tests in testsuite/concurrent. They aren't run by default in a validate.

Yup, I've been running a full testsuite run and will be checking those results today.

  • The patch adds no less than 4 fields to the TSO struct, which is quite a lot. We should benchmark carefully to make sure this isn't a problem.

I can reduce this by one by paying the cost of a division every time we're scheduled (since stride = STRIDE1/tickets, always); we can reduce it by one more by observing that pass and remain are never both set at the same time and turning it into a union. I'm not sure what benchmarks in particular would start failing due to the increase in size, other than just general badness.

  • Why stride scheduling rather than something else, e.g. Linux's CFS? Wouldn't CFS address the issue of giving more time to blocked threads? What about inaccuracies due to the time it takes for a thread to respond to the timer signal?

Here's the secret: Linux's CFS is stride scheduling! The primary differences here are the backing data structure (they use a red-black tree while we're using a simple heap) and how timeslices are measured (Linux counts nanoseconds; we stupidly just consider a jump to the scheduler one quantum.)

It is possible to implement CFS-style sleeper fairness, by modifying what happens to a processes' pass when it sleeps. I'll post a version with those changes.

  • Currently, if there's a very long run queue, a minor GC won't have to traverse a lot of it because it will be in the old generation, but here the GC has to look at every element of the priority queue. We should make a benchmark with a long run queue and see if that's an issue.

Could be a problem. If we switched to a tree data structure, we could probably recover this behavior (the GC will only have to traverse log n).

  • promoteInRunQueue: it would be a shame to lose this, but as long as we're sure that #3838 hasn't regressed it should be ok.

There is no particular reason why we can't implement this, and if we switch to a red-black tree it gets easier.

  • could we have some more docs on the scheduler functions, capPassUpdate etc. please?

OK.

  • Looks like we should do annulTSO in a better way. These fields should be empty normally, set when a thread is blocked, and unset again when the thread is unblocked.

OK.

  • I don't see why sched_mutex is needed in setTickets and co. Could you explain?

setTickets access can be concurrent, so since a read and a write occur, it needs to be protected by a lock.

comment:4 in reply to: ↑ 2 Changed 15 months ago by ezyang

Replying to simonpj:

  • What the goal of this change is.

The goal of the change is to allow Haskell programmers to change how frequently threads they create are being run; in case there are some threads that are important and should be run as frequently as possible, or threads are being created by an untrusted user who should not be allowed to thread-bomb the system.

  • Roughly how the new scheduling algorithm achieves the goal.

Stride scheduling operates by replacing our old doubly linked list run queue with a priority queue. The priorities are 'passes' which increment by their stride (computed based on their priority) every time they run; thus, a task with a low stride will run multiple times before their pass catches up.

  • There seems to be something about priorities involved. Who sets those priorities? The programmer? The runtime system?

The priorities are set by the programmer using setTickets and related functions.

  • What about priority inversion?

Without any change to priorities, scheduling behavior is intended to be identical to what it was previously. Priority inversion can occur when a thread blocks on another; this patch doesn't address that yet (though I've mentioned about how to do it for BLACKHOLES).

comment:5 Changed 15 months ago by ezyang

Currently, if there's a very long run queue, a minor GC won't have to traverse a lot of it because it will be in the old generation, but here the GC has to look at every element of the priority queue. We should make a benchmark with a long run queue and see if that's an issue.

Based on some investigation, this appears to be probably one of the primary causes of slowdown. I checked to see if this was an issue by checking how often markCapability calls occurred versus actual evacuations performed on TSO objects. Although the number of markCapability calls is the same in both cases, TSO objects get evacuated a lot more frequently in the new world.

One of the things that I'm not really clear about is how to tell if I can skip evacuating an object. This GC part of the code is pretty opaque to me, and while I feel like there should be something like a log(n) solution I'm not seeing an implementation path.

comment:6 Changed 15 months ago by simonmar

I would expect to see lots more calls to evacuate on TSOs, but in most cases the TSO should be in the old generation, so evacuate will just return (in the case of a minor GC, that is). Are you counting actual copies, or just calls to evacuate?

It would probably be a good idea to add some diagnostics to track the size of the run queue too. I'm not completely convinced that this is your problem yet.

comment:7 Changed 15 months ago by ezyang

You’re right, it’s not (or, at least, the situation is more complicated.) With the following gratuitous change:

diff --git a/rts/Capability.c b/rts/Capability.c
index 811df58..b4a1ec0 100644
--- a/rts/Capability.c
+++ b/rts/Capability.c
@@ -1007,6 +1007,10 @@ markCapability (evac_fn evac, void *user, Capability *cap,
     // or fewer Capabilities as GC threads, but just in case there
     // are more, we mark every Capability whose number is the GC
     // thread's index plus a multiple of the number of GC threads.
+    StgTSO *tso;
+    for (tso = cap->run_queue_hd; tso != END_TSO_QUEUE; tso = tso->_link) {
+        evac(user, (StgClosure **)(void *)&tso);
+    }
     evac(user, (StgClosure **)(void *)&cap->run_queue_hd);
     evac(user, (StgClosure **)(void *)&cap->run_queue_tl);
 #if defined(THREADED_RTS)

We only see a very minor bump in runtime:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
          sieve          +0.0%     +0.0%     +0.7%     +0.7%     +0.0%

I've been focusing on sieve, since it had the most slowdown. What I do notice when I add some diagnostics is that the run queue starts with a 100 threads in it, and that the program runs a lot more slowly when the run queue is large when it is small. When I count actual copies, TSOs are copied a lot more frequently in the new code. If you have some ideas on what to look at, that would be great!

I think what I am going to try to do is implement another version which uses sorted doubly linked lists instead, and see if the same problems crop up. (I also realized that we're leaking memory for the run queues, since I never free them. Oops!)

comment:8 Changed 15 months ago by ezyang

Measuring the effect of adding five 64-bit words to the TSO (and no other changes)

        Program           Size    Allocs   Runtime   Elapsed  TotalMem
          sieve          +0.0%     +0.0%     +0.7%     +0.8%     +0.0%

E.g. it hurts, but is not the root cause of the major problems.

comment:9 Changed 15 months ago by ezyang

Maybe a good idea is to replace the malloc'd array with a MUT_ARR_PTR; then we only evacuate one thing (and all of the card marking happens automatically). Unfortunately, there don't seem to be any amenities for manipulating these from RTS-land.

comment:10 Changed 15 months ago by jwlato

  • Cc jwlato@… added

comment:11 Changed 15 months ago by WrenThornton

  • Cc wren@… added

Changed 15 months ago by ezyang

Stride scheduling using sorted linked lists

comment:12 Changed 15 months ago by ezyang

Here is an alternate patch, which uses sorted linked lists. It is mostly interesting because it also exhibits the same slowdown as the original patch, even though it should hypothetically act the same way as the original scheduler (except for some constant overhead.)

comment:13 Changed 15 months ago by simonmar

Perhaps you should check that the scheduling behaviour is the same. e.g. in sieve, if we manage to run a thread to completion in one time slice, then it won't need to be copied by the GC, but if it gets preempted then it will get copied.

A glance at the ThreadScope? profile, or even just the eventlog, might be illuminating. Also check the +RTS -s stats, a good sanity check is whether the amount of allocation is the same (if not, something strange is going on). And the GC stats will give you more clues.

comment:14 Changed 15 months ago by simonmar

BTW, I like the idea of using a MUT_ARR, but the card marking is probably too coarse to help much (128 entries per card).

comment:15 Changed 15 months ago by ezyang

OK, I have a much better sense for where the performance problems are coming from.


Adding extra words to the TSO has no impact on most smp benchmarks, but really kills threads006 (up to 100% slowdown); this is not too surprising since this benchmark involves creating 200,000 threads. A representative stat without a small TSO is

224198416 bytes, 425 GCs (416 + 9), 0/0 avg/max bytes residency (0 samples), 699092656 bytes GC work, 279M in use, 0.00 INIT (0.00 elapsed), 0.02 MUT (0.02 elapsed), 0.33 GC (0.33 elapsed), 0.12 GC(0) (0.12 elapsed), 0.21 GC(1) (0.21 elapsed)

A representative stat with a big TSO is

224198416 bytes, 425 GCs (415 + 10), 0/0 avg/max bytes residency (0 samples), 853296496 bytes GC work, 426M in use, 0.00 INIT (0.00 elapsed), 0.00 MUT (0.00 elapsed), 0.47 GC (0.47 elapsed), 0.12 GC(0) (0.12 elapsed), 0.34 GC(1) (0.35 elapsed)

What I don't understand is why the memory in use blows up by 150MB; by my count, the extra words should only be adding something like 10MB of overhead; my best guess is that I am actually nudging TSO size over some invisible threshold (maybe the big object threshold).


From there, the next two bugs have to do with subtle scheduler changes. The first bug I have a clean fix for (it was a plain old bug); 'sieve' slows down a lot if threads which HeapOverflow? don't get put back in front of the run queue. Fixing that dramatically improves runtime for all of the benchmarks except 'threads003': we can make that performance problem go away if we force threads to get appended to the end of the run queue (of course, stride scheduling won't work in that case!) I'm still investigating these.

comment:16 Changed 15 months ago by ezyang

Well, the magic threshold (on x86_64) is apparently one StgWord64 and one StgWord32 (which I believe ); add one more StgWord32 and the memory usage blows up. (That is, increasing the struct's size to 16 64-bit words. 15 and 14 work fine.) I’m a bit surprised we were so close to the limit; I guess I’ll have to figure out how to do this with only two variables, or see if I can condense other parts of the TSO struct. (Actually, we probably have even less to work with since Windows adds another StgWord32 and PROFILING adds another pointer field... so we literally have zero extra fields! Ouch!)

comment:17 Changed 15 months ago by ezyang

I wonder if we're hitting this:

    /* The size argument we are given includes all the per-thread
     * overheads:
     *
     *    - The TSO structure
     *    - The STACK header
     *
     * This is so that we can use a nice round power of 2 for the
     * default stack size (e.g. 1k), and if we're allocating lots of
     * threads back-to-back they'll fit nicely in a block.  It's a bit
     * of a benchmark hack, but it doesn't do any harm.
     */

Changed 15 months ago by ezyang

Sorted list implementation draft 2

comment:18 Changed 15 months ago by ezyang

OK, I've fixed all of the performance problems, except for the TSO one.

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
    callback001          +0.0%     +0.0%     -0.6%     +0.0%     +0.0%
    callback002          +0.0%     +0.0%     +0.2%     +0.0%     +0.0%
           chan          +0.0%     +0.0%     +0.2%     +0.2%     +3.2%
          sieve          +0.0%     +0.0%     -0.4%     -0.1%     +0.0%
     threads001          +0.0%     +0.0%     +0.8%     +0.8%     +0.0%
     threads003          +0.0%     -0.0%     +0.1%     +0.3%     +0.0%
     threads006          +0.0%     +0.0%    +35.1%    +34.9%    +52.7%
     threads007          +0.0%     +0.0%     -0.1%     +0.0%     +0.0%

I don't think the TSO problem (in threads006) has to do with influencing stack size; when I run with GC stats, I don't see any thread overflows (the stack is only one word smaller), and the overall allocation remains the same (TSO ends up being 0x10 aligned, but I don't see why that would cause performance problems...)

big (slows down)
    size = 0x80
    sizeOf(StgTSO) = 0x80 (0x10 * 8)
    free = 0x7ffff6b0a800
    allocate(0x70) = 0x7ffff6b0a800 (stack)
    free = 0x7ffff6b0ab80 (+0x380 == 0x70 * 8)
    allocate(0x10) = 0x7ffff6b0ab80 (tso)
    free = 0x7ffff6b0ac00 (+0x80)
    # overall, free += 0x400

medium (is fine)
    size = 0x80
    sizeOf(StgTSO) = 0x78 (0xf * 8)
    free = 0x7ffff6b0a800
    allocate(0x71) = 0x7ffff6b0a800 (stack)
    free = 0x7ffff6b0ab88 (+0x388 == 0x71 * 8)
    allocate(0xf) = 0x7ffff6b0ab88 (tso)
    free = 0x7ffff6b0ac00 (+0x78)
    # overall, free += 0x400

comment:19 Changed 15 months ago by liyang

  • Cc hackage.haskell.org@… added

comment:20 Changed 15 months ago by simonmar

Great progress!

If we used actual time rather than scheduler quanta, wouldn't that fix the heap-check case too? Calling gettimeofday is very cheap these days, cheap enough that we can call it every time we emit an event in the event log, so I'm sure it won't be a problem to call it in the scheduler.

My guess is that with threads003 we're just pushing the size up enough that the GC has to touch another cache line per TSO. If that's the case, then I'm not too worried, since this is a microbenchmark and won't affect real world performance much. But we should verify that that's what's going on. It can't be that we're allocating another stack chunk, because that would show up in the allocations.

Check whether the amount of memory copied by the GC goes up in proportion to the amount you added to the TSO struct. If it is a lot more, then something else is going on. Also try some measurements with perf?

Once this is squashed (or accounted for), we can move on to checking that priorities work right, and looking at the primops and the user API.

comment:21 Changed 15 months ago by ezyang

Copied memory is affected disproportionately:

[ezyang@hs01 threads006]$ ./Main-big 200000 +RTS -s
     224,198,416 bytes allocated in the heap
     853,296,496 bytes copied during GC
     197,453,072 bytes maximum residency (10 sample(s))
      25,701,000 bytes maximum slop
             426 MB total memory in use (0 MB lost due to fragmentation)

[ezyang@hs01 threads006]$ ./Main-medium 200000 +RTS -s
     224,198,416 bytes allocated in the heap
     690,756,096 bytes copied during GC
     125,998,552 bytes maximum residency (9 sample(s))
      24,121,032 bytes maximum slop
             270 MB total memory in use (0 MB lost due to fragmentation)

That's an over 150MB jump.

I think I have an important clue, though; consider these three runs on the big TSO executable:

[ezyang@hs01 threads006]$ ./Main-big 182260 +RTS -s
     204,318,544 bytes allocated in the heap
     622,311,352 bytes copied during GC
     110,659,024 bytes maximum residency (9 sample(s))
      23,499,512 bytes maximum slop
             239 MB total memory in use (0 MB lost due to fragmentation)

[ezyang@hs01 threads006]$ ./Main-big 182261 +RTS -s
     204,319,648 bytes allocated in the heap
     622,313,432 bytes copied during GC
     110,659,024 bytes maximum residency (9 sample(s))
      23,453,024 bytes maximum slop
             239 MB total memory in use (0 MB lost due to fragmentation)

[ezyang@hs01 threads006]$ ./Main-big 182262 +RTS -s
     204,320,752 bytes allocated in the heap
     791,958,776 bytes copied during GC
     169,695,072 bytes maximum residency (10 sample(s))
      23,503,128 bytes maximum slop
             395 MB total memory in use (0 MB lost due to fragmentation)

Wow! In the case of the 182262th thread, it's literally the straw that broke the camel's back. I went ahead and logged GC events. The breakpoint occurs here:

----------------------------------------------------------
  Gen     Max  Mut-list  Blocks    Large     Live     Slop
       Blocks     Bytes          Objects                  
----------------------------------------------------------
    0   53762         0       2        0     1184     7008
    1   53762         3   53766       94 196721432 23504104
----------------------------------------------------------
                                         196722616 23511112
----------------------------------------------------------

Memory inventory:
  gen 0 blocks :     1 blocks (   0.0 MB)
  gen 1 blocks : 53766 blocks ( 210.0 MB)
  nursery      :   129 blocks (   0.5 MB)
  retainer     :     0 blocks (   0.0 MB)
  arena blocks :     0 blocks (   0.0 MB)
  exec         :     0 blocks (   0.0 MB)
  free         :  6332 blocks (  24.7 MB)
  total        : 60228 blocks ( 235.3 MB)
alloc new todo block 0x7f13c7aff000 for gen  0
alloc new todo block 0x7f13c7ca1000 for gen  1

[snip]

----------------------------------------------------------
  Gen     Max  Mut-list  Blocks    Large     Live     Slop
       Blocks     Bytes          Objects                  
----------------------------------------------------------
    0   82858         0       2        0      104     8088
    1   82858         1   45569        3 169695176 16955448
----------------------------------------------------------
                                         169695280 16963536
----------------------------------------------------------

Memory inventory:
  gen 0 blocks :     1 blocks (   0.0 MB)
  gen 1 blocks : 45569 blocks ( 178.0 MB)
  nursery      :   129 blocks (   0.5 MB)
  retainer     :     0 blocks (   0.0 MB)
  arena blocks :     0 blocks (   0.0 MB)
  exec         :     0 blocks (   0.0 MB)
  free         : 53841 blocks ( 210.3 MB)
  total        : 99540 blocks ( 388.8 MB)


Inside the snip, approximately a 100 hundred megablocks are allocated as TODO space for scavenge, ala

allocated 1 megablock(s) at 0x7f13b5a00000
alloc new todo block 0x7f13b5a04000 for gen  1
increasing limit for 0x7f13b5a04000 to 0x7f13b5a04800

It seems we run out of todo blocks while evacuating, and then essentially need to copy our entire heap. Ouch! Unfortunately, I don't know enough to say if this is a GC bug or not.

comment:22 Changed 15 months ago by simonmar

Correct me if I'm wrong, but in the "slow" case you're just doing one more major GC than in the fast case, right? The extra thread just pushed allocation over the threshold for the next GC. If that's the case, then this is not a problem at all.

comment:23 Changed 15 months ago by ezyang

Oh yes, you're right! (I was looking at the Gen 0 counts; but for Gen 1 it went from 9 to 10.)

comment:24 Changed 15 months ago by ezyang

OK, here's where we stand:

  • The sorted list implementation performs comparably to the original when no priority changes are involved. However, it does not apply sleeper fairness (give threads that were sleeping extra timeslices), and when sleeper fairness is applied, we suffer from performance degradation (since operations on the sorted list are no longer constant time). So if this implementation wants to go anywhere, it needs to be upgraded into something like a skip list, which can efficiently search and remove things inside the structure.
  • I've stamped out some of the performance bugs from the array implementation (mostly by adding an extra queue which we use to handle "must run now!" threads). However, we suffer from a huge bottleneck deleteMinPQueue(), which by the accounts of perf takes up 22% of the runtime. Apparently logarithmically many comparisons/memory writes cost a lot when a lot of threads are involved. I am going to look for something more high performance.

comment:25 Changed 15 months ago by ezyang

Some other interesting results about sleeper fairness: for most benchmarks, it doesn't have much effect, but threads003 displays some interesting behavior (this comparison is before and after sleeper fairness is applied):

        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
     threads003          -0.0%     +9.0%     -8.8%     -8.9%    +20.2%

So, changing how we schedule blocked threads (i.e. giving them a priority boost) results in a 10% speedup, but we end up using 20% more memory! A very interesting tradeoff, if you ask me...

comment:26 Changed 15 months ago by ezyang

Erm, sorry, that's not quite right. Here is a better set of numbers comparing without and with sleeper fairness.

    callback001          -0.0%     +0.0%     -8.8%     -8.6%     +0.0%
    callback002          -0.0%     +0.0%     -2.9%     -3.0%     +0.0%
           chan          -0.0%     +0.0%     -0.3%     -0.1%     -0.6%
          sieve          -0.0%     +0.0%     -0.9%     -0.7%     +0.0%
     threads001          -0.0%     +0.0%     -1.7%     -1.8%     +0.0%
     threads003          -0.0%     +0.0%     -0.2%     -0.2%     +3.4%
     threads006          -0.0%     +0.0%     +0.6%     +0.0%     +0.0%
     threads007          -0.0%     +7.1%     -3.1%     -2.8%    -47.0%

comment:27 Changed 15 months ago by ezyang

OK, here's a new progress report.

The red-black tree implementation is now behaving very consistently and with only 1-3% overhead. (Right now, it uses a malloc'd free list of red-black tree nodes; this ended up being quicker than using heap allocated stuff.) Unfortunately, sleeper fairness (with the objective of giving sleepy threads a bit of a boost) makes all of the benchmarks perform worse; in part due to the fact that sleepy threads increase the overhead of managing processes, since they tend to have different priorities. I don't at the moment have a good way of measuring latency change, see #7659 for a bug related to that. Additionally, the impact of scheduling is fairly hard on some test programs I’ve constructed unless you insert explicit yields into your programs. One more worse thing is that the the rb-trees are synthetically constructed to get a performance boost when all threads have the same priority; when threads have different priorities you will immediately see a slowdown because more rb-tree manipulations will become necessary.)

The array heap implementation has lower overhead (0-2%) but is far more inconsistent; some benchmarks do much worse. Right now it’s still pretty unclear where all of this overhead is coming from.

The end conclusion is as follows: if red-black trees end up being the dominant implementation, then there is still a chance that we can replace the current scheduler wholesale, since its runtime characteristics are very similar to the current scheduler. (Sleeper fairness will need to continue to be a configurable option, assuming we can prove that it improves latency.) However, if array heaps end up having much better properties on prioritized workloads (test-cases of which I have not constructed yet), I find it highly implausible that we’ll be able to finesse the implementation to work identically to the current scheduler, and we will need to support both. On the plus side, because I have made so many implementations I have a good sense refactor the API properly.

I’ll attach some patches, but the comments still need some cleaning up.

rbtrees:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
    callback001          +0.2%     +0.0%     +2.2%     +1.9%     -2.0%
    callback002          +0.2%     +0.0%     +1.7%     +1.6%     +0.0%
           chan          +0.2%     +0.0%     +2.1%     +2.1%     +0.6%
          sieve          +0.2%     +0.0%     +2.5%     +2.6%     +0.0%
     threads001          +0.2%     +0.0%     +1.7%     +1.5%     +0.0%
     threads003          +0.2%     +0.0%     -8.9%     -8.8%    +12.7%
     threads006          +0.2%     +0.0%     +1.6%     +0.8%     -3.2%
     threads007          +0.2%     -0.0%     +2.9%     +3.0%     -0.0%
--------------------------------------------------------------------------------
            Min          +0.2%     -0.0%     -8.9%     -8.8%     -3.2%
            Max          +0.2%     +0.0%     +2.9%     +3.0%    +12.7%
 Geometric Mean          +0.2%     +0.0%     +0.7%     +0.5%     +0.9%

heap arrays (using emulated ordering in order to recreate thread conditions)

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
    callback001          +0.1%     +0.0%     +2.1%     +2.0%     -2.0%
    callback002          +0.1%     +0.0%     +1.3%     +1.3%     +0.0%
           chan          +0.1%     +0.0%     +0.5%     +0.4%     +0.0%
          sieve          +0.1%     +0.0%     -0.7%     -0.7%     +0.0%
     threads001          +0.1%     +0.0%     -0.6%     -0.7%     +0.0%
     threads003          +0.1%     +0.0%    +62.0%    +62.4%     -3.2%
     threads006          +0.1%     +0.0%     +6.0%     +5.8%     -3.2%
     threads007          +0.1%     +0.0%     +6.7%     +6.8%     -8.3%
--------------------------------------------------------------------------------
            Min          +0.1%     +0.0%     -0.7%     -0.7%     -8.3%
            Max          +0.1%     +0.0%    +62.0%    +62.4%     +0.0%
 Geometric Mean          +0.1%     +0.0%     +8.2%     +8.2%     -2.1%

Changed 15 months ago by ezyang

Common functionality used by both schedulers

Changed 15 months ago by ezyang

Red-black tree implementation version 27

Changed 15 months ago by ezyang

array heap implementation draft 3 (second epoch)

comment:28 Changed 15 months ago by simonmar

Looking good, I'm still excited about this work. I really hope you'll put some of your benchmarks into nofib/smp - the current set may well be giving us a skewed view of what's important. I definitely agree that we want latency benchmarks too, but I haven't thought about how to construct them.

Have you thought about keeping the red-black tree nodes in the heap? That's what I would do. Heap allocation is much faster than malloc, and reclamation is free. You would get some advantages for free, such as when parts of the tree are in the old generation they won't get touched during a minor GC.

comment:29 Changed 15 months ago by ezyang

Yep, I think any new benchmarks will probably be pretty important, maybe even more so than the actually stride scheduling :)

Have you thought about keeping the red-black tree nodes in the heap? That's what I would do. Heap allocation is much faster than malloc, and reclamation is free. You would get some advantages for free, such as when parts of the tree are in the old generation they won't get touched during a minor GC.

I’ve tried it, jerry-rigging StgMutArrPtrs?. I concluded we’ll need to make a custom closure (ugh!) or make some improvements to how GHC garbage collects mutable objects (actually, I was planning on filing a bug, here we go #7662). In the case of mutable arrays, there are some funny characteristics: the write barriers are relatively cheap but they all get scanned because mutable arrays are permanently on the mutable list, so we don’t end up getting any benefits. (There aren’t any other existing mutable objects which can be jerry-rigged here.) Adding things to the mutable list may end up being expensive in the long run.

The current malloc'd red-black implementation is pretty efficient; we maintain a free list and allocate them in blocks. I haven't implemented reclamation yet but even if we leak all of the free nodes it will only consume space up to the high watermark of number of simultaneous threads.

comment:30 Changed 15 months ago by simonmar

Do you have to use mutable objects? Why not do an Okasaki-style immutable red-black tree? In the heap, allocation is very cheap, mutation is not.

comment:31 Changed 15 months ago by ezyang

No we don’t, but all of the C implementations that I can easily crib off of are mutable :)

comment:32 Changed 14 months ago by njd

I'd like to add a word of caution to the inclusion of any form of
priority/weighted scheduling into GHC at this time.

Scheduling (unless it discards work) does not, a priori, change the
total amount of work that has to be done, just its order. Such
scheduling strategies don't create computational power, they just
distribute the "disappointment" of not being first...

This would imply that all of the timing changes reported in this
thread are the result of interactions with other resources - different
patterns of memory access, garbage collection etc.

The community already recognises these sort of performance couplings
and there several mechanisms (at a higher level) that can be used to
manage the inevitable performance trades.

I strongly believe that adding differential scheduling at this time
will be, overall, counter productive to GHC's RTS development. It will
create a new class of performance artefacts that will require
resources to investigate. Time will be spent pushing trades around,
chasing the proverbial bubble under the wall paper around and around.

Please don't get me wrong, I am highly appreciative of Edward's
efforts. The more overhead we can reduce in the scheduling, the
better.

If we want to start looking at more complex differential treatment of
the allocation of supply (cpu cycles) to demand (task steps) then we
need to consider the issues of how that trade interacts with the
others. And how the overall trading space is going to manifest itself
in terms of end-user goals. We also need to come up with a sensible
approach of how to discard computational demand (load shedding) when
the "system" becomes "excessively" infeasible to schedule to met those
goals. Fortunately there are approaches to this (stuff I've been
involved with in data networks for many years) - and Haskell's
operational has all the right hooks in to do this. Interested PhD
students, do get in touch....

comment:33 Changed 14 months ago by ezyang

OK, after a brief break, I've started working on latency benchmarks to help evaluate how much alternative scheduling algorithms help. Of course, the first order of business was to establish some baselines, so I ran some tests on GHC 7.4 first, as well as a bare-bones C version of the application. The tests are here https://github.com/ezyang/latency ; eventually I'll integrate them with nofib (although it's not obvious how to integrate them into the current nofib reporting framework). The basic method is to write a byte onto a Unix domain socket; I'm sure I've botched some instrumentation somewhere so some eyeballs here would be useful. I can also make pretty graphs if people are interested.

System info:

[ezyang@hs01 latency]$ uname -a
Linux hs01.scs.stanford.edu 3.5.3-1-ARCH #1 SMP PREEMPT Sun Aug 26 09:14:51 CEST 2012 x86_64 GNU/Linux

GHC 7.4.2 (as installed by Arch Linux)

      0%      50%      90%      99%    99.9%   99.99% 
0.000016 0.000021 0.000064 0.000096 0.000157 0.000175 

GHC 7.4.2 threaded -N1

      0%      50%      90%      99%    99.9%   99.99% 
0.000048 0.000096 0.000191 0.000221 0.000260 0.000301

GHC 7.4.2 threaded -N2

      0%      50%      90%      99%    99.9%   99.99% 
0.000054 0.000208 0.000264 0.000307 0.000338 0.000370

GHC 7.7.20130122 (prior to the newest latency IO manager)

      0%      50%      90%      99%    99.9%   99.99% 
0.000015 0.000045 0.000093 0.000117 0.000153 0.000169

GHC 7.7.20130122 (prior to the newest latency IO manager)

      0%      50%      90%      99%    99.9%   99.99% 
0.000015 0.000045 0.000093 0.000117 0.000153 0.000169

GHC 7.7.20130122 -threaded

      0%      50%      90%      99%    99.9%   99.99% 
0.000044 0.000058 0.000177 0.000219 0.000252 0.000281 

GHC 7.7.20130122 -threaded -N2

      0%      50%      90%      99%    99.9%   99.99% 
0.000053 0.000210 0.000258 0.000298 0.000335 0.000376 

I'm building a version of GHC with the latest IO manager stuff and we'll see how that changes things...

comment:34 Changed 14 months ago by ezyang

Aaand the latest numbers! It looks like the new IO manager does a lot better!

GHC 7.7.20130221

      0%      50%      90%      99%    99.9%   99.99% 
0.000009 0.000019 0.000059 0.000103 0.000130 0.000926 

GHC 7.7.20130221 -threaded

      0%      50%      90%      99%    99.9%   99.99% 
0.000026 0.000046 0.000090 0.000167 0.000220 0.001286 

GHC 7.7.20130221 -threaded -N2

      0%      50%      90%      99%    99.9%   99.99% 
0.000026 0.000055 0.000101 0.000179 0.000253 0.000927

comment:35 Changed 14 months ago by simonmar

Something is a bit suspicious about those numbers. Without -threaded the IO manager is not used at all, yet your second set of numbers are quite different from the first even without -threaded. Furthermore, the second set of numbers seem to have a lot more outliers (the 99.99% slot is much higher).

(BTW I think this is great stuff nonetheless).

comment:36 Changed 14 months ago by ezyang

Hmm, it looks like the GHC 7.7.20130122 numbers are nonsense (I used what I thought was a perf build, but it looks like it actually was just a validate build, so libraries/rts may not have been as optimized as they could be. Hmm, I guess I should go re-run the nofib numbers too, since some of those might be suspicious.) The 7.4.2 numbers, being distro built, should still be valid.

Oh, and here are the C numbers, on the same machine. These should roughly correspond directly to the domain socket overhead.

      0%      50%      90%      99%    99.9%   99.99% 
0.000006 0.000014 0.000017 0.000028 0.000062 0.000259 

comment:37 Changed 14 months ago by ezyang

When I attempt to replicate the conditions of #3838 (by removing the thread bump), I do not see any appreciable increase in runtime. This occurs even if I explicitly yield while inside the atomicModifyIORef (using unsafePerformIO), or I increase the number of threads being run. My conclusion is that a small pileup does occur, but since there's no longer a global list we keep traversing it's not expensive. But I can't rule out (1) the test-case having enough incidental other changes that have made it stop tickling the bug, or (2) some other RTS change masking over this particular problem. A conundrum!

comment:38 Changed 14 months ago by simonmar

Edward: if I recall correctly, the thread bump was only a minor optimisation, the real fix for #3838 was the redesign of black hole blocking so that we didn't have a global list. So that would agree with your findings.

I don't have a lot of evidence that bumping the blocker is a good idea. I was hoping you might find some :-)

comment:39 Changed 14 months ago by ezyang

Well, bumping the blocker does seem to help a little bit. Here's what happens when I remove the bump:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
    callback001          -0.0%     +0.0%     +5.6%     +5.0%     +0.0%
    callback002          -0.0%     +0.0%     +0.3%     +0.9%     +0.0%
           chan          -0.0%     +0.0%     +1.4%     +1.3%     +0.0%
          sieve          -0.0%     +0.0%     +4.1%     +5.1%     +4.2%
     threads001          -0.0%     +0.0%     -1.8%     -1.3%     +0.0%
     threads003          -0.0%     -0.0%     +0.4%     +0.7%     +0.0%
     threads006          -0.0%     +0.0%     +0.0%     +0.6%     +0.0%
     threads007          -0.0%     +0.0%     -1.3%     -0.7%     +0.0%
--------------------------------------------------------------------------------
            Min          -0.0%     -0.0%     -1.8%     -1.3%     +0.0%
            Max          -0.0%     +0.0%     +5.6%     +5.1%     +4.2%
 Geometric Mean          -0.0%     -0.0%     +1.0%     +1.4%     +0.5%

comment:40 Changed 12 months ago by nicolast

  • Cc ikke+ghc@… added
Note: See TracTickets for help on using tickets.