GHC first generation of GC to be as large as largest cache size by default

changed weight to 5

added RTS Tfeature request Trac import labels

changed milestone to %8.4.1

Trac metadata

Trac field	Value
TypeOfFailure	OtherFailure → RuntimePerformance

added runtime perf label

changed milestone to %8.6.1

This ticket won't be resolved in 8.4; remilestoning for 8.6. Do holler if you are affected by this or would otherwise like to work on it.

I vote for this! It's not hard to be implemented I hope, but affect performance a lot on more hardware.

Auto-sizing the allocation area sounds like a reasonable idea to me. A portable solution is perhaps tricky, but I doubt a solution that works on the major operating systems is far from reach. Perhaps you want to have a look?

added newcomer label

do we mean the Nursery or the Gen1 heap after the nursery?

I'd imagine we want the nursery to fit in L1 or L2 caches (where applicable) and the Gen1 heap to fit in the rest of the Cache left in Level3 after we account for nurseries?

Perhaps

size of nursery = size of L2 cache per cpu core
size of gen1 >= max(#capabilities * size of nursery
                               , size of L3 cache in socket - (#capabilities * size of  nursery) )

we definitely (at least in many core systems) do *not* want nurseries on the same Socket creating cache thrash with eachother (ie under heavy allocation workloads)?

Stephen M Blackburn, Perry Cheng, Kathryn S McKinley - Myths and Realities: The Performance Impact of Garbage Collection p. 10

4.5 Sizing the nursery

"Figure 4(a) shows a small improvement with larger nurseries in mutator performance due to fewer L2 (Figure 4(e)) and TLB misses (Figure 4(f)). However, the difference in GC time dominates: smaller nurseries demand more frequent collection and thus a substantially higher load. We measured the fixed overhead of each collection <...> The garbage collection cost tapers off between 4MB and 8MB as the fixed collection costs become insignificant. These results debunk the myth that the nursery size should be matched to the L2 cache size (512KB on all three architectures)."

For nursery may be there is no such deal, but see a benchmark made on a small raytracer (https://bitbucket.org/varosi/cgraytrace/overview) that do a lot of allocations on two different machines:

https://docs.google.com/spreadsheets/d/1dnhQTrm_EgKab3IJQAC4Rw1IJOfOcryJ56W0z9aeo0k/edit?usp=sharing

There is a clear difference for different GC sizes. As sizes grow beyond times of needed memory GC is almost not used and then the benefit of larger GC is brought back.

@klapaucius, this paper is really interesting!

changed the description

assigned to @sjakobi

I have a branch that works for me on Windows 10 and Linux using an i7-4790K CPU.

It would be great if y'all could test this on:

macOS / OS X
FreeBSD
ARM and other non-x86 architectures
Intel Haswell and Broadwell CPUs with L4 cache

To ensure that the code works as intended, run a "Hello world"-program with +RTS -s and check that the report shows (N+1) MB total memory in use where N MB is the size of your largest cache.

PRs to support other operating systems are also very welcome! :)

Open design questions as of now:

If we only find an L1 cache, should we really go with an allocation area of typically just 32 or 64 kB?

IMHO it might be better to ignore any L1 caches and to simply default to the old 1 MB in these cases.

What if we find an L4 cache with 64 or 128 **MB**? This would easier to decide if we got some benchmark results, for example in the style of [ticket:13362#comment:148722 varosi's].

One thing that your code seems to miss is it doesn't account for different cache size on different CPUs or NUMA nodes in the system. It is technically possible to have multiple CPUs with different cache sizes. Not accounting for that might lead to various kinds of shenanigans like in Mono.

Great! Is it possible to share your Windows executable so I could experiment on a few machines from a few cores up to close to hundred?

Replying to [ticket:13362#comment:149966 varosi]:

Great! Is it possible to share your Windows executable so I could experiment on a few machines from a few cores up to close to hundred?

You can download a binary distribution here. It's not an optimized build though, so at least building with it should be slower than with official releases.

Regarding running on Windows machines with close to a hundred cores, the current implementation will only detect caches within its current processor group of at most 64 logical processors (see "Remarks" here). As long as there aren't any larger caches outside of the processor group it will still set the allocation area to the correct size.

@klapaucius, "Multicore Garbage Collection with Local Heaps" by Simon Marlow and Simon Peyton Jones in chapter 6.1.1 state: "Nevertheless, we do find that on average there is a local minimum around 1MB on **this hardware**. ... **staying within the cache** becomes more beneficial as contention for main memory increases."

I could experiment more on different processors.

How can I experiment with non-optimized version as I doesn't have reference for comparison? I could try to build some optimized version.

Replying to [ticket:13362#comment:149983 sjakobi]:

Replying to [ticket:13362#comment:149966 varosi]:

Great! Is it possible to share your Windows executable so I could experiment on a few machines from a few cores up to close to hundred?

You can download a binary distribution here. It's not an optimized build though, so at least building with it should be slower than with official releases.

Regarding running on Windows machines with close to a hundred cores, the current implementation will only detect caches within its current processor group of at most 64 logical processors (see "Remarks" here). As long as there aren't any larger caches outside of the processor group it will still set the allocation area to the correct size.

Replying to [ticket:13362#comment:150922 varosi]:

How can I experiment with non-optimized version as I doesn't have reference for comparison? I could try to build some optimized version.

Building your own binary should be pretty straightforward. In order to investigate the effects of my patch you don't really need a different build anyway. You can simply find out what size your L3 cache has and pass that size to the -A RTS flag.

I cannot build my ray-tracer with the new GHC version, but I'll try to fix problems and get back here.

In order to move things forward, I have uploaded a patch.

Regarding my questions from ticket:13362#comment:149964, I propose only looking at L3 and L2 caches for now.

Trac metadata

Trac field	Value
Differential revisions	- → D4679

There are several interesting comments on Phab:D4679 pointing out that a more intricate method is required for auto-sizing the allocation area.

As I need to focus on my GSoC project (and due to my lack of understanding of the GC) I'm unlikely to be of much further help with this ticket.

I'm also unsure whether this is still a good newcomer ticket.

changed milestone to %8.8.1

We won't be doing this for 8.6. Bumping to 8.8.

This will certainly need more measurement before we do this.

added NUMA label

unassigned @sjakobi

changed milestone to %8.10.1

added Pnormal label

removed milestone

GHC first generation of GC to be as large as largest cache size by default

Child items 0

Activity