If anyone needs different size - there is "-A" RTS option. Machines are very different. Since 8.2 this size has been 1MB by default for all different architectures and hardware no matter what.
In most cases machines with larger caches have more RAM as well and vice versa. So this will affect positively both small and larger machines. It will be most efficient in most cases to keep short lived objects in caches. Most modern workstation and server machines have L3 cache as well, that is why I'm asking for "largest cache size".
Second idea will be if there are two short lived generations on machines with second and third level caches with sizes that match both.
For NUMA machines with non-unified caches (like this strange and non-common ARM) the common solution could be to set first generation to be with size of the largest cache of smallest core. Which will not be the optimal, but close to.
Auto-sizing the allocation area sounds like a reasonable idea to me. A portable solution is perhaps tricky, but I doubt a solution that works on the major operating systems is far from reach. Perhaps you want to have a look?
do we mean the Nursery or the Gen1 heap after the nursery?
I'd imagine we want the nursery to fit in L1 or L2 caches (where applicable) and the Gen1 heap to fit in the rest of the Cache left in Level3 after we account for nurseries?
Perhaps
size of nursery = size of L2 cache per cpu coresize of gen1 >= max(#capabilities * size of nursery , size of L3 cache in socket - (#capabilities * size of nursery) )
we definitely (at least in many core systems) do *not* want nurseries on the same Socket creating cache thrash with eachother (ie under heavy allocation workloads)?
Stephen M Blackburn, Perry Cheng, Kathryn S McKinley - Myths and Realities: The Performance Impact of Garbage Collection p. 10
4.5 Sizing the nursery
"Figure 4(a) shows a small improvement with larger nurseries in mutator performance due to fewer L2 (Figure 4(e)) and TLB misses (Figure 4(f)). However, the difference in GC time dominates: smaller nurseries demand more frequent collection and thus a substantially higher load. We measured the fixed overhead of each collection <...> The garbage collection cost tapers off between 4MB and 8MB as the fixed collection costs become insignificant. These results debunk the myth that the nursery size should be matched to the L2 cache size (512KB on all three architectures)."
For nursery may be there is no such deal, but see a benchmark made on a small raytracer (https://bitbucket.org/varosi/cgraytrace/overview) that do a lot of allocations on two different machines:
There is a clear difference for different GC sizes. As sizes grow beyond times of needed memory GC is almost not used and then the benefit of larger GC is brought back.
I have a branch that works for me on Windows 10 and Linux using an i7-4790K CPU.
It would be great if y'all could test this on:
macOS / OS X
FreeBSD
ARM and other non-x86 architectures
Intel Haswell and Broadwell CPUs with L4 cache
To ensure that the code works as intended, run a "Hello world"-program with +RTS -s and check that the report shows (N+1) MB total memory in use where N MB is the size of your largest cache.
PRs to support other operating systems are also very welcome! :)
Open design questions as of now:
If we only find an L1 cache, should we really go with an allocation area of typically just 32 or 64 kB?
IMHO it might be better to ignore any L1 caches and to simply default to the old 1 MB in these cases.
What if we find an L4 cache with 64 or 128 **MB**? This would easier to decide if we got some benchmark results, for example in the style of [ticket:13362#comment:148722 varosi's].
One thing that your code seems to miss is it doesn't account for different cache size on different CPUs or NUMA nodes in the system. It is technically possible to have multiple CPUs with different cache sizes. Not accounting for that might lead to various kinds of shenanigans like in Mono.
Great! Is it possible to share your Windows executable so I could experiment on a few machines from a few cores up to close to hundred?
You can download a binary distribution here. It's not an optimized build though, so at least building with it should be slower than with official releases.
Regarding running on Windows machines with close to a hundred cores, the current implementation will only detect caches within its current processor group of at most 64 logical processors (see "Remarks" here). As long as there aren't any larger caches outside of the processor group it will still set the allocation area to the correct size.
@klapaucius, "Multicore Garbage Collection with Local Heaps" by Simon Marlow and Simon Peyton Jones in chapter 6.1.1 state:
"Nevertheless, we do find that on average there is a local minimum around 1MB on **this hardware**. ... **staying within the cache** becomes more beneficial as contention for main memory increases."
Great! Is it possible to share your Windows executable so I could experiment on a few machines from a few cores up to close to hundred?
You can download a binary distribution here. It's not an optimized build though, so at least building with it should be slower than with official releases.
Regarding running on Windows machines with close to a hundred cores, the current implementation will only detect caches within its current processor group of at most 64 logical processors (see "Remarks" here). As long as there aren't any larger caches outside of the processor group it will still set the allocation area to the correct size.
How can I experiment with non-optimized version as I doesn't have reference for comparison? I could try to build some optimized version.
Building your own binary should be pretty straightforward. In order to investigate the effects of my patch you don't really need a different build anyway. You can simply find out what size your L3 cache has and pass that size to the -A RTS flag.