Opened 2 months ago

Last modified 2 months ago

#8732 new bug

Global big object heap allocator lock causes contention

Reported by: tibbe Owned by: simonmar
Priority: normal Milestone: 7.10.1
Component: Runtime System Version: 7.6.3
Keywords: Cc: hvr, simonmar
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description (last modified by tibbe)

The lock allocate takes when allocating big objects hurts scalability of I/O bound application. Network.Socket.ByteString.recv is typically called with a buffer size of 4096, which causes a ByteString of that size to be allocated. The size of this ByteString causes it to be allocated from the big object space, which causes contention of the global lock that guards that space.

See http://www.yesodweb.com/blog/2014/02/new-warp for a real world example.

Change History (10)

comment:1 Changed 2 months ago by tibbe

  • Description modified (diff)

comment:2 Changed 2 months ago by hvr

  • Cc hvr added
  • Milestone set to 7.10.1
  • Type of failure changed from None/Unknown to Runtime performance bug

comment:3 Changed 2 months ago by ezyang

It is a good thing that these blocks are considered big blocks, since we don't really want to be copying the buffers around. So one thought might be to make the large block list in generation-0 per-thread, and perform allocations from a thread-local block list. But you have to be careful: objects that are larger than a block need contiguous blocks, so unless you are only going to enable this for large objects that still fit in a single block, you'll have to maintain multiple lists with the sizes you want.

comment:4 Changed 2 months ago by tibbe

But you have to be careful: objects that are larger than a block need contiguous blocks, so unless you are only going to enable this for large objects that still fit in a single block, you'll have to maintain multiple lists with the sizes you want.

I think malloc already does that, so we could copy whatever they do perhaps.

comment:5 follow-up: Changed 2 months ago by ezyang

It's pretty standard, yes (we implement for handling the global block pool), but it does mean all of that code would have to be made thread-local.

comment:6 in reply to: ↑ 5 Changed 2 months ago by tibbe

Replying to ezyang:

It's pretty standard, yes (we implement for handling the global block pool), but it does mean all of that code would have to be made thread-local.

I guess that means even worse performance problems on OS X? Even if it does, it sounds like the right thing to do.

comment:7 Changed 2 months ago by carter

@tibbe, becaue TLS is slow on OS X currently? (mind you, my understanding is that the other RTS issues go away when building GHC with a REAL GCC, right? I take it thats not the case for this discussion? )

comment:8 Changed 2 months ago by ezyang

In this case, slowness of TLS is not an issue, because we manually pass around pointers to structs which are known to be per-capability (and can be accessed in an unsynchronized way), so you don't actually need thread-local *state*.

comment:9 Changed 2 months ago by simonmar

I don't really understand why in mighty he couldn't just re-use the same block.

I'm kind of surprised that this is a bottleneck, and I think it needs more investigation. We only take the lock for large objects, so typically there's going to be a lot of computation going on per allocation.

I suppose if it really is a problem then we could just have a per-thread block pool at the granularity of a megablock to avoid fragmentation issues. We just push the global lock back to the megablock free list. This has the danger that we might have a lot of free blocks owned by one thread that don't get used, though, so we might want to redistribute the free blocks at GC. Things start to get annoyingly complicated.

comment:10 Changed 2 months ago by simonmar

It's even harder than that, because a block can be allocated by one thread and freed by another thread, so we lose block coalescing, even if it can be made to work safely.

So I suggest if we want to do anything at all here, we just do the really simple thing: we allocate a chunk of contiguous memory, keep it in the capability, and use that to satisfy large block requests if it's large enough.

Note: See TracTickets for help on using tickets.