Opened 5 months ago

Last modified 5 months ago

#8733 new bug

I/O manager causes unnecessary syscalls in send/recv loops

Reported by: tibbe Owned by:
Priority: normal Milestone: 7.10.1
Component: Runtime System Version: 7.6.3
Keywords: Cc: hvr, simonmar, andreas.voellmy@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description (last modified by tibbe)

Network applications often call send followed by recv, to send a message and then read an answer. This causes syscall traces like this one:

recvfrom(13, )                -- Haskell thread A
sendto(13, )                  -- Haskell thread A
recvfrom(13, ) = -1 EAGAIN    -- Haskell thread A
epoll_ctl(3, )                -- Haskell thread A (a job for the IO manager)
recvfrom(14, )                -- Haskell thread B
sendto(14, )                  -- Haskell thread B
recvfrom(14, ) = -1 EAGAIN    -- Haskell thread B
epoll_ctl(3, )                -- Haskell thread B (a job for the IO manager)

The recvfrom call always fails, as the response from the partner we're communicating with won't be available right after we send the request.

We ought to consider descheduling the thread as soon as sending is "done". The hard part is to figure out when that is.

See http://www.yesodweb.com/blog/2014/02/new-warp for a real world example.

Change History (19)

comment:1 Changed 5 months ago by tibbe

  • Description modified (diff)

comment:2 Changed 5 months ago by carter

could it be done by having the end of the send call always yield? Well... i guess depending on whats being sent, there may be multiple sends on the application side, right?

Last edited 5 months ago by carter (previous) (diff)

comment:3 Changed 5 months ago by tibbe

@carter it could, but then we might end up yielding before all the content has been sent (in case it's spread over several sends). We could think about having something similar to TCP_CORK to have the client communicate when it's ready to yield. On the other hand, if the client has to deal with this issue she might as well yield manually.

comment:4 Changed 5 months ago by carter

yeah, you're right. In which case, might it make sense to have clear breadcrumbs for this technique in network or the like?
eg (a strawman mind you)

withYield :: IO a-> IO () 
withYield m = 
   do  m ; yield

this is probably the most wrong way possible, but I don't mean it as a serious suggestion. Merely that making yield more visible in the api surface might be a "social" approach to mitigate things. Eg, its always safe to add yields to code, but as you say, its not a good idea to just sprinkle it around.

comment:5 follow-up: Changed 5 months ago by etrepum

Is the IO manager the right layer to be approaching this from? Not all protocols are going to be strictly request/response with no pipelining or multiplexing like vanilla HTTP is.

Why not implement this at a higher level in a library that encapsulates these best practices for various sorts of protocols?

A good start may be to add this to the Performance Resource: http://www.haskell.org/haskellwiki/Performance (which looks old but that's what came up when I searched). Erlang publishes an Efficiency Guide that is nice (but doesn't cover any networking topics) http://erlang.org/doc/efficiency_guide/users_guide.html

comment:6 Changed 5 months ago by carter

yeah, that might be a good approach.

comment:7 Changed 5 months ago by hvr

  • Cc hvr added
  • Milestone set to 7.10.1
  • Type of failure changed from None/Unknown to Runtime performance bug

comment:8 Changed 5 months ago by schyler

If I understand correctly, this can be implemented as a state machine. Each thread can have a "last IO action" state which gets reset when it yields and set to some constant on send/recv. Then, if you call recv and it's still in "send mode", yield automatically (and yield would set it back to NULL).

This way if you do heaps of sends in a row you see no regression.

Edit: I realised this is fd-specific rather than thread specific. Maybe there needs to be a "last fd" state also, so you can alternate fds without accidentally triggering the yield.

Version 2, edited 5 months ago by schyler (previous) (next) (diff)

comment:9 in reply to: ↑ 5 Changed 5 months ago by tibbe

Replying to etrepum:

Is the IO manager the right layer to be approaching this from? Not all protocols are going to be strictly request/response with no pipelining or multiplexing like vanilla HTTP is.

Why not implement this at a higher level in a library that encapsulates these best practices for various sorts of protocols?

Good question. I considered this before filing the bug. One argument for why this should work without user interaction, in addition to the standard "it's nice if it just works" argument, is that if we didn't use the I/O manager but instead normal blocking syscalls, we'd end up with better behavior in this case.

A good start may be to add this to the Performance Resource: http://www.haskell.org/haskellwiki/Performance (which looks old but that's what came up when I searched). Erlang publishes an Efficiency Guide that is nice (but doesn't cover any networking topics) http://erlang.org/doc/efficiency_guide/users_guide.html

I'm planning to add a note to the network package's docs when I find the time.

comment:10 follow-up: Changed 5 months ago by simonmar

  • Owner simonmar deleted

Hmm. Isn't it application-specific knowledge that the recv() is going to block? I'm not sure what it is you're proposing that the IO manager should do, could you elaborate?

One argument for why this should work without user interaction, in addition to the standard "it's nice if it just works" argument, is that if we didn't use the I/O manager but instead normal blocking syscalls, we'd end up with better behavior in this case.

If you used blocking syscalls you'd end up with much worse scaling, that's why we have the IO manager, no?

comment:11 in reply to: ↑ 10 Changed 5 months ago by tibbe

Replying to simonmar:

If you used blocking syscalls you'd end up with much worse scaling, that's why we have the IO manager, no?

Sure. I guess I'm saying that the I/O manager makes some things better and some things worse. This bug is a bit nebulous. I'm not sure there's a good solution without involving the user. However, we should at least think about it as this will come up in more or less every server we write.

comment:12 follow-up: Changed 5 months ago by AndreasVoellmy

I would be interested to know what the source of the overhead is when making the recvfrom call that fails. I.e. is due to having more syscalls or from invoking the IO manager (Haskell-land) code to register the callback (requires taking an MVar, etc). If it is the sys call, then we would have to do something like you describe where you avoid making the recvfrom call right after the send. If, OTOH it is IO manager overhead, then we may be able to fix it by streamlining the IO manager. E.g. we could try to avoid taking any MVars to register callbacks. We've also talked about integrating the IO manager more closely with the threaded RTS and writing it in C. I think we could easily reduce the overhead if we go down that route.

comment:13 Changed 5 months ago by AndreasVoellmy

  • Cc andreas.voellmy@… added

comment:14 in reply to: ↑ 12 ; follow-up: Changed 5 months ago by tibbe

Replying to AndreasVoellmy:

I would be interested to know what the source of the overhead is when making the recvfrom call that fails. I.e. is due to having more syscalls or from invoking the IO manager (Haskell-land) code to register the callback (requires taking an MVar, etc).

Good question. I have a theory that the MVar overheads might be hurting us and integrating the I/O manager with the scheduler could allow us to get around that.

comment:15 in reply to: ↑ 14 Changed 5 months ago by hvr

Replying to tibbe:

Good question. I have a theory that the MVar overheads might be hurting us and integrating the I/O manager with the scheduler could allow us to get around that.

Just wondering, would integration with the scheduler simplify implementing (soft) timeouts, like e.g. providing

threadWaitReadWithTimeout, threadWaitWriteWithTimeout :: Fd -> Int -> IO Bool

directly in the I/O manager?

comment:16 follow-up: Changed 5 months ago by AndreasVoellmy

When you say simplify, what are you comparing to? Is there some implementation of those functions that you are thinking of?

Maybe you mean that those functions are implemented in a way similar to thetimeout :: Int -> IO a -> IO (Maybe a) function in System.Timeout. I haven't measured the performance of timeout but I imagine it isn't very good, since it forks a new thread in each call. We do have a couple new functions in 7.8 that might allow us to provide a more efficient implementation than something liketimeout . In 7.8, we have these functions:

threadWaitReadSTM :: Fd -> IO (STM (), IO ())
threadWaitWriteSTM :: Fd -> IO (STM (), IO ())

They return an STM action that completes whenever the given file is ready to read (or write). Then we can use it in combination with registerDelay :: Int -> IO (TVar Bool) to wait on both the file and the timer without spawning a new thread. It would be something like this:

threadWaitReadWithTimeout :: Fd -> Int -> IO Bool
threadWaitReadWithTimeout fd n =
 do { (ready, _) <- threadWaitReadSTM fd; 
       alarm <- registerDelay n; 
       atomically ((ready >> return True)`orElse` 
                   (readTVar alarm >>= check >> return False))
     }

comment:17 in reply to: ↑ 16 Changed 5 months ago by hvr

Replying to AndreasVoellmy:

When you say simplify, what are you comparing to? Is there some implementation of those functions that you are thinking of?

Your suggestion of using the new threadWait*STM functions seems to be already an improvement over the naive timeout+threadWait{Read,Write} approach.

I was originally thinking of a more low-level approach: Since epoll/select and similar OS sys-calls take a timeout-argument, that argument would be set according to the earliest due pending threadWait*WithTimeout calls when the need arises to actually call epoll to wait for new I/O events. Or does this implicitly happen anyway when using registerDelay which goes through the TimeManager?

comment:18 Changed 5 months ago by AndreasVoellmy

The approach you mention is actually what the IO manager pre 7.8 did, although it didn't have threadWait{Read,Write}WithTimeout methods, so you would have to code this via the lower-level interface in GHC.Event.

In the 7.8 IO manager, the approach you describe won't quite work. The 7.8 manager uses a separate poll instance to handle timeouts (much as you described) and uses epoll instances (one per HEC) to monitor files. The reason that we don't handle timeouts in the epoll instances anymore is that the 7.8 manager uses non-blocking calls to epoll, and it sometimes yields and moves to the back of the run queue of its HEC. This often reduces the number of blocking foreign calls. But it means that epoll may not be called in a timely fashion, because it may wait for a bunch of busy Haskell threads on its HEC to finish. For that reason, we keep a separate Haskell thread monitoring a poll instance just for timeouts.

In contrast, in the pre-7.8 IO manager, the IO manager thread always did a blocking epoll call, so it could use the earliest timeout as the timeout for the epoll call. It would then have to wait for at most one Haskell thread to finish after the foreign epoll call returns in order to grab the HEC and dispatch callbacks.

Tighter integration of the IO manager and the scheduler might help here. With the integration, we may not need the trick of sending the IO manager to the end of the run queue. Then we could probably match the latency characteristics of the pre-7.8 manager (wait at most one Haskell thread to fire callbacks) and the throughput of the 7.8 manager.

comment:19 Changed 5 months ago by tibbe

In contrast, in the pre-7.8 IO manager, the IO manager thread always did a blocking epoll call, so it could use the earliest timeout as the timeout for the epoll call. It would then have to wait for at most one Haskell thread to finish after the foreign epoll call returns in order to grab the HEC and dispatch callbacks.

We don't guarantee to wake the thread up exactly after N milliseconds, so unless the new I/O manager seriously delays wake-ups, I'd prefer it if we could skip having a completely separate timeout manager.

Note: See TracTickets for help on using tickets.