threadDelay can wait forever, next time on January 22, 2009, around 20:00 UTC.

We(*) have found a serious bug in the non-threaded RTS on 32-bit *nix systems. Calling threadDelay at certain clock-time intervals can make the thread wait forever. This happens regularly, approximately every 49 days 17 hours, or, more accurately, every (2^32 / 1000) seconds. You can reproduce the bug with a simple program, if you set your machine's clock to a specially crafted time.

In this description I assume a 32-bit *nix system, where the type unsigned long has 32 bits.

If you want to see this happen, here is an instruction:

Test.hs:

import Control.Concurrent

main = threadDelay 5000000 -- 5 secs

Running on Linux, as root:

# date --set 'Thu Jan 22 20:20:10 UTC 2009' +%s; strace ./Delay
...
gettimeofday({1232655610, 12122}, NULL) = 0
gettimeofday({1232655610, 12171}, NULL) = 0
gettimeofday({1232655610, 12195}, NULL) = 0
select(0, [], [], NULL, {5, 20000})     = 0 (Timeout)
gettimeofday({1232655615, 32125}, NULL) = 0
select(0, [], [], NULL, {0, 4000})      = 0 (Timeout)
gettimeofday({1232655615, 35777}, NULL) = 0
select(0, [], [], NULL, {0, 4000})      = 0 (Timeout)
gettimeofday({1232655615, 40039}, NULL) = 0
select(0, [], [], NULL, {4294, 951296}

The runtime slept for more than 5 seconds, but it didn't wake up the thread, but instead started waiting for an evil looking amount of time.

I've managed to find the exact location of the bug, in function getourtimeofday() in rts/posix/Itimer.c. Other parts of non-threaded RTS, and especially the function awaitEvent(rtsBool) in rts/posix/Select.c, assume that the results of getourtimeofday() use the full range of values in lnat. But this isn't so. Let's take a look at the current version of this function:

lnat
getourtimeofday(void)
{
  struct timeval tv;
  nat interval;
  interval = RtsFlags.MiscFlags.tickInterval;
  if (interval == 0) { interval = 50; }
  gettimeofday(&tv, (struct timezone *) NULL);
  	// cast to lnat because nat may be 64 bit when int is only 32 bit
  return ((lnat)tv.tv_sec * 1000 / interval +
	  (lnat)tv.tv_usec / (interval * 1000));
}

Look at the first half of the expression in the return statement:

(lnat)tv.tv_sec * 1000 / interval

Here, tv.tv_sec is first converted to lnat, than multiplied by 1000, and finally divided by interval. lnat is defined as "unsigned long" - on a 32-bit system this is usually a 32-bit type. The result of multiplying by 1000 will have its higher bits lost and it will be in the range [0 .. 2^32 - 1]. Dividing it by interval will shrink the range to [0 .. (2^32 - 1) / interval]. With default value of tickInterval - which is 50 - this will be [0 .. 85899345]. Adding the second part of the expression can only increase it by (999999 / (interval * 1000)), which is equal to 19 with default settings.

So, for default parameters, the result of getourtimeofday() will be in [0 .. 85899364]. Why is it a problem? It's quite simple: When a thread executes threadDelay the runtime calculates the tick at which it should be woken up. In our example, threadDelay is executed at tick 85899266, and the time of wake-up is probably calculated as 85899266 + 5 * (1000 / 50) = 85899366. This is greater then the biggest value getourtimeofday() can return.

At this point I've probably already given too much detail, so I'll stop. I think the fix is quite easy: just perform the problematic part of computation on a wider integral type and cast it to lnat at the end.

There is also an easy workaround: compile your program with -threaded. The threaded runtime does not use this code, and AFAICT, the time is tracked using a 64-bit number of microseconds.

When I was trying to understand the problem, for some time the culprit for me was the "clever trick" mentionad at the top of rts/posix/Select.c. Now I understand it better and think it's innocent.

We have found this bug in GHC 6.6, and it's still there in the HEAD.

(*) "We" means Mariusz Gądarowski, Przemysław Kosiak, Jakub Bogusz and Tomasz Zielonka - employees of Gemius SA in Poland. Mariusz and Przemek encountered this bug in their haskell system for distributing computations on a cluster. Mariusz and Jakub provided data about the symptoms of the problem - that was invaluable because, as you can now see, the bug was very difficult to reproduce. Tomasz finally spent four hours in solitude to nail the bug, and another hour to write this report. The problem repeated itself for serveral months, and in desperation Mariusz started to think about rewriting the system in C++. Unfortunately, there is still such a posibility, because we have another similar problem to solve. We are more optimistic now, but this event raised some doubts about the development model of GHC. We are afraid that with the current emphasis on adding new features you may be adding bugs faster that you are removing them :-/

Edited Mar 09, 2019 by Ian Lynagh <igloo@earth.li>

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information