threadDelay can wait forever, next time on January 22, 2009, around 20:00 UTC.
We(*) have found a serious bug in the non-threaded RTS on 32-bit *nix systems.
Calling threadDelay at certain clock-time intervals can make the thread
wait forever. This happens regularly, approximately every 49 days 17 hours,
or, more accurately, every (2^32 / 1000)
seconds. You can reproduce the bug with
a simple program, if you set your machine's clock to a specially crafted
time.
In this description I assume a 32-bit *nix system, where the type unsigned long has 32 bits.
If you want to see this happen, here is an instruction:
Test.hs:
import Control.Concurrent
main = threadDelay 5000000 -- 5 secs
Running on Linux, as root:
# date --set 'Thu Jan 22 20:20:10 UTC 2009' +%s; strace ./Delay
...
gettimeofday({1232655610, 12122}, NULL) = 0
gettimeofday({1232655610, 12171}, NULL) = 0
gettimeofday({1232655610, 12195}, NULL) = 0
select(0, [], [], NULL, {5, 20000}) = 0 (Timeout)
gettimeofday({1232655615, 32125}, NULL) = 0
select(0, [], [], NULL, {0, 4000}) = 0 (Timeout)
gettimeofday({1232655615, 35777}, NULL) = 0
select(0, [], [], NULL, {0, 4000}) = 0 (Timeout)
gettimeofday({1232655615, 40039}, NULL) = 0
select(0, [], [], NULL, {4294, 951296}
The runtime slept for more than 5 seconds, but it didn't wake up the thread, but instead started waiting for an evil looking amount of time.
I've managed to find the exact location of the bug, in function getourtimeofday() in rts/posix/Itimer.c. Other parts of non-threaded RTS, and especially the function awaitEvent(rtsBool) in rts/posix/Select.c, assume that the results of getourtimeofday() use the full range of values in lnat. But this isn't so. Let's take a look at the current version of this function:
lnat
getourtimeofday(void)
{
struct timeval tv;
nat interval;
interval = RtsFlags.MiscFlags.tickInterval;
if (interval == 0) { interval = 50; }
gettimeofday(&tv, (struct timezone *) NULL);
// cast to lnat because nat may be 64 bit when int is only 32 bit
return ((lnat)tv.tv_sec * 1000 / interval +
(lnat)tv.tv_usec / (interval * 1000));
}
Look at the first half of the expression in the return statement:
(lnat)tv.tv_sec * 1000 / interval
Here, tv.tv_sec is first converted to lnat, than multiplied by 1000, and
finally divided by interval. lnat is defined as "unsigned long" - on a
32-bit system this is usually a 32-bit type. The result of multiplying
by 1000 will have its higher bits lost and it will be in the range [0 .. 2^32 - 1]
. Dividing it by interval will shrink the range to [0 .. (2^32 - 1) / interval]
. With default value of tickInterval - which is 50
- this will be [0 .. 85899345]
. Adding the second part of the expression
can only increase it by (999999 / (interval * 1000))
, which is equal to
19 with default settings.
So, for default parameters, the result of getourtimeofday() will be in
[0 .. 85899364]
. Why is it a problem? It's quite simple: When a thread
executes threadDelay the runtime calculates the tick at which it should
be woken up. In our example, threadDelay is executed at tick 85899266,
and the time of wake-up is probably calculated as
85899266 + 5 * (1000 / 50) = 85899366
. This is greater then the biggest value getourtimeofday()
can return.
At this point I've probably already given too much detail, so I'll stop. I think the fix is quite easy: just perform the problematic part of computation on a wider integral type and cast it to lnat at the end.
There is also an easy workaround: compile your program with -threaded. The threaded runtime does not use this code, and AFAICT, the time is tracked using a 64-bit number of microseconds.
When I was trying to understand the problem, for some time the culprit for me was the "clever trick" mentionad at the top of rts/posix/Select.c. Now I understand it better and think it's innocent.
We have found this bug in GHC 6.6, and it's still there in the HEAD.
(*) "We" means Mariusz Gądarowski, Przemysław Kosiak, Jakub Bogusz and Tomasz Zielonka - employees of Gemius SA in Poland. Mariusz and Przemek encountered this bug in their haskell system for distributing computations on a cluster. Mariusz and Jakub provided data about the symptoms of the problem - that was invaluable because, as you can now see, the bug was very difficult to reproduce. Tomasz finally spent four hours in solitude to nail the bug, and another hour to write this report. The problem repeated itself for serveral months, and in desperation Mariusz started to think about rewriting the system in C++. Unfortunately, there is still such a posibility, because we have another similar problem to solve. We are more optimistic now, but this event raised some doubts about the development model of GHC. We are afraid that with the current emphasis on adding new features you may be adding bugs faster that you are removing them :-/