Memory effects in doomed STM transactions

changed weight to 5

I'm looking at this again and a way to fix the problem is to validate a transaction before you do a large allocation in rts/sm/Storage.c allocate. I know how to do validation at that point with stmValidateNestOfTransactions, but if the transaction is invalid, we do not want to continue executing. How do we gracefully end the transaction at that point? Really all we need to do is gracefully get back to the scheduler loop.

Any ideas?

added STM label

I'm a bit confused about how the allocating branch of the conditional can be taken. I also tried to run this on my laptop and I was unable to get the results reported here. How do I reproduce the problem with GHC 8.6.5?

@osa1 as far as I know nothing has changed that would remove this issue. The reason the branch can be taken is that transactions do not ensure that they have seen a consistent view of memory until the end of the transaction. The order of the reads is opposite so if the mut transaction commits between the readTVars of the first transaction, then the invariant that tx < ty will not hold. You could do some long computation between those reads to make an observation of this more likely.

I still don't understand. mut will increment both ty and tx in a single transaction, no? How can the other thread observe only the update to tx and ty?

@osa1 the implementation allows transactions to continue to execute regardless of the consistency of their reads. For instance, if we call the transaction in the example code with the big allocation big and the other transactions mut we can have the following sequence of events with values instead of variables:

big                   mut
----------------------------------------------
..,
1001 <- readTVar tx
                     1002 <- readTVar ty
                     1001 < -readTVar tx
                     writeTVar tx 0
                     writeTVar ty 1
                     -- Transaction commits --
1 <- readTVar ty
if 1001 > 1
   ...

Note that big can never commit (doomed or zombie execution), it will see that tx does not hold the value 1001 anymore. In this case committing does not matter as the effect it encounters in the first branch of the if causes the program to be terminated. To fix this we have to validate that the execution is consistent before executing any dangerous effects. Normal allocations are not dangerous (they are harmful to performance) as it will simply trigger GC which will validate transactions.

Ah, that makes sense. Thanks @fryguybob .

If I understand the suggested fix you want to validate big before the big allocation and retry, correct? Would it make sense to validate transactions that use the updated TVars when mut successfully commits? (so validation will be triggered by mut instead of big, which is the opposite of your suggestion IIUC)

...validate big before the big allocation and retry

Yes. There is some overloaded terminology here (retry) so to be completely clear, we want to abort big and start again from the beginning. The only difficulty in doing this right now is that the moment we know the allocation is "big" (that is, big enough to fail in this way) we are in C code.

Would it make sense ... validation will be triggered by mut instead of big

Yes this can make sense, but you need a mechanism to signal that other threads need to validate. TL2 does this with a global clock that is checked every time a TVar is read. This is the more invasive change I mentioned in the original issue: https://dl.acm.org/citation.cfm?id=2976020

added Pnormal label

mentioned in issue #24427

NB: I have not yet read the paper linked above.

My impression is that performing operations with non-reversible effects inside a stm transaction is generally a bad fit for the STM programming model as implemented in GHC.

~~But this particular case is slightly special, since it only happens because the effects of atomic blocks can be partially observed from within another STM block. E.g the interleaving above.~~

The STM paper says:

... If the log is valid then the STM commits it atomically to the heap, thereby exposing its effects to other transactions. ...

To which I would say this is a bit of a stretch. The log is gradually (one tvar after another) exposed to other transactions. But only transactions to whom the update looked atomic will be able to commit. And hence we get speculative executions of the doomed branch.

I think this is rarely a problem, but for cases like the one in this ticket we would probably wants STM operations like consistentOrRetry :: STM () and consistentWithBarrierOrRetry :: STM () which check that:

All stm operations done so far don't conflict with the state of the current memory. (Both)
No atomic operation results are currently partially visible inside the transaction. (Only with barrier)

Both of which should be fairly easy to implement. The first check would just compare expected values in the transaction log with actual values and is trivial.

The barrier version takes a lock of all relevant tvars before performing the check. This ensures the check can see no partial updates. But it can prevent other transactions from comitting while the check is running.

Is the benefit of adding those worth the trouble? I'm not sure.

The log is gradually (one tvar after another) exposed to other transactions. But only transactions to whom the update looked atomic will be able to commit. And hence we get speculative executions of the doomed branch.

It isn't the one after another release of the locked TVars that can cause inconsistent reads, but that a running transaction may have already read one of the values being updated (and subsequent reads come from its transactional record) and then reads another for the first time after the commit. The commit itself is atomic, from the moment that all the locks for written TVar are acquired any other transaction will see the lock and spin reading until the updated value is written.

I don't think it would be helpful to expose an API to help with this situation because the user is likely unaware if they are using functionality that results in low level allocation. We can see when we are doing that allocation in the runtime that we are in a transaction.

We can see when we are doing that allocation in the runtime that we are in a transaction.

How do we gracefully end the transaction at that point? Really all we need to do is gracefully get back to the scheduler loop.

Tying this into some other OOM discussions I had recently here is an idea, but it might still have holes:

If we would replace allocate() with allocateMightFail() everywhere and throw an Exception on failure, then allocateMightFail could trigger validate on large allocations, and fail while setting a flag on the capability if validate failed. The STM exception handler could see the exception and could check for the flag and just silently restart the transaction if it's set.

It isn't the one after another release of the locked TVars that can cause inconsistent reads

True, my mistake. I still think saying it's committed to the heap atomically is misleading. But that's nitpicking at this point and I assume just common phrasing in the literatue.

I don't think it would be helpful to expose an API to help with this situation because the user is likely unaware if they are using functionality that results in low level allocation. We can see when we are doing that allocation in the runtime that we are in a transaction.

It's not a magic bullet but I think for library authors it would be a reasonable addition. But then again I haven't written any STM data structures so I might be mistaken.

Trac field	Value
Version	8.0.1
Type	Bug
TypeOfFailure	OtherFailure
Priority	normal
Resolution	Unresolved
Component	Runtime System
Test case
Differential revisions
BlockedBy
Related
Blocking
CC	simonmar
Operating system
Architecture

Memory effects in doomed STM transactions

Problem

Fixing

Child items ...

Activity