On Mon, 2009-08-17 at 11:19 +0300, Heikki Linnakangas wrote:
> I think there's a race condition in the way LogCurrentRunningXacts() is
> called at the end of checkpoint. This can happen in the master:
>
> 1. Checkpoint starts
> 2. Transaction 123 begins, and does some updates
> 3. Checkpoint ends. LogCurrentRunningXacts() is called.
> 4. LogCurrentRunningXacts() gets the list of currently running
> transactions by calling GetCurrentTransactionData().
> 5. Transaction 123 ends, writing commit record to WAL
> 6. LogCurrentRunningXacts() writes the list of running XIDs to WAL. This
> includes XID 123, since that was still running at step 4.
>
> When that is replayed, ProcArrayUpdateTransactions() will zap the
> unobserved xids array with the list that includes XID 123, even though
> we already saw a commit record for it.
That's not a race condition, but it does make the code more complex. The
issue has been long understood.
I don't think it's acceptable to take and hold both ProcArray and
WALInsertLock. Those are now the two most heavily contended locks on the
system. We have evidence that there are burst delays associated with
various operations on just one of those locks, let alone two.
If you're still doubtful, the problem I've been working on recently is
the point that I overlooked the initial state of the lock table in my
earlier patch. GetRunningTransactionData() also needs to have initial
lock data.
There is no way in hell that I could personally condone holding
ProcArrayLock, WALInsertLock and all of the LockMgrLock partitions at
same time. So we just have to eat the complexity. (No doubt someone will
disagree with my strong language here, but please take it as an
indication of exactly how bad an idea holding multiple locks will be).
Slight timing issues are not too bad really. We just have to be careful
to assume that there is a mismatch in the data and must have code to
handle that.
Anyway, I've been working on this problem for some time and continue to
do so.
-- Simon Riggs www.2ndQuadrant.com