On Sun, 2009-09-27 at 14:59 +0300, Heikki Linnakangas wrote:
> The locking in smgr_redo_commit and smgr_redo_abort doesn't look right.
> First of all, smgr_redo_abort is not holding XidGenLock and
> ProcArrayLock while modifying ShmemVariableCache->nextXid and
> ShmemVariableCache->latestCompletedXid, respectively, like
> smgr_redo_commit is. Attached patch fixes that.
>
> But I think there's a little race condition in there anyway, as they
> remove the finished xids from known-assigned-xids list by calling
> ExpireTreeKnownAssignedTransactionIds, and
> ShmemVariableCache->latestCompletedXid is updated only after that. We're
> not holding ProcArrayLock between those two steps, like we do during
> normal transaction commit. I'm not sure what kind of issues that can
> lead to, but it seems like it can lead to broken snapshots if someone
> calls GetSnapshotData() between those steps.
I confess I didn't know what you were talking about when you wrote this
and was expecting you to have a coffee and retract. I realise now you
meant xact_redo_commit() rather than smgr_ and it makes sense at last.
I've just committed about half of your patch exactly, but not all.
I've avoided making the changes to
ShmemVariableCache->latestCompletedXid directly and moved the update to
ExpireTreeKnownAssignedTransactionIds() which already had access to
max_xid while holding ProcArrayLock. Hopefully that resolves your
comment about race condition.
Also, I noticed that you removed the line TransactionIdAdvance(ShmemVariableCache->nextXid);
in xact_redo_abort(). That looks like an error to me, since this follows
neither the comment nor how it is coded in xact_redo_commit(). Let me
know if there was some other reason for that line removal and I'll make
the change again.
-- Simon Riggs www.2ndQuadrant.com