Re: [HACKERS] Race condition in GetOldestActiveTransactionId() - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: [HACKERS] Race condition in GetOldestActiveTransactionId()
Date
Msg-id e4c4e884-0c00-2f44-2a20-1b29a031349d@iki.fi
Whole thread Raw
In response to Race condition in GetOldestActiveTransactionId()  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
On 08/22/2016 01:46 PM, Heikki Linnakangas wrote:
> While hacking on the CSN patch, I spotted a race condition between
> GetOldestActiveTransactionId() and GetNewTransactionId().
> GetOldestActiveTransactionId() calculates the oldest XID that's still
> running, by doing:
>
> 1. Read nextXid, without a lock. This is the upper bound, if there are
> no active XIDs in the proc array.
> 2. Loop through the proc array, making note of the oldest XID.
>
> While GetNewTransactionId() does:
>
> 1. Read and Increment nextXid
> 2. Set MyPgXact->xid.
>
> It seems possible that if you call GetNewTransactionId() concurrently
> with GetOldestActiveTransactionId(), GetOldestActiveTransactionId() sees
> the new nextXid value that the concurrent GetNewTransactionId() set, but
> doesn't see the old XID in the proc array. It will return a value that
> doesn't cover the old XID, i.e. it won't consider the just-assigned XID
> as in-progress.
>
> Am I missing something? Commit c6d76d7c added a comment to
> GetOldestActiveTransactionId() explaining why it's not necessary to
> acquire XidGenLock there, but I think it missed the above race condition.
>
> GetOldestActiveTransactionId() is not performance-critical, it's only
> called when performing a checkpoint, so I think we should just bite the
> bullet and grab the lock. Per attached patch.

I did some testing, and was able to indeed construct a case where 
oldestActiveXid was off-by-one in an online checkpoint record. However, 
looking at how it's used, I think it happened to not have any 
user-visible effect. The oldestActiveXid value of an online checkpoint 
is only used to determine the point where pg_subtrans is initialized, 
and the XID missed in the computation could not be a subtransaction.

Nevertheless, it's clearly a bug and the fix is simple, so I committed a 
fix.

- Heikki




pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: [HACKERS] pl/perl extension fails on Windows
Next
From: Marina Polyakova
Date:
Subject: Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors