Thread: Hot standby, race condition between recovery snapshot and commit

Hot standby, race condition between recovery snapshot and commit

From
Heikki Linnakangas
Date:
There's a race condition between transaction commit and
GetRunningTransactionData(). If GetRunningTransactionData() runs betweenthe RecordTransactionCommit() and
ProcArrayEndTransaction()calls in
 
CommitTransaction():

>     /*
>      * Here is where we really truly commit.
>      */
>     latestXid = RecordTransactionCommit(false);
> 
>     TRACE_POSTGRESQL_TRANSACTION_COMMIT(MyProc->lxid);
> 
>     /*
>      * Let others know about no transaction in progress by me. Note that this
>      * must be done _before_ releasing locks we hold and _after_
>      * RecordTransactionCommit.
>      */
>     ProcArrayEndTransaction(MyProc, latestXid);

The running-xacts snapshot will include the transaction that's just
committing, but the commit record will be before the running-xacts WAL
record. If standby initializes transaction tracking from that
running-xacts record, it will consider the just-committed transactions
as still in-progress until the next running-xact record (at next
checkpoint).

I can't see any obvious way around that. We could have transaction
commit acquire the new RecoveryInfoLock across those two calls, but I'd
like to avoid putting any extra overhead into such a critical path.

Hmm, actually ProcArrayApplyRecoveryInfo() could check every xid in the
running-xacts record against clog. If it's marked as finished in clog
already (because we already saw the commit/abort record before the
running-xacts record), we know it's not running after all.

Because of the sequence that commit removes entry from procarray and
releases locks, it also seems possible for GetRunningTransactionsData()
to acquire a snapshot that contains an AccessExclusiveLock for a
transaction, but that XID is not listed as running in the XID list. That
sounds like trouble too.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, race condition between recovery snapshot and commit

From
Simon Riggs
Date:
On Sat, 2009-11-14 at 14:59 +0200, Heikki Linnakangas wrote:
> There's a race condition ....

Yes, I believe this is a major showstopper for the current
approach/attempt....but...

> I can't see any obvious way around that. 

Huh? We're only doing this strict locking approach because you insisted
that the looser approach was not acceptable. Have you forgotten that
discussion so completely that you can't even remember the existence of
other options? 

It amazes me that you should then use locking overhead as the reason to
not pursue the current approach further, which was exactly my argument
for not pursuing it in the first place.

You're leading me a merry dance.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot standby, race condition between recovery snapshot and commit

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Sat, 2009-11-14 at 14:59 +0200, Heikki Linnakangas wrote:
>> I can't see any obvious way around that. 
> 
> Huh? We're only doing this strict locking approach because you insisted
> that the looser approach was not acceptable.

Take it easy, Simon. By obvious, I meant "trivial" or "easy".  I believe
you're referring to this
(http://archives.postgresql.org/message-id/4A8CE561.4000302@enterprisedb.com):
> If there's a way, I would prefer a solution where the RunningXacts
> snapshot represents the situation the moment it appears in WAL, not some
> moment before it. It makes the logic easier to understand.

or did we have further discussion on that since?

> Have you forgotten that
> discussion so completely that you can't even remember the existence of
> other options? 

I do remember that. I've been thinking about the looser approach a lot
since yesterday.

So, if we drop the notion that the running-xacts record represents the
situation at the exact moment it appears in WAL, what do we have to
change? Creating the running-xacts snapshot becomes easier, but when we
replay it, we must take the snapshot with a grain of salt.

1. the snapshot can contain xids that have already finished (= we've
already seen the commit/abort record)
2. the snapshot can lack xids belonging to transactions that have just
started, between the window when the running-xacts snapshot is taken in
the master and it's written to WAL.

Problem 1 is quite easy to handle: just check every xid in clog. If it's
marked there as finished already, it can be ignored.

For problem 2, if a transaction hasn't written any WAL yet, we might as
well treat it as not-yet-started in the standby, so we're concerned
about transactions that have written a WAL record between when the
running-xacts snapshot was taken and written to WAL. Assuming the
snapshot was taken after the REDO pointer of the checkpoint record, the
standby has seen the WAL record and therefore has all the information it
needs. Currently, the standby doesn't add xids to known-assigned list
until it sees the running-xacts record, but we could change that.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, race condition between recovery snapshot and commit

From
Heikki Linnakangas
Date:
Oh, forgot to mention another thing that I've been pondering:

Currently, the running-xacts record is written to the WAL after the
checkpoint record. There's a small chance that you get an xlog switch in
between. If that happens, it might take a long time after the checkpoint
record until the standby sees the running-xacts record, so it might take
a long time until the standby can open up for connections.

In general, I'd like to remove as many as possible of those cases where
the standby starts up, and can't open up for connections. It makes the
standby a lot less useful if you can't rely on it being open. So I'd
like to make it so that the standby can *always* open up. There's
currently three cases where that can happen:

1. If the subxid cache has overflown.

2. If there's no running-xacts record after the checkpoint record for
some reason. For example, one was written but not archive yet, or
because the master crashed before it was written.

3. If too many AccessExclusiveLocks was being held.

Case 3 should be pretty easy to handle. Just need to WAL log all the
AccessExclusiveLocks, perhaps as separate WAL records (we already have a
new WAL record type for logging locks) if we're worried about the
running-xacts record growing too large. I think we could handle case 2
if we wrote the running-xacts record *before* the checkpoint record.
Then it would be always between the REDO pointer of the checkpoint
record, and the checkpoint record itself, so it would always be seen by
the WAL recovery. To handle case 1, we could scan pg_subtrans. It would
add some amount of code and would add some more work to taking the
running-xacts snapshot, but it could be done.

This isn't absolutely necessary for the first version, but it's
something to keep in mind...

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, race condition between recovery snapshot and commit

From
Simon Riggs
Date:
On Sun, 2009-11-15 at 14:43 +0200, Heikki Linnakangas wrote:

> This isn't absolutely necessary for the first version, but it's
> something to keep in mind...

Do I take that as agreement to the phased plan?

> In general, I'd like to remove as many as possible of those cases
> where the standby starts up, and can't open up for connections. It
> makes the standby a lot less useful if you can't rely on it being
> open. So I'd like to make it so that the standby can *always* open up.

Yes, of course. The only reason for restrictions being acceptable is
that we have 99% of what we want, yet may lose everything if we play for
100% too quickly.

The standby will open quickly in many cases, as is. There are also a
range of other ways of doing this.

> There's currently three cases where that can happen:
> 
> 1. If the subxid cache has overflown.
> 
> 2. If there's no running-xacts record after the checkpoint record for
> some reason. For example, one was written but not archive yet, or
> because the master crashed before it was written.
> 
> 3. If too many AccessExclusiveLocks was being held.
> 
> Case 3 should be pretty easy to handle. Just need to WAL log all the
> AccessExclusiveLocks, perhaps as separate WAL records (we already have
> a
> new WAL record type for logging locks) if we're worried about the
> running-xacts record growing too large. I think we could handle case 2
> if we wrote the running-xacts record *before* the checkpoint record.
> Then it would be always between the REDO pointer of the checkpoint
> record, and the checkpoint record itself, so it would always be seen
> by
> the WAL recovery. To handle case 1, we could scan pg_subtrans. It
> would
> add some amount of code and would add some more work to taking the
> running-xacts snapshot, but it could be done.

"Some amount of code" requires some amount of thought, followed by some
amount of review which takes some amount of time.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot standby, race condition between recovery snapshot and commit

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> Simon Riggs wrote:
>> Have you forgotten that
>> discussion so completely that you can't even remember the existence of
>> other options?
>
> I do remember that. I've been thinking about the looser approach a lot
> since yesterday.
>
> So, if we drop the notion that the running-xacts record represents the
> situation at the exact moment it appears in WAL, what do we have to
> change? Creating the running-xacts snapshot becomes easier, but when we
> replay it, we must take the snapshot with a grain of salt.
>
> 1. the snapshot can contain xids that have already finished (= we've
> already seen the commit/abort record)
> 2. the snapshot can lack xids belonging to transactions that have just
> started, between the window when the running-xacts snapshot is taken in
> the master and it's written to WAL.
>
> Problem 1 is quite easy to handle: just check every xid in clog. If it's
> marked there as finished already, it can be ignored.
>
> For problem 2, if a transaction hasn't written any WAL yet, we might as
> well treat it as not-yet-started in the standby, so we're concerned
> about transactions that have written a WAL record between when the
> running-xacts snapshot was taken and written to WAL. Assuming the
> snapshot was taken after the REDO pointer of the checkpoint record, the
> standby has seen the WAL record and therefore has all the information it
> needs. Currently, the standby doesn't add xids to known-assigned list
> until it sees the running-xacts record, but we could change that.

Ok, I tried out that approach. Attached is a complete patch against CVS
HEAD (see commit db15148b930 in the git branch for the diff against the
old approach):

- We start tracking transactions in the known-assigned hash table
immediately from the start of WAL replay. We have to do that because the
running-xacts record we will eventually see lack XIDs belonging to
transactions that started between when the running-xacts snapshot was
taken and written to WAL. If we start tracking at the running-xacts
record, we will miss them. To keep the size of the known-assigned table
bounded, we ignore any XIDs smaller than the oldest XID present in the
running-xacts record (any such transaction must've finished before the
running-xacts record, so we're not interested in them). We wouldn't know
the oldest running XID until we see the running-xacts record, so we
store it in the checkpoint record too, which we have access to right
from the start.

- StartupCLOG/SUBTRANS/MultiXact are now called at the beginning of WAL
replay. We used to delay that until we saw the running-xacts record, but
that always felt a bit weird to me. StartupSUBTRANS takes the
oldest-running-xid as argument, but now that we store that in the
checkpoint record, that's not a problem.

- Because the running-xacts record can contain XIDs belonging to
transactions that finished before the record was written to WAL, we
ignore any already-finished XIDs when it's replayed.

- The running-xacts record is written to WAL before the checkpoint
record. That guarantees that WAL replay will see it.

- RecoveryInfoLock is no longer needed.

This also lays the foundation to allow standby mode even with subxid or
lock overflows. We could now emit separate log records for overflowed
subxids or locks before the running-xacts record to fill that gap.

Am I missing anything?

I also experimented with including the running-xacts information in the
checkpoint record itself. That somehow feels more straightforward to me,
but it wasn't really any less code, and it wouldn't allow us to do the
running-xacts snapshot as multiple WAL records, so the current approach
with separate running-xacts record is better.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: Hot standby, race condition between recovery snapshot and commit

From
Simon Riggs
Date:
On Sun, 2009-11-15 at 21:37 +0200, Heikki Linnakangas wrote:

> Am I missing anything?

Will review.

> I also experimented with including the running-xacts information in the
> checkpoint record itself. That somehow feels more straightforward to me,
> but it wasn't really any less code, and it wouldn't allow us to do the
> running-xacts snapshot as multiple WAL records, so the current approach
> with separate running-xacts record is better.

Agreed, more modular.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot standby, race condition between recovery snapshot and commit

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Sun, 2009-11-15 at 21:37 +0200, Heikki Linnakangas wrote:
> 
>> Am I missing anything?
> 
> Will review.

Thanks! Please use the head of git branch, I already found one major
oversight in what I posted that's fixed there... I should go to bed already.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com