Thread: Incorrect snapshots while promoting hot standby node when 2PC is used

Incorrect snapshots while promoting hot standby node when 2PC is used

From
Andres Freund
Date:
Hi,

Michael Paquier (running locally I think), and subsequently Thomas Munro
(noticing [1]), privately reported that they noticed an assertion failure in
GetSnapshotData(). Both reasonably were wondering if that's related to the
snapshot scalability patches.

Michael reported the following assertion failure in 023_pitr_prepared_xact.pl:
> TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2468,
PID:22901)
 

> The failure was triggered by one of the new TAP tests,
> 023_pitr_prepared_xact.pl, after recovering a 2PC transaction that
> used a transaction ID that matches with RecentXmin:
> (gdb) p RecentXmin
> $1 = 588
> (gdb) p TransactionXmin
> $2 = 589

I tried for a while to reproduce that, but couldn't. Adding a bunch of
debugging output and increasing the log level shows the problem pretty clearly
however, just not tripping any asserts:

2021-04-21 17:55:54.287 PDT [1829098] [unknown] LOG:  setting xmin to 588
...
2021-04-21 17:55:54.377 PDT [1829049] DEBUG:  removing all KnownAssignedXids
2021-04-21 17:55:54.377 PDT [1829049] DEBUG:  release all standby locks
...
2021-04-21 17:55:54.396 PDT [1829100] [unknown] LOG:  setting xmin to 589
...
2021-04-21 17:55:55.379 PDT [1829048] LOG:  database system is ready to accept connections
...
2021-04-21 17:55:55.380 PDT [1829120] LOG:  setting xmin to 588
...
2021-04-21 17:55:55.386 PDT [1829126] [unknown] LOG:  setting xmin to 588
2021-04-21 17:55:55.387 PDT [1829126] 023_pitr_prepared_xact.pl LOG:  statement: COMMIT PREPARED 'fooinsert';
...
2021-04-21 17:55:55.428 PDT [1829128] [unknown] LOG:  setting xmin to 589

So there's clear proof for xmin going from 588 to 589 and back and
forth.


After looking some more the bug isn't even that subtle. And definitely not new
- likely it exists since the introduction of hot standby.

The sequence in StartupXLOG() leading to the issue is the following:

1) RecoverPreparedTransactions();
2) ShutdownRecoveryTransactionEnvironment();
3) XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;

Because 2) resets the KnownAssignedXids machinery, snapshots that happen
between 2) and 3) will not actually look at the procarray to compute
snapshots, as that only happens within

    snapshot->takenDuringRecovery = RecoveryInProgress();
    if (!snapshot->takenDuringRecovery)

and RecoveryInProgress() checks XLogCtl->SharedRecoveryState !=
RECOVERY_STATE_DONE, which is set in 3).

So snapshots within that window will always be "empty", i.e. xmin is
latestCompletedXid and xmax is latestCompletedXid + 1. Once we reach 3), we'll
look at the procarray, which then leads xmin going back to 588.


I think that this can lead to data corruption, because a too new xmin horizon
could lead to rows from a prepared transaction getting hint bitted as dead (or
perhaps even pruned, although that's probably harder to hit). Due to the too
new xmin horizon we won't treat rows by the prepared xact as in-progress, and
TransactionIdDidCommit() will return false, as the transaction didn't commit
yet. Which afaict can result in row versions created by the prepared
transaction being invisible even after the prepared transaction commits.

Without prepared transaction there probably isn't an issue, because there
shouldn't be any other in-progress xids at that point?


I think to fix the issue we'd have to move
ShutdownRecoveryTransactionEnvironment() to after XLogCtl->SharedRecoveryState
= RECOVERY_STATE_DONE.

The acquisition of ProcArrayLock() in
ShutdownRecoveryTransactionEnvironment()->ExpireAllKnownAssignedTransactionIds()
should prevent the data from being removed between the RecoveryInProgress()
and the KnownAssignedXidsGetAndSetXmin() calls in GetSnapshotData().

I haven't yet figured out whether there would be a problem with deferring the
other tasks in ShutdownRecoveryTransactionEnvironment() until after
RECOVERY_STATE_DONE.


I think we ought to introduce assertions that have a higher chance to
catch cases like this. The window to hit the new assertion that caused
Michael to hit this is pretty darn small (xmin needs to move backwards
between two snapshot computations inside a single transaction).  I
*think* we can safely assert that xmin doesn't move backwards globally,
if we store it as a 64bit xid, and don't perform that check in
walsender?

Greetings,

Andres Freund

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-04-20%2003%3A04%3A04



Re: Incorrect snapshots while promoting hot standby node when 2PC is used

From
Andrey Borodin
Date:
Hi Andres!

> 23 апр. 2021 г., в 01:36, Andres Freund <andres@anarazel.de> написал(а):
>
> So snapshots within that window will always be "empty", i.e. xmin is
> latestCompletedXid and xmax is latestCompletedXid + 1. Once we reach 3), we'll
> look at the procarray, which then leads xmin going back to 588.
>
>
> I think that this can lead to data corruption, because a too new xmin horizon
> could lead to rows from a prepared transaction getting hint bitted as dead (or
> perhaps even pruned, although that's probably harder to hit). Due to the too
> new xmin horizon we won't treat rows by the prepared xact as in-progress, and
> TransactionIdDidCommit() will return false, as the transaction didn't commit
> yet. Which afaict can result in row versions created by the prepared
> transaction being invisible even after the prepared transaction commits.
>
> Without prepared transaction there probably isn't an issue, because there
> shouldn't be any other in-progress xids at that point?

I'm investigating somewhat resemblant case.
We have an OLTP sharded installation where shards are almost always under rebalancing. Data movement is implemented
with2pc. 
Switchover happens quite often due to datacenter drills. The installation is running on PostgreSQL 12.6.

In January heapcheck of backup reported some suspicious problems
ERROR:  Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470240,
offnum(tuple)=1
ERROR:  Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470241,
offnum(tuple)=1
ERROR:  Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470242,
offnum(tuple)=1
...
and so on for ~100 pages - tuples with lp==1 were not frozen.

We froze tuples with pg_dirty_hands and run VACUUM (DSIABLE_PAGE_SKIPPING) on the table.

In the end of the March the same shard stroke again with:
ERROR:  Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470240,
offnum(tuple)=42
....
around ~1040 blocks (starting from the same 1470240!) had non-frozen tuple at lp==42.
I've run
update s3.objects_65 set created = created where ctid = '(1470241,42)' returning *;

After that heapcheck showed VM problem
ERROR:  XX001: Found non all-visible tuple. Oid(relation)=18487, blkno(page)=1470240, offnum(tuple)=42
LOCATION:  collect_corrupt_items, heap_check.c:186

VACUUM fixed it with warnings.
WARNING: 01000: page is not marked all-visible but visibility map bit is set in relation "objects_65" page 1470240
and failed on next page
ERROR:  XX001: found xmin 1650436694 from before relfrozenxid 1752174172
LOCATION:  heap_prepare_freeze_tuple, heapam.c:6172

I run update from all tuples in heapcheks ctid list and subsequent vacuum (without page skipping). This satisfied
corruptionmonitoring. 


Can this case be related to the problem that you described?

Or, perhaps, it looks more like a hardware problem? Data_checksums are on, but few years ago we observed ssd firmware
thatwas loosing updates, but passing checksums. I'm sure that we would benefit from having separate relation fork for
checksumsor LSNs. 


We observe similar cases 3-5 times a year. To the date no data was lost due to this, but it's somewhat annoying.
BTW I'd say that such things are an argument for back-porting pg_surgery and heapcheck to old versions.

Thanks!

Best regards, Andrey Borodin.




Re: Incorrect snapshots while promoting hot standby node when 2PC is used

From
Andres Freund
Date:
Hi,

On 2021-05-01 17:35:09 +0500, Andrey Borodin wrote:
> I'm investigating somewhat resemblant case.
> We have an OLTP sharded installation where shards are almost always under rebalancing. Data movement is implemented
with2pc.
 
> Switchover happens quite often due to datacenter drills. The installation is running on PostgreSQL 12.6.

If you still have the data it would be useful if you could check if the
LSNs of the corrupted pages are LSNs from shortly after standby
promotion/switchover?


> Can this case be related to the problem that you described?

It seems possible, but it's hard to know without a lot more information.


> Or, perhaps, it looks more like a hardware problem? Data_checksums are
> on, but few years ago we observed ssd firmware that was loosing
> updates, but passing checksums. I'm sure that we would benefit from
> having separate relation fork for checksums or LSNs.

Right - checksums are "page local". They can only detect if a page is
corrupted, not if e.g. an older version of the page (with correct
checksum) has been restored. While there are schemes to have stronger
error detection properties, they do come with substantial overhead (at
least the ones I can think of right now).


Greetings,

Andres Freund



Re: Incorrect snapshots while promoting hot standby node when 2PC is used

From
Andrey Borodin
Date:

> 3 мая 2021 г., в 23:10, Andres Freund <andres@anarazel.de> написал(а):
>
> Hi,
>
> On 2021-05-01 17:35:09 +0500, Andrey Borodin wrote:
>> I'm investigating somewhat resemblant case.
>> We have an OLTP sharded installation where shards are almost always under rebalancing. Data movement is implemented
with2pc. 
>> Switchover happens quite often due to datacenter drills. The installation is running on PostgreSQL 12.6.
>
> If you still have the data it would be useful if you could check if the
> LSNs of the corrupted pages are LSNs from shortly after standby
> promotion/switchover?
That's a neat idea, I'll check if I can restore backup with corruptions.
I have a test cluster with corruptions, but it has undergone tens of switchovers...

>> Or, perhaps, it looks more like a hardware problem? Data_checksums are
>> on, but few years ago we observed ssd firmware that was loosing
>> updates, but passing checksums. I'm sure that we would benefit from
>> having separate relation fork for checksums or LSNs.
>
> Right - checksums are "page local". They can only detect if a page is
> corrupted, not if e.g. an older version of the page (with correct
> checksum) has been restored. While there are schemes to have stronger
> error detection properties, they do come with substantial overhead (at
> least the ones I can think of right now).

We can have PTRACK-like fork with page LSNs. It can be flushed on checkpoint and restored from WAL on crash. So we
alwayscan detect stale page version. Like LSN-track :) We will have much faster rewind and delta-backups for free. 

Though I don't think it worth an effort until we at least checksum CLOG.

Thanks!

Best regards, Andrey Borodin.


Andres Freund <andres@anarazel.de> writes:
> Michael Paquier (running locally I think), and subsequently Thomas Munro
> (noticing [1]), privately reported that they noticed an assertion failure in
> GetSnapshotData(). Both reasonably were wondering if that's related to the
> snapshot scalability patches.
> Michael reported the following assertion failure in 023_pitr_prepared_xact.pl:
>> TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2468,
PID:22901) 

mantid just showed a failure that looks like the same thing, at
least it's also in 023_pitr_prepared_xact.pl:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mantid&dt=2021-05-03%2013%3A07%3A06

The assertion line number is rather different though:

TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2094,
PID:1163004) 

and interestingly, this happened in a parallel worker:

postgres: node_pitr: parallel worker for PID 1162998 (ExceptionalCondition+0x7a)[0x946eca]
postgres: node_pitr: parallel worker for PID 1162998 (GetSnapshotData+0x897)[0x7ef327]
postgres: node_pitr: parallel worker for PID 1162998 (GetNonHistoricCatalogSnapshot+0x4a)[0x986a5a]
postgres: node_pitr: parallel worker for PID 1162998 (systable_beginscan+0x189)[0x4fffe9]
postgres: node_pitr: parallel worker for PID 1162998 [0x937336]
postgres: node_pitr: parallel worker for PID 1162998 [0x937743]
postgres: node_pitr: parallel worker for PID 1162998 (RelationIdGetRelation+0x85)[0x93f155]
postgres: node_pitr: parallel worker for PID 1162998 (relation_open+0x5c)[0x4a348c]
postgres: node_pitr: parallel worker for PID 1162998 (index_open+0x6)[0x5007a6]
postgres: node_pitr: parallel worker for PID 1162998 (systable_beginscan+0x177)[0x4fffd7]
postgres: node_pitr: parallel worker for PID 1162998 [0x937336]
postgres: node_pitr: parallel worker for PID 1162998 [0x93e1c1]
postgres: node_pitr: parallel worker for PID 1162998 (RelationIdGetRelation+0xbd)[0x93f18d]
postgres: node_pitr: parallel worker for PID 1162998 (relation_open+0x5c)[0x4a348c]
postgres: node_pitr: parallel worker for PID 1162998 (table_open+0x6)[0x532656]
postgres: node_pitr: parallel worker for PID 1162998 [0x937306]
postgres: node_pitr: parallel worker for PID 1162998 [0x93e1c1]
postgres: node_pitr: parallel worker for PID 1162998 (RelationIdGetRelation+0xbd)[0x93f18d]
postgres: node_pitr: parallel worker for PID 1162998 (relation_open+0x5c)[0x4a348c]
postgres: node_pitr: parallel worker for PID 1162998 (table_open+0x6)[0x532656]
postgres: node_pitr: parallel worker for PID 1162998 [0x92c4f1]
postgres: node_pitr: parallel worker for PID 1162998 (SearchCatCache1+0x176)[0x92e236]
postgres: node_pitr: parallel worker for PID 1162998 (TupleDescInitEntry+0xb3)[0x4a7eb3]
postgres: node_pitr: parallel worker for PID 1162998 [0x688f30]
postgres: node_pitr: parallel worker for PID 1162998 (ExecInitResultTupleSlotTL+0x1b)[0x68a8fb]
postgres: node_pitr: parallel worker for PID 1162998 (ExecInitResult+0x92)[0x6af902]
postgres: node_pitr: parallel worker for PID 1162998 (ExecInitNode+0x446)[0x684b06]
postgres: node_pitr: parallel worker for PID 1162998 (standard_ExecutorStart+0x269)[0x67d7b9]
postgres: node_pitr: parallel worker for PID 1162998 (ParallelQueryMain+0x1a3)[0x681d83]
postgres: node_pitr: parallel worker for PID 1162998 (ParallelWorkerMain+0x408)[0x53c218]
postgres: node_pitr: parallel worker for PID 1162998 (StartBackgroundWorker+0x23f)[0x774ebf]
postgres: node_pitr: parallel worker for PID 1162998 [0x780f3d]

            regards, tom lane



Re: Incorrect snapshots while promoting hot standby node when 2PC is used

From
Andres Freund
Date:
Hi,

On 2021-05-04 12:32:34 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Michael Paquier (running locally I think), and subsequently Thomas Munro
> > (noticing [1]), privately reported that they noticed an assertion failure in
> > GetSnapshotData(). Both reasonably were wondering if that's related to the
> > snapshot scalability patches.
> > Michael reported the following assertion failure in 023_pitr_prepared_xact.pl:
> >> TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line:
2468,PID: 22901)
 
> 
> mantid just showed a failure that looks like the same thing, at
> least it's also in 023_pitr_prepared_xact.pl:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mantid&dt=2021-05-03%2013%3A07%3A06
> 
> The assertion line number is rather different though:
> 
> TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2094,
PID:1163004)
 

I managed to hit that one as well and it's also what fairywren hit - the
assertion in 2094 and 2468 are basically copies of the same check, and
which one hit is a question of timing.


> and interestingly, this happened in a parallel worker:

I think the issue can be hit (or rather detected) whenever a transaction
builds one snapshot while in recovery, and a second one during
end-of-recovery.  The parallel query here is just
2021-05-03 09:18:35.602 EDT [1162987:6] DETAIL:  Failed process was running: SELECT pg_is_in_recovery() = 'f';
(parallel due to force_parallel_mode) - which of course is likely to run
during end-of-recovery

So it does seem like the same bug of resetting the KnownAssignedXids
stuff too early.

Greetings,

Andres Freund



Re: Incorrect snapshots while promoting hot standby node when 2PC is used

From
Michael Paquier
Date:
On Thu, Apr 22, 2021 at 01:36:03PM -0700, Andres Freund wrote:
> The sequence in StartupXLOG() leading to the issue is the following:
>
> 1) RecoverPreparedTransactions();
> 2) ShutdownRecoveryTransactionEnvironment();
> 3) XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
>
> Because 2) resets the KnownAssignedXids machinery, snapshots that happen
> between 2) and 3) will not actually look at the procarray to compute
> snapshots, as that only happens within
>
>     snapshot->takenDuringRecovery = RecoveryInProgress();
>     if (!snapshot->takenDuringRecovery)
>
> and RecoveryInProgress() checks XLogCtl->SharedRecoveryState !=
> RECOVERY_STATE_DONE, which is set in 3).

Oh, indeed.  It is easy to see RecentXmin jumping back-and-worth while
running 023_pitr_prepared_xact.pl with a small sleep added just after
ShutdownRecoveryTransactionEnvironment().

> Without prepared transaction there probably isn't an issue, because there
> shouldn't be any other in-progress xids at that point?

Yes, there should not be any as far as I recall.  2PC is kind of
special with its fake ProcArray entries.

> I think to fix the issue we'd have to move
> ShutdownRecoveryTransactionEnvironment() to after XLogCtl->SharedRecoveryState
> = RECOVERY_STATE_DONE.
>
> The acquisition of ProcArrayLock() in
> ShutdownRecoveryTransactionEnvironment()->ExpireAllKnownAssignedTransactionIds()
> should prevent the data from being removed between the RecoveryInProgress()
> and the KnownAssignedXidsGetAndSetXmin() calls in GetSnapshotData().
>
> I haven't yet figured out whether there would be a problem with deferring the
> other tasks in ShutdownRecoveryTransactionEnvironment() until after
> RECOVERY_STATE_DONE.

Hmm.  This would mean releasing all the exclusive locks tracked by a
standby, as of StandbyReleaseAllLocks(), after opening the instance
for writes after a promotion.  I don't think that's unsafe, but it
would be intrusive.

Anyway, isn't the issue ExpireAllKnownAssignedTransactionIds() itself,
where we should try to not wipe out the 2PC entries to make sure that
all those snapshots still see the 2PC transactions as something to
count on?  I am attaching a crude patch to show the idea.

Thoughts?
--
Michael

Attachment

Re: Incorrect snapshots while promoting hot standby node when 2PC is used

From
Andres Freund
Date:
Hi,

On 2021-05-26 16:57:31 +0900, Michael Paquier wrote:
> Yes, there should not be any as far as I recall.  2PC is kind of
> special with its fake ProcArray entries.

It's really quite an awful design :(


> > I think to fix the issue we'd have to move
> > ShutdownRecoveryTransactionEnvironment() to after XLogCtl->SharedRecoveryState
> > = RECOVERY_STATE_DONE.
> >
> > The acquisition of ProcArrayLock() in
> > ShutdownRecoveryTransactionEnvironment()->ExpireAllKnownAssignedTransactionIds()
> > should prevent the data from being removed between the RecoveryInProgress()
> > and the KnownAssignedXidsGetAndSetXmin() calls in GetSnapshotData().
> >
> > I haven't yet figured out whether there would be a problem with deferring the
> > other tasks in ShutdownRecoveryTransactionEnvironment() until after
> > RECOVERY_STATE_DONE.
>
> Hmm.  This would mean releasing all the exclusive locks tracked by a
> standby, as of StandbyReleaseAllLocks(), after opening the instance
> for writes after a promotion.  I don't think that's unsafe, but it
> would be intrusive.

Why would it be intrusive? We're talking a split second here, no? More
importantly, I don't think it's correct to release the locks at that
point.


> Anyway, isn't the issue ExpireAllKnownAssignedTransactionIds() itself,
> where we should try to not wipe out the 2PC entries to make sure that
> all those snapshots still see the 2PC transactions as something to
> count on?  I am attaching a crude patch to show the idea.

I don't think that's sufficient. We can't do most of the other stuff in
ShutdownRecoveryTransactionEnvironment() before changing
XLogCtl->SharedRecoveryState either. As long as the other backends think
we are in recovery, we shouldn't release e.g. the virtual transaction.

Greetings,

Andres Freund



Re: Incorrect snapshots while promoting hot standby node when 2PC is used

From
Michael Paquier
Date:
On Thu, May 27, 2021 at 10:01:49AM -0700, Andres Freund wrote:
> Why would it be intrusive? We're talking a split second here, no? More
> importantly, I don't think it's correct to release the locks at that
> point.

I have been looking at all that for the last couple of days, and
checked the code to make sure that relying on RecoveryInProgress() as
the tipping point is logically correct in terms of virtual XID,
snapshot build and KnownAssignedXids cleanup.  This stuff is tricky
enough that I may have missed something, but my impression (and
testing) is that we should be safe.

I am adding this patch to the next CF for now.  More eyes are needed.
--
Michael

Attachment

Re: Incorrect snapshots while promoting hot standby node when 2PC is used

From
Michael Paquier
Date:
On Mon, May 31, 2021 at 09:37:17PM +0900, Michael Paquier wrote:
> I have been looking at all that for the last couple of days, and
> checked the code to make sure that relying on RecoveryInProgress() as
> the tipping point is logically correct in terms of virtual XID,
> snapshot build and KnownAssignedXids cleanup.  This stuff is tricky
> enough that I may have missed something, but my impression (and
> testing) is that we should be safe.

A couple of months later, I have looked back at this thread and this
report.  I have rechecked all the standby handling and snapshot builds
involving KnownAssignedXids and looked at all the phases that are
getting called until we call ShutdownRecoveryTransactionEnvironment()
to release these, and I don't think that there is a problem with the
solution proposed here.  So I propose to move on and apply this
patch.  Please let me know if there are any objections.
--
Michael

Attachment

Re: Incorrect snapshots while promoting hot standby node when 2PC is used

From
Michael Paquier
Date:
On Fri, Oct 01, 2021 at 02:11:15PM +0900, Michael Paquier wrote:
> A couple of months later, I have looked back at this thread and this
> report.  I have rechecked all the standby handling and snapshot builds
> involving KnownAssignedXids and looked at all the phases that are
> getting called until we call ShutdownRecoveryTransactionEnvironment()
> to release these, and I don't think that there is a problem with the
> solution proposed here.  So I propose to move on and apply this
> patch.  Please let me know if there are any objections.

Okay, I have worked more on that today, did more tests and applied the
fix as of 8a42379.
--
Michael

Attachment

Re: Incorrect snapshots while promoting hot standby node when 2PC is used

From
Andres Freund
Date:
On 2021-10-04 17:27:44 +0900, Michael Paquier wrote:
> On Fri, Oct 01, 2021 at 02:11:15PM +0900, Michael Paquier wrote:
> > A couple of months later, I have looked back at this thread and this
> > report.  I have rechecked all the standby handling and snapshot builds
> > involving KnownAssignedXids and looked at all the phases that are
> > getting called until we call ShutdownRecoveryTransactionEnvironment()
> > to release these, and I don't think that there is a problem with the
> > solution proposed here.  So I propose to move on and apply this
> > patch.  Please let me know if there are any objections.
> 
> Okay, I have worked more on that today, did more tests and applied the
> fix as of 8a42379.

Thanks for remembering this!