Thread: Incorrect snapshots while promoting hot standby node when 2PC is used
Hi, Michael Paquier (running locally I think), and subsequently Thomas Munro (noticing [1]), privately reported that they noticed an assertion failure in GetSnapshotData(). Both reasonably were wondering if that's related to the snapshot scalability patches. Michael reported the following assertion failure in 023_pitr_prepared_xact.pl: > TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2468, PID:22901) > The failure was triggered by one of the new TAP tests, > 023_pitr_prepared_xact.pl, after recovering a 2PC transaction that > used a transaction ID that matches with RecentXmin: > (gdb) p RecentXmin > $1 = 588 > (gdb) p TransactionXmin > $2 = 589 I tried for a while to reproduce that, but couldn't. Adding a bunch of debugging output and increasing the log level shows the problem pretty clearly however, just not tripping any asserts: 2021-04-21 17:55:54.287 PDT [1829098] [unknown] LOG: setting xmin to 588 ... 2021-04-21 17:55:54.377 PDT [1829049] DEBUG: removing all KnownAssignedXids 2021-04-21 17:55:54.377 PDT [1829049] DEBUG: release all standby locks ... 2021-04-21 17:55:54.396 PDT [1829100] [unknown] LOG: setting xmin to 589 ... 2021-04-21 17:55:55.379 PDT [1829048] LOG: database system is ready to accept connections ... 2021-04-21 17:55:55.380 PDT [1829120] LOG: setting xmin to 588 ... 2021-04-21 17:55:55.386 PDT [1829126] [unknown] LOG: setting xmin to 588 2021-04-21 17:55:55.387 PDT [1829126] 023_pitr_prepared_xact.pl LOG: statement: COMMIT PREPARED 'fooinsert'; ... 2021-04-21 17:55:55.428 PDT [1829128] [unknown] LOG: setting xmin to 589 So there's clear proof for xmin going from 588 to 589 and back and forth. After looking some more the bug isn't even that subtle. And definitely not new - likely it exists since the introduction of hot standby. The sequence in StartupXLOG() leading to the issue is the following: 1) RecoverPreparedTransactions(); 2) ShutdownRecoveryTransactionEnvironment(); 3) XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE; Because 2) resets the KnownAssignedXids machinery, snapshots that happen between 2) and 3) will not actually look at the procarray to compute snapshots, as that only happens within snapshot->takenDuringRecovery = RecoveryInProgress(); if (!snapshot->takenDuringRecovery) and RecoveryInProgress() checks XLogCtl->SharedRecoveryState != RECOVERY_STATE_DONE, which is set in 3). So snapshots within that window will always be "empty", i.e. xmin is latestCompletedXid and xmax is latestCompletedXid + 1. Once we reach 3), we'll look at the procarray, which then leads xmin going back to 588. I think that this can lead to data corruption, because a too new xmin horizon could lead to rows from a prepared transaction getting hint bitted as dead (or perhaps even pruned, although that's probably harder to hit). Due to the too new xmin horizon we won't treat rows by the prepared xact as in-progress, and TransactionIdDidCommit() will return false, as the transaction didn't commit yet. Which afaict can result in row versions created by the prepared transaction being invisible even after the prepared transaction commits. Without prepared transaction there probably isn't an issue, because there shouldn't be any other in-progress xids at that point? I think to fix the issue we'd have to move ShutdownRecoveryTransactionEnvironment() to after XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE. The acquisition of ProcArrayLock() in ShutdownRecoveryTransactionEnvironment()->ExpireAllKnownAssignedTransactionIds() should prevent the data from being removed between the RecoveryInProgress() and the KnownAssignedXidsGetAndSetXmin() calls in GetSnapshotData(). I haven't yet figured out whether there would be a problem with deferring the other tasks in ShutdownRecoveryTransactionEnvironment() until after RECOVERY_STATE_DONE. I think we ought to introduce assertions that have a higher chance to catch cases like this. The window to hit the new assertion that caused Michael to hit this is pretty darn small (xmin needs to move backwards between two snapshot computations inside a single transaction). I *think* we can safely assert that xmin doesn't move backwards globally, if we store it as a 64bit xid, and don't perform that check in walsender? Greetings, Andres Freund [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-04-20%2003%3A04%3A04
Hi Andres! > 23 апр. 2021 г., в 01:36, Andres Freund <andres@anarazel.de> написал(а): > > So snapshots within that window will always be "empty", i.e. xmin is > latestCompletedXid and xmax is latestCompletedXid + 1. Once we reach 3), we'll > look at the procarray, which then leads xmin going back to 588. > > > I think that this can lead to data corruption, because a too new xmin horizon > could lead to rows from a prepared transaction getting hint bitted as dead (or > perhaps even pruned, although that's probably harder to hit). Due to the too > new xmin horizon we won't treat rows by the prepared xact as in-progress, and > TransactionIdDidCommit() will return false, as the transaction didn't commit > yet. Which afaict can result in row versions created by the prepared > transaction being invisible even after the prepared transaction commits. > > Without prepared transaction there probably isn't an issue, because there > shouldn't be any other in-progress xids at that point? I'm investigating somewhat resemblant case. We have an OLTP sharded installation where shards are almost always under rebalancing. Data movement is implemented with2pc. Switchover happens quite often due to datacenter drills. The installation is running on PostgreSQL 12.6. In January heapcheck of backup reported some suspicious problems ERROR: Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470240, offnum(tuple)=1 ERROR: Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470241, offnum(tuple)=1 ERROR: Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470242, offnum(tuple)=1 ... and so on for ~100 pages - tuples with lp==1 were not frozen. We froze tuples with pg_dirty_hands and run VACUUM (DSIABLE_PAGE_SKIPPING) on the table. In the end of the March the same shard stroke again with: ERROR: Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470240, offnum(tuple)=42 .... around ~1040 blocks (starting from the same 1470240!) had non-frozen tuple at lp==42. I've run update s3.objects_65 set created = created where ctid = '(1470241,42)' returning *; After that heapcheck showed VM problem ERROR: XX001: Found non all-visible tuple. Oid(relation)=18487, blkno(page)=1470240, offnum(tuple)=42 LOCATION: collect_corrupt_items, heap_check.c:186 VACUUM fixed it with warnings. WARNING: 01000: page is not marked all-visible but visibility map bit is set in relation "objects_65" page 1470240 and failed on next page ERROR: XX001: found xmin 1650436694 from before relfrozenxid 1752174172 LOCATION: heap_prepare_freeze_tuple, heapam.c:6172 I run update from all tuples in heapcheks ctid list and subsequent vacuum (without page skipping). This satisfied corruptionmonitoring. Can this case be related to the problem that you described? Or, perhaps, it looks more like a hardware problem? Data_checksums are on, but few years ago we observed ssd firmware thatwas loosing updates, but passing checksums. I'm sure that we would benefit from having separate relation fork for checksumsor LSNs. We observe similar cases 3-5 times a year. To the date no data was lost due to this, but it's somewhat annoying. BTW I'd say that such things are an argument for back-porting pg_surgery and heapcheck to old versions. Thanks! Best regards, Andrey Borodin.
Hi, On 2021-05-01 17:35:09 +0500, Andrey Borodin wrote: > I'm investigating somewhat resemblant case. > We have an OLTP sharded installation where shards are almost always under rebalancing. Data movement is implemented with2pc. > Switchover happens quite often due to datacenter drills. The installation is running on PostgreSQL 12.6. If you still have the data it would be useful if you could check if the LSNs of the corrupted pages are LSNs from shortly after standby promotion/switchover? > Can this case be related to the problem that you described? It seems possible, but it's hard to know without a lot more information. > Or, perhaps, it looks more like a hardware problem? Data_checksums are > on, but few years ago we observed ssd firmware that was loosing > updates, but passing checksums. I'm sure that we would benefit from > having separate relation fork for checksums or LSNs. Right - checksums are "page local". They can only detect if a page is corrupted, not if e.g. an older version of the page (with correct checksum) has been restored. While there are schemes to have stronger error detection properties, they do come with substantial overhead (at least the ones I can think of right now). Greetings, Andres Freund
> 3 мая 2021 г., в 23:10, Andres Freund <andres@anarazel.de> написал(а): > > Hi, > > On 2021-05-01 17:35:09 +0500, Andrey Borodin wrote: >> I'm investigating somewhat resemblant case. >> We have an OLTP sharded installation where shards are almost always under rebalancing. Data movement is implemented with2pc. >> Switchover happens quite often due to datacenter drills. The installation is running on PostgreSQL 12.6. > > If you still have the data it would be useful if you could check if the > LSNs of the corrupted pages are LSNs from shortly after standby > promotion/switchover? That's a neat idea, I'll check if I can restore backup with corruptions. I have a test cluster with corruptions, but it has undergone tens of switchovers... >> Or, perhaps, it looks more like a hardware problem? Data_checksums are >> on, but few years ago we observed ssd firmware that was loosing >> updates, but passing checksums. I'm sure that we would benefit from >> having separate relation fork for checksums or LSNs. > > Right - checksums are "page local". They can only detect if a page is > corrupted, not if e.g. an older version of the page (with correct > checksum) has been restored. While there are schemes to have stronger > error detection properties, they do come with substantial overhead (at > least the ones I can think of right now). We can have PTRACK-like fork with page LSNs. It can be flushed on checkpoint and restored from WAL on crash. So we alwayscan detect stale page version. Like LSN-track :) We will have much faster rewind and delta-backups for free. Though I don't think it worth an effort until we at least checksum CLOG. Thanks! Best regards, Andrey Borodin.
Andres Freund <andres@anarazel.de> writes: > Michael Paquier (running locally I think), and subsequently Thomas Munro > (noticing [1]), privately reported that they noticed an assertion failure in > GetSnapshotData(). Both reasonably were wondering if that's related to the > snapshot scalability patches. > Michael reported the following assertion failure in 023_pitr_prepared_xact.pl: >> TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2468, PID:22901) mantid just showed a failure that looks like the same thing, at least it's also in 023_pitr_prepared_xact.pl: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mantid&dt=2021-05-03%2013%3A07%3A06 The assertion line number is rather different though: TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2094, PID:1163004) and interestingly, this happened in a parallel worker: postgres: node_pitr: parallel worker for PID 1162998 (ExceptionalCondition+0x7a)[0x946eca] postgres: node_pitr: parallel worker for PID 1162998 (GetSnapshotData+0x897)[0x7ef327] postgres: node_pitr: parallel worker for PID 1162998 (GetNonHistoricCatalogSnapshot+0x4a)[0x986a5a] postgres: node_pitr: parallel worker for PID 1162998 (systable_beginscan+0x189)[0x4fffe9] postgres: node_pitr: parallel worker for PID 1162998 [0x937336] postgres: node_pitr: parallel worker for PID 1162998 [0x937743] postgres: node_pitr: parallel worker for PID 1162998 (RelationIdGetRelation+0x85)[0x93f155] postgres: node_pitr: parallel worker for PID 1162998 (relation_open+0x5c)[0x4a348c] postgres: node_pitr: parallel worker for PID 1162998 (index_open+0x6)[0x5007a6] postgres: node_pitr: parallel worker for PID 1162998 (systable_beginscan+0x177)[0x4fffd7] postgres: node_pitr: parallel worker for PID 1162998 [0x937336] postgres: node_pitr: parallel worker for PID 1162998 [0x93e1c1] postgres: node_pitr: parallel worker for PID 1162998 (RelationIdGetRelation+0xbd)[0x93f18d] postgres: node_pitr: parallel worker for PID 1162998 (relation_open+0x5c)[0x4a348c] postgres: node_pitr: parallel worker for PID 1162998 (table_open+0x6)[0x532656] postgres: node_pitr: parallel worker for PID 1162998 [0x937306] postgres: node_pitr: parallel worker for PID 1162998 [0x93e1c1] postgres: node_pitr: parallel worker for PID 1162998 (RelationIdGetRelation+0xbd)[0x93f18d] postgres: node_pitr: parallel worker for PID 1162998 (relation_open+0x5c)[0x4a348c] postgres: node_pitr: parallel worker for PID 1162998 (table_open+0x6)[0x532656] postgres: node_pitr: parallel worker for PID 1162998 [0x92c4f1] postgres: node_pitr: parallel worker for PID 1162998 (SearchCatCache1+0x176)[0x92e236] postgres: node_pitr: parallel worker for PID 1162998 (TupleDescInitEntry+0xb3)[0x4a7eb3] postgres: node_pitr: parallel worker for PID 1162998 [0x688f30] postgres: node_pitr: parallel worker for PID 1162998 (ExecInitResultTupleSlotTL+0x1b)[0x68a8fb] postgres: node_pitr: parallel worker for PID 1162998 (ExecInitResult+0x92)[0x6af902] postgres: node_pitr: parallel worker for PID 1162998 (ExecInitNode+0x446)[0x684b06] postgres: node_pitr: parallel worker for PID 1162998 (standard_ExecutorStart+0x269)[0x67d7b9] postgres: node_pitr: parallel worker for PID 1162998 (ParallelQueryMain+0x1a3)[0x681d83] postgres: node_pitr: parallel worker for PID 1162998 (ParallelWorkerMain+0x408)[0x53c218] postgres: node_pitr: parallel worker for PID 1162998 (StartBackgroundWorker+0x23f)[0x774ebf] postgres: node_pitr: parallel worker for PID 1162998 [0x780f3d] regards, tom lane
Hi, On 2021-05-04 12:32:34 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Michael Paquier (running locally I think), and subsequently Thomas Munro > > (noticing [1]), privately reported that they noticed an assertion failure in > > GetSnapshotData(). Both reasonably were wondering if that's related to the > > snapshot scalability patches. > > Michael reported the following assertion failure in 023_pitr_prepared_xact.pl: > >> TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2468,PID: 22901) > > mantid just showed a failure that looks like the same thing, at > least it's also in 023_pitr_prepared_xact.pl: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mantid&dt=2021-05-03%2013%3A07%3A06 > > The assertion line number is rather different though: > > TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2094, PID:1163004) I managed to hit that one as well and it's also what fairywren hit - the assertion in 2094 and 2468 are basically copies of the same check, and which one hit is a question of timing. > and interestingly, this happened in a parallel worker: I think the issue can be hit (or rather detected) whenever a transaction builds one snapshot while in recovery, and a second one during end-of-recovery. The parallel query here is just 2021-05-03 09:18:35.602 EDT [1162987:6] DETAIL: Failed process was running: SELECT pg_is_in_recovery() = 'f'; (parallel due to force_parallel_mode) - which of course is likely to run during end-of-recovery So it does seem like the same bug of resetting the KnownAssignedXids stuff too early. Greetings, Andres Freund
Re: Incorrect snapshots while promoting hot standby node when 2PC is used
From
Michael Paquier
Date:
On Thu, Apr 22, 2021 at 01:36:03PM -0700, Andres Freund wrote: > The sequence in StartupXLOG() leading to the issue is the following: > > 1) RecoverPreparedTransactions(); > 2) ShutdownRecoveryTransactionEnvironment(); > 3) XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE; > > Because 2) resets the KnownAssignedXids machinery, snapshots that happen > between 2) and 3) will not actually look at the procarray to compute > snapshots, as that only happens within > > snapshot->takenDuringRecovery = RecoveryInProgress(); > if (!snapshot->takenDuringRecovery) > > and RecoveryInProgress() checks XLogCtl->SharedRecoveryState != > RECOVERY_STATE_DONE, which is set in 3). Oh, indeed. It is easy to see RecentXmin jumping back-and-worth while running 023_pitr_prepared_xact.pl with a small sleep added just after ShutdownRecoveryTransactionEnvironment(). > Without prepared transaction there probably isn't an issue, because there > shouldn't be any other in-progress xids at that point? Yes, there should not be any as far as I recall. 2PC is kind of special with its fake ProcArray entries. > I think to fix the issue we'd have to move > ShutdownRecoveryTransactionEnvironment() to after XLogCtl->SharedRecoveryState > = RECOVERY_STATE_DONE. > > The acquisition of ProcArrayLock() in > ShutdownRecoveryTransactionEnvironment()->ExpireAllKnownAssignedTransactionIds() > should prevent the data from being removed between the RecoveryInProgress() > and the KnownAssignedXidsGetAndSetXmin() calls in GetSnapshotData(). > > I haven't yet figured out whether there would be a problem with deferring the > other tasks in ShutdownRecoveryTransactionEnvironment() until after > RECOVERY_STATE_DONE. Hmm. This would mean releasing all the exclusive locks tracked by a standby, as of StandbyReleaseAllLocks(), after opening the instance for writes after a promotion. I don't think that's unsafe, but it would be intrusive. Anyway, isn't the issue ExpireAllKnownAssignedTransactionIds() itself, where we should try to not wipe out the 2PC entries to make sure that all those snapshots still see the 2PC transactions as something to count on? I am attaching a crude patch to show the idea. Thoughts? -- Michael
Attachment
Hi, On 2021-05-26 16:57:31 +0900, Michael Paquier wrote: > Yes, there should not be any as far as I recall. 2PC is kind of > special with its fake ProcArray entries. It's really quite an awful design :( > > I think to fix the issue we'd have to move > > ShutdownRecoveryTransactionEnvironment() to after XLogCtl->SharedRecoveryState > > = RECOVERY_STATE_DONE. > > > > The acquisition of ProcArrayLock() in > > ShutdownRecoveryTransactionEnvironment()->ExpireAllKnownAssignedTransactionIds() > > should prevent the data from being removed between the RecoveryInProgress() > > and the KnownAssignedXidsGetAndSetXmin() calls in GetSnapshotData(). > > > > I haven't yet figured out whether there would be a problem with deferring the > > other tasks in ShutdownRecoveryTransactionEnvironment() until after > > RECOVERY_STATE_DONE. > > Hmm. This would mean releasing all the exclusive locks tracked by a > standby, as of StandbyReleaseAllLocks(), after opening the instance > for writes after a promotion. I don't think that's unsafe, but it > would be intrusive. Why would it be intrusive? We're talking a split second here, no? More importantly, I don't think it's correct to release the locks at that point. > Anyway, isn't the issue ExpireAllKnownAssignedTransactionIds() itself, > where we should try to not wipe out the 2PC entries to make sure that > all those snapshots still see the 2PC transactions as something to > count on? I am attaching a crude patch to show the idea. I don't think that's sufficient. We can't do most of the other stuff in ShutdownRecoveryTransactionEnvironment() before changing XLogCtl->SharedRecoveryState either. As long as the other backends think we are in recovery, we shouldn't release e.g. the virtual transaction. Greetings, Andres Freund
Re: Incorrect snapshots while promoting hot standby node when 2PC is used
From
Michael Paquier
Date:
On Thu, May 27, 2021 at 10:01:49AM -0700, Andres Freund wrote: > Why would it be intrusive? We're talking a split second here, no? More > importantly, I don't think it's correct to release the locks at that > point. I have been looking at all that for the last couple of days, and checked the code to make sure that relying on RecoveryInProgress() as the tipping point is logically correct in terms of virtual XID, snapshot build and KnownAssignedXids cleanup. This stuff is tricky enough that I may have missed something, but my impression (and testing) is that we should be safe. I am adding this patch to the next CF for now. More eyes are needed. -- Michael
Attachment
Re: Incorrect snapshots while promoting hot standby node when 2PC is used
From
Michael Paquier
Date:
On Mon, May 31, 2021 at 09:37:17PM +0900, Michael Paquier wrote: > I have been looking at all that for the last couple of days, and > checked the code to make sure that relying on RecoveryInProgress() as > the tipping point is logically correct in terms of virtual XID, > snapshot build and KnownAssignedXids cleanup. This stuff is tricky > enough that I may have missed something, but my impression (and > testing) is that we should be safe. A couple of months later, I have looked back at this thread and this report. I have rechecked all the standby handling and snapshot builds involving KnownAssignedXids and looked at all the phases that are getting called until we call ShutdownRecoveryTransactionEnvironment() to release these, and I don't think that there is a problem with the solution proposed here. So I propose to move on and apply this patch. Please let me know if there are any objections. -- Michael
Attachment
Re: Incorrect snapshots while promoting hot standby node when 2PC is used
From
Michael Paquier
Date:
On Fri, Oct 01, 2021 at 02:11:15PM +0900, Michael Paquier wrote: > A couple of months later, I have looked back at this thread and this > report. I have rechecked all the standby handling and snapshot builds > involving KnownAssignedXids and looked at all the phases that are > getting called until we call ShutdownRecoveryTransactionEnvironment() > to release these, and I don't think that there is a problem with the > solution proposed here. So I propose to move on and apply this > patch. Please let me know if there are any objections. Okay, I have worked more on that today, did more tests and applied the fix as of 8a42379. -- Michael
Attachment
On 2021-10-04 17:27:44 +0900, Michael Paquier wrote: > On Fri, Oct 01, 2021 at 02:11:15PM +0900, Michael Paquier wrote: > > A couple of months later, I have looked back at this thread and this > > report. I have rechecked all the standby handling and snapshot builds > > involving KnownAssignedXids and looked at all the phases that are > > getting called until we call ShutdownRecoveryTransactionEnvironment() > > to release these, and I don't think that there is a problem with the > > solution proposed here. So I propose to move on and apply this > > patch. Please let me know if there are any objections. > > Okay, I have worked more on that today, did more tests and applied the > fix as of 8a42379. Thanks for remembering this!