Thread: [BUG] Panic due to incorrect missingContrecPtr after promotion
On 13.5 a wal flush PANIC is encountered after a standby is promoted.
With debugging, it was found that when a standby skips a missing continuation record on recovery, the missingContrecPtr is not invalidated after the record is skipped. Therefore, when the standby is promoted to a primary it writes an overwrite_contrecord with an LSN of the missingContrecPtr, which is now in the past. On flush time, this causes a PANIC. From what I can see, this failure scenario can only occur after a standby is promoted.
The overwrite_contrecord was introduced in 13.5 with https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ff9f111bce24.
Attached is a patch and a TAP test to handle this condition. The patch ensures that an overwrite_contrecord is only created if the missingContrecPtr is ahead of the last wal record.
To reproduce:
Run the new tap test recovery/t/029_overwrite_contrecord_promotion.pl without the attached patch
2022-02-22 18:38:15.526 UTC [31138] LOG: started streaming WAL from primary at 0/2000000 on timeline 1
2022-02-22 18:38:15.535 UTC [31105] LOG: successfully skipped missing contrecord at 0/1FFC620, overwritten at 2022-02-22 18:38:15.136482+00
2022-02-22 18:38:15.535 UTC [31105] CONTEXT: WAL redo at 0/2000028 for XLOG/OVERWRITE_CONTRECORD: lsn 0/1FFC620; time 2022-02-22 18:38:15.136482+00
…
…..
2022-02-22 18:38:15.575 UTC [31103] PANIC: xlog flush request 0/201EC70 is not satisfied --- flushed only to 0/2000088
2022-02-22 18:38:15.575 UTC [31101] LOG: checkpointer process (PID 31103) was terminated by signal 6: Aborted
….
…..
With the patch, running the same tap test succeeds and a PANIC is not observed.
Thanks
Sami Imseih
Amazon Web Services
Attachment
On 2022-Feb-22, Imseih (AWS), Sami wrote: > On 13.5 a wal flush PANIC is encountered after a standby is promoted. > > With debugging, it was found that when a standby skips a missing > continuation record on recovery, the missingContrecPtr is not > invalidated after the record is skipped. Therefore, when the standby > is promoted to a primary it writes an overwrite_contrecord with an LSN > of the missingContrecPtr, which is now in the past. On flush time, > this causes a PANIC. From what I can see, this failure scenario can > only occur after a standby is promoted. Ooh, nice find and diagnosys. I can confirm that the test fails as you described without the code fix, and doesn't fail with it. I attach the same patch, with the test file put in its final place rather than as a patch. Due to recent xlog.c changes this need a bit of work to apply to back branches; I'll see about getting it in all branches soon. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "I'm impressed how quickly you are fixing this obscure issue. I came from MS SQL and it would be hard for me to put into words how much of a better job you all are doing on [PostgreSQL]." Steve Midgley, http://archives.postgresql.org/pgsql-sql/2008-08/msg00000.php
Attachment
> Ooh, nice find and diagnosys. I can confirm that the test fails as you > described without the code fix, and doesn't fail with it. > I attach the same patch, with the test file put in its final place > rather than as a patch. Due to recent xlog.c changes this need a bit of > work to apply to back branches; I'll see about getting it in all > branches soon. Thank you for reviewing! Sami
At Wed, 23 Feb 2022 02:58:07 +0000, "Imseih (AWS), Sami" <simseih@amazon.com> wrote in > > Ooh, nice find and diagnosys. I can confirm that the test fails as you > > described without the code fix, and doesn't fail with it. > > > I attach the same patch, with the test file put in its final place > > rather than as a patch. Due to recent xlog.c changes this need a bit of > > work to apply to back branches; I'll see about getting it in all > > branches soon. > > Thank you for reviewing! Nice catch! However, I'm not sure I like the patch. * made it through and start writing after the portion that persisted. * (It's critical to first write an OVERWRITE_CONTRECORD message, which * we'll do as soon as we're open for writing new WAL.) + * + * If the last wal record is ahead of the missing contrecord, this is a + * recently promoted primary and we should not write an overwrite + * contrecord. Before the part, the comment follows the part shown below. > * Actually, if WAL ended in an incomplete record, skip the parts that So, actually WAL did not ended in an incomplete record. I think FinishWalRecover is the last place to do that. (But it could be earlier.) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Thu, 24 Feb 2022 16:26:42 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > So, actually WAL did not ended in an incomplete record. I think > FinishWalRecover is the last place to do that. (But it could be > earlier.) After some investigation, I finally concluded that we should reset abortedRecPtr and missingContrecPtr at processing XLOG_OVERWRITE_CONTRECORD. --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -1953,6 +1953,11 @@ xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI) LSN_FORMAT_ARGS(xlrec.overwritten_lsn), timestamptz_to_str(xlrec.overwrite_time)))); + /* We have safely skipped the aborted record */ + abortedRecPtr = InvalidXLogRecPtr; + missingContrecPtr = InvalidXLogRecPtr; + /* Verifying the record should only happen once */ record->overwrittenRecPtr = InvalidXLogRecPtr; } The last check in the test against "resetting aborted record" is no longer useful since it is already checked by 026_verwrite_contrecord.pl. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
> Nice catch! However, I'm not sure I like the patch. > * made it through and start writing after the portion that persisted. > * (It's critical to first write an OVERWRITE_CONTRECORD message, which > * we'll do as soon as we're open for writing new WAL.) > + * > + * If the last wal record is ahead of the missing contrecord, this is a > + * recently promoted primary and we should not write an overwrite > + * contrecord. > Before the part, the comment follows the part shown below. > > * Actually, if WAL ended in an incomplete record, skip the parts that > So, actually WAL did not ended in an incomplete record. I think > FinishWalRecover is the last place to do that. (But it could be > earlier.) Thanks for the feedback. ## On the primary, an OVERWRITE_CONTRECORD is written. The aborted record (abortedRecPtr) and insert position of the missingcontrecord (missingContrecPtr) are set during FinishWalRecovery and after recovery but before writes are accepted,the OVERWRITE_CONTRECORD is written. ## on the primary logs 2022-02-25 02:25:16.208 UTC [7397] LOG: redo done at 0/1FFD2B8 system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed:0.01 s 2022-02-25 02:25:16.208 UTC [7397] DEBUG: resetting unlogged relations: cleanup 0 init 1 2022-02-25 02:25:16.209 UTC [7397] DEBUG: creating and filling new WAL file 2022-02-25 02:25:16.217 UTC [7397] DEBUG: done creating and filling new WAL file 2022-02-25 02:25:16.217 UTC [7397] DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 1 2022-02-25 02:25:16.217 UTC [7397] DEBUG: MultiXact member stop limit is now 4294914944 based on MultiXact 1 2022-02-25 02:25:16.217 UTC [7397] DEBUG: OVERWRITE_CONTRECORD inserted at 0/2000028 for aborted record 0/1FFD2E0 <<-------Attached V3 of patch that adds logging which shows the overwrite record created on the primary 2022-02-25 02:25:16.217 UTC [7395] LOG: checkpoint starting: end-of-recovery immediate wait 2022-02-25 02:25:16.217 UTC [7395] DEBUG: performing replication slot checkpoint 2022-02-25 02:25:16.218 UTC [7395] DEBUG: attempting to remove WAL segments older than log file 000000000000000000000000 2022-02-25 02:25:16.218 UTC [7395] LOG: checkpoint complete: wrote 131 buffers (102.3%); 0 WAL file(s) added, 0 removed,0 recycled; write=0.001 s, sync=0.001 s, total=0.001 s; sync files=0, longest=0.000 s, average=0.000 s; distance=11381kB, estimate=11381 kB 2022-02-25 02:25:16.219 UTC [7394] DEBUG: starting background worker process "logical replication launcher" 2022-02-25 02:25:16.219 UTC [7394] LOG: database system is ready to accept connections ## A downstream standby now skips over this OVERWRITE_CONTRECORD, but the value for missingContrecPtr is not invalidatedafterwards. ## on the standby logs 2022-02-25 02:25:16.616 UTC [7413] DEBUG: sending hot standby feedback xmin 0 epoch 0 catalog_xmin 0 catalog_xmin_epoch0 2022-02-25 02:25:16.616 UTC [7387] LOG: successfully skipped missing contrecord at 0/1FFD2E0, overwritten at 2022-02-2502:25:16.2175+00 2022-02-25 02:25:16.616 UTC [7387] CONTEXT: WAL redo at 0/2000028 for XLOG/OVERWRITE_CONTRECORD: lsn 0/1FFD2E0; time 2022-02-2502:25:16.2175+00 ## After promotion, the standby attempts to write the overwrite_contrecord again using the missingContrecPtr LSN which isnow in the past. ## on the standby logs 2022-02-25 02:25:16.646 UTC [7387] LOG: invalid record length at 0/201EC70: wanted 24, got 0 2022-02-25 02:25:16.646 UTC [7387] LOG: redo done at 0/201EC48 system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed:0.55 s 2022-02-25 02:25:16.646 UTC [7387] LOG: last completed transaction was at log time 2022-02-25 02:25:16.301554+00 2022-02-25 02:25:16.646 UTC [7387] DEBUG: resetting unlogged relations: cleanup 0 init 1 2022-02-25 02:25:16.646 UTC [7387] LOG: selected new timeline ID: 2 2022-02-25 02:25:16.646 UTC [7387] DEBUG: updated min recovery point to 0/201EC70 on timeline 1 2022-02-25 02:25:16.656 UTC [7387] DEBUG: could not remove file "pg_wal/000000020000000000000002": No such file or directory 2022-02-25 02:25:16.656 UTC [7387] LOG: archive recovery complete 2022-02-25 02:25:16.656 UTC [7387] DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 1 2022-02-25 02:25:16.656 UTC [7387] DEBUG: MultiXact member stop limit is now 4294914944 based on MultiXact 1 2022-02-25 02:25:16.656 UTC [7387] DEBUG: OVERWRITE_CONTRECORD inserted at 0/2000028 for aborted record 0/1FFD2E0 <<-------The same overwrite record is written on the recently promoted standby 2022-02-25 02:25:16.656 UTC [7387] DEBUG: attempting to remove WAL segments newer than log file 000000020000000000000001 2022-02-25 02:25:16.656 UTC [7387] DEBUG: recycled write-ahead log file "000000010000000000000002" 2022-02-25 02:25:16.656 UTC [7387] DEBUG: release all standby locks 2022-02-25 02:25:16.656 UTC [7385] LOG: checkpoint starting: force 2022-02-25 02:25:16.656 UTC [7385] DEBUG: performing replication slot checkpoint 2022-02-25 02:25:16.656 UTC [7385] LOG: request to flush past end of generated WAL; request 0/201EC70, current position0/2000088 2022-02-25 02:25:16.656 UTC [7385] PANIC: xlog flush request 0/201EC70 is not satisfied --- flushed only to 0/2000088 2022-02-25 02:25:16.657 UTC [7384] LOG: checkpointer process (PID 7385) was terminated by signal 6: Aborted The purpose of the patch is to check that if the current LSN is beyond missingContrecPtr, and to skip ( incorrectly ) writingan overwrite contrecord that was written on the old writer. -- Sami Imseih Amazon Web Services
Attachment
> After some investigation, I finally concluded that we should reset > abortedRecPtr and missingContrecPtr at processing > XLOG_OVERWRITE_CONTRECORD. > --- a/src/backend/access/transam/xlogrecovery.c > +++ b/src/backend/access/transam/xlogrecovery.c > @@ -1953,6 +1953,11 @@ xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI) LSN_FORMAT_ARGS(xlrec.overwritten_lsn), timestamptz_to_str(xlrec.overwrite_time)))); > + /* We have safely skipped the aborted record */ > + abortedRecPtr = InvalidXLogRecPtr; > + missingContrecPtr = InvalidXLogRecPtr; > + > /* Verifying the record should only happen once */ > record->overwrittenRecPtr = InvalidXLogRecPtr; > } > The last check in the test against "resetting aborted record" is no > longer useful since it is already checked by > 026_verwrite_contrecord.pl. > regards. +1 for this. Resetting abortedRecPtr and missingContrecPtr after the broken record is skipped in is cleaner. I also thinkit would make sense to move the " successfully skipped missing contrecord" to after we invalidate the aborted and missingrecord pointers, like below. +++ b/src/backend/access/transam/xlogrecovery.c @@ -1948,15 +1948,15 @@ xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI) LSN_FORMAT_ARGS(xlrec.overwritten_lsn), LSN_FORMAT_ARGS(record->overwrittenRecPtr)); + /* We have safely skipped the aborted record */ + abortedRecPtr = InvalidXLogRecPtr; + missingContrecPtr = InvalidXLogRecPtr; + ereport(LOG, (errmsg("successfully skipped missing contrecord at %X/%X, overwritten at %s", LSN_FORMAT_ARGS(xlrec.overwritten_lsn), timestamptz_to_str(xlrec.overwrite_time)))); Also, instead of a new test, the current 026_overwrite_contrecord.pl test can include a promotion test at the end. Attaching a new patch for review. -- Sami Imseih Amazon Web Services
Attachment
I have gone ahead and backpatched this all the way to 10 as well. -- Sami Imseih Amazon Web Services
Attachment
- v4-0001-Fix-missing-continuation-record-after-standby-promot-v10.patch
- v4-0001-Fix-missing-continuation-record-after-standby-promot-v11.patch
- v4-0001-Fix-missing-continuation-record-after-standby-promot-v12.patch
- v4-0001-Fix-missing-continuation-record-after-standby-promot-v13.patch
- v4-0001-Fix-missing-continuation-record-after-standby-promot-v14.patch
Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
On 2022-Mar-07, Imseih (AWS), Sami wrote: > I have gone ahead and backpatched this all the way to 10 as well. Thanks! I pushed this now. I edited the test though: I don't understand why you went to the trouble of setting stuff in order to call 'pg_ctl promote' (in different ways for older branches), when $node_standby->promote does the same and is simpler to call. So I changed the tests to do that. (I did verify that without the code fix, the PANIC indeed is thrown.) Thank again, -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
We see another occurrence of this bug with the last patch applied in 13.7. After a promotion we observe the following in the logs: 2022-05-25 00:35:38 UTC::@:[371]:PANIC: xlog flush request 10/B1FA3D88 is not satisfied --- flushed only to 7/A8000060 2022-05-25 00:35:38 UTC:172.31.26.238(38610):administrator@postgres:[23433]:ERROR: current transaction is aborted, commandsignored until end of transaction block However, The logs do not show "LOG: successfully skipped missing contrecord", therefore we know that VerifyOverwriteContrecord is not being called to invalidate the missingContrecPtr. VerifyOverwriteContrecord(xl_overwrite_contrecord *xlrec, XLogReaderState *state) { if (xlrec->overwritten_lsn != state->overwrittenRecPtr) elog(FATAL, "mismatching overwritten LSN %X/%X -> %X/%X", (uint32) (xlrec->overwritten_lsn >> 32), (uint32) xlrec->overwritten_lsn, (uint32) (state->overwrittenRecPtr >> 32), (uint32) state->overwrittenRecPtr); /* We have safely skipped the aborted record */ abortedRecPtr = InvalidXLogRecPtr; missingContrecPtr = InvalidXLogRecPtr; ereport(LOG, (errmsg("successfully skipped missing contrecord at %X/%X, overwritten at %s", (uint32) (xlrec->overwritten_lsn >> 32), (uint32) xlrec->overwritten_lsn, timestamptz_to_str(xlrec->overwrite_time)))); We think it's because VerifyOverwriteContrecord was not called which is why we see this behavior. Are there are other places where missingContrecPtr should be invalidated, such as after a successful promotion? -- Sami Imseih Amazon Web Services
On Tue, Feb 22, 2022 at 07:20:55PM +0000, Imseih (AWS), Sami wrote: > The overwrite_contrecord was introduced in 13.5 with https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ff9f111bce24. > > Attached is a patch and a TAP test to handle this condition. The > patch ensures that an overwrite_contrecord is only created if the > missingContrecPtr is ahead of the last wal record. The test you are introducing to force a complete segment to be filled is funky, and kind of nice actually while being cheap. This part particularly makes the test predictable: ++unlink $node->basedir . "/pgdata/pg_wal/$endfile" ++ or die "could not unlink " . $node->basedir . "/pgdata/pg_wal/$endfile: $!"; I really like that. > With the patch, running the same tap test succeeds and a PANIC is > not observed. This needs a very close lookup, I'll try to check all that except if somebody beats me to it. -- Michael
Attachment
On Fri, May 27, 2022 at 08:53:03AM +0900, Michael Paquier wrote: > This needs a very close lookup, I'll try to check all that except if > somebody beats me to it. Please ignore that.. I need more coffee, and likely a break. -- Michael
Attachment
At Thu, 26 May 2022 19:57:41 +0000, "Imseih (AWS), Sami" <simseih@amazon.com> wrote in > We see another occurrence of this bug with the last patch applied in 13.7. > > After a promotion we observe the following in the logs: ... > We think it's because VerifyOverwriteContrecord was not > called which is why we see this behavior. > > Are there are other places where missingContrecPtr > should be invalidated, such as after a successful promotion? The only cause known to me for EndOfLog being moved to such location is missingContrecPtr. But if the next record is not XLOG_OVERWRITE_CONTRECORD recovery should have stopped there. And if XLOG_OVERWRITE_CONTRECORD is there, VerifyOverwriteContrecord should have been called.. Could you inspect WAL files of the environment and see if the first record of the '7/A8'th segment OVERWRITE_CONTRECORD? I don't say that makes some progress on this, but could be the first step. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
After further research, we found the following. Testing on 13.6 with the attached patch we see that the missingContrecPtr is being incorrectly set on the standby and the promote in the tap test fails. Per the comments in xlog.c, the missingContrecPtr should not be set when in standby mode. /* * When not in standby mode we find that WAL ends in an incomplete * record, keep track of that record. After recovery is done, * we'll write a record to indicate downstream WAL readers that * that portion is to be ignored. */ if (!StandbyMode && !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr)) { abortedRecPtr = xlogreader->abortedRecPtr; missingContrecPtr = xlogreader->missingContrecPtr; elog(LOG, "missingContrecPtr == %ld", missingContrecPtr); } If StandbyModeRequested is checked instead, which checks for the presence of a standby signal file, The missingContrecPtr is not set on the standby and the test succeeds. if (!StandbyModeRequested && !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr)) { abortedRecPtr = xlogreader->abortedRecPtr; missingContrecPtr = xlogreader->missingContrecPtr; elog(LOG, "missingContrecPtr == %ld", missingContrecPtr); } If this is a bug as it appears, it appears the original patch to resolve this issue is not needed and the ideal fix Is to ensure that a standby does not set missingContrecPtr. Would like to see what others think. Thanks -- Sami Imseih Amazon Web Services
Attachment
At Fri, 27 May 2022 02:01:27 +0000, "Imseih (AWS), Sami" <simseih@amazon.com> wrote in > After further research, we found the following. > > Testing on 13.6 with the attached patch we see > that the missingContrecPtr is being incorrectly > set on the standby and the promote in the tap > test fails. > > Per the comments in xlog.c, the > missingContrecPtr should not be set when > in standby mode. > > /* > * When not in standby mode we find that WAL ends in an incomplete > * record, keep track of that record. After recovery is done, > * we'll write a record to indicate downstream WAL readers that > * that portion is to be ignored. > */ > if (!StandbyMode && > !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr)) > { > abortedRecPtr = xlogreader->abortedRecPtr; > missingContrecPtr = xlogreader->missingContrecPtr; > elog(LOG, "missingContrecPtr == %ld", missingContrecPtr); > } > > If StandbyModeRequested is checked instead, which > checks for the presence of a standby signal file, > The missingContrecPtr is not set on the > standby and the test succeeds. > > if (!StandbyModeRequested && > !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr)) > { > abortedRecPtr = xlogreader->abortedRecPtr; > missingContrecPtr = xlogreader->missingContrecPtr; > elog(LOG, "missingContrecPtr == %ld", missingContrecPtr); > } > > If this is a bug as it appears, it appears the original patch > to resolve this issue is not needed and the ideal fix > Is to ensure that a standby does not set > missingContrecPtr. > > Would like to see what others think. Due to lack of information, I don't have a clear idea of what is happening here. If that change "fixed" the "issue", that seems to mean that crash recovery passed an immature contrecord (@6/A8000060) but recovery did not stop at the time then continues until 10/B1FA3D88. A possibility is, crash recovery ended at the immature contrecord then the server moved to archive recovery mode and was able to continue from the same (aborted) LSN on the archived WAL. This means 6/A8 in pg_wal had been somehow removed after archived, or the segment 6/A7 (not A8) in both pg_xlog and archive are in different histories, which seems to me what we wanted to prevent by the XLOG_OVERWRITE_CONTRECORD. Didn't you use the same archive content for repeated recovery testing? And from the fact that entering crash recovery means the cluster didn't have an idea of how far it should recover until consistency, that is contained in backup label control file. That could happen when a crashed primary is as-is reused as the new standby of the new primary. So.. I'd like to hear exactly what you did as the testing. When standby mode is requested, if crash recovery fails with immature contrecord, I think we shouldn't continue recovery. But I'm not sure we need to explictly reject that case. Further study is needed.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
> So.. I'd like to hear exactly what you did as the testing. > When standby mode is requested, if crash recovery fails with immature > contrecord, I think we shouldn't continue recovery. But I'm not sure > we need to explictly reject that case. Further study is needed.. Here is more details about my findings and testing. Even with commit 9d92582abf918215d27659d45a4c9e78bda50aff we still see the issue with post promotion checkpoint resulting in "request to flush past end of generated WAL;" i.e. 2022-05-25 00:35:38 UTC::@:[371]:LOG: checkpoint starting: immediate force wait wal time 2022-05-25 00:35:38 UTC::@:[371]:LOG: request to flush past end of generated WAL; request 10/B1FA3D88, currpos 7/A8000060 2022-05-25 00:35:38 UTC::@:[371]:PANIC: xlog flush request 10/B1FA3D88 is not satisfied --- flushed only to 7/A8000060 2022-05-25 00:35:38 UTC:172.31.26.238(38610):administrator@postgres:[23433]:ERROR: current transaction is aborted, commandsignored until end of transaction block The intent of commit 9d92582abf918215d27659d45a4c9e78bda50aff was to make sure the standby skips the overwrite contrecord. However, we still see "missingContrecPtr" is being set on the standby before promotion and after the instance is promoted, the missingContrecPtr is written to WAL and the subsequent flush throws a "PANIC: xlog flush request" To Reproduce using TAP tests; 1) apply the attached patch 0001-Patch_to_repro.patch to the head branch. This patch adds logging when an OverWriteContRecord is created and comments out the invalidation code added in commit 9d92582abf918215d27659d45a4c9e78bda50aff 2) Run a tap test under "test/recovery" make check PROVE_TESTS='t/026_overwrite_contrecord.pl' 3) What is observed in the logfiles for both the standby and primary instance in the tap test is that an overwrite contrecord is created on the primary, which is correct, but also on the standby after promotion, which is incorrect. This is incorrect as the contrecord simseih@88665a22795f recovery % cat tmp_check/log/*prim* | grep 'creating\|promo' 2022-05-27 13:17:50.843 CDT [98429] LOG: creating overwrite contrecord at 0/2000058 for aborted_lsn 0/1FFD000 simseih@88665a22795f recovery % cat tmp_check/log/*stan* | grep 'creating\|promo' 2022-05-27 13:17:51.361 CDT [98421] LOG: received promote request 2022-05-27 13:17:51.394 CDT [98421] LOG: creating overwrite contrecord at 0/2000058 for aborted_lsn 0/1FFD000 simseih@88665a22795f recovery % What we found: 1. missingContrecPtr is set when StandbyMode is false, therefore only a writer should set this value and a record is then sent downstream. But a standby going through crash recovery will always have StandbyMode = false, causing the missingContrecPtr to be incorrectly set. 2. If StandbyModeRequested is checked instead, we ensure that a standby will not set a missingContrecPtr. 3. After applying the patch below, the tap test succeeded diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index 5ee90b6..a727aaf 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -2977,7 +2977,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode, * we'll write a record to indicate to downstream WAL readers that * that portion is to be ignored. */ - if (!StandbyMode && + if (!StandbyModeRequested && !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr)) { abortedRecPtr = xlogreader->abortedRecPtr; So, it might be that the best fix is not the commit in 9d92582abf918215d27659d45a4c9e78bda50aff but to check StandbyModeRequested = false before setting missingContrecPtr. Thank you --- Sami Imseih Amazon Web Services
Attachment
Hi, As an update we have a test suite running for the last few days with constant workload and failovers/promotions. With the proposed change from Sami we no longer see PANICs due to this. The comment indicates that the abortedRecPtr and missingContRecPtr should only be set when the database is a writer since it's used to write a new record to downstream readers. !StandbyMode is a poor proxy for this as Sami mentioned since there's cases when a replica is going through crash recovery that would then set this record. Thanks, John H On 5/27/22 12:01 PM, Imseih (AWS), Sami wrote: >> So.. I'd like to hear exactly what you did as the testing. >> When standby mode is requested, if crash recovery fails with immature >> contrecord, I think we shouldn't continue recovery. But I'm not sure >> we need to explictly reject that case. Further study is needed.. > > Here is more details about my findings and testing. > > Even with commit 9d92582abf918215d27659d45a4c9e78bda50aff > we still see the issue with post promotion checkpoint > resulting in "request to flush past end of generated WAL;" > > i.e. > > 2022-05-25 00:35:38 UTC::@:[371]:LOG: checkpoint starting: immediate force wait wal time > 2022-05-25 00:35:38 UTC::@:[371]:LOG: request to flush past end of generated WAL; request 10/B1FA3D88, currpos 7/A8000060 > 2022-05-25 00:35:38 UTC::@:[371]:PANIC: xlog flush request 10/B1FA3D88 is not satisfied --- flushed only to 7/A8000060 > 2022-05-25 00:35:38 UTC:172.31.26.238(38610):administrator@postgres:[23433]:ERROR: current transaction is aborted, commandsignored until end of transaction block > > The intent of commit 9d92582abf918215d27659d45a4c9e78bda50aff > was to make sure the standby skips the overwrite contrecord. > > However, we still see "missingContrecPtr" is being set > on the standby before promotion and after the instance is > promoted, the missingContrecPtr is written to WAL > and the subsequent flush throws a "PANIC: xlog flush request" > > To Reproduce using TAP tests; > > 1) apply the attached patch 0001-Patch_to_repro.patch to the head branch. > This patch adds logging when an OverWriteContRecord > is created and comments out the invalidation code > added in commit 9d92582abf918215d27659d45a4c9e78bda50aff > > 2) Run a tap test under "test/recovery" > make check PROVE_TESTS='t/026_overwrite_contrecord.pl' > > 3) What is observed in the logfiles for both the standby > and primary instance in the tap test is > that an overwrite contrecord is created on the primary, > which is correct, but also on the standby after promotion, > which is incorrect. This is incorrect as the contrecord > > simseih@88665a22795f recovery % cat tmp_check/log/*prim* | grep 'creating\|promo' > 2022-05-27 13:17:50.843 CDT [98429] LOG: creating overwrite contrecord at 0/2000058 for aborted_lsn 0/1FFD000 > > simseih@88665a22795f recovery % cat tmp_check/log/*stan* | grep 'creating\|promo' > 2022-05-27 13:17:51.361 CDT [98421] LOG: received promote request > 2022-05-27 13:17:51.394 CDT [98421] LOG: creating overwrite contrecord at 0/2000058 for aborted_lsn 0/1FFD000 > simseih@88665a22795f recovery % > > What we found: > > 1. missingContrecPtr is set when > StandbyMode is false, therefore > only a writer should set this value > and a record is then sent downstream. > > But a standby going through crash > recovery will always have StandbyMode = false, > causing the missingContrecPtr to be incorrectly > set. > > 2. If StandbyModeRequested is checked instead, > we ensure that a standby will not set a > missingContrecPtr. > > 3. After applying the patch below, the tap test succeeded > > diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c > index 5ee90b6..a727aaf 100644 > --- a/src/backend/access/transam/xlogrecovery.c > +++ b/src/backend/access/transam/xlogrecovery.c > @@ -2977,7 +2977,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode, > * we'll write a record to indicate to downstream WAL readers that > * that portion is to be ignored. > */ > - if (!StandbyMode && > + if (!StandbyModeRequested && > !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr)) > { > abortedRecPtr = xlogreader->abortedRecPtr; > > So, it might be that the best fix is not the commit in 9d92582abf918215d27659d45a4c9e78bda50aff > but to check StandbyModeRequested = false before setting > missingContrecPtr. > > Thank you > --- > Sami Imseih > Amazon Web Services >
On Fri, May 27, 2022 at 07:01:37PM +0000, Imseih (AWS), Sami wrote: > What we found: > > 1. missingContrecPtr is set when > StandbyMode is false, therefore > only a writer should set this value > and a record is then sent downstream. > > But a standby going through crash > recovery will always have StandbyMode = false, > causing the missingContrecPtr to be incorrectly > set. That stands as true as far as I know, StandbyMode would be switched only once we get out of crash recovery, and only if archive recovery completes when there is a restore_command. > 2. If StandbyModeRequested is checked instead, > we ensure that a standby will not set a > missingContrecPtr. > > 3. After applying the patch below, the tap test succeeded Hmm. I have not looked at that in depth, but if the intention is to check that the database is able to write WAL, looking at XLogCtl->SharedRecoveryState would be the way to go because that's the flip switching between crash recovery, archive recovery and the end of recovery (when WAL can be safely written). The check in xlogrecovery_redo() still looks like a good thing to have anyway, because we know that we can safely skip the contrecord. Now, for any patch produced, could the existing TAP test be extended so as we are able to get a PANIC even if we keep around the sanity check in xlogrecovery_redo(). That would likely involve an immediate shutdown of a standby followed by a start sequence? -- Michael
Attachment
At Mon, 20 Jun 2022 16:13:43 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Fri, May 27, 2022 at 07:01:37PM +0000, Imseih (AWS), Sami wrote: > > What we found: > > > > 1. missingContrecPtr is set when > > StandbyMode is false, therefore > > only a writer should set this value > > and a record is then sent downstream. > > > > But a standby going through crash > > recovery will always have StandbyMode = false, > > causing the missingContrecPtr to be incorrectly > > set. > > That stands as true as far as I know, StandbyMode would be switched > only once we get out of crash recovery, and only if archive recovery > completes when there is a restore_command. Anyway the change; - abortedRecPtr = InvalidXLogRecPtr; - missingContrecPtr = InvalidXLogRecPtr; + //abortedRecPtr = InvalidXLogRecPtr; + //missingContrecPtr = InvalidXLogRecPtr; Is injecting a bug that invalid "aborted contrecord" record can be injected for certain conditions. If a bug is intentionally injected, it's quite natrual that some behavior gets broken.. > > 2. If StandbyModeRequested is checked instead, > > we ensure that a standby will not set a > > missingContrecPtr. > > > > 3. After applying the patch below, the tap test succeeded > > Hmm. I have not looked at that in depth, but if the intention is to > check that the database is able to write WAL, looking at > XLogCtl->SharedRecoveryState would be the way to go because that's the > flip switching between crash recovery, archive recovery and the end of > recovery (when WAL can be safely written). What we are checking there is "we are going to write WAL", not "we are writing". (!StanbyMode && StandbyModeRequested) means the server have been running crash-recovery before starting archive recovery. In that case, the server is supposed to continue with archived WAL without insering a record. However, if no archived WAL available and the server promoted, the server needs to insert an "aborted contrecord" record again. (I'm not sure how that case happens in the field, though.) So I don't think !StandbyModeRequested is the right thing here. Actually the attached test fails with the fix. > The check in xlogrecovery_redo() still looks like a good thing to have > anyway, because we know that we can safely skip the contrecord. Now, > for any patch produced, could the existing TAP test be extended so as > we are able to get a PANIC even if we keep around the sanity check in > xlogrecovery_redo(). That would likely involve an immediate shutdown > of a standby followed by a start sequence? Thus, I still don't see what have happened at Imseih's hand, but I can cause PANIC with a bit tricky steps, which I don't think valid. This is what I wanted to know the exact steps to cause the PANIC. The attached 1 is the PoC of the TAP test (it uses system()..), and the second is a tentative fix for that. (I don't like the fix, too, though...) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
On Mon, Jun 20, 2022 at 7:28 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > Hmm. I have not looked at that in depth, but if the intention is to > > check that the database is able to write WAL, looking at > > XLogCtl->SharedRecoveryState would be the way to go because that's the > > flip switching between crash recovery, archive recovery and the end of > > recovery (when WAL can be safely written). > > What we are checking there is "we are going to write WAL", not "we are > writing". > > (!StanbyMode && StandbyModeRequested) means the server have been > running crash-recovery before starting archive recovery. In that > case, the server is supposed to continue with archived WAL without > insering a record. However, if no archived WAL available and the > server promoted, the server needs to insert an "aborted contrecord" > record again. (I'm not sure how that case happens in the field, > though.) > > So I don't think !StandbyModeRequested is the right thing > here. Actually the attached test fails with the fix. It seems to me that what we want to do is: if we're about to start allowing WAL writes, then consider whether to insert an aborted contrecord record. Now, if we are about to start allowing WAL write, we must determine the LSN at which the first such record will be written. Then there are two possibilities: either that LSN is on an existing record boundary, or it isn't. In the former case, no aborted contrecord record is required. In the latter case, we're writing at LSN which would have been in the middle of a previous record, except that the record in question was never finished or at least we don't have all of it. So the first WAL we insert at our chosen starting LSN must be an aborted contrecord record. I'm not quite sure I understand the nature of the remaining bug here, but this logic seems quite sketchy to me: /* * When not in standby mode we find that WAL ends in an incomplete * record, keep track of that record. After recovery is done, * we'll write a record to indicate to downstream WAL readers that * that portion is to be ignored. */ if (!StandbyMode && !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr)) { abortedRecPtr = xlogreader->abortedRecPtr; missingContrecPtr = xlogreader->missingContrecPtr; } I don't see what StandbyMode has to do with anything here. The question is whether the place where we're going to begin writing WAL is on an existing record boundary or not. If it so happens that when StandbyMode is turned on we can never decide to promote in the middle of a record, then maybe there's no need to track this information when StandbyMode = true, but I don't see a good reason why we should make that assumption. What if in the future somebody added a command that says "don't keep trying to read more WAL, just promote RIGHT HERE?". I think this logic would surely be incorrect in that case. It feels to me like the right thing to do is to always keep track if WAL ends in an incomplete record, and then when we promote, we write an aborted contrecord record if WAL ended in an incomplete record, full stop. Now, we do need to keep in mind that, while in StandbyMode, we might reach the end of WAL more than once, because we have a polling loop that looks for more WAL. So it does not work to just set the values once and then assume that's the whole truth forever. But why not handle that by storing the abortedRecPtr and missingContrecPtr unconditionally after every call to XLogPrefetcherReadRecord()? They will go back and forth between XLogRecPtrIsInvalid and other values many times but the last value should -- I think -- be however things ended up at the point where we decided to stop reading WAL. I haven't really dived into this issue much so I might be totally off base, but it just doesn't feel right to me that this should depend on whether we're in standby mode. That seems at best incidentally related to whether an aborted contrecord record is required. P.S. "aborted contrecord record" is really quite an awkward turn of phrase! -- Robert Haas EDB: http://www.enterprisedb.com
At Mon, 20 Jun 2022 11:57:20 -0400, Robert Haas <robertmhaas@gmail.com> wrote in > It seems to me that what we want to do is: if we're about to start > allowing WAL writes, then consider whether to insert an aborted > contrecord record. Now, if we are about to start allowing WAL write, > we must determine the LSN at which the first such record will be > written. Then there are two possibilities: either that LSN is on an > existing record boundary, or it isn't. In the former case, no aborted > contrecord record is required. In the latter case, we're writing at > LSN which would have been in the middle of a previous record, except > that the record in question was never finished or at least we don't > have all of it. So the first WAL we insert at our chosen starting LSN > must be an aborted contrecord record. Right. > I'm not quite sure I understand the nature of the remaining bug here, > but this logic seems quite sketchy to me: > > /* > * When not in standby mode we find that WAL ends in an incomplete > * record, keep track of that record. After recovery is done, > * we'll write a record to indicate to downstream WAL readers that > * that portion is to be ignored. > */ > if (!StandbyMode && > !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr)) > { > abortedRecPtr = xlogreader->abortedRecPtr; > missingContrecPtr = xlogreader->missingContrecPtr; > } > > I don't see what StandbyMode has to do with anything here. The > question is whether the place where we're going to begin writing WAL > is on an existing record boundary or not. If it so happens that when > StandbyMode is turned on we can never decide to promote in the middle > of a record, then maybe there's no need to track this information when > StandbyMode = true, but I don't see a good reason why we should make > that assumption. What if in the future somebody added a command that Right. I came to the same conclusion before reading this section. It is rearer than other cases but surely possible. > says "don't keep trying to read more WAL, just promote RIGHT HERE?". I > think this logic would surely be incorrect in that case. It feels to > me like the right thing to do is to always keep track if WAL ends in > an incomplete record, and then when we promote, we write an aborted > contrecord record if WAL ended in an incomplete record, full stop. Agreed. Actually, with the second patch applied, removing !StandbyMode from the condition makes no difference in behavior. > Now, we do need to keep in mind that, while in StandbyMode, we might > reach the end of WAL more than once, because we have a polling loop > that looks for more WAL. So it does not work to just set the values > once and then assume that's the whole truth forever. But why not > handle that by storing the abortedRecPtr and missingContrecPtr > unconditionally after every call to XLogPrefetcherReadRecord()? They > will go back and forth between XLogRecPtrIsInvalid and other values > many times but the last value should -- I think -- be however things > ended up at the point where we decided to stop reading WAL. Unfortunately it doesn't work because we read a record already known to be complete again at the end of recovery. It is the reason of "abortedRecPtr < xlogreader->EndRecPtr" in my PoC patch. Without it, abrotedRecPtr is erased when it is actually needed. I don't like that expedient-looking condition, but the strict condition for resetting abortedRecPtr is iff "we have read a complete record at the same or grater LSN ofabortedRecPtr"... Come to think of this, I noticed that we can get rid of the file-global abortedRecPtr by letting FinishWalRecovery copy abortedRecPtr *before* reading the last record. This also allows us to remove the "expedient" condition. > I haven't really dived into this issue much so I might be totally off > base, but it just doesn't feel right to me that this should depend on > whether we're in standby mode. That seems at best incidentally related > to whether an aborted contrecord record is required. > > P.S. "aborted contrecord record" is really quite an awkward turn of phrase! Thats true! Always open for better phrasings:( regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
On Tue, Jun 21, 2022 at 10:35:33AM +0900, Kyotaro Horiguchi wrote: > At Mon, 20 Jun 2022 11:57:20 -0400, Robert Haas <robertmhaas@gmail.com> wrote in >> says "don't keep trying to read more WAL, just promote RIGHT HERE?". I >> think this logic would surely be incorrect in that case. It feels to >> me like the right thing to do is to always keep track if WAL ends in >> an incomplete record, and then when we promote, we write an aborted >> contrecord record if WAL ended in an incomplete record, full stop. Agreed. The state is maintained in the WAL reader as far as I understand the logic behind it. >> Now, we do need to keep in mind that, while in StandbyMode, we might >> reach the end of WAL more than once, because we have a polling loop >> that looks for more WAL. So it does not work to just set the values >> once and then assume that's the whole truth forever. But why not >> handle that by storing the abortedRecPtr and missingContrecPtr >> unconditionally after every call to XLogPrefetcherReadRecord()? They >> will go back and forth between XLogRecPtrIsInvalid and other values >> many times but the last value should -- I think -- be however things >> ended up at the point where we decided to stop reading WAL. > > Unfortunately it doesn't work because we read a record already known > to be complete again at the end of recovery. It is the reason of > "abortedRecPtr < xlogreader->EndRecPtr" in my PoC patch. Without it, > abrotedRecPtr is erased when it is actually needed. I don't like that > expedient-looking condition, but the strict condition for resetting > abortedRecPtr is iff "we have read a complete record at the same or > grater LSN ofabortedRecPtr"... > > Come to think of this, I noticed that we can get rid of the > file-global abortedRecPtr by letting FinishWalRecovery copy > abortedRecPtr *before* reading the last record. This also allows us to > remove the "expedient" condition. Interesting point, though I am not sure to get why this is safe compared to the existing solution of setting missingContrecPtr/abortedRecPtr while reading a set of records when we look for the last LSN at the end of recovery. >> I haven't really dived into this issue much so I might be totally off >> base, but it just doesn't feel right to me that this should depend on >> whether we're in standby mode. That seems at best incidentally related >> to whether an aborted contrecord record is required. >> >> P.S. "aborted contrecord record" is really quite an awkward turn of phrase! > > Thats true! Always open for better phrasings:( The term that would be appropriate here is continuation record? contrecord is a bit confusing for French-speakers, actually, as "contre" means "against" ;) By the way, something that itches me with the proposed patch is that we don't actually stress the second problem reported, which is that the use of StandbyMode is incorrect. Isn't that a sign that we'd better extend more the tests of 026_overwrite_contrecord.pl with more end-of-recovery scenarios? Two things that immediately come to mind are the use of recovery_target_lsn that would force a promotion in the middle of a continuation record and cascading standbys to make sure that these get the extra OVERWRITE_CONTRECORD record generated at the end of recovery. -- Michael
Attachment
On Mon, Jun 20, 2022 at 9:35 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > Unfortunately it doesn't work because we read a record already known > to be complete again at the end of recovery. It is the reason of > "abortedRecPtr < xlogreader->EndRecPtr" in my PoC patch. Without it, > abrotedRecPtr is erased when it is actually needed. I don't like that > expedient-looking condition, but the strict condition for resetting > abortedRecPtr is iff "we have read a complete record at the same or > grater LSN ofabortedRecPtr"... Yeah, we need to work around that somehow. I noticed the record-rereading behavior when I was working on some patches for this release cycle and I think what we ought to do is get rid of that. It serves no purpose other than to make things complicated. However, we shouldn't back-patch a change like that, I think, so we'll need to work around the issue somehow. Perhaps it'd be better to return these values to the caller somehow and then the caller can decide whether to save them based on context. The last re-read can choose not to do so. -- Robert Haas EDB: http://www.enterprisedb.com
> Thus, I still don't see what have happened at Imseih's hand, but I can > cause PANIC with a bit tricky steps, which I don't think valid. This > is what I wanted to know the exact steps to cause the PANIC. > The attached 1 is the PoC of the TAP test (it uses system()..), and > the second is a tentative fix for that. (I don't like the fix, too, > though...) It is been difficult to get a generic repro, but the way we reproduce Is through our test suite. To give more details, we are running tests In which we constantly failover and promote standbys. The issue surfaces after we have gone through a few promotions which occur every few hours or so ( not really important but to give context ). I am adding some additional debugging to see if I can draw a better picture of what is happening. Will also give aborted_contrec_reset_3.patch a go, although I suspect it will not handle the specific case we are deaing with. Regards, Sami imseih Amazon Web Services
On Fri, Jun 24, 2022 at 04:17:34PM +0000, Imseih (AWS), Sami wrote: > It is been difficult to get a generic repro, but the way we reproduce > Is through our test suite. To give more details, we are running tests > In which we constantly failover and promote standbys. The issue > surfaces after we have gone through a few promotions which occur > every few hours or so ( not really important but to give context ). Hmm. Could you describe exactly the failover scenario you are using? Is the test using a set of cascading standbys linked to the promoted one? Are the standbys recycled from the promoted nodes with pg_rewind or created from scratch with a new base backup taken from the freshly-promoted primary? I have been looking more at this thread through the day but I don't see a remaining issue. It could be perfectly possible that we are missing a piece related to the handling of those new overwrite contrecords in some cases, like in a rewind. > I am adding some additional debugging to see if I can draw a better > picture of what is happening. Will also give aborted_contrec_reset_3.patch > a go, although I suspect it will not handle the specific case we are deaing with. Yeah, this is not going to change much things if you are still seeing an issue. This patch does not change the logic, aka it just simplifies the tracking of the continuation record data, resetting it when a complete record has been read. Saying that, getting rid of the dependency on StandbyMode because we cannot promote in the middle of a record is nice (my memories around that were a bit blurry but even recovery_target_lsn would not recover in the middle of an continuation record), and this is not bug so there is limited reason to backpatch this part of the change. -- Michael
Attachment
At Mon, 27 Jun 2022 15:02:11 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Fri, Jun 24, 2022 at 04:17:34PM +0000, Imseih (AWS), Sami wrote: > > It is been difficult to get a generic repro, but the way we reproduce > > Is through our test suite. To give more details, we are running tests > > In which we constantly failover and promote standbys. The issue > > surfaces after we have gone through a few promotions which occur > > every few hours or so ( not really important but to give context ). > > Hmm. Could you describe exactly the failover scenario you are using? > Is the test using a set of cascading standbys linked to the promoted > one? Are the standbys recycled from the promoted nodes with pg_rewind > or created from scratch with a new base backup taken from the > freshly-promoted primary? I have been looking more at this thread > through the day but I don't see a remaining issue. It could be > perfectly possible that we are missing a piece related to the handling > of those new overwrite contrecords in some cases, like in a rewind. > > > I am adding some additional debugging to see if I can draw a better > > picture of what is happening. Will also give aborted_contrec_reset_3.patch > > a go, although I suspect it will not handle the specific case we are deaing with. > > Yeah, this is not going to change much things if you are still seeing > an issue. This patch does not change the logic, aka it just True. That is a siginicant hint on what happened at the time. - Are there only two hosts in the replication set? I concerned on whether it is a cascading set or not. - Exactly what are you performing at every failover? Especially do the steps contain pg_rewind, and do you copy pg_wal and/or archive files between the failover hosts? > simplifies the tracking of the continuation record data, resetting it > when a complete record has been read. Saying that, getting rid of the > dependency on StandbyMode because we cannot promote in the middle of a > record is nice (my memories around that were a bit blurry but even > recovery_target_lsn would not recover in the middle of an continuation > record), and this is not bug so there is limited reason to backpatch > this part of the change. Agreed. In the first place my "repro" (or the test case) is a bit too intricated to happen in the real field. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
I'd like to look into the WAL segments related to the failure. Mmm... With the patch, xlogreader->abortedRecPtr is valid only and always when the last read failed record was an aborted contrec. If recovery ends here the first insereted record is an "aborted contrec" record. I still see it as the only chance that an aborted contrecord is followed by a non-"aborted contrec" record is that recovery somehow fetches two consecutive WAL segments that are inconsistent at the boundary. I found the reason that the TAP test doesn't respond to the first proposed patch (the below). - if (!StandbyMode && + if (!StandbyModeRequested && !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr)) The cause was that I disabled standby-mode in the test. The change affects only while standby mode is on, which was to make the test reliable and simpler. The first attached detects the same error (in a somwhat maybe-unstable way) and responds to the fix above, and also responds to the aborted_contrec_reset_3.patch. So, aborted_contrec_reset_3 looks closer to the issue than before. Would you mind trying the second attached to abtain detailed log on your testing environment? With the patch, the modified TAP test yields the log lines like below. 2022-06-28 15:49:20.661 JST [165472] LOG: ### [A] @0/1FFD338: abort=(0/1FFD338)0/0, miss=(0/2000000)0/0, SbyMode=0, SbyModeReq=1 ... 2022-06-28 15:49:20.681 JST [165472] LOG: ### [F] @0/2094610: abort=(0/0)0/1FFD338, miss=(0/0)0/2000000, SbyMode=1, SbyModeReq=1 ... 2022-06-28 15:49:20.767 JST [165472] LOG: ### [S] @0/2094610: abort=(0/0)0/1FFD338, miss=(0/0)0/2000000, SbyMode=0, SbyModeReq=1 ... 2022-06-28 15:49:20.777 JST [165470] PANIC: xlog flush request 0/2094610 is not satisfied --- flushed only to 0/2000088 In this example, abortedRecPtr is set at the first line and recovery continued to 2094610 but abortedRecPtr is not reset then PANICed. ([A] means aborted contrec falure. [F] and [S] are failed and succeeded reads respectively.) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
> Would you mind trying the second attached to abtain detailed log on > your testing environment? With the patch, the modified TAP test yields > the log lines like below. Thanks for this. I will apply this to the testing environment and will share the output. Regards, Sami Imseih Amazon Web Services (AWS)
Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
Hello, On 2022-Jun-29, Imseih (AWS), Sami wrote: > > Would you mind trying the second attached to abtain detailed log on > > your testing environment? With the patch, the modified TAP test yields > > the log lines like below. > > Thanks for this. I will apply this to the testing environment and > will share the output. Any luck with this? -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Doing what he did amounts to sticking his fingers under the hood of the implementation; if he gets his fingers burnt, it's his problem." (Tom Lane)
> Any luck with this? Apologies for the delay, as I have been away. I will test this next week and report back my findings. Thanks Sami Imseih Amazon Web Services (AWS) On 6/28/22, 2:10 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you canconfirm the sender and know the content is safe. I'd like to look into the WAL segments related to the failure. Mmm... With the patch, xlogreader->abortedRecPtr is valid only and always when the last read failed record was an aborted contrec. If recovery ends here the first insereted record is an "aborted contrec" record. I still see it as the only chance that an aborted contrecord is followed by a non-"aborted contrec" record is that recovery somehow fetches two consecutive WAL segments that are inconsistent at the boundary. I found the reason that the TAP test doesn't respond to the first proposed patch (the below). - if (!StandbyMode && + if (!StandbyModeRequested && !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr)) The cause was that I disabled standby-mode in the test. The change affects only while standby mode is on, which was to make the test reliable and simpler. The first attached detects the same error (in a somwhat maybe-unstable way) and responds to the fix above, and also responds to the aborted_contrec_reset_3.patch. So, aborted_contrec_reset_3 looks closer to the issue than before. Would you mind trying the second attached to abtain detailed log on your testing environment? With the patch, the modified TAP test yields the log lines like below. 2022-06-28 15:49:20.661 JST [165472] LOG: ### [A] @0/1FFD338: abort=(0/1FFD338)0/0, miss=(0/2000000)0/0, SbyMode=0,SbyModeReq=1 ... 2022-06-28 15:49:20.681 JST [165472] LOG: ### [F] @0/2094610: abort=(0/0)0/1FFD338, miss=(0/0)0/2000000, SbyMode=1,SbyModeReq=1 ... 2022-06-28 15:49:20.767 JST [165472] LOG: ### [S] @0/2094610: abort=(0/0)0/1FFD338, miss=(0/0)0/2000000, SbyMode=0,SbyModeReq=1 ... 2022-06-28 15:49:20.777 JST [165470] PANIC: xlog flush request 0/2094610 is not satisfied --- flushed only to 0/2000088 In this example, abortedRecPtr is set at the first line and recovery continued to 2094610 but abortedRecPtr is not reset then PANICed. ([A] means aborted contrec falure. [F] and [S] are failed and succeeded reads respectively.) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
> Would you mind trying the second attached to abtain detailed log on > your testing environment? With the patch, the modified TAP test yields > the log lines like below. I applied the logging patch to 13.7 ( attached is the backport ) and repro'd the Issue. I stripped out the relevant parts of the file. Let me know if this is helpful. postgresql.log.2022-08-05-17:2022-08-05 17:18:51 UTC::@:[359]:LOG: ### [S] @0/48A3400: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=0,SbyModeReq=1 postgresql.log.2022-08-05-17:2022-08-05 17:18:51 UTC::@:[359]:LOG: ### [F] @0/10000000: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=0,SbyModeReq=1 postgresql.log.2022-08-05-17:2022-08-05 17:22:21 UTC::@:[359]:LOG: ### [S] @0/10000060: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=1,SbyModeReq=1 postgresql.log.2022-08-05-18:2022-08-05 18:38:00 UTC::@:[359]:LOG: ### [F] @6/B6CB27D0: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=1,SbyModeReq=1 postgresql.log.2022-08-05-18:2022-08-05 18:38:05 UTC::@:[359]:LOG: ### [F] @6/B6CB27D0: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=1,SbyModeReq=1 postgresql.log.2022-08-05-18:2022-08-05 18:38:06 UTC::@:[359]:LOG: ### [F] @6/B6CB27D0: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=1,SbyModeReq=1 postgresql.log.2022-08-05-18:2022-08-05 18:38:11 UTC::@:[359]:LOG: ### [F] @6/B6CB27D0: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=1,SbyModeReq=1 postgresql.log.2022-08-05-18:2022-08-05 18:38:14 UTC::@:[359]:LOG: ### [F] @6/B6CB27D0: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=1,SbyModeReq=1 postgresql.log.2022-08-05-18:2022-08-05 18:38:14 UTC::@:[359]:LOG: ### [S] @6/B6CB27D0: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=0,SbyModeReq=1 postgresql.log.2022-08-05-18:2022-08-05 18:50:13 UTC::@:[359]:LOG: ### [S] @6/B8000198: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=0,SbyModeReq=1 postgresql.log.2022-08-05-18:2022-08-05 18:50:20 UTC::@:[359]:LOG: ### [A] @6/F3FFFF20: abort=(6/F3FFFF20)0/0, miss=(6/F4000000)0/0,SbyMode=0, SbyModeReq=1 postgresql.log.2022-08-05-18:2022-08-05 18:50:20 UTC::@:[359]:LOG: ### [S] @6/F4000030: abort=(0/0)6/F3FFFF20, miss=(0/0)6/F4000000,SbyMode=1, SbyModeReq=1 postgresql.log.2022-08-05-20:2022-08-05 20:38:06 UTC::@:[359]:LOG: ### [F] @E/17A78298: abort=(0/0)6/F3FFFF20, miss=(0/0)6/F4000000,SbyMode=1, SbyModeReq=1 postgresql.log.2022-08-05-20:2022-08-05 20:38:11 UTC::@:[359]:LOG: ### [F] @E/17A78298: abort=(0/0)6/F3FFFF20, miss=(0/0)6/F4000000,SbyMode=1, SbyModeReq=1 postgresql.log.2022-08-05-20:2022-08-05 20:38:13 UTC::@:[359]:LOG: ### [F] @E/17A78298: abort=(0/0)6/F3FFFF20, miss=(0/0)6/F4000000,SbyMode=1, SbyModeReq=1 postgresql.log.2022-08-05-20:2022-08-05 20:38:18 UTC::@:[359]:LOG: ### [F] @E/17A78298: abort=(0/0)6/F3FFFF20, miss=(0/0)6/F4000000,SbyMode=1, SbyModeReq=1 postgresql.log.2022-08-05-20:2022-08-05 20:38:21 UTC::@:[359]:LOG: ### [F] @E/17A78298: abort=(0/0)6/F3FFFF20, miss=(0/0)6/F4000000,SbyMode=1, SbyModeReq=1 postgresql.log.2022-08-05-20:2022-08-05 20:38:21 UTC::@:[359]:LOG: ### [S] @E/17A78298: abort=(0/0)6/F3FFFF20, miss=(0/0)6/F4000000,SbyMode=0, SbyModeReq=1 .... ....... ......... 2022-08-05 20:38:21 UTC::@:[359]:LOG: received promote request 2022-08-05 20:38:21 UTC::@:[359]:LOG: redo done at E/17A78270 2022-08-05 20:38:21 UTC::@:[359]:LOG: last completed transaction was at log time 2022-08-05 20:38:06.398382+00 recovering 000000020000000E00000005 000000020000000E00000005 archive /rdsdbdata/log/restore/pg-wal-archive.901.* is not yet downloaded, exiting restore scriptfor now 2022-08-05 20:38:21 UTC::@:[359]:LOG: ### [S] @E/17A78298: abort=(0/0)6/F3FFFF20, miss=(0/0)6/F4000000, SbyMode=0, SbyModeReq=1 recovering 00000003.history 2022-08-05 20:38:21 UTC::@:[359]:LOG: selected new timeline ID: 3 2022-08-05 20:38:21 UTC::@:[359]:LOG: archive recovery complete recovering 00000002.history 2022-08-05 20:38:22 UTC::@:[373]:LOG: request to flush past end of generated WAL; request D/C7A425C0, currpos 6/F4000088 2022-08-05 20:38:22 UTC::@:[373]:CONTEXT: writing block 0 of relation base/14301/26678 2022-08-05 20:38:22 UTC::@:[373]:ERROR: xlog flush request D/C7A425C0 is not satisfied --- flushed only to 6/F4000088 2022-08-05 20:38:22 UTC::@:[373]:CONTEXT: writing block 0 of relation base/14301/26678 2022-08-05 20:38:22 UTC::@:[357]:LOG: database system is ready to accept connections 2022-08-05 20:38:23 UTC::@:[373]:LOG: checkpoint starting: immediate force wait wal time 2022-08-05 20:38:23 UTC::@:[373]:LOG: request to flush past end of generated WAL; request D/F5873CF0, currpos 6/F8000060 2022-08-05 20:38:23 UTC::@:[373]:PANIC: xlog flush request D/F5873CF0 is not satisfied --- flushed only to 6/F8000060 2022-08-05 20:38:23 UTC::@:[357]:LOG: checkpointer process (PID 373) was terminated by signal 6: Aborted 2022-08-05 20:38:23 UTC::@:[357]:LOG: terminating any other active server processes /etc/rds/dbbin/pgscripts/rds_wal_archive: line 19: 14546 Quit (core dumped) usleep 50000 2022-08-05 20:38:23 UTC::@:[375]:FATAL: archive command failed with exit code 131 2022-08-05 20:38:23 UTC::@:[375]:DETAIL: The failed archive command was: /etc/rds/dbbin/pgscripts/rds_wal_archive pg_wal/00000003000000060000003D Thanks, Sami Imseih Amazon Web Services
Attachment
At Fri, 5 Aug 2022 21:28:16 +0000, "Imseih (AWS), Sami" <simseih@amazon.com> wrote in > > Would you mind trying the second attached to abtain detailed log on > > your testing environment? With the patch, the modified TAP test yields > > the log lines like below. > > I applied the logging patch to 13.7 ( attached is the backport ) and repro'd the > Issue. > > I stripped out the relevant parts of the file. Let me know if this is > helpful. Thank you very much! > postgresql.log.2022-08-05-17:2022-08-05 17:18:51 UTC::@:[359]:LOG: ### [F] @0/10000000: abort=(0/0)0/0, miss=(0/0)0/0,SbyMode=0, SbyModeReq=1 > postgresql.log.2022-08-05-17:2022-08-05 17:22:21 UTC::@:[359]:LOG: ### [S] @0/10000060: abort=(0/0)0/0, miss=(0/0)0/0,SbyMode=1, SbyModeReq=1 The server seem to have started as a standby after crashing a primary. Is it correct? > postgresql.log.2022-08-05-18:2022-08-05 18:38:14 UTC::@:[359]:LOG: ### [F] @6/B6CB27D0: abort=(0/0)0/0, miss=(0/0)0/0,SbyMode=1, SbyModeReq=1 > postgresql.log.2022-08-05-18:2022-08-05 18:38:14 UTC::@:[359]:LOG: ### [S] @6/B6CB27D0: abort=(0/0)0/0, miss=(0/0)0/0,SbyMode=0, SbyModeReq=1 Archive recovery ended here. The server should have promoted that time.. Do you see some interesting log lines around this time? > postgresql.log.2022-08-05-18:2022-08-05 18:50:13 UTC::@:[359]:LOG: ### [S] @6/B8000198: abort=(0/0)0/0, miss=(0/0)0/0,SbyMode=0, SbyModeReq=1 But, recovery continues in non-standby mode. I don't see how come it behaves that way. > postgresql.log.2022-08-05-18:2022-08-05 18:50:20 UTC::@:[359]:LOG: ### [A] @6/F3FFFF20: abort=(6/F3FFFF20)0/0, miss=(6/F4000000)0/0,SbyMode=0, SbyModeReq=1 > postgresql.log.2022-08-05-18:2022-08-05 18:50:20 UTC::@:[359]:LOG: ### [S] @6/F4000030: abort=(0/0)6/F3FFFF20, miss=(0/0)6/F4000000,SbyMode=1, SbyModeReq=1 Then archive recovery starts again. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
> The server seem to have started as a standby after crashing a > primary. Is it correct? Yes, that is correct. 2022-08-05 17:18:51 UTC::@:[359]:LOG: database system was interrupted; last known up at 2022-08-05 17:08:52 UTC 2022-08-05 17:18:51 UTC::@:[359]:LOG: creating missing WAL directory "pg_wal/archive_status" recovering 00000002.history 2022-08-05 17:18:51 UTC::@:[359]:LOG: entering standby mode 2022-08-05 17:18:51 UTC::@:[359]:LOG: ### [S] @0/48A3400: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=0, SbyModeReq=1 2022-08-05 17:18:51 UTC::@:[359]:DEBUG: checkpoint record is at 0/48A3388 2022-08-05 17:18:51 UTC::@:[359]:DEBUG: redo record is at 0/48A3388; shutdown true 2022-08-05 17:18:51 UTC::@:[359]:DEBUG: next transaction ID: 533; next OID: 16395 2022-08-05 17:18:51 UTC::@:[359]:DEBUG: next MultiXactId: 1; next MultiXactOffset: 0 2022-08-05 17:18:51 UTC::@:[359]:DEBUG: oldest unfrozen transaction ID: 479, in database 1 2022-08-05 17:18:51 UTC::@:[359]:DEBUG: oldest MultiXactId: 1, in database 1 2022-08-05 17:18:51 UTC::@:[359]:DEBUG: commit timestamp Xid oldest/newest: 0/0 2022-08-05 17:18:51 UTC::@:[359]:DEBUG: transaction ID wrap limit is 2147484126, limited by database with OID 1 2022-08-05 17:18:51 UTC::@:[359]:DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 1 2022-08-05 17:18:51 UTC::@:[359]:DEBUG: starting up replication slots 2022-08-05 17:18:51 UTC::@:[359]:LOG: database system was not properly shut down; automatic recovery in progress 2022-08-05 17:18:51 UTC::@:[359]:DEBUG: resetting unlogged relations: cleanup 1 init 0 2022-08-05 17:18:51 UTC::@:[359]:DEBUG: initializing for hot standby 2022-08-05 17:18:51 UTC::@:[359]:DEBUG: recovery snapshots are now enabled 2022-08-05 17:18:51 UTC::@:[359]:LOG: redo starts at 0/48A3400 2022-08-05 17:18:51 UTC::@:[359]:LOG: ### [F] @0/10000000: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=0, SbyModeReq=1 2022-08-05 17:18:51 UTC::@:[359]:DEBUG: reached end of WAL in pg_wal, entering archive recovery 2022-08-05 17:18:51 UTC::@:[359]:LOG: consistent recovery state reached at 0/10000000 2022-08-05 17:18:51 UTC::@:[357]:LOG: database system is ready to accept read only connections > Archive recovery ended here. The server should have promoted that > time.. Do you see some interesting log lines around this time? The server did promote around that time 2022-08-05 18:38:14 UTC::@:[359]:LOG: ### [F] @6/B6CB27D0: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=1, SbyModeReq=1 2022-08-05 18:38:14 UTC::@:[359]:LOG: received promote request 2022-08-05 18:38:14 UTC::@:[359]:LOG: redo done at 6/B6CB27A8 2022-08-05 18:38:14 UTC::@:[359]:LOG: last completed transaction was at log time 2022-08-05 18:38:00.832047+00 recovering 00000001000000060000002D .... ........ 2022-08-05 18:38:14 UTC::@:[359]:LOG: ### [S] @6/B6CB27D0: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=0, SbyModeReq=1 recovering 00000002.history 2022-08-05 18:38:14 UTC::@:[359]:LOG: selected new timeline ID: 2 2022-08-05 18:38:14 UTC::@:[359]:LOG: archive recovery complete recovering 00000001.history 2022-08-05 18:38:14 UTC::@:[357]:LOG: database system is ready to accept connections 2022-08-05 18:38:15 UTC::@:[367]:LOG: restartpoint complete: wrote 21388 buffers (2.1%); 0 WAL file(s) added, 9 removed,0 recycled; write=98.394 s, sync=0.041 s, total=98.586 s; sync files=46, longest=0.012 s, average=0.001 s; distance=1048565kB, estimate=1048584 kB 2022-08-05 18:38:15 UTC::@:[367]:LOG: recovery restart point at 6/5C0003B0 > But, recovery continues in non-standby mode. I don't see how come it > behaves that way. But the server crashes sometime after which is why recovery starts. 022-08-05 18:50:13 UTC::@:[357]:LOG: listening on IPv4 address "0.0.0.0", port 5432 2022-08-05 18:50:13 UTC::@:[357]:LOG: listening on IPv6 address "::", port 5432 2022-08-05 18:50:13 UTC::@:[357]:LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" 2022-08-05 18:50:13 UTC::@:[359]:LOG: database system was interrupted; last known up at 2022-08-05 18:38:15 UTC 2022-08-05 18:50:13 UTC::@:[359]:LOG: creating missing WAL directory "pg_wal/archive_status" recovering 00000003.history 2022-08-05 18:50:13 UTC::@:[359]:LOG: entering standby mode recovering 00000002.history 2022-08-05 18:50:13 UTC::@:[359]:LOG: ### [S] @6/B8000198: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=0, SbyModeReq=1 2022-08-05 18:50:13 UTC::@:[359]:LOG: database system was not properly shut down; automatic recovery in progress 2022-08-05 18:50:13 UTC::@:[359]:LOG: redo starts at 6/B80000E8 And a few hours later, is when we see a panic Thanks -- Sami Imseih Amazon Web Services (AWS)
Hello. > Yes, that is correct. Mmm. I believed that the log came from a single server run, since the PID (I believe the [359], [357] are PID) did not change through the log lines. > 2022-08-05 18:50:13 UTC::@:[359]:LOG: creating missing WAL directory "pg_wal/archive_status" This means that someone removes the content of pg_wal directory. Removing some WAL files in pg_wal may lead to this symptom. > 2022-08-05 18:50:13 UTC::@:[359]:LOG: entering standby mode > recovering 00000002.history > 2022-08-05 18:50:13 UTC::@:[359]:LOG: ### [S] @6/B8000198: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=0, SbyModeReq=1 > 2022-08-05 18:50:13 UTC::@:[359]:LOG: database system was not properly shut down; automatic recovery in progress > 2022-08-05 18:50:13 UTC::@:[359]:LOG: redo starts at 6/B80000E8 > > And a few hours later, is when we see a panic So, it seems that the *standby* received the inconsistent WAL stream (aborted-contrecord not followed by a overwriting-missing-contrecord) from the primary. Thus the inconsistency happened on the primary, not on the standby. So... I'm still failing to draw the big picutre of what is happening here. Could you show us the server configuration (dbservers involved and their roles (primary/standby..)), and the exact steps when you restart the server after carsh? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Aug 10, 2022 at 4:37 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > So, it seems that the *standby* received the inconsistent WAL stream > (aborted-contrecord not followed by a overwriting-missing-contrecord) > from the primary. Thus the inconsistency happened on the primary, not > on the standby. > > So... I'm still failing to draw the big picutre of what is happening > here. For the benefit of anyone who may be looking at this thread in the archive later, I believe this commit will have fixed this issue: -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Aug 29, 2022 at 12:54 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Aug 10, 2022 at 4:37 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > So, it seems that the *standby* received the inconsistent WAL stream > > (aborted-contrecord not followed by a overwriting-missing-contrecord) > > from the primary. Thus the inconsistency happened on the primary, not > > on the standby. > > > > So... I'm still failing to draw the big picutre of what is happening > > here. > > For the benefit of anyone who may be looking at this thread in the > archive later, I believe this commit will have fixed this issue: Err, this commit here, I mean: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=6672d7913929b82ec723a54381773d9cdc20fe9d -- Robert Haas EDB: http://www.enterprisedb.com
> > For the benefit of anyone who may be looking at this thread in the > > archive later, I believe this commit will have fixed this issue: I can also confirm that indeed the commit https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=6672d79 does fix this issue. Thanks! -- Sami Imseih Amazon Web Services (AWS)