Re: [BUG] Panic due to incorrect missingContrecPtr after promotion - Mailing list pgsql-hackers
From | Hsu, John |
---|---|
Subject | Re: [BUG] Panic due to incorrect missingContrecPtr after promotion |
Date | |
Msg-id | 0071e7fe-b1a9-e38c-5706-657913fde1fb@amazon.com Whole thread Raw |
In response to | Re: [BUG] Panic due to incorrect missingContrecPtr after promotion ("Imseih (AWS), Sami" <simseih@amazon.com>) |
List | pgsql-hackers |
Hi, As an update we have a test suite running for the last few days with constant workload and failovers/promotions. With the proposed change from Sami we no longer see PANICs due to this. The comment indicates that the abortedRecPtr and missingContRecPtr should only be set when the database is a writer since it's used to write a new record to downstream readers. !StandbyMode is a poor proxy for this as Sami mentioned since there's cases when a replica is going through crash recovery that would then set this record. Thanks, John H On 5/27/22 12:01 PM, Imseih (AWS), Sami wrote: >> So.. I'd like to hear exactly what you did as the testing. >> When standby mode is requested, if crash recovery fails with immature >> contrecord, I think we shouldn't continue recovery. But I'm not sure >> we need to explictly reject that case. Further study is needed.. > > Here is more details about my findings and testing. > > Even with commit 9d92582abf918215d27659d45a4c9e78bda50aff > we still see the issue with post promotion checkpoint > resulting in "request to flush past end of generated WAL;" > > i.e. > > 2022-05-25 00:35:38 UTC::@:[371]:LOG: checkpoint starting: immediate force wait wal time > 2022-05-25 00:35:38 UTC::@:[371]:LOG: request to flush past end of generated WAL; request 10/B1FA3D88, currpos 7/A8000060 > 2022-05-25 00:35:38 UTC::@:[371]:PANIC: xlog flush request 10/B1FA3D88 is not satisfied --- flushed only to 7/A8000060 > 2022-05-25 00:35:38 UTC:172.31.26.238(38610):administrator@postgres:[23433]:ERROR: current transaction is aborted, commandsignored until end of transaction block > > The intent of commit 9d92582abf918215d27659d45a4c9e78bda50aff > was to make sure the standby skips the overwrite contrecord. > > However, we still see "missingContrecPtr" is being set > on the standby before promotion and after the instance is > promoted, the missingContrecPtr is written to WAL > and the subsequent flush throws a "PANIC: xlog flush request" > > To Reproduce using TAP tests; > > 1) apply the attached patch 0001-Patch_to_repro.patch to the head branch. > This patch adds logging when an OverWriteContRecord > is created and comments out the invalidation code > added in commit 9d92582abf918215d27659d45a4c9e78bda50aff > > 2) Run a tap test under "test/recovery" > make check PROVE_TESTS='t/026_overwrite_contrecord.pl' > > 3) What is observed in the logfiles for both the standby > and primary instance in the tap test is > that an overwrite contrecord is created on the primary, > which is correct, but also on the standby after promotion, > which is incorrect. This is incorrect as the contrecord > > simseih@88665a22795f recovery % cat tmp_check/log/*prim* | grep 'creating\|promo' > 2022-05-27 13:17:50.843 CDT [98429] LOG: creating overwrite contrecord at 0/2000058 for aborted_lsn 0/1FFD000 > > simseih@88665a22795f recovery % cat tmp_check/log/*stan* | grep 'creating\|promo' > 2022-05-27 13:17:51.361 CDT [98421] LOG: received promote request > 2022-05-27 13:17:51.394 CDT [98421] LOG: creating overwrite contrecord at 0/2000058 for aborted_lsn 0/1FFD000 > simseih@88665a22795f recovery % > > What we found: > > 1. missingContrecPtr is set when > StandbyMode is false, therefore > only a writer should set this value > and a record is then sent downstream. > > But a standby going through crash > recovery will always have StandbyMode = false, > causing the missingContrecPtr to be incorrectly > set. > > 2. If StandbyModeRequested is checked instead, > we ensure that a standby will not set a > missingContrecPtr. > > 3. After applying the patch below, the tap test succeeded > > diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c > index 5ee90b6..a727aaf 100644 > --- a/src/backend/access/transam/xlogrecovery.c > +++ b/src/backend/access/transam/xlogrecovery.c > @@ -2977,7 +2977,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode, > * we'll write a record to indicate to downstream WAL readers that > * that portion is to be ignored. > */ > - if (!StandbyMode && > + if (!StandbyModeRequested && > !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr)) > { > abortedRecPtr = xlogreader->abortedRecPtr; > > So, it might be that the best fix is not the commit in 9d92582abf918215d27659d45a4c9e78bda50aff > but to check StandbyModeRequested = false before setting > missingContrecPtr. > > Thank you > --- > Sami Imseih > Amazon Web Services >
pgsql-hackers by date: