Home > mailing lists

[BUG] Panic due to incorrect missingContrecPtr after promotion - Mailing list pgsql-hackers

From	Imseih (AWS), Sami
Subject	[BUG] Panic due to incorrect missingContrecPtr after promotion
Date	February 22, 2022 22:20:55
Msg-id	44D259DE-7542-49C4-8A52-2AB01534DCA9@amazon.com Whole thread Raw
Responses	Re: [BUG] Panic due to incorrect missingContrecPtr after promotion Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
List	pgsql-hackers

Tree view

On 13.5 a wal flush PANIC is encountered after a standby is promoted.

With debugging, it was found that when a standby skips a missing continuation record on recovery, the missingContrecPtr is not invalidated after the record is skipped. Therefore, when the standby is promoted to a primary it writes an overwrite_contrecord with an LSN of the missingContrecPtr, which is now in the past. On flush time, this causes a PANIC. From what I can see, this failure scenario can only occur after a standby is promoted.

The overwrite_contrecord was introduced in 13.5 with https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ff9f111bce24.

Attached is a patch and a TAP test to handle this condition. The patch ensures that an overwrite_contrecord is only created if the missingContrecPtr is ahead of the last wal record.

To reproduce:

Run the new tap test recovery/t/029_overwrite_contrecord_promotion.pl without the attached patch

2022-02-22 18:38:15.526 UTC [31138] LOG: started streaming WAL from primary at 0/2000000 on timeline 1

2022-02-22 18:38:15.535 UTC [31105] LOG: successfully skipped missing contrecord at 0/1FFC620, overwritten at 2022-02-22 18:38:15.136482+00

2022-02-22 18:38:15.535 UTC [31105] CONTEXT: WAL redo at 0/2000028 for XLOG/OVERWRITE_CONTRECORD: lsn 0/1FFC620; time 2022-02-22 18:38:15.136482+00

…

…..

2022-02-22 18:38:15.575 UTC [31103] PANIC: xlog flush request 0/201EC70 is not satisfied --- flushed only to 0/2000088

2022-02-22 18:38:15.575 UTC [31101] LOG: checkpointer process (PID 31103) was terminated by signal 6: Aborted

….

…..

With the patch, running the same tap test succeeds and a PANIC is not observed.

Thanks

Sami Imseih

Amazon Web Services

Attachment

0001-Fix-missing-continuation-record-after-standby-promot.patch

pgsql-hackers by date:

From: Matthias van de Meent
Date: 22 February 2022, 21:42:54
Subject: Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

From: Nathan Bossart
Date: 22 February 2022, 22:52:29
Subject: Re: remove more archiving overhead

[BUG] Panic due to incorrect missingContrecPtr after promotion - Mailing list pgsql-hackers

Attachment

Previous

Next