Re: [BUG] Panic due to incorrect missingContrecPtr after promotion - Mailing list pgsql-hackers

From Alvaro Herrera
Subject Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
Date
Msg-id 202202222016.3nar64wc7xs7@alvherre.pgsql
Whole thread Raw
In response to [BUG] Panic due to incorrect missingContrecPtr after promotion  ("Imseih (AWS), Sami" <simseih@amazon.com>)
Responses Re: [BUG] Panic due to incorrect missingContrecPtr after promotion  ("Imseih (AWS), Sami" <simseih@amazon.com>)
List pgsql-hackers
On 2022-Feb-22, Imseih (AWS), Sami wrote:

> On 13.5 a wal flush PANIC is encountered after a standby is promoted.
> 
> With debugging, it was found that when a standby skips a missing
> continuation record on recovery, the missingContrecPtr is not
> invalidated after the record is skipped. Therefore, when the standby
> is promoted to a primary it writes an overwrite_contrecord with an LSN
> of the missingContrecPtr, which is now in the past. On flush time,
> this causes a PANIC. From what I can see, this failure scenario can
> only occur after a standby is promoted.

Ooh, nice find and diagnosys.  I can confirm that the test fails as you
described without the code fix, and doesn't fail with it.

I attach the same patch, with the test file put in its final place
rather than as a patch.  Due to recent xlog.c changes this need a bit of
work to apply to back branches; I'll see about getting it in all
branches soon.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"I'm impressed how quickly you are fixing this obscure issue. I came from 
MS SQL and it would be hard for me to put into words how much of a better job
you all are doing on [PostgreSQL]."
 Steve Midgley, http://archives.postgresql.org/pgsql-sql/2008-08/msg00000.php

Attachment

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: postgres_fdw: using TABLESAMPLE to collect remote sample
Next
From: Andres Freund
Date:
Subject: Re: bailing out in tap tests nearly always a bad idea