Re: Standby got invalid primary checkpoint after crashed right after promoted. - Mailing list pgsql-hackers

From hao harry
Subject Re: Standby got invalid primary checkpoint after crashed right after promoted.
Date
Msg-id C8AD8B0B-7914-4D4A-96C3-B8CF724C51C2@outlook.com
Whole thread Raw
In response to Standby got invalid primary checkpoint after crashed right after promoted.  (hao harry <harry-hao@outlook.com>)
Responses Re: Standby got invalid primary checkpoint after crashed right after promoted.  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
Found this issue is duplicated to [1], after applied that patch, I cannot reproduce it anymore.


2022年3月16日 下午3:16,hao harry <harry-hao@outlook.com> 写道:

Hi, pgsql-hackers,

I think I found a case that database is not recoverable, would you please give a look?

Here is how it happens:

- setup primary/standby
- do a lots INSERT at primary
- create a checkpoint at primary
- wait until standby start doing restart point, it take about 3mins syncing buffers to complete
- before the restart point update ControlFile, promote the standby, that changed ControlFile
 ->state to DB_IN_PRODUCTION, this will skip update to ControlFile, leaving the ControlFile
 ->checkPoint pointing to a removed file
- before the promoted standby request the post-recovery checkpoint (fast promoted),
 one backend crashed, it will kill other server process, so the post-recovery checkpoint skipped
- the database restart startup process, which report: "could not locate a valid checkpoint record"

I attached a test to reproduce it, it does not fail every time, it fails every 10 times to me.
To increase the chance CreateRestartPoint skip update ControlFile and to simulate a crash,
the patch 0001 is needed.

Best Regard.

Harry Hao

<0001-Patched-CreateRestartPoint-to-reproduce-invalid-chec.patch><reprod_crash_right_after_promoted.pl>

pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: pg_tablespace_location() failure with allow_in_place_tablespaces
Next
From: Masahiko Sawada
Date:
Subject: Re: Skipping logical replication transactions on subscriber side