Hi, pgsql-hackers,
I think I found a case that database is not recoverable, would you please give a look?
Here is how it happens:
- setup primary/standby
- do a lots INSERT at primary
- create a checkpoint at primary
- wait until standby start doing restart point, it take about 3mins syncing buffers to complete
- before the restart point update ControlFile, promote the standby, that changed ControlFile
->state to DB_IN_PRODUCTION, this will skip update to ControlFile, leaving the ControlFile
->checkPoint pointing to a removed file
- before the promoted standby request the post-recovery checkpoint (fast promoted),
one backend crashed, it will kill other server process, so the post-recovery checkpoint skipped
- the database restart startup process, which report: "could not locate a valid checkpoint record"
I attached a test to reproduce it, it does not fail every time, it fails every 10 times to me.
To increase the chance CreateRestartPoint skip update ControlFile and to simulate a crash,
the patch 0001 is needed.
Best Regard.
Harry Hao