Detecting some cases of missing backup_label - Mailing list pgsql-hackers

From Andres Freund
Subject Detecting some cases of missing backup_label
Date
Msg-id 20231130205605.slaaw2ny5sjmukn3@awork3.anarazel.de
Whole thread Raw
Responses Re: Detecting some cases of missing backup_label
List pgsql-hackers
Hi,

I recently mentioned to Robert (and also Heikki earlier), that I think I see a
way to detect an omitted backup_label in a relevant subset of the cases (it'd
apply to the pg_control as well, if we moved to that).  Robert encouraged me
to share the idea, even though it does not provide complete protection.


The subset I think we can address is the following:

a) An omitted backup_label would lead to corruption, i.e. without the
   backup_label we won't start recovery at the right position. Obviously it'd
   be better to also catch a wrong procedure when it'd not cause corruption -
   perhaps my idea can be extended to handle that, with a small bit of
   overhead.

b) The backup has been taken from a primary. Unfortunately that probably can't
   be addressed - but the vast majority of backups are taken from a primary,
   so I think it's still a worthwhile protection.


Here's my approach

1) We add a XLOG_BACKUP_START WAL record when starting a base backup on a
   primary, emitted just *after* the checkpoint completed

2) When replaying a base backup start record, we create a state file that
   includes the corresponding LSN in the filename

3) On the primary, the state file for XLOG_BACKUP_START is *not* created at
   that time. Instead the state file is created during pg_backup_stop().

4) When replaying a XLOG_BACKUP_END record, we verif that the state file
   created by XLOG_BACKUP_START is present, and error out if not.  Backups
   that started before the redo LSN from backup_label are ignored
   (necessitates remembering that LSN, but we've been discussing that anyway).


Because the backup state file on the primary is only created during
pg_backup_stop(), a copy of the data directory taken between pg_backup_start()
and pg_backup_stop() does *not* contain the corresponding "backup state
file". Because of this, an omitted backup_label is detected if recovery does
not start early enough - recovery won't encounter the XLOG_BACKUP_START record
and thus would not create the state file, leading to an error in 4).

It is not a problem that the primary does not create the state file before the
pg_backup_stop() - if the primary crashes before pg_backup_stop(), there is no
XLOG_BACKUP_END and thus no error will be raised.  It's a bit odd that the
sequence differs between normal processing and recovery, but I think that's
nothing a good comment couldn't explain.


I haven't worked out the details, but I think we might be able extend this to
catch errors even if there is no checkpoint during the base backup, by
emitting the WAL record *before* the RequestCheckpoint(), and creating the
corresponding state file during backup_label processing at the start of
recovery.  That'd probably make the logic for when we can remove the backup
state files a bit more complicated, but I think we could deal with that.


Comments? Swear words?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: CRC32C Parallel Computation Optimization on ARM
Next
From: "Tristan Partin"
Date:
Subject: Re: meson: Stop using deprecated way getting path of files