On Mon, 2023-10-16 at 09:26 -0700, David G. Johnston wrote: > This email is a first pass at a user-visible design for how our backup and restore > process, as enabled by the Low Level API, can be modified to make it more mistake-proof. > In short, it requires pg_start_backup to further expand upon what it means for the > system to be in the midst of a backup, pg_stop_backup to reverse those things, > and modifying the startup process to deal with the server having crashed while the > system is in that backup state. Notes at the end extend the design to handle concurrent backups. > > The core functional changes are: > 1) pg_backup_start modifies a newly added "in backup" state flag in pg_control to on. > 2) pg_backup_stop modifies that flag back to off. > 3) postmaster will refuse to start if that flag is on, unless one of: > a) crash.signal exists in the data directory > b) recovery.signal exists in the data directory > c) standby.signal exists in the data directory > 4) Signal file processing causes the in-backup flag in pg_control to be set to off > > The newly added crash.signal file is required to handle the case where the server > crashes after pg_backup_start and before pg_backup_stop. It initiates a crash recovery > of the instance just as is done today but with the added change of flipping the flag > to off when recovery is complete just before going live.
I see a couple of problems and/or things that need clarification with that idea:
- Two backups can run concurrently. How do you reconcile that with the "in backup" flag and crash.signal? - I guess crash.signal is created during pg_start_backup(). So that file will be included in the backup. How do you handle that during recovery? Ignore it if another signal file is present? And if the user forgets to create a signal file for recovery, how do you prevent PostgreSQL from performing crash recovery?
crash.signal is created in the pg_backup_metadata directory, not the root directory. Should the server crash while any backup is in progress pg_control would be aware of that fact (in_backup=true would still be there, instead of in_backup=false which only comes back after all backups have completed) and the server will not restart without user intervention - specifically, moving the crash.signal file from (one of) the pg_backup_metadata subdirectories to the root directory. As there is nothing special about the crash.signal files in the pg_backup_metadata subdirectories "touch crash.signal" could be used.
The backed up pg_control file will have in_backup=true (I haven't pondered the torn reads dynamic of this - I'm supposing that placing a copy of pg_control into the pg_backup_metadata directory might be part of solving that problem).