Improving Physical Backup/Restore within the Low Level API - Mailing list pgsql-hackers

From David G. Johnston
Subject Improving Physical Backup/Restore within the Low Level API
Date
Msg-id CAKFQuwbpz4s8XP_+Khsif2eFaC78wpTbNbevUYBmjq-UCeNL7Q@mail.gmail.com
Whole thread Raw
Responses Re: Improving Physical Backup/Restore within the Low Level API
List pgsql-hackers
Hi!

This email is a first pass at a user-visible design for how our backup and restore process, as enabled by the Low Level API, can be modified to make it more mistake-proof.  In short, it requires pg_start_backup to further expand upon what it means for the system to be in the midst of a backup, pg_stop_backup to reverse those things, and modifying the startup process to deal with the server having crashed while the system is in that backup state.  Notes at the end extend the design to handle concurrent backups.

The core functional changes are:
1) pg_backup_start modifies a newly added "in backup" state flag in pg_control to on.
2) pg_backup_stop modifies that flag back to off.
3) postmaster will refuse to start if that flag is on, unless one of:
  a) crash.signal exists in the data directory
  b) recovery.signal exists in the data directory
  c) standby.signal exists in the data directory
4) Signal file processing causes the in-backup flag in pg_control to be set to off

The newly added crash.signal file is required to handle the case where the server crashes after pg_backup_start and before pg_backup_stop.  It initiates a crash recovery of the instance just as is done today but with the added change of flipping the flag to off when recovery is complete just before going live.

The error message for the failed startup while in backup will tell the dba that one of the three signal files must exist.
When processing recovery.signal or standby.signal the presence of the backup_label and tablespace_map files are mandatory and the system will also fail to start should they be missing.

For non-functional changes I would also suggest doing the following:
pg_backup_start will create a "pg_backup_metadata" directory if there is not already one, or will empty it if there is.
pg_backup_start will create a crash.signal file in that directory
pg_backup_stop  will create files within pg_backup_metadata upon its completion:
backup_label
tablespace_map
recovery.signal
standby.signal

All of the instructions regarding what to place in those files should be removed and instead the system should write them - no copy-paste.

The instructions modified to say "copy the backup_label and tablespace_map files to the root of the backup directory and the recovery and standby signal files to the pg_backup_metadata directory in the backup.  Additionally, we document crash recovery by saying "move crash.signal from pg_backup_metadata to the root of the data directory". We should explicitly advise excluding or removing pg_backup_metadata/crash.signal from the backup as well.

Extending the above to handle concurrent backup, for pg_control we'd sill use the on/off flag but we have to have a shared in-memory session lock on something so that only the last surviving process actually changes it to off while also dealing with sessions that terminate without issuing pg_backup_stop and without the server itself crashing. (I'm unfamiliar with how this is handled today but I presume a mechanism exists already that just needs to be extended).

For the non-functional stuff, pg_backup_start returns a process id, and subdirectories under pg_backup_metadata are created named with such.  Add a pg_backup_cleanup() function that executes while not in backup mode to clean up those subdirectories.  Any subdirectory in the backup that isn't the specified process id from pg_start_backup should be excluded/removed.

David J.

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: The danger of deleting backup_label
Next
From: Michael Christofides
Date:
Subject: Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE