Hackers,
While reading through [1] I saw there were two instances where
backup_label was removed to achieve a "successful" restore. This might
work on trivial test restores but is an invitation to (silent) disaster
in a production environment where the checkpoint stored in backup_label
is almost certain to be earlier than the one stored in pg_control.
A while back I had an idea on how to prevent this so I decided to give
it a try. Basically, before writing pg_control to the backup I set
checkpoint to 0xFFFFFFFFFFFFFFFF.
Recovery worked perfectly as long as backup_label was present and failed
hard when it was not:
LOG: invalid primary checkpoint record
PANIC: could not locate a valid checkpoint record
It's not a very good message, but at least the foot gun has been
removed. We could use this as a special value to give a better message,
and maybe use something a bit more unique like 0xFFFFFFFFFADEFADE (or
whatever) as the value.
This is all easy enough for pg_basebackup to do, but will certainly be
non-trivial for most backup software to implement. In [2] we have
discussed perhaps returning pg_control from pg_backup_stop() for the
backup software to save, or it could become part of the backup_label
(encoded as hex or base64, presumably). I prefer the latter as this
means less work for the backup software (except for the need to exclude
pg_control from the backup).
I don't have a patch for this yet because I did not test this idea using
pg_basebackup, but I'll be happy to work up a patch if there is interest.
I feel like we should do *something* here. If even advanced users are
making this mistake, then we should take it pretty seriously.
Regards,
-David
[1]
https://www.postgresql.org/message-id/flat/CAM_vCudkSjr7NsNKSdjwtfAm9dbzepY6beZ5DP177POKy8%3D2aw%40mail.gmail.com#746e492bfcd2667635634f1477a61288
[2]
https://www.postgresql.org/message-id/CA%2BhUKGKiZJcfZSA5G5Rm8oC78SNOQ4c8az5Ku%3D4wMTjw1FZ40g%40mail.gmail.com