Re: The danger of deleting backup_label - Mailing list pgsql-hackers
From | David Steele |
---|---|
Subject | Re: The danger of deleting backup_label |
Date | |
Msg-id | 9dc67363-5200-7395-3ddb-b50171800bc6@pgmasters.net Whole thread Raw |
In response to | Re: The danger of deleting backup_label (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: The danger of deleting backup_label
|
List | pgsql-hackers |
On 10/16/23 10:55, Robert Haas wrote: > On Sat, Oct 14, 2023 at 11:33 AM David Steele <david@pgmasters.net> wrote: >> All of this is fixable in HEAD, but seems incredibly dangerous to back >> patch. Even so, I have attached the patch in case somebody sees an >> opportunity that I do not. > > I really do not think we should be even thinking about back-patching > something like this. It's clearly not a bug fix, although I'm sure > that someone can try to characterize it that way, if they want to make > the well-worn argument that any behavior they don't like is a bug. But > that's a pretty lame argument. Usage errors on the part of users are > not bugs, even if we've coded the software in such a way as to make > those errors more likely. Hmmm, the reason to back patch this is that it would fix [1], which sure looks like a problem to me even if it is not a "bug". We can certainly require backup software to retry pg_control until the checksum is valid but that seems like a pretty big ask, even considering how complicated backup is. I investigated this as a solution to [1] because it seemed like a better solution that what we have in [1]. But once I got into the weeds it was obvious that this wasn't going to be something we could back patch. > I think what we ought to be talking about is whether a change like > this is a good idea even in master. I don't think it's a terrible > idea, but I'm also not sure that it's a good idea. The problem is that > if you're doing the right thing with your backup_label, then this is > unnecessary, and if you're doing the wrong thing, then why should you > do the right thing about this? First and foremost, this solves the problem of pg_control being torn when read by the backup software. It can't be torn if it is not there. There are also other advantages -- we can massage pg_control before writing it back out. This already happens, but before that happens there is a copy of the (prior) running pg_control there that can mess up the process. > I mean, admittedly you can't just > ignore a fatal error, but I think people will just run pg_resetwal, > which is even worse than starting from the wrong checkpoint. If you start from the last checkpoint (which is what will generally be stored in pg_control) then the effect is pretty similar. > I feel > like in cases where a customer I'm working with has a bad backup, > their entire focus is on doing something to that backup to get a > running system back, whatever it takes. It's already too late at that > point to fix the backup procedure - they only have the backups they > have. You could hope people would do test restores before disaster > strikes, but people who are that prepared are probably running a real > backup tool and will never have this problem in the first place. Right now the user can remove backup_label and get a "successful" restore and not realize that they have just corrupted their cluster. This is independent of the backup/restore tool doing all the right things. My goal here is to narrow the options to try and make it so there is *one* valid procedure that will work. For this patch the idea is that they *must* start Postgres to get a valid pg_control from the backup_label. Any other action leads to a fatal error. Note that the current patch is very WIP and does not actually do everything I'm talking about here. I was just trying to see if it could be used to solve the problem in [1]. It can't. > Perhaps that's all too pessimistic. I don't know. Certainly, other > people can have experiences that are different than mine. But I feel > like I struggle to think of a case where this would have prevented a > bad outcome, and that makes me wonder whether it's really a good idea > to complicate the system. I'm specifically addressing cases like those that came up (twice!) in [2]. This is the main place I see people stumbling these days. If even hackers can make this mistake then we should do better. Regards, -David [1] https://www.postgresql.org/message-id/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de [2] [1] https://www.postgresql.org/message-id/flat/CAM_vCudkSjr7NsNKSdjwtfAm9dbzepY6beZ5DP177POKy8%3D2aw%40mail.gmail.com#746e492bfcd2667635634f1477a61288
pgsql-hackers by date: