Re: The danger of deleting backup_label - Mailing list pgsql-hackers

From David Steele
Subject Re: The danger of deleting backup_label
Date
Msg-id 9dc67363-5200-7395-3ddb-b50171800bc6@pgmasters.net
Whole thread Raw
In response to Re: The danger of deleting backup_label  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: The danger of deleting backup_label
List pgsql-hackers
On 10/16/23 10:55, Robert Haas wrote:
> On Sat, Oct 14, 2023 at 11:33 AM David Steele <david@pgmasters.net> wrote:
>> All of this is fixable in HEAD, but seems incredibly dangerous to back
>> patch. Even so, I have attached the patch in case somebody sees an
>> opportunity that I do not.
> 
> I really do not think we should be even thinking about back-patching
> something like this. It's clearly not a bug fix, although I'm sure
> that someone can try to characterize it that way, if they want to make
> the well-worn argument that any behavior they don't like is a bug. But
> that's a pretty lame argument. Usage errors on the part of users are
> not bugs, even if we've coded the software in such a way as to make
> those errors more likely.

Hmmm, the reason to back patch this is that it would fix [1], which sure 
looks like a problem to me even if it is not a "bug". We can certainly 
require backup software to retry pg_control until the checksum is valid 
but that seems like a pretty big ask, even considering how complicated 
backup is.

I investigated this as a solution to [1] because it seemed like a better 
solution that what we have in [1]. But once I got into the weeds it was 
obvious that this wasn't going to be something we could back patch.

> I think what we ought to be talking about is whether a change like
> this is a good idea even in master. I don't think it's a terrible
> idea, but I'm also not sure that it's a good idea. The problem is that
> if you're doing the right thing with your backup_label, then this is
> unnecessary, and if you're doing the wrong thing, then why should you
> do the right thing about this? 

First and foremost, this solves the problem of pg_control being torn 
when read by the backup software. It can't be torn if it is not there.

There are also other advantages -- we can massage pg_control before 
writing it back out. This already happens, but before that happens there 
is a copy of the (prior) running pg_control there that can mess up the 
process.

> I mean, admittedly you can't just
> ignore a fatal error, but I think people will just run pg_resetwal,
> which is even worse than starting from the wrong checkpoint. 

If you start from the last checkpoint (which is what will generally be 
stored in pg_control) then the effect is pretty similar.

> I feel
> like in cases where a customer I'm working with has a bad backup,
> their entire focus is on doing something to that backup to get a
> running system back, whatever it takes. It's already too late at that
> point to fix the backup procedure - they only have the backups they
> have. You could hope people would do test restores before disaster
> strikes, but people who are that prepared are probably running a real
> backup tool and will never have this problem in the first place.

Right now the user can remove backup_label and get a "successful" 
restore and not realize that they have just corrupted their cluster. 
This is independent of the backup/restore tool doing all the right things.

My goal here is to narrow the options to try and make it so there is 
*one* valid procedure that will work. For this patch the idea is that 
they *must* start Postgres to get a valid pg_control from the 
backup_label. Any other action leads to a fatal error.

Note that the current patch is very WIP and does not actually do 
everything I'm talking about here. I was just trying to see if it could 
be used to solve the problem in [1]. It can't.

> Perhaps that's all too pessimistic. I don't know. Certainly, other
> people can have experiences that are different than mine. But I feel
> like I struggle to think of a case where this would have prevented a
> bad outcome, and that makes me wonder whether it's really a good idea
> to complicate the system.

I'm specifically addressing cases like those that came up (twice!) in 
[2]. This is the main place I see people stumbling these days. If even 
hackers can make this mistake then we should do better.

Regards,
-David

[1] 
https://www.postgresql.org/message-id/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
[2] [1] 

https://www.postgresql.org/message-id/flat/CAM_vCudkSjr7NsNKSdjwtfAm9dbzepY6beZ5DP177POKy8%3D2aw%40mail.gmail.com#746e492bfcd2667635634f1477a61288



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: index prefetching
Next
From: Laurenz Albe
Date:
Subject: Re: Requiring recovery.signal or standby.signal when recovering with a backup_label