Re: Add recovery to pg_control and remove backup_label - Mailing list pgsql-hackers

From David Steele
Subject Re: Add recovery to pg_control and remove backup_label
Date
Msg-id 188e97f4-69d9-4542-b0c1-852fa6b8319b@pgmasters.net
Whole thread Raw
In response to Re: Add recovery to pg_control and remove backup_label  (Andres Freund <andres@anarazel.de>)
Responses Re: Add recovery to pg_control and remove backup_label
List pgsql-hackers
On 11/21/23 12:41, Andres Freund wrote:
> 
> On 2023-11-21 07:42:42 -0400, David Steele wrote:
>> On 11/20/23 19:58, Andres Freund wrote:
>>> On 2023-11-21 08:52:08 +0900, Michael Paquier wrote:
>>>> On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote:
>>>>> Given that, I wonder if what we should do is to just add a new field to
>>>>> pg_control that says "error out if backup_label does not exist", that we set
>>>>> when creating a streaming base backup
>>>>
>>>> That would mean that one still needs to take an extra step to update a
>>>> control file with this byte set, which is something you had a concern
>>>> with in terms of compatibility when it comes to external backup
>>>> solutions because more steps are necessary to take a backup, no?
>>>
>>> I was thinking we'd just set it in the pg_basebackup style path, and we'd
>>> error out if it's set and backup_label is present. But we'd still use
>>> backup_label without the pg_control flag set.
>>>
>>> So it'd just provide a cross-check that backup_label was not removed for
>>> pg_basebackup style backup, but wouldn't do anything for external backups. But
>>> imo the proposal to just us pg_control doesn't actually do anything for
>>> external backups either - which is why I think my proposal would achieve as
>>> much, for a much lower price.
>>
>> I'm not sure why you think the patch under discussion doesn't do anything
>> for external backups. It provides the same benefits to both pg_basebackup
>> and external backups, i.e. they both receive the updated version of
>> pg_control.
> 
> Sure. They also receive a backup_label today. If an external solution forgets
> to replace pg_control copied as part of the filesystem copy, they won't get an
> error after the remove of backup_label, just like they don't get one today if
> they don't put backup_label in the data directory.  Given that users don't do
> the right thing with backup_label today, why can we rely on them doing the
> right thing with pg_control?

I think reliable backup software does the right thing with backup_label, 
but if the user starts getting errors on recovery they the decide to 
remove backup_label. I know we can't do much about bad backup software, 
but we can at least make this a bit more resistant to user error after 
the fact.

It doesn't help that one of our hints suggests removing backup_label. In 
highly automated systems, the user might not even know they just 
restored from a backup. They are only in the loop because the restore 
failed and they are trying to figure out what is going wrong. When they 
remove backup_label the cluster comes up just fine. Victory!

This is the scenario I've seen most often -- not the backup/restore 
process getting it wrong but the user removing backup_label on their own 
initiative. And because it yields such a positive result, at least 
initially, they remember in the future that the thing to do is to remove 
backup_label whenever they see the error.

If they only have pg_control, then their only choice is to get it right 
or run pg_resetwal. Most users have no knowledge of pg_resetwal so it 
will take them longer to get there. Also, I think that tool make it 
pretty clear that corruption will result and the only thing to do is a 
logical dump and restore after using it.

There are plenty of ways a user can mess things up. What I'd like to 
prevent is the appearance of everything being OK when in fact they have 
corrupted their cluster. That's the situation we have now with 
backup_label. Is this new solution perfect? No, but I do think it checks 
several boxes, and is a worthwhile improvement.

Regards,
-David

Regards,
-David



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Partial aggregates pushdown
Next
From: Robert Haas
Date:
Subject: Re: Locks on unlogged tables are locked?!