Re: Add recovery to pg_control and remove backup_label - Mailing list pgsql-hackers
From | Stephen Frost |
---|---|
Subject | Re: Add recovery to pg_control and remove backup_label |
Date | |
Msg-id | ZWMFA0OVsS1ir3sC@tamriel.snowman.net Whole thread Raw |
In response to | Re: Add recovery to pg_control and remove backup_label (David Steele <david@pgmasters.net>) |
Responses |
Re: Add recovery to pg_control and remove backup_label
|
List | pgsql-hackers |
Greetings, * David Steele (david@pgmasters.net) wrote: > On 11/21/23 12:41, Andres Freund wrote: > > Sure. They also receive a backup_label today. If an external solution forgets > > to replace pg_control copied as part of the filesystem copy, they won't get an > > error after the remove of backup_label, just like they don't get one today if > > they don't put backup_label in the data directory. Given that users don't do > > the right thing with backup_label today, why can we rely on them doing the > > right thing with pg_control? > > I think reliable backup software does the right thing with backup_label, but > if the user starts getting errors on recovery they the decide to remove > backup_label. I know we can't do much about bad backup software, but we can > at least make this a bit more resistant to user error after the fact. > > It doesn't help that one of our hints suggests removing backup_label. In > highly automated systems, the user might not even know they just restored > from a backup. They are only in the loop because the restore failed and they > are trying to figure out what is going wrong. When they remove backup_label > the cluster comes up just fine. Victory! Yup, this is exactly the issue. > This is the scenario I've seen most often -- not the backup/restore process > getting it wrong but the user removing backup_label on their own initiative. > And because it yields such a positive result, at least initially, they > remember in the future that the thing to do is to remove backup_label > whenever they see the error. > > If they only have pg_control, then their only choice is to get it right or > run pg_resetwal. Most users have no knowledge of pg_resetwal so it will take > them longer to get there. Also, I think that tool make it pretty clear that > corruption will result and the only thing to do is a logical dump and > restore after using it. Agreed. > There are plenty of ways a user can mess things up. What I'd like to prevent > is the appearance of everything being OK when in fact they have corrupted > their cluster. That's the situation we have now with backup_label. Is this > new solution perfect? No, but I do think it checks several boxes, and is a > worthwhile improvement. +1. As for the complaint about 'operators' having issue with the changes we've been making in this area- where are those people complaining, exactly? Who are they? I feel like we keep getting this kind of push-back in this area from folks on this list but not from actual backup software authors; all the complaints seem to either be speculative or unattributed pass-through from someone else. What would really be helpful would be hearing from these individuals directly as to what the issues are with the changes, such that perhaps we can do things better in the future to avoid whatever the issue is they're having with the changes. Simply saying we shouldn't make changes in this area isn't workable and the constant push-back is actively discouraging to folks trying to make improvements. Obviously it's a biased view, but we've not had issues making the necessary adjustments in pgbackrest with each release and I feel like if the authors of wal-g or barman did that they would have spoken up. Making a change as suggested which only helps pg_basebackup (and tools like pgbackrest, since it's in C and can also make this particular change) ends up leaving tools like wal-g and barman potentially still with an easy way for users of those tools to corrupt their databases- even though we've not heard anything from the authors of those tools about issues with the proposed change, nor have there been a lot of complaints from them about the prior changes to indicate that they'd even have an issue with the more involved change. Given the lack of complaint about past changes, I'd certainly rather err on the side of improved safety for users than on the side of the authors of these tools possibly complaining. What these changes have done is finally break things like omnipitr completely, which hasn't been maintained in a very long time. The changes in v12 broke recovery with omnipitr but not backup, and folks were trying to use omnipitr as recently as with v13[1]. Certainly having a backup tool that only works for backup (fsvo works, anyway, as it still used exclusive backup mode meaning that a crash during a backup would cause the system to not come back up after...) but doesn't work for recovery isn't exactly great and I'm glad that, now, an attempt to use omnipitr to perform a backup will fail. As with lots of other areas of PG, folks need to read the release notes and potentially update their code for new major versions. If anything, the backup area is less of an issue for this because the authors of the backup tools are able to make the change (and who are often the ones pushing for these changes) and the end-user isn't impacted at all. Much the same can be said for wal-e, with users still trying to use it even long after it was stated to be obsolete (the Obsolescence Notice[2] was added in February 2022, though it hadn't been maintained for a while before that, and an issue was opened in December 2022 asking for it to be updated to v15[3]...). Thanks, Stephen [1]: https://github.com/omniti-labs/omnipitr/issues/43 [2]: https://github.com/wal-e/wal-e/commit/f5b3e790fe10daa098b8cbf01d836c4885dc13c7 [3]: https://github.com/wal-e/wal-e/issues/433
Attachment
pgsql-hackers by date: