Thread: Remove Deprecated Exclusive Backup Mode
Hackers, I propose we remove the deprecated exclusive backup mode of pg_start_backup()/pg_stop_backup() for Postgres 12. The exclusive backup mode has a serious known issue. If Postgres terminates ungracefully during a backup (due to hardware, kernel, Postgres issues, etc.) then Postgres may refuse to restart. The reason is that the backup_label file will usually reference a checkpoint LSN that is older than the WAL available in pg_wal. Postgres does emit a helpful error message while PANIC'ing but that's cold comfort to an admin who must manually intervene to get their cluster operational again. The deprecated exclusive mode promises to make a difficult problem simple but doesn't live up to that promise. That's why it was replaced externally in 9.6 and why pg_basebackup has not used exclusive backups since it was introduced in 9.2. Non-exclusive backups have been available since 9.6 and several third-party solutions support this mode, in addition to pg_basebackup. The recently introduced recovery changes will break current automated solutions so this seems like a good opportunity to make improvements on the backup side as well. I'll submit a patch for the 2019-01 commitfest. Regards, -- -David david@pgmasters.net
On Mon, Nov 26, 2018 at 10:13:34PM -0500, David Steele wrote: > Non-exclusive backups have been available since 9.6 and several third-party > solutions support this mode, in addition to pg_basebackup. I think that two releases is not actually that much time to just nuke the exclusive backup interface. I would be fine if the docs show the deprecation more aggressively using a warning section, and we could add an explicit WARNING message about the matter directly when calling pg_start_backup and pg_stop_backup. My 2c. -- Michael
Attachment
Hi, On 2018-11-27 12:20:13 +0900, Michael Paquier wrote: > On Mon, Nov 26, 2018 at 10:13:34PM -0500, David Steele wrote: > > Non-exclusive backups have been available since 9.6 and several third-party > > solutions support this mode, in addition to pg_basebackup. > > I think that two releases is not actually that much time to just nuke > the exclusive backup interface. I would be fine if the docs show the > deprecation more aggressively using a warning section, and we could add > an explicit WARNING message about the matter directly when calling > pg_start_backup and pg_stop_backup. That was my gut reaction as well, but I think David's argument about existing scripts / workflows being broken due the the recovery.conf is a good reason to be more aggressive here. Greetings, Andres Freund
On Mon, Nov 26, 2018 at 10:13 PM David Steele <david@pgmasters.net> wrote: > I propose we remove the deprecated exclusive backup mode of > pg_start_backup()/pg_stop_backup() for Postgres 12. -1. I don't have a problem with deprecating exclusive backup mode eventually, but I don't see any good reason to do it this soon. It's not like the problems with exclusive backup are so serious that you can't work around them. If you know which machine is your master, you can arrange to remove backup_label on reboot (only) on the master (only). Sure, a lot of people probably do this wrong, but it's infeasible to disallow all the things that people might use incorrectly without making the system useless. There must be hundreds or thousands of backup scripts written by individual users that still use exclusive mode floating around out there. Forcing all of those to be updated or scrapped will annoy users to no benefit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Nov 27, 2018 at 4:46 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2018-11-27 12:20:13 +0900, Michael Paquier wrote:
> On Mon, Nov 26, 2018 at 10:13:34PM -0500, David Steele wrote:
> > Non-exclusive backups have been available since 9.6 and several third-party
> > solutions support this mode, in addition to pg_basebackup.
>
> I think that two releases is not actually that much time to just nuke
> the exclusive backup interface. I would be fine if the docs show the
> deprecation more aggressively using a warning section, and we could add
> an explicit WARNING message about the matter directly when calling
> pg_start_backup and pg_stop_backup.
That was my gut reaction as well, but I think David's argument about
existing scripts / workflows being broken due the the recovery.conf
is a good reason to be more aggressive here.
Yeah, I'm in the same boat here -- it feels like it's a bit too short, but since we're breaking things for people *anyway*, it's probably better to break both at once than to have all those people have their things broken multiple times.
On 11/26/18 11:04 PM, Robert Haas wrote: > On Mon, Nov 26, 2018 at 10:13 PM David Steele <david@pgmasters.net> wrote: >> I propose we remove the deprecated exclusive backup mode of >> pg_start_backup()/pg_stop_backup() for Postgres 12. > > -1. I don't have a problem with deprecating exclusive backup mode > eventually, but I don't see any good reason to do it this soon. > > It's not like the problems with exclusive backup are so serious that > you can't work around them. If you know which machine is your master, > you can arrange to remove backup_label on reboot (only) on the master > (only). I do not recommend that users write automated scripts to delete files out of $PGDATA. Accidents happen that way. If this is our recommendation on how the mitigate the problem with exclusive backup then I think it proves the point that it is broken and should be removed. > Sure, a lot of people probably do this wrong, but it's > infeasible to disallow all the things that people might use > incorrectly without making the system useless. But we have already fixed this problem -- it only remains to remove the old method. > There must be hundreds or thousands of backup scripts written by > individual users that still use exclusive mode floating around out > there. Forcing all of those to be updated or scrapped will annoy > users to no benefit. But we are OK with forcing them to scrap their recovery scripts? And there is a benefit -- Postgres will be able to restart if it crashes during a backup. Regards, -- -David david@pgmasters.net
On 27/11/2018 04:46, Andres Freund wrote: > That was my gut reaction as well, but I think David's argument about > existing scripts / workflows being broken due the the recovery.conf > is a good reason to be more aggressive here. But backup scripts are not affected by the recovery.conf changes. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Nov 27, 2018 at 02:21:49PM +0100, Peter Eisentraut wrote: > On 27/11/2018 04:46, Andres Freund wrote: >> That was my gut reaction as well, but I think David's argument about >> existing scripts / workflows being broken due the the recovery.conf >> is a good reason to be more aggressive here. > > But backup scripts are not affected by the recovery.conf changes. In any of my own backup scripts (yeah!), I don't have any dependency to that either. Or perhaps pgBackRest has a dependency in this area? -- Michael
Attachment
On Tue, 27 Nov 2018 at 03:13, David Steele <david@pgmasters.net> wrote:
--
The deprecated exclusive mode promises to make a difficult problem
simple but doesn't live up to that promise. That's why it was replaced
externally in 9.6 and why pg_basebackup has not used exclusive backups
since it was introduced in 9.2.
Non-exclusive backups have been available since 9.6 and several
third-party solutions support this mode, in addition to pg_basebackup.
The recently introduced recovery changes will break current automated
solutions so this seems like a good opportunity to make improvements on
the backup side as well.
I'll submit a patch for the 2019-01 commitfest.
-1 for removal, in this release
It's not there because anyone likes it, it's there because removing it is a risk
Recent changes are the reason to keep it, not a reason to remove.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings, * Michael Paquier (michael@paquier.xyz) wrote: > On Tue, Nov 27, 2018 at 02:21:49PM +0100, Peter Eisentraut wrote: > > On 27/11/2018 04:46, Andres Freund wrote: > >> That was my gut reaction as well, but I think David's argument about > >> existing scripts / workflows being broken due the the recovery.conf > >> is a good reason to be more aggressive here. > > > > But backup scripts are not affected by the recovery.conf changes. > > In any of my own backup scripts (yeah!), I don't have any dependency to > that either. Or perhaps pgBackRest has a dependency in this area? If you don't consider your recovery scripts and your backup scripts to be related then I've really got to wonder how you're regularly testing your backups to make sure that they're actually valid. If you aren't regularly testing your backups then I've got little sympathy. To be clear, pgbackrest doesn't have any dependency here- but it, like all of the other 3rd party backup solutions and any restore solution that a user has come up with, are going to have to be changed to deal with the changes in how recovery works, so this is a good time to make these changes. Thanks! Stephen
Attachment
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > There must be hundreds or thousands of backup scripts written by > individual users that still use exclusive mode floating around out > there. Forcing all of those to be updated or scrapped will annoy > users to no benefit. How about automated recovery scripts? We've decided it's fine to break all of those which exist out there. I'm concerned, seriously, that people don't have anywhere near the concern about the recovery side of things as they do about the backup side of things and that's really concerning. Thanks! Stephen
Attachment
On 27/11/2018 15:45, Stephen Frost wrote: >>> But backup scripts are not affected by the recovery.conf changes. >> In any of my own backup scripts (yeah!), I don't have any dependency to >> that either. Or perhaps pgBackRest has a dependency in this area? > If you don't consider your recovery scripts and your backup scripts to > be related then I've really got to wonder how you're regularly testing > your backups to make sure that they're actually valid. The sort of installations that continue to use the exclusive backup mode probably have the following tooling: a 20-line shell script to make the backup and either a 10-line shell script or a similarly sized README or wiki page to do the recovery. Changing the latter for the recovery.conf changes is probably a 3-line change. Changing the former for the removal of exclusive backups would require major changes. (Try writing a shell script that keeps a psql session open while it takes the backup from the file system. It's possible, but it requires significantly more logic.) -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings, * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote: > On 27/11/2018 15:45, Stephen Frost wrote: > >>> But backup scripts are not affected by the recovery.conf changes. > >> In any of my own backup scripts (yeah!), I don't have any dependency to > >> that either. Or perhaps pgBackRest has a dependency in this area? > > If you don't consider your recovery scripts and your backup scripts to > > be related then I've really got to wonder how you're regularly testing > > your backups to make sure that they're actually valid. > > The sort of installations that continue to use the exclusive backup mode > probably have the following tooling: a 20-line shell script to make the > backup and either a 10-line shell script or a similarly sized README or > wiki page to do the recovery. Changing the latter for the recovery.conf > changes is probably a 3-line change. Changing the former for the > removal of exclusive backups would require major changes. (Try writing > a shell script that keeps a psql session open while it takes the backup > from the file system. It's possible, but it requires significantly more > logic.) They're also the sort of installations which don't have reliable backups and don't have any clue about the danger they are in due to the current bug/issue/whatever we have with exclusive backups. No, I don't agree that it's sensible to continue to march on as if nothing is wrong. Thanks! Stephen
Attachment
On 11/27/18 3:46 PM, Stephen Frost wrote: > I'm concerned, seriously, that people don't have anywhere near the > concern about the recovery side of things as they do about the backup > side of things and that's really concerning. I agree with your larger point, but in this case the two breakages do not seem equal. As far as I gather the removal of recovery.conf will in practice result in a longer downtime when your recovery scripts breaks but not any data loss. While if we remove the exclusive backup mode then some people's backup scripts will break and if they do not properly monitor their backups they risk data loss. I think we should use more caution when data loss is at stake rather than "just" downtime. So personally I am in favor of updating the manual with warnings (right now it does not even say if exclusive or non-exclusive is the default) and adding a deprecation warning when people use the exclusive mode. Best regards, Andreas
Greetings, * Andreas Karlsson (andreas@proxel.se) wrote: > On 11/27/18 3:46 PM, Stephen Frost wrote: > >I'm concerned, seriously, that people don't have anywhere near the > >concern about the recovery side of things as they do about the backup > >side of things and that's really concerning. > > I agree with your larger point, but in this case the two breakages do not > seem equal. As far as I gather the removal of recovery.conf will in practice > result in a longer downtime when your recovery scripts breaks but not any > data loss. While if we remove the exclusive backup mode then some people's > backup scripts will break and if they do not properly monitor their backups > they risk data loss. Let's please try to remember that this is across a major version upgrade and is removing a broken capability that's already been deprecated for a couple years. If they aren't monitoring their backup scripts today, and aren't regularly performing restores of those backups, they're already risking data loss. > I think we should use more caution when data loss is at stake rather than > "just" downtime. So personally I am in favor of updating the manual with > warnings (right now it does not even say if exclusive or non-exclusive is > the default) and adding a deprecation warning when people use the exclusive > mode. Waiting isn't going to change any of these factors, nor will throwing warnings about the exclusive mode if people aren't monitoring their scripts. If we really want to keep the exclusive backup mode around, then, as Magnus said on a nearby thread, it needs to be fixed. If it's fixed and just works and everyone's scripts work, then great, then we can un-deprecate it and move on. If we aren't going to fix it then let's remove it. Thanks! Stephen
Attachment
On 11/27/18 4:11 PM, Stephen Frost wrote: >> I agree with your larger point, but in this case the two breakages do not >> seem equal. As far as I gather the removal of recovery.conf will in practice >> result in a longer downtime when your recovery scripts breaks but not any >> data loss. While if we remove the exclusive backup mode then some people's >> backup scripts will break and if they do not properly monitor their backups >> they risk data loss. > > Let's please try to remember that this is across a major version upgrade > and is removing a broken capability that's already been deprecated for a > couple years. I know that and you know that, but even our own manual use the exclusive mode in some of the examples, e.g: "pg_start_backup('label_goes_here')"[1]. Your point about that people who do not monitor their backups wont notice warnings is fair, but even so I think we should give people more time to migrate when even our own documentation isn't always clear about the exclusive mode being deprecated. 1. https://www.postgresql.org/docs/11/functions-admin.html#FUNCTIONS-ADMIN-BACKUP Andreas
On 27/11/2018 16:02, Stephen Frost wrote: > They're also the sort of installations which don't have reliable backups > and don't have any clue about the danger they are in due to the current > bug/issue/whatever we have with exclusive backups. > > No, I don't agree that it's sensible to continue to march on as if > nothing is wrong. It's fair to argue that this facility is broken and needs to be removed. But that needs to be its own argument. The argument in this subthread is that since we're already making changes in an adjacent area, removing this feature will have a low impact. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/27/18 9:54 AM, Peter Eisentraut wrote: > On 27/11/2018 15:45, Stephen Frost wrote: >>>> But backup scripts are not affected by the recovery.conf changes. >>> In any of my own backup scripts (yeah!), I don't have any dependency to >>> that either. Or perhaps pgBackRest has a dependency in this area? >> If you don't consider your recovery scripts and your backup scripts to >> be related then I've really got to wonder how you're regularly testing >> your backups to make sure that they're actually valid. > > The sort of installations that continue to use the exclusive backup mode > probably have the following tooling: a 20-line shell script to make the > backup and either a 10-line shell script or a similarly sized README or > wiki page to do the recovery. Changing the latter for the recovery.conf > changes is probably a 3-line change. Really? We have gone from a file that can be safely overwritten (recovery.conf) to a file that might need to be parsed to figure out if the settings already exists (postgresql.auto.conf). Of course, you can just continue to append to the file but that seems pretty grotty to me. This will also require changes to all HA solutions or backup solutions that deal with recovery. I think you underestimate how big a change this is. I've been thinking about how to write code for it and it is significantly more complex -- also more flexible so I'm happy about that. > Changing the former for the > removal of exclusive backups would require major changes. (Try writing > a shell script that keeps a psql session open while it takes the backup > from the file system. It's possible, but it requires significantly more > logic.) We provide pg_basebackup with core and it is a solid tool for doing backups. Do we really want to encourage the use of bash scripts to do what we already have a tool for? If users want to do something more complex than pg_basebackup then bash is certainly not the tool for that. Regards, -- -David david@pgmasters.net
On 11/27/18 10:29 AM, Peter Eisentraut wrote: > On 27/11/2018 16:02, Stephen Frost wrote: >> They're also the sort of installations which don't have reliable backups >> and don't have any clue about the danger they are in due to the current >> bug/issue/whatever we have with exclusive backups. >> >> No, I don't agree that it's sensible to continue to march on as if >> nothing is wrong. > > It's fair to argue that this facility is broken and needs to be removed. > But that needs to be its own argument. I believe I made this argument in the OP. > The argument in this subthread is that since we're already making > changes in an adjacent area, removing this feature will have a low impact. Fair enough, but I think the argument against exclusive backups stands on its own. Also, I don't see backup and restore as adjacent. I see them as thoroughly intertwined. -- -David david@pgmasters.net
On 11/27/18 8:56 AM, Simon Riggs wrote: > On Tue, 27 Nov 2018 at 03:13, David Steele <david@pgmasters.net > <mailto:david@pgmasters.net>> wrote: > > > The deprecated exclusive mode promises to make a difficult problem > simple but doesn't live up to that promise. That's why it was replaced > externally in 9.6 and why pg_basebackup has not used exclusive backups > since it was introduced in 9.2. > > Non-exclusive backups have been available since 9.6 and several > third-party solutions support this mode, in addition to pg_basebackup. > > The recently introduced recovery changes will break current automated > solutions so this seems like a good opportunity to make improvements on > the backup side as well. > > I'll submit a patch for the 2019-01 commitfest. > > > -1 for removal, in this release > > It's not there because anyone likes it, it's there because removing it > is a risk > > Recent changes are the reason to keep it, not a reason to remove. How so? -- -David david@pgmasters.net
Greetings, * David Steele (david@pgmasters.net) wrote: > On 11/27/18 10:29 AM, Peter Eisentraut wrote: > > On 27/11/2018 16:02, Stephen Frost wrote: > >> They're also the sort of installations which don't have reliable backups > >> and don't have any clue about the danger they are in due to the current > >> bug/issue/whatever we have with exclusive backups. > >> > >> No, I don't agree that it's sensible to continue to march on as if > >> nothing is wrong. > > > > It's fair to argue that this facility is broken and needs to be removed. > > But that needs to be its own argument. We made that argument, and had agreement on it, and that's why it's deprecated today. The only question here is how long we keep this deprecated and broken capability around and given that we're completely breaking everything around recovery, now is as good a time as any since users should be looking at how they do backup and restore when they migrate to v12 due to these changes. > > The argument in this subthread is that since we're already making > > changes in an adjacent area, removing this feature will have a low impact. > > Fair enough, but I think the argument against exclusive backups stands > on its own. Also, I don't see backup and restore as adjacent. I see > them as thoroughly intertwined. Agreed. Thanks! Stephen
Attachment
On Tue, 27 Nov 2018 at 14:45, Stephen Frost <sfrost@snowman.net> wrote:
--
Greetings,
* Michael Paquier (michael@paquier.xyz) wrote:
> On Tue, Nov 27, 2018 at 02:21:49PM +0100, Peter Eisentraut wrote:
> > On 27/11/2018 04:46, Andres Freund wrote:
> >> That was my gut reaction as well, but I think David's argument about
> >> existing scripts / workflows being broken due the the recovery.conf
> >> is a good reason to be more aggressive here.
> >
> > But backup scripts are not affected by the recovery.conf changes.
>
> In any of my own backup scripts (yeah!), I don't have any dependency to
> that either. Or perhaps pgBackRest has a dependency in this area?
If you don't consider your recovery scripts and your backup scripts to
be related then I've really got to wonder how you're regularly testing
your backups to make sure that they're actually valid.
If you aren't regularly testing your backups then I've got little
sympathy.
To be clear, pgbackrest doesn't have any dependency here- but it, like
all of the other 3rd party backup solutions and any restore solution
that a user has come up with, are going to have to be changed to deal
with the changes in how recovery works, so this is a good time to make
these changes.
If those tools are updated and the changes all work, that will be good.
That isn't the same thing as an argument to remove things in this release.
I propose waiting until next release.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings, * Andreas Karlsson (andreas@proxel.se) wrote: > On 11/27/18 4:11 PM, Stephen Frost wrote: > >>I agree with your larger point, but in this case the two breakages do not > >>seem equal. As far as I gather the removal of recovery.conf will in practice > >>result in a longer downtime when your recovery scripts breaks but not any > >>data loss. While if we remove the exclusive backup mode then some people's > >>backup scripts will break and if they do not properly monitor their backups > >>they risk data loss. > > > >Let's please try to remember that this is across a major version upgrade > >and is removing a broken capability that's already been deprecated for a > >couple years. > > I know that and you know that, but even our own manual use the exclusive > mode in some of the examples, e.g: "pg_start_backup('label_goes_here')"[1]. Thankfully, that'll get nicely cleaned up by removing the exclusive backup mode. > Your point about that people who do not monitor their backups wont notice > warnings is fair, but even so I think we should give people more time to > migrate when even our own documentation isn't always clear about the > exclusive mode being deprecated. I don't buy off on this argument- we also have things like this: https://www.postgresql.org/docs/11/backup-file.html Where we show a simple tar command as a way to backup the database, but then we caveat it. Similairly, we make it clear that users shouldn't be using the exclusive mode backups: https://www.postgresql.org/docs/11/continuous-archiving.html#BACKUP-LOWLEVEL-BASE-BACKUP If users are only reading the system functions description and thinking that's enough to figure out how to do backups, and didn't read the chapters of documentation we have about how to perform a backup, then chances are very good their existing backup systems are broken, and that's all the more reason to remove the exclusive mode because at least then it won't look like things are working when they aren't. Thanks! Stephen
Attachment
Hi, On 2018-11-26 23:04:35 -0500, Robert Haas wrote: > It's not like the problems with exclusive backup are so serious that > you can't work around them. If you know which machine is your master, > you can arrange to remove backup_label on reboot (only) on the master > (only). Sure, a lot of people probably do this wrong, but it's > infeasible to disallow all the things that people might use > incorrectly without making the system useless. That doesn't protect against postgres, not machine, [crash-]restarts though. Greetings, Andres Freund
On Tue, Nov 27, 2018 at 09:45:04AM -0500, Stephen Frost wrote: > If you don't consider your recovery scripts and your backup scripts to > be related then I've really got to wonder how you're regularly testing > your backups to make sure that they're actually valid. Base backups can be perfectly self-contained as long as they include all the WAL segments needed to recover up to the end-of-backup record. That's what pg_basebackup does with its default options (--wal-method=stream in particular). > If you aren't regularly testing your backups then I've got little > sympathy. Fortunately they do, hundreds of time on a daily basis ;) > To be clear, pgbackrest doesn't have any dependency here- but it, like > all of the other 3rd party backup solutions and any restore solution > that a user has come up with, are going to have to be changed to deal > with the changes in how recovery works, so this is a good time to make > these changes. My point is that base backups do not have a mandatory dependency with recovery.conf all the time as they can perfectly be restored if they are standalone backups. I can see a dependency with recovery.conf once you have a base backup which needs to be fed with WAL segments from an external archive, or when using a base backup to create a standby. -- Michael
Attachment
On 11/27/18 4:25 PM, Michael Paquier wrote: > On Tue, Nov 27, 2018 at 09:45:04AM -0500, Stephen Frost wrote: >> If you don't consider your recovery scripts and your backup scripts to >> be related then I've really got to wonder how you're regularly testing >> your backups to make sure that they're actually valid. > > Base backups can be perfectly self-contained as long as they include all > the WAL segments needed to recover up to the end-of-backup record. > That's what pg_basebackup does with its default options > (--wal-method=stream in particular). This is true, of course, but it misses one of the major benefits of file-level backups which is PITR. >> If you aren't regularly testing your backups then I've got little >> sympathy. > > Fortunately they do, hundreds of time on a daily basis ;) Nice! >> To be clear, pgbackrest doesn't have any dependency here- but it, like >> all of the other 3rd party backup solutions and any restore solution >> that a user has come up with, are going to have to be changed to deal >> with the changes in how recovery works, so this is a good time to make >> these changes. > > My point is that base backups do not have a mandatory dependency with > recovery.conf all the time as they can perfectly be restored if they are > standalone backups. I can see a dependency with recovery.conf once you > have a base backup which needs to be fed with WAL segments from an > external archive, or when using a base backup to create a standby. If you want to do PITR -- which is the default in most situations -- then some interaction with recovery.conf is needed. I think you would be hard-pressed to find a prominent HA or backup solution that doesn't do so. -- -David david@pgmasters.net
Attachment
Greetings, * Michael Paquier (michael@paquier.xyz) wrote: > On Tue, Nov 27, 2018 at 09:45:04AM -0500, Stephen Frost wrote: > > If you don't consider your recovery scripts and your backup scripts to > > be related then I've really got to wonder how you're regularly testing > > your backups to make sure that they're actually valid. > > Base backups can be perfectly self-contained as long as they include all > the WAL segments needed to recover up to the end-of-backup record. > That's what pg_basebackup does with its default options > (--wal-method=stream in particular). This really doesn't change my opinion at all. Sure, some custom scripts might have been written to operate in this way and now they'll need to be adjusted to work the exact same way pg_basebackup works today and use the non-exclusive mode, but they'll be better off for it and won't have the concern about what happens if the system is inadvertantly restarted during a backup. I'd also say it's likely that places which know enough to build such a solution aren't ones that we really need to worry about having an issue adjusting their scripts. Thanks! Stephen
Attachment
On 11/26/18 10:13 PM, David Steele wrote: > Hackers, > > I propose we remove the deprecated exclusive backup mode of > pg_start_backup()/pg_stop_backup() for Postgres 12. > > The exclusive backup mode has a serious known issue. If Postgres > terminates ungracefully during a backup (due to hardware, kernel, > Postgres issues, etc.) then Postgres may refuse to restart. > > The reason is that the backup_label file will usually reference a > checkpoint LSN that is older than the WAL available in pg_wal. Postgres > does emit a helpful error message while PANIC'ing but that's cold > comfort to an admin who must manually intervene to get their cluster > operational again. > > The deprecated exclusive mode promises to make a difficult problem > simple but doesn't live up to that promise. That's why it was replaced > externally in 9.6 and why pg_basebackup has not used exclusive backups > since it was introduced in 9.2. > > Non-exclusive backups have been available since 9.6 and several > third-party solutions support this mode, in addition to pg_basebackup. > > The recently introduced recovery changes will break current automated > solutions so this seems like a good opportunity to make improvements on > the backup side as well. > > I'll submit a patch for the 2019-01 commitfest. Attached is the patch. I was a bit surprised by how much code went away. There was a lot of code dedicated to making sure that backup_label got renamed on shutdown, that there was not an exclusive backup running, etc. There were no tests for exclusive backup so the test changes were minimal. I did have to replace one "hot backup" in 010_logical_decoding_timelines.pl. I'm not sure why the backup was being done this way, or why the standby needs a copy of pg_replslot (which is not copied by pg_basebackup). It might be worth looking at but it seemed out of scope for this patch. Regards, -- -David david@pgmasters.net
Attachment
On Wed, Dec 12, 2018 at 7:24 AM David Steele <david@pgmasters.net> wrote: > Attached is the patch. > > I was a bit surprised by how much code went away. There was a lot of > code dedicated to making sure that backup_label got renamed on shutdown, > that there was not an exclusive backup running, etc. > > There were no tests for exclusive backup so the test changes were minimal. > > I did have to replace one "hot backup" in > 010_logical_decoding_timelines.pl. I'm not sure why the backup was > being done this way, or why the standby needs a copy of pg_replslot > (which is not copied by pg_basebackup). It might be worth looking at > but it seemed out of scope for this patch. I wish to point out that there currently seem to be more votes against this proposal than for it, and that nobody should commit a patch unless there is a consensus that it should be committed, whether or not the committer personally agrees with the arguments against it. As for my vote, I do not buy the idea that because we're changing some stuff about recovery.conf we should go ahead and do this too. Yes, they are related, but just because you adjust your backup/restore script/tool to cope with one change doesn't mean that you don't have to adjust it some more to cope with the other change. I also think that the idea that supporting the exclusive backup interface is hurting anything is greatly exaggerated. Whether we keep it or not, we're not forcing anyone to use it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Dec 12, 2018 at 11:05:36AM +0900, Robert Haas wrote: > I wish to point out that there currently seem to be more votes against > this proposal than for it, and that nobody should commit a patch > unless there is a consensus that it should be committed, whether or > not the committer personally agrees with the arguments against it. > > As for my vote, I do not buy the idea that because we're changing some > stuff about recovery.conf we should go ahead and do this too. Yes, > they are related, but just because you adjust your backup/restore > script/tool to cope with one change doesn't mean that you don't have > to adjust it some more to cope with the other change. > > I also think that the idea that supporting the exclusive backup > interface is hurting anything is greatly exaggerated. Whether we keep > it or not, we're not forcing anyone to use it. +1 on all that. Per the trend of this thread, I see a bunch of committers and contributors commenting about *not* removing this code, so sending a patch to actually remove it looks like a throw into an abysmal void. -- Michael
Attachment
Hi, On 2018-12-12 11:05:36 +0900, Robert Haas wrote: > As for my vote, I do not buy the idea that because we're changing some > stuff about recovery.conf we should go ahead and do this too. Yes, > they are related, but just because you adjust your backup/restore > script/tool to cope with one change doesn't mean that you don't have > to adjust it some more to cope with the other change. Sure, but you have to touch it, and it's not that big of an additional change. Call the other form of pg_stop_backup(), and write the contents of the two returned textblobs into backup_label/tablespace_map. As a reward you get a backup that break crash-safety for the database, and not having to do the migration later. > I also think that the idea that supporting the exclusive backup > interface is hurting anything is greatly exaggerated. Whether we keep > it or not, we're not forcing anyone to use it. It's hurting users. I've seen numerous outages due to the backup_label file being around in the wrong moment. Greetings, Andres Freund
On 2018-12-12 11:32:55 +0900, Michael Paquier wrote: > On Wed, Dec 12, 2018 at 11:05:36AM +0900, Robert Haas wrote: > > I wish to point out that there currently seem to be more votes against > > this proposal than for it, and that nobody should commit a patch > > unless there is a consensus that it should be committed, whether or > > not the committer personally agrees with the arguments against it. > > > > As for my vote, I do not buy the idea that because we're changing some > > stuff about recovery.conf we should go ahead and do this too. Yes, > > they are related, but just because you adjust your backup/restore > > script/tool to cope with one change doesn't mean that you don't have > > to adjust it some more to cope with the other change. > > > > I also think that the idea that supporting the exclusive backup > > interface is hurting anything is greatly exaggerated. Whether we keep > > it or not, we're not forcing anyone to use it. > > +1 on all that. Per the trend of this thread, I see a bunch of > committers and contributors commenting about *not* removing this code, > so sending a patch to actually remove it looks like a throw into an > abysmal void. I don't think it's as clear as you make it out to be: In favor: - David Steele - Andres Freund - Stephen Frost, I think Against: - Simon Riggs - Robert Haas - Michael Paquier - Andreas Karlsson - Peter Eisentraut, I think While that's clearly lopsided, I still think it's pretty darn absurd to call such a situation "a throw into an abysmal void". Greetings, Andres Freund
On 12/11/18 9:05 PM, Robert Haas wrote: > On Wed, Dec 12, 2018 at 7:24 AM David Steele <david@pgmasters.net> wrote: >> Attached is the patch. > > As for my vote, I do not buy the idea that because we're changing some > stuff about recovery.conf we should go ahead and do this too. Yes, > they are related, but just because you adjust your backup/restore > script/tool to cope with one change doesn't mean that you don't have > to adjust it some more to cope with the other change. I think the patch stands on its own. Exclusive backup mode is broken and is know to cause problems in the field. We deprecated it three years ago and we have a very capable core backup tool that people can use instead. > I also think that the idea that supporting the exclusive backup > interface is hurting anything is greatly exaggerated. Whether we keep > it or not, we're not forcing anyone to use it. There are maintenance advantages to getting rid of it. The code has become simpler and the documentation has become much simpler. Trying to explain the subtle differences between backup methods is difficult and only makes the documentation more confusing. Removing the extra complexity from shutdown also seems like a win. The gyrations we went through to try and ensure that backup_label was renamed before shutdown were truly agonizing. None of that is necessary with the non-exclusive mode. -- -David david@pgmasters.net
On Wed, Dec 12, 2018 at 11:44 AM Andres Freund <andres@anarazel.de> wrote:
On 2018-12-12 11:32:55 +0900, Michael Paquier wrote:
> On Wed, Dec 12, 2018 at 11:05:36AM +0900, Robert Haas wrote:
> > I wish to point out that there currently seem to be more votes against
> > this proposal than for it, and that nobody should commit a patch
> > unless there is a consensus that it should be committed, whether or
> > not the committer personally agrees with the arguments against it.
> >
> > As for my vote, I do not buy the idea that because we're changing some
> > stuff about recovery.conf we should go ahead and do this too. Yes,
> > they are related, but just because you adjust your backup/restore
> > script/tool to cope with one change doesn't mean that you don't have
> > to adjust it some more to cope with the other change.
> >
> > I also think that the idea that supporting the exclusive backup
> > interface is hurting anything is greatly exaggerated. Whether we keep
> > it or not, we're not forcing anyone to use it.
>
> +1 on all that. Per the trend of this thread, I see a bunch of
> committers and contributors commenting about *not* removing this code,
> so sending a patch to actually remove it looks like a throw into an
> abysmal void.
I don't think it's as clear as you make it out to be:
In favor:
- David Steele
- Andres Freund
- Stephen Frost, I think
You forgot me, also in this camp :P
Against:
- Simon Riggs
- Robert Haas
- Michael Paquier
- Andreas Karlsson
- Peter Eisentraut, I think
While that's clearly lopsided, I still think it's pretty darn absurd to
call such a situation "a throw into an abysmal void".
I'd say it's close enough to an even split to say it doesn't go either way. But we can clearly say it's not consensus for removing it (nor for keeping it, but that would be the default..)
On Wed, Dec 12, 2018 at 1:09 PM David Steele <david@pgmasters.net> wrote:
On 12/11/18 9:05 PM, Robert Haas wrote:
> On Wed, Dec 12, 2018 at 7:24 AM David Steele <david@pgmasters.net> wrote:
>> Attached is the patch.
>
> As for my vote, I do not buy the idea that because we're changing some
> stuff about recovery.conf we should go ahead and do this too. Yes,
> they are related, but just because you adjust your backup/restore
> script/tool to cope with one change doesn't mean that you don't have
> to adjust it some more to cope with the other change.
I think the patch stands on its own. Exclusive backup mode is broken
and is know to cause problems in the field.
Yes, this is a *far* too common thing to be seen in the field IMHO.
Is it super-common? No. But it's by far too common not to be a concern.
We deprecated it three years ago and we have a very capable core backup
tool that people can use instead.
+1.
If we can't get consensus on removing it now, then how about actually setting a *plan* for when to remove it?
For example, in 12 we could more aggressively deprecate it (such as actually emitting a WARNING message whenever used?), and then announce already now that it will be removed in version 13? That will then cover all those people who didn't read the documentation and find the existing deprecation notice?
> I also think that the idea that supporting the exclusive backup
> interface is hurting anything is greatly exaggerated. Whether we keep
> it or not, we're not forcing anyone to use it.
There are maintenance advantages to getting rid of it. The code has
become simpler and the documentation has become much simpler. Trying to
explain the subtle differences between backup methods is difficult and
only makes the documentation more confusing.
Removing the extra complexity from shutdown also seems like a win. The
gyrations we went through to try and ensure that backup_label was
renamed before shutdown were truly agonizing. None of that is necessary
with the non-exclusive mode.
That, too!
On 11/27/18 10:05 AM, Andreas Karlsson wrote: > On 11/27/18 3:46 PM, Stephen Frost wrote: >> I'm concerned, seriously, that people don't have anywhere near the >> concern about the recovery side of things as they do about the backup >> side of things and that's really concerning. > > I think we should use more caution when data loss is at stake rather > than "just" downtime. So personally I am in favor of updating the manual > with warnings (right now it does not even say if exclusive or > non-exclusive is the default) and adding a deprecation warning when > people use the exclusive mode. The documentation says (since 9.6): Low level base backups can be made in a non-exclusive or an exclusive way. The non-exclusive method is recommended and the exclusive one is deprecated and will eventually be removed. That seems clear enough to me. -- -David david@pgmasters.net
On 12/11/18 9:44 PM, Andres Freund wrote: > On 2018-12-12 11:32:55 +0900, Michael Paquier wrote: >> >> +1 on all that. Per the trend of this thread, I see a bunch of >> committers and contributors commenting about *not* removing this code, >> so sending a patch to actually remove it looks like a throw into an >> abysmal void. > > I don't think it's as clear as you make it out to be: > > In favor: > - David Steele > - Andres Freund > - Stephen Frost, I think > Against: > - Simon Riggs > - Robert Haas > - Michael Paquier > - Andreas Karlsson > - Peter Eisentraut, I think > > While that's clearly lopsided, I still think it's pretty darn absurd to > call such a situation "a throw into an abysmal void". I didn't get the impression that Peter was against, he just thought that it needed to stand on its own, rather than be justified by the recovery.conf changes, which I agree with. Simon rather clearly said that he thinks we should wait until the next release, which I don't see as being entirely against. -- -David david@pgmasters.net
On Wed, Dec 12, 2018 at 1:23 PM David Steele <david@pgmasters.net> wrote: > I didn't get the impression that Peter was against, he just thought that > it needed to stand on its own, rather than be justified by the > recovery.conf changes, which I agree with. > > Simon rather clearly said that he thinks we should wait until the next > release, which I don't see as being entirely against. Well, nobody is saying that we should NEVER remove this. The discussion is about what to do in v12. Most of the features I've been involved in removing have been deprecated for 5+ years. The first release where this one was deprecated was only 2 years ago. So it feels dramatically faster to me than what I think we have typically done. Actually, I hadn't realized until this discussion that the exclusive backup interface was actually deprecated -- I thought we were just recommending the new non-exclusive backup interface should be used. If we're in a rush to remove this (and apparently many of us are), I think we should make that warning a lot more prominent, maybe copy it into a few more places, and back-patch the changes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Dec 12, 2018 at 01:31:53PM +0900, Robert Haas wrote: > Most of the features I've been involved in removing have been > deprecated for 5+ years. The first release where this one was > deprecated was only 2 years ago. So it feels dramatically faster to > me than what I think we have typically done. > > Actually, I hadn't realized until this discussion that the exclusive > backup interface was actually deprecated -- I thought we were just > recommending the new non-exclusive backup interface should be used. > If we're in a rush to remove this (and apparently many of us are), I > think we should make that warning a lot more prominent, maybe copy it > into a few more places, and back-patch the changes. Indeed, I definitely agree that we should move away from the exclusive backup APIs at some point, still it feels very early to do this move as of now. Showing a bigger warning in the docs would be nice as a first step, and then we could wait a couple of extra years before doing the actual move. By the way, David and others, my apologies for my latest email, I hope you did not take that badly, my impression of this thread about the opinions gathered was wrong. -- Michael
Attachment
On December 11, 2018 10:10:00 PM PST, Michael Paquier <michael@paquier.xyz> wrote: >By the way, David and others, my apologies for my latest email, I hope >you did not take that badly, my impression of this thread about the >opinions gathered was wrong. Even if it had been 10:1 it doesn't seem like it'd have been a good approach of an answer. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Wed, 2018-12-12 at 13:31 +0900, Robert Haas wrote: > On Wed, Dec 12, 2018 at 1:23 PM David Steele <david@pgmasters.net> wrote: > > I didn't get the impression that Peter was against, he just thought that > > it needed to stand on its own, rather than be justified by the > > recovery.conf changes, which I agree with. > > > > Simon rather clearly said that he thinks we should wait until the next > > release, which I don't see as being entirely against. > > Well, nobody is saying that we should NEVER remove this. The > discussion is about what to do in v12. > > Most of the features I've been involved in removing have been > deprecated for 5+ years. The first release where this one was > deprecated was only 2 years ago. So it feels dramatically faster to > me than what I think we have typically done. > > Actually, I hadn't realized until this discussion that the exclusive > backup interface was actually deprecated -- I thought we were just > recommending the new non-exclusive backup interface should be used. > If we're in a rush to remove this (and apparently many of us are), I > think we should make that warning a lot more prominent, maybe copy it > into a few more places, and back-patch the changes. +1 I too only learned about this recently, while the problem with exclusive backups has been known at least since 2008 (c979a1fe), and nobody felt this to be a terrible problem back then. I believe that the danger is greatly overrated. It is not like you end up with a corrupted database after a crash, and you get a pretty helpful error message. Many people are happy enough to live with that. I'm on board with deprecating and removing it eventually, but I see no problem in waiting for the customary 5 years. And yes, a prominent warning in the next major release notes would be a good thing. Yours, Laurenz Albe
On 12/11/18 11:31 PM, Robert Haas wrote: > On Wed, Dec 12, 2018 at 1:23 PM David Steele <david@pgmasters.net> wrote: >> I didn't get the impression that Peter was against, he just thought that >> it needed to stand on its own, rather than be justified by the >> recovery.conf changes, which I agree with. >> >> Simon rather clearly said that he thinks we should wait until the next >> release, which I don't see as being entirely against. > > Well, nobody is saying that we should NEVER remove this. The > discussion is about what to do in v12. Fair enough. > Most of the features I've been involved in removing have been > deprecated for 5+ years. The first release where this one was > deprecated was only 2 years ago. So it feels dramatically faster to > me than what I think we have typically done. I think this case is different because exclusive backups have known issues. I also think that having two similar but different methods stifles development in this area and makes the docs harder to improve. There's a lot of "if this then that else that" in the docs which make them hard to maintain and harder to read. Add the fact that we have *zero* tests for exclusive backups. I only had to refactor one exclusive backup in the tests and since it did not archive, exclude pg_wal, postmaster.pid, or do anything else our docs recommend I wouldn't say it qualifies as a real test. Also, it wasn't even trying to test exclusive backups -- it was a test for logical replication following timelines. So yeah, we'd be removing it in three years instead of five, but there has been a safe in-core alternative since 9.1 (eight years). Regards, -- -David david@pgmasters.net
On Wed, Dec 12, 2018 at 08:22:10AM -0500, David Steele wrote: > Add the fact that we have *zero* tests for exclusive backups. I only > had to refactor one exclusive backup in the tests and since it did not > archive, exclude pg_wal, postmaster.pid, or do anything else our docs > recommend I wouldn't say it qualifies as a real test. Also, it wasn't > even trying to test exclusive backups -- it was a test for logical > replication following timelines. This point is really right. The TAP tests rely heavily on pg_basebackup when taking base backups, still there is an interface to be able to take filesystem-level backups with the exclusive SQL interface. The test David is referring to here is backup_fs_hot in PostgresNode.pm. -- Michael
Attachment
On 12/12/2018 05:31, Robert Haas wrote: > Most of the features I've been involved in removing have been > deprecated for 5+ years. The first release where this one was > deprecated was only 2 years ago. So it feels dramatically faster to > me than what I think we have typically done. I was just looking this up as well, and I find it too fast. The nonexclusive backups were introduced in 9.6. So I'd say that we could remove the exclusive ones when 9.5 goes EOL. (That would mean this patch could be submitted for PostgreSQL 13, since 9.5 will go out of support around the time PG13 would be released.) -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/12/2018 23:24, David Steele wrote: > @@ -845,7 +838,7 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0 > rights to run pg_start_backup (superuser, or a user who has been granted > EXECUTE on the function) and issue the command: > <programlisting> > -SELECT pg_start_backup('label', false, false); > +SELECT pg_start_backup('label', false); > </programlisting> > where <literal>label</literal> is any string you want to use to uniquely > identify this backup operation. The connection Is it good to change the meaning of an existing interface like that? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 12/12/2018 05:09, David Steele wrote: > I think the patch stands on its own. Exclusive backup mode is broken > and is know to cause problems in the field. Is this breakage documented anywhere for users? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 12/13/18 2:02 PM, Peter Eisentraut wrote: > On 12/12/2018 05:09, David Steele wrote: >> I think the patch stands on its own. Exclusive backup mode is broken >> and is know to cause problems in the field. > > Is this breakage documented anywhere for users? Yes: https://www.postgresql.org/docs/11/continuous-archiving.html "Note that if the server crashes during the backup it may not be possible to restart until the backup_label file has been manually deleted from the PGDATA directory." Seems like the wording could be stronger. -- -David david@pgmasters.net
On 12/13/18 2:00 PM, Peter Eisentraut wrote: > On 11/12/2018 23:24, David Steele wrote: >> @@ -845,7 +838,7 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0 >> rights to run pg_start_backup (superuser, or a user who has been granted >> EXECUTE on the function) and issue the command: >> <programlisting> >> -SELECT pg_start_backup('label', false, false); >> +SELECT pg_start_backup('label', false); >> </programlisting> >> where <literal>label</literal> is any string you want to use to uniquely >> identify this backup operation. The connection > > Is it good to change the meaning of an existing interface like that? Good question. We could leave the third parameter (changing the default to false) and error if it has any value but false. It's a bit ugly but it does maintain compatibility with the current non-exclusive backup interface. And, the third parameter would never need to be specified unless we add a fourth. Or we could add a new function and deprecate this one -- but what to call it? I agree it's not very cool to break code for users who actually *did* migrate to non-exclusive backups. -- -David david@pgmasters.net
On Thu, Dec 13, 2018 at 2:29 PM David Steele <david@pgmasters.net> wrote: > Good question. > > We could leave the third parameter (changing the default to false) and > error if it has any value but false. It's a bit ugly but it does > maintain compatibility with the current non-exclusive backup interface. > And, the third parameter would never need to be specified unless we add > a fourth. > > Or we could add a new function and deprecate this one -- but what to > call it? > > I agree it's not very cool to break code for users who actually *did* > migrate to non-exclusive backups. Uh, yeah. If you do that, you're not just forcing people off the old interface -- you're making *everybody* update their stuff again. And having to have conditional logic to handle different releases presents its own set of dangers. IMHO, the biggest enemy in this area BY FAR is human error, not the details of how the interface works. No amount of providing a better-designed interface will compensate for the confusion factor involved in changing it. My vote is ... when we actually deprecate this, change nothing at the SQL level initially, but just throw an error if the "exclusive" argument isn't false: ERROR: exclusive backup mode is no longer supported That will require everyone to use the three-argument form of pg_start_backup(), but I think that's good, because if we accept the one and two argument forms, how do we actually know that the user updated their code? If you want to actually force people to switch to the non-exclusive mode, you have to make sure that anything that might be a residual request for exclusive backup mode errors out. If the same SQL just does something different, the user might fail to notice. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 13/12/2018 20:17, David Steele wrote: > On 12/13/18 2:02 PM, Peter Eisentraut wrote: >> On 12/12/2018 05:09, David Steele wrote: >>> I think the patch stands on its own. Exclusive backup mode is broken >>> and is know to cause problems in the field. >> >> Is this breakage documented anywhere for users? > > Yes: > > https://www.postgresql.org/docs/11/continuous-archiving.html > > "Note that if the server crashes during the backup it may not be > possible to restart until the backup_label file has been manually > deleted from the PGDATA directory." > > Seems like the wording could be stronger. I think that is a pretty liberal interpretation of the word "broken". All this says is "if you do A, and B happens, then you need to do C". Clearly, not having to do that at all is better, but if this is all there is to it, then I'm confused by the characterizations of how awful and terrible this feature is and how we must rush to remove it. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 12/14/18 3:01 AM, Peter Eisentraut wrote: > On 13/12/2018 20:17, David Steele wrote: >> On 12/13/18 2:02 PM, Peter Eisentraut wrote: >>> On 12/12/2018 05:09, David Steele wrote: >>>> I think the patch stands on its own. Exclusive backup mode is broken >>>> and is know to cause problems in the field. >>> >>> Is this breakage documented anywhere for users? >> >> Yes: >> >> https://www.postgresql.org/docs/11/continuous-archiving.html >> >> "Note that if the server crashes during the backup it may not be >> possible to restart until the backup_label file has been manually >> deleted from the PGDATA directory." >> >> Seems like the wording could be stronger. > > I think that is a pretty liberal interpretation of the word "broken". > All this says is "if you do A, and B happens, then you need to do C". > > Clearly, not having to do that at all is better, but if this is all > there is to it, then I'm confused by the characterizations of how awful > and terrible this feature is and how we must rush to remove it. If the server crashes during a backup, the server probably won't restart. We say "may" here but if your backup runs more than a few checkpoints it's more like won't. The remedy is to go manually delete a file the user may have never heard of. They are often hesitant to do so -- for good reason. The downtime stretches as they call their support company or consult Google to determine if they should really do this. The main thing is that *manual* intervention is required to get the cluster running again. Sure, you could automate it, but now we have users writing scripts to delete files out of PGDATA (may as well get tablespace_map as well) -- but hopefully not in the case of a restore where you actually need those files. And hopefully not anything else important. That seems pretty broken to me. -- -David david@pgmasters.net
On Fri, Dec 14, 2018 at 8:27 AM David Steele <david@pgmasters.net> wrote: > If the server crashes during a backup, the server probably won't > restart. We say "may" here but if your backup runs more than a few > checkpoints it's more like won't. The remedy is to go manually delete a > file the user may have never heard of. They are often hesitant to do so > -- for good reason. Sure. So it's broken in the sense that if you don't read and carefully follow the directions, you will have problems. But that's also true of an automobile, a power saw, the DELETE command, antibiotics, and the Internal Revenue Code. Many things in life are complicated and require due caution, and yet people regularly navigate things that are VASTLY more complicated than "if the server crashes during a hot backup, remove this file." Especially because if we subsequently fail for this reason, we give you a very precise hint that tells you exactly what to do: HINT: If you are not restoring from a backup, try removing the file "<path to $PGDATA goes here>/backup_label" Now, I grant that emitting a HINT telling you exactly what to do is not as good as not requiring manual user invention in the first place, but it's not like we're requiring you to carry out some byzantine process that nobody can possible understand. The "if" condition here is one that any DBA should be able to understand: am I, or am I not, in the process of restoring a backup? The system does not know, but the DBA should. And once you know that the answer is "no, I'm not restoring a backup", then you just remove the exact pathname indicated and try it again and everything works fine. Peter's complaint -- that this is a pretty liberal reading of the word "broken" -- is IMHO very well put. Saying that something is broken means it doesn't work. But the exclusive backup mode works fine if used according to the directions. The fact that people often get confused and don't follow the directions, and that the directions tend to end up requiring manual intervention after a system crash, is unfortunate, and it is why we have a new API. But "that thing isn't as well designed as it could've been" is not the same thing as "that thing does not work when used as designed," no matter how much you keep insisting the contrary. It does work. It has worked for a long time. Many people have used it successfully. It was the ONLY way of doing this for many years. There is no necessary reason why it has to be a huge problem, and I don't think it is. In my experience with EnterpriseDB customers, this IS an occasional source of confusion, but there are other things that are a problem a lot more frequently, like work_mem, which you often can't set high enough to get good query performance without causing OOM events when the system gets busy. I'd hesitate to call work_mem broken, but a GUC for which no single value both adequately protects against OOM and gives workable performance is closer to broken than an exclusive backup mode, which actually does work - albeit with occasional manual intervention - if you read and follow the directions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/12/18 4:44 AM, Andres Freund wrote: > On 2018-12-12 11:32:55 +0900, Michael Paquier wrote: >> >> +1 on all that. Per the trend of this thread, I see a bunch of >> committers and contributors commenting about *not* removing this code, >> so sending a patch to actually remove it looks like a throw into an >> abysmal void. > > I don't think it's as clear as you make it out to be: > > In favor: > - David Steele > - Andres Freund > - Stephen Frost, I think > Against: > - Simon Riggs > - Robert Haas > - Michael Paquier > - Andreas Karlsson > - Peter Eisentraut, I think > > While that's clearly lopsided, I still think it's pretty darn absurd to > call such a situation "a throw into an abysmal void". The way I read the discussion, there is no consensus for getting this into PG12, but there does seem to be one for PG13. I'll push this to the first post-PG12 CF when one appears. Can we just go ahead and create a 2019-07 CF now? I know we generally discuss this at PGCon, but we can always rename it, right? -- -David david@pgmasters.net
On Wed, Dec 19, 2018 at 5:38 PM David Steele <david@pgmasters.net> wrote:
On 12/12/18 4:44 AM, Andres Freund wrote:
> On 2018-12-12 11:32:55 +0900, Michael Paquier wrote:
>>
>> +1 on all that. Per the trend of this thread, I see a bunch of
>> committers and contributors commenting about *not* removing this code,
>> so sending a patch to actually remove it looks like a throw into an
>> abysmal void.
>
> I don't think it's as clear as you make it out to be:
>
> In favor:
> - David Steele
> - Andres Freund
> - Stephen Frost, I think
> Against:
> - Simon Riggs
> - Robert Haas
> - Michael Paquier
> - Andreas Karlsson
> - Peter Eisentraut, I think
>
> While that's clearly lopsided, I still think it's pretty darn absurd to
> call such a situation "a throw into an abysmal void".
The way I read the discussion, there is no consensus for getting this
into PG12, but there does seem to be one for PG13.
I'll push this to the first post-PG12 CF when one appears. Can we just
go ahead and create a 2019-07 CF now? I know we generally discuss this
at PGCon, but we can always rename it, right?
We can, I guess. Are you saying you're likely to forget? :P
For 12, it would probably be good to have a more clear deprecation notice into the documentations. Is that something on your list of things to work on, or should someone else (like, uh, me)?
On 12/19/18 9:02 PM, Magnus Hagander wrote: > On Wed, Dec 19, 2018 at 5:38 PM David Steele <david@pgmasters.net > > I'll push this to the first post-PG12 CF when one appears. Can we just > go ahead and create a 2019-07 CF now? I know we generally discuss this > at PGCon, but we can always rename it, right? > > > We can, I guess. Are you saying you're likely to forget? :P Not likely to forget, but I figured it's not good to have it in at the start of the next CF. People might think they can review it! > > For 12, it would probably be good to have a more clear deprecation > notice into the documentations. Is that something on your list of things > to work on, or should someone else (like, uh, me)? I'm planning to do that -- I meant to mention that. I may not have time until after the new year, but I doubt it's something that will cause a big fuss. -- -David david@pgmasters.net
On Wed, Dec 19, 2018 at 06:38:00PM +0200, David Steele wrote: > I'll push this to the first post-PG12 CF when one appears. Can we just go > ahead and create a 2019-07 CF now? I know we generally discuss this at > PGCon, but we can always rename it, right? The schedule gets decided at the developer meeting, still the commit fests created can have their name and start/end dates modified as needed So, I think that it is not an issue to add one ahead now. And here you go: https://commitfest.postgresql.org/23/ This request was going to pop out at some point anyway in the follow-up months. -- Michael
Attachment
On 12/20/18 1:35 AM, Michael Paquier wrote: > On Wed, Dec 19, 2018 at 06:38:00PM +0200, David Steele wrote: >> I'll push this to the first post-PG12 CF when one appears. Can we just go >> ahead and create a 2019-07 CF now? I know we generally discuss this at >> PGCon, but we can always rename it, right? > > The schedule gets decided at the developer meeting, still the commit > fests created can have their name and start/end dates modified as needed > So, I think that it is not an issue to add one ahead now. And here you > go: > https://commitfest.postgresql.org/23/ > > This request was going to pop out at some point anyway in the follow-up > months. Thanks, Michael. Getting this error, not sure what it means: Cannot move patch to the same commitfest, and multiple future commitfests exist! -- -David david@pgmasters.net
On Thu, Dec 20, 2018 at 12:29:48PM +0200, David Steele wrote: > Cannot move patch to the same commitfest, and multiple future commitfests > exist! I am not sure what it means either. What if you just mark the existing entry as returned with feedback, and create a new one ahead? -- Michael
Attachment
On 12/21/18 2:01 AM, Michael Paquier wrote: > On Thu, Dec 20, 2018 at 12:29:48PM +0200, David Steele wrote: >> Cannot move patch to the same commitfest, and multiple future commitfests >> exist! > > I am not sure what it means either. What if you just mark the existing > entry as returned with feedback, and create a new one ahead? Yeah, I can do that, but it would be nice to know why this is not working. It would also be nice to preserve the history. Magnus? -- -David david@pgmasters.net
On Fri, Dec 21, 2018 at 1:18 AM David Steele <david@pgmasters.net> wrote: > On 12/21/18 2:01 AM, Michael Paquier wrote: > > On Thu, Dec 20, 2018 at 12:29:48PM +0200, David Steele wrote: > >> Cannot move patch to the same commitfest, and multiple future commitfests > >> exist! > > > > I am not sure what it means either. What if you just mark the existing > > entry as returned with feedback, and create a new one ahead? > > Yeah, I can do that, but it would be nice to know why this is not > working. It would also be nice to preserve the history. > > Magnus? I think what it means is that it doesn't know which CommitFest is the "next" one. If the patch were currently in the in-progress CommitFest, the Open one (if any) would be the next one. Otherwise, if there were only one Future CommitFest, that would be the next one. But right now you're moving a patch that is already in the Open CommitFest and there are two 2 Future CommitFests, so it doesn't know which one to pick. I think the old CommitFest application let you set the CommitFest via the Edit screen also, so you could pick a specific CommitFest to target. But that doesn't seem to exist in Magnus's version. If you just wait until the Open CommitFest is marked In Progress and the next one is marked Open, it should work in any case. I think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/21/18 6:43 PM, Robert Haas wrote: > > If you just wait until the Open CommitFest is marked In Progress and > the next one is marked Open, it should work in any case. I'll give that a try since it's only a few more days. > I think. Well, it's an uncertain world. -- -David david@pgmasters.net
On Fri, Dec 21, 2018 at 5:43 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Dec 21, 2018 at 1:18 AM David Steele <david@pgmasters.net> wrote:
> On 12/21/18 2:01 AM, Michael Paquier wrote:
> > On Thu, Dec 20, 2018 at 12:29:48PM +0200, David Steele wrote:
> >> Cannot move patch to the same commitfest, and multiple future commitfests
> >> exist!
> >
> > I am not sure what it means either. What if you just mark the existing
> > entry as returned with feedback, and create a new one ahead?
>
> Yeah, I can do that, but it would be nice to know why this is not
> working. It would also be nice to preserve the history.
>
> Magnus?
I think what it means is that it doesn't know which CommitFest is the
"next" one. If the patch were currently in the in-progress
CommitFest, the Open one (if any) would be the next one. Otherwise,
if there were only one Future CommitFest, that would be the next one.
But right now you're moving a patch that is already in the Open
CommitFest and there are two 2 Future CommitFests, so it doesn't know
which one to pick.
I think the old CommitFest application let you set the CommitFest via
the Edit screen also, so you could pick a specific CommitFest to
target. But that doesn't seem to exist in Magnus's version.
That was a specifically requested change, and IIRC has been discussed before whenever people run into it :)
The basics is, there is not supposed to be more than one Future commitfest, in order to keep the workflow as simple as possible. When there is, all sorts of bad things happen. What we really should have at that point is a blocker so you can't *create* more than one of them, but right now it naively assumes nobody does...
On 12/27/18 6:53 PM, Magnus Hagander wrote: > On Fri, Dec 21, 2018 at 5:43 PM Robert Haas <robertmhaas@gmail.com > > I think the old CommitFest application let you set the CommitFest via > the Edit screen also, so you could pick a specific CommitFest to > target. But that doesn't seem to exist in Magnus's version. > > The basics is, there is not supposed to be more than one Future > commitfest, in order to keep the workflow as simple as possible. When > there is, all sorts of bad things happen. What we really should have at > that point is a blocker so you can't *create* more than one of them, but > right now it naively assumes nobody does... Which is ironic since I specifically requested that one be created. -- -David david@pgmasters.net
Greetings, * Laurenz Albe (laurenz.albe@cybertec.at) wrote: > I too only learned about this recently, while the problem with exclusive > backups has been known at least since 2008 (c979a1fe), and nobody felt > this to be a terrible problem back then. My recollection was that back then it wasn't clear what to *do* about the problem. Once we had a solution (non-exclusive backups) we deprecated the exclusive backup mechanism. > I believe that the danger is greatly overrated. It is not like you end > up with a corrupted database after a crash, and you get a pretty helpful > error message. Many people are happy enough to live with that. Unfortunately, people get corrupted databases all the time because the backups are corrupted. The problem with databases not restarting correctly is another issue with the exclusive backup method but it's certainly not the only one. > I'm on board with deprecating and removing it eventually, but I see no > problem in waiting for the customary 5 years. We don't actually have a 'customary 5 years' rule of any sort. > And yes, a prominent warning in the next major release notes would be > a good thing. This is two years too late, at best. That is, maybe it would have made sense to highlight that the exclusive backup method was deprecated when it was made so, but that ship has sailed. Further, we don't put out announcements saying "we're going to remove X in the next release" and I don't see it making sense to start now. Thanks! Stephen
Attachment
Greetings, * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote: > On 12/12/2018 05:31, Robert Haas wrote: > > Most of the features I've been involved in removing have been > > deprecated for 5+ years. The first release where this one was > > deprecated was only 2 years ago. So it feels dramatically faster to > > me than what I think we have typically done. > > I was just looking this up as well, and I find it too fast. The > nonexclusive backups were introduced in 9.6. So I'd say that we could > remove the exclusive ones when 9.5 goes EOL. (That would mean this > patch could be submitted for PostgreSQL 13, since 9.5 will go out of > support around the time PG13 would be released.) I don't agree with either the notion that we have to wait 5 years in this case or that we've only had a good alternative to the exclusive backup mode since 9.5 as we've had pg_basebackup since 9.1. Thanks, Stephen
Attachment
Greetings, * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote: > Clearly, not having to do that at all is better, but if this is all > there is to it, then I'm confused by the characterizations of how awful > and terrible this feature is and how we must rush to remove it. It's not all there is to it though. This issue leads to extended downtime regularly and is definitely a huge 'gotcha' for users, even if you don't want to call it outright broken, but the other issue is that our documentation is ridiculously complicated around how to do a backup properly because of this and that also leads to the reality that it's difficult to make improvements to it because every sentence has to consider both methods, and that's really assuming that users actively read through the detailed documentation instead of just looking at those nice simple 'pg_start_backup' and 'pg_stop_backup' methods and use them thinking that's all that's required. We see this *all* the time, on the lists, in blog posts (even those from somewhat reputable companies...), and in other places. The exclusive backup method is a huge foot-gun for new users to PostgreSQL and leads to downtime, corrupt backups, and then corrupt databases when backups get restored. It really does need to go, and the sooner, the better. Thanks, Stephen
Attachment
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Thu, Dec 13, 2018 at 2:29 PM David Steele <david@pgmasters.net> wrote: > > We could leave the third parameter (changing the default to false) and > > error if it has any value but false. It's a bit ugly but it does > > maintain compatibility with the current non-exclusive backup interface. > > And, the third parameter would never need to be specified unless we add > > a fourth. > > > > Or we could add a new function and deprecate this one -- but what to > > call it? ... and would people really realize that they have to do more than just change the function name? I'm not sure. > > I agree it's not very cool to break code for users who actually *did* > > migrate to non-exclusive backups. This bothers me much less. We change our APIs between major releases from time-to-time and people need to update their code to handle that. > Uh, yeah. If you do that, you're not just forcing people off the old > interface -- you're making *everybody* update their stuff again. And > having to have conditional logic to handle different releases presents > its own set of dangers. IMHO, the biggest enemy in this area BY FAR is > human error, not the details of how the interface works. No amount of > providing a better-designed interface will compensate for the > confusion factor involved in changing it. I don't agree with the argument that conditional logic for different releases is an issue- we already require people to have that all over the place and, while it's a bit annoying, it's part and parcel for working with multiple major versions of PG. > My vote is ... when we actually deprecate this, change nothing at the > SQL level initially, but just throw an error if the "exclusive" > argument isn't false: > ERROR: exclusive backup mode is no longer supported > That will require everyone to use the three-argument form of > pg_start_backup(), but I think that's good, because if we accept the > one and two argument forms, how do we actually know that the user > updated their code? If you want to actually force people to switch to > the non-exclusive mode, you have to make sure that anything that might > be a residual request for exclusive backup mode errors out. If the > same SQL just does something different, the user might fail to notice. This is an argument that I can agree with though. We certainly don't want existing users of the exclusive mode to think "everything is fine!" after it's been ripped out. I hate having to leave old API cruft around like an argument that's required to be passed in and required to have a specific value but it does seem the best way to make it extremely clear to users of the exclusive mode that they really have to adjust their code. Perhaps in some future version we'll redefine the API more broadly and be able to rip out this API entirely, throwing away such old cruft as we do it, but until then I think we can live with it. Thanks! Stephen
Attachment
Stephen Frost wrote: > * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote: > > Clearly, not having to do that at all is better, but if this is all > > there is to it, then I'm confused by the characterizations of how awful > > and terrible this feature is and how we must rush to remove it. > > It's not all there is to it though. > > This issue leads to extended downtime regularly and is definitely a huge > 'gotcha' for users, even if you don't want to call it outright broken, Only if PostgreSQL crashes regularly, right? Yours, Laurenz Albe
Greetings, * Laurenz Albe (laurenz.albe@cybertec.at) wrote: > Stephen Frost wrote: > > * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote: > > > Clearly, not having to do that at all is better, but if this is all > > > there is to it, then I'm confused by the characterizations of how awful > > > and terrible this feature is and how we must rush to remove it. > > > > It's not all there is to it though. > > > > This issue leads to extended downtime regularly and is definitely a huge > > 'gotcha' for users, even if you don't want to call it outright broken, > > Only if PostgreSQL crashes regularly, right? This happens anytime a backup is in progress and the system crashes in any way- PostgreSQL going down, the container Postgres is running in, the server Postgres is running on, or if the filesystem that Postgres is running on goes away, etc. There's certainly no shortage of cases where this can happen. Thanks! Stephen
Attachment
On 2019-01-05 13:19:20 -0500, Stephen Frost wrote: > Greetings, > > * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote: > > On 12/12/2018 05:31, Robert Haas wrote: > > > Most of the features I've been involved in removing have been > > > deprecated for 5+ years. The first release where this one was > > > deprecated was only 2 years ago. So it feels dramatically faster to > > > me than what I think we have typically done. > > > > I was just looking this up as well, and I find it too fast. The > > nonexclusive backups were introduced in 9.6. So I'd say that we could > > remove the exclusive ones when 9.5 goes EOL. (That would mean this > > patch could be submitted for PostgreSQL 13, since 9.5 will go out of > > support around the time PG13 would be released.) > > I don't agree with either the notion that we have to wait 5 years in > this case or that we've only had a good alternative to the exclusive > backup mode since 9.5 as we've had pg_basebackup since 9.1. I don't agree with a general 5 year deprecation window either. But it seems pretty clear that there's no majority for removing exclusive backups in v12. I think it'd be good to make the warning about its impending death more explicit, but otherwise mark this CF entry either as rejected or returned with feedback. Greetings, Andres Freund
On 2/16/19 5:57 AM, Andres Freund wrote: > On 2019-01-05 13:19:20 -0500, Stephen Frost wrote: >> Greetings, >> >> * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote: >>> On 12/12/2018 05:31, Robert Haas wrote: >>>> Most of the features I've been involved in removing have been >>>> deprecated for 5+ years. The first release where this one was >>>> deprecated was only 2 years ago. So it feels dramatically faster to >>>> me than what I think we have typically done. >>> >>> I was just looking this up as well, and I find it too fast. The >>> nonexclusive backups were introduced in 9.6. So I'd say that we could >>> remove the exclusive ones when 9.5 goes EOL. (That would mean this >>> patch could be submitted for PostgreSQL 13, since 9.5 will go out of >>> support around the time PG13 would be released.) >> >> I don't agree with either the notion that we have to wait 5 years in >> this case or that we've only had a good alternative to the exclusive >> backup mode since 9.5 as we've had pg_basebackup since 9.1. > > I don't agree with a general 5 year deprecation window either. But it > seems pretty clear that there's no majority for removing exclusive > backups in v12. I think it'd be good to make the warning about its > impending death more explicit, but otherwise mark this CF entry either > as rejected or returned with feedback. I think there is support for the patch in PG13 so I was planning to move it out of the March CF to the first PG13 CF as soon as the app will allow it, i.e., once there is only a single open CF. Thanks, -- -David david@pgmasters.net
On Mon, Feb 18, 2019 at 6:13 AM David Steele <david@pgmasters.net> wrote:
On 2/16/19 5:57 AM, Andres Freund wrote:
> On 2019-01-05 13:19:20 -0500, Stephen Frost wrote:
>> Greetings,
>>
>> * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote:
>>> On 12/12/2018 05:31, Robert Haas wrote:
>>>> Most of the features I've been involved in removing have been
>>>> deprecated for 5+ years. The first release where this one was
>>>> deprecated was only 2 years ago. So it feels dramatically faster to
>>>> me than what I think we have typically done.
>>>
>>> I was just looking this up as well, and I find it too fast. The
>>> nonexclusive backups were introduced in 9.6. So I'd say that we could
>>> remove the exclusive ones when 9.5 goes EOL. (That would mean this
>>> patch could be submitted for PostgreSQL 13, since 9.5 will go out of
>>> support around the time PG13 would be released.)
>>
>> I don't agree with either the notion that we have to wait 5 years in
>> this case or that we've only had a good alternative to the exclusive
>> backup mode since 9.5 as we've had pg_basebackup since 9.1.
>
> I don't agree with a general 5 year deprecation window either. But it
> seems pretty clear that there's no majority for removing exclusive
> backups in v12. I think it'd be good to make the warning about its
> impending death more explicit, but otherwise mark this CF entry either
> as rejected or returned with feedback.
I think there is support for the patch in PG13 so I was planning to move
it out of the March CF to the first PG13 CF as soon as the app will
allow it, i.e., once there is only a single open CF.
Agreed, and I think we should also update the documentation for 12 with the suggested more explicit mention of the deprecation.
Greetings, * Magnus Hagander (magnus@hagander.net) wrote: > On Mon, Feb 18, 2019 at 6:13 AM David Steele <david@pgmasters.net> wrote: > > On 2/16/19 5:57 AM, Andres Freund wrote: > > > On 2019-01-05 13:19:20 -0500, Stephen Frost wrote: > > >> * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote: > > >>> On 12/12/2018 05:31, Robert Haas wrote: > > >>>> Most of the features I've been involved in removing have been > > >>>> deprecated for 5+ years. The first release where this one was > > >>>> deprecated was only 2 years ago. So it feels dramatically faster to > > >>>> me than what I think we have typically done. > > >>> > > >>> I was just looking this up as well, and I find it too fast. The > > >>> nonexclusive backups were introduced in 9.6. So I'd say that we could > > >>> remove the exclusive ones when 9.5 goes EOL. (That would mean this > > >>> patch could be submitted for PostgreSQL 13, since 9.5 will go out of > > >>> support around the time PG13 would be released.) > > >> > > >> I don't agree with either the notion that we have to wait 5 years in > > >> this case or that we've only had a good alternative to the exclusive > > >> backup mode since 9.5 as we've had pg_basebackup since 9.1. > > > > > > I don't agree with a general 5 year deprecation window either. But it > > > seems pretty clear that there's no majority for removing exclusive > > > backups in v12. I think it'd be good to make the warning about its > > > impending death more explicit, but otherwise mark this CF entry either > > > as rejected or returned with feedback. > > > > I think there is support for the patch in PG13 so I was planning to move > > it out of the March CF to the first PG13 CF as soon as the app will > > allow it, i.e., once there is only a single open CF. > > Agreed, and I think we should also update the documentation for 12 with the > suggested more explicit mention of the deprecation. +1. Thanks! Stephen
Attachment
On 2/22/19 6:32 PM, Stephen Frost wrote: > * Magnus Hagander (magnus@hagander.net) wrote: >> On Mon, Feb 18, 2019 at 6:13 AM David Steele <david@pgmasters.net> wrote: >>> On 2/16/19 5:57 AM, Andres Freund wrote: >>>> I don't agree with a general 5 year deprecation window either. But it >>>> seems pretty clear that there's no majority for removing exclusive >>>> backups in v12. I think it'd be good to make the warning about its >>>> impending death more explicit, but otherwise mark this CF entry either >>>> as rejected or returned with feedback. >>> >>> I think there is support for the patch in PG13 so I was planning to move >>> it out of the March CF to the first PG13 CF as soon as the app will >>> allow it, i.e., once there is only a single open CF. >> >> Agreed, and I think we should also update the documentation for 12 with the >> suggested more explicit mention of the deprecation. > > +1. I'm working on this for the March CF. Ideally we would back-patch it to 9.6. -- -David david@pgmasters.net
On Tue, Nov 27, 2018 at 12:13 PM David Steele <david@pgmasters.net> wrote: > > Hackers, > > I propose we remove the deprecated exclusive backup mode of > pg_start_backup()/pg_stop_backup() for Postgres 12. -1 for the removal. I think that there are still many users of an exclusive backup API, and it's not so easy to migrate their backup scripts so that only non-exclusive one is used because of the reason I wrote in another thread. https://postgr.es/m/CAHGQGwHUkEbkVexVfWNLjmq2rzOS_SHYMiECt+KBn-cBPq5Arg@mail.gmail.com Regards, -- Fujii Masao
On Fri, Feb 22, 2019 at 12:35 PM Fujii Masao <masao.fujii@gmail.com> wrote: > -1 for the removal. I think that there are still many users of an exclusive > backup API, and it's not so easy to migrate their backup scripts so that > only non-exclusive one is used because of the reason I wrote in another thread. > https://postgr.es/m/CAHGQGwHUkEbkVexVfWNLjmq2rzOS_SHYMiECt+KBn-cBPq5Arg@mail.gmail.com Yeah, those are interesting points. Unfortunately, I think something as simple as your proposed... psql -c "SELECT pg_start_backup()" rsync, cp, tar, storage backup, or something psql -c "SELECT pg_stop_backup()" ...doesn't have much chance of being correct. If you aren't checking whether pg_start_backup() throws an error, for example, you have got a very failure-prone set-up. That being said, it's possible to fix that problem using some shell logic, whereas the problem of keeping a connection open for the duration of the backup from the shell is much harder. I recently ran across a case where someone attempted it but did not get the logic entirely right, with rather painful consequences. Now you might say that people should not write their own tools and just use professionally produced ones. But I don't really agree with that, and whatever we think people SHOULD do, there are in fact a lot of people using their own tools. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Fri, Feb 22, 2019 at 12:35 PM Fujii Masao <masao.fujii@gmail.com> wrote: > > -1 for the removal. I think that there are still many users of an exclusive > > backup API, and it's not so easy to migrate their backup scripts so that > > only non-exclusive one is used because of the reason I wrote in another thread. > > https://postgr.es/m/CAHGQGwHUkEbkVexVfWNLjmq2rzOS_SHYMiECt+KBn-cBPq5Arg@mail.gmail.com > > Yeah, those are interesting points. Unfortunately, I think something > as simple as your proposed... > > psql -c "SELECT pg_start_backup()" > rsync, cp, tar, storage backup, or something > psql -c "SELECT pg_stop_backup()" > > ...doesn't have much chance of being correct. If you aren't checking > whether pg_start_backup() throws an error, for example, you have got a > very failure-prone set-up. That being said, it's possible to fix that > problem using some shell logic, whereas the problem of keeping a > connection open for the duration of the backup from the shell is much > harder. I recently ran across a case where someone attempted it but > did not get the logic entirely right, with rather painful > consequences. > > Now you might say that people should not write their own tools and > just use professionally produced ones. But I don't really agree with > that, and whatever we think people SHOULD do, there are in fact a lot > of people using their own tools. Yes, they are- but I'd argue that's specifically because the old/deprecated/broken API makes it *look* safe, even though it isn't. Getting a solid and resiliant backup to work from a shell script is, imv anyway (though I might have a bit of experience, having tried numerous times myself and then realizing that it just isn't practical...), a downright fool's errand. Perhaps there's some way we can fix that so that people can write a lazy rsync-based backup script and use it, but I don't see anyone seriously working on that and I don't think that's because they just really like to make things difficult for people but rather because it really is *hard* to do, if not downright impossible. What we need is a backup tool included in core that users feel comfortable using instead of trying to write their own. I don't think they write their own because it's somehow better than the existing tools or because they really think it's a good idea- it's just what looks like the "thing to do" based on our documentation and the API we provide today. Sure, pg_basebackup is good for smaller environments and I always encourage people to use it when they want something that's in core that is safe to use and that does things correctly, but it's got some pretty serious limitations due to using the replication protocol. Thanks! Stephen
Attachment
> On Feb 22, 2019, at 15:18, Stephen Frost <sfrost@snowman.net> wrote: > Getting a solid and resiliant backup to work from a shell script is, imv > anyway (though I might have a bit of experience, having tried numerous > times myself and then realizing that it just isn't practical...), a > downright fool's errand. The reality, though, is that there are a lot of organizations who have invested time and effort into getting a backup strategyworking using the existing APIs, and there will be quite a bit of pushback against the version in which the existingexclusive API is removed. Some of those will be able to move to non-exclusive backups easily; others won't. For the ones that can't move easily, thereaction will not be, "PostgreSQL version x has a safer backup API"; it will be "PostgreSQL version x broke our backups,so we're not upgrading to it." Rather than deprecate the existing API, I'd rather see the documentation updated to discuss the danger cases. -- -- Christophe Pettus xof@thebuild.com
Greetings, * Christophe Pettus (xof@thebuild.com) wrote: > > On Feb 22, 2019, at 15:18, Stephen Frost <sfrost@snowman.net> wrote: > > Getting a solid and resiliant backup to work from a shell script is, imv > > anyway (though I might have a bit of experience, having tried numerous > > times myself and then realizing that it just isn't practical...), a > > downright fool's errand. > > The reality, though, is that there are a lot of organizations who have invested time and effort into getting a backup strategyworking using the existing APIs, and there will be quite a bit of pushback against the version in which the existingexclusive API is removed. Do they realize how that existing backup strategy is flawed? I doubt it, and when they discover how it's broken, I don't think they're going to be thanking us for letting them continue to use it. > Some of those will be able to move to non-exclusive backups easily; others won't. For the ones that can't move easily,the reaction will not be, "PostgreSQL version x has a safer backup API"; it will be "PostgreSQL version x broke ourbackups, so we're not upgrading to it." We don't cater to this line of argument when it comes to breaking changes in the backend, or when we break monitoring scripts, and I don't see a reason why we should do so here. They aren't required to upgrade immediately- we provide 5 *years* of support for major versions which we release, and we're going to give advance warning of this change that's even beyond that 5 years. I have essentially zero sympathy for organizations which refuse to allocate even the bit of time necessary to address this change for over 5+ years. > Rather than deprecate the existing API, I'd rather see the documentation updated to discuss the danger cases. Ok, then please do so, and please be prepared to continue to maintain the documentation of both methods moving forward, because others have tried and have (rightfully, in my opinion) decided that it's frankly not worth the effort and ultimately just terribly confusing for users that we have these two different backup methods and even just updating the documentation for one or the other is downright painful (to the point that people litterally give up on it). That really isn't a good place to be in. Thanks! Stephen
Attachment
Hi,, On 2019-02-24 11:52:54 -0800, Christophe Pettus wrote: > > On Feb 22, 2019, at 15:18, Stephen Frost <sfrost@snowman.net> wrote: > > Getting a solid and resiliant backup to work from a shell script is, imv > > anyway (though I might have a bit of experience, having tried numerous > > times myself and then realizing that it just isn't practical...), a > > downright fool's errand. > > The reality, though, is that there are a lot of organizations who have > invested time and effort into getting a backup strategy working using > the existing APIs, and there will be quite a bit of pushback against > the version in which the existing exclusive API is removed. > > Some of those will be able to move to non-exclusive backups easily; > others won't. For the ones that can't move easily, the reaction will > not be, "PostgreSQL version x has a safer backup API"; it will be > "PostgreSQL version x broke our backups, so we're not upgrading to > it." I think it also depends on how we remove exclusive backups. If we do it by adding a 'pg_run_hot_backup --conection-options -- shell_command_to_copy' it ought to be pretty simple to fix scripts. Whereas right now, with the requirement to manually have a postgres connection alive, which is annoying to do from shell, is not the case. I think if we'd introduced a command like that, we'd have seen a higher adoption of non-exclusive backups. > Rather than deprecate the existing API, I'd rather see the documentation updated to discuss the danger cases. You can't work around some of danger cases (crash in the wrong moment, your database doesn't come up properly), so I don't think docs really address the problem. Greetings, Andres Freund
> On Feb 24, 2019, at 12:00, Stephen Frost <sfrost@snowman.net> wrote: > > Do they realize how that existing backup strategy is flawed? Undoubtedly, some do, some don't. However, given that it has been the *only* backup API for a very long time, many organizationshave spent a lot of time closing all of the holes. It's not impossible to do safe backups with the existing API. Unquestionably, there are installations doing backups thatmight end up with a silently badly one, but it's entirely possible to do that unsafely (in many of the same ways) withthe non-exclusively API. The installations that need to fix the scripts are also exactly the ones that can't use pg_basebackup or another pre-packagedsolution, usually because they have a specific way of taking the file system copy (SAN snapshot, etc.) that thosedon't support. > We don't cater to this line of argument when it comes to breaking > changes in the backend, or when we break monitoring scripts, and I don't > see a reason why we should do so here. Those cases are not analogous. 1. Backend APIs are declared non-stable, and do not have a wide audience compared to backing up the database. 2. Monitoring scripts, while important, are not as critical as the backup system. (And, in fact, I didn't agree with breakingthose views either, but that's another discussion.) > Ok, then please do so, and please be prepared to continue to maintain > the documentation of both methods moving forward, because others have > tried and have (rightfully, in my opinion) decided that it's frankly not > worth the effort and ultimately just terribly confusing for users that > we have these two different backup methods and even just updating the > documentation for one or the other is downright painful (to the point > that people litterally give up on it). We're going to have to do that anyway. For as long as we are maintaining the documentation on a version that has both APIs,we're going to have to say, "Don't use this one, and *here's why*." Saying, "Don't use this one because we said so"when it is an API of long standing that works just as it always did isn't going to cut it. -- -- Christophe Pettus xof@thebuild.com
Greetings, * Christophe Pettus (xof@thebuild.com) wrote: > > On Feb 24, 2019, at 12:00, Stephen Frost <sfrost@snowman.net> wrote: > > > > Do they realize how that existing backup strategy is flawed? > > Undoubtedly, some do, some don't. However, given that it has been the *only* backup API for a very long time, many organizationshave spent a lot of time closing all of the holes. No, it *hasn't* been the only backup API for a long time- that was only true up until 9.6 was released, since then we've had both, and it's made everything a downright mess because of exactly these arguments. > It's not impossible to do safe backups with the existing API. Unquestionably, there are installations doing backups thatmight end up with a silently badly one, but it's entirely possible to do that unsafely (in many of the same ways) withthe non-exclusively API. Yes, it *is* impossible to do safe backups with the existing API. There is an unquestionable race condition where a system restart will cause your system to not come back up without you going in and removing the backup_label file- and the only way you make that race window small is to remove the backup_label file right after you run pg_start_backup and copy it, and then PUT IT BACK at the end before you call pg_stop_backup, which is insane, but otherwise the 'race window' is the ENTIRE length of the backup. This is the reason the non-exclusive backup mode *exists*. All of the effort that went into building the non-exclusive backup mode wasn't done just for the fun of it. > The installations that need to fix the scripts are also exactly the ones that can't use pg_basebackup or another pre-packagedsolution, usually because they have a specific way of taking the file system copy (SAN snapshot, etc.) that thosedon't support. I'm not going to bother going into the issues that SAN snapshots have here, because it's largely orthogonal to this discussion. What I'll say is that they'll have over 5 years to adjust those scripts and I don't see an issue with that. > > We don't cater to this line of argument when it comes to breaking > > changes in the backend, or when we break monitoring scripts, and I don't > > see a reason why we should do so here. > > Those cases are not analogous. I disagree *entirely*. > 1. Backend APIs are declared non-stable, and do not have a wide audience compared to backing up the database. *ALL* of our APIs are declared non-stable between major releases. That's why we have *major* and *minor* releases. > 2. Monitoring scripts, while important, are not as critical as the backup system. (And, in fact, I didn't agree with breakingthose views either, but that's another discussion.) Agreed, we should care even more about making sure that the tools we provide to perform a backup are reliable, which is a good reason to remove the broken and deprecated exclusive API. > > Ok, then please do so, and please be prepared to continue to maintain > > the documentation of both methods moving forward, because others have > > tried and have (rightfully, in my opinion) decided that it's frankly not > > worth the effort and ultimately just terribly confusing for users that > > we have these two different backup methods and even just updating the > > documentation for one or the other is downright painful (to the point > > that people litterally give up on it). > > We're going to have to do that anyway. For as long as we are maintaining the documentation on a version that has bothAPIs, we're going to have to say, "Don't use this one, and *here's why*." Saying, "Don't use this one because we saidso" when it is an API of long standing that works just as it always did isn't going to cut it. We have existing documentation for the versions that have both APIs. It sucks, it's horribly complicated, and no one wants to work on it because of that. Sure, it'd be nice if someone would fix it to list out the problems with the exclusive API, but that requires someone wanting to spend time on it. If that's not going to happen, I certainly don't think that means we should *keep* it. THanks! Stephen
Attachment
Andres Freund wrote: > On 2019-02-24 11:52:54 -0800, Christophe Pettus wrote: > > > On Feb 22, 2019, at 15:18, Stephen Frost <sfrost@snowman.net> wrote: > > > Getting a solid and resiliant backup to work from a shell script is, imv > > > anyway (though I might have a bit of experience, having tried numerous > > > times myself and then realizing that it just isn't practical...), a > > > downright fool's errand. Here is my run with the fool's cap on: https://github.com/cybertec-postgresql/safe-backup > > The reality, though, is that there are a lot of organizations who have > > invested time and effort into getting a backup strategy working using > > the existing APIs, and there will be quite a bit of pushback against > > the version in which the existing exclusive API is removed. > > > > Some of those will be able to move to non-exclusive backups easily; > > others won't. For the ones that can't move easily, the reaction will > > not be, "PostgreSQL version x has a safer backup API"; it will be > > "PostgreSQL version x broke our backups, so we're not upgrading to > > it." From the reactions I see in the field, I'd agree. > I think it also depends on how we remove exclusive backups. If we do it > by adding a 'pg_run_hot_backup --conection-options -- shell_command_to_copy' > it ought to be pretty simple to fix scripts. Whereas right now, with the > requirement to manually have a postgres connection alive, which is > annoying to do from shell, is not the case. > > I think if we'd introduced a command like that, we'd have seen a higher > adoption of non-exclusive backups. I think that is a good idea and will do the trick for many people. It will be harder for those whose backup solution is driven by a central backup software that backs up the file system and just offers "pre-backup" and "post-backup" hooks. Yours, Laurenz Albe
Stephen Frost wrote: > Yes, it *is* impossible to do safe backups with the existing API. There > is an unquestionable race condition where a system restart will cause > your system to not come back up without you going in and removing the > backup_label file- and the only way you make that race window small is > to remove the backup_label file right after you run pg_start_backup and > copy it, and then PUT IT BACK at the end before you call pg_stop_backup, > which is insane, but otherwise the 'race window' is the ENTIRE length of > the backup. I just have an idea: What about an option to keep WAL around for the duration of an exclusive backup? That way PostgreSQL can still restart after a crash. It will take longer than expected, but it will work. But then, perhaps the long recovery time is only marginally better than having to manually delete the backup_label file... Yours, Laurenz Albe
> On Feb 24, 2019, at 13:00, Stephen Frost <sfrost@snowman.net> wrote: > > No, it *hasn't* been the only backup API for a long time- that was only > true up until 9.6 was released, since then we've had both, and it's made > everything a downright mess because of exactly these arguments. 9.6 has been out for about 2.5 years. How long was the old version of pg_start_backup() the only interface? Considerablylonger. > Yes, it *is* impossible to do safe backups with the existing API. That's an overstatement. "The existing interface has failure conditions that are hard to work around" is true. Saying "it'simpossible to do safe backups" implies that no pre-9.6 system has ever been backed up. If we take that line, it's justgoing to create a "Oh, come *on*" reaction from users. > What I'll say > is that they'll have over 5 years to adjust those scripts and I don't > see an issue with that. That's not quite the situation. What they will have is a choice between upgrading, and or their existing scripts continuingto work. That choice will start ticking the instant a release without the old interface comes out. The back-pressureon upgrading is already very high; this is going to put a lot of installations into a bind, and a lot will makethe decision not to upgrade because of that. > *ALL* of our APIs are declared non-stable between major releases. > That's why we have *major* and *minor* releases. I note that a word game is being played here. We're calling pg_start_backup() "an API", as if it is the same as, say, theGiST C API, but it's a SQL level construct. SQL level constructs are not promises we need to keep forever, of course,but they're not the same class of promise as a C API. > Sure, it'd be nice if someone would fix it to list out the problems with > the exclusive API, but that requires someone wanting to spend time on > it. I'll take a turn at it. -- -- Christophe Pettus xof@thebuild.com
Greetings, * Laurenz Albe (laurenz.albe@cybertec.at) wrote: > Stephen Frost wrote: > > Yes, it *is* impossible to do safe backups with the existing API. There > > is an unquestionable race condition where a system restart will cause > > your system to not come back up without you going in and removing the > > backup_label file- and the only way you make that race window small is > > to remove the backup_label file right after you run pg_start_backup and > > copy it, and then PUT IT BACK at the end before you call pg_stop_backup, > > which is insane, but otherwise the 'race window' is the ENTIRE length of > > the backup. > > I just have an idea: > > What about an option to keep WAL around for the duration of an exclusive backup? > > That way PostgreSQL can still restart after a crash. It will take longer than > expected, but it will work. But then, perhaps the long recovery time is only > marginally better than having to manually delete the backup_label file... I'm afraid that we'd end up with many, many complaints about people running out of disk space on WAL when they are trying to take a backup.. I do applaud your efforts to think of a better solution but I'm afraid that isn't really workable. While crashing with a backup_label in place definitely sucks and makes recovering from that not fun, it's probably better than having people run out of disk space and having the system PANIC from that during what would otherwise be perfectly normal operation. That would also seem like a bit of an odd difference between the exclusive and non-exclusive backup methods... and another things we'd have to write up documentation for if we kept both methods around to try and explain to users and that is just not a pleasant thought. Thanks! Stephen
Attachment
Greetings, * Christophe Pettus (xof@thebuild.com) wrote: > > On Feb 24, 2019, at 13:00, Stephen Frost <sfrost@snowman.net> wrote: > > > > No, it *hasn't* been the only backup API for a long time- that was only > > true up until 9.6 was released, since then we've had both, and it's made > > everything a downright mess because of exactly these arguments. > > 9.6 has been out for about 2.5 years. How long was the old version of pg_start_backup() the only interface? Considerablylonger. Right, and PG12 will be out for another *5* years beyond that, meaning people will have had 8.5 years to move from the exclusive API to the non-exclusive one. > > Yes, it *is* impossible to do safe backups with the existing API. > > That's an overstatement. "The existing interface has failure conditions that are hard to work around" is true. Saying"it's impossible to do safe backups" implies that no pre-9.6 system has ever been backed up. If we take that line,it's just going to create a "Oh, come *on*" reaction from users. Every one of the exclusive backups that was taken wasn't *safe*, and a crash during any of them (which we've certainly heard plenty about through various support channels over the years...) would have resulted in a failure to properly restart. No, I'm not saying that such backups are corrupted, *that* is an overstatement, but it isn't actually what I'm saying, just a strawman you've come up with to argue against. > > What I'll say > > is that they'll have over 5 years to adjust those scripts and I don't > > see an issue with that. > > That's not quite the situation. What they will have is a choice between upgrading, and or their existing scripts continuingto work. That choice will start ticking the instant a release without the old interface comes out. The back-pressureon upgrading is already very high; this is going to put a lot of installations into a bind, and a lot will makethe decision not to upgrade because of that. That's fine- they can push off upgrading, for as long as 5 years, and we'll still support them. That's the model we operate under for everything, this isn't any different in that regard and I don't have sympathy for people who insist on saying that *this*, just *this* one API of *all* the ones we have simply *can't* be changed *ever*. > > *ALL* of our APIs are declared non-stable between major releases. > > That's why we have *major* and *minor* releases. > > I note that a word game is being played here. We're calling pg_start_backup() "an API", as if it is the same as, say,the GiST C API, but it's a SQL level construct. SQL level constructs are not promises we need to keep forever, of course,but they're not the same class of promise as a C API. I'm not following here, at all... We're actually better, historically, about not changing SQL-level constructs than C-level APIs, but we can and do still change them, regularly, with every major release- including making new reserved keywords and other very clearly potentially breaking changes. > > Sure, it'd be nice if someone would fix it to list out the problems with > > the exclusive API, but that requires someone wanting to spend time on > > it. > > I'll take a turn at it. Just to be clear- I don't agree that simply documenting the issues with it is a sufficient solution to make us keep it. I'll certainly continue to advocate for its removal, but it would be nice for people who are currently using it to be aware of the issues, so I do think it'd be good to have the documentation improved in that regard, just make sure to not invalidate the documentation around non-exclusive backup while you're at it. Thanks! Stephen
Attachment
> On Feb 24, 2019, at 13:44, Stephen Frost <sfrost@snowman.net> wrote: > > Right, and PG12 will be out for another *5* years beyond that, meaning > people will have had 8.5 years to move from the exclusive API to the > non-exclusive one. The thing is that for 90% of installations, the clock will start ticking when the deprecation. Until then, most of themare not going to feel any pressure to make the change. Most installations have a lot of other things on their minds. They have the option of not upgrading, and that will be the option a lot of them take. > Every one of the exclusive backups that was taken wasn't *safe*, and a > crash during any of them (which we've certainly heard plenty about > through various support channels over the years...) would have resulted > in a failure to properly restart. Most installations don't routinely encounter this. I know it happens, and it is definitely a problem, but the field is notstrewn with corpses here. The new interface is unquestionably an improvement, but let's not get our messaging amped upto the point that it hurts our credibility. > No, I'm not saying that such backups are corrupted, *that* is an > overstatement, but it isn't actually what I'm saying, just a strawman > you've come up with to argue against. That may not be what you are saying, but statements like "it *is* impossible to do safe backups with the existing API" willbe heard that way. Let's keep the messaging we give users accurate. > this isn't any different in that regard and I don't have > sympathy for people who insist on saying that *this*, just *this* one > API of *all* the ones we have simply *can't* be changed *ever*. Sure, we can change anything. It's whether this particular deprecation is a good idea. > I'm not following here, at all... We're actually better, historically, > about not changing SQL-level constructs than C-level APIs That's my point. Calling this an "API" makes the change sound more routine than it is. It's an "API" that a *lot* of installationsdepend on. > I don't agree that simply documenting the issues with > it is a sufficient solution to make us keep it. Understood, but I think we need to document it no matter what. -- -- Christophe Pettus xof@thebuild.com
Greetings, * Christophe Pettus (xof@thebuild.com) wrote: > > On Feb 24, 2019, at 13:44, Stephen Frost <sfrost@snowman.net> wrote: > > Right, and PG12 will be out for another *5* years beyond that, meaning > > people will have had 8.5 years to move from the exclusive API to the > > non-exclusive one. > > The thing is that for 90% of installations, the clock will start ticking when the deprecation. Until then, most of themare not going to feel any pressure to make the change. Most installations have a lot of other things on their minds. They have the option of not upgrading, and that will be the option a lot of them take. Just to be clear- it was *already* deprecated, we are just talking here about making that more clear/obvious. And, again, they're welcome to not upgrade for as long as they'd like, frankly, but we'll only provide back-patches for 5 years. > > Every one of the exclusive backups that was taken wasn't *safe*, and a > > crash during any of them (which we've certainly heard plenty about > > through various support channels over the years...) would have resulted > > in a failure to properly restart. > > Most installations don't routinely encounter this. I know it happens, and it is definitely a problem, but the field isnot strewn with corpses here. The new interface is unquestionably an improvement, but let's not get our messaging ampedup to the point that it hurts our credibility. > > > No, I'm not saying that such backups are corrupted, *that* is an > > overstatement, but it isn't actually what I'm saying, just a strawman > > you've come up with to argue against. > > That may not be what you are saying, but statements like "it *is* impossible to do safe backups with the existing API"will be heard that way. Let's keep the messaging we give users accurate. > > > this isn't any different in that regard and I don't have > > sympathy for people who insist on saying that *this*, just *this* one > > API of *all* the ones we have simply *can't* be changed *ever*. > > Sure, we can change anything. It's whether this particular deprecation is a good idea. You say above that the new interface is unquestionably an improvement and here say that we shouldn't deprecate the old one in favor of it (even though we actually already have... but that's beside the point I'm trying to make here), so what you're advocating for is that we keep an old and known broken interface that we know causes real issues even after we've developed a new and unquestionably better one. For my 2c, I think keeping the old interface around at all was a mistake from the start, just the same as things like pg_user have been kept around forever because of fear of breaking things, even as we rip out the recovery.conf file or rename tons of things in the catalog across major versions. I really, honestly, believe what we need to *stop* doing is trying to drag along support for old/deprecated interfaces after we've introduced new ones, thus avoiding these arguments and the time spent on them, and the time spent dealing with implementing and documenting new APIs around old ones. The only thing it seems to unquestionably do is make us argue about when we are going to *finally* rip out the old/deprecated API (or interface, or whatever you want to call it). This is the new "should we call it PG 10.0 or PG 9.7 this time?" argument, all over again... > > I'm not following here, at all... We're actually better, historically, > > about not changing SQL-level constructs than C-level APIs > > That's my point. Calling this an "API" makes the change sound more routine than it is. It's an "API" that a *lot* ofinstallations depend on. A lot of them depended on pg_wal being named pg_xlog too, but we seem to have managed reasonably well through that, not to mention all the catalog changes that caused issues for monitoring, etc. > > I don't agree that simply documenting the issues with > > it is a sufficient solution to make us keep it. > > Understood, but I think we need to document it no matter what. Sure, go for it. Thanks! Stephen
Attachment
Hi, On 2019-02-24 17:19:22 -0500, Stephen Frost wrote: > You say above that the new interface is unquestionably an improvement FWIW, I think we didn't design it even remotely as well as we ought to have. It was both a mistake to not offer a version of non-exclusive backups that works with a client connection (for integration with the TSMs of this world), and it was a mistake not to offer a commandline tool that does the non-exclusive pg/start backup. > I really, honestly, believe what we need to *stop* doing is trying to > drag along support for old/deprecated interfaces after we've introduced > new ones, thus avoiding these arguments and the time spent on them, and > the time spent dealing with implementing and documenting new APIs around > old ones. The only thing it seems to unquestionably do is make us argue > about when we are going to *finally* rip out the old/deprecated API (or > interface, or whatever you want to call it). While I agree that we should remove non-exclusive backups, I VEHEMENTLY oppose the unconditionality of the above statement. Yes, it's annoying to keep interfaces around, but unnecessarily inflicting pain to everyone also is annoying. That's not to say we shouldn't be better at announcing and then following through on the deprecation of old interfaces. Greetings, Andres Freund
> On Feb 24, 2019, at 14:19, Stephen Frost <sfrost@snowman.net> wrote: > You say above that the new interface is unquestionably an improvement > and here say that we shouldn't deprecate the old one in favor of it > (even though we actually already have... but that's beside the point I'm > trying to make here), so what you're advocating for is that we keep an > old and known broken interface that we know causes real issues even > after we've developed a new and unquestionably better one. Yes, I am advocating exactly that. The reason that I think we need to keep the old one (or, at least, not remove it as soonas 12) is that creating an obstacle to upgrades is worse than retaining the old one, and it *will* be an obstacle toupgrades (or to using the community edition at all). I do think we need a simple, pg_basebackup-level-complexity hot backup tool that allows a forked copy operation, per Andres'suggestion. There's no single answer to every possible situation like this that pops up; it depends on how widely used the interfacewas, how reasonable the expectation of permanence is, and the difficulty the users will encounter in changing tothe new one, along with the complexity of maintaining the documentation and code for the old interface. > A lot of them depended on pg_wal being named pg_xlog too, but we seem to > have managed reasonably well through that, not to mention all the > catalog changes that caused issues for monitoring, etc. Some of the incompatible catalog changes (in particular, in pg_stat_activity) I thought were gratuitous, but we did them,and no point in relitigating that now. I'd say that the internal layout of PGDATA is fairly weak promise compared toan SQL-level construct, especially one as widely used as pg_start_backup(). -- -- Christophe Pettus xof@thebuild.com
Hi, On 2019-02-24 14:35:04 -0800, Christophe Pettus wrote: > > On Feb 24, 2019, at 14:19, Stephen Frost <sfrost@snowman.net> wrote: > > You say above that the new interface is unquestionably an improvement > > and here say that we shouldn't deprecate the old one in favor of it > > (even though we actually already have... but that's beside the point I'm > > trying to make here), so what you're advocating for is that we keep an > > old and known broken interface that we know causes real issues even > > after we've developed a new and unquestionably better one. > > Yes, I am advocating exactly that. The reason that I think we need to > keep the old one (or, at least, not remove it as soon as 12) is that > creating an obstacle to upgrades is worse than retaining the old one, > and it *will* be an obstacle to upgrades (or to using the community > edition at all). It sounds to me like you treat this as if having the old method around had no downsides. But I have seen *numereous* downtimes due to it, and also corruption triggered by it (people following the hint to remove backup_label). > > A lot of them depended on pg_wal being named pg_xlog too, but we seem to > > have managed reasonably well through that, not to mention all the > > catalog changes that caused issues for monitoring, etc. > > Some of the incompatible catalog changes (in particular, in > pg_stat_activity) I thought were gratuitous, but we did them, and no > point in relitigating that now. I'd say that the internal layout of > PGDATA is fairly weak promise compared to an SQL-level construct, > especially one as widely used as pg_start_backup(). I don't buy that, because you normally specifically should exclude pg_xlog/pg_wal from basebackups. Greetings, Andres Freund
On 2/24/19 11:36 PM, Stephen Frost wrote: > Greetings, > > * Laurenz Albe (laurenz.albe@cybertec.at) wrote: >> Stephen Frost wrote: >>> Yes, it *is* impossible to do safe backups with the existing API. There >>> is an unquestionable race condition where a system restart will cause >>> your system to not come back up without you going in and removing the >>> backup_label file- and the only way you make that race window small is >>> to remove the backup_label file right after you run pg_start_backup and >>> copy it, and then PUT IT BACK at the end before you call pg_stop_backup, >>> which is insane, but otherwise the 'race window' is the ENTIRE length of >>> the backup. >> >> I just have an idea: >> >> What about an option to keep WAL around for the duration of an exclusive backup? >> >> That way PostgreSQL can still restart after a crash. It will take longer than >> expected, but it will work. But then, perhaps the long recovery time is only >> marginally better than having to manually delete the backup_label file... > > I'm afraid that we'd end up with many, many complaints about people > running out of disk space on WAL when they are trying to take a backup.. This would also require replaying all that WAL during crash recovery which could mean a much longer startup time. -- -David david@pgmasters.net
On 2/25/19 12:35 AM, Christophe Pettus wrote: > > >> On Feb 24, 2019, at 14:19, Stephen Frost <sfrost@snowman.net> wrote: >> You say above that the new interface is unquestionably an improvement >> and here say that we shouldn't deprecate the old one in favor of it >> (even though we actually already have... but that's beside the point I'm >> trying to make here), so what you're advocating for is that we keep an >> old and known broken interface that we know causes real issues even >> after we've developed a new and unquestionably better one. > > Yes, I am advocating exactly that. The reason that I think we need to keep the old one (or, at least, not remove it assoon as 12)... Exclusive backup will not be removed for PG12. There wasn't support for it so I push it out to PG13. There does appear to be support for removing it in PG13, though, and I'd like to see that done sooner than later. > I do think we need a simple, pg_basebackup-level-complexity hot backup tool that allows a forked copy operation, per Andres'suggestion. This is a good idea -- but I'm still a bit mystified why the ability to run a backup in shell script is considered to be a hard requirement. Just about any language you can name performs non-exclusive backups with ease -- except shell. Perhaps the problem is with the language? Regards, -- -David david@pgmasters.net
On 2/22/19 7:10 PM, Robert Haas wrote: > On Fri, Feb 22, 2019 at 12:35 PM Fujii Masao <masao.fujii@gmail.com> wrote: >> -1 for the removal. I think that there are still many users of an exclusive >> backup API, and it's not so easy to migrate their backup scripts so that >> only non-exclusive one is used because of the reason I wrote in another thread. >> https://postgr.es/m/CAHGQGwHUkEbkVexVfWNLjmq2rzOS_SHYMiECt+KBn-cBPq5Arg@mail.gmail.com > > Yeah, those are interesting points. Unfortunately, I think something > as simple as your proposed... > > psql -c "SELECT pg_start_backup()" > rsync, cp, tar, storage backup, or something > psql -c "SELECT pg_stop_backup()" > > ...doesn't have much chance of being correct. If you aren't checking > whether pg_start_backup() throws an error, for example, you have got a > very failure-prone set-up. That being said, it's possible to fix that > problem using some shell logic, whereas the problem of keeping a > connection open for the duration of the backup from the shell is much > harder. I recently ran across a case where someone attempted it but > did not get the logic entirely right, with rather painful > consequences. > > Now you might say that people should not write their own tools and > just use professionally produced ones. But I don't really agree with > that, and whatever we think people SHOULD do, there are in fact a lot > of people using their own tools. > FWIW +1 for not remove exclusive backup. Maintain a connection during the backup is a hard pre-requisite. In my previous jobs I already done custom scripts to perform backup by using pre/post backup hook to control backup software: With vmware to do PITR backup with VM snapshot or with another file backup tool which perform block deduplication. I do not see where is the problem if you check pg_start_backup()/pg_stop_backup() went well? It will be annoying if after this removal, companies must change their backup strategy by using specific postgres tools (pgbackrest, barman). Regards,
David Steele <david@pgmasters.net> writes: > On 2/25/19 12:35 AM, Christophe Pettus wrote: >> >> >>> On Feb 24, 2019, at 14:19, Stephen Frost <sfrost@snowman.net> wrote: >>> You say above that the new interface is unquestionably an improvement >>> and here say that we shouldn't deprecate the old one in favor of it >>> (even though we actually already have... but that's beside the point I'm >>> trying to make here), so what you're advocating for is that we keep an >>> old and known broken interface that we know causes real issues even >>> after we've developed a new and unquestionably better one. >> >> Yes, I am advocating exactly that. The reason that I think we need >> to keep the old one (or, at least, not remove it as soon as 12)... > > Exclusive backup will not be removed for PG12. There wasn't support for > it so I push it out to PG13. How about making the deprecation more obvious in PG12, by making pg_start_backup() in exclusive mode issue a WARNING that it will be removed in PG13? We had a similar problem in Perl, where we got pushback for removing deprecated features because they had been deprecated for so long (in some cases over 20 years) with no sign of actually being removed. We alleviated this by making all deprecation warnings include the version in which the feature would be removed (at least two major versions after the one where the versioned warning was introduced). Having this sort of clarity about what deprecation actually constitutes is also useful when deciding whether to deprecate or merely discourage a feature. My two minor currency units, - ilmari -- "A disappointingly low fraction of the human race is, at any given time, on fire." - Stig Sandbeck Mathisen
Greetings, * Adrien NAYRAT (adrien.nayrat@anayrat.info) wrote: > On 2/22/19 7:10 PM, Robert Haas wrote: > >On Fri, Feb 22, 2019 at 12:35 PM Fujii Masao <masao.fujii@gmail.com> wrote: > >>-1 for the removal. I think that there are still many users of an exclusive > >>backup API, and it's not so easy to migrate their backup scripts so that > >>only non-exclusive one is used because of the reason I wrote in another thread. > >>https://postgr.es/m/CAHGQGwHUkEbkVexVfWNLjmq2rzOS_SHYMiECt+KBn-cBPq5Arg@mail.gmail.com > > > >Yeah, those are interesting points. Unfortunately, I think something > >as simple as your proposed... > > > >psql -c "SELECT pg_start_backup()" > >rsync, cp, tar, storage backup, or something > >psql -c "SELECT pg_stop_backup()" > > > >...doesn't have much chance of being correct. If you aren't checking > >whether pg_start_backup() throws an error, for example, you have got a > >very failure-prone set-up. That being said, it's possible to fix that > >problem using some shell logic, whereas the problem of keeping a > >connection open for the duration of the backup from the shell is much > >harder. I recently ran across a case where someone attempted it but > >did not get the logic entirely right, with rather painful > >consequences. > > > >Now you might say that people should not write their own tools and > >just use professionally produced ones. But I don't really agree with > >that, and whatever we think people SHOULD do, there are in fact a lot > >of people using their own tools. > > FWIW +1 for not remove exclusive backup. > > Maintain a connection during the backup is a hard pre-requisite. In my > previous jobs I already done custom scripts to perform backup by using > pre/post backup hook to control backup software: With vmware to do PITR > backup with VM snapshot or with another file backup tool which perform block > deduplication. I do not see where is the problem if you check > pg_start_backup()/pg_stop_backup() went well? This, really, is a large part of the problem- people don't realize and just simply don't want to believe, it seems at least, that there's any problem with what they're doing because it works some, or even most, of the time. Making a running PG cluster *look* like it was restored from a backup is really bad and leads to a lot of confusion and cases where PG won't start up, but that's what the exclusive pg_start_backup/pg_stop_backup method does and is more-or-less *required* of any solution that wants to just be able to tar up the data directory as the backup method, which is why it ends up just being a bad interface overall. And then there's other issues- things like making sure that you actually got all of the WAL, without which the backup is inconsistent being the leading one but there's others. > It will be annoying if after this removal, companies must change their > backup strategy by using specific postgres tools (pgbackrest, barman). You don't write your own database system using CSV files and shell magic, do you? I have to say that it continues to boggle my mind that people insist that *this* part of the system has to be able to be implementable using shell scripts. Folks, these are your backups we're talking about, your last resort if everything else goes up in flames, why do you want to risk that by implementing your own one-off solution, particularly when there's known serious issues using that interface, and you want to just use shell scripts to do it...? Thanks! Stephen
Attachment
Greetings, * Dagfinn Ilmari Mannsåker (ilmari@ilmari.org) wrote: > David Steele <david@pgmasters.net> writes: > > On 2/25/19 12:35 AM, Christophe Pettus wrote: > >>> On Feb 24, 2019, at 14:19, Stephen Frost <sfrost@snowman.net> wrote: > >>> You say above that the new interface is unquestionably an improvement > >>> and here say that we shouldn't deprecate the old one in favor of it > >>> (even though we actually already have... but that's beside the point I'm > >>> trying to make here), so what you're advocating for is that we keep an > >>> old and known broken interface that we know causes real issues even > >>> after we've developed a new and unquestionably better one. > >> > >> Yes, I am advocating exactly that. The reason that I think we need > >> to keep the old one (or, at least, not remove it as soon as 12)... > > > > Exclusive backup will not be removed for PG12. There wasn't support for > > it so I push it out to PG13. > > How about making the deprecation more obvious in PG12, by making > pg_start_backup() in exclusive mode issue a WARNING that it will be > removed in PG13? So.. We've more-or-less come full circle back to where the thread had left off last time- the plan is to update the documentation to make it clearer that the exclusive mode is deprecated and that it's going to be removed, and then we'll actually remove it in PG13. We could also add such a warning but I'm not entirely convinced it's a good idea. There was some earlier discussion about it that I'd probably want to go back and reread first. Thanks! Stephen
Attachment
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2019-02-24 17:19:22 -0500, Stephen Frost wrote: > > You say above that the new interface is unquestionably an improvement > > FWIW, I think we didn't design it even remotely as well as we ought to > have. It was both a mistake to not offer a version of non-exclusive > backups that works with a client connection (for integration with the > TSMs of this world), and it was a mistake not to offer a commandline > tool that does the non-exclusive pg/start backup. I'm not really following- did you mean to say "without a client connection" above? We could still add some kind of commandline tool to help with the non-exclusive pg start/stop backup but I'm not really sure exactly what that would look like and I honestly don't have terribly much interest in spending resources on implementing it since it would lack a great many features that we'd then end up wanting to add on... and then we end up backing into building yet another backup tool. > > I really, honestly, believe what we need to *stop* doing is trying to > > drag along support for old/deprecated interfaces after we've introduced > > new ones, thus avoiding these arguments and the time spent on them, and > > the time spent dealing with implementing and documenting new APIs around > > old ones. The only thing it seems to unquestionably do is make us argue > > about when we are going to *finally* rip out the old/deprecated API (or > > interface, or whatever you want to call it). > > While I agree that we should remove non-exclusive backups, I VEHEMENTLY > oppose the unconditionality of the above statement. Yes, it's annoying > to keep interfaces around, but unnecessarily inflicting pain to everyone > also is annoying. Alright, then how about we provide a bit of help for everyone who's got a system built around recovery.conf today, instead of just completely ripping that out? If we're happy to do that then I really can't agree with these arguments that there's things we should try to maintain when it comes to interfaces, and that's far from the only example of a pretty massive change that just went into version X with zero notice to users about what they're using today being deprecated. > That's not to say we shouldn't be better at announcing and then > following through on the deprecation of old interfaces. We announce things have changed every year, and people have five years from that point, during which we'll continue to support them and fix bugs and deal with security issues, to make whatever changes they need to for the new interface. The argument seems to be that people want new features but they don't want to have to do any work to get those new features. Instead, they expect the community to take on the burden of maintaining those old interfaces for them, so that they can get those new features. That seems like a pretty raw deal for the community and not one that I really think we should be supporting, and it isn't like we're actually consistent with it at all- except that we don't break the back-branches and we do make breaking changes in major versions. When is something too big of a change to make in a given major version and we have to, instead, provide backwards compatibility for it? What is that policy? How many releases do we have to support it for? How do we communicate that to our users, so that they know what they can depend on being there across major releases and what they can't? Thanks! Stephen
Attachment
Stephen Frost wrote: > > It will be annoying if after this removal, companies must change their > > backup strategy by using specific postgres tools (pgbackrest, barman). > > You don't write your own database system using CSV files and shell > magic, do you? I have to say that it continues to boggle my mind that > people insist that *this* part of the system has to be able to be > implementable using shell scripts. > > Folks, these are your backups we're talking about, your last resort if > everything else goes up in flames, why do you want to risk that by > implementing your own one-off solution, particularly when there's known > serious issues using that interface, and you want to just use shell > scripts to do it...? If we didn't think that simplicity in backup has some value, why does PostgreSQL provide pg_basebackup? Not everybody wants to use tools like pgbackrest and barman because it takes some effort to set them up properly. It seems that you think that people who want something simpler are irresponsible. Incidentally, both barman and pgbackrest stream the backup to a backup server. I think the most important use case for the low level backup API is when your database is so large that you want to leverage storage techniques like snapshots for backup, because you can't or don't want to stream all those terabytes across the network. I'm not playing devil's advocate here to annoy you. I see the problems with the exclusive backup, and I see how it can hurt people. I just think that removing exclusive backup without some kind of help like Andres sketched above will make people unhappy. Yours, Laurenz Albe
Greetings, * Laurenz Albe (laurenz.albe@cybertec.at) wrote: > Stephen Frost wrote: > > > It will be annoying if after this removal, companies must change their > > > backup strategy by using specific postgres tools (pgbackrest, barman). > > > > You don't write your own database system using CSV files and shell > > magic, do you? I have to say that it continues to boggle my mind that > > people insist that *this* part of the system has to be able to be > > implementable using shell scripts. > > > > Folks, these are your backups we're talking about, your last resort if > > everything else goes up in flames, why do you want to risk that by > > implementing your own one-off solution, particularly when there's known > > serious issues using that interface, and you want to just use shell > > scripts to do it...? > > If we didn't think that simplicity in backup has some value, why does > PostgreSQL provide pg_basebackup? I'm all for people using pg_basebackup and encourage them to do so, but, really, that tool *is* too simple by itself: you need to combine it with something else that handles data retention, backup validation, etc. This could be some independent backup solution though, since pg_basebackup gives you a simple set of files for the backup that can then be backed up using those other backup tools. (The same is true of pgbackrest, btw, and people certainly do use pgbackrest to back up to a repository and then back that repo up using other tools, but in this case pgbackrest does have some of those data retention and backup validation capabilities itself, so it isn't necessary to do that.) > Not everybody wants to use tools like pgbackrest and barman because it > takes some effort to set them up properly. It seems that you think that > people who want something simpler are irresponsible. I don't mind simpler (we actually try pretty hard to make pgbackrest simple to use, in fact... the simplest config only requires a 4-line config file), provided that it actually does everything that you need a backup tool to do. I do think it's an unfortunate reality that not enough people take the time to think about what they really need from a backup solution and spend the time to either find or build a proper solution. > Incidentally, both barman and pgbackrest stream the backup to a backup server. I'm pretty sure both can backup to a local repository too (I know pgbackrest can), and not just stream to a backup server, so I'm not really sure what you're getting at here. > I think the most important use case for the low level backup API is when > your database is so large that you want to leverage storage techniques > like snapshots for backup, because you can't or don't want to stream all > those terabytes across the network. Snapshots place an awful lot of faith in your storage layer- particularly when you aren't doing any kind of validation that the snapshot is in good shape and that there hasn't been any in-place latent corruption, ever. If you think about it, a snapshot is actually very similar to just always doing incremental backups with pgbackrest and never doing a new full backup or even a check of those old files that haven't changed to see if they're still valid. It's not an approach I'd recommend depending on exclusively. There's ways to address that, of course, such as building a manifest of checksums of the files in the data directory, such that you can then validate that the snapshot is correct, and detect if any corruption ends up happening (and checksum the manifest as well), though you need to also track the LSN of the pg_start_backup, so you can know that changes which happened after that LSN are in the WAL that's captured during the checksum, and you need a place to put all of this for when you want to go back and re-validate an old snapshot.. and likely other things that I'm forgetting at the moment, but it's definitely something we've thought about and have considered how we might add support to pgbackrest for working with snapshots (and it wouldn't surprise me if that got implemented before v13 even..). If you are happy to place complete faith in snapshots and trust that they're entirely atomic, you can always forgo dealing with any of this and just snapshot PGDATA and let PG go through crash recovery when you restore. > I'm not playing devil's advocate here to annoy you. I see the problems > with the exclusive backup, and I see how it can hurt people. > I just think that removing exclusive backup without some kind of help > like Andres sketched above will make people unhappy. Keeping it will definitely make me unhappy. :) I'm not against something simple being added to help with the new API, or with keeping the old API if someone can figure out how to solve for the known issues with it, just to be clear. I do worry that the 'simple' thing we add either won't help in a lot of cases or will end up being much more complex and difficult to use than we're thinking, but we can discuss that if someone actually implements it, I suppose. Thanks! Stephen
Attachment
> On Feb 25, 2019, at 05:14, Stephen Frost <sfrost@snowman.net> wrote: > > You don't write your own database system using CSV files and shell > magic, do you? I have to say that it continues to boggle my mind that > people insist that *this* part of the system has to be able to be > implementable using shell scripts. It's not complicated why: Backup has to interact with a lot more components of the overall environment than a CSV export,and those components vary *wildly* from one installation to another, and are often over-contrained by local policy. -- -- Christophe Pettus xof@thebuild.com
> On Feb 25, 2019, at 05:18, Stephen Frost <sfrost@snowman.net> wrote: > So.. We've more-or-less come full circle back to where the thread had > left off last time- the plan is to update the documentation to make it > clearer that the exclusive mode is deprecated and that it's going to be > removed, and then we'll actually remove it in PG13. That's not my position, certainly; I still object to its removal. -- -- Christophe Pettus xof@thebuild.com
> On Feb 24, 2019, at 22:49, David Steele <david@pgmasters.net> wrote: > This is a good idea -- but I'm still a bit mystified why the ability to run a backup in shell script is considered to bea hard requirement. Just about any language you can name performs non-exclusive backups with ease -- except shell. Perhapsthe problem is with the language? Many installations have a high degree of discomfort implementing system management functionality in something other thanbash. It can be for any number of reasons: previous bad experiences (*cough* RHEL/Python *cough*), the feeling thattoo many dependencies are dragged in, or just lack of familiarity with them. -- -- Christophe Pettus xof@thebuild.com
Greetings, * Christophe Pettus (xof@thebuild.com) wrote: > > On Feb 25, 2019, at 05:18, Stephen Frost <sfrost@snowman.net> wrote: > > So.. We've more-or-less come full circle back to where the thread had > > left off last time- the plan is to update the documentation to make it > > clearer that the exclusive mode is deprecated and that it's going to be > > removed, and then we'll actually remove it in PG13. > > That's not my position, certainly; I still object to its removal. It wasn't my intent to outline that as your position, my apologies if it came across that way. I do believe that's still the plan though, based on both this and the prior discussion, unless something else happens to address the known issues. Thanks! Stephen
Attachment
On Sun, Feb 24, 2019 at 3:00 PM Stephen Frost <sfrost@snowman.net> wrote: > Ok, then please do so, and please be prepared to continue to maintain > the documentation of both methods moving forward, because others have > tried and have (rightfully, in my opinion) decided that it's frankly not > worth the effort and ultimately just terribly confusing for users that > we have these two different backup methods and even just updating the > documentation for one or the other is downright painful (to the point > that people litterally give up on it). That really isn't a good place > to be in. This, to me, is downright rude. You don't get to tell other people what to do. Nor do you get to tell other people "hey, I'm going to make this change that you don't like unless you agree to do the work which I specify." If you want to make a change, you have to build consensus for that change. If you can't get people to agree with your proposed change, what happens is that you don't get to make that change. Whether other people choose to do any work that you and they might happen to agree is valuable is up to them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Feb 25, 2019 at 1:49 AM David Steele <david@pgmasters.net> wrote: > Exclusive backup will not be removed for PG12. There wasn't support for > it so I push it out to PG13. > > There does appear to be support for removing it in PG13, though, and I'd > like to see that done sooner than later. Given the additional votes that have appeared since the previous discussion, I am no longer convinced that there is a consensus for ripping it out in PG13. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Sun, Feb 24, 2019 at 3:00 PM Stephen Frost <sfrost@snowman.net> wrote: > > Ok, then please do so, and please be prepared to continue to maintain > > the documentation of both methods moving forward, because others have > > tried and have (rightfully, in my opinion) decided that it's frankly not > > worth the effort and ultimately just terribly confusing for users that > > we have these two different backup methods and even just updating the > > documentation for one or the other is downright painful (to the point > > that people litterally give up on it). That really isn't a good place > > to be in. > > This, to me, is downright rude. You don't get to tell other people > what to do. Nor do you get to tell other people "hey, I'm going to > make this change that you don't like unless you agree to do the work > which I specify." If you want to make a change, you have to build > consensus for that change. If you can't get people to agree with your > proposed change, what happens is that you don't get to make that > change. Whether other people choose to do any work that you and they > might happen to agree is valuable is up to them. It wasn't my intent to be rude, and my apologies to you and Christophe for it coming across that way. I do think we can ask that people who wish to have a capability included in PG (or continue to be included when there are serious known issues with it...) be prepared to either build and maintain it themselves or to convince someone else to do so (or both, and have a committer agree to it). That's how we've long operated and it wasn't my intent to imply otherwise, but I agree that I could have said it in a nicer way to avoid it coming across as telling Christophe what to do. I'm also on board with building a consensus for making a change, but a consensus does not mean that everyone must agree or be supportive of the change. There's also a practical side to things which is that if the consensus seems to be "we are going to get rid of X" and someone wants to work on X, we should probably make it pretty clear to them that there's a consensus to remove it to allow them the option to decide if they wish to still work on X, or not. Similairly, I think it's good to let people know who want to work on Y when Y is really hard, so that they have some idea what they're getting into. Thanks! Stephen
Attachment
On Mon, Feb 25, 2019 at 8:42 AM Stephen Frost <sfrost@snowman.net> wrote: > Alright, then how about we provide a bit of help for everyone who's got > a system built around recovery.conf today, instead of just completely > ripping that out? > > If we're happy to do that then I really can't agree with these arguments > that there's things we should try to maintain when it comes to > interfaces, and that's far from the only example of a pretty massive > change that just went into version X with zero notice to users about > what they're using today being deprecated. It's not the same thing. First of all, the recovery.conf changes have been under discussion for quite a few years now, whereas this change has been under discussion for only a few months. It is also worth noting that those recovery.conf changes would've been made years ago if people hadn't objected to breaking backward compatibility; people have just as much of a right to object to these changes on those grounds as they did to those changes. Second, the recovery.conf changes are intended to provide a tangible benefit - namely, getting the recovery.conf settings into the GUC system, so that they can be queried and changed using methods available for other GUCs. But the change proposed here breaks things for people who are using the mechanism in question and it doesn't give them anything else instead. Indeed, it seems quite obvious that you don't have the LEAST interest in putting any effort into actually making things in this area better for users; it appears that you just want to break what they're doing now and shove the work of coping with those changes onto them. Christophe is quite right to question whether that will cause users not to upgrade. I think it's overtly user-hostile. You have referred to the effort of maintaining this code, but you haven't pointed to any tangible improvement that you'd like to make in this area that is actually being blocked by the existence of exclusive backup mode. It's all just griping about how terrible the whole thing is, and it seems that there are actually quite a few people here who find it less terrible than you think it is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Feb 22, 2019 at 6:18 PM Stephen Frost <sfrost@snowman.net> wrote: > What we need is a backup tool included in core that users feel > comfortable using instead of trying to write their own. I agree. That's a great idea. Let's talk about how to make that happen. Providing a tool that gives people MORE AND BETTER options than what they have today is likely to make users more happy. Removing an option that people are currently using, and which they find better than other available options for reasons with which I understand that you disagree, will make users more sad. Happy is better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, Feb 25, 2019 at 8:42 AM Stephen Frost <sfrost@snowman.net> wrote: > > Alright, then how about we provide a bit of help for everyone who's got > > a system built around recovery.conf today, instead of just completely > > ripping that out? > > > > If we're happy to do that then I really can't agree with these arguments > > that there's things we should try to maintain when it comes to > > interfaces, and that's far from the only example of a pretty massive > > change that just went into version X with zero notice to users about > > what they're using today being deprecated. > > It's not the same thing. First of all, the recovery.conf changes have > been under discussion for quite a few years now, whereas this change > has been under discussion for only a few months. This argument doesn't work for me because it doesn't make *any* difference to our users- the vast majority of whom don't follow -hackers. Our users will simply see that in v12 the system they've built around recovery.conf won't work and they'll have to learn the new way and adjust everything to deal with it. If that argument did matter, we could go back and find the prior discussions about the issues around the exclusive backup mode and about removing it, or next year we could point to this thread about it, or the year after, and say "well, we talked about it a lot, so now let's just do it", but it all ends up looking the same to our users unless we actually write into some kind of user-facing documentation or WARNINGs or similar that something is going away. > It is also worth > noting that those recovery.conf changes would've been made years ago > if people hadn't objected to breaking backward compatibility; people > have just as much of a right to object to these changes on those > grounds as they did to those changes. Second, the recovery.conf > changes are intended to provide a tangible benefit - namely, getting > the recovery.conf settings into the GUC system, so that they can be > queried and changed using methods available for other GUCs. But the > change proposed here breaks things for people who are using the > mechanism in question and it doesn't give them anything else instead. This isn't a fair argument either because we're having this *after* the new API was implemented for backup- namely non-exclusive mode, which *did* give people something better to use instead. If we went back and added back into the v12 branch support for a recovery.conf file, then we'd be having exactly this same argument next year about if we should remove the recovery.conf file support or leave it in- and you could make the same argument then that removing the recovery.conf support doesn't give the user anything tangible to replace it with- because that would have already been done. > Indeed, it seems quite obvious that you don't have the LEAST interest > in putting any effort into actually making things in this area better > for users; it appears that you just want to break what they're doing > now and shove the work of coping with those changes onto them. What will make things better for users is actually moving to the new API that's been available for years now, so that they can avoid the pitfalls with the old API. > Christophe is quite right to question whether that will cause users > not to upgrade. I think it's overtly user-hostile. You have referred > to the effort of maintaining this code, but you haven't pointed to any > tangible improvement that you'd like to make in this area that is > actually being blocked by the existence of exclusive backup mode. I disagree quite a bit with this statement- the existing documentation is absolutely horrid and needs to be completely ripped out and rewritten and maintaining the exclusive backup mode in such a rewrite would absolutely be a *lot* of additional work. We actually took a shot at trying to improve the documentation while continuing to cover both the exclusive and the non-exclusive mode and it's hugely painful. > It's all just griping about how terrible the whole thing is, and it > seems that there are actually quite a few people here who find it less > terrible than you think it is. I used to be one of those people. I know that it looks fine and it certainly seems appealing but having gone through bad experiences with it, and seen others stumble through those same experiences time and time again, I've learned that it really is an issue, and one that I would very much like to avoid causing future users to stumble over. Thanks! Stephen
Attachment
On Mon, Feb 25, 2019 at 11:09 AM Stephen Frost <sfrost@snowman.net> wrote: > I do think we can ask that people who wish to have a capability included > in PG (or continue to be included when there are serious known issues > with it...) be prepared to either build and maintain it themselves or to > convince someone else to do so (or both, and have a committer agree to > it). That's how we've long operated and it wasn't my intent to imply > otherwise, but I agree that I could have said it in a nicer way to avoid > it coming across as telling Christophe what to do. I think it is certainly true that if you want a new capability added, you have to either write it yourself or get someone to do it for you. There is SOME precedent for the proposition that if you want an obsolete capability not to be removed, you should be prepared to help fix it. But frankly I think the latter proposition is one we've taken only occasionally, and only for things that were a whole lot cruftier and more problematic than this is. For example, if a library is no longer available and we have a contrib module that depends on that library, it's reasonable to say that we can't keep the contrib module unless somebody rewrites it not to depend on that library any more. But the current situation is completely different. The code maintenance burden resulting from keeping exclusive-mode backups is mostly hypothetical; the code works as well as it ever did. We rarely take the position that we're going to rip out older code that doesn't work as nicely as newer code does just because nobody's prepared to e.g. better-document the old code. For example, we still have FEBE protocol 2 support floating around. If you're going to talk about badly-designed footguns that ought to be excised with extreme prejudice, that would IMHO be a far better place to start than this. Unlike this code, which it's now obvious is used by quite a number of people even just among those who read this list regularly, that code is probably nearly unused and untested and, if we broke it, we probably wouldn't know. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Fri, Feb 22, 2019 at 6:18 PM Stephen Frost <sfrost@snowman.net> wrote: > > What we need is a backup tool included in core that users feel > > comfortable using instead of trying to write their own. > > I agree. That's a great idea. Let's talk about how to make that > happen. Providing a tool that gives people MORE AND BETTER options > than what they have today is likely to make users more happy. I've been working for years on making this happen, and am currently putting quite a large amount of resources into what I hope will be the final step in the process to get a solution that will have a great many more and better options than what's available today. Maybe it'll end up happening, which I think would be great, maybe it won't, which would be sad, but at least we'll have tried. > Removing an option that people are currently using, and which they > find better than other available options for reasons with which I > understand that you disagree, will make users more sad. Happy is > better. I don't want to see more users stumbling over the issues with the exclusive backup interface. A better interface exists, and has existed since 9.6. The exclusive backup method was deprecated in 2016. One of the really bad things about the exclusive backup method is that it *looks* like it works well and that there aren't any issues with it, but when users hit the issues, they get very sad. We might have 1000s of happy users who never run into those issues and therefore don't want to see us change anything, but what about the 10s or 100s of users who do hit those issues, what do we tell them? Seems like we're saying "well, sorry, we knew those issues existed and it's unfortunate you hit them, but there's this better thing that you *should* have been using and then you wouldn't have hit those issues." That doesn't seem likely to make people happy with us. Thanks! Stephen
Attachment
On Mon, Feb 25, 2019 at 11:23 AM Stephen Frost <sfrost@snowman.net> wrote: > If that argument did matter, we could go back and find the prior > discussions about the issues around the exclusive backup mode and about > removing it, or next year we could point to this thread about it, or the > year after, and say "well, we talked about it a lot, so now let's just > do it", but it all ends up looking the same to our users unless we > actually write into some kind of user-facing documentation or WARNINGs > or similar that something is going away. That's true. But nobody's objecting to strengthening the warnings in the documentation. People are objecting to removing the thing itself. > This isn't a fair argument either because we're having this *after* the > new API was implemented for backup- namely non-exclusive mode, which > *did* give people something better to use instead. There are several people who are saying that doesn't meet all their needs, giving reasons why it's problematic, and suggesting things that could be done to make it less problematic. It's not OK to say "we can remove exclusive backup mode because now we have non-exclusive backup mode" unless other people actually agree that all the use cases are covered, and it appears that they don't. > I disagree quite a bit with this statement- the existing documentation > is absolutely horrid and needs to be completely ripped out and rewritten > and maintaining the exclusive backup mode in such a rewrite would > absolutely be a *lot* of additional work. > > We actually took a shot at trying to improve the documentation while > continuing to cover both the exclusive and the non-exclusive mode and > it's hugely painful. Well, perhaps that's pain you need to incur. The alternative seems to be to rip out something that people don't want ripped out. > I used to be one of those people. I know that it looks fine and it > certainly seems appealing but having gone through bad experiences with > it, and seen others stumble through those same experiences time and time > again, I've learned that it really is an issue, and one that I would > very much like to avoid causing future users to stumble over. Sure, that sounds great. But the way to do that is to continue improving the system until exclusive-mode backups really are not a useful thing any more, not to remove it while there are still a lot of people relying on it who can offer tangible explanations for their choice to do so. It feels to me like you are portraying the increasing number of people objecting to this change as naive or foolish or at least not as enlightened as you are, and I really object to that. I happen to think that people like Christophe Pettus and Fujii Masao and Laurenz Albe are smart people whose opinions ought to be taken as just as valid as your own. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Feb 25, 2019 at 11:33 AM Stephen Frost <sfrost@snowman.net> wrote: > > Removing an option that people are currently using, and which they > > find better than other available options for reasons with which I > > understand that you disagree, will make users more sad. Happy is > > better. > > I don't want to see more users stumbling over the issues with the > exclusive backup interface. A better interface exists, and has existed > since 9.6. The exclusive backup method was deprecated in 2016. One of > the really bad things about the exclusive backup method is that it > *looks* like it works well and that there aren't any issues with it, but > when users hit the issues, they get very sad. We might have 1000s of > happy users who never run into those issues and therefore don't want to > see us change anything, but what about the 10s or 100s of users who do > hit those issues, what do we tell them? Seems like we're saying "well, > sorry, we knew those issues existed and it's unfortunate you hit them, > but there's this better thing that you *should* have been using and then > you wouldn't have hit those issues." That doesn't seem likely to make > people happy with us. Sure, nobody wants users to keep hitting the same problems over and over again, but the above is still reductionist. If we take the view that anything that can cause data corruption in any scenario, no matter how remote, is more than everything else, then we should just stop shipping minor releases and halt all other development work until we deal with https://commitfest.postgresql.org/22/528/ that has been open for 14 CommitFests and until we've fully dealt with every aspect of fsync-gate. I don't see you treating either of those issues with the same sense of urgency with which you're treating this one, so evidently you feel entitled to decide which potentially-data-corrupting issues you want to attack first. And I agree that you should have exactly that right. Along similar lines, I'd like our users to have the right to decide when they want to upgrade from using exclusive backup mode to non-exclusive backup mode, instead of being forced into making a change at a time decided by our fiat. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, Feb 25, 2019 at 11:23 AM Stephen Frost <sfrost@snowman.net> wrote: > > If that argument did matter, we could go back and find the prior > > discussions about the issues around the exclusive backup mode and about > > removing it, or next year we could point to this thread about it, or the > > year after, and say "well, we talked about it a lot, so now let's just > > do it", but it all ends up looking the same to our users unless we > > actually write into some kind of user-facing documentation or WARNINGs > > or similar that something is going away. > > That's true. But nobody's objecting to strengthening the warnings in > the documentation. People are objecting to removing the thing itself. Then how long will we carry it forward? How much warning do we need to provide? > > This isn't a fair argument either because we're having this *after* the > > new API was implemented for backup- namely non-exclusive mode, which > > *did* give people something better to use instead. > > There are several people who are saying that doesn't meet all their > needs, giving reasons why it's problematic, and suggesting things that > could be done to make it less problematic. It's not OK to say "we can > remove exclusive backup mode because now we have non-exclusive backup > mode" unless other people actually agree that all the use cases are > covered, and it appears that they don't. I haven't heard anyone say it doesn't meet their needs, just that it's not as easy to use, which is a fair critcism, but not the same thing. I'm all for making the non-exclusive mode less problematic if we're able to. If that's something that would help us move forward with getting rid of the exclusive backup mode than that's an area that I'd be willing to put resources. > > I disagree quite a bit with this statement- the existing documentation > > is absolutely horrid and needs to be completely ripped out and rewritten > > and maintaining the exclusive backup mode in such a rewrite would > > absolutely be a *lot* of additional work. > > > > We actually took a shot at trying to improve the documentation while > > continuing to cover both the exclusive and the non-exclusive mode and > > it's hugely painful. > > Well, perhaps that's pain you need to incur. The alternative seems to > be to rip out something that people don't want ripped out. ... now I feel like I'm being told what to do. The point is that keeping it actually *is* work, it isn't free, not even in our tree. Making our users stumble over the issues with it also isn't free, that's another cost of keeping it. > > I used to be one of those people. I know that it looks fine and it > > certainly seems appealing but having gone through bad experiences with > > it, and seen others stumble through those same experiences time and time > > again, I've learned that it really is an issue, and one that I would > > very much like to avoid causing future users to stumble over. > > Sure, that sounds great. But the way to do that is to continue > improving the system until exclusive-mode backups really are not a > useful thing any more, not to remove it while there are still a lot of > people relying on it who can offer tangible explanations for their > choice to do so. > > It feels to me like you are portraying the increasing number of people > objecting to this change as naive or foolish or at least not as > enlightened as you are, and I really object to that. I happen to > think that people like Christophe Pettus and Fujii Masao and Laurenz > Albe are smart people whose opinions ought to be taken as just as > valid as your own. I honestly do doubt that they have had the same experiences that I have had, but that doesn't mean that they're not smart or that I don't value their opinion- I agree that they're all smart people and I certainly do value their opinions. That doesn't mean that I can't disagree with them or can't explain why I disagree. Thanks! Stephen
Attachment
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, Feb 25, 2019 at 11:33 AM Stephen Frost <sfrost@snowman.net> wrote: > > > Removing an option that people are currently using, and which they > > > find better than other available options for reasons with which I > > > understand that you disagree, will make users more sad. Happy is > > > better. > > > > I don't want to see more users stumbling over the issues with the > > exclusive backup interface. A better interface exists, and has existed > > since 9.6. The exclusive backup method was deprecated in 2016. One of > > the really bad things about the exclusive backup method is that it > > *looks* like it works well and that there aren't any issues with it, but > > when users hit the issues, they get very sad. We might have 1000s of > > happy users who never run into those issues and therefore don't want to > > see us change anything, but what about the 10s or 100s of users who do > > hit those issues, what do we tell them? Seems like we're saying "well, > > sorry, we knew those issues existed and it's unfortunate you hit them, > > but there's this better thing that you *should* have been using and then > > you wouldn't have hit those issues." That doesn't seem likely to make > > people happy with us. > > Sure, nobody wants users to keep hitting the same problems over and > over again, but the above is still reductionist. If we take the view > that anything that can cause data corruption in any scenario, no > matter how remote, is more than everything else, then we should just > stop shipping minor releases and halt all other development work until > we deal with https://commitfest.postgresql.org/22/528/ that has been > open for 14 CommitFests and until we've fully dealt with every aspect > of fsync-gate. I don't see you treating either of those issues with > the same sense of urgency with which you're treating this one, so > evidently you feel entitled to decide which > potentially-data-corrupting issues you want to attack first. I'm not advocating for removing it today or tomorrow, or in some point release. We're already to the stage where we're talking about something that wouldn't hit until late 2020. Also, frankly, while there might be something I could do to help, the whole madness with fsync gate seems to be getting dealt with by some very smart people who know a whole lot more about the internals of Linux and FreeBSD and I'd rather not get in their way. On the other hand, dealing with backups and issues around backups has been something that I at least hope I've gotten decent at. > And I agree that you should have exactly that right. Along similar > lines, I'd like our users to have the right to decide when they want > to upgrade from using exclusive backup mode to non-exclusive backup > mode, instead of being forced into making a change at a time decided > by our fiat. Our users *have* that option, and have for over two years now, and it'll have been 4 years by the time v13 comes out. I agree that the non-exclusive backup mode isn't as simple as the exclusive backup mode- to use your example, what we *used* to be doing with fsync() was sure a lot simpler than what we're going to be doing in the future, but sometimes people have to make changes, even to more complicated APIs, to make sure that they're doing things the right way to ensure that their data doesn't get eaten. Thanks! Stephen
Attachment
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, Feb 25, 2019 at 11:09 AM Stephen Frost <sfrost@snowman.net> wrote: > > I do think we can ask that people who wish to have a capability included > > in PG (or continue to be included when there are serious known issues > > with it...) be prepared to either build and maintain it themselves or to > > convince someone else to do so (or both, and have a committer agree to > > it). That's how we've long operated and it wasn't my intent to imply > > otherwise, but I agree that I could have said it in a nicer way to avoid > > it coming across as telling Christophe what to do. > > I think it is certainly true that if you want a new capability added, > you have to either write it yourself or get someone to do it for you. > There is SOME precedent for the proposition that if you want an > obsolete capability not to be removed, you should be prepared to help > fix it. But frankly I think the latter proposition is one we've taken > only occasionally, and only for things that were a whole lot cruftier > and more problematic than this is. For example, if a library is no > longer available and we have a contrib module that depends on that > library, it's reasonable to say that we can't keep the contrib module > unless somebody rewrites it not to depend on that library any more. > But the current situation is completely different. The code > maintenance burden resulting from keeping exclusive-mode backups is > mostly hypothetical; the code works as well as it ever did. We rarely > take the position that we're going to rip out older code that doesn't > work as nicely as newer code does just because nobody's prepared to > e.g. better-document the old code. I'd argue that if something which has been deprecated for years is seriously getting in the way of making progress that we should be willing to accept a proposal to rip it out after giving people sufficient time to adjust, and that if someone really wants to keep that deprecated API then they need to be willing to spend the time to address the reasons why it was deprecated. > For example, we still have FEBE protocol 2 support floating around. > If you're going to talk about badly-designed footguns that ought to be > excised with extreme prejudice, that would IMHO be a far better place > to start than this. Unlike this code, which it's now obvious is used > by quite a number of people even just among those who read this list > regularly, that code is probably nearly unused and untested and, if we > broke it, we probably wouldn't know. You can probably guess my feelings with regard to the FEBE v2 support. At least, in that case, people aren't really using it though. What I really dislike is that we've got a *lot* of people using something that we *know* has some pretty serious potential data-eating problems. Saying that we should keep it around because a lot of people are using it isn't the way to persuade me that we should keep it. Thanks! Stephen
Attachment
> On Feb 25, 2019, at 08:55, Stephen Frost <sfrost@snowman.net> wrote: > > I honestly do doubt that they have had the same experiences that I have > had Well, I guarantee you that no two people on this list have had identical experiences. :) I certainly have been bitten bythe problems with the current system. But the resistance to major version upgrades is *huge*, and I'm strongly biasedagainst anything that will make that harder. I'm not sure I'm communicating how big a problem telling many large installations,"If you move to v12/13/etc., you will have to change your backup system" is going to be. -- -- Christophe Pettus xof@thebuild.com
On Mon, Feb 25, 2019 at 10:49 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > > Stephen Frost wrote: > > > It will be annoying if after this removal, companies must change their > > > backup strategy by using specific postgres tools (pgbackrest, barman). > > > > You don't write your own database system using CSV files and shell > > magic, do you? I have to say that it continues to boggle my mind that > > people insist that *this* part of the system has to be able to be > > implementable using shell scripts. > > > > Folks, these are your backups we're talking about, your last resort if > > everything else goes up in flames, why do you want to risk that by > > implementing your own one-off solution, particularly when there's known > > serious issues using that interface, and you want to just use shell > > scripts to do it...? > > If we didn't think that simplicity in backup has some value, why does > PostgreSQL provide pg_basebackup? > > Not everybody wants to use tools like pgbackrest and barman because it > takes some effort to set them up properly. It seems that you think that > people who want something simpler are irresponsible. > > Incidentally, both barman and pgbackrest stream the backup to a backup server. > > I think the most important use case for the low level backup API is when > your database is so large that you want to leverage storage techniques > like snapshots for backup, because you can't or don't want to stream all > those terabytes across the network. > > I'm not playing devil's advocate here to annoy you. I see the problems > with the exclusive backup, and I see how it can hurt people. > I just think that removing exclusive backup without some kind of help > like Andres sketched above will make people unhappy. +1 Another idea is to improve an exclusive backup method so that it will never cause such issue. What about changing an exclusive backup mode of pg_start_backup() so that it creates something like backup_label.pending file instead of backup_label? Then if the database cluster has backup_label.pending file but not recovery.signal (this is the case where the database is recovered just after the server crashes while an exclusive backup is in progress), in this idea, the recovery using that database cluster always ignores (or removes) backup_label.pending file and start replaying WAL from the REDO location that pg_control file indicates. So this idea enables us to work around the issue that an exclusive backup could cause. On the other hand, the downside of this idea is that the users need to change the recovery procedure. When they want to do PITR using the backup having backup_label.pending, they need to not only create recovery.signal but also rename backup_label.pending to backup_label. Rename of backup_label file is brand-new step for their recovery procedure, and changing the recovery procedure might be painful for some users. But IMO it's less painful than removing an exclusive backup API at all. Thought? BTW, if recovery.signal is created but backup_label.pending is not renamed (this is the case where the operator forgets to rename the file even though she or he create recovery signal file, i.e., mis-configuration), I think that the recovery should emit PANIC immediately with the HINT like "HINT: rename backup_label.pening to backup_label if you want to do PITR". Regards, -- Fujii Masao
On 2/25/19 7:50 PM, Fujii Masao wrote: > On Mon, Feb 25, 2019 at 10:49 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote: >> >> I'm not playing devil's advocate here to annoy you. I see the problems >> with the exclusive backup, and I see how it can hurt people. >> I just think that removing exclusive backup without some kind of help >> like Andres sketched above will make people unhappy. > > +1 > > Another idea is to improve an exclusive backup method so that it will never > cause such issue. What about changing an exclusive backup mode of > pg_start_backup() so that it creates something like backup_label.pending file > instead of backup_label? Then if the database cluster has backup_label.pending > file but not recovery.signal (this is the case where the database is recovered > just after the server crashes while an exclusive backup is in progress), > in this idea, the recovery using that database cluster always ignores > (or removes) backup_label.pending file and start replaying WAL from > the REDO location that pg_control file indicates. So this idea enables us to > work around the issue that an exclusive backup could cause. It's an interesting idea. > On the other hand, the downside of this idea is that the users need to change > the recovery procedure. When they want to do PITR using the backup having > backup_label.pending, they need to not only create recovery.signal but also > rename backup_label.pending to backup_label. Rename of backup_label file > is brand-new step for their recovery procedure, and changing the recovery > procedure might be painful for some users. But IMO it's less painful than > removing an exclusive backup API at all. Well, given that we have invalidated all prior recovery procedures in PG12 I'm not sure how big a deal that is. Of course, it's too late make a change like this for PG12. > Thought? Here's the really obvious bad thing: if users do not update their procedures and we ignore backup_label.pending on startup then they will end up with a corrupt database because it will not replay from the correct checkpoint. If we error on the presence of backup_label.pending then we are right back to where we started. I know there are backup solutions that rely on copying all required WAL to pg_xlog/pg_wal before starting recovery. Those solutions would silently break in this case and end up in corruption. If we require recovery.signal then we still have the current problem of the cluster not starting after a crash. > BTW, if recovery.signal is created but backup_label.pending is not renamed > (this is the case where the operator forgets to rename the file even though > she or he create recovery signal file, i.e., mis-configuration), I think that > the recovery should emit PANIC immediately with the HINT like > "HINT: rename backup_label.pening to backup_label if you want to do PITR". This causes its own problems, as stated above. Regards, -- -David david@pgmasters.net
On Mon, Feb 25, 2019 at 11:33:33AM -0500, Stephen Frost wrote: > I don't want to see more users stumbling over the issues with the > exclusive backup interface. A better interface exists, and has existed > since 9.6. Do you really think we would be having this discussion if the non-exclusive backup method was unequivocally better? It is better for some use-cases, and impossible for others. Also, you can't say it will have no impact for five years on people who do not upgrade. The impact will be that they will have no new Postgres features for five years. I am not taking a stance on this issue, but making unclear statements isn't helping the discussion. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Hi Christophe, On 2/25/19 7:24 PM, Christophe Pettus wrote: > > >> On Feb 25, 2019, at 08:55, Stephen Frost <sfrost@snowman.net> wrote: >> >> I honestly do doubt that they have had the same experiences that I have >> had > > Well, I guarantee you that no two people on this list have had identical experiences. :) I certainly have been bittenby the problems with the current system. But the resistance to major version upgrades is *huge*, and I'm stronglybiased against anything that will make that harder. I'm not sure I'm communicating how big a problem telling manylarge installations, "If you move to v12/13/etc., you will have to change your backup system" is going to be. I honestly think you are underestimating how bad this can be. The prevailing wisdom is that it's unfortunate that these backup_labels get left around but they can be removed with scripting, so no big deal. After that the cluster will start. But -- if you are too aggressive about removing the backup_label and accidentally do it before a real restore from backup, then you have a corrupt cluster. Totally silent, but definitely corrupt. You'll probably only see it when you start getting consistency errors from the indexes, if ever. Page checksums won't catch it either unless you are *lucky* enough to have a torn page. Erroneous scripting of this kind can also affect backups that were made with the non-exclusive method since the backups look the same. fsync() is the major corruption issue we are facing right now but that doesn't mean there aren't other sources of corruption we should be thinking about. I've thought about this one a lot and it scares me. I've worked on ways to make it better, but all of them break something and involve compromises that are nearly as severe as removing exclusive backups entirely. Regards, -- -David david@pgmasters.net
Greetings, * Christophe Pettus (xof@thebuild.com) wrote: > > On Feb 25, 2019, at 08:55, Stephen Frost <sfrost@snowman.net> wrote: > > > > I honestly do doubt that they have had the same experiences that I have > > had > > Well, I guarantee you that no two people on this list have had identical experiences. :) Indeed! :) > I certainly have been bitten by the problems with the current system. But the resistance to major version upgrades is*huge*, and I'm strongly biased against anything that will make that harder. I'm not sure I'm communicating how big aproblem telling many large installations, "If you move to v12/13/etc., you will have to change your backup system" is goingto be. Aren't they going to need to make a change for v12 now anyway? Hopefully they're regularly testing their backups by doing a restore of them, and dropping a recovery.conf into the directory of a v12 system after restore will do exactly nothing and they'll get errors complaining about how they need to provide a restore_command. That's actually what prompted bringing this painful topic up again- there's a large change being done to how backup/restore with PG is done (the recovery.conf file is going away), and people who have written any kind of automation (or even just documented procedures...) around that will have to update their systems. At least from my perspective, making them have to do such an update once, instead of once now and another time in the future when we remove exclusive backup (or figure out a way to do it that's safe and update the instructions for how to do it right...), is the better option. * Bruce Momjian (bruce@momjian.us) wrote: > On Mon, Feb 25, 2019 at 11:33:33AM -0500, Stephen Frost wrote: > > I don't want to see more users stumbling over the issues with the > > exclusive backup interface. A better interface exists, and has existed > > since 9.6. > > Do you really think we would be having this discussion if the > non-exclusive backup method was unequivocally better? It is better for > some use-cases, and impossible for others. Based on Christophe's comment above, anything which required users to make a change on upgrade to their backup system would be cause to have this discussion, which likely includes most possible solutions to the issues with exclusive backup too, unfortunately.. > Also, you can't say it will have no impact for five years on people who > do not upgrade. The impact will be that they will have no new Postgres > features for five years. I don't think I made the claim that there wouldn't be any impact for five years, I said they would continue to have support for five years. Also, this is exactly what we tell them for any other breaking change (such as removal of recovery.conf). > I am not taking a stance on this issue, but making unclear statements > isn't helping the discussion. It's not my intent to make unclear statements, so I certainly appreicate you, and anyone else, pointing out when I do. I'm happy to clarify. Thanks! Stephen
Attachment
Greetings, * David Steele (david@pgmasters.net) wrote: > On 2/25/19 7:50 PM, Fujii Masao wrote: > >Thought? > > Here's the really obvious bad thing: if users do not update their procedures > and we ignore backup_label.pending on startup then they will end up with a > corrupt database because it will not replay from the correct checkpoint. If > we error on the presence of backup_label.pending then we are right back to > where we started. Right, we definitely can't make a change which would cause non-updated systems using the same APIs to suddenly end up with corruption. If we require a change be made to the scripts that run the process then we must make it something the user is essentially opt'ing into- where they've told us explicitly that they're ready to use the new interface, which is how the non-exclusive backup mode was added. Thanks! Stephen
Attachment
On Mon, Feb 25, 2019 at 02:24:18PM -0500, Stephen Frost wrote: > * Bruce Momjian (bruce@momjian.us) wrote: > > On Mon, Feb 25, 2019 at 11:33:33AM -0500, Stephen Frost wrote: > > > I don't want to see more users stumbling over the issues with the > > > exclusive backup interface. A better interface exists, and has existed > > > since 9.6. > > > > Do you really think we would be having this discussion if the > > non-exclusive backup method was unequivocally better? It is better for > > some use-cases, and impossible for others. > > Based on Christophe's comment above, anything which required users to > make a change on upgrade to their backup system would be cause to have > this discussion, which likely includes most possible solutions to the > issues with exclusive backup too, unfortunately.. I am not addressing what Christophe said, but rather what you said. We clearly are fine in requiring people to update their software for major releases. I think if there was a way for old backup methods to work at all, this would be a lot simpler converation. > > Also, you can't say it will have no impact for five years on people who > > do not upgrade. The impact will be that they will have no new Postgres > > features for five years. > > I don't think I made the claim that there wouldn't be any impact for > five years, I said they would continue to have support for five years. > > Also, this is exactly what we tell them for any other breaking change > (such as removal of recovery.conf). > > > I am not taking a stance on this issue, but making unclear statements > > isn't helping the discussion. > > It's not my intent to make unclear statements, so I certainly appreicate > you, and anyone else, pointing out when I do. I'm happy to clarify. Good. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Hi, On 2019-02-25 08:14:16 -0500, Stephen Frost wrote: > > It will be annoying if after this removal, companies must change their > > backup strategy by using specific postgres tools (pgbackrest, barman). > > You don't write your own database system using CSV files and shell > magic, do you? I have to say that it continues to boggle my mind that > people insist that *this* part of the system has to be able to be > implementable using shell scripts. > > Folks, these are your backups we're talking about, your last resort if > everything else goes up in flames, why do you want to risk that by > implementing your own one-off solution, particularly when there's known > serious issues using that interface, and you want to just use shell > scripts to do it...? FWIW, if you weren't selling backrest quite so hard everywhere backups are mentioned, I'd find this thread a lot more convicing. Greetings, Andres Freund
On 2019-02-25 08:42:02 -0500, Stephen Frost wrote: > Greetings, > > * Andres Freund (andres@anarazel.de) wrote: > > On 2019-02-24 17:19:22 -0500, Stephen Frost wrote: > > > You say above that the new interface is unquestionably an improvement > > > > FWIW, I think we didn't design it even remotely as well as we ought to > > have. It was both a mistake to not offer a version of non-exclusive > > backups that works with a client connection (for integration with the > > TSMs of this world), and it was a mistake not to offer a commandline > > tool that does the non-exclusive pg/start backup. > > I'm not really following- did you mean to say "without a client > connection" above? Yes. > We could still add some kind of commandline tool to help with the > non-exclusive pg start/stop backup That doesn't help, as long as the connection requirement is there. As explained elsewhere in this thread, a lot of larger external backup infrastructure just gives you a start/stop hook, and that makes using a persistent connection hard and fragile. > but I'm not really sure exactly what that would look like and I > honestly don't have terribly much interest in spending resources on > implementing it since it would lack a great many features that we'd > then end up wanting to add on... and then we end up backing into > building yet another backup tool. Meh. You are proposing to remove a feature (even if a dangerous one). Requiring some minimal compat work to make htat more palpaple isn't crazy. Nor does using backrest or whatnot do *ANYTHING* about the users of large company wide backup tools. > > > I really, honestly, believe what we need to *stop* doing is trying to > > > drag along support for old/deprecated interfaces after we've introduced > > > new ones, thus avoiding these arguments and the time spent on them, and > > > the time spent dealing with implementing and documenting new APIs around > > > old ones. The only thing it seems to unquestionably do is make us argue > > > about when we are going to *finally* rip out the old/deprecated API (or > > > interface, or whatever you want to call it). > > > > While I agree that we should remove non-exclusive backups, I VEHEMENTLY > > oppose the unconditionality of the above statement. Yes, it's annoying > > to keep interfaces around, but unnecessarily inflicting pain to everyone > > also is annoying. > > Alright, then how about we provide a bit of help for everyone who's got > a system built around recovery.conf today, instead of just completely > ripping that out? Oh, comeon. Obviously we're sometimes going to have to make breaking changes. But you're essentially arguing that we shouldn't provide compat where we can come up with a reasonable way to provide backward compatibility. And I think that's crazy and will hurt postgres. > > That's not to say we shouldn't be better at announcing and then > > following through on the deprecation of old interfaces. > > We announce things have changed every year, and people have five years > from that point, during which we'll continue to support them and fix > bugs and deal with security issues, to make whatever changes they need > to for the new interface. > > The argument seems to be that people want new features but they don't > want to have to do any work to get those new features. I mean, duh? We can't make that happen all the time, but I don't understand why it's surprising to have that as *one* goal that's in conflict with other goals? Your analysis also neglects that a lot of software will have to work across multiple supported versions of postgres (at the very least in prep for the migration, but also often longer) - by having a few versions were both interfaces work, that can be made *MUCH* less painful. > When is something too big of a change to make in a given major version > and we have to, instead, provide backwards compatibility for it? What > is that policy? How many releases do we have to support it for? How do > we communicate that to our users, so that they know what they can depend > on being there across major releases and what they can't? I literally said that we should have more policy around this. I'm going to stop here, discussing with you in this thread seems unproductive. Greetings, Andres Freund
Hi, On 2019-02-25 20:43:01 +0200, David Steele wrote: > fsync() is the major corruption issue we are facing right now but that > doesn't mean there aren't other sources of corruption we should be thinking > about. I've thought about this one a lot and it scares me. FWIW, I think this kind of issue practially bites a *LOT* more people than the fsync() issue. > I've worked on ways to make it better, but all of them break something and > involve compromises that are nearly as severe as removing exclusive backups > entirely. There've been plenty proposals to make non-exclusive backups less painful in this thread, so ...? Greetings, Andres Freund
> On Feb 25, 2019, at 11:24, Stephen Frost <sfrost@snowman.net> wrote: > Aren't they going to need to make a change for v12 now anyway? > > Hopefully they're regularly testing their backups by doing a restore of > them, and dropping a recovery.conf into the directory of a v12 system > after restore will do exactly nothing and they'll get errors complaining > about how they need to provide a restore_command. It varies with the installation, but backups are 100% - epsilon automated, while restores are very frequently done via runbook. Changing a runbook is institutionally less contentious, and the change is more or less "drop the file into mumble/conf.drather than mumble", which is less of a break. -- -- Christophe Pettus xof@thebuild.com
Hi, On 2019-02-25 09:24:32 -0800, Christophe Pettus wrote: > But the resistance to major version upgrades is *huge*, and I'm > strongly biased against anything that will make that harder. I'm not > sure I'm communicating how big a problem telling many large > installations, "If you move to v12/13/etc., you will have to change > your backup system" is going to be. I think you might be right about this specific issue. But to me it sounds like you also don't appreciate that development resources are really constrained too, and providing endless backward compatibility for everything is going to use both resources directly, and indirectly by making the whole system more complex. I've been on your side of this fight a couple times (and largely lost), but I think it's important to appreciate that it's all a balancing, and we all have valid reasons to keep the balance between development pace / ease of use / ease of upgrade the way we argue for them. Hard upgrading is going to reduce adoption, but so is lack of feature development and too many creaky & redundant & easy to misuse interfaces. Greetings, Andres Freund
Hi Andres, On 2/25/19 10:57 PM, Andres Freund wrote: > Hi, > > On 2019-02-25 08:14:16 -0500, Stephen Frost wrote: >>> It will be annoying if after this removal, companies must change their >>> backup strategy by using specific postgres tools (pgbackrest, barman). >> >> You don't write your own database system using CSV files and shell >> magic, do you? I have to say that it continues to boggle my mind that >> people insist that *this* part of the system has to be able to be >> implementable using shell scripts. >> >> Folks, these are your backups we're talking about, your last resort if >> everything else goes up in flames, why do you want to risk that by >> implementing your own one-off solution, particularly when there's known >> serious issues using that interface, and you want to just use shell >> scripts to do it...? > > FWIW, if you weren't selling backrest quite so hard everywhere backups > are mentioned, I'd find this thread a lot more convicing. pgBackRest has not used exclusive backups since the new API was introduced in 9.6 so this is not an issue for our users. Over time we have contributed back to Postgres in areas we thought could be improved based on our work on the pgBackRest project: 6ad8ac60, 9fe3c644, 017e4f25, 78874531, 449338cc, 98267ee8, 8694cc96, 920a5e50, c37b3d08, 5fc1670b, b981df4c. This does not include the various backup related patches that we have reviewed. If promoting pgBackRest were our primary concern then it would be in our interest to allow Postgres exclusive backups to stay broken and pg_basebackup to be as primitive as possible. Our objections to exclusive backups stem from a desire to do the right thing, as we see it, and because we have seen first hand all the ways that backups can go wrong. -- -David david@pgmasters.net
On Mon, Feb 25, 2019 at 10:14 PM Andres Freund <andres@anarazel.de> wrote:
On 2019-02-25 08:42:02 -0500, Stephen Frost wrote:
> Greetings,
>
> * Andres Freund (andres@anarazel.de) wrote:
> > On 2019-02-24 17:19:22 -0500, Stephen Frost wrote:
> > > You say above that the new interface is unquestionably an improvement
> >
> > FWIW, I think we didn't design it even remotely as well as we ought to
> > have. It was both a mistake to not offer a version of non-exclusive
> > backups that works with a client connection (for integration with the
> > TSMs of this world), and it was a mistake not to offer a commandline
> > tool that does the non-exclusive pg/start backup.
>
> I'm not really following- did you mean to say "without a client
> connection" above?
Yes.
> We could still add some kind of commandline tool to help with the
> non-exclusive pg start/stop backup
That doesn't help, as long as the connection requirement is there. As
explained elsewhere in this thread, a lot of larger external backup
infrastructure just gives you a start/stop hook, and that makes using a
persistent connection hard and fragile.
In my experience they don't just give you that. They also give you a module that calls the postgres APIs to do backups.
Some of them are using the old API, some are using the new one. The only thing that would be required there would be to upgrade the module to one that supports the 9.6 APIs. (And you have to upgrade those modules whenever you upgrade oracle or mssql or whatever else you have, so that's not something those organisations are unused to)
I agree with the need for as simpler way to drive it in the simple case or in the say filesystem snapshot case, but I'm not sure the pre/post case is all that common in this context? Some certainly use it today even if they don't have to, but that can probably be easily changed to such a "simplified cmd tool" that you have suggested elsewhere as well.
On 2/25/19 11:20 PM, Christophe Pettus wrote: > > >> On Feb 25, 2019, at 11:24, Stephen Frost <sfrost@snowman.net> wrote: >> Aren't they going to need to make a change for v12 now anyway? >> >> Hopefully they're regularly testing their backups by doing a restore of >> them, and dropping a recovery.conf into the directory of a v12 system >> after restore will do exactly nothing and they'll get errors complaining >> about how they need to provide a restore_command. > > It varies with the installation, but backups are 100% - epsilon automated, while restores are very frequently done viarunbook. Changing a runbook is institutionally less contentious, and the change is more or less "drop the file into mumble/conf.drather than mumble", which is less of a break. This is true but is becoming far less true over time. Automation of recovery is now very common -- as it should be. Backups should be tested regularly and recovery should not be a scary thing that is done at 2am from an outdated runbook. The recovery.conf change will have serious impact when it arrives in PG12. -- -David david@pgmasters.net
On 2/25/19 11:38 PM, Andres Freund wrote: > Hi, > > On 2019-02-25 09:24:32 -0800, Christophe Pettus wrote: >> But the resistance to major version upgrades is *huge*, and I'm >> strongly biased against anything that will make that harder. I'm not >> sure I'm communicating how big a problem telling many large >> installations, "If you move to v12/13/etc., you will have to change >> your backup system" is going to be. > > I think you might be right about this specific issue. But to me it > sounds like you also don't appreciate that development resources are > really constrained too, and providing endless backward compatibility for > everything is going to use both resources directly, and indirectly by > making the whole system more complex. I've been on your side of this > fight a couple times (and largely lost), but I think it's important to > appreciate that it's all a balancing, and we all have valid reasons to > keep the balance between development pace / ease of use / ease of > upgrade the way we argue for them. Hard upgrading is going to reduce > adoption, but so is lack of feature development and too many creaky & > redundant & easy to misuse interfaces. +1. And those (like me) who are motivated to work in this area are put off by the byzantine code and how easy it is to break things -- in large part because of the *complete* absence of tests for exclusive mode. I did a round of doc updates for non/exclusive backups a while back and it was excruciating to measure and document the subtle differences in a way that a user might possibly understand. I'm honestly not motivated to come back to it until we remove the deprecated interfaces. -- -David david@pgmasters.net
> On Feb 25, 2019, at 13:38, Andres Freund <andres@anarazel.de> wrote: > > I think you might be right about this specific issue. But to me it > sounds like you also don't appreciate that development resources are > really constrained too, and providing endless backward compatibility for > everything is going to use both resources directly, and indirectly by > making the whole system more complex. One of the things I've tried not to do in this discussion is turn it into a policy debate about backwards compatibility ingeneral, rather than this very specific feature. Of course, there's a cost to keeping around features, and I fully appreciatethat. In this discussion, the code complexity didn't seem to be the fundamental driver behind removing the feature;the relative safety of it was, along with maintaining the documentation. I jumped in because there seemed to be an argument going on that all of the cool kids will have moved off the old interfaceand there was essentially no cost to removing it in v12 or v13, and that didn't correspond to my experience. -- -- Christophe Pettus xof@thebuild.com
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2019-02-25 08:14:16 -0500, Stephen Frost wrote: > > > It will be annoying if after this removal, companies must change their > > > backup strategy by using specific postgres tools (pgbackrest, barman). > > > > You don't write your own database system using CSV files and shell > > magic, do you? I have to say that it continues to boggle my mind that > > people insist that *this* part of the system has to be able to be > > implementable using shell scripts. > > > > Folks, these are your backups we're talking about, your last resort if > > everything else goes up in flames, why do you want to risk that by > > implementing your own one-off solution, particularly when there's known > > serious issues using that interface, and you want to just use shell > > scripts to do it...? > > FWIW, if you weren't selling backrest quite so hard everywhere backups > are mentioned, I'd find this thread a lot more convicing. I push pgbackrest, just like I push PG, because I view it as the best open source tool out there for the job, well architected, designed from the ground-up to do what a backup tool needs to, and it has a strong community around it with people contributing back, all openly and under a permissive license which allows anyone to use it, hack on it, and do what they want with it- basically, I see it as the PG of backup tools for PG. What PG is to relational databases, pgbackrest is to PG backup tools, imv. As David mentioned already, this really doesn't change things for pgbackrest one way or the other- we don't use the old API and haven't since the non-exclusive API was available. In addition, pgbackrest hasn't got a solution for the "we just want to take a snapshot" or "we just want to use start/stop hooks". If I was really trying to push people to pgbackrest because of the flaws in the exclusive backup mode, I'd surely develop answers to those issues first and make them part of pgbackrest in short order, and then start yelling from the tops of buildings about how broken the exclusive backup API is and about how I've got an easy solution for everyone to move to, maybe do a blog post on planet. I'm not doing that though, instead I'm discussing the issues here, with a tiny fraction of our community, and pushing for us to agree to get rid of a dangerous API in favor of an API that anyone can use w/o pgbackrest, if they want to. Thanks! Stephen
Attachment
On 2/25/19 11:59 PM, Christophe Pettus wrote: > > >> On Feb 25, 2019, at 13:38, Andres Freund <andres@anarazel.de> wrote: >> >> I think you might be right about this specific issue. But to me it >> sounds like you also don't appreciate that development resources are >> really constrained too, and providing endless backward compatibility for >> everything is going to use both resources directly, and indirectly by >> making the whole system more complex. > > One of the things I've tried not to do in this discussion is turn it into a policy debate about backwards compatibilityin general, rather than this very specific feature. Of course, there's a cost to keeping around features, andI fully appreciate that. In this discussion, the code complexity didn't seem to be the fundamental driver behind removingthe feature; the relative safety of it was, along with maintaining the documentation. Code complexity is very much an issue. Since the two methods are merged with a ton of conditionals, working on one without affecting the other is a real chore, especially since exclusive backups have *zero* tests. Personally I'm not interested in writing those tests so I try not to touch the code either. > I jumped in because there seemed to be an argument going on that all of the cool kids will have moved off the old interfaceand there was essentially no cost to removing it in v12 or v13, and that didn't correspond to my experience. I don't think anyone believes there will be zero cost, but the issues with the exclusive method seem to outweigh the benefits, at least in my experience. -- -David david@pgmasters.net
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2019-02-25 08:42:02 -0500, Stephen Frost wrote: > > * Andres Freund (andres@anarazel.de) wrote: > > > On 2019-02-24 17:19:22 -0500, Stephen Frost wrote: > > > > You say above that the new interface is unquestionably an improvement > > > > > > FWIW, I think we didn't design it even remotely as well as we ought to > > > have. It was both a mistake to not offer a version of non-exclusive > > > backups that works with a client connection (for integration with the > > > TSMs of this world), and it was a mistake not to offer a commandline > > > tool that does the non-exclusive pg/start backup. > > > > I'm not really following- did you mean to say "without a client > > connection" above? > > Yes. Ah, alright, that makes more sense then. You might bring that up with the individual who wrote it. > > We could still add some kind of commandline tool to help with the > > non-exclusive pg start/stop backup > > That doesn't help, as long as the connection requirement is there. As > explained elsewhere in this thread, a lot of larger external backup > infrastructure just gives you a start/stop hook, and that makes using a > persistent connection hard and fragile. I tend to agree w/ Magnus in that I'm not sure that this is actually as common these days as claimed. That said, I'm not against someone trying to implement something that doesn't require a connection, though I understand from discussion with David and his comments elsewhere on this thread that it isn't that simple a thing to do- and people would have to migrate to whatever it is anyway. > > but I'm not really sure exactly what that would look like and I > > honestly don't have terribly much interest in spending resources on > > implementing it since it would lack a great many features that we'd > > then end up wanting to add on... and then we end up backing into > > building yet another backup tool. > > Meh. You are proposing to remove a feature (even if a dangerous > one). Requiring some minimal compat work to make htat more palpaple > isn't crazy. Nor does using backrest or whatnot do *ANYTHING* about the > users of large company wide backup tools. Yes, I'm supporting the proposal to remove a dangerous feature. I really don't think much more needs to be said than that. If the feature wasn't dangerous then I wouldn't be argueing for its removal. I do feel bad that pgbackrest hasn't got a solution already to this, but we've had other priorities. I do think we'll get there, eventually, but not any time soon. Being told that I have to implement another feature so that we can remove the dangerous one is something I don't agree with. > > > > I really, honestly, believe what we need to *stop* doing is trying to > > > > drag along support for old/deprecated interfaces after we've introduced > > > > new ones, thus avoiding these arguments and the time spent on them, and > > > > the time spent dealing with implementing and documenting new APIs around > > > > old ones. The only thing it seems to unquestionably do is make us argue > > > > about when we are going to *finally* rip out the old/deprecated API (or > > > > interface, or whatever you want to call it). > > > > > > While I agree that we should remove non-exclusive backups, I VEHEMENTLY > > > oppose the unconditionality of the above statement. Yes, it's annoying > > > to keep interfaces around, but unnecessarily inflicting pain to everyone > > > also is annoying. > > > > Alright, then how about we provide a bit of help for everyone who's got > > a system built around recovery.conf today, instead of just completely > > ripping that out? > > Oh, comeon. Obviously we're sometimes going to have to make breaking > changes. But you're essentially arguing that we shouldn't provide compat > where we can come up with a reasonable way to provide backward > compatibility. And I think that's crazy and will hurt postgres. This was discussed, *extensively*, previously on this list with a whole lot of proposals trying to come up with easy ways to "fix" exclusive backups and I don't really think that this thread has proposed anything novel in that regard. Basically, I just don't think it's that easy. Perhaps I'm wrong, in which case, great, I'd be happy to review a simple patch that solves it. > > > That's not to say we shouldn't be better at announcing and then > > > following through on the deprecation of old interfaces. > > > > We announce things have changed every year, and people have five years > > from that point, during which we'll continue to support them and fix > > bugs and deal with security issues, to make whatever changes they need > > to for the new interface. > > > > The argument seems to be that people want new features but they don't > > want to have to do any work to get those new features. > > I mean, duh? We can't make that happen all the time, but I don't > understand why it's surprising to have that as *one* goal that's in > conflict with other goals? Your analysis also neglects that a lot of > software will have to work across multiple supported versions of > postgres (at the very least in prep for the migration, but also often > longer) - by having a few versions were both interfaces work, that can > be made *MUCH* less painful. Having to support multiple APIs like this really is painful. Also, I'm pretty well versed in software that has to support multiple PG major versions across time, using specifically these APIs in fact. Having both interfaces work at the same time didn't make that any less painful (pgbackrest supported non-exclusive backup in the same version as 9.6 support was added, and supports backing up PG back to 8.3). Perhaps it would help some, but I tend to doubt it based both on that experience and in the experience that there tends to be two times people make changes- opportunistically when they realize something has been deprecated, and when they're forced to because the old interface went away, and the level of effort and time spent tends to be about the same, all that's different is when it's done. > > When is something too big of a change to make in a given major version > > and we have to, instead, provide backwards compatibility for it? What > > is that policy? How many releases do we have to support it for? How do > > we communicate that to our users, so that they know what they can depend > > on being there across major releases and what they can't? > > I literally said that we should have more policy around this. Great. I was kinda hoping you might propose something by way of answering those questions so that maybe we could actually get to a point of having such a policy. Thanks! Stephen
Attachment
On Mon, Feb 25, 2019 at 4:38 PM David Steele <david@pgmasters.net> wrote: > > FWIW, if you weren't selling backrest quite so hard everywhere backups > > are mentioned, I'd find this thread a lot more convicing. > > pgBackRest has not used exclusive backups since the new API was > introduced in 9.6 so this is not an issue for our users. > > Over time we have contributed back to Postgres in areas we thought could > be improved based on our work on the pgBackRest project: 6ad8ac60, > 9fe3c644, 017e4f25, 78874531, 449338cc, 98267ee8, 8694cc96, 920a5e50, > c37b3d08, 5fc1670b, b981df4c. This does not include the various backup > related patches that we have reviewed. > > If promoting pgBackRest were our primary concern then it would be in our > interest to allow Postgres exclusive backups to stay broken and > pg_basebackup to be as primitive as possible. Hmm, so what you're saying is that you'd like to disable an API that some non-backrest users are relying upon but which no backrest users are relying upon. And you don't understand why some non-backrest users are opposed to that plan. Is that a correct summary of your position? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert, * Robert Haas (robertmhaas@gmail.com) wrote: > Hmm, so what you're saying is that you'd like to disable an API that > some non-backrest users are relying upon but which no backrest users > are relying upon. And you don't understand why some non-backrest > users are opposed to that plan. Is that a correct summary of your > position? This topic has come up more than once and on multiple threads (entirely independent of pgbackrest and any of the other individuals on this thread) and I think it needs to be a discussion held at the PGCon Developer meeting. I'll propose it. If you'd like to propose something to fix the, as well discussed, dangerous exclusive backup method, or further litigate at what point a dangerous and misleading interface should be removed, I'm sure you'll find we'd be happy to talk about that. Thanks, Stephen
Attachment
On 26/02/19 4:53 PM, Robert Haas wrote: > On Mon, Feb 25, 2019 at 4:38 PM David Steele <david@pgmasters.net> wrote: >>> FWIW, if you weren't selling backrest quite so hard everywhere backups >>> are mentioned, I'd find this thread a lot more convicing. >> pgBackRest has not used exclusive backups since the new API was >> introduced in 9.6 so this is not an issue for our users. >> >> Over time we have contributed back to Postgres in areas we thought could >> be improved based on our work on the pgBackRest project: 6ad8ac60, >> 9fe3c644, 017e4f25, 78874531, 449338cc, 98267ee8, 8694cc96, 920a5e50, >> c37b3d08, 5fc1670b, b981df4c. This does not include the various backup >> related patches that we have reviewed. >> >> If promoting pgBackRest were our primary concern then it would be in our >> interest to allow Postgres exclusive backups to stay broken and >> pg_basebackup to be as primitive as possible. > Hmm, so what you're saying is that you'd like to disable an API that > some non-backrest users are relying upon but which no backrest users > are relying upon. And you don't understand why some non-backrest > users are opposed to that plan. Is that a correct summary of your > position? > +1 to Robert's Management Summary. ISTM that the onus should be on the patch submitter to provide additions to pg_basebackup that make it as painless as possible for those people *not* using pgBackRest to continue making backups. Breaking this is just not right. Submitting patches that mean that people *must* use pgBackRest is also not right IMHO. Finally, Open Source is about is working together to make the a common project (in this case Poistgres) better - not forcing us to use something else you have written (even if it is good). regards Mark
Greetings Mark, * Mark Kirkwood (mark.kirkwood@catalyst.net.nz) wrote: > ISTM that the onus should be on the patch submitter to provide additions to > pg_basebackup that make it as painless as possible for those people *not* > using pgBackRest to continue making backups. Breaking this is just not > right. Submitting patches that mean that people *must* use pgBackRest is > also not right IMHO. I'm sorry that there's some confusion here- to be clear, no one is required to use pgBackRest. pg_basebackup works quite well and wouldn't be impacted by the changes proposed no this thread. The arguments against removing the exclusive backup feature don't have anything to do with pg_basebackup. Thanks! Stephen
Attachment
On Mon, Feb 25, 2019 at 08:17:27PM +0200, David Steele wrote: > Here's the really obvious bad thing: if users do not update their procedures > and we ignore backup_label.pending on startup then they will end up with a > corrupt database because it will not replay from the correct checkpoint. If > we error on the presence of backup_label.pending then we are right back to > where we started. Not really. If we error on backup_label.pending, we can make the difference between a backend which has crashed in the middle of an exclusive backup without replaying anything and a backend which is started based on a base backup, so an operator can take some action to see what's wrong with the server. If you issue an error, users can also see that their custom backup script is wrong because they forgot to rename the flag after taking a backup of the data folder(s). The real pain point about the non-exclusive APIs is really that we need to keep the connection opened, which can be done easily using drivers like psycopg2 or such, still that's far from straight-forward for the average enterprise user, and it can be hard for some companies to complicate a software stack with more external dependencies. So yes, while I can see the long-term benefits of non-exclusive backups, they are much more complicated to use than exclusive backups. -- Michael
Attachment
On 26/02/19 5:41 PM, Stephen Frost wrote: > Greetings Mark, > > * Mark Kirkwood (mark.kirkwood@catalyst.net.nz) wrote: >> ISTM that the onus should be on the patch submitter to provide additions to >> pg_basebackup that make it as painless as possible for those people *not* >> using pgBackRest to continue making backups. Breaking this is just not >> right. Submitting patches that mean that people *must* use pgBackRest is >> also not right IMHO. > I'm sorry that there's some confusion here- to be clear, no one is > required to use pgBackRest. pg_basebackup works quite well and wouldn't > be impacted by the changes proposed no this thread. The arguments > against removing the exclusive backup feature don't have anything to do > with pg_basebackup. > Ah yes (checks pg_basbackup code), you are correct! Reading this thread I thought I saw a comment to the effect that pg_basebackup was being broken, hence the less than impressed post. Your relentless bashing of people doing their own backups and heavy marketing of pgBackRest - unfortunately - made it easy for me to believe that this was a possibility that you might see as ok. So - apologies for the misunderstanding, however less marketing of your own product would avoid me jumping to the wrong conclusion. regards Mark
On 2/26/19 6:51 AM, Michael Paquier wrote: > On Mon, Feb 25, 2019 at 08:17:27PM +0200, David Steele wrote: >> Here's the really obvious bad thing: if users do not update their procedures >> and we ignore backup_label.pending on startup then they will end up with a >> corrupt database because it will not replay from the correct checkpoint. If >> we error on the presence of backup_label.pending then we are right back to >> where we started. > > Not really. If we error on backup_label.pending, we can make the > difference between a backend which has crashed in the middle of an > exclusive backup without replaying anything and a backend which is > started based on a base backup, so an operator can take some action to > see what's wrong with the server. If you issue an error, users can > also see that their custom backup script is wrong because they forgot > to rename the flag after taking a backup of the data folder(s). The operator still has a decision to make, manually, just as they do now. The wrong decision may mean a corrupt database. Here's the scenario: 1) They do a restore, forget to rename backup_label.pending. 2) Postgres won't start, which is the same action we take now. 3) The user is not sure what to do, rename or delete? They delete, and the cluster is corrupted. Worse, they have scripted the deletion of backup_label so that the cluster will restart on crash. This is the recommendation from our documentation after all. If that script runs after a restore instead of a crash, then the cluster will be corrupt -- silently. -- -David david@pgmasters.net
Fujii Masao wrote: > Another idea is to improve an exclusive backup method so that it will never > cause such issue. What about changing an exclusive backup mode of > pg_start_backup() so that it creates something like backup_label.pending file > instead of backup_label? Then if the database cluster has backup_label.pending > file but not recovery.signal (this is the case where the database is recovered > just after the server crashes while an exclusive backup is in progress), > in this idea, the recovery using that database cluster always ignores > (or removes) backup_label.pending file and start replaying WAL from > the REDO location that pg_control file indicates. So this idea enables us to > work around the issue that an exclusive backup could cause. Then if you restore a backup, but forget to add the recovery.signal file, PostgreSQL will happily recover from a wrong checkpoint and you end up with a corrupted database. I think the fundamental problem with all these approaches is that there is no safe way to distinguish a server crashed in backup mode from a restored backup. This is what makes the problem so hard. The existing exclusive backup is in my opinion the safest variant: it refuses to create a corrupted cluster without manual intervention and gives you a dire warning to consider if you are doing the right thing. Yours, Laurenz Albe
Greetings, * Mark Kirkwood (mark.kirkwood@catalyst.net.nz) wrote: > On 26/02/19 5:41 PM, Stephen Frost wrote: > >* Mark Kirkwood (mark.kirkwood@catalyst.net.nz) wrote: > >>ISTM that the onus should be on the patch submitter to provide additions to > >>pg_basebackup that make it as painless as possible for those people *not* > >>using pgBackRest to continue making backups. Breaking this is just not > >>right. Submitting patches that mean that people *must* use pgBackRest is > >>also not right IMHO. > >I'm sorry that there's some confusion here- to be clear, no one is > >required to use pgBackRest. pg_basebackup works quite well and wouldn't > >be impacted by the changes proposed no this thread. The arguments > >against removing the exclusive backup feature don't have anything to do > >with pg_basebackup. > > Ah yes (checks pg_basbackup code), you are correct! Reading this thread I > thought I saw a comment to the effect that pg_basebackup was being broken, > hence the less than impressed post. No, I don't see breaking pg_basebackup as ok, but I do have concerns about how it doesn't provide for any validation of the backup unless you use tar+gzip. Unlike with exclusive mode, at least you don't run the risk with pg_basebackup that a crash will mean that PG won't come back up without intervention. I can understand how the comments which you were responding to might have lead to that confusion. > Your relentless bashing of people doing their own backups and heavy > marketing of pgBackRest - unfortunately - made it easy for me to believe > that this was a possibility that you might see as ok. So - apologies for the > misunderstanding, however less marketing of your own product would avoid me > jumping to the wrong conclusion. As for my personal recommendations, if you go back far enough, you can see where I used to discuss how to use the exclusive API for doing backups with people, because maybe they didn't like the tools available at the time (I didn't, after all), but after seeing how many issues come from doing that and realizing how bad it is that things like our documented archive_command doesn't ensure that WAL is actually written out to disk, I realized why dedicated tools had been written and started to recommend that people pick one of the dedicated tools instead of writing their own (and I used to recommend the other solution like barman right alongside pgBackRest, and I still do make the generic recommendation at times that people not try to build their own backup tool) but then after seeing things like CIFS mounts happily throwing kernel errors while returning success on (obviously, actually failed) fsync() calls, resulting in corrupted backups that would silently be corrupt if restored from, I realized that having an actual manifest of all files backed up and their checksums wasn't a "nice to have" kind of feature for a backup tool but something critical to a backup solution (thankfully, David knew that right from the start and so it was always part of pgBackRest) and so I have a pretty hard time recommending solutions that don't have that these days. As for back to the main topic of this thread, do I hope that removing the exclusive backup mode will cause people to look at and consider existing, purpose-build, backup solutions for PG, rather than trying to update their one-off backup shell scripts? Yes, of course I do. Am I trying to sneak something in that's going to force them to use pgBackRest? No, of course not. Note that pg_basebackup, barman and pgBackRest have had support for the non-exclusive API since it was introduced (or, for barman, even before then if one installed the pgespresso extension, and pg_basebackup since it was written because it uses the replication protocol), and so does WAL-G (though I'm not sure exactly when it got it, or maybe it always had it), and pg_probackup (which appears to have gotten it in May 2017, less than a year after 9.6 was released), and even the shell-based pitery (which got it at some point before pg_probackup, it looks like, based on the commits). Considering there's at least *5* other backup tools which already support the non-exclusive API, all of which have done so since the non-exclusive mode was introduced or shortly after, I really have to question the angle of attack being used here which is claiming that this thread is motivated by a goal of forcing people to pgBackRest. The main group who would be impacted by the exclusive mode going away are those who aren't using an existing solution and who either didn't get the memo about exclusive mode being deprecated, or decided just to ignore it, and those are the installations which really do concern me the most because they're the highest risk for people ending up losing their data. Thanks! Stephen
Attachment
Greetings, * Laurenz Albe (laurenz.albe@cybertec.at) wrote: > I think the fundamental problem with all these approaches is that there is > no safe way to distinguish a server crashed in backup mode from a restored > backup. This is what makes the problem so hard. Right- if you want to just call start/stop and take a snapshot in the middle and then be able to restore that directly and start up the database, then there *can't* be any way to distinguish between the two, which is, I'm pretty sure, where this whole discussion ended up back during the 9.6 development cycle and why it's still an issue. If there was an easy way to fix this, I feel like we would have already. > The existing exclusive backup is in my opinion the safest variant: it refuses > to create a corrupted cluster without manual intervention and gives you a dire > warning to consider if you are doing the right thing. ... it's the least dangerous if you limit yourself to that method, but that doesn't make it safe. :( In the end, you basically *have* to have a way of extracting out the data needed for the backup (start/stop WAL and such) that doesn't make the running cluster look like it's a backup being restored, and you *have* to make that information available to the database cluster when it's restored somehow, and notify PG that it's doing backup recovery and *not* crash recovery, to eliminate this risk, and that's pretty hard to manage if all you want to do is snapshot the filesystem. Of course, you have to have a solution for WAL too and the thought has crossed my mind that maybe there's something we could do when it comes to stash all the info needed in the WAL archive, but I'm still not sure how we'd solve for knowing if we're doing backup recovery or crash recovery in that case without some kind of marker or something external telling us that's what we're doing. As you proposed previously, but with a bit of a twist, maybe we could just always do backup recovery if we find a .backup (or whatever) file in the WAL that, when compared to pg_control, shows that we were in the process of doing a backup... That would require that everyone always have a restore_command set, which wasn't possible before because that went into recovery.conf, but it's possible to just always have that set now, and that would eliminate the risk of us running the system out of disk space by keeping all the WAL that's generated during the backup local. Obviously, a lot of this is pretty hand-wavy, and you still have the unfortunate situation that if you're actually recoverying from a crash that just happened to happen while you were taking a backup then you could be replaying a heck of a lot more WAL than you needed to, and you have to have a working restore_command on the primary, and you'd have to figure out a way for PG to check for these files .backup or whatever files on startup that doesn't take forever or require stepping through every WAL segment or something, but maybe those concerns could be addressed. Thanks! Stephen
Attachment
On Tue, Feb 26, 2019 at 1:49 AM David Steele <david@pgmasters.net> wrote: > The operator still has a decision to make, manually, just as they do > now. The wrong decision may mean a corrupt database. > > Here's the scenario: > > 1) They do a restore, forget to rename backup_label.pending. > 2) Postgres won't start, which is the same action we take now. > 3) The user is not sure what to do, rename or delete? They delete, and > the cluster is corrupted. > > Worse, they have scripted the deletion of backup_label so that the > cluster will restart on crash. This is the recommendation from our > documentation after all. If that script runs after a restore instead of > a crash, then the cluster will be corrupt -- silently. Sure, that's true. On the other hand, it's not like someone can't manage to use the non-exclusive mode and fail to drop the backup_label and tablespace_map files into the right places. This procedure is tedious and error-prone no matter which way you do it, which is one reason I don't believe that the exclusive backup method is nearly as bad as you're making it out to be. Mind you, I'm not really defending the proposal to add a new backup mode along the likes Fujii Masao is proposing. I don't think the situation will be made less error-prone by having three ways to do it instead of two. But I think we probably would've been happier if we'd designed it the way he's now proposing in the first place, instead of the way we actually did. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Feb 26, 2019 at 3:17 AM David Steele <david@pgmasters.net> wrote: > > On 2/25/19 7:50 PM, Fujii Masao wrote: > > On Mon, Feb 25, 2019 at 10:49 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > >> > >> I'm not playing devil's advocate here to annoy you. I see the problems > >> with the exclusive backup, and I see how it can hurt people. > >> I just think that removing exclusive backup without some kind of help > >> like Andres sketched above will make people unhappy. > > > > +1 > > > > Another idea is to improve an exclusive backup method so that it will never > > cause such issue. What about changing an exclusive backup mode of > > pg_start_backup() so that it creates something like backup_label.pending file > > instead of backup_label? Then if the database cluster has backup_label.pending > > file but not recovery.signal (this is the case where the database is recovered > > just after the server crashes while an exclusive backup is in progress), > > in this idea, the recovery using that database cluster always ignores > > (or removes) backup_label.pending file and start replaying WAL from > > the REDO location that pg_control file indicates. So this idea enables us to > > work around the issue that an exclusive backup could cause. > > It's an interesting idea. > > > On the other hand, the downside of this idea is that the users need to change > > the recovery procedure. When they want to do PITR using the backup having > > backup_label.pending, they need to not only create recovery.signal but also > > rename backup_label.pending to backup_label. Rename of backup_label file > > is brand-new step for their recovery procedure, and changing the recovery > > procedure might be painful for some users. But IMO it's less painful than > > removing an exclusive backup API at all. > > Well, given that we have invalidated all prior recovery procedures in > PG12 I'm not sure how big a deal that is. Of course, it's too late make > a change like this for PG12. > > > Thought? > > Here's the really obvious bad thing: if users do not update their > procedures and we ignore backup_label.pending on startup then they will > end up with a corrupt database because it will not replay from the > correct checkpoint. If we error on the presence of backup_label.pending > then we are right back to where we started. No. In this case, since backup_label.pending and recovery.signal exist, as I described in my previous post, the server stops the recovery with PANIC error before corrupting the database. Then the operator can rename backup_label.pending to backup_label and restart the recovery safely. So, let me clarify the situations; (1) If backup_label and recovery.signal exist, the recovery starts safely. This is the normal case of recovery from the base backup. (2)If backup_label.pending and recovery.signal exist, as described above, PANIC error happens at the start of recovery. This case can happen if the operator forgets to rename backup_label.pending, i.e., operation mistake. So, after PANIC, the operator needs to fix her or his mistake (i.e., rename backup_label.pending) and restart the recovery. (3) If backup_label.pending exists but recovery.signal doesn't, the server ignores (or removes) backup_label.pending and do the recovery starting the pg_control's REDO location. This case can happen if the server crashes while an exclusive backup is in progress. So crash-in-the-middle-of-backup doesn't prevent the recovery from starting in this idea. Regards, -- Fujii Masao
On Tue, Feb 26, 2019 at 12:20 PM Fujii Masao <masao.fujii@gmail.com> wrote: > So, let me clarify the situations; > > (1) If backup_label and recovery.signal exist, the recovery starts safely. > This is the normal case of recovery from the base backup. > > (2)If backup_label.pending and recovery.signal exist, as described above, > PANIC error happens at the start of recovery. This case can happen > if the operator forgets to rename backup_label.pending, i.e., > operation mistake. So, after PANIC, the operator needs to fix her or > his mistake (i.e., rename backup_label.pending) and restart > the recovery. > > (3) If backup_label.pending exists but recovery.signal doesn't, the server > ignores (or removes) backup_label.pending and do the recovery > starting the pg_control's REDO location. This case can happen if > the server crashes while an exclusive backup is in progress. > So crash-in-the-middle-of-backup doesn't prevent the recovery from > starting in this idea. The if-conditions for 1 and 2 appear to be the same, which is confusing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Feb 26, 2019, at 09:26, Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Feb 26, 2019 at 12:20 PM Fujii Masao <masao.fujii@gmail.com> wrote: >> So, let me clarify the situations; >> >> (1) If backup_label and recovery.signal exist, the recovery starts safely. >> This is the normal case of recovery from the base backup. >> >> (2)If backup_label.pending and recovery.signal exist, as described above, >> PANIC error happens at the start of recovery. This case can happen >> if the operator forgets to rename backup_label.pending, i.e., >> operation mistake. So, after PANIC, the operator needs to fix her or >> his mistake (i.e., rename backup_label.pending) and restart >> the recovery. >> >> (3) If backup_label.pending exists but recovery.signal doesn't, the server >> ignores (or removes) backup_label.pending and do the recovery >> starting the pg_control's REDO location. This case can happen if >> the server crashes while an exclusive backup is in progress. >> So crash-in-the-middle-of-backup doesn't prevent the recovery from >> starting in this idea. > > The if-conditions for 1 and 2 appear to be the same, which is confusing. I believe #1 is when backup_label (no .pending) exists, #2 is when backup_label.pending (with .pending) exists. At the absolute minimum, this discussion has convinced me that we need to create a wiki page to accurately describe the failurescenarios for both exclusive and non-exclusive backups, and the recovery actions for them. If it exists already,my search attempts weren't successful. If it doesn't, I'm happy to start one. -- -- Christophe Pettus xof@thebuild.com
On Wed, Feb 27, 2019 at 2:27 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Feb 26, 2019 at 12:20 PM Fujii Masao <masao.fujii@gmail.com> wrote: > > So, let me clarify the situations; > > > > (1) If backup_label and recovery.signal exist, the recovery starts safely. > > This is the normal case of recovery from the base backup. > > > > (2)If backup_label.pending and recovery.signal exist, as described above, > > PANIC error happens at the start of recovery. This case can happen > > if the operator forgets to rename backup_label.pending, i.e., > > operation mistake. So, after PANIC, the operator needs to fix her or > > his mistake (i.e., rename backup_label.pending) and restart > > the recovery. > > > > (3) If backup_label.pending exists but recovery.signal doesn't, the server > > ignores (or removes) backup_label.pending and do the recovery > > starting the pg_control's REDO location. This case can happen if > > the server crashes while an exclusive backup is in progress. > > So crash-in-the-middle-of-backup doesn't prevent the recovery from > > starting in this idea. > > The if-conditions for 1 and 2 appear to be the same, which is confusing. Using other name like backup_in_progress instead of backup_label.pending may help a bit for your concern? That is, (1) backup_label and recovery.signal (2) backup_in_progress and recovery.signal Regards, -- Fujii Masao
On Tue, Feb 26, 2019 at 12:31 PM Christophe Pettus <xof@thebuild.com> wrote: > I believe #1 is when backup_label (no .pending) exists, #2 is when backup_label.pending (with .pending) exists. Oh, whoops. I get it now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Feb 26, 2019 at 3:49 PM David Steele <david@pgmasters.net> wrote: > > On 2/26/19 6:51 AM, Michael Paquier wrote: > > On Mon, Feb 25, 2019 at 08:17:27PM +0200, David Steele wrote: > >> Here's the really obvious bad thing: if users do not update their procedures > >> and we ignore backup_label.pending on startup then they will end up with a > >> corrupt database because it will not replay from the correct checkpoint. If > >> we error on the presence of backup_label.pending then we are right back to > >> where we started. > > > > Not really. If we error on backup_label.pending, we can make the > > difference between a backend which has crashed in the middle of an > > exclusive backup without replaying anything and a backend which is > > started based on a base backup, so an operator can take some action to > > see what's wrong with the server. If you issue an error, users can > > also see that their custom backup script is wrong because they forgot > > to rename the flag after taking a backup of the data folder(s). > > The operator still has a decision to make, manually, just as they do > now. The wrong decision may mean a corrupt database. Even in non-exclusive backup mode, the wrong decision may mean a corrupt database. For example, what if the user may forget to save the backup_label in the backup taken by using non-exclusive backup method? So I'm not sure if this is the matter only for an exclusive backup. Regards, -- Fujii Masao
On Tue, Feb 26, 2019 at 6:31 PM Christophe Pettus <xof@thebuild.com> wrote:
> On Feb 26, 2019, at 09:26, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Feb 26, 2019 at 12:20 PM Fujii Masao <masao.fujii@gmail.com> wrote:
>> So, let me clarify the situations;
>>
>> (1) If backup_label and recovery.signal exist, the recovery starts safely.
>> This is the normal case of recovery from the base backup.
>>
>> (2)If backup_label.pending and recovery.signal exist, as described above,
>> PANIC error happens at the start of recovery. This case can happen
>> if the operator forgets to rename backup_label.pending, i.e.,
>> operation mistake. So, after PANIC, the operator needs to fix her or
>> his mistake (i.e., rename backup_label.pending) and restart
>> the recovery.
>>
>> (3) If backup_label.pending exists but recovery.signal doesn't, the server
>> ignores (or removes) backup_label.pending and do the recovery
>> starting the pg_control's REDO location. This case can happen if
>> the server crashes while an exclusive backup is in progress.
>> So crash-in-the-middle-of-backup doesn't prevent the recovery from
>> starting in this idea.
>
> The if-conditions for 1 and 2 appear to be the same, which is confusing.
I believe #1 is when backup_label (no .pending) exists, #2 is when backup_label.pending (with .pending) exists.
At the absolute minimum, this discussion has convinced me that we need to create a wiki page to accurately describe the failure scenarios for both exclusive and non-exclusive backups, and the recovery actions for them. If it exists already, my search attempts weren't successful. If it doesn't, I'm happy to start one.
That should not be a wiki page, really, that should be part of the main documentation.
It can be drafted on the wiki of course, but we *really* should get something like that into the docs. Because that's what people are going to read. They'd only go look for the information on the wiki if they knew there was a problem and needed the details, and the big issue is that people don't know there are issues in the first place.
Hi, On 2019-02-26 20:38:17 +0100, Magnus Hagander wrote: > That should not be a wiki page, really, that should be part of the main > documentation. +1 > It can be drafted on the wiki of course, but we *really* should get > something like that into the docs. Because that's what people are going to > read. They'd only go look for the information on the wiki if they knew > there was a problem and needed the details, and the big issue is that > people don't know there are issues in the first place. Also, there's so much wrong stuff on the wiki that people that know to look there just don't believe what they read. Greetings, Andres Freund
On Mon, Feb 25, 2019 at 10:49 PM David Steele <david@pgmasters.net> wrote: > Worse, they have scripted the deletion of backup_label so that the > cluster will restart on crash. This is the recommendation from our > documentation after all. If that script runs after a restore instead of > a crash, then the cluster will be corrupt -- silently. I've always thought that the biggest problem is the documentation. While the "Continuous Archiving and Point-in-Time Recovery (PITR)" section of the user docs isn't quite as bad as it used to be, it's still fundamentally structured "bottom up". That's bad. I think that it would improve matters enormously if that section of the documentation, and possibly the entire chapter was restructured to be "top down". This should include prominent links to specific third party projects, including Barman, pgBackRest, and others. This whole area has become much more contentious than it needs to be, and I fear that we cannot see the forest for the trees. That said, I don't think that we should be in a hurry to fully remove the exclusive backup mode. Home-spun backup solutions are a bad idea, but removing exclusive mode strikes me as corrective over-reaction, and a technical solution to an organizational problem. There are definitely people that use the interface in a sensible way, despite its shortcomings. -- Peter Geoghegan
On 2/26/19 2:46 PM, Andres Freund wrote: > Also, there's so much wrong stuff on the wiki that people that know to > look there just don't believe what they read. Should there be a wiki errata page on the wiki? I'm fairly serious ... for those times when you're reading the Foo page on the wiki, and realize it still describes the way Foo worked before 0abcde, and you don't feel you have time to write a polished update ... go to the Errata page and add a link to Foo, the date, your name, and "noticed this still describes how Foo worked before 0abcde". Or is there a MediaWiki template already for that? You could add the template right at the top of the page, and it would automagically be added to the category "pages with errata". That category could be a starting point for anybody who has a bit of time to improve some wiki pages, and is looking for likely candidates. Regards, -Chap
Fujii Masao wrote: > So, let me clarify the situations; > > (3) If backup_label.pending exists but recovery.signal doesn't, the server > ignores (or removes) backup_label.pending and do the recovery > starting the pg_control's REDO location. This case can happen if > the server crashes while an exclusive backup is in progress. > So crash-in-the-middle-of-backup doesn't prevent the recovery from > starting in this idea That's where I see the problem with your idea. It is fairly easy for someone to restore a backup and then just start the server without first creating "recovery.signal", and that would lead to data corruption. Indeed, if I didn't know much about databases and were faced with the task of restoring a PostgreSQL database from backup in a hurry, that is probably what I would do. Having thought some about the whole problem area, I believe that there is no safe way to disambiguate a backup from a server crashed in backup mode. The non-exclusive backup mode actually suffers from a similar problem --- if you don't add "backup_label" to the backup afterwards, it will lead to database corruption. The only fool-proof tool we have in core is pg_basebackup. Any really good backup solution that wishes to integrate with a third- party backup tool will have to get that tool to include "backup_label" with the backup itself. For that, it would need to know the contents of "backup_label" *before* pg_stop_backup(). Is there a good reason why it is pg_stop_backup() and not pg_start_backup() that provides that information? Yours, Laurenz Albe
> On Feb 26, 2019, at 11:38, Magnus Hagander <magnus@hagander.net> wrote: > That should not be a wiki page, really, that should be part of the main documentation. I was just suggesting using a wiki page to draft it before we drop it into the main documentation. I'm open to other suggestions,of course! -- -- Christophe Pettus xof@thebuild.com
On Wed, Feb 27, 2019 at 4:35 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > > Fujii Masao wrote: > > So, let me clarify the situations; > > > > (3) If backup_label.pending exists but recovery.signal doesn't, the server > > ignores (or removes) backup_label.pending and do the recovery > > starting the pg_control's REDO location. This case can happen if > > the server crashes while an exclusive backup is in progress. > > So crash-in-the-middle-of-backup doesn't prevent the recovery from > > starting in this idea > > That's where I see the problem with your idea. > > It is fairly easy for someone to restore a backup and then just start > the server without first creating "recovery.signal", and that would > lead to data corruption. Isn't this case problematic even when a backup was taken by pg_basebackup? Because of lack of recovery.signal, no archived WAL files are replayed and the database may not reach to the latest point. Regards, -- Fujii Masao
Greetings, * Fujii Masao (masao.fujii@gmail.com) wrote: > On Wed, Feb 27, 2019 at 4:35 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > > > > Fujii Masao wrote: > > > So, let me clarify the situations; > > > > > > (3) If backup_label.pending exists but recovery.signal doesn't, the server > > > ignores (or removes) backup_label.pending and do the recovery > > > starting the pg_control's REDO location. This case can happen if > > > the server crashes while an exclusive backup is in progress. > > > So crash-in-the-middle-of-backup doesn't prevent the recovery from > > > starting in this idea > > > > That's where I see the problem with your idea. > > > > It is fairly easy for someone to restore a backup and then just start > > the server without first creating "recovery.signal", and that would > > lead to data corruption. > > Isn't this case problematic even when a backup was taken by pg_basebackup? > Because of lack of recovery.signal, no archived WAL files are replayed and > the database may not reach to the latest point. There's a number of problems, imv, that I think have just been ignored/missed when it comes to the recovery.conf removal and tools like pg_basebackup, see this msg also: https://postgr.es/m/20181127153405.GX3415@tamriel.snowman.net At this point, I think we need to put these issues on the v12 open items page so that we don't forget about them and get these things fixed before v12 comes out. Thanks! Stephen
Attachment
On 2/28/19 4:44 PM, Fujii Masao wrote: > On Wed, Feb 27, 2019 at 4:35 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote: >> >> Fujii Masao wrote: >>> So, let me clarify the situations; >>> >>> (3) If backup_label.pending exists but recovery.signal doesn't, the server >>> ignores (or removes) backup_label.pending and do the recovery >>> starting the pg_control's REDO location. This case can happen if >>> the server crashes while an exclusive backup is in progress. >>> So crash-in-the-middle-of-backup doesn't prevent the recovery from >>> starting in this idea >> >> That's where I see the problem with your idea. >> >> It is fairly easy for someone to restore a backup and then just start >> the server without first creating "recovery.signal", and that would >> lead to data corruption. > > Isn't this case problematic even when a backup was taken by pg_basebackup? > Because of lack of recovery.signal, no archived WAL files are replayed and > the database may not reach to the latest point. It is problematic because we have a message encouraging users to delete backup_label when in fact they should be creating recovery.signal. If the required WAL is present the cluster will not be corrupted. It just may not play forward as far as you would prefer. -- -David david@pgmasters.net
On 2/27/19 8:22 PM, Christophe Pettus wrote: > > >> On Feb 26, 2019, at 11:38, Magnus Hagander <magnus@hagander.net> wrote: >> That should not be a wiki page, really, that should be part of the main documentation. > > I was just suggesting using a wiki page to draft it before we drop it into the main documentation. I'm open to other suggestions,of course! It seems to me that the best way to discuss this is via a patch to the main documentation. I've written something to get us started: https://commitfest.postgresql.org/22/2042/ -- -David david@pgmasters.net
El 28/2/19 a las 15:13, David Steele escribió: > > It seems to me that the best way to discuss this is via a patch to the > main documentation. I've written something to get us started: > > https://commitfest.postgresql.org/22/2042/ + <para> + The exclusive backup method is deprecated and should be avoided in favor + of the non-exclusive backup method or + <application>pg_basebackup</application>. + </para> Isn't pg_basebackup a non-exclusive backup method? -- Martín Marqués http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Feb 28, 2019 at 11:07:47PM -0300, Martín Marqués wrote: > El 28/2/19 a las 15:13, David Steele escribió: > + <para> > + The exclusive backup method is deprecated and should be avoided in > favor > + of the non-exclusive backup method or > + <application>pg_basebackup</application>. > + </para> > > Isn't pg_basebackup a non-exclusive backup method? It seems to me that it is exactly what this sentence means. Perhaps it should be splitted into two sentences for better clarity? -- Michael
Attachment
On Thu, Feb 28, 2019 at 7:51 PM Michael Paquier <michael@paquier.xyz> wrote: > On Thu, Feb 28, 2019 at 11:07:47PM -0300, Martín Marqués wrote: > > El 28/2/19 a las 15:13, David Steele escribió: > > + <para> > > + The exclusive backup method is deprecated and should be avoided in > > favor > > + of the non-exclusive backup method or > > + <application>pg_basebackup</application>. > > + </para> > > > > Isn't pg_basebackup a non-exclusive backup method? > > It seems to me that it is exactly what this sentence means. Perhaps > it should be splitted into two sentences for better clarity? Comments on the documentation patch should be placed on the thread linked to the documentation patch as opposed to here. David J.
El jue., 28 de feb. de 2019 a la(s) 23:50, Michael Paquier (michael@paquier.xyz) escribió: > > On Thu, Feb 28, 2019 at 11:07:47PM -0300, Martín Marqués wrote: > > El 28/2/19 a las 15:13, David Steele escribió: > > + <para> > > + The exclusive backup method is deprecated and should be avoided in > > favor > > + of the non-exclusive backup method or > > + <application>pg_basebackup</application>. > > + </para> > > > > Isn't pg_basebackup a non-exclusive backup method? > > It seems to me that it is exactly what this sentence means. Perhaps > it should be splitted into two sentences for better clarity? When I read that phrase I see 2 options: non-exclusive backups, and pg_basebackup. It's as if pg_basebackup wasn't included in the former group. It might bring some confusion. -- Martín Marqués http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 3/1/19 11:36 AM, Martín Marqués wrote: > El jue., 28 de feb. de 2019 a la(s) 23:50, Michael Paquier > (michael@paquier.xyz) escribió: >> >> On Thu, Feb 28, 2019 at 11:07:47PM -0300, Martín Marqués wrote: >>> El 28/2/19 a las 15:13, David Steele escribió: >>> + <para> >>> + The exclusive backup method is deprecated and should be avoided in >>> favor >>> + of the non-exclusive backup method or >>> + <application>pg_basebackup</application>. >>> + </para> >>> >>> Isn't pg_basebackup a non-exclusive backup method? >> >> It seems to me that it is exactly what this sentence means. Perhaps >> it should be splitted into two sentences for better clarity? > > When I read that phrase I see 2 options: non-exclusive backups, and > pg_basebackup. It's as if pg_basebackup wasn't included in the former > group. It might bring some confusion. This has been updated on the other thread and I CC'd you. -- -David david@pgmasters.net
On Thu, Feb 28, 2019 at 05:08:24PM +0200, David Steele wrote: > On 2/28/19 4:44 PM, Fujii Masao wrote: > > On Wed, Feb 27, 2019 at 4:35 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > > > > > > Fujii Masao wrote: > > > > So, let me clarify the situations; > > > > > > > > (3) If backup_label.pending exists but recovery.signal doesn't, the server > > > > ignores (or removes) backup_label.pending and do the recovery > > > > starting the pg_control's REDO location. This case can happen if > > > > the server crashes while an exclusive backup is in progress. > > > > So crash-in-the-middle-of-backup doesn't prevent the recovery from > > > > starting in this idea > > > > > > That's where I see the problem with your idea. > > > > > > It is fairly easy for someone to restore a backup and then just start > > > the server without first creating "recovery.signal", and that would > > > lead to data corruption. > > > > Isn't this case problematic even when a backup was taken by pg_basebackup? > > Because of lack of recovery.signal, no archived WAL files are replayed and > > the database may not reach to the latest point. > > It is problematic because we have a message encouraging users to delete > backup_label when in fact they should be creating recovery.signal. Should we issue a client NOTICE/WARNING message as part of pg_start_backup to indicate what to do if the server crashes before pg_stop_backup()? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On 3/5/19 12:36 AM, Bruce Momjian wrote: > On Thu, Feb 28, 2019 at 05:08:24PM +0200, David Steele wrote: >> On 2/28/19 4:44 PM, Fujii Masao wrote: >>> On Wed, Feb 27, 2019 at 4:35 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote: >>>> >>>> Fujii Masao wrote: >>>>> So, let me clarify the situations; >>>>> >>>>> (3) If backup_label.pending exists but recovery.signal doesn't, the server >>>>> ignores (or removes) backup_label.pending and do the recovery >>>>> starting the pg_control's REDO location. This case can happen if >>>>> the server crashes while an exclusive backup is in progress. >>>>> So crash-in-the-middle-of-backup doesn't prevent the recovery from >>>>> starting in this idea >>>> >>>> That's where I see the problem with your idea. >>>> >>>> It is fairly easy for someone to restore a backup and then just start >>>> the server without first creating "recovery.signal", and that would >>>> lead to data corruption. >>> >>> Isn't this case problematic even when a backup was taken by pg_basebackup? >>> Because of lack of recovery.signal, no archived WAL files are replayed and >>> the database may not reach to the latest point. >> >> It is problematic because we have a message encouraging users to delete >> backup_label when in fact they should be creating recovery.signal. > > Should we issue a client NOTICE/WARNING message as part of > pg_start_backup to indicate what to do if the server crashes before > pg_stop_backup()? I'm not in favor of this since it will just encourage users to change their log level and possibly miss important notices/warnings. Regards, -- -David david@pgmasters.net
On 12/11/18 5:24 PM, David Steele wrote: > > Attached is the patch. > > I was a bit surprised by how much code went away. There was a lot of > code dedicated to making sure that backup_label got renamed on shutdown, > that there was not an exclusive backup running, etc. > > There were no tests for exclusive backup so the test changes were minimal. > > I did have to replace one "hot backup" in > 010_logical_decoding_timelines.pl. I'm not sure why the backup was > being done this way, or why the standby needs a copy of pg_replslot > (which is not copied by pg_basebackup). It might be worth looking at > but it seemed out of scope for this patch. Rebased patch is attached. I also fixed the issue that Robert pointed out with regard to changing the function signatures. It's a bit weird that there are optional parameters when the last parameter is really not optional (i.e. the default will lead to an error). Any suggestions are appreciated. Reading back through the thread, the major objection to the original patch was that it was submitted too soon. Hopefully the elapsed time addresses that concern. Another common objection is that the new API is hard to use from bash scripts. This is arguably still true, but Laurenz has demonstrated that bash is capable of this feat: https://github.com/cybertec-postgresql/safe-backup Lastly, there is some concern about getting the backup label too late when doing snapshots or using traditional backup software. It would certainly be possible to return the backup_label and tablespace_map from pg_start_backup() and let the user make the choice about what to do with them. Any of these methods will require some scripting so I don't see why the files couldn't be written as, for example, backup_label.snap and tablespace_map.snap and then renamed by the restore script. The patch does not currently make this change, but it could be added pretty easily if that overcomes this objection. Regards, -- -David david@pgmasters.net
Attachment
Greetings, * David Steele (david@pgmasters.net) wrote: > Lastly, there is some concern about getting the backup label too late when > doing snapshots or using traditional backup software. It would certainly be > possible to return the backup_label and tablespace_map from > pg_start_backup() and let the user make the choice about what to do with > them. Any of these methods will require some scripting so I don't see why > the files couldn't be written as, for example, backup_label.snap and > tablespace_map.snap and then renamed by the restore script. The patch does > not currently make this change, but it could be added pretty easily if that > overcomes this objection. Of course, if someone just restored that snapshot without actually moving the files into place they'd get a corrupted database without any indication of anything being wrong... And if we checked for those files on startup and complained about them being there (called .snap) then we're back into the "crash during backup causes PG to fail to start" situation. All of which is, as I recall, is why we have the API we do today. As such, doing that doesn't strike me as an improvement over using the new API and making it abundently clear that it's not so simple as people might like to think it is... Thanks, Stephen
Attachment
On 6/30/20 6:13 PM, Stephen Frost wrote: > Greetings, > > * David Steele (david@pgmasters.net) wrote: >> Lastly, there is some concern about getting the backup label too late when >> doing snapshots or using traditional backup software. It would certainly be >> possible to return the backup_label and tablespace_map from >> pg_start_backup() and let the user make the choice about what to do with >> them. Any of these methods will require some scripting so I don't see why >> the files couldn't be written as, for example, backup_label.snap and >> tablespace_map.snap and then renamed by the restore script. The patch does >> not currently make this change, but it could be added pretty easily if that >> overcomes this objection. > > Of course, if someone just restored that snapshot without actually > moving the files into place they'd get a corrupted database without any > indication of anything being wrong... > > And if we checked for those files on startup and complained about them > being there (called .snap) then we're back into the "crash during backup > causes PG to fail to start" situation. > > All of which is, as I recall, is why we have the API we do today. I'm certainly not proposing that we ignore .snap files (or whatever). There are a many ways a restore can be done incorrectly and not be detected. The restore script would be responsible for knowing the rules and renaming the files or erroring out. > As such, doing that doesn't strike me as an improvement over using the > new API and making it abundently clear that it's not so simple as people > might like to think it is... Sure, backup is complicated, but I think there's an argument for providing the backup_label and tablespace_map files at the beginning of the backup since they are available at that point. And maybe put some scary language about storing them in PGDATA. Regards, -- -David david@pgmasters.net
Greetings, * David Steele (david@pgmasters.net) wrote: > On 6/30/20 6:13 PM, Stephen Frost wrote: > >* David Steele (david@pgmasters.net) wrote: > >>Lastly, there is some concern about getting the backup label too late when > >>doing snapshots or using traditional backup software. It would certainly be > >>possible to return the backup_label and tablespace_map from > >>pg_start_backup() and let the user make the choice about what to do with > >>them. Any of these methods will require some scripting so I don't see why > >>the files couldn't be written as, for example, backup_label.snap and > >>tablespace_map.snap and then renamed by the restore script. The patch does > >>not currently make this change, but it could be added pretty easily if that > >>overcomes this objection. > > > >Of course, if someone just restored that snapshot without actually > >moving the files into place they'd get a corrupted database without any > >indication of anything being wrong... > > > >And if we checked for those files on startup and complained about them > >being there (called .snap) then we're back into the "crash during backup > >causes PG to fail to start" situation. > > > >All of which is, as I recall, is why we have the API we do today. > > I'm certainly not proposing that we ignore .snap files (or whatever). There > are a many ways a restore can be done incorrectly and not be detected. The > restore script would be responsible for knowing the rules and renaming the > files or erroring out. If we don't ignore them then what are we going to do when they exist..? > >As such, doing that doesn't strike me as an improvement over using the > >new API and making it abundently clear that it's not so simple as people > >might like to think it is... > > Sure, backup is complicated, but I think there's an argument for providing > the backup_label and tablespace_map files at the beginning of the backup > since they are available at that point. And maybe put some scary language > about storing them in PGDATA. but still require that the connection be kept open until the pg_stop_backup() is called? I suppose we could do that but it seems unlikely to change much for anyone. De-deprecating the exclusive backup mode and returning that info at the start and telling users "put this anywhere EXCEPT this specific place, and make sure you DO put these files in that exact place when you restore" would perhaps be closer to what folks who really, really, want to do snapshot-based backups would want... if that ends up being a net positive for users or not is questionable imv. One could imagine a set of scripts that would save that info to a directory above PGDATA but on the appropriate volume to be included in the snapshot, and then on restore would put them into the right place to allow PG to restore from the backup properly, removing the complication around having to save that data elsewhere or snapshot another filesystem after the backup is run or such. Thanks, Stephen
Attachment
On Tue, Jun 30, 2020 at 07:10:46PM -0400, Stephen Frost wrote: > De-deprecating the exclusive backup mode and returning that info at the > start and telling users "put this anywhere EXCEPT this specific place, > and make sure you DO put these files in that exact place when you > restore" would perhaps be closer to what folks who really, really, want > to do snapshot-based backups would want... if that ends up being a net Agreed. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Tue, Jun 30, 2020 at 07:10:46PM -0400, Stephen Frost wrote: > > De-deprecating the exclusive backup mode and returning that info at the > > start and telling users "put this anywhere EXCEPT this specific place, > > and make sure you DO put these files in that exact place when you > > restore" would perhaps be closer to what folks who really, really, want > > to do snapshot-based backups would want... if that ends up being a net > > Agreed. Alright, in that case I think we need a victim^Wvolunteer to step up to rewrite the documentation for backup for these two different methods, with very clear instructions on how to do this properly (which really should also include removing the particularly bad 'cp' command usage). Maybe I'm wrong, but as I recall David already tried to do that once and I doubt he's going to be interested in another round of that. Otherwise, it seems best to stop trying to paint this as something that's simple to do and rip the bandaid off and remove this long deprecated method, so we can have some sane documentation for the software writers who are writing backup solutions for PG. Thanks, Stephen
Attachment
On Tue, Jun 30, 2020 at 07:32:47PM -0400, Stephen Frost wrote: > Greetings, > > * Bruce Momjian (bruce@momjian.us) wrote: > > On Tue, Jun 30, 2020 at 07:10:46PM -0400, Stephen Frost wrote: > > > De-deprecating the exclusive backup mode and returning that info at the > > > start and telling users "put this anywhere EXCEPT this specific place, > > > and make sure you DO put these files in that exact place when you > > > restore" would perhaps be closer to what folks who really, really, want > > > to do snapshot-based backups would want... if that ends up being a net > > > > Agreed. > > Alright, in that case I think we need a victim^Wvolunteer to step up to > rewrite the documentation for backup for these two different methods, > with very clear instructions on how to do this properly (which really > should also include removing the particularly bad 'cp' command usage). > > Maybe I'm wrong, but as I recall David already tried to do that once and > I doubt he's going to be interested in another round of that. > > Otherwise, it seems best to stop trying to paint this as something > that's simple to do and rip the bandaid off and remove this long > deprecated method, so we can have some sane documentation for the > software writers who are writing backup solutions for PG. I can work on it but I don't think I know all the details. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Tue, 2020-06-30 at 17:38 -0400, David Steele wrote: > Rebased patch is attached. I remain -1 on removing the exclusive backup API. People who don't mind manual intervention when PostgreSQL doesn't start after a crash in backup mode could safely use it. We have been over this, and there is not much point in re-opening the discussion. I just had to speak up. > Lastly, there is some concern about getting the backup label too late > when doing snapshots or using traditional backup software. It would > certainly be possible to return the backup_label and tablespace_map from > pg_start_backup() and let the user make the choice about what to do with > them. Any of these methods will require some scripting so I don't see > why the files couldn't be written as, for example, backup_label.snap and > tablespace_map.snap and then renamed by the restore script. The patch > does not currently make this change, but it could be added pretty easily > if that overcomes this objection. That would be interesting improvements that would make the non-exclusive backup API easier to use. Yours, Laurenz Albe
Greetings, * Laurenz Albe (laurenz.albe@cybertec.at) wrote: > On Tue, 2020-06-30 at 17:38 -0400, David Steele wrote: > > Rebased patch is attached. > > I remain -1 on removing the exclusive backup API. It's still deprecated and I'm certainly against removing that status until/unless someone actually fixes it (including documentation), and if we're not going to do that then we should remove it. > People who don't mind manual intervention when PostgreSQL doesn't > start after a crash in backup mode could safely use it. I disagree that that's really appropriate. > We have been over this, and there is not much point in re-opening > the discussion. I just had to speak up. It's not ok to just leave it as deprecated forever. > > Lastly, there is some concern about getting the backup label too late > > when doing snapshots or using traditional backup software. It would > > certainly be possible to return the backup_label and tablespace_map from > > pg_start_backup() and let the user make the choice about what to do with > > them. Any of these methods will require some scripting so I don't see > > why the files couldn't be written as, for example, backup_label.snap and > > tablespace_map.snap and then renamed by the restore script. The patch > > does not currently make this change, but it could be added pretty easily > > if that overcomes this objection. > > That would be interesting improvements that would make the non-exclusive > backup API easier to use. This would presumably be for the exclusive API as a way to make it not completely broken, maybe. If we wanted to try and make this work in a non-exclusive manner then we'd need to do something like have the user save some info out at pg_start_backup time that they then provide at pg_stop_backup time, so we can match up the specific backup and write the appropriate WAL message. Thanks, Stephen
Attachment
On Wed, 2020-07-01 at 08:47 -0400, Stephen Frost wrote: > > I remain -1 on removing the exclusive backup API. > > It's still deprecated and I'm certainly against removing that status > until/unless someone actually fixes it (including documentation), and if > we're not going to do that then we should remove it. Well, it doesn't need fixing, since it is working just fine (for certain values of "just fine"). I agree with the need to document the problems better. Yours, Laurenz Albe
Greetings, * Laurenz Albe (laurenz.albe@cybertec.at) wrote: > On Wed, 2020-07-01 at 08:47 -0400, Stephen Frost wrote: > > > I remain -1 on removing the exclusive backup API. > > > > It's still deprecated and I'm certainly against removing that status > > until/unless someone actually fixes it (including documentation), and if > > we're not going to do that then we should remove it. > > Well, it doesn't need fixing, since it is working just fine (for certain > values of "just fine"). imv it's a disservice to our users to claim it's working when there's a myriad of problems with it that we only allude to in the documentation, some of which simply have no way of being addressed by a user. Perhaps if we remove the landmines we can say it works, but imv it doesn't today. I suspect we'll simply not get to a point where we agree on this, but if you'd like to propose a patch that fixes the issues and the documentation, that'd be a great direction to take this. If that's not going to happen, then I'm in favor of removing it. I really dislike these half-measures and keeping deprecated things that complicate our code and our documentation forever. Thanks, Stephen
Attachment
On Wed, Jul 1, 2020 at 2:47 PM Stephen Frost <sfrost@snowman.net> wrote:
This would presumably be for the exclusive API as a way to make it not
completely broken, maybe.
If we wanted to try and make this work in a non-exclusive manner then
we'd need to do something like have the user save some info out at
pg_start_backup time that they then provide at pg_stop_backup time, so
we can match up the specific backup and write the appropriate WAL
message.
We don't even need to make it an exclusive mode -- we can allow *nonexclusive* backups but remove the requirement to run start and stop backup in the same connection, which I believe is the problem that people have with the exclusive mode. So how about something like this:
1. Make pg_start_backup() return the backup label file as well. We can add an extra column to the output without breaking backwards compatibility. And since we have all the information for the file at pg_start_backup() time, the user can then write that into the backup. We clearly document that this should *not* be written as "backup_label" in the data directory. We can even define what it should be instead.
2. Make pg_start_backup() also return a "cookie" value with a unique identifier if asked to run in "disconnectable mode". Store this identifier in shared state somewhere in the backend.
3. Make a version of pg_stop_backup that takes this cookie as a parameter, and that way supports stopping a "disconnectable backup".
4. Scream loudly in the logs if we shut down postgres with any open "disconnectable backups" in progress. (to let people know they screwed up)
5. Perhaps provide a row in pg_stat_progress_basebackup, or in it's own view, to show these "disconnectable mode backups"?
5b. In fact, maybe provide something like that for the current non-exclusive ones as well, in case people have hung sessions.
The weak spot in this is still the "don't write it as backup label", but we can document that. And that would allow us to do non-exclusive base backups without requiring maintaining the connection. And we can completely get rid of the exclusive ones.
Greetings, * Magnus Hagander (magnus@hagander.net) wrote: > On Wed, Jul 1, 2020 at 2:47 PM Stephen Frost <sfrost@snowman.net> wrote: > > This would presumably be for the exclusive API as a way to make it not > > completely broken, maybe. > > > > If we wanted to try and make this work in a non-exclusive manner then > > we'd need to do something like have the user save some info out at > > pg_start_backup time that they then provide at pg_stop_backup time, so > > we can match up the specific backup and write the appropriate WAL > > message. > > We don't even need to make it an exclusive mode -- we can allow > *nonexclusive* backups but remove the requirement to run start and stop > backup in the same connection, which I believe is the problem that people > have with the exclusive mode. So how about something like this: > > 1. Make pg_start_backup() return the backup label file as well. We can add > an extra column to the output without breaking backwards compatibility. And > since we have all the information for the file at pg_start_backup() time, > the user can then write that into the backup. We clearly document that this > should *not* be written as "backup_label" in the data directory. We can > even define what it should be instead. > > 2. Make pg_start_backup() also return a "cookie" value with a > unique identifier if asked to run in "disconnectable mode". Store this > identifier in shared state somewhere in the backend. Seems like we could possibly just make this be the WAL position the backup started at, since we use that as the finishing location...? I get that users could screw up passing the value back in and get a corrupted backup, but it'd avoid us having to stick anything in shared memory, wouldn't it? > 3. Make a version of pg_stop_backup that takes this cookie as a parameter, > and that way supports stopping a "disconnectable backup". > > 4. Scream loudly in the logs if we shut down postgres with any open > "disconnectable backups" in progress. (to let people know they screwed up) I'm less sure how useful that'd be, but ok. > 5. Perhaps provide a row in pg_stat_progress_basebackup, or in it's own > view, to show these "disconnectable mode backups"? > 5b. In fact, maybe provide something like that for the current > non-exclusive ones as well, in case people have hung sessions. Ideally we'd provide a way for external backup tools to update said progress with information of their own like the % complete, if they wish to... > The weak spot in this is still the "don't write it as backup label", but we > can document that. And that would allow us to do non-exclusive base backups > without requiring maintaining the connection. And we can completely get rid > of the exclusive ones. If this helps us get rid of exclusive backup mode then that's certainly helpful, though inventing yet another method of doing backup makes me cringe to think of the documentation complexity since it doesn't seem like we'd really be reducing that. Now, if we also removed the existing non-exclusive method and then had a *single* backup method with clear documentation as to how to use it properly, that'd be a marked improvement overall. Thanks, Stephen
Attachment
On Wed, Jul 1, 2020 at 6:29 PM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
* Magnus Hagander (magnus@hagander.net) wrote:
> On Wed, Jul 1, 2020 at 2:47 PM Stephen Frost <sfrost@snowman.net> wrote:
> > This would presumably be for the exclusive API as a way to make it not
> > completely broken, maybe.
> >
> > If we wanted to try and make this work in a non-exclusive manner then
> > we'd need to do something like have the user save some info out at
> > pg_start_backup time that they then provide at pg_stop_backup time, so
> > we can match up the specific backup and write the appropriate WAL
> > message.
>
> We don't even need to make it an exclusive mode -- we can allow
> *nonexclusive* backups but remove the requirement to run start and stop
> backup in the same connection, which I believe is the problem that people
> have with the exclusive mode. So how about something like this:
>
> 1. Make pg_start_backup() return the backup label file as well. We can add
> an extra column to the output without breaking backwards compatibility. And
> since we have all the information for the file at pg_start_backup() time,
> the user can then write that into the backup. We clearly document that this
> should *not* be written as "backup_label" in the data directory. We can
> even define what it should be instead.
>
> 2. Make pg_start_backup() also return a "cookie" value with a
> unique identifier if asked to run in "disconnectable mode". Store this
> identifier in shared state somewhere in the backend.
Seems like we could possibly just make this be the WAL position the
backup started at, since we use that as the finishing location...? I
get that users could screw up passing the value back in and get a
corrupted backup, but it'd avoid us having to stick anything in shared
memory, wouldn't it?
Or, how about we require them to provide the backup label contents in its entirety? Which I believe does contain that WAL portion? The downside of that is that it would be multiline which might screw with shellscripts.
We would still need to stick *something* in there so we can keep track of when it's done. You should only be able to stop each backup once for example. Otherwise, you'll end up with XLogCtl->Insert.nonExclusiveBackups being wrong if someone calls stop twice with the same cookie/wal location.
> 3. Make a version of pg_stop_backup that takes this cookie as a parameter,
> and that way supports stopping a "disconnectable backup".
>
> 4. Scream loudly in the logs if we shut down postgres with any open
> "disconnectable backups" in progress. (to let people know they screwed up)
I'm less sure how useful that'd be, but ok.
> 5. Perhaps provide a row in pg_stat_progress_basebackup, or in it's own
> view, to show these "disconnectable mode backups"?
> 5b. In fact, maybe provide something like that for the current
> non-exclusive ones as well, in case people have hung sessions.
Ideally we'd provide a way for external backup tools to update said
progress with information of their own like the % complete, if they wish
to...
That's definitely moving the goalposts on this several miles :) Not saying that wouldn't be nice, but let's keep those separate :P
> The weak spot in this is still the "don't write it as backup label", but we
> can document that. And that would allow us to do non-exclusive base backups
> without requiring maintaining the connection. And we can completely get rid
> of the exclusive ones.
If this helps us get rid of exclusive backup mode then that's certainly
helpful, though inventing yet another method of doing backup makes me
cringe to think of the documentation complexity since it doesn't seem
like we'd really be reducing that. Now, if we also removed the existing
non-exclusive method and then had a *single* backup method with clear
documentation as to how to use it properly, that'd be a marked
improvement overall.
We could do that, but the passing of cookies or whatever is completely unnecessary with the current method so it would complicate that one...
Greetings, * Magnus Hagander (magnus@hagander.net) wrote: > On Wed, Jul 1, 2020 at 6:29 PM Stephen Frost <sfrost@snowman.net> wrote: > > * Magnus Hagander (magnus@hagander.net) wrote: > > > On Wed, Jul 1, 2020 at 2:47 PM Stephen Frost <sfrost@snowman.net> wrote: > > > > This would presumably be for the exclusive API as a way to make it not > > > > completely broken, maybe. > > > > > > > > If we wanted to try and make this work in a non-exclusive manner then > > > > we'd need to do something like have the user save some info out at > > > > pg_start_backup time that they then provide at pg_stop_backup time, so > > > > we can match up the specific backup and write the appropriate WAL > > > > message. > > > > > > We don't even need to make it an exclusive mode -- we can allow > > > *nonexclusive* backups but remove the requirement to run start and stop > > > backup in the same connection, which I believe is the problem that people > > > have with the exclusive mode. So how about something like this: > > > > > > 1. Make pg_start_backup() return the backup label file as well. We can > > add > > > an extra column to the output without breaking backwards compatibility. > > And > > > since we have all the information for the file at pg_start_backup() time, > > > the user can then write that into the backup. We clearly document that > > this > > > should *not* be written as "backup_label" in the data directory. We can > > > even define what it should be instead. > > > > > > 2. Make pg_start_backup() also return a "cookie" value with a > > > unique identifier if asked to run in "disconnectable mode". Store this > > > identifier in shared state somewhere in the backend. > > > > Seems like we could possibly just make this be the WAL position the > > backup started at, since we use that as the finishing location...? I > > get that users could screw up passing the value back in and get a > > corrupted backup, but it'd avoid us having to stick anything in shared > > memory, wouldn't it? > > Or, how about we require them to provide the backup label contents in its > entirety? Which I believe does contain that WAL portion? The downside of > that is that it would be multiline which might screw with shellscripts. This is getting a bit beyond the pale imv if we're talking about trying to cater to software that can't manage to handle a multi-line result or multi-line argument to a function. > We would still need to stick *something* in there so we can keep track of > when it's done. You should only be able to stop each backup once for > example. Otherwise, you'll end up with XLogCtl->Insert.nonExclusiveBackups > being wrong if someone calls stop twice with the same cookie/wal location. Hrmpf, yeah, I suppose we can't really avoid having to track it somehow, in which case, yeah, we might as well just have some unique ID for each that we then look up in shared memory (though that unique ID could be the starting WAL if we wanted, but I don't suppose it particularly matters if it's that or something else). We'd need to have some parameter like max_concurrent_backups or something tho, which I was hoping to avoid. > > > 5. Perhaps provide a row in pg_stat_progress_basebackup, or in it's own > > > view, to show these "disconnectable mode backups"? > > > 5b. In fact, maybe provide something like that for the current > > > non-exclusive ones as well, in case people have hung sessions. > > > > Ideally we'd provide a way for external backup tools to update said > > progress with information of their own like the % complete, if they wish > > to... > > That's definitely moving the goalposts on this several miles :) Not saying > that wouldn't be nice, but let's keep those separate :P Well, you brought up progress monitoring. :) > > The weak spot in this is still the "don't write it as backup label", but > > we > > > can document that. And that would allow us to do non-exclusive base > > backups > > > without requiring maintaining the connection. And we can completely get > > rid > > > of the exclusive ones. > > > > If this helps us get rid of exclusive backup mode then that's certainly > > helpful, though inventing yet another method of doing backup makes me > > cringe to think of the documentation complexity since it doesn't seem > > like we'd really be reducing that. Now, if we also removed the existing > > non-exclusive method and then had a *single* backup method with clear > > documentation as to how to use it properly, that'd be a marked > > improvement overall. > > We could do that, but the passing of cookies or whatever is completely > unnecessary with the current method so it would complicate that one... No, it wouldn't complicate "that one" since "that one" would be gone. Yes, it'd be a slightly more complicated process than what they handle currently but I have confidence that everything that's, today, arranging to keep a PG connection open would be able to handle instead (or, in addition to, if they wanted to continue having that connection anyway) passing the cookie through from start->stop backup, and I do think it would massively simplify the documentation which would make it well worth it. Thanks, Stephen
Attachment
On Wed, Jul 1, 2020 at 11:48 AM Magnus Hagander <magnus@hagander.net> wrote: > We don't even need to make it an exclusive mode -- we can allow *nonexclusive* backups but remove the requirement to runstart and stop backup in the same connection, which I believe is the problem that people have with the exclusive mode. Well, one of the big advantages of the non-exclusive mode is that the backup automatically is terminated if the connection is closed, which removes the possibility of leaving a backup in progress for a long time inadvertently. I think that back-tracking on that would probably be a step in the wrong direction. I agree with what I understand to be Bruce's position -- non-exclusive backup mode is better than exclusive mode backup mode, but exclusive backup mode is not broken and doesn't need to be removed. I think it's very sensible to continue to support exclusive backup mode but to encourage users to use non-exclusive backup mode instead. I find Stephen's position that exclusive backup mode cannot be used safely to be unsupported by any compelling evidence, and reviewing the thread, I find that there are a LOT of people who seem to agree. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jul 1, 2020 at 01:23:21PM -0400, Stephen Frost wrote: > > Or, how about we require them to provide the backup label contents in its > > entirety? Which I believe does contain that WAL portion? The downside of > > that is that it would be multiline which might screw with shellscripts. > > This is getting a bit beyond the pale imv if we're talking about trying > to cater to software that can't manage to handle a multi-line result or > multi-line argument to a function. Seems to work for me in bash: $ X='a > b > c' $ echo "$X" a b c $ echo $X a b c You need the double quotes around the variable name. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Wed, Jul 1, 2020 at 9:43 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jul 1, 2020 at 11:48 AM Magnus Hagander <magnus@hagander.net> wrote:
> We don't even need to make it an exclusive mode -- we can allow *nonexclusive* backups but remove the requirement to run start and stop backup in the same connection, which I believe is the problem that people have with the exclusive mode.
Well, one of the big advantages of the non-exclusive mode is that the
backup automatically is terminated if the connection is closed, which
removes the possibility of leaving a backup in progress for a long
time inadvertently. I think that back-tracking on that would probably
be a step in the wrong direction.
I am in no way suggesting we should do this to replace the nonexclusive backups the way they are now. They would still retain this property, and should be encouraged to be the default way.
I am only suggesting this as an alternative to the exclusive ones, which hopefully can get rid of most of the drawbacks of the exclusive mode ones. Yes, it would still lack the ability to automatically terminate (except it *would* do such a termination if the server restarted -- instead of like the current exclusive mode which just breaks the server then).
I agree with what I understand to be Bruce's position -- non-exclusive
backup mode is better than exclusive mode backup mode, but exclusive
backup mode is not broken and doesn't need to be removed. I think it's
very sensible to continue to support exclusive backup mode but to
encourage users to use non-exclusive backup mode instead. I find
Stephen's position that exclusive backup mode cannot be used safely to
be unsupported by any compelling evidence, and reviewing the thread, I
find that there are a LOT of people who seem to agree.
I think that is what I'm suggesting -- that is, I'm suggesting a way to solve the actual problems that people have with non-exclusive mode, to the point of making it possible ot run those without the drawbacks.
As far as I've seen, the one thing that people have problems with in the exclusive mode backups are precisely the fact that they have to keep a persistent conneciton open, and thus it cannot work together with backup software that is limited to only supporting running a pre- and a post script.
Something like I have suggested here is to solve *that* problem. I don't think anybody actually explicitly wants "exclusive backups" -- they want a backup solution that plugs into their world of pre/post scripts. And if we can make that one work in a safer way than the current exclusive backups, ohw is that not an improvement?
//Magnus
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, Jul 1, 2020 at 11:48 AM Magnus Hagander <magnus@hagander.net> wrote: > > We don't even need to make it an exclusive mode -- we can allow *nonexclusive* backups but remove the requirement torun start and stop backup in the same connection, which I believe is the problem that people have with the exclusive mode. > > Well, one of the big advantages of the non-exclusive mode is that the > backup automatically is terminated if the connection is closed, which > removes the possibility of leaving a backup in progress for a long > time inadvertently. I think that back-tracking on that would probably > be a step in the wrong direction. What's the actual downside from having a forgotten backup being left around for an extended period of time..? > I agree with what I understand to be Bruce's position -- non-exclusive > backup mode is better than exclusive mode backup mode, but exclusive > backup mode is not broken and doesn't need to be removed. I think it's > very sensible to continue to support exclusive backup mode but to > encourage users to use non-exclusive backup mode instead. I find > Stephen's position that exclusive backup mode cannot be used safely to > be unsupported by any compelling evidence, and reviewing the thread, I > find that there are a LOT of people who seem to agree. I didn't realize there was a question regarding what happens if the DB server ends up crashing during an exclusive backup. That hardly seems to be safe if you care about having your database start up after a crash, which seems like an important thing that we generally consider of value considering the amount of effort we put into it. I agree that there seems to be a number of people who want to ignore that the system breaks down in that situation in order to keep exclusive backup around, but, unlike the discussion being had between Magnus and I, that doesn't actually move us forward at all in terms of providing a good solution or being able to write the documentation in a sane way for people who aren't PG hackers to be able to follow. Thanks, Stephen
Attachment
On Wed, Jul 1, 2020 at 3:50 PM Magnus Hagander <magnus@hagander.net> wrote: > As far as I've seen, the one thing that people have problems with in the exclusive mode backups are precisely the factthat they have to keep a persistent conneciton open, and thus it cannot work together with backup software that is limitedto only supporting running a pre- and a post script. > > Something like I have suggested here is to solve *that* problem. I don't think anybody actually explicitly wants "exclusivebackups" -- they want a backup solution that plugs into their world of pre/post scripts. And if we can make thatone work in a safer way than the current exclusive backups, ohw is that not an improvement? Yeah, I guess that's a pretty fair point. I have to confess to having somewhat limited enthusiasm for adding a third mode here, but it might be worth it. It seems pretty well inevitable to me that people are going to forget to end them. I am not sure exactly what the consequences of that will be, but if for example there's a limited number of shared memory slots to store information about these backups, then if you leak any, you'll eventually run out of slots and your backups will start failing. I feel like that's a going to happen to about 75% of the people who try to use this new backup mode at some point in time, but maybe I'm a pessimist.[1] If we could jigger things so that you don't need to stop the backup at all, you only start it, and whether you ever finish copying everything is something about which the system need not know or care, that would be a lot nicer. I'm not sure I see how to do that, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] OK, that's not really a "maybe".
On Wed, Jul 1, 2020 at 9:59 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jul 1, 2020 at 3:50 PM Magnus Hagander <magnus@hagander.net> wrote:
> As far as I've seen, the one thing that people have problems with in the exclusive mode backups are precisely the fact that they have to keep a persistent conneciton open, and thus it cannot work together with backup software that is limited to only supporting running a pre- and a post script.
>
> Something like I have suggested here is to solve *that* problem. I don't think anybody actually explicitly wants "exclusive backups" -- they want a backup solution that plugs into their world of pre/post scripts. And if we can make that one work in a safer way than the current exclusive backups, ohw is that not an improvement?
Yeah, I guess that's a pretty fair point. I have to confess to having
somewhat limited enthusiasm for adding a third mode here, but it might
be worth it.
The intention is definitely not to have 3 modes. If we build this mode the intention is that it is strictly better than exclusive mode, and thus exclusive mode can finally be removed. (Whereas the nonexclusive mode is better in many ways, but not all)
It seems pretty well inevitable to me that people are going to forget
to end them. I am not sure exactly what the consequences of that will
be, but if for example there's a limited number of shared memory slots
to store information about these backups, then if you leak any, you'll
eventually run out of slots and your backups will start failing. I
feel like that's a going to happen to about 75% of the people who try
to use this new backup mode at some point in time, but maybe I'm a
pessimist.[1]
Agreed. yet this is still strictly better or equal than the current exclusive backups, which fail after forgetting that *once*.
If we could jigger things so that you don't need to stop the backup at
all, you only start it, and whether you ever finish copying everything
is something about which the system need not know or care, that would
be a lot nicer. I'm not sure I see how to do that, though.
Yeah, I'm not sure about that one either.
//Magnus
On 7/1/20 3:58 PM, Robert Haas wrote: > On Wed, Jul 1, 2020 at 3:50 PM Magnus Hagander <magnus@hagander.net> wrote: >> As far as I've seen, the one thing that people have problems with in the exclusive mode backups are precisely the factthat they have to keep a persistent conneciton open, and thus it cannot work together with backup software that is limitedto only supporting running a pre- and a post script. >> >> Something like I have suggested here is to solve *that* problem. I don't think anybody actually explicitly wants "exclusivebackups" -- they want a backup solution that plugs into their world of pre/post scripts. And if we can make thatone work in a safer way than the current exclusive backups, ohw is that not an improvement? > > Yeah, I guess that's a pretty fair point. I have to confess to having > somewhat limited enthusiasm for adding a third mode here, but it might > be worth it. > > It seems pretty well inevitable to me that people are going to forget > to end them. I am not sure exactly what the consequences of that will > be, but if for example there's a limited number of shared memory slots > to store information about these backups, then if you leak any, you'll > eventually run out of slots and your backups will start failing. I > feel like that's a going to happen to about 75% of the people who try > to use this new backup mode at some point in time, but maybe I'm a > pessimist.[1] I did consider going down this road and I came to more or less the same conclusion. Backups would accumulate until we just ran out of space and/or started throwing errors at some limit. Hardly very attractive. Also, full pages writes will be left on indefinitely. > If we could jigger things so that you don't need to stop the backup at > all, you only start it, and whether you ever finish copying everything > is something about which the system need not know or care, that would > be a lot nicer. I'm not sure I see how to do that, though. Well, the only thing pg_stop_backup() *really* needs to know is the starting WAL position. pg_start_backup() gets that info so if it passes it back to pg_stop_backup() that could be enough. Or as was proposed above, it just passes the backup_label back to pg_stop_backup() for parsing. To write the .backup file to WAL (which I personally think is pretty useless) you'd need backup_label. The issue I have not been able to work around is that full page writes need to be on for backups and they won't get turned back off if you don't end all the backups that get started. Here's a thought. What if we just stored the oldest starting LSN and a count of how many backups have been requested. When the backup ends it checks that backup count is > 0 and starting LSN is <= its starting LSN. If not, it throws an error. When backups go to 0 FPWs are turned off if they were off before the first backup. That way, we could have a single function to cancel all backups in progress. Even if a new one started, one ending that was started prior to the cancel would know that it could not end successfully and error. For the vast majority of people who have full page writes on, it wouldn't really matter how many backups were running so we could probably just skip all of that. If we wanted to really make it more foolproof we could add a checksum to the backup label to make sure the user doesn't damage it before passing it back. Regards, -- -David david@pgmasters.net
On Wed, Jul 1, 2020 at 10:24:15PM +0200, Magnus Hagander wrote: > On Wed, Jul 1, 2020 at 9:59 PM Robert Haas <robertmhaas@gmail.com> wrote: > Yeah, I guess that's a pretty fair point. I have to confess to having > somewhat limited enthusiasm for adding a third mode here, but it might > be worth it. > The intention is definitely not to have 3 modes. If we build this mode the > intention is that it is strictly better than exclusive mode, and thus exclusive > mode can finally be removed. (Whereas the nonexclusive mode is better in many > ways, but not all) Agreed. I don't think three modes would help anyone. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Wed, Jul 1, 2020 at 04:27:58PM -0400, David Steele wrote: > On 7/1/20 3:58 PM, Robert Haas wrote: > > If we could jigger things so that you don't need to stop the backup at > > all, you only start it, and whether you ever finish copying everything > > is something about which the system need not know or care, that would > > be a lot nicer. I'm not sure I see how to do that, though. > > Well, the only thing pg_stop_backup() *really* needs to know is the starting > WAL position. pg_start_backup() gets that info so if it passes it back to > pg_stop_backup() that could be enough. Or as was proposed above, it just Doesn't pg_start_backup already return this? SELECT pg_start_backup('test'); pg_start_backup ----------------- --> 0/2000028 > Here's a thought. What if we just stored the oldest starting LSN and a count > of how many backups have been requested. When the backup ends it checks that > backup count is > 0 and starting LSN is <= its starting LSN. If not, it > throws an error. When backups go to 0 FPWs are turned off if they were off > before the first backup. Can't we just error out of an exclusive-style backup is requested to start and a previous exclusive-style backup has not been stopped? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Wed, Jul 1, 2020 at 4:27 PM David Steele <david@pgmasters.net> wrote: > Well, the only thing pg_stop_backup() *really* needs to know is the > starting WAL position. pg_start_backup() gets that info so if it passes > it back to pg_stop_backup() that could be enough. Or as was proposed > above, it just passes the backup_label back to pg_stop_backup() for > parsing. To write the .backup file to WAL (which I personally think is > pretty useless) you'd need backup_label. > > The issue I have not been able to work around is that full page writes > need to be on for backups and they won't get turned back off if you > don't end all the backups that get started. > > Here's a thought. What if we just stored the oldest starting LSN and a > count of how many backups have been requested. When the backup ends it > checks that backup count is > 0 and starting LSN is <= its starting LSN. > If not, it throws an error. When backups go to 0 FPWs are turned off if > they were off before the first backup. > > That way, we could have a single function to cancel all backups in > progress. Even if a new one started, one ending that was started prior > to the cancel would know that it could not end successfully and error. > > For the vast majority of people who have full page writes on, it > wouldn't really matter how many backups were running so we could > probably just skip all of that. Yeah, I like this line of thinking. Not sure about the details. > If we wanted to really make it more foolproof we could add a checksum to > the backup label to make sure the user doesn't damage it before passing > it back. I don't see a whole lot of point in that, honestly. It doesn't make it impossible to damage the backup label; it just makes it harder. And that doesn't seem worth much. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7/1/20 4:27 PM, David Steele wrote: > > Here's a thought. What if we just stored the oldest starting LSN and a > count of how many backups have been requested. When the backup ends it > checks that backup count is > 0 and starting LSN is <= its starting LSN. > If not, it throws an error. When backups go to 0 FPWs are turned off if > they were off before the first backup. To be clear, the min starting LSN only needs to be reset when incrementing total running backups from 0 to 1. What we are recording here is the LSN since FPWs were enabled. As long as it is <= the start LSN of the backup when it ends then all is well (I think). Regards, -- -David david@pgmasters.net
On Wed, Jul 1, 2020 at 4:29 PM Bruce Momjian <bruce@momjian.us> wrote: > Agreed. I don't think three modes would help anyone. Well, I think that if and when we remove the existing exclusive mode, we're going to break a bunch of people's backup scripts. I think it's appropriate to do that eventually, but I'm not in a big rush. Long deprecation periods are a feature, not a bug. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jul 1, 2020 at 10:28 PM David Steele <david@pgmasters.net> wrote:
On 7/1/20 3:58 PM, Robert Haas wrote:
> On Wed, Jul 1, 2020 at 3:50 PM Magnus Hagander <magnus@hagander.net> wrote:
>> As far as I've seen, the one thing that people have problems with in the exclusive mode backups are precisely the fact that they have to keep a persistent conneciton open, and thus it cannot work together with backup software that is limited to only supporting running a pre- and a post script.
>>
>> Something like I have suggested here is to solve *that* problem. I don't think anybody actually explicitly wants "exclusive backups" -- they want a backup solution that plugs into their world of pre/post scripts. And if we can make that one work in a safer way than the current exclusive backups, ohw is that not an improvement?
>
> Yeah, I guess that's a pretty fair point. I have to confess to having
> somewhat limited enthusiasm for adding a third mode here, but it might
> be worth it.
>
> It seems pretty well inevitable to me that people are going to forget
> to end them. I am not sure exactly what the consequences of that will
> be, but if for example there's a limited number of shared memory slots
> to store information about these backups, then if you leak any, you'll
> eventually run out of slots and your backups will start failing. I
> feel like that's a going to happen to about 75% of the people who try
> to use this new backup mode at some point in time, but maybe I'm a
> pessimist.[1]
I did consider going down this road and I came to more or less the same
conclusion. Backups would accumulate until we just ran out of space
and/or started throwing errors at some limit. Hardly very attractive.
Also, full pages writes will be left on indefinitely.
> If we could jigger things so that you don't need to stop the backup at
> all, you only start it, and whether you ever finish copying everything
> is something about which the system need not know or care, that would
> be a lot nicer. I'm not sure I see how to do that, though.
Well, the only thing pg_stop_backup() *really* needs to know is the
starting WAL position. pg_start_backup() gets that info so if it passes
it back to pg_stop_backup() that could be enough. Or as was proposed
above, it just passes the backup_label back to pg_stop_backup() for
parsing. To write the .backup file to WAL (which I personally think is
pretty useless) you'd need backup_label.
The issue I have not been able to work around is that full page writes
need to be on for backups and they won't get turned back off if you
don't end all the backups that get started.
Here's a thought. What if we just stored the oldest starting LSN and a
count of how many backups have been requested. When the backup ends it
checks that backup count is > 0 and starting LSN is <= its starting LSN.
If not, it throws an error. When backups go to 0 FPWs are turned off if
they were off before the first backup.
I guess the weak spot of that one is if some script does stop without doing start first, it will break somebody else's backup. (And yes, I've seen scripts make this mistake many times -- it equally breaks the exclusive backups in the current system...)
And don't we need the combination of the start/stop location for the history file?
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, Jul 1, 2020 at 4:29 PM Bruce Momjian <bruce@momjian.us> wrote: > > Agreed. I don't think three modes would help anyone. > > Well, I think that if and when we remove the existing exclusive mode, > we're going to break a bunch of people's backup scripts. I think it's > appropriate to do that eventually, but I'm not in a big rush. Long > deprecation periods are a feature, not a bug. This is pretty tangential to the overall discussion, but I generally disagree that deprecation periods beyond what we already, always, provide thanks to supporting 5 major versions concurrently are actually a feature or are healthy for the project. I continue to find it curious that we stress a great deal over (very likely) poorly written backup scripts that haven't been updated in the 5+? years since exclusive mode was deprecated, but we happily add new keywords in new major versions and remove columns in our catalog tables or rename columns or entirely redo how recovery works (breaking every script out there that performed a restore..) or do any number of other things that potentially break applications that communicate through the PG protocol and other parts of the system that very likely have scripts that were written to work with them. I'm really of the opinion that we need to stop doing that. If someone could explain what is so special about *this* part of the system that we absolutely can't possibly accept any change that might break user's scripts, and why it's worth all of the angst, maintenance, ridiculously difficult documentation to understand and hacks (the interface to pg_start_backup is ridiculously warty because of this), I'd greatly appreciate it. Thanks, Stephen
Attachment
On Wed, Jul 1, 2020 at 04:36:03PM -0400, Robert Haas wrote: > On Wed, Jul 1, 2020 at 4:29 PM Bruce Momjian <bruce@momjian.us> wrote: > > Agreed. I don't think three modes would help anyone. > > Well, I think that if and when we remove the existing exclusive mode, > we're going to break a bunch of people's backup scripts. I think it's > appropriate to do that eventually, but I'm not in a big rush. Long > deprecation periods are a feature, not a bug. True, but we have had this marked as deprecated for a while. Our Postgres people who love backup technology want this improved, and our operational people are asking for it to stay, so keeping the ability but changing the API seems reasonable to me. I tend to think our "love backup technology" folks know more than our operational folks, even though I am in the operational camp. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On 7/1/20 4:39 PM, Magnus Hagander wrote: > On Wed, Jul 1, 2020 at 10:28 PM David Steele <david@pgmasters.net > Here's a thought. What if we just stored the oldest starting LSN and a > count of how many backups have been requested. When the backup ends it > checks that backup count is > 0 and starting LSN is <= its starting > LSN. > If not, it throws an error. When backups go to 0 FPWs are turned off if > they were off before the first backup. > > I guess the weak spot of that one is if some script does stop without > doing start first, it will break somebody else's backup. (And yes, I've > seen scripts make this mistake many times -- it equally breaks the > exclusive backups in the current system...) Well, they'd have to pass in a backup_label with a start LSN >= the min LSN or they would just get an error and not decrement the backup count. The real issue would be if they called pg_stop_backup twice. We might be able to stop that with a rolling max stop lsn to keep anyone from calling pg_stop_backup() twice. But yeah, it would be possible to kill somebody else's session with some finagling. Still, worse case would be an error'd backup rather than a corrupt one. But really, that's only if FPWs are turned off. We can also do some extra validation if the session is left open, which for most software is the norm now. > And don't we need the combination of the start/stop location for the > history file? You mean the .backup file for the WAL? All that needs is the backup_label and the stop LSN that's determined in pg_stop_backup(). Am I missing something? Regards, -- -David david@pgmasters.net
On 7/1/20 4:32 PM, Bruce Momjian wrote: > On Wed, Jul 1, 2020 at 04:27:58PM -0400, David Steele wrote: >> On 7/1/20 3:58 PM, Robert Haas wrote: >>> If we could jigger things so that you don't need to stop the backup at >>> all, you only start it, and whether you ever finish copying everything >>> is something about which the system need not know or care, that would >>> be a lot nicer. I'm not sure I see how to do that, though. >> >> Well, the only thing pg_stop_backup() *really* needs to know is the starting >> WAL position. pg_start_backup() gets that info so if it passes it back to >> pg_stop_backup() that could be enough. Or as was proposed above, it just > > Doesn't pg_start_backup already return this? > > SELECT pg_start_backup('test'); > pg_start_backup > ----------------- > --> 0/2000028 It does, though after looking at the code I think we need the entire backup_label. For any of this to be useful, pg_start_backup() *must* return backup_label anyway so it can be stored somewhere the backup software can find it. >> Here's a thought. What if we just stored the oldest starting LSN and a count >> of how many backups have been requested. When the backup ends it checks that >> backup count is > 0 and starting LSN is <= its starting LSN. If not, it >> throws an error. When backups go to 0 FPWs are turned off if they were off >> before the first backup. > > Can't we just error out of an exclusive-style backup is requested to > start and a previous exclusive-style backup has not been stopped? That means only allowing one exclusive backup but it's not like we don't already have that limitation. But -- I'd rather not even need to specify exclusive/non-exclusive when the backup is started. Basically, all backups would be non-exclusive but we'd allow pg_stop_backup() to be called in a new session (or the same session). Regards, -- -David david@pgmasters.net
On Wed, Jul 1, 2020 at 05:20:16PM -0400, David Steele wrote: > On 7/1/20 4:32 PM, Bruce Momjian wrote: > > On Wed, Jul 1, 2020 at 04:27:58PM -0400, David Steele wrote: > > > On 7/1/20 3:58 PM, Robert Haas wrote: > > > > If we could jigger things so that you don't need to stop the backup at > > > > all, you only start it, and whether you ever finish copying everything > > > > is something about which the system need not know or care, that would > > > > be a lot nicer. I'm not sure I see how to do that, though. > > > > > > Well, the only thing pg_stop_backup() *really* needs to know is the starting > > > WAL position. pg_start_backup() gets that info so if it passes it back to > > > pg_stop_backup() that could be enough. Or as was proposed above, it just > > > > Doesn't pg_start_backup already return this? > > > > SELECT pg_start_backup('test'); > > pg_start_backup > > ----------------- > > --> 0/2000028 > > It does, though after looking at the code I think we need the entire > backup_label. For any of this to be useful, pg_start_backup() *must* return > backup_label anyway so it can be stored somewhere the backup software can > find it. Sure, sounds good to me. > > > Here's a thought. What if we just stored the oldest starting LSN and a count > > > of how many backups have been requested. When the backup ends it checks that > > > backup count is > 0 and starting LSN is <= its starting LSN. If not, it > > > throws an error. When backups go to 0 FPWs are turned off if they were off > > > before the first backup. > > > > Can't we just error out of an exclusive-style backup is requested to > > start and a previous exclusive-style backup has not been stopped? > > That means only allowing one exclusive backup but it's not like we don't > already have that limitation. But -- I'd rather not even need to specify > exclusive/non-exclusive when the backup is started. Basically, all backups > would be non-exclusive but we'd allow pg_stop_backup() to be called in a new > session (or the same session). OK, however you want. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Wed, Jul 1, 2020 at 11:08 PM David Steele <david@pgmasters.net> wrote:
On 7/1/20 4:39 PM, Magnus Hagander wrote:
> On Wed, Jul 1, 2020 at 10:28 PM David Steele <david@pgmasters.net
> Here's a thought. What if we just stored the oldest starting LSN and a
> count of how many backups have been requested. When the backup ends it
> checks that backup count is > 0 and starting LSN is <= its starting
> LSN.
> If not, it throws an error. When backups go to 0 FPWs are turned off if
> they were off before the first backup.
>
> I guess the weak spot of that one is if some script does stop without
> doing start first, it will break somebody else's backup. (And yes, I've
> seen scripts make this mistake many times -- it equally breaks the
> exclusive backups in the current system...)
Well, they'd have to pass in a backup_label with a start LSN >= the min
LSN or they would just get an error and not decrement the backup count.
Oh as long as we're still requiring pg_stop_backup() to pass in something that it received from pg_start_backup() so that we can verify it's the correct one, then that problem wouldn't exist.
The real issue would be if they called pg_stop_backup twice. We might be
able to stop that with a rolling max stop lsn to keep anyone from
calling pg_stop_backup() twice.
But yeah, it would be possible to kill somebody else's session with some
finagling. Still, worse case would be an error'd backup rather than a
corrupt one.
What about the case of:
Session A - start backup
Session B - stop backup (but A is still running of course)
Session C - start backup
Session A - stop backup
At this point, session A can still stop the backup because there is one running -- but there has been time in between the two when no backup was running. That could lead to Session A getting a corrupt backup, I think -- unless we pass some unique identifier back in pg_stop_backup that matches it up. (And if we do pass that up, then session B running pg_stop_backup() would fail, thus leaving the backup started by A still running.
But really, that's only if FPWs are turned off. We can also do some
extra validation if the session is left open, which for most software is
the norm now.
Session left open should really still be the default, as it's the safest one :) And yes, most backup *software* does it. But the entire reason we want another mode from the current non-exclusive backup is people *not* using one of the ready-made backup solutions.
> And don't we need the combination of the start/stop location for the
> history file?
You mean the .backup file for the WAL? All that needs is the
backup_label and the stop LSN that's determined in pg_stop_backup(). Am
I missing something?
I mean the .backup file in the archive, yes. That one contains both the start and the stop location, timeline and time.
On Wed, 1 Jul 2020 15:58:57 -0400 Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jul 1, 2020 at 3:50 PM Magnus Hagander <magnus@hagander.net> wrote: > > As far as I've seen, the one thing that people have problems with in the > > exclusive mode backups are precisely the fact that they have to keep a > > persistent conneciton open, and thus it cannot work together with backup > > software that is limited to only supporting running a pre- and a post > > script. > > > > Something like I have suggested here is to solve *that* problem. I don't > > think anybody actually explicitly wants "exclusive backups" -- they want a > > backup solution that plugs into their world of pre/post scripts. And if we > > can make that one work in a safer way than the current exclusive backups, > > ohw is that not an improvement? > > Yeah, I guess that's a pretty fair point. I have to confess to having > somewhat limited enthusiasm for adding a third mode here, but it might > be worth it. > > It seems pretty well inevitable to me that people are going to forget > to end them. There's so many way an admin could break their backup process and not notice it. Even with current nonexclusive backup, a bad written script might creates bad unrecoverable backups. In regard with your concern, monitoring in progress backups might be *one* answer, from server side. This was "easy" with exclusive backup, we just monitor the backup_label age and warn if it is older than expected [1]. Using non-exclusive backup...this is not that easy anymore. And pg_is_in_backup() is quite misleading if the admin found it without reading doc. Maybe an admin function to list in progress non-exclusive backup and related backend pid might be a good start? With such admin function facilities, client side process could monitor backup and decide to cancel them based on some internal backup policies/process. Tools like pg_basebackup (and probably pgbackrest/barman to some extends) still need the backup to abort on disconnection. Maybe it could flag its session using the replication protocol or call a new admin function or load a hook on session-shutdown to keep previous API behavior? Regards, [1] https://github.com/OPMDG/check_pgactivity/#user-content-backup_label_age-8.1
On Wed, 2020-07-01 at 16:46 -0400, Stephen Frost wrote: > I continue to find it curious that we stress a great deal over (very > likely) poorly written backup scripts that haven't been updated in the > 5+? years since exclusive mode was deprecated, but we happily add new > keywords in new major versions and remove columns in our catalog tables > or rename columns or entirely redo how recovery works (breaking every > script out there that performed a restore..) or do any number of other > things that potentially break applications that communicate through the > PG protocol and other parts of the system that very likely have scripts > that were written to work with them. > > I'm really of the opinion that we need to stop doing that. > > If someone could explain what is so special about *this* part of the > system that we absolutely can't possibly accept any change that might > break user's scripts, and why it's worth all of the angst, maintenance, > ridiculously difficult documentation to understand and hacks (the > interface to pg_start_backup is ridiculously warty because of this), I'd > greatly appreciate it. Easy. Lots of people write backup scripts, and fewer people write complicated pg_catalog queries. We would make way more users unhappy. Besides, there is nothing "poorly written" about a backup script using exclusive backup as long as the user is aware that there could be some small degree of trouble in case of a crash. You have a point about reworking the way recovery works: after teaching several classes and running into oddities with recovery parameters in postgresql.conf, I am not sure if that was a good move. Particularly the necessity for "recovery.signal" makes the New Way no easier than the Old Way - perhaps something like "pg_ctl start_in_recovery" would have been smarter. But one instance where we made users unhappy is not justification for making them unhappy again. Yours, Laurenz Albe
On Thu, Jul 2, 2020 at 1:06 AM Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
On Wed, 1 Jul 2020 15:58:57 -0400
Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jul 1, 2020 at 3:50 PM Magnus Hagander <magnus@hagander.net> wrote:
> > As far as I've seen, the one thing that people have problems with in the
> > exclusive mode backups are precisely the fact that they have to keep a
> > persistent conneciton open, and thus it cannot work together with backup
> > software that is limited to only supporting running a pre- and a post
> > script.
> >
> > Something like I have suggested here is to solve *that* problem. I don't
> > think anybody actually explicitly wants "exclusive backups" -- they want a
> > backup solution that plugs into their world of pre/post scripts. And if we
> > can make that one work in a safer way than the current exclusive backups,
> > ohw is that not an improvement?
>
> Yeah, I guess that's a pretty fair point. I have to confess to having
> somewhat limited enthusiasm for adding a third mode here, but it might
> be worth it.
>
> It seems pretty well inevitable to me that people are going to forget
> to end them.
There's so many way an admin could break their backup process and not notice
it. Even with current nonexclusive backup, a bad written script might creates
bad unrecoverable backups.
Definitely. And the scariest part of them is that they all work fine in testing...
In regard with your concern, monitoring in progress backups might be *one*
answer, from server side. This was "easy" with exclusive backup, we just
monitor the backup_label age and warn if it is older than expected [1]. Using
Yeah, but having a monitoring point that will crash your database on restart isn't very safe :)
non-exclusive backup...this is not that easy anymore. And pg_is_in_backup() is
quite misleading if the admin found it without reading doc. Maybe an admin
Yeah, as it is now it should really be called pg_is_in_exclusive_backup().
function to list in progress non-exclusive backup and related backend pid might
be a good start?
I think it would make perfect sense to show manual (exclusive or non-exclusive) base backups in pg_stat_progress_basebackup. There's no fundamental need that one should only be for base backups taken with pg_basebackup. In fact, I'd argue that's a mistake in the view in v13 that it does not include this information.
It could have "phase" set to "manual non-exclusive" for example, and leave the other fields at NULL.
I guess in theory even for exclusive ones, with the pid column set to NULL. (As Stephen mentioned at some point in the future we might also want to extend it to allow these tools to report their progress as well into it, probably by just calling an admin function on the connection that they already have).
Tools like pg_basebackup (and probably pgbackrest/barman to some extends) still
need the backup to abort on disconnection. Maybe it could flag its session
using the replication protocol or call a new admin function or load a hook on
session-shutdown to keep previous API behavior?
Suddenly requiring a replication protocol connection for one thing, when all their other things are done over a normal one, seems really terrible. But having an admin function would work.
But anything requiring loading of a hook is going to raise the bar massively as suddenly somebody needs to install a compiled binary on the server in order to use it. But calling a separate function is pretty much what I suggested upthread (except I suggested a new version of pg_stop_backup, but implementation wise)
But I'm not sure what previous API behavior you're looking to preserve here?
On Thu, Jul 2, 2020 at 10:35 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Wed, 2020-07-01 at 16:46 -0400, Stephen Frost wrote:
> I continue to find it curious that we stress a great deal over (very
> likely) poorly written backup scripts that haven't been updated in the
> 5+? years since exclusive mode was deprecated, but we happily add new
> keywords in new major versions and remove columns in our catalog tables
> or rename columns or entirely redo how recovery works (breaking every
> script out there that performed a restore..) or do any number of other
> things that potentially break applications that communicate through the
> PG protocol and other parts of the system that very likely have scripts
> that were written to work with them.
>
> I'm really of the opinion that we need to stop doing that.
>
> If someone could explain what is so special about *this* part of the
> system that we absolutely can't possibly accept any change that might
> break user's scripts, and why it's worth all of the angst, maintenance,
> ridiculously difficult documentation to understand and hacks (the
> interface to pg_start_backup is ridiculously warty because of this), I'd
> greatly appreciate it.
Easy. Lots of people write backup scripts, and fewer people write
complicated pg_catalog queries. We would make way more users unhappy.
It's not about complicated catalog queries.
For example, in PostgreSQL 13 we break what I would say is the vast majority of every query against pg_stat_statements, by renaming those fields. Yet the old ones were never deprecated, and certainly not for 5+ years. And I certainly *hope* we have more users who query pg_stat_statements than write their own backup solutions.
In PostgreSQL 12, we broke every single restore script by moving backup configuration into recovery.conf. Yet the old ones were never deprecated, and certainly not for 5+ years.
Or for PostgreSQL 10, we renamed pg_xlog to pg_wal, and pg_log to log, both which would break the vast majority of existing backup scripts. Yet the old ones were never deprecated, and certainly not for 5+ years.
In general, we have made backwards incompatible changes in almost every release of PostgreSQL. And the vast majority of them have *not* been deprecated for years ahead of making the change -- we just killed them.
So I do agree with Stephen on this -- it seems we are holding this old and if not broken then at least very *fragile* API to a *very* different set of standards than our other functionality.
All backup scripts had to be rewritten in 10. All restore scripts had to be rewritten in 12. What's so special about requiring backup scripts to be rewritten again in 14?
Now, I understand it if we were removing functionality that you could not rewrite to make things work. Heck, I can even at least partially understand that non-exclusive backup mode is too *difficult* for some people to use. But the currently suggested changes are there specifically to make it as easy to use as the deprecated one, yet a lot safer.
Besides, there is nothing "poorly written" about a backup script using
exclusive backup as long as the user is aware that there could be some
small degree of trouble in case of a crash.
It's not a poorly written script, but it *is* a poorly designed API. But if we decide we can never fix things that we have designed poorly, then what other parts of our development should we stop?
(And no, I don't agree it's a "small degree of trouble". At all. Because usually you end up with a total system outage sometime in the middle of the night (when most people run their backups), blocking end user access, waking up on-call personnel, and dealing with a pretty damn major incident)
But one instance where we made users unhappy is not justification
for making them unhappy again.
This is true.
So which users will actually be unhappy about this?
Yes, they have to make small updates to their scripts. Again, if they don't want that, they can *never* upgrade PostgreSQL, because we make such changes in *every* major release. Those are small, and those are trivial (with the suggested changes, not today). In return, they get a database that won't crash on them if something happens during backup.
On 7/1/20 5:44 PM, Magnus Hagander wrote: > On Wed, Jul 1, 2020 at 11:08 PM David Steele <david@pgmasters.net > <mailto:david@pgmasters.net>> wrote: > > But yeah, it would be possible to kill somebody else's session with > some > finagling. Still, worse case would be an error'd backup rather than a > corrupt one. > > What about the case of: > Session A - start backup > Session B - stop backup (but A is still running of course) > Session C - start backup > Session A - stop backup > > At this point, session A can still stop the backup because there is one > running -- but there has been time in between the two when no backup was > running. That could lead to Session A getting a corrupt backup, I think > -- unless we pass some unique identifier back in pg_stop_backup that > matches it up. (And if we do pass that up, then session B running > pg_stop_backup() would fail, thus leaving the backup started by A still > running. This is fine because the min start LSN would have been advanced after B stopped. When A tries to stop the min start LSN will be later than its start LSN so it will error. It might be easier/better to just keep the one exclusive slot in shared memory and store the backup label in it. We only allow one exclusive backup now so it wouldn't be a loss in functionality. None of this really solves the problem of what happens when the user dumps the backup_label into the data directory. With traditional backup software that's pretty much going to be the only choice. Is telling them not to do it and washing our hands of it really enough? In particular, I'm worried about the logic in postmaster.c that would be removed if we no longer save the backup_label explicitly during an exclusive backup. If backup_label is no longer removed on a clean shutdown it seems we'll just make the situation worse. Regards, -- -David david@pgmasters.net
On Thu, 2 Jul 2020 12:32:14 +0200 Magnus Hagander <magnus@hagander.net> wrote: [...] > > non-exclusive backup...this is not that easy anymore. And > > pg_is_in_backup() is quite misleading if the admin found it without reading > > doc. Maybe an admin > > Yeah, as it is now it should really be called pg_is_in_exclusive_backup(). +1 We end up to the same conclusion while discussing with Gilles Darold this morning. But considering the discussion bellow, this might not be needed anyway. > > function to list in progress non-exclusive backup and related backend pid > > might > > be a good start? > > > > I think it would make perfect sense to show manual (exclusive or > non-exclusive) base backups in pg_stat_progress_basebackup. There's no > fundamental need that one should only be for base backups taken with > pg_basebackup. In fact, I'd argue that's a mistake in the view in v13 that > it does not include this information. > > It could have "phase" set to "manual non-exclusive" for example, and leave > the other fields at NULL. Even in manual non-exclusive we could define some different phase: * waiting for checkpoint to finish * external backup * waiting for wal archiving to finish So maybe exposing the backup label and/or the method might be useful as well? > I guess in theory even for exclusive ones, with the pid column set to NULL. We are discussing removing the mandatory connection during the non-exclusive backup. If it become optional at some point, this PID might be NULL as well... > (As Stephen mentioned at some point in the future we might also want to > extend it to allow these tools to report their progress as well into it, > probably by just calling an admin function on the connection that they > already have). > > > > Tools like pg_basebackup (and probably pgbackrest/barman to some extends) > > still need the backup to abort on disconnection. Maybe it could flag its > > session using the replication protocol or call a new admin function or load > > a hook on session-shutdown to keep previous API behavior? > > Suddenly requiring a replication protocol connection for one thing, when > all their other things are done over a normal one, seems really terrible. Sorry, I misexplained. Adding such flag to the replication protocol was only for pg_basebackup. Other tools would use an admin function to achieve the same goal. > But having an admin function would work. > > But anything requiring loading of a hook is going to raise the bar > massively as suddenly somebody needs to install a compiled binary on the > server in order to use it. Contrib already provide many useful tools. How would it be different from eg. pg_archivecleanup? IIRC, end of session hook was discussed in the past, but I did not follow the thread by the time. But anyway, an admin function would probably make the job as well anyway. > But calling a separate function is pretty much > what I suggested upthread > (except I suggested a new version of pg_stop_backup, but implementation wise) > > But I'm not sure what previous API behavior you're looking to preserve here? Current non-exclusive backup are automatically aborted when the pg_start_backup() session disappear without pg_stop_backup(). I suppose this was the API behavior that Robert is concerned to loose when he is writing about people forgetting to end backups. And now you are talking about Stephen's idea to report the progress of a backup, this kind of API could be used as watchdog as well: if the frontend did not report before some optional kind of (user defined) timeout, abort the backup. We would have a "similar" behavior than current one. To be clear, I suppose these admin functions could be called from any backend session, taking some kind of backup id (eg. the token you were talking about upthread). Regards,
Greetings, * Laurenz Albe (laurenz.albe@cybertec.at) wrote: > On Wed, 2020-07-01 at 16:46 -0400, Stephen Frost wrote: > > If someone could explain what is so special about *this* part of the > > system that we absolutely can't possibly accept any change that might > > break user's scripts, and why it's worth all of the angst, maintenance, > > ridiculously difficult documentation to understand and hacks (the > > interface to pg_start_backup is ridiculously warty because of this), I'd > > greatly appreciate it. > > Easy. Lots of people write backup scripts, and fewer people write > complicated pg_catalog queries. We would make way more users unhappy. And no one writes scripts to do restores? No, I don't believe that (also don't believe that no one writes scripts or queries that use various catalog tables). Also, we certainly do get complaints from time to time from people who got hit by a catalog change, but it's usually "this stopped working" and you tell them "do this instead" and they're happy to update their code. That's why we have major versions, and why we have back-branches and why we support a given major version for 5 years. > You have a point about reworking the way recovery works: after teaching > several classes and running into oddities with recovery parameters in > postgresql.conf, I am not sure if that was a good move. If it was a good idea or bad idea is really pretty independent of the point I was trying to make: we *did* completely change the restore API without any deprecation period for the old one or period of time where we supported both or any of the foolishness that we've been doing for the backup API for the past however many years. Perhaps the mob of users who are going to complain about that change hasn't decided to upgrade to v12 quite yet, but I doubt it. I'll say it again- I really think we need to stop this special case where we've decided that we can't possibly make a breaking change to the backup API even across major versions. We don't do that, evidently, in hardly any other cases and it's an altogether good thing- we don't have ridiculously complicated restore documentation that has to explain two completely different ways of doing a restore, nor is the code complicated by having different ways, etc.. Thanks, Stephen
Attachment
Greetings, * Magnus Hagander (magnus@hagander.net) wrote: > On Thu, Jul 2, 2020 at 1:06 AM Jehan-Guillaume de Rorthais <jgdr@dalibo.com> > wrote: > > function to list in progress non-exclusive backup and related backend pid > > might > > be a good start? > > I think it would make perfect sense to show manual (exclusive or > non-exclusive) base backups in pg_stat_progress_basebackup. There's no > fundamental need that one should only be for base backups taken with > pg_basebackup. In fact, I'd argue that's a mistake in the view in v13 that > it does not include this information. I agree entirely that it was a mistake in v13 to not include this- and to not include a way for other backup tools to report their progress. This is a good example of why we really need an in-core non-streamed backup capability, imv, because if we don't we end up with things like this that are just thinking about streaming basebackups. We also have no in-core code that is user-facing that exercises the low-level backup API. > It could have "phase" set to "manual non-exclusive" for example, and leave > the other fields at NULL. Yeah. > I guess in theory even for exclusive ones, with the pid column set to NULL. > (As Stephen mentioned at some point in the future we might also want to > extend it to allow these tools to report their progress as well into it, > probably by just calling an admin function on the connection that they > already have). Right, that wouldn't have been hard to include and would have been quite nice. Thanks, Stephen
Attachment
On 7/2/20 7:12 AM, David Steele wrote: > > None of this really solves the problem of what happens when the user > dumps the backup_label into the data directory. With traditional backup > software that's pretty much going to be the only choice. Is telling them > not to do it and washing our hands of it really enough? After some thought I believe there are several recommendations we could give in the docs about what to do with the backup_label file. This patch, though, looks to be a dead end. I think the best thing to do is withdraw the patch and end the thread. I'll start a new thread when I have a patch/design that implements the new ideas we have discussed. Regards, -- -David david@pgmasters.net