Thread: Add recovery to pg_control and remove backup_label
Hackers, This was originally proposed in [1] but that thread went through a number of different proposals so it seems better to start anew. The basic idea here is to simplify and harden recovery by getting rid of backup_label and storing recovery information directly in pg_control. Instead of backup software copying pg_control from PGDATA, it stores an updated version that is returned from pg_backup_stop(). I believe this is better for the following reasons: * The user can no longer remove backup_label and get what looks like a successful restore (while almost certainly causing corruption). If pg_control is removed the cluster will not start. The user may try pg_resetwal, but I think that tool makes it pretty clear that corruption will result from its use. We could also modify pg_resetwal to complain if recovery info is present in pg_control. * We don't need to worry about backup software seeing a torn copy of pg_control, since Postgres can safely read it out of memory and provide a valid copy via pg_backup_stop(). This solves [2] without needing to write pg_control via a temp file, which may affect performance on a standby. Unfortunately, this solution cannot be back patched. * For backup from standby, we no longer need to instruct the backup software to copy pg_control last. In fact the backup software should not copy pg_control from PGDATA at all. Since backup_label is now gone, the fields that used to be in backup_label are now provided as columns returned from pg_backup_start() and pg_backup_stop() and the backup history file is still written to the archive. For pg_basebackup we would have the option of writing the fields into the JSON manifest, storing them to a file (e.g. backup.info), or just ignoring them. None of the fields are required for recovery but backup software may be very interested in them. I updated pg_rewind but I'm not very confident in the tests. When I removed backup_label processing, but before I updated pg_rewind to write recovery info into pg_control, all the rewind tests passed. This patch highlights the fact that we still have no tests for the low-level backup method. I modified pgBackRest to work with this patch and the entire test suite ran without any issues, but in-core tests would be good to have. I'm planning to work on those myself as a separate patch. This patch would also make the proposal in [3] obsolete since there is no need to rename backup_label if it is gone. I know that outputting pg_control as bytea is going to be a bit controversial. Software that is using psql get run pg_backup_stop() could use encode() to get pg_control as text and then decode it later. Alternately, we could update ReadControlFile() to recognize a base64-encoded pg_control file. I'm not sure dealing with binary data is that much of a problem, though, and if the backup software gets it wrong then recovery with fail on an invalid pg_control file. Lastly, I think there are improvements to be made in recovery that go beyond this patch. I originally set out to load the recovery info into *just* the existing fields in pg_control but it required so many changes to recovery that I decided it was too dangerous to do all in one patch. This patch very much takes the "backup_label in pg_control" approach, though I reused fields where possible. The added fields, e.g. backupRecoveryRequested, also allow us to keep the user experience pretty much the same in terms of messages and errors. Thoughts? Regards, -David [1] https://postgresql.org/message-id/1330cb48-4e47-03ca-f2fb-b144b49514d8%40pgmasters.net [2] https://postgresql.org/message-id/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de [3] https://postgresql.org/message-id/eb3d1aae-1a75-bcd3-692a-38729423168f%40pgmasters.net
Attachment
On Thu, Oct 26, 2023 at 2:02 PM David Steele <david@pgmasters.net> wrote:
Hackers,
This was originally proposed in [1] but that thread went through a
number of different proposals so it seems better to start anew.
The basic idea here is to simplify and harden recovery by getting rid of
backup_label and storing recovery information directly in pg_control.
Instead of backup software copying pg_control from PGDATA, it stores an
updated version that is returned from pg_backup_stop(). I believe this
is better for the following reasons:
* The user can no longer remove backup_label and get what looks like a
successful restore (while almost certainly causing corruption). If
pg_control is removed the cluster will not start. The user may try
pg_resetwal, but I think that tool makes it pretty clear that corruption
will result from its use. We could also modify pg_resetwal to complain
if recovery info is present in pg_control.
* We don't need to worry about backup software seeing a torn copy of
pg_control, since Postgres can safely read it out of memory and provide
a valid copy via pg_backup_stop(). This solves [2] without needing to
write pg_control via a temp file, which may affect performance on a
standby. Unfortunately, this solution cannot be back patched.
Are we planning on dealing with torn writes in the back branches in some way or are we just throwing in the towel and saying the old method is too error-prone to exist/retain and therefore the goal of the v17 changes is to not only provide a better way but also to ensure the old way no longer works? It seems sufficient to change the output signature of pg_backup_stop to accomplish that goal though I am pondering whether an explicit check and error for seeing the backup_label file would be warranted.
If we are going to solve the torn writes problem completely then while I agree the new way is superior, implementing it doesn't have to mean existing tools built to produce backup_label and rely upon the pg_control in the data directory need to be forcibly broken.
I know that outputting pg_control as bytea is going to be a bit
controversial. Software that is using psql get run pg_backup_stop()
could use encode() to get pg_control as text and then decode it later.
Alternately, we could update ReadControlFile() to recognize a
base64-encoded pg_control file. I'm not sure dealing with binary data is
that much of a problem, though, and if the backup software gets it wrong
then recovery with fail on an invalid pg_control file.
Can we not figure out some way to place the relevant files onto the server somewhere so that a simple "cp" command would work? Have pg_backup_stop return paths instead of contents, those paths being "$TEMP_DIR"/<random unique new directory>/pg_control.conf (and tablespace_map)
David J.
On 10/26/23 17:27, David G. Johnston wrote: > On Thu, Oct 26, 2023 at 2:02 PM David Steele <david@pgmasters.net > <mailto:david@pgmasters.net>> wrote: > > Are we planning on dealing with torn writes in the back branches in some > way or are we just throwing in the towel and saying the old method is > too error-prone to exist/retain We are still planning to address this issue in the back branches. > and therefore the goal of the v17 > changes is to not only provide a better way but also to ensure the old > way no longer works? It seems sufficient to change the output signature > of pg_backup_stop to accomplish that goal though I am pondering whether > an explicit check and error for seeing the backup_label file would be > warranted. Well, if the backup tool is just copying the second column of output to the backup_label, then it won't break. Of course in that case, restores won't work correctly but you would not get an error. Testing would show that it is not working properly and backup tools should certainly be tested. Even so, I'm OK with an explicit check for backup_label. Let's see what others think. > If we are going to solve the torn writes problem completely then while I > agree the new way is superior, implementing it doesn't have to mean > existing tools built to produce backup_label and rely upon the > pg_control in the data directory need to be forcibly broken. It is a pretty easy update to any backup software that supports non-exclusive backup. I was able to make the changes to pgBackRest in less than an hour. We've made major changes to backup and restore in almost every major version of PostgreSQL for a while: non-exlusive backup in 9.6, dir renames in 10, variable WAL size in 11, new recovery location in 12, hard recovery target errors in 13, and changes to non-exclusive backup and removal of exclusive backup in 15. In 17 we are already looking at new page and segment sizes. > I know that outputting pg_control as bytea is going to be a bit > controversial. Software that is using psql get run pg_backup_stop() > could use encode() to get pg_control as text and then decode it later. > Alternately, we could update ReadControlFile() to recognize a > base64-encoded pg_control file. I'm not sure dealing with binary > data is > that much of a problem, though, and if the backup software gets it > wrong > then recovery with fail on an invalid pg_control file. > > Can we not figure out some way to place the relevant files onto the > server somewhere so that a simple "cp" command would work? Have > pg_backup_stop return paths instead of contents, those paths being > "$TEMP_DIR"/<random unique new directory>/pg_control.conf (and > tablespace_map) Nobody has been able to figure this out, and some of us have been thinking about it for years. It just doesn't seem possible to reliably tell the difference between a cluster that was copied and one that simply crashed. If cp is really the backup tool being employed, I would recommend using pg_basebackup. cp has flaws that could lead to corruption, and of course does not at all take into account the archive required to make a backup consistent, directories to be excluded, the order of copying pg_control on backup from standy, etc., etc. Backup/restore is not a simple endeavor and we don't do anyone favors pretending that it is. Regards, -David
On Fri, Oct 27, 2023 at 7:10 AM David Steele <david@pgmasters.net> wrote:
On 10/26/23 17:27, David G. Johnston wrote:
> Can we not figure out some way to place the relevant files onto the
> server somewhere so that a simple "cp" command would work? Have
> pg_backup_stop return paths instead of contents, those paths being
> "$TEMP_DIR"/<random unique new directory>/pg_control.conf (and
> tablespace_map)
Nobody has been able to figure this out, and some of us have been
thinking about it for years. It just doesn't seem possible to reliably
tell the difference between a cluster that was copied and one that
simply crashed.
If cp is really the backup tool being employed, I would recommend using
pg_basebackup. cp has flaws that could lead to corruption, and of course
does not at all take into account the archive required to make a backup
consistent, directories to be excluded, the order of copying pg_control
on backup from standy, etc., etc.
Let me modify that to make it a bit more clear, I actually wouldn't care if pg_backup_end outputs an entire binary pg_control file as part of the SQL resultset.
My proposal would be to, in addition, place in the temporary directory on the server, Postgres-written versions of pg_control and tablespace_map as part of the pg_backup_end processing. The client software would then have a choice. Write the contents of the SQL resultset to newly created binary mode files in the destination, or, copy the server-written files from the temporary directory to the destination.
That said, I'm starting to dislike that idea myself. It only really makes sense if the files could be placed in the data directory but that isn't doable given concurrent backups and not wanting to place the source server into an inconsistent state.
David J.
On 10/27/23 13:45, David G. Johnston wrote: > > Let me modify that to make it a bit more clear, I actually wouldn't care > if pg_backup_end outputs an entire binary pg_control file as part of the > SQL resultset. > > My proposal would be to, in addition, place in the temporary directory > on the server, Postgres-written versions of pg_control and > tablespace_map as part of the pg_backup_end processing. The client > software would then have a choice. Write the contents of the SQL > resultset to newly created binary mode files in the destination, or, > copy the server-written files from the temporary directory to the > destination. > > That said, I'm starting to dislike that idea myself. It only really > makes sense if the files could be placed in the data directory but that > isn't doable given concurrent backups and not wanting to place the > source server into an inconsistent state. Pretty much the conclusion I have come to myself over the years. Regards, -David
Rebased on 151ffcf6.
Attachment
On Fri, Oct 27, 2023 at 10:10:42AM -0400, David Steele wrote: > We are still planning to address this issue in the back branches. FWIW, redesigning the backend code in charge of doing base backups in the back branches is out of scope. Based on a read of the proposed patch, it includes catalog changes which would require a catversion bump, so that's not going to work anyway. -- Michael
Attachment
On Sun, Nov 05, 2023 at 01:45:39PM -0400, David Steele wrote: > Rebased on 151ffcf6. I like this patch a lot. Even if the backup_label file is removed, we still have all the debug information from the backup history file, thanks to its LABEL, BACKUP METHOD and BACKUP FROM, so no information is lost. It does a 1:1 replacement of the contents parsed from the backup_label needed by recovery by fetching them from the control file. Sounds like a straight-forward change to me. The patch is failing the recovery test 039_end_of_wal.pl. Could you look at the failure? /* Build and save the contents of the backup history file */ - history_file = build_backup_content(state, true); + history_file = build_backup_content(state); build_backup_content() sounds like an incorrect name if it is a routine onlyused to build the contents of backup history files. Why is there nothing updated in src/bin/pg_controldata/? + /* Clear fields used to initialize recovery */ + ControlFile->backupCheckPoint = InvalidXLogRecPtr; + ControlFile->backupStartPointTLI = 0; + ControlFile->backupRecoveryRequired = false; + ControlFile->backupFromStandby = false; These variables in the control file are cleaned up when the backup_label file was read previously, but backup_label is renamed to backup_label.old a bit later than that. Your logic looks correct seen from here, but shouldn't these variables be set much later, aka just *after* UpdateControlFile(). This gap between the initialization of the control file and the in-memory reset makes the code quite brittle, IMO. - basebackup_progress_wait_wal_archive(&state); - do_pg_backup_stop(backup_state, !opt->nowait); Why is that moved? - The backup label - file includes the label string you gave to <function>pg_backup_start</function>, - as well as the time at which <function>pg_backup_start</function> was run, and - the name of the starting WAL file. In case of confusion it is therefore - possible to look inside a backup file and determine exactly which - backup session the dump file came from. The tablespace map file includes + The tablespace map file includes It may be worth mentioning that the backup history file holds this information on the primary's pg_wal, as well. The changes in sendFileWithContent() may be worth a patch of its own. --- a/src/include/catalog/pg_control.h +++ b/src/include/catalog/pg_control.h @@ -146,6 +146,9 @@ typedef struct ControlFileData @@ -160,14 +163,25 @@ typedef struct ControlFileData XLogRecPtr minRecoveryPoint; TimeLineID minRecoveryPointTLI; + XLogRecPtr backupCheckPoint; XLogRecPtr backupStartPoint; + TimeLineID backupStartPointTLI; XLogRecPtr backupEndPoint; + bool backupRecoveryRequired; + bool backupFromStandby; This increases the size of the control file from 296B to 312B with an 8-byte alignment, as far as I can see. The size of the control file has been always a sensitive subject especially with the hard limit of PG_CONTROL_MAX_SAFE_SIZE. Well, the point of this patch is that this is the price to pay to prevent users from doing something stupid with a removal of a backup_label when they should not. Do others have an opinion about this increase in size? Actually, grouping backupStartPointTLI and minRecoveryPointTLI should reduce more the size with some alignment magic, no? - /* - * BACKUP METHOD lets us know if this was a typical backup ("streamed", - * which could mean either pg_basebackup or the pg_backup_start/stop - * method was used) or if this label came from somewhere else (the only - * other option today being from pg_rewind). If this was a streamed - * backup then we know that we need to play through until we get to the - * end of the WAL which was generated during the backup (at which point we - * will have reached consistency and backupEndRequired will be reset to be - * false). - */ - if (fscanf(lfp, "BACKUP METHOD: %19s\n", backuptype) == 1) - { - if (strcmp(backuptype, "streamed") == 0) - *backupEndRequired = true; - } backupRecoveryRequired in the control file is switched to false for pg_rewind and true for streamed backups. My gut feeling is telling me that this should be OK, as out-of-core tools would need an upgrade if they relied on the backend_label file anyway. I can see that this change makes use lose some documentation, unfortunately. Shouldn't these removed lines be moved to pg_control.h instead for the description of backupEndRequired? doc/src/sgml/ref/pg_rewind.sgml and src/backend/access/transam/xlogrecovery.c still include references to the backup_label file. -- Michael
Attachment
On 11/6/23 01:05, Michael Paquier wrote: > On Fri, Oct 27, 2023 at 10:10:42AM -0400, David Steele wrote: >> We are still planning to address this issue in the back branches. > > FWIW, redesigning the backend code in charge of doing base backups in > the back branches is out of scope. Based on a read of the proposed > patch, it includes catalog changes which would require a catversion > bump, so that's not going to work anyway. I did not mean this patch -- rather some variation of what Thomas has been working on, more than likely. Regards, -David
On 11/6/23 02:35, Michael Paquier wrote: > On Sun, Nov 05, 2023 at 01:45:39PM -0400, David Steele wrote: >> Rebased on 151ffcf6. > > I like this patch a lot. Even if the backup_label file is removed, we > still have all the debug information from the backup history file, > thanks to its LABEL, BACKUP METHOD and BACKUP FROM, so no information > is lost. It does a 1:1 replacement of the contents parsed from the > backup_label needed by recovery by fetching them from the control > file. Sounds like a straight-forward change to me. That's the plan, at least! > The patch is failing the recovery test 039_end_of_wal.pl. Could you > look at the failure? I'm not seeing this failure, and CI seems happy [1]. Can you give details of the error message? > /* Build and save the contents of the backup history file */ > - history_file = build_backup_content(state, true); > + history_file = build_backup_content(state); > > build_backup_content() sounds like an incorrect name if it is a > routine onlyused to build the contents of backup history files. Good point, I have renamed this to build_backup_history_content(). > Why is there nothing updated in src/bin/pg_controldata/? Oops, added. > + /* Clear fields used to initialize recovery */ > + ControlFile->backupCheckPoint = InvalidXLogRecPtr; > + ControlFile->backupStartPointTLI = 0; > + ControlFile->backupRecoveryRequired = false; > + ControlFile->backupFromStandby = false; > > These variables in the control file are cleaned up when the > backup_label file was read previously, but backup_label is renamed to > backup_label.old a bit later than that. Your logic looks correct seen > from here, but shouldn't these variables be set much later, aka just > *after* UpdateControlFile(). This gap between the initialization of > the control file and the in-memory reset makes the code quite brittle, > IMO. If we set these fields where backup_label was renamed, the logic would not be exactly the same since pg_control won't be updated until the next time through the loop. Since the fields should be updated before UpdateControlFile() I thought it made sense to keep all the updates together. Overall I think it is simpler, and we don't need to acquire a lock on ControlFile. > - basebackup_progress_wait_wal_archive(&state); > - do_pg_backup_stop(backup_state, !opt->nowait); > > Why is that moved? do_pg_backup_stop() generates the updated pg_control so it needs to run before we transmit pg_control. > - The backup label > - file includes the label string you gave to <function>pg_backup_start</function>, > - as well as the time at which <function>pg_backup_start</function> was run, and > - the name of the starting WAL file. In case of confusion it is therefore > - possible to look inside a backup file and determine exactly which > - backup session the dump file came from. The tablespace map file includes > + The tablespace map file includes > > It may be worth mentioning that the backup history file holds this > information on the primary's pg_wal, as well. OK, reworded. > The changes in sendFileWithContent() may be worth a patch of its own. Thomas included this change in his pg_basebackup changes so I did the same. Maybe wait a bit before we split this out? Seems like a pretty small change... > --- a/src/include/catalog/pg_control.h > +++ b/src/include/catalog/pg_control.h > @@ -146,6 +146,9 @@ typedef struct ControlFileData > @@ -160,14 +163,25 @@ typedef struct ControlFileData > XLogRecPtr minRecoveryPoint; > TimeLineID minRecoveryPointTLI; > + XLogRecPtr backupCheckPoint; > XLogRecPtr backupStartPoint; > + TimeLineID backupStartPointTLI; > XLogRecPtr backupEndPoint; > + bool backupRecoveryRequired; > + bool backupFromStandby; > > This increases the size of the control file from 296B to 312B with an > 8-byte alignment, as far as I can see. The size of the control file > has been always a sensitive subject especially with the hard limit of > PG_CONTROL_MAX_SAFE_SIZE. Well, the point of this patch is that this > is the price to pay to prevent users from doing something stupid with > a removal of a backup_label when they should not. Do others have an > opinion about this increase in size? > > Actually, grouping backupStartPointTLI and minRecoveryPointTLI should > reduce more the size with some alignment magic, no? I thought about this, but it seemed to me that existing fields had been positioned to make the grouping logical rather than to optimize alignment, e.g. minRecoveryPointTLI. Ideally that would have been placed near backupEndRequired (or vice versa). But if the general opinion is to rearrange for alignment, I'm OK with that. > backupRecoveryRequired in the control file is switched to false for > pg_rewind and true for streamed backups. My gut feeling is telling me > that this should be OK, as out-of-core tools would need an upgrade if > they relied on the backend_label file anyway. I can see that this > change makes use lose some documentation, unfortunately. Shouldn't > these removed lines be moved to pg_control.h instead for the > description of backupEndRequired? Updated description in pg_control.h -- it's a bit vague but not sure it is a good idea to get into the inner workings of pg_rewind here? > doc/src/sgml/ref/pg_rewind.sgml and > src/backend/access/transam/xlogrecovery.c still include references to > the backup_label file. Fixed. Attached is a new patch based on 18b585155. Regards, -David [1] https://cirrus-ci.com/build/4939808120766464
Attachment
On Mon, Nov 06, 2023 at 05:39:02PM -0400, David Steele wrote: > On 11/6/23 02:35, Michael Paquier wrote: >> The patch is failing the recovery test 039_end_of_wal.pl. Could you >> look at the failure? > > I'm not seeing this failure, and CI seems happy [1]. Can you give details of > the error message? I've retested today, and miss the failure. I'll let you know if I see this again. >> + /* Clear fields used to initialize recovery */ >> + ControlFile->backupCheckPoint = InvalidXLogRecPtr; >> + ControlFile->backupStartPointTLI = 0; >> + ControlFile->backupRecoveryRequired = false; >> + ControlFile->backupFromStandby = false; >> >> These variables in the control file are cleaned up when the >> backup_label file was read previously, but backup_label is renamed to >> backup_label.old a bit later than that. Your logic looks correct seen >> from here, but shouldn't these variables be set much later, aka just >> *after* UpdateControlFile(). This gap between the initialization of >> the control file and the in-memory reset makes the code quite brittle, >> IMO. Yeah, sorry, there's a think from me here. I meant to reset these variables just before the UpdateControlFile() after InitWalRecovery() in UpdateControlFile(), much closer to it. > If we set these fields where backup_label was renamed, the logic would not > be exactly the same since pg_control won't be updated until the next time > through the loop. Since the fields should be updated before > UpdateControlFile() I thought it made sense to keep all the updates > together. > > Overall I think it is simpler, and we don't need to acquire a lock on > ControlFile. What you are proposing is the same as what we already do for backupEndRequired or backupStartPoint in the control file when initializing recovery, so objection withdrawn. > Thomas included this change in his pg_basebackup changes so I did the same. > Maybe wait a bit before we split this out? Seems like a pretty small > change... Seems like a pretty good argument for refactoring that now, and let any other patches rely on it. Would you like to send a separate patch? >> Actually, grouping backupStartPointTLI and minRecoveryPointTLI should >> reduce more the size with some alignment magic, no? > > I thought about this, but it seemed to me that existing fields had been > positioned to make the grouping logical rather than to optimize alignment, > e.g. minRecoveryPointTLI. Ideally that would have been placed near > backupEndRequired (or vice versa). But if the general opinion is to > rearrange for alignment, I'm OK with that. I've not tested, but it looks like moving backupStartPointTLI after backupEndPoint should shave 8 bytes, if you want to maintain a more coherent group for the LSNs. -- Michael
Attachment
On Tue, Nov 07, 2023 at 05:20:27PM +0900, Michael Paquier wrote: > On Mon, Nov 06, 2023 at 05:39:02PM -0400, David Steele wrote: > I've retested today, and miss the failure. I'll let you know if I see > this again. I've done a few more dozen runs, and still nothing. I am wondering what this disturbance was. >> If we set these fields where backup_label was renamed, the logic would not >> be exactly the same since pg_control won't be updated until the next time >> through the loop. Since the fields should be updated before >> UpdateControlFile() I thought it made sense to keep all the updates >> together. >> >> Overall I think it is simpler, and we don't need to acquire a lock on >> ControlFile. > > What you are proposing is the same as what we already do for > backupEndRequired or backupStartPoint in the control file when > initializing recovery, so objection withdrawn. > >> Thomas included this change in his pg_basebackup changes so I did the same. >> Maybe wait a bit before we split this out? Seems like a pretty small >> change... > > Seems like a pretty good argument for refactoring that now, and let > any other patches rely on it. Would you like to send a separate > patch? The split still looks worth doing seen from here, so I am switching the patch as WoA for now. >>> Actually, grouping backupStartPointTLI and minRecoveryPointTLI should >>> reduce more the size with some alignment magic, no? >> >> I thought about this, but it seemed to me that existing fields had been >> positioned to make the grouping logical rather than to optimize alignment, >> e.g. minRecoveryPointTLI. Ideally that would have been placed near >> backupEndRequired (or vice versa). But if the general opinion is to >> rearrange for alignment, I'm OK with that. > > I've not tested, but it looks like moving backupStartPointTLI after > backupEndPoint should shave 8 bytes, if you want to maintain a more > coherent group for the LSNs. + * backupFromStandby indicates that the backup was taken on a standby. It is + * require to initialize recovery and set to false afterwards. s/require/required/. The term "backup recovery", that we've never used in the tree until now as far as I know. Perhaps this recovery method should just be referred as "recovery from backup"? By the way, there is another thing that this patch has forgotten: the SQL functions that display data from the control file. Shouldn't pg_control_recovery() be extended with the new fields? These fields may be less critical than the other ones related to recovery, but I suspect that showing them can become handy at least for debugging and monitoring purposes. Something in this area is that backupRecoveryRequired is the switch controlling if the fields set by the recovery initialization. Could it be actually useful to leave the other fields as they are and only reset backupRecoveryRequired before the first control file update? This would leave a trace of the backup history directly in the control file. What about pg_resetwal and RewriteControlFile()? Shouldn't these recovery fields be reset as well? git diff --check is complaining a bit. -- Michael
Attachment
On 11/10/23 00:37, Michael Paquier wrote: > On Tue, Nov 07, 2023 at 05:20:27PM +0900, Michael Paquier wrote: >> On Mon, Nov 06, 2023 at 05:39:02PM -0400, David Steele wrote: >> I've retested today, and miss the failure. I'll let you know if I see >> this again. > > I've done a few more dozen runs, and still nothing. I am wondering > what this disturbance was. OK, hopefully it was just a blip. >>> If we set these fields where backup_label was renamed, the logic would not >>> be exactly the same since pg_control won't be updated until the next time >>> through the loop. Since the fields should be updated before >>> UpdateControlFile() I thought it made sense to keep all the updates >>> together. >>> >>> Overall I think it is simpler, and we don't need to acquire a lock on >>> ControlFile. >> >> What you are proposing is the same as what we already do for >> backupEndRequired or backupStartPoint in the control file when >> initializing recovery, so objection withdrawn. >> >>> Thomas included this change in his pg_basebackup changes so I did the same. >>> Maybe wait a bit before we split this out? Seems like a pretty small >>> change... >> >> Seems like a pretty good argument for refactoring that now, and let >> any other patches rely on it. Would you like to send a separate >> patch? > > The split still looks worth doing seen from here, so I am switching > the patch as WoA for now. This has been split out. >>>> Actually, grouping backupStartPointTLI and minRecoveryPointTLI should >>>> reduce more the size with some alignment magic, no? >>> >>> I thought about this, but it seemed to me that existing fields had been >>> positioned to make the grouping logical rather than to optimize alignment, >>> e.g. minRecoveryPointTLI. Ideally that would have been placed near >>> backupEndRequired (or vice versa). But if the general opinion is to >>> rearrange for alignment, I'm OK with that. >> >> I've not tested, but it looks like moving backupStartPointTLI after >> backupEndPoint should shave 8 bytes, if you want to maintain a more >> coherent group for the LSNs. OK, I have moved backupStartPointTLI. > + * backupFromStandby indicates that the backup was taken on a standby. It is > + * require to initialize recovery and set to false afterwards. > s/require/required/. Fixed. > The term "backup recovery", that we've never used in the tree until > now as far as I know. Perhaps this recovery method should just be > referred as "recovery from backup"? Well, "backup recovery" is less awkward, I think. For instance "backup recovery field" vs "recovery from backup field". > By the way, there is another thing that this patch has forgotten: the > SQL functions that display data from the control file. Shouldn't > pg_control_recovery() be extended with the new fields? These fields > may be less critical than the other ones related to recovery, but I > suspect that showing them can become handy at least for debugging and > monitoring purposes. I guess that depends on whether we reset them or not (discussion below). Right now they would not be visible since by the time the user could log on they would be reset. > Something in this area is that backupRecoveryRequired is the switch > controlling if the fields set by the recovery initialization. Could > it be actually useful to leave the other fields as they are and only > reset backupRecoveryRequired before the first control file update? > This would leave a trace of the backup history directly in the control > file. Since the other recovery fields are cleared in ReachedEndOfBackup() this would be a change from what we do now. None of these fields are ever visible (with the exception of minRecoveryPoint/TLI) since they are reset when the database becomes consistent and before logons are allowed. Viewing them with pg_controldata makes sense, but I'm not sure pg_control_recovery() does. In fact, are backup_start_lsn, backup_end_lsn, and end_of_backup_record_required ever non-zero when logged onto Postgres? Maybe I'm missing something? > What about pg_resetwal and RewriteControlFile()? Shouldn't these > recovery fields be reset as well? Done. > git diff --check is complaining a bit. Fixed. New patches attached based on eb81e8e790. Regards, -David
Attachment
On Fri, Nov 10, 2023 at 02:55:19PM -0400, David Steele wrote: > On 11/10/23 00:37, Michael Paquier wrote: >> I've done a few more dozen runs, and still nothing. I am wondering >> what this disturbance was. > > OK, hopefully it was just a blip. Still nothing on this side. So that seems really like a random blip in the matrix. > This has been split out. Thanks, applied 0001. >> The term "backup recovery", that we've never used in the tree until >> now as far as I know. Perhaps this recovery method should just be >> referred as "recovery from backup"? > > Well, "backup recovery" is less awkward, I think. For instance "backup > recovery field" vs "recovery from backup field". Not sure. I've never used this term when referring to recovery from a backup. Perhaps I'm just not used to it, still that sounds a bit confusing here. >> Something in this area is that backupRecoveryRequired is the switch >> controlling if the fields set by the recovery initialization. Could >> it be actually useful to leave the other fields as they are and only >> reset backupRecoveryRequired before the first control file update? >> This would leave a trace of the backup history directly in the control >> file. > > Since the other recovery fields are cleared in ReachedEndOfBackup() this > would be a change from what we do now. > > None of these fields are ever visible (with the exception of > minRecoveryPoint/TLI) since they are reset when the database becomes > consistent and before logons are allowed. Viewing them with pg_controldata > makes sense, but I'm not sure pg_control_recovery() does. > > In fact, are backup_start_lsn, backup_end_lsn, and > end_of_backup_record_required ever non-zero when logged onto Postgres? Maybe > I'm missing something? Yeah, but custom backup/restore tools may want manipulate the contents of the control file for their own work, so at least for the sake of visibility it sounds important to me to show all the information at hand, and that there is no need to. - The backup label - file includes the label string you gave to <function>pg_backup_start</function>, + The backup history file (which is archived like WAL) includes the label + string you gave to <function>pg_backup_start</function>, as well as the time at which <function>pg_backup_start</function> was run, and the name of the starting WAL file. In case of confusion it is therefore - possible to look inside a backup file and determine exactly which + possible to look inside a backup history file and determine exactly which As a side note, it is a bit disappointing that we lose the backup label from the backup itself, even if the patch changes correctly the documentation to reflect the new behavior. It is in the backup history file on the node from where the base backup has been taken or in the archives, hopefully. However there is nothing that remains in the base backup itself, and backups can be self-contained (easy with pg_basebackup --wal-method=stream). I think that we should retain a minimum amount of information as a replacement for the backup_label, at least. With this state, the current patch slightly reduces the debuggability of deployments. That could be annoying for some users. > New patches attached based on eb81e8e790. Diving into the code for references about the backup label file, I have spotted this log in pg_rewind that is now incorrect: if (showprogress) pg_log_info("creating backup label and updating control file"); + printf(_("Backup start location's timeline: %u\n"), + ControlFile->backupStartPointTLI); printf(_("Backup end location: %X/%X\n"), LSN_FORMAT_ARGS(ControlFile->backupEndPoint)); Perhaps these two should be reversed to match with the header file. + /* + * After pg_backup_stop() returns this field will contain a copy of + * pg_control that should be stored with the backup. Fields have been + * updated for recovery and the CRC has been recalculated. The buffer + * is padded to PG_CONTROL_MAX_SAFE_SIZE so that pg_control is always + * a consistent size but smaller (and hopefully easier to handle) than + * PG_CONTROL_FILE_SIZE. Bytes after sizeof(ControlFileData) are zeroed. + */ + uint8_t controlFile[PG_CONTROL_MAX_SAFE_SIZE]; I don't mind the addition of a control file with the max safe size, because it will never be higher than that. However: + /* End the backup before sending pg_control */ + basebackup_progress_wait_wal_archive(&state); + do_pg_backup_stop(backup_state, !opt->nowait); + + /* Send copy of pg_control containing recovery info */ + sendFileWithContent(sink, XLOG_CONTROL_FILE, + (char *)backup_state->controlFile, + PG_CONTROL_MAX_SAFE_SIZE, &manifest); It seems to me that the base backup protocol should always send an 8k file for the control file so as we maintain consistency with the on-disk format. Currently, a base backup taken with this patch results in a control file of size 512B. + /* Build the contents of pg_control */ + pg_control_bytea = (bytea *) palloc(PG_CONTROL_MAX_SAFE_SIZE + VARHDRSZ); + SET_VARSIZE(pg_control_bytea, PG_CONTROL_MAX_SAFE_SIZE + VARHDRSZ); + memcpy(VARDATA(pg_control_bytea), backup_state->controlFile, PG_CONTROL_MAX_SAFE_SIZE); Similar comment for the control file returned by pg_backup_stop(), which could just be made a 8k field? + <function>pg_backup_stop</function> returns the + <filename>pg_control</filename> file, which must be stored in the + <filename>global</filename> directory of the backup. It also returns the And perhaps emphasize that this file should be an 8kB file in the paragraph mentioning the data returned by pg_backup_stop()? - Create a <filename>backup_label</filename> file to begin WAL replay at + Update <filename>pg_control</filename> file to begin WAL replay at the checkpoint created at failover and configure the <filename>pg_control</filename> file with a minimum consistency LSN pg_control is mentioned twice, so perhaps this could be worded better? PG_CONTROL_VERSION is important to not forget about.. Perhaps this should be noted somewhere, or just changed in the patch itself. Contrary to catalog changes, we do few of these in the control file so there is close to zero risk of conflicts with other patches in the CF app. -- Michael
Attachment
(I am not exactly sure how, but we've lost pgsql-hackers on the way when you sent v5. Now added back in CC with the two latest patches you've proposed attached.) Here is a short summary of what has been missed by the lists: - I've commented that the patch should not create, not show up in fields returned the SQL functions or stream control files with a size of 512B, just stick to 8kB. If this is worth changing this should be applied consistently across the board including initdb, discussed on its own thread. - The backup-related fields in the control file are reset at the end of recovery. I've suggested to not do that to keep a trace of what was happening during recovery. The latest version of the patch resets the fields. - With the backup_label file gone, we lose some information in the backups themselves, which is not good. Instead, you have suggested an approach where this data is added to the backup manifest, meaning that no information would be lost, particularly useful for self-contained backups. The fields planned to be added to the backup manifest are: -- The start and end time of the backup, the end timestamp being useful to know when stop time can be used for PITR. -- The backup label. I've agreed that it may be the best thing to do at this end to not lose any data related to the removal of the backup_label file. On Sun, Nov 19, 2023 at 02:14:32PM -0400, David Steele wrote: > On 11/15/23 20:03, Michael Paquier wrote: >> As the label is only an informational field, the parsing added to >> pg_verifybackup is not really needed because it is used nowhere in the >> validation process, so keeping the logic simpler would be the way to >> go IMO. This is contrary to the WAL range for example, where start >> and end LSNs are used for validation with a pg_waldump command. >> Robert, any comments about the addition of the label in the manifest? > > I'm sure Robert will comment on this when he gets the time, but for now I > have backed off on passing the new info to pg_verifybackup and added > start/stop time. FWIW, I'm OK with the bits for the backup manifest as presented. So if there are no remarks and/or no objections, I'd like to apply it but let give some room to others to comment on that as there's been a gap in the emails exchanged on pgsql-hackers. I hope that the summary I've posted above covers everything. So let's see about doing something around the middle of next week. With Thanksgiving in the US, a lot of folks will not have the time to monitor what's happening on this thread. + The end time for the backup. This is when the backup was stopped in + <productname>PostgreSQL</productname> and represents the earliest time + that can be used for time-based Point-In-Time Recovery. This one is actually a very good point. We'd lost this capacity with the backup_label file gone without the end timestamps in the control file. > New patches attached based on b218fbb7. I've noticed on the other thread the remark about being less aggressive with the fields related to recovery in the control file, so I assume that this patch should leave the fields be after the end of recovery from the start and only rely on backupRecoveryRequired to decide if the recovery should use the fields or not: https://www.postgresql.org/message-id/241ccde1-1928-4ba2-a0bb-5350f7b191a8@=pgmasters.net + ControlFile->backupCheckPoint = InvalidXLogRecPtr; ControlFile->backupStartPoint = InvalidXLogRecPtr; + ControlFile->backupStartPointTLI = 0; ControlFile->backupEndPoint = InvalidXLogRecPtr; + ControlFile->backupFromStandby = false; ControlFile->backupEndRequired = false; Still, I get the temptation of being consistent with the current style on HEAD to reset everything, as well.. -- Michael
Attachment
On 11/19/23 21:15, Michael Paquier wrote: > (I am not exactly sure how, but we've lost pgsql-hackers on the way > when you sent v5. Now added back in CC with the two latest patches > you've proposed attached.) Ugh, I must have hit reply instead of reply all. It's a rookie error and you hate to see it. > Here is a short summary of what has been missed by the lists: > - I've commented that the patch should not create, not show up in > fields returned the SQL functions or stream control files with a size > of 512B, just stick to 8kB. If this is worth changing this should be > applied consistently across the board including initdb, discussed on > its own thread. > - The backup-related fields in the control file are reset at the end > of recovery. I've suggested to not do that to keep a trace of what > was happening during recovery. The latest version of the patch resets > the fields. > - With the backup_label file gone, we lose some information in the > backups themselves, which is not good. Instead, you have suggested an > approach where this data is added to the backup manifest, meaning that > no information would be lost, particularly useful for self-contained > backups. The fields planned to be added to the backup manifest are: > -- The start and end time of the backup, the end timestamp being > useful to know when stop time can be used for PITR. > -- The backup label. > I've agreed that it may be the best thing to do at this end to not > lose any data related to the removal of the backup_label file. This looks right to me. > On Sun, Nov 19, 2023 at 02:14:32PM -0400, David Steele wrote: >> On 11/15/23 20:03, Michael Paquier wrote: >>> As the label is only an informational field, the parsing added to >>> pg_verifybackup is not really needed because it is used nowhere in the >>> validation process, so keeping the logic simpler would be the way to >>> go IMO. This is contrary to the WAL range for example, where start >>> and end LSNs are used for validation with a pg_waldump command. >>> Robert, any comments about the addition of the label in the manifest? >> >> I'm sure Robert will comment on this when he gets the time, but for now I >> have backed off on passing the new info to pg_verifybackup and added >> start/stop time. > > FWIW, I'm OK with the bits for the backup manifest as presented. So > if there are no remarks and/or no objections, I'd like to apply it but > let give some room to others to comment on that as there's been a gap > in the emails exchanged on pgsql-hackers. I hope that the summary > I've posted above covers everything. So let's see about doing > something around the middle of next week. With Thanksgiving in the > US, a lot of folks will not have the time to monitor what's happening > on this thread. Timing sounds good to me. > > + The end time for the backup. This is when the backup was stopped in > + <productname>PostgreSQL</productname> and represents the earliest time > + that can be used for time-based Point-In-Time Recovery. > > This one is actually a very good point. We'd lost this capacity with > the backup_label file gone without the end timestamps in the control > file. Yeah, the end time is very important for recovery scenarios. We definitely need that recorded somewhere. > I've noticed on the other thread the remark about being less > aggressive with the fields related to recovery in the control file, so > I assume that this patch should leave the fields be after the end of > recovery from the start and only rely on backupRecoveryRequired to > decide if the recovery should use the fields or not: > https://www.postgresql.org/message-id/241ccde1-1928-4ba2-a0bb-5350f7b191a8@=pgmasters.net > > + ControlFile->backupCheckPoint = InvalidXLogRecPtr; > ControlFile->backupStartPoint = InvalidXLogRecPtr; > + ControlFile->backupStartPointTLI = 0; > ControlFile->backupEndPoint = InvalidXLogRecPtr; > + ControlFile->backupFromStandby = false; > ControlFile->backupEndRequired = false; > > Still, I get the temptation of being consistent with the current style > on HEAD to reset everything, as well.. I'd rather reset everything for now (as we do now) and think about keeping these values as a separate patch. It may be that we don't want to keep all of them, or we need a separate flag to say recovery was completed. We are accumulating a lot of booleans here, maybe we need a state var (recoveryRequired, recoveryInProgress, recoveryComplete) and then define which other vars are valid in each state. Regards, -David
On Sun, Nov 19, 2023 at 8:16 PM Michael Paquier <michael@paquier.xyz> wrote: > (I am not exactly sure how, but we've lost pgsql-hackers on the way > when you sent v5. Now added back in CC with the two latest patches > you've proposed attached.) > > Here is a short summary of what has been missed by the lists: > - I've commented that the patch should not create, not show up in > fields returned the SQL functions or stream control files with a size > of 512B, just stick to 8kB. If this is worth changing this should be > applied consistently across the board including initdb, discussed on > its own thread. > - The backup-related fields in the control file are reset at the end > of recovery. I've suggested to not do that to keep a trace of what > was happening during recovery. The latest version of the patch resets > the fields. > - With the backup_label file gone, we lose some information in the > backups themselves, which is not good. Instead, you have suggested an > approach where this data is added to the backup manifest, meaning that > no information would be lost, particularly useful for self-contained > backups. The fields planned to be added to the backup manifest are: > -- The start and end time of the backup, the end timestamp being > useful to know when stop time can be used for PITR. > -- The backup label. > I've agreed that it may be the best thing to do at this end to not > lose any data related to the removal of the backup_label file. I think we need more votes to make a change this big. I have a concern, which I think I've expressed before, that we keep whacking around the backup APIs, and that has a cost which is potentially larger than the benefits. The last time we changed the API, we changed pg_stop_backup to pg_backup_stop, but this doesn't do that, and I wonder if that's OK. Even if it is, do we really want to change this API around again after such a short time? That said, I don't have an intrinsic problem with moving this information from the backup_label to the backup_manifest file since it is purely informational. I do think there should perhaps be some additions to the test cases, though. I am concerned about the interaction of this proposal with incremental backup. When you take an incremental backup, you get something that looks a lot like a usable data directory but isn't. To prevent that from causing avoidable disasters, the present version of the patch adds INCREMENTAL FROM LSN and INCREMENTAL FROM TLI fields to the backup_label. pg_combinebackup knows to look for those fields, and the server knows that if they are present it should refuse to start. With this change, though, I suppose those fields would end up in pg_control. But that does not feel entirely great, because we have a goal of keeping the amount of real data in pg_control below 512 bytes, the traditional sector size, and this adds another 12 bytes of stuff to that file that currently doesn't need to be there. I feel like that's kind of a problem. But my main point here is ... if we have a few more senior hackers weigh in and vote in favor of this change, well then that's one thing. But IMHO a discussion that's mostly between 2 people is not nearly a strong enough consensus to justify this amount of disruption. -- Robert Haas EDB: http://www.enterprisedb.com
On 11/20/23 12:11, Robert Haas wrote: > On Sun, Nov 19, 2023 at 8:16 PM Michael Paquier <michael@paquier.xyz> wrote: >> (I am not exactly sure how, but we've lost pgsql-hackers on the way >> when you sent v5. Now added back in CC with the two latest patches >> you've proposed attached.) >> >> Here is a short summary of what has been missed by the lists: >> - I've commented that the patch should not create, not show up in >> fields returned the SQL functions or stream control files with a size >> of 512B, just stick to 8kB. If this is worth changing this should be >> applied consistently across the board including initdb, discussed on >> its own thread. >> - The backup-related fields in the control file are reset at the end >> of recovery. I've suggested to not do that to keep a trace of what >> was happening during recovery. The latest version of the patch resets >> the fields. >> - With the backup_label file gone, we lose some information in the >> backups themselves, which is not good. Instead, you have suggested an >> approach where this data is added to the backup manifest, meaning that >> no information would be lost, particularly useful for self-contained >> backups. The fields planned to be added to the backup manifest are: >> -- The start and end time of the backup, the end timestamp being >> useful to know when stop time can be used for PITR. >> -- The backup label. >> I've agreed that it may be the best thing to do at this end to not >> lose any data related to the removal of the backup_label file. > > I think we need more votes to make a change this big. I have a > concern, which I think I've expressed before, that we keep whacking > around the backup APIs, and that has a cost which is potentially > larger than the benefits. From my perspective it's not that big a change for backup software but it does bring a lot of benefits, including fixing an outstanding bug in Postgres, i.e. reading pg_control without getting a torn copy. > The last time we changed the API, we changed > pg_stop_backup to pg_backup_stop, but this doesn't do that, and I > wonder if that's OK. Even if it is, do we really want to change this > API around again after such a short time? This is a good point. We could just rename again, but not sure what names to go for this time. OTOH if the backup software is selecting fields then they will get an error because the names have changed. If the software is grabbing fields by position then they'll get a valid-looking result (even if querying by position is a terrible idea). Another thing we could do is explicitly error if we see backup_label in PGDATA during recovery. That's just a few lines of code so would not be a big deal to maintain. This error would only be visible on restore, so it presumes that backup software is being tested. Maybe just a rename to something like pg_backup_begin/end would be the way to go. > That said, I don't have an intrinsic problem with moving this > information from the backup_label to the backup_manifest file since it > is purely informational. I do think there should perhaps be some > additions to the test cases, though. A little hard to add to the tests, I think, since they are purely informational, i.e. not pushed up by the parser. Maybe we could just grep for the fields? > I am concerned about the interaction of this proposal with incremental > backup. When you take an incremental backup, you get something that > looks a lot like a usable data directory but isn't. To prevent that > from causing avoidable disasters, the present version of the patch > adds INCREMENTAL FROM LSN and INCREMENTAL FROM TLI fields to the > backup_label. pg_combinebackup knows to look for those fields, and the > server knows that if they are present it should refuse to start. With > this change, though, I suppose those fields would end up in > pg_control. But that does not feel entirely great, because we have a > goal of keeping the amount of real data in pg_control below 512 bytes, > the traditional sector size, and this adds another 12 bytes of stuff > to that file that currently doesn't need to be there. I feel like > that's kind of a problem. I think these fields would be handled the same as the rest of the fields in backup_label: returned from pg_backup_stop() and also stored in backup_manifest. Third-party software can do as they like with them and pg_combinebackup can just read from backup_manifest. As for the pg_control file -- it might be best to give it a different name for backups that are not essentially copies of PGDATA. On the other hand, pgBackRest has included pg_control in incremental backups since day one and we've never had a user mistakenly do a manual restore of one and cause a problem (though manual restores are not the norm). Still, probably can't hurt to be a bit careful. > But my main point here is ... if we have a few more senior hackers > weigh in and vote in favor of this change, well then that's one thing. > But IMHO a discussion that's mostly between 2 people is not nearly a > strong enough consensus to justify this amount of disruption. We absolutely need more people to look at this and sign off. I'm glad they have not so far because it has allowed time to whack the patch around and get it into better shape. Regards, -David
On Mon, Nov 20, 2023 at 12:54 PM David Steele <david@pgmasters.net> wrote: > Another thing we could do is explicitly error if we see backup_label in > PGDATA during recovery. That's just a few lines of code so would not be > a big deal to maintain. This error would only be visible on restore, so > it presumes that backup software is being tested. I think that if we do decide to adopt this proposal, that would be a smart precaution. > A little hard to add to the tests, I think, since they are purely > informational, i.e. not pushed up by the parser. Maybe we could just > grep for the fields? Hmm. Or should they be pushed up by the parser? > I think these fields would be handled the same as the rest of the fields > in backup_label: returned from pg_backup_stop() and also stored in > backup_manifest. Third-party software can do as they like with them and > pg_combinebackup can just read from backup_manifest. I think that would be a bad plan, because this is critical information, and a backup manifest is not a thing that you're required to have. It's not a natural fit at all. We don't want to create a situation where if you nuke the backup_manifest then the server forgets that what it has is an incremental backup rather than a usable data directory. > We absolutely need more people to look at this and sign off. I'm glad > they have not so far because it has allowed time to whack the patch > around and get it into better shape. Cool. -- Robert Haas EDB: http://www.enterprisedb.com
On 11/20/23 14:44, Robert Haas wrote: > On Mon, Nov 20, 2023 at 12:54 PM David Steele <david@pgmasters.net> wrote: >> Another thing we could do is explicitly error if we see backup_label in >> PGDATA during recovery. That's just a few lines of code so would not be >> a big deal to maintain. This error would only be visible on restore, so >> it presumes that backup software is being tested. > > I think that if we do decide to adopt this proposal, that would be a > smart precaution. I'd be OK with it -- what do you think, Michael? Would this be enough that we would not need to rename the functions, or should we just go with the rename? >> A little hard to add to the tests, I think, since they are purely >> informational, i.e. not pushed up by the parser. Maybe we could just >> grep for the fields? > > Hmm. Or should they be pushed up by the parser? We could do that. I started on that road, but it's a lot of code for fields that aren't used. I think it would be better if the parser also loaded a data structure that represented the manifest. Seems to me there's a lot of duplicated code between pg_verifybackup and pg_combinebackup the way it is now. >> I think these fields would be handled the same as the rest of the fields >> in backup_label: returned from pg_backup_stop() and also stored in >> backup_manifest. Third-party software can do as they like with them and >> pg_combinebackup can just read from backup_manifest. > > I think that would be a bad plan, because this is critical > information, and a backup manifest is not a thing that you're required > to have. It's not a natural fit at all. We don't want to create a > situation where if you nuke the backup_manifest then the server > forgets that what it has is an incremental backup rather than a usable > data directory. I can't see why a backup would continue to be valid without a manifest -- that's not very standard for backup software. If you have the critical info in backup_label, you can't afford to lose that, so why should backup_manifest be any different? Regards, -David
On Mon, Nov 20, 2023 at 2:41 PM David Steele <david@pgmasters.net> wrote: > I can't see why a backup would continue to be valid without a manifest > -- that's not very standard for backup software. If you have the > critical info in backup_label, you can't afford to lose that, so why > should backup_manifest be any different? I mean, you can run pg_basebackup --no-manifest. -- Robert Haas EDB: http://www.enterprisedb.com
On 11/20/23 15:47, Robert Haas wrote: > On Mon, Nov 20, 2023 at 2:41 PM David Steele <david@pgmasters.net> wrote: >> I can't see why a backup would continue to be valid without a manifest >> -- that's not very standard for backup software. If you have the >> critical info in backup_label, you can't afford to lose that, so why >> should backup_manifest be any different? > > I mean, you can run pg_basebackup --no-manifest. Maybe this would be a good thing to disable for page incremental. With all the work being done by pg_combinebackup, it seems like it would be a good idea to be able to verify the final result? I understand this is an option -- but does it need to be? What is the benefit of excluding the manifest? Regards, -David
Hi, On 2023-11-20 11:11:13 -0500, Robert Haas wrote: > I think we need more votes to make a change this big. I have a > concern, which I think I've expressed before, that we keep whacking > around the backup APIs, and that has a cost which is potentially > larger than the benefits. +1. The amount of whacking around in this area has been substantial, and it's hard for operators to keep up. And realistically, with data sizes today, the pressure to do basebackups with disk snapshots etc is not going to shrink. Leaving that concern aside, I am still on the fence about this proposal. I think it does decrease the chance of getting things wrong in the streaming-basebackup case. But for external backups, it seems almost universally worse (with the exception of the torn pg_control issue, that we also can address otherwise): It doesn't reduce the risk of getting things wrong, you can still omit placing a file into the data directory and get silent corruption as a consequence. In addition, it's harder to see when looking at a base backup whether the process was right or not, because now the good and bad state look the same if you just look on the filesystem level! Then there's the issue of making ad-hoc debugging harder by not having a human readable file with information anymore, including when looking at the history, via backup_label.old. Given that, I wonder if what we should do is to just add a new field to pg_control that says "error out if backup_label does not exist", that we set when creating a streaming base backup Greetings, Andres Freund
Hi, On 2023-11-20 15:56:19 -0400, David Steele wrote: > I understand this is an option -- but does it need to be? What is the > benefit of excluding the manifest? It's not free to create the manifest, particularly if checksums are enabled. Also, for external backups, there's no manifest... - Andres
On Mon, Nov 20, 2023 at 1:37 PM Andres Freund <andres@anarazel.de> wrote:
Given that, I wonder if what we should do is to just add a new field to
pg_control that says "error out if backup_label does not exist", that we set
when creating a streaming base backup
I thought this was DOA since we don't want to ever leave the cluster in a state where a crash requires intervention to restart. But I agree that it is not possible to fool-proof agaInst a naive backup that copies over the pg_control file as-is if breaking the crashed cluster option is not in play.
I agree that this works if the pg_control generated by stop backup produces the line and we retain the label file as a separate and now mandatory component to using the backup.
Or is the idea to make v17 error if it sees a backup label unless pg_control has the feature flag field? Which doesn't exist normally, does in the basebackup version, and is removed once the backup is restored?
David J.
On Mon, Nov 20, 2023 at 11:11:13AM -0500, Robert Haas wrote: > I think we need more votes to make a change this big. I have a > concern, which I think I've expressed before, that we keep whacking > around the backup APIs, and that has a cost which is potentially > larger than the benefits. The last time we changed the API, we changed > pg_stop_backup to pg_backup_stop, but this doesn't do that, and I > wonder if that's OK. Even if it is, do we really want to change this > API around again after such a short time? Agreed. > That said, I don't have an intrinsic problem with moving this > information from the backup_label to the backup_manifest file since it > is purely informational. I do think there should perhaps be some > additions to the test cases, though. Yep, cool. Even if we decide to not go with what's discussed in this patch, I think that's useful for some users at the end to get more redundancy, as well. And that's in a format easier to parse. > I am concerned about the interaction of this proposal with incremental > backup. When you take an incremental backup, you get something that > looks a lot like a usable data directory but isn't. To prevent that > from causing avoidable disasters, the present version of the patch > adds INCREMENTAL FROM LSN and INCREMENTAL FROM TLI fields to the > backup_label. pg_combinebackup knows to look for those fields, and the > server knows that if they are present it should refuse to start. With > this change, though, I suppose those fields would end up in > pg_control. But that does not feel entirely great, because we have a > goal of keeping the amount of real data in pg_control below 512 bytes, > the traditional sector size, and this adds another 12 bytes of stuff > to that file that currently doesn't need to be there. I feel like > that's kind of a problem. I don't recall one time where the addition of new fields to the control file was easy to discuss because of its 512B hard limit. Anyway, putting the addition aside for a second, and I've not looked at the incremental backup patch, does the removal of the backup_label make the combine logic more complicated, or that's just moving a chunk of code to do a control file lookup instead of a backup_file parsing? Making the information less readable is definitely an issue for me. A different alternative that I've mentioned upthread is to keep an equivalent of the backup_label and rename it to something like backup.debug or similar, with a name good enough to tell people that we don't care about it being removed. -- Michael
Attachment
Hi, On 2023-11-20 14:18:15 -0700, David G. Johnston wrote: > On Mon, Nov 20, 2023 at 1:37 PM Andres Freund <andres@anarazel.de> wrote: > > > > > Given that, I wonder if what we should do is to just add a new field to > > pg_control that says "error out if backup_label does not exist", that we > > set > > when creating a streaming base backup > > > > > I thought this was DOA since we don't want to ever leave the cluster in a > state where a crash requires intervention to restart. I was trying to suggest that we'd set the field in-memory, when streaming out a pg_basebackup style backup (by just replacing pg_control with an otherwise identical file that has the flag set). So it'd not have any effect on the primary. Greetings, Andres Freund
On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote: > Given that, I wonder if what we should do is to just add a new field to > pg_control that says "error out if backup_label does not exist", that we set > when creating a streaming base backup That would mean that one still needs to take an extra step to update a control file with this byte set, which is something you had a concern with in terms of compatibility when it comes to external backup solutions because more steps are necessary to take a backup, no? I don't quite see why it is different than what's proposed on this thread, except that you don't need to write one file to the data folder to store the backup label fields, but two, meaning that there's a risk for more mistakes because a clean backup process would require more steps. With the current position of the fields in ControlFileData, there are three free bytes after backupEndRequired, so it is possible to add that for free. Now, would you actually need an extra field knowing that backupStartPoint is around? -- Michael
Attachment
Hi, On 2023-11-21 08:52:08 +0900, Michael Paquier wrote: > On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote: > > Given that, I wonder if what we should do is to just add a new field to > > pg_control that says "error out if backup_label does not exist", that we set > > when creating a streaming base backup > > That would mean that one still needs to take an extra step to update a > control file with this byte set, which is something you had a concern > with in terms of compatibility when it comes to external backup > solutions because more steps are necessary to take a backup, no? I was thinking we'd just set it in the pg_basebackup style path, and we'd error out if it's set and backup_label is present. But we'd still use backup_label without the pg_control flag set. So it'd just provide a cross-check that backup_label was not removed for pg_basebackup style backup, but wouldn't do anything for external backups. But imo the proposal to just us pg_control doesn't actually do anything for external backups either - which is why I think my proposal would achieve as much, for a much lower price. Greetings, Andres Freund
On Mon, Nov 20, 2023 at 03:58:55PM -0800, Andres Freund wrote: > I was thinking we'd just set it in the pg_basebackup style path, and we'd > error out if it's set and backup_label is present. But we'd still use > backup_label without the pg_control flag set. > > So it'd just provide a cross-check that backup_label was not removed for > pg_basebackup style backup, but wouldn't do anything for external backups. But > imo the proposal to just us pg_control doesn't actually do anything for > external backups either - which is why I think my proposal would achieve as > much, for a much lower price. I don't see why not. It does not increase the number of steps when doing a backup, and backupStartPoint alone would not be able to offer this much protection. -- Michael
Attachment
On 11/20/23 19:58, Andres Freund wrote: > > On 2023-11-21 08:52:08 +0900, Michael Paquier wrote: >> On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote: >>> Given that, I wonder if what we should do is to just add a new field to >>> pg_control that says "error out if backup_label does not exist", that we set >>> when creating a streaming base backup >> >> That would mean that one still needs to take an extra step to update a >> control file with this byte set, which is something you had a concern >> with in terms of compatibility when it comes to external backup >> solutions because more steps are necessary to take a backup, no? > > I was thinking we'd just set it in the pg_basebackup style path, and we'd > error out if it's set and backup_label is present. But we'd still use > backup_label without the pg_control flag set. > > So it'd just provide a cross-check that backup_label was not removed for > pg_basebackup style backup, but wouldn't do anything for external backups. But > imo the proposal to just us pg_control doesn't actually do anything for > external backups either - which is why I think my proposal would achieve as > much, for a much lower price. I'm not sure why you think the patch under discussion doesn't do anything for external backups. It provides the same benefits to both pg_basebackup and external backups, i.e. they both receive the updated version of pg_control. I really dislike the idea of pg_basebackup having a special mechanism for making recovery safer that is not generally available to external backup software. It might be easy enough for some (e.g. pgBackRest) to manipulate pg_control but would be out of reach for most. Regards, -David
Hi, On 2023-11-21 07:42:42 -0400, David Steele wrote: > On 11/20/23 19:58, Andres Freund wrote: > > On 2023-11-21 08:52:08 +0900, Michael Paquier wrote: > > > On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote: > > > > Given that, I wonder if what we should do is to just add a new field to > > > > pg_control that says "error out if backup_label does not exist", that we set > > > > when creating a streaming base backup > > > > > > That would mean that one still needs to take an extra step to update a > > > control file with this byte set, which is something you had a concern > > > with in terms of compatibility when it comes to external backup > > > solutions because more steps are necessary to take a backup, no? > > > > I was thinking we'd just set it in the pg_basebackup style path, and we'd > > error out if it's set and backup_label is present. But we'd still use > > backup_label without the pg_control flag set. > > > > So it'd just provide a cross-check that backup_label was not removed for > > pg_basebackup style backup, but wouldn't do anything for external backups. But > > imo the proposal to just us pg_control doesn't actually do anything for > > external backups either - which is why I think my proposal would achieve as > > much, for a much lower price. > > I'm not sure why you think the patch under discussion doesn't do anything > for external backups. It provides the same benefits to both pg_basebackup > and external backups, i.e. they both receive the updated version of > pg_control. Sure. They also receive a backup_label today. If an external solution forgets to replace pg_control copied as part of the filesystem copy, they won't get an error after the remove of backup_label, just like they don't get one today if they don't put backup_label in the data directory. Given that users don't do the right thing with backup_label today, why can we rely on them doing the right thing with pg_control? Greetings, Andres Freund
On 11/21/23 12:41, Andres Freund wrote: > > On 2023-11-21 07:42:42 -0400, David Steele wrote: >> On 11/20/23 19:58, Andres Freund wrote: >>> On 2023-11-21 08:52:08 +0900, Michael Paquier wrote: >>>> On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote: >>>>> Given that, I wonder if what we should do is to just add a new field to >>>>> pg_control that says "error out if backup_label does not exist", that we set >>>>> when creating a streaming base backup >>>> >>>> That would mean that one still needs to take an extra step to update a >>>> control file with this byte set, which is something you had a concern >>>> with in terms of compatibility when it comes to external backup >>>> solutions because more steps are necessary to take a backup, no? >>> >>> I was thinking we'd just set it in the pg_basebackup style path, and we'd >>> error out if it's set and backup_label is present. But we'd still use >>> backup_label without the pg_control flag set. >>> >>> So it'd just provide a cross-check that backup_label was not removed for >>> pg_basebackup style backup, but wouldn't do anything for external backups. But >>> imo the proposal to just us pg_control doesn't actually do anything for >>> external backups either - which is why I think my proposal would achieve as >>> much, for a much lower price. >> >> I'm not sure why you think the patch under discussion doesn't do anything >> for external backups. It provides the same benefits to both pg_basebackup >> and external backups, i.e. they both receive the updated version of >> pg_control. > > Sure. They also receive a backup_label today. If an external solution forgets > to replace pg_control copied as part of the filesystem copy, they won't get an > error after the remove of backup_label, just like they don't get one today if > they don't put backup_label in the data directory. Given that users don't do > the right thing with backup_label today, why can we rely on them doing the > right thing with pg_control? I think reliable backup software does the right thing with backup_label, but if the user starts getting errors on recovery they the decide to remove backup_label. I know we can't do much about bad backup software, but we can at least make this a bit more resistant to user error after the fact. It doesn't help that one of our hints suggests removing backup_label. In highly automated systems, the user might not even know they just restored from a backup. They are only in the loop because the restore failed and they are trying to figure out what is going wrong. When they remove backup_label the cluster comes up just fine. Victory! This is the scenario I've seen most often -- not the backup/restore process getting it wrong but the user removing backup_label on their own initiative. And because it yields such a positive result, at least initially, they remember in the future that the thing to do is to remove backup_label whenever they see the error. If they only have pg_control, then their only choice is to get it right or run pg_resetwal. Most users have no knowledge of pg_resetwal so it will take them longer to get there. Also, I think that tool make it pretty clear that corruption will result and the only thing to do is a logical dump and restore after using it. There are plenty of ways a user can mess things up. What I'd like to prevent is the appearance of everything being OK when in fact they have corrupted their cluster. That's the situation we have now with backup_label. Is this new solution perfect? No, but I do think it checks several boxes, and is a worthwhile improvement. Regards, -David Regards, -David
On 11/20/23 16:41, Andres Freund wrote: > > On 2023-11-20 15:56:19 -0400, David Steele wrote: >> I understand this is an option -- but does it need to be? What is the >> benefit of excluding the manifest? > > It's not free to create the manifest, particularly if checksums are enabled. It's virtually free, even with the basic CRCs. Anyway, would you really want a backup without a manifest? How would you know something is missing? In particular, for page incremental how do you know something is new (but not WAL logged) if there is no manifest? Is the plan to just recopy anything not WAL logged with each incremental? > Also, for external backups, there's no manifest... There certainly is a manifest for many external backup solutions. Not having a manifest is just running with scissors, backup-wise. Regards, -David
Hi, On 2023-11-21 13:41:15 -0400, David Steele wrote: > On 11/20/23 16:41, Andres Freund wrote: > > > > On 2023-11-20 15:56:19 -0400, David Steele wrote: > > > I understand this is an option -- but does it need to be? What is the > > > benefit of excluding the manifest? > > > > It's not free to create the manifest, particularly if checksums are enabled. > > It's virtually free, even with the basic CRCs. Huh? perf stat src/bin/pg_basebackup/pg_basebackup -h /tmp/ -p 5440 -D - -cfast -Xnone --format=tar > /dev/null 4,423.81 msec task-clock # 0.626 CPUs utilized 433,475 context-switches # 97.987 K/sec 5 cpu-migrations # 1.130 /sec 599 page-faults # 135.404 /sec 12,208,261,153 cycles # 2.760 GHz 6,805,401,520 instructions # 0.56 insn per cycle 1,273,896,027 branches # 287.964 M/sec 14,233,126 branch-misses # 1.12% of all branches 7.068946385 seconds time elapsed 1.106072000 seconds user 3.403793000 seconds sys perf stat src/bin/pg_basebackup/pg_basebackup -h /tmp/ -p 5440 -D - -cfast -Xnone --format=tar --manifest-checksums=CRC32C> /dev/null 4,324.64 msec task-clock # 0.640 CPUs utilized 433,306 context-switches # 100.195 K/sec 3 cpu-migrations # 0.694 /sec 598 page-faults # 138.277 /sec 11,952,475,908 cycles # 2.764 GHz 6,816,888,845 instructions # 0.57 insn per cycle 1,275,949,455 branches # 295.042 M/sec 13,721,376 branch-misses # 1.08% of all branches 6.760321433 seconds time elapsed 1.113256000 seconds user 3.302907000 seconds sys perf stat src/bin/pg_basebackup/pg_basebackup -h /tmp/ -p 5440 -D - -cfast -Xnone --format=tar --no-manifest > /dev/null 3,925.38 msec task-clock # 0.823 CPUs utilized 257,467 context-switches # 65.590 K/sec 4 cpu-migrations # 1.019 /sec 552 page-faults # 140.624 /sec 11,577,054,842 cycles # 2.949 GHz 5,933,731,797 instructions # 0.51 insn per cycle 1,108,784,719 branches # 282.466 M/sec 11,867,511 branch-misses # 1.07% of all branches 4.770347012 seconds time elapsed 1.002521000 seconds user 2.991769000 seconds sys I'd not call 7.06->4.77 or 6.76->4.77 "virtually free". And this actually *under* selling the cost - we waste a lot of cycles due to bad buffering decisions. Once we fix that, the cost differential increases further. > Anyway, would you really want a backup without a manifest? How would you > know something is missing? In particular, for page incremental how do you > know something is new (but not WAL logged) if there is no manifest? Is the > plan to just recopy anything not WAL logged with each incremental? Shrug. If you just want to create a new standby by copying the primary, I don't think creating and then validating the manifest buys you much. Long term backups are a different story, particularly if data files are stored individually, rather than in a single checksummed file. > > Also, for external backups, there's no manifest... > > There certainly is a manifest for many external backup solutions. Not having > a manifest is just running with scissors, backup-wise. You mean that you have an external solution gin up a backup manifest? I fail to see how that's relevant here? Greetings, Andres Freund
On 11/20/23 16:37, Andres Freund wrote: > > On 2023-11-20 11:11:13 -0500, Robert Haas wrote: >> I think we need more votes to make a change this big. I have a >> concern, which I think I've expressed before, that we keep whacking >> around the backup APIs, and that has a cost which is potentially >> larger than the benefits. > > +1. The amount of whacking around in this area has been substantial, and it's > hard for operators to keep up. And realistically, with data sizes today, the > pressure to do basebackups with disk snapshots etc is not going to shrink. True enough, but disk snapshots aren't really backups in themselves, in most scenarios, because they reside on the same storage as the cluster. Of course, snapshots can be exported, but that's also expensive. I see snapshots as an adjunct to backups -- a safe backup offsite somewhere for DR and snapshots for day to day operations. Even so, managing snapshots as backups is harder than people think. It is easy to get wrong and end up with silent corruption. > Leaving that concern aside, I am still on the fence about this proposal. I > think it does decrease the chance of getting things wrong in the > streaming-basebackup case. But for external backups, it seems almost > universally worse (with the exception of the torn pg_control issue, that we > also can address otherwise): Why universally worse? The software stores pg_control instead of backup label. The changes to pg_basebackup were pretty trivial and the changes to external backup are pretty much the same, at least in my limited sample of one. And I don't believe we have a satisfactory solution to the torn pg_control issue yet. Certainly it has not been committed and Thomas has shown enthusiasm for this approach, to the point of hoping it could be back patched (it can't). > It doesn't reduce the risk of getting things wrong, you can still omit placing > a file into the data directory and get silent corruption as a consequence. In > addition, it's harder to see when looking at a base backup whether the process > was right or not, because now the good and bad state look the same if you just > look on the filesystem level! This is one of the reasons I thought writing just the first 512 bytes of pg_control would be valuable. It would give an easy indicator that pg_control came from a backup. Michael was not in favor of conflating that change with this patch -- but I still think it's a good idea. > Then there's the issue of making ad-hoc debugging harder by not having a > human readable file with information anymore, including when looking at the > history, via backup_label.old. Yeah, you'd need to use pg_controldata instead. But as Michael has suggested, we could also write backup_label as backup_info so there is human-readable information available. > Given that, I wonder if what we should do is to just add a new field to > pg_control that says "error out if backup_label does not exist", that we set > when creating a streaming base backup I'm not in favor of a change only accessible to pg_basebackup or external software that can manipulate pg_control. Regards, -David
On 11/21/23 13:59, Andres Freund wrote: > > On 2023-11-21 13:41:15 -0400, David Steele wrote: >> On 11/20/23 16:41, Andres Freund wrote: >>> >>> On 2023-11-20 15:56:19 -0400, David Steele wrote: >>>> I understand this is an option -- but does it need to be? What is the >>>> benefit of excluding the manifest? >>> >>> It's not free to create the manifest, particularly if checksums are enabled. >> >> It's virtually free, even with the basic CRCs. > > Huh? <snip> > I'd not call 7.06->4.77 or 6.76->4.77 "virtually free". OK, but how does that look with compression -- to a remote location? Uncompressed backup to local storage doesn't seem very realistic. With gzip compression we measure SHA1 checksums at about 5% of total CPU. Obviously that goes up with zstd or lz4. but parallelism helps offset that cost, at least in clock time. I can't understate how valuable checksums are in finding corruption, especially in long-lived backups. >> Anyway, would you really want a backup without a manifest? How would you >> know something is missing? In particular, for page incremental how do you >> know something is new (but not WAL logged) if there is no manifest? Is the >> plan to just recopy anything not WAL logged with each incremental? > > Shrug. If you just want to create a new standby by copying the primary, I > don't think creating and then validating the manifest buys you much. Long term > backups are a different story, particularly if data files are stored > individually, rather than in a single checksummed file. Fine, but you are probably not using page incremental if just using pg_basebackup to create a standby. With page incremental, at least one of the backups will already exist, which argues for a manifest. >>> Also, for external backups, there's no manifest... >> >> There certainly is a manifest for many external backup solutions. Not having >> a manifest is just running with scissors, backup-wise. > > You mean that you have an external solution gin up a backup manifest? I fail > to see how that's relevant here? Just saying that for external backups there *is* often a manifest and it is a good thing to have. Regards, -David
Hi, On 2023-11-21 14:48:59 -0400, David Steele wrote: > > I'd not call 7.06->4.77 or 6.76->4.77 "virtually free". > > OK, but how does that look with compression With compression it's obviously somewhat different - but that part is done in parallel, potentially on a different machine with client side compression, whereas I think right now the checksumming is single-threaded, on the server side. With parallel server side compression, it's still 20% slower with the default checksumming than none. With client side it's 15%. > -- to a remote location? I think this one unfortunately makes checksums a bigger issue, not a smaller one. The network interaction piece is single-threaded, adding another significant use of CPU onto the same thread means that you are hit harder by using substantial amount of CPU for checksumming in the same thread. Once you go beyond the small instances, you have plenty network bandwidth in cloud environments. We top out well before the network on bigger instances. > Uncompressed backup to local storage doesn't seem very realistic. With gzip > compression we measure SHA1 checksums at about 5% of total CPU. IMO using gzip is basically infeasible for non-toy sized databases today. I think we're using our users a disservice by defaulting to it in a bunch of places. Even if another default exposes them to difficulty due to potentially using a different compiled binary with fewer supported compression methods - that's gona be very rare in practice. > I can't understate how valuable checksums are in finding corruption, > especially in long-lived backups. I agree! But I think we need faster checksum algorithms or a faster implementation of the existing ones. And probably default to something faster once we have it. Greetings, Andres Freund
On 11/21/23 16:00, Andres Freund wrote: > Hi, > > On 2023-11-21 14:48:59 -0400, David Steele wrote: >>> I'd not call 7.06->4.77 or 6.76->4.77 "virtually free". >> >> OK, but how does that look with compression > > With compression it's obviously somewhat different - but that part is done in > parallel, potentially on a different machine with client side compression, > whereas I think right now the checksumming is single-threaded, on the server > side. Ah, yes, that's certainly a bottleneck. > With parallel server side compression, it's still 20% slower with the default > checksumming than none. With client side it's 15%. Yeah, that still seems a lot. But to a large extent it sounds like a limitation of the current implementation. >> -- to a remote location? > > I think this one unfortunately makes checksums a bigger issue, not a smaller > one. The network interaction piece is single-threaded, adding another > significant use of CPU onto the same thread means that you are hit harder by > using substantial amount of CPU for checksumming in the same thread. > > Once you go beyond the small instances, you have plenty network bandwidth in > cloud environments. We top out well before the network on bigger instances. > >> Uncompressed backup to local storage doesn't seem very realistic. With gzip >> compression we measure SHA1 checksums at about 5% of total CPU. > > IMO using gzip is basically infeasible for non-toy sized databases today. I > think we're using our users a disservice by defaulting to it in a bunch of > places. Even if another default exposes them to difficulty due to potentially > using a different compiled binary with fewer supported compression methods - > that's gona be very rare in practice. Yeah, I don't use gzip anymore, but there are still some platforms that do not provide zstd (at least not easily) and lz4 compresses less. One thing people do seem to have is a lot of cores. >> I can't understate how valuable checksums are in finding corruption, >> especially in long-lived backups. > > I agree! But I think we need faster checksum algorithms or a faster > implementation of the existing ones. And probably default to something faster > once we have it. We've been using xxHash to generate checksums for our block-level incremental and it is seriously fast, written by the same guy who did zstd and lz4. Regards, -David
Greetings, * David Steele (david@pgmasters.net) wrote: > On 11/21/23 12:41, Andres Freund wrote: > > Sure. They also receive a backup_label today. If an external solution forgets > > to replace pg_control copied as part of the filesystem copy, they won't get an > > error after the remove of backup_label, just like they don't get one today if > > they don't put backup_label in the data directory. Given that users don't do > > the right thing with backup_label today, why can we rely on them doing the > > right thing with pg_control? > > I think reliable backup software does the right thing with backup_label, but > if the user starts getting errors on recovery they the decide to remove > backup_label. I know we can't do much about bad backup software, but we can > at least make this a bit more resistant to user error after the fact. > > It doesn't help that one of our hints suggests removing backup_label. In > highly automated systems, the user might not even know they just restored > from a backup. They are only in the loop because the restore failed and they > are trying to figure out what is going wrong. When they remove backup_label > the cluster comes up just fine. Victory! Yup, this is exactly the issue. > This is the scenario I've seen most often -- not the backup/restore process > getting it wrong but the user removing backup_label on their own initiative. > And because it yields such a positive result, at least initially, they > remember in the future that the thing to do is to remove backup_label > whenever they see the error. > > If they only have pg_control, then their only choice is to get it right or > run pg_resetwal. Most users have no knowledge of pg_resetwal so it will take > them longer to get there. Also, I think that tool make it pretty clear that > corruption will result and the only thing to do is a logical dump and > restore after using it. Agreed. > There are plenty of ways a user can mess things up. What I'd like to prevent > is the appearance of everything being OK when in fact they have corrupted > their cluster. That's the situation we have now with backup_label. Is this > new solution perfect? No, but I do think it checks several boxes, and is a > worthwhile improvement. +1. As for the complaint about 'operators' having issue with the changes we've been making in this area- where are those people complaining, exactly? Who are they? I feel like we keep getting this kind of push-back in this area from folks on this list but not from actual backup software authors; all the complaints seem to either be speculative or unattributed pass-through from someone else. What would really be helpful would be hearing from these individuals directly as to what the issues are with the changes, such that perhaps we can do things better in the future to avoid whatever the issue is they're having with the changes. Simply saying we shouldn't make changes in this area isn't workable and the constant push-back is actively discouraging to folks trying to make improvements. Obviously it's a biased view, but we've not had issues making the necessary adjustments in pgbackrest with each release and I feel like if the authors of wal-g or barman did that they would have spoken up. Making a change as suggested which only helps pg_basebackup (and tools like pgbackrest, since it's in C and can also make this particular change) ends up leaving tools like wal-g and barman potentially still with an easy way for users of those tools to corrupt their databases- even though we've not heard anything from the authors of those tools about issues with the proposed change, nor have there been a lot of complaints from them about the prior changes to indicate that they'd even have an issue with the more involved change. Given the lack of complaint about past changes, I'd certainly rather err on the side of improved safety for users than on the side of the authors of these tools possibly complaining. What these changes have done is finally break things like omnipitr completely, which hasn't been maintained in a very long time. The changes in v12 broke recovery with omnipitr but not backup, and folks were trying to use omnipitr as recently as with v13[1]. Certainly having a backup tool that only works for backup (fsvo works, anyway, as it still used exclusive backup mode meaning that a crash during a backup would cause the system to not come back up after...) but doesn't work for recovery isn't exactly great and I'm glad that, now, an attempt to use omnipitr to perform a backup will fail. As with lots of other areas of PG, folks need to read the release notes and potentially update their code for new major versions. If anything, the backup area is less of an issue for this because the authors of the backup tools are able to make the change (and who are often the ones pushing for these changes) and the end-user isn't impacted at all. Much the same can be said for wal-e, with users still trying to use it even long after it was stated to be obsolete (the Obsolescence Notice[2] was added in February 2022, though it hadn't been maintained for a while before that, and an issue was opened in December 2022 asking for it to be updated to v15[3]...). Thanks, Stephen [1]: https://github.com/omniti-labs/omnipitr/issues/43 [2]: https://github.com/wal-e/wal-e/commit/f5b3e790fe10daa098b8cbf01d836c4885dc13c7 [3]: https://github.com/wal-e/wal-e/issues/433
Attachment
On Sun, Nov 26, 2023 at 3:42 AM Stephen Frost <sfrost@snowman.net> wrote: > What would really be helpful would be hearing from these individuals > directly as to what the issues are with the changes, such that perhaps > we can do things better in the future to avoid whatever the issue is > they're having with the changes. Simply saying we shouldn't make > changes in this area isn't workable and the constant push-back is > actively discouraging to folks trying to make improvements. Obviously > it's a biased view, but we've not had issues making the necessary > adjustments in pgbackrest with each release and I feel like if the > authors of wal-g or barman did that they would have spoken up. I'm happy if people show up to comment on proposed changes, but I think you're being a little bit unrealistic here. I have had to help plenty of people who have screwed up their backups in one way or another, generally by using some home-grown script, sometimes by misusing some existing backup tool. Those people are EDB customers; they don't read and participate in discussions here. If they did, perhaps they wouldn't be paying EDB to have me and my colleagues sort things out for them when it all goes wrong. I'm not trying to say that EDB doesn't have customers who participate in mailing list discussions, because we do, but it's a small minority, and I don't think that should surprise anyone. Moreover, the people who don't wouldn't necessarily have the background, expertise, or *time* to assess specific proposals in detail. If your point is that my perspective on what's helpful or unhelpful is not valid because I've only helped 30 people who had problems in this area, but that the perspective of those 30 people who were helped would be more valid, well, I don't agree with that. I think your perspective and David's are valid precisely *because* you've worked a lot on pgbackrest and no doubt interacted with lots of users; I think Andres's perspective is valid precisely *because* of his experience working with the fleet at Microsoft and individual customers at EDB and 2Q before that; and I think my perspective is valid for the same kinds of reasons. I am more in agreement with the idea that it would be nice to hear from backup tool authors, but I think even that has limited value. Surely we can all agree that if the backup tool is correctly written, none of this matters, because you'll make the tool do the right things and then you'll be fine. The difficulty here, and the motivation behind this proposal and others like it, is that too many users fail to follow the procedure correctly. If we hear from the authors of well-written backup tools, I expect they will tell us they can adapt their tool to whatever we do. And if we hear from the authors of poorly-written tools, well, I don't think their opinions would form a great basis for making decisions. > [ lengthy discussion of tools that don't work any more ] What confuses me here is that you seem to be arguing that we should *once again* make a breaking change to the backup API, but at the same time you're acknowledging that there are plenty of tools out there on the Internet that have gotten broken by previous rounds of changes. It's only one step from there to conclude that whacking the API around does more harm than good, but you seem to reject that conclusion. Personally, I haven't yet seen any evidence that the removal of exclusive backup mode made any real difference one way or the other. I think I've heard about people needing to adjust code for it, but not about that being a problem. I have yet to run into anyone who was previously using it but, because it was deprecated, switched to doing something better and safer. Have you? -- Robert Haas EDB: http://www.enterprisedb.com
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Sun, Nov 26, 2023 at 3:42 AM Stephen Frost <sfrost@snowman.net> wrote: > > What would really be helpful would be hearing from these individuals > > directly as to what the issues are with the changes, such that perhaps > > we can do things better in the future to avoid whatever the issue is > > they're having with the changes. Simply saying we shouldn't make > > changes in this area isn't workable and the constant push-back is > > actively discouraging to folks trying to make improvements. Obviously > > it's a biased view, but we've not had issues making the necessary > > adjustments in pgbackrest with each release and I feel like if the > > authors of wal-g or barman did that they would have spoken up. > > I'm happy if people show up to comment on proposed changes, but I > think you're being a little bit unrealistic here. I have had to help > plenty of people who have screwed up their backups in one way or > another, generally by using some home-grown script, sometimes by > misusing some existing backup tool. Those people are EDB customers; > they don't read and participate in discussions here. If they did, > perhaps they wouldn't be paying EDB to have me and my colleagues sort > things out for them when it all goes wrong. I'm not trying to say that > EDB doesn't have customers who participate in mailing list > discussions, because we do, but it's a small minority, and I don't > think that should surprise anyone. Moreover, the people who don't > wouldn't necessarily have the background, expertise, or *time* to > assess specific proposals in detail. If your point is that my > perspective on what's helpful or unhelpful is not valid because I've > only helped 30 people who had problems in this area, but that the > perspective of those 30 people who were helped would be more valid, > well, I don't agree with that. I think your perspective and David's > are valid precisely *because* you've worked a lot on pgbackrest and no > doubt interacted with lots of users; I think Andres's perspective is > valid precisely *because* of his experience working with the fleet at > Microsoft and individual customers at EDB and 2Q before that; and I > think my perspective is valid for the same kinds of reasons. I didn't mean to imply that anyone's perspective wasn't valid. I was simply trying to get at the root question of: what *is* the issue with the changes that are being made? If the answer to that is: we made this change, which was hard for folks to deal with, and could have been avoided by doing X, then I really, really want to hear what X was! If the answer is, well, the changes weren't hard, but we didn't like having to make any changes at all ... then I just don't have any sympathy for that. People who write backup software for PG, be it pgbackrest authors, wal-g authors, or homegrown script authors, will need to adapt between major versions as we discover things that are broken (such as exclusive mode, and such as the clear risk that's been demonstrated of a torn copy of pg_control getting copied, resulting in a completely invalid backup) and fix them. > I am more in agreement with the idea that it would be nice to hear > from backup tool authors, but I think even that has limited value. > Surely we can all agree that if the backup tool is correctly written, > none of this matters, because you'll make the tool do the right things > and then you'll be fine. The difficulty here, and the motivation > behind this proposal and others like it, is that too many users fail > to follow the procedure correctly. If we hear from the authors of > well-written backup tools, I expect they will tell us they can adapt > their tool to whatever we do. And if we hear from the authors of > poorly-written tools, well, I don't think their opinions would form a > great basis for making decisions. Uhhh. No, I disagree with this- I'd argue that pgbackrest was broken until the most recently releases where we implemented a check to ensure that the pg_control we copy has a valid PG CRC. Did we know it was broken before this discussion? No, but that doesn't change the fact that we certainly could have ended up copying an invalid pg_control and thus have an invalid backup, which even our 'pgbackrest verify' wouldn't have caught because that just checks that the checksum that pgbackrest calculates for every file hasn't changed since we copied it- but that didn't do anything for the issue about pg_control having an invalid internal checksum due to a torn write when we copied it. So, yes, it does matter. We didn't make pgbackrest do the right thing in this case because we thought it was true that you couldn't get a torn read of pg_control; Thomas showed that wasn't true and that puts all of our users at risk. Thankfully somewhat minimal since we always copy pg_control from the primary ... but still, it's not right, and we've now taken steps to address it. Unfortunately, other tools are going to have a more difficult time because they're not written in C, but we still care about them, and that's why we're pushing for this change- to allow them to get a pretty much guaranteed valid pg_control from PG to store without having to figure out how to validate it themselves. > > [ lengthy discussion of tools that don't work any more ] > > What confuses me here is that you seem to be arguing that we should > *once again* make a breaking change to the backup API, but at the same > time you're acknowledging that there are plenty of tools out there on > the Internet that have gotten broken by previous rounds of changes. The broken ones aren't being maintained. Yes, I'm happy to have those explicitly and clearly broken. I don't want people using outdated, broken, and unmaintained tools to backup their PG databases. > It's only one step from there to conclude that whacking the API around > does more harm than good, but you seem to reject that conclusion. We change the API because it objectively, clearly, addresses real issues that users can run into that will cause them to have invalid backups if left the way it is. That backup software authors need to adjust to this isn't a bad thing- it's a good thing, because we're fixing things and they should be thrilled to have these issues addressed that they may not have even considered. > Personally, I haven't yet seen any evidence that the removal of > exclusive backup mode made any real difference one way or the other. I > think I've heard about people needing to adjust code for it, but not > about that being a problem. I have yet to run into anyone who was > previously using it but, because it was deprecated, switched to doing > something better and safer. Have you? I'm glad that people haven't had a problem adjusting their code to the removal of exclusive backup mode, that's good, and leaves me, again, a bit confused at what the issue here is about changing things- apparently people don't actually have a problem with it, yet it keeps getting raised as an issue every time we change things in this area. I don't understand that. I'm not following the question entirely, I don't think. Most backup tool authors actively changed to using non-exclusive backup when exclusive backup mode was deprecated, certainly pgbackrest did and we've been using non-exclusive backup mode since it was available. Are you saying that, because everyone moved off of it, we should have kept it? In that case the answer is clearly no- omnipitr, at the least, didn't update to non-exclusive and therefore continued to run with the risk that a crash during a backup would result in a cluster that wouldn't start without manual intervention (an issue I've definitely heard about a number of times, even recently) and that manual intervention (remove the backup_label file) actively results in a *corrupt* cluster if the user is actually restoring from a backup, which makes it really terrible direction to give someone. Here, use this hack- but only if you're 100% coming back from a crash and absolutely never, ever, ever if you're actually restoring from a backup. Thanks! Stephen
Attachment
On Mon, 20 Nov 2023 at 06:46, Michael Paquier <michael@paquier.xyz> wrote: > > (I am not exactly sure how, but we've lost pgsql-hackers on the way > when you sent v5. Now added back in CC with the two latest patches > you've proposed attached.) > > Here is a short summary of what has been missed by the lists: > - I've commented that the patch should not create, not show up in > fields returned the SQL functions or stream control files with a size > of 512B, just stick to 8kB. If this is worth changing this should be > applied consistently across the board including initdb, discussed on > its own thread. > - The backup-related fields in the control file are reset at the end > of recovery. I've suggested to not do that to keep a trace of what > was happening during recovery. The latest version of the patch resets > the fields. > - With the backup_label file gone, we lose some information in the > backups themselves, which is not good. Instead, you have suggested an > approach where this data is added to the backup manifest, meaning that > no information would be lost, particularly useful for self-contained > backups. The fields planned to be added to the backup manifest are: > -- The start and end time of the backup, the end timestamp being > useful to know when stop time can be used for PITR. > -- The backup label. > I've agreed that it may be the best thing to do at this end to not > lose any data related to the removal of the backup_label file. > > On Sun, Nov 19, 2023 at 02:14:32PM -0400, David Steele wrote: > > On 11/15/23 20:03, Michael Paquier wrote: > >> As the label is only an informational field, the parsing added to > >> pg_verifybackup is not really needed because it is used nowhere in the > >> validation process, so keeping the logic simpler would be the way to > >> go IMO. This is contrary to the WAL range for example, where start > >> and end LSNs are used for validation with a pg_waldump command. > >> Robert, any comments about the addition of the label in the manifest? > > > > I'm sure Robert will comment on this when he gets the time, but for now I > > have backed off on passing the new info to pg_verifybackup and added > > start/stop time. > > FWIW, I'm OK with the bits for the backup manifest as presented. So > if there are no remarks and/or no objections, I'd like to apply it but > let give some room to others to comment on that as there's been a gap > in the emails exchanged on pgsql-hackers. I hope that the summary > I've posted above covers everything. So let's see about doing > something around the middle of next week. With Thanksgiving in the > US, a lot of folks will not have the time to monitor what's happening > on this thread. > > + The end time for the backup. This is when the backup was stopped in > + <productname>PostgreSQL</productname> and represents the earliest time > + that can be used for time-based Point-In-Time Recovery. > > This one is actually a very good point. We'd lost this capacity with > the backup_label file gone without the end timestamps in the control > file. > > > New patches attached based on b218fbb7. > > I've noticed on the other thread the remark about being less > aggressive with the fields related to recovery in the control file, so > I assume that this patch should leave the fields be after the end of > recovery from the start and only rely on backupRecoveryRequired to > decide if the recovery should use the fields or not: > https://www.postgresql.org/message-id/241ccde1-1928-4ba2-a0bb-5350f7b191a8@=pgmasters.net > > + ControlFile->backupCheckPoint = InvalidXLogRecPtr; > ControlFile->backupStartPoint = InvalidXLogRecPtr; > + ControlFile->backupStartPointTLI = 0; > ControlFile->backupEndPoint = InvalidXLogRecPtr; > + ControlFile->backupFromStandby = false; > ControlFile->backupEndRequired = false; > > Still, I get the temptation of being consistent with the current style > on HEAD to reset everything, as well.. CFBot shows that the patch does not apply anymore as in [1]: === Applying patches on top of PostgreSQL commit ID 7014c9a4bba2d1b67d60687afb5b2091c1d07f73 === === applying patch ./recovery-in-pgcontrol-v7-0001-add-info-to-manifest.patch patching file doc/src/sgml/backup-manifest.sgml patching file src/backend/backup/backup_manifest.c patching file src/backend/backup/basebackup.c Hunk #1 succeeded at 238 (offset 13 lines). Hunk #2 succeeded at 258 (offset 13 lines). Hunk #3 succeeded at 399 (offset 17 lines). Hunk #4 succeeded at 652 (offset 17 lines). can't find file to patch at input line 219 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -------------------------- |diff --git a/src/bin/pg_verifybackup/parse_manifest.c b/src/bin/pg_verifybackup/parse_manifest.c |index bf0227c668..408af88e58 100644 |--- a/src/bin/pg_verifybackup/parse_manifest.c |+++ b/src/bin/pg_verifybackup/parse_manifest.c -------------------------- No file to patch. Skipping patch. 9 out of 9 hunks ignored patching file src/include/backup/backup_manifest.h Please post an updated version for the same. [1] - http://cfbot.cputube.org/patch_46_3511.log Regards, Vignesh
On Fri, Jan 26, 2024 at 06:27:30PM +0530, vignesh C wrote: > Please post an updated version for the same. > > [1] - http://cfbot.cputube.org/patch_46_3511.log With the recent introduction of incremental backups that depend on backup_label and the rather negative feedback received, I think that it would be better to return this entry as RwF for now. What do you think? -- Michael
Attachment
On 1/28/24 19:11, Michael Paquier wrote: > On Fri, Jan 26, 2024 at 06:27:30PM +0530, vignesh C wrote: >> Please post an updated version for the same. >> >> [1] - http://cfbot.cputube.org/patch_46_3511.log > > With the recent introduction of incremental backups that depend on > backup_label and the rather negative feedback received, I think that > it would be better to return this entry as RwF for now. What do you > think? I've been thinking it makes little sense to update the patch. It would be a lot of work with all the new changes for incremental backup and since Andres and Robert appear to be very against the idea, I doubt it would be worth the effort. I have withdrawn the patch. Regards, -David
On 1/29/24 12:28, David Steele wrote: > On 1/28/24 19:11, Michael Paquier wrote: >> On Fri, Jan 26, 2024 at 06:27:30PM +0530, vignesh C wrote: >>> Please post an updated version for the same. >>> >>> [1] - http://cfbot.cputube.org/patch_46_3511.log >> >> With the recent introduction of incremental backups that depend on >> backup_label and the rather negative feedback received, I think that >> it would be better to return this entry as RwF for now. What do you >> think? > > I've been thinking it makes little sense to update the patch. It would > be a lot of work with all the new changes for incremental backup and > since Andres and Robert appear to be very against the idea, I doubt it > would be worth the effort. I've had a new idea which may revive this patch. The basic idea is to keep backup_label but also return a copy of pg_control from pg_stop_backup(). This copy of pg_control would be safe from tears and have a backupLabelRequired field set (as Andres suggested) so recovery cannot proceed without the backup label. So, everything will continue to work as it does now. But, backup software can be enhanced to write the improved pg_control that is guaranteed not to be torn and has protection against a missing backup label. Of course, pg_basebackup will write the new backupLabelRequired field into pg_control, but this way third party software can also gain advantages from the new field. Thoughts? Regards, -David
Hi, On Sunday, March 10th, 2024 at 4:47 AM, David Steele wrote: > I've had a new idea which may revive this patch. The basic idea is to > keep backup_label but also return a copy of pg_control from > pg_stop_backup(). This copy of pg_control would be safe from tears and > have a backupLabelRequired field set (as Andres suggested) so recovery > cannot proceed without the backup label. > > So, everything will continue to work as it does now. But, backup > software can be enhanced to write the improved pg_control that is > guaranteed not to be torn and has protection against a missing backup label. > > Of course, pg_basebackup will write the new backupLabelRequired field > into pg_control, but this way third party software can also gain > advantages from the new field. Bump on this idea. Given the discussion in [1], even if it obviously makes sense to improve the in core backup capabilities, the more we goin that direction, the more we'll rely on outside orchestration. So IMHO it also worth worrying about given more leverage to such orchestration tools. In that sense, I really like the ideato extend the backup functions. More thoughts? Thanks all, Kind Regards, -- Stefan FERCOT Data Egret (https://dataegret.com) [1] https://www.postgresql.org/message-id/lwXoqQdOT9Nw1tJIx_h7WuqMKrB1YMePQY99RFTZ87H7V52mgUJaSlw2WRbcOgKNUurF1yJqX3nqtZi4hJhtd3e_XlmLsLvnEtGXY-fZPoA%3D%40protonmail.com
On 4/16/24 18:55, Stefan Fercot wrote: > Hi, > > On Sunday, March 10th, 2024 at 4:47 AM, David Steele wrote: >> I've had a new idea which may revive this patch. The basic idea is to >> keep backup_label but also return a copy of pg_control from >> pg_stop_backup(). This copy of pg_control would be safe from tears and >> have a backupLabelRequired field set (as Andres suggested) so recovery >> cannot proceed without the backup label. >> >> So, everything will continue to work as it does now. But, backup >> software can be enhanced to write the improved pg_control that is >> guaranteed not to be torn and has protection against a missing backup label. >> >> Of course, pg_basebackup will write the new backupLabelRequired field >> into pg_control, but this way third party software can also gain >> advantages from the new field. > > Bump on this idea. > > Given the discussion in [1], even if it obviously makes sense to improve the in core backup capabilities, the more we goin that direction, the more we'll rely on outside orchestration. > So IMHO it also worth worrying about given more leverage to such orchestration tools. In that sense, I really like theidea to extend the backup functions. I have implemented this idea and created a new thread [1] for it. Hopefully it will address the concerns expressed in this thread. Regards, -David [1] https://www.postgresql.org/message-id/e2636c5d-c031-43c9-a5d6-5e5c7e4c5514%40pgmasters.net