Thread: Incremental backup from a streaming replication standby
I played around with incremental backup yesterday and tried $subject The WAL summarizer is running on the standby server, but when I try to take an incremental backup, I get an error that I understand to mean that WAL summarizing hasn't caught up yet. I am not sure if that is working as designed, but if it is, I think it should be documented. Yours, Laurenz Albe
On Sat, 2024-06-29 at 07:01 +0200, Laurenz Albe wrote: > I played around with incremental backup yesterday and tried $subject > > The WAL summarizer is running on the standby server, but when I try > to take an incremental backup, I get an error that I understand to mean > that WAL summarizing hasn't caught up yet. > > I am not sure if that is working as designed, but if it is, I think it > should be documented. I played with this some more. Here is the exact error message: ERROR: manifest requires WAL from final timeline 1 ending at 0/1967C260, but this backup starts at 0/1967C190 By trial and error I found that when I run a CHECKPOINT on the primary, taking an incremental backup on the standby works. I couldn't fathom the cause of that, but I think that that should either be addressed or documented before v17 comes out. Yours, Laurenz Albe
On Mon, Jul 15, 2024 at 11:27 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > On Sat, 2024-06-29 at 07:01 +0200, Laurenz Albe wrote: > > I played around with incremental backup yesterday and tried $subject > > > > The WAL summarizer is running on the standby server, but when I try > > to take an incremental backup, I get an error that I understand to mean > > that WAL summarizing hasn't caught up yet. > > > > I am not sure if that is working as designed, but if it is, I think it > > should be documented. > > I played with this some more. Here is the exact error message: > > ERROR: manifest requires WAL from final timeline 1 ending at 0/1967C260, but this backup starts at 0/1967C190 > > By trial and error I found that when I run a CHECKPOINT on the primary, > taking an incremental backup on the standby works. > > I couldn't fathom the cause of that, but I think that that should either > be addressed or documented before v17 comes out. I had a feeling this was going to be confusing. I'm not sure what to do about it, but I'm open to suggestions. Suppose you take a full backup F; replay of that backup will begin with a checkpoint CF. Then you try to take an incremental backup I; replay will begin from a checkpoint CI. For the incremental backup to be valid, it must include all blocks modified after CF and before CI. But when the backup is taken on a standby, no new checkpoint is possible. Hence, CI will be the most recent restartpoint on the standby that has occurred before the backup starts. So, if F is taken on the primary and then I is immediately taken on the standby without the standby having done a new restartpoint, or if both F and I are taken on the standby and no restartpoint intervenes, then CF=CI. In that scenario, an incremental backup is pretty much pointless: every single incremental file would contain 0 blocks. You might as well just use the backup you already have, unless one of the non-relation files has changed. So, except in that unusual corner case, the fact that the backup fails isn't really costing you anything. In fact, there's a decent chance that it's saving you from taking a completely useless backup. On the primary, this doesn't occur, because there, each new backup triggers a new checkpoint, so you always have CI>CF. The error message is definitely confusing. The reason I'm not sure how to do better is that there is a large class of errors that a user could make that would trigger an error of this general type. I'm guessing that attempting a standby backup with CF=CI will turn out to be the most common one, but I don't think it'll be the only one that ever comes up. The code in PrepareForIncrementalBackup() focuses on what has gone wrong on a technical level rather than on what you probably did to create that situation. Indeed, the server doesn't really know what you did to create that situation. You could trigger the same error by taking a full backup on the primary and then try to take an incremental based on that full backup on a time-delayed standby (or a lagging standby) whose replay position was behind the primary, i.e. CI<CF. More perversely, you could trigger the error by spinning up a standby, promoting it, taking a full backup, destroying the standby, removing the timeline history file from the archive, spinning up a new standby, promoting onto the same timeline ID as the previous one, and then trying to take an incremental backup relative to the full backup. This might actually succeed, if you take the incremental backup at a later LSN than the previous full backup, but, as you may guess, terrible things will happen to you if you try to use such a backup. (I hope you will agree that this would be a self-inflicted injury; I can't see any way of detecting such cases.) If the incremental backup LSN is earlier than the previous full backup LSN, this error will trigger. So, given all the above, what can we do here? One option might be to add an errhint() to the message. I had trouble thinking of something that was compact enough to be reasonable to include and yet reasonably accurate and useful, but maybe we can brainstorm and figure something out. Another option might be to add more to the documentation, but it's all so complicated that I'm not sure what to write. It feels hard to make something that is brief enough to be worth including, accurate enough to help more than it hurts, and understandable enough that people who run into this will be able to make use of it. I think I'm a little too close to this to really know what the best thing to do is, so I'm happy to hear suggestions from you and others. -- Robert Haas EDB: http://www.enterprisedb.com
On 7/19/24 21:52, Robert Haas wrote: > On Mon, Jul 15, 2024 at 11:27 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote: >> On Sat, 2024-06-29 at 07:01 +0200, Laurenz Albe wrote: >>> I played around with incremental backup yesterday and tried $subject >>> >>> The WAL summarizer is running on the standby server, but when I try >>> to take an incremental backup, I get an error that I understand to mean >>> that WAL summarizing hasn't caught up yet. >>> >>> I am not sure if that is working as designed, but if it is, I think it >>> should be documented. >> >> I played with this some more. Here is the exact error message: >> >> ERROR: manifest requires WAL from final timeline 1 ending at 0/1967C260, but this backup starts at 0/1967C190 >> >> By trial and error I found that when I run a CHECKPOINT on the primary, >> taking an incremental backup on the standby works. >> >> I couldn't fathom the cause of that, but I think that that should either >> be addressed or documented before v17 comes out. > > I had a feeling this was going to be confusing. I'm not sure what to > do about it, but I'm open to suggestions. > > Suppose you take a full backup F; replay of that backup will begin > with a checkpoint CF. Then you try to take an incremental backup I; > replay will begin from a checkpoint CI. For the incremental backup to > be valid, it must include all blocks modified after CF and before CI. > But when the backup is taken on a standby, no new checkpoint is > possible. Hence, CI will be the most recent restartpoint on the > standby that has occurred before the backup starts. So, if F is taken > on the primary and then I is immediately taken on the standby without > the standby having done a new restartpoint, or if both F and I are > taken on the standby and no restartpoint intervenes, then CF=CI. In > that scenario, an incremental backup is pretty much pointless: every > single incremental file would contain 0 blocks. You might as well just > use the backup you already have, unless one of the non-relation files > has changed. So, except in that unusual corner case, the fact that the > backup fails isn't really costing you anything. In fact, there's a > decent chance that it's saving you from taking a completely useless > backup. <snip> > I think I'm a little too close to this to really know what the best > thing to do is, so I'm happy to hear suggestions from you and others. I think it would be enough just to add a hint such as: HINT: this is possible when making a standby backup with little or no activity. My guess is in production environments this will be uncommon. For example, over the years we (pgBackRest) have gotten numerous bug reports that time-targeted PITR does not work. In every case we found that the user was just testing procedures and the database had no activity between backups -- therefore recovery had no commit timestamps to use to end recovery. Test environments sometimes produce weird results. Having said that, I think it would be better if it worked even if it does produce an empty backup. An empty backup wastes some disk space but if it produces less friction and saves an admin having to intervene then it is probably worth it. I don't immediately see how to do that in a reliable way, though, and in any case it seems like something to consider for PG18. Regards, -David
On Fri, Jul 19, 2024 at 11:32 AM David Steele <david@pgmasters.net> wrote: > I think it would be enough just to add a hint such as: > > HINT: this is possible when making a standby backup with little or no > activity. That could work (with "this" capitalized). > My guess is in production environments this will be uncommon. I think so too, but when it does happen, confusion may be common. > Having said that, I think it would be better if it worked even if it > does produce an empty backup. An empty backup wastes some disk space but > if it produces less friction and saves an admin having to intervene then > it is probably worth it. I don't immediately see how to do that in a > reliable way, though, and in any case it seems like something to > consider for PG18. Yeah, I'm pretty reluctant to weaken the sanity checks here, at least in the short term. Note that what the check is actually complaining about is that the previous backup thinks that the WAL it needs to replay to reach consistency ends after the start of the current backup. Even in this scenario, I'm not positive that everything would be OK if we let the backup proceed, and it's easy to think of scenarios where it definitely isn't. Plus, it's not quite clear how to distinguish the cases where it's OK from the cases where it isn't. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, 2024-07-19 at 12:59 -0400, Robert Haas wrote: Thanks for looking at this. > On Fri, Jul 19, 2024 at 11:32 AM David Steele <david@pgmasters.net> wrote: > > I think it would be enough just to add a hint such as: > > > > HINT: this is possible when making a standby backup with little or no > > activity. > > That could work (with "this" capitalized). > > > My guess is in production environments this will be uncommon. > > I think so too, but when it does happen, confusion may be common. I guess this will most likely happen during tests like the one I made. I'd be alright with the hint, but I'd say "during making an *incremental* standby backup", because that's the only case where it can happen. I think it would also be sufficient if we document that possibility. When I got the error, I looked at the documentation of incremental backup for any limitations with standby servers, but didn't find any. A remark in the documentation would have satisfied me. Yours, Laurenz
On Fri, Jul 19, 2024 at 2:41 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > I'd be alright with the hint, but I'd say "during making an *incremental* > standby backup", because that's the only case where it can happen. > > I think it would also be sufficient if we document that possibility. > When I got the error, I looked at the documentation of incremental > backup for any limitations with standby servers, but didn't find any. > A remark in the documentation would have satisfied me. Would you like to propose a patch adding a hint and/or adjusting the documentation? Or are you wanting me to do that? -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, 2024-07-19 at 16:03 -0400, Robert Haas wrote: > On Fri, Jul 19, 2024 at 2:41 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > > I'd be alright with the hint, but I'd say "during making an *incremental* > > standby backup", because that's the only case where it can happen. > > > > I think it would also be sufficient if we document that possibility. > > When I got the error, I looked at the documentation of incremental > > backup for any limitations with standby servers, but didn't find any. > > A remark in the documentation would have satisfied me. > > Would you like to propose a patch adding a hint and/or adjusting the > documentation? Or are you wanting me to do that? Here is a patch. I went for both the errhint and some documentation. Yours, Laurenz Albe
Attachment
On Sat, Jun 29, 2024 at 07:01:04AM +0200, Laurenz Albe wrote: > The WAL summarizer is running on the standby server, but when I try > to take an incremental backup, I get an error that I understand to mean > that WAL summarizing hasn't caught up yet. Added an open item for this one. -- Michael
Attachment
On Fri, Jul 19, 2024 at 6:07 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > Here is a patch. > I went for both the errhint and some documentation. Hmm, the hint doesn't end up using the word "standby" anywhere. That seems like it might not be optimal? + Like a base backup, you can take an incremental backup from a streaming + replication standby server. But since a backup of a standby server cannot + initiate a checkpoint, it is possible that an incremental backup taken + right after a base backup will fail with an error, since it would have + to start with the same checkpoint as the base backup and would therefore + be empty. Hmm. I feel like I'm about to be super-nitpicky, but this seems imprecise to me in multiple ways. First, an incremental backup is a kind of base backup, or at least, it's something you take with pg_basebackup. Note that later in the paragraph, you use the term "base backup" to refer to what I have been calling the "prior" or "previous" backup or "the backup upon which it depends," but that earlier backup could be either a full or an incremental backup. Second, the standby need not be using streaming replication, even though it probably will be in practice. Third, the failing incremental backup doesn't necessarily have to be attempted immediately after the previous one - the intervening time could be quite long on an idle system. Fourth, it makes it sound like the backup being empty is a reason for it to fail, which is debatable; I think we should try to cast this more as an implementation restriction. How about something like this: An incremental backup is only possible if replay would begin from a later checkpoint than for the previous backup upon which it depends. On the primary, this condition is always satisfied, because each backup triggers a new checkpoint. On a standby, replay begins from the most recent restartpoint. As a result, an incremental backup may fail on a standby if there has been very little activity since the previous backup. Attempting to take an incremental backup that is lagging behind the primary (or some other standby) using a prior backup taken at a later WAL position may fail for the same reason. I'm not saying that's perfect, but let me know your thoughts. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, 2024-07-22 at 09:37 -0400, Robert Haas wrote: > How about something like this: > > An incremental backup is only possible if replay would begin from a > later checkpoint than for the previous backup upon which it depends. > On the primary, this condition is always satisfied, because each > backup triggers a new checkpoint. On a standby, replay begins from the > most recent restartpoint. As a result, an incremental backup may fail > on a standby if there has been very little activity since the previous > backup. Attempting to take an incremental backup that is lagging > behind the primary (or some other standby) using a prior backup taken > at a later WAL position may fail for the same reason. Before I write a v2, a small question for clarification: I believe I remember that during my experiments, I ran CHECKPOINT on the standby server between the first backup and the incremental backup, and that was not enough to make it work. I had to run a CHECKPOINT on the primary server. Does CHECKPOINT on the standby not trigger a restartpoint, or do I simply misremember? Yours, Laurenz Albe
On Mon, Jul 22, 2024 at 1:05 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > Before I write a v2, a small question for clarification: > I believe I remember that during my experiments, I ran CHECKPOINT > on the standby server between the first backup and the incremental > backup, and that was not enough to make it work. I had to run > a CHECKPOINT on the primary server. > > Does CHECKPOINT on the standby not trigger a restartpoint, or do > I simply misremember? It's only possible for the standby to create a restartpoint at a write-ahead log position where the master created a checkpoint. With typical configuration, every or nearly every checkpoint on the primary will trigger a restartpoint on the standby, but for example if you set max_wal_size bigger and checkpoint_timeout longer on the standby than on the primary, then you might end up with only some of those checkpoints ending up becoming restartpoints and others not. Looking at the code in CreateRestartPoint(), it looks like what happens if you run CHECKPOINT is that it tries to turn the most-recently replayed checkpoint into a restartpoint if that wasn't done already; otherwise it just returns without doing anything. See the comment that begins with "If the last checkpoint record we've replayed is already our last". -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, 2024-07-22 at 09:37 -0400, Robert Haas wrote: > On Fri, Jul 19, 2024 at 6:07 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > > Here is a patch. > > I went for both the errhint and some documentation. > > Hmm, the hint doesn't end up using the word "standby" anywhere. That > seems like it might not be optimal? I guessed that the user was aware that she is taking the backup on a standby server... Anyway, I reworded the hint to This can happen for incremental backups on a standby if there was little activity since the previous backup. > Hmm. I feel like I'm about to be super-nitpicky, but this seems > imprecise to me in multiple ways. On the contrary, cour comments and explanations are valuable. > How about something like this: > > An incremental backup is only possible if replay would begin from a > later checkpoint than for the previous backup upon which it depends. > On the primary, this condition is always satisfied, because each > backup triggers a new checkpoint. On a standby, replay begins from the > most recent restartpoint. As a result, an incremental backup may fail > on a standby if there has been very little activity since the previous > backup. Attempting to take an incremental backup that is lagging > behind the primary (or some other standby) using a prior backup taken > at a later WAL position may fail for the same reason. > > I'm not saying that's perfect, but let me know your thoughts. I tinkered with this some more, and the attached patch has An incremental backup is only possible if replay would begin from a later checkpoint than the checkpoint that started the previous backup upon which it depends. If you take the incremental backup on the primary, this condition is always satisfied, because each backup triggers a new checkpoint. On a standby, replay begins from the most recent restartpoint. Therefore, an incremental backup of a standby server can fail if there has been very little activity since the previous backup, since no new restartpoint might have been created. Yours, Laurenz Albe
Attachment
On Wed, Jul 24, 2024 at 6:46 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > An incremental backup is only possible if replay would begin from a later > checkpoint than the checkpoint that started the previous backup upon which > it depends. My concern here is that the previous backup might have been taken on a standby, and therefore it did not start with a checkpoint. For a standby backup, replay will begin from a checkpoint record, but that record may be quite a bit earlier in the WAL. For instance, imagine checkpoint_timeout is set to 30 minutes on the standby. When the backup is taken, the most recent restartpoint could be up to 30 minutes ago -- and it is the checkpoint record for that restartpoint from which replay will begin. I think that in my phrasing, it's always about the checkpoint from which replay would begin (which is always well-defined) not the checkpoint that started the backup (which is only logical on the primary). > If you take the incremental backup on the primary, this > condition is always satisfied, because each backup triggers a new > checkpoint. On a standby, replay begins from the most recent restartpoint. > Therefore, an incremental backup of a standby server can fail if there has > been very little activity since the previous backup, since no new > restartpoint might have been created. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, 2024-07-24 at 15:27 -0400, Robert Haas wrote: > On Wed, Jul 24, 2024 at 6:46 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > > An incremental backup is only possible if replay would begin from a later > > checkpoint than the checkpoint that started the previous backup upon which > > it depends. > > My concern here is that the previous backup might have been taken on a > standby, and therefore it did not start with a checkpoint. For a > standby backup, replay will begin from a checkpoint record, but that > record may be quite a bit earlier in the WAL. For instance, imagine > checkpoint_timeout is set to 30 minutes on the standby. When the > backup is taken, the most recent restartpoint could be up to 30 > minutes ago -- and it is the checkpoint record for that restartpoint > from which replay will begin. I think that in my phrasing, it's always > about the checkpoint from which replay would begin (which is always > well-defined) not the checkpoint that started the backup (which is > only logical on the primary). I see. The attached patch uses your wording for the first sentence. I left out the last sentence from your suggestion, because it sounded like it is likely to confuse the reader. I think you just wanted to say that there are other possible causes for an incremental backup to fail. I want to keep the text as simple as possible and focus on the case that I hit, because I expect that a lot of people who experiment with incremental backup or run tests could run into the same problem. I don't think it will be a frequent occurrence during normal operation. Yours, Laurenz Albe
Attachment
On Thu, Jul 25, 2024 at 8:51 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > The attached patch uses your wording for the first sentence. > > I left out the last sentence from your suggestion, because it sounded > like it is likely to confuse the reader. I think you just wanted to > say that there are other possible causes for an incremental backup to > fail. I want to keep the text as simple as possible and focus on the case > that I hit, because I expect that a lot of people who experiment with > incremental backup or run tests could run into the same problem. > > I don't think it will be a frequent occurrence during normal operation. Committed this version to master and v17. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, 2024-07-25 at 16:12 -0400, Robert Haas wrote: > On Thu, Jul 25, 2024 at 8:51 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > > The attached patch uses your wording for the first sentence. > > > > I left out the last sentence from your suggestion, because it sounded > > like it is likely to confuse the reader. I think you just wanted to > > say that there are other possible causes for an incremental backup to > > fail. I want to keep the text as simple as possible and focus on the case > > that I hit, because I expect that a lot of people who experiment with > > incremental backup or run tests could run into the same problem. > > > > I don't think it will be a frequent occurrence during normal operation. > > Committed this version to master and v17. Thanks for taking care of this. Yours, Laurenz Albe
On Fri, Jul 26, 2024 at 1:09 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > > Committed this version to master and v17. > > Thanks for taking care of this. Sure thing! I knew it was going to confuse someone ... I just wasn't sure what to do about it. Now we've at least done something, which is hopefully superior to nothing. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jul 26, 2024 at 4:11 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jul 26, 2024 at 1:09 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > > > Committed this version to master and v17. > > > > Thanks for taking care of this. > > Sure thing! > > I knew it was going to confuse someone ... I just wasn't sure what to > do about it. Now we've at least done something, which is hopefully > superior to nothing. Great! Should we mark the corresponding v17 open item as closed? ------ Regards, Alexander Korotkov Supabase
On Fri, Jul 26, 2024 at 4:13 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > Great! Should we mark the corresponding v17 open item as closed? Done. -- Robert Haas EDB: http://www.enterprisedb.com