Thread: Incremental backup from a streaming replication standby

Incremental backup from a streaming replication standby

From
Laurenz Albe
Date:
I played around with incremental backup yesterday and tried $subject

The WAL summarizer is running on the standby server, but when I try
to take an incremental backup, I get an error that I understand to mean
that WAL summarizing hasn't caught up yet.

I am not sure if that is working as designed, but if it is, I think it
should be documented.

Yours,
Laurenz Albe



Re: Incremental backup from a streaming replication standby fails

From
Laurenz Albe
Date:
On Sat, 2024-06-29 at 07:01 +0200, Laurenz Albe wrote:
> I played around with incremental backup yesterday and tried $subject
>
> The WAL summarizer is running on the standby server, but when I try
> to take an incremental backup, I get an error that I understand to mean
> that WAL summarizing hasn't caught up yet.
>
> I am not sure if that is working as designed, but if it is, I think it
> should be documented.

I played with this some more.  Here is the exact error message:

ERROR:  manifest requires WAL from final timeline 1 ending at 0/1967C260, but this backup starts at 0/1967C190

By trial and error I found that when I run a CHECKPOINT on the primary,
taking an incremental backup on the standby works.

I couldn't fathom the cause of that, but I think that that should either
be addressed or documented before v17 comes out.

Yours,
Laurenz Albe



On Mon, Jul 15, 2024 at 11:27 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> On Sat, 2024-06-29 at 07:01 +0200, Laurenz Albe wrote:
> > I played around with incremental backup yesterday and tried $subject
> >
> > The WAL summarizer is running on the standby server, but when I try
> > to take an incremental backup, I get an error that I understand to mean
> > that WAL summarizing hasn't caught up yet.
> >
> > I am not sure if that is working as designed, but if it is, I think it
> > should be documented.
>
> I played with this some more.  Here is the exact error message:
>
> ERROR:  manifest requires WAL from final timeline 1 ending at 0/1967C260, but this backup starts at 0/1967C190
>
> By trial and error I found that when I run a CHECKPOINT on the primary,
> taking an incremental backup on the standby works.
>
> I couldn't fathom the cause of that, but I think that that should either
> be addressed or documented before v17 comes out.

I had a feeling this was going to be confusing. I'm not sure what to
do about it, but I'm open to suggestions.

Suppose you take a full backup F; replay of that backup will begin
with a checkpoint CF. Then you try to take an incremental backup I;
replay will begin from a checkpoint CI. For the incremental backup to
be valid, it must include all blocks modified after CF and before CI.
But when the backup is taken on a standby, no new checkpoint is
possible. Hence, CI will be the most recent restartpoint on the
standby that has occurred before the backup starts. So, if F is taken
on the primary and then I is immediately taken on the standby without
the standby having done a new restartpoint, or if both F and I are
taken on the standby and no restartpoint intervenes, then CF=CI. In
that scenario, an incremental backup is pretty much pointless: every
single incremental file would contain 0 blocks. You might as well just
use the backup you already have, unless one of the non-relation files
has changed. So, except in that unusual corner case, the fact that the
backup fails isn't really costing you anything. In fact, there's a
decent chance that it's saving you from taking a completely useless
backup.

On the primary, this doesn't occur, because there, each new backup
triggers a new checkpoint, so you always have CI>CF.

The error message is definitely confusing. The reason I'm not sure how
to do better is that there is a large class of errors that a user
could make that would trigger an error of this general type. I'm
guessing that attempting a standby backup with CF=CI will turn out to
be the most common one, but I don't think it'll be the only one that
ever comes up. The code in PrepareForIncrementalBackup() focuses on
what has gone wrong on a technical level rather than on what you
probably did to create that situation. Indeed, the server doesn't
really know what you did to create that situation. You could trigger
the same error by taking a full backup on the primary and then try to
take an incremental based on that full backup on a time-delayed
standby (or a lagging standby) whose replay position was behind the
primary, i.e. CI<CF.

More perversely, you could trigger the error by spinning up a standby,
promoting it, taking a full backup, destroying the standby, removing
the timeline history file from the archive, spinning up a new standby,
promoting onto the same timeline ID as the previous one, and then
trying to take an incremental backup relative to the full backup. This
might actually succeed, if you take the incremental backup at a later
LSN than the previous full backup, but, as you may guess, terrible
things will happen to you if you try to use such a backup. (I hope you
will agree that this would be a self-inflicted injury; I can't see any
way of detecting such cases.) If the incremental backup LSN is earlier
than the previous full backup LSN, this error will trigger.

So, given all the above, what can we do here?

One option might be to add an errhint() to the message. I had trouble
thinking of something that was compact enough to be reasonable to
include and yet reasonably accurate and useful, but maybe we can
brainstorm and figure something out. Another option might be to add
more to the documentation, but it's all so complicated that I'm not
sure what to write. It feels hard to make something that is brief
enough to be worth including, accurate enough to help more than it
hurts, and understandable enough that people who run into this will be
able to make use of it.

I think I'm a little too close to this to really know what the best
thing to do is, so I'm happy to hear suggestions from you and others.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Incremental backup from a streaming replication standby fails

From
David Steele
Date:
On 7/19/24 21:52, Robert Haas wrote:
> On Mon, Jul 15, 2024 at 11:27 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>> On Sat, 2024-06-29 at 07:01 +0200, Laurenz Albe wrote:
>>> I played around with incremental backup yesterday and tried $subject
>>>
>>> The WAL summarizer is running on the standby server, but when I try
>>> to take an incremental backup, I get an error that I understand to mean
>>> that WAL summarizing hasn't caught up yet.
>>>
>>> I am not sure if that is working as designed, but if it is, I think it
>>> should be documented.
>>
>> I played with this some more.  Here is the exact error message:
>>
>> ERROR:  manifest requires WAL from final timeline 1 ending at 0/1967C260, but this backup starts at 0/1967C190
>>
>> By trial and error I found that when I run a CHECKPOINT on the primary,
>> taking an incremental backup on the standby works.
>>
>> I couldn't fathom the cause of that, but I think that that should either
>> be addressed or documented before v17 comes out.
> 
> I had a feeling this was going to be confusing. I'm not sure what to
> do about it, but I'm open to suggestions.
> 
> Suppose you take a full backup F; replay of that backup will begin
> with a checkpoint CF. Then you try to take an incremental backup I;
> replay will begin from a checkpoint CI. For the incremental backup to
> be valid, it must include all blocks modified after CF and before CI.
> But when the backup is taken on a standby, no new checkpoint is
> possible. Hence, CI will be the most recent restartpoint on the
> standby that has occurred before the backup starts. So, if F is taken
> on the primary and then I is immediately taken on the standby without
> the standby having done a new restartpoint, or if both F and I are
> taken on the standby and no restartpoint intervenes, then CF=CI. In
> that scenario, an incremental backup is pretty much pointless: every
> single incremental file would contain 0 blocks. You might as well just
> use the backup you already have, unless one of the non-relation files
> has changed. So, except in that unusual corner case, the fact that the
> backup fails isn't really costing you anything. In fact, there's a
> decent chance that it's saving you from taking a completely useless
> backup.

<snip>

> I think I'm a little too close to this to really know what the best
> thing to do is, so I'm happy to hear suggestions from you and others.

I think it would be enough just to add a hint such as:

HINT: this is possible when making a standby backup with little or no 
activity.

My guess is in production environments this will be uncommon.

For example, over the years we (pgBackRest) have gotten numerous bug 
reports that time-targeted PITR does not work. In every case we found 
that the user was just testing procedures and the database had no 
activity between backups -- therefore recovery had no commit timestamps 
to use to end recovery. Test environments sometimes produce weird results.

Having said that, I think it would be better if it worked even if it 
does produce an empty backup. An empty backup wastes some disk space but 
if it produces less friction and saves an admin having to intervene then 
it is probably worth it. I don't immediately see how to do that in a 
reliable way, though, and in any case it seems like something to 
consider for PG18.

Regards,
-David



On Fri, Jul 19, 2024 at 11:32 AM David Steele <david@pgmasters.net> wrote:
> I think it would be enough just to add a hint such as:
>
> HINT: this is possible when making a standby backup with little or no
> activity.

That could work (with "this" capitalized).

> My guess is in production environments this will be uncommon.

I think so too, but when it does happen, confusion may be common.

> Having said that, I think it would be better if it worked even if it
> does produce an empty backup. An empty backup wastes some disk space but
> if it produces less friction and saves an admin having to intervene then
> it is probably worth it. I don't immediately see how to do that in a
> reliable way, though, and in any case it seems like something to
> consider for PG18.

Yeah, I'm pretty reluctant to weaken the sanity checks here, at least
in the short term. Note that what the check is actually complaining
about is that the previous backup thinks that the WAL it needs to
replay to reach consistency ends after the start of the current
backup. Even in this scenario, I'm not positive that everything would
be OK if we let the backup proceed, and it's easy to think of
scenarios where it definitely isn't. Plus, it's not quite clear how to
distinguish the cases where it's OK from the cases where it isn't.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Incremental backup from a streaming replication standby fails

From
Laurenz Albe
Date:
On Fri, 2024-07-19 at 12:59 -0400, Robert Haas wrote:
Thanks for looking at this.

> On Fri, Jul 19, 2024 at 11:32 AM David Steele <david@pgmasters.net> wrote:
> > I think it would be enough just to add a hint such as:
> >
> > HINT: this is possible when making a standby backup with little or no
> > activity.
>
> That could work (with "this" capitalized).
>
> > My guess is in production environments this will be uncommon.
>
> I think so too, but when it does happen, confusion may be common.

I guess this will most likely happen during tests like the one I made.

I'd be alright with the hint, but I'd say "during making an *incremental*
standby backup", because that's the only case where it can happen.

I think it would also be sufficient if we document that possibility.
When I got the error, I looked at the documentation of incremental
backup for any limitations with standby servers, but didn't find any.
A remark in the documentation would have satisfied me.

Yours,
Laurenz



On Fri, Jul 19, 2024 at 2:41 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> I'd be alright with the hint, but I'd say "during making an *incremental*
> standby backup", because that's the only case where it can happen.
>
> I think it would also be sufficient if we document that possibility.
> When I got the error, I looked at the documentation of incremental
> backup for any limitations with standby servers, but didn't find any.
> A remark in the documentation would have satisfied me.

Would you like to propose a patch adding a hint and/or adjusting the
documentation? Or are you wanting me to do that?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Incremental backup from a streaming replication standby fails

From
Laurenz Albe
Date:
On Fri, 2024-07-19 at 16:03 -0400, Robert Haas wrote:
> On Fri, Jul 19, 2024 at 2:41 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > I'd be alright with the hint, but I'd say "during making an *incremental*
> > standby backup", because that's the only case where it can happen.
> >
> > I think it would also be sufficient if we document that possibility.
> > When I got the error, I looked at the documentation of incremental
> > backup for any limitations with standby servers, but didn't find any.
> > A remark in the documentation would have satisfied me.
>
> Would you like to propose a patch adding a hint and/or adjusting the
> documentation? Or are you wanting me to do that?

Here is a patch.
I went for both the errhint and some documentation.

Yours,
Laurenz Albe

Attachment

Re: Incremental backup from a streaming replication standby

From
Michael Paquier
Date:
On Sat, Jun 29, 2024 at 07:01:04AM +0200, Laurenz Albe wrote:
> The WAL summarizer is running on the standby server, but when I try
> to take an incremental backup, I get an error that I understand to mean
> that WAL summarizing hasn't caught up yet.

Added an open item for this one.
--
Michael

Attachment
On Fri, Jul 19, 2024 at 6:07 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> Here is a patch.
> I went for both the errhint and some documentation.

Hmm, the hint doesn't end up using the word "standby" anywhere. That
seems like it might not be optimal?

+    Like a base backup, you can take an incremental backup from a streaming
+    replication standby server.  But since a backup of a standby server cannot
+    initiate a checkpoint, it is possible that an incremental backup taken
+    right after a base backup will fail with an error, since it would have
+    to start with the same checkpoint as the base backup and would therefore
+    be empty.

Hmm. I feel like I'm about to be super-nitpicky, but this seems
imprecise to me in multiple ways. First, an incremental backup is a
kind of base backup, or at least, it's something you take with
pg_basebackup. Note that later in the paragraph, you use the term
"base backup" to refer to what I have been calling the "prior" or
"previous" backup or "the backup upon which it depends," but that
earlier backup could be either a full or an incremental backup.
Second, the standby need not be using streaming replication, even
though it probably will be in practice. Third, the failing incremental
backup doesn't necessarily have to be attempted immediately after the
previous one - the intervening time could be quite long on an idle
system. Fourth, it makes it sound like the backup being empty is a
reason for it to fail, which is debatable; I think we should try to
cast this more as an implementation restriction.

How about something like this:

An incremental backup is only possible if replay would begin from a
later checkpoint than for the previous backup upon which it depends.
On the primary, this condition is always satisfied, because each
backup triggers a new checkpoint. On a standby, replay begins from the
most recent restartpoint. As a result, an incremental backup may fail
on a standby if there has been very little activity since the previous
backup. Attempting to take an incremental backup that is lagging
behind the primary (or some other standby) using a prior backup taken
at a later WAL position may fail for the same reason.

I'm not saying that's perfect, but let me know your thoughts.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Incremental backup from a streaming replication standby fails

From
Laurenz Albe
Date:
On Mon, 2024-07-22 at 09:37 -0400, Robert Haas wrote:
> How about something like this:
>
> An incremental backup is only possible if replay would begin from a
> later checkpoint than for the previous backup upon which it depends.
> On the primary, this condition is always satisfied, because each
> backup triggers a new checkpoint. On a standby, replay begins from the
> most recent restartpoint. As a result, an incremental backup may fail
> on a standby if there has been very little activity since the previous
> backup. Attempting to take an incremental backup that is lagging
> behind the primary (or some other standby) using a prior backup taken
> at a later WAL position may fail for the same reason.

Before I write a v2, a small question for clarification:
I believe I remember that during my experiments, I ran CHECKPOINT
on the standby server between the first backup and the incremental
backup, and that was not enough to make it work.  I had to run
a CHECKPOINT on the primary server.

Does CHECKPOINT on the standby not trigger a restartpoint, or do
I simply misremember?

Yours,
Laurenz Albe



On Mon, Jul 22, 2024 at 1:05 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> Before I write a v2, a small question for clarification:
> I believe I remember that during my experiments, I ran CHECKPOINT
> on the standby server between the first backup and the incremental
> backup, and that was not enough to make it work.  I had to run
> a CHECKPOINT on the primary server.
>
> Does CHECKPOINT on the standby not trigger a restartpoint, or do
> I simply misremember?

It's only possible for the standby to create a restartpoint at a
write-ahead log position where the master created a checkpoint. With
typical configuration, every or nearly every checkpoint on the primary
will trigger a restartpoint on the standby, but for example if you set
max_wal_size bigger and checkpoint_timeout longer on the standby than
on the primary, then you might end up with only some of those
checkpoints ending up becoming restartpoints and others not.

Looking at the code in CreateRestartPoint(), it looks like what
happens if you run CHECKPOINT is that it tries to turn the
most-recently replayed checkpoint into a restartpoint if that wasn't
done already; otherwise it just returns without doing anything. See
the comment that begins with "If the last checkpoint record we've
replayed is already our last".

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Incremental backup from a streaming replication standby fails

From
Laurenz Albe
Date:
On Mon, 2024-07-22 at 09:37 -0400, Robert Haas wrote:
> On Fri, Jul 19, 2024 at 6:07 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > Here is a patch.
> > I went for both the errhint and some documentation.
>
> Hmm, the hint doesn't end up using the word "standby" anywhere. That
> seems like it might not be optimal?

I guessed that the user was aware that she is taking the backup on
a standby server...

Anyway, I reworded the hint to

  This can happen for incremental backups on a standby if there was
  little activity since the previous backup.

> Hmm. I feel like I'm about to be super-nitpicky, but this seems
> imprecise to me in multiple ways.

On the contrary, cour comments and explanations are valuable.

> How about something like this:
>
> An incremental backup is only possible if replay would begin from a
> later checkpoint than for the previous backup upon which it depends.
> On the primary, this condition is always satisfied, because each
> backup triggers a new checkpoint. On a standby, replay begins from the
> most recent restartpoint. As a result, an incremental backup may fail
> on a standby if there has been very little activity since the previous
> backup. Attempting to take an incremental backup that is lagging
> behind the primary (or some other standby) using a prior backup taken
> at a later WAL position may fail for the same reason.
>
> I'm not saying that's perfect, but let me know your thoughts.

I tinkered with this some more, and the attached patch has

  An incremental backup is only possible if replay would begin from a later
  checkpoint than the checkpoint that started the previous backup upon which
  it depends.  If you take the incremental backup on the primary, this
  condition is always satisfied, because each backup triggers a new
  checkpoint.  On a standby, replay begins from the most recent restartpoint.
  Therefore, an incremental backup of a standby server can fail if there has
  been very little activity since the previous backup, since no new
  restartpoint might have been created.

Yours,
Laurenz Albe

Attachment
On Wed, Jul 24, 2024 at 6:46 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>   An incremental backup is only possible if replay would begin from a later
>   checkpoint than the checkpoint that started the previous backup upon which
>   it depends.

My concern here is that the previous backup might have been taken on a
standby, and therefore it did not start with a checkpoint. For a
standby backup, replay will begin from a checkpoint record, but that
record may be quite a bit earlier in the WAL. For instance, imagine
checkpoint_timeout is set to 30 minutes on the standby. When the
backup is taken, the most recent restartpoint could be up to 30
minutes ago -- and it is the checkpoint record for that restartpoint
from which replay will begin. I think that in my phrasing, it's always
about the checkpoint from which replay would begin (which is always
well-defined) not the checkpoint that started the backup (which is
only logical on the primary).

>  If you take the incremental backup on the primary, this
>   condition is always satisfied, because each backup triggers a new
>   checkpoint.  On a standby, replay begins from the most recent restartpoint.
>   Therefore, an incremental backup of a standby server can fail if there has
>   been very little activity since the previous backup, since no new
>   restartpoint might have been created.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Incremental backup from a streaming replication standby fails

From
Laurenz Albe
Date:
On Wed, 2024-07-24 at 15:27 -0400, Robert Haas wrote:
> On Wed, Jul 24, 2024 at 6:46 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> >    An incremental backup is only possible if replay would begin from a later
> >    checkpoint than the checkpoint that started the previous backup upon which
> >    it depends.
>
> My concern here is that the previous backup might have been taken on a
> standby, and therefore it did not start with a checkpoint. For a
> standby backup, replay will begin from a checkpoint record, but that
> record may be quite a bit earlier in the WAL. For instance, imagine
> checkpoint_timeout is set to 30 minutes on the standby. When the
> backup is taken, the most recent restartpoint could be up to 30
> minutes ago -- and it is the checkpoint record for that restartpoint
> from which replay will begin. I think that in my phrasing, it's always
> about the checkpoint from which replay would begin (which is always
> well-defined) not the checkpoint that started the backup (which is
> only logical on the primary).

I see.

The attached patch uses your wording for the first sentence.

I left out the last sentence from your suggestion, because it sounded
like it is likely to confuse the reader.  I think you just wanted to
say that there are other possible causes for an incremental backup to
fail.  I want to keep the text as simple as possible and focus on the case
that I hit, because I expect that a lot of people who experiment with
incremental backup or run tests could run into the same problem.

I don't think it will be a frequent occurrence during normal operation.

Yours,
Laurenz Albe

Attachment
On Thu, Jul 25, 2024 at 8:51 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> The attached patch uses your wording for the first sentence.
>
> I left out the last sentence from your suggestion, because it sounded
> like it is likely to confuse the reader.  I think you just wanted to
> say that there are other possible causes for an incremental backup to
> fail.  I want to keep the text as simple as possible and focus on the case
> that I hit, because I expect that a lot of people who experiment with
> incremental backup or run tests could run into the same problem.
>
> I don't think it will be a frequent occurrence during normal operation.

Committed this version to master and v17.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Incremental backup from a streaming replication standby fails

From
Laurenz Albe
Date:
On Thu, 2024-07-25 at 16:12 -0400, Robert Haas wrote:
> On Thu, Jul 25, 2024 at 8:51 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > The attached patch uses your wording for the first sentence.
> >
> > I left out the last sentence from your suggestion, because it sounded
> > like it is likely to confuse the reader.  I think you just wanted to
> > say that there are other possible causes for an incremental backup to
> > fail.  I want to keep the text as simple as possible and focus on the case
> > that I hit, because I expect that a lot of people who experiment with
> > incremental backup or run tests could run into the same problem.
> >
> > I don't think it will be a frequent occurrence during normal operation.
>
> Committed this version to master and v17.

Thanks for taking care of this.

Yours,
Laurenz Albe



On Fri, Jul 26, 2024 at 1:09 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > Committed this version to master and v17.
>
> Thanks for taking care of this.

Sure thing!

I knew it was going to confuse someone ... I just wasn't sure what to
do about it. Now we've at least done something, which is hopefully
superior to nothing.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Incremental backup from a streaming replication standby fails

From
Alexander Korotkov
Date:
On Fri, Jul 26, 2024 at 4:11 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jul 26, 2024 at 1:09 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > > Committed this version to master and v17.
> >
> > Thanks for taking care of this.
>
> Sure thing!
>
> I knew it was going to confuse someone ... I just wasn't sure what to
> do about it. Now we've at least done something, which is hopefully
> superior to nothing.

Great!  Should we mark the corresponding v17 open item as closed?

------
Regards,
Alexander Korotkov
Supabase



On Fri, Jul 26, 2024 at 4:13 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> Great!  Should we mark the corresponding v17 open item as closed?

Done.

--
Robert Haas
EDB: http://www.enterprisedb.com