Thread: Detecting some cases of missing backup_label

Detecting some cases of missing backup_label

From

Andres Freund

Date:

30 November 2023, 20:56:05

Hi,

I recently mentioned to Robert (and also Heikki earlier), that I think I see a
way to detect an omitted backup_label in a relevant subset of the cases (it'd
apply to the pg_control as well, if we moved to that).  Robert encouraged me
to share the idea, even though it does not provide complete protection.


The subset I think we can address is the following:

a) An omitted backup_label would lead to corruption, i.e. without the
   backup_label we won't start recovery at the right position. Obviously it'd
   be better to also catch a wrong procedure when it'd not cause corruption -
   perhaps my idea can be extended to handle that, with a small bit of
   overhead.

b) The backup has been taken from a primary. Unfortunately that probably can't
   be addressed - but the vast majority of backups are taken from a primary,
   so I think it's still a worthwhile protection.


Here's my approach

1) We add a XLOG_BACKUP_START WAL record when starting a base backup on a
   primary, emitted just *after* the checkpoint completed

2) When replaying a base backup start record, we create a state file that
   includes the corresponding LSN in the filename

3) On the primary, the state file for XLOG_BACKUP_START is *not* created at
   that time. Instead the state file is created during pg_backup_stop().

4) When replaying a XLOG_BACKUP_END record, we verif that the state file
   created by XLOG_BACKUP_START is present, and error out if not.  Backups
   that started before the redo LSN from backup_label are ignored
   (necessitates remembering that LSN, but we've been discussing that anyway).


Because the backup state file on the primary is only created during
pg_backup_stop(), a copy of the data directory taken between pg_backup_start()
and pg_backup_stop() does *not* contain the corresponding "backup state
file". Because of this, an omitted backup_label is detected if recovery does
not start early enough - recovery won't encounter the XLOG_BACKUP_START record
and thus would not create the state file, leading to an error in 4).

It is not a problem that the primary does not create the state file before the
pg_backup_stop() - if the primary crashes before pg_backup_stop(), there is no
XLOG_BACKUP_END and thus no error will be raised.  It's a bit odd that the
sequence differs between normal processing and recovery, but I think that's
nothing a good comment couldn't explain.


I haven't worked out the details, but I think we might be able extend this to
catch errors even if there is no checkpoint during the base backup, by
emitting the WAL record *before* the RequestCheckpoint(), and creating the
corresponding state file during backup_label processing at the start of
recovery.  That'd probably make the logic for when we can remove the backup
state files a bit more complicated, but I think we could deal with that.


Comments? Swear words?

Greetings,

Andres Freund

Re: Detecting some cases of missing backup_label

From

Stephen Frost

Date:

05 December 2023, 15:54:58

Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> I recently mentioned to Robert (and also Heikki earlier), that I think I see a
> way to detect an omitted backup_label in a relevant subset of the cases (it'd
> apply to the pg_control as well, if we moved to that).  Robert encouraged me
> to share the idea, even though it does not provide complete protection.

That would certainly be nice.

> The subset I think we can address is the following:
>
> a) An omitted backup_label would lead to corruption, i.e. without the
>    backup_label we won't start recovery at the right position. Obviously it'd
>    be better to also catch a wrong procedure when it'd not cause corruption -
>    perhaps my idea can be extended to handle that, with a small bit of
>    overhead.
>
> b) The backup has been taken from a primary. Unfortunately that probably can't
>    be addressed - but the vast majority of backups are taken from a primary,
>    so I think it's still a worthwhile protection.

Agreed that this is a worthwhile set to try and address, even if we
can't address other cases.

> Here's my approach
>
> 1) We add a XLOG_BACKUP_START WAL record when starting a base backup on a
>    primary, emitted just *after* the checkpoint completed
>
> 2) When replaying a base backup start record, we create a state file that
>    includes the corresponding LSN in the filename
>
> 3) On the primary, the state file for XLOG_BACKUP_START is *not* created at
>    that time. Instead the state file is created during pg_backup_stop().
>
> 4) When replaying a XLOG_BACKUP_END record, we verif that the state file
>    created by XLOG_BACKUP_START is present, and error out if not.  Backups
>    that started before the redo LSN from backup_label are ignored
>    (necessitates remembering that LSN, but we've been discussing that anyway).
>
>
> Because the backup state file on the primary is only created during
> pg_backup_stop(), a copy of the data directory taken between pg_backup_start()
> and pg_backup_stop() does *not* contain the corresponding "backup state
> file". Because of this, an omitted backup_label is detected if recovery does
> not start early enough - recovery won't encounter the XLOG_BACKUP_START record
> and thus would not create the state file, leading to an error in 4).

While I see the idea here, I think, doesn't it end up being an issue if
things happen like this:

pg_backup_start -> XLOG_BACKUP_START WAL written -> new checkpoint
happens -> pg_backup_stop -> XLOG_BACKUP_STOP WAL written -> crash

In that scenario, we'd go back to the new checkpoint (the one *after*
the checkpoint that happened before we wrote XLOG_BACKUP_START), start
replaying, and then hit the XLOG_BACKUP_STOP and then error out, right?
Even though we're actually doing crash recovery and everything should be
fine as long as we replay all of the WAL.

Perhaps we can make the pg_backup_stop and(/or?) the writing out of
XLOG_BACKUP_STOP wait until just before the next checkpoint and
hopefully minimize that window ... but I'm not sure if we could make
that window zero and what happens if someone does end up hitting it?
Doesn't seem like there's any way around it, which seems like it might
be a problem.  I suppose it wouldn't be hard to add some option to tell
PG to ignore the XLOG_BACKUP_STOP ... but then that's akin to removing
backup_label which lands us possibly back into the issue of people
mis-using that.

> It is not a problem that the primary does not create the state file before the
> pg_backup_stop() - if the primary crashes before pg_backup_stop(), there is no
> XLOG_BACKUP_END and thus no error will be raised.  It's a bit odd that the
> sequence differs between normal processing and recovery, but I think that's
> nothing a good comment couldn't explain.

Right, crashing before pg_backup_stop() is fine, but crashing *after*
would be an issue, I think, as outlined above, until the next checkpoint
completes, so we've moved the window but not eliminated it.

> I haven't worked out the details, but I think we might be able extend this to
> catch errors even if there is no checkpoint during the base backup, by
> emitting the WAL record *before* the RequestCheckpoint(), and creating the
> corresponding state file during backup_label processing at the start of
> recovery.  That'd probably make the logic for when we can remove the backup
> state files a bit more complicated, but I think we could deal with that.

Not entirely following this- are you meaning that we might be able to
make something here work in the case where we don't have
pg_backup_start() wait for a checkpoint to happen (which I have some
serious doubts about?), or are you saying that the above doesn't work
unless there's at least one post-pg_backup_start() checkpoint?  I don't
immediately see why that would be the case though.  Also, if we wrote
out the XLOG_BACKUP_START before the checkpoint that we start replay
from and instead move that logic to backup_label processing ... doesn't
that end up not working in the same case as we have today- where someone
decides to remove backup_label?

Going to stop guessing here as I'm clearly not understanding something
about this part.  Maybe this is the part that's addressing the concern
raised above though and if so, sorry, but would appreciate some
additional explanation.

Thanks!

Stephen

Attachment

signature.asc

Re: Detecting some cases of missing backup_label

From

Stephen Frost

Date:

18 December 2023, 14:39:49

Greetings,

* Stephen Frost (sfrost@snowman.net) wrote:
> * Andres Freund (andres@anarazel.de) wrote:
> > I recently mentioned to Robert (and also Heikki earlier), that I think I see a
> > way to detect an omitted backup_label in a relevant subset of the cases (it'd
> > apply to the pg_control as well, if we moved to that).  Robert encouraged me
> > to share the idea, even though it does not provide complete protection.
>
> That would certainly be nice.
>
> > The subset I think we can address is the following:
> >
> > a) An omitted backup_label would lead to corruption, i.e. without the
> >    backup_label we won't start recovery at the right position. Obviously it'd
> >    be better to also catch a wrong procedure when it'd not cause corruption -
> >    perhaps my idea can be extended to handle that, with a small bit of
> >    overhead.
> >
> > b) The backup has been taken from a primary. Unfortunately that probably can't
> >    be addressed - but the vast majority of backups are taken from a primary,
> >    so I think it's still a worthwhile protection.
>
> Agreed that this is a worthwhile set to try and address, even if we
> can't address other cases.
>
> > Here's my approach
> >
> > 1) We add a XLOG_BACKUP_START WAL record when starting a base backup on a
> >    primary, emitted just *after* the checkpoint completed
> >
> > 2) When replaying a base backup start record, we create a state file that
> >    includes the corresponding LSN in the filename
> >
> > 3) On the primary, the state file for XLOG_BACKUP_START is *not* created at
> >    that time. Instead the state file is created during pg_backup_stop().
> >
> > 4) When replaying a XLOG_BACKUP_END record, we verif that the state file
> >    created by XLOG_BACKUP_START is present, and error out if not.  Backups
> >    that started before the redo LSN from backup_label are ignored
> >    (necessitates remembering that LSN, but we've been discussing that anyway).
> >
> >
> > Because the backup state file on the primary is only created during
> > pg_backup_stop(), a copy of the data directory taken between pg_backup_start()
> > and pg_backup_stop() does *not* contain the corresponding "backup state
> > file". Because of this, an omitted backup_label is detected if recovery does
> > not start early enough - recovery won't encounter the XLOG_BACKUP_START record
> > and thus would not create the state file, leading to an error in 4).
>
> While I see the idea here, I think, doesn't it end up being an issue if
> things happen like this:
>
> pg_backup_start -> XLOG_BACKUP_START WAL written -> new checkpoint
> happens -> pg_backup_stop -> XLOG_BACKUP_STOP WAL written -> crash
>
> In that scenario, we'd go back to the new checkpoint (the one *after*
> the checkpoint that happened before we wrote XLOG_BACKUP_START), start
> replaying, and then hit the XLOG_BACKUP_STOP and then error out, right?
> Even though we're actually doing crash recovery and everything should be
> fine as long as we replay all of the WAL.

Andres and I discussed this in person at PGConf.eu and the idea is that
if we find a XLOG_BACKUP_STOP record then we can check if the state file
was written out and if so then we can conclude that we are doing crash
recovery and not restoring from a backup and therefore we don't error
out.  This also implies that we don't consider PG to be recovered at the
XLOG_BACKUP_STOP point, if the state file exists, but instead we have to
be sure to replay all WAL that's been written.  Perhaps we even
explicitly refuse to use restore_command in this case?

We do error out if we hit a XLOG_BACKUP_STOP and the state file
doesn't exist, as that implies that we started replaying from a point
after a XLOG_BACKUP_START record was written but are working from a copy
of the data directory which didn't include the state file.

Of course, we need to actually implement and test these different cases
to make sure it all works but I'm at least feeling better about the idea
and wanted to share that here.

Thanks,

Stephen

Attachment

signature.asc

Re: Detecting some cases of missing backup_label

From

David Steele

Date:

20 December 2023, 17:11:37

On 12/18/23 10:39, Stephen Frost wrote:
> Greetings,
> 
> * Stephen Frost (sfrost@snowman.net) wrote:
>> * Andres Freund (andres@anarazel.de) wrote:
>>> I recently mentioned to Robert (and also Heikki earlier), that I think I see a
>>> way to detect an omitted backup_label in a relevant subset of the cases (it'd
>>> apply to the pg_control as well, if we moved to that).  Robert encouraged me
>>> to share the idea, even though it does not provide complete protection.
>>
>> That would certainly be nice.
>>
>>> The subset I think we can address is the following:
>>>
>>> a) An omitted backup_label would lead to corruption, i.e. without the
>>>     backup_label we won't start recovery at the right position. Obviously it'd
>>>     be better to also catch a wrong procedure when it'd not cause corruption -
>>>     perhaps my idea can be extended to handle that, with a small bit of
>>>     overhead.
>>>
>>> b) The backup has been taken from a primary. Unfortunately that probably can't
>>>     be addressed - but the vast majority of backups are taken from a primary,
>>>     so I think it's still a worthwhile protection.
>>
>> Agreed that this is a worthwhile set to try and address, even if we
>> can't address other cases.
>>
>>> Here's my approach
>>>
>>> 1) We add a XLOG_BACKUP_START WAL record when starting a base backup on a
>>>     primary, emitted just *after* the checkpoint completed
>>>
>>> 2) When replaying a base backup start record, we create a state file that
>>>     includes the corresponding LSN in the filename
>>>
>>> 3) On the primary, the state file for XLOG_BACKUP_START is *not* created at
>>>     that time. Instead the state file is created during pg_backup_stop().
>>>
>>> 4) When replaying a XLOG_BACKUP_END record, we verif that the state file
>>>     created by XLOG_BACKUP_START is present, and error out if not.  Backups
>>>     that started before the redo LSN from backup_label are ignored
>>>     (necessitates remembering that LSN, but we've been discussing that anyway).
>>>
>>>
>>> Because the backup state file on the primary is only created during
>>> pg_backup_stop(), a copy of the data directory taken between pg_backup_start()
>>> and pg_backup_stop() does *not* contain the corresponding "backup state
>>> file". Because of this, an omitted backup_label is detected if recovery does
>>> not start early enough - recovery won't encounter the XLOG_BACKUP_START record
>>> and thus would not create the state file, leading to an error in 4).
>>
>> While I see the idea here, I think, doesn't it end up being an issue if
>> things happen like this:
>>
>> pg_backup_start -> XLOG_BACKUP_START WAL written -> new checkpoint
>> happens -> pg_backup_stop -> XLOG_BACKUP_STOP WAL written -> crash
>>
>> In that scenario, we'd go back to the new checkpoint (the one *after*
>> the checkpoint that happened before we wrote XLOG_BACKUP_START), start
>> replaying, and then hit the XLOG_BACKUP_STOP and then error out, right?
>> Even though we're actually doing crash recovery and everything should be
>> fine as long as we replay all of the WAL.
> 
> Andres and I discussed this in person at PGConf.eu and the idea is that
> if we find a XLOG_BACKUP_STOP record then we can check if the state file
> was written out and if so then we can conclude that we are doing crash
> recovery and not restoring from a backup and therefore we don't error
> out.  This also implies that we don't consider PG to be recovered at the
> XLOG_BACKUP_STOP point, if the state file exists, but instead we have to
> be sure to replay all WAL that's been written.  Perhaps we even
> explicitly refuse to use restore_command in this case?
> 
> We do error out if we hit a XLOG_BACKUP_STOP and the state file
> doesn't exist, as that implies that we started replaying from a point
> after a XLOG_BACKUP_START record was written but are working from a copy
> of the data directory which didn't include the state file.
> 
> Of course, we need to actually implement and test these different cases
> to make sure it all works but I'm at least feeling better about the idea
> and wanted to share that here.

I've run this through a bunch of scenarios (in my head) with parallel 
backups and it does seem to hold up.

I think we'd need to write the state file before XLOG_BACKUP_START just 
in case. Seems better to have an extra state file rather than have one 
be missing.

As you say, we'll need to store redo for the last recovered backup in 
pg_control. I guess it would be OK to remove that when the cluster is 
promoted. As long as recovery is going on seems like it would always be 
possible to hit an XLOG_BACKUP_STOP for an even longer running backup.

I'm a little worried about what happens if a state file goes missing, 
but I guess that could be true of any file in PGDATA.

Probably we'd want to exclude *all* state files from backups, though. 
Seems like in various PITR scenarios it could be hard to determine when 
to remove them.

Regards,
-David

Re: Detecting some cases of missing backup_label

From

Andres Freund

Date:

21 December 2023, 11:37:46

Hi,

On 2023-12-20 13:11:37 -0400, David Steele wrote:
> I've run this through a bunch of scenarios (in my head) with parallel
> backups and it does seem to hold up.
>
> I think we'd need to write the state file before XLOG_BACKUP_START just in
> case. Seems better to have an extra state file rather than have one be
> missing.

That'd very significantly weaken the approach, afaict, because "external" base
base backup could end up copying those files. The whole point is to detect
broken procedures, so relying on such files being excluded from the base
backup seems like a bad idea.

I also see no need to do so - because we'd only verify that a backup start has
been replayed when replaying XLOG_BACKUP_STOP there's no danger in not
creating the files during XLOG_BACKUP_START, but doing so just before logging
the XLOG_BACKUP_STOP.

> I'm a little worried about what happens if a state file goes missing, but I
> guess that could be true of any file in PGDATA.

Yea, that seems like a non-issue to me.

> Probably we'd want to exclude *all* state files from backups, though.

I don't think so - I think we want the opposite? As noted above, I think in a
safety net like this we shouldn't assume that backup procedures were followed
correctly.

> Seems like in various PITR scenarios it could be hard to determine when to
> remove them.

Why? I think we can basically remove the files when:

a) after the checkpoint during which XLOG_BACKUP_STOP was replayed - I think
   we already have the infrastructure to queue file deletions that we can hook
   into
b) when replaying a shutdown checkpoint / after creation of a shutdown
   checkpoint

Greetings,

Andres Freund

Re: Detecting some cases of missing backup_label

From

David Steele

Date:

21 December 2023, 12:26:29

On 12/21/23 07:37, Andres Freund wrote:
> 
> On 2023-12-20 13:11:37 -0400, David Steele wrote:
>> I've run this through a bunch of scenarios (in my head) with parallel
>> backups and it does seem to hold up.
>>
>> I think we'd need to write the state file before XLOG_BACKUP_START just in
>> case. Seems better to have an extra state file rather than have one be
>> missing.
> 
> That'd very significantly weaken the approach, afaict, because "external" base
> base backup could end up copying those files. The whole point is to detect
> broken procedures, so relying on such files being excluded from the base
> backup seems like a bad idea.
> 
> I also see no need to do so - because we'd only verify that a backup start has
> been replayed when replaying XLOG_BACKUP_STOP there's no danger in not
> creating the files during XLOG_BACKUP_START, but doing so just before logging
> the XLOG_BACKUP_STOP.

Ugh, I meant XLOG_BACKUP_STOP. So sounds like we are on the same page.

>> Probably we'd want to exclude *all* state files from backups, though.
> 
> I don't think so - I think we want the opposite? As noted above, I think in a
> safety net like this we shouldn't assume that backup procedures were followed
> correctly.

Fair enough.

>> Seems like in various PITR scenarios it could be hard to determine when to
>> remove them.
> 
> Why? I think we can basically remove the files when:
> 
> a) after the checkpoint during which XLOG_BACKUP_STOP was replayed - I think
>     we already have the infrastructure to queue file deletions that we can hook
>     into
> b) when replaying a shutdown checkpoint / after creation of a shutdown
>     checkpoint

I thought about this some more. I *think* any state files a backup can 
see would have to be for XLOG_BACKUP_STOP records generated during the 
backup and they would get removed before the cluster had recovered to 
consistency.

I'd still prefer to exclude state files from the backup, but I agree 
there is no actual need to do so.

Regards,
-David