Re: Requiring recovery.signal or standby.signal when recovering with a backup_label - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Requiring recovery.signal or standby.signal when recovering with a backup_label
Date
Msg-id ZRPY9VwlNYygle4o@paquier.xyz
Whole thread Raw
In response to Re: Requiring recovery.signal or standby.signal when recovering with a backup_label  (Bowen Shi <zxwsbg12138@gmail.com>)
List pgsql-hackers
On Thu, Sep 21, 2023 at 11:45:06AM +0800, Bowen Shi wrote:
> First I encountered the problem " FATAL:  could not find
> recovery.signal or standby.signal when recovering with backup_label ",
> then I deleted the backup_label file and started the instance
> successfully.

Doing that is equal to corrupting your instance as recovery would
begin from the latest redo LSN stored in the control file, but the
physical relation files included in the backup may include blocks that
require records that are needed before this redo LSN and the LSN
stored in the backup_label.

>> Delete a backup_label from a fresh base backup can easily lead to data
>> corruption, as the startup process would pick up as LSN to start
>> recovery from the control file rather than the backup_label file.
>> This would happen if a checkpoint updates the redo LSN in the control
>> file while a backup happens and the control file is copied after the
>> checkpoint, for instance. If one wishes to deploy a new primary from
>> a base backup, recovery.signal is the way to go, making sure that the
>> new primary is bumped into a new timeline once recovery finishes, on
>> top of making sure that the startup process starts recovery from a
>> position where the cluster would be able to achieve a consistent
>> state.

And that's what I mean here.  In more details.  So, really, don't do
that.

> ereport(FATAL,
> (errmsg("could not find redo location referenced by checkpoint record"),
> errhint("If you are restoring from a backup, touch
> \"%s/recovery.signal\" and add required recovery options.\n"
> "If you are not restoring from a backup, try removing the file
> \"%s/backup_label\".\n"
> "Be careful: removing \"%s/backup_label\" will result in a corrupt
> cluster if restoring from a backup.",
> DataDir, DataDir, DataDir)));
>
> There are two similar error messages in xlogrecovery.c. Maybe we can
> modify the error messages to be similar.

The patch adds the following message, which is written this way to be
consistent with the two others, already:

+    ereport(FATAL,
+            (errmsg("could not find recovery.signal or standby.signal when recovering with backup_label"),
+             errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and
addrequired recovery options.", 
+                     DataDir, DataDir)));

But you have an interesting point here, why isn't standby.signal also
mentioned in the two existing comments?  Depending on what's wanted by
the user this can be equally useful to report back.

Attached is a slightly updated patch, where I have also removed the
check on ArchiveRecoveryRequested because the FATAL generated for
!ArchiveRecoveryRequested makes sure that it is useless after reading
the backup_label file.

This patch has been around for a few months now.  Do others have
opinions about the direction taken here to make the presence of
recovery.signal or standby.signal a hard requirement when a
backup_label file is found (HEAD only)?
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Use virtual tuple slot for Unique node
Next
From: Yuya Watari
Date:
Subject: Re: [PoC] Reducing planning time when tables have many partitions