Re: Requiring recovery.signal or standby.signal when recovering with a backup_label - Mailing list pgsql-hackers
From | Bowen Shi |
---|---|
Subject | Re: Requiring recovery.signal or standby.signal when recovering with a backup_label |
Date | |
Msg-id | CAM_vCudkSjr7NsNKSdjwtfAm9dbzepY6beZ5DP177POKy8=2aw@mail.gmail.com Whole thread Raw |
In response to | Re: Requiring recovery.signal or standby.signal when recovering with a backup_label (Michael Paquier <michael@paquier.xyz>) |
Responses |
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label
|
List | pgsql-hackers |
Thanks for the patch. I rerun the test in https://www.postgresql.org/message-id/flat/ZQtzcH2lvo8leXEr%40paquier.xyz#cc5ed83e0edc0b9a1c1305f08ff7a335 . We can discuss all the problems in this thread. First I encountered the problem " FATAL: could not find recovery.signal or standby.signal when recovering with backup_label ", then I deleted the backup_label file and started the instance successfully. > Delete a backup_label from a fresh base backup can easily lead to data > corruption, as the startup process would pick up as LSN to start > recovery from the control file rather than the backup_label file. > This would happen if a checkpoint updates the redo LSN in the control > file while a backup happens and the control file is copied after the > checkpoint, for instance. If one wishes to deploy a new primary from > a base backup, recovery.signal is the way to go, making sure that the > new primary is bumped into a new timeline once recovery finishes, on > top of making sure that the startup process starts recovery from a > position where the cluster would be able to achieve a consistent > state. ereport(FATAL, (errmsg("could not find redo location referenced by checkpoint record"), errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" and add required recovery options.\n" "If you are not restoring from a backup, try removing the file \"%s/backup_label\".\n" "Be careful: removing \"%s/backup_label\" will result in a corrupt cluster if restoring from a backup.", DataDir, DataDir, DataDir))); There are two similar error messages in xlogrecovery.c. Maybe we can modify the error messages to be similar. -- Bowen Shi On Thu, 21 Sept 2023 at 11:01, Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Jul 19, 2023 at 11:21:17AM -0700, David Zhang wrote: > > 1) simply start server from a base backup > > > > FATAL: could not find recovery.signal or standby.signal when recovering > > with backup_label > > > > HINT: If you are restoring from a backup, touch > > "/media/david/disk1/pg_backup1/recovery.signal" or > > "/media/david/disk1/pg_backup1/standby.signal" and add required recovery > > options. > > Note the difference when --write-recovery-conf is specified, where a > standby.conf is created with a primary_conninfo in > postgresql.auto.conf. So, yes, that's expected by default with the > patch. > > > 2) touch a recovery.signal file and then try to start the server, the > > following error was encountered: > > > > FATAL: must specify restore_command when standby mode is not enabled > > Yes, that's also something expected in the scope of the v1 posted. > The idea behind this restriction is that specifying recovery.signal is > equivalent to asking for archive recovery, but not specifying > restore_command is equivalent to not provide any options to be able to > recover. See validateRecoveryParameters() and note that this > restriction exists since the beginning of times, introduced in commit > 66ec2db. I tend to agree that there is something to be said about > self-contained backups taken from pg_basebackup, though, as these > would fail if no restore_command is specified, and this restriction is > in place before Postgres has introduced replication and easier ways to > have base backups. As a whole, I think that there is a good argument > in favor of removing this restriction for the case where archive > recovery is requested if users have all their WAL in pg_wal/ to be > able to recover up to a consistent point, keeping these GUC > restrictions if requesting a standby (not recovery.signal, only > standby.signal). > > > 3) touch a standby.signal file, then the server successfully started, > > however, it operates in standby mode, whereas the intended behavior was for > > it to function as a primary server. > > standby.signal implies that the server will start in standby mode. If > one wants to deploy a new primary, that would imply a timeline jump at > the end of recovery, you would need to specify recovery.signal > instead. > > We need more discussions and more opinions, but the discussion has > stalled for a few months now. In case, I am adding Thomas Munro in CC > who has mentioned to me at PGcon that he was interested in this patch > (this thread's problem is not directly related to the fact that the > checkpointer now runs in crash recovery, though). > > For now, I am attaching a rebased v2. > -- > Michael
pgsql-hackers by date: