Re: recovery starting when backup_label exists, but not recovery.signal - Mailing list pgsql-hackers

From Fujii Masao
Subject Re: recovery starting when backup_label exists, but not recovery.signal
Date
Msg-id CAHGQGwHXH+CZzBjy1-GGK3vnqeBCcNsbdeiuPp13STMWknKD8w@mail.gmail.com
Whole thread Raw
In response to Re: recovery starting when backup_label exists, but notrecovery.signal  (David Steele <david@pgmasters.net>)
Responses Re: recovery starting when backup_label exists, but notrecovery.signal
Re: recovery starting when backup_label exists, but notrecovery.signal
List pgsql-hackers
On Fri, Sep 27, 2019 at 3:36 AM David Steele <david@pgmasters.net> wrote:
>
> On 9/24/19 1:25 AM, Fujii Masao wrote:
> >
> > When backup_label exists, the startup process enters archive recovery mode
> > even if recovery.signal file doesn't exist. In this case, the startup process
> > tries to retrieve WAL files by using restore_command. Then, at the beginning
> > of the archive recovery, the contents of backup_label are copied to pg_control
> > and backup_label file is removed. This would be an intentional behavior.
>
> > But I think the problem is that, if the server shuts down during that
> > archive recovery, the restart of the server may cause the recovery to fail
> > because neither backup_label nor recovery.signal exist and the server
> > doesn't enter an archive recovery mode. Is this intentional, too? Seems No.
> >
> > So the problematic scenario is;
> >
> > 1. the server starts with backup_label, but not recovery.signal.
> > 2. the startup process enters an archive recovery mode because
> >     backup_label exists.
> > 3. the contents of backup_label are copied to pg_control and
> >     backup_label is deleted.
>
> Do you mean deleted or renamed to backup_label.old?

Sorry for the confusing wording..
I meant the following code that renames backup_label to .old, in StartupXLOG().

    /*
    * If there was a backup label file, it's done its job and the info
    * has now been propagated into pg_control.  We must get rid of the
    * label file so that if we crash during recovery, we'll pick up at
    * the latest recovery restartpoint instead of going all the way back
    * to the backup start point.  It seems prudent though to just rename
    * the file out of the way rather than delete it completely.
    */
    if (haveBackupLabel)
    {
        unlink(BACKUP_LABEL_OLD);
        durable_rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD, FATAL);
   }

> > 4. the server shuts down..
>
> This happens after the cluster has reached consistency?

You need to shutdown the server until WAL replay finishes,
no matter whether it reaches the consistent point or not.

> > 5. the server is restarted. neither backup_label nor recovery.signal exist.
> > 6. the startup process starts just crash recovery because neither backup_label
> >     nor recovery.signal exist. Since it cannot retrieve WAL files from archival
> >     area, it may fail.
>
> I tried a few ways to reproduce this but was not successful without
> manually removing WAL.  Probably I just needed a much larger set of WAL.
>
> I assume you have a repro?  Can you give more details?

What I did is:

1. Start PostgreSQL server with WAL archiving enabled.
2. Take an online backup by using pg_basebackup, for example,
     $ pg_basebackup -D backup
3. Execute many write SQL to generate lots of WAL files. During that execution,
    perform CHECKPOINT to remove some WAL files from pg_wal directory.
    You need to repeat these until you confirm that there are many WAL files
    that have already been removed from pg_wal but exist only in archive area.
 4. Shutdown the server.
 5. Remove PGDATA and restore it from backup.
 6. Set up restore_command.
 7. (Forget to put recovery.signal)
     That is, in this scenario, you want to recover the database up to
     the latest WAL records in the archive area. So you need to start archive
     recovery by setting restore_command and putting recovery.signal.
     But the problem happens when you forget to put recovery.signal.
 8. Start PostgreSQL server.
 9. Shutdown the server while it's restoring archived WAL files and replaying
     them. At this point, you will notice that the archive recovery starts
     even though recovery.signal doesn't exist. So even archived WAL files
     are successfully restored at this step.
 10. Restart PostgreSQL server. Since neither backup_label or recovery.signal
        exist, crash recovery starts and fail to restore the archived WAL files.
       So you fail to recover the database up to the latest WAL record
in archive
       directory. The recovery will finish at early point.

> > One idea to fix this issue is to make the above step #3 remember that
> > backup_label existed, in pg_control. Then we should make the subsequent
> > recovery enter an archive recovery mode if pg_control indicates that
> > even if neither backup_label nor recovery.signal exist. Thought?
>
> That seems pretty invasive to me at this stage.  I'd like to reproduce
> it and see if there are alternatives.
>
> Also, are you sure this is a new behavior?

In v11 or before, if backup_label exists but not recovery.conf,
the startup process doesn't enter an archive recovery mode. It starts
crash recovery in that case. So the bahavior is somewhat different
between versions.

Regards,

-- 
Fujii Masao



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [PATCH] psql: add tab completion for \df slash command suffixes
Next
From: Fujii Masao
Date:
Subject: Re: recovery starting when backup_label exists, but not recovery.signal