Re: recovery starting when backup_label exists, but notrecovery.signal - Mailing list pgsql-hackers

From David Steele
Subject Re: recovery starting when backup_label exists, but notrecovery.signal
Date
Msg-id c4909bdd-4a4d-31b7-c705-aabf3f1273e0@pgmasters.net
Whole thread Raw
In response to Re: recovery starting when backup_label exists, but not recovery.signal  (Fujii Masao <masao.fujii@gmail.com>)
List pgsql-hackers
On 9/27/19 4:34 AM, Fujii Masao wrote:
> On Fri, Sep 27, 2019 at 3:36 AM David Steele <david@pgmasters.net> wrote:
>>
>> On 9/24/19 1:25 AM, Fujii Masao wrote:
>>>
>>> When backup_label exists, the startup process enters archive recovery mode
>>> even if recovery.signal file doesn't exist. In this case, the startup process
>>> tries to retrieve WAL files by using restore_command. Then, at the beginning
>>> of the archive recovery, the contents of backup_label are copied to pg_control
>>> and backup_label file is removed. This would be an intentional behavior.
>>
>>> But I think the problem is that, if the server shuts down during that
>>> archive recovery, the restart of the server may cause the recovery to fail
>>> because neither backup_label nor recovery.signal exist and the server
>>> doesn't enter an archive recovery mode. Is this intentional, too? Seems No.
>>>
>>> So the problematic scenario is;
>>>
>>> 1. the server starts with backup_label, but not recovery.signal.
>>> 2. the startup process enters an archive recovery mode because
>>>     backup_label exists.
>>> 3. the contents of backup_label are copied to pg_control and
>>>     backup_label is deleted.
>>
>> Do you mean deleted or renamed to backup_label.old?
> 
> Sorry for the confusing wording..
> I meant the following code that renames backup_label to .old, in StartupXLOG().

Right, that makes sense.

>>
>> I assume you have a repro?  Can you give more details?
> 
> What I did is:
> 
> 1. Start PostgreSQL server with WAL archiving enabled.
> 2. Take an online backup by using pg_basebackup, for example,
>      $ pg_basebackup -D backup
> 3. Execute many write SQL to generate lots of WAL files. During that execution,
>     perform CHECKPOINT to remove some WAL files from pg_wal directory.
>     You need to repeat these until you confirm that there are many WAL files
>     that have already been removed from pg_wal but exist only in archive area.
>  4. Shutdown the server.
>  5. Remove PGDATA and restore it from backup.
>  6. Set up restore_command.
>  7. (Forget to put recovery.signal)
>      That is, in this scenario, you want to recover the database up to
>      the latest WAL records in the archive area. So you need to start archive
>      recovery by setting restore_command and putting recovery.signal.
>      But the problem happens when you forget to put recovery.signal.
>  8. Start PostgreSQL server.
>  9. Shutdown the server while it's restoring archived WAL files and replaying
>      them. At this point, you will notice that the archive recovery starts
>      even though recovery.signal doesn't exist. So even archived WAL files
>      are successfully restored at this step.
>  10. Restart PostgreSQL server. Since neither backup_label or recovery.signal
>         exist, crash recovery starts and fail to restore the archived WAL files.
>        So you fail to recover the database up to the latest WAL record
> in archive
>        directory. The recovery will finish at early point.

Yes, I see it now.  I did not have enough WAL to make it work before, as
I suspected.

>>> One idea to fix this issue is to make the above step #3 remember that
>>> backup_label existed, in pg_control. Then we should make the subsequent
>>> recovery enter an archive recovery mode if pg_control indicates that
>>> even if neither backup_label nor recovery.signal exist. Thought?
>>
>> That seems pretty invasive to me at this stage.  I'd like to reproduce
>> it and see if there are alternatives.
>>
>> Also, are you sure this is a new behavior?
> 
> In v11 or before, if backup_label exists but not recovery.conf,
> the startup process doesn't enter an archive recovery mode. It starts
> crash recovery in that case. So the bahavior is somewhat different
> between versions.

Agreed.  Since recovery options can be used in the presence of
backup_label *or* recovery.signal (or standby.signal for that matter)
this does represent a change in behavior.  And it doesn't appear to be a
beneficial change.

Regards,
-- 
-David
david@pgmasters.net



pgsql-hackers by date:

Previous
From: David Steele
Date:
Subject: Re: recovery starting when backup_label exists, but notrecovery.signal
Next
From: Bruce Momjian
Date:
Subject: Re: A problem presentaion about ECPG, DECLARE STATEMENT