Re: recovery starting when backup_label exists, but notrecovery.signal - Mailing list pgsql-hackers

From David Steele
Subject Re: recovery starting when backup_label exists, but notrecovery.signal
Date
Msg-id 49b8f446-06ed-8e91-5dd6-fa4dfee1ee83@pgmasters.net
Whole thread Raw
In response to recovery starting when backup_label exists, but not recovery.signal  (Fujii Masao <masao.fujii@gmail.com>)
Responses Re: recovery starting when backup_label exists, but not recovery.signal
Re: recovery starting when backup_label exists, but not recovery.signal
List pgsql-hackers
On 9/24/19 1:25 AM, Fujii Masao wrote:
> 
> When backup_label exists, the startup process enters archive recovery mode
> even if recovery.signal file doesn't exist. In this case, the startup process
> tries to retrieve WAL files by using restore_command. Then, at the beginning
> of the archive recovery, the contents of backup_label are copied to pg_control
> and backup_label file is removed. This would be an intentional behavior.

> But I think the problem is that, if the server shuts down during that
> archive recovery, the restart of the server may cause the recovery to fail
> because neither backup_label nor recovery.signal exist and the server
> doesn't enter an archive recovery mode. Is this intentional, too? Seems No.
> 
> So the problematic scenario is;
> 
> 1. the server starts with backup_label, but not recovery.signal.
> 2. the startup process enters an archive recovery mode because
>     backup_label exists.
> 3. the contents of backup_label are copied to pg_control and
>     backup_label is deleted.

Do you mean deleted or renamed to backup_label.old?

> 4. the server shuts down..

This happens after the cluster has reached consistency?

> 5. the server is restarted. neither backup_label nor recovery.signal exist.
> 6. the startup process starts just crash recovery because neither backup_label
>     nor recovery.signal exist. Since it cannot retrieve WAL files from archival
>     area, it may fail.

I tried a few ways to reproduce this but was not successful without
manually removing WAL.  Probably I just needed a much larger set of WAL.

I assume you have a repro?  Can you give more details?

> One idea to fix this issue is to make the above step #3 remember that
> backup_label existed, in pg_control. Then we should make the subsequent
> recovery enter an archive recovery mode if pg_control indicates that
> even if neither backup_label nor recovery.signal exist. Thought?

That seems pretty invasive to me at this stage.  I'd like to reproduce
it and see if there are alternatives.

Also, are you sure this is a new behavior?  I've been finding that some
behaviors that have existed for a long time are suddenly more apparent
or easier to hit with the new mechanism.  Examples of that are in [1].

-- 
-David
david@pgmasters.net

[1]
https://www.postgresql.org/message-id/5e6537c7-d10e-6a67-4813-bbd7455cfaf5%40pgmasters.net



pgsql-hackers by date:

Previous
From: Juan José Santamaría Flecha
Date:
Subject: Re: Allow to_date() and to_timestamp() to accept localized names
Next
From: Tomas Vondra
Date:
Subject: Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions