Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
Date
Msg-id CAB7nPqQ_Fm8a-OSEeE6n1nT-LUp7ty85Yc+vWW1v06rwQxC1Hg@mail.gmail.com
Whole thread Raw
In response to Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code  (Fujii Masao <masao.fujii@gmail.com>)
Responses Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code  (Michael Paquier <michael.paquier@gmail.com>)
Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code  (Fujii Masao <masao.fujii@gmail.com>)
List pgsql-hackers
On Mon, Feb 9, 2015 at 8:29 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Sun, Feb 8, 2015 at 2:54 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Fri, Feb 6, 2015 at 4:58 PM, Fujii Masao wrote:
>>> -                     * Wait for more WAL to arrive. Time out after 5 seconds,
>>> -                     * like when polling the archive, to react to a trigger
>>> -                     * file promptly.
>>> +                     * Wait for more WAL to arrive. Time out after
>>> the amount of
>>> +                     * time specified by wal_retrieve_retry_interval, like
>>> +                     * when polling the archive, to react to a
>>> trigger file promptly.
>>>                       */
>>>                      WaitLatch(&XLogCtl->recoveryWakeupLatch,
>>>                                WL_LATCH_SET | WL_TIMEOUT,
>>> -                              5000L);
>>> +                              wal_retrieve_retry_interval * 1000L);
>>>
>>> This change can prevent the startup process from reacting to
>>> a trigger file. Imagine the case where the large interval is set
>>> and the user want to promote the standby by using the trigger file
>>> instead of pg_ctl promote. I think that the sleep time should be 5s
>>> if the interval is set to more than 5s. Thought?
>>
>> I disagree here. It is interesting to accelerate the check of WAL
>> availability from a source in some cases for replication, but the
>> opposite is true as well as mentioned by Alexey at the beginning of
>> the thread to reduce the number of requests when requesting WAL
>> archives from an external storage type AWS. Hence a correct solution
>> would be to check periodically for the trigger file with a maximum
>> one-time wait of 5s to ensure backward-compatible behavior. We could
>> reduce it to 1s or something like that as well.
>
> You seem to have misunderstood the code in question. Or I'm missing something.
> The timeout of the WaitLatch is just the interval to check for the trigger file
> while waiting for more WAL to arrive from streaming replication. Not related to
> the retry time to restore WAL from the archive.

[Re-reading the code...]
Aah.. Yes you are right. Sorry for the noise. Yes let's wait for a
maximum of 5s then. I also noticed in previous patch that the wait was
maximized to 5s. To begin with, a loop should have been used if it was
a sleep, but as now patch uses a latch this limit does not make much
sense... Patch updated is attached.
Regards,
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Marc Balmer
Date:
Subject: Re: For cursors, there is FETCH and MOVE, why no TELL?
Next
From: Andres Freund
Date:
Subject: Re: Assertion failure when streaming logical changes