Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
Date
Msg-id CAB7nPqR8zmtL-vN+8ksVW0XOUR3vXAAyJmV9R2eyr-2nXC-E_g@mail.gmail.com
Whole thread Raw
In response to Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code  (Fujii Masao <masao.fujii@gmail.com>)
Responses Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
List pgsql-hackers
On Fri, Feb 20, 2015 at 9:12 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Tue, Feb 10, 2015 at 10:32 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Mon, Feb 9, 2015 at 8:29 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> On Sun, Feb 8, 2015 at 2:54 PM, Michael Paquier
>>> <michael.paquier@gmail.com> wrote:
>>>> On Fri, Feb 6, 2015 at 4:58 PM, Fujii Masao wrote:
>>>>> -                     * Wait for more WAL to arrive. Time out after 5 seconds,
>>>>> -                     * like when polling the archive, to react to a trigger
>>>>> -                     * file promptly.
>>>>> +                     * Wait for more WAL to arrive. Time out after
>>>>> the amount of
>>>>> +                     * time specified by wal_retrieve_retry_interval, like
>>>>> +                     * when polling the archive, to react to a
>>>>> trigger file promptly.
>>>>>                       */
>>>>>                      WaitLatch(&XLogCtl->recoveryWakeupLatch,
>>>>>                                WL_LATCH_SET | WL_TIMEOUT,
>>>>> -                              5000L);
>>>>> +                              wal_retrieve_retry_interval * 1000L);
>>>>>
>>>>> This change can prevent the startup process from reacting to
>>>>> a trigger file. Imagine the case where the large interval is set
>>>>> and the user want to promote the standby by using the trigger file
>>>>> instead of pg_ctl promote. I think that the sleep time should be 5s
>>>>> if the interval is set to more than 5s. Thought?
>>>>
>>>> I disagree here. It is interesting to accelerate the check of WAL
>>>> availability from a source in some cases for replication, but the
>>>> opposite is true as well as mentioned by Alexey at the beginning of
>>>> the thread to reduce the number of requests when requesting WAL
>>>> archives from an external storage type AWS. Hence a correct solution
>>>> would be to check periodically for the trigger file with a maximum
>>>> one-time wait of 5s to ensure backward-compatible behavior. We could
>>>> reduce it to 1s or something like that as well.
>>>
>>> You seem to have misunderstood the code in question. Or I'm missing something.
>>> The timeout of the WaitLatch is just the interval to check for the trigger file
>>> while waiting for more WAL to arrive from streaming replication. Not related to
>>> the retry time to restore WAL from the archive.
>>
>> [Re-reading the code...]
>> Aah.. Yes you are right. Sorry for the noise. Yes let's wait for a
>> maximum of 5s then. I also noticed in previous patch that the wait was
>> maximized to 5s. To begin with, a loop should have been used if it was
>> a sleep, but as now patch uses a latch this limit does not make much
>> sense... Patch updated is attached.
>
> On second thought, the interval to check the trigger file is very different
> from the wait time to retry to retrieve WAL. So it seems strange and even
> confusing to control them by one parameter. If we really want to make
> the interval for the trigger file configurable, we should invent new GUC for it.
> But I don't think that it's worth doing that. If someone wants to react the
> trigger file more promptly for "fast" promotion, he or she basically can use
> pg_ctl promote command, instead. Thought?

Hm, OK.

> Attached is the updated version of the patch. I changed the parameter so that
> it doesn't affect the interval of checking the trigger file.
>
> -    static pg_time_t last_fail_time = 0;
> -    pg_time_t    now;
> +    TimestampTz now = GetCurrentTimestamp();
> +    TimestampTz    last_fail_time = now;
>
> I reverted the code here as it was. I don't think GetCurrentTimestamp() needs
> to be called for each WaitForWALToBecomeAvailable().
>
> +                        WaitLatch(&XLogCtl->recoveryWakeupLatch,
> +                                  WL_LATCH_SET | WL_TIMEOUT,
> +                                  wait_time / 1000);
>
> Don't we need to specify WL_POSTMASTER_DEATH flag here? Added.

Yeah, I am wondering though why this has not been added after 89fd72cb though.

> +        {"wal_retrieve_retry_interval", PGC_SIGHUP, WAL_SETTINGS,
>
> WAL_SETTINGS should be REPLICATION_STANDBY? Changed.

Sure.
-- 
Michael



pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
Next
From: Kevin Grittner
Date:
Subject: Re: Allow "snapshot too old" error, to prevent bloat