Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code |
Date | |
Msg-id | CAB7nPqR8zmtL-vN+8ksVW0XOUR3vXAAyJmV9R2eyr-2nXC-E_g@mail.gmail.com Whole thread Raw |
In response to | Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code (Fujii Masao <masao.fujii@gmail.com>) |
Responses |
Re: Patch: add recovery_timeout option to control timeout
of restore_command nonzero status code
|
List | pgsql-hackers |
On Fri, Feb 20, 2015 at 9:12 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Tue, Feb 10, 2015 at 10:32 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Mon, Feb 9, 2015 at 8:29 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> On Sun, Feb 8, 2015 at 2:54 PM, Michael Paquier >>> <michael.paquier@gmail.com> wrote: >>>> On Fri, Feb 6, 2015 at 4:58 PM, Fujii Masao wrote: >>>>> - * Wait for more WAL to arrive. Time out after 5 seconds, >>>>> - * like when polling the archive, to react to a trigger >>>>> - * file promptly. >>>>> + * Wait for more WAL to arrive. Time out after >>>>> the amount of >>>>> + * time specified by wal_retrieve_retry_interval, like >>>>> + * when polling the archive, to react to a >>>>> trigger file promptly. >>>>> */ >>>>> WaitLatch(&XLogCtl->recoveryWakeupLatch, >>>>> WL_LATCH_SET | WL_TIMEOUT, >>>>> - 5000L); >>>>> + wal_retrieve_retry_interval * 1000L); >>>>> >>>>> This change can prevent the startup process from reacting to >>>>> a trigger file. Imagine the case where the large interval is set >>>>> and the user want to promote the standby by using the trigger file >>>>> instead of pg_ctl promote. I think that the sleep time should be 5s >>>>> if the interval is set to more than 5s. Thought? >>>> >>>> I disagree here. It is interesting to accelerate the check of WAL >>>> availability from a source in some cases for replication, but the >>>> opposite is true as well as mentioned by Alexey at the beginning of >>>> the thread to reduce the number of requests when requesting WAL >>>> archives from an external storage type AWS. Hence a correct solution >>>> would be to check periodically for the trigger file with a maximum >>>> one-time wait of 5s to ensure backward-compatible behavior. We could >>>> reduce it to 1s or something like that as well. >>> >>> You seem to have misunderstood the code in question. Or I'm missing something. >>> The timeout of the WaitLatch is just the interval to check for the trigger file >>> while waiting for more WAL to arrive from streaming replication. Not related to >>> the retry time to restore WAL from the archive. >> >> [Re-reading the code...] >> Aah.. Yes you are right. Sorry for the noise. Yes let's wait for a >> maximum of 5s then. I also noticed in previous patch that the wait was >> maximized to 5s. To begin with, a loop should have been used if it was >> a sleep, but as now patch uses a latch this limit does not make much >> sense... Patch updated is attached. > > On second thought, the interval to check the trigger file is very different > from the wait time to retry to retrieve WAL. So it seems strange and even > confusing to control them by one parameter. If we really want to make > the interval for the trigger file configurable, we should invent new GUC for it. > But I don't think that it's worth doing that. If someone wants to react the > trigger file more promptly for "fast" promotion, he or she basically can use > pg_ctl promote command, instead. Thought? Hm, OK. > Attached is the updated version of the patch. I changed the parameter so that > it doesn't affect the interval of checking the trigger file. > > - static pg_time_t last_fail_time = 0; > - pg_time_t now; > + TimestampTz now = GetCurrentTimestamp(); > + TimestampTz last_fail_time = now; > > I reverted the code here as it was. I don't think GetCurrentTimestamp() needs > to be called for each WaitForWALToBecomeAvailable(). > > + WaitLatch(&XLogCtl->recoveryWakeupLatch, > + WL_LATCH_SET | WL_TIMEOUT, > + wait_time / 1000); > > Don't we need to specify WL_POSTMASTER_DEATH flag here? Added. Yeah, I am wondering though why this has not been added after 89fd72cb though. > + {"wal_retrieve_retry_interval", PGC_SIGHUP, WAL_SETTINGS, > > WAL_SETTINGS should be REPLICATION_STANDBY? Changed. Sure. -- Michael
pgsql-hackers by date: