Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
Date
Msg-id CAB7nPqQ4iZzjDrJc5VBvHY6uG7wGe9A8pjPSEfj3H7_NfNiu9Q@mail.gmail.com
Whole thread Raw
In response to Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
List pgsql-hackers
On Tue, Jan 20, 2015 at 1:56 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2015-01-19 17:16:11 +0900, Michael Paquier wrote:
>> On Mon, Jan 19, 2015 at 4:10 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>> > On Sat, Jan 17, 2015 at 2:44 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> >> Not this patch's fault, but I'm getting a bit tired seeing the above open
>> >> coded. How about adding a function that does the sleeping based on a
>> >> timestamptz and a ms interval?
>> > You mean in plugins, right? I don't recall seeing similar patterns in
>> > other code paths of backend. But I think that we can use something
>> > like that in timestamp.c then because we need to leverage that between
>> > two timestamps, the last failure and now():
>> > TimestampSleepDifference(start_time, stop_time, internal_ms);
>> > Perhaps you have something else in mind?
>> >
>> > Attached is an updated patch.
>
>> Actually I came with better than last patch by using a boolean flag as
>> return value of TimestampSleepDifference and use
>> TimestampDifferenceExceeds directly inside it.
>
>> Subject: [PATCH] Add wal_availability_check_interval to control WAL fetching
>>  on failure
>
> I think that name isn't a very good. And its isn't very accurate
> either.
> How about wal_retrieve_retry_interval? Not very nice, but I think it's
> still better than the above.
Fine for me.

>> +     <varlistentry id="wal-availability-check-interval" xreflabel="wal_availability_check_interval">
>> +      <term><varname>wal_availability_check_interval</varname> (<type>integer</type>)
>> +      <indexterm>
>> +        <primary><varname>wal_availability_check_interval</> recovery parameter</primary>
>> +      </indexterm>
>> +      </term>
>> +      <listitem>
>> +       <para>
>> +        This parameter specifies the amount of time to wait when
>> +        WAL is not available for a node in recovery. Default value is
>> +        <literal>5s</>.
>> +       </para>
>> +       <para>
>> +        A node in recovery will wait for this amount of time if
>> +        <varname>restore_command</> returns nonzero exit status code when
>> +        fetching new WAL segment files from archive or when a WAL receiver
>> +        is not able to fetch a WAL record when using streaming replication.
>> +       </para>
>> +      </listitem>
>> +     </varlistentry>
>> +
>>      </variablelist>
>
> Walreceiver doesn't wait that amount, but rather how long the connection
> is intact. And restore_command may or may not retry.
I changed this paragraph as follows:
+        This parameter specifies the amount of time to wait when a failure
+        occurred when reading WAL from a source (be it via streaming
+        replication, local <filename>pg_xlog</> or WAL archive) for a node
+        in standby mode, or when WAL is expected from a source. Default
+        value is <literal>5s</>.
Pondering more about this patch, I am getting the feeling that it is
not that much a win to explain in details in the doc each failure
depending on the source from which WAL is taken. But it is essential
to mention that the wait is done only for a node using standby mode.

Btw, I noticed two extra things that I think should be done for consistency:
- We should use as well wal_retrieve_retry_interval when waiting for
WAL to arrive after checking for the failures:
                                        /*
-                                        * Wait for more WAL to
arrive. Time out after 5 seconds,
-                                        * like when polling the
archive, to react to a trigger
-                                        * file promptly.
+                                        * Wait for more WAL to
arrive. Time out after
+                                        * wal_retrieve_retry_interval
ms, like when polling the archive
+                                        * to react to a trigger file promptly.
                                         */
                                        WaitLatch(&XLogCtl->recoveryWakeupLatch,
                                                          WL_LATCH_SET
| WL_TIMEOUT,
-                                                         5000L);
+
wal_retrieve_retry_interval * 1L);
-

Updated patch attached.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: tim_wilson
Date:
Subject: Re: Inaccuracy in VACUUM's tuple count estimates
Next
From: Peter Geoghegan
Date:
Subject: Re: B-Tree support function number 3 (strxfrm() optimization)