Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown - Mailing list pgsql-hackers

From Amit kapila
Subject Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown
Date
Msg-id 6C0B27F7206C9E4CA54AE035729E9C382853BBED@szxeml509-mbs
Whole thread Raw
In response to Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown  (Fujii Masao <masao.fujii@gmail.com>)
Responses Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
On Thursday, October 18, 2012 8:49 PM Fujii Masao wrote:
On Wed, Oct 17, 2012 at 8:46 PM, Amit Kapila <amit.kapila@huawei.com> wrote:
>> On Monday, October 15, 2012 3:43 PM Heikki Linnakangas wrote:
>> On 13.10.2012 19:35, Fujii Masao wrote:
>> > On Thu, Oct 11, 2012 at 11:52 PM, Heikki Linnakangas
>> > <hlinnakangas@vmware.com>  wrote:
>> >> Ok, thanks. Committed.
>> >
>> > I found one typo. The attached patch fixes that typo.
>>
>> Thanks, fixed.
>>
>> > ISTM you need to update the protocol.sgml because you added
>> > the field 'replyRequested' to WalSndrMessage and StandbyReplyMessage.
>
>
>>
>> > Is it worth adding the same mechanism (send back the reply immediately
>> > if walsender request a reply) into pg_basebackup and pg_receivexlog?
>>
>> Good catch. Yes, they should be taught about this too. I'll look into
>> doing that too.
>
> If you have not started and you don't have objection, I can pickup this to
> complete it.
>
> For both (pg_basebackup and pg_receivexlog), we need to get a timeout
> parameter from user in command line, as
> there is no conf file here. New Option can be -t (parameter name can be
> recvtimeout).
>
> The main changes will be in function ReceiveXlogStream(), it is a common
> function for both
> Pg_basebackup and pg_receivexlog. Handling will be done in same way as we
> have done in walreceiver.
>
> Suggestions/Comments?

>Before implementing the timeout parameter, I think that it's better to change
>both pg_basebackup background process and pg_receivexlog so that they
>send back the reply message immediately when they receive the keepalive
>message requesting the reply. Currently, they always ignore such keepalive
>message, so status interval parameter (-s) in them always must be set to
>the value less than replication timeout. We can avoid this troublesome
>parameter setting by introducing the same logic of walreceiver into both
>pg_basebackup background process and pg_receivexlog.

Please find the patch attached to address the modification mentioned by you (send immediate reply for keepalive).
Both basebackup and pg_receivexlog uses the same function ReceiveXLogStream, so single change for both will address the
issue.


Now further to this for introducing timeout in pg_basebackup and pg_receivexlog:
We can have mechanism similar to wal receiver timeout while streaming the data from server, but same logic can not be
usedincase network goes down during getting other database file from server.  
The reason for the same is to receive the data files PQgetCopyData() is called in synchronous mode, so it keeps waiting
forinfinite time till it gets some data.  
In order to solve this issue, I can think of following options:
1. Making this call also asynchronous (but now sure about impact of this).
2. In function pqWait, instead of passing hard-code value -1 (i.e. infinite wait), we can send some finite time. This
timecan be received as command line argument  
    from respective utility and set the same in PGconn structure.
    In order to have timeout value in PGconn, we can have:
        a. Add new parameter in PGconn to indicate the receive timeout.
        b. Use the existing parameter connect_timeout for receive timeout also but this may lead to confusion.
3. Any other better option?

Apart from above issue, there is possibility that if during connect time network goes down, then it might hang,
becauseconnect_timeout by default will be NULL and connectDBComplete will start waiting inifinitely for connection to
becomesuccessful.  
So shall we have command line argument separately for this also or any other way as you suugest.

Suggestions/Comments

With Regards,
Amit Kapila.
Attachment

pgsql-hackers by date:

Previous
From: Shigeru HANADA
Date:
Subject: Re: Move postgresql_fdw_validator into dblink
Next
From: Hannu Krosing
Date:
Subject: Re: [RFC] CREATE QUEUE (log-only table) for londiste/pgQ ccompatibility