Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown - Mailing list pgsql-hackers

On Monday, November 12, 2012 8:23 PM Fujii Masao wrote:
On Fri, Nov 9, 2012 at 3:03 PM, Amit Kapila <amit.kapila@huawei.com> wrote:
> On Thursday, November 08, 2012 10:42 PM Fujii Masao wrote:
>> On Thu, Nov 8, 2012 at 5:53 PM, Amit Kapila <amit.kapila@huawei.com>
>> wrote:
>> > On Thursday, November 08, 2012 2:04 PM Heikki Linnakangas wrote:
>> >> On 19.10.2012 14:42, Amit kapila wrote:
>> >> > On Thursday, October 18, 2012 8:49 PM Fujii Masao wrote:

>>> Are you planning to introduce the timeout mechanism in pg_basebackup
>>> main process? Or background process? It's useful to implement both.
>
>> By background process, you mean ReceiveXlogStream?
>> For both.
>
>> I think for background process, it can be done in a way similar to what we
>> have done for walreceiver.

> Yes.

>> But I have some doubts for how to do for main process:
>
>> Logic similar to walreceiver can not be used incase network goes down during
>> getting other database file from server.
>> The reason for the same is to receive the data files PQgetCopyData() is
>> called in synchronous mode, so it keeps waiting for infinite time till it
>> gets some data.
>> In order to solve this issue, I can think of following options:
>> 1. Making this call also asynchronous (but now sure about impact of this).

> +1

> Walreceiver already calls PQgetCopyData() asynchronously. ISTM you can
> solve the issue in the similar way to walreceiver's.

>> 2. In function pqWait, instead of passing hard-code value -1 (i.e. infinite
>> wait), we can send some finite time. This time can be received as command
>> line argument
>>     from respective utility and set the same in PGconn structure.

> Yes, I think that we should add something like --conninfo option to
> pg_basebackup
> and pg_receivexlog. We can easily set not only connect_timeout but also sslmode,
> application_name, ... by using such option accepting conninfo string.

I have prepared an attached patch to make pg_basebackup and pg_receivexlog as non-blocking.
To do so I have to add new command line parameters in pg_basebackup and pg_receivexlog
for now added two more command line arguments
        a.  "-r"  for pg_basebackup and pg_receivexlog to take receive time-out value. Default value for this parameter
is60 sec.  
        b. "-t"   for pg_basebackup and pg_receivexlog to take initial connection timeout value. Default value is
infinitewait.  
We can change to accept --conninfo as well.

I feel apart from above, remaining problem is for function call PQgetResult()
1. Wherever query is getting sent from BaseBackup, it calls the function PQgetResult to receive the result of query.
    As PQgetResult() is blocking function (it calls pqWait which can hang), so if network is down before sending the
queryitself,  
    then there will not be any result, so it will keep hanging in PQgetResult .
IMO, it can be solved in below ways:
a. Create one corresponding non-blocking function. But this function is being called from inside some of the
     other libpq function (PQexec->PQexecFinish->PQgetResult). So it can be little tricky to solve this way.
b. Add the receive_timeout variable in PGconn structure and use it in pqWait for timeout whenever it is set.
c. any other better way?


>> BTW, IIRC the walsender has no timeout mechanism during sending
>> backup data to pg_basebackup. So it's also useful to implement the
>> timeout mechanism for the walsender during backup.
>

>What about using pq_putmessage_noblock()?

I think may be some more functions also needs to be made as noblock. I am still evaluating.

I will upload the attached patch in commitfest if you don't have any objections?

More Suggestions/Comments?

With Regards,
Amit Kapila.
Attachment

pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Hash id in pg_stat_statements
Next
From: Robert Haas
Date:
Subject: Re: [PATCH] Patch to compute Max LSN of Data Pages