Re: [HACKERS] [COMMITTERS] pgsql: Use asynchronous connect API inlibpqwalreceiver - Mailing list pgsql-hackers

From Petr Jelinek
Subject Re: [HACKERS] [COMMITTERS] pgsql: Use asynchronous connect API inlibpqwalreceiver
Date
Msg-id 5eff6c57-eaca-c721-339b-3b621d87ce75@2ndquadrant.com
Whole thread Raw
In response to Re: [HACKERS] [COMMITTERS] pgsql: Use asynchronous connect API in libpqwalreceiver  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] [COMMITTERS] pgsql: Use asynchronous connect API inlibpqwalreceiver  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On 15/03/17 17:55, Tom Lane wrote:
> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
>> On 03/03/2017 11:11 PM, Tom Lane wrote:
>>> Yeah, I was wondering if this is just exposing a pre-existing bug.
>>> However, the "normal" path operates by repeatedly invoking PQconnectPoll
>>> (cf. connectDBComplete) so it's not immediately obvious how such a bug
>>> would've escaped detection.
> 
>> (After a long period of fruitless empirical testing I turned to the code)
>> Maybe I'm missing something, but connectDBComplete() handles a return of
>> PGRESS_POLLING_OK as a success while connectDBStart() seems not to. I
>> don't find anywhere in our code other than libpqwalreceiver that
>> actually uses that interface, so it's not surprising if it's now
>> failing. So my bet is it is indeed a long-standing bug.
> 
> Meh ... that argument doesn't hold water, because the old code here called
> PQconnectdbParams which is just PQconnectStartParams then
> connectDBComplete.  So the problem cannot be in connectDBStart; that's
> common to both paths.  It has to be some discrepancy between what
> connectDBComplete does and what the new loop in libpqwalreceiver is doing.
> 
> The original loop coding in 1e8a85009 was not very close to the documented
> spec for PQconnectPoll at all, and while e434ad39a made it closer, it's
> still not really the same: connectDBComplete doesn't call PQconnectPoll
> until the socket is known read-ready or write-ready.  The walreceiver loop
> does not guarantee that, but would make an additional call after any
> random other wakeup.  It's not very clear why bowerbird, and only
> bowerbird, would be seeing such wakeups --- but I'm having a really hard
> time seeing any other explanation for the change in behavior.  (I wonder
> whether bowerbird is telling us that WaitLatchOrSocket can sometimes
> return prematurely on Windows.)
> 
> I'm also pretty sure that the ResetLatch call is in the wrong place which
> could lead to missed wakeups, though that's the opposite of the immediate
> problem.
> 
> I'll try correcting these things and we'll see if it gets any better.
> 

Looks like that didn't help either.

I setup my own Windows machine and can reproduce the issue. I played
around a bit and could not really find a fix other than adding
WL_TIMEOUT and short timeout to WaitLatchOrSocket (it does wait a very
long time on the WaitLatchOrSocket otherwise before failing).

So I wonder if this is the same issue that caused us using different
coding for WaitLatchOrSocket in pgstat.c (lines ~3918-3940).

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Peter van Hardenberg
Date:
Subject: [HACKERS] Error handling in transactions
Next
From: Michael Paquier
Date:
Subject: Re: [HACKERS] Speedup twophase transactions