Re: [HACKERS] [COMMITTERS] pgsql: Use asynchronous connect API in libpqwalreceiver - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [HACKERS] [COMMITTERS] pgsql: Use asynchronous connect API in libpqwalreceiver
Date
Msg-id 7295.1489596949@sss.pgh.pa.us
Whole thread Raw
In response to Re: [HACKERS] [COMMITTERS] pgsql: Use asynchronous connect API inlibpqwalreceiver  (Andrew Dunstan <andrew.dunstan@2ndquadrant.com>)
Responses Re: [HACKERS] [COMMITTERS] pgsql: Use asynchronous connect API inlibpqwalreceiver  (Petr Jelinek <petr.jelinek@2ndquadrant.com>)
List pgsql-hackers
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> On 03/03/2017 11:11 PM, Tom Lane wrote:
>> Yeah, I was wondering if this is just exposing a pre-existing bug.
>> However, the "normal" path operates by repeatedly invoking PQconnectPoll
>> (cf. connectDBComplete) so it's not immediately obvious how such a bug
>> would've escaped detection.

> (After a long period of fruitless empirical testing I turned to the code)
> Maybe I'm missing something, but connectDBComplete() handles a return of
> PGRESS_POLLING_OK as a success while connectDBStart() seems not to. I
> don't find anywhere in our code other than libpqwalreceiver that
> actually uses that interface, so it's not surprising if it's now
> failing. So my bet is it is indeed a long-standing bug.

Meh ... that argument doesn't hold water, because the old code here called
PQconnectdbParams which is just PQconnectStartParams then
connectDBComplete.  So the problem cannot be in connectDBStart; that's
common to both paths.  It has to be some discrepancy between what
connectDBComplete does and what the new loop in libpqwalreceiver is doing.

The original loop coding in 1e8a85009 was not very close to the documented
spec for PQconnectPoll at all, and while e434ad39a made it closer, it's
still not really the same: connectDBComplete doesn't call PQconnectPoll
until the socket is known read-ready or write-ready.  The walreceiver loop
does not guarantee that, but would make an additional call after any
random other wakeup.  It's not very clear why bowerbird, and only
bowerbird, would be seeing such wakeups --- but I'm having a really hard
time seeing any other explanation for the change in behavior.  (I wonder
whether bowerbird is telling us that WaitLatchOrSocket can sometimes
return prematurely on Windows.)

I'm also pretty sure that the ResetLatch call is in the wrong place which
could lead to missed wakeups, though that's the opposite of the immediate
problem.

I'll try correcting these things and we'll see if it gets any better.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Emre Hasegeli
Date:
Subject: Re: [HACKERS] Parallel Bitmap scans a bit broken
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] background sessions