Re: [HACKERS] [COMMITTERS] pgsql: Use asynchronous connect API inlibpqwalreceiver - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [HACKERS] [COMMITTERS] pgsql: Use asynchronous connect API inlibpqwalreceiver
Date
Msg-id 20170317040406.l6cwm2yejsn2k6rs@alap3.anarazel.de
Whole thread Raw
In response to Re: [HACKERS] [COMMITTERS] pgsql: Use asynchronous connect API inlibpqwalreceiver  (Petr Jelinek <petr.jelinek@2ndquadrant.com>)
Responses Re: [HACKERS] [COMMITTERS] pgsql: Use asynchronous connect API in libpqwalreceiver
List pgsql-hackers
On 2017-03-16 13:00:54 +0100, Petr Jelinek wrote:
> On 15/03/17 17:55, Tom Lane wrote:
> > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> >> On 03/03/2017 11:11 PM, Tom Lane wrote:
> >>> Yeah, I was wondering if this is just exposing a pre-existing bug.
> >>> However, the "normal" path operates by repeatedly invoking PQconnectPoll
> >>> (cf. connectDBComplete) so it's not immediately obvious how such a bug
> >>> would've escaped detection.
> > 
> >> (After a long period of fruitless empirical testing I turned to the code)
> >> Maybe I'm missing something, but connectDBComplete() handles a return of
> >> PGRESS_POLLING_OK as a success while connectDBStart() seems not to. I
> >> don't find anywhere in our code other than libpqwalreceiver that
> >> actually uses that interface, so it's not surprising if it's now
> >> failing. So my bet is it is indeed a long-standing bug.
> > 
> > Meh ... that argument doesn't hold water, because the old code here called
> > PQconnectdbParams which is just PQconnectStartParams then
> > connectDBComplete.  So the problem cannot be in connectDBStart; that's
> > common to both paths.  It has to be some discrepancy between what
> > connectDBComplete does and what the new loop in libpqwalreceiver is doing.
> > 
> > The original loop coding in 1e8a85009 was not very close to the documented
> > spec for PQconnectPoll at all, and while e434ad39a made it closer, it's
> > still not really the same: connectDBComplete doesn't call PQconnectPoll
> > until the socket is known read-ready or write-ready.  The walreceiver loop
> > does not guarantee that, but would make an additional call after any
> > random other wakeup.  It's not very clear why bowerbird, and only
> > bowerbird, would be seeing such wakeups --- but I'm having a really hard
> > time seeing any other explanation for the change in behavior.  (I wonder
> > whether bowerbird is telling us that WaitLatchOrSocket can sometimes
> > return prematurely on Windows.)
> > 
> > I'm also pretty sure that the ResetLatch call is in the wrong place which
> > could lead to missed wakeups, though that's the opposite of the immediate
> > problem.
> > 
> > I'll try correcting these things and we'll see if it gets any better.
> > 
> 
> Looks like that didn't help either.
> 
> I setup my own Windows machine and can reproduce the issue. I played
> around a bit and could not really find a fix other than adding
> WL_TIMEOUT and short timeout to WaitLatchOrSocket (it does wait a very
> long time on the WaitLatchOrSocket otherwise before failing).

Hm. Could you use process explorer or such to see the exact events
happening?  Seing that might help us to nail this down.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [HACKERS] scram and \password
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] [COMMITTERS] pgsql: Use asynchronous connect API in libpqwalreceiver