On 15/03/17 17:55, Tom Lane wrote:
> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
>> On 03/03/2017 11:11 PM, Tom Lane wrote:
>>> Yeah, I was wondering if this is just exposing a pre-existing bug.
>>> However, the "normal" path operates by repeatedly invoking PQconnectPoll
>>> (cf. connectDBComplete) so it's not immediately obvious how such a bug
>>> would've escaped detection.
>
>> (After a long period of fruitless empirical testing I turned to the code)
>> Maybe I'm missing something, but connectDBComplete() handles a return of
>> PGRESS_POLLING_OK as a success while connectDBStart() seems not to. I
>> don't find anywhere in our code other than libpqwalreceiver that
>> actually uses that interface, so it's not surprising if it's now
>> failing. So my bet is it is indeed a long-standing bug.
>
> Meh ... that argument doesn't hold water, because the old code here called
> PQconnectdbParams which is just PQconnectStartParams then
> connectDBComplete. So the problem cannot be in connectDBStart; that's
> common to both paths. It has to be some discrepancy between what
> connectDBComplete does and what the new loop in libpqwalreceiver is doing.
>
> The original loop coding in 1e8a85009 was not very close to the documented
> spec for PQconnectPoll at all, and while e434ad39a made it closer, it's
> still not really the same: connectDBComplete doesn't call PQconnectPoll
> until the socket is known read-ready or write-ready. The walreceiver loop
> does not guarantee that, but would make an additional call after any
> random other wakeup. It's not very clear why bowerbird, and only
> bowerbird, would be seeing such wakeups --- but I'm having a really hard
> time seeing any other explanation for the change in behavior. (I wonder
> whether bowerbird is telling us that WaitLatchOrSocket can sometimes
> return prematurely on Windows.)
>
> I'm also pretty sure that the ResetLatch call is in the wrong place which
> could lead to missed wakeups, though that's the opposite of the immediate
> problem.
>
> I'll try correcting these things and we'll see if it gets any better.
>
Looks like that didn't help either.
I setup my own Windows machine and can reproduce the issue. I played
around a bit and could not really find a fix other than adding
WL_TIMEOUT and short timeout to WaitLatchOrSocket (it does wait a very
long time on the WaitLatchOrSocket otherwise before failing).
So I wonder if this is the same issue that caused us using different
coding for WaitLatchOrSocket in pgstat.c (lines ~3918-3940).
-- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services