Re: Why is src/test/modules/committs/t/002_standby.pl flaky? - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
Date
Msg-id CA+hUKG+o_QzN1LrEPGJLa6JiZU4DpLRybkwZjdCErs2qsB6axg@mail.gmail.com
Whole thread Raw
In response to Re: Why is src/test/modules/committs/t/002_standby.pl flaky?  (Alexander Lakhin <exclusion@gmail.com>)
Responses Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
List pgsql-hackers
On Tue, Jan 11, 2022 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> 10.01.2022 12:40, Thomas Munro wrote:
> > This is super quick-and-dirty code (and doesn't handle some errors or
> > socket changes correctly), but does it detect the closed socket?

> Yes, it fixes the behaviour and makes the 002_standby test pass (100 of
> 100 iterations).

Thanks for testing.  That result does seem to confirm the hypothesis
that FD_CLOSE is reported only once for the socket on graceful
shutdown (that is, it's edge-triggered and incidentally you won't get
FD_READ), so you need to keep track of it carefully.  Incidentally,
another observation is that your WSAPoll() test appears to be
returning POLLHUP where at least Linux, FreeBSD and Solaris would not:
a socket that is only half shut down (the primary shut down its end
gracefully, but walreceiver did not), so I suspect Windows' POLLHUP
might have POLLRDHUP semantics.

> I'm yet to find out whether the other
> WaitLatchOrSocket' users (e. g. postgres_fdw) can suffer from the
> disconnected socket state, but this approach definitely works for
> walreceiver.

I see where you're going: there might be safe call sequences and
unsafe call sequences, and maybe walreceiver is asking for trouble by
double-polling.  I'm not sure about that; I got the impression
recently that it's possible to get FD_CLOSE while you still have
buffered data to read, so then the next recv() will return > 0 and
then we don't have any state left anywhere to remember that we saw
FD_CLOSE, even if you're careful to poll and read in the ideal
sequence.  I could be wrong, and it would be nice if there is an easy
fix along those lines...  The documentation around FD_CLOSE is
unclear.

I do plan to make a higher quality patch like the one I showed
(material from earlier unfinished work[1] that needs a bit more
infrastructure), but to me that's new feature/efficiency work, not
something we'd want to back-patch.

Hmm, one thing I'm still unclear on: did this problem really start
with 6051857fc/ed52c3707?  My initial email in this thread lists
similar failures going back further, doesn't it?  (And what's tern
doing mixed up in this mess?)

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGJPaygh-6WHEd0FnH89GrkTpVyN_ew9ckv3%2BnwjmLcSeg%40mail.gmail.com#aa33ec3e7ad85499f35dd1434a139c3f



pgsql-hackers by date:

Previous
From: Allie Crawford
Date:
Subject: Re: [Ext:] Re: Stream Replication not working
Next
From: Andrew Dunstan
Date:
Subject: Re: Why is src/test/modules/committs/t/002_standby.pl flaky?