On Tue, Jan 11, 2022 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> 10.01.2022 12:40, Thomas Munro wrote:
> > This is super quick-and-dirty code (and doesn't handle some errors or
> > socket changes correctly), but does it detect the closed socket?
> Yes, it fixes the behaviour and makes the 002_standby test pass (100 of
> 100 iterations).
Thanks for testing. That result does seem to confirm the hypothesis
that FD_CLOSE is reported only once for the socket on graceful
shutdown (that is, it's edge-triggered and incidentally you won't get
FD_READ), so you need to keep track of it carefully. Incidentally,
another observation is that your WSAPoll() test appears to be
returning POLLHUP where at least Linux, FreeBSD and Solaris would not:
a socket that is only half shut down (the primary shut down its end
gracefully, but walreceiver did not), so I suspect Windows' POLLHUP
might have POLLRDHUP semantics.
> I'm yet to find out whether the other
> WaitLatchOrSocket' users (e. g. postgres_fdw) can suffer from the
> disconnected socket state, but this approach definitely works for
> walreceiver.
I see where you're going: there might be safe call sequences and
unsafe call sequences, and maybe walreceiver is asking for trouble by
double-polling. I'm not sure about that; I got the impression
recently that it's possible to get FD_CLOSE while you still have
buffered data to read, so then the next recv() will return > 0 and
then we don't have any state left anywhere to remember that we saw
FD_CLOSE, even if you're careful to poll and read in the ideal
sequence. I could be wrong, and it would be nice if there is an easy
fix along those lines... The documentation around FD_CLOSE is
unclear.
I do plan to make a higher quality patch like the one I showed
(material from earlier unfinished work[1] that needs a bit more
infrastructure), but to me that's new feature/efficiency work, not
something we'd want to back-patch.
Hmm, one thing I'm still unclear on: did this problem really start
with 6051857fc/ed52c3707? My initial email in this thread lists
similar failures going back further, doesn't it? (And what's tern
doing mixed up in this mess?)
[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGJPaygh-6WHEd0FnH89GrkTpVyN_ew9ckv3%2BnwjmLcSeg%40mail.gmail.com#aa33ec3e7ad85499f35dd1434a139c3f