Thread: walreceiver that is behind doesn't quit, send replies

walreceiver that is behind doesn't quit, send replies

From
Andres Freund
Date:
Hi,

There are no interrupt checks in the WalReceiverMain() sub-loop for
receiving WAL. There's one above

                /* See if we can read data immediately */
                len = walrcv_receive(wrconn, &buf, &wait_fd);

but none in the loop below:
                    /*
                     * Process the received data, and any subsequent data we
                     * can read without blocking.
                     */
                    for (;;)

Similarly, that inner loop doesn't send status updates or fsyncs, while
there's network data - but that matters a bit less, because we'll
sendstatus updates upon request, and flush WAL at segment boundaries.

This may explain why a low-ish wal_sender_timeout /
wal_receiver_status_interval combo still sees plenty timeouts.

I suspect this is a lot easier to hit when the IO system on the standby
is the bottleneck (with the kernel slowing us down inside the
pg_pwrite()), because that makes it easier to always have incoming
network data.

It's probably not a good idea to just remove that two-level loop - we
don't want to fsync at a much higher rate. But just putting an
ProcessWalRcvInterrupts() in the inner loop also seems unsatisfying, we
should respect wal_receiver_status_interval...


I've a couple times gotten into a situation where I was shutting down
the primary while the standby was behind, and the system appeared to
just lock up, with neither primary nor standby reacting to normal
shutdown attempts. This seems to happen more often with larger wal
segment size...

Greetings,

Andres Freund



Re: walreceiver that is behind doesn't quit, send replies

From
Andres Freund
Date:
Hi,

On 2021-05-10 19:27:55 -0700, Andres Freund wrote:
> I've a couple times gotten into a situation where I was shutting down
> the primary while the standby was behind, and the system appeared to
> just lock up, with neither primary nor standby reacting to normal
> shutdown attempts. This seems to happen more often with larger wal
> segment size...

Ah - to reproduce it, you can put a pg_usleep(10000) or so above the
pg_pwrite() in XLogWalRcvMain(). That triggers it fairly reliably for
me.

Greetings,

Andres Freund