Thread: Should wal receiver reply to wal sender more aggressively?

Should wal receiver reply to wal sender more aggressively?

From

Paul Guo

Date:

15 June 2021, 15:39:59

While working on some related issues I found that the wal receiver
tries to call walrcv_receive() loop
before replying the write/flush/apply LSN to wal senders in
XLogWalRcvSendReply(). It is possible
that walrcv_receive() loop receives and writes a lot of xlogs, so it
does not reply those LSN
information in time, thus finally slows down those transactions due to
syncrep wait (assuming default synchronous_commit)

In my TPCB testing, I found the worst case is that 10,466,469 bytes
were consumed in the walrcv_receive() loop.

More seriously, we call XLogWalRcvSendReply(false, false) after
handling those bytes; The first
argument false means no force , i.e. it notifies unless max time of
guc wal_receiver_status_interval
value (10s by default) is reached, so we may have to wait other calls
of XLogWalRcvSendReply()
to notify the wal sender.

I thought and tried enhancing this by force-replying to the wal sender
each when receiving
a maximum bytes (e.g. 128K) but several things confused me:

- What's the purpose of guc wal_receiver_status_interval? The OS
kernel is usually not
  efficient when handling small packets but we are not replying that
aggressively so why
  is this guc there?

- I run simple TPCB (1000 scaling, 200 connections, shared_buffers,
max_connections tuned)
  but found no obvious performance difference with and without the
code change. I did not
  see obvious system (IO/CPU/network) bottleneck - probably the
bottleneck is in PG itself.
  I did not investigate further at this moment, but the change should
in theory help, no?

Another thing came to my mind is the wal receiver logic:
Currently the wal receiver process does network io, wal write, wal
flush in one process.
Network io is async, blocking at epoll/poll, wal write is mostly
non-blocking, but for wal flush,
probably we could decouple it to a dedicated process. And maybe use
sync_file_range instead
of wal file fsync in issue_xlog_fsync()? We should sync those wal
contents with lower LSN at
first and reply to the wal sender in time, right?.

Below is the related code:

                                /* See if we can read data immediately */
                                len = walrcv_receive(wrconn, &buf, &wait_fd);
                                if (len != 0)
                                {
                                        /*
                                         * Process the received data,
and any subsequent data we
                                         * can read without blocking.
                                         */
                                        for (;;)
                                        {
                                                if (len > 0)
                                                {
                                                        /*
                                                         * Something
was received from primary, so reset
                                                         * timeout
                                                         */

last_recv_timestamp = GetCurrentTimestamp();
                                                        ping_sent = false;

XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1);
                                                }
                                                else if (len == 0)
                                                        break;
                                                else if (len < 0)
                                                {
                                                        ereport(LOG,

 (errmsg("replication terminated by primary server"),

  errdetail("End of WAL reached on timeline %u at %X/%X.",

                    startpointTLI,

                    LSN_FORMAT_ARGS(LogstreamResult.Write))));
                                                        endofwal = true;
                                                        break;
                                                }
                                                len =
walrcv_receive(wrconn, &buf, &wait_fd);
                                        }

                                        /* Let the primary know that
we received some data. */
                                        XLogWalRcvSendReply(false, false);

                                        /*
                                         * If we've written some
records, flush them to disk and
                                         * let the startup process and
primary server know about
                                         * them.
                                         */
                                        XLogWalRcvFlush(false);

-- 
Paul Guo (Vmware)

Re: Should wal receiver reply to wal sender more aggressively?

From

Paul Guo

Date:

16 June 2021, 14:23:53

[ Resending the mail since I found my previous email has a very
  bad format that is hard to read].

While working on some related issues I found that the wal receiver
tries to call walrcv_receive() loop before replying the write/flush/apply
LSN to wal senders in XLogWalRcvSendReply(). It is possible
that walrcv_receive() loop receives and writes a lot of xlogs, so it
does not reply those LSN information in time, thus finally slows down
the transactions due to syncrep wait (assuming default
synchronous_commit)

During TPCB testing, I found the worst case is that 10,466,469 bytes
were consumed in the walrcv_receive() loop.

More seriously, we call XLogWalRcvSendReply(false, false) after
handling those bytes; The first argument false means no force ,
i.e. it notifies unless max time of guc wal_receiver_status_interval
value (10s by default) is reached, so we may have to wait for other
calls of XLogWalRcvSendReply() to notify the wal sender.

I thought and tried enhancing this by force-flushing-replying each
time when receiving a maximum bytes (e.g. 128K) but several things
confused me:

- What's the purpose of guc wal_receiver_status_interval? The OS
  kernel is usually not efficient when handling small packets but we
  are not replying that aggressively so why is this guc there?

- I run simple TPCB (1000 scaling, 200 connections, shared_buffers,
  max_connections tuned) but found no obvious performance difference
  with and without the code change. I did not see an obvious system
  IO/CPU/network) bottleneck - probably the bottleneck is in PG itself?
  I did not investigate further at this moment, but the change should in
  theory help, right? I may continue investigating but probably won't
  do this unless I have some clear answers to the confusions.

Another thing came to my mind is the wal receiver logic:
Currently the wal receiver process does network io, wal write, wal
flush in one process. Network io is async, blocking at epoll/poll, etc,
wal write is mostly non-blocking, but for wal flush, probably we could
decouple it to a dedicated process? And maybe use sync_file_range
instead of wal file fsync in issue_xlog_fsync()? We should sync those
wal contents with lower LSN at first and reply to the wal sender in
time, right?.

Below is the related code:

  /* See if we can read data immediately */
len = walrcv_receive(wrconn, &buf, &wait_fd);
if (len != 0)
{
    /*
     * Process the received data, and any subsequent data we
     * can read without blocking.
     */
    for (;;)
    {
        if (len > 0)
        {
            /*
             * Something was received from primary, so reset
             * timeout
             */
            last_recv_timestamp = GetCurrentTimestamp();
            ping_sent = false;
            XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1);
        }
        else if (len == 0)
            break;
        else if (len < 0)
        {
            ereport(LOG,
                    (errmsg("replication terminated by primary server"),
                     errdetail("End of WAL reached on timeline %u at %X/%X.",
                               startpointTLI,
                               LSN_FORMAT_ARGS(LogstreamResult.Write))));
            endofwal = true;
            break;
        }
        len = walrcv_receive(wrconn, &buf, &wait_fd);
    }

    /* Let the primary know that we received some data. */
    XLogWalRcvSendReply(false, false);

    /*
     * If we've written some records, flush them to disk and
     * let the startup process and primary server know about
     * them.
     */
    XLogWalRcvFlush(false);