[ Resending the mail since I found my previous email has a very
bad format that is hard to read].
While working on some related issues I found that the wal receiver
tries to call walrcv_receive() loop before replying the write/flush/apply
LSN to wal senders in XLogWalRcvSendReply(). It is possible
that walrcv_receive() loop receives and writes a lot of xlogs, so it
does not reply those LSN information in time, thus finally slows down
the transactions due to syncrep wait (assuming default
synchronous_commit)
During TPCB testing, I found the worst case is that 10,466,469 bytes
were consumed in the walrcv_receive() loop.
More seriously, we call XLogWalRcvSendReply(false, false) after
handling those bytes; The first argument false means no force ,
i.e. it notifies unless max time of guc wal_receiver_status_interval
value (10s by default) is reached, so we may have to wait for other
calls of XLogWalRcvSendReply() to notify the wal sender.
I thought and tried enhancing this by force-flushing-replying each
time when receiving a maximum bytes (e.g. 128K) but several things
confused me:
- What's the purpose of guc wal_receiver_status_interval? The OS
kernel is usually not efficient when handling small packets but we
are not replying that aggressively so why is this guc there?
- I run simple TPCB (1000 scaling, 200 connections, shared_buffers,
max_connections tuned) but found no obvious performance difference
with and without the code change. I did not see an obvious system
IO/CPU/network) bottleneck - probably the bottleneck is in PG itself?
I did not investigate further at this moment, but the change should in
theory help, right? I may continue investigating but probably won't
do this unless I have some clear answers to the confusions.
Another thing came to my mind is the wal receiver logic:
Currently the wal receiver process does network io, wal write, wal
flush in one process. Network io is async, blocking at epoll/poll, etc,
wal write is mostly non-blocking, but for wal flush, probably we could
decouple it to a dedicated process? And maybe use sync_file_range
instead of wal file fsync in issue_xlog_fsync()? We should sync those
wal contents with lower LSN at first and reply to the wal sender in
time, right?.
Below is the related code:
/* See if we can read data immediately */
len = walrcv_receive(wrconn, &buf, &wait_fd);
if (len != 0)
{
/*
* Process the received data, and any subsequent data we
* can read without blocking.
*/
for (;;)
{
if (len > 0)
{
/*
* Something was received from primary, so reset
* timeout
*/
last_recv_timestamp = GetCurrentTimestamp();
ping_sent = false;
XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1);
}
else if (len == 0)
break;
else if (len < 0)
{
ereport(LOG,
(errmsg("replication terminated by primary server"),
errdetail("End of WAL reached on timeline %u at %X/%X.",
startpointTLI,
LSN_FORMAT_ARGS(LogstreamResult.Write))));
endofwal = true;
break;
}
len = walrcv_receive(wrconn, &buf, &wait_fd);
}
/* Let the primary know that we received some data. */
XLogWalRcvSendReply(false, false);
/*
* If we've written some records, flush them to disk and
* let the startup process and primary server know about
* them.
*/
XLogWalRcvFlush(false);