Hi,
I have observed the following behavior with PostgreSQL 13.3.
The WAL sender process sends approximately 500 keepalive messages per second to pg_recvlogical.
These keepalive messages are totally un-necessary.
Keepalives should be sent only if there is no network traffic and a certain time (half of wal_sender_timeout) passes.
These keepalive messages not only choke the network but also impact the performance of the receiver,
because the receiver has to process the received message and then decide whether to reply to it or not.
The receiver remains busy doing this activity 500 times a second.
On investigation it is revealed that the following code fragment in function WalSndWaitForWal in file walsender.c is responsible for sending these frequent keepalives:
if (MyWalSnd->flush < sentPtr &&
MyWalSnd->write < sentPtr &&
!waiting_for_ping_response)
WalSndKeepalive(false);
waiting_for_ping_response is normally false, and flush and write will always be less than sentPtr (Receiver's LSNs cannot advance server's LSNs)
Here are the steps to reproduce:
1. Start the database server.
2. Setup pgbench tables.
./pgbench -i -s 50 -h 192.168.5.140 -p 7654 -U abbas postgres
3. Create a logical replication slot.
SELECT * FROM pg_create_logical_replication_slot('my_slot', 'test_decoding');
4. Start pg_recvlogical.
./pg_recvlogical --slot=my_slot --verbose -d postgres -h 192.168.5.140 -p 7654 -U abbas --start -f -
5. Run pgbench
./pgbench -U abbas -h 192.168.5.140 -p 7654 -c 2 -j 2 -T 1200 -n postgres
6. Observer network traffic to find the keepalive flood.
Alternately modify the above code fragment to see approx 500 keepalive log messages a second
if (MyWalSnd->flush < sentPtr && MyWalSnd->write < sentPtr && !waiting_for_ping_response){ elog(LOG, "[Keepalive] wrt ptr %X/%X snt ptr %X/%X ", (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write, (uint32) (sentPtr >> 32), (uint32) sentPtr); WalSndKeepalive
(false);}Opinions?
--
--
Abbas
Senior Architect
Skype ID: gabbasb
Follow us on Twitter
@EnterpriseDB