Sending unflushed WAL in physical replication - Mailing list pgsql-hackers

From Rahila Syed
Subject Sending unflushed WAL in physical replication
Date
Msg-id CAH2L28tHzvZgtL7MHDK86Rzz56f+74mgZo-uKQNJHob7_JDb-w@mail.gmail.com
Whole thread Raw
List pgsql-hackers
Hi,

Please find attached a POC patch that introduces changes to the WAL sender and
receiver, allowing WAL records to be sent to standbys before they are flushed
to disk on the primary during physical replication. This is intended to improve
replication latency by reducing the amount of WAL read from disk.
For large transactions, this approach ensures that the bulk of the transaction’s
WAL records are already sent to the standby before the flush occurs on the primary.
As a result, the flush on the primary and standby happen closer together,
reducing replication lag.

Observations from the benchmark:
1. The patch improves TPS by ~13% in the sync replication setup. In repeated runs,
I see that the TPS increase is anywhere between 5% to 13% .
2. WAL sender reads significantly less WAL from disk, indicating more efficient use
of WAL buffers and reduced disk I/O   

Following are some of the details of the implementation:

1. Primary does not wait for flush before starting to send data, so it is likely to
send smaller chunks of data. To prevent network overload, changes are made to
avoid sending excessively small packets.
2. The sender includes the current flush pointer in the replication protocol
messages, so the standby knows up to which point WAL has been safely flushed
on the primary.
3. The logic ensures that standbys do not apply transactions that have not
been flushed on the primary, by updating the flushedUpto position on the standby
only up to the flushPtr received from the primary.
4. WAL records received from the primary are written and can be flushed to disk on the
standby, but are only marked as flushed up to the flushPtr reported by the primary.

Benchmark details are as follows:
Synchronous replication with remote write enabled.
Two Azure VMs: Central India (primary), Central US (standby).
OS: Ubuntu 24.04, VM size D4s (4 vCPUs, 16 GiB RAM).

With patch
TPS : 115
WAL read from disk by wal sender : ~40MB (read bytes from pg_stat_io)
WAL generated during the test: 772705760 bytes.

Without the patch 
TPS: 102
WAL read from disk by wal sender : ~79MB (read bytes from pg_stat_io)
WAL generated during the test : 760060792 bytes

Commit hash: b1187266e0

pgbench -c 32 -j 4 postgres -T 300 -f wal_test.sql

wal_test.sql (each transaction generates ~36KB of WAL):
\set delta random(1, 500)
BEGIN;
INSERT INTO wal_bloat_:delta (data)
SELECT repeat('x', 8000)
FROM generate_series(1, 80);

TODO:
1. Ensure there is a robust mechanism on the receiver to prevent WAL records
that are not flushed on primary from being applied on standby, under any
circumstances.
2. When smaller chunks of WAL are received on the standby, it can lead to more
frequent disk write operations. To mitigate this issue, employing WAL buffers
on the standby could be a more effective approach. Evaluate the performance
impact of using WAL buffers on the standby.

Similar idea was proposed here: 
Proposal: Allow walsenders to send WAL directly from wal_buffers to replicas
This idea is also discussed here recently :
https://www.postgresql.org/message-id/fa2e932eeff472250e2dbacb49d8c43ad282fea9.camel%40j-davis.com

Kindly let me know your thoughts.

Thank you,
Rahila Syed
Attachment

pgsql-hackers by date:

Previous
From: Пополитов Владлен
Date:
Subject: Re: Avoiding roundoff error in pg_sleep()
Next
From: Nathan Bossart
Date:
Subject: Re: a couple of small patches for simd.h