Hi,
Please find attached a POC patch that introduces changes to the WAL sender and
receiver, allowing WAL records to be sent to standbys before they are flushed
to disk on the primary during physical replication. This is intended to improve
replication latency by reducing the amount of WAL read from disk.
For large transactions, this approach ensures that the bulk of the transaction’s
WAL records are already sent to the standby before the flush occurs on the primary.
As a result, the flush on the primary and standby happen closer together,
reducing replication lag.
At the high level idea LGTM.
Observations from the benchmark:
1. The patch improves TPS by ~13% in the sync replication setup. In repeated runs,
I see that the TPS increase is anywhere between 5% to 13% .
2. WAL sender reads significantly less WAL from disk, indicating more efficient use
of WAL buffers and reduced disk I/O
Can you please measure the transaction commit latency improvement as well.
Commit latency = Primary_Disk_Flush_time + Standby_disk_fluish_time + network_roundtrip_time
Following are some of the details of the implementation:
1. Primary does not wait for flush before starting to send data, so it is likely to
send smaller chunks of data. To prevent network overload, changes are made to
avoid sending excessively small packets.
2. The sender includes the current flush pointer in the replication protocol
messages, so the standby knows up to which point WAL has been safely flushed
on the primary.
3. The logic ensures that standbys do not apply transactions that have not
been flushed on the primary, by updating the flushedUpto position on the standby
only up to the flushPtr received from the primary.
4. WAL records received from the primary are written and can be flushed to disk on the
standby, but are only marked as flushed up to the flushPtr reported by the primary.
What happens in crash recovery scenarios? For example, when a standby crash restart,
it replays until the end of WAL. In this case, it may end up replaying WAL that was
never flushed on the primary (if primary does a crash recovery).
Shouldn't archive on standby not upload WAL before WAL gets flushed on the primary?
Same applicable for pg_receivewal.
Thanks,
Satya