Unnecessary delay in streaming replication due to replay lag - Mailing list pgsql-hackers

From Asim R P
Subject Unnecessary delay in streaming replication due to replay lag
Date
Msg-id CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ@mail.gmail.com
Whole thread Raw
Responses Re: Unnecessary delay in streaming replication due to replay lag  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
Hi

Standby does not start walreceiver process until startup process
finishes WAL replay.  The more WAL there is to replay, longer is the
delay in starting streaming replication.  If replication connection is
temporarily disconnected, this delay becomes a major problem and we
are proposing a solution to avoid the delay.

WAL replay is likely to fall behind when master is processing
write-heavy workload, because WAL is generated by concurrently running
backends on master while only one startup process on standby replays WAL
records in sequence as new WAL is received from master.

Replication connection between walsender and walreceiver may break due
to reasons such as transient network issue, standby going through
restart, etc.  The delay in resuming replication connection leads to
lack of high availability - only one copy of WAL is available during
this period.

The problem worsens when the replication is configured to be
synchronous.  Commits on master must wait until the WAL replay is
finished on standby, walreceiver is then started and it confirms flush
of WAL upto the commit LSN.  If synchronous_commit GUC is set to
remote_write, this behavior is equivalent to tacitly changing it to
remote_apply until the replication connection is re-established!

Has anyone encountered such a problem with streaming replication?

We propose to address this by starting walreceiver without waiting for
startup process to finish replay of WAL.  Please see attached
patchset.  It can be summarized as follows:

    0001 - TAP test to demonstrate the problem.

    0002 - The standby startup sequence is changed such that
           walreceiver is started by startup process before it begins
           to replay WAL.

    0003 - Postmaster starts walreceiver if it finds that a
           walreceiver process is no longer running and the state
           indicates that it is operating as a standby.

This is a POC, we are looking for early feedback on whether the
problem is worth solving and if it makes sense to solve if along this
route.

Hao and Asim
Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: pgindent && weirdness
Next
From: Dilip Kumar
Date:
Subject: Re: [HACKERS] Block level parallel vacuum