Re: Unnecessary delay in streaming replication due to replay lag - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Unnecessary delay in streaming replication due to replay lag
Date
Msg-id 20200117053756.GI2127@paquier.xyz
Whole thread Raw
In response to Unnecessary delay in streaming replication due to replay lag  (Asim R P <apraveen@pivotal.io>)
Responses Re: Unnecessary delay in streaming replication due to replay lag  (Asim R P <apraveen@pivotal.io>)
List pgsql-hackers
On Fri, Jan 17, 2020 at 09:34:05AM +0530, Asim R P wrote:
> Standby does not start walreceiver process until startup process
> finishes WAL replay.  The more WAL there is to replay, longer is the
> delay in starting streaming replication.  If replication connection is
> temporarily disconnected, this delay becomes a major problem and we
> are proposing a solution to avoid the delay.

Yeah, that's documented:
https://www.postgresql.org/message-id/20190910062325.GD11737@paquier.xyz

> We propose to address this by starting walreceiver without waiting for
> startup process to finish replay of WAL.  Please see attached
> patchset.  It can be summarized as follows:
>
>     0001 - TAP test to demonstrate the problem.

There is no real need for debug_replay_delay because we have already
recovery_min_apply_delay, no?  That would count only after consistency
has been reached, and only for COMMIT records, but your test would be
enough with that.

>     0002 - The standby startup sequence is changed such that
>            walreceiver is started by startup process before it begins
>            to replay WAL.

See below.

>     0003 - Postmaster starts walreceiver if it finds that a
>            walreceiver process is no longer running and the state
>            indicates that it is operating as a standby.

I have not checked in details, but I smell some race conditions
between the postmaster and the startup process here.

> This is a POC, we are looking for early feedback on whether the
> problem is worth solving and if it makes sense to solve if along this
> route.

You are not the first person interested in this problem, we have a
patch registered in this CF to control the timing when a WAL receiver
is started at recovery:
https://commitfest.postgresql.org/26/1995/
https://www.postgresql.org/message-id/b271715f-f945-35b0-d1f5-c9de3e56f65e@postgrespro.ru

I am pretty sure that we should not change the default behavior to
start the WAL receiver after replaying everything from the archives to
avoid copying some WAL segments for nothing, so being able to use a
GUC switch should be the way to go, and Konstantin's latest patch was
using this approach.  Your patch 0002 adds visibly a third mode: start
immediately on top of the two ones already proposed:
- Start after replaying all WAL available locally and in the
archives.
- Start after reaching a consistent point.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: [HACKERS] Block level parallel vacuum
Next
From: Amit Kapila
Date:
Subject: Re: [HACKERS] Block level parallel vacuum