Re: Unnecessary delay in streaming replication due to replay lag - Mailing list pgsql-hackers

From Asim R P
Subject Re: Unnecessary delay in streaming replication due to replay lag
Date
Msg-id CANXE4TewY1WNgu5J5ek38RD+2m9F2K=fgbWubjv9yG0BeyFxRQ@mail.gmail.com
Whole thread Raw
In response to Re: Unnecessary delay in streaming replication due to replay lag  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Unnecessary delay in streaming replication due to replay lag  (Asim Praveen <pasim@vmware.com>)
List pgsql-hackers
On Fri, Jan 17, 2020 at 11:08 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Jan 17, 2020 at 09:34:05AM +0530, Asim R P wrote:
> >
> >     0001 - TAP test to demonstrate the problem.
>
> There is no real need for debug_replay_delay because we have already
> recovery_min_apply_delay, no?  That would count only after consistency
> has been reached, and only for COMMIT records, but your test would be
> enough with that.
>

Indeed, we didn't know about recovery_min_apply_delay.  Thank you for
the suggestion, the updated test is attached.

>
> > This is a POC, we are looking for early feedback on whether the
> > problem is worth solving and if it makes sense to solve if along this
> > route.
>
> You are not the first person interested in this problem, we have a
> patch registered in this CF to control the timing when a WAL receiver
> is started at recovery:
> https://commitfest.postgresql.org/26/1995/
> https://www.postgresql.org/message-id/b271715f-f945-35b0-d1f5-c9de3e56f65e@postgrespro.ru
>

Great to know about this patch and the discussion.  The test case and
the part that saves next start point in control file from our patch
can be combined with Konstantin's patch to solve this problem.  Let me
work on that.

> I am pretty sure that we should not change the default behavior to
> start the WAL receiver after replaying everything from the archives to
> avoid copying some WAL segments for nothing, so being able to use a
> GUC switch should be the way to go, and Konstantin's latest patch was
> using this approach.  Your patch 0002 adds visibly a third mode: start
> immediately on top of the two ones already proposed:
> - Start after replaying all WAL available locally and in the
> archives.
> - Start after reaching a consistent point.

Consistent point should be reached fairly quickly, in spite of large
replay lag.  Min recovery point is updated during XLOG flush and that
happens when a commit record is replayed.  Commits should occur
frequently in the WAL stream.  So I do not see much value in starting
WAL receiver immediately as compared to starting it after reaching a
consistent point.  Does that make sense?

That said, is there anything obviously wrong with starting WAL receiver
immediately, even before reaching consistent state?  A consequence is
that WAL receiver may overwrite a WAL segment while startup process is
reading and replaying WAL from it.  But that doesn't appear to be a
problem because the overwrite should happen with identical content as
before.

Asim
Attachment

pgsql-hackers by date:

Previous
From: Mahendra Singh Thalor
Date:
Subject: Re: [HACKERS] Block level parallel vacuum
Next
From: Kohei KaiGai
Date:
Subject: Re: TRUNCATE on foreign tables