Re: Replication failure, slave requesting old segments - Mailing list pgsql-general

From Stephen Frost
Subject Re: Replication failure, slave requesting old segments
Date
Msg-id 20180813141154.GM3326@tamriel.snowman.net
Whole thread Raw
In response to Re: Replication failure, slave requesting old segments  (Adrian Klaver <adrian.klaver@aklaver.com>)
List pgsql-general
Greetings,

* Adrian Klaver (adrian.klaver@aklaver.com) wrote:
> On 08/13/2018 05:08 AM, Phil Endecott wrote:
> >Adrian Klaver wrote:
> >Really?  I thought the intention was that the system should be
> >able to recover reliably when the slave reconnects after a
> >period of downtime, subject only to there being sufficient
> >network/CPU/disk bandwidth etc. for it to eventually catch up.

That's correct.

> See also my reply to Stephen earlier. Basically you are trying to coordinate
> two different operations. They start from the same source pg_xlog(pg_wal
> 10+) but arrive on a different time scale and from different locations.
> Without sufficient sanity checks it is possible they diverge enough on one
> or both paths to render the process unstable.

This isn't what's happening.  We're not talking about a timeline change
here or a replica being promoted to be a primary in general.  There's no
diverging happening- it's the same consistent WAL stream, just coming
from two different sources, which PG is specifically designed to handle
and should be handling seamlessly.

> I would say that:
>
> "If you set up a WAL archive that's accessible from the standby, these
> solutions are not required, since the standby can always use the archive to
> catch up provided it retains enough segments."
>
> should be more like:
>
> "If you set up a WAL archive that's accessible from the standby, these
> solutions are not required, since the standby can always use the archive to
> catch up provided it retains enough segments. *This is dependent on
> verification that the archiving is working properly. A belt and suspenders
> approach would be to set wal_keep_segments to a value > 0 in the event
> archiving is not properly functioning*"
> "

I don't think I can disagree more with this additional wording, and I
*really* don't think we should be encouraging people to set a high
wal_keep_segments.  The specific case here looks like it just need to be
set to, exactly, '1', to ensure that the primary hasn't removed the last
WAL file that it archived.

Thanks!

Stephen

Attachment

pgsql-general by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: Replication failure, slave requesting old segments
Next
From: "Phil Endecott"
Date:
Subject: Re: Replication failure, slave requesting old segments