Re: [GENERAL] Interesting streaming replication issue - Mailing list pgsql-general

From Andres Freund
Subject Re: [GENERAL] Interesting streaming replication issue
Date
Msg-id 20170809220811.ekhxxlhyse5mvf5c@alap3.anarazel.de
Whole thread Raw
In response to [GENERAL] Interesting streaming replication issue  (James Sewell <james.sewell@jirotech.com>)
List pgsql-general
Hi,

On 2017-07-27 13:00:17 +1000, James Sewell wrote:
> Hi all,
>
> I've got two servers (A,B) which are part of a streaming replication pair.
> A is the master, B is a hot standby. I'm sending archived WAL to a
> directory on A, B is reading it via SCP.
>
> This all works fine normally. I'm on Redhat 7.3, running EDB 9.6.2 (I'm
> currently working to reproduce with standard 9.6)
>
> We have recently seen a situation where B does not catch up when taken
> offline for maintenance.
>
> When B is started we see the following in the logs:
>
> 2017-07-27 11:56:03 AEST [21432]: [990-1] user=,db=,client=
> (0:00000)LOG:  restored log file "0000000C0000005A000000B5" from
> archive
> scp: /archive/xlog//0000000C0000005A000000B6: No such file or directory
> 2017-07-27 11:56:03 AEST [46191]: [1-1] user=,db=,client=
> (0:00000)LOG:  started streaming WAL from primary at 5A/B5000000 on
> timeline 12
> 2017-07-27 11:56:03 AEST [46191]: [2-1] user=,db=,client=
> (0:XX000)FATAL:  could not receive data from WAL stream: ERROR:
> requested WAL segment 0000000C0000005A000000B5 has already been
> removed
>
> scp: /archive/xlog//0000000D.history: No such file or directory
> scp: /archive/xlog//0000000C0000005A000000B6: No such file or directory
> 2017-07-27 11:56:04 AEST [46203]: [1-1] user=,db=,client=
> (0:00000)LOG:  started streaming WAL from primary at 5A/B5000000 on
> timeline 12
> 2017-07-27 11:56:04 AEST [46203]: [2-1] user=,db=,client=
> (0:XX000)FATAL:  could not receive data from WAL stream: ERROR:
> requested WAL segment 0000000C0000005A000000B5 has already been
> removed
>
> This will loop indefinitely. At this stage the master reports no connected
> standbys in pg_stat_replication, and the standby has no running WAL
> receiver process.
>
> This can be 'fixed' by running pg_switch_xlog() on the master, at which
> time a connection is seen from the standby and the logs show the following:
>
> scp: /archive/xlog//0000000D.history: No such file or directory
> 2017-07-27 12:03:19 AEST [21432]: [1029-1] user=,db=,client=  (0:00000)LOG:
>  restored log file "0000000C0000005A000000B5" from archive
> scp: /archive/xlog//0000000C0000005A000000B6: No such file or directory
> 2017-07-27 12:03:19 AEST [63141]: [1-1] user=,db=,client=  (0:00000)LOG:
>  started streaming WAL from primary at 5A/B5000000 on timeline 12
> 2017-07-27 12:03:19 AEST [63141]: [2-1] user=,db=,client=  (0:XX000)FATAL:
>  could not receive data from WAL stream: ERROR:  requested WAL segment
> 0000000C0000005A000000B5 has already been removed
>
> scp: /archive/xlog//0000000D.history: No such file or directory
> 2017-07-27 12:03:24 AEST [21432]: [1030-1] user=,db=,client=  (0:00000)LOG:
>  restored log file "0000000C0000005A000000B5" from archive
> 2017-07-27 12:03:24 AEST [21432]: [1031-1] user=,db=,client=  (0:00000)LOG:
>  restored log file "0000000C0000005A000000B6" from archive

FWIW, I don't see a bug here. Archiving on its own doesn't guarantee
that replication progresses in increments smaller than 16MB, unless you
use archive_timeout (or as you do manually switch segments). Streaming
replication doesn't guarantee that WAL is retained unless you use
replication slots - which you don't appear to be. You can make SR retain
more with approximate methods like wal_keep_segments too, but that's not
a guarantee.  From what I can see you're just seeing the combination of
these two limitations, because you don't use the methods to address them
(archive_timeout, replication slots and/or wal_keep_segments).

Greetings,

Andres Freund


pgsql-general by date:

Previous
From: "Seong Son (US)"
Date:
Subject: [GENERAL] streaming replication - crash on standby
Next
From: Andres Freund
Date:
Subject: Re: [GENERAL] streaming replication - crash on standby