Home > mailing lists

[GENERAL] Interesting streaming replication issue - Mailing list pgsql-general

From	James Sewell
Subject	[GENERAL] Interesting streaming replication issue
Date	July 27, 2017 09:00:17
Msg-id	CAANVwEuEx=1ub9wg8-WhrZbbQY8zXk81YNti1fSWstNikz1DsA@mail.gmail.com Whole thread Raw
Responses	Re: [GENERAL] Interesting streaming replication issue ("Gunnar \"Nick\" Bluth" <gunnar.bluth@pro-open.de>) Re: [GENERAL] Interesting streaming replication issue (Andres Freund <andres@anarazel.de>)
List	pgsql-general

Tree view

Hi all,

I've got two servers (A,B) which are part of a streaming replication pair. A is the master, B is a hot standby. I'm sending archived WAL to a directory on A, B is reading it via SCP.

This all works fine normally. I'm on Redhat 7.3, running EDB 9.6.2 (I'm currently working to reproduce with standard 9.6)

We have recently seen a situation where B does not catch up when taken offline for maintenance.

When B is started we see the following in the logs:

2017-07-27 11:55:57 AEST [21432]: [979-1] user=,db=,client=  (0:00000)LOG:  restored log file "0000000C0000005A000000AA" from archive
2017-07-27 11:55:58 AEST [21432]: [980-1] user=,db=,client=  (0:00000)LOG:  restored log file "0000000C0000005A000000AB" from archive
2017-07-27 11:55:58 AEST [21432]: [981-1] user=,db=,client=  (0:00000)LOG:  restored log file "0000000C0000005A000000AC" from archive
2017-07-27 11:55:59 AEST [21432]: [982-1] user=,db=,client=  (0:00000)LOG:  restored log file "0000000C0000005A000000AD" from archive
2017-07-27 11:55:59 AEST [21432]: [983-1] user=,db=,client=  (0:00000)LOG:  restored log file "0000000C0000005A000000AE" from archive
2017-07-27 11:56:00 AEST [21432]: [984-1] user=,db=,client=  (0:00000)LOG:  restored log file "0000000C0000005A000000AF" from archive
2017-07-27 11:56:00 AEST [21432]: [985-1] user=,db=,client=  (0:00000)LOG:  restored log file "0000000C0000005A000000B0" from archive
2017-07-27 11:56:01 AEST [21432]: [986-1] user=,db=,client=  (0:00000)LOG:  restored log file "0000000C0000005A000000B1" from archive
2017-07-27 11:56:01 AEST [21432]: [987-1] user=,db=,client=  (0:00000)LOG:  restored log file "0000000C0000005A000000B2" from archive
2017-07-27 11:56:02 AEST [21432]: [988-1] user=,db=,client=  (0:00000)LOG:  restored log file "0000000C0000005A000000B3" from archive
2017-07-27 11:56:02 AEST [21432]: [989-1] user=,db=,client=  (0:00000)LOG:  restored log file "0000000C0000005A000000B4" from archive
2017-07-27 11:56:03 AEST [21432]: [990-1] user=,db=,client=  (0:00000)LOG:  restored log file "0000000C0000005A000000B5" from archive
scp: /archive/xlog//0000000C0000005A000000B6: No such file or directory
2017-07-27 11:56:03 AEST [46191]: [1-1] user=,db=,client=  (0:00000)LOG:  started streaming WAL from primary at 5A/B5000000 on timeline 12
2017-07-27 11:56:03 AEST [46191]: [2-1] user=,db=,client=  (0:XX000)FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000C0000005A000000B5 has already been removed

scp: /archive/xlog//0000000D.history: No such file or directory
scp: /archive/xlog//0000000C0000005A000000B6: No such file or directory
2017-07-27 11:56:04 AEST [46203]: [1-1] user=,db=,client=  (0:00000)LOG:  started streaming WAL from primary at 5A/B5000000 on timeline 12
2017-07-27 11:56:04 AEST [46203]: [2-1] user=,db=,client=  (0:XX000)FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000C0000005A000000B5 has already been removed

This will loop indefinitely. At this stage the master reports no connected standbys in pg_stat_replication, and the standby has no running WAL receiver process.

This can be 'fixed' by running pg_switch_xlog() on the master, at which time a connection is seen from the standby and the logs show the following:

scp: /archive/xlog//0000000D.history: No such file or directory

2017-07-27 12:03:19 AEST [21432]: [1029-1] user=,db=,client= (0:00000)LOG: restored log file "0000000C0000005A000000B5" from archive

scp: /archive/xlog//0000000C0000005A000000B6: No such file or directory

2017-07-27 12:03:19 AEST [63141]: [1-1] user=,db=,client= (0:00000)LOG: started streaming WAL from primary at 5A/B5000000 on timeline 12

2017-07-27 12:03:19 AEST [63141]: [2-1] user=,db=,client= (0:XX000)FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000C0000005A000000B5 has already been removed

scp: /archive/xlog//0000000D.history: No such file or directory

2017-07-27 12:03:24 AEST [21432]: [1030-1] user=,db=,client= (0:00000)LOG: restored log file "0000000C0000005A000000B5" from archive

2017-07-27 12:03:24 AEST [21432]: [1031-1] user=,db=,client= (0:00000)LOG: restored log file "0000000C0000005A000000B6" from archive

scp: /archive/xlog//0000000C0000005A000000B7: No such file or directory

2017-07-27 12:03:25 AEST [21432]: [1032-1] user=,db=,client= (0:00000)LOG: unexpected pageaddr 56/E7000000 in log segment 0000000C0000005A000000B7, offset 0

2017-07-27 12:03:25 AEST [63196]: [1-1] user=,db=,client= (0:00000)LOG: started streaming WAL from primary

At this time replication is running as normal and all changes are streamed.

Am I missing something here, this seems very odd. One thing I have noticed is it only seems to be caused after a lot of WAL is produced and the pg_xlog directory is sitting at max_wal_size

James Sewell,

PostgreSQL Team Lead / Solutions Architect

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009

P (+61) 2 8099 9000 W www.jirotech.com F (+61) 2 8099 9099

The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.

pgsql-general by date:

From: JingYuan Chen
Date: 27 July 2017, 08:57:31
Subject: Re: [GENERAL] Developer GUI tools for PostgreSQL

From: John R Pierce
Date: 27 July 2017, 09:08:09
Subject: Re: [GENERAL] ODBC driver issue

[GENERAL] Interesting streaming replication issue - Mailing list pgsql-general

Previous

Next