RE: Hot Standby Replica Recovery Problem - Mailing list pgsql-general

From Michael Harris
Subject RE: Hot Standby Replica Recovery Problem
Date
Msg-id AM0PR07MB6179D93313EDEAEFCAA3A7FEF3BB0@AM0PR07MB6179.eurprd07.prod.outlook.com
Whole thread Raw
In response to Hot Standby Replica Recovery Problem  (Michael Harris <michael.harris@ericsson.com>)
List pgsql-general
Hello All

We got to the bottom of this in the end - it was due to a misconfiguration that resulted in random corruption of WAL
archives.

Part of the problem was the use of 'pigz' (multi-threaded gzip tool) for compression of the WAL files, which in our
caseled to a file that decompresses without error into a file of the correct size but where 2/3 of the files were all
'0's.Needless to say I have stopped using pigz! 

There were a couple if learnings from the whole exercise: first, during recovery, corrupted WAL files do not result in
errormessages describing the corruption in the log. In fact there is code in xlog.c (XLogPageRead) which validates the
WALfile header, and if it finds any errors, discards them and tries again. From the log file all you can see is
repeatedattempts to fetch the WAL file from alternating sources. 

The other wrinkle was the missing timeline history file. Even after fixing the WAL corruption, and starting over with a
freshbase backup, it still hiccupped about 00000003.history being mysteriously absent from the master. The following
sequencewas repeatedly present in the log: 

[2018-12-24 21:41:26 AEDT]  16246@  LOG:  restored log file "000000030002CE55000000ED" from archive
[2018-12-24 21:41:31 AEDT]  16246@  LOG:  restored log file "000000030002CE55000000EE" from archive
[2018-12-24 21:41:31 AEDT]  16246@  LOG:  restored log file "000000030002CE55000000EF" from archive
[2018-12-24 21:41:32 AEDT]  16246@  LOG:  restored log file "000000030002CE55000000F0" from archive
[2018-12-24 21:41:33 AEDT]  16246@  LOG:  restored log file "000000030002CE55000000F1" from archive

gzip: /var/backup2/dbs1a/pg_xlog/000000030002CE55000000F2.gz: unexpected end of file
[2018-12-24 21:41:33 AEDT]  4287@  LOG:  fetching timeline history file for timeline 3 from primary server
[2018-12-24 21:41:33 AEDT]  4287@  FATAL:  could not receive timeline history file from the primary server: ERROR:
couldnot open file "pg_xlog/00000003.history": No such file or directory 

It seems like it reached the end of the WAL archive, then tried to switch to streaming, but then that failed because
themaster did not have a copy of this history file. Then it switched back to the WAL archive, and the cycle continued. 

In the end I created a fake history file and put it in the master's pg_xlog dir. After that the switch to WAL streaming
wassuccessful and the standby was at last up to speed. 

Anyway, all is well that ends well!

Cheers
Mike

-----Original Message-----
From: Michael Harris
Sent: Friday, December 14, 2018 6:40 PM
To: pgsql-general@lists.postgresql.org
Subject: Hot Standby Replica Recovery Problem

Hello Experts,

We've been having a strange problem with one of our databases.

A summary of the setup follows:

 - We are running postgresql 9.6.9 on Centos 7.
 - We are using postgresql native streaming replication
 - There is one master and one hot standby
 - The master is archiving it's WAL files with this archive command:

   archive_command = 'test ! -f %p || (test ! -f /var/backup2/dbs1a/pg_xlog/%f.gz && pigz -c %p >
/var/backup2/dbs1a/pg_xlog/%f.gz)'            # command to use to archive a logfile segment 

   /var/backup2 is a shared gfs2 filesystem.

 - Recently we had to do some maintenance that involved rebuilding the database servers.
   At that point dbs1b was the master, so we promoted dbs1a, did the work on dbs1b,
   and then began establishing a new replica on dbs1b using pg_basebackup

 - After pg_basebackup completed we set up recovery.conf:

    restore_command='gzip -dc /var/backup2/dbs1a/pg_xlog/%f.gz > %p'
    standby_mode='on'
    primary_conninfo='host=dbs1a user=xxxxxx password=xxxxx'

   and then started postgresql.

At first it seemed to be starting up as per normal. It began REDO-ing WAL files from the archive. It reached a certain
pointand then failed with this: 

[2018-12-14 11:07:39 AEDT]  2050@  LOG:  restored log file "000000030002C4D700000028" from archive
[2018-12-14 11:07:39 AEDT]  29240@  LOG:  fetching timeline history file for timeline 3 from primary server
[2018-12-14 11:07:39 AEDT]  29240@  FATAL:  could not receive timeline history file from the primary server: ERROR:
couldnot open file "pg_xlog/00000003.history": No such file or directory 

These messages repeated every few seconds until we stopped the postmaster.

I was surprised because we have done this procedure many times on this & other databases, but anyway Dr Google
suggestedthis
https://superuser.com/questions/1127360/cant-find-the-timeline-history-file-to-get-the-replication-working.
After reading that and other linked articles I decided to create a 'dummy' history file.

This time the startup sequence failed like this:

[2018-12-14 16:26:46 AEDT]  16575@  LOG:  restored log file "000000030002C4D700000028" from archive
[2018-12-14 16:26:47 AEDT]  17669@  LOG:  fetching timeline history file for timeline 3 from primary server
[2018-12-14 16:26:47 AEDT]  17669@  LOG:  started streaming WAL from primary at 2C4D7/28000000 on timeline 3
[2018-12-14 16:26:47 AEDT]  17669@  FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment
000000030002C4D700000028has already been removed 

Again I was surprised - why wasn't it getting this file from the archives? It is present and does not seem to be
truncatedor corrupted (as far as I can tell). 

Anyway in an attempt to force it to get the file from the archives, I removed the primary_conninfo.
This time it just repeatedly fetched the file 000000030002C4D700000028 again and again, never progressing to the next
one.

Finally I turned up the debugging output, to see this:

[2018-12-14 17:58:25 AEDT]  12365@  DEBUG:  executing restore command "gzip -dc
/var/backup2/dbs1a/pg_xlog/000000030002C4D700000028.gz> pg_xlog/RECOVERYXLOG" 
[2018-12-14 17:58:25 AEDT]  12365@  LOG:  restored log file "000000030002C4D700000028" from archive
[2018-12-14 17:58:25 AEDT]  12365@  DEBUG:  got WAL segment from archive
[2018-12-14 17:58:25 AEDT]  12365@  DEBUG:  switched WAL source from archive to stream after failure
[2018-12-14 17:58:25 AEDT]  12365@  DEBUG:  switched WAL source from stream to archive after failure

It seems like something is going wrong with the WAL file replay here, although I cannot see what??

I'm writing this email after the second attempt at replication: the first attempt ended much the same.
I assumed I had done something wrong so I re-started from the pg_basebackup. Now I am concerned that something is wrong
withthe master, or maybe with our WAL archiving process. 

Any tips would be gratefully received!

Regards
Mike



pgsql-general by date:

Previous
From: Mitar
Date:
Subject: Re: Watching for view changes
Next
From: Mitar
Date:
Subject: CREATE UNLOGGED MATERIALIZED VIEW