Re: [GENERAL] pg_rewind - restore new slave failed to startup duringrecovery - Mailing list pgsql-general

From Dylan Luong
Subject Re: [GENERAL] pg_rewind - restore new slave failed to startup duringrecovery
Date
Msg-id 42fbdfcee16049f184aebbdd5441d876@ITUPW-EXMBOX2B.UniNet.unisa.edu.au
Whole thread Raw
In response to Re: [GENERAL] pg_rewind - restore new slave failed to startup during recovery  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-general
Thanks Michael.

> First things first. What is the version of PostgreSQL involved here?

The PostgreSQL is version 9.6.

>4 is the number of the last timeline the promoted standby has been using, right?

The history file in pg_xlog, is dated at the time of promotion on the standby (current master)
-rw-------. 1 postgres postgres      131 Aug 21 13:26 00000004.history

$ more 00000004.history
1       20/5C000098     no recovery target specified

2       76/F8000098     no recovery target specified

3       9C/7CC50680     no recovery target specified

> What are the contents of /pg_backup/backup/archive_sync/?

The archive folder is /pg_backup/backup/archive, I ftp'ed all the contents from the /pg_backup/backup/archive folder
fromthe new  master to the /pg_backup/backup/archive_sync on the new slave.
 

-----Original Message-----
From: Michael Paquier [mailto:michael.paquier@gmail.com] 
Sent: Tuesday, 22 August 2017 10:37 AM
To: Dylan Luong <Dylan.Luong@unisa.edu.au>
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] pg_rewind - restore new slave failed to startup during recovery

On Tue, Aug 22, 2017 at 9:52 AM, Dylan Luong <Dylan.Luong@unisa.edu.au> wrote:
> I have 1 master and 1 slave wal streaming replication setup and the 
> Application connects via a load balancer (LTM) where the all 
> connections are redirected to the master member (master db).
>
> We have archive_mode enabled.

First things first. What is the version of PostgreSQL involved here?

> I am trying to test to use pg_rewind to restore the new slave (old 
> master) after a failover while the system is under load.

Don't worry. pg_rewind works :)

> Here are the steps I take to test:
>
> 1.       Disable the master ltm member (all connections redired to slave
> member)
> 2.       Promote slave (touch promote.me)
> 3.       Stop the master db (old master)
> 4.       Do pg_rewind on the new slave (old master)
> 5.       Start the new slave.

That flow looks correct to me. No I think that you should trigger manually a checkpoint after step 2 on the promoted
standbyso as its control file gets forcibly updated correctly with its new timeline number. This is a small but
criticalpoint people usually miss. The documentation of pg_rewind does not mention this point when using a live source
server,and many people have fallen into this trap up to now... We should really mention that in the docs. What do
othersthink?
 

> Checking the on the new master, I see that the check point that its 
> trying to restore is the file 000000040000009C0000006F, but the file 
> does not exist anywhere on the new master. Not in the pg_xlog or the 
> archive folder. (as specified in the postgresql.conf)

4 is the number of the last timeline the promoted standby has been using, right?

> Please see attached  psql.jpg.
>
> Here is my recovery.conf :
> standby_mode = 'on'
> primary_conninfo = 'host=10.69.19.18  user=replicant’
> trigger_file = '/var/run/promote_me'
> restore_command = 'cp /pg_backup/backup/archive_sync/%f "%p"'
>
> does anyone know why?

What are the contents of /pg_backup/backup/archive_sync/? Are you sure that the promoted standby has archived correctly
thefirst segment of its new timeline for example?
 

> Under what conditions will pg_rewind wont’ work?

Only one WAL segment missing would prevent any base backup or rewound node to reach a consistent point. You need to be
carefulabout the contents of your archives. Now a failover done correctly is a tricky thing, which could likely fail if
youdon't issue a checkpoint immediately on the promoted standby if pg_rewind is kicked in the process before an
automaticcheckpoint happens (because of timeout or volume, whichever).
 
--
Michael

pgsql-general by date:

Previous
From: Igor Korot
Date:
Subject: [GENERAL] What is the proper query
Next
From: Abhijit Gharami
Date:
Subject: [GENERAL] Fwd: PostgreSQL 9.4.13 is facing issue in shutting down