Thread: Cannot rebuild a standby server

Cannot rebuild a standby server

From

John Scalia

Date:

20 June 2014, 17:45:00

In the true definition of insanity, I've tried to rebuild a standby streaming replication server using the following
stepsseveral times: 

1) ensure the postgresql data directory, /var/lib/pgsql/9.3/data, is empty.
2) run: pg_basebackup -h <primary server> -D /var/lib/pgsql/9.3/data
3) manually copy the WAL's from the primary server's pg_xlog directory to the directory specified in the standby's
recovery.confrestore_command. 
4) rm any artifacts from the standby's new data directory like the backup_label file.
5) copy the saved recovery.conf into the standby's data directory and check it is accurate.
6) Start the database using "service postgresql-9.3 start"

Every time, however, the following appears in the pg_log/postgresql-Fri.log:
<timestamp> LOG: entering standby mode
<timestamp> LOG: restored log file "00000003.history"
<timestamp> LOG: invalid secondary checkpoint record
<timestamp> PANIC: could not locate a valid checkpoint record

All this was originally caused by testing the failover mechanism in pgpool. That didn't succeed and I'm trying to get
theservers back to their original states. I've done this kind  
of thing before, but don't know what's wrong with this effort. What have I missed?
--
Jay

Re: Cannot rebuild a standby server

From

Kevin Grittner

Date:

20 June 2014, 18:09:29

John Scalia <jayknowsunix@gmail.com> wrote:

> In the true definition of insanity, I've tried to rebuild a standby
> streaming replication server using the following steps several times:
>
> 1) ensure the postgresql data directory, /var/lib/pgsql/9.3/data, is empty.
> 2) run: pg_basebackup -h <primary server> -D /var/lib/pgsql/9.3/data
> 3) manually copy the WAL's from the primary server's pg_xlog directory
> to the directory specified in the standby's recovery.conf restore_command.

Step 3 is enough to cause database corruption on the replica.

> 4) rm any artifacts from the standby's new data directory like the
> backup_label file.

So is that.

> 5) copy the saved recovery.conf into the standby's data directory and check
> it is accurate.
> 6) Start the database using "service postgresql-9.3 start"
>
> Every time, however, the following appears in the pg_log/postgresql-Fri.log:
> <timestamp> LOG: entering standby mode
> <timestamp> LOG: restored log file "00000003.history"
> <timestamp> LOG: invalid secondary checkpoint record
> <timestamp> PANIC: could not locate a valid checkpoint record

Yep, that's about the best result you can expect with the above
procedure; it is also occasionally possible to get it to start, but
if it did there would almost certainly be data loss or corruption.

> All this was originally caused by testing the failover mechanism in pgpool. That
> didn't succeed and I'm trying to get the servers back to their original
> states. I've done this kind
> of thing before, but don't know what's wrong with this effort. What have
> I missed?

You should enable WAL archiving and the restore_command in
recovery.conf should copy WAL files from the archive.  The pg_xlog
directory should be empty when starting recovery unless the primary
is stopped and you only copy pg_xlog files from the stopped server
into the pg_xlog directory of the recovery cluster.  Don't delete
the backup_label file, because it has the information recovery
needs about the point from which it should start WAL replay --
without it, it will have to guess, and is very likely to get that
wrong.

The documentation is your friend.  It gives pretty specific
instructions for what to do.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Cannot rebuild a standby server

From

John Scalia

Date:

20 June 2014, 19:01:13

Well, I did finally get it working by adding -X s -c fast to the pg_basebackup command.

Kevin, if I didn't copy WALs over, the database still refused to start as it claimed it was looking for a one of the
firstspecific files. Also, I've not seen any references to  
removing certain files like a backup_label file in the standby's data directory causing problems. The other files I
removedwere the old postgresql.pid file from the primary and a  
file called archiving_active, which I use for controlling whether postgresql writes WAL files or not. Seems a little
funnyto me that I've done this same procedure for over 4  
months with no problems, and today was the first time it bit me.

On 6/20/2014 2:09 PM, Kevin Grittner wrote:
> John Scalia <jayknowsunix@gmail.com> wrote:
>
>> In the true definition of insanity, I've tried to rebuild a standby
>> streaming replication server using the following steps several times:
>>
>> 1) ensure the postgresql data directory, /var/lib/pgsql/9.3/data, is empty.
>> 2) run: pg_basebackup -h <primary server> -D /var/lib/pgsql/9.3/data
>> 3) manually copy the WAL's from the primary server's pg_xlog directory
>> to the directory specified in the standby's recovery.conf restore_command.
> Step 3 is enough to cause database corruption on the replica.
>
>> 4) rm any artifacts from the standby's new data directory like the
>> backup_label file.
> So is that.
>
>> 5) copy the saved recovery.conf into the standby's data directory and check
>> it is accurate.
>> 6) Start the database using "service postgresql-9.3 start"
>>
>> Every time, however, the following appears in the pg_log/postgresql-Fri.log:
>> <timestamp> LOG: entering standby mode
>> <timestamp> LOG: restored log file "00000003.history"
>> <timestamp> LOG: invalid secondary checkpoint record
>> <timestamp> PANIC: could not locate a valid checkpoint record
> Yep, that's about the best result you can expect with the above
> procedure; it is also occasionally possible to get it to start, but
> if it did there would almost certainly be data loss or corruption.
>
>> All this was originally caused by testing the failover mechanism in pgpool. That
>> didn't succeed and I'm trying to get the servers back to their original
>> states. I've done this kind
>> of thing before, but don't know what's wrong with this effort. What have
>> I missed?
> You should enable WAL archiving and the restore_command in
> recovery.conf should copy WAL files from the archive.  The pg_xlog
> directory should be empty when starting recovery unless the primary
> is stopped and you only copy pg_xlog files from the stopped server
> into the pg_xlog directory of the recovery cluster.  Don't delete
> the backup_label file, because it has the information recovery
> needs about the point from which it should start WAL replay --
> without it, it will have to guess, and is very likely to get that
> wrong.
>
> The documentation is your friend.  It gives pretty specific
> instructions for what to do.
>
> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

Re: Cannot rebuild a standby server

From

Kevin Grittner

Date:

20 June 2014, 20:36:16

John Scalia <jayknowsunix@gmail.com> wrote:
> On 6/20/2014 2:09 PM, Kevin Grittner wrote:

>> The documentation is your friend.  It gives pretty specific
>> instructions for what to do.

> Well, I did finally get it working by adding -X s -c fast to the
> pg_basebackup command.

Hopefully you did not follow that with a copy of pg_xlog files from
a running server.

> Kevin, if I didn't copy WALs over, the database still refused to
> start as it claimed it was looking for a one of the first
> specific files.

You absolutely need WAL files for recovery.  You absolutely should
not copy the files ad hoc from a running cluster, as they may be in
flux at the time of the copy.  The archive_command is called at the
right point in time to get a good copy, and using the pg_basebackup
command will also acquire them in a safe way.

> Also, I've not seen any references to removing certain files like
> a backup_label file in the standby's data directory causing
> problems.

On this page:

http://www.postgresql.org/docs/current/static/continuous-archiving.html

... it says:

| pg_start_backup creates a backup label file, called backup_label,
| in the cluster directory with information about your backup,
| including the start time and label string. The file is critical
| to the integrity of the backup, should you need to restore from
| it.

The same page later says:

| It's also worth noting that the pg_start_backup function makes a
| file named backup_label in the database cluster directory, which
| is removed by pg_stop_backup. This file will of course be
| archived as a part of your backup dump file. The backup label
| file includes the label string you gave to pg_start_backup, as
| well as the time at which pg_start_backup was run, and the name
| of the starting WAL file. In case of confusion it is therefore
| possible to look inside a backup dump file and determine exactly
| which backup session the dump file came from. However, this file
| is not merely for your information; its presence and contents are
| critical to the proper operation of the system's recovery
| process.

These are related to the pg_start_backup function and apply whether
you run that function directly or use pg_basebackup, which runs it.

> The other files I removed were the old postgresql.pid file from
> the primary

You should remove that, or exclude it from your backup in the first
place.

> and a file called archiving_active, which I use for controlling
> whether postgresql writes WAL files or not.

That's based on your particular archiving strategy; without more
info on that I can't comment.

> Seems a little funny to me that I've done this same procedure for
> over 4 months with no problems, and today was the first time it
> bit me.

It's sorta like crossing a highway without looking either way.  It
might work many times.  Then again, a Mac truck might be coming
just that one time.  If you would like to have it work
consistently, rather than gamble each time, don't copy WAL files
from the pg_xlog directory of a live server except through the
documented methods, and don't delete the backup_label file.  If you
do either of those things you risks errors like you just saw.
Worse, you risk not getting errors but having a silently corrupted
database.  Unless you enjoy the thrill of it....  ;-)

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company