Thread: Cannot rebuild a standby server
In the true definition of insanity, I've tried to rebuild a standby streaming replication server using the following stepsseveral times: 1) ensure the postgresql data directory, /var/lib/pgsql/9.3/data, is empty. 2) run: pg_basebackup -h <primary server> -D /var/lib/pgsql/9.3/data 3) manually copy the WAL's from the primary server's pg_xlog directory to the directory specified in the standby's recovery.confrestore_command. 4) rm any artifacts from the standby's new data directory like the backup_label file. 5) copy the saved recovery.conf into the standby's data directory and check it is accurate. 6) Start the database using "service postgresql-9.3 start" Every time, however, the following appears in the pg_log/postgresql-Fri.log: <timestamp> LOG: entering standby mode <timestamp> LOG: restored log file "00000003.history" <timestamp> LOG: invalid secondary checkpoint record <timestamp> PANIC: could not locate a valid checkpoint record All this was originally caused by testing the failover mechanism in pgpool. That didn't succeed and I'm trying to get theservers back to their original states. I've done this kind of thing before, but don't know what's wrong with this effort. What have I missed? -- Jay
John Scalia <jayknowsunix@gmail.com> wrote: > In the true definition of insanity, I've tried to rebuild a standby > streaming replication server using the following steps several times: > > 1) ensure the postgresql data directory, /var/lib/pgsql/9.3/data, is empty. > 2) run: pg_basebackup -h <primary server> -D /var/lib/pgsql/9.3/data > 3) manually copy the WAL's from the primary server's pg_xlog directory > to the directory specified in the standby's recovery.conf restore_command. Step 3 is enough to cause database corruption on the replica. > 4) rm any artifacts from the standby's new data directory like the > backup_label file. So is that. > 5) copy the saved recovery.conf into the standby's data directory and check > it is accurate. > 6) Start the database using "service postgresql-9.3 start" > > Every time, however, the following appears in the pg_log/postgresql-Fri.log: > <timestamp> LOG: entering standby mode > <timestamp> LOG: restored log file "00000003.history" > <timestamp> LOG: invalid secondary checkpoint record > <timestamp> PANIC: could not locate a valid checkpoint record Yep, that's about the best result you can expect with the above procedure; it is also occasionally possible to get it to start, but if it did there would almost certainly be data loss or corruption. > All this was originally caused by testing the failover mechanism in pgpool. That > didn't succeed and I'm trying to get the servers back to their original > states. I've done this kind > of thing before, but don't know what's wrong with this effort. What have > I missed? You should enable WAL archiving and the restore_command in recovery.conf should copy WAL files from the archive. The pg_xlog directory should be empty when starting recovery unless the primary is stopped and you only copy pg_xlog files from the stopped server into the pg_xlog directory of the recovery cluster. Don't delete the backup_label file, because it has the information recovery needs about the point from which it should start WAL replay -- without it, it will have to guess, and is very likely to get that wrong. The documentation is your friend. It gives pretty specific instructions for what to do. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Well, I did finally get it working by adding -X s -c fast to the pg_basebackup command. Kevin, if I didn't copy WALs over, the database still refused to start as it claimed it was looking for a one of the firstspecific files. Also, I've not seen any references to removing certain files like a backup_label file in the standby's data directory causing problems. The other files I removedwere the old postgresql.pid file from the primary and a file called archiving_active, which I use for controlling whether postgresql writes WAL files or not. Seems a little funnyto me that I've done this same procedure for over 4 months with no problems, and today was the first time it bit me. On 6/20/2014 2:09 PM, Kevin Grittner wrote: > John Scalia <jayknowsunix@gmail.com> wrote: > >> In the true definition of insanity, I've tried to rebuild a standby >> streaming replication server using the following steps several times: >> >> 1) ensure the postgresql data directory, /var/lib/pgsql/9.3/data, is empty. >> 2) run: pg_basebackup -h <primary server> -D /var/lib/pgsql/9.3/data >> 3) manually copy the WAL's from the primary server's pg_xlog directory >> to the directory specified in the standby's recovery.conf restore_command. > Step 3 is enough to cause database corruption on the replica. > >> 4) rm any artifacts from the standby's new data directory like the >> backup_label file. > So is that. > >> 5) copy the saved recovery.conf into the standby's data directory and check >> it is accurate. >> 6) Start the database using "service postgresql-9.3 start" >> >> Every time, however, the following appears in the pg_log/postgresql-Fri.log: >> <timestamp> LOG: entering standby mode >> <timestamp> LOG: restored log file "00000003.history" >> <timestamp> LOG: invalid secondary checkpoint record >> <timestamp> PANIC: could not locate a valid checkpoint record > Yep, that's about the best result you can expect with the above > procedure; it is also occasionally possible to get it to start, but > if it did there would almost certainly be data loss or corruption. > >> All this was originally caused by testing the failover mechanism in pgpool. That >> didn't succeed and I'm trying to get the servers back to their original >> states. I've done this kind >> of thing before, but don't know what's wrong with this effort. What have >> I missed? > You should enable WAL archiving and the restore_command in > recovery.conf should copy WAL files from the archive. The pg_xlog > directory should be empty when starting recovery unless the primary > is stopped and you only copy pg_xlog files from the stopped server > into the pg_xlog directory of the recovery cluster. Don't delete > the backup_label file, because it has the information recovery > needs about the point from which it should start WAL replay -- > without it, it will have to guess, and is very likely to get that > wrong. > > The documentation is your friend. It gives pretty specific > instructions for what to do. > > -- > Kevin Grittner > EDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
John Scalia <jayknowsunix@gmail.com> wrote: > On 6/20/2014 2:09 PM, Kevin Grittner wrote: >> The documentation is your friend. It gives pretty specific >> instructions for what to do. > Well, I did finally get it working by adding -X s -c fast to the > pg_basebackup command. Hopefully you did not follow that with a copy of pg_xlog files from a running server. > Kevin, if I didn't copy WALs over, the database still refused to > start as it claimed it was looking for a one of the first > specific files. You absolutely need WAL files for recovery. You absolutely should not copy the files ad hoc from a running cluster, as they may be in flux at the time of the copy. The archive_command is called at the right point in time to get a good copy, and using the pg_basebackup command will also acquire them in a safe way. > Also, I've not seen any references to removing certain files like > a backup_label file in the standby's data directory causing > problems. On this page: http://www.postgresql.org/docs/current/static/continuous-archiving.html ... it says: | pg_start_backup creates a backup label file, called backup_label, | in the cluster directory with information about your backup, | including the start time and label string. The file is critical | to the integrity of the backup, should you need to restore from | it. The same page later says: | It's also worth noting that the pg_start_backup function makes a | file named backup_label in the database cluster directory, which | is removed by pg_stop_backup. This file will of course be | archived as a part of your backup dump file. The backup label | file includes the label string you gave to pg_start_backup, as | well as the time at which pg_start_backup was run, and the name | of the starting WAL file. In case of confusion it is therefore | possible to look inside a backup dump file and determine exactly | which backup session the dump file came from. However, this file | is not merely for your information; its presence and contents are | critical to the proper operation of the system's recovery | process. These are related to the pg_start_backup function and apply whether you run that function directly or use pg_basebackup, which runs it. > The other files I removed were the old postgresql.pid file from > the primary You should remove that, or exclude it from your backup in the first place. > and a file called archiving_active, which I use for controlling > whether postgresql writes WAL files or not. That's based on your particular archiving strategy; without more info on that I can't comment. > Seems a little funny to me that I've done this same procedure for > over 4 months with no problems, and today was the first time it > bit me. It's sorta like crossing a highway without looking either way. It might work many times. Then again, a Mac truck might be coming just that one time. If you would like to have it work consistently, rather than gamble each time, don't copy WAL files from the pg_xlog directory of a live server except through the documented methods, and don't delete the backup_label file. If you do either of those things you risks errors like you just saw. Worse, you risk not getting errors but having a silently corrupted database. Unless you enjoy the thrill of it.... ;-) -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company