Thread: Some problem with warm standby server
Hi, I have some doubts regarding the settings and the access procedure of warm standby servers: - can autovacuum be safely enabled on the replicator? - I'm using pg_standby (from cvs) that is generally working well as expected (logs are copied with scp); today I wanted to temporarily stop the replication to verify some data to restart it later on, so I touched the trigger file, waited for the log to report "database ready", verified that the databases were actually up-to-date. All was fine, then I ran rm -f pg_xlog/* pg_xlog/archive_status/* mv recovery.done recovery.conf (the permissions were right) /etc/init.d/postgresql stop ; /etc/init.d/postgresql start the replication seemed to start: ---- --------------------------------------------------- LOG: database system was shut down at 2007-04-27 12:16:13 CEST LOG: starting archive recovery LOG: restore_command = "/usr/local/bin/pg_standby -s 5 -w 0 -t /usr/local/postgres_replica/trigger /usr/local/postgres_replica/log/ %f %p" cp: cannot stat `/usr/local/postgres_replica/log//00000001.history': No such file or directory cp: cannot stat `/usr/local/postgres_replica/log//00000001.history': No such file or directory cp: cannot stat `/usr/local/postgres_replica/log//00000001.history': No such file or directory then I updated the master with a batch of inserts, but after a while the slave stopped with these messages: LOG: restored log file "000000010000000000000021" from archive LOG: record with zero length at 0/21000048 LOG: invalid primary checkpoint record LOG: restored log file "000000010000000000000020" from archive LOG: restored log file "000000010000000000000021" from archive LOG: invalid resource manager ID in secondary checkpoint record PANIC: could not locate a valid checkpoint record LOG: startup process (PID 19619) was terminated by signal 6 LOG: aborting startup due to startup process failure What did I do wrong? Is there any other procedure to follow to restart a stopped replication? Thanks, Nico
On Fri, 2007-04-27 at 12:31 +0200, Nico Sabbi wrote: > I have some doubts regarding the settings and the access procedure of > warm standby servers: > - can autovacuum be safely enabled on the replicator? Not a relevant question. The standby isn't "up", so no SQL can be executed on the standby, so it never needs vacuuming. The autovacuum process doesn't start until recovery finishes. That is why it is Warm rather than Hot standby. Florian Pflug is working on allowing read-only queries to execute on the standby, so we're hopeful of further enhancements in the next release. > - I'm using pg_standby (from cvs) that is generally working well as > expected (logs are copied with > scp); today I wanted to temporarily stop the replication to verify > some data to restart it later on, so > I touched the trigger file, waited for the log to report "database > ready", verified that the > databases were actually up-to-date. All was fine, then I ran > > rm -f pg_xlog/* pg_xlog/archive_status/* > mv recovery.done recovery.conf (the permissions were right) > /etc/init.d/postgresql stop ; /etc/init.d/postgresql start > > the replication seemed to start: > ---- > --------------------------------------------------- > LOG: database system was shut down at 2007-04-27 12:16:13 CEST > LOG: starting archive recovery > LOG: restore_command = "/usr/local/bin/pg_standby -s 5 -w 0 -t > /usr/local/postgres_replica/trigger /usr/local/postgres_replica/log/ %f %p" > cp: cannot stat `/usr/local/postgres_replica/log//00000001.history': No > such file or directory > cp: cannot stat `/usr/local/postgres_replica/log//00000001.history': No > such file or directory > cp: cannot stat `/usr/local/postgres_replica/log//00000001.history': No > such file or directory Looks like there's an issue with double slashes on the archive filename, which I will fix. This probably isnt the problem though. > then I updated the master with a batch of inserts, but after a while the > slave stopped with > these messages: > > LOG: restored log file "000000010000000000000021" from archive > LOG: record with zero length at 0/21000048 > LOG: invalid primary checkpoint record > LOG: restored log file "000000010000000000000020" from archive > LOG: restored log file "000000010000000000000021" from archive > LOG: invalid resource manager ID in secondary checkpoint record > PANIC: could not locate a valid checkpoint record > LOG: startup process (PID 19619) was terminated by signal 6 > LOG: aborting startup due to startup process failure Please run pg_controldata to print out the control file. Backup all the files in case we need to inspect them. What was the ending log sequence number (e.g. x/xxxx) from the previous recovery? I'll see if I can re-create this. > What did I do wrong? Is there any other procedure to follow to restart a > stopped replication? You're right, using the trigger is not the right way to stop/start the standby. Just stop/start the standby server normally. The trigger means that you'd like to perform a failover. There is a patch not yet applied which will make a new version of pg_standby. pg_standby's official status right now is beta, so please expect, look for and report any issues you find. Thanks. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: >>then I updated the master with a batch of inserts, but after a while the >>slave stopped with >>these messages: >> >>LOG: restored log file "000000010000000000000021" from archive >>LOG: record with zero length at 0/21000048 >>LOG: invalid primary checkpoint record >>LOG: restored log file "000000010000000000000020" from archive >>LOG: restored log file "000000010000000000000021" from archive >>LOG: invalid resource manager ID in secondary checkpoint record >>PANIC: could not locate a valid checkpoint record >>LOG: startup process (PID 19619) was terminated by signal 6 >>LOG: aborting startup due to startup process failure >> >> > >Please run pg_controldata to print out the control file. > > Hi, sorry for the long delay. First of all I had to stop postgres with pg_ctl stop -s immediate, or it wouldn't die because of the ongoing replication. This is the output of pg_controldata: postgres@www3:/usr/local/postgres_replica/data$ pg_controldata /usr/local/postgres_replica/data/ pg_control version number: 812 Catalog version number: 200510211 Database system identifier: 5001030714849737714 Database cluster state: in recovery pg_control last modified: Fri 27 Apr 2007 13:20:46 CEST Current log file ID: 0 Next log file segment: 26 Latest checkpoint location: 0/190C7E04 Prior checkpoint location: 0/190C7DC0 Latest checkpoint's REDO location: 0/190C7E04 Latest checkpoint's UNDO location: 0/0 Latest checkpoint's TimeLineID: 1 Latest checkpoint's NextXID: 3698809 Latest checkpoint's NextOID: 68745 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Time of latest checkpoint: Fri 27 Apr 2007 11:53:47 CEST Maximum data alignment: 4 Database block size: 8192 Blocks per segment of large relation: 131072 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Date/time type storage: floating-point numbers Maximum length of locale name: 128 LC_COLLATE: C LC_CTYPE: C >Backup all the files in case we need to inspect them. > > ok >What was the ending log sequence number (e.g. x/xxxx) from the previous >recovery? I'll see if I can re-create this. > > judging from the logs I gues it is 0/190C7E04: LOG: restored log file "000000010000000000000019.000C7E04.backup" from archive LOG: restored log file "000000010000000000000019" from archive LOG: checkpoint record is at 0/190C7E04 LOG: redo record is at 0/190C7E04; undo record is at 0/0; shutdown FALSE LOG: next transaction ID: 3698809; next OID: 68745 LOG: next MultiXactId: 1; next MultiXactOffset: 0 LOG: automatic recovery in progress LOG: redo starts at 0/190C7E48 > > >>What did I do wrong? Is there any other procedure to follow to restart a >>stopped replication? >> >> > >You're right, using the trigger is not the right way to stop/start the >standby. Just stop/start the standby server normally. > > as above: a plain stop hangs >The trigger means that you'd like to perform a failover. > >There is a patch not yet applied which will make a new version of >pg_standby. pg_standby's official status right now is beta, so please >expect, look for and report any issues you find. Thanks. > > > thank you