Thread: Some problem with warm standby server

Some problem with warm standby server

From

Nico Sabbi

Date:

27 April 2007, 10:35:20

Hi,
I have some doubts regarding the settings and the access procedure of
warm standby servers:
- can autovacuum be safely enabled on the replicator?
- I'm using pg_standby (from cvs) that is generally working well as
expected (logs are copied with
  scp); today I wanted to  temporarily stop the replication to verify
some data to restart it later on, so
  I touched the trigger file, waited for the log to report "database
ready", verified that the
  databases were actually up-to-date. All was fine, then I ran

  rm -f pg_xlog/* pg_xlog/archive_status/*
  mv recovery.done recovery.conf (the permissions were right)
  /etc/init.d/postgresql stop ; /etc/init.d/postgresql start

  the replication seemed to start:
 ----
---------------------------------------------------
LOG:  database system was shut down at 2007-04-27 12:16:13 CEST
LOG:  starting archive recovery
LOG:  restore_command = "/usr/local/bin/pg_standby -s 5 -w 0 -t
/usr/local/postgres_replica/trigger  /usr/local/postgres_replica/log/ %f %p"
cp: cannot stat `/usr/local/postgres_replica/log//00000001.history': No
such file or directory
cp: cannot stat `/usr/local/postgres_replica/log//00000001.history': No
such file or directory
cp: cannot stat `/usr/local/postgres_replica/log//00000001.history': No
such file or directory


then I updated the master with a batch of inserts, but after a while the
slave stopped with
these messages:

LOG:  restored log file "000000010000000000000021" from archive
LOG:  record with zero length at 0/21000048
LOG:  invalid primary checkpoint record
LOG:  restored log file "000000010000000000000020" from archive
LOG:  restored log file "000000010000000000000021" from archive
LOG:  invalid resource manager ID in secondary checkpoint record
PANIC:  could not locate a valid checkpoint record
LOG:  startup process (PID 19619) was terminated by signal 6
LOG:  aborting startup due to startup process failure


What did I do wrong? Is there any other procedure to follow to restart a
stopped replication?
Thanks,
    Nico

Re: Some problem with warm standby server

From

"Simon Riggs"

Date:

28 April 2007, 10:25:09

On Fri, 2007-04-27 at 12:31 +0200, Nico Sabbi wrote:

> I have some doubts regarding the settings and the access procedure of
> warm standby servers:
> - can autovacuum be safely enabled on the replicator?

Not a relevant question. The standby isn't "up", so no SQL can be
executed on the standby, so it never needs vacuuming. The autovacuum
process doesn't start until recovery finishes. That is why it is Warm
rather than Hot standby. Florian Pflug is working on allowing read-only
queries to execute on the standby, so we're hopeful of further
enhancements in the next release.

> - I'm using pg_standby (from cvs) that is generally working well as
> expected (logs are copied with
>   scp); today I wanted to  temporarily stop the replication to verify
> some data to restart it later on, so
>   I touched the trigger file, waited for the log to report "database
> ready", verified that the
>   databases were actually up-to-date. All was fine, then I ran
>
>   rm -f pg_xlog/* pg_xlog/archive_status/*
>   mv recovery.done recovery.conf (the permissions were right)
>   /etc/init.d/postgresql stop ; /etc/init.d/postgresql start
>
>   the replication seemed to start:
>  ----
> ---------------------------------------------------
> LOG:  database system was shut down at 2007-04-27 12:16:13 CEST
> LOG:  starting archive recovery
> LOG:  restore_command = "/usr/local/bin/pg_standby -s 5 -w 0 -t
> /usr/local/postgres_replica/trigger  /usr/local/postgres_replica/log/ %f %p"
> cp: cannot stat `/usr/local/postgres_replica/log//00000001.history': No
> such file or directory
> cp: cannot stat `/usr/local/postgres_replica/log//00000001.history': No
> such file or directory
> cp: cannot stat `/usr/local/postgres_replica/log//00000001.history': No
> such file or directory

Looks like there's an issue with double slashes on the archive filename,
which I will fix. This probably isnt the problem though.

> then I updated the master with a batch of inserts, but after a while the
> slave stopped with
> these messages:
>
> LOG:  restored log file "000000010000000000000021" from archive
> LOG:  record with zero length at 0/21000048
> LOG:  invalid primary checkpoint record
> LOG:  restored log file "000000010000000000000020" from archive
> LOG:  restored log file "000000010000000000000021" from archive
> LOG:  invalid resource manager ID in secondary checkpoint record
> PANIC:  could not locate a valid checkpoint record
> LOG:  startup process (PID 19619) was terminated by signal 6
> LOG:  aborting startup due to startup process failure

Please run pg_controldata to print out the control file.
Backup all the files in case we need to inspect them.
What was the ending log sequence number (e.g. x/xxxx) from the previous
recovery? I'll see if I can re-create this.

> What did I do wrong? Is there any other procedure to follow to restart a
> stopped replication?

You're right, using the trigger is not the right way to stop/start the
standby. Just stop/start the standby server normally.

The trigger means that you'd like to perform a failover.

There is a patch not yet applied which will make a new version of
pg_standby. pg_standby's official status right now is beta, so please
expect, look for and report any issues you find. Thanks.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com

Re: Some problem with warm standby server

From

Nico Sabbi

Date:

08 May 2007, 16:10:22

Simon Riggs wrote:

>>then I updated the master with a batch of inserts, but after a while the
>>slave stopped with
>>these messages:
>>
>>LOG:  restored log file "000000010000000000000021" from archive
>>LOG:  record with zero length at 0/21000048
>>LOG:  invalid primary checkpoint record
>>LOG:  restored log file "000000010000000000000020" from archive
>>LOG:  restored log file "000000010000000000000021" from archive
>>LOG:  invalid resource manager ID in secondary checkpoint record
>>PANIC:  could not locate a valid checkpoint record
>>LOG:  startup process (PID 19619) was terminated by signal 6
>>LOG:  aborting startup due to startup process failure
>>
>>
>
>Please run pg_controldata to print out the control file.
>
>

Hi, sorry for the long delay.
First of all I had to stop postgres with pg_ctl stop -s immediate, or it
wouldn't die because of the ongoing replication.

This is the output of pg_controldata:

postgres@www3:/usr/local/postgres_replica/data$ pg_controldata
/usr/local/postgres_replica/data/
pg_control version number:            812
Catalog version number:               200510211
Database system identifier:           5001030714849737714
Database cluster state:               in recovery
pg_control last modified:             Fri 27 Apr 2007 13:20:46 CEST
Current log file ID:                  0
Next log file segment:                26
Latest checkpoint location:           0/190C7E04
Prior checkpoint location:            0/190C7DC0
Latest checkpoint's REDO location:    0/190C7E04
Latest checkpoint's UNDO location:    0/0
Latest checkpoint's TimeLineID:       1
Latest checkpoint's NextXID:          3698809
Latest checkpoint's NextOID:          68745
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Time of latest checkpoint:            Fri 27 Apr 2007 11:53:47 CEST
Maximum data alignment:               4
Database block size:                  8192
Blocks per segment of large relation: 131072
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Date/time type storage:               floating-point numbers
Maximum length of locale name:        128
LC_COLLATE:                           C
LC_CTYPE:                             C


>Backup all the files in case we need to inspect them.
>
>

ok

>What was the ending log sequence number (e.g. x/xxxx) from the previous
>recovery? I'll see if I can re-create this.
>
>

judging from the logs I gues it is 0/190C7E04:
LOG:  restored log file "000000010000000000000019.000C7E04.backup" from
archive
LOG:  restored log file "000000010000000000000019" from archive
LOG:  checkpoint record is at 0/190C7E04
LOG:  redo record is at 0/190C7E04; undo record is at 0/0; shutdown FALSE
LOG:  next transaction ID: 3698809; next OID: 68745
LOG:  next MultiXactId: 1; next MultiXactOffset: 0
LOG:  automatic recovery in progress
LOG:  redo starts at 0/190C7E48


>
>
>>What did I do wrong? Is there any other procedure to follow to restart a
>>stopped replication?
>>
>>
>
>You're right, using the trigger is not the right way to stop/start the
>standby. Just stop/start the standby server normally.
>
>

as above: a plain stop hangs

>The trigger means that you'd like to perform a failover.
>
>There is a patch not yet applied which will make a new version of
>pg_standby. pg_standby's official status right now is beta, so please
>expect, look for and report any issues you find. Thanks.
>
>
>
thank you