Spontaneous PITR standby activiation - Mailing list pgsql-hackers

From David Christensen
Subject Spontaneous PITR standby activiation
Date
Msg-id CBC1F2EC-CDE6-45CD-8101-2695C427F55E@endpoint.com
Whole thread Raw
Responses Re: Spontaneous PITR standby activiation
List pgsql-hackers
Last night we had the standby server in a PITR setup seemingly  
spontaneously believe it had recovered completely and startup a new  
timeline; I'm running this information by you guys to see if what had  
happened is considered a bug, or if there is a simple explanation.

So some questions here:

1) is there a hard limit of the number of times the archive_command  
will attempt?  I didn't see anything documented about this in the PITR  
or config docs, so I'm guessing the 10 failures I saw in the log were  
just coincidental.

2) are the archive_command failures in the master's log responsible  
for the redo records?

3) would a Pg forced shutdown cause issues with the generated WAL when  
replaying?

4) at first I thought it had to do with a bug/failure in pg_standby,  
but I'm wondering if it has to do with the "record with zero length"  
referenced in the standby's logs.  Thoughts?

Details:

The postgres partition had completely filled up on the master, causing  
the postmaster to shut down.  We removed some cruft from the partition  
and were able to get the server started back up at this time.  When  
checking on the standby server, I noticed that it had switched out of  
recovery mode and was archiving its own WAL, etc, and was perfectly  
connectable from the command line.

The archive_command is:
 cp -i "%p" /mnt/pg_wal_archive/"%f" && cat %p | ssh backupserver  
"cat - > /mnt/pg_wal_archive/.%f; if test -e /mnt/pg_wal_archive/%f ;  
then rm /mnt/pg_wal_archive/.%f ; exit 1; else mv -f /mnt/ 
pg_wal_archive/.%f /mnt/pg_wal_archive/%f ; fi"

The recovery.conf file is a fairly vanilla pg_standby invocation:
  restore_command = '/usr/lib/postgresql/8.3/bin/pg_standby -c -d -s  
2 -t /tmp/pgsql.trigger.6666 /db_data/pg_wal_archive %f %p %r 2>> / 
db_data/pg_log/warmstandby.log'

Some possibly relevant settings from the master:
             name             |  setting
------------------------------+----------- archive_mode                 | on archive_timeout              | 60
checkpoint_completion_target| 0.7 checkpoint_segments          | 100 checkpoint_timeout           | 600
checkpoint_warning          | 30 commit_delay                 | 0 commit_siblings              | 5 fsync
       | on full_page_writes             | on synchronous_commit           | on wal_buffers                  | 128
wal_sync_method             | fdatasync wal_writer_delay             | 200
 

----
Relevant lines from the standby:

2009-11-25 04:03:52 UTC [6315]: [7478-1] LOG:  restored log file  
"000000020000064200000001" from archive
2009-11-25 04:03:58 UTC [6315]: [7479-1] LOG:  restored log file  
"000000020000064200000002" from archive
2009-11-25 04:44:09 UTC [6315]: [7480-1] LOG:  restored log file  
"000000020000064200000003" from archive
2009-11-25 04:45:22 UTC [6315]: [7481-1] LOG:  record with zero length  
at 642/3FFEB38
2009-11-25 04:45:22 UTC [6315]: [7482-1] LOG:  redo done at 642/3FFEAF0
2009-11-25 04:45:22 UTC [6315]: [7483-1] LOG:  last completed  
transaction was at log time 2009-11-25 04:04:03.067187+00
2009-11-25 04:45:22 UTC [6315]: [7484-1] LOG:  restored log file  
"000000020000064200000003" from archive
2009-11-25 04:45:28 UTC [6315]: [7485-1] LOG:  selected new timeline  
ID: 3
2009-11-25 04:45:28 UTC [6315]: [7486-1] LOG:  restored log file  
"00000002.history" from archive
2009-11-25 04:45:28 UTC [6315]: [7487-1] LOG:  archive recovery complete
2009-11-25 04:45:28 UTC [6315]: [7488-1] LOG:  checkpoint starting:  
shutdown immediate
2009-11-25 04:45:29 UTC [6315]: [7489-1] LOG:  checkpoint complete:  
wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 14  
recycled; write=0.009 s, sync=0.010 s, total=0.404 s
2009-11-25 04:45:30 UTC [22614]: [1-1] LOG:  autovacuum launcher started
2009-11-25 04:45:30 UTC [6313]: [1-1] LOG:  database system is ready  
to accept connections
2009-11-25 04:55:30 UTC [22612]: [1-1] LOG:  checkpoint starting: time
----From the master:
aws-oi-db-master-11192009:/var/log/postgresql# date
Wed Nov 25 14:40:44 EST 2009
aws-oi-db-master-11192009:/var/log/postgresql# grep WARNING  
postgresql-2009-11-25.log
2009-11-25 00:00:34 EST [30401]: [126-1] WARNING:  transaction log  
file "000000020000064200000003" could not be archived: too many failures
2009-11-25 00:01:36 EST [30401]: [133-1] WARNING:  transaction log  
file "000000020000064200000003" could not be archived: too many failures
2009-11-25 00:02:34 EST [30401]: [140-1] WARNING:  transaction log  
file "000000020000064200000003" could not be archived: too many failures
2009-11-25 00:03:34 EST [30401]: [147-1] WARNING:  transaction log  
file "000000020000064200000003" could not be archived: too many failures
2009-11-25 00:04:34 EST [30401]: [154-1] WARNING:  transaction log  
file "000000020000064200000003" could not be archived: too many failures
2009-11-25 00:05:34 EST [30401]: [161-1] WARNING:  transaction log  
file "000000020000064200000003" could not be archived: too many failures
2009-11-25 00:06:34 EST [30401]: [168-1] WARNING:  transaction log  
file "000000020000064200000003" could not be archived: too many failures
2009-11-25 00:07:34 EST [30401]: [175-1] WARNING:  transaction log  
file "000000020000064200000003" could not be archived: too many failures
2009-11-25 00:08:34 EST [30401]: [182-1] WARNING:  transaction log  
file "000000020000064200000003" could not be archived: too many failures
2009-11-25 00:09:34 EST [30401]: [189-1] WARNING:  transaction log  
file "000000020000064200000003" could not be archived: too many failures
----
Regards,

David
--
David Christensen
End Point Corporation
david@endpoint.com






pgsql-hackers by date:

Previous
From: Roger Leigh
Date:
Subject: Re: garbage in psql -l
Next
From: Magnus Hagander
Date:
Subject: Re: cvs chapters in our docs