Probable problem with pg_standby - Mailing list pgsql-hackers
From | Detlef Ulherr |
---|---|
Subject | Probable problem with pg_standby |
Date | |
Msg-id | 49102D78.9070701@sun.com Whole thread Raw |
Responses |
Re: Probable problem with pg_standby
|
List | pgsql-hackers |
Hi, First to introduce myself, I am working in Sun Cluster engineering and I am responsible for the integration (the agent) between PostgreSQL and Sun Cluster. The PostgreSQL agent provides a feature which uses WAL file shipping and pg_standby as a replacement for shared storage. Let's talk about the problem now. Whenever the primary database server selects a new timeline, the standby server which is running pg_standby stops applying logs to its database. It comes even worse, after a while pg_standby terminates the recovery mode and now we have primary and standby accepting requests. there was no trigger file created, nor a signal sent manually to pg_standby. Here is some debugging output of pg_standby. running restore : OK removing "/pgs/83_walarchives/0000001A00000000000000B5" LOG: restored log file "0000001F00000000000000D4" from archive LOG: record with zero length at 0/D4000060 LOG: redo done at 0/D4000020 Trigger file : /pgs/data/failover Waiting for WAL file : 0000001F00000000000000D4 WAL file path : /pgs/83_walarchives/0000001F00000000000000D4 Restoring to... : pg_xlog/RECOVERYXLOG Sleep interval : 5 seconds Max wait interval : 0 forever Command for restore : cp "/pgs/83_walarchives/0000001F00000000000000D4" "pg_xlog/RECOVERYXLOG" Keep archive history : 0000001F00000000000000B6 and later running restore : OK LOG: restored log file "0000001F00000000000000D4" from archive Trigger file : /pgs/data/failover Waiting for WAL file : 00000020.history WAL file path : /pgs/83_walarchives/00000020.history Restoring to... : pg_xlog/RECOVERYHISTORY Sleep interval : 5 seconds Max wait interval : 0 forever Command for restore : cp "/pgs/83_walarchives/00000020.history" "pg_xlog/RECOVERYHISTORY" Keep archive history : No cleanup required running restore : OKLOG: restored log file "00000020.history" from archive Trigger file : /pgs/data/failover Waiting for WAL file : 00000021.history WAL file path : /pgs/83_walarchives/00000021.history Restoring to... : pg_xlog/RECOVERYHISTORY Sleep interval : 5 seconds Max wait interval : 0 forever Command for restore : cp "/pgs/83_walarchives/00000021.history" "pg_xlog/RECOVERYHISTORY" Keep archive history : No cleanup required running restore :cp: cannot access /pgs/83_walarchives/00000021.history cp: cannot access /pgs/83_walarchives/00000021.history cp: cannot access /pgs/83_walarchives/00000021.history not restored : history file not found LOG: selected new timeline ID: 33 Trigger file : /pgs/data/failover Waiting for WAL file : 0000001F.history WAL file path : /pgs/83_walarchives/0000001F.history Restoring to... : pg_xlog/RECOVERYHISTORY Sleep interval : 5 seconds Max wait interval : 0 forever Command for restore : cp "/pgs/83_walarchives/0000001F.history" "pg_xlog/RECOVERYHISTORY" Keep archive history : No cleanup required running restore : OKLOG: restored log file "0000001F.history" from archive LOG: archive recovery complete LOG: autovacuum launcher started LOG: database system is ready to accept connections And here are the corresponding logs from the primary database server. LOG: autovacuum launcher started LOG: database system is ready to accept connections building file list ... done 0000001D00000000000000D1 sent 16779397 bytes received 42 bytes 6711775.60 bytes/sec total size is 16777216 speedup is 1.00 building file list ... done 0000001F.history sent 2248 bytes received 42 bytes 4580.00 bytes/sec total size is 2119 speedup is 0.93 building file list ... done 0000001F00000000000000D2 sent 16779397 bytes received 42 bytes 11186292.67 bytes/sec total size is 16777216 speedup is 1.00 LOG: received fast shutdown request LOG: aborting any active transactions LOG: autovacuum launcher shutting down LOG: shutting down LOG: database system is shut down LOG: database system was shut down at 2008-10-29 14:07:40 CET LOG: database system is ready to accept connections LOG: autovacuum launcher started building file list ... done 0000001F00000000000000D3 sent 16779397 bytes received 42 bytes 11186292.67 bytes/sec total size is 16777216 speedup is 1.00 LOG: received fast shutdown request LOG: aborting any active transactions LOG: autovacuum launcher shutting down LOG: shutting down LOG: database system is shut down building file list ... done 0000001D00000000000000D2 sent 16779397 bytes received 42 bytes 11186292.67 bytes/sec total size is 16777216 speedup is 1.00 building file list ... done 0000001E00000000000000D2 sent 16779397 bytes received 42 bytes 6711775.60 bytes/sec total size is 16777216 speedup is 1.00 LOG: database system was shut down at 2008-10-29 14:10:59 CET LOG: starting archive recovery LOG: restore_command = 'cp /pgs/83_walarchives/%f %p' cp: cannot access /pgs/83_walarchives/0000001F.history cp: cannot access /pgs/83_walarchives/0000001F00000000000000D4 LOG: automatic recovery in progress LOG: record with zero length at 0/D4000060 LOG: redo is not required cp: cannot access /pgs/83_walarchives/0000001F00000000000000D4 cp: cannot access /pgs/83_walarchives/00000020.history LOG: selected new timeline ID: 32 cp: cannot access /pgs/83_walarchives/0000001F.history LOG: archive recovery complete LOG: database system is ready to accept connections LOG: autovacuum launcher started building file list ... done 0000001D00000000000000D3 sent 16779397 bytes received 42 bytes 6711775.60 bytes/sec total size is 16777216 speedup is 1.00 building file list ... done 00000020.history All I did was forcing the primary in a recovery to generate a new timeline. The installed version was 8.3.4, but the problem is the same with earlier versions as well. It occurred in 8.2 also. this problem is reproducible all the times. For my agent code I implemented a workaround which guarantees that during a resilvering process the primary and the standby start at t the same timeline. But my feeling is that the standby should go to the same timeline as the primary when he receives the history file without disruption, and by all means it should never stop the recovery unmotivated. This will make a full synchronization necessary and in times of larger databases, this may cause major downtimes. Kind regards Detlef -- *****************************************************************************Detlef UlherrStaff Engineer Tel: (++49 6103) 752-248Availability Engineering Fax: (++49 6103) 752-167Sun Microsystems GmbH Amperestr. 6 mailto:detlef.ulherr@sun.com63225 Langen http://www.sun.de/ ***************************************************************************** Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten Amtsgericht Muenchen: HRB 161028 Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer Vorsitzender des Aufsichtsrates: Martin Haering *****************************************************************************
pgsql-hackers by date: