Home > mailing lists
Probable problem with pg_standby - Mailing list pgsql-hackers

From	Detlef Ulherr
Subject	Probable problem with pg_standby
Date	November 4, 2008 07:36:06
Msg-id	49102D78.9070701@sun.com Whole thread Raw
Responses	Re: Probable problem with pg_standby
List	pgsql-hackers
Tree view
Hi,

First to introduce myself, I am working in Sun Cluster engineering and I 
am responsible for the integration (the agent) between PostgreSQL and 
Sun Cluster. The PostgreSQL agent provides a feature which uses WAL file 
shipping and pg_standby as a replacement for shared storage.

Let's talk about the problem now. Whenever the primary database server 
selects a new timeline, the standby server which is running pg_standby 
stops applying logs to  its database.  It comes even worse, after a 
while pg_standby terminates the recovery mode and now we have primary 
and standby accepting requests. there was no trigger file created, nor a 
signal sent manually to pg_standby.

Here is some debugging output of pg_standby.

running restore         : OK
removing "/pgs/83_walarchives/0000001A00000000000000B5"
LOG:  restored log file "0000001F00000000000000D4" from archive
LOG:  record with zero length at 0/D4000060
LOG:  redo done at 0/D4000020

Trigger file            : /pgs/data/failover
Waiting for WAL file    : 0000001F00000000000000D4
WAL file path           : /pgs/83_walarchives/0000001F00000000000000D4
Restoring to...         : pg_xlog/RECOVERYXLOG
Sleep interval          : 5 seconds
Max wait interval       : 0 forever
Command for restore     : cp 
"/pgs/83_walarchives/0000001F00000000000000D4" "pg_xlog/RECOVERYXLOG"
Keep archive history    : 0000001F00000000000000B6 and later
running restore         : OK
LOG:  restored log file "0000001F00000000000000D4" from archive

Trigger file            : /pgs/data/failover
Waiting for WAL file    : 00000020.history
WAL file path           : /pgs/83_walarchives/00000020.history
Restoring to...         : pg_xlog/RECOVERYHISTORY
Sleep interval          : 5 seconds
Max wait interval       : 0 forever
Command for restore     : cp "/pgs/83_walarchives/00000020.history" 
"pg_xlog/RECOVERYHISTORY"
Keep archive history    : No cleanup required
running restore         : OKLOG:  restored log file "00000020.history" 
from archive

Trigger file            : /pgs/data/failover
Waiting for WAL file    : 00000021.history
WAL file path           : /pgs/83_walarchives/00000021.history
Restoring to...         : pg_xlog/RECOVERYHISTORY
Sleep interval          : 5 seconds
Max wait interval       : 0 forever
Command for restore     : cp "/pgs/83_walarchives/00000021.history" 
"pg_xlog/RECOVERYHISTORY"
Keep archive history    : No cleanup required
running restore         :cp: cannot access 
/pgs/83_walarchives/00000021.history
cp: cannot access /pgs/83_walarchives/00000021.history
cp: cannot access /pgs/83_walarchives/00000021.history
not restored            : history file not found
LOG:  selected new timeline ID: 33

Trigger file            : /pgs/data/failover
Waiting for WAL file    : 0000001F.history
WAL file path           : /pgs/83_walarchives/0000001F.history
Restoring to...         : pg_xlog/RECOVERYHISTORY
Sleep interval          : 5 seconds
Max wait interval       : 0 forever
Command for restore     : cp "/pgs/83_walarchives/0000001F.history" 
"pg_xlog/RECOVERYHISTORY"
Keep archive history    : No cleanup required
running restore         : OKLOG:  restored log file "0000001F.history" 
from archive
LOG:  archive recovery complete
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections

And here are the corresponding logs from the primary database server.

LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections
building file list ... done
0000001D00000000000000D1

sent 16779397 bytes  received 42 bytes  6711775.60 bytes/sec
total size is 16777216  speedup is 1.00
building file list ... done
0000001F.history

sent 2248 bytes  received 42 bytes  4580.00 bytes/sec
total size is 2119  speedup is 0.93
building file list ... done
0000001F00000000000000D2

sent 16779397 bytes  received 42 bytes  11186292.67 bytes/sec
total size is 16777216  speedup is 1.00
LOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  autovacuum launcher shutting down
LOG:  shutting down
LOG:  database system is shut down
LOG:  database system was shut down at 2008-10-29 14:07:40 CET
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
building file list ... done
0000001F00000000000000D3

sent 16779397 bytes  received 42 bytes  11186292.67 bytes/sec
total size is 16777216  speedup is 1.00
LOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  autovacuum launcher shutting down
LOG:  shutting down
LOG:  database system is shut down
building file list ... done
0000001D00000000000000D2

sent 16779397 bytes  received 42 bytes  11186292.67 bytes/sec
total size is 16777216  speedup is 1.00
building file list ... done
0000001E00000000000000D2

sent 16779397 bytes  received 42 bytes  6711775.60 bytes/sec
total size is 16777216  speedup is 1.00
LOG:  database system was shut down at 2008-10-29 14:10:59 CET
LOG:  starting archive recovery
LOG:  restore_command = 'cp /pgs/83_walarchives/%f %p'
cp: cannot access /pgs/83_walarchives/0000001F.history
cp: cannot access /pgs/83_walarchives/0000001F00000000000000D4
LOG:  automatic recovery in progress
LOG:  record with zero length at 0/D4000060
LOG:  redo is not required
cp: cannot access /pgs/83_walarchives/0000001F00000000000000D4
cp: cannot access /pgs/83_walarchives/00000020.history
LOG:  selected new timeline ID: 32
cp: cannot access /pgs/83_walarchives/0000001F.history
LOG:  archive recovery complete
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
building file list ... done
0000001D00000000000000D3

sent 16779397 bytes  received 42 bytes  6711775.60 bytes/sec
total size is 16777216  speedup is 1.00
building file list ... done
00000020.history


All I did was forcing the primary in a recovery to generate a new 
timeline. The installed version was 8.3.4, but the problem is the same 
with earlier versions as well. It occurred in 8.2 also. this problem is 
reproducible all the times. For my agent code I implemented a workaround 
which guarantees that during a resilvering process the primary and the 
standby start at t the same timeline. But my feeling is that the standby 
should go to the same timeline as the primary when he receives the 
history file without disruption, and by all means it should never stop 
the recovery unmotivated. This will make a full synchronization 
necessary and in times of larger databases, this may cause major downtimes.

Kind regards

Detlef



-- 

*****************************************************************************Detlef UlherrStaff Engineer
            Tel: (++49 6103) 752-248Availability Engineering            Fax: (++49 6103) 752-167Sun Microsystems GmbH
         Amperestr. 6                            mailto:detlef.ulherr@sun.com63225 Langen
http://www.sun.de/
 
*****************************************************************************

Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551
Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

*****************************************************************************
pgsql-hackers by date:
From: Gregory Stark
Date: 04 November 2008, 07:24:09
Subject: Re: Bitmap Indexes patch
From: "Fujii Masao"
Date: 04 November 2008, 08:04:38
Subject: Signal handling patch (v2) for Synch Rep
Probable problem with pg_standby - Mailing list pgsql-hackers

Previous

Next