Re: PITR problem - Mailing list pgsql-general
From | Erik Jones |
---|---|
Subject | Re: PITR problem |
Date | |
Msg-id | 5309E09C-113D-4CEB-9776-D6F01D40C4DE@myemma.com Whole thread Raw |
In response to | PITR problem (wstrzalka <wstrzalka@gmail.com>) |
List | pgsql-general |
On Apr 26, 2008, at 5:11 PM, wstrzalka wrote: > I have some problem with setting up PITR recovery on the database. > > I have archive_command set properly and logs are shipping OK. Archive > timeout is also set (5 min). > > When performing pg_start_backup the WAL is lets say on position > 0000000100000001000000D9, then I start copy database to the second > machine which takes me 30 minutes. In that time archive timeout is > called a few times and those file are shipped properly to the second > host. After DB is succesfully copied i'm calling pg_stop_backup. The > WAL is at the moment on position 0000000100000001000000DE. > > In that moment I see on the second machine WAL files from > 0000000100000001000000D9 to 0000000100000001000000DE as well as > 0000000100000001000000D9.00000020.backup > > The problem occurs now when I'm trying to start my standby server in > recovery mode (with pg_standby). > > The output from pg_standby: > ------------------------------------ > Trigger file : /tmp/pgsql.promote_trigger.5432 > Waiting for WAL file : 00000001.history > WAL file path : /var/lib/pgsql/incoming_wal/ > 00000001.history > Restoring to... : pg_xlog/RECOVERYHISTORY > Sleep interval : 5 seconds > Max wait interval : 0 forever > Command for restore : ln -s -f "/var/lib/pgsql/incoming_wal/ > 00000001.history" "pg_xlog/RECOVERYHISTORY" > Keep archive history : 0000000100000001000000DB and later > running restore : OK > > > Trigger file : /tmp/pgsql.promote_trigger.5432 > Waiting for WAL file : 0000000100000001000000D9.00000020.backup > WAL file path : /var/lib/pgsql/incoming_wal/ > 0000000100000001000000D9.00000020.backup > Restoring to... : pg_xlog/RECOVERYHISTORY > Sleep interval : 5 seconds > Max wait interval : 0 forever > Command for restore : ln -s -f "/var/lib/pgsql/incoming_wal/ > 0000000100000001000000D9.00000020.backup" "pg_xlog/RECOVERYHISTORY" > Keep archive history : 0000000100000001000000DB and later > running restore : OK > > > Trigger file : /tmp/pgsql.promote_trigger.5432 > Waiting for WAL file : 0000000100000001000000D9 > WAL file path : /var/lib/pgsql/incoming_wal/ > 0000000100000001000000D9 > Restoring to... : pg_xlog/RECOVERYXLOG > Sleep interval : 5 seconds > Max wait interval : 0 forever > Command for restore : ln -s -f "/var/lib/pgsql/incoming_wal/ > 0000000100000001000000D9" "pg_xlog/RECOVERYXLOG" > Keep archive history : 0000000100000001000000DB and later > running restore : OK > removing "/var/lib/pgsql/incoming_wal/0000000100000001000000D9" > removing "/var/lib/pgsql/incoming_wal/0000000100000001000000DA" > > -------------------------------------------------------------------------------------------------------- > > > For the first time I start standby Postgres log says and the postgres > process goes down: > -------------------------------------------------------------------------------------------------------- > restored log file "0000000100000001000000D9.00000020.backup" from > archive > could not open file "pg_xlog/0000000100000001000000D9" (log file 1, > segment 217): No such file or directory > invalid checkpoint record > could not locate required checkpoint record > If you are not restoring from a backup, try removing the file "/var/ > lib/pgsql/data/backup_label". > startup process (PID 19201) was terminated by signal 6: Aborted > aborting startup due to startup process failure > -------------------------------------------------------------------------------------------------------- > > When I try to start PG for the second time it just stucks waiting > for ...000D9 > > In my opinion the problem is that when starting standby PostgresSQL > wants to recovery WAL 0000000100000001000000D9, but first deletes it, > as keep archive history (%r) param is set to > 0000000100000001000000DB > > Is it a bug or I'm missing something? I can repeat the scenario with > this big DB. However it's not happening on exactly the same > environment when playing with smaller cluster (copying cluster is > shorter then archive_timeout ). What is the full pg_standby command string (restore_command=....) in your recovery.conf. It sound's like you have pg_standby set to delete archived WALs and possibly have that a little too aggressive. Do you have the -k flag set in your pg_standby call in your restore_command? Erik Jones DBA | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com
pgsql-general by date: