BUG #4137: Postgres drives pg_standby %r param to delete WAL file before processing - Mailing list pgsql-bugs
From | Wojciech Strzalka |
---|---|
Subject | BUG #4137: Postgres drives pg_standby %r param to delete WAL file before processing |
Date | |
Msg-id | 200805021811.m42IBdRV041102@wwwmaster.postgresql.org Whole thread Raw |
Responses |
Re: BUG #4137: Postgres drives pg_standby %r param to
delete WAL file before processing
|
List | pgsql-bugs |
The following bug has been logged online: Bug reference: 4137 Logged by: Wojciech Strzalka Email address: wstrzalka@gmail.com PostgreSQL version: 8.3.1 Operating system: Linux version 2.6.18-53.1.14.el5 (Red Hat 4.1.2-14) Description: Postgres drives pg_standby %r param to delete WAL file before processing Details: I've discovered problem with WAL recovery on standby server. I have archive_command set properly and logs are shipping OK. Archive timeout is also set (5 min). When performing pg_start_backup the WAL is lets say on position 0000000100000001000000D9, then I start copy database to the second machine which takes me 30 minutes. In that time archive timeout is called a few times and those file are shipped properly to the second host. After DB is succesfully copied i'm calling pg_stop_backup. The WAL is at the moment on position 0000000100000001000000DE. In that moment I see on the second machine WAL files from 0000000100000001000000D9 to 0000000100000001000000DE as well as 0000000100000001000000D9.00000020.backup The problem occurs now when I'm trying to start my standby server in recovery mode (with pg_standby). The output from pg_standby: ------------------------------------ Trigger file : /tmp/pgsql.promote_trigger.5432 Waiting for WAL file : 00000001.history WAL file path : /var/lib/pgsql/incoming_wal/ 00000001.history Restoring to... : pg_xlog/RECOVERYHISTORY Sleep interval : 5 seconds Max wait interval : 0 forever Command for restore : ln -s -f "/var/lib/pgsql/incoming_wal/ 00000001.history" "pg_xlog/RECOVERYHISTORY" Keep archive history : 0000000100000001000000DB and later running restore : OK Trigger file : /tmp/pgsql.promote_trigger.5432 Waiting for WAL file : 0000000100000001000000D9.00000020.backup WAL file path : /var/lib/pgsql/incoming_wal/ 0000000100000001000000D9.00000020.backup Restoring to... : pg_xlog/RECOVERYHISTORY Sleep interval : 5 seconds Max wait interval : 0 forever Command for restore : ln -s -f "/var/lib/pgsql/incoming_wal/ 0000000100000001000000D9.00000020.backup" "pg_xlog/RECOVERYHISTORY" Keep archive history : 0000000100000001000000DB and later running restore : OK Trigger file : /tmp/pgsql.promote_trigger.5432 Waiting for WAL file : 0000000100000001000000D9 WAL file path : /var/lib/pgsql/incoming_wal/ 0000000100000001000000D9 Restoring to... : pg_xlog/RECOVERYXLOG Sleep interval : 5 seconds Max wait interval : 0 forever Command for restore : ln -s -f "/var/lib/pgsql/incoming_wal/ 0000000100000001000000D9" "pg_xlog/RECOVERYXLOG" Keep archive history : 0000000100000001000000DB and later running restore : OK removing "/var/lib/pgsql/incoming_wal/0000000100000001000000D9" removing "/var/lib/pgsql/incoming_wal/0000000100000001000000DA" ---------------------------------------------------------------------------- ---------------------------- For the first time I start standby Postgres log says and the postgres process goes down: ---------------------------------------------------------------------------- ---------------------------- restored log file "0000000100000001000000D9.00000020.backup" from archive could not open file "pg_xlog/0000000100000001000000D9" (log file 1, segment 217): No such file or directory invalid checkpoint record could not locate required checkpoint record If you are not restoring from a backup, try removing the file "/var/lib/pgsql/data/backup_label". startup process (PID 19201) was terminated by signal 6: Aborted aborting startup due to startup process failure ---------------------------------------------------------------------------- ---------------------------- When I try to start PG for the second time it just stucks waiting for 0000000100000001000000D9 In my opinion the problem is that when starting standby PostgresSQL wants to recovery WAL 0000000100000001000000D9, but first deletes it, as keep archive history (%r) param is set to 0000000100000001000000DB from the very first call to the pg_standby (when reading 00000001.history). This occurs only when the copy process is longer then archive_timeout. My restore command is: ------------------------------------------ restore_command = 'pg_standby -l -d -s 5 -w 0 -t /tmp/pgsql.promote_trigger.5432 ~postgres/incoming_wal %f %p %r 2>&1 | logger -p local1.info -t pitr-standby' ------------------------------------------ I don't know if it matters but there are no transactions on active at the time I make a copy - so all WAL archives shipped at the time are probably empty. I don't delete any files from pg_xlog prior to start standby. I hope this description is enough to reproduce the bug.
pgsql-bugs by date: