Unarchived WALs deleted after crash - Mailing list pgsql-hackers

From Jehan-Guillaume de Rorthais
Subject Unarchived WALs deleted after crash
Date
Msg-id 511D06A6.8050106@dalibo.com
Whole thread Raw
Responses Re: Unarchived WALs deleted after crash
Re: Unarchived WALs deleted after crash
List pgsql-hackers
Hi,

I am facing an unexpected behavior on a 9.2.2 cluster that I can
reproduce on current HEAD.

On a cluster with archive enabled but failing, after a crash of
postmaster, the checkpoint occurring before leaving the recovery mode
deletes any additional WALs, even those waiting to be archived.

Because of this, after recovering from the crash, previous PITR backup
can not be used to restore the instance to a time where archiving was
failing. Any slaves fed by WAL or lagging in SR need to be recreated.

AFAICT, this is not documented and I would expect the WALs to be
archived by the archiver process when the cluster exits the recovery step.

Here is a simple scenario to reproduce this.

Configuration:
 wal_level = archive archive_mode = on archive_command = '/bin/false' log_checkpoints = on

Scenario: createdb test psql -c 'create table test as select i, md5(i::text) from
generate_series(1,3000000) as i;' test kill -9 $(head -1 $PGDATA/postmaster.pid) pg_ctl start

Using this scenario, log files shows:
 LOG:  archive command failed with exit code 1 DETAIL:  The failed archive command was: /bin/false WARNING:
transactionlog file "000000010000000000000001" could not
 
be archived: too many failures LOG:  database system was interrupted; last known up at 2013-02-14
16:12:58 CET LOG:  database system was not properly shut down; automatic recovery
in progress LOG:  crash recovery starts in timeline 1 and has target timeline 1 LOG:  redo starts at 0/11400078 LOG:
recordwith zero length at 0/13397190 LOG:  redo done at 0/13397160 LOG:  last completed transaction was at log time
2013-02-14
16:12:58.49303+01 LOG:  checkpoint starting: end-of-recovery immediate LOG:  checkpoint complete: wrote 2869 buffers
(17.5%);0 transaction
 
log file(s) added, 9 removed, 7 recycled; write=0.023 s, sync=0.468 s,
total=0.739 s; sync files=2, longest=0.426 s, average=0.234 s LOG:  autovacuum launcher started LOG:  database system
isready to accept connections LOG:  archive command failed with exit code 1 DETAIL:  The failed archive command was:
/bin/falseLOG:  archive command failed with exit code 1 DETAIL:  The failed archive command was: /bin/false LOG:
archivecommand failed with exit code 1 DETAIL:  The failed archive command was: /bin/false WARNING:  transaction log
file"000000010000000000000011" could not
 
be archived: too many failures

Before the kill, "000000010000000000000001" was the WAL to archive.
After the kill, the checkpoint deleted 9 files before exiting recovery
mode and "000000010000000000000011" become the first WAL to archive.
"000000010000000000000001" through "000000010000000000000010" were
removed or recycled.

Is it expected ?
-- 
Jehan-Guillaume de Rorthais
http://www.dalibo.com



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: proposal or just idea for psql - show first N rows from relation backslash statement
Next
From: Merlin Moncure
Date:
Subject: Re: proposal or just idea for psql - show first N rows from relation backslash statement