WAL replay failure after file truncation(?) - Mailing list pgsql-hackers

From Tom Lane
Subject WAL replay failure after file truncation(?)
Date
Msg-id 12133.1117033331@sss.pgh.pa.us
Whole thread Raw
Responses Re: WAL replay failure after file truncation(?)  (Bruce Momjian <pgman@candle.pha.pa.us>)
Re: WAL replay failure after file truncation(?)  (Manfred Koizar <mkoi-pg@aon.at>)
Re: WAL replay failure after file truncation(?)  (Christopher Kings-Lynne <chriskl@familyhealth.com.au>)
List pgsql-hackers
We've seen two recent reports:
http://archives.postgresql.org/pgsql-admin/2005-04/msg00008.php
http://archives.postgresql.org/pgsql-general/2005-05/msg01143.php
of postmaster restart failing because the WAL contains a reference
to a page that no longer exists.

I can think of a couple of possible explanations:
1. filesystem corruption, ie the page should exist in the file but the  kernel has forgotten about it;
2. we truncated the file subsequent to the WAL record that causes  the panic.

However, neither of these theories is entirely satisfying, because
the WAL replay logic has always acted like this; why haven't we
seen similar reports ever since 7.1?  And why are both of these
reports connected to btrees, when file truncation probably happens
far more often on regular tables?

But, setting those nagging doubts aside, theory #2 seems like a definite
bug that we ought to do something about.

The only really clean answer I can see is for file truncation to force a
checkpoint just before issuing the ftruncate call.  That way, no WAL
records referencing the to-be-deleted pages would need to be replayed in
a subsequent crash.  However, checkpoints are expensive enough to make
this solution very unattractive from a performance point of view.  And
I fear it's not a 100% solution anyway: what about the PITR scenario,
where you need to replay a WAL log that was made concurrently with a
filesystem backup being taken?  The backup might well include the
truncated version of the file, but you can't avoid replaying the
beginning portion of the WAL log.

Plan B is for WAL replay to always be willing to extend the file to
whatever record number is mentioned in the log, even though this
may require inventing the contents of empty pages; we trust that their
contents won't matter because they'll be truncated again later in the
replay sequence.  This seems pretty messy though, especially for
indexes.  The major objection to it is that it gives up error detection
in real filesystem-corruption cases: we'll just silently build an
invalid index and then try to run with it.  (Still, that might be better
than refusing to start; at least you can REINDEX afterwards.)

Any thoughts?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: PseudoPartitioning and agregates
Next
From: Greg Stark
Date:
Subject: Re: PseudoPartitioning and agregates