Re: Recovery inconsistencies, standby much larger than primary - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Recovery inconsistencies, standby much larger than primary
Date
Msg-id 20140215114530.GD20973@alap3.anarazel.de
Whole thread Raw
In response to Re: Recovery inconsistencies, standby much larger than primary  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Recovery inconsistencies, standby much larger than primary  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
On 2014-02-14 22:30:45 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-02-14 20:46:01 +0000, Greg Stark wrote:
> >> Going over this I think this is still a potential issue:
> >> On 31 Jan 2014 15:56, "Andres Freund" <andres@2ndquadrant.com> wrote:
> >>> I am not sure that explains the issue, but I think the redo action for
> >>> truncation is not safe across crashes.  A XLOG_SMGR_TRUNCATE will just
> >>> do a smgrtruncate() (and then mdtruncate) which will iterate over the
> >>> segments starting at 0 till mdnblocks()/segment_size and *truncate* but
> >>> not delete individual segment files that are not needed anymore, right?
> >>> If we crash in the midst of that a new mdtruncate() will be issued, but
> >>> it will get a shorter value back from mdnblocks().

> We could probably fix things so it deleted backwards; it'd be a tad
> tedious because the list structure isn't organized that way, but we
> could do it.

We could just make the list a doubly linked one, that'd make it simple.

> Not sure if that's good enough though.  If you don't
> want to assume the filesystem metadata is coherent after a crash,
> we might have nonzero-size segments after zero-size ones, even if
> the truncate calls had been issued in the right order.

I don't think that can actually happen on any realistic/interesting
FS. Metadata updates better be journaled, so while they might not
persist because the journal wasn't flushed, they should be applied in a
sane order after a crash.
But nonetheless I am not sure we want to rely on that.

> Another possibility is to keep opening and truncating files until
> we don't find the next segment in sequence, looking directly at the
> filesystem not at the mdfd chain.  I don't think this would be
> appropriate in normal operation, but we could do it if InRecovery
> (and maybe even only if we don't think the database is consistent?)

Yes, I was thinking of simply having a mdnblocks() variant that looks
for the last existing file, disregarding the size. But looking around,
it seems mdunlinkfork() has a similar issue, and I don't see how such a
trick could be applied there :(

I guess the theoretically correct thing would be to make all WAL records
about truncation and unlinking contain the current size of the relation,
but especially with deletions and forks that will probably turn out to
be annoying to do.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Patch: show xid and xmin in pg_stat_activity and pg_stat_replication
Next
From: Andres Freund
Date:
Subject: Re: narwhal and PGDLLIMPORT