Re: Recovery inconsistencies, standby much larger than primary - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Recovery inconsistencies, standby much larger than primary
Date
Msg-id 31058.1392435045@sss.pgh.pa.us
Whole thread Raw
In response to Re: Recovery inconsistencies, standby much larger than primary  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: Recovery inconsistencies, standby much larger than primary  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-02-14 20:46:01 +0000, Greg Stark wrote:
>> Going over this I think this is still a potential issue:
>> On 31 Jan 2014 15:56, "Andres Freund" <andres@2ndquadrant.com> wrote:
>>> I am not sure that explains the issue, but I think the redo action for
>>> truncation is not safe across crashes.  A XLOG_SMGR_TRUNCATE will just
>>> do a smgrtruncate() (and then mdtruncate) which will iterate over the
>>> segments starting at 0 till mdnblocks()/segment_size and *truncate* but
>>> not delete individual segment files that are not needed anymore, right?
>>> If we crash in the midst of that a new mdtruncate() will be issued, but
>>> it will get a shorter value back from mdnblocks().

>> I'm not too familiar with md.c but my reading of the code is that we
>> truncate the files in reverse order?

> That's what I had assumed as well, but it doesn't look that way:

No, it's deleting forward.

We could probably fix things so it deleted backwards; it'd be a tad
tedious because the list structure isn't organized that way, but we
could do it.  Not sure if that's good enough though.  If you don't
want to assume the filesystem metadata is coherent after a crash,
we might have nonzero-size segments after zero-size ones, even if
the truncate calls had been issued in the right order.

Another possibility is to keep opening and truncating files until
we don't find the next segment in sequence, looking directly at the
filesystem not at the mdfd chain.  I don't think this would be
appropriate in normal operation, but we could do it if InRecovery
(and maybe even only if we don't think the database is consistent?)
        regards, tom lane



pgsql-hackers by date:

Previous
From: Florian Pflug
Date:
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Next
From: Bruce Momjian
Date:
Subject: Small psql memory fix