Thread: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)

In the discussion of bug #5599 we pretty much agreed to do this:
> Seems like we need to think harder about recovering from a truncate
> failure.  A few random ideas:
> 1. Write the dirty buffers before dropping them.  Kind of ugly from a
> performance viewpoint, but simple and safe.

I looked at making this happen, and noted that DropRelFileNodeBuffers
is used both for the truncation case and for dropping relation buffers
during smgrdounlink.  In the latter case, it's still appropriate to
drop dirty buffers without writing them, both for performance reasons
and because we don't really care about any errors: we have already
committed the relation DROP, and are not going to look at the file
contents again in any case.  So this means that two different behaviors
are now required for truncation and dropping.

The cleanest fix is an API change to add a boolean write-or-not
parameter to DropRelFileNodeBuffers.  That's what I want to do in HEAD
and 9.0, but I'm unsure whether it's a safe fix in the back branches.
Does anyone have an opinion whether it's likely that any third-party
code is calling DropRelFileNodeBuffers directly?  If there is, then
changing its API in a minor release would be an unfriendly thing to do.
We could avoid that by some ugly expedient like inserting a second copy
of the function in back branches.

Comments?
        regards, tom lane


On 15/08/10 21:58, Tom Lane wrote:
> Does anyone have an opinion whether it's likely that any third-party
> code is calling DropRelFileNodeBuffers directly?

I doubt it. External modules shouldn't be modifying relations at such a 
low level.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


On Sun, Aug 15, 2010 at 2:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> In the discussion of bug #5599 we pretty much agreed to do this:
>> Seems like we need to think harder about recovering from a truncate
>> failure.  A few random ideas:
>> 1. Write the dirty buffers before dropping them.  Kind of ugly from a
>> performance viewpoint, but simple and safe.
>
> I looked at making this happen, and noted that DropRelFileNodeBuffers
> is used both for the truncation case and for dropping relation buffers
> during smgrdounlink.  In the latter case, it's still appropriate to
> drop dirty buffers without writing them, both for performance reasons
> and because we don't really care about any errors: we have already
> committed the relation DROP, and are not going to look at the file
> contents again in any case.  So this means that two different behaviors
> are now required for truncation and dropping.
>
> The cleanest fix is an API change to add a boolean write-or-not
> parameter to DropRelFileNodeBuffers.  That's what I want to do in HEAD
> and 9.0, but I'm unsure whether it's a safe fix in the back branches.
> Does anyone have an opinion whether it's likely that any third-party
> code is calling DropRelFileNodeBuffers directly?  If there is, then
> changing its API in a minor release would be an unfriendly thing to do.
> We could avoid that by some ugly expedient like inserting a second copy
> of the function in back branches.
>
> Comments?

I really hate this solution, because writing out data that we're about
to throw away just in case we can't actually throw it away seems like
a real waste from a performance standpoint.  Could we avoid this
altogether by allocating a new relfilenode on truncate?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


On Sun, Aug 15, 2010 at 9:48 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 15/08/10 21:58, Tom Lane wrote:
>>
>> Does anyone have an opinion whether it's likely that any third-party
>> code is calling DropRelFileNodeBuffers directly?
>
> I doubt it. External modules shouldn't be modifying relations at such a low
> level.

Really? What about an index access method?


-- 
greg


Robert Haas <robertmhaas@gmail.com> writes:
> I really hate this solution, because writing out data that we're about
> to throw away just in case we can't actually throw it away seems like
> a real waste from a performance standpoint.

We went around on this already in the pgsql-bugs thread.  I don't much
like it either, but there isn't any better solution that seems safe
enough to back-patch.

> Could we avoid this
> altogether by allocating a new relfilenode on truncate?

Then we'd have to copy all the data we *didn't* truncate, which is
hardly likely to be a win.
        regards, tom lane


Greg Stark <gsstark@mit.edu> writes:
> On Sun, Aug 15, 2010 at 9:48 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> On 15/08/10 21:58, Tom Lane wrote:
>>> Does anyone have an opinion whether it's likely that any third-party
>>> code is calling DropRelFileNodeBuffers directly?
>> 
>> I doubt it. External modules shouldn't be modifying relations at such a low
>> level.

> Really? What about an index access method?

An index AM might have a reason to call smgrtruncate, but I can't see an
argument why it should need to call DropRelFileNodeBuffers directly.
The built-in AMs certainly never did.
        regards, tom lane


I wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Could we avoid this
>> altogether by allocating a new relfilenode on truncate?

> Then we'd have to copy all the data we *didn't* truncate, which is
> hardly likely to be a win.

There is one thing we could do pretty easily.  AFAICS there are
presently just two users of smgrtruncate: vacuumlazy.c and
heap_truncate.  The latter is used primarily for ON COMMIT DELETE ROWS.
AFAICS we could still use drop-dirty-buffers semantics for that.
In event of a truncate failure, we'd have garbage data on disk,
but it doesn't matter because that data would represent rows inserted
by the transaction that was aborted by the truncation failure, so
later transactions would ignore it anyway.  Now the indexes on such
a temp table are a bigger problem, but we have pretty fatal
heap-to-index consistency issues anyhow if one of the interrelated
truncates fails.  We could probably minimize the danger if we did
things in the right order: truncate indexes to zero length, truncate
table to zero length, rebuild indexes as empty.

However, I'm not sure that we want to do that in the back branches,
because it would require adding write-or-drop booleans to some
higher-level functions like smgrtruncate, thus greatly increasing
the risk of breaking 3rd-party code.
        regards, tom lane


On Sun, Aug 15, 2010 at 5:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Could we avoid this
>> altogether by allocating a new relfilenode on truncate?
>
> Then we'd have to copy all the data we *didn't* truncate, which is
> hardly likely to be a win.

Oh, sorry.  I was thinking we were talking about complete truncation
rather than partial truncation.  I'm still pretty unhappy with the
proposed fix, though, because it gives up performance in a broad range
of cases to cater to an extremely narrow failure case.  Considering
the rarity of the proposed problem, are we sure that it isn't better
to adopt a solution like what Heikki proposed?  If truncation fails,
try to zero the pages; if that also fails, PANIC.  I'm really
reluctant to back-patch a performance regression.  Perhaps, as Greg
Stark says, there are a variety of ways that this can happen - but
they're all pretty rare, and seem to require a fairly substantial
amount of broken-ness.  If we're in a situation where we can't
reliably update our disk files, it seems optimistic to assume that
keeping on running is going to be a whole lot better than PANICing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Robert Haas <robertmhaas@gmail.com> writes:
> Oh, sorry.  I was thinking we were talking about complete truncation
> rather than partial truncation.  I'm still pretty unhappy with the
> proposed fix, though, because it gives up performance in a broad range
> of cases to cater to an extremely narrow failure case.

It doesn't matter: broken is broken, and failure to recover from a
truncate() error is broken.  You mustn't think that this is a
Windows-only problem.

> Considering
> the rarity of the proposed problem, are we sure that it isn't better
> to adopt a solution like what Heikki proposed?  If truncation fails,
> try to zero the pages; if that also fails, PANIC.

... and PANIC is absolutely, entirely, 100% unacceptable here.  I don't
think you understand the context.  We've already written the truncate
action to WAL (as we must: WAL before data change).  If we PANIC, that
means we'll PANIC again during WAL replay.  So that solution means a
down, and perhaps unrecoverably broken, database.  There's also the
little problem that "zero the page" is only a safe recovery action for
heap pages not index pages; smgrtruncate is certainly not entitled to
assume its working on a heap.

I think the performance argument is based on fear not facts anyway.
In practice, in modern installations, you're only going to be talking
about marginally more work in autovacuum.  ISTM we should fix the bug,
in a simple/reliable/backpatchable way, and then anyone who's worried
about performance can measure the size of the problem, and try to
think of a workable solution if he doesn't like the results.
        regards, tom lane


On Sun, Aug 15, 2010 at 9:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Oh, sorry.  I was thinking we were talking about complete truncation
>> rather than partial truncation.  I'm still pretty unhappy with the
>> proposed fix, though, because it gives up performance in a broad range
>> of cases to cater to an extremely narrow failure case.
>
> It doesn't matter: broken is broken, and failure to recover from a
> truncate() error is broken.  You mustn't think that this is a
> Windows-only problem.
>
>> Considering
>> the rarity of the proposed problem, are we sure that it isn't better
>> to adopt a solution like what Heikki proposed?  If truncation fails,
>> try to zero the pages; if that also fails, PANIC.
>
> ... and PANIC is absolutely, entirely, 100% unacceptable here.  I don't
> think you understand the context.  We've already written the truncate
> action to WAL (as we must: WAL before data change).  If we PANIC, that
> means we'll PANIC again during WAL replay.  So that solution means a
> down, and perhaps unrecoverably broken, database.

All right, that would be bad.

> There's also the
> little problem that "zero the page" is only a safe recovery action for
> heap pages not index pages; smgrtruncate is certainly not entitled to
> assume its working on a heap.

This doesn't seem right to me; we don't WAL log relation extension,
even on an index.

> I think the performance argument is based on fear not facts anyway.
> In practice, in modern installations, you're only going to be talking
> about marginally more work in autovacuum.  ISTM we should fix the bug,
> in a simple/reliable/backpatchable way, and then anyone who's worried
> about performance can measure the size of the problem, and try to
> think of a workable solution if he doesn't like the results.

So, what's the worst case for this bug?  "DELETE FROM table; VACUUM table;"?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Aug 15, 2010 at 9:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ... and PANIC is absolutely, entirely, 100% unacceptable here. �I don't
>> think you understand the context. �We've already written the truncate
>> action to WAL (as we must: WAL before data change). �If we PANIC, that
>> means we'll PANIC again during WAL replay. �So that solution means a
>> down, and perhaps unrecoverably broken, database.

> All right, that would be bad.

Actually ... after some study of the code, I find that I'm holding this
proposal to a higher standard than the current code maintains.
According to our normal rules for applying WAL-loggable data changes,
there should be a critical section wrapping the application of the
data changes with making the WAL entry.  RelationTruncate fails to
do any such thing: it just pushes out the WAL entry and then calls
smgrtruncate, and so any error inside the latter just results in
elog(ERROR).  Whereupon the system state is entirely different from what
WAL says it should be.  So my previous gut feeling that we need to
rethink this whole thing seems justified.

I traced through everything that leads to an ftruncate() call in the
backend as of HEAD, and found that we have these cases to worry about:

mdunlink calls ftruncate directly, and does nothing worse than
elog(WARNING) on failure.  This is fine because all it wants to do
is save some disk space until the file gets deleted "for real" at
the next checkpoint.  Failure only means we waste disk space
temporarily, and is surely not cause for panic.

All other calls proceed (sometimes indirectly) from RelationTruncate
or replay of the WAL record it emits.  We have not thought hard
about the order of the various truncations this does and whether or
not we have a self-consistent state if it fails partway through.
If we don't want to make the whole thing a critical section, we need
to figure that out.

RelationTruncate (and its subsidiary RelationTruncateIndexes) is called
from heap_truncate_one_rel (which itself does things in an unsafe order),
and from lazy_truncate_heap in VACUUM.

heap_truncate_one_rel has (indirectly) two call sources:

from ExecuteTruncate for a locally created rel, where we don't care,
and would definitely rather NOT have a panic on error: just rolling back
the transaction is fine thank you very much.

from PreCommit_on_commit_actions, to process ON COMMIT DELETE ROWS.
Here again, so far as heaps are concerned, rollback on error would be
plenty since any inserted rows would then just be dead.  The tricky
part is the indexes for such a table.  If we truncate them before
truncating the heap, then the worst possible case is an internally
inconsistent index on a temp table, which will be automatically cleaned
up during the next successful commit in its session.  So it's pretty
hard to justify a PANIC response here either.

So it seems like the only case where there is really grounds for PANIC
on failure is the VACUUM case.  And there we might apply Heikki's idea
of trying to zero the untruncatable pages first.

I'm thinking that we need some sort of what-to-do-on-error flag passed
into RelationTruncate, plus at least order-of-operations fixes in
several other places, if not a wholesale refactoring of this whole call
stack.  But I'm running out of steam and don't have a concrete proposal
to make right now.  In any case, we've got more problems here than just
the original one of forgetting dirty buffers too soon.
        regards, tom lane


Excerpts from Tom Lane's message of dom ago 15 22:54:31 -0400 2010:

> I'm thinking that we need some sort of what-to-do-on-error flag passed
> into RelationTruncate, plus at least order-of-operations fixes in
> several other places, if not a wholesale refactoring of this whole call
> stack.  But I'm running out of steam and don't have a concrete proposal
> to make right now.  In any case, we've got more problems here than just
> the original one of forgetting dirty buffers too soon.

I think this fell through the cracks.  What are we going to do here?

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Did we make any progress on this?  Is it a TODO?

---------------------------------------------------------------------------

Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Sun, Aug 15, 2010 at 9:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> ... and PANIC is absolutely, entirely, 100% unacceptable here. �I don't
> >> think you understand the context. �We've already written the truncate
> >> action to WAL (as we must: WAL before data change). �If we PANIC, that
> >> means we'll PANIC again during WAL replay. �So that solution means a
> >> down, and perhaps unrecoverably broken, database.
> 
> > All right, that would be bad.
> 
> Actually ... after some study of the code, I find that I'm holding this
> proposal to a higher standard than the current code maintains.
> According to our normal rules for applying WAL-loggable data changes,
> there should be a critical section wrapping the application of the
> data changes with making the WAL entry.  RelationTruncate fails to
> do any such thing: it just pushes out the WAL entry and then calls
> smgrtruncate, and so any error inside the latter just results in
> elog(ERROR).  Whereupon the system state is entirely different from what
> WAL says it should be.  So my previous gut feeling that we need to
> rethink this whole thing seems justified.
> 
> I traced through everything that leads to an ftruncate() call in the
> backend as of HEAD, and found that we have these cases to worry about:
> 
> mdunlink calls ftruncate directly, and does nothing worse than
> elog(WARNING) on failure.  This is fine because all it wants to do
> is save some disk space until the file gets deleted "for real" at
> the next checkpoint.  Failure only means we waste disk space
> temporarily, and is surely not cause for panic.
> 
> All other calls proceed (sometimes indirectly) from RelationTruncate
> or replay of the WAL record it emits.  We have not thought hard
> about the order of the various truncations this does and whether or
> not we have a self-consistent state if it fails partway through.
> If we don't want to make the whole thing a critical section, we need
> to figure that out.
> 
> RelationTruncate (and its subsidiary RelationTruncateIndexes) is called
> from heap_truncate_one_rel (which itself does things in an unsafe order),
> and from lazy_truncate_heap in VACUUM.
> 
> heap_truncate_one_rel has (indirectly) two call sources:
> 
> from ExecuteTruncate for a locally created rel, where we don't care,
> and would definitely rather NOT have a panic on error: just rolling back
> the transaction is fine thank you very much.
> 
> from PreCommit_on_commit_actions, to process ON COMMIT DELETE ROWS.
> Here again, so far as heaps are concerned, rollback on error would be
> plenty since any inserted rows would then just be dead.  The tricky
> part is the indexes for such a table.  If we truncate them before
> truncating the heap, then the worst possible case is an internally
> inconsistent index on a temp table, which will be automatically cleaned
> up during the next successful commit in its session.  So it's pretty
> hard to justify a PANIC response here either.
> 
> So it seems like the only case where there is really grounds for PANIC
> on failure is the VACUUM case.  And there we might apply Heikki's idea
> of trying to zero the untruncatable pages first.
> 
> I'm thinking that we need some sort of what-to-do-on-error flag passed
> into RelationTruncate, plus at least order-of-operations fixes in
> several other places, if not a wholesale refactoring of this whole call
> stack.  But I'm running out of steam and don't have a concrete proposal
> to make right now.  In any case, we've got more problems here than just
> the original one of forgetting dirty buffers too soon.
> 
>             regards, tom lane
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Bruce Momjian <bruce@momjian.us> writes:
> Did we make any progress on this?  Is it a TODO?

AFAIR nothing's been done about it, so it's a TODO.
        regards, tom lane


On Fri, Feb 18, 2011 at 4:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Bruce Momjian <bruce@momjian.us> writes:
>> Did we make any progress on this?  Is it a TODO?
>
> AFAIR nothing's been done about it, so it's a TODO.

I was thinking of adding it to the 9.1 open items list, but the wiki's
been down every time I've tried to go there.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas wrote:
> On Fri, Feb 18, 2011 at 4:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> >> Did we make any progress on this? ?Is it a TODO?
> >
> > AFAIR nothing's been done about it, so it's a TODO.
> 
> I was thinking of adding it to the 9.1 open items list, but the wiki's
> been down every time I've tried to go there.

It is up now.  :-)

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Feb 18, 2011 at 4:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> AFAIR nothing's been done about it, so it's a TODO.

> I was thinking of adding it to the 9.1 open items list, but the wiki's
> been down every time I've tried to go there.

Since the problem's been there since forever, I don't see that it's an
open item for 9.1.  That list normally is for "must fix before ship"
items, not development projects.
        regards, tom lane


On Fri, Feb 18, 2011 at 4:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Feb 18, 2011 at 4:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> AFAIR nothing's been done about it, so it's a TODO.
>
>> I was thinking of adding it to the 9.1 open items list, but the wiki's
>> been down every time I've tried to go there.
>
> Since the problem's been there since forever, I don't see that it's an
> open item for 9.1.  That list normally is for "must fix before ship"
> items, not development projects.

OK.  If you don't feel it warrants being on that list, then the TODO
is OK with me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas wrote:
> On Fri, Feb 18, 2011 at 4:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> On Fri, Feb 18, 2011 at 4:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>> AFAIR nothing's been done about it, so it's a TODO.
> >
> >> I was thinking of adding it to the 9.1 open items list, but the wiki's
> >> been down every time I've tried to go there.
> >
> > Since the problem's been there since forever, I don't see that it's an
> > open item for 9.1. ?That list normally is for "must fix before ship"
> > items, not development projects.
> 
> OK.  If you don't feel it warrants being on that list, then the TODO
> is OK with me.

Agreed.  Do you want me to do it, or will you?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


On Fri, Feb 18, 2011 at 5:10 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Robert Haas wrote:
>> On Fri, Feb 18, 2011 at 4:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> > Robert Haas <robertmhaas@gmail.com> writes:
>> >> On Fri, Feb 18, 2011 at 4:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> >>> AFAIR nothing's been done about it, so it's a TODO.
>> >
>> >> I was thinking of adding it to the 9.1 open items list, but the wiki's
>> >> been down every time I've tried to go there.
>> >
>> > Since the problem's been there since forever, I don't see that it's an
>> > open item for 9.1. ?That list normally is for "must fix before ship"
>> > items, not development projects.
>>
>> OK.  If you don't feel it warrants being on that list, then the TODO
>> is OK with me.
>
> Agreed.  Do you want me to do it, or will you?

You do it.  :-)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas wrote:
> On Fri, Feb 18, 2011 at 5:10 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > Robert Haas wrote:
> >> On Fri, Feb 18, 2011 at 4:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> > Robert Haas <robertmhaas@gmail.com> writes:
> >> >> On Fri, Feb 18, 2011 at 4:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> >>> AFAIR nothing's been done about it, so it's a TODO.
> >> >
> >> >> I was thinking of adding it to the 9.1 open items list, but the wiki's
> >> >> been down every time I've tried to go there.
> >> >
> >> > Since the problem's been there since forever, I don't see that it's an
> >> > open item for 9.1. ?That list normally is for "must fix before ship"
> >> > items, not development projects.
> >>
> >> OK. ?If you don't feel it warrants being on that list, then the TODO
> >> is OK with me.
> >
> > Agreed. ?Do you want me to do it, or will you?
> 
> You do it.  :-)

Done:
Restructure truncation logic is more resistant to failure    This also involves not writing dirty buffers for a
truncatedordropped relation        * http://archives.postgresql.org/pgsql-hackers/2010-08/msg01032.php 
 

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +