Thread: Truncation failure in autovacuum results in data corruption (duplicate keys)

Hello,

It seems to me that our customer might have hit an unresolved data
corruption issue which is already known in this ML, but I can't figure
out why this happened.  I'd appreciate if you could give me your
thoughts.  Depending on the severity of this issue and the customer's
request, I think I'll submit a patch to solve the unresolved issue.

The customer is using PostgreSQL 9.2 on Windows.  Autovacuum failed to
truncate the pages at the end of a table, which is probably due to
anti-virus software.  FYI, automatic checkpoint was in progress when
this error occurred.

ERROR:  could not truncate file "base/aaa/bbb" to 58 blocks:
Permission denied

After a while, an application got a unique key violation error.  The
customer says that there shouldn't be any duplicate keys.

ERROR:  duplicate key value violates unique constraint "pk_xxx"

The output of pg_dump on that table certainly includes multiple couple
of rows with the same primary key values... data corruption.



Another Japanese user, who is not our customer, hit the same problem
with 9.4, which was not solved (note: the mail is in Japanese).  He
said he repeatedly encountered the same error even after REINDEXing
with 9.4, but it doesn't happen with 9.1.  I wonder if there's
something introduced in 9.2 which causes the issue, such as index-only
scan stuff:

https://ml.postgresql.jp/pipermail/pgsql-jp/2016-October/016865.html


The problem with truncation failure was found in 2010.  The user
reported the problem as another phenomenon on 9.0.  The problem could
not be solved even by leading hackers here -- Tom, Robert, Alvaro,
Heikki, Greg Stark, etc.

TODO
https://wiki.postgresql.org/wiki/Todo
----------------------------------------
Restructure truncation logic to be more resistant to failure
This also involves not writing dirty buffers for a truncated or
dropped relation
http://archives.postgresql.org/pgsql-hackers/2010-08/msg01032.php
----------------------------------------


Tom's comments in above thread
----------------------------------------
Imagine that we have some rows at the end of a table, we delete them,
we vacuum before the next checkpoint.  Vacuum decides it can now
truncate away the last pages, but fails to do so.  The original page
state is still on disk, which means we have lost the fact of the
deletion --- the rows are now effectively live again, though their
index entries are probably gone.

...
Still, we have a live issue with heap truncation during plain VACUUM.
However, the scope of the problem seems a lot less than I was
thinking.
Maybe write-the-buffers-first is a sufficient longterm solution.

...
So it seems like the only case where there is really grounds for PANIC
on failure is the VACUUM case.  And there we might apply Heikki's idea
of trying to zero the untruncatable pages first.

...
I'm thinking that we need some sort of what-to-do-on-error flag passed
into RelationTruncate, plus at least order-of-operations fixes in
several other places, if not a wholesale refactoring of this whole
call
stack.  But I'm running out of steam and don't have a concrete
proposal
to make right now.  In any case, we've got more problems here than
just
the original one of forgetting dirty buffers too soon.
----------------------------------------


However, I have a question.  How does the truncation failure in
autovacuum lead to duplicate keys?  The failed-to-be-truncated pages
should only contain dead tuples, so pg_dump's table scan should ignore
dead tuples in those pages.

Regards
MauMau





"MauMau" <maumau307@gmail.com> writes:
> However, I have a question.  How does the truncation failure in
> autovacuum lead to duplicate keys?  The failed-to-be-truncated pages
> should only contain dead tuples, so pg_dump's table scan should ignore
> dead tuples in those pages.

[ re-reads thread... ]  The extra assumption you need in order to have
trouble is that the blocks in question are dirty in shared buffers and
have never been written to disk since their rows were deleted.  Then
the situation is that the page image on disk shows the rows as live,
while the up-to-date page image in memory correctly shows them as dead.
Relation truncation throws away the page image in memory without ever
writing it to disk.  Then, if the subsequent file truncate step fails,
we have a problem, because anyone who goes looking for that page will
fetch it afresh from disk and see the tuples as live.

There are WAL entries recording the row deletions, but that doesn't
help unless we crash and replay the WAL.

It's hard to see a way around this that isn't fairly catastrophic for
performance :-(.  But in any case it's wrapped up in order-of-operations
issues.  I've long since forgotten the details, but I seem to have thought
that there were additional order-of-operations hazards besides this one.

            regards, tom lane


From: Tom Lane
[ re-reads thread... ]  The extra assumption you need in order to have
trouble is that the blocks in question are dirty in shared buffers and
have never been written to disk since their rows were deleted.  Then
the situation is that the page image on disk shows the rows as live,
while the up-to-date page image in memory correctly shows them as
dead.
Relation truncation throws away the page image in memory without ever
writing it to disk.  Then, if the subsequent file truncate step fails,
we have a problem, because anyone who goes looking for that page will
fetch it afresh from disk and see the tuples as live.


Thank you so much, I got it!  And I'm always impressed at how quick
and concise you are, while you are busy addressing multiple issues and
answering user questions.  Maybe I wouldn't be surprised to hear that
there are multiple clones of Tom Lane.

I'd like to continue to think of a solution and create a patch, based
on the severity and how the customer will respond to our answer.  I
have a feeling that we have to say it's a bit serious, since it
requires recovery from a base backup, not just rebuilding indexes.
The patch development may be after PGCon.

Regards
MauMau





"MauMau" <maumau307@gmail.com> writes:
> I'd like to continue to think of a solution and create a patch, based
> on the severity and how the customer will respond to our answer.  I
> have a feeling that we have to say it's a bit serious, since it
> requires recovery from a base backup, not just rebuilding indexes.
> The patch development may be after PGCon.

It's definitely something we need to work on.  It looks to me like
the original thread died because we weren't in a part of the cycle
where we wanted to work on major patches ... and here we are again :-(.
So yeah, targeting a fix for v12 might be the right way to think
about it, seeing how seldom the problem crops up.

            regards, tom lane


I wrote:
> Relation truncation throws away the page image in memory without ever
> writing it to disk.  Then, if the subsequent file truncate step fails,
> we have a problem, because anyone who goes looking for that page will
> fetch it afresh from disk and see the tuples as live.

> There are WAL entries recording the row deletions, but that doesn't
> help unless we crash and replay the WAL.

> It's hard to see a way around this that isn't fairly catastrophic for
> performance :-(.

Just to throw out a possibly-crazy idea: maybe we could fix this by
PANIC'ing if truncation fails, so that we replay the row deletions from
WAL.  Obviously this would be intolerable if the case were frequent,
but we've had only two such complaints in the last nine years, so maybe
it's tolerable.  It seems more attractive than taking a large performance
hit on truncation speed in normal cases, anyway.

A gotcha to be concerned about is what happens if we replay from WAL,
come to the XLOG_SMGR_TRUNCATE WAL record, and get the same truncation
failure again, which is surely not unlikely.  PANIC'ing again will not
do.  I think we could probably handle that by having the replay code
path zero out all the pages it was unable to delete; as long as that
succeeds, we can call it good and move on.

Or maybe just do that in the mainline case too?  That is, if ftruncate
fails, handle it by zeroing the undeletable pages and pressing on?

> But in any case it's wrapped up in order-of-operations
> issues.  I've long since forgotten the details, but I seem to have thought
> that there were additional order-of-operations hazards besides this one.

It'd be a good idea to redo that investigation before concluding this
issue is fixed, too.  I was not thinking at the time that it'd be years
before anybody did anything, or I'd have made more notes.

            regards, tom lane


Re: Truncation failure in autovacuum results in data corruption(duplicate keys)

From
Michael Paquier
Date:
On Wed, Apr 18, 2018 at 04:49:17PM -0400, Tom Lane wrote:
> Just to throw out a possibly-crazy idea: maybe we could fix this by
> PANIC'ing if truncation fails, so that we replay the row deletions from
> WAL.  Obviously this would be intolerable if the case were frequent,
> but we've had only two such complaints in the last nine years, so maybe
> it's tolerable.  It seems more attractive than taking a large performance
> hit on truncation speed in normal cases, anyway.

It can take some time to go through the whole thread...

And that was my first intuition when looking at those things.

So one case is where the truncation of the main relation happens first,
and succeeds.  After that comes up potentially the truncation of index
pages which can refer to tuples on the pages which have been truncated
previously, and then that part fails.  This causes index references to
be broken, which is what the report of 2010 is originally about.

> A gotcha to be concerned about is what happens if we replay from WAL,
> come to the XLOG_SMGR_TRUNCATE WAL record, and get the same truncation
> failure again, which is surely not unlikely.  PANIC'ing again will not
> do.  I think we could probably handle that by having the replay code
> path zero out all the pages it was unable to delete; as long as that
> succeeds, we can call it good and move on.

The complain we are discussing here is Windows antivirus meddling with
PostgreSQL by randomly preventing an access to the file to be
truncated.  Would a PANIC in this unique code path be sufficient though?
It seems to me that any error could cause an inconsistency, which could
justify the use of a critical section instead to force WAL replay to
cleanup things?
--
Michael

Attachment

Re: Truncation failure in autovacuum results in data corruption(duplicate keys)

From
Alexander Korotkov
Date:
On Wed, Apr 18, 2018 at 11:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
> > Relation truncation throws away the page image in memory without ever
> > writing it to disk.  Then, if the subsequent file truncate step fails,
> > we have a problem, because anyone who goes looking for that page will
> > fetch it afresh from disk and see the tuples as live.
>
> > There are WAL entries recording the row deletions, but that doesn't
> > help unless we crash and replay the WAL.
>
> > It's hard to see a way around this that isn't fairly catastrophic for
> > performance :-(.
>
> Just to throw out a possibly-crazy idea: maybe we could fix this by
> PANIC'ing if truncation fails, so that we replay the row deletions from
> WAL.  Obviously this would be intolerable if the case were frequent,
> but we've had only two such complaints in the last nine years, so maybe
> it's tolerable.  It seems more attractive than taking a large performance
> hit on truncation speed in normal cases, anyway.

We have only two complaints of data corruption in nine years.  But I
suspect that in vast majority of cases truncation error didn't cause
the corruption OR the corruption wasn't noticed.  So, once we
introduce PANIC here, we would get way more complaints.

> A gotcha to be concerned about is what happens if we replay from WAL,
> come to the XLOG_SMGR_TRUNCATE WAL record, and get the same truncation
> failure again, which is surely not unlikely.  PANIC'ing again will not
> do.  I think we could probably handle that by having the replay code
> path zero out all the pages it was unable to delete; as long as that
> succeeds, we can call it good and move on.
>
> Or maybe just do that in the mainline case too?  That is, if ftruncate
> fails, handle it by zeroing the undeletable pages and pressing on?

I've just started really digging into this set of problems.  But this
idea looks good for me so soon...

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Truncation failure in autovacuum results in data corruption(duplicate keys)

From
Alexander Korotkov
Date:
On Wed, Apr 18, 2018 at 10:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> [ re-reads thread... ]  The extra assumption you need in order to have
> trouble is that the blocks in question are dirty in shared buffers and
> have never been written to disk since their rows were deleted.  Then
> the situation is that the page image on disk shows the rows as live,
> while the up-to-date page image in memory correctly shows them as dead.
> Relation truncation throws away the page image in memory without ever
> writing it to disk.  Then, if the subsequent file truncate step fails,
> we have a problem, because anyone who goes looking for that page will
> fetch it afresh from disk and see the tuples as live.
>
> There are WAL entries recording the row deletions, but that doesn't
> help unless we crash and replay the WAL.
>
> It's hard to see a way around this that isn't fairly catastrophic for
> performance :-(.  But in any case it's wrapped up in order-of-operations
> issues.  I've long since forgotten the details, but I seem to have thought
> that there were additional order-of-operations hazards besides this one.

Just for clarification.  Do you mean zeroing of to-be-truncated blocks
to be catastrophic for performance?  Or something else?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Alexander Korotkov <a.korotkov@postgrespro.ru> writes:
> On Wed, Apr 18, 2018 at 10:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It's hard to see a way around this that isn't fairly catastrophic for
>> performance :-(.  But in any case it's wrapped up in order-of-operations
>> issues.  I've long since forgotten the details, but I seem to have thought
>> that there were additional order-of-operations hazards besides this one.

> Just for clarification.  Do you mean zeroing of to-be-truncated blocks
> to be catastrophic for performance?  Or something else?

It would be pretty terrible to have to do that in the normal code path.
The other idea that was in my mind was to force out dirty buffers, then
discard them, then truncate ... but that's awful too if there's a lot
of dirty buffers that we'd have to write only to throw the data away.

I think it's all right to be slow if the truncation fails, though; that
does not seem like a path that has to be fast, only correct.

One thing to be concerned about is that as soon as we've discarded any
page images from buffers, we have to be in a critical section all the way
through till we've either truncated or zeroed those pages on-disk.  Any
failure in that has to result in a PANIC and recover-from-WAL, because
we don't know what state we lost by dropping the buffers.  Ugh.  It's
especially bad if the truncation fails because the file got marked
read-only, because then the zeroing is also going to fail, making that
a guaranteed PANIC case (with no clear path to recovery after the
panic, either ...)

I wonder if it could help to do something like zeroing the buffers in
memory, then truncating, then discarding buffers.  This is just a
half-baked idea and I don't have time to think more right now, but maybe
making two passes over the shared buffers could lead to a better solution.
It's the useless I/O that we need to avoid, IMO.

            regards, tom lane