Thread: Truncation failure in autovacuum results in data corruption (duplicate keys)
Hello, It seems to me that our customer might have hit an unresolved data corruption issue which is already known in this ML, but I can't figure out why this happened. I'd appreciate if you could give me your thoughts. Depending on the severity of this issue and the customer's request, I think I'll submit a patch to solve the unresolved issue. The customer is using PostgreSQL 9.2 on Windows. Autovacuum failed to truncate the pages at the end of a table, which is probably due to anti-virus software. FYI, automatic checkpoint was in progress when this error occurred. ERROR: could not truncate file "base/aaa/bbb" to 58 blocks: Permission denied After a while, an application got a unique key violation error. The customer says that there shouldn't be any duplicate keys. ERROR: duplicate key value violates unique constraint "pk_xxx" The output of pg_dump on that table certainly includes multiple couple of rows with the same primary key values... data corruption. Another Japanese user, who is not our customer, hit the same problem with 9.4, which was not solved (note: the mail is in Japanese). He said he repeatedly encountered the same error even after REINDEXing with 9.4, but it doesn't happen with 9.1. I wonder if there's something introduced in 9.2 which causes the issue, such as index-only scan stuff: https://ml.postgresql.jp/pipermail/pgsql-jp/2016-October/016865.html The problem with truncation failure was found in 2010. The user reported the problem as another phenomenon on 9.0. The problem could not be solved even by leading hackers here -- Tom, Robert, Alvaro, Heikki, Greg Stark, etc. TODO https://wiki.postgresql.org/wiki/Todo ---------------------------------------- Restructure truncation logic to be more resistant to failure This also involves not writing dirty buffers for a truncated or dropped relation http://archives.postgresql.org/pgsql-hackers/2010-08/msg01032.php ---------------------------------------- Tom's comments in above thread ---------------------------------------- Imagine that we have some rows at the end of a table, we delete them, we vacuum before the next checkpoint. Vacuum decides it can now truncate away the last pages, but fails to do so. The original page state is still on disk, which means we have lost the fact of the deletion --- the rows are now effectively live again, though their index entries are probably gone. ... Still, we have a live issue with heap truncation during plain VACUUM. However, the scope of the problem seems a lot less than I was thinking. Maybe write-the-buffers-first is a sufficient longterm solution. ... So it seems like the only case where there is really grounds for PANIC on failure is the VACUUM case. And there we might apply Heikki's idea of trying to zero the untruncatable pages first. ... I'm thinking that we need some sort of what-to-do-on-error flag passed into RelationTruncate, plus at least order-of-operations fixes in several other places, if not a wholesale refactoring of this whole call stack. But I'm running out of steam and don't have a concrete proposal to make right now. In any case, we've got more problems here than just the original one of forgetting dirty buffers too soon. ---------------------------------------- However, I have a question. How does the truncation failure in autovacuum lead to duplicate keys? The failed-to-be-truncated pages should only contain dead tuples, so pg_dump's table scan should ignore dead tuples in those pages. Regards MauMau
Re: Truncation failure in autovacuum results in data corruption (duplicate keys)
From
Tom Lane
Date:
"MauMau" <maumau307@gmail.com> writes: > However, I have a question. How does the truncation failure in > autovacuum lead to duplicate keys? The failed-to-be-truncated pages > should only contain dead tuples, so pg_dump's table scan should ignore > dead tuples in those pages. [ re-reads thread... ] The extra assumption you need in order to have trouble is that the blocks in question are dirty in shared buffers and have never been written to disk since their rows were deleted. Then the situation is that the page image on disk shows the rows as live, while the up-to-date page image in memory correctly shows them as dead. Relation truncation throws away the page image in memory without ever writing it to disk. Then, if the subsequent file truncate step fails, we have a problem, because anyone who goes looking for that page will fetch it afresh from disk and see the tuples as live. There are WAL entries recording the row deletions, but that doesn't help unless we crash and replay the WAL. It's hard to see a way around this that isn't fairly catastrophic for performance :-(. But in any case it's wrapped up in order-of-operations issues. I've long since forgotten the details, but I seem to have thought that there were additional order-of-operations hazards besides this one. regards, tom lane
Re: Truncation failure in autovacuum results in data corruption (duplicate keys)
From
"MauMau"
Date:
From: Tom Lane [ re-reads thread... ] The extra assumption you need in order to have trouble is that the blocks in question are dirty in shared buffers and have never been written to disk since their rows were deleted. Then the situation is that the page image on disk shows the rows as live, while the up-to-date page image in memory correctly shows them as dead. Relation truncation throws away the page image in memory without ever writing it to disk. Then, if the subsequent file truncate step fails, we have a problem, because anyone who goes looking for that page will fetch it afresh from disk and see the tuples as live. Thank you so much, I got it! And I'm always impressed at how quick and concise you are, while you are busy addressing multiple issues and answering user questions. Maybe I wouldn't be surprised to hear that there are multiple clones of Tom Lane. I'd like to continue to think of a solution and create a patch, based on the severity and how the customer will respond to our answer. I have a feeling that we have to say it's a bit serious, since it requires recovery from a base backup, not just rebuilding indexes. The patch development may be after PGCon. Regards MauMau
Re: Truncation failure in autovacuum results in data corruption (duplicate keys)
From
Tom Lane
Date:
"MauMau" <maumau307@gmail.com> writes: > I'd like to continue to think of a solution and create a patch, based > on the severity and how the customer will respond to our answer. I > have a feeling that we have to say it's a bit serious, since it > requires recovery from a base backup, not just rebuilding indexes. > The patch development may be after PGCon. It's definitely something we need to work on. It looks to me like the original thread died because we weren't in a part of the cycle where we wanted to work on major patches ... and here we are again :-(. So yeah, targeting a fix for v12 might be the right way to think about it, seeing how seldom the problem crops up. regards, tom lane
Re: Truncation failure in autovacuum results in data corruption (duplicate keys)
From
Tom Lane
Date:
I wrote: > Relation truncation throws away the page image in memory without ever > writing it to disk. Then, if the subsequent file truncate step fails, > we have a problem, because anyone who goes looking for that page will > fetch it afresh from disk and see the tuples as live. > There are WAL entries recording the row deletions, but that doesn't > help unless we crash and replay the WAL. > It's hard to see a way around this that isn't fairly catastrophic for > performance :-(. Just to throw out a possibly-crazy idea: maybe we could fix this by PANIC'ing if truncation fails, so that we replay the row deletions from WAL. Obviously this would be intolerable if the case were frequent, but we've had only two such complaints in the last nine years, so maybe it's tolerable. It seems more attractive than taking a large performance hit on truncation speed in normal cases, anyway. A gotcha to be concerned about is what happens if we replay from WAL, come to the XLOG_SMGR_TRUNCATE WAL record, and get the same truncation failure again, which is surely not unlikely. PANIC'ing again will not do. I think we could probably handle that by having the replay code path zero out all the pages it was unable to delete; as long as that succeeds, we can call it good and move on. Or maybe just do that in the mainline case too? That is, if ftruncate fails, handle it by zeroing the undeletable pages and pressing on? > But in any case it's wrapped up in order-of-operations > issues. I've long since forgotten the details, but I seem to have thought > that there were additional order-of-operations hazards besides this one. It'd be a good idea to redo that investigation before concluding this issue is fixed, too. I was not thinking at the time that it'd be years before anybody did anything, or I'd have made more notes. regards, tom lane
Re: Truncation failure in autovacuum results in data corruption(duplicate keys)
From
Michael Paquier
Date:
On Wed, Apr 18, 2018 at 04:49:17PM -0400, Tom Lane wrote: > Just to throw out a possibly-crazy idea: maybe we could fix this by > PANIC'ing if truncation fails, so that we replay the row deletions from > WAL. Obviously this would be intolerable if the case were frequent, > but we've had only two such complaints in the last nine years, so maybe > it's tolerable. It seems more attractive than taking a large performance > hit on truncation speed in normal cases, anyway. It can take some time to go through the whole thread... And that was my first intuition when looking at those things. So one case is where the truncation of the main relation happens first, and succeeds. After that comes up potentially the truncation of index pages which can refer to tuples on the pages which have been truncated previously, and then that part fails. This causes index references to be broken, which is what the report of 2010 is originally about. > A gotcha to be concerned about is what happens if we replay from WAL, > come to the XLOG_SMGR_TRUNCATE WAL record, and get the same truncation > failure again, which is surely not unlikely. PANIC'ing again will not > do. I think we could probably handle that by having the replay code > path zero out all the pages it was unable to delete; as long as that > succeeds, we can call it good and move on. The complain we are discussing here is Windows antivirus meddling with PostgreSQL by randomly preventing an access to the file to be truncated. Would a PANIC in this unique code path be sufficient though? It seems to me that any error could cause an inconsistency, which could justify the use of a critical section instead to force WAL replay to cleanup things? -- Michael
Attachment
Re: Truncation failure in autovacuum results in data corruption(duplicate keys)
From
Alexander Korotkov
Date:
On Wed, Apr 18, 2018 at 11:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: > > Relation truncation throws away the page image in memory without ever > > writing it to disk. Then, if the subsequent file truncate step fails, > > we have a problem, because anyone who goes looking for that page will > > fetch it afresh from disk and see the tuples as live. > > > There are WAL entries recording the row deletions, but that doesn't > > help unless we crash and replay the WAL. > > > It's hard to see a way around this that isn't fairly catastrophic for > > performance :-(. > > Just to throw out a possibly-crazy idea: maybe we could fix this by > PANIC'ing if truncation fails, so that we replay the row deletions from > WAL. Obviously this would be intolerable if the case were frequent, > but we've had only two such complaints in the last nine years, so maybe > it's tolerable. It seems more attractive than taking a large performance > hit on truncation speed in normal cases, anyway. We have only two complaints of data corruption in nine years. But I suspect that in vast majority of cases truncation error didn't cause the corruption OR the corruption wasn't noticed. So, once we introduce PANIC here, we would get way more complaints. > A gotcha to be concerned about is what happens if we replay from WAL, > come to the XLOG_SMGR_TRUNCATE WAL record, and get the same truncation > failure again, which is surely not unlikely. PANIC'ing again will not > do. I think we could probably handle that by having the replay code > path zero out all the pages it was unable to delete; as long as that > succeeds, we can call it good and move on. > > Or maybe just do that in the mainline case too? That is, if ftruncate > fails, handle it by zeroing the undeletable pages and pressing on? I've just started really digging into this set of problems. But this idea looks good for me so soon... ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Truncation failure in autovacuum results in data corruption(duplicate keys)
From
Alexander Korotkov
Date:
On Wed, Apr 18, 2018 at 10:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > [ re-reads thread... ] The extra assumption you need in order to have > trouble is that the blocks in question are dirty in shared buffers and > have never been written to disk since their rows were deleted. Then > the situation is that the page image on disk shows the rows as live, > while the up-to-date page image in memory correctly shows them as dead. > Relation truncation throws away the page image in memory without ever > writing it to disk. Then, if the subsequent file truncate step fails, > we have a problem, because anyone who goes looking for that page will > fetch it afresh from disk and see the tuples as live. > > There are WAL entries recording the row deletions, but that doesn't > help unless we crash and replay the WAL. > > It's hard to see a way around this that isn't fairly catastrophic for > performance :-(. But in any case it's wrapped up in order-of-operations > issues. I've long since forgotten the details, but I seem to have thought > that there were additional order-of-operations hazards besides this one. Just for clarification. Do you mean zeroing of to-be-truncated blocks to be catastrophic for performance? Or something else? ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Truncation failure in autovacuum results in data corruption (duplicate keys)
From
Tom Lane
Date:
Alexander Korotkov <a.korotkov@postgrespro.ru> writes: > On Wed, Apr 18, 2018 at 10:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It's hard to see a way around this that isn't fairly catastrophic for >> performance :-(. But in any case it's wrapped up in order-of-operations >> issues. I've long since forgotten the details, but I seem to have thought >> that there were additional order-of-operations hazards besides this one. > Just for clarification. Do you mean zeroing of to-be-truncated blocks > to be catastrophic for performance? Or something else? It would be pretty terrible to have to do that in the normal code path. The other idea that was in my mind was to force out dirty buffers, then discard them, then truncate ... but that's awful too if there's a lot of dirty buffers that we'd have to write only to throw the data away. I think it's all right to be slow if the truncation fails, though; that does not seem like a path that has to be fast, only correct. One thing to be concerned about is that as soon as we've discarded any page images from buffers, we have to be in a critical section all the way through till we've either truncated or zeroed those pages on-disk. Any failure in that has to result in a PANIC and recover-from-WAL, because we don't know what state we lost by dropping the buffers. Ugh. It's especially bad if the truncation fails because the file got marked read-only, because then the zeroing is also going to fail, making that a guaranteed PANIC case (with no clear path to recovery after the panic, either ...) I wonder if it could help to do something like zeroing the buffers in memory, then truncating, then discarding buffers. This is just a half-baked idea and I don't have time to think more right now, but maybe making two passes over the shared buffers could lead to a better solution. It's the useless I/O that we need to avoid, IMO. regards, tom lane