Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows - Mailing list pgsql-bugs
From | Thomas Munro |
---|---|
Subject | Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows |
Date | |
Msg-id | CA+hUKG+V39zO6E9kzg=SP0S8jGYy6AYY9xWCkGLqKxfUJuuFeg@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
(Thomas Munro <thomas.munro@gmail.com>)
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows (Michael Paquier <michael@paquier.xyz>) Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-bugs |
On Thu, Oct 5, 2023 at 3:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Paquier <michael@paquier.xyz> writes: > > On Wed, Oct 04, 2023 at 09:17:11AM +0200, Laurenz Albe wrote: > >> Data corruption like this is not necessarily caused by a PostgreSQL bug. > > > Err, well... A failure on the end-of-vacuum truncation should not > > lead to corruption afterwards as well, and this ought to be safe even > > if this step failed. This is a very tricky problem that nobody has > > really looked into yet. > > ISTM we did identify the problem: while all the tuples in the > pages-to-be-truncated should be dead and thus invisible, it may > be that some of those pages are dirty and haven't been written > out of shared buffers yet, and the page versions on disk contain > tuples that look live. If VACUUM discards those dirty buffers > and then fails to truncate, voila you have tuples rising from > the dead. > > I'm too lazy to check the commit log right now, but I think > we did implement a fix for that (ie, flush dirty pages even > if we anticipate them going away due to truncation). But as > Laurenz says, v10 is out of support and possibly didn't get > that fix. Even if it did, you'd need to be running one of > the last minor releases, because this wasn't very long ago. This thread seems to be saying otherwise: https://www.postgresql.org/message-id/flat/2348.1544474335%40sss.pgh.pa.us > In the end though, the *real* problem here is running on a > platform that randomly disallows writes to disk. There's only > so much that Postgres can possibly do about unreliability of the > underlying platform. I would never run a production database on > Windows, because it's just too prone to that sort of BS. It's surprising that ftruncate() AKA chsize() is able to fail like this (I am not a Windows user but AFAIR that sharing stuff obstructs stuff like open, unlink, rename, so it surprises me to see it come up with ftruncate, since we must already have made it past the open stage). Hmm, the documentation is scant, but I know from my attempts to use large files that chsize() is probably some kind of wrapper around SetEndOfFile() or similar, and that is documented as failing if someone has the file mapped. I don't know why someone would have the file mapped, though. But as for what we should do about it, PANIC (as suggested by several people) seems better than corruption, if we're not going to write some kind of resilience? How else are we supposed to deal with "this shouldn't happen, and if it does we're hosed?"
pgsql-bugs by date: