Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows - Mailing list pgsql-bugs

From Thomas Munro
Subject Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
Date
Msg-id CA+hUKG+V39zO6E9kzg=SP0S8jGYy6AYY9xWCkGLqKxfUJuuFeg@mail.gmail.com
Whole thread Raw
In response to Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows  (Thomas Munro <thomas.munro@gmail.com>)
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows  (Michael Paquier <michael@paquier.xyz>)
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
On Thu, Oct 5, 2023 at 3:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael@paquier.xyz> writes:
> > On Wed, Oct 04, 2023 at 09:17:11AM +0200, Laurenz Albe wrote:
> >> Data corruption like this is not necessarily caused by a PostgreSQL bug.
>
> > Err, well...  A failure on the end-of-vacuum truncation should not
> > lead to corruption afterwards as well, and this ought to be safe even
> > if this step failed.  This is a very tricky problem that nobody has
> > really looked into yet.
>
> ISTM we did identify the problem: while all the tuples in the
> pages-to-be-truncated should be dead and thus invisible, it may
> be that some of those pages are dirty and haven't been written
> out of shared buffers yet, and the page versions on disk contain
> tuples that look live.  If VACUUM discards those dirty buffers
> and then fails to truncate, voila you have tuples rising from
> the dead.
>
> I'm too lazy to check the commit log right now, but I think
> we did implement a fix for that (ie, flush dirty pages even
> if we anticipate them going away due to truncation).  But as
> Laurenz says, v10 is out of support and possibly didn't get
> that fix.  Even if it did, you'd need to be running one of
> the last minor releases, because this wasn't very long ago.

This thread seems to be saying otherwise:

https://www.postgresql.org/message-id/flat/2348.1544474335%40sss.pgh.pa.us

> In the end though, the *real* problem here is running on a
> platform that randomly disallows writes to disk.  There's only
> so much that Postgres can possibly do about unreliability of the
> underlying platform.  I would never run a production database on
> Windows, because it's just too prone to that sort of BS.

It's surprising that ftruncate() AKA chsize() is able to fail like
this (I am not a Windows user but AFAIR that sharing stuff obstructs
stuff like open, unlink, rename, so it surprises me to see it come up
with ftruncate, since we must already have made it past the open
stage).  Hmm, the documentation is scant, but I know from my attempts
to use large files that chsize() is probably some kind of wrapper
around SetEndOfFile() or similar, and that is documented as failing if
someone has the file mapped.  I don't know why someone would have the
file mapped, though.

But as for what we should do about it, PANIC (as suggested by several
people) seems better than corruption, if we're not going to write some
kind of resilience?  How else are we supposed to deal with "this
shouldn't happen, and if it does we're hosed?"



pgsql-bugs by date:

Previous
From: "Given, Robert A"
Date:
Subject: REFRESH MATERIALIZED VIEW error
Next
From: Thomas Munro
Date:
Subject: Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows