Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows - Mailing list pgsql-bugs

From Michael Paquier
Subject Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
Date
Msg-id ZR3qvrYULJWaUnBK@paquier.xyz
Whole thread Raw
In response to Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-bugs
On Thu, Oct 05, 2023 at 10:12:27AM +1300, Thomas Munro wrote:
> On Thu, Oct 5, 2023 at 3:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm too lazy to check the commit log right now, but I think
>> we did implement a fix for that (ie, flush dirty pages even
>> if we anticipate them going away due to truncation).  But as
>> Laurenz says, v10 is out of support and possibly didn't get
>> that fix.  Even if it did, you'd need to be running one of
>> the last minor releases, because this wasn't very long ago.
>
> This thread seems to be saying otherwise:
>
> https://www.postgresql.org/message-id/flat/2348.1544474335%40sss.pgh.pa.us

Yeah, that's not been solved.  I've put my mind on this problem for a
few hours last May, just before PGCon, and there was an argument that
marking all the pages as dirty is kind of a waste of resources as it
would create WAL for data that's going to be gone a little bit later
as an effect of the truncate, leading to an extra burst of I/O
especially for large truncatoins.  FWIW, I think that I'd be
personally OK with using this method.  At least that's safe, simple,
backpatchable and it does not require any new magic.  I know that
there are voices that argued against this method, but here we are N
years later, so perhaps we should just do that on HEAD at least.

>> In the end though, the *real* problem here is running on a
>> platform that randomly disallows writes to disk.  There's only
>> so much that Postgres can possibly do about unreliability of the
>> underlying platform.  I would never run a production database on
>> Windows, because it's just too prone to that sort of BS.
>
> It's surprising that ftruncate() AKA chsize() is able to fail like
> this (I am not a Windows user but AFAIR that sharing stuff obstructs
> stuff like open, unlink, rename, so it surprises me to see it come up
> with ftruncate, since we must already have made it past the open
> stage).  Hmm, the documentation is scant, but I know from my attempts
> to use large files that chsize() is probably some kind of wrapper
> around SetEndOfFile() or similar, and that is documented as failing if
> someone has the file mapped.  I don't know why someone would have the
> file mapped, though.

(shrug)

> But as for what we should do about it, PANIC (as suggested by several
> people) seems better than corruption, if we're not going to write some
> kind of resilience?  How else are we supposed to deal with "this
> shouldn't happen, and if it does we're hosed?"

A PANIC may be OK for this specific syscall and would be better, but
the problematic area is larger than that as we'd still finish with a
corruption as long as there's an ERROR or a FATAL between the moment
the buffers (potentially dirty, with live-still-dead-in-memory tuples
on disk) are discarded and the moment the truncation fails.  Another
method discussed is the use of a critical section (I recall that there
were some pallocs in this area, actually, but got nothing on my notes
about that...).
--
Michael

Attachment

pgsql-bugs by date:

Previous
From: Thomas Munro
Date:
Subject: Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
Next
From: Tom Lane
Date:
Subject: Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows