Thread: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows

BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      18146
Logged by:          Root Cause
Email address:      rootcause000@gmail.com
PostgreSQL version: Unsupported/Unknown
Operating system:   Windows
Description:

Version: PostgreSQL 10.21, compiled by Visual C++ build 1800, 64-bit

Platform: Windows

In our application code, we've implemented logic to clean up entries in
several PostgreSQL tables using a loop. Although some of these tables have
foreign key references, we've included them to ensure a thorough deletion
process. Here's a simplified code snippet:

String tables[] = {"TableA", "TableB", "TableC", "TableD", "TableE",
"TableF", "TableG", "TableH"};

for (String tableName : tables) {
    try {
        stmt = conn.prepareStatement("DELETE FROM " + tableName);
        stmt.execute();
    } catch (Exception ex) {
        // Log the exception
    }
}
Notably, it leaves entries in different tables each time, and we've found
auto-vacuum failure traces in the pg_log for these tables at the same time.
Here's an example of the error messages from pg_log:

2023-09-28 12:03:35.955 IST,,,11592,,65151e2d.2d48,1,,2023-09-28 12:03:17
IST,6/19,122786262,ERROR,42501,"could not truncate file
""base/16509/6935026"" to 0 blocks: Permission denied",,,,,"automatic vacuum
of table ""DB.public.TableB""",,,,"" 2023-09-28 12:03:37.205
IST,,,11592,,65151e2d.2d48,2,,2023-09-28 12:03:17
IST,6/45,122786326,ERROR,42501,"could not truncate file
""base/16509/6935443"" to 0 blocks: Permission denied",,,,,"automatic vacuum
of table ""DB.public.TableB""",,,,""

We've already attempted to exclude the entire pg folder from antivirus
scans, but the problem persists. Any insights or solutions to this issue
would be greatly appreciated. Thank you!


On Tue, 2023-10-03 at 17:25 +0000, PG Bug reporting form wrote:
> Version: PostgreSQL 10.21, compiled by Visual C++ build 1800, 64-bit
>
> Platform: Windows
>
> In our application code, we've implemented logic to clean up entries in
> several PostgreSQL tables using a loop. Although some of these tables have
> foreign key references, we've included them to ensure a thorough deletion
> process. Here's a simplified code snippet:
>
> String tables[] = {"TableA", "TableB", "TableC", "TableD", "TableE",
> "TableF", "TableG", "TableH"};
>
> for (String tableName : tables) {
>     try {
>         stmt = conn.prepareStatement("DELETE FROM " + tableName);
>         stmt.execute();
>     } catch (Exception ex) {
>         // Log the exception
>     }
> }
> Notably, it leaves entries in different tables each time, and we've found
> auto-vacuum failure traces in the pg_log for these tables at the same time.
> Here's an example of the error messages from pg_log:
>
> 2023-09-28 12:03:35.955 IST,,,11592,,65151e2d.2d48,1,,2023-09-28 12:03:17
> IST,6/19,122786262,ERROR,42501,"could not truncate file
> ""base/16509/6935026"" to 0 blocks: Permission denied",,,,,"automatic vacuum
> of table ""DB.public.TableB""",,,,"" 2023-09-28 12:03:37.205
> IST,,,11592,,65151e2d.2d48,2,,2023-09-28 12:03:17
> IST,6/45,122786326,ERROR,42501,"could not truncate file
> ""base/16509/6935443"" to 0 blocks: Permission denied",,,,,"automatic vacuum
> of table ""DB.public.TableB""",,,,""
>
> We've already attempted to exclude the entire pg folder from antivirus
> scans, but the problem persists. Any insights or solutions to this issue
> would be greatly appreciated. Thank you!

PostgreSQL v10 is out of support.

Data corruption like this is not necessarily caused by a PostgreSQL bug.

Yours,
Laurenz Albe



On Wed, Oct 04, 2023 at 09:17:11AM +0200, Laurenz Albe wrote:
> Data corruption like this is not necessarily caused by a PostgreSQL bug.

Err, well...  A failure on the end-of-vacuum truncation should not
lead to corruption afterwards as well, and this ought to be safe even
if this step failed.  This is a very tricky problem that nobody has
really looked into yet.
--
Michael

Attachment
Michael Paquier <michael@paquier.xyz> writes:
> On Wed, Oct 04, 2023 at 09:17:11AM +0200, Laurenz Albe wrote:
>> Data corruption like this is not necessarily caused by a PostgreSQL bug.

> Err, well...  A failure on the end-of-vacuum truncation should not
> lead to corruption afterwards as well, and this ought to be safe even
> if this step failed.  This is a very tricky problem that nobody has
> really looked into yet.

ISTM we did identify the problem: while all the tuples in the
pages-to-be-truncated should be dead and thus invisible, it may
be that some of those pages are dirty and haven't been written
out of shared buffers yet, and the page versions on disk contain
tuples that look live.  If VACUUM discards those dirty buffers
and then fails to truncate, voila you have tuples rising from
the dead.

I'm too lazy to check the commit log right now, but I think
we did implement a fix for that (ie, flush dirty pages even
if we anticipate them going away due to truncation).  But as
Laurenz says, v10 is out of support and possibly didn't get
that fix.  Even if it did, you'd need to be running one of
the last minor releases, because this wasn't very long ago.

In the end though, the *real* problem here is running on a
platform that randomly disallows writes to disk.  There's only
so much that Postgres can possibly do about unreliability of the
underlying platform.  I would never run a production database on
Windows, because it's just too prone to that sort of BS.

            regards, tom lane



On Thu, Oct 5, 2023 at 3:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael@paquier.xyz> writes:
> > On Wed, Oct 04, 2023 at 09:17:11AM +0200, Laurenz Albe wrote:
> >> Data corruption like this is not necessarily caused by a PostgreSQL bug.
>
> > Err, well...  A failure on the end-of-vacuum truncation should not
> > lead to corruption afterwards as well, and this ought to be safe even
> > if this step failed.  This is a very tricky problem that nobody has
> > really looked into yet.
>
> ISTM we did identify the problem: while all the tuples in the
> pages-to-be-truncated should be dead and thus invisible, it may
> be that some of those pages are dirty and haven't been written
> out of shared buffers yet, and the page versions on disk contain
> tuples that look live.  If VACUUM discards those dirty buffers
> and then fails to truncate, voila you have tuples rising from
> the dead.
>
> I'm too lazy to check the commit log right now, but I think
> we did implement a fix for that (ie, flush dirty pages even
> if we anticipate them going away due to truncation).  But as
> Laurenz says, v10 is out of support and possibly didn't get
> that fix.  Even if it did, you'd need to be running one of
> the last minor releases, because this wasn't very long ago.

This thread seems to be saying otherwise:

https://www.postgresql.org/message-id/flat/2348.1544474335%40sss.pgh.pa.us

> In the end though, the *real* problem here is running on a
> platform that randomly disallows writes to disk.  There's only
> so much that Postgres can possibly do about unreliability of the
> underlying platform.  I would never run a production database on
> Windows, because it's just too prone to that sort of BS.

It's surprising that ftruncate() AKA chsize() is able to fail like
this (I am not a Windows user but AFAIR that sharing stuff obstructs
stuff like open, unlink, rename, so it surprises me to see it come up
with ftruncate, since we must already have made it past the open
stage).  Hmm, the documentation is scant, but I know from my attempts
to use large files that chsize() is probably some kind of wrapper
around SetEndOfFile() or similar, and that is documented as failing if
someone has the file mapped.  I don't know why someone would have the
file mapped, though.

But as for what we should do about it, PANIC (as suggested by several
people) seems better than corruption, if we're not going to write some
kind of resilience?  How else are we supposed to deal with "this
shouldn't happen, and if it does we're hosed?"



On Thu, Oct 5, 2023 at 10:12 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> It's surprising that ftruncate() AKA chsize() is able to fail like
> this (I am not a Windows user but AFAIR that sharing stuff obstructs
> stuff like open, unlink, rename, so it surprises me to see it come up
> with ftruncate, since we must already have made it past the open
> stage).  Hmm, the documentation is scant, but I know from my attempts
> to use large files that chsize() is probably some kind of wrapper
> around SetEndOfFile() or similar, and that is documented as failing if
> someone has the file mapped.  I don't know why someone would have the
> file mapped, though.

Some more thoughts: I guess it would probably also fail like that if
someone explicitly locked a range with LockFile(), but I think we can
rule that out as read and/or write calls would also fail.  As for the
mapping theory, apparently the underlying NT error for that is
ERROR_USER_MAPPED_FILE, and searching for that brings up various
unexplained errors vaguely blamed on anti-virus tools etc.  But if all
of that sort of thing really is turned off for the data directory, I
wonder if there could be a backup tool in use that thinks it can go
faster by mapping files while copying?



On Thu, Oct 05, 2023 at 10:12:27AM +1300, Thomas Munro wrote:
> On Thu, Oct 5, 2023 at 3:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm too lazy to check the commit log right now, but I think
>> we did implement a fix for that (ie, flush dirty pages even
>> if we anticipate them going away due to truncation).  But as
>> Laurenz says, v10 is out of support and possibly didn't get
>> that fix.  Even if it did, you'd need to be running one of
>> the last minor releases, because this wasn't very long ago.
>
> This thread seems to be saying otherwise:
>
> https://www.postgresql.org/message-id/flat/2348.1544474335%40sss.pgh.pa.us

Yeah, that's not been solved.  I've put my mind on this problem for a
few hours last May, just before PGCon, and there was an argument that
marking all the pages as dirty is kind of a waste of resources as it
would create WAL for data that's going to be gone a little bit later
as an effect of the truncate, leading to an extra burst of I/O
especially for large truncatoins.  FWIW, I think that I'd be
personally OK with using this method.  At least that's safe, simple,
backpatchable and it does not require any new magic.  I know that
there are voices that argued against this method, but here we are N
years later, so perhaps we should just do that on HEAD at least.

>> In the end though, the *real* problem here is running on a
>> platform that randomly disallows writes to disk.  There's only
>> so much that Postgres can possibly do about unreliability of the
>> underlying platform.  I would never run a production database on
>> Windows, because it's just too prone to that sort of BS.
>
> It's surprising that ftruncate() AKA chsize() is able to fail like
> this (I am not a Windows user but AFAIR that sharing stuff obstructs
> stuff like open, unlink, rename, so it surprises me to see it come up
> with ftruncate, since we must already have made it past the open
> stage).  Hmm, the documentation is scant, but I know from my attempts
> to use large files that chsize() is probably some kind of wrapper
> around SetEndOfFile() or similar, and that is documented as failing if
> someone has the file mapped.  I don't know why someone would have the
> file mapped, though.

(shrug)

> But as for what we should do about it, PANIC (as suggested by several
> people) seems better than corruption, if we're not going to write some
> kind of resilience?  How else are we supposed to deal with "this
> shouldn't happen, and if it does we're hosed?"

A PANIC may be OK for this specific syscall and would be better, but
the problematic area is larger than that as we'd still finish with a
corruption as long as there's an ERROR or a FATAL between the moment
the buffers (potentially dirty, with live-still-dead-in-memory tuples
on disk) are discarded and the moment the truncation fails.  Another
method discussed is the use of a critical section (I recall that there
were some pallocs in this area, actually, but got nothing on my notes
about that...).
--
Michael

Attachment
Thomas Munro <thomas.munro@gmail.com> writes:
> On Thu, Oct 5, 2023 at 3:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm too lazy to check the commit log right now, but I think
>> we did implement a fix for that (ie, flush dirty pages even
>> if we anticipate them going away due to truncation).

> This thread seems to be saying otherwise:
> https://www.postgresql.org/message-id/flat/2348.1544474335%40sss.pgh.pa.us

Hmph.  OK, I was remembering the discussion not the (lack of)
end result.

> But as for what we should do about it, PANIC (as suggested by several
> people) seems better than corruption, if we're not going to write some
> kind of resilience?

Maybe that's an acceptable answer now ... it's not great, but nobody
is in love with any of the other options either.  And it would definitely
get DBAs' attention about this misbehavior of their file systems.

            regards, tom lane



On Thu, Oct 5, 2023 at 11:44 AM Michael Paquier <michael@paquier.xyz> wrote:
> On Thu, Oct 05, 2023 at 10:12:27AM +1300, Thomas Munro wrote:
> > But as for what we should do about it, PANIC (as suggested by several
> > people) seems better than corruption, if we're not going to write some
> > kind of resilience?  How else are we supposed to deal with "this
> > shouldn't happen, and if it does we're hosed?"
>
> A PANIC may be OK for this specific syscall and would be better, but
> the problematic area is larger than that as we'd still finish with a
> corruption as long as there's an ERROR or a FATAL between the moment
> the buffers (potentially dirty, with live-still-dead-in-memory tuples
> on disk) are discarded and the moment the truncation fails.  Another
> method discussed is the use of a critical section (I recall that there
> were some pallocs in this area, actually, but got nothing on my notes
> about that...).

Yeah.  I guess the obvious place for a critical section to start would
be in RelationTruncate() near DELAY_CHKPT_COMPLETE where a similar
concern about recovery is discussed.  There is a comment explaining
that it's a bad idea to use a critical section, but evidently it
overestimated the harmlessness of that choice.  It has a point that if
DO can't truncate, maybe REDO will fail too and you might be stuck in
an eternal samsara of failed recovery, but... that's because your
system isn't doing things we fundamentally need it to do to make
progress, and an administrator needs to find out why.  And although
eternity sounds bad, as far as I can tell from the rare reports we've
had of this failure, it seems to be transient, right?

Perhaps we could consider adding ftruncate() to the set of horrible
Windows wrapper functions that hide a few sleep-retry loops before
they give up so there's a good chance of avoiding a crash, but it'd be
better if someone more knowledgeable/hands-on with Windows could get
to the bottom of what is causing this and document how to avoid it...
I think we understand why we need that in the other cases (this OS's
peculiar dirent management), and this case seems a bit different.

As for Unix, if I guessed right about mmap being involved, the kernel
always lets the truncation proceed but kills the mapping process on
access to non-backed region of memory.  And for EINTR, handling was
recently added for this (0d369ac6500), so I guess there may be ways in
older branches on some systems (eg the old 'interruptible' NFS, but
probably not on normal file systems...).  The only other things I can
think of are EIO, and then various left field errors caused by the
file system going read only or the file being 'sealed', etc, and
promotion to PANIC seems appropriate for those cases.



On Fri, Oct 6, 2023 at 8:55 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> ... DELAY_CHKPT_COMPLETE ...

About that...  If the lights go out after the truncation and the
delayed logging of the checkpoint, how do we know the truncation has
actually reached the disk?  mdtruncate() enqueues fsync() calls, but
if we were already in phase 2 (see proc.h) of a checkpoint at that
moment, they might be processed by the *next* checkpoint, not the one
whose phase 3 we've carefully delayed there, no?



On Fri, Oct 6, 2023 at 9:18 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Oct 6, 2023 at 8:55 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > ... DELAY_CHKPT_COMPLETE ...
>
> About that...  If the lights go out after the truncation and the
> delayed logging of the checkpoint, how do we know the truncation has
> actually reached the disk?  mdtruncate() enqueues fsync() calls, but
> if we were already in phase 2 (see proc.h) of a checkpoint at that
> moment, they might be processed by the *next* checkpoint, not the one
> whose phase 3 we've carefully delayed there, no?

I didn't look into this for very long so I might be missing something
here, but I think there could be at least one bad sequence.  If you
insert a couple of sleeps to jinx the scheduling, and hack
RelationTruncate() to request a checkpoint at a carefully chosen wrong
moment (see attached), then:

postgres=# create table t (i int);
CREATE TABLE
postgres=# insert into t select 1 from generate_series(1, 100);
INSERT 0 100
postgres=# checkpoint; -- checkpoint #1 just puts some data on disk
CHECKPOINT
postgres=# delete from t;
DELETE 100
postgres=# vacuum freeze t; -- truncates, starts unlucky checkpoint #2
VACUUM

If you trace the checkpointer's system calls you will see that
base/5/16384 (or whatever t's relfilenode is for you) is *not* fsync'd
by checkpoint #2.  The following checkpoint #3 might eventually do it,
but if the kernel loses power after checkpoint #2 completes and there
is no checkpoint #3, the kernel might forget the truncation, and yet
replay starts too late to redo it.  I think that bad sequence looks
like this:

P1: log truncate
P2:                        choose redo LSN
P1: DropRelationBuffers()
P2:                        CheckPointBuffers()
P2:                        ProcessSyncRequests()
P1: ftruncate()
P1: RegisterSyncRequest()
P2:                        log checkpoint
             *** system loses power ***

I realise it is a different problem than the one reported, but it's
close.  My initial thought is that perhaps we shouldn't allow a redo
LSN to be chosen until the sync request is registered, which is also
fairly close to the critical section boundaries being discussed for
ftruncate() error case.  But that's not a phase the checkpoint delay
machinery currently knows how to delay.  And there may well be better
ways...

Attachment
On Wed, Oct 4, 2023 at 7:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > But as for what we should do about it, PANIC (as suggested by several
> > people) seems better than corruption, if we're not going to write some
> > kind of resilience?
>
> Maybe that's an acceptable answer now ... it's not great, but nobody
> is in love with any of the other options either.  And it would definitely
> get DBAs' attention about this misbehavior of their file systems.

I and others, including Andres, have been thinking that a PANIC is the
right option for some time.

Quoth I in
https://www.postgresql.org/message-id/CA%2BTgmobwc_Rdaw%2B6TupT4_g9z55JjL%3DvhwpphsQe%3DYmBN0OPDg%40mail.gmail.com
some 2 years ago...
> As you say, this doesn't fix the problem that truncation might fail.
> But as Andres and Sawada-san said, the solution to that is to get rid
> of the comments saying that it's OK for truncation to fail and make it
> a PANIC. However, I don't think that change needs to be part of this
> patch. Even if we do that, we still need to do this. And even if we do
> this, we still need to do that.

I think the only reasons that I didn't do it at the time where (a)
shortage of round tuits and (b) fear of being yelled at. But the
comment is wrong, and a critical section is right.

I do think that it's nice to be tolerant of bad filesystem behavior
when we can. For instance if we try to write() some data to the OS and
it fails for some transient reason, it's nice if we can try to write()
it again. But there are always going to be cases where that sort of
tolerance is not practical. Having PostgreSQL continue to operate when
the filesystem isn't operating is a luxury, and we can't afford it in
every situation. shared_buffers provides a layer of insulation between
the logical act of modifying a buffer and the need to have a system
call succeed -- dirtying the buffer is in effect making a note that
the write() needs to be done later, instead of actually doing it in
the moment. And since the code that actually writes it is
checkpoint-aware and write-outs can be retried, we can avoid
panicking. But for operations such as creating, removing, or
truncating relations, there is no similar, general layer of insulation
-- we have no mechanism that allows us to logically do those things
now and have them actually happen at the FS level later. Which, to me,
seems to mean that we have little choice but to panic if they fail.
Otherwise, the primary diverges from any standbys that it has.

I also think that's OK. Unreliable filesystems lead to unreliable
databases, and it's better to find that out before something really
bad happens. Maybe in the future we'll develop more general mechanisms
for some of this stuff and maybe that will allow us to avoid panics in
more cases, and then we can debate the merits of such changes. But
right now, the cost of avoiding a panic here is a corrupted database,
and I have to believe that the overwhelming majority of users would
think that a corrupted database is worse.

--
Robert Haas
EDB: http://www.enterprisedb.com



On Fri, Oct 6, 2023 at 12:50 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> If you trace the checkpointer's system calls you will see that
> base/5/16384 (or whatever t's relfilenode is for you) is *not* fsync'd
> by checkpoint #2.  The following checkpoint #3 might eventually do it,
> but if the kernel loses power after checkpoint #2 completes and there
> is no checkpoint #3, the kernel might forget the truncation, and yet
> replay starts too late to redo it.  I think that bad sequence looks
> like this:
>
> P1: log truncate
> P2:                        choose redo LSN
> P1: DropRelationBuffers()
> P2:                        CheckPointBuffers()
> P2:                        ProcessSyncRequests()
> P1: ftruncate()
> P1: RegisterSyncRequest()
> P2:                        log checkpoint
>              *** system loses power ***
>
> I realise it is a different problem than the one reported, but it's
> close.  My initial thought is that perhaps we shouldn't allow a redo
> LSN to be chosen until the sync request is registered, which is also
> fairly close to the critical section boundaries being discussed for
> ftruncate() error case.  But that's not a phase the checkpoint delay
> machinery currently knows how to delay.  And there may well be better
> ways...

Hmm. So the problem in this case is that we not only need the WAL to
be inserted and the filesystem operation to be performed in the same
checkpoint cycle, but we also need the sync request to be processed in
that checkpoint cycle. In this example, the WAL insertion happens
before the redo LSN, so the truncate on disk must happen before the
checkpoint completes, which is guaranteed, and the sync request must
also be processed before the checkpoint completes, which is not
guaranteed. We only guarantee that the checkpoint goes into the
checkpointer's queue before the checkpoint completes, not that it gets
processed before the checkpoint completes.

Suppose that RelationTruncate set both DELAY_CHKPT_START and
DELAY_CHKPT_COMPLETE. I think that would prevent this problem. P2
could still choose the redo LSN after P1 logged the truncate, but it
wouldn't then be able to reach CheckPointBuffers() until after P1 had
reached RegisterSyncRequest. Note that setting *only*
DELAY_CHKPT_START isn't good enough, because then we can get this
history:

P1: log truncate
P2:                        choose redo LSN
P2:                        CheckPointBuffers()
P1: DropRelationBuffers()
P2:                        ProcessSyncRequests()
P2:                        log checkpoint
              *** system loses power ***

I think, in general, that DELAY_CHKPT_START is useful for cases where
CheckPointGuts() is going to flush something that might not be locked
at the time we WAL-log a modification to it. For instance,
MarkBufferDirtyHint() wants to log that the buffer will be dirty while
holding only a shared lock on the buffer. RecordTransactonCommit() and
RecordTransactionCommitPrepared() write WAL that necesitate CLOG
updates, but they haven't yet locked the relevant CLOG buffers.
DELAY_CHKPT_END, on the other hand, is useful for cases where some
operation needs to get completed before the redo pointer gets moved.
What I think we missed in 412ad7a55639516f284cd0ef9757d6ae5c7abd43 is
that RelationTruncate actually falls into both categories -- the
ftruncate() needs to be protected by DELAY_CHKPT_END so that it
happens before the redo pointer moves, but the RegisterSyncRequest()
needs to be protected by DELAY_CHKPT_START so that it gets "flushed"
by ProcessSyncRequests.

I'm not quite sure that this idea closes all the holes, though. The
delay-checkpoint mechanism is quite hard to reason about. I spent a
while this morning trying to think up something better but realized at
the end I'd just reinvented the wheel.

--
Robert Haas
EDB: http://www.enterprisedb.com



On Thu, Oct 12, 2023 at 5:59 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Suppose that RelationTruncate set both DELAY_CHKPT_START and
> DELAY_CHKPT_COMPLETE. I think that would prevent this problem. P2
> could still choose the redo LSN after P1 logged the truncate, but it
> wouldn't then be able to reach CheckPointBuffers() until after P1 had
> reached RegisterSyncRequest. [...]

Thanks for thinking about this.  Yeah, the existing _START flag does
indeed seem to be enough.  I'd been focusing on trying to control the
redo point selection, but we just need to delay ProcessSyncRequests(),
and we had a hammer for that already.  Who wants to write the patch?
It should be trivial, except for the comments.

It's interesting that we'd stun the checkpointer just before we try to
send it a request in a fixed size queue, but that's OK because we'll
perform the fsync ourselves if the queue is full.



On Wed, Oct 11, 2023 at 11:37 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Thanks for thinking about this.  Yeah, the existing _START flag does
> indeed seem to be enough.  I'd been focusing on trying to control the
> redo point selection, but we just need to delay ProcessSyncRequests(),
> and we had a hammer for that already.  Who wants to write the patch?
> It should be trivial, except for the comments.
>
> It's interesting that we'd stun the checkpointer just before we try to
> send it a request in a fixed size queue, but that's OK because we'll
> perform the fsync ourselves if the queue is full.

I don't understand what you mean about stunning the checkpointer, but
here are some patches.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment
On Fri, Oct 20, 2023 at 8:50 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I don't understand what you mean about stunning the checkpointer, but

I just meant that, with your change, we ask the checkpointer to wait
for our signal if it reaches that code path and then while it's
snoozing we call register_dirty_segment().  I wanted to report that
I'd checked that it copes with the request queue being full, by
falling back to calling fsync() itself, which is good news because if
it instead waited for the checkpointer to drain the request queue
instead (an available option), we'd have a potential deadlock.

> here are some patches.

0001: LGTM, except in the commit message "By setting
XLOG_CHKPT_START," s/XLOG/DELAY/.
0002: LGTM

Now I wonder if we will get occasional PANIC reports from Windows
users.  I might prepare one of those retry wrappers...



On Fri, Oct 20, 2023 at 4:16 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Oct 20, 2023 at 8:50 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > here are some patches.
>
> 0001: LGTM, except in the commit message "By setting

I take that back: there is a palloc() under RelationTruncate().  DNLGTM.



On Fri, Oct 20, 2023 at 4:38 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Oct 20, 2023 at 4:16 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Fri, Oct 20, 2023 at 8:50 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > here are some patches.
> >
> > 0001: LGTM, except in the commit message "By setting
>
> I take that back: there is a palloc() under RelationTruncate().  DNLGTM.

Ooops, I should have quoted to the 0002 patch there.

Hmm.  We could teach md.c to do all its path manipulation in
MAXPGPATH-sized output buffers but that still wouldn't be enough
because it might decide to allocate more space for the array of
segments, and I'm not even sure where in PostgreSQL we guarantee that
everything that could appear here would fit in that.  But that gives
me an idea.  It feels like a bit of a dirty hack, which could perhaps
be made less dirty with some kind of function that includes the word
'ensure' in its name, but I think we can make md.c promise not to
allocate anything before the next CFI with something like this:

         * truncate files on disk before they can get the database up
and running
         * again. But forcing them to fix permissions (or whatever the
problem is)
         * beats corrupting the database.
+        *
+        * First, make sure that smgr doesn't need to allocate any memory to
+        * truncate, since that wouldn't be allowed in a critical
section. We don't
+        * need to call XLogEnsureRecord() for such a small insertion.
         */
+
+       smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM);
+       if (fsm)
+               smgrnblocks(RelationGetSmgr(rel), FSM_FORKNUM);
+
        START_CRIT_SECTION();



On Fri, Oct 20, 2023 at 1:17 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > I take that back: there is a palloc() under RelationTruncate().  DNLGTM.
>
> Ooops, I should have quoted to the 0002 patch there.
>
> Hmm.  We could teach md.c to do all its path manipulation in
> MAXPGPATH-sized output buffers but that still wouldn't be enough
> because it might decide to allocate more space for the array of
> segments, and I'm not even sure where in PostgreSQL we guarantee that
> everything that could appear here would fit in that.  But that gives
> me an idea.  It feels like a bit of a dirty hack, which could perhaps
> be made less dirty with some kind of function that includes the word
> 'ensure' in its name, but I think we can make md.c promise not to
> allocate anything before the next CFI with something like this:

Ugh. AFAICS the problems are confined to smgrtruncate() and within
that to the callback to smgr_truncate. As far as the rest of
smgrtruncate() is concerned, it seems like DropRelationBuffers() and
CacheInvalidateSmgr() are safe enough in a critical section, and the
other code directly inside RelationTruncate() looks OK, too.

But within mdtruncate(), we've got more than one problem, I think.
mdnblocks() is a problem because of the reason that you mention, but
register_dirty_segment() doesn't look totally safe either, because it
can call RegisterSyncRequest() which, in a standalone backend, can
call RememberSyncRequest().

In general, it seems like it would be a lot nicer if we were doing a
lot less stuff inside the critical section here. So I think you're
right that we need some refactoring. Maybe smgr_prepare_truncate() and
smgr_execute_truncate() or something like that. I wonder if we could
actually register the dirty segment in the "prepare" phase - is it bad
if we register a dirty segment before actually dirtying it? And maybe
even CacheInvalidateSmgr() could be done at that stage? It seems
pretty clear that dropping the dirty buffers and actually truncating
the relation on disk need to happen after we've entered the critical
section, because if we fail after doing the former, we've thrown away
dirty data in anticipation of performing an operation that didn't
happen, and if we fail when attempting the latter, primaries and
standbys diverge and the originally-reported bug on this thread
happens. But we'd like to move as much other stuff as we can out of
that critical section.

--
Robert Haas
EDB: http://www.enterprisedb.com



On Tue, Oct 24, 2023 at 4:06 AM Robert Haas <robertmhaas@gmail.com> wrote:
> But within mdtruncate(), we've got more than one problem, I think.
> mdnblocks() is a problem because of the reason that you mention, but
> register_dirty_segment() doesn't look totally safe either, because it
> can call RegisterSyncRequest() which, in a standalone backend, can
> call RememberSyncRequest().

But that is already enkludgified thusly:

        /*
         * XXX: The checkpointer needs to add entries to the pending ops table
         * when absorbing fsync requests.  That is done within a critical
         * section, which isn't usually allowed, but we make an exception. It
         * means that there's a theoretical possibility that you run out of
         * memory while absorbing fsync requests, which leads to a PANIC.
         * Fortunately the hash table is small so that's unlikely to happen in
         * practice.
         */
        pendingOpsCxt = AllocSetContextCreate(TopMemoryContext,
                                              "Pending ops context",
                                              ALLOCSET_DEFAULT_SIZES);
        MemoryContextAllowInCriticalSection(pendingOpsCxt, true);


> In general, it seems like it would be a lot nicer if we were doing a
> lot less stuff inside the critical section here. So I think you're
> right that we need some refactoring. Maybe smgr_prepare_truncate() and
> smgr_execute_truncate() or something like that. I wonder if we could
> actually register the dirty segment in the "prepare" phase - is it bad
> if we register a dirty segment before actually dirtying it? And maybe
> even CacheInvalidateSmgr() could be done at that stage? It seems
> pretty clear that dropping the dirty buffers and actually truncating
> the relation on disk need to happen after we've entered the critical
> section, because if we fail after doing the former, we've thrown away
> dirty data in anticipation of performing an operation that didn't
> happen, and if we fail when attempting the latter, primaries and
> standbys diverge and the originally-reported bug on this thread
> happens. But we'd like to move as much other stuff as we can out of
> that critical section.

Hmm, yeah it seems like that direction would be a nice improvement, as
long as we are sure that the fsync request can't be processed too
soon.



Related bug #18426 sent me back here.

Here is a new attempt to see what it might take to put
RelationTruncate() into a critical section.  Problems encountered:
even if you've called mdnblocks() beforehand, dressed up as
smgrpreparetruncate(), if the highest segment is exactly full then the
later mdnblocks() again probes whether the next segment number exists
on disk, which involves GetRelationPath(), which allocates.  So I
finished up having to write a GetRelationPathInPlace() function, and
to decide whether it's OK to use a MAXPGPATH-sized array on the stack
for this.  I also had to teach _fdvec_resize() not to reallocate when
downsizing, to avoid the critical section assertion.  It seems like
quite a lot to back-patch... but also awful to leave this trickle of
data corruption reports unaddressed.  The big re-engineering ideas[1]
would be absolutely unbackpatchable, but I hope we can work on
something like that for 18...

[1] https://www.postgresql.org/message-id/flat/2348.1544474335%40sss.pgh.pa.us

Attachment
On Tue, Apr 23, 2024 at 07:48:12PM +1200, Thomas Munro wrote:
> Here is a new attempt to see what it might take to put
> RelationTruncate() into a critical section.  Problems encountered:
> even if you've called mdnblocks() beforehand, dressed up as
> smgrpreparetruncate(), if the highest segment is exactly full then the
> later mdnblocks() again probes whether the next segment number exists
> on disk, which involves GetRelationPath(), which allocates.  So I
> finished up having to write a GetRelationPathInPlace() function, and
> to decide whether it's OK to use a MAXPGPATH-sized array on the stack
> for this.  I also had to teach _fdvec_resize() not to reallocate when
> downsizing, to avoid the critical section assertion.  It seems like
> quite a lot to back-patch... but also awful to leave this trickle of
> data corruption reports unaddressed.  The big re-engineering ideas[1]
> would be absolutely unbackpatchable, but I hope we can work on
> something like that for 18...

For now I'd agree with something like that, even for a backpatch.  No
idea what a good "cheap" solution would look like, though.  Anything
considered these past years for this issue was utterly expensive.

> However, it didn't guarantee that the sync request for the
> truncation got processed before the XLOG_CHECKPOINT_ONLINE record
> was written. By setting XLOG_CHKPT_START, we guarantee that if
> an XLOG_SMGR_TRUNCATE record is written to WAL before the redo
> pointer of some concurrent checkpoint, the sync request queued by
> that operation must be processed by that checkpoint (rather than
> being left for the following one).

s/XLOG_CHKPT_START/DELAY_CHKPT_START/ in the commit message of 0001.

The point you are making about d8725104 seems like a good argument to
do a backpatch down to v14, as the possible interrupt make this setof
issues much easier to reach compared to the Windows-only-weird-syscall
failures happening in the middle of the truncation that people have
been complaining about for N years now.  I am seeing less reports of
these on WIN32 these days with files switched suddenly to read-only,
or is that just my imagination?

The pieces around GetRelationPathInPlace() and MAX_EXTENSION_SIZE
could be a refactoring piece worth their own, split into its own patch
before introducing the core part of the fix with the critical section.
That would be much cleaner, IMO.

It seems to me that an assertion at the end of
GetRelationPathInPlace() about the computed size would be adapted
based on MAXPGPATH.  The interface of _mdfd_segpath() is quite
confusing, actually.  Why return the same pointer as the input
argument rather than void?

I was re-reading your comment about last October, and the fact that we
may stuck replay, and finishing with semi-corrupted undetectible
corruptions is worse than that, still I'm going to agree that failing
harder will serve much better at this stage:
https://www.postgresql.org/message-id/CA%2BhUKGJy9iCBfkjUyV8ZuRwd5CAGxZV1STywe%2B0S%2B9YKH1zF8w%40mail.gmail.com

That may impact availability.  Still that's cheaper than any other
solution discussed.  The one involving the WAL-logging of dirty pages
that would get truncated shortly after comes first into mind..

PS: On HEAD, I'd like to see one or more tests for these failure
scenarios in the long-term and how replay is able to react after
panicking in the critical section.  Now, injection points won't work
inside a critical section as the initial function loading needs to do
a palloc() for the library path.  We've discussed about having a
PRELOAD macro that could be called outside the critical path, before
running a INJECTION_POINT.  Just mentioning.
--
Michael

Attachment
Hello Thomas,

23.04.2024 10:48, Thomas Munro wrote:
> Related bug #18426 sent me back here.
>
> Here is a new attempt to see what it might take to put
> RelationTruncate() into a critical section.

When running 027_stream_regress on a slow machine with the aggressive
autovacuum settings, having those patches applied, I've stumbled upon:
TRAP: failed Assert("CritSectionCount == 0 || (context)->allowInCritSection"), File: "mcxt.c", Line: 1353, PID: 24468
...
2024-05-14 12:30:03.542 UTC [22964:4] LOG:  server process (PID 24468) was terminated by signal 6: Aborted
2024-05-14 12:30:03.542 UTC [22964:5] DETAIL:  Failed process was running: autovacuum: VACUUM ANALYZE
pg_catalog.pg_class

with the following stack trace:
Core was generated by `postgres: primary: autovacuum worker regression                               '.
Program terminated with signal SIGABRT, Aborted.
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
(gdb) bt
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
#1  0xb63c90ae in __libc_signal_restore_set (set=0xbe99f34c) at ../sysdeps/unix/sysv/linux/internal-signals.h:84
#2  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:48
#3  0xb63bb1f2 in __GI_abort () at abort.c:79
#4  0xb6d3e2d0 in ExceptionalCondition (conditionName=<optimized out>, fileName=<optimized out>, 
lineNumber=lineNumber@entry=1353) at assert.c:66
#5  0xb6d61834 in palloc0 (size=3062661664) at mcxt.c:1353
#6  0xb6bcd304 in CompactCheckpointerRequestQueue () at checkpointer.c:1173
#7  ForwardSyncRequest (ftag=ftag@entry=0xbe99f7d0, type=type@entry=SYNC_REQUEST) at checkpointer.c:1113
#8  0xb6c4b3c4 in RegisterSyncRequest (ftag=ftag@entry=0xbe99f7d0, type=type@entry=SYNC_REQUEST, 
retryOnError=retryOnError@entry=false) at sync.c:605
#9  0xb6c48e62 in register_dirty_segment (reln=reln@entry=0xb8ec75a8, forknum=forknum@entry=MAIN_FORKNUM,
seg=<optimized
 
out>, seg=<optimized out>) at md.c:1369
#10 0xb6c4a0c6 in mdtruncate (reln=0xb8ec75a8, forknum=MAIN_FORKNUM, nblocks=45) at md.c:1229
#11 0xb6c4abe8 in smgrtruncate (reln=0xb8ec75a8, forknum=forknum@entry=0xbe99f86c, nforks=nforks@entry=2, 
nblocks=nblocks@entry=0xbe99f878) at smgr.c:743
#12 0xb6a3e8c6 in RelationTruncate (rel=0xb2e969a0, nblocks=nblocks@entry=45) at ../../../src/include/utils/rel.h:574
#13 0xb69b8042 in lazy_truncate_heap (vacrel=0xb8ee8f38) at vacuumlazy.c:2642
...

Best regards,
Alexander



On 14/05/2024 16:00, Alexander Lakhin wrote:
> 23.04.2024 10:48, Thomas Munro wrote:
>> Here is a new attempt to see what it might take to put
>> RelationTruncate() into a critical section.
> 
> When running 027_stream_regress on a slow machine with the aggressive
> autovacuum settings, having those patches applied, I've stumbled upon:
> TRAP: failed Assert("CritSectionCount == 0 || (context)->allowInCritSection"), File: "mcxt.c", Line: 1353, PID:
24468
> ...
> 2024-05-14 12:30:03.542 UTC [22964:4] LOG:  server process (PID 24468) was terminated by signal 6: Aborted
> 2024-05-14 12:30:03.542 UTC [22964:5] DETAIL:  Failed process was running: autovacuum: VACUUM ANALYZE
pg_catalog.pg_class
> 
> with the following stack trace:
> Core was generated by `postgres: primary: autovacuum worker regression                               '.
> Program terminated with signal SIGABRT, Aborted.
> #0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
> (gdb) bt
> #0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
> #1  0xb63c90ae in __libc_signal_restore_set (set=0xbe99f34c) at ../sysdeps/unix/sysv/linux/internal-signals.h:84
> #2  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:48
> #3  0xb63bb1f2 in __GI_abort () at abort.c:79
> #4  0xb6d3e2d0 in ExceptionalCondition (conditionName=<optimized out>, fileName=<optimized out>,
> lineNumber=lineNumber@entry=1353) at assert.c:66
> #5  0xb6d61834 in palloc0 (size=3062661664) at mcxt.c:1353
> #6  0xb6bcd304 in CompactCheckpointerRequestQueue () at checkpointer.c:1173
> #7  ForwardSyncRequest (ftag=ftag@entry=0xbe99f7d0, type=type@entry=SYNC_REQUEST) at checkpointer.c:1113
> #8  0xb6c4b3c4 in RegisterSyncRequest (ftag=ftag@entry=0xbe99f7d0, type=type@entry=SYNC_REQUEST,
> retryOnError=retryOnError@entry=false) at sync.c:605
> #9  0xb6c48e62 in register_dirty_segment (reln=reln@entry=0xb8ec75a8, forknum=forknum@entry=MAIN_FORKNUM,
seg=<optimized
> out>, seg=<optimized out>) at md.c:1369
> #10 0xb6c4a0c6 in mdtruncate (reln=0xb8ec75a8, forknum=MAIN_FORKNUM, nblocks=45) at md.c:1229
> #11 0xb6c4abe8 in smgrtruncate (reln=0xb8ec75a8, forknum=forknum@entry=0xbe99f86c, nforks=nforks@entry=2,
> nblocks=nblocks@entry=0xbe99f878) at smgr.c:743
> #12 0xb6a3e8c6 in RelationTruncate (rel=0xb2e969a0, nblocks=nblocks@entry=45) at
../../../src/include/utils/rel.h:574
> #13 0xb69b8042 in lazy_truncate_heap (vacrel=0xb8ee8f38) at vacuumlazy.c:2642
> ...

This should've also been fixed by commit b1ffe3ff0b.

Thanks Alexander for pointing me to this thread!

-- 
Heikki Linnakangas
Neon (https://neon.tech)




On 26/06/2024 23:58, Heikki Linnakangas wrote:
> On 14/05/2024 16:00, Alexander Lakhin wrote:
>> 23.04.2024 10:48, Thomas Munro wrote:
>>> Here is a new attempt to see what it might take to put
>>> RelationTruncate() into a critical section.
>>
>> When running 027_stream_regress on a slow machine with the aggressive
>> autovacuum settings, having those patches applied, I've stumbled upon:
>> TRAP: failed Assert("CritSectionCount == 0 || (context)->allowInCritSection"), File: "mcxt.c", Line: 1353, PID:
24468
>> ...
>> 2024-05-14 12:30:03.542 UTC [22964:4] LOG:  server process (PID 24468) was terminated by signal 6: Aborted
>> 2024-05-14 12:30:03.542 UTC [22964:5] DETAIL:  Failed process was running: autovacuum: VACUUM ANALYZE
pg_catalog.pg_class
>>
>> with the following stack trace:
>> Core was generated by `postgres: primary: autovacuum worker regression                               '.
>> Program terminated with signal SIGABRT, Aborted.
>> #0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
>> (gdb) bt
>> #0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
>> #1  0xb63c90ae in __libc_signal_restore_set (set=0xbe99f34c) at ../sysdeps/unix/sysv/linux/internal-signals.h:84
>> #2  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:48
>> #3  0xb63bb1f2 in __GI_abort () at abort.c:79
>> #4  0xb6d3e2d0 in ExceptionalCondition (conditionName=<optimized out>, fileName=<optimized out>,
>> lineNumber=lineNumber@entry=1353) at assert.c:66
>> #5  0xb6d61834 in palloc0 (size=3062661664) at mcxt.c:1353
>> #6  0xb6bcd304 in CompactCheckpointerRequestQueue () at checkpointer.c:1173
>> #7  ForwardSyncRequest (ftag=ftag@entry=0xbe99f7d0, type=type@entry=SYNC_REQUEST) at checkpointer.c:1113
>> #8  0xb6c4b3c4 in RegisterSyncRequest (ftag=ftag@entry=0xbe99f7d0, type=type@entry=SYNC_REQUEST,
>> retryOnError=retryOnError@entry=false) at sync.c:605
>> #9  0xb6c48e62 in register_dirty_segment (reln=reln@entry=0xb8ec75a8, forknum=forknum@entry=MAIN_FORKNUM,
seg=<optimized
>> out>, seg=<optimized out>) at md.c:1369
>> #10 0xb6c4a0c6 in mdtruncate (reln=0xb8ec75a8, forknum=MAIN_FORKNUM, nblocks=45) at md.c:1229
>> #11 0xb6c4abe8 in smgrtruncate (reln=0xb8ec75a8, forknum=forknum@entry=0xbe99f86c, nforks=nforks@entry=2,
>> nblocks=nblocks@entry=0xbe99f878) at smgr.c:743
>> #12 0xb6a3e8c6 in RelationTruncate (rel=0xb2e969a0, nblocks=nblocks@entry=45) at
../../../src/include/utils/rel.h:574
>> #13 0xb69b8042 in lazy_truncate_heap (vacrel=0xb8ee8f38) at vacuumlazy.c:2642
>> ...
> 
> This should've also been fixed by commit b1ffe3ff0b.

To clarify commit b1ffe3ff0b only fixed this assertion failure, if you 
call RelationTruncate in a critical section, like in with this patch. 
Not the original issue.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




On Mon, Sep 9, 2024 at 7:21 PM Michael Paquier <michael@paquier.xyz> wrote:
> The trick to use _mdfd_segpath() through smgrpreparetruncate() smells
> too much of magic to me.  I don't have an idea that does not involve
> tweaking the interface of smgr.c or some of its structures on top of
> my mind, though, which is to save the path before the critical section
> and pass it through (like a memory area that can be used by
> md.c)..  Or do a MemoryContextAllowInCriticalSection, even if it is
> not its original purpose.

It's funny that it's so hard to get or store the pathnames when every
single File knows its pathname, and segment 0 could give it to you:
FilePathName().  Thinking about that made me notice that the vfd layer
does raw strdup() and ERRORs on failure, so the thing I proposed might
be adhering to the letter of the law ("no palloc in critical
sections"), but not the spirit (the fd.c file opening stuff will throw
ERRORs that are promoted to PANICs, exactly what that law intended to
prevent).  Mleugh.

Another option would be to introduce mdtruncate2(old_nblocks,
new_nblocks), so that it doesn't have to call smgrnblocks() itself,
and you can obtain old_nblocks outside the CS...



On Tue, Sep 10, 2024 at 12:43 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> It is possible that an extension that messes with smgrsw[] would not
> like this in a minor release:
>
> -               smgrsw[reln->smgr_which].smgr_truncate(reln,
> forknum[i], nblocks[i]);
> +               smgrsw[reln->smgr_which].smgr_truncate(reln, forknum[i],
> +
>                     old_nblocks[i], nblocks[i]);

Scratch that concern, smgrsw is static in smgr.c for now (h/t to
Andres for pointing that out in an off-list chat when I described
this...).  I must have confused myself with proposals I've read to
make it so an extension *could* mess with this stuff, but it's not
done yet.

Also if this goes somewhere I should note in the commit message that
the analysis of the WaitIO issue was from Alexander L in bug #18426.