Thread: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 18146 Logged by: Root Cause Email address: rootcause000@gmail.com PostgreSQL version: Unsupported/Unknown Operating system: Windows Description: Version: PostgreSQL 10.21, compiled by Visual C++ build 1800, 64-bit Platform: Windows In our application code, we've implemented logic to clean up entries in several PostgreSQL tables using a loop. Although some of these tables have foreign key references, we've included them to ensure a thorough deletion process. Here's a simplified code snippet: String tables[] = {"TableA", "TableB", "TableC", "TableD", "TableE", "TableF", "TableG", "TableH"}; for (String tableName : tables) { try { stmt = conn.prepareStatement("DELETE FROM " + tableName); stmt.execute(); } catch (Exception ex) { // Log the exception } } Notably, it leaves entries in different tables each time, and we've found auto-vacuum failure traces in the pg_log for these tables at the same time. Here's an example of the error messages from pg_log: 2023-09-28 12:03:35.955 IST,,,11592,,65151e2d.2d48,1,,2023-09-28 12:03:17 IST,6/19,122786262,ERROR,42501,"could not truncate file ""base/16509/6935026"" to 0 blocks: Permission denied",,,,,"automatic vacuum of table ""DB.public.TableB""",,,,"" 2023-09-28 12:03:37.205 IST,,,11592,,65151e2d.2d48,2,,2023-09-28 12:03:17 IST,6/45,122786326,ERROR,42501,"could not truncate file ""base/16509/6935443"" to 0 blocks: Permission denied",,,,,"automatic vacuum of table ""DB.public.TableB""",,,,"" We've already attempted to exclude the entire pg folder from antivirus scans, but the problem persists. Any insights or solutions to this issue would be greatly appreciated. Thank you!
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Laurenz Albe
Date:
On Tue, 2023-10-03 at 17:25 +0000, PG Bug reporting form wrote: > Version: PostgreSQL 10.21, compiled by Visual C++ build 1800, 64-bit > > Platform: Windows > > In our application code, we've implemented logic to clean up entries in > several PostgreSQL tables using a loop. Although some of these tables have > foreign key references, we've included them to ensure a thorough deletion > process. Here's a simplified code snippet: > > String tables[] = {"TableA", "TableB", "TableC", "TableD", "TableE", > "TableF", "TableG", "TableH"}; > > for (String tableName : tables) { > try { > stmt = conn.prepareStatement("DELETE FROM " + tableName); > stmt.execute(); > } catch (Exception ex) { > // Log the exception > } > } > Notably, it leaves entries in different tables each time, and we've found > auto-vacuum failure traces in the pg_log for these tables at the same time. > Here's an example of the error messages from pg_log: > > 2023-09-28 12:03:35.955 IST,,,11592,,65151e2d.2d48,1,,2023-09-28 12:03:17 > IST,6/19,122786262,ERROR,42501,"could not truncate file > ""base/16509/6935026"" to 0 blocks: Permission denied",,,,,"automatic vacuum > of table ""DB.public.TableB""",,,,"" 2023-09-28 12:03:37.205 > IST,,,11592,,65151e2d.2d48,2,,2023-09-28 12:03:17 > IST,6/45,122786326,ERROR,42501,"could not truncate file > ""base/16509/6935443"" to 0 blocks: Permission denied",,,,,"automatic vacuum > of table ""DB.public.TableB""",,,,"" > > We've already attempted to exclude the entire pg folder from antivirus > scans, but the problem persists. Any insights or solutions to this issue > would be greatly appreciated. Thank you! PostgreSQL v10 is out of support. Data corruption like this is not necessarily caused by a PostgreSQL bug. Yours, Laurenz Albe
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Michael Paquier
Date:
On Wed, Oct 04, 2023 at 09:17:11AM +0200, Laurenz Albe wrote: > Data corruption like this is not necessarily caused by a PostgreSQL bug. Err, well... A failure on the end-of-vacuum truncation should not lead to corruption afterwards as well, and this ought to be safe even if this step failed. This is a very tricky problem that nobody has really looked into yet. -- Michael
Attachment
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes: > On Wed, Oct 04, 2023 at 09:17:11AM +0200, Laurenz Albe wrote: >> Data corruption like this is not necessarily caused by a PostgreSQL bug. > Err, well... A failure on the end-of-vacuum truncation should not > lead to corruption afterwards as well, and this ought to be safe even > if this step failed. This is a very tricky problem that nobody has > really looked into yet. ISTM we did identify the problem: while all the tuples in the pages-to-be-truncated should be dead and thus invisible, it may be that some of those pages are dirty and haven't been written out of shared buffers yet, and the page versions on disk contain tuples that look live. If VACUUM discards those dirty buffers and then fails to truncate, voila you have tuples rising from the dead. I'm too lazy to check the commit log right now, but I think we did implement a fix for that (ie, flush dirty pages even if we anticipate them going away due to truncation). But as Laurenz says, v10 is out of support and possibly didn't get that fix. Even if it did, you'd need to be running one of the last minor releases, because this wasn't very long ago. In the end though, the *real* problem here is running on a platform that randomly disallows writes to disk. There's only so much that Postgres can possibly do about unreliability of the underlying platform. I would never run a production database on Windows, because it's just too prone to that sort of BS. regards, tom lane
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Thomas Munro
Date:
On Thu, Oct 5, 2023 at 3:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Paquier <michael@paquier.xyz> writes: > > On Wed, Oct 04, 2023 at 09:17:11AM +0200, Laurenz Albe wrote: > >> Data corruption like this is not necessarily caused by a PostgreSQL bug. > > > Err, well... A failure on the end-of-vacuum truncation should not > > lead to corruption afterwards as well, and this ought to be safe even > > if this step failed. This is a very tricky problem that nobody has > > really looked into yet. > > ISTM we did identify the problem: while all the tuples in the > pages-to-be-truncated should be dead and thus invisible, it may > be that some of those pages are dirty and haven't been written > out of shared buffers yet, and the page versions on disk contain > tuples that look live. If VACUUM discards those dirty buffers > and then fails to truncate, voila you have tuples rising from > the dead. > > I'm too lazy to check the commit log right now, but I think > we did implement a fix for that (ie, flush dirty pages even > if we anticipate them going away due to truncation). But as > Laurenz says, v10 is out of support and possibly didn't get > that fix. Even if it did, you'd need to be running one of > the last minor releases, because this wasn't very long ago. This thread seems to be saying otherwise: https://www.postgresql.org/message-id/flat/2348.1544474335%40sss.pgh.pa.us > In the end though, the *real* problem here is running on a > platform that randomly disallows writes to disk. There's only > so much that Postgres can possibly do about unreliability of the > underlying platform. I would never run a production database on > Windows, because it's just too prone to that sort of BS. It's surprising that ftruncate() AKA chsize() is able to fail like this (I am not a Windows user but AFAIR that sharing stuff obstructs stuff like open, unlink, rename, so it surprises me to see it come up with ftruncate, since we must already have made it past the open stage). Hmm, the documentation is scant, but I know from my attempts to use large files that chsize() is probably some kind of wrapper around SetEndOfFile() or similar, and that is documented as failing if someone has the file mapped. I don't know why someone would have the file mapped, though. But as for what we should do about it, PANIC (as suggested by several people) seems better than corruption, if we're not going to write some kind of resilience? How else are we supposed to deal with "this shouldn't happen, and if it does we're hosed?"
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Thomas Munro
Date:
On Thu, Oct 5, 2023 at 10:12 AM Thomas Munro <thomas.munro@gmail.com> wrote: > It's surprising that ftruncate() AKA chsize() is able to fail like > this (I am not a Windows user but AFAIR that sharing stuff obstructs > stuff like open, unlink, rename, so it surprises me to see it come up > with ftruncate, since we must already have made it past the open > stage). Hmm, the documentation is scant, but I know from my attempts > to use large files that chsize() is probably some kind of wrapper > around SetEndOfFile() or similar, and that is documented as failing if > someone has the file mapped. I don't know why someone would have the > file mapped, though. Some more thoughts: I guess it would probably also fail like that if someone explicitly locked a range with LockFile(), but I think we can rule that out as read and/or write calls would also fail. As for the mapping theory, apparently the underlying NT error for that is ERROR_USER_MAPPED_FILE, and searching for that brings up various unexplained errors vaguely blamed on anti-virus tools etc. But if all of that sort of thing really is turned off for the data directory, I wonder if there could be a backup tool in use that thinks it can go faster by mapping files while copying?
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Michael Paquier
Date:
On Thu, Oct 05, 2023 at 10:12:27AM +1300, Thomas Munro wrote: > On Thu, Oct 5, 2023 at 3:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'm too lazy to check the commit log right now, but I think >> we did implement a fix for that (ie, flush dirty pages even >> if we anticipate them going away due to truncation). But as >> Laurenz says, v10 is out of support and possibly didn't get >> that fix. Even if it did, you'd need to be running one of >> the last minor releases, because this wasn't very long ago. > > This thread seems to be saying otherwise: > > https://www.postgresql.org/message-id/flat/2348.1544474335%40sss.pgh.pa.us Yeah, that's not been solved. I've put my mind on this problem for a few hours last May, just before PGCon, and there was an argument that marking all the pages as dirty is kind of a waste of resources as it would create WAL for data that's going to be gone a little bit later as an effect of the truncate, leading to an extra burst of I/O especially for large truncatoins. FWIW, I think that I'd be personally OK with using this method. At least that's safe, simple, backpatchable and it does not require any new magic. I know that there are voices that argued against this method, but here we are N years later, so perhaps we should just do that on HEAD at least. >> In the end though, the *real* problem here is running on a >> platform that randomly disallows writes to disk. There's only >> so much that Postgres can possibly do about unreliability of the >> underlying platform. I would never run a production database on >> Windows, because it's just too prone to that sort of BS. > > It's surprising that ftruncate() AKA chsize() is able to fail like > this (I am not a Windows user but AFAIR that sharing stuff obstructs > stuff like open, unlink, rename, so it surprises me to see it come up > with ftruncate, since we must already have made it past the open > stage). Hmm, the documentation is scant, but I know from my attempts > to use large files that chsize() is probably some kind of wrapper > around SetEndOfFile() or similar, and that is documented as failing if > someone has the file mapped. I don't know why someone would have the > file mapped, though. (shrug) > But as for what we should do about it, PANIC (as suggested by several > people) seems better than corruption, if we're not going to write some > kind of resilience? How else are we supposed to deal with "this > shouldn't happen, and if it does we're hosed?" A PANIC may be OK for this specific syscall and would be better, but the problematic area is larger than that as we'd still finish with a corruption as long as there's an ERROR or a FATAL between the moment the buffers (potentially dirty, with live-still-dead-in-memory tuples on disk) are discarded and the moment the truncation fails. Another method discussed is the use of a critical section (I recall that there were some pallocs in this area, actually, but got nothing on my notes about that...). -- Michael
Attachment
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes: > On Thu, Oct 5, 2023 at 3:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'm too lazy to check the commit log right now, but I think >> we did implement a fix for that (ie, flush dirty pages even >> if we anticipate them going away due to truncation). > This thread seems to be saying otherwise: > https://www.postgresql.org/message-id/flat/2348.1544474335%40sss.pgh.pa.us Hmph. OK, I was remembering the discussion not the (lack of) end result. > But as for what we should do about it, PANIC (as suggested by several > people) seems better than corruption, if we're not going to write some > kind of resilience? Maybe that's an acceptable answer now ... it's not great, but nobody is in love with any of the other options either. And it would definitely get DBAs' attention about this misbehavior of their file systems. regards, tom lane
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Thomas Munro
Date:
On Thu, Oct 5, 2023 at 11:44 AM Michael Paquier <michael@paquier.xyz> wrote: > On Thu, Oct 05, 2023 at 10:12:27AM +1300, Thomas Munro wrote: > > But as for what we should do about it, PANIC (as suggested by several > > people) seems better than corruption, if we're not going to write some > > kind of resilience? How else are we supposed to deal with "this > > shouldn't happen, and if it does we're hosed?" > > A PANIC may be OK for this specific syscall and would be better, but > the problematic area is larger than that as we'd still finish with a > corruption as long as there's an ERROR or a FATAL between the moment > the buffers (potentially dirty, with live-still-dead-in-memory tuples > on disk) are discarded and the moment the truncation fails. Another > method discussed is the use of a critical section (I recall that there > were some pallocs in this area, actually, but got nothing on my notes > about that...). Yeah. I guess the obvious place for a critical section to start would be in RelationTruncate() near DELAY_CHKPT_COMPLETE where a similar concern about recovery is discussed. There is a comment explaining that it's a bad idea to use a critical section, but evidently it overestimated the harmlessness of that choice. It has a point that if DO can't truncate, maybe REDO will fail too and you might be stuck in an eternal samsara of failed recovery, but... that's because your system isn't doing things we fundamentally need it to do to make progress, and an administrator needs to find out why. And although eternity sounds bad, as far as I can tell from the rare reports we've had of this failure, it seems to be transient, right? Perhaps we could consider adding ftruncate() to the set of horrible Windows wrapper functions that hide a few sleep-retry loops before they give up so there's a good chance of avoiding a crash, but it'd be better if someone more knowledgeable/hands-on with Windows could get to the bottom of what is causing this and document how to avoid it... I think we understand why we need that in the other cases (this OS's peculiar dirent management), and this case seems a bit different. As for Unix, if I guessed right about mmap being involved, the kernel always lets the truncation proceed but kills the mapping process on access to non-backed region of memory. And for EINTR, handling was recently added for this (0d369ac6500), so I guess there may be ways in older branches on some systems (eg the old 'interruptible' NFS, but probably not on normal file systems...). The only other things I can think of are EIO, and then various left field errors caused by the file system going read only or the file being 'sealed', etc, and promotion to PANIC seems appropriate for those cases.
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Thomas Munro
Date:
On Fri, Oct 6, 2023 at 8:55 AM Thomas Munro <thomas.munro@gmail.com> wrote: > ... DELAY_CHKPT_COMPLETE ... About that... If the lights go out after the truncation and the delayed logging of the checkpoint, how do we know the truncation has actually reached the disk? mdtruncate() enqueues fsync() calls, but if we were already in phase 2 (see proc.h) of a checkpoint at that moment, they might be processed by the *next* checkpoint, not the one whose phase 3 we've carefully delayed there, no?
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Thomas Munro
Date:
On Fri, Oct 6, 2023 at 9:18 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Oct 6, 2023 at 8:55 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > ... DELAY_CHKPT_COMPLETE ... > > About that... If the lights go out after the truncation and the > delayed logging of the checkpoint, how do we know the truncation has > actually reached the disk? mdtruncate() enqueues fsync() calls, but > if we were already in phase 2 (see proc.h) of a checkpoint at that > moment, they might be processed by the *next* checkpoint, not the one > whose phase 3 we've carefully delayed there, no? I didn't look into this for very long so I might be missing something here, but I think there could be at least one bad sequence. If you insert a couple of sleeps to jinx the scheduling, and hack RelationTruncate() to request a checkpoint at a carefully chosen wrong moment (see attached), then: postgres=# create table t (i int); CREATE TABLE postgres=# insert into t select 1 from generate_series(1, 100); INSERT 0 100 postgres=# checkpoint; -- checkpoint #1 just puts some data on disk CHECKPOINT postgres=# delete from t; DELETE 100 postgres=# vacuum freeze t; -- truncates, starts unlucky checkpoint #2 VACUUM If you trace the checkpointer's system calls you will see that base/5/16384 (or whatever t's relfilenode is for you) is *not* fsync'd by checkpoint #2. The following checkpoint #3 might eventually do it, but if the kernel loses power after checkpoint #2 completes and there is no checkpoint #3, the kernel might forget the truncation, and yet replay starts too late to redo it. I think that bad sequence looks like this: P1: log truncate P2: choose redo LSN P1: DropRelationBuffers() P2: CheckPointBuffers() P2: ProcessSyncRequests() P1: ftruncate() P1: RegisterSyncRequest() P2: log checkpoint *** system loses power *** I realise it is a different problem than the one reported, but it's close. My initial thought is that perhaps we shouldn't allow a redo LSN to be chosen until the sync request is registered, which is also fairly close to the critical section boundaries being discussed for ftruncate() error case. But that's not a phase the checkpoint delay machinery currently knows how to delay. And there may well be better ways...
Attachment
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Robert Haas
Date:
On Wed, Oct 4, 2023 at 7:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > But as for what we should do about it, PANIC (as suggested by several > > people) seems better than corruption, if we're not going to write some > > kind of resilience? > > Maybe that's an acceptable answer now ... it's not great, but nobody > is in love with any of the other options either. And it would definitely > get DBAs' attention about this misbehavior of their file systems. I and others, including Andres, have been thinking that a PANIC is the right option for some time. Quoth I in https://www.postgresql.org/message-id/CA%2BTgmobwc_Rdaw%2B6TupT4_g9z55JjL%3DvhwpphsQe%3DYmBN0OPDg%40mail.gmail.com some 2 years ago... > As you say, this doesn't fix the problem that truncation might fail. > But as Andres and Sawada-san said, the solution to that is to get rid > of the comments saying that it's OK for truncation to fail and make it > a PANIC. However, I don't think that change needs to be part of this > patch. Even if we do that, we still need to do this. And even if we do > this, we still need to do that. I think the only reasons that I didn't do it at the time where (a) shortage of round tuits and (b) fear of being yelled at. But the comment is wrong, and a critical section is right. I do think that it's nice to be tolerant of bad filesystem behavior when we can. For instance if we try to write() some data to the OS and it fails for some transient reason, it's nice if we can try to write() it again. But there are always going to be cases where that sort of tolerance is not practical. Having PostgreSQL continue to operate when the filesystem isn't operating is a luxury, and we can't afford it in every situation. shared_buffers provides a layer of insulation between the logical act of modifying a buffer and the need to have a system call succeed -- dirtying the buffer is in effect making a note that the write() needs to be done later, instead of actually doing it in the moment. And since the code that actually writes it is checkpoint-aware and write-outs can be retried, we can avoid panicking. But for operations such as creating, removing, or truncating relations, there is no similar, general layer of insulation -- we have no mechanism that allows us to logically do those things now and have them actually happen at the FS level later. Which, to me, seems to mean that we have little choice but to panic if they fail. Otherwise, the primary diverges from any standbys that it has. I also think that's OK. Unreliable filesystems lead to unreliable databases, and it's better to find that out before something really bad happens. Maybe in the future we'll develop more general mechanisms for some of this stuff and maybe that will allow us to avoid panics in more cases, and then we can debate the merits of such changes. But right now, the cost of avoiding a panic here is a corrupted database, and I have to believe that the overwhelming majority of users would think that a corrupted database is worse. -- Robert Haas EDB: http://www.enterprisedb.com
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Robert Haas
Date:
On Fri, Oct 6, 2023 at 12:50 AM Thomas Munro <thomas.munro@gmail.com> wrote: > If you trace the checkpointer's system calls you will see that > base/5/16384 (or whatever t's relfilenode is for you) is *not* fsync'd > by checkpoint #2. The following checkpoint #3 might eventually do it, > but if the kernel loses power after checkpoint #2 completes and there > is no checkpoint #3, the kernel might forget the truncation, and yet > replay starts too late to redo it. I think that bad sequence looks > like this: > > P1: log truncate > P2: choose redo LSN > P1: DropRelationBuffers() > P2: CheckPointBuffers() > P2: ProcessSyncRequests() > P1: ftruncate() > P1: RegisterSyncRequest() > P2: log checkpoint > *** system loses power *** > > I realise it is a different problem than the one reported, but it's > close. My initial thought is that perhaps we shouldn't allow a redo > LSN to be chosen until the sync request is registered, which is also > fairly close to the critical section boundaries being discussed for > ftruncate() error case. But that's not a phase the checkpoint delay > machinery currently knows how to delay. And there may well be better > ways... Hmm. So the problem in this case is that we not only need the WAL to be inserted and the filesystem operation to be performed in the same checkpoint cycle, but we also need the sync request to be processed in that checkpoint cycle. In this example, the WAL insertion happens before the redo LSN, so the truncate on disk must happen before the checkpoint completes, which is guaranteed, and the sync request must also be processed before the checkpoint completes, which is not guaranteed. We only guarantee that the checkpoint goes into the checkpointer's queue before the checkpoint completes, not that it gets processed before the checkpoint completes. Suppose that RelationTruncate set both DELAY_CHKPT_START and DELAY_CHKPT_COMPLETE. I think that would prevent this problem. P2 could still choose the redo LSN after P1 logged the truncate, but it wouldn't then be able to reach CheckPointBuffers() until after P1 had reached RegisterSyncRequest. Note that setting *only* DELAY_CHKPT_START isn't good enough, because then we can get this history: P1: log truncate P2: choose redo LSN P2: CheckPointBuffers() P1: DropRelationBuffers() P2: ProcessSyncRequests() P2: log checkpoint *** system loses power *** I think, in general, that DELAY_CHKPT_START is useful for cases where CheckPointGuts() is going to flush something that might not be locked at the time we WAL-log a modification to it. For instance, MarkBufferDirtyHint() wants to log that the buffer will be dirty while holding only a shared lock on the buffer. RecordTransactonCommit() and RecordTransactionCommitPrepared() write WAL that necesitate CLOG updates, but they haven't yet locked the relevant CLOG buffers. DELAY_CHKPT_END, on the other hand, is useful for cases where some operation needs to get completed before the redo pointer gets moved. What I think we missed in 412ad7a55639516f284cd0ef9757d6ae5c7abd43 is that RelationTruncate actually falls into both categories -- the ftruncate() needs to be protected by DELAY_CHKPT_END so that it happens before the redo pointer moves, but the RegisterSyncRequest() needs to be protected by DELAY_CHKPT_START so that it gets "flushed" by ProcessSyncRequests. I'm not quite sure that this idea closes all the holes, though. The delay-checkpoint mechanism is quite hard to reason about. I spent a while this morning trying to think up something better but realized at the end I'd just reinvented the wheel. -- Robert Haas EDB: http://www.enterprisedb.com
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Thomas Munro
Date:
On Thu, Oct 12, 2023 at 5:59 AM Robert Haas <robertmhaas@gmail.com> wrote: > Suppose that RelationTruncate set both DELAY_CHKPT_START and > DELAY_CHKPT_COMPLETE. I think that would prevent this problem. P2 > could still choose the redo LSN after P1 logged the truncate, but it > wouldn't then be able to reach CheckPointBuffers() until after P1 had > reached RegisterSyncRequest. [...] Thanks for thinking about this. Yeah, the existing _START flag does indeed seem to be enough. I'd been focusing on trying to control the redo point selection, but we just need to delay ProcessSyncRequests(), and we had a hammer for that already. Who wants to write the patch? It should be trivial, except for the comments. It's interesting that we'd stun the checkpointer just before we try to send it a request in a fixed size queue, but that's OK because we'll perform the fsync ourselves if the queue is full.
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Robert Haas
Date:
On Wed, Oct 11, 2023 at 11:37 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Thanks for thinking about this. Yeah, the existing _START flag does > indeed seem to be enough. I'd been focusing on trying to control the > redo point selection, but we just need to delay ProcessSyncRequests(), > and we had a hammer for that already. Who wants to write the patch? > It should be trivial, except for the comments. > > It's interesting that we'd stun the checkpointer just before we try to > send it a request in a fixed size queue, but that's OK because we'll > perform the fsync ourselves if the queue is full. I don't understand what you mean about stunning the checkpointer, but here are some patches. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Thomas Munro
Date:
On Fri, Oct 20, 2023 at 8:50 AM Robert Haas <robertmhaas@gmail.com> wrote: > I don't understand what you mean about stunning the checkpointer, but I just meant that, with your change, we ask the checkpointer to wait for our signal if it reaches that code path and then while it's snoozing we call register_dirty_segment(). I wanted to report that I'd checked that it copes with the request queue being full, by falling back to calling fsync() itself, which is good news because if it instead waited for the checkpointer to drain the request queue instead (an available option), we'd have a potential deadlock. > here are some patches. 0001: LGTM, except in the commit message "By setting XLOG_CHKPT_START," s/XLOG/DELAY/. 0002: LGTM Now I wonder if we will get occasional PANIC reports from Windows users. I might prepare one of those retry wrappers...
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Thomas Munro
Date:
On Fri, Oct 20, 2023 at 4:16 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Oct 20, 2023 at 8:50 AM Robert Haas <robertmhaas@gmail.com> wrote: > > here are some patches. > > 0001: LGTM, except in the commit message "By setting I take that back: there is a palloc() under RelationTruncate(). DNLGTM.
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Thomas Munro
Date:
On Fri, Oct 20, 2023 at 4:38 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Oct 20, 2023 at 4:16 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Fri, Oct 20, 2023 at 8:50 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > here are some patches. > > > > 0001: LGTM, except in the commit message "By setting > > I take that back: there is a palloc() under RelationTruncate(). DNLGTM. Ooops, I should have quoted to the 0002 patch there. Hmm. We could teach md.c to do all its path manipulation in MAXPGPATH-sized output buffers but that still wouldn't be enough because it might decide to allocate more space for the array of segments, and I'm not even sure where in PostgreSQL we guarantee that everything that could appear here would fit in that. But that gives me an idea. It feels like a bit of a dirty hack, which could perhaps be made less dirty with some kind of function that includes the word 'ensure' in its name, but I think we can make md.c promise not to allocate anything before the next CFI with something like this: * truncate files on disk before they can get the database up and running * again. But forcing them to fix permissions (or whatever the problem is) * beats corrupting the database. + * + * First, make sure that smgr doesn't need to allocate any memory to + * truncate, since that wouldn't be allowed in a critical section. We don't + * need to call XLogEnsureRecord() for such a small insertion. */ + + smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM); + if (fsm) + smgrnblocks(RelationGetSmgr(rel), FSM_FORKNUM); + START_CRIT_SECTION();
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Robert Haas
Date:
On Fri, Oct 20, 2023 at 1:17 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > I take that back: there is a palloc() under RelationTruncate(). DNLGTM. > > Ooops, I should have quoted to the 0002 patch there. > > Hmm. We could teach md.c to do all its path manipulation in > MAXPGPATH-sized output buffers but that still wouldn't be enough > because it might decide to allocate more space for the array of > segments, and I'm not even sure where in PostgreSQL we guarantee that > everything that could appear here would fit in that. But that gives > me an idea. It feels like a bit of a dirty hack, which could perhaps > be made less dirty with some kind of function that includes the word > 'ensure' in its name, but I think we can make md.c promise not to > allocate anything before the next CFI with something like this: Ugh. AFAICS the problems are confined to smgrtruncate() and within that to the callback to smgr_truncate. As far as the rest of smgrtruncate() is concerned, it seems like DropRelationBuffers() and CacheInvalidateSmgr() are safe enough in a critical section, and the other code directly inside RelationTruncate() looks OK, too. But within mdtruncate(), we've got more than one problem, I think. mdnblocks() is a problem because of the reason that you mention, but register_dirty_segment() doesn't look totally safe either, because it can call RegisterSyncRequest() which, in a standalone backend, can call RememberSyncRequest(). In general, it seems like it would be a lot nicer if we were doing a lot less stuff inside the critical section here. So I think you're right that we need some refactoring. Maybe smgr_prepare_truncate() and smgr_execute_truncate() or something like that. I wonder if we could actually register the dirty segment in the "prepare" phase - is it bad if we register a dirty segment before actually dirtying it? And maybe even CacheInvalidateSmgr() could be done at that stage? It seems pretty clear that dropping the dirty buffers and actually truncating the relation on disk need to happen after we've entered the critical section, because if we fail after doing the former, we've thrown away dirty data in anticipation of performing an operation that didn't happen, and if we fail when attempting the latter, primaries and standbys diverge and the originally-reported bug on this thread happens. But we'd like to move as much other stuff as we can out of that critical section. -- Robert Haas EDB: http://www.enterprisedb.com
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Thomas Munro
Date:
On Tue, Oct 24, 2023 at 4:06 AM Robert Haas <robertmhaas@gmail.com> wrote: > But within mdtruncate(), we've got more than one problem, I think. > mdnblocks() is a problem because of the reason that you mention, but > register_dirty_segment() doesn't look totally safe either, because it > can call RegisterSyncRequest() which, in a standalone backend, can > call RememberSyncRequest(). But that is already enkludgified thusly: /* * XXX: The checkpointer needs to add entries to the pending ops table * when absorbing fsync requests. That is done within a critical * section, which isn't usually allowed, but we make an exception. It * means that there's a theoretical possibility that you run out of * memory while absorbing fsync requests, which leads to a PANIC. * Fortunately the hash table is small so that's unlikely to happen in * practice. */ pendingOpsCxt = AllocSetContextCreate(TopMemoryContext, "Pending ops context", ALLOCSET_DEFAULT_SIZES); MemoryContextAllowInCriticalSection(pendingOpsCxt, true); > In general, it seems like it would be a lot nicer if we were doing a > lot less stuff inside the critical section here. So I think you're > right that we need some refactoring. Maybe smgr_prepare_truncate() and > smgr_execute_truncate() or something like that. I wonder if we could > actually register the dirty segment in the "prepare" phase - is it bad > if we register a dirty segment before actually dirtying it? And maybe > even CacheInvalidateSmgr() could be done at that stage? It seems > pretty clear that dropping the dirty buffers and actually truncating > the relation on disk need to happen after we've entered the critical > section, because if we fail after doing the former, we've thrown away > dirty data in anticipation of performing an operation that didn't > happen, and if we fail when attempting the latter, primaries and > standbys diverge and the originally-reported bug on this thread > happens. But we'd like to move as much other stuff as we can out of > that critical section. Hmm, yeah it seems like that direction would be a nice improvement, as long as we are sure that the fsync request can't be processed too soon.
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Thomas Munro
Date:
Related bug #18426 sent me back here. Here is a new attempt to see what it might take to put RelationTruncate() into a critical section. Problems encountered: even if you've called mdnblocks() beforehand, dressed up as smgrpreparetruncate(), if the highest segment is exactly full then the later mdnblocks() again probes whether the next segment number exists on disk, which involves GetRelationPath(), which allocates. So I finished up having to write a GetRelationPathInPlace() function, and to decide whether it's OK to use a MAXPGPATH-sized array on the stack for this. I also had to teach _fdvec_resize() not to reallocate when downsizing, to avoid the critical section assertion. It seems like quite a lot to back-patch... but also awful to leave this trickle of data corruption reports unaddressed. The big re-engineering ideas[1] would be absolutely unbackpatchable, but I hope we can work on something like that for 18... [1] https://www.postgresql.org/message-id/flat/2348.1544474335%40sss.pgh.pa.us
Attachment
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Michael Paquier
Date:
On Tue, Apr 23, 2024 at 07:48:12PM +1200, Thomas Munro wrote: > Here is a new attempt to see what it might take to put > RelationTruncate() into a critical section. Problems encountered: > even if you've called mdnblocks() beforehand, dressed up as > smgrpreparetruncate(), if the highest segment is exactly full then the > later mdnblocks() again probes whether the next segment number exists > on disk, which involves GetRelationPath(), which allocates. So I > finished up having to write a GetRelationPathInPlace() function, and > to decide whether it's OK to use a MAXPGPATH-sized array on the stack > for this. I also had to teach _fdvec_resize() not to reallocate when > downsizing, to avoid the critical section assertion. It seems like > quite a lot to back-patch... but also awful to leave this trickle of > data corruption reports unaddressed. The big re-engineering ideas[1] > would be absolutely unbackpatchable, but I hope we can work on > something like that for 18... For now I'd agree with something like that, even for a backpatch. No idea what a good "cheap" solution would look like, though. Anything considered these past years for this issue was utterly expensive. > However, it didn't guarantee that the sync request for the > truncation got processed before the XLOG_CHECKPOINT_ONLINE record > was written. By setting XLOG_CHKPT_START, we guarantee that if > an XLOG_SMGR_TRUNCATE record is written to WAL before the redo > pointer of some concurrent checkpoint, the sync request queued by > that operation must be processed by that checkpoint (rather than > being left for the following one). s/XLOG_CHKPT_START/DELAY_CHKPT_START/ in the commit message of 0001. The point you are making about d8725104 seems like a good argument to do a backpatch down to v14, as the possible interrupt make this setof issues much easier to reach compared to the Windows-only-weird-syscall failures happening in the middle of the truncation that people have been complaining about for N years now. I am seeing less reports of these on WIN32 these days with files switched suddenly to read-only, or is that just my imagination? The pieces around GetRelationPathInPlace() and MAX_EXTENSION_SIZE could be a refactoring piece worth their own, split into its own patch before introducing the core part of the fix with the critical section. That would be much cleaner, IMO. It seems to me that an assertion at the end of GetRelationPathInPlace() about the computed size would be adapted based on MAXPGPATH. The interface of _mdfd_segpath() is quite confusing, actually. Why return the same pointer as the input argument rather than void? I was re-reading your comment about last October, and the fact that we may stuck replay, and finishing with semi-corrupted undetectible corruptions is worse than that, still I'm going to agree that failing harder will serve much better at this stage: https://www.postgresql.org/message-id/CA%2BhUKGJy9iCBfkjUyV8ZuRwd5CAGxZV1STywe%2B0S%2B9YKH1zF8w%40mail.gmail.com That may impact availability. Still that's cheaper than any other solution discussed. The one involving the WAL-logging of dirty pages that would get truncated shortly after comes first into mind.. PS: On HEAD, I'd like to see one or more tests for these failure scenarios in the long-term and how replay is able to react after panicking in the critical section. Now, injection points won't work inside a critical section as the initial function loading needs to do a palloc() for the library path. We've discussed about having a PRELOAD macro that could be called outside the critical path, before running a INJECTION_POINT. Just mentioning. -- Michael
Attachment
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Alexander Lakhin
Date:
Hello Thomas, 23.04.2024 10:48, Thomas Munro wrote: > Related bug #18426 sent me back here. > > Here is a new attempt to see what it might take to put > RelationTruncate() into a critical section. When running 027_stream_regress on a slow machine with the aggressive autovacuum settings, having those patches applied, I've stumbled upon: TRAP: failed Assert("CritSectionCount == 0 || (context)->allowInCritSection"), File: "mcxt.c", Line: 1353, PID: 24468 ... 2024-05-14 12:30:03.542 UTC [22964:4] LOG: server process (PID 24468) was terminated by signal 6: Aborted 2024-05-14 12:30:03.542 UTC [22964:5] DETAIL: Failed process was running: autovacuum: VACUUM ANALYZE pg_catalog.pg_class with the following stack trace: Core was generated by `postgres: primary: autovacuum worker regression '. Program terminated with signal SIGABRT, Aborted. #0 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47 (gdb) bt #0 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47 #1 0xb63c90ae in __libc_signal_restore_set (set=0xbe99f34c) at ../sysdeps/unix/sysv/linux/internal-signals.h:84 #2 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:48 #3 0xb63bb1f2 in __GI_abort () at abort.c:79 #4 0xb6d3e2d0 in ExceptionalCondition (conditionName=<optimized out>, fileName=<optimized out>, lineNumber=lineNumber@entry=1353) at assert.c:66 #5 0xb6d61834 in palloc0 (size=3062661664) at mcxt.c:1353 #6 0xb6bcd304 in CompactCheckpointerRequestQueue () at checkpointer.c:1173 #7 ForwardSyncRequest (ftag=ftag@entry=0xbe99f7d0, type=type@entry=SYNC_REQUEST) at checkpointer.c:1113 #8 0xb6c4b3c4 in RegisterSyncRequest (ftag=ftag@entry=0xbe99f7d0, type=type@entry=SYNC_REQUEST, retryOnError=retryOnError@entry=false) at sync.c:605 #9 0xb6c48e62 in register_dirty_segment (reln=reln@entry=0xb8ec75a8, forknum=forknum@entry=MAIN_FORKNUM, seg=<optimized out>, seg=<optimized out>) at md.c:1369 #10 0xb6c4a0c6 in mdtruncate (reln=0xb8ec75a8, forknum=MAIN_FORKNUM, nblocks=45) at md.c:1229 #11 0xb6c4abe8 in smgrtruncate (reln=0xb8ec75a8, forknum=forknum@entry=0xbe99f86c, nforks=nforks@entry=2, nblocks=nblocks@entry=0xbe99f878) at smgr.c:743 #12 0xb6a3e8c6 in RelationTruncate (rel=0xb2e969a0, nblocks=nblocks@entry=45) at ../../../src/include/utils/rel.h:574 #13 0xb69b8042 in lazy_truncate_heap (vacrel=0xb8ee8f38) at vacuumlazy.c:2642 ... Best regards, Alexander
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Heikki Linnakangas
Date:
On 14/05/2024 16:00, Alexander Lakhin wrote: > 23.04.2024 10:48, Thomas Munro wrote: >> Here is a new attempt to see what it might take to put >> RelationTruncate() into a critical section. > > When running 027_stream_regress on a slow machine with the aggressive > autovacuum settings, having those patches applied, I've stumbled upon: > TRAP: failed Assert("CritSectionCount == 0 || (context)->allowInCritSection"), File: "mcxt.c", Line: 1353, PID: 24468 > ... > 2024-05-14 12:30:03.542 UTC [22964:4] LOG: server process (PID 24468) was terminated by signal 6: Aborted > 2024-05-14 12:30:03.542 UTC [22964:5] DETAIL: Failed process was running: autovacuum: VACUUM ANALYZE pg_catalog.pg_class > > with the following stack trace: > Core was generated by `postgres: primary: autovacuum worker regression '. > Program terminated with signal SIGABRT, Aborted. > #0 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47 > (gdb) bt > #0 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47 > #1 0xb63c90ae in __libc_signal_restore_set (set=0xbe99f34c) at ../sysdeps/unix/sysv/linux/internal-signals.h:84 > #2 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:48 > #3 0xb63bb1f2 in __GI_abort () at abort.c:79 > #4 0xb6d3e2d0 in ExceptionalCondition (conditionName=<optimized out>, fileName=<optimized out>, > lineNumber=lineNumber@entry=1353) at assert.c:66 > #5 0xb6d61834 in palloc0 (size=3062661664) at mcxt.c:1353 > #6 0xb6bcd304 in CompactCheckpointerRequestQueue () at checkpointer.c:1173 > #7 ForwardSyncRequest (ftag=ftag@entry=0xbe99f7d0, type=type@entry=SYNC_REQUEST) at checkpointer.c:1113 > #8 0xb6c4b3c4 in RegisterSyncRequest (ftag=ftag@entry=0xbe99f7d0, type=type@entry=SYNC_REQUEST, > retryOnError=retryOnError@entry=false) at sync.c:605 > #9 0xb6c48e62 in register_dirty_segment (reln=reln@entry=0xb8ec75a8, forknum=forknum@entry=MAIN_FORKNUM, seg=<optimized > out>, seg=<optimized out>) at md.c:1369 > #10 0xb6c4a0c6 in mdtruncate (reln=0xb8ec75a8, forknum=MAIN_FORKNUM, nblocks=45) at md.c:1229 > #11 0xb6c4abe8 in smgrtruncate (reln=0xb8ec75a8, forknum=forknum@entry=0xbe99f86c, nforks=nforks@entry=2, > nblocks=nblocks@entry=0xbe99f878) at smgr.c:743 > #12 0xb6a3e8c6 in RelationTruncate (rel=0xb2e969a0, nblocks=nblocks@entry=45) at ../../../src/include/utils/rel.h:574 > #13 0xb69b8042 in lazy_truncate_heap (vacrel=0xb8ee8f38) at vacuumlazy.c:2642 > ... This should've also been fixed by commit b1ffe3ff0b. Thanks Alexander for pointing me to this thread! -- Heikki Linnakangas Neon (https://neon.tech)
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Heikki Linnakangas
Date:
On 26/06/2024 23:58, Heikki Linnakangas wrote: > On 14/05/2024 16:00, Alexander Lakhin wrote: >> 23.04.2024 10:48, Thomas Munro wrote: >>> Here is a new attempt to see what it might take to put >>> RelationTruncate() into a critical section. >> >> When running 027_stream_regress on a slow machine with the aggressive >> autovacuum settings, having those patches applied, I've stumbled upon: >> TRAP: failed Assert("CritSectionCount == 0 || (context)->allowInCritSection"), File: "mcxt.c", Line: 1353, PID: 24468 >> ... >> 2024-05-14 12:30:03.542 UTC [22964:4] LOG: server process (PID 24468) was terminated by signal 6: Aborted >> 2024-05-14 12:30:03.542 UTC [22964:5] DETAIL: Failed process was running: autovacuum: VACUUM ANALYZE pg_catalog.pg_class >> >> with the following stack trace: >> Core was generated by `postgres: primary: autovacuum worker regression '. >> Program terminated with signal SIGABRT, Aborted. >> #0 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47 >> (gdb) bt >> #0 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47 >> #1 0xb63c90ae in __libc_signal_restore_set (set=0xbe99f34c) at ../sysdeps/unix/sysv/linux/internal-signals.h:84 >> #2 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:48 >> #3 0xb63bb1f2 in __GI_abort () at abort.c:79 >> #4 0xb6d3e2d0 in ExceptionalCondition (conditionName=<optimized out>, fileName=<optimized out>, >> lineNumber=lineNumber@entry=1353) at assert.c:66 >> #5 0xb6d61834 in palloc0 (size=3062661664) at mcxt.c:1353 >> #6 0xb6bcd304 in CompactCheckpointerRequestQueue () at checkpointer.c:1173 >> #7 ForwardSyncRequest (ftag=ftag@entry=0xbe99f7d0, type=type@entry=SYNC_REQUEST) at checkpointer.c:1113 >> #8 0xb6c4b3c4 in RegisterSyncRequest (ftag=ftag@entry=0xbe99f7d0, type=type@entry=SYNC_REQUEST, >> retryOnError=retryOnError@entry=false) at sync.c:605 >> #9 0xb6c48e62 in register_dirty_segment (reln=reln@entry=0xb8ec75a8, forknum=forknum@entry=MAIN_FORKNUM, seg=<optimized >> out>, seg=<optimized out>) at md.c:1369 >> #10 0xb6c4a0c6 in mdtruncate (reln=0xb8ec75a8, forknum=MAIN_FORKNUM, nblocks=45) at md.c:1229 >> #11 0xb6c4abe8 in smgrtruncate (reln=0xb8ec75a8, forknum=forknum@entry=0xbe99f86c, nforks=nforks@entry=2, >> nblocks=nblocks@entry=0xbe99f878) at smgr.c:743 >> #12 0xb6a3e8c6 in RelationTruncate (rel=0xb2e969a0, nblocks=nblocks@entry=45) at ../../../src/include/utils/rel.h:574 >> #13 0xb69b8042 in lazy_truncate_heap (vacrel=0xb8ee8f38) at vacuumlazy.c:2642 >> ... > > This should've also been fixed by commit b1ffe3ff0b. To clarify commit b1ffe3ff0b only fixed this assertion failure, if you call RelationTruncate in a critical section, like in with this patch. Not the original issue. -- Heikki Linnakangas Neon (https://neon.tech)
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Thomas Munro
Date:
On Mon, Sep 9, 2024 at 7:21 PM Michael Paquier <michael@paquier.xyz> wrote: > The trick to use _mdfd_segpath() through smgrpreparetruncate() smells > too much of magic to me. I don't have an idea that does not involve > tweaking the interface of smgr.c or some of its structures on top of > my mind, though, which is to save the path before the critical section > and pass it through (like a memory area that can be used by > md.c).. Or do a MemoryContextAllowInCriticalSection, even if it is > not its original purpose. It's funny that it's so hard to get or store the pathnames when every single File knows its pathname, and segment 0 could give it to you: FilePathName(). Thinking about that made me notice that the vfd layer does raw strdup() and ERRORs on failure, so the thing I proposed might be adhering to the letter of the law ("no palloc in critical sections"), but not the spirit (the fd.c file opening stuff will throw ERRORs that are promoted to PANICs, exactly what that law intended to prevent). Mleugh. Another option would be to introduce mdtruncate2(old_nblocks, new_nblocks), so that it doesn't have to call smgrnblocks() itself, and you can obtain old_nblocks outside the CS...
Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
From
Thomas Munro
Date:
On Tue, Sep 10, 2024 at 12:43 PM Thomas Munro <thomas.munro@gmail.com> wrote: > It is possible that an extension that messes with smgrsw[] would not > like this in a minor release: > > - smgrsw[reln->smgr_which].smgr_truncate(reln, > forknum[i], nblocks[i]); > + smgrsw[reln->smgr_which].smgr_truncate(reln, forknum[i], > + > old_nblocks[i], nblocks[i]); Scratch that concern, smgrsw is static in smgr.c for now (h/t to Andres for pointing that out in an off-list chat when I described this...). I must have confused myself with proposals I've read to make it so an extension *could* mess with this stuff, but it's not done yet. Also if this goes somewhere I should note in the commit message that the analysis of the WaitIO issue was from Alexander L in bug #18426.