Thread: Fix DROP TABLESPACE on Windows with ProcSignalBarrier?
Hello, In tablespace.c, a comment explains that DROP TABLESPACE can fail bogusly because of Windows file semantics: * XXX On Windows, an unlinked file persists in the directory listing * until no process retains an open handle for the file. The DDL * commands that schedule files for unlink send invalidation messages * directing other PostgreSQL processes to close the files. DROP * TABLESPACE should not give up on the tablespace becoming empty * until all relevant invalidation processing is complete. While trying to get the AIO patchset working on more operating systems, this turned out to be a problem. Andres mentioned the new ProcSignalBarrier stuff as a good way to tackle this, so I tried it and it seems to work well so far. The idea in this initial version is to tell every process in the cluster to close all fds, and then try again. That's a pretty large hammer, but it isn't reached on Unix, and with slightly more work it could be made to happen only after 2 failures on Windows. It was tempting to try to figure out how to use the sinval mechanism to close precisely the right files instead, but it doesn't look safe to run sinval at arbitrary CFI() points. It's easier to see that the pre-existing closeAllVfds() function has an effect that is local to fd.c and doesn't affect the VFDs or SMgrRelations, so any CFI() should be an OK time to run that. While reading the ProcSignalBarrier code, I couldn't resist replacing its poll/sleep loop with condition variables. Thoughts?
Attachment
> While reading the ProcSignalBarrier code, I couldn't resist replacing > its poll/sleep loop with condition variables. Oops, that version accidentally added and then removed an unnecessary change due to incorrect commit squashing. Here's a better pair of patches.
Attachment
Hi, Thanks for developing this. On 2021-01-31 02:11:06 +1300, Thomas Munro wrote: > --- a/src/backend/commands/tablespace.c > +++ b/src/backend/commands/tablespace.c > @@ -520,15 +520,23 @@ DropTableSpace(DropTableSpaceStmt *stmt) > * but we can't tell them apart from important data files that we > * mustn't delete. So instead, we force a checkpoint which will clean > * out any lingering files, and try again. > - * > - * XXX On Windows, an unlinked file persists in the directory listing > - * until no process retains an open handle for the file. The DDL > - * commands that schedule files for unlink send invalidation messages > - * directing other PostgreSQL processes to close the files. DROP > - * TABLESPACE should not give up on the tablespace becoming empty > - * until all relevant invalidation processing is complete. > */ > RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT); > + /* > + * On Windows, an unlinked file persists in the directory listing until > + * no process retains an open handle for the file. The DDL > + * commands that schedule files for unlink send invalidation messages > + * directing other PostgreSQL processes to close the files, but nothing > + * guarantees they'll be processed in time. So, we'll also use a > + * global barrier to ask all backends to close all files, and wait > + * until they're finished. > + */ > +#if defined(WIN32) || defined(USE_ASSERT_CHECKING) > + LWLockRelease(TablespaceCreateLock); > + WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE)); > + LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE); > +#endif > + /* And now try again. */ > if (!destroy_tablespace_directories(tablespaceoid, false)) > { > /* Still not empty, the files must be important then */ It's probably rare enough to care, but this still made me thing whether we could avoid the checkpoint at all somehow. Requiring an immediate checkpoint for dropping relations is quite a heavy hammer that practically cannot be used in production without causing performance problems. But it seems hard to process the fsync deletion queue in another way. > diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c > index 4dc24649df..0f8548747c 100644 > --- a/src/backend/storage/smgr/smgr.c > +++ b/src/backend/storage/smgr/smgr.c > @@ -298,6 +298,12 @@ smgrcloseall(void) > smgrclose(reln); > } > > +void > +smgrrelease(void) > +{ > + mdrelease(); > +} Probably should be something like for (i = 0; i < NSmgr; i++) { if (smgrsw[i].smgr_release) smgrsw[i].smgr_release(); } Greetings, Andres Freund
On Tue, Feb 2, 2021 at 8:02 AM Andres Freund <andres@anarazel.de> wrote: > It's probably rare enough to care, but this still made me thing whether > we could avoid the checkpoint at all somehow. Requiring an immediate > checkpoint for dropping relations is quite a heavy hammer that > practically cannot be used in production without causing performance > problems. But it seems hard to process the fsync deletion queue in > another way. Right, the checkpoint itself is probably worse than this "close-all-your-files!" thing in some cases (though it seems likely that once we start using ProcSignalBarrier we're going to find out about places that take a long time to get around to processing them and that's going to be a thing to work on). As a separate project, perhaps we should find some other way to stop GetNewRelFileNode() from recycling the relfilenode until the next checkpoint, so that we can unlink the file eagerly at commit time, while still avoiding the hazard described in the comment for mdunlink(). A straw-man idea would be to touch a file under PGDATA/pg_dropped and fsync it so it survives a power outage, have checkpoints clean that out, and have GetNewRelFileNode() to try access() it. Then we wouldn't need the checkpoint here, I think; we'd just need this ProcSignalBarrier for Windows. > > +void > > +smgrrelease(void) > > +{ > > + mdrelease(); > > +} > > Probably should be something like > for (i = 0; i < NSmgr; i++) > { > if (smgrsw[i].smgr_release) > smgrsw[i].smgr_release(); > } Fixed. Thanks!
Attachment
On Tue, Feb 2, 2021 at 11:16 AM Thomas Munro <thomas.munro@gmail.com> wrote: > ... A straw-man idea > would be to touch a file under PGDATA/pg_dropped and fsync it so it > survives a power outage, have checkpoints clean that out, and have > GetNewRelFileNode() to try access() it. ... I should add, the reason I mentioned fsyncing it is that in another thread we've also discussed making the end-of-crash-recovery checkpoint optional, and then I think you'd need to be sure you can avoid reusing the relfilenode even after crash recovery, because if you recycle the relfilenode and then crash again you'd be exposed to that hazard during the 2nd run thought recovery. But perhaps it's enough to recreate the hypothetical pg_dropped file while replaying the drop-relation record. Not sure, would need more thought.
On Sat, Mar 6, 2021 at 12:10 PM Daniel Gustafsson <daniel@yesql.se> wrote: > > On 3 Mar 2021, at 23:19, Thomas Munro <thomas.munro@gmail.com> wrote: > > That's why I release and reacquire that LWLock. But does that break some > > logic? > > One clear change to current behavior is naturally that a concurrent > TablespaceCreateDbspace can happen while barrier absorption is performed. > Given where we are that might not be a problem, but I don't have enough > caffeine at the moment to conclude anything there. Testing nu inducing > concurent calls while absorption was stalled didn't trigger anything, but I'm > sure I didn't test every scenario. Do you see anything off the cuff? Now I may have the opposite problem (too much coffee) but it looks like it should work about as well as it does today. At this new point where we released the LWLock, all we've really done is possibly unlink some empty database directories in destroy_tablespace_directories(), and that's harmless, they'll be recreated on demand if we abandon ship. If TablespaceCreateDbspace() happened while we were absorbing the barrier and not holding the lock in this new code, then a concurrent mdcreate() is running and so we have a race where we'll again try to drop all empty directories, and it'll try to create its relfile in the new empty directory, and one of us will fail (possibly with an ugly ENOENT error message). But that's already the case in the master branch: mdcreate() could have run TablespaceCreateDbspace() before we acquire the lock in the master branch, and (with pathological enough scheduling) it could reach its attempt to create its relfile after DropTableSpace() has unlinked the empty directory. The interlocking here is hard to follow. I wonder why we don't use heavyweight locks to do per-tablespace interlocking between DefineRelation() and DropTableSpace(). I'm sure this question is hopelessly naive and I should probably go and read some history.
Just as an FYI: this entry currently fails with "Timed out!" on cfbot because of an oversight in the master branch[1], AFAICS. It should pass again once that's fixed. [1] https://www.postgresql.org/message-id/CA%2BhUKGLah2w1pWKHonZP_%2BEQw69%3Dq56AHYwCgEN8GDzsRG_Hgw%40mail.gmail.com
On Sun, Jun 13, 2021 at 8:25 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Just as an FYI: this entry currently fails with "Timed out!" on cfbot > because of an oversight in the master branch[1], AFAICS. It should > pass again once that's fixed. > > [1] https://www.postgresql.org/message-id/CA%2BhUKGLah2w1pWKHonZP_%2BEQw69%3Dq56AHYwCgEN8GDzsRG_Hgw%40mail.gmail.com That's fixed now. So what should we do about this patch? This is a bug, so it would be nice to do *something*. I don't really like the fact that this makes the behavior contingent on USE_ASSERT_CHECKING, and I suggest that you make a new symbol like USE_BARRIER_SMGR_RELEASE which by default gets defined on WIN32, but can be defined elsewhere if you want (see the treatment of EXEC_BACKEND in pg_config_manual.h). Furthermore, I can't see back-patching this, given that it would be the very first use of the barrier machinery. But I think it would be good to get something into master, because then we'd actually be using this procsignalbarrier stuff for something. On a good day we've fixed a bug. On a bad day we'll learn something new about how procsignalbarrier needs to work. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jan 6, 2022 at 10:23 AM Robert Haas <robertmhaas@gmail.com> wrote: > That's fixed now. So what should we do about this patch? This is a > bug, so it would be nice to do *something*. I don't really like the > fact that this makes the behavior contingent on USE_ASSERT_CHECKING, > and I suggest that you make a new symbol like USE_BARRIER_SMGR_RELEASE > which by default gets defined on WIN32, but can be defined elsewhere > if you want (see the treatment of EXEC_BACKEND in pg_config_manual.h). Ok, done like that. > Furthermore, I can't see back-patching this, given that it would be > the very first use of the barrier machinery. But I think it would be > good to get something into master, because then we'd actually be using > this procsignalbarrier stuff for something. On a good day we've fixed > a bug. On a bad day we'll learn something new about how > procsignalbarrier needs to work. Agreed. Pushed. The basic Windows/tablespace bug seen occasionally in CI[1] should now be fixed. For the sake of the archives, here's a link to the ongoing discussion about further potential uses of this mechanism: https://www.postgresql.org/message-id/flat/20220209220004.kb3dgtn2x2k2gtdm%40alap3.anarazel.de [1] https://www.postgresql.org/message-id/CA%2BhUKGJp-m8uAD_wS7%2BdkTgif013SNBSoJujWxvRUzZ1nkoUyA%40mail.gmail.com