Thread: Fixing order of resowner cleanup in 12, for Windows
Hi all, A while back I posted a patch[1] to change the order of resowner cleanup so that DSM handles are released last. That's useful for the error cleanup path on Windows, when a SharedFileSet is cleaned up (a mechanism that's used by parallel CREATE INDEX and parallel hash join, for spilling files to disk under a temporary directory, with automatic cleanup). Previously we believed that it was OK to unlink things that other processes might have currently open as long as you use the FILE_SHARE_DELETE flag, but that turned out not to be the full story: you can unlink files that someone has open, but you can't unlink the directory that contains them! Hence the desire to reverse the clean-up order. It didn't seem worth the risk of back-patching the change, because the only consequence is a confusing message that appears somewhere near the real error: LOG: could not rmdir directory "base/pgsql_tmp/pgsql_tmp5088.0.sharedfileset": Directory not empty I suppose we probably should make the change to 12 though: then owners of extensions that use DSM detach hooks (if there any such extensions) will have a bit of time to get used to the new order during the beta period. I'll need to find someone to test this with a fault injection scenario on Windows before committing it, but wanted to sound out the list for any objections to this late change? [1] https://www.postgresql.org/message-id/CAEepm%3D2ikUtjmiJ18bTnwaeUBoiYN%3DwMDSdhU1jy%3D8WzNhET-Q%40mail.gmail.com -- Thomas Munro https://enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes: > A while back I posted a patch[1] to change the order of resowner > cleanup so that DSM handles are released last. That's useful for the > error cleanup path on Windows, when a SharedFileSet is cleaned up (a > mechanism that's used by parallel CREATE INDEX and parallel hash join, > for spilling files to disk under a temporary directory, with automatic > cleanup). I guess what I'm wondering is if there are any potential negative consequences, ie code that won't work if we change the order like this. I'm finding it hard to visualize what that would be, but then again this failure mode wasn't obvious either. > I suppose we probably should make the change to 12 though: then owners > of extensions that use DSM detach hooks (if there any such extensions) > will have a bit of time to get used to the new order during the beta > period. I'll need to find someone to test this with a fault injection > scenario on Windows before committing it, but wanted to sound out the > list for any objections to this late change? Since we haven't started beta yet, I don't see a reason not to change it. Worst case is that it causes problems and we revert it. I concur with not back-patching, in any case. regards, tom lane
On Fri, May 3, 2019 at 2:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > A while back I posted a patch[1] to change the order of resowner > > cleanup so that DSM handles are released last. That's useful for the > > error cleanup path on Windows, when a SharedFileSet is cleaned up (a > > mechanism that's used by parallel CREATE INDEX and parallel hash join, > > for spilling files to disk under a temporary directory, with automatic > > cleanup). > > I guess what I'm wondering is if there are any potential negative > consequences, ie code that won't work if we change the order like this. > I'm finding it hard to visualize what that would be, but then again > this failure mode wasn't obvious either. I can't think of anything in core. The trouble here is that we're talking about hypothetical out-of-tree code that could want to plug in detach hooks to do anything at all, so it's hard to say. One idea that occurred to me is that if someone comes up with a genuine need to run arbitrary callbacks before locks are released (for example), we could provide a way to be called in all three phases and receive the phase, though admittedly in this case FileClose() is in the same phase as I'm proposing to put dsm_detach(), so there is an ordering requirement that might require more fine grained phases. I don't know. > > I suppose we probably should make the change to 12 though: then owners > > of extensions that use DSM detach hooks (if there any such extensions) > > will have a bit of time to get used to the new order during the beta > > period. I'll need to find someone to test this with a fault injection > > scenario on Windows before committing it, but wanted to sound out the > > list for any objections to this late change? > > Since we haven't started beta yet, I don't see a reason not to change > it. Worst case is that it causes problems and we revert it. > > I concur with not back-patching, in any case. Here's a way to produce an error which might produce the log message on Windows. Does anyone want to try it? postgres=# create table foo as select generate_series(1, 10000000)::int i; SELECT 10000000 postgres=# set synchronize_seqscans = off; SET postgres=# create index on foo ((1 / (5000000 - i))); psql: ERROR: division by zero postgres=# create index on foo ((1 / (5000000 - i))); psql: ERROR: division by zero postgres=# create index on foo ((1 / (5000000 - i))); psql: ERROR: division by zero CONTEXT: parallel worker (If you don't turn sync scan off, it starts scanning from where it left off last time and then fails immediately, which may interfere with the experiment if you run it more than once, I'm not sure). If it does produce the log message, then the attached patch should make it go away. -- Thomas Munro https://enterprisedb.com
Attachment
Thomas Munro <thomas.munro@gmail.com> writes: > If it does produce the log message, then the attached patch should > make it go away. One thing I don't care for about this patch is that the original code looked like it didn't matter what order we did the resource releases in, and the patched code still looks like that. You're not doing future hackers any service by failing to include a comment that explains that DSM detach MUST BE LAST, and explaining why. Even with that, I'd only rate it about a 75% chance that somebody won't try to add their new resource type at the end --- but with no comment, the odds they'll get it right are indistinguishable from zero. regards, tom lane
On Mon, May 6, 2019 at 11:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > One thing I don't care for about this patch is that the original code > looked like it didn't matter what order we did the resource releases in, > and the patched code still looks like that. You're not doing future > hackers any service by failing to include a comment that explains that > DSM detach MUST BE LAST, and explaining why. Even with that, I'd only > rate it about a 75% chance that somebody won't try to add their new > resource type at the end --- but with no comment, the odds they'll > get it right are indistinguishable from zero. Ok, here's a version that provides a specific reason (the Windows file handle thing) and also a more general reasoning: we don't really want extension (or core) authors writing callbacks that depend on eg pins or locks or whatever else being still held when they run, because that's fragile, so calling them last is the best and most conservative choice. I think if someone does come with legitimate reasons to want that, we should discuss it then, and perhaps consider something a bit like the ResourceRelease_callbacks list: its callbacks are invoked for each phase. -- Thomas Munro https://enterprisedb.com
Attachment
Thomas Munro <thomas.munro@gmail.com> writes: > On Mon, May 6, 2019 at 11:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> ... You're not doing future >> hackers any service by failing to include a comment that explains that >> DSM detach MUST BE LAST, and explaining why. > Ok, here's a version that provides a specific reason (the Windows file > handle thing) and also a more general reasoning: we don't really want > extension (or core) authors writing callbacks that depend on eg pins > or locks or whatever else being still held when they run, because > that's fragile, so calling them last is the best and most conservative > choice. LGTM. > ... I think if someone does come with legitimate reasons to want > that, we should discuss it then, and perhaps consider something a bit > like the ResourceRelease_callbacks list: its callbacks are invoked for > each phase. Hmm, now that you mention it: this bit at the very end /* Let add-on modules get a chance too */ for (item = ResourceRelease_callbacks; item; item = item->next) item->callback(phase, isCommit, isTopLevel, item->arg); seems kind of misplaced given this discussion. Should we not run that *first*, before we release core resources for the same phase? It's a lot more plausible that extension resources depend on core resources than vice versa. regards, tom lane
On Mon, May 6, 2019 at 3:44 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > On Mon, May 6, 2019 at 11:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> ... You're not doing future > >> hackers any service by failing to include a comment that explains that > >> DSM detach MUST BE LAST, and explaining why. > > > Ok, here's a version that provides a specific reason (the Windows file > > handle thing) and also a more general reasoning: we don't really want > > extension (or core) authors writing callbacks that depend on eg pins > > or locks or whatever else being still held when they run, because > > that's fragile, so calling them last is the best and most conservative > > choice. > > LGTM. Cool. I'll wait a bit to see if we can get confirmation from a Windows hacker that it does what I claim. Or maybe I should try to come up with a regression test that exercises it without having to create a big table. > > ... I think if someone does come with legitimate reasons to want > > that, we should discuss it then, and perhaps consider something a bit > > like the ResourceRelease_callbacks list: its callbacks are invoked for > > each phase. > > Hmm, now that you mention it: this bit at the very end > > /* Let add-on modules get a chance too */ > for (item = ResourceRelease_callbacks; item; item = item->next) > item->callback(phase, isCommit, isTopLevel, item->arg); > > seems kind of misplaced given this discussion. Should we not run that > *first*, before we release core resources for the same phase? It's > a lot more plausible that extension resources depend on core resources > than vice versa. Not sure. Changing the meaning of the existing callbacks from last to first in each phase seems a bit unfriendly. If it's useful to be able to run a callback before RESOURCE_RELEASE_BEFORE_LOCKS, perhaps we need a new phase that comes before that? -- Thomas Munro https://enterprisedb.com
On Mon, May 6, 2019 at 3:43 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Fri, May 3, 2019 at 2:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Thomas Munro <thomas.munro@gmail.com> writes: > > > A while back I posted a patch[1] to change the order of resowner > > > cleanup so that DSM handles are released last. That's useful for the > > > error cleanup path on Windows, when a SharedFileSet is cleaned up (a > > > mechanism that's used by parallel CREATE INDEX and parallel hash join, > > > for spilling files to disk under a temporary directory, with automatic > > > cleanup). > > > > I guess what I'm wondering is if there are any potential negative > > consequences, ie code that won't work if we change the order like this. > > I'm finding it hard to visualize what that would be, but then again > > this failure mode wasn't obvious either. > > I can't think of anything in core. The trouble here is that we're > talking about hypothetical out-of-tree code that could want to plug in > detach hooks to do anything at all, so it's hard to say. One idea > that occurred to me is that if someone comes up with a genuine need to > run arbitrary callbacks before locks are released (for example), we > could provide a way to be called in all three phases and receive the > phase, though admittedly in this case FileClose() is in the same phase > as I'm proposing to put dsm_detach(), so there is an ordering > requirement that might require more fine grained phases. I don't > know. > > > > I suppose we probably should make the change to 12 though: then owners > > > of extensions that use DSM detach hooks (if there any such extensions) > > > will have a bit of time to get used to the new order during the beta > > > period. I'll need to find someone to test this with a fault injection > > > scenario on Windows before committing it, but wanted to sound out the > > > list for any objections to this late change? > > > > Since we haven't started beta yet, I don't see a reason not to change > > it. Worst case is that it causes problems and we revert it. > > > > I concur with not back-patching, in any case. > > Here's a way to produce an error which might produce the log message > on Windows. Does anyone want to try it? > I can give it a try. > postgres=# create table foo as select generate_series(1, 10000000)::int i; > SELECT 10000000 > postgres=# set synchronize_seqscans = off; > SET > postgres=# create index on foo ((1 / (5000000 - i))); > psql: ERROR: division by zero > postgres=# create index on foo ((1 / (5000000 - i))); > psql: ERROR: division by zero > postgres=# create index on foo ((1 / (5000000 - i))); > psql: ERROR: division by zero > CONTEXT: parallel worker > > (If you don't turn sync scan off, it starts scanning from where it > left off last time and then fails immediately, which may interfere > with the experiment if you run it more than once, I'm not sure). > > If it does produce the log message, then the attached patch should > make it go away. > Are you referring to log message "LOG: could not rmdir directory "base/pgsql_tmp/pgsql_tmp3692.0.sharedfileset": Directory not empty"? If so, I am getting it both before and after your patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, May 6, 2019 at 9:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, May 6, 2019 at 3:43 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > Here's a way to produce an error which might produce the log message > > on Windows. Does anyone want to try it? > > I can give it a try. Thanks! > > If it does produce the log message, then the attached patch should > > make it go away. > > Are you referring to log message "LOG: could not rmdir directory > "base/pgsql_tmp/pgsql_tmp3692.0.sharedfileset": Directory not empty"? Yes. > If so, I am getting it both before and after your patch. Huh. I thought the only problem here was the phenomenon demonstrated by [1]. I'm a bit stumped... if we've closed all the handles in every backend before detaching, and then the last to detach unlinks all the files first and then the directory, how can we get that error? [1] https://www.postgresql.org/message-id/CAEepm%3D2rH_V5by1kH1Q1HZWPFj%3D4ykjU4JcyoKMNVT6Jh8Q_Rw%40mail.gmail.com -- Thomas Munro https://enterprisedb.com
On Mon, May 6, 2019 at 4:41 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Mon, May 6, 2019 at 9:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > If so, I am getting it both before and after your patch. > > Huh. I thought the only problem here was the phenomenon demonstrated > by [1]. I'm a bit stumped... if we've closed all the handles in every > backend before detaching, and then the last to detach unlinks all the > files first and then the directory, how can we get that error? > Yeah, I am also not sure what caused that and I have verified it two times to ensure that I have not made any mistake. I can try once again tomorrow after adding some debug messages, but if someone else can also once confirm the behavior, it would be good. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, May 2, 2019 at 10:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > A while back I posted a patch[1] to change the order of resowner > > cleanup so that DSM handles are released last. That's useful for the > > error cleanup path on Windows, when a SharedFileSet is cleaned up (a > > mechanism that's used by parallel CREATE INDEX and parallel hash join, > > for spilling files to disk under a temporary directory, with automatic > > cleanup). > > I guess what I'm wondering is if there are any potential negative > consequences, ie code that won't work if we change the order like this. > I'm finding it hard to visualize what that would be, but then again > this failure mode wasn't obvious either. I have a thought about this. It seems to me that when it comes to backend-private memory, we release it even later: aborting the transaction does nothing, and we do it only later when we clean up the transaction. So I wonder whether we're going to find that we actually want to postpone reclaiming dynamic shared memory for even longer than this change would do. But in general, I think we've already established the principle that releasing memory needs to happen last, because every other resource that you might be using is tracked using data structures that are, uh, stored in memory. Therefore I suspect that this change is going in the right direction. To put that another way, the issue here is that the removal of the files can't happen after the cleanup of the memory that tells us which files to remove. If we had the corresponding problem for the non-parallel case, it would mean that we were deleting the transaction's memory context before we finished releasing all the resources managed by the transaction's resowner, which would be insane. I believe I put the call to release DSM segments where it is on the theory that "we should release dynamic shared memory as early as possible because freeing memory is good," completing failing to take into account that this was not at all like what we do for backend-private memory. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I have a thought about this. It seems to me that when it comes to > backend-private memory, we release it even later: aborting the > transaction does nothing, and we do it only later when we clean up the > transaction. So I wonder whether we're going to find that we actually > want to postpone reclaiming dynamic shared memory for even longer than > this change would do. But in general, I think we've already > established the principle that releasing memory needs to happen last, > because every other resource that you might be using is tracked using > data structures that are, uh, stored in memory. Therefore I suspect > that this change is going in the right direction. Hmm. That argument suggests that DSM cleanup shouldn't be part of resowner cleanup at all, but should be handled as a bespoke, late step in transaction cleanup, as memory-context release is. Not sure if that's going too far or not. It would definitely be a big change in environment for DSM-cleanup hooks. regards, tom lane
On Mon, May 6, 2019 at 1:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > I have a thought about this. It seems to me that when it comes to > > backend-private memory, we release it even later: aborting the > > transaction does nothing, and we do it only later when we clean up the > > transaction. So I wonder whether we're going to find that we actually > > want to postpone reclaiming dynamic shared memory for even longer than > > this change would do. But in general, I think we've already > > established the principle that releasing memory needs to happen last, > > because every other resource that you might be using is tracked using > > data structures that are, uh, stored in memory. Therefore I suspect > > that this change is going in the right direction. > > Hmm. That argument suggests that DSM cleanup shouldn't be part of > resowner cleanup at all, but should be handled as a bespoke, late > step in transaction cleanup, as memory-context release is. > > Not sure if that's going too far or not. It would definitely be a big > change in environment for DSM-cleanup hooks. Right. That's why I favor applying the change to move DSM cleanup to the end for now, and seeing how that goes. It could be that we'll eventually discover that doing it before all of the AtEOXact_BLAH functions have had a short at doing their thing is still too early, but the only concrete problem that we know about right now can be solved by this much-less-invasive change. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Right. That's why I favor applying the change to move DSM cleanup to > the end for now, and seeing how that goes. It could be that we'll > eventually discover that doing it before all of the AtEOXact_BLAH > functions have had a short at doing their thing is still too early, > but the only concrete problem that we know about right now can be > solved by this much-less-invasive change. But Amit's results say that this *doesn't* fix the problem that we know about. I suspect the reason is exactly that we need to run AtEOXact_Files or the like before closing DSM. But we should get some Windows developer to trace through this and identify the cause for-sure before we go designing an invasive fix. regards, tom lane
On Mon, May 6, 2019 at 1:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > Right. That's why I favor applying the change to move DSM cleanup to > > the end for now, and seeing how that goes. It could be that we'll > > eventually discover that doing it before all of the AtEOXact_BLAH > > functions have had a short at doing their thing is still too early, > > but the only concrete problem that we know about right now can be > > solved by this much-less-invasive change. > > But Amit's results say that this *doesn't* fix the problem that we know > about. I suspect the reason is exactly that we need to run AtEOXact_Files > or the like before closing DSM. But we should get some Windows developer > to trace through this and identify the cause for-sure before we go > designing an invasive fix. Huh, OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 7, 2019 at 6:08 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, May 6, 2019 at 1:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > > > Right. That's why I favor applying the change to move DSM cleanup to > > > the end for now, and seeing how that goes. It could be that we'll > > > eventually discover that doing it before all of the AtEOXact_BLAH > > > functions have had a short at doing their thing is still too early, > > > but the only concrete problem that we know about right now can be > > > solved by this much-less-invasive change. > > > > But Amit's results say that this *doesn't* fix the problem that we know > > about. I suspect the reason is exactly that we need to run AtEOXact_Files > > or the like before closing DSM. But we should get some Windows developer > > to trace through this and identify the cause for-sure before we go > > designing an invasive fix. > > Huh, OK. The reason the patch didn't solve the problem is that AtEOXact_Parallel() calls DestroyParallelContext(). So DSM segments that happen to belong to ParallelContext objects are already gone by the time resowner.c gets involved. -- Thomas Munro https://enterprisedb.com
On Thu, May 9, 2019 at 10:23 PM Thomas Munro <thomas.munro@gmail.com> wrote: > The reason the patch didn't solve the problem is that > AtEOXact_Parallel() calls DestroyParallelContext(). So DSM segments > that happen to belong to ParallelContext objects are already gone by > the time resowner.c gets involved. This was listed as an open item for PostgreSQL 12, but I'm going to move it to "older bugs". I want to fix it, but now that I understand what's wrong, it's a slightly bigger design issue than I'm game to try to fix right now. This means that 12, like 11, will be capable of leaking empty temporary directories on Windows whenever an error is raised in the middle of parallel CREATE INDEX or multi-batch Parallel Hash Join. The directories are eventually unlinked at restart, and at least the (potentially large) files inside the directory are unlinked on abort. I think we can live with that for a bit longer. -- Thomas Munro https://enterprisedb.com