Thread: could not create directory "...": File exists
Greetings, We've come across this rather annoying error happening during our builds: ERROR: could not create directory "pg_tblspc/25120/PG_9.3_201212081/231253": File exists It turns out that this is coming from copydir() when called by createdb() during a CREATE DATABASE .. FROM TEMPLATE where the template has tables in tablespaces. It turns out that createdb() currently only takes an AccessShareLock on pg_tablespace when scanning it with SnapshotNow, making it possible for a concurrent process to make some uninteresting modification to a tablespace (such as an ACL change) which will cause the heap scan in createdb() to see a given tablespace multiple times. copydir() will then, rightfully, complain that it's being asked to create a directory which already exists. Given that this is during createdb(), I'm guessing we don't have any option but to switch the scan to using ShareLock, since there isn't a snapshot available to do an MVCC scan with (I'm guessing that there could be other issues trying to do that anyway). Attached is a patch which does this and corrects the problem for us (of course, we're now going to go rework our build system to not modify tablespace ACLs, since with this patch concurrent builds end up blocking on each other, which is annoying). Thanks, Stephen
Attachment
Stephen Frost <sfrost@snowman.net> writes: > It turns out that createdb() currently only takes an AccessShareLock > on pg_tablespace when scanning it with SnapshotNow, making it possible > for a concurrent process to make some uninteresting modification to a > tablespace (such as an ACL change) which will cause the heap scan in > createdb() to see a given tablespace multiple times. copydir() will > then, rightfully, complain that it's being asked to create a directory > which already exists. Ugh. Still another problem with non-MVCC catalog scans. > Given that this is during createdb(), I'm guessing we don't have any > option but to switch the scan to using ShareLock, since there isn't a > snapshot available to do an MVCC scan with (I'm guessing that there > could be other issues trying to do that anyway). It seems that the only thing we actually use from each tuple is the OID. So there are other ways to fix it, of which probably the minimum-change one is to keep a list of already-processed tablespace OIDs and skip a tuple if we get a match in the list. This would be O(N^2) in the number of tablespaces, but I rather doubt that's a problem. [ thinks for a bit... ] Ugh, no, because the *other* risk you've got here is not seeing a row at all, which would be really bad. I'm not sure that transiently increasing the lock here is enough, either. The concurrent transactions you're worried about probably aren't holding their locks till commit, so you could get this lock and still see tuples with unstable committed-good states. regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Ugh. Still another problem with non-MVCC catalog scans. Indeed. > It seems that the only thing we actually use from each tuple is the OID. Yes, that's true. > So there are other ways to fix it, of which probably the minimum-change > one is to keep a list of already-processed tablespace OIDs and skip a > tuple if we get a match in the list. This would be O(N^2) in the number > of tablespaces, but I rather doubt that's a problem. I did consider this option also but it felt like a lot of additional code to write and I wasn't entirely convinced it'd be worth it. It's very frustrating that we can't simply get a list of *currently valid* tablespaces with a guarantee that no one else is going to mess with that list while we process it. That's what MVCC would give us, but we can't rely on that yet.. If we agree that keeping a list and then skipping over anything on the list already seen, I can go ahead and code that up. > [ thinks for a bit... ] Ugh, no, because the *other* risk you've got > here is not seeing a row at all, which would be really bad. I'm not sure that I see how that could happen..? I agree that it'd be really bad if it did though. Or are you thinking if we were to take a snapshot and then walk the table? > I'm not sure that transiently increasing the lock here is enough, > either. The concurrent transactions you're worried about probably > aren't holding their locks till commit, so you could get this lock > and still see tuples with unstable committed-good states. If there are other transactiosn which have open locks against the table, wouldn't that prevent my being able to acquire ShareLock on it? Or put another way, how would they not hold their locks till commit or rollback? We do only look at tablespaces which exist for the database we're copying, as I recall, so any newly created tablespaces shouldn't impact this. Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> [ thinks for a bit... ] Ugh, no, because the *other* risk you've got >> here is not seeing a row at all, which would be really bad. > I'm not sure that I see how that could happen..? I agree that it'd be > really bad if it did though. Or are you thinking if we were to take a > snapshot and then walk the table? The case where you see a tuple twice is if an update drops a new version of a row beyond your seqscan, and then commits before you get to the new version. But if it drops the new version of the row *behind* your seqscan, and then commits before you get to the original tuple, you'll not see either row version as committed. Both of these cases are inherent risks in SnapshotNow scans. >> I'm not sure that transiently increasing the lock here is enough, >> either. The concurrent transactions you're worried about probably >> aren't holding their locks till commit, so you could get this lock >> and still see tuples with unstable committed-good states. > If there are other transactiosn which have open locks against the table, > wouldn't that prevent my being able to acquire ShareLock on it? What I'm worried about is that we very commonly release locks on catalogs as soon as we're done with the immediate operation --- not only read locks, but update locks as well. So there are lots of cases where locks are released before commit. It's possible that that never happens in operations applied to pg_tablespace, but I'd not want to bet on it. I wonder whether it's practical to look at the on-disk directories instead of relying on pg_tablespace for this particular purpose. Or maybe we should just write this off as a case we can't realistically fix before we have MVCC catalog scans. It seems that any other fix is going to be hopelessly ugly. regards, tom lane
On 2013-01-17 13:46:44 -0500, Tom Lane wrote: > Or maybe we should just write this off as a case we can't realistically > fix before we have MVCC catalog scans. It seems that any other fix is > going to be hopelessly ugly. ISTM we could just use a MVCC snapshot in this specific case? Andres --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > The case where you see a tuple twice is if an update drops a new version > of a row beyond your seqscan, and then commits before you get to the new > version. But if it drops the new version of the row *behind* your > seqscan, and then commits before you get to the original tuple, you'll > not see either row version as committed. Both of these cases are > inherent risks in SnapshotNow scans. Ah, I see. > > If there are other transactiosn which have open locks against the table, > > wouldn't that prevent my being able to acquire ShareLock on it? > > What I'm worried about is that we very commonly release locks on > catalogs as soon as we're done with the immediate operation --- not only > read locks, but update locks as well. So there are lots of cases where > locks are released before commit. It's possible that that never happens > in operations applied to pg_tablespace, but I'd not want to bet on it. Fair enough. > I wonder whether it's practical to look at the on-disk directories > instead of relying on pg_tablespace for this particular purpose. So we'd traverse all of the tablespace directories and copy any directories which we come across which have the oid of the template database..? Sounds pretty reasonable, though I'm not sure that we'd want to back-patch a large change like that. > Or maybe we should just write this off as a case we can't realistically > fix before we have MVCC catalog scans. It seems that any other fix is > going to be hopelessly ugly. I feel like we should be able to do better than what we have now, at least. Using ShareLock prevented the specific case that we were experiencing and is therefore MUCH better (for us, anyway) than released versions where we ran into the error on a regular basis. Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> Or maybe we should just write this off as a case we can't realistically >> fix before we have MVCC catalog scans. It seems that any other fix is >> going to be hopelessly ugly. > I feel like we should be able to do better than what we have now, at > least. Using ShareLock prevented the specific case that we were > experiencing and is therefore MUCH better (for us, anyway) than > released versions where we ran into the error on a regular basis. If it actually *reliably* fixed the problem, rather than just reducing the probabilities, that would mean that the updates your other sessions were doing didn't release RowExclusiveLock early. Have you dug into the code to see if that's true? (Or even more to the point, if it's still true in HEAD? I have no idea if all the recent refactoring has changed this behavior for GRANT.) My thought about a localized fix is similar to Andres' --- maybe we could take a snapshot and use a real MVCC scan. regards, tom lane
I wrote: > Stephen Frost <sfrost@snowman.net> writes: >> I feel like we should be able to do better than what we have now, at >> least. Using ShareLock prevented the specific case that we were >> experiencing and is therefore MUCH better (for us, anyway) than >> released versions where we ran into the error on a regular basis. > My thought about a localized fix is similar to Andres' --- maybe we > could take a snapshot and use a real MVCC scan. BTW, it strikes me that the reason this particular SnapshotNow scan is a big problem for you is that, because we stop and copy a subdirectory after each tuple, the elapsed time for the seqscan is several orders of magnitude more than it is for most (perhaps all) other cases where SnapshotNow is used. So the window for trouble with concurrent updates is that much bigger. This seems to provide a reasonably principled argument why we might want to fix this case with a localized use of an MVCC scan before we have such a fix globally. Not that I wouldn't still want to mark it with a REVERT_ME_SOMEDAY kind of annotation. We know we need the SnapshotNow scan fix. regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > This seems to provide a reasonably principled > argument why we might want to fix this case with a localized use of an > MVCC scan before we have such a fix globally. I had discussed that idea a bit with Andres on IRC and my only concern was if there's some reason that acquiring a snapshot during createdb() would be problematic. It doesn't appear to currently and I wasn't sure if there'd be any issues. I'll start working on a patch for that. > Not that I wouldn't still want to mark it with a REVERT_ME_SOMEDAY > kind of annotation. We know we need the SnapshotNow scan fix. Agreed. Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> This seems to provide a reasonably principled >> argument why we might want to fix this case with a localized use of an >> MVCC scan before we have such a fix globally. > I had discussed that idea a bit with Andres on IRC and my only concern > was if there's some reason that acquiring a snapshot during createdb() > would be problematic. It doesn't appear to currently and I wasn't sure > if there'd be any issues. Don't see what. The main reason we've not yet attempted a global fix is that the most straightforward way (take a new snapshot each time we start a new SnapshotNow scan) seems too expensive. But CREATE DATABASE is so expensive that the cost of an extra snapshot there ain't gonna matter. regards, tom lane
Tom, * Tom Lane (tgl@sss.pgh.pa.us) wrote: > Don't see what. The main reason we've not yet attempted a global fix is > that the most straightforward way (take a new snapshot each time we > start a new SnapshotNow scan) seems too expensive. But CREATE DATABASE > is so expensive that the cost of an extra snapshot there ain't gonna > matter. Patch attached. Passes the regression tests and our internal testing shows that it eliminates the error we were seeing (and doesn't cause blocking, which is even better). We have a workaround in place for our build system (more-or-less "don't do that" approach), but it'd really be great if this was back-patched and in the next round of point releases. Glancing through the branches, looks like it should apply pretty cleanly. Thanks, Stephen
Attachment
Stephen Frost <sfrost@snowman.net> writes: > Patch attached. Passes the regression tests and our internal testing > shows that it eliminates the error we were seeing (and doesn't cause > blocking, which is even better). > We have a workaround in place for our build system (more-or-less > "don't do that" approach), but it'd really be great if this was > back-patched and in the next round of point releases. Glancing > through the branches, looks like it should apply pretty cleanly. It looks like it will work back to 8.4; before that we didn't have RegisterSnapshot. The patch could be adjusted for 8.3 if anyone is sufficiently excited about it, but personally I'm not. regards, tom lane
Stephen Frost <sfrost@snowman.net> writes: > Tom, > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> Don't see what. The main reason we've not yet attempted a global fix is >> that the most straightforward way (take a new snapshot each time we >> start a new SnapshotNow scan) seems too expensive. But CREATE DATABASE >> is so expensive that the cost of an extra snapshot there ain't gonna >> matter. > Patch attached. Passes the regression tests and our internal testing > shows that it eliminates the error we were seeing (and doesn't cause > blocking, which is even better). I committed this with a couple of changes: * I used GetLatestSnapshot rather than GetTransactionSnapshot. Since we don't allow these operations inside transaction blocks, there shouldn't be much difference, but in principle GetTransactionSnapshot is the wrong thing; in a serializable transaction it could be quite old. * After reviewing the other uses of SnapshotNow in dbcommands.c, I decided we'd better make the same type of change in remove_dbtablespaces and check_db_file_conflict, because those are likewise doing filesystem operations on the strength of what they find in pg_tablespace. I also ended up deciding to back-patch to 8.3 as well. regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > I committed this with a couple of changes: Great, thanks! > * I used GetLatestSnapshot rather than GetTransactionSnapshot. Since > we don't allow these operations inside transaction blocks, there > shouldn't be much difference, but in principle GetTransactionSnapshot > is the wrong thing; in a serializable transaction it could be quite old. Makes sense. > * After reviewing the other uses of SnapshotNow in dbcommands.c, I > decided we'd better make the same type of change in remove_dbtablespaces > and check_db_file_conflict, because those are likewise doing filesystem > operations on the strength of what they find in pg_tablespace. Thanks for that. I had noticed the other places where we were using SnapshotNow, but I hadn't run down if they needed to be changed or not. > I also ended up deciding to back-patch to 8.3 as well. Excellent, thanks again. Stephen