Thread: Missing dependency tracking for TableFunc nodes
I happened to notice that find_expr_references_walker has not been taught anything about TableFunc nodes, which means it will miss the type and collation OIDs embedded in such a node. This can be demonstrated to be a problem by the attached script, which will end up with a "cache lookup failed for type NNNNN" error because we allow dropping a type the XMLTABLE construct references. This isn't hard to fix, as per the attached patch, but it makes me nervous. I wonder what other dependencies we might be missing. Would it be a good idea to move find_expr_references_walker to nodeFuncs.c, in hopes of making it more visible to people adding new node types? We could decouple it from the specific use-case of recordDependencyOnExpr by having it call a callback function for each identified OID; although maybe there's no point in that, since I'm not sure there are any other use-cases. Another thought is that maybe the code could be automatically generated, as Andres has been threatening to do with respect to the other stuff in backend/nodes/. In practice, this bug is probably not a huge problem, because a view that involves a column of type X will likely have some other dependencies on X. I had to tweak the example view a bit to get it to not have any other dependencies on "seg". So I'm not feeling that this is a stop-ship problem for today's releases --- I'll plan on installing the fix after the releases are tagged. regards, tom lane CREATE EXTENSION seg; DROP TABLE IF EXISTS xmldata CASCADE; CREATE TABLE xmldata AS SELECT xml $$ <ROWS> <ROW id="1"> <LOC>0 .. 1</LOC> <COUNTRY_NAME>Australia</COUNTRY_NAME> </ROW> <ROW id="5"> <LOC>0 .. 4</LOC> <COUNTRY_NAME>Japan</COUNTRY_NAME> </ROW> </ROWS> $$ AS data; CREATE VIEW xmlview AS SELECT xmltable.id, "LOC"::text FROM xmldata, XMLTABLE('//ROWS/ROW' PASSING data COLUMNS id int PATH '@id', "COUNTRY_NAME" text, "LOC" seg); SELECT * FROM xmlview; DROP EXTENSION seg; SELECT * FROM xmlview; diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c index 0358278..d07bb44 100644 --- a/src/backend/catalog/dependency.c +++ b/src/backend/catalog/dependency.c @@ -2256,6 +2256,28 @@ find_expr_references_walker(Node *node, context->addrs); } } + else if (IsA(node, TableFunc)) + { + TableFunc *tf = (TableFunc *) node; + ListCell *ct; + + /* + * Add refs for the datatypes and collations used in the TableFunc. + */ + foreach(ct, tf->coltypes) + { + add_object_address(OCLASS_TYPE, lfirst_oid(ct), 0, + context->addrs); + } + foreach(ct, tf->colcollations) + { + Oid collid = lfirst_oid(ct); + + if (OidIsValid(collid) && collid != DEFAULT_COLLATION_OID) + add_object_address(OCLASS_COLLATION, collid, 0, + context->addrs); + } + } else if (IsA(node, TableSampleClause)) { TableSampleClause *tsc = (TableSampleClause *) node;
On 11/11/19 1:41 PM, Tom Lane wrote: > Would it be a good idea to move find_expr_references_walker to > nodeFuncs.c, in hopes of making it more visible to people adding > new node types? I'm not sure that would be enough. The logic of that function is not immediately obvious, and where to add a node to it might not occur to people. If the repeated use of else if (IsA(node, XXX)) were replaced with switch (nodeTag(node)) { case XXX: then the compiler, ala -Wswitch, would alert folks when they forget to handle a new node type. -- Mark Dilger
On 11/11/19 2:33 PM, Mark Dilger wrote: > > > On 11/11/19 1:41 PM, Tom Lane wrote: >> Would it be a good idea to move find_expr_references_walker to >> nodeFuncs.c, in hopes of making it more visible to people adding >> new node types? > > I'm not sure that would be enough. The logic of that function is not > immediately obvious, and where to add a node to it might not occur to > people. If the repeated use of > > else if (IsA(node, XXX)) > > were replaced with > > switch (nodeTag(node)) { > case XXX: > > then the compiler, ala -Wswitch, would alert folks when they forget to > handle a new node type. > I played with this a bit, making the change I proposed, and got lots of warnings from the compiler. I don't know how many of these would need to be suppressed by adding a no-op for them at the end of the switch vs. how many need to be handled, but the attached patch implements the idea. I admit adding all these extra cases to the end is verbose.... The change as written is much too verbose to be acceptable, but given how many places in the code could really use this sort of treatment, I wonder if there is a way to reorganize the NodeTag enum into multiple enums, one for each logical subtype (such as executor nodes, plan nodes, etc) and then have switches over enums of the given subtype, with the compiler helping detect tags of same subtype that are unhandled in the switch. I have added enough nodes over the years, and spent enough time tracking down all the parts of the code that need updating for a new node, to say that this would be very helpful if we could make it work. I have not done the research yet on how many places would be made less elegant by such a change, though. I think I'll go look into that a bit.... -- Mark Dilger
Attachment
Mark Dilger <hornschnorter@gmail.com> writes: > I played with this a bit, making the change I proposed, and got lots of > warnings from the compiler. I don't know how many of these would need > to be suppressed by adding a no-op for them at the end of the switch vs. > how many need to be handled, but the attached patch implements the idea. > I admit adding all these extra cases to the end is verbose.... Yeah, that's why it's not done that way ... > The change as written is much too verbose to be acceptable, but given > how many places in the code could really use this sort of treatment, I > wonder if there is a way to reorganize the NodeTag enum into multiple > enums, one for each logical subtype (such as executor nodes, plan nodes, > etc) and then have switches over enums of the given subtype, with the > compiler helping detect tags of same subtype that are unhandled in the > switch. The problem here is that the set of nodes of interest can vary depending on what you're doing. As a case in point, find_expr_references has to cover both expression nodes and some things that aren't expression nodes but can represent dependencies of a plan tree. I think that the long-term answer, if Andres gets somewhere with his project to autogenerate code like this, is that we'd rely on annotating the struct declarations to tell us what to do. In the case at hand, I could imagine annotations that say "this field contains a function OID" or "this list contains collation OIDs" and then the find_expr_references logic could be derived from that. Now, that's not perfect either, because it's always possible to forget to annotate something. But it'd be a lot easier, because there'd be tons of nearby examples of doing it right. regards, tom lane
Hi, On 2019-11-12 10:19:30 -0500, Tom Lane wrote: > I think that the long-term answer, if Andres gets somewhere with his > project to autogenerate code like this, is that we'd rely on annotating > the struct declarations to tell us what to do. In the case at hand, > I could imagine annotations that say "this field contains a function OID" > or "this list contains collation OIDs" and then the find_expr_references > logic could be derived from that. Now, that's not perfect either, because > it's always possible to forget to annotate something. But it'd be a lot > easier, because there'd be tons of nearby examples of doing it right. Yea, I think that'd be going in the right direction. I've a few annotations for other purposes in my local version of the patch (e.g. to ignore fields for comparison), and adding further ones for purposes like this ought to be easy. I want to attach some annotations to types, rather than fields. I e.g. introduced a Location typedef, annotated as being ignored for equality purposes, instead of annotating each 'int location'. Wonder if we should also do something like that for your hypothetical "function OID" etc. above - seems like it also might give the human reader of code a hint. On 2019-11-11 16:41:41 -0500, Tom Lane wrote: > I happened to notice that find_expr_references_walker has not > been taught anything about TableFunc nodes, which means it will > miss the type and collation OIDs embedded in such a node. > Would it be a good idea to move find_expr_references_walker to > nodeFuncs.c, in hopes of making it more visible to people adding > new node types? Can't hurt, at least. Reducing the number of files that need to be fairly mechanically be touched when adding a node type / node type field strikes me as a good idea. Wonder if there's any way to write an assertion check that verifies we have the necessary dependencies. But the only idea I have - basically record all the syscache lookups while parse analysing an expression, and then check that all the necessary dependencies exist - seems too complicated to be worthwhile. > We could decouple it from the specific use-case > of recordDependencyOnExpr by having it call a callback function > for each identified OID; although maybe there's no point in that, > since I'm not sure there are any other use-cases. I could see it being useful for a few other purposes, e.g. it seems *marginally* possible we could share *some* code with extract_query_dependencies(). But I think I'd only go there if we actually convert something else to it. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2019-11-12 10:19:30 -0500, Tom Lane wrote: >> I could imagine annotations that say "this field contains a function OID" >> or "this list contains collation OIDs" and then the find_expr_references >> logic could be derived from that. > I want to attach some annotations to types, rather than fields. I > e.g. introduced a Location typedef, annotated as being ignored for > equality purposes, instead of annotating each 'int location'. Wonder if > we should also do something like that for your hypothetical "function > OID" etc. above - seems like it also might give the human reader of code > a hint. Hm. We could certainly do "typedef FunctionOid Oid;", "typedef CollationOidList List;" etc, but I think it'd get pretty tedious pretty quickly --- just for this one purpose, you'd need a couple of typedefs for every system catalog that contains query-referenceable OIDs. Maybe that's better than comment-style annotations, but I'm not convinced. > Wonder if there's any way to write an assertion check that verifies we > have the necessary dependencies. But the only idea I have - basically > record all the syscache lookups while parse analysing an expression, and > then check that all the necessary dependencies exist - seems too > complicated to be worthwhile. Yeah, it's problematic. One issue there that I'm not sure how to resolve with autogenerated code, much less automated checking, is that quite a few cases in find_expr_references know that we don't need to record a dependency on an OID stored in the node because there's an indirect dependency on something else. For example, in FuncExpr we needn't log funcresulttype because the funcid is enough dependency, and we needn't log either funccollid or inputcollid because those are derived from the input expressions or the function result type. (And giving up those optimizations would be pretty costly, 4x more dependency checks in this example alone.) For sure I don't want both "CollationOid" and "RedundantCollationOid" typedefs, so it seems like annotation is the solution for this, but I see no reasonable way to automatically verify such annotations. Still, just writing down the annotations would be a way to expose such assumptions for manual checking, which we don't really have now. regards, tom lane
On 2019-11-12 15:32:14 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2019-11-12 10:19:30 -0500, Tom Lane wrote: > >> I could imagine annotations that say "this field contains a function OID" > >> or "this list contains collation OIDs" and then the find_expr_references > >> logic could be derived from that. > > > I want to attach some annotations to types, rather than fields. I > > e.g. introduced a Location typedef, annotated as being ignored for > > equality purposes, instead of annotating each 'int location'. Wonder if > > we should also do something like that for your hypothetical "function > > OID" etc. above - seems like it also might give the human reader of code > > a hint. > > Hm. We could certainly do "typedef FunctionOid Oid;", > "typedef CollationOidList List;" etc, but I think it'd get pretty > tedious pretty quickly --- just for this one purpose, you'd need > a couple of typedefs for every system catalog that contains > query-referenceable OIDs. Maybe that's better than comment-style > annotations, but I'm not convinced. I'm not saying that we should exclusively do so, just that it's worthwhile for a few frequent cases. > One issue there that I'm not sure how to resolve with autogenerated > code, much less automated checking, is that quite a few cases in > find_expr_references know that we don't need to record a dependency on > an OID stored in the node because there's an indirect dependency on > something else. For example, in FuncExpr we needn't log > funcresulttype because the funcid is enough dependency, and we needn't > log either funccollid or inputcollid because those are derived from > the input expressions or the function result type. (And giving up > those optimizations would be pretty costly, 4x more dependency checks > in this example alone.) Yea, that one is hard. I suspect the best way to address that is to have explicit code for a few cases that are worth optimizing (like e.g. FuncExpr), and for the rest use the generic logic using. I'd so far just written the specialized cases into the "generated metadata" using code - but we also could have an annotation that instructs to instead call some function, but I doubt that's worthwhile. > For sure I don't want both "CollationOid" and "RedundantCollationOid" > typedefs Indeed. > so it seems like annotation is the solution for this I'm not even sure annotations are going to be the easiest way to implement some of the more complicated edge cases. Might be easier to just open-code those, and fall back to generic logic for the rest. We'll have to see, I think. Greetings, Andres Freund
On 11/11/19 1:41 PM, Tom Lane wrote: > I happened to notice that find_expr_references_walker has not > been taught anything about TableFunc nodes, which means it will > miss the type and collation OIDs embedded in such a node. > > This can be demonstrated to be a problem by the attached script, > which will end up with a "cache lookup failed for type NNNNN" > error because we allow dropping a type the XMLTABLE construct > references. > > This isn't hard to fix, as per the attached patch, but it makes > me nervous. I wonder what other dependencies we might be missing. I can consistently generate errors like the following in master: ERROR: cache lookup failed for statistics object 31041 This happens in a stress test in which multiple processes are making changes to the schema. So far, all the sessions that report this cache lookup error do so when performing one of ANALYZE, VACUUM ANALYZE, UPDATE, DELETE or EXPLAIN ANALYZE on a table that has MCV statistics. All processes running simultaneously are running the same set of functions, which create and delete tables, indexes, and statistics objects, insert, update, and delete rows in those tables, etc. The fact that the statistics are of the MCV type might not be relevant; I'm creating those on tables as part of testing Tomas Vondra's MCV statistics patch, so all the tables have statistics of that kind on them. I can try to distill my test case a bit, but first I'd like to know if you are interested. Currently, the patch is over 2.2MB, gzip'd. I'll only bother distilling it if you don't already know about these cache lookup failures. -- Mark Dilger
On Wed, Nov 13, 2019 at 03:00:03PM -0800, Mark Dilger wrote: > > >On 11/11/19 1:41 PM, Tom Lane wrote: >>I happened to notice that find_expr_references_walker has not >>been taught anything about TableFunc nodes, which means it will >>miss the type and collation OIDs embedded in such a node. >> >>This can be demonstrated to be a problem by the attached script, >>which will end up with a "cache lookup failed for type NNNNN" >>error because we allow dropping a type the XMLTABLE construct >>references. >> >>This isn't hard to fix, as per the attached patch, but it makes >>me nervous. I wonder what other dependencies we might be missing. > >I can consistently generate errors like the following in master: > > ERROR: cache lookup failed for statistics object 31041 > >This happens in a stress test in which multiple processes are making >changes to the schema. So far, all the sessions that report this >cache lookup error do so when performing one of ANALYZE, VACUUM >ANALYZE, UPDATE, DELETE or EXPLAIN ANALYZE on a table that has MCV >statistics. All processes running simultaneously are running the same >set of functions, which create and delete tables, indexes, and >statistics objects, insert, update, and delete rows in those tables, >etc. > >The fact that the statistics are of the MCV type might not be >relevant; I'm creating those on tables as part of testing Tomas >Vondra's MCV statistics patch, so all the tables have statistics of >that kind on them. > Hmmm, I don't know the details of the test, but this seems a bit like we're trying to use the stats during estimation but it got dropped meanwhile. If that's the case, it probably affects all stats types, not just MCV lists - there should no significant difference between different statistics types, I think. I've managed to reproduce this with a stress-test, and I do get these failures with both dependencies and mcv stats, although in slightly different places. And I think I see the issue - when dropping the statistics, we do RemoveObjects which however does not acquire any lock on the table. So we get the list of stats (without the serialized data), but before we get to load the contents, someone drops it. If that's the root cause, it's there since pg 10. I'm not sure what's the right solution. An straightforward option would be to lock the relation, but will that work after adding support for stats on joins? An alternative would be to just ignore those failures, but that kinda breaks the estimation (we should have picked a different stats in this case). >I can try to distill my test case a bit, but first I'd like to know if >you are interested. Currently, the patch is over 2.2MB, gzip'd. I'll >only bother distilling it if you don't already know about these cache >lookup failures. > Not sure. But I do wonder if we see the same issue. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Mark Dilger <hornschnorter@gmail.com> writes: > On 11/11/19 1:41 PM, Tom Lane wrote: >> I happened to notice that find_expr_references_walker has not >> been taught anything about TableFunc nodes, which means it will >> miss the type and collation OIDs embedded in such a node. > I can consistently generate errors like the following in master: > ERROR: cache lookup failed for statistics object 31041 This is surely a completely different issue --- there are not, one hopes, any extended-stats OIDs embedded in views or other query trees. I concur with Tomas' suspicion that this must be a race condition during DROP STATISTICS. If we're going to allow people to do that separately from dropping the table(s), there has to be some kind of locking around it, and it sounds like there's not :-( regards, tom lane
On 11/13/19 4:46 PM, Tomas Vondra wrote: > On Wed, Nov 13, 2019 at 03:00:03PM -0800, Mark Dilger wrote: >> >> >> On 11/11/19 1:41 PM, Tom Lane wrote: >>> I happened to notice that find_expr_references_walker has not >>> been taught anything about TableFunc nodes, which means it will >>> miss the type and collation OIDs embedded in such a node. >>> >>> This can be demonstrated to be a problem by the attached script, >>> which will end up with a "cache lookup failed for type NNNNN" >>> error because we allow dropping a type the XMLTABLE construct >>> references. >>> >>> This isn't hard to fix, as per the attached patch, but it makes >>> me nervous. I wonder what other dependencies we might be missing. >> >> I can consistently generate errors like the following in master: >> >> ERROR: cache lookup failed for statistics object 31041 >> >> This happens in a stress test in which multiple processes are making >> changes to the schema. So far, all the sessions that report this >> cache lookup error do so when performing one of ANALYZE, VACUUM >> ANALYZE, UPDATE, DELETE or EXPLAIN ANALYZE on a table that has MCV >> statistics. All processes running simultaneously are running the same >> set of functions, which create and delete tables, indexes, and >> statistics objects, insert, update, and delete rows in those tables, etc. >> >> The fact that the statistics are of the MCV type might not be >> relevant; I'm creating those on tables as part of testing Tomas >> Vondra's MCV statistics patch, so all the tables have statistics of >> that kind on them. >> > > Hmmm, I don't know the details of the test, but this seems a bit like > we're trying to use the stats during estimation but it got dropped > meanwhile. If that's the case, it probably affects all stats types, not > just MCV lists - there should no significant difference between > different statistics types, I think. > > I've managed to reproduce this with a stress-test, and I do get these > failures with both dependencies and mcv stats, although in slightly > different places. > > And I think I see the issue - when dropping the statistics, we do > RemoveObjects which however does not acquire any lock on the table. So > we get the list of stats (without the serialized data), but before we > get to load the contents, someone drops it. If that's the root cause, > it's there since pg 10. > > I'm not sure what's the right solution. An straightforward option would > be to lock the relation, but will that work after adding support for > stats on joins? An alternative would be to just ignore those failures, > but that kinda breaks the estimation (we should have picked a different > stats in this case). > >> I can try to distill my test case a bit, but first I'd like to know if >> you are interested. Currently, the patch is over 2.2MB, gzip'd. I'll >> only bother distilling it if you don't already know about these cache >> lookup failures. >> > > Not sure. But I do wonder if we see the same issue. I don't know. If you want to reproduce what I'm seeing.... I added a parallel_schedule target: diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule index fc0f14122b..5ace7c7a8a 100644 --- a/src/test/regress/parallel_schedule +++ b/src/test/regress/parallel_schedule @@ -85,6 +85,8 @@ test: create_table_like alter_generic alter_operator misc async dbsize misc_func # collate.*.utf8 tests cannot be run in parallel with each other test: rules psql psql_crosstab amutils stats_ext collate.linux.utf8 +test: mcv_huge_stress_a mcv_huge_stress_b mcv_huge_stress_c mcv_huge_stress_d mcv_huge_stress_e mcv_huge_stress_f mcv_huge_stress_g + # run by itself so it can run parallel workers test: select_parallel test: write_parallel And used the attached script to generate the contents of the seven parallel tests. If you want to duplicate this, you'll have to manually run gen.pl and direct its output to those src/test/regress/sql/ files. The src/test/regress/expected/ files are just empty, as I don't care about whether the test results match. I'm just checking what kinds of errors I get and whether any of them are concerning. After my most recent run of the stress tests, I grep'd for cache failures and got 23 of them, all coming from get_relation_statistics(), statext_store() and statext_mcv_load(). Two different adjacent spots in get_relation_statistics() were involved: htup = SearchSysCache1(STATEXTOID, ObjectIdGetDatum(statOid)); if (!HeapTupleIsValid(htup)) elog(ERROR, "cache lookup failed for statistics object %u", statOid); staForm = (Form_pg_statistic_ext) GETSTRUCT(htup); dtup = SearchSysCache1(STATEXTDATASTXOID, ObjectIdGetDatum(statOid)); if (!HeapTupleIsValid(dtup)) elog(ERROR, "cache lookup failed for statistics object %u", statOid); Most were from the first SearchSysCache1 call, but one of them was from the second. -- Mark Dilger
Attachment
On Wed, Nov 13, 2019 at 08:37:59PM -0500, Tom Lane wrote: >Mark Dilger <hornschnorter@gmail.com> writes: >> On 11/11/19 1:41 PM, Tom Lane wrote: >>> I happened to notice that find_expr_references_walker has not >>> been taught anything about TableFunc nodes, which means it will >>> miss the type and collation OIDs embedded in such a node. > >> I can consistently generate errors like the following in master: >> ERROR: cache lookup failed for statistics object 31041 > >This is surely a completely different issue --- there are not, >one hopes, any extended-stats OIDs embedded in views or other >query trees. > >I concur with Tomas' suspicion that this must be a race condition >during DROP STATISTICS. If we're going to allow people to do that >separately from dropping the table(s), there has to be some kind of >locking around it, and it sounds like there's not :-( > I think the right thing to do is simply acquire AE lock on the relation in RemoveStatisticsById, per the attached patch. It's possible we'll need to do something more complicated once join stats are added, but for now this should be enough (and backpatchable). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On Wed, Nov 13, 2019 at 08:37:59PM -0500, Tom Lane wrote: >> I concur with Tomas' suspicion that this must be a race condition >> during DROP STATISTICS. If we're going to allow people to do that >> separately from dropping the table(s), there has to be some kind of >> locking around it, and it sounds like there's not :-( > I think the right thing to do is simply acquire AE lock on the relation > in RemoveStatisticsById, per the attached patch. It's possible we'll > need to do something more complicated once join stats are added, but > for now this should be enough (and backpatchable). Hm. No, it's not enough, unless you add more logic to deal with the possibility that the stats object is gone by the time you have the table lock. Consider e.g. two concurrent DROP STATISTICS commands, or a statistics drop that's cascading from something like a drop of a relevant function and so has no earlier table lock. regards, tom lane
On Thu, Nov 14, 2019 at 04:36:54PM -0500, Tom Lane wrote: >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> On Wed, Nov 13, 2019 at 08:37:59PM -0500, Tom Lane wrote: >>> I concur with Tomas' suspicion that this must be a race condition >>> during DROP STATISTICS. If we're going to allow people to do that >>> separately from dropping the table(s), there has to be some kind of >>> locking around it, and it sounds like there's not :-( > >> I think the right thing to do is simply acquire AE lock on the relation >> in RemoveStatisticsById, per the attached patch. It's possible we'll >> need to do something more complicated once join stats are added, but >> for now this should be enough (and backpatchable). > >Hm. No, it's not enough, unless you add more logic to deal with the >possibility that the stats object is gone by the time you have the >table lock. Consider e.g. two concurrent DROP STATISTICS commands, >or a statistics drop that's cascading from something like a drop >of a relevant function and so has no earlier table lock. > Isn't that solved by RemoveObjects() doing this? /* Get an ObjectAddress for the object. */ address = get_object_address(stmt->removeType, object, &relation, AccessExclusiveLock, stmt->missing_ok); I've actually done some debugging before sending the patch, and I think this prevent the issue you describe. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-Nov-14, Tomas Vondra wrote: > Isn't that solved by RemoveObjects() doing this? > > /* Get an ObjectAddress for the object. */ > address = get_object_address(stmt->removeType, > object, > &relation, > AccessExclusiveLock, > stmt->missing_ok); > > I've actually done some debugging before sending the patch, and I think > this prevent the issue you describe. Hmm .. shouldn't get_statistics_object_oid get a lock on the table that owns the stats object too? I think it should be setting *relp to it. That way, the lock you're proposing to add would be obtained there. That means it'd be similar to what we do for OBJECT_TRIGGER etc, get_object_address_relobject(). I admit this'd crash and burn if we had stats on multiple relations, because there'd be no way to return the multiple relations that would end up locked. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On Thu, Nov 14, 2019 at 04:36:54PM -0500, Tom Lane wrote: >> Hm. No, it's not enough, unless you add more logic to deal with the >> possibility that the stats object is gone by the time you have the >> table lock. Consider e.g. two concurrent DROP STATISTICS commands, >> or a statistics drop that's cascading from something like a drop >> of a relevant function and so has no earlier table lock. > Isn't that solved by RemoveObjects() doing this? > /* Get an ObjectAddress for the object. */ > address = get_object_address(stmt->removeType, > object, > &relation, > AccessExclusiveLock, > stmt->missing_ok); Ah, I see, we already have AEL on the stats object itself. So that eliminates my concern about a race between two RemoveStatisticsById calls, but what we have instead is fear of deadlock. A DROP STATISTICS command will acquire AEL on the stats object but then AEL on the table, the opposite of what will happen during DROP TABLE, so concurrent executions of those will deadlock. That might be better than the failures Mark is seeing now, but not by much. A correct fix I think is that the planner ought to acquire AccessShareLock on a stats object it's trying to use (and then recheck whether the object is still there). That seems rather expensive, but there may be no other way. regards, tom lane
On Thu, Nov 14, 2019 at 07:27:29PM -0300, Alvaro Herrera wrote: >On 2019-Nov-14, Tomas Vondra wrote: > >> Isn't that solved by RemoveObjects() doing this? >> >> /* Get an ObjectAddress for the object. */ >> address = get_object_address(stmt->removeType, >> object, >> &relation, >> AccessExclusiveLock, >> stmt->missing_ok); >> >> I've actually done some debugging before sending the patch, and I think >> this prevent the issue you describe. > >Hmm .. shouldn't get_statistics_object_oid get a lock on the table that >owns the stats object too? I think it should be setting *relp to it. >That way, the lock you're proposing to add would be obtained there. >That means it'd be similar to what we do for OBJECT_TRIGGER etc, >get_object_address_relobject(). > Hmmm, maybe. We'd have to fake the list of names, because that function expects the relation name to be included in the list of names, and we don't have that for extended stats. But it might work, I guess. >I admit this'd crash and burn if we had stats on multiple relations, >because there'd be no way to return the multiple relations that would >end up locked. > I think that's less important now. If we ever get that feature, we'll need to make that work somehow. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Nov 14, 2019 at 05:35:06PM -0500, Tom Lane wrote: >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> On Thu, Nov 14, 2019 at 04:36:54PM -0500, Tom Lane wrote: >>> Hm. No, it's not enough, unless you add more logic to deal with the >>> possibility that the stats object is gone by the time you have the >>> table lock. Consider e.g. two concurrent DROP STATISTICS commands, >>> or a statistics drop that's cascading from something like a drop >>> of a relevant function and so has no earlier table lock. > >> Isn't that solved by RemoveObjects() doing this? > >> /* Get an ObjectAddress for the object. */ >> address = get_object_address(stmt->removeType, >> object, >> &relation, >> AccessExclusiveLock, >> stmt->missing_ok); > >Ah, I see, we already have AEL on the stats object itself. So that >eliminates my concern about a race between two RemoveStatisticsById >calls, but what we have instead is fear of deadlock. A DROP STATISTICS >command will acquire AEL on the stats object but then AEL on the table, >the opposite of what will happen during DROP TABLE, so concurrent >executions of those will deadlock. That might be better than the >failures Mark is seeing now, but not by much. > Hmmm, yeah :-( >A correct fix I think is that the planner ought to acquire AccessShareLock >on a stats object it's trying to use (and then recheck whether the object >is still there). That seems rather expensive, but there may be no other >way. Yes, so something like for indexes, although we don't need the recheck in that case. I think the attached patch does that (but it's 1AM here). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services