Thread: Missing dependency tracking for TableFunc nodes

Missing dependency tracking for TableFunc nodes

From

Tom Lane

Date:

11 November 2019, 21:41:41

I happened to notice that find_expr_references_walker has not
been taught anything about TableFunc nodes, which means it will
miss the type and collation OIDs embedded in such a node.

This can be demonstrated to be a problem by the attached script,
which will end up with a "cache lookup failed for type NNNNN"
error because we allow dropping a type the XMLTABLE construct
references.

This isn't hard to fix, as per the attached patch, but it makes
me nervous.  I wonder what other dependencies we might be missing.

Would it be a good idea to move find_expr_references_walker to
nodeFuncs.c, in hopes of making it more visible to people adding
new node types?  We could decouple it from the specific use-case
of recordDependencyOnExpr by having it call a callback function
for each identified OID; although maybe there's no point in that,
since I'm not sure there are any other use-cases.

Another thought is that maybe the code could be automatically
generated, as Andres has been threatening to do with respect
to the other stuff in backend/nodes/.

In practice, this bug is probably not a huge problem, because a
view that involves a column of type X will likely have some other
dependencies on X.  I had to tweak the example view a bit to get
it to not have any other dependencies on "seg".  So I'm not feeling
that this is a stop-ship problem for today's releases --- I'll plan
on installing the fix after the releases are tagged.

            regards, tom lane

CREATE EXTENSION seg;

DROP TABLE IF EXISTS xmldata CASCADE;

CREATE TABLE xmldata AS SELECT
xml $$
<ROWS>
  <ROW id="1">
    <LOC>0 .. 1</LOC>
    <COUNTRY_NAME>Australia</COUNTRY_NAME>
  </ROW>
  <ROW id="5">
    <LOC>0 .. 4</LOC>
    <COUNTRY_NAME>Japan</COUNTRY_NAME>
  </ROW>
</ROWS>
$$ AS data;

CREATE VIEW xmlview AS
SELECT xmltable.id, "LOC"::text
  FROM xmldata,
       XMLTABLE('//ROWS/ROW'
                PASSING data
                COLUMNS id int PATH '@id',
                        "COUNTRY_NAME" text,
                        "LOC" seg);

SELECT * FROM xmlview;

DROP EXTENSION seg;

SELECT * FROM xmlview;
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 0358278..d07bb44 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -2256,6 +2256,28 @@ find_expr_references_walker(Node *node,
                                    context->addrs);
         }
     }
+    else if (IsA(node, TableFunc))
+    {
+        TableFunc  *tf = (TableFunc *) node;
+        ListCell   *ct;
+
+        /*
+         * Add refs for the datatypes and collations used in the TableFunc.
+         */
+        foreach(ct, tf->coltypes)
+        {
+            add_object_address(OCLASS_TYPE, lfirst_oid(ct), 0,
+                               context->addrs);
+        }
+        foreach(ct, tf->colcollations)
+        {
+            Oid            collid = lfirst_oid(ct);
+
+            if (OidIsValid(collid) && collid != DEFAULT_COLLATION_OID)
+                add_object_address(OCLASS_COLLATION, collid, 0,
+                                   context->addrs);
+        }
+    }
     else if (IsA(node, TableSampleClause))
     {
         TableSampleClause *tsc = (TableSampleClause *) node;

Re: Missing dependency tracking for TableFunc nodes

From

Mark Dilger

Date:

11 November 2019, 22:33:24

On 11/11/19 1:41 PM, Tom Lane wrote:
> Would it be a good idea to move find_expr_references_walker to
> nodeFuncs.c, in hopes of making it more visible to people adding
> new node types?

I'm not sure that would be enough.  The logic of that function is not 
immediately obvious, and where to add a node to it might not occur to 
people.  If the repeated use of

     else if (IsA(node, XXX))

were replaced with

     switch (nodeTag(node)) {
         case XXX:

then the compiler, ala -Wswitch, would alert folks when they forget to 
handle a new node type.

-- 
Mark Dilger

Re: Missing dependency tracking for TableFunc nodes

From

Mark Dilger

Date:

12 November 2019, 01:42:45

On 11/11/19 2:33 PM, Mark Dilger wrote:
> 
> 
> On 11/11/19 1:41 PM, Tom Lane wrote:
>> Would it be a good idea to move find_expr_references_walker to
>> nodeFuncs.c, in hopes of making it more visible to people adding
>> new node types?
> 
> I'm not sure that would be enough.  The logic of that function is not 
> immediately obvious, and where to add a node to it might not occur to 
> people.  If the repeated use of
> 
>      else if (IsA(node, XXX))
> 
> were replaced with
> 
>      switch (nodeTag(node)) {
>          case XXX:
> 
> then the compiler, ala -Wswitch, would alert folks when they forget to 
> handle a new node type.
> 

I played with this a bit, making the change I proposed, and got lots of 
warnings from the compiler.  I don't know how many of these would need 
to be suppressed by adding a no-op for them at the end of the switch vs. 
how many need to be handled, but the attached patch implements the idea. 
  I admit adding all these extra cases to the end is verbose....

The change as written is much too verbose to be acceptable, but given 
how many places in the code could really use this sort of treatment, I 
wonder if there is a way to reorganize the NodeTag enum into multiple 
enums, one for each logical subtype (such as executor nodes, plan nodes, 
etc) and then have switches over enums of the given subtype, with the 
compiler helping detect tags of same subtype that are unhandled in the 
switch.

I have added enough nodes over the years, and spent enough time tracking 
down all the parts of the code that need updating for a new node, to say 
that this would be very helpful if we could make it work.  I have not 
done the research yet on how many places would be made less elegant by 
such a change, though.  I think I'll go look into that a bit....

-- 
Mark Dilger

Attachment

nodeTag.WIP.1

Re: Missing dependency tracking for TableFunc nodes

From

Tom Lane

Date:

12 November 2019, 15:19:30

Mark Dilger <hornschnorter@gmail.com> writes:
> I played with this a bit, making the change I proposed, and got lots of 
> warnings from the compiler.  I don't know how many of these would need 
> to be suppressed by adding a no-op for them at the end of the switch vs. 
> how many need to be handled, but the attached patch implements the idea. 
>   I admit adding all these extra cases to the end is verbose....

Yeah, that's why it's not done that way ...

> The change as written is much too verbose to be acceptable, but given 
> how many places in the code could really use this sort of treatment, I 
> wonder if there is a way to reorganize the NodeTag enum into multiple 
> enums, one for each logical subtype (such as executor nodes, plan nodes, 
> etc) and then have switches over enums of the given subtype, with the 
> compiler helping detect tags of same subtype that are unhandled in the 
> switch.

The problem here is that the set of nodes of interest can vary depending
on what you're doing.  As a case in point, find_expr_references has to
cover both expression nodes and some things that aren't expression nodes
but can represent dependencies of a plan tree.

I think that the long-term answer, if Andres gets somewhere with his
project to autogenerate code like this, is that we'd rely on annotating
the struct declarations to tell us what to do.  In the case at hand,
I could imagine annotations that say "this field contains a function OID"
or "this list contains collation OIDs" and then the find_expr_references
logic could be derived from that.  Now, that's not perfect either, because
it's always possible to forget to annotate something.  But it'd be a lot
easier, because there'd be tons of nearby examples of doing it right.

            regards, tom lane

Re: Missing dependency tracking for TableFunc nodes

From

Andres Freund

Date:

12 November 2019, 19:47:20

Hi,

On 2019-11-12 10:19:30 -0500, Tom Lane wrote:
> I think that the long-term answer, if Andres gets somewhere with his
> project to autogenerate code like this, is that we'd rely on annotating
> the struct declarations to tell us what to do.  In the case at hand,
> I could imagine annotations that say "this field contains a function OID"
> or "this list contains collation OIDs" and then the find_expr_references
> logic could be derived from that.  Now, that's not perfect either, because
> it's always possible to forget to annotate something.  But it'd be a lot
> easier, because there'd be tons of nearby examples of doing it right.

Yea, I think that'd be going in the right direction.

I've a few annotations for other purposes in my local version of the
patch (e.g. to ignore fields for comparison), and adding further ones
for purposes like this ought to be easy.

I want to attach some annotations to types, rather than fields. I
e.g. introduced a Location typedef, annotated as being ignored for
equality purposes, instead of annotating each 'int location'. Wonder if
we should also do something like that for your hypothetical "function
OID" etc. above - seems like it also might give the human reader of code
a hint.

On 2019-11-11 16:41:41 -0500, Tom Lane wrote:
> I happened to notice that find_expr_references_walker has not
> been taught anything about TableFunc nodes, which means it will
> miss the type and collation OIDs embedded in such a node.

> Would it be a good idea to move find_expr_references_walker to
> nodeFuncs.c, in hopes of making it more visible to people adding
> new node types?

Can't hurt, at least. Reducing the number of files that need to be
fairly mechanically be touched when adding a node type / node type
field strikes me as a good idea.

Wonder if there's any way to write an assertion check that verifies we
have the necessary dependencies. But the only idea I have - basically
record all the syscache lookups while parse analysing an expression, and
then check that all the necessary dependencies exist - seems too
complicated to be worthwhile.

> We could decouple it from the specific use-case
> of recordDependencyOnExpr by having it call a callback function
> for each identified OID; although maybe there's no point in that,
> since I'm not sure there are any other use-cases.

I could see it being useful for a few other purposes, e.g. it seems
*marginally* possible we could share *some* code with
extract_query_dependencies(). But I think I'd only go there if we
actually convert something else to it.

Greetings,

Andres Freund

Re: Missing dependency tracking for TableFunc nodes

From

Tom Lane

Date:

12 November 2019, 20:32:14

Andres Freund <andres@anarazel.de> writes:
> On 2019-11-12 10:19:30 -0500, Tom Lane wrote:
>> I could imagine annotations that say "this field contains a function OID"
>> or "this list contains collation OIDs" and then the find_expr_references
>> logic could be derived from that.

> I want to attach some annotations to types, rather than fields. I
> e.g. introduced a Location typedef, annotated as being ignored for
> equality purposes, instead of annotating each 'int location'. Wonder if
> we should also do something like that for your hypothetical "function
> OID" etc. above - seems like it also might give the human reader of code
> a hint.

Hm.  We could certainly do "typedef FunctionOid Oid;",
"typedef CollationOidList List;" etc, but I think it'd get pretty
tedious pretty quickly --- just for this one purpose, you'd need
a couple of typedefs for every system catalog that contains
query-referenceable OIDs.  Maybe that's better than comment-style
annotations, but I'm not convinced.

> Wonder if there's any way to write an assertion check that verifies we
> have the necessary dependencies. But the only idea I have - basically
> record all the syscache lookups while parse analysing an expression, and
> then check that all the necessary dependencies exist - seems too
> complicated to be worthwhile.

Yeah, it's problematic.  One issue there that I'm not sure how to
resolve with autogenerated code, much less automated checking, is that
quite a few cases in find_expr_references know that we don't need to
record a dependency on an OID stored in the node because there's an
indirect dependency on something else.  For example, in FuncExpr we
needn't log funcresulttype because the funcid is enough dependency,
and we needn't log either funccollid or inputcollid because those are
derived from the input expressions or the function result type.
(And giving up those optimizations would be pretty costly, 4x more
dependency checks in this example alone.)

For sure I don't want both "CollationOid" and "RedundantCollationOid"
typedefs, so it seems like annotation is the solution for this, but
I see no reasonable way to automatically verify such annotations.
Still, just writing down the annotations would be a way to expose
such assumptions for manual checking, which we don't really have now.

            regards, tom lane

Re: Missing dependency tracking for TableFunc nodes

From

Andres Freund

Date:

12 November 2019, 21:21:08

On 2019-11-12 15:32:14 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2019-11-12 10:19:30 -0500, Tom Lane wrote:
> >> I could imagine annotations that say "this field contains a function OID"
> >> or "this list contains collation OIDs" and then the find_expr_references
> >> logic could be derived from that.
> 
> > I want to attach some annotations to types, rather than fields. I
> > e.g. introduced a Location typedef, annotated as being ignored for
> > equality purposes, instead of annotating each 'int location'. Wonder if
> > we should also do something like that for your hypothetical "function
> > OID" etc. above - seems like it also might give the human reader of code
> > a hint.
> 
> Hm.  We could certainly do "typedef FunctionOid Oid;",
> "typedef CollationOidList List;" etc, but I think it'd get pretty
> tedious pretty quickly --- just for this one purpose, you'd need
> a couple of typedefs for every system catalog that contains
> query-referenceable OIDs.  Maybe that's better than comment-style
> annotations, but I'm not convinced.

I'm not saying that we should exclusively do so, just that it's
worthwhile for a few frequent cases.


> One issue there that I'm not sure how to resolve with autogenerated
> code, much less automated checking, is that quite a few cases in
> find_expr_references know that we don't need to record a dependency on
> an OID stored in the node because there's an indirect dependency on
> something else.  For example, in FuncExpr we needn't log
> funcresulttype because the funcid is enough dependency, and we needn't
> log either funccollid or inputcollid because those are derived from
> the input expressions or the function result type.  (And giving up
> those optimizations would be pretty costly, 4x more dependency checks
> in this example alone.)

Yea, that one is hard. I suspect the best way to address that is to have
explicit code for a few cases that are worth optimizing (like
e.g. FuncExpr), and for the rest use the generic logic using.  I'd so
far just written the specialized cases into the "generated metadata"
using code - but we also could have an annotation that instructs to
instead call some function, but I doubt that's worthwhile.


> For sure I don't want both "CollationOid" and "RedundantCollationOid"
> typedefs

Indeed.


> so it seems like annotation is the solution for this

I'm not even sure annotations are going to be the easiest way to
implement some of the more complicated edge cases. Might be easier to
just open-code those, and fall back to generic logic for the rest. We'll
have to see, I think.


Greetings,

Andres Freund

Re: Missing dependency tracking for TableFunc nodes

From

Mark Dilger

Date:

13 November 2019, 23:00:03

On 11/11/19 1:41 PM, Tom Lane wrote:
> I happened to notice that find_expr_references_walker has not
> been taught anything about TableFunc nodes, which means it will
> miss the type and collation OIDs embedded in such a node.
> 
> This can be demonstrated to be a problem by the attached script,
> which will end up with a "cache lookup failed for type NNNNN"
> error because we allow dropping a type the XMLTABLE construct
> references.
> 
> This isn't hard to fix, as per the attached patch, but it makes
> me nervous.  I wonder what other dependencies we might be missing.

I can consistently generate errors like the following in master:

   ERROR:  cache lookup failed for statistics object 31041

This happens in a stress test in which multiple processes are making 
changes to the schema.  So far, all the sessions that report this cache 
lookup error do so when performing one of ANALYZE, VACUUM ANALYZE, 
UPDATE, DELETE or EXPLAIN ANALYZE on a table that has MCV statistics. 
All processes running simultaneously are running the same set of 
functions, which create and delete tables, indexes, and statistics 
objects, insert, update, and delete rows in those tables, etc.

The fact that the statistics are of the MCV type might not be relevant; 
I'm creating those on tables as part of testing Tomas Vondra's MCV 
statistics patch, so all the tables have statistics of that kind on them.

I can try to distill my test case a bit, but first I'd like to know if 
you are interested.  Currently, the patch is over 2.2MB, gzip'd.  I'll 
only bother distilling it if you don't already know about these cache 
lookup failures.

-- 
Mark Dilger

Re: Missing dependency tracking for TableFunc nodes

From

Tomas Vondra

Date:

14 November 2019, 00:46:31

On Wed, Nov 13, 2019 at 03:00:03PM -0800, Mark Dilger wrote:
>
>
>On 11/11/19 1:41 PM, Tom Lane wrote:
>>I happened to notice that find_expr_references_walker has not
>>been taught anything about TableFunc nodes, which means it will
>>miss the type and collation OIDs embedded in such a node.
>>
>>This can be demonstrated to be a problem by the attached script,
>>which will end up with a "cache lookup failed for type NNNNN"
>>error because we allow dropping a type the XMLTABLE construct
>>references.
>>
>>This isn't hard to fix, as per the attached patch, but it makes
>>me nervous.  I wonder what other dependencies we might be missing.
>
>I can consistently generate errors like the following in master:
>
>  ERROR:  cache lookup failed for statistics object 31041
>
>This happens in a stress test in which multiple processes are making 
>changes to the schema.  So far, all the sessions that report this 
>cache lookup error do so when performing one of ANALYZE, VACUUM 
>ANALYZE, UPDATE, DELETE or EXPLAIN ANALYZE on a table that has MCV 
>statistics. All processes running simultaneously are running the same 
>set of functions, which create and delete tables, indexes, and 
>statistics objects, insert, update, and delete rows in those tables, 
>etc.
>
>The fact that the statistics are of the MCV type might not be 
>relevant; I'm creating those on tables as part of testing Tomas 
>Vondra's MCV statistics patch, so all the tables have statistics of 
>that kind on them.
>

Hmmm, I don't know the details of the test, but this seems a bit like
we're trying to use the stats during estimation but it got dropped
meanwhile. If that's the case, it probably affects all stats types, not
just MCV lists - there should no significant difference between
different statistics types, I think.

I've managed to reproduce this with a stress-test, and I do get these
failures with both dependencies and mcv stats, although in slightly
different places.

And I think I see the issue - when dropping the statistics, we do
RemoveObjects which however does not acquire any lock on the table. So
we get the list of stats (without the serialized data), but before we
get to load the contents, someone drops it. If that's the root cause,
it's there since pg 10.

I'm not sure what's the right solution. An straightforward option would
be to lock the relation, but will that work after adding support for
stats on joins? An alternative would be to just ignore those failures,
but that kinda breaks the estimation (we should have picked a different
stats in this case).

>I can try to distill my test case a bit, but first I'd like to know if 
>you are interested.  Currently, the patch is over 2.2MB, gzip'd.  I'll 
>only bother distilling it if you don't already know about these cache 
>lookup failures.
>

Not sure. But I do wonder if we see the same issue.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Missing dependency tracking for TableFunc nodes

From

Tom Lane

Date:

14 November 2019, 01:37:59

Mark Dilger <hornschnorter@gmail.com> writes:
> On 11/11/19 1:41 PM, Tom Lane wrote:
>> I happened to notice that find_expr_references_walker has not
>> been taught anything about TableFunc nodes, which means it will
>> miss the type and collation OIDs embedded in such a node.

> I can consistently generate errors like the following in master:
>    ERROR:  cache lookup failed for statistics object 31041

This is surely a completely different issue --- there are not,
one hopes, any extended-stats OIDs embedded in views or other
query trees.

I concur with Tomas' suspicion that this must be a race condition
during DROP STATISTICS.  If we're going to allow people to do that
separately from dropping the table(s), there has to be some kind of
locking around it, and it sounds like there's not :-(

            regards, tom lane

Re: Missing dependency tracking for TableFunc nodes

From

Mark Dilger

Date:

14 November 2019, 01:38:02


On 11/13/19 4:46 PM, Tomas Vondra wrote:
> On Wed, Nov 13, 2019 at 03:00:03PM -0800, Mark Dilger wrote:
>>
>>
>> On 11/11/19 1:41 PM, Tom Lane wrote:
>>> I happened to notice that find_expr_references_walker has not
>>> been taught anything about TableFunc nodes, which means it will
>>> miss the type and collation OIDs embedded in such a node.
>>>
>>> This can be demonstrated to be a problem by the attached script,
>>> which will end up with a "cache lookup failed for type NNNNN"
>>> error because we allow dropping a type the XMLTABLE construct
>>> references.
>>>
>>> This isn't hard to fix, as per the attached patch, but it makes
>>> me nervous.  I wonder what other dependencies we might be missing.
>>
>> I can consistently generate errors like the following in master:
>>
>>  ERROR:  cache lookup failed for statistics object 31041
>>
>> This happens in a stress test in which multiple processes are making 
>> changes to the schema.  So far, all the sessions that report this 
>> cache lookup error do so when performing one of ANALYZE, VACUUM 
>> ANALYZE, UPDATE, DELETE or EXPLAIN ANALYZE on a table that has MCV 
>> statistics. All processes running simultaneously are running the same 
>> set of functions, which create and delete tables, indexes, and 
>> statistics objects, insert, update, and delete rows in those tables, etc.
>>
>> The fact that the statistics are of the MCV type might not be 
>> relevant; I'm creating those on tables as part of testing Tomas 
>> Vondra's MCV statistics patch, so all the tables have statistics of 
>> that kind on them.
>>
> 
> Hmmm, I don't know the details of the test, but this seems a bit like
> we're trying to use the stats during estimation but it got dropped
> meanwhile. If that's the case, it probably affects all stats types, not
> just MCV lists - there should no significant difference between
> different statistics types, I think.
> 
> I've managed to reproduce this with a stress-test, and I do get these
> failures with both dependencies and mcv stats, although in slightly
> different places.
> 
> And I think I see the issue - when dropping the statistics, we do
> RemoveObjects which however does not acquire any lock on the table. So
> we get the list of stats (without the serialized data), but before we
> get to load the contents, someone drops it. If that's the root cause,
> it's there since pg 10.
> 
> I'm not sure what's the right solution. An straightforward option would
> be to lock the relation, but will that work after adding support for
> stats on joins? An alternative would be to just ignore those failures,
> but that kinda breaks the estimation (we should have picked a different
> stats in this case).
> 
>> I can try to distill my test case a bit, but first I'd like to know if 
>> you are interested.  Currently, the patch is over 2.2MB, gzip'd.  I'll 
>> only bother distilling it if you don't already know about these cache 
>> lookup failures.
>>
> 
> Not sure. But I do wonder if we see the same issue.

I don't know.  If you want to reproduce what I'm seeing....

I added a parallel_schedule target:

diff --git a/src/test/regress/parallel_schedule 
b/src/test/regress/parallel_schedule
index fc0f14122b..5ace7c7a8a 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -85,6 +85,8 @@ test: create_table_like alter_generic alter_operator 
misc async dbsize misc_func
  # collate.*.utf8 tests cannot be run in parallel with each other
  test: rules psql psql_crosstab amutils stats_ext collate.linux.utf8

+test: mcv_huge_stress_a mcv_huge_stress_b mcv_huge_stress_c 
mcv_huge_stress_d mcv_huge_stress_e mcv_huge_stress_f mcv_huge_stress_g
+
  # run by itself so it can run parallel workers
  test: select_parallel
  test: write_parallel


And used the attached script to generate the contents of the seven 
parallel tests.  If you want to duplicate this, you'll have to manually 
run gen.pl and direct its output to those src/test/regress/sql/ files. 
The src/test/regress/expected/ files are just empty, as I don't care 
about whether the test results match.  I'm just checking what kinds of 
errors I get and whether any of them are concerning.

After my most recent run of the stress tests, I grep'd for cache 
failures and got 23 of them, all coming from get_relation_statistics(), 
statext_store() and statext_mcv_load().  Two different adjacent spots in 
get_relation_statistics() were involved:

         htup = SearchSysCache1(STATEXTOID, ObjectIdGetDatum(statOid));
         if (!HeapTupleIsValid(htup))
             elog(ERROR, "cache lookup failed for statistics object %u", 
statOid);
         staForm = (Form_pg_statistic_ext) GETSTRUCT(htup);

         dtup = SearchSysCache1(STATEXTDATASTXOID, 
ObjectIdGetDatum(statOid));
         if (!HeapTupleIsValid(dtup))
             elog(ERROR, "cache lookup failed for statistics object %u", 
statOid);

Most were from the first SearchSysCache1 call, but one of them was from 
the second.

-- 
Mark Dilger

Attachment

gen.pl

Re: Missing dependency tracking for TableFunc nodes

From

Tomas Vondra

Date:

14 November 2019, 21:31:24

On Wed, Nov 13, 2019 at 08:37:59PM -0500, Tom Lane wrote:
>Mark Dilger <hornschnorter@gmail.com> writes:
>> On 11/11/19 1:41 PM, Tom Lane wrote:
>>> I happened to notice that find_expr_references_walker has not
>>> been taught anything about TableFunc nodes, which means it will
>>> miss the type and collation OIDs embedded in such a node.
>
>> I can consistently generate errors like the following in master:
>>    ERROR:  cache lookup failed for statistics object 31041
>
>This is surely a completely different issue --- there are not,
>one hopes, any extended-stats OIDs embedded in views or other
>query trees.
>
>I concur with Tomas' suspicion that this must be a race condition
>during DROP STATISTICS.  If we're going to allow people to do that
>separately from dropping the table(s), there has to be some kind of
>locking around it, and it sounds like there's not :-(
>

I think the right thing to do is simply acquire AE lock on the relation
in RemoveStatisticsById, per the attached patch. It's possible we'll
need to do something more complicated once join stats are added, but
for now this should be enough (and backpatchable).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

stats-locking.patch

Re: Missing dependency tracking for TableFunc nodes

From

Tom Lane

Date:

14 November 2019, 21:36:54

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On Wed, Nov 13, 2019 at 08:37:59PM -0500, Tom Lane wrote:
>> I concur with Tomas' suspicion that this must be a race condition
>> during DROP STATISTICS.  If we're going to allow people to do that
>> separately from dropping the table(s), there has to be some kind of
>> locking around it, and it sounds like there's not :-(

> I think the right thing to do is simply acquire AE lock on the relation
> in RemoveStatisticsById, per the attached patch. It's possible we'll
> need to do something more complicated once join stats are added, but
> for now this should be enough (and backpatchable).

Hm.  No, it's not enough, unless you add more logic to deal with the
possibility that the stats object is gone by the time you have the
table lock.  Consider e.g. two concurrent DROP STATISTICS commands,
or a statistics drop that's cascading from something like a drop
of a relevant function and so has no earlier table lock.

            regards, tom lane

Re: Missing dependency tracking for TableFunc nodes

From

Tomas Vondra

Date:

14 November 2019, 22:22:32

On Thu, Nov 14, 2019 at 04:36:54PM -0500, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On Wed, Nov 13, 2019 at 08:37:59PM -0500, Tom Lane wrote:
>>> I concur with Tomas' suspicion that this must be a race condition
>>> during DROP STATISTICS.  If we're going to allow people to do that
>>> separately from dropping the table(s), there has to be some kind of
>>> locking around it, and it sounds like there's not :-(
>
>> I think the right thing to do is simply acquire AE lock on the relation
>> in RemoveStatisticsById, per the attached patch. It's possible we'll
>> need to do something more complicated once join stats are added, but
>> for now this should be enough (and backpatchable).
>
>Hm.  No, it's not enough, unless you add more logic to deal with the
>possibility that the stats object is gone by the time you have the
>table lock.  Consider e.g. two concurrent DROP STATISTICS commands,
>or a statistics drop that's cascading from something like a drop
>of a relevant function and so has no earlier table lock.
>

Isn't that solved by RemoveObjects() doing this?

    /* Get an ObjectAddress for the object. */
    address = get_object_address(stmt->removeType,
                                 object,
                                 &relation,
                                 AccessExclusiveLock,
                                 stmt->missing_ok);

I've actually done some debugging before sending the patch, and I think
this prevent the issue you describe.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Missing dependency tracking for TableFunc nodes

From

Alvaro Herrera

Date:

14 November 2019, 22:27:29

On 2019-Nov-14, Tomas Vondra wrote:

> Isn't that solved by RemoveObjects() doing this?
> 
>    /* Get an ObjectAddress for the object. */
>    address = get_object_address(stmt->removeType,
>                                 object,
>                                 &relation,
>                                 AccessExclusiveLock,
>                                 stmt->missing_ok);
> 
> I've actually done some debugging before sending the patch, and I think
> this prevent the issue you describe.

Hmm .. shouldn't get_statistics_object_oid get a lock on the table that
owns the stats object too?  I think it should be setting *relp to it.
That way, the lock you're proposing to add would be obtained there.
That means it'd be similar to what we do for OBJECT_TRIGGER etc,
get_object_address_relobject().

I admit this'd crash and burn if we had stats on multiple relations,
because there'd be no way to return the multiple relations that would
end up locked.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Missing dependency tracking for TableFunc nodes

From

Tom Lane

Date:

14 November 2019, 22:35:06

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On Thu, Nov 14, 2019 at 04:36:54PM -0500, Tom Lane wrote:
>> Hm.  No, it's not enough, unless you add more logic to deal with the
>> possibility that the stats object is gone by the time you have the
>> table lock.  Consider e.g. two concurrent DROP STATISTICS commands,
>> or a statistics drop that's cascading from something like a drop
>> of a relevant function and so has no earlier table lock.

> Isn't that solved by RemoveObjects() doing this?

>     /* Get an ObjectAddress for the object. */
>     address = get_object_address(stmt->removeType,
>                                  object,
>                                  &relation,
>                                  AccessExclusiveLock,
>                                  stmt->missing_ok);

Ah, I see, we already have AEL on the stats object itself.  So that
eliminates my concern about a race between two RemoveStatisticsById
calls, but what we have instead is fear of deadlock.  A DROP STATISTICS
command will acquire AEL on the stats object but then AEL on the table,
the opposite of what will happen during DROP TABLE, so concurrent
executions of those will deadlock.  That might be better than the
failures Mark is seeing now, but not by much.

A correct fix I think is that the planner ought to acquire AccessShareLock
on a stats object it's trying to use (and then recheck whether the object
is still there).  That seems rather expensive, but there may be no other
way.

            regards, tom lane

Re: Missing dependency tracking for TableFunc nodes

From

Tomas Vondra

Date:

14 November 2019, 23:06:46

On Thu, Nov 14, 2019 at 07:27:29PM -0300, Alvaro Herrera wrote:
>On 2019-Nov-14, Tomas Vondra wrote:
>
>> Isn't that solved by RemoveObjects() doing this?
>>
>>    /* Get an ObjectAddress for the object. */
>>    address = get_object_address(stmt->removeType,
>>                                 object,
>>                                 &relation,
>>                                 AccessExclusiveLock,
>>                                 stmt->missing_ok);
>>
>> I've actually done some debugging before sending the patch, and I think
>> this prevent the issue you describe.
>
>Hmm .. shouldn't get_statistics_object_oid get a lock on the table that
>owns the stats object too?  I think it should be setting *relp to it.
>That way, the lock you're proposing to add would be obtained there.
>That means it'd be similar to what we do for OBJECT_TRIGGER etc,
>get_object_address_relobject().
>

Hmmm, maybe. We'd have to fake the list of names, because that function
expects the relation name to be included in the list of names, and we
don't have that for extended stats. But it might work, I guess.

>I admit this'd crash and burn if we had stats on multiple relations,
>because there'd be no way to return the multiple relations that would
>end up locked.
>

I think that's less important now. If we ever get that feature, we'll
need to make that work somehow.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Missing dependency tracking for TableFunc nodes

From

Tomas Vondra

Date:

14 November 2019, 23:28:57

On Thu, Nov 14, 2019 at 05:35:06PM -0500, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On Thu, Nov 14, 2019 at 04:36:54PM -0500, Tom Lane wrote:
>>> Hm.  No, it's not enough, unless you add more logic to deal with the
>>> possibility that the stats object is gone by the time you have the
>>> table lock.  Consider e.g. two concurrent DROP STATISTICS commands,
>>> or a statistics drop that's cascading from something like a drop
>>> of a relevant function and so has no earlier table lock.
>
>> Isn't that solved by RemoveObjects() doing this?
>
>>     /* Get an ObjectAddress for the object. */
>>     address = get_object_address(stmt->removeType,
>>                                  object,
>>                                  &relation,
>>                                  AccessExclusiveLock,
>>                                  stmt->missing_ok);
>
>Ah, I see, we already have AEL on the stats object itself.  So that
>eliminates my concern about a race between two RemoveStatisticsById
>calls, but what we have instead is fear of deadlock.  A DROP STATISTICS
>command will acquire AEL on the stats object but then AEL on the table,
>the opposite of what will happen during DROP TABLE, so concurrent
>executions of those will deadlock.  That might be better than the
>failures Mark is seeing now, but not by much.
>

Hmmm, yeah :-(

>A correct fix I think is that the planner ought to acquire AccessShareLock
>on a stats object it's trying to use (and then recheck whether the object
>is still there).  That seems rather expensive, but there may be no other
>way.

Yes, so something like for indexes, although we don't need the recheck
in that case. I think the attached patch does that (but it's 1AM here).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

stats-locking-v2.patch