Re: Missing dependency tracking for TableFunc nodes - Mailing list pgsql-hackers

From Mark Dilger
Subject Re: Missing dependency tracking for TableFunc nodes
Date
Msg-id bfc09232-599b-482b-8a5e-f532ba7cfbd8@gmail.com
Whole thread Raw
In response to Re: Missing dependency tracking for TableFunc nodes  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers

On 11/13/19 4:46 PM, Tomas Vondra wrote:
> On Wed, Nov 13, 2019 at 03:00:03PM -0800, Mark Dilger wrote:
>>
>>
>> On 11/11/19 1:41 PM, Tom Lane wrote:
>>> I happened to notice that find_expr_references_walker has not
>>> been taught anything about TableFunc nodes, which means it will
>>> miss the type and collation OIDs embedded in such a node.
>>>
>>> This can be demonstrated to be a problem by the attached script,
>>> which will end up with a "cache lookup failed for type NNNNN"
>>> error because we allow dropping a type the XMLTABLE construct
>>> references.
>>>
>>> This isn't hard to fix, as per the attached patch, but it makes
>>> me nervous.  I wonder what other dependencies we might be missing.
>>
>> I can consistently generate errors like the following in master:
>>
>>  ERROR:  cache lookup failed for statistics object 31041
>>
>> This happens in a stress test in which multiple processes are making 
>> changes to the schema.  So far, all the sessions that report this 
>> cache lookup error do so when performing one of ANALYZE, VACUUM 
>> ANALYZE, UPDATE, DELETE or EXPLAIN ANALYZE on a table that has MCV 
>> statistics. All processes running simultaneously are running the same 
>> set of functions, which create and delete tables, indexes, and 
>> statistics objects, insert, update, and delete rows in those tables, etc.
>>
>> The fact that the statistics are of the MCV type might not be 
>> relevant; I'm creating those on tables as part of testing Tomas 
>> Vondra's MCV statistics patch, so all the tables have statistics of 
>> that kind on them.
>>
> 
> Hmmm, I don't know the details of the test, but this seems a bit like
> we're trying to use the stats during estimation but it got dropped
> meanwhile. If that's the case, it probably affects all stats types, not
> just MCV lists - there should no significant difference between
> different statistics types, I think.
> 
> I've managed to reproduce this with a stress-test, and I do get these
> failures with both dependencies and mcv stats, although in slightly
> different places.
> 
> And I think I see the issue - when dropping the statistics, we do
> RemoveObjects which however does not acquire any lock on the table. So
> we get the list of stats (without the serialized data), but before we
> get to load the contents, someone drops it. If that's the root cause,
> it's there since pg 10.
> 
> I'm not sure what's the right solution. An straightforward option would
> be to lock the relation, but will that work after adding support for
> stats on joins? An alternative would be to just ignore those failures,
> but that kinda breaks the estimation (we should have picked a different
> stats in this case).
> 
>> I can try to distill my test case a bit, but first I'd like to know if 
>> you are interested.  Currently, the patch is over 2.2MB, gzip'd.  I'll 
>> only bother distilling it if you don't already know about these cache 
>> lookup failures.
>>
> 
> Not sure. But I do wonder if we see the same issue.

I don't know.  If you want to reproduce what I'm seeing....

I added a parallel_schedule target:

diff --git a/src/test/regress/parallel_schedule 
b/src/test/regress/parallel_schedule
index fc0f14122b..5ace7c7a8a 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -85,6 +85,8 @@ test: create_table_like alter_generic alter_operator 
misc async dbsize misc_func
  # collate.*.utf8 tests cannot be run in parallel with each other
  test: rules psql psql_crosstab amutils stats_ext collate.linux.utf8

+test: mcv_huge_stress_a mcv_huge_stress_b mcv_huge_stress_c 
mcv_huge_stress_d mcv_huge_stress_e mcv_huge_stress_f mcv_huge_stress_g
+
  # run by itself so it can run parallel workers
  test: select_parallel
  test: write_parallel


And used the attached script to generate the contents of the seven 
parallel tests.  If you want to duplicate this, you'll have to manually 
run gen.pl and direct its output to those src/test/regress/sql/ files. 
The src/test/regress/expected/ files are just empty, as I don't care 
about whether the test results match.  I'm just checking what kinds of 
errors I get and whether any of them are concerning.

After my most recent run of the stress tests, I grep'd for cache 
failures and got 23 of them, all coming from get_relation_statistics(), 
statext_store() and statext_mcv_load().  Two different adjacent spots in 
get_relation_statistics() were involved:

         htup = SearchSysCache1(STATEXTOID, ObjectIdGetDatum(statOid));
         if (!HeapTupleIsValid(htup))
             elog(ERROR, "cache lookup failed for statistics object %u", 
statOid);
         staForm = (Form_pg_statistic_ext) GETSTRUCT(htup);

         dtup = SearchSysCache1(STATEXTDATASTXOID, 
ObjectIdGetDatum(statOid));
         if (!HeapTupleIsValid(dtup))
             elog(ERROR, "cache lookup failed for statistics object %u", 
statOid);

Most were from the first SearchSysCache1 call, but one of them was from 
the second.

-- 
Mark Dilger

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Missing dependency tracking for TableFunc nodes
Next
From: Andres Freund
Date:
Subject: Re: Ought to use heap_multi_insert() for pg_attribute/dependinsertions?