Re: Missing dependency tracking for TableFunc nodes - Mailing list pgsql-hackers
From | Mark Dilger |
---|---|
Subject | Re: Missing dependency tracking for TableFunc nodes |
Date | |
Msg-id | bfc09232-599b-482b-8a5e-f532ba7cfbd8@gmail.com Whole thread Raw |
In response to | Re: Missing dependency tracking for TableFunc nodes (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
List | pgsql-hackers |
On 11/13/19 4:46 PM, Tomas Vondra wrote: > On Wed, Nov 13, 2019 at 03:00:03PM -0800, Mark Dilger wrote: >> >> >> On 11/11/19 1:41 PM, Tom Lane wrote: >>> I happened to notice that find_expr_references_walker has not >>> been taught anything about TableFunc nodes, which means it will >>> miss the type and collation OIDs embedded in such a node. >>> >>> This can be demonstrated to be a problem by the attached script, >>> which will end up with a "cache lookup failed for type NNNNN" >>> error because we allow dropping a type the XMLTABLE construct >>> references. >>> >>> This isn't hard to fix, as per the attached patch, but it makes >>> me nervous. I wonder what other dependencies we might be missing. >> >> I can consistently generate errors like the following in master: >> >> ERROR: cache lookup failed for statistics object 31041 >> >> This happens in a stress test in which multiple processes are making >> changes to the schema. So far, all the sessions that report this >> cache lookup error do so when performing one of ANALYZE, VACUUM >> ANALYZE, UPDATE, DELETE or EXPLAIN ANALYZE on a table that has MCV >> statistics. All processes running simultaneously are running the same >> set of functions, which create and delete tables, indexes, and >> statistics objects, insert, update, and delete rows in those tables, etc. >> >> The fact that the statistics are of the MCV type might not be >> relevant; I'm creating those on tables as part of testing Tomas >> Vondra's MCV statistics patch, so all the tables have statistics of >> that kind on them. >> > > Hmmm, I don't know the details of the test, but this seems a bit like > we're trying to use the stats during estimation but it got dropped > meanwhile. If that's the case, it probably affects all stats types, not > just MCV lists - there should no significant difference between > different statistics types, I think. > > I've managed to reproduce this with a stress-test, and I do get these > failures with both dependencies and mcv stats, although in slightly > different places. > > And I think I see the issue - when dropping the statistics, we do > RemoveObjects which however does not acquire any lock on the table. So > we get the list of stats (without the serialized data), but before we > get to load the contents, someone drops it. If that's the root cause, > it's there since pg 10. > > I'm not sure what's the right solution. An straightforward option would > be to lock the relation, but will that work after adding support for > stats on joins? An alternative would be to just ignore those failures, > but that kinda breaks the estimation (we should have picked a different > stats in this case). > >> I can try to distill my test case a bit, but first I'd like to know if >> you are interested. Currently, the patch is over 2.2MB, gzip'd. I'll >> only bother distilling it if you don't already know about these cache >> lookup failures. >> > > Not sure. But I do wonder if we see the same issue. I don't know. If you want to reproduce what I'm seeing.... I added a parallel_schedule target: diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule index fc0f14122b..5ace7c7a8a 100644 --- a/src/test/regress/parallel_schedule +++ b/src/test/regress/parallel_schedule @@ -85,6 +85,8 @@ test: create_table_like alter_generic alter_operator misc async dbsize misc_func # collate.*.utf8 tests cannot be run in parallel with each other test: rules psql psql_crosstab amutils stats_ext collate.linux.utf8 +test: mcv_huge_stress_a mcv_huge_stress_b mcv_huge_stress_c mcv_huge_stress_d mcv_huge_stress_e mcv_huge_stress_f mcv_huge_stress_g + # run by itself so it can run parallel workers test: select_parallel test: write_parallel And used the attached script to generate the contents of the seven parallel tests. If you want to duplicate this, you'll have to manually run gen.pl and direct its output to those src/test/regress/sql/ files. The src/test/regress/expected/ files are just empty, as I don't care about whether the test results match. I'm just checking what kinds of errors I get and whether any of them are concerning. After my most recent run of the stress tests, I grep'd for cache failures and got 23 of them, all coming from get_relation_statistics(), statext_store() and statext_mcv_load(). Two different adjacent spots in get_relation_statistics() were involved: htup = SearchSysCache1(STATEXTOID, ObjectIdGetDatum(statOid)); if (!HeapTupleIsValid(htup)) elog(ERROR, "cache lookup failed for statistics object %u", statOid); staForm = (Form_pg_statistic_ext) GETSTRUCT(htup); dtup = SearchSysCache1(STATEXTDATASTXOID, ObjectIdGetDatum(statOid)); if (!HeapTupleIsValid(dtup)) elog(ERROR, "cache lookup failed for statistics object %u", statOid); Most were from the first SearchSysCache1 call, but one of them was from the second. -- Mark Dilger
Attachment
pgsql-hackers by date: