Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used - Mailing list pgsql-bugs

From Alexander Lakhin
Subject Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date
Msg-id defada9d-8bb1-860f-2682-eee03fdc0ab4@gmail.com
Whole thread Raw
In response to Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used  (Alexander Lakhin <exclusion@gmail.com>)
Responses Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
13.10.2023 18:00, Alexander Lakhin wrote:

>
>> I spent some time looking through existing SearchSysCacheExists calls,
>> and I could only find two sets of routines where we seem to be
>> depending on SearchSysCacheExists to protect a subsequent lookup
>> somewhere else, and there isn't any lock on the object in question.
>> Those are the has_foo_privilege functions discussed here, and the
>> foo_is_visible functions near the bottom of namespace.c.  I'm not
>> sure why we've not heard complaints traceable to the foo_is_visible
>> family.  Maybe nobody has tried hard to break them, or maybe they
>> are just less likely to be used in ways that are at risk.
>
> I'll try to research/break xxx_is_visible and share my findings tomorrow.
>

I tried the script based on the initial reproducer [1]:
for ((n=1;n<=30;n++)); do
echo "ITERATION $n"

numclients=30
for ((c=1;c<=$numclients;c++)); do
cat << EOF | psql >psql_$c.log &
CREATE SCHEMA testxmlschema_$c;

SELECT format('CREATE TABLE testxmlschema_$c.test_%s (a int);', g) FROM
generate_series(1, 30) g
\\gexec

SET parallel_setup_cost = 1;
SET min_parallel_table_scan_size = '1kB';

SELECT oid FROM pg_catalog.pg_class WHERE relnamespace = 1 AND
  relkind IN ('r', 'm', 'v') AND pg_catalog.pg_table_is_visible(oid);

SELECT format('DROP TABLE testxmlschema_$c.test_%s', g) FROM
generate_series(1, 30) g
\\gexec

DROP SCHEMA testxmlschema_$c;
EOF
done
wait
grep 'ERROR:' server.log && break;
done

And couldn't get the error, for multiple runs. (Here SELECT oid ... is
based on the query executed by schema_to_xmlschema().)
But I could reliably get the error with
s/pg_table_is_visible(oid)/has_table_privilege (oid, 'SELECT')/.
So there is a difference between these two functions. And the difference is
in their costs.
If I do "ALTER FUNCTION pg_table_is_visible COST 1" before the script,
I get the error as expected.
With cost 10 I see the following plan:
  Index Scan using pg_class_relname_nsp_index on pg_class (cost=0.42..2922.38 rows=1 width=4)
    Index Cond: (relnamespace = '1'::oid)
    Filter: ((relkind = ANY ('{r,m,v}'::"char"[])) AND pg_table_is_visible(oid))

But with cost 1:
  Gather  (cost=1.00..257.10 rows=1 width=4)
    Workers Planned: 2
    Workers Launched: 2
    ->  Parallel Seq Scan on pg_class  (cost=0.00..256.00 rows=1 width=4)
          Filter: (pg_table_is_visible(oid) AND (relnamespace = '1'::oid) AND (relkind = ANY ('{r,m,v}'::"char"[])))
          Rows Removed by Filter: 405

The cost of the pg_foo_is_visible functions was increased in a80889a73.
But all the has_xxx_privilige functions have cost 1, except for
has_any_column_privilege, which cost was also increased in 7449427a1.

So to see the issue we need several ingredients:
1) The mode CATCACHE_FORCE_RELEASE enabled (may be some other way is
  possible, I don't know of);
     - Thanks to prion for that.
2) A function with the coding pattern
  "SearchSysCacheExistsX();  SearchSysCacheX();" called in a parallel worker;
     - Thanks to "debug_parallel_query = regress" and low cost of
       has_table_privilege() called by schema_to_xmlschema().
3) The catalog cache invalidated by some concurrent activity.
     - Thanks to running the test xmlmap in parallel with 16 other tests.

[1] https://www.postgresql.org/message-id/18014-28c81cb79d44295d%40postgresql.org

Best regards,
Alexander



pgsql-bugs by date:

Previous
From: Andres Freund
Date:
Subject: Re: BUG #18130: \copy fails with "could not read block" or "page should be empty but not" errors due to triggers
Next
From: Erki Eessaar
Date:
Subject: System administration functions about relation size ignore changes in the table structure