Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used - Mailing list pgsql-bugs
From | Alexander Lakhin |
---|---|
Subject | Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used |
Date | |
Msg-id | b2dcd258-82f1-dfe3-498e-9f652a0c0c4b@gmail.com Whole thread Raw |
In response to | BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used (PG Bug reporting form <noreply@postgresql.org>) |
Responses |
Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
|
List | pgsql-bugs |
04.07.2023 14:00, PG Bug reporting form wrote: > The following bug has been logged on the website: > > Bug reference: 18014 > Logged by: Alexander Lakhin > Email address: exclusion@gmail.com > PostgreSQL version: 16beta2 > Operating system: Ubuntu 22.04 > Description: > > Yesterday's test failure on prion: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2023-07-03%2010%3A13%3A03 > made me wonder, what's going on there and whether it's yet another issue > with invalidating relcache (bug #17994). > ( > SELECT schema_to_xmlschema('testxmlschema', false, true, ''); > ERROR: relation with OID 29598 does not exist > CONTEXT: SQL statement "SELECT oid FROM pg_catalog.pg_class WHERE > relnamespace = 29597 AND relkind IN ('r','m','v') AND > pg_catalog.has_table_privilege (oid, 'SELECT') ORDER BY relname;" I investigated this case and would like to share my findings. I added in has_table_privilege_id(), just below if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(tableoid))) PG_RETURN_NULL(); the following loop: for (int i = 0; i < 100; i++) { bool sce = SearchSysCacheExists1(RELOID, ObjectIdGetDatum(tableoid)); if (!sce) elog(LOG, "has_table_privilege_id(): no syscache entry on iteration %d", i); break; } } and discovered that when the reproducing script uses parallel worker(s), the syscache entry disappears during this loop execution. But that's not happening when the query "SELECT oid FROM pg_catalog.pg_class WHERE ..." is executed in a regular backend. AFAICS, the difference is in the LockRelationOid(): res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock); /* * Now that we have the lock, check for invalidation messages, so that we * will update or flush any stale relcache entry before we try to use it. * RangeVarGetRelid() specifically relies on us for this. We can skip * this in the not-uncommon case that we already had the same type of lock * being requested, since then no one else could have modified the * relcache entry in an undesirable way. (In the case where our own xact * modifies the rel, the relcache update happens via * CommandCounterIncrement, not here.) * * However, in corner cases where code acts on tables (usually catalogs) * recursively, we might get here while still processing invalidation * messages in some outer execution of this function or a sibling. The * "cleared" status of the lock tells us whether we really are done * absorbing relevant inval messages. */ if (res != LOCKACQUIRE_ALREADY_CLEAR) { AcceptInvalidationMessages(); MarkLockClear(locallock); } when LockRelationOid() is called for pg_class_oid_index inside SearchCatCacheMiss() -> systable_beginscan() -> index_open() -> relation_open(). The parallel worker doesn't have a lock on pg_class_oid_index before executing the query, so it gets the lock and res == LOCKACQUIRE_OK (not LOCKACQUIRE_ALREADY_CLEAR as in a regular backend case), after that it processes invalidation messages (this can make the backend use a newer catalog snapshot), and at the end it does systable_endscan() -> index_close() -> UnlockRelationId() -> LockRelease()... Thus, on a next iteration it gets the lock anew, with the res == LOCKACQUIRE_OK again, and all that ceremony repeated. It's not clear to me, whether this parallel worker behavior is expected and if so, what to fix to avoid the test failure. Best regards, Alexander
pgsql-bugs by date: