Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used - Mailing list pgsql-bugs

From Alexander Lakhin
Subject Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date
Msg-id b2dcd258-82f1-dfe3-498e-9f652a0c0c4b@gmail.com
Whole thread Raw
In response to BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used  (PG Bug reporting form <noreply@postgresql.org>)
Responses Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
List pgsql-bugs
04.07.2023 14:00, PG Bug reporting form wrote:
> The following bug has been logged on the website:
>
> Bug reference:      18014
> Logged by:          Alexander Lakhin
> Email address:      exclusion@gmail.com
> PostgreSQL version: 16beta2
> Operating system:   Ubuntu 22.04
> Description:
>
> Yesterday's test failure on prion:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2023-07-03%2010%3A13%3A03
> made me wonder, what's going on there and whether it's yet another issue
> with invalidating relcache (bug #17994).
> (
> SELECT schema_to_xmlschema('testxmlschema', false, true, '');
> ERROR:  relation with OID 29598 does not exist
> CONTEXT:  SQL statement "SELECT oid FROM pg_catalog.pg_class WHERE
> relnamespace = 29597 AND relkind IN ('r','m','v') AND
> pg_catalog.has_table_privilege (oid, 'SELECT') ORDER BY relname;"

I investigated this case and would like to share my findings.
I added in has_table_privilege_id(), just below
      if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(tableoid)))
          PG_RETURN_NULL();

the following loop:
for (int i = 0; i < 100; i++) {
   bool sce = SearchSysCacheExists1(RELOID, ObjectIdGetDatum(tableoid));
   if (!sce)
     elog(LOG, "has_table_privilege_id(): no syscache entry on iteration %d", i);
     break;
   }
}

and discovered that when the reproducing script uses parallel worker(s), the
syscache entry disappears during this loop execution. But that's not
happening when the query "SELECT oid FROM pg_catalog.pg_class WHERE ..."
is executed in a regular backend.
AFAICS, the difference is in the LockRelationOid():
      res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);

     /*
      * Now that we have the lock, check for invalidation messages, so that we
      * will update or flush any stale relcache entry before we try to use it.
      * RangeVarGetRelid() specifically relies on us for this.  We can skip
      * this in the not-uncommon case that we already had the same type of lock
      * being requested, since then no one else could have modified the
      * relcache entry in an undesirable way.  (In the case where our own xact
      * modifies the rel, the relcache update happens via
      * CommandCounterIncrement, not here.)
      *
      * However, in corner cases where code acts on tables (usually catalogs)
      * recursively, we might get here while still processing invalidation
      * messages in some outer execution of this function or a sibling.  The
      * "cleared" status of the lock tells us whether we really are done
      * absorbing relevant inval messages.
      */
     if (res != LOCKACQUIRE_ALREADY_CLEAR)
     {
         AcceptInvalidationMessages();
         MarkLockClear(locallock);
     }
when LockRelationOid() is called for pg_class_oid_index inside
SearchCatCacheMiss() -> systable_beginscan() -> index_open() -> relation_open().

The parallel worker doesn't have a lock on pg_class_oid_index before
executing the query, so it gets the lock and res == LOCKACQUIRE_OK (not
LOCKACQUIRE_ALREADY_CLEAR as in a regular backend case), after that it
processes invalidation messages (this can make the backend use a newer
catalog snapshot), and at the end it does systable_endscan() ->
index_close() -> UnlockRelationId() -> LockRelease()...
Thus, on a next iteration it gets the lock anew, with the res == LOCKACQUIRE_OK
again, and all that ceremony repeated.

It's not clear to me, whether this parallel worker behavior is expected and
if so, what to fix to avoid the test failure.

Best regards,
Alexander



pgsql-bugs by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: BUG #17552: pg_stat_statements tracks internal FK check queries when COPY used to load data
Next
From: Thomas Munro
Date:
Subject: Re: BUG #17949: Adding an index introduces serialisation anomalies.