Thread: Re: BUG #18545: \dt breaks transaction, calling error when executed in SET SESSION AUTHORIZATION

This commit seems to trigger elog(), not reproducible in the
parent commit.

6e086fa2e77 Allow parallel workers to cope with a newly-created session user ID.

postgres=# SET min_parallel_table_scan_size=0; CLUSTER pg_attribute USING pg_attribute_relid_attnum_index;
ERROR:  pg_attribute catalog is missing 26 attribute(s) for relation OID 70321
postgres=# \errverbose
ERROR:  XX000: pg_attribute catalog is missing 26 attribute(s) for relation OID 70321
LOCATION:  RelationBuildTupleDesc, relcache.c:658

This is not completely deterministic:

postgres=# CLUSTER pg_attribute USING pg_attribute_relid_attnum_index;
CLUSTER
postgres=# CLUSTER pg_attribute USING pg_attribute_relid_attnum_index;
CLUSTER
postgres=# CLUSTER pg_attribute USING pg_attribute_relid_attnum_index;
CLUSTER
postgres=# CLUSTER pg_attribute USING pg_attribute_relid_attnum_index;
CLUSTER
postgres=# CLUSTER pg_attribute USING pg_attribute_relid_attnum_index;
ERROR:  pg_attribute catalog is missing 26 attribute(s) for relation OID 70391

But I think this will be reproducible in any database with a nontrivial
number of attributes.



Justin Pryzby <pryzby@telsasoft.com> writes:
> This commit seems to trigger elog(), not reproducible in the
> parent commit.

Yeah, I can reproduce that.  Will take a look tomorrow.

            regards, tom lane



Justin Pryzby <pryzby@telsasoft.com> writes:
> This commit seems to trigger elog(), not reproducible in the
> parent commit.

> 6e086fa2e77 Allow parallel workers to cope with a newly-created session user ID.

> postgres=# SET min_parallel_table_scan_size=0; CLUSTER pg_attribute USING pg_attribute_relid_attnum_index;
> ERROR:  pg_attribute catalog is missing 26 attribute(s) for relation OID 70321

I've been poking at this all day, and I still have little idea what's
going on.  I've added a bunch of throwaway instrumentation, and have
managed to convince myself that the problem is that parallel heap
scan is broken.  The scans done to rebuild pg_attribute's indexes
seem to sometimes miss heap pages or visit pages twice (in different
workers).  I have no idea why this is, and even less idea how
6e086fa2e is provoking it.  As you say, the behavior isn't entirely
reproducible, but I couldn't make it happen at all after reverting
6e086fa2e's changes in transam/parallel.c, so apparently there is
some connection.

Another possibly useful data point is that for me it reproduces
fairly well (more than one time in two) on x86_64 Linux, but
I could not make it happen on macOS ARM64.  If it's a race
condition, which smells plausible, that's perhaps not hugely
surprising.

            regards, tom lane



I wrote:
> Justin Pryzby <pryzby@telsasoft.com> writes:
>> This commit seems to trigger elog(), not reproducible in the
>> parent commit.
>> 6e086fa2e77 Allow parallel workers to cope with a newly-created session user ID.

>> postgres=# SET min_parallel_table_scan_size=0; CLUSTER pg_attribute USING pg_attribute_relid_attnum_index;
>> ERROR:  pg_attribute catalog is missing 26 attribute(s) for relation OID 70321

> I've been poking at this all day, and I still have little idea what's
> going on.

Got it, after a good deal more head-scratching.  Here's the relevant
parts of ParallelWorkerMain:

    /*
     * We've changed which tuples we can see, and must therefore invalidate
     * system caches.
     */
    InvalidateSystemCaches();

    /*
     * Restore GUC values from launching backend.  We can't do this earlier,
     * because GUC check hooks that do catalog lookups need to see the same
     * database state as the leader.
     */
    gucspace = shm_toc_lookup(toc, PARALLEL_KEY_GUC, false);
    RestoreGUCState(gucspace);

    ...

    /* Restore relmapper state. */
    relmapperspace = shm_toc_lookup(toc, PARALLEL_KEY_RELMAPPER_STATE, false);
    RestoreRelationMap(relmapperspace);

InvalidateSystemCaches blows away the worker's relcache.  Then
RestoreGUCState causes some catalog lookups (tracing shows that
restoring default_text_search_config is what triggers this on my
setup), and in particular pg_attribute's relcache entry will get
constructed to support that.  Then we wheel in a new set of
relation map entries *without doing anything about what that
might invalidate*.

In the given test case, the globally-visible relmap says that
pg_attribute's relfilenode is, say, XXXX.  But we are busy rewriting
it, so the parent process has an "active" relmap entry that says
pg_attribute's relfilenode is YYYY.  Given the above, the worker
process will have built a pg_attribute relcache entry that contains
XXXX, and even though it now knows YYYY is the value it should be
using, that information never makes it to the worker's relcache.

The upshot of this is that when the parallel heap scan machinery
doles out some block numbers for the parent process to read, and
some other block numbers for the worker to read, the worker is
reading those block numbers from the pre-clustering copy of
pg_attribute, which most likely doesn't match the post-clustering
image.  This accounts for the missing and duplicate tuples I was
seeing in the scan output.

Of course, the reason 6e086fa2e made this visible is that before
that, any catalog reads triggered by RestoreGUCState were done
in an earlier transaction, and then we would blow away the ensuing
relcache entries in InvalidateSystemCaches.  So there was no bug
as long as you assume that the "..." code doesn't cause any
catalog reads.  I'm not too sure of that though --- it's certainly
not very comfortable to assume that functions like SetCurrentRoleId
and SetTempNamespaceState will never attempt a catalog lookup.

The code has another hazard too, which is that this all implies
that the GUC-related catalog lookups will be done against the
globally-visible relmap state not whatever is active in the parent
process.  I have not tried to construct a POC showing that that
can give incorrect answers (that is, different from what the
parent thinks), but it seems plausible that it could.

So the fix seems clear to me: RestoreRelationMap needs to happen
before anything that could result in catalog lookups.  I'm kind
of inclined to move up the adjacent restores of non-transactional
low-level stuff too, particularly RestoreReindexState which has
direct impact on how catalog lookups are done.

Independently of that, it's annoying that the parallel heap scan
machinery failed to notice that it was handing out block numbers
for two different relfilenodes.  I'm inclined to see if we can
put some Asserts in there that would detect that.  This particular
bug would have been far easier to diagnose that way, and it hardly
seems unlikely that "worker is reading the wrong relation" could
happen with other mistakes in future.

            regards, tom lane



Noah Misch <noah@leadboat.com> writes:
> Thanks for debugging that.  RestorePendingSyncs() also changes what
> RelationInitPhysicalAddr() puts in the relcache entry, so it needs to stay
> with RestoreRelationMap().  I'm attaching the fix I have in mind.

Ah.  No objections here.

            regards, tom lane