Re: BUG #18559: Crash after detaching a partition concurrently from another session - Mailing list pgsql-bugs

From Kuntal Ghosh
Subject Re: BUG #18559: Crash after detaching a partition concurrently from another session
Date
Msg-id CAGz5QC+yTwKXAYgvvmiPAwez4CVkwO==ECDK+3fMAJ8j9Lcbkw@mail.gmail.com
Whole thread Raw
Responses Re: BUG #18559: Crash after detaching a partition concurrently from another session
Re: BUG #18559: Crash after detaching a partition concurrently from another session
List pgsql-bugs
On Tue, Jul 30, 2024 at 7:18 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
Adding Alvaro.
>
> The following code assumes that an pg_class entry for the detached partition
> will always be available which is wrong.
>
Referring to the following code.
        /*
         * Two problems are possible here.  First, a concurrent ATTACH
         * PARTITION might be in the process of adding a new partition, but
         * the syscache doesn't have it, or its copy of it does not yet have
         * its relpartbound set.  We cannot just AcceptInvalidationMessages(),
         * because the other process might have already removed itself from
         * the ProcArray but not yet added its invalidation messages to the
         * shared queue.  We solve this problem by reading pg_class directly
         * for the desired tuple.
         *
         * The other problem is that DETACH CONCURRENTLY is in the process of
         * removing a partition, which happens in two steps: first it marks it
         * as "detach pending", commits, then unsets relpartbound.  If
         * find_inheritance_children_extended included that partition but we
         * below we see that DETACH CONCURRENTLY has reset relpartbound for
         * it, we'd see an inconsistent view.  (The inconsistency is seen
         * because table_open below reads invalidation messages.)  We protect
         * against this by retrying find_inheritance_children_extended().
         */
        if (boundspec == NULL)
        {
            Relation    pg_class;
            SysScanDesc scan;
            ScanKeyData key[1];
            Datum       datum;
            bool        isnull;

            pg_class = table_open(RelationRelationId, AccessShareLock);
            ScanKeyInit(&key[0],
                        Anum_pg_class_oid,
                        BTEqualStrategyNumber, F_OIDEQ,
                        ObjectIdGetDatum(inhrelid));
            scan = systable_beginscan(pg_class, ClassOidIndexId, true,
                                      NULL, 1, key);
            tuple = systable_getnext(scan);
            datum = heap_getattr(tuple, Anum_pg_class_relpartbound,
                                 RelationGetDescr(pg_class), &isnull);
            if (!isnull)
                boundspec = stringToNode(TextDatumGetCString(datum));
            systable_endscan(scan);
            table_close(pg_class, AccessShareLock);

IIUC, a simple fix would be to retry if an entry is not found. Attached a patch.

--
Thanks & Regards,
Kuntal Ghosh

Attachment

pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #18559: Crash after detaching a partition concurrently from another session
Next
From: semab tariq
Date:
Subject: Re: Installer initialization failed