Thread: Re: BUG #18559: Crash after detaching a partition concurrently from another session

On Tue, Jul 30, 2024 at 7:18 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
Adding Alvaro.
>
> The following code assumes that an pg_class entry for the detached partition
> will always be available which is wrong.
>
Referring to the following code.
        /*
         * Two problems are possible here.  First, a concurrent ATTACH
         * PARTITION might be in the process of adding a new partition, but
         * the syscache doesn't have it, or its copy of it does not yet have
         * its relpartbound set.  We cannot just AcceptInvalidationMessages(),
         * because the other process might have already removed itself from
         * the ProcArray but not yet added its invalidation messages to the
         * shared queue.  We solve this problem by reading pg_class directly
         * for the desired tuple.
         *
         * The other problem is that DETACH CONCURRENTLY is in the process of
         * removing a partition, which happens in two steps: first it marks it
         * as "detach pending", commits, then unsets relpartbound.  If
         * find_inheritance_children_extended included that partition but we
         * below we see that DETACH CONCURRENTLY has reset relpartbound for
         * it, we'd see an inconsistent view.  (The inconsistency is seen
         * because table_open below reads invalidation messages.)  We protect
         * against this by retrying find_inheritance_children_extended().
         */
        if (boundspec == NULL)
        {
            Relation    pg_class;
            SysScanDesc scan;
            ScanKeyData key[1];
            Datum       datum;
            bool        isnull;

            pg_class = table_open(RelationRelationId, AccessShareLock);
            ScanKeyInit(&key[0],
                        Anum_pg_class_oid,
                        BTEqualStrategyNumber, F_OIDEQ,
                        ObjectIdGetDatum(inhrelid));
            scan = systable_beginscan(pg_class, ClassOidIndexId, true,
                                      NULL, 1, key);
            tuple = systable_getnext(scan);
            datum = heap_getattr(tuple, Anum_pg_class_relpartbound,
                                 RelationGetDescr(pg_class), &isnull);
            if (!isnull)
                boundspec = stringToNode(TextDatumGetCString(datum));
            systable_endscan(scan);
            table_close(pg_class, AccessShareLock);

IIUC, a simple fix would be to retry if an entry is not found. Attached a patch.

--
Thanks & Regards,
Kuntal Ghosh

Attachment


Kuntal Ghosh <kuntalghosh.2007@gmail.com> 于2024年7月30日周二 21:52写道:
On Tue, Jul 30, 2024 at 7:18 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
Adding Alvaro.
>
> The following code assumes that an pg_class entry for the detached partition
> will always be available which is wrong.
>
Referring to the following code.
        /*
         * Two problems are possible here.  First, a concurrent ATTACH
         * PARTITION might be in the process of adding a new partition, but
         * the syscache doesn't have it, or its copy of it does not yet have
         * its relpartbound set.  We cannot just AcceptInvalidationMessages(),
         * because the other process might have already removed itself from
         * the ProcArray but not yet added its invalidation messages to the
         * shared queue.  We solve this problem by reading pg_class directly
         * for the desired tuple.
         *
         * The other problem is that DETACH CONCURRENTLY is in the process of
         * removing a partition, which happens in two steps: first it marks it
         * as "detach pending", commits, then unsets relpartbound.  If
         * find_inheritance_children_extended included that partition but we
         * below we see that DETACH CONCURRENTLY has reset relpartbound for
         * it, we'd see an inconsistent view.  (The inconsistency is seen
         * because table_open below reads invalidation messages.)  We protect
         * against this by retrying find_inheritance_children_extended().
         */
        if (boundspec == NULL)
        {
            Relation    pg_class;
            SysScanDesc scan;
            ScanKeyData key[1];
            Datum       datum;
            bool        isnull;

            pg_class = table_open(RelationRelationId, AccessShareLock);
            ScanKeyInit(&key[0],
                        Anum_pg_class_oid,
                        BTEqualStrategyNumber, F_OIDEQ,
                        ObjectIdGetDatum(inhrelid));
            scan = systable_beginscan(pg_class, ClassOidIndexId, true,
                                      NULL, 1, key);
            tuple = systable_getnext(scan);
            datum = heap_getattr(tuple, Anum_pg_class_relpartbound,
                                 RelationGetDescr(pg_class), &isnull);
            if (!isnull)
                boundspec = stringToNode(TextDatumGetCString(datum));
            systable_endscan(scan);
            table_close(pg_class, AccessShareLock);

IIUC, a simple fix would be to retry if an entry is not found. Attached a patch.

I take a quick look the attached patch. It looks good to me. 

--
Tender Wang
On Tue, Jul 30, 2024 at 9:52 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>
> On Tue, Jul 30, 2024 at 7:18 PM PG Bug reporting form
> <noreply@postgresql.org> wrote:
> Adding Alvaro.
> >
> > The following code assumes that an pg_class entry for the detached partition
> > will always be available which is wrong.
> >
> Referring to the following code.
>         /*
>          * Two problems are possible here.  First, a concurrent ATTACH
>          * PARTITION might be in the process of adding a new partition, but
>          * the syscache doesn't have it, or its copy of it does not yet have
>          * its relpartbound set.  We cannot just AcceptInvalidationMessages(),
>          * because the other process might have already removed itself from
>          * the ProcArray but not yet added its invalidation messages to the
>          * shared queue.  We solve this problem by reading pg_class directly
>          * for the desired tuple.
>          *
>          * The other problem is that DETACH CONCURRENTLY is in the process of
>          * removing a partition, which happens in two steps: first it marks it
>          * as "detach pending", commits, then unsets relpartbound.  If
>          * find_inheritance_children_extended included that partition but we
>          * below we see that DETACH CONCURRENTLY has reset relpartbound for
>          * it, we'd see an inconsistent view.  (The inconsistency is seen
>          * because table_open below reads invalidation messages.)  We protect
>          * against this by retrying find_inheritance_children_extended().
>          */
>         if (boundspec == NULL)
>         {
>             Relation    pg_class;
>             SysScanDesc scan;
>             ScanKeyData key[1];
>             Datum       datum;
>             bool        isnull;
>
>             pg_class = table_open(RelationRelationId, AccessShareLock);
>             ScanKeyInit(&key[0],
>                         Anum_pg_class_oid,
>                         BTEqualStrategyNumber, F_OIDEQ,
>                         ObjectIdGetDatum(inhrelid));
>             scan = systable_beginscan(pg_class, ClassOidIndexId, true,
>                                       NULL, 1, key);
>             tuple = systable_getnext(scan);
>             datum = heap_getattr(tuple, Anum_pg_class_relpartbound,
>                                  RelationGetDescr(pg_class), &isnull);
>             if (!isnull)
>                 boundspec = stringToNode(TextDatumGetCString(datum));
>             systable_endscan(scan);
>             table_close(pg_class, AccessShareLock);
>
> IIUC, a simple fix would be to retry if an entry is not found. Attached a patch.

I can reproduce the issue, and the patch LGTM.

>
> --
> Thanks & Regards,
> Kuntal Ghosh



--
Regards
Junwang Zhao



Re: BUG #18559: Crash after detaching a partition concurrently from another session

From
Alvaro Herrera from 2ndQuadrant
Date:
On 2024-Aug-10, Junwang Zhao wrote:

> > IIUC, a simple fix would be to retry if an entry is not found. Attached a patch.
> 
> I can reproduce the issue, and the patch LGTM.

Interesting issue, thanks for reporting and putting together a
reproducer.  I have added some comments to the proposed patch, so here's
a v2 for it.  I'm going to write a commit message for it and push to all
branches since 14.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"La experiencia nos dice que el hombre peló millones de veces las patatas,
pero era forzoso admitir la posibilidad de que en un caso entre millones,
las patatas pelarían al hombre" (Ijon Tichy)

Attachment

Re: BUG #18559: Crash after detaching a partition concurrently from another session

From
Alvaro Herrera from 2ndQuadrant
Date:
On 2024-Aug-12, Alvaro Herrera from 2ndQuadrant wrote:

> On 2024-Aug-10, Junwang Zhao wrote:
> 
> > > IIUC, a simple fix would be to retry if an entry is not found. Attached a patch.
> > 
> > I can reproduce the issue, and the patch LGTM.
> 
> Interesting issue, thanks for reporting and putting together a
> reproducer.  I have added some comments to the proposed patch, so here's
> a v2 for it.  I'm going to write a commit message for it and push to all
> branches since 14.

Dept. of second thoughts.  I couldn't find any reason why it's okay to
dereference the return value from systable_getnext() without verifying
that it's not NULL, so I added the check to the older branches too.  As
far as I know we've never had a crash report that could be traced to
lack of that check, but that code still looks like it's assuming a
little too much.


One more thing here.  Applying the test scripts that I used for the
previous bug with addition of DROP after the DETACH CONCURRENTLY, I get
a different failure during planning, which reports this error:

ERROR:  could not open relation with OID 457639
STATEMENT:  select * from p where a = $1;

and changing that error (in relation_open) from ERROR to PANIC would
result in this backtrace:

#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at
./nptl/pthread_kill.c:44
#1  0x00007f149c3cae8f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007f149c37bfb2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007f149c366472 in __GI_abort () at ./stdlib/abort.c:79
#4  0x0000557db57b8611 in errfinish (filename=<optimized out>, lineno=61, funcname=0x557db5816128 <__func__.1>
"relation_open")
    at ../../../../../../pgsql/source/master/src/backend/utils/error/elog.c:599
#5  0x0000557db52441ad in relation_open (relationId=17466, lockmode=lockmode@entry=1)
    at ../../../../../../pgsql/source/master/src/backend/access/common/relation.c:61
#6  0x0000557db537d1d9 in table_open (relationId=<optimized out>, lockmode=lockmode@entry=1)
    at ../../../../../../pgsql/source/master/src/backend/access/table/table.c:44
#7  0x0000557db55b0cf8 in expand_partitioned_rtentry (root=root@entry=0x557db9c8b530,
relinfo=relinfo@entry=0x557db9c8c2d8,
 
    parentrte=parentrte@entry=0x557db9c8c178, parentRTindex=parentRTindex@entry=1,
parentrel=parentrel@entry=0x7f149bc46060,parent_updatedCols=0x0, 
 
    top_parentrc=0x0, lockmode=1) at ../../../../../../pgsql/source/master/src/backend/optimizer/util/inherit.c:390
#8  0x0000557db55b121f in expand_inherited_rtentry (root=root@entry=0x557db9c8b530, rel=0x557db9c8c2d8,
rte=0x557db9c8c178,rti=rti@entry=1)
 
    at ../../../../../../pgsql/source/master/src/backend/optimizer/util/inherit.c:154
#9  0x0000557db558e11a in add_other_rels_to_query (root=root@entry=0x557db9c8b530)
    at ../../../../../../pgsql/source/master/src/backend/optimizer/plan/initsplan.c:214
#10 0x0000557db5591983 in query_planner (root=root@entry=0x557db9c8b530, qp_callback=qp_callback@entry=0x557db5593470
<standard_qp_callback>,
 
    qp_extra=qp_extra@entry=0x7ffd987c39e0) at
../../../../../../pgsql/source/master/src/backend/optimizer/plan/planmain.c:268
#11 0x0000557db5597968 in grouping_planner (root=root@entry=0x557db9c8b530, tuple_fraction=<optimized out>,
tuple_fraction@entry=0,setops=setops@entry=0x0)
 
    at ../../../../../../pgsql/source/master/src/backend/optimizer/plan/planner.c:1520
#12 0x0000557db559a8f3 in subquery_planner (glob=glob@entry=0x557db9c8c758, parse=parse@entry=0x557db9c8bf68,
parent_root=parent_root@entry=0x0,
 
    hasRecursion=hasRecursion@entry=false, tuple_fraction=tuple_fraction@entry=0, setops=setops@entry=0x0)
    at ../../../../../../pgsql/source/master/src/backend/optimizer/plan/planner.c:1089
#13 0x0000557db559acea in standard_planner (parse=0x557db9c8bf68, query_string=<optimized out>, cursorOptions=2048,
boundParams=0x0)
    at ../../../../../../pgsql/source/master/src/backend/optimizer/plan/planner.c:415
#14 0x0000557db5677117 in pg_plan_query (querytree=querytree@entry=0x557db9c8bf68,
query_string=query_string@entry=0x557db9cb34c8"select * from p where a = $1;", 
 
    cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0) at
../../../../../pgsql/source/master/src/backend/tcop/postgres.c:912
#15 0x0000557db5677273 in pg_plan_queries (querytrees=querytrees@entry=0x557db9d69268, query_string=0x557db9cb34c8
"select* from p where a = $1;", 
 
    cursorOptions=2048, boundParams=boundParams@entry=0x0) at
../../../../../pgsql/source/master/src/backend/tcop/postgres.c:1006
#16 0x0000557db57a39b3 in BuildCachedPlan (plansource=plansource@entry=0x557db9c5ff10,
qlist=qlist@entry=0x557db9d69268,boundParams=boundParams@entry=0x0, 
 
    queryEnv=queryEnv@entry=0x0) at ../../../../../../pgsql/source/master/src/backend/utils/cache/plancache.c:962
#17 0x0000557db57a4204 in GetCachedPlan (plansource=plansource@entry=0x557db9c5ff10,
boundParams=boundParams@entry=0x557db9cc0778,owner=owner@entry=0x0, 
 
    queryEnv=queryEnv@entry=0x0) at ../../../../../../pgsql/source/master/src/backend/utils/cache/plancache.c:1199
#18 0x0000557db56786c5 in exec_bind_message (input_message=0x7ffd987c3e70) at
../../../../../pgsql/source/master/src/backend/tcop/postgres.c:2017
#19 PostgresMain (dbname=<optimized out>, username=<optimized out>) at
../../../../../pgsql/source/master/src/backend/tcop/postgres.c:4814

Clearly this is undesirable, but I'm not sure how strongly we need to
pursue a fix for it.  It seems hard to forbid dropping the table after
the detach.  Is it enough to advise users to not drop partitions
immediately after detaching them?  Is this a sign of a more fundamental
problem in the mechanism for DETACH CONCURRENTLY when used together with
prepared statements?

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Industry suffers from the managerial dogma that for the sake of stability
and continuity, the company should be independent of the competence of
individual employees."                                      (E. Dijkstra)



On Tue, Aug 13, 2024 at 4:03 AM Alvaro Herrera from 2ndQuadrant
<alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Aug-12, Alvaro Herrera from 2ndQuadrant wrote:
>
> > On 2024-Aug-10, Junwang Zhao wrote:
> >
> > > > IIUC, a simple fix would be to retry if an entry is not found. Attached a patch.
> > >
> > > I can reproduce the issue, and the patch LGTM.
> >
> > Interesting issue, thanks for reporting and putting together a
> > reproducer.  I have added some comments to the proposed patch, so here's
> > a v2 for it.  I'm going to write a commit message for it and push to all
> > branches since 14.
>
> Dept. of second thoughts.  I couldn't find any reason why it's okay to
> dereference the return value from systable_getnext() without verifying
> that it's not NULL, so I added the check to the older branches too.  As
> far as I know we've never had a crash report that could be traced to
> lack of that check, but that code still looks like it's assuming a
> little too much.
+1.

>
> One more thing here.  Applying the test scripts that I used for the
> previous bug with addition of DROP after the DETACH CONCURRENTLY, I get
> a different failure during planning, which reports this error:
>
> ERROR:  could not open relation with OID 457639
> STATEMENT:  select * from p where a = $1;
>
> #0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at
./nptl/pthread_kill.c:44
> #1  0x00007f149c3cae8f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
> #2  0x00007f149c37bfb2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
> #3  0x00007f149c366472 in __GI_abort () at ./stdlib/abort.c:79
> #4  0x0000557db57b8611 in errfinish (filename=<optimized out>, lineno=61, funcname=0x557db5816128 <__func__.1>
"relation_open")
>     at ../../../../../../pgsql/source/master/src/backend/utils/error/elog.c:599
> #5  0x0000557db52441ad in relation_open (relationId=17466, lockmode=lockmode@entry=1)
>     at ../../../../../../pgsql/source/master/src/backend/access/common/relation.c:61
> #6  0x0000557db537d1d9 in table_open (relationId=<optimized out>, lockmode=lockmode@entry=1)
>     at ../../../../../../pgsql/source/master/src/backend/access/table/table.c:44
> #7  0x0000557db55b0cf8 in expand_partitioned_rtentry (root=root@entry=0x557db9c8b530,
relinfo=relinfo@entry=0x557db9c8c2d8,
>     parentrte=parentrte@entry=0x557db9c8c178, parentRTindex=parentRTindex@entry=1,
parentrel=parentrel@entry=0x7f149bc46060,parent_updatedCols=0x0, 
>     top_parentrc=0x0, lockmode=1) at ../../../../../../pgsql/source/master/src/backend/optimizer/util/inherit.c:390
That means - after getting the live partitions from
prune_append_rel_partitions(), by the time the code tries to lock a
child, it's already dropped.

Able to reproduce the issue with same steps with a small tweak,

Session 1:
1. Continue till DetachPartitionFinalize.

Session 2:
1. Continue till expand_partitioned_rtentry(). It'll find two live
partition after calling prune_append_rel_partitions().

Session 1:
1. Run to completion.
2. SQL: drop table p2;

Session 2:
1. Continue

The table_open will thrown the error - "could not open relation with OID".

The function find_inheritance_children_extended deals with the missing
partition as following:
            /* Get the lock to synchronize against concurrent drop */
            LockRelationOid(inhrelid, lockmode);

            /*
             * Now that we have the lock, double-check to see if the relation
             * really exists or not.  If not, assume it was dropped while we
             * waited to acquire lock, and ignore it.
             */
            if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(inhrelid)))
            {
                /* Release useless lock */
                UnlockRelationOid(inhrelid, lockmode);
                /* And ignore this relation */
                continue;
            }
However, similar check is not there in expand_partitioned_rtentry().
Introducing the same check will fix the issue. But, I don't know how
it affects the pruning part as this partition couldn't be pruned
earlier and that's why we're opening the child partition.

The key points I see are:
1. The find_inheritance_children_extended() function includes a
partition that's being detached based on the current snapshot.
2. Later in the code path, the expand_partitioned_rtentry function
takes a heavy-weight lock on the partitions.

Between these two steps, some code paths expect the child partition to
exist in the syscache, which leads to errors or crashes.

--
Thanks & Regards,
Kuntal Ghosh



Re: BUG #18559: Crash after detaching a partition concurrently from another session

From
Alvaro Herrera from 2ndQuadrant
Date:
On 2024-Aug-13, Kuntal Ghosh wrote:

> That means - after getting the live partitions from
> prune_append_rel_partitions(), by the time the code tries to lock a
> child, it's already dropped.

Right.

> However, similar check is not there in expand_partitioned_rtentry().
> Introducing the same check will fix the issue. But, I don't know how
> it affects the pruning part as this partition couldn't be pruned
> earlier and that's why we're opening the child partition.

Hmm, we could just remove the partition from the set of live partitions
-- then it should behave the same as if the partition had been pruned.
Something like the attached, perhaps.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"In fact, the basic problem with Perl 5's subroutines is that they're not
crufty enough, so the cruft leaks out into user-defined code instead, by
the Conservation of Cruft Principle."  (Larry Wall, Apocalypse 6)

Attachment