Re: BUG #18559: Crash after detaching a partition concurrently from another session - Mailing list pgsql-bugs
From | Kuntal Ghosh |
---|---|
Subject | Re: BUG #18559: Crash after detaching a partition concurrently from another session |
Date | |
Msg-id | CAGz5QCKGAhdN_Aq109zRAv44nJfZw2Wc6Fx22EZZ-UmX-WGFVg@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #18559: Crash after detaching a partition concurrently from another session (Alvaro Herrera from 2ndQuadrant <alvherre@alvh.no-ip.org>) |
Responses |
Re: BUG #18559: Crash after detaching a partition concurrently from another session
|
List | pgsql-bugs |
On Tue, Aug 13, 2024 at 4:03 AM Alvaro Herrera from 2ndQuadrant <alvherre@alvh.no-ip.org> wrote: > > On 2024-Aug-12, Alvaro Herrera from 2ndQuadrant wrote: > > > On 2024-Aug-10, Junwang Zhao wrote: > > > > > > IIUC, a simple fix would be to retry if an entry is not found. Attached a patch. > > > > > > I can reproduce the issue, and the patch LGTM. > > > > Interesting issue, thanks for reporting and putting together a > > reproducer. I have added some comments to the proposed patch, so here's > > a v2 for it. I'm going to write a commit message for it and push to all > > branches since 14. > > Dept. of second thoughts. I couldn't find any reason why it's okay to > dereference the return value from systable_getnext() without verifying > that it's not NULL, so I added the check to the older branches too. As > far as I know we've never had a crash report that could be traced to > lack of that check, but that code still looks like it's assuming a > little too much. +1. > > One more thing here. Applying the test scripts that I used for the > previous bug with addition of DROP after the DETACH CONCURRENTLY, I get > a different failure during planning, which reports this error: > > ERROR: could not open relation with OID 457639 > STATEMENT: select * from p where a = $1; > > #0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44 > #1 0x00007f149c3cae8f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78 > #2 0x00007f149c37bfb2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 > #3 0x00007f149c366472 in __GI_abort () at ./stdlib/abort.c:79 > #4 0x0000557db57b8611 in errfinish (filename=<optimized out>, lineno=61, funcname=0x557db5816128 <__func__.1> "relation_open") > at ../../../../../../pgsql/source/master/src/backend/utils/error/elog.c:599 > #5 0x0000557db52441ad in relation_open (relationId=17466, lockmode=lockmode@entry=1) > at ../../../../../../pgsql/source/master/src/backend/access/common/relation.c:61 > #6 0x0000557db537d1d9 in table_open (relationId=<optimized out>, lockmode=lockmode@entry=1) > at ../../../../../../pgsql/source/master/src/backend/access/table/table.c:44 > #7 0x0000557db55b0cf8 in expand_partitioned_rtentry (root=root@entry=0x557db9c8b530, relinfo=relinfo@entry=0x557db9c8c2d8, > parentrte=parentrte@entry=0x557db9c8c178, parentRTindex=parentRTindex@entry=1, parentrel=parentrel@entry=0x7f149bc46060,parent_updatedCols=0x0, > top_parentrc=0x0, lockmode=1) at ../../../../../../pgsql/source/master/src/backend/optimizer/util/inherit.c:390 That means - after getting the live partitions from prune_append_rel_partitions(), by the time the code tries to lock a child, it's already dropped. Able to reproduce the issue with same steps with a small tweak, Session 1: 1. Continue till DetachPartitionFinalize. Session 2: 1. Continue till expand_partitioned_rtentry(). It'll find two live partition after calling prune_append_rel_partitions(). Session 1: 1. Run to completion. 2. SQL: drop table p2; Session 2: 1. Continue The table_open will thrown the error - "could not open relation with OID". The function find_inheritance_children_extended deals with the missing partition as following: /* Get the lock to synchronize against concurrent drop */ LockRelationOid(inhrelid, lockmode); /* * Now that we have the lock, double-check to see if the relation * really exists or not. If not, assume it was dropped while we * waited to acquire lock, and ignore it. */ if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(inhrelid))) { /* Release useless lock */ UnlockRelationOid(inhrelid, lockmode); /* And ignore this relation */ continue; } However, similar check is not there in expand_partitioned_rtentry(). Introducing the same check will fix the issue. But, I don't know how it affects the pruning part as this partition couldn't be pruned earlier and that's why we're opening the child partition. The key points I see are: 1. The find_inheritance_children_extended() function includes a partition that's being detached based on the current snapshot. 2. Later in the code path, the expand_partitioned_rtentry function takes a heavy-weight lock on the partitions. Between these two steps, some code paths expect the child partition to exist in the syscache, which leads to errors or crashes. -- Thanks & Regards, Kuntal Ghosh
pgsql-bugs by date: