Re: BUG #18815: Logical replication worker Segmentation fault - Mailing list pgsql-bugs
From | Sergey Belyashov |
---|---|
Subject | Re: BUG #18815: Logical replication worker Segmentation fault |
Date | |
Msg-id | CAOe0RDwUeZduRUcD1N=BcAk5z3ANPpdyZtr4qNjiY6fPQu=sDw@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #18815: Logical replication worker Segmentation fault (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: BUG #18815: Logical replication worker Segmentation fault
|
List | pgsql-bugs |
Hi, Do I need to apply this patch for debugging purposes? I want to remove brin indexes from active partitions and start replication. When the issue is fixed I will return brin indexes back. Best regards, Sergey Belyashov вт, 18 февр. 2025 г. в 02:37, Tom Lane <tgl@sss.pgh.pa.us>: > > I wrote: > > Further to this ... I'd still really like to have a reproducer. > > While brininsertcleanup is clearly being less robust than it should > > be, I now suspect that there is another bug somewhere further down > > the call stack. We're getting to this point via ExecCloseIndices, > > and that should be paired with ExecOpenIndices, and that would have > > created a fresh IndexInfo. So it looks a lot like some path in a > > logrep worker is able to call ExecCloseIndices twice on the same > > working data. That would probably lead to a "releasing a lock you > > don't own" error if we weren't hitting this crash first. > > Hmm ... I tried modifying ExecCloseIndices to blow up if called > twice, as in the attached. This gets through core regression > just fine, but it blows up in three different subscription TAP > tests, all with a stack trace matching Sergey's: > > #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 > #1 0x00007f064bfe3e65 in __GI_abort () at abort.c:79 > #2 0x00000000009e9253 in ExceptionalCondition ( > conditionName=conditionName@entry=0xb8717b "indexDescs[i] != NULL", > fileName=fileName@entry=0xb87139 "execIndexing.c", > lineNumber=lineNumber@entry=249) at assert.c:66 > #3 0x00000000006f0b13 in ExecCloseIndices ( > resultRelInfo=resultRelInfo@entry=0x2f11c18) at execIndexing.c:249 > #4 0x00000000006f86d8 in ExecCleanupTupleRouting (mtstate=0x2ef92d8, > proute=0x2ef94e8) at execPartition.c:1273 > #5 0x0000000000848cb6 in finish_edata (edata=0x2ef8f50) at worker.c:717 > #6 0x000000000084d0a0 in apply_handle_insert (s=<optimized out>) > at worker.c:2460 > #7 apply_dispatch (s=<optimized out>) at worker.c:3389 > #8 0x000000000084e494 in LogicalRepApplyLoop (last_received=25066600) > at worker.c:3680 > #9 start_apply (origin_startpos=0) at worker.c:4507 > #10 0x000000000084e711 in run_apply_worker () at worker.c:4629 > #11 ApplyWorkerMain (main_arg=<optimized out>) at worker.c:4798 > #12 0x00000000008138f9 in BackgroundWorkerMain (startup_data=<optimized out>, > startup_data_len=<optimized out>) at bgworker.c:842 > > The problem seems to be that apply_handle_insert_internal does > ExecOpenIndices and then ExecCloseIndices, and then > ExecCleanupTupleRouting does ExecCloseIndices again, which nicely > explains why brininsertcleanup blows up if you happen to have a BRIN > index involved. What it doesn't explain is how come we don't see > other symptoms from the duplicate index_close calls, regardless of > index type. I'd have expected an assertion failure from > RelationDecrementReferenceCount, and/or an assertion failure for > nonzero rd_refcnt at transaction end, and/or a "you don't own a lock > of type X" gripe from LockRelease. We aren't getting any of those, > but why not, if this code is as broken as I think it is? > > (On closer inspection, we seem to have about 99% broken relcache.c's > ability to notice rd_refcnt being nonzero at transaction end, but > the other two things should still be happening.) > > regards, tom lane >
pgsql-bugs by date: