Thread: BUG #18815: Logical replication worker Segmentation fault
The following bug has been logged on the website: Bug reference: 18815 Logged by: Sergey Belyashov Email address: sergey.belyashov@gmail.com PostgreSQL version: 17.3 Operating system: Debian bookworm x86_64 Description: Today I try to upgrade my cluster from postgresql-16 to postgresql-17. And it was successfull until I restore some logical replication subscriptions. When subscription is activated and first data are come then server logs: 2025-02-17 13:34:08.975 [98417] LOG: logical replication apply worker for subscription "node4_closed_sessions_sub" has started 2025-02-17 13:34:11.213 [62583] LOG: background worker "logical replication apply worker" (PID 98417) was terminated by signal 11: Segmentation fault 2025-02-17 13:34:11.213 [62583] LOG: terminating any other active server processes 2025-02-17 13:34:11.240 [62583] LOG: all server processes terminated; reinitializing 2025-02-17 13:34:11.310 [98418] LOG: database system was interrupted; last known up at 2025-02-17 13:22:08 and then restarts. Kernel has been logged following info: [94740743.468001] postgres[98417]: segfault at 10 ip 0000562b2b74d69c sp 00007fff284a7320 error 4 in postgres[562b2b6bb000+595000] [94740743.468173] Code: 1f 80 00 00 00 00 44 89 e0 48 8b 15 56 0b 82 00 f7 d0 48 98 4c 8b 3c c2 eb 99 0f 1f 40 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 <8b> 7f 10 e8 4c b1 32 00 8b 7b 14 85 ff 75 15 48 89 df 48 8b 5d f8 After some investigations I found that segfault is caused by one type of subscriptions: subscription for huge partitioned tables on publisher and subscriber (via root), subscriptions are created with data_copy=false (source table updated by inserts and partition detaches, and it is huge, data transfer is not compressed so it may take a days). Segfault does not come immediately after subscription creation, but it cause when data is come from the publisher. Then subscriber is restarts, recover, run subscription again, catch segfault and repeat again until subscription is disabled. Subscriptions for tables (small) without partitions works fine. There is difference for publisher server versions: both publishers 16 and 17 cause the segfault on subscriber (version 17.3). postgresql versions 12-16 works for years without any segfault with same partition tables and publications/subscriptions. postgresql-17=17.3-3.pgdg120+1 installed from the repository: http://apt.postgresql.org/pub/repos/apt/ bookworm-pgdg main
PG Bug reporting form <noreply@postgresql.org> writes: > After some investigations I found that segfault is caused by one type of > subscriptions: subscription for huge partitioned tables on publisher and > subscriber (via root), subscriptions are created with data_copy=false > (source table updated by inserts and partition detaches, and it is huge, > data transfer is not compressed so it may take a days). Segfault does not > come immediately after subscription creation, but it cause when data is come > from the publisher. Then subscriber is restarts, recover, run subscription > again, catch segfault and repeat again until subscription is disabled. This is not enough information for anyone else to reproduce the problem; it very likely depends on details that you haven't mentioned. Can you create a reproducer case? I'm hoping for a script that sets up the necessary tables and subscriptions and populates the tables with enough dummy data to cause the failure. Something that might be less work for you is to get a stack trace from the crash: https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend However, I make no promises that we can isolate the cause from just a stack trace. A reproducer would be much better. regards, tom lane
Hi, I think backtrace will help. Core was generated by `postgres: 17/main: logical replication apply worker for subscription 602051860'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00005635402c869c in brinRevmapTerminate (revmap=0x0) at ./build/../src/backend/access/brin/brin_revmap.c:102 (gdb) backtrace #0 0x00005635402c869c in brinRevmapTerminate (revmap=0x0) at ./build/../src/backend/access/brin/brin_revmap.c:102 #1 0x00005635402bfddd in brininsertcleanup (index=<optimized out>, indexInfo=<optimized out>) at ./build/../src/backend/access/brin/brin.c:515 #2 0x0000563540479309 in ExecCloseIndices (resultRelInfo=resultRelInfo@entry=0x563541cab8d0) at ./build/../src/backend/executor/execIndexing.c:248 #3 0x000056354048067f in ExecCleanupTupleRouting (mtstate=0x563541c6db58, proute=0x563541cab638) at ./build/../src/backend/executor/execPartition.c:1270 #4 0x00005635405c89f7 in finish_edata (edata=0x563541ca0fa8) at ./build/../src/backend/replication/logical/worker.c:718 #5 0x00005635405cc6c4 in apply_handle_insert (s=0x7f61d2a3a1d8) at ./build/../src/backend/replication/logical/worker.c:2438 #6 apply_dispatch (s=s@entry=0x7ffd30d95a70) at ./build/../src/backend/replication/logical/worker.c:3296 #7 0x00005635405cdb7f in LogicalRepApplyLoop (last_received=106949425100872) at ./build/../src/backend/replication/logical/worker.c:3587 #8 start_apply (origin_startpos=origin_startpos@entry=0) at ./build/../src/backend/replication/logical/worker.c:4429 #9 0x00005635405ce11f in run_apply_worker () at ./build/../src/backend/replication/logical/worker.c:4550 #10 ApplyWorkerMain (main_arg=<optimized out>) at ./build/../src/backend/replication/logical/worker.c:4719 #11 0x0000563540594bf8 in BackgroundWorkerMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at ./build/../src/backend/postmaster/bgworker.c:848 #12 0x0000563540596daa in postmaster_child_launch (child_type=child_type@entry=B_BG_WORKER, startup_data=startup_data@entry=0x563541bc3618 "", startup_data_len=startup_data_len@entry=1472, client_sock=client_sock@entry=0x0) at ./build/../src/backend/postmaster/launch_backend.c:277 #13 0x0000563540598f88 in do_start_bgworker (rw=0x563541bc3618) at ./build/../src/backend/postmaster/postmaster.c:4272 #14 maybe_start_bgworkers () at ./build/../src/backend/postmaster/postmaster.c:4503 #15 0x0000563540599fea in process_pm_pmsignal () at ./build/../src/backend/postmaster/postmaster.c:3776 #16 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1669 #17 0x000056354059c7c8 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x563541b229c0) at ./build/../src/backend/postmaster/postmaster.c:1374 #18 0x00005635402bf5b1 in main (argc=5, argv=0x563541b229c0) at ./build/../src/backend/main/main.c:199 The destination (subscriber) table has two timestamps "started" and "closed" with brin index on each of them. Table is partitioned by the range on the "closed" column. Each partition is splitted on 6 subpartitions via list (remainder of id). Best regards, Sergey Belyashov пн, 17 февр. 2025 г. в 19:39, Tom Lane <tgl@sss.pgh.pa.us>: > > PG Bug reporting form <noreply@postgresql.org> writes: > > After some investigations I found that segfault is caused by one type of > > subscriptions: subscription for huge partitioned tables on publisher and > > subscriber (via root), subscriptions are created with data_copy=false > > (source table updated by inserts and partition detaches, and it is huge, > > data transfer is not compressed so it may take a days). Segfault does not > > come immediately after subscription creation, but it cause when data is come > > from the publisher. Then subscriber is restarts, recover, run subscription > > again, catch segfault and repeat again until subscription is disabled. > > This is not enough information for anyone else to reproduce the > problem; it very likely depends on details that you haven't mentioned. > Can you create a reproducer case? I'm hoping for a script that sets > up the necessary tables and subscriptions and populates the tables > with enough dummy data to cause the failure. > > Something that might be less work for you is to get a stack trace > from the crash: > > https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend > > However, I make no promises that we can isolate the cause from > just a stack trace. A reproducer would be much better. > > regards, tom lane
Sergey Belyashov <sergey.belyashov@gmail.com> writes: > I think backtrace will help. > Core was generated by `postgres: 17/main: logical replication apply > worker for subscription 602051860'. > Program terminated with signal SIGSEGV, Segmentation fault. > #0 0x00005635402c869c in brinRevmapTerminate (revmap=0x0) > at ./build/../src/backend/access/brin/brin_revmap.c:102 > (gdb) backtrace > #0 0x00005635402c869c in brinRevmapTerminate (revmap=0x0) > at ./build/../src/backend/access/brin/brin_revmap.c:102 > #1 0x00005635402bfddd in brininsertcleanup (index=<optimized out>, > indexInfo=<optimized out>) > at ./build/../src/backend/access/brin/brin.c:515 > #2 0x0000563540479309 in ExecCloseIndices > (resultRelInfo=resultRelInfo@entry=0x563541cab8d0) > at ./build/../src/backend/executor/execIndexing.c:248 Thanks! It seems clear from that that the fault is basically in brininsertcleanup(), which is trashing the BrinInsertState but leaving indexInfo->ii_AmCache still pointing at it, so that the next brininsert() will think it has a valid cache entry. I suspect that the attached will fix it. What I don't understand is why it's apparently so hard to trigger the crash, because it looks to me like any two successive insert commands on the same BRIN index should hit this. BTW, I'm also a bit suspicious of the comment's claim that the brinDesc doesn't need cleanup. That looks like a potential memory leak. regards, tom lane diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c index 4265687afa..60320440fc 100644 --- a/src/backend/access/brin/brin.c +++ b/src/backend/access/brin/brin.c @@ -511,16 +511,18 @@ brininsertcleanup(Relation index, IndexInfo *indexInfo) BrinInsertState *bistate = (BrinInsertState *) indexInfo->ii_AmCache; /* bail out if cache not initialized */ - if (indexInfo->ii_AmCache == NULL) + if (bistate == NULL) return; + /* do this first to avoid dangling pointer if we fail partway through */ + indexInfo->ii_AmCache = NULL; + /* * Clean up the revmap. Note that the brinDesc has already been cleaned up * as part of its own memory context. */ brinRevmapTerminate(bistate->bis_rmAccess); - bistate->bis_rmAccess = NULL; - bistate->bis_desc = NULL; + pfree(bistate); } /*
I wrote: > I suspect that the attached will fix it. What I don't understand > is why it's apparently so hard to trigger the crash, because it > looks to me like any two successive insert commands on the same > BRIN index should hit this. Oh, wait: I was confusing ii_AmCache with rd_amcache in the index's relcache entry. This coding would absolutely not work with rd_amcache since that's persistent. It mostly works with the IndexInfo field though, since an IndexInfo typically only survives per-query. Evidently there's some path in logical replication that will re-use an IndexInfo across multiple distinct insertion operations, and that's what breaks it. > BTW, I'm also a bit suspicious of the comment's claim that the > brinDesc doesn't need cleanup. That looks like a potential > memory leak. This concern still stands. regards, tom lane
Further to this ... I'd still really like to have a reproducer. While brininsertcleanup is clearly being less robust than it should be, I now suspect that there is another bug somewhere further down the call stack. We're getting to this point via ExecCloseIndices, and that should be paired with ExecOpenIndices, and that would have created a fresh IndexInfo. So it looks a lot like some path in a logrep worker is able to call ExecCloseIndices twice on the same working data. That would probably lead to a "releasing a lock you don't own" error if we weren't hitting this crash first. regards, tom lane
I wrote: > Further to this ... I'd still really like to have a reproducer. > While brininsertcleanup is clearly being less robust than it should > be, I now suspect that there is another bug somewhere further down > the call stack. We're getting to this point via ExecCloseIndices, > and that should be paired with ExecOpenIndices, and that would have > created a fresh IndexInfo. So it looks a lot like some path in a > logrep worker is able to call ExecCloseIndices twice on the same > working data. That would probably lead to a "releasing a lock you > don't own" error if we weren't hitting this crash first. Hmm ... I tried modifying ExecCloseIndices to blow up if called twice, as in the attached. This gets through core regression just fine, but it blows up in three different subscription TAP tests, all with a stack trace matching Sergey's: #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007f064bfe3e65 in __GI_abort () at abort.c:79 #2 0x00000000009e9253 in ExceptionalCondition ( conditionName=conditionName@entry=0xb8717b "indexDescs[i] != NULL", fileName=fileName@entry=0xb87139 "execIndexing.c", lineNumber=lineNumber@entry=249) at assert.c:66 #3 0x00000000006f0b13 in ExecCloseIndices ( resultRelInfo=resultRelInfo@entry=0x2f11c18) at execIndexing.c:249 #4 0x00000000006f86d8 in ExecCleanupTupleRouting (mtstate=0x2ef92d8, proute=0x2ef94e8) at execPartition.c:1273 #5 0x0000000000848cb6 in finish_edata (edata=0x2ef8f50) at worker.c:717 #6 0x000000000084d0a0 in apply_handle_insert (s=<optimized out>) at worker.c:2460 #7 apply_dispatch (s=<optimized out>) at worker.c:3389 #8 0x000000000084e494 in LogicalRepApplyLoop (last_received=25066600) at worker.c:3680 #9 start_apply (origin_startpos=0) at worker.c:4507 #10 0x000000000084e711 in run_apply_worker () at worker.c:4629 #11 ApplyWorkerMain (main_arg=<optimized out>) at worker.c:4798 #12 0x00000000008138f9 in BackgroundWorkerMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at bgworker.c:842 The problem seems to be that apply_handle_insert_internal does ExecOpenIndices and then ExecCloseIndices, and then ExecCleanupTupleRouting does ExecCloseIndices again, which nicely explains why brininsertcleanup blows up if you happen to have a BRIN index involved. What it doesn't explain is how come we don't see other symptoms from the duplicate index_close calls, regardless of index type. I'd have expected an assertion failure from RelationDecrementReferenceCount, and/or an assertion failure for nonzero rd_refcnt at transaction end, and/or a "you don't own a lock of type X" gripe from LockRelease. We aren't getting any of those, but why not, if this code is as broken as I think it is? (On closer inspection, we seem to have about 99% broken relcache.c's ability to notice rd_refcnt being nonzero at transaction end, but the other two things should still be happening.) regards, tom lane diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c index 7c87f012c3..a264a2edbc 100644 --- a/src/backend/executor/execIndexing.c +++ b/src/backend/executor/execIndexing.c @@ -246,14 +246,15 @@ ExecCloseIndices(ResultRelInfo *resultRelInfo) for (i = 0; i < numIndices; i++) { - if (indexDescs[i] == NULL) - continue; /* shouldn't happen? */ + Assert(indexDescs[i] != NULL); /* Give the index a chance to do some post-insert cleanup */ index_insert_cleanup(indexDescs[i], indexInfos[i]); /* Drop lock acquired by ExecOpenIndices */ index_close(indexDescs[i], RowExclusiveLock); + + indexDescs[i] = NULL; } /*
I wrote: > The problem seems to be that apply_handle_insert_internal does > ExecOpenIndices and then ExecCloseIndices, and then > ExecCleanupTupleRouting does ExecCloseIndices again, which nicely > explains why brininsertcleanup blows up if you happen to have a BRIN > index involved. What it doesn't explain is how come we don't see > other symptoms from the duplicate index_close calls, regardless of > index type. Hah, I see it: this bug is specific to the apply_handle_tuple_routing code path. The ResultRelInfo we're dealing with is made by ExecFindPartition, which does ExecOpenIndices on it. Then apply_handle_tuple_routing calls apply_handle_insert_internal, which does ExecOpenIndices *again* on the same ResultRelInfo. That gets a second refcount and second lock on the index(es), and leaks all the IndexInfo data structures made by the first call. When done, apply_handle_insert_internal does ExecCloseIndices, releasing one refcount and lock. Then, back in apply_handle_tuple_routing, we do ExecCloseIndices again. So the refcounts and locks balance out, and it very accidentally fails to crash, so long as nobody is expecting the contents of the IndexInfos to match up. The "Move the tuple into the new partition" path in apply_handle_tuple_routing's UPDATE case is even squirrelier. That does another ExecFindPartition and then apply_handle_insert_internal, but there's no ExecCloseIndices call balancing the ExecFindPartition, meaning it looks like it's leaking refcount and lock. Experimentation shows that it's not, because the matching ExecCloseIndices is eventually done by the ExecCleanupTupleRouting call in finish_edata after returning to the outer apply_handle_update. So if you ask me, this is impossibly convoluted and in need of a significant rewrite. It's not okay to be opening/closing the ResultRelInfo's indexes twice (not least because "speculative" is different in the two open calls). And it's not okay to have such radically different paths for cleaning up the index-opening done by ExecFindPartition. It's even more not okay that there's not one word of commentary about this complexity, strongly suggesting that the original author didn't quite understand how it worked either. regards, tom lane
Hi, Do I need to apply this patch for debugging purposes? I want to remove brin indexes from active partitions and start replication. When the issue is fixed I will return brin indexes back. Best regards, Sergey Belyashov вт, 18 февр. 2025 г. в 02:37, Tom Lane <tgl@sss.pgh.pa.us>: > > I wrote: > > Further to this ... I'd still really like to have a reproducer. > > While brininsertcleanup is clearly being less robust than it should > > be, I now suspect that there is another bug somewhere further down > > the call stack. We're getting to this point via ExecCloseIndices, > > and that should be paired with ExecOpenIndices, and that would have > > created a fresh IndexInfo. So it looks a lot like some path in a > > logrep worker is able to call ExecCloseIndices twice on the same > > working data. That would probably lead to a "releasing a lock you > > don't own" error if we weren't hitting this crash first. > > Hmm ... I tried modifying ExecCloseIndices to blow up if called > twice, as in the attached. This gets through core regression > just fine, but it blows up in three different subscription TAP > tests, all with a stack trace matching Sergey's: > > #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 > #1 0x00007f064bfe3e65 in __GI_abort () at abort.c:79 > #2 0x00000000009e9253 in ExceptionalCondition ( > conditionName=conditionName@entry=0xb8717b "indexDescs[i] != NULL", > fileName=fileName@entry=0xb87139 "execIndexing.c", > lineNumber=lineNumber@entry=249) at assert.c:66 > #3 0x00000000006f0b13 in ExecCloseIndices ( > resultRelInfo=resultRelInfo@entry=0x2f11c18) at execIndexing.c:249 > #4 0x00000000006f86d8 in ExecCleanupTupleRouting (mtstate=0x2ef92d8, > proute=0x2ef94e8) at execPartition.c:1273 > #5 0x0000000000848cb6 in finish_edata (edata=0x2ef8f50) at worker.c:717 > #6 0x000000000084d0a0 in apply_handle_insert (s=<optimized out>) > at worker.c:2460 > #7 apply_dispatch (s=<optimized out>) at worker.c:3389 > #8 0x000000000084e494 in LogicalRepApplyLoop (last_received=25066600) > at worker.c:3680 > #9 start_apply (origin_startpos=0) at worker.c:4507 > #10 0x000000000084e711 in run_apply_worker () at worker.c:4629 > #11 ApplyWorkerMain (main_arg=<optimized out>) at worker.c:4798 > #12 0x00000000008138f9 in BackgroundWorkerMain (startup_data=<optimized out>, > startup_data_len=<optimized out>) at bgworker.c:842 > > The problem seems to be that apply_handle_insert_internal does > ExecOpenIndices and then ExecCloseIndices, and then > ExecCleanupTupleRouting does ExecCloseIndices again, which nicely > explains why brininsertcleanup blows up if you happen to have a BRIN > index involved. What it doesn't explain is how come we don't see > other symptoms from the duplicate index_close calls, regardless of > index type. I'd have expected an assertion failure from > RelationDecrementReferenceCount, and/or an assertion failure for > nonzero rd_refcnt at transaction end, and/or a "you don't own a lock > of type X" gripe from LockRelease. We aren't getting any of those, > but why not, if this code is as broken as I think it is? > > (On closer inspection, we seem to have about 99% broken relcache.c's > ability to notice rd_refcnt being nonzero at transaction end, but > the other two things should still be happening.) > > regards, tom lane >
Sergey Belyashov <sergey.belyashov@gmail.com> writes: > Do I need to apply this patch for debugging purposes? I think we've debugged enough to understand what's happening. Thanks for the report! If you just want to get some work done, applying that patch should be enough to prevent the crash, although I want to change some other things too. regards, tom lane
Thank you very much
вт, 18 февр. 2025 г., 19:54 Tom Lane <tgl@sss.pgh.pa.us>:
Sergey Belyashov <sergey.belyashov@gmail.com> writes:
> Do I need to apply this patch for debugging purposes?
I think we've debugged enough to understand what's happening.
Thanks for the report!
If you just want to get some work done, applying that patch
should be enough to prevent the crash, although I want to
change some other things too.
regards, tom lane