Thread: BUG #16226: background worker "logical replication worker" (PID ) was terminated by signal 11: Segmentation
BUG #16226: background worker "logical replication worker" (PID ) was terminated by signal 11: Segmentation
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 16226 Logged by: Vadim Yatsenko Email address: vadim@postgrespro.co.il PostgreSQL version: 12.1 Operating system: CentOS Linux 7 (Core) Kernel: Linux 3.10.0-1062.9. Description: Hi! We have 2 PostgreSQL servers with logical replication between Postgres 11.6 (Primary) and 12.1 (Logical). Some times ago, we changed column type in a 2 big tables from integer to text: oid | nspname | table | rows | pages | size | index_size | total_size ------------+----------+--------------------------------------+-----------+---------+------------+------------+---------------- 573803236 | public | acc_rep | 283359648 | 9276153 | 71 GB | 50 GB | 121 GB 586884963 | public | acc_rep_actual | 283280448 | 8855209 | 68 GB | 27 GB | 95 GB , this of course led to a full rewrite both tables. We repated this operation on both servers. And after that we started to get error like "background worker "logical replication worker" (PID <pid>) was terminated by signal 11: Segmentation fault" and server goes to recovery mode. This error repated several times every 10-15 minutes. When we connected to logical replication worker process using gdb, we recieved the following message: Program received signal SIGSEGV, Segmentation fault. 0x00007f04e5060426 in __memcpy_ssse3_back () from /lib64/libc.so.6 (gdb) bt #0 0x00007f04e5060426 in __memcpy_ssse3_back () from /lib64/libc.so.6 #1 0x000000000061e1cb in memcpy (__len=224588909, __src=<optimized out>, __dest=0x101b870) at /usr/include/bits/string3.h:51 #2 tts_virtual_materialize (slot=0x101b0b0) at execTuples.c:235 #3 0x000000000061f594 in ExecFetchSlotHeapTuple (slot=0x101b0b0, materialize=<optimized out>, shouldFree=0x7ffc331ae187) at execTuples.c:1624 #4 0x00000000004c868c in heapam_tuple_update (relation=0x7f04e77d8470, otid=0x101b240, slot=0x101b0b0, cid=19778, snapshot=<optimized out>, crosscheck=0x0, wait=true, tmfd=0x7ffc331ae210, lockmode=0x7ffc331ae20c, update_indexes=0x7ffc331ae277) at heapam_handler.c:325 #5 0x00000000004f5524 in table_tuple_update (update_indexes=0x7ffc331ae277, lockmode=0x7ffc331ae20c, tmfd=0x7ffc331ae210, wait=true, crosscheck=0x0, snapshot=0x0, cid=<optimized out>, slot=0x101b0b0, otid=0x101b240, rel=0x7f04e77d8470) at ../../../../src/include/access/tableam.h:1261 #6 simple_table_tuple_update (rel=0x7f04e77d8470, otid=otid@entry=0x101b240, slot=slot@entry=0x101b0b0, snapshot=0x0, update_indexes=update_indexes@entry=0x7ffc331ae277) at tableam.c:325 #7 0x000000000061bbc0 in ExecSimpleRelationUpdate (estate=estate@entry=0x1031660, epqstate=epqstate@entry=0x7ffc331ae330, searchslot=searchslot@entry=0x101b210, slot=slot@entry=0x101b0b0) at execReplication.c:493 #8 0x000000000071100d in apply_handle_update (s=s@entry=0x7ffc331b5910) at worker.c:750 #9 0x0000000000711ca7 in apply_dispatch (s=0x7ffc331b5910) at worker.c:968 #10 LogicalRepApplyLoop (last_received=31760851165536) at worker.c:1176 #11 ApplyWorkerMain (main_arg=<optimized out>) at worker.c:1734 #12 0x00000000006e2d4a in StartBackgroundWorker () at bgworker.c:834 #13 0x00000000006ef1d3 in do_start_bgworker (rw=0xf71970) at postmaster.c:5770 #14 maybe_start_bgworkers () at postmaster.c:5996 #15 0x00000000006efde5 in sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5167 #16 <signal handler called> #17 0x00007f04e4fff933 in __select_nocancel () from /lib64/libc.so.6 #18 0x00000000004833d4 in ServerLoop () at postmaster.c:1668 #19 0x00000000006f106f in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xf45300) at postmaster.c:1377 #20 0x0000000000484cd3 in main (argc=3, argv=0xf45300) at main.c:228 (gdb) #0 0x00007f04e5060426 in __memcpy_ssse3_back () from /lib64/libc.so.6 #1 0x000000000061e1cb in memcpy (__len=224588909, __src=<optimized out>, __dest=0x101b870) at /usr/include/bits/string3.h:51 #2 tts_virtual_materialize (slot=0x101b0b0) at execTuples.c:235 #3 0x000000000061f594 in ExecFetchSlotHeapTuple (slot=0x101b0b0, materialize=<optimized out>, shouldFree=0x7ffc331ae187) at execTuples.c:1624 #4 0x00000000004c868c in heapam_tuple_update (relation=0x7f04e77d8470, otid=0x101b240, slot=0x101b0b0, cid=19778, snapshot=<optimized out>, crosscheck=0x0, wait=true, tmfd=0x7ffc331ae210, lockmode=0x7ffc331ae20c, update_indexes=0x7ffc331ae277) at heapam_handler.c:325 #5 0x00000000004f5524 in table_tuple_update (update_indexes=0x7ffc331ae277, lockmode=0x7ffc331ae20c, tmfd=0x7ffc331ae210, wait=true, crosscheck=0x0, snapshot=0x0, cid=<optimized out>, slot=0x101b0b0, otid=0x101b240, rel=0x7f04e77d8470) at ../../../../src/include/access/tableam.h:1261 #6 simple_table_tuple_update (rel=0x7f04e77d8470, otid=otid@entry=0x101b240, slot=slot@entry=0x101b0b0, snapshot=0x0, update_indexes=update_indexes@entry=0x7ffc331ae277) at tableam.c:325 #7 0x000000000061bbc0 in ExecSimpleRelationUpdate (estate=estate@entry=0x1031660, epqstate=epqstate@entry=0x7ffc331ae330, searchslot=searchslot@entry=0x101b210, slot=slot@entry=0x101b0b0) at execReplication.c:493 #8 0x000000000071100d in apply_handle_update (s=s@entry=0x7ffc331b5910) at worker.c:750 #9 0x0000000000711ca7 in apply_dispatch (s=0x7ffc331b5910) at worker.c:968 #10 LogicalRepApplyLoop (last_received=31760851165536) at worker.c:1176 #11 ApplyWorkerMain (main_arg=<optimized out>) at worker.c:1734 #12 0x00000000006e2d4a in StartBackgroundWorker () at bgworker.c:834 #13 0x00000000006ef1d3 in do_start_bgworker (rw=0xf71970) at postmaster.c:5770 #14 maybe_start_bgworkers () at postmaster.c:5996 #15 0x00000000006efde5 in sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5167 #16 <signal handler called> #17 0x00007f04e4fff933 in __select_nocancel () from /lib64/libc.so.6 #18 0x00000000004833d4 in ServerLoop () at postmaster.c:1668 #19 0x00000000006f106f in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xf45300) at postmaster.c:1377 #20 0x0000000000484cd3 in main (argc=3, argv=0xf45300) at main.c:228 Error from /var/log/messages: Jan 21 02:01:59 dbmr01 kernel: postmaster[50093]: segfault at e76f13d ip 00007fa8f701ffa0 sp 00007ffc8ae23498 error 4 in libc-2.17.so[7fa8f6ed4000+1c2000] Jan 21 02:01:59 dbmr01 abrt-hook-ccpp: Process 50093 (postgres) of user 26 killed by SIGSEGV - dumping core Jan 21 02:02:15 dbmr01 abrt-server: Package 'postgresql12-server' isn't signed with proper key Jan 21 02:02:15 dbmr01 abrt-server: 'post-create' on '/var/spool/abrt/ccpp-2020-01-21-02:01:59-50093' exited with 1 Jan 21 02:02:15 dbmr01 abrt-server: Deleting problem directory '/var/spool/abrt/ccpp-2020-01-21-02:01:59-50093' The issue was solved by removing these tables from publication on Primary. Can you advise something? Is this a bug or didn't I take something into account when working with logical replication? Thank you!
Re: BUG #16226: background worker "logical replication worker" (PID) was terminated by signal 11: Segmentation
From
Peter Eisentraut
Date:
On 2020-01-22 08:43, PG Bug reporting form wrote: > We have 2 PostgreSQL servers with logical replication between Postgres 11.6 > (Primary) and 12.1 (Logical). Some times ago, we changed column type in a 2 > big tables from integer to text: > , this of course led to a full rewrite both tables. We repated this > operation on both servers. And after that we started to get error like > "background worker "logical replication worker" (PID <pid>) was terminated > by signal 11: Segmentation fault" and server goes to recovery mode. This > error repated several times every 10-15 minutes. When we connected to > logical replication worker process using gdb, we recieved the following > message: > > Program received signal SIGSEGV, Segmentation fault. Could you narrow it down to a reproducible test case? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #16226: background worker "logical replication worker" (PID) was terminated by signal 11: Segmentation
From
Michael Paquier
Date:
On Wed, Jan 22, 2020 at 10:22:55AM +0100, Peter Eisentraut wrote: > On 2020-01-22 08:43, PG Bug reporting form wrote: >> , this of course led to a full rewrite both tables. We repated this >> operation on both servers. And after that we started to get error like >> "background worker "logical replication worker" (PID <pid>) was terminated >> by signal 11: Segmentation fault" and server goes to recovery mode. This >> error repated several times every 10-15 minutes. When we connected to >> logical replication worker process using gdb, we recieved the following >> message: >> >> Program received signal SIGSEGV, Segmentation fault. > > Could you narrow it down to a reproducible test case? I suspect that this is the same problem as this report: https://www.postgresql.org/message-id/16129-a0c0f48e71741e5f@postgresql.org And it has been fixed here, after 12.1, the fix being available with 12.2: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=a2aa224e -- Michael
Attachment
Re: BUG #16226: background worker "logical replication worker" (PID ) was terminated by signal 11: Segmentation
From
Tom Lane
Date:
> We have 2 PostgreSQL servers with logical replication between Postgres 11.6 > (Primary) and 12.1 (Logical). Some times ago, we changed column type in a 2 > big tables from integer to text: > ... > , this of course led to a full rewrite both tables. We repated this > operation on both servers. And after that we started to get error like > "background worker "logical replication worker" (PID <pid>) was terminated > by signal 11: Segmentation fault" and server goes to recovery mode. Not sure, but this seems like it might be explained by this recent bug fix: Author: Tom Lane <tgl@sss.pgh.pa.us> Branch: master [4d9ceb001] 2019-11-22 11:31:19 -0500 Branch: REL_12_STABLE [a2aa224e0] 2019-11-22 11:31:19 -0500 Branch: REL_11_STABLE [b72a44c51] 2019-11-22 11:31:19 -0500 Branch: REL_10_STABLE [5d3fcb53a] 2019-11-22 11:31:19 -0500 Fix bogus tuple-slot management in logical replication UPDATE handling. slot_modify_cstrings seriously abused the TupleTableSlot API by relying on a slot's underlying data to stay valid across ExecClearTuple. Since this abuse was also quite undocumented, it's little surprise that the case got broken during the v12 slot rewrites. As reported in bug #16129 from Ondřej Jirman, this could lead to crashes or data corruption when a logical replication subscriber processes a row update. Problems would only arise if the subscriber's table contained columns of pass-by-ref types that were not being copied from the publisher. Fix by explicitly copying the datum/isnull arrays from the source slot that the old row was in already. This ends up being about the same thing that happened pre-v12, but hopefully in a less opaque and fragile way. We might've caught the problem sooner if there were any test cases dealing with updates involving non-replicated or dropped columns. Now there are. Back-patch to v10 where this code came in. Even though the failure does not manifest before v12, IMO this code is too fragile to leave as-is. In any case we certainly want the additional test coverage. Patch by me; thanks to Tomas Vondra for initial investigation. Discussion: https://postgr.es/m/16129-a0c0f48e71741e5f@postgresql.org regards, tom lane
Re: BUG #16226: background worker "logical replication worker" (PID) was terminated by signal 11: Segmentation
From
Vadim Yatsenko
Date:
Tom,
Thanks you! We'll wait patch to update our servers.
Best Regards,
Vadim Yatsenko
Vadim Yatsenko
ср, 22 янв. 2020 г., 18:28 Tom Lane <tgl@sss.pgh.pa.us>:
> We have 2 PostgreSQL servers with logical replication between Postgres 11.6
> (Primary) and 12.1 (Logical). Some times ago, we changed column type in a 2
> big tables from integer to text:
> ...
> , this of course led to a full rewrite both tables. We repated this
> operation on both servers. And after that we started to get error like
> "background worker "logical replication worker" (PID <pid>) was terminated
> by signal 11: Segmentation fault" and server goes to recovery mode.
Not sure, but this seems like it might be explained by this recent
bug fix:
Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: master [4d9ceb001] 2019-11-22 11:31:19 -0500
Branch: REL_12_STABLE [a2aa224e0] 2019-11-22 11:31:19 -0500
Branch: REL_11_STABLE [b72a44c51] 2019-11-22 11:31:19 -0500
Branch: REL_10_STABLE [5d3fcb53a] 2019-11-22 11:31:19 -0500
Fix bogus tuple-slot management in logical replication UPDATE handling.
slot_modify_cstrings seriously abused the TupleTableSlot API by relying
on a slot's underlying data to stay valid across ExecClearTuple. Since
this abuse was also quite undocumented, it's little surprise that the
case got broken during the v12 slot rewrites. As reported in bug #16129
from Ondřej Jirman, this could lead to crashes or data corruption when
a logical replication subscriber processes a row update. Problems would
only arise if the subscriber's table contained columns of pass-by-ref
types that were not being copied from the publisher.
Fix by explicitly copying the datum/isnull arrays from the source slot
that the old row was in already. This ends up being about the same
thing that happened pre-v12, but hopefully in a less opaque and
fragile way.
We might've caught the problem sooner if there were any test cases
dealing with updates involving non-replicated or dropped columns.
Now there are.
Back-patch to v10 where this code came in. Even though the failure
does not manifest before v12, IMO this code is too fragile to leave
as-is. In any case we certainly want the additional test coverage.
Patch by me; thanks to Tomas Vondra for initial investigation.
Discussion: https://postgr.es/m/16129-a0c0f48e71741e5f@postgresql.org
regards, tom lane