BUG #16226: background worker "logical replication worker" (PID ) was terminated by signal 11: Segmentation - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #16226: background worker "logical replication worker" (PID ) was terminated by signal 11: Segmentation
Date
Msg-id 16226-690712ac4467269a@postgresql.org
Whole thread Raw
Responses Re: BUG #16226: background worker "logical replication worker" (PID) was terminated by signal 11: Segmentation
Re: BUG #16226: background worker "logical replication worker" (PID ) was terminated by signal 11: Segmentation
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      16226
Logged by:          Vadim Yatsenko
Email address:      vadim@postgrespro.co.il
PostgreSQL version: 12.1
Operating system:   CentOS Linux 7 (Core) Kernel: Linux 3.10.0-1062.9.
Description:

Hi!
We have 2 PostgreSQL servers with logical replication between Postgres 11.6
(Primary) and 12.1 (Logical). Some times ago, we changed column type in a 2
big tables from integer to text: 

    oid     | nspname  |                table                 |   rows    |
pages  |    size    | index_size | total_size

------------+----------+--------------------------------------+-----------+---------+------------+------------+----------------
  573803236 | public | acc_rep                         | 283359648 | 9276153
| 71 GB      | 50 GB      | 121 GB
  586884963 | public | acc_rep_actual             | 283280448 | 8855209 | 68
GB      | 27 GB      | 95 GB

, this of course led to a full rewrite both tables. We repated this
operation on both servers. And after that we started to get error like
"background worker "logical replication worker" (PID <pid>) was terminated
by signal 11: Segmentation fault" and server goes to recovery mode. This
error repated several times every 10-15 minutes. When we connected to
logical replication worker process using gdb, we recieved the following
message:

Program received signal SIGSEGV, Segmentation fault.
0x00007f04e5060426 in __memcpy_ssse3_back () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f04e5060426 in __memcpy_ssse3_back () from /lib64/libc.so.6
#1  0x000000000061e1cb in memcpy (__len=224588909, __src=<optimized out>,
__dest=0x101b870) at /usr/include/bits/string3.h:51
#2  tts_virtual_materialize (slot=0x101b0b0) at execTuples.c:235
#3  0x000000000061f594 in ExecFetchSlotHeapTuple (slot=0x101b0b0,
materialize=<optimized out>, shouldFree=0x7ffc331ae187) at
execTuples.c:1624
#4  0x00000000004c868c in heapam_tuple_update (relation=0x7f04e77d8470,
otid=0x101b240, slot=0x101b0b0, cid=19778, snapshot=<optimized out>,
crosscheck=0x0, wait=true, tmfd=0x7ffc331ae210, lockmode=0x7ffc331ae20c,
    update_indexes=0x7ffc331ae277) at heapam_handler.c:325
#5  0x00000000004f5524 in table_tuple_update (update_indexes=0x7ffc331ae277,
lockmode=0x7ffc331ae20c, tmfd=0x7ffc331ae210, wait=true, crosscheck=0x0,
snapshot=0x0, cid=<optimized out>, slot=0x101b0b0, otid=0x101b240,
rel=0x7f04e77d8470)
    at ../../../../src/include/access/tableam.h:1261
#6  simple_table_tuple_update (rel=0x7f04e77d8470,
otid=otid@entry=0x101b240, slot=slot@entry=0x101b0b0, snapshot=0x0,
update_indexes=update_indexes@entry=0x7ffc331ae277) at tableam.c:325
#7  0x000000000061bbc0 in ExecSimpleRelationUpdate
(estate=estate@entry=0x1031660, epqstate=epqstate@entry=0x7ffc331ae330,
searchslot=searchslot@entry=0x101b210, slot=slot@entry=0x101b0b0) at
execReplication.c:493
#8  0x000000000071100d in apply_handle_update (s=s@entry=0x7ffc331b5910) at
worker.c:750
#9  0x0000000000711ca7 in apply_dispatch (s=0x7ffc331b5910) at
worker.c:968
#10 LogicalRepApplyLoop (last_received=31760851165536) at worker.c:1176
#11 ApplyWorkerMain (main_arg=<optimized out>) at worker.c:1734
#12 0x00000000006e2d4a in StartBackgroundWorker () at bgworker.c:834
#13 0x00000000006ef1d3 in do_start_bgworker (rw=0xf71970) at
postmaster.c:5770
#14 maybe_start_bgworkers () at postmaster.c:5996
#15 0x00000000006efde5 in sigusr1_handler (postgres_signal_arg=<optimized
out>) at postmaster.c:5167
#16 <signal handler called>
#17 0x00007f04e4fff933 in __select_nocancel () from /lib64/libc.so.6
#18 0x00000000004833d4 in ServerLoop () at postmaster.c:1668
#19 0x00000000006f106f in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0xf45300) at postmaster.c:1377
#20 0x0000000000484cd3 in main (argc=3, argv=0xf45300) at main.c:228
(gdb)
#0  0x00007f04e5060426 in __memcpy_ssse3_back () from /lib64/libc.so.6
#1  0x000000000061e1cb in memcpy (__len=224588909, __src=<optimized out>,
__dest=0x101b870) at /usr/include/bits/string3.h:51
#2  tts_virtual_materialize (slot=0x101b0b0) at execTuples.c:235
#3  0x000000000061f594 in ExecFetchSlotHeapTuple (slot=0x101b0b0,
materialize=<optimized out>, shouldFree=0x7ffc331ae187) at
execTuples.c:1624
#4  0x00000000004c868c in heapam_tuple_update (relation=0x7f04e77d8470,
otid=0x101b240, slot=0x101b0b0, cid=19778, snapshot=<optimized out>,
crosscheck=0x0, wait=true, tmfd=0x7ffc331ae210, lockmode=0x7ffc331ae20c,
    update_indexes=0x7ffc331ae277) at heapam_handler.c:325
#5  0x00000000004f5524 in table_tuple_update (update_indexes=0x7ffc331ae277,
lockmode=0x7ffc331ae20c, tmfd=0x7ffc331ae210, wait=true, crosscheck=0x0,
snapshot=0x0, cid=<optimized out>, slot=0x101b0b0, otid=0x101b240,
rel=0x7f04e77d8470)
    at ../../../../src/include/access/tableam.h:1261
#6  simple_table_tuple_update (rel=0x7f04e77d8470,
otid=otid@entry=0x101b240, slot=slot@entry=0x101b0b0, snapshot=0x0,
update_indexes=update_indexes@entry=0x7ffc331ae277) at tableam.c:325
#7  0x000000000061bbc0 in ExecSimpleRelationUpdate
(estate=estate@entry=0x1031660, epqstate=epqstate@entry=0x7ffc331ae330,
searchslot=searchslot@entry=0x101b210, slot=slot@entry=0x101b0b0) at
execReplication.c:493
#8  0x000000000071100d in apply_handle_update (s=s@entry=0x7ffc331b5910) at
worker.c:750
#9  0x0000000000711ca7 in apply_dispatch (s=0x7ffc331b5910) at
worker.c:968
#10 LogicalRepApplyLoop (last_received=31760851165536) at worker.c:1176
#11 ApplyWorkerMain (main_arg=<optimized out>) at worker.c:1734
#12 0x00000000006e2d4a in StartBackgroundWorker () at bgworker.c:834
#13 0x00000000006ef1d3 in do_start_bgworker (rw=0xf71970) at
postmaster.c:5770
#14 maybe_start_bgworkers () at postmaster.c:5996
#15 0x00000000006efde5 in sigusr1_handler (postgres_signal_arg=<optimized
out>) at postmaster.c:5167
#16 <signal handler called>
#17 0x00007f04e4fff933 in __select_nocancel () from /lib64/libc.so.6
#18 0x00000000004833d4 in ServerLoop () at postmaster.c:1668
#19 0x00000000006f106f in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0xf45300) at postmaster.c:1377
#20 0x0000000000484cd3 in main (argc=3, argv=0xf45300) at main.c:228

Error from /var/log/messages:

Jan 21 02:01:59 dbmr01 kernel: postmaster[50093]: segfault at e76f13d ip
00007fa8f701ffa0 sp 00007ffc8ae23498 error 4 in
libc-2.17.so[7fa8f6ed4000+1c2000]
Jan 21 02:01:59 dbmr01 abrt-hook-ccpp: Process 50093 (postgres) of user 26
killed by SIGSEGV - dumping core
Jan 21 02:02:15 dbmr01 abrt-server: Package 'postgresql12-server' isn't
signed with proper key
Jan 21 02:02:15 dbmr01 abrt-server: 'post-create' on
'/var/spool/abrt/ccpp-2020-01-21-02:01:59-50093' exited with 1
Jan 21 02:02:15 dbmr01 abrt-server: Deleting problem directory
'/var/spool/abrt/ccpp-2020-01-21-02:01:59-50093'

The issue was solved by removing these tables from publication on Primary.
Can you advise something? Is this a bug or didn't I take something into
account when working with logical replication?
Thank you!


pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #16225: EL8 - PGDG postgresql* packages conflict with appstream postgresql packages
Next
From: Peter Eisentraut
Date:
Subject: Re: BUG #16226: background worker "logical replication worker" (PID) was terminated by signal 11: Segmentation