Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker - Mailing list pgsql-bugs

From Tomas Vondra
Subject Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker
Date
Msg-id 20191121160906.wpq5u6yh6pxv6hyo@development
Whole thread Raw
In response to Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker  (Ondřej Jirman <ienieghapheoghaiwida@xff.cz>)
Responses Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker
List pgsql-bugs
On Thu, Nov 21, 2019 at 04:57:07PM +0100, Ondřej Jirman wrote:
>On Thu, Nov 21, 2019 at 04:27:32PM +0100, Tomas Vondra wrote:
>> On Thu, Nov 21, 2019 at 04:07:16PM +0100, Ondřej Jirman wrote:
>> > I've also managed to get valgrind output on the segfaulting process:
>> >
>> > https://megous.com/dl/tmp/valgrind-105524-109020.log
>> >
>> > Though it probably doesn't show much new.
>> >
>>
>> Is postgres actually compiled with valgrind support? What does pg_config
>> say?
>
>I don't see any valgrind related options in configure --help, I repackaged it
>just with CFLAGS set to -ggdb -Og.
>

You need to use -DUSE_VALGRIND or edit pg_config_manual.h. I usually
just use

   CFLAGS="-DRANDOMIZE_ALLOCATED_MEMORY -DUSE_VALGRIND -O0 -ggdb"

which does the trick. Without the USE_VALGRIND defined, we don't include
the valgrind instrumentation, so I suppose the previous results are bogus.

>CONFIGURE = '--prefix=/usr' '--mandir=/usr/share/man' '--datadir=/usr/share/postgresql' '--sysconfdir=/etc'
'--with-gssapi''--with-libxml' '--with-openssl' '--with-perl' '--with-python' '--with-tcl' '--with-pam'
'--with-system-tzdata=/usr/share/zoneinfo''--with-uuid=e2fs' '--with-icu' '--with-systemd' '--with-ldap' '--enable-nls'
'--enable-thread-safety''--disable-rpath' 'PYTHON=/usr/bin/python2' 'CFLAGS=-march=x86-64 -mtune=generic -O2 -pipe
-fstack-protector-strong-fno-plt -Og -ggdb
-fdebug-prefix-map=/workspace/external/pgsql/postgresql-pkg/src=/usr/src/debug'
'LDFLAGS=-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now''CPPFLAGS=-D_FORTIFY_SOURCE=2' 'CXXFLAGS=-march=x86-64
-mtune=generic-O2 -pipe -fstack-protector-strong -fno-plt -Og -ggdb
-fdebug-prefix-map=/workspace/external/pgsql/postgresql-pkg/src=/usr/src/debug'
>CC = gcc
>CPPFLAGS = -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2
>CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard
-Wno-format-truncation-Wno-stringop-truncation -march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt
-Og-ggdb -fdebug-prefix-map=/workspace/external/pgsql/postgresql-pkg/src=/usr/src/debug
 
>CFLAGS_SL = -fPIC
>LDFLAGS = -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -Wl,--as-needed
>LDFLAGS_EX =
>LDFLAGS_SL =
>LIBS = -lpgcommon -lpgport -lpthread -lxml2 -lpam -lssl -lcrypto -lgssapi_krb5 -lz -lreadline -lrt -lcrypt -ldl -lm
>VERSION = PostgreSQL 12.1
>
>> If you're building the server locally (I don't think there are packages
>> with valgrind enabled), maybe try adding --enable-cassert option and
>>
>>    CFLAGS"-DRANDOMIZE_ALLOCATED_MEMORY"
>>
>> which will do more extensive checks at runtime.
>
>Done.
>
>CONFIGURE = '--prefix=/usr' '--mandir=/usr/share/man' '--datadir=/usr/share/postgresql' '--sysconfdir=/etc'
'--with-gssapi''--with-libxml' '--with-openssl' '--with-perl' '--with-python' '--with-tcl' '--with-pam'
'--with-system-tzdata=/usr/share/zoneinfo''--with-uuid=e2fs' '--with-icu' '--with-systemd' '--with-ldap' '--enable-nls'
'--enable-thread-safety''--disable-rpath' '--enable-cassert' 'PYTHON=/usr/bin/python2' 'CFLAGS=-march=x86-64
-mtune=generic-O2 -pipe -fstack-protector-strong -fno-plt -Og -ggdb
-fdebug-prefix-map=/workspace/external/pgsql/postgresql-pkg/src=/usr/src/debug-DRANDOMIZE_ALLOCATED_MEMORY'
'LDFLAGS=-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now''CPPFLAGS=-D_FORTIFY_SOURCE=2' 'CXXFLAGS=-march=x86-64
-mtune=generic-O2 -pipe -fstack-protector-strong -fno-plt -Og -ggdb
-fdebug-prefix-map=/workspace/external/pgsql/postgresql-pkg/src=/usr/src/debug'
>CC = gcc
>CPPFLAGS = -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2
>CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard
-Wno-format-truncation-Wno-stringop-truncation -march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt
-Og-ggdb -fdebug-prefix-map=/workspace/external/pgsql/postgresql-pkg/src=/usr/src/debug -DRANDOMIZE_ALLOCATED_MEMORY
 
>CFLAGS_SL = -fPIC
>LDFLAGS = -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -Wl,--as-needed
>LDFLAGS_EX =
>LDFLAGS_SL =
>LIBS = -lpgcommon -lpgport -lpthread -lxml2 -lpam -lssl -lcrypto -lgssapi_krb5 -lz -lreadline -lrt -lcrypt -ldl -lm
>VERSION = PostgreSQL 12.1
>
>The difference now is that _len argument value is not stable accross crashes.
>
>one crash...
>
>#0  0x00007fb90c336917 in __memmove_sse2_unaligned_erms () from /usr/lib/libc.so.6
>#1  0x000055ff2a7d0e64 in memcpy (__len=972662513, __src=<optimized out>, __dest=0x55ff2cc7dce4) at
/usr/include/bits/string_fortified.h:34
>#2  tts_virtual_materialize (slot=0x55ff2cc7d868) at execTuples.c:235
>#3  0x000055ff2a7d2a63 in ExecFetchSlotHeapTuple (slot=slot@entry=0x55ff2cc7d868, materialize=materialize@entry=true,
shouldFree=shouldFree@entry=0x7ffc80cfc607)at execTuples.c:1624
 
>#4  0x000055ff2a634809 in heapam_tuple_update (relation=0x7fb8f8338100, otid=0x55ff2cc7e138, slot=0x55ff2cc7d868,
cid=0,snapshot=<optimized out>, crosscheck=0x0, wait=true, tmfd=0x7ffc80cfc680,
 
>    lockmode=0x7ffc80cfc67c, update_indexes=0x7ffc80cfc6d7) at heapam_handler.c:325
>#5  0x000055ff2a675d55 in table_tuple_update (update_indexes=0x7ffc80cfc6d7, lockmode=0x7ffc80cfc67c,
tmfd=0x7ffc80cfc680,wait=true, crosscheck=0x0, snapshot=0x0, cid=<optimized out>,
 
>    slot=0x55ff2cc7d868, otid=0x55ff2cc7e138, rel=0x7fb8f8338100) at ../../../../src/include/access/tableam.h:1261
>#6  simple_table_tuple_update (rel=rel@entry=0x7fb8f8338100, otid=otid@entry=0x55ff2cc7e138,
slot=slot@entry=0x55ff2cc7d868,snapshot=0x0, update_indexes=update_indexes@entry=0x7ffc80cfc6d7)
 
>    at tableam.c:325
>#7  0x000055ff2a7cdf08 in ExecSimpleRelationUpdate (estate=estate@entry=0x55ff2ccd02b8,
epqstate=epqstate@entry=0x7ffc80cfc730,searchslot=searchslot@entry=0x55ff2cc7e108,
 
>    slot=slot@entry=0x55ff2cc7d868) at execReplication.c:493
>#8  0x000055ff2a8eb8b2 in apply_handle_update (s=s@entry=0x7ffc80d03d00) at worker.c:750
>#9  0x000055ff2a8ebe62 in apply_dispatch (s=s@entry=0x7ffc80d03d00) at worker.c:968
>#10 0x000055ff2a8ec5ba in LogicalRepApplyLoop (last_received=11741462160) at worker.c:1176
>#11 0x000055ff2a8ecd1a in ApplyWorkerMain (main_arg=<optimized out>) at worker.c:1734
>#12 0x000055ff2a8b3927 in StartBackgroundWorker () at bgworker.c:834
>#13 0x000055ff2a8c2b6b in do_start_bgworker (rw=rw@entry=0x55ff2cbbf430) at postmaster.c:5770
>#14 0x000055ff2a8c2cf7 in maybe_start_bgworkers () at postmaster.c:5996
>#15 0x000055ff2a8c3678 in sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5167
>#16 <signal handler called>
>#17 0x00007fb90c384e2a in select () from /usr/lib/libc.so.6
>#18 0x000055ff2a8c40f8 in ServerLoop () at postmaster.c:1668
>#19 0x000055ff2a8c54fc in PostmasterMain (argc=3, argv=<optimized out>) at postmaster.c:1377
>#20 0x000055ff2a81f88a in main (argc=3, argv=0x55ff2cb94260) at main.c:228
>
>and another ...
>
>#0  0x00007f7a1befc917 in __memmove_sse2_unaligned_erms () from /usr/lib/libc.so.6
>#1  0x0000559a5439ae64 in memcpy (__len=766338801, __src=<optimized out>, __dest=0x559a55626934) at
/usr/include/bits/string_fortified.h:34
>#2  tts_virtual_materialize (slot=0x559a556264b8) at execTuples.c:235
>#3  0x0000559a5439ca63 in ExecFetchSlotHeapTuple (slot=slot@entry=0x559a556264b8, materialize=materialize@entry=true,
shouldFree=shouldFree@entry=0x7fff7f2d5787)at execTuples.c:1624
 
>#4  0x0000559a541fe809 in heapam_tuple_update (relation=0x7f7a07efe100, otid=0x559a55626d88, slot=0x559a556264b8,
cid=0,snapshot=<optimized out>, crosscheck=0x0, wait=true, tmfd=0x7fff7f2d5800,
 
>    lockmode=0x7fff7f2d57fc, update_indexes=0x7fff7f2d5857) at heapam_handler.c:325
>#5  0x0000559a5423fd55 in table_tuple_update (update_indexes=0x7fff7f2d5857, lockmode=0x7fff7f2d57fc,
tmfd=0x7fff7f2d5800,wait=true, crosscheck=0x0, snapshot=0x0, cid=<optimized out>,
 
>    slot=0x559a556264b8, otid=0x559a55626d88, rel=0x7f7a07efe100) at ../../../../src/include/access/tableam.h:1261
>#6  simple_table_tuple_update (rel=rel@entry=0x7f7a07efe100, otid=otid@entry=0x559a55626d88,
slot=slot@entry=0x559a556264b8,snapshot=0x0, update_indexes=update_indexes@entry=0x7fff7f2d5857)
 
>    at tableam.c:325
>#7  0x0000559a54397f08 in ExecSimpleRelationUpdate (estate=estate@entry=0x559a55678dc8,
epqstate=epqstate@entry=0x7fff7f2d58b0,searchslot=searchslot@entry=0x559a55626d58,
 
>    slot=slot@entry=0x559a556264b8) at execReplication.c:493
>#8  0x0000559a544b58b2 in apply_handle_update (s=s@entry=0x7fff7f2dce80) at worker.c:750
>#9  0x0000559a544b5e62 in apply_dispatch (s=s@entry=0x7fff7f2dce80) at worker.c:968
>#10 0x0000559a544b65ba in LogicalRepApplyLoop (last_received=11741462160) at worker.c:1176
>#11 0x0000559a544b6d1a in ApplyWorkerMain (main_arg=<optimized out>) at worker.c:1734
>#12 0x0000559a5447d927 in StartBackgroundWorker () at bgworker.c:834
>#13 0x0000559a5448cb6b in do_start_bgworker (rw=rw@entry=0x559a5558c010) at postmaster.c:5770
>#14 0x0000559a5448ccf7 in maybe_start_bgworkers () at postmaster.c:5996
>#15 0x0000559a5448d678 in sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5167
>#16 <signal handler called>
>#17 0x00007f7a1bf4ae2a in select () from /usr/lib/libc.so.6
>#18 0x0000559a5448e0f8 in ServerLoop () at postmaster.c:1668
>#19 0x0000559a5448f4fc in PostmasterMain (argc=3, argv=<optimized out>) at postmaster.c:1377
>#20 0x0000559a543e988a in main (argc=3, argv=0x559a55534260) at main.c:228
>
>So it perhaps looks like some memory management issue in the subscribing
>postmaster, rather than wal corruption on the publisher.
>

Right, maybe using freed memory, or something like that.

>> > > Can you show us \d+ on the videos table? What's the replica identity?
>> >
>> >                                                             Table "public.videos"
>> >   Column    |            Type             | Collation | Nullable |              Default               | Storage  |
Statstarget | Description
 
>> >
-------------+-----------------------------+-----------+----------+------------------------------------+----------+--------------+-------------
>> > id          | integer                     |           | not null | nextval('videos_id_seq'::regclass) | plain    |
            |
 
>> > title       | text                        |           | not null |                                    | extended |
            |
 
>> > cover_image | bytea                       |           | not null |                                    | extended |
            |
 
>> > metadata    | jsonb                       |           | not null |                                    | extended |
            |
 
>> > category    | integer                     |           | not null |                                    | plain    |
            |
 
>> > published   | date                        |           | not null |                                    | plain    |
            |
 
>> > added       | timestamp without time zone |           | not null | now()                              | plain    |
            |
 
>> > played      | boolean                     |           | not null | false                              | plain    |
            |
 
>> > Indexes:
>> >    "videos_pkey" PRIMARY KEY, btree (id)
>> > Foreign-key constraints:
>> >    "videos_category_fkey" FOREIGN KEY (category) REFERENCES categories(id)
>> > Publications:
>> >    "pub"
>> >
>> >                                                             Table "public.videos"
>> >   Column    |            Type             | Collation | Nullable |              Default               | Storage  |
Statstarget | Description
 
>> >
-------------+-----------------------------+-----------+----------+------------------------------------+----------+--------------+-------------
>> > id          | integer                     |           | not null | nextval('videos_id_seq'::regclass) | plain    |
            |
 
>> > title       | text                        |           | not null |                                    | extended |
            |
 
>> > cover_image | bytea                       |           | not null |                                    | extended |
            |
 
>> > metadata    | jsonb                       |           | not null |                                    | extended |
            |
 
>> > category    | integer                     |           | not null |                                    | plain    |
            |
 
>> > published   | date                        |           | not null |                                    | plain    |
            |
 
>> > added       | timestamp without time zone |           | not null | now()                              | plain    |
            |
 
>> > played      | boolean                     |           | not null | false                              | plain    |
            |
 
>> > Indexes:
>> >    "videos_pkey" PRIMARY KEY, btree (id)
>> > Foreign-key constraints:
>> >    "videos_category_fkey" FOREIGN KEY (category) REFERENCES categories(id)
>> > Access method: heap
>> >
>> >
>> > Googling for relica identity, it looks like it should be shown in the \d+
>> > output, but it isn't, for me. Maybe because it's has default value?
>> >
>>
>> Yes, I think it should be listed either for the index, or there should
>> be an extra section with replica identity. But that depends on psql
>> version. Aren't you accidentally using an older psql version? That
>> should be mentioned right after connecting to the server, I think.
>
>Maybe it has something to do with my upgrade method. I dumped/restored
>the replica with pg_dumpall, and then just proceded to enable subscription
>and refresh publication with (copy_data=false) for all my subscriptions.
>

Interesting.

>Now, I've tried to start fresh and replicate from the primary "hometv" database
>into a new l5_hometv2 on my crashing 12.1 server and it replicates fine, no
>crashes. Anyway even in this fresh replicated database, there's no REPLICA
>IDENTITY in the \d+ output.
>
>I don't have multiple postgresql installations on the host. There's only
>a difference between primary version 11.5 (ARMv7) and replica 12.1 (x86_64).
>

Hmmm, not sure. Have to check the code.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-bugs by date:

Previous
From: Ondřej Jirman
Date:
Subject: Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker
Next
From: Ondřej Jirman
Date:
Subject: Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker