Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker - Mailing list pgsql-bugs

From Tomas Vondra
Subject Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker
Date
Msg-id 20191121152732.hu6y4wzw7nkxpsvk@development
Whole thread Raw
In response to Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker  (Ondřej Jirman <megi@xff.cz>)
Responses Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker
List pgsql-bugs
On Thu, Nov 21, 2019 at 04:07:16PM +0100, Ondřej Jirman wrote:
>On Thu, Nov 21, 2019 at 03:35:03PM +0100, Tomas Vondra wrote:
>> On Thu, Nov 21, 2019 at 02:55:23PM +0100, Ondřej Jirman wrote:
>> > On Thu, Nov 21, 2019 at 02:32:37PM +0100, Tomas Vondra wrote:
>> > > On Thu, Nov 21, 2019 at 01:59:13PM +0100, Ondřej Jirman wrote:
>> > > > On Thu, Nov 21, 2019 at 12:53:26PM +0100, postgresql wrote:
>> > > > > Hello,
>> > > > >
>> > > > > On Thu, Nov 21, 2019 at 11:39:40AM +0100, Tomas Vondra wrote:
>> > > > > > On Thu, Nov 21, 2019 at 01:14:18AM +0000, PG Bug reporting form wrote:
>> > > > > > > Replication of one of my databases (running on ARMv7 machine) started
>> > > > > > > segfaulting on the subscriber side (x86_64) like this:
>> > > > > > >
>> > > > > > > #0  0x00007fc259739917 in __memmove_sse2_unaligned_erms () from
>> > > > > > > /usr/lib/libc.so.6
>> > > > > > > #1  0x000055d033e93d44 in memcpy (__len=620701425, __src=<optimized out>,
>> > > > > > > __dest=0x55d0356da804) at /usr/include/bits/string_fortified.h:34
>> > > > > > > #2  tts_virtual_materialize (slot=0x55d0356da3b8) at execTuples.c:235
>> > > > > > > #3  0x000055d033e94d32 in ExecFetchSlotHeapTuple
>> > > > > > > (slot=slot@entry=0x55d0356da3b8, materialize=materialize@entry=true,
>> > > > > > > shouldFree=shouldFree@entry=0x7fff0e7cf387) at execTuples.c:1624
>> > > > >
>> > > > > I forgot to add that publisher is still PostgreSQL 11.5.
>> > > > >
>> > > >
>> > > > I can also add that I have data checksumming enabled on both ends, and
>> > > > it did not detect any corruption:
>> > > >
>> > > > # pg_verify_checksums -D /var/lib/postgres/data
>> > > > Checksum scan completed
>> > > > Data checksum version: 1
>> > > > Files scanned:  1751
>> > > > Blocks scanned: 86592
>> > > > Bad checksums:  0
>> > > >
>> > > > # pg_checksums /var/lib/postgres/data
>> > > > Checksum operation completed
>> > > > Files scanned:  22777
>> > > > Blocks scanned: 3601527
>> > > > Bad checksums:  0
>> > > > Data checksum version: 1
>> > > >
>> > > > WAL log on the publisher is also dumpable to a state hours after the issues
>> > > > started:
>> > > >
>> > > > I've put the dump here, if it's of any use: https://megous.com/dl/tmp/wal_dump.txt
>> > > >
>> > > > Dump ends with:
>> > > >
>> > > > pg_waldump: FATAL:  error in WAL record at 2/BBE0E538: invalid record length at 2/BBE0E5A8: wanted 24, got 0
>> > > >
>> > > > But that seems normal. I get that error on my other database clusters, too.
>> > > >
>> > > > I managed to extract the failing logical decoding data from the publisher, if
>> > > > that helps:
>> > > >
>> > > >
>> > > > SELECT * FROM pg_logical_slot_peek_binary_changes('l5_hometv', NULL, NULL, 'proto_version', '1',
'publication_names','pub');
 
>> > > >
>> > > > 2/BBD86EA0 | 56395 | \x4200000002bbd880b800023acd790ce5510000dc4b
>> > > > 2/BBD87E90 | 56395 |
\x5200004a687075626c696300766964656f73006400080169640000000017ffffffff007469746c650000000019ffffffff00636f7665725f696d6167650000000011ffffffff006d657461646174610000000edafffffff
>> > > >
f0063617465676f72790000000017ffffffff007075626c6973686564000000043affffffff006164646564000000045affffffff00706c617965640000000010ffffffff
>> > > > 2/BBD87E90 | 56395 |
\x5500004a684e0008740000000438333933740000005650617a6465726b613a204f206dc3a964696120736520706f7665646520626f6a2e204b64796279206ec3a17320706f6c6974696369206d696c6f76616c692c20627
>> > > >
96c6f206279206ec49b636f20c5a17061746ec49b7574000001397b226964223a20226430313064343430303965323131656162323539616331663662323230656538222c202264617465223a2022323031392d31312d3138222c20226e616d65223a20
>> > > >
2250617a6465726b613a204f206dc3a964696120736520706f7665646520626f6a2e204b64796279206ec3a17320706f6c6974696369206d696c6f76616c692c2062796c6f206279206ec49b636f20c5a17061746ec49b222c2022696d616765223a202
>> > > >
268747470733a2f2f63646e2e7873642e637a2f726573697a652f63353535656239633131353333313632386164666539396237343534353731655f657874726163743d302c302c313931392c313038305f726573697a653d3732302c3430355f2e6a70
>> > > >
673f686173683d6362316362623836336230353361613561333761346666616439303865303431227d7400000003323432740000000a323031392d31312d3138740000001a323031392d31312d31382031323a35303a30312e383136333736740000000
>> > > > 174
>> > > > 2/BBD880E8 | 56395 | \x430000000002bbd880b800000002bbd880e800023acd790ce551
>> > > >
>> > >
>> > > Can you show the whole transaction? From the WAL dump it seems it just
>> > > did a single UPDATE:
>> >
>> > I'm not sure what you mean. The above is the whole output I got from that
>> > SELECT.
>> >
>> > > rmgr: Heap        len (rec/tot):     59/  4075, tx:      56395, lsn: 2/BBD86EA0, prev 2/BBD86E68, desc: LOCK off
2:xid 56395: flags 0 LOCK_ONLY EXCL_LOCK
 
>> > >     blkref #0: rel 1663/19030/19048 fork main blk 415 (FPW); hole: offset: 56, length: 4176
>> > > rmgr: Heap        len (rec/tot):    523/   523, tx:      56395, lsn: 2/BBD87E90, prev 2/BBD86EA0, desc:
HOT_UPDATEoff 2 xmax 56395 ; new off 9 xmax 0
 
>> > >     blkref #0: rel 1663/19030/19048 fork main blk 415
>> > > rmgr: Transaction len (rec/tot):     46/    46, tx:      56395, lsn: 2/BBD880B8, prev 2/BBD87E90, desc: COMMIT
2019-11-2023:28:47.016273 CET
 
>> > >
>> > > Which fields does it update?
>> >
>> > The only update that my system does in this database is:
>> >
>> > UPDATE videos SET played = TRUE WHERE id = ?
>> >
>> > The rest is just inserts:
>> >
>> > INSERT INTO categories (name, metadata, cover_image, provider) VALUES (?, ?, ?, (SELECT id FROM providers WHERE
system_name= ?))
 
>> > INSERT INTO videos (title, metadata, cover_image, category, published) VALUES (?, ?, ?, ?, ?)
>> >
>> > There are no other data modification statements executed.
>> >
>>
>> Well, this is definitely the updated - both judging by the WAL dump and
>> also because the backtrace contains apply_handle_update.
>
>I've also managed to get valgrind output on the segfaulting process:
>
>https://megous.com/dl/tmp/valgrind-105524-109020.log
>
>Though it probably doesn't show much new.
>

Is postgres actually compiled with valgrind support? What does pg_config
say?

If you're building the server locally (I don't think there are packages
with valgrind enabled), maybe try adding --enable-cassert option and

    CFLAGS"-DRANDOMIZE_ALLOCATED_MEMORY"

which will do more extensive checks at runtime.

>> Can you show us \d+ on the videos table? What's the replica identity?
>
>                                                             Table "public.videos"
>   Column    |            Type             | Collation | Nullable |              Default               | Storage  |
Statstarget | Description
 

>-------------+-----------------------------+-----------+----------+------------------------------------+----------+--------------+-------------
> id          | integer                     |           | not null | nextval('videos_id_seq'::regclass) | plain    |
         |
 
> title       | text                        |           | not null |                                    | extended |
         |
 
> cover_image | bytea                       |           | not null |                                    | extended |
         |
 
> metadata    | jsonb                       |           | not null |                                    | extended |
         |
 
> category    | integer                     |           | not null |                                    | plain    |
         |
 
> published   | date                        |           | not null |                                    | plain    |
         |
 
> added       | timestamp without time zone |           | not null | now()                              | plain    |
         |
 
> played      | boolean                     |           | not null | false                              | plain    |
         |
 
>Indexes:
>    "videos_pkey" PRIMARY KEY, btree (id)
>Foreign-key constraints:
>    "videos_category_fkey" FOREIGN KEY (category) REFERENCES categories(id)
>Publications:
>    "pub"
>
>                                                             Table "public.videos"
>   Column    |            Type             | Collation | Nullable |              Default               | Storage  |
Statstarget | Description
 

>-------------+-----------------------------+-----------+----------+------------------------------------+----------+--------------+-------------
> id          | integer                     |           | not null | nextval('videos_id_seq'::regclass) | plain    |
         |
 
> title       | text                        |           | not null |                                    | extended |
         |
 
> cover_image | bytea                       |           | not null |                                    | extended |
         |
 
> metadata    | jsonb                       |           | not null |                                    | extended |
         |
 
> category    | integer                     |           | not null |                                    | plain    |
         |
 
> published   | date                        |           | not null |                                    | plain    |
         |
 
> added       | timestamp without time zone |           | not null | now()                              | plain    |
         |
 
> played      | boolean                     |           | not null | false                              | plain    |
         |
 
>Indexes:
>    "videos_pkey" PRIMARY KEY, btree (id)
>Foreign-key constraints:
>    "videos_category_fkey" FOREIGN KEY (category) REFERENCES categories(id)
>Access method: heap
>
>
>Googling for relica identity, it looks like it should be shown in the \d+
>output, but it isn't, for me. Maybe because it's has default value?
>

Yes, I think it should be listed either for the index, or there should
be an extra section with replica identity. But that depends on psql
version. Aren't you accidentally using an older psql version? That
should be mentioned right after connecting to the server, I think.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-bugs by date:

Previous
From: Ondřej Jirman
Date:
Subject: Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker
Next
From: Ondřej Jirman
Date:
Subject: Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker