Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker - Mailing list pgsql-bugs

From Tomas Vondra
Subject Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker
Date
Msg-id 20191121125930.or5fhpi2yjgkmjye@development
Whole thread Raw
In response to Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker  (Ondřej Jirman <ienieghapheoghaiwida@xff.cz>)
Responses Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker
List pgsql-bugs
On Thu, Nov 21, 2019 at 12:53:26PM +0100, Ondřej Jirman wrote:
>Hello,
>
>On Thu, Nov 21, 2019 at 11:39:40AM +0100, Tomas Vondra wrote:
>> On Thu, Nov 21, 2019 at 01:14:18AM +0000, PG Bug reporting form wrote:
>> > The following bug has been logged on the website:
>> >
>> > Bug reference:      16129
>> > Logged by:          Ondrej Jirman
>> > Email address:      ienieghapheoghaiwida@xff.cz
>> > PostgreSQL version: 12.1
>> > Operating system:   Arch Linux
>> > Description:
>> >
>> > Hello,
>> >
>> > I've upgraded my main PostgreSQL cluster from 11.5 to 12.1 via pg_dumpall
>> > method and after a while I started getting segfault in logical replication
>> > worker.
>> >
>> > My setup is fairly vanilla, non-default options:
>> >
>> > shared_buffers = 256MB
>> > work_mem = 512MB
>> > temp_buffers = 64MB
>> > maintenance_work_mem = 4GB
>> > effective_cache_size = 16GB
>> > max_logical_replication_workers = 30
>> > max_replication_slots = 30
>> > max_worker_processes = 30
>> > wal_level = logical
>> >
>> > I have several databases that I subscribe to from this database cluster
>> > using logical replication.
>> >
>> > Replication of one of my databases (running on ARMv7 machine) started
>> > segfaulting on the subscriber side (x86_64) like this:
>> >
>> > #0  0x00007fc259739917 in __memmove_sse2_unaligned_erms () from
>> > /usr/lib/libc.so.6
>> > #1  0x000055d033e93d44 in memcpy (__len=620701425, __src=<optimized out>,
>> > __dest=0x55d0356da804) at /usr/include/bits/string_fortified.h:34
>> > #2  tts_virtual_materialize (slot=0x55d0356da3b8) at execTuples.c:235
>> > #3  0x000055d033e94d32 in ExecFetchSlotHeapTuple
>> > (slot=slot@entry=0x55d0356da3b8, materialize=materialize@entry=true,
>> > shouldFree=shouldFree@entry=0x7fff0e7cf387) at execTuples.c:1624
>
>I forgot to add that publisher is still PostgreSQL 11.5.
>
>> Hmmm, so it's failing on this memcpy() in tts_virtual_materialize:
>>
>>     else
>>     {
>>         Size        data_length = 0;
>>
>>         data = (char *) att_align_nominal(data, att->attalign);
>>         data_length = att_addlength_datum(data_length, att->attlen, val);
>>
>>         memcpy(data, DatumGetPointer(val), data_length);
>
>This is a bytea column in one of the tables on the publisher.
>
>So this looks like it's segfaulting while trying to copy too much bytes
>(data_length determined as 620,701,425 bytes):
>
>#1  0x000055d033e93d44 in memcpy (__len=620701425, __src=<optimized out>, __dest=0x55d0356da804) at
/usr/include/bits/string_fortified.h:34
>
>But maximum length of any bytea valaue in the publisher database is <200kB.
>

OK. So the public.videos.cover_image column, I suppose?

Anyway, this likely means the pointer bogus, in some way. Maybe it's
pointing to already freed memory, or something like that.

>>
>>         slot->tts_values[natt] = PointerGetDatum(data);
>>         data += data_length;
>>     }
>>
>> The question is, which of the pointers is bogus. You seem to already
>> have a core file, so can you inspect the variables in frame #2? I think
>> especially
>>
>>  p *slot
>>  p natt
>>  p val
>>  p *att
>>
>> would be interesting to see.
>
>(gdb) p *slot
>$1 = {type = T_TupleTableSlot, tts_flags = 20, tts_nvalid = 8, tts_ops = 0x558149ff4da0 <TTSOpsVirtual>,
tts_tupleDescriptor= 0x7fcca2ea7548, tts_values = 0x55814adfbc10,
 
>  tts_isnull = 0x55814adfbc50, tts_mcxt = 0x55814adfb6e0, tts_tid = {ip_blkid = {bi_hi = 65535, bi_lo = 65535},
ip_posid= 0}, tts_tableOid = 0}
 
>(gdb) p natt
>$2 = 2
>(gdb) p val
>$3 = <optimized out>
>(gdb) p slot->tts_values[nat]
>No symbol "nat" in current context.
>(gdb) p slot->tts_values[natt]
>$4 = 94013795319824
>(gdb) p *slot->tts_values[natt]
>$5 = -1812161596
>(gdb) p *att
>$6 = {attrelid = 55240, attname = {data = "cover_image", '\000' <repeats 52 times>}, atttypid = 17, attstattarget =
-1,attlen = -1, attnum = 3, attndims = 0, attcacheoff = -1, atttypmod = -1,
 
>  attbyval = false, attstorage = 120 'x', attalign = 105 'i', attnotnull = true, atthasdef = false, atthasmissing =
false,attidentity = 0 '\000', attgenerated = 0 '\000', attisdropped = false,
 
>  attislocal = true, attinhcount = 0, attcollation = 0}
>

OK, nothing obviously bobus here, I think. It's not entirely clear to me
why this works

(gdb) p *slot->tts_values[natt]
$5 = -1812161596

because tts_values[natt] is not a pointer, I think. So I'm not sure how
to interpret the dereference. It would be interesting to see the first
couple of bytes? I think somehing like

(gdb) p ((char*)slot->tts_values[natt])[$n]

for $n between 0 and 3 should do the trick.

>> Also, how does the replicated schema look like? Can we see the table
>> definitions?
>
> ...
>CREATE TABLE public.videos (
>    id integer NOT NULL,
>    title text NOT NULL,
>    cover_image bytea NOT NULL,
>    metadata jsonb NOT NULL,
>    category integer NOT NULL,
>    published date NOT NULL,
>    added timestamp without time zone DEFAULT now() NOT NULL,
>    played boolean DEFAULT false NOT NULL
>);

OK, so this is the problematic table. Any idea what operation triggers
the issue? Having a small reproducer (say, a single-row insert or so)
would be very helpful.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-bugs by date:

Previous
From: Ondřej Jirman
Date:
Subject: Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker
Next
From: Ondřej Jirman
Date:
Subject: Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker