Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker - Mailing list pgsql-bugs
From | Ondřej Jirman |
---|---|
Subject | Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker |
Date | |
Msg-id | 20191121131202.xhi5hxvdjjfvbmop@core.my.home Whole thread Raw |
In response to | Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker
|
List | pgsql-bugs |
On Thu, Nov 21, 2019 at 01:59:30PM +0100, Tomas Vondra wrote: > On Thu, Nov 21, 2019 at 12:53:26PM +0100, Ondřej Jirman wrote: > > Hello, > > > > On Thu, Nov 21, 2019 at 11:39:40AM +0100, Tomas Vondra wrote: > > > On Thu, Nov 21, 2019 at 01:14:18AM +0000, PG Bug reporting form wrote: > > > > The following bug has been logged on the website: > > > > > > > > Bug reference: 16129 > > > > Logged by: Ondrej Jirman > > > > Email address: ienieghapheoghaiwida@xff.cz > > > > PostgreSQL version: 12.1 > > > > Operating system: Arch Linux > > > > Description: > > > > > > > > Hello, > > > > > > > > I've upgraded my main PostgreSQL cluster from 11.5 to 12.1 via pg_dumpall > > > > method and after a while I started getting segfault in logical replication > > > > worker. > > > > > > > > My setup is fairly vanilla, non-default options: > > > > > > > > shared_buffers = 256MB > > > > work_mem = 512MB > > > > temp_buffers = 64MB > > > > maintenance_work_mem = 4GB > > > > effective_cache_size = 16GB > > > > max_logical_replication_workers = 30 > > > > max_replication_slots = 30 > > > > max_worker_processes = 30 > > > > wal_level = logical > > > > > > > > I have several databases that I subscribe to from this database cluster > > > > using logical replication. > > > > > > > > Replication of one of my databases (running on ARMv7 machine) started > > > > segfaulting on the subscriber side (x86_64) like this: > > > > > > > > #0 0x00007fc259739917 in __memmove_sse2_unaligned_erms () from > > > > /usr/lib/libc.so.6 > > > > #1 0x000055d033e93d44 in memcpy (__len=620701425, __src=<optimized out>, > > > > __dest=0x55d0356da804) at /usr/include/bits/string_fortified.h:34 > > > > #2 tts_virtual_materialize (slot=0x55d0356da3b8) at execTuples.c:235 > > > > #3 0x000055d033e94d32 in ExecFetchSlotHeapTuple > > > > (slot=slot@entry=0x55d0356da3b8, materialize=materialize@entry=true, > > > > shouldFree=shouldFree@entry=0x7fff0e7cf387) at execTuples.c:1624 > > > > I forgot to add that publisher is still PostgreSQL 11.5. > > > > > Hmmm, so it's failing on this memcpy() in tts_virtual_materialize: > > > > > > else > > > { > > > Size data_length = 0; > > > > > > data = (char *) att_align_nominal(data, att->attalign); > > > data_length = att_addlength_datum(data_length, att->attlen, val); > > > > > > memcpy(data, DatumGetPointer(val), data_length); > > > > This is a bytea column in one of the tables on the publisher. > > > > So this looks like it's segfaulting while trying to copy too much bytes > > (data_length determined as 620,701,425 bytes): > > > > #1 0x000055d033e93d44 in memcpy (__len=620701425, __src=<optimized out>, __dest=0x55d0356da804) at /usr/include/bits/string_fortified.h:34 > > > > But maximum length of any bytea valaue in the publisher database is <200kB. > > > > OK. So the public.videos.cover_image column, I suppose? > > Anyway, this likely means the pointer bogus, in some way. Maybe it's > pointing to already freed memory, or something like that. > > > > > > > slot->tts_values[natt] = PointerGetDatum(data); > > > data += data_length; > > > } > > > > > > The question is, which of the pointers is bogus. You seem to already > > > have a core file, so can you inspect the variables in frame #2? I think > > > especially > > > > > > p *slot > > > p natt > > > p val > > > p *att > > > > > > would be interesting to see. > > > > (gdb) p *slot > > $1 = {type = T_TupleTableSlot, tts_flags = 20, tts_nvalid = 8, tts_ops = 0x558149ff4da0 <TTSOpsVirtual>, tts_tupleDescriptor= 0x7fcca2ea7548, tts_values = 0x55814adfbc10, > > tts_isnull = 0x55814adfbc50, tts_mcxt = 0x55814adfb6e0, tts_tid = {ip_blkid = {bi_hi = 65535, bi_lo = 65535}, ip_posid= 0}, tts_tableOid = 0} > > (gdb) p natt > > $2 = 2 > > (gdb) p val > > $3 = <optimized out> > > (gdb) p slot->tts_values[nat] > > No symbol "nat" in current context. > > (gdb) p slot->tts_values[natt] > > $4 = 94013795319824 > > (gdb) p *slot->tts_values[natt] > > $5 = -1812161596 > > (gdb) p *att > > $6 = {attrelid = 55240, attname = {data = "cover_image", '\000' <repeats 52 times>}, atttypid = 17, attstattarget = -1,attlen = -1, attnum = 3, attndims = 0, attcacheoff = -1, atttypmod = -1, > > attbyval = false, attstorage = 120 'x', attalign = 105 'i', attnotnull = true, atthasdef = false, atthasmissing = false,attidentity = 0 '\000', attgenerated = 0 '\000', attisdropped = false, > > attislocal = true, attinhcount = 0, attcollation = 0} > > > > OK, nothing obviously bobus here, I think. It's not entirely clear to me > why this works > > (gdb) p *slot->tts_values[natt] > $5 = -1812161596 > > because tts_values[natt] is not a pointer, I think. So I'm not sure how > to interpret the dereference. It would be interesting to see the first > couple of bytes? I think somehing like It's probably nothing, the value there just looked big enough be a pointer, so I dereferenced it in gdb. > (gdb) p ((char*)slot->tts_values[natt])[$n] > > for $n between 0 and 3 should do the trick. (gdb) p ((char*)slot->tts_values[natt])[0] $7 = -60 '\304' (gdb) p ((char*)slot->tts_values[natt])[1] $8 = -101 '\233' (gdb) p ((char*)slot->tts_values[natt])[2] $9 = -4 '\374' (gdb) p ((char*)slot->tts_values[natt])[3] $10 = -109 '\223' (gdb) p ((char*)slot->tts_values[natt])[4] $11 = 0 '\000' (gdb) p ((char*)slot->tts_values[natt])[5] $12 = 0 '\000' (gdb) p ((char*)slot->tts_values[natt])[6] $13 = -8 '\370' > > > Also, how does the replicated schema look like? Can we see the table > > > definitions? > > > > ... > > CREATE TABLE public.videos ( > > id integer NOT NULL, > > title text NOT NULL, > > cover_image bytea NOT NULL, > > metadata jsonb NOT NULL, > > category integer NOT NULL, > > published date NOT NULL, > > added timestamp without time zone DEFAULT now() NOT NULL, > > played boolean DEFAULT false NOT NULL > > ); > > OK, so this is the problematic table. Any idea what operation triggers > the issue? Having a small reproducer (say, a single-row insert or so) > would be very helpful. Can something be extracted from the logical decoding data I sent in the other e-mail? That seems to contain the failing operation, but I don't know how to decode it off the top of my head from that binary representation. regards, o. > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-bugs by date: