Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker - Mailing list pgsql-bugs
From | Tomas Vondra |
---|---|
Subject | Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker |
Date | |
Msg-id | 20191121125930.or5fhpi2yjgkmjye@development Whole thread Raw |
In response to | Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker (Ondřej Jirman <ienieghapheoghaiwida@xff.cz>) |
Responses |
Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker
|
List | pgsql-bugs |
On Thu, Nov 21, 2019 at 12:53:26PM +0100, Ondřej Jirman wrote: >Hello, > >On Thu, Nov 21, 2019 at 11:39:40AM +0100, Tomas Vondra wrote: >> On Thu, Nov 21, 2019 at 01:14:18AM +0000, PG Bug reporting form wrote: >> > The following bug has been logged on the website: >> > >> > Bug reference: 16129 >> > Logged by: Ondrej Jirman >> > Email address: ienieghapheoghaiwida@xff.cz >> > PostgreSQL version: 12.1 >> > Operating system: Arch Linux >> > Description: >> > >> > Hello, >> > >> > I've upgraded my main PostgreSQL cluster from 11.5 to 12.1 via pg_dumpall >> > method and after a while I started getting segfault in logical replication >> > worker. >> > >> > My setup is fairly vanilla, non-default options: >> > >> > shared_buffers = 256MB >> > work_mem = 512MB >> > temp_buffers = 64MB >> > maintenance_work_mem = 4GB >> > effective_cache_size = 16GB >> > max_logical_replication_workers = 30 >> > max_replication_slots = 30 >> > max_worker_processes = 30 >> > wal_level = logical >> > >> > I have several databases that I subscribe to from this database cluster >> > using logical replication. >> > >> > Replication of one of my databases (running on ARMv7 machine) started >> > segfaulting on the subscriber side (x86_64) like this: >> > >> > #0 0x00007fc259739917 in __memmove_sse2_unaligned_erms () from >> > /usr/lib/libc.so.6 >> > #1 0x000055d033e93d44 in memcpy (__len=620701425, __src=<optimized out>, >> > __dest=0x55d0356da804) at /usr/include/bits/string_fortified.h:34 >> > #2 tts_virtual_materialize (slot=0x55d0356da3b8) at execTuples.c:235 >> > #3 0x000055d033e94d32 in ExecFetchSlotHeapTuple >> > (slot=slot@entry=0x55d0356da3b8, materialize=materialize@entry=true, >> > shouldFree=shouldFree@entry=0x7fff0e7cf387) at execTuples.c:1624 > >I forgot to add that publisher is still PostgreSQL 11.5. > >> Hmmm, so it's failing on this memcpy() in tts_virtual_materialize: >> >> else >> { >> Size data_length = 0; >> >> data = (char *) att_align_nominal(data, att->attalign); >> data_length = att_addlength_datum(data_length, att->attlen, val); >> >> memcpy(data, DatumGetPointer(val), data_length); > >This is a bytea column in one of the tables on the publisher. > >So this looks like it's segfaulting while trying to copy too much bytes >(data_length determined as 620,701,425 bytes): > >#1 0x000055d033e93d44 in memcpy (__len=620701425, __src=<optimized out>, __dest=0x55d0356da804) at /usr/include/bits/string_fortified.h:34 > >But maximum length of any bytea valaue in the publisher database is <200kB. > OK. So the public.videos.cover_image column, I suppose? Anyway, this likely means the pointer bogus, in some way. Maybe it's pointing to already freed memory, or something like that. >> >> slot->tts_values[natt] = PointerGetDatum(data); >> data += data_length; >> } >> >> The question is, which of the pointers is bogus. You seem to already >> have a core file, so can you inspect the variables in frame #2? I think >> especially >> >> p *slot >> p natt >> p val >> p *att >> >> would be interesting to see. > >(gdb) p *slot >$1 = {type = T_TupleTableSlot, tts_flags = 20, tts_nvalid = 8, tts_ops = 0x558149ff4da0 <TTSOpsVirtual>, tts_tupleDescriptor= 0x7fcca2ea7548, tts_values = 0x55814adfbc10, > tts_isnull = 0x55814adfbc50, tts_mcxt = 0x55814adfb6e0, tts_tid = {ip_blkid = {bi_hi = 65535, bi_lo = 65535}, ip_posid= 0}, tts_tableOid = 0} >(gdb) p natt >$2 = 2 >(gdb) p val >$3 = <optimized out> >(gdb) p slot->tts_values[nat] >No symbol "nat" in current context. >(gdb) p slot->tts_values[natt] >$4 = 94013795319824 >(gdb) p *slot->tts_values[natt] >$5 = -1812161596 >(gdb) p *att >$6 = {attrelid = 55240, attname = {data = "cover_image", '\000' <repeats 52 times>}, atttypid = 17, attstattarget = -1,attlen = -1, attnum = 3, attndims = 0, attcacheoff = -1, atttypmod = -1, > attbyval = false, attstorage = 120 'x', attalign = 105 'i', attnotnull = true, atthasdef = false, atthasmissing = false,attidentity = 0 '\000', attgenerated = 0 '\000', attisdropped = false, > attislocal = true, attinhcount = 0, attcollation = 0} > OK, nothing obviously bobus here, I think. It's not entirely clear to me why this works (gdb) p *slot->tts_values[natt] $5 = -1812161596 because tts_values[natt] is not a pointer, I think. So I'm not sure how to interpret the dereference. It would be interesting to see the first couple of bytes? I think somehing like (gdb) p ((char*)slot->tts_values[natt])[$n] for $n between 0 and 3 should do the trick. >> Also, how does the replicated schema look like? Can we see the table >> definitions? > > ... >CREATE TABLE public.videos ( > id integer NOT NULL, > title text NOT NULL, > cover_image bytea NOT NULL, > metadata jsonb NOT NULL, > category integer NOT NULL, > published date NOT NULL, > added timestamp without time zone DEFAULT now() NOT NULL, > played boolean DEFAULT false NOT NULL >); OK, so this is the problematic table. Any idea what operation triggers the issue? Having a small reproducer (say, a single-row insert or so) would be very helpful. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-bugs by date: