logical decoding bug: segfault in ReorderBufferToastReplace() - Mailing list pgsql-bugs

From Jeremy Schneider
Subject logical decoding bug: segfault in ReorderBufferToastReplace()
Date
Msg-id b1da9abc-bad7-efd6-2f1c-503d508c3877@amazon.com
Whole thread Raw
Responses Re: logical decoding bug: segfault in ReorderBufferToastReplace()
Re: logical decoding bug: segfault in ReorderBufferToastReplace()
List pgsql-bugs
On 9/8/19 14:01, Tom Lane wrote:
> Fix RelationIdGetRelation calls that weren't bothering with error checks.
> 
> ...
> 
> Details
> -------
> https://git.postgresql.org/pg/commitdiff/69f883fef14a3fc5849126799278abcc43f40f56

We had two different databases this week (with the same schema) both
independently hit the condition of this recent commit from Tom. It's on
11.5 so we're actually segfaulting and restarting rather than just
causing the walsender process to ERROR, but regardless there's still
some underlying bug here.

We have core files and we're still working to see if we can figure out
what's going on, but I thought I'd report now in case anyone has extra
ideas or suggestions.  The segfault is on line 3034 of reorderbuffer.c.

https://github.com/postgres/postgres/blob/REL_11_5/src/backend/replication/logical/reorderbuffer.c#L3034

3033     toast_rel = RelationIdGetRelation(relation->rd_rel->reltoastrelid);
3034     toast_desc = RelationGetDescr(toast_rel);

We'll keep looking; let me know any feedback! Would love to track down
whatever bug is in the logical decoding code, if that's what it is.

==========

backtrace showing the call stack...

Core was generated by `postgres: walsender <NAME-REDACTED>
<DNS-REDACTED>(31712)'.
Program terminated with signal 11, Segmentation fault.
#0  ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78,
relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938)
    at reorderbuffer.c:3034
3034    reorderbuffer.c: No such file or directory.
...
(gdb) #0  ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78,
relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938)
    at reorderbuffer.c:3034
#1  ReorderBufferCommit (rb=0x3086af0, xid=xid@entry=1358809,
commit_lsn=9430473346032, end_lsn=<optimized out>,
    commit_time=commit_time@entry=628712466364268,
origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at
reorderbuffer.c:1584
#2  0x0000000000716248 in DecodeCommit (xid=1358809,
parsed=0x7ffc4ce123f0, buf=0x7ffc4ce125b0, ctx=0x3068f70) at decode.c:637
#3  DecodeXactOp (ctx=0x3068f70, buf=buf@entry=0x7ffc4ce125b0) at
decode.c:245
#4  0x000000000071655a in LogicalDecodingProcessRecord (ctx=0x3068f70,
record=0x3069208) at decode.c:117
#5  0x0000000000727150 in XLogSendLogical () at walsender.c:2886
#6  0x0000000000729192 in WalSndLoop (send_data=send_data@entry=0x7270f0
<XLogSendLogical>) at walsender.c:2249
#7  0x0000000000729f91 in StartLogicalReplication (cmd=0x30485a0) at
walsender.c:1111
#8  exec_replication_command (
    cmd_string=cmd_string@entry=0x2f968b0 "START_REPLICATION SLOT
\"<NAME-REDACTED>\" LOGICAL 893/38002B98 (proto_version '1',
publication_names '\"<NAME-REDACTED>\"')") at walsender.c:1628
#9  0x000000000076e939 in PostgresMain (argc=<optimized out>,
argv=argv@entry=0x2fea168, dbname=0x2fea020 "<NAME-REDACTED>",
    username=<optimized out>) at postgres.c:4182
#10 0x00000000004bdcb5 in BackendRun (port=0x2fdec50) at postmaster.c:4410
#11 BackendStartup (port=0x2fdec50) at postmaster.c:4082
#12 ServerLoop () at postmaster.c:1759
#13 0x00000000007062f9 in PostmasterMain (argc=argc@entry=7,
argv=argv@entry=0x2f92540) at postmaster.c:1432
#14 0x00000000004be73b in main (argc=7, argv=0x2f92540) at main.c:228

==========

Some additional context...

# select * from pg_publication_rel;
 prpubid | prrelid
---------+---------
   71417 |   16453
   71417 |   54949
(2 rows)

(gdb) print toast_rel
$4 = (struct RelationData *) 0x0

(gdb) print *relation->rd_rel
$11 = {relname = {data = "<NAME-REDACTED>", '\000' <repeats 44 times>},
relnamespace = 16402, reltype = 16430, reloftype = 0,
relowner = 16393, relam = 0, relfilenode = 16428, reltablespace = 0,
relpages = 0, reltuples = 0, relallvisible = 0, reltoastrelid = 0,
relhasindex = true, relisshared = false, relpersistence = 112 'p',
relkind = 114 'r', relnatts = 4, relchecks = 0, relhasoids = false,
relhasrules = false, relhastriggers = false, relhassubclass = false,
relrowsecurity = false, relforcerowsecurity = false,
relispopulated = true, relreplident = 100 'd', relispartition = false,
relrewrite = 0, relfrozenxid = 1808, relminmxid = 1}

(gdb) print *relation
$12 = {rd_node = {spcNode = 1663, dbNode = 16401, relNode = 16428},
rd_smgr = 0x0, rd_refcnt = 1, rd_backend = -1, rd_islocaltemp = false,
rd_isnailed = false, rd_isvalid = true, rd_indexvalid = 0 '\000',
rd_statvalid = false, rd_createSubid = 0, rd_newRelfilenodeSubid = 0,
rd_rel = 0x2b7917724bd8, rd_att = 0x2b7917724ce8, rd_id = 16428,
rd_lockInfo = {lockRelId = {relId = 16428, dbId = 16401}},
rd_rules = 0x0, rd_rulescxt = 0x0, trigdesc = 0x0, rd_rsdesc = 0x0,
rd_fkeylist = 0x0, rd_fkeyvalid = false, rd_partkeycxt = 0x0,
rd_partkey = 0x0, rd_pdcxt = 0x0, rd_partdesc = 0x0, rd_partcheck = 0x0,
rd_indexlist = 0x0, rd_oidindex = 0, rd_pkindex = 0,
rd_replidindex = 0, rd_statlist = 0x0, rd_indexattr = 0x0,
rd_projindexattr = 0x0, rd_keyattr = 0x0, rd_pkattr = 0x0, rd_idattr = 0x0,
rd_projidx = 0x0, rd_pubactions = 0x0, rd_options = 0x0, rd_index = 0x0,
rd_indextuple = 0x0, rd_amhandler = 0, rd_indexcxt = 0x0,
rd_amroutine = 0x0, rd_opfamily = 0x0, rd_opcintype = 0x0, rd_support =
0x0, rd_supportinfo = 0x0, rd_indoption = 0x0, rd_indexprs = 0x0,
rd_indpred = 0x0, rd_exclops = 0x0, rd_exclprocs = 0x0, rd_exclstrats =
0x0, rd_amcache = 0x0, rd_indcollation = 0x0,
rd_fdwroutine = 0x0, rd_toastoid = 0, pgstat_info = 0x0,
rd_partcheckvalid = false, rd_partcheckcxt = 0x0}

(gdb) print *desc
$13 = {natts = 4, tdtypeid = 16430, tdtypmod = -1, tdhasoid = false,
tdrefcount = 1, constr = 0x2b7917724ef8, attrs = 0x2b7917724d08}

(gdb) print *txn
$2 = {xid = 1358809, has_catalog_changes = true, is_known_as_subxact =
false, toplevel_xid = 0, first_lsn = 9430473113640,
  final_lsn = 9430473346032, end_lsn = 9430473350592,
restart_decoding_lsn = 0, origin_id = 0, origin_lsn = 0,
  commit_time = 628712466364268, base_snapshot = 0x308cdc0,
base_snapshot_lsn = 9430473113776, base_snapshot_node = {prev = 0x3086b08,
    next = 0x3086b08}, nentries = 357, nentries_mem = 357, serialized =
false, changes = {head = {prev = 0x30aca08, next = 0x309aac8}},
  tuplecids = {head = {prev = 0x30ac878, next = 0x309ab18}}, ntuplecids
= 151, tuplecid_hash = 0x30b0bf0, toast_hash = 0x30bb460,
  subtxns = {head = {prev = 0x3094b30, next = 0x3094b30}}, nsubtxns = 0,
ninvalidations = 278, invalidations = 0x30acb08, node = {
    prev = 0x3086af8, next = 0x3086af8}}

(gdb) print *change
$1 = {lsn = 9430473343416, action = REORDER_BUFFER_CHANGE_INSERT,
origin_id = 0, data = {tp = {relnode = {spcNode = 1663, dbNode = 16401,
        relNode = 16428}, clear_toast_afterwards = true, oldtuple = 0x0,
newtuple = 0x2b79313f9c68}, truncate = {nrelids = 70441758623359,
      cascade = 44, restart_seqs = 64, relids = 0x0}, msg = {prefix =
0x40110000067f <Address 0x40110000067f out of bounds>,
      message_size = 4294983724, message = 0x0}, snapshot =
0x40110000067f, command_id = 1663, tuplecid = {node = {spcNode = 1663,
        dbNode = 16401, relNode = 16428}, tid = {ip_blkid = {bi_hi = 1,
bi_lo = 0}, ip_posid = 0}, cmin = 0, cmax = 826252392,
      combocid = 11129}}, node = {prev = 0x30ac918, next = 0x30ac9b8}}




-- 
Jeremy Schneider
Database Engineer
Amazon Web Services



pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #16140: View with INSERT, DO INSTEAD, and ON CONFLICT causes an error
Next
From: Petr Fedorov
Date:
Subject: A row-level trigger on a partitioned table is not created on asub-partition created later