logical decoding bug: segfault in ReorderBufferToastReplace() - Mailing list pgsql-bugs
From | Jeremy Schneider |
---|---|
Subject | logical decoding bug: segfault in ReorderBufferToastReplace() |
Date | |
Msg-id | b1da9abc-bad7-efd6-2f1c-503d508c3877@amazon.com Whole thread Raw |
Responses |
Re: logical decoding bug: segfault in ReorderBufferToastReplace()
Re: logical decoding bug: segfault in ReorderBufferToastReplace() |
List | pgsql-bugs |
On 9/8/19 14:01, Tom Lane wrote: > Fix RelationIdGetRelation calls that weren't bothering with error checks. > > ... > > Details > ------- > https://git.postgresql.org/pg/commitdiff/69f883fef14a3fc5849126799278abcc43f40f56 We had two different databases this week (with the same schema) both independently hit the condition of this recent commit from Tom. It's on 11.5 so we're actually segfaulting and restarting rather than just causing the walsender process to ERROR, but regardless there's still some underlying bug here. We have core files and we're still working to see if we can figure out what's going on, but I thought I'd report now in case anyone has extra ideas or suggestions. The segfault is on line 3034 of reorderbuffer.c. https://github.com/postgres/postgres/blob/REL_11_5/src/backend/replication/logical/reorderbuffer.c#L3034 3033 toast_rel = RelationIdGetRelation(relation->rd_rel->reltoastrelid); 3034 toast_desc = RelationGetDescr(toast_rel); We'll keep looking; let me know any feedback! Would love to track down whatever bug is in the logical decoding code, if that's what it is. ========== backtrace showing the call stack... Core was generated by `postgres: walsender <NAME-REDACTED> <DNS-REDACTED>(31712)'. Program terminated with signal 11, Segmentation fault. #0 ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78, relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938) at reorderbuffer.c:3034 3034 reorderbuffer.c: No such file or directory. ... (gdb) #0 ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78, relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938) at reorderbuffer.c:3034 #1 ReorderBufferCommit (rb=0x3086af0, xid=xid@entry=1358809, commit_lsn=9430473346032, end_lsn=<optimized out>, commit_time=commit_time@entry=628712466364268, origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at reorderbuffer.c:1584 #2 0x0000000000716248 in DecodeCommit (xid=1358809, parsed=0x7ffc4ce123f0, buf=0x7ffc4ce125b0, ctx=0x3068f70) at decode.c:637 #3 DecodeXactOp (ctx=0x3068f70, buf=buf@entry=0x7ffc4ce125b0) at decode.c:245 #4 0x000000000071655a in LogicalDecodingProcessRecord (ctx=0x3068f70, record=0x3069208) at decode.c:117 #5 0x0000000000727150 in XLogSendLogical () at walsender.c:2886 #6 0x0000000000729192 in WalSndLoop (send_data=send_data@entry=0x7270f0 <XLogSendLogical>) at walsender.c:2249 #7 0x0000000000729f91 in StartLogicalReplication (cmd=0x30485a0) at walsender.c:1111 #8 exec_replication_command ( cmd_string=cmd_string@entry=0x2f968b0 "START_REPLICATION SLOT \"<NAME-REDACTED>\" LOGICAL 893/38002B98 (proto_version '1', publication_names '\"<NAME-REDACTED>\"')") at walsender.c:1628 #9 0x000000000076e939 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x2fea168, dbname=0x2fea020 "<NAME-REDACTED>", username=<optimized out>) at postgres.c:4182 #10 0x00000000004bdcb5 in BackendRun (port=0x2fdec50) at postmaster.c:4410 #11 BackendStartup (port=0x2fdec50) at postmaster.c:4082 #12 ServerLoop () at postmaster.c:1759 #13 0x00000000007062f9 in PostmasterMain (argc=argc@entry=7, argv=argv@entry=0x2f92540) at postmaster.c:1432 #14 0x00000000004be73b in main (argc=7, argv=0x2f92540) at main.c:228 ========== Some additional context... # select * from pg_publication_rel; prpubid | prrelid ---------+--------- 71417 | 16453 71417 | 54949 (2 rows) (gdb) print toast_rel $4 = (struct RelationData *) 0x0 (gdb) print *relation->rd_rel $11 = {relname = {data = "<NAME-REDACTED>", '\000' <repeats 44 times>}, relnamespace = 16402, reltype = 16430, reloftype = 0, relowner = 16393, relam = 0, relfilenode = 16428, reltablespace = 0, relpages = 0, reltuples = 0, relallvisible = 0, reltoastrelid = 0, relhasindex = true, relisshared = false, relpersistence = 112 'p', relkind = 114 'r', relnatts = 4, relchecks = 0, relhasoids = false, relhasrules = false, relhastriggers = false, relhassubclass = false, relrowsecurity = false, relforcerowsecurity = false, relispopulated = true, relreplident = 100 'd', relispartition = false, relrewrite = 0, relfrozenxid = 1808, relminmxid = 1} (gdb) print *relation $12 = {rd_node = {spcNode = 1663, dbNode = 16401, relNode = 16428}, rd_smgr = 0x0, rd_refcnt = 1, rd_backend = -1, rd_islocaltemp = false, rd_isnailed = false, rd_isvalid = true, rd_indexvalid = 0 '\000', rd_statvalid = false, rd_createSubid = 0, rd_newRelfilenodeSubid = 0, rd_rel = 0x2b7917724bd8, rd_att = 0x2b7917724ce8, rd_id = 16428, rd_lockInfo = {lockRelId = {relId = 16428, dbId = 16401}}, rd_rules = 0x0, rd_rulescxt = 0x0, trigdesc = 0x0, rd_rsdesc = 0x0, rd_fkeylist = 0x0, rd_fkeyvalid = false, rd_partkeycxt = 0x0, rd_partkey = 0x0, rd_pdcxt = 0x0, rd_partdesc = 0x0, rd_partcheck = 0x0, rd_indexlist = 0x0, rd_oidindex = 0, rd_pkindex = 0, rd_replidindex = 0, rd_statlist = 0x0, rd_indexattr = 0x0, rd_projindexattr = 0x0, rd_keyattr = 0x0, rd_pkattr = 0x0, rd_idattr = 0x0, rd_projidx = 0x0, rd_pubactions = 0x0, rd_options = 0x0, rd_index = 0x0, rd_indextuple = 0x0, rd_amhandler = 0, rd_indexcxt = 0x0, rd_amroutine = 0x0, rd_opfamily = 0x0, rd_opcintype = 0x0, rd_support = 0x0, rd_supportinfo = 0x0, rd_indoption = 0x0, rd_indexprs = 0x0, rd_indpred = 0x0, rd_exclops = 0x0, rd_exclprocs = 0x0, rd_exclstrats = 0x0, rd_amcache = 0x0, rd_indcollation = 0x0, rd_fdwroutine = 0x0, rd_toastoid = 0, pgstat_info = 0x0, rd_partcheckvalid = false, rd_partcheckcxt = 0x0} (gdb) print *desc $13 = {natts = 4, tdtypeid = 16430, tdtypmod = -1, tdhasoid = false, tdrefcount = 1, constr = 0x2b7917724ef8, attrs = 0x2b7917724d08} (gdb) print *txn $2 = {xid = 1358809, has_catalog_changes = true, is_known_as_subxact = false, toplevel_xid = 0, first_lsn = 9430473113640, final_lsn = 9430473346032, end_lsn = 9430473350592, restart_decoding_lsn = 0, origin_id = 0, origin_lsn = 0, commit_time = 628712466364268, base_snapshot = 0x308cdc0, base_snapshot_lsn = 9430473113776, base_snapshot_node = {prev = 0x3086b08, next = 0x3086b08}, nentries = 357, nentries_mem = 357, serialized = false, changes = {head = {prev = 0x30aca08, next = 0x309aac8}}, tuplecids = {head = {prev = 0x30ac878, next = 0x309ab18}}, ntuplecids = 151, tuplecid_hash = 0x30b0bf0, toast_hash = 0x30bb460, subtxns = {head = {prev = 0x3094b30, next = 0x3094b30}}, nsubtxns = 0, ninvalidations = 278, invalidations = 0x30acb08, node = { prev = 0x3086af8, next = 0x3086af8}} (gdb) print *change $1 = {lsn = 9430473343416, action = REORDER_BUFFER_CHANGE_INSERT, origin_id = 0, data = {tp = {relnode = {spcNode = 1663, dbNode = 16401, relNode = 16428}, clear_toast_afterwards = true, oldtuple = 0x0, newtuple = 0x2b79313f9c68}, truncate = {nrelids = 70441758623359, cascade = 44, restart_seqs = 64, relids = 0x0}, msg = {prefix = 0x40110000067f <Address 0x40110000067f out of bounds>, message_size = 4294983724, message = 0x0}, snapshot = 0x40110000067f, command_id = 1663, tuplecid = {node = {spcNode = 1663, dbNode = 16401, relNode = 16428}, tid = {ip_blkid = {bi_hi = 1, bi_lo = 0}, ip_posid = 0}, cmin = 0, cmax = 826252392, combocid = 11129}}, node = {prev = 0x30ac918, next = 0x30ac9b8}} -- Jeremy Schneider Database Engineer Amazon Web Services
pgsql-bugs by date: