Thread: logical decoding bug: segfault in ReorderBufferToastReplace()
On 9/8/19 14:01, Tom Lane wrote: > Fix RelationIdGetRelation calls that weren't bothering with error checks. > > ... > > Details > ------- > https://git.postgresql.org/pg/commitdiff/69f883fef14a3fc5849126799278abcc43f40f56 We had two different databases this week (with the same schema) both independently hit the condition of this recent commit from Tom. It's on 11.5 so we're actually segfaulting and restarting rather than just causing the walsender process to ERROR, but regardless there's still some underlying bug here. We have core files and we're still working to see if we can figure out what's going on, but I thought I'd report now in case anyone has extra ideas or suggestions. The segfault is on line 3034 of reorderbuffer.c. https://github.com/postgres/postgres/blob/REL_11_5/src/backend/replication/logical/reorderbuffer.c#L3034 3033 toast_rel = RelationIdGetRelation(relation->rd_rel->reltoastrelid); 3034 toast_desc = RelationGetDescr(toast_rel); We'll keep looking; let me know any feedback! Would love to track down whatever bug is in the logical decoding code, if that's what it is. ========== backtrace showing the call stack... Core was generated by `postgres: walsender <NAME-REDACTED> <DNS-REDACTED>(31712)'. Program terminated with signal 11, Segmentation fault. #0 ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78, relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938) at reorderbuffer.c:3034 3034 reorderbuffer.c: No such file or directory. ... (gdb) #0 ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78, relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938) at reorderbuffer.c:3034 #1 ReorderBufferCommit (rb=0x3086af0, xid=xid@entry=1358809, commit_lsn=9430473346032, end_lsn=<optimized out>, commit_time=commit_time@entry=628712466364268, origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at reorderbuffer.c:1584 #2 0x0000000000716248 in DecodeCommit (xid=1358809, parsed=0x7ffc4ce123f0, buf=0x7ffc4ce125b0, ctx=0x3068f70) at decode.c:637 #3 DecodeXactOp (ctx=0x3068f70, buf=buf@entry=0x7ffc4ce125b0) at decode.c:245 #4 0x000000000071655a in LogicalDecodingProcessRecord (ctx=0x3068f70, record=0x3069208) at decode.c:117 #5 0x0000000000727150 in XLogSendLogical () at walsender.c:2886 #6 0x0000000000729192 in WalSndLoop (send_data=send_data@entry=0x7270f0 <XLogSendLogical>) at walsender.c:2249 #7 0x0000000000729f91 in StartLogicalReplication (cmd=0x30485a0) at walsender.c:1111 #8 exec_replication_command ( cmd_string=cmd_string@entry=0x2f968b0 "START_REPLICATION SLOT \"<NAME-REDACTED>\" LOGICAL 893/38002B98 (proto_version '1', publication_names '\"<NAME-REDACTED>\"')") at walsender.c:1628 #9 0x000000000076e939 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x2fea168, dbname=0x2fea020 "<NAME-REDACTED>", username=<optimized out>) at postgres.c:4182 #10 0x00000000004bdcb5 in BackendRun (port=0x2fdec50) at postmaster.c:4410 #11 BackendStartup (port=0x2fdec50) at postmaster.c:4082 #12 ServerLoop () at postmaster.c:1759 #13 0x00000000007062f9 in PostmasterMain (argc=argc@entry=7, argv=argv@entry=0x2f92540) at postmaster.c:1432 #14 0x00000000004be73b in main (argc=7, argv=0x2f92540) at main.c:228 ========== Some additional context... # select * from pg_publication_rel; prpubid | prrelid ---------+--------- 71417 | 16453 71417 | 54949 (2 rows) (gdb) print toast_rel $4 = (struct RelationData *) 0x0 (gdb) print *relation->rd_rel $11 = {relname = {data = "<NAME-REDACTED>", '\000' <repeats 44 times>}, relnamespace = 16402, reltype = 16430, reloftype = 0, relowner = 16393, relam = 0, relfilenode = 16428, reltablespace = 0, relpages = 0, reltuples = 0, relallvisible = 0, reltoastrelid = 0, relhasindex = true, relisshared = false, relpersistence = 112 'p', relkind = 114 'r', relnatts = 4, relchecks = 0, relhasoids = false, relhasrules = false, relhastriggers = false, relhassubclass = false, relrowsecurity = false, relforcerowsecurity = false, relispopulated = true, relreplident = 100 'd', relispartition = false, relrewrite = 0, relfrozenxid = 1808, relminmxid = 1} (gdb) print *relation $12 = {rd_node = {spcNode = 1663, dbNode = 16401, relNode = 16428}, rd_smgr = 0x0, rd_refcnt = 1, rd_backend = -1, rd_islocaltemp = false, rd_isnailed = false, rd_isvalid = true, rd_indexvalid = 0 '\000', rd_statvalid = false, rd_createSubid = 0, rd_newRelfilenodeSubid = 0, rd_rel = 0x2b7917724bd8, rd_att = 0x2b7917724ce8, rd_id = 16428, rd_lockInfo = {lockRelId = {relId = 16428, dbId = 16401}}, rd_rules = 0x0, rd_rulescxt = 0x0, trigdesc = 0x0, rd_rsdesc = 0x0, rd_fkeylist = 0x0, rd_fkeyvalid = false, rd_partkeycxt = 0x0, rd_partkey = 0x0, rd_pdcxt = 0x0, rd_partdesc = 0x0, rd_partcheck = 0x0, rd_indexlist = 0x0, rd_oidindex = 0, rd_pkindex = 0, rd_replidindex = 0, rd_statlist = 0x0, rd_indexattr = 0x0, rd_projindexattr = 0x0, rd_keyattr = 0x0, rd_pkattr = 0x0, rd_idattr = 0x0, rd_projidx = 0x0, rd_pubactions = 0x0, rd_options = 0x0, rd_index = 0x0, rd_indextuple = 0x0, rd_amhandler = 0, rd_indexcxt = 0x0, rd_amroutine = 0x0, rd_opfamily = 0x0, rd_opcintype = 0x0, rd_support = 0x0, rd_supportinfo = 0x0, rd_indoption = 0x0, rd_indexprs = 0x0, rd_indpred = 0x0, rd_exclops = 0x0, rd_exclprocs = 0x0, rd_exclstrats = 0x0, rd_amcache = 0x0, rd_indcollation = 0x0, rd_fdwroutine = 0x0, rd_toastoid = 0, pgstat_info = 0x0, rd_partcheckvalid = false, rd_partcheckcxt = 0x0} (gdb) print *desc $13 = {natts = 4, tdtypeid = 16430, tdtypmod = -1, tdhasoid = false, tdrefcount = 1, constr = 0x2b7917724ef8, attrs = 0x2b7917724d08} (gdb) print *txn $2 = {xid = 1358809, has_catalog_changes = true, is_known_as_subxact = false, toplevel_xid = 0, first_lsn = 9430473113640, final_lsn = 9430473346032, end_lsn = 9430473350592, restart_decoding_lsn = 0, origin_id = 0, origin_lsn = 0, commit_time = 628712466364268, base_snapshot = 0x308cdc0, base_snapshot_lsn = 9430473113776, base_snapshot_node = {prev = 0x3086b08, next = 0x3086b08}, nentries = 357, nentries_mem = 357, serialized = false, changes = {head = {prev = 0x30aca08, next = 0x309aac8}}, tuplecids = {head = {prev = 0x30ac878, next = 0x309ab18}}, ntuplecids = 151, tuplecid_hash = 0x30b0bf0, toast_hash = 0x30bb460, subtxns = {head = {prev = 0x3094b30, next = 0x3094b30}}, nsubtxns = 0, ninvalidations = 278, invalidations = 0x30acb08, node = { prev = 0x3086af8, next = 0x3086af8}} (gdb) print *change $1 = {lsn = 9430473343416, action = REORDER_BUFFER_CHANGE_INSERT, origin_id = 0, data = {tp = {relnode = {spcNode = 1663, dbNode = 16401, relNode = 16428}, clear_toast_afterwards = true, oldtuple = 0x0, newtuple = 0x2b79313f9c68}, truncate = {nrelids = 70441758623359, cascade = 44, restart_seqs = 64, relids = 0x0}, msg = {prefix = 0x40110000067f <Address 0x40110000067f out of bounds>, message_size = 4294983724, message = 0x0}, snapshot = 0x40110000067f, command_id = 1663, tuplecid = {node = {spcNode = 1663, dbNode = 16401, relNode = 16428}, tid = {ip_blkid = {bi_hi = 1, bi_lo = 0}, ip_posid = 0}, cmin = 0, cmax = 826252392, combocid = 11129}}, node = {prev = 0x30ac918, next = 0x30ac9b8}} -- Jeremy Schneider Database Engineer Amazon Web Services
On Wed, Dec 04, 2019 at 05:36:16PM -0800, Jeremy Schneider wrote: >On 9/8/19 14:01, Tom Lane wrote: >> Fix RelationIdGetRelation calls that weren't bothering with error checks. >> >> ... >> >> Details >> ------- >> https://git.postgresql.org/pg/commitdiff/69f883fef14a3fc5849126799278abcc43f40f56 > >We had two different databases this week (with the same schema) both >independently hit the condition of this recent commit from Tom. It's on >11.5 so we're actually segfaulting and restarting rather than just >causing the walsender process to ERROR, but regardless there's still >some underlying bug here. > >We have core files and we're still working to see if we can figure out >what's going on, but I thought I'd report now in case anyone has extra >ideas or suggestions. The segfault is on line 3034 of reorderbuffer.c. > >https://github.com/postgres/postgres/blob/REL_11_5/src/backend/replication/logical/reorderbuffer.c#L3034 > >3033 toast_rel = RelationIdGetRelation(relation->rd_rel->reltoastrelid); >3034 toast_desc = RelationGetDescr(toast_rel); > >We'll keep looking; let me know any feedback! Would love to track down >whatever bug is in the logical decoding code, if that's what it is. > >========== > >backtrace showing the call stack... > >Core was generated by `postgres: walsender <NAME-REDACTED> ><DNS-REDACTED>(31712)'. >Program terminated with signal 11, Segmentation fault. >#0 ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78, >relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938) > at reorderbuffer.c:3034 >3034 reorderbuffer.c: No such file or directory. >... >(gdb) #0 ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78, >relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938) > at reorderbuffer.c:3034 >#1 ReorderBufferCommit (rb=0x3086af0, xid=xid@entry=1358809, >commit_lsn=9430473346032, end_lsn=<optimized out>, > commit_time=commit_time@entry=628712466364268, >origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at >reorderbuffer.c:1584 >#2 0x0000000000716248 in DecodeCommit (xid=1358809, >parsed=0x7ffc4ce123f0, buf=0x7ffc4ce125b0, ctx=0x3068f70) at decode.c:637 >#3 DecodeXactOp (ctx=0x3068f70, buf=buf@entry=0x7ffc4ce125b0) at >decode.c:245 >#4 0x000000000071655a in LogicalDecodingProcessRecord (ctx=0x3068f70, >record=0x3069208) at decode.c:117 >#5 0x0000000000727150 in XLogSendLogical () at walsender.c:2886 >#6 0x0000000000729192 in WalSndLoop (send_data=send_data@entry=0x7270f0 ><XLogSendLogical>) at walsender.c:2249 >#7 0x0000000000729f91 in StartLogicalReplication (cmd=0x30485a0) at >walsender.c:1111 >#8 exec_replication_command ( > cmd_string=cmd_string@entry=0x2f968b0 "START_REPLICATION SLOT >\"<NAME-REDACTED>\" LOGICAL 893/38002B98 (proto_version '1', >publication_names '\"<NAME-REDACTED>\"')") at walsender.c:1628 >#9 0x000000000076e939 in PostgresMain (argc=<optimized out>, >argv=argv@entry=0x2fea168, dbname=0x2fea020 "<NAME-REDACTED>", > username=<optimized out>) at postgres.c:4182 >#10 0x00000000004bdcb5 in BackendRun (port=0x2fdec50) at postmaster.c:4410 >#11 BackendStartup (port=0x2fdec50) at postmaster.c:4082 >#12 ServerLoop () at postmaster.c:1759 >#13 0x00000000007062f9 in PostmasterMain (argc=argc@entry=7, >argv=argv@entry=0x2f92540) at postmaster.c:1432 >#14 0x00000000004be73b in main (argc=7, argv=0x2f92540) at main.c:228 > >========== > >Some additional context... > ># select * from pg_publication_rel; > prpubid | prrelid >---------+--------- > 71417 | 16453 > 71417 | 54949 >(2 rows) > >(gdb) print toast_rel >$4 = (struct RelationData *) 0x0 > >(gdb) print *relation->rd_rel >$11 = {relname = {data = "<NAME-REDACTED>", '\000' <repeats 44 times>}, >relnamespace = 16402, reltype = 16430, reloftype = 0, >relowner = 16393, relam = 0, relfilenode = 16428, reltablespace = 0, >relpages = 0, reltuples = 0, relallvisible = 0, reltoastrelid = 0, Hmmm, so reltoastrelid = 0, i.e. the relation does not have a TOAST relation. Yet we're calling ReorderBufferToastReplace on the decoded record ... interesting. Can you share structure of the relation causing the issue? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Dec 04, 2019 at 05:36:16PM -0800, Jeremy Schneider wrote: >On 9/8/19 14:01, Tom Lane wrote: >> Fix RelationIdGetRelation calls that weren't bothering with error checks. >> >> ... >> >> Details >> ------- >> https://git.postgresql.org/pg/commitdiff/69f883fef14a3fc5849126799278abcc43f40f56 > >We had two different databases this week (with the same schema) both >independently hit the condition of this recent commit from Tom. It's on >11.5 so we're actually segfaulting and restarting rather than just >causing the walsender process to ERROR, but regardless there's still >some underlying bug here. > >We have core files and we're still working to see if we can figure out >what's going on, but I thought I'd report now in case anyone has extra >ideas or suggestions. The segfault is on line 3034 of reorderbuffer.c. > >https://github.com/postgres/postgres/blob/REL_11_5/src/backend/replication/logical/reorderbuffer.c#L3034 > >3033 toast_rel = RelationIdGetRelation(relation->rd_rel->reltoastrelid); >3034 toast_desc = RelationGetDescr(toast_rel); > >We'll keep looking; let me know any feedback! Would love to track down >whatever bug is in the logical decoding code, if that's what it is. > >========== > >backtrace showing the call stack... > >Core was generated by `postgres: walsender <NAME-REDACTED> ><DNS-REDACTED>(31712)'. >Program terminated with signal 11, Segmentation fault. >#0 ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78, >relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938) > at reorderbuffer.c:3034 >3034 reorderbuffer.c: No such file or directory. >... >(gdb) #0 ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78, >relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938) > at reorderbuffer.c:3034 >#1 ReorderBufferCommit (rb=0x3086af0, xid=xid@entry=1358809, >commit_lsn=9430473346032, end_lsn=<optimized out>, > commit_time=commit_time@entry=628712466364268, >origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at >reorderbuffer.c:1584 >#2 0x0000000000716248 in DecodeCommit (xid=1358809, >parsed=0x7ffc4ce123f0, buf=0x7ffc4ce125b0, ctx=0x3068f70) at decode.c:637 >#3 DecodeXactOp (ctx=0x3068f70, buf=buf@entry=0x7ffc4ce125b0) at >decode.c:245 >#4 0x000000000071655a in LogicalDecodingProcessRecord (ctx=0x3068f70, >record=0x3069208) at decode.c:117 >#5 0x0000000000727150 in XLogSendLogical () at walsender.c:2886 >#6 0x0000000000729192 in WalSndLoop (send_data=send_data@entry=0x7270f0 ><XLogSendLogical>) at walsender.c:2249 >#7 0x0000000000729f91 in StartLogicalReplication (cmd=0x30485a0) at >walsender.c:1111 >#8 exec_replication_command ( > cmd_string=cmd_string@entry=0x2f968b0 "START_REPLICATION SLOT >\"<NAME-REDACTED>\" LOGICAL 893/38002B98 (proto_version '1', >publication_names '\"<NAME-REDACTED>\"')") at walsender.c:1628 >#9 0x000000000076e939 in PostgresMain (argc=<optimized out>, >argv=argv@entry=0x2fea168, dbname=0x2fea020 "<NAME-REDACTED>", > username=<optimized out>) at postgres.c:4182 >#10 0x00000000004bdcb5 in BackendRun (port=0x2fdec50) at postmaster.c:4410 >#11 BackendStartup (port=0x2fdec50) at postmaster.c:4082 >#12 ServerLoop () at postmaster.c:1759 >#13 0x00000000007062f9 in PostmasterMain (argc=argc@entry=7, >argv=argv@entry=0x2f92540) at postmaster.c:1432 >#14 0x00000000004be73b in main (argc=7, argv=0x2f92540) at main.c:228 > >========== > >Some additional context... > ># select * from pg_publication_rel; > prpubid | prrelid >---------+--------- > 71417 | 16453 > 71417 | 54949 >(2 rows) > >(gdb) print toast_rel >$4 = (struct RelationData *) 0x0 > >(gdb) print *relation->rd_rel >$11 = {relname = {data = "<NAME-REDACTED>", '\000' <repeats 44 times>}, >relnamespace = 16402, reltype = 16430, reloftype = 0, >relowner = 16393, relam = 0, relfilenode = 16428, reltablespace = 0, >relpages = 0, reltuples = 0, relallvisible = 0, reltoastrelid = 0, Hmmm, so reltoastrelid = 0, i.e. the relation does not have a TOAST relation. Yet we're calling ReorderBufferToastReplace on the decoded record ... interesting. Can you share structure of the relation causing the issue? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 12/9/19, 10:10 AM, "Tomas Vondra" <tomas.vondra@2ndquadrant.com> wrote: >On Wed, Dec 04, 2019 at 05:36:16PM -0800, Jeremy Schneider wrote: >>On 9/8/19 14:01, Tom Lane wrote: >>> Fix RelationIdGetRelation calls that weren't bothering with error checks. >>> >>> ... >>> >>> Details >>> ------- >>> https://git.postgresql.org/pg/commitdiff/69f883fef14a3fc5849126799278abcc43f40f56 >> >>We had two different databases this week (with the same schema) both >>independently hit the condition of this recent commit from Tom. It's on >>11.5 so we're actually segfaulting and restarting rather than just >>causing the walsender process to ERROR, but regardless there's still >>some underlying bug here. >> >>We have core files and we're still working to see if we can figure out >>what's going on, but I thought I'd report now in case anyone has extra >>ideas or suggestions. The segfault is on line 3034 of reorderbuffer.c. >> >>https://github.com/postgres/postgres/blob/REL_11_5/src/backend/replication/logical/reorderbuffer.c#L3034 >> >>3033 toast_rel = RelationIdGetRelation(relation->rd_rel->reltoastrelid); >>3034 toast_desc = RelationGetDescr(toast_rel); >> >>We'll keep looking; let me know any feedback! Would love to track down >>whatever bug is in the logical decoding code, if that's what it is. >> >>========== >> >>backtrace showing the call stack... >> >>Core was generated by `postgres: walsender <NAME-REDACTED> >><DNS-REDACTED>(31712)'. >>Program terminated with signal 11, Segmentation fault. >>#0 ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78, >>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938) >> at reorderbuffer.c:3034 >>3034 reorderbuffer.c: No such file or directory. >>... >>(gdb) #0 ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78, >>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938) >> at reorderbuffer.c:3034 >>#1 ReorderBufferCommit (rb=0x3086af0, xid=xid@entry=1358809, >>commit_lsn=9430473346032, end_lsn=<optimized out>, >> commit_time=commit_time@entry=628712466364268, >>origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at >>reorderbuffer.c:1584 >>#2 0x0000000000716248 in DecodeCommit (xid=1358809, >>parsed=0x7ffc4ce123f0, buf=0x7ffc4ce125b0, ctx=0x3068f70) at decode.c:637 >>#3 DecodeXactOp (ctx=0x3068f70, buf=buf@entry=0x7ffc4ce125b0) at >>decode.c:245 >>#4 0x000000000071655a in LogicalDecodingProcessRecord (ctx=0x3068f70, >>record=0x3069208) at decode.c:117 >>#5 0x0000000000727150 in XLogSendLogical () at walsender.c:2886 >>#6 0x0000000000729192 in WalSndLoop (send_data=send_data@entry=0x7270f0 >><XLogSendLogical>) at walsender.c:2249 >>#7 0x0000000000729f91 in StartLogicalReplication (cmd=0x30485a0) at >>walsender.c:1111 >>#8 exec_replication_command ( >> cmd_string=cmd_string@entry=0x2f968b0 "START_REPLICATION SLOT >>\"<NAME-REDACTED>\" LOGICAL 893/38002B98 (proto_version '1', >>publication_names '\"<NAME-REDACTED>\"')") at walsender.c:1628 >>#9 0x000000000076e939 in PostgresMain (argc=<optimized out>, >>argv=argv@entry=0x2fea168, dbname=0x2fea020 "<NAME-REDACTED>", >> username=<optimized out>) at postgres.c:4182 >>#10 0x00000000004bdcb5 in BackendRun (port=0x2fdec50) at postmaster.c:4410 >>#11 BackendStartup (port=0x2fdec50) at postmaster.c:4082 >>#12 ServerLoop () at postmaster.c:1759 >>#13 0x00000000007062f9 in PostmasterMain (argc=argc@entry=7, >>argv=argv@entry=0x2f92540) at postmaster.c:1432 >>#14 0x00000000004be73b in main (argc=7, argv=0x2f92540) at main.c:228 >> >>========== >> >>Some additional context... >> >># select * from pg_publication_rel; >> prpubid | prrelid >>---------+--------- >> 71417 | 16453 >> 71417 | 54949 >>(2 rows) >> >>(gdb) print toast_rel >>$4 = (struct RelationData *) 0x0 >> >>(gdb) print *relation->rd_rel >>$11 = {relname = {data = "<NAME-REDACTED>", '\000' <repeats 44 times>}, >>relnamespace = 16402, reltype = 16430, reloftype = 0, >>relowner = 16393, relam = 0, relfilenode = 16428, reltablespace = 0, >>relpages = 0, reltuples = 0, relallvisible = 0, reltoastrelid = 0, >Hmmm, so reltoastrelid = 0, i.e. the relation does not have a TOAST >relation. Yet we're calling ReorderBufferToastReplace on the decoded >record ... interesting. > >Can you share structure of the relation causing the issue? Here it is: \d+ rel_having_issue Table "public.rel_having_issue" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description ----------------+--------------------------+-----------+----------+-------------------------------------------------+----------+--------------+------------- id | integer | | not null | nextval('rel_having_issue_id_seq'::regclass) | plain | | field1 | character varying(255) | | | | extended| | field2 | integer | | | | plain | | field3 | timestamp with time zone | | | | plain | | Indexes: "rel_having_issue_pkey" PRIMARY KEY, btree (id) select relname,relfilenode,reltoastrelid from pg_class where relname='rel_having_issue'; relname | relfilenode | reltoastrelid ---------------------+-------------+--------------- rel_having_issue | 16428 | 0 Bertrand
On 12/9/19, 10:10 AM, "Tomas Vondra" <tomas.vondra@2ndquadrant.com> wrote: >On Wed, Dec 04, 2019 at 05:36:16PM -0800, Jeremy Schneider wrote: >>On 9/8/19 14:01, Tom Lane wrote: >>> Fix RelationIdGetRelation calls that weren't bothering with error checks. >>> >>> ... >>> >>> Details >>> ------- >>> https://git.postgresql.org/pg/commitdiff/69f883fef14a3fc5849126799278abcc43f40f56 >> >>We had two different databases this week (with the same schema) both >>independently hit the condition of this recent commit from Tom. It's on >>11.5 so we're actually segfaulting and restarting rather than just >>causing the walsender process to ERROR, but regardless there's still >>some underlying bug here. >> >>We have core files and we're still working to see if we can figure out >>what's going on, but I thought I'd report now in case anyone has extra >>ideas or suggestions. The segfault is on line 3034 of reorderbuffer.c. >> >>https://github.com/postgres/postgres/blob/REL_11_5/src/backend/replication/logical/reorderbuffer.c#L3034 >> >>3033 toast_rel = RelationIdGetRelation(relation->rd_rel->reltoastrelid); >>3034 toast_desc = RelationGetDescr(toast_rel); >> >>We'll keep looking; let me know any feedback! Would love to track down >>whatever bug is in the logical decoding code, if that's what it is. >> >>========== >> >>backtrace showing the call stack... >> >>Core was generated by `postgres: walsender <NAME-REDACTED> >><DNS-REDACTED>(31712)'. >>Program terminated with signal 11, Segmentation fault. >>#0 ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78, >>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938) >> at reorderbuffer.c:3034 >>3034 reorderbuffer.c: No such file or directory. >>... >>(gdb) #0 ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78, >>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938) >> at reorderbuffer.c:3034 >>#1 ReorderBufferCommit (rb=0x3086af0, xid=xid@entry=1358809, >>commit_lsn=9430473346032, end_lsn=<optimized out>, >> commit_time=commit_time@entry=628712466364268, >>origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at >>reorderbuffer.c:1584 >>#2 0x0000000000716248 in DecodeCommit (xid=1358809, >>parsed=0x7ffc4ce123f0, buf=0x7ffc4ce125b0, ctx=0x3068f70) at decode.c:637 >>#3 DecodeXactOp (ctx=0x3068f70, buf=buf@entry=0x7ffc4ce125b0) at >>decode.c:245 >>#4 0x000000000071655a in LogicalDecodingProcessRecord (ctx=0x3068f70, >>record=0x3069208) at decode.c:117 >>#5 0x0000000000727150 in XLogSendLogical () at walsender.c:2886 >>#6 0x0000000000729192 in WalSndLoop (send_data=send_data@entry=0x7270f0 >><XLogSendLogical>) at walsender.c:2249 >>#7 0x0000000000729f91 in StartLogicalReplication (cmd=0x30485a0) at >>walsender.c:1111 >>#8 exec_replication_command ( >> cmd_string=cmd_string@entry=0x2f968b0 "START_REPLICATION SLOT >>\"<NAME-REDACTED>\" LOGICAL 893/38002B98 (proto_version '1', >>publication_names '\"<NAME-REDACTED>\"')") at walsender.c:1628 >>#9 0x000000000076e939 in PostgresMain (argc=<optimized out>, >>argv=argv@entry=0x2fea168, dbname=0x2fea020 "<NAME-REDACTED>", >> username=<optimized out>) at postgres.c:4182 >>#10 0x00000000004bdcb5 in BackendRun (port=0x2fdec50) at postmaster.c:4410 >>#11 BackendStartup (port=0x2fdec50) at postmaster.c:4082 >>#12 ServerLoop () at postmaster.c:1759 >>#13 0x00000000007062f9 in PostmasterMain (argc=argc@entry=7, >>argv=argv@entry=0x2f92540) at postmaster.c:1432 >>#14 0x00000000004be73b in main (argc=7, argv=0x2f92540) at main.c:228 >> >>========== >> >>Some additional context... >> >># select * from pg_publication_rel; >> prpubid | prrelid >>---------+--------- >> 71417 | 16453 >> 71417 | 54949 >>(2 rows) >> >>(gdb) print toast_rel >>$4 = (struct RelationData *) 0x0 >> >>(gdb) print *relation->rd_rel >>$11 = {relname = {data = "<NAME-REDACTED>", '\000' <repeats 44 times>}, >>relnamespace = 16402, reltype = 16430, reloftype = 0, >>relowner = 16393, relam = 0, relfilenode = 16428, reltablespace = 0, >>relpages = 0, reltuples = 0, relallvisible = 0, reltoastrelid = 0, >Hmmm, so reltoastrelid = 0, i.e. the relation does not have a TOAST >relation. Yet we're calling ReorderBufferToastReplace on the decoded >record ... interesting. > >Can you share structure of the relation causing the issue? Here it is: \d+ rel_having_issue Table "public.rel_having_issue" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description ----------------+--------------------------+-----------+----------+-------------------------------------------------+----------+--------------+------------- id | integer | | not null | nextval('rel_having_issue_id_seq'::regclass) | plain | | field1 | character varying(255) | | | | extended| | field2 | integer | | | | plain | | field3 | timestamp with time zone | | | | plain | | Indexes: "rel_having_issue_pkey" PRIMARY KEY, btree (id) select relname,relfilenode,reltoastrelid from pg_class where relname='rel_having_issue'; relname | relfilenode | reltoastrelid ---------------------+-------------+--------------- rel_having_issue | 16428 | 0 Bertrand
On Wed, Dec 11, 2019 at 3:17 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote: > Here it is: > > \d+ rel_having_issue You did a heck of a job choosing the name of that table. I bet nobody was surprised when it had an issue! :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Dec 11, 2019 at 3:17 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote: > Here it is: > > \d+ rel_having_issue You did a heck of a job choosing the name of that table. I bet nobody was surprised when it had an issue! :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-12-11 08:17:01 +0000, Drouvot, Bertrand wrote: > >>Core was generated by `postgres: walsender <NAME-REDACTED> > >><DNS-REDACTED>(31712)'. > >>Program terminated with signal 11, Segmentation fault. > >>#0 ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78, > >>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938) > >> at reorderbuffer.c:3034 > >>3034 reorderbuffer.c: No such file or directory. > >>... > >>(gdb) #0 ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78, > >>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938) > >> at reorderbuffer.c:3034 > >>#1 ReorderBufferCommit (rb=0x3086af0, xid=xid@entry=1358809, > >>commit_lsn=9430473346032, end_lsn=<optimized out>, > >> commit_time=commit_time@entry=628712466364268, > >>origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at > >>reorderbuffer.c:1584 This indicates that a toast record was present for that relation, despite: > > \d+ rel_having_issue > Table "public.rel_having_issue" > Column | Type | Collation | Nullable | Default | Storage | Stats target | Description > ----------------+--------------------------+-----------+----------+-------------------------------------------------+----------+--------------+------------- > id | integer | | not null | nextval('rel_having_issue_id_seq'::regclass) | plain | | > field1 | character varying(255) | | | |extended | | > field2 | integer | | | |plain | | > field3 | timestamp with time zone | | | | plain | | > Indexes: > "rel_having_issue_pkey" PRIMARY KEY, btree (id) > > select relname,relfilenode,reltoastrelid from pg_class where relname='rel_having_issue'; > relname | relfilenode | reltoastrelid > ---------------------+-------------+--------------- > rel_having_issue | 16428 | 0 I think we need to see pg_waldump output for the preceding records. That might allow us to see why there's a toast record that's being associated with this table, despite there not being a toast table. Seems like we clearly should add an elog(ERROR) here, so we error out, rather than crash. Has there been DDL to this table? Could you print out *change? Is this version of postgres effectively unmodified in any potentially relevant region (snapshot computations, generation of WAL records, ...)? Greetings, Andres Freund
Hi, On 2019-12-11 08:17:01 +0000, Drouvot, Bertrand wrote: > >>Core was generated by `postgres: walsender <NAME-REDACTED> > >><DNS-REDACTED>(31712)'. > >>Program terminated with signal 11, Segmentation fault. > >>#0 ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78, > >>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938) > >> at reorderbuffer.c:3034 > >>3034 reorderbuffer.c: No such file or directory. > >>... > >>(gdb) #0 ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78, > >>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938) > >> at reorderbuffer.c:3034 > >>#1 ReorderBufferCommit (rb=0x3086af0, xid=xid@entry=1358809, > >>commit_lsn=9430473346032, end_lsn=<optimized out>, > >> commit_time=commit_time@entry=628712466364268, > >>origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at > >>reorderbuffer.c:1584 This indicates that a toast record was present for that relation, despite: > > \d+ rel_having_issue > Table "public.rel_having_issue" > Column | Type | Collation | Nullable | Default | Storage | Stats target | Description > ----------------+--------------------------+-----------+----------+-------------------------------------------------+----------+--------------+------------- > id | integer | | not null | nextval('rel_having_issue_id_seq'::regclass) | plain | | > field1 | character varying(255) | | | |extended | | > field2 | integer | | | |plain | | > field3 | timestamp with time zone | | | | plain | | > Indexes: > "rel_having_issue_pkey" PRIMARY KEY, btree (id) > > select relname,relfilenode,reltoastrelid from pg_class where relname='rel_having_issue'; > relname | relfilenode | reltoastrelid > ---------------------+-------------+--------------- > rel_having_issue | 16428 | 0 I think we need to see pg_waldump output for the preceding records. That might allow us to see why there's a toast record that's being associated with this table, despite there not being a toast table. Seems like we clearly should add an elog(ERROR) here, so we error out, rather than crash. Has there been DDL to this table? Could you print out *change? Is this version of postgres effectively unmodified in any potentially relevant region (snapshot computations, generation of WAL records, ...)? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > This indicates that a toast record was present for that relation, > despite: > [ \d that looks like the table isn't wide enough for that ] > I think we need to see pg_waldump output for the preceding records. That > might allow us to see why there's a toast record that's being associated > with this table, despite there not being a toast table. I don't think you can make that conclusion. Perhaps the table once needed a toast table because of some wide column that got dropped; if so, it'd still have one. It'd be safer to look at pg_class.reltoastrelid to verify existence (or not) of the toast relation. It strikes me that there could easily be cases where a publisher table has a toast relation and a subscriber's doesn't ... maybe this code isn't expecting that? regards, tom lane
Andres Freund <andres@anarazel.de> writes: > This indicates that a toast record was present for that relation, > despite: > [ \d that looks like the table isn't wide enough for that ] > I think we need to see pg_waldump output for the preceding records. That > might allow us to see why there's a toast record that's being associated > with this table, despite there not being a toast table. I don't think you can make that conclusion. Perhaps the table once needed a toast table because of some wide column that got dropped; if so, it'd still have one. It'd be safer to look at pg_class.reltoastrelid to verify existence (or not) of the toast relation. It strikes me that there could easily be cases where a publisher table has a toast relation and a subscriber's doesn't ... maybe this code isn't expecting that? regards, tom lane
On Wed, Dec 11, 2019 at 12:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I don't think you can make that conclusion. Perhaps the table once > needed a toast table because of some wide column that got dropped; > if so, it'd still have one. It'd be safer to look at > pg_class.reltoastrelid to verify existence (or not) of the toast relation. I believe that output was already shown earlier in the thread. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Dec 11, 2019 at 12:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I don't think you can make that conclusion. Perhaps the table once > needed a toast table because of some wide column that got dropped; > if so, it'd still have one. It'd be safer to look at > pg_class.reltoastrelid to verify existence (or not) of the toast relation. I believe that output was already shown earlier in the thread. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-12-11 12:11:03 -0500, Tom Lane wrote: > I don't think you can make that conclusion. Perhaps the table once > needed a toast table because of some wide column that got dropped; > if so, it'd still have one. It'd be safer to look at > pg_class.reltoastrelid to verify existence (or not) of the toast relation. That was checked in the email I was responding to. > It strikes me that there could easily be cases where a publisher table > has a toast relation and a subscriber's doesn't ... maybe this code > isn't expecting that? This code is all running on the publisher side, so I don't think it could matter. Greetings, Andres Freund
Hi, On 2019-12-11 12:11:03 -0500, Tom Lane wrote: > I don't think you can make that conclusion. Perhaps the table once > needed a toast table because of some wide column that got dropped; > if so, it'd still have one. It'd be safer to look at > pg_class.reltoastrelid to verify existence (or not) of the toast relation. That was checked in the email I was responding to. > It strikes me that there could easily be cases where a publisher table > has a toast relation and a subscriber's doesn't ... maybe this code > isn't expecting that? This code is all running on the publisher side, so I don't think it could matter. Greetings, Andres Freund
On 12/11/19 08:35, Andres Freund wrote:
Unfortunately the WAL logs are no longer available at this time. :(I think we need to see pg_waldump output for the preceding records. That might allow us to see why there's a toast record that's being associated with this table, despite there not being a toast table.
I did a little poking around in the core file and searching source code but didn't find anything yet. Is there any memory structure that would have the preceding/following records cached in memory? If so then I might be able to extract this from the core dumps.
done - in the commit that I replied to when I started this thread :)Seems like we clearly should add an elog(ERROR) here, so we error out, rather than crash.
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=69f883fef14a3fc5849126799278abcc43f40f56
I'm not sure that we will be able to find out at this point.Has there been DDL to this table?
Could you print out *change?
This was also in the original email - here it is:
(gdb) print *change
$1 = {lsn = 9430473343416, action = REORDER_BUFFER_CHANGE_INSERT, origin_id = 0, data = {tp = {relnode = {spcNode = 1663, dbNode = 16401,
relNode = 16428}, clear_toast_afterwards = true, oldtuple = 0x0, newtuple = 0x2b79313f9c68}, truncate = {
nrelids = 70441758623359, cascade = 44, restart_seqs = 64, relids = 0x0}, msg = {
prefix = 0x40110000067f <Address 0x40110000067f out of bounds>, message_size = 4294983724, message = 0x0},
snapshot = 0x40110000067f, command_id = 1663, tuplecid = {node = {spcNode = 1663, dbNode = 16401, relNode = 16428}, tid = {
ip_blkid = {bi_hi = 1, bi_lo = 0}, ip_posid = 0}, cmin = 0, cmax = 826252392, combocid = 11129}}, node = {prev = 0x30ac918,
next = 0x30ac9b8}}
It's not changed from community code in any relevant regions. (Also, FYI, this is not Aurora.)Is this version of postgres effectively unmodified in any potentially relevant region (snapshot computations, generation of WAL records, ...)?
-Jeremy
-- Jeremy Schneider Database Engineer Amazon Web Services
On 12/11/19 08:35, Andres Freund wrote:
Unfortunately the WAL logs are no longer available at this time. :(I think we need to see pg_waldump output for the preceding records. That might allow us to see why there's a toast record that's being associated with this table, despite there not being a toast table.
I did a little poking around in the core file and searching source code but didn't find anything yet. Is there any memory structure that would have the preceding/following records cached in memory? If so then I might be able to extract this from the core dumps.
done - in the commit that I replied to when I started this thread :)Seems like we clearly should add an elog(ERROR) here, so we error out, rather than crash.
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=69f883fef14a3fc5849126799278abcc43f40f56
I'm not sure that we will be able to find out at this point.Has there been DDL to this table?
Could you print out *change?
This was also in the original email - here it is:
(gdb) print *change
$1 = {lsn = 9430473343416, action = REORDER_BUFFER_CHANGE_INSERT, origin_id = 0, data = {tp = {relnode = {spcNode = 1663, dbNode = 16401,
relNode = 16428}, clear_toast_afterwards = true, oldtuple = 0x0, newtuple = 0x2b79313f9c68}, truncate = {
nrelids = 70441758623359, cascade = 44, restart_seqs = 64, relids = 0x0}, msg = {
prefix = 0x40110000067f <Address 0x40110000067f out of bounds>, message_size = 4294983724, message = 0x0},
snapshot = 0x40110000067f, command_id = 1663, tuplecid = {node = {spcNode = 1663, dbNode = 16401, relNode = 16428}, tid = {
ip_blkid = {bi_hi = 1, bi_lo = 0}, ip_posid = 0}, cmin = 0, cmax = 826252392, combocid = 11129}}, node = {prev = 0x30ac918,
next = 0x30ac9b8}}
It's not changed from community code in any relevant regions. (Also, FYI, this is not Aurora.)Is this version of postgres effectively unmodified in any potentially relevant region (snapshot computations, generation of WAL records, ...)?
-Jeremy
-- Jeremy Schneider Database Engineer Amazon Web Services
Hi, On 2019-12-13 16:13:35 -0800, Jeremy Schneider wrote: > On 12/11/19 08:35, Andres Freund wrote: > > I think we need to see pg_waldump output for the preceding records. That > > might allow us to see why there's a toast record that's being associated > > with this table, despite there not being a toast table. > Unfortunately the WAL logs are no longer available at this time. :( > > I did a little poking around in the core file and searching source code > but didn't find anything yet. Is there any memory structure that would > have the preceding/following records cached in memory? If so then I > might be able to extract this from the core dumps. Well, not the records directly, but the changes could be, depending on the size of the changes. That'd already help. It depends a bit on whether there are subtransactions or not (txn->nsubtxns will tell you). Within one transaction, the currently loaded (i.e. not changes that are spilled to disk, and haven't currently been restored - see txn->serialized) changes are in ReorderBufferTXN->changes. > > Seems like we clearly should add an elog(ERROR) here, so we error out, > > rather than crash. > done - in the commit that I replied to when I started this thread :) > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=69f883fef14a3fc5849126799278abcc43f40f56 Ah, I was actually thinking this is the thread of a similar sounding bug, where ReorderBufferToastReplace would crash because there isn't actually a new tuple - there somehow toast changes exist for a delete. > > Is this version of postgres effectively unmodified in any potentially > > relevant region (snapshot computations, generation of WAL records, ...)? > It's not changed from community code in any relevant regions. (Also, > FYI, this is not Aurora.) Well, I've heard mutterings that plain RDS postgres had some efficiency improvements around snapshots (in the GetSnapshotData() sense) - and that's an area where slightly wrong changes could quite plausibly cause a bug like this. Greetings, Andres Freund
Hi, On 2019-12-13 16:13:35 -0800, Jeremy Schneider wrote: > On 12/11/19 08:35, Andres Freund wrote: > > I think we need to see pg_waldump output for the preceding records. That > > might allow us to see why there's a toast record that's being associated > > with this table, despite there not being a toast table. > Unfortunately the WAL logs are no longer available at this time. :( > > I did a little poking around in the core file and searching source code > but didn't find anything yet. Is there any memory structure that would > have the preceding/following records cached in memory? If so then I > might be able to extract this from the core dumps. Well, not the records directly, but the changes could be, depending on the size of the changes. That'd already help. It depends a bit on whether there are subtransactions or not (txn->nsubtxns will tell you). Within one transaction, the currently loaded (i.e. not changes that are spilled to disk, and haven't currently been restored - see txn->serialized) changes are in ReorderBufferTXN->changes. > > Seems like we clearly should add an elog(ERROR) here, so we error out, > > rather than crash. > done - in the commit that I replied to when I started this thread :) > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=69f883fef14a3fc5849126799278abcc43f40f56 Ah, I was actually thinking this is the thread of a similar sounding bug, where ReorderBufferToastReplace would crash because there isn't actually a new tuple - there somehow toast changes exist for a delete. > > Is this version of postgres effectively unmodified in any potentially > > relevant region (snapshot computations, generation of WAL records, ...)? > It's not changed from community code in any relevant regions. (Also, > FYI, this is not Aurora.) Well, I've heard mutterings that plain RDS postgres had some efficiency improvements around snapshots (in the GetSnapshotData() sense) - and that's an area where slightly wrong changes could quite plausibly cause a bug like this. Greetings, Andres Freund
On 12/13/19 16:25, Andres Freund wrote: > On 2019-12-13 16:13:35 -0800, Jeremy Schneider wrote: >> On 12/11/19 08:35, Andres Freund wrote: >>> I think we need to see pg_waldump output for the preceding records. That >>> might allow us to see why there's a toast record that's being associated >>> with this table, despite there not being a toast table. >> Unfortunately the WAL logs are no longer available at this time. :( >> >> I did a little poking around in the core file and searching source code >> but didn't find anything yet. Is there any memory structure that would >> have the preceding/following records cached in memory? If so then I >> might be able to extract this from the core dumps. > > Well, not the records directly, but the changes could be, depending on > the size of the changes. That'd already help. It depends a bit on > whether there are subtransactions or not (txn->nsubtxns will tell > you). Within one transaction, the currently loaded (i.e. not changes > that are spilled to disk, and haven't currently been restored - see > txn->serialized) changes are in ReorderBufferTXN->changes. I did include the txn in the original post to this thread; there are 357 changes in the transaction and they are all in memory (none spilled to disk a.k.a. serialized). No subtransactions. However I do see that "txn.has_catalog_changes = true" which makes me wonder if that's related to the bug. So... now I know... walking a dlist in gdb and dumping all the changes is not exactly a walk in the park! Need some python magic like Tomas Vondra's script that decodes Nodes. I was not yet successful today in figuring out how to do this... so the changes are there in the core dump but I can't get them yet. :) I also dug around the ReorderBufferIterTXNState a little bit but there's nothing that isn't already in the original post. If anyone has a trick for walking a dlist in gdb that would be awesome... I'm off for holidays and won't be working on this for a couple weeks; not sure whether it'll be possible to get to the bottom of it. But I hope there's enough info in this thread to at least get a head start if someone hits it again in the future. > Well, I've heard mutterings that plain RDS postgres had some efficiency > improvements around snapshots (in the GetSnapshotData() sense) - and > that's an area where slightly wrong changes could quite plausibly > cause a bug like this. Definitely no changes around snapshots. I've never even heard anyone talk about making changes like that in RDS PostgreSQL - feels to me like people at AWS want it to be as close as possible to postgresql.org code. Aurora is different; it feels to me like the engineering org has more license to make changes. For example they re-wrote the subtransaction subsystem. No changes to GetSnapshotData though. -Jeremy -- Jeremy Schneider Database Engineer Amazon Web Services
On 12/13/19 16:25, Andres Freund wrote: > On 2019-12-13 16:13:35 -0800, Jeremy Schneider wrote: >> On 12/11/19 08:35, Andres Freund wrote: >>> I think we need to see pg_waldump output for the preceding records. That >>> might allow us to see why there's a toast record that's being associated >>> with this table, despite there not being a toast table. >> Unfortunately the WAL logs are no longer available at this time. :( >> >> I did a little poking around in the core file and searching source code >> but didn't find anything yet. Is there any memory structure that would >> have the preceding/following records cached in memory? If so then I >> might be able to extract this from the core dumps. > > Well, not the records directly, but the changes could be, depending on > the size of the changes. That'd already help. It depends a bit on > whether there are subtransactions or not (txn->nsubtxns will tell > you). Within one transaction, the currently loaded (i.e. not changes > that are spilled to disk, and haven't currently been restored - see > txn->serialized) changes are in ReorderBufferTXN->changes. I did include the txn in the original post to this thread; there are 357 changes in the transaction and they are all in memory (none spilled to disk a.k.a. serialized). No subtransactions. However I do see that "txn.has_catalog_changes = true" which makes me wonder if that's related to the bug. So... now I know... walking a dlist in gdb and dumping all the changes is not exactly a walk in the park! Need some python magic like Tomas Vondra's script that decodes Nodes. I was not yet successful today in figuring out how to do this... so the changes are there in the core dump but I can't get them yet. :) I also dug around the ReorderBufferIterTXNState a little bit but there's nothing that isn't already in the original post. If anyone has a trick for walking a dlist in gdb that would be awesome... I'm off for holidays and won't be working on this for a couple weeks; not sure whether it'll be possible to get to the bottom of it. But I hope there's enough info in this thread to at least get a head start if someone hits it again in the future. > Well, I've heard mutterings that plain RDS postgres had some efficiency > improvements around snapshots (in the GetSnapshotData() sense) - and > that's an area where slightly wrong changes could quite plausibly > cause a bug like this. Definitely no changes around snapshots. I've never even heard anyone talk about making changes like that in RDS PostgreSQL - feels to me like people at AWS want it to be as close as possible to postgresql.org code. Aurora is different; it feels to me like the engineering org has more license to make changes. For example they re-wrote the subtransaction subsystem. No changes to GetSnapshotData though. -Jeremy -- Jeremy Schneider Database Engineer Amazon Web Services
[Resent -- apologies to those who are getting this email twice. Please be mindful to reply to this one if you do. I think the no-crosspost policy is very obnoxious and should be relaxed.) On 2019-Dec-11, Andres Freund wrote: > On 2019-12-11 08:17:01 +0000, Drouvot, Bertrand wrote: > > >>Core was generated by `postgres: walsender <NAME-REDACTED> > > >><DNS-REDACTED>(31712)'. > > >>Program terminated with signal 11, Segmentation fault. > > >>#0 ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78, > > >>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938) > > >> at reorderbuffer.c:3034 > > >>3034 reorderbuffer.c: No such file or directory. > > >>... > > >>(gdb) #0 ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78, > > >>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938) > > >> at reorderbuffer.c:3034 > > >>#1 ReorderBufferCommit (rb=0x3086af0, xid=xid@entry=1358809, > > >>commit_lsn=9430473346032, end_lsn=<optimized out>, > > >> commit_time=commit_time@entry=628712466364268, > > >>origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at > > >>reorderbuffer.c:1584 > > This indicates that a toast record was present for that relation, > despite: Can you explain what it is you saw that indicates that a toast record was present for the relation? I may be misreading the code, but there's nothing obvious that says that if we reach there, then a toast datum exists anywhere for this relation. We only know that txn->toast_hash is set, but that could be because the transaction touched a toast record in some other table. Right? > > \d+ rel_having_issue > > Table "public.rel_having_issue" > > Column | Type | Collation | Nullable | Default |Storage | Stats target | Description > > ----------------+--------------------------+-----------+----------+-------------------------------------------------+----------+--------------+------------- > > id | integer | | not null | nextval('rel_having_issue_id_seq'::regclass) | plain | | > > field1 | character varying(255) | | | | extended | | > > field2 | integer | | | | plain | | > > field3 | timestamp with time zone | | | | plain | | > > Indexes: > > "rel_having_issue_pkey" PRIMARY KEY, btree (id) -- Álvaro Herrera 39°49'30"S 73°17'W <inflex> really, I see PHP as like a strange amalgamation of C, Perl, Shell <crab> inflex: you know that "amalgam" means "mixture with mercury", more or less, right? <crab> i.e., "deadly poison"
On Sat, Jun 5, 2021 at 5:41 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > This indicates that a toast record was present for that relation, > > despite: > > Can you explain what it is you saw that indicates that a toast record > was present for the relation? I may be misreading the code, but there's > nothing obvious that says that if we reach there, then a toast datum > exists anywhere for this relation. We only know that txn->toast_hash is > set, but that could be because the transaction touched a toast record in > some other table. Right? Is this problem is related to the thread [1], where due to spec abort the toast hash was not deleted and after that, if the next record is for some other relation which is not having a toast table you will see this error. There are a few other problems if the toast hash is not cleaned due to spec abort. I have submitted patches with 2 approached in that thread. [1] https://www.postgresql.org/message-id/CAFiTN-szfpMXF2H%2Bmk3m_9AB610v103NTv7Z1E8uDBr9iQg1gg%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com