Thread: logical decoding bug: segfault in ReorderBufferToastReplace()

logical decoding bug: segfault in ReorderBufferToastReplace()

From

Jeremy Schneider

Date:

05 December 2019, 01:36:16

On 9/8/19 14:01, Tom Lane wrote:
> Fix RelationIdGetRelation calls that weren't bothering with error checks.
> 
> ...
> 
> Details
> -------
> https://git.postgresql.org/pg/commitdiff/69f883fef14a3fc5849126799278abcc43f40f56

We had two different databases this week (with the same schema) both
independently hit the condition of this recent commit from Tom. It's on
11.5 so we're actually segfaulting and restarting rather than just
causing the walsender process to ERROR, but regardless there's still
some underlying bug here.

We have core files and we're still working to see if we can figure out
what's going on, but I thought I'd report now in case anyone has extra
ideas or suggestions.  The segfault is on line 3034 of reorderbuffer.c.

https://github.com/postgres/postgres/blob/REL_11_5/src/backend/replication/logical/reorderbuffer.c#L3034

3033     toast_rel = RelationIdGetRelation(relation->rd_rel->reltoastrelid);
3034     toast_desc = RelationGetDescr(toast_rel);

We'll keep looking; let me know any feedback! Would love to track down
whatever bug is in the logical decoding code, if that's what it is.

==========

backtrace showing the call stack...

Core was generated by `postgres: walsender <NAME-REDACTED>
<DNS-REDACTED>(31712)'.
Program terminated with signal 11, Segmentation fault.
#0  ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78,
relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938)
    at reorderbuffer.c:3034
3034    reorderbuffer.c: No such file or directory.
...
(gdb) #0  ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78,
relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938)
    at reorderbuffer.c:3034
#1  ReorderBufferCommit (rb=0x3086af0, xid=xid@entry=1358809,
commit_lsn=9430473346032, end_lsn=<optimized out>,
    commit_time=commit_time@entry=628712466364268,
origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at
reorderbuffer.c:1584
#2  0x0000000000716248 in DecodeCommit (xid=1358809,
parsed=0x7ffc4ce123f0, buf=0x7ffc4ce125b0, ctx=0x3068f70) at decode.c:637
#3  DecodeXactOp (ctx=0x3068f70, buf=buf@entry=0x7ffc4ce125b0) at
decode.c:245
#4  0x000000000071655a in LogicalDecodingProcessRecord (ctx=0x3068f70,
record=0x3069208) at decode.c:117
#5  0x0000000000727150 in XLogSendLogical () at walsender.c:2886
#6  0x0000000000729192 in WalSndLoop (send_data=send_data@entry=0x7270f0
<XLogSendLogical>) at walsender.c:2249
#7  0x0000000000729f91 in StartLogicalReplication (cmd=0x30485a0) at
walsender.c:1111
#8  exec_replication_command (
    cmd_string=cmd_string@entry=0x2f968b0 "START_REPLICATION SLOT
\"<NAME-REDACTED>\" LOGICAL 893/38002B98 (proto_version '1',
publication_names '\"<NAME-REDACTED>\"')") at walsender.c:1628
#9  0x000000000076e939 in PostgresMain (argc=<optimized out>,
argv=argv@entry=0x2fea168, dbname=0x2fea020 "<NAME-REDACTED>",
    username=<optimized out>) at postgres.c:4182
#10 0x00000000004bdcb5 in BackendRun (port=0x2fdec50) at postmaster.c:4410
#11 BackendStartup (port=0x2fdec50) at postmaster.c:4082
#12 ServerLoop () at postmaster.c:1759
#13 0x00000000007062f9 in PostmasterMain (argc=argc@entry=7,
argv=argv@entry=0x2f92540) at postmaster.c:1432
#14 0x00000000004be73b in main (argc=7, argv=0x2f92540) at main.c:228

==========

Some additional context...

# select * from pg_publication_rel;
 prpubid | prrelid
---------+---------
   71417 |   16453
   71417 |   54949
(2 rows)

(gdb) print toast_rel
$4 = (struct RelationData *) 0x0

(gdb) print *relation->rd_rel
$11 = {relname = {data = "<NAME-REDACTED>", '\000' <repeats 44 times>},
relnamespace = 16402, reltype = 16430, reloftype = 0,
relowner = 16393, relam = 0, relfilenode = 16428, reltablespace = 0,
relpages = 0, reltuples = 0, relallvisible = 0, reltoastrelid = 0,
relhasindex = true, relisshared = false, relpersistence = 112 'p',
relkind = 114 'r', relnatts = 4, relchecks = 0, relhasoids = false,
relhasrules = false, relhastriggers = false, relhassubclass = false,
relrowsecurity = false, relforcerowsecurity = false,
relispopulated = true, relreplident = 100 'd', relispartition = false,
relrewrite = 0, relfrozenxid = 1808, relminmxid = 1}

(gdb) print *relation
$12 = {rd_node = {spcNode = 1663, dbNode = 16401, relNode = 16428},
rd_smgr = 0x0, rd_refcnt = 1, rd_backend = -1, rd_islocaltemp = false,
rd_isnailed = false, rd_isvalid = true, rd_indexvalid = 0 '\000',
rd_statvalid = false, rd_createSubid = 0, rd_newRelfilenodeSubid = 0,
rd_rel = 0x2b7917724bd8, rd_att = 0x2b7917724ce8, rd_id = 16428,
rd_lockInfo = {lockRelId = {relId = 16428, dbId = 16401}},
rd_rules = 0x0, rd_rulescxt = 0x0, trigdesc = 0x0, rd_rsdesc = 0x0,
rd_fkeylist = 0x0, rd_fkeyvalid = false, rd_partkeycxt = 0x0,
rd_partkey = 0x0, rd_pdcxt = 0x0, rd_partdesc = 0x0, rd_partcheck = 0x0,
rd_indexlist = 0x0, rd_oidindex = 0, rd_pkindex = 0,
rd_replidindex = 0, rd_statlist = 0x0, rd_indexattr = 0x0,
rd_projindexattr = 0x0, rd_keyattr = 0x0, rd_pkattr = 0x0, rd_idattr = 0x0,
rd_projidx = 0x0, rd_pubactions = 0x0, rd_options = 0x0, rd_index = 0x0,
rd_indextuple = 0x0, rd_amhandler = 0, rd_indexcxt = 0x0,
rd_amroutine = 0x0, rd_opfamily = 0x0, rd_opcintype = 0x0, rd_support =
0x0, rd_supportinfo = 0x0, rd_indoption = 0x0, rd_indexprs = 0x0,
rd_indpred = 0x0, rd_exclops = 0x0, rd_exclprocs = 0x0, rd_exclstrats =
0x0, rd_amcache = 0x0, rd_indcollation = 0x0,
rd_fdwroutine = 0x0, rd_toastoid = 0, pgstat_info = 0x0,
rd_partcheckvalid = false, rd_partcheckcxt = 0x0}

(gdb) print *desc
$13 = {natts = 4, tdtypeid = 16430, tdtypmod = -1, tdhasoid = false,
tdrefcount = 1, constr = 0x2b7917724ef8, attrs = 0x2b7917724d08}

(gdb) print *txn
$2 = {xid = 1358809, has_catalog_changes = true, is_known_as_subxact =
false, toplevel_xid = 0, first_lsn = 9430473113640,
  final_lsn = 9430473346032, end_lsn = 9430473350592,
restart_decoding_lsn = 0, origin_id = 0, origin_lsn = 0,
  commit_time = 628712466364268, base_snapshot = 0x308cdc0,
base_snapshot_lsn = 9430473113776, base_snapshot_node = {prev = 0x3086b08,
    next = 0x3086b08}, nentries = 357, nentries_mem = 357, serialized =
false, changes = {head = {prev = 0x30aca08, next = 0x309aac8}},
  tuplecids = {head = {prev = 0x30ac878, next = 0x309ab18}}, ntuplecids
= 151, tuplecid_hash = 0x30b0bf0, toast_hash = 0x30bb460,
  subtxns = {head = {prev = 0x3094b30, next = 0x3094b30}}, nsubtxns = 0,
ninvalidations = 278, invalidations = 0x30acb08, node = {
    prev = 0x3086af8, next = 0x3086af8}}

(gdb) print *change
$1 = {lsn = 9430473343416, action = REORDER_BUFFER_CHANGE_INSERT,
origin_id = 0, data = {tp = {relnode = {spcNode = 1663, dbNode = 16401,
        relNode = 16428}, clear_toast_afterwards = true, oldtuple = 0x0,
newtuple = 0x2b79313f9c68}, truncate = {nrelids = 70441758623359,
      cascade = 44, restart_seqs = 64, relids = 0x0}, msg = {prefix =
0x40110000067f <Address 0x40110000067f out of bounds>,
      message_size = 4294983724, message = 0x0}, snapshot =
0x40110000067f, command_id = 1663, tuplecid = {node = {spcNode = 1663,
        dbNode = 16401, relNode = 16428}, tid = {ip_blkid = {bi_hi = 1,
bi_lo = 0}, ip_posid = 0}, cmin = 0, cmax = 826252392,
      combocid = 11129}}, node = {prev = 0x30ac918, next = 0x30ac9b8}}




-- 
Jeremy Schneider
Database Engineer
Amazon Web Services

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From

Tomas Vondra

Date:

05 December 2019, 13:38:36

On Wed, Dec 04, 2019 at 05:36:16PM -0800, Jeremy Schneider wrote:
>On 9/8/19 14:01, Tom Lane wrote:
>> Fix RelationIdGetRelation calls that weren't bothering with error checks.
>>
>> ...
>>
>> Details
>> -------
>> https://git.postgresql.org/pg/commitdiff/69f883fef14a3fc5849126799278abcc43f40f56
>
>We had two different databases this week (with the same schema) both
>independently hit the condition of this recent commit from Tom. It's on
>11.5 so we're actually segfaulting and restarting rather than just
>causing the walsender process to ERROR, but regardless there's still
>some underlying bug here.
>
>We have core files and we're still working to see if we can figure out
>what's going on, but I thought I'd report now in case anyone has extra
>ideas or suggestions.  The segfault is on line 3034 of reorderbuffer.c.
>
>https://github.com/postgres/postgres/blob/REL_11_5/src/backend/replication/logical/reorderbuffer.c#L3034
>
>3033     toast_rel = RelationIdGetRelation(relation->rd_rel->reltoastrelid);
>3034     toast_desc = RelationGetDescr(toast_rel);
>
>We'll keep looking; let me know any feedback! Would love to track down
>whatever bug is in the logical decoding code, if that's what it is.
>
>==========
>
>backtrace showing the call stack...
>
>Core was generated by `postgres: walsender <NAME-REDACTED>
><DNS-REDACTED>(31712)'.
>Program terminated with signal 11, Segmentation fault.
>#0  ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78,
>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938)
>    at reorderbuffer.c:3034
>3034    reorderbuffer.c: No such file or directory.
>...
>(gdb) #0  ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78,
>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938)
>    at reorderbuffer.c:3034
>#1  ReorderBufferCommit (rb=0x3086af0, xid=xid@entry=1358809,
>commit_lsn=9430473346032, end_lsn=<optimized out>,
>    commit_time=commit_time@entry=628712466364268,
>origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at
>reorderbuffer.c:1584
>#2  0x0000000000716248 in DecodeCommit (xid=1358809,
>parsed=0x7ffc4ce123f0, buf=0x7ffc4ce125b0, ctx=0x3068f70) at decode.c:637
>#3  DecodeXactOp (ctx=0x3068f70, buf=buf@entry=0x7ffc4ce125b0) at
>decode.c:245
>#4  0x000000000071655a in LogicalDecodingProcessRecord (ctx=0x3068f70,
>record=0x3069208) at decode.c:117
>#5  0x0000000000727150 in XLogSendLogical () at walsender.c:2886
>#6  0x0000000000729192 in WalSndLoop (send_data=send_data@entry=0x7270f0
><XLogSendLogical>) at walsender.c:2249
>#7  0x0000000000729f91 in StartLogicalReplication (cmd=0x30485a0) at
>walsender.c:1111
>#8  exec_replication_command (
>    cmd_string=cmd_string@entry=0x2f968b0 "START_REPLICATION SLOT
>\"<NAME-REDACTED>\" LOGICAL 893/38002B98 (proto_version '1',
>publication_names '\"<NAME-REDACTED>\"')") at walsender.c:1628
>#9  0x000000000076e939 in PostgresMain (argc=<optimized out>,
>argv=argv@entry=0x2fea168, dbname=0x2fea020 "<NAME-REDACTED>",
>    username=<optimized out>) at postgres.c:4182
>#10 0x00000000004bdcb5 in BackendRun (port=0x2fdec50) at postmaster.c:4410
>#11 BackendStartup (port=0x2fdec50) at postmaster.c:4082
>#12 ServerLoop () at postmaster.c:1759
>#13 0x00000000007062f9 in PostmasterMain (argc=argc@entry=7,
>argv=argv@entry=0x2f92540) at postmaster.c:1432
>#14 0x00000000004be73b in main (argc=7, argv=0x2f92540) at main.c:228
>
>==========
>
>Some additional context...
>
># select * from pg_publication_rel;
> prpubid | prrelid
>---------+---------
>   71417 |   16453
>   71417 |   54949
>(2 rows)
>
>(gdb) print toast_rel
>$4 = (struct RelationData *) 0x0
>
>(gdb) print *relation->rd_rel
>$11 = {relname = {data = "<NAME-REDACTED>", '\000' <repeats 44 times>},
>relnamespace = 16402, reltype = 16430, reloftype = 0,
>relowner = 16393, relam = 0, relfilenode = 16428, reltablespace = 0,
>relpages = 0, reltuples = 0, relallvisible = 0, reltoastrelid = 0,

Hmmm, so reltoastrelid = 0, i.e. the relation does not have a TOAST
relation. Yet we're calling ReorderBufferToastReplace on the decoded
record ... interesting.

Can you share structure of the relation causing the issue?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From

Tomas Vondra

Date:

05 December 2019, 13:38:36

On Wed, Dec 04, 2019 at 05:36:16PM -0800, Jeremy Schneider wrote:
>On 9/8/19 14:01, Tom Lane wrote:
>> Fix RelationIdGetRelation calls that weren't bothering with error checks.
>>
>> ...
>>
>> Details
>> -------
>> https://git.postgresql.org/pg/commitdiff/69f883fef14a3fc5849126799278abcc43f40f56
>
>We had two different databases this week (with the same schema) both
>independently hit the condition of this recent commit from Tom. It's on
>11.5 so we're actually segfaulting and restarting rather than just
>causing the walsender process to ERROR, but regardless there's still
>some underlying bug here.
>
>We have core files and we're still working to see if we can figure out
>what's going on, but I thought I'd report now in case anyone has extra
>ideas or suggestions.  The segfault is on line 3034 of reorderbuffer.c.
>
>https://github.com/postgres/postgres/blob/REL_11_5/src/backend/replication/logical/reorderbuffer.c#L3034
>
>3033     toast_rel = RelationIdGetRelation(relation->rd_rel->reltoastrelid);
>3034     toast_desc = RelationGetDescr(toast_rel);
>
>We'll keep looking; let me know any feedback! Would love to track down
>whatever bug is in the logical decoding code, if that's what it is.
>
>==========
>
>backtrace showing the call stack...
>
>Core was generated by `postgres: walsender <NAME-REDACTED>
><DNS-REDACTED>(31712)'.
>Program terminated with signal 11, Segmentation fault.
>#0  ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78,
>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938)
>    at reorderbuffer.c:3034
>3034    reorderbuffer.c: No such file or directory.
>...
>(gdb) #0  ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78,
>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938)
>    at reorderbuffer.c:3034
>#1  ReorderBufferCommit (rb=0x3086af0, xid=xid@entry=1358809,
>commit_lsn=9430473346032, end_lsn=<optimized out>,
>    commit_time=commit_time@entry=628712466364268,
>origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at
>reorderbuffer.c:1584
>#2  0x0000000000716248 in DecodeCommit (xid=1358809,
>parsed=0x7ffc4ce123f0, buf=0x7ffc4ce125b0, ctx=0x3068f70) at decode.c:637
>#3  DecodeXactOp (ctx=0x3068f70, buf=buf@entry=0x7ffc4ce125b0) at
>decode.c:245
>#4  0x000000000071655a in LogicalDecodingProcessRecord (ctx=0x3068f70,
>record=0x3069208) at decode.c:117
>#5  0x0000000000727150 in XLogSendLogical () at walsender.c:2886
>#6  0x0000000000729192 in WalSndLoop (send_data=send_data@entry=0x7270f0
><XLogSendLogical>) at walsender.c:2249
>#7  0x0000000000729f91 in StartLogicalReplication (cmd=0x30485a0) at
>walsender.c:1111
>#8  exec_replication_command (
>    cmd_string=cmd_string@entry=0x2f968b0 "START_REPLICATION SLOT
>\"<NAME-REDACTED>\" LOGICAL 893/38002B98 (proto_version '1',
>publication_names '\"<NAME-REDACTED>\"')") at walsender.c:1628
>#9  0x000000000076e939 in PostgresMain (argc=<optimized out>,
>argv=argv@entry=0x2fea168, dbname=0x2fea020 "<NAME-REDACTED>",
>    username=<optimized out>) at postgres.c:4182
>#10 0x00000000004bdcb5 in BackendRun (port=0x2fdec50) at postmaster.c:4410
>#11 BackendStartup (port=0x2fdec50) at postmaster.c:4082
>#12 ServerLoop () at postmaster.c:1759
>#13 0x00000000007062f9 in PostmasterMain (argc=argc@entry=7,
>argv=argv@entry=0x2f92540) at postmaster.c:1432
>#14 0x00000000004be73b in main (argc=7, argv=0x2f92540) at main.c:228
>
>==========
>
>Some additional context...
>
># select * from pg_publication_rel;
> prpubid | prrelid
>---------+---------
>   71417 |   16453
>   71417 |   54949
>(2 rows)
>
>(gdb) print toast_rel
>$4 = (struct RelationData *) 0x0
>
>(gdb) print *relation->rd_rel
>$11 = {relname = {data = "<NAME-REDACTED>", '\000' <repeats 44 times>},
>relnamespace = 16402, reltype = 16430, reloftype = 0,
>relowner = 16393, relam = 0, relfilenode = 16428, reltablespace = 0,
>relpages = 0, reltuples = 0, relallvisible = 0, reltoastrelid = 0,

Hmmm, so reltoastrelid = 0, i.e. the relation does not have a TOAST
relation. Yet we're calling ReorderBufferToastReplace on the decoded
record ... interesting.

Can you share structure of the relation causing the issue?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From

"Drouvot, Bertrand"

Date:

11 December 2019, 08:17:01

On 12/9/19, 10:10 AM, "Tomas Vondra" <tomas.vondra@2ndquadrant.com> wrote:
    >On Wed, Dec 04, 2019 at 05:36:16PM -0800, Jeremy Schneider wrote:
    >>On 9/8/19 14:01, Tom Lane wrote:
    >>> Fix RelationIdGetRelation calls that weren't bothering with error checks.
    >>>
    >>> ...
    >>>
    >>> Details
    >>> -------
    >>> https://git.postgresql.org/pg/commitdiff/69f883fef14a3fc5849126799278abcc43f40f56
    >>
    >>We had two different databases this week (with the same schema) both
    >>independently hit the condition of this recent commit from Tom. It's on
    >>11.5 so we're actually segfaulting and restarting rather than just
    >>causing the walsender process to ERROR, but regardless there's still
    >>some underlying bug here.
    >>
    >>We have core files and we're still working to see if we can figure out
    >>what's going on, but I thought I'd report now in case anyone has extra
    >>ideas or suggestions.  The segfault is on line 3034 of reorderbuffer.c.
    >>
    >>https://github.com/postgres/postgres/blob/REL_11_5/src/backend/replication/logical/reorderbuffer.c#L3034
    >>
    >>3033     toast_rel = RelationIdGetRelation(relation->rd_rel->reltoastrelid);
    >>3034     toast_desc = RelationGetDescr(toast_rel);
    >>
    >>We'll keep looking; let me know any feedback! Would love to track down
    >>whatever bug is in the logical decoding code, if that's what it is.
    >>
    >>==========
    >>
    >>backtrace showing the call stack...
    >>
    >>Core was generated by `postgres: walsender <NAME-REDACTED>
    >><DNS-REDACTED>(31712)'.
    >>Program terminated with signal 11, Segmentation fault.
    >>#0  ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78,
    >>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938)
    >>    at reorderbuffer.c:3034
    >>3034    reorderbuffer.c: No such file or directory.
    >>...
    >>(gdb) #0  ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78,
    >>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938)
    >>    at reorderbuffer.c:3034
    >>#1  ReorderBufferCommit (rb=0x3086af0, xid=xid@entry=1358809,
    >>commit_lsn=9430473346032, end_lsn=<optimized out>,
    >>    commit_time=commit_time@entry=628712466364268,
    >>origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at
    >>reorderbuffer.c:1584
    >>#2  0x0000000000716248 in DecodeCommit (xid=1358809,
    >>parsed=0x7ffc4ce123f0, buf=0x7ffc4ce125b0, ctx=0x3068f70) at decode.c:637
    >>#3  DecodeXactOp (ctx=0x3068f70, buf=buf@entry=0x7ffc4ce125b0) at
    >>decode.c:245
    >>#4  0x000000000071655a in LogicalDecodingProcessRecord (ctx=0x3068f70,
    >>record=0x3069208) at decode.c:117
    >>#5  0x0000000000727150 in XLogSendLogical () at walsender.c:2886
    >>#6  0x0000000000729192 in WalSndLoop (send_data=send_data@entry=0x7270f0
    >><XLogSendLogical>) at walsender.c:2249
    >>#7  0x0000000000729f91 in StartLogicalReplication (cmd=0x30485a0) at
    >>walsender.c:1111
    >>#8  exec_replication_command (
    >>    cmd_string=cmd_string@entry=0x2f968b0 "START_REPLICATION SLOT
    >>\"<NAME-REDACTED>\" LOGICAL 893/38002B98 (proto_version '1',
    >>publication_names '\"<NAME-REDACTED>\"')") at walsender.c:1628
    >>#9  0x000000000076e939 in PostgresMain (argc=<optimized out>,
    >>argv=argv@entry=0x2fea168, dbname=0x2fea020 "<NAME-REDACTED>",
    >>    username=<optimized out>) at postgres.c:4182
    >>#10 0x00000000004bdcb5 in BackendRun (port=0x2fdec50) at postmaster.c:4410
    >>#11 BackendStartup (port=0x2fdec50) at postmaster.c:4082
    >>#12 ServerLoop () at postmaster.c:1759
    >>#13 0x00000000007062f9 in PostmasterMain (argc=argc@entry=7,
    >>argv=argv@entry=0x2f92540) at postmaster.c:1432
    >>#14 0x00000000004be73b in main (argc=7, argv=0x2f92540) at main.c:228
    >>
    >>==========
    >>
    >>Some additional context...
    >>
    >># select * from pg_publication_rel;
    >> prpubid | prrelid
    >>---------+---------
    >>   71417 |   16453
    >>   71417 |   54949
    >>(2 rows)
    >>
    >>(gdb) print toast_rel
    >>$4 = (struct RelationData *) 0x0
    >>
    >>(gdb) print *relation->rd_rel
    >>$11 = {relname = {data = "<NAME-REDACTED>", '\000' <repeats 44 times>},
    >>relnamespace = 16402, reltype = 16430, reloftype = 0,
    >>relowner = 16393, relam = 0, relfilenode = 16428, reltablespace = 0,
    >>relpages = 0, reltuples = 0, relallvisible = 0, reltoastrelid = 0,
    
    >Hmmm, so reltoastrelid = 0, i.e. the relation does not have a TOAST
    >relation. Yet we're calling ReorderBufferToastReplace on the decoded
    >record ... interesting.
    >
    >Can you share structure of the relation causing the issue?
    
   Here it is:

\d+ rel_having_issue
                                                             Table "public.rel_having_issue"
     Column     |           Type           | Collation | Nullable |                     Default                     |
Storage | Stats target | Description
 

----------------+--------------------------+-----------+----------+-------------------------------------------------+----------+--------------+-------------
 id             | integer                  |           | not null | nextval('rel_having_issue_id_seq'::regclass) |
plain   |              |
 
 field1           | character varying(255)   |           |          |                                                 |
extended|              |
 
 field2          | integer                  |           |          |                                                 |
plain   |              |
 
 field3 | timestamp with time zone |           |          |                                                 | plain
|             |
 
Indexes:
    "rel_having_issue_pkey" PRIMARY KEY, btree (id)

select relname,relfilenode,reltoastrelid from pg_class where relname='rel_having_issue';
       relname       | relfilenode | reltoastrelid
---------------------+-------------+---------------
 rel_having_issue |       16428 |             0

Bertrand

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From

"Drouvot, Bertrand"

Date:

11 December 2019, 08:17:01

On 12/9/19, 10:10 AM, "Tomas Vondra" <tomas.vondra@2ndquadrant.com> wrote:
    >On Wed, Dec 04, 2019 at 05:36:16PM -0800, Jeremy Schneider wrote:
    >>On 9/8/19 14:01, Tom Lane wrote:
    >>> Fix RelationIdGetRelation calls that weren't bothering with error checks.
    >>>
    >>> ...
    >>>
    >>> Details
    >>> -------
    >>> https://git.postgresql.org/pg/commitdiff/69f883fef14a3fc5849126799278abcc43f40f56
    >>
    >>We had two different databases this week (with the same schema) both
    >>independently hit the condition of this recent commit from Tom. It's on
    >>11.5 so we're actually segfaulting and restarting rather than just
    >>causing the walsender process to ERROR, but regardless there's still
    >>some underlying bug here.
    >>
    >>We have core files and we're still working to see if we can figure out
    >>what's going on, but I thought I'd report now in case anyone has extra
    >>ideas or suggestions.  The segfault is on line 3034 of reorderbuffer.c.
    >>
    >>https://github.com/postgres/postgres/blob/REL_11_5/src/backend/replication/logical/reorderbuffer.c#L3034
    >>
    >>3033     toast_rel = RelationIdGetRelation(relation->rd_rel->reltoastrelid);
    >>3034     toast_desc = RelationGetDescr(toast_rel);
    >>
    >>We'll keep looking; let me know any feedback! Would love to track down
    >>whatever bug is in the logical decoding code, if that's what it is.
    >>
    >>==========
    >>
    >>backtrace showing the call stack...
    >>
    >>Core was generated by `postgres: walsender <NAME-REDACTED>
    >><DNS-REDACTED>(31712)'.
    >>Program terminated with signal 11, Segmentation fault.
    >>#0  ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78,
    >>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938)
    >>    at reorderbuffer.c:3034
    >>3034    reorderbuffer.c: No such file or directory.
    >>...
    >>(gdb) #0  ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78,
    >>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938)
    >>    at reorderbuffer.c:3034
    >>#1  ReorderBufferCommit (rb=0x3086af0, xid=xid@entry=1358809,
    >>commit_lsn=9430473346032, end_lsn=<optimized out>,
    >>    commit_time=commit_time@entry=628712466364268,
    >>origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at
    >>reorderbuffer.c:1584
    >>#2  0x0000000000716248 in DecodeCommit (xid=1358809,
    >>parsed=0x7ffc4ce123f0, buf=0x7ffc4ce125b0, ctx=0x3068f70) at decode.c:637
    >>#3  DecodeXactOp (ctx=0x3068f70, buf=buf@entry=0x7ffc4ce125b0) at
    >>decode.c:245
    >>#4  0x000000000071655a in LogicalDecodingProcessRecord (ctx=0x3068f70,
    >>record=0x3069208) at decode.c:117
    >>#5  0x0000000000727150 in XLogSendLogical () at walsender.c:2886
    >>#6  0x0000000000729192 in WalSndLoop (send_data=send_data@entry=0x7270f0
    >><XLogSendLogical>) at walsender.c:2249
    >>#7  0x0000000000729f91 in StartLogicalReplication (cmd=0x30485a0) at
    >>walsender.c:1111
    >>#8  exec_replication_command (
    >>    cmd_string=cmd_string@entry=0x2f968b0 "START_REPLICATION SLOT
    >>\"<NAME-REDACTED>\" LOGICAL 893/38002B98 (proto_version '1',
    >>publication_names '\"<NAME-REDACTED>\"')") at walsender.c:1628
    >>#9  0x000000000076e939 in PostgresMain (argc=<optimized out>,
    >>argv=argv@entry=0x2fea168, dbname=0x2fea020 "<NAME-REDACTED>",
    >>    username=<optimized out>) at postgres.c:4182
    >>#10 0x00000000004bdcb5 in BackendRun (port=0x2fdec50) at postmaster.c:4410
    >>#11 BackendStartup (port=0x2fdec50) at postmaster.c:4082
    >>#12 ServerLoop () at postmaster.c:1759
    >>#13 0x00000000007062f9 in PostmasterMain (argc=argc@entry=7,
    >>argv=argv@entry=0x2f92540) at postmaster.c:1432
    >>#14 0x00000000004be73b in main (argc=7, argv=0x2f92540) at main.c:228
    >>
    >>==========
    >>
    >>Some additional context...
    >>
    >># select * from pg_publication_rel;
    >> prpubid | prrelid
    >>---------+---------
    >>   71417 |   16453
    >>   71417 |   54949
    >>(2 rows)
    >>
    >>(gdb) print toast_rel
    >>$4 = (struct RelationData *) 0x0
    >>
    >>(gdb) print *relation->rd_rel
    >>$11 = {relname = {data = "<NAME-REDACTED>", '\000' <repeats 44 times>},
    >>relnamespace = 16402, reltype = 16430, reloftype = 0,
    >>relowner = 16393, relam = 0, relfilenode = 16428, reltablespace = 0,
    >>relpages = 0, reltuples = 0, relallvisible = 0, reltoastrelid = 0,
    
    >Hmmm, so reltoastrelid = 0, i.e. the relation does not have a TOAST
    >relation. Yet we're calling ReorderBufferToastReplace on the decoded
    >record ... interesting.
    >
    >Can you share structure of the relation causing the issue?
    
   Here it is:

\d+ rel_having_issue
                                                             Table "public.rel_having_issue"
     Column     |           Type           | Collation | Nullable |                     Default                     |
Storage | Stats target | Description
 

----------------+--------------------------+-----------+----------+-------------------------------------------------+----------+--------------+-------------
 id             | integer                  |           | not null | nextval('rel_having_issue_id_seq'::regclass) |
plain   |              |
 
 field1           | character varying(255)   |           |          |                                                 |
extended|              |
 
 field2          | integer                  |           |          |                                                 |
plain   |              |
 
 field3 | timestamp with time zone |           |          |                                                 | plain
|             |
 
Indexes:
    "rel_having_issue_pkey" PRIMARY KEY, btree (id)

select relname,relfilenode,reltoastrelid from pg_class where relname='rel_having_issue';
       relname       | relfilenode | reltoastrelid
---------------------+-------------+---------------
 rel_having_issue |       16428 |             0

Bertrand

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From

Robert Haas

Date:

11 December 2019, 15:54:13

On Wed, Dec 11, 2019 at 3:17 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>    Here it is:
>
> \d+ rel_having_issue

You did a heck of a job choosing the name of that table. I bet nobody
was surprised when it had an issue!

:-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From

Robert Haas

Date:

11 December 2019, 15:54:13

On Wed, Dec 11, 2019 at 3:17 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>    Here it is:
>
> \d+ rel_having_issue

You did a heck of a job choosing the name of that table. I bet nobody
was surprised when it had an issue!

:-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From

Andres Freund

Date:

11 December 2019, 16:35:22

Hi,

On 2019-12-11 08:17:01 +0000, Drouvot, Bertrand wrote:
>     >>Core was generated by `postgres: walsender <NAME-REDACTED>
>     >><DNS-REDACTED>(31712)'.
>     >>Program terminated with signal 11, Segmentation fault.
>     >>#0  ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78,
>     >>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938)
>     >>    at reorderbuffer.c:3034
>     >>3034    reorderbuffer.c: No such file or directory.
>     >>...
>     >>(gdb) #0  ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78,
>     >>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938)
>     >>    at reorderbuffer.c:3034
>     >>#1  ReorderBufferCommit (rb=0x3086af0, xid=xid@entry=1358809,
>     >>commit_lsn=9430473346032, end_lsn=<optimized out>,
>     >>    commit_time=commit_time@entry=628712466364268,
>     >>origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at
>     >>reorderbuffer.c:1584

This indicates that a toast record was present for that relation,
despite:

> 
> \d+ rel_having_issue
>                                                              Table "public.rel_having_issue"
>      Column     |           Type           | Collation | Nullable |                     Default                     |
Storage | Stats target | Description
 
>
----------------+--------------------------+-----------+----------+-------------------------------------------------+----------+--------------+-------------
>  id             | integer                  |           | not null | nextval('rel_having_issue_id_seq'::regclass) |
plain   |              |
 
>  field1           | character varying(255)   |           |          |
|extended |              |
 
>  field2          | integer                  |           |          |
|plain    |              |
 
>  field3 | timestamp with time zone |           |          |                                                 | plain
|              |
 
> Indexes:
>     "rel_having_issue_pkey" PRIMARY KEY, btree (id)
> 
> select relname,relfilenode,reltoastrelid from pg_class where relname='rel_having_issue';
>        relname       | relfilenode | reltoastrelid
> ---------------------+-------------+---------------
>  rel_having_issue |       16428 |             0


I think we need to see pg_waldump output for the preceding records. That
might allow us to see why there's a toast record that's being associated
with this table, despite there not being a toast table.

Seems like we clearly should add an elog(ERROR) here, so we error out,
rather than crash.

Has there been DDL to this table?

Could you print out *change?

Is this version of postgres effectively unmodified in any potentially
relevant region (snapshot computations, generation of WAL records, ...)?

Greetings,

Andres Freund

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From

Andres Freund

Date:

11 December 2019, 16:35:22

Hi,

On 2019-12-11 08:17:01 +0000, Drouvot, Bertrand wrote:
>     >>Core was generated by `postgres: walsender <NAME-REDACTED>
>     >><DNS-REDACTED>(31712)'.
>     >>Program terminated with signal 11, Segmentation fault.
>     >>#0  ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78,
>     >>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938)
>     >>    at reorderbuffer.c:3034
>     >>3034    reorderbuffer.c: No such file or directory.
>     >>...
>     >>(gdb) #0  ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78,
>     >>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938)
>     >>    at reorderbuffer.c:3034
>     >>#1  ReorderBufferCommit (rb=0x3086af0, xid=xid@entry=1358809,
>     >>commit_lsn=9430473346032, end_lsn=<optimized out>,
>     >>    commit_time=commit_time@entry=628712466364268,
>     >>origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at
>     >>reorderbuffer.c:1584

This indicates that a toast record was present for that relation,
despite:

> 
> \d+ rel_having_issue
>                                                              Table "public.rel_having_issue"
>      Column     |           Type           | Collation | Nullable |                     Default                     |
Storage | Stats target | Description
 
>
----------------+--------------------------+-----------+----------+-------------------------------------------------+----------+--------------+-------------
>  id             | integer                  |           | not null | nextval('rel_having_issue_id_seq'::regclass) |
plain   |              |
 
>  field1           | character varying(255)   |           |          |
|extended |              |
 
>  field2          | integer                  |           |          |
|plain    |              |
 
>  field3 | timestamp with time zone |           |          |                                                 | plain
|              |
 
> Indexes:
>     "rel_having_issue_pkey" PRIMARY KEY, btree (id)
> 
> select relname,relfilenode,reltoastrelid from pg_class where relname='rel_having_issue';
>        relname       | relfilenode | reltoastrelid
> ---------------------+-------------+---------------
>  rel_having_issue |       16428 |             0


I think we need to see pg_waldump output for the preceding records. That
might allow us to see why there's a toast record that's being associated
with this table, despite there not being a toast table.

Seems like we clearly should add an elog(ERROR) here, so we error out,
rather than crash.

Has there been DDL to this table?

Could you print out *change?

Is this version of postgres effectively unmodified in any potentially
relevant region (snapshot computations, generation of WAL records, ...)?

Greetings,

Andres Freund

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From

Tom Lane

Date:

11 December 2019, 17:11:03

Andres Freund <andres@anarazel.de> writes:
> This indicates that a toast record was present for that relation,
> despite:
> [ \d that looks like the table isn't wide enough for that ]
> I think we need to see pg_waldump output for the preceding records. That
> might allow us to see why there's a toast record that's being associated
> with this table, despite there not being a toast table.

I don't think you can make that conclusion.  Perhaps the table once
needed a toast table because of some wide column that got dropped;
if so, it'd still have one.  It'd be safer to look at
pg_class.reltoastrelid to verify existence (or not) of the toast relation.

It strikes me that there could easily be cases where a publisher table
has a toast relation and a subscriber's doesn't ... maybe this code
isn't expecting that?

            regards, tom lane

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From

Tom Lane

Date:

11 December 2019, 17:11:03

Andres Freund <andres@anarazel.de> writes:
> This indicates that a toast record was present for that relation,
> despite:
> [ \d that looks like the table isn't wide enough for that ]
> I think we need to see pg_waldump output for the preceding records. That
> might allow us to see why there's a toast record that's being associated
> with this table, despite there not being a toast table.

I don't think you can make that conclusion.  Perhaps the table once
needed a toast table because of some wide column that got dropped;
if so, it'd still have one.  It'd be safer to look at
pg_class.reltoastrelid to verify existence (or not) of the toast relation.

It strikes me that there could easily be cases where a publisher table
has a toast relation and a subscriber's doesn't ... maybe this code
isn't expecting that?

            regards, tom lane

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From

Robert Haas

Date:

11 December 2019, 17:12:48

On Wed, Dec 11, 2019 at 12:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I don't think you can make that conclusion.  Perhaps the table once
> needed a toast table because of some wide column that got dropped;
> if so, it'd still have one.  It'd be safer to look at
> pg_class.reltoastrelid to verify existence (or not) of the toast relation.

I believe that output was already shown earlier in the thread.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From

Robert Haas

Date:

11 December 2019, 17:12:48

On Wed, Dec 11, 2019 at 12:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I don't think you can make that conclusion.  Perhaps the table once
> needed a toast table because of some wide column that got dropped;
> if so, it'd still have one.  It'd be safer to look at
> pg_class.reltoastrelid to verify existence (or not) of the toast relation.

I believe that output was already shown earlier in the thread.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From

Andres Freund

Date:

11 December 2019, 17:21:01

Hi,

On 2019-12-11 12:11:03 -0500, Tom Lane wrote:
> I don't think you can make that conclusion.  Perhaps the table once
> needed a toast table because of some wide column that got dropped;
> if so, it'd still have one.  It'd be safer to look at
> pg_class.reltoastrelid to verify existence (or not) of the toast relation.

That was checked in the email I was responding to.


> It strikes me that there could easily be cases where a publisher table
> has a toast relation and a subscriber's doesn't ... maybe this code
> isn't expecting that?

This code is all running on the publisher side, so I don't think it
could matter.

Greetings,

Andres Freund

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From

Andres Freund

Date:

11 December 2019, 17:21:01

Hi,

On 2019-12-11 12:11:03 -0500, Tom Lane wrote:
> I don't think you can make that conclusion.  Perhaps the table once
> needed a toast table because of some wide column that got dropped;
> if so, it'd still have one.  It'd be safer to look at
> pg_class.reltoastrelid to verify existence (or not) of the toast relation.

That was checked in the email I was responding to.


> It strikes me that there could easily be cases where a publisher table
> has a toast relation and a subscriber's doesn't ... maybe this code
> isn't expecting that?

This code is all running on the publisher side, so I don't think it
could matter.

Greetings,

Andres Freund

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From

Jeremy Schneider

Date:

14 December 2019, 00:13:35

On 12/11/19 08:35, Andres Freund wrote:

I think we need to see pg_waldump output for the preceding records. That
might allow us to see why there's a toast record that's being associated
with this table, despite there not being a toast table.

Unfortunately the WAL logs are no longer available at this time. :(

I did a little poking around in the core file and searching source code but didn't find anything yet. Is there any memory structure that would have the preceding/following records cached in memory? If so then I might be able to extract this from the core dumps.

Seems like we clearly should add an elog(ERROR) here, so we error out,
rather than crash.

done - in the commit that I replied to when I started this thread :)

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=69f883fef14a3fc5849126799278abcc43f40f56

Has there been DDL to this table?

I'm not sure that we will be able to find out at this point.

Could you print out *change?

This was also in the original email - here it is:

(gdb) print *change
$1 = {lsn = 9430473343416, action = REORDER_BUFFER_CHANGE_INSERT, origin_id = 0, data = {tp = {relnode = {spcNode = 1663, dbNode = 16401,
        relNode = 16428}, clear_toast_afterwards = true, oldtuple = 0x0, newtuple = 0x2b79313f9c68}, truncate = {
      nrelids = 70441758623359, cascade = 44, restart_seqs = 64, relids = 0x0}, msg = {
      prefix = 0x40110000067f <Address 0x40110000067f out of bounds>, message_size = 4294983724, message = 0x0},
    snapshot = 0x40110000067f, command_id = 1663, tuplecid = {node = {spcNode = 1663, dbNode = 16401, relNode = 16428}, tid = {
        ip_blkid = {bi_hi = 1, bi_lo = 0}, ip_posid = 0}, cmin = 0, cmax = 826252392, combocid = 11129}}, node = {prev = 0x30ac918,
    next = 0x30ac9b8}}

Is this version of postgres effectively unmodified in any potentially
relevant region (snapshot computations, generation of WAL records, ...)?

It's not changed from community code in any relevant regions. (Also, FYI, this is not Aurora.)

-Jeremy

-- 
Jeremy Schneider
Database Engineer
Amazon Web Services

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From

Jeremy Schneider

Date:

14 December 2019, 00:13:35

On 12/11/19 08:35, Andres Freund wrote:

I think we need to see pg_waldump output for the preceding records. That
might allow us to see why there's a toast record that's being associated
with this table, despite there not being a toast table.

Seems like we clearly should add an elog(ERROR) here, so we error out,
rather than crash.

done - in the commit that I replied to when I started this thread :)

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=69f883fef14a3fc5849126799278abcc43f40f56

Has there been DDL to this table?

I'm not sure that we will be able to find out at this point.

Could you print out *change?

Is this version of postgres effectively unmodified in any potentially
relevant region (snapshot computations, generation of WAL records, ...)?

It's not changed from community code in any relevant regions. (Also, FYI, this is not Aurora.)

-Jeremy

-- 
Jeremy Schneider
Database Engineer
Amazon Web Services

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From

Andres Freund

Date:

14 December 2019, 00:25:13

Hi,

On 2019-12-13 16:13:35 -0800, Jeremy Schneider wrote:
> On 12/11/19 08:35, Andres Freund wrote:
> > I think we need to see pg_waldump output for the preceding records. That
> > might allow us to see why there's a toast record that's being associated
> > with this table, despite there not being a toast table.
> Unfortunately the WAL logs are no longer available at this time.  :(
> 
> I did a little poking around in the core file and searching source code
> but didn't find anything yet.  Is there any memory structure that would
> have the preceding/following records cached in memory?  If so then I
> might be able to extract this from the core dumps.

Well, not the records directly, but the changes could be, depending on
the size of the changes. That'd already help. It depends a bit on
whether there are subtransactions or not (txn->nsubtxns will tell
you). Within one transaction, the currently loaded (i.e. not changes
that are spilled to disk, and haven't currently been restored - see
txn->serialized) changes are in ReorderBufferTXN->changes.

> > Seems like we clearly should add an elog(ERROR) here, so we error out,
> > rather than crash.

> done - in the commit that I replied to when I started this thread :)
> 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=69f883fef14a3fc5849126799278abcc43f40f56

Ah, I was actually thinking this is the thread of a similar sounding
bug, where ReorderBufferToastReplace would crash because there isn't
actually a new tuple - there somehow toast changes exist for a delete.

> > Is this version of postgres effectively unmodified in any potentially
> > relevant region (snapshot computations, generation of WAL records, ...)?
> It's not changed from community code in any relevant regions.  (Also,
> FYI, this is not Aurora.)

Well, I've heard mutterings that plain RDS postgres had some efficiency
improvements around snapshots (in the GetSnapshotData() sense) - and
that's an area where slightly wrong changes could quite plausibly
cause a bug like this.

Greetings,

Andres Freund

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From

Andres Freund

Date:

14 December 2019, 00:25:13

Hi,

On 2019-12-13 16:13:35 -0800, Jeremy Schneider wrote:
> On 12/11/19 08:35, Andres Freund wrote:
> > I think we need to see pg_waldump output for the preceding records. That
> > might allow us to see why there's a toast record that's being associated
> > with this table, despite there not being a toast table.
> Unfortunately the WAL logs are no longer available at this time.  :(
> 
> I did a little poking around in the core file and searching source code
> but didn't find anything yet.  Is there any memory structure that would
> have the preceding/following records cached in memory?  If so then I
> might be able to extract this from the core dumps.

Well, not the records directly, but the changes could be, depending on
the size of the changes. That'd already help. It depends a bit on
whether there are subtransactions or not (txn->nsubtxns will tell
you). Within one transaction, the currently loaded (i.e. not changes
that are spilled to disk, and haven't currently been restored - see
txn->serialized) changes are in ReorderBufferTXN->changes.

> > Seems like we clearly should add an elog(ERROR) here, so we error out,
> > rather than crash.

> done - in the commit that I replied to when I started this thread :)
> 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=69f883fef14a3fc5849126799278abcc43f40f56

Ah, I was actually thinking this is the thread of a similar sounding
bug, where ReorderBufferToastReplace would crash because there isn't
actually a new tuple - there somehow toast changes exist for a delete.

> > Is this version of postgres effectively unmodified in any potentially
> > relevant region (snapshot computations, generation of WAL records, ...)?
> It's not changed from community code in any relevant regions.  (Also,
> FYI, this is not Aurora.)

Well, I've heard mutterings that plain RDS postgres had some efficiency
improvements around snapshots (in the GetSnapshotData() sense) - and
that's an area where slightly wrong changes could quite plausibly
cause a bug like this.

Greetings,

Andres Freund

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From

Jeremy Schneider

Date:

20 December 2019, 23:21:30

On 12/13/19 16:25, Andres Freund wrote:
> On 2019-12-13 16:13:35 -0800, Jeremy Schneider wrote:
>> On 12/11/19 08:35, Andres Freund wrote:
>>> I think we need to see pg_waldump output for the preceding records. That
>>> might allow us to see why there's a toast record that's being associated
>>> with this table, despite there not being a toast table.
>> Unfortunately the WAL logs are no longer available at this time.  :(
>>
>> I did a little poking around in the core file and searching source code
>> but didn't find anything yet.  Is there any memory structure that would
>> have the preceding/following records cached in memory?  If so then I
>> might be able to extract this from the core dumps.
> 
> Well, not the records directly, but the changes could be, depending on
> the size of the changes. That'd already help. It depends a bit on
> whether there are subtransactions or not (txn->nsubtxns will tell
> you). Within one transaction, the currently loaded (i.e. not changes
> that are spilled to disk, and haven't currently been restored - see
> txn->serialized) changes are in ReorderBufferTXN->changes.

I did include the txn in the original post to this thread; there are 357
changes in the transaction and they are all in memory (none spilled to
disk a.k.a. serialized).  No subtransactions.  However I do see that
"txn.has_catalog_changes = true" which makes me wonder if that's related
to the bug.

So... now I know... walking a dlist in gdb and dumping all the changes
is not exactly a walk in the park!  Need some python magic like Tomas
Vondra's script that decodes Nodes.  I was not yet successful today in
figuring out how to do this... so the changes are there in the core dump
but I can't get them yet.  :)

I also dug around the ReorderBufferIterTXNState a little bit but there's
nothing that isn't already in the original post.

If anyone has a trick for walking a dlist in gdb that would be awesome...

I'm off for holidays and won't be working on this for a couple weeks;
not sure whether it'll be possible to get to the bottom of it. But I
hope there's enough info in this thread to at least get a head start if
someone hits it again in the future.

> Well, I've heard mutterings that plain RDS postgres had some efficiency
> improvements around snapshots (in the GetSnapshotData() sense) - and
> that's an area where slightly wrong changes could quite plausibly
> cause a bug like this.

Definitely no changes around snapshots. I've never even heard anyone
talk about making changes like that in RDS PostgreSQL - feels to me like
people at AWS want it to be as close as possible to postgresql.org code.

Aurora is different; it feels to me like the engineering org has more
license to make changes. For example they re-wrote the subtransaction
subsystem. No changes to GetSnapshotData though.

-Jeremy

-- 
Jeremy Schneider
Database Engineer
Amazon Web Services

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From

Jeremy Schneider

Date:

20 December 2019, 23:21:30

On 12/13/19 16:25, Andres Freund wrote:
> On 2019-12-13 16:13:35 -0800, Jeremy Schneider wrote:
>> On 12/11/19 08:35, Andres Freund wrote:
>>> I think we need to see pg_waldump output for the preceding records. That
>>> might allow us to see why there's a toast record that's being associated
>>> with this table, despite there not being a toast table.
>> Unfortunately the WAL logs are no longer available at this time.  :(
>>
>> I did a little poking around in the core file and searching source code
>> but didn't find anything yet.  Is there any memory structure that would
>> have the preceding/following records cached in memory?  If so then I
>> might be able to extract this from the core dumps.
> 
> Well, not the records directly, but the changes could be, depending on
> the size of the changes. That'd already help. It depends a bit on
> whether there are subtransactions or not (txn->nsubtxns will tell
> you). Within one transaction, the currently loaded (i.e. not changes
> that are spilled to disk, and haven't currently been restored - see
> txn->serialized) changes are in ReorderBufferTXN->changes.

I did include the txn in the original post to this thread; there are 357
changes in the transaction and they are all in memory (none spilled to
disk a.k.a. serialized).  No subtransactions.  However I do see that
"txn.has_catalog_changes = true" which makes me wonder if that's related
to the bug.

So... now I know... walking a dlist in gdb and dumping all the changes
is not exactly a walk in the park!  Need some python magic like Tomas
Vondra's script that decodes Nodes.  I was not yet successful today in
figuring out how to do this... so the changes are there in the core dump
but I can't get them yet.  :)

I also dug around the ReorderBufferIterTXNState a little bit but there's
nothing that isn't already in the original post.

If anyone has a trick for walking a dlist in gdb that would be awesome...

I'm off for holidays and won't be working on this for a couple weeks;
not sure whether it'll be possible to get to the bottom of it. But I
hope there's enough info in this thread to at least get a head start if
someone hits it again in the future.

> Well, I've heard mutterings that plain RDS postgres had some efficiency
> improvements around snapshots (in the GetSnapshotData() sense) - and
> that's an area where slightly wrong changes could quite plausibly
> cause a bug like this.

Definitely no changes around snapshots. I've never even heard anyone
talk about making changes like that in RDS PostgreSQL - feels to me like
people at AWS want it to be as close as possible to postgresql.org code.

Aurora is different; it feels to me like the engineering org has more
license to make changes. For example they re-wrote the subtransaction
subsystem. No changes to GetSnapshotData though.

-Jeremy

-- 
Jeremy Schneider
Database Engineer
Amazon Web Services

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From

Alvaro Herrera

Date:

05 June 2021, 00:07:05

[Resent -- apologies to those who are getting this email twice.  Please
be mindful to reply to this one if you do.  I think the no-crosspost
policy is very obnoxious and should be relaxed.)

On 2019-Dec-11, Andres Freund wrote:

> On 2019-12-11 08:17:01 +0000, Drouvot, Bertrand wrote:
> >     >>Core was generated by `postgres: walsender <NAME-REDACTED>
> >     >><DNS-REDACTED>(31712)'.
> >     >>Program terminated with signal 11, Segmentation fault.
> >     >>#0  ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78,
> >     >>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938)
> >     >>    at reorderbuffer.c:3034
> >     >>3034    reorderbuffer.c: No such file or directory.
> >     >>...
> >     >>(gdb) #0  ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78,
> >     >>relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938)
> >     >>    at reorderbuffer.c:3034
> >     >>#1  ReorderBufferCommit (rb=0x3086af0, xid=xid@entry=1358809,
> >     >>commit_lsn=9430473346032, end_lsn=<optimized out>,
> >     >>    commit_time=commit_time@entry=628712466364268,
> >     >>origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at
> >     >>reorderbuffer.c:1584
> 
> This indicates that a toast record was present for that relation,
> despite:

Can you explain what it is you saw that indicates that a toast record
was present for the relation?  I may be misreading the code, but there's
nothing obvious that says that if we reach there, then a toast datum
exists anywhere for this relation.  We only know that txn->toast_hash is
set, but that could be because the transaction touched a toast record in
some other table.  Right?

> > \d+ rel_having_issue
> >                                                              Table "public.rel_having_issue"
> >      Column     |           Type           | Collation | Nullable |                     Default
|Storage  | Stats target | Description
 
> >
----------------+--------------------------+-----------+----------+-------------------------------------------------+----------+--------------+-------------
> >  id             | integer                  |           | not null | nextval('rel_having_issue_id_seq'::regclass) |
plain   |              |
 
> >  field1           | character varying(255)   |           |          |
 | extended |              |
 
> >  field2          | integer                  |           |          |
| plain    |              |
 
> >  field3 | timestamp with time zone |           |          |                                                 | plain
  |              |
 
> > Indexes:
> >     "rel_having_issue_pkey" PRIMARY KEY, btree (id)

-- 
Álvaro Herrera                            39°49'30"S 73°17'W
<inflex> really, I see PHP as like a strange amalgamation of C, Perl, Shell
<crab> inflex: you know that "amalgam" means "mixture with mercury",
       more or less, right?
<crab> i.e., "deadly poison"

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From

Dilip Kumar

Date:

05 June 2021, 04:44:42

On Sat, Jun 5, 2021 at 5:41 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> > This indicates that a toast record was present for that relation,
> > despite:
>
> Can you explain what it is you saw that indicates that a toast record
> was present for the relation?  I may be misreading the code, but there's
> nothing obvious that says that if we reach there, then a toast datum
> exists anywhere for this relation.  We only know that txn->toast_hash is
> set, but that could be because the transaction touched a toast record in
> some other table.  Right?

Is this problem is related to the thread [1], where due to spec abort
the toast hash was not deleted and after that, if the next record is
for some other relation which is not having a toast table you will see
this error.  There are a few other problems if the toast hash is not
cleaned due to spec abort.  I have submitted patches with 2 approached
in that thread.

[1] https://www.postgresql.org/message-id/CAFiTN-szfpMXF2H%2Bmk3m_9AB610v103NTv7Z1E8uDBr9iQg1gg%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com