Re: BUG #16125: Crash of PostgreSQL's wal sender during logicalreplication - Mailing list pgsql-bugs
From | Tomas Vondra |
---|---|
Subject | Re: BUG #16125: Crash of PostgreSQL's wal sender during logicalreplication |
Date | |
Msg-id | 20191119221732.pcqtxljac334zgb4@development Whole thread Raw |
In response to | BUG #16125: Crash of PostgreSQL's wal sender during logical replication (PG Bug reporting form <noreply@postgresql.org>) |
Responses |
Re: BUG #16125: Crash of PostgreSQL's wal sender during logicalreplication
|
List | pgsql-bugs |
On Wed, Nov 20, 2019 at 12:46:44AM +0300, Andrey Salnikov wrote: >Hi, > >> Can you create a core dump (see [1]), and print 'change' and 'txn' in >> frame #2? I wonder if some the other fields are bogus too (but it can't >> be entirely true ...), and if the transaction got serialized. >> >> You may need to install debug symbols first, BTW. > >I hope that I've understood how to extract correctly those data > >(gdb) frame 2 >#2 ReorderBufferCommit (rb=0x555713b7b788, xid=xid@entry=1667601527, commit_lsn=commit_lsn@entry=41707737345720, end_lsn=end_lsn@entry=41707737345792,commit_time=commit_time@entry=627188499326201, > origin_id=origin_id@entry=0, origin_lsn=0) at ./build/../src/backend/replication/logical/reorderbuffer.c:1585 >1585 ./build/../src/backend/replication/logical/reorderbuffer.c: No such file or directory. >(gdb) print change >$1 = (ReorderBufferChange *) 0x555713b9a108 >(gdb) print *change >$2 = {lsn = 41707737330608, action = REORDER_BUFFER_CHANGE_DELETE, origin_id = 0, data = {tp = {relnode = {spcNode = 1663,dbNode = 13018, relNode = 88964815}, clear_toast_afterwards = 1 '\001', oldtuple = 0x555713cd0d88, > newtuple = 0x0}, msg = {prefix = 0x32da0000067f <error: Cannot access memory at address 0x32da0000067f>, message_size= 4383932111, message = 0x555713cd0d88 "\330-\315\023WU"}, snapshot = 0x32da0000067f, > command_id = 1663, tuplecid = {node = {spcNode = 1663, dbNode = 13018, relNode = 88964815}, tid = {ip_blkid = {bi_hi= 1, bi_lo = 0}, ip_posid = 3464}, cmin = 21847, cmax = 0, combocid = 0}}, node = { > prev = 0x555713b98178, next = 0x555713b98178}} >(gdb) print txn >$3 = (ReorderBufferTXN *) 0x555713b982e8 >(gdb) print *txn >$6 = {xid = 1667601527, has_catalog_changes = 0 '\000', is_known_as_subxact = 0 '\000', toplevel_xid = 0, first_lsn = 41707730043960,final_lsn = 41707737345720, end_lsn = 41707737345792, restart_decoding_lsn = 0, > origin_id = 0, origin_lsn = 0, commit_time = 627188499326201, base_snapshot = 0x555713b90408, base_snapshot_lsn = 41707730047464,base_snapshot_node = {prev = 0x555713b7b7a0, next = 0x555713b7b7a0}, nentries = 0, > nentries_mem = 0, serialized = 0 '\000', changes = {head = {prev = 0x555713b98368, next = 0x555713b98368}}, tuplecids= {head = {prev = 0x555713b98378, next = 0x555713b98378}}, ntuplecids = 0, tuplecid_hash = 0x0, > toast_hash = 0x555714ad0fb8, subtxns = {head = {prev = 0x555713b985b0, next = 0x555713b984b8}}, nsubtxns = 4, ninvalidations= 0, invalidations = 0x0, node = {prev = 0x555713b7b790, next = 0x555713b7b790}} > Yep. thanks for this data. This seems to confirm Andres' suspicion that the change is actually a delete. It's not clear to me why we're doing this toast thing for DELETEs unconditionally. For the record, the change LSN is 25EE/D6DE33B0 and the transactions is between 25EE/D66F0438 and 25EE/D6DE6F00. > >> Do you see it on different PostgreSQL instances? Can you share the >> schema definition? > >Unfortunately I can't do it completely, but there are > List of installed extensions > Name | Version | Schema | Description >-----------+---------+------------+-------------------------------- > pgcrypto | 1.3 | public | cryptographic functions > pglogical | 2.2.1 | pglogical | PostgreSQL Logical Replication > plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language > >And each table in publication has next structure: >1 jsonb field, and few smallint, integer, bigint fields. No one table has any triggers or special constraints, only primarykey and one or two btree indexes. > I think it'd be interesting to actually see the table structure - column types, primary key / replica identity. > >> Well, if you can create a reproducer, that'd be the best option, because >> then we can investigate locally instead of the ping-ping here. > >I will try to do it. > OK. > >> I'd also like to see the contents of the WAL, particularly for the XID >> triggering this issue. Please run pg_waldump and see how much data is >> there for XID 1667601527. It does commit at 25EE/D6DE6EB8, not sure >> where it starts. It may have subtransactions, so don't do just grep. > >Hm, it looks like I do not have wal with start of transaction, but I've found next records using txid and subxacts. I hopeit will be useful. > >2514:rmgr: Transaction len (rec/tot): 38/ 38, tx: 0, lsn: 25EE/D66F0438, prev 25EE/D66F0410, desc: ASSIGNMENTxtop 1667601527: subxacts: 1667601528 >2518:rmgr: Heap len (rec/tot): 2063/ 2063, tx: 1667601530, lsn: 25EE/D66F11E8, prev 25EE/D66F0CB0, desc: INSERToff 4, blkref #0: rel 1663/13018/88964795 blk 3125 >2520:rmgr: Btree len (rec/tot): 64/ 64, tx: 1667601530, lsn: 25EE/D66F1A38, prev 25EE/D66F19F8, desc: INSERT_LEAFoff 256, blkref #0: rel 1663/13018/88964797 blk 15840 >2522:rmgr: Heap len (rec/tot): 1341/ 9385, tx: 1667601530, lsn: 25EE/D66F2118, prev 25EE/D66F1A78, desc: INSERToff 7, blkref #0: rel 1663/13018/88964795 blk 3139 FPW >2524:rmgr: Btree len (rec/tot): 64/ 64, tx: 1667601530, lsn: 25EE/D66F54C8, prev 25EE/D66F45E0, desc: INSERT_LEAFoff 257, blkref #0: rel 1663/13018/88964797 blk 15840 >2525:rmgr: Heap len (rec/tot): 1671/ 1671, tx: 1667601530, lsn: 25EE/D66F5508, prev 25EE/D66F54C8, desc: INSERToff 2, blkref #0: rel 1663/13018/88964792 blk 18475 >2527:rmgr: Btree len (rec/tot): 64/ 64, tx: 1667601530, lsn: 25EE/D66F6738, prev 25EE/D66F5B90, desc: INSERT_LEAFoff 140, blkref #0: rel 1663/13018/88964798 blk 2058 >2530:rmgr: Btree len (rec/tot): 96/ 96, tx: 1667601530, lsn: 25EE/D66F7E40, prev 25EE/D66F7E10, desc: INSERT_LEAFoff 28, blkref #0: rel 1663/13018/88964799 blk 5076 >2531:rmgr: Btree len (rec/tot): 80/ 80, tx: 1667601530, lsn: 25EE/D66F7EA0, prev 25EE/D66F7E40, desc: INSERT_LEAFoff 88, blkref #0: rel 1663/13018/88964800 blk 3412 >2532:rmgr: Heap len (rec/tot): 54/ 54, tx: 1667601530, lsn: 25EE/D66F7EF0, prev 25EE/D66F7EA0, desc: DELETEoff 2 , blkref #0: rel 1663/13018/88964792 blk 18475 >2533:rmgr: Heap len (rec/tot): 54/ 54, tx: 1667601530, lsn: 25EE/D66F7F28, prev 25EE/D66F7EF0, desc: DELETEoff 4 , blkref #0: rel 1663/13018/88964795 blk 3125 >2534:rmgr: Heap len (rec/tot): 54/ 54, tx: 1667601530, lsn: 25EE/D66F7F60, prev 25EE/D66F7F28, desc: DELETEoff 7 , blkref #0: rel 1663/13018/88964795 blk 3139 >56346:rmgr: Heap len (rec/tot): 378/ 8198, tx: 1667601537, lsn: 25EE/D6DE33B0, prev 25EE/D6DE1628, desc: DELETEoff 6 KEYS_UPDATED , blkref #0: rel 1663/13018/88964815 blk 1938695 FPW So, this DELETE change triggers the issue. If I understand it correctly, that means the transaction locked the tuple and then deleted it. But I don't see any record about the lock. But this output seems a bit incomplete - there's an assignment record for subxact 1667601528, without any records. And there's no assignment for 1667601530. Can you simply post the whole WAL output for the transaction LSN range, i.e. something like pg_waldump -s 25EE/D66F0438 -e 25EE/D6DE6F00 ... Any idea what this transaction actually does, i.e. what commands it executes? Presumably it does some plpgsql with subtransactions? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-bugs by date: