Re: Reorderbuffer crash during recovery - Mailing list pgsql-bugs
From | vignesh C |
---|---|
Subject | Re: Reorderbuffer crash during recovery |
Date | |
Msg-id | CALDaNm2eKs1t4k1GBnr2tjhijgSLcABksHvLV=O93ihTOO-z+w@mail.gmail.com Whole thread Raw |
In response to | Re: Reorderbuffer crash during recovery (Amit Kapila <amit.kapila16@gmail.com>) |
List | pgsql-bugs |
On Sat, Nov 9, 2019 at 5:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Nov 8, 2019 at 10:05 AM vignesh C <vignesh21@gmail.com> wrote: > > > > On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > Hi, > > > > > > On 2019-11-07 17:03:44 +0530, Amit Kapila wrote: > > > > On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra > > > > <tomas.vondra@2ndquadrant.com> wrote: > > > > > > > > > > I'm a bit confused - does this happen only with the logical_work_mem > > > > > patches, or with clean master too? > > > > > > > > > > > > > This occurs with the clean master. This is a base code problem > > > > revealed while doing stress testing of logical_work_mem patches. > > > > > > As far as I can tell there are no repro steps included? Any chance to > > > get those? > > > > > I think it will be bit tricky to write a test case, but the callstack > indicates the problem. For ex. for issue-1, see the below part of > callstack, > > #2 0x0000000000ec514d in ExceptionalCondition > (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24 > "FailedAssertion", > fileName=0x10eab00 "../../../../src/include/lib/ilist.h", > lineNumber=458) at assert.c:54 > #3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60, > off=48) at ../../../../src/include/lib/ilist.h:458 > #4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0, > oldestRunningXid=895) at reorderbuffer.c:1910 > #5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0, > buf=0x7fff7e7b1e40) at decode.c:332 > > The Assertion failure indicates that the changes list is empty when we > are trying to get 'lsn' of the last change. I think this is possible > when immediately after serializing the transaction changes the server > got restarted (say it crashed or somebody switched off and restarted > it). The reason is that after serializing the changes, the changes > list will be empty and serialized flag for txn will be set. I am not > if there is any reason to set the final_lsn if the changes list is > empty. > > Similarly, if we see the call stack of issue-2, the problem is clear. > > #2 0x0000000000ec4e1d in ExceptionalCondition > (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr", > errorType=0x10ea284 "FailedAssertion", > fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54 > #3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0, > txn=0x2bafb08) at reorderbuffer.c:3052 > #4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0, > txn=0x2bafb08) at reorderbuffer.c:1318 > #5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0, > txn=0x2b9d778) at reorderbuffer.c:1257 > #6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0, > oldestRunningXid=3835) at reorderbuffer.c:1973 > > I think this also has a symptom similar to the prior issue but for > sub-transactions. The ReorderBufferAbortOld() tries to set the > final_lsn of toplevel transaction, but not for subtransactions, later > in that path ReorderBufferRestoreCleanup expects it to be set even for > subtransaction. Is that Assert in ReorderBufferRestoreCleanup() > required, because immediately after assert, we are anyway setting the > value of final_lsn. > Thanks Amit for more analysis of the issues. > > > > This problem does not occur consistently. When I was reviewing and testing "logical streaming for large in-progress transactions"link [1] I found the crashes. > > > > This issue does not occur directly, meaning this issue will occur only when some crash occurs in postgres process(notfrom reorderbuffer but due to some other issue), after the original non-reorderbuffer crash this reorderbuffercrash appears. > > > > To simplify the reorderbuffer crash, I used the following steps: > > 1) Make replication setup with publisher/subscriber for some table > > 2) Prepare a sql file with the below: > > begin; > > 4096 insert statements; > > select pg_sleep(120) > > 3) Execute the above script. > > 4) Attach the postgres process when pg_sleep is in progress. > > 5) call abort() from attached gdb. > > > > Isn't it important to call this abort immediately after the changes > are serialized as explained above? > As we are performing exactly 4096 insert statements on a fresh replication setup it will get serialized immediately after 4096 inserts. The variable max_changes_in_memory is initialized with 4096 (reorderbuffer.c file), txn->nentries_mem gets incremented for every insert statement in ReorderBufferQueueChange function. As we have executed 4096 insert statements, txn->nentries_mem will become 4096 which results in the serializing from ReorderBufferCheckSerializeTXN function. Then attaching gdb and call abort() helped in reproducing the issue consistently. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
pgsql-bugs by date: