Re: Reorderbuffer crash during recovery - Mailing list pgsql-bugs

From vignesh C
Subject Re: Reorderbuffer crash during recovery
Date
Msg-id CALDaNm2eKs1t4k1GBnr2tjhijgSLcABksHvLV=O93ihTOO-z+w@mail.gmail.com
Whole thread Raw
In response to Re: Reorderbuffer crash during recovery  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-bugs
On Sat, Nov 9, 2019 at 5:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Nov 8, 2019 at 10:05 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > Hi,
> > >
> > > On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:
> > > > On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
> > > > <tomas.vondra@2ndquadrant.com> wrote:
> > > > >
> > > > > I'm a bit confused - does this happen only with the logical_work_mem
> > > > > patches, or with clean master too?
> > > > >
> > > >
> > > > This occurs with the clean master.  This is a base code problem
> > > > revealed while doing stress testing of logical_work_mem patches.
> > >
> > > As far as I can tell there are no repro steps included? Any chance to
> > > get those?
> > >
>
> I think it will be bit tricky to write a test case, but the callstack
> indicates the problem.  For ex. for issue-1, see the below part of
> callstack,
>
> #2  0x0000000000ec514d in ExceptionalCondition
> (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
> "FailedAssertion",
>     fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
> lineNumber=458) at assert.c:54
> #3  0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
> off=48) at ../../../../src/include/lib/ilist.h:458
> #4  0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
> oldestRunningXid=895) at reorderbuffer.c:1910
> #5  0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
> buf=0x7fff7e7b1e40) at decode.c:332
>
> The Assertion failure indicates that the changes list is empty when we
> are trying to get 'lsn' of the last change.  I think this is possible
> when immediately after serializing the transaction changes the server
> got restarted (say it crashed or somebody switched off and restarted
> it).  The reason is that after serializing the changes, the changes
> list will be empty and serialized flag for txn will be set.  I am not
> if there is any reason to set the final_lsn if the changes list is
> empty.
>
> Similarly, if we see the call stack of issue-2, the problem is clear.
>
> #2  0x0000000000ec4e1d in ExceptionalCondition
> (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
> errorType=0x10ea284 "FailedAssertion",
>     fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
> #3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:3052
> #4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:1318
> #5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2b9d778) at reorderbuffer.c:1257
> #6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
> oldestRunningXid=3835) at reorderbuffer.c:1973
>
> I think this also has a symptom similar to the prior issue but for
> sub-transactions.  The ReorderBufferAbortOld() tries to set the
> final_lsn of toplevel transaction, but not for subtransactions, later
> in that path ReorderBufferRestoreCleanup expects it to be set even for
> subtransaction.  Is that Assert in  ReorderBufferRestoreCleanup()
> required, because immediately after assert, we are anyway setting the
> value of final_lsn.
>

Thanks Amit for more analysis of the issues.

> >
> > This problem does not occur consistently. When I was reviewing and testing "logical streaming for large in-progress
transactions"link [1] I found the crashes.
 
> >
> > This issue does not occur directly, meaning this issue will occur only when some crash occurs in postgres
process(notfrom reorderbuffer but due to some other issue), after the original non-reorderbuffer crash this
reorderbuffercrash appears.
 
> >
> > To simplify the reorderbuffer crash, I used the following steps:
> > 1) Make replication setup with publisher/subscriber for some table
> > 2) Prepare a sql file with the below:
> > begin;
> > 4096 insert statements;
> > select pg_sleep(120)
> > 3) Execute the above script.
> > 4) Attach the postgres process when pg_sleep is in progress.
> > 5) call abort() from attached gdb.
> >
>
> Isn't it important to call this abort immediately after the changes
> are serialized as explained above?
>

As we are performing exactly 4096 insert statements on a fresh
replication setup it will get serialized immediately after 4096
inserts. The variable max_changes_in_memory is initialized with 4096
(reorderbuffer.c file), txn->nentries_mem gets incremented for every
insert statement in ReorderBufferQueueChange function. As we have
executed 4096 insert statements, txn->nentries_mem will become 4096
which results in the serializing from ReorderBufferCheckSerializeTXN
function. Then attaching gdb and call abort() helped in reproducing
the issue consistently.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



pgsql-bugs by date:

Previous
From: Thomas Munro
Date:
Subject: Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Next
From: James Coleman
Date:
Subject: Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash