Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Date
Msg-id CAA4eK1+Kn2yxHo5ZZ+A8jkkmGu8MP+Ab0ab7jtiz3gjGxWxGOw@mail.gmail.com
Whole thread Raw
In response to Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions  (Dilip Kumar <dilipbalaut@gmail.com>)
List pgsql-hackers
On Mon, Jul 6, 2020 at 11:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > > > 10.  I have got the below failure once.  I have not investigated this
> > > > in detail as the patch is still under progress.  See, if you have any
> > > > idea?
> > > > #   Failed test 'check extra columns contain local defaults'
> > > > #   at t/013_stream_subxact_ddl_abort.pl line 81.
> > > > #          got: '2|0'
> > > > #     expected: '1000|500'
> > > > # Looks like you failed 1 test of 2.
> > > > make[2]: *** [check] Error 1
> > > > make[1]: *** [check-subscription-recurse] Error 2
> > > > make[1]: *** Waiting for unfinished jobs....
> > > > make: *** [check-world-src/test-recurse] Error 2
> > >
> > > Even I got the failure once and after that, it did not reproduce.  I
> > > have executed it multiple time but it did not reproduce again.  Are
> > > you able to reproduce it consistently?
> > >
> >
> > No, I am also not able to reproduce it consistently but I think this
> > can fail if a subscriber sends the replay_location before actually
> > replaying the changes.  First, I thought that extra send_feedback we
> > have in apply_handle_stream_commit might have caused this but I guess
> > that can't happen because we need the commit time location for that
> > and we are storing the same at the end of apply_handle_stream_commit
> > after applying all messages.  I am not sure what is going on here.  I
> > think we somehow need to reproduce this or some variant of this test
> > consistently to find the root cause.
>
> And I think it appeared first time for me,  so maybe either induced
> from past few versions so some changes in the last few versions might
> have exposed it.  I have noticed that almost 50% of the time I am able
> to reproduce after the clean build so I can trace back from which
> version it started appearing that way it will be easy to narrow down.
>

One more comment
ReorderBufferLargestTopTXN
{
..
dlist_foreach(iter, &rb->toplevel_by_lsn)
  {
  ReorderBufferTXN *txn;
+ Size size = 0;
+ Size largest_size = 0;

  txn = dlist_container(ReorderBufferTXN, node, iter.cur);

- /* if the current transaction is larger, remember it */
- if ((!largest) || (txn->size > largest->size))
+ /*
+ * If this transaction have some incomplete changes then only consider
+ * the size upto last complete lsn.
+ */
+ if (rbtxn_has_incomplete_tuple(txn))
+ size = txn->complete_size;
+ else
+ size = txn->total_size;
+
+ /* If the current transaction is larger then remember it. */
+ if ((largest != NULL || size > largest_size) && size > 0)

Here largest_size is a local variable inside the loop which is
initialized to 0 in each iteration and that will lead to picking each
next txn as largest.  This seems wrong to me.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: vignesh C
Date:
Subject: Re: Parallel copy
Next
From: Amit Kapila
Date:
Subject: Re: Default setting for enable_hashagg_disk (hash_mem)