Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Date
Msg-id CAFiTN-uGxgo5258hZy2QJoz=s7_Cs7v9=b8Z2GgFV7qmQUOwxw@mail.gmail.com
Whole thread Raw
In response to Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Mon, Sep 14, 2020 at 4:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Sep 14, 2020 at 1:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Sep 14, 2020 at 8:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Yeah, this is right, and here is some initial analysis. It seems to be
> > > failing in below code:
> > > rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..}
> > >
> > > This list can have elements only in 'streaming' mode (need to enable
> > > 'streaming' with Create Subscription command) whereas none of the
> > > tests in 010_truncate.pl is using 'streaming', so this list should be
> > > empty (NULL). The two different assertion failures shown in BF reports
> > > in list_free code are as below:
> > > Assert(list->length > 0);
> > > Assert(list->length <= list->max_length);
> > >
> > > It seems to me that this list is not initialized properly when it is
> > > not used or maybe that is true in some special circumstances because
> > > we initialize it in get_rel_sync_entry(). I am not sure if CCI build
> > > is impacting this in some way.
> >
> >
> > Even I have analyzed this but did not find any reason why the
> > streamed_txns list should be anything other than NULL.  The only thing
> > is we are initializing the entry->streamed_txns to NULL and the list
> > free is checking  "if (list == NIL)" then return. However IMHO, that
> > should not be an issue becase NIL is defined as (List*) NULL.
> >
>
> Yeah, that is not the issue but it is better to initialize it with NIL
> for the sake of consistency. The basic issue here was we were trying
> to open/lock the relation(s) before initializing this list. Now, when
> we process the invalidations during open relation, we try to access
> this list in rel_sync_cache_relation_cb and that leads to assertion
> failure. I have reproduced the exact scenario of 010_truncate.pl via
> debugger. Basically, the backend on publisher has sent the
> invalidation after truncating the relation 'tab1' and while processing
> the truncate message if WALSender receives that message exactly after
> creating the RelSyncEntry for 'tab1', the Assertion shown in BF can be
> reproduced.

Yeah, this is an issue and I am also able to reproduce this manually
using gdb.  Basically, I have inserted some data in publication table
and after that, I stopped in get_rel_sync_entry after creating the
reentry and before calling GetRelationPublications.  Meanwhile, I have
truncated this table and then it hit the same issue you pointed here.

> The attached patch will fix the issue. What do you think?

The patch looks good to me and fixing the reported issue.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Division in dynahash.c due to HASH_FFACTOR
Next
From: Anastasia Lubennikova
Date:
Subject: Re: [PATCH] Automatic HASH and LIST partition creation