Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop - Mailing list pgsql-bugs

From Dilip Kumar
Subject Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Date
Msg-id CAFiTN-viZixPtZx7X+PLuvZ0rf9djm18OhR74+ZQVx69oJWHew@mail.gmail.com
Whole thread Raw
In response to Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
List pgsql-bugs
On Fri, Nov 20, 2020 at 10:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Nov 18, 2020 at 2:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Nov 18, 2020 at 11:19 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > On Wed, Nov 18, 2020 at 3:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > To cut a long story short, a tablesync worker CAN in fact end up
> > > > > processing (e.g. apply_dispatch) streaming messages.
> > > > > So the tablesync worker CAN get into the apply_handle_stream_commit.
> > > > > And this scenario, albeit rare, will crash.
> > > > >
> > > >
> > > > Thank you for reproducing this issue. Dilip, Peter, is anyone of you
> > > > interested in writing a fix for this?
> > >
> > > Hi Amit.
> > >
> > > FYI - Sorry, I am away/offline for the next 5 days.
> > >
> > > However, if this bug still remains unfixed after next Tuesday then I
> > > can look at it then.
> > >
> >
> > Fair enough. Let's see if Dilip or I can get a chance to look into
> > this before that.
> >
> > > ---
> > >
> > > IIUC there are 2 options:
> > > 1) Disallow streaming for the tablesync worker.
> > > 2) Make streaming work for the tablesync worker.
> > >
> > > I prefer option (a) not only because of the KISS principle, but also
> > > because this is how the tablesync worker was previously thought to
> > > behave anyway. I expect this fix may be like the code that Dilip
> > > already posted [1]
> > > [1] https://www.postgresql.org/message-id/CAFiTN-uUgKpfdbwSGnn3db3mMQAeviOhQvGWE_pC9icZF7VDKg%40mail.gmail.com
> > >
> > > OTOH, option (b) fix may or may not be possible (I don't know), but I
> > > have doubts that it is worthwhile to consider making a special fix for
> > > a scenario which so far has never been reproduced outside of the
> > > debugger.
> > >
> >
> > I would prefer option (b) unless the fix is not possible due to design
> > constraints. I don't think it is a good idea to make tablesync workers
> > behave differently unless we have a reason for doing so.
> >
>
> Okay, I will analyze this and try to submit my finding today.

I have done my analysis, basically, the table sync worker is applying
all the changes (for multiple transactions from upstream) under the
single transaction (on downstream).  Now for normal changes, we can
just avoid committing in apply_handle_commit and everything is fine
for streaming changes we also have the transaction at the stream level
which we need to manage the buffiles for storing the streaming
changes.  So if we want to do that for the streaming transaction then
we need to avoid commit transactions on apply_handle_stream_commit as
apply_handle_stream_stop for the table sync worker.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



pgsql-bugs by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Next
From: PG Bug reporting form
Date:
Subject: BUG #16733: insert into on conflict(pk) do nothing error violates not-null constraint