Re: Single transaction in the tablesync worker? - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Single transaction in the tablesync worker?
Date
Msg-id CAA4eK1K4J1jZYLR_k_Cq-dMS7o7ke_yy3VP4-UK05yfOX3o-+w@mail.gmail.com
Whole thread Raw
In response to Re: Single transaction in the tablesync worker?  (Peter Smith <smithpb2250@gmail.com>)
Responses Re: Single transaction in the tablesync worker?
List pgsql-hackers
On Tue, Dec 22, 2020 at 4:58 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Mon, Dec 21, 2020 at 11:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Dec 21, 2020 at 3:17 PM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > On Mon, Dec 21, 2020 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > > Few other comments:
> > > > ==================
> > >
> > > Thanks for your feedback.
> > >
> > > > 1.
> > > > * FIXME 3 - Crashed tablesync workers may also have remaining slots
> > > > because I don't think
> > > > + * such workers are even iterated by this loop, and nobody else is
> > > > removing them.
> > > > + */
> > > > + if (slotname)
> > > > + {
> > > >
> > > > The above FIXME is not clear to me. Actually, the crashed workers
> > > > should restart, finish their work, and drop the slots. So not sure
> > > > what exactly this FIXME refers to?
> > >
> > > Yes, normally if the tablesync can complete it should behave like that.
> > > But I think there are other scenarios where it may be unable to
> > > clean-up after itself. For example:
> > >
> > > i) Maybe the crashed tablesync worker cannot finish. e.g. A row insert
> > > handled by tablesync can give a PK violation which also will crash
> > > again and again for each re-launched/replacement tablesync worker.
> > > This can be reproduced in the debugger. If the DropSubscription
> > > doesn't clean-up the tablesync's slot then nobody will.
> > >
> > > ii) Also DROP SUBSCRIPTION code has locking (see code commit) "to
> > > ensure that the launcher doesn't restart new worker during dropping
> > > the subscription".
> > >
> >
> > Yeah, I have also read that comment but do you know how it is
> > preventing relaunch? How does the subscription lock help?
>
> Hmmm. I did see there is a matching lock in get_subscription_list of
> launcher.c, which may be what that code comment was referring to. But
> I also am currently unsure how this lock prevents anybody (e.g.
> process_syncing_tables_for_apply) from executing another
> logicalrep_worker_launch.
>

process_syncing_tables_for_apply will be called by the apply worker
and we are stopping the apply worker. So, after that launcher won't
start a new apply worker because of get_subscription_list() and if the
apply worker is not started then it won't be able to start tablesync
worker. So, we need the handling of crashed tablesync workers here
such that we need to drop any new sync slots.

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Single transaction in the tablesync worker?
Next
From: Masahiko Sawada
Date:
Subject: Re: Deadlock between backend and recovery may not be detected