Home > mailing lists

Re: Single transaction in the tablesync worker? - Mailing list pgsql-hackers

From	Amit Kapila
Subject	Re: Single transaction in the tablesync worker?
Date	February 1, 2021 08:40:13
Msg-id	CAA4eK1JdWv84nMyCpTboBURjj70y3BfO1xdy8SYPRqNxtH7TEA@mail.gmail.com Whole thread Raw
In response to	Re: Single transaction in the tablesync worker? (Peter Smith <smithpb2250@gmail.com>)
List	pgsql-hackers

Tree view

On Mon, Feb 1, 2021 at 1:08 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Mon, Feb 1, 2021 at 5:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>  > > > AFAIK there is always a potential race with DropSubscription dropping
> > > > > slots. The DropSubscription might be running at exactly the same time
> > > > > the apply worker has just dropped the very same tablesync slot.
> > > > >
> > > >
> > > > We stopped the workers before getting a list of NotReady relations and
> > > > then we try to drop the corresponding slots. So, how such a race
> > > > condition can happen? Note, because we have a lock on pg_subscrition,
> > > > there is no chance that the workers can restart till the transaction
> > > > end.
> > >
> > > OK. I think I was forgetting the logicalrep_worker_stop would also go
> > > into a loop waiting for the worker process to die. So even if the
> > > tablesync worker does simultaneously drop it's own slot, I think it
> > > will certainly at least be in SYNCDONE state before DropSubscription
> > > does anything else with that worker.
> > >
> >
> > How is that ensured? We don't have anything like HOLD_INTERRUPTS
> > between the time dropped the slot and updated rel state as SYNCDONE.
> > So, isn't it possible that after we dropped the slot and before we
> > update the state, the SIGTERM signal arrives and led to worker exit?
> >
>
> The worker has the SIGTERM handler of "die". IIUC the "die" function
> doesn't normally do anything except set some flags to say please die
> at the next convenient opportunity. My understanding is that the
> worker process will not actually exit until it next executes
> CHECK_FOR_INTERRUPTS(), whereupon it will see the ProcDiePending flag
> and *really* die. So even if the SIGTERM signal arrives immediately
> after the slot is dropped, the tablesync will still become SYNCDONE.
> Is this wrong understanding?
>
> But your scenario could still be possible if "die" exited immediately
> (e.g. only in single user mode?).
>

I think it is possible without that as well. There are many calls
in-between those two operations which can internally call
CHECK_FOR_INTERRUPTS. One of the flows where such a possibility exists
is
UpdateSubscriptionRelState->SearchSysCacheCopy2->SearchSysCacheCopy->SearchSysCache->SearchCatCache->SearchCatCacheInternal->SearchCatCacheMiss->systable_getnext.
This can internally call heapgetpage where we have
CHECK_FOR_INTERRUPTS. I think even if today there was no CFI call we
can't take a guarantee for the future as the calls used are quite
common. So, probably we need missing_ok flag in DropSubscription.

One more point in the tablesync code you are calling
ReplicationSlotDropAtPubNode with missing_ok as false. What if we get
an error after that and before we have marked the state as SYNCDONE? I
guess it will always error from ReplicationSlotDropAtPubNode after
that because we had already dropped the slot.

-- 
With Regards,
Amit Kapila.

pgsql-hackers by date:

From: Dmitry Dolgov
Date: 01 February 2021, 08:39:31
Subject: Re: [HACKERS] [PATCH] Generic type subscripting

From: Kyotaro Horiguchi
Date: 01 February 2021, 08:45:42
Subject: Re: Printing backtrace of postgres processes

Re: Single transaction in the tablesync worker? - Mailing list pgsql-hackers

Previous

Next