Re: Single transaction in the tablesync worker? - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Single transaction in the tablesync worker? |
Date | |
Msg-id | CAA4eK1JdWv84nMyCpTboBURjj70y3BfO1xdy8SYPRqNxtH7TEA@mail.gmail.com Whole thread Raw |
In response to | Re: Single transaction in the tablesync worker? (Peter Smith <smithpb2250@gmail.com>) |
List | pgsql-hackers |
On Mon, Feb 1, 2021 at 1:08 PM Peter Smith <smithpb2250@gmail.com> wrote: > > On Mon, Feb 1, 2021 at 5:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > AFAIK there is always a potential race with DropSubscription dropping > > > > > slots. The DropSubscription might be running at exactly the same time > > > > > the apply worker has just dropped the very same tablesync slot. > > > > > > > > > > > > > We stopped the workers before getting a list of NotReady relations and > > > > then we try to drop the corresponding slots. So, how such a race > > > > condition can happen? Note, because we have a lock on pg_subscrition, > > > > there is no chance that the workers can restart till the transaction > > > > end. > > > > > > OK. I think I was forgetting the logicalrep_worker_stop would also go > > > into a loop waiting for the worker process to die. So even if the > > > tablesync worker does simultaneously drop it's own slot, I think it > > > will certainly at least be in SYNCDONE state before DropSubscription > > > does anything else with that worker. > > > > > > > How is that ensured? We don't have anything like HOLD_INTERRUPTS > > between the time dropped the slot and updated rel state as SYNCDONE. > > So, isn't it possible that after we dropped the slot and before we > > update the state, the SIGTERM signal arrives and led to worker exit? > > > > The worker has the SIGTERM handler of "die". IIUC the "die" function > doesn't normally do anything except set some flags to say please die > at the next convenient opportunity. My understanding is that the > worker process will not actually exit until it next executes > CHECK_FOR_INTERRUPTS(), whereupon it will see the ProcDiePending flag > and *really* die. So even if the SIGTERM signal arrives immediately > after the slot is dropped, the tablesync will still become SYNCDONE. > Is this wrong understanding? > > But your scenario could still be possible if "die" exited immediately > (e.g. only in single user mode?). > I think it is possible without that as well. There are many calls in-between those two operations which can internally call CHECK_FOR_INTERRUPTS. One of the flows where such a possibility exists is UpdateSubscriptionRelState->SearchSysCacheCopy2->SearchSysCacheCopy->SearchSysCache->SearchCatCache->SearchCatCacheInternal->SearchCatCacheMiss->systable_getnext. This can internally call heapgetpage where we have CHECK_FOR_INTERRUPTS. I think even if today there was no CFI call we can't take a guarantee for the future as the calls used are quite common. So, probably we need missing_ok flag in DropSubscription. One more point in the tablesync code you are calling ReplicationSlotDropAtPubNode with missing_ok as false. What if we get an error after that and before we have marked the state as SYNCDONE? I guess it will always error from ReplicationSlotDropAtPubNode after that because we had already dropped the slot. -- With Regards, Amit Kapila.
pgsql-hackers by date: