Re: Single transaction in the tablesync worker? - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Single transaction in the tablesync worker?
Date
Msg-id CAA4eK1JuwZF7FHM+EPjWdVh=Xaz-7Eo-G0TByMjWeUU32Xue3w@mail.gmail.com
Whole thread Raw
In response to Re: Single transaction in the tablesync worker?  (Peter Smith <smithpb2250@gmail.com>)
Responses Re: Single transaction in the tablesync worker?  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Tue, Jan 12, 2021 at 6:17 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Mon, Jan 11, 2021 at 3:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Jan 8, 2021 at 8:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Jan 8, 2021 at 7:14 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > > >
> > > > FYI, I was able to reproduce this case in debugger. PSA logs showing details.
> > > >
> > >
> > > Thanks for reproducing as I was worried about exactly this case. I
> > > have one question related to logs:
> > >
> > > ##
> > > ## ALTER SUBSCRIPTION to REFRESH the publication
> > >
> > > ## This blocks on some latch until the tablesync worker dies, then it continues
> > > ##
> > >
> > > Did you check which exact latch or lock blocks this?
> > >
> >
> > I have checked this myself and the command is waiting on the drop of
> > origin till the tablesync worker is finished because replorigin_drop()
> > requires state->acquired_by to be 0 which will only be true once the
> > tablesync worker exits. I think this is the reason you might have
> > noticed that the command can't be finished until the tablesync worker
> > died. So this can't be an interlock between ALTER SUBSCRIPTION ..
> > REFRESH command and tablesync worker and to that end it seems you have
> > below Fixme's in the patch:
>
> I have also seen this same blocking reason before in the replorigin_drop().
> However, back when I first tested/reproduced the refresh issue
> [test-refresh] that
> AlterSubscription_refresh was still *original* unchanged code, so at
> that time it did not
> have any replorigin_drop() in at all. In any case in the latest code
> [v14] the AlterSubscription is
> immediately stopping the workers so this question may be moot.
>
> >
> > + * FIXME - Usually this cleanup would be OK, but will not
> > + * always be OK because the logicalrep_worker_stop_at_commit
> > + * only "flags" the worker to be stopped in the near future
> > + * but meanwhile it may still be running. In this case there
> > + * could be a race between the tablesync worker and this code
> > + * to see who will succeed with the tablesync drop (and the
> > + * loser will ERROR).
> > + *
> > + * FIXME - Also, checking the state is also not guaranteed
> > + * correct because state might be TCOPYDONE when we checked
> > + * but has since progressed to SYNDONE
> > + */
> > +
> > + if (state == SUBREL_STATE_TCOPYDONE)
> > + {
> >
> > I feel this was okay for an earlier code but now we need to stop the
> > tablesync workers before trying to drop the slot as we do in
> > DropSubscription. Now, if we do that then that would fix the race
> > conditions mentioned in Fixme but still, there are few more things I
> > am worried about: (a) What if the launcher again starts the tablesync
> > worker? One idea could be to acquire AccessExclusiveLock on
> > SubscriptionRelationId as we do in DropSubscription which is not a
> > very good idea but I can't think of any other good way. (b) the patch
> > is just checking SUBREL_STATE_TCOPYDONE before dropping the
> > replication slot but the slot could be created even before that (in
> > SUBREL_STATE_DATASYNC state). One idea could be we can try to drop the
> > slot and if we are not able to drop then we can simply continue
> > assuming it didn't exist.
>
> The code was modified in the latest patch [v14] something like as suggested.
>
> The workers for removed tables are now immediately stopped (like
> DropSubscription does). Although I did include the AccessExclusiveLock
> as (a) suggested, AFAIK this was actually ineffective at preventing
> the workers relaunching.
>

The reason why it was ineffective is that you are locking
SubscriptionRelationId which is to protect relaunch of apply workers
not tablesync workers. But in current form even acquiring
SubscriptionRelRelationId lock won't serve the purpose because
process_syncing_tables_for_apply() doesn't always acquire it before
relaunching the tablesync workers. However, if we acquire
SubscriptionRelRelationId in process_syncing_tables_for_apply() then
it would prevent relaunch of workers but not sure if that is a good
idea. Can you think of some other way?

> Instead, I am using
> logicalrep_worker_stop_at_commit to do this - testing shows it as
> working ok. Please see the code and latest test logs [v14] for
> details.
>

There is still a window where it can relaunch. Basically, after you
stop the worker in AlterSubscription_refresh and till the commit
happens apply worker can relaunch the tablesync workers. I don't see
code-wise how we can protect that. And if the tablesync workers are
restarted after we stopped them, the purpose won't be achieved because
it can recreate or try to reuse the slot which we have dropped.

The other issue with the current code could be that after we drop the
slot and origin what if the transaction (in which we are doing Alter
Subscription) is rolledback? Basically, the workers will be relaunched
and it would assume that slot should be there but the slot won't be
present. I have thought of dropping the slot at commit time after we
stop the workers but again not sure if that is a good idea because at
that point we don't want to establish the connection with the
publisher.

I think this needs some more thought.

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Dmitry Dolgov
Date:
Subject: Re: [PATCH] Identify LWLocks in tracepoints
Next
From: Dilip Kumar
Date:
Subject: Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION