Re: Single transaction in the tablesync worker? - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Single transaction in the tablesync worker?
Date
Msg-id CAA4eK1KGUt86A7CfuQW6OeDvAhEbVk8VOBJmcoZjrYBn965kOA@mail.gmail.com
Whole thread Raw
In response to Re: Single transaction in the tablesync worker?  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Single transaction in the tablesync worker?
List pgsql-hackers
On Fri, Jan 8, 2021 at 8:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jan 8, 2021 at 7:14 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > FYI, I was able to reproduce this case in debugger. PSA logs showing details.
> >
>
> Thanks for reproducing as I was worried about exactly this case. I
> have one question related to logs:
>
> ##
> ## ALTER SUBSCRIPTION to REFRESH the publication
>
> ## This blocks on some latch until the tablesync worker dies, then it continues
> ##
>
> Did you check which exact latch or lock blocks this?
>

I have checked this myself and the command is waiting on the drop of
origin till the tablesync worker is finished because replorigin_drop()
requires state->acquired_by to be 0 which will only be true once the
tablesync worker exits. I think this is the reason you might have
noticed that the command can't be finished until the tablesync worker
died. So this can't be an interlock between ALTER SUBSCRIPTION ..
REFRESH command and tablesync worker and to that end it seems you have
below Fixme's in the patch:

+ * FIXME - Usually this cleanup would be OK, but will not
+ * always be OK because the logicalrep_worker_stop_at_commit
+ * only "flags" the worker to be stopped in the near future
+ * but meanwhile it may still be running. In this case there
+ * could be a race between the tablesync worker and this code
+ * to see who will succeed with the tablesync drop (and the
+ * loser will ERROR).
+ *
+ * FIXME - Also, checking the state is also not guaranteed
+ * correct because state might be TCOPYDONE when we checked
+ * but has since progressed to SYNDONE
+ */
+
+ if (state == SUBREL_STATE_TCOPYDONE)
+ {

I feel this was okay for an earlier code but now we need to stop the
tablesync workers before trying to drop the slot as we do in
DropSubscription. Now, if we do that then that would fix the race
conditions mentioned in Fixme but still, there are few more things I
am worried about: (a) What if the launcher again starts the tablesync
worker? One idea could be to acquire AccessExclusiveLock on
SubscriptionRelationId as we do in DropSubscription which is not a
very good idea but I can't think of any other good way. (b) the patch
is just checking SUBREL_STATE_TCOPYDONE before dropping the
replication slot but the slot could be created even before that (in
SUBREL_STATE_DATASYNC state). One idea could be we can try to drop the
slot and if we are not able to drop then we can simply continue
assuming it didn't exist.

One minor comment:
1.
+ SpinLockAcquire(&MyLogicalRepWorker->relmutex);
  MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCDONE;
  MyLogicalRepWorker->relstate_lsn = current_lsn;
-

Spurious line removal.

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: "Andrey V. Lepikhov"
Date:
Subject: Re: Removing unneeded self joins
Next
From: Masahiko Sawada
Date:
Subject: Re: logical replication worker accesses catalogs in error context callback