Re: Single transaction in the tablesync worker? - Mailing list pgsql-hackers

From Peter Smith
Subject Re: Single transaction in the tablesync worker?
Date
Msg-id CAHut+PtbUaZLfLWqic0dUoujBxqxrQXTDE6F8LLPNWEauSfsWQ@mail.gmail.com
Whole thread Raw
In response to Re: Single transaction in the tablesync worker?  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Thu, Jan 28, 2021 at 9:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jan 28, 2021 at 12:32 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Wed, Jan 27, 2021 at 2:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Sat, Jan 23, 2021 at 5:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Sat, Jan 23, 2021 at 4:55 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > > > >
> > > > > PSA the v18 patch for the Tablesync Solution1.
> > > >
> > > > 7. Have you tested with the new patch the scenario where we crash
> > > > after FINISHEDCOPY and before SYNCDONE, is it able to pick up the
> > > > replication using the new temporary slot? Here, we need to test the
> > > > case where during the catchup phase we have received few commits and
> > > > then the tablesync worker is crashed/errored out? Basically, check if
> > > > the replication is continued from the same point?
> > > >
> > >
> > > I have tested this and it didn't work, see the below example.
> > >
> > > Publisher-side
> > > ================
> > > CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
> > >
> > > BEGIN;
> > > INSERT INTO mytbl1(somedata, text) VALUES (1, 1);
> > > INSERT INTO mytbl1(somedata, text) VALUES (1, 2);
> > > COMMIT;
> > >
> > > CREATE PUBLICATION mypublication FOR TABLE mytbl1;
> > >
> > > Subscriber-side
> > > ================
> > > - Have a while(1) loop in LogicalRepSyncTableStart so that tablesync
> > > worker stops.
> > >
> > > CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
> > >
> > >
> > > CREATE SUBSCRIPTION mysub
> > >          CONNECTION 'host=localhost port=5432 dbname=postgres'
> > >         PUBLICATION mypublication;
> > >
> > > During debug, stop after we mark FINISHEDCOPY state.
> > >
> > > Publisher-side
> > > ================
> > > INSERT INTO mytbl1(somedata, text) VALUES (1, 3);
> > > INSERT INTO mytbl1(somedata, text) VALUES (1, 4);
> > >
> > >
> > > Subscriber-side
> > > ================
> > > - Have a breakpoint in apply_dispatch
> > > - continue in debugger;
> > > - After we replay first commit (which will be for values(1,3), note
> > > down the origin position in apply_handle_commit_internal and somehow
> > > error out. I have forced the debugger to set to the last line in
> > > apply_dispatch where the error is raised.
> > > - After the error, again the tablesync worker is restarted and it
> > > starts from the position noted in the previous step
> > > - It exits without replaying the WAL for (1,4)
> > >
> > > So, on the subscriber-side, you will see 3 records. Fourth is missing.
> > > Now, if you insert more records on the publisher, it will anyway
> > > replay those but the fourth one got missing.
> > >
> ...
> > >
> > > At this point, I can't think of any way to fix this problem except for
> > > going back to the previous approach of permanent slots but let me know
> > > if you have any ideas to salvage this approach?
> > >
> >
> > OK. The latest patch [v21] now restores the permanent slot (and slot
> > cleanup) approach as it was implemented in an earlier version [v17].
> > Please note that this change also re-introduces some potential slot
> > cleanup problems for some race scenarios.
> >
>
> I am able to reproduce the race condition where slot/origin will
> remain on the publisher node even when the corresponding subscription
> is dropped. Basically, if we error out in the 'catchup' phase in
> tablesync worker then either it will restart and cleanup slot/origin
> or if in the meantime we have dropped the subscription and stopped
> apply worker then probably the slot and origin will be dangling on the
> publisher.
>
> I have used exactly the same test procedure as was used to expose the
> problem in the temporary slots with some minor changes as mentioned
> below:
> Subscriber-side
> ================
> - Have a while(1) loop in LogicalRepSyncTableStart so that tablesync
> worker stops.
> - Have a while(1) loop in wait_for_relation_state_change so that we
> can control apply worker via debugger at the right time.
>
> Subscriber-side
> ================
> - Have a breakpoint in apply_dispatch
> - continue in debugger;
> - After we replay first commit somehow error out. I have forced the
> debugger to set to the last line in apply_dispatch where the error is
> raised.
> - Now, the table sync worker won't restart because the apply worker is
> looping in wait_for_relation_state_change.
> - Execute DropSubscription;
> - We can allow apply worker to continue by skipping the while(1) and
> it will exit because DropSubscription would have sent a terminate
> signal.
>
> After the above steps, check the publisher (select * from
> pg_replication_slots) and you will find the dangling tablesync slot.
>
> I think to solve the above problem we should drop tablesync
> slot/origin at the Drop/Alter Subscription time and additionally we
> need to ensure that apply worker doesn't let tablesync workers restart
> (or it must not do any work to access the slot because the slots are
> dropped) once we stopped them. To ensure that, I think we need to make
> the following changes:
>
> 1. Take AccessExclusivelock on subscription_rel during Alter (before
> calling RemoveSubscriptionRel) and don't release it till transaction
> end (do table_close with NoLock) similar to DropSubscription.
> 2. Take share lock (AccessShareLock) in GetSubscriptionRelState (it
> gets called from logicalrepsyncstartworker), we can release this lock
> at the end of that function. This will ensure that even if the
> tablesync worker is restarted, it will be blocked till the transaction
> performing Alter will commit.
> 3. Make Alter command to not run in xact block so that we don't keep
> locks for a longer time and second for the slots related stuff similar
> to dropsubscription.
>

OK. The latest patch [v22] changes the code as suggested above.

> Few comments on v21:
> ===================
> 1.
> DropSubscription()
> {
> ..
> - /* Clean up dependencies */
> + /* Clean up dependencies. */
>   deleteSharedDependencyRecordsFor(SubscriptionRelationId, subid, 0);
> ..
> }
>
> The above change seems unnecessary w.r.t current patch.
>

OK. Modified in patch [v22].

> 2.
> DropSubscription()
> {
> ..
>   /*
> - * If there is no slot associated with the subscription, we can finish
> - * here.
> + * If there is a slot associated with the subscription, then drop the
> + * replication slot at the publisher node using the replication
> + * connection.
>   */
> - if (!slotname)
> + if (slotname)
>   {
> - table_close(rel, NoLock);
> - return;
> ..
> }
>
> What is the reason for this change? Can't we keep the check in its
> existing form?
>

I think the above comment is longer applicable in the latest patch [v22].
Early exit for null slotname is not desirable anymore; we still need
to process all the tablesync slots/origins regardless.

----
[v22] https://www.postgresql.org/message-id/CAHut%2BPtrAVrtjc8srASTeUhbJtviw0Up-bzFSc14Ss%3DmAMxz9g%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
Next
From: Dilip Kumar
Date:
Subject: Re: Is Recovery actually paused?