Re: Single transaction in the tablesync worker? - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Single transaction in the tablesync worker?
Date
Msg-id CAA4eK1LMYXZY1SpzgW-WyFdy+FTMZ4BMz1dj0rT2rxGv-zLwFA@mail.gmail.com
Whole thread Raw
In response to Re: Single transaction in the tablesync worker?  (Peter Smith <smithpb2250@gmail.com>)
Responses Re: Single transaction in the tablesync worker?  (Peter Smith <smithpb2250@gmail.com>)
List pgsql-hackers
On Thu, Jan 28, 2021 at 12:32 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Wed, Jan 27, 2021 at 2:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sat, Jan 23, 2021 at 5:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Sat, Jan 23, 2021 at 4:55 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > > >
> > > > PSA the v18 patch for the Tablesync Solution1.
> > >
> > > 7. Have you tested with the new patch the scenario where we crash
> > > after FINISHEDCOPY and before SYNCDONE, is it able to pick up the
> > > replication using the new temporary slot? Here, we need to test the
> > > case where during the catchup phase we have received few commits and
> > > then the tablesync worker is crashed/errored out? Basically, check if
> > > the replication is continued from the same point?
> > >
> >
> > I have tested this and it didn't work, see the below example.
> >
> > Publisher-side
> > ================
> > CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
> >
> > BEGIN;
> > INSERT INTO mytbl1(somedata, text) VALUES (1, 1);
> > INSERT INTO mytbl1(somedata, text) VALUES (1, 2);
> > COMMIT;
> >
> > CREATE PUBLICATION mypublication FOR TABLE mytbl1;
> >
> > Subscriber-side
> > ================
> > - Have a while(1) loop in LogicalRepSyncTableStart so that tablesync
> > worker stops.
> >
> > CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
> >
> >
> > CREATE SUBSCRIPTION mysub
> >          CONNECTION 'host=localhost port=5432 dbname=postgres'
> >         PUBLICATION mypublication;
> >
> > During debug, stop after we mark FINISHEDCOPY state.
> >
> > Publisher-side
> > ================
> > INSERT INTO mytbl1(somedata, text) VALUES (1, 3);
> > INSERT INTO mytbl1(somedata, text) VALUES (1, 4);
> >
> >
> > Subscriber-side
> > ================
> > - Have a breakpoint in apply_dispatch
> > - continue in debugger;
> > - After we replay first commit (which will be for values(1,3), note
> > down the origin position in apply_handle_commit_internal and somehow
> > error out. I have forced the debugger to set to the last line in
> > apply_dispatch where the error is raised.
> > - After the error, again the tablesync worker is restarted and it
> > starts from the position noted in the previous step
> > - It exits without replaying the WAL for (1,4)
> >
> > So, on the subscriber-side, you will see 3 records. Fourth is missing.
> > Now, if you insert more records on the publisher, it will anyway
> > replay those but the fourth one got missing.
> >
...
> >
> > At this point, I can't think of any way to fix this problem except for
> > going back to the previous approach of permanent slots but let me know
> > if you have any ideas to salvage this approach?
> >
>
> OK. The latest patch [v21] now restores the permanent slot (and slot
> cleanup) approach as it was implemented in an earlier version [v17].
> Please note that this change also re-introduces some potential slot
> cleanup problems for some race scenarios.
>

I am able to reproduce the race condition where slot/origin will
remain on the publisher node even when the corresponding subscription
is dropped. Basically, if we error out in the 'catchup' phase in
tablesync worker then either it will restart and cleanup slot/origin
or if in the meantime we have dropped the subscription and stopped
apply worker then probably the slot and origin will be dangling on the
publisher.

I have used exactly the same test procedure as was used to expose the
problem in the temporary slots with some minor changes as mentioned
below:
Subscriber-side
================
- Have a while(1) loop in LogicalRepSyncTableStart so that tablesync
worker stops.
- Have a while(1) loop in wait_for_relation_state_change so that we
can control apply worker via debugger at the right time.

Subscriber-side
================
- Have a breakpoint in apply_dispatch
- continue in debugger;
- After we replay first commit somehow error out. I have forced the
debugger to set to the last line in apply_dispatch where the error is
raised.
- Now, the table sync worker won't restart because the apply worker is
looping in wait_for_relation_state_change.
- Execute DropSubscription;
- We can allow apply worker to continue by skipping the while(1) and
it will exit because DropSubscription would have sent a terminate
signal.

After the above steps, check the publisher (select * from
pg_replication_slots) and you will find the dangling tablesync slot.

I think to solve the above problem we should drop tablesync
slot/origin at the Drop/Alter Subscription time and additionally we
need to ensure that apply worker doesn't let tablesync workers restart
(or it must not do any work to access the slot because the slots are
dropped) once we stopped them. To ensure that, I think we need to make
the following changes:

1. Take AccessExclusivelock on subscription_rel during Alter (before
calling RemoveSubscriptionRel) and don't release it till transaction
end (do table_close with NoLock) similar to DropSubscription.
2. Take share lock (AccessShareLock) in GetSubscriptionRelState (it
gets called from logicalrepsyncstartworker), we can release this lock
at the end of that function. This will ensure that even if the
tablesync worker is restarted, it will be blocked till the transaction
performing Alter will commit.
3. Make Alter command to not run in xact block so that we don't keep
locks for a longer time and second for the slots related stuff similar
to dropsubscription.

Few comments on v21:
===================
1.
DropSubscription()
{
..
- /* Clean up dependencies */
+ /* Clean up dependencies. */
  deleteSharedDependencyRecordsFor(SubscriptionRelationId, subid, 0);
..
}

The above change seems unnecessary w.r.t current patch.

2.
DropSubscription()
{
..
  /*
- * If there is no slot associated with the subscription, we can finish
- * here.
+ * If there is a slot associated with the subscription, then drop the
+ * replication slot at the publisher node using the replication
+ * connection.
  */
- if (!slotname)
+ if (slotname)
  {
- table_close(rel, NoLock);
- return;
..
}

What is the reason for this change? Can't we keep the check in its
existing form?

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: "k.jamison@fujitsu.com"
Date:
Subject: RE: libpq debug log
Next
From: Andrey Borodin
Date:
Subject: Re: pglz compression performance, take two