AtEOXact_ApplyLauncher() and subtransactions - Mailing list pgsql-hackers

From Amit Khandekar
Subject AtEOXact_ApplyLauncher() and subtransactions
Date
Msg-id CAJ3gD9eaG_mWqiOTA2LfAug-VRNn1hrhf50Xi1YroxL37QkZNg@mail.gmail.com
Whole thread Raw
Responses Re: AtEOXact_ApplyLauncher() and subtransactions
Re: AtEOXact_ApplyLauncher() and subtransactions
List pgsql-hackers
Hi,

When a SUBSCRIPTION is altered, then the currently running
table-synchronization workers that are no longer needed for the
altered subscription, are terminated. This is done by the function
AtEOXact_ApplyLauncher() inside CommitTransaction(). So during each
ALTER-SUBSCRIPTION command, the on_commit_stop_workers list is
appended with new set of workers to be stopped. And then at commit,
AtEOXact_ApplyLauncher() stops the workers in the list.

But there is no handling for sub-transaction abort. Consider this :

-- On secondary, create a subscription that will initiate the table sync
CREATE SUBSCRIPTION mysub CONNECTION
'dbname=postgres host=localhost user=rep password=Password port=5432'
PUBLICATION mypub;

-- While the sync is going on, change the publication of the
-- subscription in a subtransaction, so that it adds up the synchronization
-- workers for the earlier publication into the 'on_commit_stop_workers' list
select pg_sleep(1);
begin;
savepoint a;
ALTER SUBSCRIPTION mysub SET PUBLICATION mypub_2;
rollback to a;
-- Commit will stop the above sync.
commit;

So the ALTER-SUBSCRIPTION has been rolled back. But the
on_commit_stop_workers is not reverted back to its earlier state. And
then at commit, the workers will be killed unnecessarily. I have
observed that when the workers are killed, they are restarted. But
consider the case where the original subscription was synchronizing
hundreds of tables. And just when it is about to finish, someone
changes the subscription inside a subtransaction and rolls it back,
and then commits the transaction. The whole synchronization gets
re-started from scratch.

Below log snippets show this behaviour :

< CREATE-SUBSCRIPTION command spawns workers >
2018-06-05 15:04:07.841 IST [39951] LOG:  logical replication apply
worker for subscription "mysub" has started
2018-06-05 15:04:07.848 IST [39953] LOG:  logical replication table
synchronization worker for subscription "mysub", table "article" has
started

< After some time, ALTER-SUBSCRIPTION rollbacks inside subtransaction,
and commits transaction. Workers are killed >
2018-06-05 15:04:32.903 IST [39953] FATAL:  terminating logical
replication worker due to administrator command
2018-06-05 15:04:32.903 IST [39953] CONTEXT:  COPY article, line 37116293
2018-06-05 15:04:32.904 IST [39666] LOG:  background worker "logical
replication worker" (PID 39953) exited with exit code 1

< Workers are restarted >
2018-06-05 15:04:32.909 IST [40003] LOG:  logical replication table
synchronization worker for subscription "mysub", table "article" has
started
< Synchronization done after some time >
2018-06-05 15:05:10.042 IST [40003] LOG:  logical replication table
synchronization worker for subscription "mysub", table "article" has
finished

-----------

To reproduce the issue :
1. On master server, create the tables and publications using attached
setup_master.sql.
2. On secondary server, run the attached run_on_secondary.sql. This
will reproduce the issue as can be seen from the log output.

-----------

Fix :

I haven't written a patch for it, but I think we should have a
separate on_commit_stop_workers for each subtransaction. At
subtransaction commit, we replace the on_commit_stop_workers list of
the parent subtransaction with the one from the committed
subtransaction; and on abort, discard the list of the current
subtransaction. So have a stack of the lists.

-----------

Furthermore, before fixing this, we may have to fix another issue
which, I suspect, would surface even without subtransactions, as
explained below :

Suppose publication pubx is set for tables tx1 and and tx2.
And publication puby is for tables ty1 and ty2.

Subscription mysub is set to synchronise tables tx1 and tx2 :
CREATE SUBSCRIPTION mysub ... PUBLICATION pubx;

Now suppose the subscription is altered to synchronise ty1 and ty2,
and then again altered back to synchronise tx1 and tx2 in the same
transaction.
begin;
ALTER SUBSCRIPTION mysub set publication puby;
ALTER SUBSCRIPTION mysub set publication pubx;
commit;

Here, workers for tx1 and tx2 are added to on_commit_stop_workers
after the publication is set to puby. And then workers for ty1 and ty2
are further added to that list after the 2nd ALTER command where
publication is set to pubx, because the earlier ALTER command has
already changed the catalogs to denote that ty1 and ty2 are being
synchronised. Effectively, at commit, all the workers are targetted to
be stopped, when actually at commit time there is no final change in
the tables to be synchronised. What actually we should do is : for
each ALTER command we should compare the very first set of tables
being synchronised since the last committed ALTER command, rather than
checking the catalogs.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

pgsql-hackers by date:

Previous
From: Petr Jelinek
Date:
Subject: Re: pg_replication_slot_advance to return NULL instead of 0/0 if slotnot advanced
Next
From: Konstantin Knizhnik
Date:
Subject: Re: [HACKERS] Moving relation extension locks out of heavyweight lockmanager