AtEOXact_ApplyLauncher() and subtransactions - Mailing list pgsql-hackers
From | Amit Khandekar |
---|---|
Subject | AtEOXact_ApplyLauncher() and subtransactions |
Date | |
Msg-id | CAJ3gD9eaG_mWqiOTA2LfAug-VRNn1hrhf50Xi1YroxL37QkZNg@mail.gmail.com Whole thread Raw |
Responses |
Re: AtEOXact_ApplyLauncher() and subtransactions
Re: AtEOXact_ApplyLauncher() and subtransactions |
List | pgsql-hackers |
Hi, When a SUBSCRIPTION is altered, then the currently running table-synchronization workers that are no longer needed for the altered subscription, are terminated. This is done by the function AtEOXact_ApplyLauncher() inside CommitTransaction(). So during each ALTER-SUBSCRIPTION command, the on_commit_stop_workers list is appended with new set of workers to be stopped. And then at commit, AtEOXact_ApplyLauncher() stops the workers in the list. But there is no handling for sub-transaction abort. Consider this : -- On secondary, create a subscription that will initiate the table sync CREATE SUBSCRIPTION mysub CONNECTION 'dbname=postgres host=localhost user=rep password=Password port=5432' PUBLICATION mypub; -- While the sync is going on, change the publication of the -- subscription in a subtransaction, so that it adds up the synchronization -- workers for the earlier publication into the 'on_commit_stop_workers' list select pg_sleep(1); begin; savepoint a; ALTER SUBSCRIPTION mysub SET PUBLICATION mypub_2; rollback to a; -- Commit will stop the above sync. commit; So the ALTER-SUBSCRIPTION has been rolled back. But the on_commit_stop_workers is not reverted back to its earlier state. And then at commit, the workers will be killed unnecessarily. I have observed that when the workers are killed, they are restarted. But consider the case where the original subscription was synchronizing hundreds of tables. And just when it is about to finish, someone changes the subscription inside a subtransaction and rolls it back, and then commits the transaction. The whole synchronization gets re-started from scratch. Below log snippets show this behaviour : < CREATE-SUBSCRIPTION command spawns workers > 2018-06-05 15:04:07.841 IST [39951] LOG: logical replication apply worker for subscription "mysub" has started 2018-06-05 15:04:07.848 IST [39953] LOG: logical replication table synchronization worker for subscription "mysub", table "article" has started < After some time, ALTER-SUBSCRIPTION rollbacks inside subtransaction, and commits transaction. Workers are killed > 2018-06-05 15:04:32.903 IST [39953] FATAL: terminating logical replication worker due to administrator command 2018-06-05 15:04:32.903 IST [39953] CONTEXT: COPY article, line 37116293 2018-06-05 15:04:32.904 IST [39666] LOG: background worker "logical replication worker" (PID 39953) exited with exit code 1 < Workers are restarted > 2018-06-05 15:04:32.909 IST [40003] LOG: logical replication table synchronization worker for subscription "mysub", table "article" has started < Synchronization done after some time > 2018-06-05 15:05:10.042 IST [40003] LOG: logical replication table synchronization worker for subscription "mysub", table "article" has finished ----------- To reproduce the issue : 1. On master server, create the tables and publications using attached setup_master.sql. 2. On secondary server, run the attached run_on_secondary.sql. This will reproduce the issue as can be seen from the log output. ----------- Fix : I haven't written a patch for it, but I think we should have a separate on_commit_stop_workers for each subtransaction. At subtransaction commit, we replace the on_commit_stop_workers list of the parent subtransaction with the one from the committed subtransaction; and on abort, discard the list of the current subtransaction. So have a stack of the lists. ----------- Furthermore, before fixing this, we may have to fix another issue which, I suspect, would surface even without subtransactions, as explained below : Suppose publication pubx is set for tables tx1 and and tx2. And publication puby is for tables ty1 and ty2. Subscription mysub is set to synchronise tables tx1 and tx2 : CREATE SUBSCRIPTION mysub ... PUBLICATION pubx; Now suppose the subscription is altered to synchronise ty1 and ty2, and then again altered back to synchronise tx1 and tx2 in the same transaction. begin; ALTER SUBSCRIPTION mysub set publication puby; ALTER SUBSCRIPTION mysub set publication pubx; commit; Here, workers for tx1 and tx2 are added to on_commit_stop_workers after the publication is set to puby. And then workers for ty1 and ty2 are further added to that list after the 2nd ALTER command where publication is set to pubx, because the earlier ALTER command has already changed the catalogs to denote that ty1 and ty2 are being synchronised. Effectively, at commit, all the workers are targetted to be stopped, when actually at commit time there is no final change in the tables to be synchronised. What actually we should do is : for each ALTER command we should compare the very first set of tables being synchronised since the last committed ALTER command, rather than checking the catalogs. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attachment
pgsql-hackers by date: