Thread: wake up logical workers after ALTER SUBSCRIPTION
Hi hackers, While working on avoiding unnecessary wakeups in logical/worker.c (as was done for walreceiver.c in 05a7be9), I noticed that the tests began taking much longer. This seems to be caused by the reduced frequency of calls to maybe_reread_subscription() in LogicalRepApplyLoop(). Presently, LogicalRepApplyLoop() only waits for up to a second, so the subscription info is re-read by workers relatively frequently. If LogicalRepApplyLoop() sleeps for longer, the subscription info may not be read for much longer. I think the fix for this problem can be considered independently, as relying on frequent wakeups seems less than ideal, and the patch seems to provide a small improvement even before applying the avoid-unnecessary-wakeups patch. On my machine, the attached patch improved 'check-world -j8' run time by ~12 seconds (from 3min 8sec to 2min 56 sec) and src/test/subscription test time by ~17 seconds (from 139 seconds to 122 seconds). I put the new logic in launcher.c, but it might make more sense to put it in logical/worker.c. I think that might require some new #includes in a couple of files, but otherwise, the patch would likely look about the same. Thoughts? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Nov 22, 2022 at 1:41 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > On my machine, the attached patch > improved 'check-world -j8' run time by ~12 seconds (from 3min 8sec to 2min > 56 sec) and src/test/subscription test time by ~17 seconds (from 139 > seconds to 122 seconds). Nice! Maybe a comment to explain why a single variable is enough? And an assertion that it wasn't already set? And a note to future self: this would be a candidate user of the nearby SetLatches() patch (which is about moving SetLatch() syscalls out from under LWLocks, though this one may not be very hot).
On Tue, Nov 22, 2022 at 03:16:05PM +1300, Thomas Munro wrote: > Maybe a comment to explain why a single variable is enough? This crossed my mind shortly after sending my previous message. Looking closer, I see that several types of ALTER SUBSCRIPTION do not call PreventInTransactionBlock(), so a single variable might not be enough. Perhaps we can put a list in TopTransactionContext. I'll do some more investigation and report back. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Hi Nathan, I have done almost same thing locally for [1], but I thought your code seemed better. Just One comment: IIUC the statement "ALTER SUBSCRIPTION" can be executed inside the transaction. So if two subscriptions are altered in the same transaction, only one of them will awake. Is it expected behavior? I think we can hold a suboid list and record oids when the subscription are altered, and then the backend process can consume all of list cells at the end of the transaction. How do you think? [1]: https://commitfest.postgresql.org/40/3581/ Best Regards, Hayato Kuroda FUJITSU LIMITED
On Tue, Nov 22, 2022 at 03:03:52AM +0000, Hayato Kuroda (Fujitsu) wrote: > Just One comment: IIUC the statement "ALTER SUBSCRIPTION" can be executed > inside the transaction. So if two subscriptions are altered in the same > transaction, only one of them will awake. Is it expected behavior? > > I think we can hold a suboid list and record oids when the subscription are > altered, and then the backend process can consume all of list cells at the end of > the transaction. I think you are correct. I did it this way in v2. I've also moved the bulk of the logic to logical/worker.c. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
Dear Nathan, > I think you are correct. I did it this way in v2. I've also moved the > bulk of the logic to logical/worker.c. Thanks for updating! It becomes better. Further comments: 01. AlterSubscription() ``` + LogicalRepWorkersWakeupAtCommit(subid); + ``` Currently subids will be recorded even if the subscription is not modified. I think LogicalRepWorkersWakeupAtCommit() should be called inside the if (update_tuple). 02. LogicalRepWorkersWakeupAtCommit() ``` + oldcxt = MemoryContextSwitchTo(TopTransactionContext); + on_commit_wakeup_workers_subids = lappend_oid(on_commit_wakeup_workers_subids, + subid); ``` If the subscription is altered twice in the same transaction, the same subid will be recorded twice. I'm not sure whether it may be caused some issued, but list_member_oid() can be used to avoid that. Best Regards, Hayato Kuroda FUJITSU LIMITED
On Tuesday, November 22, 2022 1:39 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Tue, Nov 22, 2022 at 03:03:52AM +0000, Hayato Kuroda (Fujitsu) wrote: > > Just One comment: IIUC the statement "ALTER SUBSCRIPTION" can be > > executed inside the transaction. So if two subscriptions are altered > > in the same transaction, only one of them will awake. Is it expected > behavior? > > > > I think we can hold a suboid list and record oids when the > > subscription are altered, and then the backend process can consume all > > of list cells at the end of the transaction. > > I think you are correct. I did it this way in v2. I've also moved the bulk of > the logic to logical/worker.c. Hi, thanks for updating. I just quickly had a look at your patch and had one minor question. With this patch, when we execute alter subscription in a sub transaction and additionally rollback to it, is there any possibility that we'll wake up the workers that don't need to do so ? I'm not sure if this brings about some substantial issue, but just wondering if there is any need of improvement for this. Best Regards, Takamichi Osumi
On Tuesday, November 22, 2022 2:49 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> > > Dear Nathan, > > > I think you are correct. I did it this way in v2. I've also moved > > the bulk of the logic to logical/worker.c. > > Thanks for updating! It becomes better. Further comments: > > 01. AlterSubscription() > > ``` > + LogicalRepWorkersWakeupAtCommit(subid); > + > ``` > > Currently subids will be recorded even if the subscription is not modified. > I think LogicalRepWorkersWakeupAtCommit() should be called inside the if > (update_tuple). I think an exception would be REFRESH PULLICATION in which case update_tuple is false, but it seems better to wake up apply worker in this case as well, because the apply worker is also responsible to start table sync workers for newly subscribed tables(in process_syncing_tables()). Besides, it seems not a must to wake up apply worker for ALTER SKIP TRANSACTION, Although there might be no harm for waking up in this case. > > 02. LogicalRepWorkersWakeupAtCommit() > > ``` > + oldcxt = MemoryContextSwitchTo(TopTransactionContext); > + on_commit_wakeup_workers_subids = > lappend_oid(on_commit_wakeup_workers_subids, > + > subid); > ``` > > If the subscription is altered twice in the same transaction, the same subid will > be recorded twice. > I'm not sure whether it may be caused some issued, but list_member_oid() can > be used to avoid that. +1, list_append_unique_oid might be better. Best regards, Hou zj
On Tue, Nov 22, 2022 at 6:11 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > While working on avoiding unnecessary wakeups in logical/worker.c (as was > done for walreceiver.c in 05a7be9), I noticed that the tests began taking > much longer. This seems to be caused by the reduced frequency of calls to > maybe_reread_subscription() in LogicalRepApplyLoop(). > I think it would be interesting to know why tests started taking more time after a reduced frequency of calls to maybe_reread_subscription(). IIRC, we anyway call maybe_reread_subscription for each xact. -- With Regards, Amit Kapila.
On Tue, Nov 22, 2022 at 07:25:36AM +0000, houzj.fnst@fujitsu.com wrote: > On Tuesday, November 22, 2022 2:49 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> >> Thanks for updating! It becomes better. Further comments: >> >> 01. AlterSubscription() >> >> ``` >> + LogicalRepWorkersWakeupAtCommit(subid); >> + >> ``` >> >> Currently subids will be recorded even if the subscription is not modified. >> I think LogicalRepWorkersWakeupAtCommit() should be called inside the if >> (update_tuple). > > I think an exception would be REFRESH PULLICATION in which case update_tuple is > false, but it seems better to wake up apply worker in this case as well, > because the apply worker is also responsible to start table sync workers for > newly subscribed tables(in process_syncing_tables()). > > Besides, it seems not a must to wake up apply worker for ALTER SKIP TRANSACTION, > Although there might be no harm for waking up in this case. In v3, I moved the call to LogicalRepWorkersWakeupAtCommit() to the end of the function. This should avoid waking up workers in some cases where it's unnecessary (e.g., if ALTER SUBSCRIPTION ERRORs in a subtransaction), but there are still cases where we'll wake up the workers unnecessarily. I think this is unlikely to cause any real problems in practice. >> 02. LogicalRepWorkersWakeupAtCommit() >> >> ``` >> + oldcxt = MemoryContextSwitchTo(TopTransactionContext); >> + on_commit_wakeup_workers_subids = >> lappend_oid(on_commit_wakeup_workers_subids, >> + >> subid); >> ``` >> >> If the subscription is altered twice in the same transaction, the same subid will >> be recorded twice. >> I'm not sure whether it may be caused some issued, but list_member_oid() can >> be used to avoid that. > > +1, list_append_unique_oid might be better. Done in v3. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Nov 22, 2022 at 04:59:28PM +0530, Amit Kapila wrote: > On Tue, Nov 22, 2022 at 6:11 AM Nathan Bossart <nathandbossart@gmail.com> wrote: >> While working on avoiding unnecessary wakeups in logical/worker.c (as was >> done for walreceiver.c in 05a7be9), I noticed that the tests began taking >> much longer. This seems to be caused by the reduced frequency of calls to >> maybe_reread_subscription() in LogicalRepApplyLoop(). > > I think it would be interesting to know why tests started taking more > time after a reduced frequency of calls to > maybe_reread_subscription(). IIRC, we anyway call > maybe_reread_subscription for each xact. At the moment, commands like ALTER SUBSCRIPTION don't wake up the logical workers for the target subscription, so the next call to maybe_reread_subscription() may not happen for a while. Presently, we'll only sleep up to a second in the apply loop, but with my new prevent-unnecessary-wakeups patch, we may sleep for much longer. This causes wait_for_subscription_sync to take more time after some ALTER SUBSCRIPTION commands. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Dear Nathan, Thank you for updating the patch! > In v3, I moved the call to LogicalRepWorkersWakeupAtCommit() to the end of > the function. This should avoid waking up workers in some cases where it's > unnecessary (e.g., if ALTER SUBSCRIPTION ERRORs in a subtransaction), but > there are still cases where we'll wake up the workers unnecessarily. I > think this is unlikely to cause any real problems in practice. I understood you could accept false-positive event to avoid missing true-negative like ALTER SUBSCRIPTION REFRESH. +1. > >> 02. LogicalRepWorkersWakeupAtCommit() > >> > >> ``` > >> + oldcxt = MemoryContextSwitchTo(TopTransactionContext); > >> + on_commit_wakeup_workers_subids = > >> lappend_oid(on_commit_wakeup_workers_subids, > >> + > >> subid); > >> ``` > >> > >> If the subscription is altered twice in the same transaction, the same subid will > >> be recorded twice. > >> I'm not sure whether it may be caused some issued, but list_member_oid() can > >> be used to avoid that. > > > > +1, list_append_unique_oid might be better. > > Done in v3. I have no comments for the v3 patch. Best Regards, Hayato Kuroda FUJITSU LIMITED
On Thu, Nov 24, 2022 at 05:26:27AM +0000, Hayato Kuroda (Fujitsu) wrote: > I have no comments for the v3 patch. Thanks for reviewing! Does anyone else have thoughts on the patch? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
I spent some more time on the prevent-unnecessary-wakeups patch for logical/worker.c that I've been alluding to in this thread, and I found a few more places where we depend on the worker periodically waking up. This seems to be a common technique, so I'm beginning to wonder whether these changes are worthwhile. I think there's a good chance it would become a game of whac-a-mole. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Wed, Nov 30, 2022 at 5:10 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > I spent some more time on the prevent-unnecessary-wakeups patch for > logical/worker.c that I've been alluding to in this thread, and I found a > few more places where we depend on the worker periodically waking up. This > seems to be a common technique, so I'm beginning to wonder whether these > changes are worthwhile. I think there's a good chance it would become a > game of whac-a-mole. Aren't they all bugs, though, making our tests and maybe even real systems slower than they need to be?
On Wed, Nov 30, 2022 at 5:23 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Nov 30, 2022 at 5:10 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > I spent some more time on the prevent-unnecessary-wakeups patch for > > logical/worker.c that I've been alluding to in this thread, and I found a > > few more places where we depend on the worker periodically waking up. This > > seems to be a common technique, so I'm beginning to wonder whether these > > changes are worthwhile. I think there's a good chance it would become a > > game of whac-a-mole. > > Aren't they all bugs, though, making our tests and maybe even real > systems slower than they need to be? (Which isn't to suggest that it's your job to fix them, but please do share what you have if you run out of whack-a-mole steam, since we seem to have several people keen to finish those moles off.)
Dear Nathan, > I spent some more time on the prevent-unnecessary-wakeups patch for > logical/worker.c that I've been alluding to in this thread, and I found a > few more places where we depend on the worker periodically waking up. This > seems to be a common technique, so I'm beginning to wonder whether these > changes are worthwhile. I think there's a good chance it would become a > game of whac-a-mole. I think at least this feature is needed for waking up workers that are slept due to the min_apply_delay. The author supposed this patch and pinned our thread[1]. [1]: https://www.postgresql.org/message-id/TYCPR01MB8373775ECC6972289AF8CB30ED0F9%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED
On Wed, Nov 30, 2022 at 05:27:40PM +1300, Thomas Munro wrote: > On Wed, Nov 30, 2022 at 5:23 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> On Wed, Nov 30, 2022 at 5:10 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> > I spent some more time on the prevent-unnecessary-wakeups patch for >> > logical/worker.c that I've been alluding to in this thread, and I found a >> > few more places where we depend on the worker periodically waking up. This >> > seems to be a common technique, so I'm beginning to wonder whether these >> > changes are worthwhile. I think there's a good chance it would become a >> > game of whac-a-mole. >> >> Aren't they all bugs, though, making our tests and maybe even real >> systems slower than they need to be? Yeah, you're right, it's probably worth proceeding with this particular thread even if we don't end up porting the suppress-unnecessary-wakeups patch to logical/worker.c. > (Which isn't to suggest that it's your job to fix them, but please do > share what you have if you run out of whack-a-mole steam, since we > seem to have several people keen to finish those moles off.) I don't mind fixing it! There are a couple more I'd like to track down before posting another revision. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Nov 29, 2022 at 09:04:41PM -0800, Nathan Bossart wrote: > I don't mind fixing it! There are a couple more I'd like to track down > before posting another revision. Okay, here is a new version of the patch. This seems to clear up everything that I could find via the tests. Thanks to this effort, I discovered that we need to include wal_retrieve_retry_interval in our wait time calculations after failed tablesyncs (for the suppress-unnecessary-wakeups patch). I'll make that change and post that patch in a new thread. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Dec 01, 2022 at 04:21:30PM -0800, Nathan Bossart wrote: > Okay, here is a new version of the patch. This seems to clear up > everything that I could find via the tests. I cleaned up the patch a bit. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
Hi Nathan,
@@ -410,6 +411,12 @@ ExecRenameStmt(RenameStmt *stmt)
stmt->newname);
table_close(catalog, RowExclusiveLock);
+ /*
+ * Wake up the logical replication workers to handle this
+ * change quickly.
+ */
+ LogicalRepWorkersWakeupAtCommit(address.objectId);
Is it really necessary to wake logical workers up when renaming other than subscription or publication? address.objectId will be a valid subid only when renaming a subscription.
@@ -322,6 +323,9 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
/* Cleanup. */
table_close(rel, NoLock);
+
+ /* Wake up the logical replication workers to handle this change quickly. */
+ LogicalRepWorkersWakeupAtCommit(subid);
I wonder why a wakeup call is needed every time a subscription relation is updated.
It seems to me that there are two places where UpdateSubscriptionRelState is called and we need another worker to wake up:
- When a relation is in SYNCWAIT state, it waits for the apply worker to wake up and change the relation state to CATCHUP. Then tablesync worker needs to wake up to continue from CATCHUP state.
- When the state is SYNCDONE and the apply worker has to wake up to change the state to READY.
I think we already call logicalrep_worker_wakeup_ptr wherever it's needed for the above cases? What am I missing here?
Best,
--
Melih Mutlu
Microsoft
Thanks for reviewing! On Tue, Dec 06, 2022 at 07:44:46PM +0300, Melih Mutlu wrote: > Is it really necessary to wake logical workers up when renaming other than > subscription or publication? address.objectId will be a valid subid only > when renaming a subscription. Oops, that is a mistake. I only meant to wake up the workers for ALTER SUBSCRIPTION RENAME. I think I've fixed this in v6. > - When the state is SYNCDONE and the apply worker has to wake up to change > the state to READY. > > I think we already call logicalrep_worker_wakeup_ptr wherever it's needed > for the above cases? What am I missing here? IIUC we must restart all the apply workers for a subscription to enable two_phase mode. It looks like finish_sync_worker() only wakes up its own apply worker. I moved this logic to where the sync worker marks the state as SYNCDONE and added a check that two_phase mode is pending. Even so, there can still be unnecessary wakeups, but this adjustment should limit them. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Dec 06, 2022 at 11:25:51AM -0800, Nathan Bossart wrote: > On Tue, Dec 06, 2022 at 07:44:46PM +0300, Melih Mutlu wrote: >> - When the state is SYNCDONE and the apply worker has to wake up to change >> the state to READY. >> >> I think we already call logicalrep_worker_wakeup_ptr wherever it's needed >> for the above cases? What am I missing here? > > IIUC we must restart all the apply workers for a subscription to enable > two_phase mode. It looks like finish_sync_worker() only wakes up its own > apply worker. I moved this logic to where the sync worker marks the state > as SYNCDONE and added a check that two_phase mode is pending. Even so, > there can still be unnecessary wakeups, but this adjustment should limit > them. Actually, that's not quite right. The sync worker will wake up the apply worker to change the state from SYNCDONE to READY. AllTablesyncsReady() checks that all tables are READY, so we need to wake up all the workers when an apply worker changes the state to READY. Each worker will then evaluate whether to restart for two_phase mode. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
Hi,
Actually, that's not quite right. The sync worker will wake up the apply
worker to change the state from SYNCDONE to READY. AllTablesyncsReady()
checks that all tables are READY, so we need to wake up all the workers
when an apply worker changes the state to READY. Each worker will then
evaluate whether to restart for two_phase mode.
Right. I didn't think about the two phase case thoroughly. Waking up all apply workers can help.
Do we also need to wake up all sync workers too? Even if not, I'm not actually sure whether doing that would harm anything though.
Just asking since currently the patch wakes up all workers including sync workers if any still exists.
Best,
--
Melih Mutlu
Microsoft
On Wed, Dec 07, 2022 at 02:07:11PM +0300, Melih Mutlu wrote: > Do we also need to wake up all sync workers too? Even if not, I'm not > actually sure whether doing that would harm anything though. > Just asking since currently the patch wakes up all workers including sync > workers if any still exists. After sleeping on this, I think we can do better. IIUC we can simply check for AllTablesyncsReady() at the end of process_syncing_tables_for_apply() and wake up the logical replication workers (which should just consiѕt of setting the current process's latch) if we are ready for two_phase mode. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
Nathan Bossart <nathandbossart@gmail.com> writes: > After sleeping on this, I think we can do better. IIUC we can simply check > for AllTablesyncsReady() at the end of process_syncing_tables_for_apply() > and wake up the logical replication workers (which should just consiѕt of > setting the current process's latch) if we are ready for two_phase mode. I independently rediscovered the need for something like this after wondering why the subscription/t/031_column_list.pl test seemed to take so much longer than its siblings. I found that a considerable amount of the elapsed time was wasted because we were waiting up to a full second (NAPTIME_PER_CYCLE) for the logrep worker to notice that something had changed in the local subscription state. At least on my machine, it seems that the worst-case timing is reliably hit multiple times during this test. Now admittedly, this is probably not a significant problem in real-world usage; but it's sure annoying that it eats time during check-world. However, this patch seems to still be leaving quite a bit on the table. Here's the timings I see for the subscription suite in HEAD (test is just "time make check PROVE_FLAGS=--timer" with an assert-enabled build): +++ tap check in src/test/subscription +++ [18:07:38] t/001_rep_changes.pl ............... ok 6659 ms ( 0.00 usr 0.00 sys + 0.89 cusr 0.52 csys = 1.41 CPU) [18:07:45] t/002_types.pl ..................... ok 1572 ms ( 0.00 usr 0.00 sys + 0.70 cusr 0.27 csys = 0.97 CPU) [18:07:47] t/003_constraints.pl ............... ok 1436 ms ( 0.01 usr 0.00 sys + 0.74 cusr 0.25 csys = 1.00 CPU) [18:07:48] t/004_sync.pl ...................... ok 3007 ms ( 0.00 usr 0.00 sys + 0.75 cusr 0.31 csys = 1.06 CPU) [18:07:51] t/005_encoding.pl .................. ok 1468 ms ( 0.00 usr 0.00 sys + 0.74 cusr 0.21 csys = 0.95 CPU) [18:07:53] t/006_rewrite.pl ................... ok 1494 ms ( 0.00 usr 0.00 sys + 0.72 cusr 0.24 csys = 0.96 CPU) [18:07:54] t/007_ddl.pl ....................... ok 2005 ms ( 0.00 usr 0.00 sys + 0.73 cusr 0.24 csys = 0.97 CPU) [18:07:56] t/008_diff_schema.pl ............... ok 1746 ms ( 0.01 usr 0.00 sys + 0.70 cusr 0.28 csys = 0.99 CPU) [18:07:58] t/009_matviews.pl .................. ok 1878 ms ( 0.00 usr 0.00 sys + 0.71 cusr 0.24 csys = 0.95 CPU) [18:08:00] t/010_truncate.pl .................. ok 2999 ms ( 0.00 usr 0.00 sys + 0.77 cusr 0.38 csys = 1.15 CPU) [18:08:03] t/011_generated.pl ................. ok 1467 ms ( 0.00 usr 0.00 sys + 0.71 cusr 0.24 csys = 0.95 CPU) [18:08:04] t/012_collation.pl ................. skipped: ICU not supported by this build [18:08:04] t/013_partition.pl ................. ok 4787 ms ( 0.01 usr 0.00 sys + 1.29 cusr 0.71 csys = 2.01 CPU) [18:08:09] t/014_binary.pl .................... ok 2564 ms ( 0.00 usr 0.00 sys + 0.72 cusr 0.28 csys = 1.00 CPU) [18:08:12] t/015_stream.pl .................... ok 2531 ms ( 0.01 usr 0.00 sys + 0.73 cusr 0.27 csys = 1.01 CPU) [18:08:14] t/016_stream_subxact.pl ............ ok 1590 ms ( 0.00 usr 0.00 sys + 0.70 cusr 0.24 csys = 0.94 CPU) [18:08:16] t/017_stream_ddl.pl ................ ok 1610 ms ( 0.00 usr 0.00 sys + 0.72 cusr 0.25 csys = 0.97 CPU) [18:08:17] t/018_stream_subxact_abort.pl ...... ok 1827 ms ( 0.00 usr 0.00 sys + 0.73 cusr 0.24 csys = 0.97 CPU) [18:08:19] t/019_stream_subxact_ddl_abort.pl .. ok 1474 ms ( 0.00 usr 0.00 sys + 0.71 cusr 0.24 csys = 0.95 CPU) [18:08:21] t/020_messages.pl .................. ok 2423 ms ( 0.01 usr 0.00 sys + 0.74 cusr 0.25 csys = 1.00 CPU) [18:08:23] t/021_twophase.pl .................. ok 4799 ms ( 0.00 usr 0.00 sys + 0.82 cusr 0.39 csys = 1.21 CPU) [18:08:28] t/022_twophase_cascade.pl .......... ok 4346 ms ( 0.00 usr 0.00 sys + 1.12 cusr 0.54 csys = 1.66 CPU) [18:08:32] t/023_twophase_stream.pl ........... ok 3656 ms ( 0.01 usr 0.00 sys + 0.78 cusr 0.32 csys = 1.11 CPU) [18:08:36] t/024_add_drop_pub.pl .............. ok 3585 ms ( 0.00 usr 0.00 sys + 0.73 cusr 0.29 csys = 1.02 CPU) [18:08:39] t/025_rep_changes_for_schema.pl .... ok 3631 ms ( 0.00 usr 0.00 sys + 0.77 cusr 0.34 csys = 1.11 CPU) [18:08:43] t/026_stats.pl ..................... ok 4096 ms ( 0.00 usr 0.00 sys + 0.77 cusr 0.32 csys = 1.09 CPU) [18:08:47] t/027_nosuperuser.pl ............... ok 4824 ms ( 0.01 usr 0.00 sys + 0.77 cusr 0.39 csys = 1.17 CPU) [18:08:52] t/028_row_filter.pl ................ ok 5321 ms ( 0.00 usr 0.00 sys + 0.90 cusr 0.50 csys = 1.40 CPU) [18:08:57] t/029_on_error.pl .................. ok 3748 ms ( 0.00 usr 0.00 sys + 0.75 cusr 0.32 csys = 1.07 CPU) [18:09:01] t/030_origin.pl .................... ok 4496 ms ( 0.00 usr 0.00 sys + 1.09 cusr 0.45 csys = 1.54 CPU) [18:09:06] t/031_column_list.pl ............... ok 13802 ms ( 0.01 usr 0.00 sys + 1.00 cusr 0.69 csys = 1.70 CPU) [18:09:19] t/100_bugs.pl ...................... ok 5195 ms ( 0.00 usr 0.00 sys + 2.05 cusr 0.76 csys = 2.81 CPU) [18:09:25] All tests successful. Files=32, Tests=379, 107 wallclock secs ( 0.09 usr 0.02 sys + 26.10 cusr 10.98 csys = 37.19 CPU) Result: PASS real 1m47.503s user 0m27.068s sys 0m11.452s With the v8 patch, I get: +++ tap check in src/test/subscription +++ [18:11:15] t/001_rep_changes.pl ............... ok 5505 ms ( 0.01 usr 0.00 sys + 0.90 cusr 0.49 csys = 1.40 CPU) [18:11:21] t/002_types.pl ..................... ok 1574 ms ( 0.00 usr 0.00 sys + 0.71 cusr 0.26 csys = 0.97 CPU) [18:11:23] t/003_constraints.pl ............... ok 1442 ms ( 0.00 usr 0.00 sys + 0.71 cusr 0.28 csys = 0.99 CPU) [18:11:24] t/004_sync.pl ...................... ok 2087 ms ( 0.01 usr 0.00 sys + 0.74 cusr 0.30 csys = 1.05 CPU) [18:11:26] t/005_encoding.pl .................. ok 1465 ms ( 0.00 usr 0.00 sys + 0.71 cusr 0.23 csys = 0.94 CPU) [18:11:28] t/006_rewrite.pl ................... ok 1489 ms ( 0.00 usr 0.00 sys + 0.73 cusr 0.24 csys = 0.97 CPU) [18:11:29] t/007_ddl.pl ....................... ok 2007 ms ( 0.00 usr 0.00 sys + 0.73 cusr 0.23 csys = 0.96 CPU) [18:11:31] t/008_diff_schema.pl ............... ok 1644 ms ( 0.00 usr 0.00 sys + 0.72 cusr 0.27 csys = 0.99 CPU) [18:11:33] t/009_matviews.pl .................. ok 1878 ms ( 0.00 usr 0.00 sys + 0.70 cusr 0.25 csys = 0.95 CPU) [18:11:35] t/010_truncate.pl .................. ok 3006 ms ( 0.00 usr 0.00 sys + 0.79 cusr 0.37 csys = 1.16 CPU) [18:11:38] t/011_generated.pl ................. ok 1470 ms ( 0.00 usr 0.00 sys + 0.72 cusr 0.23 csys = 0.95 CPU) [18:11:39] t/012_collation.pl ................. skipped: ICU not supported by this build [18:11:39] t/013_partition.pl ................. ok 4656 ms ( 0.01 usr 0.00 sys + 1.30 cusr 0.69 csys = 2.00 CPU) [18:11:44] t/014_binary.pl .................... ok 2570 ms ( 0.00 usr 0.00 sys + 0.74 cusr 0.27 csys = 1.01 CPU) [18:11:46] t/015_stream.pl .................... ok 2535 ms ( 0.00 usr 0.00 sys + 0.74 cusr 0.26 csys = 1.00 CPU) [18:11:49] t/016_stream_subxact.pl ............ ok 1601 ms ( 0.00 usr 0.00 sys + 0.71 cusr 0.26 csys = 0.97 CPU) [18:11:51] t/017_stream_ddl.pl ................ ok 1608 ms ( 0.00 usr 0.00 sys + 0.70 cusr 0.26 csys = 0.96 CPU) [18:11:52] t/018_stream_subxact_abort.pl ...... ok 1834 ms ( 0.00 usr 0.00 sys + 0.72 cusr 0.26 csys = 0.98 CPU) [18:11:54] t/019_stream_subxact_ddl_abort.pl .. ok 1476 ms ( 0.00 usr 0.00 sys + 0.71 cusr 0.24 csys = 0.95 CPU) [18:11:55] t/020_messages.pl .................. ok 1489 ms ( 0.00 usr 0.00 sys + 0.73 cusr 0.24 csys = 0.97 CPU) [18:11:57] t/021_twophase.pl .................. ok 4289 ms ( 0.00 usr 0.00 sys + 0.82 cusr 0.38 csys = 1.20 CPU) [18:12:01] t/022_twophase_cascade.pl .......... ok 3835 ms ( 0.01 usr 0.00 sys + 1.17 cusr 0.49 csys = 1.67 CPU) [18:12:05] t/023_twophase_stream.pl ........... ok 3158 ms ( 0.00 usr 0.00 sys + 0.79 cusr 0.32 csys = 1.11 CPU) [18:12:08] t/024_add_drop_pub.pl .............. ok 2553 ms ( 0.00 usr 0.00 sys + 0.72 cusr 0.28 csys = 1.00 CPU) [18:12:11] t/025_rep_changes_for_schema.pl .... ok 2703 ms ( 0.01 usr 0.00 sys + 0.77 cusr 0.32 csys = 1.10 CPU) [18:12:13] t/026_stats.pl ..................... ok 4101 ms ( 0.00 usr 0.00 sys + 0.77 cusr 0.31 csys = 1.08 CPU) [18:12:18] t/027_nosuperuser.pl ............... ok 4822 ms ( 0.00 usr 0.00 sys + 0.80 cusr 0.36 csys = 1.16 CPU) [18:12:22] t/028_row_filter.pl ................ ok 4396 ms ( 0.00 usr 0.00 sys + 0.90 cusr 0.50 csys = 1.40 CPU) [18:12:27] t/029_on_error.pl .................. ok 4382 ms ( 0.00 usr 0.00 sys + 0.75 cusr 0.33 csys = 1.08 CPU) [18:12:31] t/030_origin.pl .................... ok 2735 ms ( 0.00 usr 0.00 sys + 1.10 cusr 0.40 csys = 1.50 CPU) [18:12:34] t/031_column_list.pl ............... ok 10281 ms ( 0.01 usr 0.00 sys + 1.01 cusr 0.60 csys = 1.62 CPU) [18:12:44] t/100_bugs.pl ...................... ok 5214 ms ( 0.00 usr 0.00 sys + 2.05 cusr 0.79 csys = 2.84 CPU) [18:12:49] All tests successful. Files=32, Tests=379, 94 wallclock secs ( 0.10 usr 0.02 sys + 26.21 cusr 10.72 csys = 37.05 CPU) Result: PASS real 1m35.275s user 0m27.177s sys 0m11.182s That's better, but not by an impressive amount: there's still an annoyingly large amount of daylight between the CPU time expended and the elapsed time (and I'm not even considering the possibility that some of that CPU time could be parallelized). I poked into it some more, and what I'm seeing now is traces like 2022-12-13 18:12:35.936 EST [2547426] 031_column_list.pl LOG: statement: ALTER SUBSCRIPTION sub1 SET PUBLICATION pub2, pub3 2022-12-13 18:12:35.941 EST [2547327] LOG: logical replication apply worker for subscription "sub1" will restart becauseof a parameter change 2022-12-13 18:12:35.944 EST [2547429] 031_column_list.pl LOG: statement: SELECT count(1) = 0 FROM pg_subscription_rel WHEREsrsubstate NOT IN ('r', 's'); 2022-12-13 18:12:36.048 EST [2547431] 031_column_list.pl LOG: statement: SELECT count(1) = 0 FROM pg_subscription_rel WHEREsrsubstate NOT IN ('r', 's'); 2022-12-13 18:12:36.151 EST [2547433] 031_column_list.pl LOG: statement: SELECT count(1) = 0 FROM pg_subscription_rel WHEREsrsubstate NOT IN ('r', 's'); 2022-12-13 18:12:36.255 EST [2547435] 031_column_list.pl LOG: statement: SELECT count(1) = 0 FROM pg_subscription_rel WHEREsrsubstate NOT IN ('r', 's'); 2022-12-13 18:12:36.359 EST [2547437] 031_column_list.pl LOG: statement: SELECT count(1) = 0 FROM pg_subscription_rel WHEREsrsubstate NOT IN ('r', 's'); 2022-12-13 18:12:36.443 EST [2547441] LOG: logical replication apply worker for subscription "sub1" has started 2022-12-13 18:12:36.446 EST [2547443] LOG: logical replication table synchronization worker for subscription "sub1", table"tab5" has started 2022-12-13 18:12:36.451 EST [2547443] LOG: logical replication table synchronization worker for subscription "sub1", table"tab5" has finished 2022-12-13 18:12:36.463 EST [2547446] 031_column_list.pl LOG: statement: SELECT count(1) = 0 FROM pg_subscription_rel WHEREsrsubstate NOT IN ('r', 's'); Before, there was up to 1 second (with multiple "SELECT count(1) = 0" probes from the test script) between the ALTER SUBSCRIPTION command and the "apply worker will restart" log entry. That wait is pretty well zapped, but instead now we're waiting hundreds of ms for the "apply worker has started" message. I've not chased it further than that, but I venture that the apply launcher also needs a kick in the pants, and/or there needs to be an interlock to ensure that it doesn't wake until after the old apply worker quits. regards, tom lane
On Tue, Dec 13, 2022 at 06:32:08PM -0500, Tom Lane wrote: > Before, there was up to 1 second (with multiple "SELECT count(1) = 0" > probes from the test script) between the ALTER SUBSCRIPTION command > and the "apply worker will restart" log entry. That wait is pretty > well zapped, but instead now we're waiting hundreds of ms for the > "apply worker has started" message. > > I've not chased it further than that, but I venture that the apply > launcher also needs a kick in the pants, and/or there needs to be > an interlock to ensure that it doesn't wake until after the old > apply worker quits. This is probably because the tests set wal_retrieve_retry_interval to 500ms. Lowering that to 1ms in Cluster.pm seems to wipe out this particular wait, and the total src/test/subscription test time drops from 119 seconds to 95 seconds on my machine. This probably lowers the amount of test coverage we get on the wal_retrieve_retry_interval code paths, but if that's a concern, perhaps we should write a test specifically for wal_retrieve_retry_interval. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Nathan Bossart <nathandbossart@gmail.com> writes: > On Tue, Dec 13, 2022 at 06:32:08PM -0500, Tom Lane wrote: >> I've not chased it further than that, but I venture that the apply >> launcher also needs a kick in the pants, and/or there needs to be >> an interlock to ensure that it doesn't wake until after the old >> apply worker quits. > This is probably because the tests set wal_retrieve_retry_interval to > 500ms. Lowering that to 1ms in Cluster.pm seems to wipe out this > particular wait, and the total src/test/subscription test time drops from > 119 seconds to 95 seconds on my machine. That's not really the direction we should be going in, though. Ideally there should be *no* situation where we are waiting for a timeout to elapse for a process to wake up and notice it ought to do something. If we have timeouts at all, they should be backstops for the possibility of a lost interrupt, and it should be possible to set them quite high without any visible impact on normal operation. (This gets back to the business about minimizing idle power consumption, which Simon was bugging us about recently but that's been on the radar screen for years.) I certainly don't think that "wake the apply launcher every 1ms" is a sane configuration. Unless I'm missing something basic about its responsibilities, it should seldom need to wake at all in normal operation. regards, tom lane
On Tue, Dec 13, 2022 at 07:20:14PM -0500, Tom Lane wrote: > I certainly don't think that "wake the apply launcher every 1ms" > is a sane configuration. Unless I'm missing something basic about > its responsibilities, it should seldom need to wake at all in > normal operation. This parameter appears to control how often the apply launcher starts new workers. If it starts new workers in a loop iteration, it updates its last_start_time variable, and it won't start any more workers until another wal_retrieve_retry_interval has elapsed. If no new workers need to be started, it only wakes up every 3 minutes. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Dec 13, 2022 at 04:41:05PM -0800, Nathan Bossart wrote: > On Tue, Dec 13, 2022 at 07:20:14PM -0500, Tom Lane wrote: >> I certainly don't think that "wake the apply launcher every 1ms" >> is a sane configuration. Unless I'm missing something basic about >> its responsibilities, it should seldom need to wake at all in >> normal operation. > > This parameter appears to control how often the apply launcher starts new > workers. If it starts new workers in a loop iteration, it updates its > last_start_time variable, and it won't start any more workers until another > wal_retrieve_retry_interval has elapsed. If no new workers need to be > started, it only wakes up every 3 minutes. Looking closer, I see that wal_retrieve_retry_interval is used for three purposes. It's main purpose seems to be preventing busy-waiting in WaitForWALToBecomeAvailable(), as that's what's documented. But it's also used for logical replication. The apply launcher uses it as I've describe above, and the apply workers use it when launching sync workers. Unlike the apply launcher, the apply workers store the last start time for each table's sync worker and use that to determine whether to start a new one. My first thought is that the latter two uses should be moved to a new parameter, and the apply launcher should store the last start time for each apply worker like the apply workers do for the table-sync workers. In any case, it probably makes sense to lower this parameter's value for testing so that tests that restart these workers frequently aren't waiting for so long. I can put a patch together if this seems like a reasonable direction to go. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Nathan Bossart <nathandbossart@gmail.com> writes: > My first thought is that the latter two uses should be moved to a new > parameter, and the apply launcher should store the last start time for each > apply worker like the apply workers do for the table-sync workers. In any > case, it probably makes sense to lower this parameter's value for testing > so that tests that restart these workers frequently aren't waiting for so > long. > I can put a patch together if this seems like a reasonable direction to go. No, I'm still of the opinion that waiting for the launcher to timeout before doing something is fundamentally wrong design. We should signal it when we want it to do something. That's not different from what you're fixing about the workers; why don't you see that it's appropriate for the launcher too? regards, tom lane
On Wed, Dec 14, 2022 at 12:42:32PM -0500, Tom Lane wrote: > Nathan Bossart <nathandbossart@gmail.com> writes: >> My first thought is that the latter two uses should be moved to a new >> parameter, and the apply launcher should store the last start time for each >> apply worker like the apply workers do for the table-sync workers. In any >> case, it probably makes sense to lower this parameter's value for testing >> so that tests that restart these workers frequently aren't waiting for so >> long. > >> I can put a patch together if this seems like a reasonable direction to go. > > No, I'm still of the opinion that waiting for the launcher to timeout > before doing something is fundamentally wrong design. We should signal > it when we want it to do something. That's not different from what > you're fixing about the workers; why don't you see that it's appropriate > for the launcher too? I'm reasonably certain the launcher is already signaled like you describe. It'll just wait to start new workers if it's been less than wal_retrieve_retry_interval milliseconds since the last time it started workers. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Nathan Bossart <nathandbossart@gmail.com> writes: > I'm reasonably certain the launcher is already signaled like you describe. > It'll just wait to start new workers if it's been less than > wal_retrieve_retry_interval milliseconds since the last time it started > workers. Oh. What in the world is the rationale for that? regards, tom lane
On Wed, Dec 14, 2022 at 01:23:18PM -0500, Tom Lane wrote: > Nathan Bossart <nathandbossart@gmail.com> writes: >> I'm reasonably certain the launcher is already signaled like you describe. >> It'll just wait to start new workers if it's been less than >> wal_retrieve_retry_interval milliseconds since the last time it started >> workers. > > Oh. What in the world is the rationale for that? My assumption is that this is meant to avoid starting workers as fast as possible if they repeatedly crash. I didn't see much discussion in the original logical replication thread [0], but I do see follow-up discussion about creating a separate GUC for this [1] [2]. [0] https://postgr.es/m/b8132323-b577-428c-b2aa-bf41a66b18e7%402ndquadrant.com [1] https://postgr.es/m/CAD21AoAjTTGm%2BOx70b2OGWvb77vPcRdYeRv3gkAWx76nXDo%2BEA%40mail.gmail.com [2] https://postgr.es/m/CAD21AoDCnyRJDUY%3DESVVe68AukvOP2dFomTeBFpAd1TiFbjsGg%40mail.gmail.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Nathan Bossart <nathandbossart@gmail.com> writes: > On Wed, Dec 14, 2022 at 01:23:18PM -0500, Tom Lane wrote: >> Oh. What in the world is the rationale for that? > My assumption is that this is meant to avoid starting workers as fast as > possible if they repeatedly crash. I can see the point of rate-limiting if the workers are failing to connect or crashing while trying to process data. But it's not very sane to apply the same policy to an intentional worker exit-for-reconfiguration. Maybe we could have workers that are exiting for that reason set a flag saying "please restart me without delay"? A *real* fix would be to not exit at all, at least for reconfigurations that don't change the connection parameters, but instead cope with recomputing whatever needs recomputed in the workers' state. I can believe that that'd be a lot of work though. regards, tom lane
On Wed, Dec 14, 2022 at 02:02:58PM -0500, Tom Lane wrote: > Maybe we could have workers that are exiting for that reason set a > flag saying "please restart me without delay"? That helps a bit, but there are still delays when starting workers for new subscriptions. I think we'd need to create a new array in shared memory for subscription OIDs that need their workers started immediately. I'm not totally sure this is worth the effort. These delays surface in the tests because the workers are started so frequently. In normal operation, this is probably unusual, so the launcher would typically start new workers immediately. But if you and/or others feel this is worthwhile, I don't mind working on the patch. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
I tried setting wal_retrieve_retry_interval to 1ms for all TAP tests (similar to what was done in 2710ccd), and I noticed that the recovery tests consistently took much longer. Upon further inspection, it looks like the same (or a very similar) race condition described in e5d494d's commit message [0]. With some added debug logs, I see that all of the callers of MaybeStartWalReceiver() complete before SIGCHLD is processed, so ServerLoop() waits for a minute before starting the WAL receiver. A simple fix is to have DetermineSleepTime() take the WalReceiverRequested flag into consideration. The attached 0002 patch shortens the sleep time to 100ms if it looks like we are waiting on a SIGCHLD. I'm not certain this is the best approach, but it seems to fix the tests. On my machine, I see the following improvements in the tests (all units in seconds): HEAD patched (v9) check-world -j8 165 138 subscription 120 75 recovery 111 108 [0] https://postgr.es/m/21344.1498494720%40sss.pgh.pa.us -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Dec 15, 2022 at 02:47:21PM -0800, Nathan Bossart wrote: > I tried setting wal_retrieve_retry_interval to 1ms for all TAP tests > (similar to what was done in 2710ccd), and I noticed that the recovery > tests consistently took much longer. Upon further inspection, it looks > like the same (or a very similar) race condition described in e5d494d's > commit message [0]. With some added debug logs, I see that all of the > callers of MaybeStartWalReceiver() complete before SIGCHLD is processed, so > ServerLoop() waits for a minute before starting the WAL receiver. > > A simple fix is to have DetermineSleepTime() take the WalReceiverRequested > flag into consideration. The attached 0002 patch shortens the sleep time > to 100ms if it looks like we are waiting on a SIGCHLD. I'm not certain > this is the best approach, but it seems to fix the tests. This seems to have somehow broken the archiving tests on Windows, so obviously I owe some better analysis here. I didn't see anything obvious in the logs, but I will continue to dig. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Sun, Dec 18, 2022 at 03:36:07PM -0800, Nathan Bossart wrote: > This seems to have somehow broken the archiving tests on Windows, so > obviously I owe some better analysis here. I didn't see anything obvious > in the logs, but I will continue to dig. On Windows, WaitForWALToBecomeAvailable() seems to depend on the call to WaitLatch() for wal_retrieve_retry_interval to ensure that signals are dispatched (i.e., pgwin32_dispatch_queued_signals()). My first instinct is to just always call WaitLatch() in this code path, even if wal_retrieve_rety_interval milliseconds have already elapsed. The attached 0003 does this. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Dec 15, 2022 at 4:47 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Wed, Dec 14, 2022 at 02:02:58PM -0500, Tom Lane wrote: > > Maybe we could have workers that are exiting for that reason set a > > flag saying "please restart me without delay"? > > That helps a bit, but there are still delays when starting workers for new > subscriptions. I think we'd need to create a new array in shared memory > for subscription OIDs that need their workers started immediately. > That would be tricky because the list of subscription OIDs can be longer than the workers. Can't we set a boolean variable (check_immediate or something like that) in LogicalRepCtxStruct and use that to traverse the subscriptions? So, when any worker will restart because of a parameter change, we can set the variable and send a signal to the launcher. The launcher can then check this variable to decide whether to start the missing workers for enabled subscriptions. -- With Regards, Amit Kapila.
On Wed, Dec 7, 2022 at 11:42 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Wed, Dec 07, 2022 at 02:07:11PM +0300, Melih Mutlu wrote: > > Do we also need to wake up all sync workers too? Even if not, I'm not > > actually sure whether doing that would harm anything though. > > Just asking since currently the patch wakes up all workers including sync > > workers if any still exists. > > After sleeping on this, I think we can do better. IIUC we can simply check > for AllTablesyncsReady() at the end of process_syncing_tables_for_apply() > and wake up the logical replication workers (which should just consiѕt of > setting the current process's latch) if we are ready for two_phase mode. > How just waking up will help with two_phase mode? For that, we need to restart the apply worker as we are doing at the beginning of process_syncing_tables_for_apply(). -- With Regards, Amit Kapila.
On Tue, Jan 03, 2023 at 11:03:32AM +0530, Amit Kapila wrote: > On Thu, Dec 15, 2022 at 4:47 AM Nathan Bossart <nathandbossart@gmail.com> wrote: >> On Wed, Dec 14, 2022 at 02:02:58PM -0500, Tom Lane wrote: >> > Maybe we could have workers that are exiting for that reason set a >> > flag saying "please restart me without delay"? >> >> That helps a bit, but there are still delays when starting workers for new >> subscriptions. I think we'd need to create a new array in shared memory >> for subscription OIDs that need their workers started immediately. > > That would be tricky because the list of subscription OIDs can be > longer than the workers. Can't we set a boolean variable > (check_immediate or something like that) in LogicalRepCtxStruct and > use that to traverse the subscriptions? So, when any worker will > restart because of a parameter change, we can set the variable and > send a signal to the launcher. The launcher can then check this > variable to decide whether to start the missing workers for enabled > subscriptions. My approach was to add a variable to LogicalRepWorker that indicated whether a worker needed to be restarted immediately. While this is a little weird because the workers array is treated as slots, it worked nicely for ALTER SUBSCRIPTION. However, this doesn't help at all for CREATE SUBSCRIPTION. IIUC you are suggesting just one variable that would bypass wal_retrieve_retry_interval for all subscriptions, not just those newly altered or created. This definitely seems like it would prevent delays, but it would also cause wal_retrieve_retry_interval to be incorrectly bypassed for the other workers in some cases. Is this acceptable? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Jan 03, 2023 at 11:43:59AM +0530, Amit Kapila wrote: > On Wed, Dec 7, 2022 at 11:42 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> After sleeping on this, I think we can do better. IIUC we can simply check >> for AllTablesyncsReady() at the end of process_syncing_tables_for_apply() >> and wake up the logical replication workers (which should just consiѕt of >> setting the current process's latch) if we are ready for two_phase mode. > > How just waking up will help with two_phase mode? For that, we need to > restart the apply worker as we are doing at the beginning of > process_syncing_tables_for_apply(). Right. IIRC waking up causes the apply worker to immediately call process_syncing_tables_for_apply() again, which will then proc_exit(0) as appropriate. It might be possible to move the restart logic to the end of process_syncing_tables_for_apply() to avoid this extra wakeup. WDYT? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Jan 3, 2023 at 11:51 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Jan 03, 2023 at 11:43:59AM +0530, Amit Kapila wrote: > > On Wed, Dec 7, 2022 at 11:42 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > >> After sleeping on this, I think we can do better. IIUC we can simply check > >> for AllTablesyncsReady() at the end of process_syncing_tables_for_apply() > >> and wake up the logical replication workers (which should just consiѕt of > >> setting the current process's latch) if we are ready for two_phase mode. > > > > How just waking up will help with two_phase mode? For that, we need to > > restart the apply worker as we are doing at the beginning of > > process_syncing_tables_for_apply(). > > Right. IIRC waking up causes the apply worker to immediately call > process_syncing_tables_for_apply() again, which will then proc_exit(0) as > appropriate. > But we are already in apply worker and performing process_syncing_tables_for_apply(). This means the apply worker is not waiting/sleeping, so what exactly are we trying to wake up? > It might be possible to move the restart logic to the end of > process_syncing_tables_for_apply() to avoid this extra wakeup. WDYT? > I am not sure if I understand the problem you are trying to solve with this part of the patch. Are you worried that after we mark some of the relation's state as READY, all the table syncs are in the READY state but we will not immediately try to check the two_pahse stuff and probably the apply worker may sleep before the next time it invokes process_syncing_tables_for_apply()? If so, we probably also need to ensure that table_states_valid is marked false probably via invalidations so that we can get the latest state and then perform this check. I guess if we can do that then we can directly move the restart logic to the end. -- With Regards, Amit Kapila.
On Tue, Jan 3, 2023 at 11:40 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Jan 03, 2023 at 11:03:32AM +0530, Amit Kapila wrote: > > On Thu, Dec 15, 2022 at 4:47 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > >> On Wed, Dec 14, 2022 at 02:02:58PM -0500, Tom Lane wrote: > >> > Maybe we could have workers that are exiting for that reason set a > >> > flag saying "please restart me without delay"? > >> > >> That helps a bit, but there are still delays when starting workers for new > >> subscriptions. I think we'd need to create a new array in shared memory > >> for subscription OIDs that need their workers started immediately. > > > > That would be tricky because the list of subscription OIDs can be > > longer than the workers. Can't we set a boolean variable > > (check_immediate or something like that) in LogicalRepCtxStruct and > > use that to traverse the subscriptions? So, when any worker will > > restart because of a parameter change, we can set the variable and > > send a signal to the launcher. The launcher can then check this > > variable to decide whether to start the missing workers for enabled > > subscriptions. > > My approach was to add a variable to LogicalRepWorker that indicated > whether a worker needed to be restarted immediately. While this is a > little weird because the workers array is treated as slots, it worked > nicely for ALTER SUBSCRIPTION. > So, are you planning to keep its in_use and subid flag as it is in logicalrep_worker_cleanup()? Otherwise, without that it could be reused for some other subscription. > However, this doesn't help at all for > CREATE SUBSCRIPTION. > What if we maintain a hash table similar to 'last_start_times' maintained in tablesync.c? It won't have entries for new subscriptions, so for those we may not need to wait till wal_retrieve_retry_interval. > IIUC you are suggesting just one variable that would bypass > wal_retrieve_retry_interval for all subscriptions, not just those newly > altered or created. This definitely seems like it would prevent delays, > but it would also cause wal_retrieve_retry_interval to be incorrectly > bypassed for the other workers in some cases. > Right, but I guess it would be rare in practical cases that someone Altered/Created a subscription, and also some workers are restarted due to errors/crashes as only in those cases launcher can restart the worker when it shouldn't. However, in that case, also, it won't restart the apply worker again and again unless there are concurrent Create/Alter Subscription operations going on. IIUC, currently also it can always first time restart the worker immediately after ERROR/CRASH because we don't maintain last_start_time for each worker. I think this is probably okay as we want to avoid repeated restarts after the ERROR. BTW, now users also have a subscription option 'disable_on_error' which could also be used to avoid repeated restarts due to ERRORS. > Is this acceptable? > To me, this sounds acceptable but if you and others don't think so then we can try to develop some solution like per-worker-flag and a hash table as discussed in the earlier part of the email. -- With Regards, Amit Kapila.
On Wed, Jan 04, 2023 at 09:41:47AM +0530, Amit Kapila wrote: > I am not sure if I understand the problem you are trying to solve with > this part of the patch. Are you worried that after we mark some of the > relation's state as READY, all the table syncs are in the READY state > but we will not immediately try to check the two_pahse stuff and > probably the apply worker may sleep before the next time it invokes > process_syncing_tables_for_apply()? Yes. > If so, we probably also need to > ensure that table_states_valid is marked false probably via > invalidations so that we can get the latest state and then perform > this check. I guess if we can do that then we can directly move the > restart logic to the end. IMO this shows the advantage of just waking up the worker. It doesn't change the apply worker's behavior besides making it more responsive. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Wed, Jan 04, 2023 at 10:57:43AM +0530, Amit Kapila wrote: > On Tue, Jan 3, 2023 at 11:40 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> My approach was to add a variable to LogicalRepWorker that indicated >> whether a worker needed to be restarted immediately. While this is a >> little weird because the workers array is treated as slots, it worked >> nicely for ALTER SUBSCRIPTION. > > So, are you planning to keep its in_use and subid flag as it is in > logicalrep_worker_cleanup()? Otherwise, without that it could be > reused for some other subscription. I believe I did something like this in my proof-of-concept. I might have used the new flag as another indicator that the slot was still "in use". In any case, you are right that we need to prevent the slot from being reused. > What if we maintain a hash table similar to 'last_start_times' > maintained in tablesync.c? It won't have entries for new > subscriptions, so for those we may not need to wait till > wal_retrieve_retry_interval. I proposed this upthread [0]. I still think it is a worthwhile change. Right now, if a worker needs to be restarted but another unrelated worker was restarted less than wal_retrieve_retry_interval milliseconds ago, the launcher waits to restart it. I think it makes more sense for each worker to have its own restart interval tracked. >> IIUC you are suggesting just one variable that would bypass >> wal_retrieve_retry_interval for all subscriptions, not just those newly >> altered or created. This definitely seems like it would prevent delays, >> but it would also cause wal_retrieve_retry_interval to be incorrectly >> bypassed for the other workers in some cases. > > Right, but I guess it would be rare in practical cases that someone > Altered/Created a subscription, and also some workers are restarted > due to errors/crashes as only in those cases launcher can restart the > worker when it shouldn't. However, in that case, also, it won't > restart the apply worker again and again unless there are concurrent > Create/Alter Subscription operations going on. IIUC, currently also it > can always first time restart the worker immediately after ERROR/CRASH > because we don't maintain last_start_time for each worker. I think > this is probably okay as we want to avoid repeated restarts after the > ERROR. This line of thinking is why I felt that lowering wal_retrieve_retry_interval for the tests might be sufficient. Besides the fact that it revealed multiple bugs, I don't see the point in adding much more complexity here. In practice, workers will usually start right away, unless of course there are other worker starts happening around the same time. This consistently causes testing delays because the tests stress these code paths, but I don't think what the tests are doing is a typical use-case. From the discussion thus far, it sounds like the alternatives are to 1) add a global flag that causes wal_retrieve_retry_interval to be bypassed for all workers or to 2) add a hash map in the launcher and a restart_immediately flag in each worker slot. I'll go ahead and create a patch for 2 since it seems like the most complete solution, and we can evaluate whether the complexity seems appropriate. [0] https://postgr.es/m/20221214171023.GA689106%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Wed, Jan 04, 2023 at 10:12:19AM -0800, Nathan Bossart wrote: > From the discussion thus far, it sounds like the alternatives are to 1) add > a global flag that causes wal_retrieve_retry_interval to be bypassed for > all workers or to 2) add a hash map in the launcher and a > restart_immediately flag in each worker slot. I'll go ahead and create a > patch for 2 since it seems like the most complete solution, and we can > evaluate whether the complexity seems appropriate. Here is a first attempt at adding a hash table to the launcher and a restart_immediately flag in each worker slot. This provides a similar speedup to lowering wal_retrieve_retry_interval to 1ms. I've noted a couple of possible race conditions in comments, but none of them seemed particularly egregious. Ideally, we'd put the hash table in shared memory so that other backends could adjust it directly, but IIUC that requires it to be a fixed size, and the number of subscriptions is virtually unbounded. There might still be problems with the patch, but I'm hoping it at least helps further the discussion about which approach to take. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Wed, Jan 4, 2023 at 11:03 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Wed, Jan 04, 2023 at 09:41:47AM +0530, Amit Kapila wrote: > > I am not sure if I understand the problem you are trying to solve with > > this part of the patch. Are you worried that after we mark some of the > > relation's state as READY, all the table syncs are in the READY state > > but we will not immediately try to check the two_pahse stuff and > > probably the apply worker may sleep before the next time it invokes > > process_syncing_tables_for_apply()? > > Yes. > > > If so, we probably also need to > > ensure that table_states_valid is marked false probably via > > invalidations so that we can get the latest state and then perform > > this check. I guess if we can do that then we can directly move the > > restart logic to the end. > > IMO this shows the advantage of just waking up the worker. It doesn't > change the apply worker's behavior besides making it more responsive. > But there doesn't appear to be any guarantee that the result for AllTablesyncsReady() will change between the time it is invoked earlier in the function and at the place you have it in the patch. This is because the value of 'table_states_valid' may not have changed. So, how is this supposed to work? -- With Regards, Amit Kapila.
On Thu, Jan 05, 2023 at 09:09:12AM +0530, Amit Kapila wrote: > On Wed, Jan 4, 2023 at 11:03 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> On Wed, Jan 04, 2023 at 09:41:47AM +0530, Amit Kapila wrote: >> > If so, we probably also need to >> > ensure that table_states_valid is marked false probably via >> > invalidations so that we can get the latest state and then perform >> > this check. I guess if we can do that then we can directly move the >> > restart logic to the end. >> >> IMO this shows the advantage of just waking up the worker. It doesn't >> change the apply worker's behavior besides making it more responsive. > > But there doesn't appear to be any guarantee that the result for > AllTablesyncsReady() will change between the time it is invoked > earlier in the function and at the place you have it in the patch. > This is because the value of 'table_states_valid' may not have > changed. So, how is this supposed to work? The call to CommandCounterIncrement() should set table_states_valid to false if needed. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Wed, Jan 04, 2023 at 08:12:37PM -0800, Nathan Bossart wrote: > On Thu, Jan 05, 2023 at 09:09:12AM +0530, Amit Kapila wrote: >> But there doesn't appear to be any guarantee that the result for >> AllTablesyncsReady() will change between the time it is invoked >> earlier in the function and at the place you have it in the patch. >> This is because the value of 'table_states_valid' may not have >> changed. So, how is this supposed to work? > > The call to CommandCounterIncrement() should set table_states_valid to > false if needed. In v12, I moved the restart for two_phase mode to the end of process_syncing_tables_for_apply() so that we don't need to rely on another iteration of the loop. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Jan 5, 2023 at 6:19 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Wed, Jan 04, 2023 at 10:12:19AM -0800, Nathan Bossart wrote: > > From the discussion thus far, it sounds like the alternatives are to 1) add > > a global flag that causes wal_retrieve_retry_interval to be bypassed for > > all workers or to 2) add a hash map in the launcher and a > > restart_immediately flag in each worker slot. I'll go ahead and create a > > patch for 2 since it seems like the most complete solution, and we can > > evaluate whether the complexity seems appropriate. > > Here is a first attempt at adding a hash table to the launcher and a > restart_immediately flag in each worker slot. This provides a similar > speedup to lowering wal_retrieve_retry_interval to 1ms. I've noted a > couple of possible race conditions in comments, but none of them seemed > particularly egregious. Ideally, we'd put the hash table in shared memory > so that other backends could adjust it directly, but IIUC that requires it > to be a fixed size, and the number of subscriptions is virtually unbounded. > True, if we want we can use dshash for this. The garbage collection mechanism used in the patch seems odd to me as that will remove/add entries to the hash table even when the corresponding subscription is never dropped. Also, adding this garbage collection each time seems like an overhead, especially for small values of wal_retrieve_retry_interval and a large number of subscriptions. Another point is immediately after cleaning the worker info, trying to find it again seems of no use. In logicalrep_worker_launch(), using both in_use and restart_immediately to find an unused slot doesn't look neat to me, we could probably keep the in_use flag intact if we want to reuse the worker. But again after freeing the worker, keeping its associated slot allocated sounds odd to me. -- With Regards, Amit Kapila.
On Thu, Jan 5, 2023 at 10:16 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Wed, Jan 04, 2023 at 08:12:37PM -0800, Nathan Bossart wrote: > > On Thu, Jan 05, 2023 at 09:09:12AM +0530, Amit Kapila wrote: > >> But there doesn't appear to be any guarantee that the result for > >> AllTablesyncsReady() will change between the time it is invoked > >> earlier in the function and at the place you have it in the patch. > >> This is because the value of 'table_states_valid' may not have > >> changed. So, how is this supposed to work? > > > > The call to CommandCounterIncrement() should set table_states_valid to > > false if needed. > > In v12, I moved the restart for two_phase mode to the end of > process_syncing_tables_for_apply() so that we don't need to rely on another > iteration of the loop. > This should work but it is better to add a comment before calling CommandCounterIncrement() to indicate that this is for making changes to the relation state visible. Thinking along similar lines, won't apply worker need to be notified of SUBREL_STATE_SYNCWAIT state change by the tablesync worker? -- With Regards, Amit Kapila.
On Thu, Jan 05, 2023 at 11:34:37AM +0530, Amit Kapila wrote: > On Thu, Jan 5, 2023 at 10:16 AM Nathan Bossart <nathandbossart@gmail.com> wrote: >> In v12, I moved the restart for two_phase mode to the end of >> process_syncing_tables_for_apply() so that we don't need to rely on another >> iteration of the loop. > > This should work but it is better to add a comment before calling > CommandCounterIncrement() to indicate that this is for making changes > to the relation state visible. Will do. > Thinking along similar lines, won't apply worker need to be notified > of SUBREL_STATE_SYNCWAIT state change by the tablesync worker? wait_for_worker_state_change() should notify the apply worker in this case. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Thu, Jan 05, 2023 at 10:57:58AM +0530, Amit Kapila wrote: > True, if we want we can use dshash for this. I'll look into this. > The garbage collection > mechanism used in the patch seems odd to me as that will remove/add > entries to the hash table even when the corresponding subscription is > never dropped. Yeah, I think this deserves a comment. We can remove anything beyond wal_retrieve_retry_interval because the lack of a hash table entry is taken to mean that we can start the worker immediately. There might be a corner case when wal_retrieve_retry_interval is concurrently updated, in which case we'll effectively use the previous value for the worker. That doesn't seem too terrible to me. It might be possible to remove this garbage collection completely if we use dshash, but I haven't thought through that approach completely yet. > Also, adding this garbage collection each time seems > like an overhead, especially for small values of > wal_retrieve_retry_interval and a large number of subscriptions. Right. > Another point is immediately after cleaning the worker info, trying to > find it again seems of no use. In logicalrep_worker_launch(), using > both in_use and restart_immediately to find an unused slot doesn't > look neat to me, we could probably keep the in_use flag intact if we > want to reuse the worker. But again after freeing the worker, keeping > its associated slot allocated sounds odd to me. Yeah, this flag certainly feels hacky. With a shared hash table, we could just have backends remove the last-start-time entry directly, and we wouldn't need the flag. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Thu, Jan 05, 2023 at 09:29:24AM -0800, Nathan Bossart wrote: > On Thu, Jan 05, 2023 at 10:57:58AM +0530, Amit Kapila wrote: >> True, if we want we can use dshash for this. > > I'll look into this. Here is an attempt at using dshash. This is quite a bit cleaner since we don't need garbage collection or the flag in the worker slots. There is some extra work required to set up the table, but it doesn't seem too bad. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Jan 5, 2023 at 10:49 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Thu, Jan 05, 2023 at 11:34:37AM +0530, Amit Kapila wrote: > > On Thu, Jan 5, 2023 at 10:16 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > >> In v12, I moved the restart for two_phase mode to the end of > >> process_syncing_tables_for_apply() so that we don't need to rely on another > >> iteration of the loop. > > > > This should work but it is better to add a comment before calling > > CommandCounterIncrement() to indicate that this is for making changes > > to the relation state visible. > > Will do. > Isn't it better to move this part into a separate patch as this is useful even without the main patch to improve wakeups? > > Thinking along similar lines, won't apply worker need to be notified > > of SUBREL_STATE_SYNCWAIT state change by the tablesync worker? > > wait_for_worker_state_change() should notify the apply worker in this case. > I think this is yet to be included in the patch, right? -- With Regards, Amit Kapila.
I found some additional places that should remove the last-start time from the hash table. I've added those in v14. On Fri, Jan 06, 2023 at 10:30:18AM +0530, Amit Kapila wrote: > On Thu, Jan 5, 2023 at 10:49 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> On Thu, Jan 05, 2023 at 11:34:37AM +0530, Amit Kapila wrote: >> > On Thu, Jan 5, 2023 at 10:16 AM Nathan Bossart <nathandbossart@gmail.com> wrote: >> >> In v12, I moved the restart for two_phase mode to the end of >> >> process_syncing_tables_for_apply() so that we don't need to rely on another >> >> iteration of the loop. >> > >> > This should work but it is better to add a comment before calling >> > CommandCounterIncrement() to indicate that this is for making changes >> > to the relation state visible. >> >> Will do. > > Isn't it better to move this part into a separate patch as this is > useful even without the main patch to improve wakeups? І moved it to a separate patch in v14. >> > Thinking along similar lines, won't apply worker need to be notified >> > of SUBREL_STATE_SYNCWAIT state change by the tablesync worker? >> >> wait_for_worker_state_change() should notify the apply worker in this case. > > I think this is yet to be included in the patch, right? This is already present on HEAD. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
Nathan Bossart <nathandbossart@gmail.com> writes: > I found some additional places that should remove the last-start time from > the hash table. I've added those in v14. I've pushed 0001 and 0002, which seem pretty uncontroversial. Attached is a rebased 0003, just to keep the cfbot happy. I'm kind of wondering whether 0003 is worth the complexity TBH, but in any case I ran out of time to look at it closely today. regards, tom lane diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 5bcba0fdec..8f06e234c8 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -1991,6 +1991,16 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser <entry>Waiting to read or update information about <quote>heavyweight</quote> locks.</entry> </row> + <row> + <entry><literal>LogicalRepLauncherDSA</literal></entry> + <entry>Waiting for logical replication launcher dynamic shared memory + allocator access</entry> + </row> + <row> + <entry><literal>LogicalRepLauncherHash</literal></entry> + <entry>Waiting for logical replication launcher shared memory hash table + access</entry> + </row> <row> <entry><literal>LogicalRepWorker</literal></entry> <entry>Waiting to read or update the state of logical replication diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c index f15a332bae..88f180c2a7 100644 --- a/src/backend/commands/subscriptioncmds.c +++ b/src/backend/commands/subscriptioncmds.c @@ -1504,6 +1504,14 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel) } list_free(subworkers); + /* + * Clear the last-start time for the apply worker to free up space. If + * this transaction rolls back, the launcher might restart the apply worker + * before wal_retrieve_retry_interval milliseconds have elapsed, but that's + * probably okay. + */ + logicalrep_launcher_delete_last_start_time(subid); + /* * Cleanup of tablesync replication origins. * diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c index a69e371c05..dfe49db64f 100644 --- a/src/backend/replication/logical/launcher.c +++ b/src/backend/replication/logical/launcher.c @@ -25,6 +25,7 @@ #include "catalog/pg_subscription.h" #include "catalog/pg_subscription_rel.h" #include "funcapi.h" +#include "lib/dshash.h" #include "libpq/pqsignal.h" #include "miscadmin.h" #include "pgstat.h" @@ -56,6 +57,25 @@ int max_logical_replication_workers = 4; int max_sync_workers_per_subscription = 2; +/* an entry in the last-start times hash table */ +typedef struct LauncherLastStartTimesEntry +{ + Oid subid; + TimestampTz last_start_time; +} LauncherLastStartTimesEntry; + +/* parameters for the last-start times hash table */ +static const dshash_parameters dsh_params = { + sizeof(Oid), + sizeof(LauncherLastStartTimesEntry), + dshash_memcmp, + dshash_memhash, + LWTRANCHE_LAUNCHER_HASH +}; + +static dsa_area *last_start_times_dsa = NULL; +static dshash_table *last_start_times = NULL; + LogicalRepWorker *MyLogicalRepWorker = NULL; typedef struct LogicalRepCtxStruct @@ -63,6 +83,10 @@ typedef struct LogicalRepCtxStruct /* Supervisor process. */ pid_t launcher_pid; + /* hash table for last-start times */ + dsa_handle last_start_dsa; + dshash_table_handle last_start_dsh; + /* Background workers. */ LogicalRepWorker workers[FLEXIBLE_ARRAY_MEMBER]; } LogicalRepCtxStruct; @@ -74,6 +98,9 @@ static void logicalrep_launcher_onexit(int code, Datum arg); static void logicalrep_worker_onexit(int code, Datum arg); static void logicalrep_worker_detach(void); static void logicalrep_worker_cleanup(LogicalRepWorker *worker); +static void logicalrep_launcher_attach_dshmem(void); +static void logicalrep_launcher_set_last_start_time(Oid subid, TimestampTz start_time); +static TimestampTz logicalrep_launcher_get_last_start_time(Oid subid); static bool on_commit_launcher_wakeup = false; @@ -756,6 +783,9 @@ ApplyLauncherShmemInit(void) memset(worker, 0, sizeof(LogicalRepWorker)); SpinLockInit(&worker->relmutex); } + + LogicalRepCtx->last_start_dsa = DSM_HANDLE_INVALID; + LogicalRepCtx->last_start_dsh = DSM_HANDLE_INVALID; } } @@ -801,8 +831,6 @@ ApplyLauncherWakeup(void) void ApplyLauncherMain(Datum main_arg) { - TimestampTz last_start_time = 0; - ereport(DEBUG1, (errmsg_internal("logical replication launcher started"))); @@ -837,58 +865,55 @@ ApplyLauncherMain(Datum main_arg) now = GetCurrentTimestamp(); - /* Limit the start retry to once a wal_retrieve_retry_interval */ - if (TimestampDifferenceExceeds(last_start_time, now, - wal_retrieve_retry_interval)) - { - /* Use temporary context for the database list and worker info. */ - subctx = AllocSetContextCreate(TopMemoryContext, - "Logical Replication Launcher sublist", - ALLOCSET_DEFAULT_SIZES); - oldctx = MemoryContextSwitchTo(subctx); - - /* search for subscriptions to start or stop. */ - sublist = get_subscription_list(); - - /* Start the missing workers for enabled subscriptions. */ - foreach(lc, sublist) - { - Subscription *sub = (Subscription *) lfirst(lc); - LogicalRepWorker *w; + /* Use temporary context for the database list and worker info. */ + subctx = AllocSetContextCreate(TopMemoryContext, + "Logical Replication Launcher sublist", + ALLOCSET_DEFAULT_SIZES); + oldctx = MemoryContextSwitchTo(subctx); - if (!sub->enabled) - continue; + sublist = get_subscription_list(); + foreach(lc, sublist) + { + Subscription *sub = (Subscription *) lfirst(lc); + LogicalRepWorker *w; + TimestampTz last_start; - LWLockAcquire(LogicalRepWorkerLock, LW_SHARED); - w = logicalrep_worker_find(sub->oid, InvalidOid, false); - LWLockRelease(LogicalRepWorkerLock); + if (!sub->enabled) + continue; - if (w == NULL) - { - last_start_time = now; - wait_time = wal_retrieve_retry_interval; + LWLockAcquire(LogicalRepWorkerLock, LW_EXCLUSIVE); + w = logicalrep_worker_find(sub->oid, InvalidOid, false); + LWLockRelease(LogicalRepWorkerLock); - logicalrep_worker_launch(sub->dbid, sub->oid, sub->name, - sub->owner, InvalidOid); - } - } + if (w != NULL) + continue; - /* Switch back to original memory context. */ - MemoryContextSwitchTo(oldctx); - /* Clean the temporary memory. */ - MemoryContextDelete(subctx); - } - else - { /* - * The wait in previous cycle was interrupted in less than - * wal_retrieve_retry_interval since last worker was started, this - * usually means crash of the worker, so we should retry in - * wal_retrieve_retry_interval again. + * If the worker is eligible to start now, launch it. Otherwise, + * adjust wait_time so that we wake up when it can be started. */ - wait_time = wal_retrieve_retry_interval; + last_start = logicalrep_launcher_get_last_start_time(sub->oid); + if (TimestampDifferenceExceeds(last_start, now, + wal_retrieve_retry_interval)) + { + logicalrep_launcher_set_last_start_time(sub->oid, now); + logicalrep_worker_launch(sub->dbid, sub->oid, sub->name, + sub->owner, InvalidOid); + } + else + { + long elapsed; + + elapsed = TimestampDifferenceMilliseconds(last_start, now); + wait_time = Min(wait_time, wal_retrieve_retry_interval - elapsed); + } } + /* Switch back to original memory context. */ + MemoryContextSwitchTo(oldctx); + /* Clean the temporary memory. */ + MemoryContextDelete(subctx); + /* Wait for more work. */ rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, @@ -996,3 +1021,89 @@ pg_stat_get_subscription(PG_FUNCTION_ARGS) return (Datum) 0; } + +/* + * Initialize or attach to the dynamic shared hash table that stores the + * last-start times, if not already done. This must be called before using the + * table. + */ +static void +logicalrep_launcher_attach_dshmem(void) +{ + MemoryContext oldcontext; + + oldcontext = MemoryContextSwitchTo(TopMemoryContext); + LWLockAcquire(LogicalRepWorkerLock, LW_EXCLUSIVE); + + if (LogicalRepCtx->last_start_dsh == DSM_HANDLE_INVALID) + { + /* Initialize dynamic shared hash table for last-start times. */ + last_start_times_dsa = dsa_create(LWTRANCHE_LAUNCHER_DSA); + dsa_pin(last_start_times_dsa); + dsa_pin_mapping(last_start_times_dsa); + last_start_times = dshash_create(last_start_times_dsa, &dsh_params, 0); + + /* Store handles in shared memory for other backends to use. */ + LogicalRepCtx->last_start_dsa = dsa_get_handle(last_start_times_dsa); + LogicalRepCtx->last_start_dsh = dshash_get_hash_table_handle(last_start_times); + } + else if (!last_start_times) + { + /* Attach to existing dynamic shared hash table. */ + last_start_times_dsa = dsa_attach(LogicalRepCtx->last_start_dsa); + dsa_pin_mapping(last_start_times_dsa); + last_start_times = dshash_attach(last_start_times_dsa, &dsh_params, + LogicalRepCtx->last_start_dsh, 0); + } + + LWLockRelease(LogicalRepWorkerLock); + MemoryContextSwitchTo(oldcontext); +} + +/* + * Set the last-start time for the subscription. + */ +static void +logicalrep_launcher_set_last_start_time(Oid subid, TimestampTz start_time) +{ + LauncherLastStartTimesEntry *entry; + bool found; + + logicalrep_launcher_attach_dshmem(); + + entry = dshash_find_or_insert(last_start_times, &subid, &found); + entry->last_start_time = start_time; + dshash_release_lock(last_start_times, entry); +} + +/* + * Return the last-start time for the subscription, or 0 if there isn't one. + */ +static TimestampTz +logicalrep_launcher_get_last_start_time(Oid subid) +{ + LauncherLastStartTimesEntry *entry; + TimestampTz ret; + + logicalrep_launcher_attach_dshmem(); + + entry = dshash_find(last_start_times, &subid, false); + if (entry == NULL) + return 0; + + ret = entry->last_start_time; + dshash_release_lock(last_start_times, entry); + + return ret; +} + +/* + * Remove the last-start time for the subscription, if one exists. + */ +void +logicalrep_launcher_delete_last_start_time(Oid subid) +{ + logicalrep_launcher_attach_dshmem(); + + (void) dshash_delete_key(last_start_times, &subid); +} diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c index 09b3e8b32a..61be761e03 100644 --- a/src/backend/replication/logical/tablesync.c +++ b/src/backend/replication/logical/tablesync.c @@ -618,6 +618,13 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn) ereport(LOG, (errmsg("logical replication apply worker for subscription \"%s\" will restart so that two_phasecan be enabled", MySubscription->name))); + + /* + * Clear the last-start time for this worker so that the + * launcher will restart it immediately. + */ + logicalrep_launcher_delete_last_start_time(MySubscription->oid); + should_exit = true; } } diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c index f8649e142c..1e4d94d359 100644 --- a/src/backend/replication/logical/worker.c +++ b/src/backend/replication/logical/worker.c @@ -164,6 +164,7 @@ #include "postmaster/walwriter.h" #include "replication/decode.h" #include "replication/logical.h" +#include "replication/logicallauncher.h" #include "replication/logicalproto.h" #include "replication/logicalrelation.h" #include "replication/logicalworker.h" @@ -3089,6 +3090,8 @@ maybe_reread_subscription(void) "stop because the subscription was removed", MySubscription->name))); + if (!am_tablesync_worker()) + logicalrep_launcher_delete_last_start_time(MyLogicalRepWorker->subid); proc_exit(0); } @@ -3100,6 +3103,8 @@ maybe_reread_subscription(void) "stop because the subscription was disabled", MySubscription->name))); + if (!am_tablesync_worker()) + logicalrep_launcher_delete_last_start_time(MyLogicalRepWorker->subid); proc_exit(0); } @@ -3126,6 +3131,13 @@ maybe_reread_subscription(void) (errmsg("logical replication apply worker for subscription \"%s\" will restart because of a parameter change", MySubscription->name))); + /* + * Clear the last-start time for the apply worker so that the launcher + * will restart it immediately, bypassing wal_retrieve_retry_interval. + */ + if (!am_tablesync_worker()) + logicalrep_launcher_delete_last_start_time(MySubscription->oid); + proc_exit(0); } @@ -3671,6 +3683,8 @@ ApplyWorkerMain(Datum main_arg) (errmsg("logical replication apply worker for subscription %u will not " "start because the subscription was removed during startup", MyLogicalRepWorker->subid))); + if (!am_tablesync_worker()) + logicalrep_launcher_delete_last_start_time(MyLogicalRepWorker->subid); proc_exit(0); } @@ -3684,6 +3698,8 @@ ApplyWorkerMain(Datum main_arg) "start because the subscription was disabled during startup", MySubscription->name))); + if (!am_tablesync_worker()) + logicalrep_launcher_delete_last_start_time(MyLogicalRepWorker->subid); proc_exit(0); } @@ -3884,6 +3900,8 @@ DisableSubscriptionAndExit(void) errmsg("subscription \"%s\" has been disabled because of an error", MySubscription->name)); + if (!am_tablesync_worker()) + logicalrep_launcher_delete_last_start_time(MyLogicalRepWorker->subid); proc_exit(0); } diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index 196bece0a3..d2ec396045 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -186,6 +186,10 @@ static const char *const BuiltinTrancheNames[] = { "PgStatsHash", /* LWTRANCHE_PGSTATS_DATA: */ "PgStatsData", + /* LWTRANCHE_LAUNCHER_DSA: */ + "LogicalRepLauncherDSA", + /* LWTRANCHE_LAUNCHER_HASH: */ + "LogicalRepLauncherHash", }; StaticAssertDecl(lengthof(BuiltinTrancheNames) == diff --git a/src/include/replication/logicallauncher.h b/src/include/replication/logicallauncher.h index e1661b6c91..00f6f89d72 100644 --- a/src/include/replication/logicallauncher.h +++ b/src/include/replication/logicallauncher.h @@ -26,4 +26,6 @@ extern void AtEOXact_ApplyLauncher(bool isCommit); extern bool IsLogicalLauncher(void); +extern void logicalrep_launcher_delete_last_start_time(Oid subid); + #endif /* LOGICALLAUNCHER_H */ diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h index e4162db613..d2c7afb8f4 100644 --- a/src/include/storage/lwlock.h +++ b/src/include/storage/lwlock.h @@ -204,6 +204,8 @@ typedef enum BuiltinTrancheIds LWTRANCHE_PGSTATS_DSA, LWTRANCHE_PGSTATS_HASH, LWTRANCHE_PGSTATS_DATA, + LWTRANCHE_LAUNCHER_DSA, + LWTRANCHE_LAUNCHER_HASH, LWTRANCHE_FIRST_USER_DEFINED } BuiltinTrancheIds;
On Fri, Jan 06, 2023 at 05:31:26PM -0500, Tom Lane wrote: > I've pushed 0001 and 0002, which seem pretty uncontroversial. Thanks! > Attached is a rebased 0003, just to keep the cfbot happy. > I'm kind of wondering whether 0003 is worth the complexity TBH, > but in any case I ran out of time to look at it closely today. Yeah. It's not as bad as I was expecting, but it does add a bit more complexity than is probably warranted. I'm not wedded to this approach. BTW I intend to start a new thread for the bugs I mentioned upthread that were revealed by setting wal_retrieve_retry_interval to 1ms in the tests. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
rebased for cfbot -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Sat, Jan 7, 2023 at 6:15 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Fri, Jan 06, 2023 at 05:31:26PM -0500, Tom Lane wrote: > > > Attached is a rebased 0003, just to keep the cfbot happy. > > I'm kind of wondering whether 0003 is worth the complexity TBH, > > but in any case I ran out of time to look at it closely today. > > Yeah. It's not as bad as I was expecting, but it does add a bit more > complexity than is probably warranted. > Personally, I think it is not as complex as we were initially thinking and does the job accurately unless we are missing something. So, +1 to proceed with this approach. I haven't looked in detail but isn't it better to explain somewhere in the comments that it achieves to rate limit the restart of workers in case of error and allows them to restart immediately in case of subscription parameter change? Another minor point: Don't we need to set the launcher's latch after removing the entry from the hash table to avoid the launcher waiting on the latch for a bit longer? -- With Regards, Amit Kapila.
On Tue, Jan 10, 2023 at 10:59:14AM +0530, Amit Kapila wrote: > I haven't looked in detail but isn't it better to explain somewhere in > the comments that it achieves to rate limit the restart of workers in > case of error and allows them to restart immediately in case of > subscription parameter change? I expanded one of the existing comments to make this clear. > Another minor point: Don't we need to set the launcher's latch after > removing the entry from the hash table to avoid the launcher waiting > on the latch for a bit longer? The launcher's latch should be set when the apply worker exits. The apply worker's notify_pid is set to the launcher, which means the launcher will be sent SIGUSR1 on exit. The launcher's SIGUSR1 handler sets its latch. Of course, if the launcher restarts, then the notify_pid will no longer be accurate. However, I see that workers also register a before_shmem_exit callback that will send SIGUSR1 to the launcher_pid currently stored in shared memory. (I wonder if there is a memory ordering bug here.) -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
Nathan Bossart <nathandbossart@gmail.com> writes: > On Tue, Jan 10, 2023 at 10:59:14AM +0530, Amit Kapila wrote: >> I haven't looked in detail but isn't it better to explain somewhere in >> the comments that it achieves to rate limit the restart of workers in >> case of error and allows them to restart immediately in case of >> subscription parameter change? > I expanded one of the existing comments to make this clear. I pushed v17 with some mostly-cosmetic changes, including more comments. > Of course, if the launcher restarts, then the notify_pid will no longer be > accurate. However, I see that workers also register a before_shmem_exit > callback that will send SIGUSR1 to the launcher_pid currently stored in > shared memory. (I wonder if there is a memory ordering bug here.) I think it's all close enough in reality. There are other issues in this code, and I'm about to start a new thread about one I identified while testing this patch, but I think we're in good shape on this particular point. I've marked the CF entry as committed. regards, tom lane
On Sun, Jan 22, 2023 at 02:12:54PM -0500, Tom Lane wrote: > I pushed v17 with some mostly-cosmetic changes, including more comments. Thanks! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Monday, January 23, 2023 3:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: Hi, > > Nathan Bossart <nathandbossart@gmail.com> writes: > > On Tue, Jan 10, 2023 at 10:59:14AM +0530, Amit Kapila wrote: > >> I haven't looked in detail but isn't it better to explain somewhere > >> in the comments that it achieves to rate limit the restart of workers > >> in case of error and allows them to restart immediately in case of > >> subscription parameter change? > > > I expanded one of the existing comments to make this clear. > > I pushed v17 with some mostly-cosmetic changes, including more comments. I noticed one minor thing in this commit. - LogicalRepCtx->last_start_dsh = DSM_HANDLE_INVALID; - The code takes the last_start_dsh as dsm_handle, but it seems it is a dsa_pointer. " typedef dsa_pointer dshash_table_handle;" This won’t cause any problem, but I feel It would be easier to understand if we take it as dsa_pointer and use InvalidDsaPointer here, like what he attached patch does. What do you think ? Best regards, Hou zj
Attachment
On Tue, Jan 24, 2023 at 02:55:07AM +0000, houzj.fnst@fujitsu.com wrote: > I noticed one minor thing in this commit. > > - > LogicalRepCtx->last_start_dsh = DSM_HANDLE_INVALID; > - > > The code takes the last_start_dsh as dsm_handle, but it seems it is a dsa_pointer. > " typedef dsa_pointer dshash_table_handle;" This won’t cause any problem, but I feel > It would be easier to understand if we take it as dsa_pointer and use InvalidDsaPointer here, > like what he attached patch does. What do you think ? IMO ideally there should be a DSA_HANDLE_INVALID and DSHASH_HANDLE_INVALID for use with dsa_handle and dshash_table_handle, respectively. But your patch does seem like an improvement. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Nathan Bossart <nathandbossart@gmail.com> writes: > IMO ideally there should be a DSA_HANDLE_INVALID and DSHASH_HANDLE_INVALID > for use with dsa_handle and dshash_table_handle, respectively. But your > patch does seem like an improvement. Yeah, particularly given that dsa.h says /* * The handle for a dsa_area is currently implemented as the dsm_handle * for the first DSM segment backing this dynamic storage area, but client * code shouldn't assume that is true. */ typedef dsm_handle dsa_handle; but then provides no way for client code to not be aware that a dsa_handle is a dsm_handle, if it needs to deal with "invalid" values. Either that comment needs to be rewritten or we need to invent some more macros. I agree that the patch as given is an improvement on what was committed, but I wonder whether we shouldn't work a little harder on cleaning this up more widely. regards, tom lane
On Tue, Jan 24, 2023 at 01:13:55PM -0500, Tom Lane wrote: > Either that comment needs to be rewritten or we need to invent some > more macros. Here is a first attempt at a patch. I scanned through all the existing uses of InvalidDsaPointer and DSM_HANDLE_INVALID and didn't notice anything else that needed adjusting. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
At Tue, 24 Jan 2023 10:42:17 -0800, Nathan Bossart <nathandbossart@gmail.com> wrote in > On Tue, Jan 24, 2023 at 01:13:55PM -0500, Tom Lane wrote: > > Either that comment needs to be rewritten or we need to invent some > > more macros. > > Here is a first attempt at a patch. I scanned through all the existing > uses of InvalidDsaPointer and DSM_HANDLE_INVALID and didn't notice anything > else that needed adjusting. There seems to be two cases for DSA_HANDLE_INVALID in dsa_get_handle and dsa_attach_in_place, one of which is Assert(), though. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes: > At Tue, 24 Jan 2023 10:42:17 -0800, Nathan Bossart <nathandbossart@gmail.com> wrote in >> Here is a first attempt at a patch. I scanned through all the existing >> uses of InvalidDsaPointer and DSM_HANDLE_INVALID and didn't notice anything >> else that needed adjusting. > There seems to be two cases for DSA_HANDLE_INVALID in dsa_get_handle > and dsa_attach_in_place, one of which is Assert(), though. Right. I fixed some other infelicities and pushed it. regards, tom lane
On Wed, Jan 25, 2023 at 04:12:00PM +0900, Kyotaro Horiguchi wrote: > At Tue, 24 Jan 2023 10:42:17 -0800, Nathan Bossart <nathandbossart@gmail.com> wrote in >> Here is a first attempt at a patch. I scanned through all the existing >> uses of InvalidDsaPointer and DSM_HANDLE_INVALID and didn't notice anything >> else that needed adjusting. > > There seems to be two cases for DSA_HANDLE_INVALID in dsa_get_handle > and dsa_attach_in_place, one of which is Assert(), though. Ah, sorry, I'm not sure how I missed this. Thanks for looking. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Wed, Jan 25, 2023 at 11:49:27AM -0500, Tom Lane wrote: > Right. I fixed some other infelicities and pushed it. Thanks! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com