Thread: logical replication restrictions
One thing is needed and is not solved yet is delayed replication on logical replication. Would be interesting to document it on Restrictions page, right ?
regards,
Marcos
On Mon, Sep 20, 2021 at 9:47 PM Marcos Pegoraro <marcos@f10.com.br> wrote: > > One thing is needed and is not solved yet is delayed replication on logical replication. Would be interesting to documentit on Restrictions page, right ? > What do you mean by delayed replication? Is it that by default we send the transactions at commit? -- With Regards, Amit Kapila.
No, I´m talking about that configuration you can have on standby servers
recovery_min_apply_delay = '8h'
Atenciosamente,
Em seg., 20 de set. de 2021 às 23:44, Amit Kapila <amit.kapila16@gmail.com> escreveu:
On Mon, Sep 20, 2021 at 9:47 PM Marcos Pegoraro <marcos@f10.com.br> wrote:
>
> One thing is needed and is not solved yet is delayed replication on logical replication. Would be interesting to document it on Restrictions page, right ?
>
What do you mean by delayed replication? Is it that by default we send
the transactions at commit?
--
With Regards,
Amit Kapila.
On Tue, Sep 21, 2021 at 4:21 PM Marcos Pegoraro <marcos@f10.com.br> wrote:
No, I´m talking about that configuration you can have on standby serversrecovery_min_apply_delay = '8h'
oh okay, I think this can be useful in some cases where we want to avoid data loss similar to its use for physical standby. For example, if the user has by mistake truncated the table (or deleted some required data) on the publisher, we can always it from the subscriber if we have such a feature.
Having said that, I am not sure if we can call it a restriction. It is more of a TODO kind of thing. It doesn't sound advisable to me to keep growing the current Restrictions page [1].
With Regards,
Amit Kapila.
Amit Kapila.
oh okay, I think this can be useful in some cases where we want to avoid data loss similar to its use for physical standby. For example, if the user has by mistake truncated the table (or deleted some required data) on the publisher, we can always it from the subscriber if we have such a feature.Having said that, I am not sure if we can call it a restriction. It is more of a TODO kind of thing. It doesn't sound advisable to me to keep growing the current Restrictions page
OK, so, could you guide me where to start on this feature ?
regards,
Marcos
On Wed, Sep 22, 2021, at 1:18 AM, Amit Kapila wrote:
On Tue, Sep 21, 2021 at 4:21 PM Marcos Pegoraro <marcos@f10.com.br> wrote:No, I´m talking about that configuration you can have on standby serversrecovery_min_apply_delay = '8h'oh okay, I think this can be useful in some cases where we want to avoid data loss similar to its use for physical standby. For example, if the user has by mistake truncated the table (or deleted some required data) on the publisher, we can always it from the subscriber if we have such a feature.Having said that, I am not sure if we can call it a restriction. It is more of a TODO kind of thing. It doesn't sound advisable to me to keep growing the current Restrictions page [1].
It is a new feature. pglogical supports it and it is useful for delayed
secondary server and if, for some business reason, you have to delay when data
is available. There might be other use cases but these are the ones I regularly
heard from customers.
BTW, I have a WIP patch for this feature. I didn't have enough time to post it
because it lacks documentation and tests. I'm planning to do it as soon as this
CF ends.
No, I´m talking about that configuration you can have on standby serversrecovery_min_apply_delay = '8h'oh okay, I think this can be useful in some cases where we want to avoid data loss similar to its use for physical standby. For example, if the user has by mistake truncated the table (or deleted some required data) on the publisher, we can always it from the subscriber if we have such a feature.Having said that, I am not sure if we can call it a restriction. It is more of a TODO kind of thing. It doesn't sound advisable to me to keep growing the current Restrictions page [1].It is a new feature. pglogical supports it and it is useful for delayedsecondary server and if, for some business reason, you have to delay when datais available. There might be other use cases but these are the ones I regularlyheard from customers.BTW, I have a WIP patch for this feature. I didn't have enough time to post itbecause it lacks documentation and tests. I'm planning to do it as soon as thisCF ends.
Fine, let me know if you need any help, testing, for example.
On Wed, Sep 22, 2021 at 10:27 PM Euler Taveira <euler@eulerto.com> wrote: > > On Wed, Sep 22, 2021, at 1:18 AM, Amit Kapila wrote: > > On Tue, Sep 21, 2021 at 4:21 PM Marcos Pegoraro <marcos@f10.com.br> wrote: > > No, I´m talking about that configuration you can have on standby servers > recovery_min_apply_delay = '8h' > > > oh okay, I think this can be useful in some cases where we want to avoid data loss similar to its use for physical standby.For example, if the user has by mistake truncated the table (or deleted some required data) on the publisher, wecan always it from the subscriber if we have such a feature. > > Having said that, I am not sure if we can call it a restriction. It is more of a TODO kind of thing. It doesn't sound advisableto me to keep growing the current Restrictions page [1]. > > It is a new feature. pglogical supports it and it is useful for delayed > secondary server and if, for some business reason, you have to delay when data > is available. > What kind of reasons do you see where users prefer to delay except to avoid data loss in the case where users unintentionally removed some data from the primary? -- With Regards, Amit Kapila.
What kind of reasons do you see where users prefer to delay except to
avoid data loss in the case where users unintentionally removed some
data from the primary?
Debugging. Suppose I have a problem, but that problem occurs once a week or a month. When this problem occurs again a monitoring system sends me a message ... Hey, that problem occurred again. Then, as I configured my replica to Delay = '30 min', I have time to connect to it and wait, record by record coming and see exactly what made that mistake.
On Wed, Sep 22, 2021 at 6:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Sep 21, 2021 at 4:21 PM Marcos Pegoraro <marcos@f10.com.br> wrote: >> >> No, I´m talking about that configuration you can have on standby servers >> recovery_min_apply_delay = '8h' >> > > oh okay, I think this can be useful in some cases where we want to avoid data loss similar to its use for physical standby.For example, if the user has by mistake truncated the table (or deleted some required data) on the publisher, wecan always it from the subscriber if we have such a feature. > > Having said that, I am not sure if we can call it a restriction. It is more of a TODO kind of thing. It doesn't sound advisableto me to keep growing the current Restrictions page [1]. One could argue that not having delayed apply *is* a restriction compared to both physical replication and "the original upstream" pg_logical. I think therefore it should be mentioned in "Restrictions" so people considering moving from physical streaming to pg_logical or just trying to decide whether to use pg_logical are warned. Also, the Restrictions page starts with " These might be addressed in future releases." so there is no exclusivity of being either a restriction or TODO. > [1] - https://wiki.postgresql.org/wiki/Todo > [2] - https://www.postgresql.org/docs/devel/logical-replication-restrictions.html ----- Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested.
On Wed, Sep 22, 2021, at 1:57 PM, Euler Taveira wrote:
On Wed, Sep 22, 2021, at 1:18 AM, Amit Kapila wrote:On Tue, Sep 21, 2021 at 4:21 PM Marcos Pegoraro <marcos@f10.com.br> wrote:No, I´m talking about that configuration you can have on standby serversrecovery_min_apply_delay = '8h'oh okay, I think this can be useful in some cases where we want to avoid data loss similar to its use for physical standby. For example, if the user has by mistake truncated the table (or deleted some required data) on the publisher, we can always it from the subscriber if we have such a feature.Having said that, I am not sure if we can call it a restriction. It is more of a TODO kind of thing. It doesn't sound advisable to me to keep growing the current Restrictions page [1].It is a new feature. pglogical supports it and it is useful for delayedsecondary server and if, for some business reason, you have to delay when datais available. There might be other use cases but these are the ones I regularlyheard from customers.BTW, I have a WIP patch for this feature. I didn't have enough time to post itbecause it lacks documentation and tests. I'm planning to do it as soon as thisCF ends.
Long time, no patch. Here it is. I will provide documentation in the next
version. I would appreciate some feedback.
Attachment
On Tuesday, March 1, 2022 9:19 AM Euler Taveira <euler@eulerto.com> wrote: > Long time, no patch. Here it is. I will provide documentation in the next > > version. I would appreciate some feedback. Hi, thank you for posting the patch ! $ git am v1-0001-Time-delayed-logical-replication-subscriber.patch Applying: Time-delayed logical replication subscriber error: patch failed: src/backend/catalog/system_views.sql:1261 error: src/backend/catalog/system_views.sql: patch does not apply FYI, by one recent commit(7a85073), the HEAD redesigned pg_stat_subscription_workers. Thus, the blow change can't be applied. Could you please rebase v1 ? diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 3cb69b1f87..1cc0d86f2e 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1261,7 +1261,8 @@ REVOKE ALL ON pg_replication_origin_status FROM public; -- All columns of pg_subscription except subconninfo are publicly readable. REVOKE ALL ON pg_subscription FROM public; GRANT SELECT (oid, subdbid, subname, subowner, subenabled, subbinary, - substream, subtwophasestate, subslotname, subsynccommit, subpublications) + substream, subtwophasestate, subslotname, subsynccommit, + subapplydelay, subpublications) ON pg_subscription TO public; CREATE VIEW pg_stat_subscription_workers AS Best Regards, Takamichi Osumi
On Tue, Mar 1, 2022, at 3:27 AM, osumi.takamichi@fujitsu.com wrote:
$ git am v1-0001-Time-delayed-logical-replication-subscriber.patch
I generally use -3 to fall back on 3-way merge. Doesn't it work for you?
On Wednesday, March 2, 2022 8:54 AM Euler Taveira <euler@eulerto.com> wrote: > On Tue, Mar 1, 2022, at 3:27 AM, osumi.takamichi@fujitsu.com > <mailto:osumi.takamichi@fujitsu.com> wrote: > > > $ git am v1-0001-Time-delayed-logical-replication-subscriber.patch > > > I generally use -3 to fall back on 3-way merge. Doesn't it work for you? It did. Excuse me for making noises. Best Regards, Takamichi Osumi
On Mon, Feb 28, 2022, at 9:18 PM, Euler Taveira wrote:
Long time, no patch. Here it is. I will provide documentation in the nextversion. I would appreciate some feedback.
This patch is broken since commit 705e20f8550c0e8e47c0b6b20b5f5ffd6ffd9e33. I
rebased it.
I added documentation that explains how this parameter works. I decided to
rename the parameter from apply_delay to min_apply_delay to use the same
terminology from the physical replication. IMO the new name seems clear that
there isn't a guarantee that we are always x ms behind the publisher. Indeed,
due to processing/transferring the delay might be higher than the specified
interval.
I refactored the way the delay is applied. The previous patch is only covering
a regular transaction. This new one also covers prepared transaction. The
current design intercepts the transaction during the first change (at the time
it will start the transaction to apply the changes) and applies the delay
before effectively starting the transaction. The previous patch uses
begin_replication_step() as this point. However, to support prepared
transactions I changed the apply_delay signature to accepts a timestamp
parameter (because we use another variable to calculate the delay for prepared
transactions -- prepare_time). Hence, the apply_delay() moved to another places
-- apply_handle_begin and apply_handle_begin_prepare().
The new code does not apply the delay in 2 situations:
* STREAM START: streamed transactions might not have commit_time or
prepare_time set. I'm afraid it is not possible to use the referred variables
because at STREAM START time we don't have a transaction commit time. The
protocol could provide a timestamp that indicates when it starts streaming
the transaction then we could use it to apply the delay. Unfortunately, we
don't have it. Having said that this new patch does not apply delay for
streamed transactions.
* non-transaction messages: the delay could be applied to non-transaction
messages too. It is sent independently of the transaction that contains it.
Since the logical replication does not send messages to the subscriber, this
is not an issue. However, consumers that use pgoutput and wants to implement
a delay will require it.
I'm still looking for a way to support streamed transactions without much
surgery into the logical replication protocol.
Attachment
On 2022-03-20 21:40:40 -0300, Euler Taveira wrote: > On Mon, Feb 28, 2022, at 9:18 PM, Euler Taveira wrote: > > Long time, no patch. Here it is. I will provide documentation in the next > > version. I would appreciate some feedback. > This patch is broken since commit 705e20f8550c0e8e47c0b6b20b5f5ffd6ffd9e33. I > rebased it. This fails tests, specifically it seems psql crashes: https://cirrus-ci.com/task/6592281292570624?logs=cores#L46 Marked as waiting-on-author. Greetings, Andres Freund
On Mon, Mar 21, 2022, at 10:04 PM, Andres Freund wrote:
On 2022-03-20 21:40:40 -0300, Euler Taveira wrote:> On Mon, Feb 28, 2022, at 9:18 PM, Euler Taveira wrote:> > Long time, no patch. Here it is. I will provide documentation in the next> > version. I would appreciate some feedback.> This patch is broken since commit 705e20f8550c0e8e47c0b6b20b5f5ffd6ffd9e33. I> rebased it.This fails tests, specifically it seems psql crashes:
Yeah. I forgot to test this patch with cassert before sending it. :( I didn't
send a new patch because there is another issue (with int128) that I'm
currently reworking. I'll send another patch soon.
On Mon, Mar 21, 2022, at 10:09 PM, Euler Taveira wrote:
On Mon, Mar 21, 2022, at 10:04 PM, Andres Freund wrote:On 2022-03-20 21:40:40 -0300, Euler Taveira wrote:> On Mon, Feb 28, 2022, at 9:18 PM, Euler Taveira wrote:> > Long time, no patch. Here it is. I will provide documentation in the next> > version. I would appreciate some feedback.> This patch is broken since commit 705e20f8550c0e8e47c0b6b20b5f5ffd6ffd9e33. I> rebased it.This fails tests, specifically it seems psql crashes:Yeah. I forgot to test this patch with cassert before sending it. :( I didn'tsend a new patch because there is another issue (with int128) that I'mcurrently reworking. I'll send another patch soon.
Here is another version after rebasing it. In this version I fixed the psql
issue and rewrote interval_to_ms function.
Attachment
On Wed, Mar 23, 2022, at 6:19 PM, Euler Taveira wrote:
On Mon, Mar 21, 2022, at 10:09 PM, Euler Taveira wrote:On Mon, Mar 21, 2022, at 10:04 PM, Andres Freund wrote:On 2022-03-20 21:40:40 -0300, Euler Taveira wrote:> On Mon, Feb 28, 2022, at 9:18 PM, Euler Taveira wrote:> > Long time, no patch. Here it is. I will provide documentation in the next> > version. I would appreciate some feedback.> This patch is broken since commit 705e20f8550c0e8e47c0b6b20b5f5ffd6ffd9e33. I> rebased it.This fails tests, specifically it seems psql crashes:Yeah. I forgot to test this patch with cassert before sending it. :( I didn'tsend a new patch because there is another issue (with int128) that I'mcurrently reworking. I'll send another patch soon.Here is another version after rebasing it. In this version I fixed the psqlissue and rewrote interval_to_ms function.
From the previous version, I added support for streamed transactions. For
streamed transactions, the delay is applied during STREAM COMMIT message.
That's ok if we add the delay before applying the spooled messages. Hence, we
guarantee that the delay is applied *before* each transaction. The same logic
is applied to prepared transactions. The delay is introduced before applying
the spooled messages in STREAM PREPARE message.
Tests were refactored a bit. A test for streamed transaction was included too.
Version 4 is attached.
Attachment
Here are some review comments for your v4-0001 patch. I hope they are useful for you. ====== 1. General This thread name "logical replication restrictions" seems quite unrelated to the patch here. Maybe it's better to start a new thread otherwise nobody is going to recognise what this thread is really about. ====== 2. Commit message Similar to physical replication, a time-delayed copy of the data for logical replication is useful for some scenarios (specially to fix errors that might cause data loss). "specially" -> "particularly" ? ~~~ 3. Commit message Maybe take some examples from the regression tests to show usage of the new parameter ====== 4. doc/src/sgml/catalogs.sgml + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>subapplydelay</structfield> <type>int8</type> + </para> + <para> + Delay the application of changes by a specified amount of time. + </para></entry> + </row> I think this should say that the units are ms. ====== 5. doc/src/sgml/ref/create_subscription.sgml + <varlistentry> + <term><literal>min_apply_delay</literal> (<type>integer</type>)</term> + <listitem> Is the "integer" type here correct? It might eventually be stored as an integer, but IIUC (going by the tests) from the user point-of-view this parameter is really "text" type for representing ms or interval, right? ~~~ 6. doc/src/sgml/ref/create_subscription.sgml Similar + to the physical replication feature + (<xref linkend="guc-recovery-min-apply-delay"/>), it may be useful to + have a time-delayed copy of data for logical replication. SUGGESTION As with the physical replication feature (recovery_min_apply_delay), it can be useful for logical replication to delay the data replication. ~~~ 7. doc/src/sgml/ref/create_subscription.sgml Delays in logical + decoding and in transfer the transaction may reduce the actual wait + time. SUGGESTION Time spent in logical decoding and in transferring the transaction may reduce the actual wait time. ~~~ 8. doc/src/sgml/ref/create_subscription.sgml If the system clocks on publisher and subscriber are not + synchronized, this may lead to apply changes earlier than expected. Why just say "earlier than expected"? If the publisher's time is ahead of the subscriber then the changes might also be *later* than expected, right? So, perhaps it is better to just say "other than expected". ~~~ 9. doc/src/sgml/ref/create_subscription.sgml Should there also be a big warning box about the impact if using synchronous_commit (like the other streaming replication page has this warning)? ~~~ 10. doc/src/sgml/ref/create_subscription.sgml I think there should be some examples somewhere showing how to specify this parameter. Maybe they are better added somewhere in "31.2 Subscription" and xrefed from here. ====== 11. src/backend/commands/subscriptioncmds.c - parse_subscription_options I think there should be a default assignment to 0 (done where all the other supported option defaults are set) ~~~ 12. src/backend/commands/subscriptioncmds.c - parse_subscription_options + if (opts->min_apply_delay < 0) + ereport(ERROR, + errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), + errmsg("option \"%s\" must not be negative", "min_apply_delay")); + I thought this check only needs to be do within scope of the preceding if - (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && strcmp(defel->defname, "min_apply_delay") == 0) ====== 13. src/backend/commands/subscriptioncmds.c - AlterSubscription @@ -1093,6 +1126,17 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt, if (opts.enabled) ApplyLauncherWakeupAtCommit(); + /* + * If this subscription has been disabled and it has an apply + * delay set, wake up the logical replication worker to finish + * it as soon as possible. + */ + if (!opts.enabled && sub->applydelay > 0) I did not really understand the logic why should the min_apply_delay override the enabled=false? It is a called *minimum* delay so if it ends up being way over the parameter value (because the subscription is disabled) then why does that matter? ====== 14. src/backend/replication/logical/worker.c @@ -252,6 +252,7 @@ WalReceiverConn *LogRepWorkerWalRcvConn = NULL; Subscription *MySubscription = NULL; static bool MySubscriptionValid = false; +TimestampTz MySubscriptionMinApplyDelayUntil = 0; Looking at the only usage of this variable (in apply_delay) and how it is used I did see why this cannot just be a local member of the apply_delay function? ~~~ 15. src/backend/replication/logical/worker.c - apply_delay +/* + * Apply the informed delay for the transaction. + * + * A regular transaction uses the commit time to calculate the delay. A + * prepared transaction uses the prepare time to calculate the delay. + */ +static void +apply_delay(TimestampTz ts) I didn't think it needs to mention here about the different kinds of transactions because where it comes from has nothing really to do with this function's logic. ~~~ 16. src/backend/replication/logical/worker.c - apply_delay Refer to comment #14 about MySubscriptionMinApplyDelayUntil. ~~~ 17. src/backend/replication/logical/worker.c - apply_handle_stream_prepare @@ -1090,6 +1146,19 @@ apply_handle_stream_prepare(StringInfo s) elog(DEBUG1, "received prepare for streamed transaction %u", prepare_data.xid); + /* + * Should we delay the current prepared transaction? + * + * Although the delay is applied in BEGIN PREPARE messages, streamed + * prepared transactions apply the delay in a STREAM PREPARE message. + * That's ok because no changes have been applied yet + * (apply_spooled_messages() will do it). + * The STREAM START message does not contain a prepare time (it will be + * available when the in-progress prepared transaction finishes), hence, it + * was not possible to apply a delay at that time. + */ + apply_delay(prepare_data.prepare_time); + It seems to rely on the spooling happening at the end. But won't this cause a problem later when/if the "parallel apply" patch [1] is pushed and the stream bgworkers are doing stuff on the fly instead of spooling at the end? Or are you expecting that the "parallel apply" feature should be disabled if there is any min_apply_delay parameter specified? ~~~ 18. src/backend/replication/logical/worker.c - apply_handle_stream_commit Ditto comment #17. ====== 19. src/bin/psql/tab-complete.c Let's keep the alphabetical order of the parameters in COMPLETE_WITH, as per [2] ====== 20. src/include/catalog/pg_subscription.h @@ -58,6 +58,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW XLogRecPtr subskiplsn; /* All changes finished at this LSN are * skipped */ + int64 subapplydelay; /* Replication apply delay */ + IMO the comment should mention the units "(ms)" ====== 21. src/test/regress/sql/subscription.sql There are some test cases for CREATE SUBSCRIPTION but there are no test cases for ALTER SUBSCRIPTION changing this new parameter. ==== 22. src/test/subscription/t/032_apply_delay.pl I received the following error when trying to run these 'subscription' tests: t/032_apply_delay.pl ............... No such class log_location at t/032_apply_delay.pl line 49, near "my log_location" syntax error at t/032_apply_delay.pl line 49, near "my log_location =" Global symbol "$log_location" requires explicit package name at t/032_apply_delay.pl line 103. Global symbol "$log_location" requires explicit package name at t/032_apply_delay.pl line 105. Global symbol "$log_location" requires explicit package name at t/032_apply_delay.pl line 105. Global symbol "$log_location" requires explicit package name at t/032_apply_delay.pl line 107. Global symbol "$sect" requires explicit package name at t/032_apply_delay.pl line 108. Execution of t/032_apply_delay.pl aborted due to compilation errors. t/032_apply_delay.pl ............... Dubious, test returned 255 (wstat 65280, 0xff00) No subtests run t/100_bugs.pl ...................... ok Test Summary Report ------------------- t/032_apply_delay.pl (Wstat: 65280 Tests: 0 Failed: 0) Non-zero exit status: 255 Parse errors: No plan found in TAP output ------ [1] https://www.postgresql.org/message-id/flat/CAA4eK1%2BwyN6zpaHUkCLorEWNx75MG0xhMwcFhvjqm2KURZEAGw%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/CAHut%2BPucvKZgg_eJzUW--iL6DXHg1Jwj6F09tQziE3kUF67uLg%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
On Tue, Jul 5, 2022 at 2:12 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for your v4-0001 patch. I hope they are > useful for you. > > ====== > > 1. General > > This thread name "logical replication restrictions" seems quite > unrelated to the patch here. Maybe it's better to start a new thread > otherwise nobody is going to recognise what this thread is really > about. > +1. > > 17. src/backend/replication/logical/worker.c - apply_handle_stream_prepare > > @@ -1090,6 +1146,19 @@ apply_handle_stream_prepare(StringInfo s) > > elog(DEBUG1, "received prepare for streamed transaction %u", > prepare_data.xid); > > + /* > + * Should we delay the current prepared transaction? > + * > + * Although the delay is applied in BEGIN PREPARE messages, streamed > + * prepared transactions apply the delay in a STREAM PREPARE message. > + * That's ok because no changes have been applied yet > + * (apply_spooled_messages() will do it). > + * The STREAM START message does not contain a prepare time (it will be > + * available when the in-progress prepared transaction finishes), hence, it > + * was not possible to apply a delay at that time. > + */ > + apply_delay(prepare_data.prepare_time); > + > > It seems to rely on the spooling happening at the end. But won't this > cause a problem later when/if the "parallel apply" patch [1] is pushed > and the stream bgworkers are doing stuff on the fly instead of > spooling at the end? > I wonder why we don't apply the delay on commit/commit_prepared records only similar to physical replication. See recoveryApplyDelay. One more advantage would be then we don't need to worry about transactions that we are going to skip due SKIP feature for subscribers. One more thing that might be worth discussing is whether introducing a new subscription parameter for this feature is a better idea or can we use guc (either an existing or a new one). Users may want to set this only for a particular subscription or set of subscriptions in which case it is better to have this as a subscription level parameter. OTOH, I was slightly worried that if this will be used for all subscriptions on a subscriber then it will be burdensome for users. -- With Regards, Amit Kapila.
Hi Euler,
I've some comments/questions about the latest version (v4) of your patch.
Firstly, I think the patch needs a rebase. CI currently cannot apply it [1].
22. src/test/subscription/t/032_apply_delay.pl
I received the following error when trying to run these 'subscription' tests:
t/032_apply_delay.pl ............... No such class log_location at
t/032_apply_delay.pl line 49, near "my log_location"
syntax error at t/032_apply_delay.pl line 49, near "my log_location ="
I'm having these errors too. Seems like some declarations are missing.
+ specified amount of time. If this value is specified without units,+ it is taken as milliseconds. The default is zero, adding no delay.+ </para>
I'm also having an issue when I give min_apply_delay parameter without units.
I expect that if I set min_apply_delay to 5000 (without any unit), it will be interpreted as 5000 ms.
I tried:
postgres=# CREATE SUBSCRIPTION testsub CONNECTION 'dbname=postgres port=5432' PUBLICATION testpub WITH (min_apply_delay=5000);
And logs showed:
2022-07-13 20:26:52.231 +03 [5422] LOG: logical replication apply delay: 4999999 ms
2022-07-13 20:26:52.231 +03 [5422] CONTEXT: processing remote data for replication origin "pg_18126" during "BEGIN" in transaction 3152 finished at 0/465D7A0
Looks like it starts from 5000000 ms instead of 5000 ms for me. If I state the unit as ms, then it works correctly.
Lastly, I have a question about this delay during tablesync.
It's stated here that apply delays are not for initial tablesync.
<para>+ The delay occurs only on WAL records for transaction begins and after+ the initial table synchronization. It is possible that the+ replication delay between publisher and subscriber exceeds the value+ of this parameter, in which case no delay is added. Note that the+ delay is calculated between the WAL time stamp as written on+ publisher and the current time on the subscriber. Delays in logical+ decoding and in transfer the transaction may reduce the actual wait+ time. If the system clocks on publisher and subscriber are not+ synchronized, this may lead to apply changes earlier than expected.+ This is not a major issue because a typical setting of this parameter+ are much larger than typical time deviations between servers.+ </para>
There might be a case where tablesync workers are in SYNCWAIT state and waiting for apply worker to tell them to CATCHUP.
And if apply worker is waiting in apply_delay function, tablesync workers will be stuck at SYNCWAIT state and this might delay tablesync at least "min_apply_delay" amount of time or more.
Is it something we would want? What do you think?
Best,
Melih
On Tue, Jul 5, 2022, at 5:41 AM, Peter Smith wrote:
Here are some review comments for your v4-0001 patch. I hope they areuseful for you.
Thanks for your review.
This thread name "logical replication restrictions" seems quiteunrelated to the patch here. Maybe it's better to start a new threadotherwise nobody is going to recognise what this thread is reallyabout.
I agree that the $SUBJECT does not describe the proposal. I decided that it is
not worth creating a thread because (i) there are some interaction and they
could be monitoring this thread and (ii) the CF entry has the correct
description.
Similar to physical replication, a time-delayed copy of the data forlogical replication is useful for some scenarios (specially to fixerrors that might cause data loss).
I changed the commit message a bit.
Maybe take some examples from the regression tests to show usage ofthe new parameter
I don't think an example is really useful in a commit message. If you are
checking this commit, it is a matter of reading the regression tests or
documentation to obtain an example of how to use it.
I think this should say that the units are ms.
Unit included.
Is the "integer" type here correct? It might eventually be stored asan integer, but IIUC (going by the tests) from the user point-of-viewthis parameter is really "text" type for representing ms or interval,right?
The internal representation is integer. The unit is correct. If you use units,
the format is text that what the section [1] calls "Numeric with Unit". Even
if the user is unsure about its usage, an example might help here.
SUGGESTIONAs with the physical replication feature (recovery_min_apply_delay),it can be useful for logical replication to delay the datareplication.
It is not "data replication", it is applying changes. I reworded that sentence.
SUGGESTIONTime spent in logical decoding and in transferring the transaction mayreduce the actual wait time.
Changed.
If the system clocks on publisher and subscriber are not+ synchronized, this may lead to apply changes earlier than expected.Why just say "earlier than expected"? If the publisher's time is aheadof the subscriber then the changes might also be *later* thanexpected, right? So, perhaps it is better to just say "other thanexpected".
This sentence is similar to another one in the recovery_min_apply_delay. I want
to emphasize the fact that even if you use a 30-minute delay, it might apply a
change that happened 29 minutes 55 seconds ago. The main reason for this
feature is to avoid modifying changes *earlier*. If it applies the change 30
minutes 5 seconds, it is fine.
Should there also be a big warning box about the impact if usingsynchronous_commit (like the other streaming replication page has thiswarning)?
Impact? Could you elaborate?
I think there should be some examples somewhere showing how to specifythis parameter. Maybe they are better added somewhere in "31.2Subscription" and xrefed from here.
I added one example in the CREATE SUBSCRIPTION. We can add an example in the
section 31.2, however, since it is a new chapter I think it lacks examples for
the other options too (streaming, two_phase, copy_data, ...). It could be
submitted as a separate patch IMO.
I think there should be a default assignment to 0 (done where all theother supported option defaults are set)
It could for completeness. the memset() takes care of it. Anyway, I added it to
the beginning of the parse_subscription_options().
+ if (opts->min_apply_delay < 0)+ ereport(ERROR,+ errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),+ errmsg("option \"%s\" must not be negative", "min_apply_delay"));+I thought this check only needs to be do within scope of the precedingif - (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) &&strcmp(defel->defname, "min_apply_delay") == 0)
Fixed.
+ /*+ * If this subscription has been disabled and it has an apply+ * delay set, wake up the logical replication worker to finish+ * it as soon as possible.+ */+ if (!opts.enabled && sub->applydelay > 0)I did not really understand the logic why should the min_apply_delayoverride the enabled=false? It is a called *minimum* delay so if itends up being way over the parameter value (because the subscriptionis disabled) then why does that matter?
It doesn't. The main point of this code (as I tried to explain in the comment)
is to kill the worker as soon as possible if you disable the subscription.
Isn't the comment clear?
Subscription *MySubscription = NULL;static bool MySubscriptionValid = false;+TimestampTz MySubscriptionMinApplyDelayUntil = 0;Looking at the only usage of this variable (in apply_delay) and how itis used I did see why this cannot just be a local member of theapply_delay function?
Good catch. A previous patch used that variable outside that function scope.
+/*+ * Apply the informed delay for the transaction.+ *+ * A regular transaction uses the commit time to calculate the delay. A+ * prepared transaction uses the prepare time to calculate the delay.+ */+static void+apply_delay(TimestampTz ts)I didn't think it needs to mention here about the different kinds oftransactions because where it comes from has nothing really to do withthis function's logic.
Fixed.
Refer to comment #14 about MySubscriptionMinApplyDelayUntil.
Fixed.
It seems to rely on the spooling happening at the end. But won't thiscause a problem later when/if the "parallel apply" patch [1] is pushedand the stream bgworkers are doing stuff on the fly instead ofspooling at the end?Or are you expecting that the "parallel apply" feature should bedisabled if there is any min_apply_delay parameter specified?
I didn't read the "parallel apply" patch yet.
Let's keep the alphabetical order of the parameters in COMPLETE_WITH, as per [2]
Fixed.
+ int64 subapplydelay; /* Replication apply delay */+IMO the comment should mention the units "(ms)"
I'm not sure. It should be documented in the catalogs. It is an important
information for user-visible interface. There are a few places in the
documentation that the unit is mentioned.
There are some test cases for CREATE SUBSCRIPTION but there are notest cases for ALTER SUBSCRIPTION changing this new parameter.
I added a test to cover ALTER SUBSCRIPTION and also for the disabling a
subscription that contains a delay set.
I received the following error when trying to run these 'subscription' tests:
Fixed.
Attachment
On Tue, Jul 5, 2022, at 9:29 AM, Amit Kapila wrote:
I wonder why we don't apply the delay on commit/commit_preparedrecords only similar to physical replication. See recoveryApplyDelay.One more advantage would be then we don't need to worry abouttransactions that we are going to skip due SKIP feature forsubscribers.
I added an explanation at the top of apply_delay(). I didn't read the "parallel
apply" patch yet. I'll do soon to understand how the current design for
streamed transactions conflicts with the parallel apply patch.
+ * It applies the delay for the next transaction but before starting the
+ * transaction. The main reason for this design is to avoid a long-running
+ * transaction (which can cause some operational challenges) if the user sets a
+ * high value for the delay. This design is different from the physical
+ * replication (that applies the delay at commit time) mainly because write
+ * operations may allow some issues (such as bloat and locks) that can be
+ * minimized if it does not keep the transaction open for such a long time.
+ */
+static void
+apply_delay(TimestampTz ts)
Regarding the skip transaction feature, we could certainly skip the
transactions combined with the apply delay. However, it introduces complexity
for a rare use case IMO. Besides that, the skip transaction code path is fast,
hence, it is very unlikely that the current patch will impose some issues to
the skip transaction feature. (Remember that the main goal for this feature is
to provide an old state of the database.)
One more thing that might be worth discussing is whether introducing anew subscription parameter for this feature is a better idea or can weuse guc (either an existing or a new one). Users may want to set thisonly for a particular subscription or set of subscriptions in whichcase it is better to have this as a subscription level parameter.OTOH, I was slightly worried that if this will be used for allsubscriptions on a subscriber then it will be burdensome for users.
That's a good point. Logical replication is per database and it is slightly
different from physical replication that is per cluster. In physical
replication, you have no choice but to have a GUC. It is very unlikely that
someone wants to delay all logical replicas. Therefore, the benefit of having a
GUC is quite small.
On Mon, Aug 1, 2022 at 6:46 PM Euler Taveira <euler@eulerto.com> wrote: > > On Tue, Jul 5, 2022, at 9:29 AM, Amit Kapila wrote: > > I wonder why we don't apply the delay on commit/commit_prepared > records only similar to physical replication. See recoveryApplyDelay. > One more advantage would be then we don't need to worry about > transactions that we are going to skip due SKIP feature for > subscribers. > > I added an explanation at the top of apply_delay(). I didn't read the "parallel > apply" patch yet. I'll do soon to understand how the current design for > streamed transactions conflicts with the parallel apply patch. > > + * It applies the delay for the next transaction but before starting the > + * transaction. The main reason for this design is to avoid a long-running > + * transaction (which can cause some operational challenges) if the user sets a > + * high value for the delay. This design is different from the physical > + * replication (that applies the delay at commit time) mainly because write > + * operations may allow some issues (such as bloat and locks) that can be > + * minimized if it does not keep the transaction open for such a long time. > + */ Your explanation makes sense to me. The other point to consider is that there can be cases where we may not apply operation for the transaction because of empty transactions (we don't yet skip empty xacts for prepared transactions). So, won't it be better to apply the delay just before we apply the first change for a transaction? Do we want to apply the delay during table sync as we sometimes do need to enter apply phase while doing table sync? > > One more thing that might be worth discussing is whether introducing a > new subscription parameter for this feature is a better idea or can we > use guc (either an existing or a new one). Users may want to set this > only for a particular subscription or set of subscriptions in which > case it is better to have this as a subscription level parameter. > OTOH, I was slightly worried that if this will be used for all > subscriptions on a subscriber then it will be burdensome for users. > > That's a good point. Logical replication is per database and it is slightly > different from physical replication that is per cluster. In physical > replication, you have no choice but to have a GUC. It is very unlikely that > someone wants to delay all logical replicas. Therefore, the benefit of having a > GUC is quite small. > Fair enough. -- With Regards, Amit Kapila.
On Wed, Jul 13, 2022, at 2:34 PM, Melih Mutlu wrote:
[Sorry for the delay...]
22. src/test/subscription/t/032_apply_delay.plI received the following error when trying to run these 'subscription' tests:t/032_apply_delay.pl ............... No such class log_location att/032_apply_delay.pl line 49, near "my log_location"syntax error at t/032_apply_delay.pl line 49, near "my log_location ="I'm having these errors too. Seems like some declarations are missing.
Fixed in v5.
+ specified amount of time. If this value is specified without units,+ it is taken as milliseconds. The default is zero, adding no delay.+ </para>I'm also having an issue when I give min_apply_delay parameter without units.I expect that if I set min_apply_delay to 5000 (without any unit), it will be interpreted as 5000 ms.
Good catch. I fixed it in v5.
Lastly, I have a question about this delay during tablesync.It's stated here that apply delays are not for initial tablesync.<para>+ The delay occurs only on WAL records for transaction begins and after+ the initial table synchronization. It is possible that the+ replication delay between publisher and subscriber exceeds the value+ of this parameter, in which case no delay is added. Note that the+ delay is calculated between the WAL time stamp as written on+ publisher and the current time on the subscriber. Delays in logical+ decoding and in transfer the transaction may reduce the actual wait+ time. If the system clocks on publisher and subscriber are not+ synchronized, this may lead to apply changes earlier than expected.+ This is not a major issue because a typical setting of this parameter+ are much larger than typical time deviations between servers.+ </para>There might be a case where tablesync workers are in SYNCWAIT state and waiting for apply worker to tell them to CATCHUP.And if apply worker is waiting in apply_delay function, tablesync workers will be stuck at SYNCWAIT state and this might delay tablesync at least "min_apply_delay" amount of time or more.Is it something we would want? What do you think?
Good catch. That's an oversight. It should wait for the initial table
synchronization before starting to apply the delay. The main reason is the
current logical replication worker design. It only closes the tablesync workers
after the catchup phase. As you noticed we cannot impose the delay as soon as
the COPY finishes because it will take a long time to finish due to possibly
lack of workers. Instead, let's wait for the READY state for all tables then
apply the delay. I added an explanation for it.
I also modified the test a bit to use the new function
wait_for_subscription_sync introduced in the commit
0c20dd33db1607d6a85ffce24238c1e55e384b49.
I attached a v6.
Attachment
On Wed, Aug 3, 2022, at 10:27 AM, Amit Kapila wrote:
Your explanation makes sense to me. The other point to consider isthat there can be cases where we may not apply operation for thetransaction because of empty transactions (we don't yet skip emptyxacts for prepared transactions). So, won't it be better to apply thedelay just before we apply the first change for a transaction? Do wewant to apply the delay during table sync as we sometimes do need toenter apply phase while doing table sync?
I thought about the empty transactions but decided to not complicate the code
mainly because skipping transactions is not a code path that will slow down
this feature. As explained in the documentation, there is no harm in delaying a
transaction for more than min_apply_delay; it cannot apply earlier. Having said
that I decided to do nothing. I'm also not sure if it deserves a comment or if
this email is a possible explanation for this decision.
Regarding the table sync that was mention by Melih, I sent a new version (v6)
that fixed this oversight. The current logical replication worker design make
it difficult to apply the delay in the catchup phase; tablesync workers are not
closed as soon as the COPY finishes (which means possibly running out of
workers sooner). After all tablesync workers have reached READY state, the
apply delay is activated. The documentation was correct; the code wasn't.
On Tuesday, August 9, 2022 6:47 AM Euler Taveira <euler@eulerto.com> wrote: > I attached a v6. Hi, thank you for posting the updated patch. Minor review comments for v6. (1) commit message "If the subscriber sets min_apply_delay parameter, ..." I suggest we use subscription rather than subscriber, because this parameter refers to and is used for one subscription. My suggestion is "If one subscription sets min_apply_delay parameter, ..." In case if you agree, there are other places to apply this change. (2) commit message It might be better to write a note for committer like "Bump catalog version" at the bottom of the commit message. (3) unit alignment between recovery_min_apply_delay and min_apply_delay The former interprets input number as milliseconds in case of no units, while the latter takes it as seconds without units. I feel it would be better to make them aligned. (4) catalogs.sgml + Delay the application of changes by a specified amount of time. The + unit is in milliseconds. As a column explanation, it'd be better to use a noun in the first sentence to make this description aligned with other places. My suggestion is "Application delay of changes by ....". (5) pg_subscription.c There is one missing blank line before writing if statement. It's written in the AlterSubscription for other cases. @@ -1100,6 +1130,12 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt, replaces[Anum_pg_subscription_subdisableonerr - 1] = true; } + if (IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY)) (6) tab-complete.c The order of tab-complete parameters listed in the COMPLETE_WITH should follow alphabetical order. "min_apply_delay" can come before "origin". We can refer to d547f7c commit. (7) 032_apply_delay.pl There are missing whitespaces after comma in the mod functions. UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0; DELETE FROM test_tab WHERE mod(a,3) = 0; Best Regards, Takamichi Osumi
On Wed, Aug 10, 2022, at 9:39 AM, osumi.takamichi@fujitsu.com wrote:
Minor review comments for v6.
Thanks for your review. I'm attaching v7.
"If the subscriber sets min_apply_delay parameter, ..."I suggest we use subscription rather than subscriber, becausethis parameter refers to and is used for one subscription.My suggestion is"If one subscription sets min_apply_delay parameter, ..."In case if you agree, there are other places to apply this change.
I changed the terminology to subscription. I also checked other "subscriber"
occurrences but I don't think it should be changed. Some of them are used as
publisher/subscriber pair. If you think there is another sentence to consider,
point it out.
It might be better to write a note for committerlike "Bump catalog version" at the bottom of the commit message.
It is a committer task to bump the catalog number. IMO it is easy to notice
(using a git hook?) that it must bump it when we are modifying the catalog.
AFAICS there is no recommendation to add such a warning.
The former interprets input number as milliseconds in case of no units,while the latter takes it as seconds without units.I feel it would be better to make them aligned.
In a previous version I decided not to add a code to attach a unit when there
isn't one. Instead, I changed the documentation to reflect what interval_in
uses (seconds as unit). Under reflection, let's use ms as default unit if the
user doesn't specify one.
I fixed all the other suggestions too.
Attachment
On Tue, Aug 9, 2022 at 3:52 AM Euler Taveira <euler@eulerto.com> wrote: > > On Wed, Aug 3, 2022, at 10:27 AM, Amit Kapila wrote: > > Your explanation makes sense to me. The other point to consider is > that there can be cases where we may not apply operation for the > transaction because of empty transactions (we don't yet skip empty > xacts for prepared transactions). So, won't it be better to apply the > delay just before we apply the first change for a transaction? Do we > want to apply the delay during table sync as we sometimes do need to > enter apply phase while doing table sync? > > I thought about the empty transactions but decided to not complicate the code > mainly because skipping transactions is not a code path that will slow down > this feature. As explained in the documentation, there is no harm in delaying a > transaction for more than min_apply_delay; it cannot apply earlier. Having said > that I decided to do nothing. I'm also not sure if it deserves a comment or if > this email is a possible explanation for this decision. > I don't know what makes you think it will complicate the code. But anyway thinking further about the way apply_delay is used at various places in the patch, as pointed out by Peter Smith it seems it won't work for the parallel apply feature where we start applying the transaction immediately after start stream. I was wondering why don't we apply delay after each commit of the transaction rather than at the begin command. We can remember if the transaction has made any change and if so then after commit, apply the delay. If we can do that then it will alleviate the concern of empty and skipped xacts as well. Another thing I was wondering how to determine what is a good delay time for tests and found that current tests in replay_delay.pl uses 3s, so should we use the same for apply delay tests in this patch as well? -- With Regards, Amit Kapila.
Dear Euler, Thank you for making the patch! I'm also interested in the patch so I want to join the thread. While testing your patch, I noticed that the 032_apply_delay.pl failed. PSA logs that generated on my machine. This failure is same as reported by cfbot[1]. It seemed that the apply worker could not exit and starts WaitLatch() again even if the subscription had been disabled. Followings are cited from attached log. ``` ... 2022-09-14 09:44:30.489 UTC [14880] 032_apply_delay.pl LOG: statement: ALTER SUBSCRIPTION tap_sub SET (min_apply_delay =86460000) 2022-09-14 09:44:30.525 UTC [14777] DEBUG: sending feedback (force 0) to recv 0/1690220, write 0/1690220, flush 0/1690220 2022-09-14 09:44:30.526 UTC [14759] DEBUG: server process (PID 14878) exited with exit code 0 2022-09-14 09:44:30.535 UTC [14777] DEBUG: logical replication apply delay: 86460000 ms 2022-09-14 09:44:30.535 UTC [14777] CONTEXT: processing remote data for replication origin "pg_16393" during "BEGIN" intransaction 734 finished at 0/16902A8 2022-09-14 09:44:30.576 UTC [14759] DEBUG: forked new backend, pid=14884 socket=6 2022-09-14 09:44:30.578 UTC [14759] DEBUG: server process (PID 14880) exited with exit code 0 2022-09-14 09:44:30.583 UTC [14884] 032_apply_delay.pl LOG: statement: ALTER SUBSCRIPTION tap_sub DISABLE 2022-09-14 09:44:30.589 UTC [14777] DEBUG: logical replication apply delay: 86459945 ms 2022-09-14 09:44:30.589 UTC [14777] CONTEXT: processing remote data for replication origin "pg_16393" during "BEGIN" intransaction 734 finished at 0/16902A8 2022-09-14 09:44:30.608 UTC [14759] DEBUG: forked new backend, pid=14886 socket=6 2022-09-14 09:44:30.632 UTC [14886] 032_apply_delay.pl LOG: statement: SELECT count(1) = 0 FROM pg_stat_subscription WHEREsubname = 'tap_sub' AND pid IS NOT NULL; 2022-09-14 09:44:30.665 UTC [14759] DEBUG: server process (PID 14884) exited with exit code 0 ... ``` I think this may be caused because the delayed worker will not read the modified catalog even if ALTER SUBSCRIPTION ... DISABLEDis called. I also attached the fix patch that can be applied after yours. It seems OK on my env. [1]: https://cirrus-ci.com/task/4888001967816704 Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
Hi, Sorry for noise but I found another bug. When the 032_apply_delay.pl is modified like following, the test will be always failed even if my patch is applied. ``` # Disable subscription. worker should die immediately. -$node_subscriber->safe_psql('postgres', - "ALTER SUBSCRIPTION tap_sub DISABLE" +$node_subscriber->safe_psql('postgres', q{ +BEGIN; +ALTER SUBSCRIPTION tap_sub DISABLE; +SELECT pg_sleep(1); +COMMIT; +} ); ``` The point of failure is same as I reported previously. ``` ... 2022-09-14 12:00:48.891 UTC [11330] 032_apply_delay.pl LOG: statement: ALTER SUBSCRIPTION tap_sub SET (min_apply_delay =86460000) 2022-09-14 12:00:48.910 UTC [11226] DEBUG: sending feedback (force 0) to recv 0/1690220, write 0/1690220, flush 0/1690220 2022-09-14 12:00:48.937 UTC [11208] DEBUG: server process (PID 11328) exited with exit code 0 2022-09-14 12:00:48.950 UTC [11226] DEBUG: logical replication apply delay: 86459996 ms 2022-09-14 12:00:48.950 UTC [11226] CONTEXT: processing remote data for replication origin "pg_16393" during "BEGIN" intransaction 734 finished at 0/16902A8 2022-09-14 12:00:48.979 UTC [11208] DEBUG: forked new backend, pid=11334 socket=6 2022-09-14 12:00:49.007 UTC [11334] 032_apply_delay.pl LOG: statement: BEGIN; 2022-09-14 12:00:49.008 UTC [11334] 032_apply_delay.pl LOG: statement: ALTER SUBSCRIPTION tap_sub DISABLE; 2022-09-14 12:00:49.009 UTC [11334] 032_apply_delay.pl LOG: statement: SELECT pg_sleep(1); 2022-09-14 12:00:49.009 UTC [11226] DEBUG: check status of MySubscription 2022-09-14 12:00:49.009 UTC [11226] CONTEXT: processing remote data for replication origin "pg_16393" during "BEGIN" intransaction 734 finished at 0/16902A8 2022-09-14 12:00:49.009 UTC [11226] DEBUG: logical replication apply delay: 86459937 ms 2022-09-14 12:00:49.009 UTC [11226] CONTEXT: processing remote data for replication origin "pg_16393" during "BEGIN" intransaction 734 finished at 0/16902A8 ... ``` I think it may be caused that waken worker read catalogs that have not modified yet. In AlterSubscription(), the backend kicks the apply worker ASAP, but it should be at end of the transaction, like ApplyLauncherWakeupAtCommit() and AtEOXact_ApplyLauncher(). ``` + /* + * If this subscription has been disabled and it has an apply + * delay set, wake up the logical replication worker to finish + * it as soon as possible. + */ + if (!opts.enabled && sub->applydelay > 0) + logicalrep_worker_wakeup(sub->oid, InvalidOid); + ``` How do you think? Best Regards, Hayato Kuroda FUJITSU LIMITED
Hi Euler, a long time ago you ask me a few questions about my previous review [1]. Here are my replies, plus a few other review comments for patch v7-0001. ====== 1. doc/src/sgml/catalogs.sgml + <para> + Application delay of changes by a specified amount of time. The + unit is in milliseconds. + </para></entry> The wording sounds a bit strange still. How about below SUGGESTION The length of time (ms) to delay the application of changes. ======= 2. Other documentation? Maybe should say something on the Logical Replication Subscription page about this? (31.2 Subscription) ======= 3. doc/src/sgml/ref/create_subscription.sgml + synchronized, this may lead to apply changes earlier than expected. + This is not a major issue because a typical setting of this parameter + are much larger than typical time deviations between servers. Wording? SUGGESTION ... than expected, but this is not a major issue because this parameter is typically much larger than the time deviations between servers. ~~~ 4. Q/A From [2] you asked: > Should there also be a big warning box about the impact if using > synchronous_commit (like the other streaming replication page has this > warning)? Impact? Could you elaborate? ~ I noticed the streaming replication docs for recovery_min_apply_delay has a big red warning box saying that setting this GUC may block the synchronous commits. So I was saying won’t a similar big red warning be needed also for this min_apply_delay parameter if the delay is used in conjunction with a publisher wanting synchronous commit because it might block everything? ~~~ 4. Example +<programlisting> +CREATE SUBSCRIPTION foo + CONNECTION 'host=192.168.1.50 port=5432 user=foo dbname=foodb' + PUBLICATION baz + WITH (copy_data = false, min_apply_delay = '4h'); +</programlisting></para> If the example named the subscription/publication as ‘mysub’ and ‘mypub’ I think it would be more consistent with the existing examples. ====== 5. src/backend/commands/subscriptioncmds.c - SubOpts @@ -89,6 +91,7 @@ typedef struct SubOpts bool disableonerr; char *origin; XLogRecPtr lsn; + int64 min_apply_delay; } SubOpts; I feel it would be better to be explicit about the storage units. So call this member ‘min_apply_delay_ms’. E.g. then other code in parse_subscription_options will be more natural when you are converting using and assigning them to this member. ~~~ 6. - parse_subscription_options + /* + * If there is no unit, interval_in takes second as unit. This + * parameter expects millisecond as unit so add a unit (ms) if + * there isn't one. + */ The comment feels awkward. How about below SUGGESTION If no unit was specified, then explicitly add 'ms' otherwise the interval_in function would assume 'seconds' ~~~ 7. - parse_subscription_options (This is a repeat of [1] review comment #12) + if (opts->min_apply_delay < 0 && IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY)) + ereport(ERROR, + errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), + errmsg("option \"%s\" must not be negative", "min_apply_delay")); Why is this code here instead of inside the previous code block where the min_apply_delay was assigned in the first place? ====== 8. src/backend/replication/logical/worker.c - apply_delay + * When min_apply_delay parameter is set on subscriber, we wait long enough to + * make sure a transaction is applied at least that interval behind the + * publisher. "on subscriber" -> "on the subscription" ~~~ 9. + * Apply delay only after all tablesync workers have reached READY state. A + * tablesync worker are kept until it reaches READY state. If we allow the Wording ?? "A tablesync worker are kept until it reaches READY state." ?? ~~~ 10. 10a. + /* nothing to do if no delay set */ Uppercase comment /* Nothing to do if no delay set */ ~ 10b. + /* set apply delay */ Uppercase comment /* Set apply delay */ ~~~ 11. - apply_handle_stream_prepare / apply_handle_stream_commit The previous concern about incompatibility with the "Parallel Apply" work (see [1] review comments #17, #18) is still a pending issue, isn't it? ====== 12. src/backend/utils/adt/timestamp.c interval_to_ms +/* + * Given an Interval returns the number of milliseconds. + */ +int64 +interval_to_ms(const Interval *interval) SUGGESTION Returns the number of milliseconds in the specified Interval. ~~~ 13. + /* adds portion time (in ms) to the previous result. */ Uppercase comment /* Adds portion time (in ms) to the previous result. * ====== 14. src/bin/pg_dump/pg_dump.c - getSubscriptions + { + appendPQExpBufferStr(query, " s.suborigin,\n"); + appendPQExpBufferStr(query, " s.subapplydelay\n"); + } This could be done using just a single appendPQExpBufferStr if you want to have 1 call instead of 2. ====== 15. src/bin/psql/describe.c - describeSubscriptions + /* origin and min_apply_delay are only supported in v16 and higher */ Uppercase comment /* Origin and min_apply_delay are only supported in v16 and higher */ ====== 16. src/include/catalog/pg_subscription.h + int64 subapplydelay; /* Replication apply delay */ + Consider renaming this as 'subapplydelayms' to make the units perfectly clear. ====== 17. src/test/regress/sql/subscription.sql Is [1] review comment 21 (There are some test cases for CREATE SUBSCRIPTION but there are no test cases for ALTER SUBSCRIPTION changing this new parameter.) still a pending item? ------ [1] My v4 review - https://www.postgresql.org/message-id/CAHut%2BPvugkna7avUQLydg602hymc8qMp%3DCRT2ZCTGbi8Bkfv%2BA%40mail.gmail.com [2] Euler's reply to my v4 review - https://www.postgresql.org/message-id/acfc1946-a73e-4e9d-86b3-b19cba225a41%40www.fastmail.com Kind Regards, Peter Smith. Fujitsu Australia
Dear Euler, Do you have enough time to handle the issue? Our discussion has been suspended for two months... If you could not allocate a time to discuss this problem because of other important tasks or events, we would like to take over the thread and modify your patch. We've planned that we will start to address comments and reported bugs if you would not respond by the end of this week. I look forward to hearing from you. Best Regards, Hayato Kuroda FUJITSU LIMITED
At Wed, 10 Aug 2022 17:33:00 -0300, "Euler Taveira" <euler@eulerto.com> wrote in > On Wed, Aug 10, 2022, at 9:39 AM, osumi.takamichi@fujitsu.com wrote: > > Minor review comments for v6. > Thanks for your review. I'm attaching v7. Using interval is not standard as this kind of parameters but it seems convenient. On the other hand, it's not great that the unit month introduces some subtle ambiguity. This patch translates a month to 30 days but I'm not sure it's the right thing to do. Perhaps we shouldn't allow the units upper than days. apply_delay() chokes the message-receiving path so that a not-so-long delay can cause a replication timeout to fire. I think we should process walsender pings even while delaying. Needing to make replication timeout longer than apply delay is not great, I think. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, 11 Aug 2022 at 02:03, Euler Taveira <euler@eulerto.com> wrote: > > On Wed, Aug 10, 2022, at 9:39 AM, osumi.takamichi@fujitsu.com wrote: > > Minor review comments for v6. > > Thanks for your review. I'm attaching v7. > > "If the subscriber sets min_apply_delay parameter, ..." > > I suggest we use subscription rather than subscriber, because > this parameter refers to and is used for one subscription. > My suggestion is > "If one subscription sets min_apply_delay parameter, ..." > In case if you agree, there are other places to apply this change. > > I changed the terminology to subscription. I also checked other "subscriber" > occurrences but I don't think it should be changed. Some of them are used as > publisher/subscriber pair. If you think there is another sentence to consider, > point it out. > > It might be better to write a note for committer > like "Bump catalog version" at the bottom of the commit message. > > It is a committer task to bump the catalog number. IMO it is easy to notice > (using a git hook?) that it must bump it when we are modifying the catalog. > AFAICS there is no recommendation to add such a warning. > > The former interprets input number as milliseconds in case of no units, > while the latter takes it as seconds without units. > I feel it would be better to make them aligned. > > In a previous version I decided not to add a code to attach a unit when there > isn't one. Instead, I changed the documentation to reflect what interval_in > uses (seconds as unit). Under reflection, let's use ms as default unit if the > user doesn't specify one. > > I fixed all the other suggestions too. Few comments: 1) I feel if the user has specified a long delay there is a chance that replication may not continue if the replication slot falls behind the current LSN by more than max_slot_wal_keep_size. I feel we should add this reference in the documentation of min_apply_delay as the replication will not continue in this case. 2) I also noticed that if we have to shut down the publisher server with a long min_apply_delay configuration, the publisher server cannot be stopped as the walsender waits for the data to be replicated. Is this behavior ok for the server to wait in this case? If this behavior is ok, we could add a log message as it is not very evident from the log files why the server could not be shut down. Regards, Vignesh
On Tuesday, November 8, 2022 2:27 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote: > If you could not allocate a time to discuss this problem because of other > important tasks or events, we would like to take over the thread and modify > your patch. > > We've planned that we will start to address comments and reported bugs if > you would not respond by the end of this week. Hi, I've simply rebased the patch to make it applicable on top of HEAD and make the tests pass. Note there are still open pending comments and I'm going to start to address those. I've written Euler as the original author in the commit message to note his credit. Best Regards, Takamichi Osumi
Attachment
Hi, The thread title doesn't really convey the topic under discussion, so changed it. IIRC, this has been mentioned by others as well in the thread. On Sat, Nov 12, 2022 at 7:21 PM vignesh C <vignesh21@gmail.com> wrote: > > Few comments: > 1) I feel if the user has specified a long delay there is a chance > that replication may not continue if the replication slot falls behind > the current LSN by more than max_slot_wal_keep_size. I feel we should > add this reference in the documentation of min_apply_delay as the > replication will not continue in this case. > This makes sense to me. > 2) I also noticed that if we have to shut down the publisher server > with a long min_apply_delay configuration, the publisher server cannot > be stopped as the walsender waits for the data to be replicated. Is > this behavior ok for the server to wait in this case? If this behavior > is ok, we could add a log message as it is not very evident from the > log files why the server could not be shut down. > I think for this case, the behavior should be the same as for physical replication. Can you please check what is behavior for the case you are worried about in physical replication? Note, we already have a similar parameter for recovery_min_apply_delay for physical replication. -- With Regards, Amit Kapila.
On Mon, Nov 14, 2022 at 12:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sat, Nov 12, 2022 at 7:21 PM vignesh C <vignesh21@gmail.com> wrote: > > > > Few comments: > > 1) I feel if the user has specified a long delay there is a chance > > that replication may not continue if the replication slot falls behind > > the current LSN by more than max_slot_wal_keep_size. I feel we should > > add this reference in the documentation of min_apply_delay as the > > replication will not continue in this case. > > > > This makes sense to me. > > > 2) I also noticed that if we have to shut down the publisher server > > with a long min_apply_delay configuration, the publisher server cannot > > be stopped as the walsender waits for the data to be replicated. Is > > this behavior ok for the server to wait in this case? If this behavior > > is ok, we could add a log message as it is not very evident from the > > log files why the server could not be shut down. > > > > I think for this case, the behavior should be the same as for physical > replication. Can you please check what is behavior for the case you > are worried about in physical replication? Note, we already have a > similar parameter for recovery_min_apply_delay for physical > replication. > I don't understand the reason for the below change in the patch: + /* + * If this subscription has been disabled and it has an apply + * delay set, wake up the logical replication worker to finish + * it as soon as possible. + */ + if (!opts.enabled && sub->applydelay > 0) + logicalrep_worker_wakeup(sub->oid, InvalidOid); + It seems to me Kuroda-San has proposed this change [1] to fix the test but it is not clear to me why such a change is required. Why can't CHECK_FOR_INTERRUPTS() after waiting, followed by the existing below code [2] in LogicalRepApplyLoop() sufficient to handle parameter updates? [2] if (!in_remote_transaction && !in_streamed_transaction) { /* * If we didn't get any transactions for a while there might be * unconsumed invalidation messages in the queue, consume them * now. */ AcceptInvalidationMessages(); maybe_reread_subscription(); ... [1] - https://www.postgresql.org/message-id/TYAPR01MB5866F9716A18DA0C68A2CDB3F5469%40TYAPR01MB5866.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
Dear Amit, > I don't understand the reason for the below change in the patch: > > + /* > + * If this subscription has been disabled and it has an apply > + * delay set, wake up the logical replication worker to finish > + * it as soon as possible. > + */ > + if (!opts.enabled && sub->applydelay > 0) > + logicalrep_worker_wakeup(sub->oid, InvalidOid); > + > > It seems to me Kuroda-San has proposed this change [1] to fix the test > but it is not clear to me why such a change is required. Why can't > CHECK_FOR_INTERRUPTS() after waiting, followed by the existing below > code [2] in LogicalRepApplyLoop() sufficient to handle parameter > updates? > > [2] > if (!in_remote_transaction && !in_streamed_transaction) > { > /* > * If we didn't get any transactions for a while there might be > * unconsumed invalidation messages in the queue, consume them > * now. > */ > AcceptInvalidationMessages(); > maybe_reread_subscription(); > ... I mentioned the case with a long min_apply_delay configuration. The worker will exit normally if apply_delay() has been ended and then it can reach LogicalRepApplyLoop(). It works well if the delay is short and workers can wake up immediately. But if workers have long min_apply_delay, they cannot go out the while-loop, so worker processes remain for a long time. According to test code, it is determined that worker should die immediately and we have a test-case that we try to kill the worker with min_apply_delay = 1 day. Also note that the launcher process will not set a latch or send a SIGTERM even if the subscription is altered to enabled=f. In the launcher main loop, the launcher reads pg_subscription periodically but they do not consider about changes of parameters. They just skip doing something if they find disabled subscriptions. If the situation can be ignored, we may be able to remove lines. Best Regards, Hayato Kuroda FUJITSU LIMITED
On Wed, Nov 9, 2022 at 12:11 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Wed, 10 Aug 2022 17:33:00 -0300, "Euler Taveira" <euler@eulerto.com> wrote in > > On Wed, Aug 10, 2022, at 9:39 AM, osumi.takamichi@fujitsu.com wrote: > > > Minor review comments for v6. > > Thanks for your review. I'm attaching v7. > > Using interval is not standard as this kind of parameters but it seems > convenient. On the other hand, it's not great that the unit month > introduces some subtle ambiguity. This patch translates a month to 30 > days but I'm not sure it's the right thing to do. Perhaps we shouldn't > allow the units upper than days. > Agreed. Isn't the same thing already apply to recovery_min_apply_delay for which the maximum unit seems to be in days? If so, there is no reason to do something different here? > apply_delay() chokes the message-receiving path so that a not-so-long > delay can cause a replication timeout to fire. I think we should > process walsender pings even while delaying. Needing to make > replication timeout longer than apply delay is not great, I think. > Again, I think for this case also the behavior should be similar to how we handle recovery_min_apply_delay. -- With Regards, Amit Kapila.
On Mon, Nov 14, 2022 at 2:28 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > > I don't understand the reason for the below change in the patch: > > > > + /* > > + * If this subscription has been disabled and it has an apply > > + * delay set, wake up the logical replication worker to finish > > + * it as soon as possible. > > + */ > > + if (!opts.enabled && sub->applydelay > 0) > > + logicalrep_worker_wakeup(sub->oid, InvalidOid); > > + > > > > It seems to me Kuroda-San has proposed this change [1] to fix the test > > but it is not clear to me why such a change is required. Why can't > > CHECK_FOR_INTERRUPTS() after waiting, followed by the existing below > > code [2] in LogicalRepApplyLoop() sufficient to handle parameter > > updates? > > > > [2] > > if (!in_remote_transaction && !in_streamed_transaction) > > { > > /* > > * If we didn't get any transactions for a while there might be > > * unconsumed invalidation messages in the queue, consume them > > * now. > > */ > > AcceptInvalidationMessages(); > > maybe_reread_subscription(); > > ... > > I mentioned the case with a long min_apply_delay configuration. > > The worker will exit normally if apply_delay() has been ended and then it can reach > LogicalRepApplyLoop(). It works well if the delay is short and workers can wake up > immediately. But if workers have long min_apply_delay, they cannot go out the > while-loop, so worker processes remain for a long time. According to test code, > it is determined that worker should die immediately and we have a > test-case that we try to kill the worker with min_apply_delay = 1 day. > So, why only honor the 'disable' option of the subscription? For example, one can change 'min_apply_delay' and it seems recoveryApplyDelay() honors a similar change in the recovery parameter. Is there a way to set the latch of the worker process, so that it can recheck if anything is changed? -- With Regards, Amit Kapila.
Dear Amit, > > > It seems to me Kuroda-San has proposed this change [1] to fix the test > > > but it is not clear to me why such a change is required. Why can't > > > CHECK_FOR_INTERRUPTS() after waiting, followed by the existing below > > > code [2] in LogicalRepApplyLoop() sufficient to handle parameter > > > updates? (I forgot to say, this change was not proposed by me. I said that there should be modified. I thought workers should wake up after the transaction was committed.) > So, why only honor the 'disable' option of the subscription? For > example, one can change 'min_apply_delay' and it seems > recoveryApplyDelay() honors a similar change in the recovery > parameter. Is there a way to set the latch of the worker process, so > that it can recheck if anything is changed? I have not considered about it, but seems reasonable. We may be able to do maybe_reread_subscription() if subscription parameters are changed and latch is set. Currently, IIUC we try to disable subscription regardless of the state, but should we avoid to reread catalog if workers are handling the transactions, like LogicalRepApplyLoop()? Best Regards, Hayato Kuroda FUJITSU LIMITED
On Mon, Nov 14, 2022 at 6:52 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Amit, > > > > > It seems to me Kuroda-San has proposed this change [1] to fix the test > > > > but it is not clear to me why such a change is required. Why can't > > > > CHECK_FOR_INTERRUPTS() after waiting, followed by the existing below > > > > code [2] in LogicalRepApplyLoop() sufficient to handle parameter > > > > updates? > > (I forgot to say, this change was not proposed by me. I said that there should be > modified. I thought workers should wake up after the transaction was committed.) > > > So, why only honor the 'disable' option of the subscription? For > > example, one can change 'min_apply_delay' and it seems > > recoveryApplyDelay() honors a similar change in the recovery > > parameter. Is there a way to set the latch of the worker process, so > > that it can recheck if anything is changed? > > I have not considered about it, but seems reasonable. We may be able to > do maybe_reread_subscription() if subscription parameters are changed > and latch is set. > One more thing I would like you to consider is the point raised by me related to this patch's interaction with the parallel apply feature as mentioned in the email [1]. I am not sure the idea proposed in that email [1] is a good one because delaying after applying commit may not be good as we want to delay the apply of the transaction(s) on subscribers by this feature. I feel this needs more thought. > Currently, IIUC we try to disable subscription regardless of the state, but > should we avoid to reread catalog if workers are handling the transactions, > like LogicalRepApplyLoop()? > IIUC, here you are referring to reading catalogs again via the function maybe_reread_subscription(), right? If so, I think the idea is to not invoke it frequently to avoid increasing transaction apply time. However, when you are anyway going to wait for a delay, it may not matter. I feel it would be better to add some comments saying that we don't want workers to wait for a long time if users have disabled the subscription or reduced the apply_delay time. [1] - https://www.postgresql.org/message-id/CAA4eK1JRs0v9Z65HWKEZg3quWx4LiQ%3DpddTJZ_P1koXsbR3TMA%40mail.gmail.com -- With Regards, Amit Kapila.
2022年11月14日(月) 10:09 Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com>: > > On Tuesday, November 8, 2022 2:27 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote: > > If you could not allocate a time to discuss this problem because of other > > important tasks or events, we would like to take over the thread and modify > > your patch. > > > > We've planned that we will start to address comments and reported bugs if > > you would not respond by the end of this week. > Hi, > > > I've simply rebased the patch to make it applicable on top of HEAD > and make the tests pass. Note there are still open pending comments > and I'm going to start to address those. > > I've written Euler as the original author in the commit message > to note his credit. Hi Thanks for the updated patch. While reviewing the patch backlog, we have determined that this patch adds one or more TAP tests but has not added the test to the "meson.build" file. To do this, locate the relevant "meson.build" file for each test and add it in the 'tests' dictionary, which will look something like this: 'tap': { 'tests': [ 't/001_basic.pl', ], }, For some additional details please see this Wiki article: https://wiki.postgresql.org/wiki/Meson_for_patch_authors For more information on the meson build system for PostgreSQL see: https://wiki.postgresql.org/wiki/Meson Regards Ian Barwick
On Mon, 14 Nov 2022 at 12:14, Amit Kapila <amit.kapila16@gmail.com> wrote: > > Hi, > > The thread title doesn't really convey the topic under discussion, so > changed it. IIRC, this has been mentioned by others as well in the > thread. > > On Sat, Nov 12, 2022 at 7:21 PM vignesh C <vignesh21@gmail.com> wrote: > > > > Few comments: > > 1) I feel if the user has specified a long delay there is a chance > > that replication may not continue if the replication slot falls behind > > the current LSN by more than max_slot_wal_keep_size. I feel we should > > add this reference in the documentation of min_apply_delay as the > > replication will not continue in this case. > > > > This makes sense to me. > > > 2) I also noticed that if we have to shut down the publisher server > > with a long min_apply_delay configuration, the publisher server cannot > > be stopped as the walsender waits for the data to be replicated. Is > > this behavior ok for the server to wait in this case? If this behavior > > is ok, we could add a log message as it is not very evident from the > > log files why the server could not be shut down. > > > > I think for this case, the behavior should be the same as for physical > replication. Can you please check what is behavior for the case you > are worried about in physical replication? Note, we already have a > similar parameter for recovery_min_apply_delay for physical > replication. In the case of physical replication by setting recovery_min_apply_delay, I noticed that both primary and standby nodes were getting stopped successfully immediately after the stop server command. In case of logical replication, stop server fails: pg_ctl -D publisher -l publisher.log stop -c waiting for server to shut down............................................................... failed pg_ctl: server does not shut down In case of logical replication, the server does not get stopped because the walsender process is not able to exit: ps ux | grep walsender vignesh 1950789 75.3 0.0 8695216 22284 ? Rs 11:51 1:08 postgres: walsender vignesh [local] START_REPLICATION Regards, Vignesh
Re: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Wednesday, October 5, 2022 6:42 PM Peter Smith <smithpb2250@gmail.com> wrote: > Hi Euler, a long time ago you ask me a few questions about my previous review > [1]. > > Here are my replies, plus a few other review comments for patch v7-0001. Hi, thank you for your comments. > ====== > > 1. doc/src/sgml/catalogs.sgml > > + <para> > + Application delay of changes by a specified amount of time. The > + unit is in milliseconds. > + </para></entry> > > The wording sounds a bit strange still. How about below > > SUGGESTION > The length of time (ms) to delay the application of changes. Fixed. > ======= > > 2. Other documentation? > > Maybe should say something on the Logical Replication Subscription page > about this? (31.2 Subscription) Added. > ======= > > 3. doc/src/sgml/ref/create_subscription.sgml > > + synchronized, this may lead to apply changes earlier than expected. > + This is not a major issue because a typical setting of this parameter > + are much larger than typical time deviations between servers. > > Wording? > > SUGGESTION > ... than expected, but this is not a major issue because this parameter is > typically much larger than the time deviations between servers. Fixed. > ~~~ > > 4. Q/A > > From [2] you asked: > > > Should there also be a big warning box about the impact if using > > synchronous_commit (like the other streaming replication page has this > > warning)? > Impact? Could you elaborate? > > ~ > > I noticed the streaming replication docs for recovery_min_apply_delay has a big > red warning box saying that setting this GUC may block the synchronous > commits. So I was saying won’t a similar big red warning be needed also for > this min_apply_delay parameter if the delay is used in conjunction with a > publisher wanting synchronous commit because it might block everything? I agree with you. Fixed. > ~~~ > > 4. Example > > +<programlisting> > +CREATE SUBSCRIPTION foo > + CONNECTION 'host=192.168.1.50 port=5432 user=foo > dbname=foodb' > + PUBLICATION baz > + WITH (copy_data = false, min_apply_delay = '4h'); > +</programlisting></para> > > If the example named the subscription/publication as ‘mysub’ and ‘mypub’ I > think it would be more consistent with the existing examples. Fixed. > ====== > > 5. src/backend/commands/subscriptioncmds.c - SubOpts > > @@ -89,6 +91,7 @@ typedef struct SubOpts > bool disableonerr; > char *origin; > XLogRecPtr lsn; > + int64 min_apply_delay; > } SubOpts; > > I feel it would be better to be explicit about the storage units. So call this > member ‘min_apply_delay_ms’. E.g. then other code in > parse_subscription_options will be more natural when you are converting using > and assigning them to this member. I don't think we use such names including units explicitly. Could you please tell me a similar example for this ? > ~~~ > > 6. - parse_subscription_options > > + /* > + * If there is no unit, interval_in takes second as unit. This > + * parameter expects millisecond as unit so add a unit (ms) if > + * there isn't one. > + */ > > The comment feels awkward. How about below > > SUGGESTION > If no unit was specified, then explicitly add 'ms' otherwise the interval_in > function would assume 'seconds' Fixed. > ~~~ > > 7. - parse_subscription_options > > (This is a repeat of [1] review comment #12) > > + if (opts->min_apply_delay < 0 && IsSet(supported_opts, > SUBOPT_MIN_APPLY_DELAY)) > + ereport(ERROR, > + errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), > + errmsg("option \"%s\" must not be negative", "min_apply_delay")); > > Why is this code here instead of inside the previous code block where the > min_apply_delay was assigned in the first place? Changed. > ====== > > 8. src/backend/replication/logical/worker.c - apply_delay > > + * When min_apply_delay parameter is set on subscriber, we wait long > + enough to > + * make sure a transaction is applied at least that interval behind the > + * publisher. > > "on subscriber" -> "on the subscription" Fixed. > ~~~ > > 9. > > + * Apply delay only after all tablesync workers have reached READY > + state. A > + * tablesync worker are kept until it reaches READY state. If we allow > + the > > > Wording ?? > > "A tablesync worker are kept until it reaches READY state." ?? I removed the sentence. > ~~~ > > 10. > > 10a. > + /* nothing to do if no delay set */ > > Uppercase comment > /* Nothing to do if no delay set */ > > ~ > > 10b. > + /* set apply delay */ > > Uppercase comment > /* Set apply delay */ Both are fixed. > ~~~ > > 11. - apply_handle_stream_prepare / apply_handle_stream_commit > > The previous concern about incompatibility with the "Parallel Apply" > work (see [1] review comments #17, #18) is still a pending issue, isn't it? Yes, I think so. Kindly have a look at [1]. > ====== > > 12. src/backend/utils/adt/timestamp.c interval_to_ms > > +/* > + * Given an Interval returns the number of milliseconds. > + */ > +int64 > +interval_to_ms(const Interval *interval) > > SUGGESTION > Returns the number of milliseconds in the specified Interval. Fixed. > ~~~ > > 13. > > > + /* adds portion time (in ms) to the previous result. */ > > Uppercase comment > /* Adds portion time (in ms) to the previous result. * Fixed. > ====== > > 14. src/bin/pg_dump/pg_dump.c - getSubscriptions > > + { > + appendPQExpBufferStr(query, " s.suborigin,\n"); > + appendPQExpBufferStr(query, " s.subapplydelay\n"); } > > This could be done using just a single appendPQExpBufferStr if you want to > have 1 call instead of 2. Made them together. > ====== > > 15. src/bin/psql/describe.c - describeSubscriptions > > + /* origin and min_apply_delay are only supported in v16 and higher */ > > Uppercase comment > /* Origin and min_apply_delay are only supported in v16 and higher */ Fixed. > ====== > > 16. src/include/catalog/pg_subscription.h > > + int64 subapplydelay; /* Replication apply delay */ > + > > Consider renaming this as 'subapplydelayms' to make the units perfectly clear. Similar to the 5th comments, I can't find any examples for this. I'd like to keep it general, which makes me feel it is more aligned with existing codes. > ====== > > 17. src/test/regress/sql/subscription.sql > > Is [1] review comment 21 (There are some test cases for CREATE > SUBSCRIPTION but there are no test cases for ALTER SUBSCRIPTION > changing this new parameter.) still a pending item? Added one test case for alter subscription. Also, I removed the function of logicalrep_worker_wakeup() that was trigged by AlterSubscription only when disabling the subscription. This is achieved and replaced by another patch proposed in [2] in a general manner. There are still some pending comments for this patch, but I'll share the current patch once. Lastly, thank you so much, Kuroda-san for giving me many advice and suggestion for some modification of this patch. [1] - https://www.postgresql.org/message-id/CAA4eK1JJFpgqE0ehAb7C9YFkJ-Xe-W1ZUPZptEfYjNJM4G-sLA%40mail.gmail.com [2] - https://www.postgresql.org/message-id/20221122004119.GA132961%40nathanxps13 Best Regards, Takamichi Osumi
Attachment
Re: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Wednesday, November 16, 2022 12:58 PM Ian Lawrence Barwick <barwick@gmail.com> wrote: > 2022年11月14日(月) 10:09 Takamichi Osumi (Fujitsu) > <osumi.takamichi@fujitsu.com>: > > I've simply rebased the patch to make it applicable on top of HEAD and > > make the tests pass. Note there are still open pending comments and > > I'm going to start to address those. > Thanks for the updated patch. > > While reviewing the patch backlog, we have determined that this patch adds > one or more TAP tests but has not added the test to the "meson.build" file. > > To do this, locate the relevant "meson.build" file for each test and add it in the > 'tests' dictionary, which will look something like this: > > 'tap': { > 'tests': [ > 't/001_basic.pl', > ], > }, > > For some additional details please see this Wiki article: > > https://wiki.postgresql.org/wiki/Meson_for_patch_authors > > For more information on the meson build system for PostgreSQL see: > > https://wiki.postgresql.org/wiki/Meson Hi, thanks for your notification. You are right. Modified. The updated patch can be seen in [1]. [1] - https://www.postgresql.org/message-id/TYCPR01MB8373775ECC6972289AF8CB30ED0F9%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Tuesday, November 22, 2022 6:15 PM vignesh C <vignesh21@gmail.com> wrote: > On Mon, 14 Nov 2022 at 12:14, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Hi, > > > > The thread title doesn't really convey the topic under discussion, so > > changed it. IIRC, this has been mentioned by others as well in the > > thread. > > > > On Sat, Nov 12, 2022 at 7:21 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > Few comments: > > > 1) I feel if the user has specified a long delay there is a chance > > > that replication may not continue if the replication slot falls > > > behind the current LSN by more than max_slot_wal_keep_size. I feel > > > we should add this reference in the documentation of min_apply_delay > > > as the replication will not continue in this case. > > > > > > > This makes sense to me. Modified accordingly. The updated patch is in [1]. > > > > > 2) I also noticed that if we have to shut down the publisher server > > > with a long min_apply_delay configuration, the publisher server > > > cannot be stopped as the walsender waits for the data to be > > > replicated. Is this behavior ok for the server to wait in this case? > > > If this behavior is ok, we could add a log message as it is not very > > > evident from the log files why the server could not be shut down. > > > > > > > I think for this case, the behavior should be the same as for physical > > replication. Can you please check what is behavior for the case you > > are worried about in physical replication? Note, we already have a > > similar parameter for recovery_min_apply_delay for physical > > replication. > > In the case of physical replication by setting recovery_min_apply_delay, I > noticed that both primary and standby nodes were getting stopped successfully > immediately after the stop server command. In case of logical replication, stop > server fails: > pg_ctl -D publisher -l publisher.log stop -c waiting for server to shut > down............................................................... > failed > pg_ctl: server does not shut down > > In case of logical replication, the server does not get stopped because the > walsender process is not able to exit: > ps ux | grep walsender > vignesh 1950789 75.3 0.0 8695216 22284 ? Rs 11:51 1:08 > postgres: walsender vignesh [local] START_REPLICATION Thanks, I could reproduce this and I'll update this point in a subsequent version. [1] - https://www.postgresql.org/message-id/TYCPR01MB8373775ECC6972289AF8CB30ED0F9%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Monday, November 14, 2022 7:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Nov 9, 2022 at 12:11 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Wed, 10 Aug 2022 17:33:00 -0300, "Euler Taveira" > > <euler@eulerto.com> wrote in > > > On Wed, Aug 10, 2022, at 9:39 AM, osumi.takamichi@fujitsu.com wrote: > > > > Minor review comments for v6. > > > Thanks for your review. I'm attaching v7. > > > > Using interval is not standard as this kind of parameters but it seems > > convenient. On the other hand, it's not great that the unit month > > introduces some subtle ambiguity. This patch translates a month to 30 > > days but I'm not sure it's the right thing to do. Perhaps we shouldn't > > allow the units upper than days. > > > > Agreed. Isn't the same thing already apply to recovery_min_apply_delay for > which the maximum unit seems to be in days? If so, there is no reason to do > something different here? The corresponding one of physical replication had the upper limit of INT_MAX(like it means 24 days is OK, but 25 days isn't). I added this test in the patch posted in [1]. > > > apply_delay() chokes the message-receiving path so that a not-so-long > > delay can cause a replication timeout to fire. I think we should > > process walsender pings even while delaying. Needing to make > > replication timeout longer than apply delay is not great, I think. > > > > Again, I think for this case also the behavior should be similar to how we handle > recovery_min_apply_delay. Yes, I agree with you. This feature makes it easier to trigger the publisher's timeout, which can't be observed in the physical replication. I'll do the investigation and modify this point in a subsequent version. [1] - https://www.postgresql.org/message-id/TYCPR01MB8373775ECC6972289AF8CB30ED0F9%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
Hi, On Thursday, August 11, 2022 7:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Aug 9, 2022 at 3:52 AM Euler Taveira <euler@eulerto.com> wrote: > > > > On Wed, Aug 3, 2022, at 10:27 AM, Amit Kapila wrote: > > > > Your explanation makes sense to me. The other point to consider is > > that there can be cases where we may not apply operation for the > > transaction because of empty transactions (we don't yet skip empty > > xacts for prepared transactions). So, won't it be better to apply the > > delay just before we apply the first change for a transaction? Do we > > want to apply the delay during table sync as we sometimes do need to > > enter apply phase while doing table sync? > > > > I thought about the empty transactions but decided to not complicate > > the code mainly because skipping transactions is not a code path that > > will slow down this feature. As explained in the documentation, there > > is no harm in delaying a transaction for more than min_apply_delay; it > > cannot apply earlier. Having said that I decided to do nothing. I'm > > also not sure if it deserves a comment or if this email is a possible explanation > for this decision. > > > > I don't know what makes you think it will complicate the code. But anyway > thinking further about the way apply_delay is used at various places in the patch, > as pointed out by Peter Smith it seems it won't work for the parallel apply > feature where we start applying the transaction immediately after start stream. > I was wondering why don't we apply delay after each commit of the transaction > rather than at the begin command. We can remember if the transaction has > made any change and if so then after commit, apply the delay. If we can do that > then it will alleviate the concern of empty and skipped xacts as well. I agree with this direction. I'll update this point in a subsequent patch. > Another thing I was wondering how to determine what is a good delay time for > tests and found that current tests in replay_delay.pl uses 3s, so should we use > the same for apply delay tests in this patch as well? Fixed in the patch posted in [1]. [1] - https://www.postgresql.org/message-id/TYCPR01MB8373775ECC6972289AF8CB30ED0F9%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
On Fri, Nov 25, 2022 at 2:15 AM Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > > On Wednesday, October 5, 2022 6:42 PM Peter Smith <smithpb2250@gmail.com> wrote: ... > > > ====== > > > > 5. src/backend/commands/subscriptioncmds.c - SubOpts > > > > @@ -89,6 +91,7 @@ typedef struct SubOpts > > bool disableonerr; > > char *origin; > > XLogRecPtr lsn; > > + int64 min_apply_delay; > > } SubOpts; > > > > I feel it would be better to be explicit about the storage units. So call this > > member ‘min_apply_delay_ms’. E.g. then other code in > > parse_subscription_options will be more natural when you are converting using > > and assigning them to this member. > I don't think we use such names including units explicitly. > Could you please tell me a similar example for this ? > Regex search "\..*_ms[e\s]" finds some members where the unit is in the member name. e.g. delay_ms (see EnableTimeoutParams in timeout.h) e.g. interval_in_ms (see timeout_paramsin timeout.c) Regex search ".*_ms[e\s]" finds many local variables where the unit is in the variable name > > ====== > > > > 16. src/include/catalog/pg_subscription.h > > > > + int64 subapplydelay; /* Replication apply delay */ > > + > > > > Consider renaming this as 'subapplydelayms' to make the units perfectly clear. > Similar to the 5th comments, I can't find any examples for this. > I'd like to keep it general, which makes me feel it is more aligned with > existing codes. > As above. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Tue, Nov 15, 2022 at 12:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Nov 14, 2022 at 6:52 PM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > > > Dear Amit, > > > > > > > It seems to me Kuroda-San has proposed this change [1] to fix the test > > > > > but it is not clear to me why such a change is required. Why can't > > > > > CHECK_FOR_INTERRUPTS() after waiting, followed by the existing below > > > > > code [2] in LogicalRepApplyLoop() sufficient to handle parameter > > > > > updates? > > > > (I forgot to say, this change was not proposed by me. I said that there should be > > modified. I thought workers should wake up after the transaction was committed.) > > > > > So, why only honor the 'disable' option of the subscription? For > > > example, one can change 'min_apply_delay' and it seems > > > recoveryApplyDelay() honors a similar change in the recovery > > > parameter. Is there a way to set the latch of the worker process, so > > > that it can recheck if anything is changed? > > > > I have not considered about it, but seems reasonable. We may be able to > > do maybe_reread_subscription() if subscription parameters are changed > > and latch is set. > > > > One more thing I would like you to consider is the point raised by me > related to this patch's interaction with the parallel apply feature as > mentioned in the email [1]. I am not sure the idea proposed in that > email [1] is a good one because delaying after applying commit may not > be good as we want to delay the apply of the transaction(s) on > subscribers by this feature. I feel this needs more thought. > I have thought a bit more about this and we have the following options to choose the delay point from. (a) apply delay just before committing a transaction. As mentioned in comments in the patch this can lead to bloat and locks held for a long time. (b) apply delay before starting to apply changes for a transaction but here the problem is which time to consider. In some cases, like for streaming transactions, we don't receive the commit/prepare xact time in the start message. (c) use (b) but use the previous transaction's commit time. (d) apply delay after committing a transaction by using the xact's commit time. At this stage, among above, I feel any one of (c) or (d) is worth considering. Now, the difference between (c) and (d) is that if after commit the next xact's data is already delayed by more than min_apply_delay time then we don't need to kick the new logic of apply delay. The other thing to consider whether we need to process any keepalive messages during the delay because otherwise, walsender may think that the subscriber is not available and time out. This may not be a problem for synchronous replication but otherwise, it could be a problem. Thoughts? -- With Regards, Amit Kapila.
Here are some review comments for patch v9-0001: ====== GENERAL 1. min_ prefix? What's the significance of the "min_" prefix for this parameter? I'm guessing the background is that at one time it was considered to be a GUC so took a name similar to GUC recovery_min_apply_delay (??) But in practice, I think it is meaningless and/or misleading. For example, suppose the user wants to defer replication by 1hr. IMO it is more natural to just say "defer replication by 1 hr" (aka apply_delay='1hr') Clearly it means replication will take place about 1 hr into the future. OTHO saying "defer replication by a MINIMUM of 1 hr" (aka min_apply_delay='1hr') is quite vague because then it is equally valid if the replication gets delayed by 1 hr or 2 hrs or 5 days or 3 weeks since all of those satisfy the minimum delay. The implementation could hardwire a delay of INT_MAX ms but clearly, that's not really what the user would expect. ~ So, I think this parameter should be renamed just as 'apply_delay'. But, if you still decide to keep it as 'min_apply_delay' then there is a lot of other code that ought to be changed to be consistent with that name. e.g. - subapplydelay in catalogs.sgml --> subminapplydelay - subapplydelay in system_views.sql --> subminapplydelay - subapplydelay in pg_subscription.h --> subminapplydelay - subapplydelay in dump.h --> subminapplydelay - i_subapplydelay in pg_dump.c --> i_subminapplydelay - applydelay member name of Form_pg_subscription --> minapplydelay - "Apply Delay" for the column name displayed by describe.c --> "Min apply delay" - more... (IMO the fact that so much code does not currently say 'min' at all is just evidence that the 'min' prefix really didn't really mean much in the first place) ====== doc/src/sgml/catalogs.sgml 2. Section 31.2 Subscription + <para> + Time delayed replica of subscription is available by indicating + <literal>min_apply_delay</literal>. See + <xref linkend="sql-createsubscription"/> for details. + </para> How about saying like: SUGGESTION The subscriber replication can be instructed to lag behind the publisher side changes by specifying the <literal>min_apply_delay</literal> subscription parameter. See XXX for details. ====== doc/src/sgml/ref/create_subscription.sgml 3. min_apply_delay + <para> + By default, subscriber applies changes as soon as possible. As with + the physical replication feature + (<xref linkend="guc-recovery-min-apply-delay"/>), it can be useful to + have a time-delayed logical replica. This parameter allows you to + delay the application of changes by a specified amount of time. If + this value is specified without units, it is taken as milliseconds. + The default is zero, adding no delay. + </para> "subscriber applies" -> "the subscriber applies" "allows you" -> "lets the user" "The default is zero, adding no delay." -> "The default is zero (no delay)." ~ 4. + larger than the time deviations between servers. Note that + in the case when this parameter is set to a long value, the + replication may not continue if the replication slot falls behind the + current LSN by more than <literal>max_slot_wal_keep_size</literal>. + See more details in <xref linkend="guc-max-slot-wal-keep-size"/>. + </para> 4a. SUGGESTION Note that if this parameter is set to a long delay, the replication will stop if the replication slot falls behind the current LSN by more than <literal>max_slot_wal_keep_size</literal>. ~ 4b. When it is rendered (like below) it looks a bit repetitive: ... if the replication slot falls behind the current LSN by more than max_slot_wal_keep_size. See more details in max_slot_wal_keep_size. ~ IMO the previous sentence should include the link. SUGGESTION if the replication slot falls behind the current LSN by more than <link linkend = "guc-max-slot-wal-keep-size"><literal>max_slot_wal_keep_size</literal></link>. ~ 5. + <para> + Synchronous replication is affected by this setting when + <varname>synchronous_commit</varname> is set to + <literal>remote_write</literal>; every <literal>COMMIT</literal> + will need to wait to be applied. + </para> Yes, this deserves a big warning -- but I am just not quite sure of the details. I think this impacts more than just "remote_rewrite" -- e.g. the same problem would happen if "synchronous_standby_names" is non-empty. I think this warning needs to be more generic to cover everything. Maybe something like below SUGGESTION: Delaying the replication can mean there is a much longer time between making a change on the publisher, and that change being committed on the subscriber. This can have a big impact on synchronous replication. See https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-SYNCHRONOUS-COMMIT ====== src/backend/commands/subscriptioncmds.c 6. parse_subscription_options + ms = interval_to_ms(interval); + if (ms < 0 || ms > INT_MAX) + ereport(ERROR, + errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), + errmsg("%lld ms is outside the valid range for option \"%s\"", + (long long) ms, "min_apply_delay")); "for option" -> "for parameter" ====== src/backend/replication/logical/worker.c 7. apply_delay +static void +apply_delay(TimestampTz ts) IMO having a delay is not the usual case. So, would a better name for this function be 'maybe_delay'? ~ 8. + * high value for the delay. This design is different from the physical + * replication (that applies the delay at commit time) mainly because write + * operations may allow some issues (such as bloat and locks) that can be + * minimized if it does not keep the transaction open for such a long time. Something seems not quite right with this wording -- is there a better way of describing this? ~ 9. + /* + * Delay apply until all tablesync workers have reached READY state. If we + * allow the delay during the catchup phase, once we reach the limit of + * tablesync workers, it will impose a delay for each subsequent worker. + * It means it will take a long time to finish the initial table + * synchronization. + */ + if (!AllTablesyncsReady()) + return; "Delay apply until..." -> "The min_apply_delay parameter is ignored until..." ~ 10. + /* + * The worker may be waken because of the ALTER SUBSCRIPTION ... + * DISABLE, so the catalog pg_subscription should be read again. + */ + if (!in_remote_transaction && !in_streamed_transaction) + { + AcceptInvalidationMessages(); + maybe_reread_subscription(); + } + } "waken" -> "woken" ====== src/bin/psql/describe.c 11. describeSubscriptions + /* Origin and min_apply_delay are only supported in v16 and higher */ if (pset.sversion >= 160000) appendPQExpBuffer(&buf, - ", suborigin AS \"%s\"\n", - gettext_noop("Origin")); + ", suborigin AS \"%s\"\n" + ", subapplydelay AS \"%s\"\n", + gettext_noop("Origin"), + gettext_noop("Apply delay")); IIUC the psql command is supposed to display useful information to the user, so I wondered if it is worthwhile to put the units in this column header -- "Apply delay (ms)" instead of just "Apply delay" because that would make it far easier to understand the meaning without having to check the documentation to discover the units. ====== src/include/utils/timestamp.h 12. +extern int64 interval_to_ms(const Interval *interval); + For consistency with the other interval conversion functions exposed here maybe this one should have been called 'interval2ms' ====== src/test/subscription/t/032_apply_delay.pl 13. IIUC this test is checking if a delay has occurred by inspecting the debug logs to see if a certain code path including "logical replication apply delay" is logged. I guess that is OK, but another way might be to compare the actual timing values of the published and replicated rows. The publisher table can have a column with default now() and the subscriber side table can have an *additional* column also with default now(). After replication, those two timestamp values can be compared to check if the difference exceeds the min_time_delay parameter specified. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Tue, Dec 6, 2022 at 1:30 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for patch v9-0001: > > ====== > > GENERAL > > 1. min_ prefix? > > What's the significance of the "min_" prefix for this parameter? I'm > guessing the background is that at one time it was considered to be a > GUC so took a name similar to GUC recovery_min_apply_delay (??) > > But in practice, I think it is meaningless and/or misleading. For > example, suppose the user wants to defer replication by 1hr. IMO it is > more natural to just say "defer replication by 1 hr" (aka > apply_delay='1hr') Clearly it means replication will take place about > 1 hr into the future. OTHO saying "defer replication by a MINIMUM of 1 > hr" (aka min_apply_delay='1hr') is quite vague because then it is > equally valid if the replication gets delayed by 1 hr or 2 hrs or 5 > days or 3 weeks since all of those satisfy the minimum delay. The > implementation could hardwire a delay of INT_MAX ms but clearly, > that's not really what the user would expect. > There is another way to look at this naming. It is quite possible user has set its value as '1 second' and the transaction is delayed by more than that say because the publisher delayed sending it. There could be various reasons why the publisher could delay like it was busy processing another workload, the replication connection between publisher and subscriber was not working, etc. Moreover, it will be similar to the same parameter for physical replication. So, I think keeping min in the name is a good idea. -- With Regards, Amit Kapila.
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Friday, December 2, 2022 4:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Nov 15, 2022 at 12:33 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > One more thing I would like you to consider is the point raised by me > > related to this patch's interaction with the parallel apply feature as > > mentioned in the email [1]. I am not sure the idea proposed in that > > email [1] is a good one because delaying after applying commit may not > > be good as we want to delay the apply of the transaction(s) on > > subscribers by this feature. I feel this needs more thought. > > > > I have thought a bit more about this and we have the following options to > choose the delay point from. (a) apply delay just before committing a > transaction. As mentioned in comments in the patch this can lead to bloat and > locks held for a long time. (b) apply delay before starting to apply changes for a > transaction but here the problem is which time to consider. In some cases, like > for streaming transactions, we don't receive the commit/prepare xact time in > the start message. (c) use (b) but use the previous transaction's commit time. > (d) apply delay after committing a transaction by using the xact's commit time. > > At this stage, among above, I feel any one of (c) or (d) is worth considering. Now, > the difference between (c) and (d) is that if after commit the next xact's data is > already delayed by more than min_apply_delay time then we don't need to kick > the new logic of apply delay. > > The other thing to consider whether we need to process any keepalive > messages during the delay because otherwise, walsender may think that the > subscriber is not available and time out. This may not be a problem for > synchronous replication but otherwise, it could be a problem. > > Thoughts? Hi, Thank you for your comments ! Below are some analysis for the major points above. (1) About the timing to apply the delay One approach of (b) would be best. The idea is to delay all types of transaction's application based on the time when one transaction arrives at the subscriber node. One advantage of this approach over (c) and (d) is that this can avoid the case where we might apply a transaction immediately without waiting, if there are two transactions sequentially and the time in between exceeds the min_apply_delay time. When we receive stream-in-progress transactions, we'll check whether the time for delay has passed or not at first in this approach. (2) About the timeout issue When having a look at the physical replication internals, it conducts sending feedback and application of delay separately on different processes. OTOH, the logical replication needs to achieve those within one process. When we want to apply delay and avoid the timeout, we should not store all the transactions data into memory. So, one approach for this is to serialize the transaction data and after the delay, we apply the transactions data. However, this means if users adopt this feature, then all transaction data that should be delayed would be serialized. We are not sure if this sounds a valid approach or not. One another approach was to divide the time of delay in apply_delay() and utilize the divided time for WaitLatch and sends the keepalive messages from there. But, this approach requires some change on the level of libpq layer (like implementing a new function for wal receiver in order to monitor if the data from the publisher is readable or not there). Probably, the first idea to serialize the delayed transactions might be better on this point. Any feedback is welcome. Best Regards, Takamichi Osumi
Hi, The tests fail on cfbot: https://cirrus-ci.com/task/4533866329800704 They only seem to fail on 32bit linux. https://api.cirrus-ci.com/v1/artifact/task/4533866329800704/testrun/build-32/testrun/subscription/032_apply_delay/log/regress_log_032_apply_delay [06:27:10.628](0.138s) ok 2 - check if the new rows were applied to subscriber timed out waiting for match: (?^:logical replication apply delay) at /tmp/cirrus-ci-build/src/test/subscription/t/032_apply_delay.plline 124. Greetings, Andres Freund
At Tue, 6 Dec 2022 11:08:43 -0800, Andres Freund <andres@anarazel.de> wrote in > Hi, > > The tests fail on cfbot: > https://cirrus-ci.com/task/4533866329800704 > > They only seem to fail on 32bit linux. > > https://api.cirrus-ci.com/v1/artifact/task/4533866329800704/testrun/build-32/testrun/subscription/032_apply_delay/log/regress_log_032_apply_delay > [06:27:10.628](0.138s) ok 2 - check if the new rows were applied to subscriber > timed out waiting for match: (?^:logical replication apply delay) at /tmp/cirrus-ci-build/src/test/subscription/t/032_apply_delay.plline 124. It fails for me on 64bit Linux.. (Rocky 8.7) > t/032_apply_delay.pl ............... Dubious, test returned 29 (wstat 7424, 0x1d00) > No subtests run .. > t/032_apply_delay.pl (Wstat: 7424 Tests: 0 Failed: 0) > Non-zero exit status: 29 > Parse errors: No plan found in TAP output regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Dec 6, 2022 at 5:44 PM Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > > On Friday, December 2, 2022 4:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Nov 15, 2022 at 12:33 PM Amit Kapila <amit.kapila16@gmail.com> > > wrote: > > > One more thing I would like you to consider is the point raised by me > > > related to this patch's interaction with the parallel apply feature as > > > mentioned in the email [1]. I am not sure the idea proposed in that > > > email [1] is a good one because delaying after applying commit may not > > > be good as we want to delay the apply of the transaction(s) on > > > subscribers by this feature. I feel this needs more thought. > > > > > > > I have thought a bit more about this and we have the following options to > > choose the delay point from. (a) apply delay just before committing a > > transaction. As mentioned in comments in the patch this can lead to bloat and > > locks held for a long time. (b) apply delay before starting to apply changes for a > > transaction but here the problem is which time to consider. In some cases, like > > for streaming transactions, we don't receive the commit/prepare xact time in > > the start message. (c) use (b) but use the previous transaction's commit time. > > (d) apply delay after committing a transaction by using the xact's commit time. > > > > At this stage, among above, I feel any one of (c) or (d) is worth considering. Now, > > the difference between (c) and (d) is that if after commit the next xact's data is > > already delayed by more than min_apply_delay time then we don't need to kick > > the new logic of apply delay. > > > > The other thing to consider whether we need to process any keepalive > > messages during the delay because otherwise, walsender may think that the > > subscriber is not available and time out. This may not be a problem for > > synchronous replication but otherwise, it could be a problem. > > > > Thoughts? > Hi, > > > Thank you for your comments ! > Below are some analysis for the major points above. > > (1) About the timing to apply the delay > > One approach of (b) would be best. The idea is to delay all types of transaction's application > based on the time when one transaction arrives at the subscriber node. > But I think it will unnecessarily add the delay when there is a delay in sending the transaction by the publisher (say due to the reason that publisher was busy handling other workloads or there was a temporary network communication break between publisher and subscriber). This could probably be the reason why physical replication (via recovery_min_apply_delay) uses the commit time of the sending side. > One advantage of this approach over (c) and (d) is that this can avoid the case > where we might apply a transaction immediately without waiting, > if there are two transactions sequentially and the time in between exceeds the min_apply_delay time. > I am not sure if I understand your point. However, I think even if the transactions are sequential but if the time between them exceeds (say because the publisher was down) min_apply_delay, there is no need to apply additional delay. > When we receive stream-in-progress transactions, we'll check whether the time for delay > has passed or not at first in this approach. > > > (2) About the timeout issue > > When having a look at the physical replication internals, > it conducts sending feedback and application of delay separately on different processes. > OTOH, the logical replication needs to achieve those within one process. > > When we want to apply delay and avoid the timeout, > we should not store all the transactions data into memory. > So, one approach for this is to serialize the transaction data and after the delay, > we apply the transactions data. > It is not clear to me how this will avoid a timeout. > However, this means if users adopt this feature, > then all transaction data that should be delayed would be serialized. > We are not sure if this sounds a valid approach or not. > > One another approach was to divide the time of delay in apply_delay() and > utilize the divided time for WaitLatch and sends the keepalive messages from there. > Do we anytime send keepalive messages from the apply side? I think we only send feedback reply messages as a response to the publisher's keep_alive message. So, we need to do something similar for this if you want to follow this approach. -- With Regards, Amit Kapila.
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Wednesday, December 7, 2022 12:00 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Tue, 6 Dec 2022 11:08:43 -0800, Andres Freund <andres@anarazel.de> wrote > in > > Hi, > > > > The tests fail on cfbot: > > https://cirrus-ci.com/task/4533866329800704 > > > > They only seem to fail on 32bit linux. > > > > https://api.cirrus-ci.com/v1/artifact/task/4533866329800704/testrun/bu > > ild-32/testrun/subscription/032_apply_delay/log/regress_log_032_apply_ > > delay > > [06:27:10.628](0.138s) ok 2 - check if the new rows were applied to > > subscriber timed out waiting for match: (?^:logical replication apply delay) at > /tmp/cirrus-ci-build/src/test/subscription/t/032_apply_delay.pl line 124. > > It fails for me on 64bit Linux.. (Rocky 8.7) > > > t/032_apply_delay.pl ............... Dubious, test returned 29 (wstat > > 7424, 0x1d00) No subtests run > .. > > t/032_apply_delay.pl (Wstat: 7424 Tests: 0 Failed: 0) > > Non-zero exit status: 29 > > Parse errors: No plan found in TAP output > > regards. Hi, thank you so much for your notifications ! I'll look into the failures. Best Regards, Takamichi Osumi
Hi Vignesh, > In the case of physical replication by setting > recovery_min_apply_delay, I noticed that both primary and standby > nodes were getting stopped successfully immediately after the stop > server command. In case of logical replication, stop server fails: > pg_ctl -D publisher -l publisher.log stop -c > waiting for server to shut > down............................................................... > failed > pg_ctl: server does not shut down > > In case of logical replication, the server does not get stopped > because the walsender process is not able to exit: > ps ux | grep walsender > vignesh 1950789 75.3 0.0 8695216 22284 ? Rs 11:51 1:08 > postgres: walsender vignesh [local] START_REPLICATION Thanks for reporting the issue. I analyzed about it. This issue has occurred because the apply worker cannot reply during the delay. I think we may have to modify the mechanism that delays applying transactions. When walsender processes are requested to shut down, it can shut down only after that all the sent WALs are replicated on the subscriber. This check is done in WalSndDone(), and the replicated position will be updated when processes handle the reply messages from a subscriber, in ProcessStandbyReplyMessage(). In the case of physical replication, the walreciever can receive WALs and reply even if the application is delayed. It means that the replicated position will be transported to the publisher side immediately. So the walsender can exit. In terms of logical replication, however, the worker cannot reply to the walsender while delaying the transaction with this patch at present. It causes the replicated position to be never transported upstream and the walsender cannot exit. Based on the above analysis, we can conclude that the worker must update the flushpos and reply to the walsender while delaying the transaction if we want to solve the issue. This cannot be done in the current approach, and a newer proposed one[1] may be able to solve this, although it's currently under discussion. Note that a similar issue can reproduce while doing the physical replication. When the wal_sender_timeout is set to 0 and the network between primary and secondary is broken after that primary sends WALs to secondary, we cannot stop the primary node. [1]: https://www.postgresql.org/message-id/TYCPR01MB8373FA10EB2DB2BF8E458604ED1B9%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED
Dear Andres, Thanks for reporting! I have analyzed the problem and found the root cause. This feature seemed not to work on 32-bit OSes. This was because the calculation of delay_time was wrong. The first argument of this should be TimestampTz datatype, not Datum: ``` + /* Set apply delay */ + delay_until = TimestampTzPlusMilliseconds(TimestampTzGetDatum(ts), + MySubscription->applydelay); ``` In more detail, the datum representation of int64 contains the value itself on 64-bit OSes, but it contains the pointer to the value on 32-bit. After modifying the issue, this will work on 32-bit environments. Best Regards, Hayato Kuroda FUJITSU LIMITED
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Friday, December 9, 2022 3:38 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote: > Thanks for reporting! I have analyzed the problem and found the root cause. > > This feature seemed not to work on 32-bit OSes. This was because the > calculation of delay_time was wrong. The first argument of this should be > TimestampTz datatype, not Datum: > > ``` > + /* Set apply delay */ > + delay_until = > TimestampTzPlusMilliseconds(TimestampTzGetDatum(ts), > + > + MySubscription->applydelay); > ``` > > In more detail, the datum representation of int64 contains the value itself on > 64-bit OSes, but it contains the pointer to the value on 32-bit. > > After modifying the issue, this will work on 32-bit environments. Thank you for your analysis. Yeah, it seems we conduct addition of values to the pointer value, which is returned from the call of TimestampTzGetDatum(), on 32-bit machine by mistake. I'll remove the call in my next version. Best Regards, Takamichi Osumi
Hello. I asked about unexpected walsender termination caused by this feature but I think I didn't received an answer for it and the behavior is still exists. Specifically, if servers have the following settings, walsender terminates for replication timeout. After that, connection is restored after the LR delay elapses. Although it can be said to be working in that sense, the error happens repeatedly every about min_apply_delay internvals but is hard to distinguish from network troubles. I'm not sure you're deliberately okay with it but, I don't think the behavior causing replication timeouts is acceptable. > wal_sender_timeout = 10s; > wal_receiver_temeout = 10s; > > create subscription ... with (min_apply_delay='60s'); This is a kind of artificial but timeout=60s and delay=5m is not an uncommon setup and that also causes this "issue". subscriber: > 2022-12-12 14:17:18.139 JST LOG: terminating walsender process due to replication timeout > 2022-12-12 14:18:11.076 JST LOG: starting logical decoding for slot "s1" ... regards. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Wednesday, December 7, 2022 2:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Dec 6, 2022 at 5:44 PM Takamichi Osumi (Fujitsu) > <osumi.takamichi@fujitsu.com> wrote: > > > > On Friday, December 2, 2022 4:05 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > On Tue, Nov 15, 2022 at 12:33 PM Amit Kapila > > > <amit.kapila16@gmail.com> > > > wrote: > > > > One more thing I would like you to consider is the point raised by > > > > me related to this patch's interaction with the parallel apply > > > > feature as mentioned in the email [1]. I am not sure the idea > > > > proposed in that email [1] is a good one because delaying after > > > > applying commit may not be good as we want to delay the apply of > > > > the transaction(s) on subscribers by this feature. I feel this needs more > thought. > > > > > > > > > > I have thought a bit more about this and we have the following > > > options to choose the delay point from. (a) apply delay just before > > > committing a transaction. As mentioned in comments in the patch this > > > can lead to bloat and locks held for a long time. (b) apply delay > > > before starting to apply changes for a transaction but here the > > > problem is which time to consider. In some cases, like for streaming > > > transactions, we don't receive the commit/prepare xact time in the start > message. (c) use (b) but use the previous transaction's commit time. > > > (d) apply delay after committing a transaction by using the xact's commit > time. > > > > > > At this stage, among above, I feel any one of (c) or (d) is worth > > > considering. Now, the difference between (c) and (d) is that if > > > after commit the next xact's data is already delayed by more than > > > min_apply_delay time then we don't need to kick the new logic of apply > delay. > > > > > > The other thing to consider whether we need to process any keepalive > > > messages during the delay because otherwise, walsender may think > > > that the subscriber is not available and time out. This may not be a > > > problem for synchronous replication but otherwise, it could be a problem. > > > > > > Thoughts? > > (1) About the timing to apply the delay > > > > One approach of (b) would be best. The idea is to delay all types of > > transaction's application based on the time when one transaction arrives at > the subscriber node. > > > > But I think it will unnecessarily add the delay when there is a delay in sending > the transaction by the publisher (say due to the reason that publisher was busy > handling other workloads or there was a temporary network communication > break between publisher and subscriber). This could probably be the reason > why physical replication (via recovery_min_apply_delay) uses the commit time of > the sending side. You are right. The approach (b) adds additional (or unnecessary) delay due to network communication or machine troubles in streaming-in-progress cases. We agreed this approach (b) has the disadvantage. > > One advantage of this approach over (c) and (d) is that this can avoid > > the case where we might apply a transaction immediately without > > waiting, if there are two transactions sequentially and the time in between > exceeds the min_apply_delay time. > > > > I am not sure if I understand your point. However, I think even if the > transactions are sequential but if the time between them exceeds (say because > the publisher was down) min_apply_delay, there is no need to apply additional > delay. I'm sorry, my description was not accurate. As for the approach (c), kindly imagine two transactions (txn1, txn2) are executed on the publisher side and the publisher tries to send both of them to the subscriber. Here, there is no network trouble and the publisher isn't busy for other workloads. However, the diff of the time between txn1 and txn2 execeeds "min_apply_delay" (which is set to the subscriber). In this case, when the txn2 is a stream-in-progress transaction, we don't apply any delay for txn2 when it arrives on the subscriber. It's because before txn2 comes to the subscriber, "min_apply_delay" has already passed on the publisher side. This means there's a case we don't apply any delay when we choose approach (c). The approach (d) has also similar disadvantage. IIUC, in this approach the subscriber applies delay after committing a transaction, based on the commit/prepare time of publisher side. Kindly, imagine two transactions are executed on the publisher and the 2nd transaction completes after the subscriber's delay for the 1st transaction. Again, there is no network troubles and no heavy workloads on the publisher. If so, the delay for the txn1 already finishes when the 2nd transaction arrives on the subscriber, then the 2nd transaction will be applied immediately without delay. Another new discussion point is to utilize (b) and stream commit/stream prepare time and apply the delay immediately before applying the spool files of the transactions in the stream-in-progress transaction cases. Does someone has any opinion on those approaches ? Lastly, thanks Amit-san and Kuroda-san for giving me so many offlist feedbacks about those significant points. Best Regards, Takamichi Osumi
Dear Amit, This is a reply for later part of your e-mail. > > (2) About the timeout issue > > > > When having a look at the physical replication internals, > > it conducts sending feedback and application of delay separately on different > processes. > > OTOH, the logical replication needs to achieve those within one process. > > > > When we want to apply delay and avoid the timeout, > > we should not store all the transactions data into memory. > > So, one approach for this is to serialize the transaction data and after the delay, > > we apply the transactions data. > > > > It is not clear to me how this will avoid a timeout. At first, the reason why the timeout occurs is that while delaying the apply worker neither reads messages from the walsender nor replies to it. The worker's last_recv_timeout will be not updated because it does not receive messages. This leads to wal_receiver_timeout. Similarly, the walsender's last_processing will be not updated and exit due to the timeout because the worker does not reply to upstream. Based on the above, we thought that workers must receive and handle messages evenif they are delaying applying transactions. In more detail, workers must iterate the outer loop in LogicalRepApplyLoop(). If workers receive transactions but they need to delay applying, they must keep messages somewhere. So we came up with the idea that workers serialize changes once and apply later. Our basic design is as follows: * All transactions areserialized to files if min_apply_delay is set to non-zero. * After receiving the commit message and spending time, workers reads and applies spooled messages > > However, this means if users adopt this feature, > > then all transaction data that should be delayed would be serialized. > > We are not sure if this sounds a valid approach or not. > > > > One another approach was to divide the time of delay in apply_delay() and > > utilize the divided time for WaitLatch and sends the keepalive messages from > there. > > > > Do we anytime send keepalive messages from the apply side? I think we > only send feedback reply messages as a response to the publisher's > keep_alive message. So, we need to do something similar for this if > you want to follow this approach. Right, and the above mechanism is needed for workers to understand messages and send feedback replies as a response to the publisher's keepalive message. Best Regards, Hayato Kuroda FUJITSU LIMITED
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Monday, December 12, 2022 2:54 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > I asked about unexpected walsender termination caused by this feature but I > think I didn't received an answer for it and the behavior is still exists. > > Specifically, if servers have the following settings, walsender terminates for > replication timeout. After that, connection is restored after the LR delay elapses. > Although it can be said to be working in that sense, the error happens > repeatedly every about min_apply_delay internvals but is hard to distinguish > from network troubles. I'm not sure you're deliberately okay with it but, I don't > think the behavior causing replication timeouts is acceptable. > > > wal_sender_timeout = 10s; > > wal_receiver_temeout = 10s; > > > > create subscription ... with (min_apply_delay='60s'); > > This is a kind of artificial but timeout=60s and delay=5m is not an uncommon > setup and that also causes this "issue". > > subscriber: > > 2022-12-12 14:17:18.139 JST LOG: terminating walsender process due to > > replication timeout > > 2022-12-12 14:18:11.076 JST LOG: starting logical decoding for slot "s1" > ... Hi, Horiguchi-san Thank you so much for your report! Yes. Currently, how to deal with the timeout issue is under discussion. Some analysis about the root cause are also there. Kindly have a look at [1]. [1] - https://www.postgresql.org/message-id/TYAPR01MB58669394A67F2340B82E42D1F5E29%40TYAPR01MB5866.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Tuesday, December 6, 2022 5:00 PM Peter Smith <smithpb2250@gmail.com> wrote: > Here are some review comments for patch v9-0001: Hi, thank you for your reviews ! > > ====== > > GENERAL > > 1. min_ prefix? > > What's the significance of the "min_" prefix for this parameter? I'm guessing the > background is that at one time it was considered to be a GUC so took a name > similar to GUC recovery_min_apply_delay (??) > > But in practice, I think it is meaningless and/or misleading. For example, > suppose the user wants to defer replication by 1hr. IMO it is more natural to > just say "defer replication by 1 hr" (aka > apply_delay='1hr') Clearly it means replication will take place about > 1 hr into the future. OTHO saying "defer replication by a MINIMUM of 1 hr" (aka > min_apply_delay='1hr') is quite vague because then it is equally valid if the > replication gets delayed by 1 hr or 2 hrs or 5 days or 3 weeks since all of those > satisfy the minimum delay. The implementation could hardwire a delay of > INT_MAX ms but clearly, that's not really what the user would expect. > > ~ > > So, I think this parameter should be renamed just as 'apply_delay'. > > But, if you still decide to keep it as 'min_apply_delay' then there is a lot of other > code that ought to be changed to be consistent with that name. > e.g. > - subapplydelay in catalogs.sgml --> subminapplydelay > - subapplydelay in system_views.sql --> subminapplydelay > - subapplydelay in pg_subscription.h --> subminapplydelay > - subapplydelay in dump.h --> subminapplydelay > - i_subapplydelay in pg_dump.c --> i_subminapplydelay > - applydelay member name of Form_pg_subscription --> minapplydelay > - "Apply Delay" for the column name displayed by describe.c --> "Min apply > delay" I followed the suggestion to keep the "min_" prefix in [1]. Fixed. > - more... > > (IMO the fact that so much code does not currently say 'min' at all is just > evidence that the 'min' prefix really didn't really mean much in the first place) > > > ====== > > doc/src/sgml/catalogs.sgml > > 2. Section 31.2 Subscription > > + <para> > + Time delayed replica of subscription is available by indicating > + <literal>min_apply_delay</literal>. See > + <xref linkend="sql-createsubscription"/> for details. > + </para> > > How about saying like: > > SUGGESTION > The subscriber replication can be instructed to lag behind the publisher side > changes by specifying the <literal>min_apply_delay</literal> subscription > parameter. See XXX for details. Fixed. > ====== > > doc/src/sgml/ref/create_subscription.sgml > > 3. min_apply_delay > > + <para> > + By default, subscriber applies changes as soon as possible. As with > + the physical replication feature > + (<xref linkend="guc-recovery-min-apply-delay"/>), it can be useful > to > + have a time-delayed logical replica. This parameter allows you to > + delay the application of changes by a specified amount of time. If > + this value is specified without units, it is taken as milliseconds. > + The default is zero, adding no delay. > + </para> > > "subscriber applies" -> "the subscriber applies" > > "allows you" -> "lets the user" > > "The default is zero, adding no delay." -> "The default is zero (no delay)." Fixed. > ~ > > 4. > > + larger than the time deviations between servers. Note that > + in the case when this parameter is set to a long value, the > + replication may not continue if the replication slot falls behind the > + current LSN by more than > <literal>max_slot_wal_keep_size</literal>. > + See more details in <xref linkend="guc-max-slot-wal-keep-size"/>. > + </para> > > 4a. > SUGGESTION > Note that if this parameter is set to a long delay, the replication will stop if the > replication slot falls behind the current LSN by more than > <literal>max_slot_wal_keep_size</literal>. Fixed. > ~ > > 4b. > When it is rendered (like below) it looks a bit repetitive: > ... if the replication slot falls behind the current LSN by more than > max_slot_wal_keep_size. See more details in max_slot_wal_keep_size. Thanks! Fixed the redundancy. > ~ > > IMO the previous sentence should include the link. > > SUGGESTION > if the replication slot falls behind the current LSN by more than <link linkend = > "guc-max-slot-wal-keep-size"><literal>max_slot_wal_keep_size</literal></lin > k>. Fixed. > ~ > > 5. > > + <para> > + Synchronous replication is affected by this setting when > + <varname>synchronous_commit</varname> is set to > + <literal>remote_write</literal>; every <literal>COMMIT</literal> > + will need to wait to be applied. > + </para> > > Yes, this deserves a big warning -- but I am just not quite sure of the details. I > think this impacts more than just "remote_rewrite" -- e.g. the same problem > would happen if "synchronous_standby_names" is non-empty. > > I think this warning needs to be more generic to cover everything. > Maybe something like below > > SUGGESTION: > Delaying the replication can mean there is a much longer time between making > a change on the publisher, and that change being committed on the subscriber. > This can have a big impact on synchronous replication. > See > https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-SYN > CHRONOUS-COMMIT Fixed. > > ====== > > src/backend/commands/subscriptioncmds.c > > 6. parse_subscription_options > > + ms = interval_to_ms(interval); > + if (ms < 0 || ms > INT_MAX) > + ereport(ERROR, > + errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), > + errmsg("%lld ms is outside the valid range for option \"%s\"", > + (long long) ms, "min_apply_delay")); > > "for option" -> "for parameter" Fixed. > ====== > > src/backend/replication/logical/worker.c > > 7. apply_delay > > +static void > +apply_delay(TimestampTz ts) > > IMO having a delay is not the usual case. So, would a better name for this > function be 'maybe_delay'? Makes sense. I follow some other functions such as maybe_reread_subscription and maybe_start_skipping_changes. > ~ > > 8. > > + * high value for the delay. This design is different from the physical > + * replication (that applies the delay at commit time) mainly because > + write > + * operations may allow some issues (such as bloat and locks) that can > + be > + * minimized if it does not keep the transaction open for such a long time. > > Something seems not quite right with this wording -- is there a better way of > describing this? I reworded the entire paragraph. Could you please check ? > ~ > > 9. > > + /* > + * Delay apply until all tablesync workers have reached READY state. If > + we > + * allow the delay during the catchup phase, once we reach the limit of > + * tablesync workers, it will impose a delay for each subsequent worker. > + * It means it will take a long time to finish the initial table > + * synchronization. > + */ > + if (!AllTablesyncsReady()) > + return; > > "Delay apply until..." -> "The min_apply_delay parameter is ignored until..." Fixed. > ~ > > 10. > > + /* > + * The worker may be waken because of the ALTER SUBSCRIPTION ... > + * DISABLE, so the catalog pg_subscription should be read again. > + */ > + if (!in_remote_transaction && !in_streamed_transaction) { > + AcceptInvalidationMessages(); maybe_reread_subscription(); } } > > "waken" -> "woken" I have removed this sentence for a new change to recalculate the diffms for any updates of the "min_apply_delay" parameter. Please have a look at maybe_delay_apply(). > ====== > > src/bin/psql/describe.c > > 11. describeSubscriptions > > + /* Origin and min_apply_delay are only supported in v16 and higher */ > if (pset.sversion >= 160000) > appendPQExpBuffer(&buf, > - ", suborigin AS \"%s\"\n", > - gettext_noop("Origin")); > + ", suborigin AS \"%s\"\n" > + ", subapplydelay AS \"%s\"\n", > + gettext_noop("Origin"), > + gettext_noop("Apply delay")); > > IIUC the psql command is supposed to display useful information to the user, so > I wondered if it is worthwhile to put the units in this column header -- "Apply > delay (ms)" instead of just "Apply delay" > because that would make it far easier to understand the meaning without > having to check the documentation to discover the units. Fixed. > ====== > > src/include/utils/timestamp.h > > 12. > > +extern int64 interval_to_ms(const Interval *interval); > + > > For consistency with the other interval conversion functions exposed here > maybe this one should have been called 'interval2ms' Fixed. > ====== > > src/test/subscription/t/032_apply_delay.pl > > 13. > > IIUC this test is checking if a delay has occurred by inspecting the debug logs to > see if a certain code path including "logical replication apply delay" is logged. I > guess that is OK, but another way might be to compare the actual timing values > of the published and replicated rows. > > The publisher table can have a column with default now() and the subscriber > side table can have an *additional* column also with default now(). After > replication, those two timestamp values can be compared to check if the > difference exceeds the min_time_delay parameter specified. Added this check. This patch now depends on a patch posted in another thread in [2] for TAP test of "min_apply_delay" feature. Without this patch, if one backend process executes ALTER SUBSCRIPTION SET min_apply_delay, while the apply worker gets another message for apply_dispatch, the apply worker doesn't notice the reset and utilizes the old value for that incoming transaction. To fix this, I posted the patch together. (During the patch creation, I don't any change any code logs of the wakeup patch, but for my env, I adjusted the line feed.) Kindly have a look at the updated patch. [1] - https://www.postgresql.org/message-id/CAA4eK1J9HEL-U32FwkHXLOGXPV_Fu%2Bnb%2B1KpV7hTbnqbBNnDUQ%40mail.gmail.com [2] - https://www.postgresql.org/message-id/20221122004119.GA132961@nathanxps13 Best Regards, Takamichi Osumi
Attachment
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Friday, November 25, 2022 5:43 AM Peter Smith <smithpb2250@gmail.com> wrote: > On Fri, Nov 25, 2022 at 2:15 AM Takamichi Osumi (Fujitsu) > <osumi.takamichi@fujitsu.com> wrote: > > > > On Wednesday, October 5, 2022 6:42 PM Peter Smith > <smithpb2250@gmail.com> wrote: > ... > > > > > ====== > > > > > > 5. src/backend/commands/subscriptioncmds.c - SubOpts > > > > > > @@ -89,6 +91,7 @@ typedef struct SubOpts > > > bool disableonerr; > > > char *origin; > > > XLogRecPtr lsn; > > > + int64 min_apply_delay; > > > } SubOpts; > > > > > > I feel it would be better to be explicit about the storage units. So > > > call this member ‘min_apply_delay_ms’. E.g. then other code in > > > parse_subscription_options will be more natural when you are > > > converting using and assigning them to this member. > > I don't think we use such names including units explicitly. > > Could you please tell me a similar example for this ? > > > > Regex search "\..*_ms[e\s]" finds some members where the unit is in the > member name. > > e.g. delay_ms (see EnableTimeoutParams in timeout.h) e.g. interval_in_ms (see > timeout_paramsin timeout.c) > > Regex search ".*_ms[e\s]" finds many local variables where the unit is in the > variable name > > > > ====== > > > > > > 16. src/include/catalog/pg_subscription.h > > > > > > + int64 subapplydelay; /* Replication apply delay */ > > > + > > > > > > Consider renaming this as 'subapplydelayms' to make the units perfectly > clear. > > Similar to the 5th comments, I can't find any examples for this. > > I'd like to keep it general, which makes me feel it is more aligned > > with existing codes. Hi, thank you for sharing this info. I searched the codes where I could feel the merits to add "ms" at the end of the variable names. Adding the unites would help to calculate or convert some time related values. In this patch there is only a couple of functions, like maybe_delay_apply() or for conversion of time, parse_subscription_options. I feel changing just a couple of structures might be awkward, while changing all internal structures is too much. So, I keep the names as those were after some modifications shared in [1]. If you have any better idea, please let me know. [1] - https://www.postgresql.org/message-id/TYCPR01MB83730C23CB7D29E57368BECDEDE29%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Saturday, December 10, 2022 12:08 AM Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > On Friday, December 9, 2022 3:38 PM Kuroda, Hayato/黒田 隼人 > <kuroda.hayato@fujitsu.com> wrote: > > Thanks for reporting! I have analyzed the problem and found the root cause. > > > > This feature seemed not to work on 32-bit OSes. This was because the > > calculation of delay_time was wrong. The first argument of this should > > be TimestampTz datatype, not Datum: > > > > ``` > > + /* Set apply delay */ > > + delay_until = > > TimestampTzPlusMilliseconds(TimestampTzGetDatum(ts), > > + > > + MySubscription->applydelay); > > ``` > > > > In more detail, the datum representation of int64 contains the value > > itself on 64-bit OSes, but it contains the pointer to the value on 32-bit. > > > > After modifying the issue, this will work on 32-bit environments. > Thank you for your analysis. > > Yeah, it seems we conduct addition of values to the pointer value, which is > returned from the call of TimestampTzGetDatum(), on 32-bit machine by > mistake. > > I'll remove the call in my next version. Applied this fix in the last version, shared in [1]. [1] - https://www.postgresql.org/message-id/TYCPR01MB83730C23CB7D29E57368BECDEDE29%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
On Mon, Dec 12, 2022 at 1:04 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > This is a reply for later part of your e-mail. > > > > (2) About the timeout issue > > > > > > When having a look at the physical replication internals, > > > it conducts sending feedback and application of delay separately on different > > processes. > > > OTOH, the logical replication needs to achieve those within one process. > > > > > > When we want to apply delay and avoid the timeout, > > > we should not store all the transactions data into memory. > > > So, one approach for this is to serialize the transaction data and after the delay, > > > we apply the transactions data. > > > > > > > It is not clear to me how this will avoid a timeout. > > At first, the reason why the timeout occurs is that while delaying the apply > worker neither reads messages from the walsender nor replies to it. > The worker's last_recv_timeout will be not updated because it does not receive > messages. This leads to wal_receiver_timeout. Similarly, the walsender's > last_processing will be not updated and exit due to the timeout because the > worker does not reply to upstream. > > Based on the above, we thought that workers must receive and handle messages > evenif they are delaying applying transactions. In more detail, workers must > iterate the outer loop in LogicalRepApplyLoop(). > > If workers receive transactions but they need to delay applying, they must keep > messages somewhere. So we came up with the idea that workers serialize changes > once and apply later. Our basic design is as follows: > > * All transactions areserialized to files if min_apply_delay is set to non-zero. > * After receiving the commit message and spending time, workers reads and > applies spooled messages > I think this may be more work than required because in some cases doing I/O just to delay xacts will later lead to more work. Can't we send some ping to walsender to communicate that walreceiver is alive? We already seem to be sending a ping in LogicalRepApplyLoop if we haven't heard anything from the server for more than wal_receiver_timeout / 2. Now, it is possible that the walsender is terminated due to some other reason and we need to see if we can detect that or if it will only be detected once the walreceiver's delay time is over. -- With Regards, Amit Kapila.
Hello. At Mon, 12 Dec 2022 07:42:30 +0000, "Takamichi Osumi (Fujitsu)" <osumi.takamichi@fujitsu.com> wrote in > On Monday, December 12, 2022 2:54 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > I asked about unexpected walsender termination caused by this feature but I > > think I didn't received an answer for it and the behavior is still exists. .. > Thank you so much for your report! > Yes. Currently, how to deal with the timeout issue is under discussion. > Some analysis about the root cause are also there. > > Kindly have a look at [1]. > > > [1] - https://www.postgresql.org/message-id/TYAPR01MB58669394A67F2340B82E42D1F5E29%40TYAPR01MB5866.jpnprd01.prod.outlook.com Oops. Thank you for the pointer. Will visit there. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Mon, 12 Dec 2022 18:10:00 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Mon, Dec 12, 2022 at 1:04 PM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > once and apply later. Our basic design is as follows: > > > > * All transactions areserialized to files if min_apply_delay is set to non-zero. > > * After receiving the commit message and spending time, workers reads and > > applies spooled messages > > > > I think this may be more work than required because in some cases > doing I/O just to delay xacts will later lead to more work. Can't we > send some ping to walsender to communicate that walreceiver is alive? > We already seem to be sending a ping in LogicalRepApplyLoop if we > haven't heard anything from the server for more than > wal_receiver_timeout / 2. Now, it is possible that the walsender is > terminated due to some other reason and we need to see if we can > detect that or if it will only be detected once the walreceiver's > delay time is over. FWIW, I thought the same thing with Amit. What we should do here is logrep workers notifying to walsender that it's living and the communication in-between is fine, and maybe the worker's status. Spontaneous send_feedback() calls while delaying will be sufficient for this purpose. We might need to supress extra forced feedbacks instead. In contrast the worker doesn't need to bother to know whether the peer is living until it receives the next data. But we might need to adjust the wait_time in LogicalRepApplyLoop(). But, I'm not sure what will happen when walsender is blocked by buffer-full for a long time. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Wednesday, December 7, 2022 12:00 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Tue, 6 Dec 2022 11:08:43 -0800, Andres Freund <andres@anarazel.de> wrote > in > > Hi, > > > > The tests fail on cfbot: > > https://cirrus-ci.com/task/4533866329800704 > > > > They only seem to fail on 32bit linux. > > > > https://api.cirrus-ci.com/v1/artifact/task/4533866329800704/testrun/bu > > ild-32/testrun/subscription/032_apply_delay/log/regress_log_032_apply_ > > delay > > [06:27:10.628](0.138s) ok 2 - check if the new rows were applied to > > subscriber timed out waiting for match: (?^:logical replication apply delay) at > /tmp/cirrus-ci-build/src/test/subscription/t/032_apply_delay.pl line 124. > > It fails for me on 64bit Linux.. (Rocky 8.7) > > > t/032_apply_delay.pl ............... Dubious, test returned 29 (wstat > > 7424, 0x1d00) No subtests run > .. > > t/032_apply_delay.pl (Wstat: 7424 Tests: 0 Failed: 0) > > Non-zero exit status: 29 > > Parse errors: No plan found in TAP output Hi, Horiguchi-san Sorry for being late. We couldn't reproduce this failure and find the same type of failure on the cfbot from the past failures. It seems no subtests run in your environment. Could you please share the log files, if you have or when you can reproduce this ? FYI, the latest patch is attached in [1]. [1] - https://www.postgresql.org/message-id/TYCPR01MB83730C23CB7D29E57368BECDEDE29%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
At Tue, 13 Dec 2022 02:28:49 +0000, "Takamichi Osumi (Fujitsu)" <osumi.takamichi@fujitsu.com> wrote in > On Wednesday, December 7, 2022 12:00 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > We couldn't reproduce this failure and > find the same type of failure on the cfbot from the past failures. > It seems no subtests run in your environment. Very sorry for that. The test script is found to be a left-over file in a git-reset'ed working tree. Please forget about it. FWIW, the latest patch passed make-world for me on Rocky8/x86_64. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Tuesday, December 13, 2022 1:27 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Tue, 13 Dec 2022 02:28:49 +0000, "Takamichi Osumi (Fujitsu)" > <osumi.takamichi@fujitsu.com> wrote in > > On Wednesday, December 7, 2022 12:00 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > We couldn't reproduce this failure and find the same type of failure > > on the cfbot from the past failures. > > It seems no subtests run in your environment. > > Very sorry for that. The test script is found to be a left-over file in a git-reset'ed > working tree. Please forget about it. > > FWIW, the latest patch passed make-world for me on Rocky8/x86_64. Hi, No problem at all. Also, thank you for your testing and confirming the latest one! Best Regards, Takamichi Osumi
On Tue, Dec 13, 2022 at 7:35 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Mon, 12 Dec 2022 18:10:00 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > On Mon, Dec 12, 2022 at 1:04 PM Hayato Kuroda (Fujitsu) > > <kuroda.hayato@fujitsu.com> wrote: > > > once and apply later. Our basic design is as follows: > > > > > > * All transactions areserialized to files if min_apply_delay is set to non-zero. > > > * After receiving the commit message and spending time, workers reads and > > > applies spooled messages > > > > > > > I think this may be more work than required because in some cases > > doing I/O just to delay xacts will later lead to more work. Can't we > > send some ping to walsender to communicate that walreceiver is alive? > > We already seem to be sending a ping in LogicalRepApplyLoop if we > > haven't heard anything from the server for more than > > wal_receiver_timeout / 2. Now, it is possible that the walsender is > > terminated due to some other reason and we need to see if we can > > detect that or if it will only be detected once the walreceiver's > > delay time is over. > > FWIW, I thought the same thing with Amit. > > What we should do here is logrep workers notifying to walsender that > it's living and the communication in-between is fine, and maybe the > worker's status. Spontaneous send_feedback() calls while delaying will > be sufficient for this purpose. We might need to supress extra forced > feedbacks instead. In contrast the worker doesn't need to bother to > know whether the peer is living until it receives the next data. But > we might need to adjust the wait_time in LogicalRepApplyLoop(). > > But, I'm not sure what will happen when walsender is blocked by > buffer-full for a long time. > Yeah, I think ideally it will timeout but if we have a solution like during delay, we keep sending ping messages time-to-time, it should work fine. However, that needs to be verified. Do you see any reasons why that won't work? -- With Regards, Amit Kapila.
At Tue, 13 Dec 2022 17:05:35 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Tue, Dec 13, 2022 at 7:35 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Mon, 12 Dec 2022 18:10:00 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > Yeah, I think ideally it will timeout but if we have a solution like > during delay, we keep sending ping messages time-to-time, it should > work fine. However, that needs to be verified. Do you see any reasons > why that won't work? Ah. I meant that "I have no clear idea of whether" by "I'm not sure". I looked there a bit further. Finally ProcessPendingWrites() waits for streaming socket to be writable thus no critical problem found here. That being said, it might be better ProcessPendingWrites() refrain from sending consecutive keepalives while waiting, 30s ping timeout and 1h delay may result in 120 successive pings. It might not be a big deal but.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Dear Horiguchi-san, Amit, > > On Tue, Dec 13, 2022 at 7:35 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > > > > At Mon, 12 Dec 2022 18:10:00 +0530, Amit Kapila > <amit.kapila16@gmail.com> wrote in > > Yeah, I think ideally it will timeout but if we have a solution like > > during delay, we keep sending ping messages time-to-time, it should > > work fine. However, that needs to be verified. Do you see any reasons > > why that won't work? I have implemented and tested that workers wake up per wal_receiver_timeout/2 and send keepalive. Basically it works well, but I found two problems. Do you have any good suggestions about them? 1) With this PoC at present, workers calculate sending intervals based on its wal_receiver_timeout, and it is suppressed when the parameter is set to zero. This means that there is a possibility that walsender is timeout when wal_sender_timeout in publisher and wal_receiver_timeout in subscriber is different. Supposing that wal_sender_timeout is 2min, wal_receiver_tiemout is 5min, and min_apply_delay is 10min. The worker on subscriber will wake up per 2.5min and send keepalives, but walsender exits before the message arrives to publisher. One idea to avoid that is to send the min_apply_delay subscriber option to publisher and compare them, but it may be not sufficient. Because XXX_timout GUC parameters could be modified later. 2) The issue reported by Vignesh-san[1] has still remained. I have already analyzed that [2], the root cause is that flushed WAL is not updated and sent to the publisher. Even if workers send keepalive messages to pub during the delay, the flushed position cannot be modified. [1]: https://www.postgresql.org/message-id/CALDaNm1vT8qNBqHivtAgYur-5-YkwF026VHtw9srd4fsdeaufA%40mail.gmail.com [2]: https://www.postgresql.org/message-id/TYAPR01MB5866F6BE7399E6343A96E016F51C9%40TYAPR01MB5866.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED
On Fri, Dec 9, 2022 at 10:49 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Hi Vignesh, > > > In the case of physical replication by setting > > recovery_min_apply_delay, I noticed that both primary and standby > > nodes were getting stopped successfully immediately after the stop > > server command. In case of logical replication, stop server fails: > > pg_ctl -D publisher -l publisher.log stop -c > > waiting for server to shut > > down............................................................... > > failed > > pg_ctl: server does not shut down > > > > In case of logical replication, the server does not get stopped > > because the walsender process is not able to exit: > > ps ux | grep walsender > > vignesh 1950789 75.3 0.0 8695216 22284 ? Rs 11:51 1:08 > > postgres: walsender vignesh [local] START_REPLICATION > > Thanks for reporting the issue. I analyzed about it. > > > This issue has occurred because the apply worker cannot reply during the delay. > I think we may have to modify the mechanism that delays applying transactions. > > When walsender processes are requested to shut down, it can shut down only after > that all the sent WALs are replicated on the subscriber. This check is done in > WalSndDone(), and the replicated position will be updated when processes handle > the reply messages from a subscriber, in ProcessStandbyReplyMessage(). > > In the case of physical replication, the walreciever can receive WALs and reply > even if the application is delayed. It means that the replicated position will > be transported to the publisher side immediately. So the walsender can exit. > I think it is not only the replicated positions but it also checks if there is any pending send in WalSndDone(). Why is it a must to send all pending WAL and confirm that it is flushed on standby before the shutdown for physical standby? Is it because otherwise, we may lose the required WAL? I am asking because it is better to see if those conditions apply to logical replication as well. -- With Regards, Amit Kapila.
On Wed, Dec 14, 2022 at 4:16 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Horiguchi-san, Amit, > > > > On Tue, Dec 13, 2022 at 7:35 AM Kyotaro Horiguchi > > > <horikyota.ntt@gmail.com> wrote: > > > > > > > > At Mon, 12 Dec 2022 18:10:00 +0530, Amit Kapila > > <amit.kapila16@gmail.com> wrote in > > > Yeah, I think ideally it will timeout but if we have a solution like > > > during delay, we keep sending ping messages time-to-time, it should > > > work fine. However, that needs to be verified. Do you see any reasons > > > why that won't work? > > I have implemented and tested that workers wake up per wal_receiver_timeout/2 > and send keepalive. Basically it works well, but I found two problems. > Do you have any good suggestions about them? > > 1) > > With this PoC at present, workers calculate sending intervals based on its > wal_receiver_timeout, and it is suppressed when the parameter is set to zero. > > This means that there is a possibility that walsender is timeout when wal_sender_timeout > in publisher and wal_receiver_timeout in subscriber is different. > Supposing that wal_sender_timeout is 2min, wal_receiver_tiemout is 5min, > and min_apply_delay is 10min. The worker on subscriber will wake up per 2.5min and > send keepalives, but walsender exits before the message arrives to publisher. > > One idea to avoid that is to send the min_apply_delay subscriber option to publisher > and compare them, but it may be not sufficient. Because XXX_timout GUC parameters > could be modified later. > How about restarting the apply worker if min_apply_delay changes? Will that be sufficient? -- With Regards, Amit Kapila.
At Wed, 14 Dec 2022 10:46:17 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in > I have implemented and tested that workers wake up per wal_receiver_timeout/2 > and send keepalive. Basically it works well, but I found two problems. > Do you have any good suggestions about them? > > 1) > > With this PoC at present, workers calculate sending intervals based on its > wal_receiver_timeout, and it is suppressed when the parameter is set to zero. > > This means that there is a possibility that walsender is timeout when wal_sender_timeout > in publisher and wal_receiver_timeout in subscriber is different. > Supposing that wal_sender_timeout is 2min, wal_receiver_tiemout is 5min, It seems to me wal_receiver_status_interval is better for this use. It's enough for us to docuemnt that "wal_r_s_interval should be shorter than wal_sener_timeout/2 especially when logical replication connection is using min_apply_delay. Otherwise you will suffer repeated termination of walsender". > and min_apply_delay is 10min. The worker on subscriber will wake up per 2.5min and > send keepalives, but walsender exits before the message arrives to publisher. > > One idea to avoid that is to send the min_apply_delay subscriber option to publisher > and compare them, but it may be not sufficient. Because XXX_timout GUC parameters > could be modified later. # Anyway, I don't think such asymmetric setup is preferable. > 2) > > The issue reported by Vignesh-san[1] has still remained. I have already analyzed that [2], > the root cause is that flushed WAL is not updated and sent to the publisher. Even > if workers send keepalive messages to pub during the delay, the flushed position > cannot be modified. I didn't look closer but the cause I guess is walsender doesn't die until all WAL has been sent, while logical delay chokes replication stream. Allowing walsender to finish ignoring replication status wouldn't be great. One idea is to let logical workers send delaying status. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Wed, 14 Dec 2022 16:30:28 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Wed, Dec 14, 2022 at 4:16 PM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > One idea to avoid that is to send the min_apply_delay subscriber option to publisher > > and compare them, but it may be not sufficient. Because XXX_timout GUC parameters > > could be modified later. > > > > How about restarting the apply worker if min_apply_delay changes? Will > that be sufficient? Mmm. If publisher knows that value, isn't it albe to delay *sending* data in the first place? This will resolve many known issues including walsender's un-terminatability, possible buffer-full and status packet exchanging. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Dec 15, 2022 at 7:22 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Wed, 14 Dec 2022 16:30:28 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > On Wed, Dec 14, 2022 at 4:16 PM Hayato Kuroda (Fujitsu) > > <kuroda.hayato@fujitsu.com> wrote: > > > One idea to avoid that is to send the min_apply_delay subscriber option to publisher > > > and compare them, but it may be not sufficient. Because XXX_timout GUC parameters > > > could be modified later. > > > > > > > How about restarting the apply worker if min_apply_delay changes? Will > > that be sufficient? > > Mmm. If publisher knows that value, isn't it albe to delay *sending* > data in the first place? This will resolve many known issues including > walsender's un-terminatability, possible buffer-full and status packet > exchanging. > Yeah, but won't it change the meaning of this parameter? Say the subscriber was busy enough that it doesn't need to add an additional delay before applying a particular transaction(s) but adding a delay to such a transaction on the publisher will actually make it take much longer to reflect than expected. We probably need to name this parameter as min_send_delay if we want to do what you are saying and then I don't know if it serves the actual need and also it will be different from what we do in physical standby. -- With Regards, Amit Kapila.
On Thu, Dec 15, 2022 at 7:16 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Wed, 14 Dec 2022 10:46:17 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in > > I have implemented and tested that workers wake up per wal_receiver_timeout/2 > > and send keepalive. Basically it works well, but I found two problems. > > Do you have any good suggestions about them? > > > > 1) > > > > With this PoC at present, workers calculate sending intervals based on its > > wal_receiver_timeout, and it is suppressed when the parameter is set to zero. > > > > This means that there is a possibility that walsender is timeout when wal_sender_timeout > > in publisher and wal_receiver_timeout in subscriber is different. > > Supposing that wal_sender_timeout is 2min, wal_receiver_tiemout is 5min, > > It seems to me wal_receiver_status_interval is better for this use. > It's enough for us to docuemnt that "wal_r_s_interval should be > shorter than wal_sener_timeout/2 especially when logical replication > connection is using min_apply_delay. Otherwise you will suffer > repeated termination of walsender". > This sounds reasonable to me. > > and min_apply_delay is 10min. The worker on subscriber will wake up per 2.5min and > > send keepalives, but walsender exits before the message arrives to publisher. > > > > One idea to avoid that is to send the min_apply_delay subscriber option to publisher > > and compare them, but it may be not sufficient. Because XXX_timout GUC parameters > > could be modified later. > > # Anyway, I don't think such asymmetric setup is preferable. > > > 2) > > > > The issue reported by Vignesh-san[1] has still remained. I have already analyzed that [2], > > the root cause is that flushed WAL is not updated and sent to the publisher. Even > > if workers send keepalive messages to pub during the delay, the flushed position > > cannot be modified. > > I didn't look closer but the cause I guess is walsender doesn't die > until all WAL has been sent, while logical delay chokes replication > stream. > Right, I also think so. > Allowing walsender to finish ignoring replication status > wouldn't be great. > Yes, that would be ideal. But do you know why that is a must? > One idea is to let logical workers send delaying > status. > How can that help? -- With Regards, Amit Kapila.
At Thu, 15 Dec 2022 09:23:12 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Thu, Dec 15, 2022 at 7:16 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > Allowing walsender to finish ignoring replication status > > wouldn't be great. > > > > Yes, that would be ideal. But do you know why that is a must? I believe a graceful shutdown (fast and smart) of a replication set is expected to be in sync. Of course we can change thepolicy to allow walsnder to stop before confirming all WAL have been applied. However walsender doesn't have an idea of wheter the peer is intentionally delaying or not. > > One idea is to let logical workers send delaying > > status. > > > > How can that help? If logical worker notifies "I'm intentionally pausing replication for now, so if you wan to shutting down, plese go ahead ignoring me", publisher can legally run a (kind of) dirty shut down. # It looks a bit too much, though.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Thu, 15 Dec 2022 09:18:55 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Thu, Dec 15, 2022 at 7:22 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Wed, 14 Dec 2022 16:30:28 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > > On Wed, Dec 14, 2022 at 4:16 PM Hayato Kuroda (Fujitsu) > > > <kuroda.hayato@fujitsu.com> wrote: > > > > One idea to avoid that is to send the min_apply_delay subscriber option to publisher > > > > and compare them, but it may be not sufficient. Because XXX_timout GUC parameters > > > > could be modified later. > > > > > > > > > > How about restarting the apply worker if min_apply_delay changes? Will > > > that be sufficient? > > > > Mmm. If publisher knows that value, isn't it albe to delay *sending* > > data in the first place? This will resolve many known issues including > > walsender's un-terminatability, possible buffer-full and status packet > > exchanging. > > > > Yeah, but won't it change the meaning of this parameter? Say the Internally changes, but does not change on its face. The difference is only in where the choking point exists. If ".._apply_delay" should work literally, we should go the way Kuroda-san proposed. Namely, "apply worker has received the data, but will deilay applying it". If we technically name it correctly for the current behavior, it would be "min_receive_delay" or "min_choking_interval". > subscriber was busy enough that it doesn't need to add an additional > delay before applying a particular transaction(s) but adding a delay > to such a transaction on the publisher will actually make it take much > longer to reflect than expected. We probably need to name this Isn't the name min_apply_delay implying the same behavior? Even though the delay time will be a bit prolonged. > parameter as min_send_delay if we want to do what you are saying and > then I don't know if it serves the actual need and also it will be > different from what we do in physical standby. In the first place phisical and logical replication works differently and the mechanism to delaying "apply" differs even in the current state in terms of logrep delay choking stream. I guess they cannot be different in terms of normal operation. But I'm not sure. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Dec 15, 2022 at 10:11 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Thu, 15 Dec 2022 09:18:55 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > On Thu, Dec 15, 2022 at 7:22 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > > > > At Wed, 14 Dec 2022 16:30:28 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > > > On Wed, Dec 14, 2022 at 4:16 PM Hayato Kuroda (Fujitsu) > > > > <kuroda.hayato@fujitsu.com> wrote: > > > > > One idea to avoid that is to send the min_apply_delay subscriber option to publisher > > > > > and compare them, but it may be not sufficient. Because XXX_timout GUC parameters > > > > > could be modified later. > > > > > > > > > > > > > How about restarting the apply worker if min_apply_delay changes? Will > > > > that be sufficient? > > > > > > Mmm. If publisher knows that value, isn't it albe to delay *sending* > > > data in the first place? This will resolve many known issues including > > > walsender's un-terminatability, possible buffer-full and status packet > > > exchanging. > > > > > > > Yeah, but won't it change the meaning of this parameter? Say the > > Internally changes, but does not change on its face. The difference is > only in where the choking point exists. If ".._apply_delay" should > work literally, we should go the way Kuroda-san proposed. Namely, > "apply worker has received the data, but will deilay applying it". If > we technically name it correctly for the current behavior, it would be > "min_receive_delay" or "min_choking_interval". > > > subscriber was busy enough that it doesn't need to add an additional > > delay before applying a particular transaction(s) but adding a delay > > to such a transaction on the publisher will actually make it take much > > longer to reflect than expected. We probably need to name this > > Isn't the name min_apply_delay implying the same behavior? Even though > the delay time will be a bit prolonged. > Sorry, I don't understand what you intend to say in this point. In above, I mean that the currently proposed patch won't have such a problem but if we apply delay on publisher the problem can happen. > > parameter as min_send_delay if we want to do what you are saying and > > then I don't know if it serves the actual need and also it will be > > different from what we do in physical standby. > > In the first place phisical and logical replication works differently > and the mechanism to delaying "apply" differs even in the current > state in terms of logrep delay choking stream. > I think the first preference is to make it work in a similar way (as much as possible) to how this parameter works in physical standby and if that is not at all possible then we may consider other approaches. -- With Regards, Amit Kapila.
At Thu, 15 Dec 2022 10:29:17 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Thu, Dec 15, 2022 at 10:11 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Thu, 15 Dec 2022 09:18:55 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > > On Thu, Dec 15, 2022 at 7:22 AM Kyotaro Horiguchi > > > <horikyota.ntt@gmail.com> wrote: > > > subscriber was busy enough that it doesn't need to add an additional > > > delay before applying a particular transaction(s) but adding a delay > > > to such a transaction on the publisher will actually make it take much > > > longer to reflect than expected. We probably need to name this > > > > Isn't the name min_apply_delay implying the same behavior? Even though > > the delay time will be a bit prolonged. > > > > Sorry, I don't understand what you intend to say in this point. In > above, I mean that the currently proposed patch won't have such a > problem but if we apply delay on publisher the problem can happen. Are you saing about the sender-side delay lets the whole transaction (if it have not streamed out) stay on the sender side? If so... yeah, I agree that it is undesirable. > > > parameter as min_send_delay if we want to do what you are saying and > > > then I don't know if it serves the actual need and also it will be > > > different from what we do in physical standby. > > > > In the first place phisical and logical replication works differently > > and the mechanism to delaying "apply" differs even in the current > > state in terms of logrep delay choking stream. > > > > I think the first preference is to make it work in a similar way (as > much as possible) to how this parameter works in physical standby and > if that is not at all possible then we may consider other approaches. I uderstood that. However, still I think choking the stream on the receiver-side alone is kind of ugly since it is breaking the protocol assumption, that is, the in-band maintenance packets are processed in a on-time manner on the peer under normal operation (even though involving some delays for some natural reasons). In this regard, I inclined to be in favor of Kuroda-san'sproposal.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Dear Horiguchi-san, Amit, > > Yes, that would be ideal. But do you know why that is a must? > > I believe a graceful shutdown (fast and smart) of a replication set is expected to > be in sync. Of course we can change the policy to allow walsnder to stop before > confirming all WAL have been applied. However walsender doesn't have an idea > of wheter the peer is intentionally delaying or not. This mechanism was introduced by 985bd7[1], which was needed to support a "clean" switchover. I think it is needed for physical replication, but it is not clear for the logical case. When the postmaster is stopped in fast or smart mode, we expected that all modifications were received by secondary. This requirement seems to be not changed from the initial commit. Before 985bd7, the walsender exited just after sending the final WAL, which meant that sometimes the last packet could not reach to secondary. So there was a possibility of failing to reboot the primary as a new secondary because the new primary does not have the last WAL record. To avoid the above walsender started waiting for flush before exiting. But in the case of logical replication, I'm not sure whether this limitation is really needed or not. I think it may be OK that walsender exits without waiting, in case of delaying applies. Because we don't have to consider the above issue for logical replication. [1]: https://github.com/postgres/postgres/commit/985bd7d49726c9f178558491d31a570d47340459 Best Regards, Hayato Kuroda FUJITSU LIMITED
On Thu, Dec 15, 2022 at 1:42 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Horiguchi-san, Amit, > > > > Yes, that would be ideal. But do you know why that is a must? > > > > I believe a graceful shutdown (fast and smart) of a replication set is expected to > > be in sync. Of course we can change the policy to allow walsnder to stop before > > confirming all WAL have been applied. However walsender doesn't have an idea > > of wheter the peer is intentionally delaying or not. > > This mechanism was introduced by 985bd7[1], which was needed to support a > "clean" switchover. I think it is needed for physical replication, but it is not > clear for the logical case. > > When the postmaster is stopped in fast or smart mode, we expected that all > modifications were received by secondary. This requirement seems to be not changed > from the initial commit. > > Before 985bd7, the walsender exited just after sending the final WAL, which meant > that sometimes the last packet could not reach to secondary. So there was a possibility > of failing to reboot the primary as a new secondary because the new primary does > not have the last WAL record. To avoid the above walsender started waiting for > flush before exiting. > > But in the case of logical replication, I'm not sure whether this limitation is > really needed or not. I think it may be OK that walsender exits without waiting, > in case of delaying applies. Because we don't have to consider the above issue > for logical replication. > I also don't see the need for this mechanism for logical replication, and in fact, why do we need to even wait for sending the existing WAL? I think the reason why we don't need to wait for logical replication is that after the restart, we always start sending WAL from the location requested by the subscriber, or till the point where the publisher knows the confirmed flush location of the subscriber. Consider another case where after restart publisher (node-1) wants to act as a subscriber for the previous subscriber (node-2). Now, the new subscriber (node-1) won't have a way to tell the new publisher (node-2) that starts from the location where the node-1 went down as WAL locations between publisher and subscriber need not be same. This brings us to the question of whether users can use logical replication for the scenario where they want the old master to follow the new master after the restart which we typically do in physical replication, if so how? Another related point to consider is what is the behavior of synchronous replication when shutdown has been performed both in the case of physical and logical replication especially when the time-delayed replication feature is enabled? -- With Regards, Amit Kapila.
On Thu, Dec 15, 2022 at 11:22 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Thu, 15 Dec 2022 10:29:17 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > On Thu, Dec 15, 2022 at 10:11 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > > > > At Thu, 15 Dec 2022 09:18:55 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > > > On Thu, Dec 15, 2022 at 7:22 AM Kyotaro Horiguchi > > > > <horikyota.ntt@gmail.com> wrote: > > > > subscriber was busy enough that it doesn't need to add an additional > > > > delay before applying a particular transaction(s) but adding a delay > > > > to such a transaction on the publisher will actually make it take much > > > > longer to reflect than expected. We probably need to name this > > > > > > Isn't the name min_apply_delay implying the same behavior? Even though > > > the delay time will be a bit prolonged. > > > > > > > Sorry, I don't understand what you intend to say in this point. In > > above, I mean that the currently proposed patch won't have such a > > problem but if we apply delay on publisher the problem can happen. > > Are you saing about the sender-side delay lets the whole transaction > (if it have not streamed out) stay on the sender side? > It will not stay on the sender side forever but rather will be sent after the min_apply_delay. The point I wanted to raise is that maybe the delay won't need to be applied where we will end up delaying it. Because when we apply the delay on apply side, it will take into account the other load of apply side. I don't know how much it matters but it appears logical to add the delay on applying side. -- With Regards, Amit Kapila.
Dear Amit, > I also don't see the need for this mechanism for logical replication, > and in fact, why do we need to even wait for sending the existing WAL? Is it meant that logicalrep walsenders do not have to track WalSndCaughtUp and any pending data in the output buffer? > I think the reason why we don't need to wait for logical replication > is that after the restart, we always start sending WAL from the > location requested by the subscriber, or till the point where the > publisher knows the confirmed flush location of the subscriber. > Consider another case where after restart publisher (node-1) wants to > act as a subscriber for the previous subscriber (node-2). Now, the new > subscriber (node-1) won't have a way to tell the new publisher > (node-2) that starts from the location where the node-1 went down as > WAL locations between publisher and subscriber need not be same. You mean to say that such mechanism was made for supporting switchover, but logical replication cannot do because new subscriber cannot request definitively unknown changes for it, right? It seems reasonable to me. > This brings us to the question of whether users can use logical > replication for the scenario where they want the old master to follow > the new master after the restart which we typically do in physical > replication, if so how? Maybe to support such use-case, 2-way replication is needed (but this is out-of-scope of this thread). > Another related point to consider is what is the behavior of > synchronous replication when shutdown has been performed both in the > case of physical and logical replication especially when the > time-delayed replication feature is enabled? In physical replication without any failures, it seems that users can stop primary server even if the applications are delaying on secondary. This is because sent WALs are immediately flushed on secondary and walreceiver replies its position. The transaction has been already committed at that time, and the transported changes will be applied on secondary after spending time. IIUC we can achieve that when logical walsenders do not consider the remote status while shutting down, but I want to hear another opinion and we must confirm by testing... Best Regards, Hayato Kuroda FUJITSU LIMITED
On Fri, Dec 16, 2022 at 12:11 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Amit, > > > I also don't see the need for this mechanism for logical replication, > > and in fact, why do we need to even wait for sending the existing WAL? > > Is it meant that logicalrep walsenders do not have to track WalSndCaughtUp and > any pending data in the output buffer? > I haven't checked the details but I think what you are saying is correct. > > > Another related point to consider is what is the behavior of > > synchronous replication when shutdown has been performed both in the > > case of physical and logical replication especially when the > > time-delayed replication feature is enabled? > > In physical replication without any failures, it seems that users can stop primary > server even if the applications are delaying on secondary. This is because sent WALs > are immediately flushed on secondary and walreceiver replies its position. > What happens when synchronous_commit's value is remote_apply and the user has also set synchronous_standby_names to corresponding standby? -- With Regards, Amit Kapila.
Dear Amit, > > > Another related point to consider is what is the behavior of > > > synchronous replication when shutdown has been performed both in the > > > case of physical and logical replication especially when the > > > time-delayed replication feature is enabled? > > > > In physical replication without any failures, it seems that users can stop primary > > server even if the applications are delaying on secondary. This is because sent > WALs > > are immediately flushed on secondary and walreceiver replies its position. > > > > What happens when synchronous_commit's value is remote_apply and the > user has also set synchronous_standby_names to corresponding standby? Even if synchronous_commit is set to remote_apply, the primary server can be shut down. The reason why walsender can exit is that it does not care about the status whether WALs are "applied" or not. It just checks the "flushed" WAL position, not applied one. I think we should start another thread about changing the shut-down condition, so forked[1]. [1]: https://www.postgresql.org/message-id/TYAPR01MB586668E50FC2447AD7F92491F5E89%40TYAPR01MB5866.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Thursday, December 15, 2022 12:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Dec 15, 2022 at 7:16 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> > wrote: > > > > At Wed, 14 Dec 2022 10:46:17 +0000, "Hayato Kuroda (Fujitsu)" > > <kuroda.hayato@fujitsu.com> wrote in > > > I have implemented and tested that workers wake up per > > > wal_receiver_timeout/2 and send keepalive. Basically it works well, but I > found two problems. > > > Do you have any good suggestions about them? > > > > > > 1) > > > > > > With this PoC at present, workers calculate sending intervals based > > > on its wal_receiver_timeout, and it is suppressed when the parameter is set > to zero. > > > > > > This means that there is a possibility that walsender is timeout > > > when wal_sender_timeout in publisher and wal_receiver_timeout in > subscriber is different. > > > Supposing that wal_sender_timeout is 2min, wal_receiver_tiemout is > > > 5min, > > > > It seems to me wal_receiver_status_interval is better for this use. > > It's enough for us to docuemnt that "wal_r_s_interval should be > > shorter than wal_sener_timeout/2 especially when logical replication > > connection is using min_apply_delay. Otherwise you will suffer > > repeated termination of walsender". > > > > This sounds reasonable to me. Okay, I changed the time interval to wal_receiver_status_interval and added this description about timeout. FYI, wal_receiver_status_interval by definition specifies the minimum frequency for the WAL receiver process to send information to the upstream. So I utilized the value for WaitLatch directly. My descriptions of the documentation change follow it. > > > and min_apply_delay is 10min. The worker on subscriber will wake up > > > per 2.5min and send keepalives, but walsender exits before the message > arrives to publisher. > > > > > > One idea to avoid that is to send the min_apply_delay subscriber > > > option to publisher and compare them, but it may be not sufficient. > > > Because XXX_timout GUC parameters could be modified later. > > > > # Anyway, I don't think such asymmetric setup is preferable. > > > > > 2) > > > > > > The issue reported by Vignesh-san[1] has still remained. I have > > > already analyzed that [2], the root cause is that flushed WAL is not > > > updated and sent to the publisher. Even if workers send keepalive > > > messages to pub during the delay, the flushed position cannot be modified. > > > > I didn't look closer but the cause I guess is walsender doesn't die > > until all WAL has been sent, while logical delay chokes replication > > stream. For the (2) issue, a new thread has been created independently from this thread in [1]. I'll leave any new changes to the thread on this point. Attached the updated patch. Again, I used one basic patch in another thread to wake up logical replication worker shared in [2] for TAP tests. [1] - https://www.postgresql.org/message-id/TYAPR01MB586668E50FC2447AD7F92491F5E89@TYAPR01MB5866.jpnprd01.prod.outlook.com [2] - https://www.postgresql.org/message-id/flat/20221122004119.GA132961%40nathanxps13 Best Regards, Takamichi Osumi
Attachment
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Thursday, December 22, 2022 3:02 PM Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > Attached the updated patch. > Again, I used one basic patch in another thread to wake up logical replication > worker shared in [2] for TAP tests. The v11 caused a cfbot failure in [1]. But, failed tests looked irrelevant to the feature to me at present. While waiting for another test execution of cfbot, I'd like to check the detailed reason and update the patch if necessary. [1] - https://cirrus-ci.com/task/4580705867399168 Best Regards, Takamichi Osumi
On Fri, Dec 23, 2022 at 9:16 PM Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > > On Thursday, December 22, 2022 3:02 PM Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > > Attached the updated patch. > > Again, I used one basic patch in another thread to wake up logical replication > > worker shared in [2] for TAP tests. > The v11 caused a cfbot failure in [1]. But, failed tests looked irrelevant > to the feature to me at present. > I have done some review for the patch and I have a few comments. 1. A. + <literal>wal_sender_timeout</literal> on the publisher. Otherwise, the + walsender repeatedly terminates due to timeout during the delay of + the subscriber. B. +/* + * In order to avoid walsender's timeout during time delayed replication, + * it's necessaary to keep sending feedbacks during the delay from the worker + * process. Meanwhile, the feature delays the apply before starting the + * transaction and thus we don't write WALs for the suspended changes during + * the wait. Hence, in the case the worker process sends a feedback during the + * delay, avoid having positions of the flushed and apply LSN overwritten by + * the latest LSN. + */ - Seems like these two statements are conflicting, I mean if we are sending feedback then why the walsender will timeout? - Typo /necessaary/necessary 2. + * + * During the time delayed replication, avoid reporting the suspeended + * latest LSN are already flushed and written, to the publisher. */ Typo /suspeended/suspended 3. + if (wal_receiver_status_interval > 0 + && diffms > wal_receiver_status_interval) + { + WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, + (long) wal_receiver_status_interval, + WAIT_EVENT_RECOVERY_APPLY_DELAY); + send_feedback(last_received, true, false); + } + else + WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, + diffms, + WAIT_EVENT_RECOVERY_APPLY_DELAY); I think here we should add some comments to explain about sending feedback, something like what we have explained at the time of defining the "in_delaying_apply" variable. 4. + * Although the delay is applied in BEGIN messages, streamed transactions + * apply the delay in a STREAM COMMIT message. That's ok because no + * changes have been applied yet (apply_spooled_messages() will do it). + * The STREAM START message would be a natural choice for this delay but + * there is no commit time yet (it will be available when the in-progress + * transaction finishes), hence, it was not possible to apply a delay at + * that time. + */ + maybe_delay_apply(commit_data.committime); I am wondering how this will interact with the parallel apply worker where we do not spool the data in file? How are we going to get the commit time of the transaction without applying the changes? 5. + /* + * The following operations use these special functions to detect + * overflow. Number of ms per informed days. + */ This comment doesn't make much sense, I think this needs to be rephrased. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Dec 26, 2022 at 2:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Dec 23, 2022 at 9:16 PM Takamichi Osumi (Fujitsu) > <osumi.takamichi@fujitsu.com> wrote: > > > > 4. > > + * Although the delay is applied in BEGIN messages, streamed transactions > + * apply the delay in a STREAM COMMIT message. That's ok because no > + * changes have been applied yet (apply_spooled_messages() will do it). > + * The STREAM START message would be a natural choice for this delay but > + * there is no commit time yet (it will be available when the in-progress > + * transaction finishes), hence, it was not possible to apply a delay at > + * that time. > + */ > + maybe_delay_apply(commit_data.committime); > > I am wondering how this will interact with the parallel apply worker > where we do not spool the data in file? How are we going to get the > commit time of the transaction without applying the changes? > There is no sane way to do this. So, I think these features won't work together, we can disable parallelism when this is active. Considering that parallel apply is to speed up the transactions apply and this feature is to slow down the apply, so even if they don't work together that should be okay. Does that make sense? -- With Regards, Amit Kapila.
On Mon, Dec 26, 2022 at 2:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Dec 26, 2022 at 2:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Fri, Dec 23, 2022 at 9:16 PM Takamichi Osumi (Fujitsu) > > <osumi.takamichi@fujitsu.com> wrote: > > > > > > > 4. > > > > + * Although the delay is applied in BEGIN messages, streamed transactions > > + * apply the delay in a STREAM COMMIT message. That's ok because no > > + * changes have been applied yet (apply_spooled_messages() will do it). > > + * The STREAM START message would be a natural choice for this delay but > > + * there is no commit time yet (it will be available when the in-progress > > + * transaction finishes), hence, it was not possible to apply a delay at > > + * that time. > > + */ > > + maybe_delay_apply(commit_data.committime); > > > > I am wondering how this will interact with the parallel apply worker > > where we do not spool the data in file? How are we going to get the > > commit time of the transaction without applying the changes? > > > > There is no sane way to do this. Yeah, there is no sane way to do it. So, I think these features won't work > together, we can disable parallelism when this is active. Considering > that parallel apply is to speed up the transactions apply and this > feature is to slow down the apply, so even if they don't work together > that should be okay. Does that make sense? Yes, this makes sense. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Dec 26, 2022 at 7:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Dec 26, 2022 at 2:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Dec 26, 2022 at 2:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Fri, Dec 23, 2022 at 9:16 PM Takamichi Osumi (Fujitsu) > > > <osumi.takamichi@fujitsu.com> wrote: > > > > > > > > > > 4. > > > > > > + * Although the delay is applied in BEGIN messages, streamed transactions > > > + * apply the delay in a STREAM COMMIT message. That's ok because no > > > + * changes have been applied yet (apply_spooled_messages() will do it). > > > + * The STREAM START message would be a natural choice for this delay but > > > + * there is no commit time yet (it will be available when the in-progress > > > + * transaction finishes), hence, it was not possible to apply a delay at > > > + * that time. > > > + */ > > > + maybe_delay_apply(commit_data.committime); > > > > > > I am wondering how this will interact with the parallel apply worker > > > where we do not spool the data in file? How are we going to get the > > > commit time of the transaction without applying the changes? > > > > > > > There is no sane way to do this. > > Yeah, there is no sane way to do it. > > So, I think these features won't work > > together, we can disable parallelism when this is active. Considering > > that parallel apply is to speed up the transactions apply and this > > feature is to slow down the apply, so even if they don't work together > > that should be okay. Does that make sense? > > Yes, this makes sense. > BTW, the blocking problem with this patch is to deal with shutdown as discussed in the thread [1]. In short, the problem is that at shutdown, we wait for walsender to send all pending data and ensure all data is flushed in the remote node. But, if the other node is waiting due to a time-delayed apply then shutdown won't be successful. It would be really great if you can let us know your thoughts in the thread [1] as that can help to move this work forward. [1] - https://www.postgresql.org/message-id/TYAPR01MB586668E50FC2447AD7F92491F5E89%40TYAPR01MB5866.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
On Tue, Dec 27, 2022 at 9:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > BTW, the blocking problem with this patch is to deal with shutdown as > discussed in the thread [1]. I will have a look. In short, the problem is that at > shutdown, we wait for walsender to send all pending data and ensure > all data is flushed in the remote node. But, if the other node is > waiting due to a time-delayed apply then shutdown won't be successful. > It would be really great if you can let us know your thoughts in the > thread [1] as that can help to move this work forward. Okay, so you mean to say that with logical the shutdown will be delayed until all the changes are applied on the subscriber but the same is not true for physical standby? Is it because on physical standby we flush the WAL before applying? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Dec 27, 2022 at 11:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Dec 27, 2022 at 9:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > BTW, the blocking problem with this patch is to deal with shutdown as > > discussed in the thread [1]. > > I will have a look. > Thanks! > In short, the problem is that at > > shutdown, we wait for walsender to send all pending data and ensure > > all data is flushed in the remote node. But, if the other node is > > waiting due to a time-delayed apply then shutdown won't be successful. > > It would be really great if you can let us know your thoughts in the > > thread [1] as that can help to move this work forward. > > Okay, so you mean to say that with logical the shutdown will be > delayed until all the changes are applied on the subscriber but the > same is not true for physical standby? Right. > Is it because on physical > standby we flush the WAL before applying? > Yes, the walreceiver first flushes the WAL before applying. -- With Regards, Amit Kapila.
Hi hackers, > On Thursday, December 22, 2022 3:02 PM Takamichi Osumi (Fujitsu) > <osumi.takamichi@fujitsu.com> wrote: > > Attached the updated patch. > > Again, I used one basic patch in another thread to wake up logical replication > > worker shared in [2] for TAP tests. > The v11 caused a cfbot failure in [1]. But, failed tests looked irrelevant > to the feature to me at present. > > While waiting for another test execution of cfbot, I'd like to check the detailed > reason > and update the patch if necessary. I have investigated the failure and it seemed that it has been caused by VACUUM FREEZE. Followings were copied from the server log. ``` 2022-12-23 08:50:20.175 UTC [34653][postmaster] LOG: server process (PID 37171) was terminated by signal 6: Abort trap 2022-12-23 08:50:20.175 UTC [34653][postmaster] DETAIL: Failed process was running: VACUUM FREEZE tab_freeze; 2022-12-23 08:50:20.175 UTC [34653][postmaster] LOG: terminating any other active server processes ``` Same error has been raised in other threads [1], so we have concluded that this is not related with the patch. The report was raised in another thread [2]. [1]: https://cirrus-ci.com/task/5630405437554688 [2]: https://www.postgresql.org/message-id/TYAPR01MB5866B24104FD80B5D7E65C3EF5ED9%40TYAPR01MB5866.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED
Dear Dilip, Thanks for reviewing our patch! PSA new version patch set. Again, 0001 is not made by us, brought from [1]. > I have done some review for the patch and I have a few comments. > > 1. > A. > + <literal>wal_sender_timeout</literal> on the publisher. Otherwise, the > + walsender repeatedly terminates due to timeout during the delay of > + the subscriber. > > > B. > +/* > + * In order to avoid walsender's timeout during time delayed replication, > + * it's necessaary to keep sending feedbacks during the delay from the worker > + * process. Meanwhile, the feature delays the apply before starting the > + * transaction and thus we don't write WALs for the suspended changes during > + * the wait. Hence, in the case the worker process sends a feedback during the > + * delay, avoid having positions of the flushed and apply LSN overwritten by > + * the latest LSN. > + */ > > - Seems like these two statements are conflicting, I mean if we are > sending feedback then why the walsender will timeout? It is a possibility that timeout is occurred because the interval between feedback messages may become longer than wal_sender_timeout. Reworded and added descriptions. > - Typo /necessaary/necessary Fixed. > 2. > + * > + * During the time delayed replication, avoid reporting the suspeended > + * latest LSN are already flushed and written, to the publisher. > */ > Typo /suspeended/suspended Fixed. > 3. > + if (wal_receiver_status_interval > 0 > + && diffms > wal_receiver_status_interval) > + { > + WaitLatch(MyLatch, > + WL_LATCH_SET | WL_TIMEOUT | > WL_EXIT_ON_PM_DEATH, > + (long) wal_receiver_status_interval, > + WAIT_EVENT_RECOVERY_APPLY_DELAY); > + send_feedback(last_received, true, false); > + } > + else > + WaitLatch(MyLatch, > + WL_LATCH_SET | WL_TIMEOUT | > WL_EXIT_ON_PM_DEATH, > + diffms, > + WAIT_EVENT_RECOVERY_APPLY_DELAY); > > I think here we should add some comments to explain about sending > feedback, something like what we have explained at the time of > defining the "in_delaying_apply" variable. Added. > 4. > > + * Although the delay is applied in BEGIN messages, streamed transactions > + * apply the delay in a STREAM COMMIT message. That's ok because no > + * changes have been applied yet (apply_spooled_messages() will do it). > + * The STREAM START message would be a natural choice for this delay > but > + * there is no commit time yet (it will be available when the in-progress > + * transaction finishes), hence, it was not possible to apply a delay at > + * that time. > + */ > + maybe_delay_apply(commit_data.committime); > > I am wondering how this will interact with the parallel apply worker > where we do not spool the data in file? How are we going to get the > commit time of the transaction without applying the changes? We think that parallel apply workers should not delay applications because if they delay transactions before committing they may hold locks very long time. > 5. > + /* > + * The following operations use these special functions to detect > + * overflow. Number of ms per informed days. > + */ > > This comment doesn't make much sense, I think this needs to be rephrased. Changed to simpler expression. We have also fixed wrong usage of wal_receiver_status_interval. We must convert the unit from [s] to [ms] when it is passed to WaitLatch(). Note that more than half of the modifications are done by Osumi-san. [1]: https://www.postgresql.org/message-id/20221215224721.GA694065%40nathanxps13 Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On Tue, 27 Dec 2022 at 14:59, Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > Note that more than half of the modifications are done by Osumi-san. 1) This global variable can be removed as it is used only in send_feedback which is called from maybe_delay_apply so we could pass it as a function argument: + * delay, avoid having positions of the flushed and apply LSN overwritten by + * the latest LSN. + */ +static bool in_delaying_apply = false; +static XLogRecPtr last_received = InvalidXLogRecPtr; + 2) -1 gets converted to -1000 +int64 +interval2ms(const Interval *interval) +{ + int64 days; + int64 ms; + int64 result; + + days = interval->month * INT64CONST(30); + days += interval->day; + + /* Detect whether the value of interval can cause an overflow. */ + if (pg_mul_s64_overflow(days, MSECS_PER_DAY, &result)) + ereport(ERROR, + (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), + errmsg("bigint out of range"))); + + /* Adds portion time (in ms) to the previous result. */ + ms = interval->time / INT64CONST(1000); + if (pg_add_s64_overflow(result, ms, &result)) + ereport(ERROR, + (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), + errmsg("bigint out of range"))); create subscription sub7 connection 'dbname=regression host=localhost port=5432' publication pub1 with (min_apply_delay = '-1'); ERROR: -1000 ms is outside the valid range for parameter "min_apply_delay" 3) This can be slightly reworded: + <para> + The length of time (ms) to delay the application of changes. + </para></entry> to: Delay applying the changes by a specified amount of time(ms). 4) maybe_delay_apply can be moved from apply_handle_stream_prepare to apply_spooled_messages so that it is consistent with maybe_start_skipping_changes: @@ -1120,6 +1240,19 @@ apply_handle_stream_prepare(StringInfo s) elog(DEBUG1, "received prepare for streamed transaction %u", prepare_data.xid); + /* + * Should we delay the current prepared transaction? + * + * Although the delay is applied in BEGIN PREPARE messages, streamed + * prepared transactions apply the delay in a STREAM PREPARE message. + * That's ok because no changes have been applied yet + * (apply_spooled_messages() will do it). The STREAM START message does + * not contain a prepare time (it will be available when the in-progress + * prepared transaction finishes), hence, it was not possible to apply a + * delay at that time. + */ + maybe_delay_apply(prepare_data.prepare_time); That way the call from apply_handle_stream_commit can also be removed. 5) typo transfering should be transferring + publisher and the current time on the subscriber. Time spent in logical + decoding and in transfering the transaction may reduce the actual wait + time. If the system clocks on publisher and subscriber are not 6) feedbacks can be changed to feedback messages + * it's necessary to keep sending feedbacks during the delay from the worker + * process. Meanwhile, the feature delays the apply before starting the 7) + /* + * Suppress overwrites of flushed and writtten positions by the lastest + * LSN in send_feedback(). + */ 7a) typo writtten should be written 7b) lastest should latest Regards, Vignesh
> > On Tue, 27 Dec 2022 at 14:59, Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > Note that more than half of the modifications are done by Osumi-san. > Please find a few minor comments. 1. + diffms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), + TimestampTzPlusMilliseconds(ts, MySubscription->minapplydelay)); on unix, above code looks unaligned (copied from unix) 2. same with: + interval = DatumGetIntervalP(DirectFunctionCall3(interval_in, + CStringGetDatum(val), + ObjectIdGetDatum(InvalidOid), + Int32GetDatum(-1))); perhaps due to tabs? 2. comment not clear: + * During the time delayed replication, avoid reporting the suspended + * latest LSN are already flushed and written, to the publisher. 3. + * Call send_feedback() to prevent the publisher from exiting by + * timeout during the delay, when wal_receiver_status_interval is + * available. The WALs for this delayed transaction is neither + * written nor flushed yet, Thus, we don't make the latest LSN + * overwrite those positions of the update message for this delay. yet, Thus, we --> yet, thus, we/ yet. Thus, we 4. + /* Adds portion time (in ms) to the previous result. */ + ms = interval->time / INT64CONST(1000); Is interval->time always in micro-seconds here? Thanks Shveta
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Tuesday, December 27, 2022 6:29 PM Tuesday, December 27, 2022 6:29 PM wrote: > Thanks for reviewing our patch! PSA new version patch set. Now, the patches fails to apply to the HEAD, because of recent commits (c6e1f62e2c and 216a784829c) as reported in [1]. I'll rebase the patch with other changes when I post a new version. [1] - http://cfbot.cputube.org/patch_41_3581.log Best Regards, Takamichi Osumi
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Tuesday, January 3, 2023 4:01 PM vignesh C <vignesh21@gmail.com> wrote: Hi, thanks for your review ! > 1) This global variable can be removed as it is used only in send_feedback which > is called from maybe_delay_apply so we could pass it as a function argument: > + * delay, avoid having positions of the flushed and apply LSN > +overwritten by > + * the latest LSN. > + */ > +static bool in_delaying_apply = false; > +static XLogRecPtr last_received = InvalidXLogRecPtr; > + I have removed the first variable and make it one of the arguments for send_feedback(). > 2) -1 gets converted to -1000 > > +int64 > +interval2ms(const Interval *interval) > +{ > + int64 days; > + int64 ms; > + int64 result; > + > + days = interval->month * INT64CONST(30); > + days += interval->day; > + > + /* Detect whether the value of interval can cause an overflow. */ > + if (pg_mul_s64_overflow(days, MSECS_PER_DAY, &result)) > + ereport(ERROR, > + > (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), > + errmsg("bigint out of range"))); > + > + /* Adds portion time (in ms) to the previous result. */ > + ms = interval->time / INT64CONST(1000); > + if (pg_add_s64_overflow(result, ms, &result)) > + ereport(ERROR, > + > (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), > + errmsg("bigint out of range"))); > > create subscription sub7 connection 'dbname=regression host=localhost > port=5432' publication pub1 with (min_apply_delay = '-1'); > ERROR: -1000 ms is outside the valid range for parameter "min_apply_delay" Good catch! Fixed in order to make input '-1' interpretted as -1 ms. > 3) This can be slightly reworded: > + <para> > + The length of time (ms) to delay the application of changes. > + </para></entry> > to: > Delay applying the changes by a specified amount of time(ms). This has been suggested in [1] by Peter Smith. So, I'd like to keep the current patch's description. Then, I didn't change this. > 4) maybe_delay_apply can be moved from apply_handle_stream_prepare to > apply_spooled_messages so that it is consistent with > maybe_start_skipping_changes: > @@ -1120,6 +1240,19 @@ apply_handle_stream_prepare(StringInfo s) > > elog(DEBUG1, "received prepare for streamed transaction %u", > prepare_data.xid); > > + /* > + * Should we delay the current prepared transaction? > + * > + * Although the delay is applied in BEGIN PREPARE messages, > streamed > + * prepared transactions apply the delay in a STREAM PREPARE > message. > + * That's ok because no changes have been applied yet > + * (apply_spooled_messages() will do it). The STREAM START message > does > + * not contain a prepare time (it will be available when the in-progress > + * prepared transaction finishes), hence, it was not possible to apply a > + * delay at that time. > + */ > + maybe_delay_apply(prepare_data.prepare_time); > > > That way the call from apply_handle_stream_commit can also be removed. Sounds good. I moved the call of maybe_delay_apply() to the apply_spooled_messages(). Now it's aligned with maybe_start_skipping_changes(). > 5) typo transfering should be transferring > + publisher and the current time on the subscriber. Time > spent in logical > + decoding and in transfering the transaction may reduce the > actual wait > + time. If the system clocks on publisher and subscriber are > + not Fixed. > 6) feedbacks can be changed to feedback messages > + * it's necessary to keep sending feedbacks during the delay from the > + worker > + * process. Meanwhile, the feature delays the apply before starting the Fixed. > 7) > + /* > + * Suppress overwrites of flushed and writtten positions by the lastest > + * LSN in send_feedback(). > + */ > > 7a) typo writtten should be written > > 7b) lastest should latest I have removed this sentence. So, those typos are removed. Please have a look at the updated patch. [1] - https://www.postgresql.org/message-id/CAHut%2BPttQdFMNM2c6WqKt2c9G6r3ZKYRGHm04RR-4p4fyA4WRg%40mail.gmail.com Best Regards, Takamichi Osumi
Attachment
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Tuesday, January 3, 2023 8:22 PM shveta malik <shveta.malik@gmail.com> wrote: > Please find a few minor comments. Thanks for your review ! > 1. > + diffms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), > + > > TimestampTzPlusMilliseconds(ts, MySubscription->minapplydelay)); on > unix, above code looks unaligned (copied from unix) > > 2. same with: > + interval = DatumGetIntervalP(DirectFunctionCall3(interval_in, > + > > CStringGetDatum(val), > + > > ObjectIdGetDatum(InvalidOid), > + > > Int32GetDatum(-1))); > perhaps due to tabs? Those patches indentation look OK. I checked them by pgindent and less command described in [1]. So, I didn't change those. > 2. comment not clear: > + * During the time delayed replication, avoid reporting the suspended > + * latest LSN are already flushed and written, to the publisher. You are right. I fixed this part to make it clearer. Could you please check ? > 3. > + * Call send_feedback() to prevent the publisher from exiting by > + * timeout during the delay, when wal_receiver_status_interval is > + * available. The WALs for this delayed transaction is neither > + * written nor flushed yet, Thus, we don't make the latest LSN > + * overwrite those positions of the update message for this delay. > > yet, Thus, we --> yet, thus, we/ yet. Thus, we Yeah, you are right. But, I have removed the last sentence, because the last one explains some internals of send_feedback(). I judged that it would be awkward to describe it in maybe_delay_apply(). Now this part has become concise. > 4. > + /* Adds portion time (in ms) to the previous result. */ > + ms = interval->time / INT64CONST(1000); > Is interval->time always in micro-seconds here? Yeah, it seems so. Some internal codes indicate it. Kindly have a look at functions such as make_interval() and interval2itm(). Please have a look at the latest patch v12 in [2]. [1] - https://www.postgresql.org/docs/current/source-format.html [2] - https://www.postgresql.org/message-id/TYCPR01MB837340F78F4A16F542589195EDFF9%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Tuesday, January 10, 2023 11:28 AM I wrote: > On Tuesday, December 27, 2022 6:29 PM Tuesday, December 27, 2022 6:29 PM > wrote: > > Thanks for reviewing our patch! PSA new version patch set. > Now, the patches fails to apply to the HEAD, because of recent commits > (c6e1f62e2c and 216a784829c) as reported in [1]. > > I'll rebase the patch with other changes when I post a new version. This is done in the patch in [1]. Please note that because of the commit c6e1f62e2c, we don't need the 1st patch we borrowed from another thread in [2] any more. [1] - https://www.postgresql.org/message-id/TYCPR01MB837340F78F4A16F542589195EDFF9%40TYCPR01MB8373.jpnprd01.prod.outlook.com [2] - https://www.postgresql.org/message-id/flat/20221122004119.GA132961%40nathanxps13 Best Regards, Takamichi Osumi
On Tue, Jan 10, 2023 at 7:42 PM Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > > On Tuesday, January 3, 2023 4:01 PM vignesh C <vignesh21@gmail.com> wrote: > Hi, thanks for your review ! > > > > 1) This global variable can be removed as it is used only in send_feedback which > > is called from maybe_delay_apply so we could pass it as a function argument: > > + * delay, avoid having positions of the flushed and apply LSN > > +overwritten by > > + * the latest LSN. > > + */ > > +static bool in_delaying_apply = false; > > +static XLogRecPtr last_received = InvalidXLogRecPtr; > > + > I have removed the first variable and make it one of the arguments for send_feedback(). > > > 2) -1 gets converted to -1000 > > > > +int64 > > +interval2ms(const Interval *interval) > > +{ > > + int64 days; > > + int64 ms; > > + int64 result; > > + > > + days = interval->month * INT64CONST(30); > > + days += interval->day; > > + > > + /* Detect whether the value of interval can cause an overflow. */ > > + if (pg_mul_s64_overflow(days, MSECS_PER_DAY, &result)) > > + ereport(ERROR, > > + > > (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), > > + errmsg("bigint out of range"))); > > + > > + /* Adds portion time (in ms) to the previous result. */ > > + ms = interval->time / INT64CONST(1000); > > + if (pg_add_s64_overflow(result, ms, &result)) > > + ereport(ERROR, > > + > > (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), > > + errmsg("bigint out of range"))); > > > > create subscription sub7 connection 'dbname=regression host=localhost > > port=5432' publication pub1 with (min_apply_delay = '-1'); > > ERROR: -1000 ms is outside the valid range for parameter "min_apply_delay" > Good catch! Fixed in order to make input '-1' interpretted as -1 ms. > > > 3) This can be slightly reworded: > > + <para> > > + The length of time (ms) to delay the application of changes. > > + </para></entry> > > to: > > Delay applying the changes by a specified amount of time(ms). > This has been suggested in [1] by Peter Smith. So, I'd like to keep the current patch's description. > Then, I didn't change this. > > > 4) maybe_delay_apply can be moved from apply_handle_stream_prepare to > > apply_spooled_messages so that it is consistent with > > maybe_start_skipping_changes: > > @@ -1120,6 +1240,19 @@ apply_handle_stream_prepare(StringInfo s) > > > > elog(DEBUG1, "received prepare for streamed transaction %u", > > prepare_data.xid); > > > > + /* > > + * Should we delay the current prepared transaction? > > + * > > + * Although the delay is applied in BEGIN PREPARE messages, > > streamed > > + * prepared transactions apply the delay in a STREAM PREPARE > > message. > > + * That's ok because no changes have been applied yet > > + * (apply_spooled_messages() will do it). The STREAM START message > > does > > + * not contain a prepare time (it will be available when the in-progress > > + * prepared transaction finishes), hence, it was not possible to apply a > > + * delay at that time. > > + */ > > + maybe_delay_apply(prepare_data.prepare_time); > > > > > > That way the call from apply_handle_stream_commit can also be removed. > Sounds good. I moved the call of maybe_delay_apply() to the apply_spooled_messages(). > Now it's aligned with maybe_start_skipping_changes(). > > > 5) typo transfering should be transferring > > + publisher and the current time on the subscriber. Time > > spent in logical > > + decoding and in transfering the transaction may reduce the > > actual wait > > + time. If the system clocks on publisher and subscriber are > > + not > Fixed. > > > 6) feedbacks can be changed to feedback messages > > + * it's necessary to keep sending feedbacks during the delay from the > > + worker > > + * process. Meanwhile, the feature delays the apply before starting the > Fixed. > > > 7) > > + /* > > + * Suppress overwrites of flushed and writtten positions by the lastest > > + * LSN in send_feedback(). > > + */ > > > > 7a) typo writtten should be written > > > > 7b) lastest should latest > I have removed this sentence. So, those typos are removed. > > Please have a look at the updated patch. > > [1] - https://www.postgresql.org/message-id/CAHut%2BPttQdFMNM2c6WqKt2c9G6r3ZKYRGHm04RR-4p4fyA4WRg%40mail.gmail.com > > Hi, 1. + errmsg("min_apply_delay must not be set when streaming = parallel"))); we give the same error msg for both the cases: a. when subscription is created with streaming=parallel but we are trying to alter subscription to set min_apply_delay >0 b. when subscription is created with some min_apply_delay and we are trying to alter subscription to make it streaming=parallel. For case a, error msg looks fine but for case b, I think error msg should be changed slightly. ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel); ERROR: min_apply_delay must not be set when streaming = parallel This gives the feeling that we are trying to modify min_apply_delay but we are not. Maybe we can change it to: "subscription with min_apply_delay must not be allowed to stream parallel" (or something better) thanks Shveta
On Wed, Jan 11, 2023 at 3:27 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Tue, Jan 10, 2023 at 7:42 PM Takamichi Osumi (Fujitsu) > <osumi.takamichi@fujitsu.com> wrote: > > > > On Tuesday, January 3, 2023 4:01 PM vignesh C <vignesh21@gmail.com> wrote: > > Hi, thanks for your review ! > > > > > > > 1) This global variable can be removed as it is used only in send_feedback which > > > is called from maybe_delay_apply so we could pass it as a function argument: > > > + * delay, avoid having positions of the flushed and apply LSN > > > +overwritten by > > > + * the latest LSN. > > > + */ > > > +static bool in_delaying_apply = false; > > > +static XLogRecPtr last_received = InvalidXLogRecPtr; > > > + > > I have removed the first variable and make it one of the arguments for send_feedback(). > > > > > 2) -1 gets converted to -1000 > > > > > > +int64 > > > +interval2ms(const Interval *interval) > > > +{ > > > + int64 days; > > > + int64 ms; > > > + int64 result; > > > + > > > + days = interval->month * INT64CONST(30); > > > + days += interval->day; > > > + > > > + /* Detect whether the value of interval can cause an overflow. */ > > > + if (pg_mul_s64_overflow(days, MSECS_PER_DAY, &result)) > > > + ereport(ERROR, > > > + > > > (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), > > > + errmsg("bigint out of range"))); > > > + > > > + /* Adds portion time (in ms) to the previous result. */ > > > + ms = interval->time / INT64CONST(1000); > > > + if (pg_add_s64_overflow(result, ms, &result)) > > > + ereport(ERROR, > > > + > > > (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), > > > + errmsg("bigint out of range"))); > > > > > > create subscription sub7 connection 'dbname=regression host=localhost > > > port=5432' publication pub1 with (min_apply_delay = '-1'); > > > ERROR: -1000 ms is outside the valid range for parameter "min_apply_delay" > > Good catch! Fixed in order to make input '-1' interpretted as -1 ms. > > > > > 3) This can be slightly reworded: > > > + <para> > > > + The length of time (ms) to delay the application of changes. > > > + </para></entry> > > > to: > > > Delay applying the changes by a specified amount of time(ms). > > This has been suggested in [1] by Peter Smith. So, I'd like to keep the current patch's description. > > Then, I didn't change this. > > > > > 4) maybe_delay_apply can be moved from apply_handle_stream_prepare to > > > apply_spooled_messages so that it is consistent with > > > maybe_start_skipping_changes: > > > @@ -1120,6 +1240,19 @@ apply_handle_stream_prepare(StringInfo s) > > > > > > elog(DEBUG1, "received prepare for streamed transaction %u", > > > prepare_data.xid); > > > > > > + /* > > > + * Should we delay the current prepared transaction? > > > + * > > > + * Although the delay is applied in BEGIN PREPARE messages, > > > streamed > > > + * prepared transactions apply the delay in a STREAM PREPARE > > > message. > > > + * That's ok because no changes have been applied yet > > > + * (apply_spooled_messages() will do it). The STREAM START message > > > does > > > + * not contain a prepare time (it will be available when the in-progress > > > + * prepared transaction finishes), hence, it was not possible to apply a > > > + * delay at that time. > > > + */ > > > + maybe_delay_apply(prepare_data.prepare_time); > > > > > > > > > That way the call from apply_handle_stream_commit can also be removed. > > Sounds good. I moved the call of maybe_delay_apply() to the apply_spooled_messages(). > > Now it's aligned with maybe_start_skipping_changes(). > > > > > 5) typo transfering should be transferring > > > + publisher and the current time on the subscriber. Time > > > spent in logical > > > + decoding and in transfering the transaction may reduce the > > > actual wait > > > + time. If the system clocks on publisher and subscriber are > > > + not > > Fixed. > > > > > 6) feedbacks can be changed to feedback messages > > > + * it's necessary to keep sending feedbacks during the delay from the > > > + worker > > > + * process. Meanwhile, the feature delays the apply before starting the > > Fixed. > > > > > 7) > > > + /* > > > + * Suppress overwrites of flushed and writtten positions by the lastest > > > + * LSN in send_feedback(). > > > + */ > > > > > > 7a) typo writtten should be written > > > > > > 7b) lastest should latest > > I have removed this sentence. So, those typos are removed. > > > > Please have a look at the updated patch. > > > > [1] - https://www.postgresql.org/message-id/CAHut%2BPttQdFMNM2c6WqKt2c9G6r3ZKYRGHm04RR-4p4fyA4WRg%40mail.gmail.com > > > > > > Hi, > > 1. > + errmsg("min_apply_delay must not be set when streaming = parallel"))); > we give the same error msg for both the cases: > a. when subscription is created with streaming=parallel but we are > trying to alter subscription to set min_apply_delay >0 > b. when subscription is created with some min_apply_delay and we are > trying to alter subscription to make it streaming=parallel. > For case a, error msg looks fine but for case b, I think error msg > should be changed slightly. > ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel); > ERROR: min_apply_delay must not be set when streaming = parallel > This gives the feeling that we are trying to modify min_apply_delay > but we are not. Maybe we can change it to: > "subscription with min_apply_delay must not be allowed to stream > parallel" (or something better) > > thanks > Shveta Sorry for multiple emails. One suggestion: 2. I think users can set ' wal_receiver_status_interval ' to 0 or more than 'wal_sender_timeout'. But is this a frequent use-case scenario or do we see DBAs setting these in such a way by mistake? If so, then I think, it is better to give Warning message in such a case when a user tries to create or alter a subscription with a large 'min_apply_delay' (>= 'wal_sender_timeout') , rather than leaving it to the user's understanding that WalSender may repeatedly timeout in such a case. Parse_subscription_options and AlterSubscription can be modified to log a warning. Any thoughts? thanks Shveta
On Tue, 10 Jan 2023 at 19:41, Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > > On Tuesday, January 3, 2023 4:01 PM vignesh C <vignesh21@gmail.com> wrote: > Hi, thanks for your review ! > > Please have a look at the updated patch. Thanks for the updated patch, few comments: 1) Comment inconsistency across create and alter subscription, better to keep it same: + /* + * Do additional checking for disallowed combination when min_apply_delay + * was not zero. + */ + if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && + opts->min_apply_delay > 0) + { + if (opts->streaming == LOGICALREP_STREAM_PARALLEL) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR)), + errmsg("min_apply_delay must not be set when streaming = parallel")); + } + /* + * Test the combination of streaming mode and + * min_apply_delay + */ + if (opts.streaming == LOGICALREP_STREAM_PARALLEL && + sub->minapplydelay > 0) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("min_apply_delay must not be set when streaming = parallel"))); 2) ereport inconsistency, braces around errcode is present in few places and not present in few places, it is better to keep it consistent by removing it: 2.a) + if (opts->streaming == LOGICALREP_STREAM_PARALLEL) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR)), + errmsg("min_apply_delay must not be set when streaming = parallel")); 2.b) + if (opts.streaming == LOGICALREP_STREAM_PARALLEL && + sub->minapplydelay > 0) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("min_apply_delay must not be set when streaming = parallel"))); 2.c) + if (opts.min_apply_delay > 0 && + sub->stream == LOGICALREP_STREAM_PARALLEL) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("min_apply_delay must not be set when streaming = parallel"))); 2.d) + if (pg_mul_s64_overflow(days, MSECS_PER_DAY, &result)) + ereport(ERROR, + (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), + errmsg("bigint out of range"))); 2.e) + if (pg_add_s64_overflow(result, ms, &result)) + ereport(ERROR, + (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), + errmsg("bigint out of range"))); 3) this include is not required, I could compile without it --- a/src/backend/commands/subscriptioncmds.c +++ b/src/backend/commands/subscriptioncmds.c @@ -48,6 +48,7 @@ #include "utils/memutils.h" #include "utils/pg_lsn.h" #include "utils/syscache.h" +#include "utils/timestamp.h" 4) 4.a) Should this be changed: /* Adds portion time (in ms) to the previous result. */ to /* Adds portion time (in ms) to the previous result */ 4.b) Should this be changed: /* Detect whether the value of interval can cause an overflow. */ to /* Detect whether the value of interval can cause an overflow */ 5) Can this "ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = '1d')" be combined along with "-- success -- 123 ms", that way few statements could be reduced +-- success -- 86400000 ms +CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = 123); +ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = '1d'); + +\dRs+ + +ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE); +DROP SUBSCRIPTION regress_testsub; 6) Can we do the interval testing along with alter subscription and combined with "-- success -- 123 ms" test, that way few statements could be reduced +-- success -- interval is converted into ms and stored as integer +CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = '4h 27min 35s'); + +\dRs+ + +ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE); +DROP SUBSCRIPTION regress_testsub; Regards, Vignesh
Dear Shveta, Thanks for reviewing! PSA new version. > 1. > + errmsg("min_apply_delay must not be set when streaming = parallel"))); > we give the same error msg for both the cases: > a. when subscription is created with streaming=parallel but we are > trying to alter subscription to set min_apply_delay >0 > b. when subscription is created with some min_apply_delay and we are > trying to alter subscription to make it streaming=parallel. > For case a, error msg looks fine but for case b, I think error msg > should be changed slightly. > ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel); > ERROR: min_apply_delay must not be set when streaming = parallel > This gives the feeling that we are trying to modify min_apply_delay > but we are not. Maybe we can change it to: > "subscription with min_apply_delay must not be allowed to stream > parallel" (or something better) Your point that error messages are strange is right. And while checking other ones, I found they have very similar styles. Therefore I reworded ERROR messages in AlterSubscription() and parse_subscription_options() to follow them. Which version is better? Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
> 2. > I think users can set ' wal_receiver_status_interval ' to 0 or more > than 'wal_sender_timeout'. But is this a frequent use-case scenario or > do we see DBAs setting these in such a way by mistake? If so, then I > think, it is better to give Warning message in such a case when a user > tries to create or alter a subscription with a large 'min_apply_delay' > (>= 'wal_sender_timeout') , rather than leaving it to the user's > understanding that WalSender may repeatedly timeout in such a case. > Parse_subscription_options and AlterSubscription can be modified to > log a warning. Any thoughts? Yes, DBAs may set wal_receiver_status_interval to more than wal_sender_timeout by mistake. But to handle the scenario we must compare between min_apply_delay *on subscriber* and wal_sender_timeout *on publisher*. Both values are not transferred to opposite sides, so the WARNING cannot be raised. I considered that such a mechanism seemed to be complex. The discussion around [1] may be useful. [1]: https://www.postgresql.org/message-id/CAA4eK1Lq%2Bh8qo%2BrqGU-E%2BhwJKAHYocV54y4pvou4rLysCgYD-g%40mail.gmail.com Best Regards, Hayato Kuroda FUJITSU LIMITED
Dear Vignesh, Thanks for reviewing! > 1) Comment inconsistency across create and alter subscription, better > to keep it same: A comment for CREATE SUBSCRIPTION became same as ALTER's one. > 2) ereport inconsistency, braces around errcode is present in few > places and not present in few places, it is better to keep it > consistent by removing it: Removed. > 3) this include is not required, I could compile without it Removed. Timestamp datatype is not used in subscriptioncmds.c. > 4) > 4.a) > Should this be changed: > /* Adds portion time (in ms) to the previous result. */ > to > /* Adds portion time (in ms) to the previous result */ Changed. > 4.b) > Should this be changed: > /* Detect whether the value of interval can cause an overflow. */ > to > /* Detect whether the value of interval can cause an overflow */ Changed. > 5) Can this "ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = > '1d')" be combined along with "-- success -- 123 ms", that way few > statements could be reduced > 6) Can we do the interval testing along with alter subscription and > combined with "-- success -- 123 ms" test, that way few statements > could be reduced To keep the code coverage, either of them must remain. 5) was cleanly removed and 6) was combined to you suggested. In addition, comments were updated to clarify the testcase. Please have a look at the latest patch v14 in [1]. [1]: https://www.postgresql.org/message-id/TYAPR01MB5866D0527B1B8D589F1C2551F5FC9%40TYAPR01MB5866.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED
At Wed, 11 Jan 2023 12:46:24 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in > them. Which version is better? Some comments by a quick loock, different from the above. + CONNECTION 'host=192.168.1.50 port=5432 user=foo dbname=foodb' I understand that we (not PG people, but IT people) are supposed to use in documents a certain set of special addresses that is guaranteed not to be routed in the field. > TEST-NET-1 : 192.0.2.0/24 > TEST-NET-2 : 198.51.100.0/24 > TEST-NET-3 : 203.0.113.0/24 (I found 192.83.123.89 in the postgres_fdw doc, but it'd be another issue..) + if (strspn(tmp, "-0123456789 ") == strlen(tmp)) Do we need to bother spending another memory block for apparent non-digits here? + errmsg(INT64_FORMAT " ms is outside the valid range for parameter \"%s\"", We don't use INT64_FORMAT in translatable message strings. Cast then use %lld instead. This message looks unfriendly as it doesn't suggest the valid range, and it shows the input value in a different unit from what was in the input. A I think we can spell it as "\"%s\" is outside the valid range for subsciription parameter \"%s\" (0 .. <INT_MAX> in millisecond)" + int64 min_apply_delay; .. + if (ms < 0 || ms > INT_MAX) Why is the variable wider than required? + errmsg("%s and %s are mutually exclusive options", + "min_apply_delay > 0", "streaming = parallel")); Mmm. Couldn't we refuse 0 as min_apply_delay? + sub->minapplydelay > 0) ... + if (opts.min_apply_delay > 0 && Is there any reason for the differenciation? + errmsg("cannot set %s for subscription with %s", + "streaming = parallel", "min_apply_delay > 0")); I think that this shoud be more like human-speking. Say, "cannot enable min_apply_delay for subscription in parallel streaming mode" or something.. The same is applicable to the nearby message. +static void maybe_delay_apply(TimestampTz ts); apply_spooled_messages(FileSet *stream_fileset, TransactionId xid, - XLogRecPtr lsn) + XLogRecPtr lsn, TimestampTz ts) "ts" looks too generic. Couldn't it be more specific? We need a explanation for the parameter in the function comment. + if (!am_parallel_apply_worker()) + { + Assert(ts > 0); + maybe_delay_apply(ts); It seems to me better that the if condition and assertion are checked inside maybe_delay_apply(). regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Jan 11, 2023 at 6:16 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Shveta, > > Thanks for reviewing! PSA new version. > > > 1. > > + errmsg("min_apply_delay must not be set when streaming = parallel"))); > > we give the same error msg for both the cases: > > a. when subscription is created with streaming=parallel but we are > > trying to alter subscription to set min_apply_delay >0 > > b. when subscription is created with some min_apply_delay and we are > > trying to alter subscription to make it streaming=parallel. > > For case a, error msg looks fine but for case b, I think error msg > > should be changed slightly. > > ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel); > > ERROR: min_apply_delay must not be set when streaming = parallel > > This gives the feeling that we are trying to modify min_apply_delay > > but we are not. Maybe we can change it to: > > "subscription with min_apply_delay must not be allowed to stream > > parallel" (or something better) > > Your point that error messages are strange is right. And while > checking other ones, I found they have very similar styles. Therefore I reworded > ERROR messages in AlterSubscription() and parse_subscription_options() to follow > them. Which version is better? > v14 one looks much better. Thanks! thanks Shveta
On Wed, Jan 11, 2023 at 6:16 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > > 2. > > I think users can set ' wal_receiver_status_interval ' to 0 or more > > than 'wal_sender_timeout'. But is this a frequent use-case scenario or > > do we see DBAs setting these in such a way by mistake? If so, then I > > think, it is better to give Warning message in such a case when a user > > tries to create or alter a subscription with a large 'min_apply_delay' > > (>= 'wal_sender_timeout') , rather than leaving it to the user's > > understanding that WalSender may repeatedly timeout in such a case. > > Parse_subscription_options and AlterSubscription can be modified to > > log a warning. Any thoughts? > > Yes, DBAs may set wal_receiver_status_interval to more than wal_sender_timeout by > mistake. > > But to handle the scenario we must compare between min_apply_delay *on subscriber* > and wal_sender_timeout *on publisher*. Both values are not transferred to opposite > sides, so the WARNING cannot be raised. I considered that such a mechanism seemed > to be complex. The discussion around [1] may be useful. > > [1]: https://www.postgresql.org/message-id/CAA4eK1Lq%2Bh8qo%2BrqGU-E%2BhwJKAHYocV54y4pvou4rLysCgYD-g%40mail.gmail.com > okay, I see. So even when 'wal_receiver_status_interval' is set to 0, no log/warning is needed when the user tries to set min_apply_delay>0? Are we good with doc alone? One trivial correction in config.sgml: + terminates due to the timeout errors. Hence, make sure this parameter + shorter than the <literal>wal_sender_timeout</literal> of the publisher. Hence, make sure this parameter is shorter... <is missing> thanks Shveta
Hi,
I've a question about 032_apply_delay.pl.
+# Test ALTER SUBSCRIPTION. Delay 86460 seconds (1 day 1 minute).
+$node_subscriber->safe_psql('postgres',
+ "ALTER SUBSCRIPTION tap_sub SET (min_apply_delay = 86460000)"
+);
+
+# New row to trigger apply delay.
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO test_tab VALUES (0, 'foobar')");
+
I couldn't quite see how these lines test whether ALTER SUBSCRIPTION successfully worked.
Don't we need to check that min_apply_delay really changed as a result?
But also I see that subscription.sql already tests this ALTER SUBSCRIPTION behaviour.
Best,
-- Melih Mutlu
Microsoft
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Thursday, January 12, 2023 12:04 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Wed, 11 Jan 2023 12:46:24 +0000, "Hayato Kuroda (Fujitsu)" > <kuroda.hayato@fujitsu.com> wrote in > > them. Which version is better? > > > Some comments by a quick loock, different from the above. Horiguchi-san, thanks for your review ! > + CONNECTION 'host=192.168.1.50 port=5432 user=foo > dbname=foodb' > > I understand that we (not PG people, but IT people) are supposed to use in > documents a certain set of special addresses that is guaranteed not to be > routed in the field. > > > TEST-NET-1 : 192.0.2.0/24 > > TEST-NET-2 : 198.51.100.0/24 > > TEST-NET-3 : 203.0.113.0/24 > > (I found 192.83.123.89 in the postgres_fdw doc, but it'd be another issue..) Fixed. If necessary we can create another thread for this. > + if (strspn(tmp, "-0123456789 ") == strlen(tmp)) > > Do we need to bother spending another memory block for apparent non-digits > here? Yes. The characters are necessary to handle an issue reported in [1]. The issue happened if the user inputs a negative value, then the length comparison became different between strspn and strlen and the input value was recognized as seconds, when the unit wasn't described. This led to a wrong error message for the user. Those addition of such characters solve the issue. > + errmsg(INT64_FORMAT " ms > is outside the valid range for parameter > +\"%s\"", > > We don't use INT64_FORMAT in translatable message strings. Cast then > use %lld instead. Thanks for teaching us. Fixed. > This message looks unfriendly as it doesn't suggest the valid range, and it > shows the input value in a different unit from what was in the input. A I think we > can spell it as "\"%s\" is outside the valid range for subsciription parameter > \"%s\" (0 .. <INT_MAX> in millisecond)" Makes sense. I incorporated the valid range with the aligned format of recovery_min_apply_delay. FYI, the physical replication's GUC doesn't write the unites for the range like below. I followed and applied this style. --- LOG: -1 ms is outside the valid range for parameter "recovery_min_apply_delay" (0 .. 2147483647) FATAL: configuration file "/home/k5user/new/pg/l/make_v15/slave/postgresql.conf" contains errors --- > + int64 min_apply_delay; > .. > + if (ms < 0 || ms > INT_MAX) > > Why is the variable wider than required? You are right. Fixed. > + errmsg("%s and %s are mutually > exclusive options", > + "min_apply_delay > 0", > "streaming = parallel")); > > Mmm. Couldn't we refuse 0 as min_apply_delay? Sorry, the previous patch's behavior wasn't consistent with this error message. In the previous patch, if we conducted alter subscription with stream = parallel and min_apply_delay = 0 (from a positive value) at the same time, the alter command failed, although this should succeed by this time-delayed feature specification. We fixed this part accordingly by some more tests in AlterSubscription(). By the way, we should allow users to change min_apply_dealy to 0 whenever they want from different value. Then, we didn't restrict this kind of operation. > + sub->minapplydelay > 0) > ... > + if (opts.min_apply_delay > 0 && > > Is there any reason for the differenciation? Yes. The former is the object for an existing subscription configuration. For example, if we alter subscription with setting streaming = 'parallel' for a subscription created with min_apply_delay = '1 day', we need to reject the alter command. The latter is new settings. > + > errmsg("cannot set %s for subscription with %s", > + > "streaming = parallel", "min_apply_delay > 0")); > > I think that this shoud be more like human-speking. Say, "cannot enable > min_apply_delay for subscription in parallel streaming mode" or something.. > The same is applicable to the nearby message. Reworded the error messages. Please check. > +static void maybe_delay_apply(TimestampTz ts); > > apply_spooled_messages(FileSet *stream_fileset, TransactionId xid, > - XLogRecPtr lsn) > + XLogRecPtr lsn, TimestampTz ts) > > "ts" looks too generic. Couldn't it be more specific? > We need a explanation for the parameter in the function comment. Changed it to finish_ts, since it indicates commit/prepare time. This terminology should be aligned with finish lsn. > + if (!am_parallel_apply_worker()) > + { > + Assert(ts > 0); > + maybe_delay_apply(ts); > > It seems to me better that the if condition and assertion are checked inside > maybe_delay_apply(). Fixed. [1] - https://www.postgresql.org/message-id/CALDaNm3Bpzhh60nU-keuGxMPb-OhcqsfpCN3ysfCfCJ-2ShYPA%40mail.gmail.com Best Regards, Takamichi Osumi
Attachment
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, Shveta Thanks for your comments! On Thursday, January 12, 2023 6:51 PM shveta malik <shveta.malik@gmail.com> wrote: > > Yes, DBAs may set wal_receiver_status_interval to more than > > wal_sender_timeout by mistake. > > > > But to handle the scenario we must compare between min_apply_delay *on > > subscriber* and wal_sender_timeout *on publisher*. Both values are not > > transferred to opposite sides, so the WARNING cannot be raised. I > > considered that such a mechanism seemed to be complex. The discussion > around [1] may be useful. > > > > [1]: > > > https://www.postgresql.org/message-id/CAA4eK1Lq%2Bh8qo%2BrqGU-E%2B > hwJK > > AHYocV54y4pvou4rLysCgYD-g%40mail.gmail.com > > > > okay, I see. So even when 'wal_receiver_status_interval' is set to 0, no > log/warning is needed when the user tries to set min_apply_delay>0? > Are we good with doc alone? Yes. As far as I can remember, we don't emit log or warning for some kind of combination of those parameters (in the context of timeout too). So, it should be fine. > One trivial correction in config.sgml: > + terminates due to the timeout errors. Hence, make sure this parameter > + shorter than the <literal>wal_sender_timeout</literal> of the > publisher. > Hence, make sure this parameter is shorter... <is missing> Fixed. Kindly have a look at the latest patch shared in [1]. [1] - https://www.postgresql.org/message-id/TYCPR01MB83739C6133B50DDA8BAD1601EDFD9%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, Melih On Thursday, January 12, 2023 10:12 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote: > I've a question about 032_apply_delay.pl. > ... > I couldn't quite see how these lines test whether ALTER SUBSCRIPTION successfully worked. > Don't we need to check that min_apply_delay really changed as a result? Yeah, we should check it from the POV of apply worker's debug logs. The latest patch posted in [1] addressed your concern, by checking the logged delay time in the server log. I'd say what we could do is to check the logged time is long enough after the ALTER SUBSCRIPTION command. Please have a look at the patch. [1] - https://www.postgresql.org/message-id/TYCPR01MB83739C6133B50DDA8BAD1601EDFD9%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
On Thu, 12 Jan 2023 at 21:09, Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > > On Thursday, January 12, 2023 12:04 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Wed, 11 Jan 2023 12:46:24 +0000, "Hayato Kuroda (Fujitsu)" > > <kuroda.hayato@fujitsu.com> wrote in > > > them. Which version is better? > > > > > > Some comments by a quick loock, different from the above. > Horiguchi-san, thanks for your review ! > > > > + CONNECTION 'host=192.168.1.50 port=5432 user=foo > > dbname=foodb' > > > > I understand that we (not PG people, but IT people) are supposed to use in > > documents a certain set of special addresses that is guaranteed not to be > > routed in the field. > > > > > TEST-NET-1 : 192.0.2.0/24 > > > TEST-NET-2 : 198.51.100.0/24 > > > TEST-NET-3 : 203.0.113.0/24 > > > > (I found 192.83.123.89 in the postgres_fdw doc, but it'd be another issue..) > Fixed. If necessary we can create another thread for this. > > > + if (strspn(tmp, "-0123456789 ") == strlen(tmp)) > > > > Do we need to bother spending another memory block for apparent non-digits > > here? > Yes. The characters are necessary to handle an issue reported in [1]. > The issue happened if the user inputs a negative value, > then the length comparison became different between strspn and strlen > and the input value was recognized as seconds, when > the unit wasn't described. This led to a wrong error message for the user. > > Those addition of such characters solve the issue. > > > + errmsg(INT64_FORMAT " ms > > is outside the valid range for parameter > > +\"%s\"", > > > > We don't use INT64_FORMAT in translatable message strings. Cast then > > use %lld instead. > Thanks for teaching us. Fixed. > > > This message looks unfriendly as it doesn't suggest the valid range, and it > > shows the input value in a different unit from what was in the input. A I think we > > can spell it as "\"%s\" is outside the valid range for subsciription parameter > > \"%s\" (0 .. <INT_MAX> in millisecond)" > Makes sense. I incorporated the valid range with the aligned format of recovery_min_apply_delay. > FYI, the physical replication's GUC doesn't write the unites for the range like below. > I followed and applied this style. > > --- > LOG: -1 ms is outside the valid range for parameter "recovery_min_apply_delay" (0 .. 2147483647) > FATAL: configuration file "/home/k5user/new/pg/l/make_v15/slave/postgresql.conf" contains errors > --- > > > + int64 min_apply_delay; > > .. > > + if (ms < 0 || ms > INT_MAX) > > > > Why is the variable wider than required? > You are right. Fixed. > > > + errmsg("%s and %s are mutually > > exclusive options", > > + "min_apply_delay > 0", > > "streaming = parallel")); > > > > Mmm. Couldn't we refuse 0 as min_apply_delay? > Sorry, the previous patch's behavior wasn't consistent with this error message. > > In the previous patch, if we conducted alter subscription > with stream = parallel and min_apply_delay = 0 (from a positive value) at the same time, > the alter command failed, although this should succeed by this time-delayed feature specification. > We fixed this part accordingly by some more tests in AlterSubscription(). > > By the way, we should allow users to change min_apply_dealy to 0 > whenever they want from different value. Then, we didn't restrict > this kind of operation. > > > + sub->minapplydelay > 0) > > ... > > + if (opts.min_apply_delay > 0 && > > > > Is there any reason for the differenciation? > Yes. The former is the object for an existing subscription configuration. > For example, if we alter subscription with setting streaming = 'parallel' > for a subscription created with min_apply_delay = '1 day', we > need to reject the alter command. The latter is new settings. > > > > + > > errmsg("cannot set %s for subscription with %s", > > + > > "streaming = parallel", "min_apply_delay > 0")); > > > > I think that this shoud be more like human-speking. Say, "cannot enable > > min_apply_delay for subscription in parallel streaming mode" or something.. > > The same is applicable to the nearby message. > Reworded the error messages. Please check. > > > +static void maybe_delay_apply(TimestampTz ts); > > > > apply_spooled_messages(FileSet *stream_fileset, TransactionId xid, > > - XLogRecPtr lsn) > > + XLogRecPtr lsn, TimestampTz ts) > > > > "ts" looks too generic. Couldn't it be more specific? > > We need a explanation for the parameter in the function comment. > Changed it to finish_ts, since it indicates commit/prepare time. > This terminology should be aligned with finish lsn. > > > + if (!am_parallel_apply_worker()) > > + { > > + Assert(ts > 0); > > + maybe_delay_apply(ts); > > > > It seems to me better that the if condition and assertion are checked inside > > maybe_delay_apply(). > Fixed. > Thanks for the updated patch, Few comments: 1) Since the min_apply_delay = 3, but you have specified 2s, there might be a possibility that it can log delay as 1000ms due to pub/sub/network delay and the test can fail randomly, If we cannot ensure this log file value, check_apply_delay_time verification alone should be sufficient. +is($result, qq(5|1|5), 'check if the new rows were applied to subscriber'); + +check_apply_delay_log("logical replication apply delay", "2000"); 2) I'm not sure if this will add any extra coverage as the altering value of min_apply_delay is already tested in the regression, if so this test can be removed: +# Test ALTER SUBSCRIPTION. Delay 86460 seconds (1 day 1 minute). +$node_subscriber->safe_psql('postgres', + "ALTER SUBSCRIPTION tap_sub SET (min_apply_delay = 86460000)" +); + +# New row to trigger apply delay. +$node_publisher->safe_psql('postgres', + "INSERT INTO test_tab VALUES (0, 'foobar')"); + +check_apply_delay_log("logical replication apply delay", "80000000"); 3) We generally keep the subroutines before the tests, it can be kept accordingly: 3.a) +sub check_apply_delay_log +{ + my ($message, $expected) = @_; + $expected = 0 unless defined $expected; + + my $old_log_location = $log_location; 3.b) +sub check_apply_delay_time +{ + my ($primary_key, $expected_diffs) = @_; + + my $inserted_time_on_pub = $node_publisher->safe_psql('postgres', qq[ + SELECT extract(epoch from c) FROM test_tab WHERE a = $primary_key; + ]); + 4) typo "more then once" should be "more than once" + regress_testsub | regress_subscription_user | f | {testpub,testpub1,testpub2} | f | off | d | f | any | 0 | off | dbname=regress_doesnotexist | 0/0 (1 row) -- fail - publication used more then once @@ -316,10 +316,10 @@ ERROR: publication "testpub3" is not in subscription "regress_testsub" -- ok - delete publications ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false); \dRs+ 5) This can be changed to "Is it larger than expected?" + # Is it larger than expected ? + cmp_ok($logged_delay, '>', $expected, + "The wait time of the apply worker is long enough expectedly" + ); 6) 2022 should be changed to 2023 +++ b/src/test/subscription/t/032_apply_delay.pl @@ -0,0 +1,170 @@ + +# Copyright (c) 2022, PostgreSQL Global Development Group + +# Test replication apply delay 7) Termination full stop is not required for single line comments: 7.a) +use Test::More; + +# Create publisher node. +my $node_publisher = PostgreSQL::Test::Cluster->new('publisher'); 7.b) + +# Create subscriber node. +my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber'); 7.c) + +# Create some preexisting content on publisher. +$node_publisher->safe_psql('postgres', 7.d) similarly in rest of the files 8) Is it possible to add one test for spooling also? Regards, Vignesh
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Saturday, January 14, 2023 3:27 PM vignesh C <vignesh21@gmail.com> wrote: > 1) Since the min_apply_delay = 3, but you have specified 2s, there might be a > possibility that it can log delay as 1000ms due to pub/sub/network delay and > the test can fail randomly, If we cannot ensure this log file value, > check_apply_delay_time verification alone should be sufficient. > +is($result, qq(5|1|5), 'check if the new rows were applied to > +subscriber'); > + > +check_apply_delay_log("logical replication apply delay", "2000"); You are right. Removed the left-time check of the 1st call of check_apply_delay_log(). > 2) I'm not sure if this will add any extra coverage as the altering value of > min_apply_delay is already tested in the regression, if so this test can be > removed: > +# Test ALTER SUBSCRIPTION. Delay 86460 seconds (1 day 1 minute). > +$node_subscriber->safe_psql('postgres', > + "ALTER SUBSCRIPTION tap_sub SET (min_apply_delay = > 86460000)" > +); > + > +# New row to trigger apply delay. > +$node_publisher->safe_psql('postgres', > + "INSERT INTO test_tab VALUES (0, 'foobar')"); > + > +check_apply_delay_log("logical replication apply delay", "80000000"); While addressing this point, I've noticed that there is a behavior difference between physical replication's recovery_min_apply_delay and this feature when stopping the replication during delays. At present, in the latter case, the apply worker exits without applying the suspended transaction after ALTER SUBSCRIPTION DISABLE command for the subscription. Meanwhile, there is no "disabling" command for physical replication, but I checked the behavior about what happens for promoting a secondary during the delay of recovery_min_apply_delay for physical replication as one example. The transaction has become visible even in the promoting in the middle of delay. I'm not sure if I should make the time-delayed LR aligned with this behavior. Does someone has an opinion for this ? By the way, the above test code can be used for the test case when the apply worker is in a delay but the transaction has been canceled by ALTER SUBSCRIPTION DISABLE command. So, I didn't remove it at this stage. > 3) We generally keep the subroutines before the tests, it can be kept > accordingly: > 3.a) > +sub check_apply_delay_log > +{ > + my ($message, $expected) = @_; > + $expected = 0 unless defined $expected; > + > + my $old_log_location = $log_location; > > 3.b) > +sub check_apply_delay_time > +{ > + my ($primary_key, $expected_diffs) = @_; > + > + my $inserted_time_on_pub = $node_publisher->safe_psql('postgres', > qq[ > + SELECT extract(epoch from c) FROM test_tab WHERE a = > $primary_key; > + ]); > + Fixed. > 4) typo "more then once" should be "more than once" > + regress_testsub | regress_subscription_user | f | > {testpub,testpub1,testpub2} | f | off | d | > f | any | 0 | off > | dbname=regress_doesnotexist | 0/0 > (1 row) > > -- fail - publication used more then once @@ -316,10 +316,10 @@ ERROR: > publication "testpub3" is not in subscription "regress_testsub" > -- ok - delete publications > ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, > testpub2 WITH (refresh = false); > \dRs+ This was an existing typo on HEAD. Addressed in other thread in [1]. > 5) This can be changed to "Is it larger than expected?" > + # Is it larger than expected ? > + cmp_ok($logged_delay, '>', $expected, > + "The wait time of the apply worker is long > enough expectedly" > + ); Fixed. > 6) 2022 should be changed to 2023 > +++ b/src/test/subscription/t/032_apply_delay.pl > @@ -0,0 +1,170 @@ > + > +# Copyright (c) 2022, PostgreSQL Global Development Group > + > +# Test replication apply delay Fixed. > 7) Termination full stop is not required for single line comments: > 7.a) > +use Test::More; > + > +# Create publisher node. > +my $node_publisher = PostgreSQL::Test::Cluster->new('publisher'); > > 7.b) + > +# Create subscriber node. > +my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber'); > > 7.c) + > +# Create some preexisting content on publisher. > +$node_publisher->safe_psql('postgres', > > 7.d) similarly in rest of the files Removed the periods for single line comments. > 8) Is it possible to add one test for spooling also? There is a streaming transaction case in the TAP test already. I conducted some minor comment modifications along with above changes. Kindly have a look at the v16. [1] - https://www.postgresql.org/message-id/flat/TYCPR01MB83737EA140C79B7D099F65E8EDC69%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
Attachment
Dear hackers, > At present, in the latter case, > the apply worker exits without applying the suspended transaction > after ALTER SUBSCRIPTION DISABLE command for the subscription. > Meanwhile, there is no "disabling" command for physical replication, > but I checked the behavior about what happens for promoting a secondary > during the delay of recovery_min_apply_delay for physical replication as one > example. > The transaction has become visible even in the promoting in the middle of delay. > > I'm not sure if I should make the time-delayed LR aligned with this behavior. > Does someone has an opinion for this ? I put my opinion here. The current specification is correct; we should not follow a physical replication manner. One motivation for this feature is to offer opportunities to correct data loss errors. When accidental delete events occur, DBA can stop propagations on subscribers by disabling the subscription, with the patch at present. IIUC, when the subscription is disabled before transactions are started, workers exit and stop applications. This feature delays starting txns, so we should regard such an alternation as that is executed before the transaction. Best Regards, Hayato Kuroda FUJITSU LIMITED
On Tue, Jan 17, 2023 at 4:30 PM Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > > On Saturday, January 14, 2023 3:27 PM vignesh C <vignesh21@gmail.com> wrote: > > > 2) I'm not sure if this will add any extra coverage as the altering value of > > min_apply_delay is already tested in the regression, if so this test can be > > removed: > > +# Test ALTER SUBSCRIPTION. Delay 86460 seconds (1 day 1 minute). > > +$node_subscriber->safe_psql('postgres', > > + "ALTER SUBSCRIPTION tap_sub SET (min_apply_delay = > > 86460000)" > > +); > > + > > +# New row to trigger apply delay. > > +$node_publisher->safe_psql('postgres', > > + "INSERT INTO test_tab VALUES (0, 'foobar')"); > > + > > +check_apply_delay_log("logical replication apply delay", "80000000"); > While addressing this point, I've noticed that there is a > behavior difference between physical replication's recovery_min_apply_delay > and this feature when stopping the replication during delays. > > At present, in the latter case, > the apply worker exits without applying the suspended transaction > after ALTER SUBSCRIPTION DISABLE command for the subscription. > In the previous paragraph, you said the behavior difference while stopping the replication but it is not clear from where this DISABLE command comes in that scenario. > Meanwhile, there is no "disabling" command for physical replication, > but I checked the behavior about what happens for promoting a secondary > during the delay of recovery_min_apply_delay for physical replication as one example. > The transaction has become visible even in the promoting in the middle of delay. > What causes such a transaction to be visible after promotion? Ideally, if the commit doesn't succeed, the transaction shouldn't be visible. Do, we allow the transaction waiting due to delay to get committed on promotion? > I'm not sure if I should make the time-delayed LR aligned with this behavior. > Does someone has an opinion for this ? > Can you please explain a bit more as asked above to understand the difference? -- With Regards, Amit Kapila.
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Tuesday, January 17, 2023 9:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Jan 17, 2023 at 4:30 PM Takamichi Osumi (Fujitsu) > <osumi.takamichi@fujitsu.com> wrote: > > > > On Saturday, January 14, 2023 3:27 PM vignesh C <vignesh21@gmail.com> > wrote: > > > > > 2) I'm not sure if this will add any extra coverage as the altering > > > value of min_apply_delay is already tested in the regression, if so > > > this test can be > > > removed: > > > +# Test ALTER SUBSCRIPTION. Delay 86460 seconds (1 day 1 minute). > > > +$node_subscriber->safe_psql('postgres', > > > + "ALTER SUBSCRIPTION tap_sub SET (min_apply_delay = > > > 86460000)" > > > +); > > > + > > > +# New row to trigger apply delay. > > > +$node_publisher->safe_psql('postgres', > > > + "INSERT INTO test_tab VALUES (0, 'foobar')"); > > > + > > > +check_apply_delay_log("logical replication apply delay", > > > +"80000000"); > > While addressing this point, I've noticed that there is a behavior > > difference between physical replication's recovery_min_apply_delay and > > this feature when stopping the replication during delays. > > > > At present, in the latter case, > > the apply worker exits without applying the suspended transaction > > after ALTER SUBSCRIPTION DISABLE command for the subscription. > > > > In the previous paragraph, you said the behavior difference while stopping the > replication but it is not clear from where this DISABLE command comes in that > scenario. Sorry for my unclear description. I mean "stopping the replication" is to disable the subscription during the "min_apply_delay" wait time on logical replication setup. I proposed and mentioned this discussion point to define how the time-delayed apply worker should behave when there is a transaction delayed by "min_apply_delay" parameter and additionally the user issues ALTER SUBSCRIPTION ... DISABLE during the delay. When it comes to physical replication, it's hard to find a perfect correspondent for LR's ALTER SUBSCRIPTION DISABLE command, but I chose a scenario to promote a secondary during "recovery_min_apply_delay" for comparison this time. After the promotion of the secondary in the physical replication, the transaction committed on the publisher but delayed on the secondary can be seen. This would be because CheckForStandbyTrigger in recoveryApplyDelay returns true and we apply the record by breaking the wait. I checked and got the LOG message "received promote request" in the secondary log when I tested this case. > > Meanwhile, there is no "disabling" command for physical replication, > > but I checked the behavior about what happens for promoting a > > secondary during the delay of recovery_min_apply_delay for physical > replication as one example. > > The transaction has become visible even in the promoting in the middle of > delay. > > > > What causes such a transaction to be visible after promotion? Ideally, if the > commit doesn't succeed, the transaction shouldn't be visible. > Do, we allow the transaction waiting due to delay to get committed on > promotion? The commit succeeded on the primary and then I promoted the secondary during the "recovery_min_apply_delay" wait of the transaction. Then, the result is the transaction turned out to be available on the promoted secondary. > > I'm not sure if I should make the time-delayed LR aligned with this behavior. > > Does someone has an opinion for this ? > > > > Can you please explain a bit more as asked above to understand the > difference? So, the current difference is that the time-delayed apply worker of logical replication doesn't apply the delayed transaction on the subscriber when the subscription has been disabled during the delay, while (in one example of a promotion) the physical replication does the apply of the delayed transaction. Best Regards, Takamichi Osumi
On Wed, Jan 18, 2023 at 6:37 AM Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > > > > > Can you please explain a bit more as asked above to understand the > > difference? > So, the current difference is that the time-delayed apply > worker of logical replication doesn't apply the delayed transaction on the subscriber > when the subscription has been disabled during the delay, while (in one example > of a promotion) the physical replication does the apply of the delayed transaction. > I don't see any particular reason here to allow the transaction apply to complete if the subscription is disabled. Note, that here we are waiting at the beginning of the transaction and for large transactions, it might cause a significant delay if we allow applying the xact. OTOH, if someone comes up with a valid use case to allow the transaction apply to get completed after the subscription is disabled then we can anyway do it later as well. -- With Regards, Amit Kapila.
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Wednesday, January 18, 2023 2:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Jan 18, 2023 at 6:37 AM Takamichi Osumi (Fujitsu) > <osumi.takamichi@fujitsu.com> wrote: > > > > > > > > Can you please explain a bit more as asked above to understand the > > > difference? > > So, the current difference is that the time-delayed apply worker of > > logical replication doesn't apply the delayed transaction on the > > subscriber when the subscription has been disabled during the delay, > > while (in one example of a promotion) the physical replication does the apply > of the delayed transaction. > > > > I don't see any particular reason here to allow the transaction apply to complete > if the subscription is disabled. Note, that here we are waiting at the beginning > of the transaction and for large transactions, it might cause a significant delay if > we allow applying the xact. OTOH, if someone comes up with a valid use case > to allow the transaction apply to get completed after the subscription is > disabled then we can anyway do it later as well. This makes sense. I agree with you. So, I'll keep the current behavior of the patch. Best Regards, Takamichi Osumi
Here are my review comments for the latest patch v16-0001. (excluding the test code) ====== General 1. Since the value of min_apply_delay cannot be < 0, I was thinking probably it should have been declared everywhere in this patch as a uint64 instead of an int64, right? ====== Commit message 2. If the subscription sets min_apply_delay parameter, the logical replication worker will delay the transaction commit for min_apply_delay milliseconds. ~ IMO there should be another sentence before this just to say that a new parameter is being added: e.g. This patch implements a new subscription parameter called 'min_apply_delay'. ====== doc/src/sgml/config.sgml 3. + <para> + For time-delayed logical replication, the apply worker sends a Standby + Status Update message to the corresponding publisher per the indicated + time of this parameter. Therefore, if this parameter is longer than + <literal>wal_sender_timeout</literal> on the publisher, then the + walsender doesn't get any update message during the delay and repeatedly + terminates due to the timeout errors. Hence, make sure this parameter is + shorter than the <literal>wal_sender_timeout</literal> of the publisher. + If this parameter is set to zero with time-delayed replication, the + apply worker doesn't send any feedback messages during the + <literal>min_apply_delay</literal>. + </para> This paragraph seemed confusing. I think it needs to be reworded to change all of the "this parameter" references because there are at least 3 different parameters mentioned in this paragraph. e.g. maybe just change them to explicitly name the parameter you are talking about. I also think it needs to mention the ‘min_apply_delay’ subscription parameter up-front and then refer to it appropriately. The end result might be something like I wrote below (this is just my guess – probably you can word it better). SUGGESTION For time-delayed logical replication (i.e. when the subscription is created with parameter min_apply_delay > 0), the apply worker sends a Standby Status Update message to the publisher with a period of wal_receiver_status_interval . Make sure to set wal_receiver_status_interval less than the wal_sender_timeout on the publisher, otherwise, the walsender will repeatedly terminate due to the timeout errors. If wal_receiver_status_interval is set to zero, the apply worker doesn't send any feedback messages during the subscriber’s min_apply_delay period. ====== doc/src/sgml/ref/create_subscription.sgml 4. + <para> + By default, the subscriber applies changes as soon as possible. As + with the physical replication feature + (<xref linkend="guc-recovery-min-apply-delay"/>), it can be useful to + have a time-delayed logical replica. This parameter lets the user to + delay the application of changes by a specified amount of time. If this + value is specified without units, it is taken as milliseconds. The + default is zero(no delay). + </para> 4a. As with the physical replication feature (recovery_min_apply_delay), it can be useful to have a time-delayed logical replica. IMO not sure that the above sentence is necessary. It seems only to be saying that this parameter can be useful. Why do we need to say that? ~ 4b. "This parameter lets the user to delay" -> "This parameter lets the user delay" OR "This parameter lets the user to delay" -> "This parameter allows the user to delay" ~ 4c. "If this value is specified without units" -> "If the value is specified without units" ~ 4d. "zero(no delay)." -> "zero (no delay)." ---- 5. + <para> + The delay occurs only on WAL records for transaction begins and after + the initial table synchronization. It is possible that the + replication delay between publisher and subscriber exceeds the value + of this parameter, in which case no delay is added. Note that the + delay is calculated between the WAL time stamp as written on + publisher and the current time on the subscriber. Time spent in logical + decoding and in transferring the transaction may reduce the actual wait + time. If the system clocks on publisher and subscriber are not + synchronized, this may lead to apply changes earlier than expected, + but this is not a major issue because this parameter is typically much + larger than the time deviations between servers. Note that if this + parameter is set to a long delay, the replication will stop if the + replication slot falls behind the current LSN by more than + <link linkend="guc-max-slot-wal-keep-size"><literal>max_slot_wal_keep_size</literal></link>. + </para> I think the first part can be reworded slightly. See what you think about the suggestion below. SUGGESTION Any delay occurs only on WAL records for transaction begins after all initial table synchronization has finished. The delay is calculated between the WAL timestamp as written on the publisher and the current time on the subscriber. Any overhead of time spent in logical decoding and in transferring the transaction may reduce the actual wait time. It is also possible that the overhead already exceeds the requested 'min_apply_delay' value, in which case no additional wait is necessary. If the system clocks... ---- 6. + <para> + Setting streaming to <literal>parallel</literal> mode and <literal>min_apply_delay</literal> + simultaneously is not supported. + </para> SUGGESTION A non-zero min_apply_delay parameter is not allowed when streaming in parallel mode. ====== src/backend/commands/subscriptioncmds.c 7. parse_subscription_options @@ -404,6 +445,17 @@ parse_subscription_options(ParseState *pstate, List *stmt_options, "slot_name = NONE", "create_slot = false"))); } } + + /* Test the combination of streaming mode and min_apply_delay */ + if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && + opts->min_apply_delay > 0) + { + if (opts->streaming == LOGICALREP_STREAM_PARALLEL) + ereport(ERROR, + errcode(ERRCODE_SYNTAX_ERROR), + errmsg("%s and %s are mutually exclusive options", + "min_apply_delay > 0", "streaming = parallel")); + } SUGGESTION (comment) The combination of parallel streaming mode and min_apply_delay is not allowed. ~~~ 8. AlterSubscription (general) I observed during testing there are 3 different errors…. At subscription CREATE time you can get this error: ERROR: min_apply_delay > 0 and streaming = parallel are mutually exclusive options If you try to ALTER the min_apply_delay when already streaming = parallel you can get this error: ERROR: cannot enable min_apply_delay for subscription in streaming = parallel mode If you try to ALTER the streaming to be parallel if there is already a min_apply_delay > 0 then you can get this error: ERROR: cannot enable streaming = parallel mode for subscription with min_apply_delay ~ IMO there is no need to have 3 different error message texts. I think all these cases are explained by just the first text (ERROR: min_apply_delay > 0 and streaming = parallel are mutually exclusive options) ~~~ 9. AlterSubscription @@ -1098,6 +1152,18 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt, if (IsSet(opts.specified_opts, SUBOPT_STREAMING)) { + /* + * Test the combination of streaming mode and + * min_apply_delay + */ + if (opts.streaming == LOGICALREP_STREAM_PARALLEL) + if ((IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && opts.min_apply_delay > 0) || + (!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && sub->minapplydelay > 0)) + ereport(ERROR, + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot enable %s mode for subscription with %s", + "streaming = parallel", "min_apply_delay")); + 9a. SUGGESTION (comment) The combination of parallel streaming mode and min_apply_delay is not allowed. ~ 9b. (see AlterSubscription general review comment #8 above) Here you can use the same comment error message that says min_apply_delay > 0 and streaming = parallel are mutually exclusive options. ~~~ 10. AlterSubscription @@ -1111,6 +1177,25 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt, = true; } + if (IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY)) + { + /* + * Test the combination of streaming mode and + * min_apply_delay + */ + if (opts.min_apply_delay > 0) + if ((IsSet(opts.specified_opts, SUBOPT_STREAMING) && opts.streaming == LOGICALREP_STREAM_PARALLEL) || + (!IsSet(opts.specified_opts, SUBOPT_STREAMING) && sub->stream == LOGICALREP_STREAM_PARALLEL)) + ereport(ERROR, + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot enable %s for subscription in %s mode", + "min_apply_delay", "streaming = parallel")); + + values[Anum_pg_subscription_subminapplydelay - 1] = + Int64GetDatum(opts.min_apply_delay); + replaces[Anum_pg_subscription_subminapplydelay - 1] = true; + } 10a. SUGGESTION (comment) The combination of parallel streaming mode and min_apply_delay is not allowed. ~ 10b. (see AlterSubscription general review comment #8 above) Here you can use the same comment error message that says min_apply_delay > 0 and streaming = parallel are mutually exclusive options. ====== .../replication/logical/applyparallelworker.c 11. @@ -704,7 +704,8 @@ pa_process_spooled_messages_if_required(void) { apply_spooled_messages(&MyParallelShared->fileset, MyParallelShared->xid, - InvalidXLogRecPtr); + InvalidXLogRecPtr, + 0); IMO this passing of 0 is a bit strange because it is currently acting like a dummy value since the apply_spooled_messages will never make use of the 'finish_ts' anyway (since this call is from a parallel apply worker). I think a better way to code this might be to pass the 0 (same as you are doing here) but inside the apply_spooled_messages change the code: FROM if (!am_parallel_apply_worker()) maybe_delay_apply(finish_ts); TO if (finish_ts) maybe_delay_apply(finish_ts); That does 2 things. - It makes the passed-in 0 have some meaning - It simplifies the apply_spooled_messages code ====== src/backend/replication/logical/worker.c 12. @@ -318,6 +318,17 @@ static List *on_commit_wakeup_workers_subids = NIL; bool in_remote_transaction = false; static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr; +/* + * In order to avoid walsender's timeout during time-delayed replication, + * it's necessary to keep sending feedback messages during the delay from the + * worker process. Meanwhile, the feature delays the apply before starting the + * transaction and thus we don't write WALs for the suspended changes during + * the wait. Hence, in the case the worker process sends a feedback message + * during the delay, we should not make positions of the flushed and apply LSN + * overwritten by the last received latest LSN. See send_feedback() for details. + */ +static XLogRecPtr last_received = InvalidXLogRecPtr; 12a. Suggest a small change to the first sentence of the comment. BEFORE In order to avoid walsender's timeout during time-delayed replication, it's necessary to keep sending feedback messages during the delay from the worker process. AFTER In order to avoid walsender timeout for time-delayed replication the worker process keeps sending feedback messages during the delay period. ~ 12b. "Hence, in the case" -> "When" ~~~ 13. forward declare -static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply); +static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply, + bool in_delaying_apply); Change the param name: "in_delaying_apply" -> "in_delayed_apply” (??) ~~~ 14. maybe_delay_apply + /* Nothing to do if no delay set */ + if (MySubscription->minapplydelay <= 0) + return; IIUC min_apply_delay cannot be < 0 so this condition could simply be: if (!MySubscription->minapplydelay) return; ~~~ 15. maybe_delay_apply + /* + * The min_apply_delay parameter is ignored until all tablesync workers + * have reached READY state. If we allow the delay during the catchup + * phase, once we reach the limit of tablesync workers, it will impose a + * delay for each subsequent worker. It means it will take a long time to + * finish the initial table synchronization. + */ + if (!AllTablesyncsReady()) + return; SUGGESTION (slight rewording) The min_apply_delay parameter is ignored until all tablesync workers have reached READY state. This is because if we allowed the delay during the catchup phase, then once we reached the limit of tablesync workers it would impose a delay for each subsequent worker. That would cause initial table synchronization completion to take a long time. ~~~ 16. maybe_delay_apply + while (true) + { + long diffms; + + ResetLatch(MyLatch); + + CHECK_FOR_INTERRUPTS(); IMO there should be some small explanatory comment here at the top of the while loop. ~~~ 17. apply_spooled_messages @@ -2024,6 +2141,21 @@ apply_spooled_messages(FileSet *stream_fileset, TransactionId xid, int fileno; off_t offset; + /* + * Should we delay the current transaction? + * + * Unlike the regular (non-streamed) cases, the delay is applied in a + * STREAM COMMIT/STREAM PREPARE message for streamed transactions. The + * STREAM START message does not contain a commit/prepare time (it will be + * available when the in-progress transaction finishes). Hence, it's not + * appropriate to apply a delay at that time. + * + * It's not allowed to execute time-delayed replication with parallel + * apply feature. + */ + if (!am_parallel_apply_worker()) + maybe_delay_apply(finish_ts); That whole comment part "Unlike the regular (non-streamed) cases" seems misplaced here. Perhaps this part of the comment is better put into the function header where the meaning of 'finish_ts' is explained? ~~~ 18. apply_spooled_messages + * It's not allowed to execute time-delayed replication with parallel + * apply feature. + */ + if (!am_parallel_apply_worker()) + maybe_delay_apply(finish_ts); As was mentioned in comment #11 above this code could be changed like if (finish_ts) maybe_delay_apply(finish_ts); then you don't even need to make mention of "parallel apply" at all here. OTOH if you want to still have the parallel apply comment then maybe reword it like this: "It is not allowed to combine time-delayed replication with the parallel apply feature." ~~~ 19. apply_spooled_messages If you chose not to do my suggestion from comment #11, then there are 2 identical conditions (!am_parallel_apply_worker()); In this case, I was wondering if it would be better to refactor to use a single condition instead. ~~~ 20. send_feedback (same as comment #13) Maybe change the new param name to “in_delayed_apply”? ~~~ 21. @@ -3737,8 +3869,15 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply) /* * No outstanding transactions to flush, we can report the latest received * position. This is important for synchronous replication. + * + * During the delay of time-delayed replication, do not tell the publisher + * that the received latest LSN is already applied and flushed at this + * stage, since we don't apply the transaction yet. If we do so, it leads + * to a wrong assumption of logical replication progress on the publisher + * side. Here, we just send a feedback message to avoid publisher's + * timeout during the delay. */ Minor rewording of the comment SUGGESTION If the subscriber side apply is delayed (because of time-delayed replication) then do not tell the publisher that the received latest LSN is already applied and flushed, otherwise, it leads to the publisher side making a wrong assumption of logical replication progress. Instead, we just send a feedback message to avoid a publisher timeout during the delay. ====== src/bin/pg_dump/pg_dump.c 22. @@ -4546,9 +4547,14 @@ getSubscriptions(Archive *fout) LOGICALREP_TWOPHASE_STATE_DISABLED); if (fout->remoteVersion >= 160000) - appendPQExpBufferStr(query, " s.suborigin\n"); + appendPQExpBufferStr(query, + " s.suborigin,\n" + " s.subminapplydelay\n"); else - appendPQExpBuffer(query, " '%s' AS suborigin\n", LOGICALREP_ORIGIN_ANY); + { + appendPQExpBuffer(query, " '%s' AS suborigin,\n", LOGICALREP_ORIGIN_ANY); + appendPQExpBufferStr(query, " 0 AS subminapplydelay\n"); + } Can’t those appends in the else part can be combined to a single appendPQExpBuffer appendPQExpBuffer(query, " '%s' AS suborigin,\n" " 0 AS subminapplydelay\n" LOGICALREP_ORIGIN_ANY); ====== src/include/catalog/pg_subscription.h 23. @@ -70,6 +70,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW XLogRecPtr subskiplsn; /* All changes finished at this LSN are * skipped */ + int64 subminapplydelay; /* Replication apply delay */ + NameData subname; /* Name of the subscription */ Oid subowner BKI_LOOKUP(pg_authid); /* Owner of the subscription */ SUGGESTION (for comment) Replication apply delay (ms) ~~ 24. @@ -120,6 +122,7 @@ typedef struct Subscription * in */ XLogRecPtr skiplsn; /* All changes finished at this LSN are * skipped */ + int64 minapplydelay; /* Replication apply delay */ SUGGESTION (for comment) Replication apply delay (ms) ------ Kind Regards, Peter Smith. Fujitsu Australia
On Wed, Jan 18, 2023 at 6:06 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are my review comments for the latest patch v16-0001. (excluding > the test code) > ... > > 8. AlterSubscription (general) > > I observed during testing there are 3 different errors…. > > At subscription CREATE time you can get this error: > ERROR: min_apply_delay > 0 and streaming = parallel are mutually > exclusive options > > If you try to ALTER the min_apply_delay when already streaming = > parallel you can get this error: > ERROR: cannot enable min_apply_delay for subscription in streaming = > parallel mode > > If you try to ALTER the streaming to be parallel if there is already a > min_apply_delay > 0 then you can get this error: > ERROR: cannot enable streaming = parallel mode for subscription with > min_apply_delay > > ~ > > IMO there is no need to have 3 different error message texts. I think > all these cases are explained by just the first text (ERROR: > min_apply_delay > 0 and streaming = parallel are mutually exclusive > options) > > After checking the regression test output I can see the merit of your separate error messages like this, even if they are maybe not strictly necessary. So feel free to ignore my previous review comment. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Wed, Jan 18, 2023 at 6:06 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are my review comments for the latest patch v16-0001. (excluding > the test code) > And here are some review comments for the v16-0001 test code. ====== src/test/regress/sql/subscription.sql 1. General For all comments "time delayed replication" -> "time-delayed replication" maybe is better? ~~~ 2. -- fail - utilizing streaming = parallel with time delayed replication is not supported. For readability please put a blank line before this test. ~~~ 3. -- success -- value without unit is taken as milliseconds "value" -> "min_apply_delay value" ~~~ 4. -- success -- interval is converted into ms and stored as integer "interval" -> "min_apply_delay interval" "integer" -> "an integer" ~~~ 5. You could also add another test where min_apply_delay is 0 Then the following combination can be confirmed OK -- success create subscription with (streaming=parallel, min_apply_delay=0) ~~ 6. -- fail - alter subscription with min_apply_delay should fail when streaming = parallel is set. CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = parallel); There is another way to do this test without creating a brand-new subscription. You could just alter the existing subscription like: ALTER ... SET (min_apply_delay = 0) then ALTER ... SET (parallel = streaming) then ALTER ... SET (min_apply_delay = 123) ====== src/test/subscription/t/032_apply_delay.pl 7. sub check_apply_delay_log my ($node_subscriber, $message, $expected) = @_; Why pass in the message text? I is always the same so can be hardwired in this function, right? ~~~ 8. # Get the delay time in the server log "int the server log" -> "from the server log" (?) ~~~ 9. qr/$message: (\d+) ms/ or die "could not get delayed time"; my $logged_delay = $1; # Is it larger than expected? cmp_ok($logged_delay, '>', $expected, "The wait time of the apply worker is long enough expectedly" ); 9a. "could not get delayed time" -> "could not get the apply worker wait time" 9b. "The wait time of the apply worker is long enough expectedly" -> "The apply worker wait time has expected duration" ~~~ 10. sub check_apply_delay_time Maybe a brief explanatory comment for this function is needed to explain the unreplicated column c. ~~~ 11. $node_subscriber->safe_psql('postgres', "CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, min_apply_delay = '3s')" I think there should be a comment here highlighting that you are setting up a subscriber time delay of 3 seconds, and then later you can better describe the parameters for the checking functions... e.g. (add this comment) # verifies that the subscriber lags the publisher by at least 3 seconds check_apply_delay_time($node_publisher, $node_subscriber, '5', '3'); e.g. # verifies that the subscriber lags the publisher by at least 3 seconds check_apply_delay_time($node_publisher, $node_subscriber, '8', '3'); ~~~ 12. # Test whether ALTER SUBSCRIPTION changes the delayed time of the apply worker # (1 day 1 minute). $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub SET (min_apply_delay = 86460000)" ); Update the comment with another note. # Note - The extra 1 min is to account for any decoding/network overhead. ~~~ 13. # Make sure we have long enough min_apply_delay after the ALTER command check_apply_delay_log($node_subscriber, "logical replication apply delay", "80000000"); IMO the expectation of 1 day (86460000 ms) wait time might be a better number for your "expected" value. So update the comment/call like this: # Make sure the apply worker knows to wait for more than 1 day (86400000 ms) check_apply_delay_log($node_subscriber, "logical replication apply delay", "86400000"); ------ Kind Regards, Peter Smith. Fujitsu Australia
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Thursday, January 19, 2023 10:49 AM Peter Smith <smithpb2250@gmail.com> wrote: > On Wed, Jan 18, 2023 at 6:06 PM Peter Smith <smithpb2250@gmail.com> > wrote: > > > > Here are my review comments for the latest patch v16-0001. (excluding > > the test code) > > > > And here are some review comments for the v16-0001 test code. Hi, thanks for your review ! > ====== > > src/test/regress/sql/subscription.sql > > 1. General > For all comments > > "time delayed replication" -> "time-delayed replication" maybe is better? Fixed. > ~~~ > > 2. > -- fail - utilizing streaming = parallel with time delayed replication is not > supported. > > For readability please put a blank line before this test. Fixed. > ~~~ > > 3. > -- success -- value without unit is taken as milliseconds > > "value" -> "min_apply_delay value" Fixed. > ~~~ > > 4. > -- success -- interval is converted into ms and stored as integer > > "interval" -> "min_apply_delay interval" > > "integer" -> "an integer" Both are fixed. > ~~~ > > 5. > You could also add another test where min_apply_delay is 0 > > Then the following combination can be confirmed OK -- success create > subscription with (streaming=parallel, min_apply_delay=0) This combination is added with the modification for #6. > ~~ > > 6. > -- fail - alter subscription with min_apply_delay should fail when streaming = > parallel is set. > CREATE SUBSCRIPTION regress_testsub CONNECTION > 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, > streaming = parallel); > > There is another way to do this test without creating a brand-new subscription. > You could just alter the existing subscription like: > ALTER ... SET (min_apply_delay = 0) > then ALTER ... SET (parallel = streaming) then ALTER ... SET (min_apply_delay > = 123) Fixed. > ====== > > src/test/subscription/t/032_apply_delay.pl > > 7. sub check_apply_delay_log > > my ($node_subscriber, $message, $expected) = @_; > > Why pass in the message text? I is always the same so can be hardwired in this > function, right? Fixed. > ~~~ > > 8. > # Get the delay time in the server log > > "int the server log" -> "from the server log" (?) Fixed. > ~~~ > > 9. > qr/$message: (\d+) ms/ > or die "could not get delayed time"; > my $logged_delay = $1; > > # Is it larger than expected? > cmp_ok($logged_delay, '>', $expected, > "The wait time of the apply worker is long enough expectedly" > ); > > 9a. > "could not get delayed time" -> "could not get the apply worker wait time" > > 9b. > "The wait time of the apply worker is long enough expectedly" -> "The apply > worker wait time has expected duration" Both are fixed. > ~~~ > > 10. > sub check_apply_delay_time > > > Maybe a brief explanatory comment for this function is needed to explain the > unreplicated column c. Added. > ~~~ > > 11. > $node_subscriber->safe_psql('postgres', > "CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr > application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, > min_apply_delay = '3s')" > > > I think there should be a comment here highlighting that you are setting up a > subscriber time delay of 3 seconds, and then later you can better describe the > parameters for the checking functions... Added a comment for CREATE SUBSCRIPTION command. > e.g. (add this comment) > # verifies that the subscriber lags the publisher by at least 3 seconds > check_apply_delay_time($node_publisher, $node_subscriber, '5', '3'); > > e.g. > # verifies that the subscriber lags the publisher by at least 3 seconds > check_apply_delay_time($node_publisher, $node_subscriber, '8', '3'); Added. > ~~~ > > 12. > # Test whether ALTER SUBSCRIPTION changes the delayed time of the apply > worker # (1 day 1 minute). > $node_subscriber->safe_psql('postgres', > "ALTER SUBSCRIPTION tap_sub SET (min_apply_delay = 86460000)" > ); > > Update the comment with another note. > # Note - The extra 1 min is to account for any decoding/network overhead. Okay, added the comment. In general, TAP tests fail if we wait for more than 3 minutes. Then, we should think setting the maximum consumed time more than 3 minutes is safe. For example, if (which should not happen usually, but) we consumed more than 1 minutes between this ALTER SUBSCRIPTION SET and below check_apply_delay_log() then, the test will fail. So made the extra time bigger. > ~~~ > > 13. > # Make sure we have long enough min_apply_delay after the ALTER command > check_apply_delay_log($node_subscriber, "logical replication apply delay", > "80000000"); > > IMO the expectation of 1 day (86460000 ms) wait time might be a better number > for your "expected" value. > > So update the comment/call like this: > > # Make sure the apply worker knows to wait for more than 1 day (86400000 ms) > check_apply_delay_log($node_subscriber, "logical replication apply delay", > "86400000"); Updated the comment and the function call. Kindly have a look at the updated patch v17. Best Regards, Takamichi Osumi
Attachment
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Wednesday, January 18, 2023 4:06 PM Peter Smith <smithpb2250@gmail.com> wrote: > Here are my review comments for the latest patch v16-0001. (excluding the > test code) Hi, thank you for your review ! > ====== > > General > > 1. > > Since the value of min_apply_delay cannot be < 0, I was thinking probably it > should have been declared everywhere in this patch as a > uint64 instead of an int64, right? No, we won't be able to adopt this idea. It seems that we are not able to use uint for catalog type. So, can't applying it to the pg_subscription.h definitions and then similarly Int64GetDatum to store catalog variables and the argument variable of Int64GetDatum. Plus, there is a possibility that type Interval becomes negative value, then we are not able to change the int64 variable to get the return value of interval2ms(). > ====== > > Commit message > > 2. > > If the subscription sets min_apply_delay parameter, the logical replication > worker will delay the transaction commit for min_apply_delay milliseconds. > > ~ > > IMO there should be another sentence before this just to say that a new > parameter is being added: > > e.g. > This patch implements a new subscription parameter called > 'min_apply_delay'. Added. > ====== > > doc/src/sgml/config.sgml > > 3. > > + <para> > + For time-delayed logical replication, the apply worker sends a Standby > + Status Update message to the corresponding publisher per the > indicated > + time of this parameter. Therefore, if this parameter is longer than > + <literal>wal_sender_timeout</literal> on the publisher, then the > + walsender doesn't get any update message during the delay and > repeatedly > + terminates due to the timeout errors. Hence, make sure this parameter > is > + shorter than the <literal>wal_sender_timeout</literal> of the > publisher. > + If this parameter is set to zero with time-delayed replication, the > + apply worker doesn't send any feedback messages during the > + <literal>min_apply_delay</literal>. > + </para> > > > This paragraph seemed confusing. I think it needs to be reworded to change all > of the "this parameter" references because there are at least 3 different > parameters mentioned in this paragraph. e.g. maybe just change them to > explicitly name the parameter you are talking about. > > I also think it needs to mention the ‘min_apply_delay’ subscription parameter > up-front and then refer to it appropriately. > > The end result might be something like I wrote below (this is just my guess ? > probably you can word it better). > > SUGGESTION > For time-delayed logical replication (i.e. when the subscription is created with > parameter min_apply_delay > 0), the apply worker sends a Standby Status > Update message to the publisher with a period of wal_receiver_status_interval . > Make sure to set wal_receiver_status_interval less than the > wal_sender_timeout on the publisher, otherwise, the walsender will repeatedly > terminate due to the timeout errors. If wal_receiver_status_interval is set to zero, > the apply worker doesn't send any feedback messages during the subscriber’s > min_apply_delay period. Applied. Also, I added one reference for min_apply_delay parameter at the end of this description. > ====== > > doc/src/sgml/ref/create_subscription.sgml > > 4. > > + <para> > + By default, the subscriber applies changes as soon as possible. As > + with the physical replication feature > + (<xref linkend="guc-recovery-min-apply-delay"/>), it can be > useful to > + have a time-delayed logical replica. This parameter lets the user to > + delay the application of changes by a specified amount of > time. If this > + value is specified without units, it is taken as milliseconds. The > + default is zero(no delay). > + </para> > > 4a. > As with the physical replication feature (recovery_min_apply_delay), it can be > useful to have a time-delayed logical replica. > > IMO not sure that the above sentence is necessary. It seems only to be saying > that this parameter can be useful. Why do we need to say that? Removed the sentence. > ~ > > 4b. > "This parameter lets the user to delay" -> "This parameter lets the user delay" > OR > "This parameter lets the user to delay" -> "This parameter allows the user to > delay" Fixed. > ~ > > 4c. > "If this value is specified without units" -> "If the value is specified without > units" Fixed. > ~ > > 4d. > "zero(no delay)." -> "zero (no delay)." Fixed. > ---- > > 5. > > + <para> > + The delay occurs only on WAL records for transaction begins and > after > + the initial table synchronization. It is possible that the > + replication delay between publisher and subscriber exceeds the > value > + of this parameter, in which case no delay is added. Note that the > + delay is calculated between the WAL time stamp as written on > + publisher and the current time on the subscriber. Time > spent in logical > + decoding and in transferring the transaction may reduce the > actual wait > + time. If the system clocks on publisher and subscriber are not > + synchronized, this may lead to apply changes earlier than > expected, > + but this is not a major issue because this parameter is > typically much > + larger than the time deviations between servers. Note that if this > + parameter is set to a long delay, the replication will stop if the > + replication slot falls behind the current LSN by more than > + <link > linkend="guc-max-slot-wal-keep-size"><literal>max_slot_wal_keep_size</ > literal></link>. > + </para> > > I think the first part can be reworded slightly. See what you think about the > suggestion below. > > SUGGESTION > Any delay occurs only on WAL records for transaction begins after all initial > table synchronization has finished. The delay is calculated between the WAL > timestamp as written on the publisher and the current time on the subscriber. > Any overhead of time spent in logical decoding and in transferring the > transaction may reduce the actual wait time. > It is also possible that the overhead already exceeds the requested > 'min_apply_delay' value, in which case no additional wait is necessary. If the > system clocks... Addressed. > ---- > > 6. > > + <para> > + Setting streaming to <literal>parallel</literal> mode and > <literal>min_apply_delay</literal> > + simultaneously is not supported. > + </para> > > SUGGESTION > A non-zero min_apply_delay parameter is not allowed when streaming in > parallel mode. Applied. > ====== > > src/backend/commands/subscriptioncmds.c > > 7. parse_subscription_options > > @@ -404,6 +445,17 @@ parse_subscription_options(ParseState *pstate, List > *stmt_options, > "slot_name = NONE", "create_slot = false"))); > } > } > + > + /* Test the combination of streaming mode and min_apply_delay */ if > + (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && > + opts->min_apply_delay > 0) > + { > + if (opts->streaming == LOGICALREP_STREAM_PARALLEL) > ereport(ERROR, > + errcode(ERRCODE_SYNTAX_ERROR), errmsg("%s and %s are mutually > + exclusive options", > + "min_apply_delay > 0", "streaming = parallel")); } > > SUGGESTION (comment) > The combination of parallel streaming mode and min_apply_delay is not > allowed. Fixed. > ~~~ > > 8. AlterSubscription (general) > > I observed during testing there are 3 different errors…. > > At subscription CREATE time you can get this error: > ERROR: min_apply_delay > 0 and streaming = parallel are mutually exclusive > options > > If you try to ALTER the min_apply_delay when already streaming = parallel you > can get this error: > ERROR: cannot enable min_apply_delay for subscription in streaming = > parallel mode > > If you try to ALTER the streaming to be parallel if there is already a > min_apply_delay > 0 then you can get this error: > ERROR: cannot enable streaming = parallel mode for subscription with > min_apply_delay Yes. This is because the existing error message styles in AlterSubscription and parse_subscription_options. The former uses "mutually exclusive" messages consistently, while the latter does "cannot enable ..." ones. > ~ > > IMO there is no need to have 3 different error message texts. I think all these > cases are explained by just the first text (ERROR: > min_apply_delay > 0 and streaming = parallel are mutually exclusive > options) Then, we followed this kind of formats. > ~~~ > > 9. AlterSubscription > > @@ -1098,6 +1152,18 @@ AlterSubscription(ParseState *pstate, > AlterSubscriptionStmt *stmt, > > if (IsSet(opts.specified_opts, SUBOPT_STREAMING)) > { > + /* > + * Test the combination of streaming mode and > + * min_apply_delay > + */ > + if (opts.streaming == LOGICALREP_STREAM_PARALLEL) if > + ((IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && > opts.min_apply_delay > 0) || > + (!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && > sub->minapplydelay > 0)) > + ereport(ERROR, > + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot enable %s mode for subscription with %s", > + "streaming = parallel", "min_apply_delay")); > + > > 9a. > SUGGESTION (comment) > The combination of parallel streaming mode and min_apply_delay is not > allowed. Fixed. > ~ > > 9b. > (see AlterSubscription general review comment #8 above) Here you can use the > same comment error message that says min_apply_delay > 0 and streaming = > parallel are mutually exclusive options. As described above, we followed the current style in the existing functions. > ~~~ > > 10. AlterSubscription > > @@ -1111,6 +1177,25 @@ AlterSubscription(ParseState *pstate, > AlterSubscriptionStmt *stmt, > = true; > } > > + if (IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY)) { > + /* > + * Test the combination of streaming mode and > + * min_apply_delay > + */ > + if (opts.min_apply_delay > 0) > + if ((IsSet(opts.specified_opts, SUBOPT_STREAMING) && opts.streaming > == LOGICALREP_STREAM_PARALLEL) || > + (!IsSet(opts.specified_opts, SUBOPT_STREAMING) && sub->stream == > LOGICALREP_STREAM_PARALLEL)) > + ereport(ERROR, > + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot enable %s for subscription in %s mode", > + "min_apply_delay", "streaming = parallel")); > + > + values[Anum_pg_subscription_subminapplydelay - 1] = > + Int64GetDatum(opts.min_apply_delay); > + replaces[Anum_pg_subscription_subminapplydelay - 1] = true; } > > 10a. > SUGGESTION (comment) > The combination of parallel streaming mode and min_apply_delay is not > allowed. Fixed. > ~ > > 10b. > (see AlterSubscription general review comment #8 above) Here you can use the > same comment error message that says min_apply_delay > 0 and streaming = > parallel are mutually exclusive options. Same as 9b. > ====== > > .../replication/logical/applyparallelworker.c > > 11. > > @@ -704,7 +704,8 @@ pa_process_spooled_messages_if_required(void) > { > apply_spooled_messages(&MyParallelShared->fileset, > MyParallelShared->xid, > - InvalidXLogRecPtr); > + InvalidXLogRecPtr, > + 0); > > IMO this passing of 0 is a bit strange because it is currently acting like a dummy > value since the apply_spooled_messages will never make use of the 'finish_ts' > anyway (since this call is from a parallel apply worker). > > I think a better way to code this might be to pass the 0 (same as you are doing > here) but inside the apply_spooled_messages change the code: > > FROM > if (!am_parallel_apply_worker()) > maybe_delay_apply(finish_ts); > > TO > if (finish_ts) > maybe_delay_apply(finish_ts); > > That does 2 things. > - It makes the passed-in 0 have some meaning > - It simplifies the apply_spooled_messages code Adopted. > ====== > > src/backend/replication/logical/worker.c > > 12. > > @@ -318,6 +318,17 @@ static List *on_commit_wakeup_workers_subids = > NIL; bool in_remote_transaction = false; static XLogRecPtr > remote_final_lsn = InvalidXLogRecPtr; > > +/* > + * In order to avoid walsender's timeout during time-delayed > +replication, > + * it's necessary to keep sending feedback messages during the delay > +from the > + * worker process. Meanwhile, the feature delays the apply before > +starting the > + * transaction and thus we don't write WALs for the suspended changes > +during > + * the wait. Hence, in the case the worker process sends a feedback > +message > + * during the delay, we should not make positions of the flushed and > +apply LSN > + * overwritten by the last received latest LSN. See send_feedback() > for details. > + */ > +static XLogRecPtr last_received = InvalidXLogRecPtr; > > 12a. > Suggest a small change to the first sentence of the comment. > > BEFORE > In order to avoid walsender's timeout during time-delayed replication, it's > necessary to keep sending feedback messages during the delay from the > worker process. > > AFTER > In order to avoid walsender timeout for time-delayed replication the worker > process keeps sending feedback messages during the delay period. Fixed. > ~ > > 12b. > "Hence, in the case" -> "When" Fixed. > ~~~ > > 13. forward declare > > -static void send_feedback(XLogRecPtr recvpos, bool force, bool > requestReply); > +static void send_feedback(XLogRecPtr recvpos, bool force, bool > requestReply, > + bool in_delaying_apply); > > Change the param name: > > "in_delaying_apply" -> "in_delayed_apply” (??) Changed. The initial intention to append the "in_" prefix is to make the variable name aligned with some other variables such as "in_remote_transaction" and "in_streamed_transaction" that mean the current status for the transaction. So, until there is a better name proposed, we can keep it. > ~~~ > > 14. maybe_delay_apply > > + /* Nothing to do if no delay set */ > + if (MySubscription->minapplydelay <= 0) return; > > IIUC min_apply_delay cannot be < 0 so this condition could simply be: > > if (!MySubscription->minapplydelay) > return; Fixed. > ~~~ > > 15. maybe_delay_apply > > + /* > + * The min_apply_delay parameter is ignored until all tablesync workers > + * have reached READY state. If we allow the delay during the catchup > + * phase, once we reach the limit of tablesync workers, it will impose > + a > + * delay for each subsequent worker. It means it will take a long time > + to > + * finish the initial table synchronization. > + */ > + if (!AllTablesyncsReady()) > + return; > > SUGGESTION (slight rewording) > The min_apply_delay parameter is ignored until all tablesync workers have > reached READY state. This is because if we allowed the delay during the > catchup phase, then once we reached the limit of tablesync workers it would > impose a delay for each subsequent worker. That would cause initial table > synchronization completion to take a long time. Fixed. > ~~~ > > 16. maybe_delay_apply > > + while (true) > + { > + long diffms; > + > + ResetLatch(MyLatch); > + > + CHECK_FOR_INTERRUPTS(); > > IMO there should be some small explanatory comment here at the top of the > while loop. Added. > ~~~ > > 17. apply_spooled_messages > > @@ -2024,6 +2141,21 @@ apply_spooled_messages(FileSet *stream_fileset, > TransactionId xid, > int fileno; > off_t offset; > > + /* > + * Should we delay the current transaction? > + * > + * Unlike the regular (non-streamed) cases, the delay is applied in a > + * STREAM COMMIT/STREAM PREPARE message for streamed transactions. > The > + * STREAM START message does not contain a commit/prepare time (it will > + be > + * available when the in-progress transaction finishes). Hence, it's > + not > + * appropriate to apply a delay at that time. > + * > + * It's not allowed to execute time-delayed replication with parallel > + * apply feature. > + */ > + if (!am_parallel_apply_worker()) > + maybe_delay_apply(finish_ts); > > That whole comment part "Unlike the regular (non-streamed) cases" > seems misplaced here. Perhaps this part of the comment is better put into > the function header where the meaning of 'finish_ts' is explained? Moved it to the header comment for maybe_delay_apply. > ~~~ > > 18. apply_spooled_messages > > + * It's not allowed to execute time-delayed replication with parallel > + * apply feature. > + */ > + if (!am_parallel_apply_worker()) > + maybe_delay_apply(finish_ts); > > As was mentioned in comment #11 above this code could be changed like > > if (finish_ts) > maybe_delay_apply(finish_ts); > then you don't even need to make mention of "parallel apply" at all here. > > OTOH if you want to still have the parallel apply comment then maybe reword it > like this: > "It is not allowed to combine time-delayed replication with the parallel apply > feature." Changed and now I don't mention the parallel apply feature. > ~~~ > > 19. apply_spooled_messages > > If you chose not to do my suggestion from comment #11, then there are > 2 identical conditions (!am_parallel_apply_worker()); In this case, I was > wondering if it would be better to refactor to use a single condition instead. I applied #11 comment. Now, the conditions are not identical. > ~~~ > > 20. send_feedback > (same as comment #13) > > Maybe change the new param name to “in_delayed_apply”? Changed. > ~~~ > > 21. > > @@ -3737,8 +3869,15 @@ send_feedback(XLogRecPtr recvpos, bool force, > bool requestReply) > /* > * No outstanding transactions to flush, we can report the latest received > * position. This is important for synchronous replication. > + * > + * During the delay of time-delayed replication, do not tell the > + publisher > + * that the received latest LSN is already applied and flushed at this > + * stage, since we don't apply the transaction yet. If we do so, it > + leads > + * to a wrong assumption of logical replication progress on the > + publisher > + * side. Here, we just send a feedback message to avoid publisher's > + * timeout during the delay. > */ > > Minor rewording of the comment > > SUGGESTION > If the subscriber side apply is delayed (because of time-delayed > replication) then do not tell the publisher that the received latest LSN is already > applied and flushed, otherwise, it leads to the publisher side making a wrong > assumption of logical replication progress. Instead, we just send a feedback > message to avoid a publisher timeout during the delay. Adopted. > ====== > > > src/bin/pg_dump/pg_dump.c > > 22. > > @@ -4546,9 +4547,14 @@ getSubscriptions(Archive *fout) > LOGICALREP_TWOPHASE_STATE_DISABLED); > > if (fout->remoteVersion >= 160000) > - appendPQExpBufferStr(query, " s.suborigin\n"); > + appendPQExpBufferStr(query, > + " s.suborigin,\n" > + " s.subminapplydelay\n"); > else > - appendPQExpBuffer(query, " '%s' AS suborigin\n", > LOGICALREP_ORIGIN_ANY); > + { > + appendPQExpBuffer(query, " '%s' AS suborigin,\n", > + LOGICALREP_ORIGIN_ANY); appendPQExpBufferStr(query, " 0 AS > + subminapplydelay\n"); } > > Can’t those appends in the else part can be combined to a single > appendPQExpBuffer > > appendPQExpBuffer(query, > " '%s' AS suborigin,\n" > " 0 AS subminapplydelay\n" > LOGICALREP_ORIGIN_ANY); Adopted. > ====== > > src/include/catalog/pg_subscription.h > > 23. > > @@ -70,6 +70,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) > BKI_SHARED_RELATION BKI_ROW > XLogRecPtr subskiplsn; /* All changes finished at this LSN are > * skipped */ > > + int64 subminapplydelay; /* Replication apply delay */ > + > NameData subname; /* Name of the subscription */ > > Oid subowner BKI_LOOKUP(pg_authid); /* Owner of the subscription */ > > SUGGESTION (for comment) > Replication apply delay (ms) Fixed. > ~~ > > 24. > > @@ -120,6 +122,7 @@ typedef struct Subscription > * in */ > XLogRecPtr skiplsn; /* All changes finished at this LSN are > * skipped */ > + int64 minapplydelay; /* Replication apply delay */ > > SUGGESTION (for comment) > Replication apply delay (ms) Fixed. Kindly have a look at the latest v17 patch in [1]. [1] - https://www.postgresql.org/message-id/TYCPR01MB8373F5162C7A0E6224670CF0EDC49%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Thursday, January 19, 2023 10:42 AM Peter Smith <smithpb2250@gmail.com> wrote: > On Wed, Jan 18, 2023 at 6:06 PM Peter Smith <smithpb2250@gmail.com> > wrote: > > > > Here are my review comments for the latest patch v16-0001. (excluding > > the test code) > > > ... > > > > 8. AlterSubscription (general) > > > > I observed during testing there are 3 different errors…. > > > > At subscription CREATE time you can get this error: > > ERROR: min_apply_delay > 0 and streaming = parallel are mutually > > exclusive options > > > > If you try to ALTER the min_apply_delay when already streaming = > > parallel you can get this error: > > ERROR: cannot enable min_apply_delay for subscription in streaming = > > parallel mode > > > > If you try to ALTER the streaming to be parallel if there is already a > > min_apply_delay > 0 then you can get this error: > > ERROR: cannot enable streaming = parallel mode for subscription with > > min_apply_delay > > > > ~ > > > > IMO there is no need to have 3 different error message texts. I think > > all these cases are explained by just the first text (ERROR: > > min_apply_delay > 0 and streaming = parallel are mutually exclusive > > options) > > > > > > After checking the regression test output I can see the merit of your separate > error messages like this, even if they are maybe not strictly necessary. So feel > free to ignore my previous review comment. Thank you for your notification. I wrote another reason why we wrote those messages in [1]. So, please have a look at it. [1] - https://www.postgresql.org/message-id/TYCPR01MB8373447440202B248BB63805EDC49%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
On Thu, 19 Jan 2023 at 12:06, Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > > Updated the comment and the function call. > > Kindly have a look at the updated patch v17. Thanks for the updated patch, few comments: 1) min_apply_delay was accepting values like '600 m s h', I was not sure if we should allow this: alter subscription sub1 set (min_apply_delay = ' 600 m s h'); + /* + * If no unit was specified, then explicitly add 'ms' otherwise + * the interval_in function would assume 'seconds'. + */ + if (strspn(tmp, "-0123456789 ") == strlen(tmp)) + val = psprintf("%sms", tmp); + else + val = tmp; + + interval = DatumGetIntervalP(DirectFunctionCall3(interval_in, + CStringGetDatum(val), + ObjectIdGetDatum(InvalidOid), + Int32GetDatum(-1))); 2) How about adding current_txn_wait_time in pg_stat_subscription_stats, we can update the current_txn_wait_time periodically, this will help the user to check approximately how much time is left(min_apply_delay - stat value) before this transaction will be applied in the subscription. If you agree this can be 0002 patch. 3) There is one check at parse_subscription_options and another check in AlterSubscription, this looks like a redundant check in case of alter subscription, can we try to merge and keep in one place: /* * The combination of parallel streaming mode and min_apply_delay is not * allowed. */ if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && opts->min_apply_delay > 0) { if (opts->streaming == LOGICALREP_STREAM_PARALLEL) ereport(ERROR, errcode(ERRCODE_SYNTAX_ERROR), errmsg("%s and %s are mutually exclusive options", "min_apply_delay > 0", "streaming = parallel")); } if (IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY)) { /* * The combination of parallel streaming mode and * min_apply_delay is not allowed. */ if (opts.min_apply_delay > 0) if ((IsSet(opts.specified_opts, SUBOPT_STREAMING) && opts.streaming == LOGICALREP_STREAM_PARALLEL) || (!IsSet(opts.specified_opts, SUBOPT_STREAMING) && sub->stream == LOGICALREP_STREAM_PARALLEL)) ereport(ERROR, errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("cannot enable %s for subscription in %s mode", "min_apply_delay", "streaming = parallel")); values[Anum_pg_subscription_subminapplydelay - 1] = Int64GetDatum(opts.min_apply_delay); replaces[Anum_pg_subscription_subminapplydelay - 1] = true; } 4) typo "execeeds" should be "exceeds" + time on the subscriber. Any overhead of time spent in logical decoding + and in transferring the transaction may reduce the actual wait time. + It is also possible that the overhead already execeeds the requested + <literal>min_apply_delay</literal> value, in which case no additional + wait is necessary. If the system clocks on publisher and subscriber + are not synchronized, this may lead to apply changes earlier than Regards, Vignesh
On Thu, Jan 19, 2023 at 4:25 PM vignesh C <vignesh21@gmail.com> wrote: > > On Thu, 19 Jan 2023 at 12:06, Takamichi Osumi (Fujitsu) > <osumi.takamichi@fujitsu.com> wrote: > > > > Updated the comment and the function call. > > > > Kindly have a look at the updated patch v17. > > Thanks for the updated patch, few comments: > 1) min_apply_delay was accepting values like '600 m s h', I was not > sure if we should allow this: > alter subscription sub1 set (min_apply_delay = ' 600 m s h'); > I think here we should have specs similar to recovery_min_apply_delay. > > 2) How about adding current_txn_wait_time in > pg_stat_subscription_stats, we can update the current_txn_wait_time > periodically, this will help the user to check approximately how much > time is left(min_apply_delay - stat value) before this transaction > will be applied in the subscription. If you agree this can be 0002 > patch. > Do we have any similar stats for recovery_min_apply_delay? If not, I suggest let's postpone this to see if users really need such a parameter. -- With Regards, Amit Kapila.
On Thu, Jan 19, 2023 at 12:06 PM Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > > Kindly have a look at the updated patch v17. > Can we try to optimize the test time for this test? On my machine, it is the second highest time-consuming test in src/test/subscription. It seems you are waiting twice for apply_delay and both are for streaming cases by varying the number of changes. I think it should be just once and that too for the non-streaming case. I think it would be good to test streaming code path interaction but not sure if it is important enough to have two test cases for apply_delay. One minor comment that I observed while going through the patch. + /* + * The combination of parallel streaming mode and min_apply_delay is not + * allowed. + */ + if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && + opts->min_apply_delay > 0) I think it would be good if you can specify the reason for not allowing this combination in the comments. -- With Regards, Amit Kapila.
On Thu, 19 Jan 2023 at 18:29, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jan 19, 2023 at 4:25 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Thu, 19 Jan 2023 at 12:06, Takamichi Osumi (Fujitsu) > > <osumi.takamichi@fujitsu.com> wrote: > > > > > > Updated the comment and the function call. > > > > > > Kindly have a look at the updated patch v17. > > > > Thanks for the updated patch, few comments: > > 1) min_apply_delay was accepting values like '600 m s h', I was not > > sure if we should allow this: > > alter subscription sub1 set (min_apply_delay = ' 600 m s h'); > > > > I think here we should have specs similar to recovery_min_apply_delay. > > > > > 2) How about adding current_txn_wait_time in > > pg_stat_subscription_stats, we can update the current_txn_wait_time > > periodically, this will help the user to check approximately how much > > time is left(min_apply_delay - stat value) before this transaction > > will be applied in the subscription. If you agree this can be 0002 > > patch. > > > > Do we have any similar stats for recovery_min_apply_delay? If not, I > suggest let's postpone this to see if users really need such a > parameter. I did not find any statistics for recovery_min_apply_delay, ok it can be delayed to a later time. Regards, Vignesh
On Thu, Jan 19, 2023 at 12:42 PM Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > > On Wednesday, January 18, 2023 4:06 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are my review comments for the latest patch v16-0001. (excluding the > > test code) > Hi, thank you for your review ! > > > ====== > > > > General > > > > 1. > > > > Since the value of min_apply_delay cannot be < 0, I was thinking probably it > > should have been declared everywhere in this patch as a > > uint64 instead of an int64, right? > No, we won't be able to adopt this idea. > > It seems that we are not able to use uint for catalog type. > So, can't applying it to the pg_subscription.h definitions > and then similarly Int64GetDatum to store catalog variables > and the argument variable of Int64GetDatum. > > Plus, there is a possibility that type Interval becomes negative value, > then we are not able to change the int64 variable to get > the return value of interval2ms(). > > > ====== > > > > Commit message > > > > 2. > > > > If the subscription sets min_apply_delay parameter, the logical replication > > worker will delay the transaction commit for min_apply_delay milliseconds. > > > > ~ > > > > IMO there should be another sentence before this just to say that a new > > parameter is being added: > > > > e.g. > > This patch implements a new subscription parameter called > > 'min_apply_delay'. > Added. > > > > ====== > > > > doc/src/sgml/config.sgml > > > > 3. > > > > + <para> > > + For time-delayed logical replication, the apply worker sends a Standby > > + Status Update message to the corresponding publisher per the > > indicated > > + time of this parameter. Therefore, if this parameter is longer than > > + <literal>wal_sender_timeout</literal> on the publisher, then the > > + walsender doesn't get any update message during the delay and > > repeatedly > > + terminates due to the timeout errors. Hence, make sure this parameter > > is > > + shorter than the <literal>wal_sender_timeout</literal> of the > > publisher. > > + If this parameter is set to zero with time-delayed replication, the > > + apply worker doesn't send any feedback messages during the > > + <literal>min_apply_delay</literal>. > > + </para> > > > > > > This paragraph seemed confusing. I think it needs to be reworded to change all > > of the "this parameter" references because there are at least 3 different > > parameters mentioned in this paragraph. e.g. maybe just change them to > > explicitly name the parameter you are talking about. > > > > I also think it needs to mention the ‘min_apply_delay’ subscription parameter > > up-front and then refer to it appropriately. > > > > The end result might be something like I wrote below (this is just my guess ? > > probably you can word it better). > > > > SUGGESTION > > For time-delayed logical replication (i.e. when the subscription is created with > > parameter min_apply_delay > 0), the apply worker sends a Standby Status > > Update message to the publisher with a period of wal_receiver_status_interval . > > Make sure to set wal_receiver_status_interval less than the > > wal_sender_timeout on the publisher, otherwise, the walsender will repeatedly > > terminate due to the timeout errors. If wal_receiver_status_interval is set to zero, > > the apply worker doesn't send any feedback messages during the subscriber’s > > min_apply_delay period. > Applied. Also, I added one reference for min_apply_delay parameter > at the end of this description. > > > > ====== > > > > doc/src/sgml/ref/create_subscription.sgml > > > > 4. > > > > + <para> > > + By default, the subscriber applies changes as soon as possible. As > > + with the physical replication feature > > + (<xref linkend="guc-recovery-min-apply-delay"/>), it can be > > useful to > > + have a time-delayed logical replica. This parameter lets the user to > > + delay the application of changes by a specified amount of > > time. If this > > + value is specified without units, it is taken as milliseconds. The > > + default is zero(no delay). > > + </para> > > > > 4a. > > As with the physical replication feature (recovery_min_apply_delay), it can be > > useful to have a time-delayed logical replica. > > > > IMO not sure that the above sentence is necessary. It seems only to be saying > > that this parameter can be useful. Why do we need to say that? > Removed the sentence. > > > > ~ > > > > 4b. > > "This parameter lets the user to delay" -> "This parameter lets the user delay" > > OR > > "This parameter lets the user to delay" -> "This parameter allows the user to > > delay" > Fixed. > > > > ~ > > > > 4c. > > "If this value is specified without units" -> "If the value is specified without > > units" > Fixed. > > > ~ > > > > 4d. > > "zero(no delay)." -> "zero (no delay)." > Fixed. > > > ---- > > > > 5. > > > > + <para> > > + The delay occurs only on WAL records for transaction begins and > > after > > + the initial table synchronization. It is possible that the > > + replication delay between publisher and subscriber exceeds the > > value > > + of this parameter, in which case no delay is added. Note that the > > + delay is calculated between the WAL time stamp as written on > > + publisher and the current time on the subscriber. Time > > spent in logical > > + decoding and in transferring the transaction may reduce the > > actual wait > > + time. If the system clocks on publisher and subscriber are not > > + synchronized, this may lead to apply changes earlier than > > expected, > > + but this is not a major issue because this parameter is > > typically much > > + larger than the time deviations between servers. Note that if this > > + parameter is set to a long delay, the replication will stop if the > > + replication slot falls behind the current LSN by more than > > + <link > > linkend="guc-max-slot-wal-keep-size"><literal>max_slot_wal_keep_size</ > > literal></link>. > > + </para> > > > > I think the first part can be reworded slightly. See what you think about the > > suggestion below. > > > > SUGGESTION > > Any delay occurs only on WAL records for transaction begins after all initial > > table synchronization has finished. The delay is calculated between the WAL > > timestamp as written on the publisher and the current time on the subscriber. > > Any overhead of time spent in logical decoding and in transferring the > > transaction may reduce the actual wait time. > > It is also possible that the overhead already exceeds the requested > > 'min_apply_delay' value, in which case no additional wait is necessary. If the > > system clocks... > Addressed. > > > > ---- > > > > 6. > > > > + <para> > > + Setting streaming to <literal>parallel</literal> mode and > > <literal>min_apply_delay</literal> > > + simultaneously is not supported. > > + </para> > > > > SUGGESTION > > A non-zero min_apply_delay parameter is not allowed when streaming in > > parallel mode. > Applied. > > > > ====== > > > > src/backend/commands/subscriptioncmds.c > > > > 7. parse_subscription_options > > > > @@ -404,6 +445,17 @@ parse_subscription_options(ParseState *pstate, List > > *stmt_options, > > "slot_name = NONE", "create_slot = false"))); > > } > > } > > + > > + /* Test the combination of streaming mode and min_apply_delay */ if > > + (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && > > + opts->min_apply_delay > 0) > > + { > > + if (opts->streaming == LOGICALREP_STREAM_PARALLEL) > > ereport(ERROR, > > + errcode(ERRCODE_SYNTAX_ERROR), errmsg("%s and %s are mutually > > + exclusive options", > > + "min_apply_delay > 0", "streaming = parallel")); } > > > > SUGGESTION (comment) > > The combination of parallel streaming mode and min_apply_delay is not > > allowed. > Fixed. > > > > ~~~ > > > > 8. AlterSubscription (general) > > > > I observed during testing there are 3 different errors…. > > > > At subscription CREATE time you can get this error: > > ERROR: min_apply_delay > 0 and streaming = parallel are mutually exclusive > > options > > > > If you try to ALTER the min_apply_delay when already streaming = parallel you > > can get this error: > > ERROR: cannot enable min_apply_delay for subscription in streaming = > > parallel mode > > > > If you try to ALTER the streaming to be parallel if there is already a > > min_apply_delay > 0 then you can get this error: > > ERROR: cannot enable streaming = parallel mode for subscription with > > min_apply_delay > Yes. This is because the existing error message styles > in AlterSubscription and parse_subscription_options. > > The former uses "mutually exclusive" messages consistently, > while the latter does "cannot enable ..." ones. > > ~ > > > > IMO there is no need to have 3 different error message texts. I think all these > > cases are explained by just the first text (ERROR: > > min_apply_delay > 0 and streaming = parallel are mutually exclusive > > options) > Then, we followed this kind of formats. > > > > ~~~ > > > > 9. AlterSubscription > > > > @@ -1098,6 +1152,18 @@ AlterSubscription(ParseState *pstate, > > AlterSubscriptionStmt *stmt, > > > > if (IsSet(opts.specified_opts, SUBOPT_STREAMING)) > > { > > + /* > > + * Test the combination of streaming mode and > > + * min_apply_delay > > + */ > > + if (opts.streaming == LOGICALREP_STREAM_PARALLEL) if > > + ((IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && > > opts.min_apply_delay > 0) || > > + (!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && > > sub->minapplydelay > 0)) > > + ereport(ERROR, > > + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > + errmsg("cannot enable %s mode for subscription with %s", > > + "streaming = parallel", "min_apply_delay")); > > + > > > > 9a. > > SUGGESTION (comment) > > The combination of parallel streaming mode and min_apply_delay is not > > allowed. > Fixed. > > > > ~ > > > > 9b. > > (see AlterSubscription general review comment #8 above) Here you can use the > > same comment error message that says min_apply_delay > 0 and streaming = > > parallel are mutually exclusive options. > As described above, we followed the current style in the existing functions. > > > > ~~~ > > > > 10. AlterSubscription > > > > @@ -1111,6 +1177,25 @@ AlterSubscription(ParseState *pstate, > > AlterSubscriptionStmt *stmt, > > = true; > > } > > > > + if (IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY)) { > > + /* > > + * Test the combination of streaming mode and > > + * min_apply_delay > > + */ > > + if (opts.min_apply_delay > 0) > > + if ((IsSet(opts.specified_opts, SUBOPT_STREAMING) && opts.streaming > > == LOGICALREP_STREAM_PARALLEL) || > > + (!IsSet(opts.specified_opts, SUBOPT_STREAMING) && sub->stream == > > LOGICALREP_STREAM_PARALLEL)) > > + ereport(ERROR, > > + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > + errmsg("cannot enable %s for subscription in %s mode", > > + "min_apply_delay", "streaming = parallel")); > > + > > + values[Anum_pg_subscription_subminapplydelay - 1] = > > + Int64GetDatum(opts.min_apply_delay); > > + replaces[Anum_pg_subscription_subminapplydelay - 1] = true; } > > > > 10a. > > SUGGESTION (comment) > > The combination of parallel streaming mode and min_apply_delay is not > > allowed. > Fixed. > > > > ~ > > > > 10b. > > (see AlterSubscription general review comment #8 above) Here you can use the > > same comment error message that says min_apply_delay > 0 and streaming = > > parallel are mutually exclusive options. > Same as 9b. > > > ====== > > > > .../replication/logical/applyparallelworker.c > > > > 11. > > > > @@ -704,7 +704,8 @@ pa_process_spooled_messages_if_required(void) > > { > > apply_spooled_messages(&MyParallelShared->fileset, > > MyParallelShared->xid, > > - InvalidXLogRecPtr); > > + InvalidXLogRecPtr, > > + 0); > > > > IMO this passing of 0 is a bit strange because it is currently acting like a dummy > > value since the apply_spooled_messages will never make use of the 'finish_ts' > > anyway (since this call is from a parallel apply worker). > > > > I think a better way to code this might be to pass the 0 (same as you are doing > > here) but inside the apply_spooled_messages change the code: > > > > FROM > > if (!am_parallel_apply_worker()) > > maybe_delay_apply(finish_ts); > > > > TO > > if (finish_ts) > > maybe_delay_apply(finish_ts); > > > > That does 2 things. > > - It makes the passed-in 0 have some meaning > > - It simplifies the apply_spooled_messages code > Adopted. > > > > ====== > > > > src/backend/replication/logical/worker.c > > > > 12. > > > > @@ -318,6 +318,17 @@ static List *on_commit_wakeup_workers_subids = > > NIL; bool in_remote_transaction = false; static XLogRecPtr > > remote_final_lsn = InvalidXLogRecPtr; > > > > +/* > > + * In order to avoid walsender's timeout during time-delayed > > +replication, > > + * it's necessary to keep sending feedback messages during the delay > > +from the > > + * worker process. Meanwhile, the feature delays the apply before > > +starting the > > + * transaction and thus we don't write WALs for the suspended changes > > +during > > + * the wait. Hence, in the case the worker process sends a feedback > > +message > > + * during the delay, we should not make positions of the flushed and > > +apply LSN > > + * overwritten by the last received latest LSN. See send_feedback() > > for details. > > + */ > > +static XLogRecPtr last_received = InvalidXLogRecPtr; > > > > 12a. > > Suggest a small change to the first sentence of the comment. > > > > BEFORE > > In order to avoid walsender's timeout during time-delayed replication, it's > > necessary to keep sending feedback messages during the delay from the > > worker process. > > > > AFTER > > In order to avoid walsender timeout for time-delayed replication the worker > > process keeps sending feedback messages during the delay period. > Fixed. > > > > ~ > > > > 12b. > > "Hence, in the case" -> "When" > Fixed. > > > > ~~~ > > > > 13. forward declare > > > > -static void send_feedback(XLogRecPtr recvpos, bool force, bool > > requestReply); > > +static void send_feedback(XLogRecPtr recvpos, bool force, bool > > requestReply, > > + bool in_delaying_apply); > > > > Change the param name: > > > > "in_delaying_apply" -> "in_delayed_apply” (??) > Changed. The initial intention to append the "in_" > prefix is to make the variable name aligned with > some other variables such as "in_remote_transaction" and > "in_streamed_transaction" that mean the current status > for the transaction. So, until there is a better name proposed, > we can keep it. > > > > ~~~ > > > > 14. maybe_delay_apply > > > > + /* Nothing to do if no delay set */ > > + if (MySubscription->minapplydelay <= 0) return; > > > > IIUC min_apply_delay cannot be < 0 so this condition could simply be: > > > > if (!MySubscription->minapplydelay) > > return; > Fixed. > > > > ~~~ > > > > 15. maybe_delay_apply > > > > + /* > > + * The min_apply_delay parameter is ignored until all tablesync workers > > + * have reached READY state. If we allow the delay during the catchup > > + * phase, once we reach the limit of tablesync workers, it will impose > > + a > > + * delay for each subsequent worker. It means it will take a long time > > + to > > + * finish the initial table synchronization. > > + */ > > + if (!AllTablesyncsReady()) > > + return; > > > > SUGGESTION (slight rewording) > > The min_apply_delay parameter is ignored until all tablesync workers have > > reached READY state. This is because if we allowed the delay during the > > catchup phase, then once we reached the limit of tablesync workers it would > > impose a delay for each subsequent worker. That would cause initial table > > synchronization completion to take a long time. > Fixed. > > > > ~~~ > > > > 16. maybe_delay_apply > > > > + while (true) > > + { > > + long diffms; > > + > > + ResetLatch(MyLatch); > > + > > + CHECK_FOR_INTERRUPTS(); > > > > IMO there should be some small explanatory comment here at the top of the > > while loop. > Added. > > > > ~~~ > > > > 17. apply_spooled_messages > > > > @@ -2024,6 +2141,21 @@ apply_spooled_messages(FileSet *stream_fileset, > > TransactionId xid, > > int fileno; > > off_t offset; > > > > + /* > > + * Should we delay the current transaction? > > + * > > + * Unlike the regular (non-streamed) cases, the delay is applied in a > > + * STREAM COMMIT/STREAM PREPARE message for streamed transactions. > > The > > + * STREAM START message does not contain a commit/prepare time (it will > > + be > > + * available when the in-progress transaction finishes). Hence, it's > > + not > > + * appropriate to apply a delay at that time. > > + * > > + * It's not allowed to execute time-delayed replication with parallel > > + * apply feature. > > + */ > > + if (!am_parallel_apply_worker()) > > + maybe_delay_apply(finish_ts); > > > > That whole comment part "Unlike the regular (non-streamed) cases" > > seems misplaced here. Perhaps this part of the comment is better put into > > the function header where the meaning of 'finish_ts' is explained? > Moved it to the header comment for maybe_delay_apply. > > > > ~~~ > > > > 18. apply_spooled_messages > > > > + * It's not allowed to execute time-delayed replication with parallel > > + * apply feature. > > + */ > > + if (!am_parallel_apply_worker()) > > + maybe_delay_apply(finish_ts); > > > > As was mentioned in comment #11 above this code could be changed like > > > > if (finish_ts) > > maybe_delay_apply(finish_ts); > > then you don't even need to make mention of "parallel apply" at all here. > > > > OTOH if you want to still have the parallel apply comment then maybe reword it > > like this: > > "It is not allowed to combine time-delayed replication with the parallel apply > > feature." > Changed and now I don't mention the parallel apply feature. > > > ~~~ > > > > 19. apply_spooled_messages > > > > If you chose not to do my suggestion from comment #11, then there are > > 2 identical conditions (!am_parallel_apply_worker()); In this case, I was > > wondering if it would be better to refactor to use a single condition instead. > I applied #11 comment. Now, the conditions are not identical. > > > ~~~ > > > > 20. send_feedback > > (same as comment #13) > > > > Maybe change the new param name to “in_delayed_apply”? > Changed. > > > > ~~~ > > > > 21. > > > > @@ -3737,8 +3869,15 @@ send_feedback(XLogRecPtr recvpos, bool force, > > bool requestReply) > > /* > > * No outstanding transactions to flush, we can report the latest received > > * position. This is important for synchronous replication. > > + * > > + * During the delay of time-delayed replication, do not tell the > > + publisher > > + * that the received latest LSN is already applied and flushed at this > > + * stage, since we don't apply the transaction yet. If we do so, it > > + leads > > + * to a wrong assumption of logical replication progress on the > > + publisher > > + * side. Here, we just send a feedback message to avoid publisher's > > + * timeout during the delay. > > */ > > > > Minor rewording of the comment > > > > SUGGESTION > > If the subscriber side apply is delayed (because of time-delayed > > replication) then do not tell the publisher that the received latest LSN is already > > applied and flushed, otherwise, it leads to the publisher side making a wrong > > assumption of logical replication progress. Instead, we just send a feedback > > message to avoid a publisher timeout during the delay. > Adopted. > > > > ====== > > > > > > src/bin/pg_dump/pg_dump.c > > > > 22. > > > > @@ -4546,9 +4547,14 @@ getSubscriptions(Archive *fout) > > LOGICALREP_TWOPHASE_STATE_DISABLED); > > > > if (fout->remoteVersion >= 160000) > > - appendPQExpBufferStr(query, " s.suborigin\n"); > > + appendPQExpBufferStr(query, > > + " s.suborigin,\n" > > + " s.subminapplydelay\n"); > > else > > - appendPQExpBuffer(query, " '%s' AS suborigin\n", > > LOGICALREP_ORIGIN_ANY); > > + { > > + appendPQExpBuffer(query, " '%s' AS suborigin,\n", > > + LOGICALREP_ORIGIN_ANY); appendPQExpBufferStr(query, " 0 AS > > + subminapplydelay\n"); } > > > > Can’t those appends in the else part can be combined to a single > > appendPQExpBuffer > > > > appendPQExpBuffer(query, > > " '%s' AS suborigin,\n" > > " 0 AS subminapplydelay\n" > > LOGICALREP_ORIGIN_ANY); > Adopted. > > > > ====== > > > > src/include/catalog/pg_subscription.h > > > > 23. > > > > @@ -70,6 +70,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) > > BKI_SHARED_RELATION BKI_ROW > > XLogRecPtr subskiplsn; /* All changes finished at this LSN are > > * skipped */ > > > > + int64 subminapplydelay; /* Replication apply delay */ > > + > > NameData subname; /* Name of the subscription */ > > > > Oid subowner BKI_LOOKUP(pg_authid); /* Owner of the subscription */ > > > > SUGGESTION (for comment) > > Replication apply delay (ms) > Fixed. > > > ~~ > > > > 24. > > > > @@ -120,6 +122,7 @@ typedef struct Subscription > > * in */ > > XLogRecPtr skiplsn; /* All changes finished at this LSN are > > * skipped */ > > + int64 minapplydelay; /* Replication apply delay */ > > > > SUGGESTION (for comment) > > Replication apply delay (ms) > Fixed. > > > Kindly have a look at the latest v17 patch in [1]. > > > [1] - https://www.postgresql.org/message-id/TYCPR01MB8373F5162C7A0E6224670CF0EDC49%40TYCPR01MB8373.jpnprd01.prod.outlook.com > > Best Regards, > Takamichi Osumi > 1) Tried different variations of altering 'min_apply_delay'. All passed except one below: postgres=# alter subscription mysubnew set (min_apply_delay = '10.9min 1ms'); ALTER SUBSCRIPTION postgres=# alter subscription mysubnew set (min_apply_delay = '10.9min 2s 1ms'); ALTER SUBSCRIPTION --very similar to above but fails, postgres=# alter subscription mysubnew set (min_apply_delay = '10.9s 1ms'); ERROR: invalid input syntax for type interval: "10.9s 1ms" 2) Logging: 2023-01-19 17:33:16.202 IST [404797] DEBUG: logical replication apply delay: 19979 ms 2023-01-19 17:33:26.212 IST [404797] DEBUG: logical replication apply delay: 9969 ms 2023-01-19 17:34:25.730 IST [404962] DEBUG: logical replication apply delay: 179988 ms-->previous wait over, started for next txn 2023-01-19 17:34:35.737 IST [404962] DEBUG: logical replication apply delay: 169981 ms 2023-01-19 17:34:45.746 IST [404962] DEBUG: logical replication apply delay: 159972 ms Is there a way to distinguish between these logs? Maybe dumping xids along-with? thanks Shveta
Re: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, Horiguchi-san and Amit-san On Wednesday, November 9, 2022 3:41 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > Using interval is not standard as this kind of parameters but it seems > convenient. On the other hand, it's not great that the unit month introduces > some subtle ambiguity. This patch translates a month to 30 days but I'm not > sure it's the right thing to do. Perhaps we shouldn't allow the units upper than > days. In the past discussion, we talked about the merits to utilize the interval type. On the other hand, now we are facing some incompatibility issues of parsing between this time-delayed feature and physical replication's recovery_min_apply_delay. For instance, the interval type can accept '600 m s h', '1d 10min' and '1m', but the recovery_min_apply_delay makes the server failed to start by all of those. Therefore, this would confuse users and I'm going to make the feature's input compatible with recovery_min_apply_delay in the next version. Best Regards, Takamichi Osumi
Hi Osumi-san, here are my review comments for the latest patch v17-0001. ====== Commit Message 1. Prohibit the combination of this feature and parallel streaming mode. SUGGESTION (using the same wording as in the code comments) The combination of parallel streaming mode and min_apply_delay is not allowed. ====== doc/src/sgml/ref/create_subscription.sgml 2. + <para> + By default, the subscriber applies changes as soon as possible. This + parameter allows the user to delay the application of changes by a + specified amount of time. If the value is specified without units, it + is taken as milliseconds. The default is zero (no delay). + </para> Looking at this again, it seemed a bit strange to repeat "specified" twice in 2 sentences. Maybe change one of them. I’ve also suggested using the word "interval" because I don’t think docs yet mentioned anywhere (except in the example) that using intervals is possible. SUGGESTION (for the 2nd sentence) This parameter allows the user to delay the application of changes by a given time interval. ~~~ 3. + <para> + Any delay occurs only on WAL records for transaction begins after all + initial table synchronization has finished. The delay is calculated + between the WAL timestamp as written on the publisher and the current + time on the subscriber. Any overhead of time spent in logical decoding + and in transferring the transaction may reduce the actual wait time. + It is also possible that the overhead already execeeds the requested + <literal>min_apply_delay</literal> value, in which case no additional + wait is necessary. If the system clocks on publisher and subscriber + are not synchronized, this may lead to apply changes earlier than + expected, but this is not a major issue because this parameter is + typically much larger than the time deviations between servers. Note + that if this parameter is set to a long delay, the replication will + stop if the replication slot falls behind the current LSN by more than + <link linkend="guc-max-slot-wal-keep-size"><literal>max_slot_wal_keep_size</literal></link>. + </para> 3a. Typo "execeeds" (I think Vignesh reported this already) ~ 3b. SUGGESTION (for the 2nd sentence) BEFORE The delay is calculated between the WAL timestamp... AFTER The delay is calculated as the difference between the WAL timestamp... ~~~ 4. + <warning> + <para> + Delaying the replication can mean there is a much longer time between making + a change on the publisher, and that change being committed on the subscriber. + v + See <xref linkend="guc-synchronous-commit"/>. + </para> + </warning> IMO maybe there is a better way to express the 2nd sentence: BEFORE This can have a big impact on synchronous replication. AFTER This can impact the performance of synchronous replication. ====== src/backend/commands/subscriptioncmds.c 5. parse_subscription_options @@ -324,6 +328,43 @@ parse_subscription_options(ParseState *pstate, List *stmt_options, opts->specified_opts |= SUBOPT_LSN; opts->lsn = lsn; } + else if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && + strcmp(defel->defname, "min_apply_delay") == 0) + { + char *val, + *tmp; + Interval *interval; + int64 ms; IMO 'delay_ms' (or similar) would be a friendlier variable name than just 'ms' ~~~ 6. @@ -404,6 +445,20 @@ parse_subscription_options(ParseState *pstate, List *stmt_options, "slot_name = NONE", "create_slot = false"))); } } + + /* + * The combination of parallel streaming mode and min_apply_delay is not + * allowed. + */ + if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && + opts->min_apply_delay > 0) + { + if (opts->streaming == LOGICALREP_STREAM_PARALLEL) + ereport(ERROR, + errcode(ERRCODE_SYNTAX_ERROR), + errmsg("%s and %s are mutually exclusive options", + "min_apply_delay > 0", "streaming = parallel")); + } This could be expressed as a single condition using &&, maybe also with the brackets eliminated. (Unless you feel the current code is more readable) ~~~ 7. + if (opts.min_apply_delay > 0) + if ((IsSet(opts.specified_opts, SUBOPT_STREAMING) && opts.streaming == LOGICALREP_STREAM_PARALLEL) || + (!IsSet(opts.specified_opts, SUBOPT_STREAMING) && sub->stream == LOGICALREP_STREAM_PARALLEL)) + ereport(ERROR, + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot enable %s for subscription in %s mode", + "min_apply_delay", "streaming = parallel")); These nested ifs could instead be a single "if" with && condition. (Unless you feel the current code is more readable) ====== src/backend/replication/logical/worker.c 8. maybe_delay_apply + * Hence, it's not appropriate to apply a delay at the time. + */ +static void +maybe_delay_apply(TimestampTz finish_ts) That last sentence "Hence,... delay at the time" does not sound correct. Is there a typo or missing words here? Maybe it meant to say "... at the STREAM START time."? ~~~ 9. + /* This might change wal_receiver_status_interval */ + if (ConfigReloadPending) + { + ConfigReloadPending = false; + ProcessConfigFile(PGC_SIGHUP); + } I was unsure why did you make a special mention of 'wal_receiver_status_interval' here. I mean, Aren't there also other GUCs that might change and affect something here so was there some special reason only this one was mentioned? ====== src/test/subscription/t/032_apply_delay.pl 10. + +# Compare inserted time on the publisher with applied time on the subscriber to +# confirm the latter is applied after expected time. +sub check_apply_delay_time Maybe the comment could also mention that the time is automatically stored in the table column 'c'. ~~~ 11. +# Confirm the suspended record doesn't get applied expectedly by the ALTER +# DISABLE command. +$result = $node_subscriber->safe_psql('postgres', + "SELECT count(a) FROM test_tab WHERE a = 0;"); +is($result, qq(0), "check if the delayed transaction doesn't get applied expectedly"); The use of "doesn't get applied expectedly" (in 2 places here) seemed strange. Maybe it's better to say like SUGGESTION # Confirm disabling the subscription by ALTER DISABLE did not cause the delayed transaction to be applied. $result = $node_subscriber->safe_psql('postgres', "SELECT count(a) FROM test_tab WHERE a = 0;"); is($result, qq(0), "check the delayed transaction was not applied"); ------ Kind Regards, Peter Smith. Fujitsu Australia
On Fri, Jan 20, 2023 at 2:47 PM shveta malik <shveta.malik@gmail.com> wrote: > ... > 2) > Logging: > 2023-01-19 17:33:16.202 IST [404797] DEBUG: logical replication apply > delay: 19979 ms > 2023-01-19 17:33:26.212 IST [404797] DEBUG: logical replication apply > delay: 9969 ms > 2023-01-19 17:34:25.730 IST [404962] DEBUG: logical replication apply > delay: 179988 ms-->previous wait over, started for next txn > 2023-01-19 17:34:35.737 IST [404962] DEBUG: logical replication apply > delay: 169981 ms > 2023-01-19 17:34:45.746 IST [404962] DEBUG: logical replication apply > delay: 159972 ms > > Is there a way to distinguish between these logs? Maybe dumping xids along-with? > +1 Also, I was thinking of some other logging enhancements a) the message should say that this is the *remaining* time to left to wait. b) it might be convenient to know from the log what was the original min_apply_delay value in the 1st place. For example, the logs might look something like this: DEBUG: time-delayed replication for txid 1234, min_apply_delay = 160000 ms. Remaining wait time: 159972 ms DEBUG: time-delayed replication for txid 1234, min_apply_delay = 160000 ms. Remaining wait time: 142828 ms DEBUG: time-delayed replication for txid 1234, min_apply_delay = 160000 ms. Remaining wait time: 129994 ms DEBUG: time-delayed replication for txid 1234, min_apply_delay = 160000 ms. Remaining wait time: 110001 ms ... ------ Kind Regards, Peter Smith. Fujitsu Australia
On Fri, Jan 20, 2023 at 1:08 PM Peter Smith <smithpb2250@gmail.com> wrote: > a) the message should say that this is the *remaining* time to left to wait. > > b) it might be convenient to know from the log what was the original > min_apply_delay value in the 1st place. > > For example, the logs might look something like this: > > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > 160000 ms. Remaining wait time: 159972 ms > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > 160000 ms. Remaining wait time: 142828 ms > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > 160000 ms. Remaining wait time: 129994 ms > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > 160000 ms. Remaining wait time: 110001 ms > ... > +1 This will also help when min_apply_delay is set to a new value in between the current wait. Lets say, I started with min_apply_delay=5 min, when the worker was half way through this, I changed min_apply_delay to 3 min or say 10min, I see the impact of that change i.e. new wait-time is adjusted, but log becomes confusing. So, please keep this scenario as well in mind while improving logging. thanks Shveta
On Fri, Jan 20, 2023 at 2:23 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Fri, Jan 20, 2023 at 1:08 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > a) the message should say that this is the *remaining* time to left to wait. > > > > b) it might be convenient to know from the log what was the original > > min_apply_delay value in the 1st place. > > > > For example, the logs might look something like this: > > > > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > > 160000 ms. Remaining wait time: 159972 ms > > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > > 160000 ms. Remaining wait time: 142828 ms > > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > > 160000 ms. Remaining wait time: 129994 ms > > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > > 160000 ms. Remaining wait time: 110001 ms > > ... > > > > +1 > This will also help when min_apply_delay is set to a new value in > between the current wait. Lets say, I started with min_apply_delay=5 > min, when the worker was half way through this, I changed > min_apply_delay to 3 min or say 10min, I see the impact of that change > i.e. new wait-time is adjusted, but log becomes confusing. So, please > keep this scenario as well in mind while improving logging. > when we send-feedback during apply-delay after every wal_receiver_status_interval , the log comes as: 023-01-19 17:12:56.000 IST [404795] DEBUG: sending feedback (force 1) to recv 0/1570840, write 0/1570840, flush 0/1570840 Shall we have some info here to indicate that it is sent while waiting for apply_delay to distinguish it from other such send-feedback logs? It will make apply_delay flow clear in logs. thanks Shveta
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Friday, January 20, 2023 3:56 PM Peter Smith <smithpb2250@gmail.com> wrote: > Hi Osumi-san, here are my review comments for the latest patch v17-0001. Thanks for your review ! > ====== > Commit Message > > 1. > Prohibit the combination of this feature and parallel streaming mode. > > SUGGESTION (using the same wording as in the code comments) The > combination of parallel streaming mode and min_apply_delay is not allowed. Okay. Fixed. > ====== > doc/src/sgml/ref/create_subscription.sgml > > 2. > + <para> > + By default, the subscriber applies changes as soon as possible. > This > + parameter allows the user to delay the application of changes by a > + specified amount of time. If the value is specified without units, it > + is taken as milliseconds. The default is zero (no delay). > + </para> > > Looking at this again, it seemed a bit strange to repeat "specified" > twice in 2 sentences. Maybe change one of them. > > I’ve also suggested using the word "interval" because I don’t think docs yet > mentioned anywhere (except in the example) that using intervals is possible. > > SUGGESTION (for the 2nd sentence) > This parameter allows the user to delay the application of changes by a given > time interval. Adopted. > ~~~ > > 3. > + <para> > + Any delay occurs only on WAL records for transaction begins after > all > + initial table synchronization has finished. The delay is calculated > + between the WAL timestamp as written on the publisher and the > current > + time on the subscriber. Any overhead of time spent in > logical decoding > + and in transferring the transaction may reduce the actual wait time. > + It is also possible that the overhead already execeeds the > requested > + <literal>min_apply_delay</literal> value, in which case no > additional > + wait is necessary. If the system clocks on publisher and subscriber > + are not synchronized, this may lead to apply changes earlier than > + expected, but this is not a major issue because this parameter is > + typically much larger than the time deviations between servers. > Note > + that if this parameter is set to a long delay, the replication will > + stop if the replication slot falls behind the current LSN > by more than > + <link > linkend="guc-max-slot-wal-keep-size"><literal>max_slot_wal_keep_size</ > literal></link>. > + </para> > > 3a. > Typo "execeeds" (I think Vignesh reported this already) Fixed. > ~ > > 3b. > SUGGESTION (for the 2nd sentence) > BEFORE > The delay is calculated between the WAL timestamp... > AFTER > The delay is calculated as the difference between the WAL timestamp... Fixed. > ~~~ > > 4. > + <warning> > + <para> > + Delaying the replication can mean there is a much longer > time between making > + a change on the publisher, and that change being > committed on the subscriber. > + v > + See <xref linkend="guc-synchronous-commit"/>. > + </para> > + </warning> > > IMO maybe there is a better way to express the 2nd sentence: > > BEFORE > This can have a big impact on synchronous replication. > AFTER > This can impact the performance of synchronous replication. Fixed. > ====== > src/backend/commands/subscriptioncmds.c > > 5. parse_subscription_options > > @@ -324,6 +328,43 @@ parse_subscription_options(ParseState *pstate, List > *stmt_options, > opts->specified_opts |= SUBOPT_LSN; > opts->lsn = lsn; > } > + else if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && > + strcmp(defel->defname, "min_apply_delay") == 0) { > + char *val, > + *tmp; > + Interval *interval; > + int64 ms; > > IMO 'delay_ms' (or similar) would be a friendlier variable name than just 'ms' The variable name has been changed which is more clear to the feature. > ~~~ > > 6. > @@ -404,6 +445,20 @@ parse_subscription_options(ParseState *pstate, List > *stmt_options, > "slot_name = NONE", "create_slot = false"))); > } > } > + > + /* > + * The combination of parallel streaming mode and min_apply_delay is > + not > + * allowed. > + */ > + if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && > + opts->min_apply_delay > 0) > + { > + if (opts->streaming == LOGICALREP_STREAM_PARALLEL) > ereport(ERROR, > + errcode(ERRCODE_SYNTAX_ERROR), errmsg("%s and %s are mutually > + exclusive options", > + "min_apply_delay > 0", "streaming = parallel")); } > > This could be expressed as a single condition using &&, maybe also with the > brackets eliminated. (Unless you feel the current code is more readable) The current style is intentional. We feel the code is more readable. > ~~~ > > 7. > > + if (opts.min_apply_delay > 0) > + if ((IsSet(opts.specified_opts, SUBOPT_STREAMING) && opts.streaming > == LOGICALREP_STREAM_PARALLEL) || > + (!IsSet(opts.specified_opts, SUBOPT_STREAMING) && sub->stream == > LOGICALREP_STREAM_PARALLEL)) > + ereport(ERROR, > + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot enable %s for subscription in %s mode", > + "min_apply_delay", "streaming = parallel")); > > These nested ifs could instead be a single "if" with && condition. > (Unless you feel the current code is more readable) Same as #6. > ====== > src/backend/replication/logical/worker.c > > 8. maybe_delay_apply > > + * Hence, it's not appropriate to apply a delay at the time. > + */ > +static void > +maybe_delay_apply(TimestampTz finish_ts) > > That last sentence "Hence,... delay at the time" does not sound correct. Is there > a typo or missing words here? > > Maybe it meant to say "... at the STREAM START time."? Yes. Fixed. > ~~~ > > 9. > + /* This might change wal_receiver_status_interval */ if > + (ConfigReloadPending) { ConfigReloadPending = false; > + ProcessConfigFile(PGC_SIGHUP); } > > I was unsure why did you make a special mention of > 'wal_receiver_status_interval' here. I mean, Aren't there also other GUCs that > might change and affect something here so was there some special reason only > this one was mentioned? This should be similar to the recoveryApplyDelay for physical replication. It mentions the GUC used in the same function. > ====== > src/test/subscription/t/032_apply_delay.pl > > 10. > + > +# Compare inserted time on the publisher with applied time on the > +subscriber to # confirm the latter is applied after expected time. > +sub check_apply_delay_time > > Maybe the comment could also mention that the time is automatically stored in > the table column 'c'. Added. > ~~~ > > 11. > +# Confirm the suspended record doesn't get applied expectedly by the > +ALTER # DISABLE command. > +$result = $node_subscriber->safe_psql('postgres', > + "SELECT count(a) FROM test_tab WHERE a = 0;"); is($result, qq(0), > +"check if the delayed transaction doesn't get > applied expectedly"); > > The use of "doesn't get applied expectedly" (in 2 places here) seemed strange. > Maybe it's better to say like > > SUGGESTION > # Confirm disabling the subscription by ALTER DISABLE did not cause the > delayed transaction to be applied. > $result = $node_subscriber->safe_psql('postgres', > "SELECT count(a) FROM test_tab WHERE a = 0;"); is($result, qq(0), "check > the delayed transaction was not applied"); Fixed. Kindly have a look at the patch v18. Best Regards, Takamichi Osumi
Attachment
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Friday, January 20, 2023 6:13 PM shveta malik <shveta.malik@gmail.com> wrote: > On Fri, Jan 20, 2023 at 2:23 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Fri, Jan 20, 2023 at 1:08 PM Peter Smith <smithpb2250@gmail.com> > wrote: > > > > > a) the message should say that this is the *remaining* time to left to wait. > > > > > > b) it might be convenient to know from the log what was the original > > > min_apply_delay value in the 1st place. > > > > > > For example, the logs might look something like this: > > > > > > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > > > 160000 ms. Remaining wait time: 159972 ms > > > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > > > 160000 ms. Remaining wait time: 142828 ms > > > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > > > 160000 ms. Remaining wait time: 129994 ms > > > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > > > 160000 ms. Remaining wait time: 110001 ms ... > > > > > > > +1 > > This will also help when min_apply_delay is set to a new value in > > between the current wait. Lets say, I started with min_apply_delay=5 > > min, when the worker was half way through this, I changed > > min_apply_delay to 3 min or say 10min, I see the impact of that change > > i.e. new wait-time is adjusted, but log becomes confusing. So, please > > keep this scenario as well in mind while improving logging. > > > > > when we send-feedback during apply-delay after every > wal_receiver_status_interval , the log comes as: > 023-01-19 17:12:56.000 IST [404795] DEBUG: sending feedback (force 1) to > recv 0/1570840, write 0/1570840, flush 0/1570840 > > Shall we have some info here to indicate that it is sent while waiting for > apply_delay to distinguish it from other such send-feedback logs? > It will > make apply_delay flow clear in logs. This additional tip of log information has been added in the latest v18. Kindly have a look at it in [1]. [1] - https://www.postgresql.org/message-id/TYCPR01MB8373BED9E390C4839AF56685EDC59%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Friday, January 20, 2023 5:54 PM shveta malik <shveta.malik@gmail.com> wrote: > On Fri, Jan 20, 2023 at 1:08 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > a) the message should say that this is the *remaining* time to left to wait. > > > > b) it might be convenient to know from the log what was the original > > min_apply_delay value in the 1st place. > > > > For example, the logs might look something like this: > > > > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > > 160000 ms. Remaining wait time: 159972 ms > > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > > 160000 ms. Remaining wait time: 142828 ms > > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > > 160000 ms. Remaining wait time: 129994 ms > > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > > 160000 ms. Remaining wait time: 110001 ms ... > > > > +1 > This will also help when min_apply_delay is set to a new value in between the > current wait. Lets say, I started with min_apply_delay=5 min, when the worker > was half way through this, I changed min_apply_delay to 3 min or say 10min, I > see the impact of that change i.e. new wait-time is adjusted, but log becomes > confusing. So, please keep this scenario as well in mind while improving > logging. Yes, now the change of min_apply_delay value can be detected since I followed the format provided above. So, this scenario is also covered. Best Regards, Takamichi Osumi
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Friday, January 20, 2023 12:47 PM shveta malik <shveta.malik@gmail.com> wrote: > 1) > Tried different variations of altering 'min_apply_delay'. All passed except one > below: > > postgres=# alter subscription mysubnew set (min_apply_delay = '10.9min > 1ms'); ALTER SUBSCRIPTION postgres=# alter subscription mysubnew set > (min_apply_delay = '10.9min 2s 1ms'); ALTER SUBSCRIPTION --very similar to > above but fails, postgres=# alter subscription mysubnew set > (min_apply_delay = '10.9s 1ms'); > ERROR: invalid input syntax for type interval: "10.9s 1ms" FYI, this was because the interval type couldn't accept this format. But now we changed the input format from interval to integer alinged with recovery_min_apply_delay. Thus, we don't face this issue now. Best Regards, Takamichi Osumi
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Thursday, January 19, 2023 10:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Jan 19, 2023 at 12:06 PM Takamichi Osumi (Fujitsu) > <osumi.takamichi@fujitsu.com> wrote: > > > > Kindly have a look at the updated patch v17. > > > > Can we try to optimize the test time for this test? On my machine, it is the > second highest time-consuming test in src/test/subscription. It seems you are > waiting twice for apply_delay and both are for streaming cases by varying the > number of changes. I think it should be just once and that too for the > non-streaming case. I think it would be good to test streaming code path > interaction but not sure if it is important enough to have two test cases for > apply_delay. The first insert test is for non-streaming case and we need both cases for coverage. Regarding the time of test, conducted some optimization such as turning off the initial table sync, shortening the time of wait, and so on. > > One minor comment that I observed while going through the patch. > + /* > + * The combination of parallel streaming mode and min_apply_delay is > + not > + * allowed. > + */ > + if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && > + opts->min_apply_delay > 0) > > I think it would be good if you can specify the reason for not allowing this > combination in the comments. Added. Please have a look at the latest v18 patch in [1]. [1] - https://www.postgresql.org/message-id/TYCPR01MB8373BED9E390C4839AF56685EDC59%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Thursday, January 19, 2023 7:55 PM vignesh C <vignesh21@gmail.com> wrote: > On Thu, 19 Jan 2023 at 12:06, Takamichi Osumi (Fujitsu) > <osumi.takamichi@fujitsu.com> wrote: > > > > Updated the comment and the function call. > > > > Kindly have a look at the updated patch v17. > > Thanks for the updated patch, few comments: > 1) min_apply_delay was accepting values like '600 m s h', I was not sure if we > should allow this: > alter subscription sub1 set (min_apply_delay = ' 600 m s h'); > > + /* > + * If no unit was specified, then explicitly > add 'ms' otherwise > + * the interval_in function would assume 'seconds'. > + */ > + if (strspn(tmp, "-0123456789 ") == strlen(tmp)) > + val = psprintf("%sms", tmp); > + else > + val = tmp; > + > + interval = > DatumGetIntervalP(DirectFunctionCall3(interval_in, > + > > CStringGetDatum(val), > + > > ObjectIdGetDatum(InvalidOid), > + > Int32GetDatum(-1))); > FYI, the input can be accepted by the interval type. Now we changed the direction of the type from interval to integer but plus some unit can be added like recovery_min_apply_delay. Please check. > 3) There is one check at parse_subscription_options and another check in > AlterSubscription, this looks like a redundant check in case of alter > subscription, can we try to merge and keep in one place: > /* > * The combination of parallel streaming mode and min_apply_delay is not > * allowed. > */ > if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && > opts->min_apply_delay > 0) > { > if (opts->streaming == LOGICALREP_STREAM_PARALLEL) ereport(ERROR, > errcode(ERRCODE_SYNTAX_ERROR), errmsg("%s and %s are mutually > exclusive options", > "min_apply_delay > 0", "streaming = parallel")); } > > if (IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY)) { > /* > * The combination of parallel streaming mode and > * min_apply_delay is not allowed. > */ > if (opts.min_apply_delay > 0) > if ((IsSet(opts.specified_opts, SUBOPT_STREAMING) && opts.streaming == > LOGICALREP_STREAM_PARALLEL) || > (!IsSet(opts.specified_opts, SUBOPT_STREAMING) && sub->stream == > LOGICALREP_STREAM_PARALLEL)) > ereport(ERROR, > errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > errmsg("cannot enable %s for subscription in %s mode", > "min_apply_delay", "streaming = parallel")); > > values[Anum_pg_subscription_subminapplydelay - 1] = > Int64GetDatum(opts.min_apply_delay); > replaces[Anum_pg_subscription_subminapplydelay - 1] = true; } We can't. For create subscription, we need to check the patch from parse_subscription_options, while for alter subscription, we need to refer the current MySubscription value for those tests in AlterSubscription. > 4) typo "execeeds" should be "exceeds" > > + time on the subscriber. Any overhead of time spent in > logical decoding > + and in transferring the transaction may reduce the actual wait time. > + It is also possible that the overhead already execeeds the > requested > + <literal>min_apply_delay</literal> value, in which case no > additional > + wait is necessary. If the system clocks on publisher and subscriber > + are not synchronized, this may lead to apply changes earlier > + than Fixed. Kindly have a look at the v18 patch in [1]. [1] - https://www.postgresql.org/message-id/TYCPR01MB8373BED9E390C4839AF56685EDC59%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Saturday, January 21, 2023 3:36 AM I wrote: > Kindly have a look at the patch v18. I've conducted some refactoring for v18. Now the latest patch should be tidier and the comments would be clearer and more aligned as a whole. Attached the updated patch v19. Best Regards, Takamichi Osumi
Attachment
Here are my review comments for v19-0001. ====== Commit message 1. The combination of parallel streaming mode and min_apply_delay is not allowed. The subscriber in the parallel streaming mode applies each stream on arrival without the time of commit/prepare. So, the subscriber needs to depend on the arrival time of the stream in this case, if we apply the time-delayed feature for such transactions. Then there is a possibility where some unnecessary delay will be added on the subscriber by network communication break between nodes or other heavy work load on the publisher. On the other hand, applying the delay at the end of transaction with parallel apply also can cause issues of used resource bloat and locks kept in open for a long time. Thus, those features can't work together. ~ I think the above is just cut/paste from a code comment within subscriptioncmds.c. See review comments #5 below -- so if the code is changed then this commit message should also change to match it. ====== doc/src/sgml/ref/create_subscription.sgml 2. + <varlistentry> + <term><literal>min_apply_delay</literal> (<type>integer</type>)</term> + <listitem> + <para> + By default, the subscriber applies changes as soon as possible. This + parameter allows the user to delay the application of changes by a + given time interval. If the value is specified without units, it is + taken as milliseconds. The default is zero (no delay). + </para> 2a. The pgdocs says this is an integer default to “ms” unit. Also, the example on this same page shows it is set to '4h'. But I did not see any mention of what other units are available to the user. Maybe other time units should be mentioned here, or maybe a link should be given to the section “20.1.1. Parameter Names and Values". ~ 2b. Previously the word "interval" was deliberately used because this parameter had interval support. But maybe now it should be changed so it is not misleading. "a given time interval" --> "a given time period" ?? ====== src/backend/commands/subscriptioncmds.c 3. Forward declare +static int defGetMinApplyDelay(DefElem *def); If the new function is implemented as static near the top of this source file then this forward declare would not even be necessary, right? ~~~ 4. parse_subscription_options @@ -324,6 +328,12 @@ parse_subscription_options(ParseState *pstate, List *stmt_options, opts->specified_opts |= SUBOPT_LSN; opts->lsn = lsn; } + else if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && + strcmp(defel->defname, "min_apply_delay") == 0) + { + opts->specified_opts |= SUBOPT_MIN_APPLY_DELAY; + opts->min_apply_delay = defGetMinApplyDelay(defel); + } Should this code fragment be calling errorConflictingDefElem so it will report an error if the same min_apply_delay parameter is redundantly repeated? (IIUC, this appears to be the code pattern for other parameters nearby). ~~~ 5. parse_subscription_options + /* + * The combination of parallel streaming mode and min_apply_delay is not + * allowed. The subscriber in the parallel streaming mode applies each + * stream on arrival without the time of commit/prepare. So, the + * subscriber needs to depend on the arrival time of the stream in this + * case, if we apply the time-delayed feature for such transactions. Then + * there is a possibility where some unnecessary delay will be added on + * the subscriber by network communication break between nodes or other + * heavy work load on the publisher. On the other hand, applying the delay + * at the end of transaction with parallel apply also can cause issues of + * used resource bloat and locks kept in open for a long time. Thus, those + * features can't work together. + */ IMO some re-wording might be warranted here. I am not sure quite how to do it. Perhaps like below? SUGGESTION The combination of parallel streaming mode and min_apply_delay is not allowed. Here are some reasons why these features are incompatible: a. In the parallel streaming mode the subscriber applies each stream on arrival without knowledge of the commit/prepare time. This means we cannot calculate the underlying network/decoding lag between publisher and subscriber, and so always waiting for the full 'min_apply_delay' period might include unnecessary delay. b. If we apply the delay at the end of the transaction of the parallel apply then that would cause issues related to resource bloat and locks being held for a long time. ~~~ 6. defGetMinApplyDelay + + +/* + * Extract the min_apply_delay mode value from a DefElem. This is very similar + * to PGC_INT case of parse_and_validate_value(), because min_apply_delay + * accepts the same string as recovery_min_apply_delay. + */ +int +defGetMinApplyDelay(DefElem *def) 6a. "same string" -> "same parameter format" ?? ~ 6b. I thought this function should be implemented as static and located at the top of the subscriptioncmds.c source file. ====== src/backend/replication/logical/worker.c 7. maybe_delay_apply +static void maybe_delay_apply(TransactionId xid, TimestampTz finish_ts); Is there a reason why this is here? AFAIK the static implementation precedes any usage so I doubt this forward declaration is required. ~~~ 8. send_feedback @@ -3775,11 +3912,12 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply) pq_sendint64(reply_message, now); /* sendTime */ pq_sendbyte(reply_message, requestReply); /* replyRequested */ - elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X", + elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X in-delayed: %d", force, LSN_FORMAT_ARGS(recvpos), LSN_FORMAT_ARGS(writepos), - LSN_FORMAT_ARGS(flushpos)); + LSN_FORMAT_ARGS(flushpos), + in_delayed_apply); Wondering if it is better to write this as: "sending feedback (force %d, in_delayed_apply %d) to recv %X/%X, write %X/%X, flush %X/%X" ====== src/test/regress/sql/subscription.sql 9. Add new test? Should there be an additional test to check redundant parameter setting -- eg. "... WITH (min_apply_delay=123, min_apply_delay=456)" (this is related to the review comment #4) ~ 10. Add new tests? Should there be other tests just to verify different units (like 'd', 'h', 'min') are working OK? ====== src/test/subscription/t/032_apply_delay.pl 11. +# Confirm the time-delayed replication has been effective from the server log +# message where the apply worker emits for applying delay. Moreover, verifies +# that the current worker's delayed time is sufficiently bigger than the +# expected value, in order to check any update of the min_apply_delay. +sub check_apply_delay_log "the current worker's delayed time..." --> "the current worker's remaining wait time..." ?? ~~~ 12. + # Get the delay time from the server log + my $contents = slurp_file($node_subscriber->logfile, $offset); "Get the delay time...." --> "Get the remaining wait time..." ~~~ 13. +# Create a subscription that applies the trasaction after 50 milliseconds delay +$node_subscriber->safe_psql('postgres', + "CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (copy_data = off, min_apply_delay = '50ms', streaming = 'on')" +); 13a. typo: "trasaction" ~ 13b 50ms seems an extremely short time – How do you even know if this is testing anything related to the time delay? You may just be detecting the normal lag between publisher and subscriber without time delay having much to do with anything. ~ 14. +# Note that we cannot call check_apply_delay_log() here because there is a +# possibility that the delay is skipped. The event happens when the WAL +# replication between publisher and subscriber is delayed due to a mechanical +# problem. The log output will be checked later - substantial delay-time case. + +# Verify that the subscriber lags the publisher by at least 50 milliseconds +check_apply_delay_time($node_publisher, $node_subscriber, '2', '0.05'); 14a. "The event happens..." ?? Did you mean "This might happen if the WAL..." ~ 14b. The log output will be checked later - substantial delay-time case. I think that needs re-wording to clarify. e.g1. you have nothing called a "substantial delay-time" case. e.g2. the word "later" confused me. Originally, I thought you meant it is not tested yet but that you will check it "later", but now IIUC you are just referring to the "1 day 5 minutes" test that comes below in this location TAP file (??) ------ Kind Regards, Peter Smith. Fujitsu Australia
On Mon, Jan 23, 2023 at 1:36 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are my review comments for v19-0001. > ... > > 5. parse_subscription_options > > + /* > + * The combination of parallel streaming mode and min_apply_delay is not > + * allowed. The subscriber in the parallel streaming mode applies each > + * stream on arrival without the time of commit/prepare. So, the > + * subscriber needs to depend on the arrival time of the stream in this > + * case, if we apply the time-delayed feature for such transactions. Then > + * there is a possibility where some unnecessary delay will be added on > + * the subscriber by network communication break between nodes or other > + * heavy work load on the publisher. On the other hand, applying the delay > + * at the end of transaction with parallel apply also can cause issues of > + * used resource bloat and locks kept in open for a long time. Thus, those > + * features can't work together. > + */ > > IMO some re-wording might be warranted here. I am not sure quite how > to do it. Perhaps like below? > > SUGGESTION > > The combination of parallel streaming mode and min_apply_delay is not allowed. > > Here are some reasons why these features are incompatible: > a. In the parallel streaming mode the subscriber applies each stream > on arrival without knowledge of the commit/prepare time. This means we > cannot calculate the underlying network/decoding lag between publisher > and subscriber, and so always waiting for the full 'min_apply_delay' > period might include unnecessary delay. > b. If we apply the delay at the end of the transaction of the parallel > apply then that would cause issues related to resource bloat and locks > being held for a long time. > > ~~~ > How about something like: The combination of parallel streaming mode and min_apply_delay is not allowed. This is because we start applying the transaction stream as soon as the first change arrives without knowing the transaction's prepare/commit time. This means we cannot calculate the underlying network/decoding lag between publisher and subscriber, and so always waiting for the full 'min_apply_delay' period might include unnecessary delay. The other possibility is to apply the delay at the end of the parallel apply transaction but that would cause issues related to resource bloat and locks being held for a long time. > 6. defGetMinApplyDelay > > + > + > +/* > + * Extract the min_apply_delay mode value from a DefElem. This is very similar > + * to PGC_INT case of parse_and_validate_value(), because min_apply_delay > + * accepts the same string as recovery_min_apply_delay. > + */ > +int > +defGetMinApplyDelay(DefElem *def) > > 6a. > "same string" -> "same parameter format" ?? > > ~ > > 6b. > I thought this function should be implemented as static and located at > the top of the subscriptioncmds.c source file. > I agree that this should be a static function but I think its current location is a better place as other similar function is just above it. > > ====== > src/test/regress/sql/subscription.sql > > 9. Add new test? > > Should there be an additional test to check redundant parameter > setting -- eg. "... WITH (min_apply_delay=123, min_apply_delay=456)" > I don't think that will be of much help. We don't seem to have other tests for subscription parameters. -- With Regards, Amit Kapila.
On Sun, Jan 22, 2023 at 6:12 PM Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > > > Attached the updated patch v19. > Few comments: ============= 1. } + + +/* Only one empty line is sufficient between different functions. 2. + if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && + opts->min_apply_delay > 0 && opts->streaming == LOGICALREP_STREAM_PARALLEL) + ereport(ERROR, + errcode(ERRCODE_SYNTAX_ERROR), + errmsg("%s and %s are mutually exclusive options", + "min_apply_delay > 0", "streaming = parallel")); } I think here we should add a comment for the translator as we are doing in some other nearby cases. 3. + /* + * The combination of parallel streaming mode and + * min_apply_delay is not allowed. + */ + if (opts.streaming == LOGICALREP_STREAM_PARALLEL) + if ((IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && opts.min_apply_delay > 0) || + (!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && sub->minapplydelay > 0)) + ereport(ERROR, + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot enable %s mode for subscription with %s", + "streaming = parallel", "min_apply_delay")); + A. When can second condition ((!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && sub->minapplydelay > 0)) in above check be true? B. In comments, you can say "See parse_subscription_options." 4. +/* + * When min_apply_delay parameter is set on the subscriber, we wait long enough + * to make sure a transaction is applied at least that interval behind the + * publisher. Shouldn't this part of the comment needs to be updated after the patch has stopped using interval? 5. How does this feature interacts with the SKIP feature? Currently, it doesn't care whether the changes of a particular xact are skipped or not. I think that might be okay because anyway the purpose of this feature is to make subscriber lag from publishers. What do you think? I feel we can add some comments to indicate the same. -- With Regards, Amit Kapila.
On Mon, Jan 23, 2023 at 9:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jan 23, 2023 at 1:36 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Here are my review comments for v19-0001. > > > ... > > > > 5. parse_subscription_options > > > > + /* > > + * The combination of parallel streaming mode and min_apply_delay is not > > + * allowed. The subscriber in the parallel streaming mode applies each > > + * stream on arrival without the time of commit/prepare. So, the > > + * subscriber needs to depend on the arrival time of the stream in this > > + * case, if we apply the time-delayed feature for such transactions. Then > > + * there is a possibility where some unnecessary delay will be added on > > + * the subscriber by network communication break between nodes or other > > + * heavy work load on the publisher. On the other hand, applying the delay > > + * at the end of transaction with parallel apply also can cause issues of > > + * used resource bloat and locks kept in open for a long time. Thus, those > > + * features can't work together. > > + */ > > > > IMO some re-wording might be warranted here. I am not sure quite how > > to do it. Perhaps like below? > > > > SUGGESTION > > > > The combination of parallel streaming mode and min_apply_delay is not allowed. > > > > Here are some reasons why these features are incompatible: > > a. In the parallel streaming mode the subscriber applies each stream > > on arrival without knowledge of the commit/prepare time. This means we > > cannot calculate the underlying network/decoding lag between publisher > > and subscriber, and so always waiting for the full 'min_apply_delay' > > period might include unnecessary delay. > > b. If we apply the delay at the end of the transaction of the parallel > > apply then that would cause issues related to resource bloat and locks > > being held for a long time. > > > > ~~~ > > > > How about something like: > The combination of parallel streaming mode and min_apply_delay is not > allowed. This is because we start applying the transaction stream as > soon as the first change arrives without knowing the transaction's > prepare/commit time. This means we cannot calculate the underlying > network/decoding lag between publisher and subscriber, and so always > waiting for the full 'min_apply_delay' period might include > unnecessary delay. > > The other possibility is to apply the delay at the end of the parallel > apply transaction but that would cause issues related to resource > bloat and locks being held for a long time. > +1. That's better. > > > 6. defGetMinApplyDelay > > ... > > > > 6b. > > I thought this function should be implemented as static and located at > > the top of the subscriptioncmds.c source file. > > > > I agree that this should be a static function but I think its current > location is a better place as other similar function is just above it. > But, why not do everything, instead of settling on a half-fix? e.g. 1. Change the new function (defGetMinApplyDelay) to be static as it should be 2. And move defGetMinApplyDelay to the top of the file where IMO it really belongs 3. And then remove the (now) redundant forward declaration of defGetMinApplyDelay 4. And also move the existing function (defGetStreamingMode) to the top of the file so that those similar functions (defGetMinApplyDelay and defGetStreamingMode) can remain together ------ Kind Regards, Peter Smith. Fujitsu Australia
On Sun, Jan 22, 2023, at 9:42 AM, Takamichi Osumi (Fujitsu) wrote:
On Saturday, January 21, 2023 3:36 AM I wrote:> Kindly have a look at the patch v18.I've conducted some refactoring for v18.Now the latest patch should be tidier andthe comments would be clearer and more aligned as a whole.Attached the updated patch v19.
[I haven't been following this thread for a long time...]
Good to know that you keep improving this patch. I have a few suggestions that
were easier to provide a patch on top of your latest patch than to provide an
inline suggestions.
There are a few documentation polishing. Let me comment some of them above.
- The length of time (ms) to delay the application of changes.
+ Total time spent delaying the application of changes, in milliseconds
I don't remember if I suggested this description for catalog but IMO the
suggestion reads better for me.
- For time-delayed logical replication (i.e. when the subscription is
- created with parameter min_apply_delay > 0), the apply worker sends a
- Standby Status Update message to the publisher with a period of
- <literal>wal_receiver_status_interval</literal>. Make sure to set
- <literal>wal_receiver_status_interval</literal> less than the
- <literal>wal_sender_timeout</literal> on the publisher, otherwise, the
- walsender will repeatedly terminate due to the timeout errors. If
- <literal>wal_receiver_status_interval</literal> is set to zero, the apply
- worker doesn't send any feedback messages during the subscriber's
- <literal>min_apply_delay</literal> period. See
- <xref linkend="sql-createsubscription"/> for details.
+ For time-delayed logical replication, the apply worker sends a feedback
+ message to the publisher every
+ <varname>wal_receiver_status_interval</varname> milliseconds. Make sure
+ to set <varname>wal_receiver_status_interval</varname> less than the
+ <varname>wal_sender_timeout</varname> on the publisher, otherwise, the
+ <literal>walsender</literal> will repeatedly terminate due to timeout
+ error. If <varname>wal_receiver_status_interval</varname> is set to
+ zero, the apply worker doesn't send any feedback messages during the
+ <literal>min_apply_delay</literal> interval.
I removed the parenthesis explanation about time-delayed logical replication.
If you are reading the documentation and does not know what it means you should
(a) read the logical replication chapter or (b) check the glossary (maybe a new
entry should be added). I also removed the Standby status Update message but it
is a low level detail; let's refer to it as feedback message as the other
sentences do. I changed "literal" to "varname" that's the correct tag for
parameters. I replace "period" with "interval" that was the previous
terminology. IMO we should be uniform, use one or the other.
- The subscriber replication can be instructed to lag behind the publisher
- side changes by specifying the <literal>min_apply_delay</literal>
- subscription parameter. See <xref linkend="sql-createsubscription"/> for
- details.
+ A logical replication subscription can delay the application of changes by
+ specifying the <literal>min_apply_delay</literal> subscription parameter.
+ See <xref linkend="sql-createsubscription"/> for details.
This feature refers to a specific subscription, hence, "logical replication
subscription" instead of "subscriber replication".
+ if (IsSet(opts->specified_opts, SUBOPT_MIN_APPLY_DELAY))
+ errorConflictingDefElem(defel, pstate);
+
Peter S referred to this missing piece of code too.
-int
+static int
defGetMinApplyDelay(DefElem *def)
{
It seems you forgot static keyword.
- elog(DEBUG2, "time-delayed replication for txid %u, min_apply_delay = %lld ms, Remaining wait time: %ld ms",
- xid, (long long) MySubscription->minapplydelay, diffms);
+ elog(DEBUG2, "time-delayed replication for txid %u, min_apply_delay = " INT64_FORMAT " ms, remaining wait time: %ld ms",
+ xid, MySubscription->minapplydelay, diffms);
int64 should use format modifier INT64_FORMAT.
- (long) wal_receiver_status_interval * 1000,
+ wal_receiver_status_interval * 1000L,
Cast is not required. I added a suffix to the constant.
- elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X in-delayed: %d",
+ elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X, apply delay: %s",
force,
LSN_FORMAT_ARGS(recvpos),
LSN_FORMAT_ARGS(writepos),
LSN_FORMAT_ARGS(flushpos),
- in_delayed_apply);
+ in_delayed_apply? "yes" : "no");
It is better to use a string to represent the yes/no option.
- gettext_noop("Min apply delay (ms)"));
+ gettext_noop("Min apply delay"));
I don't know if it was discussed but we don't add units to headers. When I
think about this parameter representation (internal and external), I decided to
use the previous code because it provides a unit for external representation. I
understand that using the same representation as recovery_min_apply_delay is
good but the current code does not handle the external representation
accordingly. (recovery_min_apply_delay uses the GUC machinery to adds the unit
but for min_apply_delay, it doesn't).
# Setup for streaming case
-$node_publisher->append_conf('postgres.conf',
+$node_publisher->append_conf('postgresql.conf',
'logical_decoding_mode = immediate');
$node_publisher->reload;
Fix configuration file name.
Maybe tests should do a better job. I think check_apply_delay_time is fragile
because it does not guarantee that time is not shifted. Time-delayed
replication is a subscriber feature and to check its correctness it should
check the logs.
# Note that we cannot call check_apply_delay_log() here because there is a
# possibility that the delay is skipped. The event happens when the WAL
# replication between publisher and subscriber is delayed due to a mechanical
# problem. The log output will be checked later - substantial delay-time case.
If you might not use the logs for it, it should adjust the min_apply_delay, no?
It does not exercise the min_apply_delay vs parallel streaming mode.
+ /*
+ * The combination of parallel streaming mode and
+ * min_apply_delay is not allowed.
+ */
+ if (opts.streaming == LOGICALREP_STREAM_PARALLEL)
+ if ((IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && opts.min_apply_delay > 0) ||
+ (!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && sub->minapplydelay > 0))
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot enable %s mode for subscription with %s",
+ "streaming = parallel", "min_apply_delay"));
+
Is this code correct? I also didn't like this message. "cannot enable streaming
= parallel mode for subscription with min_apply_delay" is far from a good error
message. How about refer parallelism to "parallel streaming mode".
Attachment
At Mon, 23 Jan 2023 17:36:13 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Sun, Jan 22, 2023 at 6:12 PM Takamichi Osumi (Fujitsu) > <osumi.takamichi@fujitsu.com> wrote: > > > > > > Attached the updated patch v19. > Few comments: > 2. > + if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && > + opts->min_apply_delay > 0 && opts->streaming == LOGICALREP_STREAM_PARALLEL) > + ereport(ERROR, > + errcode(ERRCODE_SYNTAX_ERROR), > + errmsg("%s and %s are mutually exclusive options", > + "min_apply_delay > 0", "streaming = parallel")); > } > > I think here we should add a comment for the translator as we are > doing in some other nearby cases. IMHO "foo > bar" is not an "option". I think we say "foo and bar are mutually exclusive options" but I think don't say "foo = x and bar = y are.. options". I wrote a comment as "this should be more like human-speaking" and Euler seems having the same feeling for another error message. Concretely I would spell this as "min_apply_delay cannot be enabled when parallel streaming mode is enabled" or something. And the opposite-direction message nearby would be "parallel streaming mode cannot be enabled when min_apply_delay is enabled." regards. -- Kyotaro Horiguchi NTT Open Source Software Center
> Attached the updated patch v19. + maybe_delay_apply(TransactionId xid, TimestampTz finish_ts) I look this spelling strange. How about maybe_apply_delay()? send_feedback(): + * If the subscriber side apply is delayed (because of time-delayed + * replication) then do not tell the publisher that the received latest + * LSN is already applied and flushed, otherwise, it leads to the + * publisher side making a wrong assumption of logical replication + * progress. Instead, we just send a feedback message to avoid a publisher + * timeout during the delay. */ - if (!have_pending_txes) + if (!have_pending_txes && !in_delayed_apply) flushpos = writepos = recvpos; Honestly I don't like this wart. The reason for this is the function assumes recvpos = applypos but we actually call it while holding unapplied changes, that is, applypos < recvpos. Couldn't we maintain an additional static variable "last_applied" along with last_received? In this case the condition cited above would be as follows and in_delayed_apply will become unnecessary. + if (!have_pending_txes && last_received == last_applied) The function is a static function and always called with a variable last_received that has the same scope with the function, as the first parameter. Thus we can remove the first parameter then let the function directly look at the both two varaibles instead. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Sorry, I forgot to write one comment. At Tue, 24 Jan 2023 11:45:35 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in + /* Should we delay the current transaction? */ + if (finish_ts) + maybe_delay_apply(xid, finish_ts); + if (!am_parallel_apply_worker()) maybe_start_skipping_changes(lsn); It may not give actual advantages, but isn't it better that delay happens after skipping? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Jan 24, 2023 at 3:46 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Mon, Jan 23, 2023 at 9:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > 6. defGetMinApplyDelay > > > > ... > > > > > > 6b. > > > I thought this function should be implemented as static and located at > > > the top of the subscriptioncmds.c source file. > > > > > > > I agree that this should be a static function but I think its current > > location is a better place as other similar function is just above it. > > > > But, why not do everything, instead of settling on a half-fix? > > e.g. > 1. Change the new function (defGetMinApplyDelay) to be static as it should be > 2. And move defGetMinApplyDelay to the top of the file where IMO it > really belongs > 3. And then remove the (now) redundant forward declaration of > defGetMinApplyDelay > 4. And also move the existing function (defGetStreamingMode) to the > top of the file so that those similar functions (defGetMinApplyDelay > and defGetStreamingMode) can remain together > There are various other static functions (merge_publications, check_duplicates_in_publist, etc.) which then also needs similar change. BTW, I don't think we have a policy to always define static functions before their usage. So, I don't see the need to do anything in this matter. -- With Regards, Amit Kapila.
On Tue, Jan 24, 2023 at 5:02 AM Euler Taveira <euler@eulerto.com> wrote: > > On Sun, Jan 22, 2023, at 9:42 AM, Takamichi Osumi (Fujitsu) wrote: > > > Attached the updated patch v19. > > [I haven't been following this thread for a long time...] > > Good to know that you keep improving this patch. I have a few suggestions that > were easier to provide a patch on top of your latest patch than to provide an > inline suggestions. > Euler, thanks for your comments. We have an existing problem related to shutdown which impacts this patch. The problem is that during shutdown on the publisher, we wait for all the WAL to be sent and flushed on the subscriber. Now, if we user has configured a long value for min_apply_delay on the subscriber then the shutdown won't be successful. This can happen even today if the subscriber waits for some lock during the apply. This is not so much a problem with physical replication because there we have a separate process to first flush the WAL. This problem has been discussed in a separate thread as well. See [1]. It is important to reach conclusion even if we just want to document it. So, your thoughts on that other thread can help us to make it move forward. [1] - https://www.postgresql.org/message-id/TYAPR01MB586668E50FC2447AD7F92491F5E89%40TYAPR01MB5866.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
On Tue, Jan 24, 2023 at 6:17 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Mon, 23 Jan 2023 17:36:13 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > On Sun, Jan 22, 2023 at 6:12 PM Takamichi Osumi (Fujitsu) > > <osumi.takamichi@fujitsu.com> wrote: > > > > > > > > > Attached the updated patch v19. > > Few comments: > > 2. > > + if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && > > + opts->min_apply_delay > 0 && opts->streaming == LOGICALREP_STREAM_PARALLEL) > > + ereport(ERROR, > > + errcode(ERRCODE_SYNTAX_ERROR), > > + errmsg("%s and %s are mutually exclusive options", > > + "min_apply_delay > 0", "streaming = parallel")); > > } > > > > I think here we should add a comment for the translator as we are > > doing in some other nearby cases. > > IMHO "foo > bar" is not an "option". I think we say "foo and bar are > mutually exclusive options" but I think don't say "foo = x and bar = y > are.. options". I wrote a comment as "this should be more like > human-speaking" and Euler seems having the same feeling for another > error message. > > Concretely I would spell this as "min_apply_delay cannot be enabled > when parallel streaming mode is enabled" or something. > We can change it but the current message seems to be in line with some nearby messages like "slot_name = NONE and enabled = true are mutually exclusive options". So, isn't it better to keep this as one in sync with existing messages? -- With Regards, Amit Kapila.
On Tue, Jan 24, 2023 at 8:35 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > Sorry, I forgot to write one comment. > > At Tue, 24 Jan 2023 11:45:35 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > + /* Should we delay the current transaction? */ > + if (finish_ts) > + maybe_delay_apply(xid, finish_ts); > + > if (!am_parallel_apply_worker()) > maybe_start_skipping_changes(lsn); > > It may not give actual advantages, but isn't it better that delay > happens after skipping? > If we go with the order you are suggesting then the LOGs will appear as follows when we are skipping the transaction: "logical replication starts skipping transaction at LSN ..." "time-delayed replication for txid %u, min_apply_delay = %lld ms, Remaining wait time: ..." Personally, I would prefer the above LOGs to be in reverse order as it doesn't make much sense to me to first say that we are skipping changes and then say the transaction is delayed. What do you think? -- With Regards, Amit Kapila.
On Tue, Jan 24, 2023 at 8:15 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > > Attached the updated patch v19. > > + maybe_delay_apply(TransactionId xid, TimestampTz finish_ts) > > I look this spelling strange. How about maybe_apply_delay()? > +1. > > send_feedback(): > + * If the subscriber side apply is delayed (because of time-delayed > + * replication) then do not tell the publisher that the received latest > + * LSN is already applied and flushed, otherwise, it leads to the > + * publisher side making a wrong assumption of logical replication > + * progress. Instead, we just send a feedback message to avoid a publisher > + * timeout during the delay. > */ > - if (!have_pending_txes) > + if (!have_pending_txes && !in_delayed_apply) > flushpos = writepos = recvpos; > > Honestly I don't like this wart. The reason for this is the function > assumes recvpos = applypos but we actually call it while holding > unapplied changes, that is, applypos < recvpos. > > Couldn't we maintain an additional static variable "last_applied" > along with last_received? > It won't be easy to maintain the meaning of last_applied because there are cases where we don't apply the change directly. For example, in case of streaming xacts, we will just keep writing it to the file, now, say, due to some reason, we have to send the feedback, then it will not allow you to update the latest write locations. This would then become different then what we are doing without the patch. Another point to think about is that we also need to keep the variable updated for keep-alive ('k') messages even though we don't apply anything in that case. Still, other cases to consider are where we have mix of streaming and non-streaming transactions. > In this case the condition cited above > would be as follows and in_delayed_apply will become unnecessary. > > + if (!have_pending_txes && last_received == last_applied) > > The function is a static function and always called with a variable > last_received that has the same scope with the function, as the first > parameter. Thus we can remove the first parameter then let the > function directly look at the both two varaibles instead. > I think this is true without this patch, so why that has not been followed in the first place? One comment, I see in this regard is as below: /* It's legal to not pass a recvpos */ if (recvpos < last_recvpos) recvpos = last_recvpos; -- With Regards, Amit Kapila.
On Tue, Jan 24, 2023 at 5:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jan 24, 2023 at 8:15 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > > Attached the updated patch v19. > > > > + maybe_delay_apply(TransactionId xid, TimestampTz finish_ts) > > > > I look this spelling strange. How about maybe_apply_delay()? > > > > +1. It depends on how you read it. I read it like this: maybe_delay_apply === means "maybe delay [the] apply" (which is exactly what the function does) versus maybe_apply_delay === means "maybe [the] apply [needs a] delay" (which is also correct, but it seemed a more awkward way to say it IMO) ~ Perhaps it's better to rename it more fully like *maybe_delay_the_apply* to remove any ambiguous interpretations. ------ Kind Regards, Peter Smith. Fujitsu Australia
At Tue, 24 Jan 2023 11:28:58 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Tue, Jan 24, 2023 at 6:17 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > IMHO "foo > bar" is not an "option". I think we say "foo and bar are > > mutually exclusive options" but I think don't say "foo = x and bar = y > > are.. options". I wrote a comment as "this should be more like > > human-speaking" and Euler seems having the same feeling for another > > error message. > > > > Concretely I would spell this as "min_apply_delay cannot be enabled > > when parallel streaming mode is enabled" or something. > > > > We can change it but the current message seems to be in line with some > nearby messages like "slot_name = NONE and enabled = true are mutually > exclusive options". So, isn't it better to keep this as one in sync > with existing messages? Ooo. subscriptioncmds.c is full of such messages. Okay I agree that it is better to leave it as is.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Jan 24, 2023 at 12:44 PM Peter Smith <smithpb2250@gmail.com> wrote: > > On Tue, Jan 24, 2023 at 5:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Jan 24, 2023 at 8:15 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > > > > > Attached the updated patch v19. > > > > > > + maybe_delay_apply(TransactionId xid, TimestampTz finish_ts) > > > > > > I look this spelling strange. How about maybe_apply_delay()? > > > > > > > +1. > > It depends on how you read it. I read it like this: > > maybe_delay_apply === means "maybe delay [the] apply" > (which is exactly what the function does) > > versus > > maybe_apply_delay === means "maybe [the] apply [needs a] delay" > (which is also correct, but it seemed a more awkward way to say it IMO) > This matches more with GUC and all other usages of variables in the patch. So, I still prefer the second one. -- With Regards, Amit Kapila.
Dear Amit, Horiguchi-san, > > > > send_feedback(): > > + * If the subscriber side apply is delayed (because of time-delayed > > + * replication) then do not tell the publisher that the received latest > > + * LSN is already applied and flushed, otherwise, it leads to the > > + * publisher side making a wrong assumption of logical replication > > + * progress. Instead, we just send a feedback message to avoid a > publisher > > + * timeout during the delay. > > */ > > - if (!have_pending_txes) > > + if (!have_pending_txes && !in_delayed_apply) > > flushpos = writepos = recvpos; > > > > Honestly I don't like this wart. The reason for this is the function > > assumes recvpos = applypos but we actually call it while holding > > unapplied changes, that is, applypos < recvpos. > > > > Couldn't we maintain an additional static variable "last_applied" > > along with last_received? > > > > It won't be easy to maintain the meaning of last_applied because there > are cases where we don't apply the change directly. For example, in > case of streaming xacts, we will just keep writing it to the file, > now, say, due to some reason, we have to send the feedback, then it > will not allow you to update the latest write locations. This would > then become different then what we are doing without the patch. > Another point to think about is that we also need to keep the variable > updated for keep-alive ('k') messages even though we don't apply > anything in that case. Still, other cases to consider are where we > have mix of streaming and non-streaming transactions. I have tried to implement that, but it might be difficult because of a corner case related with the initial data sync. First of all, I have made last_applied to update when * transactions are committed, prepared, or aborted * apply worker receives keepalive message. I thought during the initial data sync, we must not update the last applied triggered by keepalive messages, so following lines were added just after updating last_received. ``` + if (last_applied < end_lsn && AllTablesyncsReady()) + last_applied = end_lsn; ``` However, if data is synchronizing and workers receive the non-committable WAL, this condition cannot be satisfied. 009_matviews.pl tests such a case, and I got a failure there. In this test MATERIALIZED VIEW is created on publisher and then the WAL is replicated to subscriber, but the transaction is not committed because logical replication does not support the statement. If we change the condition, we may the system may become inconsistent because the worker replies that all remote WALs are applied even if tablesync workers are synchronizing data. Best Regards, Hayato Kuroda FUJITSU LIMITED
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Tuesday, January 24, 2023 5:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Jan 24, 2023 at 12:44 PM Peter Smith <smithpb2250@gmail.com> > wrote: > > > > On Tue, Jan 24, 2023 at 5:58 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > On Tue, Jan 24, 2023 at 8:15 AM Kyotaro Horiguchi > > > <horikyota.ntt@gmail.com> wrote: > > > > > > > > > Attached the updated patch v19. > > > > > > > > + maybe_delay_apply(TransactionId xid, TimestampTz finish_ts) > > > > > > > > I look this spelling strange. How about maybe_apply_delay()? > > > > > > > > > > +1. > > > > It depends on how you read it. I read it like this: > > > > maybe_delay_apply === means "maybe delay [the] apply" > > (which is exactly what the function does) > > > > versus > > > > maybe_apply_delay === means "maybe [the] apply [needs a] delay" > > (which is also correct, but it seemed a more awkward way to say it > > IMO) > > > > This matches more with GUC and all other usages of variables in the patch. So, > I still prefer the second one. Okay. Fixed. Attached the patch v20 that has incorporated all comments so far. Kindly have a look at the attached patch. Best Regards, Takamichi Osumi
Attachment
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Tuesday, January 24, 2023 3:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > send_feedback(): > > + * If the subscriber side apply is delayed (because of time-delayed > > + * replication) then do not tell the publisher that the received latest > > + * LSN is already applied and flushed, otherwise, it leads to the > > + * publisher side making a wrong assumption of logical replication > > + * progress. Instead, we just send a feedback message to avoid a > publisher > > + * timeout during the delay. > > */ > > - if (!have_pending_txes) > > + if (!have_pending_txes && !in_delayed_apply) > > flushpos = writepos = recvpos; > > > > Honestly I don't like this wart. The reason for this is the function > > assumes recvpos = applypos but we actually call it while holding > > unapplied changes, that is, applypos < recvpos. > > > > Couldn't we maintain an additional static variable "last_applied" > > along with last_received? > > > > It won't be easy to maintain the meaning of last_applied because there are > cases where we don't apply the change directly. For example, in case of > streaming xacts, we will just keep writing it to the file, now, say, due to some > reason, we have to send the feedback, then it will not allow you to update the > latest write locations. This would then become different then what we are > doing without the patch. > Another point to think about is that we also need to keep the variable updated > for keep-alive ('k') messages even though we don't apply anything in that case. > Still, other cases to consider are where we have mix of streaming and > non-streaming transactions. Agreed. This will change some existing behaviors. So, didn't conduct this change in the latest patch [1]. [1] - https://www.postgresql.org/message-id/TYCPR01MB8373DC1881F382B4703F26E0EDC99%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Monday, January 23, 2023 9:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sun, Jan 22, 2023 at 6:12 PM Takamichi Osumi (Fujitsu) > <osumi.takamichi@fujitsu.com> wrote: > > > > > > Attached the updated patch v19. > > > > Few comments: > ============= > 1. > } > + > + > +/* > > Only one empty line is sufficient between different functions. Fixed. > 2. > + if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && > + opts->min_apply_delay > 0 && opts->streaming == > + opts->LOGICALREP_STREAM_PARALLEL) > + ereport(ERROR, > + errcode(ERRCODE_SYNTAX_ERROR), > + errmsg("%s and %s are mutually exclusive options", > + "min_apply_delay > 0", "streaming = parallel")); > } > > I think here we should add a comment for the translator as we are doing in > some other nearby cases. Fixed. > 3. > + /* > + * The combination of parallel streaming mode and > + * min_apply_delay is not allowed. > + */ > + if (opts.streaming == LOGICALREP_STREAM_PARALLEL) if > + ((IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && > opts.min_apply_delay > 0) || > + (!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && > sub->minapplydelay > 0)) > + ereport(ERROR, > + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot enable %s mode for subscription with %s", > + "streaming = parallel", "min_apply_delay")); > + > > A. When can second condition ((!IsSet(opts.specified_opts, > SUBOPT_MIN_APPLY_DELAY) && sub->minapplydelay > 0)) in above check > be true? > B. In comments, you can say "See parse_subscription_options." (1) In the alter statement, streaming = parallel is set. Also, (2) in the alter statement, min_apply_delay isn't set. and (3) an existing subscription has non-zero min_apply_delay. Added the comment. > 4. > +/* > + * When min_apply_delay parameter is set on the subscriber, we wait > +long enough > + * to make sure a transaction is applied at least that interval behind > +the > + * publisher. > > Shouldn't this part of the comment needs to be updated after the patch has > stopped using interval? Yes. I removed "interval" in descriptions so that we don't get confused with types. > 5. How does this feature interacts with the SKIP feature? Currently, it doesn't > care whether the changes of a particular xact are skipped or not. I think that > might be okay because anyway the purpose of this feature is to make > subscriber lag from publishers. What do you think? > I feel we can add some comments to indicate the same. Added the comment in the commit message. I didn't add this kind of comment as code comments, since both features are independent. If there is a need to write it anywhere, then please let me know. The latest patch is posted in [1]. [1] - https://www.postgresql.org/message-id/TYCPR01MB8373DC1881F382B4703F26E0EDC99%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Monday, January 23, 2023 7:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Jan 23, 2023 at 1:36 PM Peter Smith <smithpb2250@gmail.com> > wrote: > > > > Here are my review comments for v19-0001. > > > ... > > > > 5. parse_subscription_options > > > > + /* > > + * The combination of parallel streaming mode and min_apply_delay is > > + not > > + * allowed. The subscriber in the parallel streaming mode applies > > + each > > + * stream on arrival without the time of commit/prepare. So, the > > + * subscriber needs to depend on the arrival time of the stream in > > + this > > + * case, if we apply the time-delayed feature for such transactions. > > + Then > > + * there is a possibility where some unnecessary delay will be added > > + on > > + * the subscriber by network communication break between nodes or > > + other > > + * heavy work load on the publisher. On the other hand, applying the > > + delay > > + * at the end of transaction with parallel apply also can cause > > + issues of > > + * used resource bloat and locks kept in open for a long time. Thus, > > + those > > + * features can't work together. > > + */ > > > > IMO some re-wording might be warranted here. I am not sure quite how > > to do it. Perhaps like below? > > > > SUGGESTION > > > > The combination of parallel streaming mode and min_apply_delay is not > allowed. > > > > Here are some reasons why these features are incompatible: > > a. In the parallel streaming mode the subscriber applies each stream > > on arrival without knowledge of the commit/prepare time. This means we > > cannot calculate the underlying network/decoding lag between publisher > > and subscriber, and so always waiting for the full 'min_apply_delay' > > period might include unnecessary delay. > > b. If we apply the delay at the end of the transaction of the parallel > > apply then that would cause issues related to resource bloat and locks > > being held for a long time. > > > > ~~~ > > > > How about something like: > The combination of parallel streaming mode and min_apply_delay is not > allowed. This is because we start applying the transaction stream as soon as > the first change arrives without knowing the transaction's prepare/commit time. > This means we cannot calculate the underlying network/decoding lag between > publisher and subscriber, and so always waiting for the full 'min_apply_delay' > period might include unnecessary delay. > > The other possibility is to apply the delay at the end of the parallel apply > transaction but that would cause issues related to resource bloat and locks > being held for a long time. Thank you for providing a good description ! Adopted. The latest patch can be seen in [1]. [1] - https://www.postgresql.org/message-id/TYCPR01MB8373DC1881F382B4703F26E0EDC99%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Monday, January 23, 2023 5:07 PM Peter Smith <smithpb2250@gmail.com> wrote: > Here are my review comments for v19-0001. Thanks for your review ! > > ====== > Commit message > > 1. > The combination of parallel streaming mode and min_apply_delay is not > allowed. The subscriber in the parallel streaming mode applies each stream on > arrival without the time of commit/prepare. So, the subscriber needs to depend > on the arrival time of the stream in this case, if we apply the time-delayed > feature for such transactions. Then there is a possibility where some > unnecessary delay will be added on the subscriber by network communication > break between nodes or other heavy work load on the publisher. On the other > hand, applying the delay at the end of transaction with parallel apply also can > cause issues of used resource bloat and locks kept in open for a long time. > Thus, those features can't work together. > ~ > > I think the above is just cut/paste from a code comment within > subscriptioncmds.c. See review comments #5 below -- so if the code is > changed then this commit message should also change to match it. Now, updated this. Kindly have a look at the latest patch in [1]. > > ====== > doc/src/sgml/ref/create_subscription.sgml > > 2. > + <varlistentry> > + <term><literal>min_apply_delay</literal> > (<type>integer</type>)</term> > + <listitem> > + <para> > + By default, the subscriber applies changes as soon as possible. > This > + parameter allows the user to delay the application of changes by a > + given time interval. If the value is specified without units, it is > + taken as milliseconds. The default is zero (no delay). > + </para> > > 2a. > The pgdocs says this is an integer default to “ms” unit. Also, the example on > this same page shows it is set to '4h'. But I did not see any mention of what > other units are available to the user. Maybe other time units should be > mentioned here, or maybe a link should be given to the section “20.1.1. > Parameter Names and Values". Added. > ~ > > 2b. > Previously the word "interval" was deliberately used because this parameter > had interval support. But maybe now it should be changed so it is not > misleading. > > "a given time interval" --> "a given time period" ?? Fixed. > ====== > src/backend/commands/subscriptioncmds.c > > 3. Forward declare > > +static int defGetMinApplyDelay(DefElem *def); > > If the new function is implemented as static near the top of this source file then > this forward declare would not even be necessary, right? This declaration has been kept as discussed. > ~~~ > > 4. parse_subscription_options > > @@ -324,6 +328,12 @@ parse_subscription_options(ParseState *pstate, List > *stmt_options, > opts->specified_opts |= SUBOPT_LSN; > opts->lsn = lsn; > } > + else if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && > + strcmp(defel->defname, "min_apply_delay") == 0) { > + opts->specified_opts |= SUBOPT_MIN_APPLY_DELAY; min_apply_delay = > + opts->defGetMinApplyDelay(defel); > + } > > Should this code fragment be calling errorConflictingDefElem so it will report > an error if the same min_apply_delay parameter is redundantly repeated? > (IIUC, this appears to be the code pattern for other parameters nearby). Added. > ~~~ > > 5. parse_subscription_options > > + /* > + * The combination of parallel streaming mode and min_apply_delay is > + not > + * allowed. The subscriber in the parallel streaming mode applies each > + * stream on arrival without the time of commit/prepare. So, the > + * subscriber needs to depend on the arrival time of the stream in this > + * case, if we apply the time-delayed feature for such transactions. > + Then > + * there is a possibility where some unnecessary delay will be added on > + * the subscriber by network communication break between nodes or other > + * heavy work load on the publisher. On the other hand, applying the > + delay > + * at the end of transaction with parallel apply also can cause issues > + of > + * used resource bloat and locks kept in open for a long time. Thus, > + those > + * features can't work together. > + */ > > IMO some re-wording might be warranted here. I am not sure quite how to do it. > Perhaps like below? > > SUGGESTION > > The combination of parallel streaming mode and min_apply_delay is not > allowed. > > Here are some reasons why these features are incompatible: > a. In the parallel streaming mode the subscriber applies each stream on arrival > without knowledge of the commit/prepare time. This means we cannot > calculate the underlying network/decoding lag between publisher and > subscriber, and so always waiting for the full 'min_apply_delay' > period might include unnecessary delay. > b. If we apply the delay at the end of the transaction of the parallel apply then > that would cause issues related to resource bloat and locks being held for a > long time. Now, this has been changed to the one suggested by Amit-san. Thanks for your help. > ~~~ > > 6. defGetMinApplyDelay > > + > + > +/* > + * Extract the min_apply_delay mode value from a DefElem. This is very > +similar > + * to PGC_INT case of parse_and_validate_value(), because > +min_apply_delay > + * accepts the same string as recovery_min_apply_delay. > + */ > +int > +defGetMinApplyDelay(DefElem *def) > > 6a. > "same string" -> "same parameter format" ?? Fixed. > ~ > > 6b. > I thought this function should be implemented as static and located at the top > of the subscriptioncmds.c source file. Made it static but didn't change the place, as Amit-san mentioned. > ====== > src/backend/replication/logical/worker.c > > 7. maybe_delay_apply > > +static void maybe_delay_apply(TransactionId xid, TimestampTz > +finish_ts); > > Is there a reason why this is here? AFAIK the static implementation precedes > any usage so I doubt this forward declaration is required. Removed. > ~~~ > > 8. send_feedback > > @@ -3775,11 +3912,12 @@ send_feedback(XLogRecPtr recvpos, bool force, > bool requestReply) > pq_sendint64(reply_message, now); /* sendTime */ > pq_sendbyte(reply_message, requestReply); /* replyRequested */ > > - elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, > flush %X/%X", > + elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write > %X/%X, flush %X/%X in-delayed: %d", > force, > LSN_FORMAT_ARGS(recvpos), > LSN_FORMAT_ARGS(writepos), > - LSN_FORMAT_ARGS(flushpos)); > + LSN_FORMAT_ARGS(flushpos), > + in_delayed_apply); > > Wondering if it is better to write this as: > "sending feedback (force %d, in_delayed_apply %d) to recv %X/%X, > write %X/%X, flush %X/%X" Adopted and merged with the modification Euler-san provided. > ~ > > 10. Add new tests? > > Should there be other tests just to verify different units (like 'd', 'h', 'min') are > working OK? No need. The current subscription.sql does the check of "invalid value for parameter..." error message, which ensures we call the defGetMinApplyDelay(). Additionally, we have the test of one unit 'd' for unit iteration loopin convert_to_base_unit(). So, the current test sets should suffice. > ====== > src/test/subscription/t/032_apply_delay.pl > > 11. > +# Confirm the time-delayed replication has been effective from the > +server log # message where the apply worker emits for applying delay. > +Moreover, verifies # that the current worker's delayed time is > +sufficiently bigger than the # expected value, in order to check any update of > the min_apply_delay. > +sub check_apply_delay_log > > "the current worker's delayed time..." --> "the current worker's remaining wait > time..." ?? Fixed. > ~~~ > > 12. > + # Get the delay time from the server log my $contents = > + slurp_file($node_subscriber->logfile, $offset); > > "Get the delay time...." --> "Get the remaining wait time..." Fixed. > ~~~ > > 13. > +# Create a subscription that applies the trasaction after 50 > +milliseconds delay $node_subscriber->safe_psql('postgres', > + "CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr > application_name=$appname' PUBLICATION tap_pub WITH (copy_data = off, > min_apply_delay = '50ms', streaming = 'on')" > +); > > 13a. > typo: "trasaction" Fixed. > ~ > > 13b > 50ms seems an extremely short time – How do you even know if this is testing > anything related to the time delay? You may just be detecting the normal lag > between publisher and subscriber without time delay having much to do with > anything. The wait time has been updated to 1 second now. Also, the TAP tests now search for the emitted logs by the apply worker. The path to emit the log is in the maybe_apply_delay and it does writes the log only if the "diffms" is bigger than zero, which invokes the wait. So, this will ensure we use the feature by this flow. > ~ > > 14. > > +# Note that we cannot call check_apply_delay_log() here because there > +is a # possibility that the delay is skipped. The event happens when > +the WAL # replication between publisher and subscriber is delayed due > +to a mechanical # problem. The log output will be checked later - substantial > delay-time case. > + > +# Verify that the subscriber lags the publisher by at least 50 > +milliseconds check_apply_delay_time($node_publisher, $node_subscriber, > +'2', '0.05'); > > 14a. > "The event happens..." ?? > > Did you mean "This might happen if the WAL..." This part has been removed. > ~ > > 14b. > The log output will be checked later - substantial delay-time case. > > I think that needs re-wording to clarify. > e.g1. you have nothing called a "substantial delay-time" case. > e.g2. the word "later" confused me. Originally, I thought you meant it is not > tested yet but that you will check it "later", but now IIUC you are just referring > to the "1 day 5 minutes" test that comes below in this location TAP file (??) Also, removed. [1] - https://www.postgresql.org/message-id/TYCPR01MB8373DC1881F382B4703F26E0EDC99%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Tuesday, January 24, 2023 8:32 AM Euler Taveira <euler@eulerto.com> wrote: > Good to know that you keep improving this patch. I have a few suggestions that > were easier to provide a patch on top of your latest patch than to provide an > inline suggestions. Thanks for your review ! We basically adopted your suggestions. > There are a few documentation polishing. Let me comment some of them above. > > - The length of time (ms) to delay the application of changes. > + Total time spent delaying the application of changes, in milliseconds > > I don't remember if I suggested this description for catalog but IMO the > suggestion reads better for me. Adopted the above change. > - For time-delayed logical replication (i.e. when the subscription is > - created with parameter min_apply_delay > 0), the apply worker sends a > - Standby Status Update message to the publisher with a period of > - <literal>wal_receiver_status_interval</literal>. Make sure to set > - <literal>wal_receiver_status_interval</literal> less than the > - <literal>wal_sender_timeout</literal> on the publisher, otherwise, the > - walsender will repeatedly terminate due to the timeout errors. If > - <literal>wal_receiver_status_interval</literal> is set to zero, the apply > - worker doesn't send any feedback messages during the subscriber's > - <literal>min_apply_delay</literal> period. See > - <xref linkend="sql-createsubscription"/> for details. > + For time-delayed logical replication, the apply worker sends a feedback > + message to the publisher every > + <varname>wal_receiver_status_interval</varname> milliseconds. Make sure > + to set <varname>wal_receiver_status_interval</varname> less than the > + <varname>wal_sender_timeout</varname> on the publisher, otherwise, the > + <literal>walsender</literal> will repeatedly terminate due to timeout > + error. If <varname>wal_receiver_status_interval</varname> is set to > + zero, the apply worker doesn't send any feedback messages during the > + <literal>min_apply_delay</literal> interval. > > I removed the parenthesis explanation about time-delayed logical replication. > If you are reading the documentation and does not know what it means you should > (a) read the logical replication chapter or (b) check the glossary (maybe a new > entry should be added). I also removed the Standby status Update message but it > is a low level detail; let's refer to it as feedback message as the other > sentences do. I changed "literal" to "varname" that's the correct tag for > parameters. I replace "period" with "interval" that was the previous > terminology. IMO we should be uniform, use one or the other. Adopted. Also, I added the glossary for time-delayed replication (one for applicable to both physical replication and logical replication). Plus, I united the term "interval" to period, because it would clarify the type for this feature. I think this is better. > - The subscriber replication can be instructed to lag behind the publisher > - side changes by specifying the <literal>min_apply_delay</literal> > - subscription parameter. See <xref linkend="sql-createsubscription"/> for > - details. > + A logical replication subscription can delay the application of changes by > + specifying the <literal>min_apply_delay</literal> subscription parameter. > + See <xref linkend="sql-createsubscription"/> for details. > > This feature refers to a specific subscription, hence, "logical replication > subscription" instead of "subscriber replication". Adopted. > + if (IsSet(opts->specified_opts, SUBOPT_MIN_APPLY_DELAY)) > + errorConflictingDefElem(defel, pstate); > + > > Peter S referred to this missing piece of code too. Added. > -int > +static int > defGetMinApplyDelay(DefElem *def) > { > > It seems you forgot static keyword. Fixed. > - elog(DEBUG2, "time-delayed replication for txid %u, min_apply_delay = %lld ms, Remaining wait time: %ld ms", > - xid, (long long) MySubscription->minapplydelay, diffms); > + elog(DEBUG2, "time-delayed replication for txid %u, min_apply_delay = " INT64_FORMAT " ms, remaining wait time:%ld ms", > + xid, MySubscription->minapplydelay, diffms); > int64 should use format modifier INT64_FORMAT. Fixed. > - (long) wal_receiver_status_interval * 1000, > + wal_receiver_status_interval * 1000L, > > Cast is not required. I added a suffix to the constant. Fixed. > - elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X in-delayed: %d", > + elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X, apply delay: %s", > force, > LSN_FORMAT_ARGS(recvpos), > LSN_FORMAT_ARGS(writepos), > LSN_FORMAT_ARGS(flushpos), > - in_delayed_apply); > + in_delayed_apply? "yes" : "no"); > > It is better to use a string to represent the yes/no option. Fixed. > - gettext_noop("Min apply delay (ms)")); > + gettext_noop("Min apply delay")); > > I don't know if it was discussed but we don't add units to headers. When I > think about this parameter representation (internal and external), I decided to > use the previous code because it provides a unit for external representation. I > understand that using the same representation as recovery_min_apply_delay is > good but the current code does not handle the external representation > accordingly. (recovery_min_apply_delay uses the GUC machinery to adds the unit > but for min_apply_delay, it doesn't). Adopted. > # Setup for streaming case > -$node_publisher->append_conf('postgres.conf', > +$node_publisher->append_conf('postgresql.conf', > 'logical_decoding_mode = immediate'); > $node_publisher->reload; > > Fix configuration file name. Fixed. > Maybe tests should do a better job. I think check_apply_delay_time is fragile > because it does not guarantee that time is not shifted. Time-delayed > replication is a subscriber feature and to check its correctness it should > check the logs. > > # Note that we cannot call check_apply_delay_log() here because there is a > # possibility that the delay is skipped. The event happens when the WAL > # replication between publisher and subscriber is delayed due to a mechanical > # problem. The log output will be checked later - substantial delay-time case. > > If you might not use the logs for it, it should adjust the min_apply_delay, no? Yes. Adjusted. > It does not exercise the min_apply_delay vs parallel streaming mode. > > + /* > + * The combination of parallel streaming mode and > + * min_apply_delay is not allowed. > + */ > + if (opts.streaming == LOGICALREP_STREAM_PARALLEL) > + if ((IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && opts.min_apply_delay > 0) || > + (!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && sub->minapplydelay > 0)) > + ereport(ERROR, > + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot enable %s mode for subscription with %s", > + "streaming = parallel", "min_apply_delay")); > + > > Is this code correct? I also didn't like this message. "cannot enable streaming > = parallel mode for subscription with min_apply_delay" is far from a good error > message. How about refer parallelism to "parallel streaming mode". Yes. opts is the input for alter command and sub object is the existing definition. We need to check those combinations like when streaming is set to parallel and min_apply_delay also gets set, then, min_apply_delay should not be bigger than 0, for example. Besides, adopted your suggestion to improve the comments. Attach the patch in [1]. Kindly have a look at it. [1] - https://www.postgresql.org/message-id/TYCPR01MB8373DC1881F382B4703F26E0EDC99%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
At Tue, 24 Jan 2023 11:45:36 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > Personally, I would prefer the above LOGs to be in reverse order as it > doesn't make much sense to me to first say that we are skipping > changes and then say the transaction is delayed. What do you think? In the first place, I misunderstood maybe_start_skipping_changes(), which doesn't actually skip changes. So... sorry for the noise. For the record, I agree that the current order is right. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
In short, I'd like to propose renaming the parameter in_delayed_apply of send_feedback to "has_unprocessed_change". At Tue, 24 Jan 2023 12:27:58 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > send_feedback(): > > + * If the subscriber side apply is delayed (because of time-delayed > > + * replication) then do not tell the publisher that the received latest > > + * LSN is already applied and flushed, otherwise, it leads to the > > + * publisher side making a wrong assumption of logical replication > > + * progress. Instead, we just send a feedback message to avoid a publisher > > + * timeout during the delay. > > */ > > - if (!have_pending_txes) > > + if (!have_pending_txes && !in_delayed_apply) > > flushpos = writepos = recvpos; > > > > Honestly I don't like this wart. The reason for this is the function > > assumes recvpos = applypos but we actually call it while holding > > unapplied changes, that is, applypos < recvpos. > > > > Couldn't we maintain an additional static variable "last_applied" > > along with last_received? > > > > It won't be easy to maintain the meaning of last_applied because there > are cases where we don't apply the change directly. For example, in > case of streaming xacts, we will just keep writing it to the file, > now, say, due to some reason, we have to send the feedback, then it > will not allow you to update the latest write locations. This would > then become different then what we are doing without the patch. > Another point to think about is that we also need to keep the variable > updated for keep-alive ('k') messages even though we don't apply > anything in that case. Still, other cases to consider are where we > have mix of streaming and non-streaming transactions. Yeah. Even though I named it as "last_applied", its objective is to have get_flush_position returning the correct have_pending_txes without a hint from callers, that is, "let g_f_position know if store_flush_position has been called with the last received data". Anyway I tried that but didn't find a clean and simple way. However, while on it, I realized what the code made me confused. +static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply, + bool in_delayed_apply); The name "in_delayed_apply" doesn't donsn't give me an idea of what the function should do for it. If it is named "has_unprocessed_change", I think it makes sense that send_feedback should think there may be an outstanding transaction that is not known to the function. So, my conclusion here is I'd like to propose changing the parameter name to "has_unapplied_change". > > In this case the condition cited above > > would be as follows and in_delayed_apply will become unnecessary. > > > > + if (!have_pending_txes && last_received == last_applied) > > > > The function is a static function and always called with a variable > > last_received that has the same scope with the function, as the first Sorry for the noise, I misread it. Maybe I took the "function-scoped" variable as file-scoped.. Thus the discussion is false. > > parameter. Thus we can remove the first parameter then let the > > function directly look at the both two varaibles instead. > > > > I think this is true without this patch, so why that has not been > followed in the first place? One comment, I see in this regard is as > below: > > /* It's legal to not pass a recvpos */ > if (recvpos < last_recvpos) > recvpos = last_recvpos; Sorry. I don't understand this. It is just a part of the ratchet mechanism for the last received lsn to report. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Tue, 24 Jan 2023 14:22:19 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Tue, Jan 24, 2023 at 12:44 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > On Tue, Jan 24, 2023 at 5:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Tue, Jan 24, 2023 at 8:15 AM Kyotaro Horiguchi > > > <horikyota.ntt@gmail.com> wrote: > > > > > > > > > Attached the updated patch v19. > > > > > > > > + maybe_delay_apply(TransactionId xid, TimestampTz finish_ts) > > > > > > > > I look this spelling strange. How about maybe_apply_delay()? > > > > > > > > > > +1. > > > > It depends on how you read it. I read it like this: > > > > maybe_delay_apply === means "maybe delay [the] apply" > > (which is exactly what the function does) > > > > versus > > > > maybe_apply_delay === means "maybe [the] apply [needs a] delay" > > (which is also correct, but it seemed a more awkward way to say it IMO) > > > > This matches more with GUC and all other usages of variables in the > patch. So, I still prefer the second one. I read it as "maybe apply [the] delay [to something suggested by the context]". If we go the first way, I will name it as "maybe_delay_apply_change" or something that has an extra word. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Sorry for making you bothered by this. At Tue, 24 Jan 2023 10:12:40 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in > > > Couldn't we maintain an additional static variable "last_applied" > > > along with last_received? > > > > > > > It won't be easy to maintain the meaning of last_applied because there > > are cases where we don't apply the change directly. For example, in > > case of streaming xacts, we will just keep writing it to the file, > > now, say, due to some reason, we have to send the feedback, then it > > will not allow you to update the latest write locations. This would > > then become different then what we are doing without the patch. > > Another point to think about is that we also need to keep the variable > > updated for keep-alive ('k') messages even though we don't apply > > anything in that case. Still, other cases to consider are where we > > have mix of streaming and non-streaming transactions. > > I have tried to implement that, but it might be difficult because of a corner > case related with the initial data sync. > > First of all, I have made last_applied to update when > > * transactions are committed, prepared, or aborted > * apply worker receives keepalive message. Yeah, I vagurly thought that it is enough that the update happens just befor existing send_feecback() calls. But it turned out to introduce another unprincipledness.. > I thought during the initial data sync, we must not update the last applied > triggered by keepalive messages, so following lines were added just after > updating last_received. > > ``` > + if (last_applied < end_lsn && AllTablesyncsReady()) > + last_applied = end_lsn; > ``` Maybe, the name "last_applied" made you confused. As I mentioned in another message, the variable points to the remote LSN of last "processed" 'w/k' messages. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Jan 24, 2023 at 5:49 PM Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > > > Attached the patch v20 that has incorporated all comments so far. > Kindly have a look at the attached patch. > > > Best Regards, > Takamichi Osumi > Thank You for patch. My previous comments are addressed. Tested it and it looks good. Logging is also fine now. Just one comment, in summary, we see : If the subscription sets min_apply_delay parameter, the logical replication worker will delay the transaction commit for min_apply_delay milliseconds. Is it better to write "delay the transaction apply" instead of "delay the transaction commit" just to be consistent as we do not actually delay the commit for regular transactions. thanks Shveta
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Wednesday, January 25, 2023 2:02 PM shveta malik <shveta.malik@gmail.com> wrote: > On Tue, Jan 24, 2023 at 5:49 PM Takamichi Osumi (Fujitsu) > <osumi.takamichi@fujitsu.com> wrote: > > > > > > Attached the patch v20 that has incorporated all comments so far. > > Kindly have a look at the attached patch. > Thank You for patch. My previous comments are addressed. Tested it and it > looks good. Logging is also fine now. > > Just one comment, in summary, we see : > If the subscription sets min_apply_delay parameter, the logical replication > worker will delay the transaction commit for min_apply_delay milliseconds. > > Is it better to write "delay the transaction apply" instead of "delay the > transaction commit" just to be consistent as we do not actually delay the > commit for regular transactions. Thank you for your review ! Agreed. Your description looks better. Attached the updated patch v21. Best Regards, Takamichi Osumi
Attachment
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, Horiguchi-san Thank you for checking the patch ! On Wednesday, January 25, 2023 10:17 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > In short, I'd like to propose renaming the parameter in_delayed_apply of > send_feedback to "has_unprocessed_change". > > At Tue, 24 Jan 2023 12:27:58 +0530, Amit Kapila <amit.kapila16@gmail.com> > wrote in > > > send_feedback(): > > > + * If the subscriber side apply is delayed (because of > time-delayed > > > + * replication) then do not tell the publisher that the received > latest > > > + * LSN is already applied and flushed, otherwise, it leads to the > > > + * publisher side making a wrong assumption of logical > replication > > > + * progress. Instead, we just send a feedback message to avoid a > publisher > > > + * timeout during the delay. > > > */ > > > - if (!have_pending_txes) > > > + if (!have_pending_txes && !in_delayed_apply) > > > flushpos = writepos = recvpos; > > > > > > Honestly I don't like this wart. The reason for this is the function > > > assumes recvpos = applypos but we actually call it while holding > > > unapplied changes, that is, applypos < recvpos. > > > > > > Couldn't we maintain an additional static variable "last_applied" > > > along with last_received? > > > > > > > It won't be easy to maintain the meaning of last_applied because there > > are cases where we don't apply the change directly. For example, in > > case of streaming xacts, we will just keep writing it to the file, > > now, say, due to some reason, we have to send the feedback, then it > > will not allow you to update the latest write locations. This would > > then become different then what we are doing without the patch. > > Another point to think about is that we also need to keep the variable > > updated for keep-alive ('k') messages even though we don't apply > > anything in that case. Still, other cases to consider are where we > > have mix of streaming and non-streaming transactions. > > Yeah. Even though I named it as "last_applied", its objective is to have > get_flush_position returning the correct have_pending_txes without a hint > from callers, that is, "let g_f_position know if store_flush_position has been > called with the last received data". > > Anyway I tried that but didn't find a clean and simple way. However, while on it, > I realized what the code made me confused. > > +static void send_feedback(XLogRecPtr recvpos, bool force, bool > requestReply, > + bool in_delayed_apply); > > The name "in_delayed_apply" doesn't donsn't give me an idea of what the > function should do for it. If it is named "has_unprocessed_change", I think it > makes sense that send_feedback should think there may be an outstanding > transaction that is not known to the function. > > > So, my conclusion here is I'd like to propose changing the parameter name to > "has_unapplied_change". Renamed the variable name to "has_unprocessed_change". Also, removed the first argument of the send_feedback() which isn't necessary now. Kindly have a look at the patch shared in [1]. [1] - https://www.postgresql.org/message-id/TYCPR01MB8373193B4331B7EB6276F682EDCE9%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
At Tue, 24 Jan 2023 12:19:04 +0000, "Takamichi Osumi (Fujitsu)" <osumi.takamichi@fujitsu.com> wrote in > Attached the patch v20 that has incorporated all comments so far. Thanks! I looked thourgh the documentation part. + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>subminapplydelay</structfield> <type>int8</type> + </para> + <para> + Total time spent delaying the application of changes, in milliseconds. + </para></entry> I was confused becase it reads as this column shows the summarized actual waiting time caused by min_apply_delay. IIUC actually it shows the min_apply_delay setting for the subscription. Thus shouldn't it be something like this? "The minimum amount of time to delay applying changes, in milliseconds" And it might be better to mention the corresponding subscription paramter. + error. If <varname>wal_receiver_status_interval</varname> is set to + zero, the apply worker doesn't send any feedback messages during the + <literal>min_apply_delay</literal> period. I took a bit longer time to understand what this sentence means. I'd like to suggest something like the follwoing. "Since no status-update messages are sent while delaying, note that wal_receiver_status_interval is the only source of keepalive messages during that period." + <para> + A logical replication subscription can delay the application of changes by + specifying the <literal>min_apply_delay</literal> subscription parameter. + See <xref linkend="sql-createsubscription"/> for details. + </para> I'm not sure "logical replication subscription" is a common term. Doesn't just "subscription" mean the same, especially in that context? (Note that 31.2 starts with "A subscription is the downstream.."). + Any delay occurs only on WAL records for transaction begins after all + initial table synchronization has finished. The delay is calculated There is no "transaction begin" WAL records. Maybe it is "logical replication transaction begin message". The timestamp is of "commit time". (I took "transaction begins" as a noun, but that might be wrong..) + may reduce the actual wait time. It is also possible that the overhead + already exceeds the requested <literal>min_apply_delay</literal> value, + in which case no additional wait is necessary. If the system clocks I'm not sure it is right to say "necessary" here. IMHO it might be better be "in which case no delay is applied". + in which case no additional wait is necessary. If the system clocks + on publisher and subscriber are not synchronized, this may lead to + apply changes earlier than expected, but this is not a major issue + because this parameter is typically much larger than the time + deviations between servers. Note that if this parameter is set to a This doesn't seem to fit our documentation. It is not our business whether a certain amount deviation is critical or not. How about somethig like the following? "Note that the delay is measured between the timestamp assigned by publisher and the system clock on subscriber. You need to manage the system clocks to be in sync so that the delay works properly." + Delaying the replication can mean there is a much longer time + between making a change on the publisher, and that change being + committed on the subscriber. This can impact the performance of + synchronous replication. See <xref linkend="guc-synchronous-commit"/> + parameter. Do we need the "can" in "Delaying the replication can mean"? If we want to say, it might be "Delaying the replication means there can be a much longer..."? + <para> + Create a subscription to a remote server that replicates tables in + the <literal>mypub</literal> publication and starts replicating immediately + on commit. Pre-existing data is not copied. The application of changes is + delayed by 4 hours. +<programlisting> +CREATE SUBSCRIPTION mysub + CONNECTION 'host=192.0.2.4 port=5432 user=foo dbname=foodb' + PUBLICATION mypub + WITH (copy_data = false, min_apply_delay = '4h'); +</programlisting></para> I'm not sure we need this additional example. We already have two exmaples one of which differs from the above only by actual values for PUBLICATION and WITH clauses. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Jan 25, 2023 at 11:23 AM Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > > > Thank you for checking the patch ! > On Wednesday, January 25, 2023 10:17 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > In short, I'd like to propose renaming the parameter in_delayed_apply of > > send_feedback to "has_unprocessed_change". > > > > At Tue, 24 Jan 2023 12:27:58 +0530, Amit Kapila <amit.kapila16@gmail.com> > > wrote in > > > > send_feedback(): > > > > + * If the subscriber side apply is delayed (because of > > time-delayed > > > > + * replication) then do not tell the publisher that the received > > latest > > > > + * LSN is already applied and flushed, otherwise, it leads to the > > > > + * publisher side making a wrong assumption of logical > > replication > > > > + * progress. Instead, we just send a feedback message to avoid a > > publisher > > > > + * timeout during the delay. > > > > */ > > > > - if (!have_pending_txes) > > > > + if (!have_pending_txes && !in_delayed_apply) > > > > flushpos = writepos = recvpos; > > > > > > > > Honestly I don't like this wart. The reason for this is the function > > > > assumes recvpos = applypos but we actually call it while holding > > > > unapplied changes, that is, applypos < recvpos. > > > > > > > > Couldn't we maintain an additional static variable "last_applied" > > > > along with last_received? > > > > > > > > > > It won't be easy to maintain the meaning of last_applied because there > > > are cases where we don't apply the change directly. For example, in > > > case of streaming xacts, we will just keep writing it to the file, > > > now, say, due to some reason, we have to send the feedback, then it > > > will not allow you to update the latest write locations. This would > > > then become different then what we are doing without the patch. > > > Another point to think about is that we also need to keep the variable > > > updated for keep-alive ('k') messages even though we don't apply > > > anything in that case. Still, other cases to consider are where we > > > have mix of streaming and non-streaming transactions. > > > > Yeah. Even though I named it as "last_applied", its objective is to have > > get_flush_position returning the correct have_pending_txes without a hint > > from callers, that is, "let g_f_position know if store_flush_position has been > > called with the last received data". > > > > Anyway I tried that but didn't find a clean and simple way. However, while on it, > > I realized what the code made me confused. > > > > +static void send_feedback(XLogRecPtr recvpos, bool force, bool > > requestReply, > > + bool in_delayed_apply); > > > > The name "in_delayed_apply" doesn't donsn't give me an idea of what the > > function should do for it. If it is named "has_unprocessed_change", I think it > > makes sense that send_feedback should think there may be an outstanding > > transaction that is not known to the function. > > > > > > So, my conclusion here is I'd like to propose changing the parameter name to > > "has_unapplied_change". > Renamed the variable name to "has_unprocessed_change". > Also, removed the first argument of the send_feedback() which isn't necessary now. > Why did you remove the first argument of the send_feedback() when that is not added by this patch? If you really think that is an improvement, feel free to propose that as a separate patch. Personally, I don't see a value in it. -- With Regards, Amit Kapila.
On Wed, Jan 25, 2023 at 11:57 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Tue, 24 Jan 2023 12:19:04 +0000, "Takamichi Osumi (Fujitsu)" <osumi.takamichi@fujitsu.com> wrote in > > Attached the patch v20 that has incorporated all comments so far. > ... > > > + in which case no additional wait is necessary. If the system clocks > + on publisher and subscriber are not synchronized, this may lead to > + apply changes earlier than expected, but this is not a major issue > + because this parameter is typically much larger than the time > + deviations between servers. Note that if this parameter is set to a > > This doesn't seem to fit our documentation. It is not our business > whether a certain amount deviation is critical or not. How about > somethig like the following? > But we have a similar description for 'recovery_min_apply_delay' [1]. See "...If the system clocks on primary and standby are not synchronized, this may lead to recovery applying records earlier than expected; but that is not a major issue because useful settings of this parameter are much larger than typical time deviations between servers." [1] - https://www.postgresql.org/docs/devel/runtime-config-replication.html -- With Regards, Amit Kapila.
At Wed, 25 Jan 2023 12:30:19 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Wed, Jan 25, 2023 at 11:57 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Tue, 24 Jan 2023 12:19:04 +0000, "Takamichi Osumi (Fujitsu)" <osumi.takamichi@fujitsu.com> wrote in > > > Attached the patch v20 that has incorporated all comments so far. > > > ... > > > > > > + in which case no additional wait is necessary. If the system clocks > > + on publisher and subscriber are not synchronized, this may lead to > > + apply changes earlier than expected, but this is not a major issue > > + because this parameter is typically much larger than the time > > + deviations between servers. Note that if this parameter is set to a > > > > This doesn't seem to fit our documentation. It is not our business > > whether a certain amount deviation is critical or not. How about > > somethig like the following? > > > > But we have a similar description for 'recovery_min_apply_delay' [1]. > See "...If the system clocks on primary and standby are not > synchronized, this may lead to recovery applying records earlier than > expected; but that is not a major issue because useful settings of > this parameter are much larger than typical time deviations between > servers." Mmmm. I thought that we might be able to gather the description (including other common descriptions, if any), but I didn't find an appropreate place.. Okay. I agree to the current description. Thanks for the kind explanation. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Wednesday, January 25, 2023 3:27 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Tue, 24 Jan 2023 12:19:04 +0000, "Takamichi Osumi (Fujitsu)" > <osumi.takamichi@fujitsu.com> wrote in > > Attached the patch v20 that has incorporated all comments so far. > > Thanks! I looked thourgh the documentation part. Thank you for your review ! > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>subminapplydelay</structfield> <type>int8</type> > + </para> > + <para> > + Total time spent delaying the application of changes, in milliseconds. > + </para></entry> > > I was confused becase it reads as this column shows the summarized actual > waiting time caused by min_apply_delay. IIUC actually it shows the > min_apply_delay setting for the subscription. Thus shouldn't it be something > like this? > > "The minimum amount of time to delay applying changes, in milliseconds" > And it might be better to mention the corresponding subscription paramter. This description looks much better to me than the past description. Fixed. OTOH, other parameters don't mention about its subscription parameters. So, I didn't add the mention. > + error. If <varname>wal_receiver_status_interval</varname> is set to > + zero, the apply worker doesn't send any feedback messages during > the > + <literal>min_apply_delay</literal> period. > > I took a bit longer time to understand what this sentence means. I'd like to > suggest something like the follwoing. > > "Since no status-update messages are sent while delaying, note that > wal_receiver_status_interval is the only source of keepalive messages during > that period." The current patch's description is precise and I prefer that. I would say "the only source" would be confusing to readers. However, I slightly adjusted the description a bit. Could you please check ? > + <para> > + A logical replication subscription can delay the application of changes by > + specifying the <literal>min_apply_delay</literal> subscription > parameter. > + See <xref linkend="sql-createsubscription"/> for details. > + </para> > > I'm not sure "logical replication subscription" is a common term. > Doesn't just "subscription" mean the same, especially in that context? > (Note that 31.2 starts with "A subscription is the downstream.."). I think you are right. Fixed. > + Any delay occurs only on WAL records for transaction begins after > all > + initial table synchronization has finished. The delay is > + calculated > > There is no "transaction begin" WAL records. Maybe it is "logical replication > transaction begin message". The timestamp is of "commit time". (I took > "transaction begins" as a noun, but that might be > wrong..) Yeah, we can improve here. But, we need to include not only "commit" but also "prepare" as nuance in this part. In short, I think we should change here to mention (1) the delay happens after all initial table synchronization (2) how delay is applied for non-streaming and streaming transactions in general. By the way, WAL timestamp is a word used in the recovery_min_apply_delay. So, I'd like to keep it to make the description more aligned with it, until there is a better description. Updated the doc. I adjusted the commit message according to this fix. > > + may reduce the actual wait time. It is also possible that the overhead > + already exceeds the requested <literal>min_apply_delay</literal> > value, > + in which case no additional wait is necessary. If the system > + clocks > > I'm not sure it is right to say "necessary" here. IMHO it might be better be "in > which case no delay is applied". Agreed. Fixed. > + in which case no additional wait is necessary. If the system clocks > + on publisher and subscriber are not synchronized, this may lead to > + apply changes earlier than expected, but this is not a major issue > + because this parameter is typically much larger than the time > + deviations between servers. Note that if this parameter is > + set to a > > This doesn't seem to fit our documentation. It is not our business whether a > certain amount deviation is critical or not. How about somethig like the > following? > > "Note that the delay is measured between the timestamp assigned by > publisher and the system clock on subscriber. You need to manage the > system clocks to be in sync so that the delay works properly." As discussed, this is aligned with recovery_min_apply_delay. So, I keep it. > + Delaying the replication can mean there is a much longer time > + between making a change on the publisher, and that change > being > + committed on the subscriber. This can impact the performance > of > + synchronous replication. See <xref > linkend="guc-synchronous-commit"/> > + parameter. > > Do we need the "can" in "Delaying the replication can mean"? If we want to > say, it might be "Delaying the replication means there can be a much longer..."? The "can" indicates the possibility as the nuance, while adopting "means" in this case indicates "time delayed LR causes the long time wait always". I'm okay with either expression, but I think you are right in practice and from the perspective of the purpose of this feature. So, fixed. > + <para> > + Create a subscription to a remote server that replicates tables in > + the <literal>mypub</literal> publication and starts replicating > immediately > + on commit. Pre-existing data is not copied. The application of changes is > + delayed by 4 hours. > +<programlisting> > +CREATE SUBSCRIPTION mysub > + CONNECTION 'host=192.0.2.4 port=5432 user=foo dbname=foodb' > + PUBLICATION mypub > + WITH (copy_data = false, min_apply_delay = '4h'); > +</programlisting></para> > > I'm not sure we need this additional example. We already have two exmaples > one of which differs from the above only by actual values for PUBLICATION and > WITH clauses. I thought there was no harm in having this example, but what you say makes sense. Removed. Attached the updated v22. Best Regards, Takamichi Osumi
Attachment
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Wednesday, January 25, 2023 3:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Jan 25, 2023 at 11:23 AM Takamichi Osumi (Fujitsu) > <osumi.takamichi@fujitsu.com> wrote: > > > > > > Thank you for checking the patch ! > > On Wednesday, January 25, 2023 10:17 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > In short, I'd like to propose renaming the parameter > > > in_delayed_apply of send_feedback to "has_unprocessed_change". > > > > > > At Tue, 24 Jan 2023 12:27:58 +0530, Amit Kapila > > > <amit.kapila16@gmail.com> wrote in > > > > > send_feedback(): > > > > > + * If the subscriber side apply is delayed (because of > > > time-delayed > > > > > + * replication) then do not tell the publisher that the > > > > > + received > > > latest > > > > > + * LSN is already applied and flushed, otherwise, it leads to > the > > > > > + * publisher side making a wrong assumption of logical > > > replication > > > > > + * progress. Instead, we just send a feedback message to > > > > > + avoid a > > > publisher > > > > > + * timeout during the delay. > > > > > */ > > > > > - if (!have_pending_txes) > > > > > + if (!have_pending_txes && !in_delayed_apply) > > > > > flushpos = writepos = recvpos; > > > > > > > > > > Honestly I don't like this wart. The reason for this is the > > > > > function assumes recvpos = applypos but we actually call it > > > > > while holding unapplied changes, that is, applypos < recvpos. > > > > > > > > > > Couldn't we maintain an additional static variable "last_applied" > > > > > along with last_received? > > > > > > > > > > > > > It won't be easy to maintain the meaning of last_applied because > > > > there are cases where we don't apply the change directly. For > > > > example, in case of streaming xacts, we will just keep writing it > > > > to the file, now, say, due to some reason, we have to send the > > > > feedback, then it will not allow you to update the latest write > > > > locations. This would then become different then what we are doing > without the patch. > > > > Another point to think about is that we also need to keep the > > > > variable updated for keep-alive ('k') messages even though we > > > > don't apply anything in that case. Still, other cases to consider > > > > are where we have mix of streaming and non-streaming transactions. > > > > > > Yeah. Even though I named it as "last_applied", its objective is to > > > have get_flush_position returning the correct have_pending_txes > > > without a hint from callers, that is, "let g_f_position know if > > > store_flush_position has been called with the last received data". > > > > > > Anyway I tried that but didn't find a clean and simple way. However, > > > while on it, I realized what the code made me confused. > > > > > > +static void send_feedback(XLogRecPtr recvpos, bool force, bool > > > requestReply, > > > + bool > > > + in_delayed_apply); > > > > > > The name "in_delayed_apply" doesn't donsn't give me an idea of what > > > the function should do for it. If it is named > > > "has_unprocessed_change", I think it makes sense that send_feedback > > > should think there may be an outstanding transaction that is not known to > the function. > > > > > > > > > So, my conclusion here is I'd like to propose changing the parameter > > > name to "has_unapplied_change". > > Renamed the variable name to "has_unprocessed_change". > > Also, removed the first argument of the send_feedback() which isn't > necessary now. > > > > Why did you remove the first argument of the send_feedback() when that is not > added by this patch? If you really think that is an improvement, feel free to > propose that as a separate patch. > Personally, I don't see a value in it. Oh, sorry for that. I have made the change back. Kindly have a look at the v22 shared in [1]. [1] - https://www.postgresql.org/message-id/TYCPR01MB837305BD31FA317256BC7B1FEDCE9%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Wednesday, January 25, 2023 11:24 PM I wrote: > Attached the updated v22. Hi, During self-review, I noticed some changes are required for some variable types related to 'min_apply_delay' value, so have conducted the adjustment changes for the same. Additionally, I made some comments for translator and TAP test better. Note that I executed pgindent and pgperltidy for the patch. Now the updated patch should be more refined. Best Regards, Takamichi Osumi
Attachment
On Fri, Jan 27, 2023 at 1:39 PM Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > > On Wednesday, January 25, 2023 11:24 PM I wrote: > > Attached the updated v22. > Hi, > > During self-review, I noticed some changes are > required for some variable types related to 'min_apply_delay' value, > so have conducted the adjustment changes for the same. > So, you have changed min_apply_delay from int64 to int32, but you haven't mentioned the reason for the same? We use 'int' for the similar parameter recovery_min_apply_delay, so, ideally, it makes sense but still better to tell your reason explicitly. Few comments ============= 1. @@ -70,6 +70,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW XLogRecPtr subskiplsn; /* All changes finished at this LSN are * skipped */ + int32 subminapplydelay; /* Replication apply delay (ms) */ + NameData subname; /* Name of the subscription */ Oid subowner BKI_LOOKUP(pg_authid); /* Owner of the subscription */ Why are you placing this after subskiplsn? Earlier it was okay because we want the 64 bit value to be aligned but now, isn't it better to keep it after subowner? 2. + + diffms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), + TimestampTzPlusMilliseconds(finish_ts, MySubscription->minapplydelay)); The above code appears a bit unreadable. Can we store the result of TimestampTzPlusMilliseconds() in a separate variable say "TimestampTz delayUntil;"? -- With Regards, Amit Kapila.
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Friday, January 27, 2023 8:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Jan 27, 2023 at 1:39 PM Takamichi Osumi (Fujitsu) > <osumi.takamichi@fujitsu.com> wrote: > > > > On Wednesday, January 25, 2023 11:24 PM I wrote: > > > Attached the updated v22. > > Hi, > > > > During self-review, I noticed some changes are required for some > > variable types related to 'min_apply_delay' value, so have conducted > > the adjustment changes for the same. > > > > So, you have changed min_apply_delay from int64 to int32, but you haven't > mentioned the reason for the same? We use 'int' for the similar parameter > recovery_min_apply_delay, so, ideally, it makes sense but still better to tell your > reason explicitly. Yes. It's because I thought I need to make this feature consistent with the recovery_min_apply_delay. This feature handles the range same as the recovery_min_apply delay from 0 to INT_MAX now so should be adjusted to match it. > Few comments > ============= > 1. > @@ -70,6 +70,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) > BKI_SHARED_RELATION BKI_ROW > XLogRecPtr subskiplsn; /* All changes finished at this LSN are > * skipped */ > > + int32 subminapplydelay; /* Replication apply delay (ms) */ > + > NameData subname; /* Name of the subscription */ > > Oid subowner BKI_LOOKUP(pg_authid); /* Owner of the subscription */ > > Why are you placing this after subskiplsn? Earlier it was okay because we want > the 64 bit value to be aligned but now, isn't it better to keep it after subowner? Moved it after subowner. > 2. > + > + diffms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), > + TimestampTzPlusMilliseconds(finish_ts, > + MySubscription->minapplydelay)); > > The above code appears a bit unreadable. Can we store the result of > TimestampTzPlusMilliseconds() in a separate variable say "TimestampTz > delayUntil;"? Agreed. Fixed. Attached the updated patch v24. Best Regards, Takamichi Osumi
Attachment
At Sat, 28 Jan 2023 04:28:29 +0000, "Takamichi Osumi (Fujitsu)" <osumi.takamichi@fujitsu.com> wrote in > On Friday, January 27, 2023 8:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > So, you have changed min_apply_delay from int64 to int32, but you haven't > > mentioned the reason for the same? We use 'int' for the similar parameter > > recovery_min_apply_delay, so, ideally, it makes sense but still better to tell your > > reason explicitly. > Yes. It's because I thought I need to make this feature consistent with the recovery_min_apply_delay. > This feature handles the range same as the recovery_min_apply delay from 0 to INT_MAX now > so should be adjusted to match it. INT_MAX can stick out of int32 on some platforms. (I'm not sure where that actually happens, though.) We can use PG_INT32_MAX instead. IMHO, I think we don't use int as a catalog column and I agree that int32 is sufficient since I don't think more than 49 days delay is practical. On the other hand, maybe I wouldn't want to use int32 for intermediate calculations. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Jan 30, 2023 at 8:32 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Sat, 28 Jan 2023 04:28:29 +0000, "Takamichi Osumi (Fujitsu)" <osumi.takamichi@fujitsu.com> wrote in > > On Friday, January 27, 2023 8:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > So, you have changed min_apply_delay from int64 to int32, but you haven't > > > mentioned the reason for the same? We use 'int' for the similar parameter > > > recovery_min_apply_delay, so, ideally, it makes sense but still better to tell your > > > reason explicitly. > > Yes. It's because I thought I need to make this feature consistent with the recovery_min_apply_delay. > > This feature handles the range same as the recovery_min_apply delay from 0 to INT_MAX now > > so should be adjusted to match it. > > INT_MAX can stick out of int32 on some platforms. (I'm not sure where > that actually happens, though.) We can use PG_INT32_MAX instead. > But in other integer GUCs including recovery_min_apply_delay, we use INT_MAX, so not sure if it is a good idea to do something different here. -- With Regards, Amit Kapila.
At Mon, 30 Jan 2023 08:51:05 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Mon, Jan 30, 2023 at 8:32 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Sat, 28 Jan 2023 04:28:29 +0000, "Takamichi Osumi (Fujitsu)" <osumi.takamichi@fujitsu.com> wrote in > > > On Friday, January 27, 2023 8:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > So, you have changed min_apply_delay from int64 to int32, but you haven't > > > > mentioned the reason for the same? We use 'int' for the similar parameter > > > > recovery_min_apply_delay, so, ideally, it makes sense but still better to tell your > > > > reason explicitly. > > > Yes. It's because I thought I need to make this feature consistent with the recovery_min_apply_delay. > > > This feature handles the range same as the recovery_min_apply delay from 0 to INT_MAX now > > > so should be adjusted to match it. > > > > INT_MAX can stick out of int32 on some platforms. (I'm not sure where > > that actually happens, though.) We can use PG_INT32_MAX instead. > > > > But in other integer GUCs including recovery_min_apply_delay, we use > INT_MAX, so not sure if it is a good idea to do something different > here. The GUC is not stored in a catalog, but.. oh... it is multiplied by 1000. So if it is larger than (INT_MAX / 1000), it overflows... If we officially accept that (I don't think great) behavior (even only for impractical values), I don't object further. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Jan 30, 2023 at 9:43 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Mon, 30 Jan 2023 08:51:05 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > On Mon, Jan 30, 2023 at 8:32 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > > > > At Sat, 28 Jan 2023 04:28:29 +0000, "Takamichi Osumi (Fujitsu)" <osumi.takamichi@fujitsu.com> wrote in > > > > On Friday, January 27, 2023 8:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > So, you have changed min_apply_delay from int64 to int32, but you haven't > > > > > mentioned the reason for the same? We use 'int' for the similar parameter > > > > > recovery_min_apply_delay, so, ideally, it makes sense but still better to tell your > > > > > reason explicitly. > > > > Yes. It's because I thought I need to make this feature consistent with the recovery_min_apply_delay. > > > > This feature handles the range same as the recovery_min_apply delay from 0 to INT_MAX now > > > > so should be adjusted to match it. > > > > > > INT_MAX can stick out of int32 on some platforms. (I'm not sure where > > > that actually happens, though.) We can use PG_INT32_MAX instead. > > > > > > > But in other integer GUCs including recovery_min_apply_delay, we use > > INT_MAX, so not sure if it is a good idea to do something different > > here. > > The GUC is not stored in a catalog, but.. oh... it is multiplied by > 1000. Which part of the patch you are referring to here? Isn't the check in the function defGetMinApplyDelay() sufficient to ensure that the 'delay' value stored in the catalog will always be lesser than INT_MAX? -- With Regards, Amit Kapila.
At Mon, 30 Jan 2023 11:56:33 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Mon, Jan 30, 2023 at 9:43 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Mon, 30 Jan 2023 08:51:05 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > > On Mon, Jan 30, 2023 at 8:32 AM Kyotaro Horiguchi > > > <horikyota.ntt@gmail.com> wrote: > > > > > > > > At Sat, 28 Jan 2023 04:28:29 +0000, "Takamichi Osumi (Fujitsu)" <osumi.takamichi@fujitsu.com> wrote in > > > > > On Friday, January 27, 2023 8:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > So, you have changed min_apply_delay from int64 to int32, but you haven't > > > > > > mentioned the reason for the same? We use 'int' for the similar parameter > > > > > > recovery_min_apply_delay, so, ideally, it makes sense but still better to tell your > > > > > > reason explicitly. > > > > > Yes. It's because I thought I need to make this feature consistent with the recovery_min_apply_delay. > > > > > This feature handles the range same as the recovery_min_apply delay from 0 to INT_MAX now > > > > > so should be adjusted to match it. > > > > > > > > INT_MAX can stick out of int32 on some platforms. (I'm not sure where > > > > that actually happens, though.) We can use PG_INT32_MAX instead. > > > > > > > > > > But in other integer GUCs including recovery_min_apply_delay, we use > > > INT_MAX, so not sure if it is a good idea to do something different > > > here. > > > > The GUC is not stored in a catalog, but.. oh... it is multiplied by > > 1000. > > Which part of the patch you are referring to here? Isn't the check in Where recovery_min_apply_delay is used. It is allowed to be set up to INT_MAX but it is used as: > delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay); Where the macro is defined as: > #define TimestampTzPlusMilliseconds(tz,ms) ((tz) + ((ms) * (int64) 1000)) Which can lead to overflow, which is practically harmless. > the function defGetMinApplyDelay() sufficient to ensure that the > 'delay' value stored in the catalog will always be lesser than > INT_MAX? I'm concerned about cases where INT_MAX is wider than int32. If we don't assume such cases, I'm fine with INT_MAX there. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Jan 30, 2023 at 12:38 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Mon, 30 Jan 2023 11:56:33 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > > > > > The GUC is not stored in a catalog, but.. oh... it is multiplied by > > > 1000. > > > > Which part of the patch you are referring to here? Isn't the check in > > Where recovery_min_apply_delay is used. It is allowed to be set up to > INT_MAX but it is used as: > > > delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay); > > Where the macro is defined as: > > > #define TimestampTzPlusMilliseconds(tz,ms) ((tz) + ((ms) * (int64) 1000)) > > Which can lead to overflow, which is practically harmless. > But here tz is always TimestampTz (which is int64), so do, we need to worry? > > the function defGetMinApplyDelay() sufficient to ensure that the > > 'delay' value stored in the catalog will always be lesser than > > INT_MAX? > > I'm concerned about cases where INT_MAX is wider than int32. If we > don't assume such cases, I'm fine with INT_MAX there. > I am not aware of such cases. Anyway, if any such case is discovered then we need to change the checks in defGetMinApplyDelay(), right? If so, then I think it is better to keep it as it is unless we know that this could be an issue on some platform. -- With Regards, Amit Kapila.
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Saturday, January 28, 2023 1:28 PM I wrote: > Attached the updated patch v24. Hi, I've conducted the rebase affected by the commit(1e8b61735c) by renaming the GUC to logical_replication_mode accordingly, because it's utilized in the TAP test of this time-delayed LR feature. There is no other change for this version. Kindly have a look at the attached v25. Best Regards, Takamichi Osumi
Attachment
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Monday, January 30, 2023 12:02 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Sat, 28 Jan 2023 04:28:29 +0000, "Takamichi Osumi (Fujitsu)" > <osumi.takamichi@fujitsu.com> wrote in > > On Friday, January 27, 2023 8:00 PM Amit Kapila > <amit.kapila16@gmail.com> wrote: > > > So, you have changed min_apply_delay from int64 to int32, but you > > > haven't mentioned the reason for the same? We use 'int' for the > > > similar parameter recovery_min_apply_delay, so, ideally, it makes > > > sense but still better to tell your reason explicitly. > > Yes. It's because I thought I need to make this feature consistent with the > recovery_min_apply_delay. > > This feature handles the range same as the recovery_min_apply delay > > from 0 to INT_MAX now so should be adjusted to match it. > > INT_MAX can stick out of int32 on some platforms. (I'm not sure where that > actually happens, though.) We can use PG_INT32_MAX instead. > > IMHO, I think we don't use int as a catalog column and I agree that > int32 is sufficient since I don't think more than 49 days delay is practical. On > the other hand, maybe I wouldn't want to use int32 for intermediate > calculations. Hi, Horiguchi-san. Thanks for your comments ! IIUC, in the last sentence, you proposed the type of SubOpts min_apply_delay should be change to "int". But I couldn't find actual harm of the current codes, because we anyway insert the SubOpts value to the catalog after holding it in SubOpts. Also, it seems there is no explicit rule where we should use "int" local variables for "int32" system catalog values internally. I had a look at other variables for int32 system catalog members and either looked fine. So, I'd like to keep the current code as it is, until actual harm is found. The latest patch can be seen in [1]. [1] - https://www.postgresql.org/message-id/TYCPR01MB8373E26884C385EFFFB8965FEDD39%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Monday, January 30, 2023 7:05 PM I wrote: > On Saturday, January 28, 2023 1:28 PM I wrote: > > Attached the updated patch v24. > I've conducted the rebase affected by the commit(1e8b61735c) by renaming > the GUC to logical_replication_mode accordingly, because it's utilized in the > TAP test of this time-delayed LR feature. > There is no other change for this version. > > Kindly have a look at the attached v25. Hi, The v25 caused a failure on windows of cfbot in [1]. But, the failure happened in the tests of pg_upgrade and the failure message looks the same one reported in the ongoing discussion of [2]. Then, it's an issue independent from the v25. [1] - https://cirrus-ci.com/task/5484559622471680 [2] - https://www.postgresql.org/message-id/20220919213217.ptqfdlcc5idk5xup%40awork3.anarazel.de Best Regards, Takamichi Osumi
At Mon, 30 Jan 2023 14:24:31 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Mon, Jan 30, 2023 at 12:38 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Mon, 30 Jan 2023 11:56:33 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > > #define TimestampTzPlusMilliseconds(tz,ms) ((tz) + ((ms) * (int64) 1000)) > > > > Which can lead to overflow, which is practically harmless. > > > > But here tz is always TimestampTz (which is int64), so do, we need to worry? Sorry, I was putting an assuption that int were int64 here. > > > the function defGetMinApplyDelay() sufficient to ensure that the > > > 'delay' value stored in the catalog will always be lesser than > > > INT_MAX? > > > > I'm concerned about cases where INT_MAX is wider than int32. If we > > don't assume such cases, I'm fine with INT_MAX there. > > > > I am not aware of such cases. Anyway, if any such case is discovered > then we need to change the checks in defGetMinApplyDelay(), right? If > so, then I think it is better to keep it as it is unless we know that > this could be an issue on some platform. I'm not sure. I think that int is generally thought that it is tied with an integer type of any size. min_apply_delay is tightly bond with a catalog column of int32 thus I thought that (PG_)INT32_MAX is the right limit. So, as I expressed before, if we assume sizeof(int) <= sizeof(int32), I' fine with using INT_MAX there. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Dear Horiguchi-san, > I'm not sure. I think that int is generally thought that it is tied > with an integer type of any size. min_apply_delay is tightly bond > with a catalog column of int32 thus I thought that (PG_)INT32_MAX is > the right limit. So, as I expressed before, if we assume sizeof(int) > <= sizeof(int32), I' fine with using INT_MAX there. I have checked some articles and I think platforms supported by postgres regard Int as 32-bit integer. According to the definition of C99, actual value of INT_MAX/INT_MIN depend on the implementation - INT_MAX must bigger than or equal to 2^15 - 1 [1]. So theoretically there is a possibility that int is bigger than int, as you worried. Next, I checked some data models, and found ILP64 that regards int as 64-bit integer. In this case INT_MAX may be 2^63-1, it exceeds PG_INT32_MAX. I cannot find the proper document about the type, but I can site a table from the doc[2]. ``` Datatype LP64 ILP64 LLP64 ILP32 LP32 char 8 8 8 8 8 short 16 16 16 16 16 _int32 32 int 32 64 32 32 16 long 64 64 32 32 32 long long 64 pointer 64 64 64 32 32 ``` I'm not sure whether the system survives or not. According to [2], a few system released, but I have never heard. Modern systems have LP64 or LLP64. > There have been a few examples of ILP64 systems that have shipped > (Cray and ETA come to mind). In another paper[3], Sun UltraSPARC, which is 32-bit OS and use SPARC64 processor, seems to use ILP64 model, but it may be ancient OS. > 1995 Sun UltraSPARC: 64/32-bit hardware, 32-bit-only operating system. HAL Computer’s SPARC64: uses ILP64 model for C. Also, I checked buildfarm animals that have Sparc64 architecture, but their alignment of int seems to be 4 byte [4]. > checking alignment of int... 4 Therefore, I think we can say that modern platforms that are supported by PostgreSQL define int as 32-bit. It satisfies the condition sizeof(int) <= sizeof(int32), so we can keep to use INT_MAX. [1] https://www.dii.uchile.cl/~daespino/files/Iso_C_1999_definition.pdf [2] https://unix.org/version2/whatsnew/lp64_wp.html [3] https://queue.acm.org/detail.cfm?id=1165766 [4] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=castoroides&dt=2023-01-30%2012%3A00%3A07&stg=configure#:~:text=checking%20alignment%20of%20int...%204 Best Regards, Hayato Kuroda FUJITSU LIMITED
Hi, Kuroda-san, Thanks for the detailed study. At Tue, 31 Jan 2023 07:06:40 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in > Therefore, I think we can say that modern platforms that are supported by PostgreSQL define int as 32-bit. > It satisfies the condition sizeof(int) <= sizeof(int32), so we can keep to use INT_MAX. Yeah, I know that that's practically correct. Just I wanted to make clear is whether we (always) assume int == int32. I don't want to do that just because that works. Even though we cannot be perfect, in this particular case the destination space is explicitly made as int32. It's a similar discussion to the recent commit 3b4ac33254. We choosed to use the "correct" symbols refusing to employ an implicit assumption about the actual values. (In that sense, it is a compromize to assume int32 being narrower than int is a premise, but the code will get uselessly complex without that assumption:p) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Jan 31, 2023 at 1:40 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > Hi, Kuroda-san, Thanks for the detailed study. > > At Tue, 31 Jan 2023 07:06:40 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in > > Therefore, I think we can say that modern platforms that are supported by PostgreSQL define int as 32-bit. > > It satisfies the condition sizeof(int) <= sizeof(int32), so we can keep to use INT_MAX. > > Yeah, I know that that's practically correct. Just I wanted to make > clear is whether we (always) assume int == int32. I don't want to do > that just because that works. Even though we cannot be perfect, in > this particular case the destination space is explicitly made as > int32. > So, shall we check if the result of parse_int is in the range 0 and PG_INT32_MAX to ameliorate this concern? If this works then we need to probably change the return value of defGetMinApplyDelay() to int32. -- With Regards, Amit Kapila.
At Tue, 31 Jan 2023 15:12:14 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Tue, Jan 31, 2023 at 1:40 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > Hi, Kuroda-san, Thanks for the detailed study. > > > > At Tue, 31 Jan 2023 07:06:40 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in > > > Therefore, I think we can say that modern platforms that are supported by PostgreSQL define int as 32-bit. > > > It satisfies the condition sizeof(int) <= sizeof(int32), so we can keep to use INT_MAX. > > > > Yeah, I know that that's practically correct. Just I wanted to make > > clear is whether we (always) assume int == int32. I don't want to do > > that just because that works. Even though we cannot be perfect, in > > this particular case the destination space is explicitly made as > > int32. > > > > So, shall we check if the result of parse_int is in the range 0 and > PG_INT32_MAX to ameliorate this concern? Yeah, it is exactly what I wanted to suggest. > If this works then we need to > probably change the return value of defGetMinApplyDelay() to int32. I didn't thought doing that, int can store all values in the valid range (I'm assuming we implicitly assume int >= int32 in bit width) and it is the natural integer in C. Either will do for me but I slightly prefer to use int there. As the result I'd like to propose the following change. diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c index 489eae85ee..9de2745623 100644 --- a/src/backend/commands/subscriptioncmds.c +++ b/src/backend/commands/subscriptioncmds.c @@ -2293,16 +2293,16 @@ defGetMinApplyDelay(DefElem *def) hintmsg ? errhint("%s", _(hintmsg)) : 0)); /* - * Check lower bound. parse_int() has already been confirmed that result - * is less than or equal to INT_MAX. + * Check the both boundary. Although parse_int() checked the result against + * INT_MAX, this value is to be stored in a catalog column of int32. */ - if (result < 0) + if (result < 0 || result > PG_INT32_MAX) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("%d ms is outside the valid range for parameter \"%s\" (%d .. %d)", result, "min_apply_delay", - 0, INT_MAX))); + 0, PG_INT32_MAX))); return result; } regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Feb 1, 2023 at 8:13 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Tue, 31 Jan 2023 15:12:14 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > On Tue, Jan 31, 2023 at 1:40 PM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > > > > Hi, Kuroda-san, Thanks for the detailed study. > > > > > > At Tue, 31 Jan 2023 07:06:40 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in > > > > Therefore, I think we can say that modern platforms that are supported by PostgreSQL define int as 32-bit. > > > > It satisfies the condition sizeof(int) <= sizeof(int32), so we can keep to use INT_MAX. > > > > > > Yeah, I know that that's practically correct. Just I wanted to make > > > clear is whether we (always) assume int == int32. I don't want to do > > > that just because that works. Even though we cannot be perfect, in > > > this particular case the destination space is explicitly made as > > > int32. > > > > > > > So, shall we check if the result of parse_int is in the range 0 and > > PG_INT32_MAX to ameliorate this concern? > > Yeah, it is exactly what I wanted to suggest. > > > If this works then we need to > > probably change the return value of defGetMinApplyDelay() to int32. > > I didn't thought doing that, int can store all values in the valid > range (I'm assuming we implicitly assume int >= int32 in bit width) > and it is the natural integer in C. Either will do for me but I > slightly prefer to use int there. > I think it would be clear to use int32 because the parameter where we store the return value is also int32. -- With Regards, Amit Kapila.
Here are my review comments for the patch v25-0001. ====== Commit Message 1. The other possibility is to apply the delay at the end of the parallel apply transaction but that would cause issues related to resource bloat and locks being held for a long time. ~ SUGGESTION We chose not to apply the delay at the end of the parallel apply transaction because that would cause issues related to resource bloat and locks being held for a long time. ====== doc/src/sgml/config.sgml 2. + <para> + For time-delayed logical replication, the apply worker sends a feedback + message to the publisher every + <varname>wal_receiver_status_interval</varname> milliseconds. Make sure + to set <varname>wal_receiver_status_interval</varname> less than the + <varname>wal_sender_timeout</varname> on the publisher, otherwise, the + <literal>walsender</literal> will repeatedly terminate due to timeout + error. Note that if <varname>wal_receiver_status_interval</varname> is + set to zero, the apply worker sends no feedback messages during the + <literal>min_apply_delay</literal> period. + </para> 2a. "due to timeout error." --> "due to timeout errors." ~ 2b. Shouldn't this also cross-ref to CREATE SUBSCRIPTION docs? Because the above mentions 'min_apply_delay' but that is not defined on this page. ====== doc/src/sgml/ref/create_subscription.sgml 3. + <para> + By default, the subscriber applies changes as soon as possible. This + parameter allows the user to delay the application of changes by a + given time period. If the value is specified without units, it is + taken as milliseconds. The default is zero (no delay). See + <xref linkend="config-setting-names-values"/> for details on the + available valid time unites. + </para> Typo: "unites" ~~~ 4. + <para> + Any delay becomes effective after all initial table synchronization + has finished and occurs before each transaction starts to get applied + on the subscriber. The delay is calculated as the difference between + the WAL timestamp as written on the publisher and the current time on + the subscriber. Any overhead of time spent in logical decoding and in + transferring the transaction may reduce the actual wait time. It is + also possible that the overhead already exceeds the requested + <literal>min_apply_delay</literal> value, in which case no delay is + applied. If the system clocks on publisher and subscriber are not + synchronized, this may lead to apply changes earlier than expected, + but this is not a major issue because this parameter is typically + much larger than the time deviations between servers. Note that if + this parameter is set to a long delay, the replication will stop if + the replication slot falls behind the current LSN by more than + <link linkend="guc-max-slot-wal-keep-size"><literal>max_slot_wal_keep_size</literal></link>. + </para> "Any delay becomes effective after all initial table synchronization..." --> "Any delay becomes effective only after all initial table synchronization..." ~~~ 5. + <warning> + <para> + Delaying the replication means there is a much longer time between + making a change on the publisher, and that change being committed + on the subscriber. This can impact the performance of synchronous + replication. See <xref linkend="guc-synchronous-commit"/> + parameter. + </para> + </warning> I'm not sure why this was text changed to say "means there is a much longer time" instead of "can mean there is a much longer time". IMO the previous wording was better because this current text makes an assumption about what the user has configured -- e.g. if they configured only 1ms delay then the warning text is not really relevant. ~~~ 6. Why was the example (it existed when I last looked at patch v19) removed? Personally, I found that example to be a useful reminder that the min_apply_delay can specify units other than just 'ms'. ====== src/backend/commands/subscriptioncmds.c 7. parse_subscription_options + /* + * The combination of parallel streaming mode and min_apply_delay is not + * allowed. This is because we start applying the transaction stream as + * soon as the first change arrives without knowing the transaction's + * prepare/commit time. This means we cannot calculate the underlying + * network/decoding lag between publisher and subscriber, and so always + * waiting for the full 'min_apply_delay' period might include unnecessary + * delay. + * + * The other possibility is to apply the delay at the end of the parallel + * apply transaction but that would cause issues related to resource bloat + * and locks being held for a long time. + */ I think the 2nd paragraph should be changed slightly as follows (like review comment #1) SUGGESTION Note - we chose not to apply the delay at the end of the parallel apply transaction because that would cause issues related to resource bloat and locks being held for a long time. ~~~ 8. + if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && + opts->min_apply_delay > 0 && opts->streaming == LOGICALREP_STREAM_PARALLEL) + ereport(ERROR, + errcode(ERRCODE_SYNTAX_ERROR), Saying "> 0" (in the condition) is not strictly necessary here, since it is never < 0. ~~~ 9. AlterSubscription + /* + * The combination of parallel streaming mode and + * min_apply_delay is not allowed. See + * parse_subscription_options for details of the reason. + */ + if (opts.streaming == LOGICALREP_STREAM_PARALLEL) + if ((IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && opts.min_apply_delay > 0) || + (!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && sub->minapplydelay > 0)) Saying "> 0" (in the condition) is not strictly necessary here, since it is never < 0. ~~~ 10. + if (IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY)) + { + /* + * The combination of parallel streaming mode and + * min_apply_delay is not allowed. + */ + if (opts.min_apply_delay > 0) Saying "> 0" (in the condition) is not strictly necessary here, since it is never < 0. ~~~ 11. defGetMinApplyDelay + /* + * Check lower bound. parse_int() has already been confirmed that result + * is less than or equal to INT_MAX. + */ The parse_int already checks < INT_MAX. But on return from that function, don’t you need to check again that it is < PG_INT32_MAX (in case those are different) (I think Kuroda-san already suggested same as this) ====== src/backend/replication/logical/worker.c 12. +/* + * In order to avoid walsender timeout for time-delayed logical replication the + * apply worker keeps sending feedback messages during the delay period. + * Meanwhile, the feature delays the apply before the start of the + * transaction and thus we don't write WAL records for the suspended changes + * during the wait. When the apply worker sends a feedback message during the + * delay, we should not make positions of the flushed and apply LSN overwritten + * by the last received latest LSN. See send_feedback() for details. + */ "we should not make positions of the flushed and apply LSN overwritten" --> "we should overwrite positions of the flushed and apply LSN" ~~~ 14. send_feedback @@ -3738,8 +3867,15 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply) /* * No outstanding transactions to flush, we can report the latest received * position. This is important for synchronous replication. + * + * If the logical replication subscription has unprocessed changes then do + * not inform the publisher that the received latest LSN is already + * applied and flushed, otherwise, the publisher will make a wrong + * assumption about the logical replication progress. Instead, it just + * sends a feedback message to avoid a replication timeout during the + * delay. */ "Instead, it just sends" --> "Instead, just send" ====== src/bin/pg_dump/pg_dump.h 15. SubscriptionInfo @@ -661,6 +661,7 @@ typedef struct _SubscriptionInfo char *subdisableonerr; char *suborigin; char *subsynccommit; + int subminapplydelay; char *subpublications; } SubscriptionInfo; Should this also be "int32" to match the other member type changes? ====== src/test/subscription/t/032_apply_delay.pl 16. +# Make sure the apply worker knows to wait for more than 500ms +check_apply_delay_log($node_subscriber, $offset, "0.5"); "knows to wait for more than" --> "waits for more than" (this occurs in a couple of places) ------ Kind Regards, Peter Smith. Fujitsu Australia
At Wed, 1 Feb 2023 08:38:11 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Wed, Feb 1, 2023 at 8:13 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Tue, 31 Jan 2023 15:12:14 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > > So, shall we check if the result of parse_int is in the range 0 and > > > PG_INT32_MAX to ameliorate this concern? > > > > Yeah, it is exactly what I wanted to suggest. > > > > > If this works then we need to > > > probably change the return value of defGetMinApplyDelay() to int32. > > > > I didn't thought doing that, int can store all values in the valid > > range (I'm assuming we implicitly assume int >= int32 in bit width) > > and it is the natural integer in C. Either will do for me but I > > slightly prefer to use int there. > > > > I think it would be clear to use int32 because the parameter where we > store the return value is also int32. I'm fine with that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Jan 30, 2023 6:05 PM Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > > On Saturday, January 28, 2023 1:28 PM I wrote: > > Attached the updated patch v24. > Hi, > > > I've conducted the rebase affected by the commit(1e8b61735c) > by renaming the GUC to logical_replication_mode accordingly, > because it's utilized in the TAP test of this time-delayed LR feature. > There is no other change for this version. > > Kindly have a look at the attached v25. > Thanks for your patch. Here are some comments. 1. + /* + * The min_apply_delay parameter is ignored until all tablesync workers + * have reached READY state. This is because if we allowed the delay + * during the catchup phase, then once we reached the limit of tablesync + * workers it would impose a delay for each subsequent worker. That would + * cause initial table synchronization completion to take a long time. + */ + if (!AllTablesyncsReady()) + return; I saw that the new parameter becomes effective after all tables are in ready state, because the apply worker can't set the state to catchup during the delay. But can we call process_syncing_tables() in the while-loop of maybe_apply_delay()? Then the tablesync can finish without delay. If we can't do so, it might be better to add some comments for it. 2. +# Make sure the apply worker knows to wait for more than 500ms +check_apply_delay_log($node_subscriber, $offset, "0.5"); I think the last parameter should be 500. Besides, I am not sure it's a stable test to check the log. Is it possible that there's no such log on a slow machine? I modified the code to sleep 1s at the beginning of apply_dispatch(), then the new added test failed because the server log cannot match. Regards, Shi yu
On Wed, Feb 1, 2023 at 3:10 PM shiy.fnst@fujitsu.com <shiy.fnst@fujitsu.com> wrote: > > On Mon, Jan 30, 2023 6:05 PM Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > > > > Kindly have a look at the attached v25. > > > > Thanks for your patch. Here are some comments. > > 1. > + /* > + * The min_apply_delay parameter is ignored until all tablesync workers > + * have reached READY state. This is because if we allowed the delay > + * during the catchup phase, then once we reached the limit of tablesync > + * workers it would impose a delay for each subsequent worker. That would > + * cause initial table synchronization completion to take a long time. > + */ > + if (!AllTablesyncsReady()) > + return; > > I saw that the new parameter becomes effective after all tables are in ready > state, because the apply worker can't set the state to catchup during the delay. > But can we call process_syncing_tables() in the while-loop of > maybe_apply_delay()? Then the tablesync can finish without delay. If we can't do > so, it might be better to add some comments for it. > I think the point here is that if the apply worker is ahead of tablesync worker then to complete the catch-up, tablesync worker needs to apply additional transactions, and delaying during that time will cause initial table synchronization completion to take a long time. I am not sure how much more details can be added to the existing comments. -- With Regards, Amit Kapila.
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Wednesday, February 1, 2023 5:40 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Wed, 1 Feb 2023 08:38:11 +0530, Amit Kapila <amit.kapila16@gmail.com> > wrote in > > On Wed, Feb 1, 2023 at 8:13 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > > > > At Tue, 31 Jan 2023 15:12:14 +0530, Amit Kapila > > > <amit.kapila16@gmail.com> wrote in > > > > So, shall we check if the result of parse_int is in the range 0 > > > > and PG_INT32_MAX to ameliorate this concern? > > > > > > Yeah, it is exactly what I wanted to suggest. > > > > > > > If this works then we need to > > > > probably change the return value of defGetMinApplyDelay() to int32. > > > > > > I didn't thought doing that, int can store all values in the valid > > > range (I'm assuming we implicitly assume int >= int32 in bit width) > > > and it is the natural integer in C. Either will do for me but I > > > slightly prefer to use int there. > > > > > > > I think it would be clear to use int32 because the parameter where we > > store the return value is also int32. > > I'm fine with that. Thank you for confirming. Attached the updated patch v26 accordingly. I slightly adjusted the comments in defGetMinApplyDelay on this point as well. Best Regards, Takamichi Osumi
Attachment
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Wednesday, February 1, 2023 1:37 PM Peter Smith <smithpb2250@gmail.com> wrote: > Here are my review comments for the patch v25-0001. Thank you for your review ! > ====== > Commit Message > > 1. > The other possibility is to apply the delay at the end of the parallel apply > transaction but that would cause issues related to resource bloat and locks being > held for a long time. > > ~ > > SUGGESTION > We chose not to apply the delay at the end of the parallel apply transaction > because that would cause issues related to resource bloat and locks being held > for a long time. I prefer the current description. So, I just changed one word from "The other possibility is..." to "The other possibility was" to indicate both two paragraphs (this paragraph and the previous paragraph) are related. > ====== > doc/src/sgml/config.sgml > > 2. > + <para> > + For time-delayed logical replication, the apply worker sends a feedback > + message to the publisher every > + <varname>wal_receiver_status_interval</varname> milliseconds. > Make sure > + to set <varname>wal_receiver_status_interval</varname> less than > the > + <varname>wal_sender_timeout</varname> on the publisher, > otherwise, the > + <literal>walsender</literal> will repeatedly terminate due to timeout > + error. Note that if <varname>wal_receiver_status_interval</varname> > is > + set to zero, the apply worker sends no feedback messages during the > + <literal>min_apply_delay</literal> period. > + </para> > > 2a. > "due to timeout error." --> "due to timeout errors." Fixed. > ~ > > 2b. > Shouldn't this also cross-ref to CREATE SUBSCRIPTION docs? Because the > above mentions 'min_apply_delay' but that is not defined on this page. Makes sense. Added. > ====== > doc/src/sgml/ref/create_subscription.sgml > > 3. > + <para> > + By default, the subscriber applies changes as soon as possible. This > + parameter allows the user to delay the application of changes by a > + given time period. If the value is specified without units, it is > + taken as milliseconds. The default is zero (no delay). See > + <xref linkend="config-setting-names-values"/> for details on the > + available valid time unites. > + </para> > > Typo: "unites" Fixed it to "units". > ~~~ > > 4. > + <para> > + Any delay becomes effective after all initial table synchronization > + has finished and occurs before each transaction starts to get applied > + on the subscriber. The delay is calculated as the difference between > + the WAL timestamp as written on the publisher and the current time > on > + the subscriber. Any overhead of time spent in logical decoding and in > + transferring the transaction may reduce the actual wait time. It is > + also possible that the overhead already exceeds the requested > + <literal>min_apply_delay</literal> value, in which case no delay is > + applied. If the system clocks on publisher and subscriber are not > + synchronized, this may lead to apply changes earlier than expected, > + but this is not a major issue because this parameter is typically > + much larger than the time deviations between servers. Note that if > + this parameter is set to a long delay, the replication will stop if > + the replication slot falls behind the current LSN by more than > + <link > linkend="guc-max-slot-wal-keep-size"><literal>max_slot_wal_keep_size</liter > al></link>. > + </para> > > "Any delay becomes effective after all initial table synchronization..." --> "Any > delay becomes effective only after all initial table synchronization..." Agreed. Fixed. > ~~~ > > 5. > + <warning> > + <para> > + Delaying the replication means there is a much longer time > between > + making a change on the publisher, and that change being > committed > + on the subscriber. This can impact the performance of > synchronous > + replication. See <xref linkend="guc-synchronous-commit"/> > + parameter. > + </para> > + </warning> > > > I'm not sure why this was text changed to say "means there is a much longer > time" instead of "can mean there is a much longer time". > > IMO the previous wording was better because this current text makes an > assumption about what the user has configured -- e.g. if they configured only > 1ms delay then the warning text is not really relevant. Yes, I changed here. The reason is that the purpose of this feature is to address unintentional wrong operations on the pub and for that purpose, I didn't feel quite very short time like you mentioned might not be set for this parameter after some community's comments from hackers. Either was fine, but I chose the current description, depending on the purpose. > ~~~ > > 6. > Why was the example (it existed when I last looked at patch v19) removed? > Personally, I found that example to be a useful reminder that the > min_apply_delay can specify units other than just 'ms'. Removed because the example was one variation that used one difference value of WITH clause, after some comments from the hackers. The reference for available units is documented, so the current description should be sufficient. > ====== > src/backend/commands/subscriptioncmds.c > > 7. parse_subscription_options > > + /* > + * The combination of parallel streaming mode and min_apply_delay is > + not > + * allowed. This is because we start applying the transaction stream as > + * soon as the first change arrives without knowing the transaction's > + * prepare/commit time. This means we cannot calculate the underlying > + * network/decoding lag between publisher and subscriber, and so always > + * waiting for the full 'min_apply_delay' period might include > + unnecessary > + * delay. > + * > + * The other possibility is to apply the delay at the end of the > + parallel > + * apply transaction but that would cause issues related to resource > + bloat > + * and locks being held for a long time. > + */ > > I think the 2nd paragraph should be changed slightly as follows (like review > comment #1) > > SUGGESTION > Note - we chose not to apply the delay at the end of the parallel apply > transaction because that would cause issues related to resource bloat and locks > being held for a long time. Same as the first comment, changed only "is" to "was", to indicate the last paragraph is related to past discussion(option) for the parallel streaming mode that was not adopted. > ~~~ > > 8. > + if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && > + opts->min_apply_delay > 0 && opts->streaming == > + opts->LOGICALREP_STREAM_PARALLEL) > + ereport(ERROR, > + errcode(ERRCODE_SYNTAX_ERROR), > > Saying "> 0" (in the condition) is not strictly necessary here, since it is never < 0. This check is necessary. For example, imagine a case when we CREATE a subscription with streaming = on and then try to ALTER the subscription with streaming = parallel without any settings for min_apply_delay. The ALTER command throws an error of "min_apply_delay > 0 and streaming = parallel are mutually exclusive options." then. This is because min_apply_delay is supported by ALTER command (so the first condition becomes true) and we set streaming = parallel (which makes the 2nd condition true). So, we need to check the opts's actual min_apply_delay value to make the irrelavent case pass. > ~~~ > > 9. AlterSubscription > > + /* > + * The combination of parallel streaming mode and > + * min_apply_delay is not allowed. See > + * parse_subscription_options for details of the reason. > + */ > + if (opts.streaming == LOGICALREP_STREAM_PARALLEL) if > + ((IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && > opts.min_apply_delay > 0) || > + (!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && > sub->minapplydelay > 0)) > > Saying "> 0" (in the condition) is not strictly necessary here, since it is never < 0. This is also necessary. For example, imagine a case that there is a subscription whose min_apply_delay is 1 day. Then, you want to try to execute ALTER SUBSCRIPTION with (min_apply_delay = 0, streaming = parallel). If we remove the condition of otps.min_apply_delay > 0, then we error out in this case too. First we pass the first condition of the opts.streaming == LOGICALREP_STREAM_PARALLEL, since we use streaming option. Then, we also set min_apply_delay in this example, then without checking the value of min_apply_delay, the second condition becomes true (IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY)). So, we need to make this case(min_apply_delay = 0) pass. Meanwhile, checking the "sub" value is necessary for checking existing subscription value. > ~~~ > > 10. > + if (IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY)) { > + /* > + * The combination of parallel streaming mode and > + * min_apply_delay is not allowed. > + */ > + if (opts.min_apply_delay > 0) > > Saying "> 0" (in the condition) is not strictly necessary here, since it is never < 0. This is also required to check the value equals to 0 or not. Kindly imagine a case when we want to execute ALTER min_apply_delay from 1day with a pair of (min_apply_delay = 0 and streaming = parallel). If we remove this check, then this ALTER command fails with error. Without the check, when we set min_apply_delay and parallel streaming mode, even when making the min_apply_delay 0, the error is invoked. The check for sub.stream is necessary for existing definition of target subscription. > ~~~ > > 11. defGetMinApplyDelay > > + /* > + * Check lower bound. parse_int() has already been confirmed that > + result > + * is less than or equal to INT_MAX. > + */ > > The parse_int already checks < INT_MAX. But on return from that function, > don’t you need to check again that it is < PG_INT32_MAX (in case those are > different) > > (I think Kuroda-san already suggested same as this) Changed according to the discussion. > ====== > src/backend/replication/logical/worker.c > > 12. > +/* > + * In order to avoid walsender timeout for time-delayed logical > +replication the > + * apply worker keeps sending feedback messages during the delay period. > + * Meanwhile, the feature delays the apply before the start of the > + * transaction and thus we don't write WAL records for the suspended > +changes > + * during the wait. When the apply worker sends a feedback message > +during the > + * delay, we should not make positions of the flushed and apply LSN > +overwritten > + * by the last received latest LSN. See send_feedback() for details. > + */ > > "we should not make positions of the flushed and apply LSN overwritten" --> > "we should overwrite positions of the flushed and apply LSN" Fixed. I added "not" in your suggestion, too. > ~~~ > > 14. send_feedback > > @@ -3738,8 +3867,15 @@ send_feedback(XLogRecPtr recvpos, bool force, bool > requestReply) > /* > * No outstanding transactions to flush, we can report the latest received > * position. This is important for synchronous replication. > + * > + * If the logical replication subscription has unprocessed changes then > + do > + * not inform the publisher that the received latest LSN is already > + * applied and flushed, otherwise, the publisher will make a wrong > + * assumption about the logical replication progress. Instead, it just > + * sends a feedback message to avoid a replication timeout during the > + * delay. > */ > > "Instead, it just sends" --> "Instead, just send" Fixed. > ====== > src/bin/pg_dump/pg_dump.h > > 15. SubscriptionInfo > > @@ -661,6 +661,7 @@ typedef struct _SubscriptionInfo > char *subdisableonerr; > char *suborigin; > char *subsynccommit; > + int subminapplydelay; > char *subpublications; > } SubscriptionInfo; > > Should this also be "int32" to match the other member type changes? This is intentional. In the context of pg_dump, we are treating this same as other int32 catalog members. So, I'd like to keep the current code. > ====== > src/test/subscription/t/032_apply_delay.pl > > 16. > +# Make sure the apply worker knows to wait for more than 500ms > +check_apply_delay_log($node_subscriber, $offset, "0.5"); > > "knows to wait for more than" --> "waits for more than" > > (this occurs in a couple of places) Fixed. Kindly have a look at v26 shared in [1]. [1] - https://www.postgresql.org/message-id/TYCPR01MB83730A45925B9680C40D92AFEDD69%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Wednesday, February 1, 2023 6:41 PM Shi, Yu/侍 雨 <shiy.fnst@fujitsu.com> wrote: > On Mon, Jan 30, 2023 6:05 PM Takamichi Osumi (Fujitsu) > <osumi.takamichi@fujitsu.com> wrote: > > > > On Saturday, January 28, 2023 1:28 PM I wrote: > > > Attached the updated patch v24. > > Hi, > > > > > > I've conducted the rebase affected by the commit(1e8b61735c) by > > renaming the GUC to logical_replication_mode accordingly, because it's > > utilized in the TAP test of this time-delayed LR feature. > > There is no other change for this version. > > > > Kindly have a look at the attached v25. > > > > Thanks for your patch. Here are some comments. Thank you for your review ! > 2. > +# Make sure the apply worker knows to wait for more than 500ms > +check_apply_delay_log($node_subscriber, $offset, "0.5"); > > I think the last parameter should be 500. Good catch ! Fixed. > Besides, I am not sure it's a stable test to check the log. Is it possible that there's > no such log on a slow machine? I modified the code to sleep 1s at the beginning > of apply_dispatch(), then the new added test failed because the server log > cannot match. To get the log by itself is necessary to ensure that the delay is conducted by the apply worker, because we emit the diffms only if it's bigger than 0 in maybe_apply_delay(). If we omit the step, we are not sure the delay is caused by other reasons or the time-delayed feature. As you mentioned, it's possible that no log is emitted on slow machine. Then, the idea to make the test safer for such machines should be to make the delayed time longer. But we shortened the delay time to 1 second to mitigate the long test execution time of this TAP test. So, I'm not sure if it's a good idea to make it longer again. Please have a look at the latest v26 in [1]. [1] - https://www.postgresql.org/message-id/TYCPR01MB83730A45925B9680C40D92AFEDD69%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
On Thu, Feb 2, 2023 at 7:21 PM Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > ... > > > > Besides, I am not sure it's a stable test to check the log. Is it possible that there's > > no such log on a slow machine? I modified the code to sleep 1s at the beginning > > of apply_dispatch(), then the new added test failed because the server log > > cannot match. > To get the log by itself is necessary to ensure > that the delay is conducted by the apply worker, because we emit the diffms > only if it's bigger than 0 in maybe_apply_delay(). If we omit the step, > we are not sure the delay is caused by other reasons or the time-delayed feature. > > As you mentioned, it's possible that no log is emitted on slow machine. Then, > the idea to make the test safer for such machines should be to make the delayed time longer. > But we shortened the delay time to 1 second to mitigate the long test execution time of this TAP test. > So, I'm not sure if it's a good idea to make it longer again. I think there are a couple of things that can be done about this problem: 1. If you need the code/test to remain as-is then at least the test message could include some comforting text like "(this can fail on slow machines when the delay time is already exceeded)" so then a test failure will not cause undue alarm. 2. Try moving the DEBUG2 elog (in function maybe_apply_delay) so that it will *always* log the remaining wait time even if that wait time becomes negative. Then I think the test cases can be made deterministic instead of relying on good luck. This seems like the better option. ------ Kind Regards, Peter Smith. Fujitsu Australia
Here are my review comments for patch v26-0001. On Thu, Feb 2, 2023 at 7:18 PM Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > > Hi, > > On Wednesday, February 1, 2023 1:37 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are my review comments for the patch v25-0001. > Thank you for your review ! > > > 8. > > + if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && > > + opts->min_apply_delay > 0 && opts->streaming == > > + opts->LOGICALREP_STREAM_PARALLEL) > > + ereport(ERROR, > > + errcode(ERRCODE_SYNTAX_ERROR), > > > > Saying "> 0" (in the condition) is not strictly necessary here, since it is never < 0. > This check is necessary. > > For example, imagine a case when we CREATE a subscription with streaming = on > and then try to ALTER the subscription with streaming = parallel > without any settings for min_apply_delay. The ALTER command > throws an error of "min_apply_delay > 0 and streaming = parallel are > mutually exclusive options." then. > > This is because min_apply_delay is supported by ALTER command > (so the first condition becomes true) and we set > streaming = parallel (which makes the 2nd condition true). > > So, we need to check the opts's actual min_apply_delay value > to make the irrelavent case pass. I think there is some misunderstanding. I was not suggesting removing the condition -- only that I thought it could be written without the > 0 as: if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && opts->min_apply_delay && opts->streaming == LOGICALREP_STREAM_PARALLEL) ereport(ERROR, > > ~~~ > > > > 9. AlterSubscription > > > > + /* > > + * The combination of parallel streaming mode and > > + * min_apply_delay is not allowed. See > > + * parse_subscription_options for details of the reason. > > + */ > > + if (opts.streaming == LOGICALREP_STREAM_PARALLEL) if > > + ((IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && > > opts.min_apply_delay > 0) || > > + (!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && > > sub->minapplydelay > 0)) > > > > Saying "> 0" (in the condition) is not strictly necessary here, since it is never < 0. > This is also necessary. > > For example, imagine a case that > there is a subscription whose min_apply_delay is 1 day. > Then, you want to try to execute ALTER SUBSCRIPTION > with (min_apply_delay = 0, streaming = parallel). > If we remove the condition of otps.min_apply_delay > 0, > then we error out in this case too. > > First we pass the first condition > of the opts.streaming == LOGICALREP_STREAM_PARALLEL, > since we use streaming option. > Then, we also set min_apply_delay in this example, > then without checking the value of min_apply_delay, > the second condition becomes true > (IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY)). > > So, we need to make this case(min_apply_delay = 0) pass. > Meanwhile, checking the "sub" value is necessary for checking existing subscription value. I think there is some misunderstanding. I was not suggesting removing the condition -- only that I thought it could be written without the > 0 as:: if (opts.streaming == LOGICALREP_STREAM_PARALLEL) if ((IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && opts.min_apply_delay) || (!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && sub->minapplydelay)) ereport(ERROR, > > ~~~ > > > > 10. > > + if (IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY)) { > > + /* > > + * The combination of parallel streaming mode and > > + * min_apply_delay is not allowed. > > + */ > > + if (opts.min_apply_delay > 0) > > > > Saying "> 0" (in the condition) is not strictly necessary here, since it is never < 0. > This is also required to check the value equals to 0 or not. > Kindly imagine a case when we want to execute ALTER min_apply_delay from 1day > with a pair of (min_apply_delay = 0 and > streaming = parallel). If we remove this check, then this ALTER command fails > with error. Without the check, when we set min_apply_delay > and parallel streaming mode, even when making the min_apply_delay 0, > the error is invoked. > > The check for sub.stream is necessary for existing definition of target subscription. I think there is some misunderstanding. I was not suggesting removing the condition -- only that I thought it could be written without the > 0 as:: if (opts.min_apply_delay) if ((IsSet(opts.specified_opts, SUBOPT_STREAMING) && opts.streaming == LOGICALREP_STREAM_PARALLEL) || (!IsSet(opts.specified_opts, SUBOPT_STREAMING) && sub->stream == LOGICALREP_STREAM_PARALLEL)) ereport(ERROR, ------ Kind Regards, Peter Smith. Fujitsu Australia
On Fri, Feb 3, 2023 at 6:41 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Thu, Feb 2, 2023 at 7:21 PM Takamichi Osumi (Fujitsu) > <osumi.takamichi@fujitsu.com> wrote: > > > ... > > > > > > > Besides, I am not sure it's a stable test to check the log. Is it possible that there's > > > no such log on a slow machine? I modified the code to sleep 1s at the beginning > > > of apply_dispatch(), then the new added test failed because the server log > > > cannot match. > > To get the log by itself is necessary to ensure > > that the delay is conducted by the apply worker, because we emit the diffms > > only if it's bigger than 0 in maybe_apply_delay(). If we omit the step, > > we are not sure the delay is caused by other reasons or the time-delayed feature. > > > > As you mentioned, it's possible that no log is emitted on slow machine. Then, > > the idea to make the test safer for such machines should be to make the delayed time longer. > > But we shortened the delay time to 1 second to mitigate the long test execution time of this TAP test. > > So, I'm not sure if it's a good idea to make it longer again. > > I think there are a couple of things that can be done about this problem: > > 1. If you need the code/test to remain as-is then at least the test > message could include some comforting text like "(this can fail on > slow machines when the delay time is already exceeded)" so then a test > failure will not cause undue alarm. > > 2. Try moving the DEBUG2 elog (in function maybe_apply_delay) so that > it will *always* log the remaining wait time even if that wait time > becomes negative. Then I think the test cases can be made > deterministic instead of relying on good luck. This seems like the > better option. > I don't understand why we have to do any of this instead of using 3s as min_apply_delay similar to what we are doing in src/test/recovery/t/005_replay_delay. Also, I think we should use exactly the same way to verify the test even though we want to keep the log level as DEBUG2 to check logs in case of any failures. Also, I don't see the need to add more tests like the ones below: +# Test whether ALTER SUBSCRIPTION changes the delayed time of the apply worker +# (1 day 5 minutes). Note that the extra 5 minute is to account for any +# decoding/network overhead. Let's try to add tests similar to what we have for recovery_min_apply_delay unless there is some functionality in this patch that is not there in the recovery_min_apply_delay feature. -- With Regards, Amit Kapila.
On Fri, Feb 3, 2023 at 8:02 AM Peter Smith <smithpb2250@gmail.com> wrote: > > I think there is some misunderstanding. I was not suggesting removing > the condition -- only that I thought it could be written without the > > 0 as: > > if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && > opts->min_apply_delay && opts->streaming == LOGICALREP_STREAM_PARALLEL) > ereport(ERROR, > Yeah, we can probably write that way but in the error message we are already using > 0, so the current style used by patch seems good to me. Also, I think using the way you are suggesting is more apt for booleans. -- With Regards, Amit Kapila.
On Fri, Feb 3, 2023 at 4:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Feb 3, 2023 at 6:41 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > On Thu, Feb 2, 2023 at 7:21 PM Takamichi Osumi (Fujitsu) > > <osumi.takamichi@fujitsu.com> wrote: > > > > > ... > > > > > > > > > > Besides, I am not sure it's a stable test to check the log. Is it possible that there's > > > > no such log on a slow machine? I modified the code to sleep 1s at the beginning > > > > of apply_dispatch(), then the new added test failed because the server log > > > > cannot match. > > > To get the log by itself is necessary to ensure > > > that the delay is conducted by the apply worker, because we emit the diffms > > > only if it's bigger than 0 in maybe_apply_delay(). If we omit the step, > > > we are not sure the delay is caused by other reasons or the time-delayed feature. > > > > > > As you mentioned, it's possible that no log is emitted on slow machine. Then, > > > the idea to make the test safer for such machines should be to make the delayed time longer. > > > But we shortened the delay time to 1 second to mitigate the long test execution time of this TAP test. > > > So, I'm not sure if it's a good idea to make it longer again. > > > > I think there are a couple of things that can be done about this problem: > > > > 1. If you need the code/test to remain as-is then at least the test > > message could include some comforting text like "(this can fail on > > slow machines when the delay time is already exceeded)" so then a test > > failure will not cause undue alarm. > > > > 2. Try moving the DEBUG2 elog (in function maybe_apply_delay) so that > > it will *always* log the remaining wait time even if that wait time > > becomes negative. Then I think the test cases can be made > > deterministic instead of relying on good luck. This seems like the > > better option. > > > > I don't understand why we have to do any of this instead of using 3s > as min_apply_delay similar to what we are doing in > src/test/recovery/t/005_replay_delay. Also, I think we should use > exactly the same way to verify the test even though we want to keep > the log level as DEBUG2 to check logs in case of any failures. > IIUC the reasons are due to conflicting requirements. e.g. - A longer delay like 3s might work better for testing this feature, but OTOH - A longer delay will also cause the whole BF execution to take longer ------ Kind Regards, Peter Smith. Fujitsu Australia.
On Fri, Feb 3, 2023 at 11:12 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Fri, Feb 3, 2023 at 4:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Feb 3, 2023 at 6:41 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > On Thu, Feb 2, 2023 at 7:21 PM Takamichi Osumi (Fujitsu) > > > <osumi.takamichi@fujitsu.com> wrote: > > > > > > > ... > > > > > > > > > > > > > Besides, I am not sure it's a stable test to check the log. Is it possible that there's > > > > > no such log on a slow machine? I modified the code to sleep 1s at the beginning > > > > > of apply_dispatch(), then the new added test failed because the server log > > > > > cannot match. > > > > To get the log by itself is necessary to ensure > > > > that the delay is conducted by the apply worker, because we emit the diffms > > > > only if it's bigger than 0 in maybe_apply_delay(). If we omit the step, > > > > we are not sure the delay is caused by other reasons or the time-delayed feature. > > > > > > > > As you mentioned, it's possible that no log is emitted on slow machine. Then, > > > > the idea to make the test safer for such machines should be to make the delayed time longer. > > > > But we shortened the delay time to 1 second to mitigate the long test execution time of this TAP test. > > > > So, I'm not sure if it's a good idea to make it longer again. > > > > > > I think there are a couple of things that can be done about this problem: > > > > > > 1. If you need the code/test to remain as-is then at least the test > > > message could include some comforting text like "(this can fail on > > > slow machines when the delay time is already exceeded)" so then a test > > > failure will not cause undue alarm. > > > > > > 2. Try moving the DEBUG2 elog (in function maybe_apply_delay) so that > > > it will *always* log the remaining wait time even if that wait time > > > becomes negative. Then I think the test cases can be made > > > deterministic instead of relying on good luck. This seems like the > > > better option. > > > > > > > I don't understand why we have to do any of this instead of using 3s > > as min_apply_delay similar to what we are doing in > > src/test/recovery/t/005_replay_delay. Also, I think we should use > > exactly the same way to verify the test even though we want to keep > > the log level as DEBUG2 to check logs in case of any failures. > > > > IIUC the reasons are due to conflicting requirements. e.g. > - A longer delay like 3s might work better for testing this feature, but OTOH > - A longer delay will also cause the whole BF execution to take longer > Sure, but we already have the same test for a similar feature and it seems to be a proven reliable way to test the feature. We do seem to have seen buildfarm failures for tests related to recovery_min_apply_delay and the current way is quite stable, so I would prefer to go with that. -- With Regards, Amit Kapila.
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Friday, February 3, 2023 2:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Feb 3, 2023 at 6:41 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Thu, Feb 2, 2023 at 7:21 PM Takamichi Osumi (Fujitsu) > > <osumi.takamichi@fujitsu.com> wrote: > > > > > ... > > > > Besides, I am not sure it's a stable test to check the log. Is it > > > > possible that there's no such log on a slow machine? I modified > > > > the code to sleep 1s at the beginning of apply_dispatch(), then > > > > the new added test failed because the server log cannot match. > > > To get the log by itself is necessary to ensure that the delay is > > > conducted by the apply worker, because we emit the diffms only if > > > it's bigger than 0 in maybe_apply_delay(). If we omit the step, we > > > are not sure the delay is caused by other reasons or the time-delayed > feature. > > > > > > As you mentioned, it's possible that no log is emitted on slow > > > machine. Then, the idea to make the test safer for such machines should > be to make the delayed time longer. > > > But we shortened the delay time to 1 second to mitigate the long test > execution time of this TAP test. > > > So, I'm not sure if it's a good idea to make it longer again. > > > > I think there are a couple of things that can be done about this problem: > > > > 1. If you need the code/test to remain as-is then at least the test > > message could include some comforting text like "(this can fail on > > slow machines when the delay time is already exceeded)" so then a test > > failure will not cause undue alarm. > > > > 2. Try moving the DEBUG2 elog (in function maybe_apply_delay) so that > > it will *always* log the remaining wait time even if that wait time > > becomes negative. Then I think the test cases can be made > > deterministic instead of relying on good luck. This seems like the > > better option. > > > > I don't understand why we have to do any of this instead of using 3s as > min_apply_delay similar to what we are doing in > src/test/recovery/t/005_replay_delay. Also, I think we should use exactly the > same way to verify the test even though we want to keep the log level as > DEBUG2 to check logs in case of any failures. OK, will try to make our tests similar to the tests in 005_replay_delay as much as possible. > Also, I don't see the need to add more tests like the ones below: > +# Test whether ALTER SUBSCRIPTION changes the delayed time of the apply > +worker # (1 day 5 minutes). Note that the extra 5 minute is to account > +for any # decoding/network overhead. > > Let's try to add tests similar to what we have for recovery_min_apply_delay > unless there is some functionality in this patch that is not there in the > recovery_min_apply_delay feature. The above command is a preparation part to check a behavior unique to time-delayed logical replication, which is to DISABLE a subscription causes the apply worker not to apply the suspended (delayed) transaction. So, it will be OK to have this test. Best Regards, Takamichi Osumi
On Thurs, Feb 2, 2023 16:04 PM Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > Attached the updated patch v26 accordingly. Thanks for your patch. Here is a comment: 1. The checks in function AlterSubscription + /* + * The combination of parallel streaming mode and + * min_apply_delay is not allowed. See + * parse_subscription_options for details of the reason. + */ + if (opts.streaming == LOGICALREP_STREAM_PARALLEL) + if ((IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && opts.min_apply_delay > 0) || + (!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && sub->minapplydelay > 0)) and + /* + * The combination of parallel streaming mode and + * min_apply_delay is not allowed. + */ + if (opts.min_apply_delay > 0) + if ((IsSet(opts.specified_opts, SUBOPT_STREAMING) && opts.streaming == LOGICALREP_STREAM_PARALLEL)|| + (!IsSet(opts.specified_opts, SUBOPT_STREAMING) && sub->stream == LOGICALREP_STREAM_PARALLEL)) I think the case where the options "min_apply_delay>0" and "streaming=parallel" are set at the same time seems to have been checked in the function parse_subscription_options, how about simplifying these two if-statements here to the following: ``` if (opts.streaming == LOGICALREP_STREAM_PARALLEL && !IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && sub->minapplydelay > 0) and if (opts.min_apply_delay > 0 && !IsSet(opts.specified_opts, SUBOPT_STREAMING) && sub->stream == LOGICALREP_STREAM_PARALLEL) ``` Regards, Wang Wei
On Fri, Feb 3, 2023 at 3:12 PM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > Here is a comment: > > 1. The checks in function AlterSubscription > + /* > + * The combination of parallel streaming mode and > + * min_apply_delay is not allowed. See > + * parse_subscription_options for details of the reason. > + */ > + if (opts.streaming == LOGICALREP_STREAM_PARALLEL) > + if ((IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && opts.min_apply_delay> 0) || > + (!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && sub->minapplydelay> 0)) > and > + /* > + * The combination of parallel streaming mode and > + * min_apply_delay is not allowed. > + */ > + if (opts.min_apply_delay > 0) > + if ((IsSet(opts.specified_opts, SUBOPT_STREAMING) && opts.streaming ==LOGICALREP_STREAM_PARALLEL) || > + (!IsSet(opts.specified_opts, SUBOPT_STREAMING) && sub->stream ==LOGICALREP_STREAM_PARALLEL)) > > I think the case where the options "min_apply_delay>0" and "streaming=parallel" > are set at the same time seems to have been checked in the function > parse_subscription_options, how about simplifying these two if-statements here > to the following: > ``` > if (opts.streaming == LOGICALREP_STREAM_PARALLEL && > !IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && > sub->minapplydelay > 0) > > and > > if (opts.min_apply_delay > 0 && > !IsSet(opts.specified_opts, SUBOPT_STREAMING) && > sub->stream == LOGICALREP_STREAM_PARALLEL) > ``` > Won't just checking if ((opts.streaming == LOGICALREP_STREAM_PARALLEL && sub->minapplydelay > 0) || (opts.min_apply_delay > 0 && sub->stream == LOGICALREP_STREAM_PARALLEL)) be sufficient in that case? -- With Regards, Amit Kapila.
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Friday, February 3, 2023 3:35 PM I wrote: > On Friday, February 3, 2023 2:21 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > On Fri, Feb 3, 2023 at 6:41 AM Peter Smith <smithpb2250@gmail.com> > wrote: > > > On Thu, Feb 2, 2023 at 7:21 PM Takamichi Osumi (Fujitsu) > > > <osumi.takamichi@fujitsu.com> wrote: > > > ... > > > > > Besides, I am not sure it's a stable test to check the log. Is > > > > > it possible that there's no such log on a slow machine? I > > > > > modified the code to sleep 1s at the beginning of > > > > > apply_dispatch(), then the new added test failed because the server > log cannot match. > > > > To get the log by itself is necessary to ensure that the delay is > > > > conducted by the apply worker, because we emit the diffms only if > > > > it's bigger than 0 in maybe_apply_delay(). If we omit the step, we > > > > are not sure the delay is caused by other reasons or the > > > > time-delayed > > feature. > > > > > > > > As you mentioned, it's possible that no log is emitted on slow > > > > machine. Then, the idea to make the test safer for such machines > > > > should > > be to make the delayed time longer. > > > > But we shortened the delay time to 1 second to mitigate the long > > > > test > > execution time of this TAP test. > > > > So, I'm not sure if it's a good idea to make it longer again. > > > > > > I think there are a couple of things that can be done about this problem: > > > > > > 1. If you need the code/test to remain as-is then at least the test > > > message could include some comforting text like "(this can fail on > > > slow machines when the delay time is already exceeded)" so then a > > > test failure will not cause undue alarm. > > > > > > 2. Try moving the DEBUG2 elog (in function maybe_apply_delay) so > > > that it will *always* log the remaining wait time even if that wait > > > time becomes negative. Then I think the test cases can be made > > > deterministic instead of relying on good luck. This seems like the > > > better option. > > > > > > > I don't understand why we have to do any of this instead of using 3s > > as min_apply_delay similar to what we are doing in > > src/test/recovery/t/005_replay_delay. Also, I think we should use > > exactly the same way to verify the test even though we want to keep > > the log level as > > DEBUG2 to check logs in case of any failures. > OK, will try to make our tests similar to the tests in 005_replay_delay as much > as possible. I've updated the TAP test and made it aligned with 005_reply_delay.pl. For coverage, I have the stream of in-progress transaction test case and ALTER SUBSCRIPTION DISABLE behavior, which is unique to logical replication. Also, conducted pgindent and pgperltidy. Note that the latter half of the 005_reply_delay.pl doesn't seem to match with the test for time-delayed logical replication (e.g. promotion). So, I don't have those points. Kindly have a look at the attached v27. Best Regards, Takamichi Osumi
Attachment
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On wangw.fnst@fujitsu.com Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Feb 3, 2023 at 3:12 PM wangw.fnst@fujitsu.com > <wangw.fnst@fujitsu.com> wrote: > > > > Here is a comment: > > > > 1. The checks in function AlterSubscription > > + /* > > + * The combination of parallel > streaming mode and > > + * min_apply_delay is not > allowed. See > > + * parse_subscription_options > for details of the reason. > > + */ > > + if (opts.streaming == > LOGICALREP_STREAM_PARALLEL) > > + if > ((IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && > opts.min_apply_delay > 0) || > > + > > + (!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && > > + sub->minapplydelay > 0)) > > and > > + /* > > + * The combination of parallel > streaming mode and > > + * min_apply_delay is not > allowed. > > + */ > > + if (opts.min_apply_delay > 0) > > + if > ((IsSet(opts.specified_opts, SUBOPT_STREAMING) && opts.streaming == > LOGICALREP_STREAM_PARALLEL) || > > + > > + (!IsSet(opts.specified_opts, SUBOPT_STREAMING) && sub->stream == > > + LOGICALREP_STREAM_PARALLEL)) > > > > I think the case where the options "min_apply_delay>0" and > "streaming=parallel" > > are set at the same time seems to have been checked in the function > > parse_subscription_options, how about simplifying these two > > if-statements here to the following: > > ``` > > if (opts.streaming == LOGICALREP_STREAM_PARALLEL && > > !IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && > > sub->minapplydelay > 0) > > > > and > > > > if (opts.min_apply_delay > 0 && > > !IsSet(opts.specified_opts, SUBOPT_STREAMING) && > > sub->stream == LOGICALREP_STREAM_PARALLEL) ``` > > > > Won't just checking if ((opts.streaming == > LOGICALREP_STREAM_PARALLEL && sub->minapplydelay > 0) || > (opts.min_apply_delay > 0 && sub->stream == > LOGICALREP_STREAM_PARALLEL)) be sufficient in that case? We need checks for !IsSet(). If we don't have those, we error out when executing the alter subscription with min_apply_delay = 0 and streaming = parallel, at the same time for a subscription whose min_apply_delay setting is bigger than 0, for instance. In this case, we pass (don't error out) parse_subscription_options()'s test for the combination of mutual exclusive options and then, error out the condition by matching the first condition opts.streaming == parallel and sub->minapplydelay > 0 above. Also, the Wang-san's refactoring proposal makes sense. Adopted. Regarding the style how to write min_apply_delay > 0 (or just putting min_apply_delay in 'if' conditions) for checking parameters, I agreed with Amit-san so I keep them as it is in the latest patch v27. Kindly have a look at v27 posted in [1] [1] - https://www.postgresql.org/message-id/TYCPR01MB83738F2BEF83DE525410E3ACEDD49%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
On Sat, Feb 4, 2023 at 5:04 PM Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > ... > > Kindly have a look at the attached v27. > Here are some review comments for patch v27-0001. ====== src/test/subscription/t/032_apply_delay.pl 1. +# Confirm the time-delayed replication has been effective from the server log +# message where the apply worker emits for applying delay. Moreover, verify +# that the current worker's remaining wait time is sufficiently bigger than the +# expected value, in order to check any update of the min_apply_delay. +sub check_apply_delay_log ~ "has been effective from the server log" --> "worked, by inspecting the server log" ~~~ 2. +my $delay = 3; Might be better to name this variable as 'min_apply_delay'. ~~~ 3. +# Now wait for replay to complete on publisher. We're done waiting when the +# subscriber has applyed up to the publisher LSN. +$node_publisher->wait_for_catchup($appname); 3a. Something seemed wrong with the comment. Was it meant to say more like? "The publisher waits for the replication to complete". Typo: "applyed" ~ 3b. Instead of doing this wait_for_catchup stuff why don't you just use a synchronous pub/sub and then the publication will just block internally like you require but without you having to block using test code? ~~~ 4. +# Run a query to make sure that the reload has taken effect. +$node_publisher->safe_psql('postgres', q{SELECT 1}); SUGGESTION (for the comment) # Running a dummy query causes the config to be reloaded. ~~~ 5. +# Confirm the record is not applied expectedly +my $result = $node_subscriber->safe_psql('postgres', + "SELECT count(a) FROM tab_int WHERE a = 0;"); +is($result, qq(0), "check the delayed transaction was not applied"); "expectedly" ?? SUGGESTION (for comment) # Confirm the record was not applied ------ Kind Regards, Peter Smith. Fujitsu Australia
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Monday, February 6, 2023 12:03 PM Peter Smith <smithpb2250@gmail.com> wrote: > On Sat, Feb 4, 2023 at 5:04 PM Takamichi Osumi (Fujitsu) > <osumi.takamichi@fujitsu.com> wrote: > > > ... > > > > Kindly have a look at the attached v27. > > > > Here are some review comments for patch v27-0001. Thanks for checking ! > ====== > src/test/subscription/t/032_apply_delay.pl > > 1. > +# Confirm the time-delayed replication has been effective from the > +server log # message where the apply worker emits for applying delay. > +Moreover, verify # that the current worker's remaining wait time is > +sufficiently bigger than the # expected value, in order to check any update of > the min_apply_delay. > +sub check_apply_delay_log > > ~ > > "has been effective from the server log" --> "worked, by inspecting the server > log" Sounds good to me. Also, this is an unique part for time-delayed logical replication. So, we can update those as we want. Fixed. > ~~~ > > 2. > +my $delay = 3; > > Might be better to name this variable as 'min_apply_delay'. I named this variable by following the test of recovery_min_apply_delay (src/test/recovery/005_replay_delay.pl). So, this is aligned with the test and I'd like to keep it as it is. > ~~~ > > 3. > +# Now wait for replay to complete on publisher. We're done waiting when > +the # subscriber has applyed up to the publisher LSN. > +$node_publisher->wait_for_catchup($appname); > > 3a. > Something seemed wrong with the comment. > > Was it meant to say more like? "The publisher waits for the replication to > complete". > > Typo: "applyed" Your wording looks better than mine. Fixed. > ~ > > 3b. > Instead of doing this wait_for_catchup stuff why don't you just use a > synchronous pub/sub and then the publication will just block internally like > you require but without you having to block using test code? This is the style of 005_reply_delay.pl. Then, this is also aligned with it. So, I'd like to keep the current way of times comparison as it is. Even if we could omit wait_for_catchup(), there will be new codes for synchronous replication and that would make the min_apply_delay tests more different from the corresponding one. Note that if we use the synchronous mode, we need to turn it off for the last ALTER SUBSCRIPTION DISABLE test case whose min_apply_delay to 1 day 5 min and execute one record insert after that. This will make the tests confusing. > ~~~ > > 4. > +# Run a query to make sure that the reload has taken effect. > +$node_publisher->safe_psql('postgres', q{SELECT 1}); > > SUGGESTION (for the comment) > # Running a dummy query causes the config to be reloaded. Fixed. > ~~~ > > 5. > +# Confirm the record is not applied expectedly my $result = > +$node_subscriber->safe_psql('postgres', > + "SELECT count(a) FROM tab_int WHERE a = 0;"); is($result, qq(0), > +"check the delayed transaction was not applied"); > > "expectedly" ?? > > SUGGESTION (for comment) > # Confirm the record was not applied Fixed. Best Regards, Takamichi Osumi
Attachment
On Mon, Feb 6, 2023 at 12:36 PM Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > I have made a couple of changes in the attached: (a) changed a few error and LOG messages; (a) added/changed comments. See, if these look good to you then please include them in the next version. -- With Regards, Amit Kapila.
Attachment
On Tue, Jan 24, 2023 at 5:02 AM Euler Taveira <euler@eulerto.com> wrote: > > > - elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X in-delayed: %d", > + elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X, apply delay: %s", > force, > LSN_FORMAT_ARGS(recvpos), > LSN_FORMAT_ARGS(writepos), > LSN_FORMAT_ARGS(flushpos), > - in_delayed_apply); > + in_delayed_apply? "yes" : "no"); > > It is better to use a string to represent the yes/no option. > I think it is better to be consistent with the existing force parameter which is also boolean, otherwise, it will look odd. -- With Regards, Amit Kapila.
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
On Monday, February 6, 2023 8:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Feb 6, 2023 at 12:36 PM Takamichi Osumi (Fujitsu) > <osumi.takamichi@fujitsu.com> wrote: > > > > I have made a couple of changes in the attached: (a) changed a few error and > LOG messages; (a) added/changed comments. See, if these look good to you > then please include them in the next version. Hi, thanks for sharing the patch ! The proposed changes make comments easier to understand and more aligned with other existing comments. So, LGTM. The attached patch v29 has included your changes. Best Regards, Takamichi Osumi
Attachment
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Monday, February 6, 2023 8:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Jan 24, 2023 at 5:02 AM Euler Taveira <euler@eulerto.com> wrote: > > > > > > - elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, > write %X/%X, flush %X/%X in-delayed: %d", > > + elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write > > + %X/%X, flush %X/%X, apply delay: %s", > > force, > > LSN_FORMAT_ARGS(recvpos), > > LSN_FORMAT_ARGS(writepos), > > LSN_FORMAT_ARGS(flushpos), > > - in_delayed_apply); > > + in_delayed_apply? "yes" : "no"); > > > > It is better to use a string to represent the yes/no option. > > > > I think it is better to be consistent with the existing force parameter which is > also boolean, otherwise, it will look odd. Agreed. The latest patch v29 posted in [1] followed this suggestion. Kindly have a look at it. [1] - https://www.postgresql.org/message-id/TYCPR01MB8373A59E7B74AA4F96B62BEAEDDA9%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
Here are my review comments for v29-0001. ====== Commit Message 1. Discussion: https://postgr.es/m/CAB-JLwYOYwL=XTyAXKiH5CtM_Vm8KjKh7aaitCKvmCh4rzr5pQ@mail.gmail.com tmp ~ What's that "tmp" doing there? A typo? ====== doc/src/sgml/catalogs.sgml 2. + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>subminapplydelay</structfield> <type>int4</type> + </para> + <para> + The minimum delay (ms) for applying changes. + </para></entry> + </row> For consistency remove the period (.) because the other single-sentence descriptions on this page do not have one. ====== src/backend/commands/subscriptioncmds.c 3. AlterSubscription + errmsg("cannot set parallel streaming mode for subscription with %s", + "min_apply_delay")); Since there are no translator considerations here why not write it like this: errmsg("cannot set parallel streaming mode for subscription with min_apply_delay") ~~~ 4. AlterSubscription + errmsg("cannot set %s for subscription in parallel streaming mode", + "min_apply_delay")); Since there are no translator considerations here why not write it like this: errmsg("cannot set min_apply_delay for subscription in parallel streaming mode") ~~~ 5. +defGetMinApplyDelay(DefElem *def) +{ + char *input_string; + int result; + const char *hintmsg; + + input_string = defGetString(def); + + /* + * Parse given string as parameter which has millisecond unit + */ + if (!parse_int(input_string, &result, GUC_UNIT_MS, &hintmsg)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("invalid value for parameter \"%s\": \"%s\"", + "min_apply_delay", input_string), + hintmsg ? errhint("%s", _(hintmsg)) : 0)); + + /* + * Check both the lower boundary for the valid min_apply_delay range and + * the upper boundary as the safeguard for some platforms where INT_MAX is + * wider than int32 respectively. Although parse_int() has confirmed that + * the result is less than or equal to INT_MAX, the value will be stored + * in a catalog column of int32. + */ + if (result < 0 || result > PG_INT32_MAX) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("%d ms is outside the valid range for parameter \"%s\" (%d .. %d)", + result, + "min_apply_delay", + 0, PG_INT32_MAX))); + + return result; +} 5a. Since there are no translator considerations here why not write the first error like: errmsg("invalid value for parameter \"min_apply_delay\": \"%s\"", input_string) ~ 5b. Since there are no translator considerations here why not write the second error like: errmsg("%d ms is outside the valid range for parameter \"min_apply_delay\" (%d .. %d)", result, 0, PG_INT32_MAX)) ------ Kind Regards, Peter Smith. Fujitsu Australia
Dear Peter, Thank you for reviewing! PSA new version. > ====== > Commit Message > > 1. > Discussion: > https://postgr.es/m/CAB-JLwYOYwL=XTyAXKiH5CtM_Vm8KjKh7aaitCKvmCh4r > zr5pQ@mail.gmail.com > > tmp > > ~ > > What's that "tmp" doing there? A typo? Removed. It was a typo. I used `git rebase` command to combining the local commits, but the commit message seemed to be remained. > ====== > doc/src/sgml/catalogs.sgml > > 2. > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>subminapplydelay</structfield> <type>int4</type> > + </para> > + <para> > + The minimum delay (ms) for applying changes. > + </para></entry> > + </row> > > For consistency remove the period (.) because the other > single-sentence descriptions on this page do not have one. I have also confirmed and agreed. Fixed. > ====== > src/backend/commands/subscriptioncmds.c > > 3. AlterSubscription > + errmsg("cannot set parallel streaming mode for subscription with %s", > + "min_apply_delay")); > > Since there are no translator considerations here why not write it like this: > > errmsg("cannot set parallel streaming mode for subscription with > min_apply_delay") Fixed. > ~~~ > > 4. AlterSubscription > + errmsg("cannot set %s for subscription in parallel streaming mode", > + "min_apply_delay")); > > Since there are no translator considerations here why not write it like this: > > errmsg("cannot set min_apply_delay for subscription in parallel streaming mode") Fixed. > ~~~ > > 5. > +defGetMinApplyDelay(DefElem *def) > +{ > + char *input_string; > + int result; > + const char *hintmsg; > + > + input_string = defGetString(def); > + > + /* > + * Parse given string as parameter which has millisecond unit > + */ > + if (!parse_int(input_string, &result, GUC_UNIT_MS, &hintmsg)) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("invalid value for parameter \"%s\": \"%s\"", > + "min_apply_delay", input_string), > + hintmsg ? errhint("%s", _(hintmsg)) : 0)); > + > + /* > + * Check both the lower boundary for the valid min_apply_delay range and > + * the upper boundary as the safeguard for some platforms where INT_MAX is > + * wider than int32 respectively. Although parse_int() has confirmed that > + * the result is less than or equal to INT_MAX, the value will be stored > + * in a catalog column of int32. > + */ > + if (result < 0 || result > PG_INT32_MAX) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("%d ms is outside the valid range for parameter \"%s\" (%d .. %d)", > + result, > + "min_apply_delay", > + 0, PG_INT32_MAX))); > + > + return result; > +} > > 5a. > Since there are no translator considerations here why not write the > first error like: > > errmsg("invalid value for parameter \"min_apply_delay\": \"%s\"", > input_string) > > ~ > > 5b. > Since there are no translator considerations here why not write the > second error like: > > errmsg("%d ms is outside the valid range for parameter > \"min_apply_delay\" (%d .. %d)", > result, 0, PG_INT32_MAX)) Both of you said were fixed. Best Regards, Hayato Kuroda FUJITSU LIMITED > -----Original Message----- > From: Peter Smith <smithpb2250@gmail.com> > Sent: Tuesday, February 7, 2023 9:33 AM > To: Osumi, Takamichi/大墨 昂道 <osumi.takamichi@fujitsu.com> > Cc: Amit Kapila <amit.kapila16@gmail.com>; Shi, Yu/侍 雨 > <shiy.fnst@fujitsu.com>; Kyotaro Horiguchi <horikyota.ntt@gmail.com>; > vignesh21@gmail.com; Kuroda, Hayato/黒田 隼人 > <kuroda.hayato@fujitsu.com>; shveta.malik@gmail.com; dilipbalaut@gmail.com; > euler@eulerto.com; m.melihmutlu@gmail.com; andres@anarazel.de; > marcos@f10.com.br; pgsql-hackers@postgresql.org > Subject: Re: Time delayed LR (WAS Re: logical replication restrictions) > > Here are my review comments for v29-0001. > > ====== > Commit Message > > 1. > Discussion: > https://postgr.es/m/CAB-JLwYOYwL=XTyAXKiH5CtM_Vm8KjKh7aaitCKvmCh4r > zr5pQ@mail.gmail.com > > tmp > > ~ > > What's that "tmp" doing there? A typo? > > ====== > doc/src/sgml/catalogs.sgml > > 2. > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>subminapplydelay</structfield> <type>int4</type> > + </para> > + <para> > + The minimum delay (ms) for applying changes. > + </para></entry> > + </row> > > For consistency remove the period (.) because the other > single-sentence descriptions on this page do not have one. > > ====== > src/backend/commands/subscriptioncmds.c > > 3. AlterSubscription > + errmsg("cannot set parallel streaming mode for subscription with %s", > + "min_apply_delay")); > > Since there are no translator considerations here why not write it like this: > > errmsg("cannot set parallel streaming mode for subscription with > min_apply_delay") > > ~~~ > > 4. AlterSubscription > + errmsg("cannot set %s for subscription in parallel streaming mode", > + "min_apply_delay")); > > Since there are no translator considerations here why not write it like this: > > errmsg("cannot set min_apply_delay for subscription in parallel streaming mode") > > ~~~ > > 5. > +defGetMinApplyDelay(DefElem *def) > +{ > + char *input_string; > + int result; > + const char *hintmsg; > + > + input_string = defGetString(def); > + > + /* > + * Parse given string as parameter which has millisecond unit > + */ > + if (!parse_int(input_string, &result, GUC_UNIT_MS, &hintmsg)) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("invalid value for parameter \"%s\": \"%s\"", > + "min_apply_delay", input_string), > + hintmsg ? errhint("%s", _(hintmsg)) : 0)); > + > + /* > + * Check both the lower boundary for the valid min_apply_delay range and > + * the upper boundary as the safeguard for some platforms where INT_MAX is > + * wider than int32 respectively. Although parse_int() has confirmed that > + * the result is less than or equal to INT_MAX, the value will be stored > + * in a catalog column of int32. > + */ > + if (result < 0 || result > PG_INT32_MAX) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("%d ms is outside the valid range for parameter \"%s\" (%d .. %d)", > + result, > + "min_apply_delay", > + 0, PG_INT32_MAX))); > + > + return result; > +} > > 5a. > Since there are no translator considerations here why not write the > first error like: > > errmsg("invalid value for parameter \"min_apply_delay\": \"%s\"", > input_string) > > ~ > > 5b. > Since there are no translator considerations here why not write the > second error like: > > errmsg("%d ms is outside the valid range for parameter > \"min_apply_delay\" (%d .. %d)", > result, 0, PG_INT32_MAX)) > > ------ > Kind Regards, > Peter Smith. > Fujitsu Australia >
Attachment
On Tue, Feb 7, 2023 at 6:03 AM Peter Smith <smithpb2250@gmail.com> wrote: > > 5. > +defGetMinApplyDelay(DefElem *def) > +{ > + char *input_string; > + int result; > + const char *hintmsg; > + > + input_string = defGetString(def); > + > + /* > + * Parse given string as parameter which has millisecond unit > + */ > + if (!parse_int(input_string, &result, GUC_UNIT_MS, &hintmsg)) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("invalid value for parameter \"%s\": \"%s\"", > + "min_apply_delay", input_string), > + hintmsg ? errhint("%s", _(hintmsg)) : 0)); > + > + /* > + * Check both the lower boundary for the valid min_apply_delay range and > + * the upper boundary as the safeguard for some platforms where INT_MAX is > + * wider than int32 respectively. Although parse_int() has confirmed that > + * the result is less than or equal to INT_MAX, the value will be stored > + * in a catalog column of int32. > + */ > + if (result < 0 || result > PG_INT32_MAX) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("%d ms is outside the valid range for parameter \"%s\" (%d .. %d)", > + result, > + "min_apply_delay", > + 0, PG_INT32_MAX))); > + > + return result; > +} > > 5a. > Since there are no translator considerations here why not write the > first error like: > > errmsg("invalid value for parameter \"min_apply_delay\": \"%s\"", > input_string) > > ~ > > 5b. > Since there are no translator considerations here why not write the > second error like: > > errmsg("%d ms is outside the valid range for parameter > \"min_apply_delay\" (%d .. %d)", > result, 0, PG_INT32_MAX)) > I see that existing usage in the code matches what the patch had before this comment. See below and similar usages in the code. if (start <= 0) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("invalid value for parameter \"%s\": %d", "start", start))); -- With Regards, Amit Kapila.
At Tue, 7 Feb 2023 09:10:01 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Tue, Feb 7, 2023 at 6:03 AM Peter Smith <smithpb2250@gmail.com> wrote: > > 5b. > > Since there are no translator considerations here why not write the > > second error like: > > > > errmsg("%d ms is outside the valid range for parameter > > \"min_apply_delay\" (%d .. %d)", > > result, 0, PG_INT32_MAX)) > > > > I see that existing usage in the code matches what the patch had > before this comment. See below and similar usages in the code. > if (start <= 0) > ereport(ERROR, > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > errmsg("invalid value for parameter \"%s\": %d", > "start", start))); The same errmsg text occurs mamy times in the tree. On the other hand the pointed message is the only one. I suppose Peter considered this aspect. # "%d%s%s is outside the valid range for parameter \"%s\" (%d .. %d)" # also appears just once As for me, it seems to me a good practice to do that regadless of the number of duplicates to (semi)mechanically avoid duplicates. (But I believe I would do as Peter suggests by myself for the first cut, though:p) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Thanks! At Mon, 6 Feb 2023 13:10:01 +0000, "Takamichi Osumi (Fujitsu)" <osumi.takamichi@fujitsu.com> wrote in > The attached patch v29 has included your changes. catalogs.sgml + <para> + The minimum delay (ms) for applying changes. + </para></entry> I think we don't use unit symbols that way. Namely I think we would write it as "The minimum delay for applying changes in milliseconds" alter_subscription.sgml are <literal>slot_name</literal>, <literal>synchronous_commit</literal>, <literal>binary</literal>, <literal>streaming</literal>, - <literal>disable_on_error</literal>, and - <literal>origin</literal>. + <literal>disable_on_error</literal>, + <literal>origin</literal>, and + <literal>min_apply_delay</literal>. </para> By the way, is there any rule for the order among the words? They don't seem in alphabetical order nor in the same order to the create_sbuscription page. (I seems like in the order of SUBOPT_* symbols, but I'm not sure it's a good idea..) subscriptioncmds.c + if (opts.streaming == LOGICALREP_STREAM_PARALLEL && + !IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && sub->minapplydelay > 0) .. + if (opts.min_apply_delay > 0 && + !IsSet(opts.specified_opts, SUBOPT_STREAMING) && sub->stream == LOGICALREP_STREAM_PARALLEL) Don't we wrap the lines? worker.c + if (wal_receiver_status_interval > 0 && + diffms > wal_receiver_status_interval * 1000L) + { + WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, + wal_receiver_status_interval * 1000L, + WAIT_EVENT_RECOVERY_APPLY_DELAY); + send_feedback(last_received, true, false, true); + } + else + WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, + diffms, + WAIT_EVENT_RECOVERY_APPLY_DELAY); send_feedback always handles the case where wal_receiver_status_interval == 0. thus we can simply wait for min(wal_receiver_status_interval, diffms) then call send_feedback() unconditionally. -start_apply(XLogRecPtr origin_startpos) +start_apply(void) -LogicalRepApplyLoop(XLogRecPtr last_received) +LogicalRepApplyLoop(void) Does this patch requires this change? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Feb 7, 2023 at 10:07 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Tue, 7 Feb 2023 09:10:01 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > On Tue, Feb 7, 2023 at 6:03 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > 5b. > > > Since there are no translator considerations here why not write the > > > second error like: > > > > > > errmsg("%d ms is outside the valid range for parameter > > > \"min_apply_delay\" (%d .. %d)", > > > result, 0, PG_INT32_MAX)) > > > > > > > I see that existing usage in the code matches what the patch had > > before this comment. See below and similar usages in the code. > > if (start <= 0) > > ereport(ERROR, > > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > errmsg("invalid value for parameter \"%s\": %d", > > "start", start))); > > The same errmsg text occurs mamy times in the tree. On the other hand > the pointed message is the only one. I suppose Peter considered this > aspect. > > # "%d%s%s is outside the valid range for parameter \"%s\" (%d .. %d)" > # also appears just once > > As for me, it seems to me a good practice to do that regadless of the > number of duplicates to (semi)mechanically avoid duplicates. > > (But I believe I would do as Peter suggests by myself for the first > cut, though:p) > Personally, I would prefer consistency. I think we can later start a new thread to change the existing message and if there is a consensus and value in the same then we could use the same style here as well. -- With Regards, Amit Kapila.
On Tue, Feb 7, 2023 at 4:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Feb 7, 2023 at 10:07 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Tue, 7 Feb 2023 09:10:01 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > > On Tue, Feb 7, 2023 at 6:03 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > 5b. > > > > Since there are no translator considerations here why not write the > > > > second error like: > > > > > > > > errmsg("%d ms is outside the valid range for parameter > > > > \"min_apply_delay\" (%d .. %d)", > > > > result, 0, PG_INT32_MAX)) > > > > > > > > > > I see that existing usage in the code matches what the patch had > > > before this comment. See below and similar usages in the code. > > > if (start <= 0) > > > ereport(ERROR, > > > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > > errmsg("invalid value for parameter \"%s\": %d", > > > "start", start))); > > > > The same errmsg text occurs mamy times in the tree. On the other hand > > the pointed message is the only one. I suppose Peter considered this > > aspect. > > > > # "%d%s%s is outside the valid range for parameter \"%s\" (%d .. %d)" > > # also appears just once > > > > As for me, it seems to me a good practice to do that regadless of the > > number of duplicates to (semi)mechanically avoid duplicates. > > > > (But I believe I would do as Peter suggests by myself for the first > > cut, though:p) > > > > Personally, I would prefer consistency. I think we can later start a > new thread to change the existing message and if there is a consensus > and value in the same then we could use the same style here as well. > Of course, if there is a convention then we should stick to it. My understanding was that (string literal) message parameters are specified separately from the message format string primarily as an aid to translators. That makes good sense for parameters with names that are also English words (like "start" etc), but for non-word parameters like "min_apply_delay" there is no such ambiguity in the first place. Anyway, I am fine with it being written either way. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Tue, Feb 7, 2023 at 10:13 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Mon, 6 Feb 2023 13:10:01 +0000, "Takamichi Osumi (Fujitsu)" <osumi.takamichi@fujitsu.com> wrote in > > The attached patch v29 has included your changes. > > catalogs.sgml > > + <para> > + The minimum delay (ms) for applying changes. > + </para></entry> > > I think we don't use unit symbols that way. Namely I think we would > write it as "The minimum delay for applying changes in milliseconds" > Okay, if we prefer to use milliseconds, then how about: "The minimum delay, in milliseconds, for applying changes"? > > alter_subscription.sgml > > are <literal>slot_name</literal>, > <literal>synchronous_commit</literal>, > <literal>binary</literal>, <literal>streaming</literal>, > - <literal>disable_on_error</literal>, and > - <literal>origin</literal>. > + <literal>disable_on_error</literal>, > + <literal>origin</literal>, and > + <literal>min_apply_delay</literal>. > </para> > > By the way, is there any rule for the order among the words? > Currently, it is in the order in which the corresponding features are added. > They > don't seem in alphabetical order nor in the same order to the > create_sbuscription page. > In create_subscription page also, it appears to be in the order in which those are added with a difference that they are divided into two categories (parameters that control what happens during subscription creation and parameters that control the subscription's replication behavior after it has been created) > (I seems like in the order of SUBOPT_* > symbols, but I'm not sure it's a good idea..) > > > subscriptioncmds.c > > + if (opts.streaming == LOGICALREP_STREAM_PARALLEL && > + !IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && sub->minapplydelay> 0) > .. > + if (opts.min_apply_delay > 0 && > + !IsSet(opts.specified_opts, SUBOPT_STREAMING) && sub->stream == LOGICALREP_STREAM_PARALLEL) > > Don't we wrap the lines? > > > worker.c > > + if (wal_receiver_status_interval > 0 && > + diffms > wal_receiver_status_interval * 1000L) > + { > + WaitLatch(MyLatch, > + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > + wal_receiver_status_interval * 1000L, > + WAIT_EVENT_RECOVERY_APPLY_DELAY); > + send_feedback(last_received, true, false, true); > + } > + else > + WaitLatch(MyLatch, > + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > + diffms, > + WAIT_EVENT_RECOVERY_APPLY_DELAY); > > send_feedback always handles the case where > wal_receiver_status_interval == 0. > It only handles when force is false but here we are using that as true. So, not sure, if what you said would be an improvement. > thus we can simply wait for > min(wal_receiver_status_interval, diffms) then call send_feedback() > unconditionally. > > > -start_apply(XLogRecPtr origin_startpos) > +start_apply(void) > > -LogicalRepApplyLoop(XLogRecPtr last_received) > +LogicalRepApplyLoop(void) > > Does this patch requires this change? > I think this is because the scope of last_received has been changed so that it can be used to pass in send_feedback() during the delay. -- With Regards, Amit Kapila.
On Tue, Feb 7, 2023 at 10:42 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Tue, Feb 7, 2023 at 4:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Feb 7, 2023 at 10:07 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > > > > At Tue, 7 Feb 2023 09:10:01 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > > > On Tue, Feb 7, 2023 at 6:03 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > 5b. > > > > > Since there are no translator considerations here why not write the > > > > > second error like: > > > > > > > > > > errmsg("%d ms is outside the valid range for parameter > > > > > \"min_apply_delay\" (%d .. %d)", > > > > > result, 0, PG_INT32_MAX)) > > > > > > > > > > > > > I see that existing usage in the code matches what the patch had > > > > before this comment. See below and similar usages in the code. > > > > if (start <= 0) > > > > ereport(ERROR, > > > > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > > > errmsg("invalid value for parameter \"%s\": %d", > > > > "start", start))); > > > > > > The same errmsg text occurs mamy times in the tree. On the other hand > > > the pointed message is the only one. I suppose Peter considered this > > > aspect. > > > > > > # "%d%s%s is outside the valid range for parameter \"%s\" (%d .. %d)" > > > # also appears just once > > > > > > As for me, it seems to me a good practice to do that regadless of the > > > number of duplicates to (semi)mechanically avoid duplicates. > > > > > > (But I believe I would do as Peter suggests by myself for the first > > > cut, though:p) > > > > > > > Personally, I would prefer consistency. I think we can later start a > > new thread to change the existing message and if there is a consensus > > and value in the same then we could use the same style here as well. > > > > Of course, if there is a convention then we should stick to it. > > My understanding was that (string literal) message parameters are > specified separately from the message format string primarily as an > aid to translators. That makes good sense for parameters with names > that are also English words (like "start" etc), but for non-word > parameters like "min_apply_delay" there is no such ambiguity in the > first place. > TBH, I am not an expert in this matter. So, to avoid, making any mistakes I thought of keeping it close to the existing style. -- With Regards, Amit Kapila.
On Tue, Feb 7, 2023 at 8:22 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Thank you for reviewing! PSA new version. > Few comments: ============= 1. @@ -74,6 +74,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW Oid subowner BKI_LOOKUP(pg_authid); /* Owner of the subscription */ + int32 subminapplydelay; /* Replication apply delay (ms) */ + bool subenabled; /* True if the subscription is enabled (the * worker should be running) */ @@ -120,6 +122,7 @@ typedef struct Subscription * in */ XLogRecPtr skiplsn; /* All changes finished at this LSN are * skipped */ + int32 minapplydelay; /* Replication apply delay (ms) */ char *name; /* Name of the subscription */ Oid owner; /* Oid of the subscription owner */ Why the new parameter is placed at different locations in above two strcutures? I think it should be after owner in both cases and accordingly its order should be changed in GetSubscription() or any other place it is used. 2. A minor comment change suggestion: /* * Common spoolfile processing. * - * The commit/prepare time (finish_ts) for streamed transactions is required - * for time-delayed logical replication. + * The commit/prepare time (finish_ts) is required for time-delayed logical + * replication. */ 3. I find the newly added tests take about 8s on my machine which is close highest in the subscription folder. I understand that it can't be less than 3s because of the delay but checking multiple cases makes it take that long. I think we can avoid the tests for streaming and disable the subscription. Also, after removing those, I think it would be better to add the remaining test in 001_rep_changes to save set-up and tear-down costs as well. 4. +$node_publisher->append_conf('postgresql.conf', + 'logical_decoding_work_mem = 64kB'); I think this setting is also not required. -- With Regards, Amit Kapila.
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Tuesday, February 7, 2023 6:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Feb 7, 2023 at 8:22 AM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > > > Thank you for reviewing! PSA new version. > > > > Few comments: > ============= Thanks for your comments ! > 1. > @@ -74,6 +74,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) > BKI_SHARED_RELATION BKI_ROW > > Oid subowner BKI_LOOKUP(pg_authid); /* Owner of the subscription */ > > + int32 subminapplydelay; /* Replication apply delay (ms) */ > + > bool subenabled; /* True if the subscription is enabled (the > * worker should be running) */ > > @@ -120,6 +122,7 @@ typedef struct Subscription > * in */ > XLogRecPtr skiplsn; /* All changes finished at this LSN are > * skipped */ > + int32 minapplydelay; /* Replication apply delay (ms) */ > char *name; /* Name of the subscription */ > Oid owner; /* Oid of the subscription owner */ > > Why the new parameter is placed at different locations in above two > strcutures? I think it should be after owner in both cases and accordingly its > order should be changed in GetSubscription() or any other place it is used. Fixed. > > 2. A minor comment change suggestion: > /* > * Common spoolfile processing. > * > - * The commit/prepare time (finish_ts) for streamed transactions is required > - * for time-delayed logical replication. > + * The commit/prepare time (finish_ts) is required for time-delayed > + logical > + * replication. > */ Fixed. > 3. I find the newly added tests take about 8s on my machine which is close > highest in the subscription folder. I understand that it can't be less than 3s > because of the delay but checking multiple cases makes it take that long. I > think we can avoid the tests for streaming and disable the subscription. Also, > after removing those, I think it would be better to add the remaining test in > 001_rep_changes to save set-up and tear-down costs as well. Sounds good to me. Moved the test to 001_rep_changes.pl. > 4. > +$node_publisher->append_conf('postgresql.conf', > + 'logical_decoding_work_mem = 64kB'); > > I think this setting is also not required. Yes. And, in the process to move the test, removed. Attached the v31 patch. Note that regarding the translator style, I chose to export the parameters from the errmsg to outside at this stage. If there is a need to change it, then I'll follow it. Other changes are minor alignments to make 'if' conditions that exceeded 80 characters folded and look nicer. Also conducted pgindent and pgperltidy. Best Regards, Takamichi Osumi
Attachment
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Tuesday, February 7, 2023 2:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Feb 7, 2023 at 10:13 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> > wrote: > > > > At Mon, 6 Feb 2023 13:10:01 +0000, "Takamichi Osumi (Fujitsu)" > > <osumi.takamichi@fujitsu.com> wrote in > > > The attached patch v29 has included your changes. > > > > catalogs.sgml > > > > + <para> > > + The minimum delay (ms) for applying changes. > > + </para></entry> > > > > I think we don't use unit symbols that way. Namely I think we would > > write it as "The minimum delay for applying changes in milliseconds" > > > > Okay, if we prefer to use milliseconds, then how about: "The minimum delay, in > milliseconds, for applying changes"? This looks good to me. Adopted. > > > > > alter_subscription.sgml > > > > are <literal>slot_name</literal>, > > <literal>synchronous_commit</literal>, > > <literal>binary</literal>, <literal>streaming</literal>, > > - <literal>disable_on_error</literal>, and > > - <literal>origin</literal>. > > + <literal>disable_on_error</literal>, > > + <literal>origin</literal>, and > > + <literal>min_apply_delay</literal>. > > </para> > > > > By the way, is there any rule for the order among the words? > > > > Currently, it is in the order in which the corresponding features are added. Yes. So, I keep it as it is. > > > They > > don't seem in alphabetical order nor in the same order to the > > create_sbuscription page. > > > > In create_subscription page also, it appears to be in the order in which those > are added with a difference that they are divided into two categories > (parameters that control what happens during subscription creation and > parameters that control the subscription's replication behavior after it has been > created) Same as here. The current order should be fine. > > > (I seems like in the order of SUBOPT_* symbols, but I'm not sure it's > > a good idea..) > > > > > > subscriptioncmds.c > > > > + if (opts.streaming == > LOGICALREP_STREAM_PARALLEL && > > + > > + !IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && > > + sub->minapplydelay > 0) > > .. > > + if (opts.min_apply_delay > 0 && > > + > > + !IsSet(opts.specified_opts, SUBOPT_STREAMING) && sub->stream == > > + LOGICALREP_STREAM_PARALLEL) > > > > Don't we wrap the lines? > > > > > > worker.c > > > > + if (wal_receiver_status_interval > 0 && > > + diffms > wal_receiver_status_interval * 1000L) > > + { > > + WaitLatch(MyLatch, > > + WL_LATCH_SET | > WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > > + wal_receiver_status_interval * > 1000L, > > + > WAIT_EVENT_RECOVERY_APPLY_DELAY); > > + send_feedback(last_received, true, false, true); > > + } > > + else > > + WaitLatch(MyLatch, > > + WL_LATCH_SET | > WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > > + diffms, > > + > > + WAIT_EVENT_RECOVERY_APPLY_DELAY); > > > > send_feedback always handles the case where > > wal_receiver_status_interval == 0. > > > > It only handles when force is false but here we are using that as true. So, not > sure, if what you said would be an improvement. Agreed. So, I keep it as it is. > > > thus we can simply wait for > > min(wal_receiver_status_interval, diffms) then call send_feedback() > > unconditionally. > > > > > > -start_apply(XLogRecPtr origin_startpos) > > +start_apply(void) > > > > -LogicalRepApplyLoop(XLogRecPtr last_received) > > +LogicalRepApplyLoop(void) > > > > Does this patch requires this change? > > > > I think this is because the scope of last_received has been changed so that it > can be used to pass in send_feedback() during the delay. Yes, that's our intention. Kindly have a look at the latest patch v31 shared in [1]. [1] - https://www.postgresql.org/message-id/TYCPR01MB8373BA483A6D2C924C600968EDDB9%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, Horiguchi-san Thanks for your review ! On Tuesday, February 7, 2023 1:43 PM From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Mon, 6 Feb 2023 13:10:01 +0000, "Takamichi Osumi (Fujitsu)" > <osumi.takamichi@fujitsu.com> wrote in > subscriptioncmds.c > > + if (opts.streaming == > LOGICALREP_STREAM_PARALLEL && > + !IsSet(opts.specified_opts, > SUBOPT_MIN_APPLY_DELAY) && > +sub->minapplydelay > 0) > .. > + if (opts.min_apply_delay > 0 && > + !IsSet(opts.specified_opts, > SUBOPT_STREAMING) && sub->stream == > +LOGICALREP_STREAM_PARALLEL) > > Don't we wrap the lines? Yes, those lines should have looked nicer. Updated. Kindly have a look at the latest patch v31 in [1]. There are also other some changes in the patch. [1] - https://www.postgresql.org/message-id/TYCPR01MB8373BA483A6D2C924C600968EDDB9%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
Here are my review comments for v31-0001 ====== doc/src/sgml/glossary.sgml 1. + <para> + Replication setup that applies time-delayed copy of the data. + </para> That sentence seemed a bit strange to me. SUGGESTION Replication setup that delays the application of changes by a specified minimum time-delay period. ====== src/backend/replication/logical/worker.c 2. maybe_apply_delay + if (wal_receiver_status_interval > 0 && + diffms > wal_receiver_status_interval * 1000L) + { + WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, + wal_receiver_status_interval * 1000L, + WAIT_EVENT_RECOVERY_APPLY_DELAY); + send_feedback(last_received, true, false, true); + } I felt that introducing another variable like: long statusinterval_ms = wal_receiver_status_interval * 1000L; would help here by doing 2 things: 1) The condition would be easier to read because the ms units would be the same 2) Won't need * 1000L repeated in two places. Only, do take care to assign this variable in the right place in this loop in case the configuration is changed. ====== src/test/subscription/t/001_rep_changes.pl 3. +# Test time-delayed logical replication +# +# If the subscription sets min_apply_delay parameter, the logical replication +# worker will delay the transaction apply for min_apply_delay milliseconds. We +# look the time duration between tuples are inserted on publisher and then +# changes are replicated on subscriber. This comment and the other one appearing later in this test are both explaining the same test strategy. I think both comments should be combined into one big one up-front, like this: SUGGESTION If the subscription sets min_apply_delay parameter, the logical replication worker will delay the transaction apply for min_apply_delay milliseconds. We verify this by looking at the time difference between a) when tuples are inserted on the publisher, and b) when those changes are replicated on the subscriber. Even on slow machines, this strategy will give predictable behavior. ~~ 4. +my $delay = 3; + +# Set min_apply_delay parameter to 3 seconds +$node_subscriber->safe_psql('postgres', + "ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay = '${delay}s')"); IMO that "my $delay = 3;" assignment should be *after* the comment: e.g. + +# Set min_apply_delay parameter to 3 seconds +my $delay = 3; +$node_subscriber->safe_psql('postgres', + "ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay = '${delay}s')"); ~~~ 5. +# Make new content on publisher and check its presence in subscriber depending +# on the delay applied above. Before doing the insertion, get the +# current timestamp that will be used as a comparison base. Even on slow +# machines, this allows to have a predictable behavior when comparing the +# delay between data insertion moment on publisher and replay time on subscriber. Most of this comment is now redundant because this was already explained in the big comment up-front (see #3). Only one useful sentence is left. SUGGESTION Before doing the insertion, get the current timestamp that will be used as a comparison base. ------ Kind Regards, Peter Smith. Fujitsu Australia.
Dear Peter, Thank you for reviewing! PSA new version. > ====== > doc/src/sgml/glossary.sgml > > 1. > + <para> > + Replication setup that applies time-delayed copy of the data. > + </para> > > That sentence seemed a bit strange to me. > > SUGGESTION > Replication setup that delays the application of changes by a > specified minimum time-delay period. Fixed. > ====== > > src/backend/replication/logical/worker.c > > 2. maybe_apply_delay > > + if (wal_receiver_status_interval > 0 && > + diffms > wal_receiver_status_interval * 1000L) > + { > + WaitLatch(MyLatch, > + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > + wal_receiver_status_interval * 1000L, > + WAIT_EVENT_RECOVERY_APPLY_DELAY); > + send_feedback(last_received, true, false, true); > + } > > I felt that introducing another variable like: > > long statusinterval_ms = wal_receiver_status_interval * 1000L; > > would help here by doing 2 things: > 1) The condition would be easier to read because the ms units would be the same > 2) Won't need * 1000L repeated in two places. > > Only, do take care to assign this variable in the right place in this > loop in case the configuration is changed. Fixed. Calculations are done on two lines - first one is the entrance of the loop, and second one is the after SIGHUP is detected. > ====== > src/test/subscription/t/001_rep_changes.pl > > 3. > +# Test time-delayed logical replication > +# > +# If the subscription sets min_apply_delay parameter, the logical replication > +# worker will delay the transaction apply for min_apply_delay milliseconds. We > +# look the time duration between tuples are inserted on publisher and then > +# changes are replicated on subscriber. > > This comment and the other one appearing later in this test are both > explaining the same test strategy. I think both comments should be > combined into one big one up-front, like this: > > SUGGESTION > If the subscription sets min_apply_delay parameter, the logical > replication worker will delay the transaction apply for > min_apply_delay milliseconds. We verify this by looking at the time > difference between a) when tuples are inserted on the publisher, and > b) when those changes are replicated on the subscriber. Even on slow > machines, this strategy will give predictable behavior. Changed. > 4. > +my $delay = 3; > + > +# Set min_apply_delay parameter to 3 seconds > +$node_subscriber->safe_psql('postgres', > + "ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay = > '${delay}s')"); > > IMO that "my $delay = 3;" assignment should be *after* the comment: > > e.g. > + > +# Set min_apply_delay parameter to 3 seconds > +my $delay = 3; > +$node_subscriber->safe_psql('postgres', > + "ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay = > '${delay}s')"); Right, changed. > 5. > +# Make new content on publisher and check its presence in subscriber > depending > +# on the delay applied above. Before doing the insertion, get the > +# current timestamp that will be used as a comparison base. Even on slow > +# machines, this allows to have a predictable behavior when comparing the > +# delay between data insertion moment on publisher and replay time on > subscriber. > > Most of this comment is now redundant because this was already > explained in the big comment up-front (see #3). Only one useful > sentence is left. > > SUGGESTION > Before doing the insertion, get the current timestamp that will be > used as a comparison base. Removed. Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On Wed, Feb 8, 2023 at 8:03 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > ... > > ====== > > > > src/backend/replication/logical/worker.c > > > > 2. maybe_apply_delay > > > > + if (wal_receiver_status_interval > 0 && > > + diffms > wal_receiver_status_interval * 1000L) > > + { > > + WaitLatch(MyLatch, > > + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > > + wal_receiver_status_interval * 1000L, > > + WAIT_EVENT_RECOVERY_APPLY_DELAY); > > + send_feedback(last_received, true, false, true); > > + } > > > > I felt that introducing another variable like: > > > > long statusinterval_ms = wal_receiver_status_interval * 1000L; > > > > would help here by doing 2 things: > > 1) The condition would be easier to read because the ms units would be the same > > 2) Won't need * 1000L repeated in two places. > > > > Only, do take care to assign this variable in the right place in this > > loop in case the configuration is changed. > > Fixed. Calculations are done on two lines - first one is the entrance of the loop, > and second one is the after SIGHUP is detected. > TBH, I expected you would write this as just a *single* variable assignment before the condition like below: SUGGESTION (tweaked comment and put single assignment before condition) /* * Call send_feedback() to prevent the publisher from exiting by * timeout during the delay, when the status interval is greater than * zero. */ status_interval_ms = wal_receiver_status_interval * 1000L; if (status_interval_ms > 0 && diffms > status_interval_ms) { ... ~ I understand in theory, your code is more efficient, but in practice, I think the overhead of a single variable assignment every loop iteration (which is doing WaitLatch anyway) is of insignificant concern, whereas having one assignment is simpler than having two IMO. But, if you want to keep it the way you have then that is OK. Otherwise, this patch v32 LGTM. ------ Kind Regards, Peter Smith. Fujitsu Australia.
At Wed, 8 Feb 2023 09:03:03 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in > Thank you for reviewing! PSA new version. + if (statusinterval_ms > 0 && diffms > statusinterval_ms) The next expected feedback time is measured from the last status report. Thus, it seems to me this may suppress feedbacks from being sent for an unexpectedly long time especially when min_apply_delay is shorter than wal_r_s_interval. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Feb 9, 2023 at 12:17 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Wed, Feb 8, 2023 at 8:03 PM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > > ... > > > ====== > > > > > > src/backend/replication/logical/worker.c > > > > > > 2. maybe_apply_delay > > > > > > + if (wal_receiver_status_interval > 0 && > > > + diffms > wal_receiver_status_interval * 1000L) > > > + { > > > + WaitLatch(MyLatch, > > > + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > > > + wal_receiver_status_interval * 1000L, > > > + WAIT_EVENT_RECOVERY_APPLY_DELAY); > > > + send_feedback(last_received, true, false, true); > > > + } > > > > > > I felt that introducing another variable like: > > > > > > long statusinterval_ms = wal_receiver_status_interval * 1000L; > > > > > > would help here by doing 2 things: > > > 1) The condition would be easier to read because the ms units would be the same > > > 2) Won't need * 1000L repeated in two places. > > > > > > Only, do take care to assign this variable in the right place in this > > > loop in case the configuration is changed. > > > > Fixed. Calculations are done on two lines - first one is the entrance of the loop, > > and second one is the after SIGHUP is detected. > > > > TBH, I expected you would write this as just a *single* variable > assignment before the condition like below: > > SUGGESTION (tweaked comment and put single assignment before condition) > /* > * Call send_feedback() to prevent the publisher from exiting by > * timeout during the delay, when the status interval is greater than > * zero. > */ > status_interval_ms = wal_receiver_status_interval * 1000L; > if (status_interval_ms > 0 && diffms > status_interval_ms) > { > ... > > ~ > > I understand in theory, your code is more efficient, but in practice, > I think the overhead of a single variable assignment every loop > iteration (which is doing WaitLatch anyway) is of insignificant > concern, whereas having one assignment is simpler than having two IMO. > Yeah, that sounds better to me as well. -- With Regards, Amit Kapila.
On Thu, Feb 9, 2023 at 10:45 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Wed, 8 Feb 2023 09:03:03 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in > > Thank you for reviewing! PSA new version. > > + if (statusinterval_ms > 0 && diffms > statusinterval_ms) > > The next expected feedback time is measured from the last status > report. Thus, it seems to me this may suppress feedbacks from being > sent for an unexpectedly long time especially when min_apply_delay is > shorter than wal_r_s_interval. > I think the minimum time before we send any feedback during the wait is wal_r_s_interval. Now, I think if there is no transaction for a long time before we get a new transaction, there should be keep-alive messages in between which would allow us to send feedback at regular intervals (wal_receiver_status_interval). So, I think we should be able to send feedback in less than 2 * wal_receiver_status_interval unless wal_sender/receiver timeout is very large and there is a very low volume of transactions. Now, we can try to send the feedback before we start waiting or maybe after every wal_receiver_status_interval / 2 but I think that will lead to more spurious feedback messages than we get the benefit from them. -- With Regards, Amit Kapila.
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Thursday, February 9, 2023 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Feb 9, 2023 at 12:17 AM Peter Smith <smithpb2250@gmail.com> > wrote: > > > > On Wed, Feb 8, 2023 at 8:03 PM Hayato Kuroda (Fujitsu) > > <kuroda.hayato@fujitsu.com> wrote: > > > > > ... > > > > ====== > > > > > > > > src/backend/replication/logical/worker.c > > > > > > > > 2. maybe_apply_delay > > > > > > > > + if (wal_receiver_status_interval > 0 && diffms > > > > > + wal_receiver_status_interval * 1000L) { WaitLatch(MyLatch, > > > > + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > > > > + wal_receiver_status_interval * 1000L, > > > > + WAIT_EVENT_RECOVERY_APPLY_DELAY); > send_feedback(last_received, > > > > + true, false, true); } > > > > > > > > I felt that introducing another variable like: > > > > > > > > long statusinterval_ms = wal_receiver_status_interval * 1000L; > > > > > > > > would help here by doing 2 things: > > > > 1) The condition would be easier to read because the ms units > > > > would be the same > > > > 2) Won't need * 1000L repeated in two places. > > > > > > > > Only, do take care to assign this variable in the right place in > > > > this loop in case the configuration is changed. > > > > > > Fixed. Calculations are done on two lines - first one is the > > > entrance of the loop, and second one is the after SIGHUP is detected. > > > > > > > TBH, I expected you would write this as just a *single* variable > > assignment before the condition like below: > > > > SUGGESTION (tweaked comment and put single assignment before > > condition) > > /* > > * Call send_feedback() to prevent the publisher from exiting by > > * timeout during the delay, when the status interval is greater than > > * zero. > > */ > > status_interval_ms = wal_receiver_status_interval * 1000L; if > > (status_interval_ms > 0 && diffms > status_interval_ms) { ... > > > > ~ > > > > I understand in theory, your code is more efficient, but in practice, > > I think the overhead of a single variable assignment every loop > > iteration (which is doing WaitLatch anyway) is of insignificant > > concern, whereas having one assignment is simpler than having two IMO. > > > > Yeah, that sounds better to me as well. OK, fixed. The comment adjustment suggested by Peter-san above was also included in this v33. Please have a look at the attached patch. Best Regards, Takamichi Osumi
Attachment
> The comment adjustment suggested by Peter-san above > was also included in this v33. > Please have a look at the attached patch. Patch v33 LGTM. ------ Kind Regards, Peter Smith. Fujitsu Australia
At Thu, 9 Feb 2023 13:26:19 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in amit.kapila16> On Thu, Feb 9, 2023 at 12:17 AM Peter Smith <smithpb2250@gmail.com> wrote: > > I understand in theory, your code is more efficient, but in practice, > > I think the overhead of a single variable assignment every loop > > iteration (which is doing WaitLatch anyway) is of insignificant > > concern, whereas having one assignment is simpler than having two IMO. > > > > Yeah, that sounds better to me as well. FWIW, I'm on board with this. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Thu, 9 Feb 2023 13:48:52 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Thu, Feb 9, 2023 at 10:45 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Wed, 8 Feb 2023 09:03:03 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in > > > Thank you for reviewing! PSA new version. > > > > + if (statusinterval_ms > 0 && diffms > statusinterval_ms) > > > > The next expected feedback time is measured from the last status > > report. Thus, it seems to me this may suppress feedbacks from being > > sent for an unexpectedly long time especially when min_apply_delay is > > shorter than wal_r_s_interval. > > > > I think the minimum time before we send any feedback during the wait > is wal_r_s_interval. Now, I think if there is no transaction for a > long time before we get a new transaction, there should be keep-alive > messages in between which would allow us to send feedback at regular > intervals (wal_receiver_status_interval). So, I think we should be Right. > able to send feedback in less than 2 * wal_receiver_status_interval > unless wal_sender/receiver timeout is very large and there is a very > low volume of transactions. Now, we can try to send the feedback We have suffered this kind of feedback silence many times. Thus I don't want to rely on luck here. I had in mind of exposing last_send itself or providing interval-calclation function to the logic. > before we start waiting or maybe after every > wal_receiver_status_interval / 2 but I think that will lead to more > spurious feedback messages than we get the benefit from them. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Mmm. A part of the previous mail have gone anywhere for a uncertain reason and placed by a mysterious blank lines... At Fri, 10 Feb 2023 09:57:22 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Thu, 9 Feb 2023 13:48:52 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > On Thu, Feb 9, 2023 at 10:45 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > > > > At Wed, 8 Feb 2023 09:03:03 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in > > > > Thank you for reviewing! PSA new version. > > > > > > + if (statusinterval_ms > 0 && diffms > statusinterval_ms) > > > > > > The next expected feedback time is measured from the last status > > > report. Thus, it seems to me this may suppress feedbacks from being > > > sent for an unexpectedly long time especially when min_apply_delay is > > > shorter than wal_r_s_interval. > > > > > > > I think the minimum time before we send any feedback during the wait > > is wal_r_s_interval. Now, I think if there is no transaction for a > > long time before we get a new transaction, there should be keep-alive > > messages in between which would allow us to send feedback at regular > > intervals (wal_receiver_status_interval). So, I think we should be > > Right. > > > able to send feedback in less than 2 * wal_receiver_status_interval > > unless wal_sender/receiver timeout is very large and there is a very > > low volume of transactions. Now, we can try to send the feedback > > We have suffered this kind of feedback silence many times. Thus I > don't want to rely on luck here. I had in mind of exposing last_send > itself or providing interval-calclation function to the logic. > > > before we start waiting or maybe after every > > wal_receiver_status_interval / 2 but I think that will lead to more > > spurious feedback messages than we get the benefit from them. Agreed. I think we dont want to do that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Feb 10, 2023 at 6:27 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Thu, 9 Feb 2023 13:48:52 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > On Thu, Feb 9, 2023 at 10:45 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > > > > At Wed, 8 Feb 2023 09:03:03 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in > > > > Thank you for reviewing! PSA new version. > > > > > > + if (statusinterval_ms > 0 && diffms > statusinterval_ms) > > > > > > The next expected feedback time is measured from the last status > > > report. Thus, it seems to me this may suppress feedbacks from being > > > sent for an unexpectedly long time especially when min_apply_delay is > > > shorter than wal_r_s_interval. > > > > > > > I think the minimum time before we send any feedback during the wait > > is wal_r_s_interval. Now, I think if there is no transaction for a > > long time before we get a new transaction, there should be keep-alive > > messages in between which would allow us to send feedback at regular > > intervals (wal_receiver_status_interval). So, I think we should be > > Right. > > > able to send feedback in less than 2 * wal_receiver_status_interval > > unless wal_sender/receiver timeout is very large and there is a very > > low volume of transactions. Now, we can try to send the feedback > > We have suffered this kind of feedback silence many times. Thus I > don't want to rely on luck here. I had in mind of exposing last_send > itself or providing interval-calclation function to the logic. > I think we have last_send time in send_feedback(), so we can expose it if we want but how would that solve the problem you are worried about? The one simple idea as I shared in my last email was to send feedback every wal_receiver_status_interval / 2. I think this should avoid any timeout problem because we already recommend setting it to lesser than wal_sender_timeout. -- With Regards, Amit Kapila.
On Fri, Feb 10, 2023 at 10:11 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Feb 10, 2023 at 6:27 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Thu, 9 Feb 2023 13:48:52 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > > On Thu, Feb 9, 2023 at 10:45 AM Kyotaro Horiguchi > > > <horikyota.ntt@gmail.com> wrote: > > > > > > > > At Wed, 8 Feb 2023 09:03:03 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in > > > > > Thank you for reviewing! PSA new version. > > > > > > > > + if (statusinterval_ms > 0 && diffms > statusinterval_ms) > > > > > > > > The next expected feedback time is measured from the last status > > > > report. Thus, it seems to me this may suppress feedbacks from being > > > > sent for an unexpectedly long time especially when min_apply_delay is > > > > shorter than wal_r_s_interval. > > > > > > > > > > I think the minimum time before we send any feedback during the wait > > > is wal_r_s_interval. Now, I think if there is no transaction for a > > > long time before we get a new transaction, there should be keep-alive > > > messages in between which would allow us to send feedback at regular > > > intervals (wal_receiver_status_interval). So, I think we should be > > > > Right. > > > > > able to send feedback in less than 2 * wal_receiver_status_interval > > > unless wal_sender/receiver timeout is very large and there is a very > > > low volume of transactions. Now, we can try to send the feedback > > > > We have suffered this kind of feedback silence many times. Thus I > > don't want to rely on luck here. I had in mind of exposing last_send > > itself or providing interval-calclation function to the logic. > > > > I think we have last_send time in send_feedback(), so we can expose it > if we want but how would that solve the problem you are worried about? > I have an idea to use last_send time to avoid walsenders being timeout. Instead of directly using wal_receiver_status_interval as a minimum interval to send the feedback, we should check if it is greater than last_send time then we should send the feedback after (wal_receiver_status_interval - last_send). I think they can probably be different only on the very first time. Any better ideas? -- With Regards, Amit Kapila.
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, On Friday, February 10, 2023 2:05 PM Friday, February 10, 2023 2:05 PM wrote: > On Fri, Feb 10, 2023 at 10:11 AM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > On Fri, Feb 10, 2023 at 6:27 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > > > > At Thu, 9 Feb 2023 13:48:52 +0530, Amit Kapila > > > <amit.kapila16@gmail.com> wrote in > > > > On Thu, Feb 9, 2023 at 10:45 AM Kyotaro Horiguchi > > > > <horikyota.ntt@gmail.com> wrote: > > > > > > > > > > At Wed, 8 Feb 2023 09:03:03 +0000, "Hayato Kuroda (Fujitsu)" > > > > > <kuroda.hayato@fujitsu.com> wrote in > > > > > > Thank you for reviewing! PSA new version. > > > > > > > > > > + if (statusinterval_ms > 0 && diffms > > > > > > + statusinterval_ms) > > > > > > > > > > The next expected feedback time is measured from the last status > > > > > report. Thus, it seems to me this may suppress feedbacks from > > > > > being sent for an unexpectedly long time especially when > > > > > min_apply_delay is shorter than wal_r_s_interval. > > > > > > > > > > > > > I think the minimum time before we send any feedback during the > > > > wait is wal_r_s_interval. Now, I think if there is no transaction > > > > for a long time before we get a new transaction, there should be > > > > keep-alive messages in between which would allow us to send > > > > feedback at regular intervals (wal_receiver_status_interval). So, > > > > I think we should be > > > > > > Right. > > > > > > > able to send feedback in less than 2 * > > > > wal_receiver_status_interval unless wal_sender/receiver timeout is > > > > very large and there is a very low volume of transactions. Now, we > > > > can try to send the feedback > > > > > > We have suffered this kind of feedback silence many times. Thus I > > > don't want to rely on luck here. I had in mind of exposing last_send > > > itself or providing interval-calclation function to the logic. > > > > > > > I think we have last_send time in send_feedback(), so we can expose it > > if we want but how would that solve the problem you are worried about? > > > > I have an idea to use last_send time to avoid walsenders being timeout. > Instead of directly using wal_receiver_status_interval as a minimum interval > to send the feedback, we should check if it is greater than last_send time > then we should send the feedback after (wal_receiver_status_interval - > last_send). I think they can probably be different only on the very first time. > Any better ideas? This idea sounds good to me and implemented this idea in an attached patch v34. In the previous patch, we couldn't solve the timeout of the publisher, when we conduct a scenario suggested by Horiguchi-san and reproduced in the scenario attached test file 'test.sh'. But now we handle it by adjusting the timing of the first wait time. FYI, we thought to implement the new variable 'send_time' in the LogicalRepWorker structure at first. But, this structure is used when launcher controls workers or reports statistics and it stores TimestampTz recorded in the received WAL, so not sure if the struct is the right place to implement the variable. Moreover, there are other similar variables such as last_recv_time or reply_time. So, those will be confusing when we decide to have new variable together. Then, it's declared separately. The new patch also includes some changes for wait event. Kindly have a look at the v34 patch. Best Regards, Takamichi Osumi
Attachment
Hi, On 2023-02-10 11:26:21 +0000, Takamichi Osumi (Fujitsu) wrote: > Subject: [PATCH v34] Time-delayed logical replication subscriber > > Similar to physical replication, a time-delayed copy of the data for > logical replication is useful for some scenarios (particularly to fix > errors that might cause data loss). > > This patch implements a new subscription parameter called 'min_apply_delay'. Sorry for not reading through the thread, but it's very long. Has there been any discussion about whether this is actually best implemented on the client side? You could alternatively implement it on the sender. That'd have quite a few advantages, I think - you e.g. wouldn't remove the ability to *receive* and send feedback messages. We'd not end up filling up the network buffer with data that we'll not process anytime soon. Greetings, Andres Freund
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi On Saturday, February 11, 2023 11:10 AM Andres Freund <andres@anarazel.de> wrote: > On 2023-02-10 11:26:21 +0000, Takamichi Osumi (Fujitsu) wrote: > > Subject: [PATCH v34] Time-delayed logical replication subscriber > > > > Similar to physical replication, a time-delayed copy of the data for > > logical replication is useful for some scenarios (particularly to fix > > errors that might cause data loss). > > > > This patch implements a new subscription parameter called > 'min_apply_delay'. > Has there been any discussion about whether this is actually best > implemented on the client side? You could alternatively implement it on the > sender. > > That'd have quite a few advantages, I think - you e.g. wouldn't remove the > ability to *receive* and send feedback messages. We'd not end up filling up > the network buffer with data that we'll not process anytime soon. Thanks for your comments ! We have discussed about the publisher side idea around here [1] but, we chose the current direction. Kindly have a look at the discussion. If we apply the delay on the publisher, then it can lead to extra delay where we don't need to apply. The current proposed approach can take other loads or factors (network, busyness of the publisher, etc) into account because it calculates the required delay on the subscriber. [1] - https://www.postgresql.org/message-id/20221215.105200.268327207020006785.horikyota.ntt%40gmail.com Best Regards, Takamichi Osumi
At Fri, 10 Feb 2023 10:34:49 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Fri, Feb 10, 2023 at 10:11 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Feb 10, 2023 at 6:27 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > We have suffered this kind of feedback silence many times. Thus I > > > don't want to rely on luck here. I had in mind of exposing last_send > > > itself or providing interval-calclation function to the logic. > > > > I think we have last_send time in send_feedback(), so we can expose it > > if we want but how would that solve the problem you are worried about? Wal receiver can avoid a too-long sleep by knowing when to wake up for the next feedback. > I have an idea to use last_send time to avoid walsenders being > timeout. Instead of directly using wal_receiver_status_interval as a > minimum interval to send the feedback, we should check if it is > greater than last_send time then we should send the feedback after > (wal_receiver_status_interval - last_send). I think they can probably > be different only on the very first time. Any better ideas? If it means MyLogicalRepWorker->last_send_time, it is not the last time when walreceiver sent a feedback but the last time when wal*sender* sent a data. So I'm not sure that works. We could use the variable that way, but AFAIU in turn when so many changes have been spooled that the control doesn't return to LogicalRepApplyLoop longer than wal_r_s_interval, maybe_apply_delay() starts calling send_feedback() at every call after the first feedback timing. Even in that case, send_feedback() won't send one actually until the next feedback timing, I don't think that behavior is great. The only packets walreceiver sends back is the feedback packets and currently only send_feedback knows the last feedback time. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, Horiguchi-san On Monday, February 13, 2023 10:26 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Fri, 10 Feb 2023 10:34:49 +0530, Amit Kapila <amit.kapila16@gmail.com> > wrote in > > On Fri, Feb 10, 2023 at 10:11 AM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > On Fri, Feb 10, 2023 at 6:27 AM Kyotaro Horiguchi > > > <horikyota.ntt@gmail.com> wrote: > > > > We have suffered this kind of feedback silence many times. Thus I > > > > don't want to rely on luck here. I had in mind of exposing > > > > last_send itself or providing interval-calclation function to the logic. > > > > > > I think we have last_send time in send_feedback(), so we can expose > > > it if we want but how would that solve the problem you are worried > about? > > Wal receiver can avoid a too-long sleep by knowing when to wake up for the > next feedback. > > > I have an idea to use last_send time to avoid walsenders being > > timeout. Instead of directly using wal_receiver_status_interval as a > > minimum interval to send the feedback, we should check if it is > > greater than last_send time then we should send the feedback after > > (wal_receiver_status_interval - last_send). I think they can probably > > be different only on the very first time. Any better ideas? > > If it means MyLogicalRepWorker->last_send_time, it is not the last time when > walreceiver sent a feedback but the last time when > wal*sender* sent a data. So I'm not sure that works. > > We could use the variable that way, but AFAIU in turn when so many changes > have been spooled that the control doesn't return to LogicalRepApplyLoop > longer than wal_r_s_interval, maybe_apply_delay() starts calling > send_feedback() at every call after the first feedback timing. Even in that > case, send_feedback() won't send one actually until the next feedback timing, > I don't think that behavior is great. > > The only packets walreceiver sends back is the feedback packets and > currently only send_feedback knows the last feedback time. Thanks for your comments ! As described in your last sentence, in the latest patch v34 [1], we use the last time set in send_feedback() and based on it, we calculate and adjust the first timing of feedback message in maybe_apply_delay() so that we can send the feedback message following the interval of wal_receiver_status_interval. I wasn't sure if the above concern is still valid for this implementation. Could you please have a look at the latest patch and share your opinion ? [1] - https://www.postgresql.org/message-id/TYCPR01MB83736C50C98CB2153728A7A8EDDE9%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
Here are my review comments for the v34 patch. ====== src/backend/replication/logical/worker.c +/* The last time we send a feedback message */ +static TimestampTz send_time = 0; + IMO this is a bad variable name. When this variable was changed to be global it ought to have been renamed. The name "send_time" is almost meaningless without any contextual information. But also it's bad because this global name is "shadowed" by several other parameters and other local variables using that same name (e.g. see UpdateWorkerStats, LogicalRepApplyLoop, etc). It is too confusing. How about using a unique/meaningful name with a comment to match to improve readability and remove unwanted shadowing? SUGGESTION /* Timestamp of when the last feedback message was sent. */ static TimestampTz last_sent_feedback_ts = 0; ~~~ 2. maybe_apply_delay + /* Apply the delay by the latch mechanism */ + do + { + TimestampTz delayUntil; + long diffms; + + ResetLatch(MyLatch); + + CHECK_FOR_INTERRUPTS(); + + /* This might change wal_receiver_status_interval */ + if (ConfigReloadPending) + { + ConfigReloadPending = false; + ProcessConfigFile(PGC_SIGHUP); + } + + /* + * Before calculating the time duration, reload the catalog if needed. + */ + if (!in_remote_transaction && !in_streamed_transaction) + { + AcceptInvalidationMessages(); + maybe_reread_subscription(); + } + + delayUntil = TimestampTzPlusMilliseconds(finish_ts, MySubscription->minapplydelay); + diffms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), delayUntil); + + /* + * Exit without arming the latch if it's already past time to apply + * this transaction. + */ + if (diffms <= 0) + break; + + elog(DEBUG2, "time-delayed replication for txid %u, min_apply_delay = %d ms, remaining wait time: %ld ms", + xid, MySubscription->minapplydelay, diffms); + + /* + * Call send_feedback() to prevent the publisher from exiting by + * timeout during the delay, when the status interval is greater than + * zero. + */ + if (!status_interval_ms) + { + TimestampTz nextFeedback; + + /* + * Based on the last time when we send a feedback message, adjust + * the first delay time for this transaction. This ensures that + * the first feedback message follows wal_receiver_status_interval + * interval. + */ + nextFeedback = TimestampTzPlusMilliseconds(send_time, + wal_receiver_status_interval * 1000L); + status_interval_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), nextFeedback); + } + else + status_interval_ms = wal_receiver_status_interval * 1000L; + + if (status_interval_ms > 0 && diffms > status_interval_ms) + { + WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, + status_interval_ms, + WAIT_EVENT_LOGICAL_APPLY_DELAY); + send_feedback(last_received, true, false, true); + } + else + WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, + diffms, + WAIT_EVENT_LOGICAL_APPLY_DELAY); + + } while (true); ~ IMO this logic has been tweaked too many times without revisiting the variable names and logic from scratch, so it has become over-complex - some variable names are assuming multiple meanings - multiple * 1000L have crept back in again - the 'diffms' is too generic now with so many vars so it has lost its meaning - GetCurrentTimestamp call in multiple places SUGGESTIONS - rename some variables and simplify the logic. - reduce all the if/else - don't be sneaky with the meaning of status_interval_ms - 'diffms' --> 'remaining_delay_ms' - 'DelayUntil' --> 'delay_until_ts' - introduce 'now' variable - simplify the check of (next_feedback_due_ms < remaining_delay_ms) SUGGESTION (WFM) /* Apply the delay by the latch mechanism */ while (true) { TimestampTz now; TimestampTz delay_until_ts; long remaining_delay_ms; long status_interval_ms; ResetLatch(MyLatch); CHECK_FOR_INTERRUPTS(); /* This might change wal_receiver_status_interval */ if (ConfigReloadPending) { ConfigReloadPending = false; ProcessConfigFile(PGC_SIGHUP); } /* * Before calculating the time duration, reload the catalog if needed. */ if (!in_remote_transaction && !in_streamed_transaction) { AcceptInvalidationMessages(); maybe_reread_subscription(); } now = GetCurrentTimestamp(); delay_until_ts = TimestampTzPlusMilliseconds(finish_ts, MySubscription->minapplydelay); remaining_delay_ms = TimestampDifferenceMilliseconds(now, delay_until_ts); /* * Exit without arming the latch if it's already past time to apply * this transaction. */ if (remaining_delay_ms <= 0) break; elog(DEBUG2, "time-delayed replication for txid %u, min_apply_delay = %d ms, remaining wait time: %ld ms", xid, MySubscription->minapplydelay, remaining_delay_ms); /* * If a status interval is defined then we may need to call send_feedback() * early to prevent the publisher from exiting during a long apply delay. */ status_interval_ms = wal_receiver_status_interval * 1000L; if (status_interval_ms > 0) { TimestampTz next_feedback_due_ts; long next_feedback_due_ms; /* * Find if the next feedback is due earlier than the remaining delay ms. */ next_feedback_due_ts = TimestampTzPlusMilliseconds(send_time, status_interval_ms); next_feedback_due_ms = TimestampDifferenceMilliseconds(now, next_feedback_due_ts); if (next_feedback_due_ms < remaining_delay_ms) { /* delay before feedback */ WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, next_feedback_due_ms, WAIT_EVENT_LOGICAL_APPLY_DELAY); send_feedback(last_received, true, false, true); continue; } } /* delay before apply */ WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, remaining_delay_ms, WAIT_EVENT_LOGICAL_APPLY_DELAY); } ====== src/include/utils/wait_event.h 3. @@ -149,7 +149,8 @@ typedef enum WAIT_EVENT_REGISTER_SYNC_REQUEST, WAIT_EVENT_SPIN_DELAY, WAIT_EVENT_VACUUM_DELAY, - WAIT_EVENT_VACUUM_TRUNCATE + WAIT_EVENT_VACUUM_TRUNCATE, + WAIT_EVENT_LOGICAL_APPLY_DELAY } WaitEventTimeout; FYI - The PGDOCS has a section with "Table 28.13. Wait Events of Type Timeout" so if you a going to add a new Timeout Event then you also need to document it (alphabetically) in that table. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Fri, Feb 10, 2023 at 4:56 PM Takamichi Osumi (Fujitsu) <osumi.takamichi@fujitsu.com> wrote: > > On Friday, February 10, 2023 2:05 PM Friday, February 10, 2023 2:05 PM wrote: > > On Fri, Feb 10, 2023 at 10:11 AM Amit Kapila <amit.kapila16@gmail.com> > > wrote: > > In the previous patch, we couldn't solve the > timeout of the publisher, when we conduct a scenario suggested by Horiguchi-san > and reproduced in the scenario attached test file 'test.sh'. > But now we handle it by adjusting the timing of the first wait time. > > FYI, we thought to implement the new variable 'send_time' > in the LogicalRepWorker structure at first. But, this structure > is used when launcher controls workers or reports statistics > and it stores TimestampTz recorded in the received WAL, > so not sure if the struct is the right place to implement the variable. > Moreover, there are other similar variables such as last_recv_time > or reply_time. So, those will be confusing when we decide to have > new variable together. Then, it's declared separately. > I think we can introduce a new variable as last_feedback_time in the LogicalRepWorker structure and probably for the last_received, we can last_lsn in MyLogicalRepWorker as that seems to be updated correctly. I think it would be good to avoid global variables. -- With Regards, Amit Kapila.
Hi, On 2023-02-11 05:44:47 +0000, Takamichi Osumi (Fujitsu) wrote: > On Saturday, February 11, 2023 11:10 AM Andres Freund <andres@anarazel.de> wrote: > > Has there been any discussion about whether this is actually best > > implemented on the client side? You could alternatively implement it on the > > sender. > > > > That'd have quite a few advantages, I think - you e.g. wouldn't remove the > > ability to *receive* and send feedback messages. We'd not end up filling up > > the network buffer with data that we'll not process anytime soon. > Thanks for your comments ! > > We have discussed about the publisher side idea around here [1] > but, we chose the current direction. Kindly have a look at the discussion. > > If we apply the delay on the publisher, then > it can lead to extra delay where we don't need to apply. > The current proposed approach can take other loads or factors > (network, busyness of the publisher, etc) into account > because it calculates the required delay on the subscriber. I don't think it's OK to just loose the ability to read / reply to keepalive messages. I think as-is we seriously consider to just reject the feature, adding too much complexity, without corresponding gain. Greetings, Andres Freund
At Mon, 13 Feb 2023 15:51:25 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > I think we can introduce a new variable as last_feedback_time in the > LogicalRepWorker structure and probably for the last_received, we can > last_lsn in MyLogicalRepWorker as that seems to be updated correctly. > I think it would be good to avoid global variables. MyLogicalRepWorker is a global variable:p, but it is far better than a bear one. By the way, we are trying to send the status messages regularly, but as Andres pointed out, worker does not read nor reply to keepalive messages from publisher while delaying. It is not possible as far as we choke the stream at the subscriber end. It doesn't seem to be a practical problem, but IMHO I think he's right in terms of adherence to the wire protocol, which was also one of my own initial concern. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: Time delayed LR (WAS Re: logical replication restrictions)
From
"Takamichi Osumi (Fujitsu)"
Date:
Hi, Andres-san On Tuesday, February 14, 2023 1:47 AM Andres Freund <andres@anarazel.de> wrote: > On 2023-02-11 05:44:47 +0000, Takamichi Osumi (Fujitsu) wrote: > > On Saturday, February 11, 2023 11:10 AM Andres Freund > <andres@anarazel.de> wrote: > > > Has there been any discussion about whether this is actually best > > > implemented on the client side? You could alternatively implement it > > > on the sender. > > > > > > That'd have quite a few advantages, I think - you e.g. wouldn't > > > remove the ability to *receive* and send feedback messages. We'd > > > not end up filling up the network buffer with data that we'll not process > anytime soon. > > Thanks for your comments ! > > > > We have discussed about the publisher side idea around here [1] but, > > we chose the current direction. Kindly have a look at the discussion. > > > > If we apply the delay on the publisher, then it can lead to extra > > delay where we don't need to apply. > > The current proposed approach can take other loads or factors > > (network, busyness of the publisher, etc) into account because it > > calculates the required delay on the subscriber. > > I don't think it's OK to just loose the ability to read / reply to keepalive > messages. > > I think as-is we seriously consider to just reject the feature, adding too much > complexity, without corresponding gain. Thanks for your comments ! Could you please tell us about your concern a bit more? The keepalive/reply messages are currently used for two purposes, (a) send the updated wrte/flush/apply locations; (b) avoid timeouts incase of idle times. Both the cases shouldn't be impacted with this time-delayed LR patch because during the delay there won't be any progress and to avoid timeouts, we allow to send the alive message during the delay. This is just we would like to clarify the issue you have in mind. OTOH, if we want to implement the functionality on publisher-side, I think we need to first consider the interface. We can think of two options (a) Have it as a subscription parameter as the patch has now and then pass it as an option to the publisher which it will use to delay; (b) Have it defined on publisher-side, say via GUC or some other way. The basic idea could be that while processing commit record (in DecodeCommit), we can somehow check the value of delay and then use it there to delay sending the xact. Also, during delay, we need to somehow send the keepalive and process replies, probably via a new callback or by some existing callback. We also need to handle in-progress and 2PC xacts in a similar way. For the former, probably we would need to apply the delay before sending the first stream. Could you please share what you feel on this direction as well ? Best Regards, Takamichi Osumi
Dear Andres and other hackers, > OTOH, if we want to implement the functionality on publisher-side, > I think we need to first consider the interface. > We can think of two options (a) Have it as a subscription parameter as the patch > has now and > then pass it as an option to the publisher which it will use to delay; > (b) Have it defined on publisher-side, say via GUC or some other way. > The basic idea could be that while processing commit record (in DecodeCommit), > we can somehow check the value of delay and then use it there to delay sending > the xact. > Also, during delay, we need to somehow send the keepalive and process replies, > probably via a new callback or by some existing callback. > We also need to handle in-progress and 2PC xacts in a similar way. > For the former, probably we would need to apply the delay before sending the first > stream. > Could you please share what you feel on this direction as well ? I implemented a patch that the delaying is done on the publisher side. In this patch, approach (a) was chosen, in which min_apply_delay is specified as a subscription parameter, and then apply worker passes it to the publisher as an output plugin option. During the delay, the walsender periodically checks and processes replies from the apply worker and sends keepalive messages if needed. Therefore, the ability to handle keepalives is not loosed. To delay the transaction in the output plugin layer, the new LogicalOutputPlugin API was added. For now, I choose the output plugin layer but can consider to do it from the core if there is a better way. Could you please share your opinion? Note: thanks for Osumi-san to help implementing. Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
At Wed, 15 Feb 2023 11:29:18 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in > Dear Andres and other hackers, > > > OTOH, if we want to implement the functionality on publisher-side, > > I think we need to first consider the interface. > > We can think of two options (a) Have it as a subscription parameter as the patch > > has now and > > then pass it as an option to the publisher which it will use to delay; > > (b) Have it defined on publisher-side, say via GUC or some other way. > > The basic idea could be that while processing commit record (in DecodeCommit), > > we can somehow check the value of delay and then use it there to delay sending > > the xact. > > Also, during delay, we need to somehow send the keepalive and process replies, > > probably via a new callback or by some existing callback. > > We also need to handle in-progress and 2PC xacts in a similar way. > > For the former, probably we would need to apply the delay before sending the first > > stream. > > Could you please share what you feel on this direction as well ? > > I implemented a patch that the delaying is done on the publisher side. In this patch, > approach (a) was chosen, in which min_apply_delay is specified as a subscription > parameter, and then apply worker passes it to the publisher as an output plugin option. As Amit-K mentioned, we may need to change the name of the option in this version, since the delay mechanism in this version causes a delay in sending from publisher than delaying apply on the subscriber side. I'm not sure why output plugin is involved in the delay mechanism. It appears to me that it would be simpler if the delay occurred in reorder buffer or logical decoder instead. Perhaps what I understand correctly is that we could delay right before only sending commit records in this case. If we delay at publisher end, all changes will be sent at once if !streaming, and otherwise, all changes in a transaction will be spooled at subscriber end. In any case, apply worker won't be holding an active transaction unnecessarily. Of course we need add the mechanism to process keep-alive and status report messages. > During the delay, the walsender periodically checks and processes replies from the > apply worker and sends keepalive messages if needed. Therefore, the ability to handle > keepalives is not loosed. My understanding is that the keep-alives is a different mechanism with a different objective from status reports. Even if subscriber doesn't send a spontaneous or extra status reports at all, connection can be checked and maintained by keep-alive packets. It is possible to setup an asymmetric configuration where only walsender sends keep-alives, but none are sent from the peer. Those setups work fine when no apply-delay involved, but they won't work with the patches we're talking about because the subscriber won't respond to the keep-alive packets from the peer. So when I wrote "practically works" in the last mail, this is what I meant. Thus if someone plans to enable apply_delay for logical replication, that person should be aware of some additional subtle restrictions that are required compared to a non-delayed setups. > To delay the transaction in the output plugin layer, the new LogicalOutputPlugin > API was added. For now, I choose the output plugin layer but can consider to do > it from the core if there is a better way. > > Could you please share your opinion? > > Note: thanks for Osumi-san to help implementing. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Dear Horiguchi-san, Thank you for responding! Before modifying patches, I want to confirm something you said. > As Amit-K mentioned, we may need to change the name of the option in > this version, since the delay mechanism in this version causes a delay > in sending from publisher than delaying apply on the subscriber side. Right, will be changed. > I'm not sure why output plugin is involved in the delay mechanism. It > appears to me that it would be simpler if the delay occurred in > reorder buffer or logical decoder instead. I'm planning to change, but.. > Perhaps what I understand > correctly is that we could delay right before only sending commit > records in this case. If we delay at publisher end, all changes will > be sent at once if !streaming, and otherwise, all changes in a > transaction will be spooled at subscriber end. In any case, apply > worker won't be holding an active transaction unnecessarily. What about parallel case? Latest patch does not reject the combination of parallel streaming mode and delay. If delay is done at commit and subscriber uses an parallel apply worker, it may acquire lock for a long time. > Of > course we need add the mechanism to process keep-alive and status > report messages. Could you share the good way to handle keep-alive and status messages if you have? If we changed to the decoding layer, it is strange to call walsender function directly. > Those setups work fine when no > apply-delay involved, but they won't work with the patches we're > talking about because the subscriber won't respond to the keep-alive > packets from the peer. So when I wrote "practically works" in the > last mail, this is what I meant. I'm not sure around the part. I think in the latest patch, subscriber can respond to the keepalive packets from the peer. Also, publisher can respond to the peer. Could you please tell me if you know a case that publisher or subscriber cannot respond to the opposite side? Note that if we apply the publisher-side patch, we don't have to apply subscriber-side patch. Best Regards, Hayato Kuroda FUJITSU LIMITED
At Thu, 16 Feb 2023 06:20:23 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in > Dear Horiguchi-san, > > Thank you for responding! Before modifying patches, I want to confirm something > you said. > > > As Amit-K mentioned, we may need to change the name of the option in > > this version, since the delay mechanism in this version causes a delay > > in sending from publisher than delaying apply on the subscriber side. > > Right, will be changed. > > > I'm not sure why output plugin is involved in the delay mechanism. It > > appears to me that it would be simpler if the delay occurred in > > reorder buffer or logical decoder instead. > > I'm planning to change, but.. Yeah, I don't think we've made up our minds about which way to go yet, so it's a bit too early to work on that. > > Perhaps what I understand > > correctly is that we could delay right before only sending commit > > records in this case. If we delay at publisher end, all changes will > > be sent at once if !streaming, and otherwise, all changes in a > > transaction will be spooled at subscriber end. In any case, apply > > worker won't be holding an active transaction unnecessarily. > > What about parallel case? Latest patch does not reject the combination of parallel > streaming mode and delay. If delay is done at commit and subscriber uses an parallel > apply worker, it may acquire lock for a long time. I didn't looked too closely, but my guess is that transactions are conveyed in spool files in parallel mode, with each file storing a complete transaction. > > Of > > course we need add the mechanism to process keep-alive and status > > report messages. > > Could you share the good way to handle keep-alive and status messages if you have? > If we changed to the decoding layer, it is strange to call walsender function > directly. I'm sorry, but I don't have a concrete idea at the moment. When I read through the last patch, I missed that WalSndDelay is actually a subset of WalSndLoop. Although it can handle keep-alives correctly, I'm not sure we can accept that structure.. > > Those setups work fine when no > > apply-delay involved, but they won't work with the patches we're > > talking about because the subscriber won't respond to the keep-alive > > packets from the peer. So when I wrote "practically works" in the > > last mail, this is what I meant. > > I'm not sure around the part. I think in the latest patch, subscriber can respond > to the keepalive packets from the peer. Also, publisher can respond to the peer. > Could you please tell me if you know a case that publisher or subscriber cannot > respond to the opposite side? Note that if we apply the publisher-side patch, we > don't have to apply subscriber-side patch. Sorry about that again, I missed that part in the last patch as mentioned earlier.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Feb 16, 2023 at 2:25 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Thu, 16 Feb 2023 06:20:23 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in > > Dear Horiguchi-san, > > > > Thank you for responding! Before modifying patches, I want to confirm something > > you said. > > > > > As Amit-K mentioned, we may need to change the name of the option in > > > this version, since the delay mechanism in this version causes a delay > > > in sending from publisher than delaying apply on the subscriber side. > > > > Right, will be changed. > > > > > I'm not sure why output plugin is involved in the delay mechanism. It > > > appears to me that it would be simpler if the delay occurred in > > > reorder buffer or logical decoder instead. > > > > I'm planning to change, but.. > > Yeah, I don't think we've made up our minds about which way to go yet, > so it's a bit too early to work on that. > > > > Perhaps what I understand > > > correctly is that we could delay right before only sending commit > > > records in this case. If we delay at publisher end, all changes will > > > be sent at once if !streaming, and otherwise, all changes in a > > > transaction will be spooled at subscriber end. In any case, apply > > > worker won't be holding an active transaction unnecessarily. > > > > What about parallel case? Latest patch does not reject the combination of parallel > > streaming mode and delay. If delay is done at commit and subscriber uses an parallel > > apply worker, it may acquire lock for a long time. > > I didn't looked too closely, but my guess is that transactions are > conveyed in spool files in parallel mode, with each file storing a > complete transaction. > No, we don't try to collect all the data in files for parallel mode. Having said that, it doesn't matter because we won't know the time of the commit (which is used to compute delay) before we encounter the commit record in WAL. So, I feel for this approach, we can follow what you said. > > > Of > > > course we need add the mechanism to process keep-alive and status > > > report messages. > > > > Could you share the good way to handle keep-alive and status messages if you have? > > If we changed to the decoding layer, it is strange to call walsender function > > directly. > > I'm sorry, but I don't have a concrete idea at the moment. When I read > through the last patch, I missed that WalSndDelay is actually a subset > of WalSndLoop. Although it can handle keep-alives correctly, I'm not > sure we can accept that structure.. > I think we can use update_progress_txn() for this purpose but note that we are discussing to change the same in thread [1]. [1] - https://www.postgresql.org/message-id/20230210210423.r26ndnfmuifie4f6%40awork3.anarazel.de -- With Regards, Amit Kapila.
Hi, On 2023-02-16 14:21:01 +0900, Kyotaro Horiguchi wrote: > I'm not sure why output plugin is involved in the delay mechanism. +many The output plugin absolutely never should be involved in something like this. It was a grave mistake that OutputPluginUpdateProgress() calls were added to the commit callback and then proliferated. > It appears to me that it would be simpler if the delay occurred in reorder > buffer or logical decoder instead. This is a feature specific to walsender. So the riggering logic should either directly live in the walsender, or in a callback set in LogicalDecodingContext. That could be called from decode.c or such. Greetings, Andres Freund
Dear Horiguchi-san, Thank you for replying! This direction seems OK, so I started to revise the patch. PSA new version. > > > As Amit-K mentioned, we may need to change the name of the option in > > > this version, since the delay mechanism in this version causes a delay > > > in sending from publisher than delaying apply on the subscriber side. > > > > Right, will be changed. > > > > > I'm not sure why output plugin is involved in the delay mechanism. It > > > appears to me that it would be simpler if the delay occurred in > > > reorder buffer or logical decoder instead. > > > > I'm planning to change, but.. > > Yeah, I don't think we've made up our minds about which way to go yet, > so it's a bit too early to work on that. The parameter name is changed to min_send_delay. And the delaying spot is changed to logical decoder. > > > Perhaps what I understand > > > correctly is that we could delay right before only sending commit > > > records in this case. If we delay at publisher end, all changes will > > > be sent at once if !streaming, and otherwise, all changes in a > > > transaction will be spooled at subscriber end. In any case, apply > > > worker won't be holding an active transaction unnecessarily. > > > > What about parallel case? Latest patch does not reject the combination of > parallel > > streaming mode and delay. If delay is done at commit and subscriber uses an > parallel > > apply worker, it may acquire lock for a long time. > > I didn't looked too closely, but my guess is that transactions are > conveyed in spool files in parallel mode, with each file storing a > complete transaction. Based on the advice, I moved the delaying to DecodeCommit(). And the combination of parallel streaming mode and min_send_delay is rejected again. > > > Of > > > course we need add the mechanism to process keep-alive and status > > > report messages. > > > > Could you share the good way to handle keep-alive and status messages if you > have? > > If we changed to the decoding layer, it is strange to call walsender function > > directly. > > I'm sorry, but I don't have a concrete idea at the moment. When I read > through the last patch, I missed that WalSndDelay is actually a subset > of WalSndLoop. Although it can handle keep-alives correctly, I'm not > sure we can accept that structure.. No issues. I have kept the current implementation. Some bugs I found are also fixed. Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
Dear Amit, > > > > Perhaps what I understand > > > > correctly is that we could delay right before only sending commit > > > > records in this case. If we delay at publisher end, all changes will > > > > be sent at once if !streaming, and otherwise, all changes in a > > > > transaction will be spooled at subscriber end. In any case, apply > > > > worker won't be holding an active transaction unnecessarily. > > > > > > What about parallel case? Latest patch does not reject the combination of > parallel > > > streaming mode and delay. If delay is done at commit and subscriber uses an > parallel > > > apply worker, it may acquire lock for a long time. > > > > I didn't looked too closely, but my guess is that transactions are > > conveyed in spool files in parallel mode, with each file storing a > > complete transaction. > > > > No, we don't try to collect all the data in files for parallel mode. > Having said that, it doesn't matter because we won't know the time of > the commit (which is used to compute delay) before we encounter the > commit record in WAL. So, I feel for this approach, we can follow what > you said. Right. And new patch follows the opinion. > > > > Of > > > > course we need add the mechanism to process keep-alive and status > > > > report messages. > > > > > > Could you share the good way to handle keep-alive and status messages if > you have? > > > If we changed to the decoding layer, it is strange to call walsender function > > > directly. > > > > I'm sorry, but I don't have a concrete idea at the moment. When I read > > through the last patch, I missed that WalSndDelay is actually a subset > > of WalSndLoop. Although it can handle keep-alives correctly, I'm not > > sure we can accept that structure.. > > > > I think we can use update_progress_txn() for this purpose but note > that we are discussing to change the same in thread [1]. > > [1] - > https://www.postgresql.org/message-id/20230210210423.r26ndnfmuifie4f6%40 > awork3.anarazel.de I did not reuse update_progress_txn() because we cannot use it straightforward, But I can change if we have better idea than present. New patch was posted in [1]. [1]: https://www.postgresql.org/message-id/TYAPR01MB5866F00191375D0193320A4DF5A19%40TYAPR01MB5866.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED
Dear Andres, Thank you for giving comments! I understood that you have agreed the approach that publisher delays to send data. > > I'm not sure why output plugin is involved in the delay mechanism. > > +many > > The output plugin absolutely never should be involved in something like > this. It was a grave mistake that OutputPluginUpdateProgress() calls were > added to the commit callback and then proliferated. > > > > It appears to me that it would be simpler if the delay occurred in reorder > > buffer or logical decoder instead. > > This is a feature specific to walsender. So the riggering logic should either > directly live in the walsender, or in a callback set in > LogicalDecodingContext. That could be called from decode.c or such. OK, I can follow the opinion. I think the walsender function should not be called directly from decode.c. So I implemented as callback in LogicalDecodingContext and it is called from decode.c if set. New patch was posted in [1]. [1]: https://www.postgresql.org/message-id/TYAPR01MB5866F00191375D0193320A4DF5A19%40TYAPR01MB5866.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED
On Fri, Feb 17, 2023 at 12:14 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Thank you for replying! This direction seems OK, so I started to revise the patch. > PSA new version. > Few comments: ============= 1. + <para> + The minimum delay for publisher sends data, in milliseconds + </para></entry> + </row> It would probably be better to write it as "The minimum delay, in milliseconds, by the publisher to send changes" 2. The subminsenddelay is placed inconsistently in the patch. In the docs (catalogs.sgml), system_views.sql, and in some places in the code, it is after subskiplsn, but in the catalog table and corresponding structure, it is placed after subowner. It should be consistently placed after the subscription owner. 3. + <row> + <entry><literal>WalSenderSendDelay</literal></entry> + <entry>Waiting for sending changes to subscriber in WAL sender + process.</entry> How about writing it as follows: "Waiting while sending changes for time-delayed logical replication in the WAL sender process."? 4. + <para> + Any delay becomes effective only after all initial table + synchronization has finished and occurs before each transaction + starts to get applied on the subscriber. The delay does not take into + account the overhead of time spent in transferring the transaction, + which means that the arrival time at the subscriber may be delayed + more than the given time. + </para> This needs to change based on a new approach. It should be something like: "The delay is effective only when the publisher decides to send a particular transaction downstream." 5. + * allowed. This is because in parallel streaming mode, we start applying + * the transaction stream as soon as the first change arrives without + * knowing the transaction's prepare/commit time. Always waiting for the + * full 'min_send_delay' period might include unnecessary delay. + * + * The other possibility was to apply the delay at the end of the parallel + * apply transaction but that would cause issues related to resource bloat + * and locks being held for a long time. + */ This part of the comments seems to imply more of a subscriber-side delay approach. I think we should try to adjust these as per the changed approach. 6. @@ -666,6 +666,10 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, buf->origptr, buf->endptr); } + /* Delay given time if the context has 'delay' callback */ + if (ctx->delay) + ctx->delay(ctx, commit_time); + I think we should invoke delay functionality only when ctx->min_send_delay > 0. Otherwise, there will be some unnecessary overhead. We can change the comment along the lines of: "Delay sending the changes if required. For streaming transactions, this means a delay in sending the last stream but that is okay because on the downstream the changes will be applied only after receiving the last stream." 7. For 2PC transactions, I think we should add the delay in DecodePrerpare. Because after receiving the PREPARE, the downstream will apply the xact. In this case, we shouldn't add a delay for the commit_prepared. 8. +# +# If the subscription sets min_send_delay parameter, the logical replication +# worker will delay the transaction apply for min_send_delay milliseconds. I think here also comments should be updated as per the changed approach for applying the delay on the publisher side. -- With Regards, Amit Kapila.
Dear Amit, Thank you for reviewing! PSA new version. > 1. > + <para> > + The minimum delay for publisher sends data, in milliseconds > + </para></entry> > + </row> > > It would probably be better to write it as "The minimum delay, in > milliseconds, by the publisher to send changes" Fixed. > 2. The subminsenddelay is placed inconsistently in the patch. In the > docs (catalogs.sgml), system_views.sql, and in some places in the > code, it is after subskiplsn, but in the catalog table and > corresponding structure, it is placed after subowner. It should be > consistently placed after the subscription owner. Basically moved. Note that some parts were not changed like maybe_reread_subscription() because the ordering had been already broken. > 3. > + <row> > + <entry><literal>WalSenderSendDelay</literal></entry> > + <entry>Waiting for sending changes to subscriber in WAL sender > + process.</entry> > > How about writing it as follows: "Waiting while sending changes for > time-delayed logical replication in the WAL sender process."? Fixed. > 4. > + <para> > + Any delay becomes effective only after all initial table > + synchronization has finished and occurs before each transaction > + starts to get applied on the subscriber. The delay does not take into > + account the overhead of time spent in transferring the transaction, > + which means that the arrival time at the subscriber may be delayed > + more than the given time. > + </para> > > This needs to change based on a new approach. It should be something > like: "The delay is effective only when the publisher decides to send > a particular transaction downstream." Right, the first sentence is partially changed as you said. > 5. > + * allowed. This is because in parallel streaming mode, we start applying > + * the transaction stream as soon as the first change arrives without > + * knowing the transaction's prepare/commit time. Always waiting for the > + * full 'min_send_delay' period might include unnecessary delay. > + * > + * The other possibility was to apply the delay at the end of the parallel > + * apply transaction but that would cause issues related to resource bloat > + * and locks being held for a long time. > + */ > > This part of the comments seems to imply more of a subscriber-side > delay approach. I think we should try to adjust these as per the > changed approach. Adjusted. > 6. > @@ -666,6 +666,10 @@ DecodeCommit(LogicalDecodingContext *ctx, > XLogRecordBuffer *buf, > buf->origptr, buf->endptr); > } > > + /* Delay given time if the context has 'delay' callback */ > + if (ctx->delay) > + ctx->delay(ctx, commit_time); > + > > I think we should invoke delay functionality only when > ctx->min_send_delay > 0. Otherwise, there will be some unnecessary > overhead. We can change the comment along the lines of: "Delay sending > the changes if required. For streaming transactions, this means a > delay in sending the last stream but that is okay because on the > downstream the changes will be applied only after receiving the last > stream." Changed accordingly. > 7. For 2PC transactions, I think we should add the delay in > DecodePrerpare. Because after receiving the PREPARE, the downstream > will apply the xact. In this case, we shouldn't add a delay for the > commit_prepared. Right, the transaction will be end when it receive PREPARE. Fixed. I've tested locally and the delay seemed to be occurred at PREPARE phase. > 8. > +# > +# If the subscription sets min_send_delay parameter, the logical replication > +# worker will delay the transaction apply for min_send_delay milliseconds. > > I think here also comments should be updated as per the changed > approach for applying the delay on the publisher side. Fixed. Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
Here are some review comments for patch v3-0001. (I haven't looked at the test code yet) ====== Commit Message 1. If the subscription sets min_send_delay parameter, an apply worker passes the value to the publisher as an output plugin option. And then, the walsender will delay the transaction sending for given milliseconds. ~ 1a. "an apply worker" --> "the apply worker (via walrcv_startstreaming)". ~ 1b. "And then, the walsender" --> "The walsender" ~~~ 2. The combination of parallel streaming mode and min_send_delay is not allowed. This is because in parallel streaming mode, we start applying the transaction stream as soon as the first change arrives without knowing the transaction's prepare/commit time. Always waiting for the full 'min_send_delay' period might include unnecessary delay. ~ Is there another reason not to support this? Even if streaming + min_send_delay incurs some extra delay, is that a reason to reject outright the combination? What difference will the potential of a few extra seconds overhead make when min_send_delay is more likely to be far greater (e.g. minutes or hours)? ~~~ 3. The other possibility was to apply the delay at the end of the parallel apply transaction but that would cause issues related to resource bloat and locks being held for a long time. ~ Is this explanation still relevant now you are doing pub-side delays? ====== doc/src/sgml/catalogs.sgml 4. + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>subminsenddelay</structfield> <type>int4</type> + </para> + <para> + The minimum delay, in milliseconds, by the publisher to send changes + </para></entry> + </row> "by the publisher to send changes" --> "by the publisher before sending changes" ====== doc/src/sgml/logical-replication.sgml 5. + <para> + A publication can delay sending changes to the subscription by specifying + the <literal>min_send_delay</literal> subscription parameter. See + <xref linkend="sql-createsubscription"/> for details. + </para> ~ This description seemed backwards because IIUC the PUBLICATION has nothing to do with the delay really, the walsender is told what to do by the SUBSCRIPTION. Anyway, this paragraph is in the "Subscriber" section, so mentioning publications was a bit confusing. SUGGESTION A subscription can delay the receipt of changes by specifying the min_send_delay subscription parameter. See ... ====== doc/src/sgml/monitoring.sgml 6. + <row> + <entry><literal>WalSenderSendDelay</literal></entry> + <entry>Waiting while sending changes for time-delayed logical replication + in the WAL sender process.</entry> + </row> Should this say "Waiting before sending changes", instead of "Waiting while sending changes"? ====== doc/src/sgml/ref/create_subscription.sgml 7. + <para> + By default, the publisher sends changes as soon as possible. This + parameter allows the user to delay the publisher to send changes by + given time period. If the value is specified without units, it is + taken as milliseconds. The default is zero (no delay). See + <xref linkend="config-setting-names-values"/> for details on the + available valid time units. + </para> "to delay the publisher to send changes" --> "to delay changes" ~~~ 8. + <para> + The delay is effective only when the initial table synchronization + has been finished and the publisher decides to send a particular + transaction downstream. The delay does not take into account the + overhead of time spent in transferring the transaction, which means + that the arrival time at the subscriber may be delayed more than the + given time. + </para> I'm not sure about this mention about only "effective only when the initial table synchronization has been finished"... Now that the delay is pub-side I don't know that it is true anymore. The tablesync worker will try to synchronize with the apply worker. IIUC during this "synchronization" phase the apply worker might be getting delayed by its own walsender, so therefore the tablesync might also be delayed (due to syncing with the apply worker) won't it? ====== src/backend/commands/subscriptioncmds.c 9. + /* + * translator: the first %s is a string of the form "parameter > 0" + * and the second one is "option = value". + */ + errmsg("%s and %s are mutually exclusive options", + "min_send_delay > 0", "streaming = parallel")); + + } Excessive whitespace. ====== src/backend/replication/logical/worker.c 10. ApplyWorkerMain + /* + * Time-delayed logical replication does not support tablesync + * workers, so only the leader apply worker can request walsenders to + * apply delay on the publisher side. + */ + if (server_version >= 160000 && MySubscription->minsenddelay > 0) + options.proto.logical.min_send_delay = MySubscription->minsenddelay; "apply delay" --> "delay" ====== src/backend/replication/pgoutput/pgoutput.c 11. + errno = 0; + parsed = strtoul(strVal(defel->arg), &endptr, 10); + if (errno != 0 || *endptr != '\0') + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("invalid min_send_delay"))); + + if (parsed > PG_INT32_MAX) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("min_send_delay \"%s\" out of range", + strVal(defel->arg)))); Should the validation be also checking/asserting no negative numbers, or actually should the min_send_delay be defined as a uint32 in the first place? ~~~ 12. pgoutput_startup @@ -501,6 +528,15 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt, else ctx->twophase_opt_given = true; + if (data->min_send_delay && + data->protocol_version < LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("requested proto_version=%d does not support delay sending data, need %d or higher", + data->protocol_version, LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM))); + else + ctx->min_send_delay = data->min_send_delay; IMO it doesn't make sense to compare this new feature with the unrelated LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM protocol version. I think we should define a new constant LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM (even if it has the same value as the LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM). ====== src/backend/replication/walsender.c 13. WalSndDelay + long diffms; + long timeout_interval_ms; IMO some more informative name for these would make the code read better: 'diffms' --> 'remaining_wait_time_ms' 'timeout_interval_ms' --> 'timeout_sleeptime_ms' ~~~ 14. + /* Sleep until we get reply from worker or we time out */ + WalSndWait(WL_SOCKET_READABLE, + Min(timeout_interval_ms, diffms), + WAIT_EVENT_WALSENDER_SEND_DELAY); Sorry, I didn't understand this comment "reply from worker"... AFAIK here we are just sleeping, not waiting for replies from anywhere (???) ====== src/include/replication/logical.h 15. @@ -64,6 +68,7 @@ typedef struct LogicalDecodingContext LogicalOutputPluginWriterPrepareWrite prepare_write; LogicalOutputPluginWriterWrite write; LogicalOutputPluginWriterUpdateProgress update_progress; + LogicalOutputPluginWriterDelay delay; ~ 15a. Question: Is there some advantage to introducing another callback, instead of just doing the delay inline? ~ 15b. Should this be a more informative member name like 'delay_send'? ~~~ 16. @@ -100,6 +105,8 @@ typedef struct LogicalDecodingContext */ bool twophase_opt_given; + int32 min_send_delay; + Missing comment for this new member. ------ Kind Regards, Peter Smith. Fujitsu Australia
Here are some review comments for the v3-0001 test code. ====== src/test/regress/sql/subscription.sql 1. +-- fail - utilizing streaming = parallel with time-delayed replication is not supported +CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = parallel, min_send_delay = 123); "utilizing" --> "specifying" ~~~ 2. +-- success -- min_send_delay value without unit is take as milliseconds +CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexit' PUBLICATION testpub WITH (connect = false, min_send_delay = 123); +\dRs+ "without unit is take as" --> "without units is taken as" ~~~ 3. +-- success -- min_send_delay value with unit is converted into ms and stored as an integer +ALTER SUBSCRIPTION regress_testsub SET (min_send_delay = '1 d'); +\dRs+ "with unit is converted into ms" --> "with units other than ms is converted to ms" ~~~ 4. Missing tests? Why have the previous ALTER SUBSCRIPTION tests been removed? AFAIK, currently, there are no regression tests for error messages like: test_sub=# ALTER SUBSCRIPTION sub1 SET (min_send_delay = 123); ERROR: cannot set min_send_delay for subscription in parallel streaming mode ====== src/test/subscription/t/001_rep_changes.pl 5. +# This test is successful if and only if the LSN has been applied with at least +# the configured apply delay. +ok( time() - $publisher_insert_time >= $delay, + "subscriber applies WAL only after replication delay for non-streaming transaction" +); It's not strictly an "apply delay". Maybe this comment only needs to say like below: SUGGESTION # This test is successful only if at least the configured delay has elapsed. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Tue, Feb 21, 2023 at 3:31 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > 2. > The combination of parallel streaming mode and min_send_delay is not allowed. > This is because in parallel streaming mode, we start applying the transaction > stream as soon as the first change arrives without knowing the transaction's > prepare/commit time. Always waiting for the full 'min_send_delay' period might > include unnecessary delay. > > ~ > > Is there another reason not to support this? > > Even if streaming + min_send_delay incurs some extra delay, is that a > reason to reject outright the combination? What difference will the > potential of a few extra seconds overhead make when min_send_delay is > more likely to be far greater (e.g. minutes or hours)? > I think the point is that we don't know the commit time at the start of streaming and even the transaction can be quite long in which case adding the delay is not expected. > > ====== > doc/src/sgml/catalogs.sgml > > 4. > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>subminsenddelay</structfield> <type>int4</type> > + </para> > + <para> > + The minimum delay, in milliseconds, by the publisher to send changes > + </para></entry> > + </row> > > "by the publisher to send changes" --> "by the publisher before sending changes" > For the streaming (=on) case, we may end up sending changes before we start to apply delay. > ====== > doc/src/sgml/monitoring.sgml > > 6. > + <row> > + <entry><literal>WalSenderSendDelay</literal></entry> > + <entry>Waiting while sending changes for time-delayed logical replication > + in the WAL sender process.</entry> > + </row> > > Should this say "Waiting before sending changes", instead of "Waiting > while sending changes"? > In the streaming (non-parallel) case, we may have sent some changes before wait as we wait only at commit/prepare time. The downstream won't apply such changes till commit. So, this description makes sense and this matches similar nearby descriptions. > > 8. > + <para> > + The delay is effective only when the initial table synchronization > + has been finished and the publisher decides to send a particular > + transaction downstream. The delay does not take into account the > + overhead of time spent in transferring the transaction, which means > + that the arrival time at the subscriber may be delayed more than the > + given time. > + </para> > > I'm not sure about this mention about only "effective only when the > initial table synchronization has been finished"... Now that the delay > is pub-side I don't know that it is true anymore. > This will still be true because we don't wait during the initial copy (sync). The delay happens only when the replication starts. > ====== > src/backend/commands/subscriptioncmds.c > ====== > src/backend/replication/pgoutput/pgoutput.c > > 11. > + errno = 0; > + parsed = strtoul(strVal(defel->arg), &endptr, 10); > + if (errno != 0 || *endptr != '\0') > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("invalid min_send_delay"))); > + > + if (parsed > PG_INT32_MAX) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("min_send_delay \"%s\" out of range", > + strVal(defel->arg)))); > > Should the validation be also checking/asserting no negative numbers, > or actually should the min_send_delay be defined as a uint32 in the > first place? > I don't see the need to change the datatype of min_send_delay as compared to what we have min_apply_delay. > ====== > src/include/replication/logical.h > > 15. > @@ -64,6 +68,7 @@ typedef struct LogicalDecodingContext > LogicalOutputPluginWriterPrepareWrite prepare_write; > LogicalOutputPluginWriterWrite write; > LogicalOutputPluginWriterUpdateProgress update_progress; > + LogicalOutputPluginWriterDelay delay; > > ~ > > 15a. > Question: Is there some advantage to introducing another callback, > instead of just doing the delay inline? > This is required because we need to check walsender's timeout and or process replies during the delay. -- With Regards, Amit Kapila.
Dear Peter, Thank you for reviewing! PSA new version. > 1. > If the subscription sets min_send_delay parameter, an apply worker passes the > value to the publisher as an output plugin option. And then, the walsender will > delay the transaction sending for given milliseconds. > > ~ > > 1a. > "an apply worker" --> "the apply worker (via walrcv_startstreaming)". > > ~ > > 1b. > "And then, the walsender" --> "The walsender" Fixed. > 2. > The combination of parallel streaming mode and min_send_delay is not allowed. > This is because in parallel streaming mode, we start applying the transaction > stream as soon as the first change arrives without knowing the transaction's > prepare/commit time. Always waiting for the full 'min_send_delay' period might > include unnecessary delay. > > ~ > > Is there another reason not to support this? > > Even if streaming + min_send_delay incurs some extra delay, is that a > reason to reject outright the combination? What difference will the > potential of a few extra seconds overhead make when min_send_delay is > more likely to be far greater (e.g. minutes or hours)? Another case I came up with is that streaming transactions are come continuously. If there are many transactions to be streamed, the walsender must delay to send for every transactions, for the given period. It means that arrival of transactions at the subscriber may delay for approximately min_send_delay x # of transactions. > 3. > The other possibility was to apply the delay at the end of the parallel apply > transaction but that would cause issues related to resource bloat and > locks being > held for a long time. > > ~ > > Is this explanation still relevant now you are doing pub-side delays? Slightly reworded. I think the problem may be occurred if we delay sending COMMIT record for parallel applied transactions. > doc/src/sgml/catalogs.sgml > > 4. > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>subminsenddelay</structfield> <type>int4</type> > + </para> > + <para> > + The minimum delay, in milliseconds, by the publisher to send changes > + </para></entry> > + </row> > > "by the publisher to send changes" --> "by the publisher before sending changes" As Amit said[1], there is a possibility to delay after sending delay. So I changed to "before sending COMMIT record". How do you think? > doc/src/sgml/logical-replication.sgml > > 5. > + <para> > + A publication can delay sending changes to the subscription by specifying > + the <literal>min_send_delay</literal> subscription parameter. See > + <xref linkend="sql-createsubscription"/> for details. > + </para> > > ~ > > This description seemed backwards because IIUC the PUBLICATION has > nothing to do with the delay really, the walsender is told what to do > by the SUBSCRIPTION. Anyway, this paragraph is in the "Subscriber" > section, so mentioning publications was a bit confusing. > > SUGGESTION > A subscription can delay the receipt of changes by specifying the > min_send_delay subscription parameter. See ... Changed. > doc/src/sgml/monitoring.sgml > > 6. > + <row> > + <entry><literal>WalSenderSendDelay</literal></entry> > + <entry>Waiting while sending changes for time-delayed logical > replication > + in the WAL sender process.</entry> > + </row> > > Should this say "Waiting before sending changes", instead of "Waiting > while sending changes"? Per discussion[1], I did not fix. > doc/src/sgml/ref/create_subscription.sgml > > 7. > + <para> > + By default, the publisher sends changes as soon as possible. This > + parameter allows the user to delay the publisher to send changes by > + given time period. If the value is specified without units, it is > + taken as milliseconds. The default is zero (no delay). See > + <xref linkend="config-setting-names-values"/> for details on the > + available valid time units. > + </para> > > "to delay the publisher to send changes" --> "to delay changes" Fixed. > 8. > + <para> > + The delay is effective only when the initial table synchronization > + has been finished and the publisher decides to send a particular > + transaction downstream. The delay does not take into account the > + overhead of time spent in transferring the transaction, which means > + that the arrival time at the subscriber may be delayed more than the > + given time. > + </para> > > I'm not sure about this mention about only "effective only when the > initial table synchronization has been finished"... Now that the delay > is pub-side I don't know that it is true anymore. The tablesync worker > will try to synchronize with the apply worker. IIUC during this > "synchronization" phase the apply worker might be getting delayed by > its own walsender, so therefore the tablesync might also be delayed > (due to syncing with the apply worker) won't it? I tested and checked codes. First of all, the tablesync worker request to send WALs without min_send_delay, so changes will be sent and applied with no delays. In this meaning, the table synchronization has not been affected by the feature. While checking, however, there is a possibility that the state of table will be delayed to get 'readly' because the changing of status from SYNCDONE from READY is done by apply worker. It may lead that two-phase will be delayed in getting to "enabled". I added descriptions about it. > src/backend/commands/subscriptioncmds.c > > 9. > + /* > + * translator: the first %s is a string of the form "parameter > 0" > + * and the second one is "option = value". > + */ > + errmsg("%s and %s are mutually exclusive options", > + "min_send_delay > 0", "streaming = parallel")); > + > + > } > > Excessive whitespace. Adjusted. > src/backend/replication/logical/worker.c > > 10. ApplyWorkerMain > > + /* > + * Time-delayed logical replication does not support tablesync > + * workers, so only the leader apply worker can request walsenders to > + * apply delay on the publisher side. > + */ > + if (server_version >= 160000 && MySubscription->minsenddelay > 0) > + options.proto.logical.min_send_delay = MySubscription->minsenddelay; > > "apply delay" --> "delay" Fixed. > src/backend/replication/pgoutput/pgoutput.c > > 11. > + errno = 0; > + parsed = strtoul(strVal(defel->arg), &endptr, 10); > + if (errno != 0 || *endptr != '\0') > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("invalid min_send_delay"))); > + > + if (parsed > PG_INT32_MAX) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("min_send_delay \"%s\" out of range", > + strVal(defel->arg)))); > > Should the validation be also checking/asserting no negative numbers, > or actually should the min_send_delay be defined as a uint32 in the > first place? I think you are right because min_apply_delay does not have related code. we must consider additional possibility that user may send START_REPLICATION by hand and it has minus value. Fixed. > 12. pgoutput_startup > > @@ -501,6 +528,15 @@ pgoutput_startup(LogicalDecodingContext *ctx, > OutputPluginOptions *opt, > else > ctx->twophase_opt_given = true; > > + if (data->min_send_delay && > + data->protocol_version < > LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("requested proto_version=%d does not support delay sending > data, need %d or higher", > + data->protocol_version, > LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM))); > + else > + ctx->min_send_delay = data->min_send_delay; > > > IMO it doesn't make sense to compare this new feature with the > unrelated LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM protocol > version. I think we should define a new constant > LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM (even if it has the > same > value as the LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM). Added. > src/backend/replication/walsender.c > > 13. WalSndDelay > > + long diffms; > + long timeout_interval_ms; > > IMO some more informative name for these would make the code read better: > > 'diffms' --> 'remaining_wait_time_ms' > 'timeout_interval_ms' --> 'timeout_sleeptime_ms' Changed. > 14. > + /* Sleep until we get reply from worker or we time out */ > + WalSndWait(WL_SOCKET_READABLE, > + Min(timeout_interval_ms, diffms), > + WAIT_EVENT_WALSENDER_SEND_DELAY); > > Sorry, I didn't understand this comment "reply from worker"... AFAIK > here we are just sleeping, not waiting for replies from anywhere (???) > > ====== > src/include/replication/logical.h > > 15. > @@ -64,6 +68,7 @@ typedef struct LogicalDecodingContext > LogicalOutputPluginWriterPrepareWrite prepare_write; > LogicalOutputPluginWriterWrite write; > LogicalOutputPluginWriterUpdateProgress update_progress; > + LogicalOutputPluginWriterDelay delay; > > ~ > > 15a. > Question: Is there some advantage to introducing another callback, > instead of just doing the delay inline? IIUC functions related with walsender should not be called directly, because there is a possibility that replication slots are manipulated from the backed. > 15b. > Should this be a more informative member name like 'delay_send'? Changed. > 16. > @@ -100,6 +105,8 @@ typedef struct LogicalDecodingContext > */ > bool twophase_opt_given; > > + int32 min_send_delay; > + > > Missing comment for this new member. Added. [1]: https://www.postgresql.org/message-id/CAA4eK1+JwLAVAOphnZ1YTiEV_jOMAE6JgJmBE98oek2cg7XF0w@mail.gmail.com Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
Dear Peter, > 1. > +-- fail - utilizing streaming = parallel with time-delayed > replication is not supported > +CREATE SUBSCRIPTION regress_testsub CONNECTION > 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = > false, streaming = parallel, min_send_delay = 123); > > "utilizing" --> "specifying" Fixed. > 2. > +-- success -- min_send_delay value without unit is take as milliseconds > +CREATE SUBSCRIPTION regress_testsub CONNECTION > 'dbname=regress_doesnotexit' PUBLICATION testpub WITH (connect = > false, min_send_delay = 123); > +\dRs+ > > "without unit is take as" --> "without units is taken as" Fixed. > 3. > +-- success -- min_send_delay value with unit is converted into ms and > stored as an integer > +ALTER SUBSCRIPTION regress_testsub SET (min_send_delay = '1 d'); > +\dRs+ > > > "with unit is converted into ms" --> "with units other than ms is > converted to ms" Fixed. > 4. Missing tests? > > Why have the previous ALTER SUBSCRIPTION tests been removed? AFAIK, > currently, there are no regression tests for error messages like: > > test_sub=# ALTER SUBSCRIPTION sub1 SET (min_send_delay = 123); > ERROR: cannot set min_send_delay for subscription in parallel streaming mode These tests were missed while changing the basic design. Added. > src/test/subscription/t/001_rep_changes.pl > > 5. > +# This test is successful if and only if the LSN has been applied with at least > +# the configured apply delay. > +ok( time() - $publisher_insert_time >= $delay, > + "subscriber applies WAL only after replication delay for > non-streaming transaction" > +); > > It's not strictly an "apply delay". Maybe this comment only needs to > say like below: > > SUGGESTION > # This test is successful only if at least the configured delay has elapsed. Changed. New patch is available on [1]. [1]: https://www.postgresql.org/message-id/TYAPR01MB5866C6BCA4D9386D9C486033F5A59%40TYAPR01MB5866.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED
Dear Amit, Thank you for commenting! > > 8. > > + <para> > > + The delay is effective only when the initial table synchronization > > + has been finished and the publisher decides to send a particular > > + transaction downstream. The delay does not take into account the > > + overhead of time spent in transferring the transaction, which > means > > + that the arrival time at the subscriber may be delayed more than > the > > + given time. > > + </para> > > > > I'm not sure about this mention about only "effective only when the > > initial table synchronization has been finished"... Now that the delay > > is pub-side I don't know that it is true anymore. > > > > This will still be true because we don't wait during the initial copy > (sync). The delay happens only when the replication starts. Maybe this depends on the definition of initial copy and sync. I checked and added descriptions in [1]. > > 11. > > + errno = 0; > > + parsed = strtoul(strVal(defel->arg), &endptr, 10); > > + if (errno != 0 || *endptr != '\0') > > + ereport(ERROR, > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > + errmsg("invalid min_send_delay"))); > > + > > + if (parsed > PG_INT32_MAX) > > + ereport(ERROR, > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > + errmsg("min_send_delay \"%s\" out of range", > > + strVal(defel->arg)))); > > > > Should the validation be also checking/asserting no negative numbers, > > or actually should the min_send_delay be defined as a uint32 in the > > first place? > > > > I don't see the need to change the datatype of min_send_delay as > compared to what we have min_apply_delay. I think it is OK to change "long" to "unsinged long", because We use strtoul() for reading and should reject the minus value. Of course we can modify them, but I want to keep the consistency with proto_version part. [1]: https://www.postgresql.org/message-id/TYAPR01MB5866C6BCA4D9386D9C486033F5A59@TYAPR01MB5866.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED
Here are some very minor review comments for the patch v4-0001 ====== Commit Message 1. The other possibility was to apply the delay at the end of the parallel apply transaction but that would cause issues related to resource bloat and locks being held for a long time. ~ The reply [1] for review comment #2 says that this was "slightly reworded", but AFAICT nothing is changed here. ~~~ 2. Eariler versions were written by Euler Taveira, Takamichi Osumi, and Kuroda Hayato Typo: "Eariler" ====== doc/src/sgml/ref/create_subscription.sgml 3. + <para> + By default, the publisher sends changes as soon as possible. This + parameter allows the user to delay changes by given time period. If + the value is specified without units, it is taken as milliseconds. + The default is zero (no delay). See <xref linkend="config-setting-names-values"/> + for details on the available valid time units. + </para> "by given time period" --> "by the given time period" ====== src/backend/replication/pgoutput/pgoutput.c 4. parse_output_parameters + else if (strcmp(defel->defname, "min_send_delay") == 0) + { + unsigned long parsed; + char *endptr; I think 'parsed' is a fairly meaningless variable name. How about calling this variable something useful like 'delay_val' or 'min_send_delay_value', or something like those? Yes, I recognize that you copied this from some existing code fragment, but IMO that doesn't make it good. ====== src/backend/replication/walsender.c 5. + /* Sleep until we get reply from worker or we time out */ + WalSndWait(WL_SOCKET_READABLE, + Min(timeout_sleeptime_ms, remaining_wait_time_ms), + WAIT_EVENT_WALSENDER_SEND_DELAY); In my previous review [2] comment #14, I questioned if this comment was correct. It looks like that was accidentally missed. ====== src/include/replication/logical.h 6. + /* + * The minimum delay, in milliseconds, by the publisher before sending + * COMMIT/PREPARE record + */ + int32 min_send_delay; The comment is missing a period. ------ [1] Kuroda-san replied to my review v3-0001. https://www.postgresql.org/message-id/TYAPR01MB5866C6BCA4D9386D9C486033F5A59%40TYAPR01MB5866.jpnprd01.prod.outlook.com [2] My previous review v3-0001. https://www.postgresql.org/message-id/CAHut%2BPu6Y%2BBkYKg6MYGi2zGnx6imeK4QzxBVhpQoPMeDr1npnQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Dear Peter, Thank you for reviewing! PSA new version. > 1. > The other possibility was to apply the delay at the end of the parallel apply > transaction but that would cause issues related to resource bloat and > locks being > held for a long time. > > ~ > > The reply [1] for review comment #2 says that this was "slightly > reworded", but AFAICT nothing is changed here. Oh, my git operation might be wrong and it was disappeared. Sorry for inconvenience, reworded again. > 2. > Eariler versions were written by Euler Taveira, Takamichi Osumi, and > Kuroda Hayato > > Typo: "Eariler" Fixed. > ====== > doc/src/sgml/ref/create_subscription.sgml > > 3. > + <para> > + By default, the publisher sends changes as soon as possible. This > + parameter allows the user to delay changes by given time period. If > + the value is specified without units, it is taken as milliseconds. > + The default is zero (no delay). See <xref > linkend="config-setting-names-values"/> > + for details on the available valid time units. > + </para> > > "by given time period" --> "by the given time period" Fixed. > src/backend/replication/pgoutput/pgoutput.c > > 4. parse_output_parameters > > + else if (strcmp(defel->defname, "min_send_delay") == 0) > + { > + unsigned long parsed; > + char *endptr; > > I think 'parsed' is a fairly meaningless variable name. How about > calling this variable something useful like 'delay_val' or > 'min_send_delay_value', or something like those? Yes, I recognize that > you copied this from some existing code fragment, but IMO that doesn't > make it good. OK, changed to 'delay_val'. > > ====== > src/backend/replication/walsender.c > > 5. > + /* Sleep until we get reply from worker or we time out */ > + WalSndWait(WL_SOCKET_READABLE, > + Min(timeout_sleeptime_ms, remaining_wait_time_ms), > + WAIT_EVENT_WALSENDER_SEND_DELAY); > > In my previous review [2] comment #14, I questioned if this comment > was correct. It looks like that was accidentally missed. Sorry, I missed that. But I think this does not have to be changed. Important point here is that WalSndWait() is used, not WaitLatch(). According to comment atop WalSndWait(), the function waits till following events: - the socket becomes readable or writable - a timeout occurs Logical walsender process is always connected to worker, so the socket becomes readable when apply worker sends feedback message. That's why I wrote "Sleep until we get reply from worker or we time out". > src/include/replication/logical.h > > 6. > + /* > + * The minimum delay, in milliseconds, by the publisher before sending > + * COMMIT/PREPARE record > + */ > + int32 min_send_delay; > > The comment is missing a period. Right, added. Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On Tue, Feb 21, 2023 at 1:28 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > > doc/src/sgml/catalogs.sgml > > > > 4. > > + <row> > > + <entry role="catalog_table_entry"><para role="column_definition"> > > + <structfield>subminsenddelay</structfield> <type>int4</type> > > + </para> > > + <para> > > + The minimum delay, in milliseconds, by the publisher to send changes > > + </para></entry> > > + </row> > > > > "by the publisher to send changes" --> "by the publisher before sending changes" > > As Amit said[1], there is a possibility to delay after sending delay. So I changed to > "before sending COMMIT record". How do you think? > I think it would be better to say: "The minimum delay, in milliseconds, by the publisher before sending all the changes". If you agree then similar change is required in below comment as well: + /* + * The minimum delay, in milliseconds, by the publisher before sending + * COMMIT/PREPARE record. + */ + int32 min_send_delay; + > > > src/backend/replication/pgoutput/pgoutput.c > > > > 11. > > + errno = 0; > > + parsed = strtoul(strVal(defel->arg), &endptr, 10); > > + if (errno != 0 || *endptr != '\0') > > + ereport(ERROR, > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > + errmsg("invalid min_send_delay"))); > > + > > + if (parsed > PG_INT32_MAX) > > + ereport(ERROR, > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > + errmsg("min_send_delay \"%s\" out of range", > > + strVal(defel->arg)))); > > > > Should the validation be also checking/asserting no negative numbers, > > or actually should the min_send_delay be defined as a uint32 in the > > first place? > > I think you are right because min_apply_delay does not have related code. > we must consider additional possibility that user may send START_REPLICATION > by hand and it has minus value. > Fixed. > Your reasoning for adding the additional check seems good to me but I don't see it in the patch. The check as I see is as below: + if (delay_val > PG_INT32_MAX) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("min_send_delay \"%s\" out of range", + strVal(defel->arg)))); Am, I missing something, and the new check is at some other place? + has been finished. However, there is a possibility that the table + status written in <link linkend="catalog-pg-subscription-rel"><structname>pg_subscription_rel</structname></link> + will be delayed in getting to "ready" state, and also two-phase + (if specified) will be delayed in getting to "enabled". + </para> There appears to be a special value <0x00> after "ready". I think that is added by mistake or probably you have used some editor which has added this value. Can we slightly reword this to: "However, there is a possibility that the table status updated in <link linkend="catalog-pg-subscription-rel"><structname>pg_subscription_rel</structname></link> could be delayed in getting to the "ready" state, and also two-phase (if specified) could be delayed in getting to "enabled"."? -- With Regards, Amit Kapila.
Dear Amit, Thank you for reviewing! PSA new version. > I think it would be better to say: "The minimum delay, in > milliseconds, by the publisher before sending all the changes". If you > agree then similar change is required in below comment as well: > + /* > + * The minimum delay, in milliseconds, by the publisher before sending > + * COMMIT/PREPARE record. > + */ > + int32 min_send_delay; OK, both of them were fixed. > > > Should the validation be also checking/asserting no negative numbers, > > > or actually should the min_send_delay be defined as a uint32 in the > > > first place? > > > > I think you are right because min_apply_delay does not have related code. > > we must consider additional possibility that user may send > START_REPLICATION > > by hand and it has minus value. > > Fixed. > > > > Your reasoning for adding the additional check seems good to me but I > don't see it in the patch. The check as I see is as below: > + if (delay_val > PG_INT32_MAX) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("min_send_delay \"%s\" out of range", > + strVal(defel->arg)))); > > Am, I missing something, and the new check is at some other place? For extracting value from the string, strtoul() is used. This is an important point. ``` delay_val = strtoul(strVal(defel->arg), &endptr, 10); ``` If user specifies min_send_delay as '-1', the value is read as a bit string '0xFFFFFFFFFFFFFFFF', and it is interpreted as PG_UINT64_MAX. After that such a strange value is rejected by the part you copied. I have tested the case and it has correctly rejected. ``` postgres=# START_REPLICATION SLOT "sub" LOGICAL 0/0 (min_send_delay '-1'); ERROR: min_send_delay "-1" out of range CONTEXT: slot "sub", output plugin "pgoutput", in the startup callback ``` > + has been finished. However, there is a possibility that the table > + status written in <link > linkend="catalog-pg-subscription-rel"><structname>pg_subscription_rel</stru > ctname></link> > + will be delayed in getting to "ready" state, and also two-phase > + (if specified) will be delayed in getting to "enabled". > + </para> > > There appears to be a special value <0x00> after "ready". I think that > is added by mistake or probably you have used some editor which has > added this value. Can we slightly reword this to: "However, there is a > possibility that the table status updated in <link > linkend="catalog-pg-subscription-rel"><structname>pg_subscription_rel</stru > ctname></link> > could be delayed in getting to the "ready" state, and also two-phase > (if specified) could be delayed in getting to "enabled"."? Oh, my Visual Studio Code did not detect the strange character. And reworded accordingly. Additionally, I modified the commit message to describe more clearly the reason why the do not allow combination of min_send_delay and streaming = parallel. Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
Patch v6 LGTM. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Wed, Feb 22, 2023 9:48 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote: > > Thank you for reviewing! PSA new version. > Thanks for your patch. Here is a comment. + elog(DEBUG2, "time-delayed replication for txid %u, delay_time = %d ms, remaining wait time: %ld ms", + ctx->write_xid, (int) ctx->min_send_delay, + remaining_wait_time_ms); I tried and saw that the xid here looks wrong, what it got is the xid of the previous transaction. It seems `ctx->write_xid` has not been updated and we can't use it. Regards, Shi Yu
Dear Shi, Thank you for reviewing! PSA new version. > + elog(DEBUG2, "time-delayed replication for txid %u, delay_time > = %d ms, remaining wait time: %ld ms", > + ctx->write_xid, (int) ctx->min_send_delay, > + remaining_wait_time_ms); > > I tried and saw that the xid here looks wrong, what it got is the xid of the > previous transaction. It seems `ctx->write_xid` has not been updated and we > can't use it. > Good catch. There are several approaches to fix that, I choose the simplest way. TransactionId was added as an argument of functions. Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On Thu, Feb 23, 2023 at 9:10 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Shi, > > Thank you for reviewing! PSA new version. > > > + elog(DEBUG2, "time-delayed replication for txid %u, delay_time > > = %d ms, remaining wait time: %ld ms", > > + ctx->write_xid, (int) ctx->min_send_delay, > > + remaining_wait_time_ms); > > > > I tried and saw that the xid here looks wrong, what it got is the xid of the > > previous transaction. It seems `ctx->write_xid` has not been updated and we > > can't use it. > > > > Good catch. There are several approaches to fix that, I choose the simplest way. > TransactionId was added as an argument of functions. > Thank you for updating the patch. Here are some comments on v7 patch: + * + * LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM is the minimum protocol version + * with support for delaying to send transactions. Introduced in PG16. */ #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 #define LOGICALREP_PROTO_VERSION_NUM 1 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 #define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3 #define LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM 4 -#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM +#define LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM 4 +#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM What is the usecase of the old macro, LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM, after adding LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM ? I think if we go this way, we will end up adding macros every time when adding a new option, which seems not a good idea. I'm really not sure we need to change the protocol version or the macro. Commit 366283961ac0ed6d89014444c6090f3fd02fce0a introduced the 'origin' subscription parameter that is also sent to the publisher, but we didn't touch the protocol version at all. --- Why do we not to delay sending COMMIT PREPARED messages? --- + /* + * If we've requested to shut down, exit the process. + * + * Note that WalSndDone() cannot be used here because the delaying + * changes will be sent in the function. + */ + if (got_STOPPING) + WalSndShutdown(); Since the walsender exits without sending the done message at a server shutdown, we get the following log message on the subscriber: ERROR: could not receive data from WAL stream: server closed the connection unexpectedly I think that since the walsender is just waiting for sending data, it can send the done message if the socket is writable. --- + delayUntil = TimestampTzPlusMilliseconds(delay_start, ctx->min_send_delay); + remaining_wait_time_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), delayUntil); + (snip) + + /* Sleep until appropriate time. */ + timeout_sleeptime_ms = WalSndComputeSleeptime(GetCurrentTimestamp()); I think it's better to call GetCurrentTimestamp() only once. --- +# This test is successful only if at least the configured delay has elapsed. +ok( time() - $publisher_insert_time >= $delay, + "subscriber applies WAL only after replication delay for non-streaming transaction" +); The subscriber doesn't actually apply WAL records, but logically replicated changes. How about "subscriber applies changes only after..."? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Feb 23, 2023 at 5:40 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Thank you for reviewing! PSA new version. > I was trying to think if there is any better way to implement the newly added callback (WalSndDelay()) but couldn't find any. For example, one idea I tried to evaluate is whether we can merge it with the existing callback WalSndUpdateProgress() or maybe extract the part other than progress tracking from that function into a new callback and then try to reuse it here as well. Though there is some common functionality like checking for timeout and processing replies still they are different enough that they seem to need separate callbacks. The prime purpose of a callback for the patch being discussed here is to delay the xact before sending the commit/prepare whereas the existing callback (WalSndUpdateProgress()) or what we are discussing at [1] allows sending the keepalive message in some special cases where there is no communication between walsender and walreceiver. Now, the WalSndDelay() also tries to check for timeout and send keepalive if necessary but there is also dependency on the delay parameter, so don't think it is a good idea of trying to combine those functionalities into one API. Thoughts? [1] - https://www.postgresql.org/message-id/20230210210423.r26ndnfmuifie4f6%40awork3.anarazel.de -- With Regards, Amit Kapila.
On Mon, Feb 27, 2023 at 11:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Feb 23, 2023 at 9:10 PM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > > > Thank you for reviewing! PSA new version. > > > > > > Thank you for updating the patch. Here are some comments on v7 patch: > > + * > + * LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM is the minimum protocol version > + * with support for delaying to send transactions. Introduced in PG16. > */ > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 > #define LOGICALREP_PROTO_VERSION_NUM 1 > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > #define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3 > #define LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM 4 > -#define LOGICALREP_PROTO_MAX_VERSION_NUM > LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM > +#define LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM 4 > +#define LOGICALREP_PROTO_MAX_VERSION_NUM > LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM > > What is the usecase of the old macro, > LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM, after adding > LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM ? I think if we go this > way, we will end up adding macros every time when adding a new option, > which seems not a good idea. I'm really not sure we need to change the > protocol version or the macro. Commit > 366283961ac0ed6d89014444c6090f3fd02fce0a introduced the 'origin' > subscription parameter that is also sent to the publisher, but we > didn't touch the protocol version at all. > Right, I also don't see a reason to do anything for this. We have previously bumped the protocol version when we send extra/additional information from walsender but here that is not the requirement, so this change doesn't seem to be required. > --- > Why do we not to delay sending COMMIT PREPARED messages? > I think we need to either add delay for prepare or commit prepared as otherwise, it will lead to delaying the xact more than required. The patch seems to add a delay before sending a PREPARE as that is the time when the subscriber will apply the changes. -- With Regards, Amit Kapila.
Dear Amit, > I was trying to think if there is any better way to implement the > newly added callback (WalSndDelay()) but couldn't find any. For > example, one idea I tried to evaluate is whether we can merge it with > the existing callback WalSndUpdateProgress() or maybe extract the part > other than progress tracking from that function into a new callback > and then try to reuse it here as well. Though there is some common > functionality like checking for timeout and processing replies still > they are different enough that they seem to need separate callbacks. > The prime purpose of a callback for the patch being discussed here is > to delay the xact before sending the commit/prepare whereas the > existing callback (WalSndUpdateProgress()) or what we are discussing > at [1] allows sending the keepalive message in some special cases > where there is no communication between walsender and walreceiver. > Now, the WalSndDelay() also tries to check for timeout and send > keepalive if necessary but there is also dependency on the delay > parameter, so don't think it is a good idea of trying to combine those > functionalities into one API. > > Thoughts? > > [1] - > https://www.postgresql.org/message-id/20230210210423.r26ndnfmuifie4f6%40 > awork3.anarazel.de Thank you for confirming. My understanding was that we should keep the current design. I agree with your posting. In the current callback and modified version in [1], sending keepalives is done via ProcessPendingWrites(). It is called by many functions and should not be changed, like adding end_time only for us. Moreover, the name is not suitable because time-delayed logical replication does not wait until the send buffer becomes empty. If we reconstruct WalSndUpdateProgress() and change mechanisms around that, codes will become dirty. As Amit said, in one path, the lag will be tracked and the walsender will wait until the buffer is empty. In another path, the lag calculation will be ignored, and the walsender will wait until the process spends time till a given period. Such a function is painful to read later. I think callbacks that have different purposes should not be mixed. Best Regards, Hayato Kuroda FUJITSU LIMITED
On Mon, Feb 27, 2023 at 3:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Feb 27, 2023 at 11:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Feb 23, 2023 at 9:10 PM Hayato Kuroda (Fujitsu) > > <kuroda.hayato@fujitsu.com> wrote: > > > > > > Thank you for reviewing! PSA new version. > > > > > > > > > > Thank you for updating the patch. Here are some comments on v7 patch: > > > > + * > > + * LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM is the minimum protocol version > > + * with support for delaying to send transactions. Introduced in PG16. > > */ > > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 > > #define LOGICALREP_PROTO_VERSION_NUM 1 > > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > > #define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3 > > #define LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM 4 > > -#define LOGICALREP_PROTO_MAX_VERSION_NUM > > LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM > > +#define LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM 4 > > +#define LOGICALREP_PROTO_MAX_VERSION_NUM > > LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM > > > > What is the usecase of the old macro, > > LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM, after adding > > LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM ? I think if we go this > > way, we will end up adding macros every time when adding a new option, > > which seems not a good idea. I'm really not sure we need to change the > > protocol version or the macro. Commit > > 366283961ac0ed6d89014444c6090f3fd02fce0a introduced the 'origin' > > subscription parameter that is also sent to the publisher, but we > > didn't touch the protocol version at all. > > > > Right, I also don't see a reason to do anything for this. We have > previously bumped the protocol version when we send extra/additional > information from walsender but here that is not the requirement, so > this change doesn't seem to be required. > > > --- > > Why do we not to delay sending COMMIT PREPARED messages? > > > > I think we need to either add delay for prepare or commit prepared as > otherwise, it will lead to delaying the xact more than required. Agreed. > The > patch seems to add a delay before sending a PREPARE as that is the > time when the subscriber will apply the changes. Considering the purpose of this feature mentioned in the commit message "particularly to fix errors that might cause data loss", delaying sending PREPARE would really help that situation? For example, even after (mistakenly) executing PREPARE for a transaction executing DELETE without WHERE clause on the publisher the user still can rollback the transaction. They don't lose data on both nodes yet. After executing (and replicating) COMMIT PREPARED for that transaction, they lose the data on both nodes. IIUC the time-delayed logical replication should help this situation by delaying sending COMMIT PREPARED so that, for example, the user can stop logical replication before COMMIT PREPARED message arrives to the subscriber. So I think we should delay sending COMMIT PREPARED (and COMMIT) instead of PREPARE. This would help users to correct data loss errors, and would be more consistent with what recovery_min_apply_delay does. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Dear Sawada-san, Amit, Thank you for reviewing! > + * > + * LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM is the minimum > protocol version > + * with support for delaying to send transactions. Introduced in PG16. > */ > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 > #define LOGICALREP_PROTO_VERSION_NUM 1 > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > #define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3 > #define LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM 4 > -#define LOGICALREP_PROTO_MAX_VERSION_NUM > LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM > +#define LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM 4 > +#define LOGICALREP_PROTO_MAX_VERSION_NUM > LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM > > What is the usecase of the old macro, > LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM, after adding > LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM ? I think if we go this > way, we will end up adding macros every time when adding a new option, > which seems not a good idea. I'm really not sure we need to change the > protocol version or the macro. Commit > 366283961ac0ed6d89014444c6090f3fd02fce0a introduced the 'origin' > subscription parameter that is also sent to the publisher, but we > didn't touch the protocol version at all. I removed the protocol number. I checked the previous discussion[1]. According to it, the protocol version must be modified when new message is added or exiting messages are changed. This patch intentionally make walsenders delay sending data, and at that time no extra information is added. Therefore I think it is not needed. > --- > Why do we not to delay sending COMMIT PREPARED messages? This is motivated by the comment[2] but I preferred your opinion[3]. Now COMMIT PREPARED is delayed instead of PREPARE message. > --- > + /* > + * If we've requested to shut down, exit the process. > + * > + * Note that WalSndDone() cannot be used here because > the delaying > + * changes will be sent in the function. > + */ > + if (got_STOPPING) > + WalSndShutdown(); > > Since the walsender exits without sending the done message at a server > shutdown, we get the following log message on the subscriber: > > ERROR: could not receive data from WAL stream: server closed the > connection unexpectedly > > I think that since the walsender is just waiting for sending data, it > can send the done message if the socket is writable. You are right. I was confused with the previous implementation that workers cannot accept any messages. I make walsenders send the end-command message directly. Is it what you expeced? > --- > + delayUntil = TimestampTzPlusMilliseconds(delay_start, > ctx->min_send_delay); > + remaining_wait_time_ms = > TimestampDifferenceMilliseconds(GetCurrentTimestamp(), delayUntil); > + > (snip) > + > + /* Sleep until appropriate time. */ > + timeout_sleeptime_ms = > WalSndComputeSleeptime(GetCurrentTimestamp()); > > I think it's better to call GetCurrentTimestamp() only once. Right, fixed. > --- > +# This test is successful only if at least the configured delay has elapsed. > +ok( time() - $publisher_insert_time >= $delay, > + "subscriber applies WAL only after replication delay for > non-streaming transaction" > +); > > The subscriber doesn't actually apply WAL records, but logically > replicated changes. How about "subscriber applies changes only > after..."? I grepped other tests, and I could not find the same usage of the word "WAL". So fixed as you said. In next version I will use grammar checker like Chat-GPT to modify commit messages... [1]: https://www.postgresql.org/message-id/CAA4eK1LjOm6-OHggYVH35dQ_v40jOXrJW0GFy3kuwTd2J48%3DUg%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CAA4eK1K4uPbudrNdH%2B%3D_vN-Hpe9wYh%3D3vBS5Ww9dHn-LOWMV0g%40mail.gmail.com [3]: https://www.postgresql.org/message-id/CAD21AoA0mPq_m6USfAC8DAkvFfwjqGvGq++Uv=avryYotvq98A@mail.gmail.com Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On Mon, Feb 27, 2023 at 1:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Feb 27, 2023 at 3:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > --- > > > Why do we not to delay sending COMMIT PREPARED messages? > > > > > > > I think we need to either add delay for prepare or commit prepared as > > otherwise, it will lead to delaying the xact more than required. > > Agreed. > > > The > > patch seems to add a delay before sending a PREPARE as that is the > > time when the subscriber will apply the changes. > > Considering the purpose of this feature mentioned in the commit > message "particularly to fix errors that might cause data loss", > delaying sending PREPARE would really help that situation? For > example, even after (mistakenly) executing PREPARE for a transaction > executing DELETE without WHERE clause on the publisher the user still > can rollback the transaction. They don't lose data on both nodes yet. > After executing (and replicating) COMMIT PREPARED for that > transaction, they lose the data on both nodes. IIUC the time-delayed > logical replication should help this situation by delaying sending > COMMIT PREPARED so that, for example, the user can stop logical > replication before COMMIT PREPARED message arrives to the subscriber. > So I think we should delay sending COMMIT PREPARED (and COMMIT) > instead of PREPARE. This would help users to correct data loss errors, > and would be more consistent with what recovery_min_apply_delay does. > The one difference w.r.t recovery_min_apply_delay is that here we will hold locks for the duration of the delay which didn't seem to be a good idea. This will also probably lead to more bloat as we will keep transactions open for a long time. Doing it before DecodePrepare won't have such problems. This is the reason that we decide to perform a delay at the start of the transaction instead at commit/prepare in the subscriber-side approach. -- With Regards, Amit Kapila.
At Mon, 27 Feb 2023 14:56:19 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > The one difference w.r.t recovery_min_apply_delay is that here we will > hold locks for the duration of the delay which didn't seem to be a > good idea. This will also probably lead to more bloat as we will keep > transactions open for a long time. Doing it before DecodePrepare won't I don't have a concrete picture but could we tell reorder buffer to retain a PREPAREd transaction until a COMMIT PREPARED comes? If delaying non-prepared transactions until COMMIT is adequate, then the same thing seems to work for prepared transactions. > have such problems. This is the reason that we decide to perform a > delay at the start of the transaction instead at commit/prepare in the > subscriber-side approach. It seems that there are no technical obstacles to do that on the publisher side. The only observable difference would be that relatively large prepared transactions may experience noticeable additional delays. IMHO I don't think it's a good practice protocol-wise to intentionally choke a stream at the receiving end when it has not been flow-controlled on the transmitting end. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Feb 28, 2023 at 8:14 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Mon, 27 Feb 2023 14:56:19 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > The one difference w.r.t recovery_min_apply_delay is that here we will > > hold locks for the duration of the delay which didn't seem to be a > > good idea. This will also probably lead to more bloat as we will keep > > transactions open for a long time. Doing it before DecodePrepare won't > > I don't have a concrete picture but could we tell reorder buffer to > retain a PREPAREd transaction until a COMMIT PREPARED comes? > Yeah, we could do that and that is what is the behavior unless the user enables 2PC via 'two_phase' subscription option. But, I don't see the need to unnecessarily delay the prepare till the commit if a user has specified 'two_phase' option. It is quite possible that COMMIT PREPARED happens at a much later time frame than the amount of delay the user is expecting. > If > delaying non-prepared transactions until COMMIT is adequate, then the > same thing seems to work for prepared transactions. > > > have such problems. This is the reason that we decide to perform a > > delay at the start of the transaction instead at commit/prepare in the > > subscriber-side approach. > > It seems that there are no technical obstacles to do that on the > publisher side. The only observable difference would be that > relatively large prepared transactions may experience noticeable > additional delays. IMHO I don't think it's a good practice > protocol-wise to intentionally choke a stream at the receiving end > when it has not been flow-controlled on the transmitting end. > But in this proposal, we are not choking/delaying anything on the receiving end. -- With Regards, Amit Kapila.
On Mon, Feb 27, 2023 at 2:21 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > Few comments: 1. + /* + * If we've requested to shut down, exit the process. + * + * Note that WalSndDone() cannot be used here because the delaying + * changes will be sent in the function. + */ + if (got_STOPPING) + { + QueryCompletion qc; + + /* Inform the standby that XLOG streaming is done */ + SetQueryCompletion(&qc, CMDTAG_COPY, 0); + EndCommand(&qc, DestRemote, false); + pq_flush(); Do we really need to do anything except for breaking the loop and let the exit handling happen in the main loop when 'got_STOPPING' is set? AFAICS, this is what we are doing in some other palces (See WalSndWaitForWal). Won't that work? It seems that will help us sending all the pending WAL. 2. + /* Try to flush pending output to the client */ + if (pq_flush_if_writable() != 0) + WalSndShutdown(); Is there a reason to try flushing here? Apart from the above, I have made a few changes in the comments in the attached diff patch. If you agree with those then please include them in the next version. -- With Regards, Amit Kapila.
Attachment
Dear Amit, > Few comments: Thank you for reviewing! PSA new version. Note that the starting point of delay for 2PC was not changed, I think it has been under discussion. > 1. > + /* > + * If we've requested to shut down, exit the process. > + * > + * Note that WalSndDone() cannot be used here because the delaying > + * changes will be sent in the function. > + */ > + if (got_STOPPING) > + { > + QueryCompletion qc; > + > + /* Inform the standby that XLOG streaming is done */ > + SetQueryCompletion(&qc, CMDTAG_COPY, 0); > + EndCommand(&qc, DestRemote, false); > + pq_flush(); > > Do we really need to do anything except for breaking the loop and let > the exit handling happen in the main loop when 'got_STOPPING' is set? > AFAICS, this is what we are doing in some other palces (See > WalSndWaitForWal). Won't that work? It seems that will help us sending > all the pending WAL. If we exit the loop after got_STOPPING is set, as you said, the walsender will send delaying changes and then exit. The behavior is same as the case that WalSndDone() is called. But I think it is not suitable for the motivation of the feature. If users notice the miss operation like TRUNCATE, they must shut down the publisher once and then recovery from back up or old subscriber. If the walsender sends all pending changes, miss operations will be also propagated to subscriber and data cannot be protected. So currently I want to keep the style. FYI - In case of physical replication, received WALs are not applied when the secondary is shutted down. > 2. > + /* Try to flush pending output to the client */ > + if (pq_flush_if_writable() != 0) > + WalSndShutdown(); > > Is there a reason to try flushing here? IIUC if pq_flush_if_writable() returns non-zero (EOF), it means that there is a trouble and walsender fails to send messages to subscriber. In Linux, the stuck trace from pq_flush_if_writable() will finally reach the send() system call. And according to man page[1], it will be triggered by some unexpected state or the connection is closed. Based on above, I think the returned value should be confirmed. > Apart from the above, I have made a few changes in the comments in the > attached diff patch. If you agree with those then please include them > in the next version. Thanks! I checked and I thought all of them should be included. Moreover, I used grammar checker and slightly reworded the commit message. [1]: https://man7.org/linux/man-pages/man3/send.3p.html Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
Here are some review comments for v9-0001, but these are only very trivial. ====== Commit Message 1. Nitpick. The new text is jagged-looking. It should wrap at ~80 chars. ~~~ 2. 2. Another reason is for that parallel streaming, the transaction will be opened immediately by the parallel apply worker. Therefore, if the walsender is delayed in sending the final record of the transaction, the parallel apply worker must wait to receive it with an open transaction. This would result in the locks acquired during the transaction not being released until the min_send_delay has elapsed. ~ The text already said there are "two reasons", and already this is numbered as reason 2. So it doesn't need to keep saying "Another reason" here. "Another reason is for that parallel streaming" --> "For parallel streaming..." ====== src/backend/replication/walsender.c 3. WalSndDelay + /* die if timeout was reached */ + WalSndCheckTimeOut(); Other nearby comments start uppercase, so this should too. ====== src/include/replication/walreceiver.h 4. WalRcvStreamOptions @@ -187,6 +187,7 @@ typedef struct * prepare time */ char *origin; /* Only publish data originating from the * specified origin */ + int32 min_send_delay; /* The minimum send delay */ } logical; } proto; } WalRcvStreamOptions; ~ Should that comment mention the units are "(ms)" ------ Kind Regards, Peter Smith. Fujitsu Australia
At Tue, 28 Feb 2023 08:35:11 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Tue, Feb 28, 2023 at 8:14 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Mon, 27 Feb 2023 14:56:19 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > > The one difference w.r.t recovery_min_apply_delay is that here we will > > > hold locks for the duration of the delay which didn't seem to be a > > > good idea. This will also probably lead to more bloat as we will keep > > > transactions open for a long time. Doing it before DecodePrepare won't > > > > I don't have a concrete picture but could we tell reorder buffer to > > retain a PREPAREd transaction until a COMMIT PREPARED comes? > > > > Yeah, we could do that and that is what is the behavior unless the > user enables 2PC via 'two_phase' subscription option. But, I don't see > the need to unnecessarily delay the prepare till the commit if a user > has specified 'two_phase' option. It is quite possible that COMMIT > PREPARED happens at a much later time frame than the amount of delay > the user is expecting. It looks like the user should decide between potential long locks or extra delays, and this choice ought to be documented. > > If > > delaying non-prepared transactions until COMMIT is adequate, then the > > same thing seems to work for prepared transactions. > > > > > have such problems. This is the reason that we decide to perform a > > > delay at the start of the transaction instead at commit/prepare in the > > > subscriber-side approach. > > > > It seems that there are no technical obstacles to do that on the > > publisher side. The only observable difference would be that > > relatively large prepared transactions may experience noticeable > > additional delays. IMHO I don't think it's a good practice > > protocol-wise to intentionally choke a stream at the receiving end > > when it has not been flow-controlled on the transmitting end. > > > > But in this proposal, we are not choking/delaying anything on the receiving end. I didn't say that to the latest patch. I interpreted the quote of your description as saying that the subscriber-side solution is effective in solving the long-lock problems, so I replied that that can be solved with the publisher-side solution and the subscriber-side solution could cause some unwanted behavior. Do you think we have decided to go with the publisher-side solution? I'm fine if so. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Mar 1, 2023 at 12:51 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Amit, > > > Few comments: > > Thank you for reviewing! PSA new version. > Note that the starting point of delay for 2PC was not changed, > I think it has been under discussion. > > > 1. > > + /* > > + * If we've requested to shut down, exit the process. > > + * > > + * Note that WalSndDone() cannot be used here because the delaying > > + * changes will be sent in the function. > > + */ > > + if (got_STOPPING) > > + { > > + QueryCompletion qc; > > + > > + /* Inform the standby that XLOG streaming is done */ > > + SetQueryCompletion(&qc, CMDTAG_COPY, 0); > > + EndCommand(&qc, DestRemote, false); > > + pq_flush(); > > > > Do we really need to do anything except for breaking the loop and let > > the exit handling happen in the main loop when 'got_STOPPING' is set? > > AFAICS, this is what we are doing in some other palces (See > > WalSndWaitForWal). Won't that work? It seems that will help us sending > > all the pending WAL. > > If we exit the loop after got_STOPPING is set, as you said, the walsender will > send delaying changes and then exit. The behavior is same as the case that WalSndDone() > is called. But I think it is not suitable for the motivation of the feature. > If users notice the miss operation like TRUNCATE, they must shut down the publisher > once and then recovery from back up or old subscriber. If the walsender sends all > pending changes, miss operations will be also propagated to subscriber and data > cannot be protected. So currently I want to keep the style. > FYI - In case of physical replication, received WALs are not applied when the > secondary is shutted down. > > > 2. > > + /* Try to flush pending output to the client */ > > + if (pq_flush_if_writable() != 0) > > + WalSndShutdown(); > > > > Is there a reason to try flushing here? > > IIUC if pq_flush_if_writable() returns non-zero (EOF), it means that there is a > trouble and walsender fails to send messages to subscriber. > > In Linux, the stuck trace from pq_flush_if_writable() will finally reach the send() system call. > And according to man page[1], it will be triggered by some unexpected state or the connection is closed. > > Based on above, I think the returned value should be confirmed. > > > Apart from the above, I have made a few changes in the comments in the > > attached diff patch. If you agree with those then please include them > > in the next version. > > Thanks! I checked and I thought all of them should be included. > > Moreover, I used grammar checker and slightly reworded the commit message. Thinking of side effects of this feature (no matter where we delay applying the changes), on the publisher, vacuum cannot collect garbage and WAL cannot be recycled. Is that okay in the first place? The point is that the subscription setting affects the publisher. That is, min_send_delay is specified on the subscriber but the symptoms that could ultimately lead to a server crash appear on the publisher, which sounds dangerous to me. Imagine a service or system like where there is a publication server and it's somewhat exposed so that a user (or a subsystem) arbitrarily can create a subscriber to replicate a subset of the data. A malicious user can have the publisher crash by creating a subscription with, say, min_send_delay = 20d. max_slot_wal_keep_size helps this situation but it's -1 by default. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Mar 1, 2023 at 8:06 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Tue, 28 Feb 2023 08:35:11 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > On Tue, Feb 28, 2023 at 8:14 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > > > > At Mon, 27 Feb 2023 14:56:19 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > > > The one difference w.r.t recovery_min_apply_delay is that here we will > > > > hold locks for the duration of the delay which didn't seem to be a > > > > good idea. This will also probably lead to more bloat as we will keep > > > > transactions open for a long time. Doing it before DecodePrepare won't > > > > > > I don't have a concrete picture but could we tell reorder buffer to > > > retain a PREPAREd transaction until a COMMIT PREPARED comes? > > > > > > > Yeah, we could do that and that is what is the behavior unless the > > user enables 2PC via 'two_phase' subscription option. But, I don't see > > the need to unnecessarily delay the prepare till the commit if a user > > has specified 'two_phase' option. It is quite possible that COMMIT > > PREPARED happens at a much later time frame than the amount of delay > > the user is expecting. > > It looks like the user should decide between potential long locks or > extra delays, and this choice ought to be documented. > Sure, we can do that. However, I think the way this feature works is that we keep standby/subscriber behind the primary/publisher by a certain time period and if there is any unwanted transaction (say Delete * .. without where clause), we can recover it from the receiver side. So, it may not matter much even if we wait at PREPARE to avoid long locks instead of documenting it. > > > If > > > delaying non-prepared transactions until COMMIT is adequate, then the > > > same thing seems to work for prepared transactions. > > > > > > > have such problems. This is the reason that we decide to perform a > > > > delay at the start of the transaction instead at commit/prepare in the > > > > subscriber-side approach. > > > > > > It seems that there are no technical obstacles to do that on the > > > publisher side. The only observable difference would be that > > > relatively large prepared transactions may experience noticeable > > > additional delays. IMHO I don't think it's a good practice > > > protocol-wise to intentionally choke a stream at the receiving end > > > when it has not been flow-controlled on the transmitting end. > > > > > > > But in this proposal, we are not choking/delaying anything on the receiving end. > > I didn't say that to the latest patch. I interpreted the quote of > your description as saying that the subscriber-side solution is > effective in solving the long-lock problems, so I replied that that > can be solved with the publisher-side solution and the subscriber-side > solution could cause some unwanted behavior. > > Do you think we have decided to go with the publisher-side solution? > I'm fine if so. > I am fine too unless we discover any major challenges with publisher-side implementation. -- With Regards, Amit Kapila.
On Wed, Mar 1, 2023 at 8:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Mar 1, 2023 at 12:51 AM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > Thinking of side effects of this feature (no matter where we delay > applying the changes), on the publisher, vacuum cannot collect garbage > and WAL cannot be recycled. Is that okay in the first place? The point > is that the subscription setting affects the publisher. That is, > min_send_delay is specified on the subscriber but the symptoms that > could ultimately lead to a server crash appear on the publisher, which > sounds dangerous to me. > > Imagine a service or system like where there is a publication server > and it's somewhat exposed so that a user (or a subsystem) arbitrarily > can create a subscriber to replicate a subset of the data. A malicious > user can have the publisher crash by creating a subscription with, > say, min_send_delay = 20d. max_slot_wal_keep_size helps this situation > but it's -1 by default. > By publisher crash, do you mean due to the disk full situation, it can lead the publisher to stop/panic? Won't a malicious user can block the replication in other ways as well and let the publisher stall (or crash the publisher) even without setting min_send_delay? Basically, one needs to either disable the subscription or create a constraint-violating row in the table to make that happen. If the system is exposed for arbitrarily allowing the creation of a subscription then a malicious user can create a subscription similar to one existing subscription and block the replication due to constraint violations. I don't think it would be so easy to bypass the current system that a malicious user will be allowed to create/alter subscriptions arbitrarily. Similarly, if there is a network issue (unreachable or slow), one will see similar symptoms. I think retention of data and WAL on publisher do rely on acknowledgment from subscribers and delay in that due to any reason can lead to the symptoms you describe above. We have documented at least one such case already where during Drop Subscription, if the network is not reachable then also, a similar problem can happen and users need to be careful about it [1]. [1] - https://www.postgresql.org/docs/devel/logical-replication-subscription.html -- With Regards, Amit Kapila.
On Tue, Feb 28, 2023 at 9:21 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > > 1. > > + /* > > + * If we've requested to shut down, exit the process. > > + * > > + * Note that WalSndDone() cannot be used here because the delaying > > + * changes will be sent in the function. > > + */ > > + if (got_STOPPING) > > + { > > + QueryCompletion qc; > > + > > + /* Inform the standby that XLOG streaming is done */ > > + SetQueryCompletion(&qc, CMDTAG_COPY, 0); > > + EndCommand(&qc, DestRemote, false); > > + pq_flush(); > > > > Do we really need to do anything except for breaking the loop and let > > the exit handling happen in the main loop when 'got_STOPPING' is set? > > AFAICS, this is what we are doing in some other palces (See > > WalSndWaitForWal). Won't that work? It seems that will help us sending > > all the pending WAL. > > If we exit the loop after got_STOPPING is set, as you said, the walsender will > send delaying changes and then exit. The behavior is same as the case that WalSndDone() > is called. But I think it is not suitable for the motivation of the feature. > If users notice the miss operation like TRUNCATE, they must shut down the publisher > once and then recovery from back up or old subscriber. If the walsender sends all > pending changes, miss operations will be also propagated to subscriber and data > cannot be protected. So currently I want to keep the style. > FYI - In case of physical replication, received WALs are not applied when the > secondary is shutted down. > Fair point but I think the current comment should explain why we are doing something different here. How about extending the existing comments to something like: "If we've requested to shut down, exit the process. This is unlike handling at other places where we allow complete WAL to be sent before shutdown because we don't want the delayed transactions to be applied downstream. This will allow one to use the data from downstream in case of some unwanted operations on the current node." -- With Regards, Amit Kapila.
On Wed, Mar 1, 2023 at 1:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Mar 1, 2023 at 8:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Mar 1, 2023 at 12:51 AM Hayato Kuroda (Fujitsu) > > <kuroda.hayato@fujitsu.com> wrote: > > > > Thinking of side effects of this feature (no matter where we delay > > applying the changes), on the publisher, vacuum cannot collect garbage > > and WAL cannot be recycled. Is that okay in the first place? The point > > is that the subscription setting affects the publisher. That is, > > min_send_delay is specified on the subscriber but the symptoms that > > could ultimately lead to a server crash appear on the publisher, which > > sounds dangerous to me. > > > > Imagine a service or system like where there is a publication server > > and it's somewhat exposed so that a user (or a subsystem) arbitrarily > > can create a subscriber to replicate a subset of the data. A malicious > > user can have the publisher crash by creating a subscription with, > > say, min_send_delay = 20d. max_slot_wal_keep_size helps this situation > > but it's -1 by default. > > > > By publisher crash, do you mean due to the disk full situation, it can > lead the publisher to stop/panic? Exactly. > Won't a malicious user can block the > replication in other ways as well and let the publisher stall (or > crash the publisher) even without setting min_send_delay? Basically, > one needs to either disable the subscription or create a > constraint-violating row in the table to make that happen. If the > system is exposed for arbitrarily allowing the creation of a > subscription then a malicious user can create a subscription similar > to one existing subscription and block the replication due to > constraint violations. I don't think it would be so easy to bypass the > current system that a malicious user will be allowed to create/alter > subscriptions arbitrarily. Right. But a difference is that with min_send_delay, it's just to create a subscription. > Similarly, if there is a network issue > (unreachable or slow), one will see similar symptoms. I think > retention of data and WAL on publisher do rely on acknowledgment from > subscribers and delay in that due to any reason can lead to the > symptoms you describe above. I think that piling up WAL files due to a slow network is a different story since it's a problem not only on the subscriber side. > We have documented at least one such case > already where during Drop Subscription, if the network is not > reachable then also, a similar problem can happen and users need to be > careful about it [1]. Apart from a bad-use case example I mentioned, in general, piling up WAL files due to the replication slot has many bad effects on the system. I'm concerned that the side effect of this feature (at least of the current design) is too huge compared to the benefit, and afraid that users might end up using this feature without understanding the side effect well. It might be okay if we thoroughly document it but I'm not sure. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Mar 1, 2023 at 10:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Mar 1, 2023 at 1:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > Won't a malicious user can block the > > replication in other ways as well and let the publisher stall (or > > crash the publisher) even without setting min_send_delay? Basically, > > one needs to either disable the subscription or create a > > constraint-violating row in the table to make that happen. If the > > system is exposed for arbitrarily allowing the creation of a > > subscription then a malicious user can create a subscription similar > > to one existing subscription and block the replication due to > > constraint violations. I don't think it would be so easy to bypass the > > current system that a malicious user will be allowed to create/alter > > subscriptions arbitrarily. > > Right. But a difference is that with min_send_delay, it's just to > create a subscription. > But, currently, only superusers would be allowed to create subscriptions. Even, if we change it and allow it based on some pre-defined role, it won't be allowed to create a subscription arbitrarily. So, not sure, if any malicious user can easily bypass it as you are envisioning it. -- With Regards, Amit Kapila.
Dear Sawada-san, Thank you for giving your consideration! > > We have documented at least one such case > > already where during Drop Subscription, if the network is not > > reachable then also, a similar problem can happen and users need to be > > careful about it [1]. > > Apart from a bad-use case example I mentioned, in general, piling up > WAL files due to the replication slot has many bad effects on the > system. I'm concerned that the side effect of this feature (at least > of the current design) is too huge compared to the benefit, and afraid > that users might end up using this feature without understanding the > side effect well. It might be okay if we thoroughly document it but > I'm not sure. One approach is that change max_slot_wal_keep_size forcibly when min_send_delay is set. But it may lead to disable the slot because WALs needed by the time-delayed replication may be also removed. Just the right value cannot be set by us because it is quite depends on the min_send_delay and workload. How about throwing the WARNING when min_send_delay > 0 but max_slot_wal_keep_size < 0? Differ from previous, version the subscription parameter min_send_delay will be sent to publisher. Therefore, we can compare min_send_delay and max_slot_wal_keep_size when publisher receives the parameter. Of course we can reject such a setup by using ereport(ERROR), but it may generate abandoned replication slot. It is because we send the parameter at START_REPLICATION and the slot has been already created. Best Regards, Hayato Kuroda FUJITSU LIMITED
On Tue, 28 Feb 2023 at 21:21, Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Amit, > > > Few comments: > > Thank you for reviewing! PSA new version. Thanks for the updated patch, few comments: 1) Currently we have added the delay during the decode of commit, while decoding the commit walsender process will stop decoding any further transaction until delay is completed. There might be a possibility that a lot of transactions will happen in parallel and there will be a lot of transactions to be decoded after the delay is completed. Will it be possible to decode the WAL if any WAL is generated instead of staying idle in the meantime, I'm not sure if this is feasible just throwing my thought to see if it might be possible. --- a/src/backend/replication/logical/decode.c +++ b/src/backend/replication/logical/decode.c @@ -676,6 +676,15 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, buf->origptr, buf->endptr); } + /* + * Delay sending the changes if required. For streaming transactions, + * this means a delay in sending the last stream but that is OK + * because on the downstream the changes will be applied only after + * receiving the last stream. + */ + if (ctx->min_send_delay > 0 && ctx->delay_send) + ctx->delay_send(ctx, xid, commit_time); + 2) Generally single line comments are not terminated by ".", The comment "/* Sleep until appropriate time. */" can be changed appropriately: + + /* Sleep until appropriate time. */ + timeout_sleeptime_ms = WalSndComputeSleeptime(now); + + elog(DEBUG2, "time-delayed replication for txid %u, delay_time = %d ms, remaining wait time: %ld ms", + xid, (int) ctx->min_send_delay, remaining_wait_time_ms); + + /* Sleep until we get reply from worker or we time out */ + WalSndWait(WL_SOCKET_READABLE, 3) In some places we mention as min_send_delay and in some places we mention it as time-delayed replication, we can keep the comment consistent by using the similar wordings. +-- fail - specifying streaming = parallel with time-delayed replication is not +-- supported +CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = parallel, min_send_delay = 123); +-- fail - alter subscription with streaming = parallel should fail when +-- time-delayed replication is set +ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel); +-- fail - alter subscription with min_send_delay should fail when +-- streaming = parallel is set 4) Since the value is stored in ms, we need not add ms again as the default value is in ms: @@ -4686,6 +4694,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo) if (strcmp(subinfo->subsynccommit, "off") != 0) appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit)); + if (subinfo->subminsenddelay > 0) + appendPQExpBuffer(query, ", min_send_delay = '%d ms'", subinfo->subminsenddelay); + 5) we can use the new error reporting style: 5.a) brackets around errcode can be removed + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("invalid value for parameter \"%s\": \"%s\"", + "min_send_delay", input_string), + hintmsg ? errhint("%s", _(hintmsg)) : 0)); 5.b) Similarly here too; + if (result < 0 || result > PG_INT32_MAX) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("%d ms is outside the valid range for parameter \"%s\" (%d .. %d)", + result, + "min_send_delay", + 0, PG_INT32_MAX))); 5.c) Similarly here too; + delay_val = strtoul(strVal(defel->arg), &endptr, 10); + if (errno != 0 || *endptr != '\0') + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("invalid min_send_delay"))); 5.d) Similarly here too; + if (delay_val > PG_INT32_MAX) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("min_send_delay \"%s\" out of range", + strVal(defel->arg)))); 6) This can be changed to a single line comment: + /* + * Parse given string as parameter which has millisecond unit + */ + if (!parse_int(input_string, &result, GUC_UNIT_MS, &hintmsg)) + ereport(ERROR, 7) In the expect we have specifically mention "for non-streaming transaction", is the behavior different for streaming transaction, if not we can change the message accordingly +# The publisher waits for the replication to complete +$node_publisher->wait_for_catchup('tap_sub_renamed'); + +# This test is successful only if at least the configured delay has elapsed. +ok( time() - $publisher_insert_time >= $delay, + "subscriber applies changes only after replication delay for non-streaming transaction" +); Regards, Vignesh
On Wed, Mar 1, 2023 at 6:21 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Sawada-san, > > Thank you for giving your consideration! > > > > We have documented at least one such case > > > already where during Drop Subscription, if the network is not > > > reachable then also, a similar problem can happen and users need to be > > > careful about it [1]. > > > > Apart from a bad-use case example I mentioned, in general, piling up > > WAL files due to the replication slot has many bad effects on the > > system. I'm concerned that the side effect of this feature (at least > > of the current design) is too huge compared to the benefit, and afraid > > that users might end up using this feature without understanding the > > side effect well. It might be okay if we thoroughly document it but > > I'm not sure. > > One approach is that change max_slot_wal_keep_size forcibly when min_send_delay > is set. But it may lead to disable the slot because WALs needed by the time-delayed > replication may be also removed. Just the right value cannot be set by us because > it is quite depends on the min_send_delay and workload. > > How about throwing the WARNING when min_send_delay > 0 but > max_slot_wal_keep_size < 0? Differ from previous, version the subscription > parameter min_send_delay will be sent to publisher. Therefore, we can compare > min_send_delay and max_slot_wal_keep_size when publisher receives the parameter. Since max_slot_wal_keep_size can be changed by reloading the config file, each walsender warns it also at that time? Not sure it's helpful. I think it's a legitimate use case to set min_send_delay > 0 and max_slot_wal_keep_size = -1, and users might not even notice the WARNING message. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Mar 2, 2023 at 7:38 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Mar 1, 2023 at 6:21 PM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > > > > > > > Apart from a bad-use case example I mentioned, in general, piling up > > > WAL files due to the replication slot has many bad effects on the > > > system. I'm concerned that the side effect of this feature (at least > > > of the current design) is too huge compared to the benefit, and afraid > > > that users might end up using this feature without understanding the > > > side effect well. It might be okay if we thoroughly document it but > > > I'm not sure. > > > > One approach is that change max_slot_wal_keep_size forcibly when min_send_delay > > is set. But it may lead to disable the slot because WALs needed by the time-delayed > > replication may be also removed. Just the right value cannot be set by us because > > it is quite depends on the min_send_delay and workload. > > > > How about throwing the WARNING when min_send_delay > 0 but > > max_slot_wal_keep_size < 0? Differ from previous, version the subscription > > parameter min_send_delay will be sent to publisher. Therefore, we can compare > > min_send_delay and max_slot_wal_keep_size when publisher receives the parameter. > > Since max_slot_wal_keep_size can be changed by reloading the config > file, each walsender warns it also at that time? > I think Kuroda-San wants to emit a WARNING at the time of CREATE SUBSCRIPTION. But it won't be possible to emit a WARNING at the time of ALTER SUBSCRIPTION. Also, as you say if the user later changes the value of max_slot_wal_keep_size, then even if we issue LOG/WARNING in walsender, it may go unnoticed. If we really want to give WARNING for this then we can probably give it as soon as user has set non-default value of min_send_delay to indicate that this can lead to retaining WAL on the publisher and they should consider setting max_slot_wal_keep_size. Having said that, I think users can always tune max_slot_wal_keep_size and min_send_delay (as none of these requires restart) if they see any indication of unexpected WAL size growth. There could be multiple ways to check it but I think one can refer wal_status in pg_replication_slots, the extended value can be an indicator of this. > Not sure it's > helpful. I think it's a legitimate use case to set min_send_delay > 0 > and max_slot_wal_keep_size = -1, and users might not even notice the > WARNING message. > I think it would be better to tell about this in the docs along with the 'min_send_delay' description. The key point is whether this would be an acceptable trade-off for users who want to use this feature. I think it can harm only if users use this without understanding the corresponding trade-off. As we kept the default to no delay, it is expected from users using this have an understanding of the trade-off. -- With Regards, Amit Kapila.
Dear Amit, Sawada-san, > I think Kuroda-San wants to emit a WARNING at the time of CREATE > SUBSCRIPTION. But it won't be possible to emit a WARNING at the time > of ALTER SUBSCRIPTION. Also, as you say if the user later changes the > value of max_slot_wal_keep_size, then even if we issue LOG/WARNING in > walsender, it may go unnoticed. If we really want to give WARNING for > this then we can probably give it as soon as user has set non-default > value of min_send_delay to indicate that this can lead to retaining > WAL on the publisher and they should consider setting > max_slot_wal_keep_size. Yeah, my motivation is to emit WARNING at CREATE SUBSCRIPTION, but I have not noticed that the approach has not covered ALTER SUBSCRIPTION. > Having said that, I think users can always tune max_slot_wal_keep_size > and min_send_delay (as none of these requires restart) if they see any > indication of unexpected WAL size growth. There could be multiple ways > to check it but I think one can refer wal_status in > pg_replication_slots, the extended value can be an indicator of this. Yeah, min_send_delay and max_slots_wal_keep_size should be easily tunable because the appropriate value depends on the enviroment and workload. However, pg_replication_slots.pg_replication_slots cannot show the exact amout of WALs, so it may not suitable for tuning. I think user can compare the value pg_replication_slots.restart_lsn (or pg_stat_replication.sent_lsn) and pg_current_wal_lsn() to calclate number of WALs to be delayed, like ``` postgres=# select pg_current_wal_lsn() - pg_replication_slots.restart_lsn as delayed from pg_replication_slots; delayed ------------ 1689153760 (1 row) ``` > I think it would be better to tell about this in the docs along with > the 'min_send_delay' description. The key point is whether this would > be an acceptable trade-off for users who want to use this feature. I > think it can harm only if users use this without understanding the > corresponding trade-off. As we kept the default to no delay, it is > expected from users using this have an understanding of the trade-off. Yes, the trade-off should be emphasized. Best Regards, Hayato Kuroda FUJITSU LIMITED
Dear Peter, Thank you for reviewing! PSA new version. > 1. > Nitpick. The new text is jagged-looking. It should wrap at ~80 chars. Addressed. > > 2. > 2. Another reason is for that parallel streaming, the transaction will be opened > immediately by the parallel apply worker. Therefore, if the walsender > is delayed in sending the final record of the transaction, the > parallel apply worker must wait to receive it with an open > transaction. This would result in the locks acquired during the > transaction not being released until the min_send_delay has elapsed. > > ~ > > The text already said there are "two reasons", and already this is > numbered as reason 2. So it doesn't need to keep saying "Another > reason" here. > > "Another reason is for that parallel streaming" --> "For parallel streaming..." Changed. > ====== > src/backend/replication/walsender.c > > 3. WalSndDelay > > + /* die if timeout was reached */ > + WalSndCheckTimeOut(); > > Other nearby comments start uppercase, so this should too. I just picked from other part and they have lowercase, but fixed. > ====== > src/include/replication/walreceiver.h > > 4. WalRcvStreamOptions > > @@ -187,6 +187,7 @@ typedef struct > * prepare time */ > char *origin; /* Only publish data originating from the > * specified origin */ > + int32 min_send_delay; /* The minimum send delay */ > } logical; > } proto; > } WalRcvStreamOptions; > > ~ > > Should that comment mention the units are "(ms)" Added. Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
Dear Vignesh, Thank you for reviewing! New version can be available at [1]. > 1) Currently we have added the delay during the decode of commit, > while decoding the commit walsender process will stop decoding any > further transaction until delay is completed. There might be a > possibility that a lot of transactions will happen in parallel and > there will be a lot of transactions to be decoded after the delay is > completed. > Will it be possible to decode the WAL if any WAL is generated instead > of staying idle in the meantime, I'm not sure if this is feasible just > throwing my thought to see if it might be possible. > --- a/src/backend/replication/logical/decode.c > +++ b/src/backend/replication/logical/decode.c > @@ -676,6 +676,15 @@ DecodeCommit(LogicalDecodingContext *ctx, > XLogRecordBuffer *buf, > > buf->origptr, buf->endptr); > } > > + /* > + * Delay sending the changes if required. For streaming transactions, > + * this means a delay in sending the last stream but that is OK > + * because on the downstream the changes will be applied only after > + * receiving the last stream. > + */ > + if (ctx->min_send_delay > 0 && ctx->delay_send) > + ctx->delay_send(ctx, xid, commit_time); > + I see your point, but I think that extension can be done in future version if needed. This is because we must change some parts and introduce some complexities. If we have decoded but have not wanted to send changes yet, we must store them in the memory one and skip sending. In order to do that we must add new data structure, and we must add another path in DecodeCommit, DecodePrepare not to send changes and in WalSndLoop() and other functions to send pending changes. These may not be sufficient. I'm now thinking aboves are not needed, we can modify later if the overhead of decoding is quite large and we must do them very efficiently. > 2) Generally single line comments are not terminated by ".", The > comment "/* Sleep until appropriate time. */" can be changed > appropriately: > + > + /* Sleep until appropriate time. */ > + timeout_sleeptime_ms = WalSndComputeSleeptime(now); > + > + elog(DEBUG2, "time-delayed replication for txid %u, > delay_time = %d ms, remaining wait time: %ld ms", > + xid, (int) ctx->min_send_delay, > remaining_wait_time_ms); > + > + /* Sleep until we get reply from worker or we time out */ > + WalSndWait(WL_SOCKET_READABLE, Right, removed. > 3) In some places we mention as min_send_delay and in some places we > mention it as time-delayed replication, we can keep the comment > consistent by using the similar wordings. > +-- fail - specifying streaming = parallel with time-delayed replication is not > +-- supported > +CREATE SUBSCRIPTION regress_testsub CONNECTION > 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = > false, streaming = parallel, min_send_delay = 123); > > +-- fail - alter subscription with streaming = parallel should fail when > +-- time-delayed replication is set > +ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel); > > +-- fail - alter subscription with min_send_delay should fail when > +-- streaming = parallel is set "time-delayed replication" was removed. > 4) Since the value is stored in ms, we need not add ms again as the > default value is in ms: > @@ -4686,6 +4694,9 @@ dumpSubscription(Archive *fout, const > SubscriptionInfo *subinfo) > if (strcmp(subinfo->subsynccommit, "off") != 0) > appendPQExpBuffer(query, ", synchronous_commit = %s", > fmtId(subinfo->subsynccommit)); > > + if (subinfo->subminsenddelay > 0) > + appendPQExpBuffer(query, ", min_send_delay = '%d ms'", > subinfo->subminsenddelay); Right, fixed. > 5) we can use the new error reporting style: > 5.a) brackets around errcode can be removed > + ereport(ERROR, > + > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("invalid value for parameter > \"%s\": \"%s\"", > + "min_send_delay", > input_string), > + hintmsg ? errhint("%s", _(hintmsg)) : 0)); > > 5.b) Similarly here too; > + if (result < 0 || result > PG_INT32_MAX) > + ereport(ERROR, > + > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("%d ms is outside the valid > range for parameter \"%s\" (%d .. %d)", > + result, > + "min_send_delay", > + 0, PG_INT32_MAX))); > > 5.c) Similarly here too; > + delay_val = strtoul(strVal(defel->arg), &endptr, 10); > + if (errno != 0 || *endptr != '\0') > + ereport(ERROR, > + > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("invalid > min_send_delay"))); > > > 5.d) Similarly here too; > + if (delay_val > PG_INT32_MAX) > + ereport(ERROR, > + > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + > errmsg("min_send_delay \"%s\" out of range", > + > strVal(defel->arg)))); All of them are fixed. > 6) This can be changed to a single line comment: > + /* > + * Parse given string as parameter which has millisecond unit > + */ > + if (!parse_int(input_string, &result, GUC_UNIT_MS, &hintmsg)) > + ereport(ERROR, Changed. I grepped ereport() in the patch and I thought there were no similar one. > 7) In the expect we have specifically mention "for non-streaming > transaction", is the behavior different for streaming transaction, if > not we can change the message accordingly > +# The publisher waits for the replication to complete > +$node_publisher->wait_for_catchup('tap_sub_renamed'); > + > +# This test is successful only if at least the configured delay has elapsed. > +ok( time() - $publisher_insert_time >= $delay, > + "subscriber applies changes only after replication delay for > non-streaming transaction" > +); There is no difference, both of normal and streamed transaction could be delayed to apply. So removed. [1]: https://www.postgresql.org/message-id/flat/TYAPR01MB586606CF3B585B6F8BE13A9CF5B29%40TYAPR01MB5866.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED
Dear Amit, > Fair point but I think the current comment should explain why we are > doing something different here. How about extending the existing > comments to something like: "If we've requested to shut down, exit the > process. This is unlike handling at other places where we allow > complete WAL to be sent before shutdown because we don't want the > delayed transactions to be applied downstream. This will allow one to > use the data from downstream in case of some unwanted operations on > the current node." Thank you for suggestion. I think it is better, so changed. Please see new patch at [1] [1]: https://www.postgresql.org/message-id/flat/TYAPR01MB586606CF3B585B6F8BE13A9CF5B29%40TYAPR01MB5866.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED
> Yeah, min_send_delay and max_slots_wal_keep_size should be easily tunable > because > the appropriate value depends on the enviroment and workload. > However, pg_replication_slots.pg_replication_slots cannot show the exact amout > of WALs, > so it may not suitable for tuning. I think user can compare the value > pg_replication_slots.restart_lsn (or pg_stat_replication.sent_lsn) and > pg_current_wal_lsn() to calclate number of WALs to be delayed, like > > ``` > postgres=# select pg_current_wal_lsn() - pg_replication_slots.restart_lsn as > delayed from pg_replication_slots; > delayed > ------------ > 1689153760 > (1 row) > ``` > > > I think it would be better to tell about this in the docs along with > > the 'min_send_delay' description. The key point is whether this would > > be an acceptable trade-off for users who want to use this feature. I > > think it can harm only if users use this without understanding the > > corresponding trade-off. As we kept the default to no delay, it is > > expected from users using this have an understanding of the trade-off. > > Yes, the trade-off should be emphasized. Based on the understanding, I added them to the doc in new version patch. Please see [1]. [1]: https://www.postgresql.org/message-id/flat/TYAPR01MB586606CF3B585B6F8BE13A9CF5B29%40TYAPR01MB5866.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED
On Thu, Mar 2, 2023 at 1:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Mar 2, 2023 at 7:38 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Mar 1, 2023 at 6:21 PM Hayato Kuroda (Fujitsu) > > <kuroda.hayato@fujitsu.com> wrote: > > > > > > > > > > > Apart from a bad-use case example I mentioned, in general, piling up > > > > WAL files due to the replication slot has many bad effects on the > > > > system. I'm concerned that the side effect of this feature (at least > > > > of the current design) is too huge compared to the benefit, and afraid > > > > that users might end up using this feature without understanding the > > > > side effect well. It might be okay if we thoroughly document it but > > > > I'm not sure. > > > > > > One approach is that change max_slot_wal_keep_size forcibly when min_send_delay > > > is set. But it may lead to disable the slot because WALs needed by the time-delayed > > > replication may be also removed. Just the right value cannot be set by us because > > > it is quite depends on the min_send_delay and workload. > > > > > > How about throwing the WARNING when min_send_delay > 0 but > > > max_slot_wal_keep_size < 0? Differ from previous, version the subscription > > > parameter min_send_delay will be sent to publisher. Therefore, we can compare > > > min_send_delay and max_slot_wal_keep_size when publisher receives the parameter. > > > > Since max_slot_wal_keep_size can be changed by reloading the config > > file, each walsender warns it also at that time? > > > > I think Kuroda-San wants to emit a WARNING at the time of CREATE > SUBSCRIPTION. But it won't be possible to emit a WARNING at the time > of ALTER SUBSCRIPTION. Also, as you say if the user later changes the > value of max_slot_wal_keep_size, then even if we issue LOG/WARNING in > walsender, it may go unnoticed. If we really want to give WARNING for > this then we can probably give it as soon as user has set non-default > value of min_send_delay to indicate that this can lead to retaining > WAL on the publisher and they should consider setting > max_slot_wal_keep_size. > > Having said that, I think users can always tune max_slot_wal_keep_size > and min_send_delay (as none of these requires restart) if they see any > indication of unexpected WAL size growth. There could be multiple ways > to check it but I think one can refer wal_status in > pg_replication_slots, the extended value can be an indicator of this. > > > Not sure it's > > helpful. I think it's a legitimate use case to set min_send_delay > 0 > > and max_slot_wal_keep_size = -1, and users might not even notice the > > WARNING message. > > > > I think it would be better to tell about this in the docs along with > the 'min_send_delay' description. The key point is whether this would > be an acceptable trade-off for users who want to use this feature. I > think it can harm only if users use this without understanding the > corresponding trade-off. As we kept the default to no delay, it is > expected from users using this have an understanding of the trade-off. I imagine that a typical use case would be to set min_send_delay to several hours to days. I'm concerned that it could not be an acceptable trade-off for many users that the system cannot collect any garbage during that. I think we can have the apply process write the decoded changes somewhere on the disk (as not temporary files) and return the flush LSN so that the apply worker can apply them later and the publisher can advance slot's LSN. The feature would be more complex but from the user perspective it would be better. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Hi all,
Thanks for working on this.
I imagine that a typical use case would be to set min_send_delay to
several hours to days. I'm concerned that it could not be an
acceptable trade-off for many users that the system cannot collect any
garbage during that.
I'm not too worried about the WAL recycling, that mostly looks like
a documentation issue to me. It is not a problem that many PG users
are unfamiliar. Also, even though one day creating - altering subscription
is relaxed to be done by a regular user, one option could be to require
this setting to be changed by a superuser? That would alleviate my concern
regarding WAL recycling. A superuser should be able to monitor the system
and adjust the settings/hardware accordingly.
However, VACUUM being blocked by replication with a configuration
change on the subscription sounds more concerning to me. Blocking
VACUUM for hours could quickly escalate to performance problems.
On the other hand, we already have a similar problem with
recovery_min_apply_delay combined with hot_standby_feedback [1].
So, that probably is an acceptable trade-off for the pgsql-hackers.
If you use this feature, you should be even more careful.
I think we can have the apply process write the decoded changes
somewhere on the disk (as not temporary files) and return the flush
LSN so that the apply worker can apply them later and the publisher
can advance slot's LSN. The feature would be more complex but from the
user perspective it would be better.
Yes, this might probably be one of the ideal solutions to the problem at hand. But,
my current guess is that it'd be a non-trivial change with different concurrency/failure
scenarios. So, I'm not sure if that is going to be a realistic patch to pursue.
Thanks,
Onder KALACI
On Mon, Mar 06, 2023 at 07:27:59PM +0300, Önder Kalacı wrote: > On the other hand, we already have a similar problem with > recovery_min_apply_delay combined with hot_standby_feedback [1]. > So, that probably is an acceptable trade-off for the pgsql-hackers. > If you use this feature, you should be even more careful. Yes, but it's possible to turn off hot_standby_feedback so that you don't incur bloat on the primary. And you don't need to store hours or days of WAL on the primary. I'm very late to this thread, but IIUC you cannot avoid blocking VACUUM with the proposed feature. IMO the current set of trade-offs (e.g., unavoidable bloat and WAL buildup) would make this feature virtually unusable for a lot of workloads, so it's probably worth exploring an alternative approach. In any case, we probably shouldn't rush this into v16 in its current form. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Wed, Mar 8, 2023 at 3:30 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Mon, Mar 06, 2023 at 07:27:59PM +0300, Önder Kalacı wrote: > > On the other hand, we already have a similar problem with > > recovery_min_apply_delay combined with hot_standby_feedback [1]. > > So, that probably is an acceptable trade-off for the pgsql-hackers. > > If you use this feature, you should be even more careful. > > Yes, but it's possible to turn off hot_standby_feedback so that you don't > incur bloat on the primary. And you don't need to store hours or days of > WAL on the primary. Right. This side effect belongs to the combination of recovery_min_apply_delay and hot_standby_feedback/replication slot. recovery_min_apply_delay itself can be used even without this side effect if we accept other trade-offs. When it comes to this time-delayed logical replication feature, there is no choice to avoid the side effect for users who want to use this feature. > I'm very late to this thread, but IIUC you cannot > avoid blocking VACUUM with the proposed feature. Right. > IMO the current set of > trade-offs (e.g., unavoidable bloat and WAL buildup) would make this > feature virtually unusable for a lot of workloads, so it's probably worth > exploring an alternative approach. It might require more engineering effort for alternative approaches such as one I proposed but the feature could become better from the user perspective. I also think it would be worth exploring it if we've not yet. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Mar 8, 2023 at 9:20 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Mar 8, 2023 at 3:30 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > > > > IMO the current set of > > trade-offs (e.g., unavoidable bloat and WAL buildup) would make this > > feature virtually unusable for a lot of workloads, so it's probably worth > > exploring an alternative approach. > > It might require more engineering effort for alternative approaches > such as one I proposed but the feature could become better from the > user perspective. I also think it would be worth exploring it if we've > not yet. > Fair enough. I think as of now most people think that we should consider alternative approaches for this feature. The two ideas at a high level are that the apply worker itself first flushes the decoded WAL (maybe only when time-delay is configured) or have a separate walreceiver process as we have for standby. I think we need to analyze the pros and cons of each of those approaches and see if they would be useful even for other things on the apply side. -- With Regards, Amit Kapila.
At Thu, 9 Mar 2023 11:00:46 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Wed, Mar 8, 2023 at 9:20 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Mar 8, 2023 at 3:30 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > > > > > > > IMO the current set of > > > trade-offs (e.g., unavoidable bloat and WAL buildup) would make this > > > feature virtually unusable for a lot of workloads, so it's probably worth > > > exploring an alternative approach. > > > > It might require more engineering effort for alternative approaches > > such as one I proposed but the feature could become better from the > > user perspective. I also think it would be worth exploring it if we've > > not yet. > > > > Fair enough. I think as of now most people think that we should > consider alternative approaches for this feature. The two ideas at a If we can notify subscriber of the transaction start time, will that solve the current problem? If not, or if it is not possible, +1 to look for other solutions. > high level are that the apply worker itself first flushes the decoded > WAL (maybe only when time-delay is configured) or have a separate > walreceiver process as we have for standby. I think we need to analyze > the pros and cons of each of those approaches and see if they would be > useful even for other things on the apply side. My understanding of the requirements here is that the publisher should not hold changes, the subscriber should not hold data reads, and all transactions including two-phase ones should be applied at once upon committing. Both sides need to respond to the requests from the other side. We expect apply-delay of several hours or more. My thoughts considering the requirements are as follows: If we expect delays of several hours or more, I don't think it's feasible to stack received changes in the process memory. So, if apply-delay is in effect, I think it would be better to process transactions through files regardless of process configuration. I'm not sure whether we should have a separate process for protocol processing. On one hand, it would simplify the protocol processing part, but on the other hand, changes would always have to be applied through files. If we plan to integrate the paths with and without apply-delay by the file-passing method, this might work. If we want to maintain the current approach when not applying apply-delay, I think we would have to implement it in a single process, but I feel the protocol processing part could become complicated. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Mar 9, 2023 at 2:56 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Thu, 9 Mar 2023 11:00:46 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > On Wed, Mar 8, 2023 at 9:20 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Wed, Mar 8, 2023 at 3:30 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > > > > > > > > > > IMO the current set of > > > > trade-offs (e.g., unavoidable bloat and WAL buildup) would make this > > > > feature virtually unusable for a lot of workloads, so it's probably worth > > > > exploring an alternative approach. > > > > > > It might require more engineering effort for alternative approaches > > > such as one I proposed but the feature could become better from the > > > user perspective. I also think it would be worth exploring it if we've > > > not yet. > > > > > > > Fair enough. I think as of now most people think that we should > > consider alternative approaches for this feature. The two ideas at a > > If we can notify subscriber of the transaction start time, will that > solve the current problem? > I don't think that will solve the current problem because the problem is related to confirming back the flush LSN (commit LSN) to the publisher which we do only after we commit the delayed transaction. Due to this, we are not able to advance WAL(restart_lsn)/XMIN on the publisher which causes an accumulation of WAL and does not allow the vacuum to remove deleted rows. Do you have something else in mind which makes you think that it can solve the problem? -- With Regards, Amit Kapila.
Dear hackers, Based on the discussion Sawada-san pointed out[1] that the current approach of logical time-delayed avoids recycling WALs, I'm planning to close the CF entry once. This or the forked thread will be registered again after deciding on the alternative approach. Thank you very much for the time to join our discussions earlier. I think to solve the issue, logical changes must be flushed on subscribers once and workers apply changes after spending a specified time. The straightforward approach for it is following physical replication - introduce the walreceiver process on the subscriber. We must research more, but at least there are some benefits: * Publisher can be shutted down even if the apply worker stuck. The stuck is more likely happen than physical replication, so this may improve the robustness. More detail, please see another thread[2]. * In case of synchronous_commit = 'remote_write', publisher can COMMIT faster. This is because walreceiver will flush changes immediately and reply soon. Even if time-delayed is enabled, the wait-time will not be increased. * May be used as an infrastructure of parallel apply for non-streaming transaction. The basic design of them are the similar - one process receive changes and others apply. I searched old discussions [3] and wiki pages, and I found that the initial prototype had a logical walreceiver but in a later version [4] apply worker directly received changes. I could not find the reason for the decision, but I suspect there were the following reasons. Could you please tell me the correct background about that? * Performance bottlenecks. If the walreceiver flush changes and the worker applies them, fsync() is called for every reception. * Complexity. In this design walreceiver and apply worker must share the progress of flush/apply. For crash recovery, more consideration is needed. The related discussion can be found in [5]. * Extendibility. In-core logical replication should be a sample of an external project. Apply worker is just a background worker that can be launched from an extension, so it can be easily understood. If it deeply depends on the walreceiver, other projects cannot follow. [1]: https://www.postgresql.org/message-id/CAD21AoAeG2%2BRsUYD9%2BmEwr8-rrt8R1bqpe56T2D%3DeuO-Qs-GAg%40mail.gmail.com [2]: https://www.postgresql.org/message-id/flat/TYAPR01MB586668E50FC2447AD7F92491F5E89%40TYAPR01MB5866.jpnprd01.prod.outlook.com [3]: https://www.postgresql.org/message-id/201206131327.24092.andres%402ndquadrant.com [4]: https://www.postgresql.org/message-id/37e19ad5-f667-2fe2-b95b-bba69c5b6c68@2ndquadrant.com [5]: https://www.postgresql.org/message-id/1339586927-13156-12-git-send-email-andres%402ndquadrant.com Best Regards, Hayato Kuroda FUJITSU LIMITED
Hi hackers, I have made a rough prototype that can serialize changes to permanent file and apply after time elapsed from v30 patch. I think the 2PC and restore mechanism needs more analysis, but I can share codes for discussion. How do you think? ## Interfaces Not changed from old versions. The subscription parameter "min_apply_delay" is used to enable the time-delayed logical replication. ## Advantages Two big problems are solved. * Apply worker can respond from walsender's keepalive while delaying application. This is because the process will not sleep. * Publisher can recycle WALs even if a transaction related with the WAL is not applied yet. This is because the apply worker flush all the changes to file and reply that WALs are flushed. ## Disadvantages Code complexity. ## Basic design The basic idea is quite simple - create a new file when apply worker receive BEGIN message, write received changes, and flush them when COMMIT message is come. The delayed transaction is checked its commit time for every main loop, and applied when the time exceeds the min_apply_delay. To handle files APIs that uses plain kernel FDs was used. This approach is similar to physical walreceiver process. Apart from the physical one, worker does not flush for every messages - it is done at the end of the transaction. ### For 2PC The delay is started since COMMIT PREPARED is come. But to avoid the long-lock-holding issue, the prepared transaction is just written into file without applying. When BEGIN PREPARE is received, same as normal transactions, the worker creates a file and starts to write changes. If we reach the PREPARE message, just writes a message into file, flushes, and just closes it. This means that no transactions are prepared on subscriber. When COMMIT PREPARED is received, the worker opens the file again and write the message. After that we treat same as normal committed transaction. ### For streamed transaction Do no special thing when the streaming transaction is come. When it is committed or prepared, read all the changes and write into permanent file. To read and write changes apply_spooled_changes() is used, which means the basic workflow is not changed. ### Restore from files To check the elapsed time from the commit, all commit_time of delayed transactions must be stored in the memory. Basically it can store when the worker handle COMMIT message, but it must do special treatment for restarting. When an apply worker receives COMMIT/PREPARE/COMMIT PREPARED message, it writes the message, flush them, and cache the commit_time. When worker restarts, it open files, check the final message (this is done by seeking some bytes from end of the file), and then cache the written commit_time. ## Restrictions * The combination with ALTER SUBSCRIPTION .. SKIP LSN is not considered. Thanks for Osumi-san to help implementing. Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
Dear hackers, > I have made a rough prototype that can serialize changes to permanent file and > apply after time elapsed from v30 patch. I think the 2PC and restore mechanism > needs more analysis, but I can share codes for discussion. How do you think? I have noticed that it could not be applied due to the recent commit. Here is a rebased version. Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
Dear hackers, Previous patch could not be applied due to 482675 1e10d4, c3afe8. PSA rebased version. Also, I have done some code cleanups. Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
Dear hackers, I have rebased an update the PoC. Please see attached. In [1], I wrote: > ### Restore from files To check the elapsed time from the commit, all commit_time of delayed transactions must be stored in the memory. Basically it can store when the worker handle COMMIT message, but it must do special treatment for restarting. When an apply worker receives COMMIT/PREPARE/COMMIT PREPARED message, it writes the message, flush them, and cache the commit_time. When worker restarts, it open files, check the final message (this is done by seeking some bytes from end of the file), and then cache the written commit_time. > But I have been thinking that this spec is terrible. Therefore, I have implemented new approach which uses the its filename for restoring when it is commit. Followings are the summary. When a worker receives a BEGIN message, it creates a new file and writes its changes to it. The filename contains the following dash-separated components: 1. Subscription OID 2. XID of the delayed transaction on the publisher 3. Status of the delaying transaction 4. Upper 32 bits of the commit_lsn 5. Lower 32 bits of the commit_lsn 6. Upper 32 bits of the end_lsn 7. Lower 32 bits of the end_lsn 8. Commit time At the beginning, the new file contains components 4-8 as 0 because the worker does not know their values. When it receives a COMMIT message, the changes are written to the permanent file, and the file is renamed to an appropriate value. While restarting, the worker reads the directory containing the files and caches their commit time into memory from the filenames. Files do not need to be opened at this point. Therefore, PREPARE/COMMIT PREPARED messages are no longer written into the file. The status of transactions can be distinguished from the filename. Another notable change is the addition of a replication option. If the min_apply_delay is greater than 0, a new parameter called "require_schema" is passed via START_REPICATION command. When "require_schema" is enabled, the publisher sends its schema (RELATION and TYPE messages) every time it sends decoded DMLs. This is necessary because delayed transactions may be applied after the subscriber is restarted, and the LogicalRepRelMap hash is destroyed at that time. If the RELATION message is not written into the delayed file, and the worker restarts just before applying the transaction, it will fail to open the local relation and display an error message: "ERROR: no relation map entry". And some small bugs were also fixed. [1]: https://www.postgresql.org/message-id/TYAPR01MB5866D871F60DDFD8FAA2CDE4F5BD9@TYAPR01MB5866.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
Dear hackers, I rebased and refined my PoC. Followings are the changes: * Added support for ALTER SUBSCRIPTION .. SKIP LSN. The skip operation is done when the application starts. User must indicate the commit_lsn of the transaction to skip the transaction. If the apply worker faces ERROR, it will output the commit_lsn. Apart from non-delayed transactions, the prepared but not committed transaction cannot be skipped. This is because currently the prepare_lsn is not recorded to the file. * Added integrity checks. When the debug build is enabled, each messages written in the files has the CRC checksums. When the message is read by apply worker, the worker checks it and raise PANIC if the process fails to compare. I'm not sure the performancec degradation can be accepted, so I added it only when USE_ASSERT_CHECKING is on. Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On Fri, Apr 28, 2023 at 2:35 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear hackers, > > I rebased and refined my PoC. Followings are the changes: Thanks. Apologies for being late here. Please bear with me if I'm repeating any of the discussed points. I'm mainly trying to understand the production level use-case behind this feature, and for that matter, recovery_min_apply_delay. AFAIK, people try to keep the replication lag as minimum as possible i.e. near zero to avoid the extreme problems on production servers - wal file growth, blocked vacuum, crash and downtime. The proposed feature commit message and existing docs about recovery_min_apply_delay justify the reason as 'offering opportunities to correct data loss errors'. If someone wants to enable recovery_min_apply_delay/min_apply_delay on production servers, I'm guessing their values will be in hours, not in minutes; for the simple reason that when a data loss occurs, people/infrastructure monitoring postgres need to know it first and need time to respond with corrective actions to recover data loss. When these parameters are set, the primary server mustn't be generating too much WAL to avoid eventual crash/downtime. Who would really want to be so defensive against somebody who may or may not accidentally cause data loss and enable these features on production servers (especially when these can take down the primary server) and live happily with the induced replication lag? AFAIK, PITR is what people use for recovering from data loss errors in production. IMO, before we even go implement the apply delay feature for logical replication, it's worth to understand if induced replication lags have any production level significance. We can also debate if providing apply delay hooks is any better with simple out-of-the-box extensions as opposed to the core providing these features. Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, May 10, 2023 at 5:35 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Fri, Apr 28, 2023 at 2:35 PM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > > > Dear hackers, > > > > I rebased and refined my PoC. Followings are the changes: > > Thanks. > > Apologies for being late here. Please bear with me if I'm repeating > any of the discussed points. > > I'm mainly trying to understand the production level use-case behind > this feature, and for that matter, recovery_min_apply_delay. AFAIK, > people try to keep the replication lag as minimum as possible i.e. > near zero to avoid the extreme problems on production servers - wal > file growth, blocked vacuum, crash and downtime. > > The proposed feature commit message and existing docs about > recovery_min_apply_delay justify the reason as 'offering opportunities > to correct data loss errors'. If someone wants to enable > recovery_min_apply_delay/min_apply_delay on production servers, I'm > guessing their values will be in hours, not in minutes; for the simple > reason that when a data loss occurs, people/infrastructure monitoring > postgres need to know it first and need time to respond with > corrective actions to recover data loss. When these parameters are > set, the primary server mustn't be generating too much WAL to avoid > eventual crash/downtime. Who would really want to be so defensive > against somebody who may or may not accidentally cause data loss and > enable these features on production servers (especially when these can > take down the primary server) and live happily with the induced > replication lag? > > AFAIK, PITR is what people use for recovering from data loss errors in > production. > I think PITR is not a preferred way to achieve this because it can be quite time-consuming. See how Gitlab[1] uses delayed replication in PostgreSQL. This is one of the use cases I came across but I am sure there will be others as well, otherwise, we would not have introduced this feature in the first place. Some of the other solutions like MySQL also have this feature. See [2], you can also read the other use cases in that article. It seems pglogical has this feature and there is a customer demand for the same [3] > IMO, before we even go implement the apply delay feature for logical > replication, it's worth to understand if induced replication lags have > any production level significance. > I think the main thing here is to come up with the right design to implement this feature. In the last release, we found some blocking problems with the proposed patch at that time but Kuroda-San came up with a new patch with a different design based on the discussion here. I haven't looked at it yet though. [1] - https://about.gitlab.com/blog/2019/02/13/delayed-replication-for-disaster-recovery-with-postgresql/ [2] - https://dev.mysql.com/doc/refman/8.0/en/replication-delayed.html [3] - https://www.postgresql.org/message-id/73b06a32-56ab-4056-86ff-e307f3c316f1%40www.fastmail.com -- With Regards, Amit Kapila.
Dear Amit-san, Bharath, Thank you for giving your opinion! > Some of the other solutions like MySQL also have this feature. See > [2], you can also read the other use cases in that article. It seems > pglogical has this feature and there is a customer demand for the same > [3] Additionally, the Db2[1] seems to have similar feature. If we extend to DBaaSes, RDS for MySQL [2] and TencentDB [3] have that. These may indicate the needs of the delayed replication. [1]: https://www.ibm.com/docs/en/db2/11.5?topic=parameters-hadr-replay-delay-hadr-replay-delay [2]: https://aws.amazon.com/jp/blogs/database/recover-from-a-disaster-with-delayed-replication-in-amazon-rds-for-mysql/ [3]: https://www.tencentcloud.com/document/product/236/41085 Best Regards, Hayato Kuroda FUJITSU LIMITED
On Fri, Apr 28, 2023 at 2:35 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear hackers, > > I rebased and refined my PoC. Followings are the changes: > 1. Is my understanding correct that this patch creates the delay files for each transaction? If so, did you consider other approaches such as using one file to avoid creating many files? 2. For streaming transactions, first the changes are written in the temp file and then moved to the delay file. It seems like there is a double work. Is it possible to unify it such that when min_apply_delay is specified, we just use the delay file without sacrificing the advantages like stream sub-abort can truncate the changes? 3. Ideally, there shouldn't be a performance impact of this feature on regular transactions because the delay file is created only when min_apply_delay is active but better to do some testing of the same. Overall, I think such an approach can address comments by Sawada-San [1] but not sure if Sawada-San or others have any better ideas to achieve this feature. It would be good to see what others think of this approach. [1] - https://www.postgresql.org/message-id/CAD21AoAeG2%2BRsUYD9%2BmEwr8-rrt8R1bqpe56T2D%3DeuO-Qs-GAg%40mail.gmail.com -- With Regards, Amit Kapila.
On Thu, May 11, 2023 at 2:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Apr 28, 2023 at 2:35 PM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > > > Dear hackers, > > > > I rebased and refined my PoC. Followings are the changes: > > > > 1. Is my understanding correct that this patch creates the delay files > for each transaction? If so, did you consider other approaches such as > using one file to avoid creating many files? > 2. For streaming transactions, first the changes are written in the > temp file and then moved to the delay file. It seems like there is a > double work. Is it possible to unify it such that when min_apply_delay > is specified, we just use the delay file without sacrificing the > advantages like stream sub-abort can truncate the changes? > 3. Ideally, there shouldn't be a performance impact of this feature on > regular transactions because the delay file is created only when > min_apply_delay is active but better to do some testing of the same. > In addition to the points Amit raised, if the 'required_schema' option is specified in START_REPLICATION, the publisher sends schema information for every change. I think it leads to significant overhead. Did you consider alternative approaches such as sending the schema information for every transaction or the subscriber requests the publisher to send it? > Overall, I think such an approach can address comments by Sawada-San > [1] but not sure if Sawada-San or others have any better ideas to > achieve this feature. It would be good to see what others think of > this approach. > I agree with this approach. When it comes to the idea of writing logical changes to permanent files, I think it would also be a good idea (and perhaps could be a building block of this feature) that we write streamed changes to a permanent file so that the apply worker can retry to apply them without retrieving the same changes again from the publisher. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Dear Amit, Sawada-san, Thank you for replying! > On Thu, May 11, 2023 at 2:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Apr 28, 2023 at 2:35 PM Hayato Kuroda (Fujitsu) > > <kuroda.hayato@fujitsu.com> wrote: > > > > > > Dear hackers, > > > > > > I rebased and refined my PoC. Followings are the changes: > > > > > > > 1. Is my understanding correct that this patch creates the delay files > > for each transaction? If so, did you consider other approaches such as > > using one file to avoid creating many files? > > 2. For streaming transactions, first the changes are written in the > > temp file and then moved to the delay file. It seems like there is a > > double work. Is it possible to unify it such that when min_apply_delay > > is specified, we just use the delay file without sacrificing the > > advantages like stream sub-abort can truncate the changes? > > 3. Ideally, there shouldn't be a performance impact of this feature on > > regular transactions because the delay file is created only when > > min_apply_delay is active but better to do some testing of the same. > > > > In addition to the points Amit raised, if the 'required_schema' option > is specified in START_REPLICATION, the publisher sends schema > information for every change. I think it leads to significant > overhead. Did you consider alternative approaches such as sending the > schema information for every transaction or the subscriber requests > the publisher to send it? Thanks for giving your opinions. Except for suggestion 2, I have never considered. I will analyze them and share my opinion later. About 2, I chose the style in order to simplify the source code, but I'm now planning to follow suggestions. > > Overall, I think such an approach can address comments by Sawada-San > > [1] but not sure if Sawada-San or others have any better ideas to > > achieve this feature. It would be good to see what others think of > > this approach. > > > > I agree with this approach. > > When it comes to the idea of writing logical changes to permanent > files, I think it would also be a good idea (and perhaps could be a > building block of this feature) that we write streamed changes to a > permanent file so that the apply worker can retry to apply them > without retrieving the same changes again from the publisher. I'm very relieved to hear that. One question: did you mean to say that serializing changes into the permanent files can be extend to the non-delay case, right? I think once I will treat for delayed replication, and then we can consider later. Best Regards, Hayato Kuroda FUJITSU LIMITED
On Fri, May 12, 2023 at 12:48 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > > > Overall, I think such an approach can address comments by Sawada-San > > > [1] but not sure if Sawada-San or others have any better ideas to > > > achieve this feature. It would be good to see what others think of > > > this approach. > > > > > > > I agree with this approach. > > > > When it comes to the idea of writing logical changes to permanent > > files, I think it would also be a good idea (and perhaps could be a > > building block of this feature) that we write streamed changes to a > > permanent file so that the apply worker can retry to apply them > > without retrieving the same changes again from the publisher. > > I'm very relieved to hear that. > One question: did you mean to say that serializing changes into the permanent files > can be extend to the non-delay case, right? I think once I will treat for delayed > replication, and then we can consider later. What I was thinking of is that we implement non-delay cases (only for streamed transactions) and then extend it to delay cases (i.e. adding non-streamed transaction support and the delay mechanism). It might be helpful if this patch becomes large and this approach can enable us to reduce the complexity or divide the patch. That being said, I've not considered this approach enough yet and it's just an idea. Extending this feature to non-delay cases later also makes sense to me. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, May 12, 2023 at 7:38 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, May 11, 2023 at 2:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Apr 28, 2023 at 2:35 PM Hayato Kuroda (Fujitsu) > > <kuroda.hayato@fujitsu.com> wrote: > > > > > > Dear hackers, > > > > > > I rebased and refined my PoC. Followings are the changes: > > > > > > > 1. Is my understanding correct that this patch creates the delay files > > for each transaction? If so, did you consider other approaches such as > > using one file to avoid creating many files? > > 2. For streaming transactions, first the changes are written in the > > temp file and then moved to the delay file. It seems like there is a > > double work. Is it possible to unify it such that when min_apply_delay > > is specified, we just use the delay file without sacrificing the > > advantages like stream sub-abort can truncate the changes? > > 3. Ideally, there shouldn't be a performance impact of this feature on > > regular transactions because the delay file is created only when > > min_apply_delay is active but better to do some testing of the same. > > > > In addition to the points Amit raised, if the 'required_schema' option > is specified in START_REPLICATION, the publisher sends schema > information for every change. I think it leads to significant > overhead. Did you consider alternative approaches such as sending the > schema information for every transaction or the subscriber requests > the publisher to send it? > Why do we need this new flag? I can't see any comments in the related code which explain its need. > > Overall, I think such an approach can address comments by Sawada-San > > [1] but not sure if Sawada-San or others have any better ideas to > > achieve this feature. It would be good to see what others think of > > this approach. > > > > I agree with this approach. > > When it comes to the idea of writing logical changes to permanent > files, I think it would also be a good idea (and perhaps could be a > building block of this feature) that we write streamed changes to a > permanent file so that the apply worker can retry to apply them > without retrieving the same changes again from the publisher. > I think we anyway won't be able to send confirmation till we write or process the commit. If it gets interrupted anytime in between we need to get all the changes again. I think using Fileset with temp files has quite a few advantages for streaming as are noted in the header comments of worker.c. We can investigate to replace that with permanent files but I don't see that the advantages outweigh the change. Also, after parallel apply, I am expecting, most users would prefer that mode for large transactions, so making changes in the serialized path doesn't seem like a good idea to me. Having said that, I also thought that it would be a good idea if both streaming and time-delayed can use the same code path in some way w.r.t writing to files but couldn't come up with any good idea without more downsides. I see that Kuroda-San has tried to keep the code path isolated for this feature but still see that one can question the implementation approach. -- With Regards, Amit Kapila.
On Fri, May 12, 2023 at 10:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, May 12, 2023 at 7:38 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, May 11, 2023 at 2:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Fri, Apr 28, 2023 at 2:35 PM Hayato Kuroda (Fujitsu) > > > <kuroda.hayato@fujitsu.com> wrote: > > > > > > > > Dear hackers, > > > > > > > > I rebased and refined my PoC. Followings are the changes: > > > > > > > > > > 1. Is my understanding correct that this patch creates the delay files > > > for each transaction? If so, did you consider other approaches such as > > > using one file to avoid creating many files? > > > 2. For streaming transactions, first the changes are written in the > > > temp file and then moved to the delay file. It seems like there is a > > > double work. Is it possible to unify it such that when min_apply_delay > > > is specified, we just use the delay file without sacrificing the > > > advantages like stream sub-abort can truncate the changes? > > > 3. Ideally, there shouldn't be a performance impact of this feature on > > > regular transactions because the delay file is created only when > > > min_apply_delay is active but better to do some testing of the same. > > > > > > > In addition to the points Amit raised, if the 'required_schema' option > > is specified in START_REPLICATION, the publisher sends schema > > information for every change. I think it leads to significant > > overhead. Did you consider alternative approaches such as sending the > > schema information for every transaction or the subscriber requests > > the publisher to send it? > > > > Why do we need this new flag? I can't see any comments in the related > code which explain its need. > So as per the email [1], this would be required after the subscriber restart. I guess we ideally need it once per delay file (considering that we have one file for all delayed xacts). In the worst case, we can have it per transaction as suggested by Sawada-San. [1] - https://www.postgresql.org/message-id/TYAPR01MB5866568A5C1E71338328B20CF5629%40TYAPR01MB5866.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
Dear Amit, Thank you for giving suggestions. > > Dear hackers, > > > > I rebased and refined my PoC. Followings are the changes: > > > > 1. Is my understanding correct that this patch creates the delay files > for each transaction? If so, did you consider other approaches such as > using one file to avoid creating many files? I have been analyzing the approach which uses only one file per subscription, per your suggestion. Currently I'm not sure whether it is good approach or not, so could you please give me any feedbacks? TL;DR: rotating segment files like WALs may be used, but there are several issues. # Assumption * Streamed txns are also serialized to the same permanent file, in the received order. * No additional sorting is considered. # Considerations As a premise, applied txns must be removed from files, otherwise the disk becomes full in some day and it leads PANIC. ## Naive approach - serialize all the changes to one large file If workers continue to write received changes from the head naively, it may be difficult to purge applied txns because there seems not to have a good way to truncate the first part of the file. I could not find related functions in fd.h. ## Alternative approach - separate the file into segments Alternative approach I came up with is that the file is divided into some segments - like WAL - and remove it if all written txns are applied. It may work well in non-streaming, 1pc case, but may not in other cases. ### Regarding the PREPARE transactions At that time it is more likely to occur that the segment which contains the actual txn is differ from the segment where COMMIT PREPARED. Hence the worker must check all the remained segments to find the actual messages from them. Isn't it inefficient? There is another approach that workers apply the PREPARE immediately and spill to file only COMMIT PREPARED, but in this case the worker have been acquiring the lock and never released it till delay is done. ### Regarding the streamed transactions As for streaming case, chunks of txns are separated into several segments. Hence the worker must check all the remained segments to find chunks messages from them, same as above. Isn't it inefficient too? Additionally, segments which have prepared or streamed transactions cannot be removed, so even if the case many files may be generated and remained. Anyway, it may be difficult to accept to stream in-progress transactions while delaying the application. IIUC the motivation of steaming is to reduce the lag between nodes, and it is opposite of this feature. So it might be okay, not sure. ### Regarding the publisher - timing to send schema may be fuzzy Another issue is that the timing when publisher sends the schema information cannot be determined on publisher itself. As discussed on hackers, publisher must send schema information once per segment file, but it is controlled on subscriber side. I'm thinking that the walsender cannot recognize the changing of segments and understand the timing to send them. That's it. I'm very happy to get idea. Best Regards, Hayato Kuroda FUJITSU LIMITED
Dear hackers, At PGcon and other places we have discussed the time-delayed logical replication, but now we have understood that there are no easy ways. Followings are our analysis. # Abstract To implement the time-dealyed logical replication for more proper approach, the worker must serialize all the received messages into permanent files. But PostgreSQL does not have good infrastructures for the purpose so huge engineering is needed. ## Review: problem of without-file approach In the without-file approach, the apply worker process sleeps while delaying the application. This approach is chosen in earlier versions like [1], but it contains problems which was shared by Sawada-san [2]. They lead the PANIC error due to the disk full. A) WALs cannot be recycled on publisher because they are not flushed on subscriber. B) Moreover, vacuuming cannot remove dead tuples on publisher. ## Alternative approach: serializing messages to files To prevent any potential issues, the worker should serialize all incoming messages to a permanent file, like what the physical walreceiver does. Here, messages are first written into files at the beginning of transactions and then flushed at the end. This approach could slove problem a), b), but it still has many considerations and difficulties. ### How to separate messages into files? There are two possibilities for dividing messages into files, but neither of them is ideal. 1. Create a file per received transaction. In this case files will be removed after the delay-period is exceeded and it is applied. This is the simplest approach, but the number of files is bloat. 2. Use one large file or segmented file (like WAL). This can reduce the number of files, but we must consider further things: A) Purge – We must purge the applied transaction, but we do not have a good way to remove one transaction from the large file. B) 2PC – It is more likely to occur that the segment which contains the actual transaction differs from the segment where COMMIT PREPARED. Hence the worker must check all the segments to find the actual messages from them. C) Streamed in-progress transactions - chunks of transactions are separated into several segments. Hence the worker must check all the segments to find chunks messages from them, same as above. ### Handle the case when the file exceeds the limitation Regardless of the option chosen from the ones mentioned above, there is a possibility that the file size could exceed the file system's limit. This can occur as the publisher can send transactions of any length. PostgreSQL provides a mechanism for working with such large files - BufFile data structure, but it could not be used as-is for several reasons: A) It only supports the buffered-I/O. A read or write of the low-level File occurs only when the buffer is filled or emptied. So, we cannot control when it is persisted. B) It can be used only for temporary purpose. Internally the BufFile creates some physical files into $PGDATA/base/pgsql_tmp directories, and files in the subdirectory will be removed when postmaster restarts. C) It does not have mechanisms for restoring information after the restart. BufFile contains virtual positions such as file index and offset, but these fields are stored in a memory structure, so the BufFile will forget the ordering of files and its initial/final position after restarts. D) It cannot remove a part of virtual file. Even if a large file is separated into multiple physical files and all transactions in a physical file are already applied, BufFile cannot remove only one part. [1]: https://www.postgresql.org/message-id/f026292b-c9ee-472e-beaa-d32c5c3a2ced%40www.fastmail.com [2]: https://www.postgresql.org/message-id/CAD21AoAeG2+RsUYD9+mEwr8-rrt8R1bqpe56T2D=euO-Qs-GAg@mail.gmail.com Acknowledgement: Amit, Peter, Sawada-san Thank you for discussing with me off-list. Best Regards, Hayato Kuroda FUJITSU LIMITED
Dear hackers, > At PGcon and other places we have discussed the time-delayed logical > replication, > but now we have understood that there are no easy ways. Followings are our > analysis. At this point, I have not planned to develop the PoC anymore, unless better idea or infrastructure will come. Best Regards, Hayato Kuroda FUJITSU LIMITED