Thread: Add two missing tests in 035_standby_logical_decoding.pl
hi hackers, In the logical decoding on standby thread [1], Andres proposed 2 new tests (that I did not find the time to complete before the finish line): - Test that we can subscribe to the standby (with the publication created on the primary) - Verify that invalidated logical slots do not lead to retaining WAL Please find those 2 missing tests in the patch proposal attached. A few words about them: 1) Regarding the subscription test: It modifies wait_for_catchup() to take into account the case where the requesting node is in recovery mode. Indeed, without that change, wait_for_subscription_sync() was failing with: " error running SQL: 'psql:<stdin>:1: ERROR: recovery is in progress HINT: WAL control functions cannot be executed during recovery.' while running 'psql -XAtq -d port=61441 host=/tmp/45dt3wqs2p dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'SELECTpg_current_wal_lsn()' " 2) Regarding the WAL file not retained test: As it's not possible to execute pg_switch_wal() and friends on a standby, this is done on the primary. Also checking that the WAL file (linked to a restart_lsn of an invalidate slot) has been removed is done directly at the os/directory level. The attached patch also removes: " -log_min_messages = 'debug2' -log_error_verbosity = verbose " as also discussed in [1]. I'm not sure if adding those 2 tests should be considered as an open item. I can add this open item if we think that makes sense. I'd be happy to do so but it looks like I don't have the privileges to edit https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com [1]: https://www.postgresql.org/message-id/6d801661-e21b-7326-be1b-f90d904da66a%40gmail.com
Attachment
On 2023-Apr-12, Drouvot, Bertrand wrote: > I'm not sure if adding those 2 tests should be considered as an open > item. I can add this open item if we think that makes sense. I'd be > happy to do so but it looks like I don't have the privileges to edit > https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items I think adding extra tests for new code can definitely be considered an open item, since those tests might help to discover issues in said new code. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Hi, On 4/17/23 11:55 AM, Alvaro Herrera wrote: > On 2023-Apr-12, Drouvot, Bertrand wrote: > >> I'm not sure if adding those 2 tests should be considered as an open >> item. I can add this open item if we think that makes sense. I'd be >> happy to do so but it looks like I don't have the privileges to edit >> https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items > > I think adding extra tests for new code can definitely be considered an > open item, since those tests might help to discover issues in said new > code. > Thanks for the feedback! Added as an open item. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, 12 Apr 2023 at 21:45, Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > hi hackers, > > In the logical decoding on standby thread [1], Andres proposed 2 new tests (that I did > not find the time to complete before the finish line): > > - Test that we can subscribe to the standby (with the publication created on the primary) > - Verify that invalidated logical slots do not lead to retaining WAL > > Please find those 2 missing tests in the patch proposal attached. Few comments: 1) Should this change be committed as a separate patch instead of mixing it with the new test addition patch? I feel it would be better to split it into 0001 and 0002 patches. # Name for the physical slot on primary @@ -235,8 +241,6 @@ $node_primary->append_conf('postgresql.conf', q{ wal_level = 'logical' max_replication_slots = 4 max_wal_senders = 4 -log_min_messages = 'debug2' -log_error_verbosity = verbose }); $node_primary->dump_info; $node_primary->start; 2) We could add a commitfest entry for this, which will help in checking cfbot results across platforms. 3) Should the comment say subscription instead of subscriber here? +# We do not need the subscriber anymore +$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub"); +$node_subscriber->stop; 4) we could add a commit message for the patch Regards, Vignesh
Hi, On 4/24/23 6:04 AM, vignesh C wrote: > On Wed, 12 Apr 2023 at 21:45, Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> hi hackers, >> >> In the logical decoding on standby thread [1], Andres proposed 2 new tests (that I did >> not find the time to complete before the finish line): >> >> - Test that we can subscribe to the standby (with the publication created on the primary) >> - Verify that invalidated logical slots do not lead to retaining WAL >> >> Please find those 2 missing tests in the patch proposal attached. > > Few comments: Thanks for looking at it! > 1) Should this change be committed as a separate patch instead of > mixing it with the new test addition patch? I feel it would be better > to split it into 0001 and 0002 patches. Agree, done in V2 attached. > 2) We could add a commitfest entry for this, which will help in > checking cfbot results across platforms. Good point, done in [1]. > 3) Should the comment say subscription instead of subscriber here? > +# We do not need the subscriber anymore > +$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub"); > +$node_subscriber->stop; Comment was due to the node_subscriber being stopped. Changed to "We do not need the subscription and the subscriber anymore" in V2. > > 4) we could add a commit message for the patch > Good point, done. [1]: https://commitfest.postgresql.org/43/4295/ Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Apr 24, 2023 at 11:24 AM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > Few comments: ============ 1. +$node_subscriber->init(allows_streaming => 'logical'); +$node_subscriber->append_conf('postgresql.conf', 'max_replication_slots = 4'); Why do we need slots on the subscriber? 2. +# Speed up the subscription creation +$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()"); + +# Explicitly shut down psql instance gracefully - to avoid hangs +# or worse on windows +$psql_subscriber{subscriber_stdin} .= "\\q\n"; +$psql_subscriber{run}->finish; + +# Insert some rows on the primary +$node_primary->safe_psql('postgres', + qq[INSERT INTO tab_rep select generate_series(1,10);]); + +$node_primary->wait_for_replay_catchup($node_standby); + +# To speed up the wait_for_subscription_sync +$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()"); +$node_subscriber->wait_for_subscription_sync($node_standby, 'tap_sub'); It is not clear to me why you need to do pg_log_standby_snapshot() twice. 3. Why do you need $psql_subscriber to be used in a different way instead of using safe_psql as is used for node_primary? -- With Regards, Amit Kapila.
On Wed, Apr 12, 2023 at 9:45 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > > The attached patch also removes: > > " > -log_min_messages = 'debug2' > -log_error_verbosity = verbose > " > > as also discussed in [1]. > I agree that we should reduce the log level here. It is discussed in an email [1]. I'll push this part tomorrow unless Andres or someone else thinks that we still need this. [1] - https://www.postgresql.org/message-id/523315.1681245505%40sss.pgh.pa.us -- With Regards, Amit Kapila.
On Mon, Apr 24, 2023 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Apr 24, 2023 at 11:24 AM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > Few comments: > ============ > +# We can not test if the WAL file still exists immediately. +# We need to let some time to the standby to actually "remove" it. +my $i = 0; +while (1) +{ + last if !-f $standby_walfile; + if ($i++ == 10 * $default_timeout) + { + die + "could not determine if WAL file has been retained or not, can't continue"; + } + usleep(100_000); +} Is this adhoc wait required because we can't guarantee that the checkpoint is complete on standby even after using wait_for_catchup? Is there a guarantee that it can never fail on some slower machines? BTW, for the second test is it necessary that we first ensure that the WAL file has not been retained on the primary? -- With Regards, Amit Kapila.
Hi, On 4/24/23 8:24 AM, Amit Kapila wrote: > On Mon, Apr 24, 2023 at 11:24 AM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> > > Few comments: > ============ Thanks for looking at it! > 1. > +$node_subscriber->init(allows_streaming => 'logical'); > +$node_subscriber->append_conf('postgresql.conf', 'max_replication_slots = 4'); > > Why do we need slots on the subscriber? > Good point, it's not needed. I guess it has been missed during my initial patch clean up. Fixed in V3 attached. > 2. > +# Speed up the subscription creation > +$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()"); > + > +# Explicitly shut down psql instance gracefully - to avoid hangs > +# or worse on windows > +$psql_subscriber{subscriber_stdin} .= "\\q\n"; > +$psql_subscriber{run}->finish; > + > +# Insert some rows on the primary > +$node_primary->safe_psql('postgres', > + qq[INSERT INTO tab_rep select generate_series(1,10);]); > + > +$node_primary->wait_for_replay_catchup($node_standby); > + > +# To speed up the wait_for_subscription_sync > +$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()"); > +$node_subscriber->wait_for_subscription_sync($node_standby, 'tap_sub'); > > It is not clear to me why you need to do pg_log_standby_snapshot() twice. That's because there is 2 logical slot creations that have the be done on the standby. The one for the subscription: " CREATE_REPLICATION_SLOT "tap_sub" LOGICAL pgoutput (SNAPSHOT 'nothing') " And the one for the data sync: " CREATE_REPLICATION_SLOT "pg_16389_sync_16384_7225540800768250444" LOGICAL pgoutput (SNAPSHOT 'use') " Without the second "pg_log_standby_snapshot()" then wait_for_subscription_sync() would be waiting some time on the poll for "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');" Adding a comment in V3 to explain the need for the second pg_log_standby_snapshot(). > > 3. Why do you need $psql_subscriber to be used in a different way > instead of using safe_psql as is used for node_primary? > Because safe_psql() would wait for activity on the primary without being able to launch pg_log_standby_snapshot() on the primary while waiting. psql_subscriber() allows to not wait synchronously. Also adding a comment in V3 to explain why safe_psql() is not being used here. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
Hi, On 4/24/23 11:45 AM, Amit Kapila wrote: > On Mon, Apr 24, 2023 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Mon, Apr 24, 2023 at 11:24 AM Drouvot, Bertrand >> <bertranddrouvot.pg@gmail.com> wrote: >>> >> >> Few comments: >> ============ >> > > +# We can not test if the WAL file still exists immediately. > +# We need to let some time to the standby to actually "remove" it. > +my $i = 0; > +while (1) > +{ > + last if !-f $standby_walfile; > + if ($i++ == 10 * $default_timeout) > + { > + die > + "could not determine if WAL file has been retained or not, can't continue"; > + } > + usleep(100_000); > +} > > Is this adhoc wait required because we can't guarantee that the > checkpoint is complete on standby even after using wait_for_catchup? Yes, the restart point on the standby is not necessary completed even after wait_for_catchup is done. > Is there a guarantee that it can never fail on some slower machines? > We are waiting here at a maximum for 10 * $default_timeout (means 3 minutes) before we time out. Would you prefer to wait more than 3 minutes at a maximum? > BTW, for the second test is it necessary that we first ensure that the > WAL file has not been retained on the primary? > I was not sure it's worth it too. Idea was more: it's useless to verify it is removed on the standby if we are not 100% sure it has been removed on the primary first. But yeah, we can get rid of this test if you prefer. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Apr 24, 2023 at 3:36 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 4/24/23 8:24 AM, Amit Kapila wrote: > > > 2. > > +# Speed up the subscription creation > > +$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()"); > > + > > +# Explicitly shut down psql instance gracefully - to avoid hangs > > +# or worse on windows > > +$psql_subscriber{subscriber_stdin} .= "\\q\n"; > > +$psql_subscriber{run}->finish; > > + > > +# Insert some rows on the primary > > +$node_primary->safe_psql('postgres', > > + qq[INSERT INTO tab_rep select generate_series(1,10);]); > > + > > +$node_primary->wait_for_replay_catchup($node_standby); > > + > > +# To speed up the wait_for_subscription_sync > > +$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()"); > > +$node_subscriber->wait_for_subscription_sync($node_standby, 'tap_sub'); > > > > It is not clear to me why you need to do pg_log_standby_snapshot() twice. > > That's because there is 2 logical slot creations that have the be done on the standby. > > The one for the subscription: > > " > CREATE_REPLICATION_SLOT "tap_sub" LOGICAL pgoutput (SNAPSHOT 'nothing') > " > > And the one for the data sync: > > " > CREATE_REPLICATION_SLOT "pg_16389_sync_16384_7225540800768250444" LOGICAL pgoutput (SNAPSHOT 'use') > " > > Without the second "pg_log_standby_snapshot()" then wait_for_subscription_sync() would be waiting > some time on the poll for "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');" > > Adding a comment in V3 to explain the need for the second pg_log_standby_snapshot(). > Won't this still be unpredictable because it is possible that the tablesync worker may take more time to get launched or create a replication slot? If that happens after your second pg_log_standby_snapshot() then wait_for_subscription_sync() will be hanging. Wouldn't it be better to create a subscription with (copy_data = false) to make it predictable and then we won't need pg_log_standby_snapshot() to be performed twice? If you agree with the above suggestion then you probably need to move wait_for_subscription_sync() before Insert. -- With Regards, Amit Kapila.
On Mon, Apr 24, 2023 at 5:38 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 4/24/23 11:45 AM, Amit Kapila wrote: > > On Mon, Apr 24, 2023 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> > >> On Mon, Apr 24, 2023 at 11:24 AM Drouvot, Bertrand > >> <bertranddrouvot.pg@gmail.com> wrote: > >>> > >> > >> Few comments: > >> ============ > >> > > > > +# We can not test if the WAL file still exists immediately. > > +# We need to let some time to the standby to actually "remove" it. > > +my $i = 0; > > +while (1) > > +{ > > + last if !-f $standby_walfile; > > + if ($i++ == 10 * $default_timeout) > > + { > > + die > > + "could not determine if WAL file has been retained or not, can't continue"; > > + } > > + usleep(100_000); > > +} > > > > Is this adhoc wait required because we can't guarantee that the > > checkpoint is complete on standby even after using wait_for_catchup? > > Yes, the restart point on the standby is not necessary completed even after wait_for_catchup is done. > > > Is there a guarantee that it can never fail on some slower machines? > > > > We are waiting here at a maximum for 10 * $default_timeout (means 3 minutes) before > we time out. Would you prefer to wait more than 3 minutes at a maximum? > No, because I don't know what would be a suitable timeout here. At this stage, I don't have a good idea on how to implement this test in a better way. Can we split this into a separate patch as the first test is a bit straightforward, we can push that one and then brainstorm on if there is a better way to test this functionality. -- With Regards, Amit Kapila.
Hi, On 4/25/23 6:23 AM, Amit Kapila wrote: > On Mon, Apr 24, 2023 at 3:36 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> Without the second "pg_log_standby_snapshot()" then wait_for_subscription_sync() would be waiting >> some time on the poll for "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');" >> >> Adding a comment in V3 to explain the need for the second pg_log_standby_snapshot(). >> > > Won't this still be unpredictable because it is possible that the > tablesync worker may take more time to get launched or create a > replication slot? If that happens after your second > pg_log_standby_snapshot() then wait_for_subscription_sync() will be > hanging. Oh right, that looks like a possible scenario. > Wouldn't it be better to create a subscription with > (copy_data = false) to make it predictable and then we won't need > pg_log_standby_snapshot() to be performed twice? > > If you agree with the above suggestion then you probably need to move > wait_for_subscription_sync() before Insert. > I like that idea, thanks! Done in V4 attached. Not related to the above corner case, but while re-reading the patch I also added: " $node_primary->wait_for_replay_catchup($node_standby); " between the publication creation on the primary and the subscription to the standby (to ensure the publication gets replicated before we request for the subscription creation). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
Hi, On 4/25/23 6:43 AM, Amit Kapila wrote: > On Mon, Apr 24, 2023 at 5:38 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> We are waiting here at a maximum for 10 * $default_timeout (means 3 minutes) before >> we time out. Would you prefer to wait more than 3 minutes at a maximum? >> > > No, because I don't know what would be a suitable timeout here. Yeah, I understand that. On the other hand, there is other places that rely on a timeout, for example: - wait_for_catchup(), wait_for_slot_catchup(), wait_for_subscription_sync() by making use of poll_query_until. - wait_for_log() by setting a max_attempts. Couldn't we have the same concern for those ones? (aka be suitable on slower machines). > At > this stage, I don't have a good idea on how to implement this test in > a better way. Can we split this into a separate patch as the first > test is a bit straightforward, we can push that one and then > brainstorm on if there is a better way to test this functionality. > I created a dedicated v4-0002-Add-retained-WAL-test-in-035_standby_logical_deco.patch just shared up-thread. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, 25 Apr 2023 at 12:51, Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 4/25/23 6:23 AM, Amit Kapila wrote: > > On Mon, Apr 24, 2023 at 3:36 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >> Without the second "pg_log_standby_snapshot()" then wait_for_subscription_sync() would be waiting > >> some time on the poll for "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');" > >> > >> Adding a comment in V3 to explain the need for the second pg_log_standby_snapshot(). > >> > > > > Won't this still be unpredictable because it is possible that the > > tablesync worker may take more time to get launched or create a > > replication slot? If that happens after your second > > pg_log_standby_snapshot() then wait_for_subscription_sync() will be > > hanging. > > Oh right, that looks like a possible scenario. > > > Wouldn't it be better to create a subscription with > > (copy_data = false) to make it predictable and then we won't need > > pg_log_standby_snapshot() to be performed twice? > > > > If you agree with the above suggestion then you probably need to move > > wait_for_subscription_sync() before Insert. > > > > I like that idea, thanks! Done in V4 attached. > > Not related to the above corner case, but while re-reading the patch I also added: > > " > $node_primary->wait_for_replay_catchup($node_standby); > " > > between the publication creation on the primary and the subscription to the standby > (to ensure the publication gets replicated before we request for the subscription creation). Thanks for the updated patch. Few comments: 1) subscriber_stdout and subscriber_stderr are not required for this test case, we could remove it, I was able to remove those variables and run the test successfully: +$node_subscriber->start; + +my %psql_subscriber = ( + 'subscriber_stdin' => '', + 'subscriber_stdout' => '', + 'subscriber_stderr' => ''); +$psql_subscriber{run} = IPC::Run::start( + [ 'psql', '-XA', '-f', '-', '-d', $node_subscriber->connstr('postgres') ], + '<', + \$psql_subscriber{subscriber_stdin}, + '>', + \$psql_subscriber{subscriber_stdout}, + '2>', + \$psql_subscriber{subscriber_stderr}, + $psql_timeout); I ran it like: my %psql_subscriber = ( 'subscriber_stdin' => ''); $psql_subscriber{run} = IPC::Run::start( [ 'psql', '-XA', '-f', '-', '-d', $node_subscriber->connstr('postgres') ], '<', \$psql_subscriber{subscriber_stdin}, $psql_timeout); 2) Also we have changed the default timeout here, why is this change required: my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby'); +my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber'); my $default_timeout = $PostgreSQL::Test::Utils::timeout_default; +my $psql_timeout = IPC::Run::timer(2 * $default_timeout); Regards, Vignesh
Hi, On 4/26/23 6:06 AM, vignesh C wrote: > On Tue, 25 Apr 2023 at 12:51, Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > Thanks for the updated patch. > Few comments: Thanks for looking at it! > 1) subscriber_stdout and subscriber_stderr are not required for this > test case, we could remove it, I was able to remove those variables > and run the test successfully: > +$node_subscriber->start; > + > +my %psql_subscriber = ( > + 'subscriber_stdin' => '', > + 'subscriber_stdout' => '', > + 'subscriber_stderr' => ''); > +$psql_subscriber{run} = IPC::Run::start( > + [ 'psql', '-XA', '-f', '-', '-d', > $node_subscriber->connstr('postgres') ], > + '<', > + \$psql_subscriber{subscriber_stdin}, > + '>', > + \$psql_subscriber{subscriber_stdout}, > + '2>', > + \$psql_subscriber{subscriber_stderr}, > + $psql_timeout); > > I ran it like: > my %psql_subscriber = ( > 'subscriber_stdin' => ''); > $psql_subscriber{run} = IPC::Run::start( > [ 'psql', '-XA', '-f', '-', '-d', $node_subscriber->connstr('postgres') ], > '<', > \$psql_subscriber{subscriber_stdin}, > $psql_timeout); > Not using the 3 std* is also the case for example in 021_row_visibility.pl and 032_relfilenode_reuse.pl where the "stderr" is set but does not seem to be used. I don't think that's a problem to keep them all and I think it's better to have them re-directed to dedicated places. > 2) Also we have changed the default timeout here, why is this change required: > my $node_cascading_standby = > PostgreSQL::Test::Cluster->new('cascading_standby'); > +my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber'); > my $default_timeout = $PostgreSQL::Test::Utils::timeout_default; > +my $psql_timeout = IPC::Run::timer(2 * $default_timeout); I think I used 021_row_visibility.pl as an example. But agree there is others .pl that are using the timeout_default as the psql_timeout and that the default is enough in our case. So, using the default in V5 attached. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Wed, 26 Apr 2023 at 13:45, Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 4/26/23 6:06 AM, vignesh C wrote: > > On Tue, 25 Apr 2023 at 12:51, Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > > Thanks for the updated patch. > > Few comments: > > Thanks for looking at it! > > > 1) subscriber_stdout and subscriber_stderr are not required for this > > test case, we could remove it, I was able to remove those variables > > and run the test successfully: > > +$node_subscriber->start; > > + > > +my %psql_subscriber = ( > > + 'subscriber_stdin' => '', > > + 'subscriber_stdout' => '', > > + 'subscriber_stderr' => ''); > > +$psql_subscriber{run} = IPC::Run::start( > > + [ 'psql', '-XA', '-f', '-', '-d', > > $node_subscriber->connstr('postgres') ], > > + '<', > > + \$psql_subscriber{subscriber_stdin}, > > + '>', > > + \$psql_subscriber{subscriber_stdout}, > > + '2>', > > + \$psql_subscriber{subscriber_stderr}, > > + $psql_timeout); > > > > I ran it like: > > my %psql_subscriber = ( > > 'subscriber_stdin' => ''); > > $psql_subscriber{run} = IPC::Run::start( > > [ 'psql', '-XA', '-f', '-', '-d', $node_subscriber->connstr('postgres') ], > > '<', > > \$psql_subscriber{subscriber_stdin}, > > $psql_timeout); > > > > Not using the 3 std* is also the case for example in 021_row_visibility.pl and 032_relfilenode_reuse.pl > where the "stderr" is set but does not seem to be used. > > I don't think that's a problem to keep them all and I think it's better to have > them re-directed to dedicated places. ok, that way it will be consistent across others too. > > 2) Also we have changed the default timeout here, why is this change required: > > my $node_cascading_standby = > > PostgreSQL::Test::Cluster->new('cascading_standby'); > > +my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber'); > > my $default_timeout = $PostgreSQL::Test::Utils::timeout_default; > > +my $psql_timeout = IPC::Run::timer(2 * $default_timeout); > > I think I used 021_row_visibility.pl as an example. But agree there is > others .pl that are using the timeout_default as the psql_timeout and that > the default is enough in our case. So, using the default in V5 attached. > Thanks for fixing this. There was one typo in the commit message, subscribtion should be subscription, the rest of the changes looks good to me: Subject: [PATCH v5] Add subscribtion to the standby test in 035_standby_logical_decoding.pl Adding one test, to verify that subscribtion to the standby is possible. Regards, Vignesh
On Mon, Apr 24, 2023 8:07 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 4/24/23 11:45 AM, Amit Kapila wrote: > > On Mon, Apr 24, 2023 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> > wrote: > >> > >> On Mon, Apr 24, 2023 at 11:24 AM Drouvot, Bertrand > >> <bertranddrouvot.pg@gmail.com> wrote: > >>> > >> > >> Few comments: > >> ============ > >> > > > > +# We can not test if the WAL file still exists immediately. > > +# We need to let some time to the standby to actually "remove" it. > > +my $i = 0; > > +while (1) > > +{ > > + last if !-f $standby_walfile; > > + if ($i++ == 10 * $default_timeout) > > + { > > + die > > + "could not determine if WAL file has been retained or not, can't continue"; > > + } > > + usleep(100_000); > > +} > > > > Is this adhoc wait required because we can't guarantee that the > > checkpoint is complete on standby even after using wait_for_catchup? > > Yes, the restart point on the standby is not necessary completed even after > wait_for_catchup is done. > I think that's because when replaying a checkpoint record, the startup process of standby only saves the information of the checkpoint, and we need to wait for the checkpointer to perform a restartpoint (see RecoveryRestartPoint), right? If so, could we force a checkpoint on standby? After this, the standby should have completed the restartpoint and we don't need to wait. Besides, would it be better to wait for the cascading standby? If the wal log file needed for cascading standby is removed on the standby, the subsequent test will fail. Do we need to consider this scenario? I saw the following error message after setting recovery_min_apply_delay to 5s on the cascading standby, and the test failed due to a timeout while waiting for cascading standby. Log of cascading standby node: FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000000000003 has already been removed Regards, Shi Yu
> diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm > index 6f7f4e5de4..819667d42a 100644 > --- a/src/test/perl/PostgreSQL/Test/Cluster.pm > +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm > @@ -2644,7 +2644,16 @@ sub wait_for_catchup > } > if (!defined($target_lsn)) > { > - $target_lsn = $self->lsn('write'); > + my $isrecovery = $self->safe_psql('postgres', "SELECT pg_is_in_recovery()"); > + chomp($isrecovery); > + if ($isrecovery eq 't') > + { > + $target_lsn = $self->lsn('replay'); > + } > + else > + { > + $target_lsn = $self->lsn('write'); > + } Please modify the function's documentation to account for this code change. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Porque Kim no hacía nada, pero, eso sí, con extraordinario éxito" ("Kim", Kipling)
Hi, On 4/26/23 12:27 PM, Alvaro Herrera wrote: >> diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm >> index 6f7f4e5de4..819667d42a 100644 >> --- a/src/test/perl/PostgreSQL/Test/Cluster.pm >> +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm >> @@ -2644,7 +2644,16 @@ sub wait_for_catchup >> } >> if (!defined($target_lsn)) >> { >> - $target_lsn = $self->lsn('write'); >> + my $isrecovery = $self->safe_psql('postgres', "SELECT pg_is_in_recovery()"); >> + chomp($isrecovery); >> + if ($isrecovery eq 't') >> + { >> + $target_lsn = $self->lsn('replay'); >> + } >> + else >> + { >> + $target_lsn = $self->lsn('write'); >> + } > > Please modify the function's documentation to account for this code change. > Good point, thanks! Done in V6 attached. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
Hi, On 4/26/23 11:12 AM, vignesh C wrote: > On Wed, 26 Apr 2023 at 13:45, Drouvot, Bertrand > > There was one typo in the commit message, subscribtion should be > subscription, the rest of the changes looks good to me: > Subject: [PATCH v5] Add subscribtion to the standby test in > 035_standby_logical_decoding.pl > > Adding one test, to verify that subscribtion to the standby is possible. > Oops, at least I repeated it twice ;-) Fixed in V6 that I just shared up-thread. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 4/26/23 11:58 AM, Yu Shi (Fujitsu) wrote: > On Mon, Apr 24, 2023 8:07 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > I think that's because when replaying a checkpoint record, the startup process > of standby only saves the information of the checkpoint, and we need to wait for > the checkpointer to perform a restartpoint (see RecoveryRestartPoint), right? If > so, could we force a checkpoint on standby? After this, the standby should have > completed the restartpoint and we don't need to wait. > Thanks for looking at it! Oh right, that looks like good a good way to ensure the WAL file is removed on the standby so that we don't need to wait. Implemented that way in V6 attached and that works fine. > Besides, would it be better to wait for the cascading standby? If the wal log > file needed for cascading standby is removed on the standby, the subsequent test > will fail. Good catch! I agree that we have to wait on the cascading standby before removing the WAL files. It's done in V6 (and the test is not failing anymore if we set a recovery_min_apply_delay to 5s on the cascading standby). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Wed, Apr 26, 2023 at 4:41 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 4/26/23 12:27 PM, Alvaro Herrera wrote: > >> diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm > >> index 6f7f4e5de4..819667d42a 100644 > >> --- a/src/test/perl/PostgreSQL/Test/Cluster.pm > >> +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm > >> @@ -2644,7 +2644,16 @@ sub wait_for_catchup > >> } > >> if (!defined($target_lsn)) > >> { > >> - $target_lsn = $self->lsn('write'); > >> + my $isrecovery = $self->safe_psql('postgres', "SELECT pg_is_in_recovery()"); > >> + chomp($isrecovery); > >> + if ($isrecovery eq 't') > >> + { > >> + $target_lsn = $self->lsn('replay'); > >> + } > >> + else > >> + { > >> + $target_lsn = $self->lsn('write'); > >> + } > > > > Please modify the function's documentation to account for this code change. > > > > Good point, thanks! Done in V6 attached. > +When in recovery, the default value of target_lsn is $node->lsn('replay') +instead. This is needed when the publisher passed to wait_for_subscription_sync() +is a standby node. I think this will be useful whenever wait_for_catchup has been called for a standby node (where self is a standby node). I have tried even by commenting wait_for_subscription_sync in the new test then it fails for $node_standby->wait_for_catchup('tap_sub');. So instead, how about a comment like: "When in recovery, the default value of target_lsn is $node->lsn('replay') instead which ensures that the cascaded standby has caught up to what has been replayed on the standby."? -- With Regards, Amit Kapila.
Hi, On 4/27/23 5:37 AM, Amit Kapila wrote: > On Wed, Apr 26, 2023 at 4:41 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > +When in recovery, the default value of target_lsn is $node->lsn('replay') > +instead. This is needed when the publisher passed to > wait_for_subscription_sync() > +is a standby node. > > I think this will be useful whenever wait_for_catchup has been called > for a standby node (where self is a standby node). I have tried even > by commenting wait_for_subscription_sync in the new test then it fails > for $node_standby->wait_for_catchup('tap_sub');. So instead, how about > a comment like: "When in recovery, the default value of target_lsn is > $node->lsn('replay') instead which ensures that the cascaded standby > has caught up to what has been replayed on the standby."? > I did it that way because wait_for_subscription_sync() was the first case I had to work on but I do agree that your wording better describe the intend of the new code. Changed in V7 attached. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Apr 27, 2023 at 1:05 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 4/27/23 5:37 AM, Amit Kapila wrote: > > On Wed, Apr 26, 2023 at 4:41 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > > > > +When in recovery, the default value of target_lsn is $node->lsn('replay') > > +instead. This is needed when the publisher passed to > > wait_for_subscription_sync() > > +is a standby node. > > > > I think this will be useful whenever wait_for_catchup has been called > > for a standby node (where self is a standby node). I have tried even > > by commenting wait_for_subscription_sync in the new test then it fails > > for $node_standby->wait_for_catchup('tap_sub');. So instead, how about > > a comment like: "When in recovery, the default value of target_lsn is > > $node->lsn('replay') instead which ensures that the cascaded standby > > has caught up to what has been replayed on the standby."? > > > > I did it that way because wait_for_subscription_sync() was the first case I had > to work on but I do agree that your wording better describe the intend of the new > code. > > Changed in V7 attached. > Pushed. -- With Regards, Amit Kapila.
On 4/27/23 11:54 AM, Amit Kapila wrote: > On Thu, Apr 27, 2023 at 1:05 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> On 4/27/23 5:37 AM, Amit Kapila wrote: >>> On Wed, Apr 26, 2023 at 4:41 PM Drouvot, Bertrand >>> <bertranddrouvot.pg@gmail.com> wrote: >>> >>> +When in recovery, the default value of target_lsn is $node->lsn('replay') >>> +instead. This is needed when the publisher passed to >>> wait_for_subscription_sync() >>> +is a standby node. >>> >>> I think this will be useful whenever wait_for_catchup has been called >>> for a standby node (where self is a standby node). I have tried even >>> by commenting wait_for_subscription_sync in the new test then it fails >>> for $node_standby->wait_for_catchup('tap_sub');. So instead, how about >>> a comment like: "When in recovery, the default value of target_lsn is >>> $node->lsn('replay') instead which ensures that the cascaded standby >>> has caught up to what has been replayed on the standby."? >>> >> >> I did it that way because wait_for_subscription_sync() was the first case I had >> to work on but I do agree that your wording better describe the intend of the new >> code. >> >> Changed in V7 attached. >> > > Pushed. > Thanks! Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Apr 26, 2023 at 7:53 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > > Besides, would it be better to wait for the cascading standby? If the wal log > > file needed for cascading standby is removed on the standby, the subsequent test > > will fail. > > Good catch! I agree that we have to wait on the cascading standby before removing > the WAL files. It's done in V6 (and the test is not failing anymore if we set a > recovery_min_apply_delay to 5s on the cascading standby). > +# Get the restart_lsn from an invalidated slot +my $restart_lsn = $node_standby->safe_psql('postgres', + "SELECT restart_lsn from pg_replication_slots WHERE slot_name = 'vacuum_full_activeslot' and conflicting is true;" +); + +chomp($restart_lsn); + +# Get the WAL file name associated to this lsn on the primary +my $walfile_name = $node_primary->safe_psql('postgres', + "SELECT pg_walfile_name('$restart_lsn')"); + +chomp($walfile_name); + +# Check the WAL file is still on the primary +ok(-f $node_primary->data_dir . '/pg_wal/' . $walfile_name, + "WAL file still on the primary"); How is it guaranteed that the WAL file corresponding to the invalidated slot on standby will still be present on primary? Can you please explain the logic behind this test a bit more like how the WAL file switch helps you to achieve the purpose? -- With Regards, Amit Kapila.
Hi, On 4/28/23 5:55 AM, Amit Kapila wrote: > On Wed, Apr 26, 2023 at 7:53 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > +# Get the restart_lsn from an invalidated slot > +my $restart_lsn = $node_standby->safe_psql('postgres', > + "SELECT restart_lsn from pg_replication_slots WHERE slot_name = > 'vacuum_full_activeslot' and conflicting is true;" > +); > + > +chomp($restart_lsn); > + > +# Get the WAL file name associated to this lsn on the primary > +my $walfile_name = $node_primary->safe_psql('postgres', > + "SELECT pg_walfile_name('$restart_lsn')"); > + > +chomp($walfile_name); > + > +# Check the WAL file is still on the primary > +ok(-f $node_primary->data_dir . '/pg_wal/' . $walfile_name, > + "WAL file still on the primary"); > > How is it guaranteed that the WAL file corresponding to the > invalidated slot on standby will still be present on primary? The slot(s) have been invalidated by the "vacuum full" test just above this one. So I think the WAL we are looking for is the last one being used by the primary. As no activity happened on it since the vacuum full it looks to me that it should still be present. But I may have missed something and maybe that's not guarantee that this WAL is still there in all the cases. In that case I think it's better to remove this test (it does not provide added value here). Test removed in V7 attached. > Can you > please explain the logic behind this test a bit more like how the WAL > file switch helps you to achieve the purpose? > The idea was to generate enough "wal switch" on the primary to ensure the WAL file has been removed. I gave another thought on it and I think we can skip the test that the WAL is not on the primary any more. That way, one "wal switch" seems to be enough to see it removed on the standby. It's done in V7. V7 is not doing "extra tests" than necessary and I think it's probably better like this. I can see V7 failing on "Cirrus CI / macOS - Ventura - Meson" only (other machines are not complaining). It does fail on "invalidated logical slots do not lead to retaining WAL", see https://cirrus-ci.com/task/4518083541336064 I'm not sure why it is failing, any idea? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Fri, Apr 28, 2023 at 2:24 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > > Can you > > please explain the logic behind this test a bit more like how the WAL > > file switch helps you to achieve the purpose? > > > > The idea was to generate enough "wal switch" on the primary to ensure > the WAL file has been removed. > > I gave another thought on it and I think we can skip the test that the WAL is > not on the primary any more. That way, one "wal switch" seems to be enough > to see it removed on the standby. > > It's done in V7. > > V7 is not doing "extra tests" than necessary and I think it's probably better like this. > > I can see V7 failing on "Cirrus CI / macOS - Ventura - Meson" only (other machines are not complaining). > > It does fail on "invalidated logical slots do not lead to retaining WAL", see https://cirrus-ci.com/task/4518083541336064 > > I'm not sure why it is failing, any idea? > I think the reason for the failure is that on standby, the test is not able to remove the file corresponding to the invalid slot. You are using pg_switch_wal() to generate a switch record and I think you need one more WAL-generating statement after that to achieve your purpose which is that during checkpoint, the tes removes the WAL file corresponding to an invalid slot. Just doing checkpoint on primary may not serve the need as that doesn't lead to any new insertion of WAL on standby. Is your v6 failing in the same environment? If not, then it is probably due to the reason that the test is doing insert after pg_switch_wal() in that version. Why did you change the order of insert in v7? BTW, you can confirm the failure by changing the DEBUG2 message in RemoveOldXlogFiles() to LOG. In the case, where the test fails, it may not remove the WAL file corresponding to an invalid slot whereas it will remove the WAL file when the test succeeds. -- With Regards, Amit Kapila.
Hi, On 5/2/23 8:28 AM, Amit Kapila wrote: > On Fri, Apr 28, 2023 at 2:24 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> I can see V7 failing on "Cirrus CI / macOS - Ventura - Meson" only (other machines are not complaining). >> >> It does fail on "invalidated logical slots do not lead to retaining WAL", see https://cirrus-ci.com/task/4518083541336064 >> >> I'm not sure why it is failing, any idea? >> > > I think the reason for the failure is that on standby, the test is not > able to remove the file corresponding to the invalid slot. You are > using pg_switch_wal() to generate a switch record and I think you need > one more WAL-generating statement after that to achieve your purpose > which is that during checkpoint, the tes removes the WAL file > corresponding to an invalid slot. Just doing checkpoint on primary may > not serve the need as that doesn't lead to any new insertion of WAL on > standby. Is your v6 failing in the same environment? Thanks for the feedback! No V6 was working fine. > If not, then it > is probably due to the reason that the test is doing insert after > pg_switch_wal() in that version. Why did you change the order of > insert in v7? > I thought doing the insert before the switch was ok and as my local test was running fine I did not re-consider the ordering. > BTW, you can confirm the failure by changing the DEBUG2 message in > RemoveOldXlogFiles() to LOG. In the case, where the test fails, it may > not remove the WAL file corresponding to an invalid slot whereas it > will remove the WAL file when the test succeeds. Yeah, I added more debug information and what I can see is that the WAL file we want to see removed is "000000010000000000000003" while the standby emits: " 2023-05-02 10:03:28.351 UTC [16971][checkpointer] LOG: attempting to remove WAL segments older than log file 000000000000000000000002 2023-05-02 10:03:28.351 UTC [16971][checkpointer] LOG: recycled write-ahead log file "000000010000000000000002" " As per your suggestion, changing the insert ordering (like in V8 attached) makes it now work on the failing environment too. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, May 2, 2023 at 4:52 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > > As per your suggestion, changing the insert ordering (like in V8 attached) makes it now work on the failing environmenttoo. > I think it is better to use wait_for_replay_catchup() to wait for standby to catch up. I have changed that and a comment in the attached. I'll push this tomorrow unless there are further comments. -- With Regards, Amit Kapila.
Attachment
Hi, On 5/3/23 12:29 PM, Amit Kapila wrote: > On Tue, May 2, 2023 at 4:52 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> >> As per your suggestion, changing the insert ordering (like in V8 attached) makes it now work on the failing environmenttoo. >> > > I think it is better to use wait_for_replay_catchup() to wait for > standby to catch up. Oh right, that's a discussion we already had in [1], I should have thought about it. > I have changed that and a comment in the > attached. I'll push this tomorrow unless there are further comments. > LGTM, thanks! [1]: https://www.postgresql.org/message-id/acbac69e-9ae8-c546-3216-8ecb38e7a93d%40gmail.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, 3 May 2023 at 15:59, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, May 2, 2023 at 4:52 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > As per your suggestion, changing the insert ordering (like in V8 attached) makes it now work on the failing environmenttoo. > > > > I think it is better to use wait_for_replay_catchup() to wait for > standby to catch up. I have changed that and a comment in the > attached. I'll push this tomorrow unless there are further comments. Thanks for posting the updated patch, I had run this test in a loop of 100 times to verify that there was no failure because of race conditions. The 100 times execution passed successfully. One suggestion: "wal file" should be changed to "WAL file": +# Request a checkpoint on the standby to trigger the WAL file(s) removal +$node_standby->safe_psql('postgres', 'checkpoint;'); + +# Verify that the wal file has not been retained on the standby +my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name; Regards, Vignesh
On Thu, May 4, 2023 at 8:37 AM vignesh C <vignesh21@gmail.com> wrote: > > Thanks for posting the updated patch, I had run this test in a loop of > 100 times to verify that there was no failure because of race > conditions. The 100 times execution passed successfully. > > One suggestion: > "wal file" should be changed to "WAL file": > +# Request a checkpoint on the standby to trigger the WAL file(s) removal > +$node_standby->safe_psql('postgres', 'checkpoint;'); > + > +# Verify that the wal file has not been retained on the standby > +my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name; > Thanks for the verification. I have pushed the patch. -- With Regards, Amit Kapila.
Hi, On 5/4/23 6:43 AM, Amit Kapila wrote: > On Thu, May 4, 2023 at 8:37 AM vignesh C <vignesh21@gmail.com> wrote: >> >> Thanks for posting the updated patch, I had run this test in a loop of >> 100 times to verify that there was no failure because of race >> conditions. The 100 times execution passed successfully. >> >> One suggestion: >> "wal file" should be changed to "WAL file": >> +# Request a checkpoint on the standby to trigger the WAL file(s) removal >> +$node_standby->safe_psql('postgres', 'checkpoint;'); >> + >> +# Verify that the wal file has not been retained on the standby >> +my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name; >> > > Thanks for the verification. I have pushed the patch. > Thanks! I've marked the CF entry as Committed and moved the associated PostgreSQL 16 Open Item to "resolved before 16beta1". Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 5/4/23 6:43 AM, Amit Kapila wrote: > On Thu, May 4, 2023 at 8:37 AM vignesh C <vignesh21@gmail.com> wrote: >> >> Thanks for posting the updated patch, I had run this test in a loop of >> 100 times to verify that there was no failure because of race >> conditions. The 100 times execution passed successfully. >> >> One suggestion: >> "wal file" should be changed to "WAL file": >> +# Request a checkpoint on the standby to trigger the WAL file(s) removal >> +$node_standby->safe_psql('postgres', 'checkpoint;'); >> + >> +# Verify that the wal file has not been retained on the standby >> +my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name; >> > > Thanks for the verification. I have pushed the patch. > It looks like there is still something wrong with this test as there are a bunch of cfbot errors on this new test (mainly on macOS - Ventura - Meson). I'll try to reproduce with more debug infos. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, May 5, 2023 at 11:08 AM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 5/4/23 6:43 AM, Amit Kapila wrote: > > On Thu, May 4, 2023 at 8:37 AM vignesh C <vignesh21@gmail.com> wrote: > >> > >> Thanks for posting the updated patch, I had run this test in a loop of > >> 100 times to verify that there was no failure because of race > >> conditions. The 100 times execution passed successfully. > >> > >> One suggestion: > >> "wal file" should be changed to "WAL file": > >> +# Request a checkpoint on the standby to trigger the WAL file(s) removal > >> +$node_standby->safe_psql('postgres', 'checkpoint;'); > >> + > >> +# Verify that the wal file has not been retained on the standby > >> +my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name; > >> > > > > Thanks for the verification. I have pushed the patch. > > > > It looks like there is still something wrong with this test as there > are a bunch of cfbot errors on this new test (mainly on macOS - Ventura - Meson). > Is it possible for you to point me to those failures? > I'll try to reproduce with more debug infos. > Okay, thanks! -- With Regards, Amit Kapila.
On Fri, 5 May 2023 at 12:34, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, May 5, 2023 at 11:08 AM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > On 5/4/23 6:43 AM, Amit Kapila wrote: > > > On Thu, May 4, 2023 at 8:37 AM vignesh C <vignesh21@gmail.com> wrote: > > >> > > >> Thanks for posting the updated patch, I had run this test in a loop of > > >> 100 times to verify that there was no failure because of race > > >> conditions. The 100 times execution passed successfully. > > >> > > >> One suggestion: > > >> "wal file" should be changed to "WAL file": > > >> +# Request a checkpoint on the standby to trigger the WAL file(s) removal > > >> +$node_standby->safe_psql('postgres', 'checkpoint;'); > > >> + > > >> +# Verify that the wal file has not been retained on the standby > > >> +my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name; > > >> > > > > > > Thanks for the verification. I have pushed the patch. > > > > > > > It looks like there is still something wrong with this test as there > > are a bunch of cfbot errors on this new test (mainly on macOS - Ventura - Meson). > > > > Is it possible for you to point me to those failures? I think these failures are occuring in CFBOT, once such instance is at: https://cirrus-ci.com/task/6642271152504832?logs=test_world#L39 https://api.cirrus-ci.com/v1/artifact/task/6642271152504832/testrun/build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding Regards, Vignesh
Hi, On 5/5/23 9:11 AM, vignesh C wrote: > On Fri, 5 May 2023 at 12:34, Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Fri, May 5, 2023 at 11:08 AM Drouvot, Bertrand >> <bertranddrouvot.pg@gmail.com> wrote: >>> >>> On 5/4/23 6:43 AM, Amit Kapila wrote: >>>> On Thu, May 4, 2023 at 8:37 AM vignesh C <vignesh21@gmail.com> wrote: >>>>> >>>>> Thanks for posting the updated patch, I had run this test in a loop of >>>>> 100 times to verify that there was no failure because of race >>>>> conditions. The 100 times execution passed successfully. >>>>> >>>>> One suggestion: >>>>> "wal file" should be changed to "WAL file": >>>>> +# Request a checkpoint on the standby to trigger the WAL file(s) removal >>>>> +$node_standby->safe_psql('postgres', 'checkpoint;'); >>>>> + >>>>> +# Verify that the wal file has not been retained on the standby >>>>> +my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name; >>>>> >>>> >>>> Thanks for the verification. I have pushed the patch. >>>> >>> >>> It looks like there is still something wrong with this test as there >>> are a bunch of cfbot errors on this new test (mainly on macOS - Ventura - Meson). >>> >> >> Is it possible for you to point me to those failures? > > I think these failures are occuring in CFBOT, once such instance is at: > https://cirrus-ci.com/task/6642271152504832?logs=test_world#L39 > https://api.cirrus-ci.com/v1/artifact/task/6642271152504832/testrun/build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding > Yeah, thanks, that's one of them. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 5/5/23 9:04 AM, Amit Kapila wrote: > On Fri, May 5, 2023 at 11:08 AM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> On 5/4/23 6:43 AM, Amit Kapila wrote: >>> On Thu, May 4, 2023 at 8:37 AM vignesh C <vignesh21@gmail.com> wrote: >>>> >>>> Thanks for posting the updated patch, I had run this test in a loop of >>>> 100 times to verify that there was no failure because of race >>>> conditions. The 100 times execution passed successfully. >>>> >>>> One suggestion: >>>> "wal file" should be changed to "WAL file": >>>> +# Request a checkpoint on the standby to trigger the WAL file(s) removal >>>> +$node_standby->safe_psql('postgres', 'checkpoint;'); >>>> + >>>> +# Verify that the wal file has not been retained on the standby >>>> +my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name; >>>> >>> >>> Thanks for the verification. I have pushed the patch. >>> >> >> It looks like there is still something wrong with this test as there >> are a bunch of cfbot errors on this new test (mainly on macOS - Ventura - Meson). >> > > Is it possible for you to point me to those failures? > >> I'll try to reproduce with more debug infos. >> > > Okay, thanks! > After multiple attempts, I got one failing one. Issue is that we expect this file to be removed: [07:24:27.261](0.899s) #WAL file is /Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003 But the standby emits: 2023-05-05 07:24:27.216 UTC [17909][client backend] [035_standby_logical_decoding.pl][3/6:0] LOG: statement: checkpoint; 2023-05-05 07:24:27.216 UTC [17745][checkpointer] LOG: restartpoint starting: immediate wait 2023-05-05 07:24:27.259 UTC [17745][checkpointer] LOG: attempting to remove WAL segments older than log file 000000000000000000000002 So it seems the test is not right (missing activity??), not sure why yet. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, May 5, 2023 at 1:16 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > > After multiple attempts, I got one failing one. > > Issue is that we expect this file to be removed: > > [07:24:27.261](0.899s) #WAL file is /Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003 > > But the standby emits: > > 2023-05-05 07:24:27.216 UTC [17909][client backend] [035_standby_logical_decoding.pl][3/6:0] LOG: statement: checkpoint; > 2023-05-05 07:24:27.216 UTC [17745][checkpointer] LOG: restartpoint starting: immediate wait > 2023-05-05 07:24:27.259 UTC [17745][checkpointer] LOG: attempting to remove WAL segments older than log file 000000000000000000000002 > > So it seems the test is not right (missing activity??), not sure why yet. > Can you try to print the value returned by XLogGetReplicationSlotMinimumLSN() in KeepLogSeg() on standby? Also, please try to print "attempting to remove WAL segments ..." on the primary. We can see, if by any chance some slot is holding us to remove the required WAL file. -- With Regards, Amit Kapila.
On Fri, May 5, 2023 at 2:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, May 5, 2023 at 1:16 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > After multiple attempts, I got one failing one. > > > > Issue is that we expect this file to be removed: > > > > [07:24:27.261](0.899s) #WAL file is /Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003 > > > > But the standby emits: > > > > 2023-05-05 07:24:27.216 UTC [17909][client backend] [035_standby_logical_decoding.pl][3/6:0] LOG: statement: checkpoint; > > 2023-05-05 07:24:27.216 UTC [17745][checkpointer] LOG: restartpoint starting: immediate wait > > 2023-05-05 07:24:27.259 UTC [17745][checkpointer] LOG: attempting to remove WAL segments older than log file 000000000000000000000002 > > > > So it seems the test is not right (missing activity??), not sure why yet. > > > > Can you try to print the value returned by > XLogGetReplicationSlotMinimumLSN() in KeepLogSeg() on standby? Also, > please try to print "attempting to remove WAL segments ..." on the > primary. We can see, if by any chance some slot is holding us to > remove the required WAL file. > We can also probably check the values of 'endptr', 'receivePtr', and replayPtr on standby in the below code: CreateRestartPoint() { ... /* * Retreat _logSegNo using the current end of xlog replayed or received, * whichever is later. */ receivePtr = GetWalRcvFlushRecPtr(NULL, NULL); replayPtr = GetXLogReplayRecPtr(&replayTLI); endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr; KeepLogSeg(endptr, &_logSegNo); ... } -- With Regards, Amit Kapila.
On 5/5/23 11:29 AM, Amit Kapila wrote: > On Fri, May 5, 2023 at 1:16 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> >> After multiple attempts, I got one failing one. >> >> Issue is that we expect this file to be removed: >> >> [07:24:27.261](0.899s) #WAL file is /Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003 >> >> But the standby emits: >> >> 2023-05-05 07:24:27.216 UTC [17909][client backend] [035_standby_logical_decoding.pl][3/6:0] LOG: statement: checkpoint; >> 2023-05-05 07:24:27.216 UTC [17745][checkpointer] LOG: restartpoint starting: immediate wait >> 2023-05-05 07:24:27.259 UTC [17745][checkpointer] LOG: attempting to remove WAL segments older than log file 000000000000000000000002 >> >> So it seems the test is not right (missing activity??), not sure why yet. >> > > Can you try to print the value returned by > XLogGetReplicationSlotMinimumLSN() in KeepLogSeg() on standby? Also, > please try to print "attempting to remove WAL segments ..." on the > primary. We can see, if by any chance some slot is holding us to > remove the required WAL file. > I turned DEBUG2 on. We can also see on the primary: 2023-05-05 08:23:30.843 UTC [16833][checkpointer] LOCATION: CheckPointReplicationSlots, slot.c:1576 2023-05-05 08:23:30.844 UTC [16833][checkpointer] DEBUG: 00000: snapshot of 0+0 running transaction ids (lsn 0/40000D0 oldestxid 746 latest complete 745 next xid 746) 2023-05-05 08:23:30.844 UTC [16833][checkpointer] LOCATION: LogCurrentRunningXacts, standby.c:1377 2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOG: 00000: BDT1 about to call RemoveOldXlogFiles in CreateCheckPoint 2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION: CreateCheckPoint, xlog.c:6835 2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOG: 00000: attempting to remove WAL segments older than log file 000000000000000000000002 2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION: RemoveOldXlogFiles, xlog.c:3560 2023-05-05 08:23:30.845 UTC [16833][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000001" 2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION: RemoveXlogFile, xlog.c:3708 2023-05-05 08:23:30.845 UTC [16833][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000002" 2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION: RemoveXlogFile, xlog.c:3708 2023-05-05 08:23:30.845 UTC [16833][checkpointer] DEBUG: 00000: SlruScanDirectory invoking callback on pg_subtrans/0000 So, 000000010000000000000003 is not removed on the primary. It has been recycled on: 2023-05-05 08:23:38.605 UTC [16833][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000003" Which is later than the test: [08:23:31.931](0.000s) not ok 19 - invalidated logical slots do not lead to retaining WAL FWIW, the failing test with DEBUG2 can be found there: https://cirrus-ci.com/task/5615316688961536 Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, May 5, 2023 at 4:02 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 5/5/23 11:29 AM, Amit Kapila wrote: > > On Fri, May 5, 2023 at 1:16 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >> > >> After multiple attempts, I got one failing one. > >> > >> Issue is that we expect this file to be removed: > >> > >> [07:24:27.261](0.899s) #WAL file is /Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003 > >> > >> But the standby emits: > >> > >> 2023-05-05 07:24:27.216 UTC [17909][client backend] [035_standby_logical_decoding.pl][3/6:0] LOG: statement: checkpoint; > >> 2023-05-05 07:24:27.216 UTC [17745][checkpointer] LOG: restartpoint starting: immediate wait > >> 2023-05-05 07:24:27.259 UTC [17745][checkpointer] LOG: attempting to remove WAL segments older than log file 000000000000000000000002 > >> > >> So it seems the test is not right (missing activity??), not sure why yet. > >> > > > > Can you try to print the value returned by > > XLogGetReplicationSlotMinimumLSN() in KeepLogSeg() on standby? Also, > > please try to print "attempting to remove WAL segments ..." on the > > primary. We can see, if by any chance some slot is holding us to > > remove the required WAL file. > > > > I turned DEBUG2 on. We can also see on the primary: > > 2023-05-05 08:23:30.843 UTC [16833][checkpointer] LOCATION: CheckPointReplicationSlots, slot.c:1576 > 2023-05-05 08:23:30.844 UTC [16833][checkpointer] DEBUG: 00000: snapshot of 0+0 running transaction ids (lsn 0/40000D0oldest xid 746 latest complete 745 next xid 746) > 2023-05-05 08:23:30.844 UTC [16833][checkpointer] LOCATION: LogCurrentRunningXacts, standby.c:1377 > 2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOG: 00000: BDT1 about to call RemoveOldXlogFiles in CreateCheckPoint > 2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION: CreateCheckPoint, xlog.c:6835 > 2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOG: 00000: attempting to remove WAL segments older than log file 000000000000000000000002 > 2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION: RemoveOldXlogFiles, xlog.c:3560 > 2023-05-05 08:23:30.845 UTC [16833][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000001" > 2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION: RemoveXlogFile, xlog.c:3708 > 2023-05-05 08:23:30.845 UTC [16833][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000002" > 2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION: RemoveXlogFile, xlog.c:3708 > 2023-05-05 08:23:30.845 UTC [16833][checkpointer] DEBUG: 00000: SlruScanDirectory invoking callback on pg_subtrans/0000 > > So, 000000010000000000000003 is not removed on the primary. > How did you concluded that 000000010000000000000003 is the file the test is expecting to be removed? -- With Regards, Amit Kapila.
On 5/5/23 12:58 PM, Amit Kapila wrote: > On Fri, May 5, 2023 at 4:02 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > How did you concluded that 000000010000000000000003 is the file the > test is expecting to be removed? > because I added a note in the test that way: " @@ -535,6 +539,7 @@ $node_standby->safe_psql('postgres', 'checkpoint;'); # Verify that the WAL file has not been retained on the standby my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name; +note "BDT WAL file is $standby_walfile"; ok(!-f "$standby_walfile", "invalidated logical slots do not lead to retaining WAL"); " so that I can check in the test log file: grep "WAL file is" ./build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding [08:23:31.931](2.217s) # BDT WAL file is /Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003 Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, May 5, 2023 at 5:36 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 5/5/23 12:58 PM, Amit Kapila wrote: > > On Fri, May 5, 2023 at 4:02 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > > > How did you concluded that 000000010000000000000003 is the file the > > test is expecting to be removed? > > > because I added a note in the test that way: > > " > @@ -535,6 +539,7 @@ $node_standby->safe_psql('postgres', 'checkpoint;'); > > # Verify that the WAL file has not been retained on the standby > my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name; > +note "BDT WAL file is $standby_walfile"; > ok(!-f "$standby_walfile", > "invalidated logical slots do not lead to retaining WAL"); > " > > so that I can check in the test log file: > > grep "WAL file is" ./build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding > [08:23:31.931](2.217s) # BDT WAL file is /Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003 > It seems due to some reason the current wal file is not switched due to some reason. I think we need to add more DEBUG info to find that out. Can you please try to print 'RedoRecPtr', '_logSegNo', and recptr? /* * Delete old log files, those no longer needed for last checkpoint to * prevent the disk holding the xlog from growing full. */ XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size); KeepLogSeg(recptr, &_logSegNo); -- With Regards, Amit Kapila.
On 5/5/23 2:28 PM, Amit Kapila wrote: > On Fri, May 5, 2023 at 5:36 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > It seems due to some reason the current wal file is not switched due > to some reason. Oh wait, here is a NON failing one: https://cirrus-ci.com/task/5086849685782528 (I modified the .cirrus.yml so that we can download the "testrun.zip" file even if the test is not failing). So, in this testrun.zip we can see, that the test is ok: $ grep -i retain ./build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding [10:06:08.789](0.000s) ok 19 - invalidated logical slots do not lead to retaining WAL and that the WAL file we expect to be removed is: $ grep "WAL file is" ./build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding [10:06:08.789](0.925s) # BDT WAL file is /Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003 This WAL file has been removed by the standby: $ grep -i 000000010000000000000003 ./build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_standby.log| grep -i recy 2023-05-05 10:06:08.787 UTC [17521][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000003" But on the primary, it has been recycled way after that time: $ grep -i 000000010000000000000003 ./build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_primary.log| grep -i recy 2023-05-05 10:06:13.370 UTC [16785][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000003" As, the checkpoint on the primary after the WAL file switch only recycled (001 and 002): $ grep -i recycled ./build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_primary.log 2023-05-05 10:05:57.196 UTC [16785][checkpointer] LOG: 00000: checkpoint complete: wrote 4 buffers (3.1%); 0 WAL file(s)added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.027 s; sync files=0, longest=0.000 s, average=0.000s; distance=11219 kB, estimate=11219 kB; lsn=0/2000060, redo lsn=0/2000028 2023-05-05 10:06:08.138 UTC [16785][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000001" 2023-05-05 10:06:08.138 UTC [16785][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000002" 2023-05-05 10:06:08.138 UTC [16785][checkpointer] LOG: 00000: checkpoint complete: wrote 20 buffers (15.6%); 0 WAL file(s)added, 0 removed, 2 recycled; write=0.001 s, sync=0.001 s, total=0.003 s; sync files=0, longest=0.000 s, average=0.000s; distance=32768 kB, estimate=32768 kB; lsn=0/40000D0, redo lsn=0/4000098 So, even on a successful test, we can see that the WAL file we expect to be removed on the standby has not been recycledon the primary before the test. > I think we need to add more DEBUG info to find that > out. Can you please try to print 'RedoRecPtr', '_logSegNo', and > recptr? > Yes, will do. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, May 5, 2023 at 7:53 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > > On 5/5/23 2:28 PM, Amit Kapila wrote: > > On Fri, May 5, 2023 at 5:36 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > > > > It seems due to some reason the current wal file is not switched due > > to some reason. > > Oh wait, here is a NON failing one: https://cirrus-ci.com/task/5086849685782528 (I modified the > .cirrus.yml so that we can download the "testrun.zip" file even if the test is not failing). > > So, in this testrun.zip we can see, that the test is ok: > > $ grep -i retain ./build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding > [10:06:08.789](0.000s) ok 19 - invalidated logical slots do not lead to retaining WAL > > and that the WAL file we expect to be removed is: > > $ grep "WAL file is" ./build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding > [10:06:08.789](0.925s) # BDT WAL file is /Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003 > > This WAL file has been removed by the standby: > > $ grep -i 000000010000000000000003 ./build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_standby.log| grep -i recy > 2023-05-05 10:06:08.787 UTC [17521][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000003" > > But on the primary, it has been recycled way after that time: > > $ grep -i 000000010000000000000003 ./build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_primary.log| grep -i recy > 2023-05-05 10:06:13.370 UTC [16785][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000003" > > As, the checkpoint on the primary after the WAL file switch only recycled (001 and 002): > > $ grep -i recycled ./build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_primary.log > 2023-05-05 10:05:57.196 UTC [16785][checkpointer] LOG: 00000: checkpoint complete: wrote 4 buffers (3.1%); 0 WAL file(s)added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.027 s; sync files=0, longest=0.000 s, average=0.000s; distance=11219 kB, estimate=11219 kB; lsn=0/2000060, redo lsn=0/2000028 > 2023-05-05 10:06:08.138 UTC [16785][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000001" > 2023-05-05 10:06:08.138 UTC [16785][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000002" > 2023-05-05 10:06:08.138 UTC [16785][checkpointer] LOG: 00000: checkpoint complete: wrote 20 buffers (15.6%); 0 WAL file(s)added, 0 removed, 2 recycled; write=0.001 s, sync=0.001 s, total=0.003 s; sync files=0, longest=0.000 s, average=0.000s; distance=32768 kB, estimate=32768 kB; lsn=0/40000D0, redo lsn=0/4000098 > > > So, even on a successful test, we can see that the WAL file we expect to be removed on the standby has not been recycledon the primary before the test. > Okay, one possibility of not removing on primary is that at the time of checkpoint (when we compute RedoRecPtr), the wal_swtich and insert is not yet performed because in that case it will compute the RedoRecPtr as a location before those operations which would be 0000*3 file. However, it is not clear how is that possible except from a background checkpoint happening at that point but from LOGs, it appears that the checkpoint triggered by test has recycled the wal files. > > I think we need to add more DEBUG info to find that > > out. Can you please try to print 'RedoRecPtr', '_logSegNo', and > > recptr? > > > > Yes, will do. > Okay, thanks, please try to print similar locations on standby in CreateRestartPoint(). -- With Regards, Amit Kapila.
Hi, On 5/6/23 4:10 AM, Amit Kapila wrote: > On Fri, May 5, 2023 at 7:53 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> On 5/5/23 2:28 PM, Amit Kapila wrote: >>> On Fri, May 5, 2023 at 5:36 PM Drouvot, Bertrand >> >> So, even on a successful test, we can see that the WAL file we expect to be removed on the standby has not been recycledon the primary before the test. >> > > Okay, one possibility of not removing on primary is that at the time > of checkpoint (when we compute RedoRecPtr), the wal_swtich and insert > is not yet performed because in that case it will compute the > RedoRecPtr as a location before those operations which would be 0000*3 > file. However, it is not clear how is that possible except from a > background checkpoint happening at that point but from LOGs, it > appears that the checkpoint triggered by test has recycled the wal > files. > >>> I think we need to add more DEBUG info to find that >>> out. Can you please try to print 'RedoRecPtr', '_logSegNo', and >>> recptr? >>> >> >> Yes, will do. >> > > Okay, thanks, please try to print similar locations on standby in > CreateRestartPoint(). > The extra information is displayed that way: https://github.com/bdrouvot/postgres/commit/a3d6d58d105b379c04a17a1129bfb709302588ca#diff-c1cb3ab2a19606390c1a7ed00ffe5a45531702ca5faf999d401c548f8951c65bR6822-R6830 https://github.com/bdrouvot/postgres/commit/a3d6d58d105b379c04a17a1129bfb709302588ca#diff-c1cb3ab2a19606390c1a7ed00ffe5a45531702ca5faf999d401c548f8951c65bR7269-R7271 https://github.com/bdrouvot/postgres/commit/a3d6d58d105b379c04a17a1129bfb709302588ca#diff-c1cb3ab2a19606390c1a7ed00ffe5a45531702ca5faf999d401c548f8951c65bR7281-R7284 There is 2 runs with this extra info in place: A successful one: https://cirrus-ci.com/task/6528745436086272 A failed one: https://cirrus-ci.com/task/4558139312308224 For both the testrun.zip is available in the Artifacts section. Sharing this now in case you want to have a look (I'll have a look at them early next week on my side). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Sat, May 6, 2023 at 1:52 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > There is 2 runs with this extra info in place: > > A successful one: https://cirrus-ci.com/task/6528745436086272 > A failed one: https://cirrus-ci.com/task/4558139312308224 > Thanks, I think I got some clue as to why this test is failing randomly. Following is the comparison of successful and failed run logs for standby: Success case ============ 2023-05-06 07:23:05.496 UTC [17617][walsender] [cascading_standby][3/0:0] DEBUG: 00000: write 0/4000148 flush 0/4000000 apply 0/4000000 reply_time 2023-05-06 07:23:05.496365+00 2023-05-06 07:23:05.496 UTC [17617][walsender] [cascading_standby][3/0:0] LOCATION: ProcessStandbyReplyMessage, walsender.c:2101 2023-05-06 07:23:05.496 UTC [17617][walsender] [cascading_standby][3/0:0] DEBUG: 00000: write 0/4000148 flush 0/4000148 apply 0/4000000 reply_time 2023-05-06 07:23:05.4964+00 2023-05-06 07:23:05.496 UTC [17617][walsender] [cascading_standby][3/0:0] LOCATION: ProcessStandbyReplyMessage, walsender.c:2101 2023-05-06 07:23:05.496 UTC [17617][walsender] [cascading_standby][3/0:0] DEBUG: 00000: write 0/4000148 flush 0/4000148 apply 0/4000148 reply_time 2023-05-06 07:23:05.496531+00 2023-05-06 07:23:05.496 UTC [17617][walsender] [cascading_standby][3/0:0] LOCATION: ProcessStandbyReplyMessage, walsender.c:2101 2023-05-06 07:23:05.500 UTC [17706][client backend] [035_standby_logical_decoding.pl][2/12:0] LOG: 00000: statement: checkpoint; 2023-05-06 07:23:05.500 UTC [17706][client backend] [035_standby_logical_decoding.pl][2/12:0] LOCATION: exec_simple_query, postgres.c:1074 2023-05-06 07:23:05.500 UTC [17550][checkpointer] LOG: 00000: restartpoint starting: immediate wait ... ... 2023-05-06 07:23:05.500 UTC [17550][checkpointer] LOCATION: CheckPointReplicationSlots, slot.c:1576 2023-05-06 07:23:05.501 UTC [17550][checkpointer] DEBUG: 00000: updated min recovery point to 0/4000148 on timeline 1 2023-05-06 07:23:05.501 UTC [17550][checkpointer] LOCATION: UpdateMinRecoveryPoint, xlog.c:2500 2023-05-06 07:23:05.515 UTC [17550][checkpointer] LOG: 00000: CreateRestartPoint: After XLByteToSeg RedoRecPtr is 0/4000098, _logSegNo is 4 2023-05-06 07:23:05.515 UTC [17550][checkpointer] LOCATION: CreateRestartPoint, xlog.c:7271 2023-05-06 07:23:05.515 UTC [17550][checkpointer] LOG: 00000: CreateRestartPoint: After KeepLogSeg RedoRecPtr is 0/4000098, endptr is 0/4000148, _logSegNo is 4 Failed case: ========== 2023-05-06 07:53:19.657 UTC [17914][walsender] [cascading_standby][3/0:0] DEBUG: 00000: write 0/3D1A000 flush 0/3CFA000 apply 0/4000000 reply_time 2023-05-06 07:53:19.65207+00 2023-05-06 07:53:19.657 UTC [17914][walsender] [cascading_standby][3/0:0] LOCATION: ProcessStandbyReplyMessage, walsender.c:2101 2023-05-06 07:53:19.657 UTC [17914][walsender] [cascading_standby][3/0:0] DEBUG: 00000: write 0/3D1A000 flush 0/3D1A000 apply 0/4000000 reply_time 2023-05-06 07:53:19.656471+00 2023-05-06 07:53:19.657 UTC [17914][walsender] [cascading_standby][3/0:0] LOCATION: ProcessStandbyReplyMessage, walsender.c:2101 ... ... 2023-05-06 07:53:19.686 UTC [17881][checkpointer] DEBUG: 00000: updated min recovery point to 0/4000148 on timeline 1 2023-05-06 07:53:19.686 UTC [17881][checkpointer] LOCATION: UpdateMinRecoveryPoint, xlog.c:2500 2023-05-06 07:53:19.707 UTC [17881][checkpointer] LOG: 00000: CreateRestartPoint: After XLByteToSeg RedoRecPtr is 0/4000098, _logSegNo is 4 2023-05-06 07:53:19.707 UTC [17881][checkpointer] LOCATION: CreateRestartPoint, xlog.c:7271 2023-05-06 07:53:19.707 UTC [17881][checkpointer] LOG: 00000: CreateRestartPoint: After KeepLogSeg RedoRecPtr is 0/4000098, endptr is 0/4000148, _logSegNo is 3 Observations: ============ 1. In the failed run, the KeepLogSeg(), reduced the _logSegNo to 3 which is the reason for the failure because now the standby won't be able to remove/recycle the WAL file corresponding to segment number 3 which the test was expecting. 2. We didn't expect the KeepLogSeg() to reduce the _logSegNo because all logical slots were invalidated. However, I think we forgot that both standby and primary have physical slots which might also influence the XLogGetReplicationSlotMinimumLSN() calculation in KeepLogSeg(). 3. Now, the reason for its success in some of the runs is that restart_lsn of physical slots also moved ahead by the time checkpoint happens. You can see the difference of LSNs for ProcessStandbyReplyMessage in failed and successful cases. Next steps: ========= 1. The first thing is we should verify this theory by adding some LOG in KeepLogSeg() to see if the _logSegNo is reduced due to the value returned by XLogGetReplicationSlotMinimumLSN(). 2. The reason for the required file not being removed in the primary is also that it has a physical slot which prevents the file removal. 3. If the above theory is correct then I see a few possibilities to fix this test (a) somehow ensure that restart_lsn of the physical slot on standby is advanced up to the point where we can safely remove the required files; (b) just create a separate test case by initializing a fresh node for primary and standby where we only have logical slots on standby. This will be a bit costly but probably less risky. (c) any better ideas? -- With Regards, Amit Kapila.
Hi, On 5/6/23 3:28 PM, Amit Kapila wrote: > On Sat, May 6, 2023 at 1:52 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> There is 2 runs with this extra info in place: >> >> A successful one: https://cirrus-ci.com/task/6528745436086272 >> A failed one: https://cirrus-ci.com/task/4558139312308224 >> > > Thanks, I think I got some clue as to why this test is failing > randomly. Great, thanks! > Observations: > ============ > 1. In the failed run, the KeepLogSeg(), reduced the _logSegNo to 3 > which is the reason for the failure because now the standby won't be > able to remove/recycle the WAL file corresponding to segment number 3 > which the test was expecting. Agree. > 2. We didn't expect the KeepLogSeg() to reduce the _logSegNo because > all logical slots were invalidated. However, I think we forgot that > both standby and primary have physical slots which might also > influence the XLogGetReplicationSlotMinimumLSN() calculation in > KeepLogSeg(). Oh right... > Next steps: > ========= > 1. The first thing is we should verify this theory by adding some LOG > in KeepLogSeg() to see if the _logSegNo is reduced due to the value > returned by XLogGetReplicationSlotMinimumLSN(). Yeah, will do that early next week. > 2. The reason for the required file not being removed in the primary > is also that it has a physical slot which prevents the file removal. Yeah, agree. But this one is not an issue as we are not checking for the WAL file removal on the primary, do you agree? > 3. If the above theory is correct then I see a few possibilities to > fix this test (a) somehow ensure that restart_lsn of the physical slot > on standby is advanced up to the point where we can safely remove the > required files; (b) just create a separate test case by initializing a > fresh node for primary and standby where we only have logical slots on > standby. This will be a bit costly but probably less risky. (c) any > better ideas? > (c): Since, I think, the physical slot on the primary is not a concern for the reason mentioned above, then instead of (b): What about postponing the physical slot creation on the standby and the cascading standby node initialization after this test? That way, this test would be done without a physical slot on the standby and we could also get rid of the "Wait for the cascading standby to catchup before removing the WAL file(s)" part. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Sat, May 6, 2023 at 9:33 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 5/6/23 3:28 PM, Amit Kapila wrote: > > On Sat, May 6, 2023 at 1:52 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > > > Next steps: > > ========= > > 1. The first thing is we should verify this theory by adding some LOG > > in KeepLogSeg() to see if the _logSegNo is reduced due to the value > > returned by XLogGetReplicationSlotMinimumLSN(). > > Yeah, will do that early next week. > > > 2. The reason for the required file not being removed in the primary > > is also that it has a physical slot which prevents the file removal. > > Yeah, agree. But this one is not an issue as we are not > checking for the WAL file removal on the primary, do you agree? > Agreed. > > 3. If the above theory is correct then I see a few possibilities to > > fix this test (a) somehow ensure that restart_lsn of the physical slot > > on standby is advanced up to the point where we can safely remove the > > required files; (b) just create a separate test case by initializing a > > fresh node for primary and standby where we only have logical slots on > > standby. This will be a bit costly but probably less risky. (c) any > > better ideas? > > > > (c): Since, I think, the physical slot on the primary is not a concern for > the reason mentioned above, then instead of (b): > > What about postponing the physical slot creation on the standby and the > cascading standby node initialization after this test? > Yeah, that is also possible. But, I have a few questions regarding that: (a) There doesn't seem to be a physical slot on cascading standby, if I am missing something, can you please point me to the relevant part of the test? (b) Which test is currently dependent on the physical slot on standby? -- With Regards, Amit Kapila.
Hi, On 5/8/23 4:42 AM, Amit Kapila wrote: > On Sat, May 6, 2023 at 9:33 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> On 5/6/23 3:28 PM, Amit Kapila wrote: >>> On Sat, May 6, 2023 at 1:52 PM Drouvot, Bertrand >>> <bertranddrouvot.pg@gmail.com> wrote: >> >>> Next steps: >>> ========= >>> 1. The first thing is we should verify this theory by adding some LOG >>> in KeepLogSeg() to see if the _logSegNo is reduced due to the value >>> returned by XLogGetReplicationSlotMinimumLSN(). >> >> Yeah, will do that early next week. It's done with the following changes: https://github.com/bdrouvot/postgres/commit/79e1bd9ab429a22f876b9364eb8a0da2dacaaef7#diff-c1cb3ab2a19606390c1a7ed00ffe5a45531702ca5faf999d401c548f8951c65bL7454-R7514 With that in place, there is one failing test here: https://cirrus-ci.com/task/5173216310722560 Where we can see: 2023-05-08 07:42:56.301 UTC [18038][checkpointer] LOCATION: UpdateMinRecoveryPoint, xlog.c:2500 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: CreateRestartPoint: After XLByteToSeg RedoRecPtr is 0/4000098,_logSegNo is 4 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: CreateRestartPoint, xlog.c:7271 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: KeepLogSeg: segno changed to 4 due to XLByteToSeg 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: KeepLogSeg, xlog.c:7473 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: KeepLogSeg: segno changed to 3 due to XLogGetReplicationSlotMinimumLSN() 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: KeepLogSeg, xlog.c:7483 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: CreateRestartPoint: After KeepLogSeg RedoRecPtr is 0/4000098,endptr is 0/4000148, _logSegNo is 3 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: CreateRestartPoint, xlog.c:7284 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: BDT1 about to call RemoveOldXlogFiles in CreateRestartPoint 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: CreateRestartPoint, xlog.c:7313 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: attempting to remove WAL segments older than log file 000000000000000000000002 So the suspicion about XLogGetReplicationSlotMinimumLSN() was correct (_logSegNo moved from 4 to 3 due to XLogGetReplicationSlotMinimumLSN()). >> What about postponing the physical slot creation on the standby and the >> cascading standby node initialization after this test? >> > > Yeah, that is also possible. But, I have a few questions regarding > that: (a) There doesn't seem to be a physical slot on cascading > standby, if I am missing something, can you please point me to the > relevant part of the test? That's right. There is a physical slot only on the primary and on the standby. What I meant up-thread is to also postpone the cascading standby node initialization after this test (once the physical slot on the standby is created). Please find attached a proposal doing so. > (b) Which test is currently dependent on > the physical slot on standby? Not a test but the cascading standby initialization with the "primary_slot_name" parameter. Also, now I think that's better to have the physical slot on the standby + hsf set to on on the cascading standby (coming from the standby backup). Idea is to avoid any risk of logical slot invalidation on the cascading standby in the standby promotion test. That was not the case before the attached proposal though (hsf was off on the cascading standby). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, May 8, 2023 at 1:45 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 5/8/23 4:42 AM, Amit Kapila wrote: > > On Sat, May 6, 2023 at 9:33 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >> On 5/6/23 3:28 PM, Amit Kapila wrote: > >>> On Sat, May 6, 2023 at 1:52 PM Drouvot, Bertrand > >>> <bertranddrouvot.pg@gmail.com> wrote: > >> > >>> Next steps: > >>> ========= > >>> 1. The first thing is we should verify this theory by adding some LOG > >>> in KeepLogSeg() to see if the _logSegNo is reduced due to the value > >>> returned by XLogGetReplicationSlotMinimumLSN(). > >> > >> Yeah, will do that early next week. > > It's done with the following changes: > > https://github.com/bdrouvot/postgres/commit/79e1bd9ab429a22f876b9364eb8a0da2dacaaef7#diff-c1cb3ab2a19606390c1a7ed00ffe5a45531702ca5faf999d401c548f8951c65bL7454-R7514 > > With that in place, there is one failing test here: https://cirrus-ci.com/task/5173216310722560 > > Where we can see: > > 2023-05-08 07:42:56.301 UTC [18038][checkpointer] LOCATION: UpdateMinRecoveryPoint, xlog.c:2500 > 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: CreateRestartPoint: After XLByteToSeg RedoRecPtr is 0/4000098,_logSegNo is 4 > 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: CreateRestartPoint, xlog.c:7271 > 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: KeepLogSeg: segno changed to 4 due to XLByteToSeg > 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: KeepLogSeg, xlog.c:7473 > 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: KeepLogSeg: segno changed to 3 due to XLogGetReplicationSlotMinimumLSN() > 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: KeepLogSeg, xlog.c:7483 > 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: CreateRestartPoint: After KeepLogSeg RedoRecPtr is 0/4000098,endptr is 0/4000148, _logSegNo is 3 > 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: CreateRestartPoint, xlog.c:7284 > 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: BDT1 about to call RemoveOldXlogFiles in CreateRestartPoint > 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: CreateRestartPoint, xlog.c:7313 > 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: attempting to remove WAL segments older than log file 000000000000000000000002 > > So the suspicion about XLogGetReplicationSlotMinimumLSN() was correct (_logSegNo moved from > 4 to 3 due to XLogGetReplicationSlotMinimumLSN()). > > >> What about postponing the physical slot creation on the standby and the > >> cascading standby node initialization after this test? > >> > > > > Yeah, that is also possible. But, I have a few questions regarding > > that: (a) There doesn't seem to be a physical slot on cascading > > standby, if I am missing something, can you please point me to the > > relevant part of the test? > > That's right. There is a physical slot only on the primary and on the standby. > > What I meant up-thread is to also postpone the cascading standby node initialization > after this test (once the physical slot on the standby is created). > > Please find attached a proposal doing so. > > > (b) Which test is currently dependent on > > the physical slot on standby? > > Not a test but the cascading standby initialization with the "primary_slot_name" parameter. > > Also, now I think that's better to have the physical slot on the standby + hsf set to on on the > cascading standby (coming from the standby backup). > > Idea is to avoid any risk of logical slot invalidation on the cascading standby in the > standby promotion test. > Why not initialize the cascading standby node just before the standby promotion test: "Test standby promotion and logical decoding behavior after the standby gets promoted."? That way we will avoid any unknown side-effects of cascading standby and it will anyway look more logical to initialize it where the test needs it. -- With Regards, Amit Kapila.
Hi, On 5/9/23 8:02 AM, Amit Kapila wrote: > On Mon, May 8, 2023 at 1:45 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > Why not initialize the cascading standby node just before the standby > promotion test: "Test standby promotion and logical decoding behavior > after the standby gets promoted."? That way we will avoid any unknown > side-effects of cascading standby and it will anyway look more logical > to initialize it where the test needs it. > Yeah, that's even better. Moved the physical slot creation on the standby and the cascading standby initialization where "strictly" needed in V2 attached. Also ensuring that hsf is set to on on the cascading standby to be on the safe side of thing. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, May 9, 2023 at 12:44 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 5/9/23 8:02 AM, Amit Kapila wrote: > > On Mon, May 8, 2023 at 1:45 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > Why not initialize the cascading standby node just before the standby > > promotion test: "Test standby promotion and logical decoding behavior > > after the standby gets promoted."? That way we will avoid any unknown > > side-effects of cascading standby and it will anyway look more logical > > to initialize it where the test needs it. > > > > Yeah, that's even better. Moved the physical slot creation on the standby > and the cascading standby initialization where "strictly" needed in V2 > attached. > > Also ensuring that hsf is set to on on the cascading standby to be on the > safe side of thing. > Pushed this yesterday. -- With Regards, Amit Kapila.
On 5/10/23 12:41 PM, Amit Kapila wrote: > On Tue, May 9, 2023 at 12:44 PM Drouvot, Bertrand > > Pushed this yesterday. > Thanks! -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com