Thread: Add two missing tests in 035_standby_logical_decoding.pl

Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:
hi hackers,

In the logical decoding on standby thread [1], Andres proposed 2 new tests (that I did
not find the time to complete before the finish line):

- Test that we can subscribe to the standby (with the publication created on the primary)
- Verify that invalidated logical slots do not lead to retaining WAL

Please find those 2 missing tests in the patch proposal attached.

A few words about them:

1) Regarding the subscription test:

It modifies wait_for_catchup() to take into account the case where the requesting
node is in recovery mode. Indeed, without that change, wait_for_subscription_sync() was
failing with:

"
error running SQL: 'psql:<stdin>:1: ERROR:  recovery is in progress
HINT:  WAL control functions cannot be executed during recovery.'
while running 'psql -XAtq -d port=61441 host=/tmp/45dt3wqs2p dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql
'SELECTpg_current_wal_lsn()'
 
"

2) Regarding the WAL file not retained test:

As it's not possible to execute pg_switch_wal() and friends on a standby, this is
done on the primary. Also checking that the WAL file (linked to a restart_lsn of an invalidate
slot) has been removed is done directly at the os/directory level.

The attached patch also removes:

"
-log_min_messages = 'debug2'
-log_error_verbosity = verbose
"

as also discussed in [1].

I'm not sure if adding those 2 tests should be considered as an open item. I can add this open item
if we think that makes sense. I'd be happy to do so but it looks like I don't have the privileges
to edit https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

[1]: https://www.postgresql.org/message-id/6d801661-e21b-7326-be1b-f90d904da66a%40gmail.com
Attachment

Re: Add two missing tests in 035_standby_logical_decoding.pl

From
Alvaro Herrera
Date:
On 2023-Apr-12, Drouvot, Bertrand wrote:

> I'm not sure if adding those 2 tests should be considered as an open
> item. I can add this open item if we think that makes sense. I'd be
> happy to do so but it looks like I don't have the privileges to edit
> https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items

I think adding extra tests for new code can definitely be considered an
open item, since those tests might help to discover issues in said new
code.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/17/23 11:55 AM, Alvaro Herrera wrote:
> On 2023-Apr-12, Drouvot, Bertrand wrote:
> 
>> I'm not sure if adding those 2 tests should be considered as an open
>> item. I can add this open item if we think that makes sense. I'd be
>> happy to do so but it looks like I don't have the privileges to edit
>> https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items
> 
> I think adding extra tests for new code can definitely be considered an
> open item, since those tests might help to discover issues in said new
> code.
> 

Thanks for the feedback! Added as an open item.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
vignesh C
Date:
On Wed, 12 Apr 2023 at 21:45, Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> hi hackers,
>
> In the logical decoding on standby thread [1], Andres proposed 2 new tests (that I did
> not find the time to complete before the finish line):
>
> - Test that we can subscribe to the standby (with the publication created on the primary)
> - Verify that invalidated logical slots do not lead to retaining WAL
>
> Please find those 2 missing tests in the patch proposal attached.

Few comments:
1) Should this change be committed as a separate patch instead of
mixing it with the new test addition patch? I feel it would be better
to split it into 0001 and 0002 patches.
 # Name for the physical slot on primary
@@ -235,8 +241,6 @@ $node_primary->append_conf('postgresql.conf', q{
 wal_level = 'logical'
 max_replication_slots = 4
 max_wal_senders = 4
-log_min_messages = 'debug2'
-log_error_verbosity = verbose
 });
 $node_primary->dump_info;
 $node_primary->start;

2) We could add a commitfest entry for this, which will help in
checking cfbot results across platforms.

3) Should the comment say subscription instead of subscriber here?
+# We do not need the subscriber anymore
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+$node_subscriber->stop;

4) we could add a commit message for the patch

Regards,
Vignesh



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/24/23 6:04 AM, vignesh C wrote:
> On Wed, 12 Apr 2023 at 21:45, Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>> hi hackers,
>>
>> In the logical decoding on standby thread [1], Andres proposed 2 new tests (that I did
>> not find the time to complete before the finish line):
>>
>> - Test that we can subscribe to the standby (with the publication created on the primary)
>> - Verify that invalidated logical slots do not lead to retaining WAL
>>
>> Please find those 2 missing tests in the patch proposal attached.
> 
> Few comments:

Thanks for looking at it!

> 1) Should this change be committed as a separate patch instead of
> mixing it with the new test addition patch? I feel it would be better
> to split it into 0001 and 0002 patches.

Agree, done in V2 attached.
  
> 2) We could add a commitfest entry for this, which will help in
> checking cfbot results across platforms.

Good point, done in [1].

> 3) Should the comment say subscription instead of subscriber here?
> +# We do not need the subscriber anymore
> +$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
> +$node_subscriber->stop;

Comment was due to the node_subscriber being stopped. Changed to
  "We do not need the subscription and the subscriber anymore"
in V2.

> 
> 4) we could add a commit message for the patch
> 

Good point, done.

[1]: https://commitfest.postgresql.org/43/4295/

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Add two missing tests in 035_standby_logical_decoding.pl

From
Amit Kapila
Date:
On Mon, Apr 24, 2023 at 11:24 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>

Few comments:
============
1.
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', 'max_replication_slots = 4');

Why do we need slots on the subscriber?

2.
+# Speed up the subscription creation
+$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+
+# Explicitly shut down psql instance gracefully - to avoid hangs
+# or worse on windows
+$psql_subscriber{subscriber_stdin} .= "\\q\n";
+$psql_subscriber{run}->finish;
+
+# Insert some rows on the primary
+$node_primary->safe_psql('postgres',
+ qq[INSERT INTO tab_rep select generate_series(1,10);]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# To speed up the wait_for_subscription_sync
+$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+$node_subscriber->wait_for_subscription_sync($node_standby, 'tap_sub');

It is not clear to me why you need to do pg_log_standby_snapshot() twice.

3. Why do you need $psql_subscriber to be used in a different way
instead of using safe_psql as is used for node_primary?

--
With Regards,
Amit Kapila.



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
Amit Kapila
Date:
On Wed, Apr 12, 2023 at 9:45 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
>
> The attached patch also removes:
>
> "
> -log_min_messages = 'debug2'
> -log_error_verbosity = verbose
> "
>
> as also discussed in [1].
>

I agree that we should reduce the log level here. It is discussed in
an email [1]. I'll push this part tomorrow unless Andres or someone
else thinks that we still need this.

[1] - https://www.postgresql.org/message-id/523315.1681245505%40sss.pgh.pa.us

--
With Regards,
Amit Kapila.



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
Amit Kapila
Date:
On Mon, Apr 24, 2023 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Apr 24, 2023 at 11:24 AM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
> >
>
> Few comments:
> ============
>

+# We can not test if the WAL file still exists immediately.
+# We need to let some time to the standby to actually "remove" it.
+my $i = 0;
+while (1)
+{
+ last if !-f $standby_walfile;
+ if ($i++ == 10 * $default_timeout)
+ {
+ die
+   "could not determine if WAL file has been retained or not, can't continue";
+ }
+ usleep(100_000);
+}

Is this adhoc wait required because we can't guarantee that the
checkpoint is complete on standby even after using wait_for_catchup?
Is there a guarantee that it can never fail on some slower machines?

BTW, for the second test is it necessary that we first ensure that the
WAL file has not been retained on the primary?

--
With Regards,
Amit Kapila.



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/24/23 8:24 AM, Amit Kapila wrote:
> On Mon, Apr 24, 2023 at 11:24 AM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
> 
> Few comments:
> ============

Thanks for looking at it!

> 1.
> +$node_subscriber->init(allows_streaming => 'logical');
> +$node_subscriber->append_conf('postgresql.conf', 'max_replication_slots = 4');
> 
> Why do we need slots on the subscriber?
> 

Good point, it's not needed. I guess it has been missed during my initial patch clean up.

Fixed in V3 attached.

> 2.
> +# Speed up the subscription creation
> +$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
> +
> +# Explicitly shut down psql instance gracefully - to avoid hangs
> +# or worse on windows
> +$psql_subscriber{subscriber_stdin} .= "\\q\n";
> +$psql_subscriber{run}->finish;
> +
> +# Insert some rows on the primary
> +$node_primary->safe_psql('postgres',
> + qq[INSERT INTO tab_rep select generate_series(1,10);]);
> +
> +$node_primary->wait_for_replay_catchup($node_standby);
> +
> +# To speed up the wait_for_subscription_sync
> +$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
> +$node_subscriber->wait_for_subscription_sync($node_standby, 'tap_sub');
> 
> It is not clear to me why you need to do pg_log_standby_snapshot() twice.

That's because there is 2 logical slot creations that have the be done on the standby.

The one for the subscription:

"
CREATE_REPLICATION_SLOT "tap_sub" LOGICAL pgoutput (SNAPSHOT 'nothing')
"

And the one for the data sync:

"
CREATE_REPLICATION_SLOT "pg_16389_sync_16384_7225540800768250444" LOGICAL pgoutput (SNAPSHOT 'use')
"

Without the second "pg_log_standby_snapshot()" then wait_for_subscription_sync() would be waiting
some time on the poll for "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');"

Adding a comment in V3 to explain the need for the second pg_log_standby_snapshot().

> 
> 3. Why do you need $psql_subscriber to be used in a different way
> instead of using safe_psql as is used for node_primary?
> 

Because safe_psql() would wait for activity on the primary without being able to launch
pg_log_standby_snapshot() on the primary while waiting. psql_subscriber() allows
to not wait synchronously.

Also adding a comment in V3 to explain why safe_psql() is not being used here.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/24/23 11:45 AM, Amit Kapila wrote:
> On Mon, Apr 24, 2023 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Mon, Apr 24, 2023 at 11:24 AM Drouvot, Bertrand
>> <bertranddrouvot.pg@gmail.com> wrote:
>>>
>>
>> Few comments:
>> ============
>>
> 
> +# We can not test if the WAL file still exists immediately.
> +# We need to let some time to the standby to actually "remove" it.
> +my $i = 0;
> +while (1)
> +{
> + last if !-f $standby_walfile;
> + if ($i++ == 10 * $default_timeout)
> + {
> + die
> +   "could not determine if WAL file has been retained or not, can't continue";
> + }
> + usleep(100_000);
> +}
> 
> Is this adhoc wait required because we can't guarantee that the
> checkpoint is complete on standby even after using wait_for_catchup?

Yes, the restart point on the standby is not necessary completed even after wait_for_catchup is done.

> Is there a guarantee that it can never fail on some slower machines?
> 

We are waiting here at a maximum for 10 * $default_timeout (means 3 minutes) before
we time out. Would you prefer to wait more than 3 minutes at a maximum?

> BTW, for the second test is it necessary that we first ensure that the
> WAL file has not been retained on the primary?
> 

I was not sure it's worth it too. Idea was more: it's useless to verify it is removed on
the standby if we are not 100% sure it has been removed on the primary first. But yeah, we can get
rid of this test if you prefer.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
Amit Kapila
Date:
On Mon, Apr 24, 2023 at 3:36 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> On 4/24/23 8:24 AM, Amit Kapila wrote:
>
> > 2.
> > +# Speed up the subscription creation
> > +$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
> > +
> > +# Explicitly shut down psql instance gracefully - to avoid hangs
> > +# or worse on windows
> > +$psql_subscriber{subscriber_stdin} .= "\\q\n";
> > +$psql_subscriber{run}->finish;
> > +
> > +# Insert some rows on the primary
> > +$node_primary->safe_psql('postgres',
> > + qq[INSERT INTO tab_rep select generate_series(1,10);]);
> > +
> > +$node_primary->wait_for_replay_catchup($node_standby);
> > +
> > +# To speed up the wait_for_subscription_sync
> > +$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
> > +$node_subscriber->wait_for_subscription_sync($node_standby, 'tap_sub');
> >
> > It is not clear to me why you need to do pg_log_standby_snapshot() twice.
>
> That's because there is 2 logical slot creations that have the be done on the standby.
>
> The one for the subscription:
>
> "
> CREATE_REPLICATION_SLOT "tap_sub" LOGICAL pgoutput (SNAPSHOT 'nothing')
> "
>
> And the one for the data sync:
>
> "
> CREATE_REPLICATION_SLOT "pg_16389_sync_16384_7225540800768250444" LOGICAL pgoutput (SNAPSHOT 'use')
> "
>
> Without the second "pg_log_standby_snapshot()" then wait_for_subscription_sync() would be waiting
> some time on the poll for "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');"
>
> Adding a comment in V3 to explain the need for the second pg_log_standby_snapshot().
>

Won't this still be unpredictable because it is possible that the
tablesync worker may take more time to get launched or create a
replication slot? If that happens after your second
pg_log_standby_snapshot() then wait_for_subscription_sync() will be
hanging. Wouldn't it be better to create a subscription with
(copy_data = false) to make it predictable and then we won't need
pg_log_standby_snapshot() to be performed twice?

If you agree with the above suggestion then you probably need to move
wait_for_subscription_sync() before Insert.

--
With Regards,
Amit Kapila.



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
Amit Kapila
Date:
On Mon, Apr 24, 2023 at 5:38 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> On 4/24/23 11:45 AM, Amit Kapila wrote:
> > On Mon, Apr 24, 2023 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >>
> >> On Mon, Apr 24, 2023 at 11:24 AM Drouvot, Bertrand
> >> <bertranddrouvot.pg@gmail.com> wrote:
> >>>
> >>
> >> Few comments:
> >> ============
> >>
> >
> > +# We can not test if the WAL file still exists immediately.
> > +# We need to let some time to the standby to actually "remove" it.
> > +my $i = 0;
> > +while (1)
> > +{
> > + last if !-f $standby_walfile;
> > + if ($i++ == 10 * $default_timeout)
> > + {
> > + die
> > +   "could not determine if WAL file has been retained or not, can't continue";
> > + }
> > + usleep(100_000);
> > +}
> >
> > Is this adhoc wait required because we can't guarantee that the
> > checkpoint is complete on standby even after using wait_for_catchup?
>
> Yes, the restart point on the standby is not necessary completed even after wait_for_catchup is done.
>
> > Is there a guarantee that it can never fail on some slower machines?
> >
>
> We are waiting here at a maximum for 10 * $default_timeout (means 3 minutes) before
> we time out. Would you prefer to wait more than 3 minutes at a maximum?
>

No, because I don't know what would be a suitable timeout here. At
this stage, I don't have a good idea on how to implement this test in
a better way. Can we split this into a separate patch as the first
test is a bit straightforward, we can push that one and then
brainstorm on if there is a better way to test this functionality.

--
With Regards,
Amit Kapila.



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/25/23 6:23 AM, Amit Kapila wrote:
> On Mon, Apr 24, 2023 at 3:36 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>> Without the second "pg_log_standby_snapshot()" then wait_for_subscription_sync() would be waiting
>> some time on the poll for "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');"
>>
>> Adding a comment in V3 to explain the need for the second pg_log_standby_snapshot().
>>
> 
> Won't this still be unpredictable because it is possible that the
> tablesync worker may take more time to get launched or create a
> replication slot? If that happens after your second
> pg_log_standby_snapshot() then wait_for_subscription_sync() will be
> hanging. 

Oh right, that looks like a possible scenario.

> Wouldn't it be better to create a subscription with
> (copy_data = false) to make it predictable and then we won't need
> pg_log_standby_snapshot() to be performed twice?
> 
> If you agree with the above suggestion then you probably need to move
> wait_for_subscription_sync() before Insert.
> 

I like that idea, thanks! Done in V4 attached.

Not related to the above corner case, but while re-reading the patch I also added:

"
$node_primary->wait_for_replay_catchup($node_standby);
"

between the publication creation on the primary and the subscription to the standby
(to ensure the publication gets replicated before we request for the subscription creation).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/25/23 6:43 AM, Amit Kapila wrote:
> On Mon, Apr 24, 2023 at 5:38 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>> We are waiting here at a maximum for 10 * $default_timeout (means 3 minutes) before
>> we time out. Would you prefer to wait more than 3 minutes at a maximum?
>>
> 
> No, because I don't know what would be a suitable timeout here.

Yeah, I understand that. On the other hand, there is other places that
rely on a timeout, for example:

- wait_for_catchup(), wait_for_slot_catchup(),
wait_for_subscription_sync() by making use of poll_query_until.
- wait_for_log() by setting a max_attempts.

Couldn't we have the same concern for those ones? (aka be suitable on
slower machines).

> At
> this stage, I don't have a good idea on how to implement this test in
> a better way. Can we split this into a separate patch as the first
> test is a bit straightforward, we can push that one and then
> brainstorm on if there is a better way to test this functionality.
> 

I created a dedicated v4-0002-Add-retained-WAL-test-in-035_standby_logical_deco.patch
just shared up-thread.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
vignesh C
Date:
On Tue, 25 Apr 2023 at 12:51, Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On 4/25/23 6:23 AM, Amit Kapila wrote:
> > On Mon, Apr 24, 2023 at 3:36 PM Drouvot, Bertrand
> > <bertranddrouvot.pg@gmail.com> wrote:
> >>
> >> Without the second "pg_log_standby_snapshot()" then wait_for_subscription_sync() would be waiting
> >> some time on the poll for "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');"
> >>
> >> Adding a comment in V3 to explain the need for the second pg_log_standby_snapshot().
> >>
> >
> > Won't this still be unpredictable because it is possible that the
> > tablesync worker may take more time to get launched or create a
> > replication slot? If that happens after your second
> > pg_log_standby_snapshot() then wait_for_subscription_sync() will be
> > hanging.
>
> Oh right, that looks like a possible scenario.
>
> > Wouldn't it be better to create a subscription with
> > (copy_data = false) to make it predictable and then we won't need
> > pg_log_standby_snapshot() to be performed twice?
> >
> > If you agree with the above suggestion then you probably need to move
> > wait_for_subscription_sync() before Insert.
> >
>
> I like that idea, thanks! Done in V4 attached.
>
> Not related to the above corner case, but while re-reading the patch I also added:
>
> "
> $node_primary->wait_for_replay_catchup($node_standby);
> "
>
> between the publication creation on the primary and the subscription to the standby
> (to ensure the publication gets replicated before we request for the subscription creation).

Thanks for the updated patch.
Few comments:
1) subscriber_stdout and  subscriber_stderr are not required for this
test case, we could remove it, I was able to remove those variables
and run the test successfully:
+$node_subscriber->start;
+
+my %psql_subscriber = (
+       'subscriber_stdin'  => '',
+       'subscriber_stdout' => '',
+       'subscriber_stderr' => '');
+$psql_subscriber{run} = IPC::Run::start(
+       [ 'psql', '-XA', '-f', '-', '-d',
$node_subscriber->connstr('postgres') ],
+       '<',
+       \$psql_subscriber{subscriber_stdin},
+       '>',
+       \$psql_subscriber{subscriber_stdout},
+       '2>',
+       \$psql_subscriber{subscriber_stderr},
+       $psql_timeout);

I ran it like:
my %psql_subscriber = (
'subscriber_stdin' => '');
$psql_subscriber{run} = IPC::Run::start(
[ 'psql', '-XA', '-f', '-', '-d', $node_subscriber->connstr('postgres') ],
'<',
\$psql_subscriber{subscriber_stdin},
$psql_timeout);

2) Also we have changed the default timeout here, why is this change required:
 my $node_cascading_standby =
PostgreSQL::Test::Cluster->new('cascading_standby');
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
 my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $psql_timeout    = IPC::Run::timer(2 * $default_timeout);

Regards,
Vignesh



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/26/23 6:06 AM, vignesh C wrote:
> On Tue, 25 Apr 2023 at 12:51, Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
> Thanks for the updated patch.
> Few comments:

Thanks for looking at it!

> 1) subscriber_stdout and  subscriber_stderr are not required for this
> test case, we could remove it, I was able to remove those variables
> and run the test successfully:
> +$node_subscriber->start;
> +
> +my %psql_subscriber = (
> +       'subscriber_stdin'  => '',
> +       'subscriber_stdout' => '',
> +       'subscriber_stderr' => '');
> +$psql_subscriber{run} = IPC::Run::start(
> +       [ 'psql', '-XA', '-f', '-', '-d',
> $node_subscriber->connstr('postgres') ],
> +       '<',
> +       \$psql_subscriber{subscriber_stdin},
> +       '>',
> +       \$psql_subscriber{subscriber_stdout},
> +       '2>',
> +       \$psql_subscriber{subscriber_stderr},
> +       $psql_timeout);
> 
> I ran it like:
> my %psql_subscriber = (
> 'subscriber_stdin' => '');
> $psql_subscriber{run} = IPC::Run::start(
> [ 'psql', '-XA', '-f', '-', '-d', $node_subscriber->connstr('postgres') ],
> '<',
> \$psql_subscriber{subscriber_stdin},
> $psql_timeout);
> 

Not using the 3 std* is also the case for example in 021_row_visibility.pl and 032_relfilenode_reuse.pl
where the "stderr" is set but does not seem to be used.

I don't think that's a problem to keep them all and I think it's better to have
them re-directed to dedicated places.

> 2) Also we have changed the default timeout here, why is this change required:
>   my $node_cascading_standby =
> PostgreSQL::Test::Cluster->new('cascading_standby');
> +my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
>   my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
> +my $psql_timeout    = IPC::Run::timer(2 * $default_timeout);

I think I used 021_row_visibility.pl as an example. But agree there is
others .pl that are using the timeout_default as the psql_timeout and that
the default is enough in our case. So, using the default in V5 attached.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Add two missing tests in 035_standby_logical_decoding.pl

From
vignesh C
Date:
On Wed, 26 Apr 2023 at 13:45, Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On 4/26/23 6:06 AM, vignesh C wrote:
> > On Tue, 25 Apr 2023 at 12:51, Drouvot, Bertrand
> > <bertranddrouvot.pg@gmail.com> wrote:
> > Thanks for the updated patch.
> > Few comments:
>
> Thanks for looking at it!
>
> > 1) subscriber_stdout and  subscriber_stderr are not required for this
> > test case, we could remove it, I was able to remove those variables
> > and run the test successfully:
> > +$node_subscriber->start;
> > +
> > +my %psql_subscriber = (
> > +       'subscriber_stdin'  => '',
> > +       'subscriber_stdout' => '',
> > +       'subscriber_stderr' => '');
> > +$psql_subscriber{run} = IPC::Run::start(
> > +       [ 'psql', '-XA', '-f', '-', '-d',
> > $node_subscriber->connstr('postgres') ],
> > +       '<',
> > +       \$psql_subscriber{subscriber_stdin},
> > +       '>',
> > +       \$psql_subscriber{subscriber_stdout},
> > +       '2>',
> > +       \$psql_subscriber{subscriber_stderr},
> > +       $psql_timeout);
> >
> > I ran it like:
> > my %psql_subscriber = (
> > 'subscriber_stdin' => '');
> > $psql_subscriber{run} = IPC::Run::start(
> > [ 'psql', '-XA', '-f', '-', '-d', $node_subscriber->connstr('postgres') ],
> > '<',
> > \$psql_subscriber{subscriber_stdin},
> > $psql_timeout);
> >
>
> Not using the 3 std* is also the case for example in 021_row_visibility.pl and 032_relfilenode_reuse.pl
> where the "stderr" is set but does not seem to be used.
>
> I don't think that's a problem to keep them all and I think it's better to have
> them re-directed to dedicated places.

ok, that way it will be consistent across others too.

> > 2) Also we have changed the default timeout here, why is this change required:
> >   my $node_cascading_standby =
> > PostgreSQL::Test::Cluster->new('cascading_standby');
> > +my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
> >   my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
> > +my $psql_timeout    = IPC::Run::timer(2 * $default_timeout);
>
> I think I used 021_row_visibility.pl as an example. But agree there is
> others .pl that are using the timeout_default as the psql_timeout and that
> the default is enough in our case. So, using the default in V5 attached.
>

Thanks for fixing this.

There was one typo in the commit message, subscribtion should be
subscription, the rest of the changes looks good to me:
Subject: [PATCH v5] Add subscribtion to the standby test in
 035_standby_logical_decoding.pl

Adding one test, to verify that subscribtion to the standby is possible.

Regards,
Vignesh



RE: Add two missing tests in 035_standby_logical_decoding.pl

From
"Yu Shi (Fujitsu)"
Date:
On Mon, Apr 24, 2023 8:07 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote:
> 
> On 4/24/23 11:45 AM, Amit Kapila wrote:
> > On Mon, Apr 24, 2023 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >>
> >> On Mon, Apr 24, 2023 at 11:24 AM Drouvot, Bertrand
> >> <bertranddrouvot.pg@gmail.com> wrote:
> >>>
> >>
> >> Few comments:
> >> ============
> >>
> >
> > +# We can not test if the WAL file still exists immediately.
> > +# We need to let some time to the standby to actually "remove" it.
> > +my $i = 0;
> > +while (1)
> > +{
> > + last if !-f $standby_walfile;
> > + if ($i++ == 10 * $default_timeout)
> > + {
> > + die
> > +   "could not determine if WAL file has been retained or not, can't continue";
> > + }
> > + usleep(100_000);
> > +}
> >
> > Is this adhoc wait required because we can't guarantee that the
> > checkpoint is complete on standby even after using wait_for_catchup?
> 
> Yes, the restart point on the standby is not necessary completed even after
> wait_for_catchup is done.
> 

I think that's because when replaying a checkpoint record, the startup process
of standby only saves the information of the checkpoint, and we need to wait for
the checkpointer to perform a restartpoint (see RecoveryRestartPoint), right? If
so, could we force a checkpoint on standby? After this, the standby should have
completed the restartpoint and we don't need to wait.

Besides, would it be better to wait for the cascading standby? If the wal log
file needed for cascading standby is removed on the standby, the subsequent test
will fail. Do we need to consider this scenario? I saw the following error
message after setting recovery_min_apply_delay to 5s on the cascading standby,
and the test failed due to a timeout while waiting for cascading standby.

Log of cascading standby node:
FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000000000000003 has already been
removed

Regards,
Shi Yu

Re: Add two missing tests in 035_standby_logical_decoding.pl

From
Alvaro Herrera
Date:
> diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
> index 6f7f4e5de4..819667d42a 100644
> --- a/src/test/perl/PostgreSQL/Test/Cluster.pm
> +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
> @@ -2644,7 +2644,16 @@ sub wait_for_catchup
>      }
>      if (!defined($target_lsn))
>      {
> -        $target_lsn = $self->lsn('write');
> +        my $isrecovery = $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
> +        chomp($isrecovery);
> +        if ($isrecovery eq 't')
> +        {
> +            $target_lsn = $self->lsn('replay');
> +        }
> +        else
> +        {
> +            $target_lsn = $self->lsn('write');
> +        }

Please modify the function's documentation to account for this code change.


-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Porque Kim no hacía nada, pero, eso sí,
con extraordinario éxito" ("Kim", Kipling)



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/26/23 12:27 PM, Alvaro Herrera wrote:
>> diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
>> index 6f7f4e5de4..819667d42a 100644
>> --- a/src/test/perl/PostgreSQL/Test/Cluster.pm
>> +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
>> @@ -2644,7 +2644,16 @@ sub wait_for_catchup
>>       }
>>       if (!defined($target_lsn))
>>       {
>> -        $target_lsn = $self->lsn('write');
>> +        my $isrecovery = $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
>> +        chomp($isrecovery);
>> +        if ($isrecovery eq 't')
>> +        {
>> +            $target_lsn = $self->lsn('replay');
>> +        }
>> +        else
>> +        {
>> +            $target_lsn = $self->lsn('write');
>> +        }
> 
> Please modify the function's documentation to account for this code change.
> 

Good point, thanks! Done in V6 attached.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/26/23 11:12 AM, vignesh C wrote:
> On Wed, 26 Apr 2023 at 13:45, Drouvot, Bertrand
> 
> There was one typo in the commit message, subscribtion should be
> subscription, the rest of the changes looks good to me:
> Subject: [PATCH v5] Add subscribtion to the standby test in
>   035_standby_logical_decoding.pl
> 
> Adding one test, to verify that subscribtion to the standby is possible.
> 

Oops, at least I repeated it twice ;-)
Fixed in V6 that I just shared up-thread.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/26/23 11:58 AM, Yu Shi (Fujitsu) wrote:
> On Mon, Apr 24, 2023 8:07 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote:

> I think that's because when replaying a checkpoint record, the startup process
> of standby only saves the information of the checkpoint, and we need to wait for
> the checkpointer to perform a restartpoint (see RecoveryRestartPoint), right? If
> so, could we force a checkpoint on standby? After this, the standby should have
> completed the restartpoint and we don't need to wait.
> 

Thanks for looking at it!

Oh right, that looks like good a good way to ensure the WAL file is removed on the standby
so that we don't need to wait.

Implemented that way in V6 attached and that works fine.

> Besides, would it be better to wait for the cascading standby? If the wal log
> file needed for cascading standby is removed on the standby, the subsequent test
> will fail. 

Good catch! I agree that we have to wait on the cascading standby before removing
the WAL files. It's done in V6 (and the test is not failing anymore if we set a
recovery_min_apply_delay to 5s on the cascading standby).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Add two missing tests in 035_standby_logical_decoding.pl

From
Amit Kapila
Date:
On Wed, Apr 26, 2023 at 4:41 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> On 4/26/23 12:27 PM, Alvaro Herrera wrote:
> >> diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
> >> index 6f7f4e5de4..819667d42a 100644
> >> --- a/src/test/perl/PostgreSQL/Test/Cluster.pm
> >> +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
> >> @@ -2644,7 +2644,16 @@ sub wait_for_catchup
> >>      }
> >>      if (!defined($target_lsn))
> >>      {
> >> -            $target_lsn = $self->lsn('write');
> >> +            my $isrecovery = $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
> >> +            chomp($isrecovery);
> >> +            if ($isrecovery eq 't')
> >> +            {
> >> +                    $target_lsn = $self->lsn('replay');
> >> +            }
> >> +            else
> >> +            {
> >> +                    $target_lsn = $self->lsn('write');
> >> +            }
> >
> > Please modify the function's documentation to account for this code change.
> >
>
> Good point, thanks! Done in V6 attached.
>

+When in recovery, the default value of target_lsn is $node->lsn('replay')
+instead. This is needed when the publisher passed to
wait_for_subscription_sync()
+is a standby node.

I think this will be useful whenever wait_for_catchup has been called
for a standby node (where self is a standby node). I have tried even
by commenting wait_for_subscription_sync in the new test then it fails
for $node_standby->wait_for_catchup('tap_sub');. So instead, how about
a comment like: "When in recovery, the default value of target_lsn is
$node->lsn('replay') instead which ensures that the cascaded standby
has caught up to what has been replayed on the standby."?

--
With Regards,
Amit Kapila.



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/27/23 5:37 AM, Amit Kapila wrote:
> On Wed, Apr 26, 2023 at 4:41 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
> 
> +When in recovery, the default value of target_lsn is $node->lsn('replay')
> +instead. This is needed when the publisher passed to
> wait_for_subscription_sync()
> +is a standby node.
> 
> I think this will be useful whenever wait_for_catchup has been called
> for a standby node (where self is a standby node). I have tried even
> by commenting wait_for_subscription_sync in the new test then it fails
> for $node_standby->wait_for_catchup('tap_sub');. So instead, how about
> a comment like: "When in recovery, the default value of target_lsn is
> $node->lsn('replay') instead which ensures that the cascaded standby
> has caught up to what has been replayed on the standby."?
> 

I did it that way because wait_for_subscription_sync() was the first case I had
to work on but I do agree that your wording better describe the intend of the new
code.

Changed in V7 attached.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Add two missing tests in 035_standby_logical_decoding.pl

From
Amit Kapila
Date:
On Thu, Apr 27, 2023 at 1:05 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> On 4/27/23 5:37 AM, Amit Kapila wrote:
> > On Wed, Apr 26, 2023 at 4:41 PM Drouvot, Bertrand
> > <bertranddrouvot.pg@gmail.com> wrote:
> >
> > +When in recovery, the default value of target_lsn is $node->lsn('replay')
> > +instead. This is needed when the publisher passed to
> > wait_for_subscription_sync()
> > +is a standby node.
> >
> > I think this will be useful whenever wait_for_catchup has been called
> > for a standby node (where self is a standby node). I have tried even
> > by commenting wait_for_subscription_sync in the new test then it fails
> > for $node_standby->wait_for_catchup('tap_sub');. So instead, how about
> > a comment like: "When in recovery, the default value of target_lsn is
> > $node->lsn('replay') instead which ensures that the cascaded standby
> > has caught up to what has been replayed on the standby."?
> >
>
> I did it that way because wait_for_subscription_sync() was the first case I had
> to work on but I do agree that your wording better describe the intend of the new
> code.
>
> Changed in V7 attached.
>

Pushed.

--
With Regards,
Amit Kapila.



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:

On 4/27/23 11:54 AM, Amit Kapila wrote:
> On Thu, Apr 27, 2023 at 1:05 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>> On 4/27/23 5:37 AM, Amit Kapila wrote:
>>> On Wed, Apr 26, 2023 at 4:41 PM Drouvot, Bertrand
>>> <bertranddrouvot.pg@gmail.com> wrote:
>>>
>>> +When in recovery, the default value of target_lsn is $node->lsn('replay')
>>> +instead. This is needed when the publisher passed to
>>> wait_for_subscription_sync()
>>> +is a standby node.
>>>
>>> I think this will be useful whenever wait_for_catchup has been called
>>> for a standby node (where self is a standby node). I have tried even
>>> by commenting wait_for_subscription_sync in the new test then it fails
>>> for $node_standby->wait_for_catchup('tap_sub');. So instead, how about
>>> a comment like: "When in recovery, the default value of target_lsn is
>>> $node->lsn('replay') instead which ensures that the cascaded standby
>>> has caught up to what has been replayed on the standby."?
>>>
>>
>> I did it that way because wait_for_subscription_sync() was the first case I had
>> to work on but I do agree that your wording better describe the intend of the new
>> code.
>>
>> Changed in V7 attached.
>>
> 
> Pushed.
> 

Thanks!

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
Amit Kapila
Date:
On Wed, Apr 26, 2023 at 7:53 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> > Besides, would it be better to wait for the cascading standby? If the wal log
> > file needed for cascading standby is removed on the standby, the subsequent test
> > will fail.
>
> Good catch! I agree that we have to wait on the cascading standby before removing
> the WAL files. It's done in V6 (and the test is not failing anymore if we set a
> recovery_min_apply_delay to 5s on the cascading standby).
>

+# Get the restart_lsn from an invalidated slot
+my $restart_lsn = $node_standby->safe_psql('postgres',
+ "SELECT restart_lsn from pg_replication_slots WHERE slot_name =
'vacuum_full_activeslot' and conflicting is true;"
+);
+
+chomp($restart_lsn);
+
+# Get the WAL file name associated to this lsn on the primary
+my $walfile_name = $node_primary->safe_psql('postgres',
+ "SELECT pg_walfile_name('$restart_lsn')");
+
+chomp($walfile_name);
+
+# Check the WAL file is still on the primary
+ok(-f $node_primary->data_dir . '/pg_wal/' . $walfile_name,
+ "WAL file still on the primary");

How is it guaranteed that the WAL file corresponding to the
invalidated slot on standby will still be present on primary? Can you
please explain the logic behind this test a bit more like how the WAL
file switch helps you to achieve the purpose?

--
With Regards,
Amit Kapila.



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/28/23 5:55 AM, Amit Kapila wrote:
> On Wed, Apr 26, 2023 at 7:53 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
> 
> +# Get the restart_lsn from an invalidated slot
> +my $restart_lsn = $node_standby->safe_psql('postgres',
> + "SELECT restart_lsn from pg_replication_slots WHERE slot_name =
> 'vacuum_full_activeslot' and conflicting is true;"
> +);
> +
> +chomp($restart_lsn);
> +
> +# Get the WAL file name associated to this lsn on the primary
> +my $walfile_name = $node_primary->safe_psql('postgres',
> + "SELECT pg_walfile_name('$restart_lsn')");
> +
> +chomp($walfile_name);
> +
> +# Check the WAL file is still on the primary
> +ok(-f $node_primary->data_dir . '/pg_wal/' . $walfile_name,
> + "WAL file still on the primary");
> 
> How is it guaranteed that the WAL file corresponding to the
> invalidated slot on standby will still be present on primary?

The slot(s) have been invalidated by the "vacuum full" test just above
this one. So I think the WAL we are looking for is the last one being used
by the primary. As no activity happened on it since the vacuum full it looks to
me that it should still be present.

But I may have missed something and maybe that's not guarantee that this WAL is still there in all the cases.
In that case I think it's better to remove this test (it does not provide added value here).

Test removed in V7 attached.

> Can you
> please explain the logic behind this test a bit more like how the WAL
> file switch helps you to achieve the purpose?
> 

The idea was to generate enough "wal switch" on the primary to ensure
the WAL file has been removed.

I gave another thought on it and I think we can skip the test that the WAL is
not on the primary any more. That way, one "wal switch" seems to be enough
to see it removed on the standby.

It's done in V7.

V7 is not doing "extra tests" than necessary and I think it's probably better like this.

I can see V7 failing on "Cirrus CI / macOS - Ventura - Meson" only (other machines are not complaining).

It does fail on "invalidated logical slots do not lead to retaining WAL", see
https://cirrus-ci.com/task/4518083541336064

I'm not sure why it is failing, any idea?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Add two missing tests in 035_standby_logical_decoding.pl

From
Amit Kapila
Date:
On Fri, Apr 28, 2023 at 2:24 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> > Can you
> > please explain the logic behind this test a bit more like how the WAL
> > file switch helps you to achieve the purpose?
> >
>
> The idea was to generate enough "wal switch" on the primary to ensure
> the WAL file has been removed.
>
> I gave another thought on it and I think we can skip the test that the WAL is
> not on the primary any more. That way, one "wal switch" seems to be enough
> to see it removed on the standby.
>
> It's done in V7.
>
> V7 is not doing "extra tests" than necessary and I think it's probably better like this.
>
> I can see V7 failing on "Cirrus CI / macOS - Ventura - Meson" only (other machines are not complaining).
>
> It does fail on "invalidated logical slots do not lead to retaining WAL", see
https://cirrus-ci.com/task/4518083541336064
>
> I'm not sure why it is failing, any idea?
>

I think the reason for the failure is that on standby, the test is not
able to remove the file corresponding to the invalid slot. You are
using pg_switch_wal() to generate a switch record and I think you need
one more WAL-generating statement after that to achieve your purpose
which is that during checkpoint, the tes removes the WAL file
corresponding to an invalid slot. Just doing checkpoint on primary may
not serve the need as that doesn't lead to any new insertion of WAL on
standby. Is your v6 failing in the same environment? If not, then it
is probably due to the reason that the test is doing insert after
pg_switch_wal() in that version. Why did you change the order of
insert in v7?

BTW, you can confirm the failure by changing the DEBUG2 message in
RemoveOldXlogFiles() to LOG. In the case, where the test fails, it may
not remove the WAL file corresponding to an invalid slot whereas it
will remove the WAL file when the test succeeds.

--
With Regards,
Amit Kapila.



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:
Hi,

On 5/2/23 8:28 AM, Amit Kapila wrote:
> On Fri, Apr 28, 2023 at 2:24 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>> I can see V7 failing on "Cirrus CI / macOS - Ventura - Meson" only (other machines are not complaining).
>>
>> It does fail on "invalidated logical slots do not lead to retaining WAL", see
https://cirrus-ci.com/task/4518083541336064
>>
>> I'm not sure why it is failing, any idea?
>>
> 
> I think the reason for the failure is that on standby, the test is not
> able to remove the file corresponding to the invalid slot. You are
> using pg_switch_wal() to generate a switch record and I think you need
> one more WAL-generating statement after that to achieve your purpose
> which is that during checkpoint, the tes removes the WAL file
> corresponding to an invalid slot. Just doing checkpoint on primary may
> not serve the need as that doesn't lead to any new insertion of WAL on
> standby. Is your v6 failing in the same environment?

Thanks for the feedback!

No V6 was working fine.

> If not, then it
> is probably due to the reason that the test is doing insert after
> pg_switch_wal() in that version. Why did you change the order of
> insert in v7?
> 

I thought doing the insert before the switch was ok and as my local test
was running fine I did not re-consider the ordering.

> BTW, you can confirm the failure by changing the DEBUG2 message in
> RemoveOldXlogFiles() to LOG. In the case, where the test fails, it may
> not remove the WAL file corresponding to an invalid slot whereas it
> will remove the WAL file when the test succeeds.

Yeah, I added more debug information and what I can see is that the WAL file
we want to see removed is "000000010000000000000003" while the standby emits:

"
2023-05-02 10:03:28.351 UTC [16971][checkpointer] LOG:  attempting to remove WAL segments older than log file
000000000000000000000002
2023-05-02 10:03:28.351 UTC [16971][checkpointer] LOG:  recycled write-ahead log file "000000010000000000000002"
"

As per your suggestion, changing the insert ordering (like in V8 attached) makes it now work on the failing environment
too.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Add two missing tests in 035_standby_logical_decoding.pl

From
Amit Kapila
Date:
On Tue, May 2, 2023 at 4:52 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
>
> As per your suggestion, changing the insert ordering (like in V8 attached) makes it now work on the failing
environmenttoo. 
>

I think it is better to use wait_for_replay_catchup() to wait for
standby to catch up. I have changed that and a comment in the
attached. I'll push this tomorrow unless there are further comments.

--
With Regards,
Amit Kapila.

Attachment

Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:
Hi,

On 5/3/23 12:29 PM, Amit Kapila wrote:
> On Tue, May 2, 2023 at 4:52 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>>
>> As per your suggestion, changing the insert ordering (like in V8 attached) makes it now work on the failing
environmenttoo.
 
>>
> 
> I think it is better to use wait_for_replay_catchup() to wait for
> standby to catch up.

Oh right, that's a discussion we already had in [1], I should have thought about it.

>  I have changed that and a comment in the
> attached. I'll push this tomorrow unless there are further comments.
> 

LGTM, thanks!

[1]: https://www.postgresql.org/message-id/acbac69e-9ae8-c546-3216-8ecb38e7a93d%40gmail.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
vignesh C
Date:
On Wed, 3 May 2023 at 15:59, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, May 2, 2023 at 4:52 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
> >
> >
> > As per your suggestion, changing the insert ordering (like in V8 attached) makes it now work on the failing
environmenttoo. 
> >
>
> I think it is better to use wait_for_replay_catchup() to wait for
> standby to catch up. I have changed that and a comment in the
> attached. I'll push this tomorrow unless there are further comments.

Thanks for posting the updated patch, I had run this test in a loop of
100 times to verify that there was no failure because of race
conditions. The 100 times execution passed successfully.

One suggestion:
"wal file" should be changed to "WAL file":
+# Request a checkpoint on the standby to trigger the WAL file(s) removal
+$node_standby->safe_psql('postgres', 'checkpoint;');
+
+# Verify that the wal file has not been retained on the standby
+my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;

Regards,
Vignesh



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
Amit Kapila
Date:
On Thu, May 4, 2023 at 8:37 AM vignesh C <vignesh21@gmail.com> wrote:
>
> Thanks for posting the updated patch, I had run this test in a loop of
> 100 times to verify that there was no failure because of race
> conditions. The 100 times execution passed successfully.
>
> One suggestion:
> "wal file" should be changed to "WAL file":
> +# Request a checkpoint on the standby to trigger the WAL file(s) removal
> +$node_standby->safe_psql('postgres', 'checkpoint;');
> +
> +# Verify that the wal file has not been retained on the standby
> +my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
>

Thanks for the verification. I have pushed the patch.

--
With Regards,
Amit Kapila.



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:
Hi,

On 5/4/23 6:43 AM, Amit Kapila wrote:
> On Thu, May 4, 2023 at 8:37 AM vignesh C <vignesh21@gmail.com> wrote:
>>
>> Thanks for posting the updated patch, I had run this test in a loop of
>> 100 times to verify that there was no failure because of race
>> conditions. The 100 times execution passed successfully.
>>
>> One suggestion:
>> "wal file" should be changed to "WAL file":
>> +# Request a checkpoint on the standby to trigger the WAL file(s) removal
>> +$node_standby->safe_psql('postgres', 'checkpoint;');
>> +
>> +# Verify that the wal file has not been retained on the standby
>> +my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
>>
> 
> Thanks for the verification. I have pushed the patch.
> 

Thanks!

I've marked the CF entry as Committed and moved the associated PostgreSQL 16 Open Item
to "resolved before 16beta1".

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:
Hi,

On 5/4/23 6:43 AM, Amit Kapila wrote:
> On Thu, May 4, 2023 at 8:37 AM vignesh C <vignesh21@gmail.com> wrote:
>>
>> Thanks for posting the updated patch, I had run this test in a loop of
>> 100 times to verify that there was no failure because of race
>> conditions. The 100 times execution passed successfully.
>>
>> One suggestion:
>> "wal file" should be changed to "WAL file":
>> +# Request a checkpoint on the standby to trigger the WAL file(s) removal
>> +$node_standby->safe_psql('postgres', 'checkpoint;');
>> +
>> +# Verify that the wal file has not been retained on the standby
>> +my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
>>
> 
> Thanks for the verification. I have pushed the patch.
> 

It looks like there is still something wrong with this test as there
are a bunch of cfbot errors on this new test (mainly on macOS - Ventura - Meson).

I'll try to reproduce with more debug infos.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
Amit Kapila
Date:
On Fri, May 5, 2023 at 11:08 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> On 5/4/23 6:43 AM, Amit Kapila wrote:
> > On Thu, May 4, 2023 at 8:37 AM vignesh C <vignesh21@gmail.com> wrote:
> >>
> >> Thanks for posting the updated patch, I had run this test in a loop of
> >> 100 times to verify that there was no failure because of race
> >> conditions. The 100 times execution passed successfully.
> >>
> >> One suggestion:
> >> "wal file" should be changed to "WAL file":
> >> +# Request a checkpoint on the standby to trigger the WAL file(s) removal
> >> +$node_standby->safe_psql('postgres', 'checkpoint;');
> >> +
> >> +# Verify that the wal file has not been retained on the standby
> >> +my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
> >>
> >
> > Thanks for the verification. I have pushed the patch.
> >
>
> It looks like there is still something wrong with this test as there
> are a bunch of cfbot errors on this new test (mainly on macOS - Ventura - Meson).
>

Is it possible for you to point me to those failures?

> I'll try to reproduce with more debug infos.
>

Okay, thanks!

--
With Regards,
Amit Kapila.



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
vignesh C
Date:
On Fri, 5 May 2023 at 12:34, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 5, 2023 at 11:08 AM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
> >
> > On 5/4/23 6:43 AM, Amit Kapila wrote:
> > > On Thu, May 4, 2023 at 8:37 AM vignesh C <vignesh21@gmail.com> wrote:
> > >>
> > >> Thanks for posting the updated patch, I had run this test in a loop of
> > >> 100 times to verify that there was no failure because of race
> > >> conditions. The 100 times execution passed successfully.
> > >>
> > >> One suggestion:
> > >> "wal file" should be changed to "WAL file":
> > >> +# Request a checkpoint on the standby to trigger the WAL file(s) removal
> > >> +$node_standby->safe_psql('postgres', 'checkpoint;');
> > >> +
> > >> +# Verify that the wal file has not been retained on the standby
> > >> +my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
> > >>
> > >
> > > Thanks for the verification. I have pushed the patch.
> > >
> >
> > It looks like there is still something wrong with this test as there
> > are a bunch of cfbot errors on this new test (mainly on macOS - Ventura - Meson).
> >
>
> Is it possible for you to point me to those failures?

I think these failures are occuring in CFBOT, once such instance is at:
https://cirrus-ci.com/task/6642271152504832?logs=test_world#L39

https://api.cirrus-ci.com/v1/artifact/task/6642271152504832/testrun/build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding

Regards,
Vignesh



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:
Hi,

On 5/5/23 9:11 AM, vignesh C wrote:
> On Fri, 5 May 2023 at 12:34, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Fri, May 5, 2023 at 11:08 AM Drouvot, Bertrand
>> <bertranddrouvot.pg@gmail.com> wrote:
>>>
>>> On 5/4/23 6:43 AM, Amit Kapila wrote:
>>>> On Thu, May 4, 2023 at 8:37 AM vignesh C <vignesh21@gmail.com> wrote:
>>>>>
>>>>> Thanks for posting the updated patch, I had run this test in a loop of
>>>>> 100 times to verify that there was no failure because of race
>>>>> conditions. The 100 times execution passed successfully.
>>>>>
>>>>> One suggestion:
>>>>> "wal file" should be changed to "WAL file":
>>>>> +# Request a checkpoint on the standby to trigger the WAL file(s) removal
>>>>> +$node_standby->safe_psql('postgres', 'checkpoint;');
>>>>> +
>>>>> +# Verify that the wal file has not been retained on the standby
>>>>> +my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
>>>>>
>>>>
>>>> Thanks for the verification. I have pushed the patch.
>>>>
>>>
>>> It looks like there is still something wrong with this test as there
>>> are a bunch of cfbot errors on this new test (mainly on macOS - Ventura - Meson).
>>>
>>
>> Is it possible for you to point me to those failures?
> 
> I think these failures are occuring in CFBOT, once such instance is at:
> https://cirrus-ci.com/task/6642271152504832?logs=test_world#L39
>
https://api.cirrus-ci.com/v1/artifact/task/6642271152504832/testrun/build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding
> 

Yeah, thanks, that's one of them.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:

On 5/5/23 9:04 AM, Amit Kapila wrote:
> On Fri, May 5, 2023 at 11:08 AM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>> On 5/4/23 6:43 AM, Amit Kapila wrote:
>>> On Thu, May 4, 2023 at 8:37 AM vignesh C <vignesh21@gmail.com> wrote:
>>>>
>>>> Thanks for posting the updated patch, I had run this test in a loop of
>>>> 100 times to verify that there was no failure because of race
>>>> conditions. The 100 times execution passed successfully.
>>>>
>>>> One suggestion:
>>>> "wal file" should be changed to "WAL file":
>>>> +# Request a checkpoint on the standby to trigger the WAL file(s) removal
>>>> +$node_standby->safe_psql('postgres', 'checkpoint;');
>>>> +
>>>> +# Verify that the wal file has not been retained on the standby
>>>> +my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
>>>>
>>>
>>> Thanks for the verification. I have pushed the patch.
>>>
>>
>> It looks like there is still something wrong with this test as there
>> are a bunch of cfbot errors on this new test (mainly on macOS - Ventura - Meson).
>>
> 
> Is it possible for you to point me to those failures?
> 
>> I'll try to reproduce with more debug infos.
>>
> 
> Okay, thanks!
> 

After multiple attempts, I got one failing one.

Issue is that we expect this file to be removed:

[07:24:27.261](0.899s) #WAL file is
/Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003

But the standby emits:

2023-05-05 07:24:27.216 UTC [17909][client backend] [035_standby_logical_decoding.pl][3/6:0] LOG:  statement:
checkpoint;
2023-05-05 07:24:27.216 UTC [17745][checkpointer] LOG:  restartpoint starting: immediate wait
2023-05-05 07:24:27.259 UTC [17745][checkpointer] LOG:  attempting to remove WAL segments older than log file
000000000000000000000002

So it seems the test is not right (missing activity??), not sure why yet.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
Amit Kapila
Date:
On Fri, May 5, 2023 at 1:16 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
>
> After multiple attempts, I got one failing one.
>
> Issue is that we expect this file to be removed:
>
> [07:24:27.261](0.899s) #WAL file is
/Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003
>
> But the standby emits:
>
> 2023-05-05 07:24:27.216 UTC [17909][client backend] [035_standby_logical_decoding.pl][3/6:0] LOG:  statement:
checkpoint;
> 2023-05-05 07:24:27.216 UTC [17745][checkpointer] LOG:  restartpoint starting: immediate wait
> 2023-05-05 07:24:27.259 UTC [17745][checkpointer] LOG:  attempting to remove WAL segments older than log file
000000000000000000000002
>
> So it seems the test is not right (missing activity??), not sure why yet.
>

Can you try to print the value returned by
XLogGetReplicationSlotMinimumLSN() in KeepLogSeg() on standby? Also,
please try to print "attempting to remove WAL segments ..." on the
primary. We can see, if by any chance some slot is holding us to
remove the required WAL file.

--
With Regards,
Amit Kapila.



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
Amit Kapila
Date:
On Fri, May 5, 2023 at 2:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 5, 2023 at 1:16 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
> >
> >
> > After multiple attempts, I got one failing one.
> >
> > Issue is that we expect this file to be removed:
> >
> > [07:24:27.261](0.899s) #WAL file is
/Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003
> >
> > But the standby emits:
> >
> > 2023-05-05 07:24:27.216 UTC [17909][client backend] [035_standby_logical_decoding.pl][3/6:0] LOG:  statement:
checkpoint;
> > 2023-05-05 07:24:27.216 UTC [17745][checkpointer] LOG:  restartpoint starting: immediate wait
> > 2023-05-05 07:24:27.259 UTC [17745][checkpointer] LOG:  attempting to remove WAL segments older than log file
000000000000000000000002
> >
> > So it seems the test is not right (missing activity??), not sure why yet.
> >
>
> Can you try to print the value returned by
> XLogGetReplicationSlotMinimumLSN() in KeepLogSeg() on standby? Also,
> please try to print "attempting to remove WAL segments ..." on the
> primary. We can see, if by any chance some slot is holding us to
> remove the required WAL file.
>

We can also probably check the values of 'endptr',  'receivePtr', and
replayPtr on standby in the below code:

CreateRestartPoint()
{
...
/*
* Retreat _logSegNo using the current end of xlog replayed or received,
* whichever is later.
*/
receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
replayPtr = GetXLogReplayRecPtr(&replayTLI);
endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
KeepLogSeg(endptr, &_logSegNo);
...
}

--
With Regards,
Amit Kapila.



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:

On 5/5/23 11:29 AM, Amit Kapila wrote:
> On Fri, May 5, 2023 at 1:16 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>>
>> After multiple attempts, I got one failing one.
>>
>> Issue is that we expect this file to be removed:
>>
>> [07:24:27.261](0.899s) #WAL file is
/Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003
>>
>> But the standby emits:
>>
>> 2023-05-05 07:24:27.216 UTC [17909][client backend] [035_standby_logical_decoding.pl][3/6:0] LOG:  statement:
checkpoint;
>> 2023-05-05 07:24:27.216 UTC [17745][checkpointer] LOG:  restartpoint starting: immediate wait
>> 2023-05-05 07:24:27.259 UTC [17745][checkpointer] LOG:  attempting to remove WAL segments older than log file
000000000000000000000002
>>
>> So it seems the test is not right (missing activity??), not sure why yet.
>>
> 
> Can you try to print the value returned by
> XLogGetReplicationSlotMinimumLSN() in KeepLogSeg() on standby? Also,
> please try to print "attempting to remove WAL segments ..." on the
> primary. We can see, if by any chance some slot is holding us to
> remove the required WAL file.
> 

I turned DEBUG2 on. We can also see on the primary:

2023-05-05 08:23:30.843 UTC [16833][checkpointer] LOCATION:  CheckPointReplicationSlots, slot.c:1576
2023-05-05 08:23:30.844 UTC [16833][checkpointer] DEBUG:  00000: snapshot of 0+0 running transaction ids (lsn 0/40000D0
oldestxid 746 latest complete 745 next xid 746)
 
2023-05-05 08:23:30.844 UTC [16833][checkpointer] LOCATION:  LogCurrentRunningXacts, standby.c:1377
2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOG:  00000: BDT1 about to call RemoveOldXlogFiles in
CreateCheckPoint
2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION:  CreateCheckPoint, xlog.c:6835
2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOG:  00000: attempting to remove WAL segments older than log file
000000000000000000000002
2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION:  RemoveOldXlogFiles, xlog.c:3560
2023-05-05 08:23:30.845 UTC [16833][checkpointer] DEBUG:  00000: recycled write-ahead log file
"000000010000000000000001"
2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION:  RemoveXlogFile, xlog.c:3708
2023-05-05 08:23:30.845 UTC [16833][checkpointer] DEBUG:  00000: recycled write-ahead log file
"000000010000000000000002"
2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION:  RemoveXlogFile, xlog.c:3708
2023-05-05 08:23:30.845 UTC [16833][checkpointer] DEBUG:  00000: SlruScanDirectory invoking callback on
pg_subtrans/0000

So, 000000010000000000000003 is not removed on the primary.

It has been recycled on:

2023-05-05 08:23:38.605 UTC [16833][checkpointer] DEBUG:  00000: recycled write-ahead log file
"000000010000000000000003"

Which is later than the test:

[08:23:31.931](0.000s) not ok 19 - invalidated logical slots do not lead to retaining WAL

FWIW, the failing test with DEBUG2 can be found there: https://cirrus-ci.com/task/5615316688961536

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
Amit Kapila
Date:
On Fri, May 5, 2023 at 4:02 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> On 5/5/23 11:29 AM, Amit Kapila wrote:
> > On Fri, May 5, 2023 at 1:16 PM Drouvot, Bertrand
> > <bertranddrouvot.pg@gmail.com> wrote:
> >>
> >>
> >> After multiple attempts, I got one failing one.
> >>
> >> Issue is that we expect this file to be removed:
> >>
> >> [07:24:27.261](0.899s) #WAL file is
/Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003
> >>
> >> But the standby emits:
> >>
> >> 2023-05-05 07:24:27.216 UTC [17909][client backend] [035_standby_logical_decoding.pl][3/6:0] LOG:  statement:
checkpoint;
> >> 2023-05-05 07:24:27.216 UTC [17745][checkpointer] LOG:  restartpoint starting: immediate wait
> >> 2023-05-05 07:24:27.259 UTC [17745][checkpointer] LOG:  attempting to remove WAL segments older than log file
000000000000000000000002
> >>
> >> So it seems the test is not right (missing activity??), not sure why yet.
> >>
> >
> > Can you try to print the value returned by
> > XLogGetReplicationSlotMinimumLSN() in KeepLogSeg() on standby? Also,
> > please try to print "attempting to remove WAL segments ..." on the
> > primary. We can see, if by any chance some slot is holding us to
> > remove the required WAL file.
> >
>
> I turned DEBUG2 on. We can also see on the primary:
>
> 2023-05-05 08:23:30.843 UTC [16833][checkpointer] LOCATION:  CheckPointReplicationSlots, slot.c:1576
> 2023-05-05 08:23:30.844 UTC [16833][checkpointer] DEBUG:  00000: snapshot of 0+0 running transaction ids (lsn
0/40000D0oldest xid 746 latest complete 745 next xid 746) 
> 2023-05-05 08:23:30.844 UTC [16833][checkpointer] LOCATION:  LogCurrentRunningXacts, standby.c:1377
> 2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOG:  00000: BDT1 about to call RemoveOldXlogFiles in
CreateCheckPoint
> 2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION:  CreateCheckPoint, xlog.c:6835
> 2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOG:  00000: attempting to remove WAL segments older than log file
000000000000000000000002
> 2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION:  RemoveOldXlogFiles, xlog.c:3560
> 2023-05-05 08:23:30.845 UTC [16833][checkpointer] DEBUG:  00000: recycled write-ahead log file
"000000010000000000000001"
> 2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION:  RemoveXlogFile, xlog.c:3708
> 2023-05-05 08:23:30.845 UTC [16833][checkpointer] DEBUG:  00000: recycled write-ahead log file
"000000010000000000000002"
> 2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION:  RemoveXlogFile, xlog.c:3708
> 2023-05-05 08:23:30.845 UTC [16833][checkpointer] DEBUG:  00000: SlruScanDirectory invoking callback on
pg_subtrans/0000
>
> So, 000000010000000000000003 is not removed on the primary.
>

How did you concluded that 000000010000000000000003 is the file the
test is expecting to be removed?


--
With Regards,
Amit Kapila.



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:

On 5/5/23 12:58 PM, Amit Kapila wrote:
> On Fri, May 5, 2023 at 4:02 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:

> How did you concluded that 000000010000000000000003 is the file the
> test is expecting to be removed?
> 
because I added a note in the test that way:

"
@@ -535,6 +539,7 @@ $node_standby->safe_psql('postgres', 'checkpoint;');

  # Verify that the WAL file has not been retained on the standby
  my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
+note "BDT WAL file is $standby_walfile";
  ok(!-f "$standby_walfile",
         "invalidated logical slots do not lead to retaining WAL");
"

so that I can check in the test log file:

grep "WAL file is" ./build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding
[08:23:31.931](2.217s) # BDT WAL file is
/Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
Amit Kapila
Date:
On Fri, May 5, 2023 at 5:36 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> On 5/5/23 12:58 PM, Amit Kapila wrote:
> > On Fri, May 5, 2023 at 4:02 PM Drouvot, Bertrand
> > <bertranddrouvot.pg@gmail.com> wrote:
>
> > How did you concluded that 000000010000000000000003 is the file the
> > test is expecting to be removed?
> >
> because I added a note in the test that way:
>
> "
> @@ -535,6 +539,7 @@ $node_standby->safe_psql('postgres', 'checkpoint;');
>
>   # Verify that the WAL file has not been retained on the standby
>   my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
> +note "BDT WAL file is $standby_walfile";
>   ok(!-f "$standby_walfile",
>          "invalidated logical slots do not lead to retaining WAL");
> "
>
> so that I can check in the test log file:
>
> grep "WAL file is" ./build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding
> [08:23:31.931](2.217s) # BDT WAL file is
/Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003
>

It seems due to some reason the current wal file is not switched due
to some reason. I think we need to add more DEBUG info to find that
out. Can you please try to print 'RedoRecPtr', '_logSegNo', and
recptr?

/*
* Delete old log files, those no longer needed for last checkpoint to
* prevent the disk holding the xlog from growing full.
*/
XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
KeepLogSeg(recptr, &_logSegNo);


--
With Regards,
Amit Kapila.



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:

On 5/5/23 2:28 PM, Amit Kapila wrote:
> On Fri, May 5, 2023 at 5:36 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
> 
> It seems due to some reason the current wal file is not switched due
> to some reason.

Oh wait, here is a NON failing one: https://cirrus-ci.com/task/5086849685782528 (I modified the
.cirrus.yml so that we can download the "testrun.zip" file even if the test is not failing).

So, in this testrun.zip we can see, that the test is ok:

$ grep -i retain ./build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding
[10:06:08.789](0.000s) ok 19 - invalidated logical slots do not lead to retaining WAL

and that the WAL file we expect to be removed is:

$ grep "WAL file is"
./build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding
[10:06:08.789](0.925s) # BDT WAL file is
/Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003

This WAL file has been removed by the standby:

$ grep -i 000000010000000000000003
./build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_standby.log| grep -i recy
 
2023-05-05 10:06:08.787 UTC [17521][checkpointer] DEBUG:  00000: recycled write-ahead log file
"000000010000000000000003"

But on the primary, it has been recycled way after that time:

$ grep -i 000000010000000000000003
./build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_primary.log| grep -i recy
 
2023-05-05 10:06:13.370 UTC [16785][checkpointer] DEBUG:  00000: recycled write-ahead log file
"000000010000000000000003"

As, the checkpoint on the primary after the WAL file switch only recycled (001 and 002):

$ grep -i recycled ./build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_primary.log
2023-05-05 10:05:57.196 UTC [16785][checkpointer] LOG:  00000: checkpoint complete: wrote 4 buffers (3.1%); 0 WAL
file(s)added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.027 s; sync files=0, longest=0.000 s,
average=0.000s; distance=11219 kB, estimate=11219 kB; lsn=0/2000060, redo lsn=0/2000028
 
2023-05-05 10:06:08.138 UTC [16785][checkpointer] DEBUG:  00000: recycled write-ahead log file
"000000010000000000000001"
2023-05-05 10:06:08.138 UTC [16785][checkpointer] DEBUG:  00000: recycled write-ahead log file
"000000010000000000000002"
2023-05-05 10:06:08.138 UTC [16785][checkpointer] LOG:  00000: checkpoint complete: wrote 20 buffers (15.6%); 0 WAL
file(s)added, 0 removed, 2 recycled; write=0.001 s, sync=0.001 s, total=0.003 s; sync files=0, longest=0.000 s,
average=0.000s; distance=32768 kB, estimate=32768 kB; lsn=0/40000D0, redo lsn=0/4000098
 


So, even on a successful test, we can see that the WAL file we expect to be removed on the standby has not been
recycledon the primary before the test.
 

> I think we need to add more DEBUG info to find that
> out. Can you please try to print 'RedoRecPtr', '_logSegNo', and
> recptr?
>

Yes, will do.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
Amit Kapila
Date:
On Fri, May 5, 2023 at 7:53 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
>
> On 5/5/23 2:28 PM, Amit Kapila wrote:
> > On Fri, May 5, 2023 at 5:36 PM Drouvot, Bertrand
> > <bertranddrouvot.pg@gmail.com> wrote:
> >
> > It seems due to some reason the current wal file is not switched due
> > to some reason.
>
> Oh wait, here is a NON failing one: https://cirrus-ci.com/task/5086849685782528 (I modified the
> .cirrus.yml so that we can download the "testrun.zip" file even if the test is not failing).
>
> So, in this testrun.zip we can see, that the test is ok:
>
> $ grep -i retain ./build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding
> [10:06:08.789](0.000s) ok 19 - invalidated logical slots do not lead to retaining WAL
>
> and that the WAL file we expect to be removed is:
>
> $ grep "WAL file is"
./build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding
> [10:06:08.789](0.925s) # BDT WAL file is
/Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003
>
> This WAL file has been removed by the standby:
>
> $ grep -i 000000010000000000000003
./build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_standby.log| grep -i recy 
> 2023-05-05 10:06:08.787 UTC [17521][checkpointer] DEBUG:  00000: recycled write-ahead log file
"000000010000000000000003"
>
> But on the primary, it has been recycled way after that time:
>
> $ grep -i 000000010000000000000003
./build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_primary.log| grep -i recy 
> 2023-05-05 10:06:13.370 UTC [16785][checkpointer] DEBUG:  00000: recycled write-ahead log file
"000000010000000000000003"
>
> As, the checkpoint on the primary after the WAL file switch only recycled (001 and 002):
>
> $ grep -i recycled ./build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_primary.log
> 2023-05-05 10:05:57.196 UTC [16785][checkpointer] LOG:  00000: checkpoint complete: wrote 4 buffers (3.1%); 0 WAL
file(s)added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.027 s; sync files=0, longest=0.000 s,
average=0.000s; distance=11219 kB, estimate=11219 kB; lsn=0/2000060, redo lsn=0/2000028 
> 2023-05-05 10:06:08.138 UTC [16785][checkpointer] DEBUG:  00000: recycled write-ahead log file
"000000010000000000000001"
> 2023-05-05 10:06:08.138 UTC [16785][checkpointer] DEBUG:  00000: recycled write-ahead log file
"000000010000000000000002"
> 2023-05-05 10:06:08.138 UTC [16785][checkpointer] LOG:  00000: checkpoint complete: wrote 20 buffers (15.6%); 0 WAL
file(s)added, 0 removed, 2 recycled; write=0.001 s, sync=0.001 s, total=0.003 s; sync files=0, longest=0.000 s,
average=0.000s; distance=32768 kB, estimate=32768 kB; lsn=0/40000D0, redo lsn=0/4000098 
>
>
> So, even on a successful test, we can see that the WAL file we expect to be removed on the standby has not been
recycledon the primary before the test. 
>

Okay, one possibility of not removing on primary is that at the time
of checkpoint (when we compute RedoRecPtr), the wal_swtich and insert
is not yet performed because in that case it will compute the
RedoRecPtr as a location before those operations which would be 0000*3
file. However, it is not clear how is that possible except from a
background checkpoint happening at that point but from LOGs, it
appears that the checkpoint triggered by test has recycled the wal
files.

> > I think we need to add more DEBUG info to find that
> > out. Can you please try to print 'RedoRecPtr', '_logSegNo', and
> > recptr?
> >
>
> Yes, will do.
>

Okay, thanks, please try to print similar locations on standby in
CreateRestartPoint().

--
With Regards,
Amit Kapila.



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:
Hi,

On 5/6/23 4:10 AM, Amit Kapila wrote:
> On Fri, May 5, 2023 at 7:53 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>> On 5/5/23 2:28 PM, Amit Kapila wrote:
>>> On Fri, May 5, 2023 at 5:36 PM Drouvot, Bertrand
>>
>> So, even on a successful test, we can see that the WAL file we expect to be removed on the standby has not been
recycledon the primary before the test.
 
>>
> 
> Okay, one possibility of not removing on primary is that at the time
> of checkpoint (when we compute RedoRecPtr), the wal_swtich and insert
> is not yet performed because in that case it will compute the
> RedoRecPtr as a location before those operations which would be 0000*3
> file. However, it is not clear how is that possible except from a
> background checkpoint happening at that point but from LOGs, it
> appears that the checkpoint triggered by test has recycled the wal
> files.
> 
>>> I think we need to add more DEBUG info to find that
>>> out. Can you please try to print 'RedoRecPtr', '_logSegNo', and
>>> recptr?
>>>
>>
>> Yes, will do.
>>
> 
> Okay, thanks, please try to print similar locations on standby in
> CreateRestartPoint().
> 

The extra information is displayed that way:


https://github.com/bdrouvot/postgres/commit/a3d6d58d105b379c04a17a1129bfb709302588ca#diff-c1cb3ab2a19606390c1a7ed00ffe5a45531702ca5faf999d401c548f8951c65bR6822-R6830

https://github.com/bdrouvot/postgres/commit/a3d6d58d105b379c04a17a1129bfb709302588ca#diff-c1cb3ab2a19606390c1a7ed00ffe5a45531702ca5faf999d401c548f8951c65bR7269-R7271

https://github.com/bdrouvot/postgres/commit/a3d6d58d105b379c04a17a1129bfb709302588ca#diff-c1cb3ab2a19606390c1a7ed00ffe5a45531702ca5faf999d401c548f8951c65bR7281-R7284

There is 2 runs with this extra info in place:

A successful one: https://cirrus-ci.com/task/6528745436086272
A failed one: https://cirrus-ci.com/task/4558139312308224

For both the testrun.zip is available in the Artifacts section.

Sharing this now in case you want to have a look (I'll have a look at them early next week on my side).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
Amit Kapila
Date:
On Sat, May 6, 2023 at 1:52 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> There is 2 runs with this extra info in place:
>
> A successful one: https://cirrus-ci.com/task/6528745436086272
> A failed one: https://cirrus-ci.com/task/4558139312308224
>

Thanks, I think I got some clue as to why this test is failing
randomly. Following is the comparison of successful and failed run
logs for standby:

Success case
============
2023-05-06 07:23:05.496 UTC [17617][walsender]
[cascading_standby][3/0:0] DEBUG:  00000: write 0/4000148 flush
0/4000000 apply 0/4000000 reply_time 2023-05-06 07:23:05.496365+00
2023-05-06 07:23:05.496 UTC [17617][walsender]
[cascading_standby][3/0:0] LOCATION:  ProcessStandbyReplyMessage,
walsender.c:2101
2023-05-06 07:23:05.496 UTC [17617][walsender]
[cascading_standby][3/0:0] DEBUG:  00000: write 0/4000148 flush
0/4000148 apply 0/4000000 reply_time 2023-05-06 07:23:05.4964+00
2023-05-06 07:23:05.496 UTC [17617][walsender]
[cascading_standby][3/0:0] LOCATION:  ProcessStandbyReplyMessage,
walsender.c:2101
2023-05-06 07:23:05.496 UTC [17617][walsender]
[cascading_standby][3/0:0] DEBUG:  00000: write 0/4000148 flush
0/4000148 apply 0/4000148 reply_time 2023-05-06 07:23:05.496531+00
2023-05-06 07:23:05.496 UTC [17617][walsender]
[cascading_standby][3/0:0] LOCATION:  ProcessStandbyReplyMessage,
walsender.c:2101
2023-05-06 07:23:05.500 UTC [17706][client backend]
[035_standby_logical_decoding.pl][2/12:0] LOG:  00000: statement:
checkpoint;
2023-05-06 07:23:05.500 UTC [17706][client backend]
[035_standby_logical_decoding.pl][2/12:0] LOCATION:
exec_simple_query, postgres.c:1074
2023-05-06 07:23:05.500 UTC [17550][checkpointer] LOG:  00000:
restartpoint starting: immediate wait
...
...
2023-05-06 07:23:05.500 UTC [17550][checkpointer] LOCATION:
CheckPointReplicationSlots, slot.c:1576
2023-05-06 07:23:05.501 UTC [17550][checkpointer] DEBUG:  00000:
updated min recovery point to 0/4000148 on timeline 1
2023-05-06 07:23:05.501 UTC [17550][checkpointer] LOCATION:
UpdateMinRecoveryPoint, xlog.c:2500
2023-05-06 07:23:05.515 UTC [17550][checkpointer] LOG:  00000:
CreateRestartPoint: After XLByteToSeg RedoRecPtr is 0/4000098,
_logSegNo is 4
2023-05-06 07:23:05.515 UTC [17550][checkpointer] LOCATION:
CreateRestartPoint, xlog.c:7271
2023-05-06 07:23:05.515 UTC [17550][checkpointer] LOG:  00000:
CreateRestartPoint: After KeepLogSeg RedoRecPtr is 0/4000098, endptr
is 0/4000148, _logSegNo is 4

Failed case:
==========
2023-05-06 07:53:19.657 UTC [17914][walsender]
[cascading_standby][3/0:0] DEBUG:  00000: write 0/3D1A000 flush
0/3CFA000 apply 0/4000000 reply_time 2023-05-06 07:53:19.65207+00
2023-05-06 07:53:19.657 UTC [17914][walsender]
[cascading_standby][3/0:0] LOCATION:  ProcessStandbyReplyMessage,
walsender.c:2101
2023-05-06 07:53:19.657 UTC [17914][walsender]
[cascading_standby][3/0:0] DEBUG:  00000: write 0/3D1A000 flush
0/3D1A000 apply 0/4000000 reply_time 2023-05-06 07:53:19.656471+00
2023-05-06 07:53:19.657 UTC [17914][walsender]
[cascading_standby][3/0:0] LOCATION:  ProcessStandbyReplyMessage,
walsender.c:2101
...
...
2023-05-06 07:53:19.686 UTC [17881][checkpointer] DEBUG:  00000:
updated min recovery point to 0/4000148 on timeline 1
2023-05-06 07:53:19.686 UTC [17881][checkpointer] LOCATION:
UpdateMinRecoveryPoint, xlog.c:2500
2023-05-06 07:53:19.707 UTC [17881][checkpointer] LOG:  00000:
CreateRestartPoint: After XLByteToSeg RedoRecPtr is 0/4000098,
_logSegNo is 4
2023-05-06 07:53:19.707 UTC [17881][checkpointer] LOCATION:
CreateRestartPoint, xlog.c:7271
2023-05-06 07:53:19.707 UTC [17881][checkpointer] LOG:  00000:
CreateRestartPoint: After KeepLogSeg RedoRecPtr is 0/4000098, endptr
is 0/4000148, _logSegNo is 3

Observations:
============
1. In the failed run, the KeepLogSeg(), reduced the _logSegNo to 3
which is the reason for the failure because now the standby won't be
able to remove/recycle the WAL file corresponding to segment number 3
which the test was expecting.
2.  We didn't expect the KeepLogSeg() to reduce the _logSegNo because
all logical slots were invalidated. However, I think we forgot that
both standby and primary have physical slots which might also
influence the XLogGetReplicationSlotMinimumLSN() calculation in
KeepLogSeg().
3. Now, the reason for its success in some of the runs is that
restart_lsn of physical slots also moved ahead by the time checkpoint
happens. You can see the difference of LSNs for
ProcessStandbyReplyMessage in failed and successful cases.

Next steps:
=========
1. The first thing is we should verify this theory by adding some LOG
in KeepLogSeg() to see if the _logSegNo is reduced due to the value
returned by XLogGetReplicationSlotMinimumLSN().
2. The reason for the required file not being removed in the primary
is also that it has a physical slot which prevents the file removal.
3. If the above theory is correct then I see a few possibilities to
fix this test (a) somehow ensure that restart_lsn of the physical slot
on standby is advanced up to the point where we can safely remove the
required files; (b) just create a separate test case by initializing a
fresh node for primary and standby where we only have logical slots on
standby. This will be a bit costly but probably less risky. (c) any
better ideas?

--
With Regards,
Amit Kapila.



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:
Hi,

On 5/6/23 3:28 PM, Amit Kapila wrote:
> On Sat, May 6, 2023 at 1:52 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>> There is 2 runs with this extra info in place:
>>
>> A successful one: https://cirrus-ci.com/task/6528745436086272
>> A failed one: https://cirrus-ci.com/task/4558139312308224
>>
> 
> Thanks, I think I got some clue as to why this test is failing
> randomly. 

Great, thanks!

> Observations:
> ============
> 1. In the failed run, the KeepLogSeg(), reduced the _logSegNo to 3
> which is the reason for the failure because now the standby won't be
> able to remove/recycle the WAL file corresponding to segment number 3
> which the test was expecting.

Agree.

> 2.  We didn't expect the KeepLogSeg() to reduce the _logSegNo because
> all logical slots were invalidated. However, I think we forgot that
> both standby and primary have physical slots which might also
> influence the XLogGetReplicationSlotMinimumLSN() calculation in
> KeepLogSeg().

Oh right...

> Next steps:
> =========
> 1. The first thing is we should verify this theory by adding some LOG
> in KeepLogSeg() to see if the _logSegNo is reduced due to the value
> returned by XLogGetReplicationSlotMinimumLSN().

Yeah, will do that early next week.

> 2. The reason for the required file not being removed in the primary
> is also that it has a physical slot which prevents the file removal.

Yeah, agree. But this one is not an issue as we are not
checking for the WAL file removal on the primary, do you agree?

> 3. If the above theory is correct then I see a few possibilities to
> fix this test (a) somehow ensure that restart_lsn of the physical slot
> on standby is advanced up to the point where we can safely remove the
> required files; (b) just create a separate test case by initializing a
> fresh node for primary and standby where we only have logical slots on
> standby. This will be a bit costly but probably less risky. (c) any
> better ideas?
> 

(c): Since, I think, the physical slot on the primary is not a concern for
the reason mentioned above, then instead of (b):

What about postponing the physical slot creation on the standby and the
cascading standby node initialization after this test?

That way, this test would be done without a physical slot on the standby and
we could also get rid of the "Wait for the cascading standby to catchup before
removing the WAL file(s)" part.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
Amit Kapila
Date:
On Sat, May 6, 2023 at 9:33 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> On 5/6/23 3:28 PM, Amit Kapila wrote:
> > On Sat, May 6, 2023 at 1:52 PM Drouvot, Bertrand
> > <bertranddrouvot.pg@gmail.com> wrote:
>
> > Next steps:
> > =========
> > 1. The first thing is we should verify this theory by adding some LOG
> > in KeepLogSeg() to see if the _logSegNo is reduced due to the value
> > returned by XLogGetReplicationSlotMinimumLSN().
>
> Yeah, will do that early next week.
>
> > 2. The reason for the required file not being removed in the primary
> > is also that it has a physical slot which prevents the file removal.
>
> Yeah, agree. But this one is not an issue as we are not
> checking for the WAL file removal on the primary, do you agree?
>

Agreed.

> > 3. If the above theory is correct then I see a few possibilities to
> > fix this test (a) somehow ensure that restart_lsn of the physical slot
> > on standby is advanced up to the point where we can safely remove the
> > required files; (b) just create a separate test case by initializing a
> > fresh node for primary and standby where we only have logical slots on
> > standby. This will be a bit costly but probably less risky. (c) any
> > better ideas?
> >
>
> (c): Since, I think, the physical slot on the primary is not a concern for
> the reason mentioned above, then instead of (b):
>
> What about postponing the physical slot creation on the standby and the
> cascading standby node initialization after this test?
>

Yeah, that is also possible. But, I have a few questions regarding
that: (a) There doesn't seem to be a physical slot on cascading
standby, if I am missing something, can you please point me to the
relevant part of the test? (b) Which test is currently dependent on
the physical slot on standby?

--
With Regards,
Amit Kapila.



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:
Hi,

On 5/8/23 4:42 AM, Amit Kapila wrote:
> On Sat, May 6, 2023 at 9:33 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>> On 5/6/23 3:28 PM, Amit Kapila wrote:
>>> On Sat, May 6, 2023 at 1:52 PM Drouvot, Bertrand
>>> <bertranddrouvot.pg@gmail.com> wrote:
>>
>>> Next steps:
>>> =========
>>> 1. The first thing is we should verify this theory by adding some LOG
>>> in KeepLogSeg() to see if the _logSegNo is reduced due to the value
>>> returned by XLogGetReplicationSlotMinimumLSN().
>>
>> Yeah, will do that early next week.

It's done with the following changes:


https://github.com/bdrouvot/postgres/commit/79e1bd9ab429a22f876b9364eb8a0da2dacaaef7#diff-c1cb3ab2a19606390c1a7ed00ffe5a45531702ca5faf999d401c548f8951c65bL7454-R7514

With that in place, there is one failing test here: https://cirrus-ci.com/task/5173216310722560

Where we can see:

2023-05-08 07:42:56.301 UTC [18038][checkpointer] LOCATION:  UpdateMinRecoveryPoint, xlog.c:2500
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG:  00000: CreateRestartPoint: After XLByteToSeg RedoRecPtr is
0/4000098,_logSegNo is 4
 
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION:  CreateRestartPoint, xlog.c:7271
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG:  00000: KeepLogSeg: segno changed to 4 due to XLByteToSeg
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION:  KeepLogSeg, xlog.c:7473
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG:  00000: KeepLogSeg: segno changed to 3 due to
XLogGetReplicationSlotMinimumLSN()
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION:  KeepLogSeg, xlog.c:7483
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG:  00000: CreateRestartPoint: After KeepLogSeg RedoRecPtr is
0/4000098,endptr is 0/4000148, _logSegNo is 3
 
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION:  CreateRestartPoint, xlog.c:7284
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG:  00000: BDT1 about to call RemoveOldXlogFiles in
CreateRestartPoint
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION:  CreateRestartPoint, xlog.c:7313
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG:  00000: attempting to remove WAL segments older than log file
000000000000000000000002

So the suspicion about XLogGetReplicationSlotMinimumLSN() was correct (_logSegNo moved from
4 to 3 due to XLogGetReplicationSlotMinimumLSN()).

>> What about postponing the physical slot creation on the standby and the
>> cascading standby node initialization after this test?
>>
> 
> Yeah, that is also possible. But, I have a few questions regarding
> that: (a) There doesn't seem to be a physical slot on cascading
> standby, if I am missing something, can you please point me to the
> relevant part of the test?

That's right. There is a physical slot only on the primary and on the standby.

What I meant up-thread is to also postpone the cascading standby node initialization
after this test (once the physical slot on the standby is created).

Please find attached a proposal doing so.

> (b) Which test is currently dependent on
> the physical slot on standby?

Not a test but the cascading standby initialization with the "primary_slot_name" parameter.

Also, now I think that's better to have the physical slot on the standby + hsf set to on on the
cascading standby (coming from the standby backup).

Idea is to avoid any risk of logical slot invalidation on the cascading standby in the
standby promotion test.

That was not the case before the attached proposal though (hsf was off on the cascading standby).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Add two missing tests in 035_standby_logical_decoding.pl

From
Amit Kapila
Date:
On Mon, May 8, 2023 at 1:45 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> On 5/8/23 4:42 AM, Amit Kapila wrote:
> > On Sat, May 6, 2023 at 9:33 PM Drouvot, Bertrand
> > <bertranddrouvot.pg@gmail.com> wrote:
> >>
> >> On 5/6/23 3:28 PM, Amit Kapila wrote:
> >>> On Sat, May 6, 2023 at 1:52 PM Drouvot, Bertrand
> >>> <bertranddrouvot.pg@gmail.com> wrote:
> >>
> >>> Next steps:
> >>> =========
> >>> 1. The first thing is we should verify this theory by adding some LOG
> >>> in KeepLogSeg() to see if the _logSegNo is reduced due to the value
> >>> returned by XLogGetReplicationSlotMinimumLSN().
> >>
> >> Yeah, will do that early next week.
>
> It's done with the following changes:
>
>
https://github.com/bdrouvot/postgres/commit/79e1bd9ab429a22f876b9364eb8a0da2dacaaef7#diff-c1cb3ab2a19606390c1a7ed00ffe5a45531702ca5faf999d401c548f8951c65bL7454-R7514
>
> With that in place, there is one failing test here: https://cirrus-ci.com/task/5173216310722560
>
> Where we can see:
>
> 2023-05-08 07:42:56.301 UTC [18038][checkpointer] LOCATION:  UpdateMinRecoveryPoint, xlog.c:2500
> 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG:  00000: CreateRestartPoint: After XLByteToSeg RedoRecPtr is
0/4000098,_logSegNo is 4 
> 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION:  CreateRestartPoint, xlog.c:7271
> 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG:  00000: KeepLogSeg: segno changed to 4 due to XLByteToSeg
> 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION:  KeepLogSeg, xlog.c:7473
> 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG:  00000: KeepLogSeg: segno changed to 3 due to
XLogGetReplicationSlotMinimumLSN()
> 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION:  KeepLogSeg, xlog.c:7483
> 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG:  00000: CreateRestartPoint: After KeepLogSeg RedoRecPtr is
0/4000098,endptr is 0/4000148, _logSegNo is 3 
> 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION:  CreateRestartPoint, xlog.c:7284
> 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG:  00000: BDT1 about to call RemoveOldXlogFiles in
CreateRestartPoint
> 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION:  CreateRestartPoint, xlog.c:7313
> 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG:  00000: attempting to remove WAL segments older than log file
000000000000000000000002
>
> So the suspicion about XLogGetReplicationSlotMinimumLSN() was correct (_logSegNo moved from
> 4 to 3 due to XLogGetReplicationSlotMinimumLSN()).
>
> >> What about postponing the physical slot creation on the standby and the
> >> cascading standby node initialization after this test?
> >>
> >
> > Yeah, that is also possible. But, I have a few questions regarding
> > that: (a) There doesn't seem to be a physical slot on cascading
> > standby, if I am missing something, can you please point me to the
> > relevant part of the test?
>
> That's right. There is a physical slot only on the primary and on the standby.
>
> What I meant up-thread is to also postpone the cascading standby node initialization
> after this test (once the physical slot on the standby is created).
>
> Please find attached a proposal doing so.
>
> > (b) Which test is currently dependent on
> > the physical slot on standby?
>
> Not a test but the cascading standby initialization with the "primary_slot_name" parameter.
>
> Also, now I think that's better to have the physical slot on the standby + hsf set to on on the
> cascading standby (coming from the standby backup).
>
> Idea is to avoid any risk of logical slot invalidation on the cascading standby in the
> standby promotion test.
>

Why not initialize the cascading standby node just before the standby
promotion test: "Test standby promotion and logical decoding behavior
after the standby gets promoted."? That way we will avoid any unknown
side-effects of cascading standby and it will anyway look more logical
to initialize it where the test needs it.

--
With Regards,
Amit Kapila.



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:
Hi,

On 5/9/23 8:02 AM, Amit Kapila wrote:
> On Mon, May 8, 2023 at 1:45 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:

> 
> Why not initialize the cascading standby node just before the standby
> promotion test: "Test standby promotion and logical decoding behavior
> after the standby gets promoted."? That way we will avoid any unknown
> side-effects of cascading standby and it will anyway look more logical
> to initialize it where the test needs it.
> 

Yeah, that's even better. Moved the physical slot creation on the standby
and the cascading standby initialization where "strictly" needed in V2
attached.

Also ensuring that hsf is set to on on the cascading standby to be on the
safe side of thing.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Add two missing tests in 035_standby_logical_decoding.pl

From
Amit Kapila
Date:
On Tue, May 9, 2023 at 12:44 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> On 5/9/23 8:02 AM, Amit Kapila wrote:
> > On Mon, May 8, 2023 at 1:45 PM Drouvot, Bertrand
> > <bertranddrouvot.pg@gmail.com> wrote:
>
> >
> > Why not initialize the cascading standby node just before the standby
> > promotion test: "Test standby promotion and logical decoding behavior
> > after the standby gets promoted."? That way we will avoid any unknown
> > side-effects of cascading standby and it will anyway look more logical
> > to initialize it where the test needs it.
> >
>
> Yeah, that's even better. Moved the physical slot creation on the standby
> and the cascading standby initialization where "strictly" needed in V2
> attached.
>
> Also ensuring that hsf is set to on on the cascading standby to be on the
> safe side of thing.
>

Pushed this yesterday.

--
With Regards,
Amit Kapila.



Re: Add two missing tests in 035_standby_logical_decoding.pl

From
"Drouvot, Bertrand"
Date:

On 5/10/23 12:41 PM, Amit Kapila wrote:
> On Tue, May 9, 2023 at 12:44 PM Drouvot, Bertrand

> 
> Pushed this yesterday.
> 

Thanks!

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com