Thread: A failure in t/038_save_logical_slots_shutdown.pl

A failure in t/038_save_logical_slots_shutdown.pl

From

Bharath Rupireddy

Date:

10 January 2024, 08:38:18

Hi,

I've been observing a failure in t/038_save_logical_slots_shutdown.pl
of late on my developer system:

t/038_save_logical_slots_shutdown.pl .. 1/?
#   Failed test 'Check that the slot's confirmed_flush LSN is the same
as the latest_checkpoint location'
#   at t/038_save_logical_slots_shutdown.pl line 35.
# Looks like you failed 1 test of 2.
t/038_save_logical_slots_shutdown.pl .. Dubious, test returned 1
(wstat 256, 0x100)
Failed 1/2 subtests

I did a quick analysis of the failure and commit
https://github.com/postgres/postgres/commit/e0b2eed047df9045664da6f724cb42c10f8b12f0
that introduced this test. I think the issue is that the slot's
confirmed_flush LSN (0/1508000) and shutdown checkpoint LSN
(0/1508018) are not the same:

tmp_check/log/038_save_logical_slots_shutdown_pub.log:

2024-01-10 07:55:44.539 UTC [57621] sub LOG:  starting logical
decoding for slot "sub"
2024-01-10 07:55:44.539 UTC [57621] sub DETAIL:  Streaming
transactions committing after 0/1508000, reading WAL from 0/1507FC8.
2024-01-10 07:55:44.539 UTC [57621] sub STATEMENT:  START_REPLICATION
SLOT "sub" LOGICAL 0/0 (proto_version '4', origin 'any',
publication_names '"pub"')

ubuntu:~/postgres$ pg17/bin/pg_controldata -D
src/test/recovery/tmp_check/t_038_save_logical_slots_shutdown_pub_data/pgdata/
Database cluster state:               in production
pg_control last modified:             Wed Jan 10 07:55:44 2024
Latest checkpoint location:           0/1508018
Latest checkpoint's REDO location:    0/1508018

But the tests added by t/038_save_logical_slots_shutdown.pl expects
both LSNs to be same:

sub compare_confirmed_flush
{
    # Is it same as the value read from log?
    ok( $latest_checkpoint eq $confirmed_flush_from_log,
        "Check that the slot's confirmed_flush LSN is the same as the
latest_checkpoint location"
    );

I suspect that it's quite not right to expect the slot's
confirmed_flush and latest checkpoint location to be same in the test.
This is because the shutdown checkpoint gets an LSN that's greater
than the slot's confirmed_flush LSN - see the shutdown checkpoint
record getting inserted into WAL after the slot is marked dirty in
CheckPointReplicationSlots().

With this analysis in mind, I think the tests need to do something
like the following:

diff --git a/src/test/recovery/t/038_save_logical_slots_shutdown.pl
b/src/test/recovery/t/038_save_logical_slots_shut
down.pl
index 5a4f5dc1d4..d49e6014fc 100644
--- a/src/test/recovery/t/038_save_logical_slots_shutdown.pl
+++ b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
@@ -32,7 +32,7 @@ sub compare_confirmed_flush
          unless defined($latest_checkpoint);

        # Is it same as the value read from log?
-       ok( $latest_checkpoint eq $confirmed_flush_from_log,
+       ok( $latest_checkpoint ge $confirmed_flush_from_log,
                "Check that the slot's confirmed_flush LSN is the same
as the latest_checkpoint location"
        );

Thoughts?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: A failure in t/038_save_logical_slots_shutdown.pl

From

vignesh C

Date:

10 January 2024, 13:07:29

On Wed, 10 Jan 2024 at 14:08, Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Hi,
>
> I've been observing a failure in t/038_save_logical_slots_shutdown.pl
> of late on my developer system:
>
> t/038_save_logical_slots_shutdown.pl .. 1/?
> #   Failed test 'Check that the slot's confirmed_flush LSN is the same
> as the latest_checkpoint location'
> #   at t/038_save_logical_slots_shutdown.pl line 35.
> # Looks like you failed 1 test of 2.
> t/038_save_logical_slots_shutdown.pl .. Dubious, test returned 1
> (wstat 256, 0x100)
> Failed 1/2 subtests
>
> I did a quick analysis of the failure and commit
> https://github.com/postgres/postgres/commit/e0b2eed047df9045664da6f724cb42c10f8b12f0
> that introduced this test. I think the issue is that the slot's
> confirmed_flush LSN (0/1508000) and shutdown checkpoint LSN
> (0/1508018) are not the same:
>
> tmp_check/log/038_save_logical_slots_shutdown_pub.log:
>
> 2024-01-10 07:55:44.539 UTC [57621] sub LOG:  starting logical
> decoding for slot "sub"
> 2024-01-10 07:55:44.539 UTC [57621] sub DETAIL:  Streaming
> transactions committing after 0/1508000, reading WAL from 0/1507FC8.
> 2024-01-10 07:55:44.539 UTC [57621] sub STATEMENT:  START_REPLICATION
> SLOT "sub" LOGICAL 0/0 (proto_version '4', origin 'any',
> publication_names '"pub"')
>
> ubuntu:~/postgres$ pg17/bin/pg_controldata -D
> src/test/recovery/tmp_check/t_038_save_logical_slots_shutdown_pub_data/pgdata/
> Database cluster state:               in production
> pg_control last modified:             Wed Jan 10 07:55:44 2024
> Latest checkpoint location:           0/1508018
> Latest checkpoint's REDO location:    0/1508018
>
> But the tests added by t/038_save_logical_slots_shutdown.pl expects
> both LSNs to be same:
>
> sub compare_confirmed_flush
> {
>     # Is it same as the value read from log?
>     ok( $latest_checkpoint eq $confirmed_flush_from_log,
>         "Check that the slot's confirmed_flush LSN is the same as the
> latest_checkpoint location"
>     );
>
> I suspect that it's quite not right to expect the slot's
> confirmed_flush and latest checkpoint location to be same in the test.
> This is because the shutdown checkpoint gets an LSN that's greater
> than the slot's confirmed_flush LSN - see the shutdown checkpoint
> record getting inserted into WAL after the slot is marked dirty in
> CheckPointReplicationSlots().
>
> With this analysis in mind, I think the tests need to do something
> like the following:
>
> diff --git a/src/test/recovery/t/038_save_logical_slots_shutdown.pl
> b/src/test/recovery/t/038_save_logical_slots_shut
> down.pl
> index 5a4f5dc1d4..d49e6014fc 100644
> --- a/src/test/recovery/t/038_save_logical_slots_shutdown.pl
> +++ b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
> @@ -32,7 +32,7 @@ sub compare_confirmed_flush
>           unless defined($latest_checkpoint);
>
>         # Is it same as the value read from log?
> -       ok( $latest_checkpoint eq $confirmed_flush_from_log,
> +       ok( $latest_checkpoint ge $confirmed_flush_from_log,
>                 "Check that the slot's confirmed_flush LSN is the same
> as the latest_checkpoint location"
>         );
>
> Thoughts?

I got the log files from Bharath offline. Thanks Bharath for sharing
the log files offline.
The WAL record sequence is exactly the same in the failing test and
tests which are passing.
One observation in our case the confirmed flush lsn points exactly to
shutdown checkpoint, but in the failing test the lsn pointed is
invalid, pg_waldump says that address is invalid and skips about 24
bytes and then sees a valid record

Passing case confirm flush lsn(0/150D158) from my machine:
pg_waldump 000000010000000000000001 -s 0/150D158
rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn:
0/0150D158, prev 0/0150D120, desc: CHECKPOINT_SHUTDOWN redo 0/150D158;
tli 1; prev tli 1; fpw true; xid 0:739; oid 16388; multi 1; offset 0;
oldest xid 728 in DB 1; oldest multi 1 in DB 1; oldest/newest commit
timestamp xid: 0/0; oldest running xid 0; shutdown

Failing case confirm flush lsn( 0/1508000) from failing tests log file:
pg_waldump 000000010000000000000001 -s 0/1508000
pg_waldump: first record is after 0/1508000, at 0/1508018, skipping
over 24 bytes
rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn:
0/01508018, prev 0/01507FC8, desc: CHECKPOINT_SHUTDOWN redo 0/1508018;
tli 1; prev tli 1; fpw true; xid 0:739; oid 16388; multi 1; offset 0;
oldest xid 728 in DB 1; oldest multi 1 in DB 1; oldest/newest commit
timestamp xid: 0/0; oldest running xid 0; shutdown

I'm still not sure why in this case, it is not exactly pointing to a
valid WAL record, it has to skip 24 bytes to find the valid checkpoint
shutdown record.
I will investigate this further and share the analysis.

Regards,
Vignesh

Re: A failure in t/038_save_logical_slots_shutdown.pl

From

Amit Kapila

Date:

11 January 2024, 05:31:52

On Wed, Jan 10, 2024 at 6:37 PM vignesh C <vignesh21@gmail.com> wrote:
>
> I got the log files from Bharath offline. Thanks Bharath for sharing
> the log files offline.
> The WAL record sequence is exactly the same in the failing test and
> tests which are passing.
> One observation in our case the confirmed flush lsn points exactly to
> shutdown checkpoint, but in the failing test the lsn pointed is
> invalid, pg_waldump says that address is invalid and skips about 24
> bytes and then sees a valid record
>
> Passing case confirm flush lsn(0/150D158) from my machine:
> pg_waldump 000000010000000000000001 -s 0/150D158
> rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn:
> 0/0150D158, prev 0/0150D120, desc: CHECKPOINT_SHUTDOWN redo 0/150D158;
> tli 1; prev tli 1; fpw true; xid 0:739; oid 16388; multi 1; offset 0;
> oldest xid 728 in DB 1; oldest multi 1 in DB 1; oldest/newest commit
> timestamp xid: 0/0; oldest running xid 0; shutdown
>
> Failing case confirm flush lsn( 0/1508000) from failing tests log file:
> pg_waldump 000000010000000000000001 -s 0/1508000
> pg_waldump: first record is after 0/1508000, at 0/1508018, skipping
> over 24 bytes
> rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn:
> 0/01508018, prev 0/01507FC8, desc: CHECKPOINT_SHUTDOWN redo 0/1508018;
> tli 1; prev tli 1; fpw true; xid 0:739; oid 16388; multi 1; offset 0;
> oldest xid 728 in DB 1; oldest multi 1 in DB 1; oldest/newest commit
> timestamp xid: 0/0; oldest running xid 0; shutdown
>
> I'm still not sure why in this case, it is not exactly pointing to a
> valid WAL record, it has to skip 24 bytes to find the valid checkpoint
> shutdown record.
>

Can we see the previous record (as pointed out by prev in the WAL
record) in both cases? Also, you can see few prior records in both
cases.

--
With Regards,
Amit Kapila.

Re: A failure in t/038_save_logical_slots_shutdown.pl

From

Amit Kapila

Date:

11 January 2024, 05:52:29

On Wed, Jan 10, 2024 at 2:08 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> I've been observing a failure in t/038_save_logical_slots_shutdown.pl
> of late on my developer system:
>
> t/038_save_logical_slots_shutdown.pl .. 1/?
> #   Failed test 'Check that the slot's confirmed_flush LSN is the same
> as the latest_checkpoint location'
> #   at t/038_save_logical_slots_shutdown.pl line 35.
> # Looks like you failed 1 test of 2.
> t/038_save_logical_slots_shutdown.pl .. Dubious, test returned 1
> (wstat 256, 0x100)
> Failed 1/2 subtests
>
> I did a quick analysis of the failure and commit
> https://github.com/postgres/postgres/commit/e0b2eed047df9045664da6f724cb42c10f8b12f0
> that introduced this test. I think the issue is that the slot's
> confirmed_flush LSN (0/1508000) and shutdown checkpoint LSN
> (0/1508018) are not the same:
>
> tmp_check/log/038_save_logical_slots_shutdown_pub.log:
>
> 2024-01-10 07:55:44.539 UTC [57621] sub LOG:  starting logical
> decoding for slot "sub"
> 2024-01-10 07:55:44.539 UTC [57621] sub DETAIL:  Streaming
> transactions committing after 0/1508000, reading WAL from 0/1507FC8.
> 2024-01-10 07:55:44.539 UTC [57621] sub STATEMENT:  START_REPLICATION
> SLOT "sub" LOGICAL 0/0 (proto_version '4', origin 'any',
> publication_names '"pub"')
>
> ubuntu:~/postgres$ pg17/bin/pg_controldata -D
> src/test/recovery/tmp_check/t_038_save_logical_slots_shutdown_pub_data/pgdata/
> Database cluster state:               in production
> pg_control last modified:             Wed Jan 10 07:55:44 2024
> Latest checkpoint location:           0/1508018
> Latest checkpoint's REDO location:    0/1508018
>
> But the tests added by t/038_save_logical_slots_shutdown.pl expects
> both LSNs to be same:
>
> sub compare_confirmed_flush
> {
>     # Is it same as the value read from log?
>     ok( $latest_checkpoint eq $confirmed_flush_from_log,
>         "Check that the slot's confirmed_flush LSN is the same as the
> latest_checkpoint location"
>     );
>
> I suspect that it's quite not right to expect the slot's
> confirmed_flush and latest checkpoint location to be same in the test.
>

As per my understanding, the reason we expect them to be the same is
because we ensure that during shutdown, the walsender sends all the
WAL just before shutdown_checkpoint and the confirm_flush also points
to the end of WAL record before shutodwn_checkpoint. So, the next
starting location should be of shutdown_checkpoint record which should
ideally be the same. Do you see this failure consistently?

--
With Regards,
Amit Kapila.

Re: A failure in t/038_save_logical_slots_shutdown.pl

From

vignesh C

Date:

11 January 2024, 11:05:03

On Wed, 10 Jan 2024 at 18:37, vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, 10 Jan 2024 at 14:08, Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > Hi,
> >
> > I've been observing a failure in t/038_save_logical_slots_shutdown.pl
> > of late on my developer system:
> >
> > t/038_save_logical_slots_shutdown.pl .. 1/?
> > #   Failed test 'Check that the slot's confirmed_flush LSN is the same
> > as the latest_checkpoint location'
> > #   at t/038_save_logical_slots_shutdown.pl line 35.
> > # Looks like you failed 1 test of 2.
> > t/038_save_logical_slots_shutdown.pl .. Dubious, test returned 1
> > (wstat 256, 0x100)
> > Failed 1/2 subtests
> >
> > I did a quick analysis of the failure and commit
> > https://github.com/postgres/postgres/commit/e0b2eed047df9045664da6f724cb42c10f8b12f0
> > that introduced this test. I think the issue is that the slot's
> > confirmed_flush LSN (0/1508000) and shutdown checkpoint LSN
> > (0/1508018) are not the same:
> >
> > tmp_check/log/038_save_logical_slots_shutdown_pub.log:
> >
> > 2024-01-10 07:55:44.539 UTC [57621] sub LOG:  starting logical
> > decoding for slot "sub"
> > 2024-01-10 07:55:44.539 UTC [57621] sub DETAIL:  Streaming
> > transactions committing after 0/1508000, reading WAL from 0/1507FC8.
> > 2024-01-10 07:55:44.539 UTC [57621] sub STATEMENT:  START_REPLICATION
> > SLOT "sub" LOGICAL 0/0 (proto_version '4', origin 'any',
> > publication_names '"pub"')
> >
> > ubuntu:~/postgres$ pg17/bin/pg_controldata -D
> > src/test/recovery/tmp_check/t_038_save_logical_slots_shutdown_pub_data/pgdata/
> > Database cluster state:               in production
> > pg_control last modified:             Wed Jan 10 07:55:44 2024
> > Latest checkpoint location:           0/1508018
> > Latest checkpoint's REDO location:    0/1508018
> >
> > But the tests added by t/038_save_logical_slots_shutdown.pl expects
> > both LSNs to be same:
> >
> > sub compare_confirmed_flush
> > {
> >     # Is it same as the value read from log?
> >     ok( $latest_checkpoint eq $confirmed_flush_from_log,
> >         "Check that the slot's confirmed_flush LSN is the same as the
> > latest_checkpoint location"
> >     );
> >
> > I suspect that it's quite not right to expect the slot's
> > confirmed_flush and latest checkpoint location to be same in the test.
> > This is because the shutdown checkpoint gets an LSN that's greater
> > than the slot's confirmed_flush LSN - see the shutdown checkpoint
> > record getting inserted into WAL after the slot is marked dirty in
> > CheckPointReplicationSlots().
> >
> > With this analysis in mind, I think the tests need to do something
> > like the following:
> >
> > diff --git a/src/test/recovery/t/038_save_logical_slots_shutdown.pl
> > b/src/test/recovery/t/038_save_logical_slots_shut
> > down.pl
> > index 5a4f5dc1d4..d49e6014fc 100644
> > --- a/src/test/recovery/t/038_save_logical_slots_shutdown.pl
> > +++ b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
> > @@ -32,7 +32,7 @@ sub compare_confirmed_flush
> >           unless defined($latest_checkpoint);
> >
> >         # Is it same as the value read from log?
> > -       ok( $latest_checkpoint eq $confirmed_flush_from_log,
> > +       ok( $latest_checkpoint ge $confirmed_flush_from_log,
> >                 "Check that the slot's confirmed_flush LSN is the same
> > as the latest_checkpoint location"
> >         );
> >
> > Thoughts?
>
> I got the log files from Bharath offline. Thanks Bharath for sharing
> the log files offline.
> The WAL record sequence is exactly the same in the failing test and
> tests which are passing.
> One observation in our case the confirmed flush lsn points exactly to
> shutdown checkpoint, but in the failing test the lsn pointed is
> invalid, pg_waldump says that address is invalid and skips about 24
> bytes and then sees a valid record
>
> Passing case confirm flush lsn(0/150D158) from my machine:
> pg_waldump 000000010000000000000001 -s 0/150D158
> rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn:
> 0/0150D158, prev 0/0150D120, desc: CHECKPOINT_SHUTDOWN redo 0/150D158;
> tli 1; prev tli 1; fpw true; xid 0:739; oid 16388; multi 1; offset 0;
> oldest xid 728 in DB 1; oldest multi 1 in DB 1; oldest/newest commit
> timestamp xid: 0/0; oldest running xid 0; shutdown
>
> Failing case confirm flush lsn( 0/1508000) from failing tests log file:
> pg_waldump 000000010000000000000001 -s 0/1508000
> pg_waldump: first record is after 0/1508000, at 0/1508018, skipping
> over 24 bytes
> rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn:
> 0/01508018, prev 0/01507FC8, desc: CHECKPOINT_SHUTDOWN redo 0/1508018;
> tli 1; prev tli 1; fpw true; xid 0:739; oid 16388; multi 1; offset 0;
> oldest xid 728 in DB 1; oldest multi 1 in DB 1; oldest/newest commit
> timestamp xid: 0/0; oldest running xid 0; shutdown
>
> I'm still not sure why in this case, it is not exactly pointing to a
> valid WAL record, it has to skip 24 bytes to find the valid checkpoint
> shutdown record.
> I will investigate this further and share the analysis.

On further analysis, it was found that in the failing test,
CHECKPOINT_SHUTDOWN was started in a new page, so there was the WAL
page header present just before the CHECKPOINT_SHUTDOWN which was
causing the failure. We could alternatively reproduce the issue by
switching the WAL file before restarting the server like in the
attached test change patch.
There are a couple of ways to fix this issue a) one by switching the
WAL before the insertion of records so that the CHECKPOINT_SHUTDOWN
does not get inserted in a new page as in the attached test_fix.patch
b) by using pg_walinspect to check that the next WAL record is
CHECKPOINT_SHUTDOWN. I have to try this approach.

Thanks to Bharath and Kuroda-san for offline discussions and helping
in getting to the root cause.

Thoughts?

Regards,
Vignesh

Attachment

Re: A failure in t/038_save_logical_slots_shutdown.pl

From

Bharath Rupireddy

Date:

11 January 2024, 16:33:29

On Thu, Jan 11, 2024 at 4:35 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On further analysis, it was found that in the failing test,
> CHECKPOINT_SHUTDOWN was started in a new page, so there was the WAL
> page header present just before the CHECKPOINT_SHUTDOWN which was
> causing the failure. We could alternatively reproduce the issue by
> switching the WAL file before restarting the server like in the
> attached test change patch.
> There are a couple of ways to fix this issue a) one by switching the
> WAL before the insertion of records so that the CHECKPOINT_SHUTDOWN
> does not get inserted in a new page as in the attached test_fix.patch
> b) by using pg_walinspect to check that the next WAL record is
> CHECKPOINT_SHUTDOWN. I have to try this approach.
>
> Thanks to Bharath and Kuroda-san for offline discussions and helping
> in getting to the root cause.

IIUC, the problem the commit e0b2eed tries to solve is to ensure there
are no left-over decodable WAL records between confirmed_flush LSN and
a shutdown checkpoint, which is what it is expected from the
t/038_save_logical_slots_shutdown.pl. How about we have a PG function
returning true if there are any decodable WAL records between the
given start_lsn and end_lsn? Usage of this new function will make the
tests more concrete and stable. This function doesn't have to be
something really new, we can just turn
binary_upgrade_logical_slot_has_caught_up to a general, non-binary PG
function; this idea has come up before
https://www.postgresql.org/message-id/CAA4eK1KZXaBgVOAdV8ZfG6AdDbKYFVz7teDa7GORgQ3RVYS93g%40mail.gmail.com.
If okay, I can offer to write a patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: A failure in t/038_save_logical_slots_shutdown.pl

From

Amit Kapila

Date:

12 January 2024, 03:57:54

On Thu, Jan 11, 2024 at 10:03 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Thu, Jan 11, 2024 at 4:35 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On further analysis, it was found that in the failing test,
> > CHECKPOINT_SHUTDOWN was started in a new page, so there was the WAL
> > page header present just before the CHECKPOINT_SHUTDOWN which was
> > causing the failure. We could alternatively reproduce the issue by
> > switching the WAL file before restarting the server like in the
> > attached test change patch.
> > There are a couple of ways to fix this issue a) one by switching the
> > WAL before the insertion of records so that the CHECKPOINT_SHUTDOWN
> > does not get inserted in a new page as in the attached test_fix.patch
> > b) by using pg_walinspect to check that the next WAL record is
> > CHECKPOINT_SHUTDOWN. I have to try this approach.
> >
> > Thanks to Bharath and Kuroda-san for offline discussions and helping
> > in getting to the root cause.
>
> IIUC, the problem the commit e0b2eed tries to solve is to ensure there
> are no left-over decodable WAL records between confirmed_flush LSN and
> a shutdown checkpoint, which is what it is expected from the
> t/038_save_logical_slots_shutdown.pl. How about we have a PG function
> returning true if there are any decodable WAL records between the
> given start_lsn and end_lsn?
>

But, we already test this in 003_logical_slot during a successful
upgrade. Having an explicit test to do the same thing has some merits
but not sure if it is worth it. The current test tries to ensure that
during shutdown after we shutdown walsender and ensures that it sends
all the wal records and receipts an ack for the same, there is no
other WAL except shutdown_checkpoint. Vignesh's suggestion (a) makes
the test robust enough that it shouldn't show spurious failures like
the current one reported by you, so shall we try to proceed with that?

--
With Regards,
Amit Kapila.

Re: A failure in t/038_save_logical_slots_shutdown.pl

From

Bharath Rupireddy

Date:

12 January 2024, 10:05:57

On Fri, Jan 12, 2024 at 9:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jan 11, 2024 at 10:03 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > On Thu, Jan 11, 2024 at 4:35 PM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > On further analysis, it was found that in the failing test,
> > > CHECKPOINT_SHUTDOWN was started in a new page, so there was the WAL
> > > page header present just before the CHECKPOINT_SHUTDOWN which was
> > > causing the failure. We could alternatively reproduce the issue by
> > > switching the WAL file before restarting the server like in the
> > > attached test change patch.
> > > There are a couple of ways to fix this issue a) one by switching the
> > > WAL before the insertion of records so that the CHECKPOINT_SHUTDOWN
> > > does not get inserted in a new page as in the attached test_fix.patch
> > > b) by using pg_walinspect to check that the next WAL record is
> > > CHECKPOINT_SHUTDOWN. I have to try this approach.
> > >
> > > Thanks to Bharath and Kuroda-san for offline discussions and helping
> > > in getting to the root cause.
> >
> > IIUC, the problem the commit e0b2eed tries to solve is to ensure there
> > are no left-over decodable WAL records between confirmed_flush LSN and
> > a shutdown checkpoint, which is what it is expected from the
> > t/038_save_logical_slots_shutdown.pl. How about we have a PG function
> > returning true if there are any decodable WAL records between the
> > given start_lsn and end_lsn?
> >
>
> But, we already test this in 003_logical_slot during a successful
> upgrade. Having an explicit test to do the same thing has some merits
> but not sure if it is worth it.

If the code added by commit e0b2eed is covered by the new upgrade
test, why not remove 038_save_logical_slots_shutdown.pl altogether?

> The current test tries to ensure that
> during shutdown after we shutdown walsender and ensures that it sends
> all the wal records and receipts an ack for the same, there is no
> other WAL except shutdown_checkpoint. Vignesh's suggestion (a) makes
> the test robust enough that it shouldn't show spurious failures like
> the current one reported by you, so shall we try to proceed with that?

Do you mean something like [1]? It ensures the test passes unless any
writes are added (in future) before the publisher restarts in the test
which can again make the tests flaky. How do we ensure no one adds
anything in before the publisher restart
038_save_logical_slots_shutdown.pl? A note before the restart perhaps?
I might be okay with a simple solution like [1] with a note before the
restart instead of other complicated ones.

[1]
diff --git a/src/test/recovery/t/038_save_logical_slots_shutdown.pl
b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
index 5a4f5dc1d4..493fdbce2f 100644
--- a/src/test/recovery/t/038_save_logical_slots_shutdown.pl
+++ b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
@@ -60,6 +60,14 @@ $node_subscriber->start;
 $node_publisher->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
 $node_subscriber->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");

+# On some machines, it was detected that the shutdown checkpoint WAL record
+# that gets generated as part of the publisher restart below falls exactly in
+# the new page in the WAL file. Due to this, the latest checkpoint location and
+# confirmed flush check in compare_confirmed_flush() was failing. Hence, we
+# advance WAL by 1 segment before generating some data so that the shutdown
+# checkpoint doesn't fall exactly in the new WAL file page.
+$node_publisher->advance_wal(1);
+
 # Insert some data
 $node_publisher->safe_psql('postgres',
        "INSERT INTO test_tbl VALUES (generate_series(1, 5));");

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: A failure in t/038_save_logical_slots_shutdown.pl

From

Amit Kapila

Date:

13 January 2024, 11:12:55

On Fri, Jan 12, 2024 at 3:36 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Fri, Jan 12, 2024 at 9:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Jan 11, 2024 at 10:03 PM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> > >
> > > On Thu, Jan 11, 2024 at 4:35 PM vignesh C <vignesh21@gmail.com> wrote:
> > > >
> > > > On further analysis, it was found that in the failing test,
> > > > CHECKPOINT_SHUTDOWN was started in a new page, so there was the WAL
> > > > page header present just before the CHECKPOINT_SHUTDOWN which was
> > > > causing the failure. We could alternatively reproduce the issue by
> > > > switching the WAL file before restarting the server like in the
> > > > attached test change patch.
> > > > There are a couple of ways to fix this issue a) one by switching the
> > > > WAL before the insertion of records so that the CHECKPOINT_SHUTDOWN
> > > > does not get inserted in a new page as in the attached test_fix.patch
> > > > b) by using pg_walinspect to check that the next WAL record is
> > > > CHECKPOINT_SHUTDOWN. I have to try this approach.
> > > >
> > > > Thanks to Bharath and Kuroda-san for offline discussions and helping
> > > > in getting to the root cause.
> > >
> > > IIUC, the problem the commit e0b2eed tries to solve is to ensure there
> > > are no left-over decodable WAL records between confirmed_flush LSN and
> > > a shutdown checkpoint, which is what it is expected from the
> > > t/038_save_logical_slots_shutdown.pl. How about we have a PG function
> > > returning true if there are any decodable WAL records between the
> > > given start_lsn and end_lsn?
> > >
> >
> > But, we already test this in 003_logical_slot during a successful
> > upgrade. Having an explicit test to do the same thing has some merits
> > but not sure if it is worth it.
>
> If the code added by commit e0b2eed is covered by the new upgrade
> test, why not remove 038_save_logical_slots_shutdown.pl altogether?
>

This is a more strict check because it is possible that even if the
latest confirmed_flush location is not persisted there is no
meaningful decodable WAL between whatever the last confirmed_flush
location saved on disk and the shutdown_checkpoint record.
Kuroda-San/Vignesh, do you have any suggestion on this one?

> > The current test tries to ensure that
> > during shutdown after we shutdown walsender and ensures that it sends
> > all the wal records and receipts an ack for the same, there is no
> > other WAL except shutdown_checkpoint. Vignesh's suggestion (a) makes
> > the test robust enough that it shouldn't show spurious failures like
> > the current one reported by you, so shall we try to proceed with that?
>
> Do you mean something like [1]? It ensures the test passes unless any
> writes are added (in future) before the publisher restarts in the test
> which can again make the tests flaky. How do we ensure no one adds
> anything in before the publisher restart
> 038_save_logical_slots_shutdown.pl? A note before the restart perhaps?
>

I am fine with adding the note.

--
With Regards,
Amit Kapila.

Re: A failure in t/038_save_logical_slots_shutdown.pl

From

Bharath Rupireddy

Date:

14 January 2024, 15:02:40

On Sat, Jan 13, 2024 at 4:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > > The current test tries to ensure that
> > > during shutdown after we shutdown walsender and ensures that it sends
> > > all the wal records and receipts an ack for the same, there is no
> > > other WAL except shutdown_checkpoint. Vignesh's suggestion (a) makes
> > > the test robust enough that it shouldn't show spurious failures like
> > > the current one reported by you, so shall we try to proceed with that?
> >
> > Do you mean something like [1]? It ensures the test passes unless any
> > writes are added (in future) before the publisher restarts in the test
> > which can again make the tests flaky. How do we ensure no one adds
> > anything in before the publisher restart
> > 038_save_logical_slots_shutdown.pl? A note before the restart perhaps?
> >
>
> I am fine with adding the note.

Okay. Please see the attached patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

v1-0001-Fix-test-failure-in-038_save_logical_slots_shutdo.patch

RE: A failure in t/038_save_logical_slots_shutdown.pl

From

"Hayato Kuroda (Fujitsu)"

Date:

16 January 2024, 06:43:04

Dear Amit, Bharath,

> This is a more strict check because it is possible that even if the
> latest confirmed_flush location is not persisted there is no
> meaningful decodable WAL between whatever the last confirmed_flush
> location saved on disk and the shutdown_checkpoint record.
> Kuroda-San/Vignesh, do you have any suggestion on this one?

I think it should be as testcase explicitly. There are two reasons:
 
* e0b2eed is a commit for backend codes, so it should be tested by src/test/*
  files. Each src/bin/XXX/*.pl files should test only their executable.
* Assuming that the feature would be broken. In this case 003_logical_slots.pl
  would fail, but we do not have a way to recognize on the build farm.
  038_save_logical_slots_shutdown.pl helps to distinguish the case.
 
Based on that, I think it is OK to add advance_wal() and comments, like Bharath's patch.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: A failure in t/038_save_logical_slots_shutdown.pl

From

Bharath Rupireddy

Date:

25 January 2024, 10:57:20

On Tue, Jan 16, 2024 at 12:13 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Amit, Bharath,
>
> > This is a more strict check because it is possible that even if the
> > latest confirmed_flush location is not persisted there is no
> > meaningful decodable WAL between whatever the last confirmed_flush
> > location saved on disk and the shutdown_checkpoint record.
> > Kuroda-San/Vignesh, do you have any suggestion on this one?
>
> I think it should be as testcase explicitly. There are two reasons:
>
> * e0b2eed is a commit for backend codes, so it should be tested by src/test/*
>   files. Each src/bin/XXX/*.pl files should test only their executable.
> * Assuming that the feature would be broken. In this case 003_logical_slots.pl
>   would fail, but we do not have a way to recognize on the build farm.
>   038_save_logical_slots_shutdown.pl helps to distinguish the case.

+1 to keep 038_save_logical_slots_shutdown.pl as-is.

> Based on that, I think it is OK to add advance_wal() and comments, like Bharath's patch.

Thanks. I'll wait a while and then add it to CF to not lose it in the wild.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: A failure in t/038_save_logical_slots_shutdown.pl

From

Amit Kapila

Date:

25 January 2024, 11:37:01

On Thu, Jan 25, 2024 at 4:27 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Thanks. I'll wait a while and then add it to CF to not lose it in the wild.
>

Feel free to add it to CF. However, I do plan to look at it in the
next few days.

--
With Regards,
Amit Kapila.

Re: A failure in t/038_save_logical_slots_shutdown.pl

From

Bharath Rupireddy

Date:

26 January 2024, 15:02:00

On Thu, Jan 25, 2024 at 5:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jan 25, 2024 at 4:27 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > Thanks. I'll wait a while and then add it to CF to not lose it in the wild.
> >
>
> Feel free to add it to CF. However, I do plan to look at it in the
> next few days.

Thanks. CF entry is here https://commitfest.postgresql.org/47/4796/.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: A failure in t/038_save_logical_slots_shutdown.pl

From

Amit Kapila

Date:

12 March 2024, 12:35:15

On Thu, Jan 25, 2024 at 5:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jan 25, 2024 at 4:27 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > Thanks. I'll wait a while and then add it to CF to not lose it in the wild.
> >
>
> Feel free to add it to CF. However, I do plan to look at it in the
> next few days.
>

The patch looks mostly good to me. I have changed the comments and
commit message in the attached. I am planning to push this tomorrow
unless you or others have any comments on it.

--
With Regards,
Amit Kapila.

Attachment

v2-0001-Fix-a-random-failure-in-038_save_logical_slots_sh.patch

Re: A failure in t/038_save_logical_slots_shutdown.pl

From

Bharath Rupireddy

Date:

12 March 2024, 15:15:50

On Tue, Mar 12, 2024 at 6:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> The patch looks mostly good to me. I have changed the comments and
> commit message in the attached. I am planning to push this tomorrow
> unless you or others have any comments on it.

LGTM.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: A failure in t/038_save_logical_slots_shutdown.pl

From

Amit Kapila

Date:

13 March 2024, 05:07:44

On Tue, Mar 12, 2024 at 8:46 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Tue, Mar 12, 2024 at 6:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > The patch looks mostly good to me. I have changed the comments and
> > commit message in the attached. I am planning to push this tomorrow
> > unless you or others have any comments on it.
>
> LGTM.
>

Pushed.

--
With Regards,
Amit Kapila.