Thread: persist logical slots to disk during shutdown checkpoint

persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
It's entirely possible for a logical slot to have a confirmed_flush
LSN higher than the last value saved on disk while not being marked as
dirty.  It's currently not a problem to lose that value during a clean
shutdown / restart cycle but to support the upgrade of logical slots
[1] (see latest patch at [2]), we seem to rely on that value being
properly persisted to disk. During the upgrade, we need to verify that
all the data prior to shudown_checkpoint for the logical slots has
been consumed, otherwise, the downstream may miss some data. Now, to
ensure the same, we are planning to compare the confirm_flush LSN
location with the latest shudown_checkpoint location which means that
the confirm_flush LSN should be updated after restart.

I think this is inefficient even without an upgrade because, after the
restart, this may lead to decoding some data again. Say, we process
some transactions for which we didn't send anything downstream (the
changes got filtered) but the confirm_flush LSN is updated due to
keepalives. As we don't flush the latest value of confirm_flush LSN,
it may lead to processing the same changes again.

The idea discussed in the thread [1] is to always persist logical
slots to disk during the shutdown checkpoint. I have extracted the
patch to achieve the same from that thread and attached it here. This
could lead to some overhead during shutdown (checkpoint) if there are
many slots but it is probably a one-time work.

I couldn't think of better ideas but another possibility is to mark
the slot as dirty when we update the confirm_flush LSN (see
LogicalConfirmReceivedLocation()). However, that would be a bigger
overhead in the running server as it could be a frequent operation and
could lead to more writes.

Thoughts?

[1]  -
https://www.postgresql.org/message-id/TYAPR01MB58664C81887B3AF2EB6B16E3F5939%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[2] -
https://www.postgresql.org/message-id/TYAPR01MB5866562EF047F2C9DDD1F9DEF51BA%40TYAPR01MB5866.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.

Attachment

Re: persist logical slots to disk during shutdown checkpoint

From
Julien Rouhaud
Date:
On Sat, 19 Aug 2023, 14:16 Amit Kapila, <amit.kapila16@gmail.com> wrote:
It's entirely possible for a logical slot to have a confirmed_flush
LSN higher than the last value saved on disk while not being marked as
dirty.  It's currently not a problem to lose that value during a clean
shutdown / restart cycle but to support the upgrade of logical slots
[1] (see latest patch at [2]), we seem to rely on that value being
properly persisted to disk. During the upgrade, we need to verify that
all the data prior to shudown_checkpoint for the logical slots has
been consumed, otherwise, the downstream may miss some data. Now, to
ensure the same, we are planning to compare the confirm_flush LSN
location with the latest shudown_checkpoint location which means that
the confirm_flush LSN should be updated after restart.

I think this is inefficient even without an upgrade because, after the
restart, this may lead to decoding some data again. Say, we process
some transactions for which we didn't send anything downstream (the
changes got filtered) but the confirm_flush LSN is updated due to
keepalives. As we don't flush the latest value of confirm_flush LSN,
it may lead to processing the same changes again.

In most cases there shouldn't be a lot of records to decode after restart, but I agree it's better to avoid decoding those again. 

The idea discussed in the thread [1] is to always persist logical
slots to disk during the shutdown checkpoint. I have extracted the
patch to achieve the same from that thread and attached it here. This
could lead to some overhead during shutdown (checkpoint) if there are
many slots but it is probably a one-time work.

I couldn't think of better ideas but another possibility is to mark
the slot as dirty when we update the confirm_flush LSN (see
LogicalConfirmReceivedLocation()). However, that would be a bigger
overhead in the running server as it could be a frequent operation and
could lead to more writes.

Yeah I didn't find any better option either at that time. I still think that forcing persistence on shutdown is the best compromise. If we tried to always mark the slot as dirty, we would be sure to add regular overhead but we would probably end up persisting the slot on disk on shutdown anyway most of the time, so I don't think it would be a good compromise. 

My biggest concern was that some switchover scenario might be a bit slower in some cases, but if that really is a problem it's hard to imagine what workload would be possible without having to persist them anyway due to continuous activity needing to be sent just before the shutdown.

Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Sat, Aug 19, 2023 at 12:46 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Sat, 19 Aug 2023, 14:16 Amit Kapila, <amit.kapila16@gmail.com> wrote:
>>
>
>> The idea discussed in the thread [1] is to always persist logical
>> slots to disk during the shutdown checkpoint. I have extracted the
>> patch to achieve the same from that thread and attached it here. This
>> could lead to some overhead during shutdown (checkpoint) if there are
>> many slots but it is probably a one-time work.
>>
>> I couldn't think of better ideas but another possibility is to mark
>> the slot as dirty when we update the confirm_flush LSN (see
>> LogicalConfirmReceivedLocation()). However, that would be a bigger
>> overhead in the running server as it could be a frequent operation and
>> could lead to more writes.
>
>
> Yeah I didn't find any better option either at that time. I still think that forcing persistence on shutdown is the
bestcompromise. If we tried to always mark the slot as dirty, we would be sure to add regular overhead but we would
probablyend up persisting the slot on disk on shutdown anyway most of the time, so I don't think it would be a good
compromise.
>

The other possibility is that we introduce yet another dirty flag for
slots, say dirty_for_shutdown_checkpoint which will be set when we
update confirmed_flush LSN. The flag will be cleared each time we
persist the slot but we won't persist if only this flag is set. We can
then use it during the shutdown checkpoint to decide whether to
persist the slot.

--
With Regards,
Amit Kapila.



Re: persist logical slots to disk during shutdown checkpoint

From
Ashutosh Bapat
Date:
On Sun, Aug 20, 2023 at 8:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Aug 19, 2023 at 12:46 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> >
> > On Sat, 19 Aug 2023, 14:16 Amit Kapila, <amit.kapila16@gmail.com> wrote:
> >>
> >
> >> The idea discussed in the thread [1] is to always persist logical
> >> slots to disk during the shutdown checkpoint. I have extracted the
> >> patch to achieve the same from that thread and attached it here. This
> >> could lead to some overhead during shutdown (checkpoint) if there are
> >> many slots but it is probably a one-time work.
> >>
> >> I couldn't think of better ideas but another possibility is to mark
> >> the slot as dirty when we update the confirm_flush LSN (see
> >> LogicalConfirmReceivedLocation()). However, that would be a bigger
> >> overhead in the running server as it could be a frequent operation and
> >> could lead to more writes.
> >
> >
> > Yeah I didn't find any better option either at that time. I still think that forcing persistence on shutdown is the
bestcompromise. If we tried to always mark the slot as dirty, we would be sure to add regular overhead but we would
probablyend up persisting the slot on disk on shutdown anyway most of the time, so I don't think it would be a good
compromise.
> >
>
> The other possibility is that we introduce yet another dirty flag for
> slots, say dirty_for_shutdown_checkpoint which will be set when we
> update confirmed_flush LSN. The flag will be cleared each time we
> persist the slot but we won't persist if only this flag is set. We can
> then use it during the shutdown checkpoint to decide whether to
> persist the slot.

There are already two booleans controlling dirty-ness of slot, dirty
and just_dirty. Adding third will created more confusion.

Another idea is to record the confirm_flush_lsn at the time of
persisting the slot. We can use it in two different ways 1. to mark a
slot dirty and persist if the last confirm_flush_lsn when slot was
persisted was too far from the current confirm_flush_lsn of the slot.
2. at shutdown checkpoint, persist all the slots which have these two
confirm_flush_lsns different.

--
Best Wishes,
Ashutosh Bapat



Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Mon, Aug 21, 2023 at 6:36 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Sun, Aug 20, 2023 at 8:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > The other possibility is that we introduce yet another dirty flag for
> > slots, say dirty_for_shutdown_checkpoint which will be set when we
> > update confirmed_flush LSN. The flag will be cleared each time we
> > persist the slot but we won't persist if only this flag is set. We can
> > then use it during the shutdown checkpoint to decide whether to
> > persist the slot.
>
> There are already two booleans controlling dirty-ness of slot, dirty
> and just_dirty. Adding third will created more confusion.
>
> Another idea is to record the confirm_flush_lsn at the time of
> persisting the slot. We can use it in two different ways 1. to mark a
> slot dirty and persist if the last confirm_flush_lsn when slot was
> persisted was too far from the current confirm_flush_lsn of the slot.
> 2. at shutdown checkpoint, persist all the slots which have these two
> confirm_flush_lsns different.
>

I think using it in the second (2) way sounds advantageous as compared
to storing another dirty flag because this requires us to update
last_persisted_confirm_flush_lsn only while writing the slot info.
OTOH, having a flag dirty_for_shutdown_checkpoint will require us to
update it each time we update confirm_flush_lsn under spinlock at
multiple places. But, I don't see the need of doing what you proposed
in (1) as the use case for it is very minor, basically this may
sometimes help us to avoid decoding after crash recovery.

--
With Regards,
Amit Kapila.



Re: persist logical slots to disk during shutdown checkpoint

From
Ashutosh Bapat
Date:
On Tue, Aug 22, 2023 at 9:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Another idea is to record the confirm_flush_lsn at the time of
> > persisting the slot. We can use it in two different ways 1. to mark a
> > slot dirty and persist if the last confirm_flush_lsn when slot was
> > persisted was too far from the current confirm_flush_lsn of the slot.
> > 2. at shutdown checkpoint, persist all the slots which have these two
> > confirm_flush_lsns different.
> >
>
> I think using it in the second (2) way sounds advantageous as compared
> to storing another dirty flag because this requires us to update
> last_persisted_confirm_flush_lsn only while writing the slot info.
> OTOH, having a flag dirty_for_shutdown_checkpoint will require us to
> update it each time we update confirm_flush_lsn under spinlock at
> multiple places. But, I don't see the need of doing what you proposed
> in (1) as the use case for it is very minor, basically this may
> sometimes help us to avoid decoding after crash recovery.

Once we have last_persisted_confirm_flush_lsn, (1) is just an
optimization on top of that. With that we take the opportunity to
persist confirmed_flush_lsn which is much farther than the current
persisted value and thus improving chances of updating restart_lsn and
catalog_xmin faster after a WAL sender restart. We need to keep that
in mind when implementing (2). The problem is if we don't implement
(1) right now, we might just forget to do that small incremental
change in future. My preference is 1. Do both (1) and (2) together 2.
Do (2) first and then (1) as a separate commit. 3. Just implement (2)
if we don't have time at all for first two options.

--
Best Wishes,
Ashutosh Bapat



Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Tue, Aug 22, 2023 at 2:56 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Tue, Aug 22, 2023 at 9:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > Another idea is to record the confirm_flush_lsn at the time of
> > > persisting the slot. We can use it in two different ways 1. to mark a
> > > slot dirty and persist if the last confirm_flush_lsn when slot was
> > > persisted was too far from the current confirm_flush_lsn of the slot.
> > > 2. at shutdown checkpoint, persist all the slots which have these two
> > > confirm_flush_lsns different.
> > >
> >
> > I think using it in the second (2) way sounds advantageous as compared
> > to storing another dirty flag because this requires us to update
> > last_persisted_confirm_flush_lsn only while writing the slot info.
> > OTOH, having a flag dirty_for_shutdown_checkpoint will require us to
> > update it each time we update confirm_flush_lsn under spinlock at
> > multiple places. But, I don't see the need of doing what you proposed
> > in (1) as the use case for it is very minor, basically this may
> > sometimes help us to avoid decoding after crash recovery.
>
> Once we have last_persisted_confirm_flush_lsn, (1) is just an
> optimization on top of that. With that we take the opportunity to
> persist confirmed_flush_lsn which is much farther than the current
> persisted value and thus improving chances of updating restart_lsn and
> catalog_xmin faster after a WAL sender restart. We need to keep that
> in mind when implementing (2). The problem is if we don't implement
> (1) right now, we might just forget to do that small incremental
> change in future. My preference is 1. Do both (1) and (2) together 2.
> Do (2) first and then (1) as a separate commit. 3. Just implement (2)
> if we don't have time at all for first two options.
>

I prefer one of (2) or (3). Anyway, it is better to do that
optimization (persist confirm_flush_lsn at a regular interval) as a
separate patch as we need to test and prove its value separately.

--
With Regards,
Amit Kapila.



Re: persist logical slots to disk during shutdown checkpoint

From
Ashutosh Bapat
Date:
On Tue, Aug 22, 2023 at 3:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Once we have last_persisted_confirm_flush_lsn, (1) is just an
> > optimization on top of that. With that we take the opportunity to
> > persist confirmed_flush_lsn which is much farther than the current
> > persisted value and thus improving chances of updating restart_lsn and
> > catalog_xmin faster after a WAL sender restart. We need to keep that
> > in mind when implementing (2). The problem is if we don't implement
> > (1) right now, we might just forget to do that small incremental
> > change in future. My preference is 1. Do both (1) and (2) together 2.
> > Do (2) first and then (1) as a separate commit. 3. Just implement (2)
> > if we don't have time at all for first two options.
> >
>
> I prefer one of (2) or (3). Anyway, it is better to do that
> optimization (persist confirm_flush_lsn at a regular interval) as a
> separate patch as we need to test and prove its value separately.

Fine with me.


--
Best Wishes,
Ashutosh Bapat



RE: persist logical slots to disk during shutdown checkpoint

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear hackers,

Thanks for forking the thread! I think we would choose another design, but I wanted
to post the updated version once with the current approach. All comments came
from the parent thread [1].

> 1. GENERAL -- git apply
>
> The patch fails to apply cleanly. There are whitespace warnings.
>
> [postgres(at)CentOS7-x64 oss_postgres_misc]$ git apply
> ../patches_misc/v23-0001-Always-persist-to-disk-logical-slots-during-a-sh.patch
> ../patches_misc/v23-0001-Always-persist-to-disk-logical-slots-during-a-sh.patch:102:
> trailing whitespace.
> # SHUTDOWN_CHECKPOINT record.
> warning: 1 line adds whitespace errors.

There was an extra blank, removed.

> 2. GENERAL -- which patch is the real one and which is the copy?
>
> IMO this patch has become muddled.
>
> Amit recently created a new thread [1] "persist logical slots to disk
> during shutdown checkpoint", which I thought was dedicated to the
> discussion/implementation of this 0001 patch. Therefore, I expected any
> 0001 patch changes to would be made only in that new thread from now on,
> (and maybe you would mirror them here in this thread).
>
> But now I see there are v23-0001 patch changes here again. So, now the same
> patch is in 2 places and they are different. It is no longer clear to me
> which 0001 ("Always persist...") patch is the definitive one, and which one
> is the copy.

Attached one in another thread is just copy to make cfbot happy, it could be
ignored.

> contrib/test_decoding/t/002_always_persist.pl
>
> 3.
> +
> +# Copyright (c) 2023, PostgreSQL Global Development Group
> +
> +# Test logical replication slots are always persist to disk during a
> shutdown
> +# checkpoint.
> +
> +use strict;
> +use warnings;
> +
> +use PostgreSQL::Test::Cluster;
> +use PostgreSQL::Test::Utils;
> +use Test::More;
>
> /always persist/always persisted/

Fixed.

> 4.
> +
> +# Test set-up
> my $node = PostgreSQL::Test::Cluster->new('test');
> $node->init(allows_streaming => 'logical');
> $node->append_conf('postgresql.conf', q{
> autovacuum = off
> checkpoint_timeout = 1h
> });
>
> $node->start;
>
> # Create table
> $node->safe_psql('postgres', "CREATE TABLE test (id int)");
>
> Maybe it is better to call the table something different instead of the
> same name as the cluster. e.g. 'test_tbl' would be better.

Changed to 'test_tbl'.

> 5.
> +# Shutdown the node once to do shutdown checkpoint
> $node->stop();
>
> SUGGESTION
> # Stop the node to cause a shutdown checkpoint

Fixed.

> 6.
> +# Fetch checkPoint from the control file itself
> my ($stdout, $stderr) = run_command([ 'pg_controldata', $node->data_dir ]);
> my @control_data = split("\n", $stdout);
> my $latest_checkpoint = undef;
> foreach (@control_data)
> {
>  if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
>  {
>  $latest_checkpoint = $1;
>  last;
>  }
> }
> die "No checkPoint in control file found\n"
>   unless defined($latest_checkpoint);
>
> 6a.
> /checkPoint/checkpoint/  (2x)
>
> 6b.
> +die "No checkPoint in control file found\n"
>
> SUGGESTION
> "No checkpoint found in control file\n"

Hmm, these notations were followed the test recovery/t/016_min_consistency.pl,
it uses the word "minRecoveryPoint". So I preferred current one.

[1]: https://www.postgresql.org/message-id/CAHut%2BPtb%3DZYTM_awoLy3sJ5m9Oxe%3DJYn6Gve5rSW9cUdThpsVA%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: persist logical slots to disk during shutdown checkpoint

From
vignesh C
Date:
On Tue, 22 Aug 2023 at 15:42, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Aug 22, 2023 at 2:56 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > On Tue, Aug 22, 2023 at 9:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > Another idea is to record the confirm_flush_lsn at the time of
> > > > persisting the slot. We can use it in two different ways 1. to mark a
> > > > slot dirty and persist if the last confirm_flush_lsn when slot was
> > > > persisted was too far from the current confirm_flush_lsn of the slot.
> > > > 2. at shutdown checkpoint, persist all the slots which have these two
> > > > confirm_flush_lsns different.
> > > >
> > >
> > > I think using it in the second (2) way sounds advantageous as compared
> > > to storing another dirty flag because this requires us to update
> > > last_persisted_confirm_flush_lsn only while writing the slot info.
> > > OTOH, having a flag dirty_for_shutdown_checkpoint will require us to
> > > update it each time we update confirm_flush_lsn under spinlock at
> > > multiple places. But, I don't see the need of doing what you proposed
> > > in (1) as the use case for it is very minor, basically this may
> > > sometimes help us to avoid decoding after crash recovery.
> >
> > Once we have last_persisted_confirm_flush_lsn, (1) is just an
> > optimization on top of that. With that we take the opportunity to
> > persist confirmed_flush_lsn which is much farther than the current
> > persisted value and thus improving chances of updating restart_lsn and
> > catalog_xmin faster after a WAL sender restart. We need to keep that
> > in mind when implementing (2). The problem is if we don't implement
> > (1) right now, we might just forget to do that small incremental
> > change in future. My preference is 1. Do both (1) and (2) together 2.
> > Do (2) first and then (1) as a separate commit. 3. Just implement (2)
> > if we don't have time at all for first two options.
> >
>
> I prefer one of (2) or (3). Anyway, it is better to do that
> optimization (persist confirm_flush_lsn at a regular interval) as a
> separate patch as we need to test and prove its value separately.

Here is a patch to persist to disk logical slots during a shutdown
checkpoint if the updated confirmed_flush_lsn has not yet been
persisted.

Regards,
Vignesh

Attachment

RE: persist logical slots to disk during shutdown checkpoint

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Vignesh,

> Here is a patch to persist to disk logical slots during a shutdown
> checkpoint if the updated confirmed_flush_lsn has not yet been
> persisted.

Thanks for making the patch with different approach! Here are comments.

01. RestoreSlotFromDisk

```
                slot->candidate_xmin_lsn = InvalidXLogRecPtr;
                slot->candidate_restart_lsn = InvalidXLogRecPtr;
                slot->candidate_restart_valid = InvalidXLogRecPtr;
+               slot->last_persisted_confirmed_flush = InvalidXLogRecPtr;
```

last_persisted_confirmed_flush was set to InvalidXLogRecPtr, but isn't it better
to use cp.slotdata. confirmed_flush? Assuming that the server is shut down immediately,
your patch forces to save.

02. t/002_always_persist.pl

The original author of the patch is me, but I found that the test could pass
without your patch. This is because pg_logical_slot_get_changes()->
pg_logical_slot_get_changes_guts(confirm = true)  always mark the slot as dirty.
IIUC we must use the logical replication system to verify the persistence.
Attached test can pass only when patch is applied.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: persist logical slots to disk during shutdown checkpoint

From
vignesh C
Date:
On Wed, 23 Aug 2023 at 14:21, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Vignesh,
>
> > Here is a patch to persist to disk logical slots during a shutdown
> > checkpoint if the updated confirmed_flush_lsn has not yet been
> > persisted.
>
> Thanks for making the patch with different approach! Here are comments.
>
> 01. RestoreSlotFromDisk
>
> ```
>                 slot->candidate_xmin_lsn = InvalidXLogRecPtr;
>                 slot->candidate_restart_lsn = InvalidXLogRecPtr;
>                 slot->candidate_restart_valid = InvalidXLogRecPtr;
> +               slot->last_persisted_confirmed_flush = InvalidXLogRecPtr;
> ```
>
> last_persisted_confirmed_flush was set to InvalidXLogRecPtr, but isn't it better
> to use cp.slotdata. confirmed_flush? Assuming that the server is shut down immediately,
> your patch forces to save.
>
> 02. t/002_always_persist.pl
>
> The original author of the patch is me, but I found that the test could pass
> without your patch. This is because pg_logical_slot_get_changes()->
> pg_logical_slot_get_changes_guts(confirm = true)  always mark the slot as dirty.
> IIUC we must use the logical replication system to verify the persistence.
> Attached test can pass only when patch is applied.

Here are few other comments that I noticed:

1) I too noticed that the test passes both with and without patch:
diff --git a/contrib/test_decoding/meson.build
b/contrib/test_decoding/meson.build
index 7b05cc25a3..12afb9ea8c 100644
--- a/contrib/test_decoding/meson.build
+++ b/contrib/test_decoding/meson.build
@@ -72,6 +72,7 @@ tests += {
   'tap': {
     'tests': [
       't/001_repl_stats.pl',
+      't/002_always_persist.pl',

2) change checkPoint to checkpoint:
2.a) checkPoint should be checkpoint to maintain consistency across the file:
+# Shutdown the node once to do shutdown checkpoint
+$node->stop();
+
+# Fetch checkPoint from the control file itself
+my ($stdout, $stderr) = run_command([ 'pg_controldata', $node->data_dir ]);
+my @control_data = split("\n", $stdout);
+my $latest_checkpoint = undef;

2.b) similarly here:
+die "No checkPoint in control file found\n"
+  unless defined($latest_checkpoint);

2.c) similarly here too:
+# Compare confirmed_flush_lsn and checkPoint
+ok($confirmed_flush eq $latest_checkpoint,
+       "Check confirmed_flush is same as latest checkpoint location");

3) change checkpoint to "Latest checkpoint location":
3.a) We should change "No checkPoint in control file found\n" to:
"Latest checkpoint location not found in control file\n" as there are
many checkpoint entries in control data

+foreach (@control_data)
+{
+       if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+       {
+               $latest_checkpoint = $1;
+               last;
+       }
+}
+die "No checkPoint in control file found\n"
+  unless defined($latest_checkpoint);

3.b) We should change "Fetch checkPoint from the control file itself" to:
"Fetch Latest checkpoint location from the control file"

+# Fetch checkPoint from the control file itself
+my ($stdout, $stderr) = run_command([ 'pg_controldata', $node->data_dir ]);
+my @control_data = split("\n", $stdout);
+my $latest_checkpoint = undef;
+foreach (@control_data)
+{

Regards,
Vignesh



Re: persist logical slots to disk during shutdown checkpoint

From
vignesh C
Date:
On Wed, 23 Aug 2023 at 14:21, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Vignesh,
>
> > Here is a patch to persist to disk logical slots during a shutdown
> > checkpoint if the updated confirmed_flush_lsn has not yet been
> > persisted.
>
> Thanks for making the patch with different approach! Here are comments.
>
> 01. RestoreSlotFromDisk
>
> ```
>                 slot->candidate_xmin_lsn = InvalidXLogRecPtr;
>                 slot->candidate_restart_lsn = InvalidXLogRecPtr;
>                 slot->candidate_restart_valid = InvalidXLogRecPtr;
> +               slot->last_persisted_confirmed_flush = InvalidXLogRecPtr;
> ```
>
> last_persisted_confirmed_flush was set to InvalidXLogRecPtr, but isn't it better
> to use cp.slotdata. confirmed_flush? Assuming that the server is shut down immediately,
> your patch forces to save.

Modified

> 02. t/002_always_persist.pl
>
> The original author of the patch is me, but I found that the test could pass
> without your patch. This is because pg_logical_slot_get_changes()->
> pg_logical_slot_get_changes_guts(confirm = true)  always mark the slot as dirty.
> IIUC we must use the logical replication system to verify the persistence.
> Attached test can pass only when patch is applied.

Update the test based on your another_test with slight modifications.

Attached v4 version patch has the changes for the same.

Regards,
Vignesh

Attachment

Re: persist logical slots to disk during shutdown checkpoint

From
vignesh C
Date:
On Sat, 19 Aug 2023 at 11:53, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> It's entirely possible for a logical slot to have a confirmed_flush
> LSN higher than the last value saved on disk while not being marked as
> dirty.  It's currently not a problem to lose that value during a clean
> shutdown / restart cycle but to support the upgrade of logical slots
> [1] (see latest patch at [2]), we seem to rely on that value being
> properly persisted to disk. During the upgrade, we need to verify that
> all the data prior to shudown_checkpoint for the logical slots has
> been consumed, otherwise, the downstream may miss some data. Now, to
> ensure the same, we are planning to compare the confirm_flush LSN
> location with the latest shudown_checkpoint location which means that
> the confirm_flush LSN should be updated after restart.
>
> I think this is inefficient even without an upgrade because, after the
> restart, this may lead to decoding some data again. Say, we process
> some transactions for which we didn't send anything downstream (the
> changes got filtered) but the confirm_flush LSN is updated due to
> keepalives. As we don't flush the latest value of confirm_flush LSN,
> it may lead to processing the same changes again.

I was able to test and verify that we were not processing the same
changes again.
Note: The 0001-Add-logs-to-skip-transaction-filter-insert-operation.patch
has logs to print if a decode transaction is skipped and also a log to
mention if any operation is filtered.
The test.sh script has the steps for a) setting up logical replication
for a table b) perform insert on table that need to be published (this
will be replicated to the subscriber) c) perform insert on a table
that will not be published (this insert will be filtered, it will not
be replicated) d) sleep for 5 seconds e) stop the server f) start the
server
I used the following steps, do the following in HEAD:
a) Apply 0001-Add-logs-to-skip-transaction-filter-insert-operation.patch
patch in Head and build the binaries b) execute test.sh c) view N1.log
file to see that the insert operations were filtered again by seeing
the following logs:
LOG:  Filter insert for table tbl2
...
===restart===
...
LOG:  Skipping transaction 0/156AD10 as start decode at is greater 0/156AE40
...
LOG:  Filter insert for table tbl2

We can see that the insert operations on tbl2 which was filtered
before server was stopped is again filtered after restart too in HEAD.

Lets see that the same changes were not processed again with patch:
a) Apply v4-0001-Persist-to-disk-logical-slots-during-a-shutdown-c.patch
from [1] also apply
0001-Add-logs-to-skip-transaction-filter-insert-operation.patch patch
and build the binaries b) execute test.sh c) view N1.log file to see
that the insert operations were skipped after restart of server by
seeing the following logs:
LOG:  Filter insert for table tbl2
...
===restart===
...
Skipping transaction 0/156AD10 as start decode at is greater 0/156AFB0
...
Skipping transaction 0/156AE80 as start decode at is greater 0/156AFB0

We can see that the insert operations on tbl2 are not processed again
after restart with the patch.

[1] - https://www.postgresql.org/message-id/CALDaNm0VrAt24e2FxbOX6eJQ-G_tZ0gVpsFBjzQM99NxG0hZfg%40mail.gmail.com

Regards,
Vignesh

Attachment

Re: persist logical slots to disk during shutdown checkpoint

From
vignesh C
Date:
On Fri, 25 Aug 2023 at 17:40, vignesh C <vignesh21@gmail.com> wrote:
>
> On Sat, 19 Aug 2023 at 11:53, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > It's entirely possible for a logical slot to have a confirmed_flush
> > LSN higher than the last value saved on disk while not being marked as
> > dirty.  It's currently not a problem to lose that value during a clean
> > shutdown / restart cycle but to support the upgrade of logical slots
> > [1] (see latest patch at [2]), we seem to rely on that value being
> > properly persisted to disk. During the upgrade, we need to verify that
> > all the data prior to shudown_checkpoint for the logical slots has
> > been consumed, otherwise, the downstream may miss some data. Now, to
> > ensure the same, we are planning to compare the confirm_flush LSN
> > location with the latest shudown_checkpoint location which means that
> > the confirm_flush LSN should be updated after restart.
> >
> > I think this is inefficient even without an upgrade because, after the
> > restart, this may lead to decoding some data again. Say, we process
> > some transactions for which we didn't send anything downstream (the
> > changes got filtered) but the confirm_flush LSN is updated due to
> > keepalives. As we don't flush the latest value of confirm_flush LSN,
> > it may lead to processing the same changes again.
>
> I was able to test and verify that we were not processing the same
> changes again.
> Note: The 0001-Add-logs-to-skip-transaction-filter-insert-operation.patch
> has logs to print if a decode transaction is skipped and also a log to
> mention if any operation is filtered.
> The test.sh script has the steps for a) setting up logical replication
> for a table b) perform insert on table that need to be published (this
> will be replicated to the subscriber) c) perform insert on a table
> that will not be published (this insert will be filtered, it will not
> be replicated) d) sleep for 5 seconds e) stop the server f) start the
> server
> I used the following steps, do the following in HEAD:
> a) Apply 0001-Add-logs-to-skip-transaction-filter-insert-operation.patch
> patch in Head and build the binaries b) execute test.sh c) view N1.log
> file to see that the insert operations were filtered again by seeing
> the following logs:
> LOG:  Filter insert for table tbl2
> ...
> ===restart===
> ...
> LOG:  Skipping transaction 0/156AD10 as start decode at is greater 0/156AE40
> ...
> LOG:  Filter insert for table tbl2
>
> We can see that the insert operations on tbl2 which was filtered
> before server was stopped is again filtered after restart too in HEAD.
>
> Lets see that the same changes were not processed again with patch:
> a) Apply v4-0001-Persist-to-disk-logical-slots-during-a-shutdown-c.patch
> from [1] also apply
> 0001-Add-logs-to-skip-transaction-filter-insert-operation.patch patch
> and build the binaries b) execute test.sh c) view N1.log file to see
> that the insert operations were skipped after restart of server by
> seeing the following logs:
> LOG:  Filter insert for table tbl2
> ...
> ===restart===
> ...
> Skipping transaction 0/156AD10 as start decode at is greater 0/156AFB0
> ...
> Skipping transaction 0/156AE80 as start decode at is greater 0/156AFB0
>
> We can see that the insert operations on tbl2 are not processed again
> after restart with the patch.

Here is another way to test using pg_replslotdata approach that was
proposed earlier at [1].
I have rebased this on top of HEAD and the v5 version for the same is attached.

We can use the same test as test.sh shared at [2].
When executed with HEAD, it was noticed that confirmed_flush points to
WAL location before both the transaction:
slot_name  slot_type    datoid   persistency     xmin  catalog_xmin
  restart_lsn         confirmed_flush     two_phase_at  two_phase
plugin
---------       ---------         ------     ----------           ----
    -----------               -----------            ---------------
         ------------          ---------          ------
sub            logical          5          persistent       0
735                      0/1531E28        0/1531E60             0/0
               0                   pgoutput

WAL record information generated using pg_walinspect for various
records at and after confirmed_flush WAL 0/1531E60:
 row_number | start_lsn |  end_lsn  | prev_lsn  | xid |
resource_manager |     record_type     | record_length |
main_data_length | fpi_length |
                               description
                                                                     |
                 block_ref

------------+-----------+-----------+-----------+-----+------------------+---------------------+---------------+------------------+------------+-------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------
          1 | 0/1531E60 | 0/1531EA0 | 0/1531E28 |   0 | Heap2
  | PRUNE               |            57 |                9 |
0 | snapshotConflictHorizon: 0, nredirected: 0, ndead: 1, nunused: 0,
redirected: [], dead: [1], unused: []
                                                                     |
blkref #0: rel 1663/5/1255 fork main blk 58
          2 | 0/1531EA0 | 0/1531EE0 | 0/1531E60 | 735 | Heap
  | INSERT+INIT         |            59 |                3 |
0 | off: 1, flags: 0x08

                                                                     |
blkref #0: rel 1663/5/16384 fork main blk 0
          3 | 0/1531EE0 | 0/1531F20 | 0/1531EA0 | 735 | Heap
  | INSERT              |            59 |                3 |
0 | off: 2, flags: 0x08

                                                                     |
blkref #0: rel 1663/5/16384 fork main blk 0
          4 | 0/1531F20 | 0/1531F60 | 0/1531EE0 | 735 | Heap
  | INSERT              |            59 |                3 |
0 | off: 3, flags: 0x08

                                                                     |
blkref #0: rel 1663/5/16384 fork main blk 0
          5 | 0/1531F60 | 0/1531FA0 | 0/1531F20 | 735 | Heap
  | INSERT              |            59 |                3 |
0 | off: 4, flags: 0x08

                                                                     |
blkref #0: rel 1663/5/16384 fork main blk 0
          6 | 0/1531FA0 | 0/1531FE0 | 0/1531F60 | 735 | Heap
  | INSERT              |            59 |                3 |
0 | off: 5, flags: 0x08

                                                                     |
blkref #0: rel 1663/5/16384 fork main blk 0
          7 | 0/1531FE0 | 0/1532028 | 0/1531FA0 | 735 | Transaction
  | COMMIT              |            46 |               20 |
0 | 2023-08-27 23:22:17.161215+05:30

                                                                     |
          8 | 0/1532028 | 0/1532068 | 0/1531FE0 | 736 | Heap
  | INSERT+INIT         |            59 |                3 |
0 | off: 1, flags: 0x08

                                                                     |
blkref #0: rel 1663/5/16387 fork main blk 0
          9 | 0/1532068 | 0/15320A8 | 0/1532028 | 736 | Heap
  | INSERT              |            59 |                3 |
0 | off: 2, flags: 0x08

                                                                     |
blkref #0: rel 1663/5/16387 fork main blk 0
         10 | 0/15320A8 | 0/15320E8 | 0/1532068 | 736 | Heap
  | INSERT              |            59 |                3 |
0 | off: 3, flags: 0x08

                                                                     |
blkref #0: rel 1663/5/16387 fork main blk 0
         11 | 0/15320E8 | 0/1532128 | 0/15320A8 | 736 | Heap
  | INSERT              |            59 |                3 |
0 | off: 4, flags: 0x08

                                                                     |
blkref #0: rel 1663/5/16387 fork main blk 0
         12 | 0/1532128 | 0/1532168 | 0/15320E8 | 736 | Heap
  | INSERT              |            59 |                3 |
0 | off: 5, flags: 0x08

                                                                     |
blkref #0: rel 1663/5/16387 fork main blk 0
         13 | 0/1532168 | 0/1532198 | 0/1532128 | 736 | Transaction
  | COMMIT              |            46 |               20 |
0 | 2023-08-27 23:22:17.174756+05:30

                                                                     |
         14 | 0/1532198 | 0/1532210 | 0/1532168 |   0 | XLOG
  | CHECKPOINT_SHUTDOWN |           114 |               88 |
0 | redo 0/1532198; tli 1; prev tli 1; fpw true; xid 0:737; oid 16399;
 multi 1; offset 0; oldest xid 723 in DB 1; oldest multi 1 in DB 1;
oldest/newest commit timestamp xid: 0/0; oldest running xid 0;
shutdown |

Whereas the same test executed with the patch applied shows that
confirmed_flush points to CHECKPOINT_SHUTDOWN record:
slot_name   slot_type   datoid persistency    xmin catalog_xmin
restart_lsn       confirmed_flush    two_phase_at  two_phase
    plugin
---------         ---------       ------    -----------       ---
-----------           -----------           ---------------
-----------  ---------               ------
sub              logical       5          persistent    0        735
               0/1531E28       0/1532198            0/0          0
        pgoutput

WAL record information generated using pg_walinspect for various
records at and after confirmed_flush WAL 0/1532198:
 row_number | start_lsn |  end_lsn  | prev_lsn  | xid |
resource_manager |     record_type     | record_length |
main_data_length | fpi_length |
                               description
                                                                     |
block_ref

------------+-----------+-----------+-----------+-----+------------------+---------------------+---------------+------------------+------------+-------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------------------------------------+-----------
          1 | 0/1532198 | 0/1532210 | 0/1532168 |   0 | XLOG
  | CHECKPOINT_SHUTDOWN |           114 |               88 |
0 | redo 0/1532198; tli 1; prev tli 1; fpw true; xid 0:737; oid 16399;
 multi 1; offset 0; oldest xid 723 in DB 1; oldest multi 1 in DB 1;
oldest/newest commit timestamp xid: 0/0; oldest running xid 0;
shutdown |
(1 row)

[1] -
https://www.postgresql.org/message-id/flat/CALj2ACW0rV5gWK8A3m6_X62qH%2BVfaq5hznC%3Di0R5Wojt5%2Byhyw%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CALDaNm2BboFuFVYxyzP4wkv7%3D8%2B_TwsD%2BugyGhtibTSF4m4XRg%40mail.gmail.com

Regards,
Vignesh

Attachment

RE: persist logical slots to disk during shutdown checkpoint

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear hackers,

I also tested for logical slots on the physical standby. PSA the script.
confirmed_flush_lsn for such slots were successfully persisted.

# Topology

In this test nodes are connected each other.

node1 --(physical replication)-->node2--(logical replication)-->node3

# Test method

An attached script did following steps

1. constructed above configurations
2. Inserted data on node1
3. read confirmed_flush_lsn on node2 (a)
4. restarted node2
5. read confirmed_flush_lsn again on node2 (b)
6. compare (a) and (b)

# result

Before patching, (a) and (b) were different value, which meant that logical
slots on physical standby were not saved at shutdown.

```
slot_name | confirmed_flush_lsn 
-----------+---------------------
 sub       | 0/30003E8
(1 row)

waiting for server to shut down.... done
server stopped
waiting for server to start.... done
server started
 slot_name | confirmed_flush_lsn 
-----------+---------------------
 sub       | 0/30000D8
(1 row)
```

After patching, (a) and (b) became the same value. The v4 patch worked well even
if the node is physical standby.

```
slot_name | confirmed_flush_lsn 
-----------+---------------------
 sub       | 0/30003E8
(1 row)

waiting for server to shut down.... done
server stopped
waiting for server to start.... done
server started
 slot_name | confirmed_flush_lsn 
-----------+---------------------
 sub       | 0/30003E8
(1 row)
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Thu, Aug 24, 2023 at 11:44 AM vignesh C <vignesh21@gmail.com> wrote:
>

The patch looks mostly good to me. I have made minor changes which are
as follows: (a) removed the autovacuum =off and
wal_receiver_status_interval = 0 setting as those doesn't seem to be
required for the test; (b) changed a few comments and variable names
in the code and test;

Shall we change the test file name from always_persist to
save_logical_slots_shutdown and move to recovery/t/ as this test is
about verification after the restart of the server?

--
With Regards,
Amit Kapila.

Attachment

Re: persist logical slots to disk during shutdown checkpoint

From
vignesh C
Date:
On Mon, 28 Aug 2023 at 18:56, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 24, 2023 at 11:44 AM vignesh C <vignesh21@gmail.com> wrote:
> >
>
> The patch looks mostly good to me. I have made minor changes which are
> as follows: (a) removed the autovacuum =off and
> wal_receiver_status_interval = 0 setting as those doesn't seem to be
> required for the test; (b) changed a few comments and variable names
> in the code and test;
>
> Shall we change the test file name from always_persist to
> save_logical_slots_shutdown and move to recovery/t/ as this test is
> about verification after the restart of the server?

That makes sense. The attached v6 version has the changes for the
same, apart from this I have also fixed a) pgindent issues b) perltidy
issues c) one variable change (flush_lsn_changed to
confirmed_flush_has_changed) d) corrected few comments in the test
file. Thanks to Peter for providing few offline comments.

Regards,
Vignesh

Attachment

Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Tue, Aug 29, 2023 at 10:16 AM vignesh C <vignesh21@gmail.com> wrote:
>
> That makes sense. The attached v6 version has the changes for the
> same, apart from this I have also fixed a) pgindent issues b) perltidy
> issues c) one variable change (flush_lsn_changed to
> confirmed_flush_has_changed) d) corrected few comments in the test
> file. Thanks to Peter for providing few offline comments.
>

The latest version looks good to me. Julien, Ashutosh, and others,
unless you have more comments or suggestions, I would like to push
this in a day or two.

--
With Regards,
Amit Kapila.



Re: persist logical slots to disk during shutdown checkpoint

From
Ashutosh Bapat
Date:
On Tue, Aug 29, 2023 at 2:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Aug 29, 2023 at 10:16 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > That makes sense. The attached v6 version has the changes for the
> > same, apart from this I have also fixed a) pgindent issues b) perltidy
> > issues c) one variable change (flush_lsn_changed to
> > confirmed_flush_has_changed) d) corrected few comments in the test
> > file. Thanks to Peter for providing few offline comments.
> >
>
> The latest version looks good to me. Julien, Ashutosh, and others,
> unless you have more comments or suggestions, I would like to push
> this in a day or two.

I am looking at it. If you can wait till the end of the week, that
will be great.

--
Best Wishes,
Ashutosh Bapat



Re: persist logical slots to disk during shutdown checkpoint

From
Julien Rouhaud
Date:
Hi,

On Tue, Aug 29, 2023 at 02:21:15PM +0530, Amit Kapila wrote:
> On Tue, Aug 29, 2023 at 10:16 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > That makes sense. The attached v6 version has the changes for the
> > same, apart from this I have also fixed a) pgindent issues b) perltidy
> > issues c) one variable change (flush_lsn_changed to
> > confirmed_flush_has_changed) d) corrected few comments in the test
> > file. Thanks to Peter for providing few offline comments.
> >
>
> The latest version looks good to me. Julien, Ashutosh, and others,
> unless you have more comments or suggestions, I would like to push
> this in a day or two.

Unfortunately I'm currently swamped with some internal escalations so I
couldn't keep up closely with the latest activity here.

I think I recall that you wanted to
change the timing at which logical slots are shutdown, I'm assuming that this
change won't lead to always have a difference between the LSN and latest
persisted LSN being different?  Otherwise saving the latest persisted LSN to
try to avoid persisting again all logical slots on shutdown seems reasonable to
me.



Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Wed, Aug 30, 2023 at 9:03 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Tue, Aug 29, 2023 at 02:21:15PM +0530, Amit Kapila wrote:
> > On Tue, Aug 29, 2023 at 10:16 AM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > That makes sense. The attached v6 version has the changes for the
> > > same, apart from this I have also fixed a) pgindent issues b) perltidy
> > > issues c) one variable change (flush_lsn_changed to
> > > confirmed_flush_has_changed) d) corrected few comments in the test
> > > file. Thanks to Peter for providing few offline comments.
> > >
> >
> > The latest version looks good to me. Julien, Ashutosh, and others,
> > unless you have more comments or suggestions, I would like to push
> > this in a day or two.
>
> Unfortunately I'm currently swamped with some internal escalations so I
> couldn't keep up closely with the latest activity here.
>
> I think I recall that you wanted to
> change the timing at which logical slots are shutdown, I'm assuming that this
> change won't lead to always have a difference between the LSN and latest
> persisted LSN being different?
>

I think here by LSN you are referring to confirmed_flush LSN. If so,
this doesn't create any new difference between the values for the
confirmed_flush LSN in memory and in disk. We just remember the last
persisted value to avoid writes of slots at shutdown time.

>  Otherwise saving the latest persisted LSN to
> try to avoid persisting again all logical slots on shutdown seems reasonable to
> me.

Thanks for responding.

--
With Regards,
Amit Kapila.



Re: persist logical slots to disk during shutdown checkpoint

From
Ashutosh Bapat
Date:
On Tue, Aug 29, 2023 at 5:40 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> I am looking at it. If you can wait till the end of the week, that
> will be great.

     /*
      * Successfully wrote, unset dirty bit, unless somebody dirtied again
-     * already.
+     * already and remember the confirmed_flush LSN value.
      */
     SpinLockAcquire(&slot->mutex);
     if (!slot->just_dirtied)
         slot->dirty = false;
+    slot->last_saved_confirmed_flush = slot->data.confirmed_flush;

If the slot->data.confirmed_flush gets updated concurrently between copying it
to be written to the disk and when it's written to last_saved_confirmed_flush,
we will miss one update. I think we need to update last_saved_confirmed_flush
based on the cp.slotdata.confirmed_flush rather than
slot->data.confirmed_flush.

We are setting last_saved_confirmed_flush for all types of slots but using it
only when the slot is logical. Should we also set it only for logical slots?

     /* first check whether there's something to write out */
     SpinLockAcquire(&slot->mutex);
     was_dirty = slot->dirty;
     slot->just_dirtied = false;
+    confirmed_flush_has_changed = (slot->data.confirmed_flush !=
slot->last_saved_confirmed_flush);

The confirmed_flush LSN should always move forward, otherwise there may not be
enough WAL retained for the slot to work. I am wondering whether we should take
an opportunity to make sure
Assert(slot->data.confirmed_flush <= slot->last_saved_confirmed_flush)

-    /* and don't do anything if there's nothing to write */
-    if (!was_dirty)
+    /* Don't do anything if there's nothing to write. See ReplicationSlot. */
+    if (!was_dirty &&
+        !(is_shutdown && SlotIsLogical(slot) && confirmed_flush_has_changed))

Rather than complicating this condition, I wonder whether it's better to just
set was_dirty = true when is_shutdown && SlotIsLogical(slot) &&
confirmed_flush_has_changed) or even slot->dirty = true. See also the note at
the end of the email.

+
+    /*
+     * We won't ensure that the slot is persisted after the confirmed_flush
+     * LSN is updated as that could lead to frequent writes.  However, we need
+     * to ensure that we do persist the slots at the time of shutdown whose
+     * confirmed_flush LSN is changed since we last saved the slot to disk.
+     * This will help in avoiding retreat of the confirmed_flush LSN after
+     * restart.  This variable is used to track the last saved confirmed_flush
+     * LSN value.
+     */

This comment makes more sense in SaveSlotToPath() than here. We may decide to
use last_saved_confirmed_flush for something else in future.
+
+sub compare_confirmed_flush
+{
+    my ($node, $confirmed_flush_from_log) = @_;
+
+    # Fetch Latest checkpoint location from the control file
+    my ($stdout, $stderr) =
+      run_command([ 'pg_controldata', $node->data_dir ]);
+    my @control_data      = split("\n", $stdout);
+    my $latest_checkpoint = undef;
+    foreach (@control_data)
+    {
+        if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+        {
+            $latest_checkpoint = $1;
+            last;
+        }
+    }
+    die "Latest checkpoint location not found in control file\n"
+      unless defined($latest_checkpoint);
+
+    # Is it same as the value read from log?
+    ok( $latest_checkpoint eq $confirmed_flush_from_log,
+        "Check that the slot's confirmed_flush LSN is the same as the
latest_checkpoint location"
+    );

This function assumes that the subscriber will receive and confirm WAL upto
checkpoint location and publisher's WAL sender will update it in the slot.
Where is the code to ensure that? Does the WAL sender process wait for
checkpoint
LSN to be confirmed when shutting down?
+
+# Restart the publisher to ensure that the slot will be persisted if required
+$node_publisher->restart();

Can we add this test comparing LSNs after publisher restart, to an existing
test itself - like basic replication. That's the only extra thing that this
test does beyond usual replication stuff.

+
+# Wait until the walsender creates decoding context
+$node_publisher->wait_for_log(
+    qr/Streaming transactions committing after
([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./,
+    $offset);
+
+# Extract confirmed_flush from the logfile
+my $log_contents = slurp_file($node_publisher->logfile, $offset);
+$log_contents =~
+  qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+),
reading WAL from ([A-F0-9]+\/[A-F0-9]+)./
+  or die "could not get confirmed_flush_lsn";

Why are we figuring out the LSN from the log file? Is it not available from
pg_replication_slots view? If we do so, I think this test will fail is the slot
gets written after the restart because of concurrent activity on the publisher
(like autovacuum, or other things that cause empty transaction to be
replicated) and subscriber. A very rare chance but not 0 probability one. I
think we should shut down subscriber, restart publisher and then make this
check based on the contents of the replication slot instead of server log.
Shutting down subscriber will ensure that the subscriber won't send any new
confirmed flush location to the publisher after restart.

All the places which call ReplicationSlotSave() mark the slot as dirty.  All
the places where SaveSlotToPath() is called, the slot is marked dirty except
when calling from CheckPointReplicationSlots(). So I am wondering whether we
should be marking the slot dirty in CheckPointReplicationSlots() and avoid
passing down is_shutdown flag to SaveSlotToPath().

Unrelated to this patch, I noticed that all the callers of SaveSlotToPath()
have the same code to craft replication slot file path. I wonder if that needs
to be macro'ised or added to some common function or to be pushed into
SaveSlotToPath() itself to make sure that any changes to the path in future are
consistent for all callers of SaveSlotToPath().  Interestingly slot->data.name
is accessed without a lock here. Name of the slot does not change after
creation so this isn't a problem right now. But generally against the principle
of accessing data protected by a mutex.

--
Best Wishes,
Ashutosh Bapat



Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Wed, Aug 30, 2023 at 6:33 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Tue, Aug 29, 2023 at 5:40 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > I am looking at it. If you can wait till the end of the week, that
> > will be great.
>
>      /*
>       * Successfully wrote, unset dirty bit, unless somebody dirtied again
> -     * already.
> +     * already and remember the confirmed_flush LSN value.
>       */
>      SpinLockAcquire(&slot->mutex);
>      if (!slot->just_dirtied)
>          slot->dirty = false;
> +    slot->last_saved_confirmed_flush = slot->data.confirmed_flush;
>
> If the slot->data.confirmed_flush gets updated concurrently between copying it
> to be written to the disk and when it's written to last_saved_confirmed_flush,
> we will miss one update. I think we need to update last_saved_confirmed_flush
> based on the cp.slotdata.confirmed_flush rather than
> slot->data.confirmed_flush.
>

Yeah, this appears to be a problem.

> We are setting last_saved_confirmed_flush for all types of slots but using it
> only when the slot is logical. Should we also set it only for logical slots?
>

We can do that but not sure if there is any advantage of it other than
adding extra condition. BTW, won't even confirmed_flush LSN be used
only for logical slots?

>      /* first check whether there's something to write out */
>      SpinLockAcquire(&slot->mutex);
>      was_dirty = slot->dirty;
>      slot->just_dirtied = false;
> +    confirmed_flush_has_changed = (slot->data.confirmed_flush !=
> slot->last_saved_confirmed_flush);
>
> The confirmed_flush LSN should always move forward, otherwise there may not be
> enough WAL retained for the slot to work. I am wondering whether we should take
> an opportunity to make sure
> Assert(slot->data.confirmed_flush <= slot->last_saved_confirmed_flush)
>

Theoretically, what you are saying makes sense to me but we don't have
such a protection while updating the confirmed_flush LSN. It would be
better to first add such a protection for confirmed_flush LSN update
as a separate patch.

> -    /* and don't do anything if there's nothing to write */
> -    if (!was_dirty)
> +    /* Don't do anything if there's nothing to write. See ReplicationSlot. */
> +    if (!was_dirty &&
> +        !(is_shutdown && SlotIsLogical(slot) && confirmed_flush_has_changed))
>
> Rather than complicating this condition, I wonder whether it's better to just
> set was_dirty = true when is_shutdown && SlotIsLogical(slot) &&
> confirmed_flush_has_changed) or even slot->dirty = true.
>

I think it is better to keep the slot's dirty property separate. But
we can introduce another variable that can be the result of both
was_dirty and other checks together, however, that doesn't seem much
better than the current code.

> +
> +    /*
> +     * We won't ensure that the slot is persisted after the confirmed_flush
> +     * LSN is updated as that could lead to frequent writes.  However, we need
> +     * to ensure that we do persist the slots at the time of shutdown whose
> +     * confirmed_flush LSN is changed since we last saved the slot to disk.
> +     * This will help in avoiding retreat of the confirmed_flush LSN after
> +     * restart.  This variable is used to track the last saved confirmed_flush
> +     * LSN value.
> +     */
>
> This comment makes more sense in SaveSlotToPath() than here. We may decide to
> use last_saved_confirmed_flush for something else in future.
>

I have kept it here because it contains some information that is not
specific to SaveSlotToPath. So, it seems easier to follow the whole
theory if we keep it at the central place in the structure and then
add the reference wherever required but I am fine if you and others
feel strongly about moving this to SaveSlotToPath().

> +
> +sub compare_confirmed_flush
> +{
> +    my ($node, $confirmed_flush_from_log) = @_;
> +
> +    # Fetch Latest checkpoint location from the control file
> +    my ($stdout, $stderr) =
> +      run_command([ 'pg_controldata', $node->data_dir ]);
> +    my @control_data      = split("\n", $stdout);
> +    my $latest_checkpoint = undef;
> +    foreach (@control_data)
> +    {
> +        if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
> +        {
> +            $latest_checkpoint = $1;
> +            last;
> +        }
> +    }
> +    die "Latest checkpoint location not found in control file\n"
> +      unless defined($latest_checkpoint);
> +
> +    # Is it same as the value read from log?
> +    ok( $latest_checkpoint eq $confirmed_flush_from_log,
> +        "Check that the slot's confirmed_flush LSN is the same as the
> latest_checkpoint location"
> +    );
>
> This function assumes that the subscriber will receive and confirm WAL upto
> checkpoint location and publisher's WAL sender will update it in the slot.
> Where is the code to ensure that? Does the WAL sender process wait for
> checkpoint
> LSN to be confirmed when shutting down?
>

Note, that we need to compare if all the WAL before the
shutdown_checkpoint WAL record is sent. Before the clean shutdown, we
do ensure that all the pending WAL is confirmed back. See the use of
WalSndDone() in WalSndLoop().

> +
> +# Restart the publisher to ensure that the slot will be persisted if required
> +$node_publisher->restart();
>
> Can we add this test comparing LSNs after publisher restart, to an existing
> test itself - like basic replication. That's the only extra thing that this
> test does beyond usual replication stuff.
>

As this is a test after the restart of the server, I thought to keep
it with recovery tests. However, I think once the upgrade (of
publisher nodes) patch is ready, we should keep this test with those
tests or somehow merge it with those tests but till that patch is
ready, let's keep this as a separate test.

> +
> +# Wait until the walsender creates decoding context
> +$node_publisher->wait_for_log(
> +    qr/Streaming transactions committing after
> ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./,
> +    $offset);
> +
> +# Extract confirmed_flush from the logfile
> +my $log_contents = slurp_file($node_publisher->logfile, $offset);
> +$log_contents =~
> +  qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+),
> reading WAL from ([A-F0-9]+\/[A-F0-9]+)./
> +  or die "could not get confirmed_flush_lsn";
>
> Why are we figuring out the LSN from the log file? Is it not available from
> pg_replication_slots view? If we do so, I think this test will fail is the slot
> gets written after the restart because of concurrent activity on the publisher
> (like autovacuum, or other things that cause empty transaction to be
> replicated) and subscriber. A very rare chance but not 0 probability one.
>

Yes, that is a possibility.

> I
> think we should shut down subscriber, restart publisher and then make this
> check based on the contents of the replication slot instead of server log.
> Shutting down subscriber will ensure that the subscriber won't send any new
> confirmed flush location to the publisher after restart.
>

But if we shutdown the subscriber before the publisher there is no
guarantee that the publisher has sent all outstanding logs up to the
shutdown checkpoint record (i.e., the latest record). Such a guarantee
can only be there if we do a clean shutdown of the publisher before
the subscriber.

> All the places which call ReplicationSlotSave() mark the slot as dirty.  All
> the places where SaveSlotToPath() is called, the slot is marked dirty except
> when calling from CheckPointReplicationSlots(). So I am wondering whether we
> should be marking the slot dirty in CheckPointReplicationSlots() and avoid
> passing down is_shutdown flag to SaveSlotToPath().
>

I feel that will add another spinlock acquire/release pair without
much benefit. Sure, it may not be performance-sensitive but still
adding another pair of lock/release doesn't seem like a better idea.

--
With Regards,
Amit Kapila.



Re: persist logical slots to disk during shutdown checkpoint

From
Ashutosh Bapat
Date:
On Thu, Aug 31, 2023 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > +
> > +    /*
> > +     * We won't ensure that the slot is persisted after the confirmed_flush
> > +     * LSN is updated as that could lead to frequent writes.  However, we need
> > +     * to ensure that we do persist the slots at the time of shutdown whose
> > +     * confirmed_flush LSN is changed since we last saved the slot to disk.
> > +     * This will help in avoiding retreat of the confirmed_flush LSN after
> > +     * restart.  This variable is used to track the last saved confirmed_flush
> > +     * LSN value.
> > +     */
> >
> > This comment makes more sense in SaveSlotToPath() than here. We may decide to
> > use last_saved_confirmed_flush for something else in future.
> >
>
> I have kept it here because it contains some information that is not
> specific to SaveSlotToPath. So, it seems easier to follow the whole
> theory if we keep it at the central place in the structure and then
> add the reference wherever required but I am fine if you and others
> feel strongly about moving this to SaveSlotToPath().

Saving slot to disk happens only in SaveSlotToPath, so except the last
sentence rest of the comment makes sense in SaveSlotToPath().

> >
> > This function assumes that the subscriber will receive and confirm WAL upto
> > checkpoint location and publisher's WAL sender will update it in the slot.
> > Where is the code to ensure that? Does the WAL sender process wait for
> > checkpoint
> > LSN to be confirmed when shutting down?
> >
>
> Note, that we need to compare if all the WAL before the
> shutdown_checkpoint WAL record is sent. Before the clean shutdown, we
> do ensure that all the pending WAL is confirmed back. See the use of
> WalSndDone() in WalSndLoop().

Ok. Thanks for pointing that out to me.

>
> > I
> > think we should shut down subscriber, restart publisher and then make this
> > check based on the contents of the replication slot instead of server log.
> > Shutting down subscriber will ensure that the subscriber won't send any new
> > confirmed flush location to the publisher after restart.
> >
>
> But if we shutdown the subscriber before the publisher there is no
> guarantee that the publisher has sent all outstanding logs up to the
> shutdown checkpoint record (i.e., the latest record). Such a guarantee
> can only be there if we do a clean shutdown of the publisher before
> the subscriber.

So the sequence is shutdown publisher node, shutdown subscriber node,
start publisher node and carry out the checks.

>
> > All the places which call ReplicationSlotSave() mark the slot as dirty.  All
> > the places where SaveSlotToPath() is called, the slot is marked dirty except
> > when calling from CheckPointReplicationSlots(). So I am wondering whether we
> > should be marking the slot dirty in CheckPointReplicationSlots() and avoid
> > passing down is_shutdown flag to SaveSlotToPath().
> >
>
> I feel that will add another spinlock acquire/release pair without
> much benefit. Sure, it may not be performance-sensitive but still
> adding another pair of lock/release doesn't seem like a better idea.

We call ReplicatioinSlotMarkDirty() followed by ReplicationSlotSave()
at all the places, even those which are more frequent than this. So I
think it's better to stick to that protocol rather than adding a new
flag.

--
Best Wishes,
Ashutosh Bapat



Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Thu, Aug 31, 2023 at 12:25 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Thu, Aug 31, 2023 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > +
> > > +    /*
> > > +     * We won't ensure that the slot is persisted after the confirmed_flush
> > > +     * LSN is updated as that could lead to frequent writes.  However, we need
> > > +     * to ensure that we do persist the slots at the time of shutdown whose
> > > +     * confirmed_flush LSN is changed since we last saved the slot to disk.
> > > +     * This will help in avoiding retreat of the confirmed_flush LSN after
> > > +     * restart.  This variable is used to track the last saved confirmed_flush
> > > +     * LSN value.
> > > +     */
> > >
> > > This comment makes more sense in SaveSlotToPath() than here. We may decide to
> > > use last_saved_confirmed_flush for something else in future.
> > >
> >
> > I have kept it here because it contains some information that is not
> > specific to SaveSlotToPath. So, it seems easier to follow the whole
> > theory if we keep it at the central place in the structure and then
> > add the reference wherever required but I am fine if you and others
> > feel strongly about moving this to SaveSlotToPath().
>
> Saving slot to disk happens only in SaveSlotToPath, so except the last
> sentence rest of the comment makes sense in SaveSlotToPath().
>
> > >
> > > This function assumes that the subscriber will receive and confirm WAL upto
> > > checkpoint location and publisher's WAL sender will update it in the slot.
> > > Where is the code to ensure that? Does the WAL sender process wait for
> > > checkpoint
> > > LSN to be confirmed when shutting down?
> > >
> >
> > Note, that we need to compare if all the WAL before the
> > shutdown_checkpoint WAL record is sent. Before the clean shutdown, we
> > do ensure that all the pending WAL is confirmed back. See the use of
> > WalSndDone() in WalSndLoop().
>
> Ok. Thanks for pointing that out to me.
>
> >
> > > I
> > > think we should shut down subscriber, restart publisher and then make this
> > > check based on the contents of the replication slot instead of server log.
> > > Shutting down subscriber will ensure that the subscriber won't send any new
> > > confirmed flush location to the publisher after restart.
> > >
> >
> > But if we shutdown the subscriber before the publisher there is no
> > guarantee that the publisher has sent all outstanding logs up to the
> > shutdown checkpoint record (i.e., the latest record). Such a guarantee
> > can only be there if we do a clean shutdown of the publisher before
> > the subscriber.
>
> So the sequence is shutdown publisher node, shutdown subscriber node,
> start publisher node and carry out the checks.
>

This can probably work but I still prefer the current approach as that
will be closer to the ideal values on the disk instead of comparison
with a later in-memory value of confirmed_flush LSN. Ideally, if we
would have a tool like pg_replslotdata which can read the on-disk
state of slots that would be better but missing that, the current one
sounds like the next best possibility. Do you see any problem with the
current approach of test?

BTW, I think we can keep autovacuum = off for this test just to avoid
any extra record generation even though that doesn't matter for the
purpose of test.

> >
> > > All the places which call ReplicationSlotSave() mark the slot as dirty.  All
> > > the places where SaveSlotToPath() is called, the slot is marked dirty except
> > > when calling from CheckPointReplicationSlots(). So I am wondering whether we
> > > should be marking the slot dirty in CheckPointReplicationSlots() and avoid
> > > passing down is_shutdown flag to SaveSlotToPath().
> > >
> >
> > I feel that will add another spinlock acquire/release pair without
> > much benefit. Sure, it may not be performance-sensitive but still
> > adding another pair of lock/release doesn't seem like a better idea.
>
> We call ReplicatioinSlotMarkDirty() followed by ReplicationSlotSave()
> at all the places, even those which are more frequent than this.
>

All but one. Normally, the idea of marking dirty is to indicate that
we will actually write/flush the contents at a later point (except
when required for correctness) as even indicated in the comments atop
ReplicatioinSlotMarkDirty(). However, I see your point that we use
that protocol at all the current places including CreateSlotOnDisk().
So, we can probably do it here as well.

--
With Regards,
Amit Kapila.



Re: persist logical slots to disk during shutdown checkpoint

From
Ashutosh Bapat
Date:
On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > I
> > > > think we should shut down subscriber, restart publisher and then make this
> > > > check based on the contents of the replication slot instead of server log.
> > > > Shutting down subscriber will ensure that the subscriber won't send any new
> > > > confirmed flush location to the publisher after restart.
> > > >
> > >
> > > But if we shutdown the subscriber before the publisher there is no
> > > guarantee that the publisher has sent all outstanding logs up to the
> > > shutdown checkpoint record (i.e., the latest record). Such a guarantee
> > > can only be there if we do a clean shutdown of the publisher before
> > > the subscriber.
> >
> > So the sequence is shutdown publisher node, shutdown subscriber node,
> > start publisher node and carry out the checks.
> >
>
> This can probably work but I still prefer the current approach as that
> will be closer to the ideal values on the disk instead of comparison
> with a later in-memory value of confirmed_flush LSN. Ideally, if we
> would have a tool like pg_replslotdata which can read the on-disk
> state of slots that would be better but missing that, the current one
> sounds like the next best possibility. Do you see any problem with the
> current approach of test?

> +  qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+),
> reading WAL from ([A-F0-9]+\/[A-F0-9]+)./

I don't think the LSN reported in this message is guaranteed to be the
confirmed_flush LSN of the slot. It's usually confirmed_flush but not
always. It's the LSN that snapshot builder computes based on factors
including confirmed_flush. There's a chance that this test will fail
sometimes because  of this behaviour. Reading directly from
replication slot is better that this. pg_replslotdata might help if we
read replication slot content between shutdown and restart of
publisher.

>
> BTW, I think we can keep autovacuum = off for this test just to avoid
> any extra record generation even though that doesn't matter for the
> purpose of test.

Autovacuum is one thing, but we can't guarantee the absence of any
concurrent activity forever.

>
> > >
> > > > All the places which call ReplicationSlotSave() mark the slot as dirty.  All
> > > > the places where SaveSlotToPath() is called, the slot is marked dirty except
> > > > when calling from CheckPointReplicationSlots(). So I am wondering whether we
> > > > should be marking the slot dirty in CheckPointReplicationSlots() and avoid
> > > > passing down is_shutdown flag to SaveSlotToPath().
> > > >
> > >
> > > I feel that will add another spinlock acquire/release pair without
> > > much benefit. Sure, it may not be performance-sensitive but still
> > > adding another pair of lock/release doesn't seem like a better idea.
> >
> > We call ReplicatioinSlotMarkDirty() followed by ReplicationSlotSave()
> > at all the places, even those which are more frequent than this.
> >
>
> All but one. Normally, the idea of marking dirty is to indicate that
> we will actually write/flush the contents at a later point (except
> when required for correctness) as even indicated in the comments atop
> ReplicatioinSlotMarkDirty(). However, I see your point that we use
> that protocol at all the current places including CreateSlotOnDisk().
> So, we can probably do it here as well.

yes

I didn't see this entry in commitfest. Since we are discussing it and
the next CF is about to begin, probably it's good to add one there.
--
Best Wishes,
Ashutosh Bapat



Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > I
> > > > > think we should shut down subscriber, restart publisher and then make this
> > > > > check based on the contents of the replication slot instead of server log.
> > > > > Shutting down subscriber will ensure that the subscriber won't send any new
> > > > > confirmed flush location to the publisher after restart.
> > > > >
> > > >
> > > > But if we shutdown the subscriber before the publisher there is no
> > > > guarantee that the publisher has sent all outstanding logs up to the
> > > > shutdown checkpoint record (i.e., the latest record). Such a guarantee
> > > > can only be there if we do a clean shutdown of the publisher before
> > > > the subscriber.
> > >
> > > So the sequence is shutdown publisher node, shutdown subscriber node,
> > > start publisher node and carry out the checks.
> > >
> >
> > This can probably work but I still prefer the current approach as that
> > will be closer to the ideal values on the disk instead of comparison
> > with a later in-memory value of confirmed_flush LSN. Ideally, if we
> > would have a tool like pg_replslotdata which can read the on-disk
> > state of slots that would be better but missing that, the current one
> > sounds like the next best possibility. Do you see any problem with the
> > current approach of test?
>
> > +  qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+),
> > reading WAL from ([A-F0-9]+\/[A-F0-9]+)./
>
> I don't think the LSN reported in this message is guaranteed to be the
> confirmed_flush LSN of the slot. It's usually confirmed_flush but not
> always. It's the LSN that snapshot builder computes based on factors
> including confirmed_flush. There's a chance that this test will fail
> sometimes because  of this behaviour.
>

I think I am missing something here because as per my understanding,
the LOG referred by the test is generated in CreateDecodingContext()
before which we shouldn't be changing the slot's confirmed_flush LSN.
The LOG [1] refers to the slot's persistent value for confirmed_flush,
so how it could be different from what the test is expecting.

[1]
errdetail("Streaming transactions committing after %X/%X, reading WAL
from %X/%X.",
   LSN_FORMAT_ARGS(slot->data.confirmed_flush),


--
With Regards,
Amit Kapila.



Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > All but one. Normally, the idea of marking dirty is to indicate that
> > we will actually write/flush the contents at a later point (except
> > when required for correctness) as even indicated in the comments atop
> > ReplicatioinSlotMarkDirty(). However, I see your point that we use
> > that protocol at all the current places including CreateSlotOnDisk().
> > So, we can probably do it here as well.
>
> yes
>

I think we should also ensure that slots are not invalidated
(slot.data.invalidated != RS_INVAL_NONE) before marking them dirty
because we don't allow decoding from such slots, so we shouldn't
include those.

--
With Regards,
Amit Kapila.



Re: persist logical slots to disk during shutdown checkpoint

From
vignesh C
Date:
On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > All but one. Normally, the idea of marking dirty is to indicate that
> > > we will actually write/flush the contents at a later point (except
> > > when required for correctness) as even indicated in the comments atop
> > > ReplicatioinSlotMarkDirty(). However, I see your point that we use
> > > that protocol at all the current places including CreateSlotOnDisk().
> > > So, we can probably do it here as well.
> >
> > yes
> >
>
> I think we should also ensure that slots are not invalidated
> (slot.data.invalidated != RS_INVAL_NONE) before marking them dirty
> because we don't allow decoding from such slots, so we shouldn't
> include those.

Added this check.

Apart from this I have also fixed the following issues that were
agreed on: a) Setting slots to dirty in CheckPointReplicationSlots
instead of setting it in SaveSlotToPath b) The comments were moved
from ReplicationSlot and moved to CheckPointReplicationSlots c) Tests
will be run in autovacuum = off d) updating last_saved_confirmed_flush
based on cp.slotdata.confirmed_flush rather than
slot->data.confirmed_flush.
I have also added the commitfest entry for this at [1].

Thanks to Ashutosh/Amit for the feedback.
Attached v7 version patch has the changes for the same.
[1] - https://commitfest.postgresql.org/44/4536/

Regards,
Vignesh

Attachment

Re: persist logical slots to disk during shutdown checkpoint

From
Ashutosh Bapat
Date:
On Thu, Aug 31, 2023 at 7:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > > I
> > > > > > think we should shut down subscriber, restart publisher and then make this
> > > > > > check based on the contents of the replication slot instead of server log.
> > > > > > Shutting down subscriber will ensure that the subscriber won't send any new
> > > > > > confirmed flush location to the publisher after restart.
> > > > > >
> > > > >
> > > > > But if we shutdown the subscriber before the publisher there is no
> > > > > guarantee that the publisher has sent all outstanding logs up to the
> > > > > shutdown checkpoint record (i.e., the latest record). Such a guarantee
> > > > > can only be there if we do a clean shutdown of the publisher before
> > > > > the subscriber.
> > > >
> > > > So the sequence is shutdown publisher node, shutdown subscriber node,
> > > > start publisher node and carry out the checks.
> > > >
> > >
> > > This can probably work but I still prefer the current approach as that
> > > will be closer to the ideal values on the disk instead of comparison
> > > with a later in-memory value of confirmed_flush LSN. Ideally, if we
> > > would have a tool like pg_replslotdata which can read the on-disk
> > > state of slots that would be better but missing that, the current one
> > > sounds like the next best possibility. Do you see any problem with the
> > > current approach of test?
> >
> > > +  qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+),
> > > reading WAL from ([A-F0-9]+\/[A-F0-9]+)./
> >
> > I don't think the LSN reported in this message is guaranteed to be the
> > confirmed_flush LSN of the slot. It's usually confirmed_flush but not
> > always. It's the LSN that snapshot builder computes based on factors
> > including confirmed_flush. There's a chance that this test will fail
> > sometimes because  of this behaviour.
> >
>
> I think I am missing something here because as per my understanding,
> the LOG referred by the test is generated in CreateDecodingContext()
> before which we shouldn't be changing the slot's confirmed_flush LSN.
> The LOG [1] refers to the slot's persistent value for confirmed_flush,
> so how it could be different from what the test is expecting.
>
> [1]
> errdetail("Streaming transactions committing after %X/%X, reading WAL
> from %X/%X.",
>    LSN_FORMAT_ARGS(slot->data.confirmed_flush),

I was afraid that we may move confirmed_flush while creating the
snapshot builder when creating the decoding context. But I don't see
any code doing that. So may be we are safe. But if the log message
changes, this test would fail - depending upon the log message looks a
bit fragile, esp. when we have a way to access the data directly
reliably.

--
Best Wishes,
Ashutosh Bapat



Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Fri, Sep 1, 2023 at 1:11 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Thu, Aug 31, 2023 at 7:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
> > <ashutosh.bapat.oss@gmail.com> wrote:
> > >
> > > > +  qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+),
> > > > reading WAL from ([A-F0-9]+\/[A-F0-9]+)./
> > >
> > > I don't think the LSN reported in this message is guaranteed to be the
> > > confirmed_flush LSN of the slot. It's usually confirmed_flush but not
> > > always. It's the LSN that snapshot builder computes based on factors
> > > including confirmed_flush. There's a chance that this test will fail
> > > sometimes because  of this behaviour.
> > >
> >
> > I think I am missing something here because as per my understanding,
> > the LOG referred by the test is generated in CreateDecodingContext()
> > before which we shouldn't be changing the slot's confirmed_flush LSN.
> > The LOG [1] refers to the slot's persistent value for confirmed_flush,
> > so how it could be different from what the test is expecting.
> >
> > [1]
> > errdetail("Streaming transactions committing after %X/%X, reading WAL
> > from %X/%X.",
> >    LSN_FORMAT_ARGS(slot->data.confirmed_flush),
>
> I was afraid that we may move confirmed_flush while creating the
> snapshot builder when creating the decoding context. But I don't see
> any code doing that. So may be we are safe.
>

We are safe in that respect. As far as I understand there is no reason
to be worried.

>
 But if the log message
> changes, this test would fail - depending upon the log message looks a
> bit fragile, esp. when we have a way to access the data directly
> reliably.
>

This message is there from the very begining (b89e1510) and I can't
forsee a reason to change such a message. But even if we change, we
can always change the test output or test accordingly, if required. I
think it is a matter of preference to which way we can write the test,
so let's not argue too much on this. I find current way slightly more
reliable but we can change it if we see any problem.

--
With Regards,
Amit Kapila.



Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Fri, Sep 1, 2023 at 10:50 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
> > <ashutosh.bapat.oss@gmail.com> wrote:
> > >
> > > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > All but one. Normally, the idea of marking dirty is to indicate that
> > > > we will actually write/flush the contents at a later point (except
> > > > when required for correctness) as even indicated in the comments atop
> > > > ReplicatioinSlotMarkDirty(). However, I see your point that we use
> > > > that protocol at all the current places including CreateSlotOnDisk().
> > > > So, we can probably do it here as well.
> > >
> > > yes
> > >
> >
> > I think we should also ensure that slots are not invalidated
> > (slot.data.invalidated != RS_INVAL_NONE) before marking them dirty
> > because we don't allow decoding from such slots, so we shouldn't
> > include those.
>
> Added this check.
>
> Apart from this I have also fixed the following issues that were
> agreed on: a) Setting slots to dirty in CheckPointReplicationSlots
> instead of setting it in SaveSlotToPath
>

+ if (is_shutdown && SlotIsLogical(s))
+ {
+ SpinLockAcquire(&s->mutex);
+ if (s->data.invalidated == RS_INVAL_NONE &&
+ s->data.confirmed_flush != s->last_saved_confirmed_flush)
+ s->dirty = true;

I think it is better to use ReplicationSlotMarkDirty() as that would
be consistent with all other usages.

--
With Regards,
Amit Kapila.



Re: persist logical slots to disk during shutdown checkpoint

From
vignesh C
Date:
On Mon, 4 Sept 2023 at 15:20, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Sep 1, 2023 at 10:50 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
> > > <ashutosh.bapat.oss@gmail.com> wrote:
> > > >
> > > > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > All but one. Normally, the idea of marking dirty is to indicate that
> > > > > we will actually write/flush the contents at a later point (except
> > > > > when required for correctness) as even indicated in the comments atop
> > > > > ReplicatioinSlotMarkDirty(). However, I see your point that we use
> > > > > that protocol at all the current places including CreateSlotOnDisk().
> > > > > So, we can probably do it here as well.
> > > >
> > > > yes
> > > >
> > >
> > > I think we should also ensure that slots are not invalidated
> > > (slot.data.invalidated != RS_INVAL_NONE) before marking them dirty
> > > because we don't allow decoding from such slots, so we shouldn't
> > > include those.
> >
> > Added this check.
> >
> > Apart from this I have also fixed the following issues that were
> > agreed on: a) Setting slots to dirty in CheckPointReplicationSlots
> > instead of setting it in SaveSlotToPath
> >
>
> + if (is_shutdown && SlotIsLogical(s))
> + {
> + SpinLockAcquire(&s->mutex);
> + if (s->data.invalidated == RS_INVAL_NONE &&
> + s->data.confirmed_flush != s->last_saved_confirmed_flush)
> + s->dirty = true;
>
> I think it is better to use ReplicationSlotMarkDirty() as that would
> be consistent with all other usages.

ReplicationSlotMarkDirty works only on MyReplicationSlot whereas
CheckpointReplicationSlots loops through all the slots and marks the
appropriate slot as dirty, we might have to change
ReplicationSlotMarkDirty to take the slot as input parameter and all
caller should pass MyReplicationSlot. Another thing is we have already
taken spin lock to access last_confirmed_flush_lsn from
CheckpointReplicationSlots, we could set dirty flag here itself, else
we will have to release the lock and call ReplicationSlotMarkDirty
which will take lock again. Instead shall we set just_dirtied also in
CheckpointReplicationSlots?
Thoughts?

Regards,
Vignesh



RE: persist logical slots to disk during shutdown checkpoint

From
"Zhijie Hou (Fujitsu)"
Date:
On Monday, September 4, 2023 6:15 PM vignesh C <vignesh21@gmail.com> wrote:
> 
> On Mon, 4 Sept 2023 at 15:20, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Sep 1, 2023 at 10:50 AM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
> > > > <ashutosh.bapat.oss@gmail.com> wrote:
> > > > >
> > > > > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > > > > >
> > > > > > All but one. Normally, the idea of marking dirty is to
> > > > > > indicate that we will actually write/flush the contents at a
> > > > > > later point (except when required for correctness) as even
> > > > > > indicated in the comments atop ReplicatioinSlotMarkDirty().
> > > > > > However, I see your point that we use that protocol at all the current
> places including CreateSlotOnDisk().
> > > > > > So, we can probably do it here as well.
> > > > >
> > > > > yes
> > > > >
> > > >
> > > > I think we should also ensure that slots are not invalidated
> > > > (slot.data.invalidated != RS_INVAL_NONE) before marking them dirty
> > > > because we don't allow decoding from such slots, so we shouldn't
> > > > include those.
> > >
> > > Added this check.
> > >
> > > Apart from this I have also fixed the following issues that were
> > > agreed on: a) Setting slots to dirty in CheckPointReplicationSlots
> > > instead of setting it in SaveSlotToPath
> > >
> >
> > + if (is_shutdown && SlotIsLogical(s)) { SpinLockAcquire(&s->mutex);
> > + if (s->data.invalidated == RS_INVAL_NONE &&
> > + s->data.confirmed_flush != s->last_saved_confirmed_flush) dirty =
> > + s->true;
> >
> > I think it is better to use ReplicationSlotMarkDirty() as that would
> > be consistent with all other usages.
> 
> ReplicationSlotMarkDirty works only on MyReplicationSlot whereas
> CheckpointReplicationSlots loops through all the slots and marks the
> appropriate slot as dirty, we might have to change ReplicationSlotMarkDirty to
> take the slot as input parameter and all caller should pass MyReplicationSlot.

Personally, I feel if we want to centralize the code of marking dirty into a
function, we can introduce a new static function MarkSlotDirty(slot) to mark
passed slot dirty and let ReplicationSlotMarkDirty and
CheckpointReplicationSlots call it. Like:

void
ReplicationSlotMarkDirty(void)
{
    MarkSlotDirty(MyReplicationSlot);
}

+static void
+MarkSlotDirty(ReplicationSlot *slot)
+{
+    Assert(slot != NULL);
+
+    SpinLockAcquire(&slot->mutex);
+    slot->just_dirtied = true;
+    slot->dirty = true;
+    SpinLockRelease(&slot->mutex);
+}

This is somewhat similar to the relation between ReplicationSlotSave(serialize
my backend's replications slot) and SaveSlotToPath(save the passed slot).

> Another thing is we have already taken spin lock to access
> last_confirmed_flush_lsn from CheckpointReplicationSlots, we could set dirty
> flag here itself, else we will have to release the lock and call
> ReplicationSlotMarkDirty which will take lock again.

Yes, this is unavoidable, but maybe it's not a big problem as
we only do it at shutdown.

> Instead shall we set just_dirtied also in CheckpointReplicationSlots?
> Thoughts?

I agree we'd better set just_dirtied to true to ensure we will serialize slot
info here, because if some other processes just serialized the slot, the dirty
flag will be reset to false if we don't set just_dirtied to true in
CheckpointReplicationSlots(), this race condition may not exists for now, but
seems better to completely forbid it by setting just_dirtied.

Best Regards,
Hou zj


Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Tue, Sep 5, 2023 at 7:54 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, September 4, 2023 6:15 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Mon, 4 Sept 2023 at 15:20, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Sep 1, 2023 at 10:50 AM vignesh C <vignesh21@gmail.com> wrote:
> > > >
> > > > On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
> > > > > <ashutosh.bapat.oss@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila
> > <amit.kapila16@gmail.com> wrote:
> > > > > > >
> > > > > > > All but one. Normally, the idea of marking dirty is to
> > > > > > > indicate that we will actually write/flush the contents at a
> > > > > > > later point (except when required for correctness) as even
> > > > > > > indicated in the comments atop ReplicatioinSlotMarkDirty().
> > > > > > > However, I see your point that we use that protocol at all the current
> > places including CreateSlotOnDisk().
> > > > > > > So, we can probably do it here as well.
> > > > > >
> > > > > > yes
> > > > > >
> > > > >
> > > > > I think we should also ensure that slots are not invalidated
> > > > > (slot.data.invalidated != RS_INVAL_NONE) before marking them dirty
> > > > > because we don't allow decoding from such slots, so we shouldn't
> > > > > include those.
> > > >
> > > > Added this check.
> > > >
> > > > Apart from this I have also fixed the following issues that were
> > > > agreed on: a) Setting slots to dirty in CheckPointReplicationSlots
> > > > instead of setting it in SaveSlotToPath
> > > >
> > >
> > > + if (is_shutdown && SlotIsLogical(s)) { SpinLockAcquire(&s->mutex);
> > > + if (s->data.invalidated == RS_INVAL_NONE &&
> > > + s->data.confirmed_flush != s->last_saved_confirmed_flush) dirty =
> > > + s->true;
> > >
> > > I think it is better to use ReplicationSlotMarkDirty() as that would
> > > be consistent with all other usages.
> >
> > ReplicationSlotMarkDirty works only on MyReplicationSlot whereas
> > CheckpointReplicationSlots loops through all the slots and marks the
> > appropriate slot as dirty, we might have to change ReplicationSlotMarkDirty to
> > take the slot as input parameter and all caller should pass MyReplicationSlot.
>
> Personally, I feel if we want to centralize the code of marking dirty into a
> function, we can introduce a new static function MarkSlotDirty(slot) to mark
> passed slot dirty and let ReplicationSlotMarkDirty and
> CheckpointReplicationSlots call it. Like:
>
> void
> ReplicationSlotMarkDirty(void)
> {
>         MarkSlotDirty(MyReplicationSlot);
> }
>
> +static void
> +MarkSlotDirty(ReplicationSlot *slot)
> +{
> +       Assert(slot != NULL);
> +
> +       SpinLockAcquire(&slot->mutex);
> +       slot->just_dirtied = true;
> +       slot->dirty = true;
> +       SpinLockRelease(&slot->mutex);
> +}
>
> This is somewhat similar to the relation between ReplicationSlotSave(serialize
> my backend's replications slot) and SaveSlotToPath(save the passed slot).
>
> > Another thing is we have already taken spin lock to access
> > last_confirmed_flush_lsn from CheckpointReplicationSlots, we could set dirty
> > flag here itself, else we will have to release the lock and call
> > ReplicationSlotMarkDirty which will take lock again.
>
> Yes, this is unavoidable, but maybe it's not a big problem as
> we only do it at shutdown.
>

True but still it doesn't look elegant. I also thought about having a
probably inline function that marks both just_dirty and dirty fields.
However, that requires us to assert that the caller has already
acquired a spinlock. I see a macro SpinLockFree() that might help but
it didn't seem to be used anywhere in the code so not sure if we can
rely on it.

> > Instead shall we set just_dirtied also in CheckpointReplicationSlots?
> > Thoughts?
>
> I agree we'd better set just_dirtied to true to ensure we will serialize slot
> info here, because if some other processes just serialized the slot, the dirty
> flag will be reset to false if we don't set just_dirtied to true in
> CheckpointReplicationSlots(), this race condition may not exists for now, but
> seems better to completely forbid it by setting just_dirtied.
>

Agreed, and it is better to close any such possibility because we
can't say with certainty about manual slots. This seems better than
the other ideas we discussed.

--
With Regards,
Amit Kapila.



Re: persist logical slots to disk during shutdown checkpoint

From
Dilip Kumar
Date:
On Fri, Sep 1, 2023 at 12:16 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
> > <ashutosh.bapat.oss@gmail.com> wrote:
> > >
> > > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > All but one. Normally, the idea of marking dirty is to indicate that
> > > > we will actually write/flush the contents at a later point (except
> > > > when required for correctness) as even indicated in the comments atop
> > > > ReplicatioinSlotMarkDirty(). However, I see your point that we use
> > > > that protocol at all the current places including CreateSlotOnDisk().
> > > > So, we can probably do it here as well.
> > >
> > > yes
> > >
> >
> > I think we should also ensure that slots are not invalidated
> > (slot.data.invalidated != RS_INVAL_NONE) before marking them dirty
> > because we don't allow decoding from such slots, so we shouldn't
> > include those.
>
> Added this check.
>
> Apart from this I have also fixed the following issues that were
> agreed on: a) Setting slots to dirty in CheckPointReplicationSlots
> instead of setting it in SaveSlotToPath b) The comments were moved
> from ReplicationSlot and moved to CheckPointReplicationSlots c) Tests
> will be run in autovacuum = off d) updating last_saved_confirmed_flush
> based on cp.slotdata.confirmed_flush rather than
> slot->data.confirmed_flush.
> I have also added the commitfest entry for this at [1].

The overall idea looks fine to me

+
+ /*
+ * We won't ensure that the slot is persisted after the
+ * confirmed_flush LSN is updated as that could lead to frequent
+ * writes.  However, we need to ensure that we do persist the slots at
+ * the time of shutdown whose confirmed_flush LSN is changed since we
+ * last saved the slot to disk. This will help in avoiding retreat of
+ * the confirmed_flush LSN after restart.
+ */
+ if (is_shutdown && SlotIsLogical(s))
+ {
+ SpinLockAcquire(&s->mutex);
+ if (s->data.invalidated == RS_INVAL_NONE &&
+ s->data.confirmed_flush != s->last_saved_confirmed_flush)
+ s->dirty = true;
+ SpinLockRelease(&s->mutex);
+ }

The comments don't mention anything about why we are just flushing at
the shutdown checkpoint.  I mean the checkpoint is not that frequent
and we already perform a lot of I/O during checkpoints so isn't it
wise to flush during every checkpoint.  We may argue that there is no
extra advantage of that as we are not optimizing for crash recovery
but OTOH there is no reason for not doing so for other checkpoints or
we are worried about the concurrency with parallel walsender running
during non shutdown checkpoint if so then better we explain that as
well?  If it is already discussed in the thread and we have a
conclusion on this then maybe we can mention this in comments?


/*
  * Flush all replication slots to disk.
  *
- * This needn't actually be part of a checkpoint, but it's a convenient
- * location.
+ * is_shutdown is true in case of a shutdown checkpoint.
  */
 void
-CheckPointReplicationSlots(void)
+CheckPointReplicationSlots(bool is_shutdown)

It seems we have removed two lines from the function header comments,
is this intentional or accidental?

Other than this patch LGTM.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Tue, Sep 5, 2023 at 9:10 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Sep 1, 2023 at 12:16 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
> > > <ashutosh.bapat.oss@gmail.com> wrote:
> > > >
> > > > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > All but one. Normally, the idea of marking dirty is to indicate that
> > > > > we will actually write/flush the contents at a later point (except
> > > > > when required for correctness) as even indicated in the comments atop
> > > > > ReplicatioinSlotMarkDirty(). However, I see your point that we use
> > > > > that protocol at all the current places including CreateSlotOnDisk().
> > > > > So, we can probably do it here as well.
> > > >
> > > > yes
> > > >
> > >
> > > I think we should also ensure that slots are not invalidated
> > > (slot.data.invalidated != RS_INVAL_NONE) before marking them dirty
> > > because we don't allow decoding from such slots, so we shouldn't
> > > include those.
> >
> > Added this check.
> >
> > Apart from this I have also fixed the following issues that were
> > agreed on: a) Setting slots to dirty in CheckPointReplicationSlots
> > instead of setting it in SaveSlotToPath b) The comments were moved
> > from ReplicationSlot and moved to CheckPointReplicationSlots c) Tests
> > will be run in autovacuum = off d) updating last_saved_confirmed_flush
> > based on cp.slotdata.confirmed_flush rather than
> > slot->data.confirmed_flush.
> > I have also added the commitfest entry for this at [1].
>
> The overall idea looks fine to me
>
> +
> + /*
> + * We won't ensure that the slot is persisted after the
> + * confirmed_flush LSN is updated as that could lead to frequent
> + * writes.  However, we need to ensure that we do persist the slots at
> + * the time of shutdown whose confirmed_flush LSN is changed since we
> + * last saved the slot to disk. This will help in avoiding retreat of
> + * the confirmed_flush LSN after restart.
> + */
> + if (is_shutdown && SlotIsLogical(s))
> + {
> + SpinLockAcquire(&s->mutex);
> + if (s->data.invalidated == RS_INVAL_NONE &&
> + s->data.confirmed_flush != s->last_saved_confirmed_flush)
> + s->dirty = true;
> + SpinLockRelease(&s->mutex);
> + }
>
> The comments don't mention anything about why we are just flushing at
> the shutdown checkpoint.  I mean the checkpoint is not that frequent
> and we already perform a lot of I/O during checkpoints so isn't it
> wise to flush during every checkpoint.  We may argue that there is no
> extra advantage of that as we are not optimizing for crash recovery
> but OTOH there is no reason for not doing so for other checkpoints or
> we are worried about the concurrency with parallel walsender running
> during non shutdown checkpoint if so then better we explain that as
> well?  If it is already discussed in the thread and we have a
> conclusion on this then maybe we can mention this in comments?
>

The point is that at the time of non-shutdown checkpoints, it is not
clear that there is an extra advantage but we will definitely add
extra I/O for this. Because at other times, we will already be saving
the slot from time to time as the replication makes progress. And, we
also need to flush such slots during shutdown for correctness for some
use cases like upgrades. We can probably add something like: "At other
times, the walsender keeps saving the slot from time to time as the
replication progresses, so there is no clear advantage of flushing
additional slots at the time of checkpoint". Will that work for you?
Having said that, I am not opposed to doing it for non-shutdown
checkpoints if one makes a separate case for it.

--
With Regards,
Amit Kapila.



Re: persist logical slots to disk during shutdown checkpoint

From
Dilip Kumar
Date:
On Tue, Sep 5, 2023 at 9:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 5, 2023 at 9:10 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > The comments don't mention anything about why we are just flushing at
> > the shutdown checkpoint.  I mean the checkpoint is not that frequent
> > and we already perform a lot of I/O during checkpoints so isn't it
> > wise to flush during every checkpoint.  We may argue that there is no
> > extra advantage of that as we are not optimizing for crash recovery
> > but OTOH there is no reason for not doing so for other checkpoints or
> > we are worried about the concurrency with parallel walsender running
> > during non shutdown checkpoint if so then better we explain that as
> > well?  If it is already discussed in the thread and we have a
> > conclusion on this then maybe we can mention this in comments?
> >
>
> The point is that at the time of non-shutdown checkpoints, it is not
> clear that there is an extra advantage but we will definitely add
> extra I/O for this. Because at other times, we will already be saving
> the slot from time to time as the replication makes progress. And, we
> also need to flush such slots during shutdown for correctness for some
> use cases like upgrades. We can probably add something like: "At other
> times, the walsender keeps saving the slot from time to time as the
> replication progresses, so there is no clear advantage of flushing
> additional slots at the time of checkpoint". Will that work for you?

Yeah that comments will work out, my only concern was because we added
an explicit condition that it should be synced only during shutdown
checkpoint so better comments also explicitly explains the reason.
Anyway I am fine with either way whether we sync at the shutdown
checkpoint or all the checkpoint.  Because I/O for slot sync during
checkpoint time should not be a real worry and with that if we can
avoid additional code with extra conditions then it's better because
such code branches will be frequently hit and I think for testability
pov we prefer to add code in common path unless there is some overhead
or it is specifically meant for that branch only.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: persist logical slots to disk during shutdown checkpoint

From
vignesh C
Date:
On Tue, 5 Sept 2023 at 09:10, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Sep 1, 2023 at 12:16 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
> > > <ashutosh.bapat.oss@gmail.com> wrote:
> > > >
> > > > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > All but one. Normally, the idea of marking dirty is to indicate that
> > > > > we will actually write/flush the contents at a later point (except
> > > > > when required for correctness) as even indicated in the comments atop
> > > > > ReplicatioinSlotMarkDirty(). However, I see your point that we use
> > > > > that protocol at all the current places including CreateSlotOnDisk().
> > > > > So, we can probably do it here as well.
> > > >
> > > > yes
> > > >
> > >
> > > I think we should also ensure that slots are not invalidated
> > > (slot.data.invalidated != RS_INVAL_NONE) before marking them dirty
> > > because we don't allow decoding from such slots, so we shouldn't
> > > include those.
> >
> > Added this check.
> >
> > Apart from this I have also fixed the following issues that were
> > agreed on: a) Setting slots to dirty in CheckPointReplicationSlots
> > instead of setting it in SaveSlotToPath b) The comments were moved
> > from ReplicationSlot and moved to CheckPointReplicationSlots c) Tests
> > will be run in autovacuum = off d) updating last_saved_confirmed_flush
> > based on cp.slotdata.confirmed_flush rather than
> > slot->data.confirmed_flush.
> > I have also added the commitfest entry for this at [1].
>
> The overall idea looks fine to me
>
> +
> + /*
> + * We won't ensure that the slot is persisted after the
> + * confirmed_flush LSN is updated as that could lead to frequent
> + * writes.  However, we need to ensure that we do persist the slots at
> + * the time of shutdown whose confirmed_flush LSN is changed since we
> + * last saved the slot to disk. This will help in avoiding retreat of
> + * the confirmed_flush LSN after restart.
> + */
> + if (is_shutdown && SlotIsLogical(s))
> + {
> + SpinLockAcquire(&s->mutex);
> + if (s->data.invalidated == RS_INVAL_NONE &&
> + s->data.confirmed_flush != s->last_saved_confirmed_flush)
> + s->dirty = true;
> + SpinLockRelease(&s->mutex);
> + }
>
> The comments don't mention anything about why we are just flushing at
> the shutdown checkpoint.  I mean the checkpoint is not that frequent
> and we already perform a lot of I/O during checkpoints so isn't it
> wise to flush during every checkpoint.  We may argue that there is no
> extra advantage of that as we are not optimizing for crash recovery
> but OTOH there is no reason for not doing so for other checkpoints or
> we are worried about the concurrency with parallel walsender running
> during non shutdown checkpoint if so then better we explain that as
> well?  If it is already discussed in the thread and we have a
> conclusion on this then maybe we can mention this in comments?

I felt it is better to do this only during the shutdown checkpoint as
in other cases it is being saved  periodically as and when the
replication happens. Added comments for the same.

> /*
>   * Flush all replication slots to disk.
>   *
> - * This needn't actually be part of a checkpoint, but it's a convenient
> - * location.
> + * is_shutdown is true in case of a shutdown checkpoint.
>   */
>  void
> -CheckPointReplicationSlots(void)
> +CheckPointReplicationSlots(bool is_shutdown)
>
> It seems we have removed two lines from the function header comments,
> is this intentional or accidental?

Modified.
The updated v8 version patch has the changes for the same.

Regards,
Vignesh

Attachment

Re: persist logical slots to disk during shutdown checkpoint

From
Dilip Kumar
Date:
On Tue, Sep 5, 2023 at 10:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 5, 2023 at 7:54 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Monday, September 4, 2023 6:15 PM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > On Mon, 4 Sept 2023 at 15:20, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Fri, Sep 1, 2023 at 10:50 AM vignesh C <vignesh21@gmail.com> wrote:
> > > > >
> > > > > On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
> > > > > > <ashutosh.bapat.oss@gmail.com> wrote:
> > > > > > >
> > > > > > > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila
> > > <amit.kapila16@gmail.com> wrote:
> > > > > > > >
> > > > > > > > All but one. Normally, the idea of marking dirty is to
> > > > > > > > indicate that we will actually write/flush the contents at a
> > > > > > > > later point (except when required for correctness) as even
> > > > > > > > indicated in the comments atop ReplicatioinSlotMarkDirty().
> > > > > > > > However, I see your point that we use that protocol at all the current
> > > places including CreateSlotOnDisk().
> > > > > > > > So, we can probably do it here as well.
> > > > > > >
> > > > > > > yes
> > > > > > >
> > > > > >
> > > > > > I think we should also ensure that slots are not invalidated
> > > > > > (slot.data.invalidated != RS_INVAL_NONE) before marking them dirty
> > > > > > because we don't allow decoding from such slots, so we shouldn't
> > > > > > include those.
> > > > >
> > > > > Added this check.
> > > > >
> > > > > Apart from this I have also fixed the following issues that were
> > > > > agreed on: a) Setting slots to dirty in CheckPointReplicationSlots
> > > > > instead of setting it in SaveSlotToPath
> > > > >
> > > >
> > > > + if (is_shutdown && SlotIsLogical(s)) { SpinLockAcquire(&s->mutex);
> > > > + if (s->data.invalidated == RS_INVAL_NONE &&
> > > > + s->data.confirmed_flush != s->last_saved_confirmed_flush) dirty =
> > > > + s->true;
> > > >
> > > > I think it is better to use ReplicationSlotMarkDirty() as that would
> > > > be consistent with all other usages.
> > >
> > > ReplicationSlotMarkDirty works only on MyReplicationSlot whereas
> > > CheckpointReplicationSlots loops through all the slots and marks the
> > > appropriate slot as dirty, we might have to change ReplicationSlotMarkDirty to
> > > take the slot as input parameter and all caller should pass MyReplicationSlot.
> >
> > Personally, I feel if we want to centralize the code of marking dirty into a
> > function, we can introduce a new static function MarkSlotDirty(slot) to mark
> > passed slot dirty and let ReplicationSlotMarkDirty and
> > CheckpointReplicationSlots call it. Like:
> >
> > void
> > ReplicationSlotMarkDirty(void)
> > {
> >         MarkSlotDirty(MyReplicationSlot);
> > }
> >
> > +static void
> > +MarkSlotDirty(ReplicationSlot *slot)
> > +{
> > +       Assert(slot != NULL);
> > +
> > +       SpinLockAcquire(&slot->mutex);
> > +       slot->just_dirtied = true;
> > +       slot->dirty = true;
> > +       SpinLockRelease(&slot->mutex);
> > +}
> >
> > This is somewhat similar to the relation between ReplicationSlotSave(serialize
> > my backend's replications slot) and SaveSlotToPath(save the passed slot).
> >
> > > Another thing is we have already taken spin lock to access
> > > last_confirmed_flush_lsn from CheckpointReplicationSlots, we could set dirty
> > > flag here itself, else we will have to release the lock and call
> > > ReplicationSlotMarkDirty which will take lock again.
> >
> > Yes, this is unavoidable, but maybe it's not a big problem as
> > we only do it at shutdown.
> >
>
> True but still it doesn't look elegant. I also thought about having a
> probably inline function that marks both just_dirty and dirty fields.
> However, that requires us to assert that the caller has already
> acquired a spinlock. I see a macro SpinLockFree() that might help but
> it didn't seem to be used anywhere in the code so not sure if we can
> rely on it.

Can't we just have code like this?  I mean we will have to make
ReplicationSlotMarkDirty take slot as an argument or have another
version which takes slot as an argument and that would be called by us
as well as by ReplicationSlotMarkDirty().  I mean why do we need these
checks (s-(data.invalidated == RS_INVAL_NONE &&
s->data.confirmed_flush != s->last_saved_confirmed_flush) under the
mutex?  Walsender is shutdown so confirmed flush LSN can not move
concurrently and slot can not be invalidated as well because that is
done by checkpointer and we are in checkpointer?

+ if (is_shutdown && SlotIsLogical(s))
+ {
+ if (s->data.invalidated == RS_INVAL_NONE &&
+ s->data.confirmed_flush != s->last_saved_confirmed_flush)
+ {
+ ReplicationSlotMarkDirty(s);
+ }
+
+ SpinLockRelease(&s->mutex);
+ }


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



RE: persist logical slots to disk during shutdown checkpoint

From
"Zhijie Hou (Fujitsu)"
Date:
On Tuesday, September 5, 2023 4:15 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Hi,

> 
> On Tue, Sep 5, 2023 at 10:12 AM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Tue, Sep 5, 2023 at 7:54 AM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Monday, September 4, 2023 6:15 PM vignesh C
> <vignesh21@gmail.com> wrote:
> > > >
> > > > On Mon, 4 Sept 2023 at 15:20, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > > > >
> > > > > On Fri, Sep 1, 2023 at 10:50 AM vignesh C <vignesh21@gmail.com>
> wrote:
> > > > > >
> > > > > > On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > > > > > > I think we should also ensure that slots are not invalidated
> > > > > > > (slot.data.invalidated != RS_INVAL_NONE) before marking them
> > > > > > > dirty because we don't allow decoding from such slots, so we
> > > > > > > shouldn't include those.
> > > > > >
> > > > > > Added this check.
> > > > > >
> > > > > > Apart from this I have also fixed the following issues that
> > > > > > were agreed on: a) Setting slots to dirty in
> > > > > > CheckPointReplicationSlots instead of setting it in
> > > > > > SaveSlotToPath
> > > > > >
> > > > >
> > > > > + if (is_shutdown && SlotIsLogical(s)) {
> > > > > + SpinLockAcquire(&s->mutex); if (s->data.invalidated ==
> > > > > + RS_INVAL_NONE &&
> > > > > + s->data.confirmed_flush != s->last_saved_confirmed_flush)
> > > > > + s->dirty = true;
> > > > >
> > > > > I think it is better to use ReplicationSlotMarkDirty() as that
> > > > > would be consistent with all other usages.
> > > >
> > > > ReplicationSlotMarkDirty works only on MyReplicationSlot whereas
> > > > CheckpointReplicationSlots loops through all the slots and marks
> > > > the appropriate slot as dirty, we might have to change
> > > > ReplicationSlotMarkDirty to take the slot as input parameter and all caller
> should pass MyReplicationSlot.
> > >
> > > Personally, I feel if we want to centralize the code of marking
> > > dirty into a function, we can introduce a new static function
> > > MarkSlotDirty(slot) to mark passed slot dirty and let
> > > ReplicationSlotMarkDirty and CheckpointReplicationSlots call it. Like:
> > >
> > > void
> > > ReplicationSlotMarkDirty(void)
> > > {
> > >         MarkSlotDirty(MyReplicationSlot); }
> > >
> > > +static void
> > > +MarkSlotDirty(ReplicationSlot *slot) {
> > > +       Assert(slot != NULL);
> > > +
> > > +       SpinLockAcquire(&slot->mutex);
> > > +       slot->just_dirtied = true;
> > > +       slot->dirty = true;
> > > +       SpinLockRelease(&slot->mutex); }
> > >
> > > This is somewhat similar to the relation between
> > > ReplicationSlotSave(serialize my backend's replications slot) and
> SaveSlotToPath(save the passed slot).
> > >
> > > > Another thing is we have already taken spin lock to access
> > > > last_confirmed_flush_lsn from CheckpointReplicationSlots, we could
> > > > set dirty flag here itself, else we will have to release the lock
> > > > and call ReplicationSlotMarkDirty which will take lock again.
> > >
> > > Yes, this is unavoidable, but maybe it's not a big problem as we
> > > only do it at shutdown.
> > >
> >
> > True but still it doesn't look elegant. I also thought about having a
> > probably inline function that marks both just_dirty and dirty fields.
> > However, that requires us to assert that the caller has already
> > acquired a spinlock. I see a macro SpinLockFree() that might help but
> > it didn't seem to be used anywhere in the code so not sure if we can
> > rely on it.
> 
> Can't we just have code like this?  I mean we will have to make
> ReplicationSlotMarkDirty take slot as an argument or have another version
> which takes slot as an argument and that would be called by us as well as by
> ReplicationSlotMarkDirty().  I mean why do we need these checks
> (s-(data.invalidated == RS_INVAL_NONE &&
> s->data.confirmed_flush != s->last_saved_confirmed_flush) under the
> mutex?  Walsender is shutdown so confirmed flush LSN can not move
> concurrently and slot can not be invalidated as well because that is done by
> checkpointer and we are in checkpointer?

I agree with your analysis that the lock may be unnecessary for now and the
code will work, but I personally feel we'd better take the spinlock.

Firstly, considering our discussion on the potential extension of persisting
the slot for online checkpoints in the future, we anyway need the lock at that
time, so taking the lock here could avoid overlooking the need to update it
later. And the lock also won't cause any performance or concurrency issue.

Additionally, if we don't take the lock, we rely on the assumption that the
walsender will exit before the shutdown checkpoint, currently, that's true for
logical walsender, but physical walsender can exit later than checkpointer. So,
I am slight woirred that if we change the logical walsender's exit timing in
the future, the assumption may not hold.

Besides, for non-built-in logical replication, if someone creates their own
walsender or other processes to send the changes and the process doesn't exit
before the shutdown checkpoint, it may also be a problem. Although I don't have
exsiting examples about these extensions, but I feel taking the lock would make
it more robust.

Best Regards,
Hou zj

Re: persist logical slots to disk during shutdown checkpoint

From
Dilip Kumar
Date:
On Tue, Sep 5, 2023 at 5:04 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>

> > Can't we just have code like this?  I mean we will have to make
> > ReplicationSlotMarkDirty take slot as an argument or have another version
> > which takes slot as an argument and that would be called by us as well as by
> > ReplicationSlotMarkDirty().  I mean why do we need these checks
> > (s-(data.invalidated == RS_INVAL_NONE &&
> > s->data.confirmed_flush != s->last_saved_confirmed_flush) under the
> > mutex?  Walsender is shutdown so confirmed flush LSN can not move
> > concurrently and slot can not be invalidated as well because that is done by
> > checkpointer and we are in checkpointer?
>
> I agree with your analysis that the lock may be unnecessary for now and the
> code will work, but I personally feel we'd better take the spinlock.
>
> Firstly, considering our discussion on the potential extension of persisting
> the slot for online checkpoints in the future, we anyway need the lock at that
> time, so taking the lock here could avoid overlooking the need to update it
> later. And the lock also won't cause any performance or concurrency issue.

If we think that we might plan to persist on the online checkpoint as
well then better to do it now, because this is not a extension of the
feature instead we are thinking that it is wise to just persist on the
shutdown checkpoint and I think that's what the conclusion at this
point and if thats the conclusion then no point to right code in
assumption that we will change our conclusion in future.

> Additionally, if we don't take the lock, we rely on the assumption that the
> walsender will exit before the shutdown checkpoint, currently, that's true for
> logical walsender, but physical walsender can exit later than checkpointer. So,
> I am slight woirred that if we change the logical walsender's exit timing in
> the future, the assumption may not hold.
>
> Besides, for non-built-in logical replication, if someone creates their own
> walsender or other processes to send the changes and the process doesn't exit
> before the shutdown checkpoint, it may also be a problem. Although I don't have
> exsiting examples about these extensions, but I feel taking the lock would make
> it more robust.

I think our all logic is based on that the walsender is existed
already.  If not then even if you check under the mutex that the
confirmed flush LSN is not changed then it can changed right after you
release the lock and then we will not be flushing the latest update of
the confirmed flush lsn to the disk and our logic of comparing
checkpoint.redo with the confirmed flush lsn might not work?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Tue, Sep 5, 2023 at 6:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Sep 5, 2023 at 5:04 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
>
> > > Can't we just have code like this?  I mean we will have to make
> > > ReplicationSlotMarkDirty take slot as an argument or have another version
> > > which takes slot as an argument and that would be called by us as well as by
> > > ReplicationSlotMarkDirty().  I mean why do we need these checks
> > > (s-(data.invalidated == RS_INVAL_NONE &&
> > > s->data.confirmed_flush != s->last_saved_confirmed_flush) under the
> > > mutex?  Walsender is shutdown so confirmed flush LSN can not move
> > > concurrently and slot can not be invalidated as well because that is done by
> > > checkpointer and we are in checkpointer?
> >
>
...
> > Additionally, if we don't take the lock, we rely on the assumption that the
> > walsender will exit before the shutdown checkpoint, currently, that's true for
> > logical walsender, but physical walsender can exit later than checkpointer. So,
> > I am slight woirred that if we change the logical walsender's exit timing in
> > the future, the assumption may not hold.
> >
> > Besides, for non-built-in logical replication, if someone creates their own
> > walsender or other processes to send the changes and the process doesn't exit
> > before the shutdown checkpoint, it may also be a problem. Although I don't have
> > exsiting examples about these extensions, but I feel taking the lock would make
> > it more robust.
>
> I think our all logic is based on that the walsender is existed
> already.  If not then even if you check under the mutex that the
> confirmed flush LSN is not changed then it can changed right after you
> release the lock and then we will not be flushing the latest update of
> the confirmed flush lsn to the disk and our logic of comparing
> checkpoint.redo with the confirmed flush lsn might not work?
>

Right, it can change and in that case, the check related to
confirm_flush LSN will fail during the upgrade. However, the point is
that if we don't take spinlock, we need to properly write comments on
why unlike in other places it is safe here to check these values
without spinlock. We can do that but I feel we have to be careful for
all future usages of these variables, so, having spinlock makes them
follow the normal coding pattern which I feel makes it more robust.
Yes, marking dirty via common function also has merits but personally,
I find it better to follow the normal coding practice of checking the
required fields under spinlock. The other possibility is to first
check if we need to mark the slot dirty under spinlock, then release
the spinlock, and then call the common MarkDirty function, but again
that will be under the assumption that these flags won't change.

--
With Regards,
Amit Kapila.



Re: persist logical slots to disk during shutdown checkpoint

From
Dilip Kumar
Date:
On Wed, Sep 6, 2023 at 9:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 5, 2023 at 6:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

>
> Right, it can change and in that case, the check related to
> confirm_flush LSN will fail during the upgrade. However, the point is
> that if we don't take spinlock, we need to properly write comments on
> why unlike in other places it is safe here to check these values
> without spinlock.

I agree with that, but now also it is not true that we are alway
reading this under the spin lock for example[1][2], we can see we are
reading this without spin lock.
[1]
StartLogicalReplication
{
/*
* Report the location after which we'll send out further commits as the
* current sentPtr.
*/
sentPtr = MyReplicationSlot->data.confirmed_flush;
}
[2]
LogicalIncreaseRestartDecodingForSlot
{
/* candidates are already valid with the current flush position, apply */
if (updated_lsn)
LogicalConfirmReceivedLocation(slot->data.confirmed_flush);
}

 We can do that but I feel we have to be careful for
> all future usages of these variables, so, having spinlock makes them
> follow the normal coding pattern which I feel makes it more robust.
> Yes, marking dirty via common function also has merits but personally,
> I find it better to follow the normal coding practice of checking the
> required fields under spinlock. The other possibility is to first
> check if we need to mark the slot dirty under spinlock, then release
> the spinlock, and then call the common MarkDirty function, but again
> that will be under the assumption that these flags won't change.

Thats true, but we are already making the assumption because now also
we are taking the spinlock and taking a decision of marking the slot
dirty.  And after that we are releasing the spin lock and if we do not
have guarantee that it can not concurrently change the many things can
go wrong no?

Anyway said that, I do not have any strong objection against what we
are doing now.  There were discussion around making the code so that
it can use common function and I was suggesting how it could be
achieved but I am not against the current way either.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Wed, Sep 6, 2023 at 9:57 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Sep 6, 2023 at 9:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Sep 5, 2023 at 6:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> >
> > Right, it can change and in that case, the check related to
> > confirm_flush LSN will fail during the upgrade. However, the point is
> > that if we don't take spinlock, we need to properly write comments on
> > why unlike in other places it is safe here to check these values
> > without spinlock.
>
> I agree with that, but now also it is not true that we are alway
> reading this under the spin lock for example[1][2], we can see we are
> reading this without spin lock.
> [1]
> StartLogicalReplication
> {
> /*
> * Report the location after which we'll send out further commits as the
> * current sentPtr.
> */
> sentPtr = MyReplicationSlot->data.confirmed_flush;
> }
> [2]
> LogicalIncreaseRestartDecodingForSlot
> {
> /* candidates are already valid with the current flush position, apply */
> if (updated_lsn)
> LogicalConfirmReceivedLocation(slot->data.confirmed_flush);
> }
>

These are accessed only in walsender and confirmed_flush is always
updated by walsender. So, this is clearly okay.

>  We can do that but I feel we have to be careful for
> > all future usages of these variables, so, having spinlock makes them
> > follow the normal coding pattern which I feel makes it more robust.
> > Yes, marking dirty via common function also has merits but personally,
> > I find it better to follow the normal coding practice of checking the
> > required fields under spinlock. The other possibility is to first
> > check if we need to mark the slot dirty under spinlock, then release
> > the spinlock, and then call the common MarkDirty function, but again
> > that will be under the assumption that these flags won't change.
>
> Thats true, but we are already making the assumption because now also
> we are taking the spinlock and taking a decision of marking the slot
> dirty.  And after that we are releasing the spin lock and if we do not
> have guarantee that it can not concurrently change the many things can
> go wrong no?
>

Also, note that invalidated field could be updated by startup process
but that is only possible on standby, so it is safe but again that
would be another assumption.

> Anyway said that, I do not have any strong objection against what we
> are doing now.  There were discussion around making the code so that
> it can use common function and I was suggesting how it could be
> achieved but I am not against the current way either.
>

Okay, thanks for looking into it.

--
With Regards,
Amit Kapila.



Re: persist logical slots to disk during shutdown checkpoint

From
Dilip Kumar
Date:
On Wed, Sep 6, 2023 at 12:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Sep 6, 2023 at 9:57 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Sep 6, 2023 at 9:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Sep 5, 2023 at 6:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > >
> > > Right, it can change and in that case, the check related to
> > > confirm_flush LSN will fail during the upgrade. However, the point is
> > > that if we don't take spinlock, we need to properly write comments on
> > > why unlike in other places it is safe here to check these values
> > > without spinlock.
> >
> > I agree with that, but now also it is not true that we are alway
> > reading this under the spin lock for example[1][2], we can see we are
> > reading this without spin lock.
> > [1]
> > StartLogicalReplication
> > {
> > /*
> > * Report the location after which we'll send out further commits as the
> > * current sentPtr.
> > */
> > sentPtr = MyReplicationSlot->data.confirmed_flush;
> > }
> > [2]
> > LogicalIncreaseRestartDecodingForSlot
> > {
> > /* candidates are already valid with the current flush position, apply */
> > if (updated_lsn)
> > LogicalConfirmReceivedLocation(slot->data.confirmed_flush);
> > }
> >
>
> These are accessed only in walsender and confirmed_flush is always
> updated by walsender. So, this is clearly okay.

Hmm, that's a valid point.

> >  We can do that but I feel we have to be careful for
> > > all future usages of these variables, so, having spinlock makes them
> > > follow the normal coding pattern which I feel makes it more robust.
> > > Yes, marking dirty via common function also has merits but personally,
> > > I find it better to follow the normal coding practice of checking the
> > > required fields under spinlock. The other possibility is to first
> > > check if we need to mark the slot dirty under spinlock, then release
> > > the spinlock, and then call the common MarkDirty function, but again
> > > that will be under the assumption that these flags won't change.
> >
> > Thats true, but we are already making the assumption because now also
> > we are taking the spinlock and taking a decision of marking the slot
> > dirty.  And after that we are releasing the spin lock and if we do not
> > have guarantee that it can not concurrently change the many things can
> > go wrong no?
> >
>
> Also, note that invalidated field could be updated by startup process
> but that is only possible on standby, so it is safe but again that
> would be another assumption.

Okay, so I also agree to go with the current patch.  Because as you
said above if we access this without a spin lock outside walsender
then we will be making a new exception and I agree with that decision
of not making the new exception.

Other than that the patch LGTM.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: persist logical slots to disk during shutdown checkpoint

From
vignesh C
Date:
On Wed, 6 Sept 2023 at 12:09, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Other than that the patch LGTM.

pgindent reported that the new comments added need to be re-adjusted,
here is an updated patch for the same.
I also verified the following: a) patch applies neatly on HEAD b) make
check-world passes c) pgindent looks good d) pgperltiy was fine e)
meson test runs were successful. Also checked that CFBot run was fine
for the last patch.

Regards,
Vignesh

Attachment

Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Thu, Sep 7, 2023 at 10:13 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, 6 Sept 2023 at 12:09, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Other than that the patch LGTM.
>
> pgindent reported that the new comments added need to be re-adjusted,
> here is an updated patch for the same.
>

Thanks, the patch looks good to me as well.

--
With Regards,
Amit Kapila.



Re: persist logical slots to disk during shutdown checkpoint

From
Michael Paquier
Date:
On Thu, Sep 07, 2023 at 11:56:28AM +0530, Amit Kapila wrote:
> Thanks, the patch looks good to me as well.

+   /* This is used to track the last saved confirmed_flush LSN value */
+   XLogRecPtr  last_saved_confirmed_flush;

This does not feel sufficient, as the comment explaining what this
variable does uses the same terms as the variable name (aka it is the
last save of the confirmed_lsn).  Why it it here and why it is useful?
In which context and/or code paths is it used?  Okay, there are some
explanations when saving a slot, restoring a slot or when a checkpoint
processes slots, but it seems important to me to document more things
in ReplicationSlot where this is defined.

(Just passing by, I have not checked the patch logic in details but
that looks under-documented to me.)
--
Michael

Attachment

Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Thu, Sep 7, 2023 at 1:18 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Sep 07, 2023 at 11:56:28AM +0530, Amit Kapila wrote:
> > Thanks, the patch looks good to me as well.
>
> +   /* This is used to track the last saved confirmed_flush LSN value */
> +   XLogRecPtr  last_saved_confirmed_flush;
>
> This does not feel sufficient, as the comment explaining what this
> variable does uses the same terms as the variable name (aka it is the
> last save of the confirmed_lsn).  Why it it here and why it is useful?
> In which context and/or code paths is it used?  Okay, there are some
> explanations when saving a slot, restoring a slot or when a checkpoint
> processes slots, but it seems important to me to document more things
> in ReplicationSlot where this is defined.
>

Hmm, this is quite debatable as different people feel differently
about this. The patch author kept it where it is now but in one of my
revisions, I rewrote and added it in the ReplicationSlot. Then
Ashutosh argued that it is better to keep it near where we are saving
the slot (aka where the patch has) [1]. Anyway, as I also preferred
the core part of the theory about this variable to be in
ReplicationSlot, so I'll move it there before commit unless someone
argues against it or has any other comments.

[1] - https://www.postgresql.org/message-id/CAExHW5uXq_CU80XJtmWbPJinRjJx54kbQJ9DT%3DUFySKXpjVwrw%40mail.gmail.com

--
With Regards,
Amit Kapila.



Re: persist logical slots to disk during shutdown checkpoint

From
Ashutosh Bapat
Date:
On Thu, Sep 7, 2023 at 1:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Sep 7, 2023 at 1:18 PM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Thu, Sep 07, 2023 at 11:56:28AM +0530, Amit Kapila wrote:
> > > Thanks, the patch looks good to me as well.
> >
> > +   /* This is used to track the last saved confirmed_flush LSN value */
> > +   XLogRecPtr  last_saved_confirmed_flush;
> >
> > This does not feel sufficient, as the comment explaining what this
> > variable does uses the same terms as the variable name (aka it is the
> > last save of the confirmed_lsn).  Why it it here and why it is useful?
> > In which context and/or code paths is it used?  Okay, there are some
> > explanations when saving a slot, restoring a slot or when a checkpoint
> > processes slots, but it seems important to me to document more things
> > in ReplicationSlot where this is defined.
> >
>
> Hmm, this is quite debatable as different people feel differently
> about this. The patch author kept it where it is now but in one of my
> revisions, I rewrote and added it in the ReplicationSlot. Then
> Ashutosh argued that it is better to keep it near where we are saving
> the slot (aka where the patch has) [1]. Anyway, as I also preferred
> the core part of the theory about this variable to be in
> ReplicationSlot, so I'll move it there before commit unless someone
> argues against it or has any other comments.

If we want it to be in ReplicationSlot, I suggest we just say, - saves
last confirmed flush LSN to detect any divergence in the in-memory and
on-disk confirmed flush LSN cheaply.

When to detect that divergence and what to do when there is divergence
should be document at relevant places in the code. In future if we
expand the When and How we use this variable, the comment in
ReplicationSlot will be insufficient.

We follow this commenting style at several places e.g.
/* any outstanding modifications? */
bool just_dirtied;
bool dirty;

how and when these variables are used is commented upon in the relevant code.

  * This needn't actually be part of a checkpoint, but it's a convenient
- * location.
+ * location. is_shutdown is true in case of a shutdown checkpoint.

Relying on the first sentence, if we decide to not persist the
replication slot at the time of checkpoint, would that be OK? It
doesn't look like a convenience thing to me any more.

--
Best Wishes,
Ashutosh Bapat



Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Thu, Sep 7, 2023 at 3:38 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
>   * This needn't actually be part of a checkpoint, but it's a convenient
> - * location.
> + * location. is_shutdown is true in case of a shutdown checkpoint.
>
> Relying on the first sentence, if we decide to not persist the
> replication slot at the time of checkpoint, would that be OK? It
> doesn't look like a convenience thing to me any more.
>

Instead of removing that comment, how about something like this: "This
needn't actually be part of a checkpoint except for shutdown
checkpoint, but it's a convenient location."?

--
With Regards,
Amit Kapila.



Re: persist logical slots to disk during shutdown checkpoint

From
Ashutosh Bapat
Date:
On Thu, Sep 7, 2023 at 4:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Sep 7, 2023 at 3:38 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> >   * This needn't actually be part of a checkpoint, but it's a convenient
> > - * location.
> > + * location. is_shutdown is true in case of a shutdown checkpoint.
> >
> > Relying on the first sentence, if we decide to not persist the
> > replication slot at the time of checkpoint, would that be OK? It
> > doesn't look like a convenience thing to me any more.
> >
>
> Instead of removing that comment, how about something like this: "This
> needn't actually be part of a checkpoint except for shutdown
> checkpoint, but it's a convenient location."?
>

I find the wording a bit awkward. My version would be "Checkpoint is a
convenient location to persist all the slots. But in a shutdown
checkpoint, indicated by is_shutdown = true, we also update
confirmed_flush." But please feel free to choose whichever version you
are comfortable with.

--
Best Wishes,
Ashutosh Bapat



Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Thu, Sep 7, 2023 at 4:30 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Thu, Sep 7, 2023 at 4:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Sep 7, 2023 at 3:38 PM Ashutosh Bapat
> > <ashutosh.bapat.oss@gmail.com> wrote:
> > >
> > >   * This needn't actually be part of a checkpoint, but it's a convenient
> > > - * location.
> > > + * location. is_shutdown is true in case of a shutdown checkpoint.
> > >
> > > Relying on the first sentence, if we decide to not persist the
> > > replication slot at the time of checkpoint, would that be OK? It
> > > doesn't look like a convenience thing to me any more.
> > >
> >
> > Instead of removing that comment, how about something like this: "This
> > needn't actually be part of a checkpoint except for shutdown
> > checkpoint, but it's a convenient location."?
> >
>
> I find the wording a bit awkward. My version would be "Checkpoint is a
> convenient location to persist all the slots. But in a shutdown
> checkpoint, indicated by is_shutdown = true, we also update
> confirmed_flush." But please feel free to choose whichever version you
> are comfortable with.
>

I think saying we also update confirmed_flush appears unclear to me.
So, I tried another version by changing the entire comment to:
"Normally, we can flush dirty replication slots at regular intervals
by any background process like bgwriter but checkpoint is a convenient
location to persist. Additionally, in case of a shutdown checkpoint,
we also identify the slots for which confirmed_flush has been updated
since the last time it persisted and flush them."

--
With Regards,
Amit Kapila.



Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Thu, Sep 7, 2023 at 3:38 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Thu, Sep 7, 2023 at 1:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Sep 7, 2023 at 1:18 PM Michael Paquier <michael@paquier.xyz> wrote:
> > >
> > > On Thu, Sep 07, 2023 at 11:56:28AM +0530, Amit Kapila wrote:
> > > > Thanks, the patch looks good to me as well.
> > >
> > > +   /* This is used to track the last saved confirmed_flush LSN value */
> > > +   XLogRecPtr  last_saved_confirmed_flush;
> > >
> > > This does not feel sufficient, as the comment explaining what this
> > > variable does uses the same terms as the variable name (aka it is the
> > > last save of the confirmed_lsn).  Why it it here and why it is useful?
> > > In which context and/or code paths is it used?  Okay, there are some
> > > explanations when saving a slot, restoring a slot or when a checkpoint
> > > processes slots, but it seems important to me to document more things
> > > in ReplicationSlot where this is defined.
> > >
> >
> > Hmm, this is quite debatable as different people feel differently
> > about this. The patch author kept it where it is now but in one of my
> > revisions, I rewrote and added it in the ReplicationSlot. Then
> > Ashutosh argued that it is better to keep it near where we are saving
> > the slot (aka where the patch has) [1]. Anyway, as I also preferred
> > the core part of the theory about this variable to be in
> > ReplicationSlot, so I'll move it there before commit unless someone
> > argues against it or has any other comments.
>
> If we want it to be in ReplicationSlot, I suggest we just say, - saves
> last confirmed flush LSN to detect any divergence in the in-memory and
> on-disk confirmed flush LSN cheaply.
>

I have added something on these lines and also changed the other
comment pointed out by you. In the passing, I made minor cosmetic
changes as well.

--
With Regards,
Amit Kapila.

Attachment

Re: persist logical slots to disk during shutdown checkpoint

From
Michael Paquier
Date:
On Fri, Sep 08, 2023 at 09:04:43AM +0530, Amit Kapila wrote:
> I have added something on these lines and also changed the other
> comment pointed out by you. In the passing, I made minor cosmetic
> changes as well.

+ * We can flush dirty replication slots at regular intervals by any
+ * background process like bgwriter but checkpoint is a convenient location.

I don't see a need to refer to the bgwriter here.  On the contrary, it
can be confusing as one could think that this flush happens in the
bgwriter, but that's not the case currently as only the checkpointer
does that.

+        * We won't ensure that the slot is persisted after the

s/persisted/flushed/?  Or just refer to the "slot's data being
flushed", or refer to "the slot's data is made durable" instead?  The
use of "persist" here is confusing, because a slot's persistence
refers to it as being a *candidate* for flush (compared to an
ephemeral slot), and it does not refer to the *fact* of flushing its
data to make sure that it survives a crash.  In the context of this
patch, the LSN value tracked in the slot's in-memory data refers to
the last point where the slot's data has been flushed.

+    /*
+     * This is used to track the last persisted confirmed_flush LSN value to
+     * detect any divergence in the in-memory and on-disk values for the same.
+     */

"This value tracks is the last LSN where this slot's data has been
flushed to disk.  This is used during a checkpoint shutdown to decide
if a logical slot's data should be forcibly flushed or not."

Hmm.  WAL senders are shut down *after* the checkpointer and *after*
the shutdown checkpoint is finished (see PM_SHUTDOWN and
PM_SHUTDOWN_2) because we want the WAL senders to acknowledge the
checkpoint record before shutting down the primary.  In order to limit
the number of records to work on after a restart, what this patch is
proposing is an improvement.  Perhaps it would be better to document
that we don't care about the potential concurrent activity of logical
WAL senders in this case and that the LSN we are saving at is a best
effort, meaning that last_saved_confirmed_flush is just here to reduce
the damage on a follow-up restart?  The comment added in
CheckPointReplicationSlots() goes in this direction, but perhaps this
potential concurrent activity should be mentioned?
--
Michael

Attachment

Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Fri, Sep 8, 2023 at 10:08 AM Michael Paquier <michael@paquier.xyz> wrote:
>
>
> +    /*
> +     * This is used to track the last persisted confirmed_flush LSN value to
> +     * detect any divergence in the in-memory and on-disk values for the same.
> +     */
>
> "This value tracks is the last LSN where this slot's data has been
> flushed to disk.
>

This makes the comment vague as this sounds like we are saving a slot
corresponding to some LSN which is not the case. If you prefer this
tone then we can instead say: "This value tracks the last
confirmed_flush LSN flushed which is used during a checkpoint shutdown
to decide if a logical slot's data should be forcibly flushed or not."

>
> This is used during a checkpoint shutdown to decide
> if a logical slot's data should be forcibly flushed or not."
>
> Hmm.  WAL senders are shut down *after* the checkpointer and *after*
> the shutdown checkpoint is finished (see PM_SHUTDOWN and
> PM_SHUTDOWN_2) because we want the WAL senders to acknowledge the
> checkpoint record before shutting down the primary.
>

As per my understanding, this is not true for logical walsenders. As
per code, while HandleCheckpointerInterrupts(), we call ShutdownXLOG()
which sends a signal to walsender to stop and waits for it to stop.
And only after that, did it write a shutdown checkpoint WAL record.
After getting the InitStopping signal, walsender sets got_STOPPING
flag. Then *logical* walsender ensures that it sends all the pending
WAL and exits. What you have quoted is probably true for physical
walsenders.

>
>  In order to limit
> the number of records to work on after a restart, what this patch is
> proposing is an improvement.  Perhaps it would be better to document
> that we don't care about the potential concurrent activity of logical
> WAL senders in this case and that the LSN we are saving at is a best
> effort, meaning that last_saved_confirmed_flush is just here to reduce
> the damage on a follow-up restart?
>

Unless I am wrong, there shouldn't be any concurrent activity for
logical walsenders. IIRC, it is a mandatory requirement for logical
walsenders to stop before shutdown checkpointer to avoid panic error.
We do handle logical walsnders differently because they can generate
WAL during decoding.

>
>  The comment added in
> CheckPointReplicationSlots() goes in this direction, but perhaps this
> potential concurrent activity should be mentioned?
>

Sure, we can change it if required.

--
With Regards,
Amit Kapila.



Re: persist logical slots to disk during shutdown checkpoint

From
Michael Paquier
Date:
On Fri, Sep 08, 2023 at 11:50:37AM +0530, Amit Kapila wrote:
> On Fri, Sep 8, 2023 at 10:08 AM Michael Paquier <michael@paquier.xyz> wrote:
>>
>>
>> +    /*
>> +     * This is used to track the last persisted confirmed_flush LSN value to
>> +     * detect any divergence in the in-memory and on-disk values for the same.
>> +     */
>>
>> "This value tracks is the last LSN where this slot's data has been
>> flushed to disk.
>>
>
> This makes the comment vague as this sounds like we are saving a slot
> corresponding to some LSN which is not the case. If you prefer this
> tone then we can instead say: "This value tracks the last
> confirmed_flush LSN flushed which is used during a checkpoint shutdown
> to decide if a logical slot's data should be forcibly flushed or not."

Okay, that looks like an improvement over the term "persisted".

>> This is used during a checkpoint shutdown to decide
>> if a logical slot's data should be forcibly flushed or not."
>>
>> Hmm.  WAL senders are shut down *after* the checkpointer and *after*
>> the shutdown checkpoint is finished (see PM_SHUTDOWN and
>> PM_SHUTDOWN_2) because we want the WAL senders to acknowledge the
>> checkpoint record before shutting down the primary.
>>
>
> As per my understanding, this is not true for logical walsenders. As
> per code, while HandleCheckpointerInterrupts(), we call ShutdownXLOG()
> which sends a signal to walsender to stop and waits for it to stop.
> And only after that, did it write a shutdown checkpoint WAL record.
> After getting the InitStopping signal, walsender sets got_STOPPING
> flag. Then *logical* walsender ensures that it sends all the pending
> WAL and exits. What you have quoted is probably true for physical
> walsenders.

Hm, reminding me about this area..  This roots down to the handling of
WalSndCaughtUp in the send_data callback for logical or physical.
This is switched to true for logical WAL senders much earlier than
physical WAL senders, aka before the shutdown checkpoint begins in the
latter.  What was itching me a bit is that the postmaster logic could
be made more solid.  Logical and physical WAL senders are both marked
as BACKEND_TYPE_WALSND, but we don't actually check that the WAL
senders remaining at the end of PM_SHUTDOWN_2 are *not* connected to a
database.  This would require a new BACKEND_TYPE_* perhaps, or perhaps
we're fine with the current state because we'll catch up problems in
the checkpointer if any WAL is generated while the shutdown checkpoint
is running anyway.  Just something I got in mind, unrelated to this
patch.

>>  In order to limit
>> the number of records to work on after a restart, what this patch is
>> proposing is an improvement.  Perhaps it would be better to document
>> that we don't care about the potential concurrent activity of logical
>> WAL senders in this case and that the LSN we are saving at is a best
>> effort, meaning that last_saved_confirmed_flush is just here to reduce
>> the damage on a follow-up restart?
>
> Unless I am wrong, there shouldn't be any concurrent activity for
> logical walsenders. IIRC, it is a mandatory requirement for logical
> walsenders to stop before shutdown checkpointer to avoid panic error.
> We do handle logical walsnders differently because they can generate
> WAL during decoding.

Yeah.  See above.

>>  The comment added in
>> CheckPointReplicationSlots() goes in this direction, but perhaps this
>> potential concurrent activity should be mentioned?
>
> Sure, we can change it if required.

+ * We can flush dirty replication slots at regular intervals by any
+ * background process like bgwriter but checkpoint is a convenient location.

I still don't quite see a need to mention the bgwriter at all here..
That's just unrelated.

The comment block in CheckPointReplicationSlots() from v10 uses
"persist", but you mean "flush", I guess..
--
Michael

Attachment

Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Mon, Sep 11, 2023 at 12:08 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Sep 08, 2023 at 11:50:37AM +0530, Amit Kapila wrote:
> > On Fri, Sep 8, 2023 at 10:08 AM Michael Paquier <michael@paquier.xyz> wrote:
> >>
> >>
> >> +    /*
> >> +     * This is used to track the last persisted confirmed_flush LSN value to
> >> +     * detect any divergence in the in-memory and on-disk values for the same.
> >> +     */
> >>
> >> "This value tracks is the last LSN where this slot's data has been
> >> flushed to disk.
> >>
> >
> > This makes the comment vague as this sounds like we are saving a slot
> > corresponding to some LSN which is not the case. If you prefer this
> > tone then we can instead say: "This value tracks the last
> > confirmed_flush LSN flushed which is used during a checkpoint shutdown
> > to decide if a logical slot's data should be forcibly flushed or not."
>
> Okay, that looks like an improvement over the term "persisted".
>

Changed accordingly.

> >> This is used during a checkpoint shutdown to decide
> >> if a logical slot's data should be forcibly flushed or not."
> >>
> >> Hmm.  WAL senders are shut down *after* the checkpointer and *after*
> >> the shutdown checkpoint is finished (see PM_SHUTDOWN and
> >> PM_SHUTDOWN_2) because we want the WAL senders to acknowledge the
> >> checkpoint record before shutting down the primary.
> >>
> >
> > As per my understanding, this is not true for logical walsenders. As
> > per code, while HandleCheckpointerInterrupts(), we call ShutdownXLOG()
> > which sends a signal to walsender to stop and waits for it to stop.
> > And only after that, did it write a shutdown checkpoint WAL record.
> > After getting the InitStopping signal, walsender sets got_STOPPING
> > flag. Then *logical* walsender ensures that it sends all the pending
> > WAL and exits. What you have quoted is probably true for physical
> > walsenders.
>
> Hm, reminding me about this area..  This roots down to the handling of
> WalSndCaughtUp in the send_data callback for logical or physical.
> This is switched to true for logical WAL senders much earlier than
> physical WAL senders, aka before the shutdown checkpoint begins in the
> latter.  What was itching me a bit is that the postmaster logic could
> be made more solid.  Logical and physical WAL senders are both marked
> as BACKEND_TYPE_WALSND, but we don't actually check that the WAL
> senders remaining at the end of PM_SHUTDOWN_2 are *not* connected to a
> database.  This would require a new BACKEND_TYPE_* perhaps, or perhaps
> we're fine with the current state because we'll catch up problems in
> the checkpointer if any WAL is generated while the shutdown checkpoint
> is running anyway.  Just something I got in mind, unrelated to this
> patch.
>

I don't know if the difference is worth inventing a new BACKEND_TYPE_*
but if you think so then we can probably discuss this in a new thread.
I think we may want to improve some comments as a separate patch to
make this evident.

>
> + * We can flush dirty replication slots at regular intervals by any
> + * background process like bgwriter but checkpoint is a convenient location.
>
> I still don't quite see a need to mention the bgwriter at all here..
> That's just unrelated.
>

I don't disagree with it, so changed it in the attached patch.

> The comment block in CheckPointReplicationSlots() from v10 uses
> "persist", but you mean "flush", I guess..
>

This point is not very clear to me. Can you please quote the exact
comment if you think something needs to be changed?

--
With Regards,
Amit Kapila.

Attachment

Re: persist logical slots to disk during shutdown checkpoint

From
Michael Paquier
Date:
On Mon, Sep 11, 2023 at 02:49:49PM +0530, Amit Kapila wrote:
> I don't know if the difference is worth inventing a new BACKEND_TYPE_*
> but if you think so then we can probably discuss this in a new thread.
> I think we may want to improve some comments as a separate patch to
> make this evident.

The comments in postmaster.c could be improved, at least.  There is no
need to discuss that here.

> This point is not very clear to me. Can you please quote the exact
> comment if you think something needs to be changed?

Hmm.  Don't think that's it yet..

Please see the v11 attached, that rewords all the places of the patch
that need clarifications IMO.  I've found that the comment additions
in CheckPointReplicationSlots() to be overcomplicated:
- The key point to force a flush of a slot if its confirmed_lsn has
moved ahead of the last LSN where it was saved is to make the follow
up restart more responsive.
- Not sure that there is any point to mention the other code paths in
the tree where ReplicationSlotSave() can be called, and a slot can be
saved in other processes than just WAL senders (like slot
manipulations in normal backends, for one).  This was the last
sentence in v10.
- Persist is incorrect in this context in the tests, slot.c and
slot.h, as it should refer to the slot's data being flushed, saved or
just "made durable" because this is what the new last saved LSN is
here for.  Persistence is a slot property, and does not refer to the
fact of flushing the data IMO.

+           if (s->data.invalidated == RS_INVAL_NONE &&
+               s->data.confirmed_flush != s->last_saved_confirmed_flush)

Actually this is incorrect, no?  Shouldn't we make sure that the
confirmed_flush is strictly higher than the last saved LSN?
--
Michael

Attachment

Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Tue, Sep 12, 2023 at 10:55 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Sep 11, 2023 at 02:49:49PM +0530, Amit Kapila wrote:
>
> Please see the v11 attached, that rewords all the places of the patch
> that need clarifications IMO.  I've found that the comment additions
> in CheckPointReplicationSlots() to be overcomplicated:
> - The key point to force a flush of a slot if its confirmed_lsn has
> moved ahead of the last LSN where it was saved is to make the follow
> up restart more responsive.
>

I don't think it will become more responsive in any way, not sure what
made you think like that. The key idea is that after restart we want
to ensure that all the WAL data up to the shutdown checkpoint record
is sent to downstream. As mentioned in the commit message, this will
help in ensuring that upgrades don't miss any data and then there is
another small advantage as mentioned in the commit message.

> - Not sure that there is any point to mention the other code paths in
> the tree where ReplicationSlotSave() can be called, and a slot can be
> saved in other processes than just WAL senders (like slot
> manipulations in normal backends, for one).  This was the last
> sentence in v10.
>

The point was the earlier sentence is no longer true and keeping it as
it is could be wrong or at least misleading. For example, earlier it
is okay to say, "This needn't actually be part of a checkpoint, ..."
but now that is no longer true as we want to invoke this at the time
of shutdown checkpoint for correctness. If we want to be precise, we
can say, "It is convenient to flush dirty replication slots at the
time of checkpoint. Additionally, .."

>
> +           if (s->data.invalidated == RS_INVAL_NONE &&
> +               s->data.confirmed_flush != s->last_saved_confirmed_flush)
>
> Actually this is incorrect, no?  Shouldn't we make sure that the
> confirmed_flush is strictly higher than the last saved LSN?
>

I can't see why it is incorrect. Do you see how (in what scenario) it
could go wrong? As per my understanding, confirmed_flush LSN will
always be greater than equal to last_saved_confirmed_flush but we
don't want to ensure that point here because we just want if the
latest value is not the same then we should mark the slot dirty and
flush it as that will be location we have ensured to update before
walsender shutdown. I think it is better to add an assert if you are
worried about any such case and we had thought of adding it as well
but then didn't do it because we don't have matching asserts to ensure
that we never assign prior LSN value to consfirmed_flush LSN.

+ /*
+ * LSN used to track the last confirmed_flush LSN where the slot's data
+ * has been flushed to disk.
+ */
+ XLogRecPtr last_saved_confirmed_flush;

I don't want to argue on such a point because it is a little bit of a
matter of personal choice but I find this comment unclear. It seems to
read that confirmed_flush LSN is some LSN position which is where we
flushed the slot's data and that is not true. I found the last comment
in the patch sent by me: "This value tracks the last confirmed_flush
LSN flushed which is used during a shutdown checkpoint to decide if
logical's slot data should be forcibly flushed or not." which I feel
we agreed upon is better.

--
With Regards,
Amit Kapila.



Re: persist logical slots to disk during shutdown checkpoint

From
Michael Paquier
Date:
On Tue, Sep 12, 2023 at 03:15:44PM +0530, Amit Kapila wrote:
> I don't think it will become more responsive in any way, not sure what
> made you think like that. The key idea is that after restart we want
> to ensure that all the WAL data up to the shutdown checkpoint record
> is sent to downstream. As mentioned in the commit message, this will
> help in ensuring that upgrades don't miss any data and then there is
> another small advantage as mentioned in the commit message.

Good thing I did not use the term "responsive" in the previous patch I
posted.  My apologies if you found that confusing.  Let's say, "to
prevent unnecessary retreat", then ;)

>> - Not sure that there is any point to mention the other code paths in
>> the tree where ReplicationSlotSave() can be called, and a slot can be
>> saved in other processes than just WAL senders (like slot
>> manipulations in normal backends, for one).  This was the last
>> sentence in v10.
>
> The point was the earlier sentence is no longer true and keeping it as
> it is could be wrong or at least misleading. For example, earlier it
> is okay to say, "This needn't actually be part of a checkpoint, ..."
> but now that is no longer true as we want to invoke this at the time
> of shutdown checkpoint for correctness. If we want to be precise, we

How so?  This is just a reference about the fact that using a
checkpoint path for this stuff is useful.  A shutdown checkpoint is
still a checkpoint, done by the checkpointer.  The background writer
is not concerned by that.

> can say, "It is convenient to flush dirty replication slots at the
> time of checkpoint. Additionally, .."

Okay by mr to reword the top comment of CheckPointReplicationSlots()
to use these terms, if you feel strongly about it.

>> +           if (s->data.invalidated == RS_INVAL_NONE &&
>> +               s->data.confirmed_flush != s->last_saved_confirmed_flush)
>>
>> Actually this is incorrect, no?  Shouldn't we make sure that the
>> confirmed_flush is strictly higher than the last saved LSN?
>
> I can't see why it is incorrect. Do you see how (in what scenario) it
> could go wrong? As per my understanding, confirmed_flush LSN will
> always be greater than equal to last_saved_confirmed_flush but we
> don't want to ensure that point here because we just want if the
> latest value is not the same then we should mark the slot dirty and
> flush it as that will be location we have ensured to update before
> walsender shutdown. I think it is better to add an assert if you are
> worried about any such case and we had thought of adding it as well
> but then didn't do it because we don't have matching asserts to ensure
> that we never assign prior LSN value to consfirmed_flush LSN.

Because that's just safer in the long run, and I don't see why we
cannot just do that?  Imagine, for instance, that a bug doing an
incorrect manipulation of a logical slot's data does an incorrect
computation of this field, and that we finish with in-memory data
that's older than what was previously saved.  The code may cause a
flush at an incorrect, past, position.  That's just an assumption from
my side, of course.

> I don't want to argue on such a point because it is a little bit of a
> matter of personal choice but I find this comment unclear. It seems to
> read that confirmed_flush LSN is some LSN position which is where we
> flushed the slot's data and that is not true. I found the last comment
> in the patch sent by me: "This value tracks the last confirmed_flush
> LSN flushed which is used during a shutdown checkpoint to decide if
> logical's slot data should be forcibly flushed or not." which I feel
> we agreed upon is better.

Okay, fine by me here.
--
Michael

Attachment

Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Wed, Sep 13, 2023 at 10:57 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Sep 12, 2023 at 03:15:44PM +0530, Amit Kapila wrote:
> >>
> >> - Not sure that there is any point to mention the other code paths in
> >> the tree where ReplicationSlotSave() can be called, and a slot can be
> >> saved in other processes than just WAL senders (like slot
> >> manipulations in normal backends, for one).  This was the last
> >> sentence in v10.
> >
> > The point was the earlier sentence is no longer true and keeping it as
> > it is could be wrong or at least misleading. For example, earlier it
> > is okay to say, "This needn't actually be part of a checkpoint, ..."
> > but now that is no longer true as we want to invoke this at the time
> > of shutdown checkpoint for correctness. If we want to be precise, we
>
> How so?
>

Consider if we move this call to bgwriter (aka flushing slots is no
longer part of a checkpoint), Would that be okay? Previously, I think
it was okay but not now. I see an argument to keep that as it is as
well because we have already mentioned the special shutdown checkpoint
case. By the way, I have changed this because Ashutosh felt it is no
longer correct to keep the first sentence as it is. See his email[1]
(Relying on the first sentence, ...). It happened previously as well
that different reviewers working in this area have different views on
this sort of thing. I am trying my best to address the review
comments, especially from experienced hackers but personally, I feel
this is a minor nitpick and isn't worth too much argument, either way,
should be okay.

>
> >> +           if (s->data.invalidated == RS_INVAL_NONE &&
> >> +               s->data.confirmed_flush != s->last_saved_confirmed_flush)
> >>
> >> Actually this is incorrect, no?  Shouldn't we make sure that the
> >> confirmed_flush is strictly higher than the last saved LSN?
> >
> > I can't see why it is incorrect. Do you see how (in what scenario) it
> > could go wrong? As per my understanding, confirmed_flush LSN will
> > always be greater than equal to last_saved_confirmed_flush but we
> > don't want to ensure that point here because we just want if the
> > latest value is not the same then we should mark the slot dirty and
> > flush it as that will be location we have ensured to update before
> > walsender shutdown. I think it is better to add an assert if you are
> > worried about any such case and we had thought of adding it as well
> > but then didn't do it because we don't have matching asserts to ensure
> > that we never assign prior LSN value to consfirmed_flush LSN.
>
> Because that's just safer in the long run, and I don't see why we
> cannot just do that?  Imagine, for instance, that a bug doing an
> incorrect manipulation of a logical slot's data does an incorrect
> computation of this field, and that we finish with in-memory data
> that's older than what was previously saved.  The code may cause a
> flush at an incorrect, past, position.  That's just an assumption from
> my side, of course.
>

If you are worried about such bugs, it would be better to have an
Assert as suggested previously rather than greater than check because
we will at least catch such bugs otherwise it can go unnoticed or in
the worst case will lead to unknown consequences. I am saying this
because if there are such bugs (or got introduced later) then the slot
can be flushed with a prior confirmed_flush location even from other
code paths. Just for reference, we don't have any check ensuring that
confirmed_flush LSN can move backward in function
LogicalConfirmReceivedLocation(), see and also another place where we
update it:
else
{
SpinLockAcquire(&MyReplicationSlot->mutex);
MyReplicationSlot->data.confirmed_flush = lsn;
SpinLockRelease(&MyReplicationSlot->mutex);
}

As other places don't have an assert, I didn't add one here but we can
add one here.

> > I don't want to argue on such a point because it is a little bit of a
> > matter of personal choice but I find this comment unclear. It seems to
> > read that confirmed_flush LSN is some LSN position which is where we
> > flushed the slot's data and that is not true. I found the last comment
> > in the patch sent by me: "This value tracks the last confirmed_flush
> > LSN flushed which is used during a shutdown checkpoint to decide if
> > logical's slot data should be forcibly flushed or not." which I feel
> > we agreed upon is better.
>
> Okay, fine by me here.
>

Thanks, will change once we agree on the remaining points.

--
With Regards,
Amit Kapila.



Re: persist logical slots to disk during shutdown checkpoint

From
Michael Paquier
Date:
On Wed, Sep 13, 2023 at 12:07:12PM +0530, Amit Kapila wrote:
> Consider if we move this call to bgwriter (aka flushing slots is no
> longer part of a checkpoint), Would that be okay? Previously, I think
> it was okay but not now. I see an argument to keep that as it is as
> well because we have already mentioned the special shutdown checkpoint
> case. By the way, I have changed this because Ashutosh felt it is no
> longer correct to keep the first sentence as it is. See his email[1]
> (Relying on the first sentence, ...).

Hmmm..  Okay..

> As other places don't have an assert, I didn't add one here but we can
> add one here.

I'd be OK with an assertion here at the end, though I'd still choose a
stricter run-time check if I were to apply that myself.
--
Michael

Attachment

Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Wed, Sep 13, 2023 at 12:45 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Sep 13, 2023 at 12:07:12PM +0530, Amit Kapila wrote:
> > Consider if we move this call to bgwriter (aka flushing slots is no
> > longer part of a checkpoint), Would that be okay? Previously, I think
> > it was okay but not now. I see an argument to keep that as it is as
> > well because we have already mentioned the special shutdown checkpoint
> > case. By the way, I have changed this because Ashutosh felt it is no
> > longer correct to keep the first sentence as it is. See his email[1]
> > (Relying on the first sentence, ...).
>
> Hmmm..  Okay..
>

The patch is updated as per recent discussion.

--
With Regards,
Amit Kapila.

Attachment

Re: persist logical slots to disk during shutdown checkpoint

From
Michael Paquier
Date:
On Wed, Sep 13, 2023 at 04:20:37PM +0530, Amit Kapila wrote:
> The patch is updated as per recent discussion.

WFM.  Thanks for the updated version.
--
Michael

Attachment

Re: persist logical slots to disk during shutdown checkpoint

From
Amit Kapila
Date:
On Thu, Sep 14, 2023 at 7:20 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Sep 13, 2023 at 04:20:37PM +0530, Amit Kapila wrote:
> > The patch is updated as per recent discussion.
>
> WFM.  Thanks for the updated version.
>

Pushed.

--
With Regards,
Amit Kapila.



Re: persist logical slots to disk during shutdown checkpoint

From
Ashutosh Bapat
Date:
On Thu, Sep 14, 2023 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Sep 14, 2023 at 7:20 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Wed, Sep 13, 2023 at 04:20:37PM +0530, Amit Kapila wrote:
> > > The patch is updated as per recent discussion.
> >
> > WFM.  Thanks for the updated version.
> >
>
> Pushed.

Commitfest entry "https://commitfest.postgresql.org/44/4536/ is in
"Ready for committer" state. Is there something remaining here? We
should probably set it as "committed".

--
Best Wishes,
Ashutosh Bapat



Re: persist logical slots to disk during shutdown checkpoint

From
Michael Paquier
Date:
On Wed, Sep 20, 2023 at 04:48:00PM +0530, Ashutosh Bapat wrote:
> Commitfest entry "https://commitfest.postgresql.org/44/4536/ is in
> "Ready for committer" state. Is there something remaining here? We
> should probably set it as "committed".

Thanks, I've switched that to "Committed".
--
Michael

Attachment