Thread: persist logical slots to disk during shutdown checkpoint
It's entirely possible for a logical slot to have a confirmed_flush LSN higher than the last value saved on disk while not being marked as dirty. It's currently not a problem to lose that value during a clean shutdown / restart cycle but to support the upgrade of logical slots [1] (see latest patch at [2]), we seem to rely on that value being properly persisted to disk. During the upgrade, we need to verify that all the data prior to shudown_checkpoint for the logical slots has been consumed, otherwise, the downstream may miss some data. Now, to ensure the same, we are planning to compare the confirm_flush LSN location with the latest shudown_checkpoint location which means that the confirm_flush LSN should be updated after restart. I think this is inefficient even without an upgrade because, after the restart, this may lead to decoding some data again. Say, we process some transactions for which we didn't send anything downstream (the changes got filtered) but the confirm_flush LSN is updated due to keepalives. As we don't flush the latest value of confirm_flush LSN, it may lead to processing the same changes again. The idea discussed in the thread [1] is to always persist logical slots to disk during the shutdown checkpoint. I have extracted the patch to achieve the same from that thread and attached it here. This could lead to some overhead during shutdown (checkpoint) if there are many slots but it is probably a one-time work. I couldn't think of better ideas but another possibility is to mark the slot as dirty when we update the confirm_flush LSN (see LogicalConfirmReceivedLocation()). However, that would be a bigger overhead in the running server as it could be a frequent operation and could lead to more writes. Thoughts? [1] - https://www.postgresql.org/message-id/TYAPR01MB58664C81887B3AF2EB6B16E3F5939%40TYAPR01MB5866.jpnprd01.prod.outlook.com [2] - https://www.postgresql.org/message-id/TYAPR01MB5866562EF047F2C9DDD1F9DEF51BA%40TYAPR01MB5866.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
Attachment
On Sat, 19 Aug 2023, 14:16 Amit Kapila, <amit.kapila16@gmail.com> wrote:
It's entirely possible for a logical slot to have a confirmed_flush
LSN higher than the last value saved on disk while not being marked as
dirty. It's currently not a problem to lose that value during a clean
shutdown / restart cycle but to support the upgrade of logical slots
[1] (see latest patch at [2]), we seem to rely on that value being
properly persisted to disk. During the upgrade, we need to verify that
all the data prior to shudown_checkpoint for the logical slots has
been consumed, otherwise, the downstream may miss some data. Now, to
ensure the same, we are planning to compare the confirm_flush LSN
location with the latest shudown_checkpoint location which means that
the confirm_flush LSN should be updated after restart.
I think this is inefficient even without an upgrade because, after the
restart, this may lead to decoding some data again. Say, we process
some transactions for which we didn't send anything downstream (the
changes got filtered) but the confirm_flush LSN is updated due to
keepalives. As we don't flush the latest value of confirm_flush LSN,
it may lead to processing the same changes again.
In most cases there shouldn't be a lot of records to decode after restart, but I agree it's better to avoid decoding those again.
The idea discussed in the thread [1] is to always persist logical
slots to disk during the shutdown checkpoint. I have extracted the
patch to achieve the same from that thread and attached it here. This
could lead to some overhead during shutdown (checkpoint) if there are
many slots but it is probably a one-time work.
I couldn't think of better ideas but another possibility is to mark
the slot as dirty when we update the confirm_flush LSN (see
LogicalConfirmReceivedLocation()). However, that would be a bigger
overhead in the running server as it could be a frequent operation and
could lead to more writes.
Yeah I didn't find any better option either at that time. I still think that forcing persistence on shutdown is the best compromise. If we tried to always mark the slot as dirty, we would be sure to add regular overhead but we would probably end up persisting the slot on disk on shutdown anyway most of the time, so I don't think it would be a good compromise.
My biggest concern was that some switchover scenario might be a bit slower in some cases, but if that really is a problem it's hard to imagine what workload would be possible without having to persist them anyway due to continuous activity needing to be sent just before the shutdown.
On Sat, Aug 19, 2023 at 12:46 PM Julien Rouhaud <rjuju123@gmail.com> wrote: > > On Sat, 19 Aug 2023, 14:16 Amit Kapila, <amit.kapila16@gmail.com> wrote: >> > >> The idea discussed in the thread [1] is to always persist logical >> slots to disk during the shutdown checkpoint. I have extracted the >> patch to achieve the same from that thread and attached it here. This >> could lead to some overhead during shutdown (checkpoint) if there are >> many slots but it is probably a one-time work. >> >> I couldn't think of better ideas but another possibility is to mark >> the slot as dirty when we update the confirm_flush LSN (see >> LogicalConfirmReceivedLocation()). However, that would be a bigger >> overhead in the running server as it could be a frequent operation and >> could lead to more writes. > > > Yeah I didn't find any better option either at that time. I still think that forcing persistence on shutdown is the bestcompromise. If we tried to always mark the slot as dirty, we would be sure to add regular overhead but we would probablyend up persisting the slot on disk on shutdown anyway most of the time, so I don't think it would be a good compromise. > The other possibility is that we introduce yet another dirty flag for slots, say dirty_for_shutdown_checkpoint which will be set when we update confirmed_flush LSN. The flag will be cleared each time we persist the slot but we won't persist if only this flag is set. We can then use it during the shutdown checkpoint to decide whether to persist the slot. -- With Regards, Amit Kapila.
On Sun, Aug 20, 2023 at 8:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sat, Aug 19, 2023 at 12:46 PM Julien Rouhaud <rjuju123@gmail.com> wrote: > > > > On Sat, 19 Aug 2023, 14:16 Amit Kapila, <amit.kapila16@gmail.com> wrote: > >> > > > >> The idea discussed in the thread [1] is to always persist logical > >> slots to disk during the shutdown checkpoint. I have extracted the > >> patch to achieve the same from that thread and attached it here. This > >> could lead to some overhead during shutdown (checkpoint) if there are > >> many slots but it is probably a one-time work. > >> > >> I couldn't think of better ideas but another possibility is to mark > >> the slot as dirty when we update the confirm_flush LSN (see > >> LogicalConfirmReceivedLocation()). However, that would be a bigger > >> overhead in the running server as it could be a frequent operation and > >> could lead to more writes. > > > > > > Yeah I didn't find any better option either at that time. I still think that forcing persistence on shutdown is the bestcompromise. If we tried to always mark the slot as dirty, we would be sure to add regular overhead but we would probablyend up persisting the slot on disk on shutdown anyway most of the time, so I don't think it would be a good compromise. > > > > The other possibility is that we introduce yet another dirty flag for > slots, say dirty_for_shutdown_checkpoint which will be set when we > update confirmed_flush LSN. The flag will be cleared each time we > persist the slot but we won't persist if only this flag is set. We can > then use it during the shutdown checkpoint to decide whether to > persist the slot. There are already two booleans controlling dirty-ness of slot, dirty and just_dirty. Adding third will created more confusion. Another idea is to record the confirm_flush_lsn at the time of persisting the slot. We can use it in two different ways 1. to mark a slot dirty and persist if the last confirm_flush_lsn when slot was persisted was too far from the current confirm_flush_lsn of the slot. 2. at shutdown checkpoint, persist all the slots which have these two confirm_flush_lsns different. -- Best Wishes, Ashutosh Bapat
On Mon, Aug 21, 2023 at 6:36 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Sun, Aug 20, 2023 at 8:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > The other possibility is that we introduce yet another dirty flag for > > slots, say dirty_for_shutdown_checkpoint which will be set when we > > update confirmed_flush LSN. The flag will be cleared each time we > > persist the slot but we won't persist if only this flag is set. We can > > then use it during the shutdown checkpoint to decide whether to > > persist the slot. > > There are already two booleans controlling dirty-ness of slot, dirty > and just_dirty. Adding third will created more confusion. > > Another idea is to record the confirm_flush_lsn at the time of > persisting the slot. We can use it in two different ways 1. to mark a > slot dirty and persist if the last confirm_flush_lsn when slot was > persisted was too far from the current confirm_flush_lsn of the slot. > 2. at shutdown checkpoint, persist all the slots which have these two > confirm_flush_lsns different. > I think using it in the second (2) way sounds advantageous as compared to storing another dirty flag because this requires us to update last_persisted_confirm_flush_lsn only while writing the slot info. OTOH, having a flag dirty_for_shutdown_checkpoint will require us to update it each time we update confirm_flush_lsn under spinlock at multiple places. But, I don't see the need of doing what you proposed in (1) as the use case for it is very minor, basically this may sometimes help us to avoid decoding after crash recovery. -- With Regards, Amit Kapila.
On Tue, Aug 22, 2023 at 9:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Another idea is to record the confirm_flush_lsn at the time of > > persisting the slot. We can use it in two different ways 1. to mark a > > slot dirty and persist if the last confirm_flush_lsn when slot was > > persisted was too far from the current confirm_flush_lsn of the slot. > > 2. at shutdown checkpoint, persist all the slots which have these two > > confirm_flush_lsns different. > > > > I think using it in the second (2) way sounds advantageous as compared > to storing another dirty flag because this requires us to update > last_persisted_confirm_flush_lsn only while writing the slot info. > OTOH, having a flag dirty_for_shutdown_checkpoint will require us to > update it each time we update confirm_flush_lsn under spinlock at > multiple places. But, I don't see the need of doing what you proposed > in (1) as the use case for it is very minor, basically this may > sometimes help us to avoid decoding after crash recovery. Once we have last_persisted_confirm_flush_lsn, (1) is just an optimization on top of that. With that we take the opportunity to persist confirmed_flush_lsn which is much farther than the current persisted value and thus improving chances of updating restart_lsn and catalog_xmin faster after a WAL sender restart. We need to keep that in mind when implementing (2). The problem is if we don't implement (1) right now, we might just forget to do that small incremental change in future. My preference is 1. Do both (1) and (2) together 2. Do (2) first and then (1) as a separate commit. 3. Just implement (2) if we don't have time at all for first two options. -- Best Wishes, Ashutosh Bapat
On Tue, Aug 22, 2023 at 2:56 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Tue, Aug 22, 2023 at 9:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Another idea is to record the confirm_flush_lsn at the time of > > > persisting the slot. We can use it in two different ways 1. to mark a > > > slot dirty and persist if the last confirm_flush_lsn when slot was > > > persisted was too far from the current confirm_flush_lsn of the slot. > > > 2. at shutdown checkpoint, persist all the slots which have these two > > > confirm_flush_lsns different. > > > > > > > I think using it in the second (2) way sounds advantageous as compared > > to storing another dirty flag because this requires us to update > > last_persisted_confirm_flush_lsn only while writing the slot info. > > OTOH, having a flag dirty_for_shutdown_checkpoint will require us to > > update it each time we update confirm_flush_lsn under spinlock at > > multiple places. But, I don't see the need of doing what you proposed > > in (1) as the use case for it is very minor, basically this may > > sometimes help us to avoid decoding after crash recovery. > > Once we have last_persisted_confirm_flush_lsn, (1) is just an > optimization on top of that. With that we take the opportunity to > persist confirmed_flush_lsn which is much farther than the current > persisted value and thus improving chances of updating restart_lsn and > catalog_xmin faster after a WAL sender restart. We need to keep that > in mind when implementing (2). The problem is if we don't implement > (1) right now, we might just forget to do that small incremental > change in future. My preference is 1. Do both (1) and (2) together 2. > Do (2) first and then (1) as a separate commit. 3. Just implement (2) > if we don't have time at all for first two options. > I prefer one of (2) or (3). Anyway, it is better to do that optimization (persist confirm_flush_lsn at a regular interval) as a separate patch as we need to test and prove its value separately. -- With Regards, Amit Kapila.
On Tue, Aug 22, 2023 at 3:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Once we have last_persisted_confirm_flush_lsn, (1) is just an > > optimization on top of that. With that we take the opportunity to > > persist confirmed_flush_lsn which is much farther than the current > > persisted value and thus improving chances of updating restart_lsn and > > catalog_xmin faster after a WAL sender restart. We need to keep that > > in mind when implementing (2). The problem is if we don't implement > > (1) right now, we might just forget to do that small incremental > > change in future. My preference is 1. Do both (1) and (2) together 2. > > Do (2) first and then (1) as a separate commit. 3. Just implement (2) > > if we don't have time at all for first two options. > > > > I prefer one of (2) or (3). Anyway, it is better to do that > optimization (persist confirm_flush_lsn at a regular interval) as a > separate patch as we need to test and prove its value separately. Fine with me. -- Best Wishes, Ashutosh Bapat
Dear hackers, Thanks for forking the thread! I think we would choose another design, but I wanted to post the updated version once with the current approach. All comments came from the parent thread [1]. > 1. GENERAL -- git apply > > The patch fails to apply cleanly. There are whitespace warnings. > > [postgres(at)CentOS7-x64 oss_postgres_misc]$ git apply > ../patches_misc/v23-0001-Always-persist-to-disk-logical-slots-during-a-sh.patch > ../patches_misc/v23-0001-Always-persist-to-disk-logical-slots-during-a-sh.patch:102: > trailing whitespace. > # SHUTDOWN_CHECKPOINT record. > warning: 1 line adds whitespace errors. There was an extra blank, removed. > 2. GENERAL -- which patch is the real one and which is the copy? > > IMO this patch has become muddled. > > Amit recently created a new thread [1] "persist logical slots to disk > during shutdown checkpoint", which I thought was dedicated to the > discussion/implementation of this 0001 patch. Therefore, I expected any > 0001 patch changes to would be made only in that new thread from now on, > (and maybe you would mirror them here in this thread). > > But now I see there are v23-0001 patch changes here again. So, now the same > patch is in 2 places and they are different. It is no longer clear to me > which 0001 ("Always persist...") patch is the definitive one, and which one > is the copy. Attached one in another thread is just copy to make cfbot happy, it could be ignored. > contrib/test_decoding/t/002_always_persist.pl > > 3. > + > +# Copyright (c) 2023, PostgreSQL Global Development Group > + > +# Test logical replication slots are always persist to disk during a > shutdown > +# checkpoint. > + > +use strict; > +use warnings; > + > +use PostgreSQL::Test::Cluster; > +use PostgreSQL::Test::Utils; > +use Test::More; > > /always persist/always persisted/ Fixed. > 4. > + > +# Test set-up > my $node = PostgreSQL::Test::Cluster->new('test'); > $node->init(allows_streaming => 'logical'); > $node->append_conf('postgresql.conf', q{ > autovacuum = off > checkpoint_timeout = 1h > }); > > $node->start; > > # Create table > $node->safe_psql('postgres', "CREATE TABLE test (id int)"); > > Maybe it is better to call the table something different instead of the > same name as the cluster. e.g. 'test_tbl' would be better. Changed to 'test_tbl'. > 5. > +# Shutdown the node once to do shutdown checkpoint > $node->stop(); > > SUGGESTION > # Stop the node to cause a shutdown checkpoint Fixed. > 6. > +# Fetch checkPoint from the control file itself > my ($stdout, $stderr) = run_command([ 'pg_controldata', $node->data_dir ]); > my @control_data = split("\n", $stdout); > my $latest_checkpoint = undef; > foreach (@control_data) > { > if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg) > { > $latest_checkpoint = $1; > last; > } > } > die "No checkPoint in control file found\n" > unless defined($latest_checkpoint); > > 6a. > /checkPoint/checkpoint/ (2x) > > 6b. > +die "No checkPoint in control file found\n" > > SUGGESTION > "No checkpoint found in control file\n" Hmm, these notations were followed the test recovery/t/016_min_consistency.pl, it uses the word "minRecoveryPoint". So I preferred current one. [1]: https://www.postgresql.org/message-id/CAHut%2BPtb%3DZYTM_awoLy3sJ5m9Oxe%3DJYn6Gve5rSW9cUdThpsVA%40mail.gmail.com Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On Tue, 22 Aug 2023 at 15:42, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Aug 22, 2023 at 2:56 PM Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: > > > > On Tue, Aug 22, 2023 at 9:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > Another idea is to record the confirm_flush_lsn at the time of > > > > persisting the slot. We can use it in two different ways 1. to mark a > > > > slot dirty and persist if the last confirm_flush_lsn when slot was > > > > persisted was too far from the current confirm_flush_lsn of the slot. > > > > 2. at shutdown checkpoint, persist all the slots which have these two > > > > confirm_flush_lsns different. > > > > > > > > > > I think using it in the second (2) way sounds advantageous as compared > > > to storing another dirty flag because this requires us to update > > > last_persisted_confirm_flush_lsn only while writing the slot info. > > > OTOH, having a flag dirty_for_shutdown_checkpoint will require us to > > > update it each time we update confirm_flush_lsn under spinlock at > > > multiple places. But, I don't see the need of doing what you proposed > > > in (1) as the use case for it is very minor, basically this may > > > sometimes help us to avoid decoding after crash recovery. > > > > Once we have last_persisted_confirm_flush_lsn, (1) is just an > > optimization on top of that. With that we take the opportunity to > > persist confirmed_flush_lsn which is much farther than the current > > persisted value and thus improving chances of updating restart_lsn and > > catalog_xmin faster after a WAL sender restart. We need to keep that > > in mind when implementing (2). The problem is if we don't implement > > (1) right now, we might just forget to do that small incremental > > change in future. My preference is 1. Do both (1) and (2) together 2. > > Do (2) first and then (1) as a separate commit. 3. Just implement (2) > > if we don't have time at all for first two options. > > > > I prefer one of (2) or (3). Anyway, it is better to do that > optimization (persist confirm_flush_lsn at a regular interval) as a > separate patch as we need to test and prove its value separately. Here is a patch to persist to disk logical slots during a shutdown checkpoint if the updated confirmed_flush_lsn has not yet been persisted. Regards, Vignesh
Attachment
Dear Vignesh, > Here is a patch to persist to disk logical slots during a shutdown > checkpoint if the updated confirmed_flush_lsn has not yet been > persisted. Thanks for making the patch with different approach! Here are comments. 01. RestoreSlotFromDisk ``` slot->candidate_xmin_lsn = InvalidXLogRecPtr; slot->candidate_restart_lsn = InvalidXLogRecPtr; slot->candidate_restart_valid = InvalidXLogRecPtr; + slot->last_persisted_confirmed_flush = InvalidXLogRecPtr; ``` last_persisted_confirmed_flush was set to InvalidXLogRecPtr, but isn't it better to use cp.slotdata. confirmed_flush? Assuming that the server is shut down immediately, your patch forces to save. 02. t/002_always_persist.pl The original author of the patch is me, but I found that the test could pass without your patch. This is because pg_logical_slot_get_changes()-> pg_logical_slot_get_changes_guts(confirm = true) always mark the slot as dirty. IIUC we must use the logical replication system to verify the persistence. Attached test can pass only when patch is applied. Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On Wed, 23 Aug 2023 at 14:21, Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Vignesh, > > > Here is a patch to persist to disk logical slots during a shutdown > > checkpoint if the updated confirmed_flush_lsn has not yet been > > persisted. > > Thanks for making the patch with different approach! Here are comments. > > 01. RestoreSlotFromDisk > > ``` > slot->candidate_xmin_lsn = InvalidXLogRecPtr; > slot->candidate_restart_lsn = InvalidXLogRecPtr; > slot->candidate_restart_valid = InvalidXLogRecPtr; > + slot->last_persisted_confirmed_flush = InvalidXLogRecPtr; > ``` > > last_persisted_confirmed_flush was set to InvalidXLogRecPtr, but isn't it better > to use cp.slotdata. confirmed_flush? Assuming that the server is shut down immediately, > your patch forces to save. > > 02. t/002_always_persist.pl > > The original author of the patch is me, but I found that the test could pass > without your patch. This is because pg_logical_slot_get_changes()-> > pg_logical_slot_get_changes_guts(confirm = true) always mark the slot as dirty. > IIUC we must use the logical replication system to verify the persistence. > Attached test can pass only when patch is applied. Here are few other comments that I noticed: 1) I too noticed that the test passes both with and without patch: diff --git a/contrib/test_decoding/meson.build b/contrib/test_decoding/meson.build index 7b05cc25a3..12afb9ea8c 100644 --- a/contrib/test_decoding/meson.build +++ b/contrib/test_decoding/meson.build @@ -72,6 +72,7 @@ tests += { 'tap': { 'tests': [ 't/001_repl_stats.pl', + 't/002_always_persist.pl', 2) change checkPoint to checkpoint: 2.a) checkPoint should be checkpoint to maintain consistency across the file: +# Shutdown the node once to do shutdown checkpoint +$node->stop(); + +# Fetch checkPoint from the control file itself +my ($stdout, $stderr) = run_command([ 'pg_controldata', $node->data_dir ]); +my @control_data = split("\n", $stdout); +my $latest_checkpoint = undef; 2.b) similarly here: +die "No checkPoint in control file found\n" + unless defined($latest_checkpoint); 2.c) similarly here too: +# Compare confirmed_flush_lsn and checkPoint +ok($confirmed_flush eq $latest_checkpoint, + "Check confirmed_flush is same as latest checkpoint location"); 3) change checkpoint to "Latest checkpoint location": 3.a) We should change "No checkPoint in control file found\n" to: "Latest checkpoint location not found in control file\n" as there are many checkpoint entries in control data +foreach (@control_data) +{ + if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg) + { + $latest_checkpoint = $1; + last; + } +} +die "No checkPoint in control file found\n" + unless defined($latest_checkpoint); 3.b) We should change "Fetch checkPoint from the control file itself" to: "Fetch Latest checkpoint location from the control file" +# Fetch checkPoint from the control file itself +my ($stdout, $stderr) = run_command([ 'pg_controldata', $node->data_dir ]); +my @control_data = split("\n", $stdout); +my $latest_checkpoint = undef; +foreach (@control_data) +{ Regards, Vignesh
On Wed, 23 Aug 2023 at 14:21, Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Vignesh, > > > Here is a patch to persist to disk logical slots during a shutdown > > checkpoint if the updated confirmed_flush_lsn has not yet been > > persisted. > > Thanks for making the patch with different approach! Here are comments. > > 01. RestoreSlotFromDisk > > ``` > slot->candidate_xmin_lsn = InvalidXLogRecPtr; > slot->candidate_restart_lsn = InvalidXLogRecPtr; > slot->candidate_restart_valid = InvalidXLogRecPtr; > + slot->last_persisted_confirmed_flush = InvalidXLogRecPtr; > ``` > > last_persisted_confirmed_flush was set to InvalidXLogRecPtr, but isn't it better > to use cp.slotdata. confirmed_flush? Assuming that the server is shut down immediately, > your patch forces to save. Modified > 02. t/002_always_persist.pl > > The original author of the patch is me, but I found that the test could pass > without your patch. This is because pg_logical_slot_get_changes()-> > pg_logical_slot_get_changes_guts(confirm = true) always mark the slot as dirty. > IIUC we must use the logical replication system to verify the persistence. > Attached test can pass only when patch is applied. Update the test based on your another_test with slight modifications. Attached v4 version patch has the changes for the same. Regards, Vignesh
Attachment
On Sat, 19 Aug 2023 at 11:53, Amit Kapila <amit.kapila16@gmail.com> wrote: > > It's entirely possible for a logical slot to have a confirmed_flush > LSN higher than the last value saved on disk while not being marked as > dirty. It's currently not a problem to lose that value during a clean > shutdown / restart cycle but to support the upgrade of logical slots > [1] (see latest patch at [2]), we seem to rely on that value being > properly persisted to disk. During the upgrade, we need to verify that > all the data prior to shudown_checkpoint for the logical slots has > been consumed, otherwise, the downstream may miss some data. Now, to > ensure the same, we are planning to compare the confirm_flush LSN > location with the latest shudown_checkpoint location which means that > the confirm_flush LSN should be updated after restart. > > I think this is inefficient even without an upgrade because, after the > restart, this may lead to decoding some data again. Say, we process > some transactions for which we didn't send anything downstream (the > changes got filtered) but the confirm_flush LSN is updated due to > keepalives. As we don't flush the latest value of confirm_flush LSN, > it may lead to processing the same changes again. I was able to test and verify that we were not processing the same changes again. Note: The 0001-Add-logs-to-skip-transaction-filter-insert-operation.patch has logs to print if a decode transaction is skipped and also a log to mention if any operation is filtered. The test.sh script has the steps for a) setting up logical replication for a table b) perform insert on table that need to be published (this will be replicated to the subscriber) c) perform insert on a table that will not be published (this insert will be filtered, it will not be replicated) d) sleep for 5 seconds e) stop the server f) start the server I used the following steps, do the following in HEAD: a) Apply 0001-Add-logs-to-skip-transaction-filter-insert-operation.patch patch in Head and build the binaries b) execute test.sh c) view N1.log file to see that the insert operations were filtered again by seeing the following logs: LOG: Filter insert for table tbl2 ... ===restart=== ... LOG: Skipping transaction 0/156AD10 as start decode at is greater 0/156AE40 ... LOG: Filter insert for table tbl2 We can see that the insert operations on tbl2 which was filtered before server was stopped is again filtered after restart too in HEAD. Lets see that the same changes were not processed again with patch: a) Apply v4-0001-Persist-to-disk-logical-slots-during-a-shutdown-c.patch from [1] also apply 0001-Add-logs-to-skip-transaction-filter-insert-operation.patch patch and build the binaries b) execute test.sh c) view N1.log file to see that the insert operations were skipped after restart of server by seeing the following logs: LOG: Filter insert for table tbl2 ... ===restart=== ... Skipping transaction 0/156AD10 as start decode at is greater 0/156AFB0 ... Skipping transaction 0/156AE80 as start decode at is greater 0/156AFB0 We can see that the insert operations on tbl2 are not processed again after restart with the patch. [1] - https://www.postgresql.org/message-id/CALDaNm0VrAt24e2FxbOX6eJQ-G_tZ0gVpsFBjzQM99NxG0hZfg%40mail.gmail.com Regards, Vignesh
Attachment
On Fri, 25 Aug 2023 at 17:40, vignesh C <vignesh21@gmail.com> wrote: > > On Sat, 19 Aug 2023 at 11:53, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > It's entirely possible for a logical slot to have a confirmed_flush > > LSN higher than the last value saved on disk while not being marked as > > dirty. It's currently not a problem to lose that value during a clean > > shutdown / restart cycle but to support the upgrade of logical slots > > [1] (see latest patch at [2]), we seem to rely on that value being > > properly persisted to disk. During the upgrade, we need to verify that > > all the data prior to shudown_checkpoint for the logical slots has > > been consumed, otherwise, the downstream may miss some data. Now, to > > ensure the same, we are planning to compare the confirm_flush LSN > > location with the latest shudown_checkpoint location which means that > > the confirm_flush LSN should be updated after restart. > > > > I think this is inefficient even without an upgrade because, after the > > restart, this may lead to decoding some data again. Say, we process > > some transactions for which we didn't send anything downstream (the > > changes got filtered) but the confirm_flush LSN is updated due to > > keepalives. As we don't flush the latest value of confirm_flush LSN, > > it may lead to processing the same changes again. > > I was able to test and verify that we were not processing the same > changes again. > Note: The 0001-Add-logs-to-skip-transaction-filter-insert-operation.patch > has logs to print if a decode transaction is skipped and also a log to > mention if any operation is filtered. > The test.sh script has the steps for a) setting up logical replication > for a table b) perform insert on table that need to be published (this > will be replicated to the subscriber) c) perform insert on a table > that will not be published (this insert will be filtered, it will not > be replicated) d) sleep for 5 seconds e) stop the server f) start the > server > I used the following steps, do the following in HEAD: > a) Apply 0001-Add-logs-to-skip-transaction-filter-insert-operation.patch > patch in Head and build the binaries b) execute test.sh c) view N1.log > file to see that the insert operations were filtered again by seeing > the following logs: > LOG: Filter insert for table tbl2 > ... > ===restart=== > ... > LOG: Skipping transaction 0/156AD10 as start decode at is greater 0/156AE40 > ... > LOG: Filter insert for table tbl2 > > We can see that the insert operations on tbl2 which was filtered > before server was stopped is again filtered after restart too in HEAD. > > Lets see that the same changes were not processed again with patch: > a) Apply v4-0001-Persist-to-disk-logical-slots-during-a-shutdown-c.patch > from [1] also apply > 0001-Add-logs-to-skip-transaction-filter-insert-operation.patch patch > and build the binaries b) execute test.sh c) view N1.log file to see > that the insert operations were skipped after restart of server by > seeing the following logs: > LOG: Filter insert for table tbl2 > ... > ===restart=== > ... > Skipping transaction 0/156AD10 as start decode at is greater 0/156AFB0 > ... > Skipping transaction 0/156AE80 as start decode at is greater 0/156AFB0 > > We can see that the insert operations on tbl2 are not processed again > after restart with the patch. Here is another way to test using pg_replslotdata approach that was proposed earlier at [1]. I have rebased this on top of HEAD and the v5 version for the same is attached. We can use the same test as test.sh shared at [2]. When executed with HEAD, it was noticed that confirmed_flush points to WAL location before both the transaction: slot_name slot_type datoid persistency xmin catalog_xmin restart_lsn confirmed_flush two_phase_at two_phase plugin --------- --------- ------ ---------- ---- ----------- ----------- --------------- ------------ --------- ------ sub logical 5 persistent 0 735 0/1531E28 0/1531E60 0/0 0 pgoutput WAL record information generated using pg_walinspect for various records at and after confirmed_flush WAL 0/1531E60: row_number | start_lsn | end_lsn | prev_lsn | xid | resource_manager | record_type | record_length | main_data_length | fpi_length | description | block_ref ------------+-----------+-----------+-----------+-----+------------------+---------------------+---------------+------------------+------------+------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------- 1 | 0/1531E60 | 0/1531EA0 | 0/1531E28 | 0 | Heap2 | PRUNE | 57 | 9 | 0 | snapshotConflictHorizon: 0, nredirected: 0, ndead: 1, nunused: 0, redirected: [], dead: [1], unused: [] | blkref #0: rel 1663/5/1255 fork main blk 58 2 | 0/1531EA0 | 0/1531EE0 | 0/1531E60 | 735 | Heap | INSERT+INIT | 59 | 3 | 0 | off: 1, flags: 0x08 | blkref #0: rel 1663/5/16384 fork main blk 0 3 | 0/1531EE0 | 0/1531F20 | 0/1531EA0 | 735 | Heap | INSERT | 59 | 3 | 0 | off: 2, flags: 0x08 | blkref #0: rel 1663/5/16384 fork main blk 0 4 | 0/1531F20 | 0/1531F60 | 0/1531EE0 | 735 | Heap | INSERT | 59 | 3 | 0 | off: 3, flags: 0x08 | blkref #0: rel 1663/5/16384 fork main blk 0 5 | 0/1531F60 | 0/1531FA0 | 0/1531F20 | 735 | Heap | INSERT | 59 | 3 | 0 | off: 4, flags: 0x08 | blkref #0: rel 1663/5/16384 fork main blk 0 6 | 0/1531FA0 | 0/1531FE0 | 0/1531F60 | 735 | Heap | INSERT | 59 | 3 | 0 | off: 5, flags: 0x08 | blkref #0: rel 1663/5/16384 fork main blk 0 7 | 0/1531FE0 | 0/1532028 | 0/1531FA0 | 735 | Transaction | COMMIT | 46 | 20 | 0 | 2023-08-27 23:22:17.161215+05:30 | 8 | 0/1532028 | 0/1532068 | 0/1531FE0 | 736 | Heap | INSERT+INIT | 59 | 3 | 0 | off: 1, flags: 0x08 | blkref #0: rel 1663/5/16387 fork main blk 0 9 | 0/1532068 | 0/15320A8 | 0/1532028 | 736 | Heap | INSERT | 59 | 3 | 0 | off: 2, flags: 0x08 | blkref #0: rel 1663/5/16387 fork main blk 0 10 | 0/15320A8 | 0/15320E8 | 0/1532068 | 736 | Heap | INSERT | 59 | 3 | 0 | off: 3, flags: 0x08 | blkref #0: rel 1663/5/16387 fork main blk 0 11 | 0/15320E8 | 0/1532128 | 0/15320A8 | 736 | Heap | INSERT | 59 | 3 | 0 | off: 4, flags: 0x08 | blkref #0: rel 1663/5/16387 fork main blk 0 12 | 0/1532128 | 0/1532168 | 0/15320E8 | 736 | Heap | INSERT | 59 | 3 | 0 | off: 5, flags: 0x08 | blkref #0: rel 1663/5/16387 fork main blk 0 13 | 0/1532168 | 0/1532198 | 0/1532128 | 736 | Transaction | COMMIT | 46 | 20 | 0 | 2023-08-27 23:22:17.174756+05:30 | 14 | 0/1532198 | 0/1532210 | 0/1532168 | 0 | XLOG | CHECKPOINT_SHUTDOWN | 114 | 88 | 0 | redo 0/1532198; tli 1; prev tli 1; fpw true; xid 0:737; oid 16399; multi 1; offset 0; oldest xid 723 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown | Whereas the same test executed with the patch applied shows that confirmed_flush points to CHECKPOINT_SHUTDOWN record: slot_name slot_type datoid persistency xmin catalog_xmin restart_lsn confirmed_flush two_phase_at two_phase plugin --------- --------- ------ ----------- --- ----------- ----------- --------------- ----------- --------- ------ sub logical 5 persistent 0 735 0/1531E28 0/1532198 0/0 0 pgoutput WAL record information generated using pg_walinspect for various records at and after confirmed_flush WAL 0/1532198: row_number | start_lsn | end_lsn | prev_lsn | xid | resource_manager | record_type | record_length | main_data_length | fpi_length | description | block_ref ------------+-----------+-----------+-----------+-----+------------------+---------------------+---------------+------------------+------------+------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------------------------------+----------- 1 | 0/1532198 | 0/1532210 | 0/1532168 | 0 | XLOG | CHECKPOINT_SHUTDOWN | 114 | 88 | 0 | redo 0/1532198; tli 1; prev tli 1; fpw true; xid 0:737; oid 16399; multi 1; offset 0; oldest xid 723 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown | (1 row) [1] - https://www.postgresql.org/message-id/flat/CALj2ACW0rV5gWK8A3m6_X62qH%2BVfaq5hznC%3Di0R5Wojt5%2Byhyw%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CALDaNm2BboFuFVYxyzP4wkv7%3D8%2B_TwsD%2BugyGhtibTSF4m4XRg%40mail.gmail.com Regards, Vignesh
Attachment
Dear hackers, I also tested for logical slots on the physical standby. PSA the script. confirmed_flush_lsn for such slots were successfully persisted. # Topology In this test nodes are connected each other. node1 --(physical replication)-->node2--(logical replication)-->node3 # Test method An attached script did following steps 1. constructed above configurations 2. Inserted data on node1 3. read confirmed_flush_lsn on node2 (a) 4. restarted node2 5. read confirmed_flush_lsn again on node2 (b) 6. compare (a) and (b) # result Before patching, (a) and (b) were different value, which meant that logical slots on physical standby were not saved at shutdown. ``` slot_name | confirmed_flush_lsn -----------+--------------------- sub | 0/30003E8 (1 row) waiting for server to shut down.... done server stopped waiting for server to start.... done server started slot_name | confirmed_flush_lsn -----------+--------------------- sub | 0/30000D8 (1 row) ``` After patching, (a) and (b) became the same value. The v4 patch worked well even if the node is physical standby. ``` slot_name | confirmed_flush_lsn -----------+--------------------- sub | 0/30003E8 (1 row) waiting for server to shut down.... done server stopped waiting for server to start.... done server started slot_name | confirmed_flush_lsn -----------+--------------------- sub | 0/30003E8 (1 row) ``` Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On Thu, Aug 24, 2023 at 11:44 AM vignesh C <vignesh21@gmail.com> wrote: > The patch looks mostly good to me. I have made minor changes which are as follows: (a) removed the autovacuum =off and wal_receiver_status_interval = 0 setting as those doesn't seem to be required for the test; (b) changed a few comments and variable names in the code and test; Shall we change the test file name from always_persist to save_logical_slots_shutdown and move to recovery/t/ as this test is about verification after the restart of the server? -- With Regards, Amit Kapila.
Attachment
On Mon, 28 Aug 2023 at 18:56, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Aug 24, 2023 at 11:44 AM vignesh C <vignesh21@gmail.com> wrote: > > > > The patch looks mostly good to me. I have made minor changes which are > as follows: (a) removed the autovacuum =off and > wal_receiver_status_interval = 0 setting as those doesn't seem to be > required for the test; (b) changed a few comments and variable names > in the code and test; > > Shall we change the test file name from always_persist to > save_logical_slots_shutdown and move to recovery/t/ as this test is > about verification after the restart of the server? That makes sense. The attached v6 version has the changes for the same, apart from this I have also fixed a) pgindent issues b) perltidy issues c) one variable change (flush_lsn_changed to confirmed_flush_has_changed) d) corrected few comments in the test file. Thanks to Peter for providing few offline comments. Regards, Vignesh
Attachment
On Tue, Aug 29, 2023 at 10:16 AM vignesh C <vignesh21@gmail.com> wrote: > > That makes sense. The attached v6 version has the changes for the > same, apart from this I have also fixed a) pgindent issues b) perltidy > issues c) one variable change (flush_lsn_changed to > confirmed_flush_has_changed) d) corrected few comments in the test > file. Thanks to Peter for providing few offline comments. > The latest version looks good to me. Julien, Ashutosh, and others, unless you have more comments or suggestions, I would like to push this in a day or two. -- With Regards, Amit Kapila.
On Tue, Aug 29, 2023 at 2:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Aug 29, 2023 at 10:16 AM vignesh C <vignesh21@gmail.com> wrote: > > > > That makes sense. The attached v6 version has the changes for the > > same, apart from this I have also fixed a) pgindent issues b) perltidy > > issues c) one variable change (flush_lsn_changed to > > confirmed_flush_has_changed) d) corrected few comments in the test > > file. Thanks to Peter for providing few offline comments. > > > > The latest version looks good to me. Julien, Ashutosh, and others, > unless you have more comments or suggestions, I would like to push > this in a day or two. I am looking at it. If you can wait till the end of the week, that will be great. -- Best Wishes, Ashutosh Bapat
Hi, On Tue, Aug 29, 2023 at 02:21:15PM +0530, Amit Kapila wrote: > On Tue, Aug 29, 2023 at 10:16 AM vignesh C <vignesh21@gmail.com> wrote: > > > > That makes sense. The attached v6 version has the changes for the > > same, apart from this I have also fixed a) pgindent issues b) perltidy > > issues c) one variable change (flush_lsn_changed to > > confirmed_flush_has_changed) d) corrected few comments in the test > > file. Thanks to Peter for providing few offline comments. > > > > The latest version looks good to me. Julien, Ashutosh, and others, > unless you have more comments or suggestions, I would like to push > this in a day or two. Unfortunately I'm currently swamped with some internal escalations so I couldn't keep up closely with the latest activity here. I think I recall that you wanted to change the timing at which logical slots are shutdown, I'm assuming that this change won't lead to always have a difference between the LSN and latest persisted LSN being different? Otherwise saving the latest persisted LSN to try to avoid persisting again all logical slots on shutdown seems reasonable to me.
On Wed, Aug 30, 2023 at 9:03 AM Julien Rouhaud <rjuju123@gmail.com> wrote: > > On Tue, Aug 29, 2023 at 02:21:15PM +0530, Amit Kapila wrote: > > On Tue, Aug 29, 2023 at 10:16 AM vignesh C <vignesh21@gmail.com> wrote: > > > > > > That makes sense. The attached v6 version has the changes for the > > > same, apart from this I have also fixed a) pgindent issues b) perltidy > > > issues c) one variable change (flush_lsn_changed to > > > confirmed_flush_has_changed) d) corrected few comments in the test > > > file. Thanks to Peter for providing few offline comments. > > > > > > > The latest version looks good to me. Julien, Ashutosh, and others, > > unless you have more comments or suggestions, I would like to push > > this in a day or two. > > Unfortunately I'm currently swamped with some internal escalations so I > couldn't keep up closely with the latest activity here. > > I think I recall that you wanted to > change the timing at which logical slots are shutdown, I'm assuming that this > change won't lead to always have a difference between the LSN and latest > persisted LSN being different? > I think here by LSN you are referring to confirmed_flush LSN. If so, this doesn't create any new difference between the values for the confirmed_flush LSN in memory and in disk. We just remember the last persisted value to avoid writes of slots at shutdown time. > Otherwise saving the latest persisted LSN to > try to avoid persisting again all logical slots on shutdown seems reasonable to > me. Thanks for responding. -- With Regards, Amit Kapila.
On Tue, Aug 29, 2023 at 5:40 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > I am looking at it. If you can wait till the end of the week, that > will be great. /* * Successfully wrote, unset dirty bit, unless somebody dirtied again - * already. + * already and remember the confirmed_flush LSN value. */ SpinLockAcquire(&slot->mutex); if (!slot->just_dirtied) slot->dirty = false; + slot->last_saved_confirmed_flush = slot->data.confirmed_flush; If the slot->data.confirmed_flush gets updated concurrently between copying it to be written to the disk and when it's written to last_saved_confirmed_flush, we will miss one update. I think we need to update last_saved_confirmed_flush based on the cp.slotdata.confirmed_flush rather than slot->data.confirmed_flush. We are setting last_saved_confirmed_flush for all types of slots but using it only when the slot is logical. Should we also set it only for logical slots? /* first check whether there's something to write out */ SpinLockAcquire(&slot->mutex); was_dirty = slot->dirty; slot->just_dirtied = false; + confirmed_flush_has_changed = (slot->data.confirmed_flush != slot->last_saved_confirmed_flush); The confirmed_flush LSN should always move forward, otherwise there may not be enough WAL retained for the slot to work. I am wondering whether we should take an opportunity to make sure Assert(slot->data.confirmed_flush <= slot->last_saved_confirmed_flush) - /* and don't do anything if there's nothing to write */ - if (!was_dirty) + /* Don't do anything if there's nothing to write. See ReplicationSlot. */ + if (!was_dirty && + !(is_shutdown && SlotIsLogical(slot) && confirmed_flush_has_changed)) Rather than complicating this condition, I wonder whether it's better to just set was_dirty = true when is_shutdown && SlotIsLogical(slot) && confirmed_flush_has_changed) or even slot->dirty = true. See also the note at the end of the email. + + /* + * We won't ensure that the slot is persisted after the confirmed_flush + * LSN is updated as that could lead to frequent writes. However, we need + * to ensure that we do persist the slots at the time of shutdown whose + * confirmed_flush LSN is changed since we last saved the slot to disk. + * This will help in avoiding retreat of the confirmed_flush LSN after + * restart. This variable is used to track the last saved confirmed_flush + * LSN value. + */ This comment makes more sense in SaveSlotToPath() than here. We may decide to use last_saved_confirmed_flush for something else in future. + +sub compare_confirmed_flush +{ + my ($node, $confirmed_flush_from_log) = @_; + + # Fetch Latest checkpoint location from the control file + my ($stdout, $stderr) = + run_command([ 'pg_controldata', $node->data_dir ]); + my @control_data = split("\n", $stdout); + my $latest_checkpoint = undef; + foreach (@control_data) + { + if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg) + { + $latest_checkpoint = $1; + last; + } + } + die "Latest checkpoint location not found in control file\n" + unless defined($latest_checkpoint); + + # Is it same as the value read from log? + ok( $latest_checkpoint eq $confirmed_flush_from_log, + "Check that the slot's confirmed_flush LSN is the same as the latest_checkpoint location" + ); This function assumes that the subscriber will receive and confirm WAL upto checkpoint location and publisher's WAL sender will update it in the slot. Where is the code to ensure that? Does the WAL sender process wait for checkpoint LSN to be confirmed when shutting down? + +# Restart the publisher to ensure that the slot will be persisted if required +$node_publisher->restart(); Can we add this test comparing LSNs after publisher restart, to an existing test itself - like basic replication. That's the only extra thing that this test does beyond usual replication stuff. + +# Wait until the walsender creates decoding context +$node_publisher->wait_for_log( + qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./, + $offset); + +# Extract confirmed_flush from the logfile +my $log_contents = slurp_file($node_publisher->logfile, $offset); +$log_contents =~ + qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./ + or die "could not get confirmed_flush_lsn"; Why are we figuring out the LSN from the log file? Is it not available from pg_replication_slots view? If we do so, I think this test will fail is the slot gets written after the restart because of concurrent activity on the publisher (like autovacuum, or other things that cause empty transaction to be replicated) and subscriber. A very rare chance but not 0 probability one. I think we should shut down subscriber, restart publisher and then make this check based on the contents of the replication slot instead of server log. Shutting down subscriber will ensure that the subscriber won't send any new confirmed flush location to the publisher after restart. All the places which call ReplicationSlotSave() mark the slot as dirty. All the places where SaveSlotToPath() is called, the slot is marked dirty except when calling from CheckPointReplicationSlots(). So I am wondering whether we should be marking the slot dirty in CheckPointReplicationSlots() and avoid passing down is_shutdown flag to SaveSlotToPath(). Unrelated to this patch, I noticed that all the callers of SaveSlotToPath() have the same code to craft replication slot file path. I wonder if that needs to be macro'ised or added to some common function or to be pushed into SaveSlotToPath() itself to make sure that any changes to the path in future are consistent for all callers of SaveSlotToPath(). Interestingly slot->data.name is accessed without a lock here. Name of the slot does not change after creation so this isn't a problem right now. But generally against the principle of accessing data protected by a mutex. -- Best Wishes, Ashutosh Bapat
On Wed, Aug 30, 2023 at 6:33 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Tue, Aug 29, 2023 at 5:40 PM Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: > > > > I am looking at it. If you can wait till the end of the week, that > > will be great. > > /* > * Successfully wrote, unset dirty bit, unless somebody dirtied again > - * already. > + * already and remember the confirmed_flush LSN value. > */ > SpinLockAcquire(&slot->mutex); > if (!slot->just_dirtied) > slot->dirty = false; > + slot->last_saved_confirmed_flush = slot->data.confirmed_flush; > > If the slot->data.confirmed_flush gets updated concurrently between copying it > to be written to the disk and when it's written to last_saved_confirmed_flush, > we will miss one update. I think we need to update last_saved_confirmed_flush > based on the cp.slotdata.confirmed_flush rather than > slot->data.confirmed_flush. > Yeah, this appears to be a problem. > We are setting last_saved_confirmed_flush for all types of slots but using it > only when the slot is logical. Should we also set it only for logical slots? > We can do that but not sure if there is any advantage of it other than adding extra condition. BTW, won't even confirmed_flush LSN be used only for logical slots? > /* first check whether there's something to write out */ > SpinLockAcquire(&slot->mutex); > was_dirty = slot->dirty; > slot->just_dirtied = false; > + confirmed_flush_has_changed = (slot->data.confirmed_flush != > slot->last_saved_confirmed_flush); > > The confirmed_flush LSN should always move forward, otherwise there may not be > enough WAL retained for the slot to work. I am wondering whether we should take > an opportunity to make sure > Assert(slot->data.confirmed_flush <= slot->last_saved_confirmed_flush) > Theoretically, what you are saying makes sense to me but we don't have such a protection while updating the confirmed_flush LSN. It would be better to first add such a protection for confirmed_flush LSN update as a separate patch. > - /* and don't do anything if there's nothing to write */ > - if (!was_dirty) > + /* Don't do anything if there's nothing to write. See ReplicationSlot. */ > + if (!was_dirty && > + !(is_shutdown && SlotIsLogical(slot) && confirmed_flush_has_changed)) > > Rather than complicating this condition, I wonder whether it's better to just > set was_dirty = true when is_shutdown && SlotIsLogical(slot) && > confirmed_flush_has_changed) or even slot->dirty = true. > I think it is better to keep the slot's dirty property separate. But we can introduce another variable that can be the result of both was_dirty and other checks together, however, that doesn't seem much better than the current code. > + > + /* > + * We won't ensure that the slot is persisted after the confirmed_flush > + * LSN is updated as that could lead to frequent writes. However, we need > + * to ensure that we do persist the slots at the time of shutdown whose > + * confirmed_flush LSN is changed since we last saved the slot to disk. > + * This will help in avoiding retreat of the confirmed_flush LSN after > + * restart. This variable is used to track the last saved confirmed_flush > + * LSN value. > + */ > > This comment makes more sense in SaveSlotToPath() than here. We may decide to > use last_saved_confirmed_flush for something else in future. > I have kept it here because it contains some information that is not specific to SaveSlotToPath. So, it seems easier to follow the whole theory if we keep it at the central place in the structure and then add the reference wherever required but I am fine if you and others feel strongly about moving this to SaveSlotToPath(). > + > +sub compare_confirmed_flush > +{ > + my ($node, $confirmed_flush_from_log) = @_; > + > + # Fetch Latest checkpoint location from the control file > + my ($stdout, $stderr) = > + run_command([ 'pg_controldata', $node->data_dir ]); > + my @control_data = split("\n", $stdout); > + my $latest_checkpoint = undef; > + foreach (@control_data) > + { > + if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg) > + { > + $latest_checkpoint = $1; > + last; > + } > + } > + die "Latest checkpoint location not found in control file\n" > + unless defined($latest_checkpoint); > + > + # Is it same as the value read from log? > + ok( $latest_checkpoint eq $confirmed_flush_from_log, > + "Check that the slot's confirmed_flush LSN is the same as the > latest_checkpoint location" > + ); > > This function assumes that the subscriber will receive and confirm WAL upto > checkpoint location and publisher's WAL sender will update it in the slot. > Where is the code to ensure that? Does the WAL sender process wait for > checkpoint > LSN to be confirmed when shutting down? > Note, that we need to compare if all the WAL before the shutdown_checkpoint WAL record is sent. Before the clean shutdown, we do ensure that all the pending WAL is confirmed back. See the use of WalSndDone() in WalSndLoop(). > + > +# Restart the publisher to ensure that the slot will be persisted if required > +$node_publisher->restart(); > > Can we add this test comparing LSNs after publisher restart, to an existing > test itself - like basic replication. That's the only extra thing that this > test does beyond usual replication stuff. > As this is a test after the restart of the server, I thought to keep it with recovery tests. However, I think once the upgrade (of publisher nodes) patch is ready, we should keep this test with those tests or somehow merge it with those tests but till that patch is ready, let's keep this as a separate test. > + > +# Wait until the walsender creates decoding context > +$node_publisher->wait_for_log( > + qr/Streaming transactions committing after > ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./, > + $offset); > + > +# Extract confirmed_flush from the logfile > +my $log_contents = slurp_file($node_publisher->logfile, $offset); > +$log_contents =~ > + qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), > reading WAL from ([A-F0-9]+\/[A-F0-9]+)./ > + or die "could not get confirmed_flush_lsn"; > > Why are we figuring out the LSN from the log file? Is it not available from > pg_replication_slots view? If we do so, I think this test will fail is the slot > gets written after the restart because of concurrent activity on the publisher > (like autovacuum, or other things that cause empty transaction to be > replicated) and subscriber. A very rare chance but not 0 probability one. > Yes, that is a possibility. > I > think we should shut down subscriber, restart publisher and then make this > check based on the contents of the replication slot instead of server log. > Shutting down subscriber will ensure that the subscriber won't send any new > confirmed flush location to the publisher after restart. > But if we shutdown the subscriber before the publisher there is no guarantee that the publisher has sent all outstanding logs up to the shutdown checkpoint record (i.e., the latest record). Such a guarantee can only be there if we do a clean shutdown of the publisher before the subscriber. > All the places which call ReplicationSlotSave() mark the slot as dirty. All > the places where SaveSlotToPath() is called, the slot is marked dirty except > when calling from CheckPointReplicationSlots(). So I am wondering whether we > should be marking the slot dirty in CheckPointReplicationSlots() and avoid > passing down is_shutdown flag to SaveSlotToPath(). > I feel that will add another spinlock acquire/release pair without much benefit. Sure, it may not be performance-sensitive but still adding another pair of lock/release doesn't seem like a better idea. -- With Regards, Amit Kapila.
On Thu, Aug 31, 2023 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > + > > + /* > > + * We won't ensure that the slot is persisted after the confirmed_flush > > + * LSN is updated as that could lead to frequent writes. However, we need > > + * to ensure that we do persist the slots at the time of shutdown whose > > + * confirmed_flush LSN is changed since we last saved the slot to disk. > > + * This will help in avoiding retreat of the confirmed_flush LSN after > > + * restart. This variable is used to track the last saved confirmed_flush > > + * LSN value. > > + */ > > > > This comment makes more sense in SaveSlotToPath() than here. We may decide to > > use last_saved_confirmed_flush for something else in future. > > > > I have kept it here because it contains some information that is not > specific to SaveSlotToPath. So, it seems easier to follow the whole > theory if we keep it at the central place in the structure and then > add the reference wherever required but I am fine if you and others > feel strongly about moving this to SaveSlotToPath(). Saving slot to disk happens only in SaveSlotToPath, so except the last sentence rest of the comment makes sense in SaveSlotToPath(). > > > > This function assumes that the subscriber will receive and confirm WAL upto > > checkpoint location and publisher's WAL sender will update it in the slot. > > Where is the code to ensure that? Does the WAL sender process wait for > > checkpoint > > LSN to be confirmed when shutting down? > > > > Note, that we need to compare if all the WAL before the > shutdown_checkpoint WAL record is sent. Before the clean shutdown, we > do ensure that all the pending WAL is confirmed back. See the use of > WalSndDone() in WalSndLoop(). Ok. Thanks for pointing that out to me. > > > I > > think we should shut down subscriber, restart publisher and then make this > > check based on the contents of the replication slot instead of server log. > > Shutting down subscriber will ensure that the subscriber won't send any new > > confirmed flush location to the publisher after restart. > > > > But if we shutdown the subscriber before the publisher there is no > guarantee that the publisher has sent all outstanding logs up to the > shutdown checkpoint record (i.e., the latest record). Such a guarantee > can only be there if we do a clean shutdown of the publisher before > the subscriber. So the sequence is shutdown publisher node, shutdown subscriber node, start publisher node and carry out the checks. > > > All the places which call ReplicationSlotSave() mark the slot as dirty. All > > the places where SaveSlotToPath() is called, the slot is marked dirty except > > when calling from CheckPointReplicationSlots(). So I am wondering whether we > > should be marking the slot dirty in CheckPointReplicationSlots() and avoid > > passing down is_shutdown flag to SaveSlotToPath(). > > > > I feel that will add another spinlock acquire/release pair without > much benefit. Sure, it may not be performance-sensitive but still > adding another pair of lock/release doesn't seem like a better idea. We call ReplicatioinSlotMarkDirty() followed by ReplicationSlotSave() at all the places, even those which are more frequent than this. So I think it's better to stick to that protocol rather than adding a new flag. -- Best Wishes, Ashutosh Bapat
On Thu, Aug 31, 2023 at 12:25 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Thu, Aug 31, 2023 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > + > > > + /* > > > + * We won't ensure that the slot is persisted after the confirmed_flush > > > + * LSN is updated as that could lead to frequent writes. However, we need > > > + * to ensure that we do persist the slots at the time of shutdown whose > > > + * confirmed_flush LSN is changed since we last saved the slot to disk. > > > + * This will help in avoiding retreat of the confirmed_flush LSN after > > > + * restart. This variable is used to track the last saved confirmed_flush > > > + * LSN value. > > > + */ > > > > > > This comment makes more sense in SaveSlotToPath() than here. We may decide to > > > use last_saved_confirmed_flush for something else in future. > > > > > > > I have kept it here because it contains some information that is not > > specific to SaveSlotToPath. So, it seems easier to follow the whole > > theory if we keep it at the central place in the structure and then > > add the reference wherever required but I am fine if you and others > > feel strongly about moving this to SaveSlotToPath(). > > Saving slot to disk happens only in SaveSlotToPath, so except the last > sentence rest of the comment makes sense in SaveSlotToPath(). > > > > > > > This function assumes that the subscriber will receive and confirm WAL upto > > > checkpoint location and publisher's WAL sender will update it in the slot. > > > Where is the code to ensure that? Does the WAL sender process wait for > > > checkpoint > > > LSN to be confirmed when shutting down? > > > > > > > Note, that we need to compare if all the WAL before the > > shutdown_checkpoint WAL record is sent. Before the clean shutdown, we > > do ensure that all the pending WAL is confirmed back. See the use of > > WalSndDone() in WalSndLoop(). > > Ok. Thanks for pointing that out to me. > > > > > > I > > > think we should shut down subscriber, restart publisher and then make this > > > check based on the contents of the replication slot instead of server log. > > > Shutting down subscriber will ensure that the subscriber won't send any new > > > confirmed flush location to the publisher after restart. > > > > > > > But if we shutdown the subscriber before the publisher there is no > > guarantee that the publisher has sent all outstanding logs up to the > > shutdown checkpoint record (i.e., the latest record). Such a guarantee > > can only be there if we do a clean shutdown of the publisher before > > the subscriber. > > So the sequence is shutdown publisher node, shutdown subscriber node, > start publisher node and carry out the checks. > This can probably work but I still prefer the current approach as that will be closer to the ideal values on the disk instead of comparison with a later in-memory value of confirmed_flush LSN. Ideally, if we would have a tool like pg_replslotdata which can read the on-disk state of slots that would be better but missing that, the current one sounds like the next best possibility. Do you see any problem with the current approach of test? BTW, I think we can keep autovacuum = off for this test just to avoid any extra record generation even though that doesn't matter for the purpose of test. > > > > > All the places which call ReplicationSlotSave() mark the slot as dirty. All > > > the places where SaveSlotToPath() is called, the slot is marked dirty except > > > when calling from CheckPointReplicationSlots(). So I am wondering whether we > > > should be marking the slot dirty in CheckPointReplicationSlots() and avoid > > > passing down is_shutdown flag to SaveSlotToPath(). > > > > > > > I feel that will add another spinlock acquire/release pair without > > much benefit. Sure, it may not be performance-sensitive but still > > adding another pair of lock/release doesn't seem like a better idea. > > We call ReplicatioinSlotMarkDirty() followed by ReplicationSlotSave() > at all the places, even those which are more frequent than this. > All but one. Normally, the idea of marking dirty is to indicate that we will actually write/flush the contents at a later point (except when required for correctness) as even indicated in the comments atop ReplicatioinSlotMarkDirty(). However, I see your point that we use that protocol at all the current places including CreateSlotOnDisk(). So, we can probably do it here as well. -- With Regards, Amit Kapila.
On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I > > > > think we should shut down subscriber, restart publisher and then make this > > > > check based on the contents of the replication slot instead of server log. > > > > Shutting down subscriber will ensure that the subscriber won't send any new > > > > confirmed flush location to the publisher after restart. > > > > > > > > > > But if we shutdown the subscriber before the publisher there is no > > > guarantee that the publisher has sent all outstanding logs up to the > > > shutdown checkpoint record (i.e., the latest record). Such a guarantee > > > can only be there if we do a clean shutdown of the publisher before > > > the subscriber. > > > > So the sequence is shutdown publisher node, shutdown subscriber node, > > start publisher node and carry out the checks. > > > > This can probably work but I still prefer the current approach as that > will be closer to the ideal values on the disk instead of comparison > with a later in-memory value of confirmed_flush LSN. Ideally, if we > would have a tool like pg_replslotdata which can read the on-disk > state of slots that would be better but missing that, the current one > sounds like the next best possibility. Do you see any problem with the > current approach of test? > + qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), > reading WAL from ([A-F0-9]+\/[A-F0-9]+)./ I don't think the LSN reported in this message is guaranteed to be the confirmed_flush LSN of the slot. It's usually confirmed_flush but not always. It's the LSN that snapshot builder computes based on factors including confirmed_flush. There's a chance that this test will fail sometimes because of this behaviour. Reading directly from replication slot is better that this. pg_replslotdata might help if we read replication slot content between shutdown and restart of publisher. > > BTW, I think we can keep autovacuum = off for this test just to avoid > any extra record generation even though that doesn't matter for the > purpose of test. Autovacuum is one thing, but we can't guarantee the absence of any concurrent activity forever. > > > > > > > > All the places which call ReplicationSlotSave() mark the slot as dirty. All > > > > the places where SaveSlotToPath() is called, the slot is marked dirty except > > > > when calling from CheckPointReplicationSlots(). So I am wondering whether we > > > > should be marking the slot dirty in CheckPointReplicationSlots() and avoid > > > > passing down is_shutdown flag to SaveSlotToPath(). > > > > > > > > > > I feel that will add another spinlock acquire/release pair without > > > much benefit. Sure, it may not be performance-sensitive but still > > > adding another pair of lock/release doesn't seem like a better idea. > > > > We call ReplicatioinSlotMarkDirty() followed by ReplicationSlotSave() > > at all the places, even those which are more frequent than this. > > > > All but one. Normally, the idea of marking dirty is to indicate that > we will actually write/flush the contents at a later point (except > when required for correctness) as even indicated in the comments atop > ReplicatioinSlotMarkDirty(). However, I see your point that we use > that protocol at all the current places including CreateSlotOnDisk(). > So, we can probably do it here as well. yes I didn't see this entry in commitfest. Since we are discussing it and the next CF is about to begin, probably it's good to add one there. -- Best Wishes, Ashutosh Bapat
On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > I > > > > > think we should shut down subscriber, restart publisher and then make this > > > > > check based on the contents of the replication slot instead of server log. > > > > > Shutting down subscriber will ensure that the subscriber won't send any new > > > > > confirmed flush location to the publisher after restart. > > > > > > > > > > > > > But if we shutdown the subscriber before the publisher there is no > > > > guarantee that the publisher has sent all outstanding logs up to the > > > > shutdown checkpoint record (i.e., the latest record). Such a guarantee > > > > can only be there if we do a clean shutdown of the publisher before > > > > the subscriber. > > > > > > So the sequence is shutdown publisher node, shutdown subscriber node, > > > start publisher node and carry out the checks. > > > > > > > This can probably work but I still prefer the current approach as that > > will be closer to the ideal values on the disk instead of comparison > > with a later in-memory value of confirmed_flush LSN. Ideally, if we > > would have a tool like pg_replslotdata which can read the on-disk > > state of slots that would be better but missing that, the current one > > sounds like the next best possibility. Do you see any problem with the > > current approach of test? > > > + qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), > > reading WAL from ([A-F0-9]+\/[A-F0-9]+)./ > > I don't think the LSN reported in this message is guaranteed to be the > confirmed_flush LSN of the slot. It's usually confirmed_flush but not > always. It's the LSN that snapshot builder computes based on factors > including confirmed_flush. There's a chance that this test will fail > sometimes because of this behaviour. > I think I am missing something here because as per my understanding, the LOG referred by the test is generated in CreateDecodingContext() before which we shouldn't be changing the slot's confirmed_flush LSN. The LOG [1] refers to the slot's persistent value for confirmed_flush, so how it could be different from what the test is expecting. [1] errdetail("Streaming transactions committing after %X/%X, reading WAL from %X/%X.", LSN_FORMAT_ARGS(slot->data.confirmed_flush), -- With Regards, Amit Kapila.
On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > All but one. Normally, the idea of marking dirty is to indicate that > > we will actually write/flush the contents at a later point (except > > when required for correctness) as even indicated in the comments atop > > ReplicatioinSlotMarkDirty(). However, I see your point that we use > > that protocol at all the current places including CreateSlotOnDisk(). > > So, we can probably do it here as well. > > yes > I think we should also ensure that slots are not invalidated (slot.data.invalidated != RS_INVAL_NONE) before marking them dirty because we don't allow decoding from such slots, so we shouldn't include those. -- With Regards, Amit Kapila.
On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: > > > > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > All but one. Normally, the idea of marking dirty is to indicate that > > > we will actually write/flush the contents at a later point (except > > > when required for correctness) as even indicated in the comments atop > > > ReplicatioinSlotMarkDirty(). However, I see your point that we use > > > that protocol at all the current places including CreateSlotOnDisk(). > > > So, we can probably do it here as well. > > > > yes > > > > I think we should also ensure that slots are not invalidated > (slot.data.invalidated != RS_INVAL_NONE) before marking them dirty > because we don't allow decoding from such slots, so we shouldn't > include those. Added this check. Apart from this I have also fixed the following issues that were agreed on: a) Setting slots to dirty in CheckPointReplicationSlots instead of setting it in SaveSlotToPath b) The comments were moved from ReplicationSlot and moved to CheckPointReplicationSlots c) Tests will be run in autovacuum = off d) updating last_saved_confirmed_flush based on cp.slotdata.confirmed_flush rather than slot->data.confirmed_flush. I have also added the commitfest entry for this at [1]. Thanks to Ashutosh/Amit for the feedback. Attached v7 version patch has the changes for the same. [1] - https://commitfest.postgresql.org/44/4536/ Regards, Vignesh
Attachment
On Thu, Aug 31, 2023 at 7:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: > > > > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > I > > > > > > think we should shut down subscriber, restart publisher and then make this > > > > > > check based on the contents of the replication slot instead of server log. > > > > > > Shutting down subscriber will ensure that the subscriber won't send any new > > > > > > confirmed flush location to the publisher after restart. > > > > > > > > > > > > > > > > But if we shutdown the subscriber before the publisher there is no > > > > > guarantee that the publisher has sent all outstanding logs up to the > > > > > shutdown checkpoint record (i.e., the latest record). Such a guarantee > > > > > can only be there if we do a clean shutdown of the publisher before > > > > > the subscriber. > > > > > > > > So the sequence is shutdown publisher node, shutdown subscriber node, > > > > start publisher node and carry out the checks. > > > > > > > > > > This can probably work but I still prefer the current approach as that > > > will be closer to the ideal values on the disk instead of comparison > > > with a later in-memory value of confirmed_flush LSN. Ideally, if we > > > would have a tool like pg_replslotdata which can read the on-disk > > > state of slots that would be better but missing that, the current one > > > sounds like the next best possibility. Do you see any problem with the > > > current approach of test? > > > > > + qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), > > > reading WAL from ([A-F0-9]+\/[A-F0-9]+)./ > > > > I don't think the LSN reported in this message is guaranteed to be the > > confirmed_flush LSN of the slot. It's usually confirmed_flush but not > > always. It's the LSN that snapshot builder computes based on factors > > including confirmed_flush. There's a chance that this test will fail > > sometimes because of this behaviour. > > > > I think I am missing something here because as per my understanding, > the LOG referred by the test is generated in CreateDecodingContext() > before which we shouldn't be changing the slot's confirmed_flush LSN. > The LOG [1] refers to the slot's persistent value for confirmed_flush, > so how it could be different from what the test is expecting. > > [1] > errdetail("Streaming transactions committing after %X/%X, reading WAL > from %X/%X.", > LSN_FORMAT_ARGS(slot->data.confirmed_flush), I was afraid that we may move confirmed_flush while creating the snapshot builder when creating the decoding context. But I don't see any code doing that. So may be we are safe. But if the log message changes, this test would fail - depending upon the log message looks a bit fragile, esp. when we have a way to access the data directly reliably. -- Best Wishes, Ashutosh Bapat
On Fri, Sep 1, 2023 at 1:11 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Thu, Aug 31, 2023 at 7:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat > > <ashutosh.bapat.oss@gmail.com> wrote: > > > > > > > + qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), > > > > reading WAL from ([A-F0-9]+\/[A-F0-9]+)./ > > > > > > I don't think the LSN reported in this message is guaranteed to be the > > > confirmed_flush LSN of the slot. It's usually confirmed_flush but not > > > always. It's the LSN that snapshot builder computes based on factors > > > including confirmed_flush. There's a chance that this test will fail > > > sometimes because of this behaviour. > > > > > > > I think I am missing something here because as per my understanding, > > the LOG referred by the test is generated in CreateDecodingContext() > > before which we shouldn't be changing the slot's confirmed_flush LSN. > > The LOG [1] refers to the slot's persistent value for confirmed_flush, > > so how it could be different from what the test is expecting. > > > > [1] > > errdetail("Streaming transactions committing after %X/%X, reading WAL > > from %X/%X.", > > LSN_FORMAT_ARGS(slot->data.confirmed_flush), > > I was afraid that we may move confirmed_flush while creating the > snapshot builder when creating the decoding context. But I don't see > any code doing that. So may be we are safe. > We are safe in that respect. As far as I understand there is no reason to be worried. > But if the log message > changes, this test would fail - depending upon the log message looks a > bit fragile, esp. when we have a way to access the data directly > reliably. > This message is there from the very begining (b89e1510) and I can't forsee a reason to change such a message. But even if we change, we can always change the test output or test accordingly, if required. I think it is a matter of preference to which way we can write the test, so let's not argue too much on this. I find current way slightly more reliable but we can change it if we see any problem. -- With Regards, Amit Kapila.
On Fri, Sep 1, 2023 at 10:50 AM vignesh C <vignesh21@gmail.com> wrote: > > On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat > > <ashutosh.bapat.oss@gmail.com> wrote: > > > > > > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > All but one. Normally, the idea of marking dirty is to indicate that > > > > we will actually write/flush the contents at a later point (except > > > > when required for correctness) as even indicated in the comments atop > > > > ReplicatioinSlotMarkDirty(). However, I see your point that we use > > > > that protocol at all the current places including CreateSlotOnDisk(). > > > > So, we can probably do it here as well. > > > > > > yes > > > > > > > I think we should also ensure that slots are not invalidated > > (slot.data.invalidated != RS_INVAL_NONE) before marking them dirty > > because we don't allow decoding from such slots, so we shouldn't > > include those. > > Added this check. > > Apart from this I have also fixed the following issues that were > agreed on: a) Setting slots to dirty in CheckPointReplicationSlots > instead of setting it in SaveSlotToPath > + if (is_shutdown && SlotIsLogical(s)) + { + SpinLockAcquire(&s->mutex); + if (s->data.invalidated == RS_INVAL_NONE && + s->data.confirmed_flush != s->last_saved_confirmed_flush) + s->dirty = true; I think it is better to use ReplicationSlotMarkDirty() as that would be consistent with all other usages. -- With Regards, Amit Kapila.
On Mon, 4 Sept 2023 at 15:20, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Sep 1, 2023 at 10:50 AM vignesh C <vignesh21@gmail.com> wrote: > > > > On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat > > > <ashutosh.bapat.oss@gmail.com> wrote: > > > > > > > > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > All but one. Normally, the idea of marking dirty is to indicate that > > > > > we will actually write/flush the contents at a later point (except > > > > > when required for correctness) as even indicated in the comments atop > > > > > ReplicatioinSlotMarkDirty(). However, I see your point that we use > > > > > that protocol at all the current places including CreateSlotOnDisk(). > > > > > So, we can probably do it here as well. > > > > > > > > yes > > > > > > > > > > I think we should also ensure that slots are not invalidated > > > (slot.data.invalidated != RS_INVAL_NONE) before marking them dirty > > > because we don't allow decoding from such slots, so we shouldn't > > > include those. > > > > Added this check. > > > > Apart from this I have also fixed the following issues that were > > agreed on: a) Setting slots to dirty in CheckPointReplicationSlots > > instead of setting it in SaveSlotToPath > > > > + if (is_shutdown && SlotIsLogical(s)) > + { > + SpinLockAcquire(&s->mutex); > + if (s->data.invalidated == RS_INVAL_NONE && > + s->data.confirmed_flush != s->last_saved_confirmed_flush) > + s->dirty = true; > > I think it is better to use ReplicationSlotMarkDirty() as that would > be consistent with all other usages. ReplicationSlotMarkDirty works only on MyReplicationSlot whereas CheckpointReplicationSlots loops through all the slots and marks the appropriate slot as dirty, we might have to change ReplicationSlotMarkDirty to take the slot as input parameter and all caller should pass MyReplicationSlot. Another thing is we have already taken spin lock to access last_confirmed_flush_lsn from CheckpointReplicationSlots, we could set dirty flag here itself, else we will have to release the lock and call ReplicationSlotMarkDirty which will take lock again. Instead shall we set just_dirtied also in CheckpointReplicationSlots? Thoughts? Regards, Vignesh
On Monday, September 4, 2023 6:15 PM vignesh C <vignesh21@gmail.com> wrote: > > On Mon, 4 Sept 2023 at 15:20, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Sep 1, 2023 at 10:50 AM vignesh C <vignesh21@gmail.com> wrote: > > > > > > On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat > > > > <ashutosh.bapat.oss@gmail.com> wrote: > > > > > > > > > > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila > <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > All but one. Normally, the idea of marking dirty is to > > > > > > indicate that we will actually write/flush the contents at a > > > > > > later point (except when required for correctness) as even > > > > > > indicated in the comments atop ReplicatioinSlotMarkDirty(). > > > > > > However, I see your point that we use that protocol at all the current > places including CreateSlotOnDisk(). > > > > > > So, we can probably do it here as well. > > > > > > > > > > yes > > > > > > > > > > > > > I think we should also ensure that slots are not invalidated > > > > (slot.data.invalidated != RS_INVAL_NONE) before marking them dirty > > > > because we don't allow decoding from such slots, so we shouldn't > > > > include those. > > > > > > Added this check. > > > > > > Apart from this I have also fixed the following issues that were > > > agreed on: a) Setting slots to dirty in CheckPointReplicationSlots > > > instead of setting it in SaveSlotToPath > > > > > > > + if (is_shutdown && SlotIsLogical(s)) { SpinLockAcquire(&s->mutex); > > + if (s->data.invalidated == RS_INVAL_NONE && > > + s->data.confirmed_flush != s->last_saved_confirmed_flush) dirty = > > + s->true; > > > > I think it is better to use ReplicationSlotMarkDirty() as that would > > be consistent with all other usages. > > ReplicationSlotMarkDirty works only on MyReplicationSlot whereas > CheckpointReplicationSlots loops through all the slots and marks the > appropriate slot as dirty, we might have to change ReplicationSlotMarkDirty to > take the slot as input parameter and all caller should pass MyReplicationSlot. Personally, I feel if we want to centralize the code of marking dirty into a function, we can introduce a new static function MarkSlotDirty(slot) to mark passed slot dirty and let ReplicationSlotMarkDirty and CheckpointReplicationSlots call it. Like: void ReplicationSlotMarkDirty(void) { MarkSlotDirty(MyReplicationSlot); } +static void +MarkSlotDirty(ReplicationSlot *slot) +{ + Assert(slot != NULL); + + SpinLockAcquire(&slot->mutex); + slot->just_dirtied = true; + slot->dirty = true; + SpinLockRelease(&slot->mutex); +} This is somewhat similar to the relation between ReplicationSlotSave(serialize my backend's replications slot) and SaveSlotToPath(save the passed slot). > Another thing is we have already taken spin lock to access > last_confirmed_flush_lsn from CheckpointReplicationSlots, we could set dirty > flag here itself, else we will have to release the lock and call > ReplicationSlotMarkDirty which will take lock again. Yes, this is unavoidable, but maybe it's not a big problem as we only do it at shutdown. > Instead shall we set just_dirtied also in CheckpointReplicationSlots? > Thoughts? I agree we'd better set just_dirtied to true to ensure we will serialize slot info here, because if some other processes just serialized the slot, the dirty flag will be reset to false if we don't set just_dirtied to true in CheckpointReplicationSlots(), this race condition may not exists for now, but seems better to completely forbid it by setting just_dirtied. Best Regards, Hou zj
On Tue, Sep 5, 2023 at 7:54 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Monday, September 4, 2023 6:15 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Mon, 4 Sept 2023 at 15:20, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Fri, Sep 1, 2023 at 10:50 AM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat > > > > > <ashutosh.bapat.oss@gmail.com> wrote: > > > > > > > > > > > > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila > > <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > All but one. Normally, the idea of marking dirty is to > > > > > > > indicate that we will actually write/flush the contents at a > > > > > > > later point (except when required for correctness) as even > > > > > > > indicated in the comments atop ReplicatioinSlotMarkDirty(). > > > > > > > However, I see your point that we use that protocol at all the current > > places including CreateSlotOnDisk(). > > > > > > > So, we can probably do it here as well. > > > > > > > > > > > > yes > > > > > > > > > > > > > > > > I think we should also ensure that slots are not invalidated > > > > > (slot.data.invalidated != RS_INVAL_NONE) before marking them dirty > > > > > because we don't allow decoding from such slots, so we shouldn't > > > > > include those. > > > > > > > > Added this check. > > > > > > > > Apart from this I have also fixed the following issues that were > > > > agreed on: a) Setting slots to dirty in CheckPointReplicationSlots > > > > instead of setting it in SaveSlotToPath > > > > > > > > > > + if (is_shutdown && SlotIsLogical(s)) { SpinLockAcquire(&s->mutex); > > > + if (s->data.invalidated == RS_INVAL_NONE && > > > + s->data.confirmed_flush != s->last_saved_confirmed_flush) dirty = > > > + s->true; > > > > > > I think it is better to use ReplicationSlotMarkDirty() as that would > > > be consistent with all other usages. > > > > ReplicationSlotMarkDirty works only on MyReplicationSlot whereas > > CheckpointReplicationSlots loops through all the slots and marks the > > appropriate slot as dirty, we might have to change ReplicationSlotMarkDirty to > > take the slot as input parameter and all caller should pass MyReplicationSlot. > > Personally, I feel if we want to centralize the code of marking dirty into a > function, we can introduce a new static function MarkSlotDirty(slot) to mark > passed slot dirty and let ReplicationSlotMarkDirty and > CheckpointReplicationSlots call it. Like: > > void > ReplicationSlotMarkDirty(void) > { > MarkSlotDirty(MyReplicationSlot); > } > > +static void > +MarkSlotDirty(ReplicationSlot *slot) > +{ > + Assert(slot != NULL); > + > + SpinLockAcquire(&slot->mutex); > + slot->just_dirtied = true; > + slot->dirty = true; > + SpinLockRelease(&slot->mutex); > +} > > This is somewhat similar to the relation between ReplicationSlotSave(serialize > my backend's replications slot) and SaveSlotToPath(save the passed slot). > > > Another thing is we have already taken spin lock to access > > last_confirmed_flush_lsn from CheckpointReplicationSlots, we could set dirty > > flag here itself, else we will have to release the lock and call > > ReplicationSlotMarkDirty which will take lock again. > > Yes, this is unavoidable, but maybe it's not a big problem as > we only do it at shutdown. > True but still it doesn't look elegant. I also thought about having a probably inline function that marks both just_dirty and dirty fields. However, that requires us to assert that the caller has already acquired a spinlock. I see a macro SpinLockFree() that might help but it didn't seem to be used anywhere in the code so not sure if we can rely on it. > > Instead shall we set just_dirtied also in CheckpointReplicationSlots? > > Thoughts? > > I agree we'd better set just_dirtied to true to ensure we will serialize slot > info here, because if some other processes just serialized the slot, the dirty > flag will be reset to false if we don't set just_dirtied to true in > CheckpointReplicationSlots(), this race condition may not exists for now, but > seems better to completely forbid it by setting just_dirtied. > Agreed, and it is better to close any such possibility because we can't say with certainty about manual slots. This seems better than the other ideas we discussed. -- With Regards, Amit Kapila.
On Fri, Sep 1, 2023 at 12:16 PM vignesh C <vignesh21@gmail.com> wrote: > > On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat > > <ashutosh.bapat.oss@gmail.com> wrote: > > > > > > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > All but one. Normally, the idea of marking dirty is to indicate that > > > > we will actually write/flush the contents at a later point (except > > > > when required for correctness) as even indicated in the comments atop > > > > ReplicatioinSlotMarkDirty(). However, I see your point that we use > > > > that protocol at all the current places including CreateSlotOnDisk(). > > > > So, we can probably do it here as well. > > > > > > yes > > > > > > > I think we should also ensure that slots are not invalidated > > (slot.data.invalidated != RS_INVAL_NONE) before marking them dirty > > because we don't allow decoding from such slots, so we shouldn't > > include those. > > Added this check. > > Apart from this I have also fixed the following issues that were > agreed on: a) Setting slots to dirty in CheckPointReplicationSlots > instead of setting it in SaveSlotToPath b) The comments were moved > from ReplicationSlot and moved to CheckPointReplicationSlots c) Tests > will be run in autovacuum = off d) updating last_saved_confirmed_flush > based on cp.slotdata.confirmed_flush rather than > slot->data.confirmed_flush. > I have also added the commitfest entry for this at [1]. The overall idea looks fine to me + + /* + * We won't ensure that the slot is persisted after the + * confirmed_flush LSN is updated as that could lead to frequent + * writes. However, we need to ensure that we do persist the slots at + * the time of shutdown whose confirmed_flush LSN is changed since we + * last saved the slot to disk. This will help in avoiding retreat of + * the confirmed_flush LSN after restart. + */ + if (is_shutdown && SlotIsLogical(s)) + { + SpinLockAcquire(&s->mutex); + if (s->data.invalidated == RS_INVAL_NONE && + s->data.confirmed_flush != s->last_saved_confirmed_flush) + s->dirty = true; + SpinLockRelease(&s->mutex); + } The comments don't mention anything about why we are just flushing at the shutdown checkpoint. I mean the checkpoint is not that frequent and we already perform a lot of I/O during checkpoints so isn't it wise to flush during every checkpoint. We may argue that there is no extra advantage of that as we are not optimizing for crash recovery but OTOH there is no reason for not doing so for other checkpoints or we are worried about the concurrency with parallel walsender running during non shutdown checkpoint if so then better we explain that as well? If it is already discussed in the thread and we have a conclusion on this then maybe we can mention this in comments? /* * Flush all replication slots to disk. * - * This needn't actually be part of a checkpoint, but it's a convenient - * location. + * is_shutdown is true in case of a shutdown checkpoint. */ void -CheckPointReplicationSlots(void) +CheckPointReplicationSlots(bool is_shutdown) It seems we have removed two lines from the function header comments, is this intentional or accidental? Other than this patch LGTM. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Sep 5, 2023 at 9:10 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Sep 1, 2023 at 12:16 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat > > > <ashutosh.bapat.oss@gmail.com> wrote: > > > > > > > > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > All but one. Normally, the idea of marking dirty is to indicate that > > > > > we will actually write/flush the contents at a later point (except > > > > > when required for correctness) as even indicated in the comments atop > > > > > ReplicatioinSlotMarkDirty(). However, I see your point that we use > > > > > that protocol at all the current places including CreateSlotOnDisk(). > > > > > So, we can probably do it here as well. > > > > > > > > yes > > > > > > > > > > I think we should also ensure that slots are not invalidated > > > (slot.data.invalidated != RS_INVAL_NONE) before marking them dirty > > > because we don't allow decoding from such slots, so we shouldn't > > > include those. > > > > Added this check. > > > > Apart from this I have also fixed the following issues that were > > agreed on: a) Setting slots to dirty in CheckPointReplicationSlots > > instead of setting it in SaveSlotToPath b) The comments were moved > > from ReplicationSlot and moved to CheckPointReplicationSlots c) Tests > > will be run in autovacuum = off d) updating last_saved_confirmed_flush > > based on cp.slotdata.confirmed_flush rather than > > slot->data.confirmed_flush. > > I have also added the commitfest entry for this at [1]. > > The overall idea looks fine to me > > + > + /* > + * We won't ensure that the slot is persisted after the > + * confirmed_flush LSN is updated as that could lead to frequent > + * writes. However, we need to ensure that we do persist the slots at > + * the time of shutdown whose confirmed_flush LSN is changed since we > + * last saved the slot to disk. This will help in avoiding retreat of > + * the confirmed_flush LSN after restart. > + */ > + if (is_shutdown && SlotIsLogical(s)) > + { > + SpinLockAcquire(&s->mutex); > + if (s->data.invalidated == RS_INVAL_NONE && > + s->data.confirmed_flush != s->last_saved_confirmed_flush) > + s->dirty = true; > + SpinLockRelease(&s->mutex); > + } > > The comments don't mention anything about why we are just flushing at > the shutdown checkpoint. I mean the checkpoint is not that frequent > and we already perform a lot of I/O during checkpoints so isn't it > wise to flush during every checkpoint. We may argue that there is no > extra advantage of that as we are not optimizing for crash recovery > but OTOH there is no reason for not doing so for other checkpoints or > we are worried about the concurrency with parallel walsender running > during non shutdown checkpoint if so then better we explain that as > well? If it is already discussed in the thread and we have a > conclusion on this then maybe we can mention this in comments? > The point is that at the time of non-shutdown checkpoints, it is not clear that there is an extra advantage but we will definitely add extra I/O for this. Because at other times, we will already be saving the slot from time to time as the replication makes progress. And, we also need to flush such slots during shutdown for correctness for some use cases like upgrades. We can probably add something like: "At other times, the walsender keeps saving the slot from time to time as the replication progresses, so there is no clear advantage of flushing additional slots at the time of checkpoint". Will that work for you? Having said that, I am not opposed to doing it for non-shutdown checkpoints if one makes a separate case for it. -- With Regards, Amit Kapila.
On Tue, Sep 5, 2023 at 9:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Sep 5, 2023 at 9:10 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > The comments don't mention anything about why we are just flushing at > > the shutdown checkpoint. I mean the checkpoint is not that frequent > > and we already perform a lot of I/O during checkpoints so isn't it > > wise to flush during every checkpoint. We may argue that there is no > > extra advantage of that as we are not optimizing for crash recovery > > but OTOH there is no reason for not doing so for other checkpoints or > > we are worried about the concurrency with parallel walsender running > > during non shutdown checkpoint if so then better we explain that as > > well? If it is already discussed in the thread and we have a > > conclusion on this then maybe we can mention this in comments? > > > > The point is that at the time of non-shutdown checkpoints, it is not > clear that there is an extra advantage but we will definitely add > extra I/O for this. Because at other times, we will already be saving > the slot from time to time as the replication makes progress. And, we > also need to flush such slots during shutdown for correctness for some > use cases like upgrades. We can probably add something like: "At other > times, the walsender keeps saving the slot from time to time as the > replication progresses, so there is no clear advantage of flushing > additional slots at the time of checkpoint". Will that work for you? Yeah that comments will work out, my only concern was because we added an explicit condition that it should be synced only during shutdown checkpoint so better comments also explicitly explains the reason. Anyway I am fine with either way whether we sync at the shutdown checkpoint or all the checkpoint. Because I/O for slot sync during checkpoint time should not be a real worry and with that if we can avoid additional code with extra conditions then it's better because such code branches will be frequently hit and I think for testability pov we prefer to add code in common path unless there is some overhead or it is specifically meant for that branch only. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, 5 Sept 2023 at 09:10, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Sep 1, 2023 at 12:16 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat > > > <ashutosh.bapat.oss@gmail.com> wrote: > > > > > > > > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > All but one. Normally, the idea of marking dirty is to indicate that > > > > > we will actually write/flush the contents at a later point (except > > > > > when required for correctness) as even indicated in the comments atop > > > > > ReplicatioinSlotMarkDirty(). However, I see your point that we use > > > > > that protocol at all the current places including CreateSlotOnDisk(). > > > > > So, we can probably do it here as well. > > > > > > > > yes > > > > > > > > > > I think we should also ensure that slots are not invalidated > > > (slot.data.invalidated != RS_INVAL_NONE) before marking them dirty > > > because we don't allow decoding from such slots, so we shouldn't > > > include those. > > > > Added this check. > > > > Apart from this I have also fixed the following issues that were > > agreed on: a) Setting slots to dirty in CheckPointReplicationSlots > > instead of setting it in SaveSlotToPath b) The comments were moved > > from ReplicationSlot and moved to CheckPointReplicationSlots c) Tests > > will be run in autovacuum = off d) updating last_saved_confirmed_flush > > based on cp.slotdata.confirmed_flush rather than > > slot->data.confirmed_flush. > > I have also added the commitfest entry for this at [1]. > > The overall idea looks fine to me > > + > + /* > + * We won't ensure that the slot is persisted after the > + * confirmed_flush LSN is updated as that could lead to frequent > + * writes. However, we need to ensure that we do persist the slots at > + * the time of shutdown whose confirmed_flush LSN is changed since we > + * last saved the slot to disk. This will help in avoiding retreat of > + * the confirmed_flush LSN after restart. > + */ > + if (is_shutdown && SlotIsLogical(s)) > + { > + SpinLockAcquire(&s->mutex); > + if (s->data.invalidated == RS_INVAL_NONE && > + s->data.confirmed_flush != s->last_saved_confirmed_flush) > + s->dirty = true; > + SpinLockRelease(&s->mutex); > + } > > The comments don't mention anything about why we are just flushing at > the shutdown checkpoint. I mean the checkpoint is not that frequent > and we already perform a lot of I/O during checkpoints so isn't it > wise to flush during every checkpoint. We may argue that there is no > extra advantage of that as we are not optimizing for crash recovery > but OTOH there is no reason for not doing so for other checkpoints or > we are worried about the concurrency with parallel walsender running > during non shutdown checkpoint if so then better we explain that as > well? If it is already discussed in the thread and we have a > conclusion on this then maybe we can mention this in comments? I felt it is better to do this only during the shutdown checkpoint as in other cases it is being saved periodically as and when the replication happens. Added comments for the same. > /* > * Flush all replication slots to disk. > * > - * This needn't actually be part of a checkpoint, but it's a convenient > - * location. > + * is_shutdown is true in case of a shutdown checkpoint. > */ > void > -CheckPointReplicationSlots(void) > +CheckPointReplicationSlots(bool is_shutdown) > > It seems we have removed two lines from the function header comments, > is this intentional or accidental? Modified. The updated v8 version patch has the changes for the same. Regards, Vignesh
Attachment
On Tue, Sep 5, 2023 at 10:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Sep 5, 2023 at 7:54 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > On Monday, September 4, 2023 6:15 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > On Mon, 4 Sept 2023 at 15:20, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Fri, Sep 1, 2023 at 10:50 AM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > > > On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat > > > > > > <ashutosh.bapat.oss@gmail.com> wrote: > > > > > > > > > > > > > > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila > > > <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > > All but one. Normally, the idea of marking dirty is to > > > > > > > > indicate that we will actually write/flush the contents at a > > > > > > > > later point (except when required for correctness) as even > > > > > > > > indicated in the comments atop ReplicatioinSlotMarkDirty(). > > > > > > > > However, I see your point that we use that protocol at all the current > > > places including CreateSlotOnDisk(). > > > > > > > > So, we can probably do it here as well. > > > > > > > > > > > > > > yes > > > > > > > > > > > > > > > > > > > I think we should also ensure that slots are not invalidated > > > > > > (slot.data.invalidated != RS_INVAL_NONE) before marking them dirty > > > > > > because we don't allow decoding from such slots, so we shouldn't > > > > > > include those. > > > > > > > > > > Added this check. > > > > > > > > > > Apart from this I have also fixed the following issues that were > > > > > agreed on: a) Setting slots to dirty in CheckPointReplicationSlots > > > > > instead of setting it in SaveSlotToPath > > > > > > > > > > > > > + if (is_shutdown && SlotIsLogical(s)) { SpinLockAcquire(&s->mutex); > > > > + if (s->data.invalidated == RS_INVAL_NONE && > > > > + s->data.confirmed_flush != s->last_saved_confirmed_flush) dirty = > > > > + s->true; > > > > > > > > I think it is better to use ReplicationSlotMarkDirty() as that would > > > > be consistent with all other usages. > > > > > > ReplicationSlotMarkDirty works only on MyReplicationSlot whereas > > > CheckpointReplicationSlots loops through all the slots and marks the > > > appropriate slot as dirty, we might have to change ReplicationSlotMarkDirty to > > > take the slot as input parameter and all caller should pass MyReplicationSlot. > > > > Personally, I feel if we want to centralize the code of marking dirty into a > > function, we can introduce a new static function MarkSlotDirty(slot) to mark > > passed slot dirty and let ReplicationSlotMarkDirty and > > CheckpointReplicationSlots call it. Like: > > > > void > > ReplicationSlotMarkDirty(void) > > { > > MarkSlotDirty(MyReplicationSlot); > > } > > > > +static void > > +MarkSlotDirty(ReplicationSlot *slot) > > +{ > > + Assert(slot != NULL); > > + > > + SpinLockAcquire(&slot->mutex); > > + slot->just_dirtied = true; > > + slot->dirty = true; > > + SpinLockRelease(&slot->mutex); > > +} > > > > This is somewhat similar to the relation between ReplicationSlotSave(serialize > > my backend's replications slot) and SaveSlotToPath(save the passed slot). > > > > > Another thing is we have already taken spin lock to access > > > last_confirmed_flush_lsn from CheckpointReplicationSlots, we could set dirty > > > flag here itself, else we will have to release the lock and call > > > ReplicationSlotMarkDirty which will take lock again. > > > > Yes, this is unavoidable, but maybe it's not a big problem as > > we only do it at shutdown. > > > > True but still it doesn't look elegant. I also thought about having a > probably inline function that marks both just_dirty and dirty fields. > However, that requires us to assert that the caller has already > acquired a spinlock. I see a macro SpinLockFree() that might help but > it didn't seem to be used anywhere in the code so not sure if we can > rely on it. Can't we just have code like this? I mean we will have to make ReplicationSlotMarkDirty take slot as an argument or have another version which takes slot as an argument and that would be called by us as well as by ReplicationSlotMarkDirty(). I mean why do we need these checks (s-(data.invalidated == RS_INVAL_NONE && s->data.confirmed_flush != s->last_saved_confirmed_flush) under the mutex? Walsender is shutdown so confirmed flush LSN can not move concurrently and slot can not be invalidated as well because that is done by checkpointer and we are in checkpointer? + if (is_shutdown && SlotIsLogical(s)) + { + if (s->data.invalidated == RS_INVAL_NONE && + s->data.confirmed_flush != s->last_saved_confirmed_flush) + { + ReplicationSlotMarkDirty(s); + } + + SpinLockRelease(&s->mutex); + } -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tuesday, September 5, 2023 4:15 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: Hi, > > On Tue, Sep 5, 2023 at 10:12 AM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > On Tue, Sep 5, 2023 at 7:54 AM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > On Monday, September 4, 2023 6:15 PM vignesh C > <vignesh21@gmail.com> wrote: > > > > > > > > On Mon, 4 Sept 2023 at 15:20, Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > > > > > On Fri, Sep 1, 2023 at 10:50 AM vignesh C <vignesh21@gmail.com> > wrote: > > > > > > > > > > > > On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > > I think we should also ensure that slots are not invalidated > > > > > > > (slot.data.invalidated != RS_INVAL_NONE) before marking them > > > > > > > dirty because we don't allow decoding from such slots, so we > > > > > > > shouldn't include those. > > > > > > > > > > > > Added this check. > > > > > > > > > > > > Apart from this I have also fixed the following issues that > > > > > > were agreed on: a) Setting slots to dirty in > > > > > > CheckPointReplicationSlots instead of setting it in > > > > > > SaveSlotToPath > > > > > > > > > > > > > > > > + if (is_shutdown && SlotIsLogical(s)) { > > > > > + SpinLockAcquire(&s->mutex); if (s->data.invalidated == > > > > > + RS_INVAL_NONE && > > > > > + s->data.confirmed_flush != s->last_saved_confirmed_flush) > > > > > + s->dirty = true; > > > > > > > > > > I think it is better to use ReplicationSlotMarkDirty() as that > > > > > would be consistent with all other usages. > > > > > > > > ReplicationSlotMarkDirty works only on MyReplicationSlot whereas > > > > CheckpointReplicationSlots loops through all the slots and marks > > > > the appropriate slot as dirty, we might have to change > > > > ReplicationSlotMarkDirty to take the slot as input parameter and all caller > should pass MyReplicationSlot. > > > > > > Personally, I feel if we want to centralize the code of marking > > > dirty into a function, we can introduce a new static function > > > MarkSlotDirty(slot) to mark passed slot dirty and let > > > ReplicationSlotMarkDirty and CheckpointReplicationSlots call it. Like: > > > > > > void > > > ReplicationSlotMarkDirty(void) > > > { > > > MarkSlotDirty(MyReplicationSlot); } > > > > > > +static void > > > +MarkSlotDirty(ReplicationSlot *slot) { > > > + Assert(slot != NULL); > > > + > > > + SpinLockAcquire(&slot->mutex); > > > + slot->just_dirtied = true; > > > + slot->dirty = true; > > > + SpinLockRelease(&slot->mutex); } > > > > > > This is somewhat similar to the relation between > > > ReplicationSlotSave(serialize my backend's replications slot) and > SaveSlotToPath(save the passed slot). > > > > > > > Another thing is we have already taken spin lock to access > > > > last_confirmed_flush_lsn from CheckpointReplicationSlots, we could > > > > set dirty flag here itself, else we will have to release the lock > > > > and call ReplicationSlotMarkDirty which will take lock again. > > > > > > Yes, this is unavoidable, but maybe it's not a big problem as we > > > only do it at shutdown. > > > > > > > True but still it doesn't look elegant. I also thought about having a > > probably inline function that marks both just_dirty and dirty fields. > > However, that requires us to assert that the caller has already > > acquired a spinlock. I see a macro SpinLockFree() that might help but > > it didn't seem to be used anywhere in the code so not sure if we can > > rely on it. > > Can't we just have code like this? I mean we will have to make > ReplicationSlotMarkDirty take slot as an argument or have another version > which takes slot as an argument and that would be called by us as well as by > ReplicationSlotMarkDirty(). I mean why do we need these checks > (s-(data.invalidated == RS_INVAL_NONE && > s->data.confirmed_flush != s->last_saved_confirmed_flush) under the > mutex? Walsender is shutdown so confirmed flush LSN can not move > concurrently and slot can not be invalidated as well because that is done by > checkpointer and we are in checkpointer? I agree with your analysis that the lock may be unnecessary for now and the code will work, but I personally feel we'd better take the spinlock. Firstly, considering our discussion on the potential extension of persisting the slot for online checkpoints in the future, we anyway need the lock at that time, so taking the lock here could avoid overlooking the need to update it later. And the lock also won't cause any performance or concurrency issue. Additionally, if we don't take the lock, we rely on the assumption that the walsender will exit before the shutdown checkpoint, currently, that's true for logical walsender, but physical walsender can exit later than checkpointer. So, I am slight woirred that if we change the logical walsender's exit timing in the future, the assumption may not hold. Besides, for non-built-in logical replication, if someone creates their own walsender or other processes to send the changes and the process doesn't exit before the shutdown checkpoint, it may also be a problem. Although I don't have exsiting examples about these extensions, but I feel taking the lock would make it more robust. Best Regards, Hou zj
On Tue, Sep 5, 2023 at 5:04 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > > Can't we just have code like this? I mean we will have to make > > ReplicationSlotMarkDirty take slot as an argument or have another version > > which takes slot as an argument and that would be called by us as well as by > > ReplicationSlotMarkDirty(). I mean why do we need these checks > > (s-(data.invalidated == RS_INVAL_NONE && > > s->data.confirmed_flush != s->last_saved_confirmed_flush) under the > > mutex? Walsender is shutdown so confirmed flush LSN can not move > > concurrently and slot can not be invalidated as well because that is done by > > checkpointer and we are in checkpointer? > > I agree with your analysis that the lock may be unnecessary for now and the > code will work, but I personally feel we'd better take the spinlock. > > Firstly, considering our discussion on the potential extension of persisting > the slot for online checkpoints in the future, we anyway need the lock at that > time, so taking the lock here could avoid overlooking the need to update it > later. And the lock also won't cause any performance or concurrency issue. If we think that we might plan to persist on the online checkpoint as well then better to do it now, because this is not a extension of the feature instead we are thinking that it is wise to just persist on the shutdown checkpoint and I think that's what the conclusion at this point and if thats the conclusion then no point to right code in assumption that we will change our conclusion in future. > Additionally, if we don't take the lock, we rely on the assumption that the > walsender will exit before the shutdown checkpoint, currently, that's true for > logical walsender, but physical walsender can exit later than checkpointer. So, > I am slight woirred that if we change the logical walsender's exit timing in > the future, the assumption may not hold. > > Besides, for non-built-in logical replication, if someone creates their own > walsender or other processes to send the changes and the process doesn't exit > before the shutdown checkpoint, it may also be a problem. Although I don't have > exsiting examples about these extensions, but I feel taking the lock would make > it more robust. I think our all logic is based on that the walsender is existed already. If not then even if you check under the mutex that the confirmed flush LSN is not changed then it can changed right after you release the lock and then we will not be flushing the latest update of the confirmed flush lsn to the disk and our logic of comparing checkpoint.redo with the confirmed flush lsn might not work? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Sep 5, 2023 at 6:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Sep 5, 2023 at 5:04 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > > > Can't we just have code like this? I mean we will have to make > > > ReplicationSlotMarkDirty take slot as an argument or have another version > > > which takes slot as an argument and that would be called by us as well as by > > > ReplicationSlotMarkDirty(). I mean why do we need these checks > > > (s-(data.invalidated == RS_INVAL_NONE && > > > s->data.confirmed_flush != s->last_saved_confirmed_flush) under the > > > mutex? Walsender is shutdown so confirmed flush LSN can not move > > > concurrently and slot can not be invalidated as well because that is done by > > > checkpointer and we are in checkpointer? > > > ... > > Additionally, if we don't take the lock, we rely on the assumption that the > > walsender will exit before the shutdown checkpoint, currently, that's true for > > logical walsender, but physical walsender can exit later than checkpointer. So, > > I am slight woirred that if we change the logical walsender's exit timing in > > the future, the assumption may not hold. > > > > Besides, for non-built-in logical replication, if someone creates their own > > walsender or other processes to send the changes and the process doesn't exit > > before the shutdown checkpoint, it may also be a problem. Although I don't have > > exsiting examples about these extensions, but I feel taking the lock would make > > it more robust. > > I think our all logic is based on that the walsender is existed > already. If not then even if you check under the mutex that the > confirmed flush LSN is not changed then it can changed right after you > release the lock and then we will not be flushing the latest update of > the confirmed flush lsn to the disk and our logic of comparing > checkpoint.redo with the confirmed flush lsn might not work? > Right, it can change and in that case, the check related to confirm_flush LSN will fail during the upgrade. However, the point is that if we don't take spinlock, we need to properly write comments on why unlike in other places it is safe here to check these values without spinlock. We can do that but I feel we have to be careful for all future usages of these variables, so, having spinlock makes them follow the normal coding pattern which I feel makes it more robust. Yes, marking dirty via common function also has merits but personally, I find it better to follow the normal coding practice of checking the required fields under spinlock. The other possibility is to first check if we need to mark the slot dirty under spinlock, then release the spinlock, and then call the common MarkDirty function, but again that will be under the assumption that these flags won't change. -- With Regards, Amit Kapila.
On Wed, Sep 6, 2023 at 9:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Sep 5, 2023 at 6:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Right, it can change and in that case, the check related to > confirm_flush LSN will fail during the upgrade. However, the point is > that if we don't take spinlock, we need to properly write comments on > why unlike in other places it is safe here to check these values > without spinlock. I agree with that, but now also it is not true that we are alway reading this under the spin lock for example[1][2], we can see we are reading this without spin lock. [1] StartLogicalReplication { /* * Report the location after which we'll send out further commits as the * current sentPtr. */ sentPtr = MyReplicationSlot->data.confirmed_flush; } [2] LogicalIncreaseRestartDecodingForSlot { /* candidates are already valid with the current flush position, apply */ if (updated_lsn) LogicalConfirmReceivedLocation(slot->data.confirmed_flush); } We can do that but I feel we have to be careful for > all future usages of these variables, so, having spinlock makes them > follow the normal coding pattern which I feel makes it more robust. > Yes, marking dirty via common function also has merits but personally, > I find it better to follow the normal coding practice of checking the > required fields under spinlock. The other possibility is to first > check if we need to mark the slot dirty under spinlock, then release > the spinlock, and then call the common MarkDirty function, but again > that will be under the assumption that these flags won't change. Thats true, but we are already making the assumption because now also we are taking the spinlock and taking a decision of marking the slot dirty. And after that we are releasing the spin lock and if we do not have guarantee that it can not concurrently change the many things can go wrong no? Anyway said that, I do not have any strong objection against what we are doing now. There were discussion around making the code so that it can use common function and I was suggesting how it could be achieved but I am not against the current way either. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Sep 6, 2023 at 9:57 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Sep 6, 2023 at 9:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Sep 5, 2023 at 6:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > Right, it can change and in that case, the check related to > > confirm_flush LSN will fail during the upgrade. However, the point is > > that if we don't take spinlock, we need to properly write comments on > > why unlike in other places it is safe here to check these values > > without spinlock. > > I agree with that, but now also it is not true that we are alway > reading this under the spin lock for example[1][2], we can see we are > reading this without spin lock. > [1] > StartLogicalReplication > { > /* > * Report the location after which we'll send out further commits as the > * current sentPtr. > */ > sentPtr = MyReplicationSlot->data.confirmed_flush; > } > [2] > LogicalIncreaseRestartDecodingForSlot > { > /* candidates are already valid with the current flush position, apply */ > if (updated_lsn) > LogicalConfirmReceivedLocation(slot->data.confirmed_flush); > } > These are accessed only in walsender and confirmed_flush is always updated by walsender. So, this is clearly okay. > We can do that but I feel we have to be careful for > > all future usages of these variables, so, having spinlock makes them > > follow the normal coding pattern which I feel makes it more robust. > > Yes, marking dirty via common function also has merits but personally, > > I find it better to follow the normal coding practice of checking the > > required fields under spinlock. The other possibility is to first > > check if we need to mark the slot dirty under spinlock, then release > > the spinlock, and then call the common MarkDirty function, but again > > that will be under the assumption that these flags won't change. > > Thats true, but we are already making the assumption because now also > we are taking the spinlock and taking a decision of marking the slot > dirty. And after that we are releasing the spin lock and if we do not > have guarantee that it can not concurrently change the many things can > go wrong no? > Also, note that invalidated field could be updated by startup process but that is only possible on standby, so it is safe but again that would be another assumption. > Anyway said that, I do not have any strong objection against what we > are doing now. There were discussion around making the code so that > it can use common function and I was suggesting how it could be > achieved but I am not against the current way either. > Okay, thanks for looking into it. -- With Regards, Amit Kapila.
On Wed, Sep 6, 2023 at 12:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Sep 6, 2023 at 9:57 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Sep 6, 2023 at 9:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Tue, Sep 5, 2023 at 6:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > Right, it can change and in that case, the check related to > > > confirm_flush LSN will fail during the upgrade. However, the point is > > > that if we don't take spinlock, we need to properly write comments on > > > why unlike in other places it is safe here to check these values > > > without spinlock. > > > > I agree with that, but now also it is not true that we are alway > > reading this under the spin lock for example[1][2], we can see we are > > reading this without spin lock. > > [1] > > StartLogicalReplication > > { > > /* > > * Report the location after which we'll send out further commits as the > > * current sentPtr. > > */ > > sentPtr = MyReplicationSlot->data.confirmed_flush; > > } > > [2] > > LogicalIncreaseRestartDecodingForSlot > > { > > /* candidates are already valid with the current flush position, apply */ > > if (updated_lsn) > > LogicalConfirmReceivedLocation(slot->data.confirmed_flush); > > } > > > > These are accessed only in walsender and confirmed_flush is always > updated by walsender. So, this is clearly okay. Hmm, that's a valid point. > > We can do that but I feel we have to be careful for > > > all future usages of these variables, so, having spinlock makes them > > > follow the normal coding pattern which I feel makes it more robust. > > > Yes, marking dirty via common function also has merits but personally, > > > I find it better to follow the normal coding practice of checking the > > > required fields under spinlock. The other possibility is to first > > > check if we need to mark the slot dirty under spinlock, then release > > > the spinlock, and then call the common MarkDirty function, but again > > > that will be under the assumption that these flags won't change. > > > > Thats true, but we are already making the assumption because now also > > we are taking the spinlock and taking a decision of marking the slot > > dirty. And after that we are releasing the spin lock and if we do not > > have guarantee that it can not concurrently change the many things can > > go wrong no? > > > > Also, note that invalidated field could be updated by startup process > but that is only possible on standby, so it is safe but again that > would be another assumption. Okay, so I also agree to go with the current patch. Because as you said above if we access this without a spin lock outside walsender then we will be making a new exception and I agree with that decision of not making the new exception. Other than that the patch LGTM. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, 6 Sept 2023 at 12:09, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Other than that the patch LGTM. pgindent reported that the new comments added need to be re-adjusted, here is an updated patch for the same. I also verified the following: a) patch applies neatly on HEAD b) make check-world passes c) pgindent looks good d) pgperltiy was fine e) meson test runs were successful. Also checked that CFBot run was fine for the last patch. Regards, Vignesh
Attachment
On Thu, Sep 7, 2023 at 10:13 AM vignesh C <vignesh21@gmail.com> wrote: > > On Wed, 6 Sept 2023 at 12:09, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Other than that the patch LGTM. > > pgindent reported that the new comments added need to be re-adjusted, > here is an updated patch for the same. > Thanks, the patch looks good to me as well. -- With Regards, Amit Kapila.
On Thu, Sep 07, 2023 at 11:56:28AM +0530, Amit Kapila wrote: > Thanks, the patch looks good to me as well. + /* This is used to track the last saved confirmed_flush LSN value */ + XLogRecPtr last_saved_confirmed_flush; This does not feel sufficient, as the comment explaining what this variable does uses the same terms as the variable name (aka it is the last save of the confirmed_lsn). Why it it here and why it is useful? In which context and/or code paths is it used? Okay, there are some explanations when saving a slot, restoring a slot or when a checkpoint processes slots, but it seems important to me to document more things in ReplicationSlot where this is defined. (Just passing by, I have not checked the patch logic in details but that looks under-documented to me.) -- Michael
Attachment
On Thu, Sep 7, 2023 at 1:18 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Sep 07, 2023 at 11:56:28AM +0530, Amit Kapila wrote: > > Thanks, the patch looks good to me as well. > > + /* This is used to track the last saved confirmed_flush LSN value */ > + XLogRecPtr last_saved_confirmed_flush; > > This does not feel sufficient, as the comment explaining what this > variable does uses the same terms as the variable name (aka it is the > last save of the confirmed_lsn). Why it it here and why it is useful? > In which context and/or code paths is it used? Okay, there are some > explanations when saving a slot, restoring a slot or when a checkpoint > processes slots, but it seems important to me to document more things > in ReplicationSlot where this is defined. > Hmm, this is quite debatable as different people feel differently about this. The patch author kept it where it is now but in one of my revisions, I rewrote and added it in the ReplicationSlot. Then Ashutosh argued that it is better to keep it near where we are saving the slot (aka where the patch has) [1]. Anyway, as I also preferred the core part of the theory about this variable to be in ReplicationSlot, so I'll move it there before commit unless someone argues against it or has any other comments. [1] - https://www.postgresql.org/message-id/CAExHW5uXq_CU80XJtmWbPJinRjJx54kbQJ9DT%3DUFySKXpjVwrw%40mail.gmail.com -- With Regards, Amit Kapila.
On Thu, Sep 7, 2023 at 1:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Sep 7, 2023 at 1:18 PM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Thu, Sep 07, 2023 at 11:56:28AM +0530, Amit Kapila wrote: > > > Thanks, the patch looks good to me as well. > > > > + /* This is used to track the last saved confirmed_flush LSN value */ > > + XLogRecPtr last_saved_confirmed_flush; > > > > This does not feel sufficient, as the comment explaining what this > > variable does uses the same terms as the variable name (aka it is the > > last save of the confirmed_lsn). Why it it here and why it is useful? > > In which context and/or code paths is it used? Okay, there are some > > explanations when saving a slot, restoring a slot or when a checkpoint > > processes slots, but it seems important to me to document more things > > in ReplicationSlot where this is defined. > > > > Hmm, this is quite debatable as different people feel differently > about this. The patch author kept it where it is now but in one of my > revisions, I rewrote and added it in the ReplicationSlot. Then > Ashutosh argued that it is better to keep it near where we are saving > the slot (aka where the patch has) [1]. Anyway, as I also preferred > the core part of the theory about this variable to be in > ReplicationSlot, so I'll move it there before commit unless someone > argues against it or has any other comments. If we want it to be in ReplicationSlot, I suggest we just say, - saves last confirmed flush LSN to detect any divergence in the in-memory and on-disk confirmed flush LSN cheaply. When to detect that divergence and what to do when there is divergence should be document at relevant places in the code. In future if we expand the When and How we use this variable, the comment in ReplicationSlot will be insufficient. We follow this commenting style at several places e.g. /* any outstanding modifications? */ bool just_dirtied; bool dirty; how and when these variables are used is commented upon in the relevant code. * This needn't actually be part of a checkpoint, but it's a convenient - * location. + * location. is_shutdown is true in case of a shutdown checkpoint. Relying on the first sentence, if we decide to not persist the replication slot at the time of checkpoint, would that be OK? It doesn't look like a convenience thing to me any more. -- Best Wishes, Ashutosh Bapat
On Thu, Sep 7, 2023 at 3:38 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > * This needn't actually be part of a checkpoint, but it's a convenient > - * location. > + * location. is_shutdown is true in case of a shutdown checkpoint. > > Relying on the first sentence, if we decide to not persist the > replication slot at the time of checkpoint, would that be OK? It > doesn't look like a convenience thing to me any more. > Instead of removing that comment, how about something like this: "This needn't actually be part of a checkpoint except for shutdown checkpoint, but it's a convenient location."? -- With Regards, Amit Kapila.
On Thu, Sep 7, 2023 at 4:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Sep 7, 2023 at 3:38 PM Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: > > > > * This needn't actually be part of a checkpoint, but it's a convenient > > - * location. > > + * location. is_shutdown is true in case of a shutdown checkpoint. > > > > Relying on the first sentence, if we decide to not persist the > > replication slot at the time of checkpoint, would that be OK? It > > doesn't look like a convenience thing to me any more. > > > > Instead of removing that comment, how about something like this: "This > needn't actually be part of a checkpoint except for shutdown > checkpoint, but it's a convenient location."? > I find the wording a bit awkward. My version would be "Checkpoint is a convenient location to persist all the slots. But in a shutdown checkpoint, indicated by is_shutdown = true, we also update confirmed_flush." But please feel free to choose whichever version you are comfortable with. -- Best Wishes, Ashutosh Bapat
On Thu, Sep 7, 2023 at 4:30 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Thu, Sep 7, 2023 at 4:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Sep 7, 2023 at 3:38 PM Ashutosh Bapat > > <ashutosh.bapat.oss@gmail.com> wrote: > > > > > > * This needn't actually be part of a checkpoint, but it's a convenient > > > - * location. > > > + * location. is_shutdown is true in case of a shutdown checkpoint. > > > > > > Relying on the first sentence, if we decide to not persist the > > > replication slot at the time of checkpoint, would that be OK? It > > > doesn't look like a convenience thing to me any more. > > > > > > > Instead of removing that comment, how about something like this: "This > > needn't actually be part of a checkpoint except for shutdown > > checkpoint, but it's a convenient location."? > > > > I find the wording a bit awkward. My version would be "Checkpoint is a > convenient location to persist all the slots. But in a shutdown > checkpoint, indicated by is_shutdown = true, we also update > confirmed_flush." But please feel free to choose whichever version you > are comfortable with. > I think saying we also update confirmed_flush appears unclear to me. So, I tried another version by changing the entire comment to: "Normally, we can flush dirty replication slots at regular intervals by any background process like bgwriter but checkpoint is a convenient location to persist. Additionally, in case of a shutdown checkpoint, we also identify the slots for which confirmed_flush has been updated since the last time it persisted and flush them." -- With Regards, Amit Kapila.
On Thu, Sep 7, 2023 at 3:38 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Thu, Sep 7, 2023 at 1:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Sep 7, 2023 at 1:18 PM Michael Paquier <michael@paquier.xyz> wrote: > > > > > > On Thu, Sep 07, 2023 at 11:56:28AM +0530, Amit Kapila wrote: > > > > Thanks, the patch looks good to me as well. > > > > > > + /* This is used to track the last saved confirmed_flush LSN value */ > > > + XLogRecPtr last_saved_confirmed_flush; > > > > > > This does not feel sufficient, as the comment explaining what this > > > variable does uses the same terms as the variable name (aka it is the > > > last save of the confirmed_lsn). Why it it here and why it is useful? > > > In which context and/or code paths is it used? Okay, there are some > > > explanations when saving a slot, restoring a slot or when a checkpoint > > > processes slots, but it seems important to me to document more things > > > in ReplicationSlot where this is defined. > > > > > > > Hmm, this is quite debatable as different people feel differently > > about this. The patch author kept it where it is now but in one of my > > revisions, I rewrote and added it in the ReplicationSlot. Then > > Ashutosh argued that it is better to keep it near where we are saving > > the slot (aka where the patch has) [1]. Anyway, as I also preferred > > the core part of the theory about this variable to be in > > ReplicationSlot, so I'll move it there before commit unless someone > > argues against it or has any other comments. > > If we want it to be in ReplicationSlot, I suggest we just say, - saves > last confirmed flush LSN to detect any divergence in the in-memory and > on-disk confirmed flush LSN cheaply. > I have added something on these lines and also changed the other comment pointed out by you. In the passing, I made minor cosmetic changes as well. -- With Regards, Amit Kapila.
Attachment
On Fri, Sep 08, 2023 at 09:04:43AM +0530, Amit Kapila wrote: > I have added something on these lines and also changed the other > comment pointed out by you. In the passing, I made minor cosmetic > changes as well. + * We can flush dirty replication slots at regular intervals by any + * background process like bgwriter but checkpoint is a convenient location. I don't see a need to refer to the bgwriter here. On the contrary, it can be confusing as one could think that this flush happens in the bgwriter, but that's not the case currently as only the checkpointer does that. + * We won't ensure that the slot is persisted after the s/persisted/flushed/? Or just refer to the "slot's data being flushed", or refer to "the slot's data is made durable" instead? The use of "persist" here is confusing, because a slot's persistence refers to it as being a *candidate* for flush (compared to an ephemeral slot), and it does not refer to the *fact* of flushing its data to make sure that it survives a crash. In the context of this patch, the LSN value tracked in the slot's in-memory data refers to the last point where the slot's data has been flushed. + /* + * This is used to track the last persisted confirmed_flush LSN value to + * detect any divergence in the in-memory and on-disk values for the same. + */ "This value tracks is the last LSN where this slot's data has been flushed to disk. This is used during a checkpoint shutdown to decide if a logical slot's data should be forcibly flushed or not." Hmm. WAL senders are shut down *after* the checkpointer and *after* the shutdown checkpoint is finished (see PM_SHUTDOWN and PM_SHUTDOWN_2) because we want the WAL senders to acknowledge the checkpoint record before shutting down the primary. In order to limit the number of records to work on after a restart, what this patch is proposing is an improvement. Perhaps it would be better to document that we don't care about the potential concurrent activity of logical WAL senders in this case and that the LSN we are saving at is a best effort, meaning that last_saved_confirmed_flush is just here to reduce the damage on a follow-up restart? The comment added in CheckPointReplicationSlots() goes in this direction, but perhaps this potential concurrent activity should be mentioned? -- Michael
Attachment
On Fri, Sep 8, 2023 at 10:08 AM Michael Paquier <michael@paquier.xyz> wrote: > > > + /* > + * This is used to track the last persisted confirmed_flush LSN value to > + * detect any divergence in the in-memory and on-disk values for the same. > + */ > > "This value tracks is the last LSN where this slot's data has been > flushed to disk. > This makes the comment vague as this sounds like we are saving a slot corresponding to some LSN which is not the case. If you prefer this tone then we can instead say: "This value tracks the last confirmed_flush LSN flushed which is used during a checkpoint shutdown to decide if a logical slot's data should be forcibly flushed or not." > > This is used during a checkpoint shutdown to decide > if a logical slot's data should be forcibly flushed or not." > > Hmm. WAL senders are shut down *after* the checkpointer and *after* > the shutdown checkpoint is finished (see PM_SHUTDOWN and > PM_SHUTDOWN_2) because we want the WAL senders to acknowledge the > checkpoint record before shutting down the primary. > As per my understanding, this is not true for logical walsenders. As per code, while HandleCheckpointerInterrupts(), we call ShutdownXLOG() which sends a signal to walsender to stop and waits for it to stop. And only after that, did it write a shutdown checkpoint WAL record. After getting the InitStopping signal, walsender sets got_STOPPING flag. Then *logical* walsender ensures that it sends all the pending WAL and exits. What you have quoted is probably true for physical walsenders. > > In order to limit > the number of records to work on after a restart, what this patch is > proposing is an improvement. Perhaps it would be better to document > that we don't care about the potential concurrent activity of logical > WAL senders in this case and that the LSN we are saving at is a best > effort, meaning that last_saved_confirmed_flush is just here to reduce > the damage on a follow-up restart? > Unless I am wrong, there shouldn't be any concurrent activity for logical walsenders. IIRC, it is a mandatory requirement for logical walsenders to stop before shutdown checkpointer to avoid panic error. We do handle logical walsnders differently because they can generate WAL during decoding. > > The comment added in > CheckPointReplicationSlots() goes in this direction, but perhaps this > potential concurrent activity should be mentioned? > Sure, we can change it if required. -- With Regards, Amit Kapila.
On Fri, Sep 08, 2023 at 11:50:37AM +0530, Amit Kapila wrote: > On Fri, Sep 8, 2023 at 10:08 AM Michael Paquier <michael@paquier.xyz> wrote: >> >> >> + /* >> + * This is used to track the last persisted confirmed_flush LSN value to >> + * detect any divergence in the in-memory and on-disk values for the same. >> + */ >> >> "This value tracks is the last LSN where this slot's data has been >> flushed to disk. >> > > This makes the comment vague as this sounds like we are saving a slot > corresponding to some LSN which is not the case. If you prefer this > tone then we can instead say: "This value tracks the last > confirmed_flush LSN flushed which is used during a checkpoint shutdown > to decide if a logical slot's data should be forcibly flushed or not." Okay, that looks like an improvement over the term "persisted". >> This is used during a checkpoint shutdown to decide >> if a logical slot's data should be forcibly flushed or not." >> >> Hmm. WAL senders are shut down *after* the checkpointer and *after* >> the shutdown checkpoint is finished (see PM_SHUTDOWN and >> PM_SHUTDOWN_2) because we want the WAL senders to acknowledge the >> checkpoint record before shutting down the primary. >> > > As per my understanding, this is not true for logical walsenders. As > per code, while HandleCheckpointerInterrupts(), we call ShutdownXLOG() > which sends a signal to walsender to stop and waits for it to stop. > And only after that, did it write a shutdown checkpoint WAL record. > After getting the InitStopping signal, walsender sets got_STOPPING > flag. Then *logical* walsender ensures that it sends all the pending > WAL and exits. What you have quoted is probably true for physical > walsenders. Hm, reminding me about this area.. This roots down to the handling of WalSndCaughtUp in the send_data callback for logical or physical. This is switched to true for logical WAL senders much earlier than physical WAL senders, aka before the shutdown checkpoint begins in the latter. What was itching me a bit is that the postmaster logic could be made more solid. Logical and physical WAL senders are both marked as BACKEND_TYPE_WALSND, but we don't actually check that the WAL senders remaining at the end of PM_SHUTDOWN_2 are *not* connected to a database. This would require a new BACKEND_TYPE_* perhaps, or perhaps we're fine with the current state because we'll catch up problems in the checkpointer if any WAL is generated while the shutdown checkpoint is running anyway. Just something I got in mind, unrelated to this patch. >> In order to limit >> the number of records to work on after a restart, what this patch is >> proposing is an improvement. Perhaps it would be better to document >> that we don't care about the potential concurrent activity of logical >> WAL senders in this case and that the LSN we are saving at is a best >> effort, meaning that last_saved_confirmed_flush is just here to reduce >> the damage on a follow-up restart? > > Unless I am wrong, there shouldn't be any concurrent activity for > logical walsenders. IIRC, it is a mandatory requirement for logical > walsenders to stop before shutdown checkpointer to avoid panic error. > We do handle logical walsnders differently because they can generate > WAL during decoding. Yeah. See above. >> The comment added in >> CheckPointReplicationSlots() goes in this direction, but perhaps this >> potential concurrent activity should be mentioned? > > Sure, we can change it if required. + * We can flush dirty replication slots at regular intervals by any + * background process like bgwriter but checkpoint is a convenient location. I still don't quite see a need to mention the bgwriter at all here.. That's just unrelated. The comment block in CheckPointReplicationSlots() from v10 uses "persist", but you mean "flush", I guess.. -- Michael
Attachment
On Mon, Sep 11, 2023 at 12:08 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Sep 08, 2023 at 11:50:37AM +0530, Amit Kapila wrote: > > On Fri, Sep 8, 2023 at 10:08 AM Michael Paquier <michael@paquier.xyz> wrote: > >> > >> > >> + /* > >> + * This is used to track the last persisted confirmed_flush LSN value to > >> + * detect any divergence in the in-memory and on-disk values for the same. > >> + */ > >> > >> "This value tracks is the last LSN where this slot's data has been > >> flushed to disk. > >> > > > > This makes the comment vague as this sounds like we are saving a slot > > corresponding to some LSN which is not the case. If you prefer this > > tone then we can instead say: "This value tracks the last > > confirmed_flush LSN flushed which is used during a checkpoint shutdown > > to decide if a logical slot's data should be forcibly flushed or not." > > Okay, that looks like an improvement over the term "persisted". > Changed accordingly. > >> This is used during a checkpoint shutdown to decide > >> if a logical slot's data should be forcibly flushed or not." > >> > >> Hmm. WAL senders are shut down *after* the checkpointer and *after* > >> the shutdown checkpoint is finished (see PM_SHUTDOWN and > >> PM_SHUTDOWN_2) because we want the WAL senders to acknowledge the > >> checkpoint record before shutting down the primary. > >> > > > > As per my understanding, this is not true for logical walsenders. As > > per code, while HandleCheckpointerInterrupts(), we call ShutdownXLOG() > > which sends a signal to walsender to stop and waits for it to stop. > > And only after that, did it write a shutdown checkpoint WAL record. > > After getting the InitStopping signal, walsender sets got_STOPPING > > flag. Then *logical* walsender ensures that it sends all the pending > > WAL and exits. What you have quoted is probably true for physical > > walsenders. > > Hm, reminding me about this area.. This roots down to the handling of > WalSndCaughtUp in the send_data callback for logical or physical. > This is switched to true for logical WAL senders much earlier than > physical WAL senders, aka before the shutdown checkpoint begins in the > latter. What was itching me a bit is that the postmaster logic could > be made more solid. Logical and physical WAL senders are both marked > as BACKEND_TYPE_WALSND, but we don't actually check that the WAL > senders remaining at the end of PM_SHUTDOWN_2 are *not* connected to a > database. This would require a new BACKEND_TYPE_* perhaps, or perhaps > we're fine with the current state because we'll catch up problems in > the checkpointer if any WAL is generated while the shutdown checkpoint > is running anyway. Just something I got in mind, unrelated to this > patch. > I don't know if the difference is worth inventing a new BACKEND_TYPE_* but if you think so then we can probably discuss this in a new thread. I think we may want to improve some comments as a separate patch to make this evident. > > + * We can flush dirty replication slots at regular intervals by any > + * background process like bgwriter but checkpoint is a convenient location. > > I still don't quite see a need to mention the bgwriter at all here.. > That's just unrelated. > I don't disagree with it, so changed it in the attached patch. > The comment block in CheckPointReplicationSlots() from v10 uses > "persist", but you mean "flush", I guess.. > This point is not very clear to me. Can you please quote the exact comment if you think something needs to be changed? -- With Regards, Amit Kapila.
Attachment
On Mon, Sep 11, 2023 at 02:49:49PM +0530, Amit Kapila wrote: > I don't know if the difference is worth inventing a new BACKEND_TYPE_* > but if you think so then we can probably discuss this in a new thread. > I think we may want to improve some comments as a separate patch to > make this evident. The comments in postmaster.c could be improved, at least. There is no need to discuss that here. > This point is not very clear to me. Can you please quote the exact > comment if you think something needs to be changed? Hmm. Don't think that's it yet.. Please see the v11 attached, that rewords all the places of the patch that need clarifications IMO. I've found that the comment additions in CheckPointReplicationSlots() to be overcomplicated: - The key point to force a flush of a slot if its confirmed_lsn has moved ahead of the last LSN where it was saved is to make the follow up restart more responsive. - Not sure that there is any point to mention the other code paths in the tree where ReplicationSlotSave() can be called, and a slot can be saved in other processes than just WAL senders (like slot manipulations in normal backends, for one). This was the last sentence in v10. - Persist is incorrect in this context in the tests, slot.c and slot.h, as it should refer to the slot's data being flushed, saved or just "made durable" because this is what the new last saved LSN is here for. Persistence is a slot property, and does not refer to the fact of flushing the data IMO. + if (s->data.invalidated == RS_INVAL_NONE && + s->data.confirmed_flush != s->last_saved_confirmed_flush) Actually this is incorrect, no? Shouldn't we make sure that the confirmed_flush is strictly higher than the last saved LSN? -- Michael
Attachment
On Tue, Sep 12, 2023 at 10:55 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, Sep 11, 2023 at 02:49:49PM +0530, Amit Kapila wrote: > > Please see the v11 attached, that rewords all the places of the patch > that need clarifications IMO. I've found that the comment additions > in CheckPointReplicationSlots() to be overcomplicated: > - The key point to force a flush of a slot if its confirmed_lsn has > moved ahead of the last LSN where it was saved is to make the follow > up restart more responsive. > I don't think it will become more responsive in any way, not sure what made you think like that. The key idea is that after restart we want to ensure that all the WAL data up to the shutdown checkpoint record is sent to downstream. As mentioned in the commit message, this will help in ensuring that upgrades don't miss any data and then there is another small advantage as mentioned in the commit message. > - Not sure that there is any point to mention the other code paths in > the tree where ReplicationSlotSave() can be called, and a slot can be > saved in other processes than just WAL senders (like slot > manipulations in normal backends, for one). This was the last > sentence in v10. > The point was the earlier sentence is no longer true and keeping it as it is could be wrong or at least misleading. For example, earlier it is okay to say, "This needn't actually be part of a checkpoint, ..." but now that is no longer true as we want to invoke this at the time of shutdown checkpoint for correctness. If we want to be precise, we can say, "It is convenient to flush dirty replication slots at the time of checkpoint. Additionally, .." > > + if (s->data.invalidated == RS_INVAL_NONE && > + s->data.confirmed_flush != s->last_saved_confirmed_flush) > > Actually this is incorrect, no? Shouldn't we make sure that the > confirmed_flush is strictly higher than the last saved LSN? > I can't see why it is incorrect. Do you see how (in what scenario) it could go wrong? As per my understanding, confirmed_flush LSN will always be greater than equal to last_saved_confirmed_flush but we don't want to ensure that point here because we just want if the latest value is not the same then we should mark the slot dirty and flush it as that will be location we have ensured to update before walsender shutdown. I think it is better to add an assert if you are worried about any such case and we had thought of adding it as well but then didn't do it because we don't have matching asserts to ensure that we never assign prior LSN value to consfirmed_flush LSN. + /* + * LSN used to track the last confirmed_flush LSN where the slot's data + * has been flushed to disk. + */ + XLogRecPtr last_saved_confirmed_flush; I don't want to argue on such a point because it is a little bit of a matter of personal choice but I find this comment unclear. It seems to read that confirmed_flush LSN is some LSN position which is where we flushed the slot's data and that is not true. I found the last comment in the patch sent by me: "This value tracks the last confirmed_flush LSN flushed which is used during a shutdown checkpoint to decide if logical's slot data should be forcibly flushed or not." which I feel we agreed upon is better. -- With Regards, Amit Kapila.
On Tue, Sep 12, 2023 at 03:15:44PM +0530, Amit Kapila wrote: > I don't think it will become more responsive in any way, not sure what > made you think like that. The key idea is that after restart we want > to ensure that all the WAL data up to the shutdown checkpoint record > is sent to downstream. As mentioned in the commit message, this will > help in ensuring that upgrades don't miss any data and then there is > another small advantage as mentioned in the commit message. Good thing I did not use the term "responsive" in the previous patch I posted. My apologies if you found that confusing. Let's say, "to prevent unnecessary retreat", then ;) >> - Not sure that there is any point to mention the other code paths in >> the tree where ReplicationSlotSave() can be called, and a slot can be >> saved in other processes than just WAL senders (like slot >> manipulations in normal backends, for one). This was the last >> sentence in v10. > > The point was the earlier sentence is no longer true and keeping it as > it is could be wrong or at least misleading. For example, earlier it > is okay to say, "This needn't actually be part of a checkpoint, ..." > but now that is no longer true as we want to invoke this at the time > of shutdown checkpoint for correctness. If we want to be precise, we How so? This is just a reference about the fact that using a checkpoint path for this stuff is useful. A shutdown checkpoint is still a checkpoint, done by the checkpointer. The background writer is not concerned by that. > can say, "It is convenient to flush dirty replication slots at the > time of checkpoint. Additionally, .." Okay by mr to reword the top comment of CheckPointReplicationSlots() to use these terms, if you feel strongly about it. >> + if (s->data.invalidated == RS_INVAL_NONE && >> + s->data.confirmed_flush != s->last_saved_confirmed_flush) >> >> Actually this is incorrect, no? Shouldn't we make sure that the >> confirmed_flush is strictly higher than the last saved LSN? > > I can't see why it is incorrect. Do you see how (in what scenario) it > could go wrong? As per my understanding, confirmed_flush LSN will > always be greater than equal to last_saved_confirmed_flush but we > don't want to ensure that point here because we just want if the > latest value is not the same then we should mark the slot dirty and > flush it as that will be location we have ensured to update before > walsender shutdown. I think it is better to add an assert if you are > worried about any such case and we had thought of adding it as well > but then didn't do it because we don't have matching asserts to ensure > that we never assign prior LSN value to consfirmed_flush LSN. Because that's just safer in the long run, and I don't see why we cannot just do that? Imagine, for instance, that a bug doing an incorrect manipulation of a logical slot's data does an incorrect computation of this field, and that we finish with in-memory data that's older than what was previously saved. The code may cause a flush at an incorrect, past, position. That's just an assumption from my side, of course. > I don't want to argue on such a point because it is a little bit of a > matter of personal choice but I find this comment unclear. It seems to > read that confirmed_flush LSN is some LSN position which is where we > flushed the slot's data and that is not true. I found the last comment > in the patch sent by me: "This value tracks the last confirmed_flush > LSN flushed which is used during a shutdown checkpoint to decide if > logical's slot data should be forcibly flushed or not." which I feel > we agreed upon is better. Okay, fine by me here. -- Michael
Attachment
On Wed, Sep 13, 2023 at 10:57 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Sep 12, 2023 at 03:15:44PM +0530, Amit Kapila wrote: > >> > >> - Not sure that there is any point to mention the other code paths in > >> the tree where ReplicationSlotSave() can be called, and a slot can be > >> saved in other processes than just WAL senders (like slot > >> manipulations in normal backends, for one). This was the last > >> sentence in v10. > > > > The point was the earlier sentence is no longer true and keeping it as > > it is could be wrong or at least misleading. For example, earlier it > > is okay to say, "This needn't actually be part of a checkpoint, ..." > > but now that is no longer true as we want to invoke this at the time > > of shutdown checkpoint for correctness. If we want to be precise, we > > How so? > Consider if we move this call to bgwriter (aka flushing slots is no longer part of a checkpoint), Would that be okay? Previously, I think it was okay but not now. I see an argument to keep that as it is as well because we have already mentioned the special shutdown checkpoint case. By the way, I have changed this because Ashutosh felt it is no longer correct to keep the first sentence as it is. See his email[1] (Relying on the first sentence, ...). It happened previously as well that different reviewers working in this area have different views on this sort of thing. I am trying my best to address the review comments, especially from experienced hackers but personally, I feel this is a minor nitpick and isn't worth too much argument, either way, should be okay. > > >> + if (s->data.invalidated == RS_INVAL_NONE && > >> + s->data.confirmed_flush != s->last_saved_confirmed_flush) > >> > >> Actually this is incorrect, no? Shouldn't we make sure that the > >> confirmed_flush is strictly higher than the last saved LSN? > > > > I can't see why it is incorrect. Do you see how (in what scenario) it > > could go wrong? As per my understanding, confirmed_flush LSN will > > always be greater than equal to last_saved_confirmed_flush but we > > don't want to ensure that point here because we just want if the > > latest value is not the same then we should mark the slot dirty and > > flush it as that will be location we have ensured to update before > > walsender shutdown. I think it is better to add an assert if you are > > worried about any such case and we had thought of adding it as well > > but then didn't do it because we don't have matching asserts to ensure > > that we never assign prior LSN value to consfirmed_flush LSN. > > Because that's just safer in the long run, and I don't see why we > cannot just do that? Imagine, for instance, that a bug doing an > incorrect manipulation of a logical slot's data does an incorrect > computation of this field, and that we finish with in-memory data > that's older than what was previously saved. The code may cause a > flush at an incorrect, past, position. That's just an assumption from > my side, of course. > If you are worried about such bugs, it would be better to have an Assert as suggested previously rather than greater than check because we will at least catch such bugs otherwise it can go unnoticed or in the worst case will lead to unknown consequences. I am saying this because if there are such bugs (or got introduced later) then the slot can be flushed with a prior confirmed_flush location even from other code paths. Just for reference, we don't have any check ensuring that confirmed_flush LSN can move backward in function LogicalConfirmReceivedLocation(), see and also another place where we update it: else { SpinLockAcquire(&MyReplicationSlot->mutex); MyReplicationSlot->data.confirmed_flush = lsn; SpinLockRelease(&MyReplicationSlot->mutex); } As other places don't have an assert, I didn't add one here but we can add one here. > > I don't want to argue on such a point because it is a little bit of a > > matter of personal choice but I find this comment unclear. It seems to > > read that confirmed_flush LSN is some LSN position which is where we > > flushed the slot's data and that is not true. I found the last comment > > in the patch sent by me: "This value tracks the last confirmed_flush > > LSN flushed which is used during a shutdown checkpoint to decide if > > logical's slot data should be forcibly flushed or not." which I feel > > we agreed upon is better. > > Okay, fine by me here. > Thanks, will change once we agree on the remaining points. -- With Regards, Amit Kapila.
On Wed, Sep 13, 2023 at 12:07:12PM +0530, Amit Kapila wrote: > Consider if we move this call to bgwriter (aka flushing slots is no > longer part of a checkpoint), Would that be okay? Previously, I think > it was okay but not now. I see an argument to keep that as it is as > well because we have already mentioned the special shutdown checkpoint > case. By the way, I have changed this because Ashutosh felt it is no > longer correct to keep the first sentence as it is. See his email[1] > (Relying on the first sentence, ...). Hmmm.. Okay.. > As other places don't have an assert, I didn't add one here but we can > add one here. I'd be OK with an assertion here at the end, though I'd still choose a stricter run-time check if I were to apply that myself. -- Michael
Attachment
On Wed, Sep 13, 2023 at 12:45 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Sep 13, 2023 at 12:07:12PM +0530, Amit Kapila wrote: > > Consider if we move this call to bgwriter (aka flushing slots is no > > longer part of a checkpoint), Would that be okay? Previously, I think > > it was okay but not now. I see an argument to keep that as it is as > > well because we have already mentioned the special shutdown checkpoint > > case. By the way, I have changed this because Ashutosh felt it is no > > longer correct to keep the first sentence as it is. See his email[1] > > (Relying on the first sentence, ...). > > Hmmm.. Okay.. > The patch is updated as per recent discussion. -- With Regards, Amit Kapila.
Attachment
On Wed, Sep 13, 2023 at 04:20:37PM +0530, Amit Kapila wrote: > The patch is updated as per recent discussion. WFM. Thanks for the updated version. -- Michael
Attachment
On Thu, Sep 14, 2023 at 7:20 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Sep 13, 2023 at 04:20:37PM +0530, Amit Kapila wrote: > > The patch is updated as per recent discussion. > > WFM. Thanks for the updated version. > Pushed. -- With Regards, Amit Kapila.
On Thu, Sep 14, 2023 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Sep 14, 2023 at 7:20 AM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Wed, Sep 13, 2023 at 04:20:37PM +0530, Amit Kapila wrote: > > > The patch is updated as per recent discussion. > > > > WFM. Thanks for the updated version. > > > > Pushed. Commitfest entry "https://commitfest.postgresql.org/44/4536/ is in "Ready for committer" state. Is there something remaining here? We should probably set it as "committed". -- Best Wishes, Ashutosh Bapat
On Wed, Sep 20, 2023 at 04:48:00PM +0530, Ashutosh Bapat wrote: > Commitfest entry "https://commitfest.postgresql.org/44/4536/ is in > "Ready for committer" state. Is there something remaining here? We > should probably set it as "committed". Thanks, I've switched that to "Committed". -- Michael