Thread: Reviving lost replication slots
Hi,
A replication slot can be lost when a subscriber is not able to catch up with the load on the primary and the WAL to catch up exceeds max_slot_wal_keep_size. When this happens, target has to be reseeded (pg_dump) from the scratch and this can take longer. I am investigating the options to revive a lost slot. With the attached patch and copying the WAL files from the archive to pg_wal directory I was able to revive the lost slot. I also verified that a lost slot doesn't let vacuum cleanup the catalog tuples deleted by any later transaction than catalog_xmin. One side effect of this approach is that the checkpointer creating the .ready files corresponds to the copied wal files in the archive_status folder. Archive command has to handle this case. At the same time, checkpointer can potentially delete the file again before the subscriber consumes the file again. In the proposed patch, I am not setting restart_lsn to InvalidXLogRecPtr but instead relying on invalidated_at field to tell if the slot is lost. Is the intent of setting restart_lsn to InvalidXLogRecPtr was to disallow reviving the slot?
If overall direction seems ok, I would continue on the work to revive the slot by copying the wal files from the archive. Appreciate your feedback.
Thanks,
Sirisha
Attachment
On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi <sirichamarthi22@gmail.com> wrote: > > A replication slot can be lost when a subscriber is not able to catch up with the load on the primary and the WAL to catchup exceeds max_slot_wal_keep_size. When this happens, target has to be reseeded (pg_dump) from the scratch and thiscan take longer. I am investigating the options to revive a lost slot. > Why in the first place one has to set max_slot_wal_keep_size if they care for WAL more than that? If you have a case where you want to handle this case for some particular slot (where you are okay with the invalidation of other slots exceeding max_slot_wal_keep_size) then the other possibility could be to have a similar variable at the slot level but not sure if that is a good idea because you haven't presented any such case. -- With Regards, Amit Kapila.
Hi Amit,
Thanks for your comments!
On Fri, Nov 4, 2022 at 11:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi
<sirichamarthi22@gmail.com> wrote:
>
> A replication slot can be lost when a subscriber is not able to catch up with the load on the primary and the WAL to catch up exceeds max_slot_wal_keep_size. When this happens, target has to be reseeded (pg_dump) from the scratch and this can take longer. I am investigating the options to revive a lost slot.
>
Why in the first place one has to set max_slot_wal_keep_size if they
care for WAL more than that?
Disk full is a typical use where we can't wait until the logical slots to catch up before truncating the log.
If you have a case where you want to
handle this case for some particular slot (where you are okay with the
invalidation of other slots exceeding max_slot_wal_keep_size) then the
other possibility could be to have a similar variable at the slot
level but not sure if that is a good idea because you haven't
presented any such case.
IIUC, ability to fetch WAL from the archive as a fall back mechanism should automatically take care of all the lost slots. Do you see a need to take care of a specific slot? If the idea is not to download the wal files in the pg_wal directory, they can be placed in a slot specific folder (data/pg_replslot/<slot>/) until they are needed while decoding and can be removed.
--
With Regards,
Amit Kapila.
On Tue, Nov 8, 2022 at 12:08 PM sirisha chamarthi <sirichamarthi22@gmail.com> wrote: > > On Fri, Nov 4, 2022 at 11:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi >> <sirichamarthi22@gmail.com> wrote: >> > >> > A replication slot can be lost when a subscriber is not able to catch up with the load on the primary and the WAL tocatch up exceeds max_slot_wal_keep_size. When this happens, target has to be reseeded (pg_dump) from the scratch and thiscan take longer. I am investigating the options to revive a lost slot. >> > >> >> Why in the first place one has to set max_slot_wal_keep_size if they >> care for WAL more than that? > > Disk full is a typical use where we can't wait until the logical slots to catch up before truncating the log. If the max_slot_wal_keep_size is set appropriately and the replication lag is monitored properly along with some automatic actions such as replacing/rebuilding the standbys or subscribers (which may not be easy and cheap though), the chances of hitting the "lost replication" problem becomes less, but not zero always. >> If you have a case where you want to >> handle this case for some particular slot (where you are okay with the >> invalidation of other slots exceeding max_slot_wal_keep_size) then the >> other possibility could be to have a similar variable at the slot >> level but not sure if that is a good idea because you haven't >> presented any such case. > > IIUC, ability to fetch WAL from the archive as a fall back mechanism should automatically take care of all the lost slots.Do you see a need to take care of a specific slot? If the idea is not to download the wal files in the pg_wal directory,they can be placed in a slot specific folder (data/pg_replslot/<slot>/) until they are needed while decoding andcan be removed. Is the idea here the core copying back the WAL files from the archive? If yes, I think it is not something the core needs to do. This very well fits the job of an extension or an external module that revives the lost replication slots by copying WAL files from archive location. Having said above, what's the best way to revive a lost replication slot today? Any automated way exists today? It seems like pg_replication_slot_advance() doesn't do anything for the invalidated/lost slots. If it's a streaming replication slot, the standby will anyway jump to archive mode ignoring the replication slot and the slot will never be usable again unless somebody creates a new replication slot and provides it to the standby for reuse. If it's a logical replication slot, the subscriber will start to diverge from the publisher and the slot will have to be revived manually i.e. created again. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Nov 8, 2022 at 12:08 PM sirisha chamarthi <sirichamarthi22@gmail.com> wrote: > > On Fri, Nov 4, 2022 at 11:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi >> <sirichamarthi22@gmail.com> wrote: >> > >> > A replication slot can be lost when a subscriber is not able to catch up with the load on the primary and the WAL tocatch up exceeds max_slot_wal_keep_size. When this happens, target has to be reseeded (pg_dump) from the scratch and thiscan take longer. I am investigating the options to revive a lost slot. >> > >> >> Why in the first place one has to set max_slot_wal_keep_size if they >> care for WAL more than that? > > Disk full is a typical use where we can't wait until the logical slots to catch up before truncating the log. > Ideally, in such a case the subscriber should fall back to the physical standby of the publisher but unfortunately, we don't yet have a functionality where subscribers can continue logical replication from physical standby. Do you think if we had such functionality it would serve our purpose? >> If you have a case where you want to >> handle this case for some particular slot (where you are okay with the >> invalidation of other slots exceeding max_slot_wal_keep_size) then the >> other possibility could be to have a similar variable at the slot >> level but not sure if that is a good idea because you haven't >> presented any such case. > > IIUC, ability to fetch WAL from the archive as a fall back mechanism should automatically take care of all the lost slots.Do you see a need to take care of a specific slot? > No, I was just trying to see if your use case can be addressed in some other way. BTW, won't copying the WAL again back from archive can lead to a disk full situation. -- With Regards, Amit Kapila.
On Tue, Nov 8, 2022 at 1:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Nov 8, 2022 at 12:08 PM sirisha chamarthi
<sirichamarthi22@gmail.com> wrote:
>
> On Fri, Nov 4, 2022 at 11:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi
>> <sirichamarthi22@gmail.com> wrote:
>> >
>> > A replication slot can be lost when a subscriber is not able to catch up with the load on the primary and the WAL to catch up exceeds max_slot_wal_keep_size. When this happens, target has to be reseeded (pg_dump) from the scratch and this can take longer. I am investigating the options to revive a lost slot.
>> >
>>
>> Why in the first place one has to set max_slot_wal_keep_size if they
>> care for WAL more than that?
>
> Disk full is a typical use where we can't wait until the logical slots to catch up before truncating the log.
>
Ideally, in such a case the subscriber should fall back to the
physical standby of the publisher but unfortunately, we don't yet have
a functionality where subscribers can continue logical replication
from physical standby. Do you think if we had such functionality it
would serve our purpose?
Don't think streaming from standby helps as the disk layout is expected to remain the same on physical standby and primary.
>> If you have a case where you want to
>> handle this case for some particular slot (where you are okay with the
>> invalidation of other slots exceeding max_slot_wal_keep_size) then the
>> other possibility could be to have a similar variable at the slot
>> level but not sure if that is a good idea because you haven't
>> presented any such case.
>
> IIUC, ability to fetch WAL from the archive as a fall back mechanism should automatically take care of all the lost slots. Do you see a need to take care of a specific slot?
>
No, I was just trying to see if your use case can be addressed in some
other way. BTW, won't copying the WAL again back from archive can lead
to a disk full situation.
The idea is to download the WAL from archive on demand as the slot requires them and throw away the segment once processed.
--
With Regards,
Amit Kapila.
On Mon, Nov 7, 2022 at 11:17 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Tue, Nov 8, 2022 at 12:08 PM sirisha chamarthi
<sirichamarthi22@gmail.com> wrote:
>
> On Fri, Nov 4, 2022 at 11:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi
>> <sirichamarthi22@gmail.com> wrote:
>> >
>> > A replication slot can be lost when a subscriber is not able to catch up with the load on the primary and the WAL to catch up exceeds max_slot_wal_keep_size. When this happens, target has to be reseeded (pg_dump) from the scratch and this can take longer. I am investigating the options to revive a lost slot.
>> >
>>
>> Why in the first place one has to set max_slot_wal_keep_size if they
>> care for WAL more than that?
>
> Disk full is a typical use where we can't wait until the logical slots to catch up before truncating the log.
If the max_slot_wal_keep_size is set appropriately and the replication
lag is monitored properly along with some automatic actions such as
replacing/rebuilding the standbys or subscribers (which may not be
easy and cheap though), the chances of hitting the "lost replication"
problem becomes less, but not zero always.
pg_dump and pg_restore can take several hours to days on a large database. Keeping the WAL in the pg_wal folder (faster, smaller and costly disks?) is not always an option.
>> If you have a case where you want to
>> handle this case for some particular slot (where you are okay with the
>> invalidation of other slots exceeding max_slot_wal_keep_size) then the
>> other possibility could be to have a similar variable at the slot
>> level but not sure if that is a good idea because you haven't
>> presented any such case.
>
> IIUC, ability to fetch WAL from the archive as a fall back mechanism should automatically take care of all the lost slots. Do you see a need to take care of a specific slot? If the idea is not to download the wal files in the pg_wal directory, they can be placed in a slot specific folder (data/pg_replslot/<slot>/) until they are needed while decoding and can be removed.
Is the idea here the core copying back the WAL files from the archive?
If yes, I think it is not something the core needs to do. This very
well fits the job of an extension or an external module that revives
the lost replication slots by copying WAL files from archive location.
The current code is throwing an error that the slot is lost because the restart_lsn is set to invalid LSN when the WAL is truncated by checkpointer. In order to build an external service that can revive a lost slot, at the minimum we needed the patch attached.
Having said above, what's the best way to revive a lost replication
slot today? Any automated way exists today? It seems like
pg_replication_slot_advance() doesn't do anything for the
invalidated/lost slots.
If the WAL is available in the pg_wal directory, the replication stream resumes normally when the client connects with the patch I posted.
If it's a streaming replication slot, the standby will anyway jump to
archive mode ignoring the replication slot and the slot will never be
usable again unless somebody creates a new replication slot and
provides it to the standby for reuse.
If it's a logical replication slot, the subscriber will start to
diverge from the publisher and the slot will have to be revived
manually i.e. created again.
Physical slots can be revived with standby downloading the WAL from the archive directly. This patch is helpful for the logical slots.
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
I don't think walsenders fetching segment from archive is totally stupid. With that feature, we can use fast and expensive but small storage for pg_wal, while avoiding replciation from dying even in emergency. At Tue, 8 Nov 2022 19:39:58 -0800, sirisha chamarthi <sirichamarthi22@gmail.com> wrote in > > If it's a streaming replication slot, the standby will anyway jump to > > archive mode ignoring the replication slot and the slot will never be > > usable again unless somebody creates a new replication slot and > > provides it to the standby for reuse. > > If it's a logical replication slot, the subscriber will start to > > diverge from the publisher and the slot will have to be revived > > manually i.e. created again. > > > > Physical slots can be revived with standby downloading the WAL from the > archive directly. This patch is helpful for the logical slots. However, supposing that WalSndSegmentOpen() fetches segments from archive as the fallback and that succeeds, the slot can survive missing WAL in pg_wal in the first place. So this patch doesn't seem to be needed for the purpose. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Nov 9, 2022 at 2:02 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > I don't think walsenders fetching segment from archive is totally > stupid. With that feature, we can use fast and expensive but small > storage for pg_wal, while avoiding replciation from dying even in > emergency. It seems like a useful feature to have at least as an option and it saves a lot of work - failovers, expensive rebuilds of standbys/subscribers, manual interventions etc. If you're saying that even the walsedners serving logical replication subscribers would go fetch from the archive location for the removed WAL files, it mandates enabling archiving on the subscribers. And we know that the archiving is not cheap and has its own advantages and disadvantages, so the feature may or may not help. If you're saying that only the walsedners serving streaming replication standbys would go fetch from the archive location for the removed WAL files, it's easy to implement, however it is not a complete feature and doesn't solve the problem for logical replication. With the feature, it'll be something like 'you, as primary/publisher, archive the WAL files and when you don't have them, you'll restore them', it may not sound elegant, however, it can solve the lost replication slots problem. And, the cost of restoring WAL files from the archive might further slow down the replication thus increasing the replication lag. And, one need to think, how many such WAL files are restored and kept, whether they'll be kept in pg_wal or some other directory, how will the disk full, fetching too old or too many WAL files for replication slots lagging behind, removal of unnecessary WAL files etc. be handled. I'm not sure about other implications at this point of time. Perhaps, implementing this feature as a core/external extension by introducing segment_open() or other necessary hooks might be worth it. If implemented in some way, I think the scope of replication slot invalidation/max_slot_wal_keep_size feature gets reduced or it can be removed completely, no? > However, supposing that WalSndSegmentOpen() fetches segments from > archive as the fallback and that succeeds, the slot can survive > missing WAL in pg_wal in the first place. So this patch doesn't seem > to be needed for the purpose. That is a simple solution one can think of and provide for streaming replication standbys, however, is it worth implementing it in the core as explained above? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Nov 9, 2022 at 3:00 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Wed, Nov 9, 2022 at 2:02 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > I don't think walsenders fetching segment from archive is totally > > stupid. With that feature, we can use fast and expensive but small > > storage for pg_wal, while avoiding replciation from dying even in > > emergency. > > It seems like a useful feature to have at least as an option and it > saves a lot of work - failovers, expensive rebuilds of > standbys/subscribers, manual interventions etc. > > If you're saying that even the walsedners serving logical replication > subscribers would go fetch from the archive location for the removed > WAL files, it mandates enabling archiving on the subscribers. > Why archiving on subscribers is required? Won't it be sufficient if that is enabled on the publisher where we have walsender? -- With Regards, Amit Kapila.
On Wed, Nov 9, 2022 at 3:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Nov 9, 2022 at 3:00 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > On Wed, Nov 9, 2022 at 2:02 PM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > > > > I don't think walsenders fetching segment from archive is totally > > > stupid. With that feature, we can use fast and expensive but small > > > storage for pg_wal, while avoiding replciation from dying even in > > > emergency. > > > > It seems like a useful feature to have at least as an option and it > > saves a lot of work - failovers, expensive rebuilds of > > standbys/subscribers, manual interventions etc. > > > > If you're saying that even the walsedners serving logical replication > > subscribers would go fetch from the archive location for the removed > > WAL files, it mandates enabling archiving on the subscribers. > > > > Why archiving on subscribers is required? Won't it be sufficient if > that is enabled on the publisher where we have walsender? Ugh. A typo. I meant it mandates enabling archiving on publishers. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi <sirichamarthi22@gmail.com> wrote: > Is the intent of setting restart_lsn to InvalidXLogRecPtr was to disallow reviving the slot? > I think the intent is to compute the correct value for replicationSlotMinLSN as we use restart_lsn for it and using the invalidated slot's restart_lsn value for it doesn't make sense. -- With Regards, Amit Kapila.
On Wed, Nov 9, 2022 at 2:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi
<sirichamarthi22@gmail.com> wrote:
>
Is the intent of setting restart_lsn to InvalidXLogRecPtr was to
disallow reviving the slot?
>
I think the intent is to compute the correct value for
replicationSlotMinLSN as we use restart_lsn for it and using the
invalidated slot's restart_lsn value for it doesn't make sense.
Correct. If a slot is invalidated (lost), then shouldn't we ignore the slot from computing the catalog_xmin? I don't see it being set to InvalidTransactionId in ReplicationSlotsComputeRequiredXmin. Attached a small patch to address this and the output after the patch is as shown below.
postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase
-----------+---------------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+-----------
s2 | test_decoding | logical | 5 | postgres | f | f | | | 771 | 0/30466368 | 0/304663A0 | reserved | 28903824 | f
(1 row)
postgres=# create table t2(c int, c1 char(100));
CREATE TABLE
postgres=# drop table t2;
DROP TABLE
postgres=# vacuum pg_class;
VACUUM
postgres=# select n_dead_tup from pg_stat_all_tables where relname = 'pg_class';
n_dead_tup
------------
2
(1 row)
postgres=# select * from pg_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_pri
ority | sync_state | reply_time
-----+----------+---------+------------------+-------------+-----------------+-------------+---------------+--------------+-------+----------+-----------+-----------+------------+-----------+-----------+------------+---------
------+------------+------------
(0 rows)
postgres=# insert into t1 select * from t1;
INSERT 0 2097152
postgres=# checkpoint;
CHECKPOINT
postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase
-----------+---------------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+-----------
s2 | test_decoding | logical | 5 | postgres | f | f | | | 771 | | 0/304663A0 | lost | | f
(1 row)
postgres=# vacuum pg_class;
VACUUM
postgres=# select n_dead_tup from pg_stat_all_tables where relname = 'pg_class';
n_dead_tup
------------
0
(1 row)
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase
-----------+---------------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+-----------
s2 | test_decoding | logical | 5 | postgres | f | f | | | 771 | 0/30466368 | 0/304663A0 | reserved | 28903824 | f
(1 row)
postgres=# create table t2(c int, c1 char(100));
CREATE TABLE
postgres=# drop table t2;
DROP TABLE
postgres=# vacuum pg_class;
VACUUM
postgres=# select n_dead_tup from pg_stat_all_tables where relname = 'pg_class';
n_dead_tup
------------
2
(1 row)
postgres=# select * from pg_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_pri
ority | sync_state | reply_time
-----+----------+---------+------------------+-------------+-----------------+-------------+---------------+--------------+-------+----------+-----------+-----------+------------+-----------+-----------+------------+---------
------+------------+------------
(0 rows)
postgres=# insert into t1 select * from t1;
INSERT 0 2097152
postgres=# checkpoint;
CHECKPOINT
postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase
-----------+---------------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+-----------
s2 | test_decoding | logical | 5 | postgres | f | f | | | 771 | | 0/304663A0 | lost | | f
(1 row)
postgres=# vacuum pg_class;
VACUUM
postgres=# select n_dead_tup from pg_stat_all_tables where relname = 'pg_class';
n_dead_tup
------------
0
(1 row)
--
With Regards,
Amit Kapila.
On Wed, Nov 9, 2022 at 12:32 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
I don't think walsenders fetching segment from archive is totally
stupid. With that feature, we can use fast and expensive but small
storage for pg_wal, while avoiding replciation from dying even in
emergency.
Thanks! If there is a general agreement on this in this forum, I would like to start working on this patch,
At Tue, 8 Nov 2022 19:39:58 -0800, sirisha chamarthi <sirichamarthi22@gmail.com> wrote in
> > If it's a streaming replication slot, the standby will anyway jump to
> > archive mode ignoring the replication slot and the slot will never be
> > usable again unless somebody creates a new replication slot and
> > provides it to the standby for reuse.
> > If it's a logical replication slot, the subscriber will start to
> > diverge from the publisher and the slot will have to be revived
> > manually i.e. created again.
> >
>
> Physical slots can be revived with standby downloading the WAL from the
> archive directly. This patch is helpful for the logical slots.
However, supposing that WalSndSegmentOpen() fetches segments from
archive as the fallback and that succeeds, the slot can survive
missing WAL in pg_wal in the first place. So this patch doesn't seem
to be needed for the purpose.
Agree on this. If we add the proposed support, we don't need this patch.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Thu, Nov 10, 2022 at 4:07 PM sirisha chamarthi <sirichamarthi22@gmail.com> wrote: > > On Wed, Nov 9, 2022 at 2:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi >> <sirichamarthi22@gmail.com> wrote: >> > >> Is the intent of setting restart_lsn to InvalidXLogRecPtr was to >> disallow reviving the slot? >> > >> >> I think the intent is to compute the correct value for >> replicationSlotMinLSN as we use restart_lsn for it and using the >> invalidated slot's restart_lsn value for it doesn't make sense. > > > Correct. If a slot is invalidated (lost), then shouldn't we ignore the slot from computing the catalog_xmin? I don'tsee it being set to InvalidTransactionId in ReplicationSlotsComputeRequiredXmin. Attached a small patch to address thisand the output after the patch is as shown below. > I think you forgot to attach the patch. However, I suggest you start a separate thread for this because the patch you are talking about here seems to be for an existing problem. -- With Regards, Amit Kapila.
On Thu, Nov 10, 2022 at 4:12 PM sirisha chamarthi <sirichamarthi22@gmail.com> wrote: > > On Wed, Nov 9, 2022 at 12:32 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: >> >> I don't think walsenders fetching segment from archive is totally >> stupid. With that feature, we can use fast and expensive but small >> storage for pg_wal, while avoiding replciation from dying even in >> emergency. > > Thanks! If there is a general agreement on this in this forum, I would like to start working on this patch, I think starting with establishing/summarizing the problem, design approaches, implications etc. is a better idea than a patch. It might invite more thoughts from the hackers. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com