Home > mailing lists

Thread: Reviving lost replication slots

Reviving lost replication slots

From

sirisha chamarthi

Date:

04 November 2022, 08:10:39

Hi,

A replication slot can be lost when a subscriber is not able to catch up with the load on the primary and the WAL to catch up exceeds max_slot_wal_keep_size. When this happens, target has to be reseeded (pg_dump) from the scratch and this can take longer. I am investigating the options to revive a lost slot. With the attached patch and copying the WAL files from the archive to pg_wal directory I was able to revive the lost slot. I also verified that a lost slot doesn't let vacuum cleanup the catalog tuples deleted by any later transaction than catalog_xmin. One side effect of this approach is that the checkpointer creating the .ready files corresponds to the copied wal files in the archive_status folder. Archive command has to handle this case. At the same time, checkpointer can potentially delete the file again before the subscriber consumes the file again. In the proposed patch, I am not setting restart_lsn to InvalidXLogRecPtr but instead relying on invalidated_at field to tell if the slot is lost. Is the intent of setting restart_lsn to InvalidXLogRecPtr was to disallow reviving the slot?

If overall direction seems ok, I would continue on the work to revive the slot by copying the wal files from the archive. Appreciate your feedback.

Thanks,

Sirisha

Attachment

0001-Allow-revive-a-lost-replication-slot.patch

Re: Reviving lost replication slots

From

Amit Kapila

Date:

05 November 2022, 06:02:16

On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi
<sirichamarthi22@gmail.com> wrote:
>
> A replication slot can be lost when a subscriber is not able to catch up with the load on the primary and the WAL to
catchup exceeds max_slot_wal_keep_size. When this happens, target has to be reseeded (pg_dump) from the scratch and
thiscan take longer. I am investigating the options to revive a lost slot. 
>

Why in the first place one has to set max_slot_wal_keep_size if they
care for WAL more than that? If you have a case where you want to
handle this case for some particular slot (where you are okay with the
invalidation of other slots exceeding max_slot_wal_keep_size) then the
other possibility could be to have a similar variable at the slot
level but not sure if that is a good idea because you haven't
presented any such case.

--
With Regards,
Amit Kapila.

Re: Reviving lost replication slots

From

sirisha chamarthi

Date:

08 November 2022, 06:37:49

Hi Amit,

Thanks for your comments!

On Fri, Nov 4, 2022 at 11:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi
<sirichamarthi22@gmail.com> wrote:
>
> A replication slot can be lost when a subscriber is not able to catch up with the load on the primary and the WAL to catch up exceeds max_slot_wal_keep_size. When this happens, target has to be reseeded (pg_dump) from the scratch and this can take longer. I am investigating the options to revive a lost slot.
>

Why in the first place one has to set max_slot_wal_keep_size if they
care for WAL more than that?

Disk full is a typical use where we can't wait until the logical slots to catch up before truncating the log.

If you have a case where you want to
handle this case for some particular slot (where you are okay with the
invalidation of other slots exceeding max_slot_wal_keep_size) then the
other possibility could be to have a similar variable at the slot
level but not sure if that is a good idea because you haven't
presented any such case.

IIUC, ability to fetch WAL from the archive as a fall back mechanism should automatically take care of all the lost slots. Do you see a need to take care of a specific slot? If the idea is not to download the wal files in the pg_wal directory, they can be placed in a slot specific folder (data/pg_replslot/<slot>/) until they are needed while decoding and can be removed.

--
With Regards,
Amit Kapila.

Re: Reviving lost replication slots

From

Bharath Rupireddy

Date:

08 November 2022, 07:17:23

On Tue, Nov 8, 2022 at 12:08 PM sirisha chamarthi
<sirichamarthi22@gmail.com> wrote:
>
> On Fri, Nov 4, 2022 at 11:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi
>> <sirichamarthi22@gmail.com> wrote:
>> >
>> > A replication slot can be lost when a subscriber is not able to catch up with the load on the primary and the WAL
tocatch up exceeds max_slot_wal_keep_size. When this happens, target has to be reseeded (pg_dump) from the scratch and
thiscan take longer. I am investigating the options to revive a lost slot. 
>> >
>>
>> Why in the first place one has to set max_slot_wal_keep_size if they
>> care for WAL more than that?
>
>  Disk full is a typical use where we can't wait until the logical slots to catch up before truncating the log.

If the max_slot_wal_keep_size is set appropriately and the replication
lag is monitored properly along with some automatic actions such as
replacing/rebuilding the standbys or subscribers (which may not be
easy and cheap though), the chances of hitting the "lost replication"
problem becomes less, but not zero always.

>> If you have a case where you want to
>> handle this case for some particular slot (where you are okay with the
>> invalidation of other slots exceeding max_slot_wal_keep_size) then the
>> other possibility could be to have a similar variable at the slot
>> level but not sure if that is a good idea because you haven't
>> presented any such case.
>
> IIUC, ability to fetch WAL from the archive as a fall back mechanism should automatically take care of all the lost
slots.Do you see a need to take care of a specific slot? If the idea is not to download the wal files in the pg_wal
directory,they can be placed in a slot specific folder (data/pg_replslot/<slot>/) until they are needed while decoding
andcan be removed. 

Is the idea here the core copying back the WAL files from the archive?
If yes, I think it is not something the core needs to do. This very
well fits the job of an extension or an external module that revives
the lost replication slots by copying WAL files from archive location.

Having said above, what's the best way to revive a lost replication
slot today? Any automated way exists today? It seems like
pg_replication_slot_advance() doesn't do anything for the
invalidated/lost slots.

If it's a streaming replication slot, the standby will anyway jump to
archive mode ignoring the replication slot and the slot will never be
usable again unless somebody creates a new replication slot and
provides it to the standby for reuse.
If it's a logical replication slot, the subscriber will start to
diverge from the publisher and the slot will have to be revived
manually i.e. created again.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Reviving lost replication slots

From

Amit Kapila

Date:

08 November 2022, 09:36:17

On Tue, Nov 8, 2022 at 12:08 PM sirisha chamarthi
<sirichamarthi22@gmail.com> wrote:
>
> On Fri, Nov 4, 2022 at 11:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi
>> <sirichamarthi22@gmail.com> wrote:
>> >
>> > A replication slot can be lost when a subscriber is not able to catch up with the load on the primary and the WAL
tocatch up exceeds max_slot_wal_keep_size. When this happens, target has to be reseeded (pg_dump) from the scratch and
thiscan take longer. I am investigating the options to revive a lost slot. 
>> >
>>
>> Why in the first place one has to set max_slot_wal_keep_size if they
>> care for WAL more than that?
>
>  Disk full is a typical use where we can't wait until the logical slots to catch up before truncating the log.
>

Ideally, in such a case the subscriber should fall back to the
physical standby of the publisher but unfortunately, we don't yet have
a functionality where subscribers can continue logical replication
from physical standby. Do you think if we had such functionality it
would serve our purpose?

>> If you have a case where you want to
>> handle this case for some particular slot (where you are okay with the
>> invalidation of other slots exceeding max_slot_wal_keep_size) then the
>> other possibility could be to have a similar variable at the slot
>> level but not sure if that is a good idea because you haven't
>> presented any such case.
>
> IIUC, ability to fetch WAL from the archive as a fall back mechanism should automatically take care of all the lost
slots.Do you see a need to take care of a specific slot? 
>

No, I was just trying to see if your use case can be addressed in some
other way. BTW, won't copying the WAL again back from archive can lead
to a disk full situation.

--
With Regards,
Amit Kapila.

Re: Reviving lost replication slots

From

sirisha chamarthi

Date:

09 November 2022, 03:26:45

On Tue, Nov 8, 2022 at 1:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Nov 8, 2022 at 12:08 PM sirisha chamarthi
<sirichamarthi22@gmail.com> wrote:
>
> On Fri, Nov 4, 2022 at 11:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi
>> <sirichamarthi22@gmail.com> wrote:
>> >
>> > A replication slot can be lost when a subscriber is not able to catch up with the load on the primary and the WAL to catch up exceeds max_slot_wal_keep_size. When this happens, target has to be reseeded (pg_dump) from the scratch and this can take longer. I am investigating the options to revive a lost slot.
>> >
>>
>> Why in the first place one has to set max_slot_wal_keep_size if they
>> care for WAL more than that?
>
> Disk full is a typical use where we can't wait until the logical slots to catch up before truncating the log.
>

Ideally, in such a case the subscriber should fall back to the
physical standby of the publisher but unfortunately, we don't yet have
a functionality where subscribers can continue logical replication
from physical standby. Do you think if we had such functionality it
would serve our purpose?

Don't think streaming from standby helps as the disk layout is expected to remain the same on physical standby and primary.

>> If you have a case where you want to
>> handle this case for some particular slot (where you are okay with the
>> invalidation of other slots exceeding max_slot_wal_keep_size) then the
>> other possibility could be to have a similar variable at the slot
>> level but not sure if that is a good idea because you haven't
>> presented any such case.
>
> IIUC, ability to fetch WAL from the archive as a fall back mechanism should automatically take care of all the lost slots. Do you see a need to take care of a specific slot?
>

No, I was just trying to see if your use case can be addressed in some
other way. BTW, won't copying the WAL again back from archive can lead
to a disk full situation.

The idea is to download the WAL from archive on demand as the slot requires them and throw away the segment once processed.

--
With Regards,
Amit Kapila.

Re: Reviving lost replication slots

From

sirisha chamarthi

Date:

09 November 2022, 03:39:58

On Mon, Nov 7, 2022 at 11:17 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

On Tue, Nov 8, 2022 at 12:08 PM sirisha chamarthi
<sirichamarthi22@gmail.com> wrote:
>
> On Fri, Nov 4, 2022 at 11:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi
>> <sirichamarthi22@gmail.com> wrote:
>> >
>> > A replication slot can be lost when a subscriber is not able to catch up with the load on the primary and the WAL to catch up exceeds max_slot_wal_keep_size. When this happens, target has to be reseeded (pg_dump) from the scratch and this can take longer. I am investigating the options to revive a lost slot.
>> >
>>
>> Why in the first place one has to set max_slot_wal_keep_size if they
>> care for WAL more than that?
>
> Disk full is a typical use where we can't wait until the logical slots to catch up before truncating the log.

If the max_slot_wal_keep_size is set appropriately and the replication
lag is monitored properly along with some automatic actions such as
replacing/rebuilding the standbys or subscribers (which may not be
easy and cheap though), the chances of hitting the "lost replication"
problem becomes less, but not zero always.

pg_dump and pg_restore can take several hours to days on a large database. Keeping the WAL in the pg_wal folder (faster, smaller and costly disks?) is not always an option.

>> If you have a case where you want to
>> handle this case for some particular slot (where you are okay with the
>> invalidation of other slots exceeding max_slot_wal_keep_size) then the
>> other possibility could be to have a similar variable at the slot
>> level but not sure if that is a good idea because you haven't
>> presented any such case.
>
> IIUC, ability to fetch WAL from the archive as a fall back mechanism should automatically take care of all the lost slots. Do you see a need to take care of a specific slot? If the idea is not to download the wal files in the pg_wal directory, they can be placed in a slot specific folder (data/pg_replslot/<slot>/) until they are needed while decoding and can be removed.

Is the idea here the core copying back the WAL files from the archive?
If yes, I think it is not something the core needs to do. This very
well fits the job of an extension or an external module that revives
the lost replication slots by copying WAL files from archive location.

The current code is throwing an error that the slot is lost because the restart_lsn is set to invalid LSN when the WAL is truncated by checkpointer. In order to build an external service that can revive a lost slot, at the minimum we needed the patch attached.

Having said above, what's the best way to revive a lost replication
slot today? Any automated way exists today? It seems like
pg_replication_slot_advance() doesn't do anything for the
invalidated/lost slots.

If the WAL is available in the pg_wal directory, the replication stream resumes normally when the client connects with the patch I posted.

If it's a streaming replication slot, the standby will anyway jump to
archive mode ignoring the replication slot and the slot will never be
usable again unless somebody creates a new replication slot and
provides it to the standby for reuse.
If it's a logical replication slot, the subscriber will start to
diverge from the publisher and the slot will have to be revived
manually i.e. created again.

Physical slots can be revived with standby downloading the WAL from the archive directly. This patch is helpful for the logical slots.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Reviving lost replication slots

From

Kyotaro Horiguchi

Date:

09 November 2022, 08:32:30

I don't think walsenders fetching segment from archive is totally
stupid. With that feature, we can use fast and expensive but small
storage for pg_wal, while avoiding replciation from dying even in
emergency.

At Tue, 8 Nov 2022 19:39:58 -0800, sirisha chamarthi <sirichamarthi22@gmail.com> wrote in 
> > If it's a streaming replication slot, the standby will anyway jump to
> > archive mode ignoring the replication slot and the slot will never be
> > usable again unless somebody creates a new replication slot and
> > provides it to the standby for reuse.
> > If it's a logical replication slot, the subscriber will start to
> > diverge from the publisher and the slot will have to be revived
> > manually i.e. created again.
> >
> 
> Physical slots can be revived with standby downloading the WAL from the
> archive directly. This patch is helpful for the logical slots.

However, supposing that WalSndSegmentOpen() fetches segments from
archive as the fallback and that succeeds, the slot can survive
missing WAL in pg_wal in the first place. So this patch doesn't seem
to be needed for the purpose.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Reviving lost replication slots

From

Bharath Rupireddy

Date:

09 November 2022, 09:30:36

On Wed, Nov 9, 2022 at 2:02 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> I don't think walsenders fetching segment from archive is totally
> stupid. With that feature, we can use fast and expensive but small
> storage for pg_wal, while avoiding replciation from dying even in
> emergency.

It seems like a useful feature to have at least as an option and it
saves a lot of work - failovers, expensive rebuilds of
standbys/subscribers, manual interventions etc.

If you're saying that even the walsedners serving logical replication
subscribers would go fetch from the archive location for the removed
WAL files, it mandates enabling archiving on the subscribers. And we
know that the archiving is not cheap and has its own advantages and
disadvantages, so the feature may or may not help.
If you're saying that only the walsedners serving streaming
replication standbys would go fetch from the archive location for the
removed WAL files, it's easy to implement, however it is not a
complete feature and doesn't solve the problem for logical
replication.
With the feature, it'll be something like 'you, as primary/publisher,
archive the WAL files and when you don't have them, you'll restore
them', it may not sound elegant, however, it can solve the lost
replication slots problem.
And, the cost of restoring WAL files from the archive might further
slow down the replication thus increasing the replication lag.
And, one need to think, how many such WAL files are restored and kept,
whether they'll be kept in pg_wal or some other directory, how will
the disk full, fetching too old or too many WAL files for replication
slots lagging behind, removal of unnecessary WAL files etc. be
handled.

I'm not sure about other implications at this point of time.

Perhaps, implementing this feature as a core/external extension by
introducing segment_open() or other necessary hooks might be worth it.

If implemented in some way, I think the scope of replication slot
invalidation/max_slot_wal_keep_size feature gets reduced or it can be
removed completely, no?

> However, supposing that WalSndSegmentOpen() fetches segments from
> archive as the fallback and that succeeds, the slot can survive
> missing WAL in pg_wal in the first place. So this patch doesn't seem
> to be needed for the purpose.

That is a simple solution one can think of and provide for streaming
replication standbys, however, is it worth implementing it in the core
as explained above?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Reviving lost replication slots

From

Amit Kapila

Date:

09 November 2022, 10:23:23

On Wed, Nov 9, 2022 at 3:00 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Wed, Nov 9, 2022 at 2:02 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > I don't think walsenders fetching segment from archive is totally
> > stupid. With that feature, we can use fast and expensive but small
> > storage for pg_wal, while avoiding replciation from dying even in
> > emergency.
>
> It seems like a useful feature to have at least as an option and it
> saves a lot of work - failovers, expensive rebuilds of
> standbys/subscribers, manual interventions etc.
>
> If you're saying that even the walsedners serving logical replication
> subscribers would go fetch from the archive location for the removed
> WAL files, it mandates enabling archiving on the subscribers.
>

Why archiving on subscribers is required? Won't it be sufficient if
that is enabled on the publisher where we have walsender?

-- 
With Regards,
Amit Kapila.

Re: Reviving lost replication slots

From

Bharath Rupireddy

Date:

09 November 2022, 10:25:25

On Wed, Nov 9, 2022 at 3:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Nov 9, 2022 at 3:00 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > On Wed, Nov 9, 2022 at 2:02 PM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> > >
> > > I don't think walsenders fetching segment from archive is totally
> > > stupid. With that feature, we can use fast and expensive but small
> > > storage for pg_wal, while avoiding replciation from dying even in
> > > emergency.
> >
> > It seems like a useful feature to have at least as an option and it
> > saves a lot of work - failovers, expensive rebuilds of
> > standbys/subscribers, manual interventions etc.
> >
> > If you're saying that even the walsedners serving logical replication
> > subscribers would go fetch from the archive location for the removed
> > WAL files, it mandates enabling archiving on the subscribers.
> >
>
> Why archiving on subscribers is required? Won't it be sufficient if
> that is enabled on the publisher where we have walsender?

Ugh. A typo. I meant it mandates enabling archiving on publishers.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Reviving lost replication slots

From

Amit Kapila

Date:

09 November 2022, 10:37:45

On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi
<sirichamarthi22@gmail.com> wrote:
>
 Is the intent of setting restart_lsn to InvalidXLogRecPtr was to
disallow reviving the slot?
>

I think the intent is to compute the correct value for
replicationSlotMinLSN as we use restart_lsn for it and using the
invalidated slot's restart_lsn value for it doesn't make sense.

-- 
With Regards,
Amit Kapila.

Re: Reviving lost replication slots

From

sirisha chamarthi

Date:

10 November 2022, 10:37:52

On Wed, Nov 9, 2022 at 2:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi
<sirichamarthi22@gmail.com> wrote:
>
Is the intent of setting restart_lsn to InvalidXLogRecPtr was to
disallow reviving the slot?
>

I think the intent is to compute the correct value for
replicationSlotMinLSN as we use restart_lsn for it and using the
invalidated slot's restart_lsn value for it doesn't make sense.

Correct. If a slot is invalidated (lost), then shouldn't we ignore the slot from computing the catalog_xmin? I don't see it being set to InvalidTransactionId in ReplicationSlotsComputeRequiredXmin. Attached a small patch to address this and the output after the patch is as shown below.

postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase
-----------+---------------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+-----------
s2 | test_decoding | logical | 5 | postgres | f | f | | | 771 | 0/30466368 | 0/304663A0 | reserved | 28903824 | f
(1 row)

postgres=# create table t2(c int, c1 char(100));
CREATE TABLE
postgres=# drop table t2;
DROP TABLE
postgres=# vacuum pg_class;
VACUUM
postgres=# select n_dead_tup from pg_stat_all_tables where relname = 'pg_class';
n_dead_tup
------------
2
(1 row)

postgres=# select * from pg_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_pri
ority | sync_state | reply_time
-----+----------+---------+------------------+-------------+-----------------+-------------+---------------+--------------+-------+----------+-----------+-----------+------------+-----------+-----------+------------+---------
------+------------+------------
(0 rows)

postgres=# insert into t1 select * from t1;
INSERT 0 2097152
postgres=# checkpoint;
CHECKPOINT
postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase
-----------+---------------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+-----------
s2 | test_decoding | logical | 5 | postgres | f | f | | | 771 | | 0/304663A0 | lost | | f
(1 row)

postgres=# vacuum pg_class;
VACUUM
postgres=# select n_dead_tup from pg_stat_all_tables where relname = 'pg_class';
n_dead_tup
------------
0
(1 row)

--
With Regards,
Amit Kapila.

Re: Reviving lost replication slots

From

sirisha chamarthi

Date:

10 November 2022, 10:42:50

On Wed, Nov 9, 2022 at 12:32 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

I don't think walsenders fetching segment from archive is totally
stupid. With that feature, we can use fast and expensive but small
storage for pg_wal, while avoiding replciation from dying even in
emergency.

Thanks! If there is a general agreement on this in this forum, I would like to start working on this patch,

At Tue, 8 Nov 2022 19:39:58 -0800, sirisha chamarthi <sirichamarthi22@gmail.com> wrote in
> > If it's a streaming replication slot, the standby will anyway jump to
> > archive mode ignoring the replication slot and the slot will never be
> > usable again unless somebody creates a new replication slot and
> > provides it to the standby for reuse.
> > If it's a logical replication slot, the subscriber will start to
> > diverge from the publisher and the slot will have to be revived
> > manually i.e. created again.
> >
>
> Physical slots can be revived with standby downloading the WAL from the
> archive directly. This patch is helpful for the logical slots.

However, supposing that WalSndSegmentOpen() fetches segments from
archive as the fallback and that succeeds, the slot can survive
missing WAL in pg_wal in the first place. So this patch doesn't seem
to be needed for the purpose.

Agree on this. If we add the proposed support, we don't need this patch.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Reviving lost replication slots

From

Amit Kapila

Date:

11 November 2022, 05:36:32

On Thu, Nov 10, 2022 at 4:07 PM sirisha chamarthi
<sirichamarthi22@gmail.com> wrote:
>
> On Wed, Nov 9, 2022 at 2:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi
>> <sirichamarthi22@gmail.com> wrote:
>> >
>>  Is the intent of setting restart_lsn to InvalidXLogRecPtr was to
>> disallow reviving the slot?
>> >
>>
>> I think the intent is to compute the correct value for
>> replicationSlotMinLSN as we use restart_lsn for it and using the
>> invalidated slot's restart_lsn value for it doesn't make sense.
>
>
>  Correct. If a slot is invalidated (lost), then shouldn't we ignore the slot from computing the catalog_xmin?  I
don'tsee it being set to InvalidTransactionId in ReplicationSlotsComputeRequiredXmin. Attached a small patch to address
thisand the output after the patch is as shown below. 
>

I think you forgot to attach the patch. However, I suggest you start a
separate thread for this because the patch you are talking about here
seems to be for an existing problem.

--
With Regards,
Amit Kapila.

Re: Reviving lost replication slots

From

Bharath Rupireddy

Date:

11 November 2022, 06:28:12

On Thu, Nov 10, 2022 at 4:12 PM sirisha chamarthi
<sirichamarthi22@gmail.com> wrote:
>
> On Wed, Nov 9, 2022 at 12:32 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>>
>> I don't think walsenders fetching segment from archive is totally
>> stupid. With that feature, we can use fast and expensive but small
>> storage for pg_wal, while avoiding replciation from dying even in
>> emergency.
>
> Thanks! If there is a general agreement on this in this forum, I would like to start working on this patch,

I think starting with establishing/summarizing the problem, design
approaches, implications etc. is a better idea than a patch. It might
invite more thoughts from the hackers.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com