Re: Reviving lost replication slots - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: Reviving lost replication slots
Date
Msg-id CALj2ACWeT-HxX0xtHY-5e+aEBXGA-7s6hkstr1-8V55v5YK-2A@mail.gmail.com
Whole thread Raw
In response to Re: Reviving lost replication slots  (sirisha chamarthi <sirichamarthi22@gmail.com>)
Responses Re: Reviving lost replication slots
List pgsql-hackers
On Tue, Nov 8, 2022 at 12:08 PM sirisha chamarthi
<sirichamarthi22@gmail.com> wrote:
>
> On Fri, Nov 4, 2022 at 11:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi
>> <sirichamarthi22@gmail.com> wrote:
>> >
>> > A replication slot can be lost when a subscriber is not able to catch up with the load on the primary and the WAL
tocatch up exceeds max_slot_wal_keep_size. When this happens, target has to be reseeded (pg_dump) from the scratch and
thiscan take longer. I am investigating the options to revive a lost slot. 
>> >
>>
>> Why in the first place one has to set max_slot_wal_keep_size if they
>> care for WAL more than that?
>
>  Disk full is a typical use where we can't wait until the logical slots to catch up before truncating the log.

If the max_slot_wal_keep_size is set appropriately and the replication
lag is monitored properly along with some automatic actions such as
replacing/rebuilding the standbys or subscribers (which may not be
easy and cheap though), the chances of hitting the "lost replication"
problem becomes less, but not zero always.

>> If you have a case where you want to
>> handle this case for some particular slot (where you are okay with the
>> invalidation of other slots exceeding max_slot_wal_keep_size) then the
>> other possibility could be to have a similar variable at the slot
>> level but not sure if that is a good idea because you haven't
>> presented any such case.
>
> IIUC, ability to fetch WAL from the archive as a fall back mechanism should automatically take care of all the lost
slots.Do you see a need to take care of a specific slot? If the idea is not to download the wal files in the pg_wal
directory,they can be placed in a slot specific folder (data/pg_replslot/<slot>/) until they are needed while decoding
andcan be removed. 

Is the idea here the core copying back the WAL files from the archive?
If yes, I think it is not something the core needs to do. This very
well fits the job of an extension or an external module that revives
the lost replication slots by copying WAL files from archive location.

Having said above, what's the best way to revive a lost replication
slot today? Any automated way exists today? It seems like
pg_replication_slot_advance() doesn't do anything for the
invalidated/lost slots.

If it's a streaming replication slot, the standby will anyway jump to
archive mode ignoring the replication slot and the slot will never be
usable again unless somebody creates a new replication slot and
provides it to the standby for reuse.
If it's a logical replication slot, the subscriber will start to
diverge from the publisher and the slot will have to be revived
manually i.e. created again.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: rajesh singarapu
Date:
Subject: Re: Use proc instead of MyProc in ProcArrayGroupClearXid()/TransactionGroupUpdateXidStatus()
Next
From: Thomas Munro
Date:
Subject: Re: Suppressing useless wakeups in walreceiver