Re: Reviving lost replication slots - Mailing list pgsql-hackers

From sirisha chamarthi
Subject Re: Reviving lost replication slots
Date
Msg-id CAKrAKeUMHbM7gUDNa1zwugFak-FpqnF+ZwV29e8HRvytXPkWVw@mail.gmail.com
Whole thread Raw
In response to Re: Reviving lost replication slots  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Responses Re: Reviving lost replication slots
List pgsql-hackers


On Mon, Nov 7, 2022 at 11:17 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Tue, Nov 8, 2022 at 12:08 PM sirisha chamarthi
<sirichamarthi22@gmail.com> wrote:
>
> On Fri, Nov 4, 2022 at 11:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi
>> <sirichamarthi22@gmail.com> wrote:
>> >
>> > A replication slot can be lost when a subscriber is not able to catch up with the load on the primary and the WAL to catch up exceeds max_slot_wal_keep_size. When this happens, target has to be reseeded (pg_dump) from the scratch and this can take longer. I am investigating the options to revive a lost slot.
>> >
>>
>> Why in the first place one has to set max_slot_wal_keep_size if they
>> care for WAL more than that?
>
>  Disk full is a typical use where we can't wait until the logical slots to catch up before truncating the log.

If the max_slot_wal_keep_size is set appropriately and the replication
lag is monitored properly along with some automatic actions such as
replacing/rebuilding the standbys or subscribers (which may not be
easy and cheap though), the chances of hitting the "lost replication"
problem becomes less, but not zero always.

pg_dump and pg_restore can take several hours to days on a large database. Keeping the WAL in the pg_wal folder (faster, smaller and costly disks?) is not always an option.
 

>> If you have a case where you want to
>> handle this case for some particular slot (where you are okay with the
>> invalidation of other slots exceeding max_slot_wal_keep_size) then the
>> other possibility could be to have a similar variable at the slot
>> level but not sure if that is a good idea because you haven't
>> presented any such case.
>
> IIUC, ability to fetch WAL from the archive as a fall back mechanism should automatically take care of all the lost slots. Do you see a need to take care of a specific slot? If the idea is not to download the wal files in the pg_wal directory, they can be placed in a slot specific folder (data/pg_replslot/<slot>/) until they are needed while decoding and can be removed.

Is the idea here the core copying back the WAL files from the archive?
If yes, I think it is not something the core needs to do. This very
well fits the job of an extension or an external module that revives
the lost replication slots by copying WAL files from archive location.
 
The current code is throwing an error that the slot is lost because the restart_lsn is set  to invalid LSN when the WAL is truncated by checkpointer. In order to build an external service that can revive a lost slot, at the minimum we needed the patch attached.
 

Having said above, what's the best way to revive a lost replication
slot today? Any automated way exists today? It seems like
pg_replication_slot_advance() doesn't do anything for the
invalidated/lost slots.
 
 If the WAL is available in the pg_wal directory, the replication stream resumes normally when the client connects with the patch I posted. 
 

If it's a streaming replication slot, the standby will anyway jump to
archive mode ignoring the replication slot and the slot will never be
usable again unless somebody creates a new replication slot and
provides it to the standby for reuse.
If it's a logical replication slot, the subscriber will start to
diverge from the publisher and the slot will have to be revived
manually i.e. created again.

Physical slots can be revived with standby downloading the WAL from the archive directly. This patch is helpful for the logical slots.
 

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

pgsql-hackers by date:

Previous
From: sirisha chamarthi
Date:
Subject: Re: Reviving lost replication slots
Next
From: Julien Rouhaud
Date:
Subject: Re: Allow file inclusion in pg_hba and pg_ident files