Reviving lost replication slots - Mailing list pgsql-hackers

From sirisha chamarthi
Subject Reviving lost replication slots
Date
Msg-id CAKrAKeW-sGqvkw-2zKuVYiVv=EOG4LEqJn01RJPsHfS2rQGYng@mail.gmail.com
Whole thread Raw
Responses Re: Reviving lost replication slots
Re: Reviving lost replication slots
List pgsql-hackers
Hi,

A replication slot can be lost when a subscriber is not able to catch up with the load on the primary and the WAL to catch up exceeds max_slot_wal_keep_size. When this happens, target has to be reseeded (pg_dump) from the scratch and this can take longer. I am investigating the options to revive a lost slot. With the attached patch and copying the WAL files from the archive to pg_wal directory I was able to revive the lost slot. I also verified that a lost slot doesn't let vacuum cleanup the catalog tuples deleted by any later transaction than catalog_xmin. One side effect of this approach is that the checkpointer creating the .ready files corresponds to the copied wal files in the archive_status folder. Archive command has to handle this case. At the same time, checkpointer can potentially delete the file again before the subscriber consumes the file again. In the proposed patch, I am not setting restart_lsn to InvalidXLogRecPtr but instead relying on invalidated_at field to tell if the slot is lost. Is the intent of setting restart_lsn to InvalidXLogRecPtr was to disallow reviving the slot?

If overall direction seems ok, I would continue on the work to revive the slot by copying the wal files from the archive. Appreciate your feedback.

Thanks,
Sirisha
Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Perform streaming logical transactions by background workers and parallel apply
Next
From: David Geier
Date:
Subject: Re: Add explicit casts in four places to simplehash.h