Hi,
A replication slot can be lost when a subscriber is not able to catch up with the load on the primary and the WAL to catch up exceeds max_slot_wal_keep_size. When this happens, target has to be reseeded (pg_dump) from the scratch and this can take longer. I am investigating the options to revive a lost slot. With the attached patch and copying the WAL files from the archive to pg_wal directory I was able to revive the lost slot. I also verified that a lost slot doesn't let vacuum cleanup the catalog tuples deleted by any later transaction than catalog_xmin. One side effect of this approach is that the checkpointer creating the .ready files corresponds to the copied wal files in the archive_status folder. Archive command has to handle this case. At the same time, checkpointer can potentially delete the file again before the subscriber consumes the file again. In the proposed patch, I am not setting restart_lsn to InvalidXLogRecPtr but instead relying on invalidated_at field to tell if the slot is lost. Is the intent of setting restart_lsn to InvalidXLogRecPtr was to disallow reviving the slot?
If overall direction seems ok, I would continue on the work to revive the slot by copying the wal files from the archive. Appreciate your feedback.
Thanks,
Sirisha