Re: Replication slot stats misgivings - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Replication slot stats misgivings
Date
Msg-id CAA4eK1Jt40-gHsFkinREKMKU+0C4RPgzYu3SwFezcZFc9srnfA@mail.gmail.com
Whole thread Raw
In response to Re: Replication slot stats misgivings  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Replication slot stats misgivings
List pgsql-hackers
On Sat, Apr 10, 2021 at 7:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> IIUC there are two problems in the case where the drop message is lost:
>
> 1. Writing beyond the end of replSlotStats.
> This can happen if after restarting the number of slots whose stats
> are stored in the stats file exceeds max_replication_slots. Vignesh's
> patch addresses this problem.
>
> 2. The stats for the new slot are not recorded.
> If the stats for already-dropped slots remain in replSlotStats, the
> stats for the new slot cannot be registered due to the full of
> replSlotStats. This can happen even when after restarting the number
> of slots whose stats are stored in the stat file does NOT exceed
> max_replication_slots as well as even during the server running. The
> patch doesn’t address this problem. (If this happens, we will have to
> reset all slot stats since pg_stat_reset_replication_slot() cannot
> remove the slot stats with the non-existing name).
>
> I think we can use HTAB to store slot stats and have
> pg_stat_get_replication_slot() inquire about stats by the slot name,
> resolving both problems. By using HTAB we're no longer concerned about
> the problem of writing stats beyond the end of the replSlotStats
> array. Instead, we have to consider how and when to clean up the stats
> for already-dropped slots. We can have the startup process send slot
> names at startup time, which borrows the idea proposed by Amit. But
> maybe we need to consider the case again where the message from the
> startup process is lost? Another idea would be to have
> pgstat_vacuum_stat() check the existing slots and call
> pgstat_report_replslot_drop() if the slot in the stats file doesn't
> exist. That way, we can continuously check the stats for
> already-dropped slots.
>

Agreed, I think checking periodically via pgstat_vacuum_stat is a
better idea then sending once at start up time. I also think using
slot_name is better than using 'idx' (index in
ReplicationSlotCtl->replication_slots) in this scheme because even
after startup 'idx' changes we will be able to drop the dead slot.

> I've written a PoC patch for the above idea; using HTAB and cleaning
> up slot stats at pgstat_vacuum_stat(). The patch can be applied on top
> of 0001 patch Vignesh proposed before[1].
>

It seems Vignesh has changed patches based on the latest set of
comments so you might want to rebase.

> Please note that this cannot resolve the problem of ending up
> accumulating the stats to the old slot if the slot is re-created with
> the same name and the drop message is lost. To deal with this problem
> I think we would need to use something unique identifier for each slot
> instead of slot name.
>

Right, we can probably write it in comments and or docs about this
caveat and the user can probably use pg_stat_reset_replication_slot
for such slots.

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: truncating timestamps on arbitrary intervals
Next
From: John Naylor
Date:
Subject: Re: truncating timestamps on arbitrary intervals