Re: Replication slot stats misgivings - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Replication slot stats misgivings
Date
Msg-id CAD21AoCjJUKsuDSxOPAJ=jR+8YhqbU8Ynbrt1mfzLeE2DMrS3Q@mail.gmail.com
Whole thread Raw
In response to Re: Replication slot stats misgivings  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Replication slot stats misgivings  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Sat, Apr 10, 2021 at 9:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Apr 10, 2021 at 7:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > IIUC there are two problems in the case where the drop message is lost:
> >
> > 1. Writing beyond the end of replSlotStats.
> > This can happen if after restarting the number of slots whose stats
> > are stored in the stats file exceeds max_replication_slots. Vignesh's
> > patch addresses this problem.
> >
> > 2. The stats for the new slot are not recorded.
> > If the stats for already-dropped slots remain in replSlotStats, the
> > stats for the new slot cannot be registered due to the full of
> > replSlotStats. This can happen even when after restarting the number
> > of slots whose stats are stored in the stat file does NOT exceed
> > max_replication_slots as well as even during the server running. The
> > patch doesn’t address this problem. (If this happens, we will have to
> > reset all slot stats since pg_stat_reset_replication_slot() cannot
> > remove the slot stats with the non-existing name).
> >
> > I think we can use HTAB to store slot stats and have
> > pg_stat_get_replication_slot() inquire about stats by the slot name,
> > resolving both problems. By using HTAB we're no longer concerned about
> > the problem of writing stats beyond the end of the replSlotStats
> > array. Instead, we have to consider how and when to clean up the stats
> > for already-dropped slots. We can have the startup process send slot
> > names at startup time, which borrows the idea proposed by Amit. But
> > maybe we need to consider the case again where the message from the
> > startup process is lost? Another idea would be to have
> > pgstat_vacuum_stat() check the existing slots and call
> > pgstat_report_replslot_drop() if the slot in the stats file doesn't
> > exist. That way, we can continuously check the stats for
> > already-dropped slots.
> >

Thanks for your comments.

>
> Agreed, I think checking periodically via pgstat_vacuum_stat is a
> better idea then sending once at start up time. I also think using
> slot_name is better than using 'idx' (index in
> ReplicationSlotCtl->replication_slots) in this scheme because even
> after startup 'idx' changes we will be able to drop the dead slot.
>
> > I've written a PoC patch for the above idea; using HTAB and cleaning
> > up slot stats at pgstat_vacuum_stat(). The patch can be applied on top
> > of 0001 patch Vignesh proposed before[1].
> >
>
> It seems Vignesh has changed patches based on the latest set of
> comments so you might want to rebase.

I've merged my patch into the v6 patch set Vignesh submitted.

I've attached the updated version of the patches. I didn't change
anything in the patch that changes char[NAMEDATALEN] to NameData (0001
patch) and patches that add tests. In 0003 patch I reordered the
output parameters of pg_stat_replication_slots; showing total number
of transactions and total bytes followed by statistics for spilled and
streamed transactions seems appropriate to me. Since my patch resolved
the issue of writing stats beyond the end of the array, I've removed
the patch that writes the number of stats into the stats file
(v6-0004-Handle-overwriting-of-replication-slot-statistic-.patch).

Apart from the above updates, the
contrib/test_decoding/001_repl_stats.pl add wait_for_decode_stats()
function during testing but I think we can use poll_query_until()
instead. Also, I think we can merge 0004 and 0005 patches.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Attachment

pgsql-hackers by date:

Previous
From: "osumi.takamichi@fujitsu.com"
Date:
Subject: RE: Truncate in synchronous logical replication failed
Next
From: Michael Paquier
Date:
Subject: Re: Uninitialized scalar variable (UNINIT) (src/backend/statistics/extended_stats.c)