Re: Replication slot stats misgivings - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Replication slot stats misgivings
Date
Msg-id CAA4eK1J3S539N_87f1wt2jPBfAX869sa0KVBRudbx1KcQhWykg@mail.gmail.com
Whole thread Raw
In response to Re: Replication slot stats misgivings  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Replication slot stats misgivings
List pgsql-hackers
On Fri, May 7, 2021 at 6:10 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sat, Mar 20, 2021 at 3:52 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > I started to write this as a reply to
> > https://postgr.es/m/20210318015105.dcfa4ceybdjubf2i%40alap3.anarazel.de
> > but I think it doesn't really fit under that header anymore.
> >
> > On 2021-03-17 18:51:05 -0700, Andres Freund wrote:
> > > It does make it easier for the shared memory stats patch, because if
> > > there's a fixed number + location, the relevant stats reporting doesn't
> > > need to go through a hashtable with the associated locking.  I guess
> > > that may have colored my perception that it's better to just have a
> > > statically sized memory allocation for this.  Noteworthy that SLRU stats
> > > are done in a fixed size allocation as well...
>
> Through a long discussion, all review comments pointed out here have
> been addressed. I summarized that individual review comments are fixed
> by which commit to avoid any confusion later.
>
> >
> > As part of reviewing the replication slot stats patch I looked at
> > replication slot stats a fair bit, and I've a few misgivings. First,
> > about the pgstat.c side of things:
> >
> > - If somehow slot stat drop messages got lost (remember pgstat
> >   communication is lossy!), we'll just stop maintaining stats for slots
> >   created later, because there'll eventually be no space for keeping
> >   stats for another slot.
> >
> > - If max_replication_slots was lowered between a restart,
> >   pgstat_read_statfile() will happily write beyond the end of
> >   replSlotStats.
> >
> > - pgstat_reset_replslot_counter() acquires ReplicationSlotControlLock. I
> >   think pgstat.c has absolutely no business doing things on that level.
> >
> > - We do a linear search through all replication slots whenever receiving
> >   stats for a slot. Even though there'd be a perfectly good index to
> >   just use all throughout - the slots index itself. It looks to me like
> >   slots stat reports can be fairly frequent in some workloads, so that
> >   doesn't seem great.
>
> Fixed by 3fa17d37716.
>
> >
> > - PgStat_ReplSlotStats etc use slotname[NAMEDATALEN]. Why not just NameData?
> >
> > - pgstat_report_replslot() already has a lot of stats parameters, it
> >   seems likely that we'll get more. Seems like we should just use a
> >   struct of stats updates.
>
> Fixed by cca57c1d9bf.
>
> >
> > And then more generally about the feature:
> > - If a slot was used to stream out a large amount of changes (say an
> >   initial data load), but then replication is interrupted before the
> >   transaction is committed/aborted, stream_bytes will not reflect the
> >   many gigabytes of data we may have sent.
>
> Fixed by 592f00f8d.
>
> > - I seems weird that we went to the trouble of inventing replication
> >   slot stats, but then limit them to logical slots, and even there don't
> >   record the obvious things like the total amount of data sent.
>
> Fixed by f5fc2f5b23d.
>
> >
> > I think the best way to address the more fundamental "pgstat related"
> > complaints is to change how replication slot stats are
> > "addressed". Instead of using the slots name, report stats using the
> > index in ReplicationSlotCtl->replication_slots.
> >
> > That removes the risk of running out of "replication slot stat slots":
> > If we loose a drop message, the index eventually will be reused and we
> > likely can detect that the stats were for a different slot by comparing
> > the slot name.
> >
> > It also makes it easy to handle the issue of max_replication_slots being
> > lowered and there still being stats for a slot - we simply can skip
> > restoring that slots data, because we know the relevant slot can't exist
> > anymore. And we can make the initial pgstat_report_replslot() during
> > slot creation use a
> >
> > I'm wondering if we should just remove the slot name entirely from the
> > pgstat.c side of things, and have pg_stat_get_replication_slots()
> > inquire about slots by index as well and get the list of slots to report
> > stats for from slot.c infrastructure.
>
> We fixed the problem of "running out of replication slot stat slots"
> by using HTAB to store slot stats, see 3fa17d37716. The slot stats
> could be orphaned if a slot drop message gets lost. But we constantly
> check and remove them in pgstat_vacuum_stat().
>

Thanks for the summarization. I don't find anything that is left
unaddressed. I think we can wait for a day or two to see if Andres or
anyone else sees anything that is left unaddressed and then we can
close the open item.

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: Anti-critical-section assertion failure in mcxt.c reached by walsender
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: Race condition in recovery?