Re: Replication slot stats misgivings - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Replication slot stats misgivings |
Date | |
Msg-id | CAA4eK1J3S539N_87f1wt2jPBfAX869sa0KVBRudbx1KcQhWykg@mail.gmail.com Whole thread Raw |
In response to | Re: Replication slot stats misgivings (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: Replication slot stats misgivings
|
List | pgsql-hackers |
On Fri, May 7, 2021 at 6:10 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sat, Mar 20, 2021 at 3:52 AM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > I started to write this as a reply to > > https://postgr.es/m/20210318015105.dcfa4ceybdjubf2i%40alap3.anarazel.de > > but I think it doesn't really fit under that header anymore. > > > > On 2021-03-17 18:51:05 -0700, Andres Freund wrote: > > > It does make it easier for the shared memory stats patch, because if > > > there's a fixed number + location, the relevant stats reporting doesn't > > > need to go through a hashtable with the associated locking. I guess > > > that may have colored my perception that it's better to just have a > > > statically sized memory allocation for this. Noteworthy that SLRU stats > > > are done in a fixed size allocation as well... > > Through a long discussion, all review comments pointed out here have > been addressed. I summarized that individual review comments are fixed > by which commit to avoid any confusion later. > > > > > As part of reviewing the replication slot stats patch I looked at > > replication slot stats a fair bit, and I've a few misgivings. First, > > about the pgstat.c side of things: > > > > - If somehow slot stat drop messages got lost (remember pgstat > > communication is lossy!), we'll just stop maintaining stats for slots > > created later, because there'll eventually be no space for keeping > > stats for another slot. > > > > - If max_replication_slots was lowered between a restart, > > pgstat_read_statfile() will happily write beyond the end of > > replSlotStats. > > > > - pgstat_reset_replslot_counter() acquires ReplicationSlotControlLock. I > > think pgstat.c has absolutely no business doing things on that level. > > > > - We do a linear search through all replication slots whenever receiving > > stats for a slot. Even though there'd be a perfectly good index to > > just use all throughout - the slots index itself. It looks to me like > > slots stat reports can be fairly frequent in some workloads, so that > > doesn't seem great. > > Fixed by 3fa17d37716. > > > > > - PgStat_ReplSlotStats etc use slotname[NAMEDATALEN]. Why not just NameData? > > > > - pgstat_report_replslot() already has a lot of stats parameters, it > > seems likely that we'll get more. Seems like we should just use a > > struct of stats updates. > > Fixed by cca57c1d9bf. > > > > > And then more generally about the feature: > > - If a slot was used to stream out a large amount of changes (say an > > initial data load), but then replication is interrupted before the > > transaction is committed/aborted, stream_bytes will not reflect the > > many gigabytes of data we may have sent. > > Fixed by 592f00f8d. > > > - I seems weird that we went to the trouble of inventing replication > > slot stats, but then limit them to logical slots, and even there don't > > record the obvious things like the total amount of data sent. > > Fixed by f5fc2f5b23d. > > > > > I think the best way to address the more fundamental "pgstat related" > > complaints is to change how replication slot stats are > > "addressed". Instead of using the slots name, report stats using the > > index in ReplicationSlotCtl->replication_slots. > > > > That removes the risk of running out of "replication slot stat slots": > > If we loose a drop message, the index eventually will be reused and we > > likely can detect that the stats were for a different slot by comparing > > the slot name. > > > > It also makes it easy to handle the issue of max_replication_slots being > > lowered and there still being stats for a slot - we simply can skip > > restoring that slots data, because we know the relevant slot can't exist > > anymore. And we can make the initial pgstat_report_replslot() during > > slot creation use a > > > > I'm wondering if we should just remove the slot name entirely from the > > pgstat.c side of things, and have pg_stat_get_replication_slots() > > inquire about slots by index as well and get the list of slots to report > > stats for from slot.c infrastructure. > > We fixed the problem of "running out of replication slot stat slots" > by using HTAB to store slot stats, see 3fa17d37716. The slot stats > could be orphaned if a slot drop message gets lost. But we constantly > check and remove them in pgstat_vacuum_stat(). > Thanks for the summarization. I don't find anything that is left unaddressed. I think we can wait for a day or two to see if Andres or anyone else sees anything that is left unaddressed and then we can close the open item. -- With Regards, Amit Kapila.
pgsql-hackers by date: