Re: Resetting spilled txn statistics in pg_stat_replication - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Resetting spilled txn statistics in pg_stat_replication
Date
Msg-id CA+fd4k59RXHiCGqFE-FK3rjHyihU0sxKEoF3YRQC26EAbO2THg@mail.gmail.com
Whole thread Raw
In response to Re: Resetting spilled txn statistics in pg_stat_replication  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Resetting spilled txn statistics in pg_stat_replication  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Mon, 6 Jul 2020 at 20:45, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 2, 2020 at 9:01 AM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > I've attached PoC patch that implements a simple approach. I'd like to
> > discuss how we collect the replication slot statistics in the stats
> > collector before I bring the patch to completion.
> >
> > In this PoC patch, we have the array of PgStat_ReplSlotStats struct
> > which has max_replication_slots entries. The backend and wal sender
> > send the slot statistics to the stats collector when decoding a commit
> > WAL record.
> >
> > typedef struct PgStat_ReplSlotStats
> > {
> >     char            slotname[NAMEDATALEN];
> >     PgStat_Counter  spill_txns;
> >     PgStat_Counter  spill_count;
> >     PgStat_Counter  spill_bytes;
> > } PgStat_ReplSlotStats;
> >
> > What I'd like to discuss are:
> >
> > Since the unique identifier of replication slots is the name, the
> > process sends slot statistics along with slot name to stats collector.
> > I'm concerned about the amount of data sent to the stats collector and
> > the cost of searching the statistics within the statistics array (it’s
> > O(N) where N is max_replication_slots). Since the maximum length of
> > slot name is NAMEDATALEN (64 bytes default) and max_replication_slots
> > is unlikely a large number, I might be too worrying but it seems like
> > it’s better to avoid that if we can do that easily. An idea I came up
> > with is to use the index of slots (i.g., the index of
> > ReplicationSlotCtl->replication_slots[]) as the index of statistics of
> > slot in the stats collector. But since the index of slots could change
> > after the restart we need to synchronize the index of slots on both
> > array somehow. So I thought that we can determine the index of the
> > statistics of slots at ReplicationSlotAcquire() or
> > ReplicationSlotCreate(), but it will in turn need to read stats file
> > while holding ReplicationSlotControlLock to prevent the same index of
> > the statistics being used by the concurrent process who creating a
> > slot. I might be missing something though.
> >
>
> I don't think we should be bothered about the large values of
> max_replication_slots.  The default value is 10 and I am not sure if
> users will be able to select values large enough that we should bother
> about searching them by name.  I think if it could turn out to be a
> problem then we can try to invent something to mitigate it.

Agreed.

>
> > Second, as long as the unique identifier is the slot name there is no
> > convenient way to distinguish between the same name old and new
> > replication slots, so the backend process or wal sender process sends
> > a message to the stats collector to drop the replication slot at
> > ReplicationSlotDropPtr(). This strategy differs from what we do for
> > table, index, and function statistics. It might not be a problem but
> > I’m thinking a better way.
> >
>
> Can we rely on message ordering in the transmission mechanism (UDP)
> for stats?  The wiki suggests [1] we can't.  If so, then this might
> not work.

Yeah, I'm also concerned about this. Another idea would be to have
another unique identifier to distinguish old and new replication slots
with the same name. For example, creation timestamp. And then we
reclaim the stats of unused slots later like table and function
statistics.

On the other hand, if the ordering were to be reversed, we would miss
that stats but the next stat reporting would create the new entry. If
the problem is unlikely to happen in common case we can live with
that.

> > The new view name is also an open question. I prefer
> > pg_stat_replication_slots and to add stats of physical replication
> > slots to the same view in the future, rather than a separate view.
> >
>
> This sounds okay to me.
>
> > Aside from the above, this patch will change the most of the changes
> > introduced by commit 9290ad198b1 and introduce new code much. I’m
> > concerned whether such changes are acceptable at the time of beta 2.
> >
>
> I think it depends on the final patch. My initial thought was that we
> should do this for PG14 but if you are suggesting removing the changes
> done by commit 9290ad198b1 then we need to think over it.  I could
> think of below options:
> a. Revert 9290ad198b1 and introduce stats for spilling in PG14.  We
> were anyway having spilling without any work in PG13 but didn’t have
> stats.
> b. Try to get your patch in PG13 if we can, otherwise, revert the
> feature 9290ad198b1.
> c. Get whatever we have in commit 9290ad198b1 for PG13 and
> additionally have what we are discussing here for PG14.  This means
> that spilled stats at slot level will be available in PG14 via
> pg_stat_replication_slots and for individual WAL senders it will be
> available via pg_stat_replication both in PG13 and PG14.  Even if we
> can get your patch in PG13, we can still keep those in
> pg_stat_replication.
> d. Get whatever we have in commit 9290ad198b1 for PG13 and change it
> for PG14.  I don't think this will be a popular approach.

I was thinking option (a) or (b). I'm inclined to option (a) since the
PoC patch added a certain amount of new codes. I agree with you that
it depends on the final patch.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: vignesh C
Date:
Subject: Re: Parallel worker hangs while handling errors.
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: min_safe_lsn column in pg_replication_slots view