Re: Resetting spilled txn statistics in pg_stat_replication - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Resetting spilled txn statistics in pg_stat_replication
Date
Msg-id CA+fd4k7hOvdihZ-mwkzOURY_VGqA4jzMCX5d9HNSG93gQ8f0+Q@mail.gmail.com
Whole thread Raw
In response to Re: Resetting spilled txn statistics in pg_stat_replication  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Resetting spilled txn statistics in pg_stat_replication  (Amit Kapila <amit.kapila16@gmail.com>)
Re: Resetting spilled txn statistics in pg_stat_replication  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Mon, 7 Sep 2020 at 15:24, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 23, 2020 at 11:46 AM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > I've updated the patch so that the stats collector ignores the
> > 'update' message if the slot stats array is already full.
> >
>
> This patch needs a rebase. I don't see this patch in the CF app. I
> hope you are still interested in working on this.

Thank you for reviewing this patch!

I'm still going to work on this patch although I might be slow
response this month.

>
> Review comments:
> ===============
> 1.
> +CREATE VIEW pg_stat_replication_slots AS
> +    SELECT
> +            s.name,
> +            s.spill_txns,
> +            s.spill_count,
> +            s.spill_bytes
> +    FROM pg_stat_get_replication_slots() AS s;
>
> The view pg_stat_replication_slots should have a column 'stats_reset'
> (datatype: timestamp with time zone) as we provide a facitily to reset
> the slots. A similar column exists in pg_stat_slru as well, so is
> there a reason of not providing it here?

I had missed adding the column. Fixed.

>
> 2.
> +   </para>
> +  </sect2>
> +
> +  <sect2 id="monitoring-pg-stat-wal-receiver-view">
>    <title><structname>pg_stat_wal_receiver</structname></title>
>
> It is better to keep one empty line between </para> and </sect2> to
> keep it consistent with the documentation of other views.

Fixed.

>
> 3.
>   <primary>pg_stat_reset_replication_slot</primary>
> +        </indexterm>
> +        <function>pg_stat_reset_replication_slot</function> (
> <type>text</type> )
> +        <returnvalue>void</returnvalue>
> +       </para>
> +       <para>
> +         Resets statistics to zero for a single replication slot, or for all
> +         replication slots in the cluster.  If the argument is NULL,
> all counters
> +         shown in the
> <structname>pg_stat_replication_slots</structname> view for
> +         all replication slots are reset.
> +       </para>
>
> I think the information about the parameter for this function is not
> completely clear. It seems to me that it should be the name of the
> slot for which we want to reset the stats, if so, let's try to be
> clear.

Fixed.

>
> 4.
> +pgstat_reset_replslot_counter(const char *name)
> +{
> + PgStat_MsgResetreplslotcounter msg;
> +
> + if (pgStatSock == PGINVALID_SOCKET)
> + return;
> +
> + pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETREPLSLOTCOUNTER);
> + if (name)
> + {
> + memcpy(&msg.m_slotname, name, NAMEDATALEN);
> + msg.clearall = false;
> + }
>
> Don't we want to verify here or in the caller of this function whether
> the passed slot_name is a valid slot or not? For ex. see
> pgstat_reset_shared_counters where we return an error if the target is
> not valid.

Agreed. Fixed.

>
> 5.
> +static void
> +pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
> + int len)
> +{
> + int i;
> + int idx = -1;
> + TimestampTz ts;
> +
> + if (!msg->clearall)
> + {
> + /* Get the index of replication slot statistics to reset */
> + idx = pgstat_replslot_index(msg->m_slotname, false);
> +
> + if (idx < 0)
> + return; /* not found */
>
> Can we add a comment to describe when we don't expect to find the slot
> here unless there is no way that can happen?

Added.

>
> 6.
> +pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
> + int len)
> {
> ..
> + for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
> ..
> }
>
> I think here we need to traverse till nReplSlotStats, not SLRU_NUM_ELEMENTS.

Fixed.

>
> 7. Don't we need to change PGSTAT_FILE_FORMAT_ID for this patch? We
> can probably do at the end but better to change it now so that it
> doesn't slip from our mind.

Yes, changed.

>
> 8.
> @@ -5350,6 +5474,23 @@ pgstat_read_statsfiles(Oid onlydb, bool
> permanent, bool deep)
>
>   break;
>
> + /*
> + * 'R' A PgStat_ReplSlotStats struct describing a replication slot
> + * follows.
> + */
> + case 'R':
> + if (fread(&replSlotStats[nReplSlotStats], 1,
> sizeof(PgStat_ReplSlotStats), fpin)
> + != sizeof(PgStat_ReplSlotStats))
> + {
> + ereport(pgStatRunningInCollector ? LOG : WARNING,
> + (errmsg("corrupted statistics file \"%s\"",
> + statfile)));
> + memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
> + goto done;
> + }
> + nReplSlotStats++;
> + break;
>
> Both here and in pgstat_read_db_statsfile_timestamp(), the patch
> handles 'R' message after 'D' whereas while writing the 'R' is written
> before 'D'. So, I think it is better if we keep the order during read
> the same as during write.

Changed the code so that it writes 'R' after 'D'.

>
> 9. While reviewing this patch, I noticed that in
> pgstat_read_db_statsfile_timestamp(), if we fail to read ArchiverStats
> or SLRUStats, we return 'false' from this function but OTOH, if we
> fail to read 'D' or 'R' message, we will return 'true'. I feel the
> handling of 'D' and 'R' message is fine because once we read
> GlobalStats, we can return the stats_timestamp. So the other two
> stands corrected. I understand that this is not directly related to
> this patch but if you agree we can do this as a separate patch.

It seems to make sense to me. We can set *ts and then read both
ArchiverStats and SLRUStats so we can return a valid timestamp even if
we fail to read.

I've attached both patches: 0001 patch fixes the issue you reported.
0002 patch is the patch that incorporated all review comments.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Junfeng Yang
Date:
Subject: 回复: Is it possible to set end-of-data marker for COPY statement.
Next
From: Craig Ringer
Date:
Subject: Re: [PATCH] Detect escape of ErrorContextCallback stack pointers (and from PG_TRY() )