Thread: Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

From
wenhui qiu
Date:
Hi Tomas 
     This is a great feature.  
+ /*
+ * Define (or redefine) custom GUC variables.
+ */
+ DefineCustomIntVariable("stats_history.size",
+ "Sets the amount of memory available for past events.",
+ NULL,
+ &statsHistorySizeMB,
+ 1,
+ 1,
+ 128,
+ PGC_POSTMASTER,
+ GUC_UNIT_MB,
+ NULL,
+ NULL,
+ NULL);
+
RAM is in terabytes now, the statsHistorySize is 128MB ,I think can increase to store more history record ?

Thanks 

On Sun, Dec 22, 2024 at 4:28 AM Tomas Vondra <tomas@vondra.me> wrote:
Hi,

Our runtime stats system is great, but it only keeps a snapshot of
cumulative stats / snapshot. And while that works for user workloads, it
may not be quite sufficient when analyzing maintenance operations like
vacuum/checkpoint, etc.

For those operations it's useful to have information about individual
runs, not just the cumulative counters (or even deltas between regular
snapshots). There's also the issue that we only keep a very limited
subset of available information - just look at the info included in
VACUUM VERBOSE or with log_checkpoints=on, and how little of that is
available in pg_stats_. For vacuum we have the vacuum/analyze counts,
and timestamp of the last operation, but that's it. VACUUM VERBOSE
provides way more information, but we can only guess based on the
regular pgstat counters.

Yes, we can get this info written to server log using log_checkpoints
and log_autovacuum_min_duration (AFAIK there's no way to ensure logging
for manual VACUUM). But processing this information is not particularly
convenient, as it requires parsing the log, the message format is
suitable more for humans, etc. And it'd be very convenient to be able to
query this information, just like the other pgstat catalogs.

I wonder if we might/should do two things, to improve this:

1) Introduce hooks that allow doing some custom stuff with info about
those actions, after logging it. The attached 0001 and 0002 patches do
this for vacuum and checkpoint.

2) Allow keeping information about events. The 0003 patch does that in
an extension, leveraging the new hooks, but it'd certainly possible to
do in core too.

I realize our current pgstat collector is keeping per-object stats, not
per-event. We might add this to per-object stats (e.g. each table would
have stats about vacuum runs), but that wouldn't work for checkpoints.
There's also the question of memory consumption - I'm sure we don't want
to keep infinite history of vacuum runs, for example.

So the extension simply maintains a fixed-size circular queue, i.e. when
it gets full it starts evicting oldest entries. 1MB is enough for
storing ~4k VACUUM runs - I'm sure it can be made more compact.

I don't think there's a better way to do this. I've considered if this
might be done using emit_log_hook, but (a) that only helps when we end
up logging the event (and I'd like to do this always), and (b) it'd
require parsing the server log. So it's not much better than just doing
that, I think ...


Opinions?

--
Tomas Vondra

Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

From
Tomas Vondra
Date:
Hi,

On 12/23/24 07:35, wenhui qiu wrote:
> Hi Tomas 
>      This is a great feature.  
> + /*
> + * Define (or redefine) custom GUC variables.
> + */
> + DefineCustomIntVariable("stats_history.size",
> + "Sets the amount of memory available for past events.",
> + NULL,
> + &statsHistorySizeMB,
> + 1,
> + 1,
> + 128,
> + PGC_POSTMASTER,
> + GUC_UNIT_MB,
> + NULL,
> + NULL,
> + NULL);
> +
> RAM is in terabytes now, the statsHistorySize is 128MB ,I think can
> increase to store more history record ?
> 

Maybe, the 128MB is an arbitrary (and conservative) limit - it's enough
for ~500k events, which seems good enough for most systems. Of course,
on systems with many relations might need more space, not sure.

I was thinking about specifying the space in more natural terms, either
as amount of time ("keep 1 day of history") or number of entries ("10k
entries"). That would probably mean the memory can't be allocated as
fixed size.

But maybe it'd be possible to just write the entries to a file. We don't
need random access to past entries (unlike e.g. pg_stat_statements), and
people won't query that very often either.


regards

-- 
Tomas Vondra



Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

From
Robert Treat
Date:
On Wed, Dec 25, 2024 at 12:26 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> Hi,
>
> On 12/23/24 07:35, wenhui qiu wrote:
> > Hi Tomas
> >      This is a great feature.
> > + /*
> > + * Define (or redefine) custom GUC variables.
> > + */
> > + DefineCustomIntVariable("stats_history.size",
> > + "Sets the amount of memory available for past events.",
> > + NULL,
> > + &statsHistorySizeMB,
> > + 1,
> > + 1,
> > + 128,
> > + PGC_POSTMASTER,
> > + GUC_UNIT_MB,
> > + NULL,
> > + NULL,
> > + NULL);
> > +
> > RAM is in terabytes now, the statsHistorySize is 128MB ,I think can
> > increase to store more history record ?
> >
>
> Maybe, the 128MB is an arbitrary (and conservative) limit - it's enough
> for ~500k events, which seems good enough for most systems. Of course,
> on systems with many relations might need more space, not sure.
>
> I was thinking about specifying the space in more natural terms, either
> as amount of time ("keep 1 day of history") or number of entries ("10k
> entries"). That would probably mean the memory can't be allocated as
> fixed size.
>

Based on the above, a rough calculation is that this is enough for
holding 1 year of hourly vacuum runs for 50 tables, or a year of daily
vacuums for 1000 tables. Most folks will fall somewhere in that range
(and won't really need a year's history) but that seems like plenty
for a default.

> But maybe it'd be possible to just write the entries to a file. We don't
> need random access to past entries (unlike e.g. pg_stat_statements), and
> people won't query that very often either.
>

Yeah, workloads will vary, but it doesn't seem like they would more
than query workloads do.

Robert Treat
https://xzilla.net