Re: PoC: history of recent vacuum/checkpoint runs (using new hooks) - Mailing list pgsql-hackers

From Robert Treat
Subject Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)
Date
Msg-id CAJSLCQ1dAmjuj=kV3yXvy3QDWpYpy+TDGgr3LGpLSh96HwYkFA@mail.gmail.com
Whole thread Raw
In response to Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)  (Tomas Vondra <tomas@vondra.me>)
Responses Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)
List pgsql-hackers
On Fri, Dec 27, 2024 at 8:25 PM Tomas Vondra <tomas@vondra.me> wrote:
> On 12/27/24 05:00, Michael Paquier wrote:
> > On Thu, Dec 26, 2024 at 06:58:11PM +0100, Tomas Vondra wrote:
> >> If 128MB is insufficient, why would 256MB be OK? A factor of 2x does not
> >> make a fundamental difference ...
> >>
> >> Anyway, the 128MB value is rather arbitrary. I don't mind increasing the
> >> limit, or possibly removing it entirely (and accepting anything the
> >> system can handle).
> >
> > +    DefineCustomIntVariable("stats_history.size",
> > +                            "Sets the amount of memory available for past events.",
> >
> > How about some time-based retention?  Data size can be hard to think
> > about for the end user, while there would be a set of users that would
> > want to retain data for the past week, month, etc?  If both size and
> > time upper-bound are define, then entries that match one condition or
> > the other are removed.
> >
>
> Right. In my response [1] I suggested replacing the simple memory limit
> with a time-based limit, but I haven't done anything about that yet.
>
> And the more I think about it the more I'm convinced we don't need to
> keep the data about past runs in memory, a file should be enough (except
> maybe for a small buffer). That would mean we don't need to worry about
> dynamic shared memory, etc. I initially rejected this because it seemed
> like a regression to how pgstat worked initially (sharing data through
> files), but I don't think that's actually true - this data is different
> (almost append-only), etc.
>
> The one case when we may need co read the data is in response to DROP of
> a table, when we need to discard entries for that object. Or we could
> handle that by recording OIDs of dropped objects ... not sure how
> complex this would need to be.
>
> We'd still want to enforce some limits, of course - a time-based limit
> of the minimum amount of time to cover, and maximum amount of disk space
> to use (more as a safety).
>
> FWIW there's one "issue" with enforcing the time-based limit - we can
> only do that for the "end time", because that's when we see the entry.
> If you configure the limit to keep "1 day" history, and then a vacuum
> completes after running for 2 days, we want to keep it, so that anyone
> can actually see it.
>

I can't say I recall all the reasoning involved in making
pg_stat_statements just be based on a fixed number of entries, but the
ability to come up with corner cases was certainly a factor. For
example, imagine the scenario where you set a max at 30 days, but you
have some tables only being vacuumed every few months. Ideally you
probably want the last entry no matter what, and honestly probably the
last 2 (in case you are troubleshooting something, having the last run
and something to compare against is ideal). In any case, it can get
complicated pretty quickly.

> [1]
> https://www.postgresql.org/message-id/8df7cee1-31aa-4db3-bbb7-83157ca139da%40vondra.me
>
> > +        checkpoint_log_hook(
> > +          CheckpointStats.ckpt_start_t,            /* start_time */
> > +          CheckpointStats.ckpt_end_t,                /* end_time */
> > +          (flags & CHECKPOINT_IS_SHUTDOWN),        /* is_shutdown */
> > +          (flags & CHECKPOINT_END_OF_RECOVERY),    /* is_end_of_recovery */
> > +          (flags & CHECKPOINT_IMMEDIATE),            /* is_immediate */
> > +          (flags & CHECKPOINT_FORCE),                /* is_force */
> > +          (flags & CHECKPOINT_WAIT),                /* is_wait */
> > +          (flags & CHECKPOINT_CAUSE_XLOG),        /* is_wal */
> > +          (flags & CHECKPOINT_CAUSE_TIME),        /* is_time */
> > +          (flags & CHECKPOINT_FLUSH_ALL),            /* is_flush_all */
> > +          CheckpointStats.ckpt_bufs_written,        /* buffers_written */
> > +          CheckpointStats.ckpt_slru_written,        /* slru_written */
> > +          CheckpointStats.ckpt_segs_added,        /* segs_added */
> > +          CheckpointStats.ckpt_segs_removed,        /* segs_removed */
> > +          CheckpointStats.ckpt_segs_recycled,        /* segs_recycled */
> >
> > That's a lot of arguments.  CheckpointStatsData and the various
> > CHECKPOINT_* flags are exposed, why not just send these values to the
> > hook?
> >
> > For v1-0001 as well, I'd suggest some grouping with existing
> > structures, or expose these structures so as they can be reused for
> > out-of-core code via the proposed hook.  More arguments lead to more
> > mistakes that could be easily avoided.
>
> Yes, I admit the number of parameters seemed a bit annoying to me too,
> and maybe we could reduce that somewhat. Certainly for checkpoints,
> where we already have a reasonable CheckpointStats struct and flags,
> wrapping most of the fields.
>
> With vacuum it's a bit more complicated, for two reasons: (a) the
> counters are simply in LVRelState, mixed with all kinds of other info,
> and it seems "not great" to pass it to a "log" hook, and (b) there are
> some calculated values, so I guess the hook would need to do that
> calculation on it's own, and it'd be easy to diverge over time.
>
> For (a) we could introduce some "stats" struct to keep these counters
> for vacuum (the nearby parallel vacuum patch actually does something
> like that, I think). Not sure if (b) is actually a problem in practice,
> but I guess we could add those fields to the new "stats" struct too.
>

At the risk of increasing scope, since you already are working on
checkpoints along with vacuums, I'm curious if there was a reason not
to do analyze stats retention as well? It seems pretty correlated in
the same area/problems as vacuum history.

Robert Treat
https://xzilla.net



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: table_tuple_lock's snapshot argument
Next
From: Tomas Vondra
Date:
Subject: Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)