Re: Proposal: Expose oldest xmin as SQL function for monitoring - Mailing list pgsql-hackers

From James Coleman
Subject Re: Proposal: Expose oldest xmin as SQL function for monitoring
Date
Msg-id CAAaqYe9O9DZx-Jw0Eo9KmSud8wcs+xv5mHW-0p3_e+LjZsh2_g@mail.gmail.com
Whole thread Raw
In response to Re: Proposal: Expose oldest xmin as SQL function for monitoring  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
On Wed, Apr 1, 2020 at 5:58 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2020-Apr-01, Tom Lane wrote:
>
> > James Coleman <jtc331@gmail.com> writes:
> > > To my knowledge the current oldest xmin (GetOldestXmin() if I'm not
> > > mistaken) isn't exposed directly in any view or function by Postgres.
> >
> > You could do something like
> >
> > select max(age(backend_xmin)) from pg_stat_activity;
> >
> > though I'm not sure whether that accounts for absolutely every process.

That doesn't account for for replication slots that's aren't active, right?

> > > Am I missing anything in the above description? And if not, would
> > > there be any reason why we would want to avoid exposing that
> > > information? And if not, then would exposing it as a function be
> > > acceptable?
> >
> > The fact that I had to use max(age(...)) in that sample query
> > hints at one reason: it's really hard to do arithmetic correctly
> > on raw XIDs.  Dealing with wraparound is a problem, and knowing
> > what's past or future is even harder.  What use-case do you
> > foresee exactly?

So the use case we've encountered multiple times is some (at that
point unknown) process or object that's preventing the xmin from
advancing, and thus blocking vacuum. That kind of situation can pretty
quickly lead to query plans that can result in significant business
impact.

As such, it'd be helpful to be able to monitor something like "how old
is the current xmin on the cluster". Ideally it would also tell you
what process or object is holding that xmin, but starting with the
xmin itself would at least alert to the problem so you can
investigate.

On that note, for this particular use case it would be sufficient to
have something like pg_timestamp_of_oldest_xmin() (given we have
commit timestamps tracking enabled) or even a function returning the
number of xids consumed since the oldest xmin, but it seems more
broadly useful to provide a function that gives the oldest xmin and
allow users to build on top of that.

> Maybe it would make sense to start exposing fullXids in these views and
> functions, for this reason.  There's no good reason to continue to
> expose bare Xids to userspace, we should use them only for storage.

This would be useful too (and for more reasons than the above).

> But I think James' point is precisely that it's not easy to know where
> to look for things that keep Xmin from advancing.  Currently it's
> backends, replication slots, prepared transactions, and replicas with
> hot_standby_feedback.  If you forget to monitor just one of these, your
> vacuums might be useless and you won't notice until disaster strikes.
>
>
> Maybe a useful value to publish in some monitoring view is
> RecentGlobalXmin -- which has a valid value when reading a view, since
> you had to acquire a snapshot to read the view in the first place.

If we went down that path what view do you think would be best -- an
existing one or a new one?

I go back and forth on whether this is best exposed as a monitoring
oriented view or as part of the suite of txid functions we already
have that seem to have broader applicability.

James



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: snapshot too old issues, first around wraparound and then more.
Next
From: Andres Freund
Date:
Subject: Re: snapshot too old issues, first around wraparound and then more.