Re: shared-memory based stats collector - v70 - Mailing list pgsql-hackers

From Greg Stark
Subject Re: shared-memory based stats collector - v70
Date
Msg-id CAM-w4HM22VqM=aW4c5i_Lzk1fYBqpSzTD1kK7CqDVbbsOOr-Rg@mail.gmail.com
Whole thread Raw
In response to Re: shared-memory based stats collector - v70  (Andres Freund <andres@anarazel.de>)
Responses Re: shared-memory based stats collector - v70
Re: shared-memory based stats collector - v70
List pgsql-hackers
On Tue, 9 Aug 2022 at 12:48, Andres Freund <andres@anarazel.de> wrote:

> > The reason I want a C function is I'm trying to get as far as I can
> > without a connection to a database, without a transaction, without
> > accessing the catalog, and as much as possible without taking locks.
>
> I assume you don't include lwlocks under locks?

I guess it depends on which lwlock :) I would be leery of a monitoring
system taking an lwlock that could interfere with regular transactions
doing work. Or taking a lock that is itself the cause of the problem
elsewhere that you really need stats to debug would be a deal breaker.

> I don't think that's a large enough issue to worry about unless you're
> polling at a very high rate, which'd be a bad idea in itself. If a backend
> can't get the lock for some stats change it'll defer flushing the stats a bit,
> so it'll not cause a lot of other problems.

Hm. I wonder if we're on the same page about what constitutes a "high rate".

I've seen people try push prometheus or other similar systems to 5s
poll intervals. That would be challenging for Postgres due to the
volume of statistics. The default is 30s and people often struggle to
even have that function for large fleets. But if you had a small
fleet, perhaps an iot style system with a "one large table" type of
schema you might well want stats every 5s or even every 1s.

> I'm *dead* set against including catalog names in shared memory stats. That'll
> add a good amount of memory usage and complexity, without any sort of
> comensurate gain.

Well it's pushing the complexity there from elsewhere. If the labels
aren't in the stats structures then the exporter needs to connect to
each database, gather all the names into some local cache and then it
needs to worry about keeping it up to date. And if there are any
database problems such as disk errors or catalog objects being locked
then your monitoring breaks though perhaps it can be limited to just
missing some object names or having out of date names.



> > I also think it would be nice to have a change counter for every stat
> > object, or perhaps a change time. Prometheus wouldn't be able to make
> > use of it but other monitoring software might be able to receive only
> > metrics that have changed since the last update which would really
> > help on databases with large numbers of mostly static objects.
>
> I think you're proposing adding overhead that doesn't even have a real user.

I guess I'm just brainstorming here. I don't need to currently no. It
doesn't seem like significant overhead though compared to the locking
and copying though?

-- 
greg



pgsql-hackers by date:

Previous
From: Andrey Borodin
Date:
Subject: Re: something has gone wrong, but what is it?
Next
From: Robert Haas
Date:
Subject: Re: something has gone wrong, but what is it?