Thread: shared memory stats ideas

shared memory stats ideas

From
Andres Freund
Date:
Hi,

shortly after shared memory stats went in I had a conversation with Lukas
about what it'd enable us going forward. I also chatted with Peter about
autovacuum related stats. I started to write an email, but then somehow lost
the draft and couldn't bring myself to start from scratch.


Here's a largely unordered list of ideas. I'm not planning to work on them
myself, but thought it'd nevertheless be useful to have them memorialized
somewhere.


1) Track some statistics based on relfilenodes rather than oids

We currently track IO related statistics as part of the normal relation
stats. The problem is that that prevents us from collecting stats whenever we
operate on a relfilenode, rather than a Relation.

We e.g. currently can't track the number of blocks written out in a relation,
because we don't have a Relation at that point. Nor can't we really get hold
of one, as the writeback can happen in a different database without access to
pg_class. Which is also the reason why the per-relation IO stats aren't
populated by the startup process, even though it'd obviously sometimes be
helpful to know where the most IO time is spent on a standby.

There's also quite a bit of contortions of the bufmgr interface related to
this.

I think the solution to this is actually fairly simple: We split the IO
related statistics out from the relation statistics, and track them on a
relfilenode basis instead. That'd allow us to track all the IO stats from all
the places, rather than the partial job we do right now.


2) Split index and table statistics into different types of stats

We track both types of statistics in the same format and rename column in
views etc to make them somewhat sensible. A number of the "columns" in index
stats are currently unused.

If we split the stats for indexes and relations we can have reasonable names
for the fields, shrink the current memory usage by halfing the set of fields
we keep for indexes, and extend the stats in a more targeted fashion.


This e.g. would allow us keep track of the number of index entries killed via
the killtuples mechanism, which in turn would allow us to more intelligently
decide whether we should vacuum indexes (often the most expensive part of
vacuum). In a lot of workload killtuples takes care of most of the cleanup,
but in others it doesn't do much.


3) Maintain more historical statistics about vacuuming

We currently track the last time a table was vacuumed, the number of times it
was vacuumed and a bunch of counters for the number of modified tuples since
the last vacuum.

However, none of that allows the user to identify which relations are causing
autovacuum to not keep up. Even just keeping track of the the total time
autovacuum has spent on certain relations would be a significant improvement,
with more easily imaginable (total IO [time], autovacuum delay time, xid age).


4) Make the stats mechanism extensible

Most of the work towards this has already been done, but a bit more work is
necessary. The hardest likely is how to identify stats belonging to an
extension across restarts.

There's a bunch of extensions with their own stats mechanisms, but it's hard
to get this stuff right from the outside.


5) Use extensible shared memory stats to store pg_stat_statements data

pg_stat_statements current mechanism has a few issues. The top ones I know of
are:

- Contention on individual stats entries when the same queryid is executed
  concurrently. pgstats deals with this by allowing stats to be collected in
  backend local memory and to be flushed into shared stats at a lower
  frequency.

- The querytext file can get huge (I've seen > 100GB) and cause massive
  slowdowns. It's better than the old fixed-length, fixed-shared-memory
  mechansism, don't get me wrong. But we can do better by storing the data in
  dynamic shared memory and then also support trimming based on the total
  size.


There were some other things, but I can't remember them right now.


Greetings,

Andres Freund



Re: shared memory stats ideas

From
Peter Geoghegan
Date:
On Wed, Oct 19, 2022 at 11:19 AM Andres Freund <andres@anarazel.de> wrote:
> We e.g. currently can't track the number of blocks written out in a relation,
> because we don't have a Relation at that point. Nor can't we really get hold
> of one, as the writeback can happen in a different database without access to
> pg_class. Which is also the reason why the per-relation IO stats aren't
> populated by the startup process, even though it'd obviously sometimes be
> helpful to know where the most IO time is spent on a standby.
>
> There's also quite a bit of contortions of the bufmgr interface related to
> this.

This seems related to the difficulty with distinguishing between
internal pages and leaf pages (or some generalized AM-agnostic
definition) in views like pg_statio_*_indexes.

Differentiating between leaf pages and internal pages would definitely
be a big improvement, but it's kind of an awkward thing to implement
[1] because you have to somehow invent the general concept of multiple
distinct kinds of buffers/pages within a relation. A lot of code would
need to be taught about that.

This work would be more likely to actually happen if it was tied to
some bigger project that promised other benefits.

> 2) Split index and table statistics into different types of stats

> This e.g. would allow us keep track of the number of index entries killed via
> the killtuples mechanism, which in turn would allow us to more intelligently
> decide whether we should vacuum indexes (often the most expensive part of
> vacuum). In a lot of workload killtuples takes care of most of the cleanup,
> but in others it doesn't do much.

While I do agree that it would be nice to record information about the
number of deletion operations per index, that information will still
be tricky to interpret and act upon relative to other kinds of
information. As a general rule, we should prefer to focus on signals
that show things really aren't going well in some specific and
unambiguous way. Signals about things that are going well seem harder
to work with -- they don't generalize well.

What I really mean here is this: I think that page split stuff is
going to be much more interesting than index deletion stuff. Index
deletion exists to prevent page splits. So it's natural to ask
questions about where that seems like it ought to have happened, but
didn't actually happen. This likely requires bucketing page splits
into different categories (since most individual page splits aren't
like that at all). Then it becomes much easier to (say) compare
indexes on the same table -- the user can follow a procedure that is
likely to generalize well to many different kinds of situations.

It's not completely clear how the bucketization would work. We ought
to remember how many page splits were caused by INSERT statements
rather than non-HOT UPDATEs, though -- that much seems likely to be
very useful and actionable. The DBA can probably consume this
information in a low context way by looking at the proportions of one
kind of split to the other at the level of each index.

One type of split is mostly just a "cost of doing business" for B-Tree
indexing. The other type really isn't.

> 3) Maintain more historical statistics about vacuuming

> However, none of that allows the user to identify which relations are causing
> autovacuum to not keep up. Even just keeping track of the the total time
> autovacuum has spent on certain relations would be a significant improvement,
> with more easily imaginable (total IO [time], autovacuum delay time, xid age).

With VACUUM in particular the picture over time can be far easier to
work with than any given snapshot, from any single VACUUM operation.
Focusing on how things seem to be changing can make it a lot easier to
spot concerning trends, especially if you're a non-expert.

I would also expect a similar focus on the picture over time to be
useful with the indexing stuff, for roughly the same underlying
reasons.

[1] https://postgr.es/m/CAA8Fd-pB=mr42YQuoaLPO_o2=XO9YJnjQ23CYJDFwC8SXGM8zg@mail.gmail.com
--
Peter Geoghegan



Re: shared memory stats ideas

From
"Drouvot, Bertrand"
Date:
Hi,

On 10/19/22 8:19 PM, Andres Freund wrote:
> 
> Hi,
> 
> 
> Here's a largely unordered list of ideas. I'm not planning to work on them
> myself, but thought it'd nevertheless be useful to have them memorialized
> somewhere.
> 

Thanks for sharing this list of ideas!

> 
> 
> 2) Split index and table statistics into different types of stats
> 
> We track both types of statistics in the same format and rename column in
> views etc to make them somewhat sensible. A number of the "columns" in index
> stats are currently unused.
> 
> If we split the stats for indexes and relations we can have reasonable names
> for the fields, shrink the current memory usage by halfing the set of fields
> we keep for indexes, and extend the stats in a more targeted fashion.

I started to work on this.
I should be able to provide a patch attempt in the next couple of weeks.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: shared memory stats ideas

From
Kyotaro Horiguchi
Date:
Thanks for the nice list.

At Wed, 19 Oct 2022 12:37:30 -0700, Peter Geoghegan <pg@bowt.ie> wrote in 
> On Wed, Oct 19, 2022 at 11:19 AM Andres Freund <andres@anarazel.de> wrote:
> > We e.g. currently can't track the number of blocks written out in a relation,
> > because we don't have a Relation at that point. Nor can't we really get hold
> > of one, as the writeback can happen in a different database without access to
> > pg_class. Which is also the reason why the per-relation IO stats aren't
> > populated by the startup process, even though it'd obviously sometimes be
> > helpful to know where the most IO time is spent on a standby.
> >
> > There's also quite a bit of contortions of the bufmgr interface related to
> > this.
> 
> This seems related to the difficulty with distinguishing between
> internal pages and leaf pages (or some generalized AM-agnostic
> definition) in views like pg_statio_*_indexes.
>
> Differentiating between leaf pages and internal pages would definitely
> be a big improvement, but it's kind of an awkward thing to implement
> [1] because you have to somehow invent the general concept of multiple
> distinct kinds of buffers/pages within a relation. A lot of code would
> need to be taught about that.
> 
> This work would be more likely to actually happen if it was tied to
> some bigger project that promised other benefits.

Stickier buffers for index pages seems to be related. I haven't see it
even get started, though.  But this might be able be an additional
reason for starting it.

> > 2) Split index and table statistics into different types of stats
> 
> > This e.g. would allow us keep track of the number of index entries killed via
> > the killtuples mechanism, which in turn would allow us to more intelligently
> > decide whether we should vacuum indexes (often the most expensive part of
> > vacuum). In a lot of workload killtuples takes care of most of the cleanup,
> > but in others it doesn't do much.
> 
> While I do agree that it would be nice to record information about the
> number of deletion operations per index, that information will still
> be tricky to interpret and act upon relative to other kinds of
> information. As a general rule, we should prefer to focus on signals
> that show things really aren't going well in some specific and
> unambiguous way. Signals about things that are going well seem harder
> to work with -- they don't generalize well.

I think some statistics can be pure-internal purpose. We can maintain
some statistics hidden from users, if we want. (However, I think
people will request for the numbers to be revealed, finally..)

> What I really mean here is this: I think that page split stuff is
> going to be much more interesting than index deletion stuff. Index
> deletion exists to prevent page splits. So it's natural to ask
> questions about where that seems like it ought to have happened, but
> didn't actually happen. This likely requires bucketing page splits
> into different categories (since most individual page splits aren't
> like that at all). Then it becomes much easier to (say) compare
> indexes on the same table -- the user can follow a procedure that is
> likely to generalize well to many different kinds of situations.
> 
> It's not completely clear how the bucketization would work. We ought
> to remember how many page splits were caused by INSERT statements
> rather than non-HOT UPDATEs, though -- that much seems likely to be
> very useful and actionable. The DBA can probably consume this
> information in a low context way by looking at the proportions of one
> kind of split to the other at the level of each index.
> 
> One type of split is mostly just a "cost of doing business" for B-Tree
> indexing. The other type really isn't.
>
> > 3) Maintain more historical statistics about vacuuming
> 
> > However, none of that allows the user to identify which relations are causing
> > autovacuum to not keep up. Even just keeping track of the the total time
> > autovacuum has spent on certain relations would be a significant improvement,
> > with more easily imaginable (total IO [time], autovacuum delay time, xid age).
> 
> With VACUUM in particular the picture over time can be far easier to
> work with than any given snapshot, from any single VACUUM operation.
> Focusing on how things seem to be changing can make it a lot easier to
> spot concerning trends, especially if you're a non-expert.

Agreed.  It seem like a kind of easy (low-hanging) one. I'll give it a
try.  There should be some other numbers that timeseries stats are
useful.

> I would also expect a similar focus on the picture over time to be
> useful with the indexing stuff, for roughly the same underlying
> reasons.
> 
> [1] https://postgr.es/m/CAA8Fd-pB=mr42YQuoaLPO_o2=XO9YJnjQ23CYJDFwC8SXGM8zg@mail.gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared memory stats ideas

From
"Drouvot, Bertrand"
Date:
Hi,

On 10/20/22 9:17 AM, Drouvot, Bertrand wrote:
> On 10/19/22 8:19 PM, Andres Freund wrote:
>>
>> 2) Split index and table statistics into different types of stats
>>
>> We track both types of statistics in the same format and rename column in
>> views etc to make them somewhat sensible. A number of the "columns" in 
>> index
>> stats are currently unused.
>>
>> If we split the stats for indexes and relations we can have reasonable 
>> names
>> for the fields, shrink the current memory usage by halfing the set of 
>> fields
>> we keep for indexes, and extend the stats in a more targeted fashion.
> 
> I started to work on this.
> I should be able to provide a patch attempt in the next couple of weeks.

Patch submitted and CF entry created: 
https://commitfest.postgresql.org/40/3984/

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: shared memory stats ideas

From
Peter Geoghegan
Date:
On Fri, Oct 21, 2022 at 2:26 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> Stickier buffers for index pages seems to be related. I haven't see it
> even get started, though.  But this might be able be an additional
> reason for starting it.

Maybe, but FWIW I think that that will mostly just need to distinguish
leaf pages from heap pages (and mostly ignore internal pages). Within
each index, internal pages are typically no more than a fraction of 1%
of all pages. There are already so few internal pages that it seems
very likely that they're practically guaranteed to be cached already.
There is a huge asymmetry in how pages are naturally accessed, which
justifies treating them as fundamentally different things.

Separating leaf pages from internal pages for instrumentation purposes
is valuable because it allows the DBA to completely *ignore* internal
pages. Internal pages are accessed far far more frequently than leaf
pages. In effect, internal pages add "noise" to the instrumentation,
obscuring the useful "signal" that the DBA should focus on (by
considering leaf level hits and misses in isolation). So the value is
from "removing noise", not from "adding signal".

You only need about 1% of the memory required to cache a big index to
get a "hit rate" of 75% (assuming you don't have a workload that's
very scan heavy, which would be unusual). Obviously the standard naive
definition of "index hit rate" isn't particularly useful.

> > While I do agree that it would be nice to record information about the
> > number of deletion operations per index, that information will still
> > be tricky to interpret and act upon relative to other kinds of
> > information. As a general rule, we should prefer to focus on signals
> > that show things really aren't going well in some specific and
> > unambiguous way. Signals about things that are going well seem harder
> > to work with -- they don't generalize well.
>
> I think some statistics can be pure-internal purpose. We can maintain
> some statistics hidden from users, if we want. (However, I think
> people will request for the numbers to be revealed, finally..)

It will probably be easy to add information about index tuple
deletions, without almost no downside, so of course we should do it.
My point was just that it's probably not the single most informative
thing that could be instrumented to help users to understand index
bloat. It's just much easier to understand what's not working than
what is going well. It's a stronger and more informative signal.

> > With VACUUM in particular the picture over time can be far easier to
> > work with than any given snapshot, from any single VACUUM operation.
> > Focusing on how things seem to be changing can make it a lot easier to
> > spot concerning trends, especially if you're a non-expert.
>
> Agreed.  It seem like a kind of easy (low-hanging) one. I'll give it a
> try.  There should be some other numbers that timeseries stats are
> useful.

Great!

There probably is some way that VACUUM itself will ultimately use this
information to decide what to do. For example, if we go too long
without doing any index vacuuming, we might want to do it despite the
fact that there are relatively few LP_DEAD items in heap pages.

I don't think that we need to worry too much about how VACUUM itself
might apply the same information for now, but it's something that you
might want to consider.

-- 
Peter Geoghegan