Thread: Stats collector's idx_blks_hit value is highly misleading in practice

Stats collector's idx_blks_hit value is highly misleading in practice

From
Peter Geoghegan
Date:
It occurs to mean that statistics collector stats such as
pg_statio_*_tables.idx_blks_hit are highly misleading in practice
because they fail to take account of the difference between internal
pages and leaf pages in B-Tree indexes. These two types of pages are
in fundamentally different categories, and I think that failing to
recognize that at the level of these system views makes them much less
useful. Somebody should probably write a patch that makes this
difference clear from the system views. Possibly by using some
generalized notion of "record" pages instead of leaf pages, and
"metadata" pages instead of internal pages. That would even work with
hash indexes, I think.

Consider the following example, which is based on a standard nbtree
index, but could work in almost the same way with other index access
methods:

We have a pgbench_accounts pkey after initialization by pgbench at
scale 1500. It has 409,837 leaf pages and 1,451 internal pages,
meaning that about one third of one percent of all pages in the index
are internal pages. Occasionally, with indexes on large text strings
we might notice that as many as 1% of all index pages are internal
pages, but that's very much on the high side. Generally speaking,
we're virtually guaranteed to have *all* internal pages in
shared_buffers once a steady state has been reached. Once the cache
warms up, point lookups (like the queries pgbench performs) will only
have to access one leaf page at most, which amounts to only one I/O at
most. (This asymmetry is the main reason why B-Trees are generally
very effective when buffered in a buffer cache.)

If we run the pgbench queries against this database/example index
we'll find that we have to access 4 index pages per query execution --
the root, two additional internal pages, plus a leaf page. Based on
the reasonable assumptions I'm making, 3 out of 4 of those pages will
be hits when steady state is reached with pgbench's SELECT-only
workload, regardless of how large shared_buffers is or how bloated the
index is (we only need 1451 buffers for that, and those are bound to
get hot quickly).

The overall effect is idx_blks_hit changes over time in a way that
makes no sense -- even to an expert. Let's say we start with this
entire 3213 MB pgbench index in shared_buffers. We should only get
increments in idx_blks_hit, never increments in idx_blks_read - that
much makes sense. If we then iteratively shrink shared_buffers (or
equivalently, make the index grow without adding a new level), the
proportion of page accesses that increment idx_blks_read (rather than
incrementing idx_blks_hit) goes up roughly linearly as misses increase
linearly - which also makes sense. But here is the silly part: we
cannot really have a hit rate of less than 75% if you compare
idx_blks_hit to idx_blks_read, unless and until we can barely even fit
1% of the index in memory (at which point it's hard to distinguish
from noise). So if we naively consume the current view we'll see a hit
rate that starts at 100%, and very slowly shrinks to 75%, which is
where we bottom out (more or less, roughly speaking). This behavior
seems pretty hard to defend to me.

If somebody fixed this by putting internal pages into their own bucket
in the system view, then motivated users would quickly learn that
internal page stats aren't really useful -- they are only included for
completeness. They're such a small contributor to the overall hit rate
that they can simply be ignored completely. The thing that users ought
to focus on is leaf page hit rate. Now index hit rate (by which I mean
leaf page hit rate) actually makes sense. Note that Heroku promoted
simple heuristics like this for many years.

I suppose that a change like this could end up affecting other things,
such as EXPLAIN ANALYZE statistics. OTOH we only break out index pages
separately for bitmap scans at the moment, so maybe it could be fairly
well targeted. And, maybe this is unappealing given the current
statistics collector limitations. I'm not volunteering to work on it
right now, but it would be nice to fix this. Please don't wait for me
to do it.

-- 
Peter Geoghegan



Re: Stats collector's idx_blks_hit value is highly misleading in practice

From
Tomas Vondra
Date:
On Fri, Oct 16, 2020 at 03:35:51PM -0700, Peter Geoghegan wrote:
>It occurs to mean that statistics collector stats such as
>pg_statio_*_tables.idx_blks_hit are highly misleading in practice
>because they fail to take account of the difference between internal
>pages and leaf pages in B-Tree indexes. These two types of pages are
>in fundamentally different categories, and I think that failing to
>recognize that at the level of these system views makes them much less
>useful. Somebody should probably write a patch that makes this
>difference clear from the system views. Possibly by using some
>generalized notion of "record" pages instead of leaf pages, and
>"metadata" pages instead of internal pages. That would even work with
>hash indexes, I think.
>
>Consider the following example, which is based on a standard nbtree
>index, but could work in almost the same way with other index access
>methods:
>
>We have a pgbench_accounts pkey after initialization by pgbench at
>scale 1500. It has 409,837 leaf pages and 1,451 internal pages,
>meaning that about one third of one percent of all pages in the index
>are internal pages. Occasionally, with indexes on large text strings
>we might notice that as many as 1% of all index pages are internal
>pages, but that's very much on the high side. Generally speaking,
>we're virtually guaranteed to have *all* internal pages in
>shared_buffers once a steady state has been reached. Once the cache
>warms up, point lookups (like the queries pgbench performs) will only
>have to access one leaf page at most, which amounts to only one I/O at
>most. (This asymmetry is the main reason why B-Trees are generally
>very effective when buffered in a buffer cache.)
>
>If we run the pgbench queries against this database/example index
>we'll find that we have to access 4 index pages per query execution --
>the root, two additional internal pages, plus a leaf page. Based on
>the reasonable assumptions I'm making, 3 out of 4 of those pages will
>be hits when steady state is reached with pgbench's SELECT-only
>workload, regardless of how large shared_buffers is or how bloated the
>index is (we only need 1451 buffers for that, and those are bound to
>get hot quickly).
>
>The overall effect is idx_blks_hit changes over time in a way that
>makes no sense -- even to an expert. Let's say we start with this
>entire 3213 MB pgbench index in shared_buffers. We should only get
>increments in idx_blks_hit, never increments in idx_blks_read - that
>much makes sense. If we then iteratively shrink shared_buffers (or
>equivalently, make the index grow without adding a new level), the
>proportion of page accesses that increment idx_blks_read (rather than
>incrementing idx_blks_hit) goes up roughly linearly as misses increase
>linearly - which also makes sense. But here is the silly part: we
>cannot really have a hit rate of less than 75% if you compare
>idx_blks_hit to idx_blks_read, unless and until we can barely even fit
>1% of the index in memory (at which point it's hard to distinguish
>from noise). So if we naively consume the current view we'll see a hit
>rate that starts at 100%, and very slowly shrinks to 75%, which is
>where we bottom out (more or less, roughly speaking). This behavior
>seems pretty hard to defend to me.
>

Yeah. The behavior is technically correct, but it's not very useful for
practical purposes. And most people don't even realize it behaves like
this :-( It's possible to compensate for this effect and estimate the
actually "interesting" hit rate, but if we could have it directly that
would be great.

>If somebody fixed this by putting internal pages into their own bucket
>in the system view, then motivated users would quickly learn that
>internal page stats aren't really useful -- they are only included for
>completeness. They're such a small contributor to the overall hit rate
>that they can simply be ignored completely. The thing that users ought
>to focus on is leaf page hit rate. Now index hit rate (by which I mean
>leaf page hit rate) actually makes sense. Note that Heroku promoted
>simple heuristics like this for many years.
>
>I suppose that a change like this could end up affecting other things,
>such as EXPLAIN ANALYZE statistics. OTOH we only break out index pages
>separately for bitmap scans at the moment, so maybe it could be fairly
>well targeted. And, maybe this is unappealing given the current
>statistics collector limitations. I'm not volunteering to work on it
>right now, but it would be nice to fix this. Please don't wait for me
>to do it.
>

It seems to me this should not be a particularly difficult patch in
principle, so suitable for new contributors. The main challenge would be
passing information about what page we're dealing with (internal/leaf)
to the place actually calling pgstat_count_buffer_(read|hit). That
happens in ReadBufferExtended, which just has no idea what page it's
dealing with. Not sure how to do that cleanly ...


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Stats collector's idx_blks_hit value is highly misleading in practice

From
Peter Geoghegan
Date:
On Fri, Oct 30, 2020 at 6:46 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Yeah. The behavior is technically correct, but it's not very useful for
> practical purposes. And most people don't even realize it behaves like
> this :-( It's possible to compensate for this effect and estimate the
> actually "interesting" hit rate, but if we could have it directly that
> would be great.

It's important that the information we provide in system views (and
other instrumentation) reflect reality, even when the underlying
mechanisms are not well understood by most users. DBAs often observe
correlations and arrive at useful conclusions without truly
understanding what's happening. Individual hackers have occasionally
expressed skepticism of exposing the internals of the system through
instrumentation; they object on the grounds that users are unlikely to
understand what they see anyway. It seems to me that this completely
misses the point. You don't necessarily have to truly understand
what's going on to have mechanical sympathy for the system. You don't
need to be a physicist to do folk physics.

To my mind the best example of this is wait events, which first
appeared in proprietary database systems. Wait events expose
information about mechanisms that couldn't possibly be fully
understood by the end consumer. Because technically the details were
trade secrets. That didn't stop them from being very useful in
practice.

> It seems to me this should not be a particularly difficult patch in
> principle, so suitable for new contributors. The main challenge would be
> passing information about what page we're dealing with (internal/leaf)
> to the place actually calling pgstat_count_buffer_(read|hit). That
> happens in ReadBufferExtended, which just has no idea what page it's
> dealing with. Not sure how to do that cleanly ...

It would be a bit messy to pass down a flag like that, but it could be
done. I think the idea of generalized definitions of internal pages
and leaf pages ("metadata pages and record pages") could work well,
but would require a little thought in some cases. I'm thinking of GIN.
I doubt it would really matter what the final determination is about
(say) which particular generalized page bucket GIN pending list pages
get placed in. It will be a little arbitrary in a few corner cases,
but it hardly matters at all. Right now we have something that's
technically correct but also practically useless.

-- 
Peter Geoghegan



Re: Stats collector's idx_blks_hit value is highly misleading in practice

From
John Naylor
Date:
On Fri, Oct 30, 2020 at 9:46 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> On Fri, Oct 16, 2020 at 03:35:51PM -0700, Peter Geoghegan wrote:

> >I suppose that a change like this could end up affecting other things,
> >such as EXPLAIN ANALYZE statistics. OTOH we only break out index pages
> >separately for bitmap scans at the moment, so maybe it could be fairly
> >well targeted. And, maybe this is unappealing given the current
> >statistics collector limitations. I'm not volunteering to work on it
> >right now, but it would be nice to fix this. Please don't wait for me
> >to do it.
> >
>
> It seems to me this should not be a particularly difficult patch in
> principle, so suitable for new contributors. The main challenge would be
> passing information about what page we're dealing with (internal/leaf)
> to the place actually calling pgstat_count_buffer_(read|hit). That
> happens in ReadBufferExtended, which just has no idea what page it's
> dealing with. Not sure how to do that cleanly ...

Is this a TODO candidate? What would be a succinct title for it?

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Stats collector's idx_blks_hit value is highly misleading in practice

From
Peter Geoghegan
Date:
On Thu, Feb 3, 2022 at 7:08 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> Is this a TODO candidate? What would be a succinct title for it?

I definitely think that it's worth working on. I suppose it follows
that it should go on the TODO list.

-- 
Peter Geoghegan



Re: Stats collector's idx_blks_hit value is highly misleading in practice

From
John Naylor
Date:
On Fri, Feb 4, 2022 at 11:19 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Thu, Feb 3, 2022 at 7:08 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> > Is this a TODO candidate? What would be a succinct title for it?
>
> I definitely think that it's worth working on. I suppose it follows
> that it should go on the TODO list.

Added TODO item "Teach stats collector to differentiate between
internal and leaf index pages"

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: Stats collector's idx_blks_hit value is highly misleading in practice

From
Sergey Dudoladov
Date:
Hello,

I would like to get some feedback on that task.

> pg_statio_*_tables.idx_blks_hit are highly misleading in practice
> because they fail to take account of the difference between internal
> pages and leaf pages in B-Tree indexes.

I see it is still the case, so the issue is relevant, isn't it ?

> The main challenge would be
> passing information about what page we're dealing with (internal/leaf)
> to the place actually calling pgstat_count_buffer_(read|hit). That
> happens in ReadBufferExtended, which just has no idea what page it's
> dealing with. Not sure how to do that cleanly ...

I do not immediately see the way to pass the information in a
completely clean manner.

Either
(1) ReadBufferExtended needs to know the type of an index page (leaf/internal)
or
(2) caller of ReadBufferExtended that can check the page type needs to learn
if there was a hit and call pgstat_count_buffer_(read|hit) accordingly.

In either case necessary code changes seem quite invasive to me.
I have attached a code snippet to illustrate the second idea.

Regards,
Sergey

Attachment

Re: Stats collector's idx_blks_hit value is highly misleading in practice

From
Sergey Dudoladov
Date:
Hi again,

Having played with the task for a little while,  I am no longer sure
it completely justifies the effort involved.
The reason being the task requires modifying the buffer pool in one
way or the other, which implies
(a) significant effort on performance testing and
(b) changes in the buffer pool interfaces that community might not
welcome just to get 1-2 extra statistics numbers.

Any ideas ?

Regards,
Sergey