Re: Stats collector's idx_blks_hit value is highly misleading in practice - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Stats collector's idx_blks_hit value is highly misleading in practice |
Date | |
Msg-id | 20201031014617.ukvqjlis7tkdo5ib@development Whole thread Raw |
In response to | Stats collector's idx_blks_hit value is highly misleading in practice (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: Stats collector's idx_blks_hit value is highly misleading in practice
Re: Stats collector's idx_blks_hit value is highly misleading in practice |
List | pgsql-hackers |
On Fri, Oct 16, 2020 at 03:35:51PM -0700, Peter Geoghegan wrote: >It occurs to mean that statistics collector stats such as >pg_statio_*_tables.idx_blks_hit are highly misleading in practice >because they fail to take account of the difference between internal >pages and leaf pages in B-Tree indexes. These two types of pages are >in fundamentally different categories, and I think that failing to >recognize that at the level of these system views makes them much less >useful. Somebody should probably write a patch that makes this >difference clear from the system views. Possibly by using some >generalized notion of "record" pages instead of leaf pages, and >"metadata" pages instead of internal pages. That would even work with >hash indexes, I think. > >Consider the following example, which is based on a standard nbtree >index, but could work in almost the same way with other index access >methods: > >We have a pgbench_accounts pkey after initialization by pgbench at >scale 1500. It has 409,837 leaf pages and 1,451 internal pages, >meaning that about one third of one percent of all pages in the index >are internal pages. Occasionally, with indexes on large text strings >we might notice that as many as 1% of all index pages are internal >pages, but that's very much on the high side. Generally speaking, >we're virtually guaranteed to have *all* internal pages in >shared_buffers once a steady state has been reached. Once the cache >warms up, point lookups (like the queries pgbench performs) will only >have to access one leaf page at most, which amounts to only one I/O at >most. (This asymmetry is the main reason why B-Trees are generally >very effective when buffered in a buffer cache.) > >If we run the pgbench queries against this database/example index >we'll find that we have to access 4 index pages per query execution -- >the root, two additional internal pages, plus a leaf page. Based on >the reasonable assumptions I'm making, 3 out of 4 of those pages will >be hits when steady state is reached with pgbench's SELECT-only >workload, regardless of how large shared_buffers is or how bloated the >index is (we only need 1451 buffers for that, and those are bound to >get hot quickly). > >The overall effect is idx_blks_hit changes over time in a way that >makes no sense -- even to an expert. Let's say we start with this >entire 3213 MB pgbench index in shared_buffers. We should only get >increments in idx_blks_hit, never increments in idx_blks_read - that >much makes sense. If we then iteratively shrink shared_buffers (or >equivalently, make the index grow without adding a new level), the >proportion of page accesses that increment idx_blks_read (rather than >incrementing idx_blks_hit) goes up roughly linearly as misses increase >linearly - which also makes sense. But here is the silly part: we >cannot really have a hit rate of less than 75% if you compare >idx_blks_hit to idx_blks_read, unless and until we can barely even fit >1% of the index in memory (at which point it's hard to distinguish >from noise). So if we naively consume the current view we'll see a hit >rate that starts at 100%, and very slowly shrinks to 75%, which is >where we bottom out (more or less, roughly speaking). This behavior >seems pretty hard to defend to me. > Yeah. The behavior is technically correct, but it's not very useful for practical purposes. And most people don't even realize it behaves like this :-( It's possible to compensate for this effect and estimate the actually "interesting" hit rate, but if we could have it directly that would be great. >If somebody fixed this by putting internal pages into their own bucket >in the system view, then motivated users would quickly learn that >internal page stats aren't really useful -- they are only included for >completeness. They're such a small contributor to the overall hit rate >that they can simply be ignored completely. The thing that users ought >to focus on is leaf page hit rate. Now index hit rate (by which I mean >leaf page hit rate) actually makes sense. Note that Heroku promoted >simple heuristics like this for many years. > >I suppose that a change like this could end up affecting other things, >such as EXPLAIN ANALYZE statistics. OTOH we only break out index pages >separately for bitmap scans at the moment, so maybe it could be fairly >well targeted. And, maybe this is unappealing given the current >statistics collector limitations. I'm not volunteering to work on it >right now, but it would be nice to fix this. Please don't wait for me >to do it. > It seems to me this should not be a particularly difficult patch in principle, so suitable for new contributors. The main challenge would be passing information about what page we're dealing with (internal/leaf) to the place actually calling pgstat_count_buffer_(read|hit). That happens in ReadBufferExtended, which just has no idea what page it's dealing with. Not sure how to do that cleanly ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: