Re: Showing primitive index scan count in EXPLAIN ANALYZE (for skip scan and SAOP scans) - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Showing primitive index scan count in EXPLAIN ANALYZE (for skip scan and SAOP scans)
Date
Msg-id CAH2-Wzk-NiFFVc=XayoqO4ymSbmJm121H370X8xPqCBYNQYyKw@mail.gmail.com
Whole thread Raw
In response to Re: Showing primitive index scan count in EXPLAIN ANALYZE (for skip scan and SAOP scans)  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Thu, Feb 27, 2025 at 3:42 PM Robert Haas <robertmhaas@gmail.com> wrote:
> +1 for having some instrumentation. I do not agree with Tom that these
> are numbers that only Peter Geoghegan and 2-3 other people will ever
> understand. I grant that it's not going to make sense to everyone, but
> the number of people to which it will make sense I would guess is
> probably in the hundreds or the thousands rather than the single
> digits.

I agree that it's likely to be of interest to only a minority of
users. But a fairly large minority.

> Good documentation could help.

It's easy to produce an example that makes intuitive sense. For
example, with skip scan that has a qual such as "WHERE a BETWEEN 1 and
5 AND b = 12345", it is likely that EXPLAIN ANALYZE will show "Index
Searches: 5" -- one search per "a" value. Such an example might be
more useful than my original pgbench_accounts example.

Do you think that that would help?

> So, where should we store that information?
>
> The thing that's odd about using IndexScanDesc is that it doesn't
> contain any other instrumentation currently, or at least not that I
> can see.

It is unique right now, but perhaps only because this is the first
piece of instrumentation that:

A. We must track at the level of an individual index scan -- not at
the level of an index relation, like pgstat_count_index_scan(), nor at
the level of the whole system, like BufferUsage state.

AND

B. Requires that we count something that fundamentally lives inside
the index AM -- something that we cannot reasonably track/infer from
the executor proper (more on that below, in my response to your scheme
#1).

> First, that struct has just four counters (ntuples,
> ntuples2, nfiltered, nfiltered2). For an index-only scan, all four of
> them are in use -- a plain index scan does not use ntuples2 -- but I
> think this optimization can apply to both index and index-only scans,
> so we just don't have room.

Right, index-only scans have exactly the same requirements as plain
index scans/bitmap index scans.

> 1. Add a new field to struct Instrumentation. Make
> index_getnext_slot() and possibly other similar functions return a
> count of index searches via a new parameter, and use that to bump the
> new struct Instrumentation counter. It's a little ugly to have to add
> a special-purpose parameter for this, but it doesn't look like there
> are tons and tons of calls so maybe it's OK.

That seems like the strongest possible alternative to the original
scheme used in the current draft patch (scheme #3).

This scheme #1 has the same issue as scheme #3, though: it still
requires an integer counter that tracks the number of index searches
(something a bit simpler than that, such as a boolean flag set once
per amgettuple call, won't do). This is due to there being no fixed
limit on the number of index searches required during any single
amgettuple call: in general the index AM may perform quite a few index
searches before it is able to return the first tuple to the scan (or
before it can return the next tuple).

The only difference that I can see between scheme #1 and scheme #3 is
that the former requires 2 counters instead of just 1. And, we'd still
need to have 1 out of the 2 counters located either in IndexScanDesc
itself (just like scheme #3), or located in some other struct that can
at least be accessed through IndexScanDesc (like the index AM opaque
state, per scheme #4). After all, *every* piece of state known to any
amgettuple routine must ultimately come from IndexScanDesc (or from
backend global state, per scheme #2).

(Actually, I supposed it is technically possible to avoid storing
anything in IndexScanDesc by inventing another amgettuple argument,
just for this. That seems like a distinction without a difference,
though.)

> 2. Add a new field to BufferUsage. Then the AM can bump this field and
> explain.c will know about it the same way it knows about other changes
> to pgBufferUsage. However, this is not really about buffer usage and
> the new field seems utterly unlike the other things that are in that
> structure, so this seems really bad to me.

I agree. The use of global variables seems quite inappropriate for
something like this. It'll result in wrong output whenever an index
scan uses an InitPlan in its qual when that InitPlan is itself a plan
that contains an index scan (this is already an issue with "Buffers:"
instrumentation, but this would be much worse).

> 3. Add a new field to IndexScanDesc, as you originally proposed. This
> seems similar to #1: it's still shoveling instrumentation data around
> in a way that we don't currently, but instead of shoveling it through
> a new parameter, we shovel it through a new structure member. Either
> way, the new thing (parameter or structure member) doesn't really look
> like it belongs with what's already there, so it seems like
> conservation of ugliness.

Perhaps a comment noting why the new counter lives in IndexScanDesc would help?

> 4. Add a new field to the btree-specific structure referenced by the
> IndexScanDesc's opaque pointer. I think this is what Matthias was
> proposing. It doesn't seem particularly hard to implement. and seems
> similar to #1 and #3.

It definitely isn't hard to implement. But...

> It is not clear to me that any of #1, #3, and #4 are radically better
> than any of the other ones, with the following exception: it would be
> a poor idea to choose #4 over #3 if this field will ultimately be used
> for a bunch of different AMs, and a poor idea to choose #3 over #4 if
> it's always going to be interesting only for btree.

...the requirements here really are 100% index-AM-generic. So I just
don't see any advantage to scheme #4.

All that I propose to do here is to display the information that is
already tracked by
pgstat_count_index_scan()/pg_stat_user_tables.idx_scan (at the level
of each index relation) when EXPLAIN ANALYZE is run (at the level of
each index scan). I'm not inventing a new concept; I'm extending an
existing index-AM-generic approach. It just so happens that there is a
greater practical need for that information within B-Tree scans.

Admittedly, the actual amount of code the patch adds to nbtree is
slightly more than it'll add to other index AMs (all of which require
exactly one added line of code). But that's only because nbtree alone
supports parallel index scans. Were it not for that, nbtree would also
require only a single additional line of code.

> I'll defer to you on which of those things is the case, but with the request that you
> think about what is practically likely to happen and not advocate too
> vigorously based on an argument that makes prominent use of the phrase
> "in theory".

Right now, the only instrumentation that lives inside index AMs is
pgstat_count_index_scan(), which works at the relation level (not the
scan level).

Matthias may well be right that we'll eventually want to add more
stuff like this. For example, maybe we'd show the number of index
tuples that we evaluated against the scan's qual that were not
actually returned to the executor proper. Then we really would be
inventing a whole new concept -- but a concept that was also
index-AM-neutral. I think that that's likely to be true generally, for
all manner of possible instrumentation improvements that live inside
index AMs -- so we'll likely end up putting more fields in
IndexScanDesc for that stuff.

> To be honest, I don't really like any of these options
> very much: they all seem a tad inelegant. But sometimes that is a
> necessary evil when inventing something new. I believe if I were
> implementing this myself I would probably try #1 first; if that ended
> up seeming too ugly, then I would fall back to #3 or #4.

I do get that. I hope that you don't think that I've failed to take
your feedback on board.

I ended up at #3 only through yak-shaving/trial-and-error coding.
Hopefully my reasoning makes sense.

> Does that help?

Yes, it does -- hugely. I'll need to think about it some more. The
documentation definitely needs more work, at a minimum.

Thanks for the review!

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: long-standing data loss bug in initial sync of logical replication
Next
From: Michael Paquier
Date:
Subject: Re: Small memory fixes for pg_createsubcriber