Thread: Re: Showing primitive index scan count in EXPLAIN ANALYZE (for skip scan and SAOP scans)

Re: Showing primitive index scan count in EXPLAIN ANALYZE (for skip scan and SAOP scans)

From

Alena Rybakina

Date:

10 November 2024, 22:00:33

On 09.11.2024 21:46, Peter Geoghegan wrote:

On Sat, Nov 9, 2024 at 12:37 PM Alena Rybakina
<a.rybakina@postgrespro.ru> wrote:

I noticed that the "Index Searches" cases shown in the regression tests are only for partitioned tables, maybe something you should add some tests for regular tables like tenk1.

I allowed the patch on this thread to bitrot, but I've been
maintaining this same patch as part of the skip scan patchset.
Attached is the latest version of this patch (technically this is the
first patch in the skip scan patch series). Just to keep things
passing on the CFTester app.

Thank you)


I haven't done anything about the implementation (still using a
counter that lives in IndexScanDesc) due to a lack of clarity on
what'll work best.

I've been still researching this to be honest and also haven't yet opinion when the counter will be more suitable.

Hopefully discussion of those aspects of this patch
will pick up again soon.

I hope too.

Note that I have changed the patch to divide "Index Searches:" by
nloops, since Tomas Vondra seemed to want to do it that way
(personally I don't feel strongly about that either way). So that's
one behavioral change, not seen in any of the versions of the patch
that have been posted to this thread so far.

Or maybe I was affected by fatigue, but I don’t understand this point, to be honest. I see from the documentation and your first letter that it specifies how many times in total the tuple search would be performed during the index execution. Is that not quite right?

The documentation: <para>
    <command>EXPLAIN ANALYZE</command> breaks down the total number of index
    searches performed by each index scan node. <literal>Index Searches: N</literal>
    indicates the total number of searches across <emphasis>all</emphasis>
    executor node executions/loops.
   </para>

In general, I support the initiative to display this information in the query plan output. I think it is necessary for finding the reasons for low query performance.

I just know that if Postgres 18 has skip scan, but doesn't have basic
instrumentation of the number of index searches in EXPLAIN ANALYZE
when skip scan is in use, we're going to get lots of complaints about
it. It'll be very different from the current status quo. My main
motivation here is to avoid complaints about the behavior of skip scan
being completely opaque to users.

Yes, we can expect users to be concerned about this, but it is wrong not to display information about it at all. The right thing to do is to see the problem and try to solve it in the future.
I think this patch is the first step towards a solution, right?
It may also encourage the user to consider other options for solving this problem, such as not to use index scan (for example, use pg_hint_plan extension) or building a view from this table or something else, if it significantly harms their performance.

I think that the same issue could also happen with your OR
transformation patch, if we don't get this EXPLAIN ANALYZE
instrumentation. Users will still naturally want to know if a query
"WHERE a = 2 OR a = 4 OR a = 6" required only one index search during
its index scan, or if it required as many as 3 searches. They can
already see this information with a BitmapOr-based plan, today. 

Why wouldn't they expect to continue to see the same information (or
similar information) when the index searches happen to be coordinated
by the index scan node/index AM itself?

To be honest, I don't quite understand this. Can you please explain in more detail?

-- 
Regards,
Alena Rybakina
Postgres Professional

Re: Showing primitive index scan count in EXPLAIN ANALYZE (for skip scan and SAOP scans)

From

Peter Geoghegan

Date:

12 November 2024, 23:00:01

On Sun, Nov 10, 2024 at 2:00 PM Alena Rybakina
<a.rybakina@postgrespro.ru> wrote:
> Or maybe I was affected by fatigue, but I don’t understand this point, to be honest. I see from the documentation and
yourfirst letter that it specifies how many times in total the tuple search would be performed during the index
execution.Is that not quite right? 

Well, nodes that appear on the inner side of a nested loop join (and
in a few other contexts) generally have their row counts (and a few
other things) divided by the total number of executions. The idea is
that we're showing the average across all executions of the node -- if
the user wants the true absolute number, they're expected to multiply
nrows by nloops themselves. This is slightly controversial behavior,
but it is long established (weirdly, we never divide by nloops for
"Buffers").

Initial versions of my patch didn't do this. The latest version does
divide like this, though. In general it isn't all that likely that an
inner index scan would have more than a single primitive index scan,
in any case, so which particular behavior I use here (divide vs don't
divide) is not something that I feel strongly about.

>> Why wouldn't they expect to continue to see the same information (or
>> similar information) when the index searches happen to be coordinated
>> by the index scan node/index AM itself?
>
> To be honest, I don't quite understand this. Can you please explain in more detail?

I just meant that your OR transformation patch is another case where
we shouldn't obscure the count of primitive index scans.

It would be inconsistent of us to allow users to see the number of
index scans today (without your patch), while denying users the
ability to see essentially the same information in the future (with
your patch). The fact that an index scan has its own executor node
today and won't have one tomorrow shouldn't in itself affect
instrumentation of the number of (primitive) index scans shown by
EXPLAIN ANALYZE (it certainly won't affect the instrumentation within
the pg_stat_all_indexes view, as things stand, even without my patch).

--
Peter Geoghegan

Re: Showing primitive index scan count in EXPLAIN ANALYZE (for skip scan and SAOP scans)

From

Alena Rybakina

Date:

27 November 2024, 16:22:38

Sorry it took me so long to answer, I had some minor health complications

On 12.11.2024 23:00, Peter Geoghegan wrote:

On Sun, Nov 10, 2024 at 2:00 PM Alena Rybakina
<a.rybakina@postgrespro.ru> wrote:

Or maybe I was affected by fatigue, but I don’t understand this point, to be honest. I see from the documentation and your first letter that it specifies how many times in total the tuple search would be performed during the index execution. Is that not quite right?

Well, nodes that appear on the inner side of a nested loop join (and
in a few other contexts) generally have their row counts (and a few
other things) divided by the total number of executions. The idea is
that we're showing the average across all executions of the node -- if
the user wants the true absolute number, they're expected to multiply
nrows by nloops themselves. This is slightly controversial behavior,
but it is long established (weirdly, we never divide by nloops for
"Buffers").

I understood what you mean and I faced this situation before when I saw extremely more number of actual rows that could be and it was caused by the number of scanned tuples per cycles. [0]

[0] https://www.postgresql.org/message-id/flat/9f4a159b-f527-465f-b82e-38b4b7df812f@postgrespro.ru

Initial versions of my patch didn't do this. The latest version does
divide like this, though. In general it isn't all that likely that an
inner index scan would have more than a single primitive index scan,
in any case, so which particular behavior I use here (divide vs don't
divide) is not something that I feel strongly about.

I think we should divide them because by dividing the total buffer usage by the number of loops, user finds the average buffer consumption per loop. This gives them a clearer picture of the resource intensity per basic unit of work.

-- 
Regards,
Alena Rybakina
Postgres Professional

Re: Showing primitive index scan count in EXPLAIN ANALYZE (for skip scan and SAOP scans)

From

Matthias van de Meent

Date:

27 November 2024, 16:36:11

On Wed, 27 Nov 2024 at 14:22, Alena Rybakina <a.rybakina@postgrespro.ru> wrote:
>
> Sorry it took me so long to answer, I had some minor health complications
>
> On 12.11.2024 23:00, Peter Geoghegan wrote:
>
> On Sun, Nov 10, 2024 at 2:00 PM Alena Rybakina
> <a.rybakina@postgrespro.ru> wrote:
>
> Or maybe I was affected by fatigue, but I don’t understand this point, to be honest. I see from the documentation and
yourfirst letter that it specifies how many times in total the tuple search would be performed during the index
execution.Is that not quite right? 
>
> Well, nodes that appear on the inner side of a nested loop join (and
> in a few other contexts) generally have their row counts (and a few
> other things) divided by the total number of executions. The idea is
> that we're showing the average across all executions of the node -- if
> the user wants the true absolute number, they're expected to multiply
> nrows by nloops themselves. This is slightly controversial behavior,
> but it is long established (weirdly, we never divide by nloops for
> "Buffers").
>
> I understood what you mean and I faced this situation before when I saw extremely more number of actual rows that
couldbe and it was caused by the number of scanned tuples per cycles. [0] 
>
> [0] https://www.postgresql.org/message-id/flat/9f4a159b-f527-465f-b82e-38b4b7df812f@postgrespro.ru
>
> Initial versions of my patch didn't do this. The latest version does
> divide like this, though. In general it isn't all that likely that an
> inner index scan would have more than a single primitive index scan,
> in any case, so which particular behavior I use here (divide vs don't
> divide) is not something that I feel strongly about.
>
> I think we should divide them because by dividing the total buffer usage by the number of loops, user finds the
averagebuffer consumption per loop. This gives them a clearer picture of the resource intensity per basic unit of work. 

I disagree; I think the whole "dividing by number of loops and
rounding up to integer" was the wrong choice for tuple count, as that
makes it difficult if not impossible to determine the actual produced
count when it's less than the number of loops. Data is lost in the
rounding/processing, and I don't want to have lost that data.

Same applies for ~scans~ searches: If we do an index search, we should
show it in the count as total sum, not partial processed value. If a
user is interested in per-loopcount values, then they can derive that
value from the data they're presented with; but that isn't true when
we present only the divided-and-rounded value.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Showing primitive index scan count in EXPLAIN ANALYZE (for skip scan and SAOP scans)

From

Alena Rybakina

Date:

29 November 2024, 00:09:27

Hi!

On 27.11.2024 16:36, Matthias van de Meent wrote:
> On Wed, 27 Nov 2024 at 14:22, Alena Rybakina <a.rybakina@postgrespro.ru> wrote:
>> Sorry it took me so long to answer, I had some minor health complications
>>
>> On 12.11.2024 23:00, Peter Geoghegan wrote:
>>
>> On Sun, Nov 10, 2024 at 2:00 PM Alena Rybakina
>> <a.rybakina@postgrespro.ru> wrote:
>>
>> Or maybe I was affected by fatigue, but I don’t understand this point, to be honest. I see from the documentation
andyour first letter that it specifies how many times in total the tuple search would be performed during the index
execution.Is that not quite right?
 
>>
>> Well, nodes that appear on the inner side of a nested loop join (and
>> in a few other contexts) generally have their row counts (and a few
>> other things) divided by the total number of executions. The idea is
>> that we're showing the average across all executions of the node -- if
>> the user wants the true absolute number, they're expected to multiply
>> nrows by nloops themselves. This is slightly controversial behavior,
>> but it is long established (weirdly, we never divide by nloops for
>> "Buffers").
>>
>> I understood what you mean and I faced this situation before when I saw extremely more number of actual rows that
couldbe and it was caused by the number of scanned tuples per cycles. [0]
 
>>
>> [0] https://www.postgresql.org/message-id/flat/9f4a159b-f527-465f-b82e-38b4b7df812f@postgrespro.ru
>>
>> Initial versions of my patch didn't do this. The latest version does
>> divide like this, though. In general it isn't all that likely that an
>> inner index scan would have more than a single primitive index scan,
>> in any case, so which particular behavior I use here (divide vs don't
>> divide) is not something that I feel strongly about.
>>
>> I think we should divide them because by dividing the total buffer usage by the number of loops, user finds the
averagebuffer consumption per loop. This gives them a clearer picture of the resource intensity per basic unit of
work.
> I disagree; I think the whole "dividing by number of loops and
> rounding up to integer" was the wrong choice for tuple count, as that
> makes it difficult if not impossible to determine the actual produced
> count when it's less than the number of loops. Data is lost in the
> rounding/processing, and I don't want to have lost that data.
>
> Same applies for ~scans~ searches: If we do an index search, we should
> show it in the count as total sum, not partial processed value. If a
> user is interested in per-loopcount values, then they can derive that
> value from the data they're presented with; but that isn't true when
> we present only the divided-and-rounded value.
>
To be honest, I didn't understand how it will be helpful because there 
is an uneven distribution of buffer usage from cycle to cycle, isn't it?

I thought that the dividing memory on number of cycles helps us to 
normalize the metric to account for the repeated iterations. This gives 
us a clearer picture of the resource intensity per basic unit of work, 
rather than just the overall total. Each loop may consume a different 
amount of buffer space, but by averaging it out, we're smoothing those 
fluctuations into a more representative measure.

Moreover, this does not correspond to another metric that is nearby - 
the number of lines processed by the algorithm for the inner node. Will 
not the user who evaluates the query plan be confused by such a discrepancy?

-- 
Regards,
Alena Rybakina
Postgres Professional

Re: Showing primitive index scan count in EXPLAIN ANALYZE (for skip scan and SAOP scans)

From

Matthias van de Meent

Date:

03 December 2024, 18:01:39

On Thu, 28 Nov 2024 at 22:09, Alena Rybakina <a.rybakina@postgrespro.ru> wrote:
>
> Hi!
>
> On 27.11.2024 16:36, Matthias van de Meent wrote:
>> On Wed, 27 Nov 2024 at 14:22, Alena Rybakina <a.rybakina@postgrespro.ru> wrote:
>>> Sorry it took me so long to answer, I had some minor health complications
>>>
>>> On 12.11.2024 23:00, Peter Geoghegan wrote:
>>>
>>> On Sun, Nov 10, 2024 at 2:00 PM Alena Rybakina
>>> <a.rybakina@postgrespro.ru> wrote:
>>>
>>> Or maybe I was affected by fatigue, but I don’t understand this point, to be honest. I see from the documentation
andyour first letter that it specifies how many times in total the tuple search would be performed during the index
execution.Is that not quite right? 
>>>
>>> Well, nodes that appear on the inner side of a nested loop join (and
>>> in a few other contexts) generally have their row counts (and a few
>>> other things) divided by the total number of executions. The idea is
>>> that we're showing the average across all executions of the node -- if
>>> the user wants the true absolute number, they're expected to multiply
>>> nrows by nloops themselves. This is slightly controversial behavior,
>>> but it is long established (weirdly, we never divide by nloops for
>>> "Buffers").
>>>
>>> I understood what you mean and I faced this situation before when I saw extremely more number of actual rows that
couldbe and it was caused by the number of scanned tuples per cycles. [0] 
>>>
>>> [0] https://www.postgresql.org/message-id/flat/9f4a159b-f527-465f-b82e-38b4b7df812f@postgrespro.ru
>>>
>>> Initial versions of my patch didn't do this. The latest version does
>>> divide like this, though. In general it isn't all that likely that an
>>> inner index scan would have more than a single primitive index scan,
>>> in any case, so which particular behavior I use here (divide vs don't
>>> divide) is not something that I feel strongly about.
>>>
>>> I think we should divide them because by dividing the total buffer usage by the number of loops, user finds the
averagebuffer consumption per loop. This gives them a clearer picture of the resource intensity per basic unit of work. 
>> I disagree; I think the whole "dividing by number of loops and
>> rounding up to integer" was the wrong choice for tuple count, as that
>> makes it difficult if not impossible to determine the actual produced
>> count when it's less than the number of loops. Data is lost in the
>> rounding/processing, and I don't want to have lost that data.
>>
>> Same applies for ~scans~ searches: If we do an index search, we should
>> show it in the count as total sum, not partial processed value. If a
>> user is interested in per-loopcount values, then they can derive that
>> value from the data they're presented with; but that isn't true when
>> we present only the divided-and-rounded value.
>>
> To be honest, I didn't understand how it will be helpful because there
> is an uneven distribution of buffer usage from cycle to cycle, isn't it?

I'm sorry, I don't quite understand what you mean by cycle here.

> I thought that the dividing memory on number of cycles helps us to
> normalize the metric to account for the repeated iterations. This gives
> us a clearer picture of the resource intensity per basic unit of work,
> rather than just the overall total. Each loop may consume a different
> amount of buffer space, but by averaging it out, we're smoothing those
> fluctuations into a more representative measure.

The issue I see here is that users can get those numbers from raw
results, but they can't get the raw (more accurate) data from the
current output; if we only show processed data (like the 'rows' metric
in text output, which is a divided-and-rounded value) you can't get
the original data back with good confidence.

E.g., I have a table 'twentyone' with values 1..21, and I left join it
on a table 'ten' with values 1..10. The current text explain output
-once the planner is convinced to execute (nested loop left join
(seqscan 'thousand'), (index scan 'ten'))- will show that the index
scan path produced 0 rows, which is clearly wrong, and I can't get the
original value back with accuracy by multiplying rows with loops due
to the rounding.

> Moreover, this does not correspond to another metric that is nearby -
> the number of lines processed by the algorithm for the inner node.

It doesn't have much correspondence to that anyway, as we don't count
lines that were accessed but didn't match index quals, nor heap tuples
filtered by rechecks, in the `rows` metric.

> Will
> not the user who evaluates the query plan be confused by such a discrepancy?

I think users will be more confused about a discrepancy between buffer
accesses and index searches (which are more closely related to
eachother) than a discrepancy between index searches and
rounded-average-number-of-tuples-produced-per-loop, or the discrepancy
between not-quite-average-tuples-procuded-per-loop vs the "heap
fetches" counter of an IndexOnlyScan, etc.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Showing primitive index scan count in EXPLAIN ANALYZE (for skip scan and SAOP scans)

From

Peter Geoghegan

Date:

17 February, 20:56:12

On Wed, Nov 27, 2024 at 8:36 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> > I think we should divide them because by dividing the total buffer usage by the number of loops, user finds the
averagebuffer consumption per loop. This gives them a clearer picture of the resource intensity per basic unit of work. 
>
> I disagree; I think the whole "dividing by number of loops and
> rounding up to integer" was the wrong choice for tuple count, as that
> makes it difficult if not impossible to determine the actual produced
> count when it's less than the number of loops. Data is lost in the
> rounding/processing, and I don't want to have lost that data.

I think that you're definitely right about this. I changed my mind (or
changed it back to my original position) recently, when I noticed how
bad the problem was with parallel index scans: nloops generally comes
from the number of workers (including the leader) for parallel scans,
and so it wasn't that hard to see "Index Searches: 0" with the latest
version (the version that started to divide by nloops). Obviously,
that behavior is completely ridiculous. Let's not do that.

The precedent to follow here is "Heap Fetches: N" (in the context of
index-only scans), which also doesn't divide by nloops. Likely because
the same sorts of issues arise with heap fetches.

> Same applies for ~scans~ searches: If we do an index search, we should
> show it in the count as total sum, not partial processed value. If a
> user is interested in per-loopcount values, then they can derive that
> value from the data they're presented with; but that isn't true when
> we present only the divided-and-rounded value.

I recently came across a good example of how showing "Index Searches:
N" is likely to be useful in the context of nested loop joins. The
example comes from the recently committed ORs-to-SAOP join
transformation work (commit 627d6341).

If I run the test case (taken from src/test/regress/sql/join.sql) with
EXPLAIN ANALYZE, the output confirms that the optimization added by
that commit works particularly well:

pg@regression:5432 [205457]=# explain (analyze, costs off, SUMMARY off)
select count(*)
from tenk1 t1, tenk1 t2
where t2.thousand = t1.tenthous or t2.thousand = t1.unique1 or
t2.thousand = t1.unique2;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                    QUERY PLAN
                                            │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (actual time=13.761..13.762 rows=1 loops=1)
                                            │
│   Buffers: shared hit=24201
                                            │
│   ->  Nested Loop (actual time=0.011..12.928 rows=20000 loops=1)
                                            │
│         Buffers: shared hit=24201
                                            │
│         ->  Seq Scan on tenk1 t1 (actual time=0.004..0.507
rows=10000 loops=1)                                   │
│               Buffers: shared hit=345
                                            │
│         ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t2
(actual time=0.001..0.001 rows=2 loops=10000) │
│               Index Cond: (thousand = ANY (ARRAY[t1.tenthous,
t1.unique1, t1.unique2]))                          │
│               Index Searches: 11885
                                            │
│               Heap Fetches: 0
                                            │
│               Buffers: shared hit=23856
                                            │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(11 rows)

As you can see, there are 10,000 executions of the inner index-only
scan here, which has a SAOP qual whose array will always have 3 array
elements. That means that the best possible case is 10,000 index
searches, and the worst possible case is 30,000 index searches. We
actually see "Index Searches: 11885" -- not bad!

The main factor that gets us pretty close to that best possible case
is a certain kind of redundancy: many individual inner index scans
have duplicate array elements, allowing nbtree preprocessing to shrink
the array when as it is sorted and deduplicated -- the array used
during many individual inner scan executions has as few as one or two
array elements. Another contributing factor is the prevalence of "out
of bounds" array elements: many individual SAOP arrays/inner scans
have 2 array elements that are both greater than 1,000. That'll allow
nbtree to get away with needing only one index search for all
out-of-bounds array elements. That is, it allows nbtree to determine
that all out-of-bounds elements can't possibly have any matches using
only one index search (a search that lands on the rightmost leaf page,
where no matches for any out-of-bounds element will be found).

Of course, this can only be surmised from the EXPLAIN ANALYZE output
shown because I went back to not dividing by nloops within explain.c.
A huge amount of useful information would be lost in cases like this
if we divide by nloops. So, again, let's not do it that way.

It'd be just as easy to surmise what's going on here if the inner
index scan happened to be a plain index scan. That would make the
"Buffers" output include heap buffer hits, which would usually make it
impossible to infer how many of the "Buffers hit" came from the index
structure. My analysis didn't rely on "Buffers" at all, though (only
on "Index Searches: 11885" + "loops=10000"), so everything I pointed
out would be just as readily apparent.

--
Peter Geoghegan

Re: Showing primitive index scan count in EXPLAIN ANALYZE (for skip scan and SAOP scans)

From

Alena Rybakina

Date:

17 February, 21:36:14

On 17.02.2025 20:56, Peter Geoghegan wrote:

On Wed, Nov 27, 2024 at 8:36 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
I think we should divide them because by dividing the total buffer usage by the number of loops, user finds the average buffer consumption per loop. This gives them a clearer picture of the resource intensity per basic unit of work.
I disagree; I think the whole "dividing by number of loops and
rounding up to integer" was the wrong choice for tuple count, as that
makes it difficult if not impossible to determine the actual produced
count when it's less than the number of loops. Data is lost in the
rounding/processing, and I don't want to have lost that data.
I think that you're definitely right about this. I changed my mind (or
changed it back to my original position) recently, when I noticed how
bad the problem was with parallel index scans: nloops generally comes
from the number of workers (including the leader) for parallel scans,
and so it wasn't that hard to see "Index Searches: 0" with the latest
version (the version that started to divide by nloops). Obviously,
that behavior is completely ridiculous. Let's not do that.

The precedent to follow here is "Heap Fetches: N" (in the context of
index-only scans), which also doesn't divide by nloops. Likely because
the same sorts of issues arise with heap fetches.

Yes, you are right, I agree with both of you.

-- 
Regards,
Alena Rybakina
Postgres Professional