Home > mailing lists
Re: index prefetching - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: index prefetching
Date	July 22 23:50:00
Msg-id	1bebfd1b-aea5-4d41-80a6-eae64b8f9eaf@vondra.me Whole thread Raw
In response to	Re: index prefetching (Peter Geoghegan <pg@bowt.ie>)
Responses	Re: index prefetching Re: index prefetching
List	pgsql-hackers
Tree view
On 7/22/25 19:35, Peter Geoghegan wrote:
> On Tue, Jul 22, 2025 at 9:06 AM Tomas Vondra <tomas@vondra.me> wrote:
>> Real workloads are likely to have multiple misses in a row, which indeed
>> ramps up the distance quickly. So maybe it's not that bad. Could we
>> track a longer history of hits/misses, and consider that when adjusting
>> the distance? Not just the most recent hit/miss?
> 
> +1
> 
>> FWIW I re-ran the index-prefetch-test benchmarks with restoring the
>> distance for the "simple" patch. The results are in the same github
>> repository, in a separate branch:
>>
>> https://github.com/tvondra/indexscan-prefetch-tests/tree/with-distance-restore-after-reset
> 
> These results make way more sense. There was absolutely no reason why
> the "simple" patch should have done so much worse than the "complex"
> one for most of the tests you've been running.
> 
> Obviously, whatever advantage that the "complex" patch has is bound to
> be limited to cases where index characteristics are naturally the
> limiting factor. For example, with the pgbench_accounts_pkey table
> there are only ever 6 distinct heap blocks on each leaf page. I bet
> that your "linear" test more or less looks like that, too.
> 

Yes. It's definitely true we could construct examples where the complex
patch beats the simple one for this reason. And I believe some of those
examples could be quite realistic, even if not very common (like when
very few index tuples fit on a leaf page).

However, I'm not sure the pgbench example with only 6 heap blocks per
leaf is very significant. Sure, the simple patch can't prefetch TIDs
from the following leaf, but AFAICS the complex patch won't do that
either. Not because it couldn't, but because with that many hits the
distance will drop to ~1 (or close to it). (It'll probably prefetch a
couple TIDs from the next leaf at the very end of the page, but I don't
think that matters overall.)

I'm not sure what prefetch distances will be sensible in queries that do
other stuff. The queries in the benchmark do just the index scan, but if
the query does something with the tuple (in the nodes on top), that
shortens the required prefetch distance. Of course, simple queries will
benefit from prefetching far ahead.


> I attach pgbench_accounts_pkey_nhblks.txt, which shows a query that
> (among other things) ouputs "nhblks" for each leaf page from a given
> index (while showing the details of each leaf page in index key space
> order). It also shows results for pgbench_accounts_pkey with pgbench
> scale 1. This is how I determined that every pgbench_accounts_pkey
> leaf page points to exactly 6 distinct heap blocks -- "nhblks" is
> always 6. Note that this is what I see regardless of the pgbench
> scale, indicating that things always perfectly line up (even more than
> I would expect for very synthetic data such as this).
> 

Thanks. I wonder how difficult would it be to add something like this to
pgstattuple. I mean, it shouldn't be difficult to look at leaf pages and
count distinct blocks, right? Seems quite useful.

Explain would also greatly benefit from tracking something like this.
The buffer "hits" and "reads" can be very difficult to interpret.

> This query is unwieldy when run against larger indexes, but that
> shouldn't be necessary. As with pgbench_accounts_pkey, it's typical
> for synthetically generated data to have a very consistent "nhblks",
> regardless of the total amount of data.
> 
> With your "uniform" test cases, I'd expect this query to show "nhtids
> == nhblks" (or very close to it), which of course makes our ability to
> eagerly read further leaf pages almost irrelevant. If there are
> hundreds of distinct heap blocks on each leaf page, but
> effective_io_concurrency is 16 (or even 64), there's little we can do
> about it.
> 

Right.

>> I'm attaching two PDFs with results for eic=16 (linear and log-scale, to
>> compare timings for quick queries). This shows that with restoring
>> distance after reset, the simple patch is pretty much the same as the
>> complex patch.
>>
>> The only data set where that's not the case is the "linear" data set,
>> when everything is perfectly sequential. In this case the simple patch
>> performs like "master" (i.e. no prefetching). I'm not sure why is that.
> 
> Did you restore the distance for the "complex" patch, too? I think
> that it might well matter there too.
> 

No, I did not. I did consider it, but it seemed to me it can't really
make a difference (for these data sets), because each leaf has ~300
items, and the patch limits the prefetch to 64 leafs. That means it can
prefetch ~20k TIDs ahead, and each heap page has ~20 items. So this
should be good enough for eic=1000. It should never hit stream reset.

It'd be useful to show some prefetch info in explain, I guess. It should
not be difficult to track how many times was the stream reset, the
average prefetch distance, and perhaps even a histogram of distances.
The simple patch tracks the average distance, at least.

> Isn't the obvious explanation that the complex patch benefits from
> being able to prefetch without being limited by index
> characteristics/leaf page boundaries, while the simple patch doesn't?
> 

That's a valid interpretation, yes. Although the benefit comes mostly

>> Anyway, it seems to confirm most of the differences between the two
>> patches is due to the "distance collapse". The impact of the resets in
>> the first benchmarks surprised me quite a bit, but if we don't ramp up
>> the distance that makes perfect sense.
>>
>> The issue probably affects other queries that do a lot of resets. Index
>> scan prefetching just makes it very obvious.
> 
> What is the difference between cases like "linear / eic=16 / sync" and
> "linear_1 / eic=16 / sync"?
> 
> One would imagine that these tests are very similar, based on the fact
> that they have very similar names. But we see very different results
> for each: with the former ("linear") test results, the "complex" patch
> is 2x-4x faster than the "simple" patch. But, with the latter test
> results ("linear_1", and other similar pairs of "linear_N" tests) the
> advantage for the "complex" patch *completely* evaporates. I find that
> very suspicious, and wonder if it might be due to a bug/inefficiency
> that could easily be fixed (possibly an issue on the read stream side,
> like the one you mentioned to Nazir just now).
> 

Yes, there's some similarity. Attached is the script I use to create the
tables and load the data.

The "linear" is a table with a simple sequence of values (0 to 100k).
More or less - the value is a floating point, and there are 10M rows.
But you get the idea.

The "linear_X" variants mean the value has a noise of X% of the range.
So with "linear_1" you get the "linear" value, and then random(0,1000),
with normal distribution.

The "cyclic" data sets are similar, except that the "sequence" also
wraps around 100x.

There's nothing "special" about the particular values. I simply wanted
different "levels" of noise, and 1, 10 and 25 seemed good. I initially
had a couple higher values, but that was pretty close to "uniform".

regards

-- 
Tomas Vondra
Attachment

create-tables.sql
pgsql-hackers by date:
From: Corey Huinker
Date: 22 July, 23:44:42
Subject: Re: support create index on virtual generated column.
From: Tom Lane
Date: 22 July, 23:54:45
Subject: Re: support create index on virtual generated column.
Re: index prefetching - Mailing list pgsql-hackers

Attachment

Previous

Next