Thread: index prefetching

index prefetching

From

Tomas Vondra

Date:

08 June 2023, 15:40:12

Hi,

At pgcon unconference I presented a PoC patch adding prefetching for
indexes, along with some benchmark results demonstrating the (pretty
significant) benefits etc. The feedback was quite positive, so let me
share the current patch more widely.


Motivation
----------

Imagine we have a huge table (much larger than RAM), with an index, and
that we're doing a regular index scan (e.g. using a btree index). We
first walk the index to the leaf page, read the item pointers from the
leaf page and then start issuing fetches from the heap.

The index access is usually pretty cheap, because non-leaf pages are
very likely cached, so we may do perhaps I/O for the leaf. But the
fetches from heap are likely very expensive - unless the page is
clustered, we'll do a random I/O for each item pointer. Easily ~200 or
more I/O requests per leaf page. The problem is index scans do these
requests synchronously at the moment - we get the next TID, fetch the
heap page, process the tuple, continue to the next TID etc.

That is slow and can't really leverage the bandwidth of modern storage,
which require longer queues. This patch aims to improve this by async
prefetching.

We already do prefetching for bitmap index scans, where the bitmap heap
scan prefetches future pages based on effective_io_concurrency. I'm not
sure why exactly was prefetching implemented only for bitmap scans, but
I suspect the reasoning was that it only helps when there's many
matching tuples, and that's what bitmap index scans are for. So it was
not worth the implementation effort.

But there's three shortcomings in logic:

1) It's not clear the thresholds for prefetching being beneficial and
switching to bitmap index scans are the same value. And as I'll
demonstrate later, the prefetching threshold is indeed much lower
(perhaps a couple dozen matching tuples) on large tables.

2) Our estimates / planning are not perfect, so we may easily pick an
index scan instead of a bitmap scan. It'd be nice to limit the damage a
bit by still prefetching.

3) There are queries that can't do a bitmap scan (at all, or because
it's hopelessly inefficient). Consider queries that require ordering, or
queries by distance with GiST/SP-GiST index.


Implementation
--------------

When I started looking at this, I only really thought about btree. If
you look at BTScanPosData, which is what the index scans use to
represent the current leaf page, you'll notice it has "items", which is
the array of item pointers (TIDs) that we'll fetch from the heap. Which
is exactly the thing we need.

The easiest thing would be to just do prefetching from the btree code.
But then I realized there's no particular reason why other index types
(except for GIN, which only allows bitmap scans) couldn't do prefetching
too. We could have a copy in each AM, of course, but that seems sloppy
and also violation of layering. After all, bitmap heap scans do prefetch
from the executor, so AM seems way too low level.

So I ended up moving most of the prefetching logic up into indexam.c,
see the index_prefetch() function. It can't be entirely separate,
because each AM represents the current state in a different way (e.g.
SpGistScanOpaque and BTScanOpaque are very different).

So what I did is introducing a IndexPrefetch struct, which is part of
IndexScanDesc, maintaining all the info about prefetching for that
particular scan - current/maximum distance, progress, etc.

It also contains two AM-specific callbacks (get_range and get_block)
which say valid range of indexes (into the internal array), and block
number for a given index.

This mostly does the trick, although index_prefetch() is still called
from the amgettuple() functions. That seems wrong, we should call it
from indexam.c right aftter calling amgettuple.


Problems / Open questions
-------------------------

There's a couple issues I ran into, I'll try to list them in the order
of importance (most serious ones first).

1) pairing-heap in GiST / SP-GiST

For most AMs, the index state is pretty trivial - matching items from a
single leaf page. Prefetching that is pretty trivial, even if the
current API is a bit cumbersome.

Distance queries on GiST and SP-GiST are a problem, though, because
those do not just read the pointers into a simple array, as the distance
ordering requires passing stuff through a pairing-heap :-(

I don't know how to best deal with that, especially not in the simple
API. I don't think we can "scan forward" stuff from the pairing heap, so
the only idea I have is actually having two pairing-heaps. Or maybe
using the pairing heap for prefetching, but stashing the prefetched
pointers into an array and then returning stuff from it.

In the patch I simply prefetch items before we add them to the pairing
heap, which is good enough for demonstrating the benefits.


2) prefetching from executor

Another question is whether the prefetching shouldn't actually happen
even higher - in the executor. That's what Andres suggested during the
unconference, and it kinda makes sense. That's where we do prefetching
for bitmap heap scans, so why should this happen lower, right?

I'm also not entirely sure the way this interfaces with the AM (through
the get_range / get_block callbaces) is very elegant. It did the trick,
but it seems a bit cumbersome. I wonder if someone has a better/nicer
idea how to do this ...


3) prefetch distance

I think we can do various smart things about the prefetch distance.

The current code does about the same thing bitmap scans do - it starts
with distance 0 (no prefetching), and then simply ramps the distance up
until the maximum value from get_tablespace_io_concurrency(). Which is
either effective_io_concurrency, or per-tablespace value.

I think we could be a bit smarter, and also consider e.g. the estimated
number of matching rows (but we shouldn't be too strict, because it's
just an estimate). We could also track some statistics for each scan and
use that during a rescans (think index scan in a nested loop).

But the patch doesn't do any of that now.


4) per-leaf prefetching

The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.

I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.


5) index-only scans

I'm not sure what to do about index-only scans. On the one hand, the
point of IOS is not to read stuff from the heap at all, so why prefetch
it. OTOH if there are many allvisible=false pages, we still have to
access that. And if that happens, this leads to the bizarre situation
that IOS is slower than regular index scan. But to address this, we'd
have to consider the visibility during prefetching.


Benchmarks
----------

1) OLTP

For OLTP, this tested different queries with various index types, on
data sets constructed to have certain number of matching rows, forcing
different types of query plans (bitmap, index, seqscan).

The data sets have ~34GB, which is much more than available RAM (8GB).

For example for BTREE, we have a query like this:

   SELECT * FROM btree_test WHERE a = $v

with data matching 1, 10, 100, ..., 100000 rows for each $v. The results
look like this:

   rows    bitmapscan     master    patched    seqscan
   1             19.8       20.4       18.8    31875.5
   10            24.4       23.8       23.2    30642.4
   100           27.7       40.0       26.3    31871.3
   1000          45.8      178.0       45.4    30754.1
   10000        171.8     1514.9      174.5    30743.3
   100000      1799.0    15993.3     1777.4    30937.3

This says that the query takes ~31s with a seqscan, 1.8s with a bitmap
scan and 16s index scan (on master). With the prefetching patch, it
takes about ~1.8s, i.e. about the same as the bitmap scan.

I don't know where exactly would the plan switch from index scan to
bitmap scan, but the table has ~100M rows, so all of this is tiny. I'd
bet most of the cases would do plain index scan.


For a query with ordering:

    SELECT * FROM btree_test WHERE a >= $v ORDER BY a LIMIT $n

the results look a bit different:

    rows      bitmapscan     master     patched     seqscan
    1            52703.9       19.5        19.5     31145.6
    10           51208.1       22.7        24.7     30983.5
    100          49038.6       39.0        26.3     32085.3
    1000         53760.4      193.9        48.4     31479.4
    10000        56898.4     1600.7       187.5     32064.5
    100000       50975.2    15978.7      1848.9     31587.1

This is a good illustration of a query where bitmapscan is terrible
(much worse than seqscan, in fact), and the patch is a massive
improvement over master (about an order of magnitude).

Of course, if you only scan a couple rows, the benefits are much more
modest (say 40% for 100 rows, which is still significant).

The results for other index types (HASH, GiST, SP-GiST) follow roughly
the same pattern. See the attached PDF for more charts, and [1] for
complete results.


Benchmark / TPC-H
-----------------

I ran the 22 queries on 100GB data set, with parallel query either
disabled or enabled. And I measured timing (and speedup) for each query.
The speedup results look like this (see the attached PDF for details):

    query    serial    parallel
    1          101%         99%
    2          119%        100%
    3          100%         99%
    4          101%        100%
    5          101%        100%
    6           12%         99%
    7          100%        100%
    8           52%         67%
    10         102%        101%
    11         100%         72%
    12         101%        100%
    13         100%        101%
    14          13%        100%
    15         101%        100%
    16          99%         99%
    17          95%        101%
    18         101%        106%
    19          30%         40%
    20          99%        100%
    21         101%        100%
    22         101%        107%

The percentage is (timing patched / master, so <100% means faster, >100%
means slower).

The different queries are affected depending on the query plan - many
queries are close to 100%, which means "no difference". For the serial
case, there are about 4 queries that improved a lot (6, 8, 14, 19),
while for the parallel case the benefits are somewhat less significant.

My explanation is that either (a) parallel case used a different plan
with fewer index scans or (b) the parallel query does more concurrent
I/O simply by using parallel workers. Or maybe both.

There are a couple regressions too, I believe those are due to doing too
much prefetching in some cases, and some of the heuristics mentioned
earlier should eliminate most of this, I think.


regards


[1] https://github.com/tvondra/index-prefetch-tests
[2] https://github.com/tvondra/postgres/tree/dev/index-prefetch


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

On Thu, Jun 8, 2023 at 3:17 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>     Normal index scans are an even more interesting case but I'm not
>     sure how hard it would be to get that information. It may only be
>     convenient to get the blocks from the last leaf page we looked at,
>     for example.
>
> So this suggests we simply started prefetching for the case where the
> information was readily available, and it'd be harder to do for index
> scans so that's it.

What the exact historical timeline is may not be that important. My
emphasis on ScalarArrayOpExpr is partly due to it being a particularly
compelling case for both parallel index scan and prefetching, in
general. There are many queries that have huge in() lists that
naturally benefit a great deal from prefetching. Plus they're common.

> Even if SAOP (probably) wasn't the reason, I think you're right it may
> be an issue for prefetching, causing regressions. It didn't occur to me
> before, because I'm not that familiar with the btree code and/or how it
> deals with SAOP (and didn't really intend to study it too deeply).

I'm pretty sure that you understand this already, but just in case:
ScalarArrayOpExpr doesn't even "get the blocks from the last leaf
page" in many important cases. Not really -- not in the sense that
you'd hope and expect. We're senselessly processing the same index
leaf page multiple times and treating it as a different, independent
leaf page. That makes heap prefetching of the kind you're working on
utterly hopeless, since it effectively throws away lots of useful
context. Obviously that's the fault of nbtree ScalarArrayOpExpr
handling, not the fault of your patch.

> So if you're planning to work on this for PG17, collaborating on it
> would be great.
>
> For now I plan to just ignore SAOP, or maybe just disabling prefetching
> for SAOP index scans if it proves to be prone to regressions. That's not
> great, but at least it won't make matters worse.

Makes sense, but I hope that it won't come to that.

IMV it's actually quite reasonable that you didn't expect to have to
think about ScalarArrayOpExpr at all -- it would make a lot of sense
if that was already true. But the fact is that it works in a way
that's pretty silly and naive right now, which will impact
prefetching. I wasn't really thinking about regressions, though. I was
actually more concerned about missing opportunities to get the most
out of prefetching. ScalarArrayOpExpr really matters here.

> I guess something like this might be a "nice" bad case:
>
>     insert into btree_test mod(i,100000), md5(i::text)
>       from generate_series(1, $ROWS) s(i)
>
>     select * from btree_test where a in (999, 1000, 1001, 1002)
>
> The values are likely colocated on the same heap page, the bitmap scan
> is going to do a single prefetch. With index scan we'll prefetch them
> repeatedly. I'll give it a try.

This is the sort of thing that I was thinking of. What are the
conditions under which bitmap index scan starts to make sense? Why is
the break-even point whatever it is in each case, roughly? And, is it
actually because of laws-of-physics level trade-off? Might it not be
due to implementation-level issues that are much less fundamental? In
other words, might it actually be that we're just doing something
stoopid in the case of plain index scans? Something that is just
papered-over by bitmap index scans right now?

I see that your patch has logic that avoids repeated prefetching of
the same block -- plus you have comments that wonder about going
further by adding a "small lru array" in your new index_prefetch()
function. I asked you about this during the unconference presentation.
But I think that my understanding of the situation was slightly
different to yours. That's relevant here.

I wonder if you should go further than this, by actually sorting the
items that you need to fetch as part of processing a given leaf page
(I said this at the unconference, you may recall). Why should we
*ever* pin/access the same heap page more than once per leaf page
processed per index scan? Nothing stops us from returning the tuples
to the executor in the original logical/index-wise order, despite
having actually accessed each leaf page's pointed-to heap pages
slightly out of order (with the aim of avoiding extra pin/unpin
traffic that isn't truly necessary). We can sort the heap TIDs in
scratch memory, then do our actual prefetching + heap access, and then
restore the original order before returning anything.

This is conceptually a "mini bitmap index scan", though one that takes
place "inside" a plain index scan, as it processes one particular leaf
page. That's the kind of design that "plain index scan vs bitmap index
scan as a continuum" leads me to (a little like the continuum between
nested loop joins, block nested loop joins, and merge joins). I bet it
would be practical to do things this way, and help a lot with some
kinds of queries. It might even be simpler than avoiding excessive
prefetching using an LRU cache thing.

I'm talking about problems that exist today, without your patch.

I'll show a concrete example of the kind of index/index scan that
might be affected.

Attached is an extract of the server log when the regression tests ran
against a server patched to show custom instrumentation. The log
output shows exactly what's going on with one particular nbtree
opportunistic deletion (my point has nothing to do with deletion, but
it happens to be convenient to make my point in this fashion). This
specific example involves deletion of tuples from the system catalog
index "pg_type_typname_nsp_index". There is nothing very atypical
about it; it just shows a certain kind of heap fragmentation that's
probably very common.

Imagine a plain index scan involving a query along the lines of
"select * from pg_type where typname like 'part%' ", or similar. This
query runs an instant before the example LD_DEAD-bit-driven
opportunistic deletion (a "simple deletion" in nbtree parlance) took
place. You'll be able to piece together from the log output that there
would only be about 4 heap blocks involved with such a query. Ideally,
our hypothetical index scan would pin each buffer/heap page exactly
once, for a total of 4 PinBuffer()/UnpinBuffer() calls. After all,
we're talking about a fairly selective query here, that only needs to
scan precisely one leaf page (I verified this part too) -- so why
wouldn't we expect "index scan parity"?

While there is significant clustering on this example leaf page/key
space, heap TID is not *perfectly* correlated with the
logical/keyspace order of the index -- which can have outsized
consequences. Notice that some heap blocks are non-contiguous
relative to logical/keyspace/index scan/index page offset number order.

We'll end up pinning each of the 4 or so heap pages more than once
(sometimes several times each), when in principle we could have pinned
each heap page exactly once. In other words, there is way too much of
a difference between the case where the tuples we scan are *almost*
perfectly clustered (which is what you see in my example) and the case
where they're exactly perfectly clustered. In other other words, there
is way too much of a difference between plain index scan, and bitmap
index scan.

(What I'm saying here is only true because this is a composite index
and our query uses "like", returning rows matches a prefix -- if our
index was on the column "typname" alone and we used a simple equality
condition in our query then the Postgres 12 nbtree work would be
enough to avoid the extra PinBuffer()/UnpinBuffer() calls. I suspect
that there are still relatively many important cases where we perform
extra PinBuffer()/UnpinBuffer() calls during plain index scans that
only touch one leaf page anyway.)

Obviously we should expect bitmap index scans to have a natural
advantage over plain index scans whenever there is little or no
correlation -- that's clear. But that's not what we see here -- we're
way too sensitive to minor imperfections in clustering that are
naturally present on some kinds of leaf pages. The potential
difference in pin/unpin traffic (relative to the bitmap index scan
case) seems pathological to me. Ideally, we wouldn't have these kinds
of differences at all. It's going to disrupt usage_count on the
buffers.

> > It's important to carefully distinguish between cases where plain
> > index scans really are at an inherent disadvantage relative to bitmap
> > index scans (because there really is no getting around the need to
> > access the same heap page many times with an index scan) versus cases
> > that merely *appear* that way. Implementation restrictions that only
> > really affect the plain index scan case (e.g., the lack of a
> > reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing)
> > should be accounted for when assessing the viability of index scan +
> > prefetch over bitmap index scan + prefetch. This is very subtle, but
> > important.
> >
>
> I do agree, but what do you mean by "assessing"?

I mean performance validation. There ought to be a theoretical model
that describes the relationship between index scan and bitmap index
scan, that has actual predictive power in the real world, across a
variety of different cases. Something that isn't sensitive to the
current phase of the moon (e.g., heap fragmentation along the lines of
my pg_type_typname_nsp_index log output). I particularly want to avoid
nasty discontinuities that really make no sense.

> Wasn't the agreement at
> the unconference session was we'd not tweak costing? So ultimately, this
> does not really affect which scan type we pick. We'll keep doing the
> same planning decisions as today, no?

I'm not really talking about tweaking the costing. What I'm saying is
that we really should expect index scans to behave similarly to bitmap
index scans at runtime, for queries that really don't have much to
gain from using a bitmap heap scan (queries that may or may not also
benefit from prefetching). There are several reasons why this makes
sense to me.

One reason is that it makes tweaking the actual costing easier later
on. Also, your point about plan robustness was a good one. If we make
the wrong choice about index scan vs bitmap index scan, and the
consequences aren't so bad, that's a very useful enhancement in
itself.

The most important reason of all may just be to build confidence in
the design. I'm interested in understanding when and how prefetching
stops helping.

> I'm all for building a more comprehensive set of test cases - the stuff
> presented at pgcon was good for demonstration, but it certainly is not
> enough for testing. The SAOP queries are a great addition, I also plan
> to run those queries on different (less random) data sets, etc. We'll
> probably discover more interesting cases as the patch improves.

Definitely.

> There are two aspects why I think AM is not the right place:
>
> - accessing table from index code seems backwards
>
> - we already do prefetching from the executor (nodeBitmapHeapscan.c)
>
> It feels kinda wrong in hindsight.

I'm willing to accept that we should do it the way you've done it in
the patch provisionally. It's complicated enough that it feels like I
should reserve the right to change my mind.

> >> I think this is acceptable limitation, certainly for v0. Prefetching
> >> across multiple leaf pages seems way more complex (particularly for the
> >> cases using pairing heap), so let's leave this for the future.

> Yeah, I'm not saying it's impossible, and imagined we might teach nbtree
> to do that. But it seems like work for future someone.

Right. You probably noticed that this is another case where we'd be
making index scans behave more like bitmap index scans (perhaps even
including the downsides for kill_prior_tuple that accompany not
processing each leaf page inline). There is probably a point where
that ceases to be sensible, but I don't know what that point is.
They're way more similar than we seem to imagine.

--
Peter Geoghegan

Attachment

pg_type_typname_nsp_index_index_example_log.txt

Re: index prefetching

From

Andres Freund

Date:

09 June 2023, 00:06:00

Hi,

On 2023-06-08 17:40:12 +0200, Tomas Vondra wrote:
> At pgcon unconference I presented a PoC patch adding prefetching for
> indexes, along with some benchmark results demonstrating the (pretty
> significant) benefits etc. The feedback was quite positive, so let me
> share the current patch more widely.

I'm really excited about this work.


> 1) pairing-heap in GiST / SP-GiST
> 
> For most AMs, the index state is pretty trivial - matching items from a
> single leaf page. Prefetching that is pretty trivial, even if the
> current API is a bit cumbersome.
> 
> Distance queries on GiST and SP-GiST are a problem, though, because
> those do not just read the pointers into a simple array, as the distance
> ordering requires passing stuff through a pairing-heap :-(
> 
> I don't know how to best deal with that, especially not in the simple
> API. I don't think we can "scan forward" stuff from the pairing heap, so
> the only idea I have is actually having two pairing-heaps. Or maybe
> using the pairing heap for prefetching, but stashing the prefetched
> pointers into an array and then returning stuff from it.
> 
> In the patch I simply prefetch items before we add them to the pairing
> heap, which is good enough for demonstrating the benefits.

I think it'd be perfectly fair to just not tackle distance queries for now.


> 2) prefetching from executor
> 
> Another question is whether the prefetching shouldn't actually happen
> even higher - in the executor. That's what Andres suggested during the
> unconference, and it kinda makes sense. That's where we do prefetching
> for bitmap heap scans, so why should this happen lower, right?

Yea. I think it also provides potential for further optimizations in the
future to do it at that layer.

One thing I have been wondering around this is whether we should not have
split the code for IOS and plain indexscans...


> 4) per-leaf prefetching
> 
> The code is restricted only prefetches items from one leaf page. If the
> index scan needs to scan multiple (many) leaf pages, we have to process
> the first leaf page first before reading / prefetching the next one.
> 
> I think this is acceptable limitation, certainly for v0. Prefetching
> across multiple leaf pages seems way more complex (particularly for the
> cases using pairing heap), so let's leave this for the future.

Hm. I think that really depends on the shape of the API we end up with. If we
move the responsibility more twoards to the executor, I think it very well
could end up being just as simple to prefetch across index pages.


> 5) index-only scans
> 
> I'm not sure what to do about index-only scans. On the one hand, the
> point of IOS is not to read stuff from the heap at all, so why prefetch
> it. OTOH if there are many allvisible=false pages, we still have to
> access that. And if that happens, this leads to the bizarre situation
> that IOS is slower than regular index scan. But to address this, we'd
> have to consider the visibility during prefetching.

That should be easy to do, right?



> Benchmark / TPC-H
> -----------------
> 
> I ran the 22 queries on 100GB data set, with parallel query either
> disabled or enabled. And I measured timing (and speedup) for each query.
> The speedup results look like this (see the attached PDF for details):
> 
>     query    serial    parallel
>     1          101%         99%
>     2          119%        100%
>     3          100%         99%
>     4          101%        100%
>     5          101%        100%
>     6           12%         99%
>     7          100%        100%
>     8           52%         67%
>     10         102%        101%
>     11         100%         72%
>     12         101%        100%
>     13         100%        101%
>     14          13%        100%
>     15         101%        100%
>     16          99%         99%
>     17          95%        101%
>     18         101%        106%
>     19          30%         40%
>     20          99%        100%
>     21         101%        100%
>     22         101%        107%
> 
> The percentage is (timing patched / master, so <100% means faster, >100%
> means slower).
> 
> The different queries are affected depending on the query plan - many
> queries are close to 100%, which means "no difference". For the serial
> case, there are about 4 queries that improved a lot (6, 8, 14, 19),
> while for the parallel case the benefits are somewhat less significant.
> 
> My explanation is that either (a) parallel case used a different plan
> with fewer index scans or (b) the parallel query does more concurrent
> I/O simply by using parallel workers. Or maybe both.
> 
> There are a couple regressions too, I believe those are due to doing too
> much prefetching in some cases, and some of the heuristics mentioned
> earlier should eliminate most of this, I think.

I'm a bit confused by some of these numbers. How can OS-level prefetching lead
to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
Unless I missed what "xeon / cached (speedup)" indicates?

I think it'd be good to run a performance comparison of the unpatched vs
patched cases, with prefetching disabled for both. It's possible that
something in the patch caused unintended changes (say spilling during a
hashagg, due to larger struct sizes).

Greetings,

Andres Freund

Re: index prefetching

From

Peter Geoghegan

Date:

09 June 2023, 00:40:15

On Thu, Jun 8, 2023 at 4:38 PM Peter Geoghegan <pg@bowt.ie> wrote:
> This is conceptually a "mini bitmap index scan", though one that takes
> place "inside" a plain index scan, as it processes one particular leaf
> page. That's the kind of design that "plain index scan vs bitmap index
> scan as a continuum" leads me to (a little like the continuum between
> nested loop joins, block nested loop joins, and merge joins). I bet it
> would be practical to do things this way, and help a lot with some
> kinds of queries. It might even be simpler than avoiding excessive
> prefetching using an LRU cache thing.

I'll now give a simpler (though less realistic) example of a case
where "mini bitmap index scan" would be expected to help index scans
in general, and prefetching during index scans in particular.
Something very simple:

create table bitmap_parity_test(randkey int4, filler text);
create index on bitmap_parity_test (randkey);
insert into bitmap_parity_test select (random()*1000),
repeat('filler',10) from generate_series(1,250) i;

This gives me a table with 4 pages, and an index with 2 pages.

The following query selects about half of the rows from the table:

select * from bitmap_parity_test where randkey < 500;

If I force the query to use a bitmap index scan, I see that the total
number of buffers hit is exactly as expected (according to
EXPLAIN(ANALYZE,BUFFERS), that is): there are 5 buffers/pages hit. We
need to access every single heap page once, and we need to access the
only leaf page in the index once.

I'm sure that you know where I'm going with this already. I'll force
the same query to use a plain index scan, and get a very different
result. Now EXPLAIN(ANALYZE,BUFFERS) shows that there are a total of
89 buffers hit -- 88 of which must just be the same 5 heap pages,
again and again. That's just silly. It's probably not all that much
slower, but it's not helping things. And it's likely that this effect
interferes with the prefetching in your patch.

Obviously you can come up with a variant of this test case where
bitmap index scan does way fewer buffer accesses in a way that really
makes sense -- that's not in question. This is a fairly selective
index scan, since it only touches one index page -- and yet we still
see this difference.

(Anybody pedantic enough to want to dispute whether or not this index
scan counts as "selective" should run "insert into bitmap_parity_test
select i, repeat('actshually',10)  from generate_series(2000,1e5) i"
before running the "randkey < 500" query, which will make the index
much larger without changing any of the details of how the query pins
pages -- non-pedants should just skip that step.)

--
Peter Geoghegan

Re: index prefetching

From

Tomas Vondra

Date:

09 June 2023, 10:18:11

On 6/9/23 02:06, Andres Freund wrote:
> Hi,
> 
> On 2023-06-08 17:40:12 +0200, Tomas Vondra wrote:
>> At pgcon unconference I presented a PoC patch adding prefetching for
>> indexes, along with some benchmark results demonstrating the (pretty
>> significant) benefits etc. The feedback was quite positive, so let me
>> share the current patch more widely.
> 
> I'm really excited about this work.
> 
> 
>> 1) pairing-heap in GiST / SP-GiST
>>
>> For most AMs, the index state is pretty trivial - matching items from a
>> single leaf page. Prefetching that is pretty trivial, even if the
>> current API is a bit cumbersome.
>>
>> Distance queries on GiST and SP-GiST are a problem, though, because
>> those do not just read the pointers into a simple array, as the distance
>> ordering requires passing stuff through a pairing-heap :-(
>>
>> I don't know how to best deal with that, especially not in the simple
>> API. I don't think we can "scan forward" stuff from the pairing heap, so
>> the only idea I have is actually having two pairing-heaps. Or maybe
>> using the pairing heap for prefetching, but stashing the prefetched
>> pointers into an array and then returning stuff from it.
>>
>> In the patch I simply prefetch items before we add them to the pairing
>> heap, which is good enough for demonstrating the benefits.
> 
> I think it'd be perfectly fair to just not tackle distance queries for now.
> 

My concern is that if we cut this from v0 entirely, we'll end up with an
API that'll not be suitable for adding distance queries later.

> 
>> 2) prefetching from executor
>>
>> Another question is whether the prefetching shouldn't actually happen
>> even higher - in the executor. That's what Andres suggested during the
>> unconference, and it kinda makes sense. That's where we do prefetching
>> for bitmap heap scans, so why should this happen lower, right?
> 
> Yea. I think it also provides potential for further optimizations in the
> future to do it at that layer.
> 
> One thing I have been wondering around this is whether we should not have
> split the code for IOS and plain indexscans...
> 

Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or
did you mean something else?

> 
>> 4) per-leaf prefetching
>>
>> The code is restricted only prefetches items from one leaf page. If the
>> index scan needs to scan multiple (many) leaf pages, we have to process
>> the first leaf page first before reading / prefetching the next one.
>>
>> I think this is acceptable limitation, certainly for v0. Prefetching
>> across multiple leaf pages seems way more complex (particularly for the
>> cases using pairing heap), so let's leave this for the future.
> 
> Hm. I think that really depends on the shape of the API we end up with. If we
> move the responsibility more twoards to the executor, I think it very well
> could end up being just as simple to prefetch across index pages.
> 

Maybe. I'm open to that idea if you have idea how to shape the API to
make this possible (although perhaps not in v0).

> 
>> 5) index-only scans
>>
>> I'm not sure what to do about index-only scans. On the one hand, the
>> point of IOS is not to read stuff from the heap at all, so why prefetch
>> it. OTOH if there are many allvisible=false pages, we still have to
>> access that. And if that happens, this leads to the bizarre situation
>> that IOS is slower than regular index scan. But to address this, we'd
>> have to consider the visibility during prefetching.
> 
> That should be easy to do, right?
> 

It doesn't seem particularly complicated (famous last words), and we
need to do the VM checks anyway so it seems like it wouldn't add a lot
of overhead either

> 
> 
>> Benchmark / TPC-H
>> -----------------
>>
>> I ran the 22 queries on 100GB data set, with parallel query either
>> disabled or enabled. And I measured timing (and speedup) for each query.
>> The speedup results look like this (see the attached PDF for details):
>>
>>     query    serial    parallel
>>     1          101%         99%
>>     2          119%        100%
>>     3          100%         99%
>>     4          101%        100%
>>     5          101%        100%
>>     6           12%         99%
>>     7          100%        100%
>>     8           52%         67%
>>     10         102%        101%
>>     11         100%         72%
>>     12         101%        100%
>>     13         100%        101%
>>     14          13%        100%
>>     15         101%        100%
>>     16          99%         99%
>>     17          95%        101%
>>     18         101%        106%
>>     19          30%         40%
>>     20          99%        100%
>>     21         101%        100%
>>     22         101%        107%
>>
>> The percentage is (timing patched / master, so <100% means faster, >100%
>> means slower).
>>
>> The different queries are affected depending on the query plan - many
>> queries are close to 100%, which means "no difference". For the serial
>> case, there are about 4 queries that improved a lot (6, 8, 14, 19),
>> while for the parallel case the benefits are somewhat less significant.
>>
>> My explanation is that either (a) parallel case used a different plan
>> with fewer index scans or (b) the parallel query does more concurrent
>> I/O simply by using parallel workers. Or maybe both.
>>
>> There are a couple regressions too, I believe those are due to doing too
>> much prefetching in some cases, and some of the heuristics mentioned
>> earlier should eliminate most of this, I think.
> 
> I'm a bit confused by some of these numbers. How can OS-level prefetching lead
> to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
> Unless I missed what "xeon / cached (speedup)" indicates?
> 

I forgot to explain what "cached" means in the TPC-H case. It means
second execution of the query, so you can imagine it like this:

for q in `seq 1 22`; do

   1. drop caches and restart postgres

   2. run query $q -> uncached

   3. run query $q -> cached

done

So the second execution has a chance of having data in memory - but
maybe not all, because this is a 100GB data set (so ~200GB after
loading), but the machine only has 64GB of RAM.

I think a likely explanation is some of the data wasn't actually in
memory, so prefetching still did something.

> I think it'd be good to run a performance comparison of the unpatched vs
> patched cases, with prefetching disabled for both. It's possible that
> something in the patch caused unintended changes (say spilling during a
> hashagg, due to larger struct sizes).
> 

That's certainly a good idea. I'll do that in the next round of tests. I
also plan to do a test on data set that fits into RAM, to test "properly
cached" case.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

From

Tomas Vondra

Date:

09 June 2023, 10:44:46


On 6/9/23 01:38, Peter Geoghegan wrote:
> On Thu, Jun 8, 2023 at 3:17 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>     Normal index scans are an even more interesting case but I'm not
>>     sure how hard it would be to get that information. It may only be
>>     convenient to get the blocks from the last leaf page we looked at,
>>     for example.
>>
>> So this suggests we simply started prefetching for the case where the
>> information was readily available, and it'd be harder to do for index
>> scans so that's it.
> 
> What the exact historical timeline is may not be that important. My
> emphasis on ScalarArrayOpExpr is partly due to it being a particularly
> compelling case for both parallel index scan and prefetching, in
> general. There are many queries that have huge in() lists that
> naturally benefit a great deal from prefetching. Plus they're common.
> 

Did you mean parallel index scan or bitmap index scan?

But yeah, I get the point that SAOP queries are an interesting example
of queries to explore. I'll add some to the next round of tests.

>> Even if SAOP (probably) wasn't the reason, I think you're right it may
>> be an issue for prefetching, causing regressions. It didn't occur to me
>> before, because I'm not that familiar with the btree code and/or how it
>> deals with SAOP (and didn't really intend to study it too deeply).
> 
> I'm pretty sure that you understand this already, but just in case:
> ScalarArrayOpExpr doesn't even "get the blocks from the last leaf
> page" in many important cases. Not really -- not in the sense that
> you'd hope and expect. We're senselessly processing the same index
> leaf page multiple times and treating it as a different, independent
> leaf page. That makes heap prefetching of the kind you're working on
> utterly hopeless, since it effectively throws away lots of useful
> context. Obviously that's the fault of nbtree ScalarArrayOpExpr
> handling, not the fault of your patch.
> 

I think I understand, although maybe my mental model is wrong. I agree
it seems inefficient, but I'm not sure why would it make prefetching
hopeless. Sure, it puts index scans at a disadvantage (compared to
bitmap scans), but it we pick index scan it should still be an
improvement, right?

I guess I need to do some testing on a range of data sets / queries, and
see how it works in practice.

>> So if you're planning to work on this for PG17, collaborating on it
>> would be great.
>>
>> For now I plan to just ignore SAOP, or maybe just disabling prefetching
>> for SAOP index scans if it proves to be prone to regressions. That's not
>> great, but at least it won't make matters worse.
> 
> Makes sense, but I hope that it won't come to that.
> 
> IMV it's actually quite reasonable that you didn't expect to have to
> think about ScalarArrayOpExpr at all -- it would make a lot of sense
> if that was already true. But the fact is that it works in a way
> that's pretty silly and naive right now, which will impact
> prefetching. I wasn't really thinking about regressions, though. I was
> actually more concerned about missing opportunities to get the most
> out of prefetching. ScalarArrayOpExpr really matters here.
> 

OK

>> I guess something like this might be a "nice" bad case:
>>
>>     insert into btree_test mod(i,100000), md5(i::text)
>>       from generate_series(1, $ROWS) s(i)
>>
>>     select * from btree_test where a in (999, 1000, 1001, 1002)
>>
>> The values are likely colocated on the same heap page, the bitmap scan
>> is going to do a single prefetch. With index scan we'll prefetch them
>> repeatedly. I'll give it a try.
> 
> This is the sort of thing that I was thinking of. What are the
> conditions under which bitmap index scan starts to make sense? Why is
> the break-even point whatever it is in each case, roughly? And, is it
> actually because of laws-of-physics level trade-off? Might it not be
> due to implementation-level issues that are much less fundamental? In
> other words, might it actually be that we're just doing something
> stoopid in the case of plain index scans? Something that is just
> papered-over by bitmap index scans right now?
> 

Yeah, that's partially why I do this kind of testing on a wide range of
synthetic data sets - to find cases that behave in unexpected way (say,
seem like they should improve but don't).

> I see that your patch has logic that avoids repeated prefetching of
> the same block -- plus you have comments that wonder about going
> further by adding a "small lru array" in your new index_prefetch()
> function. I asked you about this during the unconference presentation.
> But I think that my understanding of the situation was slightly
> different to yours. That's relevant here.
> 
> I wonder if you should go further than this, by actually sorting the
> items that you need to fetch as part of processing a given leaf page
> (I said this at the unconference, you may recall). Why should we
> *ever* pin/access the same heap page more than once per leaf page
> processed per index scan? Nothing stops us from returning the tuples
> to the executor in the original logical/index-wise order, despite
> having actually accessed each leaf page's pointed-to heap pages
> slightly out of order (with the aim of avoiding extra pin/unpin
> traffic that isn't truly necessary). We can sort the heap TIDs in
> scratch memory, then do our actual prefetching + heap access, and then
> restore the original order before returning anything.
> 

I think that's possible, and I thought about that a bit (not just for
btree, but especially for the distance queries on GiST). But I don't
have a good idea if this would be 1% or 50% improvement, and I was
concerned it might easily lead to regressions if we don't actually need
all the tuples.

I mean, imagine we have TIDs

    [T1, T2, T3, T4, T5, T6]

Maybe T1, T5, T6 are from the same page, so per your proposal we might
reorder and prefetch them in this order:

    [T1, T5, T6, T2, T3, T4]

But maybe we only need [T1, T2] because of a LIMIT, and the extra work
we did on processing T5, T6 is wasted.

> This is conceptually a "mini bitmap index scan", though one that takes
> place "inside" a plain index scan, as it processes one particular leaf
> page. That's the kind of design that "plain index scan vs bitmap index
> scan as a continuum" leads me to (a little like the continuum between
> nested loop joins, block nested loop joins, and merge joins). I bet it
> would be practical to do things this way, and help a lot with some
> kinds of queries. It might even be simpler than avoiding excessive
> prefetching using an LRU cache thing.
> 
> I'm talking about problems that exist today, without your patch.
> 
> I'll show a concrete example of the kind of index/index scan that
> might be affected.
> 
> Attached is an extract of the server log when the regression tests ran
> against a server patched to show custom instrumentation. The log
> output shows exactly what's going on with one particular nbtree
> opportunistic deletion (my point has nothing to do with deletion, but
> it happens to be convenient to make my point in this fashion). This
> specific example involves deletion of tuples from the system catalog
> index "pg_type_typname_nsp_index". There is nothing very atypical
> about it; it just shows a certain kind of heap fragmentation that's
> probably very common.
> 
> Imagine a plain index scan involving a query along the lines of
> "select * from pg_type where typname like 'part%' ", or similar. This
> query runs an instant before the example LD_DEAD-bit-driven
> opportunistic deletion (a "simple deletion" in nbtree parlance) took
> place. You'll be able to piece together from the log output that there
> would only be about 4 heap blocks involved with such a query. Ideally,
> our hypothetical index scan would pin each buffer/heap page exactly
> once, for a total of 4 PinBuffer()/UnpinBuffer() calls. After all,
> we're talking about a fairly selective query here, that only needs to
> scan precisely one leaf page (I verified this part too) -- so why
> wouldn't we expect "index scan parity"?
> 
> While there is significant clustering on this example leaf page/key
> space, heap TID is not *perfectly* correlated with the
> logical/keyspace order of the index -- which can have outsized
> consequences. Notice that some heap blocks are non-contiguous
> relative to logical/keyspace/index scan/index page offset number order.
> 
> We'll end up pinning each of the 4 or so heap pages more than once
> (sometimes several times each), when in principle we could have pinned
> each heap page exactly once. In other words, there is way too much of
> a difference between the case where the tuples we scan are *almost*
> perfectly clustered (which is what you see in my example) and the case
> where they're exactly perfectly clustered. In other other words, there
> is way too much of a difference between plain index scan, and bitmap
> index scan.
> 
> (What I'm saying here is only true because this is a composite index
> and our query uses "like", returning rows matches a prefix -- if our
> index was on the column "typname" alone and we used a simple equality
> condition in our query then the Postgres 12 nbtree work would be
> enough to avoid the extra PinBuffer()/UnpinBuffer() calls. I suspect
> that there are still relatively many important cases where we perform
> extra PinBuffer()/UnpinBuffer() calls during plain index scans that
> only touch one leaf page anyway.)
> 
> Obviously we should expect bitmap index scans to have a natural
> advantage over plain index scans whenever there is little or no
> correlation -- that's clear. But that's not what we see here -- we're
> way too sensitive to minor imperfections in clustering that are
> naturally present on some kinds of leaf pages. The potential
> difference in pin/unpin traffic (relative to the bitmap index scan
> case) seems pathological to me. Ideally, we wouldn't have these kinds
> of differences at all. It's going to disrupt usage_count on the
> buffers.
> 

I'm not sure I understand all the nuance here, but the thing I take away
is to add tests with different levels of correlation, and probably also
some multi-column indexes.

>>> It's important to carefully distinguish between cases where plain
>>> index scans really are at an inherent disadvantage relative to bitmap
>>> index scans (because there really is no getting around the need to
>>> access the same heap page many times with an index scan) versus cases
>>> that merely *appear* that way. Implementation restrictions that only
>>> really affect the plain index scan case (e.g., the lack of a
>>> reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing)
>>> should be accounted for when assessing the viability of index scan +
>>> prefetch over bitmap index scan + prefetch. This is very subtle, but
>>> important.
>>>
>>
>> I do agree, but what do you mean by "assessing"?
> 
> I mean performance validation. There ought to be a theoretical model
> that describes the relationship between index scan and bitmap index
> scan, that has actual predictive power in the real world, across a
> variety of different cases. Something that isn't sensitive to the
> current phase of the moon (e.g., heap fragmentation along the lines of
> my pg_type_typname_nsp_index log output). I particularly want to avoid
> nasty discontinuities that really make no sense.
> 
>> Wasn't the agreement at
>> the unconference session was we'd not tweak costing? So ultimately, this
>> does not really affect which scan type we pick. We'll keep doing the
>> same planning decisions as today, no?
> 
> I'm not really talking about tweaking the costing. What I'm saying is
> that we really should expect index scans to behave similarly to bitmap
> index scans at runtime, for queries that really don't have much to
> gain from using a bitmap heap scan (queries that may or may not also
> benefit from prefetching). There are several reasons why this makes
> sense to me.
> 
> One reason is that it makes tweaking the actual costing easier later
> on. Also, your point about plan robustness was a good one. If we make
> the wrong choice about index scan vs bitmap index scan, and the
> consequences aren't so bad, that's a very useful enhancement in
> itself.
> 
> The most important reason of all may just be to build confidence in
> the design. I'm interested in understanding when and how prefetching
> stops helping.
> 

Agreed.

>> I'm all for building a more comprehensive set of test cases - the stuff
>> presented at pgcon was good for demonstration, but it certainly is not
>> enough for testing. The SAOP queries are a great addition, I also plan
>> to run those queries on different (less random) data sets, etc. We'll
>> probably discover more interesting cases as the patch improves.
> 
> Definitely.
> 
>> There are two aspects why I think AM is not the right place:
>>
>> - accessing table from index code seems backwards
>>
>> - we already do prefetching from the executor (nodeBitmapHeapscan.c)
>>
>> It feels kinda wrong in hindsight.
> 
> I'm willing to accept that we should do it the way you've done it in
> the patch provisionally. It's complicated enough that it feels like I
> should reserve the right to change my mind.
> 
>>>> I think this is acceptable limitation, certainly for v0. Prefetching
>>>> across multiple leaf pages seems way more complex (particularly for the
>>>> cases using pairing heap), so let's leave this for the future.
> 
>> Yeah, I'm not saying it's impossible, and imagined we might teach nbtree
>> to do that. But it seems like work for future someone.
> 
> Right. You probably noticed that this is another case where we'd be
> making index scans behave more like bitmap index scans (perhaps even
> including the downsides for kill_prior_tuple that accompany not
> processing each leaf page inline). There is probably a point where
> that ceases to be sensible, but I don't know what that point is.
> They're way more similar than we seem to imagine.
> 

OK. Thanks for all the comments.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

From

Peter Geoghegan

Date:

09 June 2023, 18:23:56

On Fri, Jun 9, 2023 at 3:45 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> > What the exact historical timeline is may not be that important. My
> > emphasis on ScalarArrayOpExpr is partly due to it being a particularly
> > compelling case for both parallel index scan and prefetching, in
> > general. There are many queries that have huge in() lists that
> > naturally benefit a great deal from prefetching. Plus they're common.
> >
>
> Did you mean parallel index scan or bitmap index scan?

I meant parallel index scan (also parallel bitmap index scan). Note
that nbtree parallel index scans have special ScalarArrayOpExpr
handling code.

ScalarArrayOpExpr is kind of special -- it is simultaneously one big
index scan (to the executor), and lots of small index scans (to
nbtree). Unlike the queries that you've looked at so far, which really
only have one plausible behavior at execution time, there are many
ways that ScalarArrayOpExpr index scans can be executed at runtime --
some much faster than others. The nbtree implementation can in
principle reorder how it processes ranges from the key space (i.e.
each range of array elements) with significant flexibility.

> I think I understand, although maybe my mental model is wrong. I agree
> it seems inefficient, but I'm not sure why would it make prefetching
> hopeless. Sure, it puts index scans at a disadvantage (compared to
> bitmap scans), but it we pick index scan it should still be an
> improvement, right?

Hopeless might have been too strong of a word. More like it'd fall far
short of what is possible to do with a ScalarArrayOpExpr with a given
high end server.

The quality of the implementation (including prefetching) could make a
huge difference to how well we make use of the available hardware
resources. A really high quality implementation of ScalarArrayOpExpr +
prefetching can keep the system busy with useful work, which is less
true with other types of queries, which have inherently less
predictable I/O (and often have less I/O overall). What could be more
amenable to predicting I/O patterns than a query with a large IN()
list, with many constants that can be processed in whatever order
makes sense at runtime?

What I'd like to do with ScalarArrayOpExpr is to teach nbtree to
coalesce together those "small index scans" into "medium index scans"
dynamically, where that makes sense. That's the main part that's
missing right now. Dynamic behavior matters a lot with
ScalarArrayOpExpr stuff -- that's where the challenge lies, but also
where the opportunities are. Prefetching builds on all that.

> I guess I need to do some testing on a range of data sets / queries, and
> see how it works in practice.

If I can figure out a way of getting ScalarArrayOpExpr to visit each
leaf page exactly once, that might be enough to make things work
really well most of the time. Maybe it won't even be necessary to
coordinate very much, in the end. Unsure.

I've already done a lot of work that tries to minimize the chances of
regular (non-ScalarArrayOpExpr) queries accessing more than a single
leaf page, which will help your strategy of just prefetching items
from a single leaf page at a time -- that will get you pretty far
already. Consider the example of the tenk2_hundred index from the
bt_page_items documentation. You'll notice that the high key for the
page shown in the docs (and every other page in the same index) nicely
makes the leaf page boundaries "aligned" with natural keyspace
boundaries, due to suffix truncation. That helps index scans to access
no more than a single leaf page when accessing any one distinct
"hundred" value.

We are careful to do the right thing with the "boundary cases" when we
descend the tree, too. This _bt_search behavior builds on the way that
suffix truncation influences the on-disk structure of indexes. Queries
such as "select * from tenk2 where hundred = ?" will each return 100
rows spread across almost as many heap pages. That's a fairly large
number of rows/heap pages, but we still only need to access one leaf
page for every possible constant value (every "hundred" value that
might be specified as the ? in my point query example). It doesn't
matter if it's the leftmost or rightmost item on a leaf page -- we
always descend to exactly the correct leaf page directly, and we
always terminate the scan without having to move to the right sibling
page (we check the high key before going to the right page in some
cases, per the optimization added by commit 29b64d1d).

The same kind of behavior is also seen with the TPC-C line items
primary key index, which is a composite index. We want to access the
items from a whole order in one go, from one leaf page -- and we
reliably do the right thing there too (though with some caveats about
CREATE INDEX). We should never have to access more than one leaf page
to read a single order's line items. This matters because it's quite
natural to want to access whole orders with that particular
table/workload (it's also unnatural to only access one single item
from any given order).

Obviously there are many queries that need to access two or more leaf
pages, because that's just what needs to happen. My point is that we
*should* only do that when it's truly necessary on modern Postgres
versions, since the boundaries between pages are "aligned" with the
"natural boundaries" from the keyspace/application. Maybe your testing
should verify that this effect is actually present, though. It would
be a shame if we sometimes messed up prefetching that could have
worked well due to some issue with how page splits divide up items.

CREATE INDEX is much less smart about suffix truncation -- it isn't
capable of the same kind of tricks as nbtsplitloc.c, even though it
could be taught to do roughly the same thing. Hopefully this won't be
an issue for your work. The tenk2 case still works as expected with
CREATE INDEX/REINDEX, due to help from deduplication. Indexes like the
TPC-C line items PK will leave the index with some "orders" (or
whatever the natural grouping of things is) that span more than a
single leaf page, which is undesirable, and might hinder your
prefetching work. I wouldn't mind fixing that if it turned out to hurt
your leaf-page-at-a-time prefetching patch. Something to consider.

We can fit at most 17 TPC-C orders on each order line PK leaf page.
Could be as few as 15. If we do the wrong thing with prefetching for 2
out of every 15 orders then that's a real problem, but is still subtle enough
to easily miss with conventional benchmarking. I've had a lot of success
with paying close attention to all the little boundary cases, which is why
I'm kind of zealous about it now.

> > I wonder if you should go further than this, by actually sorting the
> > items that you need to fetch as part of processing a given leaf page
> > (I said this at the unconference, you may recall). Why should we
> > *ever* pin/access the same heap page more than once per leaf page
> > processed per index scan? Nothing stops us from returning the tuples
> > to the executor in the original logical/index-wise order, despite
> > having actually accessed each leaf page's pointed-to heap pages
> > slightly out of order (with the aim of avoiding extra pin/unpin
> > traffic that isn't truly necessary). We can sort the heap TIDs in
> > scratch memory, then do our actual prefetching + heap access, and then
> > restore the original order before returning anything.
> >
>
> I think that's possible, and I thought about that a bit (not just for
> btree, but especially for the distance queries on GiST). But I don't
> have a good idea if this would be 1% or 50% improvement, and I was
> concerned it might easily lead to regressions if we don't actually need
> all the tuples.

I get that it could be invasive. I have the sense that just pinning
the same heap page more than once in very close succession is just the
wrong thing to do, with or without prefetching.

> I mean, imagine we have TIDs
>
>     [T1, T2, T3, T4, T5, T6]
>
> Maybe T1, T5, T6 are from the same page, so per your proposal we might
> reorder and prefetch them in this order:
>
>     [T1, T5, T6, T2, T3, T4]
>
> But maybe we only need [T1, T2] because of a LIMIT, and the extra work
> we did on processing T5, T6 is wasted.

Yeah, that's possible. But isn't that par for the course? Any
optimization that involves speculation (including all prefetching)
comes with similar risks. They can be managed.

I don't think that we'd literally order by TID...we wouldn't change
the order that each heap page was *initially* pinned. We'd just
reorder the tuples minimally using an approach that is sufficient to
avoid repeated pinning of heap pages during processing of any one leaf
page's heap TIDs. ISTM that the risk of wasting work is limited to
wasting cycles on processing extra tuples from a heap page that we
definitely had to process at least one tuple from already. That
doesn't seem particularly risky, as speculative optimizations go. The
downside is bounded and well understood, while the upside could be
significant.

I really don't have that much confidence in any of this just yet. I'm
not trying to make this project more difficult. I just can't help but
notice that the order that index scans end up pinning heap pages
already has significant problems, and is sensitive to things like
small amounts of heap fragmentation -- maybe that's not a great basis
for prefetching. I *really* hate any kind of sharp discontinuity,
where a minor change in an input (e.g., from minor amounts of heap
fragmentation) has outsized impact on an output (e.g., buffers
pinned). Interactions like that tend to be really pernicious -- they
lead to bad performance that goes unnoticed and unfixed because the
problem effectively camouflages itself. It may even be easier to make
the conservative (perhaps paranoid) assumption that weird nasty
interactions will cause harm somewhere down the line...why take a
chance?

I might end up prototyping this myself. I may have to put my money
where my mouth is.  :-)

--
Peter Geoghegan

Re: index prefetching

From

Gregory Smith

Date:

09 June 2023, 21:19:47

On Thu, Jun 8, 2023 at 11:40 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

We already do prefetching for bitmap index scans, where the bitmap heap
scan prefetches future pages based on effective_io_concurrency. I'm not
sure why exactly was prefetching implemented only for bitmap scans

At the point Greg Stark was hacking on this, the underlying OS async I/O features were tricky to fix into PG's I/O model, and both of us did much review work just to find working common ground that PG could plug into. Linux POSIX advisories were completely different from Solaris's async model, the other OS used for validation that the feature worked, with the hope being that designing against two APIs would be better than just focusing on Linux. Since that foundation was all so brittle and limited, scope was limited to just the heap scan, since it seemed to have the best return on time invested given the parts of async I/O that did and didn't scale as expected.

As I remember it, the idea was to get the basic feature out the door and gather feedback about things like whether the effective_io_concurrency knob worked as expected before moving onto other prefetching. Then that got lost in filesystem upheaval land, with so much drama around Solaris/ZFS and Oracle's btrfs work. I think it's just that no one ever got back to it.

I have all the workloads that I use for testing automated into pgbench-tools now, and this change would be easy to fit into testing on them as I'm very heavy on block I/O tests. To get PG to reach full read speed on newer storage I've had to do some strange tests, like doing index range scans that touch 25+ pages. Here's that one as a pgbench script:

\set range 67 * (:multiplier + 1)

\set limit 100000 * :scale

\set limit :limit - :range

\set aid random(1, :limit)

SELECT aid,abalance FROM pgbench_accounts WHERE aid >= :aid ORDER BY aid LIMIT :range;

And then you use '-Dmultiplier=10' or such to crank it up. Database 4X RAM, multiplier=25 with 16 clients is my starting point on it when I want to saturate storage. Anything that lets me bring those numbers down would be valuable.

--
Greg Smith greg.smith@crunchydata.com
Director of Open Source Strategy

Re: index prefetching

From

Andres Freund

Date:

10 June 2023, 20:34:56

Hi,

On 2023-06-09 12:18:11 +0200, Tomas Vondra wrote:
> > 
> >> 2) prefetching from executor
> >>
> >> Another question is whether the prefetching shouldn't actually happen
> >> even higher - in the executor. That's what Andres suggested during the
> >> unconference, and it kinda makes sense. That's where we do prefetching
> >> for bitmap heap scans, so why should this happen lower, right?
> > 
> > Yea. I think it also provides potential for further optimizations in the
> > future to do it at that layer.
> > 
> > One thing I have been wondering around this is whether we should not have
> > split the code for IOS and plain indexscans...
> > 
> 
> Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or
> did you mean something else?

Yes, I meant that.

> >> 4) per-leaf prefetching
> >>
> >> The code is restricted only prefetches items from one leaf page. If the
> >> index scan needs to scan multiple (many) leaf pages, we have to process
> >> the first leaf page first before reading / prefetching the next one.
> >>
> >> I think this is acceptable limitation, certainly for v0. Prefetching
> >> across multiple leaf pages seems way more complex (particularly for the
> >> cases using pairing heap), so let's leave this for the future.
> > 
> > Hm. I think that really depends on the shape of the API we end up with. If we
> > move the responsibility more twoards to the executor, I think it very well
> > could end up being just as simple to prefetch across index pages.
> > 
> 
> Maybe. I'm open to that idea if you have idea how to shape the API to
> make this possible (although perhaps not in v0).

I'll try to have a look.


> > I'm a bit confused by some of these numbers. How can OS-level prefetching lead
> > to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
> > Unless I missed what "xeon / cached (speedup)" indicates?
> > 
> 
> I forgot to explain what "cached" means in the TPC-H case. It means
> second execution of the query, so you can imagine it like this:
> 
> for q in `seq 1 22`; do
> 
>    1. drop caches and restart postgres

Are you doing it in that order? If so, the pagecache can end up being seeded
by postgres writing out dirty buffers.


>    2. run query $q -> uncached
> 
>    3. run query $q -> cached
> 
> done
> 
> So the second execution has a chance of having data in memory - but
> maybe not all, because this is a 100GB data set (so ~200GB after
> loading), but the machine only has 64GB of RAM.
> 
> I think a likely explanation is some of the data wasn't actually in
> memory, so prefetching still did something.

Ah, ok.


> > I think it'd be good to run a performance comparison of the unpatched vs
> > patched cases, with prefetching disabled for both. It's possible that
> > something in the patch caused unintended changes (say spilling during a
> > hashagg, due to larger struct sizes).
> > 
> 
> That's certainly a good idea. I'll do that in the next round of tests. I
> also plan to do a test on data set that fits into RAM, to test "properly
> cached" case.

Cool. It'd be good to measure both the case of all data already being in s_b
(to see the overhead of the buffer mapping lookups) and the case where the
data is in the kernel pagecache (to see the overhead of pointless
posix_fadvise calls).

Greetings,

Andres Freund

Re: index prefetching

From

Tomas Vondra

Date:

10 June 2023, 21:10:59


On 6/10/23 22:34, Andres Freund wrote:
> Hi,
> 
> On 2023-06-09 12:18:11 +0200, Tomas Vondra wrote:
>>>
>>>> 2) prefetching from executor
>>>>
>>>> Another question is whether the prefetching shouldn't actually happen
>>>> even higher - in the executor. That's what Andres suggested during the
>>>> unconference, and it kinda makes sense. That's where we do prefetching
>>>> for bitmap heap scans, so why should this happen lower, right?
>>>
>>> Yea. I think it also provides potential for further optimizations in the
>>> future to do it at that layer.
>>>
>>> One thing I have been wondering around this is whether we should not have
>>> split the code for IOS and plain indexscans...
>>>
>>
>> Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or
>> did you mean something else?
> 
> Yes, I meant that.
> 

Ah, you meant that maybe we shouldn't have done that. Sorry, I
misunderstood.

>>>> 4) per-leaf prefetching
>>>>
>>>> The code is restricted only prefetches items from one leaf page. If the
>>>> index scan needs to scan multiple (many) leaf pages, we have to process
>>>> the first leaf page first before reading / prefetching the next one.
>>>>
>>>> I think this is acceptable limitation, certainly for v0. Prefetching
>>>> across multiple leaf pages seems way more complex (particularly for the
>>>> cases using pairing heap), so let's leave this for the future.
>>>
>>> Hm. I think that really depends on the shape of the API we end up with. If we
>>> move the responsibility more twoards to the executor, I think it very well
>>> could end up being just as simple to prefetch across index pages.
>>>
>>
>> Maybe. I'm open to that idea if you have idea how to shape the API to
>> make this possible (although perhaps not in v0).
> 
> I'll try to have a look.
> 
> 
>>> I'm a bit confused by some of these numbers. How can OS-level prefetching lead
>>> to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
>>> Unless I missed what "xeon / cached (speedup)" indicates?
>>>
>>
>> I forgot to explain what "cached" means in the TPC-H case. It means
>> second execution of the query, so you can imagine it like this:
>>
>> for q in `seq 1 22`; do
>>
>>    1. drop caches and restart postgres
> 
> Are you doing it in that order? If so, the pagecache can end up being seeded
> by postgres writing out dirty buffers.
> 

Actually no, I do it the other way around - first restart, then drop. It
shouldn't matter much, though, because after building the data set (and
vacuum + checkpoint), the data is not modified - all the queries run on
the same data set. So there shouldn't be any dirty buffers.

> 
>>    2. run query $q -> uncached
>>
>>    3. run query $q -> cached
>>
>> done
>>
>> So the second execution has a chance of having data in memory - but
>> maybe not all, because this is a 100GB data set (so ~200GB after
>> loading), but the machine only has 64GB of RAM.
>>
>> I think a likely explanation is some of the data wasn't actually in
>> memory, so prefetching still did something.
> 
> Ah, ok.
> 
> 
>>> I think it'd be good to run a performance comparison of the unpatched vs
>>> patched cases, with prefetching disabled for both. It's possible that
>>> something in the patch caused unintended changes (say spilling during a
>>> hashagg, due to larger struct sizes).
>>>
>>
>> That's certainly a good idea. I'll do that in the next round of tests. I
>> also plan to do a test on data set that fits into RAM, to test "properly
>> cached" case.
> 
> Cool. It'd be good to measure both the case of all data already being in s_b
> (to see the overhead of the buffer mapping lookups) and the case where the
> data is in the kernel pagecache (to see the overhead of pointless
> posix_fadvise calls).
> 

OK, I'll make sure the next round of tests includes a sufficiently small
data set too. I should have some numbers sometime early next week.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

From

Tomasz Rybak

Date:

12 June 2023, 21:27:04

On Thu, 2023-06-08 at 17:40 +0200, Tomas Vondra wrote:
> Hi,
>
> At pgcon unconference I presented a PoC patch adding prefetching for
> indexes, along with some benchmark results demonstrating the (pretty
> significant) benefits etc. The feedback was quite positive, so let me
> share the current patch more widely.
>

I added entry to
https://wiki.postgresql.org/wiki/PgCon_2023_Developer_Unconference
based on notes I took during that session.
Hope it helps.

--
Tomasz Rybak, Debian Developer <serpent@debian.org>
GPG: A565 CE64 F866 A258 4DDC F9C7 ECB7 3E37 E887 AA8C

Re: index prefetching

From

Dilip Kumar

Date:

13 June 2023, 04:26:46

On Thu, Jun 8, 2023 at 9:10 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

> We already do prefetching for bitmap index scans, where the bitmap heap
> scan prefetches future pages based on effective_io_concurrency. I'm not
> sure why exactly was prefetching implemented only for bitmap scans, but
> I suspect the reasoning was that it only helps when there's many
> matching tuples, and that's what bitmap index scans are for. So it was
> not worth the implementation effort.

One of the reasons IMHO is that in the bitmap scan before starting the
heap fetch TIDs are already sorted in heap block order.  So it is
quite obvious that once we prefetch a heap block most of the
subsequent TIDs will fall on that block i.e. each prefetch will
satisfy many immediate requests.  OTOH, in the index scan the I/O
request is very random so we might have to prefetch many blocks even
for satisfying the request for TIDs falling on one index page.  I
agree with prefetching with an index scan will definitely help in
reducing the random I/O, but this is my guess that thinking of
prefetching with a Bitmap scan appears more natural and that would
have been one of the reasons for implementing this only for a bitmap
scan.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: index prefetching

From

Tomas Vondra

Date:

19 June 2023, 19:27:46

Hi,

I have results from the new extended round of prefetch tests. I've
pushed everything to

https://github.com/tvondra/index-prefetch-tests-2

There are scripts I used to run this (run-*.sh), raw results and various
kinds of processed summaries (pdf, ods, ...) that I'll mention later.

As before, this tests a number of query types:

- point queries with btree and hash (equality)
- ORDER BY queries with btree (inequality + order by)
- SAOP queries with btree (column IN (values))

It's probably futile to go through details of all the tests - it's
easier to go through the (hopefully fairly readable) shell scripts.

But in principle, runs some simple queries while varying both the data
set and workload:

- data set may be random, sequential or cyclic (with different length)

- the number of matches per value differs (i.e. equality condition may
match 1, 10, 100, ..., 100k rows)

- forces a particular scan type (indexscan, bitmapscan, seqscan)

- each query is executed twice - first run (right after restarting DB
and dropping caches) is uncached, second run should have data cached

- the query is executed 5x with different parameters (so 10x in total)

This is tested with three basic data sizes - fits into shared buffers,
fits into RAM and exceeds RAM. The sizes are roughly 350MB, 3.5GB and
20GB (i5) / 40GB (xeon).

Note: xeon has 64GB RAM, so technically the largest scale fits into RAM.
But should not matter, thanks to drop-caches and restart.

I also attempted to pin the backend to a particular core, in effort to
eliminate scheduling-related noise. It's mostly what taskset does, but I
did that from extension (https://github.com/tvondra/taskset) which
allows me to do that as part of the SQL script.

For the results, I'll talk about the v1 patch (as submitted here) fist.
I'll use the PDF results in the "pdf" directory which generally show a
pivot table by different test parameters, comparing the results by
different parameters (prefetching on/off, master/patched).

Feel free to do your own analysis from the raw CSV data, ofc.

For example, this:

https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/patch-v1-point-queries-builds.pdf

shows how the prefetching affects timing for point queries with
different numbers of matches (1 to 100k). The numbers are timings for
master and patched build. The last group is (patched/master), so the
lower the number the better - 50% means patch makes the query 2x faster.
There's also a heatmap, with green=good, red=bad, which makes it easier
to cases that got slower/faster.

The really interesting stuff starts on page 7 (in this PDF), because the
first couple pages are "cached" (so it's more about measuring overhead
when prefetching has no benefit).

Right on page 7 you can see a couple cases with a mix of slower/faster
cases, roughtly in the +/- 30% range. However, this is unrelated from
the patch because those are results for bitmapheapscan.

For indexscans (page 8), the results are invariably improved - the more
matches the better (up to ~10x faster for 100k matches).

Those were results for the "cyclic" data set. For random data set (pages
9-11) the results are pretty similar, but for "sequential" data (11-13)
the prefetching is actually harmful - there are red clusters, with up to
500% slowdowns.

I'm not going to explain the summary for SAOP queries
(https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/patch-v1-saop-queries-builds.pdf),
the story is roughly the same, except that there are more tested query
combinations (because we also vary the pattern in the IN() list - number
of values etc.).

So, the conclusion from this is - generally very good results for random
and cyclic data sets, but pretty bad results for sequential. But even
for the random/cyclic cases there are combinations (especially with many
matches) where prefetching doesn't help or even hurts.

The only way to deal with this is (I think) a cheap way to identify and
skip inefficient prefetches, essentially by doing two things:

a) remembering more recently prefetched blocks (say, 1000+) and not
prefetching them over and over

b) ability to identify sequential pattern, when readahead seems to do
pretty good job already (although I heard some disagreement)

I've been thinking about how to do this - doing (a) seem pretty hard,
because on the one hand we want to remember a fair number of blocks and
we want the check "did we prefetch X" to be very cheap. So a hash table
seems nice. OTOH we want to expire "old" blocks and only keep the most
recent ones, and hash table doesn't really support that.

Perhaps there is a great data structure for this, not sure. But after
thinking about this I realized we don't need a perfect accuracy - it's
fine to have false positives/negatives - it's fine to forget we already
prefetched block X and prefetch it again, or prefetch it again. It's not
a matter of correctness, just a matter of efficiency - after all, we
can't know if it's still in memory, we only know if we prefetched it
fairly recently.

This led me to a "hash table of LRU caches" thing. Imagine a tiny LRU
cache that's small enough to be searched linearly (say, 8 blocks). And
we have many of them (e.g. 128), so that in total we can remember 1024
block numbers. Now, every block number is mapped to a single LRU by
hashing, as if we had a hash table

index = hash(blockno) % 128

and we only use tha one LRU to track this block. It's tiny so we can
search it linearly.

To expire prefetched blocks, there's a counter incremented every time we
prefetch a block, and we store it in the LRU with the block number. When
checking the LRU we ignore old entries (with counter more than 1000
values back), and we also evict/replace the oldest entry if needed.

This seems to work pretty well for the first requirement, but it doesn't
allow identifying the sequential pattern cheaply. To do that, I added a
tiny queue with a couple entries that can checked it the last couple
entries are sequential.

And this is what the attached 0002+0003 patches do. There are PDF with
results for this build prefixed with "patch-v3" and the results are
pretty good - the regressions are largely gone.

It's even cleared in the PDFs comparing the impact of the two patches:

https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/comparison-point.pdf

https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/comparison-saop.pdf

Which simply shows the "speedup heatmap" for the two patches, and the
"v3" heatmap has much less red regression clusters.

Note: The comparison-point.pdf summary has another group of columns
illustrating if this scan type would be actually used, with "green"
meaning "yes". This provides additional context, because e.g. for the
"noisy bitmapscans" it's all white, i.e. without setting the GUcs the
optimizer would pick something else (hence it's a non-issue).

Let me know if the results are not clear enough (I tried to cover the
important stuff, but I'm sure there's a lot of details I didn't cover),
or if you think some other summary would be better.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: index prefetching

From

Tomas Vondra

Date:

30 June 2023, 11:38:06

Hi,

attached is a v4 of the patch, with a fairly major shift in the approach.

Until now the patch very much relied on the AM to provide information
which blocks to prefetch next (based on the current leaf index page).
This seemed like a natural approach when I started working on the PoC,
but over time I ran into various drawbacks:

* a lot of the logic is at the AM level

* can't prefetch across the index page boundary (have to wait until the
  next index leaf page is read by the indexscan)

* doesn't work for distance searches (gist/spgist),

After thinking about this, I decided to ditch this whole idea of
exchanging prefetch information through an API, and make the prefetching
almost entirely in the indexam code.

The new patch maintains a queue of TIDs (read from index_getnext_tid),
with up to effective_io_concurrency entries - calling getnext_slot()
adds a TID at the queue tail, issues a prefetch for the block, and then
returns TID from the queue head.

Maintaining the queue is up to index_getnext_slot() - it can't be done
in index_getnext_tid(), because then it'd affect IOS (and prefetching
heap would mostly defeat the whole point of IOS). And we can't do that
above index_getnext_slot() because that already fetched the heap page.

I still think prefetching for IOS is doable (and desirable), in mostly
the same way - except that we'd need to maintain the queue from some
other place, as IOS doesn't do index_getnext_slot().

FWIW there's also the "index-only filters without IOS" patch [1] which
switches even regular index scans to index_getnext_tid(), so maybe
relying on index_getnext_slot() is a lost cause anyway.

Anyway, this has the nice consequence that it makes AM code entirely
oblivious of prefetching - there's no need to API, we just get TIDs as
before, and the prefetching magic happens after that. Thus it also works
for searches ordered by distance (gist/spgist). The patch got much
smaller (about 40kB, down from 80kB), which is nice.

I ran the benchmarks [2] with this v4 patch, and the results for the
"point" queries are almost exactly the same as for v3. The SAOP part is
still running - I'll add those results in a day or two, but I expect
similar outcome as for point queries.


regards


[1] https://commitfest.postgresql.org/43/4352/

[2] https://github.com/tvondra/index-prefetch-tests-2/

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

index-prefetch-v4.patch

Re: index prefetching

From

Tomas Vondra

Date:

14 July 2023, 20:31:57

Here's a v5 of the patch, rebased to current master and fixing a couple
compiler warnings reported by cfbot (%lu vs. UINT64_FORMAT in some debug
messages). No other changes compared to v4.

cfbot also reported a failure on windows in pg_dump [1], but it seem
pretty strange:

[11:42:48.708] ------------------------------------- 8<
-------------------------------------
[11:42:48.708] stderr:
[11:42:48.708] #   Failed test 'connecting to an invalid database: matches'

The patch does nothing related to pg_dump, and the test works perfectly
fine for me (I don't have windows machine, but 32-bit and 64-bit linux
works fine for me).


regards


[1] https://cirrus-ci.com/task/6398095366291456

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

index-prefetch-v5.patch

Re: index prefetching

From

Tomas Vondra

Date:

16 October 2023, 15:34:44

Hi,

Attached is a v6 of the patch, which rebases v5 (just some minor
bitrot), and also does a couple changes which I kept in separate patches
to make it obvious what changed.

0001-v5-20231016.patch
----------------------

Rebase to current master.

0002-comments-and-minor-cleanup-20231012.patch
----------------------------------------------

Various comment improvements (remove obsolete ones clarify a bunch of
other comments, etc.). I tried to explain the reasoning why some places
disable prefetching (e.g. in catalogs, replication, ...), explain how
the caching / LRU works etc.

0003-remove-prefetch_reset-20231016.patch
-----------------------------------------

I decided to remove the separate prefetch_reset parameter, so that all
the index_beginscan() methods only take a parameter specifying the
maximum prefetch target. The reset was added early when the prefetch
happened much lower in the AM code, at the index page level, and the
reset was when moving to the next index page. But now after the prefetch
moved to the executor, this doesn't make much sense - the resets happen
on rescans, and it seems right to just reset to 0 (just like for bitmap
heap scans).

0004-PoC-prefetch-for-IOS-20231016.patch
----------------------------------------

This is a PoC adding the prefetch to index-only scans too. At first that
may seem rather strange, considering eliminating the heap fetches is the
whole point of IOS. But if the pages are not marked as all-visible (say,
the most recent part of the table), we may still have to fetch them. In
which case it'd be easy to see cases that IOS is slower than a regular
index scan (with prefetching).

The code is quite rough. It adds a separate index_getnext_tid_prefetch()
function, adding prefetching on top of index_getnext_tid(). I'm not sure
it's the right pattern, but it's pretty much what index_getnext_slot()
does too, except that it also does the fetch + store to the slot.

Note: There's a second patch adding index-only filters, which requires
the regular index scans from index_getnext_slot() to _tid() too.

The prefetching then happens only after checking the visibility map (if
requested). This part definitely needs improvements - for example
there's no attempt to reuse the VM buffer, which I guess might be expensive.

index-prefetch.pdf
------------------

Attached is also a PDF with results of the same benchmark I did before,
comparing master vs. patched with various data patterns and scan types.
It's not 100% comparable to earlier results as I only ran it on a
laptop, and it's a bit noisier too. The overall behavior and conclusions
are however the same.

I was specifically interested in the IOS behavior, so I added two more
cases to test - indexonlyscan and indexonlyscan-clean. The first is the
worst-case scenario, with no pages marked as all-visible in VM (the test
simply deletes the VM), while indexonlyscan-clean is the good-case (no
heap fetches needed).

The results mostly match the expected behavior, particularly for the
uncached runs (when the data is expected to not be in memory):

* indexonlyscan (i.e. bad case) - About the same results as
"indexscans", with the same speedups etc. Which is a good thing
(i.e. IOS is not unexpectedly slower than regular indexscans).

Hi,

Here's a somewhat reworked version of the patch. My initial goal was to
see if it could adopt the StreamingRead API proposed in [1], but that
turned out to be less straight-forward than I hoped, for two reasons:

(1) The StreamingRead API seems to be designed for pages, but the index
code naturally works with TIDs/tuples. Yes, the callbacks can associate
the blocks with custom data (in this case that'd be the TID), but it
seemed a bit strange ...

(2) The place adding requests to the StreamingRead queue is pretty far
from the place actually reading the pages - for prefetching, the
requests would be generated in nodeIndexscan, but the page reading
happens somewhere deep in index_fetch_heap/heapam_index_fetch_tuple.
Sure, the TIDs would come from a callback, so it's a bit as if the
requests were generated in heapam_index_fetch_tuple - but it has no idea
StreamingRead exists, so where would it get it.

We might teach it about it, but what if there are multiple places
calling index_fetch_heap()? Not all of which may be using StreamingRead
(only indexscans would do that). Or if there are multiple index scans,
there's need to be a separate StreamingRead queues, right?

In any case, I felt a bit out of my depth here, and I chose not to do
all this work without discussing the direction here. (Also, see the
point about cursors and xs_heap_continue a bit later in this post.)

I did however like the general StreamingRead API - how it splits the
work between the API and the callback. The patch used to do everything,
which meant it hardcoded a lot of the IOS-specific logic etc. I did plan
to have some sort of "callback" for reading from the queue, but that
didn't quite solve this issue - a lot of the stuff remained hard-coded.
But the StreamingRead API made me realize that having a callback for the
first phase (that adds requests to the queue) would fix that.

So I did that - there's now one simple callback in for index scans, and
a bit more complex callback for index-only scans. Thanks to this the
hard-coded stuff mostly disappears, which is good.

Perhaps a bigger change is that I decided to move this into a separate
API on top of indexam.c. The original idea was to integrate this into
index_getnext_tid/index_getnext_slot, so that all callers benefit from
the prefetching automatically. Which would be nice, but it also meant
it's need to happen in the indexam.c code, which seemed dirty.

This patch introduces an API similar to StreamingRead. It calls the
indexam.c stuff, but does all the prefetching on top of it, not in it.
If a place calling index_getnext_tid() wants to allow prefetching, it
needs to switch to IndexPrefetchNext(). (There's no function that would
replace index_getnext_slot, at the moment. Maybe there should be.)

Note 1: The IndexPrefetch name is a bit misleading, because it's used
even with prefetching disabled - all index reads from the index scan
happen through it. Maybe it should be called IndexReader or something
like that.

Note 2: I left the code in indexam.c for now, but in principle it could
(should) be moved to a different place.

I think this layering makes sense, and it's probably much closer to what
Andres meant when he said the prefetching should happen in the executor.
Even if the patch ends up using StreamingRead in the future, I guess
we'll want something like IndexPrefetch - it might use the StreamingRead
internally, but it would still need to do some custom stuff to detect
I/O patterns or something that does not quite fit into the StreamingRead.

Now, let's talk about two (mostly unrelated) problems I ran into.

Firstly, I realized there's a bit of a problem with cursors. The
prefetching works like this:

1) reading TIDs from the index
2) stashing them into a queue in IndexPrefetch
3) doing prefetches for the new TIDs added to the queue
4) returning the TIDs to the caller, one by one

And all of this works ... unless the direction of the scan changes.
Which for cursors can happen if someone does FETCH BACKWARD or stuff
like that. I'm not sure how difficult it'd be to make this work. I
suppose we could simply discard the prefetched entries and do the right
number of steps back for the index scan. But I haven't tried, and maybe
it's more complex than I'm imagining. Also, if the cursor changes the
direction a lot, it'd make the prefetching harmful.

The patch simply disables prefetching for such queries, using the same
logic that we do for parallelism. This may be over-zealous.

FWIW this is one of the things that probably should remain outside of
StreamingRead API - it seems pretty index-specific, and I'm not sure
we'd even want to support these "backward" movements in the API.

The other issue I'm aware of is handling xs_heap_continue. I believe it
works fine for "false" but I need to take a look at non-MVCC snapshots
(i.e. when xs_heap_continue=true).

I haven't done any benchmarks with this reworked API - there's a couple
more allocations etc. but it did not change in a fundamental way. I
don't expect any major difference.

regards

[1]
https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: index prefetching

From

Robert Haas

Date:

09 January 2024, 20:31:39

On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> Here's a somewhat reworked version of the patch. My initial goal was to
> see if it could adopt the StreamingRead API proposed in [1], but that
> turned out to be less straight-forward than I hoped, for two reasons:

I guess we need Thomas or Andres or maybe Melanie to comment on this.

> Perhaps a bigger change is that I decided to move this into a separate
> API on top of indexam.c. The original idea was to integrate this into
> index_getnext_tid/index_getnext_slot, so that all callers benefit from
> the prefetching automatically. Which would be nice, but it also meant
> it's need to happen in the indexam.c code, which seemed dirty.

This patch is hard to review right now because there's a bunch of
comment updating that doesn't seem to have been done for the new
design. For instance:

+ * XXX This does not support prefetching of heap pages. When such
prefetching is
+ * desirable, use index_getnext_tid().

But not any more.

+ * XXX The prefetching may interfere with the patch allowing us to evaluate
+ * conditions on the index tuple, in which case we may not need the heap
+ * tuple. Maybe if there's such filter, we should prefetch only pages that
+ * are not all-visible (and the same idea would also work for IOS), but
+ * it also makes the indexing a bit "aware" of the visibility stuff (which
+ * seems a somewhat wrong). Also, maybe we should consider the filter
selectivity

I'm not sure whether all the problems in this area are solved, but I
think you've solved enough of them that this at least needs rewording,
if not removing.

+     * XXX Comment/check seems obsolete.

This occurs in two places. I'm not sure if it's accurate or not.

+     * XXX Could this be an issue for the prefetching? What if we
prefetch something
+     * but the direction changes before we get to the read? If that
could happen,
+     * maybe we should discard the prefetched data and go back? But can we even
+     * do that, if we already fetched some TIDs from the index? I don't think
+     * indexorderdir can't change, but es_direction maybe can?

But your email claims that "The patch simply disables prefetching for
such queries, using the same logic that we do for parallelism." FWIW,
I think that's a fine way to handle that case.

+     * XXX Maybe we should enable prefetching, but prefetch only pages that
+     * are not all-visible (but checking that from the index code seems like
+     * a violation of layering etc).

Isn't this fixed now? Note this comment occurs twice.

+     * XXX We need to disable this in some cases (e.g. when using index-only
+     * scans, we don't want to prefetch pages). Or maybe we should prefetch
+     * only pages that are not all-visible, that'd be even better.

Here again.

And now for some comments on other parts of the patch, mostly other
XXX comments:

+ * XXX This does not support prefetching of heap pages. When such
prefetching is
+ * desirable, use index_getnext_tid().

There's probably no reason to write XXX here. The comment is fine.

+     * XXX Notice we haven't added the block to the block queue yet, and there
+     * is a preceding block (i.e. blockIndex-1 is valid).

Same here, possibly? If this XXX indicates a defect in the code, I
don't know what the defect is, so I guess it needs to be more clear.
If it is just explaining the code, then there's no reason for the
comment to say XXX.

+     * XXX Could it be harmful that we read the queue backwards? Maybe memory
+     * prefetching works better for the forward direction?

It does. But I don't know whether that matters here or not.

+             * XXX We do add the cache size to the request in order not to
+             * have issues with uint64 underflows.

I don't know what this means.

+ * XXX not sure this correctly handles xs_heap_continue - see
index_getnext_slot,
+ * maybe nodeIndexscan needs to do something more to handle this?
Although, that
+ * should be in the indexscan next_cb callback, probably.
+ *
+ * XXX If xs_heap_continue=true, we need to return the last TID.

You've got a bunch of comments about xs_heap_continue here -- and I
don't fully understand what the issues are here with respect to this
particular patch, but I think that the general purpose of
xs_heap_continue is to handle the case where we need to return more
than one tuple from the same HOT chain. With an MVCC snapshot that
doesn't happen, but with say SnapshotAny or SnapshotDirty, it could.
As far as possible, the prefetcher shouldn't be involved at all when
xs_heap_continue is set, I believe, because in that case we're just
returning a bunch of tuples from the same page, and the extra fetches
from that heap page shouldn't trigger or require any further
prefetching.

+     * XXX Should this also look at plan.plan_rows and maybe cap the target
+     * to that? Pointless to prefetch more than we expect to use. Or maybe
+     * just reset to that value during prefetching, after reading the next
+     * index page (or rather after rescan)?

It seems questionable to use plan_rows here because (1) I don't think
we have existing cases where we use the estimated row count in the
executor for anything, we just carry it through so EXPLAIN can print
it and (2) row count estimates can be really far off, especially if
we're on the inner side of a nested loop, we might like to figure that
out eventually instead of just DTWT forever. But on the other hand
this does feel like an important case where we have a clue that
prefetching might need to be done less aggressively or not at all, and
it doesn't seem right to ignore that signal either. I wonder if we
want this shaped in some other way, like a Boolean that says
are-we-under-a-potentially-row-limiting-construct e.g. limit or inner
side of a semi-join or anti-join.

+     * We reach here if the index only scan is not parallel, or if we're
+     * serially executing an index only scan that was planned to be
+     * parallel.

Well, this seems sad.

+     * XXX This might lead to IOS being slower than plain index scan, if the
+     * table has a lot of pages that need recheck.

How?

+    /*
+     * XXX Only allow index prefetching when parallelModeOK=true. This is a bit
+     * of a misuse of the flag, but we need to disable prefetching for cursors
+     * (which might change direction), and parallelModeOK does that. But maybe
+     * we might (or should) have a separate flag.
+     */

I think the correct flag to be using here is execute_once, which
captures whether the executor could potentially be invoked a second
time for the same portal. Changes in the fetch direction are possible
if and only if !execute_once.

> Note 1: The IndexPrefetch name is a bit misleading, because it's used
> even with prefetching disabled - all index reads from the index scan
> happen through it. Maybe it should be called IndexReader or something
> like that.

My biggest gripe here is the capitalization. This version adds, inter
alia, IndexPrefetchAlloc, PREFETCH_QUEUE_INDEX, and
index_heap_prefetch_target, which seems like one or two too many
conventions. But maybe the PREFETCH_* macros don't even belong in a
public header.

I do like the index_heap_prefetch_* naming. Possibly that's too
verbose to use for everything, but calling this index-heap-prefetch
rather than index-prefetch seems clearer.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: index prefetching

From

Tomas Vondra

Date:

12 January 2024, 16:42:39

Hi,

Here's an improved version of this patch, finishing a lot of the stuff
that I alluded to earlier - moving the code from indexam.c, renaming a
bunch of stuff, etc. I've also squashed it into a single patch, to make
it easier to review.

I'll briefly go through the main changes in the patch, and then will
respond in-line to Robert's points.

1) I moved the code from indexam.c to (new) execPrefetch.c. All the
prototypes / typedefs now live in executor.h, with only minimal changes
in execnodes.h (adding it to scan descriptors).

I believe this finally moves the code to the right place - it feels much
nicer and cleaner than in indexam.c.  And it allowed me to hide a bunch
of internal structs and improve the general API, I think.

I'm sure there's stuff that could be named differently, but the layering
feels about right, I think.

2) A bunch of stuff got renamed to start with IndexPrefetch... to make
the naming consistent / clearer. I'm not entirely sure IndexPrefetch is
the right name, though - it's still a bit misleading, as it might seem
it's about prefetching index stuff, but really it's about heap pages
from indexes. Maybe IndexScanPrefetch() or something like that?

3) If there's a way to make this work with the streaming I/O API, I'm
not aware of it. But the overall design seems somewhat similar (based on
"next" callback etc.) so hopefully that'd make it easier to adopt it.

4) I initially relied on parallelModeOK to disable prefetching, which
kinda worked, but not really. Robert suggested to use the execute_once
flag directly, and I think that's much better - not only is it cleaner,
it also seems more appropriate (the parallel flag considers other stuff
that is not quite relevant to prefetching).

Thinking about this, I think it should be possible to make prefetching
work even for plans with execute_once=false. In particular, when the
plan changes direction it should be possible to simply "walk back" the
prefetch queue, to get to the "correct" place in in the scan. But I'm
not sure it's worth it, because plans that change direction often can't
really benefit from prefetches anyway - they'll often visit stuff they
accessed shortly before anyway. For plans that don't change direction
but may pause, we don't know if the plan pauses long enough for the
prefetched pages to get evicted or something. So I think it's OK that
execute_once=false means no prefetching.

5) I haven't done anything about the xs_heap_continue=true case yet.

6) I went through all the comments and reworked them considerably. The
main comment at execPrefetch.c start, with some overall design etc. And
then there are comments for each function, explaining that bit in more
detail. Or at least that's the goal - there's still work to do.

There's two trivial FIXMEs, but you can ignore those - it's not that
there's a bug, but that I'd like to rework something and just don't know
how yet.

There's also a couple of XXX comments. Some are a bit wild ideas for the
future, others are somewhat "open questions" to be discussed during a
review.

Anyway, there should be no outright obsolete comments - if there's
something I missed, let me know.

Now to Robert's message ...

On 1/9/24 21:31, Robert Haas wrote:
> On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> Here's a somewhat reworked version of the patch. My initial goal was to
>> see if it could adopt the StreamingRead API proposed in [1], but that
>> turned out to be less straight-forward than I hoped, for two reasons:
> 
> I guess we need Thomas or Andres or maybe Melanie to comment on this.
> 

Yeah. Or maybe Thomas if he has thoughts on how to combine this with the
streaming I/O stuff.

>> Perhaps a bigger change is that I decided to move this into a separate
>> API on top of indexam.c. The original idea was to integrate this into
>> index_getnext_tid/index_getnext_slot, so that all callers benefit from
>> the prefetching automatically. Which would be nice, but it also meant
>> it's need to happen in the indexam.c code, which seemed dirty.
> 
> This patch is hard to review right now because there's a bunch of
> comment updating that doesn't seem to have been done for the new
> design. For instance:
> 
> + * XXX This does not support prefetching of heap pages. When such
> prefetching is
> + * desirable, use index_getnext_tid().
> 
> But not any more.
> 

True. And this is now even more obsolete, as the prefetching was moved
from indexam.c layer to the executor.

> + * XXX The prefetching may interfere with the patch allowing us to evaluate
> + * conditions on the index tuple, in which case we may not need the heap
> + * tuple. Maybe if there's such filter, we should prefetch only pages that
> + * are not all-visible (and the same idea would also work for IOS), but
> + * it also makes the indexing a bit "aware" of the visibility stuff (which
> + * seems a somewhat wrong). Also, maybe we should consider the filter
> selectivity
> 
> I'm not sure whether all the problems in this area are solved, but I
> think you've solved enough of them that this at least needs rewording,
> if not removing.
> 
> +     * XXX Comment/check seems obsolete.
> 
> This occurs in two places. I'm not sure if it's accurate or not.
> 
> +     * XXX Could this be an issue for the prefetching? What if we
> prefetch something
> +     * but the direction changes before we get to the read? If that
> could happen,
> +     * maybe we should discard the prefetched data and go back? But can we even
> +     * do that, if we already fetched some TIDs from the index? I don't think
> +     * indexorderdir can't change, but es_direction maybe can?
> 
> But your email claims that "The patch simply disables prefetching for
> such queries, using the same logic that we do for parallelism." FWIW,
> I think that's a fine way to handle that case.
> 

True. I left behind this comment partly intentionally, to point out why
we disable the prefetching in these cases, but you're right the comment
now explains something that can't happen.

> +     * XXX Maybe we should enable prefetching, but prefetch only pages that
> +     * are not all-visible (but checking that from the index code seems like
> +     * a violation of layering etc).
> 
> Isn't this fixed now? Note this comment occurs twice.
> 
> +     * XXX We need to disable this in some cases (e.g. when using index-only
> +     * scans, we don't want to prefetch pages). Or maybe we should prefetch
> +     * only pages that are not all-visible, that'd be even better.
> 
> Here again.
> 

Sorry, you're right those comments (and a couple more nearby) were
stale. Removed / clarified.

> And now for some comments on other parts of the patch, mostly other
> XXX comments:
> 
> + * XXX This does not support prefetching of heap pages. When such
> prefetching is
> + * desirable, use index_getnext_tid().
> 
> There's probably no reason to write XXX here. The comment is fine.
> 
> +     * XXX Notice we haven't added the block to the block queue yet, and there
> +     * is a preceding block (i.e. blockIndex-1 is valid).
> 
> Same here, possibly? If this XXX indicates a defect in the code, I
> don't know what the defect is, so I guess it needs to be more clear.
> If it is just explaining the code, then there's no reason for the
> comment to say XXX.
> 

Yeah, removed the XXX / reworded a bit.

> +     * XXX Could it be harmful that we read the queue backwards? Maybe memory
> +     * prefetching works better for the forward direction?
> 
> It does. But I don't know whether that matters here or not.
> 
> +             * XXX We do add the cache size to the request in order not to
> +             * have issues with uint64 underflows.
> 
> I don't know what this means.
> 

There's a check that does this:

      (x + PREFETCH_CACHE_SIZE) >= y

it might also be done as "mathematically equivalent"

      x >= (y - PREFETCH_CACHE_SIZE)

but if the "y" is an uint64, and the value is smaller than the constant,
this would underflow. It'd eventually disappear, once the "y" gets large
enough, ofc.

> + * XXX not sure this correctly handles xs_heap_continue - see
> index_getnext_slot,
> + * maybe nodeIndexscan needs to do something more to handle this?
> Although, that
> + * should be in the indexscan next_cb callback, probably.
> + *
> + * XXX If xs_heap_continue=true, we need to return the last TID.
> 
> You've got a bunch of comments about xs_heap_continue here -- and I
> don't fully understand what the issues are here with respect to this
> particular patch, but I think that the general purpose of
> xs_heap_continue is to handle the case where we need to return more
> than one tuple from the same HOT chain. With an MVCC snapshot that
> doesn't happen, but with say SnapshotAny or SnapshotDirty, it could.
> As far as possible, the prefetcher shouldn't be involved at all when
> xs_heap_continue is set, I believe, because in that case we're just
> returning a bunch of tuples from the same page, and the extra fetches
> from that heap page shouldn't trigger or require any further
> prefetching.
> 

Yes, that's correct. The current code simply ignores that flag and just
proceeds to the next TID. Which is correct for xs_heap_continue=false,
and thus all MVCC snapshots work fine. But for the Any/Dirty case it
needs to work a bit differently.

> +     * XXX Should this also look at plan.plan_rows and maybe cap the target
> +     * to that? Pointless to prefetch more than we expect to use. Or maybe
> +     * just reset to that value during prefetching, after reading the next
> +     * index page (or rather after rescan)?
> 
> It seems questionable to use plan_rows here because (1) I don't think
> we have existing cases where we use the estimated row count in the
> executor for anything, we just carry it through so EXPLAIN can print
> it and (2) row count estimates can be really far off, especially if
> we're on the inner side of a nested loop, we might like to figure that
> out eventually instead of just DTWT forever. But on the other hand
> this does feel like an important case where we have a clue that
> prefetching might need to be done less aggressively or not at all, and
> it doesn't seem right to ignore that signal either. I wonder if we
> want this shaped in some other way, like a Boolean that says
> are-we-under-a-potentially-row-limiting-construct e.g. limit or inner
> side of a semi-join or anti-join.
> 

The current code actually does look at plan_rows when calculating the
prefetch target:

  prefetch_max = IndexPrefetchComputeTarget(node->ss.ss_currentRelation,
                                            node->ss.ps.plan->plan_rows,
                                            estate->es_use_prefetching);

but I agree maybe it should not, for the reasons you explain. I'm not
attached to this part.

> +     * We reach here if the index only scan is not parallel, or if we're
> +     * serially executing an index only scan that was planned to be
> +     * parallel.
> 
> Well, this seems sad.
> 

Stale comment, I believe. However, I didn't see much benefits with
parallel index scan during testing. Having I/O from multiple workers
generally had the same effect, I think.

> +     * XXX This might lead to IOS being slower than plain index scan, if the
> +     * table has a lot of pages that need recheck.
> 
> How?
> 

The comment is not particularly clear what "this" means, but I believe
this was about index-only scan with many not-all-visible pages. If it
didn't do prefetching, a regular index scan with prefetching may be way
faster. But the code actually allows doing prefetching even for IOS, by
checking the vm in the "next" callback.

> +    /*
> +     * XXX Only allow index prefetching when parallelModeOK=true. This is a bit
> +     * of a misuse of the flag, but we need to disable prefetching for cursors
> +     * (which might change direction), and parallelModeOK does that. But maybe
> +     * we might (or should) have a separate flag.
> +     */
> 
> I think the correct flag to be using here is execute_once, which
> captures whether the executor could potentially be invoked a second
> time for the same portal. Changes in the fetch direction are possible
> if and only if !execute_once.
> 

Right. The new patch version does that.

>> Note 1: The IndexPrefetch name is a bit misleading, because it's used
>> even with prefetching disabled - all index reads from the index scan
>> happen through it. Maybe it should be called IndexReader or something
>> like that.
> 
> My biggest gripe here is the capitalization. This version adds, inter
> alia, IndexPrefetchAlloc, PREFETCH_QUEUE_INDEX, and
> index_heap_prefetch_target, which seems like one or two too many
> conventions. But maybe the PREFETCH_* macros don't even belong in a
> public header.
> 
> I do like the index_heap_prefetch_* naming. Possibly that's too
> verbose to use for everything, but calling this index-heap-prefetch
> rather than index-prefetch seems clearer.
> 

Yeah. I renamed all the structs and functions to IndexPrefetchSomething,
to keep it consistent. And then the constants are all capital, ofc.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

v20240112-0001-Prefetch-heap-pages-during-index-scans.patch

Re: index prefetching

From

Robert Haas

Date:

12 January 2024, 16:52:53

Not a full response, but just to address a few points:

On Fri, Jan 12, 2024 at 11:42 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> Thinking about this, I think it should be possible to make prefetching
> work even for plans with execute_once=false. In particular, when the
> plan changes direction it should be possible to simply "walk back" the
> prefetch queue, to get to the "correct" place in in the scan. But I'm
> not sure it's worth it, because plans that change direction often can't
> really benefit from prefetches anyway - they'll often visit stuff they
> accessed shortly before anyway. For plans that don't change direction
> but may pause, we don't know if the plan pauses long enough for the
> prefetched pages to get evicted or something. So I think it's OK that
> execute_once=false means no prefetching.

+1.

> > +             * XXX We do add the cache size to the request in order not to
> > +             * have issues with uint64 underflows.
> >
> > I don't know what this means.
> >
>
> There's a check that does this:
>
>       (x + PREFETCH_CACHE_SIZE) >= y
>
> it might also be done as "mathematically equivalent"
>
>       x >= (y - PREFETCH_CACHE_SIZE)
>
> but if the "y" is an uint64, and the value is smaller than the constant,
> this would underflow. It'd eventually disappear, once the "y" gets large
> enough, ofc.

The problem is, I think, that there's no particular reason that
someone reading the existing code should imagine that it might have
been done in that "mathematically equivalent" fashion. I imagined that
you were trying to make a point about adding the cache size to the
request vs. adding nothing, whereas in reality you were trying to make
a point about adding from one side vs. subtracting from the other.

> > +     * We reach here if the index only scan is not parallel, or if we're
> > +     * serially executing an index only scan that was planned to be
> > +     * parallel.
> >
> > Well, this seems sad.
>
> Stale comment, I believe. However, I didn't see much benefits with
> parallel index scan during testing. Having I/O from multiple workers
> generally had the same effect, I think.

Fair point, likely worth mentioning explicitly in the comment.

> Yeah. I renamed all the structs and functions to IndexPrefetchSomething,
> to keep it consistent. And then the constants are all capital, ofc.

It'd still be nice to get table or heap in there, IMHO, but maybe we
can't, and consistency is certainly a good thing regardless of the
details, so thanks for that.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: index prefetching

From

Konstantin Knizhnik

Date:

16 January 2024, 08:13:43

Hi,

On 12/01/2024 6:42 pm, Tomas Vondra wrote:

Hi,

Here's an improved version of this patch, finishing a lot of the stuff
that I alluded to earlier - moving the code from indexam.c, renaming a
bunch of stuff, etc. I've also squashed it into a single patch, to make
it easier to review.

I am thinking about testing you patch with Neon (cloud Postgres). As far as Neon seaprates compute and storage, prefetch is much more critical for Neon
architecture than for vanilla Postgres.

I have few complaints:

1. It disables prefetch for sequential access pattern (i.e. INDEX MERGE), motivating it that in this case OS read-ahead will be more efficient than prefetch. It may be true for normal storage devices, bit not for Neon storage and may be also for Postgres on top of DFS (i.e. Amazon RDS). I wonder if we can delegate decision whether to perform prefetch in this case or not to some other level. I do not know precisely where is should be handled. The best candidate IMHO is storager manager. But it most likely requires extension of SMGR API. Not sure if you want to do it... Straightforward solution is to move this logic to some callback, which can be overwritten by user.

2. It disables prefetch for direct_io. It seems to be even more obvious than 1), because prefetching using `posix_fadvise` definitely not possible in case of using direct_io. But in theory if SMGR provides some alternative prefetch implementation (as in case of Neon), this also may be not true. Still unclear why we can want to use direct_io in Neon... But still I prefer to mo.ve this decision outside executor.

3. It doesn't perform prefetch of leave pages for IOS, only referenced heap pages which are not marked as all-visible. It seems to me that if optimized has chosen IOS (and not bitmap heap scan for example), then there should be large enough fraction for all-visible pages. Also index prefetch is most efficient for OLAp queries and them are used to be performance for historical data which is all-visible. But IOS can be really handled separately in some other PR. Frankly speaking combining prefetch of leave B-Tree pages and referenced heap pages seems to be very challenged task.

4. I think that performing prefetch at executor level is really great idea and so prefetch can be used by all indexes, including custom indexes. But prefetch will be efficient only if index can provide fast access to next TID (located at the same page). I am not sure that it is true for all builtin indexes (GIN, GIST, BRIN,...) and especially for custom AM. I wonder if we should extend AM API to make index make a decision weather to perform prefetch of TIDs or not.

5. Minor notice: there are few places where index_getnext_slot is called with last NULL parameter (disabled prefetch) with the following comment
"XXX Would be nice to also benefit from prefetching here." But all this places corresponds to "point loopkup", i.e. unique constraint check, find replication tuple by index... Prefetch seems to be unlikely useful here, unlkess there is index bloating and and we have to skip a lot of tuples before locating right one. But should we try to optimize case of bloated indexes?

Re: index prefetching

From

Tomas Vondra

Date:

16 January 2024, 16:25:05

On 1/16/24 09:13, Konstantin Knizhnik wrote:
> Hi,
> 
> On 12/01/2024 6:42 pm, Tomas Vondra wrote:
>> Hi,
>>
>> Here's an improved version of this patch, finishing a lot of the stuff
>> that I alluded to earlier - moving the code from indexam.c, renaming a
>> bunch of stuff, etc. I've also squashed it into a single patch, to make
>> it easier to review.
> 
> I am thinking about testing you patch with Neon (cloud Postgres). As far
> as Neon seaprates compute and storage, prefetch is much more critical
> for Neon
> architecture than for vanilla Postgres.
> 
> I have few complaints:
> 
> 1. It disables prefetch for sequential access pattern (i.e. INDEX
> MERGE), motivating it that in this case OS read-ahead will be more
> efficient than prefetch. It may be true for normal storage devices, bit
> not for Neon storage and may be also for Postgres on top of DFS (i.e.
> Amazon RDS). I wonder if we can delegate decision whether to perform
> prefetch in this case or not to some other level. I do not know
> precisely where is should be handled. The best candidate IMHO is
> storager manager. But it most likely requires extension of SMGR API. Not
> sure if you want to do it... Straightforward solution is to move this
> logic to some callback, which can be overwritten by user.
> 

Interesting point. You're right these decisions (whether to prefetch
particular patterns) are closely tied to the capabilities of the storage
system. So it might make sense to maybe define it at that level.

Not sure what exactly RDS does with the storage - my understanding is
that it's mostly regular Postgres code, but managed by Amazon. So how
would that modify the prefetching logic?

However, I'm not against making this modular / wrapping this in some
sort of callbacks, for example.

> 2. It disables prefetch for direct_io. It seems to be even more obvious
> than 1), because prefetching using `posix_fadvise` definitely not
> possible in case of using direct_io. But in theory if SMGR provides some
> alternative prefetch implementation (as in case of Neon), this also may
> be not true. Still unclear why we can want to use direct_io in Neon...
> But still I prefer to mo.ve this decision outside executor.
> 

True. I think this would / should be customizable by the callback.

> 3. It doesn't perform prefetch of leave pages for IOS, only referenced
> heap pages which are not marked as all-visible. It seems to me that if
> optimized has chosen IOS (and not bitmap heap scan for example), then
> there should be large enough fraction for all-visible pages. Also index
> prefetch is most efficient for OLAp queries and them are used to be
> performance for historical data which is all-visible. But IOS can be
> really handled separately in some other PR. Frankly speaking combining
> prefetch of leave B-Tree pages and referenced heap pages seems to be
> very challenged task.
> 

I see prefetching of leaf pages as interesting / worthwhile improvement,
but out of scope for this patch. I don't think it can be done at the
executor level - the prefetch requests need to be submitted from the
index AM code (by calling PrefetchBuffer, etc.)

> 4. I think that performing prefetch at executor level is really great
> idea and so prefetch can be used by all indexes, including custom
> indexes. But prefetch will be efficient only if index can provide fast
> access to next TID (located at the same page). I am not sure that it is
> true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
> custom AM. I wonder if we should extend AM API to make index make a
> decision weather to perform prefetch of TIDs or not.

I'm not against having a flag to enable/disable prefetching, but the
question is whether doing prefetching for such indexes can be harmful.
I'm not sure about that.

> 
> 5. Minor notice: there are few places where index_getnext_slot is called
> with last NULL parameter (disabled prefetch) with the following comment
> "XXX Would be nice to also benefit from prefetching here." But all this
> places corresponds to "point loopkup", i.e. unique constraint check,
> find replication tuple by index... Prefetch seems to be unlikely useful
> here, unlkess there is index bloating and and we have to skip a lot of
> tuples before locating right one. But should we try to optimize case of
> bloated indexes?
> 

Are you sure you're looking at the last patch version? Because the
current patch does not have any new parameters in index_getnext_* and
the comments were removed too (I suppose you're talking about
execIndexing, execReplication and those places).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

From

Robert Haas

Date:

16 January 2024, 17:08:14

On Tue, Jan 16, 2024 at 11:25 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> > 3. It doesn't perform prefetch of leave pages for IOS, only referenced
> > heap pages which are not marked as all-visible. It seems to me that if
> > optimized has chosen IOS (and not bitmap heap scan for example), then
> > there should be large enough fraction for all-visible pages. Also index
> > prefetch is most efficient for OLAp queries and them are used to be
> > performance for historical data which is all-visible. But IOS can be
> > really handled separately in some other PR. Frankly speaking combining
> > prefetch of leave B-Tree pages and referenced heap pages seems to be
> > very challenged task.
>
> I see prefetching of leaf pages as interesting / worthwhile improvement,
> but out of scope for this patch. I don't think it can be done at the
> executor level - the prefetch requests need to be submitted from the
> index AM code (by calling PrefetchBuffer, etc.)

+1. This is a good feature, and so is that, but they're not the same
feature, despite the naming problems.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: index prefetching

From

Konstantin Knizhnik

Date:

16 January 2024, 20:10:23

On 16/01/2024 6:25 pm, Tomas Vondra wrote:

On 1/16/24 09:13, Konstantin Knizhnik wrote:

Hi,

On 12/01/2024 6:42 pm, Tomas Vondra wrote:

Hi,

Here's an improved version of this patch, finishing a lot of the stuff
that I alluded to earlier - moving the code from indexam.c, renaming a
bunch of stuff, etc. I've also squashed it into a single patch, to make
it easier to review.

I am thinking about testing you patch with Neon (cloud Postgres). As far
as Neon seaprates compute and storage, prefetch is much more critical
for Neon
architecture than for vanilla Postgres.

I have few complaints:

1. It disables prefetch for sequential access pattern (i.e. INDEX
MERGE), motivating it that in this case OS read-ahead will be more
efficient than prefetch. It may be true for normal storage devices, bit
not for Neon storage and may be also for Postgres on top of DFS (i.e.
Amazon RDS). I wonder if we can delegate decision whether to perform
prefetch in this case or not to some other level. I do not know
precisely where is should be handled. The best candidate IMHO is
storager manager. But it most likely requires extension of SMGR API. Not
sure if you want to do it... Straightforward solution is to move this
logic to some callback, which can be overwritten by user.

Interesting point. You're right these decisions (whether to prefetch
particular patterns) are closely tied to the capabilities of the storage
system. So it might make sense to maybe define it at that level.

Not sure what exactly RDS does with the storage - my understanding is
that it's mostly regular Postgres code, but managed by Amazon. So how
would that modify the prefetching logic?

Amazon RDS is just vanilla Postgres with file system mounted on EBS (Amazon distributed file system).
EBS provides good throughput but larger latencies comparing with local SSDs.
I am not sure if read-ahead works for EBS.

4. I think that performing prefetch at executor level is really great

idea and so prefetch can be used by all indexes, including custom
indexes. But prefetch will be efficient only if index can provide fast
access to next TID (located at the same page). I am not sure that it is
true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
custom AM. I wonder if we should extend AM API to make index make a
decision weather to perform prefetch of TIDs or not.

I'm not against having a flag to enable/disable prefetching, but the
question is whether doing prefetching for such indexes can be harmful.
I'm not sure about that.

I tend to agree with you - it is hard to imagine index implementation which doesn't win from prefetching heap pages.
May be only the filtering case you have mentioned. But it seems to me that current B-Tree index scan (not IOS) implementation in Postgres
doesn't try to use index tuple to check extra condition - it will fetch heap tuple in any case.

5. Minor notice: there are few places where index_getnext_slot is called
with last NULL parameter (disabled prefetch) with the following comment
"XXX Would be nice to also benefit from prefetching here." But all this
places corresponds to "point loopkup", i.e. unique constraint check,
find replication tuple by index... Prefetch seems to be unlikely useful
here, unlkess there is index bloating and and we have to skip a lot of
tuples before locating right one. But should we try to optimize case of
bloated indexes?

Are you sure you're looking at the last patch version? Because the
current patch does not have any new parameters in index_getnext_* and
the comments were removed too (I suppose you're talking about
execIndexing, execReplication and those places).

Sorry, I looked at v20240103-0001-prefetch-2023-12-09.patch , I didn't noticed v20240112-0001-Prefetch-heap-pages-during-index-scans.patch

regards

Re: index prefetching

From

Jim Nasby

Date:

16 January 2024, 21:58:42

On 1/16/24 2:10 PM, Konstantin Knizhnik wrote:
> Amazon RDS is just vanilla Postgres with file system mounted on EBS 
> (Amazon  distributed file system).
> EBS provides good throughput but larger latencies comparing with local SSDs.
> I am not sure if read-ahead works for EBS.

Actually, EBS only provides a block device - it's definitely not a 
filesystem itself (*EFS* is a filesystem - but it's also significantly 
different than EBS). So as long as readahead is happening somewheer 
above the block device I would expect it to JustWork on EBS.

Of course, Aurora Postgres (like Neon) is completely different. If you 
look at page 53 of [1] you'll note that there's two different terms 
used: prefetch and batch. I'm not sure how much practical difference 
there is, but batched IO (one IO request to Aurora Storage for many 
blocks) predates index prefetch; VACUUM in APG has used batched IO for a 
very long time (it also *only* reads blocks that aren't marked all 
visble/frozen; none of the "only skip if skipping at least 32 blocks" 
logic is used).

1: 

https://d1.awsstatic.com/events/reinvent/2019/REPEAT_1_Deep_dive_on_Amazon_Aurora_with_PostgreSQL_compatibility_DAT328-R1.pdf
-- 
Jim Nasby, Data Architect, Austin TX

Re: index prefetching

From

Konstantin Knizhnik

Date:

17 January 2024, 06:10:01

On 16/01/2024 11:58 pm, Jim Nasby wrote:
> On 1/16/24 2:10 PM, Konstantin Knizhnik wrote:
>> Amazon RDS is just vanilla Postgres with file system mounted on EBS 
>> (Amazon  distributed file system).
>> EBS provides good throughput but larger latencies comparing with 
>> local SSDs.
>> I am not sure if read-ahead works for EBS.
>
> Actually, EBS only provides a block device - it's definitely not a 
> filesystem itself (*EFS* is a filesystem - but it's also significantly 
> different than EBS). So as long as readahead is happening somewheer 
> above the block device I would expect it to JustWork on EBS.

Thank you for clarification.
Yes, EBS is just block device and read-ahead can be used fir it as for 
any other local device.
There is actually recommendation to increase read-ahead for EBS device 
to reach better performance on some workloads:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSPerformance.html

So looks like for sequential access pattern manual prefetching at EBS is 
not needed.
But at Neon situation is quite different. May be Aurora Postgres is 
using some other mechanism for speed-up vacuum and seqscan,
but Neon is using Postgres prefetch mechanism for it.

Re: index prefetching

From

Konstantin Knizhnik

Date:

17 January 2024, 08:04:43

On 16/01/2024 11:58 pm, Jim Nasby wrote:
> On 1/16/24 2:10 PM, Konstantin Knizhnik wrote:
>> Amazon RDS is just vanilla Postgres with file system mounted on EBS 
>> (Amazon  distributed file system).
>> EBS provides good throughput but larger latencies comparing with 
>> local SSDs.
>> I am not sure if read-ahead works for EBS.
>
> Actually, EBS only provides a block device - it's definitely not a 
> filesystem itself (*EFS* is a filesystem - but it's also significantly 
> different than EBS). So as long as readahead is happening somewheer 
> above the block device I would expect it to JustWork on EBS.

Thank you for clarification.
Yes, EBS is just block device and read-ahead can be used fir it as for 
any other local device.
There is actually recommendation to increase read-ahead for EBS device 
to reach better performance on some workloads:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSPerformance.html

So looks like for sequential access pattern manual prefetching at EBS is 
not needed.
But at Neon situation is quite different. May be Aurora Postgres is 
using some other mechanism for speed-up vacuum and seqscan,
but Neon is using Postgres prefetch mechanism for it.

Re: index prefetching

From

Konstantin Knizhnik

Date:

17 January 2024, 08:45:01

I have integrated your prefetch patch in Neon and it actually works!
Moreover, I combined it with prefetch of leaf pages for IOS and it also 
seems to work.

Just small notice: you are reporting `blks_prefetch_rounds` in explain, 
but it is not incremented anywhere.
Moreover, I do not precisely understand what it mean and wonder if such 
information is useful for analyzing query executing plan.
Also your patch always report number of prefetched blocks (and rounds) 
if them are not zero.

I think that adding new information to explain it may cause some 
problems because there are a lot of different tools which parse explain 
report to visualize it,
make some recommendations top improve performance, ... Certainly good 
practice for such tools is to ignore all unknown tags. But I am not sure 
that everybody follow this practice.
It seems to be more safe and at the same time convenient for users to 
add extra tag to explain to enable/disable prefetch info (as it was done 
in Neon).

Here we come back to my custom explain patch;) Actually using it is not 
necessary. You can manually add "prefetch" option to Postgres core (as 
it is currently done in Neon).

Best regards,
Konstantin

Re: index prefetching

From

Tomas Vondra

Date:

18 January 2024, 15:57:47

On 1/16/24 21:10, Konstantin Knizhnik wrote:
> 
> ...
> 
>> 4. I think that performing prefetch at executor level is really great
>>> idea and so prefetch can be used by all indexes, including custom
>>> indexes. But prefetch will be efficient only if index can provide fast
>>> access to next TID (located at the same page). I am not sure that it is
>>> true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
>>> custom AM. I wonder if we should extend AM API to make index make a
>>> decision weather to perform prefetch of TIDs or not.
>> I'm not against having a flag to enable/disable prefetching, but the
>> question is whether doing prefetching for such indexes can be harmful.
>> I'm not sure about that.
> 
> I tend to agree with you - it is hard to imagine index implementation
> which doesn't win from prefetching heap pages.
> May be only the filtering case you have mentioned. But it seems to me
> that current B-Tree index scan (not IOS) implementation in Postgres
> doesn't try to use index tuple to check extra condition - it will fetch
> heap tuple in any case.
> 

That's true, but that's why I started working on this:

https://commitfest.postgresql.org/46/4352/

I need to think about how to combine that with the prefetching. The good
thing is that both changes require fetching TIDs, not slots. I think the
condition can be simply added to the prefetch callback.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

From

Tomas Vondra

Date:

18 January 2024, 16:00:32

On 1/17/24 09:45, Konstantin Knizhnik wrote:
> I have integrated your prefetch patch in Neon and it actually works!
> Moreover, I combined it with prefetch of leaf pages for IOS and it also
> seems to work.
> 

Cool! And do you think this is the right design/way to do this?

> Just small notice: you are reporting `blks_prefetch_rounds` in explain,
> but it is not incremented anywhere.
> Moreover, I do not precisely understand what it mean and wonder if such
> information is useful for analyzing query executing plan.
> Also your patch always report number of prefetched blocks (and rounds)
> if them are not zero.
> 

Right, this needs fixing.

> I think that adding new information to explain it may cause some
> problems because there are a lot of different tools which parse explain
> report to visualize it,
> make some recommendations top improve performance, ... Certainly good
> practice for such tools is to ignore all unknown tags. But I am not sure
> that everybody follow this practice.
> It seems to be more safe and at the same time convenient for users to
> add extra tag to explain to enable/disable prefetch info (as it was done
> in Neon).
> 

I think we want to add this info to explain, but maybe it should be
behind a new flag and disabled by default.

> Here we come back to my custom explain patch;) Actually using it is not
> necessary. You can manually add "prefetch" option to Postgres core (as
> it is currently done in Neon).
> 

Yeah, I think that's the right solution.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

From

Konstantin Knizhnik

Date:

19 January 2024, 08:34:42

On 18/01/2024 6:00 pm, Tomas Vondra wrote:
> On 1/17/24 09:45, Konstantin Knizhnik wrote:
>> I have integrated your prefetch patch in Neon and it actually works!
>> Moreover, I combined it with prefetch of leaf pages for IOS and it also
>> seems to work.
>>
> Cool! And do you think this is the right design/way to do this?

I like the idea of prefetching TIDs in executor.

But looking though your patch I have some questions:

1. Why it is necessary to allocate and store all_visible flag in data 
buffer. Why caller of  IndexPrefetchNext can not look at prefetch field?

+        /* store the all_visible flag in the private part of the entry */
+        entry->data = palloc(sizeof(bool));
+        *(bool *) entry->data = all_visible;

2. Names of the functions `IndexPrefetchNext` and 
`IndexOnlyPrefetchNext` are IMHO confusing because they look similar and 
one can assume that for one is used for normal index scan and last one - 
for index only scan. But actually `IndexOnlyPrefetchNext` is callback 
and `IndexPrefetchNext` is used in both nodeIndexscan.c and 
nodeIndexonlyscan.c

Re: index prefetching

From

Tomas Vondra

Date:

19 January 2024, 12:35:25


On 1/19/24 09:34, Konstantin Knizhnik wrote:
> 
> On 18/01/2024 6:00 pm, Tomas Vondra wrote:
>> On 1/17/24 09:45, Konstantin Knizhnik wrote:
>>> I have integrated your prefetch patch in Neon and it actually works!
>>> Moreover, I combined it with prefetch of leaf pages for IOS and it also
>>> seems to work.
>>>
>> Cool! And do you think this is the right design/way to do this?
> 
> I like the idea of prefetching TIDs in executor.
> 
> But looking though your patch I have some questions:
> 
> 
> 1. Why it is necessary to allocate and store all_visible flag in data
> buffer. Why caller of  IndexPrefetchNext can not look at prefetch field?
> 
> +        /* store the all_visible flag in the private part of the entry */
> +        entry->data = palloc(sizeof(bool));
> +        *(bool *) entry->data = all_visible;
> 

What you mean by "prefetch field"? The reason why it's done like this is
to only do the VM check once - without keeping the value, we'd have to
do it in the "next" callback, to determine if we need to prefetch the
heap tuple, and then later in the index-only scan itself. That's a
significant overhead, especially in the case when everything is visible.

> 2. Names of the functions `IndexPrefetchNext` and
> `IndexOnlyPrefetchNext` are IMHO confusing because they look similar and
> one can assume that for one is used for normal index scan and last one -
> for index only scan. But actually `IndexOnlyPrefetchNext` is callback
> and `IndexPrefetchNext` is used in both nodeIndexscan.c and
> nodeIndexonlyscan.c
> 

Yeah, that's a good point. The naming probably needs rethinking.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

From

Konstantin Knizhnik

Date:

19 January 2024, 15:19:22

On 18/01/2024 5:57 pm, Tomas Vondra wrote:

On 1/16/24 21:10, Konstantin Knizhnik wrote:

...

4. I think that performing prefetch at executor level is really great

idea and so prefetch can be used by all indexes, including custom
indexes. But prefetch will be efficient only if index can provide fast
access to next TID (located at the same page). I am not sure that it is
true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
custom AM. I wonder if we should extend AM API to make index make a
decision weather to perform prefetch of TIDs or not.

I'm not against having a flag to enable/disable prefetching, but the
question is whether doing prefetching for such indexes can be harmful.
I'm not sure about that.

I tend to agree with you - it is hard to imagine index implementation
which doesn't win from prefetching heap pages.
May be only the filtering case you have mentioned. But it seems to me
that current B-Tree index scan (not IOS) implementation in Postgres
doesn't try to use index tuple to check extra condition - it will fetch
heap tuple in any case.

That's true, but that's why I started working on this:

https://commitfest.postgresql.org/46/4352/

I need to think about how to combine that with the prefetching. The good
thing is that both changes require fetching TIDs, not slots. I think the
condition can be simply added to the prefetch callback.


regards

Looks like I was not true, even if it is not index-only scan but index condition involves only index attributes, then heap is not accessed until we find tuple satisfying search condition.
Inclusive index case described above (https://commitfest.postgresql.org/46/4352/) is interesting but IMHO exotic case. If keys are actually used in search, then why not to create normal compound index instead?

Re: index prefetching

From

Melanie Plageman

Date:

19 January 2024, 21:43:37

On Fri, Jan 12, 2024 at 11:42 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 1/9/24 21:31, Robert Haas wrote:
> > On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >> Here's a somewhat reworked version of the patch. My initial goal was to
> >> see if it could adopt the StreamingRead API proposed in [1], but that
> >> turned out to be less straight-forward than I hoped, for two reasons:
> >
> > I guess we need Thomas or Andres or maybe Melanie to comment on this.
> >
>
> Yeah. Or maybe Thomas if he has thoughts on how to combine this with the
> streaming I/O stuff.

I've been studying your patch with the intent of finding a way to
change it and or the streaming read API to work together. I've
attached a very rough sketch of how I think it could work.

We fill a queue with blocks from TIDs that we fetched from the index.
The queue is saved in a scan descriptor that is made available to the
streaming read callback. Once the queue is full, we invoke the table
AM specific index_fetch_tuple() function which calls
pg_streaming_read_buffer_get_next(). When the streaming read API
invokes the callback we registered, it simply dequeues a block number
for prefetching. The only change to the streaming read API is that
now, even if the callback returns InvalidBlockNumber, we may not be
finished, so make it resumable.

Structurally, this changes the timing of when the heap blocks are
prefetched. Your code would get a tid from the index and then prefetch
the heap block -- doing this until it filled a queue that had the
actual tids saved in it. With my approach and the streaming read API,
you fetch tids from the index until you've filled up a queue of block
numbers. Then the streaming read API will prefetch those heap blocks.

I didn't actually implement the block queue -- I just saved a single
block number and pretended it was a block queue. I was imagining we
replace this with something like your IndexPrefetch->blockItems --
which has light deduplication. We'd probably have to flesh it out more
than that.

There are also table AM layering violations in my sketch which would
have to be worked out (not to mention some resource leakage I didn't
bother investigating [which causes it to fail tests]).

0001 is all of Thomas' streaming read API code that isn't yet in
master and 0002 is my rough sketch of index prefetching using the
streaming read API

There are also numerous optimizations that your index prefetching
patch set does that would need to be added in some way. I haven't
thought much about it yet. I wanted to see what you thought of this
approach first. Basically, is it workable?

- Melanie

Attachment

Re: index prefetching

From

Tomas Vondra

Date:

19 January 2024, 22:14:12


On 1/19/24 16:19, Konstantin Knizhnik wrote:
> 
> On 18/01/2024 5:57 pm, Tomas Vondra wrote:
>> On 1/16/24 21:10, Konstantin Knizhnik wrote:
>>> ...
>>>
>>>> 4. I think that performing prefetch at executor level is really great
>>>>> idea and so prefetch can be used by all indexes, including custom
>>>>> indexes. But prefetch will be efficient only if index can provide fast
>>>>> access to next TID (located at the same page). I am not sure that
>>>>> it is
>>>>> true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
>>>>> custom AM. I wonder if we should extend AM API to make index make a
>>>>> decision weather to perform prefetch of TIDs or not.
>>>> I'm not against having a flag to enable/disable prefetching, but the
>>>> question is whether doing prefetching for such indexes can be harmful.
>>>> I'm not sure about that.
>>> I tend to agree with you - it is hard to imagine index implementation
>>> which doesn't win from prefetching heap pages.
>>> May be only the filtering case you have mentioned. But it seems to me
>>> that current B-Tree index scan (not IOS) implementation in Postgres
>>> doesn't try to use index tuple to check extra condition - it will fetch
>>> heap tuple in any case.
>>>
>> That's true, but that's why I started working on this:
>>
>> https://commitfest.postgresql.org/46/4352/
>>
>> I need to think about how to combine that with the prefetching. The good
>> thing is that both changes require fetching TIDs, not slots. I think the
>> condition can be simply added to the prefetch callback.
>>
>>
>> regards
>>
> Looks like I was not true, even if it is not index-only scan but index
> condition involves only index attributes, then heap is not accessed
> until we find tuple satisfying search condition.
> Inclusive index case described above
> (https://commitfest.postgresql.org/46/4352/) is interesting but IMHO
> exotic case. If keys are actually used in search, then why not to create
> normal compound index instead?
> 

Not sure I follow ...

Firstly, I'm not convinced the example addressed by that other patch is
that exotic. IMHO it's quite possible it's actually quite common, but
the users do no realize the possible gains.

Also, there are reasons to not want very wide indexes - it has overhead
associated with maintenance, disk space, etc. I think it's perfectly
rational to design indexes in a way eliminates most heap fetches
necessary to evaluate conditions, but does not guarantee IOS (so the
last heap fetch is still needed).

What do you mean by "create normal compound index"? The patch addresses
a limitation that not every condition can be translated into a proper
scan key. Even if we improve this, there will always be such conditions.
The the IOS can evaluate them on index tuple, the regular index scan
can't do that (currently).

Can you share an example demonstrating the alternative approach?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

From

Konstantin Knizhnik

Date:

21 January 2024, 19:50:17

On 20/01/2024 12:14 am, Tomas Vondra wrote:

Looks like I was not true, even if it is not index-only scan but index

condition involves only index attributes, then heap is not accessed
until we find tuple satisfying search condition.
Inclusive index case described above
(https://commitfest.postgresql.org/46/4352/) is interesting but IMHO
exotic case. If keys are actually used in search, then why not to create
normal compound index instead?

Not sure I follow ...

Firstly, I'm not convinced the example addressed by that other patch is
that exotic. IMHO it's quite possible it's actually quite common, but
the users do no realize the possible gains.

Also, there are reasons to not want very wide indexes - it has overhead
associated with maintenance, disk space, etc. I think it's perfectly
rational to design indexes in a way eliminates most heap fetches
necessary to evaluate conditions, but does not guarantee IOS (so the
last heap fetch is still needed).

We are comparing compound index (a,b) and covering (inclusive) index (a) include (b)
This indexes have exactly the same width and size and almost the same maintenance overhead.

First index has more expensive comparison function (involving two columns) but I do not think that it can significantly affect
performance and maintenance cost. Also if selectivity of "a" is good enough, then there is no need to compare "b"

Why we can prefer covering index to compound index? I see only two good reasons:
1. Extra columns type do not have comparison function need for AM.
2. The extra columns are never used in query predicate.

If you are going to use this columns in query predicates I do not see much sense in creating inclusive index rather than compound index.
Do you?

What do you mean by "create normal compound index"? The patch addresses
a limitation that not every condition can be translated into a proper
scan key. Even if we improve this, there will always be such conditions.
The the IOS can evaluate them on index tuple, the regular index scan
can't do that (currently).

Can you share an example demonstrating the alternative approach?

May be I missed something.

This is the example from https://www.postgresql.org/message-id/flat/N1xaIrU29uk5YxLyW55MGk5fz9s6V2FNtj54JRaVlFbPixD5z8sJ07Ite5CvbWwik8ZvDG07oSTN-usENLVMq2UAcizVTEd5b-o16ZGDIIU=@yamlcoder.me :

```

And here is the plan with index on (a,b).

Limit (cost=0.42..4447.90 rows=1 width=12) (actual time=6.883..6.884 rows=0 loops=1) Output: a, b, d Buffers: shared hit=613 -> Index Scan using t_a_b_idx on public.t (cost=0.42..4447.90 rows=1 width=12) (actual time=6.880..6.881 rows=0 loops=1) Output: a, b, d Index Cond: ((t.a > 1000000) AND (t.b = 4)) Buffers: shared hit=613 Planning: Buffers: shared hit=41 Planning Time: 0.314 ms Execution Time: 6.910 ms ```

Isn't it an optimal plan for this query?

And cite from self reproducible example https://dbfiddle.uk/iehtq44L :
```
create unique index t_a_include_b on t(a) include (b);
-- I'd expecd index above to behave the same as index below for this query
--create unique index on t(a,b);
```

I agree that it is natural to expect the same result for both indexes. So this PR definitely makes sense.
My point is only that compound index (a,b) in this case is more natural and preferable.

Re: index prefetching

From

Konstantin Knizhnik

Date:

21 January 2024, 19:56:36

On 19/01/2024 2:35 pm, Tomas Vondra wrote:
>
> On 1/19/24 09:34, Konstantin Knizhnik wrote:
>> On 18/01/2024 6:00 pm, Tomas Vondra wrote:
>>> On 1/17/24 09:45, Konstantin Knizhnik wrote:
>>>> I have integrated your prefetch patch in Neon and it actually works!
>>>> Moreover, I combined it with prefetch of leaf pages for IOS and it also
>>>> seems to work.
>>>>
>>> Cool! And do you think this is the right design/way to do this?
>> I like the idea of prefetching TIDs in executor.
>>
>> But looking though your patch I have some questions:
>>
>>
>> 1. Why it is necessary to allocate and store all_visible flag in data
>> buffer. Why caller of  IndexPrefetchNext can not look at prefetch field?
>>
>> +        /* store the all_visible flag in the private part of the entry */
>> +        entry->data = palloc(sizeof(bool));
>> +        *(bool *) entry->data = all_visible;
>>
> What you mean by "prefetch field"?


I mean "prefetch" field of IndexPrefetchEntry:

+
+typedef struct IndexPrefetchEntry
+{
+    ItemPointerData tid;
+
+    /* should we prefetch heap page for this TID? */
+    bool        prefetch;
+

You store the same flag twice:

+        /* prefetch only if not all visible */
+        entry->prefetch = !all_visible;
+
+        /* store the all_visible flag in the private part of the entry */
+        entry->data = palloc(sizeof(bool));
+        *(bool *) entry->data = all_visible;

My question was: why do we need to allocate something in entry->data and 
store all_visible in it, while we already stored !all-visible in 
entry->prefetch.

Re: index prefetching

From

Tomas Vondra

Date:

21 January 2024, 23:39:14


On 1/21/24 20:50, Konstantin Knizhnik wrote:
> 
> On 20/01/2024 12:14 am, Tomas Vondra wrote:
>> Looks like I was not true, even if it is not index-only scan but index
>>> condition involves only index attributes, then heap is not accessed
>>> until we find tuple satisfying search condition.
>>> Inclusive index case described above
>>> (https://commitfest.postgresql.org/46/4352/) is interesting but IMHO
>>> exotic case. If keys are actually used in search, then why not to create
>>> normal compound index instead?
>>>
>> Not sure I follow ...
>>
>> Firstly, I'm not convinced the example addressed by that other patch is
>> that exotic. IMHO it's quite possible it's actually quite common, but
>> the users do no realize the possible gains.
>>
>> Also, there are reasons to not want very wide indexes - it has overhead
>> associated with maintenance, disk space, etc. I think it's perfectly
>> rational to design indexes in a way eliminates most heap fetches
>> necessary to evaluate conditions, but does not guarantee IOS (so the
>> last heap fetch is still needed).
> 
> We are comparing compound index (a,b) and covering (inclusive) index (a)
> include (b)
> This indexes have exactly the same width and size and almost the same
> maintenance overhead.
> 
> First index has more expensive comparison function (involving two
> columns)  but I do not think that it can significantly affect
> performance and maintenance cost. Also if selectivity of "a" is good
> enough, then there is no need to compare "b"
> 
> Why we can prefer covering index  to compound index? I see only two good
> reasons:
> 1. Extra columns type do not  have comparison function need for AM.
> 2. The extra columns are never used in query predicate.
> 

Or maybe you don't want to include the columns in a UNIQUE constraint?

> If you are going to use this columns in query predicates I do not see
> much sense in creating inclusive index rather than compound index.
> Do you?
> 

But this is also about conditions that can't be translated into index
scan keys. Consider this:

create table t (a int, b int, c int);
insert into t select 1000 * random(), 1000 * random(), 1000 * random()
from generate_series(1,1000000) s(i);
create index on t (a,b);
vacuum analyze t;

explain (analyze, buffers) select * from t where a = 10 and mod(b,10) =
1111111;
                                                   QUERY PLAN

-----------------------------------------------------------------------------------------------------------------
 Index Scan using t_a_b_idx on t  (cost=0.42..3670.74 rows=5 width=12)
(actual time=4.562..4.564 rows=0 loops=1)
   Index Cond: (a = 10)
   Filter: (mod(b, 10) = 1111111)
   Rows Removed by Filter: 974
   Buffers: shared hit=980
   Prefetches: blocks=901
 Planning Time: 0.304 ms
 Execution Time: 5.146 ms
(8 rows)

Notice that this still fetched ~1000 buffers in order to evaluate the
filter on "b", because it's complex and can't be transformed into a nice
scan key. Or this:

explain (analyze, buffers) select a from t where a = 10 and (b+1) < 100
                                             and c < 0;


                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 Index Scan using t_a_b_idx on t  (cost=0.42..3673.22 rows=1 width=4)
(actual time=4.446..4.448 rows=0 loops=1)
   Index Cond: (a = 10)
   Filter: ((c < 0) AND ((b + 1) < 100))
   Rows Removed by Filter: 974
   Buffers: shared hit=980
   Prefetches: blocks=901
 Planning Time: 0.313 ms
 Execution Time: 4.878 ms
(8 rows)

where it's "broken" by the extra unindexed column.

FWIW there are the primary cases I had in mind for this patch.


> 
>> What do you mean by "create normal compound index"? The patch addresses
>> a limitation that not every condition can be translated into a proper
>> scan key. Even if we improve this, there will always be such conditions.
>> The the IOS can evaluate them on index tuple, the regular index scan
>> can't do that (currently).
>>
>> Can you share an example demonstrating the alternative approach?
> 
> May be I missed something.
> 
> This is the example from
>
https://www.postgresql.org/message-id/flat/N1xaIrU29uk5YxLyW55MGk5fz9s6V2FNtj54JRaVlFbPixD5z8sJ07Ite5CvbWwik8ZvDG07oSTN-usENLVMq2UAcizVTEd5b-o16ZGDIIU=@yamlcoder.me
:
> 
> ```
> 
> And here is the plan with index on (a,b).
> 
> Limit (cost=0.42..4447.90 rows=1 width=12) (actual time=6.883..6.884
> rows=0 loops=1)    Output: a, b, d    Buffers: shared hit=613    ->
> Index Scan using t_a_b_idx on public.t (cost=0.42..4447.90 rows=1
> width=12) (actual time=6.880..6.881 rows=0 loops=1)          Output: a,
> b, d          Index Cond: ((t.a > 1000000) AND (t.b = 4))      
>    Buffers: shared hit=613 Planning:    Buffers: shared hit=41 Planning
> Time: 0.314 ms Execution Time: 6.910 ms ```
> 
> 
> Isn't it an optimal plan for this query?
> 
> And cite from self reproducible example https://dbfiddle.uk/iehtq44L :
> ```
> create unique index t_a_include_b on t(a) include (b);
> -- I'd expecd index above to behave the same as index below for this query
> --create unique index on t(a,b);
> ```
> 
> I agree that it is natural to expect the same result for both indexes.
> So this PR definitely makes sense.
> My point is only that compound index (a,b) in this case is more natural
> and preferable.
> 

Yes, perhaps. But you may also see it from the other direction - if you
already have an index with included columns (for whatever reason), it
would be nice to leverage that if possible. And as I mentioned above,
it's not always the case that move a column from "included" to a proper
key, or stuff like that.

Anyway, it seems entirely unrelated to this prefetching thread.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

From

Tomas Vondra

Date:

21 January 2024, 23:47:27


On 1/21/24 20:56, Konstantin Knizhnik wrote:
> 
> On 19/01/2024 2:35 pm, Tomas Vondra wrote:
>>
>> On 1/19/24 09:34, Konstantin Knizhnik wrote:
>>> On 18/01/2024 6:00 pm, Tomas Vondra wrote:
>>>> On 1/17/24 09:45, Konstantin Knizhnik wrote:
>>>>> I have integrated your prefetch patch in Neon and it actually works!
>>>>> Moreover, I combined it with prefetch of leaf pages for IOS and it
>>>>> also
>>>>> seems to work.
>>>>>
>>>> Cool! And do you think this is the right design/way to do this?
>>> I like the idea of prefetching TIDs in executor.
>>>
>>> But looking though your patch I have some questions:
>>>
>>>
>>> 1. Why it is necessary to allocate and store all_visible flag in data
>>> buffer. Why caller of  IndexPrefetchNext can not look at prefetch field?
>>>
>>> +        /* store the all_visible flag in the private part of the
>>> entry */
>>> +        entry->data = palloc(sizeof(bool));
>>> +        *(bool *) entry->data = all_visible;
>>>
>> What you mean by "prefetch field"?
> 
> 
> I mean "prefetch" field of IndexPrefetchEntry:
> 
> +
> +typedef struct IndexPrefetchEntry
> +{
> +    ItemPointerData tid;
> +
> +    /* should we prefetch heap page for this TID? */
> +    bool        prefetch;
> +
> 
> You store the same flag twice:
> 
> +        /* prefetch only if not all visible */
> +        entry->prefetch = !all_visible;
> +
> +        /* store the all_visible flag in the private part of the entry */
> +        entry->data = palloc(sizeof(bool));
> +        *(bool *) entry->data = all_visible;
> 
> My question was: why do we need to allocate something in entry->data and
> store all_visible in it, while we already stored !all-visible in
> entry->prefetch.
> 

Ah, right. Well, you're right in this case we perhaps could set just one
of those flags, but the "purpose" of the two places is quite different.

The "prefetch" flag is fully controlled by the prefetcher, and it's up
to it to change it (e.g. I can easily imagine some new logic touching
setting it to "false" for some reason).

The "data" flag is fully controlled by the custom callbacks, so whatever
the callback stores, will be there.

I don't think it's worth simplifying this. In particular, I don't think
the callback can assume it can rely on the "prefetch" flag.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

From

Peter Smith

Date:

22 January 2024, 04:53:15

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
like there were  CFbot test failures last time it was run [2]. Please
have a look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/4351/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4351

Kind Regards,
Peter Smith.

Re: index prefetching

From

Konstantin Knizhnik

Date:

22 January 2024, 06:35:59

On 22/01/2024 1:47 am, Tomas Vondra wrote:

h, right. Well, you're right in this case we perhaps could set just one

of those flags, but the "purpose" of the two places is quite different.

The "prefetch" flag is fully controlled by the prefetcher, and it's up
to it to change it (e.g. I can easily imagine some new logic touching
setting it to "false" for some reason).

The "data" flag is fully controlled by the custom callbacks, so whatever
the callback stores, will be there.

I don't think it's worth simplifying this. In particular, I don't think
the callback can assume it can rely on the "prefetch" flag.

Why not to add "all_visible" flag to IndexPrefetchEntry ? If will not cause any extra space overhead (because of alignment), but allows to avoid dynamic memory allocation (not sure if it is critical, but nice to avoid if possible).

Re: index prefetching

From

Konstantin Knizhnik

Date:

22 January 2024, 07:21:14

On 22/01/2024 1:39 am, Tomas Vondra wrote:

Why we can prefer covering index  to compound index? I see only two good
reasons:
1. Extra columns type do not  have comparison function need for AM.
2. The extra columns are never used in query predicate.

Or maybe you don't want to include the columns in a UNIQUE constraint?

Do you mean that compound index (a,b) can not be used to enforce uniqueness of "a"?
If so, I agree.

If you are going to use this columns in query predicates I do not see
much sense in creating inclusive index rather than compound index.
Do you?

But this is also about conditions that can't be translated into index
scan keys. Consider this:

create table t (a int, b int, c int);
insert into t select 1000 * random(), 1000 * random(), 1000 * random()
from generate_series(1,1000000) s(i);
create index on t (a,b);
vacuum analyze t;

explain (analyze, buffers) select * from t where a = 10 and mod(b,10) =
1111111;                                                   QUERY PLAN

----------------------------------------------------------------------------------------------------------------- Index Scan using t_a_b_idx on t  (cost=0.42..3670.74 rows=5 width=12)
(actual time=4.562..4.564 rows=0 loops=1)   Index Cond: (a = 10)   Filter: (mod(b, 10) = 1111111)   Rows Removed by Filter: 974   Buffers: shared hit=980   Prefetches: blocks=901 Planning Time: 0.304 ms Execution Time: 5.146 ms
(8 rows)

Notice that this still fetched ~1000 buffers in order to evaluate the
filter on "b", because it's complex and can't be transformed into a nice
scan key.

O yes.
Looks like I didn't understand the logic when predicate is included in index condition and when not.
It seems to be natural that only such predicate which specifies some range can be included in index condition.
But it is not the case:

postgres=# explain select * from t where a = 10 and b in (10,20,30);                             QUERY PLAN                              
--------------------------------------------------------------------- Index Scan using t_a_b_idx on t  (cost=0.42..25.33 rows=3 width=12)   Index Cond: ((a = 10) AND (b = ANY ('{10,20,30}'::integer[])))
(2 rows)

So I though ANY predicate using index keys is included in index condition.
But it is not true (as your example shows).

But IMHO mod(b,10)=111111 or (b+1) < 100 are both quite rare predicates this is why I named this use cases "exotic".

In any case, if we have some columns in index tuple it is desired to use them for filtering before extracting heap tuple.
But I afraid it will be not so easy to implement...

Re: index prefetching

From

Tomas Vondra

Date:

23 January 2024, 17:43:25

On 1/19/24 22:43, Melanie Plageman wrote:
> On Fri, Jan 12, 2024 at 11:42 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 1/9/24 21:31, Robert Haas wrote:
>>> On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>> Here's a somewhat reworked version of the patch. My initial goal was to
>>>> see if it could adopt the StreamingRead API proposed in [1], but that
>>>> turned out to be less straight-forward than I hoped, for two reasons:
>>>
>>> I guess we need Thomas or Andres or maybe Melanie to comment on this.
>>>
>>
>> Yeah. Or maybe Thomas if he has thoughts on how to combine this with the
>> streaming I/O stuff.
> 
> I've been studying your patch with the intent of finding a way to
> change it and or the streaming read API to work together. I've
> attached a very rough sketch of how I think it could work.
> 

Thanks.

> We fill a queue with blocks from TIDs that we fetched from the index.
> The queue is saved in a scan descriptor that is made available to the
> streaming read callback. Once the queue is full, we invoke the table
> AM specific index_fetch_tuple() function which calls
> pg_streaming_read_buffer_get_next(). When the streaming read API
> invokes the callback we registered, it simply dequeues a block number
> for prefetching.

So in a way there are two queues in IndexFetchTableData. One (blk_queue)
is being filled from IndexNext, and then the queue in StreamingRead.

> The only change to the streaming read API is that now, even if the
> callback returns InvalidBlockNumber, we may not be finished, so make
> it resumable.
> 

Hmm, not sure when can the callback return InvalidBlockNumber before
reaching the end. Perhaps for the first index_fetch_heap call? Any
reason not to fill the blk_queue before calling index_fetch_heap?

> Structurally, this changes the timing of when the heap blocks are
> prefetched. Your code would get a tid from the index and then prefetch
> the heap block -- doing this until it filled a queue that had the
> actual tids saved in it. With my approach and the streaming read API,
> you fetch tids from the index until you've filled up a queue of block
> numbers. Then the streaming read API will prefetch those heap blocks.
> 

And is that a good/desirable change? I'm not saying it's not, but maybe
we should not be filling either queue in one go - we don't want to
overload the prefetching.

> I didn't actually implement the block queue -- I just saved a single
> block number and pretended it was a block queue. I was imagining we
> replace this with something like your IndexPrefetch->blockItems --
> which has light deduplication. We'd probably have to flesh it out more
> than that.
> 

I don't understand how this passes the TID to the index_fetch_heap.
Isn't it working only by accident, due to blk_queue only having a single
entry? Shouldn't the first queue (blk_queue) store TIDs instead?

> There are also table AM layering violations in my sketch which would
> have to be worked out (not to mention some resource leakage I didn't
> bother investigating [which causes it to fail tests]).
> 
> 0001 is all of Thomas' streaming read API code that isn't yet in
> master and 0002 is my rough sketch of index prefetching using the
> streaming read API
> 
> There are also numerous optimizations that your index prefetching
> patch set does that would need to be added in some way. I haven't
> thought much about it yet. I wanted to see what you thought of this
> approach first. Basically, is it workable?
> 

It seems workable, yes. I'm not sure it's much simpler than my patch
(considering a lot of the code is in the optimizations, which are
missing from this patch).

I think the question is where should the optimizations happen. I suppose
some of them might/should happen in the StreamingRead API itself - like
the detection of sequential patterns, recently prefetched blocks, ...

But I'm not sure what to do about optimizations that are more specific
to the access path. Consider for example the index-only scans. We don't
want to prefetch all the pages, we need to inspect the VM and prefetch
just the not-all-visible ones. And then pass the info to the index scan,
so that it does not need to check the VM again. It's not clear to me how
to do this with this approach.

The main

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

From

Melanie Plageman

Date:

24 January 2024, 00:51:24

On Tue, Jan 23, 2024 at 12:43 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 1/19/24 22:43, Melanie Plageman wrote:
>
> > We fill a queue with blocks from TIDs that we fetched from the index.
> > The queue is saved in a scan descriptor that is made available to the
> > streaming read callback. Once the queue is full, we invoke the table
> > AM specific index_fetch_tuple() function which calls
> > pg_streaming_read_buffer_get_next(). When the streaming read API
> > invokes the callback we registered, it simply dequeues a block number
> > for prefetching.
>
> So in a way there are two queues in IndexFetchTableData. One (blk_queue)
> is being filled from IndexNext, and then the queue in StreamingRead.

I've changed the name from blk_queue to tid_queue to fix the issue you
mention in your later remarks.
I suppose there are two queues. The tid_queue is just to pass the
block requests to the streaming read API. The prefetch distance will
be the smaller of the two sizes.

> > The only change to the streaming read API is that now, even if the
> > callback returns InvalidBlockNumber, we may not be finished, so make
> > it resumable.
>
> Hmm, not sure when can the callback return InvalidBlockNumber before
> reaching the end. Perhaps for the first index_fetch_heap call? Any
> reason not to fill the blk_queue before calling index_fetch_heap?

The callback will return InvalidBlockNumber whenever the queue is
empty. Let's say your queue size is 5 and your effective prefetch
distance is 10 (some combination of the PgStreamingReadRange sizes and
PgStreamingRead->max_ios). The first time you call index_fetch_heap(),
the callback returns InvalidBlockNumber. Then the tid_queue is filled
with 5 tids. Then index_fetch_heap() is called.
pg_streaming_read_look_ahead() will prefetch all 5 of these TID's
blocks, emptying the queue. Once all 5 have been dequeued, the
callback will return InvalidBlockNumber.
pg_streaming_read_buffer_get_next() will return one of the 5 blocks in
a buffer and save the associated TID in the per_buffer_data. Before
index_fetch_heap() is called again, we will see that the queue is not
full and fill it up again with 5 TIDs. So, the callback will return
InvalidBlockNumber 3 times in this scenario.

> > Structurally, this changes the timing of when the heap blocks are
> > prefetched. Your code would get a tid from the index and then prefetch
> > the heap block -- doing this until it filled a queue that had the
> > actual tids saved in it. With my approach and the streaming read API,
> > you fetch tids from the index until you've filled up a queue of block
> > numbers. Then the streaming read API will prefetch those heap blocks.
>
> And is that a good/desirable change? I'm not saying it's not, but maybe
> we should not be filling either queue in one go - we don't want to
> overload the prefetching.

We can focus on the prefetch distance algorithm maintained in the
streaming read API and then make sure that the tid_queue is larger
than the desired prefetch distance maintained by the streaming read
API.

> > I didn't actually implement the block queue -- I just saved a single
> > block number and pretended it was a block queue. I was imagining we
> > replace this with something like your IndexPrefetch->blockItems --
> > which has light deduplication. We'd probably have to flesh it out more
> > than that.
>
> I don't understand how this passes the TID to the index_fetch_heap.
> Isn't it working only by accident, due to blk_queue only having a single
> entry? Shouldn't the first queue (blk_queue) store TIDs instead?

Oh dear! Fixed in the attached v2. I've replaced the single
BlockNumber with a single ItemPointerData. I will work on implementing
an actual queue next week.

> > There are also table AM layering violations in my sketch which would
> > have to be worked out (not to mention some resource leakage I didn't
> > bother investigating [which causes it to fail tests]).
> >
> > 0001 is all of Thomas' streaming read API code that isn't yet in
> > master and 0002 is my rough sketch of index prefetching using the
> > streaming read API
> >
> > There are also numerous optimizations that your index prefetching
> > patch set does that would need to be added in some way. I haven't
> > thought much about it yet. I wanted to see what you thought of this
> > approach first. Basically, is it workable?
>
> It seems workable, yes. I'm not sure it's much simpler than my patch
> (considering a lot of the code is in the optimizations, which are
> missing from this patch).
>
> I think the question is where should the optimizations happen. I suppose
> some of them might/should happen in the StreamingRead API itself - like
> the detection of sequential patterns, recently prefetched blocks, ...

So, the streaming read API does detection of sequential patterns and
not prefetching things that are in shared buffers. It doesn't handle
avoiding prefetching recently prefetched blocks yet AFAIK. But I
daresay this would be relevant for other streaming read users and
could certainly be implemented there.

> But I'm not sure what to do about optimizations that are more specific
> to the access path. Consider for example the index-only scans. We don't
> want to prefetch all the pages, we need to inspect the VM and prefetch
> just the not-all-visible ones. And then pass the info to the index scan,
> so that it does not need to check the VM again. It's not clear to me how
> to do this with this approach.

Yea, this is an issue I'll need to think about. To really spell out
the problem: the callback dequeues a TID from the tid_queue and looks
up its block in the VM. It's all visible. So, it shouldn't return that
block to the streaming read API to fetch from the heap because it
doesn't need to be read. But, where does the callback put the TID so
that the caller can get it? I'm going to think more about this.

As for passing around the all visible status so as to not reread the
VM block -- that feels solvable but I haven't looked into it.

- Melanie

On 1/22/24 07:35, Konstantin Knizhnik wrote:
> 
> On 22/01/2024 1:47 am, Tomas Vondra wrote:
>> h, right. Well, you're right in this case we perhaps could set just one
>> of those flags, but the "purpose" of the two places is quite different.
>>
>> The "prefetch" flag is fully controlled by the prefetcher, and it's up
>> to it to change it (e.g. I can easily imagine some new logic touching
>> setting it to "false" for some reason).
>>
>> The "data" flag is fully controlled by the custom callbacks, so whatever
>> the callback stores, will be there.
>>
>> I don't think it's worth simplifying this. In particular, I don't think
>> the callback can assume it can rely on the "prefetch" flag.
>>
> Why not to add "all_visible" flag to IndexPrefetchEntry ? If will not
> cause any extra space overhead (because of alignment), but allows to
> avoid dynamic memory allocation (not sure if it is critical, but nice to
> avoid if possible).
> 

Because it's specific to index-only scans, while IndexPrefetchEntry is a
generic thing, for all places.

However:

(1) Melanie actually presented a very different way to implement this,
relying on the StreamingRead API. So chances are this struct won't
actually be used.

(2) After going through Melanie's patch, I realized this is actually
broken. The IOS case needs to keep more stuff, not just the all-visible
flag, but also the index tuple. Otherwise it'll just operate on the last
tuple read from the index, which happens to be in xs_ituple. Attached is
a patch with a trivial fix.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: index prefetching

From

Melanie Plageman

Date:

24 January 2024, 20:20:28

On Wed, Jan 24, 2024 at 4:19 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 1/24/24 01:51, Melanie Plageman wrote:
>
> >>> There are also table AM layering violations in my sketch which would
> >>> have to be worked out (not to mention some resource leakage I didn't
> >>> bother investigating [which causes it to fail tests]).
> >>>
> >>> 0001 is all of Thomas' streaming read API code that isn't yet in
> >>> master and 0002 is my rough sketch of index prefetching using the
> >>> streaming read API
> >>>
> >>> There are also numerous optimizations that your index prefetching
> >>> patch set does that would need to be added in some way. I haven't
> >>> thought much about it yet. I wanted to see what you thought of this
> >>> approach first. Basically, is it workable?
> >>
> >> It seems workable, yes. I'm not sure it's much simpler than my patch
> >> (considering a lot of the code is in the optimizations, which are
> >> missing from this patch).
> >>
> >> I think the question is where should the optimizations happen. I suppose
> >> some of them might/should happen in the StreamingRead API itself - like
> >> the detection of sequential patterns, recently prefetched blocks, ...
> >
> > So, the streaming read API does detection of sequential patterns and
> > not prefetching things that are in shared buffers. It doesn't handle
> > avoiding prefetching recently prefetched blocks yet AFAIK. But I
> > daresay this would be relevant for other streaming read users and
> > could certainly be implemented there.
> >
>
> Yes, the "recently prefetched stuff" cache seems like a fairly natural
> complement to the pattern detection and shared-buffers check.
>
> FWIW I wonder if we should make some of this customizable, so that
> systems with customized storage (e.g. neon or with direct I/O) can e.g.
> disable some of these checks. Or replace them with their version.

That's a promising idea.

> >> But I'm not sure what to do about optimizations that are more specific
> >> to the access path. Consider for example the index-only scans. We don't
> >> want to prefetch all the pages, we need to inspect the VM and prefetch
> >> just the not-all-visible ones. And then pass the info to the index scan,
> >> so that it does not need to check the VM again. It's not clear to me how
> >> to do this with this approach.
> >
> > Yea, this is an issue I'll need to think about. To really spell out
> > the problem: the callback dequeues a TID from the tid_queue and looks
> > up its block in the VM. It's all visible. So, it shouldn't return that
> > block to the streaming read API to fetch from the heap because it
> > doesn't need to be read. But, where does the callback put the TID so
> > that the caller can get it? I'm going to think more about this.
> >
>
> Yes, that's the problem for index-only scans. I'd generalize it so that
> it's about the callback being able to (a) decide if it needs to read the
> heap page, and (b) store some custom info for the TID.

Actually, I think this is no big deal. See attached. I just don't
enqueue tids whose blocks are all visible. I had to switch the order
from fetch heap then fill queue to fill queue then fetch heap.

While doing this I noticed some wrong results in the regression tests
(like in the alter table test), so I suspect I have some kind of
control flow issue. Perhaps I should fix the resource leak so I can
actually see the failing tests :)

As for your a) and b) above.

Regarding a): We discussed allowing speculative prefetching and
separating the logic for prefetching from actually reading blocks (so
you can prefetch blocks you ultimately don't read). We decided this
may not belong in a streaming read API. What do you think?

Regarding b): We can store per buffer data for anything that actually
goes down through the streaming read API, but, in the index only case,
we don't want the streaming read API to know about blocks that it
doesn't actually need to read.

- Melanie

On Wed, Jan 24, 2024 at 3:20 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Wed, Jan 24, 2024 at 4:19 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> > On 1/24/24 01:51, Melanie Plageman wrote:
> > >> But I'm not sure what to do about optimizations that are more specific
> > >> to the access path. Consider for example the index-only scans. We don't
> > >> want to prefetch all the pages, we need to inspect the VM and prefetch
> > >> just the not-all-visible ones. And then pass the info to the index scan,
> > >> so that it does not need to check the VM again. It's not clear to me how
> > >> to do this with this approach.
> > >
> > > Yea, this is an issue I'll need to think about. To really spell out
> > > the problem: the callback dequeues a TID from the tid_queue and looks
> > > up its block in the VM. It's all visible. So, it shouldn't return that
> > > block to the streaming read API to fetch from the heap because it
> > > doesn't need to be read. But, where does the callback put the TID so
> > > that the caller can get it? I'm going to think more about this.
> > >
> >
> > Yes, that's the problem for index-only scans. I'd generalize it so that
> > it's about the callback being able to (a) decide if it needs to read the
> > heap page, and (b) store some custom info for the TID.
>
> Actually, I think this is no big deal. See attached. I just don't
> enqueue tids whose blocks are all visible. I had to switch the order
> from fetch heap then fill queue to fill queue then fetch heap.
>
> While doing this I noticed some wrong results in the regression tests
> (like in the alter table test), so I suspect I have some kind of
> control flow issue. Perhaps I should fix the resource leak so I can
> actually see the failing tests :)

Attached is a patch which implements a real queue and fixes some of
the issues with the previous version. It doesn't pass tests yet and
has issues. Some are bugs in my implementation I need to fix. Some are
issues we would need to solve in the streaming read API. Some are
issues with index prefetching generally.

Note that these two patches have to be applied before 21d9c3ee4e
because Thomas hasn't released a rebased version of the streaming read
API patches yet.

Issues
---
- kill prior tuple

This optimization doesn't work with index prefetching with the current
design. Kill prior tuple relies on alternating between fetching a
single index tuple and visiting the heap. After visiting the heap we
can potentially kill the immediately preceding index tuple. Once we
fetch multiple index tuples, enqueue their TIDs, and later visit the
heap, the next index page we visit may not contain all of the index
tuples deemed killable by our visit to the heap.

In our case, we could try and fix this by prefetching only heap blocks
referred to by index tuples on the same index page. Or we could try
and keep a pool of index pages pinned and go back and kill index
tuples on those pages.

Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps
there is an easier way to fix this, as I don't think the mvcc test
failed on Tomas' version.

- switching scan directions

If the index scan switches directions on a given invocation of
IndexNext(), heap blocks may have already been prefetched and read for
blocks containing tuples beyond the point at which we want to switch
directions.

We could fix this by having some kind of streaming read "reset"
callback to drop all of the buffers which have been prefetched which
are now no longer needed. We'd have to go backwards from the last TID
which was yielded to the caller and figure out which buffers in the
pgsr buffer ranges are associated with all of the TIDs which were
prefetched after that TID. The TIDs are in the per_buffer_data
associated with each buffer in pgsr. The issue would be searching
through those efficiently.

The other issue is that the streaming read API does not currently
support backwards scans. So, if we switch to a backwards scan from a
forwards scan, we would need to fallback to the non streaming read
method. We could do this by just setting the TID queue size to 1
(which is what I have currently implemented). Or we could add
backwards scan support to the streaming read API.

- mark and restore

Similar to the issue with switching the scan direction, mark and
restore requires us to reset the TID queue and streaming read queue.
For now, I've hacked in something to the PlannerInfo and Plan to set
the TID queue size to 1 for plans containing a merge join (yikes).

- multiple executions

For reasons I don't entirely understand yet, multiple executions (not
rescan -- see ExecutorRun(...execute_once)) do not work. As in Tomas'
patch, I have disabled prefetching (and made the TID queue size 1)
when execute_once is false.

- Index Only Scans need to return IndexTuples

Because index only scans return either the IndexTuple pointed to by
IndexScanDesc->xs_itup or the HeapTuple pointed to by
IndexScanDesc->xs_hitup -- both of which are populated by the index
AM, we have to save copies of those IndexTupleData and HeapTupleDatas
for every TID whose block we prefetch.

This might be okay, but it is a bit sad to have to make copies of those tuples.

In this patch, I still haven't figured out the memory management part.
I copy over the tuples when enqueuing a TID queue item and then copy
them back again when the streaming read API returns the
per_buffer_data to us. Something is still not quite right here. I
suspect this is part of the reason why some of the other tests are
failing.

Other issues/gaps in my implementation:

Determining where to allocate the memory for the streaming read object
and the TID queue is an outstanding TODO. To implement a fallback
method for cases in which streaming read doesn't work, I set the queue
size to 1. This is obviously not good.

Right now, I allocate the TID queue and streaming read objects in
IndexNext() and IndexOnlyNext(). This doesn't seem ideal. Doing it in
index_beginscan() (and index_beginscan_parallel()) is tricky though
because we don't know the scan direction at that point (and the scan
direction can change). There are also callers of index_beginscan() who
do not call Index[Only]Next() (like systable_getnext() which calls
index_getnext_slot() directly).

Also, my implementation does not yet have the optimization Tomas does
to skip prefetching recently prefetched blocks. As he has said, it
probably makes sense to add something to do this in a lower layer --
such as in the streaming read API or even in bufmgr.c (maybe in
PrefetchSharedBuffer()).

- Melanie

Hi,

here's an improved (rebased + updated) version of the patch series, with
some significant fixes and changes. The patch adds infrastructure and
modifies btree indexes to do prefetching - and AFAIK it passes all tests
(no results, correct results). There's still a fair amount of work to be
done, of course - the btree changes are not very polished, more time
needs to be spent on profiling and optimization, etc. And I'm sure that
while the patch passes tests, there certainly are bugs.

Compared to the last patch version [1] shared on list (in November),
there's a number of significant design changes - a lot of this is based
on a number of off-list discussions I had with Peter Geoghegan, which
was very helpful. Let me try to sum the main conclusions and changes:

1) patch now relies on read_stream

The November patch still relied on sync I/O and PrefetchBuffer(). At
some point I added a commit switching it to read_stream - which turned
out non-trivial, especially for index-only scans. But it works, and for
a while I kept it separate - with PrefetchBuffer first, and a switch to
read_stream later. But then I realized it does not make much sense to
keep the first part - why would we introduce a custom fadvise-based
prefetch, only to immediately rip it out and replace it with with
read_stream code with a comparable amount of complexity, right?

So I squashed these two parts, and the patch now does read_stream (for
the table reads) from the beginning.

2) two new index AM callbacks - amgetbatch + amfreebatch

The [1] patch introduced a new callback for reading a "batch"
(essentially a leaf page) from the index. But there was a limitation of
only allowing a single batch at a time, which was causing trouble with
prefetch distance and read_stream stalls at the end of the batch, etc.

Based on the discussions with Peter I decided to make this a bit more
ambitious, moving the whole batch management from the index AM to the
indexam.c level. So now there are two callbacks - amgetbatch and
amfreebatch, and it's up to indexam.c to manage the batches - decide how
many batches to allow, etc. The index AM is responsible merely for
loading the next batch, but does not decide when to load or free a
batch, how many to keep in memory, etc.

There's a section in indexam.c with a more detailed description of the
design, I'm not going to explain all the design details here.

In a way, this design is a compromise between the initial AM-level
approach I presented as a PoC at pgconf.dev 2023, and the executor level
approach I shared a couple months back. Each of those "extreme" cases
had it's issues with either happening "too deep" or "too high" - being
too integrated in the AM, or not having enough info about the AM.

I think the indexam.c is a sensible layer for this. I was hoping doing
this at the "executor level" would mean no need for AM code changes, but
that turned out not possible - the AM clearly needs to know about the
batch boundaries, so that it can e.g. do killtuples, etc. That's why we
need the two callbacks (not just the "amgetbatch" one). At least this
way it's "hidden" by the indexam.c API, like index_getnext_slot().

(You could argue indexam.c is "executor" and maybe it is - I don't know
where exactly to draw the line. I don't think it matters, really. The
"hidden in indexam API" is the important bit.)

3) btree prefetch

The patch implements the new callbacks only for btree indexes, and it's
not very pretty / clean - it's mostly a massaged version of the old code
backing amgettuple(). This needs cleanup/improvements, and maybe
refactoring to allow reusing more of the code, etc.. Or maybe we should
even rip out the amgettuple() entirely, and only support one of those
for each AM? That's what Peter suggested, but I'm not convinced we
should do that.

For now it was very useful to be able to flip between the APIs by
setting a GUC, and I left prefetching disabled in some places (e.g. when
accessing catalogs, ...) that are unlikely to benefit. But more
importantly, I'm not 100% we want to require the index AMs to support
prefetching for all cases - if we do, a single "can't prefetch" case
would mean we can't prefetch anything for that AM.

In particular, I'm thinking about GiST / SP-GiST and indexes ordered by
distance, which don't return items in leaf pages but sort them through a
binary heap. Maybe we can do prefetch for that, but if we can't it would
be silly if it meant we can't do prefetch for any other SP-GiST queries.

Anyway, the current patch only implements prefetch for btree. I expect
it won't be difficult to do this for other index AMs, considering how
similar the design usually is to btree.

This is one of the next things on my TODO. I want to be able to validate
the design works for multiple AMs, not just btree.

4) duplicate blocks

While working on the patch, I realized the old index_fetch_heap code
skips reads for duplicate blocks - index the TID matches the immediately
preceding block, ReleaseAndReadBuffer() skips most of the work. But
read_stream() doesn't do that - if the callback returns the same block,
it starts a new read for it, pins it, etc. That can be quite expensive,
and I've seen a couple cases where the impact was not negligible
(correlated index, fits in memory, ...).

I've speculated that maybe read_stream_next_buffer() should detect and
handle these cases better - not unlike it detects sequential reads. It
might even keep a small cache of already requested reads, etc. so that
it can handle a wider range of workloads, not just perfect duplicates.

But it does not do that, and I'm not sure if/when that will happen. So
for now I simply reproduced the "skip duplicate blocks" behavior. It's
not as simple with read_stream, because this logic needs to happen in
two places - in the callback (when generating reads), and then also when
reading the blocks from the stream - if these places get "out of sync"
the stream won't return the blocks expected by the reader.

But it does work, and it's not that complex. But there's an issue with
prefetch distance ...

5) prefetch distance

Traditionally, we measure distance in "tuples" - e.g. in bitmap heap
scan, we make sure we prefetched pages for X tuples ahead. But that's
not what read_stream does for prefetching - it works with pages. That
can cause various issues.

Consider for example the "skip duplicate blocks" optimization described
in (4). And imagine a perfectly correlated index, with ~200 items per
leaf page. The heap tuples are likely wider, let's say we have 50 of
them per page. That means that for each leaf page, we have only ~4
blocks per leaf page. With effective_io_concurrency=16 the read_stream
will try to prefetch 16 heap pages, that's 3200 index entries.

Is that what we want? I'm not quite sure, maybe it's OK? It sure is not
quite what I expected.

But now imagine an index-only scan on nearly all-visible table. If the
fraction of index entries that don't pass the visibility check is very
low, we can quickly get into a situation when the read_stream has to
read a lot of leaf pages to get the next block number.

Sure, we'd need to read that block number eventually, but doing it this
early means we may need to keep the batch (leaf page) - a lot of them,
actually. Essentially, pick a number and I can construct an IOS that
needs to keep more batches.

I think this is a consequence of read_stream having an internal idea how
far ahead to prefetch, based on the number of requests it got so far,
measured in heap blocks. It has not idea about the context (how that
maps to index entries, batches we need to keep in memory, ...).

Ideally, we'd be able to give this feedback to read_stream in some way,
say by "pausing" it when we get too far ahead in the index. But we don't
have that - the only thing we can do is to return IndalidBlockNumber to
the stream, so that it stops. And then we need to "reset" the stream,
and let it continue - but only after we consumed all scheduled reads.

In principle it's very similar to the "pause/resume" I mentioned, except
that it requires completely draining the queue - a pipeline stall.
That's not great, but hopefully it's not very common, and more
importantly - it only happens when only a tiny fraction of the index
items requires a heap block.

So that's what the patch does. I think it's acceptable, but some
optimizations may be necessary (see next section).

6) performance and optimization

It's not difficult to construct cases where the prefetching is a huge
improvement - 5-10x speedup for a query is common, depending on the
hardware, dataset, etc.

But there are also cases where it doesn't (and can't) help very much.
For example fully-cached data, or index-only scans of all-visible
tables. I've done basic benchmarking based on that (I'll share some
results in the coming days), and in various cases I see a consistent
regression in the 10-20% range. The queries are very short (~1ms) and
there's a fair amount of noise, but it seems fairly consistent.

I haven't figured out the root cause(s) yet, but I believe there's a
couple contributing factors:

(a) read_stream adds a bit of complexity/overhead, but these cases
worked great with just the sync API, and can't benefit from that.

(b) There's inefficiencies in how I integrated read_stream into the
btree AM. For example every batch allocates the same buffer btbeginscan,
which turned out to be an issue before [2] - and now we do that for
every batch, not just once per scan - that's not great.

regards

From

Tomas Vondra

Date:

02 May, 02:02:06

Hi,

Here's a rebased version of the patch, addressing a couple bugs with
scrollable cursors that Peter reported to me off-list. The patch did not
handle that quite right, resulting either in incorrect results (when the
position happened to be off by one), or crashes (when it got out of sync
with the read stream).

But then there are some issues with array keys and mark/restore,
triggered by Peter's "dynamic SAOP advancement" tests in extra tests
(some of the tests use data files too large to post on hackers, it's
available in the github branch). The patch used to handle mark/restore
entirely in indexam.c, and for simple scans that works. But with array
keys the btree code needs to update the moreLeft/moreRight/needPrimScan
flags, so that after restoring it knows where to continue.

There's two "fix" patches trying to make this work - it does not crash,
and almost all the "incorrect" query results are actually stats about
buffer hits etc. And that is expected to change with prefetching, not a
bug. But then there are a bunch of explains where the number of index
scans changed, e.g. like

-         Index Searches: 5
+         Index Searches: 4

And that is almost certainly a bug.

I haven't figured this out yet, and I feel a bit lost again :-(

It made me think again whether it makes sense to make this fundamental
redesign of the index AM interface a prerequisite for prefetching. I
don't dispute the advantages of this new design, with indexam.c
responsible for more stuff (e.g. when a batch gets freed). It seems more
flexible and might make some stuff easier, and if we were designing it
now, we'd do it that way ...

Even if I eventually to fix this issue, will I ever be sufficiently
confident about correctness of the new code, enough to commit that?
Perhaps I'm too skeptical, but I'm not really sure about that anymore.

After thinking about this for a while, I decided to revisit the approach
used in the experimental patch I spoke about at pgconf.dev unconference
in 2023, and see if maybe it could be made to work.

That patch was pretty dumb - it simply initiated prefetches from the AM,
by calling PrefetchBuffer(). And the arguments against that doing this
from the AM seems like a layering violation, that every AM would need to
do a copy of this, because each AM has a different representation of the
internal scan state.

But after looking at it with fresh eyes, this seems fixable. It might
have been "more true" with the fadvise-based prefetching, but with the
ReadStream the amount of new AM code is *much* smaller. It doesn't need
to track the distance, or anything like that - that's handled by the
ReadStream. It just needs to respond to read_next callback. It also
doesn't feel like a layering violation, for the same reason.

I gave this a try last week, and I was surprised how easy it was to make
this work, and how small and simple the patches are - see the attached
simple-prefetch.tgz archive:

  infrastructure - 22kB
  btree          - 10kB
  hash           - 7kB
  gist           - 10kB
  spgist         - 16kB

That's a grand total of ~64kB (there might be some more improvements
necessary, esp. in the gist/spgist part).

Now compare that with the more complex patch, where we have

  infrastructure - 100kB
  nbtree         - 100kB

And that's just one index type. The other index types would probably
need a comparable amount of new code eventually ...

Sure, it can probably be made somewhat smaller (e.g. the nbtree code
copies a lot of stuff to support both the old and new approach, and that
might be reduced if we ditch the old one), and some of the diff are
comments. But even considering all that the size/complexity difference
will remain significant.

The one real limitation of the simpler approach is that prefetching is
limited to a single leaf page - we can't prefetch from the next one,
until the scan advances to it. But based on experiments comparing this
simpler and the "complex" approach, I don't think that really matters
that much. I haven't seen any difference for regular queries.

The one case where I think it might matter is queries with array keys,
where each array key matches a single tuple on a different leaf page.
The complex patch might prefetch tuples for later array values, while
the simpler patch won't be able to do that. If an array key matches
multiple tuples, the simple patch can prefetch those just fine, of
course. I don't know which case is more likely.


One argument for moving more stuff (including prefetching) to indexam.c
was it seems desirable to have one "component" aware of all the relevant
information, so that it can adjust prefetching in some way. I believe
that's still possible even with the simpler patch - nothing prevents
adding a "struct" to the scan descriptor, and using it from the
read_next callback or something like that.


regards


[1] https://github.com/tvondra/postgres/tree/index-prefetch-2025

-- 
Tomas Vondra