Thread: PoC: prefetching index leaf pages (for inserts)

PoC: prefetching index leaf pages (for inserts)

From

Tomas Vondra

Date:

22 October 2023, 14:46:51

Hi,

Some time ago I started a thread about prefetching heap pages during
index scans [1]. That however only helps when reading rows, not when
inserting them.

Imagine inserting a row into a table with many indexes - we insert the
row into the table, and then into the indexes one by one, synchronously.
We walk the index, determine the appropriate leaf page, read it from
disk into memory, insert the index tuple, and then do the same thing for
the next index.

If there are many indexes, and there's not much correlation with the
table, this may easily results in I/O happening synchronously with queue
depth 1. Hard to make it even slower ...

This can be a problem even with a modest number of indexes - imagine
bulk-loading data into a table using COPY (say, in 100-row batches).
Inserting the rows into heap happens in a bulk, but the indexes are
still modified in a loop, as if for single-row inserts. Not great.

The with multiple connections the concurrent I/O may be generated that
way, but for low-concurrency workloads (e.g. batch jobs) that may not
really work.

I had an idea what we might do about this - we can walk the index,
almost as if we're inserting the index tuple, but only the "inner"
non-leaf pages. And instead of descending to the leaf page, we just
prefetch it. The non-leaf pages are typically <1% of the index, and hot,
so likely already cached, so not worth prefetching those.

The attached patch does a PoC of this. It adds a new AM function
"amprefetch", with an implementation for btree indexes, mimicking the
index lookup, except that it only prefetches the leaf page as explained
a bit earlier.

In the executor, this is wrapped in ExecInsertPrefetchIndexes() which
gets called in various places right before ExecInsertPrefetchIndexes().
I thought about doing that in ExecInsertPrefetchIndexes() directly, but
that would not work for COPY, where we want to issue the prefetches for
the whole batch, not for individual tuples.

This may need various improvements - the prefetch duplicates a couple
steps that could be expensive (e.g. evaluation of index predicates,
forming index tuples, and so on). Would be nice to improve this, but
good enough for PoC I think.

Another gap is lack of incremental prefetch (ramp-up). We just prefetch
all the indexes, for all tuples. But I think that's OK. We know we'll
need those pages, and the number is fairly limited.

There's a GUC enable_insert_prefetch, that can be used to enable this
insert prefetching.

I did a simple test on two machines - one with SATA SSD RAID, one with
NVMe SSD. In both cases the data (table+indexes) are an order of
magnitude larger than RAM. The indexes are on UUID, so pretty random and
there's no correlation. Then batches of 100, 1000 and 10000 rows are
inserted, with/without the prefetching.

With 5 indexes, the results look like this:

SATA SSD RAID
-------------

 rows     no prefetch        prefetch
  100      176.872 ms       70.910 ms
 1000     1035.056 ms      590.495 ms
10000     8494.836 ms     3216.206 ms


NVMe
----

 rows     no prefetch        prefetch
  100      133.365 ms       72.899 ms
 1000     1572.379 ms      829.298 ms
10000    11889.143 ms     3621.981 ms


Not bad, I guess. Cutting the time to ~30% is nice.

The fewer the indexes, the smaller the difference (with 1 index there is
almost no difference), of course.


regards


[1] https://commitfest.postgresql.org/45/4351/

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

index-insert-prefetch-20231022.patch

Re: PoC: prefetching index leaf pages (for inserts)

From

Tomas Vondra

Date:

01 November 2023, 14:10:03

Hi,

I had a chance to discuss this patch with Andres off-list a couple days
ago, and he suggested he tried sorting the (index) tuples before
inserting them, and that yielded significant benefits, possibly
comparable to the prefetching.

I've been somewhat skeptical the sorting might be very beneficial, but I
decided to do some more tests comparing the benefits.

The sorting only really works for bulk inserts (e.g. from COPY), when we
have multiple index tuples for each index. But I didn't have time to
rework the code like that, so I took a simpler approach - do the sort in
the INSERT, so that the insert tuples are sorted too. So, something like

  WITH data AS (SELECT md5(random()::text)::uuid
                  FROM generate_series(1,100) ORDER BY 1)
  INSERT INTO t SELECT * FROM data;

Obviously, this can only sort the rows in one way - if there are
multiple indexes, then only one of them will be sorted, limiting the
possible benefit of the optimization. In the COPY code we could do a
separate sort for each index, so the tuples would be sorted for all
indexes (which also has a cost, of course).

But as I said, I decided to do the simple SQL-level sort. There are
multiple indexes on the same column, so it's a bit as if we sorted the
tuples for each index independently.

Anyway, the test inserts batches of 100, ..., 100k rows into tables of
different sizes (10M - 1B rows), with/without prefetching, and possibly
sorts the batches before the insert. See the attached index.sql script.

The PDF shows the results, and also compares the different combinations.
For example the first 5 lines are results without and with prefetching,
followed by speedup, where green=faster and red=slower. For example 50%
means prefetching makes it 2x as fast.

Similarly, first column has timings without / with sorting, with speedup
right under it. Diagonally, we have speedup for enabling both sorting
and prefetch.

I did that with different table sizes, where 10M easily fits into RAM
while 1B certainly exceeds it. And I did that on the usual two machines
with different amounts of RAM and storage (SATA SSD vs. NVME).

The results are mostly similar on both, I think:

* On 10M tables (fits into RAM including indexes), prefetching doesn't
really help (assuming the data is not evicted from memory for other
reasons), and actually hurts a bit (~20%). Sorting does help, depending
on the number of indexes - can be 10-40% faster.

* On 1B tables (exceeds RAM), prefetching is a clear winner. Sorting
does not make any difference except for a couple rare "blips".

* On 100M tables it's a mix/transition of those two cases.


So maybe we should try doing both, perhaps with some heuristics to only
do the prefetching on sufficiently large/random indexes, and sorting
only on smaller ones?

Another option would be to make the prefetching smarter so that we don't
prefetch data that is already in memory (either in shared buffers or in
page cache). That would reduce the overhead/slowdown visible in results
on the 10M table.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC: prefetching index leaf pages (for inserts)

From

Tomas Vondra

Date:

06 November 2023, 17:05:56

Seems cfbot was not entirely happy about the patch, for two reasons:

1) enable_insert_prefetching definition was inconsistent (different
boot/default values, missing in .conf and so on)

2) stupid bug in execReplication, inserting index entries twice

The attached v3 should fix all of that, I believe.


As for the path forward, I think the prefetching is demonstrably
beneficial. There are cases where it can't help or even harms
performance. I think the success depends on three areas:

(a) reducing the costs of the prefetching - For example right now we
build the index tuples twice (once for prefetch, once for the insert),
but maybe there's a way to do that only once? There are also predicate
indexes, and so on.

(b) being smarter about when to prefetch - For example if we only have
one "prefetchable" index, it's somewhat pointless to prefetch (for
single-row cases). And so on.

(c) not prefetching when already cached - This is somewhat related to
the previous case, but perhaps it'd be cheaper to first check if the
data is already cached. For shared buffers it should not be difficult,
for page cache we could use preadv2 with RWF_NOWAIT flag. The question
is if this is cheap enough to be cheaper than just doing posix_fadvise
(which however only deals with shared buffers).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

0001-insert-prefetch-v3.patch

Re: PoC: prefetching index leaf pages (for inserts)

From

Heikki Linnakangas

Date:

23 November 2023, 13:26:23

On 06/11/2023 19:05, Tomas Vondra wrote:
> As for the path forward, I think the prefetching is demonstrably
> beneficial. There are cases where it can't help or even harms
> performance. I think the success depends on three areas:
> 
> (a) reducing the costs of the prefetching - For example right now we
> build the index tuples twice (once for prefetch, once for the insert),
> but maybe there's a way to do that only once? There are also predicate
> indexes, and so on.
> 
> (b) being smarter about when to prefetch - For example if we only have
> one "prefetchable" index, it's somewhat pointless to prefetch (for
> single-row cases). And so on.
> 
> (c) not prefetching when already cached - This is somewhat related to
> the previous case, but perhaps it'd be cheaper to first check if the
> data is already cached. For shared buffers it should not be difficult,
> for page cache we could use preadv2 with RWF_NOWAIT flag. The question
> is if this is cheap enough to be cheaper than just doing posix_fadvise
> (which however only deals with shared buffers).

I don't like this approach. It duplicates the tree-descend code, and it 
also duplicates the work of descending the tree at runtime. And it only 
addresses index insertion; there are a lot of places that could benefit 
from prefetching or async execution like this.

I think we should think of this as async execution rather than 
prefetching. We don't have the general infrastructure for writing async 
code, but if we did, this would be much simpler. In an async programming 
model, like you have in many other languages like Rust, python or 
javascript, there would be no separate prefetching function. Instead, 
aminsert() would return a future that can pause execution if it needs to 
do I/O. Something like this:

aminsert_futures = NIL;
/* create a future for each index insert */
for (<all indexes>)
{
     aminsert_futures = lappend(aminsert_futures, aminsert(...));
}
/* wait for all the futures to finish */
await aminsert_futures;

The async-aware aminsert function would run to completion quickly if all 
the pages are already in cache. If you get a cache miss, it would start 
an async I/O read for the page, and yield to the other insertions until 
the I/O completes.

We already support async execution of FDWs now, with the 
ForeignAsyncRequest() and ForeignAsyncConfigureWait() callbacks. Can we 
generalize that?

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: PoC: prefetching index leaf pages (for inserts)

From

Tomas Vondra

Date:

23 November 2023, 14:46:31

On 11/23/23 14:26, Heikki Linnakangas wrote:
> On 06/11/2023 19:05, Tomas Vondra wrote:
>> As for the path forward, I think the prefetching is demonstrably
>> beneficial. There are cases where it can't help or even harms
>> performance. I think the success depends on three areas:
>>
>> (a) reducing the costs of the prefetching - For example right now we
>> build the index tuples twice (once for prefetch, once for the insert),
>> but maybe there's a way to do that only once? There are also predicate
>> indexes, and so on.
>>
>> (b) being smarter about when to prefetch - For example if we only have
>> one "prefetchable" index, it's somewhat pointless to prefetch (for
>> single-row cases). And so on.
>>
>> (c) not prefetching when already cached - This is somewhat related to
>> the previous case, but perhaps it'd be cheaper to first check if the
>> data is already cached. For shared buffers it should not be difficult,
>> for page cache we could use preadv2 with RWF_NOWAIT flag. The question
>> is if this is cheap enough to be cheaper than just doing posix_fadvise
>> (which however only deals with shared buffers).
> 
> I don't like this approach. It duplicates the tree-descend code, and it
> also duplicates the work of descending the tree at runtime. And it only
> addresses index insertion; there are a lot of places that could benefit
> from prefetching or async execution like this.
> 

Yeah, I think that's a fair assessment, although I think the amount of
duplicate code is pretty small (and perhaps it could be refactored to a
common function, which I chose not to do in the PoC patch).

> I think we should think of this as async execution rather than
> prefetching. We don't have the general infrastructure for writing async
> code, but if we did, this would be much simpler. In an async programming
> model, like you have in many other languages like Rust, python or
> javascript, there would be no separate prefetching function. Instead,
> aminsert() would return a future that can pause execution if it needs to
> do I/O. Something like this:
> 
> aminsert_futures = NIL;
> /* create a future for each index insert */
> for (<all indexes>)
> {
>     aminsert_futures = lappend(aminsert_futures, aminsert(...));
> }
> /* wait for all the futures to finish */
> await aminsert_futures;
> 
> The async-aware aminsert function would run to completion quickly if all
> the pages are already in cache. If you get a cache miss, it would start
> an async I/O read for the page, and yield to the other insertions until
> the I/O completes.
> 
> We already support async execution of FDWs now, with the
> ForeignAsyncRequest() and ForeignAsyncConfigureWait() callbacks. Can we
> generalize that?
> 

Interesting idea. I kinda like it in principle, however I'm not very
familiar with how our async execution works (and perhaps even with async
implementations in general), so I can't quite say how difficult would it
be to do something like that in an AM (instead of an executor).

Where exactly would be the boundary between who "creates" and "executes"
the requests, what would be the flow of execution? For the FDW it seems
fairly straightforward, because the boundary is local/remote, and the
async request is executed in a separate process.

But here everything would happen locally, so how would that work?

Imagine we're inserting a tuple into two indexes. There's a bunch of
index pages we may need to read for each index - first some internal
pages, then some leafs. Presumably we'd want to do each page read as
asynchronous, and allow transfer of control to the other index.

IIUC the async-aware aminsert() would execute an async request for the
first page it needs to read, with a callback that essentially does the
next step of index descent - reads the page, determines the next page to
read, and then do another async request. Then it'd sleep, which would
allow transfer of control to the aminsert() on the other index. And then
we'd do a similar thing for the leaf pages.

Or do I imagine things wrong?

The thing I like about this async approach is that it would allow
prefetching all index pages, while my PoC patch simply assumes all
internal pages are in cache and prefetches only the first leaf page.
That's much simpler in terms of control flow, but has clear limits.

I however wonder if there are concurrency issues. Imagine there's a
COPY, i.e. we're inserting a batch of tuples. Can you run the aminsert()
for all the tuples concurrently? Won't that have issues with the
different "async threads" modifying the index for the other threads?

If those concurrent "insert threads" would be an issue, maybe we could
make amprefetch() async-aware ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PoC: prefetching index leaf pages (for inserts)

From

vignesh C

Date:

26 January 2024, 14:45:52

On Mon, 6 Nov 2023 at 22:36, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> Seems cfbot was not entirely happy about the patch, for two reasons:
>
> 1) enable_insert_prefetching definition was inconsistent (different
> boot/default values, missing in .conf and so on)
>
> 2) stupid bug in execReplication, inserting index entries twice
>
> The attached v3 should fix all of that, I believe.
>
>
> As for the path forward, I think the prefetching is demonstrably
> beneficial. There are cases where it can't help or even harms
> performance. I think the success depends on three areas:
>
> (a) reducing the costs of the prefetching - For example right now we
> build the index tuples twice (once for prefetch, once for the insert),
> but maybe there's a way to do that only once? There are also predicate
> indexes, and so on.
>
> (b) being smarter about when to prefetch - For example if we only have
> one "prefetchable" index, it's somewhat pointless to prefetch (for
> single-row cases). And so on.
>
> (c) not prefetching when already cached - This is somewhat related to
> the previous case, but perhaps it'd be cheaper to first check if the
> data is already cached. For shared buffers it should not be difficult,
> for page cache we could use preadv2 with RWF_NOWAIT flag. The question
> is if this is cheap enough to be cheaper than just doing posix_fadvise
> (which however only deals with shared buffers).

CFBot shows that the patch does not apply anymore as in [1]:
=== Applying patches on top of PostgreSQL commit ID
7014c9a4bba2d1b67d60687afb5b2091c1d07f73 ===
=== applying patch ./0001-insert-prefetch-v3.patch
patching file src/backend/access/brin/brin.c
Hunk #1 FAILED at 117.
1 out of 1 hunk FAILED -- saving rejects to file
src/backend/access/brin/brin.c.rej
patching file src/backend/access/gin/ginutil.c
Hunk #1 FAILED at 64.
1 out of 1 hunk FAILED -- saving rejects to file
src/backend/access/gin/ginutil.c.rej
patching file src/backend/access/gist/gist.c
Hunk #1 FAILED at 86.
1 out of 1 hunk FAILED -- saving rejects to file
src/backend/access/gist/gist.c.rej
patching file src/backend/access/hash/hash.c
Hunk #1 FAILED at 83.
1 out of 1 hunk FAILED -- saving rejects to file
src/backend/access/hash/hash.c.rej

Please post an updated version for the same.

[1] - http://cfbot.cputube.org/patch_46_4622.log

Regards,
Vignesh