Re: index prefetching - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: index prefetching
Date
Msg-id CAKZiRmzao9mQC3aX1nXFOCybwL+EX9YWmy+GLkQ2u56SdAVBWw@mail.gmail.com
Whole thread Raw
In response to Re: index prefetching  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
On Fri, Mar 1, 2024 at 3:58 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
[..]
> TBH I don't have a clear idea what to do. It'd be cool to have at least
> some benefits in v17, but I don't know how to do that in a way that
> would be useful in the future.
>
> For example, the v20240124 patch implements this in the executor, but
> based on the recent discussions it seems that's not the right layer -
> the index AM needs to have some control, and I'm not convinced it's
> possible to improve it in that direction (even ignoring the various
> issues we identified in the executor-based approach).
>
> I think it might be more practical to do this from the index AM, even if
> it has various limitations. Ironically, that's what I proposed at pgcon,
> but mostly because it was the quick&dirty way to do this.

... that's a pity! :( Well, then let's just finish that subthread, I
gave some explanations, but I'll try to take a look in future
revisions.

> > 4. Wouldn't it be better to leave PREFETCH_LRU_SIZE at static of 8,
> > but base PREFETCH_LRU_COUNT on effective_io_concurrency instead?
> > (allowing it to follow dynamically; the more prefetches the user wants
> > to perform, the more you spread them across shared LRUs and the more
> > memory for history is required?)
> >
> >     + * XXX Maybe we could consider effective_cache_size when sizing the cache?
> >     + * Not to size the cache for that, ofc, but maybe as a guidance of how many
> >     + * heap pages it might keep. Maybe just a fraction fraction of the value,
> >     + * say Max(8MB, effective_cache_size / max_connections) or something.
> >     + */
> >     +#define        PREFETCH_LRU_SIZE        8    /* slots in one LRU */
> >     +#define        PREFETCH_LRU_COUNT        128 /* number of LRUs */
> >     +#define        PREFETCH_CACHE_SIZE        (PREFETCH_LRU_SIZE *
> > PREFETCH_LRU_COUNT)
> >
>
> I don't see why would this be related to effective_io_concurrency? It's
> merely about how many recently accessed pages we expect to find in the
> page cache. It's entirely separate from the prefetch distance.

Well, my thought was the higher eic is - the more I/O parallelism we
are introducing - in such a case, the more requests we need to
remember from the past to avoid prefetching the same (N * eic, where N
would be some multiplier)

> > 7. in IndexPrefetchComputeTarget()
> >
> >     + * XXX We cap the target to plan_rows, becausse it's pointless to prefetch
> >     + * more than we expect to use.
> >
> > That's a nice fact that's already in patch, so XXX isn't needed?
> >
>
> Right, which is why it's not a TODO/FIXME.

OH! That explains it to me. I've taken all of the XXXs as literally
FIXME that you wanted to go away (things to be removed before the
patch is considered mature).

> But I think it's good to
> point this out - I'm not 100% convinced we should be using plan_rows
> like this (because what happens if the estimate happens to be wrong?).

Well, somewhat similiar problematic pattern was present in different
codepath - get_actual_variable_endpoint() - see [1], 9c6ad5eaa95.  So
the final fix was to get away without adding new GUC (which always an
option...), but just introduce a sensible hard-limit (fence) and stick
to the 100 heap visited pages limit. Here we could have similiar
heuristics same from start: if (plan_rows <
we_have_already_visited_pages * avgRowsPerBlock) --> ignore plan_rows
and rampup prefetches back to the full eic value.

> > Some further tests, given data:
> >
> > CREATE TABLE test (id bigint, val bigint, str text);
> > ALTER TABLE test ALTER COLUMN str SET STORAGE EXTERNAL;
> > INSERT INTO test SELECT g, g, repeat(chr(65 + (10*random())::int),
> > 3000) FROM generate_series(1, 10000) g;
> > -- or INSERT INTO test SELECT x.r, x.r, repeat(chr(65 +
> > (10*random())::int), 3000) from (select 10000 * random() as r from
> > generate_series(1, 10000)) x;
> > VACUUM ANALYZE test;
> > CREATE INDEX on test (id) ;
> >
>
> It's not clear to me what's the purpose of this test? Can you explain?

It's just schema&data preparation for the tests below:

> >
> > 2. Prefetching for TOASTed heap seems to be not implemented at all,
> > correct? (Is my assumption that we should go like this:
> > t_index->t->toast_idx->toast_heap)?, but I'm too newbie to actually
> > see the code path where it could be added - certainly it's not blocker
> > -- but maybe in commit message a list of improvements for future could
> > be listed?):
> >
>
> Yes, that's true. I haven't thought about TOAST very much, but with
> prefetching happening in executor, that does not work. There'd need to
> be some extra code for TOAST prefetching. I'm not sure how beneficial
> that would be, considering most TOAST values tend to be stored on
> consecutive heap pages.

Assuming that in the above I've generated data using cyclic / random
version and I run:

SELECT md5(string_agg(md5(str),',')) FROM test WHERE id BETWEEN 10 AND 2000;

(btw: I wanted to use octet_length() at first instead of string_agg()
but that's not enough)

where fd 45,54,55 correspond to :
    lrwx------ 1 postgres postgres 64 Mar  5 12:56 /proc/8221/fd/45 ->
/tmp/blah/base/5/16384 // "test"
    lrwx------ 1 postgres postgres 64 Mar  5 12:56 /proc/8221/fd/54 ->
/tmp/blah/base/5/16388 // "pg_toast_16384_index"
    lrwx------ 1 postgres postgres 64 Mar  5 12:56 /proc/8221/fd/55 ->
/tmp/blah/base/5/16387 // "pg_toast_16384"

I've got for the following data:
- 83 pread64 and 83x fadvise() for random offsets for fd=45 - the main
intent of this patch (main relation heap prefetching), works good
- 54 pread64 calls for fd=54 (no favdises())
- 1789 (!) calls to pread64 for fd=55 for RANDOM offsets (TOAST heap,
no prefetch)

so at least in theory it makes a lot of sense to prefetch TOAST too,
pattern looks like cyclic random:

// pread(fd, "", blocksz, offset)
fadvise64(45, 40960, 8192, POSIX_FADV_WILLNEED) = 0
pread64(55, ""..., 8192, 38002688)      = 8192
pread64(55, ""..., 8192, 12034048)      = 8192
pread64(55, ""..., 8192, 36560896)      = 8192
pread64(55, ""..., 8192, 8871936)       = 8192
pread64(55, ""..., 8192, 17965056)      = 8192
pread64(55, ""..., 8192, 18710528)      = 8192
pread64(55, ""..., 8192, 35635200)      = 8192
pread64(55, ""..., 8192, 23379968)      = 8192
pread64(55, ""..., 8192, 25141248)      = 8192
pread64(55, ""..., 8192, 3457024)       = 8192
pread64(55, ""..., 8192, 24633344)      = 8192
pread64(55, ""..., 8192, 36462592)      = 8192
pread64(55, ""..., 8192, 18120704)      = 8192
pread64(55, ""..., 8192, 27066368)      = 8192
pread64(45, ""..., 8192, 40960)         = 8192
pread64(55, ""..., 8192, 2768896)       = 8192
pread64(55, ""..., 8192, 10846208)      = 8192
pread64(55, ""..., 8192, 30179328)      = 8192
pread64(55, ""..., 8192, 7700480)       = 8192
pread64(55, ""..., 8192, 38846464)      = 8192
pread64(55, ""..., 8192, 1040384)       = 8192
pread64(55, ""..., 8192, 10985472)      = 8192

It's probably a separate feature (prefetching blocks from TOAST), but
it could be mentioned that this patch is not doing that (I was
assuming it could).

> > 3. I'm not sure if I got good-enough results for DESCending index
> > `create  index on test (id DESC);`- with eic=16 it doesnt seem to be
> > be able prefetch 16 blocks in advance? (e.g. highlight offset 557056
> > below in some text editor and it's distance is far lower between that
> > fadvise<->pread):
> >
[..]
> >
>
> I'm not sure I understand these strace snippets. Can you elaborate a
> bit, explain what the strace log says?

set enable_seqscan to off;
set enable_bitmapscan to off;
drop index test_id_idx;
create index on test (id DESC); -- DESC one
SELECT sum(val) FROM test WHERE id BETWEEN 10 AND 2000;

Ok, so cleaner output of strace -s 0 for PID doing that SELECT with
eic=16, annotated with [*]:

lseek(45, 0, SEEK_END)                  = 688128
lseek(47, 0, SEEK_END)                  = 212992
pread64(47, ""..., 8192, 172032)        = 8192
pread64(45, ""..., 8192, 90112)         = 8192
fadvise64(45, 172032, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, ""..., 8192, 172032)        = 8192
fadvise64(45, 319488, 8192, POSIX_FADV_WILLNEED) = 0 [*off 319488 start]
fadvise64(45, 335872, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, ""..., 8192, 319488)        = 8192       [*off 319488,
read, distance=1 fadvises]
fadvise64(45, 466944, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 393216, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, ""..., 8192, 335872)        = 8192
fadvise64(45, 540672, 8192, POSIX_FADV_WILLNEED) = 0 [*off 540672 start]
fadvise64(45, 262144, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, ""..., 8192, 466944)        = 8192
fadvise64(45, 491520, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, ""..., 8192, 393216)        = 8192
fadvise64(45, 163840, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 385024, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, ""..., 8192, 540672)        = 8192       [*off 540672,
read, distance=4 fadvises]
fadvise64(45, 417792, 8192, POSIX_FADV_WILLNEED) = 0
[..]
I was wondering why the distance never got >4 in such case for eic=16,
it should spawn more fadvises calls, shouldn't it? (it was happening
only for DESC, in normal ASC index the prefetching distance easily
achieves ~~ eic values) and I think today i've got the answer -- after
dropping/creating DESC index I did NOT execute ANALYZE so probably the
Min(..., plan_rows) was kicking in and preventing the full
prefetching.

Hitting above, makes me think that the XXX for plan_rows , should
really be real-FIXME.

-J.

[1] - https://www.postgresql.org/message-id/CAKZiRmznOwi0oaV%3D4PHOCM4ygcH4MgSvt8%3D5cu_vNCfc8FSUug%40mail.gmail.com



pgsql-hackers by date:

Previous
From: Andy Fan
Date:
Subject: Re: a wrong index choose when statistics is out of date
Next
From: Yugo NAGATA
Date:
Subject: Remove unnecessary code from psql's watch command