Re: index prefetching - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: index prefetching
Date
Msg-id CAKZiRmz_gzOJ+wUdKXSBrxH8UxQ+WHV+38JHz07WtsjBxPo97Q@mail.gmail.com
Whole thread Raw
In response to Re: index prefetching  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: index prefetching
List pgsql-hackers
On Wed, Jan 24, 2024 at 7:13 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
[
>
> (1) Melanie actually presented a very different way to implement this,
> relying on the StreamingRead API. So chances are this struct won't
> actually be used.

Given lots of effort already spent on this and the fact that is thread
is actually two:

a. index/table prefetching since Jun 2023 till ~Jan 2024
b. afterwards index/table prefetching with Streaming API, but there
are some doubts of whether it could happen for v17 [1]

... it would be pitty to not take benefits of such work (even if
Streaming API wouldn't be ready for this; although there's lots of
movement in the area), so I've played a little with with the earlier
implementation from [2] without streaming API as it already received
feedback, it demonstrated big benefits, and earlier it got attention
on pgcon unconference. Perhaps, some of those comment might be passed
later to the "b"-patch (once that's feasible):

1. v20240124-0001-Prefetch-heap-pages-during-index-scans.patch does
not apply cleanly anymore, due show_buffer_usage() being quite
recently refactored in 5de890e3610d5a12cdaea36413d967cf5c544e20 :

patching file src/backend/commands/explain.c
Hunk #1 FAILED at 3568.
Hunk #2 FAILED at 3679.
2 out of 2 hunks FAILED -- saving rejects to file
src/backend/commands/explain.c.rej

2. v2 applies (fixup), but it would nice to see that integrated into
main patch (it adds IndexOnlyPrefetchInfo) into one patch

3. execMain.c :

    +     * XXX It might be possible to improve the prefetching code
to handle this
    +     * by "walking back" the TID queue, but it's not clear if
it's worth it.

Shouldn't we just remove the XXX? The walking-back seems to be niche
so are fetches using cursors when looking at real world users queries
? (support cases bias here when looking at peopel's pg_stat_activity)

4. Wouldn't it be better to leave PREFETCH_LRU_SIZE at static of 8,
but base PREFETCH_LRU_COUNT on effective_io_concurrency instead?
(allowing it to follow dynamically; the more prefetches the user wants
to perform, the more you spread them across shared LRUs and the more
memory for history is required?)

    + * XXX Maybe we could consider effective_cache_size when sizing the cache?
    + * Not to size the cache for that, ofc, but maybe as a guidance of how many
    + * heap pages it might keep. Maybe just a fraction fraction of the value,
    + * say Max(8MB, effective_cache_size / max_connections) or something.
    + */
    +#define        PREFETCH_LRU_SIZE        8    /* slots in one LRU */
    +#define        PREFETCH_LRU_COUNT        128 /* number of LRUs */
    +#define        PREFETCH_CACHE_SIZE        (PREFETCH_LRU_SIZE *
PREFETCH_LRU_COUNT)

BTW:
    + * heap pages it might keep. Maybe just a fraction fraction of the value,
that's a duplicated "fraction" word over there.

5.
    +     * XXX Could it be harmful that we read the queue backwards?
Maybe memory
    +     * prefetching works better for the forward direction?

I wouldn't care, we are optimizing I/O (and context-switching) which
weighs much more than memory access direction impact and Dilipi
earlier also expressed no concern, so maybe it could be also removed
(one less "XXX" to care about)

6. in IndexPrefetchFillQueue()

    +    while (!PREFETCH_QUEUE_FULL(prefetch))
    +    {
    +        IndexPrefetchEntry *entry
    +        = prefetch->next_cb(scan, direction, prefetch->data);

If we are at it... that's a strange split and assignment not indented :^)

7. in IndexPrefetchComputeTarget()

    + * XXX We cap the target to plan_rows, becausse it's pointless to prefetch
    + * more than we expect to use.

That's a nice fact that's already in patch, so XXX isn't needed?

8.
    + * XXX Maybe we should reduce the value with parallel workers?

I was assuming it could be a good idea, but the same doesn't seem
(eic/actual_parallel_works_per_gather) to be performed for bitmap heap
scan prefetches, so no?

9.
    +    /*
    +     * No prefetching for direct I/O.
    +     *
    +     * XXX Shouldn't we do prefetching even for direct I/O? We would only
    +     * pretend doing it now, ofc, because we'd not do posix_fadvise(), but
    +     * once the code starts loading into shared buffers, that'd work.
    +     */
    +    if ((io_direct_flags & IO_DIRECT_DATA) != 0)
    +        return 0;

It's redundant (?) and could be removed as
PrefetchBuffer()->PrefetchSharedBuffer() already has this at line 571:

         5   #ifdef USE_PREFETCH
         4   │   │   /*
         3   │   │   │* Try to initiate an asynchronous read.  This
returns false in
         2   │   │   │* recovery if the relation file doesn't exist.
         1   │   │   │*/
       571   │   │   if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
         1   │   │   │   smgrprefetch(smgr_reln, forkNum, blockNum, 1))
         2   │   │   {
         3   │   │   │   result.initiated_io = true;
         4   │   │   }
         5   #endif> >   >   >   >   >   >   /* USE_PREFETCH */

11. in IndexPrefetchStats() and ExecReScanIndexScan()

    + * FIXME Should be only in debug builds, or something like that.

    +    /* XXX Print some debug stats. Should be removed. */
    +    IndexPrefetchStats(indexScanDesc, node->iss_prefetch);

Hmm, but it could be useful in tuning the real world systems, no? E.g.
recovery prefetcher gives some info through pg_stat_recovery_prefetch
view, but e.g. bitmap heap scans do not provide us with anything at
all. I don't have a strong opinion. Exposing such stuff would take
away your main doubt (XXX) from execPrefetch.c
``auto-tuning/self-adjustment". And if we are at it, we could think in
far future about adding new session GUC track_cachestat or EXPLAIN
(cachestat/prefetch, analyze) (this new syscall for Linux >= 6.5)
where we could present both index stats (as what IndexPrefetchStats()
does) *and* cachestat() results there for interested users. Of course
it would have to be generic enough for the bitmap heap scan case too.
Such insight would also allow fine tuning eic, PREFETCH_LRU_COUNT,
PREFETCH_QUEUE_HISTORY. Just an idea.

12.

    +         * XXX Maybe we should reduce the target in case this is
a parallel index
    +         * scan. We don't want to issue a multiple of
effective_io_concurrency.

in IndexOnlyPrefetchCleanup() and IndexNext()

+ * XXX Maybe we should reduce the value with parallel workers?

It's redundant XXX-comment (there are two for the same), as you it was
already there just before IndexPrefetchComputeTarget()

13. The previous bitmap prefetch code uses #ifdef USE_PREFETCH, maybe
it would make some sense to follow the consistency pattern , to avoid
adding implementation on platforms without prefetching ?

14. The patch is missing documentation, so how about just this?

--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2527,7 +2527,8 @@ include_dir 'conf.d'
          operations that any individual
<productname>PostgreSQL</productname> session
          attempts to initiate in parallel.  The allowed range is 1 to 1000,
          or zero to disable issuance of asynchronous I/O requests. Currently,
-         this setting only affects bitmap heap scans.
+         this setting only enables prefetching for HEAP data blocks
when performing
+         bitmap heap scans and index (only) scans.
         </para>

Some further tests, given data:

CREATE TABLE test (id bigint, val bigint, str text);
ALTER TABLE test ALTER COLUMN str SET STORAGE EXTERNAL;
INSERT INTO test SELECT g, g, repeat(chr(65 + (10*random())::int),
3000) FROM generate_series(1, 10000) g;
-- or INSERT INTO test SELECT x.r, x.r, repeat(chr(65 +
(10*random())::int), 3000) from (select 10000 * random() as r from
generate_series(1, 10000)) x;
VACUUM ANALYZE test;
CREATE INDEX on test (id) ;

1. the patch correctly detects sequential access (e.g. we issue up to
6 fadvise() syscalls (8kB each) out and 17 preads() to heap fd for
query like `SELECT sum(val) FROM test WHERE id BETWEEN 10 AND 2000;`
-- offset of fadvise calls and pread match), so that's good.

2. Prefetching for TOASTed heap seems to be not implemented at all,
correct? (Is my assumption that we should go like this:
t_index->t->toast_idx->toast_heap)?, but I'm too newbie to actually
see the code path where it could be added - certainly it's not blocker
-- but maybe in commit message a list of improvements for future could
be listed?):

2024-02-29 11:45:14.259 CET [11098] LOG:  index prefetch stats:
requests 1990 prefetches 17 (0.854271) skip cached 0 sequential 1973
2024-02-29 11:45:14.259 CET [11098] STATEMENT:  SELECT
md5(string_agg(md5(str),',')) FROM test WHERE id BETWEEN 10 AND 2000;

fadvise64(37, 40960, 8192, POSIX_FADV_WILLNEED) = 0
pread64(50, "\0\0\0\0\350Jv\1\0\0\4\0(\0\0\10\0 \4
\0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2998272) = 8192
pread64(49, "\0\0\0\0@Hw\1\0\0\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237
\0\320\237 \0"..., 8192, 40960) = 8192
pread64(50, "\0\0\0\0\2200v\1\0\0\4\0(\0\0\10\0 \4
\0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2990080) = 8192
pread64(50, "\0\0\0\08\26v\1\0\0\4\0(\0\0\10\0 \4
\0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2981888) = 8192
pread64(50, "\0\0\0\0\340\373u\1\0\0\4\0(\0\0\10\0 \4
\0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2973696) = 8192
[..no fadvises for fd=50 which was pg_toast_rel..]

3. I'm not sure if I got good-enough results for DESCending index
`create  index on test (id DESC);`- with eic=16 it doesnt seem to be
be able prefetch 16 blocks in advance? (e.g. highlight offset 557056
below in some text editor and it's distance is far lower between that
fadvise<->pread):

pread64(45, "\0\0\0\0x\305b\3\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 0) = 8192
fadvise64(45, 417792, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, "\0\0\0\0\370\330\235\4\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 417792) = 8192
fadvise64(45, 671744, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 237568, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, "\0\0\0\08`]\5\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 671744) = 8192
fadvise64(45, 491520, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 360448, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, "\0\0\0\0\200\357\25\4\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 237568) = 8192
fadvise64(45, 557056, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 106496, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, "\0\0\0\0\240s\325\4\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 491520) = 8192
fadvise64(45, 401408, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 335872, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, "\0\0\0\0\250\233r\4\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 360448) = 8192
fadvise64(45, 524288, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 352256, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, "\0\0\0\0\240\342\6\5\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 557056) = 8192

-Jakub Wartak.

[1] - https://www.postgresql.org/message-id/20240215201337.7amzw3hpvng7wphb%40awork3.anarazel.de
[2] - https://www.postgresql.org/message-id/777e981c-bf0c-4eb9-a9e0-42d677e94327%40enterprisedb.com



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Next
From: Ajin Cherian
Date:
Subject: Re: Synchronizing slots from primary to standby