Re: index prefetching - Mailing list pgsql-hackers
From | Jakub Wartak |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | CAKZiRmz_gzOJ+wUdKXSBrxH8UxQ+WHV+38JHz07WtsjBxPo97Q@mail.gmail.com Whole thread Raw |
In response to | Re: index prefetching (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: index prefetching
|
List | pgsql-hackers |
On Wed, Jan 24, 2024 at 7:13 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: [ > > (1) Melanie actually presented a very different way to implement this, > relying on the StreamingRead API. So chances are this struct won't > actually be used. Given lots of effort already spent on this and the fact that is thread is actually two: a. index/table prefetching since Jun 2023 till ~Jan 2024 b. afterwards index/table prefetching with Streaming API, but there are some doubts of whether it could happen for v17 [1] ... it would be pitty to not take benefits of such work (even if Streaming API wouldn't be ready for this; although there's lots of movement in the area), so I've played a little with with the earlier implementation from [2] without streaming API as it already received feedback, it demonstrated big benefits, and earlier it got attention on pgcon unconference. Perhaps, some of those comment might be passed later to the "b"-patch (once that's feasible): 1. v20240124-0001-Prefetch-heap-pages-during-index-scans.patch does not apply cleanly anymore, due show_buffer_usage() being quite recently refactored in 5de890e3610d5a12cdaea36413d967cf5c544e20 : patching file src/backend/commands/explain.c Hunk #1 FAILED at 3568. Hunk #2 FAILED at 3679. 2 out of 2 hunks FAILED -- saving rejects to file src/backend/commands/explain.c.rej 2. v2 applies (fixup), but it would nice to see that integrated into main patch (it adds IndexOnlyPrefetchInfo) into one patch 3. execMain.c : + * XXX It might be possible to improve the prefetching code to handle this + * by "walking back" the TID queue, but it's not clear if it's worth it. Shouldn't we just remove the XXX? The walking-back seems to be niche so are fetches using cursors when looking at real world users queries ? (support cases bias here when looking at peopel's pg_stat_activity) 4. Wouldn't it be better to leave PREFETCH_LRU_SIZE at static of 8, but base PREFETCH_LRU_COUNT on effective_io_concurrency instead? (allowing it to follow dynamically; the more prefetches the user wants to perform, the more you spread them across shared LRUs and the more memory for history is required?) + * XXX Maybe we could consider effective_cache_size when sizing the cache? + * Not to size the cache for that, ofc, but maybe as a guidance of how many + * heap pages it might keep. Maybe just a fraction fraction of the value, + * say Max(8MB, effective_cache_size / max_connections) or something. + */ +#define PREFETCH_LRU_SIZE 8 /* slots in one LRU */ +#define PREFETCH_LRU_COUNT 128 /* number of LRUs */ +#define PREFETCH_CACHE_SIZE (PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT) BTW: + * heap pages it might keep. Maybe just a fraction fraction of the value, that's a duplicated "fraction" word over there. 5. + * XXX Could it be harmful that we read the queue backwards? Maybe memory + * prefetching works better for the forward direction? I wouldn't care, we are optimizing I/O (and context-switching) which weighs much more than memory access direction impact and Dilipi earlier also expressed no concern, so maybe it could be also removed (one less "XXX" to care about) 6. in IndexPrefetchFillQueue() + while (!PREFETCH_QUEUE_FULL(prefetch)) + { + IndexPrefetchEntry *entry + = prefetch->next_cb(scan, direction, prefetch->data); If we are at it... that's a strange split and assignment not indented :^) 7. in IndexPrefetchComputeTarget() + * XXX We cap the target to plan_rows, becausse it's pointless to prefetch + * more than we expect to use. That's a nice fact that's already in patch, so XXX isn't needed? 8. + * XXX Maybe we should reduce the value with parallel workers? I was assuming it could be a good idea, but the same doesn't seem (eic/actual_parallel_works_per_gather) to be performed for bitmap heap scan prefetches, so no? 9. + /* + * No prefetching for direct I/O. + * + * XXX Shouldn't we do prefetching even for direct I/O? We would only + * pretend doing it now, ofc, because we'd not do posix_fadvise(), but + * once the code starts loading into shared buffers, that'd work. + */ + if ((io_direct_flags & IO_DIRECT_DATA) != 0) + return 0; It's redundant (?) and could be removed as PrefetchBuffer()->PrefetchSharedBuffer() already has this at line 571: 5 #ifdef USE_PREFETCH 4 │ │ /* 3 │ │ │* Try to initiate an asynchronous read. This returns false in 2 │ │ │* recovery if the relation file doesn't exist. 1 │ │ │*/ 571 │ │ if ((io_direct_flags & IO_DIRECT_DATA) == 0 && 1 │ │ │ smgrprefetch(smgr_reln, forkNum, blockNum, 1)) 2 │ │ { 3 │ │ │ result.initiated_io = true; 4 │ │ } 5 #endif> > > > > > > /* USE_PREFETCH */ 11. in IndexPrefetchStats() and ExecReScanIndexScan() + * FIXME Should be only in debug builds, or something like that. + /* XXX Print some debug stats. Should be removed. */ + IndexPrefetchStats(indexScanDesc, node->iss_prefetch); Hmm, but it could be useful in tuning the real world systems, no? E.g. recovery prefetcher gives some info through pg_stat_recovery_prefetch view, but e.g. bitmap heap scans do not provide us with anything at all. I don't have a strong opinion. Exposing such stuff would take away your main doubt (XXX) from execPrefetch.c ``auto-tuning/self-adjustment". And if we are at it, we could think in far future about adding new session GUC track_cachestat or EXPLAIN (cachestat/prefetch, analyze) (this new syscall for Linux >= 6.5) where we could present both index stats (as what IndexPrefetchStats() does) *and* cachestat() results there for interested users. Of course it would have to be generic enough for the bitmap heap scan case too. Such insight would also allow fine tuning eic, PREFETCH_LRU_COUNT, PREFETCH_QUEUE_HISTORY. Just an idea. 12. + * XXX Maybe we should reduce the target in case this is a parallel index + * scan. We don't want to issue a multiple of effective_io_concurrency. in IndexOnlyPrefetchCleanup() and IndexNext() + * XXX Maybe we should reduce the value with parallel workers? It's redundant XXX-comment (there are two for the same), as you it was already there just before IndexPrefetchComputeTarget() 13. The previous bitmap prefetch code uses #ifdef USE_PREFETCH, maybe it would make some sense to follow the consistency pattern , to avoid adding implementation on platforms without prefetching ? 14. The patch is missing documentation, so how about just this? --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2527,7 +2527,8 @@ include_dir 'conf.d' operations that any individual <productname>PostgreSQL</productname> session attempts to initiate in parallel. The allowed range is 1 to 1000, or zero to disable issuance of asynchronous I/O requests. Currently, - this setting only affects bitmap heap scans. + this setting only enables prefetching for HEAP data blocks when performing + bitmap heap scans and index (only) scans. </para> Some further tests, given data: CREATE TABLE test (id bigint, val bigint, str text); ALTER TABLE test ALTER COLUMN str SET STORAGE EXTERNAL; INSERT INTO test SELECT g, g, repeat(chr(65 + (10*random())::int), 3000) FROM generate_series(1, 10000) g; -- or INSERT INTO test SELECT x.r, x.r, repeat(chr(65 + (10*random())::int), 3000) from (select 10000 * random() as r from generate_series(1, 10000)) x; VACUUM ANALYZE test; CREATE INDEX on test (id) ; 1. the patch correctly detects sequential access (e.g. we issue up to 6 fadvise() syscalls (8kB each) out and 17 preads() to heap fd for query like `SELECT sum(val) FROM test WHERE id BETWEEN 10 AND 2000;` -- offset of fadvise calls and pread match), so that's good. 2. Prefetching for TOASTed heap seems to be not implemented at all, correct? (Is my assumption that we should go like this: t_index->t->toast_idx->toast_heap)?, but I'm too newbie to actually see the code path where it could be added - certainly it's not blocker -- but maybe in commit message a list of improvements for future could be listed?): 2024-02-29 11:45:14.259 CET [11098] LOG: index prefetch stats: requests 1990 prefetches 17 (0.854271) skip cached 0 sequential 1973 2024-02-29 11:45:14.259 CET [11098] STATEMENT: SELECT md5(string_agg(md5(str),',')) FROM test WHERE id BETWEEN 10 AND 2000; fadvise64(37, 40960, 8192, POSIX_FADV_WILLNEED) = 0 pread64(50, "\0\0\0\0\350Jv\1\0\0\4\0(\0\0\10\0 \4 \0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2998272) = 8192 pread64(49, "\0\0\0\0@Hw\1\0\0\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 40960) = 8192 pread64(50, "\0\0\0\0\2200v\1\0\0\4\0(\0\0\10\0 \4 \0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2990080) = 8192 pread64(50, "\0\0\0\08\26v\1\0\0\4\0(\0\0\10\0 \4 \0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2981888) = 8192 pread64(50, "\0\0\0\0\340\373u\1\0\0\4\0(\0\0\10\0 \4 \0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2973696) = 8192 [..no fadvises for fd=50 which was pg_toast_rel..] 3. I'm not sure if I got good-enough results for DESCending index `create index on test (id DESC);`- with eic=16 it doesnt seem to be be able prefetch 16 blocks in advance? (e.g. highlight offset 557056 below in some text editor and it's distance is far lower between that fadvise<->pread): pread64(45, "\0\0\0\0x\305b\3\0\0\4\0\370\1\0\2\0 \4 \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 0) = 8192 fadvise64(45, 417792, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, "\0\0\0\0\370\330\235\4\0\0\4\0\370\1\0\2\0 \4 \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 417792) = 8192 fadvise64(45, 671744, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(45, 237568, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, "\0\0\0\08`]\5\0\0\4\0\370\1\0\2\0 \4 \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 671744) = 8192 fadvise64(45, 491520, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(45, 360448, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, "\0\0\0\0\200\357\25\4\0\0\4\0\370\1\0\2\0 \4 \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 237568) = 8192 fadvise64(45, 557056, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(45, 106496, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, "\0\0\0\0\240s\325\4\0\0\4\0\370\1\0\2\0 \4 \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 491520) = 8192 fadvise64(45, 401408, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(45, 335872, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, "\0\0\0\0\250\233r\4\0\0\4\0\370\1\0\2\0 \4 \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 360448) = 8192 fadvise64(45, 524288, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(45, 352256, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, "\0\0\0\0\240\342\6\5\0\0\4\0\370\1\0\2\0 \4 \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 557056) = 8192 -Jakub Wartak. [1] - https://www.postgresql.org/message-id/20240215201337.7amzw3hpvng7wphb%40awork3.anarazel.de [2] - https://www.postgresql.org/message-id/777e981c-bf0c-4eb9-a9e0-42d677e94327%40enterprisedb.com
pgsql-hackers by date: