Re: index prefetching - Mailing list pgsql-hackers
From | Jakub Wartak |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | CAKZiRmzao9mQC3aX1nXFOCybwL+EX9YWmy+GLkQ2u56SdAVBWw@mail.gmail.com Whole thread Raw |
In response to | Re: index prefetching (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
List | pgsql-hackers |
On Fri, Mar 1, 2024 at 3:58 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: [..] > TBH I don't have a clear idea what to do. It'd be cool to have at least > some benefits in v17, but I don't know how to do that in a way that > would be useful in the future. > > For example, the v20240124 patch implements this in the executor, but > based on the recent discussions it seems that's not the right layer - > the index AM needs to have some control, and I'm not convinced it's > possible to improve it in that direction (even ignoring the various > issues we identified in the executor-based approach). > > I think it might be more practical to do this from the index AM, even if > it has various limitations. Ironically, that's what I proposed at pgcon, > but mostly because it was the quick&dirty way to do this. ... that's a pity! :( Well, then let's just finish that subthread, I gave some explanations, but I'll try to take a look in future revisions. > > 4. Wouldn't it be better to leave PREFETCH_LRU_SIZE at static of 8, > > but base PREFETCH_LRU_COUNT on effective_io_concurrency instead? > > (allowing it to follow dynamically; the more prefetches the user wants > > to perform, the more you spread them across shared LRUs and the more > > memory for history is required?) > > > > + * XXX Maybe we could consider effective_cache_size when sizing the cache? > > + * Not to size the cache for that, ofc, but maybe as a guidance of how many > > + * heap pages it might keep. Maybe just a fraction fraction of the value, > > + * say Max(8MB, effective_cache_size / max_connections) or something. > > + */ > > +#define PREFETCH_LRU_SIZE 8 /* slots in one LRU */ > > +#define PREFETCH_LRU_COUNT 128 /* number of LRUs */ > > +#define PREFETCH_CACHE_SIZE (PREFETCH_LRU_SIZE * > > PREFETCH_LRU_COUNT) > > > > I don't see why would this be related to effective_io_concurrency? It's > merely about how many recently accessed pages we expect to find in the > page cache. It's entirely separate from the prefetch distance. Well, my thought was the higher eic is - the more I/O parallelism we are introducing - in such a case, the more requests we need to remember from the past to avoid prefetching the same (N * eic, where N would be some multiplier) > > 7. in IndexPrefetchComputeTarget() > > > > + * XXX We cap the target to plan_rows, becausse it's pointless to prefetch > > + * more than we expect to use. > > > > That's a nice fact that's already in patch, so XXX isn't needed? > > > > Right, which is why it's not a TODO/FIXME. OH! That explains it to me. I've taken all of the XXXs as literally FIXME that you wanted to go away (things to be removed before the patch is considered mature). > But I think it's good to > point this out - I'm not 100% convinced we should be using plan_rows > like this (because what happens if the estimate happens to be wrong?). Well, somewhat similiar problematic pattern was present in different codepath - get_actual_variable_endpoint() - see [1], 9c6ad5eaa95. So the final fix was to get away without adding new GUC (which always an option...), but just introduce a sensible hard-limit (fence) and stick to the 100 heap visited pages limit. Here we could have similiar heuristics same from start: if (plan_rows < we_have_already_visited_pages * avgRowsPerBlock) --> ignore plan_rows and rampup prefetches back to the full eic value. > > Some further tests, given data: > > > > CREATE TABLE test (id bigint, val bigint, str text); > > ALTER TABLE test ALTER COLUMN str SET STORAGE EXTERNAL; > > INSERT INTO test SELECT g, g, repeat(chr(65 + (10*random())::int), > > 3000) FROM generate_series(1, 10000) g; > > -- or INSERT INTO test SELECT x.r, x.r, repeat(chr(65 + > > (10*random())::int), 3000) from (select 10000 * random() as r from > > generate_series(1, 10000)) x; > > VACUUM ANALYZE test; > > CREATE INDEX on test (id) ; > > > > It's not clear to me what's the purpose of this test? Can you explain? It's just schema&data preparation for the tests below: > > > > 2. Prefetching for TOASTed heap seems to be not implemented at all, > > correct? (Is my assumption that we should go like this: > > t_index->t->toast_idx->toast_heap)?, but I'm too newbie to actually > > see the code path where it could be added - certainly it's not blocker > > -- but maybe in commit message a list of improvements for future could > > be listed?): > > > > Yes, that's true. I haven't thought about TOAST very much, but with > prefetching happening in executor, that does not work. There'd need to > be some extra code for TOAST prefetching. I'm not sure how beneficial > that would be, considering most TOAST values tend to be stored on > consecutive heap pages. Assuming that in the above I've generated data using cyclic / random version and I run: SELECT md5(string_agg(md5(str),',')) FROM test WHERE id BETWEEN 10 AND 2000; (btw: I wanted to use octet_length() at first instead of string_agg() but that's not enough) where fd 45,54,55 correspond to : lrwx------ 1 postgres postgres 64 Mar 5 12:56 /proc/8221/fd/45 -> /tmp/blah/base/5/16384 // "test" lrwx------ 1 postgres postgres 64 Mar 5 12:56 /proc/8221/fd/54 -> /tmp/blah/base/5/16388 // "pg_toast_16384_index" lrwx------ 1 postgres postgres 64 Mar 5 12:56 /proc/8221/fd/55 -> /tmp/blah/base/5/16387 // "pg_toast_16384" I've got for the following data: - 83 pread64 and 83x fadvise() for random offsets for fd=45 - the main intent of this patch (main relation heap prefetching), works good - 54 pread64 calls for fd=54 (no favdises()) - 1789 (!) calls to pread64 for fd=55 for RANDOM offsets (TOAST heap, no prefetch) so at least in theory it makes a lot of sense to prefetch TOAST too, pattern looks like cyclic random: // pread(fd, "", blocksz, offset) fadvise64(45, 40960, 8192, POSIX_FADV_WILLNEED) = 0 pread64(55, ""..., 8192, 38002688) = 8192 pread64(55, ""..., 8192, 12034048) = 8192 pread64(55, ""..., 8192, 36560896) = 8192 pread64(55, ""..., 8192, 8871936) = 8192 pread64(55, ""..., 8192, 17965056) = 8192 pread64(55, ""..., 8192, 18710528) = 8192 pread64(55, ""..., 8192, 35635200) = 8192 pread64(55, ""..., 8192, 23379968) = 8192 pread64(55, ""..., 8192, 25141248) = 8192 pread64(55, ""..., 8192, 3457024) = 8192 pread64(55, ""..., 8192, 24633344) = 8192 pread64(55, ""..., 8192, 36462592) = 8192 pread64(55, ""..., 8192, 18120704) = 8192 pread64(55, ""..., 8192, 27066368) = 8192 pread64(45, ""..., 8192, 40960) = 8192 pread64(55, ""..., 8192, 2768896) = 8192 pread64(55, ""..., 8192, 10846208) = 8192 pread64(55, ""..., 8192, 30179328) = 8192 pread64(55, ""..., 8192, 7700480) = 8192 pread64(55, ""..., 8192, 38846464) = 8192 pread64(55, ""..., 8192, 1040384) = 8192 pread64(55, ""..., 8192, 10985472) = 8192 It's probably a separate feature (prefetching blocks from TOAST), but it could be mentioned that this patch is not doing that (I was assuming it could). > > 3. I'm not sure if I got good-enough results for DESCending index > > `create index on test (id DESC);`- with eic=16 it doesnt seem to be > > be able prefetch 16 blocks in advance? (e.g. highlight offset 557056 > > below in some text editor and it's distance is far lower between that > > fadvise<->pread): > > [..] > > > > I'm not sure I understand these strace snippets. Can you elaborate a > bit, explain what the strace log says? set enable_seqscan to off; set enable_bitmapscan to off; drop index test_id_idx; create index on test (id DESC); -- DESC one SELECT sum(val) FROM test WHERE id BETWEEN 10 AND 2000; Ok, so cleaner output of strace -s 0 for PID doing that SELECT with eic=16, annotated with [*]: lseek(45, 0, SEEK_END) = 688128 lseek(47, 0, SEEK_END) = 212992 pread64(47, ""..., 8192, 172032) = 8192 pread64(45, ""..., 8192, 90112) = 8192 fadvise64(45, 172032, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, ""..., 8192, 172032) = 8192 fadvise64(45, 319488, 8192, POSIX_FADV_WILLNEED) = 0 [*off 319488 start] fadvise64(45, 335872, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, ""..., 8192, 319488) = 8192 [*off 319488, read, distance=1 fadvises] fadvise64(45, 466944, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(45, 393216, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, ""..., 8192, 335872) = 8192 fadvise64(45, 540672, 8192, POSIX_FADV_WILLNEED) = 0 [*off 540672 start] fadvise64(45, 262144, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, ""..., 8192, 466944) = 8192 fadvise64(45, 491520, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, ""..., 8192, 393216) = 8192 fadvise64(45, 163840, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(45, 385024, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, ""..., 8192, 540672) = 8192 [*off 540672, read, distance=4 fadvises] fadvise64(45, 417792, 8192, POSIX_FADV_WILLNEED) = 0 [..] I was wondering why the distance never got >4 in such case for eic=16, it should spawn more fadvises calls, shouldn't it? (it was happening only for DESC, in normal ASC index the prefetching distance easily achieves ~~ eic values) and I think today i've got the answer -- after dropping/creating DESC index I did NOT execute ANALYZE so probably the Min(..., plan_rows) was kicking in and preventing the full prefetching. Hitting above, makes me think that the XXX for plan_rows , should really be real-FIXME. -J. [1] - https://www.postgresql.org/message-id/CAKZiRmznOwi0oaV%3D4PHOCM4ygcH4MgSvt8%3D5cu_vNCfc8FSUug%40mail.gmail.com
pgsql-hackers by date: