Re: index prefetching - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | a062ac09-c574-1d6a-22b6-f3cc0d28ed42@enterprisedb.com Whole thread Raw |
In response to | Re: index prefetching (Andres Freund <andres@anarazel.de>) |
Responses |
Re: index prefetching
|
List | pgsql-hackers |
On 6/9/23 02:06, Andres Freund wrote: > Hi, > > On 2023-06-08 17:40:12 +0200, Tomas Vondra wrote: >> At pgcon unconference I presented a PoC patch adding prefetching for >> indexes, along with some benchmark results demonstrating the (pretty >> significant) benefits etc. The feedback was quite positive, so let me >> share the current patch more widely. > > I'm really excited about this work. > > >> 1) pairing-heap in GiST / SP-GiST >> >> For most AMs, the index state is pretty trivial - matching items from a >> single leaf page. Prefetching that is pretty trivial, even if the >> current API is a bit cumbersome. >> >> Distance queries on GiST and SP-GiST are a problem, though, because >> those do not just read the pointers into a simple array, as the distance >> ordering requires passing stuff through a pairing-heap :-( >> >> I don't know how to best deal with that, especially not in the simple >> API. I don't think we can "scan forward" stuff from the pairing heap, so >> the only idea I have is actually having two pairing-heaps. Or maybe >> using the pairing heap for prefetching, but stashing the prefetched >> pointers into an array and then returning stuff from it. >> >> In the patch I simply prefetch items before we add them to the pairing >> heap, which is good enough for demonstrating the benefits. > > I think it'd be perfectly fair to just not tackle distance queries for now. > My concern is that if we cut this from v0 entirely, we'll end up with an API that'll not be suitable for adding distance queries later. > >> 2) prefetching from executor >> >> Another question is whether the prefetching shouldn't actually happen >> even higher - in the executor. That's what Andres suggested during the >> unconference, and it kinda makes sense. That's where we do prefetching >> for bitmap heap scans, so why should this happen lower, right? > > Yea. I think it also provides potential for further optimizations in the > future to do it at that layer. > > One thing I have been wondering around this is whether we should not have > split the code for IOS and plain indexscans... > Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or did you mean something else? > >> 4) per-leaf prefetching >> >> The code is restricted only prefetches items from one leaf page. If the >> index scan needs to scan multiple (many) leaf pages, we have to process >> the first leaf page first before reading / prefetching the next one. >> >> I think this is acceptable limitation, certainly for v0. Prefetching >> across multiple leaf pages seems way more complex (particularly for the >> cases using pairing heap), so let's leave this for the future. > > Hm. I think that really depends on the shape of the API we end up with. If we > move the responsibility more twoards to the executor, I think it very well > could end up being just as simple to prefetch across index pages. > Maybe. I'm open to that idea if you have idea how to shape the API to make this possible (although perhaps not in v0). > >> 5) index-only scans >> >> I'm not sure what to do about index-only scans. On the one hand, the >> point of IOS is not to read stuff from the heap at all, so why prefetch >> it. OTOH if there are many allvisible=false pages, we still have to >> access that. And if that happens, this leads to the bizarre situation >> that IOS is slower than regular index scan. But to address this, we'd >> have to consider the visibility during prefetching. > > That should be easy to do, right? > It doesn't seem particularly complicated (famous last words), and we need to do the VM checks anyway so it seems like it wouldn't add a lot of overhead either > > >> Benchmark / TPC-H >> ----------------- >> >> I ran the 22 queries on 100GB data set, with parallel query either >> disabled or enabled. And I measured timing (and speedup) for each query. >> The speedup results look like this (see the attached PDF for details): >> >> query serial parallel >> 1 101% 99% >> 2 119% 100% >> 3 100% 99% >> 4 101% 100% >> 5 101% 100% >> 6 12% 99% >> 7 100% 100% >> 8 52% 67% >> 10 102% 101% >> 11 100% 72% >> 12 101% 100% >> 13 100% 101% >> 14 13% 100% >> 15 101% 100% >> 16 99% 99% >> 17 95% 101% >> 18 101% 106% >> 19 30% 40% >> 20 99% 100% >> 21 101% 100% >> 22 101% 107% >> >> The percentage is (timing patched / master, so <100% means faster, >100% >> means slower). >> >> The different queries are affected depending on the query plan - many >> queries are close to 100%, which means "no difference". For the serial >> case, there are about 4 queries that improved a lot (6, 8, 14, 19), >> while for the parallel case the benefits are somewhat less significant. >> >> My explanation is that either (a) parallel case used a different plan >> with fewer index scans or (b) the parallel query does more concurrent >> I/O simply by using parallel workers. Or maybe both. >> >> There are a couple regressions too, I believe those are due to doing too >> much prefetching in some cases, and some of the heuristics mentioned >> earlier should eliminate most of this, I think. > > I'm a bit confused by some of these numbers. How can OS-level prefetching lead > to massive prefetching in the alread cached case, e.g. in tpch q06 and q08? > Unless I missed what "xeon / cached (speedup)" indicates? > I forgot to explain what "cached" means in the TPC-H case. It means second execution of the query, so you can imagine it like this: for q in `seq 1 22`; do 1. drop caches and restart postgres 2. run query $q -> uncached 3. run query $q -> cached done So the second execution has a chance of having data in memory - but maybe not all, because this is a 100GB data set (so ~200GB after loading), but the machine only has 64GB of RAM. I think a likely explanation is some of the data wasn't actually in memory, so prefetching still did something. > I think it'd be good to run a performance comparison of the unpatched vs > patched cases, with prefetching disabled for both. It's possible that > something in the patch caused unintended changes (say spilling during a > hashagg, due to larger struct sizes). > That's certainly a good idea. I'll do that in the next round of tests. I also plan to do a test on data set that fits into RAM, to test "properly cached" case. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: