index prefetching - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | index prefetching |
Date | |
Msg-id | cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com Whole thread Raw |
Responses |
Re: index prefetching
Re: index prefetching Re: index prefetching Re: index prefetching Re: index prefetching |
List | pgsql-hackers |
Hi, At pgcon unconference I presented a PoC patch adding prefetching for indexes, along with some benchmark results demonstrating the (pretty significant) benefits etc. The feedback was quite positive, so let me share the current patch more widely. Motivation ---------- Imagine we have a huge table (much larger than RAM), with an index, and that we're doing a regular index scan (e.g. using a btree index). We first walk the index to the leaf page, read the item pointers from the leaf page and then start issuing fetches from the heap. The index access is usually pretty cheap, because non-leaf pages are very likely cached, so we may do perhaps I/O for the leaf. But the fetches from heap are likely very expensive - unless the page is clustered, we'll do a random I/O for each item pointer. Easily ~200 or more I/O requests per leaf page. The problem is index scans do these requests synchronously at the moment - we get the next TID, fetch the heap page, process the tuple, continue to the next TID etc. That is slow and can't really leverage the bandwidth of modern storage, which require longer queues. This patch aims to improve this by async prefetching. We already do prefetching for bitmap index scans, where the bitmap heap scan prefetches future pages based on effective_io_concurrency. I'm not sure why exactly was prefetching implemented only for bitmap scans, but I suspect the reasoning was that it only helps when there's many matching tuples, and that's what bitmap index scans are for. So it was not worth the implementation effort. But there's three shortcomings in logic: 1) It's not clear the thresholds for prefetching being beneficial and switching to bitmap index scans are the same value. And as I'll demonstrate later, the prefetching threshold is indeed much lower (perhaps a couple dozen matching tuples) on large tables. 2) Our estimates / planning are not perfect, so we may easily pick an index scan instead of a bitmap scan. It'd be nice to limit the damage a bit by still prefetching. 3) There are queries that can't do a bitmap scan (at all, or because it's hopelessly inefficient). Consider queries that require ordering, or queries by distance with GiST/SP-GiST index. Implementation -------------- When I started looking at this, I only really thought about btree. If you look at BTScanPosData, which is what the index scans use to represent the current leaf page, you'll notice it has "items", which is the array of item pointers (TIDs) that we'll fetch from the heap. Which is exactly the thing we need. The easiest thing would be to just do prefetching from the btree code. But then I realized there's no particular reason why other index types (except for GIN, which only allows bitmap scans) couldn't do prefetching too. We could have a copy in each AM, of course, but that seems sloppy and also violation of layering. After all, bitmap heap scans do prefetch from the executor, so AM seems way too low level. So I ended up moving most of the prefetching logic up into indexam.c, see the index_prefetch() function. It can't be entirely separate, because each AM represents the current state in a different way (e.g. SpGistScanOpaque and BTScanOpaque are very different). So what I did is introducing a IndexPrefetch struct, which is part of IndexScanDesc, maintaining all the info about prefetching for that particular scan - current/maximum distance, progress, etc. It also contains two AM-specific callbacks (get_range and get_block) which say valid range of indexes (into the internal array), and block number for a given index. This mostly does the trick, although index_prefetch() is still called from the amgettuple() functions. That seems wrong, we should call it from indexam.c right aftter calling amgettuple. Problems / Open questions ------------------------- There's a couple issues I ran into, I'll try to list them in the order of importance (most serious ones first). 1) pairing-heap in GiST / SP-GiST For most AMs, the index state is pretty trivial - matching items from a single leaf page. Prefetching that is pretty trivial, even if the current API is a bit cumbersome. Distance queries on GiST and SP-GiST are a problem, though, because those do not just read the pointers into a simple array, as the distance ordering requires passing stuff through a pairing-heap :-( I don't know how to best deal with that, especially not in the simple API. I don't think we can "scan forward" stuff from the pairing heap, so the only idea I have is actually having two pairing-heaps. Or maybe using the pairing heap for prefetching, but stashing the prefetched pointers into an array and then returning stuff from it. In the patch I simply prefetch items before we add them to the pairing heap, which is good enough for demonstrating the benefits. 2) prefetching from executor Another question is whether the prefetching shouldn't actually happen even higher - in the executor. That's what Andres suggested during the unconference, and it kinda makes sense. That's where we do prefetching for bitmap heap scans, so why should this happen lower, right? I'm also not entirely sure the way this interfaces with the AM (through the get_range / get_block callbaces) is very elegant. It did the trick, but it seems a bit cumbersome. I wonder if someone has a better/nicer idea how to do this ... 3) prefetch distance I think we can do various smart things about the prefetch distance. The current code does about the same thing bitmap scans do - it starts with distance 0 (no prefetching), and then simply ramps the distance up until the maximum value from get_tablespace_io_concurrency(). Which is either effective_io_concurrency, or per-tablespace value. I think we could be a bit smarter, and also consider e.g. the estimated number of matching rows (but we shouldn't be too strict, because it's just an estimate). We could also track some statistics for each scan and use that during a rescans (think index scan in a nested loop). But the patch doesn't do any of that now. 4) per-leaf prefetching The code is restricted only prefetches items from one leaf page. If the index scan needs to scan multiple (many) leaf pages, we have to process the first leaf page first before reading / prefetching the next one. I think this is acceptable limitation, certainly for v0. Prefetching across multiple leaf pages seems way more complex (particularly for the cases using pairing heap), so let's leave this for the future. 5) index-only scans I'm not sure what to do about index-only scans. On the one hand, the point of IOS is not to read stuff from the heap at all, so why prefetch it. OTOH if there are many allvisible=false pages, we still have to access that. And if that happens, this leads to the bizarre situation that IOS is slower than regular index scan. But to address this, we'd have to consider the visibility during prefetching. Benchmarks ---------- 1) OLTP For OLTP, this tested different queries with various index types, on data sets constructed to have certain number of matching rows, forcing different types of query plans (bitmap, index, seqscan). The data sets have ~34GB, which is much more than available RAM (8GB). For example for BTREE, we have a query like this: SELECT * FROM btree_test WHERE a = $v with data matching 1, 10, 100, ..., 100000 rows for each $v. The results look like this: rows bitmapscan master patched seqscan 1 19.8 20.4 18.8 31875.5 10 24.4 23.8 23.2 30642.4 100 27.7 40.0 26.3 31871.3 1000 45.8 178.0 45.4 30754.1 10000 171.8 1514.9 174.5 30743.3 100000 1799.0 15993.3 1777.4 30937.3 This says that the query takes ~31s with a seqscan, 1.8s with a bitmap scan and 16s index scan (on master). With the prefetching patch, it takes about ~1.8s, i.e. about the same as the bitmap scan. I don't know where exactly would the plan switch from index scan to bitmap scan, but the table has ~100M rows, so all of this is tiny. I'd bet most of the cases would do plain index scan. For a query with ordering: SELECT * FROM btree_test WHERE a >= $v ORDER BY a LIMIT $n the results look a bit different: rows bitmapscan master patched seqscan 1 52703.9 19.5 19.5 31145.6 10 51208.1 22.7 24.7 30983.5 100 49038.6 39.0 26.3 32085.3 1000 53760.4 193.9 48.4 31479.4 10000 56898.4 1600.7 187.5 32064.5 100000 50975.2 15978.7 1848.9 31587.1 This is a good illustration of a query where bitmapscan is terrible (much worse than seqscan, in fact), and the patch is a massive improvement over master (about an order of magnitude). Of course, if you only scan a couple rows, the benefits are much more modest (say 40% for 100 rows, which is still significant). The results for other index types (HASH, GiST, SP-GiST) follow roughly the same pattern. See the attached PDF for more charts, and [1] for complete results. Benchmark / TPC-H ----------------- I ran the 22 queries on 100GB data set, with parallel query either disabled or enabled. And I measured timing (and speedup) for each query. The speedup results look like this (see the attached PDF for details): query serial parallel 1 101% 99% 2 119% 100% 3 100% 99% 4 101% 100% 5 101% 100% 6 12% 99% 7 100% 100% 8 52% 67% 10 102% 101% 11 100% 72% 12 101% 100% 13 100% 101% 14 13% 100% 15 101% 100% 16 99% 99% 17 95% 101% 18 101% 106% 19 30% 40% 20 99% 100% 21 101% 100% 22 101% 107% The percentage is (timing patched / master, so <100% means faster, >100% means slower). The different queries are affected depending on the query plan - many queries are close to 100%, which means "no difference". For the serial case, there are about 4 queries that improved a lot (6, 8, 14, 19), while for the parallel case the benefits are somewhat less significant. My explanation is that either (a) parallel case used a different plan with fewer index scans or (b) the parallel query does more concurrent I/O simply by using parallel workers. Or maybe both. There are a couple regressions too, I believe those are due to doing too much prefetching in some cases, and some of the heuristics mentioned earlier should eliminate most of this, I think. regards [1] https://github.com/tvondra/index-prefetch-tests [2] https://github.com/tvondra/postgres/tree/dev/index-prefetch -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: