Re: index prefetching - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | CAH2-WznUX_Sk8LXPifEK9wjsVVD0SOnwz6x63inpisJ1T5R9ig@mail.gmail.com Whole thread Raw |
In response to | index prefetching (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: index prefetching
|
List | pgsql-hackers |
On Thu, Jun 8, 2023 at 8:40 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > We already do prefetching for bitmap index scans, where the bitmap heap > scan prefetches future pages based on effective_io_concurrency. I'm not > sure why exactly was prefetching implemented only for bitmap scans, but > I suspect the reasoning was that it only helps when there's many > matching tuples, and that's what bitmap index scans are for. So it was > not worth the implementation effort. I have an educated guess as to why prefetching was limited to bitmap index scans this whole time: it might have been due to issues with ScalarArrayOpExpr quals. Commit 9e8da0f757 taught nbtree to deal with ScalarArrayOpExpr quals "natively". This meant that "indexedcol op ANY(ARRAY[...])" conditions were supported by both index scans and index-only scans -- not just bitmap scans, which could handle ScalarArrayOpExpr quals even without nbtree directly understanding them. The commit was in late 2011, shortly after the introduction of index-only scans -- which seems to have been the real motivation. And so it seems to me that support for ScalarArrayOpExpr was built with bitmap scans and index-only scans in mind. Plain index scan ScalarArrayOpExpr quals do work, but support for them seems kinda perfunctory to me (maybe you can think of a specific counter-example where plain index scans really benefit from ScalarArrayOpExpr, but that doesn't seem particularly relevant to the original motivation). ScalarArrayOpExpr for plain index scans don't really make that much sense right now because there is no heap prefetching in the index scan case, which is almost certainly going to be the major bottleneck there. At the same time, adding useful prefetching for ScalarArrayOpExpr execution more or less requires that you first improve how nbtree executes ScalarArrayOpExpr quals in general. Bear in mind that ScalarArrayOpExpr execution (whether for bitmap index scans or index scans) is related to skip scan/MDAM techniques -- so there are tricky dependencies that need to be considered together. Right now, nbtree ScalarArrayOpExpr execution must call _bt_first() to descend the B-Tree for each array constant -- even though in principle we could avoid all that work in cases that happen to have locality. In other words we'll often descend the tree multiple times and land on exactly the same leaf page again and again, without ever noticing that we could have gotten away with only descending the tree once (it'd also be possible to start the next "descent" one level up, not at the root, intelligently reusing some of the work from an initial descent -- but you don't need anything so fancy to greatly improve matters here). This lack of smarts around how many times we call _bt_first() to descend the index is merely a silly annoyance when it happens in btgetbitmap(). We do at least sort and deduplicate the array up-front (inside _bt_sort_array_elements()), so there will be significant locality of access each time we needlessly descend the tree. Importantly, there is no prefetching "pipeline" to mess up in the bitmap index scan case -- since that all happens later on. Not so for the superficially similar (though actually rather different) plain index scan case -- at least not once you add prefetching. If you're uselessly processing the same leaf page multiple times, then there is no way that heap prefetching can notice that it should be batching things up. The context that would allow prefetching to work well isn't really available right now. So the plain index scan case is kinda at a gratuitous disadvantage (with prefetching) relative to the bitmap index scan case. Queries with (say) quals with many constants appearing in an "IN()" are both common and particularly likely to benefit from prefetching. I'm not suggesting that you need to address this to get to a committable patch. But you should definitely think about it now. I'm strongly considering working on this problem for 17 anyway, so we may end up collaborating on these aspects of prefetching. Smarter ScalarArrayOpExpr execution for index scans is likely to be quite compelling if it enables heap prefetching. > But there's three shortcomings in logic: > > 1) It's not clear the thresholds for prefetching being beneficial and > switching to bitmap index scans are the same value. And as I'll > demonstrate later, the prefetching threshold is indeed much lower > (perhaps a couple dozen matching tuples) on large tables. As I mentioned during the pgCon unconference session, I really like your framing of the problem; it makes a lot of sense to directly compare an index scan's execution against a very similar bitmap index scan execution -- there is an imaginary continuum between index scan and bitmap index scan. If the details of when and how we scan the index are rather similar in each case, then there is really no reason why the performance shouldn't be fairly similar. I suspect that it will be useful to ask the same question for various specific cases, that you might not have thought about just yet. Things like ScalarArrayOpExpr queries, where bitmap index scans might look like they have a natural advantage due to an inherent need for random heap access in the plain index scan case. It's important to carefully distinguish between cases where plain index scans really are at an inherent disadvantage relative to bitmap index scans (because there really is no getting around the need to access the same heap page many times with an index scan) versus cases that merely *appear* that way. Implementation restrictions that only really affect the plain index scan case (e.g., the lack of a reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing) should be accounted for when assessing the viability of index scan + prefetch over bitmap index scan + prefetch. This is very subtle, but important. That's what I was mostly trying to get at when I talked about testing strategy at the unconference session (this may have been unclear at the time). It could be done in a way that helps you to think about the problem from first principles. It could be really useful as a way of avoiding confusing cases where plain index scan + prefetch does badly due to implementation restrictions, versus cases where it's *inherently* the wrong strategy. And a testing strategy that starts with very basic ideas about what I/O is truly necessary might help you to notice and fix regressions. The difference will never be perfectly crisp, of course (isn't bitmap index scan basically just index scan with a really huge prefetch buffer anyway?), but it still seems like a useful direction to go in. > Implementation > -------------- > > When I started looking at this, I only really thought about btree. If > you look at BTScanPosData, which is what the index scans use to > represent the current leaf page, you'll notice it has "items", which is > the array of item pointers (TIDs) that we'll fetch from the heap. Which > is exactly the thing we need. > So I ended up moving most of the prefetching logic up into indexam.c, > see the index_prefetch() function. It can't be entirely separate, > because each AM represents the current state in a different way (e.g. > SpGistScanOpaque and BTScanOpaque are very different). Maybe you were right to do that, but I'm not entirely sure. Bear in mind that the ScalarArrayOpExpr case already looks like a single index scan whose qual involves an array to the executor, even though nbtree more or less implements it as multiple index scans with plain constant quals (one per unique-ified array element). Index scans whose results can be "OR'd together". Is that a modularity violation? And if so, why? As I've pointed out earlier in this email, we don't do very much with that context right now -- but clearly we should. In other words, maybe you're right to suspect that doing this in AMs like nbtree is a modularity violation. OTOH, maybe it'll turn out that that's exactly the right place to do it, because that's the only way to make the full context available in one place. I myself struggled with this when I reviewed the skip scan patch. I was sure that Tom wouldn't like the way that the skip-scan patch doubles-down on adding more intelligence/planning around how to execute queries with skippable leading columns. But, it turned out that he saw the merit in it, and basically accepted that general approach. Maybe this will turn out to be a little like that situation, where (counter to intuition) what you really need to do is add a new "layering violation". Sometimes that's the only thing that'll allow the information to flow to the right place. It's tricky. > 4) per-leaf prefetching > > The code is restricted only prefetches items from one leaf page. If the > index scan needs to scan multiple (many) leaf pages, we have to process > the first leaf page first before reading / prefetching the next one. > > I think this is acceptable limitation, certainly for v0. Prefetching > across multiple leaf pages seems way more complex (particularly for the > cases using pairing heap), so let's leave this for the future. I tend to agree that this sort of thing doesn't need to happen in the first committed version. But FWIW nbtree could be taught to scan multiple index pages and act as if it had just processed them as one single index page -- up to a point. This is at least possible with plain index scans that use MVCC snapshots (though not index-only scans), since we already drop the pin on the leaf page there anyway. AFAICT stops us from teaching nbtree to "lie" to the executor and tell it that we processed 1 leaf page, even though it was actually 5 leaf pages (maybe there would also have to be restrictions for the markpos stuff). > the results look a bit different: > > rows bitmapscan master patched seqscan > 1 52703.9 19.5 19.5 31145.6 > 10 51208.1 22.7 24.7 30983.5 > 100 49038.6 39.0 26.3 32085.3 > 1000 53760.4 193.9 48.4 31479.4 > 10000 56898.4 1600.7 187.5 32064.5 > 100000 50975.2 15978.7 1848.9 31587.1 > > This is a good illustration of a query where bitmapscan is terrible > (much worse than seqscan, in fact), and the patch is a massive > improvement over master (about an order of magnitude). > > Of course, if you only scan a couple rows, the benefits are much more > modest (say 40% for 100 rows, which is still significant). Nice! And, it'll be nice to be able to use the kill_prior_tuple optimization in many more cases (possible by teaching the optimizer to favor index scans over bitmap index scans more often). -- Peter Geoghegan
pgsql-hackers by date: