Thread: index prefetching
Hi, At pgcon unconference I presented a PoC patch adding prefetching for indexes, along with some benchmark results demonstrating the (pretty significant) benefits etc. The feedback was quite positive, so let me share the current patch more widely. Motivation ---------- Imagine we have a huge table (much larger than RAM), with an index, and that we're doing a regular index scan (e.g. using a btree index). We first walk the index to the leaf page, read the item pointers from the leaf page and then start issuing fetches from the heap. The index access is usually pretty cheap, because non-leaf pages are very likely cached, so we may do perhaps I/O for the leaf. But the fetches from heap are likely very expensive - unless the page is clustered, we'll do a random I/O for each item pointer. Easily ~200 or more I/O requests per leaf page. The problem is index scans do these requests synchronously at the moment - we get the next TID, fetch the heap page, process the tuple, continue to the next TID etc. That is slow and can't really leverage the bandwidth of modern storage, which require longer queues. This patch aims to improve this by async prefetching. We already do prefetching for bitmap index scans, where the bitmap heap scan prefetches future pages based on effective_io_concurrency. I'm not sure why exactly was prefetching implemented only for bitmap scans, but I suspect the reasoning was that it only helps when there's many matching tuples, and that's what bitmap index scans are for. So it was not worth the implementation effort. But there's three shortcomings in logic: 1) It's not clear the thresholds for prefetching being beneficial and switching to bitmap index scans are the same value. And as I'll demonstrate later, the prefetching threshold is indeed much lower (perhaps a couple dozen matching tuples) on large tables. 2) Our estimates / planning are not perfect, so we may easily pick an index scan instead of a bitmap scan. It'd be nice to limit the damage a bit by still prefetching. 3) There are queries that can't do a bitmap scan (at all, or because it's hopelessly inefficient). Consider queries that require ordering, or queries by distance with GiST/SP-GiST index. Implementation -------------- When I started looking at this, I only really thought about btree. If you look at BTScanPosData, which is what the index scans use to represent the current leaf page, you'll notice it has "items", which is the array of item pointers (TIDs) that we'll fetch from the heap. Which is exactly the thing we need. The easiest thing would be to just do prefetching from the btree code. But then I realized there's no particular reason why other index types (except for GIN, which only allows bitmap scans) couldn't do prefetching too. We could have a copy in each AM, of course, but that seems sloppy and also violation of layering. After all, bitmap heap scans do prefetch from the executor, so AM seems way too low level. So I ended up moving most of the prefetching logic up into indexam.c, see the index_prefetch() function. It can't be entirely separate, because each AM represents the current state in a different way (e.g. SpGistScanOpaque and BTScanOpaque are very different). So what I did is introducing a IndexPrefetch struct, which is part of IndexScanDesc, maintaining all the info about prefetching for that particular scan - current/maximum distance, progress, etc. It also contains two AM-specific callbacks (get_range and get_block) which say valid range of indexes (into the internal array), and block number for a given index. This mostly does the trick, although index_prefetch() is still called from the amgettuple() functions. That seems wrong, we should call it from indexam.c right aftter calling amgettuple. Problems / Open questions ------------------------- There's a couple issues I ran into, I'll try to list them in the order of importance (most serious ones first). 1) pairing-heap in GiST / SP-GiST For most AMs, the index state is pretty trivial - matching items from a single leaf page. Prefetching that is pretty trivial, even if the current API is a bit cumbersome. Distance queries on GiST and SP-GiST are a problem, though, because those do not just read the pointers into a simple array, as the distance ordering requires passing stuff through a pairing-heap :-( I don't know how to best deal with that, especially not in the simple API. I don't think we can "scan forward" stuff from the pairing heap, so the only idea I have is actually having two pairing-heaps. Or maybe using the pairing heap for prefetching, but stashing the prefetched pointers into an array and then returning stuff from it. In the patch I simply prefetch items before we add them to the pairing heap, which is good enough for demonstrating the benefits. 2) prefetching from executor Another question is whether the prefetching shouldn't actually happen even higher - in the executor. That's what Andres suggested during the unconference, and it kinda makes sense. That's where we do prefetching for bitmap heap scans, so why should this happen lower, right? I'm also not entirely sure the way this interfaces with the AM (through the get_range / get_block callbaces) is very elegant. It did the trick, but it seems a bit cumbersome. I wonder if someone has a better/nicer idea how to do this ... 3) prefetch distance I think we can do various smart things about the prefetch distance. The current code does about the same thing bitmap scans do - it starts with distance 0 (no prefetching), and then simply ramps the distance up until the maximum value from get_tablespace_io_concurrency(). Which is either effective_io_concurrency, or per-tablespace value. I think we could be a bit smarter, and also consider e.g. the estimated number of matching rows (but we shouldn't be too strict, because it's just an estimate). We could also track some statistics for each scan and use that during a rescans (think index scan in a nested loop). But the patch doesn't do any of that now. 4) per-leaf prefetching The code is restricted only prefetches items from one leaf page. If the index scan needs to scan multiple (many) leaf pages, we have to process the first leaf page first before reading / prefetching the next one. I think this is acceptable limitation, certainly for v0. Prefetching across multiple leaf pages seems way more complex (particularly for the cases using pairing heap), so let's leave this for the future. 5) index-only scans I'm not sure what to do about index-only scans. On the one hand, the point of IOS is not to read stuff from the heap at all, so why prefetch it. OTOH if there are many allvisible=false pages, we still have to access that. And if that happens, this leads to the bizarre situation that IOS is slower than regular index scan. But to address this, we'd have to consider the visibility during prefetching. Benchmarks ---------- 1) OLTP For OLTP, this tested different queries with various index types, on data sets constructed to have certain number of matching rows, forcing different types of query plans (bitmap, index, seqscan). The data sets have ~34GB, which is much more than available RAM (8GB). For example for BTREE, we have a query like this: SELECT * FROM btree_test WHERE a = $v with data matching 1, 10, 100, ..., 100000 rows for each $v. The results look like this: rows bitmapscan master patched seqscan 1 19.8 20.4 18.8 31875.5 10 24.4 23.8 23.2 30642.4 100 27.7 40.0 26.3 31871.3 1000 45.8 178.0 45.4 30754.1 10000 171.8 1514.9 174.5 30743.3 100000 1799.0 15993.3 1777.4 30937.3 This says that the query takes ~31s with a seqscan, 1.8s with a bitmap scan and 16s index scan (on master). With the prefetching patch, it takes about ~1.8s, i.e. about the same as the bitmap scan. I don't know where exactly would the plan switch from index scan to bitmap scan, but the table has ~100M rows, so all of this is tiny. I'd bet most of the cases would do plain index scan. For a query with ordering: SELECT * FROM btree_test WHERE a >= $v ORDER BY a LIMIT $n the results look a bit different: rows bitmapscan master patched seqscan 1 52703.9 19.5 19.5 31145.6 10 51208.1 22.7 24.7 30983.5 100 49038.6 39.0 26.3 32085.3 1000 53760.4 193.9 48.4 31479.4 10000 56898.4 1600.7 187.5 32064.5 100000 50975.2 15978.7 1848.9 31587.1 This is a good illustration of a query where bitmapscan is terrible (much worse than seqscan, in fact), and the patch is a massive improvement over master (about an order of magnitude). Of course, if you only scan a couple rows, the benefits are much more modest (say 40% for 100 rows, which is still significant). The results for other index types (HASH, GiST, SP-GiST) follow roughly the same pattern. See the attached PDF for more charts, and [1] for complete results. Benchmark / TPC-H ----------------- I ran the 22 queries on 100GB data set, with parallel query either disabled or enabled. And I measured timing (and speedup) for each query. The speedup results look like this (see the attached PDF for details): query serial parallel 1 101% 99% 2 119% 100% 3 100% 99% 4 101% 100% 5 101% 100% 6 12% 99% 7 100% 100% 8 52% 67% 10 102% 101% 11 100% 72% 12 101% 100% 13 100% 101% 14 13% 100% 15 101% 100% 16 99% 99% 17 95% 101% 18 101% 106% 19 30% 40% 20 99% 100% 21 101% 100% 22 101% 107% The percentage is (timing patched / master, so <100% means faster, >100% means slower). The different queries are affected depending on the query plan - many queries are close to 100%, which means "no difference". For the serial case, there are about 4 queries that improved a lot (6, 8, 14, 19), while for the parallel case the benefits are somewhat less significant. My explanation is that either (a) parallel case used a different plan with fewer index scans or (b) the parallel query does more concurrent I/O simply by using parallel workers. Or maybe both. There are a couple regressions too, I believe those are due to doing too much prefetching in some cases, and some of the heuristics mentioned earlier should eliminate most of this, I think. regards [1] https://github.com/tvondra/index-prefetch-tests [2] https://github.com/tvondra/postgres/tree/dev/index-prefetch -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Thu, Jun 8, 2023 at 8:40 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > We already do prefetching for bitmap index scans, where the bitmap heap > scan prefetches future pages based on effective_io_concurrency. I'm not > sure why exactly was prefetching implemented only for bitmap scans, but > I suspect the reasoning was that it only helps when there's many > matching tuples, and that's what bitmap index scans are for. So it was > not worth the implementation effort. I have an educated guess as to why prefetching was limited to bitmap index scans this whole time: it might have been due to issues with ScalarArrayOpExpr quals. Commit 9e8da0f757 taught nbtree to deal with ScalarArrayOpExpr quals "natively". This meant that "indexedcol op ANY(ARRAY[...])" conditions were supported by both index scans and index-only scans -- not just bitmap scans, which could handle ScalarArrayOpExpr quals even without nbtree directly understanding them. The commit was in late 2011, shortly after the introduction of index-only scans -- which seems to have been the real motivation. And so it seems to me that support for ScalarArrayOpExpr was built with bitmap scans and index-only scans in mind. Plain index scan ScalarArrayOpExpr quals do work, but support for them seems kinda perfunctory to me (maybe you can think of a specific counter-example where plain index scans really benefit from ScalarArrayOpExpr, but that doesn't seem particularly relevant to the original motivation). ScalarArrayOpExpr for plain index scans don't really make that much sense right now because there is no heap prefetching in the index scan case, which is almost certainly going to be the major bottleneck there. At the same time, adding useful prefetching for ScalarArrayOpExpr execution more or less requires that you first improve how nbtree executes ScalarArrayOpExpr quals in general. Bear in mind that ScalarArrayOpExpr execution (whether for bitmap index scans or index scans) is related to skip scan/MDAM techniques -- so there are tricky dependencies that need to be considered together. Right now, nbtree ScalarArrayOpExpr execution must call _bt_first() to descend the B-Tree for each array constant -- even though in principle we could avoid all that work in cases that happen to have locality. In other words we'll often descend the tree multiple times and land on exactly the same leaf page again and again, without ever noticing that we could have gotten away with only descending the tree once (it'd also be possible to start the next "descent" one level up, not at the root, intelligently reusing some of the work from an initial descent -- but you don't need anything so fancy to greatly improve matters here). This lack of smarts around how many times we call _bt_first() to descend the index is merely a silly annoyance when it happens in btgetbitmap(). We do at least sort and deduplicate the array up-front (inside _bt_sort_array_elements()), so there will be significant locality of access each time we needlessly descend the tree. Importantly, there is no prefetching "pipeline" to mess up in the bitmap index scan case -- since that all happens later on. Not so for the superficially similar (though actually rather different) plain index scan case -- at least not once you add prefetching. If you're uselessly processing the same leaf page multiple times, then there is no way that heap prefetching can notice that it should be batching things up. The context that would allow prefetching to work well isn't really available right now. So the plain index scan case is kinda at a gratuitous disadvantage (with prefetching) relative to the bitmap index scan case. Queries with (say) quals with many constants appearing in an "IN()" are both common and particularly likely to benefit from prefetching. I'm not suggesting that you need to address this to get to a committable patch. But you should definitely think about it now. I'm strongly considering working on this problem for 17 anyway, so we may end up collaborating on these aspects of prefetching. Smarter ScalarArrayOpExpr execution for index scans is likely to be quite compelling if it enables heap prefetching. > But there's three shortcomings in logic: > > 1) It's not clear the thresholds for prefetching being beneficial and > switching to bitmap index scans are the same value. And as I'll > demonstrate later, the prefetching threshold is indeed much lower > (perhaps a couple dozen matching tuples) on large tables. As I mentioned during the pgCon unconference session, I really like your framing of the problem; it makes a lot of sense to directly compare an index scan's execution against a very similar bitmap index scan execution -- there is an imaginary continuum between index scan and bitmap index scan. If the details of when and how we scan the index are rather similar in each case, then there is really no reason why the performance shouldn't be fairly similar. I suspect that it will be useful to ask the same question for various specific cases, that you might not have thought about just yet. Things like ScalarArrayOpExpr queries, where bitmap index scans might look like they have a natural advantage due to an inherent need for random heap access in the plain index scan case. It's important to carefully distinguish between cases where plain index scans really are at an inherent disadvantage relative to bitmap index scans (because there really is no getting around the need to access the same heap page many times with an index scan) versus cases that merely *appear* that way. Implementation restrictions that only really affect the plain index scan case (e.g., the lack of a reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing) should be accounted for when assessing the viability of index scan + prefetch over bitmap index scan + prefetch. This is very subtle, but important. That's what I was mostly trying to get at when I talked about testing strategy at the unconference session (this may have been unclear at the time). It could be done in a way that helps you to think about the problem from first principles. It could be really useful as a way of avoiding confusing cases where plain index scan + prefetch does badly due to implementation restrictions, versus cases where it's *inherently* the wrong strategy. And a testing strategy that starts with very basic ideas about what I/O is truly necessary might help you to notice and fix regressions. The difference will never be perfectly crisp, of course (isn't bitmap index scan basically just index scan with a really huge prefetch buffer anyway?), but it still seems like a useful direction to go in. > Implementation > -------------- > > When I started looking at this, I only really thought about btree. If > you look at BTScanPosData, which is what the index scans use to > represent the current leaf page, you'll notice it has "items", which is > the array of item pointers (TIDs) that we'll fetch from the heap. Which > is exactly the thing we need. > So I ended up moving most of the prefetching logic up into indexam.c, > see the index_prefetch() function. It can't be entirely separate, > because each AM represents the current state in a different way (e.g. > SpGistScanOpaque and BTScanOpaque are very different). Maybe you were right to do that, but I'm not entirely sure. Bear in mind that the ScalarArrayOpExpr case already looks like a single index scan whose qual involves an array to the executor, even though nbtree more or less implements it as multiple index scans with plain constant quals (one per unique-ified array element). Index scans whose results can be "OR'd together". Is that a modularity violation? And if so, why? As I've pointed out earlier in this email, we don't do very much with that context right now -- but clearly we should. In other words, maybe you're right to suspect that doing this in AMs like nbtree is a modularity violation. OTOH, maybe it'll turn out that that's exactly the right place to do it, because that's the only way to make the full context available in one place. I myself struggled with this when I reviewed the skip scan patch. I was sure that Tom wouldn't like the way that the skip-scan patch doubles-down on adding more intelligence/planning around how to execute queries with skippable leading columns. But, it turned out that he saw the merit in it, and basically accepted that general approach. Maybe this will turn out to be a little like that situation, where (counter to intuition) what you really need to do is add a new "layering violation". Sometimes that's the only thing that'll allow the information to flow to the right place. It's tricky. > 4) per-leaf prefetching > > The code is restricted only prefetches items from one leaf page. If the > index scan needs to scan multiple (many) leaf pages, we have to process > the first leaf page first before reading / prefetching the next one. > > I think this is acceptable limitation, certainly for v0. Prefetching > across multiple leaf pages seems way more complex (particularly for the > cases using pairing heap), so let's leave this for the future. I tend to agree that this sort of thing doesn't need to happen in the first committed version. But FWIW nbtree could be taught to scan multiple index pages and act as if it had just processed them as one single index page -- up to a point. This is at least possible with plain index scans that use MVCC snapshots (though not index-only scans), since we already drop the pin on the leaf page there anyway. AFAICT stops us from teaching nbtree to "lie" to the executor and tell it that we processed 1 leaf page, even though it was actually 5 leaf pages (maybe there would also have to be restrictions for the markpos stuff). > the results look a bit different: > > rows bitmapscan master patched seqscan > 1 52703.9 19.5 19.5 31145.6 > 10 51208.1 22.7 24.7 30983.5 > 100 49038.6 39.0 26.3 32085.3 > 1000 53760.4 193.9 48.4 31479.4 > 10000 56898.4 1600.7 187.5 32064.5 > 100000 50975.2 15978.7 1848.9 31587.1 > > This is a good illustration of a query where bitmapscan is terrible > (much worse than seqscan, in fact), and the patch is a massive > improvement over master (about an order of magnitude). > > Of course, if you only scan a couple rows, the benefits are much more > modest (say 40% for 100 rows, which is still significant). Nice! And, it'll be nice to be able to use the kill_prior_tuple optimization in many more cases (possible by teaching the optimizer to favor index scans over bitmap index scans more often). -- Peter Geoghegan
On 6/8/23 20:56, Peter Geoghegan wrote: > On Thu, Jun 8, 2023 at 8:40 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> We already do prefetching for bitmap index scans, where the bitmap heap >> scan prefetches future pages based on effective_io_concurrency. I'm not >> sure why exactly was prefetching implemented only for bitmap scans, but >> I suspect the reasoning was that it only helps when there's many >> matching tuples, and that's what bitmap index scans are for. So it was >> not worth the implementation effort. > > I have an educated guess as to why prefetching was limited to bitmap > index scans this whole time: it might have been due to issues with > ScalarArrayOpExpr quals. > > Commit 9e8da0f757 taught nbtree to deal with ScalarArrayOpExpr quals > "natively". This meant that "indexedcol op ANY(ARRAY[...])" conditions > were supported by both index scans and index-only scans -- not just > bitmap scans, which could handle ScalarArrayOpExpr quals even without > nbtree directly understanding them. The commit was in late 2011, > shortly after the introduction of index-only scans -- which seems to > have been the real motivation. And so it seems to me that support for > ScalarArrayOpExpr was built with bitmap scans and index-only scans in > mind. Plain index scan ScalarArrayOpExpr quals do work, but support > for them seems kinda perfunctory to me (maybe you can think of a > specific counter-example where plain index scans really benefit from > ScalarArrayOpExpr, but that doesn't seem particularly relevant to the > original motivation). > I don't think SAOP is the reason. I did a bit of digging in the list archives, and found thread [1], which says: Regardless of what mechanism is used and who is responsible for doing it someone is going to have to figure out which blocks are specifically interesting to prefetch. Bitmap index scans happen to be the easiest since we've already built up a list of blocks we plan to read. Somehow that information has to be pushed to the storage manager to be acted upon. Normal index scans are an even more interesting case but I'm not sure how hard it would be to get that information. It may only be convenient to get the blocks from the last leaf page we looked at, for example. So this suggests we simply started prefetching for the case where the information was readily available, and it'd be harder to do for index scans so that's it. There's a couple more ~2008 threads mentioning prefetching, bitmap scans and even regular index scans (like [2]). None of them even mentions SAOP stuff at all. [1] https://www.postgresql.org/message-id/871wa17vxb.fsf%40oxford.xeocode.com [2] https://www.postgresql.org/message-id/87wsnnz046.fsf%40oxford.xeocode.com > ScalarArrayOpExpr for plain index scans don't really make that much > sense right now because there is no heap prefetching in the index scan > case, which is almost certainly going to be the major bottleneck > there. At the same time, adding useful prefetching for > ScalarArrayOpExpr execution more or less requires that you first > improve how nbtree executes ScalarArrayOpExpr quals in general. Bear > in mind that ScalarArrayOpExpr execution (whether for bitmap index > scans or index scans) is related to skip scan/MDAM techniques -- so > there are tricky dependencies that need to be considered together. > > Right now, nbtree ScalarArrayOpExpr execution must call _bt_first() to > descend the B-Tree for each array constant -- even though in principle > we could avoid all that work in cases that happen to have locality. In > other words we'll often descend the tree multiple times and land on > exactly the same leaf page again and again, without ever noticing that > we could have gotten away with only descending the tree once (it'd > also be possible to start the next "descent" one level up, not at the > root, intelligently reusing some of the work from an initial descent > -- but you don't need anything so fancy to greatly improve matters > here). > > This lack of smarts around how many times we call _bt_first() to > descend the index is merely a silly annoyance when it happens in > btgetbitmap(). We do at least sort and deduplicate the array up-front > (inside _bt_sort_array_elements()), so there will be significant > locality of access each time we needlessly descend the tree. > Importantly, there is no prefetching "pipeline" to mess up in the > bitmap index scan case -- since that all happens later on. Not so for > the superficially similar (though actually rather different) plain > index scan case -- at least not once you add prefetching. If you're > uselessly processing the same leaf page multiple times, then there is > no way that heap prefetching can notice that it should be batching > things up. The context that would allow prefetching to work well isn't > really available right now. So the plain index scan case is kinda at a > gratuitous disadvantage (with prefetching) relative to the bitmap > index scan case. > > Queries with (say) quals with many constants appearing in an "IN()" > are both common and particularly likely to benefit from prefetching. > I'm not suggesting that you need to address this to get to a > committable patch. But you should definitely think about it now. I'm > strongly considering working on this problem for 17 anyway, so we may > end up collaborating on these aspects of prefetching. Smarter > ScalarArrayOpExpr execution for index scans is likely to be quite > compelling if it enables heap prefetching. > Even if SAOP (probably) wasn't the reason, I think you're right it may be an issue for prefetching, causing regressions. It didn't occur to me before, because I'm not that familiar with the btree code and/or how it deals with SAOP (and didn't really intend to study it too deeply). So if you're planning to work on this for PG17, collaborating on it would be great. For now I plan to just ignore SAOP, or maybe just disabling prefetching for SAOP index scans if it proves to be prone to regressions. That's not great, but at least it won't make matters worse. >> But there's three shortcomings in logic: >> >> 1) It's not clear the thresholds for prefetching being beneficial and >> switching to bitmap index scans are the same value. And as I'll >> demonstrate later, the prefetching threshold is indeed much lower >> (perhaps a couple dozen matching tuples) on large tables. > > As I mentioned during the pgCon unconference session, I really like > your framing of the problem; it makes a lot of sense to directly > compare an index scan's execution against a very similar bitmap index > scan execution -- there is an imaginary continuum between index scan > and bitmap index scan. If the details of when and how we scan the > index are rather similar in each case, then there is really no reason > why the performance shouldn't be fairly similar. I suspect that it > will be useful to ask the same question for various specific cases, > that you might not have thought about just yet. Things like > ScalarArrayOpExpr queries, where bitmap index scans might look like > they have a natural advantage due to an inherent need for random heap > access in the plain index scan case. > Yeah, although all the tests were done with a random table generated like this: insert into btree_test select $d * random(), md5(i::text) from generate_series(1, $ROWS) s(i) So it's damn random anyway. Although maybe it's random even for the bitmap case, so maybe if the SAOP had some sort of locality, that'd be an advantage for the bitmap scan. But how would such table look like? I guess something like this might be a "nice" bad case: insert into btree_test mod(i,100000), md5(i::text) from generate_series(1, $ROWS) s(i) select * from btree_test where a in (999, 1000, 1001, 1002) The values are likely colocated on the same heap page, the bitmap scan is going to do a single prefetch. With index scan we'll prefetch them repeatedly. I'll give it a try. > It's important to carefully distinguish between cases where plain > index scans really are at an inherent disadvantage relative to bitmap > index scans (because there really is no getting around the need to > access the same heap page many times with an index scan) versus cases > that merely *appear* that way. Implementation restrictions that only > really affect the plain index scan case (e.g., the lack of a > reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing) > should be accounted for when assessing the viability of index scan + > prefetch over bitmap index scan + prefetch. This is very subtle, but > important. > I do agree, but what do you mean by "assessing"? Wasn't the agreement at the unconference session was we'd not tweak costing? So ultimately, this does not really affect which scan type we pick. We'll keep doing the same planning decisions as today, no? If we pick index scan and enable prefetching, causing a regression (e.g. for the SAOP with locality), that'd be bad. But how is that related to viability of index scans over bitmap index scans? > That's what I was mostly trying to get at when I talked about testing > strategy at the unconference session (this may have been unclear at > the time). It could be done in a way that helps you to think about the > problem from first principles. It could be really useful as a way of > avoiding confusing cases where plain index scan + prefetch does badly > due to implementation restrictions, versus cases where it's > *inherently* the wrong strategy. And a testing strategy that starts > with very basic ideas about what I/O is truly necessary might help you > to notice and fix regressions. The difference will never be perfectly > crisp, of course (isn't bitmap index scan basically just index scan > with a really huge prefetch buffer anyway?), but it still seems like a > useful direction to go in. > I'm all for building a more comprehensive set of test cases - the stuff presented at pgcon was good for demonstration, but it certainly is not enough for testing. The SAOP queries are a great addition, I also plan to run those queries on different (less random) data sets, etc. We'll probably discover more interesting cases as the patch improves. >> Implementation >> -------------- >> >> When I started looking at this, I only really thought about btree. If >> you look at BTScanPosData, which is what the index scans use to >> represent the current leaf page, you'll notice it has "items", which is >> the array of item pointers (TIDs) that we'll fetch from the heap. Which >> is exactly the thing we need. > >> So I ended up moving most of the prefetching logic up into indexam.c, >> see the index_prefetch() function. It can't be entirely separate, >> because each AM represents the current state in a different way (e.g. >> SpGistScanOpaque and BTScanOpaque are very different). > > Maybe you were right to do that, but I'm not entirely sure. > > Bear in mind that the ScalarArrayOpExpr case already looks like a > single index scan whose qual involves an array to the executor, even > though nbtree more or less implements it as multiple index scans with > plain constant quals (one per unique-ified array element). Index scans > whose results can be "OR'd together". Is that a modularity violation? > And if so, why? As I've pointed out earlier in this email, we don't do > very much with that context right now -- but clearly we should. > > In other words, maybe you're right to suspect that doing this in AMs > like nbtree is a modularity violation. OTOH, maybe it'll turn out that > that's exactly the right place to do it, because that's the only way > to make the full context available in one place. I myself struggled > with this when I reviewed the skip scan patch. I was sure that Tom > wouldn't like the way that the skip-scan patch doubles-down on adding > more intelligence/planning around how to execute queries with > skippable leading columns. But, it turned out that he saw the merit in > it, and basically accepted that general approach. Maybe this will turn > out to be a little like that situation, where (counter to intuition) > what you really need to do is add a new "layering violation". > Sometimes that's the only thing that'll allow the information to flow > to the right place. It's tricky. > There are two aspects why I think AM is not the right place: - accessing table from index code seems backwards - we already do prefetching from the executor (nodeBitmapHeapscan.c) It feels kinda wrong in hindsight. >> 4) per-leaf prefetching >> >> The code is restricted only prefetches items from one leaf page. If the >> index scan needs to scan multiple (many) leaf pages, we have to process >> the first leaf page first before reading / prefetching the next one. >> >> I think this is acceptable limitation, certainly for v0. Prefetching >> across multiple leaf pages seems way more complex (particularly for the >> cases using pairing heap), so let's leave this for the future. > > I tend to agree that this sort of thing doesn't need to happen in the > first committed version. But FWIW nbtree could be taught to scan > multiple index pages and act as if it had just processed them as one > single index page -- up to a point. This is at least possible with > plain index scans that use MVCC snapshots (though not index-only > scans), since we already drop the pin on the leaf page there anyway. > AFAICT stops us from teaching nbtree to "lie" to the executor and tell > it that we processed 1 leaf page, even though it was actually 5 leaf pages > (maybe there would also have to be restrictions for the markpos stuff). > Yeah, I'm not saying it's impossible, and imagined we might teach nbtree to do that. But it seems like work for future someone. >> the results look a bit different: >> >> rows bitmapscan master patched seqscan >> 1 52703.9 19.5 19.5 31145.6 >> 10 51208.1 22.7 24.7 30983.5 >> 100 49038.6 39.0 26.3 32085.3 >> 1000 53760.4 193.9 48.4 31479.4 >> 10000 56898.4 1600.7 187.5 32064.5 >> 100000 50975.2 15978.7 1848.9 31587.1 >> >> This is a good illustration of a query where bitmapscan is terrible >> (much worse than seqscan, in fact), and the patch is a massive >> improvement over master (about an order of magnitude). >> >> Of course, if you only scan a couple rows, the benefits are much more >> modest (say 40% for 100 rows, which is still significant). > > Nice! And, it'll be nice to be able to use the kill_prior_tuple > optimization in many more cases (possible by teaching the optimizer to > favor index scans over bitmap index scans more often). > Right, I forgot to mention that benefit. Although, that'd only happen if we actually choose index scans in more places, which I guess would require tweaking the costing model ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 8, 2023 at 3:17 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > Normal index scans are an even more interesting case but I'm not > sure how hard it would be to get that information. It may only be > convenient to get the blocks from the last leaf page we looked at, > for example. > > So this suggests we simply started prefetching for the case where the > information was readily available, and it'd be harder to do for index > scans so that's it. What the exact historical timeline is may not be that important. My emphasis on ScalarArrayOpExpr is partly due to it being a particularly compelling case for both parallel index scan and prefetching, in general. There are many queries that have huge in() lists that naturally benefit a great deal from prefetching. Plus they're common. > Even if SAOP (probably) wasn't the reason, I think you're right it may > be an issue for prefetching, causing regressions. It didn't occur to me > before, because I'm not that familiar with the btree code and/or how it > deals with SAOP (and didn't really intend to study it too deeply). I'm pretty sure that you understand this already, but just in case: ScalarArrayOpExpr doesn't even "get the blocks from the last leaf page" in many important cases. Not really -- not in the sense that you'd hope and expect. We're senselessly processing the same index leaf page multiple times and treating it as a different, independent leaf page. That makes heap prefetching of the kind you're working on utterly hopeless, since it effectively throws away lots of useful context. Obviously that's the fault of nbtree ScalarArrayOpExpr handling, not the fault of your patch. > So if you're planning to work on this for PG17, collaborating on it > would be great. > > For now I plan to just ignore SAOP, or maybe just disabling prefetching > for SAOP index scans if it proves to be prone to regressions. That's not > great, but at least it won't make matters worse. Makes sense, but I hope that it won't come to that. IMV it's actually quite reasonable that you didn't expect to have to think about ScalarArrayOpExpr at all -- it would make a lot of sense if that was already true. But the fact is that it works in a way that's pretty silly and naive right now, which will impact prefetching. I wasn't really thinking about regressions, though. I was actually more concerned about missing opportunities to get the most out of prefetching. ScalarArrayOpExpr really matters here. > I guess something like this might be a "nice" bad case: > > insert into btree_test mod(i,100000), md5(i::text) > from generate_series(1, $ROWS) s(i) > > select * from btree_test where a in (999, 1000, 1001, 1002) > > The values are likely colocated on the same heap page, the bitmap scan > is going to do a single prefetch. With index scan we'll prefetch them > repeatedly. I'll give it a try. This is the sort of thing that I was thinking of. What are the conditions under which bitmap index scan starts to make sense? Why is the break-even point whatever it is in each case, roughly? And, is it actually because of laws-of-physics level trade-off? Might it not be due to implementation-level issues that are much less fundamental? In other words, might it actually be that we're just doing something stoopid in the case of plain index scans? Something that is just papered-over by bitmap index scans right now? I see that your patch has logic that avoids repeated prefetching of the same block -- plus you have comments that wonder about going further by adding a "small lru array" in your new index_prefetch() function. I asked you about this during the unconference presentation. But I think that my understanding of the situation was slightly different to yours. That's relevant here. I wonder if you should go further than this, by actually sorting the items that you need to fetch as part of processing a given leaf page (I said this at the unconference, you may recall). Why should we *ever* pin/access the same heap page more than once per leaf page processed per index scan? Nothing stops us from returning the tuples to the executor in the original logical/index-wise order, despite having actually accessed each leaf page's pointed-to heap pages slightly out of order (with the aim of avoiding extra pin/unpin traffic that isn't truly necessary). We can sort the heap TIDs in scratch memory, then do our actual prefetching + heap access, and then restore the original order before returning anything. This is conceptually a "mini bitmap index scan", though one that takes place "inside" a plain index scan, as it processes one particular leaf page. That's the kind of design that "plain index scan vs bitmap index scan as a continuum" leads me to (a little like the continuum between nested loop joins, block nested loop joins, and merge joins). I bet it would be practical to do things this way, and help a lot with some kinds of queries. It might even be simpler than avoiding excessive prefetching using an LRU cache thing. I'm talking about problems that exist today, without your patch. I'll show a concrete example of the kind of index/index scan that might be affected. Attached is an extract of the server log when the regression tests ran against a server patched to show custom instrumentation. The log output shows exactly what's going on with one particular nbtree opportunistic deletion (my point has nothing to do with deletion, but it happens to be convenient to make my point in this fashion). This specific example involves deletion of tuples from the system catalog index "pg_type_typname_nsp_index". There is nothing very atypical about it; it just shows a certain kind of heap fragmentation that's probably very common. Imagine a plain index scan involving a query along the lines of "select * from pg_type where typname like 'part%' ", or similar. This query runs an instant before the example LD_DEAD-bit-driven opportunistic deletion (a "simple deletion" in nbtree parlance) took place. You'll be able to piece together from the log output that there would only be about 4 heap blocks involved with such a query. Ideally, our hypothetical index scan would pin each buffer/heap page exactly once, for a total of 4 PinBuffer()/UnpinBuffer() calls. After all, we're talking about a fairly selective query here, that only needs to scan precisely one leaf page (I verified this part too) -- so why wouldn't we expect "index scan parity"? While there is significant clustering on this example leaf page/key space, heap TID is not *perfectly* correlated with the logical/keyspace order of the index -- which can have outsized consequences. Notice that some heap blocks are non-contiguous relative to logical/keyspace/index scan/index page offset number order. We'll end up pinning each of the 4 or so heap pages more than once (sometimes several times each), when in principle we could have pinned each heap page exactly once. In other words, there is way too much of a difference between the case where the tuples we scan are *almost* perfectly clustered (which is what you see in my example) and the case where they're exactly perfectly clustered. In other other words, there is way too much of a difference between plain index scan, and bitmap index scan. (What I'm saying here is only true because this is a composite index and our query uses "like", returning rows matches a prefix -- if our index was on the column "typname" alone and we used a simple equality condition in our query then the Postgres 12 nbtree work would be enough to avoid the extra PinBuffer()/UnpinBuffer() calls. I suspect that there are still relatively many important cases where we perform extra PinBuffer()/UnpinBuffer() calls during plain index scans that only touch one leaf page anyway.) Obviously we should expect bitmap index scans to have a natural advantage over plain index scans whenever there is little or no correlation -- that's clear. But that's not what we see here -- we're way too sensitive to minor imperfections in clustering that are naturally present on some kinds of leaf pages. The potential difference in pin/unpin traffic (relative to the bitmap index scan case) seems pathological to me. Ideally, we wouldn't have these kinds of differences at all. It's going to disrupt usage_count on the buffers. > > It's important to carefully distinguish between cases where plain > > index scans really are at an inherent disadvantage relative to bitmap > > index scans (because there really is no getting around the need to > > access the same heap page many times with an index scan) versus cases > > that merely *appear* that way. Implementation restrictions that only > > really affect the plain index scan case (e.g., the lack of a > > reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing) > > should be accounted for when assessing the viability of index scan + > > prefetch over bitmap index scan + prefetch. This is very subtle, but > > important. > > > > I do agree, but what do you mean by "assessing"? I mean performance validation. There ought to be a theoretical model that describes the relationship between index scan and bitmap index scan, that has actual predictive power in the real world, across a variety of different cases. Something that isn't sensitive to the current phase of the moon (e.g., heap fragmentation along the lines of my pg_type_typname_nsp_index log output). I particularly want to avoid nasty discontinuities that really make no sense. > Wasn't the agreement at > the unconference session was we'd not tweak costing? So ultimately, this > does not really affect which scan type we pick. We'll keep doing the > same planning decisions as today, no? I'm not really talking about tweaking the costing. What I'm saying is that we really should expect index scans to behave similarly to bitmap index scans at runtime, for queries that really don't have much to gain from using a bitmap heap scan (queries that may or may not also benefit from prefetching). There are several reasons why this makes sense to me. One reason is that it makes tweaking the actual costing easier later on. Also, your point about plan robustness was a good one. If we make the wrong choice about index scan vs bitmap index scan, and the consequences aren't so bad, that's a very useful enhancement in itself. The most important reason of all may just be to build confidence in the design. I'm interested in understanding when and how prefetching stops helping. > I'm all for building a more comprehensive set of test cases - the stuff > presented at pgcon was good for demonstration, but it certainly is not > enough for testing. The SAOP queries are a great addition, I also plan > to run those queries on different (less random) data sets, etc. We'll > probably discover more interesting cases as the patch improves. Definitely. > There are two aspects why I think AM is not the right place: > > - accessing table from index code seems backwards > > - we already do prefetching from the executor (nodeBitmapHeapscan.c) > > It feels kinda wrong in hindsight. I'm willing to accept that we should do it the way you've done it in the patch provisionally. It's complicated enough that it feels like I should reserve the right to change my mind. > >> I think this is acceptable limitation, certainly for v0. Prefetching > >> across multiple leaf pages seems way more complex (particularly for the > >> cases using pairing heap), so let's leave this for the future. > Yeah, I'm not saying it's impossible, and imagined we might teach nbtree > to do that. But it seems like work for future someone. Right. You probably noticed that this is another case where we'd be making index scans behave more like bitmap index scans (perhaps even including the downsides for kill_prior_tuple that accompany not processing each leaf page inline). There is probably a point where that ceases to be sensible, but I don't know what that point is. They're way more similar than we seem to imagine. -- Peter Geoghegan
Attachment
Hi, On 2023-06-08 17:40:12 +0200, Tomas Vondra wrote: > At pgcon unconference I presented a PoC patch adding prefetching for > indexes, along with some benchmark results demonstrating the (pretty > significant) benefits etc. The feedback was quite positive, so let me > share the current patch more widely. I'm really excited about this work. > 1) pairing-heap in GiST / SP-GiST > > For most AMs, the index state is pretty trivial - matching items from a > single leaf page. Prefetching that is pretty trivial, even if the > current API is a bit cumbersome. > > Distance queries on GiST and SP-GiST are a problem, though, because > those do not just read the pointers into a simple array, as the distance > ordering requires passing stuff through a pairing-heap :-( > > I don't know how to best deal with that, especially not in the simple > API. I don't think we can "scan forward" stuff from the pairing heap, so > the only idea I have is actually having two pairing-heaps. Or maybe > using the pairing heap for prefetching, but stashing the prefetched > pointers into an array and then returning stuff from it. > > In the patch I simply prefetch items before we add them to the pairing > heap, which is good enough for demonstrating the benefits. I think it'd be perfectly fair to just not tackle distance queries for now. > 2) prefetching from executor > > Another question is whether the prefetching shouldn't actually happen > even higher - in the executor. That's what Andres suggested during the > unconference, and it kinda makes sense. That's where we do prefetching > for bitmap heap scans, so why should this happen lower, right? Yea. I think it also provides potential for further optimizations in the future to do it at that layer. One thing I have been wondering around this is whether we should not have split the code for IOS and plain indexscans... > 4) per-leaf prefetching > > The code is restricted only prefetches items from one leaf page. If the > index scan needs to scan multiple (many) leaf pages, we have to process > the first leaf page first before reading / prefetching the next one. > > I think this is acceptable limitation, certainly for v0. Prefetching > across multiple leaf pages seems way more complex (particularly for the > cases using pairing heap), so let's leave this for the future. Hm. I think that really depends on the shape of the API we end up with. If we move the responsibility more twoards to the executor, I think it very well could end up being just as simple to prefetch across index pages. > 5) index-only scans > > I'm not sure what to do about index-only scans. On the one hand, the > point of IOS is not to read stuff from the heap at all, so why prefetch > it. OTOH if there are many allvisible=false pages, we still have to > access that. And if that happens, this leads to the bizarre situation > that IOS is slower than regular index scan. But to address this, we'd > have to consider the visibility during prefetching. That should be easy to do, right? > Benchmark / TPC-H > ----------------- > > I ran the 22 queries on 100GB data set, with parallel query either > disabled or enabled. And I measured timing (and speedup) for each query. > The speedup results look like this (see the attached PDF for details): > > query serial parallel > 1 101% 99% > 2 119% 100% > 3 100% 99% > 4 101% 100% > 5 101% 100% > 6 12% 99% > 7 100% 100% > 8 52% 67% > 10 102% 101% > 11 100% 72% > 12 101% 100% > 13 100% 101% > 14 13% 100% > 15 101% 100% > 16 99% 99% > 17 95% 101% > 18 101% 106% > 19 30% 40% > 20 99% 100% > 21 101% 100% > 22 101% 107% > > The percentage is (timing patched / master, so <100% means faster, >100% > means slower). > > The different queries are affected depending on the query plan - many > queries are close to 100%, which means "no difference". For the serial > case, there are about 4 queries that improved a lot (6, 8, 14, 19), > while for the parallel case the benefits are somewhat less significant. > > My explanation is that either (a) parallel case used a different plan > with fewer index scans or (b) the parallel query does more concurrent > I/O simply by using parallel workers. Or maybe both. > > There are a couple regressions too, I believe those are due to doing too > much prefetching in some cases, and some of the heuristics mentioned > earlier should eliminate most of this, I think. I'm a bit confused by some of these numbers. How can OS-level prefetching lead to massive prefetching in the alread cached case, e.g. in tpch q06 and q08? Unless I missed what "xeon / cached (speedup)" indicates? I think it'd be good to run a performance comparison of the unpatched vs patched cases, with prefetching disabled for both. It's possible that something in the patch caused unintended changes (say spilling during a hashagg, due to larger struct sizes). Greetings, Andres Freund
On Thu, Jun 8, 2023 at 4:38 PM Peter Geoghegan <pg@bowt.ie> wrote: > This is conceptually a "mini bitmap index scan", though one that takes > place "inside" a plain index scan, as it processes one particular leaf > page. That's the kind of design that "plain index scan vs bitmap index > scan as a continuum" leads me to (a little like the continuum between > nested loop joins, block nested loop joins, and merge joins). I bet it > would be practical to do things this way, and help a lot with some > kinds of queries. It might even be simpler than avoiding excessive > prefetching using an LRU cache thing. I'll now give a simpler (though less realistic) example of a case where "mini bitmap index scan" would be expected to help index scans in general, and prefetching during index scans in particular. Something very simple: create table bitmap_parity_test(randkey int4, filler text); create index on bitmap_parity_test (randkey); insert into bitmap_parity_test select (random()*1000), repeat('filler',10) from generate_series(1,250) i; This gives me a table with 4 pages, and an index with 2 pages. The following query selects about half of the rows from the table: select * from bitmap_parity_test where randkey < 500; If I force the query to use a bitmap index scan, I see that the total number of buffers hit is exactly as expected (according to EXPLAIN(ANALYZE,BUFFERS), that is): there are 5 buffers/pages hit. We need to access every single heap page once, and we need to access the only leaf page in the index once. I'm sure that you know where I'm going with this already. I'll force the same query to use a plain index scan, and get a very different result. Now EXPLAIN(ANALYZE,BUFFERS) shows that there are a total of 89 buffers hit -- 88 of which must just be the same 5 heap pages, again and again. That's just silly. It's probably not all that much slower, but it's not helping things. And it's likely that this effect interferes with the prefetching in your patch. Obviously you can come up with a variant of this test case where bitmap index scan does way fewer buffer accesses in a way that really makes sense -- that's not in question. This is a fairly selective index scan, since it only touches one index page -- and yet we still see this difference. (Anybody pedantic enough to want to dispute whether or not this index scan counts as "selective" should run "insert into bitmap_parity_test select i, repeat('actshually',10) from generate_series(2000,1e5) i" before running the "randkey < 500" query, which will make the index much larger without changing any of the details of how the query pins pages -- non-pedants should just skip that step.) -- Peter Geoghegan
On 6/9/23 02:06, Andres Freund wrote: > Hi, > > On 2023-06-08 17:40:12 +0200, Tomas Vondra wrote: >> At pgcon unconference I presented a PoC patch adding prefetching for >> indexes, along with some benchmark results demonstrating the (pretty >> significant) benefits etc. The feedback was quite positive, so let me >> share the current patch more widely. > > I'm really excited about this work. > > >> 1) pairing-heap in GiST / SP-GiST >> >> For most AMs, the index state is pretty trivial - matching items from a >> single leaf page. Prefetching that is pretty trivial, even if the >> current API is a bit cumbersome. >> >> Distance queries on GiST and SP-GiST are a problem, though, because >> those do not just read the pointers into a simple array, as the distance >> ordering requires passing stuff through a pairing-heap :-( >> >> I don't know how to best deal with that, especially not in the simple >> API. I don't think we can "scan forward" stuff from the pairing heap, so >> the only idea I have is actually having two pairing-heaps. Or maybe >> using the pairing heap for prefetching, but stashing the prefetched >> pointers into an array and then returning stuff from it. >> >> In the patch I simply prefetch items before we add them to the pairing >> heap, which is good enough for demonstrating the benefits. > > I think it'd be perfectly fair to just not tackle distance queries for now. > My concern is that if we cut this from v0 entirely, we'll end up with an API that'll not be suitable for adding distance queries later. > >> 2) prefetching from executor >> >> Another question is whether the prefetching shouldn't actually happen >> even higher - in the executor. That's what Andres suggested during the >> unconference, and it kinda makes sense. That's where we do prefetching >> for bitmap heap scans, so why should this happen lower, right? > > Yea. I think it also provides potential for further optimizations in the > future to do it at that layer. > > One thing I have been wondering around this is whether we should not have > split the code for IOS and plain indexscans... > Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or did you mean something else? > >> 4) per-leaf prefetching >> >> The code is restricted only prefetches items from one leaf page. If the >> index scan needs to scan multiple (many) leaf pages, we have to process >> the first leaf page first before reading / prefetching the next one. >> >> I think this is acceptable limitation, certainly for v0. Prefetching >> across multiple leaf pages seems way more complex (particularly for the >> cases using pairing heap), so let's leave this for the future. > > Hm. I think that really depends on the shape of the API we end up with. If we > move the responsibility more twoards to the executor, I think it very well > could end up being just as simple to prefetch across index pages. > Maybe. I'm open to that idea if you have idea how to shape the API to make this possible (although perhaps not in v0). > >> 5) index-only scans >> >> I'm not sure what to do about index-only scans. On the one hand, the >> point of IOS is not to read stuff from the heap at all, so why prefetch >> it. OTOH if there are many allvisible=false pages, we still have to >> access that. And if that happens, this leads to the bizarre situation >> that IOS is slower than regular index scan. But to address this, we'd >> have to consider the visibility during prefetching. > > That should be easy to do, right? > It doesn't seem particularly complicated (famous last words), and we need to do the VM checks anyway so it seems like it wouldn't add a lot of overhead either > > >> Benchmark / TPC-H >> ----------------- >> >> I ran the 22 queries on 100GB data set, with parallel query either >> disabled or enabled. And I measured timing (and speedup) for each query. >> The speedup results look like this (see the attached PDF for details): >> >> query serial parallel >> 1 101% 99% >> 2 119% 100% >> 3 100% 99% >> 4 101% 100% >> 5 101% 100% >> 6 12% 99% >> 7 100% 100% >> 8 52% 67% >> 10 102% 101% >> 11 100% 72% >> 12 101% 100% >> 13 100% 101% >> 14 13% 100% >> 15 101% 100% >> 16 99% 99% >> 17 95% 101% >> 18 101% 106% >> 19 30% 40% >> 20 99% 100% >> 21 101% 100% >> 22 101% 107% >> >> The percentage is (timing patched / master, so <100% means faster, >100% >> means slower). >> >> The different queries are affected depending on the query plan - many >> queries are close to 100%, which means "no difference". For the serial >> case, there are about 4 queries that improved a lot (6, 8, 14, 19), >> while for the parallel case the benefits are somewhat less significant. >> >> My explanation is that either (a) parallel case used a different plan >> with fewer index scans or (b) the parallel query does more concurrent >> I/O simply by using parallel workers. Or maybe both. >> >> There are a couple regressions too, I believe those are due to doing too >> much prefetching in some cases, and some of the heuristics mentioned >> earlier should eliminate most of this, I think. > > I'm a bit confused by some of these numbers. How can OS-level prefetching lead > to massive prefetching in the alread cached case, e.g. in tpch q06 and q08? > Unless I missed what "xeon / cached (speedup)" indicates? > I forgot to explain what "cached" means in the TPC-H case. It means second execution of the query, so you can imagine it like this: for q in `seq 1 22`; do 1. drop caches and restart postgres 2. run query $q -> uncached 3. run query $q -> cached done So the second execution has a chance of having data in memory - but maybe not all, because this is a 100GB data set (so ~200GB after loading), but the machine only has 64GB of RAM. I think a likely explanation is some of the data wasn't actually in memory, so prefetching still did something. > I think it'd be good to run a performance comparison of the unpatched vs > patched cases, with prefetching disabled for both. It's possible that > something in the patch caused unintended changes (say spilling during a > hashagg, due to larger struct sizes). > That's certainly a good idea. I'll do that in the next round of tests. I also plan to do a test on data set that fits into RAM, to test "properly cached" case. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 6/9/23 01:38, Peter Geoghegan wrote: > On Thu, Jun 8, 2023 at 3:17 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> Normal index scans are an even more interesting case but I'm not >> sure how hard it would be to get that information. It may only be >> convenient to get the blocks from the last leaf page we looked at, >> for example. >> >> So this suggests we simply started prefetching for the case where the >> information was readily available, and it'd be harder to do for index >> scans so that's it. > > What the exact historical timeline is may not be that important. My > emphasis on ScalarArrayOpExpr is partly due to it being a particularly > compelling case for both parallel index scan and prefetching, in > general. There are many queries that have huge in() lists that > naturally benefit a great deal from prefetching. Plus they're common. > Did you mean parallel index scan or bitmap index scan? But yeah, I get the point that SAOP queries are an interesting example of queries to explore. I'll add some to the next round of tests. >> Even if SAOP (probably) wasn't the reason, I think you're right it may >> be an issue for prefetching, causing regressions. It didn't occur to me >> before, because I'm not that familiar with the btree code and/or how it >> deals with SAOP (and didn't really intend to study it too deeply). > > I'm pretty sure that you understand this already, but just in case: > ScalarArrayOpExpr doesn't even "get the blocks from the last leaf > page" in many important cases. Not really -- not in the sense that > you'd hope and expect. We're senselessly processing the same index > leaf page multiple times and treating it as a different, independent > leaf page. That makes heap prefetching of the kind you're working on > utterly hopeless, since it effectively throws away lots of useful > context. Obviously that's the fault of nbtree ScalarArrayOpExpr > handling, not the fault of your patch. > I think I understand, although maybe my mental model is wrong. I agree it seems inefficient, but I'm not sure why would it make prefetching hopeless. Sure, it puts index scans at a disadvantage (compared to bitmap scans), but it we pick index scan it should still be an improvement, right? I guess I need to do some testing on a range of data sets / queries, and see how it works in practice. >> So if you're planning to work on this for PG17, collaborating on it >> would be great. >> >> For now I plan to just ignore SAOP, or maybe just disabling prefetching >> for SAOP index scans if it proves to be prone to regressions. That's not >> great, but at least it won't make matters worse. > > Makes sense, but I hope that it won't come to that. > > IMV it's actually quite reasonable that you didn't expect to have to > think about ScalarArrayOpExpr at all -- it would make a lot of sense > if that was already true. But the fact is that it works in a way > that's pretty silly and naive right now, which will impact > prefetching. I wasn't really thinking about regressions, though. I was > actually more concerned about missing opportunities to get the most > out of prefetching. ScalarArrayOpExpr really matters here. > OK >> I guess something like this might be a "nice" bad case: >> >> insert into btree_test mod(i,100000), md5(i::text) >> from generate_series(1, $ROWS) s(i) >> >> select * from btree_test where a in (999, 1000, 1001, 1002) >> >> The values are likely colocated on the same heap page, the bitmap scan >> is going to do a single prefetch. With index scan we'll prefetch them >> repeatedly. I'll give it a try. > > This is the sort of thing that I was thinking of. What are the > conditions under which bitmap index scan starts to make sense? Why is > the break-even point whatever it is in each case, roughly? And, is it > actually because of laws-of-physics level trade-off? Might it not be > due to implementation-level issues that are much less fundamental? In > other words, might it actually be that we're just doing something > stoopid in the case of plain index scans? Something that is just > papered-over by bitmap index scans right now? > Yeah, that's partially why I do this kind of testing on a wide range of synthetic data sets - to find cases that behave in unexpected way (say, seem like they should improve but don't). > I see that your patch has logic that avoids repeated prefetching of > the same block -- plus you have comments that wonder about going > further by adding a "small lru array" in your new index_prefetch() > function. I asked you about this during the unconference presentation. > But I think that my understanding of the situation was slightly > different to yours. That's relevant here. > > I wonder if you should go further than this, by actually sorting the > items that you need to fetch as part of processing a given leaf page > (I said this at the unconference, you may recall). Why should we > *ever* pin/access the same heap page more than once per leaf page > processed per index scan? Nothing stops us from returning the tuples > to the executor in the original logical/index-wise order, despite > having actually accessed each leaf page's pointed-to heap pages > slightly out of order (with the aim of avoiding extra pin/unpin > traffic that isn't truly necessary). We can sort the heap TIDs in > scratch memory, then do our actual prefetching + heap access, and then > restore the original order before returning anything. > I think that's possible, and I thought about that a bit (not just for btree, but especially for the distance queries on GiST). But I don't have a good idea if this would be 1% or 50% improvement, and I was concerned it might easily lead to regressions if we don't actually need all the tuples. I mean, imagine we have TIDs [T1, T2, T3, T4, T5, T6] Maybe T1, T5, T6 are from the same page, so per your proposal we might reorder and prefetch them in this order: [T1, T5, T6, T2, T3, T4] But maybe we only need [T1, T2] because of a LIMIT, and the extra work we did on processing T5, T6 is wasted. > This is conceptually a "mini bitmap index scan", though one that takes > place "inside" a plain index scan, as it processes one particular leaf > page. That's the kind of design that "plain index scan vs bitmap index > scan as a continuum" leads me to (a little like the continuum between > nested loop joins, block nested loop joins, and merge joins). I bet it > would be practical to do things this way, and help a lot with some > kinds of queries. It might even be simpler than avoiding excessive > prefetching using an LRU cache thing. > > I'm talking about problems that exist today, without your patch. > > I'll show a concrete example of the kind of index/index scan that > might be affected. > > Attached is an extract of the server log when the regression tests ran > against a server patched to show custom instrumentation. The log > output shows exactly what's going on with one particular nbtree > opportunistic deletion (my point has nothing to do with deletion, but > it happens to be convenient to make my point in this fashion). This > specific example involves deletion of tuples from the system catalog > index "pg_type_typname_nsp_index". There is nothing very atypical > about it; it just shows a certain kind of heap fragmentation that's > probably very common. > > Imagine a plain index scan involving a query along the lines of > "select * from pg_type where typname like 'part%' ", or similar. This > query runs an instant before the example LD_DEAD-bit-driven > opportunistic deletion (a "simple deletion" in nbtree parlance) took > place. You'll be able to piece together from the log output that there > would only be about 4 heap blocks involved with such a query. Ideally, > our hypothetical index scan would pin each buffer/heap page exactly > once, for a total of 4 PinBuffer()/UnpinBuffer() calls. After all, > we're talking about a fairly selective query here, that only needs to > scan precisely one leaf page (I verified this part too) -- so why > wouldn't we expect "index scan parity"? > > While there is significant clustering on this example leaf page/key > space, heap TID is not *perfectly* correlated with the > logical/keyspace order of the index -- which can have outsized > consequences. Notice that some heap blocks are non-contiguous > relative to logical/keyspace/index scan/index page offset number order. > > We'll end up pinning each of the 4 or so heap pages more than once > (sometimes several times each), when in principle we could have pinned > each heap page exactly once. In other words, there is way too much of > a difference between the case where the tuples we scan are *almost* > perfectly clustered (which is what you see in my example) and the case > where they're exactly perfectly clustered. In other other words, there > is way too much of a difference between plain index scan, and bitmap > index scan. > > (What I'm saying here is only true because this is a composite index > and our query uses "like", returning rows matches a prefix -- if our > index was on the column "typname" alone and we used a simple equality > condition in our query then the Postgres 12 nbtree work would be > enough to avoid the extra PinBuffer()/UnpinBuffer() calls. I suspect > that there are still relatively many important cases where we perform > extra PinBuffer()/UnpinBuffer() calls during plain index scans that > only touch one leaf page anyway.) > > Obviously we should expect bitmap index scans to have a natural > advantage over plain index scans whenever there is little or no > correlation -- that's clear. But that's not what we see here -- we're > way too sensitive to minor imperfections in clustering that are > naturally present on some kinds of leaf pages. The potential > difference in pin/unpin traffic (relative to the bitmap index scan > case) seems pathological to me. Ideally, we wouldn't have these kinds > of differences at all. It's going to disrupt usage_count on the > buffers. > I'm not sure I understand all the nuance here, but the thing I take away is to add tests with different levels of correlation, and probably also some multi-column indexes. >>> It's important to carefully distinguish between cases where plain >>> index scans really are at an inherent disadvantage relative to bitmap >>> index scans (because there really is no getting around the need to >>> access the same heap page many times with an index scan) versus cases >>> that merely *appear* that way. Implementation restrictions that only >>> really affect the plain index scan case (e.g., the lack of a >>> reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing) >>> should be accounted for when assessing the viability of index scan + >>> prefetch over bitmap index scan + prefetch. This is very subtle, but >>> important. >>> >> >> I do agree, but what do you mean by "assessing"? > > I mean performance validation. There ought to be a theoretical model > that describes the relationship between index scan and bitmap index > scan, that has actual predictive power in the real world, across a > variety of different cases. Something that isn't sensitive to the > current phase of the moon (e.g., heap fragmentation along the lines of > my pg_type_typname_nsp_index log output). I particularly want to avoid > nasty discontinuities that really make no sense. > >> Wasn't the agreement at >> the unconference session was we'd not tweak costing? So ultimately, this >> does not really affect which scan type we pick. We'll keep doing the >> same planning decisions as today, no? > > I'm not really talking about tweaking the costing. What I'm saying is > that we really should expect index scans to behave similarly to bitmap > index scans at runtime, for queries that really don't have much to > gain from using a bitmap heap scan (queries that may or may not also > benefit from prefetching). There are several reasons why this makes > sense to me. > > One reason is that it makes tweaking the actual costing easier later > on. Also, your point about plan robustness was a good one. If we make > the wrong choice about index scan vs bitmap index scan, and the > consequences aren't so bad, that's a very useful enhancement in > itself. > > The most important reason of all may just be to build confidence in > the design. I'm interested in understanding when and how prefetching > stops helping. > Agreed. >> I'm all for building a more comprehensive set of test cases - the stuff >> presented at pgcon was good for demonstration, but it certainly is not >> enough for testing. The SAOP queries are a great addition, I also plan >> to run those queries on different (less random) data sets, etc. We'll >> probably discover more interesting cases as the patch improves. > > Definitely. > >> There are two aspects why I think AM is not the right place: >> >> - accessing table from index code seems backwards >> >> - we already do prefetching from the executor (nodeBitmapHeapscan.c) >> >> It feels kinda wrong in hindsight. > > I'm willing to accept that we should do it the way you've done it in > the patch provisionally. It's complicated enough that it feels like I > should reserve the right to change my mind. > >>>> I think this is acceptable limitation, certainly for v0. Prefetching >>>> across multiple leaf pages seems way more complex (particularly for the >>>> cases using pairing heap), so let's leave this for the future. > >> Yeah, I'm not saying it's impossible, and imagined we might teach nbtree >> to do that. But it seems like work for future someone. > > Right. You probably noticed that this is another case where we'd be > making index scans behave more like bitmap index scans (perhaps even > including the downsides for kill_prior_tuple that accompany not > processing each leaf page inline). There is probably a point where > that ceases to be sensible, but I don't know what that point is. > They're way more similar than we seem to imagine. > OK. Thanks for all the comments. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jun 9, 2023 at 3:45 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > What the exact historical timeline is may not be that important. My > > emphasis on ScalarArrayOpExpr is partly due to it being a particularly > > compelling case for both parallel index scan and prefetching, in > > general. There are many queries that have huge in() lists that > > naturally benefit a great deal from prefetching. Plus they're common. > > > > Did you mean parallel index scan or bitmap index scan? I meant parallel index scan (also parallel bitmap index scan). Note that nbtree parallel index scans have special ScalarArrayOpExpr handling code. ScalarArrayOpExpr is kind of special -- it is simultaneously one big index scan (to the executor), and lots of small index scans (to nbtree). Unlike the queries that you've looked at so far, which really only have one plausible behavior at execution time, there are many ways that ScalarArrayOpExpr index scans can be executed at runtime -- some much faster than others. The nbtree implementation can in principle reorder how it processes ranges from the key space (i.e. each range of array elements) with significant flexibility. > I think I understand, although maybe my mental model is wrong. I agree > it seems inefficient, but I'm not sure why would it make prefetching > hopeless. Sure, it puts index scans at a disadvantage (compared to > bitmap scans), but it we pick index scan it should still be an > improvement, right? Hopeless might have been too strong of a word. More like it'd fall far short of what is possible to do with a ScalarArrayOpExpr with a given high end server. The quality of the implementation (including prefetching) could make a huge difference to how well we make use of the available hardware resources. A really high quality implementation of ScalarArrayOpExpr + prefetching can keep the system busy with useful work, which is less true with other types of queries, which have inherently less predictable I/O (and often have less I/O overall). What could be more amenable to predicting I/O patterns than a query with a large IN() list, with many constants that can be processed in whatever order makes sense at runtime? What I'd like to do with ScalarArrayOpExpr is to teach nbtree to coalesce together those "small index scans" into "medium index scans" dynamically, where that makes sense. That's the main part that's missing right now. Dynamic behavior matters a lot with ScalarArrayOpExpr stuff -- that's where the challenge lies, but also where the opportunities are. Prefetching builds on all that. > I guess I need to do some testing on a range of data sets / queries, and > see how it works in practice. If I can figure out a way of getting ScalarArrayOpExpr to visit each leaf page exactly once, that might be enough to make things work really well most of the time. Maybe it won't even be necessary to coordinate very much, in the end. Unsure. I've already done a lot of work that tries to minimize the chances of regular (non-ScalarArrayOpExpr) queries accessing more than a single leaf page, which will help your strategy of just prefetching items from a single leaf page at a time -- that will get you pretty far already. Consider the example of the tenk2_hundred index from the bt_page_items documentation. You'll notice that the high key for the page shown in the docs (and every other page in the same index) nicely makes the leaf page boundaries "aligned" with natural keyspace boundaries, due to suffix truncation. That helps index scans to access no more than a single leaf page when accessing any one distinct "hundred" value. We are careful to do the right thing with the "boundary cases" when we descend the tree, too. This _bt_search behavior builds on the way that suffix truncation influences the on-disk structure of indexes. Queries such as "select * from tenk2 where hundred = ?" will each return 100 rows spread across almost as many heap pages. That's a fairly large number of rows/heap pages, but we still only need to access one leaf page for every possible constant value (every "hundred" value that might be specified as the ? in my point query example). It doesn't matter if it's the leftmost or rightmost item on a leaf page -- we always descend to exactly the correct leaf page directly, and we always terminate the scan without having to move to the right sibling page (we check the high key before going to the right page in some cases, per the optimization added by commit 29b64d1d). The same kind of behavior is also seen with the TPC-C line items primary key index, which is a composite index. We want to access the items from a whole order in one go, from one leaf page -- and we reliably do the right thing there too (though with some caveats about CREATE INDEX). We should never have to access more than one leaf page to read a single order's line items. This matters because it's quite natural to want to access whole orders with that particular table/workload (it's also unnatural to only access one single item from any given order). Obviously there are many queries that need to access two or more leaf pages, because that's just what needs to happen. My point is that we *should* only do that when it's truly necessary on modern Postgres versions, since the boundaries between pages are "aligned" with the "natural boundaries" from the keyspace/application. Maybe your testing should verify that this effect is actually present, though. It would be a shame if we sometimes messed up prefetching that could have worked well due to some issue with how page splits divide up items. CREATE INDEX is much less smart about suffix truncation -- it isn't capable of the same kind of tricks as nbtsplitloc.c, even though it could be taught to do roughly the same thing. Hopefully this won't be an issue for your work. The tenk2 case still works as expected with CREATE INDEX/REINDEX, due to help from deduplication. Indexes like the TPC-C line items PK will leave the index with some "orders" (or whatever the natural grouping of things is) that span more than a single leaf page, which is undesirable, and might hinder your prefetching work. I wouldn't mind fixing that if it turned out to hurt your leaf-page-at-a-time prefetching patch. Something to consider. We can fit at most 17 TPC-C orders on each order line PK leaf page. Could be as few as 15. If we do the wrong thing with prefetching for 2 out of every 15 orders then that's a real problem, but is still subtle enough to easily miss with conventional benchmarking. I've had a lot of success with paying close attention to all the little boundary cases, which is why I'm kind of zealous about it now. > > I wonder if you should go further than this, by actually sorting the > > items that you need to fetch as part of processing a given leaf page > > (I said this at the unconference, you may recall). Why should we > > *ever* pin/access the same heap page more than once per leaf page > > processed per index scan? Nothing stops us from returning the tuples > > to the executor in the original logical/index-wise order, despite > > having actually accessed each leaf page's pointed-to heap pages > > slightly out of order (with the aim of avoiding extra pin/unpin > > traffic that isn't truly necessary). We can sort the heap TIDs in > > scratch memory, then do our actual prefetching + heap access, and then > > restore the original order before returning anything. > > > > I think that's possible, and I thought about that a bit (not just for > btree, but especially for the distance queries on GiST). But I don't > have a good idea if this would be 1% or 50% improvement, and I was > concerned it might easily lead to regressions if we don't actually need > all the tuples. I get that it could be invasive. I have the sense that just pinning the same heap page more than once in very close succession is just the wrong thing to do, with or without prefetching. > I mean, imagine we have TIDs > > [T1, T2, T3, T4, T5, T6] > > Maybe T1, T5, T6 are from the same page, so per your proposal we might > reorder and prefetch them in this order: > > [T1, T5, T6, T2, T3, T4] > > But maybe we only need [T1, T2] because of a LIMIT, and the extra work > we did on processing T5, T6 is wasted. Yeah, that's possible. But isn't that par for the course? Any optimization that involves speculation (including all prefetching) comes with similar risks. They can be managed. I don't think that we'd literally order by TID...we wouldn't change the order that each heap page was *initially* pinned. We'd just reorder the tuples minimally using an approach that is sufficient to avoid repeated pinning of heap pages during processing of any one leaf page's heap TIDs. ISTM that the risk of wasting work is limited to wasting cycles on processing extra tuples from a heap page that we definitely had to process at least one tuple from already. That doesn't seem particularly risky, as speculative optimizations go. The downside is bounded and well understood, while the upside could be significant. I really don't have that much confidence in any of this just yet. I'm not trying to make this project more difficult. I just can't help but notice that the order that index scans end up pinning heap pages already has significant problems, and is sensitive to things like small amounts of heap fragmentation -- maybe that's not a great basis for prefetching. I *really* hate any kind of sharp discontinuity, where a minor change in an input (e.g., from minor amounts of heap fragmentation) has outsized impact on an output (e.g., buffers pinned). Interactions like that tend to be really pernicious -- they lead to bad performance that goes unnoticed and unfixed because the problem effectively camouflages itself. It may even be easier to make the conservative (perhaps paranoid) assumption that weird nasty interactions will cause harm somewhere down the line...why take a chance? I might end up prototyping this myself. I may have to put my money where my mouth is. :-) -- Peter Geoghegan
We already do prefetching for bitmap index scans, where the bitmap heap
scan prefetches future pages based on effective_io_concurrency. I'm not
sure why exactly was prefetching implemented only for bitmap scans
\set range 67 * (:multiplier + 1) |
\set limit 100000 * :scale |
\set limit :limit - :range |
\set aid random(1, :limit) |
Hi, On 2023-06-09 12:18:11 +0200, Tomas Vondra wrote: > > > >> 2) prefetching from executor > >> > >> Another question is whether the prefetching shouldn't actually happen > >> even higher - in the executor. That's what Andres suggested during the > >> unconference, and it kinda makes sense. That's where we do prefetching > >> for bitmap heap scans, so why should this happen lower, right? > > > > Yea. I think it also provides potential for further optimizations in the > > future to do it at that layer. > > > > One thing I have been wondering around this is whether we should not have > > split the code for IOS and plain indexscans... > > > > Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or > did you mean something else? Yes, I meant that. > >> 4) per-leaf prefetching > >> > >> The code is restricted only prefetches items from one leaf page. If the > >> index scan needs to scan multiple (many) leaf pages, we have to process > >> the first leaf page first before reading / prefetching the next one. > >> > >> I think this is acceptable limitation, certainly for v0. Prefetching > >> across multiple leaf pages seems way more complex (particularly for the > >> cases using pairing heap), so let's leave this for the future. > > > > Hm. I think that really depends on the shape of the API we end up with. If we > > move the responsibility more twoards to the executor, I think it very well > > could end up being just as simple to prefetch across index pages. > > > > Maybe. I'm open to that idea if you have idea how to shape the API to > make this possible (although perhaps not in v0). I'll try to have a look. > > I'm a bit confused by some of these numbers. How can OS-level prefetching lead > > to massive prefetching in the alread cached case, e.g. in tpch q06 and q08? > > Unless I missed what "xeon / cached (speedup)" indicates? > > > > I forgot to explain what "cached" means in the TPC-H case. It means > second execution of the query, so you can imagine it like this: > > for q in `seq 1 22`; do > > 1. drop caches and restart postgres Are you doing it in that order? If so, the pagecache can end up being seeded by postgres writing out dirty buffers. > 2. run query $q -> uncached > > 3. run query $q -> cached > > done > > So the second execution has a chance of having data in memory - but > maybe not all, because this is a 100GB data set (so ~200GB after > loading), but the machine only has 64GB of RAM. > > I think a likely explanation is some of the data wasn't actually in > memory, so prefetching still did something. Ah, ok. > > I think it'd be good to run a performance comparison of the unpatched vs > > patched cases, with prefetching disabled for both. It's possible that > > something in the patch caused unintended changes (say spilling during a > > hashagg, due to larger struct sizes). > > > > That's certainly a good idea. I'll do that in the next round of tests. I > also plan to do a test on data set that fits into RAM, to test "properly > cached" case. Cool. It'd be good to measure both the case of all data already being in s_b (to see the overhead of the buffer mapping lookups) and the case where the data is in the kernel pagecache (to see the overhead of pointless posix_fadvise calls). Greetings, Andres Freund
On 6/10/23 22:34, Andres Freund wrote: > Hi, > > On 2023-06-09 12:18:11 +0200, Tomas Vondra wrote: >>> >>>> 2) prefetching from executor >>>> >>>> Another question is whether the prefetching shouldn't actually happen >>>> even higher - in the executor. That's what Andres suggested during the >>>> unconference, and it kinda makes sense. That's where we do prefetching >>>> for bitmap heap scans, so why should this happen lower, right? >>> >>> Yea. I think it also provides potential for further optimizations in the >>> future to do it at that layer. >>> >>> One thing I have been wondering around this is whether we should not have >>> split the code for IOS and plain indexscans... >>> >> >> Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or >> did you mean something else? > > Yes, I meant that. > Ah, you meant that maybe we shouldn't have done that. Sorry, I misunderstood. >>>> 4) per-leaf prefetching >>>> >>>> The code is restricted only prefetches items from one leaf page. If the >>>> index scan needs to scan multiple (many) leaf pages, we have to process >>>> the first leaf page first before reading / prefetching the next one. >>>> >>>> I think this is acceptable limitation, certainly for v0. Prefetching >>>> across multiple leaf pages seems way more complex (particularly for the >>>> cases using pairing heap), so let's leave this for the future. >>> >>> Hm. I think that really depends on the shape of the API we end up with. If we >>> move the responsibility more twoards to the executor, I think it very well >>> could end up being just as simple to prefetch across index pages. >>> >> >> Maybe. I'm open to that idea if you have idea how to shape the API to >> make this possible (although perhaps not in v0). > > I'll try to have a look. > > >>> I'm a bit confused by some of these numbers. How can OS-level prefetching lead >>> to massive prefetching in the alread cached case, e.g. in tpch q06 and q08? >>> Unless I missed what "xeon / cached (speedup)" indicates? >>> >> >> I forgot to explain what "cached" means in the TPC-H case. It means >> second execution of the query, so you can imagine it like this: >> >> for q in `seq 1 22`; do >> >> 1. drop caches and restart postgres > > Are you doing it in that order? If so, the pagecache can end up being seeded > by postgres writing out dirty buffers. > Actually no, I do it the other way around - first restart, then drop. It shouldn't matter much, though, because after building the data set (and vacuum + checkpoint), the data is not modified - all the queries run on the same data set. So there shouldn't be any dirty buffers. > >> 2. run query $q -> uncached >> >> 3. run query $q -> cached >> >> done >> >> So the second execution has a chance of having data in memory - but >> maybe not all, because this is a 100GB data set (so ~200GB after >> loading), but the machine only has 64GB of RAM. >> >> I think a likely explanation is some of the data wasn't actually in >> memory, so prefetching still did something. > > Ah, ok. > > >>> I think it'd be good to run a performance comparison of the unpatched vs >>> patched cases, with prefetching disabled for both. It's possible that >>> something in the patch caused unintended changes (say spilling during a >>> hashagg, due to larger struct sizes). >>> >> >> That's certainly a good idea. I'll do that in the next round of tests. I >> also plan to do a test on data set that fits into RAM, to test "properly >> cached" case. > > Cool. It'd be good to measure both the case of all data already being in s_b > (to see the overhead of the buffer mapping lookups) and the case where the > data is in the kernel pagecache (to see the overhead of pointless > posix_fadvise calls). > OK, I'll make sure the next round of tests includes a sufficiently small data set too. I should have some numbers sometime early next week. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, 2023-06-08 at 17:40 +0200, Tomas Vondra wrote: > Hi, > > At pgcon unconference I presented a PoC patch adding prefetching for > indexes, along with some benchmark results demonstrating the (pretty > significant) benefits etc. The feedback was quite positive, so let me > share the current patch more widely. > I added entry to https://wiki.postgresql.org/wiki/PgCon_2023_Developer_Unconference based on notes I took during that session. Hope it helps. -- Tomasz Rybak, Debian Developer <serpent@debian.org> GPG: A565 CE64 F866 A258 4DDC F9C7 ECB7 3E37 E887 AA8C
On Thu, Jun 8, 2023 at 9:10 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > We already do prefetching for bitmap index scans, where the bitmap heap > scan prefetches future pages based on effective_io_concurrency. I'm not > sure why exactly was prefetching implemented only for bitmap scans, but > I suspect the reasoning was that it only helps when there's many > matching tuples, and that's what bitmap index scans are for. So it was > not worth the implementation effort. One of the reasons IMHO is that in the bitmap scan before starting the heap fetch TIDs are already sorted in heap block order. So it is quite obvious that once we prefetch a heap block most of the subsequent TIDs will fall on that block i.e. each prefetch will satisfy many immediate requests. OTOH, in the index scan the I/O request is very random so we might have to prefetch many blocks even for satisfying the request for TIDs falling on one index page. I agree with prefetching with an index scan will definitely help in reducing the random I/O, but this is my guess that thinking of prefetching with a Bitmap scan appears more natural and that would have been one of the reasons for implementing this only for a bitmap scan. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi, I have results from the new extended round of prefetch tests. I've pushed everything to https://github.com/tvondra/index-prefetch-tests-2 There are scripts I used to run this (run-*.sh), raw results and various kinds of processed summaries (pdf, ods, ...) that I'll mention later. As before, this tests a number of query types: - point queries with btree and hash (equality) - ORDER BY queries with btree (inequality + order by) - SAOP queries with btree (column IN (values)) It's probably futile to go through details of all the tests - it's easier to go through the (hopefully fairly readable) shell scripts. But in principle, runs some simple queries while varying both the data set and workload: - data set may be random, sequential or cyclic (with different length) - the number of matches per value differs (i.e. equality condition may match 1, 10, 100, ..., 100k rows) - forces a particular scan type (indexscan, bitmapscan, seqscan) - each query is executed twice - first run (right after restarting DB and dropping caches) is uncached, second run should have data cached - the query is executed 5x with different parameters (so 10x in total) This is tested with three basic data sizes - fits into shared buffers, fits into RAM and exceeds RAM. The sizes are roughly 350MB, 3.5GB and 20GB (i5) / 40GB (xeon). Note: xeon has 64GB RAM, so technically the largest scale fits into RAM. But should not matter, thanks to drop-caches and restart. I also attempted to pin the backend to a particular core, in effort to eliminate scheduling-related noise. It's mostly what taskset does, but I did that from extension (https://github.com/tvondra/taskset) which allows me to do that as part of the SQL script. For the results, I'll talk about the v1 patch (as submitted here) fist. I'll use the PDF results in the "pdf" directory which generally show a pivot table by different test parameters, comparing the results by different parameters (prefetching on/off, master/patched). Feel free to do your own analysis from the raw CSV data, ofc. For example, this: https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/patch-v1-point-queries-builds.pdf shows how the prefetching affects timing for point queries with different numbers of matches (1 to 100k). The numbers are timings for master and patched build. The last group is (patched/master), so the lower the number the better - 50% means patch makes the query 2x faster. There's also a heatmap, with green=good, red=bad, which makes it easier to cases that got slower/faster. The really interesting stuff starts on page 7 (in this PDF), because the first couple pages are "cached" (so it's more about measuring overhead when prefetching has no benefit). Right on page 7 you can see a couple cases with a mix of slower/faster cases, roughtly in the +/- 30% range. However, this is unrelated from the patch because those are results for bitmapheapscan. For indexscans (page 8), the results are invariably improved - the more matches the better (up to ~10x faster for 100k matches). Those were results for the "cyclic" data set. For random data set (pages 9-11) the results are pretty similar, but for "sequential" data (11-13) the prefetching is actually harmful - there are red clusters, with up to 500% slowdowns. I'm not going to explain the summary for SAOP queries (https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/patch-v1-saop-queries-builds.pdf), the story is roughly the same, except that there are more tested query combinations (because we also vary the pattern in the IN() list - number of values etc.). So, the conclusion from this is - generally very good results for random and cyclic data sets, but pretty bad results for sequential. But even for the random/cyclic cases there are combinations (especially with many matches) where prefetching doesn't help or even hurts. The only way to deal with this is (I think) a cheap way to identify and skip inefficient prefetches, essentially by doing two things: a) remembering more recently prefetched blocks (say, 1000+) and not prefetching them over and over b) ability to identify sequential pattern, when readahead seems to do pretty good job already (although I heard some disagreement) I've been thinking about how to do this - doing (a) seem pretty hard, because on the one hand we want to remember a fair number of blocks and we want the check "did we prefetch X" to be very cheap. So a hash table seems nice. OTOH we want to expire "old" blocks and only keep the most recent ones, and hash table doesn't really support that. Perhaps there is a great data structure for this, not sure. But after thinking about this I realized we don't need a perfect accuracy - it's fine to have false positives/negatives - it's fine to forget we already prefetched block X and prefetch it again, or prefetch it again. It's not a matter of correctness, just a matter of efficiency - after all, we can't know if it's still in memory, we only know if we prefetched it fairly recently. This led me to a "hash table of LRU caches" thing. Imagine a tiny LRU cache that's small enough to be searched linearly (say, 8 blocks). And we have many of them (e.g. 128), so that in total we can remember 1024 block numbers. Now, every block number is mapped to a single LRU by hashing, as if we had a hash table index = hash(blockno) % 128 and we only use tha one LRU to track this block. It's tiny so we can search it linearly. To expire prefetched blocks, there's a counter incremented every time we prefetch a block, and we store it in the LRU with the block number. When checking the LRU we ignore old entries (with counter more than 1000 values back), and we also evict/replace the oldest entry if needed. This seems to work pretty well for the first requirement, but it doesn't allow identifying the sequential pattern cheaply. To do that, I added a tiny queue with a couple entries that can checked it the last couple entries are sequential. And this is what the attached 0002+0003 patches do. There are PDF with results for this build prefixed with "patch-v3" and the results are pretty good - the regressions are largely gone. It's even cleared in the PDFs comparing the impact of the two patches: https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/comparison-point.pdf https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/comparison-saop.pdf Which simply shows the "speedup heatmap" for the two patches, and the "v3" heatmap has much less red regression clusters. Note: The comparison-point.pdf summary has another group of columns illustrating if this scan type would be actually used, with "green" meaning "yes". This provides additional context, because e.g. for the "noisy bitmapscans" it's all white, i.e. without setting the GUcs the optimizer would pick something else (hence it's a non-issue). Let me know if the results are not clear enough (I tried to cover the important stuff, but I'm sure there's a lot of details I didn't cover), or if you think some other summary would be better. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, attached is a v4 of the patch, with a fairly major shift in the approach. Until now the patch very much relied on the AM to provide information which blocks to prefetch next (based on the current leaf index page). This seemed like a natural approach when I started working on the PoC, but over time I ran into various drawbacks: * a lot of the logic is at the AM level * can't prefetch across the index page boundary (have to wait until the next index leaf page is read by the indexscan) * doesn't work for distance searches (gist/spgist), After thinking about this, I decided to ditch this whole idea of exchanging prefetch information through an API, and make the prefetching almost entirely in the indexam code. The new patch maintains a queue of TIDs (read from index_getnext_tid), with up to effective_io_concurrency entries - calling getnext_slot() adds a TID at the queue tail, issues a prefetch for the block, and then returns TID from the queue head. Maintaining the queue is up to index_getnext_slot() - it can't be done in index_getnext_tid(), because then it'd affect IOS (and prefetching heap would mostly defeat the whole point of IOS). And we can't do that above index_getnext_slot() because that already fetched the heap page. I still think prefetching for IOS is doable (and desirable), in mostly the same way - except that we'd need to maintain the queue from some other place, as IOS doesn't do index_getnext_slot(). FWIW there's also the "index-only filters without IOS" patch [1] which switches even regular index scans to index_getnext_tid(), so maybe relying on index_getnext_slot() is a lost cause anyway. Anyway, this has the nice consequence that it makes AM code entirely oblivious of prefetching - there's no need to API, we just get TIDs as before, and the prefetching magic happens after that. Thus it also works for searches ordered by distance (gist/spgist). The patch got much smaller (about 40kB, down from 80kB), which is nice. I ran the benchmarks [2] with this v4 patch, and the results for the "point" queries are almost exactly the same as for v3. The SAOP part is still running - I'll add those results in a day or two, but I expect similar outcome as for point queries. regards [1] https://commitfest.postgresql.org/43/4352/ [2] https://github.com/tvondra/index-prefetch-tests-2/ -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Here's a v5 of the patch, rebased to current master and fixing a couple compiler warnings reported by cfbot (%lu vs. UINT64_FORMAT in some debug messages). No other changes compared to v4. cfbot also reported a failure on windows in pg_dump [1], but it seem pretty strange: [11:42:48.708] ------------------------------------- 8< ------------------------------------- [11:42:48.708] stderr: [11:42:48.708] # Failed test 'connecting to an invalid database: matches' The patch does nothing related to pg_dump, and the test works perfectly fine for me (I don't have windows machine, but 32-bit and 64-bit linux works fine for me). regards [1] https://cirrus-ci.com/task/6398095366291456 -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, Attached is a v6 of the patch, which rebases v5 (just some minor bitrot), and also does a couple changes which I kept in separate patches to make it obvious what changed. 0001-v5-20231016.patch ---------------------- Rebase to current master. 0002-comments-and-minor-cleanup-20231012.patch ---------------------------------------------- Various comment improvements (remove obsolete ones clarify a bunch of other comments, etc.). I tried to explain the reasoning why some places disable prefetching (e.g. in catalogs, replication, ...), explain how the caching / LRU works etc. 0003-remove-prefetch_reset-20231016.patch ----------------------------------------- I decided to remove the separate prefetch_reset parameter, so that all the index_beginscan() methods only take a parameter specifying the maximum prefetch target. The reset was added early when the prefetch happened much lower in the AM code, at the index page level, and the reset was when moving to the next index page. But now after the prefetch moved to the executor, this doesn't make much sense - the resets happen on rescans, and it seems right to just reset to 0 (just like for bitmap heap scans). 0004-PoC-prefetch-for-IOS-20231016.patch ---------------------------------------- This is a PoC adding the prefetch to index-only scans too. At first that may seem rather strange, considering eliminating the heap fetches is the whole point of IOS. But if the pages are not marked as all-visible (say, the most recent part of the table), we may still have to fetch them. In which case it'd be easy to see cases that IOS is slower than a regular index scan (with prefetching). The code is quite rough. It adds a separate index_getnext_tid_prefetch() function, adding prefetching on top of index_getnext_tid(). I'm not sure it's the right pattern, but it's pretty much what index_getnext_slot() does too, except that it also does the fetch + store to the slot. Note: There's a second patch adding index-only filters, which requires the regular index scans from index_getnext_slot() to _tid() too. The prefetching then happens only after checking the visibility map (if requested). This part definitely needs improvements - for example there's no attempt to reuse the VM buffer, which I guess might be expensive. index-prefetch.pdf ------------------ Attached is also a PDF with results of the same benchmark I did before, comparing master vs. patched with various data patterns and scan types. It's not 100% comparable to earlier results as I only ran it on a laptop, and it's a bit noisier too. The overall behavior and conclusions are however the same. I was specifically interested in the IOS behavior, so I added two more cases to test - indexonlyscan and indexonlyscan-clean. The first is the worst-case scenario, with no pages marked as all-visible in VM (the test simply deletes the VM), while indexonlyscan-clean is the good-case (no heap fetches needed). The results mostly match the expected behavior, particularly for the uncached runs (when the data is expected to not be in memory): * indexonlyscan (i.e. bad case) - About the same results as "indexscans", with the same speedups etc. Which is a good thing (i.e. IOS is not unexpectedly slower than regular indexscans). * indexonlyscan-clean (i.e. good case) - Seems to have mostly the same performance as without the prefetching, except for the low-cardinality runs with many rows per key. I haven't checked what's causing this, but I'd bet it's the extra buffer lookups/management I mentioned. I noticed there's another prefetching-related patch [1] from Thomas Munro. I haven't looked at it yet, so hard to say how much it interferes with this patch. But the idea looks interesting. [1] https://www.postgresql.org/message-id/flat/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, Here's a new WIP version of the patch set adding prefetching to indexes, exploring a couple alternative approaches. After the patch 2023/10/16 version, I happened to have an off-list discussion with Andres, and he suggested to try a couple things, and there's a couple more things I tried on my own too. Attached is the patch series starting with the 2023/10/16 patch, and then trying different things in separate patches (discussed later). As usual, there's also a bunch of benchmark results - due to size I'm unable to attach all of them here (the PDFs are pretty large), but you can find them at (with all the scripts etc.): https://github.com/tvondra/index-prefetch-tests/tree/master/2023-11-23 I'll attach only a couple small PNG with highlighted speedup/regression patterns, but it's unreadable and more of a pointer to the PDF. A quick overview of the patches ------------------------------- v20231124-0001-prefetch-2023-10-16.patch - same as the October 16 patch, with only minor comment tweaks v20231124-0002-rely-on-PrefetchBuffer-instead-of-custom-c.patch - removes custom cache of recently prefetched blocks, replaces it simply by calling PrefetchBuffer (which check shared buffers) v20231124-0003-check-page-cache-using-preadv2.patch - adds a check using preadv2(RWF_NOWAIT) to check if the whole page is in page cache v20231124-0004-reintroduce-the-LRU-cache-of-recent-blocks.patch - adds back a small LRU cache to identify sequential patterns (based on benchmarks of 0002/0003 patches) v20231124-0005-hold-the-vm-buffer-for-IOS-prefetching.patch v20231124-0006-poc-reuse-vm-information.patch - optimizes the visibilitymap handling when prefetching for IOS (to deal with overhead in the all-visible cases) by v20231124-0007-20231016-reworked.patch - returns back to the 20231016 patch, but this time with the VM optimizations in patches 0005/0006 (in retrospect I might have simply moved 0005+0006 right after 0001, but the patch evolved differently - shouldn't matter here) Now, let's talk about the patches one by one ... PrefetchBuffer + preadv2 (0002+0003) ------------------------------------ After I posted the patch in October, I happened to have an off-list discussion with Andres, and he suggested to try ditching the local cache of recently prefetched blocks, and instead: 1) call PrefetchBuffer (which checks if the page is in shared buffers, and skips the prefetch if it's already there) 2) if the page is not in shared buffers, use preadv2(RWF_NOWAIT) to check if it's in the kernel page cache Doing (1) is trivial - PrefetchBuffer() already does the shared buffer check, so 0002 simply removes the custom cache code. Doing (2) needs a bit more code to actually call preadv2() - 0003 adds FileCached() to fd.c, smgrcached() to smgr.c, and then calls it from PrefetchBuffer() right before smgrprefetch(). There's a couple loose ends (e.g. configure should check if preadv2 is supported), but in principle I think this is generally correct. Unfortunately, these changes led to a bunch of clear regressions :-( Take a look at the attached point-4-regressions-small.png, which is page 5 from the full results PDF [1][2]. As before, I plotted this as a huge pivot table with various parameters (test, dataset, prefetch, ...) on the left, and (build, nmatches) on the top. So each column shows timings for a particular patch and query returning nmatches rows. After the pivot table (on the right) is a heatmap, comparing timings for each build to master (the first couple of columns). As usual, the numbers are "timing compared to master" so e.g. 50% means the query completed in 1/2 the time compared to master. Color coding is simple too, green means "good" (speedup), red means "bad" (regression). The higher the saturation, the bigger the difference. I find this visualization handy as it quickly highlights differences between the various patches. Just look for changes in red/green areas. In the points-5-regressions-small.png image, you can see three areas of clear regressions, either compared to the master or the 20231016 patch. All of this is for "uncached" runs, i.e. after instance got restarted and the page cache was dropped too. The first regression is for bitmapscan. The first two builds show no difference compared to master - which makes sense, because the 20231016 patch does not touch any code used by bitmapscan, and the 0003 patch simply uses PrefetchBuffer as is. But then 0004 adds preadv2 to it, and the performance immediately sinks, with timings being ~5-6x higher for queries matching 1k-100k rows. The patches 0005/0006 can't possibly improve this, because visibilitymap are entirely unrelated to bitmapscans, and so is the small LRU to detect sequential patterns. The indexscan regression #1 shows a similar pattern, but in the opposite direction - indesxcan cases massively improved with the 20231016 patch (and even after just using PrefetchBuffer) revert back to master with 0003 (adding preadv2). Ditching the preadv2 restores the gains (the last build results are nicely green again). The indexscan regression #2 is interesting too, and it illustrates the importance of detecting sequential access patterns. It shows that as soon as we call PrefetBuffer() directly, the timings increase to maybe 2-5x compared to master. That's pretty terrible. Once the small LRU cache used to detect sequential patterns is added back, the performance recovers and regression disappears. Clearly, this detection matters. Unfortunately, the LRU can't do anything for the two other regresisons, because those are on random/cyclic patterns, so the LRU won't work (certainly not for the random case). preadv2 issues? --------------- I'm not entirely sure if I'm using preadv2 somehow wrong, but it doesn't seem to perform terribly well in this use case. I decided to do some microbenchmarks, measuring how long it takes to do preadv2 when the pages are [not] in cache etc. The C files are at [3]. preadv2-test simply reads file twice, first with NOWAIT and then without it. With clean page cache, the results look like this: file: ./tmp.img size: 1073741824 (131072) block 8192 check 8192 preadv2 NOWAIT time 78472 us calls 131072 hits 0 misses 131072 preadv2 WAIT time 9849082 us calls 131072 hits 131072 misses 0 and then, if you run it again with the file still being in page cache: file: ./tmp.img size: 1073741824 (131072) block 8192 check 8192 preadv2 NOWAIT time 258880 us calls 131072 hits 131072 misses 0 preadv2 WAIT time 213196 us calls 131072 hits 131072 misses 0 This is pretty terrible, IMO. It says that if the page is not in cache, the preadv2 calls take ~80ms. Which is very cheap, compared to the total read time (so if we can speed that up by prefetching, it's worth it). But if the file is already in cache, it takes ~260ms, and actually exceeds the time needed to just do preadv2() without the NOWAIT flag. AFAICS the problem is preadv2() doesn't just check if the data is available, it also copies the data and all that. But even if we only ask for the first byte, it's still way more expensive than with empty cache: file: ./tmp.img size: 1073741824 (131072) block 8192 check 1 preadv2 NOWAIT time 119751 us calls 131072 hits 131072 misses 0 preadv2 WAIT time 208136 us calls 131072 hits 131072 misses 0 There's also a fadvise-test microbenchmark that just does fadvise all the time, and even that is way cheaper than using preadv2(NOWAIT) in both cases: no cache: file: ./tmp.img size: 1073741824 (131072) block 8192 fadvise time 631686 us calls 131072 hits 0 misses 0 preadv2 time 207483 us calls 131072 hits 131072 misses 0 cache: file: ./tmp.img size: 1073741824 (131072) block 8192 fadvise time 79874 us calls 131072 hits 0 misses 0 preadv2 time 239141 us calls 131072 hits 131072 misses 0 So that's 300ms vs. 500ms in the caches case (the difference in the no-cache case is even more significant). It's entirely possible I'm doing something wrong, or maybe I just think about this the wrong way, but I can't quite imagine this being useful for this working - at least not for reasonably good local storage. Maybe it could help for slow/remote storage, or something? For now, I think the right approach is to go back to the cache of recently prefetched blocks. I liked on the preadv2 approach is that it knows exactly what is currently in page cache, while the local cache is just an approximation cache of recently prefetched blocks. And it also knows about stuff prefetched by other backends, while the local cache is private to the particular backend (or even to the particular scan node). But the local cache seems to perform much better, so there's that. LRU cache of recent blocks (0004) --------------------------------- The importance of this optimization is clearly visible in the regression image mentioned earlier - the "indexscan regression #2" shows that the sequential pattern regresses with 0002+0003 patches, but once the small LRU cache is introduced back and uses to skip prefetching for sequential patterns, the regression disappears. Ofc, this is part of the origina 20231016 patch, so going back to that version naturally includes this. visibility map optimizations (0005/0006) ---------------------------------------- Earlier benchmark results showed a bit annoying regression for index-only scans that don't need prefetching (i.e. with all pages all-visible). There was quite a bit of inefficiency because both the prefetcher and IOS code accessed the visibilitymap independently, and the prefetcher did that in a rather inefficient way. These patches make the prefetcher more efficient by reusing buffer, and also share the visibility info between prefetcher and the IOS code. I'm sure this needs more work / cleanup, but the regresion is mostly gone, as illustrated by the attached point-0-ios-improvement-small.png. layering questions ------------------ Aside from the preadv2() question, the main open question remains to be the "layering", i.e. which code should be responsible for prefetching. At the moment all the magic happens in indexam.c, in index_getnext_* functions, so that all callers benefit from prefetching. But as mentioned earlier in this thread, indexam.c seems to be the wrong layer, and I think I agree. The problem is - the prefetching needs to happen in index_getnext_* so that all index_getnext_* callers benefit from it. We could do that in the executor for index_getnext_tid(), but that's a bit weird - it'd work for index-only scans, but the primary target is regular index scans, which calls index_getnext_slot(). However, it seems it'd be good if the prefetcher and the executor code could exchange/share information more easily. Take for example the visibilitymap stuff in IOS in patches 0005/0006). I made it work, but it sure looks inconvenient, partially due to the split between executor and indexam code. The only idea I have is to have the prefetcher code somewhere in the executor, but then pass it to index_getnext_* functions, either as a new parameter (with NULL => no prefetching), or maybe as a field of scandesc (but that seems wrong, to point from the desc to something that's essentially a part of the executor state). There's also the thing that the prefetcher is part of IndexScanDesc, but it really should be in the IndexScanState. That's weird, but mostly down to my general laziness. regards [1] https://github.com/tvondra/index-prefetch-tests/blob/master/2023-11-23/pdf/point.pdf [2] https://github.com/tvondra/index-prefetch-tests/blob/master/2023-11-23/png/point-4.png [3] https://github.com/tvondra/index-prefetch-tests/tree/master/2023-11-23/preadv-tests -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- v20231124-0006-poc-reuse-vm-information.patch
- v20231124-0001-prefetch-2023-10-16.patch
- v20231124-0002-rely-on-PrefetchBuffer-instead-of-custom-c.patch
- v20231124-0003-check-page-cache-using-preadv2.patch
- v20231124-0004-reintroduce-the-LRU-cache-of-recent-blocks.patch
- v20231124-0005-hold-the-vm-buffer-for-IOS-prefetching.patch
- v20231124-0007-20231016-reworked.patch
- point-0-ios-improvement-small.png
- point-4-regressions-small.png
Hi, Here's a simplified version of the patch series, with two important changes from the last version shared on 2023/11/24. Firstly, it abandons the idea to use preadv2() to check page cache. This initially seemed like a great way to check if prefetching is needed, but in practice it seems so expensive it's not really beneficial (especially in the "cached" case, which is where it matters most). Note: There's one more reason to not want rely on preadv2() that I forgot to mention - it's a Linux-specific thing. I wouldn't mind using it to improve already acceptable behavior, but it doesn't seem like a great idea if performance without would be poor. Secondly, this reworks multiple aspects of the "layering". Until now, the prefetching info was stored in IndexScanDesc and initialized in indexam.c in the various "beginscan" functions. That was obviously wrong - IndexScanDesc is just a description of what the scan should do, not a place where execution state (which the prefetch queue is) should be stored. IndexScanState (and IndexOnlyScanState) is a more appropriate place, so I moved it there. This also means the various "beginscan" functions don't need any changes (i.e. not even get prefetch_max), which is nice. Because the prefetch state is created/initialized elsewhere. But there's a layering problem that I don't know how to solve - I don't see how we could make indexam.c entirely oblivious to the prefetching, and move it entirely to the executor. Because how else would you know what to prefetch? With index_getnext_tid() I can imagine fetching XIDs ahead, stashing them into a queue, and prefetching based on that. That's kinda what the patch does, except that it does it from inside index_getnext_tid(). But that does not work for index_getnext_slot(), because that already reads the heap tuples. We could say prefetching only works for index_getnext_tid(), but that seems a bit weird because that's what regular index scans do. (There's a patch to evaluate filters on index, which switches index scans to index_getnext_tid(), so that'd make prefetching work too, but I'd ignore that here. There are other index_getnext_slot() callers, and I don't think we should accept does not work for those places seems wrong (e.g. execIndexing/execReplication would benefit from prefetching, I think). The patch just adds a "prefetcher" argument to index_getnext_*(), and the prefetching still happens there. I guess we could move most of the prefether typedefs/code somewhere, but I don't quite see how it could be done in executor entirely. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Sat, Dec 9, 2023 at 1:08 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > But there's a layering problem that I don't know how to solve - I don't > see how we could make indexam.c entirely oblivious to the prefetching, > and move it entirely to the executor. Because how else would you know > what to prefetch? Yeah, that seems impossible. Some thoughts: * I think perhaps the subject line of this thread is misleading. It doesn't seem like there is any index prefetching going on here at all, and there couldn't be, unless you extended the index AM API with new methods. What you're actually doing is prefetching heap pages that will be needed by a scan of the index. I think this confusing naming has propagated itself into some parts of the patch, e.g. index_prefetch() reads *from the heap* which is not at all clear from the comment saying "Prefetch the TID, unless it's sequential or recently prefetched." You're not prefetching the TID: you're prefetching the heap tuple to which the TID points. That's not an academic distinction IMHO -- the TID would be stored in the index, so if we were prefetching the TID, we'd have to be reading index pages, not heap pages. * Regarding layering, my first thought was that the changes to index_getnext_tid() and index_getnext_slot() are sensible: read ahead by some number of TIDs, keep the TIDs you've fetched in an array someplace, use that to drive prefetching of blocks on disk, and return the previously-read TIDs from the queue without letting the caller know that the queue exists. I think that's the obvious design for a feature of this type, to the point where I don't really see that there's a viable alternative design. Driving something down into the individual index AMs would make sense if you wanted to prefetch *from the indexes*, but it's unnecessary otherwise, and best avoided. * But that said, the skip_all_visible flag passed down to index_prefetch() looks like a VERY strong sign that the layering here is not what it should be. Right now, when some code calls index_getnext_tid(), that function does not need to know or care whether the caller is going to fetch the heap tuple or not. But with this patch, the code does need to care. So knowledge of the executor concept of an index-only scan trickles down into indexam.c, which now has to be able to make decisions that are consistent with the ones that the executor will make. That doesn't seem good at all. * I think it might make sense to have two different prefetching schemes. Ideally they could share some structure. If a caller is using index_getnext_slot(), then it's easy for prefetching to be fully transparent. The caller can just ask for TIDs and the prefetching distance and TID queue can be fully under the control of something that is hidden from the caller. But when using index_getnext_tid(), the caller needs to have an opportunity to evaluate each TID and decide whether we even want the heap tuple. If yes, then we feed that TID to the prefetcher; if no, we don't. That way, we're not replicating executor logic in lower-level code. However, that also means that the IOS logic needs to be aware that this TID queue exists and interact with whatever controls the prefetch distance. Perhaps after calling index_getnext_tid() you call index_prefetcher_put_tid(prefetcher, tid, bool fetch_heap_tuple) and then you call index_prefetcher_get_tid() to drain the queue. Perhaps also the prefetcher has a "fill" callback that gets invoked when the TID queue isn't as full as the prefetcher wants it to be. Then index_getnext_slot() can just install a trivial fill callback that says index_prefetecher_put_tid(prefetcher, index_getnext_tid(...), true), but IOS can use a more sophisticated callback that checks the VM to determine what to pass for the third argument. * I realize that I'm being a little inconsistent in what I just said, because in the first bullet point I said that this wasn't really index prefetching, and now I'm proposing function names that still start with index_prefetch. It's not entirely clear to me what the best thing to do about the terminology is here -- could it be a heap prefetcher, or a TID prefetcher, or an index scan prefetcher? I don't really know, but whatever we can do to make the naming more clear seems like a really good idea. Maybe there should be a clearer separation between the queue of TIDs that we're going to return from the index and the queue of blocks that we want to prefetch to get the corresponding heap tuples -- making that separation crisper might ease some of the naming issues. * Not that I want to be critical because I think this is a great start on an important project, but it does look like there's an awful lot of stuff here that still needs to be sorted out before it would be reasonable to think of committing this, both in terms of design decisions and just general polish. There's a lot of stuff marked with XXX and I think that's great because most of those seem to be good questions but that does leave the, err, small problem of figuring out the answers. index_prefetch_is_sequential() makes me really nervous because it seems to depend an awful lot on whether the OS is doing prefetching, and how the OS is doing prefetching, and I think those might not be consistent across all systems and kernel versions. Similarly with index_prefetch(). There's a lot of "magical" assumptions here. Even index_prefetch_add_cache() has this problem -- the function assumes that it's OK if we sometimes fail to detect a duplicate prefetch request, which makes sense, but under what circumstances is it necessary to detect duplicates and in what cases is it optional? The function comments are silent about that, which makes it hard to assess whether the algorithm is good enough. * In terms of polish, one thing I noticed is that index_getnext_slot() calls index_prefetch_tids() even when scan->xs_heap_continue is set, which seems like it must be a waste, since we can't really need to kick off more prefetch requests halfway through a HOT chain referenced by a single index tuple, can we? Also, blks_prefetch_rounds doesn't seem to be used anywhere, and neither that nor blks_prefetches are documented. In fact there's no new documentation at all, which seems probably not right. That's partly because there are no new GUCs, which I feel like typically for a feature like this would be the place where the feature behavior would be mentioned in the documentation. I don't think it's a good idea to tie the behavior of this feature to effective_io_concurrency partly because it's usually a bad idea to make one setting control multiple different things, but perhaps even more because effective_io_concurrency doesn't actually work in a useful way AFAICT and people typically have to set it to some very artificially large value compared to how much real I/O parallelism they have. So probably there should be new GUCs with hopefully-better semantics, but at least the documentation for any existing ones would need updating, I would think. -- Robert Haas EDB: http://www.enterprisedb.com
On 12/18/23 22:00, Robert Haas wrote: > On Sat, Dec 9, 2023 at 1:08 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> But there's a layering problem that I don't know how to solve - I don't >> see how we could make indexam.c entirely oblivious to the prefetching, >> and move it entirely to the executor. Because how else would you know >> what to prefetch? > > Yeah, that seems impossible. > > Some thoughts: > > * I think perhaps the subject line of this thread is misleading. It > doesn't seem like there is any index prefetching going on here at all, > and there couldn't be, unless you extended the index AM API with new > methods. What you're actually doing is prefetching heap pages that > will be needed by a scan of the index. I think this confusing naming > has propagated itself into some parts of the patch, e.g. > index_prefetch() reads *from the heap* which is not at all clear from > the comment saying "Prefetch the TID, unless it's sequential or > recently prefetched." You're not prefetching the TID: you're > prefetching the heap tuple to which the TID points. That's not an > academic distinction IMHO -- the TID would be stored in the index, so > if we were prefetching the TID, we'd have to be reading index pages, > not heap pages. Yes, that's a fair complaint. I think the naming is mostly obsolete - the prefetching initially happened way way lower - in the index AMs. It was prefetching the heap pages, ofc, but it kinda seemed reasonable to call it "index prefetching". And even now it's called from indexam.c where most functions start with "index_". But I'll think about some better / cleared name. > > * Regarding layering, my first thought was that the changes to > index_getnext_tid() and index_getnext_slot() are sensible: read ahead > by some number of TIDs, keep the TIDs you've fetched in an array > someplace, use that to drive prefetching of blocks on disk, and return > the previously-read TIDs from the queue without letting the caller > know that the queue exists. I think that's the obvious design for a > feature of this type, to the point where I don't really see that > there's a viable alternative design. I agree. > Driving something down into the individual index AMs would make sense > if you wanted to prefetch *from the indexes*, but it's unnecessary > otherwise, and best avoided. > Right. In fact, the patch moved exactly in the opposite direction - it was originally done at the AM level, and moved up. First to indexam.c, then even more to the executor. > * But that said, the skip_all_visible flag passed down to > index_prefetch() looks like a VERY strong sign that the layering here > is not what it should be. Right now, when some code calls > index_getnext_tid(), that function does not need to know or care > whether the caller is going to fetch the heap tuple or not. But with > this patch, the code does need to care. So knowledge of the executor > concept of an index-only scan trickles down into indexam.c, which now > has to be able to make decisions that are consistent with the ones > that the executor will make. That doesn't seem good at all. > I agree the all_visible flag is a sign the abstraction is not quite right. I did that mostly to quickly verify whether the duplicate VM checks are causing for the perf regression (and they are). Whatever the right abstraction is, it probably needs to do these VM checks only once. > * I think it might make sense to have two different prefetching > schemes. Ideally they could share some structure. If a caller is using > index_getnext_slot(), then it's easy for prefetching to be fully > transparent. The caller can just ask for TIDs and the prefetching > distance and TID queue can be fully under the control of something > that is hidden from the caller. But when using index_getnext_tid(), > the caller needs to have an opportunity to evaluate each TID and > decide whether we even want the heap tuple. If yes, then we feed that > TID to the prefetcher; if no, we don't. That way, we're not > replicating executor logic in lower-level code. However, that also > means that the IOS logic needs to be aware that this TID queue exists > and interact with whatever controls the prefetch distance. Perhaps > after calling index_getnext_tid() you call > index_prefetcher_put_tid(prefetcher, tid, bool fetch_heap_tuple) and > then you call index_prefetcher_get_tid() to drain the queue. Perhaps > also the prefetcher has a "fill" callback that gets invoked when the > TID queue isn't as full as the prefetcher wants it to be. Then > index_getnext_slot() can just install a trivial fill callback that > says index_prefetecher_put_tid(prefetcher, index_getnext_tid(...), > true), but IOS can use a more sophisticated callback that checks the > VM to determine what to pass for the third argument. > Yeah, after you pointed out the "leaky" abstraction, I also started to think about customizing the behavior using a callback. Not sure what exactly you mean by "fully transparent" but as I explained above I think we need to allow passing some information between the prefetcher and the executor - for example results of the visibility map checks in IOS. I have imagined something like this: nodeIndexscan / index_getnext_slot() -> no callback, all TIDs are prefetched nodeIndexonlyscan / index_getnext_tid() -> callback checks VM for the TID, prefetches if not all-visible -> the VM check result is stored in the queue with the VM (but in an extensible way, so that other callback can store other stuff) -> index_getnext_tid() also returns this extra information So not that different from the WIP patch, but in a "generic" and extensible way. Instead of hard-coding the all-visible flag, there'd be a something custom information. A bit like qsort_r() has a void* arg to pass custom context. Or if envisioned something different, could you elaborate a bit? > * I realize that I'm being a little inconsistent in what I just said, > because in the first bullet point I said that this wasn't really index > prefetching, and now I'm proposing function names that still start > with index_prefetch. It's not entirely clear to me what the best thing > to do about the terminology is here -- could it be a heap prefetcher, > or a TID prefetcher, or an index scan prefetcher? I don't really know, > but whatever we can do to make the naming more clear seems like a > really good idea. Maybe there should be a clearer separation between > the queue of TIDs that we're going to return from the index and the > queue of blocks that we want to prefetch to get the corresponding heap > tuples -- making that separation crisper might ease some of the naming > issues. > I think if the code stays in indexam.c, it's sensible to keep the index_ prefix, but then also have a more appropriate rest of the name. For example it might be index_prefetch_heap_pages() or something like that. > * Not that I want to be critical because I think this is a great start > on an important project, but it does look like there's an awful lot of > stuff here that still needs to be sorted out before it would be > reasonable to think of committing this, both in terms of design > decisions and just general polish. There's a lot of stuff marked with > XXX and I think that's great because most of those seem to be good > questions but that does leave the, err, small problem of figuring out > the answers. Absolutely. I certainly don't claim this is close to commit ... > index_prefetch_is_sequential() makes me really nervous > because it seems to depend an awful lot on whether the OS is doing > prefetching, and how the OS is doing prefetching, and I think those > might not be consistent across all systems and kernel versions. If the OS does not have read-ahead, or it's not configured properly, then the patch does not perform worse than what we have now. I'm far more concerned about the opposite issue, i.e. causing regressions with OS-level read-ahead. And the check handles that well, I think. > Similarly with index_prefetch(). There's a lot of "magical" > assumptions here. Even index_prefetch_add_cache() has this problem -- > the function assumes that it's OK if we sometimes fail to detect a > duplicate prefetch request, which makes sense, but under what > circumstances is it necessary to detect duplicates and in what cases > is it optional? The function comments are silent about that, which > makes it hard to assess whether the algorithm is good enough. > I don't quite understand what problem with duplicates you envision here. Strictly speaking, we don't need to detect/prevent duplicates - it's just that if you do posix_fadvise() for a block that's already in memory, it's overhead / wasted time. The whole point is to not do that very often. In this sense it's entirely optional, but desirable. I'm in no way claiming the comments are perfect, ofc. > * In terms of polish, one thing I noticed is that index_getnext_slot() > calls index_prefetch_tids() even when scan->xs_heap_continue is set, > which seems like it must be a waste, since we can't really need to > kick off more prefetch requests halfway through a HOT chain referenced > by a single index tuple, can we? Yeah, I think that's true. > Also, blks_prefetch_rounds doesn't > seem to be used anywhere, and neither that nor blks_prefetches are > documented. In fact there's no new documentation at all, which seems > probably not right. That's partly because there are no new GUCs, which > I feel like typically for a feature like this would be the place where > the feature behavior would be mentioned in the documentation. That's mostly because the explain fields were added to help during development. I'm not sure we actually want to make them part of EXPLAIN. > I don't > think it's a good idea to tie the behavior of this feature to > effective_io_concurrency partly because it's usually a bad idea to > make one setting control multiple different things, but perhaps even > more because effective_io_concurrency doesn't actually work in a > useful way AFAICT and people typically have to set it to some very > artificially large value compared to how much real I/O parallelism > they have. So probably there should be new GUCs with hopefully-better > semantics, but at least the documentation for any existing ones would > need updating, I would think. > I really don't want to have multiple knobs. At this point we have three GUCs, each tuning prefetching for a fairly large part of the system: effective_io_concurrency = regular queries maintenance_io_concurrency = utility commands recovery_prefetch = recovery / PITR This seems sensible, but I really don't want many more GUCs tuning prefetching for different executor nodes or something like that. If we have issues with how effective_io_concurrency works (and I'm not sure that's actually true), then perhaps we should fix that rather than inventing new GUCs. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Dec 19, 2023 at 8:41 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > Whatever the right abstraction is, it probably needs to do these VM > checks only once. Makes sense. > Yeah, after you pointed out the "leaky" abstraction, I also started to > think about customizing the behavior using a callback. Not sure what > exactly you mean by "fully transparent" but as I explained above I think > we need to allow passing some information between the prefetcher and the > executor - for example results of the visibility map checks in IOS. Agreed. > I have imagined something like this: > > nodeIndexscan / index_getnext_slot() > -> no callback, all TIDs are prefetched > > nodeIndexonlyscan / index_getnext_tid() > -> callback checks VM for the TID, prefetches if not all-visible > -> the VM check result is stored in the queue with the VM (but in an > extensible way, so that other callback can store other stuff) > -> index_getnext_tid() also returns this extra information > > So not that different from the WIP patch, but in a "generic" and > extensible way. Instead of hard-coding the all-visible flag, there'd be > a something custom information. A bit like qsort_r() has a void* arg to > pass custom context. > > Or if envisioned something different, could you elaborate a bit? I can't totally follow the sketch you give above, but I think we're thinking along similar lines, at least. > I think if the code stays in indexam.c, it's sensible to keep the index_ > prefix, but then also have a more appropriate rest of the name. For > example it might be index_prefetch_heap_pages() or something like that. Yeah, that's not a bad idea. > > index_prefetch_is_sequential() makes me really nervous > > because it seems to depend an awful lot on whether the OS is doing > > prefetching, and how the OS is doing prefetching, and I think those > > might not be consistent across all systems and kernel versions. > > If the OS does not have read-ahead, or it's not configured properly, > then the patch does not perform worse than what we have now. I'm far > more concerned about the opposite issue, i.e. causing regressions with > OS-level read-ahead. And the check handles that well, I think. I'm just not sure how much I believe that it's going to work well everywhere. I mean, I have no evidence that it doesn't, it just kind of looks like guesswork to me. For instance, the behavior of the algorithm depends heavily on PREFETCH_QUEUE_HISTORY and PREFETCH_SEQ_PATTERN_BLOCKS, but those are just magic numbers. Who is to say that on some system or workload you didn't test the required values aren't entirely different, or that the whole algorithm doesn't need rethinking? Maybe we can't really answer that question perfectly, but the patch doesn't really explain the reasoning behind this choice of algorithm. > > Similarly with index_prefetch(). There's a lot of "magical" > > assumptions here. Even index_prefetch_add_cache() has this problem -- > > the function assumes that it's OK if we sometimes fail to detect a > > duplicate prefetch request, which makes sense, but under what > > circumstances is it necessary to detect duplicates and in what cases > > is it optional? The function comments are silent about that, which > > makes it hard to assess whether the algorithm is good enough. > > I don't quite understand what problem with duplicates you envision here. > Strictly speaking, we don't need to detect/prevent duplicates - it's > just that if you do posix_fadvise() for a block that's already in > memory, it's overhead / wasted time. The whole point is to not do that > very often. In this sense it's entirely optional, but desirable. Right ... but the patch sets up some data structure that will eliminate duplicates in some circumstances and fail to eliminate them in others. So it's making a judgement that the things it catches are the cases that are important enough that we need to catch them, and the things that it doesn't catch are cases that aren't particularly important to catch. Here again, PREFETCH_LRU_SIZE and PREFETCH_LRU_COUNT seem like they will have a big impact, but why these values? The comments suggest that it's because we want to cover ~8MB of data, but it's not clear why that should be the right amount of data to cover. My naive thought is that we'd want to avoid prefetching a block during the time between we had prefetched it and when we later read it, but then the value that is here magically 8MB should really be replaced by the operative prefetch distance. > I really don't want to have multiple knobs. At this point we have three > GUCs, each tuning prefetching for a fairly large part of the system: > > effective_io_concurrency = regular queries > maintenance_io_concurrency = utility commands > recovery_prefetch = recovery / PITR > > This seems sensible, but I really don't want many more GUCs tuning > prefetching for different executor nodes or something like that. > > If we have issues with how effective_io_concurrency works (and I'm not > sure that's actually true), then perhaps we should fix that rather than > inventing new GUCs. Well, that would very possibly be a good idea, but I still think using the same GUC for two different purposes is likely to cause trouble. I think what effective_io_concurrency currently controls is basically the heap prefetch distance for bitmap scans, and what you want to control here is the heap prefetch distance for index scans. If those are necessarily related in some understandable way (e.g. always the same, one twice the other, one the square of the other) then it's fine to use the same parameter for both, but it's not clear to me that this is the case. I fear someone will find that if they crank up effective_io_concurrency high enough to get the amount of prefetching they want for bitmap scans, it will be too much for index scans, or the other way around. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Dec 20, 2023 at 7:11 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > I was going through to understand the idea, couple of observations -- + for (int i = 0; i < PREFETCH_LRU_SIZE; i++) + { + entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + i]; + + /* Is this the oldest prefetch request in this LRU? */ + if (entry->request < oldestRequest) + { + oldestRequest = entry->request; + oldestIndex = i; + } + + /* + * If the entry is unused (identified by request being set to 0), + * we're done. Notice the field is uint64, so empty entry is + * guaranteed to be the oldest one. + */ + if (entry->request == 0) + continue; If the 'entry->request == 0' then we should break instead of continue, right? --- /* * Used to detect sequential patterns (and disable prefetching). */ #define PREFETCH_QUEUE_HISTORY 8 #define PREFETCH_SEQ_PATTERN_BLOCKS 4 If for sequential patterns we search only 4 blocks then why we are maintaining history for 8 blocks --- + * + * XXX Perhaps this should be tied to effective_io_concurrency somehow? + * + * XXX Could it be harmful that we read the queue backwards? Maybe memory + * prefetching works better for the forward direction? + */ + for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++) Correct, I think if we fetch this forward it will have an advantage with memory prefetching. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On 12/20/23 20:09, Robert Haas wrote: > On Tue, Dec 19, 2023 at 8:41 PM Tomas Vondra > ... >> I have imagined something like this: >> >> nodeIndexscan / index_getnext_slot() >> -> no callback, all TIDs are prefetched >> >> nodeIndexonlyscan / index_getnext_tid() >> -> callback checks VM for the TID, prefetches if not all-visible >> -> the VM check result is stored in the queue with the VM (but in an >> extensible way, so that other callback can store other stuff) >> -> index_getnext_tid() also returns this extra information >> >> So not that different from the WIP patch, but in a "generic" and >> extensible way. Instead of hard-coding the all-visible flag, there'd be >> a something custom information. A bit like qsort_r() has a void* arg to >> pass custom context. >> >> Or if envisioned something different, could you elaborate a bit? > > I can't totally follow the sketch you give above, but I think we're > thinking along similar lines, at least. > Yeah, it's hard to discuss vague descriptions of code that does not exist yet. I'll try to do the actual patch, then we can discuss. >>> index_prefetch_is_sequential() makes me really nervous >>> because it seems to depend an awful lot on whether the OS is doing >>> prefetching, and how the OS is doing prefetching, and I think those >>> might not be consistent across all systems and kernel versions. >> >> If the OS does not have read-ahead, or it's not configured properly, >> then the patch does not perform worse than what we have now. I'm far >> more concerned about the opposite issue, i.e. causing regressions with >> OS-level read-ahead. And the check handles that well, I think. > > I'm just not sure how much I believe that it's going to work well > everywhere. I mean, I have no evidence that it doesn't, it just kind > of looks like guesswork to me. For instance, the behavior of the > algorithm depends heavily on PREFETCH_QUEUE_HISTORY and > PREFETCH_SEQ_PATTERN_BLOCKS, but those are just magic numbers. Who is > to say that on some system or workload you didn't test the required > values aren't entirely different, or that the whole algorithm doesn't > need rethinking? Maybe we can't really answer that question perfectly, > but the patch doesn't really explain the reasoning behind this choice > of algorithm. > You're right a lot of this is a guesswork. I don't think we can do much better, because it depends on stuff that's out of our control - each OS may do things differently, or perhaps it's just configured differently. But I don't think this is really a serious issue - all the read-ahead implementations need to work about the same, because they are meant to work in a transparent way. So it's about deciding at which point we think this is a sequential pattern. Yes, the OS may use a slightly different threshold, but the exact value does not really matter - in the worst case we prefetch a couple more/fewer blocks. The OS read-ahead can't really prefetch anything except sequential cases, so the whole question is "When does the access pattern get sequential enough?". I don't think there's a perfect answer, and I don't think we need a perfect one - we just need to be reasonably close. Also, while I don't want to lazily dismiss valid cases that might be affected by this, I think that sequential access for index paths is not that common (with the exception of clustered indexes). FWIW bitmap index scans have exactly the same "problem" except that no one cares about it because that's how it worked from the start, so it's not considered a regression. >>> Similarly with index_prefetch(). There's a lot of "magical" >>> assumptions here. Even index_prefetch_add_cache() has this problem -- >>> the function assumes that it's OK if we sometimes fail to detect a >>> duplicate prefetch request, which makes sense, but under what >>> circumstances is it necessary to detect duplicates and in what cases >>> is it optional? The function comments are silent about that, which >>> makes it hard to assess whether the algorithm is good enough. >> >> I don't quite understand what problem with duplicates you envision here. >> Strictly speaking, we don't need to detect/prevent duplicates - it's >> just that if you do posix_fadvise() for a block that's already in >> memory, it's overhead / wasted time. The whole point is to not do that >> very often. In this sense it's entirely optional, but desirable. > > Right ... but the patch sets up some data structure that will > eliminate duplicates in some circumstances and fail to eliminate them > in others. So it's making a judgement that the things it catches are > the cases that are important enough that we need to catch them, and > the things that it doesn't catch are cases that aren't particularly > important to catch. Here again, PREFETCH_LRU_SIZE and > PREFETCH_LRU_COUNT seem like they will have a big impact, but why > these values? The comments suggest that it's because we want to cover > ~8MB of data, but it's not clear why that should be the right amount > of data to cover. My naive thought is that we'd want to avoid > prefetching a block during the time between we had prefetched it and > when we later read it, but then the value that is here magically 8MB > should really be replaced by the operative prefetch distance. > True. Ideally we'd not issue prefetch request for data that's already in memory - either in shared buffers or page cache (or whatever). And we already do that for shared buffers, but not for page cache. The preadv2 experiment was an attempt to do that, but it's too expensive to help. So we have to approximate, and the only way I can think of is checking if we recently prefetched that block. Which is the whole point of this simple cache - remembering which blocks we prefetched, so that we don't prefetch them over and over again. I don't understand what you mean by "cases that are important enough". In a way, all the blocks are equally important, with exactly the same impact of making the wrong decision. You're certainly right the 8MB is a pretty arbitrary value, though. It seemed reasonable, so I used that, but I might just as well use 32MB or some other sensible value. Ultimately, any hard-coded value is going to be wrong, but the negative consequences are a bit asymmetrical. If the cache is too small, we may end up doing prefetches for data that's already in cache. If it's too large, we may not prefetch data that's not in memory at that point. Obviously, the latter case has much more severe impact, but it depends on the exact workload / access pattern etc. The only "perfect" solution would be to actually check the page cache, but well - that seems to be fairly expensive. What I was envisioning was something self-tuning, based on the I/O we may do later. If the prefetcher decides to prefetch something, but finds it's already in cache, we'd increase the distance, to remember more blocks. Likewise, if a block is not prefetched but then requires I/O later, decrease the distance. That'd make it adaptive, but I don't think we actually have the info about I/O. A bigger "flaw" is that these caches are per-backend, so there's no way to check if a block was recently prefetched by some other backend. I actually wonder if maybe this cache should be in shared memory, but I haven't tried. Alternatively, I was thinking about moving the prefetches into a separate worker process (or multiple workers), so we'd just queue the request and all the overhead would be done by the worker. The main problem is the overhead of calling posix_fadvise() for blocks that are already in memory, and this would just move it to a separate backend. I wonder if that might even make the custom cache unnecessary / optional. AFAICS this seems similar to some of the AIO patch, I wonder what that plans to do. I need to check. >> I really don't want to have multiple knobs. At this point we have three >> GUCs, each tuning prefetching for a fairly large part of the system: >> >> effective_io_concurrency = regular queries >> maintenance_io_concurrency = utility commands >> recovery_prefetch = recovery / PITR >> >> This seems sensible, but I really don't want many more GUCs tuning >> prefetching for different executor nodes or something like that. >> >> If we have issues with how effective_io_concurrency works (and I'm not >> sure that's actually true), then perhaps we should fix that rather than >> inventing new GUCs. > > Well, that would very possibly be a good idea, but I still think using > the same GUC for two different purposes is likely to cause trouble. I > think what effective_io_concurrency currently controls is basically > the heap prefetch distance for bitmap scans, and what you want to > control here is the heap prefetch distance for index scans. If those > are necessarily related in some understandable way (e.g. always the > same, one twice the other, one the square of the other) then it's fine > to use the same parameter for both, but it's not clear to me that this > is the case. I fear someone will find that if they crank up > effective_io_concurrency high enough to get the amount of prefetching > they want for bitmap scans, it will be too much for index scans, or > the other way around. > I understand, but I think we should really try to keep the number of knobs as low as possible, unless we actually have very good arguments for having separate GUCs. And I don't think we have that. This is very much about how many concurrent requests the storage can handle (or rather requires to benefit from the capabilities), and that's pretty orthogonal to which operation is generating the requests. I think this is pretty similar to what we do with work_mem - there's one value for all possible parts of the query plan, no matter if it's sort, group by, or something else. We do have separate limits for maintenance commands, because that's a different matter, and we have the same for the two I/O GUCs. If we come to the realization that really need two GUCs, fine with me. But at this point I don't see a reason to do that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/21/23 07:49, Dilip Kumar wrote: > On Wed, Dec 20, 2023 at 7:11 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> > I was going through to understand the idea, couple of observations > > -- > + for (int i = 0; i < PREFETCH_LRU_SIZE; i++) > + { > + entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + i]; > + > + /* Is this the oldest prefetch request in this LRU? */ > + if (entry->request < oldestRequest) > + { > + oldestRequest = entry->request; > + oldestIndex = i; > + } > + > + /* > + * If the entry is unused (identified by request being set to 0), > + * we're done. Notice the field is uint64, so empty entry is > + * guaranteed to be the oldest one. > + */ > + if (entry->request == 0) > + continue; > > If the 'entry->request == 0' then we should break instead of continue, right? > Yes, I think that's true. The small LRU caches are accessed/filled linearly, so once we find an empty entry, all following entries are going to be empty too. I thought this shouldn't make any difference, because the LRUs are very small (only 8 entries, and I don't think we should make them larger). And it's going to go away once the cache gets full. But now that I think about it, maybe this could matter for small queries that only ever hit a couple rows. Hmmm, I'll have to check. Thanks for noticing this! > --- > /* > * Used to detect sequential patterns (and disable prefetching). > */ > #define PREFETCH_QUEUE_HISTORY 8 > #define PREFETCH_SEQ_PATTERN_BLOCKS 4 > > If for sequential patterns we search only 4 blocks then why we are > maintaining history for 8 blocks > > --- Right, I think there's no reason to keep these two separate constants. I believe this is a remnant from an earlier patch version which tried to do something smarter, but I ended up abandoning that. > > + * > + * XXX Perhaps this should be tied to effective_io_concurrency somehow? > + * > + * XXX Could it be harmful that we read the queue backwards? Maybe memory > + * prefetching works better for the forward direction? > + */ > + for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++) > > Correct, I think if we fetch this forward it will have an advantage > with memory prefetching. > OK, although we only really have a couple uint32 values, so it should be the same cacheline I guess. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2023-12-09 19:08:20 +0100, Tomas Vondra wrote: > But there's a layering problem that I don't know how to solve - I don't > see how we could make indexam.c entirely oblivious to the prefetching, > and move it entirely to the executor. Because how else would you know > what to prefetch? > With index_getnext_tid() I can imagine fetching XIDs ahead, stashing > them into a queue, and prefetching based on that. That's kinda what the > patch does, except that it does it from inside index_getnext_tid(). But > that does not work for index_getnext_slot(), because that already reads > the heap tuples. > We could say prefetching only works for index_getnext_tid(), but that > seems a bit weird because that's what regular index scans do. (There's a > patch to evaluate filters on index, which switches index scans to > index_getnext_tid(), so that'd make prefetching work too, but I'd ignore > that here. I think we should just switch plain index scans to index_getnext_tid(). It's one of the primary places triggering index scans, so a few additional lines don't seem problematic. I continue to think that we should not have split plain and index only scans into separate files... > There are other index_getnext_slot() callers, and I don't > think we should accept does not work for those places seems wrong (e.g. > execIndexing/execReplication would benefit from prefetching, I think). I don't think it'd be a problem to have to opt into supporting prefetching. There's plenty places where it doesn't really seem likely to be useful, e.g. doing prefetching during syscache lookups is very likely just a waste of time. I don't think e.g. execReplication is likely to benefit from prefetching - you're just fetching a single row after all. You'd need a lot of dead rows to make it beneficial. I think it's similar in execIndexing.c. I suspect we should work on providing executor nodes with some estimates about the number of rows that are likely to be consumed. If an index scan is under a LIMIT 1, we shoulnd't prefetch. Similar for sequential scan with the infrastructure in https://postgr.es/m/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com Greetings, Andres Freund
Hi, On 2023-12-21 13:30:42 +0100, Tomas Vondra wrote: > You're right a lot of this is a guesswork. I don't think we can do much > better, because it depends on stuff that's out of our control - each OS > may do things differently, or perhaps it's just configured differently. > > But I don't think this is really a serious issue - all the read-ahead > implementations need to work about the same, because they are meant to > work in a transparent way. > > So it's about deciding at which point we think this is a sequential > pattern. Yes, the OS may use a slightly different threshold, but the > exact value does not really matter - in the worst case we prefetch a > couple more/fewer blocks. > > The OS read-ahead can't really prefetch anything except sequential > cases, so the whole question is "When does the access pattern get > sequential enough?". I don't think there's a perfect answer, and I don't > think we need a perfect one - we just need to be reasonably close. For the streaming read interface (initially backed by fadvise, to then be replaced by AIO) we found that it's clearly necessary to avoid fadvises in cases of actual sequential IO - the overhead otherwise leads to easily reproducible regressions. So I don't think we have much choice. > Also, while I don't want to lazily dismiss valid cases that might be > affected by this, I think that sequential access for index paths is not > that common (with the exception of clustered indexes). I think sequential access is common in other cases as well. There's lots of indexes where heap tids are almost perfectly correlated with index entries, consider insert only insert-only tables and serial PKs or inserted_at timestamp columns. Even leaving those aside, for indexes with many entries for the same key, we sort by tid these days, which will also result in "runs" of sequential access. > Obviously, the latter case has much more severe impact, but it depends > on the exact workload / access pattern etc. The only "perfect" solution > would be to actually check the page cache, but well - that seems to be > fairly expensive. > What I was envisioning was something self-tuning, based on the I/O we > may do later. If the prefetcher decides to prefetch something, but finds > it's already in cache, we'd increase the distance, to remember more > blocks. Likewise, if a block is not prefetched but then requires I/O > later, decrease the distance. That'd make it adaptive, but I don't think > we actually have the info about I/O. How would the prefetcher know that hte data wasn't in cache? > Alternatively, I was thinking about moving the prefetches into a > separate worker process (or multiple workers), so we'd just queue the > request and all the overhead would be done by the worker. The main > problem is the overhead of calling posix_fadvise() for blocks that are > already in memory, and this would just move it to a separate backend. I > wonder if that might even make the custom cache unnecessary / optional. The AIO patchset provides this. > AFAICS this seems similar to some of the AIO patch, I wonder what that > plans to do. I need to check. Yes, most of this exists there. The difference that with the AIO you don't need to prefetch, as you can just initiate the IO for real, and wait for it to complete. Greetings, Andres Freund
On 12/21/23 14:43, Andres Freund wrote: > Hi, > > On 2023-12-21 13:30:42 +0100, Tomas Vondra wrote: >> You're right a lot of this is a guesswork. I don't think we can do much >> better, because it depends on stuff that's out of our control - each OS >> may do things differently, or perhaps it's just configured differently. >> >> But I don't think this is really a serious issue - all the read-ahead >> implementations need to work about the same, because they are meant to >> work in a transparent way. >> >> So it's about deciding at which point we think this is a sequential >> pattern. Yes, the OS may use a slightly different threshold, but the >> exact value does not really matter - in the worst case we prefetch a >> couple more/fewer blocks. >> >> The OS read-ahead can't really prefetch anything except sequential >> cases, so the whole question is "When does the access pattern get >> sequential enough?". I don't think there's a perfect answer, and I don't >> think we need a perfect one - we just need to be reasonably close. > > For the streaming read interface (initially backed by fadvise, to then be > replaced by AIO) we found that it's clearly necessary to avoid fadvises in > cases of actual sequential IO - the overhead otherwise leads to easily > reproducible regressions. So I don't think we have much choice. > Yeah, the regression are pretty easy to demonstrate. In fact, I didn't have such detection in the first patch, but after the first round of benchmarks it became obvious it's needed. > >> Also, while I don't want to lazily dismiss valid cases that might be >> affected by this, I think that sequential access for index paths is not >> that common (with the exception of clustered indexes). > > I think sequential access is common in other cases as well. There's lots of > indexes where heap tids are almost perfectly correlated with index entries, > consider insert only insert-only tables and serial PKs or inserted_at > timestamp columns. Even leaving those aside, for indexes with many entries > for the same key, we sort by tid these days, which will also result in > "runs" of sequential access. > True. I should have thought about those cases. > >> Obviously, the latter case has much more severe impact, but it depends >> on the exact workload / access pattern etc. The only "perfect" solution >> would be to actually check the page cache, but well - that seems to be >> fairly expensive. > >> What I was envisioning was something self-tuning, based on the I/O we >> may do later. If the prefetcher decides to prefetch something, but finds >> it's already in cache, we'd increase the distance, to remember more >> blocks. Likewise, if a block is not prefetched but then requires I/O >> later, decrease the distance. That'd make it adaptive, but I don't think >> we actually have the info about I/O. > > How would the prefetcher know that hte data wasn't in cache? > I don't think there's a good way to do that, unfortunately, or at least I'm not aware of it. That's what I meant by "we don't have the info" at the end. Which is why I haven't tried implementing it. The only "solution" I could come up with was some sort of "timing" for the I/O requests and deducing what was cached. Not great, of course. > >> Alternatively, I was thinking about moving the prefetches into a >> separate worker process (or multiple workers), so we'd just queue the >> request and all the overhead would be done by the worker. The main >> problem is the overhead of calling posix_fadvise() for blocks that are >> already in memory, and this would just move it to a separate backend. I >> wonder if that might even make the custom cache unnecessary / optional. > > The AIO patchset provides this. > OK, I guess it's time for me to take a look at the patch again. > >> AFAICS this seems similar to some of the AIO patch, I wonder what that >> plans to do. I need to check. > > Yes, most of this exists there. The difference that with the AIO you don't > need to prefetch, as you can just initiate the IO for real, and wait for it to > complete. > Right, although the line where things stop being "prefetch" and becomes "async" seems a bit unclear to me / perhaps more a point of view. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/21/23 14:27, Andres Freund wrote: > Hi, > > On 2023-12-09 19:08:20 +0100, Tomas Vondra wrote: >> But there's a layering problem that I don't know how to solve - I don't >> see how we could make indexam.c entirely oblivious to the prefetching, >> and move it entirely to the executor. Because how else would you know >> what to prefetch? > >> With index_getnext_tid() I can imagine fetching XIDs ahead, stashing >> them into a queue, and prefetching based on that. That's kinda what the >> patch does, except that it does it from inside index_getnext_tid(). But >> that does not work for index_getnext_slot(), because that already reads >> the heap tuples. > >> We could say prefetching only works for index_getnext_tid(), but that >> seems a bit weird because that's what regular index scans do. (There's a >> patch to evaluate filters on index, which switches index scans to >> index_getnext_tid(), so that'd make prefetching work too, but I'd ignore >> that here. > > I think we should just switch plain index scans to index_getnext_tid(). It's > one of the primary places triggering index scans, so a few additional lines > don't seem problematic. > > I continue to think that we should not have split plain and index only scans > into separate files... > I do agree with that opinion. Not just because of this prefetching thread, but also because of the discussions about index-only filters in a nearby thread. > >> There are other index_getnext_slot() callers, and I don't >> think we should accept does not work for those places seems wrong (e.g. >> execIndexing/execReplication would benefit from prefetching, I think). > > I don't think it'd be a problem to have to opt into supporting > prefetching. There's plenty places where it doesn't really seem likely to be > useful, e.g. doing prefetching during syscache lookups is very likely just a > waste of time. > > I don't think e.g. execReplication is likely to benefit from prefetching - > you're just fetching a single row after all. You'd need a lot of dead rows to > make it beneficial. I think it's similar in execIndexing.c. > Yeah, systable scans are unlikely to benefit from prefetching of this type. I'm not sure about execIndexing/execReplication, it wasn't clear to me but maybe you're right. > > I suspect we should work on providing executor nodes with some estimates about > the number of rows that are likely to be consumed. If an index scan is under a > LIMIT 1, we shoulnd't prefetch. Similar for sequential scan with the > infrastructure in > https://postgr.es/m/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com > Isn't this mostly addressed by the incremental ramp-up at the beginning? Even with target set to 1000, we only start prefetching 1, 2, 3, ... blocks ahead, it's not like we'll prefetch 1000 blocks right away. I did initially plan to also consider the number of rows we're expected to need, but I think it's actually harder than it might seem. With LIMIT for example we often don't know how selective the qual is, it's not like we can just stop prefetching after the reading the first N tids. With other nodes it's good to remember those are just estimates - it'd be silly to be bitten both by a wrong estimate and also prefetching doing the wrong thing based on an estimate. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2023-12-21 16:20:45 +0100, Tomas Vondra wrote: > On 12/21/23 14:43, Andres Freund wrote: > >> AFAICS this seems similar to some of the AIO patch, I wonder what that > >> plans to do. I need to check. > > > > Yes, most of this exists there. The difference that with the AIO you don't > > need to prefetch, as you can just initiate the IO for real, and wait for it to > > complete. > > > > Right, although the line where things stop being "prefetch" and becomes > "async" seems a bit unclear to me / perhaps more a point of view. Agreed. What I meant with not needing prefetching was that you'd not use fadvise(), because it's better to instead just asynchronously read data into shared buffers. That way you don't have the doubling of syscalls and you don't need to care less about the buffering rate in the kernel. Greetings, Andres Freund
On Thu, Dec 21, 2023 at 10:33 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > I continue to think that we should not have split plain and index only scans > > into separate files... > > I do agree with that opinion. Not just because of this prefetching > thread, but also because of the discussions about index-only filters in > a nearby thread. For the record, in the original patch I submitted for this feature, it wasn't in separate files. If memory serves, Tom changed it. So don't blame me. :-) -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2023-12-21 11:00:34 -0500, Robert Haas wrote: > On Thu, Dec 21, 2023 at 10:33 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > I continue to think that we should not have split plain and index only scans > > > into separate files... > > > > I do agree with that opinion. Not just because of this prefetching > > thread, but also because of the discussions about index-only filters in > > a nearby thread. > > For the record, in the original patch I submitted for this feature, it > wasn't in separate files. If memory serves, Tom changed it. > > So don't blame me. :-) But I'd like you to feel guilty (no, not really) and fix it (yes, really) :) Greetings, Andres Freund
On Thu, Dec 21, 2023 at 11:08 AM Andres Freund <andres@anarazel.de> wrote: > But I'd like you to feel guilty (no, not really) and fix it (yes, really) :) Sadly, you're more likely to get the first one than you are to get the second one. I can't really see going back to revisit that decision as a basis for somebody else's new work -- it'd be better if the person doing the new work figured out what makes sense here. -- Robert Haas EDB: http://www.enterprisedb.com
On 12/21/23 18:14, Robert Haas wrote: > On Thu, Dec 21, 2023 at 11:08 AM Andres Freund <andres@anarazel.de> wrote: >> But I'd like you to feel guilty (no, not really) and fix it (yes, really) :) > > Sadly, you're more likely to get the first one than you are to get the > second one. I can't really see going back to revisit that decision as > a basis for somebody else's new work -- it'd be better if the person > doing the new work figured out what makes sense here. > I think it's a great example of "hindsight is 20/20". There were perfectly valid reasons to have two separate nodes, and it's not like these reasons somehow disappeared. It still is a perfectly reasonable decision. It's just that allowing index-only filters for regular index scans seems to eliminate pretty much all executor differences between the two nodes. But that's hard to predict - I certainly would not have even think about that back when index-only scans were added. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, Here's a somewhat reworked version of the patch. My initial goal was to see if it could adopt the StreamingRead API proposed in [1], but that turned out to be less straight-forward than I hoped, for two reasons: (1) The StreamingRead API seems to be designed for pages, but the index code naturally works with TIDs/tuples. Yes, the callbacks can associate the blocks with custom data (in this case that'd be the TID), but it seemed a bit strange ... (2) The place adding requests to the StreamingRead queue is pretty far from the place actually reading the pages - for prefetching, the requests would be generated in nodeIndexscan, but the page reading happens somewhere deep in index_fetch_heap/heapam_index_fetch_tuple. Sure, the TIDs would come from a callback, so it's a bit as if the requests were generated in heapam_index_fetch_tuple - but it has no idea StreamingRead exists, so where would it get it. We might teach it about it, but what if there are multiple places calling index_fetch_heap()? Not all of which may be using StreamingRead (only indexscans would do that). Or if there are multiple index scans, there's need to be a separate StreamingRead queues, right? In any case, I felt a bit out of my depth here, and I chose not to do all this work without discussing the direction here. (Also, see the point about cursors and xs_heap_continue a bit later in this post.) I did however like the general StreamingRead API - how it splits the work between the API and the callback. The patch used to do everything, which meant it hardcoded a lot of the IOS-specific logic etc. I did plan to have some sort of "callback" for reading from the queue, but that didn't quite solve this issue - a lot of the stuff remained hard-coded. But the StreamingRead API made me realize that having a callback for the first phase (that adds requests to the queue) would fix that. So I did that - there's now one simple callback in for index scans, and a bit more complex callback for index-only scans. Thanks to this the hard-coded stuff mostly disappears, which is good. Perhaps a bigger change is that I decided to move this into a separate API on top of indexam.c. The original idea was to integrate this into index_getnext_tid/index_getnext_slot, so that all callers benefit from the prefetching automatically. Which would be nice, but it also meant it's need to happen in the indexam.c code, which seemed dirty. This patch introduces an API similar to StreamingRead. It calls the indexam.c stuff, but does all the prefetching on top of it, not in it. If a place calling index_getnext_tid() wants to allow prefetching, it needs to switch to IndexPrefetchNext(). (There's no function that would replace index_getnext_slot, at the moment. Maybe there should be.) Note 1: The IndexPrefetch name is a bit misleading, because it's used even with prefetching disabled - all index reads from the index scan happen through it. Maybe it should be called IndexReader or something like that. Note 2: I left the code in indexam.c for now, but in principle it could (should) be moved to a different place. I think this layering makes sense, and it's probably much closer to what Andres meant when he said the prefetching should happen in the executor. Even if the patch ends up using StreamingRead in the future, I guess we'll want something like IndexPrefetch - it might use the StreamingRead internally, but it would still need to do some custom stuff to detect I/O patterns or something that does not quite fit into the StreamingRead. Now, let's talk about two (mostly unrelated) problems I ran into. Firstly, I realized there's a bit of a problem with cursors. The prefetching works like this: 1) reading TIDs from the index 2) stashing them into a queue in IndexPrefetch 3) doing prefetches for the new TIDs added to the queue 4) returning the TIDs to the caller, one by one And all of this works ... unless the direction of the scan changes. Which for cursors can happen if someone does FETCH BACKWARD or stuff like that. I'm not sure how difficult it'd be to make this work. I suppose we could simply discard the prefetched entries and do the right number of steps back for the index scan. But I haven't tried, and maybe it's more complex than I'm imagining. Also, if the cursor changes the direction a lot, it'd make the prefetching harmful. The patch simply disables prefetching for such queries, using the same logic that we do for parallelism. This may be over-zealous. FWIW this is one of the things that probably should remain outside of StreamingRead API - it seems pretty index-specific, and I'm not sure we'd even want to support these "backward" movements in the API. The other issue I'm aware of is handling xs_heap_continue. I believe it works fine for "false" but I need to take a look at non-MVCC snapshots (i.e. when xs_heap_continue=true). I haven't done any benchmarks with this reworked API - there's a couple more allocations etc. but it did not change in a fundamental way. I don't expect any major difference. regards [1] https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > Here's a somewhat reworked version of the patch. My initial goal was to > see if it could adopt the StreamingRead API proposed in [1], but that > turned out to be less straight-forward than I hoped, for two reasons: I guess we need Thomas or Andres or maybe Melanie to comment on this. > Perhaps a bigger change is that I decided to move this into a separate > API on top of indexam.c. The original idea was to integrate this into > index_getnext_tid/index_getnext_slot, so that all callers benefit from > the prefetching automatically. Which would be nice, but it also meant > it's need to happen in the indexam.c code, which seemed dirty. This patch is hard to review right now because there's a bunch of comment updating that doesn't seem to have been done for the new design. For instance: + * XXX This does not support prefetching of heap pages. When such prefetching is + * desirable, use index_getnext_tid(). But not any more. + * XXX The prefetching may interfere with the patch allowing us to evaluate + * conditions on the index tuple, in which case we may not need the heap + * tuple. Maybe if there's such filter, we should prefetch only pages that + * are not all-visible (and the same idea would also work for IOS), but + * it also makes the indexing a bit "aware" of the visibility stuff (which + * seems a somewhat wrong). Also, maybe we should consider the filter selectivity I'm not sure whether all the problems in this area are solved, but I think you've solved enough of them that this at least needs rewording, if not removing. + * XXX Comment/check seems obsolete. This occurs in two places. I'm not sure if it's accurate or not. + * XXX Could this be an issue for the prefetching? What if we prefetch something + * but the direction changes before we get to the read? If that could happen, + * maybe we should discard the prefetched data and go back? But can we even + * do that, if we already fetched some TIDs from the index? I don't think + * indexorderdir can't change, but es_direction maybe can? But your email claims that "The patch simply disables prefetching for such queries, using the same logic that we do for parallelism." FWIW, I think that's a fine way to handle that case. + * XXX Maybe we should enable prefetching, but prefetch only pages that + * are not all-visible (but checking that from the index code seems like + * a violation of layering etc). Isn't this fixed now? Note this comment occurs twice. + * XXX We need to disable this in some cases (e.g. when using index-only + * scans, we don't want to prefetch pages). Or maybe we should prefetch + * only pages that are not all-visible, that'd be even better. Here again. And now for some comments on other parts of the patch, mostly other XXX comments: + * XXX This does not support prefetching of heap pages. When such prefetching is + * desirable, use index_getnext_tid(). There's probably no reason to write XXX here. The comment is fine. + * XXX Notice we haven't added the block to the block queue yet, and there + * is a preceding block (i.e. blockIndex-1 is valid). Same here, possibly? If this XXX indicates a defect in the code, I don't know what the defect is, so I guess it needs to be more clear. If it is just explaining the code, then there's no reason for the comment to say XXX. + * XXX Could it be harmful that we read the queue backwards? Maybe memory + * prefetching works better for the forward direction? It does. But I don't know whether that matters here or not. + * XXX We do add the cache size to the request in order not to + * have issues with uint64 underflows. I don't know what this means. + * XXX not sure this correctly handles xs_heap_continue - see index_getnext_slot, + * maybe nodeIndexscan needs to do something more to handle this? Although, that + * should be in the indexscan next_cb callback, probably. + * + * XXX If xs_heap_continue=true, we need to return the last TID. You've got a bunch of comments about xs_heap_continue here -- and I don't fully understand what the issues are here with respect to this particular patch, but I think that the general purpose of xs_heap_continue is to handle the case where we need to return more than one tuple from the same HOT chain. With an MVCC snapshot that doesn't happen, but with say SnapshotAny or SnapshotDirty, it could. As far as possible, the prefetcher shouldn't be involved at all when xs_heap_continue is set, I believe, because in that case we're just returning a bunch of tuples from the same page, and the extra fetches from that heap page shouldn't trigger or require any further prefetching. + * XXX Should this also look at plan.plan_rows and maybe cap the target + * to that? Pointless to prefetch more than we expect to use. Or maybe + * just reset to that value during prefetching, after reading the next + * index page (or rather after rescan)? It seems questionable to use plan_rows here because (1) I don't think we have existing cases where we use the estimated row count in the executor for anything, we just carry it through so EXPLAIN can print it and (2) row count estimates can be really far off, especially if we're on the inner side of a nested loop, we might like to figure that out eventually instead of just DTWT forever. But on the other hand this does feel like an important case where we have a clue that prefetching might need to be done less aggressively or not at all, and it doesn't seem right to ignore that signal either. I wonder if we want this shaped in some other way, like a Boolean that says are-we-under-a-potentially-row-limiting-construct e.g. limit or inner side of a semi-join or anti-join. + * We reach here if the index only scan is not parallel, or if we're + * serially executing an index only scan that was planned to be + * parallel. Well, this seems sad. + * XXX This might lead to IOS being slower than plain index scan, if the + * table has a lot of pages that need recheck. How? + /* + * XXX Only allow index prefetching when parallelModeOK=true. This is a bit + * of a misuse of the flag, but we need to disable prefetching for cursors + * (which might change direction), and parallelModeOK does that. But maybe + * we might (or should) have a separate flag. + */ I think the correct flag to be using here is execute_once, which captures whether the executor could potentially be invoked a second time for the same portal. Changes in the fetch direction are possible if and only if !execute_once. > Note 1: The IndexPrefetch name is a bit misleading, because it's used > even with prefetching disabled - all index reads from the index scan > happen through it. Maybe it should be called IndexReader or something > like that. My biggest gripe here is the capitalization. This version adds, inter alia, IndexPrefetchAlloc, PREFETCH_QUEUE_INDEX, and index_heap_prefetch_target, which seems like one or two too many conventions. But maybe the PREFETCH_* macros don't even belong in a public header. I do like the index_heap_prefetch_* naming. Possibly that's too verbose to use for everything, but calling this index-heap-prefetch rather than index-prefetch seems clearer. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, Here's an improved version of this patch, finishing a lot of the stuff that I alluded to earlier - moving the code from indexam.c, renaming a bunch of stuff, etc. I've also squashed it into a single patch, to make it easier to review. I'll briefly go through the main changes in the patch, and then will respond in-line to Robert's points. 1) I moved the code from indexam.c to (new) execPrefetch.c. All the prototypes / typedefs now live in executor.h, with only minimal changes in execnodes.h (adding it to scan descriptors). I believe this finally moves the code to the right place - it feels much nicer and cleaner than in indexam.c. And it allowed me to hide a bunch of internal structs and improve the general API, I think. I'm sure there's stuff that could be named differently, but the layering feels about right, I think. 2) A bunch of stuff got renamed to start with IndexPrefetch... to make the naming consistent / clearer. I'm not entirely sure IndexPrefetch is the right name, though - it's still a bit misleading, as it might seem it's about prefetching index stuff, but really it's about heap pages from indexes. Maybe IndexScanPrefetch() or something like that? 3) If there's a way to make this work with the streaming I/O API, I'm not aware of it. But the overall design seems somewhat similar (based on "next" callback etc.) so hopefully that'd make it easier to adopt it. 4) I initially relied on parallelModeOK to disable prefetching, which kinda worked, but not really. Robert suggested to use the execute_once flag directly, and I think that's much better - not only is it cleaner, it also seems more appropriate (the parallel flag considers other stuff that is not quite relevant to prefetching). Thinking about this, I think it should be possible to make prefetching work even for plans with execute_once=false. In particular, when the plan changes direction it should be possible to simply "walk back" the prefetch queue, to get to the "correct" place in in the scan. But I'm not sure it's worth it, because plans that change direction often can't really benefit from prefetches anyway - they'll often visit stuff they accessed shortly before anyway. For plans that don't change direction but may pause, we don't know if the plan pauses long enough for the prefetched pages to get evicted or something. So I think it's OK that execute_once=false means no prefetching. 5) I haven't done anything about the xs_heap_continue=true case yet. 6) I went through all the comments and reworked them considerably. The main comment at execPrefetch.c start, with some overall design etc. And then there are comments for each function, explaining that bit in more detail. Or at least that's the goal - there's still work to do. There's two trivial FIXMEs, but you can ignore those - it's not that there's a bug, but that I'd like to rework something and just don't know how yet. There's also a couple of XXX comments. Some are a bit wild ideas for the future, others are somewhat "open questions" to be discussed during a review. Anyway, there should be no outright obsolete comments - if there's something I missed, let me know. Now to Robert's message ... On 1/9/24 21:31, Robert Haas wrote: > On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> Here's a somewhat reworked version of the patch. My initial goal was to >> see if it could adopt the StreamingRead API proposed in [1], but that >> turned out to be less straight-forward than I hoped, for two reasons: > > I guess we need Thomas or Andres or maybe Melanie to comment on this. > Yeah. Or maybe Thomas if he has thoughts on how to combine this with the streaming I/O stuff. >> Perhaps a bigger change is that I decided to move this into a separate >> API on top of indexam.c. The original idea was to integrate this into >> index_getnext_tid/index_getnext_slot, so that all callers benefit from >> the prefetching automatically. Which would be nice, but it also meant >> it's need to happen in the indexam.c code, which seemed dirty. > > This patch is hard to review right now because there's a bunch of > comment updating that doesn't seem to have been done for the new > design. For instance: > > + * XXX This does not support prefetching of heap pages. When such > prefetching is > + * desirable, use index_getnext_tid(). > > But not any more. > True. And this is now even more obsolete, as the prefetching was moved from indexam.c layer to the executor. > + * XXX The prefetching may interfere with the patch allowing us to evaluate > + * conditions on the index tuple, in which case we may not need the heap > + * tuple. Maybe if there's such filter, we should prefetch only pages that > + * are not all-visible (and the same idea would also work for IOS), but > + * it also makes the indexing a bit "aware" of the visibility stuff (which > + * seems a somewhat wrong). Also, maybe we should consider the filter > selectivity > > I'm not sure whether all the problems in this area are solved, but I > think you've solved enough of them that this at least needs rewording, > if not removing. > > + * XXX Comment/check seems obsolete. > > This occurs in two places. I'm not sure if it's accurate or not. > > + * XXX Could this be an issue for the prefetching? What if we > prefetch something > + * but the direction changes before we get to the read? If that > could happen, > + * maybe we should discard the prefetched data and go back? But can we even > + * do that, if we already fetched some TIDs from the index? I don't think > + * indexorderdir can't change, but es_direction maybe can? > > But your email claims that "The patch simply disables prefetching for > such queries, using the same logic that we do for parallelism." FWIW, > I think that's a fine way to handle that case. > True. I left behind this comment partly intentionally, to point out why we disable the prefetching in these cases, but you're right the comment now explains something that can't happen. > + * XXX Maybe we should enable prefetching, but prefetch only pages that > + * are not all-visible (but checking that from the index code seems like > + * a violation of layering etc). > > Isn't this fixed now? Note this comment occurs twice. > > + * XXX We need to disable this in some cases (e.g. when using index-only > + * scans, we don't want to prefetch pages). Or maybe we should prefetch > + * only pages that are not all-visible, that'd be even better. > > Here again. > Sorry, you're right those comments (and a couple more nearby) were stale. Removed / clarified. > And now for some comments on other parts of the patch, mostly other > XXX comments: > > + * XXX This does not support prefetching of heap pages. When such > prefetching is > + * desirable, use index_getnext_tid(). > > There's probably no reason to write XXX here. The comment is fine. > > + * XXX Notice we haven't added the block to the block queue yet, and there > + * is a preceding block (i.e. blockIndex-1 is valid). > > Same here, possibly? If this XXX indicates a defect in the code, I > don't know what the defect is, so I guess it needs to be more clear. > If it is just explaining the code, then there's no reason for the > comment to say XXX. > Yeah, removed the XXX / reworded a bit. > + * XXX Could it be harmful that we read the queue backwards? Maybe memory > + * prefetching works better for the forward direction? > > It does. But I don't know whether that matters here or not. > > + * XXX We do add the cache size to the request in order not to > + * have issues with uint64 underflows. > > I don't know what this means. > There's a check that does this: (x + PREFETCH_CACHE_SIZE) >= y it might also be done as "mathematically equivalent" x >= (y - PREFETCH_CACHE_SIZE) but if the "y" is an uint64, and the value is smaller than the constant, this would underflow. It'd eventually disappear, once the "y" gets large enough, ofc. > + * XXX not sure this correctly handles xs_heap_continue - see > index_getnext_slot, > + * maybe nodeIndexscan needs to do something more to handle this? > Although, that > + * should be in the indexscan next_cb callback, probably. > + * > + * XXX If xs_heap_continue=true, we need to return the last TID. > > You've got a bunch of comments about xs_heap_continue here -- and I > don't fully understand what the issues are here with respect to this > particular patch, but I think that the general purpose of > xs_heap_continue is to handle the case where we need to return more > than one tuple from the same HOT chain. With an MVCC snapshot that > doesn't happen, but with say SnapshotAny or SnapshotDirty, it could. > As far as possible, the prefetcher shouldn't be involved at all when > xs_heap_continue is set, I believe, because in that case we're just > returning a bunch of tuples from the same page, and the extra fetches > from that heap page shouldn't trigger or require any further > prefetching. > Yes, that's correct. The current code simply ignores that flag and just proceeds to the next TID. Which is correct for xs_heap_continue=false, and thus all MVCC snapshots work fine. But for the Any/Dirty case it needs to work a bit differently. > + * XXX Should this also look at plan.plan_rows and maybe cap the target > + * to that? Pointless to prefetch more than we expect to use. Or maybe > + * just reset to that value during prefetching, after reading the next > + * index page (or rather after rescan)? > > It seems questionable to use plan_rows here because (1) I don't think > we have existing cases where we use the estimated row count in the > executor for anything, we just carry it through so EXPLAIN can print > it and (2) row count estimates can be really far off, especially if > we're on the inner side of a nested loop, we might like to figure that > out eventually instead of just DTWT forever. But on the other hand > this does feel like an important case where we have a clue that > prefetching might need to be done less aggressively or not at all, and > it doesn't seem right to ignore that signal either. I wonder if we > want this shaped in some other way, like a Boolean that says > are-we-under-a-potentially-row-limiting-construct e.g. limit or inner > side of a semi-join or anti-join. > The current code actually does look at plan_rows when calculating the prefetch target: prefetch_max = IndexPrefetchComputeTarget(node->ss.ss_currentRelation, node->ss.ps.plan->plan_rows, estate->es_use_prefetching); but I agree maybe it should not, for the reasons you explain. I'm not attached to this part. > + * We reach here if the index only scan is not parallel, or if we're > + * serially executing an index only scan that was planned to be > + * parallel. > > Well, this seems sad. > Stale comment, I believe. However, I didn't see much benefits with parallel index scan during testing. Having I/O from multiple workers generally had the same effect, I think. > + * XXX This might lead to IOS being slower than plain index scan, if the > + * table has a lot of pages that need recheck. > > How? > The comment is not particularly clear what "this" means, but I believe this was about index-only scan with many not-all-visible pages. If it didn't do prefetching, a regular index scan with prefetching may be way faster. But the code actually allows doing prefetching even for IOS, by checking the vm in the "next" callback. > + /* > + * XXX Only allow index prefetching when parallelModeOK=true. This is a bit > + * of a misuse of the flag, but we need to disable prefetching for cursors > + * (which might change direction), and parallelModeOK does that. But maybe > + * we might (or should) have a separate flag. > + */ > > I think the correct flag to be using here is execute_once, which > captures whether the executor could potentially be invoked a second > time for the same portal. Changes in the fetch direction are possible > if and only if !execute_once. > Right. The new patch version does that. >> Note 1: The IndexPrefetch name is a bit misleading, because it's used >> even with prefetching disabled - all index reads from the index scan >> happen through it. Maybe it should be called IndexReader or something >> like that. > > My biggest gripe here is the capitalization. This version adds, inter > alia, IndexPrefetchAlloc, PREFETCH_QUEUE_INDEX, and > index_heap_prefetch_target, which seems like one or two too many > conventions. But maybe the PREFETCH_* macros don't even belong in a > public header. > > I do like the index_heap_prefetch_* naming. Possibly that's too > verbose to use for everything, but calling this index-heap-prefetch > rather than index-prefetch seems clearer. > Yeah. I renamed all the structs and functions to IndexPrefetchSomething, to keep it consistent. And then the constants are all capital, ofc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Not a full response, but just to address a few points: On Fri, Jan 12, 2024 at 11:42 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > Thinking about this, I think it should be possible to make prefetching > work even for plans with execute_once=false. In particular, when the > plan changes direction it should be possible to simply "walk back" the > prefetch queue, to get to the "correct" place in in the scan. But I'm > not sure it's worth it, because plans that change direction often can't > really benefit from prefetches anyway - they'll often visit stuff they > accessed shortly before anyway. For plans that don't change direction > but may pause, we don't know if the plan pauses long enough for the > prefetched pages to get evicted or something. So I think it's OK that > execute_once=false means no prefetching. +1. > > + * XXX We do add the cache size to the request in order not to > > + * have issues with uint64 underflows. > > > > I don't know what this means. > > > > There's a check that does this: > > (x + PREFETCH_CACHE_SIZE) >= y > > it might also be done as "mathematically equivalent" > > x >= (y - PREFETCH_CACHE_SIZE) > > but if the "y" is an uint64, and the value is smaller than the constant, > this would underflow. It'd eventually disappear, once the "y" gets large > enough, ofc. The problem is, I think, that there's no particular reason that someone reading the existing code should imagine that it might have been done in that "mathematically equivalent" fashion. I imagined that you were trying to make a point about adding the cache size to the request vs. adding nothing, whereas in reality you were trying to make a point about adding from one side vs. subtracting from the other. > > + * We reach here if the index only scan is not parallel, or if we're > > + * serially executing an index only scan that was planned to be > > + * parallel. > > > > Well, this seems sad. > > Stale comment, I believe. However, I didn't see much benefits with > parallel index scan during testing. Having I/O from multiple workers > generally had the same effect, I think. Fair point, likely worth mentioning explicitly in the comment. > Yeah. I renamed all the structs and functions to IndexPrefetchSomething, > to keep it consistent. And then the constants are all capital, ofc. It'd still be nice to get table or heap in there, IMHO, but maybe we can't, and consistency is certainly a good thing regardless of the details, so thanks for that. -- Robert Haas EDB: http://www.enterprisedb.com
Hi,
Hi, Here's an improved version of this patch, finishing a lot of the stuff that I alluded to earlier - moving the code from indexam.c, renaming a bunch of stuff, etc. I've also squashed it into a single patch, to make it easier to review.
I am thinking about testing you patch with Neon (cloud Postgres). As far as Neon seaprates compute and storage, prefetch is much more critical for Neon
architecture than for vanilla Postgres.
I have few complaints:
1. It disables prefetch for sequential access pattern (i.e. INDEX MERGE), motivating it that in this case OS read-ahead will be more efficient than prefetch. It may be true for normal storage devices, bit not for Neon storage and may be also for Postgres on top of DFS (i.e. Amazon RDS). I wonder if we can delegate decision whether to perform prefetch in this case or not to some other level. I do not know precisely where is should be handled. The best candidate IMHO is storager manager. But it most likely requires extension of SMGR API. Not sure if you want to do it... Straightforward solution is to move this logic to some callback, which can be overwritten by user.
2. It disables prefetch for direct_io. It seems to be even more obvious than 1), because prefetching using `posix_fadvise` definitely not possible in case of using direct_io. But in theory if SMGR provides some alternative prefetch implementation (as in case of Neon), this also may be not true. Still unclear why we can want to use direct_io in Neon... But still I prefer to mo.ve this decision outside executor.
3. It doesn't perform prefetch of leave pages for IOS, only referenced heap pages which are not marked as all-visible. It seems to me that if optimized has chosen IOS (and not bitmap heap scan for example), then there should be large enough fraction for all-visible pages. Also index prefetch is most efficient for OLAp queries and them are used to be performance for historical data which is all-visible. But IOS can be really handled separately in some other PR. Frankly speaking combining prefetch of leave B-Tree pages and referenced heap pages seems to be very challenged task.
4. I think that performing prefetch at executor level is really great idea and so prefetch can be used by all indexes, including custom indexes. But prefetch will be efficient only if index can provide fast access to next TID (located at the same page). I am not sure that it is true for all builtin indexes (GIN, GIST, BRIN,...) and especially for custom AM. I wonder if we should extend AM API to make index make a decision weather to perform prefetch of TIDs or not.
5. Minor notice: there are few places where index_getnext_slot is called with last NULL parameter (disabled prefetch) with the following comment
"XXX Would be nice to also benefit from prefetching here." But all this places corresponds to "point loopkup", i.e. unique constraint check, find replication tuple by index... Prefetch seems to be unlikely useful here, unlkess there is index bloating and and we have to skip a lot of tuples before locating right one. But should we try to optimize case of bloated indexes?
On 1/16/24 09:13, Konstantin Knizhnik wrote: > Hi, > > On 12/01/2024 6:42 pm, Tomas Vondra wrote: >> Hi, >> >> Here's an improved version of this patch, finishing a lot of the stuff >> that I alluded to earlier - moving the code from indexam.c, renaming a >> bunch of stuff, etc. I've also squashed it into a single patch, to make >> it easier to review. > > I am thinking about testing you patch with Neon (cloud Postgres). As far > as Neon seaprates compute and storage, prefetch is much more critical > for Neon > architecture than for vanilla Postgres. > > I have few complaints: > > 1. It disables prefetch for sequential access pattern (i.e. INDEX > MERGE), motivating it that in this case OS read-ahead will be more > efficient than prefetch. It may be true for normal storage devices, bit > not for Neon storage and may be also for Postgres on top of DFS (i.e. > Amazon RDS). I wonder if we can delegate decision whether to perform > prefetch in this case or not to some other level. I do not know > precisely where is should be handled. The best candidate IMHO is > storager manager. But it most likely requires extension of SMGR API. Not > sure if you want to do it... Straightforward solution is to move this > logic to some callback, which can be overwritten by user. > Interesting point. You're right these decisions (whether to prefetch particular patterns) are closely tied to the capabilities of the storage system. So it might make sense to maybe define it at that level. Not sure what exactly RDS does with the storage - my understanding is that it's mostly regular Postgres code, but managed by Amazon. So how would that modify the prefetching logic? However, I'm not against making this modular / wrapping this in some sort of callbacks, for example. > 2. It disables prefetch for direct_io. It seems to be even more obvious > than 1), because prefetching using `posix_fadvise` definitely not > possible in case of using direct_io. But in theory if SMGR provides some > alternative prefetch implementation (as in case of Neon), this also may > be not true. Still unclear why we can want to use direct_io in Neon... > But still I prefer to mo.ve this decision outside executor. > True. I think this would / should be customizable by the callback. > 3. It doesn't perform prefetch of leave pages for IOS, only referenced > heap pages which are not marked as all-visible. It seems to me that if > optimized has chosen IOS (and not bitmap heap scan for example), then > there should be large enough fraction for all-visible pages. Also index > prefetch is most efficient for OLAp queries and them are used to be > performance for historical data which is all-visible. But IOS can be > really handled separately in some other PR. Frankly speaking combining > prefetch of leave B-Tree pages and referenced heap pages seems to be > very challenged task. > I see prefetching of leaf pages as interesting / worthwhile improvement, but out of scope for this patch. I don't think it can be done at the executor level - the prefetch requests need to be submitted from the index AM code (by calling PrefetchBuffer, etc.) > 4. I think that performing prefetch at executor level is really great > idea and so prefetch can be used by all indexes, including custom > indexes. But prefetch will be efficient only if index can provide fast > access to next TID (located at the same page). I am not sure that it is > true for all builtin indexes (GIN, GIST, BRIN,...) and especially for > custom AM. I wonder if we should extend AM API to make index make a > decision weather to perform prefetch of TIDs or not. I'm not against having a flag to enable/disable prefetching, but the question is whether doing prefetching for such indexes can be harmful. I'm not sure about that. > > 5. Minor notice: there are few places where index_getnext_slot is called > with last NULL parameter (disabled prefetch) with the following comment > "XXX Would be nice to also benefit from prefetching here." But all this > places corresponds to "point loopkup", i.e. unique constraint check, > find replication tuple by index... Prefetch seems to be unlikely useful > here, unlkess there is index bloating and and we have to skip a lot of > tuples before locating right one. But should we try to optimize case of > bloated indexes? > Are you sure you're looking at the last patch version? Because the current patch does not have any new parameters in index_getnext_* and the comments were removed too (I suppose you're talking about execIndexing, execReplication and those places). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jan 16, 2024 at 11:25 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > 3. It doesn't perform prefetch of leave pages for IOS, only referenced > > heap pages which are not marked as all-visible. It seems to me that if > > optimized has chosen IOS (and not bitmap heap scan for example), then > > there should be large enough fraction for all-visible pages. Also index > > prefetch is most efficient for OLAp queries and them are used to be > > performance for historical data which is all-visible. But IOS can be > > really handled separately in some other PR. Frankly speaking combining > > prefetch of leave B-Tree pages and referenced heap pages seems to be > > very challenged task. > > I see prefetching of leaf pages as interesting / worthwhile improvement, > but out of scope for this patch. I don't think it can be done at the > executor level - the prefetch requests need to be submitted from the > index AM code (by calling PrefetchBuffer, etc.) +1. This is a good feature, and so is that, but they're not the same feature, despite the naming problems. -- Robert Haas EDB: http://www.enterprisedb.com
On 1/16/24 09:13, Konstantin Knizhnik wrote:Hi, On 12/01/2024 6:42 pm, Tomas Vondra wrote:Hi, Here's an improved version of this patch, finishing a lot of the stuff that I alluded to earlier - moving the code from indexam.c, renaming a bunch of stuff, etc. I've also squashed it into a single patch, to make it easier to review.I am thinking about testing you patch with Neon (cloud Postgres). As far as Neon seaprates compute and storage, prefetch is much more critical for Neon architecture than for vanilla Postgres. I have few complaints: 1. It disables prefetch for sequential access pattern (i.e. INDEX MERGE), motivating it that in this case OS read-ahead will be more efficient than prefetch. It may be true for normal storage devices, bit not for Neon storage and may be also for Postgres on top of DFS (i.e. Amazon RDS). I wonder if we can delegate decision whether to perform prefetch in this case or not to some other level. I do not know precisely where is should be handled. The best candidate IMHO is storager manager. But it most likely requires extension of SMGR API. Not sure if you want to do it... Straightforward solution is to move this logic to some callback, which can be overwritten by user.Interesting point. You're right these decisions (whether to prefetch particular patterns) are closely tied to the capabilities of the storage system. So it might make sense to maybe define it at that level. Not sure what exactly RDS does with the storage - my understanding is that it's mostly regular Postgres code, but managed by Amazon. So how would that modify the prefetching logic?
Amazon RDS is just vanilla Postgres with file system mounted on EBS (Amazon distributed file system).
EBS provides good throughput but larger latencies comparing with local SSDs.
I am not sure if read-ahead works for EBS.
4. I think that performing prefetch at executor level is really greatidea and so prefetch can be used by all indexes, including custom indexes. But prefetch will be efficient only if index can provide fast access to next TID (located at the same page). I am not sure that it is true for all builtin indexes (GIN, GIST, BRIN,...) and especially for custom AM. I wonder if we should extend AM API to make index make a decision weather to perform prefetch of TIDs or not.I'm not against having a flag to enable/disable prefetching, but the question is whether doing prefetching for such indexes can be harmful. I'm not sure about that.
I tend to agree with you - it is hard to imagine index implementation which doesn't win from prefetching heap pages.
May be only the filtering case you have mentioned. But it seems to me that current B-Tree index scan (not IOS) implementation in Postgres
doesn't try to use index tuple to check extra condition - it will fetch heap tuple in any case.
5. Minor notice: there are few places where index_getnext_slot is called with last NULL parameter (disabled prefetch) with the following comment "XXX Would be nice to also benefit from prefetching here." But all this places corresponds to "point loopkup", i.e. unique constraint check, find replication tuple by index... Prefetch seems to be unlikely useful here, unlkess there is index bloating and and we have to skip a lot of tuples before locating right one. But should we try to optimize case of bloated indexes?Are you sure you're looking at the last patch version? Because the current patch does not have any new parameters in index_getnext_* and the comments were removed too (I suppose you're talking about execIndexing, execReplication and those places).
Sorry, I looked at v20240103-0001-prefetch-2023-12-09.patch , I didn't noticed v20240112-0001-Prefetch-heap-pages-during-index-scans.patch
regards
On 1/16/24 2:10 PM, Konstantin Knizhnik wrote: > Amazon RDS is just vanilla Postgres with file system mounted on EBS > (Amazon distributed file system). > EBS provides good throughput but larger latencies comparing with local SSDs. > I am not sure if read-ahead works for EBS. Actually, EBS only provides a block device - it's definitely not a filesystem itself (*EFS* is a filesystem - but it's also significantly different than EBS). So as long as readahead is happening somewheer above the block device I would expect it to JustWork on EBS. Of course, Aurora Postgres (like Neon) is completely different. If you look at page 53 of [1] you'll note that there's two different terms used: prefetch and batch. I'm not sure how much practical difference there is, but batched IO (one IO request to Aurora Storage for many blocks) predates index prefetch; VACUUM in APG has used batched IO for a very long time (it also *only* reads blocks that aren't marked all visble/frozen; none of the "only skip if skipping at least 32 blocks" logic is used). 1: https://d1.awsstatic.com/events/reinvent/2019/REPEAT_1_Deep_dive_on_Amazon_Aurora_with_PostgreSQL_compatibility_DAT328-R1.pdf -- Jim Nasby, Data Architect, Austin TX
On 16/01/2024 11:58 pm, Jim Nasby wrote: > On 1/16/24 2:10 PM, Konstantin Knizhnik wrote: >> Amazon RDS is just vanilla Postgres with file system mounted on EBS >> (Amazon distributed file system). >> EBS provides good throughput but larger latencies comparing with >> local SSDs. >> I am not sure if read-ahead works for EBS. > > Actually, EBS only provides a block device - it's definitely not a > filesystem itself (*EFS* is a filesystem - but it's also significantly > different than EBS). So as long as readahead is happening somewheer > above the block device I would expect it to JustWork on EBS. Thank you for clarification. Yes, EBS is just block device and read-ahead can be used fir it as for any other local device. There is actually recommendation to increase read-ahead for EBS device to reach better performance on some workloads: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSPerformance.html So looks like for sequential access pattern manual prefetching at EBS is not needed. But at Neon situation is quite different. May be Aurora Postgres is using some other mechanism for speed-up vacuum and seqscan, but Neon is using Postgres prefetch mechanism for it.
On 16/01/2024 11:58 pm, Jim Nasby wrote: > On 1/16/24 2:10 PM, Konstantin Knizhnik wrote: >> Amazon RDS is just vanilla Postgres with file system mounted on EBS >> (Amazon distributed file system). >> EBS provides good throughput but larger latencies comparing with >> local SSDs. >> I am not sure if read-ahead works for EBS. > > Actually, EBS only provides a block device - it's definitely not a > filesystem itself (*EFS* is a filesystem - but it's also significantly > different than EBS). So as long as readahead is happening somewheer > above the block device I would expect it to JustWork on EBS. Thank you for clarification. Yes, EBS is just block device and read-ahead can be used fir it as for any other local device. There is actually recommendation to increase read-ahead for EBS device to reach better performance on some workloads: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSPerformance.html So looks like for sequential access pattern manual prefetching at EBS is not needed. But at Neon situation is quite different. May be Aurora Postgres is using some other mechanism for speed-up vacuum and seqscan, but Neon is using Postgres prefetch mechanism for it.
I have integrated your prefetch patch in Neon and it actually works! Moreover, I combined it with prefetch of leaf pages for IOS and it also seems to work. Just small notice: you are reporting `blks_prefetch_rounds` in explain, but it is not incremented anywhere. Moreover, I do not precisely understand what it mean and wonder if such information is useful for analyzing query executing plan. Also your patch always report number of prefetched blocks (and rounds) if them are not zero. I think that adding new information to explain it may cause some problems because there are a lot of different tools which parse explain report to visualize it, make some recommendations top improve performance, ... Certainly good practice for such tools is to ignore all unknown tags. But I am not sure that everybody follow this practice. It seems to be more safe and at the same time convenient for users to add extra tag to explain to enable/disable prefetch info (as it was done in Neon). Here we come back to my custom explain patch;) Actually using it is not necessary. You can manually add "prefetch" option to Postgres core (as it is currently done in Neon). Best regards, Konstantin
On 1/16/24 21:10, Konstantin Knizhnik wrote: > > ... > >> 4. I think that performing prefetch at executor level is really great >>> idea and so prefetch can be used by all indexes, including custom >>> indexes. But prefetch will be efficient only if index can provide fast >>> access to next TID (located at the same page). I am not sure that it is >>> true for all builtin indexes (GIN, GIST, BRIN,...) and especially for >>> custom AM. I wonder if we should extend AM API to make index make a >>> decision weather to perform prefetch of TIDs or not. >> I'm not against having a flag to enable/disable prefetching, but the >> question is whether doing prefetching for such indexes can be harmful. >> I'm not sure about that. > > I tend to agree with you - it is hard to imagine index implementation > which doesn't win from prefetching heap pages. > May be only the filtering case you have mentioned. But it seems to me > that current B-Tree index scan (not IOS) implementation in Postgres > doesn't try to use index tuple to check extra condition - it will fetch > heap tuple in any case. > That's true, but that's why I started working on this: https://commitfest.postgresql.org/46/4352/ I need to think about how to combine that with the prefetching. The good thing is that both changes require fetching TIDs, not slots. I think the condition can be simply added to the prefetch callback. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/17/24 09:45, Konstantin Knizhnik wrote: > I have integrated your prefetch patch in Neon and it actually works! > Moreover, I combined it with prefetch of leaf pages for IOS and it also > seems to work. > Cool! And do you think this is the right design/way to do this? > Just small notice: you are reporting `blks_prefetch_rounds` in explain, > but it is not incremented anywhere. > Moreover, I do not precisely understand what it mean and wonder if such > information is useful for analyzing query executing plan. > Also your patch always report number of prefetched blocks (and rounds) > if them are not zero. > Right, this needs fixing. > I think that adding new information to explain it may cause some > problems because there are a lot of different tools which parse explain > report to visualize it, > make some recommendations top improve performance, ... Certainly good > practice for such tools is to ignore all unknown tags. But I am not sure > that everybody follow this practice. > It seems to be more safe and at the same time convenient for users to > add extra tag to explain to enable/disable prefetch info (as it was done > in Neon). > I think we want to add this info to explain, but maybe it should be behind a new flag and disabled by default. > Here we come back to my custom explain patch;) Actually using it is not > necessary. You can manually add "prefetch" option to Postgres core (as > it is currently done in Neon). > Yeah, I think that's the right solution. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 18/01/2024 6:00 pm, Tomas Vondra wrote: > On 1/17/24 09:45, Konstantin Knizhnik wrote: >> I have integrated your prefetch patch in Neon and it actually works! >> Moreover, I combined it with prefetch of leaf pages for IOS and it also >> seems to work. >> > Cool! And do you think this is the right design/way to do this? I like the idea of prefetching TIDs in executor. But looking though your patch I have some questions: 1. Why it is necessary to allocate and store all_visible flag in data buffer. Why caller of IndexPrefetchNext can not look at prefetch field? + /* store the all_visible flag in the private part of the entry */ + entry->data = palloc(sizeof(bool)); + *(bool *) entry->data = all_visible; 2. Names of the functions `IndexPrefetchNext` and `IndexOnlyPrefetchNext` are IMHO confusing because they look similar and one can assume that for one is used for normal index scan and last one - for index only scan. But actually `IndexOnlyPrefetchNext` is callback and `IndexPrefetchNext` is used in both nodeIndexscan.c and nodeIndexonlyscan.c
On 1/19/24 09:34, Konstantin Knizhnik wrote: > > On 18/01/2024 6:00 pm, Tomas Vondra wrote: >> On 1/17/24 09:45, Konstantin Knizhnik wrote: >>> I have integrated your prefetch patch in Neon and it actually works! >>> Moreover, I combined it with prefetch of leaf pages for IOS and it also >>> seems to work. >>> >> Cool! And do you think this is the right design/way to do this? > > I like the idea of prefetching TIDs in executor. > > But looking though your patch I have some questions: > > > 1. Why it is necessary to allocate and store all_visible flag in data > buffer. Why caller of IndexPrefetchNext can not look at prefetch field? > > + /* store the all_visible flag in the private part of the entry */ > + entry->data = palloc(sizeof(bool)); > + *(bool *) entry->data = all_visible; > What you mean by "prefetch field"? The reason why it's done like this is to only do the VM check once - without keeping the value, we'd have to do it in the "next" callback, to determine if we need to prefetch the heap tuple, and then later in the index-only scan itself. That's a significant overhead, especially in the case when everything is visible. > 2. Names of the functions `IndexPrefetchNext` and > `IndexOnlyPrefetchNext` are IMHO confusing because they look similar and > one can assume that for one is used for normal index scan and last one - > for index only scan. But actually `IndexOnlyPrefetchNext` is callback > and `IndexPrefetchNext` is used in both nodeIndexscan.c and > nodeIndexonlyscan.c > Yeah, that's a good point. The naming probably needs rethinking. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/16/24 21:10, Konstantin Knizhnik wrote:...4. I think that performing prefetch at executor level is really greatidea and so prefetch can be used by all indexes, including custom indexes. But prefetch will be efficient only if index can provide fast access to next TID (located at the same page). I am not sure that it is true for all builtin indexes (GIN, GIST, BRIN,...) and especially for custom AM. I wonder if we should extend AM API to make index make a decision weather to perform prefetch of TIDs or not.I'm not against having a flag to enable/disable prefetching, but the question is whether doing prefetching for such indexes can be harmful. I'm not sure about that.I tend to agree with you - it is hard to imagine index implementation which doesn't win from prefetching heap pages. May be only the filtering case you have mentioned. But it seems to me that current B-Tree index scan (not IOS) implementation in Postgres doesn't try to use index tuple to check extra condition - it will fetch heap tuple in any case.That's true, but that's why I started working on this: https://commitfest.postgresql.org/46/4352/ I need to think about how to combine that with the prefetching. The good thing is that both changes require fetching TIDs, not slots. I think the condition can be simply added to the prefetch callback. regards
Looks like I was not true, even if it is not index-only scan but index condition involves only index attributes, then heap is not accessed until we find tuple satisfying search condition.
Inclusive index case described above (https://commitfest.postgresql.org/46/4352/) is interesting but IMHO exotic case. If keys are actually used in search, then why not to create normal compound index instead?
On Fri, Jan 12, 2024 at 11:42 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 1/9/24 21:31, Robert Haas wrote: > > On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >> Here's a somewhat reworked version of the patch. My initial goal was to > >> see if it could adopt the StreamingRead API proposed in [1], but that > >> turned out to be less straight-forward than I hoped, for two reasons: > > > > I guess we need Thomas or Andres or maybe Melanie to comment on this. > > > > Yeah. Or maybe Thomas if he has thoughts on how to combine this with the > streaming I/O stuff. I've been studying your patch with the intent of finding a way to change it and or the streaming read API to work together. I've attached a very rough sketch of how I think it could work. We fill a queue with blocks from TIDs that we fetched from the index. The queue is saved in a scan descriptor that is made available to the streaming read callback. Once the queue is full, we invoke the table AM specific index_fetch_tuple() function which calls pg_streaming_read_buffer_get_next(). When the streaming read API invokes the callback we registered, it simply dequeues a block number for prefetching. The only change to the streaming read API is that now, even if the callback returns InvalidBlockNumber, we may not be finished, so make it resumable. Structurally, this changes the timing of when the heap blocks are prefetched. Your code would get a tid from the index and then prefetch the heap block -- doing this until it filled a queue that had the actual tids saved in it. With my approach and the streaming read API, you fetch tids from the index until you've filled up a queue of block numbers. Then the streaming read API will prefetch those heap blocks. I didn't actually implement the block queue -- I just saved a single block number and pretended it was a block queue. I was imagining we replace this with something like your IndexPrefetch->blockItems -- which has light deduplication. We'd probably have to flesh it out more than that. There are also table AM layering violations in my sketch which would have to be worked out (not to mention some resource leakage I didn't bother investigating [which causes it to fail tests]). 0001 is all of Thomas' streaming read API code that isn't yet in master and 0002 is my rough sketch of index prefetching using the streaming read API There are also numerous optimizations that your index prefetching patch set does that would need to be added in some way. I haven't thought much about it yet. I wanted to see what you thought of this approach first. Basically, is it workable? - Melanie
Attachment
On 1/19/24 16:19, Konstantin Knizhnik wrote: > > On 18/01/2024 5:57 pm, Tomas Vondra wrote: >> On 1/16/24 21:10, Konstantin Knizhnik wrote: >>> ... >>> >>>> 4. I think that performing prefetch at executor level is really great >>>>> idea and so prefetch can be used by all indexes, including custom >>>>> indexes. But prefetch will be efficient only if index can provide fast >>>>> access to next TID (located at the same page). I am not sure that >>>>> it is >>>>> true for all builtin indexes (GIN, GIST, BRIN,...) and especially for >>>>> custom AM. I wonder if we should extend AM API to make index make a >>>>> decision weather to perform prefetch of TIDs or not. >>>> I'm not against having a flag to enable/disable prefetching, but the >>>> question is whether doing prefetching for such indexes can be harmful. >>>> I'm not sure about that. >>> I tend to agree with you - it is hard to imagine index implementation >>> which doesn't win from prefetching heap pages. >>> May be only the filtering case you have mentioned. But it seems to me >>> that current B-Tree index scan (not IOS) implementation in Postgres >>> doesn't try to use index tuple to check extra condition - it will fetch >>> heap tuple in any case. >>> >> That's true, but that's why I started working on this: >> >> https://commitfest.postgresql.org/46/4352/ >> >> I need to think about how to combine that with the prefetching. The good >> thing is that both changes require fetching TIDs, not slots. I think the >> condition can be simply added to the prefetch callback. >> >> >> regards >> > Looks like I was not true, even if it is not index-only scan but index > condition involves only index attributes, then heap is not accessed > until we find tuple satisfying search condition. > Inclusive index case described above > (https://commitfest.postgresql.org/46/4352/) is interesting but IMHO > exotic case. If keys are actually used in search, then why not to create > normal compound index instead? > Not sure I follow ... Firstly, I'm not convinced the example addressed by that other patch is that exotic. IMHO it's quite possible it's actually quite common, but the users do no realize the possible gains. Also, there are reasons to not want very wide indexes - it has overhead associated with maintenance, disk space, etc. I think it's perfectly rational to design indexes in a way eliminates most heap fetches necessary to evaluate conditions, but does not guarantee IOS (so the last heap fetch is still needed). What do you mean by "create normal compound index"? The patch addresses a limitation that not every condition can be translated into a proper scan key. Even if we improve this, there will always be such conditions. The the IOS can evaluate them on index tuple, the regular index scan can't do that (currently). Can you share an example demonstrating the alternative approach? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Looks like I was not true, even if it is not index-only scan but indexcondition involves only index attributes, then heap is not accessed until we find tuple satisfying search condition. Inclusive index case described above (https://commitfest.postgresql.org/46/4352/) is interesting but IMHO exotic case. If keys are actually used in search, then why not to create normal compound index instead?Not sure I follow ... Firstly, I'm not convinced the example addressed by that other patch is that exotic. IMHO it's quite possible it's actually quite common, but the users do no realize the possible gains. Also, there are reasons to not want very wide indexes - it has overhead associated with maintenance, disk space, etc. I think it's perfectly rational to design indexes in a way eliminates most heap fetches necessary to evaluate conditions, but does not guarantee IOS (so the last heap fetch is still needed).
We are comparing compound index (a,b) and covering (inclusive) index (a) include (b)
This indexes have exactly the same width and size and almost the same maintenance overhead.
First index has more expensive comparison function (involving two columns) but I do not think that it can significantly affect
performance and maintenance cost. Also if selectivity of "a" is good enough, then there is no need to compare "b"
Why we can prefer covering index to compound index? I see only two good reasons:
1. Extra columns type do not have comparison function need for AM.
2. The extra columns are never used in query predicate.
If you are going to use this columns in query predicates I do not see much sense in creating inclusive index rather than compound index.
Do you?
What do you mean by "create normal compound index"? The patch addresses a limitation that not every condition can be translated into a proper scan key. Even if we improve this, there will always be such conditions. The the IOS can evaluate them on index tuple, the regular index scan can't do that (currently). Can you share an example demonstrating the alternative approach?
May be I missed something.
This is the example from https://www.postgresql.org/message-id/flat/N1xaIrU29uk5YxLyW55MGk5fz9s6V2FNtj54JRaVlFbPixD5z8sJ07Ite5CvbWwik8ZvDG07oSTN-usENLVMq2UAcizVTEd5b-o16ZGDIIU=@yamlcoder.me :
```
And here is the plan with index on (a,b).
Limit (cost=0.42..4447.90 rows=1 width=12) (actual time=6.883..6.884 rows=0 loops=1) Output: a, b, d Buffers: shared hit=613 -> Index Scan using t_a_b_idx on public.t (cost=0.42..4447.90 rows=1 width=12) (actual time=6.880..6.881 rows=0 loops=1) Output: a, b, d Index Cond: ((t.a > 1000000) AND (t.b = 4)) Buffers: shared hit=613 Planning: Buffers: shared hit=41 Planning Time: 0.314 ms Execution Time: 6.910 ms ```
Isn't it an optimal plan for this query?
And cite from self reproducible example https://dbfiddle.uk/iehtq44L :
```
create unique index t_a_include_b on t(a) include (b);
-- I'd expecd index above to behave the same as index below for this query
--create unique index on t(a,b);
```
I agree that it is natural to expect the same result for both indexes. So this PR definitely makes sense.
My point is only that compound index (a,b) in this case is more natural and preferable.
On 19/01/2024 2:35 pm, Tomas Vondra wrote: > > On 1/19/24 09:34, Konstantin Knizhnik wrote: >> On 18/01/2024 6:00 pm, Tomas Vondra wrote: >>> On 1/17/24 09:45, Konstantin Knizhnik wrote: >>>> I have integrated your prefetch patch in Neon and it actually works! >>>> Moreover, I combined it with prefetch of leaf pages for IOS and it also >>>> seems to work. >>>> >>> Cool! And do you think this is the right design/way to do this? >> I like the idea of prefetching TIDs in executor. >> >> But looking though your patch I have some questions: >> >> >> 1. Why it is necessary to allocate and store all_visible flag in data >> buffer. Why caller of IndexPrefetchNext can not look at prefetch field? >> >> + /* store the all_visible flag in the private part of the entry */ >> + entry->data = palloc(sizeof(bool)); >> + *(bool *) entry->data = all_visible; >> > What you mean by "prefetch field"? I mean "prefetch" field of IndexPrefetchEntry: + +typedef struct IndexPrefetchEntry +{ + ItemPointerData tid; + + /* should we prefetch heap page for this TID? */ + bool prefetch; + You store the same flag twice: + /* prefetch only if not all visible */ + entry->prefetch = !all_visible; + + /* store the all_visible flag in the private part of the entry */ + entry->data = palloc(sizeof(bool)); + *(bool *) entry->data = all_visible; My question was: why do we need to allocate something in entry->data and store all_visible in it, while we already stored !all-visible in entry->prefetch.
On 1/21/24 20:50, Konstantin Knizhnik wrote: > > On 20/01/2024 12:14 am, Tomas Vondra wrote: >> Looks like I was not true, even if it is not index-only scan but index >>> condition involves only index attributes, then heap is not accessed >>> until we find tuple satisfying search condition. >>> Inclusive index case described above >>> (https://commitfest.postgresql.org/46/4352/) is interesting but IMHO >>> exotic case. If keys are actually used in search, then why not to create >>> normal compound index instead? >>> >> Not sure I follow ... >> >> Firstly, I'm not convinced the example addressed by that other patch is >> that exotic. IMHO it's quite possible it's actually quite common, but >> the users do no realize the possible gains. >> >> Also, there are reasons to not want very wide indexes - it has overhead >> associated with maintenance, disk space, etc. I think it's perfectly >> rational to design indexes in a way eliminates most heap fetches >> necessary to evaluate conditions, but does not guarantee IOS (so the >> last heap fetch is still needed). > > We are comparing compound index (a,b) and covering (inclusive) index (a) > include (b) > This indexes have exactly the same width and size and almost the same > maintenance overhead. > > First index has more expensive comparison function (involving two > columns) but I do not think that it can significantly affect > performance and maintenance cost. Also if selectivity of "a" is good > enough, then there is no need to compare "b" > > Why we can prefer covering index to compound index? I see only two good > reasons: > 1. Extra columns type do not have comparison function need for AM. > 2. The extra columns are never used in query predicate. > Or maybe you don't want to include the columns in a UNIQUE constraint? > If you are going to use this columns in query predicates I do not see > much sense in creating inclusive index rather than compound index. > Do you? > But this is also about conditions that can't be translated into index scan keys. Consider this: create table t (a int, b int, c int); insert into t select 1000 * random(), 1000 * random(), 1000 * random() from generate_series(1,1000000) s(i); create index on t (a,b); vacuum analyze t; explain (analyze, buffers) select * from t where a = 10 and mod(b,10) = 1111111; QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Index Scan using t_a_b_idx on t (cost=0.42..3670.74 rows=5 width=12) (actual time=4.562..4.564 rows=0 loops=1) Index Cond: (a = 10) Filter: (mod(b, 10) = 1111111) Rows Removed by Filter: 974 Buffers: shared hit=980 Prefetches: blocks=901 Planning Time: 0.304 ms Execution Time: 5.146 ms (8 rows) Notice that this still fetched ~1000 buffers in order to evaluate the filter on "b", because it's complex and can't be transformed into a nice scan key. Or this: explain (analyze, buffers) select a from t where a = 10 and (b+1) < 100 and c < 0; QUERY PLAN ---------------------------------------------------------------------------------------------------------------- Index Scan using t_a_b_idx on t (cost=0.42..3673.22 rows=1 width=4) (actual time=4.446..4.448 rows=0 loops=1) Index Cond: (a = 10) Filter: ((c < 0) AND ((b + 1) < 100)) Rows Removed by Filter: 974 Buffers: shared hit=980 Prefetches: blocks=901 Planning Time: 0.313 ms Execution Time: 4.878 ms (8 rows) where it's "broken" by the extra unindexed column. FWIW there are the primary cases I had in mind for this patch. > >> What do you mean by "create normal compound index"? The patch addresses >> a limitation that not every condition can be translated into a proper >> scan key. Even if we improve this, there will always be such conditions. >> The the IOS can evaluate them on index tuple, the regular index scan >> can't do that (currently). >> >> Can you share an example demonstrating the alternative approach? > > May be I missed something. > > This is the example from > https://www.postgresql.org/message-id/flat/N1xaIrU29uk5YxLyW55MGk5fz9s6V2FNtj54JRaVlFbPixD5z8sJ07Ite5CvbWwik8ZvDG07oSTN-usENLVMq2UAcizVTEd5b-o16ZGDIIU=@yamlcoder.me : > > ``` > > And here is the plan with index on (a,b). > > Limit (cost=0.42..4447.90 rows=1 width=12) (actual time=6.883..6.884 > rows=0 loops=1) Output: a, b, d Buffers: shared hit=613 -> > Index Scan using t_a_b_idx on public.t (cost=0.42..4447.90 rows=1 > width=12) (actual time=6.880..6.881 rows=0 loops=1) Output: a, > b, d Index Cond: ((t.a > 1000000) AND (t.b = 4)) > Buffers: shared hit=613 Planning: Buffers: shared hit=41 Planning > Time: 0.314 ms Execution Time: 6.910 ms ``` > > > Isn't it an optimal plan for this query? > > And cite from self reproducible example https://dbfiddle.uk/iehtq44L : > ``` > create unique index t_a_include_b on t(a) include (b); > -- I'd expecd index above to behave the same as index below for this query > --create unique index on t(a,b); > ``` > > I agree that it is natural to expect the same result for both indexes. > So this PR definitely makes sense. > My point is only that compound index (a,b) in this case is more natural > and preferable. > Yes, perhaps. But you may also see it from the other direction - if you already have an index with included columns (for whatever reason), it would be nice to leverage that if possible. And as I mentioned above, it's not always the case that move a column from "included" to a proper key, or stuff like that. Anyway, it seems entirely unrelated to this prefetching thread. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/21/24 20:56, Konstantin Knizhnik wrote: > > On 19/01/2024 2:35 pm, Tomas Vondra wrote: >> >> On 1/19/24 09:34, Konstantin Knizhnik wrote: >>> On 18/01/2024 6:00 pm, Tomas Vondra wrote: >>>> On 1/17/24 09:45, Konstantin Knizhnik wrote: >>>>> I have integrated your prefetch patch in Neon and it actually works! >>>>> Moreover, I combined it with prefetch of leaf pages for IOS and it >>>>> also >>>>> seems to work. >>>>> >>>> Cool! And do you think this is the right design/way to do this? >>> I like the idea of prefetching TIDs in executor. >>> >>> But looking though your patch I have some questions: >>> >>> >>> 1. Why it is necessary to allocate and store all_visible flag in data >>> buffer. Why caller of IndexPrefetchNext can not look at prefetch field? >>> >>> + /* store the all_visible flag in the private part of the >>> entry */ >>> + entry->data = palloc(sizeof(bool)); >>> + *(bool *) entry->data = all_visible; >>> >> What you mean by "prefetch field"? > > > I mean "prefetch" field of IndexPrefetchEntry: > > + > +typedef struct IndexPrefetchEntry > +{ > + ItemPointerData tid; > + > + /* should we prefetch heap page for this TID? */ > + bool prefetch; > + > > You store the same flag twice: > > + /* prefetch only if not all visible */ > + entry->prefetch = !all_visible; > + > + /* store the all_visible flag in the private part of the entry */ > + entry->data = palloc(sizeof(bool)); > + *(bool *) entry->data = all_visible; > > My question was: why do we need to allocate something in entry->data and > store all_visible in it, while we already stored !all-visible in > entry->prefetch. > Ah, right. Well, you're right in this case we perhaps could set just one of those flags, but the "purpose" of the two places is quite different. The "prefetch" flag is fully controlled by the prefetcher, and it's up to it to change it (e.g. I can easily imagine some new logic touching setting it to "false" for some reason). The "data" flag is fully controlled by the custom callbacks, so whatever the callback stores, will be there. I don't think it's worth simplifying this. In particular, I don't think the callback can assume it can rely on the "prefetch" flag. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2024-01 Commitfest. Hi, This patch has a CF status of "Needs Review" [1], but it seems like there were CFbot test failures last time it was run [2]. Please have a look and post an updated version if necessary. ====== [1] https://commitfest.postgresql.org/46/4351/ [2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4351 Kind Regards, Peter Smith.
h, right. Well, you're right in this case we perhaps could set just oneof those flags, but the "purpose" of the two places is quite different. The "prefetch" flag is fully controlled by the prefetcher, and it's up to it to change it (e.g. I can easily imagine some new logic touching setting it to "false" for some reason). The "data" flag is fully controlled by the custom callbacks, so whatever the callback stores, will be there. I don't think it's worth simplifying this. In particular, I don't think the callback can assume it can rely on the "prefetch" flag.
Why not to add "all_visible" flag to IndexPrefetchEntry ? If will not cause any extra space overhead (because of alignment), but allows to avoid dynamic memory allocation (not sure if it is critical, but nice to avoid if possible).
Why we can prefer covering index to compound index? I see only two good reasons: 1. Extra columns type do not have comparison function need for AM. 2. The extra columns are never used in query predicate.Or maybe you don't want to include the columns in a UNIQUE constraint?
Do you mean that compound index (a,b) can not be used to enforce uniqueness of "a"?
If so, I agree.
If you are going to use this columns in query predicates I do not see much sense in creating inclusive index rather than compound index. Do you?But this is also about conditions that can't be translated into index scan keys. Consider this: create table t (a int, b int, c int); insert into t select 1000 * random(), 1000 * random(), 1000 * random() from generate_series(1,1000000) s(i); create index on t (a,b); vacuum analyze t; explain (analyze, buffers) select * from t where a = 10 and mod(b,10) = 1111111; QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Index Scan using t_a_b_idx on t (cost=0.42..3670.74 rows=5 width=12) (actual time=4.562..4.564 rows=0 loops=1) Index Cond: (a = 10) Filter: (mod(b, 10) = 1111111) Rows Removed by Filter: 974 Buffers: shared hit=980 Prefetches: blocks=901 Planning Time: 0.304 ms Execution Time: 5.146 ms (8 rows) Notice that this still fetched ~1000 buffers in order to evaluate the filter on "b", because it's complex and can't be transformed into a nice scan key.
O yes.
Looks like I didn't understand the logic when predicate is included in index condition and when not.
It seems to be natural that only such predicate which specifies some range can be included in index condition.
But it is not the case:
postgres=# explain select * from t where a = 10 and b in (10,20,30); QUERY PLAN --------------------------------------------------------------------- Index Scan using t_a_b_idx on t (cost=0.42..25.33 rows=3 width=12) Index Cond: ((a = 10) AND (b = ANY ('{10,20,30}'::integer[]))) (2 rows) So I though ANY predicate using index keys is included in index condition. But it is not true (as your example shows).
But IMHO mod(b,10)=111111 or (b+1) < 100 are both quite rare predicates this is why I named this use cases "exotic".
In any case, if we have some columns in index tuple it is desired to use them for filtering before extracting heap tuple.
But I afraid it will be not so easy to implement...
On 1/19/24 22:43, Melanie Plageman wrote: > On Fri, Jan 12, 2024 at 11:42 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 1/9/24 21:31, Robert Haas wrote: >>> On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra >>> <tomas.vondra@enterprisedb.com> wrote: >>>> Here's a somewhat reworked version of the patch. My initial goal was to >>>> see if it could adopt the StreamingRead API proposed in [1], but that >>>> turned out to be less straight-forward than I hoped, for two reasons: >>> >>> I guess we need Thomas or Andres or maybe Melanie to comment on this. >>> >> >> Yeah. Or maybe Thomas if he has thoughts on how to combine this with the >> streaming I/O stuff. > > I've been studying your patch with the intent of finding a way to > change it and or the streaming read API to work together. I've > attached a very rough sketch of how I think it could work. > Thanks. > We fill a queue with blocks from TIDs that we fetched from the index. > The queue is saved in a scan descriptor that is made available to the > streaming read callback. Once the queue is full, we invoke the table > AM specific index_fetch_tuple() function which calls > pg_streaming_read_buffer_get_next(). When the streaming read API > invokes the callback we registered, it simply dequeues a block number > for prefetching. So in a way there are two queues in IndexFetchTableData. One (blk_queue) is being filled from IndexNext, and then the queue in StreamingRead. > The only change to the streaming read API is that now, even if the > callback returns InvalidBlockNumber, we may not be finished, so make > it resumable. > Hmm, not sure when can the callback return InvalidBlockNumber before reaching the end. Perhaps for the first index_fetch_heap call? Any reason not to fill the blk_queue before calling index_fetch_heap? > Structurally, this changes the timing of when the heap blocks are > prefetched. Your code would get a tid from the index and then prefetch > the heap block -- doing this until it filled a queue that had the > actual tids saved in it. With my approach and the streaming read API, > you fetch tids from the index until you've filled up a queue of block > numbers. Then the streaming read API will prefetch those heap blocks. > And is that a good/desirable change? I'm not saying it's not, but maybe we should not be filling either queue in one go - we don't want to overload the prefetching. > I didn't actually implement the block queue -- I just saved a single > block number and pretended it was a block queue. I was imagining we > replace this with something like your IndexPrefetch->blockItems -- > which has light deduplication. We'd probably have to flesh it out more > than that. > I don't understand how this passes the TID to the index_fetch_heap. Isn't it working only by accident, due to blk_queue only having a single entry? Shouldn't the first queue (blk_queue) store TIDs instead? > There are also table AM layering violations in my sketch which would > have to be worked out (not to mention some resource leakage I didn't > bother investigating [which causes it to fail tests]). > > 0001 is all of Thomas' streaming read API code that isn't yet in > master and 0002 is my rough sketch of index prefetching using the > streaming read API > > There are also numerous optimizations that your index prefetching > patch set does that would need to be added in some way. I haven't > thought much about it yet. I wanted to see what you thought of this > approach first. Basically, is it workable? > It seems workable, yes. I'm not sure it's much simpler than my patch (considering a lot of the code is in the optimizations, which are missing from this patch). I think the question is where should the optimizations happen. I suppose some of them might/should happen in the StreamingRead API itself - like the detection of sequential patterns, recently prefetched blocks, ... But I'm not sure what to do about optimizations that are more specific to the access path. Consider for example the index-only scans. We don't want to prefetch all the pages, we need to inspect the VM and prefetch just the not-all-visible ones. And then pass the info to the index scan, so that it does not need to check the VM again. It's not clear to me how to do this with this approach. The main -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jan 23, 2024 at 12:43 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 1/19/24 22:43, Melanie Plageman wrote: > > > We fill a queue with blocks from TIDs that we fetched from the index. > > The queue is saved in a scan descriptor that is made available to the > > streaming read callback. Once the queue is full, we invoke the table > > AM specific index_fetch_tuple() function which calls > > pg_streaming_read_buffer_get_next(). When the streaming read API > > invokes the callback we registered, it simply dequeues a block number > > for prefetching. > > So in a way there are two queues in IndexFetchTableData. One (blk_queue) > is being filled from IndexNext, and then the queue in StreamingRead. I've changed the name from blk_queue to tid_queue to fix the issue you mention in your later remarks. I suppose there are two queues. The tid_queue is just to pass the block requests to the streaming read API. The prefetch distance will be the smaller of the two sizes. > > The only change to the streaming read API is that now, even if the > > callback returns InvalidBlockNumber, we may not be finished, so make > > it resumable. > > Hmm, not sure when can the callback return InvalidBlockNumber before > reaching the end. Perhaps for the first index_fetch_heap call? Any > reason not to fill the blk_queue before calling index_fetch_heap? The callback will return InvalidBlockNumber whenever the queue is empty. Let's say your queue size is 5 and your effective prefetch distance is 10 (some combination of the PgStreamingReadRange sizes and PgStreamingRead->max_ios). The first time you call index_fetch_heap(), the callback returns InvalidBlockNumber. Then the tid_queue is filled with 5 tids. Then index_fetch_heap() is called. pg_streaming_read_look_ahead() will prefetch all 5 of these TID's blocks, emptying the queue. Once all 5 have been dequeued, the callback will return InvalidBlockNumber. pg_streaming_read_buffer_get_next() will return one of the 5 blocks in a buffer and save the associated TID in the per_buffer_data. Before index_fetch_heap() is called again, we will see that the queue is not full and fill it up again with 5 TIDs. So, the callback will return InvalidBlockNumber 3 times in this scenario. > > Structurally, this changes the timing of when the heap blocks are > > prefetched. Your code would get a tid from the index and then prefetch > > the heap block -- doing this until it filled a queue that had the > > actual tids saved in it. With my approach and the streaming read API, > > you fetch tids from the index until you've filled up a queue of block > > numbers. Then the streaming read API will prefetch those heap blocks. > > And is that a good/desirable change? I'm not saying it's not, but maybe > we should not be filling either queue in one go - we don't want to > overload the prefetching. We can focus on the prefetch distance algorithm maintained in the streaming read API and then make sure that the tid_queue is larger than the desired prefetch distance maintained by the streaming read API. > > I didn't actually implement the block queue -- I just saved a single > > block number and pretended it was a block queue. I was imagining we > > replace this with something like your IndexPrefetch->blockItems -- > > which has light deduplication. We'd probably have to flesh it out more > > than that. > > I don't understand how this passes the TID to the index_fetch_heap. > Isn't it working only by accident, due to blk_queue only having a single > entry? Shouldn't the first queue (blk_queue) store TIDs instead? Oh dear! Fixed in the attached v2. I've replaced the single BlockNumber with a single ItemPointerData. I will work on implementing an actual queue next week. > > There are also table AM layering violations in my sketch which would > > have to be worked out (not to mention some resource leakage I didn't > > bother investigating [which causes it to fail tests]). > > > > 0001 is all of Thomas' streaming read API code that isn't yet in > > master and 0002 is my rough sketch of index prefetching using the > > streaming read API > > > > There are also numerous optimizations that your index prefetching > > patch set does that would need to be added in some way. I haven't > > thought much about it yet. I wanted to see what you thought of this > > approach first. Basically, is it workable? > > It seems workable, yes. I'm not sure it's much simpler than my patch > (considering a lot of the code is in the optimizations, which are > missing from this patch). > > I think the question is where should the optimizations happen. I suppose > some of them might/should happen in the StreamingRead API itself - like > the detection of sequential patterns, recently prefetched blocks, ... So, the streaming read API does detection of sequential patterns and not prefetching things that are in shared buffers. It doesn't handle avoiding prefetching recently prefetched blocks yet AFAIK. But I daresay this would be relevant for other streaming read users and could certainly be implemented there. > But I'm not sure what to do about optimizations that are more specific > to the access path. Consider for example the index-only scans. We don't > want to prefetch all the pages, we need to inspect the VM and prefetch > just the not-all-visible ones. And then pass the info to the index scan, > so that it does not need to check the VM again. It's not clear to me how > to do this with this approach. Yea, this is an issue I'll need to think about. To really spell out the problem: the callback dequeues a TID from the tid_queue and looks up its block in the VM. It's all visible. So, it shouldn't return that block to the streaming read API to fetch from the heap because it doesn't need to be read. But, where does the callback put the TID so that the caller can get it? I'm going to think more about this. As for passing around the all visible status so as to not reread the VM block -- that feels solvable but I haven't looked into it. - Melanie
Attachment
On 1/24/24 01:51, Melanie Plageman wrote: > On Tue, Jan 23, 2024 at 12:43 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 1/19/24 22:43, Melanie Plageman wrote: >> >>> We fill a queue with blocks from TIDs that we fetched from the index. >>> The queue is saved in a scan descriptor that is made available to the >>> streaming read callback. Once the queue is full, we invoke the table >>> AM specific index_fetch_tuple() function which calls >>> pg_streaming_read_buffer_get_next(). When the streaming read API >>> invokes the callback we registered, it simply dequeues a block number >>> for prefetching. >> >> So in a way there are two queues in IndexFetchTableData. One (blk_queue) >> is being filled from IndexNext, and then the queue in StreamingRead. > > I've changed the name from blk_queue to tid_queue to fix the issue you > mention in your later remarks. > I suppose there are two queues. The tid_queue is just to pass the > block requests to the streaming read API. The prefetch distance will > be the smaller of the two sizes. > FWIW I think the two queues are a nice / elegant approach. In hindsight my problems with trying to utilize the StreamingRead were due to trying to use the block-oriented API directly from places that work with TIDs, and this just makes that go away. I wonder what the overhead of shuffling stuff between queues will be, but hopefully not too high (that's my assumption). >>> The only change to the streaming read API is that now, even if the >>> callback returns InvalidBlockNumber, we may not be finished, so make >>> it resumable. >> >> Hmm, not sure when can the callback return InvalidBlockNumber before >> reaching the end. Perhaps for the first index_fetch_heap call? Any >> reason not to fill the blk_queue before calling index_fetch_heap? > > The callback will return InvalidBlockNumber whenever the queue is > empty. Let's say your queue size is 5 and your effective prefetch > distance is 10 (some combination of the PgStreamingReadRange sizes and > PgStreamingRead->max_ios). The first time you call index_fetch_heap(), > the callback returns InvalidBlockNumber. Then the tid_queue is filled > with 5 tids. Then index_fetch_heap() is called. > pg_streaming_read_look_ahead() will prefetch all 5 of these TID's > blocks, emptying the queue. Once all 5 have been dequeued, the > callback will return InvalidBlockNumber. > pg_streaming_read_buffer_get_next() will return one of the 5 blocks in > a buffer and save the associated TID in the per_buffer_data. Before > index_fetch_heap() is called again, we will see that the queue is not > full and fill it up again with 5 TIDs. So, the callback will return > InvalidBlockNumber 3 times in this scenario. > Thanks for the explanation. Yes, I didn't realize that the queues may be of different length, at which point it makes sense to return invalid block to signal the TID queue is empty. >>> Structurally, this changes the timing of when the heap blocks are >>> prefetched. Your code would get a tid from the index and then prefetch >>> the heap block -- doing this until it filled a queue that had the >>> actual tids saved in it. With my approach and the streaming read API, >>> you fetch tids from the index until you've filled up a queue of block >>> numbers. Then the streaming read API will prefetch those heap blocks. >> >> And is that a good/desirable change? I'm not saying it's not, but maybe >> we should not be filling either queue in one go - we don't want to >> overload the prefetching. > > We can focus on the prefetch distance algorithm maintained in the > streaming read API and then make sure that the tid_queue is larger > than the desired prefetch distance maintained by the streaming read > API. > Agreed. I think I wasn't quite right when concerned about "overloading" the prefetch, because that depends entirely on the StreamingRead API queue. A lage TID queue can't cause overload of anything. What could happen is a TID queue being too small, so the prefetch can't hit the target distance. But that can happen already, e.g. indexes that are correlated and/or index-only scans with all-visible pages. >>> There are also table AM layering violations in my sketch which would >>> have to be worked out (not to mention some resource leakage I didn't >>> bother investigating [which causes it to fail tests]). >>> >>> 0001 is all of Thomas' streaming read API code that isn't yet in >>> master and 0002 is my rough sketch of index prefetching using the >>> streaming read API >>> >>> There are also numerous optimizations that your index prefetching >>> patch set does that would need to be added in some way. I haven't >>> thought much about it yet. I wanted to see what you thought of this >>> approach first. Basically, is it workable? >> >> It seems workable, yes. I'm not sure it's much simpler than my patch >> (considering a lot of the code is in the optimizations, which are >> missing from this patch). >> >> I think the question is where should the optimizations happen. I suppose >> some of them might/should happen in the StreamingRead API itself - like >> the detection of sequential patterns, recently prefetched blocks, ... > > So, the streaming read API does detection of sequential patterns and > not prefetching things that are in shared buffers. It doesn't handle > avoiding prefetching recently prefetched blocks yet AFAIK. But I > daresay this would be relevant for other streaming read users and > could certainly be implemented there. > Yes, the "recently prefetched stuff" cache seems like a fairly natural complement to the pattern detection and shared-buffers check. FWIW I wonder if we should make some of this customizable, so that systems with customized storage (e.g. neon or with direct I/O) can e.g. disable some of these checks. Or replace them with their version. >> But I'm not sure what to do about optimizations that are more specific >> to the access path. Consider for example the index-only scans. We don't >> want to prefetch all the pages, we need to inspect the VM and prefetch >> just the not-all-visible ones. And then pass the info to the index scan, >> so that it does not need to check the VM again. It's not clear to me how >> to do this with this approach. > > Yea, this is an issue I'll need to think about. To really spell out > the problem: the callback dequeues a TID from the tid_queue and looks > up its block in the VM. It's all visible. So, it shouldn't return that > block to the streaming read API to fetch from the heap because it > doesn't need to be read. But, where does the callback put the TID so > that the caller can get it? I'm going to think more about this. > Yes, that's the problem for index-only scans. I'd generalize it so that it's about the callback being able to (a) decide if it needs to read the heap page, and (b) store some custom info for the TID. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/22/24 08:21, Konstantin Knizhnik wrote: > > On 22/01/2024 1:39 am, Tomas Vondra wrote: >>> Why we can prefer covering index to compound index? I see only two good >>> reasons: >>> 1. Extra columns type do not have comparison function need for AM. >>> 2. The extra columns are never used in query predicate. >>> >> Or maybe you don't want to include the columns in a UNIQUE constraint? >> > Do you mean that compound index (a,b) can not be used to enforce > uniqueness of "a"? > If so, I agree. > Yes. >>> If you are going to use this columns in query predicates I do not see >>> much sense in creating inclusive index rather than compound index. >>> Do you? >>> >> But this is also about conditions that can't be translated into index >> scan keys. Consider this: >> >> create table t (a int, b int, c int); >> insert into t select 1000 * random(), 1000 * random(), 1000 * random() >> from generate_series(1,1000000) s(i); >> create index on t (a,b); >> vacuum analyze t; >> >> explain (analyze, buffers) select * from t where a = 10 and mod(b,10) = >> 1111111; >> QUERY PLAN >> >> ----------------------------------------------------------------------------------------------------------------- >> Index Scan using t_a_b_idx on t (cost=0.42..3670.74 rows=5 width=12) >> (actual time=4.562..4.564 rows=0 loops=1) >> Index Cond: (a = 10) >> Filter: (mod(b, 10) = 1111111) >> Rows Removed by Filter: 974 >> Buffers: shared hit=980 >> Prefetches: blocks=901 >> Planning Time: 0.304 ms >> Execution Time: 5.146 ms >> (8 rows) >> >> Notice that this still fetched ~1000 buffers in order to evaluate the >> filter on "b", because it's complex and can't be transformed into a nice >> scan key. > > O yes. > Looks like I didn't understand the logic when predicate is included in > index condition and when not. > It seems to be natural that only such predicate which specifies some > range can be included in index condition. > But it is not the case: > > postgres=# explain select * from t where a = 10 and b in (10,20,30); > QUERY PLAN > --------------------------------------------------------------------- > Index Scan using t_a_b_idx on t (cost=0.42..25.33 rows=3 width=12) > Index Cond: ((a = 10) AND (b = ANY ('{10,20,30}'::integer[]))) > (2 rows) > > So I though ANY predicate using index keys is included in index condition. > But it is not true (as your example shows). > > But IMHO mod(b,10)=111111 or (b+1) < 100 are both quite rare predicates > this is why I named this use cases "exotic". Not sure I agree with describing this as "exotic". The same thing applies to an arbitrary function call. And those are pretty common in conditions - date_part/date_trunc. Arithmetic expressions are not that uncommon either. Also, users sometimes have conditions comparing multiple keys (a<b) etc. But even if it was "uncommon", the whole point of this patch is to eliminate these corner cases where a user does something minor (like adding an output column), and the executor disables an optimization unnecessarily, causing unexpected regressions. > > In any case, if we have some columns in index tuple it is desired to use > them for filtering before extracting heap tuple. > But I afraid it will be not so easy to implement... > I'm not sure what you mean. The patch does that, more or less. There's issues that need to be solved (e.g. to decide when not to do this), and how to integrate that into the scan interface (where the quals are evaluated at the end). What do you mean when you say "will not be easy to implement"? What problems do you foresee? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/22/24 07:35, Konstantin Knizhnik wrote: > > On 22/01/2024 1:47 am, Tomas Vondra wrote: >> h, right. Well, you're right in this case we perhaps could set just one >> of those flags, but the "purpose" of the two places is quite different. >> >> The "prefetch" flag is fully controlled by the prefetcher, and it's up >> to it to change it (e.g. I can easily imagine some new logic touching >> setting it to "false" for some reason). >> >> The "data" flag is fully controlled by the custom callbacks, so whatever >> the callback stores, will be there. >> >> I don't think it's worth simplifying this. In particular, I don't think >> the callback can assume it can rely on the "prefetch" flag. >> > Why not to add "all_visible" flag to IndexPrefetchEntry ? If will not > cause any extra space overhead (because of alignment), but allows to > avoid dynamic memory allocation (not sure if it is critical, but nice to > avoid if possible). > Because it's specific to index-only scans, while IndexPrefetchEntry is a generic thing, for all places. However: (1) Melanie actually presented a very different way to implement this, relying on the StreamingRead API. So chances are this struct won't actually be used. (2) After going through Melanie's patch, I realized this is actually broken. The IOS case needs to keep more stuff, not just the all-visible flag, but also the index tuple. Otherwise it'll just operate on the last tuple read from the index, which happens to be in xs_ituple. Attached is a patch with a trivial fix. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wed, Jan 24, 2024 at 4:19 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 1/24/24 01:51, Melanie Plageman wrote: > > >>> There are also table AM layering violations in my sketch which would > >>> have to be worked out (not to mention some resource leakage I didn't > >>> bother investigating [which causes it to fail tests]). > >>> > >>> 0001 is all of Thomas' streaming read API code that isn't yet in > >>> master and 0002 is my rough sketch of index prefetching using the > >>> streaming read API > >>> > >>> There are also numerous optimizations that your index prefetching > >>> patch set does that would need to be added in some way. I haven't > >>> thought much about it yet. I wanted to see what you thought of this > >>> approach first. Basically, is it workable? > >> > >> It seems workable, yes. I'm not sure it's much simpler than my patch > >> (considering a lot of the code is in the optimizations, which are > >> missing from this patch). > >> > >> I think the question is where should the optimizations happen. I suppose > >> some of them might/should happen in the StreamingRead API itself - like > >> the detection of sequential patterns, recently prefetched blocks, ... > > > > So, the streaming read API does detection of sequential patterns and > > not prefetching things that are in shared buffers. It doesn't handle > > avoiding prefetching recently prefetched blocks yet AFAIK. But I > > daresay this would be relevant for other streaming read users and > > could certainly be implemented there. > > > > Yes, the "recently prefetched stuff" cache seems like a fairly natural > complement to the pattern detection and shared-buffers check. > > FWIW I wonder if we should make some of this customizable, so that > systems with customized storage (e.g. neon or with direct I/O) can e.g. > disable some of these checks. Or replace them with their version. That's a promising idea. > >> But I'm not sure what to do about optimizations that are more specific > >> to the access path. Consider for example the index-only scans. We don't > >> want to prefetch all the pages, we need to inspect the VM and prefetch > >> just the not-all-visible ones. And then pass the info to the index scan, > >> so that it does not need to check the VM again. It's not clear to me how > >> to do this with this approach. > > > > Yea, this is an issue I'll need to think about. To really spell out > > the problem: the callback dequeues a TID from the tid_queue and looks > > up its block in the VM. It's all visible. So, it shouldn't return that > > block to the streaming read API to fetch from the heap because it > > doesn't need to be read. But, where does the callback put the TID so > > that the caller can get it? I'm going to think more about this. > > > > Yes, that's the problem for index-only scans. I'd generalize it so that > it's about the callback being able to (a) decide if it needs to read the > heap page, and (b) store some custom info for the TID. Actually, I think this is no big deal. See attached. I just don't enqueue tids whose blocks are all visible. I had to switch the order from fetch heap then fill queue to fill queue then fetch heap. While doing this I noticed some wrong results in the regression tests (like in the alter table test), so I suspect I have some kind of control flow issue. Perhaps I should fix the resource leak so I can actually see the failing tests :) As for your a) and b) above. Regarding a): We discussed allowing speculative prefetching and separating the logic for prefetching from actually reading blocks (so you can prefetch blocks you ultimately don't read). We decided this may not belong in a streaming read API. What do you think? Regarding b): We can store per buffer data for anything that actually goes down through the streaming read API, but, in the index only case, we don't want the streaming read API to know about blocks that it doesn't actually need to read. - Melanie
Attachment
On Wed, Jan 24, 2024 at 11:43 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > On 1/22/24 07:35, Konstantin Knizhnik wrote: > > > > On 22/01/2024 1:47 am, Tomas Vondra wrote: > >> h, right. Well, you're right in this case we perhaps could set just one > >> of those flags, but the "purpose" of the two places is quite different. > >> > >> The "prefetch" flag is fully controlled by the prefetcher, and it's up > >> to it to change it (e.g. I can easily imagine some new logic touching > >> setting it to "false" for some reason). > >> > >> The "data" flag is fully controlled by the custom callbacks, so whatever > >> the callback stores, will be there. > >> > >> I don't think it's worth simplifying this. In particular, I don't think > >> the callback can assume it can rely on the "prefetch" flag. > >> > > Why not to add "all_visible" flag to IndexPrefetchEntry ? If will not > > cause any extra space overhead (because of alignment), but allows to > > avoid dynamic memory allocation (not sure if it is critical, but nice to > > avoid if possible). > > > While reading through the first patch I got some questions, I haven't read it complete yet but this is what I got so far. 1. +static bool +IndexPrefetchBlockIsSequential(IndexPrefetch *prefetch, BlockNumber block) +{ + int idx; ... + if (prefetch->blockItems[idx] != (block - i)) + return false; + + /* Don't prefetch if the block happens to be the same. */ + if (prefetch->blockItems[idx] == block) + return false; + } + + /* not sequential, not recently prefetched */ + return true; +} The above function name is BlockIsSequential but at the end, it returns true if it is not sequential, seem like a problem? Also other 2 checks right above the end of the function are returning false if the block is the same or the pattern is sequential I think those are wrong too. 2. I have noticed that the prefetch history is maintained at the backend level, but what if multiple backends are trying to fetch the same heap blocks maybe scanning the same index, so should that be in some shared structure? I haven't thought much deeper about this from the implementation POV, but should we think about it, or it doesn't matter? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On 1/25/24 11:45, Dilip Kumar wrote: > On Wed, Jan 24, 2024 at 11:43 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > >> On 1/22/24 07:35, Konstantin Knizhnik wrote: >>> >>> On 22/01/2024 1:47 am, Tomas Vondra wrote: >>>> h, right. Well, you're right in this case we perhaps could set just one >>>> of those flags, but the "purpose" of the two places is quite different. >>>> >>>> The "prefetch" flag is fully controlled by the prefetcher, and it's up >>>> to it to change it (e.g. I can easily imagine some new logic touching >>>> setting it to "false" for some reason). >>>> >>>> The "data" flag is fully controlled by the custom callbacks, so whatever >>>> the callback stores, will be there. >>>> >>>> I don't think it's worth simplifying this. In particular, I don't think >>>> the callback can assume it can rely on the "prefetch" flag. >>>> >>> Why not to add "all_visible" flag to IndexPrefetchEntry ? If will not >>> cause any extra space overhead (because of alignment), but allows to >>> avoid dynamic memory allocation (not sure if it is critical, but nice to >>> avoid if possible). >>> >> > While reading through the first patch I got some questions, I haven't > read it complete yet but this is what I got so far. > > 1. > +static bool > +IndexPrefetchBlockIsSequential(IndexPrefetch *prefetch, BlockNumber block) > +{ > + int idx; > ... > + if (prefetch->blockItems[idx] != (block - i)) > + return false; > + > + /* Don't prefetch if the block happens to be the same. */ > + if (prefetch->blockItems[idx] == block) > + return false; > + } > + > + /* not sequential, not recently prefetched */ > + return true; > +} > > The above function name is BlockIsSequential but at the end, it > returns true if it is not sequential, seem like a problem? Actually, I think it's the comment that's wrong - the last return is reached only for a sequential pattern (and when the block was not accessed recently). > Also other 2 checks right above the end of the function are returning > false if the block is the same or the pattern is sequential I think > those are wrong too. > Hmmm. You're right this is partially wrong. There are two checks: /* * For a sequential pattern, blocks "k" step ago needs to have block * number by "k" smaller compared to the current block. */ if (prefetch->blockItems[idx] != (block - i)) return false; /* Don't prefetch if the block happens to be the same. */ if (prefetch->blockItems[idx] == block) return false; The first condition is correct - we want to return "false" when the pattern is not sequential. But the second condition is wrong - we want to skip prefetching when the block was already prefetched recently, so this should return true (which is a bit misleading, as it seems to imply the pattern is sequential, when it's not). However, this is harmless, because we then identify this block as recently prefetched in the "full" cache check, so we won't prefetch it anyway. So it's harmless, although a bit more expensive. There's another inefficiency - we stop looking for the same block once we find the first block breaking the non-sequential pattern. Imagine a sequence of blocks 1, 2, 3, 1, 2, 3, ... in which case we never notice the block was recently prefetched, because we always find the break of the sequential pattern. But again, it's harmless, thanks to the full cache of recently prefetched blocks. > 2. > I have noticed that the prefetch history is maintained at the backend > level, but what if multiple backends are trying to fetch the same heap > blocks maybe scanning the same index, so should that be in some shared > structure? I haven't thought much deeper about this from the > implementation POV, but should we think about it, or it doesn't > matter? Yes, the cache is at the backend level - it's a known limitation, but I see it more as a conscious tradeoff. Firstly, while the LRU cache is at backend level, PrefetchBuffer also checks shared buffers for each prefetch request. So with sufficiently large shared buffers we're likely to find it there (and for direct I/O there won't be any other place to check). Secondly, the only other place to check is page cache, but there's no good (sufficiently cheap) way to check that. See the preadv2/nowait experiment earlier in this thread. I suppose we could implement a similar LRU cache for shared memory (and I don't think it'd be very complicated), but I did not plan to do that in this patch unless absolutely necessary. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 24, 2024 at 3:20 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Wed, Jan 24, 2024 at 4:19 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > On 1/24/24 01:51, Melanie Plageman wrote: > > >> But I'm not sure what to do about optimizations that are more specific > > >> to the access path. Consider for example the index-only scans. We don't > > >> want to prefetch all the pages, we need to inspect the VM and prefetch > > >> just the not-all-visible ones. And then pass the info to the index scan, > > >> so that it does not need to check the VM again. It's not clear to me how > > >> to do this with this approach. > > > > > > Yea, this is an issue I'll need to think about. To really spell out > > > the problem: the callback dequeues a TID from the tid_queue and looks > > > up its block in the VM. It's all visible. So, it shouldn't return that > > > block to the streaming read API to fetch from the heap because it > > > doesn't need to be read. But, where does the callback put the TID so > > > that the caller can get it? I'm going to think more about this. > > > > > > > Yes, that's the problem for index-only scans. I'd generalize it so that > > it's about the callback being able to (a) decide if it needs to read the > > heap page, and (b) store some custom info for the TID. > > Actually, I think this is no big deal. See attached. I just don't > enqueue tids whose blocks are all visible. I had to switch the order > from fetch heap then fill queue to fill queue then fetch heap. > > While doing this I noticed some wrong results in the regression tests > (like in the alter table test), so I suspect I have some kind of > control flow issue. Perhaps I should fix the resource leak so I can > actually see the failing tests :) Attached is a patch which implements a real queue and fixes some of the issues with the previous version. It doesn't pass tests yet and has issues. Some are bugs in my implementation I need to fix. Some are issues we would need to solve in the streaming read API. Some are issues with index prefetching generally. Note that these two patches have to be applied before 21d9c3ee4e because Thomas hasn't released a rebased version of the streaming read API patches yet. Issues --- - kill prior tuple This optimization doesn't work with index prefetching with the current design. Kill prior tuple relies on alternating between fetching a single index tuple and visiting the heap. After visiting the heap we can potentially kill the immediately preceding index tuple. Once we fetch multiple index tuples, enqueue their TIDs, and later visit the heap, the next index page we visit may not contain all of the index tuples deemed killable by our visit to the heap. In our case, we could try and fix this by prefetching only heap blocks referred to by index tuples on the same index page. Or we could try and keep a pool of index pages pinned and go back and kill index tuples on those pages. Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps there is an easier way to fix this, as I don't think the mvcc test failed on Tomas' version. - switching scan directions If the index scan switches directions on a given invocation of IndexNext(), heap blocks may have already been prefetched and read for blocks containing tuples beyond the point at which we want to switch directions. We could fix this by having some kind of streaming read "reset" callback to drop all of the buffers which have been prefetched which are now no longer needed. We'd have to go backwards from the last TID which was yielded to the caller and figure out which buffers in the pgsr buffer ranges are associated with all of the TIDs which were prefetched after that TID. The TIDs are in the per_buffer_data associated with each buffer in pgsr. The issue would be searching through those efficiently. The other issue is that the streaming read API does not currently support backwards scans. So, if we switch to a backwards scan from a forwards scan, we would need to fallback to the non streaming read method. We could do this by just setting the TID queue size to 1 (which is what I have currently implemented). Or we could add backwards scan support to the streaming read API. - mark and restore Similar to the issue with switching the scan direction, mark and restore requires us to reset the TID queue and streaming read queue. For now, I've hacked in something to the PlannerInfo and Plan to set the TID queue size to 1 for plans containing a merge join (yikes). - multiple executions For reasons I don't entirely understand yet, multiple executions (not rescan -- see ExecutorRun(...execute_once)) do not work. As in Tomas' patch, I have disabled prefetching (and made the TID queue size 1) when execute_once is false. - Index Only Scans need to return IndexTuples Because index only scans return either the IndexTuple pointed to by IndexScanDesc->xs_itup or the HeapTuple pointed to by IndexScanDesc->xs_hitup -- both of which are populated by the index AM, we have to save copies of those IndexTupleData and HeapTupleDatas for every TID whose block we prefetch. This might be okay, but it is a bit sad to have to make copies of those tuples. In this patch, I still haven't figured out the memory management part. I copy over the tuples when enqueuing a TID queue item and then copy them back again when the streaming read API returns the per_buffer_data to us. Something is still not quite right here. I suspect this is part of the reason why some of the other tests are failing. Other issues/gaps in my implementation: Determining where to allocate the memory for the streaming read object and the TID queue is an outstanding TODO. To implement a fallback method for cases in which streaming read doesn't work, I set the queue size to 1. This is obviously not good. Right now, I allocate the TID queue and streaming read objects in IndexNext() and IndexOnlyNext(). This doesn't seem ideal. Doing it in index_beginscan() (and index_beginscan_parallel()) is tricky though because we don't know the scan direction at that point (and the scan direction can change). There are also callers of index_beginscan() who do not call Index[Only]Next() (like systable_getnext() which calls index_getnext_slot() directly). Also, my implementation does not yet have the optimization Tomas does to skip prefetching recently prefetched blocks. As he has said, it probably makes sense to add something to do this in a lower layer -- such as in the streaming read API or even in bufmgr.c (maybe in PrefetchSharedBuffer()). - Melanie
Attachment
On 2/7/24 22:48, Melanie Plageman wrote: > ... > > Attached is a patch which implements a real queue and fixes some of > the issues with the previous version. It doesn't pass tests yet and > has issues. Some are bugs in my implementation I need to fix. Some are > issues we would need to solve in the streaming read API. Some are > issues with index prefetching generally. > > Note that these two patches have to be applied before 21d9c3ee4e > because Thomas hasn't released a rebased version of the streaming read > API patches yet. > Thanks for working on this, and for investigating the various issues. > Issues > --- > - kill prior tuple > > This optimization doesn't work with index prefetching with the current > design. Kill prior tuple relies on alternating between fetching a > single index tuple and visiting the heap. After visiting the heap we > can potentially kill the immediately preceding index tuple. Once we > fetch multiple index tuples, enqueue their TIDs, and later visit the > heap, the next index page we visit may not contain all of the index > tuples deemed killable by our visit to the heap. > I admit I haven't thought about kill_prior_tuple until you pointed out. Yeah, prefetching separates (de-synchronizes) the two scans (index and heap) in a way that prevents this optimization. Or at least makes it much more complex :-( > In our case, we could try and fix this by prefetching only heap blocks > referred to by index tuples on the same index page. Or we could try > and keep a pool of index pages pinned and go back and kill index > tuples on those pages. > I think restricting the prefetching to a single index page would not be a huge issue performance-wise - that's what the initial patch version (implemented at the index AM level) did, pretty much. The prefetch queue would get drained as we approach the end of the index page, but luckily index pages tend to have a lot of entries. But it'd put an upper bound on the prefetch distance (much lower than the e_i_c maximum 1000, but I'd say common values are 10-100 anyway). But how would we know we're on the same index page? That knowledge is not available outside the index AM - the executor or indexam.c does not know this, right? Presumably we could expose this, somehow, but it seems like a violation of the abstraction ... The same thing affects keeping multiple index pages pinned, for TIDs that are yet to be used by the index scan. We'd need to know when to release a pinned page, once we're done with processing all items. FWIW I haven't tried to implementing any of this, so maybe I'm missing something and it can be made to work in a nice way. > Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps > there is an easier way to fix this, as I don't think the mvcc test > failed on Tomas' version. > I kinda doubt it worked correctly, considering I simply ignored the optimization. It's far more likely it just worked by luck. > - switching scan directions > > If the index scan switches directions on a given invocation of > IndexNext(), heap blocks may have already been prefetched and read for > blocks containing tuples beyond the point at which we want to switch > directions. > > We could fix this by having some kind of streaming read "reset" > callback to drop all of the buffers which have been prefetched which > are now no longer needed. We'd have to go backwards from the last TID > which was yielded to the caller and figure out which buffers in the > pgsr buffer ranges are associated with all of the TIDs which were > prefetched after that TID. The TIDs are in the per_buffer_data > associated with each buffer in pgsr. The issue would be searching > through those efficiently. > Yeah, that's roughly what I envisioned in one of my previous messages about this issue - walking back the TIDs read from the index and added to the prefetch queue. > The other issue is that the streaming read API does not currently > support backwards scans. So, if we switch to a backwards scan from a > forwards scan, we would need to fallback to the non streaming read > method. We could do this by just setting the TID queue size to 1 > (which is what I have currently implemented). Or we could add > backwards scan support to the streaming read API. > What do you mean by "support for backwards scans" in the streaming read API? I imagined it naively as 1) drop all requests in the streaming read API queue 2) walk back all "future" requests in the TID queue 3) start prefetching as if from scratch Maybe there's a way to optimize this and reuse some of the work more efficiently, but my assumption is that the scan direction does not change very often, and that we process many items in between. > - mark and restore > > Similar to the issue with switching the scan direction, mark and > restore requires us to reset the TID queue and streaming read queue. > For now, I've hacked in something to the PlannerInfo and Plan to set > the TID queue size to 1 for plans containing a merge join (yikes). > Haven't thought about this very much, will take a closer look. > - multiple executions > > For reasons I don't entirely understand yet, multiple executions (not > rescan -- see ExecutorRun(...execute_once)) do not work. As in Tomas' > patch, I have disabled prefetching (and made the TID queue size 1) > when execute_once is false. > Don't work in what sense? What is (not) happening? > - Index Only Scans need to return IndexTuples > > Because index only scans return either the IndexTuple pointed to by > IndexScanDesc->xs_itup or the HeapTuple pointed to by > IndexScanDesc->xs_hitup -- both of which are populated by the index > AM, we have to save copies of those IndexTupleData and HeapTupleDatas > for every TID whose block we prefetch. > > This might be okay, but it is a bit sad to have to make copies of those tuples. > > In this patch, I still haven't figured out the memory management part. > I copy over the tuples when enqueuing a TID queue item and then copy > them back again when the streaming read API returns the > per_buffer_data to us. Something is still not quite right here. I > suspect this is part of the reason why some of the other tests are > failing. > It's not clear to me what you need to copy the tuples back - shouldn't it be enough to copy the tuple just once? FWIW if we decide to pin multiple index pages (to make kill_prior_tuple work), that would also mean we don't need to copy any tuples, right? We could point into the buffers for all of them, right? > Other issues/gaps in my implementation: > > Determining where to allocate the memory for the streaming read object > and the TID queue is an outstanding TODO. To implement a fallback > method for cases in which streaming read doesn't work, I set the queue > size to 1. This is obviously not good. > I think IndexFetchTableData seems like a not entirely terrible place for allocating the pgsr, but I wonder what Andres thinks about this. IIRC he advocated for doing the prefetching in executor, and I'm not sure heapam_handled.c + relscan.h is what he imagined ... Also, when you say "obviously not good" - why? Are you concerned about the extra overhead of shuffling stuff between queues, or something else? > Right now, I allocate the TID queue and streaming read objects in > IndexNext() and IndexOnlyNext(). This doesn't seem ideal. Doing it in > index_beginscan() (and index_beginscan_parallel()) is tricky though > because we don't know the scan direction at that point (and the scan > direction can change). There are also callers of index_beginscan() who > do not call Index[Only]Next() (like systable_getnext() which calls > index_getnext_slot() directly). > Yeah, not sure this is the right layering ... the initial patch did everything in individual index AMs, then it moved to indexam.c, then to executor. And this seems to move it to lower layers again ... > Also, my implementation does not yet have the optimization Tomas does > to skip prefetching recently prefetched blocks. As he has said, it > probably makes sense to add something to do this in a lower layer -- > such as in the streaming read API or even in bufmgr.c (maybe in > PrefetchSharedBuffer()). > I agree this should happen in lower layers. I'd probably do this in the streaming read API, because that would define "scope" of the cache (pages prefetched for that read). Doing it in PrefetchSharedBuffer seems like it would do a single cache (for that particular backend). But that's just an initial thought ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > On 2/7/24 22:48, Melanie Plageman wrote: > I admit I haven't thought about kill_prior_tuple until you pointed out. > Yeah, prefetching separates (de-synchronizes) the two scans (index and > heap) in a way that prevents this optimization. Or at least makes it > much more complex :-( Another thing that argues against doing this is that we might not need to visit any more B-Tree leaf pages when there is a LIMIT n involved. We could end up scanning a whole extra leaf page (including all of its tuples) for want of the ability to "push down" a LIMIT to the index AM (that's not what happens right now, but it isn't really needed at all right now). This property of index scans is fundamental to how index scans work. Pinning an index page as an interlock against concurrently TID recycling by VACUUM is directly described by the index API docs [1], even (the docs actually use terms like "buffer pin" rather than something more abstract sounding). I don't think that anything affecting that behavior should be considered an implementation detail of the nbtree index AM as such (nor any particular index AM). I think that it makes sense to put the index AM in control here -- that almost follows from what I said about the index AM API. The index AM already needs to be in control, in about the same way, to deal with kill_prior_tuple (plus it helps with the LIMIT issue I described). There doesn't necessarily need to be much code duplication to make that work. Offhand I suspect it would be kind of similar to how deletion of LP_DEAD-marked index tuples by non-nbtree index AMs gets by with generic logic implemented by index_compute_xid_horizon_for_tuples -- that's all that we need to determine a snapshotConflictHorizon value for recovery conflict purposes. Note that index_compute_xid_horizon_for_tuples() reads *index* pages, despite not being aware of the caller's index AM and index tuple format. (The only reason why nbtree needs a custom solution is because it has posting list tuples to worry about, unlike GiST and unlike Hash, which consistently use unadorned generic IndexTuple structs with heap TID represented in the standard/generic way only. While these concepts probably all originated in nbtree, they're still not nbtree implementation details.) > > Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps > > there is an easier way to fix this, as I don't think the mvcc test > > failed on Tomas' version. > > > > I kinda doubt it worked correctly, considering I simply ignored the > optimization. It's far more likely it just worked by luck. The test that did fail will have only revealed that the kill_prior_tuple wasn't operating as expected -- which isn't the same thing as giving wrong answers. Note that there are various ways that concurrent TID recycling might prevent _bt_killitems() from setting LP_DEAD bits. It's totally unsurprising that breaking kill_prior_tuple in some way could be missed. Andres wrote the MVCC test in question precisely because certain aspects of kill_prior_tuple were broken for months without anybody noticing. [1] https://www.postgresql.org/docs/devel/index-locking.html -- Peter Geoghegan
On Thu, Feb 8, 2024 at 3:18 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > - kill prior tuple > > This optimization doesn't work with index prefetching with the current > design. Kill prior tuple relies on alternating between fetching a > single index tuple and visiting the heap. After visiting the heap we > can potentially kill the immediately preceding index tuple. Once we > fetch multiple index tuples, enqueue their TIDs, and later visit the > heap, the next index page we visit may not contain all of the index > tuples deemed killable by our visit to the heap. Is this maybe just a bookkeeping problem? A Boolean that says "you can kill the prior tuple" is well-suited if and only if the prior tuple is well-defined. But perhaps it could be replaced with something more sophisticated that tells you which tuples are eligible to be killed. -- Robert Haas EDB: http://www.enterprisedb.com
On 2/13/24 20:54, Peter Geoghegan wrote: > On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> On 2/7/24 22:48, Melanie Plageman wrote: >> I admit I haven't thought about kill_prior_tuple until you pointed out. >> Yeah, prefetching separates (de-synchronizes) the two scans (index and >> heap) in a way that prevents this optimization. Or at least makes it >> much more complex :-( > > Another thing that argues against doing this is that we might not need > to visit any more B-Tree leaf pages when there is a LIMIT n involved. > We could end up scanning a whole extra leaf page (including all of its > tuples) for want of the ability to "push down" a LIMIT to the index AM > (that's not what happens right now, but it isn't really needed at all > right now). > I'm not quite sure I understand what is "this" that you argue against. Are you saying we should not separate the two scans? If yes, is there a better way to do this? The LIMIT problem is not very clear to me either. Yes, if we get close to the end of the leaf page, we may need to visit the next leaf page. But that's kinda the whole point of prefetching - reading stuff ahead, and reading too far ahead is an inherent risk. Isn't that a problem we have even without LIMIT? The prefetch distance ramp up is meant to limit the impact. > This property of index scans is fundamental to how index scans work. > Pinning an index page as an interlock against concurrently TID > recycling by VACUUM is directly described by the index API docs [1], > even (the docs actually use terms like "buffer pin" rather than > something more abstract sounding). I don't think that anything > affecting that behavior should be considered an implementation detail > of the nbtree index AM as such (nor any particular index AM). > Good point. > I think that it makes sense to put the index AM in control here -- > that almost follows from what I said about the index AM API. The index > AM already needs to be in control, in about the same way, to deal with > kill_prior_tuple (plus it helps with the LIMIT issue I described). > In control how? What would be the control flow - what part would be managed by the index AM? I initially did the prefetching entirely in each index AM, but it was suggested doing this in the executor would be better. So I gradually moved it to executor. But the idea to combine this with the streaming read API seems as a move from executor back to the lower levels ... and now you're suggesting to make the index AM responsible for this again. I'm not saying any of those layering options is wrong, but it's not clear to me which is the right one. > There doesn't necessarily need to be much code duplication to make > that work. Offhand I suspect it would be kind of similar to how > deletion of LP_DEAD-marked index tuples by non-nbtree index AMs gets > by with generic logic implemented by > index_compute_xid_horizon_for_tuples -- that's all that we need to > determine a snapshotConflictHorizon value for recovery conflict > purposes. Note that index_compute_xid_horizon_for_tuples() reads > *index* pages, despite not being aware of the caller's index AM and > index tuple format. > > (The only reason why nbtree needs a custom solution is because it has > posting list tuples to worry about, unlike GiST and unlike Hash, which > consistently use unadorned generic IndexTuple structs with heap TID > represented in the standard/generic way only. While these concepts > probably all originated in nbtree, they're still not nbtree > implementation details.) > I haven't looked at the details, but I agree the LP_DEAD deletion seems like a sensible inspiration. >>> Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps >>> there is an easier way to fix this, as I don't think the mvcc test >>> failed on Tomas' version. >>> >> >> I kinda doubt it worked correctly, considering I simply ignored the >> optimization. It's far more likely it just worked by luck. > > The test that did fail will have only revealed that the > kill_prior_tuple wasn't operating as expected -- which isn't the same > thing as giving wrong answers. > Possible. But AFAIK it did fail for Melanie, and I don't have a very good explanation for the difference in behavior. > Note that there are various ways that concurrent TID recycling might > prevent _bt_killitems() from setting LP_DEAD bits. It's totally > unsurprising that breaking kill_prior_tuple in some way could be > missed. Andres wrote the MVCC test in question precisely because > certain aspects of kill_prior_tuple were broken for months without > anybody noticing. > > [1] https://www.postgresql.org/docs/devel/index-locking.html Yeah. There's clearly plenty of space for subtle issues. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2/14/24 08:10, Robert Haas wrote: > On Thu, Feb 8, 2024 at 3:18 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: >> - kill prior tuple >> >> This optimization doesn't work with index prefetching with the current >> design. Kill prior tuple relies on alternating between fetching a >> single index tuple and visiting the heap. After visiting the heap we >> can potentially kill the immediately preceding index tuple. Once we >> fetch multiple index tuples, enqueue their TIDs, and later visit the >> heap, the next index page we visit may not contain all of the index >> tuples deemed killable by our visit to the heap. > > Is this maybe just a bookkeeping problem? A Boolean that says "you can > kill the prior tuple" is well-suited if and only if the prior tuple is > well-defined. But perhaps it could be replaced with something more > sophisticated that tells you which tuples are eligible to be killed. > I don't think it's just a bookkeeping problem. In a way, nbtree already does keep an array of tuples to kill (see btgettuple), but it's always for the current index page. So it's not that we immediately go and kill the prior tuple - nbtree already stashes it in an array, and kills all those tuples when moving to the next index page. The way I understand the problem is that with prefetching we're bound to determine the kill_prior_tuple flag with a delay, in which case we might have already moved to the next index page ... So to make this work, we'd need to: 1) keep index pages pinned for all "in flight" TIDs (read from the index, not yet consumed by the index scan) 2) keep a separate array of "to be killed" index tuples for each page 3) have a more sophisticated way to decide when to kill tuples and unpin the index page (instead of just doing it when moving to the next index page) Maybe that's what you meant by "more sophisticated bookkeeping", ofc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 2/7/24 22:48, Melanie Plageman wrote: > > ... Issues > > --- > > - kill prior tuple > > > > This optimization doesn't work with index prefetching with the current > > design. Kill prior tuple relies on alternating between fetching a > > single index tuple and visiting the heap. After visiting the heap we > > can potentially kill the immediately preceding index tuple. Once we > > fetch multiple index tuples, enqueue their TIDs, and later visit the > > heap, the next index page we visit may not contain all of the index > > tuples deemed killable by our visit to the heap. > > > > I admit I haven't thought about kill_prior_tuple until you pointed out. > Yeah, prefetching separates (de-synchronizes) the two scans (index and > heap) in a way that prevents this optimization. Or at least makes it > much more complex :-( > > > In our case, we could try and fix this by prefetching only heap blocks > > referred to by index tuples on the same index page. Or we could try > > and keep a pool of index pages pinned and go back and kill index > > tuples on those pages. > > > > I think restricting the prefetching to a single index page would not be > a huge issue performance-wise - that's what the initial patch version > (implemented at the index AM level) did, pretty much. The prefetch queue > would get drained as we approach the end of the index page, but luckily > index pages tend to have a lot of entries. But it'd put an upper bound > on the prefetch distance (much lower than the e_i_c maximum 1000, but > I'd say common values are 10-100 anyway). > > But how would we know we're on the same index page? That knowledge is > not available outside the index AM - the executor or indexam.c does not > know this, right? Presumably we could expose this, somehow, but it seems > like a violation of the abstraction ... The easiest way to do this would be to have the index AM amgettuple() functions set a new member in the IndexScanDescData which is either the index page identifier or a boolean that indicates we have moved on to the next page. Then, when filling the queue, we would stop doing so when the page switches. Now, this wouldn't really work for the first index tuple on each new page, so, perhaps we would need the index AMs to implement some kind of "peek" functionality. Or, we could provide the index AM with a max queue size and allow it to fill up the queue with the TIDs it wants (which it could keep to the same index page). And, for the index-only scan case, could have some kind of flag which indicates if the caller is putting TIDs+HeapTuples or TIDS+IndexTuples on the queue, which might reduce the amount of space we need. I'm not sure who manages the memory here. I wasn't quite sure how we could use index_compute_xid_horizon_for_tuples() for inspiration -- per Peter's suggestion. But, I'd like to understand. > > - switching scan directions > > > > If the index scan switches directions on a given invocation of > > IndexNext(), heap blocks may have already been prefetched and read for > > blocks containing tuples beyond the point at which we want to switch > > directions. > > > > We could fix this by having some kind of streaming read "reset" > > callback to drop all of the buffers which have been prefetched which > > are now no longer needed. We'd have to go backwards from the last TID > > which was yielded to the caller and figure out which buffers in the > > pgsr buffer ranges are associated with all of the TIDs which were > > prefetched after that TID. The TIDs are in the per_buffer_data > > associated with each buffer in pgsr. The issue would be searching > > through those efficiently. > > > > Yeah, that's roughly what I envisioned in one of my previous messages > about this issue - walking back the TIDs read from the index and added > to the prefetch queue. > > > The other issue is that the streaming read API does not currently > > support backwards scans. So, if we switch to a backwards scan from a > > forwards scan, we would need to fallback to the non streaming read > > method. We could do this by just setting the TID queue size to 1 > > (which is what I have currently implemented). Or we could add > > backwards scan support to the streaming read API. > > > > What do you mean by "support for backwards scans" in the streaming read > API? I imagined it naively as > > 1) drop all requests in the streaming read API queue > > 2) walk back all "future" requests in the TID queue > > 3) start prefetching as if from scratch > > Maybe there's a way to optimize this and reuse some of the work more > efficiently, but my assumption is that the scan direction does not > change very often, and that we process many items in between. Yes, the steps you mention for resetting the queues make sense. What I meant by "backwards scan is not supported by the streaming read API" is that Thomas/Andres had mentioned that the streaming read API does not support backwards scans right now. Though, since the callback just returns a block number, I don't know how it would break. When switching between a forwards and backwards scan, does it go backwards from the current position or start at the end (or beginning) of the relation? If it is the former, then the blocks would most likely be in shared buffers -- which the streaming read API handles. It is not obvious to me from looking at the code what the gap is, so perhaps Thomas could weigh in. As for handling this in index prefetching, if you think a TID queue size of 1 is a sufficient fallback method, then resetting the pgsr queue and resizing the TID queue to 1 would work with no issues. If the fallback method requires the streaming read code path not be used at all, then that is more work. > > - multiple executions > > > > For reasons I don't entirely understand yet, multiple executions (not > > rescan -- see ExecutorRun(...execute_once)) do not work. As in Tomas' > > patch, I have disabled prefetching (and made the TID queue size 1) > > when execute_once is false. > > > > Don't work in what sense? What is (not) happening? I got wrong results for this. I'll have to do more investigation, but I assumed that not resetting the TID queue and pgsr queue was also the source of this issue. What I imagined we would do is figure out if there is a viable solution for the larger design issues and then investigate what seemed like smaller issues. But, perhaps I should dig into this first to ensure there isn't a larger issue. > > - Index Only Scans need to return IndexTuples > > > > Because index only scans return either the IndexTuple pointed to by > > IndexScanDesc->xs_itup or the HeapTuple pointed to by > > IndexScanDesc->xs_hitup -- both of which are populated by the index > > AM, we have to save copies of those IndexTupleData and HeapTupleDatas > > for every TID whose block we prefetch. > > > > This might be okay, but it is a bit sad to have to make copies of those tuples. > > > > In this patch, I still haven't figured out the memory management part. > > I copy over the tuples when enqueuing a TID queue item and then copy > > them back again when the streaming read API returns the > > per_buffer_data to us. Something is still not quite right here. I > > suspect this is part of the reason why some of the other tests are > > failing. > > > > It's not clear to me what you need to copy the tuples back - shouldn't > it be enough to copy the tuple just once? When enqueueing it, IndexTuple has to be copied from the scan descriptor to somewhere in memory with a TIDQueueItem pointing to it. Once we do this, the IndexTuple memory should stick around until we free it, so yes, I'm not sure why I was seeing the IndexTuple no longer be valid when I tried to put it in a slot. I'll have to do more investigation. > FWIW if we decide to pin multiple index pages (to make kill_prior_tuple > work), that would also mean we don't need to copy any tuples, right? We > could point into the buffers for all of them, right? Yes, this would be a nice benefit. > > Other issues/gaps in my implementation: > > > > Determining where to allocate the memory for the streaming read object > > and the TID queue is an outstanding TODO. To implement a fallback > > method for cases in which streaming read doesn't work, I set the queue > > size to 1. This is obviously not good. > > > > I think IndexFetchTableData seems like a not entirely terrible place for > allocating the pgsr, but I wonder what Andres thinks about this. IIRC he > advocated for doing the prefetching in executor, and I'm not sure > heapam_handled.c + relscan.h is what he imagined ... > > Also, when you say "obviously not good" - why? Are you concerned about > the extra overhead of shuffling stuff between queues, or something else? Well, I didn't resize the queue, I just limited how much of it we can use to a single member (thus wasting the other memory). But resizing a queue isn't free either. Also, I wondered if a queue size of 1 for index AMs using the fallback method is too confusing (like it is a fake queue?). But, I'd really, really rather not maintain both a queue and non-queue control flow for Index[Only]Next(). The maintenance overhead seems like it would outweigh the potential downsides. > > Right now, I allocate the TID queue and streaming read objects in > > IndexNext() and IndexOnlyNext(). This doesn't seem ideal. Doing it in > > index_beginscan() (and index_beginscan_parallel()) is tricky though > > because we don't know the scan direction at that point (and the scan > > direction can change). There are also callers of index_beginscan() who > > do not call Index[Only]Next() (like systable_getnext() which calls > > index_getnext_slot() directly). > > > > Yeah, not sure this is the right layering ... the initial patch did > everything in individual index AMs, then it moved to indexam.c, then to > executor. And this seems to move it to lower layers again ... If we do something like make the index AM responsible for the TID queue (as mentioned above as a potential solution to the kill prior tuple issue), then we might be able to allocate the TID queue in the index AMs? As for the streaming read object, if we were able to solve the issue where callers of index_beginscan() don't call Index[Only]Next() (and thus shouldn't allocate a streaming read object), then it seems easy enough to move the streaming read object allocation into the table AM-specific begin scan method. > > Also, my implementation does not yet have the optimization Tomas does > > to skip prefetching recently prefetched blocks. As he has said, it > > probably makes sense to add something to do this in a lower layer -- > > such as in the streaming read API or even in bufmgr.c (maybe in > > PrefetchSharedBuffer()). > > > > I agree this should happen in lower layers. I'd probably do this in the > streaming read API, because that would define "scope" of the cache > (pages prefetched for that read). Doing it in PrefetchSharedBuffer seems > like it would do a single cache (for that particular backend). Hmm. I wonder if there are any upsides to having the cache be per-backend. Though, that does sound like a whole other project... - Melanie
On Wed, Feb 14, 2024 at 8:34 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Another thing that argues against doing this is that we might not need > > to visit any more B-Tree leaf pages when there is a LIMIT n involved. > > We could end up scanning a whole extra leaf page (including all of its > > tuples) for want of the ability to "push down" a LIMIT to the index AM > > (that's not what happens right now, but it isn't really needed at all > > right now). > > > > I'm not quite sure I understand what is "this" that you argue against. > Are you saying we should not separate the two scans? If yes, is there a > better way to do this? What I'm concerned about is the difficulty and complexity of any design that requires revising "63.4. Index Locking Considerations", since that's pretty subtle stuff. In particular, if prefetching "de-synchronizes" (to use your term) the index leaf page level scan and the heap page scan, then we'll probably have to totally revise the basic API. Maybe that'll actually turn out to be the right thing to do -- it could just be the only thing that can unleash the full potential of prefetching. But I'm not aware of any evidence that points in that direction. Are you? (I might have just missed it.) > The LIMIT problem is not very clear to me either. Yes, if we get close > to the end of the leaf page, we may need to visit the next leaf page. > But that's kinda the whole point of prefetching - reading stuff ahead, > and reading too far ahead is an inherent risk. Isn't that a problem we > have even without LIMIT? The prefetch distance ramp up is meant to limit > the impact. Right now, the index AM doesn't know anything about LIMIT at all. That doesn't matter, since the index AM can only read/scan one full leaf page before returning control back to the executor proper. The executor proper can just shut down the whole index scan upon finding that we've already returned N tuples for a LIMIT N. We don't do prefetching right now, but we also don't risk reading a leaf page that'll just never be needed. Those two things are in tension, but I don't think that that's quite the same thing as the usual standard prefetching tension/problem. Here there is uncertainty about whether what we're prefetching will *ever* be required -- not uncertainty about when exactly it'll be required. (Perhaps this distinction doesn't mean much to you. I'm just telling you how I think about it, in case it helps move the discussion forward.) > > This property of index scans is fundamental to how index scans work. > > Pinning an index page as an interlock against concurrently TID > > recycling by VACUUM is directly described by the index API docs [1], > > even (the docs actually use terms like "buffer pin" rather than > > something more abstract sounding). I don't think that anything > > affecting that behavior should be considered an implementation detail > > of the nbtree index AM as such (nor any particular index AM). > > > > Good point. The main reason why the index AM docs require this interlock is because we need such an interlock to make non-MVCC snapshot scans safe. If you remove the interlock (the buffer pin interlock that protects against TID recycling by VACUUM), you can still avoid the same race condition by using an MVCC snapshot. This is why using an MVCC snapshot is a requirement for bitmap index scans. I believe that it's also a requirement for index-only scans, but the index AM docs don't spell that out. Another factor that complicates things here is mark/restore processing. The design for that has the idea of processing one page at a time baked-in. Kinda like with the kill_prior_tuple issue. It's certainly possible that you could figure out various workarounds for each of these issues (plus the kill_prior_tuple issue) with a prefetching design that "de-synchronizes" the index access and the heap access. But it might well be better to extend the existing design in a way that just avoids all these problems in the first place. Maybe "de-synchronization" really can pay for itself (because the benefits will outweigh these costs), but if you go that way then I'd really prefer it that way. > > I think that it makes sense to put the index AM in control here -- > > that almost follows from what I said about the index AM API. The index > > AM already needs to be in control, in about the same way, to deal with > > kill_prior_tuple (plus it helps with the LIMIT issue I described). > > > > In control how? What would be the control flow - what part would be > managed by the index AM? ISTM that prefetching for an index scan is about the index scan itself, first and foremost. The heap accesses are usually the dominant cost, of course, but sometimes the index leaf page accesses really do make up a significant fraction of the overall cost of the index scan. Especially with an expensive index qual. So if you just assume that the TIDs returned by the index scan are the only thing that matters, you might have a model that's basically correct on average, but is occasionally very wrong. That's one reason for "putting the index AM in control". As I said back in June, we should probably be marrying information from the index scan with information from the heap. This is something that is arguably a modularity violation. But it might just be that you really do need to take information from both places to consistently make the right trade-off. Perhaps the best arguments for "putting the index AM in control" only work when you go to fix the problems that "naive de-synchronization" creates. Thinking about that side of things some more might make "putting the index AM in control" seem more natural. Suppose, for example, you try to make a prefetching design based on "de-synchronization" work with kill_prior_tuple -- suppose you try to fix that problem. You're likely going to need to make some kind of trade-off that gets you most of the advantages that that approach offers (assuming that there really are significant advantages), while still retaining most of the advantages that we already get from kill_prior_tuple (basically we want to LP_DEAD-mark index tuples with almost or exactly the same consistency as we manage today). Maybe your approach involves tracking multiple LSNs for each prefetch-pending leaf page, or perhaps you hold on to a pin on some number of leaf pages instead (right now nbtree does both [1], which I go into more below). Either way, you're pushing stuff down into the index AM. Note that we already hang onto more than one pin at a time in rare cases involving mark/restore processing. For example, it can happen for a merge join that happens to involve an unlogged index, if the markpos and curpos are a certain way relative to the current leaf page (yeah, really). So putting stuff like that under the control of the index AM (while also applying basic information that comes from the heap) in order to fix the kill_prior_tuple issue is arguably something that has a kind of a precedent for us to follow. Even if you disagree with me here ("precedent" might be overstating it), perhaps you still get some general sense of why I have an inkling that putting prefetching in the index AM is the way to go. It's very hard to provide one really strong justification for all this, and I'm certainly not expecting you to just agree with me right away. I'm also not trying to impose any conditions on committing this patch. Thinking about this some more, "making kill_prior_tuple work with de-synchronization" is a bit of a misleading way of putting it. The way that you'd actually work around this is (at a very high level) *dynamically* making some kind of *trade-off* between synchronization and desynchronization. Up until now, we've been talking in terms of a strict dichotomy between the old index AM API design (index-page-at-a-time synchronization), and a "de-synchronizing" prefetching design that embraces the opposite extreme -- a design where we only think in terms of heap TIDs, and completely ignore anything that happens in the index structure (and consequently makes kill_prior_tuple ineffective). That now seems like a false dichotomy. > I initially did the prefetching entirely in each index AM, but it was > suggested doing this in the executor would be better. So I gradually > moved it to executor. But the idea to combine this with the streaming > read API seems as a move from executor back to the lower levels ... and > now you're suggesting to make the index AM responsible for this again. I did predict that there'd be lots of difficulties around the layering back in June. :-) > I'm not saying any of those layering options is wrong, but it's not > clear to me which is the right one. I don't claim to know what the right trade-off is myself. The fact that all of these things are in tension doesn't surprise me. It's just a hard problem. > Possible. But AFAIK it did fail for Melanie, and I don't have a very > good explanation for the difference in behavior. If you take a look at _bt_killitems(), you'll see that it actually has two fairly different strategies for avoiding TID recycling race condition issues, applied in each of two different cases: 1. Cases where we really have held onto a buffer pin, per the index AM API -- the "inde AM orthodox" approach. (The aforementioned issue with unlogged indexes exists because with an unlogged index we must use approach 1, per the nbtree README section [1]). 2. Cases where we drop the pin as an optimization (also per [1]), and now have to detect the possibility of concurrent modifications by VACUUM (that could have led to concurrent TID recycling). We conservatively do nothing (don't mark any index tuples LP_DEAD), unless the LSN is exactly the same as it was back when the page was scanned/read by _bt_readpage(). So some accidental detail with LSNs (like using or not using an unlogged index) could cause bugs in this area to "accidentally fail to fail". Since the nbtree index AM has its own optimizations here, which probably has a tendency to mask problems/bugs. (I sometimes use unlogged indexes for some of my nbtree related test cases, just to reduce certain kinds of variability, including variability in this area.) [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/README;h=52e646c7f759a5d9cfdc32b86f6aff8460891e12;hb=3e8235ba4f9cc3375b061fb5d3f3575434539b5f#l443 -- Peter Geoghegan
On Wed, Feb 14, 2024 at 11:40 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > I wasn't quite sure how we could use > index_compute_xid_horizon_for_tuples() for inspiration -- per Peter's > suggestion. But, I'd like to understand. The point I was trying to make with that example was: a highly generic mechanism can sometimes work across disparate index AMs (that all at least support plain index scans) when it just so happens that these AMs don't actually differ in a way that could possibly matter to that mechanism. While it's true that (say) nbtree and hash are very different at a high level, it's nevertheless also true that the way things work at the level of individual index pages is much more similar than different. With index deletion, we know that we're differences between each supported index AM either don't matter at all (which is what obviates the need for index_compute_xid_horizon_for_tuples() to be directly aware of which index AM the page it is passed comes from), or matter only in small, incidental ways (e.g., nbtree stores posting lists in its tuples, despite using IndexTuple structs). With prefetching, it seems reasonable to suppose that an index-AM specific approach would end up needing very little truly custom code. This is pretty strongly suggested by the fact that the rules around buffer pins (as an interlock against concurrent TID recycling by VACUUM) are standardized by the index AM API itself. Those rules might be slightly more natural with nbtree, but that's kinda beside the point. While the basic organizing principle for where each index tuple goes can vary enormously, it doesn't necessarily matter at all -- in the end, you're really just reading each index page (that has TIDs to read) exactly once per scan, in some fixed order, with interlaced inline heap accesses (that go fetch heap tuples for each individual TID read from each index page). In general I don't accept that we need to do things outside the index AM, because software architecture encapsulation something something. I suspect that we'll need to share some limited information across different layers of abstraction, because that's just fundamentally what's required by the constraints we're operating under. Can't really prove it, though. -- Peter Geoghegan
On Wed, Feb 14, 2024 at 11:40 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > On 2/7/24 22:48, Melanie Plageman wrote: > > > ... > > > - switching scan directions > > > > > > If the index scan switches directions on a given invocation of > > > IndexNext(), heap blocks may have already been prefetched and read for > > > blocks containing tuples beyond the point at which we want to switch > > > directions. > > > > > > We could fix this by having some kind of streaming read "reset" > > > callback to drop all of the buffers which have been prefetched which > > > are now no longer needed. We'd have to go backwards from the last TID > > > which was yielded to the caller and figure out which buffers in the > > > pgsr buffer ranges are associated with all of the TIDs which were > > > prefetched after that TID. The TIDs are in the per_buffer_data > > > associated with each buffer in pgsr. The issue would be searching > > > through those efficiently. > > > > > > > Yeah, that's roughly what I envisioned in one of my previous messages > > about this issue - walking back the TIDs read from the index and added > > to the prefetch queue. > > > > > The other issue is that the streaming read API does not currently > > > support backwards scans. So, if we switch to a backwards scan from a > > > forwards scan, we would need to fallback to the non streaming read > > > method. We could do this by just setting the TID queue size to 1 > > > (which is what I have currently implemented). Or we could add > > > backwards scan support to the streaming read API. > > > > > > > What do you mean by "support for backwards scans" in the streaming read > > API? I imagined it naively as > > > > 1) drop all requests in the streaming read API queue > > > > 2) walk back all "future" requests in the TID queue > > > > 3) start prefetching as if from scratch > > > > Maybe there's a way to optimize this and reuse some of the work more > > efficiently, but my assumption is that the scan direction does not > > change very often, and that we process many items in between. > > Yes, the steps you mention for resetting the queues make sense. What I > meant by "backwards scan is not supported by the streaming read API" > is that Thomas/Andres had mentioned that the streaming read API does > not support backwards scans right now. Though, since the callback just > returns a block number, I don't know how it would break. > > When switching between a forwards and backwards scan, does it go > backwards from the current position or start at the end (or beginning) > of the relation? Okay, well I answered this question for myself, by, um, trying it :). FETCH backward will go backwards from the current cursor position. So, I don't see exactly why this would be an issue. > If it is the former, then the blocks would most > likely be in shared buffers -- which the streaming read API handles. > It is not obvious to me from looking at the code what the gap is, so > perhaps Thomas could weigh in. I have the same problem with the sequential scan streaming read user, so I am going to try and figure this backwards scan and switching scan direction thing there (where we don't have other issues). - Melanie
On Wed, Feb 14, 2024 at 1:21 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Feb 14, 2024 at 8:34 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > Another thing that argues against doing this is that we might not need > > > to visit any more B-Tree leaf pages when there is a LIMIT n involved. > > > We could end up scanning a whole extra leaf page (including all of its > > > tuples) for want of the ability to "push down" a LIMIT to the index AM > > > (that's not what happens right now, but it isn't really needed at all > > > right now). > > > > > > > I'm not quite sure I understand what is "this" that you argue against. > > Are you saying we should not separate the two scans? If yes, is there a > > better way to do this? > > What I'm concerned about is the difficulty and complexity of any > design that requires revising "63.4. Index Locking Considerations", > since that's pretty subtle stuff. In particular, if prefetching > "de-synchronizes" (to use your term) the index leaf page level scan > and the heap page scan, then we'll probably have to totally revise the > basic API. So, a pin on the index leaf page is sufficient to keep line pointers from being reused? If we stick to prefetching heap blocks referred to by index tuples in a single index leaf page, and we keep that page pinned, will we still have a problem? > > The LIMIT problem is not very clear to me either. Yes, if we get close > > to the end of the leaf page, we may need to visit the next leaf page. > > But that's kinda the whole point of prefetching - reading stuff ahead, > > and reading too far ahead is an inherent risk. Isn't that a problem we > > have even without LIMIT? The prefetch distance ramp up is meant to limit > > the impact. > > Right now, the index AM doesn't know anything about LIMIT at all. That > doesn't matter, since the index AM can only read/scan one full leaf > page before returning control back to the executor proper. The > executor proper can just shut down the whole index scan upon finding > that we've already returned N tuples for a LIMIT N. > > We don't do prefetching right now, but we also don't risk reading a > leaf page that'll just never be needed. Those two things are in > tension, but I don't think that that's quite the same thing as the > usual standard prefetching tension/problem. Here there is uncertainty > about whether what we're prefetching will *ever* be required -- not > uncertainty about when exactly it'll be required. (Perhaps this > distinction doesn't mean much to you. I'm just telling you how I think > about it, in case it helps move the discussion forward.) I don't think that the LIMIT problem is too different for index scans than heap scans. We will need some advice from planner to come down to prevent over-eager prefetching in all cases. > Another factor that complicates things here is mark/restore > processing. The design for that has the idea of processing one page at > a time baked-in. Kinda like with the kill_prior_tuple issue. Yes, I mentioned this in my earlier email. I think we can resolve mark/restore by resetting the prefetch and TID queues and restoring the last used heap TID in the index scan descriptor. > It's certainly possible that you could figure out various workarounds > for each of these issues (plus the kill_prior_tuple issue) with a > prefetching design that "de-synchronizes" the index access and the > heap access. But it might well be better to extend the existing design > in a way that just avoids all these problems in the first place. Maybe > "de-synchronization" really can pay for itself (because the benefits > will outweigh these costs), but if you go that way then I'd really > prefer it that way. Forcing each index access to be synchronous and interleaved with each table access seems like an unprincipled design constraint. While it is true that we rely on that in our current implementation (when using non-MVCC snapshots), it doesn't seem like a principle inherent to accessing indexes and tables. > > > I think that it makes sense to put the index AM in control here -- > > > that almost follows from what I said about the index AM API. The index > > > AM already needs to be in control, in about the same way, to deal with > > > kill_prior_tuple (plus it helps with the LIMIT issue I described). > > > > > > > In control how? What would be the control flow - what part would be > > managed by the index AM? > > ISTM that prefetching for an index scan is about the index scan > itself, first and foremost. The heap accesses are usually the dominant > cost, of course, but sometimes the index leaf page accesses really do > make up a significant fraction of the overall cost of the index scan. > Especially with an expensive index qual. So if you just assume that > the TIDs returned by the index scan are the only thing that matters, > you might have a model that's basically correct on average, but is > occasionally very wrong. That's one reason for "putting the index AM > in control". I don't think the fact that it would also be valuable to do index prefetching is a reason not to do prefetching of heap pages. And, while it is true that were you to add index interior or leaf page prefetching, it would impact the heap prefetching, at the end of the day, the table AM needs some TID or TID-equivalents that whose blocks it can go fetch. The index AM has to produce something that the table AM will consume. So, if we add prefetching of heap pages and get the table AM input right, it shouldn't require a full redesign to add index page prefetching later. You could argue that my suggestion to have the index AM manage and populate a queue of TIDs for use by the table AM puts the index AM in control. I do think having so many members of the IndexScanDescriptor which imply a one-at-a-time (xs_heaptid, xs_itup, etc) synchronous interplay between fetching an index tuple and fetching a heap tuple is confusing and error prone. > As I said back in June, we should probably be marrying information > from the index scan with information from the heap. This is something > that is arguably a modularity violation. But it might just be that you > really do need to take information from both places to consistently > make the right trade-off. Agreed that we are going to need to mix information from both places. > If you take a look at _bt_killitems(), you'll see that it actually has > two fairly different strategies for avoiding TID recycling race > condition issues, applied in each of two different cases: > > 1. Cases where we really have held onto a buffer pin, per the index AM > API -- the "inde AM orthodox" approach. (The aforementioned issue > with unlogged indexes exists because with an unlogged index we must > use approach 1, per the nbtree README section [1]). > > 2. Cases where we drop the pin as an optimization (also per [1]), and > now have to detect the possibility of concurrent modifications by > VACUUM (that could have led to concurrent TID recycling). We > conservatively do nothing (don't mark any index tuples LP_DEAD), > unless the LSN is exactly the same as it was back when the page was > scanned/read by _bt_readpage(). Re 2: so the LSN could have been changed by some other process (i.e. not vacuum), so how often in practice is the LSN actually the same as when the page was scanned/read? Do you think we would catch a meaningful number of kill prior tuple opportunities if we used an LSN tracking method like this? Something that let us drop the pin on the page would obviously be better. - Melanie
On Wed, Feb 14, 2024 at 4:46 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > So, a pin on the index leaf page is sufficient to keep line pointers > from being reused? If we stick to prefetching heap blocks referred to > by index tuples in a single index leaf page, and we keep that page > pinned, will we still have a problem? That's certainly one way of dealing with it. Obviously, there are questions about how you do that in a way that consistently avoids creating new problems. > I don't think that the LIMIT problem is too different for index scans > than heap scans. We will need some advice from planner to come down to > prevent over-eager prefetching in all cases. I think that I'd rather use information at execution time instead, if at all possible (perhaps in addition to a hint given by the planner). But it seems a bit premature to discuss this problem now, except to say that it might indeed be a problem. > > It's certainly possible that you could figure out various workarounds > > for each of these issues (plus the kill_prior_tuple issue) with a > > prefetching design that "de-synchronizes" the index access and the > > heap access. But it might well be better to extend the existing design > > in a way that just avoids all these problems in the first place. Maybe > > "de-synchronization" really can pay for itself (because the benefits > > will outweigh these costs), but if you go that way then I'd really > > prefer it that way. > > Forcing each index access to be synchronous and interleaved with each > table access seems like an unprincipled design constraint. While it is > true that we rely on that in our current implementation (when using > non-MVCC snapshots), it doesn't seem like a principle inherent to > accessing indexes and tables. There is nothing sacred about the way plain index scans work right now -- especially the part about buffer pins as an interlock. If the pin thing really was sacred, then we could never have allowed nbtree to selectively opt-out in cases where it's possible to provide an equivalent correctness guarantee without holding onto buffer pins, which, as I went into, is how it actually works in nbtree's _bt_killitems() today (see commit 2ed5b87f96 for full details). And so in principle I have no problem with the idea of revising the basic definition of plain index scans -- especially if it's to make the definition more abstract, without fundamentally changing it (e.g., to make it no longer reference buffer pins, making life easier for prefetching, while at the same time still implying the same underlying guarantees sufficient to allow nbtree to mostly work the same way as today). All I'm really saying is: 1. The sort of tricks that we can do in nbtree's _bt_killitems() are quite useful, and ought to be preserved in something like their current form, even when prefetching is in use. This seems to push things in the direction of centralizing control of the process in index scan code. For example, it has to understand that _bt_killitems() will be called at some regular cadence that is well defined and sensible from an index point of view. 2. Are you sure that the leaf-page-at-a-time thing is such a huge hindrance to effective prefetching? I suppose that it might be much more important than I imagine it is right now, but it'd be nice to have something a bit more concrete to go on. 3. Even if it is somewhat important, do you really need to get that part working in v1? Tomas' original prototype worked with the leaf-page-at-a-time thing, and that still seemed like a big improvement to me. While being less invasive, in effect. If we can agree that something like that represents a useful step in the right direction (not an evolutionary dead end), then we can make good incremental progress within a single release. > I don't think the fact that it would also be valuable to do index > prefetching is a reason not to do prefetching of heap pages. And, > while it is true that were you to add index interior or leaf page > prefetching, it would impact the heap prefetching, at the end of the > day, the table AM needs some TID or TID-equivalents that whose blocks > it can go fetch. I wasn't really thinking of index page prefetching at all. Just the cost of applying index quals to read leaf pages that might never actually need to be read, due to the presence of a LIMIT. That is kind of a new problem created by eagerly reading (without actually prefetching) leaf pages. > You could argue that my suggestion to have the index AM manage and > populate a queue of TIDs for use by the table AM puts the index AM in > control. I do think having so many members of the IndexScanDescriptor > which imply a one-at-a-time (xs_heaptid, xs_itup, etc) synchronous > interplay between fetching an index tuple and fetching a heap tuple is > confusing and error prone. But that's kinda how amgettuple is supposed to work -- cursors need it to work that way. Having some kind of general notion of scan order is also important to avoid returning duplicate TIDs to the scan. In contrast, GIN heavily relies on the fact that it only supports bitmap scans -- that allows it to not have to reason about returning duplicate TIDs (when dealing with a concurrently merged pending list, and other stuff like that). And so nbtree (and basically every other index AM that supports plain index scans) kinda pretends to process a single tuple at a time, in some fixed order that's convenient for the scan to work with (that's how the executor thinks of things). In reality these index AMs actually process batches consisting of a single leaf page worth of tuples. I don't see how the IndexScanDescData side of things makes life any harder for this patch -- ISTM that you'll always need to pretend to return one tuple at a time from the index scan, regardless of what happens under the hood, with pins and whatnot. The page-at-a-time thing is more or less an implementation detail that's private to index AMs (albeit in a way that follows certain standard conventions across index AMs) -- it's a leaky abstraction only due to the interactions with VACUUM/TID recycle safety. > Re 2: so the LSN could have been changed by some other process (i.e. > not vacuum), so how often in practice is the LSN actually the same as > when the page was scanned/read? It seems very hard to make generalizations about that sort of thing. It doesn't help that we now have batching logic inside _bt_simpledel_pass() that will make up for the problem of not setting as many LP_DEAD bits as we could in many important cases. (I recall that that was one factor that allowed the bug that Andres fixed in commit 90c885cd to go undetected for months. I recall discussing the issue with Andres around that time.) > Do you think we would catch a > meaningful number of kill prior tuple opportunities if we used an LSN > tracking method like this? Something that let us drop the pin on the > page would obviously be better. Quite possibly, yes. But it's hard to say for sure without far more detailed analysis. Plus you have problems with things like unlogged indexes not having an LSN to use as a canary condition, which makes it a bit messy (it's already kind of weird that we treat unlogged indexes differently here IMV). -- Peter Geoghegan
Hi, On 2024-02-14 16:45:57 -0500, Melanie Plageman wrote: > > > The LIMIT problem is not very clear to me either. Yes, if we get close > > > to the end of the leaf page, we may need to visit the next leaf page. > > > But that's kinda the whole point of prefetching - reading stuff ahead, > > > and reading too far ahead is an inherent risk. Isn't that a problem we > > > have even without LIMIT? The prefetch distance ramp up is meant to limit > > > the impact. > > > > Right now, the index AM doesn't know anything about LIMIT at all. That > > doesn't matter, since the index AM can only read/scan one full leaf > > page before returning control back to the executor proper. The > > executor proper can just shut down the whole index scan upon finding > > that we've already returned N tuples for a LIMIT N. > > > > We don't do prefetching right now, but we also don't risk reading a > > leaf page that'll just never be needed. Those two things are in > > tension, but I don't think that that's quite the same thing as the > > usual standard prefetching tension/problem. Here there is uncertainty > > about whether what we're prefetching will *ever* be required -- not > > uncertainty about when exactly it'll be required. (Perhaps this > > distinction doesn't mean much to you. I'm just telling you how I think > > about it, in case it helps move the discussion forward.) > > I don't think that the LIMIT problem is too different for index scans > than heap scans. We will need some advice from planner to come down to > prevent over-eager prefetching in all cases. I'm not sure that that's really true. I think the more common and more problematic case for partially executing a sub-tree of a query are nested loops (worse because that happens many times within a query). Particularly for anti-joins prefetching too aggressively could lead to a significant IO amplification. At the same time it's IMO more important to ramp up prefetching distance fairly aggressively for index scans than it is for sequential scans. For sequential scans it's quite likely that either the whole scan takes quite a while (thus slowly ramping doesn't affect overall time that much) or that the data is cached anyway because the tables are small and frequently used (in which case we don't need to ramp). And even if smaller tables aren't cached, because it's sequential IO, the IOs are cheaper as they're sequential. Contrast that to index scans, where it's much more likely that you have cache misses in queries that do an overall fairly small number of IOs and where that IO is largely random. I think we'll need some awareness at ExecInitNode() time about how the results of the nodes are used. I see a few "classes": 1) All rows are needed, because the node is below an Agg, Hash, Materialize, Sort, .... Can be determined purely by the plan shape. 2) All rows are needed, because the node is completely consumed by the top-level (i.e. no limit, anti-joins or such inbetween) and the top-level wants to run the whole query. Unfortunately I don't think we know this at plan time at the moment (it's just determined by what's passed to ExecutorRun()). 3) Some rows are needed, but it's hard to know the precise number. E.g. because of a LIMIT further up. 4) Only a single row is going to be needed, albeit possibly after filtering on the node level. E.g. the anti-join case. There are different times at which we could determine how each node is consumed: a) Determine node consumption "class" purely within ExecInit*, via different eflags. Today that couldn't deal with 2), but I think it'd not too hard to modify callers that consume query results completely to tell that ExecutorStart(), not just ExecutorRun(). A disadvantage would be that this prevents us from taking IO depth into account during costing. There very well might be plans that are cheaper than others because the plan shape allows more concurrent IO. b) Determine node consumption class at plan time. This also couldn't deal with 2), but fixing that probably would be harder, because we'll often not know at plan time how the query will be executed. And in fact the same plan might be executed multiple ways, in case of prepared statements. The obvious advantage is of course that we can influence the choice of paths. I suspect we'd eventually want a mix of both. Plan time to be able to influence plan shape, ExecInit* to deal with not knowing how the query will be consumed at plan time. Which suggests that we could start with whichever is easier and extend later. Greetings, Andres Freund
Hi, On 2024-02-13 14:54:14 -0500, Peter Geoghegan wrote: > This property of index scans is fundamental to how index scans work. > Pinning an index page as an interlock against concurrently TID > recycling by VACUUM is directly described by the index API docs [1], > even (the docs actually use terms like "buffer pin" rather than > something more abstract sounding). I don't think that anything > affecting that behavior should be considered an implementation detail > of the nbtree index AM as such (nor any particular index AM). Given that the interlock is only needed for non-mvcc scans, that non-mvcc scans are rare due to catalog accesses using snapshots these days and that most non-mvcc scans do single-tuple lookups, it might be viable to be more restrictive about prefetching iff non-mvcc snapshots are in use and to use method of cleanup that allows multiple pages to be cleaned up otherwise. However, I don't think we would necessarily have to relax the IAM pinning rules, just to be able to do prefetching of more than one index leaf page. Restricting prefetching to entries within a single leaf page obviously has the disadvantage of not being able to benefit from concurrent IO whenever crossing a leaf page boundary, but at the same time processing entries from just two leaf pages would often allow for a sufficiently aggressive prefetching. Pinning a small number of leaf pages instead of a single leaf page shouldn't be a problem. One argument for loosening the tight coupling between kill_prior_tuples and index scan progress is that the lack of kill_prior_tuples for bitmap scans is quite problematic. I've seen numerous production issues with bitmap scans caused by subsequent scans processing a growing set of dead tuples, where plain index scans were substantially slower initially but didn't get much slower over time. We might be able to design a system where the bitmap contains a certain number of back-references to the index, allowing later cleanup if there weren't any page splits or such. > I think that it makes sense to put the index AM in control here -- > that almost follows from what I said about the index AM API. The index > AM already needs to be in control, in about the same way, to deal with > kill_prior_tuple (plus it helps with the LIMIT issue I described). Depending on what "control" means I'm doubtful: Imo there are decisions influencing prefetching that an index AM shouldn't need to know about directly, e.g. how the plan shape influences how many tuples are actually going to be consumed. Of course that determination could be made in planner/executor and handed to IAMs, for the IAM to then "control" the prefetching. Another aspect is that *long* term I think we want to be able to execute different parts of the plan tree when one part is blocked for IO. Of course that's not always possible. But particularly with partitioned queries it often is. Depending on the form of "control" that's harder if IAMs are in control, because control flow needs to return to the executor to be able to switch to a different node, so we can't wait for IO inside the AM. There probably are ways IAMs could be in "control" that would be compatible with such constraints however. Greetings, Andres Freund
On Wed, Feb 14, 2024 at 7:28 PM Andres Freund <andres@anarazel.de> wrote: > On 2024-02-13 14:54:14 -0500, Peter Geoghegan wrote: > > This property of index scans is fundamental to how index scans work. > > Pinning an index page as an interlock against concurrently TID > > recycling by VACUUM is directly described by the index API docs [1], > > even (the docs actually use terms like "buffer pin" rather than > > something more abstract sounding). I don't think that anything > > affecting that behavior should be considered an implementation detail > > of the nbtree index AM as such (nor any particular index AM). > > Given that the interlock is only needed for non-mvcc scans, that non-mvcc > scans are rare due to catalog accesses using snapshots these days and that > most non-mvcc scans do single-tuple lookups, it might be viable to be more > restrictive about prefetching iff non-mvcc snapshots are in use and to use > method of cleanup that allows multiple pages to be cleaned up otherwise. I agree, but don't think that it matters all that much. If you have an MVCC snapshot, that doesn't mean that TID recycle safety problems automatically go away. It only means that you have one known and supported alternative approach to dealing with such problems. It's not like you just get that for free, just by using an MVCC snapshot, though -- it has downsides. Downsides such as the current _bt_killitems() behavior with a concurrently-modified leaf page (modified when we didn't hold a leaf page pin). It'll just give up on setting any LP_DEAD bits due to noticing that the leaf page's LSN changed. (Plus there are implementation restrictions that I won't repeat again now.) When I refer to the buffer pin interlock, I'm mostly referring to the general need for something like that in the context of index scans. Principally in order to make kill_prior_tuple continue to work in something more or less like its current form. > However, I don't think we would necessarily have to relax the IAM pinning > rules, just to be able to do prefetching of more than one index leaf > page. To be clear, we already do relax the IAM pinning rules. Or at least nbtree selectively opts out, as I've gone into already. > Restricting prefetching to entries within a single leaf page obviously > has the disadvantage of not being able to benefit from concurrent IO whenever > crossing a leaf page boundary, but at the same time processing entries from > just two leaf pages would often allow for a sufficiently aggressive > prefetching. Pinning a small number of leaf pages instead of a single leaf > page shouldn't be a problem. You're probably right. I just don't see any need to solve that problem in v1. > One argument for loosening the tight coupling between kill_prior_tuples and > index scan progress is that the lack of kill_prior_tuples for bitmap scans is > quite problematic. I've seen numerous production issues with bitmap scans > caused by subsequent scans processing a growing set of dead tuples, where > plain index scans were substantially slower initially but didn't get much > slower over time. I've seen production issues like that too. No doubt it's a problem. > We might be able to design a system where the bitmap > contains a certain number of back-references to the index, allowing later > cleanup if there weren't any page splits or such. That does seem possible, but do you really want a design for index prefetching that relies on that massive enhancement (a total redesign of kill_prior_tuple) happening at some point in the not-too-distant future? Seems risky, from a project management point of view. This back-references idea seems rather complicated, especially if it needs to work with very large bitmap index scans. Since you'll still have the basic problem of TID recycle safety to deal with (even with an MVCC snapshot), you don't just have to revisit the leaf pages. You also have to revisit the corresponding heap pages (generally they'll be a lot more numerous than leaf pages). You'll have traded one problem for another (which is not to say that it's not a good trade-off). Right now the executor uses a amgettuple interface, and knows nothing about index related costs (e.g., pages accessed in any index, index qual costs). While the index AM has some limited understanding of heap access costs. So the index AM kinda knows a small bit about both types of costs (possibly not enough, but something). That informs the language I'm using to describe all this. To do something like your "back-references to the index" thing well, I think that you need more dynamic behavior around when you visit the heap to get heap tuples pointed to by TIDs from index pages (i.e. dynamic behavior that determines how many leaf pages to go before going to the heap to get pointed-to TIDs). That is basically what I meant by "put the index AM in control" -- it doesn't *strictly* require that the index AM actually do that. Just that a single piece of code has to have access to the full context, in order to make the right trade-offs around how both index and heap accesses are scheduled. > > I think that it makes sense to put the index AM in control here -- > > that almost follows from what I said about the index AM API. The index > > AM already needs to be in control, in about the same way, to deal with > > kill_prior_tuple (plus it helps with the LIMIT issue I described). > > Depending on what "control" means I'm doubtful: > > Imo there are decisions influencing prefetching that an index AM shouldn't > need to know about directly, e.g. how the plan shape influences how many > tuples are actually going to be consumed. Of course that determination could > be made in planner/executor and handed to IAMs, for the IAM to then "control" > the prefetching. I agree with all this. -- Peter Geoghegan
On Wed, Feb 14, 2024 at 7:43 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > I don't think it's just a bookkeeping problem. In a way, nbtree already > does keep an array of tuples to kill (see btgettuple), but it's always > for the current index page. So it's not that we immediately go and kill > the prior tuple - nbtree already stashes it in an array, and kills all > those tuples when moving to the next index page. > > The way I understand the problem is that with prefetching we're bound to > determine the kill_prior_tuple flag with a delay, in which case we might > have already moved to the next index page ... Well... I'm not clear on all of the details of how this works, but this sounds broken to me, for the reasons that Peter G. mentions in his comments about desynchronization. If we currently have a rule that you hold a pin on the index page while processing the heap tuples it references, you can't just throw that out the window and expect things to keep working. Saying that kill_prior_tuple doesn't work when you throw that rule out the window is probably understating the extent of the problem very considerably. I would have thought that the way this prefetching would work is that we would bring pages into shared_buffers sooner than we currently do, but not actually pin them until we're ready to use them, so that it's possible they might be evicted again before we get around to them, if we prefetch too far and the system is too busy. Alternately, it also seems OK to read those later pages and pin them right away, as long as (1) we don't also give up pins that we would have held in the absence of prefetching and (2) we have some mechanism for limiting the number of extra pins that we're holding to a reasonable number given the size of shared_buffers. However, it doesn't seem OK at all to give up pins that the current code holds sooner than the current code would do. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2024-02-15 09:59:27 +0530, Robert Haas wrote: > I would have thought that the way this prefetching would work is that > we would bring pages into shared_buffers sooner than we currently do, > but not actually pin them until we're ready to use them, so that it's > possible they might be evicted again before we get around to them, if > we prefetch too far and the system is too busy. The issue here is that we need to read index leaf pages (synchronously for now!) to get the tids to do readahead of table data. What you describe is done for the table data (IMO not a good idea medium term [1]), but the problem at hand is that once we've done readahead for all the tids on one index page, we can't do more readahead without looking at the next index leaf page. Obviously that would lead to a sawtooth like IO pattern, where you'd regularly have to wait for IO for the first tuples referenced by an index leaf page. However, if we want to issue table readahead for tids on the neighboring index leaf page, we'll - as the patch stands - not hold a pin on the "current" index leaf page. Which makes index prefetching as currently implemented incompatible with kill_prior_tuple, as that requires the index leaf page pin being held. > Alternately, it also seems OK to read those later pages and pin them right > away, as long as (1) we don't also give up pins that we would have held in > the absence of prefetching and (2) we have some mechanism for limiting the > number of extra pins that we're holding to a reasonable number given the > size of shared_buffers. FWIW, there's already some logic for (2) in LimitAdditionalPins(). Currently used to limit how many buffers a backend may pin for bulk relation extension. Greetings, Andres Freund [1] The main reasons that I think that just doing readahead without keeping a pin is a bad idea, at least medium term, are: a) To do AIO you need to hold a pin on the page while the IO is in progress, as the target buffer contents will be modified at some moment you don't control, so that buffer should better not be replaced while IO is in progress. So at the very least you need to hold a pin until the IO is over. b) If you do not keep a pin until you actually use the page, you need to either do another buffer lookup (expensive!) or you need to remember the buffer id and revalidate that it's still pointing to the same block (cheaper, but still not cheap). That's not just bad because it's slow in an absolute sense, more importantly it increases the potential performance downside of doing readahead for fully cached workloads, because you don't gain anything, but pay the price of two lookups/revalidation. Note that these reasons really just apply to cases where we read ahead because we are quite certain we'll need exactly those blocks (leaving errors or queries ending early aside), not for "heuristic" prefetching. If we e.g. were to issue prefetch requests for neighboring index pages while descending during an ordered index scan, without checking that we'll need those, it'd make sense to just do a "throway" prefetch request.
On Thu, Feb 15, 2024 at 10:33 AM Andres Freund <andres@anarazel.de> wrote: > The issue here is that we need to read index leaf pages (synchronously for > now!) to get the tids to do readahead of table data. What you describe is done > for the table data (IMO not a good idea medium term [1]), but the problem at > hand is that once we've done readahead for all the tids on one index page, we > can't do more readahead without looking at the next index leaf page. Oh, right. > However, if we want to issue table readahead for tids on the neighboring index > leaf page, we'll - as the patch stands - not hold a pin on the "current" index > leaf page. Which makes index prefetching as currently implemented incompatible > with kill_prior_tuple, as that requires the index leaf page pin being held. But I think it probably also breaks MVCC, as Peter was saying. -- Robert Haas EDB: http://www.enterprisedb.com
On 2/15/24 00:06, Peter Geoghegan wrote: > On Wed, Feb 14, 2024 at 4:46 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > >> ... > > 2. Are you sure that the leaf-page-at-a-time thing is such a huge > hindrance to effective prefetching? > > I suppose that it might be much more important than I imagine it is > right now, but it'd be nice to have something a bit more concrete to > go on. > This probably depends on which corner cases are considered important. The page-at-a-time approach essentially means index items at the beginning of the page won't get prefetched (or vice versa, prefetch distance drops to 0 when we get to end of index page). That may be acceptable, considering we can usually fit 200+ index items on a single page. Even then it limits what effective_io_concurrency values are sensible, but in my experience quickly diminish past ~32. > 3. Even if it is somewhat important, do you really need to get that > part working in v1? > > Tomas' original prototype worked with the leaf-page-at-a-time thing, > and that still seemed like a big improvement to me. While being less > invasive, in effect. If we can agree that something like that > represents a useful step in the right direction (not an evolutionary > dead end), then we can make good incremental progress within a single > release. > It certainly was a great improvement, no doubt about that. I dislike the restriction, but that's partially for aesthetic reasons - it just seems it'd be nice to not have this. That being said, I'd be OK with having this restriction if it makes v1 feasible. For me, the big question is whether it'd mean we're stuck with this restriction forever, or whether there's a viable way to improve this in v2. And I don't have answer to that :-( I got completely lost in the ongoing discussion about the locking implications (which I happily ignored while working on the PoC patch), layering tensions and questions which part should be "in control". regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Feb 15, 2024 at 9:36 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > On 2/15/24 00:06, Peter Geoghegan wrote: > > I suppose that it might be much more important than I imagine it is > > right now, but it'd be nice to have something a bit more concrete to > > go on. > > > > This probably depends on which corner cases are considered important. > > The page-at-a-time approach essentially means index items at the > beginning of the page won't get prefetched (or vice versa, prefetch > distance drops to 0 when we get to end of index page). I don't think that's true. At least not for nbtree scans. As I went into last year, you'd get the benefit of the work I've done on "boundary cases" (most recently in commit c9c0589f from just a couple of months back), which helps us get the most out of suffix truncation. This maximizes the chances of only having to scan a single index leaf page in many important cases. So I can see no reason why index items at the beginning of the page are at any particular disadvantage (compared to those from the middle or the end of the page). Where you might have a problem is cases where it's just inherently necessary to visit more than a single leaf page, despite the best efforts of the nbtsplitloc.c logic -- cases where the scan just inherently needs to return tuples that "straddle the boundary between two neighboring pages". That isn't a particularly natural restriction, but it's also not obvious that it's all that much of a disadvantage in practice. > It certainly was a great improvement, no doubt about that. I dislike the > restriction, but that's partially for aesthetic reasons - it just seems > it'd be nice to not have this. > > That being said, I'd be OK with having this restriction if it makes v1 > feasible. For me, the big question is whether it'd mean we're stuck with > this restriction forever, or whether there's a viable way to improve > this in v2. I think that there is no question that this will need to not completely disable kill_prior_tuple -- I'd be surprised if one single person disagreed with me on this point. There is also a more nuanced way of describing this same restriction, but we don't necessarily need to agree on what exactly that is right now. > And I don't have answer to that :-( I got completely lost in the ongoing > discussion about the locking implications (which I happily ignored while > working on the PoC patch), layering tensions and questions which part > should be "in control". Honestly, I always thought that it made sense to do things on the index AM side. When you went the other way I was surprised. Perhaps I should have said more about that, sooner, but I'd already said quite a bit at that point, so... Anyway, I think that it's pretty clear that "naive desynchronization" is just not acceptable, because that'll disable kill_prior_tuple altogether. So you're going to have to do this in a way that more or less preserves something like the current kill_prior_tuple behavior. It's going to have some downsides, but those can be managed. They can be managed from within the index AM itself, a bit like the _bt_killitems() no-pin stuff does things already. Obviously this interpretation suggests that doing things at the index AM level is indeed the right way to go, layering-wise. Does it make sense to you, though? -- Peter Geoghegan
On 2/15/24 17:42, Peter Geoghegan wrote: > On Thu, Feb 15, 2024 at 9:36 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> On 2/15/24 00:06, Peter Geoghegan wrote: >>> I suppose that it might be much more important than I imagine it is >>> right now, but it'd be nice to have something a bit more concrete to >>> go on. >>> >> >> This probably depends on which corner cases are considered important. >> >> The page-at-a-time approach essentially means index items at the >> beginning of the page won't get prefetched (or vice versa, prefetch >> distance drops to 0 when we get to end of index page). > > I don't think that's true. At least not for nbtree scans. > > As I went into last year, you'd get the benefit of the work I've done > on "boundary cases" (most recently in commit c9c0589f from just a > couple of months back), which helps us get the most out of suffix > truncation. This maximizes the chances of only having to scan a single > index leaf page in many important cases. So I can see no reason why > index items at the beginning of the page are at any particular > disadvantage (compared to those from the middle or the end of the > page). > I may be missing something, but it seems fairly self-evident to me an entry at the beginning of an index page won't get prefetched (assuming the page-at-a-time thing). If I understand your point about boundary cases / suffix truncation, that helps us by (a) picking the split in a way to minimize a single key spanning multiple pages, if possible and (b) increasing the number of entries that fit onto a single index page. That's certainly true / helpful, and it makes the "first entry" issue much less common. But the issue is still there. Of course, this says nothing about the importance of the issue - the impact may easily be so small it's not worth worrying about. > Where you might have a problem is cases where it's just inherently > necessary to visit more than a single leaf page, despite the best > efforts of the nbtsplitloc.c logic -- cases where the scan just > inherently needs to return tuples that "straddle the boundary between > two neighboring pages". That isn't a particularly natural restriction, > but it's also not obvious that it's all that much of a disadvantage in > practice. > One case I've been thinking about is sorting using index, where we often read large part of the index. >> It certainly was a great improvement, no doubt about that. I dislike the >> restriction, but that's partially for aesthetic reasons - it just seems >> it'd be nice to not have this. >> >> That being said, I'd be OK with having this restriction if it makes v1 >> feasible. For me, the big question is whether it'd mean we're stuck with >> this restriction forever, or whether there's a viable way to improve >> this in v2. > > I think that there is no question that this will need to not > completely disable kill_prior_tuple -- I'd be surprised if one single > person disagreed with me on this point. There is also a more nuanced > way of describing this same restriction, but we don't necessarily need > to agree on what exactly that is right now. > Even for the page-at-a-time approach? Or are you talking about the v2? >> And I don't have answer to that :-( I got completely lost in the ongoing >> discussion about the locking implications (which I happily ignored while >> working on the PoC patch), layering tensions and questions which part >> should be "in control". > > Honestly, I always thought that it made sense to do things on the > index AM side. When you went the other way I was surprised. Perhaps I > should have said more about that, sooner, but I'd already said quite a > bit at that point, so... > > Anyway, I think that it's pretty clear that "naive desynchronization" > is just not acceptable, because that'll disable kill_prior_tuple > altogether. So you're going to have to do this in a way that more or > less preserves something like the current kill_prior_tuple behavior. > It's going to have some downsides, but those can be managed. They can > be managed from within the index AM itself, a bit like the > _bt_killitems() no-pin stuff does things already. > > Obviously this interpretation suggests that doing things at the index > AM level is indeed the right way to go, layering-wise. Does it make > sense to you, though? > Yeah. The basic idea was that by moving this above index AM it will work for all indexes automatically - but given the current discussion about kill_prior_tuple, locking etc. I'm not sure that's really feasible. The index AM clearly needs to have more control over this. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Feb 15, 2024 at 12:26 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > I may be missing something, but it seems fairly self-evident to me an > entry at the beginning of an index page won't get prefetched (assuming > the page-at-a-time thing). Sure, if the first item on the page is also the first item that we need the scan to return (having just descended the tree), then it won't get prefetched under a scheme that sticks with the current page-at-a-time behavior (at least in v1). Just like when the first item that we need the scan to return is from the middle of the page, or more towards the end of the page. It is of course also true that we can't prefetch the next page's first item until we actually visit the next page -- clearly that's suboptimal. Just like we can't prefetch any other, later tuples from the next page (until such time as we have determined for sure that there really will be a next page, and have called _bt_readpage for that next page.) This is why I don't think that the tuples with lower page offset numbers are in any way significant here. The significant part is whether or not you'll actually need to visit more than one leaf page in the first place (plus the penalty from not being able to reorder the work across page boundaries in your initial v1 of prefetching). > If I understand your point about boundary cases / suffix truncation, > that helps us by (a) picking the split in a way to minimize a single key > spanning multiple pages, if possible and (b) increasing the number of > entries that fit onto a single index page. More like it makes the boundaries between leaf pages (i.e. high keys) align with the "natural boundaries of the key space". Simple point queries should practically never require more than a single leaf page access as a result. Even somewhat complicated index scans that are reasonably selective (think tens to low hundreds of matches) don't tend to need to read more than a single leaf page match, at least with equality type scan keys for the index qual. > That's certainly true / helpful, and it makes the "first entry" issue > much less common. But the issue is still there. Of course, this says > nothing about the importance of the issue - the impact may easily be so > small it's not worth worrying about. Right. And I want to be clear: I'm really *not* sure how much it matters. I just doubt that it's worth worrying about in v1 -- time grows short. Although I agree that we should commit a v1 that leaves the door open to improving matters in this area in v2. > One case I've been thinking about is sorting using index, where we often > read large part of the index. That definitely seems like a case where reordering work/desynchronization of the heap and index scans might be relatively important. > > I think that there is no question that this will need to not > > completely disable kill_prior_tuple -- I'd be surprised if one single > > person disagreed with me on this point. There is also a more nuanced > > way of describing this same restriction, but we don't necessarily need > > to agree on what exactly that is right now. > > > > Even for the page-at-a-time approach? Or are you talking about the v2? I meant that the current kill_prior_tuple behavior isn't sacred, and can be revised in v2, for the benefit of lifting the restriction on prefetching. But that's going to involve a trade-off of some kind. And not a particularly simple one. > Yeah. The basic idea was that by moving this above index AM it will work > for all indexes automatically - but given the current discussion about > kill_prior_tuple, locking etc. I'm not sure that's really feasible. > > The index AM clearly needs to have more control over this. Cool. I think that that makes the layering question a lot clearer, then. -- Peter Geoghegan
Hi, On 2024-02-15 12:53:10 -0500, Peter Geoghegan wrote: > On Thu, Feb 15, 2024 at 12:26 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > I may be missing something, but it seems fairly self-evident to me an > > entry at the beginning of an index page won't get prefetched (assuming > > the page-at-a-time thing). > > Sure, if the first item on the page is also the first item that we > need the scan to return (having just descended the tree), then it > won't get prefetched under a scheme that sticks with the current > page-at-a-time behavior (at least in v1). Just like when the first > item that we need the scan to return is from the middle of the page, > or more towards the end of the page. > > It is of course also true that we can't prefetch the next page's > first item until we actually visit the next page -- clearly that's > suboptimal. Just like we can't prefetch any other, later tuples from > the next page (until such time as we have determined for sure that > there really will be a next page, and have called _bt_readpage for > that next page.) > > This is why I don't think that the tuples with lower page offset > numbers are in any way significant here. The significant part is > whether or not you'll actually need to visit more than one leaf page > in the first place (plus the penalty from not being able to reorder > the work across page boundaries in your initial v1 of prefetching). To me this your phrasing just seems to reformulate the issue. In practical terms you'll have to wait for the full IO latency when fetching the table tuple corresponding to the first tid on a leaf page. Of course that's also the moment you had to visit another leaf page. Whether the stall is due to visit another leaf page or due to processing the first entry on such a leaf page is a distinction without a difference. > > That's certainly true / helpful, and it makes the "first entry" issue > > much less common. But the issue is still there. Of course, this says > > nothing about the importance of the issue - the impact may easily be so > > small it's not worth worrying about. > > Right. And I want to be clear: I'm really *not* sure how much it > matters. I just doubt that it's worth worrying about in v1 -- time > grows short. Although I agree that we should commit a v1 that leaves > the door open to improving matters in this area in v2. I somewhat doubt that it's realistic to aim for 17 at this point. We seem to still be doing fairly fundamental architectual work. I think it might be the right thing even for 18 to go for the simpler only-a-single-leaf-page approach though. I wonder if there are prerequisites that can be tackled for 17. One idea is to work on infrastructure to provide executor nodes with information about the number of tuples likely to be fetched - I suspect we'll trigger regressions without that in place. One way to *sometimes* process more than a single leaf page, without having to redesign kill_prior_tuple, would be to use the visibilitymap to check if the target pages are all-visible. If all the table pages on a leaf page are all-visible, we know that we don't need to kill index entries, and thus can move on to the next leaf page Greetings, Andres Freund
On Thu, Feb 15, 2024 at 3:13 PM Andres Freund <andres@anarazel.de> wrote: > > This is why I don't think that the tuples with lower page offset > > numbers are in any way significant here. The significant part is > > whether or not you'll actually need to visit more than one leaf page > > in the first place (plus the penalty from not being able to reorder > > the work across page boundaries in your initial v1 of prefetching). > > To me this your phrasing just seems to reformulate the issue. What I said to Tomas seems very obvious to me. I think that there might have been some kind of miscommunication (not a real disagreement). I was just trying to work through that. > In practical terms you'll have to wait for the full IO latency when fetching > the table tuple corresponding to the first tid on a leaf page. Of course > that's also the moment you had to visit another leaf page. Whether the stall > is due to visit another leaf page or due to processing the first entry on such > a leaf page is a distinction without a difference. I don't think anybody said otherwise? > > > That's certainly true / helpful, and it makes the "first entry" issue > > > much less common. But the issue is still there. Of course, this says > > > nothing about the importance of the issue - the impact may easily be so > > > small it's not worth worrying about. > > > > Right. And I want to be clear: I'm really *not* sure how much it > > matters. I just doubt that it's worth worrying about in v1 -- time > > grows short. Although I agree that we should commit a v1 that leaves > > the door open to improving matters in this area in v2. > > I somewhat doubt that it's realistic to aim for 17 at this point. That's a fair point. Tomas? > We seem to > still be doing fairly fundamental architectual work. I think it might be the > right thing even for 18 to go for the simpler only-a-single-leaf-page > approach though. I definitely think it's a good idea to have that as a fall back option. And to not commit ourselves to having something better than that for v1 (though we probably should commit to making that possible in v2). > I wonder if there are prerequisites that can be tackled for 17. One idea is to > work on infrastructure to provide executor nodes with information about the > number of tuples likely to be fetched - I suspect we'll trigger regressions > without that in place. I don't think that there'll be regressions if we just take the simpler only-a-single-leaf-page approach. At least it seems much less likely. > One way to *sometimes* process more than a single leaf page, without having to > redesign kill_prior_tuple, would be to use the visibilitymap to check if the > target pages are all-visible. If all the table pages on a leaf page are > all-visible, we know that we don't need to kill index entries, and thus can > move on to the next leaf page It's possible that we'll need a variety of different strategies. nbtree already has two such strategies in _bt_killitems(), in a way. Though its "Modified while not pinned means hinting is not safe" path (LSN doesn't match canary value path) seems pretty naive. The prefetching stuff might present us with a good opportunity to replace that with something fundamentally better. -- Peter Geoghegan
On Wed, Jan 24, 2024 at 7:13 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: [ > > (1) Melanie actually presented a very different way to implement this, > relying on the StreamingRead API. So chances are this struct won't > actually be used. Given lots of effort already spent on this and the fact that is thread is actually two: a. index/table prefetching since Jun 2023 till ~Jan 2024 b. afterwards index/table prefetching with Streaming API, but there are some doubts of whether it could happen for v17 [1] ... it would be pitty to not take benefits of such work (even if Streaming API wouldn't be ready for this; although there's lots of movement in the area), so I've played a little with with the earlier implementation from [2] without streaming API as it already received feedback, it demonstrated big benefits, and earlier it got attention on pgcon unconference. Perhaps, some of those comment might be passed later to the "b"-patch (once that's feasible): 1. v20240124-0001-Prefetch-heap-pages-during-index-scans.patch does not apply cleanly anymore, due show_buffer_usage() being quite recently refactored in 5de890e3610d5a12cdaea36413d967cf5c544e20 : patching file src/backend/commands/explain.c Hunk #1 FAILED at 3568. Hunk #2 FAILED at 3679. 2 out of 2 hunks FAILED -- saving rejects to file src/backend/commands/explain.c.rej 2. v2 applies (fixup), but it would nice to see that integrated into main patch (it adds IndexOnlyPrefetchInfo) into one patch 3. execMain.c : + * XXX It might be possible to improve the prefetching code to handle this + * by "walking back" the TID queue, but it's not clear if it's worth it. Shouldn't we just remove the XXX? The walking-back seems to be niche so are fetches using cursors when looking at real world users queries ? (support cases bias here when looking at peopel's pg_stat_activity) 4. Wouldn't it be better to leave PREFETCH_LRU_SIZE at static of 8, but base PREFETCH_LRU_COUNT on effective_io_concurrency instead? (allowing it to follow dynamically; the more prefetches the user wants to perform, the more you spread them across shared LRUs and the more memory for history is required?) + * XXX Maybe we could consider effective_cache_size when sizing the cache? + * Not to size the cache for that, ofc, but maybe as a guidance of how many + * heap pages it might keep. Maybe just a fraction fraction of the value, + * say Max(8MB, effective_cache_size / max_connections) or something. + */ +#define PREFETCH_LRU_SIZE 8 /* slots in one LRU */ +#define PREFETCH_LRU_COUNT 128 /* number of LRUs */ +#define PREFETCH_CACHE_SIZE (PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT) BTW: + * heap pages it might keep. Maybe just a fraction fraction of the value, that's a duplicated "fraction" word over there. 5. + * XXX Could it be harmful that we read the queue backwards? Maybe memory + * prefetching works better for the forward direction? I wouldn't care, we are optimizing I/O (and context-switching) which weighs much more than memory access direction impact and Dilipi earlier also expressed no concern, so maybe it could be also removed (one less "XXX" to care about) 6. in IndexPrefetchFillQueue() + while (!PREFETCH_QUEUE_FULL(prefetch)) + { + IndexPrefetchEntry *entry + = prefetch->next_cb(scan, direction, prefetch->data); If we are at it... that's a strange split and assignment not indented :^) 7. in IndexPrefetchComputeTarget() + * XXX We cap the target to plan_rows, becausse it's pointless to prefetch + * more than we expect to use. That's a nice fact that's already in patch, so XXX isn't needed? 8. + * XXX Maybe we should reduce the value with parallel workers? I was assuming it could be a good idea, but the same doesn't seem (eic/actual_parallel_works_per_gather) to be performed for bitmap heap scan prefetches, so no? 9. + /* + * No prefetching for direct I/O. + * + * XXX Shouldn't we do prefetching even for direct I/O? We would only + * pretend doing it now, ofc, because we'd not do posix_fadvise(), but + * once the code starts loading into shared buffers, that'd work. + */ + if ((io_direct_flags & IO_DIRECT_DATA) != 0) + return 0; It's redundant (?) and could be removed as PrefetchBuffer()->PrefetchSharedBuffer() already has this at line 571: 5 #ifdef USE_PREFETCH 4 │ │ /* 3 │ │ │* Try to initiate an asynchronous read. This returns false in 2 │ │ │* recovery if the relation file doesn't exist. 1 │ │ │*/ 571 │ │ if ((io_direct_flags & IO_DIRECT_DATA) == 0 && 1 │ │ │ smgrprefetch(smgr_reln, forkNum, blockNum, 1)) 2 │ │ { 3 │ │ │ result.initiated_io = true; 4 │ │ } 5 #endif> > > > > > > /* USE_PREFETCH */ 11. in IndexPrefetchStats() and ExecReScanIndexScan() + * FIXME Should be only in debug builds, or something like that. + /* XXX Print some debug stats. Should be removed. */ + IndexPrefetchStats(indexScanDesc, node->iss_prefetch); Hmm, but it could be useful in tuning the real world systems, no? E.g. recovery prefetcher gives some info through pg_stat_recovery_prefetch view, but e.g. bitmap heap scans do not provide us with anything at all. I don't have a strong opinion. Exposing such stuff would take away your main doubt (XXX) from execPrefetch.c ``auto-tuning/self-adjustment". And if we are at it, we could think in far future about adding new session GUC track_cachestat or EXPLAIN (cachestat/prefetch, analyze) (this new syscall for Linux >= 6.5) where we could present both index stats (as what IndexPrefetchStats() does) *and* cachestat() results there for interested users. Of course it would have to be generic enough for the bitmap heap scan case too. Such insight would also allow fine tuning eic, PREFETCH_LRU_COUNT, PREFETCH_QUEUE_HISTORY. Just an idea. 12. + * XXX Maybe we should reduce the target in case this is a parallel index + * scan. We don't want to issue a multiple of effective_io_concurrency. in IndexOnlyPrefetchCleanup() and IndexNext() + * XXX Maybe we should reduce the value with parallel workers? It's redundant XXX-comment (there are two for the same), as you it was already there just before IndexPrefetchComputeTarget() 13. The previous bitmap prefetch code uses #ifdef USE_PREFETCH, maybe it would make some sense to follow the consistency pattern , to avoid adding implementation on platforms without prefetching ? 14. The patch is missing documentation, so how about just this? --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2527,7 +2527,8 @@ include_dir 'conf.d' operations that any individual <productname>PostgreSQL</productname> session attempts to initiate in parallel. The allowed range is 1 to 1000, or zero to disable issuance of asynchronous I/O requests. Currently, - this setting only affects bitmap heap scans. + this setting only enables prefetching for HEAP data blocks when performing + bitmap heap scans and index (only) scans. </para> Some further tests, given data: CREATE TABLE test (id bigint, val bigint, str text); ALTER TABLE test ALTER COLUMN str SET STORAGE EXTERNAL; INSERT INTO test SELECT g, g, repeat(chr(65 + (10*random())::int), 3000) FROM generate_series(1, 10000) g; -- or INSERT INTO test SELECT x.r, x.r, repeat(chr(65 + (10*random())::int), 3000) from (select 10000 * random() as r from generate_series(1, 10000)) x; VACUUM ANALYZE test; CREATE INDEX on test (id) ; 1. the patch correctly detects sequential access (e.g. we issue up to 6 fadvise() syscalls (8kB each) out and 17 preads() to heap fd for query like `SELECT sum(val) FROM test WHERE id BETWEEN 10 AND 2000;` -- offset of fadvise calls and pread match), so that's good. 2. Prefetching for TOASTed heap seems to be not implemented at all, correct? (Is my assumption that we should go like this: t_index->t->toast_idx->toast_heap)?, but I'm too newbie to actually see the code path where it could be added - certainly it's not blocker -- but maybe in commit message a list of improvements for future could be listed?): 2024-02-29 11:45:14.259 CET [11098] LOG: index prefetch stats: requests 1990 prefetches 17 (0.854271) skip cached 0 sequential 1973 2024-02-29 11:45:14.259 CET [11098] STATEMENT: SELECT md5(string_agg(md5(str),',')) FROM test WHERE id BETWEEN 10 AND 2000; fadvise64(37, 40960, 8192, POSIX_FADV_WILLNEED) = 0 pread64(50, "\0\0\0\0\350Jv\1\0\0\4\0(\0\0\10\0 \4 \0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2998272) = 8192 pread64(49, "\0\0\0\0@Hw\1\0\0\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 40960) = 8192 pread64(50, "\0\0\0\0\2200v\1\0\0\4\0(\0\0\10\0 \4 \0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2990080) = 8192 pread64(50, "\0\0\0\08\26v\1\0\0\4\0(\0\0\10\0 \4 \0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2981888) = 8192 pread64(50, "\0\0\0\0\340\373u\1\0\0\4\0(\0\0\10\0 \4 \0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2973696) = 8192 [..no fadvises for fd=50 which was pg_toast_rel..] 3. I'm not sure if I got good-enough results for DESCending index `create index on test (id DESC);`- with eic=16 it doesnt seem to be be able prefetch 16 blocks in advance? (e.g. highlight offset 557056 below in some text editor and it's distance is far lower between that fadvise<->pread): pread64(45, "\0\0\0\0x\305b\3\0\0\4\0\370\1\0\2\0 \4 \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 0) = 8192 fadvise64(45, 417792, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, "\0\0\0\0\370\330\235\4\0\0\4\0\370\1\0\2\0 \4 \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 417792) = 8192 fadvise64(45, 671744, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(45, 237568, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, "\0\0\0\08`]\5\0\0\4\0\370\1\0\2\0 \4 \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 671744) = 8192 fadvise64(45, 491520, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(45, 360448, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, "\0\0\0\0\200\357\25\4\0\0\4\0\370\1\0\2\0 \4 \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 237568) = 8192 fadvise64(45, 557056, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(45, 106496, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, "\0\0\0\0\240s\325\4\0\0\4\0\370\1\0\2\0 \4 \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 491520) = 8192 fadvise64(45, 401408, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(45, 335872, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, "\0\0\0\0\250\233r\4\0\0\4\0\370\1\0\2\0 \4 \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 360448) = 8192 fadvise64(45, 524288, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(45, 352256, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, "\0\0\0\0\240\342\6\5\0\0\4\0\370\1\0\2\0 \4 \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 557056) = 8192 -Jakub Wartak. [1] - https://www.postgresql.org/message-id/20240215201337.7amzw3hpvng7wphb%40awork3.anarazel.de [2] - https://www.postgresql.org/message-id/777e981c-bf0c-4eb9-a9e0-42d677e94327%40enterprisedb.com
Hi, Thanks for looking at the patch! On 3/1/24 09:20, Jakub Wartak wrote: > On Wed, Jan 24, 2024 at 7:13 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > [ >> >> (1) Melanie actually presented a very different way to implement this, >> relying on the StreamingRead API. So chances are this struct won't >> actually be used. > > Given lots of effort already spent on this and the fact that is thread > is actually two: > > a. index/table prefetching since Jun 2023 till ~Jan 2024 > b. afterwards index/table prefetching with Streaming API, but there > are some doubts of whether it could happen for v17 [1] > > ... it would be pitty to not take benefits of such work (even if > Streaming API wouldn't be ready for this; although there's lots of > movement in the area), so I've played a little with with the earlier > implementation from [2] without streaming API as it already received > feedback, it demonstrated big benefits, and earlier it got attention > on pgcon unconference. Perhaps, some of those comment might be passed > later to the "b"-patch (once that's feasible): > TBH I don't have a clear idea what to do. It'd be cool to have at least some benefits in v17, but I don't know how to do that in a way that would be useful in the future. For example, the v20240124 patch implements this in the executor, but based on the recent discussions it seems that's not the right layer - the index AM needs to have some control, and I'm not convinced it's possible to improve it in that direction (even ignoring the various issues we identified in the executor-based approach). I think it might be more practical to do this from the index AM, even if it has various limitations. Ironically, that's what I proposed at pgcon, but mostly because it was the quick&dirty way to do this. > 1. v20240124-0001-Prefetch-heap-pages-during-index-scans.patch does > not apply cleanly anymore, due show_buffer_usage() being quite > recently refactored in 5de890e3610d5a12cdaea36413d967cf5c544e20 : > > patching file src/backend/commands/explain.c > Hunk #1 FAILED at 3568. > Hunk #2 FAILED at 3679. > 2 out of 2 hunks FAILED -- saving rejects to file > src/backend/commands/explain.c.rej > > 2. v2 applies (fixup), but it would nice to see that integrated into > main patch (it adds IndexOnlyPrefetchInfo) into one patch > Yeah, but I think it was an old patch version, no point in rebasing that forever. Also, I'm not really convinced the executor-level approach is the right path forward. > 3. execMain.c : > > + * XXX It might be possible to improve the prefetching code > to handle this > + * by "walking back" the TID queue, but it's not clear if > it's worth it. > > Shouldn't we just remove the XXX? The walking-back seems to be niche > so are fetches using cursors when looking at real world users queries > ? (support cases bias here when looking at peopel's pg_stat_activity) > > 4. Wouldn't it be better to leave PREFETCH_LRU_SIZE at static of 8, > but base PREFETCH_LRU_COUNT on effective_io_concurrency instead? > (allowing it to follow dynamically; the more prefetches the user wants > to perform, the more you spread them across shared LRUs and the more > memory for history is required?) > > + * XXX Maybe we could consider effective_cache_size when sizing the cache? > + * Not to size the cache for that, ofc, but maybe as a guidance of how many > + * heap pages it might keep. Maybe just a fraction fraction of the value, > + * say Max(8MB, effective_cache_size / max_connections) or something. > + */ > +#define PREFETCH_LRU_SIZE 8 /* slots in one LRU */ > +#define PREFETCH_LRU_COUNT 128 /* number of LRUs */ > +#define PREFETCH_CACHE_SIZE (PREFETCH_LRU_SIZE * > PREFETCH_LRU_COUNT) > I don't see why would this be related to effective_io_concurrency? It's merely about how many recently accessed pages we expect to find in the page cache. It's entirely separate from the prefetch distance. > BTW: > + * heap pages it might keep. Maybe just a fraction fraction of the value, > that's a duplicated "fraction" word over there. > > 5. > + * XXX Could it be harmful that we read the queue backwards? > Maybe memory > + * prefetching works better for the forward direction? > > I wouldn't care, we are optimizing I/O (and context-switching) which > weighs much more than memory access direction impact and Dilipi > earlier also expressed no concern, so maybe it could be also removed > (one less "XXX" to care about) > Yeah, I think it's negligible. Probably a microoptimization we can investigate later, I don't want to complicate the code unnecessarily. > 6. in IndexPrefetchFillQueue() > > + while (!PREFETCH_QUEUE_FULL(prefetch)) > + { > + IndexPrefetchEntry *entry > + = prefetch->next_cb(scan, direction, prefetch->data); > > If we are at it... that's a strange split and assignment not indented :^) > > 7. in IndexPrefetchComputeTarget() > > + * XXX We cap the target to plan_rows, becausse it's pointless to prefetch > + * more than we expect to use. > > That's a nice fact that's already in patch, so XXX isn't needed? > Right, which is why it's not a TODO/FIXME. But I think it's good to point this out - I'm not 100% convinced we should be using plan_rows like this (because what happens if the estimate happens to be wrong?). > 8. > + * XXX Maybe we should reduce the value with parallel workers? > > I was assuming it could be a good idea, but the same doesn't seem > (eic/actual_parallel_works_per_gather) to be performed for bitmap heap > scan prefetches, so no? > Yeah, if we don't do that now, I'm not sure this patch should change that behavior. > 9. > + /* > + * No prefetching for direct I/O. > + * > + * XXX Shouldn't we do prefetching even for direct I/O? We would only > + * pretend doing it now, ofc, because we'd not do posix_fadvise(), but > + * once the code starts loading into shared buffers, that'd work. > + */ > + if ((io_direct_flags & IO_DIRECT_DATA) != 0) > + return 0; > > It's redundant (?) and could be removed as > PrefetchBuffer()->PrefetchSharedBuffer() already has this at line 571: > > 5 #ifdef USE_PREFETCH > 4 │ │ /* > 3 │ │ │* Try to initiate an asynchronous read. This > returns false in > 2 │ │ │* recovery if the relation file doesn't exist. > 1 │ │ │*/ > 571 │ │ if ((io_direct_flags & IO_DIRECT_DATA) == 0 && > 1 │ │ │ smgrprefetch(smgr_reln, forkNum, blockNum, 1)) > 2 │ │ { > 3 │ │ │ result.initiated_io = true; > 4 │ │ } > 5 #endif> > > > > > > /* USE_PREFETCH */ > Yeah, I think it might be redundant. I think it allowed skipping a bunch things without prefetching (like initialization of the prefetcher), but after the reworks that's no longer true. > 11. in IndexPrefetchStats() and ExecReScanIndexScan() > > + * FIXME Should be only in debug builds, or something like that. > > + /* XXX Print some debug stats. Should be removed. */ > + IndexPrefetchStats(indexScanDesc, node->iss_prefetch); > > Hmm, but it could be useful in tuning the real world systems, no? E.g. > recovery prefetcher gives some info through pg_stat_recovery_prefetch > view, but e.g. bitmap heap scans do not provide us with anything at > all. I don't have a strong opinion. Exposing such stuff would take > away your main doubt (XXX) from execPrefetch.c You're right it'd be good to collect/expose such statistics, to help with monitoring/tuning, etc. But I think there are better / more convenient ways to do this - exposing that in EXPLAIN, and adding a counter to pgstat_all_tables / pgstat_all_indexes. > ``auto-tuning/self-adjustment". And if we are at it, we could think in > far future about adding new session GUC track_cachestat or EXPLAIN > (cachestat/prefetch, analyze) (this new syscall for Linux >= 6.5) > where we could present both index stats (as what IndexPrefetchStats() > does) *and* cachestat() results there for interested users. Of course > it would have to be generic enough for the bitmap heap scan case too. > Such insight would also allow fine tuning eic, PREFETCH_LRU_COUNT, > PREFETCH_QUEUE_HISTORY. Just an idea. > I haven't really thought about this, but I agree some auto-tuning would be very helpful (assuming it's sufficiently reliable). > 12. > > + * XXX Maybe we should reduce the target in case this is > a parallel index > + * scan. We don't want to issue a multiple of > effective_io_concurrency. > > in IndexOnlyPrefetchCleanup() and IndexNext() > > + * XXX Maybe we should reduce the value with parallel workers? > > It's redundant XXX-comment (there are two for the same), as you it was > already there just before IndexPrefetchComputeTarget() > > 13. The previous bitmap prefetch code uses #ifdef USE_PREFETCH, maybe > it would make some sense to follow the consistency pattern , to avoid > adding implementation on platforms without prefetching ? > Perhaps, but I'm not sure how to do that with the executor-based approach, where essentially everything goes through the prefetch queue (except that the prefetch distance is 0). So the amount of code that would be disabled by the ifdef would be tiny. > 14. The patch is missing documentation, so how about just this? > > --- a/doc/src/sgml/config.sgml > +++ b/doc/src/sgml/config.sgml > @@ -2527,7 +2527,8 @@ include_dir 'conf.d' > operations that any individual > <productname>PostgreSQL</productname> session > attempts to initiate in parallel. The allowed range is 1 to 1000, > or zero to disable issuance of asynchronous I/O requests. Currently, > - this setting only affects bitmap heap scans. > + this setting only enables prefetching for HEAP data blocks > when performing > + bitmap heap scans and index (only) scans. > </para> > > Some further tests, given data: > > CREATE TABLE test (id bigint, val bigint, str text); > ALTER TABLE test ALTER COLUMN str SET STORAGE EXTERNAL; > INSERT INTO test SELECT g, g, repeat(chr(65 + (10*random())::int), > 3000) FROM generate_series(1, 10000) g; > -- or INSERT INTO test SELECT x.r, x.r, repeat(chr(65 + > (10*random())::int), 3000) from (select 10000 * random() as r from > generate_series(1, 10000)) x; > VACUUM ANALYZE test; > CREATE INDEX on test (id) ; > It's not clear to me what's the purpose of this test? Can you explain? > 1. the patch correctly detects sequential access (e.g. we issue up to > 6 fadvise() syscalls (8kB each) out and 17 preads() to heap fd for > query like `SELECT sum(val) FROM test WHERE id BETWEEN 10 AND 2000;` > -- offset of fadvise calls and pread match), so that's good. > > 2. Prefetching for TOASTed heap seems to be not implemented at all, > correct? (Is my assumption that we should go like this: > t_index->t->toast_idx->toast_heap)?, but I'm too newbie to actually > see the code path where it could be added - certainly it's not blocker > -- but maybe in commit message a list of improvements for future could > be listed?): > Yes, that's true. I haven't thought about TOAST very much, but with prefetching happening in executor, that does not work. There'd need to be some extra code for TOAST prefetching. I'm not sure how beneficial that would be, considering most TOAST values tend to be stored on consecutive heap pages. > 2024-02-29 11:45:14.259 CET [11098] LOG: index prefetch stats: > requests 1990 prefetches 17 (0.854271) skip cached 0 sequential 1973 > 2024-02-29 11:45:14.259 CET [11098] STATEMENT: SELECT > md5(string_agg(md5(str),',')) FROM test WHERE id BETWEEN 10 AND 2000; > > fadvise64(37, 40960, 8192, POSIX_FADV_WILLNEED) = 0 > pread64(50, "\0\0\0\0\350Jv\1\0\0\4\0(\0\0\10\0 \4 > \0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2998272) = 8192 > pread64(49, "\0\0\0\0@Hw\1\0\0\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 > \0\320\237 \0"..., 8192, 40960) = 8192 > pread64(50, "\0\0\0\0\2200v\1\0\0\4\0(\0\0\10\0 \4 > \0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2990080) = 8192 > pread64(50, "\0\0\0\08\26v\1\0\0\4\0(\0\0\10\0 \4 > \0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2981888) = 8192 > pread64(50, "\0\0\0\0\340\373u\1\0\0\4\0(\0\0\10\0 \4 > \0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2973696) = 8192 > [..no fadvises for fd=50 which was pg_toast_rel..] > > 3. I'm not sure if I got good-enough results for DESCending index > `create index on test (id DESC);`- with eic=16 it doesnt seem to be > be able prefetch 16 blocks in advance? (e.g. highlight offset 557056 > below in some text editor and it's distance is far lower between that > fadvise<->pread): > > pread64(45, "\0\0\0\0x\305b\3\0\0\4\0\370\1\0\2\0 \4 > \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 0) = 8192 > fadvise64(45, 417792, 8192, POSIX_FADV_WILLNEED) = 0 > pread64(45, "\0\0\0\0\370\330\235\4\0\0\4\0\370\1\0\2\0 \4 > \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 417792) = 8192 > fadvise64(45, 671744, 8192, POSIX_FADV_WILLNEED) = 0 > fadvise64(45, 237568, 8192, POSIX_FADV_WILLNEED) = 0 > pread64(45, "\0\0\0\08`]\5\0\0\4\0\370\1\0\2\0 \4 > \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 671744) = 8192 > fadvise64(45, 491520, 8192, POSIX_FADV_WILLNEED) = 0 > fadvise64(45, 360448, 8192, POSIX_FADV_WILLNEED) = 0 > pread64(45, "\0\0\0\0\200\357\25\4\0\0\4\0\370\1\0\2\0 \4 > \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 237568) = 8192 > fadvise64(45, 557056, 8192, POSIX_FADV_WILLNEED) = 0 > fadvise64(45, 106496, 8192, POSIX_FADV_WILLNEED) = 0 > pread64(45, "\0\0\0\0\240s\325\4\0\0\4\0\370\1\0\2\0 \4 > \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 491520) = 8192 > fadvise64(45, 401408, 8192, POSIX_FADV_WILLNEED) = 0 > fadvise64(45, 335872, 8192, POSIX_FADV_WILLNEED) = 0 > pread64(45, "\0\0\0\0\250\233r\4\0\0\4\0\370\1\0\2\0 \4 > \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 360448) = 8192 > fadvise64(45, 524288, 8192, POSIX_FADV_WILLNEED) = 0 > fadvise64(45, 352256, 8192, POSIX_FADV_WILLNEED) = 0 > pread64(45, "\0\0\0\0\240\342\6\5\0\0\4\0\370\1\0\2\0 \4 > \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 557056) = 8192 > I'm not sure I understand these strace snippets. Can you elaborate a bit, explain what the strace log says? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2/15/24 21:30, Peter Geoghegan wrote: > On Thu, Feb 15, 2024 at 3:13 PM Andres Freund <andres@anarazel.de> wrote: >>> This is why I don't think that the tuples with lower page offset >>> numbers are in any way significant here. The significant part is >>> whether or not you'll actually need to visit more than one leaf page >>> in the first place (plus the penalty from not being able to reorder >>> the work across page boundaries in your initial v1 of prefetching). >> >> To me this your phrasing just seems to reformulate the issue. > > What I said to Tomas seems very obvious to me. I think that there > might have been some kind of miscommunication (not a real > disagreement). I was just trying to work through that. > >> In practical terms you'll have to wait for the full IO latency when fetching >> the table tuple corresponding to the first tid on a leaf page. Of course >> that's also the moment you had to visit another leaf page. Whether the stall >> is due to visit another leaf page or due to processing the first entry on such >> a leaf page is a distinction without a difference. > > I don't think anybody said otherwise? > >>>> That's certainly true / helpful, and it makes the "first entry" issue >>>> much less common. But the issue is still there. Of course, this says >>>> nothing about the importance of the issue - the impact may easily be so >>>> small it's not worth worrying about. >>> >>> Right. And I want to be clear: I'm really *not* sure how much it >>> matters. I just doubt that it's worth worrying about in v1 -- time >>> grows short. Although I agree that we should commit a v1 that leaves >>> the door open to improving matters in this area in v2. >> >> I somewhat doubt that it's realistic to aim for 17 at this point. > > That's a fair point. Tomas? > I think that's a fair assessment. To me it seems doing the prefetching solely at the executor level is not really workable. And if it can be made to work, there's far too many open questions to do that in the last commitfest. I think the consensus is at least some of the logic/control needs to move back to the index AM. Maybe there's some minimal part that we could do for v17, even if it has various limitations, and then improve that in v18. Say, doing the leaf-page-at-a-time and passing a little bit of information from the index scan to drive this. But I have very hard time figuring out what the MVP version should be, because I have very limited understanding on how much control the index AM ought to have :-( And it'd be a bit silly to do something in v17, only to have to rip it out in v18 because it turned out to not get the split right. >> We seem to >> still be doing fairly fundamental architectual work. I think it might be the >> right thing even for 18 to go for the simpler only-a-single-leaf-page >> approach though. > > I definitely think it's a good idea to have that as a fall back > option. And to not commit ourselves to having something better than > that for v1 (though we probably should commit to making that possible > in v2). > Yeah, I agree with that. >> I wonder if there are prerequisites that can be tackled for 17. One idea is to >> work on infrastructure to provide executor nodes with information about the >> number of tuples likely to be fetched - I suspect we'll trigger regressions >> without that in place. > > I don't think that there'll be regressions if we just take the simpler > only-a-single-leaf-page approach. At least it seems much less likely. > I'm sure we could pass additional information from the index scans to improve that further. But I think the gradual ramp-up would deal with most regressions. At least that's my experience from benchmarking the early version. The hard thing is what to do about cases where neither of this helps. The example I keep thinking about is IOS - if we don't do prefetching, it's not hard to construct cases where regular index scan gets much faster than IOS (with many not-all-visible pages). But we can't just prefetch all pages, because that'd hurt IOS cases with most pages fully visible (when we don't need to actually access the heap). I managed to deal with this in the executor-level version, but I'm not sure how to do this if the control moves closer to the index AM. >> One way to *sometimes* process more than a single leaf page, without having to >> redesign kill_prior_tuple, would be to use the visibilitymap to check if the >> target pages are all-visible. If all the table pages on a leaf page are >> all-visible, we know that we don't need to kill index entries, and thus can >> move on to the next leaf page > > It's possible that we'll need a variety of different strategies. > nbtree already has two such strategies in _bt_killitems(), in a way. > Though its "Modified while not pinned means hinting is not safe" path > (LSN doesn't match canary value path) seems pretty naive. The > prefetching stuff might present us with a good opportunity to replace > that with something fundamentally better. > No opinion. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 1, 2024 at 10:18 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > But I have very hard time figuring out what the MVP version should be, > because I have very limited understanding on how much control the index > AM ought to have :-( And it'd be a bit silly to do something in v17, > only to have to rip it out in v18 because it turned out to not get the > split right. I suspect that you're overestimating the difficulty of getting the layering right (at least relative to the difficulty of everything else). The executor proper doesn't know anything about pins on leaf pages (and in reality nbtree usually doesn't hold any pins these days). All the executor knows is that it had better not be possible for an in-flight index scan to get confused by concurrent TID recycling by VACUUM. When amgettuple/btgettuple is called, nbtree usually just returns TIDs it collected from a just-scanned leaf page. This sort of stuff already lives in the index AM. It seems to me that everything at the API and executor level can continue to work in essentially the same way as it always has, with only minimal revision to the wording around buffer pins (in fact that really should have happened back in 2015, as part of commit 2ed5b87f). The hard part will be figuring out how to make the physical index scan prefetch optimally, in a way that balances various considerations. These include: * Managing heap prefetch distance. * Avoiding making kill_prior_tuple significantly less effective (perhaps the new design could even make it more effective, in some scenarios, by holding onto multiple buffer pins based on a dynamic model). * Figuring out how many leaf pages it makes sense to read ahead of accessing the heap, since there is no fixed relationship between the number of leaf pages we need to scan to collect a given number of distinct heap blocks that we need for prefetching. (This is made more complicated by things like LIMIT, but is actually an independent problem.) So I think that you need to teach index AMs to behave roughly as if multiple leaf pages were read as one single leaf page, at least in terms of things like how the BTScanOpaqueData.currPos state is managed. I imagine that currPos will need to be filled with TIDs from multiple index pages, instead of just one, with entries that are organized in a way that preserves the illusion of one continuous scan from the point of view of the executor proper. By the time we actually start really returning TIDs via btgettuple, it looks like we scanned one giant leaf page instead of several (the exact number of leaf pages scanned will probably have to be indeterminate, because it'll depend on things like heap prefetch distance). The good news (assuming that I'm right here) is that you don't need to have specific answers to most of these questions in order to commit a v1 of index prefeteching. ISTM that all you really need is to have confidence that the general approach that I've outlined is the right approach, long term (certainly not nothing, but I'm at least reasonably confident here). > The hard thing is what to do about cases where neither of this helps. > The example I keep thinking about is IOS - if we don't do prefetching, > it's not hard to construct cases where regular index scan gets much > faster than IOS (with many not-all-visible pages). But we can't just > prefetch all pages, because that'd hurt IOS cases with most pages fully > visible (when we don't need to actually access the heap). > > I managed to deal with this in the executor-level version, but I'm not > sure how to do this if the control moves closer to the index AM. The reality is that nbtree already knows about index-only scans. It has to, because it wouldn't be safe to drop the pin on a leaf page's buffer when the scan is "between pages" in the specific case of index-only scans (so the _bt_killitems code path used when kill_prior_tuple has index tuples to kill knows about index-only scans). I actually added commentary to the nbtree README that goes into TID recycling by VACUUM not too long ago. This includes stuff about how LP_UNUSED items in the heap are considered dead to all index scans (which can actually try to look at a TID that just became LP_UNUSED in the heap!), even though LP_UNUSED items don't prevent VACUUM from setting heap pages all-visible. This seemed like the only way of explaining the _bt_killitems IOS issue, that actually seemed to make sense. What you really want to do here is to balance costs and benefits. That's just what's required. The fact that those costs and benefits span multiple levels of abstractions makes it a bit awkward, but doesn't (and can't) change the basic shape of the problem. -- Peter Geoghegan
On Fri, Mar 1, 2024 at 3:58 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: [..] > TBH I don't have a clear idea what to do. It'd be cool to have at least > some benefits in v17, but I don't know how to do that in a way that > would be useful in the future. > > For example, the v20240124 patch implements this in the executor, but > based on the recent discussions it seems that's not the right layer - > the index AM needs to have some control, and I'm not convinced it's > possible to improve it in that direction (even ignoring the various > issues we identified in the executor-based approach). > > I think it might be more practical to do this from the index AM, even if > it has various limitations. Ironically, that's what I proposed at pgcon, > but mostly because it was the quick&dirty way to do this. ... that's a pity! :( Well, then let's just finish that subthread, I gave some explanations, but I'll try to take a look in future revisions. > > 4. Wouldn't it be better to leave PREFETCH_LRU_SIZE at static of 8, > > but base PREFETCH_LRU_COUNT on effective_io_concurrency instead? > > (allowing it to follow dynamically; the more prefetches the user wants > > to perform, the more you spread them across shared LRUs and the more > > memory for history is required?) > > > > + * XXX Maybe we could consider effective_cache_size when sizing the cache? > > + * Not to size the cache for that, ofc, but maybe as a guidance of how many > > + * heap pages it might keep. Maybe just a fraction fraction of the value, > > + * say Max(8MB, effective_cache_size / max_connections) or something. > > + */ > > +#define PREFETCH_LRU_SIZE 8 /* slots in one LRU */ > > +#define PREFETCH_LRU_COUNT 128 /* number of LRUs */ > > +#define PREFETCH_CACHE_SIZE (PREFETCH_LRU_SIZE * > > PREFETCH_LRU_COUNT) > > > > I don't see why would this be related to effective_io_concurrency? It's > merely about how many recently accessed pages we expect to find in the > page cache. It's entirely separate from the prefetch distance. Well, my thought was the higher eic is - the more I/O parallelism we are introducing - in such a case, the more requests we need to remember from the past to avoid prefetching the same (N * eic, where N would be some multiplier) > > 7. in IndexPrefetchComputeTarget() > > > > + * XXX We cap the target to plan_rows, becausse it's pointless to prefetch > > + * more than we expect to use. > > > > That's a nice fact that's already in patch, so XXX isn't needed? > > > > Right, which is why it's not a TODO/FIXME. OH! That explains it to me. I've taken all of the XXXs as literally FIXME that you wanted to go away (things to be removed before the patch is considered mature). > But I think it's good to > point this out - I'm not 100% convinced we should be using plan_rows > like this (because what happens if the estimate happens to be wrong?). Well, somewhat similiar problematic pattern was present in different codepath - get_actual_variable_endpoint() - see [1], 9c6ad5eaa95. So the final fix was to get away without adding new GUC (which always an option...), but just introduce a sensible hard-limit (fence) and stick to the 100 heap visited pages limit. Here we could have similiar heuristics same from start: if (plan_rows < we_have_already_visited_pages * avgRowsPerBlock) --> ignore plan_rows and rampup prefetches back to the full eic value. > > Some further tests, given data: > > > > CREATE TABLE test (id bigint, val bigint, str text); > > ALTER TABLE test ALTER COLUMN str SET STORAGE EXTERNAL; > > INSERT INTO test SELECT g, g, repeat(chr(65 + (10*random())::int), > > 3000) FROM generate_series(1, 10000) g; > > -- or INSERT INTO test SELECT x.r, x.r, repeat(chr(65 + > > (10*random())::int), 3000) from (select 10000 * random() as r from > > generate_series(1, 10000)) x; > > VACUUM ANALYZE test; > > CREATE INDEX on test (id) ; > > > > It's not clear to me what's the purpose of this test? Can you explain? It's just schema&data preparation for the tests below: > > > > 2. Prefetching for TOASTed heap seems to be not implemented at all, > > correct? (Is my assumption that we should go like this: > > t_index->t->toast_idx->toast_heap)?, but I'm too newbie to actually > > see the code path where it could be added - certainly it's not blocker > > -- but maybe in commit message a list of improvements for future could > > be listed?): > > > > Yes, that's true. I haven't thought about TOAST very much, but with > prefetching happening in executor, that does not work. There'd need to > be some extra code for TOAST prefetching. I'm not sure how beneficial > that would be, considering most TOAST values tend to be stored on > consecutive heap pages. Assuming that in the above I've generated data using cyclic / random version and I run: SELECT md5(string_agg(md5(str),',')) FROM test WHERE id BETWEEN 10 AND 2000; (btw: I wanted to use octet_length() at first instead of string_agg() but that's not enough) where fd 45,54,55 correspond to : lrwx------ 1 postgres postgres 64 Mar 5 12:56 /proc/8221/fd/45 -> /tmp/blah/base/5/16384 // "test" lrwx------ 1 postgres postgres 64 Mar 5 12:56 /proc/8221/fd/54 -> /tmp/blah/base/5/16388 // "pg_toast_16384_index" lrwx------ 1 postgres postgres 64 Mar 5 12:56 /proc/8221/fd/55 -> /tmp/blah/base/5/16387 // "pg_toast_16384" I've got for the following data: - 83 pread64 and 83x fadvise() for random offsets for fd=45 - the main intent of this patch (main relation heap prefetching), works good - 54 pread64 calls for fd=54 (no favdises()) - 1789 (!) calls to pread64 for fd=55 for RANDOM offsets (TOAST heap, no prefetch) so at least in theory it makes a lot of sense to prefetch TOAST too, pattern looks like cyclic random: // pread(fd, "", blocksz, offset) fadvise64(45, 40960, 8192, POSIX_FADV_WILLNEED) = 0 pread64(55, ""..., 8192, 38002688) = 8192 pread64(55, ""..., 8192, 12034048) = 8192 pread64(55, ""..., 8192, 36560896) = 8192 pread64(55, ""..., 8192, 8871936) = 8192 pread64(55, ""..., 8192, 17965056) = 8192 pread64(55, ""..., 8192, 18710528) = 8192 pread64(55, ""..., 8192, 35635200) = 8192 pread64(55, ""..., 8192, 23379968) = 8192 pread64(55, ""..., 8192, 25141248) = 8192 pread64(55, ""..., 8192, 3457024) = 8192 pread64(55, ""..., 8192, 24633344) = 8192 pread64(55, ""..., 8192, 36462592) = 8192 pread64(55, ""..., 8192, 18120704) = 8192 pread64(55, ""..., 8192, 27066368) = 8192 pread64(45, ""..., 8192, 40960) = 8192 pread64(55, ""..., 8192, 2768896) = 8192 pread64(55, ""..., 8192, 10846208) = 8192 pread64(55, ""..., 8192, 30179328) = 8192 pread64(55, ""..., 8192, 7700480) = 8192 pread64(55, ""..., 8192, 38846464) = 8192 pread64(55, ""..., 8192, 1040384) = 8192 pread64(55, ""..., 8192, 10985472) = 8192 It's probably a separate feature (prefetching blocks from TOAST), but it could be mentioned that this patch is not doing that (I was assuming it could). > > 3. I'm not sure if I got good-enough results for DESCending index > > `create index on test (id DESC);`- with eic=16 it doesnt seem to be > > be able prefetch 16 blocks in advance? (e.g. highlight offset 557056 > > below in some text editor and it's distance is far lower between that > > fadvise<->pread): > > [..] > > > > I'm not sure I understand these strace snippets. Can you elaborate a > bit, explain what the strace log says? set enable_seqscan to off; set enable_bitmapscan to off; drop index test_id_idx; create index on test (id DESC); -- DESC one SELECT sum(val) FROM test WHERE id BETWEEN 10 AND 2000; Ok, so cleaner output of strace -s 0 for PID doing that SELECT with eic=16, annotated with [*]: lseek(45, 0, SEEK_END) = 688128 lseek(47, 0, SEEK_END) = 212992 pread64(47, ""..., 8192, 172032) = 8192 pread64(45, ""..., 8192, 90112) = 8192 fadvise64(45, 172032, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, ""..., 8192, 172032) = 8192 fadvise64(45, 319488, 8192, POSIX_FADV_WILLNEED) = 0 [*off 319488 start] fadvise64(45, 335872, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, ""..., 8192, 319488) = 8192 [*off 319488, read, distance=1 fadvises] fadvise64(45, 466944, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(45, 393216, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, ""..., 8192, 335872) = 8192 fadvise64(45, 540672, 8192, POSIX_FADV_WILLNEED) = 0 [*off 540672 start] fadvise64(45, 262144, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, ""..., 8192, 466944) = 8192 fadvise64(45, 491520, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, ""..., 8192, 393216) = 8192 fadvise64(45, 163840, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(45, 385024, 8192, POSIX_FADV_WILLNEED) = 0 pread64(45, ""..., 8192, 540672) = 8192 [*off 540672, read, distance=4 fadvises] fadvise64(45, 417792, 8192, POSIX_FADV_WILLNEED) = 0 [..] I was wondering why the distance never got >4 in such case for eic=16, it should spawn more fadvises calls, shouldn't it? (it was happening only for DESC, in normal ASC index the prefetching distance easily achieves ~~ eic values) and I think today i've got the answer -- after dropping/creating DESC index I did NOT execute ANALYZE so probably the Min(..., plan_rows) was kicking in and preventing the full prefetching. Hitting above, makes me think that the XXX for plan_rows , should really be real-FIXME. -J. [1] - https://www.postgresql.org/message-id/CAKZiRmznOwi0oaV%3D4PHOCM4ygcH4MgSvt8%3D5cu_vNCfc8FSUug%40mail.gmail.com
On Wed, Nov 6, 2024 at 12:25 PM Tomas Vondra <tomas@vondra.me> wrote: > Attached is an updated version of this patch series. The first couple > parts (adding batching + updating built-in index AMs) remain the same, > the new part is 0007 which switches index scans to read stream API. The first thing that I notice about this patch series is that it doesn't fully remove amgettuple as a concept. That seems a bit odd to me. After all, you've invented a single page batching mechanism, which is duplicative of the single page batching mechanism that each affected index AM has to use already, just to be able to allow the amgettuple interface to iterate backwards and forwards with a scrollable cursor (and to make mark/restore work). ISTM that you have one too many batching interfaces here. I can think of nothing that makes the task of completely replacing amgettuple particularly difficult. I don't think that the need to do the _bt_killitems stuff actually makes this task all that much harder. It will need to be generalized, too, by keeping track of multiple BTScanOpaqueData.killedItems[] style states, each of which is associated with its own page-level currPos state. But that's not rocket science. (Also don't think that mark/restore support is all that hard.) The current way in which _bt_kill_batch() is called from _bt_steppage() by the patch seems weird to me. You're copying what you actually know to be the current page's kill items such that _bt_steppage() will magically do what it does already when the amgetttuple/btgettuple interface is in use, just as we're stepping off the page. It seems to be working at the wrong level. Notice that the current way of doing things in your patch means that your new batching interface tacitly knows about the nbtree batching interface, and that it too works along page boundaries -- that's the only reason why it can hook into _bt_steppage like this in the first place. Things are way too tightly coupled, and the old and new way of doing things are hopelessly intertwined. What's being abstracted away here, really? I suspect that _bt_steppage() shouldn't be calling _bt_kill_batch() at all -- nor should it even call _bt_killitems(). Things need to be broken down into smaller units of work that can be reordered, instead. The first half of the current _bt_steppage() function deals with finishing off the current leaf page should be moved to some other function -- let's call it _bt_finishpage. A new callback should be called as part of the new API when the time comes to tell nbtree that we're now done with a given leaf page -- that's what this new _bt_finishpage function is for. All that remains of _bt_steppage() are the parts that deal with figuring out which page should be visited next -- the second half of _bt_steppage stays put. That way stepping to the next page and reading multiple pages can be executed as eagerly as makes sense -- we don't need to "coordinate" the heap accesses in lockstep with the leaf page accesses. Maybe you won't take advantage of this flexibility right away, but ISTM that you need nominal support for this kind of reordering to make the new API really make sense. There are some problems with this scheme, but they seem reasonably tractable to me. We already have strategies for dealing with the risk of concurrent TID recycling when _bt_killitems is called with some maybe-recycled TIDs -- we're already dropping the pin on the leaf page early in many cases. I've pointed this out many times already (again, see _bt_drop_lock_and_maybe_pin). It's true that we're still going to have to hold onto a buffer pin on leaf pages whose TIDs haven't all been read from the table AM side yet, unless we know that it's a case where that's safe for other reasons -- otherwise index-only scans might give wrong answers. But that other problem shouldn't be confused with the _bt_killitems problem, just because of the superficial similarity around holding onto a leaf page pin. To repeat: it is important that you not conflate the problems on the table AM side (TID recycle safety for index scans) with the problems on the index AM side (safely setting LP_DEAD bits in _bt_killitems). They're two separate problems that are currently dealt with as one problem on the nbtree side -- but that isn't fundamental. Teasing them apart seems likely to be helpful here. > I speculated that with the batching concept it might work better, and I > think that turned out to be the case. The batching is still the core > idea, giving the index AM enough control to make kill tuples work (by > not generating batches spanning multiple leaf pages, or doing something > smarter). And the read stream leverages that too - the next_block > callback returns items from the current batch, and the stream is reset > between batches. This is the same prefetch restriction as with the > explicit prefetching (done using posix_fadvise), except that the > prefetching is done by the read stream. ISTM that the central feature of the new API should be the ability to reorder certain kinds of work. There will have to be certain constraints, of course. Sometimes these will principally be problems for the table AM (e.g., we musn't allow concurrent TID recycling unless it's for a plain index scan using an MVCC snapshot), other times they're principally problems for the index AM (e.g., the _bt_killitems safety issues). I get that you're not that excited about multi-page batches; it's not the priority. Fair enough. I just think that the API needs to work in terms of batches that are sized as one or more pages, in order for it to make sense. BTW, the README changes you made are slightly wrong about pins and locks. We don't actually keep around C pointers to IndexTuples for index-only scans that point into shared memory -- that won't work. We simply copy whatever IndexTuples the scan returns into local state, associated with so->currPos. So that isn't a complicating factor, at all. That's all I have right now. Hope it helps. -- Peter Geoghegan
On 11/7/24 01:38, Peter Geoghegan wrote: > On Wed, Nov 6, 2024 at 12:25 PM Tomas Vondra <tomas@vondra.me> wrote: >> Attached is an updated version of this patch series. The first couple >> parts (adding batching + updating built-in index AMs) remain the same, >> the new part is 0007 which switches index scans to read stream API. > > The first thing that I notice about this patch series is that it > doesn't fully remove amgettuple as a concept. That seems a bit odd to > me. After all, you've invented a single page batching mechanism, which > is duplicative of the single page batching mechanism that each > affected index AM has to use already, just to be able to allow the > amgettuple interface to iterate backwards and forwards with a > scrollable cursor (and to make mark/restore work). ISTM that you have > one too many batching interfaces here. > > I can think of nothing that makes the task of completely replacing > amgettuple particularly difficult. I don't think that the need to do > the _bt_killitems stuff actually makes this task all that much harder. > It will need to be generalized, too, by keeping track of multiple > BTScanOpaqueData.killedItems[] style states, each of which is > associated with its own page-level currPos state. But that's not > rocket science. (Also don't think that mark/restore support is all > that hard.) > The primary reason why I kept amgettuple() as is, and added a new AM callback for the "batch" mode is backwards compatibility. I did not want to force all AMs to do this, I think it should be optional. Not only to limit the disruption for out-of-core AMs, but also because I'm not 100% sure every AM will be able to do batching in a reasonable way. I do agree having an AM-level batching, and then another batching in the indexam.c is a bit ... weird. To some extent this is a remainder of an earlier patch version, but it's also based on some suggestions by Andres about batching these calls into AM for efficiency reasons. To be fair, I was jetlagged and I'm not 100% sure this is what he meant, or that it makes a difference in practice. Yes, we could ditch the batching in indexam.c, and just rely on the AM batching, just like now. There are a couple details why the separate batching seemed convenient: 1) We may need to stash some custom data for each TID (e.g. so that IOS does not need to check VM repeatedly). But perhaps that could be delegated to the index AM too ... 2) We need to maintain two "positions" in the index. One for the item the executor is currently processing (and which might end up getting marked as "killed" etc). And another one for "read" position, i.e. items passed to the read stream API / prefetching, etc. 3) It makes it clear when the items are no longer needed, and the AM can do cleanup. process kill tuples, etc. > The current way in which _bt_kill_batch() is called from > _bt_steppage() by the patch seems weird to me. You're copying what you > actually know to be the current page's kill items such that > _bt_steppage() will magically do what it does already when the > amgetttuple/btgettuple interface is in use, just as we're stepping off > the page. It seems to be working at the wrong level. > True, but that's how it was working before, it wasn't my ambition to rework that. > Notice that the current way of doing things in your patch means that > your new batching interface tacitly knows about the nbtree batching > interface, and that it too works along page boundaries -- that's the > only reason why it can hook into _bt_steppage like this in the first > place. Things are way too tightly coupled, and the old and new way of > doing things are hopelessly intertwined. What's being abstracted away > here, really? > I'm not sure sure if by "new batching interface" you mean the indexam.c code, or the code in btgetbatch() etc. I don't think indexam.c knows all that much about the nbtree internal batching. It "just" relies on amgetbatch() producing items the AM can handle later (during killtuples/cleanup etc.). It does not even need to be a single-leaf-page batch, if the AM knows how to track/deal with that internally. It just was easier to do by restricting to a single leaf page for now. But that's internal to AM. Yes, it's true inside the AM it's more intertwined, and some of it sets things up so that the existing code does the right thing ... > I suspect that _bt_steppage() shouldn't be calling _bt_kill_batch() at > all -- nor should it even call _bt_killitems(). Things need to be > broken down into smaller units of work that can be reordered, instead. > > The first half of the current _bt_steppage() function deals with > finishing off the current leaf page should be moved to some other > function -- let's call it _bt_finishpage. A new callback should be > called as part of the new API when the time comes to tell nbtree that > we're now done with a given leaf page -- that's what this new > _bt_finishpage function is for. All that remains of _bt_steppage() are > the parts that deal with figuring out which page should be visited > next -- the second half of _bt_steppage stays put. > > That way stepping to the next page and reading multiple pages can be > executed as eagerly as makes sense -- we don't need to "coordinate" > the heap accesses in lockstep with the leaf page accesses. Maybe you > won't take advantage of this flexibility right away, but ISTM that you > need nominal support for this kind of reordering to make the new API > really make sense. > Yes, splitting _bt_steppage() like this makes sense to me, and I agree being able to proceed to the next page before we're done with the current page seems perfectly reasonable for batches spanning multiple leaf pages. > There are some problems with this scheme, but they seem reasonably > tractable to me. We already have strategies for dealing with the risk > of concurrent TID recycling when _bt_killitems is called with some > maybe-recycled TIDs -- we're already dropping the pin on the leaf page > early in many cases. I've pointed this out many times already (again, > see _bt_drop_lock_and_maybe_pin). > > It's true that we're still going to have to hold onto a buffer pin on > leaf pages whose TIDs haven't all been read from the table AM side > yet, unless we know that it's a case where that's safe for other > reasons -- otherwise index-only scans might give wrong answers. But > that other problem shouldn't be confused with the _bt_killitems > problem, just because of the superficial similarity around holding > onto a leaf page pin. > > To repeat: it is important that you not conflate the problems on the > table AM side (TID recycle safety for index scans) with the problems > on the index AM side (safely setting LP_DEAD bits in _bt_killitems). > They're two separate problems that are currently dealt with as one > problem on the nbtree side -- but that isn't fundamental. Teasing them > apart seems likely to be helpful here. > Hmm. I've intentionally tried to ignore these issues, or rather to limit the scope of the patch so that v1 does not require dealing with it. Hence the restriction to single-leaf batches, for example. But I guess I may have to look at this after all ... not great. >> I speculated that with the batching concept it might work better, and I >> think that turned out to be the case. The batching is still the core >> idea, giving the index AM enough control to make kill tuples work (by >> not generating batches spanning multiple leaf pages, or doing something >> smarter). And the read stream leverages that too - the next_block >> callback returns items from the current batch, and the stream is reset >> between batches. This is the same prefetch restriction as with the >> explicit prefetching (done using posix_fadvise), except that the >> prefetching is done by the read stream. > > ISTM that the central feature of the new API should be the ability to > reorder certain kinds of work. There will have to be certain > constraints, of course. Sometimes these will principally be problems > for the table AM (e.g., we musn't allow concurrent TID recycling > unless it's for a plain index scan using an MVCC snapshot), other > times they're principally problems for the index AM (e.g., the > _bt_killitems safety issues). > Not sure. By "new API" you mean the read stream API, or the index AM API to allow batching? > I get that you're not that excited about multi-page batches; it's not > the priority. Fair enough. I just think that the API needs to work in > terms of batches that are sized as one or more pages, in order for it > to make sense. > True, but isn't that already the case? I mean, what exactly prevents an index AM to "build" a batch for multiple leaf pages? The current patch does not implement that for any of the AMs, true, but isn't that already possible if the AM chooses to? If you were to design the index AM API to support this (instead of adding the amgetbatch callback etc.), how would it look? In one of the previous patch versions I tried to rely on amgettuple(). It got a bunch of TIDs ahead from that, depending on prefetch distance. Then those TIDs were prefetched/passed to the read stream, and stashed in a queue (in IndexScanDesc). And then indexam would get the TIDs from the queue, and pass them to index scans etc. Unfortunately that didn't work because of killtuples etc. because the index AM had no idea about the indexam queue and has it's own concept of "current item", so it was confused about which item to mark as killed. And that old item might even be from an earlier leaf page (not the "current" currPos). I was thinking maybe the AM could keep the leaf pages, and then free them once they're no longer needed. But it wasn't clear to me how to exchange this information between indexam.c and the index AM, because right now the AM only knows about a single (current) position. But imagine we have this: a) A way to switch the scan into "batch" mode, where the AM keeps the leaf page (and a way for the AM to indicate it supports this). b) Some way to track two "positions" in the scan - one for read, one for prefetch. I'm not sure if this would be internal in each index AM, or at the indexam.c level. c) A way to get the index tuple for either of the two positions (and advance the position). It might be a flag for amgettuple(), or maybe even a callaback for the "prefetch" position. d) A way to inform the AM items up to some position are no longer needed, and thus the leaf pages can be cleaned up and freed. AFAICS it could always be "up to the current read position". Does that sound reasonable / better than the current approach, or have I finally reached the "raving lunatic" stage? > BTW, the README changes you made are slightly wrong about pins and > locks. We don't actually keep around C pointers to IndexTuples for > index-only scans that point into shared memory -- that won't work. We > simply copy whatever IndexTuples the scan returns into local state, > associated with so->currPos. So that isn't a complicating factor, at > all. > Ah, OK. Thanks for the correction. > That's all I have right now. Hope it helps. > Yes, very interesting insights. Thanks! regards -- Tomas Vondra
On Thu, Nov 7, 2024 at 10:03 AM Tomas Vondra <tomas@vondra.me> wrote: > The primary reason why I kept amgettuple() as is, and added a new AM > callback for the "batch" mode is backwards compatibility. I did not want > to force all AMs to do this, I think it should be optional. Not only to > limit the disruption for out-of-core AMs, but also because I'm not 100% > sure every AM will be able to do batching in a reasonable way. All index AMs that implement amgettuple are fairly similar to nbtree. They are: * nbtree itself * GiST * Hash * SP-GiST They all have the same general notion of page-at-a-time processing, with buffering of items for the amgettuple callback to return. There are perhaps enough differences to be annoying in SP-GiST, and with GiST's ordered scans (which use a pairing heap rather than true page-at-a-time processing). I guess you're right that you'll need to maintain amgettuple support for the foreseeable future, to support these special cases. I still think that you shouldn't need to use amgettuple in either nbtree or hash, since neither AM does anything non-generic in this area. It should be normal to never need to use amgettuple. > Yes, we could ditch the batching in indexam.c, and just rely on the AM > batching, just like now. To be clear, I had imagined completely extracting the batching from the index AM, since it isn't really at all coupled to individual index AM implementation details anyway. I don't hate the idea of doing more in the index AM, but whether or not it happens there vs. somewhere else isn't my main concern at this point. My main concern right now is that one single place be made to see every relevant piece of information about costs and benefits. Probably something inside indexam.c. > There are a couple details why the separate > batching seemed convenient: > > 1) We may need to stash some custom data for each TID (e.g. so that IOS > does not need to check VM repeatedly). But perhaps that could be > delegated to the index AM too ... > > 2) We need to maintain two "positions" in the index. One for the item > the executor is currently processing (and which might end up getting > marked as "killed" etc). And another one for "read" position, i.e. items > passed to the read stream API / prefetching, etc. That all makes sense. > 3) It makes it clear when the items are no longer needed, and the AM can > do cleanup. process kill tuples, etc. But it doesn't, really. The index AM is still subject to exactly the same constraints in terms of page-at-a-time processing. These existing constraints always came from the table AM side, so it's not as if your patch can remain totally neutral on these questions. Basically, it looks like you've invented a shadow batching interface that is technically not known to the index AM, but nevertheless coordinates with the existing so->currPos batching interface. > I don't think indexam.c knows all that much about the nbtree internal > batching. It "just" relies on amgetbatch() producing items the AM can > handle later (during killtuples/cleanup etc.). It does not even need to > be a single-leaf-page batch, if the AM knows how to track/deal with that > internally. I'm concerned that no single place will know about everything under this scheme. Having one single place that has visibility into all relevant costs, whether they're index AM or table AM related, is what I think you should be aiming for. I think that you should be removing the parts of the nbtree (and other index AM) code that deal with the progress of the scan explicitly. What remains is code that simply reads the next page, and saves its details in the relevant data structures. Or code that "finishes off" a leaf page by dropping its pin, and maybe doing the _bt_killitems stuff. The index AM itself should no longer know about the current next tuple to return, nor about mark/restore. It is no longer directly in control of the scan's progress. It loses all context that survives across API calls. > Yes, splitting _bt_steppage() like this makes sense to me, and I agree > being able to proceed to the next page before we're done with the > current page seems perfectly reasonable for batches spanning multiple > leaf pages. I think that it's entirely possible that it'll just be easier to do things this way from the start. I understand that that may be far from obvious right now, but, again, I just don't see what's so special about the way that each index AM batches results. What about that it is so hard to generalize across index AMs that must support amgettuple right now? (At least in the case of nbtree and hash, which have no special requirements for things like KNN-GiST.) Most individual calls to btgettuple just return the next batched-up so->currPos tuple/TID via another call to _bt_next. Things like the _bt_first-new-primitive-scan case don't really add any complexity -- the core concept of processing a page at a time still applies. It really is just a simple batching scheme, with a couple of extra fiddly details attached to it -- but nothing too hairy. The hardest part will probably be rigorously describing the rules for not breaking index-only scans due to concurrent TID recycling by VACUUM, and the rules for doing _bt_killitems. But that's also not a huge problem, in the grand scheme of things. > Hmm. I've intentionally tried to ignore these issues, or rather to limit > the scope of the patch so that v1 does not require dealing with it. > Hence the restriction to single-leaf batches, for example. > > But I guess I may have to look at this after all ... not great. To be clear, I don't think that you necessarily have to apply these capabilities in v1 of this project. I would be satisfied if the patch could just break things out in the right way, so that some later patch could improve things later on. I only really want to see the capabilities within the index AM decomposed, such that one central place can see a global view of the costs and benefits of the index scan. You should be able to validate the new API by stress-testing the code. You can make the index AM read several leaf pages at a time when a certain debug mode is enabled. Once you prove that the index AM correctly performs the same processing as today correctly, without any needless restrictions on the ordering that these decomposed operators perform (only required restrictions that are well explained and formalized), then things should be on the right path. > > ISTM that the central feature of the new API should be the ability to > > reorder certain kinds of work. There will have to be certain > > constraints, of course. Sometimes these will principally be problems > > for the table AM (e.g., we musn't allow concurrent TID recycling > > unless it's for a plain index scan using an MVCC snapshot), other > > times they're principally problems for the index AM (e.g., the > > _bt_killitems safety issues). > > > > Not sure. By "new API" you mean the read stream API, or the index AM API > to allow batching? Right now those two concepts seem incredibly blurred to me. > > I get that you're not that excited about multi-page batches; it's not > > the priority. Fair enough. I just think that the API needs to work in > > terms of batches that are sized as one or more pages, in order for it > > to make sense. > > > > True, but isn't that already the case? I mean, what exactly prevents an > index AM to "build" a batch for multiple leaf pages? The current patch > does not implement that for any of the AMs, true, but isn't that already > possible if the AM chooses to? That's unclear, but overall I'd say no. The index AM API says that they need to hold on to a buffer pin to avoid confusing scans due to concurrent TID recycling by VACUUM. The index AM API fails to adequately describe what is expected here. And it provides no useful context for larger batching of index pages. nbtree already does its own thing by dropping leaf page pins selectively. Whether or not it's technically possible is a matter of interpretation (I came down on the "no" side, but it's still ambiguous). I would prefer it if the index AM API was much simpler for ordered scans. As I said already, something along the lines of "when you're told to scan the next index page, here's how we'll call you, here's the data structure that you need to fill up". Or "when we tell you that we're done fetching tuples from a recently read index page, here's how we'll call you". These discussions about where the exact boundaries lie don't seem very helpful. The simple fact is that nobody is ever going to invent an index AM side interface that batches up more than a single leaf page. Why would they? It just doesn't make sense to, since the index AM has no idea about certain clearly-relevant context. For example, it has no idea whether or not there's a LIMIT involved. The value that comes from using larger batches on the index AM side comes from making life easier for heap prefetching, which index AMs know nothing about whatsoever. Again, the goal should be to marry information from the index AM and the table AM in one central place. > Unfortunately that didn't work because of killtuples etc. because the > index AM had no idea about the indexam queue and has it's own concept of > "current item", so it was confused about which item to mark as killed. > And that old item might even be from an earlier leaf page (not the > "current" currPos). Currently, during a call to btgettuple, so->currPos.itemIndex is updated within _bt_next. But before _bt_next is called, so->currPos.itemIndex indicates the item returned by the most recent prior call to btgettuple -- which is also the tuple that the scan->kill_prior_tuple reports on. In short, btgettuple does some trivial things to remember which entries from so->currPos ought to be marked dead later on due to the scan->kill_prior_tuple flag having been set for those entries. This can be moved outside of each index AM. The index AM shouldn't need to use a scan->kill_prior_tuple style flag under the new batching API at all, though. It should work at a higher level than that. The index AM should be called through a callback that tells it to drop the pin on a page that the table AM has been reading from, and maybe perform _bt_killitems on these relevant known-dead TIDs first. In short, all of the bookkeeping for so->killedItems[] should be happening at a completely different layer. And the so->killedItems[] structure should be directly associated with a single index page subset of a batch (a subset similar to the current so->currPos batches). The first time the index AM sees anything about dead TIDs, it should see a whole leaf page worth of them. > I was thinking maybe the AM could keep the leaf pages, and then free > them once they're no longer needed. But it wasn't clear to me how to > exchange this information between indexam.c and the index AM, because > right now the AM only knows about a single (current) position. I'm imagining a world in which the index AM doesn't even know about the current position. Basically, it has no real context about the progress of the scan to maintain at all. It merely does what it is told by some higher level, that is sensitive to the requirements of both the index AM and the table AM. > But imagine we have this: > > a) A way to switch the scan into "batch" mode, where the AM keeps the > leaf page (and a way for the AM to indicate it supports this). I don't think that there needs to be a batch mode. There could simply be the total absence of batching, which is one point along a continuum, rather than a discrete mode. > b) Some way to track two "positions" in the scan - one for read, one for > prefetch. I'm not sure if this would be internal in each index AM, or at > the indexam.c level. I think that it would be at the indexam.c level. > c) A way to get the index tuple for either of the two positions (and > advance the position). It might be a flag for amgettuple(), or maybe > even a callaback for the "prefetch" position. Why does the index AM need to know anything about the fact that the next tuple has been requested? Why can't it just be 100% ignorant of all that? (Perhaps barring a few special cases, such as KNN-GiST scans, which continue to use the legacy amgettuple interface.) > d) A way to inform the AM items up to some position are no longer > needed, and thus the leaf pages can be cleaned up and freed. AFAICS it > could always be "up to the current read position". Yeah, I like this idea. But the index AM doesn't need to know about positions and whatnot. It just needs to do what it's told: to drop the pin, and maybe to perform _bt_killitems first. Or maybe just to drop the pin, with instruction to do _bt_killitems coming some time later (the index AM will need to be a bit more careful within its _bt_killitems step when this happens). The index AM doesn't need to drop the current pin for the current position -- not as such. The index AM doesn't directly know about what pins are held, since that'll all be tracked elsewhere. Again, the index AM should need to hold onto zero context, beyond the immediate request to perform one additional unit of work, which will usually/always happen at the index page level (all of which is tracked by data structures that are under the control of the new indexam.c level). I don't think that it'll ultimately be all that hard to schedule when and how index pages are read from outside of the index AM in question. In general all relevant index AMs already work in much the same way here. Maybe we can ultimately invent a way for the index AM to influence that scheduling, but that might never be required. > Does that sound reasonable / better than the current approach, or have I > finally reached the "raving lunatic" stage? The stage after "raving lunatic" is enlightenment. :-) -- Peter Geoghegan
On 11/7/24 18:55, Peter Geoghegan wrote: > On Thu, Nov 7, 2024 at 10:03 AM Tomas Vondra <tomas@vondra.me> wrote: >> The primary reason why I kept amgettuple() as is, and added a new AM >> callback for the "batch" mode is backwards compatibility. I did not want >> to force all AMs to do this, I think it should be optional. Not only to >> limit the disruption for out-of-core AMs, but also because I'm not 100% >> sure every AM will be able to do batching in a reasonable way. > > All index AMs that implement amgettuple are fairly similar to nbtree. They are: > > * nbtree itself > * GiST > * Hash > * SP-GiST > > They all have the same general notion of page-at-a-time processing, > with buffering of items for the amgettuple callback to return. There > are perhaps enough differences to be annoying in SP-GiST, and with > GiST's ordered scans (which use a pairing heap rather than true > page-at-a-time processing). I guess you're right that you'll need to > maintain amgettuple support for the foreseeable future, to support > these special cases. > > I still think that you shouldn't need to use amgettuple in either > nbtree or hash, since neither AM does anything non-generic in this > area. It should be normal to never need to use amgettuple. > Right, I can imagine not using amgettuple() in nbtree/hash. I guess we could even remove it altogether, although I'm not sure that'd work right now (haven't tried). >> Yes, we could ditch the batching in indexam.c, and just rely on the AM >> batching, just like now. > > To be clear, I had imagined completely extracting the batching from > the index AM, since it isn't really at all coupled to individual index > AM implementation details anyway. I don't hate the idea of doing more > in the index AM, but whether or not it happens there vs. somewhere > else isn't my main concern at this point. > > My main concern right now is that one single place be made to see > every relevant piece of information about costs and benefits. Probably > something inside indexam.c. > Not sure I understand, but I think I'm somewhat confused by "index AM" vs. indexam. Are you suggesting the individual index AMs should know as little about the batching as possible, and instead it should be up to indexam.c to orchestrate most of the stuff? If yes, then I agree in principle, and I think indexam.c is the right place to do that (or at least I can't think of a better one). That's what the current patch aimed to do, more or less. I'm not saying it got it perfectly right, and I'm sure there is stuff that can be improved (like reworking _steppage to not deal with killed tuples). But surely the index AMs need to have some knowledge about batching, because how else would it know which leaf pages to still keep, etc? >> There are a couple details why the separate >> batching seemed convenient: >> >> 1) We may need to stash some custom data for each TID (e.g. so that IOS >> does not need to check VM repeatedly). But perhaps that could be >> delegated to the index AM too ... >> >> 2) We need to maintain two "positions" in the index. One for the item >> the executor is currently processing (and which might end up getting >> marked as "killed" etc). And another one for "read" position, i.e. items >> passed to the read stream API / prefetching, etc. > > That all makes sense. > OK >> 3) It makes it clear when the items are no longer needed, and the AM can >> do cleanup. process kill tuples, etc. > > But it doesn't, really. The index AM is still subject to exactly the > same constraints in terms of page-at-a-time processing. These existing > constraints always came from the table AM side, so it's not as if your > patch can remain totally neutral on these questions. > Not sure I understand. Which part of my sentence you disagree with? Or what constraints you mean? The interface does not require page-at-a-time processing - the index AM is perfectly within it's rights to produce a batch spanning 10 leaf pages, as long as it keeps track of them, and perhaps keeps some mapping of items (returned in the batch) to leaf pages. So that when the next batch is requested, it can do the cleanup, and move to the next batch. Yes, the current implementation does not do that, to keep the patches simple. But it should be possible, I believe. > Basically, it looks like you've invented a shadow batching interface > that is technically not known to the index AM, but nevertheless > coordinates with the existing so->currPos batching interface. > Perhaps, but which part of that you consider a problem? Are you saying this shouldn't use the currPos stuff at all, and instead do stuff in some other way? >> I don't think indexam.c knows all that much about the nbtree internal >> batching. It "just" relies on amgetbatch() producing items the AM can >> handle later (during killtuples/cleanup etc.). It does not even need to >> be a single-leaf-page batch, if the AM knows how to track/deal with that >> internally. > > I'm concerned that no single place will know about everything under > this scheme. Having one single place that has visibility into all > relevant costs, whether they're index AM or table AM related, is what > I think you should be aiming for. > > I think that you should be removing the parts of the nbtree (and other > index AM) code that deal with the progress of the scan explicitly. > What remains is code that simply reads the next page, and saves its > details in the relevant data structures. Or code that "finishes off" a > leaf page by dropping its pin, and maybe doing the _bt_killitems > stuff. Does that mean not having a simple amgetbatch() callback, but some finer grained interface? Or maybe one callback that returns the next "AM page" (essentially the currPos), and then another callback to release it? (This is what I mean by "two-callback API" later.) Or what would it look like? > The index AM itself should no longer know about the current next tuple > to return, nor about mark/restore. It is no longer directly in control > of the scan's progress. It loses all context that survives across API > calls. > I'm lost. How could the index AM not know about mark/restore? >> Yes, splitting _bt_steppage() like this makes sense to me, and I agree >> being able to proceed to the next page before we're done with the >> current page seems perfectly reasonable for batches spanning multiple >> leaf pages. > > I think that it's entirely possible that it'll just be easier to do > things this way from the start. I understand that that may be far from > obvious right now, but, again, I just don't see what's so special > about the way that each index AM batches results. What about that it > is so hard to generalize across index AMs that must support amgettuple > right now? (At least in the case of nbtree and hash, which have no > special requirements for things like KNN-GiST.) > I don't think the batching in various AMs is particularly unique, that's true. But my goal was to wrap that in a single amgetbatch callback, because that seemed natural, and that moves some of the responsibilities to the AM. I still don't quite understand what API you imagine, but if we want to make more of this the responsibility of indexam.c, I guess it will require multiple smaller callbacks (I'm not opposed to that, but I also don't know if that's what you imagine). > Most individual calls to btgettuple just return the next batched-up > so->currPos tuple/TID via another call to _bt_next. Things like the > _bt_first-new-primitive-scan case don't really add any complexity -- > the core concept of processing a page at a time still applies. It > really is just a simple batching scheme, with a couple of extra fiddly > details attached to it -- but nothing too hairy. > True, although the details (how the batches are represented etc.) are often quite different, so did you imagine some shared structure to represent that, or wrapping that in a new callback? Or how would indexam.c work with that? > The hardest part will probably be rigorously describing the rules for > not breaking index-only scans due to concurrent TID recycling by > VACUUM, and the rules for doing _bt_killitems. But that's also not a > huge problem, in the grand scheme of things. > It probably is not a huge problem ... for someone who's already familiar with the rules, at least intuitively. But TBH this part really scares me a little bit. >> Hmm. I've intentionally tried to ignore these issues, or rather to limit >> the scope of the patch so that v1 does not require dealing with it. >> Hence the restriction to single-leaf batches, for example. >> >> But I guess I may have to look at this after all ... not great. > > To be clear, I don't think that you necessarily have to apply these > capabilities in v1 of this project. I would be satisfied if the patch > could just break things out in the right way, so that some later patch > could improve things later on. I only really want to see the > capabilities within the index AM decomposed, such that one central > place can see a global view of the costs and benefits of the index > scan. > Yes, I understand that. Getting the overall design right is my main concern, even if some of the advanced stuff is not implemented until later. But with the wrong design, that may turn out to be difficult. That's the feedback I was hoping for when I kept bugging you, and this discussion was already very useful in this regard. Thank you for that. > You should be able to validate the new API by stress-testing the code. > You can make the index AM read several leaf pages at a time when a > certain debug mode is enabled. Once you prove that the index AM > correctly performs the same processing as today correctly, without any > needless restrictions on the ordering that these decomposed operators > perform (only required restrictions that are well explained and > formalized), then things should be on the right path. > Yeah, stress testing is my primary tool ... >>> ISTM that the central feature of the new API should be the ability to >>> reorder certain kinds of work. There will have to be certain >>> constraints, of course. Sometimes these will principally be problems >>> for the table AM (e.g., we musn't allow concurrent TID recycling >>> unless it's for a plain index scan using an MVCC snapshot), other >>> times they're principally problems for the index AM (e.g., the >>> _bt_killitems safety issues). >>> >> >> Not sure. By "new API" you mean the read stream API, or the index AM API >> to allow batching? > > Right now those two concepts seem incredibly blurred to me. > Same here. >>> I get that you're not that excited about multi-page batches; it's not >>> the priority. Fair enough. I just think that the API needs to work in >>> terms of batches that are sized as one or more pages, in order for it >>> to make sense. >>> >> >> True, but isn't that already the case? I mean, what exactly prevents an >> index AM to "build" a batch for multiple leaf pages? The current patch >> does not implement that for any of the AMs, true, but isn't that already >> possible if the AM chooses to? > > That's unclear, but overall I'd say no. > > The index AM API says that they need to hold on to a buffer pin to > avoid confusing scans due to concurrent TID recycling by VACUUM. The > index AM API fails to adequately describe what is expected here. And > it provides no useful context for larger batching of index pages. > nbtree already does its own thing by dropping leaf page pins > selectively. > Not sure I understand. I imagined the index AM would just read a sequence of leaf pages, keeping all the same pins etc. just like it does for the one leaf it reads right now (pins, etc.). I'm probably too dumb for that, but I still don't quite understand how that's different from just reading and processing that sequence of leaf pages by amgettuple without batching. > Whether or not it's technically possible is a matter of interpretation > (I came down on the "no" side, but it's still ambiguous). I would > prefer it if the index AM API was much simpler for ordered scans. As I > said already, something along the lines of "when you're told to scan > the next index page, here's how we'll call you, here's the data > structure that you need to fill up". Or "when we tell you that we're > done fetching tuples from a recently read index page, here's how we'll > call you". > I think this is pretty much "two-callback API" I mentioned earlier. > These discussions about where the exact boundaries lie don't seem very > helpful. The simple fact is that nobody is ever going to invent an > index AM side interface that batches up more than a single leaf page. > Why would they? It just doesn't make sense to, since the index AM has > no idea about certain clearly-relevant context. For example, it has no > idea whether or not there's a LIMIT involved. > > The value that comes from using larger batches on the index AM side > comes from making life easier for heap prefetching, which index AMs > know nothing about whatsoever. Again, the goal should be to marry > information from the index AM and the table AM in one central place. > True, although the necessary context could be passed to the index AM in some way. That's what happens in the current patch, where indexam.c could size the batch just right for a LIMIT clause, before asking the index AM to fill it with items. >> Unfortunately that didn't work because of killtuples etc. because the >> index AM had no idea about the indexam queue and has it's own concept of >> "current item", so it was confused about which item to mark as killed. >> And that old item might even be from an earlier leaf page (not the >> "current" currPos). > > Currently, during a call to btgettuple, so->currPos.itemIndex is > updated within _bt_next. But before _bt_next is called, > so->currPos.itemIndex indicates the item returned by the most recent > prior call to btgettuple -- which is also the tuple that the > scan->kill_prior_tuple reports on. In short, btgettuple does some > trivial things to remember which entries from so->currPos ought to be > marked dead later on due to the scan->kill_prior_tuple flag having > been set for those entries. This can be moved outside of each index > AM. > > The index AM shouldn't need to use a scan->kill_prior_tuple style flag > under the new batching API at all, though. It should work at a higher > level than that. The index AM should be called through a callback that > tells it to drop the pin on a page that the table AM has been reading > from, and maybe perform _bt_killitems on these relevant known-dead > TIDs first. In short, all of the bookkeeping for so->killedItems[] > should be happening at a completely different layer. And the > so->killedItems[] structure should be directly associated with a > single index page subset of a batch (a subset similar to the current > so->currPos batches). > > The first time the index AM sees anything about dead TIDs, it should > see a whole leaf page worth of them. > I need to think about this a bit, but I agree passing this information to an index AM through the kill_prior_tuple seems weird. >> I was thinking maybe the AM could keep the leaf pages, and then free >> them once they're no longer needed. But it wasn't clear to me how to >> exchange this information between indexam.c and the index AM, because >> right now the AM only knows about a single (current) position. > > I'm imagining a world in which the index AM doesn't even know about > the current position. Basically, it has no real context about the > progress of the scan to maintain at all. It merely does what it is > told by some higher level, that is sensitive to the requirements of > both the index AM and the table AM. > Hmmm, OK. If the idea is to just return a leaf page as an array of items (in some fancy way) to indexam.c, then it'd be indexam.c responsible for tracking what the current position (or multiple positions are), I guess. >> But imagine we have this: >> >> a) A way to switch the scan into "batch" mode, where the AM keeps the >> leaf page (and a way for the AM to indicate it supports this). > > I don't think that there needs to be a batch mode. There could simply > be the total absence of batching, which is one point along a > continuum, rather than a discrete mode. > >> b) Some way to track two "positions" in the scan - one for read, one for >> prefetch. I'm not sure if this would be internal in each index AM, or at >> the indexam.c level. > > I think that it would be at the indexam.c level. > Yes, if the index AM returns page as a set of items, then it'd be up to indexam.c to maintain all this information. >> c) A way to get the index tuple for either of the two positions (and >> advance the position). It might be a flag for amgettuple(), or maybe >> even a callaback for the "prefetch" position. > > Why does the index AM need to know anything about the fact that the > next tuple has been requested? Why can't it just be 100% ignorant of > all that? (Perhaps barring a few special cases, such as KNN-GiST > scans, which continue to use the legacy amgettuple interface.) > Well, I was thinking about how it works now, for the "current" position. And I was thinking about how would it need to change to handle the prefetch position too, in the same way ... But if you're suggesting to move this logic and context to the upper layer indexam.c, that changes things ofc. >> d) A way to inform the AM items up to some position are no longer >> needed, and thus the leaf pages can be cleaned up and freed. AFAICS it >> could always be "up to the current read position". > > Yeah, I like this idea. But the index AM doesn't need to know about > positions and whatnot. It just needs to do what it's told: to drop the > pin, and maybe to perform _bt_killitems first. Or maybe just to drop > the pin, with instruction to do _bt_killitems coming some time later > (the index AM will need to be a bit more careful within its > _bt_killitems step when this happens). > Well, if the AM works with "batches of tuples for a leaf page" (through the two callbacks to read / release a page), then positions to exact items are no longer needed. It just needs to know which pages are still needed, etc. Correct? > The index AM doesn't need to drop the current pin for the current > position -- not as such. The index AM doesn't directly know about what > pins are held, since that'll all be tracked elsewhere. Again, the > index AM should need to hold onto zero context, beyond the immediate > request to perform one additional unit of work, which will > usually/always happen at the index page level (all of which is tracked > by data structures that are under the control of the new indexam.c > level). > No idea. > I don't think that it'll ultimately be all that hard to schedule when > and how index pages are read from outside of the index AM in question. > In general all relevant index AMs already work in much the same way > here. Maybe we can ultimately invent a way for the index AM to > influence that scheduling, but that might never be required. > I haven't thought about scheduling at all. Maybe there's something we could improve in the future, but I don't see what would it look like, and it seems unrelated to this patch. >> Does that sound reasonable / better than the current approach, or have I >> finally reached the "raving lunatic" stage? > > The stage after "raving lunatic" is enlightenment. :-) > That's my hope. regards -- Tomas Vondra
On Thu, Nov 7, 2024 at 4:34 PM Tomas Vondra <tomas@vondra.me> wrote: > Not sure I understand, but I think I'm somewhat confused by "index AM" > vs. indexam. Are you suggesting the individual index AMs should know as > little about the batching as possible, and instead it should be up to > indexam.c to orchestrate most of the stuff? Yes, that's what I'm saying. Knowing "as little as possible" turns out to be pretty close to knowing nothing at all. There might be some minor exceptions, such as the way that nbtree needs to remember the scan's array keys. But that already works in a way that's very insensitive to the exact position in the scan. For example, right now if you restore a mark that doesn't just come from the existing so->currPos batch then we cheat and reset the array keys. > If yes, then I agree in principle, and I think indexam.c is the right > place to do that (or at least I can't think of a better one). Good. > That's what the current patch aimed to do, more or less. I'm not saying > it got it perfectly right, and I'm sure there is stuff that can be > improved (like reworking _steppage to not deal with killed tuples). But > surely the index AMs need to have some knowledge about batching, because > how else would it know which leaf pages to still keep, etc? I think that your new thing can directly track which leaf pages have pins. As well as tracking the order that it has to return tuples from among those leaf page batch subsets. Your new thing can think about this in very general terms, that really aren't tied to any index AM specifics. It'll have some general notion of an ordered sequence of pages (in scan/key space order), each of which contains one or more tuples to return. It needs to track which pages have tuples that we've already done all the required visibility checks for, in order to be able to instruct the index AM to drop the pin. Suppose, for example, that we're doing an SAOP index scan, where the leaf pages that our multi-page batch consists of aren't direct siblings. That literally doesn't matter at all. The pages still have to be in the same familiar key space/scan order, regardless. And that factor shouldn't really need to influence how many pins we're willing to hold on to (no more than it would when there are large numbers of index leaf pages with no interesting tuples to return that we must still scan over). > >> 3) It makes it clear when the items are no longer needed, and the AM can > >> do cleanup. process kill tuples, etc. > > > > But it doesn't, really. The index AM is still subject to exactly the > > same constraints in terms of page-at-a-time processing. These existing > > constraints always came from the table AM side, so it's not as if your > > patch can remain totally neutral on these questions. > > > > Not sure I understand. Which part of my sentence you disagree with? Or > what constraints you mean? What I was saying here was something I said more clearly a bit further down: it's technically possible to do multi-page batches within the confines of the current index AM API, but that's not true in any practical sense. And it'll never be true with an API that looks very much like the current amgettuple API. > The interface does not require page-at-a-time processing - the index AM > is perfectly within it's rights to produce a batch spanning 10 leaf > pages, as long as it keeps track of them, and perhaps keeps some mapping > of items (returned in the batch) to leaf pages. So that when the next > batch is requested, it can do the cleanup, and move to the next batch. How does an index AM actually do that in a way that's useful? It only sees a small part of the picture. That's why it's the wrong place for it. > > Basically, it looks like you've invented a shadow batching interface > > that is technically not known to the index AM, but nevertheless > > coordinates with the existing so->currPos batching interface. > > > > Perhaps, but which part of that you consider a problem? Are you saying > this shouldn't use the currPos stuff at all, and instead do stuff in > some other way? I think that you should generalize the currPos stuff, and move it to some other, higher level module. > Does that mean not having a simple amgetbatch() callback, but some finer > grained interface? Or maybe one callback that returns the next "AM page" > (essentially the currPos), and then another callback to release it? > > (This is what I mean by "two-callback API" later.) I'm not sure. Why does the index AM need to care about the batch size at all? It merely needs to read the next leaf page. The high level understanding of batches and the leaf pages that constitute batches lives elsewhere. The nbtree code will know about buffer pins held, in the sense that it'll be the one setting the Buffer variables in the new scan descriptor thing. But it's not going to remember to drop those buffer pins on its own. It'll need to be told. So it's not ever really in control. > > The index AM itself should no longer know about the current next tuple > > to return, nor about mark/restore. It is no longer directly in control > > of the scan's progress. It loses all context that survives across API > > calls. > > > > I'm lost. How could the index AM not know about mark/restore? Restoring a mark already works by restoring an earlier so->currPos batch. Actually, more often it works by storing an offset into the current so->currPos, without actually copying anything into so->markPos, and without restoring so->markPos into so->currPos. In short, there is virtually nothing about how mark/restore works that really needs to live inside nbtree. It's all just restoring an earlier batch and/or offset into a batch. The only minor caveat is the stuff about array keys that I went into already -- that isn't quite a piece of state that lives in so->currPos, but it's a little bit like that. You can probably poke one or two more minor holes in some of this -- it's not 100% trivial. But it's doable. > I don't think the batching in various AMs is particularly unique, that's > true. But my goal was to wrap that in a single amgetbatch callback, > because that seemed natural, and that moves some of the responsibilities > to the AM. Why is it natural? I mean all of the index AMs that support amgettuple copied everything from ntree already. Including all of the kill_prior_tuple stuff. It's already quite generic. > I still don't quite understand what API you imagine, but if > we want to make more of this the responsibility of indexam.c, I guess it > will require multiple smaller callbacks (I'm not opposed to that, but I > also don't know if that's what you imagine). I think that you understood me correctly here. > > Most individual calls to btgettuple just return the next batched-up > > so->currPos tuple/TID via another call to _bt_next. Things like the > > _bt_first-new-primitive-scan case don't really add any complexity -- > > the core concept of processing a page at a time still applies. It > > really is just a simple batching scheme, with a couple of extra fiddly > > details attached to it -- but nothing too hairy. > > > > True, although the details (how the batches are represented etc.) are > often quite different, so did you imagine some shared structure to > represent that, or wrapping that in a new callback? In what sense are they sometimes different? In general batches will consist of one or more groups of tuples, each of which is associated with a particular leaf page (if the scan returns no tuples for a given scanned leaf page then it won't form a part of the final batch). You can do amgettuple style scrolling back and forth with this structure, across page boundaries. Seems pretty general to me. > Yes, I understand that. Getting the overall design right is my main > concern, even if some of the advanced stuff is not implemented until > later. But with the wrong design, that may turn out to be difficult. > > That's the feedback I was hoping for when I kept bugging you, and this > discussion was already very useful in this regard. Thank you for that. I don't want to insist on doing all this. But it just seems really weird to have this shadow batching system for the so->currPos batches. > > The index AM API says that they need to hold on to a buffer pin to > > avoid confusing scans due to concurrent TID recycling by VACUUM. The > > index AM API fails to adequately describe what is expected here. And > > it provides no useful context for larger batching of index pages. > > nbtree already does its own thing by dropping leaf page pins > > selectively. > > > > Not sure I understand. I imagined the index AM would just read a > sequence of leaf pages, keeping all the same pins etc. just like it does > for the one leaf it reads right now (pins, etc.). Right. But it wouldn't necessarily drop the leaf pages right away. It might try to coalesce together multiple heap page accesses, for index tuples that happen to span page boundaries (but are part of the same higher level batch). > I'm probably too dumb for that, but I still don't quite understand how > that's different from just reading and processing that sequence of leaf > pages by amgettuple without batching. It's not so much different, as just more flexible. It's possible that v1 would effectively do exactly the same thing in practice. It'd only be able to do fancier things with holding onto leaf pages in a debug build, that validated the general approach. > True, although the necessary context could be passed to the index AM in > some way. That's what happens in the current patch, where indexam.c > could size the batch just right for a LIMIT clause, before asking the > index AM to fill it with items. What difference does it make where it happens? It might make some difference, but as I keep saying, the important point is that *somebody* has to know all of these things at the same time. > I need to think about this a bit, but I agree passing this information > to an index AM through the kill_prior_tuple seems weird. Right. Because it's a tuple-at-a-time interface, which isn't suitable for the direction you want to take things in. > Hmmm, OK. If the idea is to just return a leaf page as an array of items > (in some fancy way) to indexam.c, then it'd be indexam.c responsible for > tracking what the current position (or multiple positions are), I guess. Right. It would have to have some basic idea of the laws-of-physics underlying the index scan. It would have to sensibly limit the number of index page buffer pins held at any given time. > > Why does the index AM need to know anything about the fact that the > > next tuple has been requested? Why can't it just be 100% ignorant of > > all that? (Perhaps barring a few special cases, such as KNN-GiST > > scans, which continue to use the legacy amgettuple interface.) > > > > Well, I was thinking about how it works now, for the "current" position. > And I was thinking about how would it need to change to handle the > prefetch position too, in the same way ... > > But if you're suggesting to move this logic and context to the upper > layer indexam.c, that changes things ofc. Yes, I am suggesting that. > Well, if the AM works with "batches of tuples for a leaf page" (through > the two callbacks to read / release a page), then positions to exact > items are no longer needed. It just needs to know which pages are still > needed, etc. Correct? Right, correct. > > I don't think that it'll ultimately be all that hard to schedule when > > and how index pages are read from outside of the index AM in question. > > In general all relevant index AMs already work in much the same way > > here. Maybe we can ultimately invent a way for the index AM to > > influence that scheduling, but that might never be required. > > > > I haven't thought about scheduling at all. Maybe there's something we > could improve in the future, but I don't see what would it look like, > and it seems unrelated to this patch. It's only related to this patch in the sense that we have to imagine that it'll be worth having in some form in the future. It might also be a good exercise architecturally. We don't need to do the same thing in several slightly different ways in each index AM. -- Peter Geoghegan
On 11/8/24 02:35, Peter Geoghegan wrote: > On Thu, Nov 7, 2024 at 4:34 PM Tomas Vondra <tomas@vondra.me> wrote: >> Not sure I understand, but I think I'm somewhat confused by "index AM" >> vs. indexam. Are you suggesting the individual index AMs should know as >> little about the batching as possible, and instead it should be up to >> indexam.c to orchestrate most of the stuff? > > Yes, that's what I'm saying. Knowing "as little as possible" turns out > to be pretty close to knowing nothing at all. > > There might be some minor exceptions, such as the way that nbtree > needs to remember the scan's array keys. But that already works in a > way that's very insensitive to the exact position in the scan. For > example, right now if you restore a mark that doesn't just come from > the existing so->currPos batch then we cheat and reset the array keys. > >> If yes, then I agree in principle, and I think indexam.c is the right >> place to do that (or at least I can't think of a better one). > > Good. > >> That's what the current patch aimed to do, more or less. I'm not saying >> it got it perfectly right, and I'm sure there is stuff that can be >> improved (like reworking _steppage to not deal with killed tuples). But >> surely the index AMs need to have some knowledge about batching, because >> how else would it know which leaf pages to still keep, etc? > > I think that your new thing can directly track which leaf pages have > pins. As well as tracking the order that it has to return tuples from > among those leaf page batch subsets. > > Your new thing can think about this in very general terms, that really > aren't tied to any index AM specifics. It'll have some general notion > of an ordered sequence of pages (in scan/key space order), each of > which contains one or more tuples to return. It needs to track which > pages have tuples that we've already done all the required visibility > checks for, in order to be able to instruct the index AM to drop the > pin. > Is it a good idea to make this part (in indexam.c) aware of / responsible for managing stuff like pins? Perhaps it'd work fine for index AMs that always return an array of items for a single leaf-page (like btree or hash). But I'm still thinking about cases like gist with ORDER BY clauses, or maybe something even weirder in custom AMs. It seems to me knowing which pages may be pinned is very AM-specific knowledge, and my intention was to let the AM to manage that. That is, the new indexam code would be responsible for deciding when the "AM batches" are loaded and released, using the two new callbacks. But it'd be the AM responsible for making sure everything is released. > Suppose, for example, that we're doing an SAOP index scan, where the > leaf pages that our multi-page batch consists of aren't direct > siblings. That literally doesn't matter at all. The pages still have > to be in the same familiar key space/scan order, regardless. And that > factor shouldn't really need to influence how many pins we're willing > to hold on to (no more than it would when there are large numbers of > index leaf pages with no interesting tuples to return that we must > still scan over). > I agree that in the simple cases it's not difficult to determine what pins we need for the sequence of tuples/pages. But is it guaranteed to be that easy, and is it easy to communicate this information to the indexam.c layer? I'm not sure about that. In an extreme case it may be that each tuple comes from entirely different leaf page, and stuff like that. And while most out-of-core AMs that I'm aware of are rather close to nbtree/gist/gin, I wonder what weird things can be out there. >>>> 3) It makes it clear when the items are no longer needed, and the AM can >>>> do cleanup. process kill tuples, etc. >>> >>> But it doesn't, really. The index AM is still subject to exactly the >>> same constraints in terms of page-at-a-time processing. These existing >>> constraints always came from the table AM side, so it's not as if your >>> patch can remain totally neutral on these questions. >>> >> >> Not sure I understand. Which part of my sentence you disagree with? Or >> what constraints you mean? > > What I was saying here was something I said more clearly a bit further > down: it's technically possible to do multi-page batches within the > confines of the current index AM API, but that's not true in any > practical sense. And it'll never be true with an API that looks very > much like the current amgettuple API. > OK >> The interface does not require page-at-a-time processing - the index AM >> is perfectly within it's rights to produce a batch spanning 10 leaf >> pages, as long as it keeps track of them, and perhaps keeps some mapping >> of items (returned in the batch) to leaf pages. So that when the next >> batch is requested, it can do the cleanup, and move to the next batch. > > How does an index AM actually do that in a way that's useful? It only > sees a small part of the picture. That's why it's the wrong place for > it. > Sure, maybe it'd need some more information - say, how many items we expect to read, but if indexam knows that bit, surely it can pass it down to the AM. But yeah, I agree doing it in amgettuple() would be inconvenient and maybe even awkward. I can imagine the AM maintaining an array of currPos, but then it'd also need to be made aware of multiple positions, and stuff like that. Which it shouldn't need to know about. >>> Basically, it looks like you've invented a shadow batching interface >>> that is technically not known to the index AM, but nevertheless >>> coordinates with the existing so->currPos batching interface. >>> >> >> Perhaps, but which part of that you consider a problem? Are you saying >> this shouldn't use the currPos stuff at all, and instead do stuff in >> some other way? > > I think that you should generalize the currPos stuff, and move it to > some other, higher level module. > By generalizing you mean defining a common struct serving the same purpose, but for all the index AMs? And the new AM callbacks would produce/consume this new struct, right? >> Does that mean not having a simple amgetbatch() callback, but some finer >> grained interface? Or maybe one callback that returns the next "AM page" >> (essentially the currPos), and then another callback to release it? >> >> (This is what I mean by "two-callback API" later.) > > I'm not sure. Why does the index AM need to care about the batch size > at all? It merely needs to read the next leaf page. The high level > understanding of batches and the leaf pages that constitute batches > lives elsewhere. > I don't think I suggested the index AM would need to know about the batch size. Only indexam.c would be aware of that, and would read enough stuff from the index to satisfy that. > The nbtree code will know about buffer pins held, in the sense that > it'll be the one setting the Buffer variables in the new scan > descriptor thing. But it's not going to remember to drop those buffer > pins on its own. It'll need to be told. So it's not ever really in > control. > Right. So those pins would be released after indexam invokes the second new callback, instructing the index AM to release everything associated with a chunk of items returned sometime earlier. >>> The index AM itself should no longer know about the current next tuple >>> to return, nor about mark/restore. It is no longer directly in control >>> of the scan's progress. It loses all context that survives across API >>> calls. >>> >> >> I'm lost. How could the index AM not know about mark/restore? > > Restoring a mark already works by restoring an earlier so->currPos > batch. Actually, more often it works by storing an offset into the > current so->currPos, without actually copying anything into > so->markPos, and without restoring so->markPos into so->currPos. > > In short, there is virtually nothing about how mark/restore works that > really needs to live inside nbtree. It's all just restoring an earlier > batch and/or offset into a batch. The only minor caveat is the stuff > about array keys that I went into already -- that isn't quite a piece > of state that lives in so->currPos, but it's a little bit like that. > > You can probably poke one or two more minor holes in some of this -- > it's not 100% trivial. But it's doable. > OK. The thing that worries me is whether it's going to be this simple for other AMs. Maybe it is, I don't know. >> I don't think the batching in various AMs is particularly unique, that's >> true. But my goal was to wrap that in a single amgetbatch callback, >> because that seemed natural, and that moves some of the responsibilities >> to the AM. > > Why is it natural? I mean all of the index AMs that support amgettuple > copied everything from ntree already. Including all of the > kill_prior_tuple stuff. It's already quite generic. > I don't recall my reasoning, and I'm not saying it was the right instinct. But if we have one callback to read tuples, it seemed like maybe we should have one callback to read a bunch of tuples in a similar way. >> I still don't quite understand what API you imagine, but if >> we want to make more of this the responsibility of indexam.c, I guess it >> will require multiple smaller callbacks (I'm not opposed to that, but I >> also don't know if that's what you imagine). > > I think that you understood me correctly here. > >>> Most individual calls to btgettuple just return the next batched-up >>> so->currPos tuple/TID via another call to _bt_next. Things like the >>> _bt_first-new-primitive-scan case don't really add any complexity -- >>> the core concept of processing a page at a time still applies. It >>> really is just a simple batching scheme, with a couple of extra fiddly >>> details attached to it -- but nothing too hairy. >>> >> >> True, although the details (how the batches are represented etc.) are >> often quite different, so did you imagine some shared structure to >> represent that, or wrapping that in a new callback? > > In what sense are they sometimes different? > > In general batches will consist of one or more groups of tuples, each > of which is associated with a particular leaf page (if the scan > returns no tuples for a given scanned leaf page then it won't form a > part of the final batch). You can do amgettuple style scrolling back > and forth with this structure, across page boundaries. Seems pretty > general to me. > I meant that each of the AMs uses a separate typedef, with different fields, etc. I'm sure there are similarities (it's always an array of elements, either TIDs, index or heap tuples, or some combination of that). But maybe there is stuff unique to some AMs - chances are that can be either "generalized" or extended using some private member. >> Yes, I understand that. Getting the overall design right is my main >> concern, even if some of the advanced stuff is not implemented until >> later. But with the wrong design, that may turn out to be difficult. >> >> That's the feedback I was hoping for when I kept bugging you, and this >> discussion was already very useful in this regard. Thank you for that. > > I don't want to insist on doing all this. But it just seems really > weird to have this shadow batching system for the so->currPos batches. > >>> The index AM API says that they need to hold on to a buffer pin to >>> avoid confusing scans due to concurrent TID recycling by VACUUM. The >>> index AM API fails to adequately describe what is expected here. And >>> it provides no useful context for larger batching of index pages. >>> nbtree already does its own thing by dropping leaf page pins >>> selectively. >>> >> >> Not sure I understand. I imagined the index AM would just read a >> sequence of leaf pages, keeping all the same pins etc. just like it does >> for the one leaf it reads right now (pins, etc.). > > Right. But it wouldn't necessarily drop the leaf pages right away. It > might try to coalesce together multiple heap page accesses, for index > tuples that happen to span page boundaries (but are part of the same > higher level batch). > No opinion, but it's not clear to me how exactly would this work. I've imagined we'd just acquire (and release) multiple pins as we go. >> I'm probably too dumb for that, but I still don't quite understand how >> that's different from just reading and processing that sequence of leaf >> pages by amgettuple without batching. > > It's not so much different, as just more flexible. It's possible that > v1 would effectively do exactly the same thing in practice. It'd only > be able to do fancier things with holding onto leaf pages in a debug > build, that validated the general approach. > >> True, although the necessary context could be passed to the index AM in >> some way. That's what happens in the current patch, where indexam.c >> could size the batch just right for a LIMIT clause, before asking the >> index AM to fill it with items. > > What difference does it make where it happens? It might make some > difference, but as I keep saying, the important point is that > *somebody* has to know all of these things at the same time. > Agreed. >>> I don't think that it'll ultimately be all that hard to schedule when >>> and how index pages are read from outside of the index AM in question. >>> In general all relevant index AMs already work in much the same way >>> here. Maybe we can ultimately invent a way for the index AM to >>> influence that scheduling, but that might never be required. >>> >> >> I haven't thought about scheduling at all. Maybe there's something we >> could improve in the future, but I don't see what would it look like, >> and it seems unrelated to this patch. > > It's only related to this patch in the sense that we have to imagine > that it'll be worth having in some form in the future. > > It might also be a good exercise architecturally. We don't need to do > the same thing in several slightly different ways in each index AM. > Could you briefly outline how you think this might interact with the scheduling of index page reads? I can imagine telling someone about which future index pages we might need to read (say, the next leaf page), or something like that. But this patch is about prefetching the heap pages it seems like an entirely independent thing. And ISTM there are concurrency challenges with prefetching index pages (at least when leveraging read stream API to do async reads). regards -- Tomas Vondra
On Sun, Nov 10, 2024 at 4:41 PM Tomas Vondra <tomas@vondra.me> wrote: > Is it a good idea to make this part (in indexam.c) aware of / > responsible for managing stuff like pins? My sense is that that's the right long term architectural direction. I can't really prove it. > Perhaps it'd work fine for > index AMs that always return an array of items for a single leaf-page > (like btree or hash). But I'm still thinking about cases like gist with > ORDER BY clauses, or maybe something even weirder in custom AMs. Nothing is perfect. What you really have to worry about not supporting is index AMs that implement amgettuple -- AMs that aren't quite a natural fit for this. At least for in-core index AMs that's really just GiST (iff KNN GiST is in use, which it usually isn't) plus SP-GiST. AFAIK most out-of-core index AMs only support lossy index scans in practice. Just limiting yourself to that makes an awful lot of things easier. For example I think that GIN gets away with a lot by only supporting lossy scans -- there's a comment above ginInsertCleanup() that says "On first glance it looks completely not crash-safe", but stuff like that is automatically okay with lossy scans. So many index AMs automatically don't need to be considered here at all. > It seems to me knowing which pages may be pinned is very AM-specific > knowledge, and my intention was to let the AM to manage that. This is useful information, because it helps me to understand how you're viewing this. I totally disagree with this characterization. This is an important difference in perspective. IMV index AMs hardly care at all about holding onto buffer pins, very much unlike heapam. I think that holding onto pins and whatnot has almost nothing to do with the index AM as such -- it's about protecting against unsafe concurrent TID recycling, which is a table AM/heap issue. You can make a rather weak argument that the index AM needs it for _bt_killitems, but that seems very secondary to me (if you go back long enough there are no _bt_killitems, but the pin thing itself still existed). As I pointed out before, the index AM API docs (at https://www.postgresql.org/docs/devel/index-locking.html) talk about holding onto buffer pins on leaf pages during amgettuple. So the need to mess around with pins just doesn't come from the index AM side, at all. The cleanup lock interlock against TID recycling protects the scan from seeing transient wrong answers -- it doesn't protect the index structure itself. The only thing that's a bit novel about what I'm proposing now is that I'm imagining that it'll be possible to eventually usefully schedule multi-leaf-page batches using code that has no more than a very general notion of how an ordered index scan works. That might turn out to be more complicated than I suppose it will now. If it is then it should still be fixable. > That is, > the new indexam code would be responsible for deciding when the "AM > batches" are loaded and released, using the two new callbacks. But it'd > be the AM responsible for making sure everything is released. What does it really mean for the index AM to be responsible for a thing? I think that the ReleaseBuffer() calls would be happening in index AM code, for sure. But that would probably always be called through your new index scan management code in practice. I don't have any fixed ideas about the resource management aspects of this. That doesn't seem particularly fundamental to the design. > I agree that in the simple cases it's not difficult to determine what > pins we need for the sequence of tuples/pages. But is it guaranteed to > be that easy, and is it easy to communicate this information to the > indexam.c layer? I think that it's fairly generic. The amount of work required to read an index page is (in very round numbers) more or less uniform across index AMs. Maybe you'd need to have some kind of way of measuring how many pages you had to read without returning any tuples, for scheduling purposes -- that cost is a relevant cost, and so would probably have to be tracked. But that still seems fairly general -- any kind of order index scan is liable to sometimes scan multiple pages without having any index tuples to return. > Sure, maybe it'd need some more information - say, how many items we > expect to read, but if indexam knows that bit, surely it can pass it > down to the AM. What are you arguing for here? Practically speaking, I think that the best way to do it is to have one layer that manages all this stuff. It would also be possible to split it up any way you can think of, but why would you want to? I'm not asking you to solve these problems. I'm only suggesting that you move things in a direction that is amenable to adding these things later on. > By generalizing you mean defining a common struct serving the same > purpose, but for all the index AMs? And the new AM callbacks would > produce/consume this new struct, right? Yes. > I don't think I suggested the index AM would need to know about the > batch size. Only indexam.c would be aware of that, and would read enough > stuff from the index to satisfy that. I don't think that you'd ultimately want to make the batch sizes fixed (though they'd probably always consist of tuples taken from 1 or more index pages). Ultimately the size would vary over time, based on competing considerations. > > The nbtree code will know about buffer pins held, in the sense that > > it'll be the one setting the Buffer variables in the new scan > > descriptor thing. But it's not going to remember to drop those buffer > > pins on its own. It'll need to be told. So it's not ever really in > > control. > > > Right. So those pins would be released after indexam invokes the second > new callback, instructing the index AM to release everything associated > with a chunk of items returned sometime earlier. Yes. It might all look very similar to today, at least for your initial commited version. You might also want to combine reading the next page with dropping the pin on the previous page. But also maybe not. > OK. The thing that worries me is whether it's going to be this simple > for other AMs. Maybe it is, I don't know. Really? I mean if we're just talking about the subset of GiST scans that use KNN-GiST as well as SP-GiST scans not using your new facility, that seems quite acceptable to me. > I don't recall my reasoning, and I'm not saying it was the right > instinct. But if we have one callback to read tuples, it seemed like > maybe we should have one callback to read a bunch of tuples in a similar > way. The tuple-level interface will still need to exist, of course. It just won't be directly owned by affected index AMs. > I meant that each of the AMs uses a separate typedef, with different > fields, etc. I'm sure there are similarities (it's always an array of > elements, either TIDs, index or heap tuples, or some combination of > that). But maybe there is stuff unique to some AMs - chances are that > can be either "generalized" or extended using some private member. Right. Maybe it won't even be that hard to do SP-GiST and KNN-GiST index scans with this too. > No opinion, but it's not clear to me how exactly would this work. I've > imagined we'd just acquire (and release) multiple pins as we go. More experimentation is required to get good intuitions about how useful it is to reorder stuff, to make heap prefetching work best. > Could you briefly outline how you think this might interact with the > scheduling of index page reads? I can imagine telling someone about > which future index pages we might need to read (say, the next leaf > page), or something like that. But this patch is about prefetching the > heap pages it seems like an entirely independent thing. I agree that prefetching of index pages themselves would be entirely independent (and probably much less useful). I wasn't talking about that at all, though. I was talking about the potential value in reading multiple leaf pages at a time as an enabler of heap prefetching -- to avoid "pipeline stalls" for heap prefetching, with certain workloads. The simplest example of how these two things (heap prefetching and eager leaf page reading) could be complementary is the idea of coalescing together accesses to the same heap page from TIDs that don't quite appear in order (when read from the index), but are clustered together. Not just clustered together on one leaf page -- clustered together on a few sibling leaf pages. (The exactly degree to which you'd vary how many leaf pages you read at a time might need to be fully dynamic/adaptive.) We've talked about this already. Reading multiple index pages at a time could in general result in pinning/reading the same heap pages far less often. Imagine if our scan will inherently need to read a total of no more than 3 or 4 index leaf pages. Reading all of those leaf pages in one go probably doesn't add any real latency, but literally guarantees that no heap page will need to be accessed twice. So it's almost a hybrid of an index scan and bitmap index scan, offering the best of both worlds. -- Peter Geoghegan
On Sun, Nov 10, 2024 at 5:41 PM Peter Geoghegan <pg@bowt.ie> wrote: > > It seems to me knowing which pages may be pinned is very AM-specific > > knowledge, and my intention was to let the AM to manage that. > > This is useful information, because it helps me to understand how > you're viewing this. > > I totally disagree with this characterization. This is an important > difference in perspective. IMV index AMs hardly care at all about > holding onto buffer pins, very much unlike heapam. > > I think that holding onto pins and whatnot has almost nothing to do > with the index AM as such -- it's about protecting against unsafe > concurrent TID recycling, which is a table AM/heap issue. You can make > a rather weak argument that the index AM needs it for _bt_killitems, > but that seems very secondary to me (if you go back long enough there > are no _bt_killitems, but the pin thing itself still existed). Much of this discussion is going over my head, but I have a comment on this part. I suppose that when any code in the system takes a pin on a buffer page, the initial concern is almost always to keep the page from disappearing out from under it. There might be a few exceptions, but hopefully not many. So I suppose what is happening here is that index AM pins an index page so that it can read that page -- and then it defers releasing the pin because of some interlocking concern. So at any given moment, there's some set of pins (possibly empty) that the index AM is holding for its own purposes, and some other set of pins (also possibly empty) that the index AM no longer requires for its own purposes but which are still required for heap/index interlocking. The second set of pins could possibly be managed in some AM-agnostic way. The AM could communicate that after the heap is done with X set of TIDs, it can unpin Y set of pages. But the first set of pins are of direct and immediate concern to the AM. Or at least, so it seems to me. Am I confused? -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Nov 11, 2024 at 12:23 PM Robert Haas <robertmhaas@gmail.com> wrote: > > I think that holding onto pins and whatnot has almost nothing to do > > with the index AM as such -- it's about protecting against unsafe > > concurrent TID recycling, which is a table AM/heap issue. You can make > > a rather weak argument that the index AM needs it for _bt_killitems, > > but that seems very secondary to me (if you go back long enough there > > are no _bt_killitems, but the pin thing itself still existed). > > Much of this discussion is going over my head, but I have a comment on > this part. I suppose that when any code in the system takes a pin on a > buffer page, the initial concern is almost always to keep the page > from disappearing out from under it. That almost never comes up in index AM code, though -- cases where you simply want to avoid having an index page evicted do exist, but are naturally very rare. I think that nbtree only does this during page deletion by VACUUM, since it works out to be slightly more convenient to hold onto just the pin at one point where we quickly drop and reacquire the lock. Index AMs find very little use for pins that don't naturally coexist with buffer locks. And even the supposed exception that happens for page deletion could easily be replaced by just dropping the pin and the lock (there'd just be no point in it). I almost think of "pin held" and "buffer lock held" as synonymous when working on the nbtree code, even though you have this one obscure page deletion case where that isn't quite true (plus the TID recycle safety business imposed by heapam). As far as protecting the structure of the index itself is concerned, holding on to buffer pins alone does not matter at all. I have a vague recollection of hash doing something novel with cleanup locks, but I also seem to recall that that had problems -- I think that we got rid of it not too long back. In any case my mental model is that cleanup locks are for the benefit of heapam, never for the benefit of index AMs themselves. This is why we require cleanup locks for nbtree VACUUM but not nbtree page deletion, even though both operations perform precisely the same kinds of page-level modifications to the index leaf page. > There might be a few exceptions, > but hopefully not many. So I suppose what is happening here is that > index AM pins an index page so that it can read that page -- and then > it defers releasing the pin because of some interlocking concern. So > at any given moment, there's some set of pins (possibly empty) that > the index AM is holding for its own purposes, and some other set of > pins (also possibly empty) that the index AM no longer requires for > its own purposes but which are still required for heap/index > interlocking. That summary is correct, but FWIW I find the emphasis on index pins slightly odd from an index AM point of view. The nbtree code virtually always calls _bt_getbuf and _bt_relbuf, as opposed to independently acquiring pins and locks -- that's why "lock" and "pin" seem almost synonymous to me in nbtree contexts. Clearly no index AM should hold onto a buffer lock for more than an instant, so my natural instinct is to wonder why you're even talking about buffer pins or buffer locks that the index AM cares about directly. As I said to Tomas, yeah, the index AM kinda sometimes needs to hold onto a leaf page pin to be able to correctly perform _bt_killitems. But this is only because it needs to reason about concurrent TID recycling. So this is also not really any kind of exception. (_bt_killitems is even prepared to reason about cases where no pin was held at all, and has been since commit 2ed5b87f96.) > The second set of pins could possibly be managed in some > AM-agnostic way. The AM could communicate that after the heap is done > with X set of TIDs, it can unpin Y set of pages. But the first set of > pins are of direct and immediate concern to the AM. > > Or at least, so it seems to me. Am I confused? I think that this is exactly what I propose to do, said in a different way. (Again, I wouldn't have expressed it in this way because it seems obvious to me that buffer pins don't have nearly the same significance to an index AM as they do to heapam -- they have no value in protecting the index structure, or helping an index scan to reason about concurrency that isn't due to a heapam issue.) Does that make sense? -- Peter Geoghegan
On Mon, Nov 11, 2024 at 1:03 PM Peter Geoghegan <pg@bowt.ie> wrote: > I almost think of "pin held" and "buffer lock held" as synonymous when > working on the nbtree code, even though you have this one obscure page > deletion case where that isn't quite true (plus the TID recycle safety > business imposed by heapam). As far as protecting the structure of the > index itself is concerned, holding on to buffer pins alone does not > matter at all. That makes sense from the point of view of working with the btree code itself, but from a system-wide perspective, it's weird to pretend like the pins don't exist or don't matter just because a buffer lock is also held. I had actually forgotten that the btree code tends to pin+lock together; now that you mention it, I remember that I knew it at one point, but it fell out of my head a long time ago... > I think that this is exactly what I propose to do, said in a different > way. (Again, I wouldn't have expressed it in this way because it seems > obvious to me that buffer pins don't have nearly the same significance > to an index AM as they do to heapam -- they have no value in > protecting the index structure, or helping an index scan to reason > about concurrency that isn't due to a heapam issue.) > > Does that make sense? Yeah, it just really throws me for a loop that you're using "pin" to mean "pin at a time when we don't also hold a lock." The fundamental purpose of a pin is to prevent a buffer from being evicted while someone is in the middle of looking at it, and nothing that uses buffers can possibly work correctly without that guarantee. Everything you've written in parentheses there is, AFAICT, 100% wrong if you mean "any pin" and 100% correct if you mean "a pin held without a corresponding lock." -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Nov 11, 2024 at 1:33 PM Robert Haas <robertmhaas@gmail.com> wrote: > That makes sense from the point of view of working with the btree code > itself, but from a system-wide perspective, it's weird to pretend like > the pins don't exist or don't matter just because a buffer lock is > also held. I can see how that could cause confusion. If you're working on nbtree all day long, it becomes natural, though. Both points are true, and relevant to the discussion. I prefer to over-communicate when discussing these points -- it's too easy to talk past each other here. I think that the precise reasons why the index AM does things with buffer pins will need to be put on a more rigorous and formalized footing with Tomas' patch. The different requirements/safety considerations will have to be carefully teased apart. > I had actually forgotten that the btree code tends to > pin+lock together; now that you mention it, I remember that I knew it > at one point, but it fell out of my head a long time ago... The same thing appears to mostly be true of hash, which mostly uses _hash_getbuf + _hash_relbuf (hash's idiosyncratic use of cleanup locks notwithstanding). To be fair it does look like GiST's gistdoinsert function holds onto multiple buffer pins at a time, for its own reasons -- index AM reasons. But this looks to be more or less an optimization to deal with navigating the tree with a loose index order, where multiple descents and ascents are absolutely expected. (This makes it a bit like the nbtree "drop lock but not pin" case that I mentioned in my last email.) It's not as if these gistdoinsert buffer pins persist across calls to amgettuple, though, so for the purposes of this discussion about the new batch API to replace amgettuple they are not relevant -- they don't actually undermine my point. (Though to be fair their existence does help to explain why you found my characterization of buffer pins as irrelevant to index AMs confusing.) The real sign that what I said is generally true of index AMs is that you'll see so few calls to LockBufferForCleanup/ConditionalLockBufferForCleanup. Only hash calls ConditionalLockBufferForCleanup at all (which I find a bit weird). Both GiST and SP-GiST call neither functions -- even during VACUUM. So GiST and SP-GiST make clear that index AMs (that support only MVCC snapshot scans) can easily get by without any use of cleanup locks (and with no externally significant use of buffer pins). > > I think that this is exactly what I propose to do, said in a different > > way. (Again, I wouldn't have expressed it in this way because it seems > > obvious to me that buffer pins don't have nearly the same significance > > to an index AM as they do to heapam -- they have no value in > > protecting the index structure, or helping an index scan to reason > > about concurrency that isn't due to a heapam issue.) > > > > Does that make sense? > > Yeah, it just really throws me for a loop that you're using "pin" to > mean "pin at a time when we don't also hold a lock." I'll try to be more careful about that in the future, then. > The fundamental > purpose of a pin is to prevent a buffer from being evicted while > someone is in the middle of looking at it, and nothing that uses > buffers can possibly work correctly without that guarantee. Everything > you've written in parentheses there is, AFAICT, 100% wrong if you mean > "any pin" and 100% correct if you mean "a pin held without a > corresponding lock." I agree. -- Peter Geoghegan
On Mon, Nov 11, 2024 at 2:00 PM Peter Geoghegan <pg@bowt.ie> wrote: > The real sign that what I said is generally true of index AMs is that > you'll see so few calls to > LockBufferForCleanup/ConditionalLockBufferForCleanup. Only hash calls > ConditionalLockBufferForCleanup at all (which I find a bit weird). > Both GiST and SP-GiST call neither functions -- even during VACUUM. So > GiST and SP-GiST make clear that index AMs (that support only MVCC > snapshot scans) can easily get by without any use of cleanup locks > (and with no externally significant use of buffer pins). Actually, I'm pretty sure that it's wrong for GiST VACUUM to not acquire a full cleanup lock (which used to be called a super-exclusive lock in index AM contexts), as I went into some years ago: https://www.postgresql.org/message-id/flat/CAH2-Wz%3DPqOziyRSrnN5jAtfXWXY7-BJcHz9S355LH8Dt%3D5qxWQ%40mail.gmail.com I plan on playing around with injection points soon. I might try my hand at proving that GiST VACUUM needs to do more here to avoid breaking concurrent GiST index-only scans. Issues such as this are why I place so much emphasis on formalizing all the rules around TID recycling and dropping pins with index scans. I think that we're still a bit sloppy about things in this area. -- Peter Geoghegan