Re: index prefetching - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | 8ed1d326-5c6e-476e-b3fd-30d3da210546@vondra.me Whole thread Raw |
In response to | Re: index prefetching (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: index prefetching
Re: index prefetching |
List | pgsql-hackers |
On 7/22/25 23:35, Peter Geoghegan wrote: > On Tue, Jul 22, 2025 at 4:50 PM Tomas Vondra <tomas@vondra.me> wrote: >>> Obviously, whatever advantage that the "complex" patch has is bound to >>> be limited to cases where index characteristics are naturally the >>> limiting factor. For example, with the pgbench_accounts_pkey table >>> there are only ever 6 distinct heap blocks on each leaf page. I bet >>> that your "linear" test more or less looks like that, too. >>> >> >> Yes. It's definitely true we could construct examples where the complex >> patch beats the simple one for this reason. > > It's literally the only possible valid reason why the complex patch could win! > > The sole performance justification for the complex patch is that it > can prevent the heap prefetching from getting bottlenecked on factors > tied to physical index characteristics (when it's possible in > principle to avoid getting bottlenecked in that way). Unsurprisingly, > if you assume that that'll never happen, then yeah, the complex patch > has no performance advantage over the simple one. > > I happen to think that that's a very unrealistic assumption. Most > standard benchmarks have indexes that almost all look fairly similar > to pgbench_accounts_pkey, from the point of view of "heap page blocks > per leaf page". There are exceptions, of course (e.g., the TPC-C order > table's primary key suffers from fragmentation). > I agree with all of this. >> And I believe some of those >> examples could be quite realistic, even if not very common (like when >> very few index tuples fit on a leaf page). > > I don't think cases like that matter very much at all. The only thing > that *really* matters on the index AM side is the logical/physical > correlation. Which your testing seems largely unconcerned with. > >> However, I'm not sure the pgbench example with only 6 heap blocks per >> leaf is very significant. Sure, the simple patch can't prefetch TIDs >> from the following leaf, but AFAICS the complex patch won't do that >> either. > > Why not? > >> Not because it couldn't, but because with that many hits the >> distance will drop to ~1 (or close to it). (It'll probably prefetch a >> couple TIDs from the next leaf at the very end of the page, but I don't >> think that matters overall.) > > Then why do your own test results continue to show such a big > advantage for the complex patch, over the simple patch? > I assume you mean results for the "linear" data set, because for every other data set the patches perform almost exactly the same (when restoring the distance after stream reset): https://github.com/tvondra/indexscan-prefetch-tests/blob/with-distance-restore-after-reset/d16-rows-cold-32GB-16-unscaled.pdf And it's a very good point. I was puzzled by this too for a while, and it took me a while to understand how/why this happens. It pretty much boils down to the "duplicate block" detection and how it interacts with the stream resets (again!). Both patches detect duplicate blocks the same way - using a lastBlock field, checked in the next_block callback, and skip reading the same block multiple times. Which for the "linear" data set happens a lot, because the index is correlated and so every block repeats ~20x. This seems to trigger entirely different behaviors in the two patches. For the complex patch, this results in very high prefetch distance, about ~270. Which seems like less than one leaf page (which has ~360 items). But if I log the read/stream positions seen in index_batch_getnext_tid, I often see this: LOG: index_batch_getnext_tid match 0 read (9,271) stream (22,264) That is, the stream ~13 batches ahead. AFAICS this happens because the read_next callback (which "produces" block numbers to the stream), skips the duplicate blocks, so that the stream never even knows about them. So the stream thinks the distance is 270, but it's really 20x that (when measured in index items). I realize this is another way to trigger the stream resets with the complex patch, even though that didn't happen here (the limit is 64 leafs, we used 13). So you're right the complex patch prefetches far ahead. I thought the distance will quickly decrease because of the duplicate blocks, but I missed the fact the read stream will not seem them at all. I'm not sure it's desirable to "hide" blocks from the read stream like this - it'll never see the misses. How could it make good decisions, when we skew the data used by the heuristics like this? For the simple patch, the effect seems exactly the opposite. It detects duplicate blocks the same way, but there's a caveat - resetting the stream invalidates the lastBlock field, so it can't detect duplicate blocks from the previous leaf. And so the distance drops. But this should not matter I think (it's just a single miss for the first item), so the rest really has to be about the single-leaf limit. (This is my working theory, I still need to investigate it a bit more.) >> I'm not sure what prefetch distances will be sensible in queries that do >> other stuff. The queries in the benchmark do just the index scan, but if >> the query does something with the tuple (in the nodes on top), that >> shortens the required prefetch distance. Of course, simple queries will >> benefit from prefetching far ahead. > > Doing *no* prefetching will usually be the right thing to do. Does > that make index prefetching pointless in general? > I don't think so. Why would it? There's plenty of queries that can benefit from it a lot, and as long as it doesn't cause harm to other queries it's a win. >> Thanks. I wonder how difficult would it be to add something like this to >> pgstattuple. I mean, it shouldn't be difficult to look at leaf pages and >> count distinct blocks, right? Seems quite useful. > > I agree that that would be quite useful. > Good first patch for someone ;-) >>> Did you restore the distance for the "complex" patch, too? I think >>> that it might well matter there too. >>> >> >> No, I did not. I did consider it, but it seemed to me it can't really >> make a difference (for these data sets), because each leaf has ~300 >> items, and the patch limits the prefetch to 64 leafs. That means it can >> prefetch ~20k TIDs ahead, and each heap page has ~20 items. So this >> should be good enough for eic=1000. It should never hit stream reset. > > It looks like the complex patch can reset the read stream for a couple > of reasons, which I don't fully understand right now. > > I'm mostly thinking of this stuff: > > /* > * If we advanced to the next batch, release the batch we no > * longer need. The positions is the "read" position, and we can > * compare it to firstBatch. > */ > if (pos->batch != scan->batchState->firstBatch) > { > batch = INDEX_SCAN_BATCH(scan, scan->batchState->firstBatch); > Assert(batch != NULL); > > /* > * XXX When advancing readPos, the streamPos may get behind as > * we're only advancing it when actually requesting heap > * blocks. But we may not do that often enough - e.g. IOS may > * not need to access all-visible heap blocks, so the > * read_next callback does not get invoked for a long time. > * It's possible the stream gets so mucu behind the position > * gets invalid, as we already removed the batch. But that > * means we don't need any heap blocks until the current read > * position - if we did, we would not be in this situation (or > * it's a sign of a bug, as those two places are expected to > * be in sync). So if the streamPos still points at the batch > * we're about to free, just reset the position - we'll set it > * to readPos in the read_next callback later. > * > * XXX This can happen after the queue gets full, we "pause" > * the stream, and then reset it to continue. But I think that > * just increases the probability of hitting the issue, it's > * just more chance to to not advance the streamPos, which > * depends on when we try to fetch the first heap block after > * calling read_stream_reset(). > */ > if (scan->batchState->streamPos.batch == > scan->batchState->firstBatch) > index_batch_pos_reset(scan, &scan->batchState->streamPos); > This is not resetting the stream, though. This is resetting the position tracking how far the stream got. This happens because the stream moves forward only in response to reading buffers from it. So without calling read_stream_next_buffer() it won't call the read_next callback generating the blocks. And it's the callback that advances the streamPos field, so it may get stale. This happens e.g. for index only scans, when we read a couple blocks that are not all-visible (so that goes through the stream). And then we get a bunch of all-visible blocks, so we only return the TIDs and index tuples. The stream gets "behind" the readPos, and may even point at a batch that was already freed. >>> Isn't the obvious explanation that the complex patch benefits from >>> being able to prefetch without being limited by index >>> characteristics/leaf page boundaries, while the simple patch doesn't? >>> >> >> That's a valid interpretation, yes. Although the benefit comes mostly > > The benefit comes mostly from....? > Sorry, got distracted and forgot to complete the sentence. I think I wanted to write "mostly from not resetting the distance to 1". Which is true, but the earlier "linear" example also shows there are cases where the page boundaries are significant. >> Yes, there's some similarity. Attached is the script I use to create the >> tables and load the data. > > Another issue with the testing that biases it against the complex > patch: heap fill factor is set to only 25 (but you use the default > index fill-factor). > That's actually intentional. I wanted to model tables with wider tuples, without having to generate all the data etc. Maybe 25% is too much, and real table have more than 20 tuples. It's true 400B is fairly large. I'm not against testing with other parameters, of course. The test was not originally written for comparing different prefetching patches, so it may not be quite fair (and I'm not sure how to define "fair"). >> The "linear" is a table with a simple sequence of values (0 to 100k). >> More or less - the value is a floating point, and there are 10M rows. >> But you get the idea. >> >> The "linear_X" variants mean the value has a noise of X% of the range. >> So with "linear_1" you get the "linear" value, and then random(0,1000), >> with normal distribution. > > I don't get why this is helpful to test, except perhaps as a general smoke test. > > If I zoom into any given "linear_1" leaf page, I see TIDs that appear > in an order that isn't technically uniformly random order, but is > fairly close to it. At least in a practical sense. At least for the > purposes of prefetching. > It's not uniformly random, I wrote it uses normal distribution. The query in the SQL script does this: select x + random_normal(0, 1000) from ... It is a synthetic test data set, of course. It's meant to be simple to generate, reason about, and somewhere in between the "linear" and "uniform" data sets. But it also has realistic motivation - real tables are usually not as clean as "linear", nor as random as the "uniform" data sets (not for all columns, at least). If you're looking at data sets like "orders" or whatever, there's usually a bit of noise even for columns like "date" etc. People modify the orders, or fill-in data from a couple days ago, etc. Perfect correlation for one column implies slightly worse correlation for another column (order date vs. delivery date). > For example: > > pg@regression:5432 [104789]=# select > itemoffset, > htid > from > bt_page_items('linear_1_a_idx', 4); > ┌────────────┬───────────┐ > │ itemoffset │ htid │ > ├────────────┼───────────┤ > │ 1 │ ∅ │ > │ 2 │ (10,18) │ > │ 3 │ (463,9) │ > │ 4 │ (66,8) │ > │ 5 │ (79,9) │ > │ 6 │ (594,7) │ > │ 7 │ (289,13) │ > │ 8 │ (568,2) │ > │ 9 │ (237,2) │ > │ 10 │ (156,10) │ > │ 11 │ (432,9) │ > │ 12 │ (372,17) │ > │ 13 │ (554,6) │ > │ 14 │ (1698,11) │ > │ 15 │ (389,6) │ > *** SNIP *** > │ 288 │ (1264,5) │ > │ 289 │ (738,16) │ > │ 290 │ (1143,3) │ > │ 291 │ (400,1) │ > │ 292 │ (1157,10) │ > │ 293 │ (266,2) │ > │ 294 │ (502,9) │ > │ 295 │ (85,15) │ > │ 296 │ (282,2) │ > │ 297 │ (453,5) │ > │ 298 │ (396,6) │ > │ 299 │ (267,18) │ > │ 300 │ (733,15) │ > │ 301 │ (108,8) │ > │ 302 │ (356,16) │ > │ 303 │ (235,10) │ > │ 304 │ (812,18) │ > │ 305 │ (675,1) │ > │ 306 │ (258,13) │ > │ 307 │ (1187,9) │ > │ 308 │ (185,2) │ > │ 309 │ (179,2) │ > │ 310 │ (951,2) │ > └────────────┴───────────┘ > (310 rows) > > There's actually 55,556 heap blocks in total in the underlying table. > So clearly there is some correlation here. Just not enough to ever > matter very much to prefetching. Again, the sole test case that has > that quality to it is the "linear" test case. > Right. I don't see a problem with this. I'm not saying parameters for this particular data set are "perfect", but the intent is to have a range of data sets from "perfectly clean" to "random" and see how the patch(es) behave on all of them. If you have a suggestion for different data sets, or how to tweak the parameters to make it more realistic, I'm happy to try those. regards -- Tomas Vondra
pgsql-hackers by date: