Re: index prefetching - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | CAH2-Wz=UL7Zi+a1qtJp8Rp370z4rpOPgvJJfkGSToPuMGpaYFQ@mail.gmail.com Whole thread Raw |
In response to | Re: index prefetching (Tomas Vondra <tomas@vondra.me>) |
List | pgsql-hackers |
On Sun, Jul 13, 2025 at 5:57 PM Tomas Vondra <tomas@vondra.me> wrote: > Thank you! I'll take a look next week, but these numbers suggest you > simplified it a lot.. Right. I'm still not done removing code from nbtree here. I still haven't done things like generalize _bt_killitems across all index AMs. That can largely (though not entirely) work the same way across all index AMs. Including the stuff about checking LSN/not dropping pins to avoid blocking VACUUM. It's already totally index-AM-agnostic, even though the avoid-blocking-vacuum thing happens to be nbtree-only right now. > Another thing is hardware. I've been testing on local NVMe drives, and > those don't seem to need very long queues (it's diminishing returns). > Maybe the results would be different on systems with more I/O latency > (e.g. because the storage is not local). That seems likely. Cloud storage with 1ms latency is going to have very different performance characteristics. The benefit of reading multiple leaf pages will also only be seen with certain workloads. Other thing is that leaf pages are typically much denser and more likely to be cached than heap pages. And, the potential to combine heap I/Os for TIDs that appear on adjacent index leaf pages seems like an interesting avenue. > I don't remember the array key details, I'll need to swap the context > back in. But I think the thing I've been concerned about the most is the > coordination of advancing to the next leaf page vs. the next array key > (and then perhaps having to go back when the scan direction changes). But we don't require anything like that. That's just not how it works. The scan can change direction, and the array keys will automatically be maintained correctly; _bt_advance_array_keys will be called as needed, taking care of everything. This all happens in a way that code in nbtree.c and nbtsearch.c knows nothing about (obviously that means that your patch won't need to, either). We do need to be careful about the scan direction changing when the so->needPrimscan flag is set, but that won't affect your patch/indexam.c, either. It also isn't very complicated; we only have to be sure to *unset* the flag when we detect a *change* in direction at the point where we're stepping off a page/pos. We don't need to modify the array keys themselves at this point -- the next call to _bt_advance_array_keys will just take care of that for us automatically (we lean on _bt_advance_array_keys like this in a number of places). The only thing in my revised version of your "complex" patch set does in indexam.c that is in any way related to nbtree arrays is the call to amrestrpos. But you'd never be able to tell -- since the amrestrpos call is nothing new. It just so happens that the only reason we still need the amrestrpos call/the whole entire concept of amrestrpos (having completely moved mark/restore out of nbtree and into indexam.c) is so that the index AM (nbtree) gets a signal that we (indexam.c) are going to restore *some* mark. Because nbtree *will* need to reset its array keys (if any) at that point. But that's it. We don't need to tell the index AM any specific details about the mark, and indexam.c is blissfully unaware of why it is that an index AM might need this. So it's a total non-issue, from a layering cleanliness point of view. There is no mutable state involved at *any* layer. (FWIW, even when we restore a mark like this, nbtree is still mostly leaning on _bt_advance_array_keys to advance the array keys properly later on. If you're interested in why we need the remaining hard reset of the arrays within amrestrpos/btrestrpos, let me know and I'll explain.) -- Peter Geoghegan
pgsql-hackers by date: