Re: index prefetching - Mailing list pgsql-hackers
| From | Peter Geoghegan |
|---|---|
| Subject | Re: index prefetching |
| Date | |
| Msg-id | CAH2-WzmDWjbVAD26HTQNjyY5Vxj74fAZMcDNNt8KJaycpr7dAA@mail.gmail.com Whole thread Raw |
| In response to | Re: index prefetching (Andres Freund <andres@anarazel.de>) |
| Responses |
Re: index prefetching
|
| List | pgsql-hackers |
On Fri, Nov 21, 2025 at 5:38 PM Andres Freund <andres@anarazel.de> wrote: > Another benfit is that it helps even more when there multiple queries running > concurrently - the high rate of lock/unlock on the buffer rather badly hurts > scalability. I haven't noticed that effect myself. In fact, it seemed to be the other way around; it looked like it helped most with very low client count workloads. It's possible that that had something to do with my hacky approach to validating the general idea of optimizing heapam buffer locking/avoiding repeated locking. This was a very rough prototype. > Besides the locking overhead, it turns out that doing visibility checks > one-by-one is a good bit slower than doing so in batches (or for the whole > page). So that's another perf improvement this would enable. Isn't that just what you automatically get by only locking once per contiguous group of TIDs that all point to the same heap page? Or did you mean that we could do the visibility checks in a separate pass, or something like that? I know that we do something like that when pruning these days, but that seems quite different. > Yes, I think that's clearly required. I think one nice bonus of such a change > is that it'd resolve one of the biggest existing layering violations around > tableam - namely that nodeIndexonlyscan.c does VM_ALL_VISIBLE() calls, which > it really has no business doing. Right. One relevant artefact of that layering violation is the way that it forces index I/O prefetching (as implemented in the current draft patch) to cache visibility lookup info. But with an I/O prefetching design that puts exactly one place (namely the new table AM index scan implementation) in charge of everything, that is no longer necessary. > I wonder if we could actually do part of the redesign in an even more > piecemeal fashion: > > 1) Move the responsibility for getting the next tid from the index into > tableam, but do so by basically using index_getnext_tid(). I would prefer it if the new table AM interface was able to totally replace the existing one, for all types of index scans that currently use amgettuple. Individual table AMs would generally be expected to fully move over to the new interface in one go. That means that we'll need to have index_getnext_tid() support built into the heapam implementation of said new interface anway. We'll need it so that it is compatible with index AMs that still use amgettuple (i.e. that haven't switched over to amgetbatch). Because switching over to the amgetbatch interface isn't going to happen with every index AM in a single release -- that definitely isn't practical. Anyway, I don't see that much point in doing just step 1 in a single release. If we don't use amgetbatch in some fashion, then we risk committing something that solves the wrong problem. > 2) Have the new interface get a single batch of tuples from the index, instead > of doing it on a single tid-by-tid basis. That was already what I had in mind for this new plan/direction. There isn't any point in having more than 1 index-AM-wise batch when we just need index AM batching to implement the heapam buffer locking optimization. (Actually, we'll need up to 2 such batches to handle things like mark + restore, much like the way that nbtree uses a separate CurrPos and markPos today.) > 3) Have heapam not acquire the page lock for each tuple, but do so for all the > tuples on the same page. I think that this is very easy compared to 1 and 2. It doesn't really seem like it makes sense as a separate step/item? Unlike with I/O prefetching, there'll be nothing speculative about the way that the heap buffer lock optimization needs to schedule work done by index AMs. We'll do it only when we can readily see that there's a group of contiguous TIDs to be returned by the scan that all point to the same heap page. There's no eager work done by index AMs compared to today/compared to amgettuple -- we're just testing if the very next TID has a matching block number, and then including it when it does. > 4) Add awareness of multiple batches > > 5) Use read stream I think that it makes sense to do these 2 together. But if we were going to break them up, my guess is that it'd make the most sense to start with the read stream work, and only then add support for reading multiple index-AM-wise batches at a time. I think that it's essential that the design of amgetbatch be able to accomodate reading leaf pages that are ahead of the current leaf page, to maintain heap I/O prefetch distance with certain workloads. But I don't think it has to do it in the first committed version. -- Peter Geoghegan
pgsql-hackers by date: