Re: index prefetching - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | 4f5f16ef-df1e-4e09-9b3f-2e0961ab5117@enterprisedb.com Whole thread Raw |
In response to | Re: index prefetching (Melanie Plageman <melanieplageman@gmail.com>) |
Responses |
Re: index prefetching
Re: index prefetching |
List | pgsql-hackers |
On 2/7/24 22:48, Melanie Plageman wrote: > ... > > Attached is a patch which implements a real queue and fixes some of > the issues with the previous version. It doesn't pass tests yet and > has issues. Some are bugs in my implementation I need to fix. Some are > issues we would need to solve in the streaming read API. Some are > issues with index prefetching generally. > > Note that these two patches have to be applied before 21d9c3ee4e > because Thomas hasn't released a rebased version of the streaming read > API patches yet. > Thanks for working on this, and for investigating the various issues. > Issues > --- > - kill prior tuple > > This optimization doesn't work with index prefetching with the current > design. Kill prior tuple relies on alternating between fetching a > single index tuple and visiting the heap. After visiting the heap we > can potentially kill the immediately preceding index tuple. Once we > fetch multiple index tuples, enqueue their TIDs, and later visit the > heap, the next index page we visit may not contain all of the index > tuples deemed killable by our visit to the heap. > I admit I haven't thought about kill_prior_tuple until you pointed out. Yeah, prefetching separates (de-synchronizes) the two scans (index and heap) in a way that prevents this optimization. Or at least makes it much more complex :-( > In our case, we could try and fix this by prefetching only heap blocks > referred to by index tuples on the same index page. Or we could try > and keep a pool of index pages pinned and go back and kill index > tuples on those pages. > I think restricting the prefetching to a single index page would not be a huge issue performance-wise - that's what the initial patch version (implemented at the index AM level) did, pretty much. The prefetch queue would get drained as we approach the end of the index page, but luckily index pages tend to have a lot of entries. But it'd put an upper bound on the prefetch distance (much lower than the e_i_c maximum 1000, but I'd say common values are 10-100 anyway). But how would we know we're on the same index page? That knowledge is not available outside the index AM - the executor or indexam.c does not know this, right? Presumably we could expose this, somehow, but it seems like a violation of the abstraction ... The same thing affects keeping multiple index pages pinned, for TIDs that are yet to be used by the index scan. We'd need to know when to release a pinned page, once we're done with processing all items. FWIW I haven't tried to implementing any of this, so maybe I'm missing something and it can be made to work in a nice way. > Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps > there is an easier way to fix this, as I don't think the mvcc test > failed on Tomas' version. > I kinda doubt it worked correctly, considering I simply ignored the optimization. It's far more likely it just worked by luck. > - switching scan directions > > If the index scan switches directions on a given invocation of > IndexNext(), heap blocks may have already been prefetched and read for > blocks containing tuples beyond the point at which we want to switch > directions. > > We could fix this by having some kind of streaming read "reset" > callback to drop all of the buffers which have been prefetched which > are now no longer needed. We'd have to go backwards from the last TID > which was yielded to the caller and figure out which buffers in the > pgsr buffer ranges are associated with all of the TIDs which were > prefetched after that TID. The TIDs are in the per_buffer_data > associated with each buffer in pgsr. The issue would be searching > through those efficiently. > Yeah, that's roughly what I envisioned in one of my previous messages about this issue - walking back the TIDs read from the index and added to the prefetch queue. > The other issue is that the streaming read API does not currently > support backwards scans. So, if we switch to a backwards scan from a > forwards scan, we would need to fallback to the non streaming read > method. We could do this by just setting the TID queue size to 1 > (which is what I have currently implemented). Or we could add > backwards scan support to the streaming read API. > What do you mean by "support for backwards scans" in the streaming read API? I imagined it naively as 1) drop all requests in the streaming read API queue 2) walk back all "future" requests in the TID queue 3) start prefetching as if from scratch Maybe there's a way to optimize this and reuse some of the work more efficiently, but my assumption is that the scan direction does not change very often, and that we process many items in between. > - mark and restore > > Similar to the issue with switching the scan direction, mark and > restore requires us to reset the TID queue and streaming read queue. > For now, I've hacked in something to the PlannerInfo and Plan to set > the TID queue size to 1 for plans containing a merge join (yikes). > Haven't thought about this very much, will take a closer look. > - multiple executions > > For reasons I don't entirely understand yet, multiple executions (not > rescan -- see ExecutorRun(...execute_once)) do not work. As in Tomas' > patch, I have disabled prefetching (and made the TID queue size 1) > when execute_once is false. > Don't work in what sense? What is (not) happening? > - Index Only Scans need to return IndexTuples > > Because index only scans return either the IndexTuple pointed to by > IndexScanDesc->xs_itup or the HeapTuple pointed to by > IndexScanDesc->xs_hitup -- both of which are populated by the index > AM, we have to save copies of those IndexTupleData and HeapTupleDatas > for every TID whose block we prefetch. > > This might be okay, but it is a bit sad to have to make copies of those tuples. > > In this patch, I still haven't figured out the memory management part. > I copy over the tuples when enqueuing a TID queue item and then copy > them back again when the streaming read API returns the > per_buffer_data to us. Something is still not quite right here. I > suspect this is part of the reason why some of the other tests are > failing. > It's not clear to me what you need to copy the tuples back - shouldn't it be enough to copy the tuple just once? FWIW if we decide to pin multiple index pages (to make kill_prior_tuple work), that would also mean we don't need to copy any tuples, right? We could point into the buffers for all of them, right? > Other issues/gaps in my implementation: > > Determining where to allocate the memory for the streaming read object > and the TID queue is an outstanding TODO. To implement a fallback > method for cases in which streaming read doesn't work, I set the queue > size to 1. This is obviously not good. > I think IndexFetchTableData seems like a not entirely terrible place for allocating the pgsr, but I wonder what Andres thinks about this. IIRC he advocated for doing the prefetching in executor, and I'm not sure heapam_handled.c + relscan.h is what he imagined ... Also, when you say "obviously not good" - why? Are you concerned about the extra overhead of shuffling stuff between queues, or something else? > Right now, I allocate the TID queue and streaming read objects in > IndexNext() and IndexOnlyNext(). This doesn't seem ideal. Doing it in > index_beginscan() (and index_beginscan_parallel()) is tricky though > because we don't know the scan direction at that point (and the scan > direction can change). There are also callers of index_beginscan() who > do not call Index[Only]Next() (like systable_getnext() which calls > index_getnext_slot() directly). > Yeah, not sure this is the right layering ... the initial patch did everything in individual index AMs, then it moved to indexam.c, then to executor. And this seems to move it to lower layers again ... > Also, my implementation does not yet have the optimization Tomas does > to skip prefetching recently prefetched blocks. As he has said, it > probably makes sense to add something to do this in a lower layer -- > such as in the streaming read API or even in bufmgr.c (maybe in > PrefetchSharedBuffer()). > I agree this should happen in lower layers. I'd probably do this in the streaming read API, because that would define "scope" of the cache (pages prefetched for that read). Doing it in PrefetchSharedBuffer seems like it would do a single cache (for that particular backend). But that's just an initial thought ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: