Re: index prefetching - Mailing list pgsql-hackers
From | Melanie Plageman |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | CAAKRu_ZPuaO5XwXfM7wowf4sSmPSJ2LT9+zfmD5=LQ=WhV_j=Q@mail.gmail.com Whole thread Raw |
In response to | Re: index prefetching (Melanie Plageman <melanieplageman@gmail.com>) |
Responses |
Re: index prefetching
Re: index prefetching |
List | pgsql-hackers |
On Wed, Jan 24, 2024 at 3:20 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Wed, Jan 24, 2024 at 4:19 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > On 1/24/24 01:51, Melanie Plageman wrote: > > >> But I'm not sure what to do about optimizations that are more specific > > >> to the access path. Consider for example the index-only scans. We don't > > >> want to prefetch all the pages, we need to inspect the VM and prefetch > > >> just the not-all-visible ones. And then pass the info to the index scan, > > >> so that it does not need to check the VM again. It's not clear to me how > > >> to do this with this approach. > > > > > > Yea, this is an issue I'll need to think about. To really spell out > > > the problem: the callback dequeues a TID from the tid_queue and looks > > > up its block in the VM. It's all visible. So, it shouldn't return that > > > block to the streaming read API to fetch from the heap because it > > > doesn't need to be read. But, where does the callback put the TID so > > > that the caller can get it? I'm going to think more about this. > > > > > > > Yes, that's the problem for index-only scans. I'd generalize it so that > > it's about the callback being able to (a) decide if it needs to read the > > heap page, and (b) store some custom info for the TID. > > Actually, I think this is no big deal. See attached. I just don't > enqueue tids whose blocks are all visible. I had to switch the order > from fetch heap then fill queue to fill queue then fetch heap. > > While doing this I noticed some wrong results in the regression tests > (like in the alter table test), so I suspect I have some kind of > control flow issue. Perhaps I should fix the resource leak so I can > actually see the failing tests :) Attached is a patch which implements a real queue and fixes some of the issues with the previous version. It doesn't pass tests yet and has issues. Some are bugs in my implementation I need to fix. Some are issues we would need to solve in the streaming read API. Some are issues with index prefetching generally. Note that these two patches have to be applied before 21d9c3ee4e because Thomas hasn't released a rebased version of the streaming read API patches yet. Issues --- - kill prior tuple This optimization doesn't work with index prefetching with the current design. Kill prior tuple relies on alternating between fetching a single index tuple and visiting the heap. After visiting the heap we can potentially kill the immediately preceding index tuple. Once we fetch multiple index tuples, enqueue their TIDs, and later visit the heap, the next index page we visit may not contain all of the index tuples deemed killable by our visit to the heap. In our case, we could try and fix this by prefetching only heap blocks referred to by index tuples on the same index page. Or we could try and keep a pool of index pages pinned and go back and kill index tuples on those pages. Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps there is an easier way to fix this, as I don't think the mvcc test failed on Tomas' version. - switching scan directions If the index scan switches directions on a given invocation of IndexNext(), heap blocks may have already been prefetched and read for blocks containing tuples beyond the point at which we want to switch directions. We could fix this by having some kind of streaming read "reset" callback to drop all of the buffers which have been prefetched which are now no longer needed. We'd have to go backwards from the last TID which was yielded to the caller and figure out which buffers in the pgsr buffer ranges are associated with all of the TIDs which were prefetched after that TID. The TIDs are in the per_buffer_data associated with each buffer in pgsr. The issue would be searching through those efficiently. The other issue is that the streaming read API does not currently support backwards scans. So, if we switch to a backwards scan from a forwards scan, we would need to fallback to the non streaming read method. We could do this by just setting the TID queue size to 1 (which is what I have currently implemented). Or we could add backwards scan support to the streaming read API. - mark and restore Similar to the issue with switching the scan direction, mark and restore requires us to reset the TID queue and streaming read queue. For now, I've hacked in something to the PlannerInfo and Plan to set the TID queue size to 1 for plans containing a merge join (yikes). - multiple executions For reasons I don't entirely understand yet, multiple executions (not rescan -- see ExecutorRun(...execute_once)) do not work. As in Tomas' patch, I have disabled prefetching (and made the TID queue size 1) when execute_once is false. - Index Only Scans need to return IndexTuples Because index only scans return either the IndexTuple pointed to by IndexScanDesc->xs_itup or the HeapTuple pointed to by IndexScanDesc->xs_hitup -- both of which are populated by the index AM, we have to save copies of those IndexTupleData and HeapTupleDatas for every TID whose block we prefetch. This might be okay, but it is a bit sad to have to make copies of those tuples. In this patch, I still haven't figured out the memory management part. I copy over the tuples when enqueuing a TID queue item and then copy them back again when the streaming read API returns the per_buffer_data to us. Something is still not quite right here. I suspect this is part of the reason why some of the other tests are failing. Other issues/gaps in my implementation: Determining where to allocate the memory for the streaming read object and the TID queue is an outstanding TODO. To implement a fallback method for cases in which streaming read doesn't work, I set the queue size to 1. This is obviously not good. Right now, I allocate the TID queue and streaming read objects in IndexNext() and IndexOnlyNext(). This doesn't seem ideal. Doing it in index_beginscan() (and index_beginscan_parallel()) is tricky though because we don't know the scan direction at that point (and the scan direction can change). There are also callers of index_beginscan() who do not call Index[Only]Next() (like systable_getnext() which calls index_getnext_slot() directly). Also, my implementation does not yet have the optimization Tomas does to skip prefetching recently prefetched blocks. As he has said, it probably makes sense to add something to do this in a lower layer -- such as in the streaming read API or even in bufmgr.c (maybe in PrefetchSharedBuffer()). - Melanie
Attachment
pgsql-hackers by date: