Re: index-only scans - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: index-only scans |
Date | |
Msg-id | 10870.1318089128@sss.pgh.pa.us Whole thread Raw |
In response to | Re: index-only scans (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: index-only scans
Re: index-only scans Re: index-only scans |
List | pgsql-hackers |
I wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Not really. We have detected a small performance regression when both >> heap and index fit in shared_buffers and an index-only scan is used. >> I have a couple of ideas for improving this. One is to store a >> virtual tuple into the slot instead of building a regular tuple, but >> what do we do about tuples with OIDs? [ that's done ] > I was also toying with the notion of pushing the slot fill-in into the > AM, so that the AM API is to return a filled TupleSlot not an > IndexTuple. This would not save any cycles AFAICT --- at least in > btree, we still have to make a palloc'd copy of the IndexTuple so that > we can release lock on the index page. The point of it would be to > avoid the assumption that the index's internal storage has exactly the > format of IndexTuple. Don't know whether we'd ever have any actual use > for that flexibility, but it seems like it wouldn't cost much to > preserve the option. BTW, I concluded that that would be a bad idea, because it would involve the index AM in some choices that we're likely to want to change. In particular it would foreclose ever doing anything with expression indexes, without an AM API change. Better to just define the AM's responsibility as to hand back a tuple defined according to the index's columns. >> Another is to avoid locking the >> index buffer multiple times - right now it locks the index buffer to >> get the TID, and then relocks it to extract the index tuple (after >> checking that nothing disturbing has happened meanwhile). It seems >> likely that with some refactoring we could get this down to a single >> lock/unlock cycle, but I haven't figured out exactly where the TID >> gets copied out. > Yeah, maybe, but let's get the patch committed before we start looking > for second-order optimizations. On reflection I'm starting to think that the above would be a good idea because there are a couple of bogosities in the basic choices this patch made. In particular, I'm thinking about how we could use an index on f(x) to avoid recalculating f() in something like select f(x) from tab where f(x) < 42; assuming that f() is expensive but immutable. The planner side of this is already a bit daunting, because it's not clear how to recognize that an index on f(x) is a covering index (the existing code is going to think that x itself needs to be available). But the executor side is a real problem, because it will fail to make use of the f() value fetched from the index anytime the heap visibility test fails. I believe that we should rejigger things so that when an index-only scan is selected, the executor *always* works from the data supplied by the index. Even if it has to visit the heap --- it will do that but just to consult the tuple's visibility data, and then use what it got from the index anyway. This means we'd build the plan node's filter quals and targetlist to reference the index tuple columns not the underlying table's. (Which in passing gets rid of the behavior you were complaining about that EXPLAIN VERBOSE shows a lot of columns that aren't actually being computed.) In order to make this work, we have to remove the API wart that says the index AM is allowed to choose to not return the index tuple. And that ties into what you were saying above. What we ought to do is have _bt_readpage() save off copies of the whole tuples not only the TIDs when it is extracting data from the page. This is no more net copying work than what happens now (assuming that all the tuples get fetched) since we won't need the per-tuple memcpy that occurs now in bt_getindextuple. The tuples can go into a page-sized workspace buffer associated with the BTScanPosData structure, and then just be referenced in-place in that workspace with no second copy step. I'm inclined to do the last part immediately, since there's a performance argument for it, and then work on revising the executor implementation. regards, tom lane
pgsql-hackers by date: