Re: index-only scans - Mailing list pgsql-hackers

From Tom Lane
Subject Re: index-only scans
Date
Msg-id 10870.1318089128@sss.pgh.pa.us
Whole thread Raw
In response to Re: index-only scans  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: index-only scans
Re: index-only scans
Re: index-only scans
List pgsql-hackers
I wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Not really.  We have detected a small performance regression when both
>> heap and index fit in shared_buffers and an index-only scan is used.
>> I have a couple of ideas for improving this.  One is to store a
>> virtual tuple into the slot instead of building a regular tuple, but
>> what do we do about tuples with OIDs?

[ that's done ]

> I was also toying with the notion of pushing the slot fill-in into the
> AM, so that the AM API is to return a filled TupleSlot not an
> IndexTuple.  This would not save any cycles AFAICT --- at least in
> btree, we still have to make a palloc'd copy of the IndexTuple so that
> we can release lock on the index page.  The point of it would be to
> avoid the assumption that the index's internal storage has exactly the
> format of IndexTuple.  Don't know whether we'd ever have any actual use
> for that flexibility, but it seems like it wouldn't cost much to
> preserve the option.

BTW, I concluded that that would be a bad idea, because it would involve
the index AM in some choices that we're likely to want to change.  In
particular it would foreclose ever doing anything with expression
indexes, without an AM API change.  Better to just define the AM's
responsibility as to hand back a tuple defined according to the index's
columns.

>> Another is to avoid locking the
>> index buffer multiple times - right now it locks the index buffer to
>> get the TID, and then relocks it to extract the index tuple (after
>> checking that nothing disturbing has happened meanwhile).  It seems
>> likely that with some refactoring we could get this down to a single
>> lock/unlock cycle, but I haven't figured out exactly where the TID
>> gets copied out.

> Yeah, maybe, but let's get the patch committed before we start looking
> for second-order optimizations.

On reflection I'm starting to think that the above would be a good idea
because there are a couple of bogosities in the basic choices this patch
made.  In particular, I'm thinking about how we could use an index on
f(x) to avoid recalculating f() in something like
select f(x) from tab where f(x) < 42;

assuming that f() is expensive but immutable.  The planner side of this
is already a bit daunting, because it's not clear how to recognize that
an index on f(x) is a covering index (the existing code is going to
think that x itself needs to be available).  But the executor side is a
real problem, because it will fail to make use of the f() value fetched
from the index anytime the heap visibility test fails.

I believe that we should rejigger things so that when an index-only scan
is selected, the executor *always* works from the data supplied by the
index.  Even if it has to visit the heap --- it will do that but just to
consult the tuple's visibility data, and then use what it got from the
index anyway.  This means we'd build the plan node's filter quals and
targetlist to reference the index tuple columns not the underlying
table's.  (Which in passing gets rid of the behavior you were
complaining about that EXPLAIN VERBOSE shows a lot of columns that
aren't actually being computed.)

In order to make this work, we have to remove the API wart that says the
index AM is allowed to choose to not return the index tuple.  And that
ties into what you were saying above.  What we ought to do is have
_bt_readpage() save off copies of the whole tuples not only the TIDs
when it is extracting data from the page.  This is no more net copying
work than what happens now (assuming that all the tuples get fetched)
since we won't need the per-tuple memcpy that occurs now in
bt_getindextuple.  The tuples can go into a page-sized workspace buffer
associated with the BTScanPosData structure, and then just be referenced
in-place in that workspace with no second copy step.

I'm inclined to do the last part immediately, since there's a
performance argument for it, and then work on revising the executor
implementation.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Cédric Villemain
Date:
Subject: Re: Intermittent regression test failure from index-only scans patch
Next
From: Tom Lane
Date:
Subject: Re: Intermittent regression test failure from index-only scans patch