Thread: SP-GiST versus index-only scans

SP-GiST versus index-only scans

From
Tom Lane
Date:
It would not take very much at all to make the SP-GiST stuff support
index-only scans, except for one thing: it's conceivable that an opclass
might choose to store only a lossy version of the original indexed
datum, in which case it'd not be able to reconstruct the datum on
demand.

I'm not sure how likely this is, because the design of SP-GiST forces
root-level leaf tuples to be stored with the original datum, but in
principle leaf tuples at lower levels might have lossy representations.
None of the current opclasses do that, but Oleg and Teodor evidently
foresee such cases, else they'd not have bothered with the recheck
support that's in there.  (If the original datum is reconstructable then
the opclass can surely deliver a correct answer for every leaf tuple.)

So the problem is that we have to either disallow such opclass designs,
or support per-opclass rather than per-index-AM decisions about whether
index-only scans are possible.

We discussed that idea when the index-only-scans patch went in a few
months ago, but blew it off on the grounds that there didn't seem to be
any immediate need for it.  Well, the need is here.

I think we should get rid of the pg_am.amcanreturn boolean column and
replace it with an AM support function.  btree's would just return
constant true, but I envision SP-GiST's version interrogating the
opclass config function.  (If we make the config function return this
info, we don't need to add anything to CREATE OPERATOR CLASS, which
seems like a good thing to me.)
        regards, tom lane


Re: SP-GiST versus index-only scans

From
Jesper Krogh
Date:
On 2011-12-14 19:00, Tom Lane wrote:
> So the problem is that we have to either disallow such opclass designs,
> or support per-opclass rather than per-index-AM decisions about whether
> index-only scans are possible.

Just a quick comment, for some queries like the famous
select count(*) from table where column @@ to_tsquery('something');
I do think index-only-scans does make sense even on indices
where the original tuple cannot be re-constructed. This also goes
for gin indices as well.

.. and yes, I do have a real-world application that would utillize this.
(and love it)

Jesper
-- 
Jesper



Re: SP-GiST versus index-only scans

From
Tom Lane
Date:
Jesper Krogh <jesper@krogh.cc> writes:
> On 2011-12-14 19:00, Tom Lane wrote:
>> So the problem is that we have to either disallow such opclass designs,
>> or support per-opclass rather than per-index-AM decisions about whether
>> index-only scans are possible.

> Just a quick comment, for some queries like the famous
> select count(*) from table where column @@ to_tsquery('something');
> I do think index-only-scans does make sense even on indices
> where the original tuple cannot be re-constructed. This also goes
> for gin indices as well.

I think this is somewhat wishful thinking unfortunately.  The difficulty
is that if the index isn't capable of reconstructing the original value,
then it's probably giving only an approximate (lossy) answer, which
means we'll have to visit the heap to recheck each result, which
pretty much defeats the purpose of an index-only scan.  So I can't get
excited about contorting things to support this.
        regards, tom lane


Re: SP-GiST versus index-only scans

From
Jesper Krogh
Date:
On 2011-12-14 19:48, Tom Lane wrote:
> Jesper Krogh<jesper@krogh.cc>  writes:
>> On 2011-12-14 19:00, Tom Lane wrote:
>>> So the problem is that we have to either disallow such opclass designs,
>>> or support per-opclass rather than per-index-AM decisions about whether
>>> index-only scans are possible.
>> Just a quick comment, for some queries like the famous
>> select count(*) from table where column @@ to_tsquery('something');
>> I do think index-only-scans does make sense even on indices
>> where the original tuple cannot be re-constructed. This also goes
>> for gin indices as well.
> I think this is somewhat wishful thinking unfortunately.  The difficulty
> is that if the index isn't capable of reconstructing the original value,
> then it's probably giving only an approximate (lossy) answer, which
> means we'll have to visit the heap to recheck each result, which
> pretty much defeats the purpose of an index-only scan.  So I can't get
> excited about contorting things to support this.

I can see that it is hard to generalize, but in the tsvector case the
we are indeed not capable of reconstructing the row since the
positions are not stored in the index, the actual lookup is not a
lossy and I'm fairly sure (based on experience) that pg dont
revisit heap-tuples for checking (only for visibillity).

-- 
Jesper
-- 
Jesper


Re: SP-GiST versus index-only scans

From
Tom Lane
Date:
Jesper Krogh <jesper@krogh.cc> writes:
> On 2011-12-14 19:48, Tom Lane wrote:
>> I think this is somewhat wishful thinking unfortunately.  The difficulty
>> is that if the index isn't capable of reconstructing the original value,
>> then it's probably giving only an approximate (lossy) answer, which
>> means we'll have to visit the heap to recheck each result, which
>> pretty much defeats the purpose of an index-only scan.

> I can see that it is hard to generalize, but in the tsvector case the
> we are indeed not capable of reconstructing the row since the
> positions are not stored in the index, the actual lookup is not a
> lossy and I'm fairly sure (based on experience) that pg dont
> revisit heap-tuples for checking (only for visibillity).

Well, the way the tsvector code handles this stuff is that it reports
the result as lossy only if the query actually poses a constraint on
position (some do, some don't).  That case was actually what made us
move the determination of lossiness from plan time to execution time,
since in the case of a non-constant tsquery, there's no way for the
planner to know about it (and even with the constant case, you'd need a
helper function that doesn't exist today).  But this behavior is
problematic for index-only scans because the planner can't tell whether
a query will be lossy or not, and it makes a heck of a lot bigger
difference than it used to.
        regards, tom lane


Re: SP-GiST versus index-only scans

From
Stefan Keller
Date:
Tom,
There seems to exist some opportunities now with GIST which relate to
geometry/geography types (but not only...):
1. Index-only scans on geometry columns with SP-GIST (being able to do
a "SELECT id FROM my_table WHERE mygeom...;").
2. Index clustering incuding NULL values (i.e. being able to do a
"CLUSTER mygeom_index ON mytable;" ).
This discussion suggests that at least 1. is close to be implemented.
The problem of 2. has to do with handling NULL values; it's mentioned
in the PostGIS manual [1]. I'm aware of kd-tree index development [2].
Don't know if clustering and index-only scans would be resolved there.

But I can't find neither in the Todo List [3] ?  What do you think?
Yours, Stefan
[2] http://postgis.refractions.net/docs/ch06.html#id2635907
[3] http://old.nabble.com/IMPORTANT%3A-%28Still%29-Seeking-Funding-for-Faster-PostGIS-Indexes-td32633545.html
[3] http://wiki.postgresql.org/wiki/Todo#Indexes

2011/12/14 Tom Lane <tgl@sss.pgh.pa.us>:
> Jesper Krogh <jesper@krogh.cc> writes:
>> On 2011-12-14 19:48, Tom Lane wrote:
>>> I think this is somewhat wishful thinking unfortunately.  The difficulty
>>> is that if the index isn't capable of reconstructing the original value,
>>> then it's probably giving only an approximate (lossy) answer, which
>>> means we'll have to visit the heap to recheck each result, which
>>> pretty much defeats the purpose of an index-only scan.
>
>> I can see that it is hard to generalize, but in the tsvector case the
>> we are indeed not capable of reconstructing the row since the
>> positions are not stored in the index, the actual lookup is not a
>> lossy and I'm fairly sure (based on experience) that pg dont
>> revisit heap-tuples for checking (only for visibillity).
>
> Well, the way the tsvector code handles this stuff is that it reports
> the result as lossy only if the query actually poses a constraint on
> position (some do, some don't).  That case was actually what made us
> move the determination of lossiness from plan time to execution time,
> since in the case of a non-constant tsquery, there's no way for the
> planner to know about it (and even with the constant case, you'd need a
> helper function that doesn't exist today).  But this behavior is
> problematic for index-only scans because the planner can't tell whether
> a query will be lossy or not, and it makes a heck of a lot bigger
> difference than it used to.
>
>                        regards, tom lane
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


Re: SP-GiST versus index-only scans

From
Tom Lane
Date:
Stefan Keller <sfkeller@gmail.com> writes:
> There seems to exist some opportunities now with GIST which relate to
> geometry/geography types (but not only...):
> 1. Index-only scans on geometry columns with SP-GIST (being able to do
> a "SELECT id FROM my_table WHERE mygeom...;").

Well, if somebody builds an SP-GiST opclass for geometry, that should
work.

> 2. Index clustering incuding NULL values (i.e. being able to do a
> "CLUSTER mygeom_index ON mytable;" ).

Huh?  GiST has supported nulls, and CLUSTER, since 8.2 or so.  The
section of the PostGIS manual you're referring to seems to be years
out of date.
        regards, tom lane