Thread: Index-only scans

Index-only scans

From
Heikki Linnakangas
Date:
Implementing index-only scans requires a few changes:

1. indexam API changes

There's currently no way to return data from an index scan. You only get
TID pointers to heap tuples.

2. Make visibility map crash-safe

After crash, the visibility map can currently be left in state where it
has some bits set that shouldn't be. That's OK for the 8.4 use, but
until that's fixed we can't use it to skip visibility checks from the
heap. The problem is described in the comments at the top of
visibilitymap.c. I'll start a new thread on how to tackle that.

3. Executor changes

The IndexScan node needs to be able to evaluate expressions and return
tuples straight from the index, skipping heap fetches. Those are
actually two separate features. Even if we don't solve the visibility
map problem, just allowing the executor to evaluate quals that are not
directly indexable using data from the index, would be useful. For
example, "SELECT * FROM foo WHERE textcol LIKE '%bar%', and you have a
b-tree index on textcol, the planner could choose a full-index-scan,
apply the '%bar%' filter on the index tuples, and only fetch those heap
tuples that match that qual.

4. Planner changes

The planner obviously needs to be changed to support the executor
changes. It needs to identify those quals that can be evaluated using
index columns only, and if the target list can be satisfied using index
columns only. It also needs to consider using a full index scan in some
queries where we didn't consider using an index before. The cost
evaluation also needs to be adjusted to take all that into account.


I have a prototype of this, except for the visibility map changes, at
http://git.postgresql.org/gitweb?p=heikki.git;a=shortlog;h=refs/heads/index-only-scans.
It's still work-in-progress, but should give an indication of where I'm
heading. For July commitfest, I'd like to get the indexam API changes
committed, and support for evaluating quals using data from index (which
doesn't require the visibility map). I'll submit patches for those shortly.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Index-only scans

From
Bruce Momjian
Date:
Heikki Linnakangas wrote:
> Even if we don't solve the visibility
> map problem, just allowing the executor to evaluate quals that are not
> directly indexable using data from the index, would be useful. For
> example, "SELECT * FROM foo WHERE textcol LIKE '%bar%', and you have a
> b-tree index on textcol, the planner could choose a full-index-scan,
> apply the '%bar%' filter on the index tuples, and only fetch those heap
> tuples that match that qual.

Interesting, I had not considered that.  You are using the index as a
single-column table that can be scanned more quickly than the heap.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Index-only scans

From
"Kevin Grittner"
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
> Implementing index-only scans requires a few changes:
I'm happy to see this work!  Now that the EXISTS predicates have been
optimized to consider semi-join and anti-join techniques, I believe
that these index issues (evaluating quals before heap access and
skipping heap access when all columns of interest are in an index) are
the last significant optimizations missing from PostgreSQL to allow it
to outperform our previous DBMS on *all* queries, rather than just the
vast majority of them.  :-)
As far as our production queries go, based on our experience with
several other products against this schema, this is the area where the
biggest performance gains remain to be realized.
-Kevin


Re: Index-only scans

From
Simon Riggs
Date:
On Mon, 2009-07-13 at 10:16 +0300, Heikki Linnakangas wrote:

> Implementing index-only scans requires a few changes:

I would like to see a clear exposition of the use cases and an an
analysis of the costs and benefits of doing this. It sounds cool, but I
want to know it is cool before we spend time solving all of the juicy
problems. Perhaps a glue-and-string patch would help.

Extra buffer accesses for vismap, crash-safe vismap sound like
performance issues, as well as planner time, not to mention all the
tuits needed. Will it damage the general case?

Oracle-does-index-only-scans, true, but we have other overheads that
mean it may not be a slamdunk win for Postgres in every case. I dealt
with a company last year that insisted this was a requirement and yet
was resistant to any ideas of how the solution could be achieved in
other ways not available and not typically known to Oracle users.

The single SQL example mentioned already has at least two mechanisms for
improving performance of that type of query. We probably don't need
another, or at least we need a good analysis of why.

The benefit that occurs to me most is covered indexes, i.e. it opens up
new and interesting indexing strategies. Covered indexes are also one
kind of materialized view. It may be better to implement mat views and
gain wider benefits too. Or maybe index-only scans are mat views, via
some cunning plan? 

No doubts about the implementation details if we do it.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Index-only scans

From
Greg Stark
Date:
On Mon, Jul 13, 2009 at 7:04 PM, Kevin
Grittner<Kevin.Grittner@wicourts.gov> wrote:
> As far as our production queries go, based on our experience with
> several other products against this schema, this is the area where the
> biggest performance gains remain to be realized.


There's a logical fallacy implicit in this statement. Namely that all
potential performance gains out there are already implemented by other
products :)

-- 
greg
http://mit.edu/~gsstark/resume.pdf


Re: Index-only scans

From
"Kevin Grittner"
Date:
Greg Stark <gsstark@mit.edu> wrote: 
> On Mon, Jul 13, 2009 at 7:04 PM, Kevin
> Grittner<Kevin.Grittner@wicourts.gov> wrote:
>> As far as our production queries go, based on our experience with
>> several other products against this schema, this is the area where
>> the biggest performance gains remain to be realized.
> 
> 
> There's a logical fallacy implicit in this statement. Namely that
> all potential performance gains out there are already implemented by
> other products :)
No, I'm basing it on having tested our application software with
various DBMS products and obtained benchmark timings.  PostgreSQL was
faster overall than products to which I compared it, but there were a
few standout slow queries under PostgreSQL.  When reviewing the plans
under the different products, the absence of particular optimizations
within PostgreSQL, present in the other products, became apparent.  I
don't base my observations on academic theory or speculation, even if
licensing agreements rarely allow me to share the supporting detail.
-Kevin


Re: Index-only scans

From
Mischa Sandberg
Date:
Does PG have an intermediate execution node to sort/batch index entries (heap tuple ptrs) by heap page prior to lookup?
Somethingmssql does ... 

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Bruce Momjian
> Sent: Monday, July 13, 2009 6:38 AM
> To: Heikki Linnakangas
> Cc: PostgreSQL-development
> Subject: Re: [HACKERS] Index-only scans
>
> Heikki Linnakangas wrote:
> > Even if we don't solve the visibility
> > map problem, just allowing the executor to evaluate quals
> that are not
> > directly indexable using data from the index, would be useful. For
> > example, "SELECT * FROM foo WHERE textcol LIKE '%bar%', and
> you have a
> > b-tree index on textcol, the planner could choose a
> full-index-scan,
> > apply the '%bar%' filter on the index tuples, and only fetch those
> > heap tuples that match that qual.
>
> Interesting, I had not considered that.  You are using the
> index as a single-column table that can be scanned more
> quickly than the heap.
>
> --
>   Bruce Momjian  <bruce@momjian.us>        http://momjian.us
>   EnterpriseDB                             http://enterprisedb.com
>
>   + If your life is a hard drive, Christ can be your backup. +
>
> --
> Sent via pgsql-hackers mailing list
> (pgsql-hackers@postgresql.org) To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: Index-only scans

From
Jaime Casanova
Date:
On Mon, Jul 13, 2009 at 5:38 PM, Mischa
Sandberg<mischa.sandberg@sophos.com> wrote:
> Does PG have an intermediate execution node to sort/batch index entries (heap tuple ptrs) by heap page prior to
lookup?Something mssql does ... 
>

it sounds a lot like a bitmap index scan


--
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157


Re: Index-only scans

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Mon, 2009-07-13 at 10:16 +0300, Heikki Linnakangas wrote:
> 
>> Implementing index-only scans requires a few changes:
> 
> I would like to see a clear exposition of the use cases and an an
> analysis of the costs and benefits of doing this. It sounds cool, but I
> want to know it is cool before we spend time solving all of the juicy
> problems. Perhaps a glue-and-string patch would help.

There's a working prototype at in my git repository at git.postgresql.org.

> Extra buffer accesses for vismap, crash-safe vismap sound like
> performance issues, as well as planner time, not to mention all the
> tuits needed. Will it damage the general case?

It does add some work to the planner, but I don't think it's noticeable.
The visibility map accesses are only needed when we're doing an
index-only scan, not in the general case, so the impact of those come
down to how well we can estimate the cost of index-only scans, so that
an index-only scan is not chosen when not beneficial.

> The single SQL example mentioned already has at least two mechanisms for
> improving performance of that type of query. We probably don't need
> another, or at least we need a good analysis of why.

Well, another class of queries where index-only scans are beneficial is
when you fetch a range of rows from index, where the heap fetches result
in a lot of random I/O. Clustering helps with that, but you can only
cluster a table on one column. A classic example where that's a problem
is a many-to-many relationship:

CREATE TABLE a (aid integer, ...);
CREATE TABLE b (bid integer, ...);
CREATE TABLE manytomany (aid integer, bid integer);
CREATE INDEX a_b ON manytomany (aid, bid);
CREATE INDEX b_a ON manytomany (bid, aid);

If you need to query the many-to-many relationship in "both directions", ie:
SELECT bid FROm manytomany WHERE aid = ?
SELECT aid FROM manytomany WHERE bid = ?

You have to choose which index you cluster the table on, which will be
fast, and the other query will be slow.

> The benefit that occurs to me most is covered indexes, i.e. it opens up
> new and interesting indexing strategies. Covered indexes are also one
> kind of materialized view. It may be better to implement mat views and
> gain wider benefits too.

Materialized view sure would be nice, but doesn't address quite the same
use cases. Doesn't help with the many-to-many example above, for
example. We should have both.

> Or maybe index-only scans are mat views, via
> some cunning plan? 

Heh, no :-).

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Index-only scans

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Mon, 2009-07-13 at 10:16 +0300, Heikki Linnakangas wrote:
> 
>> Implementing index-only scans requires a few changes:
> 
> I would like to see a clear exposition of the use cases and an an
> analysis of the costs and benefits of doing this. It sounds cool, but I
> want to know it is cool before we spend time solving all of the juicy
> problems.

BTW, there's another trick that I'm *not* going to implement yet, which
is to allow joins using data from indexes only, and fetching the rest of
the columns after the join. For example:

CREATE TABLE a (aid integer PRIMARY KEY, adata text);
CREATE TABLE b (bid integer PRIMARY KEY, bdata text);

SELECT aid, adata, bid, bdata FROM a, b WHERE aid = bid;

If the join is very selective, IOW there's only a few matching rows, it
is a lot more efficient to perform the join first using just the
indexes, and fetch adata and bdata columns and check visibility for the
matching rows only.

You can get pretty close by clustering the tables, but the wider the
tables the bigger the difference.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Index-only scans

From
Simon Riggs
Date:
On Tue, 2009-07-14 at 11:23 +0300, Heikki Linnakangas wrote:

> > The single SQL example mentioned already has at least two mechanisms for
> > improving performance of that type of query. We probably don't need
> > another, or at least we need a good analysis of why.
> 
> Well, another class of queries where index-only scans are beneficial is
> when you fetch a range of rows from index, where the heap fetches result
> in a lot of random I/O. 

Which we just optimised, no? I see it could be even better, but do we
really need that class of query to be optimized again? (Possibly...)

I'm not doubting your ability to solve every problem we face, just
advising that we crunch a few numbers before we spend months on
implementing all of this. It would be useful to have projected gains and
a useful test case ahead of time, so we can measure the potential as
early as possible. Yes, you don't need to convince me before you
proceed, but if you can at least share the thoughts that have convinced
you that would help.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Index-only scans

From
Peter Eisentraut
Date:
On Monday 13 July 2009 16:38:18 Bruce Momjian wrote:
> Heikki Linnakangas wrote:
> > Even if we don't solve the visibility
> > map problem, just allowing the executor to evaluate quals that are not
> > directly indexable using data from the index, would be useful. For
> > example, "SELECT * FROM foo WHERE textcol LIKE '%bar%', and you have a
> > b-tree index on textcol, the planner could choose a full-index-scan,
> > apply the '%bar%' filter on the index tuples, and only fetch those heap
> > tuples that match that qual.
>
> Interesting, I had not considered that.  You are using the index as a
> single-column table that can be scanned more quickly than the heap.

On slightly bizarre application of this could be that you create a poor man's 
column storage by creating heap "indexes" on tables.  These would just be 
narrower copies of the original heap, but allow faster fetching.  This might 
actually be useful for data types that don't support btree indexing.


Re: Index-only scans

From
Mischa Sandberg
Date:
Now I'm back where I can go look at the source code :-)
Thanks.

> -----Original Message-----
> From: Jaime Casanova [mailto:jcasanov@systemguards.com.ec]
> Sent: Monday, July 13, 2009 8:40 PM
> To: Mischa Sandberg
> Cc: Heikki Linnakangas; PostgreSQL-development
> Subject: Re: [HACKERS] Index-only scans
>
> On Mon, Jul 13, 2009 at 5:38 PM, Mischa
> Sandberg<mischa.sandberg@sophos.com> wrote:
> > Does PG have an intermediate execution node to sort/batch
> index entries (heap tuple ptrs) by heap page prior to lookup?
> Something mssql does ...
> >
>
> it sounds a lot like a bitmap index scan
>
>
> --
> Atentamente,
> Jaime Casanova
> Soporte y capacitación de PostgreSQL
> Asesoría y desarrollo de sistemas
> Guayaquil - Ecuador
> Cel. +59387171157
>


Re: Index-only scans

From
Ron Mayer
Date:
Heikki Linnakangas wrote:
> ...
> CREATE TABLE manytomany (aid integer, bid integer);
> CREATE INDEX a_b ON manytomany (aid, bid);
> CREATE INDEX b_a ON manytomany (bid, aid);
> ...
>> new and interesting indexing strategies. Covered indexes are also one
>> kind of materialized view. It may be better to implement mat views and
>> gain wider benefits too.
> 
> Materialized view sure would be nice, but doesn't address quite the same
> use cases. Doesn't help with the many-to-many example above, for
> example. We should have both.

Really?  I'd have thought that index is similar to materializing
these views: create view a_b as select aid,bid from manytomany order by aid,bid; create view b_a as select bid,aid from
manytomanyorder by bid,aid;
 
Or perhaps create view a_b as select aid,array_agg(bid) from manytomany group by aid;


But I like the index-only scan better anyway because I already have
the indexes so the benefit would come to me automatically rather than
having to pick and choose what views to materialize.



Re: Index-only scans

From
Greg Stark
Date:
On Wed, Jul 15, 2009 at 1:21 AM, Ron Mayer<rm_pg@cheapcomplexdevices.com> wrote:
> Really?  I'd have thought that index is similar to materializing
> these views:
>  create view a_b as select aid,bid from manytomany order by aid,bid;
>  create view b_a as select bid,aid from manytomany order by bid,aid;
> Or perhaps
>  create view a_b as select aid,array_agg(bid) from manytomany group by aid;


How do any of these views help you answer a query like "select aid
from manytomany where bid in (subquery)"?

The last one could help you answer the dual of that but not without
rewriting the query quite heavily to use array operations. The first
two I'm puzzled how they're useful at all since unless you add indexes
to the materialized views they'll just contain a complete copy of the
original table -- the most they could help with is avoiding a sort but
they'll never be updatable so they'll rarely actually be usable.

--
greg
http://mit.edu/~gsstark/resume.pdf


Re: Index-only scans

From
Gokulakannan Somasundaram
Date:
Hi Heikki,
           I was recollecting our conversation that, if the index-only quals were implemented through indexes by storing snapshots, broken data-types may not be supported. I feel this problem might exist, if we go on design a IOT(Index Organized Tables) / Clustered Indexes. IOT is again a index which stores snapshot info together with it.
           So are we not going to have this feature ever in PG? What is our stand in that case? I am asking this because i have an IOT implementation on PG 8.2.

Thanks,
Gokul

On Tue, Jul 14, 2009 at 2:08 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
Simon Riggs wrote:
> On Mon, 2009-07-13 at 10:16 +0300, Heikki Linnakangas wrote:
>
>> Implementing index-only scans requires a few changes:
>
> I would like to see a clear exposition of the use cases and an an
> analysis of the costs and benefits of doing this. It sounds cool, but I
> want to know it is cool before we spend time solving all of the juicy
> problems.

BTW, there's another trick that I'm *not* going to implement yet, which
is to allow joins using data from indexes only, and fetching the rest of
the columns after the join. For example:

CREATE TABLE a (aid integer PRIMARY KEY, adata text);
CREATE TABLE b (bid integer PRIMARY KEY, bdata text);

SELECT aid, adata, bid, bdata FROM a, b WHERE aid = bid;

If the join is very selective, IOW there's only a few matching rows, it
is a lot more efficient to perform the join first using just the
indexes, and fetch adata and bdata columns and check visibility for the
matching rows only.

You can get pretty close by clustering the tables, but the wider the
tables the bigger the difference.

--
 Heikki Linnakangas
 EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers