Thread: Index-only scans
Implementing index-only scans requires a few changes: 1. indexam API changes There's currently no way to return data from an index scan. You only get TID pointers to heap tuples. 2. Make visibility map crash-safe After crash, the visibility map can currently be left in state where it has some bits set that shouldn't be. That's OK for the 8.4 use, but until that's fixed we can't use it to skip visibility checks from the heap. The problem is described in the comments at the top of visibilitymap.c. I'll start a new thread on how to tackle that. 3. Executor changes The IndexScan node needs to be able to evaluate expressions and return tuples straight from the index, skipping heap fetches. Those are actually two separate features. Even if we don't solve the visibility map problem, just allowing the executor to evaluate quals that are not directly indexable using data from the index, would be useful. For example, "SELECT * FROM foo WHERE textcol LIKE '%bar%', and you have a b-tree index on textcol, the planner could choose a full-index-scan, apply the '%bar%' filter on the index tuples, and only fetch those heap tuples that match that qual. 4. Planner changes The planner obviously needs to be changed to support the executor changes. It needs to identify those quals that can be evaluated using index columns only, and if the target list can be satisfied using index columns only. It also needs to consider using a full index scan in some queries where we didn't consider using an index before. The cost evaluation also needs to be adjusted to take all that into account. I have a prototype of this, except for the visibility map changes, at http://git.postgresql.org/gitweb?p=heikki.git;a=shortlog;h=refs/heads/index-only-scans. It's still work-in-progress, but should give an indication of where I'm heading. For July commitfest, I'd like to get the indexam API changes committed, and support for evaluating quals using data from index (which doesn't require the visibility map). I'll submit patches for those shortly. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Even if we don't solve the visibility > map problem, just allowing the executor to evaluate quals that are not > directly indexable using data from the index, would be useful. For > example, "SELECT * FROM foo WHERE textcol LIKE '%bar%', and you have a > b-tree index on textcol, the planner could choose a full-index-scan, > apply the '%bar%' filter on the index tuples, and only fetch those heap > tuples that match that qual. Interesting, I had not considered that. You are using the index as a single-column table that can be scanned more quickly than the heap. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Implementing index-only scans requires a few changes: I'm happy to see this work! Now that the EXISTS predicates have been optimized to consider semi-join and anti-join techniques, I believe that these index issues (evaluating quals before heap access and skipping heap access when all columns of interest are in an index) are the last significant optimizations missing from PostgreSQL to allow it to outperform our previous DBMS on *all* queries, rather than just the vast majority of them. :-) As far as our production queries go, based on our experience with several other products against this schema, this is the area where the biggest performance gains remain to be realized. -Kevin
On Mon, 2009-07-13 at 10:16 +0300, Heikki Linnakangas wrote: > Implementing index-only scans requires a few changes: I would like to see a clear exposition of the use cases and an an analysis of the costs and benefits of doing this. It sounds cool, but I want to know it is cool before we spend time solving all of the juicy problems. Perhaps a glue-and-string patch would help. Extra buffer accesses for vismap, crash-safe vismap sound like performance issues, as well as planner time, not to mention all the tuits needed. Will it damage the general case? Oracle-does-index-only-scans, true, but we have other overheads that mean it may not be a slamdunk win for Postgres in every case. I dealt with a company last year that insisted this was a requirement and yet was resistant to any ideas of how the solution could be achieved in other ways not available and not typically known to Oracle users. The single SQL example mentioned already has at least two mechanisms for improving performance of that type of query. We probably don't need another, or at least we need a good analysis of why. The benefit that occurs to me most is covered indexes, i.e. it opens up new and interesting indexing strategies. Covered indexes are also one kind of materialized view. It may be better to implement mat views and gain wider benefits too. Or maybe index-only scans are mat views, via some cunning plan? No doubts about the implementation details if we do it. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Mon, Jul 13, 2009 at 7:04 PM, Kevin Grittner<Kevin.Grittner@wicourts.gov> wrote: > As far as our production queries go, based on our experience with > several other products against this schema, this is the area where the > biggest performance gains remain to be realized. There's a logical fallacy implicit in this statement. Namely that all potential performance gains out there are already implemented by other products :) -- greg http://mit.edu/~gsstark/resume.pdf
Greg Stark <gsstark@mit.edu> wrote: > On Mon, Jul 13, 2009 at 7:04 PM, Kevin > Grittner<Kevin.Grittner@wicourts.gov> wrote: >> As far as our production queries go, based on our experience with >> several other products against this schema, this is the area where >> the biggest performance gains remain to be realized. > > > There's a logical fallacy implicit in this statement. Namely that > all potential performance gains out there are already implemented by > other products :) No, I'm basing it on having tested our application software with various DBMS products and obtained benchmark timings. PostgreSQL was faster overall than products to which I compared it, but there were a few standout slow queries under PostgreSQL. When reviewing the plans under the different products, the absence of particular optimizations within PostgreSQL, present in the other products, became apparent. I don't base my observations on academic theory or speculation, even if licensing agreements rarely allow me to share the supporting detail. -Kevin
Does PG have an intermediate execution node to sort/batch index entries (heap tuple ptrs) by heap page prior to lookup? Somethingmssql does ... > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Bruce Momjian > Sent: Monday, July 13, 2009 6:38 AM > To: Heikki Linnakangas > Cc: PostgreSQL-development > Subject: Re: [HACKERS] Index-only scans > > Heikki Linnakangas wrote: > > Even if we don't solve the visibility > > map problem, just allowing the executor to evaluate quals > that are not > > directly indexable using data from the index, would be useful. For > > example, "SELECT * FROM foo WHERE textcol LIKE '%bar%', and > you have a > > b-tree index on textcol, the planner could choose a > full-index-scan, > > apply the '%bar%' filter on the index tuples, and only fetch those > > heap tuples that match that qual. > > Interesting, I had not considered that. You are using the > index as a single-column table that can be scanned more > quickly than the heap. > > -- > Bruce Momjian <bruce@momjian.us> http://momjian.us > EnterpriseDB http://enterprisedb.com > > + If your life is a hard drive, Christ can be your backup. + > > -- > Sent via pgsql-hackers mailing list > (pgsql-hackers@postgresql.org) To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
On Mon, Jul 13, 2009 at 5:38 PM, Mischa Sandberg<mischa.sandberg@sophos.com> wrote: > Does PG have an intermediate execution node to sort/batch index entries (heap tuple ptrs) by heap page prior to lookup?Something mssql does ... > it sounds a lot like a bitmap index scan -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
Simon Riggs wrote: > On Mon, 2009-07-13 at 10:16 +0300, Heikki Linnakangas wrote: > >> Implementing index-only scans requires a few changes: > > I would like to see a clear exposition of the use cases and an an > analysis of the costs and benefits of doing this. It sounds cool, but I > want to know it is cool before we spend time solving all of the juicy > problems. Perhaps a glue-and-string patch would help. There's a working prototype at in my git repository at git.postgresql.org. > Extra buffer accesses for vismap, crash-safe vismap sound like > performance issues, as well as planner time, not to mention all the > tuits needed. Will it damage the general case? It does add some work to the planner, but I don't think it's noticeable. The visibility map accesses are only needed when we're doing an index-only scan, not in the general case, so the impact of those come down to how well we can estimate the cost of index-only scans, so that an index-only scan is not chosen when not beneficial. > The single SQL example mentioned already has at least two mechanisms for > improving performance of that type of query. We probably don't need > another, or at least we need a good analysis of why. Well, another class of queries where index-only scans are beneficial is when you fetch a range of rows from index, where the heap fetches result in a lot of random I/O. Clustering helps with that, but you can only cluster a table on one column. A classic example where that's a problem is a many-to-many relationship: CREATE TABLE a (aid integer, ...); CREATE TABLE b (bid integer, ...); CREATE TABLE manytomany (aid integer, bid integer); CREATE INDEX a_b ON manytomany (aid, bid); CREATE INDEX b_a ON manytomany (bid, aid); If you need to query the many-to-many relationship in "both directions", ie: SELECT bid FROm manytomany WHERE aid = ? SELECT aid FROM manytomany WHERE bid = ? You have to choose which index you cluster the table on, which will be fast, and the other query will be slow. > The benefit that occurs to me most is covered indexes, i.e. it opens up > new and interesting indexing strategies. Covered indexes are also one > kind of materialized view. It may be better to implement mat views and > gain wider benefits too. Materialized view sure would be nice, but doesn't address quite the same use cases. Doesn't help with the many-to-many example above, for example. We should have both. > Or maybe index-only scans are mat views, via > some cunning plan? Heh, no :-). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > On Mon, 2009-07-13 at 10:16 +0300, Heikki Linnakangas wrote: > >> Implementing index-only scans requires a few changes: > > I would like to see a clear exposition of the use cases and an an > analysis of the costs and benefits of doing this. It sounds cool, but I > want to know it is cool before we spend time solving all of the juicy > problems. BTW, there's another trick that I'm *not* going to implement yet, which is to allow joins using data from indexes only, and fetching the rest of the columns after the join. For example: CREATE TABLE a (aid integer PRIMARY KEY, adata text); CREATE TABLE b (bid integer PRIMARY KEY, bdata text); SELECT aid, adata, bid, bdata FROM a, b WHERE aid = bid; If the join is very selective, IOW there's only a few matching rows, it is a lot more efficient to perform the join first using just the indexes, and fetch adata and bdata columns and check visibility for the matching rows only. You can get pretty close by clustering the tables, but the wider the tables the bigger the difference. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2009-07-14 at 11:23 +0300, Heikki Linnakangas wrote: > > The single SQL example mentioned already has at least two mechanisms for > > improving performance of that type of query. We probably don't need > > another, or at least we need a good analysis of why. > > Well, another class of queries where index-only scans are beneficial is > when you fetch a range of rows from index, where the heap fetches result > in a lot of random I/O. Which we just optimised, no? I see it could be even better, but do we really need that class of query to be optimized again? (Possibly...) I'm not doubting your ability to solve every problem we face, just advising that we crunch a few numbers before we spend months on implementing all of this. It would be useful to have projected gains and a useful test case ahead of time, so we can measure the potential as early as possible. Yes, you don't need to convince me before you proceed, but if you can at least share the thoughts that have convinced you that would help. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Monday 13 July 2009 16:38:18 Bruce Momjian wrote: > Heikki Linnakangas wrote: > > Even if we don't solve the visibility > > map problem, just allowing the executor to evaluate quals that are not > > directly indexable using data from the index, would be useful. For > > example, "SELECT * FROM foo WHERE textcol LIKE '%bar%', and you have a > > b-tree index on textcol, the planner could choose a full-index-scan, > > apply the '%bar%' filter on the index tuples, and only fetch those heap > > tuples that match that qual. > > Interesting, I had not considered that. You are using the index as a > single-column table that can be scanned more quickly than the heap. On slightly bizarre application of this could be that you create a poor man's column storage by creating heap "indexes" on tables. These would just be narrower copies of the original heap, but allow faster fetching. This might actually be useful for data types that don't support btree indexing.
Now I'm back where I can go look at the source code :-) Thanks. > -----Original Message----- > From: Jaime Casanova [mailto:jcasanov@systemguards.com.ec] > Sent: Monday, July 13, 2009 8:40 PM > To: Mischa Sandberg > Cc: Heikki Linnakangas; PostgreSQL-development > Subject: Re: [HACKERS] Index-only scans > > On Mon, Jul 13, 2009 at 5:38 PM, Mischa > Sandberg<mischa.sandberg@sophos.com> wrote: > > Does PG have an intermediate execution node to sort/batch > index entries (heap tuple ptrs) by heap page prior to lookup? > Something mssql does ... > > > > it sounds a lot like a bitmap index scan > > > -- > Atentamente, > Jaime Casanova > Soporte y capacitación de PostgreSQL > Asesoría y desarrollo de sistemas > Guayaquil - Ecuador > Cel. +59387171157 >
Heikki Linnakangas wrote: > ... > CREATE TABLE manytomany (aid integer, bid integer); > CREATE INDEX a_b ON manytomany (aid, bid); > CREATE INDEX b_a ON manytomany (bid, aid); > ... >> new and interesting indexing strategies. Covered indexes are also one >> kind of materialized view. It may be better to implement mat views and >> gain wider benefits too. > > Materialized view sure would be nice, but doesn't address quite the same > use cases. Doesn't help with the many-to-many example above, for > example. We should have both. Really? I'd have thought that index is similar to materializing these views: create view a_b as select aid,bid from manytomany order by aid,bid; create view b_a as select bid,aid from manytomanyorder by bid,aid; Or perhaps create view a_b as select aid,array_agg(bid) from manytomany group by aid; But I like the index-only scan better anyway because I already have the indexes so the benefit would come to me automatically rather than having to pick and choose what views to materialize.
On Wed, Jul 15, 2009 at 1:21 AM, Ron Mayer<rm_pg@cheapcomplexdevices.com> wrote: > Really? I'd have thought that index is similar to materializing > these views: > create view a_b as select aid,bid from manytomany order by aid,bid; > create view b_a as select bid,aid from manytomany order by bid,aid; > Or perhaps > create view a_b as select aid,array_agg(bid) from manytomany group by aid; How do any of these views help you answer a query like "select aid from manytomany where bid in (subquery)"? The last one could help you answer the dual of that but not without rewriting the query quite heavily to use array operations. The first two I'm puzzled how they're useful at all since unless you add indexes to the materialized views they'll just contain a complete copy of the original table -- the most they could help with is avoiding a sort but they'll never be updatable so they'll rarely actually be usable. -- greg http://mit.edu/~gsstark/resume.pdf
Hi Heikki,
I was recollecting our conversation that, if the index-only quals were implemented through indexes by storing snapshots, broken data-types may not be supported. I feel this problem might exist, if we go on design a IOT(Index Organized Tables) / Clustered Indexes. IOT is again a index which stores snapshot info together with it.
So are we not going to have this feature ever in PG? What is our stand in that case? I am asking this because i have an IOT implementation on PG 8.2.
Thanks,
Gokul
I was recollecting our conversation that, if the index-only quals were implemented through indexes by storing snapshots, broken data-types may not be supported. I feel this problem might exist, if we go on design a IOT(Index Organized Tables) / Clustered Indexes. IOT is again a index which stores snapshot info together with it.
So are we not going to have this feature ever in PG? What is our stand in that case? I am asking this because i have an IOT implementation on PG 8.2.
Thanks,
Gokul
On Tue, Jul 14, 2009 at 2:08 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
Simon Riggs wrote:BTW, there's another trick that I'm *not* going to implement yet, which
> On Mon, 2009-07-13 at 10:16 +0300, Heikki Linnakangas wrote:
>
>> Implementing index-only scans requires a few changes:
>
> I would like to see a clear exposition of the use cases and an an
> analysis of the costs and benefits of doing this. It sounds cool, but I
> want to know it is cool before we spend time solving all of the juicy
> problems.
is to allow joins using data from indexes only, and fetching the rest of
the columns after the join. For example:
CREATE TABLE a (aid integer PRIMARY KEY, adata text);
CREATE TABLE b (bid integer PRIMARY KEY, bdata text);
SELECT aid, adata, bid, bdata FROM a, b WHERE aid = bid;
If the join is very selective, IOW there's only a few matching rows, it
is a lot more efficient to perform the join first using just the
indexes, and fetch adata and bdata columns and check visibility for the
matching rows only.
You can get pretty close by clustering the tables, but the wider the
tables the bigger the difference.Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers