Thread: Indexed views?

Indexed views?

From
Tiago Wright
Date:
Are there any plans to support indexed views, or cross-table indexes,
or any form of "materialized views" in postgresql? How complex would
the implementation be? Indexed views are sometimes the best way to
improve the performance of complex queries.

-Tiago


Re: Indexed views?

From
"Joshua D. Drake"
Date:
Tiago Wright wrote:

>Are there any plans to support indexed views, or cross-table indexes,
>or any form of "materialized views" in postgresql? How complex would
>the implementation be? Indexed views are sometimes the best way to
>improve the performance of complex queries.
>  
>
The planner will use an index across multiple tables, even when called 
from a view as long as the comparing
tables are of the same type...

As far as materialized views, you can use triggers to create them.

Sincerely,

Joshua D. Drake




>-Tiago
>
>---------------------------(end of broadcast)---------------------------
>TIP 4: Don't 'kill -9' the postmaster
>  
>


-- 
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL



Re: Indexed views?

From
Tiago Wright
Date:
I meant indexes on fields from multiple tables, or equivaltently
indexes on a view spanning fields from multiple tables.

For example, consider the view
CREATE VIEW vw_lot AS
 SELECT productid, lotid, parentlotid, lottype, lotname, productname
   FROM lot
NATURAL JOIN product;

where productname is in the product table, and lotname in the lot
table. I would be interested in creating an index such as

CREATE INDEX ix_vw_lot ON vw_lot (lotname, productname);

for performance reasons, since both my lot and product tables are very
large. The index would be enough to cover 90% of the queries against
lot the lot and inventory tables.

-Tiago



On Mon, 06 Sep 2004 10:17:24 -0700, Joshua D. Drake
<jd@commandprompt.com> wrote:
>
>
> Tiago Wright wrote:
>
> >Are there any plans to support indexed views, or cross-table indexes,
> >or any form of "materialized views" in postgresql? How complex would
> >the implementation be? Indexed views are sometimes the best way to
> >improve the performance of complex queries.
> >
> >
> The planner will use an index across multiple tables, even when called
> from a view as long as the comparing
> tables are of the same type...
>
> As far as materialized views, you can use triggers to create them.
>
> Sincerely,
>
> Joshua D. Drake
>
> >-Tiago
> >
> >---------------------------(end of broadcast)---------------------------
> >TIP 4: Don't 'kill -9' the postmaster
> >
> >
>
> --
> Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
> Postgresql support, programming shared hosting and dedicated hosting.
> +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com
> PostgreSQL Replicator -- production quality replication for PostgreSQL
>
>


Re: Indexed views?

From
Tom Lane
Date:
Tiago Wright <tiagowright@gmail.com> writes:
> For example, consider the view
> CREATE VIEW vw_lot AS
> �SELECT�productid,�lotid,�parentlotid,�lottype,�lotname,�productname
> ���FROM�lot
> NATURAL�JOIN�product;

> where productname is in the product table, and lotname in the lot
> table. I would be interested in creating an index such as

> CREATE INDEX ix_vw_lot ON vw_lot (lotname, productname);

What purpose would this serve that indexes on the separate tables
wouldn't serve?

> The index would be enough to cover 90% of the queries against
> lot the lot and inventory tables.

This sounds to me like you are suffering from a common misconception.
Postgres cannot answer queries from the contents of indexes alone.
        regards, tom lane


Re: Indexed views?

From
Tiago Wright
Date:
> > where productname is in the product table, and lotname in the lot
> > table. I would be interested in creating an index such as
> 
> > CREATE INDEX ix_vw_lot ON vw_lot (lotname, productname);
> 
> What purpose would this serve that indexes on the separate tables
> wouldn't serve?
> 
> > The index would be enough to cover 90% of the queries against
> > lot the lot and inventory tables.
> 
> This sounds to me like you are suffering from a common misconception.
> Postgres cannot answer queries from the contents of indexes alone.

Yes, thanks Tom. This is precisely what I was missing. I searched the
archives for the reason why this is so, but I found only one message
mentioning the MVCC mechanism. Can you point me in the right
direction? I would like to understand the issue.

IMHO, a change in this area could deliver great performance improvements.

-Tiago


Re: Indexed views?

From
Doug McNaught
Date:
Tiago Wright <tiagowright@gmail.com> writes:

> Yes, thanks Tom. This is precisely what I was missing. I searched the
> archives for the reason why this is so, but I found only one message
> mentioning the MVCC mechanism. Can you point me in the right
> direction? I would like to understand the issue.

Short answer: MVCC tuple visibility status isn't (and can't be) stored
in the index.  So the backend has to visit the actual tuple to see if
it is visible to the transaction that is currently running.

> IMHO, a change in this area could deliver great performance improvements.

Hard to say how it would work, but come up with a good design and
quality patch and it'll probably go in.  :)

-Doug
-- 
Let us cross over the river, and rest under the shade of the trees.  --T. J. Jackson, 1863


Re: Indexed views?

From
Alvaro Herrera
Date:
On Tue, Sep 07, 2004 at 07:58:56PM -0400, Doug McNaught wrote:
> Tiago Wright <tiagowright@gmail.com> writes:

> > Yes, thanks Tom. This is precisely what I was missing. I searched the
> > archives for the reason why this is so, but I found only one message
> > mentioning the MVCC mechanism. Can you point me in the right
> > direction? I would like to understand the issue.
> 
> > IMHO, a change in this area could deliver great performance improvements.
> 
> Hard to say how it would work, but come up with a good design and
> quality patch and it'll probably go in.  :)

Probably not.  This has been discussed before; what's needed is that the
visibility information is stored also in the index.  This is hard and
inefficient to do, because it requires updating the index at the same
time that the heap is updated.  Which is a bad proposition as soon as
there is more than one index, and when there is a seqscan involved (i.e.
no index), because it means a lot of extra I/O.


A proposal that would be better received would be to implement "bitmap
scanning" of indexes, which could mean retrieving heap pages in order,
yielding much better performance.

-- 
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
We take risks not to escape from life, but to prevent life escaping from us.



Re: Indexed views?

From
Doug McNaught
Date:
Alvaro Herrera <alvherre@dcc.uchile.cl> writes:

> On Tue, Sep 07, 2004 at 07:58:56PM -0400, Doug McNaught wrote:

>> Hard to say how it would work, but come up with a good design and
>> quality patch and it'll probably go in.  :)
>
> Probably not.  This has been discussed before; what's needed is that the
> visibility information is stored also in the index.  This is hard and
> inefficient to do, because it requires updating the index at the same
> time that the heap is updated.  Which is a bad proposition as soon as
> there is more than one index, and when there is a seqscan involved (i.e.
> no index), because it means a lot of extra I/O.

Yeah, hence the smiley. 

-Doug
-- 
Let us cross over the river, and rest under the shade of the trees.  --T. J. Jackson, 1863


Re: Indexed views?

From
Tom Lane
Date:
Doug McNaught <doug@mcnaught.org> writes:
> Short answer: MVCC tuple visibility status isn't (and can't be) stored
> in the index.

Well, in principle it *could* be, but there are strong arguments why it
shouldn't be: the costs of updating N index entries instead of just one
tuple entry, the potential reliability hit (what happens when the index
entries disagree with the master?), and the increase in index size
(adding an extra dozen bytes to an index entry is a very nontrivial
I/O hit).

I don't say we wouldn't ever do it, but it will take a lot more than
newbies opining that it might be a good idea to persuade us that it
is a good idea.
        regards, tom lane


Re: Indexed views?

From
Greg Stark
Date:
Doug McNaught <doug@mcnaught.org> writes:

> Short answer: MVCC tuple visibility status isn't (and can't be) stored
> in the index.  

Well the "can't" part is false or at least unproven. From prior discussion the
only thing that would be technically challenging would be avoiding deadlocks.

The main issue raised is that storing the visibility information in the index
would incur other performance costs. It would increase the storage size of the
index and it would dramatically increase the i/o needed to maintain indexes
for updates and deletes.

So it becomes a question of performance trade-offs, as many things do. 
It would be an experiment that would require a lot of work to even try,
and people are skeptical that it would really have any benefits.

-- 
greg



Re: Indexed views?

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Doug McNaught <doug@mcnaught.org> writes:
> > Short answer: MVCC tuple visibility status isn't (and can't be) stored
> > in the index.
> 
> Well, in principle it *could* be, but there are strong arguments why it
> shouldn't be: the costs of updating N index entries instead of just one
> tuple entry, the potential reliability hit (what happens when the index
> entries disagree with the master?), and the increase in index size
> (adding an extra dozen bytes to an index entry is a very nontrivial
> I/O hit).

Hm. Just thinking aloud here. But what if there was an option to store the
visibility information separately from the heap entirely. There would still
only be one copy of the visibility information and it wouldn't increase
storage or i/o requirements.

I'm assuming this would only make sense if the visibility information could be
stored on a separate spindle. Or at least if the application never uses
sequential scans, especially if the indexes cover the needed columns.

But if the table has particularly wide records, then it might be useful to
avoid having to read in the many blocks of records. Even if the index doesn't
cover the columns needed if there are many dead tuples (or not-yet-alive
tuples) reading the very densely packed visibility information might be faster
than reading the wide records.

Even for narrow tables, if the index covers the columns it would be faster to
read the even narrower visibility information alone. If the user opted to
*only ever* access the data via the index he could drop the actual heap
information and end up with a 90% solution for "index organized tables". The
visibility information would still be in a heap but not all the column data.

I'm not sure the benefits would really outweigh the costs, but it would
probably be simpler than storing duplicate visibility information in an index.

-- 
greg



Re: Indexed views?

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Hm. Just thinking aloud here. But what if there was an option to store the
> visibility information separately from the heap entirely. There would still
> only be one copy of the visibility information and it wouldn't increase
> storage or i/o requirements.

How do you figure that it wouldn't increase I/O?  In most scenarios that
would be still another block to read in to find out whether a tuple is
valid.  (And that's assuming that you don't need any fancy indexing
scheme to associate tuples with visibility records.)

> But if the table has particularly wide records, then it might be useful to
> avoid having to read in the many blocks of records.

I think the TOAST scheme already gets much of the low-hanging fruit
here.  It might be interesting to expose more TOAST control knobs,
though --- for instance make the thresholds settable per-table.
        regards, tom lane


Re: Indexed views?

From
Mischa Sandberg
Date:
Greg Stark wrote:
> Doug McNaught <doug@mcnaught.org> writes:
> 
> 
>>Short answer: MVCC tuple visibility status isn't (and can't be) stored
>>in the index.  
> 
> 
> Well the "can't" part is false or at least unproven. From prior discussion the
> only thing that would be technically challenging would be avoiding deadlocks.

Rather than yank the MVCC visibility around, how about a (relatively 
small) change to the query plan ...

I take it that it is a very reasonable assumption that only a small 
proportion of index records are actually invalid (else Yurk why use the 
index?). In that case, how about tacking the heap table row ptr onto 
result tuples, and letting them percolate up through the tree?

Since you're using an index at all, the planner must be expecting a 
restricted set of rows to make it up through to the root. If there is 
any  filter criteria against the values from the index rows, you won't 
even have to check rows for tuple visibility, that don't pass that filter.

As soon as you rise to a point where (index) rows will either be 
multiplied (a join) or combined (a group-by/distinct), you can validate 
them against the heap file in relatively large batches, with a hash 
caching of which ones have been checked for visibility.

Just a thought.


Re: Indexed views?

From
Greg Stark
Date:
Mischa Sandberg <ischamay.andbergsay@activestateway.com> writes:

> I take it that it is a very reasonable assumption that only a small proportion
> of index records are actually invalid (else Yurk why use the index?). 

That's faulty logic, the percentage of tuples that are valid is entirely
independent from the percentage of tuples that match your range criterion. Ie,
I could be selecting 100 tuples out of a million -- even if 99 are invalid
it's still worthwhile to use the index.

> Since you're using an index at all, the planner must be expecting a restricted
> set of rows to make it up through to the root. If there is any  filter criteria
> against the values from the index rows, you won't even have to check rows for
> tuple visibility, that don't pass that filter.

It's an interesting idea though. But I can't think of many queries where it
would be interesting. The query would still have to visit every page
containing a record used in the final result. So the only time this would be a
significant win is if you're applying very selective restrictions to columns
that were in the index but weren't able to put in the index condition. 

This seems like a pretty rare situation; usually the reason you put columns in
an index definition is because it is going to be useful for index conditions--
especially if it's a particularly selective column.

-- 
greg



Re: Indexed views?

From
Tiago Wright
Date:
IMHO, it is worth duplicating the mvcc data to all index entries. To
summarize what I understand from this discussion, with the current
method:

a1 - Index seeks must return invisible tuples because mvcc data is not
found in the index. These tuples are eliminated once the data is read
from the actual data pages.

a2 - Covered queries are not possible since the data page must be
visited to determine visibility

If mvcc data is replicated to the index entries:

b1 - Index seeks will never return invisible tuples, possibly
eliminating some page reads

b2 - Covered queries are possible

b3 - Inserts are not affected performancewise. Deletes must now visit
every index entry, which is a larger cost. Updates must visit every
index entry too. It may be possible to reduce the cost of update if
the indexed data is not affected, since the new index entry will
likely end up in the same page as the index entry that must be
deleted, so no extra page reads would be necessary in this scenario.

Since the great majority of performance issues are related to select
queries, the benefit of eliminating invisible tuple page fetches and
supporting covered queries probably outweight the extra cost of
updating index entries. And once covered queries are supported, it
would be possible to build indexed views or multi-table indexes that
can address some of the most performance demanding queries out there.

I am wondering whether it would be possible to measure the costs of a1
and a2 above and compare with the probable costs for b3. It seems to
me that applications for which b3 are most expensive are also those
for which a1 would be most expensive, and since selects are much more
common than updates, could one offset the other in the long run? Can
anyone shed some light on these?

-Tiago


On 11 Sep 2004 01:58:01 -0400, Greg Stark <gsstark@mit.edu> wrote:
> 
> Mischa Sandberg <ischamay.andbergsay@activestateway.com> writes:
> 
> > I take it that it is a very reasonable assumption that only a small proportion
> > of index records are actually invalid (else Yurk why use the index?).
> 
> That's faulty logic, the percentage of tuples that are valid is entirely
> independent from the percentage of tuples that match your range criterion. Ie,
> I could be selecting 100 tuples out of a million -- even if 99 are invalid
> it's still worthwhile to use the index.
> 
> > Since you're using an index at all, the planner must be expecting a restricted
> > set of rows to make it up through to the root. If there is any  filter criteria
> > against the values from the index rows, you won't even have to check rows for
> > tuple visibility, that don't pass that filter.
> 
> It's an interesting idea though. But I can't think of many queries where it
> would be interesting. The query would still have to visit every page
> containing a record used in the final result. So the only time this would be a
> significant win is if you're applying very selective restrictions to columns
> that were in the index but weren't able to put in the index condition.
> 
> This seems like a pretty rare situation; usually the reason you put columns in
> an index definition is because it is going to be useful for index conditions--
> especially if it's a particularly selective column.
> 
> --
> greg
> 
> 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>


Re: Indexed views?

From
Heikki Linnakangas
Date:
On Sat, 11 Sep 2004, Tiago Wright wrote:

> IMHO, it is worth duplicating the mvcc data to all index entries. To
> summarize what I understand from this discussion, with the current
> method:
>
> a1 - Index seeks must return invisible tuples because mvcc data is not
> found in the index. These tuples are eliminated once the data is read
> from the actual data pages.
>
> a2 - Covered queries are not possible since the data page must be
> visited to determine visibility

a4 - Indexes must be fully vacuumed before vacuuming the corresponding 
heap entries

> If mvcc data is replicated to the index entries:
>
> b1 - Index seeks will never return invisible tuples, possibly
> eliminating some page reads
>
> b2 - Covered queries are possible
>
> b3 - Inserts are not affected performancewise. Deletes must now visit
> every index entry, which is a larger cost. Updates must visit every
> index entry too. It may be possible to reduce the cost of update if
> the indexed data is not affected, since the new index entry will
> likely end up in the same page as the index entry that must be
> deleted, so no extra page reads would be necessary in this scenario.

b4 - Heap and index pages can be vacuumed independently.

> Since the great majority of performance issues are related to select
> queries, the benefit of eliminating invisible tuple page fetches and
> supporting covered queries probably outweight the extra cost of
> updating index entries. And once covered queries are supported, it
> would be possible to build indexed views or multi-table indexes that
> can address some of the most performance demanding queries out there.
>
> I am wondering whether it would be possible to measure the costs of a1
> and a2 above and compare with the probable costs for b3. It seems to
> me that applications for which b3 are most expensive are also those
> for which a1 would be most expensive, and since selects are much more
> common than updates, could one offset the other in the long run? Can
> anyone shed some light on these?

If it seems that there are some cases where it's better to have the 
visibility information in the index and some cases where not, I think we 
could support both kinds of indexes and let the DBA choose.

- Heikki



Re: Indexed views?

From
Mischa Sandberg
Date:
Greg Stark wrote:

> Mischa Sandberg <ischamay.andbergsay@activestateway.com> writes:
>>I take it that it is a very reasonable assumption that only a small proportion
>>of index records are actually invalid (else Yurk why use the index?). 

> That's faulty logic, the percentage of tuples that are valid is entirely
> independent from the percentage of tuples that match your range criterion. Ie,
> I could be selecting 100 tuples out of a million -- even if 99 are invalid
> it's still worthwhile to use the index.

Ummm ... perhaps I glossed over a bit of inference. If only a small 
proportion of the index contains invalid row references, then (in the 
absence of specific biases otherwise) arbitrary queries using that index 
will average the same proportion of invalid row references. And agreed, 
it would still be worthwhile to use the index in that case. Your analyze 
stats would be a bit queered, though.

>>Since you're using an index at all, the planner must be expecting a restricted
>>set of rows to make it up through to the root. If there is any  filter criteria
>>against the values from the index rows, you won't even have to check rows for
>>tuple visibility, that don't pass that filter.
> 
> It's an interesting idea though. But I can't think of many queries where it
> would be interesting. The query would still have to visit every page
> containing a record used in the final result. So the only time this would be a
> significant win is if you're applying very selective restrictions to columns
> that were in the index but weren't able to put in the index condition. 
> 
> This seems like a pretty rare situation; usually the reason you put columns in
> an index definition is because it is going to be useful for index conditions--
> especially if it's a particularly selective column.

Ummm ... two situations where filters on index columns do not fit the 
standard index probe are:

- filtering by restrictive join. Whether the index is the source or 
target of restriction, you get a better choice of join operators/orders.
For example, if the index is the restrictor, you may be able to build a 
hash table of (filtered) index rows, where building a hash table from a 
heap scan would be a bad choice.

- filtering of non-root index fields, or filtering with inequalities. 
Normally, the planner will not bother with the index for these, and may 
do a serial scan of the table. This can be done with a serial scan of 
the index, with possible optimizations like Oracle's "skip scan".

Furthermore, what 'covering' indexes buy you is, they have all the data 
you need for the query results, whether you apply predicates to them all 
or not.

At another level, people are talking about decomposition storage models 
for data in disk pages, versus n-ary storage models. That's more or less 
a fancy way of saying, organize data by columns instead of groups. This 
storage model pays you back in CPU cycles on most computers with L1/L2 
cache splits. At such point as PG might consider moving to that, then 
the row validity columns would be grouped together in a page, and the 
verification of index rows would be significantly faster: only a small 
portion of a large page need be read and pushed through the CPU.

[For more on DSM vs NSM, google: NSM n-ary DSM ]