Thread: [PATCH] Partial indicies almost working (I think)

[PATCH] Partial indicies almost working (I think)

From
Martijn van Oosterhout
Date:
Hmm, not much a response to the last email. No matter, I've made some more
progress.

* The planner now correctly determines that the index is usable without
  segfaulting.

Really wasn't that hard. The code that deals with checking the predicate
when inserting still worked fine.

So, as far as I can tell, partial indecies would be completely usable, *if*
I could get the planner to use them. I'm pretty sure it goes wrong in
create_index_paths. The pred_test works fine but somewhere in the lines
below it doesn't realise it can use the index. Since I can't quite work out
what those functions are supposed to do I'm don't see how to fix it either.

Could somebody look into it?

I have a sneaking suspicion the cost estimator will need to be fixed too,
but I havn't got to that yet.

The patch is against 7.1release.
--
Martijn van Oosterhout <kleptog@svana.org>
http://svana.org/kleptog/
> It would be nice if someone came up with a certification system that
> actually separated those who can barely regurgitate what they crammed over
> the last few weeks from those who command secret ninja networking powers.

Attachment

Re: [PATCH] Partial indicies almost working (I think)

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> So, as far as I can tell, partial indecies would be completely usable, *if*
> I could get the planner to use them. I'm pretty sure it goes wrong in
> create_index_paths. The pred_test works fine but somewhere in the lines
> below it doesn't realise it can use the index.

Offhand I don't see why the existence of a predicate would matter.  If
you set enable_seqscan to FALSE, does it start using the index?

            regards, tom lane

PS: please don't use // comments in Postgres code.  They're unportable.

Re: [PATCH] Partial indicies almost working (I think)

From
Martijn van Oosterhout
Date:
On Tue, Jul 03, 2001 at 01:57:57PM -0400, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > So, as far as I can tell, partial indecies would be completely usable, *if*
> > I could get the planner to use them. I'm pretty sure it goes wrong in
> > create_index_paths. The pred_test works fine but somewhere in the lines
> > below it doesn't realise it can use the index.
>
> Offhand I don't see why the existence of a predicate would matter.  If
> you set enable_seqscan to FALSE, does it start using the index?

Hmm, maybe it's because it doesn't realise it would be an advantage to use
it. This is the test I used:

create table test (
  seq serial,
  clid int4,
  billid text,
  amount int4
);

-- Fill table with 300,000 rows
-- The billid column has about 10,000 nulls
-- Other values range from 0 to 9
create index test_ind on test(clid) where billid < '4';
-- The predicate matches about 4% of all the rows

explain select sum(clid) from test where billid < '3';

What I think is happening is that while the WHERE clause matches the
predicate, since my actual query doesn't reference clid anywhere other than
the output, it doesn't consider the index useful at all.

It just occured to me that there is an assumption there that there is no
point just trying a straight index scan with no constraints since a
sequential scan will always be faster in that case. But with partial
indicies that assumption is no longer valid.

I need to do some more tests when I get home this afternoon. Thanks for your
help.

> PS: please don't use // comments in Postgres code.  They're unportable.

OK
--
Martijn van Oosterhout <kleptog@svana.org>
http://svana.org/kleptog/
> It would be nice if someone came up with a certification system that
> actually separated those who can barely regurgitate what they crammed over
> the last few weeks from those who command secret ninja networking powers.

Re: [PATCH] Partial indicies almost working (I think)

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> It just occured to me that there is an assumption there that there is no
> point just trying a straight index scan with no constraints since a
> sequential scan will always be faster in that case. But with partial
> indicies that assumption is no longer valid.

You're right, there is nothing that considers the possibility that the
partial index condition itself might make the use of the index
desirable.  AFAIR, the index condition is only checked to see if the
index can legally be used.  Beyond that, the index subject variables
(not the condition) have to match some aspect of the WHERE clause or
query ordering before the planner will expend any cycles to consider
the index further.

            regards, tom lane

Re: [PATCH] Partial indicies almost working (I think)

From
Martijn van Oosterhout
Date:
On Tue, Jul 03, 2001 at 11:36:39PM -0400, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > It just occured to me that there is an assumption there that there is no
> > point just trying a straight index scan with no constraints since a
> > sequential scan will always be faster in that case. But with partial
> > indicies that assumption is no longer valid.
>
> You're right, there is nothing that considers the possibility that the
> partial index condition itself might make the use of the index
> desirable.  AFAIR, the index condition is only checked to see if the
> index can legally be used.  Beyond that, the index subject variables
> (not the condition) have to match some aspect of the WHERE clause or
> query ordering before the planner will expend any cycles to consider
> the index further.

OK, so how to fix this. It seems to me that all that is required is that for
partial indicies it should try to add a path for a plain index scan over the
index and let the cost estimator decide if it's cheaper than a sequential
scan.

The cost would be negligable for the common case (no partial index). The
only thing is, will the cost estimator get it right. At least the partial
index has an accurate count of the number of matching rows.

We currently only have statistics by table, not statistics by index. This
could be problematic since the statistics for the whole table may not apply
to the section the predicate selects. (In fact, you can pretty much
guarentee it for the cases where the partial index is truly useful.)

Hmm, could take a while to get optimal use of these indicies.

--
Martijn van Oosterhout <kleptog@svana.org>
http://svana.org/kleptog/
> It would be nice if someone came up with a certification system that
> actually separated those who can barely regurgitate what they crammed over
> the last few weeks from those who command secret ninja networking powers.

Re: [PATCH] Partial indicies almost working (I think)

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> OK, so how to fix this. It seems to me that all that is required is that for
> partial indicies it should try to add a path for a plain index scan over the
> index and let the cost estimator decide if it's cheaper than a sequential
> scan.

Probably a reasonable first cut: if the index predicate fits, then always
generate a path.  This assumes that partial indexes are usually made to
cover relatively small fractions of the table, but I'm willing to
believe that for starters.

> only thing is, will the cost estimator get it right. At least the partial
> index has an accurate count of the number of matching rows.

Why would you think that?

I can pretty much guarantee that the cost estimator will *not* get it
right --- it has no idea about partial indexes.  You'll need to do some
fooling with the index selectivity estimation routines to account for
the effects of the predicate.

            regards, tom lane

Re: [PATCH] Partial indicies almost working (I think)

From
Martijn van Oosterhout
Date:
On Wed, Jul 04, 2001 at 11:37:36AM -0400, Tom Lane wrote:
> Probably a reasonable first cut: if the index predicate fits, then always
> generate a path.  This assumes that partial indexes are usually made to
> cover relatively small fractions of the table, but I'm willing to
> believe that for starters.

Done that, see my other email.

> > only thing is, will the cost estimator get it right. At least the partial
> > index has an accurate count of the number of matching rows.
>
> Why would you think that?

Well, index->tuples has the number of tuples in the index. By matching I
meant that matching the predicate on the index.

> I can pretty much guarantee that the cost estimator will *not* get it
> right --- it has no idea about partial indexes.  You'll need to do some
> fooling with the index selectivity estimation routines to account for
> the effects of the predicate.

See my other email. All I've done is let the normal estimator estimate the
number of rows based on the whole table and then capped it if it comes out
more than the number of rows in the index.

I think the only real way to do it would be to keep seperate statistics for
the part matching the predicate, since it can be considered to be a separate
table, but that requires a few bigger changes. It gets it pretty right at
the moment.

--
Martijn van Oosterhout <kleptog@svana.org>
http://svana.org/kleptog/
> It would be nice if someone came up with a certification system that
> actually separated those who can barely regurgitate what they crammed over
> the last few weeks from those who command secret ninja networking powers.

Re: [PATCH] Partial indicies almost working (I think)

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
>> I can pretty much guarantee that the cost estimator will *not* get it
>> right --- it has no idea about partial indexes.

> See my other email. All I've done is let the normal estimator estimate the
> number of rows based on the whole table and then capped it if it comes out
> more than the number of rows in the index.

To put it politely, that's a crock; besides which it has no effect on
the estimation of the access costs for the index path.  The correct fix
has to be in genericcostestimate() in selfuncs.c.

One possibility is for genericcostestimate() to multiply the initial
indexSelectivity estimate (which considers only the WHERE clauses about
the indexed variable) by index->tuples/rel->tuples (the selectivity of
the index predicate).  This would be fully correct only if the index
predicate is independent of the indexed variable, which could be
completely wrong.  That could lead to underestimating the access costs
for the index ... on the other hand, if a partial index is applicable,
we probably want the system to use it, so underestimating is better than
overestimating.

Another approach would be to compute indexSelectivity based on the AND
of the given indexQuals and the index predicate.  That would rely on
clause_selectivity to recognize overlapping/redundant conditions, which
it is not very good at, but it'll get some simple cases right (and the
infrastructure is there to get more cases right).  Offhand this seems
like the best way to go in the long term.

            regards, tom lane

Re: [PATCH] Partial indicies almost working (I think)

From
Martijn van Oosterhout
Date:
On Thu, Jul 05, 2001 at 10:44:18AM -0400, Tom Lane wrote:
> One possibility is for genericcostestimate() to multiply the initial
> indexSelectivity estimate (which considers only the WHERE clauses about
> the indexed variable) by index->tuples/rel->tuples (the selectivity of
> the index predicate).  This would be fully correct only if the index
> predicate is independent of the indexed variable, which could be
> completely wrong.  That could lead to underestimating the access costs
> for the index ... on the other hand, if a partial index is applicable,
> we probably want the system to use it, so underestimating is better than
> overestimating.

Ah, I see what you mean. I thought that the indexSelectivity referred to the
number of index tuples that matched as a proportion of all index tuples, not
the tuples in the main table. The rest of the code there definitly assumes
index->tuples == rel->tuples.

I'll look into it.

> Another approach would be to compute indexSelectivity based on the AND
> of the given indexQuals and the index predicate.  That would rely on
> clause_selectivity to recognize overlapping/redundant conditions, which
> it is not very good at, but it'll get some simple cases right (and the
> infrastructure is there to get more cases right).  Offhand this seems
> like the best way to go in the long term.

Yes, my test had the two columns independant, thus the where clause produced
correct results.

--
Martijn van Oosterhout <kleptog@svana.org>
http://svana.org/kleptog/
> It would be nice if someone came up with a certification system that
> actually separated those who can barely regurgitate what they crammed over
> the last few weeks from those who command secret ninja networking powers.