Thread: [PATCH] Partial indicies almost working (I think)
Hmm, not much a response to the last email. No matter, I've made some more progress. * The planner now correctly determines that the index is usable without segfaulting. Really wasn't that hard. The code that deals with checking the predicate when inserting still worked fine. So, as far as I can tell, partial indecies would be completely usable, *if* I could get the planner to use them. I'm pretty sure it goes wrong in create_index_paths. The pred_test works fine but somewhere in the lines below it doesn't realise it can use the index. Since I can't quite work out what those functions are supposed to do I'm don't see how to fix it either. Could somebody look into it? I have a sneaking suspicion the cost estimator will need to be fixed too, but I havn't got to that yet. The patch is against 7.1release. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > It would be nice if someone came up with a certification system that > actually separated those who can barely regurgitate what they crammed over > the last few weeks from those who command secret ninja networking powers.
Attachment
Martijn van Oosterhout <kleptog@svana.org> writes: > So, as far as I can tell, partial indecies would be completely usable, *if* > I could get the planner to use them. I'm pretty sure it goes wrong in > create_index_paths. The pred_test works fine but somewhere in the lines > below it doesn't realise it can use the index. Offhand I don't see why the existence of a predicate would matter. If you set enable_seqscan to FALSE, does it start using the index? regards, tom lane PS: please don't use // comments in Postgres code. They're unportable.
On Tue, Jul 03, 2001 at 01:57:57PM -0400, Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > So, as far as I can tell, partial indecies would be completely usable, *if* > > I could get the planner to use them. I'm pretty sure it goes wrong in > > create_index_paths. The pred_test works fine but somewhere in the lines > > below it doesn't realise it can use the index. > > Offhand I don't see why the existence of a predicate would matter. If > you set enable_seqscan to FALSE, does it start using the index? Hmm, maybe it's because it doesn't realise it would be an advantage to use it. This is the test I used: create table test ( seq serial, clid int4, billid text, amount int4 ); -- Fill table with 300,000 rows -- The billid column has about 10,000 nulls -- Other values range from 0 to 9 create index test_ind on test(clid) where billid < '4'; -- The predicate matches about 4% of all the rows explain select sum(clid) from test where billid < '3'; What I think is happening is that while the WHERE clause matches the predicate, since my actual query doesn't reference clid anywhere other than the output, it doesn't consider the index useful at all. It just occured to me that there is an assumption there that there is no point just trying a straight index scan with no constraints since a sequential scan will always be faster in that case. But with partial indicies that assumption is no longer valid. I need to do some more tests when I get home this afternoon. Thanks for your help. > PS: please don't use // comments in Postgres code. They're unportable. OK -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > It would be nice if someone came up with a certification system that > actually separated those who can barely regurgitate what they crammed over > the last few weeks from those who command secret ninja networking powers.
Martijn van Oosterhout <kleptog@svana.org> writes: > It just occured to me that there is an assumption there that there is no > point just trying a straight index scan with no constraints since a > sequential scan will always be faster in that case. But with partial > indicies that assumption is no longer valid. You're right, there is nothing that considers the possibility that the partial index condition itself might make the use of the index desirable. AFAIR, the index condition is only checked to see if the index can legally be used. Beyond that, the index subject variables (not the condition) have to match some aspect of the WHERE clause or query ordering before the planner will expend any cycles to consider the index further. regards, tom lane
On Tue, Jul 03, 2001 at 11:36:39PM -0400, Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > It just occured to me that there is an assumption there that there is no > > point just trying a straight index scan with no constraints since a > > sequential scan will always be faster in that case. But with partial > > indicies that assumption is no longer valid. > > You're right, there is nothing that considers the possibility that the > partial index condition itself might make the use of the index > desirable. AFAIR, the index condition is only checked to see if the > index can legally be used. Beyond that, the index subject variables > (not the condition) have to match some aspect of the WHERE clause or > query ordering before the planner will expend any cycles to consider > the index further. OK, so how to fix this. It seems to me that all that is required is that for partial indicies it should try to add a path for a plain index scan over the index and let the cost estimator decide if it's cheaper than a sequential scan. The cost would be negligable for the common case (no partial index). The only thing is, will the cost estimator get it right. At least the partial index has an accurate count of the number of matching rows. We currently only have statistics by table, not statistics by index. This could be problematic since the statistics for the whole table may not apply to the section the predicate selects. (In fact, you can pretty much guarentee it for the cases where the partial index is truly useful.) Hmm, could take a while to get optimal use of these indicies. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > It would be nice if someone came up with a certification system that > actually separated those who can barely regurgitate what they crammed over > the last few weeks from those who command secret ninja networking powers.
Martijn van Oosterhout <kleptog@svana.org> writes: > OK, so how to fix this. It seems to me that all that is required is that for > partial indicies it should try to add a path for a plain index scan over the > index and let the cost estimator decide if it's cheaper than a sequential > scan. Probably a reasonable first cut: if the index predicate fits, then always generate a path. This assumes that partial indexes are usually made to cover relatively small fractions of the table, but I'm willing to believe that for starters. > only thing is, will the cost estimator get it right. At least the partial > index has an accurate count of the number of matching rows. Why would you think that? I can pretty much guarantee that the cost estimator will *not* get it right --- it has no idea about partial indexes. You'll need to do some fooling with the index selectivity estimation routines to account for the effects of the predicate. regards, tom lane
On Wed, Jul 04, 2001 at 11:37:36AM -0400, Tom Lane wrote: > Probably a reasonable first cut: if the index predicate fits, then always > generate a path. This assumes that partial indexes are usually made to > cover relatively small fractions of the table, but I'm willing to > believe that for starters. Done that, see my other email. > > only thing is, will the cost estimator get it right. At least the partial > > index has an accurate count of the number of matching rows. > > Why would you think that? Well, index->tuples has the number of tuples in the index. By matching I meant that matching the predicate on the index. > I can pretty much guarantee that the cost estimator will *not* get it > right --- it has no idea about partial indexes. You'll need to do some > fooling with the index selectivity estimation routines to account for > the effects of the predicate. See my other email. All I've done is let the normal estimator estimate the number of rows based on the whole table and then capped it if it comes out more than the number of rows in the index. I think the only real way to do it would be to keep seperate statistics for the part matching the predicate, since it can be considered to be a separate table, but that requires a few bigger changes. It gets it pretty right at the moment. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > It would be nice if someone came up with a certification system that > actually separated those who can barely regurgitate what they crammed over > the last few weeks from those who command secret ninja networking powers.
Martijn van Oosterhout <kleptog@svana.org> writes: >> I can pretty much guarantee that the cost estimator will *not* get it >> right --- it has no idea about partial indexes. > See my other email. All I've done is let the normal estimator estimate the > number of rows based on the whole table and then capped it if it comes out > more than the number of rows in the index. To put it politely, that's a crock; besides which it has no effect on the estimation of the access costs for the index path. The correct fix has to be in genericcostestimate() in selfuncs.c. One possibility is for genericcostestimate() to multiply the initial indexSelectivity estimate (which considers only the WHERE clauses about the indexed variable) by index->tuples/rel->tuples (the selectivity of the index predicate). This would be fully correct only if the index predicate is independent of the indexed variable, which could be completely wrong. That could lead to underestimating the access costs for the index ... on the other hand, if a partial index is applicable, we probably want the system to use it, so underestimating is better than overestimating. Another approach would be to compute indexSelectivity based on the AND of the given indexQuals and the index predicate. That would rely on clause_selectivity to recognize overlapping/redundant conditions, which it is not very good at, but it'll get some simple cases right (and the infrastructure is there to get more cases right). Offhand this seems like the best way to go in the long term. regards, tom lane
On Thu, Jul 05, 2001 at 10:44:18AM -0400, Tom Lane wrote: > One possibility is for genericcostestimate() to multiply the initial > indexSelectivity estimate (which considers only the WHERE clauses about > the indexed variable) by index->tuples/rel->tuples (the selectivity of > the index predicate). This would be fully correct only if the index > predicate is independent of the indexed variable, which could be > completely wrong. That could lead to underestimating the access costs > for the index ... on the other hand, if a partial index is applicable, > we probably want the system to use it, so underestimating is better than > overestimating. Ah, I see what you mean. I thought that the indexSelectivity referred to the number of index tuples that matched as a proportion of all index tuples, not the tuples in the main table. The rest of the code there definitly assumes index->tuples == rel->tuples. I'll look into it. > Another approach would be to compute indexSelectivity based on the AND > of the given indexQuals and the index predicate. That would rely on > clause_selectivity to recognize overlapping/redundant conditions, which > it is not very good at, but it'll get some simple cases right (and the > infrastructure is there to get more cases right). Offhand this seems > like the best way to go in the long term. Yes, my test had the two columns independant, thus the where clause produced correct results. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > It would be nice if someone came up with a certification system that > actually separated those who can barely regurgitate what they crammed over > the last few weeks from those who command secret ninja networking powers.