Thread: will the planner ever use an index when the condition is <> ?
I have a query that used <> against an indexed column. In this case I can use the reverse and use in or = and get the performance I need... but "in general"... will the planner ever use an index when the related column is compared using <>? I feel like the answer is no, but wanted to ask. Roxanne Postgres Version 8.4.9 PostGIS version 1.5.2 Context for question: I have the following query: select * from op_region opr, yield_segment_info ysi, data_location dl where opr.op_region_id in (select distinct op_region_id from yield_point where yield > 0 and area > 0 and ST_GeometryType(location) <> 'ST_Point' ) and ysi.op_region_id = opr.op_region_id and dl.data_set_id = opr.data_set_id Yield_Point has 161,575,599 records where yield >0 and area > 0 has 161,263,193 records, where ST_GeometryType(location)<> 'ST_Point' has just 231 records yield_segment_info has 165,929 records op_region has 566,212 records data_location has 394,763 All of these have a high volume of insert/delete's. The tables have recently been vacuum full'd and the indexes reindexed. [they are under the management of the autovacuum, but we forced a cleanup on the chance that things had degraded...] If I run an explain analyze: "Nested Loop (cost=5068203.00..5068230.31 rows=3 width=225308) (actual time=192571.730..193625.728 rows=236 loops=1)" "->Nested Loop (cost=5068203.00..5068219.66 rows=1 width=57329) (actual time=192522.573..192786.698 rows=230 loops=1)" " ->Nested Loop (cost=5068203.00..5068211.36 rows=1 width=57268) (actual time=192509.822..192638.446 rows=230 loops=1)" " ->HashAggregate (cost=5068203.00..5068203.01 rows=1 width=4) (actual time=192471.507..192471.682 rows=230 loops=1)" " ->Seq Scan on yield_point (cost=0.00..5068203.00 rows=1 width=4) (actual time=602.174..192471.177 rows=230 loops=1)" " Filter: ((yield > 0::double precision) AND (area > 0::double precision) AND (st_geometrytype(location) <> 'ST_Point'::text))" " ->Index Scan using op_region_pkey on op_region opr (cost=0.00..8.33 rows=1 width=57264) (actual time=0.723..0.723 rows=1 loops=230)" " Index Cond: (opr.op_region_id = yield_point.op_region_id)" " ->Index Scan using yield_segment_info_key on yield_segment_info ysi (cost=0.00..8.29 rows=1 width=65) (actual time=0.643..0.643 rows=1 loops=230)" " Index Cond: (ysi.op_region_id = opr.op_region_id)" "->Index Scan using data_location_data_set_idx on data_location dl (cost=0.00..10.61 rows=3 width=167979) (actual time=3.611..3.646 rows=1 loops=230)" "Index Cond: (dl.data_set_id = opr.data_set_id)" "Total runtime: 193625.955 ms" yield_point has the following indexes: btree on ST_GeometryType(location) gist on location btree on op_region_id I've also tried an index on ((yield > 0::double precision) AND (area > 0::double precision) AND (st_geometrytype(location) <> 'ST_Point'::text)) ... it still goes for the sequential scan. But if I change it to st_geometrytype(location) = 'ST_Polygon' or even in ('ST_Polygon','ST_MultiPolygon') the planner uses the index. Roxanne
Normally there is no chance it could work, because (a) the planner does not know all possible values of a column, and (b) btree indexes cannot search on "not equal" operator. BTW I've just made a case where - logically - it could work, but it still does not: create table nums ( num int4 not null, check(num=1 or num=2) ); insert into nums select case when random()<=0.99 then 1 else 2 end from generate_series(1,1000000); create index nums_idx on nums(num); analyze nums; set constraint_exclusion to 'on'; explain select * from nums where num<>1; --planner could estimate selectivity as 1%, and use index with "=2" filter basing on check constraint? 2011/12/17 Roxanne Reid-Bennett <rox@tara-lu.com>: > I have a query that used <> against an indexed column. In this > case I can use the reverse and use in or = and get the performance > I need... but "in general"... will the planner ever use an index when the > related column is compared using <>? > > I feel like the answer is no, but wanted to ask. > > Roxanne > Postgres Version 8.4.9 PostGIS version 1.5.2 > > > > Context for question: > > I have the following query: > > select * > from op_region opr, yield_segment_info ysi, data_location dl > where opr.op_region_id in > (select distinct op_region_id > from yield_point > where yield > 0 > and area > 0 > and ST_GeometryType(location) <> 'ST_Point' > ) > and ysi.op_region_id = opr.op_region_id > and dl.data_set_id = opr.data_set_id > > Yield_Point has 161,575,599 records > where yield >0 and area > 0 has 161,263,193 records, > where ST_GeometryType(location)<> 'ST_Point' has just 231 records > > yield_segment_info has 165,929 records > op_region has 566,212 records > data_location has 394,763 > > All of these have a high volume of insert/delete's. > The tables have recently been vacuum full'd and the indexes reindexed. > [they are under the management of the autovacuum, but we forced a cleanup on > the chance that things had degraded...] > > If I run an explain analyze: > > "Nested Loop > (cost=5068203.00..5068230.31 rows=3 width=225308) > (actual time=192571.730..193625.728 rows=236 loops=1)" > "->Nested Loop > (cost=5068203.00..5068219.66 rows=1 width=57329) > (actual time=192522.573..192786.698 rows=230 loops=1)" > " ->Nested Loop > (cost=5068203.00..5068211.36 rows=1 width=57268) > (actual time=192509.822..192638.446 rows=230 loops=1)" > " ->HashAggregate > (cost=5068203.00..5068203.01 rows=1 width=4) > (actual time=192471.507..192471.682 rows=230 loops=1)" > " ->Seq Scan on yield_point > (cost=0.00..5068203.00 rows=1 width=4) > (actual time=602.174..192471.177 rows=230 loops=1)" > " Filter: ((yield > 0::double precision) AND > (area > 0::double precision) AND > (st_geometrytype(location) <> 'ST_Point'::text))" > " ->Index Scan using op_region_pkey on op_region opr > (cost=0.00..8.33 rows=1 width=57264) > (actual time=0.723..0.723 rows=1 loops=230)" > " Index Cond: (opr.op_region_id = yield_point.op_region_id)" > " ->Index Scan using yield_segment_info_key on yield_segment_info ysi > (cost=0.00..8.29 rows=1 width=65) > (actual time=0.643..0.643 rows=1 loops=230)" > " Index Cond: (ysi.op_region_id = opr.op_region_id)" > "->Index Scan using data_location_data_set_idx on data_location dl > (cost=0.00..10.61 rows=3 width=167979) > (actual time=3.611..3.646 rows=1 loops=230)" > "Index Cond: (dl.data_set_id = opr.data_set_id)" > "Total runtime: 193625.955 ms" > > yield_point has the following indexes: > btree on ST_GeometryType(location) > gist on location > btree on op_region_id > > I've also tried an index on > ((yield > 0::double precision) AND (area > 0::double precision) AND > (st_geometrytype(location) <> 'ST_Point'::text)) > ... it still goes for the sequential scan. > > But if I change it to st_geometrytype(location) = 'ST_Polygon' or > even in ('ST_Polygon','ST_MultiPolygon') > > the planner uses the index. > > Roxanne > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
17.12.2011 18:25 пользователь "Filip Rembiałkowski" <plk.zuber@gmail.com> написал:
>
> Normally there is no chance it could work,
> because (a) the planner does not know all possible values of a column,
> and (b) btree indexes cannot search on "not equal" operator.
>
Why so? a<>b is same as (a<b or a>b), so, planner should chech this option.
On 12/17/2011 11:24 AM, Filip Rembiałkowski wrote: > Normally there is no chance it could work, > because (a) the planner does not know all possible values of a column, > and (b) btree indexes cannot search on "not equal" operator. Is there an index type that can check "not equal"? This specific column has a limited number of possible values - it is essentially an enumerated list. Roxanne > > > BTW I've just made a case where - logically - it could work, but it > still does not: > > create table nums ( num int4 not null, check(num=1 or num=2) ); > insert into nums select case when random()<=0.99 then 1 else 2 end > from generate_series(1,1000000); > create index nums_idx on nums(num); > analyze nums; > set constraint_exclusion to 'on'; > explain select * from nums where num<>1; > --planner could estimate selectivity as 1%, and use index with "=2" > filter basing on check constraint? > > > > > 2011/12/17 Roxanne Reid-Bennett<rox@tara-lu.com>: >> I have a query that used<> against an indexed column. In this >> case I can use the reverse and use in or = and get the performance >> I need... but "in general"... will the planner ever use an index when the >> related column is compared using<>? >> >> I feel like the answer is no, but wanted to ask. >> >> Roxanne >> Postgres Version 8.4.9 PostGIS version 1.5.2 >> >> >> >> Context for question: >> >> I have the following query: >> >> select * >> from op_region opr, yield_segment_info ysi, data_location dl >> where opr.op_region_id in >> (select distinct op_region_id >> from yield_point >> where yield> 0 >> and area> 0 >> and ST_GeometryType(location)<> 'ST_Point' >> ) >> and ysi.op_region_id = opr.op_region_id >> and dl.data_set_id = opr.data_set_id >> >> Yield_Point has 161,575,599 records >> where yield>0 and area> 0 has 161,263,193 records, >> where ST_GeometryType(location)<> 'ST_Point' has just 231 records >> >> yield_segment_info has 165,929 records >> op_region has 566,212 records >> data_location has 394,763 >> >> All of these have a high volume of insert/delete's. >> The tables have recently been vacuum full'd and the indexes reindexed. >> [they are under the management of the autovacuum, but we forced a cleanup on >> the chance that things had degraded...] >> >> If I run an explain analyze: >> >> "Nested Loop >> (cost=5068203.00..5068230.31 rows=3 width=225308) >> (actual time=192571.730..193625.728 rows=236 loops=1)" >> "->Nested Loop >> (cost=5068203.00..5068219.66 rows=1 width=57329) >> (actual time=192522.573..192786.698 rows=230 loops=1)" >> " ->Nested Loop >> (cost=5068203.00..5068211.36 rows=1 width=57268) >> (actual time=192509.822..192638.446 rows=230 loops=1)" >> " ->HashAggregate >> (cost=5068203.00..5068203.01 rows=1 width=4) >> (actual time=192471.507..192471.682 rows=230 loops=1)" >> " ->Seq Scan on yield_point >> (cost=0.00..5068203.00 rows=1 width=4) >> (actual time=602.174..192471.177 rows=230 loops=1)" >> " Filter: ((yield> 0::double precision) AND >> (area> 0::double precision) AND >> (st_geometrytype(location)<> 'ST_Point'::text))" >> " ->Index Scan using op_region_pkey on op_region opr >> (cost=0.00..8.33 rows=1 width=57264) >> (actual time=0.723..0.723 rows=1 loops=230)" >> " Index Cond: (opr.op_region_id = yield_point.op_region_id)" >> " ->Index Scan using yield_segment_info_key on yield_segment_info ysi >> (cost=0.00..8.29 rows=1 width=65) >> (actual time=0.643..0.643 rows=1 loops=230)" >> " Index Cond: (ysi.op_region_id = opr.op_region_id)" >> "->Index Scan using data_location_data_set_idx on data_location dl >> (cost=0.00..10.61 rows=3 width=167979) >> (actual time=3.611..3.646 rows=1 loops=230)" >> "Index Cond: (dl.data_set_id = opr.data_set_id)" >> "Total runtime: 193625.955 ms" >> >> yield_point has the following indexes: >> btree on ST_GeometryType(location) >> gist on location >> btree on op_region_id >> >> I've also tried an index on >> ((yield> 0::double precision) AND (area> 0::double precision) AND >> (st_geometrytype(location)<> 'ST_Point'::text)) >> ... it still goes for the sequential scan. >> >> But if I change it to st_geometrytype(location) = 'ST_Polygon' or >> even in ('ST_Polygon','ST_MultiPolygon') >> >> the planner uses the index. >> >> Roxanne >> >> -- >> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-performance
Roxanne Reid-Bennett <rox@tara-lu.com> writes: > On 12/17/2011 11:24 AM, Filip Rembiałkowski wrote: >> Normally there is no chance it could work, >> because (a) the planner does not know all possible values of a column, >> and (b) btree indexes cannot search on "not equal" operator. > Is there an index type that can check "not equal"? There is not. It's not so much that it's logically impossible as that it doesn't seem worth the trouble to implement and maintain, because most of the time a query like "where x <> constant" is going to fetch most of the table, and so it would be better done as a seqscan anyway. If you have a specific case where that's not true, you might consider a partial index (CREATE INDEX ... WHERE x <> constant). But the details of that would depend a lot on the queries you're concerned about. regards, tom lane
On Sun, Dec 18, 2011 at 16:52, Roxanne Reid-Bennett <rox@tara-lu.com> wrote: > Is there an index type that can check "not equal"? > This specific column has a limited number of possible values - it is > essentially an enumerated list. Instead of writing WHERE foo<>3 you could rewrite it as WHERE foo IN (1,2,4,...) or WHERE foo < 3 OR foo > 3. Both of these are indexable queries, but obviously the planner may choose not to use index scan if it's not worth it. Regards, Marti
On 12/18/2011 1:31 PM, Tom Lane wrote: > If you have a specific case where that's not true, you might consider > a partial index (CREATE INDEX ... WHERE x <> constant). But the > details of that would depend a lot on the queries you're concerned > about. regards, tom lane Which I had tried in the form of (st_geometrytype(location) <> 'ST_Point'::text)... planner never picked it (for the scenario given before). But this thread was all pretty much .. design/plan/future use. This specific instance I've handled with in ("ST_Polygon","ST_MultiPolygon"). Thank you for the feedback. Roxanne