Thread: Use of Functional Indexs and Planner estimates
Dear All, I have a table with approximately 570k Rows. Table "filter.rules" Column | Type | Modifiers ----------+------------------------+---------------------------------------- rulename | character varying(16) | not null default ''::character varying uri | character varying(200) | not null default ''::character varying redirect | character varying(200) | not null default ''::character varying moddate | date | not null default ('now'::text)::date active | boolean | not null default true comment | character varying(255) | not null default ''::character varying Indexes: "rules_pkey" primary key, btree (rulename, uri) "moddate_idx" btree (moddate) "rules_idx" btree (lower((uri)::text)) Statistic on the uri column have been set 1000 Vacuum full and analyze was run before tests, no alteration to tables since then. # analyze verbose filter.rules; INFO: analyzing "filter.rules" INFO: "rules": 5228 pages, 300000 rows sampled, 570533 estimated total rows ANALYZE # explain analyze SELECT rulename, redirect from filter.rules WHERE lower(uri) IN(lower('land.com'),lower('com'),lower('land.com/'),lower('com/'))GROUP BY rulename,redirect; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ HashAggregate (cost=22352.79..22352.79 rows=1 width=12) (actual time=2047.331..2047.332 rows=1 loops=1) -> Seq Scan on rules (cost=0.00..22296.32 rows=11294 width=12) (actual time=540.149..2047.308 rows=1 loops=1) Filter: ((lower((uri)::text) = 'land.com'::text) OR (lower((uri)::text) = 'com'::text) OR (lower((uri)::text) ='land.com/'::text) OR (lower((uri)::text) = 'com/'::text)) Total runtime: 2047.420 ms (4 rows) # SET enable_seqscan=off; # explain analyze SELECT rulename, redirect from filter.rules WHERE lower(uri) IN(lower('land.com'),lower('com'),lower('land.com/'),lower('com/'))GROUP BY rulename,redirect; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ HashAggregate (cost=38970.68..38970.68 rows=1 width=12) (actual time=0.328..0.328 rows=1 loops=1) -> Index Scan using rules_idx, rules_idx, rules_idx, rules_idx on rules (cost=0.00..38914.21 rows=11294 width=12) (actualtime=0.210..0.312 rows=1 loops=1) Index Cond: ((lower((uri)::text) = 'land.com'::text) OR (lower((uri)::text) = 'com'::text) OR (lower((uri)::text)= 'land.com/'::text) OR (lower((uri)::text) = 'com/'::text)) Total runtime: 0.700 ms (4 rows) Could anybody offer explanations of why the planner does such a terrible job of estimated the number of rows for this query,with the stats set so high. Tests were also done with stats set to 100, and 1. The results are exactly the same. Which I would have assumed. Also I am interested in how functional indexes have statistics collected for them, if they do. As to possibly minimize oravoid this problem in the future. Thanks for your considersation of this matter. Regards Russell Smith.
On Tue, 8 Jun 2004 17:24:36 +1000, Russell Smith <mr-russ@pws.com.au> wrote: >Also I am interested in how functional indexes have statistics collected for them, if they do. Not in any released version. http://developer.postgresql.org/cvsweb.cgi/pgsql-server/src/backend/commands/analyze.c | Revision 1.70 / Sun Feb 15 21:01:39 2004 UTC (3 months, 3 weeks ago) by tgl | Changes since 1.69: +323 -16 lines | | First steps towards statistics on expressional (nee functional) indexes. | This commit teaches ANALYZE to store such stats in pg_statistic, but | nothing is done yet about teaching the planner to use 'em. So statistics gathering for expressional indexes will be in 7.5, but I don't know about the state of the planner ... Servus Manfred
Manfred Koizar <mkoi-pg@aon.at> writes: > So statistics gathering for expressional indexes will be in 7.5, but I > don't know about the state of the planner ... Planner support is there too: 2004-02-16 19:52 tgl * src/: backend/optimizer/path/costsize.c, backend/optimizer/util/relnode.c, backend/utils/adt/selfuncs.c, include/optimizer/pathnode.h, include/utils/selfuncs.h: Make use of statistics on index expressions. There are still some corner cases that could stand improvement, but it does all the basic stuff. A byproduct is that the selectivity routines are no longer constrained to working on simple Vars; we might in future be able to improve the behavior for subexpressions that don't match indexes. I don't recall anymore what "corner cases" I had in mind for future improvement. regards, tom lane
On Tue, 2004-06-08 at 01:24, Russell Smith wrote: > Dear All, > > I have a table with approximately 570k Rows. > > > Table "filter.rules" > Column | Type | Modifiers > ----------+------------------------+---------------------------------------- > rulename | character varying(16) | not null default ''::character varying > uri | character varying(200) | not null default ''::character varying > redirect | character varying(200) | not null default ''::character varying > moddate | date | not null default ('now'::text)::date > active | boolean | not null default true > comment | character varying(255) | not null default ''::character varying > Indexes: > "rules_pkey" primary key, btree (rulename, uri) > "moddate_idx" btree (moddate) > "rules_idx" btree (lower((uri)::text)) > > Statistic on the uri column have been set 1000 > Vacuum full and analyze was run before tests, no alteration to tables since then. > > # analyze verbose filter.rules; > INFO: analyzing "filter.rules" > INFO: "rules": 5228 pages, 300000 rows sampled, 570533 estimated total rows > ANALYZE > > # explain analyze SELECT rulename, redirect from filter.rules WHERE lower(uri) IN(lower('land.com'),lower('com'),lower('land.com/'),lower('com/'))GROUP BY rulename,redirect; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > HashAggregate (cost=22352.79..22352.79 rows=1 width=12) (actual time=2047.331..2047.332 rows=1 loops=1) > -> Seq Scan on rules (cost=0.00..22296.32 rows=11294 width=12) (actual time=540.149..2047.308 rows=1 loops=1) > Filter: ((lower((uri)::text) = 'land.com'::text) OR (lower((uri)::text) = 'com'::text) OR (lower((uri)::text)= 'land.com/'::text) OR (lower((uri)::text) = 'com/'::text)) > Total runtime: 2047.420 ms > (4 rows) > > # SET enable_seqscan=off; > > # explain analyze SELECT rulename, redirect from filter.rules WHERE lower(uri) IN(lower('land.com'),lower('com'),lower('land.com/'),lower('com/'))GROUP BY rulename,redirect; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > HashAggregate (cost=38970.68..38970.68 rows=1 width=12) (actual time=0.328..0.328 rows=1 loops=1) > -> Index Scan using rules_idx, rules_idx, rules_idx, rules_idx on rules (cost=0.00..38914.21 rows=11294 width=12)(actual time=0.210..0.312 rows=1 loops=1) > Index Cond: ((lower((uri)::text) = 'land.com'::text) OR (lower((uri)::text) = 'com'::text) OR (lower((uri)::text)= 'land.com/'::text) OR (lower((uri)::text) = 'com/'::text)) > Total runtime: 0.700 ms > (4 rows) > > Could anybody offer explanations of why the planner does such a terrible job of estimated the number of rows for this query,with the stats set so high. > Tests were also done with stats set to 100, and 1. The results are exactly the same. Which I would have assumed. Simple, the planner is choosing a sequential scan when it should be choosing an index scan. This is usually because random_page_cost is set too high, at the default of 4. Try settings between 1.2 and 2.x or so to see how that helps. Be sure and test with various queries of your own to be sure you've got about the right setting.
"Scott Marlowe" <smarlowe@qwest.net> writes: > > -> Seq Scan on rules > > (cost=0.00..22296.32 rows=11294 width=12) > > (actual time=540.149..2047.308 rows=1 loops=1) > Simple, the planner is choosing a sequential scan when it should be > choosing an index scan. This is usually because random_page_cost is set > too high, at the default of 4. Try settings between 1.2 and 2.x or so > to see how that helps. Be sure and test with various queries of your > own to be sure you've got about the right setting. Unless you make random_page_cost about .0004 (4/11294) it isn't going to be costing this query right (That's a joke, don't do it:). It's thinking there are 11,000 records matching the where clause when in fact there is only 1. If you know how an upper bound on how many records the query should be finding you might try a kludge involving putting a LIMIT inside the group by. ie, something like select rulename,redirect from (select rulename,redirect from ... where ... limit 100) as kludge group by rulename,redirect This would at least tell the planner not to expect more than 100 rows and to take the plan likely to produce the first 100 rows fastest. But this has the disadvantage of uglifying your code and introducing an arbitrary limit. When 7.5 comes out it you'll want to rip this out. -- greg