Thread: Use of Functional Indexs and Planner estimates

Use of Functional Indexs and Planner estimates

From
Russell Smith
Date:
Dear All,

I have a table with approximately 570k Rows.


                  Table "filter.rules"
  Column  |          Type          |               Modifiers
----------+------------------------+----------------------------------------
 rulename | character varying(16)  | not null default ''::character varying
 uri      | character varying(200) | not null default ''::character varying
 redirect | character varying(200) | not null default ''::character varying
 moddate  | date                   | not null default ('now'::text)::date
 active   | boolean                | not null default true
 comment  | character varying(255) | not null default ''::character varying
Indexes:
    "rules_pkey" primary key, btree (rulename, uri)
    "moddate_idx" btree (moddate)
    "rules_idx" btree (lower((uri)::text))

Statistic on the uri column have been set 1000
Vacuum full and analyze was run before tests, no alteration to tables since then.

# analyze verbose filter.rules;
INFO:  analyzing "filter.rules"
INFO:  "rules": 5228 pages, 300000 rows sampled, 570533 estimated total rows
ANALYZE

# explain analyze SELECT rulename, redirect from filter.rules WHERE lower(uri)
IN(lower('land.com'),lower('com'),lower('land.com/'),lower('com/'))GROUP BY rulename,redirect; 
                                                                                           QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=22352.79..22352.79 rows=1 width=12) (actual time=2047.331..2047.332 rows=1 loops=1)
   ->  Seq Scan on rules  (cost=0.00..22296.32 rows=11294 width=12) (actual time=540.149..2047.308 rows=1 loops=1)
         Filter: ((lower((uri)::text) = 'land.com'::text) OR (lower((uri)::text) = 'com'::text) OR (lower((uri)::text)
='land.com/'::text) OR (lower((uri)::text) = 'com/'::text)) 
 Total runtime: 2047.420 ms
(4 rows)

# SET enable_seqscan=off;

# explain analyze SELECT rulename, redirect from filter.rules WHERE lower(uri)
IN(lower('land.com'),lower('com'),lower('land.com/'),lower('com/'))GROUP BY rulename,redirect; 
                                                                                           QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=38970.68..38970.68 rows=1 width=12) (actual time=0.328..0.328 rows=1 loops=1)
   ->  Index Scan using rules_idx, rules_idx, rules_idx, rules_idx on rules  (cost=0.00..38914.21 rows=11294 width=12)
(actualtime=0.210..0.312 rows=1 loops=1) 
         Index Cond: ((lower((uri)::text) = 'land.com'::text) OR (lower((uri)::text) = 'com'::text) OR
(lower((uri)::text)= 'land.com/'::text) OR (lower((uri)::text) = 'com/'::text)) 
 Total runtime: 0.700 ms
(4 rows)

Could anybody offer explanations of why the planner does such a terrible job of estimated the number of rows for this
query,with the stats set so high. 
Tests were also done with stats set to 100, and 1.  The results are exactly the same.  Which I would have assumed.

Also I am interested in how functional indexes have statistics collected for them, if they do.  As to possibly minimize
oravoid this problem in the future. 

Thanks for your considersation of this matter.

Regards

Russell Smith.

Re: Use of Functional Indexs and Planner estimates

From
Manfred Koizar
Date:
On Tue, 8 Jun 2004 17:24:36 +1000, Russell Smith <mr-russ@pws.com.au>
wrote:
>Also I am interested in how functional indexes have statistics collected for them, if they do.

Not in any released version.

http://developer.postgresql.org/cvsweb.cgi/pgsql-server/src/backend/commands/analyze.c

| Revision 1.70 / Sun Feb 15 21:01:39 2004 UTC (3 months, 3 weeks ago) by tgl
| Changes since 1.69: +323 -16 lines
|
| First steps towards statistics on expressional (nee functional) indexes.
| This commit teaches ANALYZE to store such stats in pg_statistic, but
| nothing is done yet about teaching the planner to use 'em.

So statistics gathering for expressional indexes will be in 7.5, but I
don't know about the state of the planner ...

Servus
 Manfred

Re: Use of Functional Indexs and Planner estimates

From
Tom Lane
Date:
Manfred Koizar <mkoi-pg@aon.at> writes:
> So statistics gathering for expressional indexes will be in 7.5, but I
> don't know about the state of the planner ...

Planner support is there too:

2004-02-16 19:52  tgl

    * src/: backend/optimizer/path/costsize.c,
    backend/optimizer/util/relnode.c, backend/utils/adt/selfuncs.c,
    include/optimizer/pathnode.h, include/utils/selfuncs.h: Make use of
    statistics on index expressions.  There are still some corner cases
    that could stand improvement, but it does all the basic stuff.    A
    byproduct is that the selectivity routines are no longer
    constrained to working on simple Vars; we might in future be able
    to improve the behavior for subexpressions that don't match
    indexes.

I don't recall anymore what "corner cases" I had in mind for future
improvement.

            regards, tom lane

Re: Use of Functional Indexs and Planner estimates

From
"Scott Marlowe"
Date:
On Tue, 2004-06-08 at 01:24, Russell Smith wrote:
> Dear All,
>
> I have a table with approximately 570k Rows.
>
>
>                   Table "filter.rules"
>   Column  |          Type          |               Modifiers
> ----------+------------------------+----------------------------------------
>  rulename | character varying(16)  | not null default ''::character varying
>  uri      | character varying(200) | not null default ''::character varying
>  redirect | character varying(200) | not null default ''::character varying
>  moddate  | date                   | not null default ('now'::text)::date
>  active   | boolean                | not null default true
>  comment  | character varying(255) | not null default ''::character varying
> Indexes:
>     "rules_pkey" primary key, btree (rulename, uri)
>     "moddate_idx" btree (moddate)
>     "rules_idx" btree (lower((uri)::text))
>
> Statistic on the uri column have been set 1000
> Vacuum full and analyze was run before tests, no alteration to tables since then.
>
> # analyze verbose filter.rules;
> INFO:  analyzing "filter.rules"
> INFO:  "rules": 5228 pages, 300000 rows sampled, 570533 estimated total rows
> ANALYZE
>
> # explain analyze SELECT rulename, redirect from filter.rules WHERE lower(uri)
IN(lower('land.com'),lower('com'),lower('land.com/'),lower('com/'))GROUP BY rulename,redirect; 
>                                                                                            QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  HashAggregate  (cost=22352.79..22352.79 rows=1 width=12) (actual time=2047.331..2047.332 rows=1 loops=1)
>    ->  Seq Scan on rules  (cost=0.00..22296.32 rows=11294 width=12) (actual time=540.149..2047.308 rows=1 loops=1)
>          Filter: ((lower((uri)::text) = 'land.com'::text) OR (lower((uri)::text) = 'com'::text) OR
(lower((uri)::text)= 'land.com/'::text) OR (lower((uri)::text) = 'com/'::text)) 
>  Total runtime: 2047.420 ms
> (4 rows)
>
> # SET enable_seqscan=off;
>
> # explain analyze SELECT rulename, redirect from filter.rules WHERE lower(uri)
IN(lower('land.com'),lower('com'),lower('land.com/'),lower('com/'))GROUP BY rulename,redirect; 
>                                                                                            QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  HashAggregate  (cost=38970.68..38970.68 rows=1 width=12) (actual time=0.328..0.328 rows=1 loops=1)
>    ->  Index Scan using rules_idx, rules_idx, rules_idx, rules_idx on rules  (cost=0.00..38914.21 rows=11294
width=12)(actual time=0.210..0.312 rows=1 loops=1) 
>          Index Cond: ((lower((uri)::text) = 'land.com'::text) OR (lower((uri)::text) = 'com'::text) OR
(lower((uri)::text)= 'land.com/'::text) OR (lower((uri)::text) = 'com/'::text)) 
>  Total runtime: 0.700 ms
> (4 rows)
>
> Could anybody offer explanations of why the planner does such a terrible job of estimated the number of rows for this
query,with the stats set so high. 
> Tests were also done with stats set to 100, and 1.  The results are exactly the same.  Which I would have assumed.

Simple, the planner is choosing a sequential scan when it should be
choosing an index scan.  This is usually because random_page_cost is set
too high, at the default of 4.  Try settings between 1.2 and 2.x or so
to see how that helps.  Be sure and test with various queries of your
own to be sure you've got about the right setting.


Re: Use of Functional Indexs and Planner estimates

From
Greg Stark
Date:
"Scott Marlowe" <smarlowe@qwest.net> writes:

> >    ->  Seq Scan on rules
> >          (cost=0.00..22296.32 rows=11294 width=12)
> >          (actual time=540.149..2047.308 rows=1 loops=1)


> Simple, the planner is choosing a sequential scan when it should be
> choosing an index scan.  This is usually because random_page_cost is set
> too high, at the default of 4.  Try settings between 1.2 and 2.x or so
> to see how that helps.  Be sure and test with various queries of your
> own to be sure you've got about the right setting.

Unless you make random_page_cost about .0004 (4/11294) it isn't going to be
costing this query right (That's a joke, don't do it:). It's thinking there
are 11,000 records matching the where clause when in fact there is only 1.

If you know how an upper bound on how many records the query should be finding
you might try a kludge involving putting a LIMIT inside the group by. ie,
something like

select rulename,redirect
  from (select rulename,redirect
          from ...
         where ...
         limit 100) as kludge
 group by rulename,redirect

This would at least tell the planner not to expect more than 100 rows and to
take the plan likely to produce the first 100 rows fastest.

But this has the disadvantage of uglifying your code and introducing an
arbitrary limit. When 7.5 comes out it you'll want to rip this out.

--
greg