Thread: optimizer question

optimizer question

From
"Reinoud van Leeuwen"
Date:
Hi, 

I have a table that contains almost 8 milion rows. The primary key is a 
sequence, so the index should have a good distribution. Why does the 
optimizer refuse to use the index for getting the maximum value?
(even after a vacuum analyze of the table)

radius=# explain select max(radiuspk) from radius ;
NOTICE:  QUERY PLAN:

Aggregate  (cost=257484.70..257484.70 rows=1 width=8) ->  Seq Scan on radius  (cost=0.00..237616.76 rows=7947176
width=8)


Table and key info:

Did not find any relation named "radius_pk".
radius=# \d radius                                        Table "radius"     Attribute      |           Type
|            Modifier   
 
---------------------+--------------------------+---------------------------sessionid           | character varying(30)
  | not nullusername            | character varying(30)    | not nullnas_ip              | character varying(50)    |
notnulllogfileid           | integer                  |login_ip_host       | character varying(50)    | not
nullframed_ip_address  | character varying(50)    |file_timestamp      | timestamp with time zone | not
nullcorrected_timestamp| timestamp with time zone | not nullacct_status_type    | smallint                 | not
nullbytesin            | bigint                   |bytesout            | bigint                   |handled
|boolean                  | not null default 'f'sessionhandled      | boolean                  | not null default
'f'radiuspk           | bigint                   | not null default nextval
 
('radiuspk_seq'::text)
Indices: pk_radius,        radius_us

radius=# \d pk_radiusIndex "pk_radius"Attribute |  Type
-----------+--------radiuspk  | bigint
unique btree (primary key)




Re: optimizer question

From
Tom Lane
Date:
"Reinoud van Leeuwen" <reinoud@xs4all.nl> writes:
> I have a table that contains almost 8 milion rows. The primary key is a 
> sequence, so the index should have a good distribution. Why does the 
> optimizer refuse to use the index for getting the maximum value?

The optimizer has no idea that max() has anything to do with indexes.
You could try something like
select * from tab order by foo desc limit 1;
        regards, tom lane


Re: optimizer question

From
Bruce Momjian
Date:
> "Reinoud van Leeuwen" <reinoud@xs4all.nl> writes:
> > I have a table that contains almost 8 milion rows. The primary key is a 
> > sequence, so the index should have a good distribution. Why does the 
> > optimizer refuse to use the index for getting the maximum value?
> 
> The optimizer has no idea that max() has anything to do with indexes.
> You could try something like
> 
>     select * from tab order by foo desc limit 1;

Can we consider doing this optimization automatically?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: optimizer question

From
mlw
Date:
Bruce Momjian wrote:
> 
> > "Reinoud van Leeuwen" <reinoud@xs4all.nl> writes:
> > > I have a table that contains almost 8 milion rows. The primary key is a
> > > sequence, so the index should have a good distribution. Why does the
> > > optimizer refuse to use the index for getting the maximum value?
> >
> > The optimizer has no idea that max() has anything to do with indexes.
> > You could try something like
> >
> >       select * from tab order by foo desc limit 1;
> 
> Can we consider doing this optimization automatically?

That would be real cool. I don't know of any database that does that. (I do
know that our Oracle 8i does not)


On a side note (can of worms?) is there the notion of a "rule based optimizer"
vs the cost based optimizer in Postgres?


Re: optimizer question

From
Hannu Krosing
Date:
Bruce Momjian wrote:
> 
> > "Reinoud van Leeuwen" <reinoud@xs4all.nl> writes:
> > > I have a table that contains almost 8 milion rows. The primary key is a
> > > sequence, so the index should have a good distribution. Why does the
> > > optimizer refuse to use the index for getting the maximum value?
> >
> > The optimizer has no idea that max() has anything to do with indexes.
> > You could try something like
> >
> >       select * from tab order by foo desc limit 1;
> 
> Can we consider doing this optimization automatically?

Only if we assume that people do not define their own max() that does
something 
that can't be calculated using the above formula like calculating AVG().

---------------
Hannu


Re: optimizer question

From
Hannu Krosing
Date:
Bruce Momjian wrote:
> 
> > Bruce Momjian wrote:
> > >
> > > > "Reinoud van Leeuwen" <reinoud@xs4all.nl> writes:
> > > > > I have a table that contains almost 8 milion rows. The primary key is a
> > > > > sequence, so the index should have a good distribution. Why does the
> > > > > optimizer refuse to use the index for getting the maximum value?
> > > >
> > > > The optimizer has no idea that max() has anything to do with indexes.
> > > > You could try something like
> > > >
> > > >       select * from tab order by foo desc limit 1;
> > >
> > > Can we consider doing this optimization automatically?
> >
> > Only if we assume that people do not define their own max() that does
> > something
> > that can't be calculated using the above formula like calculating AVG().
> 
> I hadn't thought of that one.  I can't imagine a max() that doesn't
> match the ORDER BY collating.

But suppose you could have different indexes on the same column. For
example 
for IP address you can theoretically define one index that indexes by
mask 
length and other that indexes by numeric value of IP and yet another
that 
indexes by some combination of both.

when doing an ORDER BY you can specify 'USING operator'

> Updated TODO item:
> 
> * Use indexes for min() and max() or convert to SELECT col FROM tab
>   ORDER BY col DESC LIMIT 1;

Maybe rather

* Use indexes for min() and max() or convert to "SELECT col FROM tab ORDER BY col DESC USING max_index_op LIMIT 1" if
thereis an index  on tab that uses btree(col max_index_op)
 

it seems that in most other cases the rewrite would be either a 
misoptimisation or plain wrong.

----------------
Hannu


Re: optimizer question

From
Bruce Momjian
Date:
> Bruce Momjian wrote:
> > 
> > > "Reinoud van Leeuwen" <reinoud@xs4all.nl> writes:
> > > > I have a table that contains almost 8 milion rows. The primary key is a
> > > > sequence, so the index should have a good distribution. Why does the
> > > > optimizer refuse to use the index for getting the maximum value?
> > >
> > > The optimizer has no idea that max() has anything to do with indexes.
> > > You could try something like
> > >
> > >       select * from tab order by foo desc limit 1;
> > 
> > Can we consider doing this optimization automatically?
> 
> Only if we assume that people do not define their own max() that does
> something 
> that can't be calculated using the above formula like calculating AVG().

I hadn't thought of that one.  I can't imagine a max() that doesn't
match the ORDER BY collating.

Updated TODO item:

* Use indexes for min() and max() or convert to SELECT col FROM tab ORDER BY col DESC LIMIT 1;   

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: optimizer question

From
Tom Lane
Date:
Hannu Krosing <hannu@tm.ee> writes:
> Maybe rather

> * Use indexes for min() and max() or convert to "SELECT col FROM tab
>   ORDER BY col DESC USING max_index_op LIMIT 1" if there is an index 
>   on tab that uses btree(col max_index_op)

> it seems that in most other cases the rewrite would be either a 
> misoptimisation or plain wrong.

We would clearly need to add information to the system catalogs to allow
the planner to determine whether a given aggregate matches up to a given
index opclass.  This has been discussed before.

A more interesting question is how to determine whether such a rewrite
would be a win.  That is NOT a foregone conclusion.  Consider
SELECT max(col1) FROM tab WHERE col2 BETWEEN 12 AND 42;

Depending on the selectivity of the WHERE condition, we might be far
better off to scan on a col2 index and use our traditional max()
code than to scan on a col1 index until we find a row passing the
WHERE condition.  I'm not sure whether the planner currently has
statistics appropriate for such estimates or not ...
        regards, tom lane


Re: optimizer question

From
Bruce Momjian
Date:
> Hannu Krosing <hannu@tm.ee> writes:
> > Maybe rather
> 
> > * Use indexes for min() and max() or convert to "SELECT col FROM tab
> >   ORDER BY col DESC USING max_index_op LIMIT 1" if there is an index 
> >   on tab that uses btree(col max_index_op)
> 
> > it seems that in most other cases the rewrite would be either a 
> > misoptimisation or plain wrong.
> 
> We would clearly need to add information to the system catalogs to allow
> the planner to determine whether a given aggregate matches up to a given
> index opclass.  This has been discussed before.
> 
> A more interesting question is how to determine whether such a rewrite
> would be a win.  That is NOT a foregone conclusion.  Consider
> 
>     SELECT max(col1) FROM tab WHERE col2 BETWEEN 12 AND 42;
> 
> Depending on the selectivity of the WHERE condition, we might be far
> better off to scan on a col2 index and use our traditional max()
> code than to scan on a col1 index until we find a row passing the
> WHERE condition.  I'm not sure whether the planner currently has
> statistics appropriate for such estimates or not ...

Yes, agreed.  This would be just for limited cases.  Updated to:

* Use indexes for min() and max() or convert to SELECT col FROM tab ORDER BY col DESC LIMIT 1 if appropriate index
existsand WHERE clause acceptible                        ^^^^^^^^^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^^^^^^^^^^^
 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: optimizer question

From
Hannu Krosing
Date:
Bruce Momjian wrote:
> 
> > Hannu Krosing <hannu@tm.ee> writes:
> > > Maybe rather
> >
> > > * Use indexes for min() and max() or convert to "SELECT col FROM tab
> > >   ORDER BY col DESC USING max_index_op LIMIT 1" if there is an index
> > >   on tab that uses btree(col max_index_op)
> >
> > > it seems that in most other cases the rewrite would be either a
> > > misoptimisation or plain wrong.
> >
> > We would clearly need to add information to the system catalogs to allow
> > the planner to determine whether a given aggregate matches up to a given
> > index opclass.  This has been discussed before.
> >
> > A more interesting question is how to determine whether such a rewrite
> > would be a win.  That is NOT a foregone conclusion.  Consider
> >
> >       SELECT max(col1) FROM tab WHERE col2 BETWEEN 12 AND 42;
> >
> > Depending on the selectivity of the WHERE condition, we might be far
> > better off to scan on a col2 index and use our traditional max()
> > code than to scan on a col1 index until we find a row passing the
> > WHERE condition.  I'm not sure whether the planner currently has
> > statistics appropriate for such estimates or not ...
> 
> Yes, agreed.  This would be just for limited cases.  Updated to:
> 
> * Use indexes for min() and max() or convert to SELECT col FROM tab ORDER
>   BY col DESC LIMIT 1 if appropriate index exists and WHERE clause acceptible
>                          ^^^^^^^^^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^^^^^^^^^^^
It would be probably a win if only exact match of
 SELECT MAX(*) FROM TAB ;

would be rewritten if appropriate index exists.

The appropriateness should be explicitly declared in aggregate
definition.

-----------------
Hannu


Re: optimizer question

From
mlw
Date:
Hannu Krosing wrote:
> 
> Bruce Momjian wrote:
> >
> > > Hannu Krosing <hannu@tm.ee> writes:
> > > > Maybe rather
> > >
> > > > * Use indexes for min() and max() or convert to "SELECT col FROM tab
> > > >   ORDER BY col DESC USING max_index_op LIMIT 1" if there is an index
> > > >   on tab that uses btree(col max_index_op)
> > >
> > > > it seems that in most other cases the rewrite would be either a
> > > > misoptimisation or plain wrong.
> > >
> > > We would clearly need to add information to the system catalogs to allow
> > > the planner to determine whether a given aggregate matches up to a given
> > > index opclass.  This has been discussed before.
> > >
> > > A more interesting question is how to determine whether such a rewrite
> > > would be a win.  That is NOT a foregone conclusion.  Consider
> > >
> > >       SELECT max(col1) FROM tab WHERE col2 BETWEEN 12 AND 42;
> > >
> > > Depending on the selectivity of the WHERE condition, we might be far
> > > better off to scan on a col2 index and use our traditional max()
> > > code than to scan on a col1 index until we find a row passing the
> > > WHERE condition.  I'm not sure whether the planner currently has
> > > statistics appropriate for such estimates or not ...
> >
> > Yes, agreed.  This would be just for limited cases.  Updated to:
> >
> > * Use indexes for min() and max() or convert to SELECT col FROM tab ORDER
> >   BY col DESC LIMIT 1 if appropriate index exists and WHERE clause acceptible
> >                          ^^^^^^^^^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^^^^^^^^^^^
> It would be probably a win if only exact match of
> 
>   SELECT MAX(*) FROM TAB ;
> 
> would be rewritten if appropriate index exists.
> 
> The appropriateness should be explicitly declared in aggregate
> definition.

I want to chime in here. If the ability exists to evaluate that max() or min()
is appropriate, and that using the equivilent of "select select col1 from tab
desc limit 1" for "select max(col1) from tab" would be a huge gain for
Postgres. I know our Oracle8i can't do it, and it would be a very usefull
optimization. 

At issue is the the "limit" clause is very very cool and not available in
Oracle, and since it isn't available, one does not think to use it, and in
queries where they my execute on both Postgres AND oracle, you can't use it.