Thread: Index optimization ?

Index optimization ?

From
Bo Lorentsen
Date:
Hi ...

In my quest to get rid of the oid dependency, i have made some new low
level code with the help from many nice people from this community
(thanks for that), but I still have one somewhat big problem.

I am running PG 7.4.6, btw.

I have a "sale" table that have a BIGSERIAL as primary key, but explain
keeps telling me (and so does the performance) that it will perform a
"seq scan" of my table when performing this statement (not using the
pkey index) :

select * from sale where id = currval( 'sale_id_seq' );

I tried this too :

select * from sale where id = currval( 'sale_id_seq' )::bigint;

But this still did not work (still using seq scan) :-(

At last I did a :

explain select * from sale where id = 42::bigint;

Just to make sure, and this made the proper optimizations (used the pkey
index).

What have I done wrong, or do PG still have some casting problems in the
optimizer ?

/BL

Re: Index optimization ?

From
Tom Lane
Date:
Bo Lorentsen <bl@netgroup.dk> writes:
> select * from sale where id = currval( 'sale_id_seq' );

This is not legally optimizable into an indexscan, because currval() is
a volatile function.  (It's easy to construct cases where its value
actually does change from row to row --- just use a nextval() as well.)

You can fake it out in a couple of ways --- the recommended method is to
wrap currval in a user-defined function that is misleadingly marked
stable.  I think it still works to just put the call in a sub-select:
    select * from sale where id = (select currval( 'sale_id_seq' ));
but I take no responsibility if future improvements in the planner break
that trick.

            regards, tom lane

Re: Index optimization ?

From
Michael Fuhr
Date:
On Sat, Jan 15, 2005 at 07:03:43PM +0100, Bo Lorentsen wrote:

> select * from sale where id = currval( 'sale_id_seq' )::bigint;
>
> But this still did not work (still using seq scan) :-(

currval() is volatile.  See "Function Volatility Categories" in the
"Extending SQL" chapter of the documentation and search the list
archives for past discussion of currval()'s volatility.

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/

Re: Index optimization ?

From
"Jim C. Nasby"
Date:
On Sat, Jan 15, 2005 at 01:27:49PM -0500, Tom Lane wrote:
> Bo Lorentsen <bl@netgroup.dk> writes:
> > select * from sale where id = currval( 'sale_id_seq' );
>
> This is not legally optimizable into an indexscan, because currval() is
> a volatile function.  (It's easy to construct cases where its value
> actually does change from row to row --- just use a nextval() as well.)
>
> You can fake it out in a couple of ways --- the recommended method is to
> wrap currval in a user-defined function that is misleadingly marked
> stable.  I think it still works to just put the call in a sub-select:
>     select * from sale where id = (select currval( 'sale_id_seq' ));
> but I take no responsibility if future improvements in the planner break
> that trick.

Would it make sense to have a version of currval that will only return
one value in a statement/transaction? So the first time it's called it
remembers what currval for that sequence is and always returns the same
value?
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Index optimization ?

From
Alvaro Herrera
Date:
On Sat, Jan 15, 2005 at 03:11:22PM -0600, Jim C. Nasby wrote:

> Would it make sense to have a version of currval that will only return
> one value in a statement/transaction? So the first time it's called it
> remembers what currval for that sequence is and always returns the same
> value?

What would nextval() do in that case?  Return the nextval on the first
call, and act like currval() from then until the end of the transaction?

--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"Porque francamente, si para saber manejarse a uno mismo hubiera que
rendir examen... �Qui�n es el machito que tendr�a carnet?"  (Mafalda)

Re: Index optimization ?

From
Bo Lorentsen
Date:
Tom Lane wrote:

>This is not legally optimizable into an indexscan, because currval() is
>a volatile function.  (It's easy to construct cases where its value
>actually does change from row to row --- just use a nextval() as well.)
>
>
I am not sure what you mean by a "volatile function", and how this
affect the returned types, I guess this demands some more low level PG
knowledge to understand.

>You can fake it out in a couple of ways --- the recommended method is to
>wrap currval in a user-defined function that is misleadingly marked
>stable.  I think it still works to just put the call in a sub-select:
>    select * from sale where id = (select currval( 'sale_id_seq' ));
>
>
>but I take no responsibility if future improvements in the planner break
>that trick.
>
>
The select trick works just fine, but I don't understand why :-(

Do you have any idea to how I may learn more about function types, or is
this a "read the source, luke" thing (I am not sure I have time for that
right now) ?

Well, thanks anyway ... this just work so nice, and I a looking forward
to fill up this database with plenty of data, and still being able to
sleep :-)

/BL

Re: Index optimization ?

From
Bo Lorentsen
Date:
Michael Fuhr wrote:

>currval() is volatile.  See "Function Volatility Categories" in the
>"Extending SQL" chapter of the documentation and search the list
>archives for past discussion of currval()'s volatility.
>
>
Hmm, I can't find that chapter in the 7.4 manual, or am I looking the
wrong place ? I really like to understand this !

/BL

Re: Index optimization ?

From
Tom Lane
Date:
Bo Lorentsen <bl@netgroup.dk> writes:
> Do you have any idea to how I may learn more about function types, or is
> this a "read the source, luke" thing (I am not sure I have time for that
> right now) ?

http://developer.postgresql.org/docs/postgres/xfunc-volatility.html

            regards, tom lane

Re: Index optimization ?

From
Michael Fuhr
Date:
On Sat, Jan 15, 2005 at 11:28:08PM +0100, Bo Lorentsen wrote:
> Michael Fuhr wrote:
>
> >currval() is volatile.  See "Function Volatility Categories" in the
> >"Extending SQL" chapter of the documentation and search the list
> >archives for past discussion of currval()'s volatility.
> >
> Hmm, I can't find that chapter in the 7.4 manual, or am I looking the
> wrong place ? I really like to understand this !

Sorry...apparently that section is in the development documentation
but not in the 7.x doc.  Tom Lane posted the link but here it is
again:

http://developer.postgresql.org/docs/postgres/xfunc-volatility.html

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/

Re: Index optimization ?

From
"Jim C. Nasby"
Date:
On Sat, Jan 15, 2005 at 06:34:11PM -0300, Alvaro Herrera wrote:
> On Sat, Jan 15, 2005 at 03:11:22PM -0600, Jim C. Nasby wrote:
>
> > Would it make sense to have a version of currval that will only return
> > one value in a statement/transaction? So the first time it's called it
> > remembers what currval for that sequence is and always returns the same
> > value?
>
> What would nextval() do in that case?  Return the nextval on the first
> call, and act like currval() from then until the end of the transaction?

I'm not sure which would be best. It could either do what you suggest,
or it could operate as normal, or it could possibly throw an error if
run a second time.
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Index optimization ?

From
Bo Lorentsen
Date:
Tom Lane wrote:

>http://developer.postgresql.org/docs/postgres/xfunc-volatility.html
>
>
Ok, thanks I see why there is these three differant function types, but
I don't quite understand why the value from a volatile function, can't
be used as a index key. Is this because there is no return type garanti,
for the voilatile function too ?

Will the only possible way to fix this be to make a volatile function
with a return type (I know this is not possible now, but in theory) ?

/BL

Re: Index optimization ?

From
Michael Glaesemann
Date:
On Jan 17, 2005, at 0:25, Bo Lorentsen wrote:

> Tom Lane wrote:
>
>> http://developer.postgresql.org/docs/postgres/xfunc-volatility.html
>>
> Ok, thanks I see why there is these three differant function types,
> but I don't quite understand why the value from a volatile function,
> can't be used as a index key. Is this because there is no return type
> garanti, for the voilatile function too ?

I don't believe it has necessarily anything to do with the return type,
but rather the return value. An index only works if you know what the
value is, and the return value for a volatile function is not
guaranteed to be the same for given parameters. Here's a contrived (and
untestsd) example, but one I think makes it clear:

CREATE FUNCTION plus_random ( INTEGER )
    RETURNS INTEGER
    LANGUAGE SQL AS '
        SELECT round( $1 + random() * 100 );
    ';

One could conceivably attempt to make a functional index using
plus_random(), but the result it gives every time is indeterminant. How
would you be able to usefully search for values in an index that is
based on this function? Would it make sense do to do so?

Does this help? (And if I'm completely off base, someone please let me
know :)

Michael Glaesemann
grzm myrealbox com


Re: Index optimization ?

From
Bo Lorentsen
Date:
Michael Glaesemann wrote:

> I don't believe it has necessarily anything to do with the return
> type, but rather the return value. An index only works if you know
> what the value is, and the return value for a volatile function is not
> guaranteed to be the same for given parameters. Here's a contrived
> (and untestsd) example, but one I think makes it clear:
>
> CREATE FUNCTION plus_random ( INTEGER )
>     RETURNS INTEGER
>     LANGUAGE SQL AS '
>         SELECT round( $1 + random() * 100 );
>     ';
>
> One could conceivably attempt to make a functional index using
> plus_random(), but the result it gives every time is indeterminant.
> How would you be able to usefully search for values in an index that
> is based on this function? Would it make sense do to do so?

What you say is that PG can't see the difference between this
"plus_random" and the "currval", right.

But if I have a select (a quite strange one), like this :

SELECT * FROM test_table WHERE id = plus_random( test_col );

I don't understand the problem. The function always return an integer as
specified in the function decl. so why not use the PK index for search,
instead of using seq scan ? The value is totally unpredictable but it is
still an integer and the pk index is still useful regarding performance !

I know there is something I don't understand, so I just have to ask :-)

> Does this help? (And if I'm completely off base, someone please let me
> know :)

No this time I think missed the point, I understand the volatility of
functions, so the planer know what to expect from the function,
regarding side effect, but I still don't understand why this influences
the choice of valid indexes.

/BL

Re: Index optimization ?

From
Ragnar Hafstað
Date:
On Sun, 2005-01-16 at 16:25 +0100, Bo Lorentsen wrote:
[about a volatile function in a where clause not generating index scan]

> Will the only possible way to fix this be to make a volatile function
> with a return type (I know this is not possible now, but in theory) ?

this has nothing to do with the return type. a volatile function is a
function that is not garanteed to return the same value given same
input parameters, (such as currval()).

when a volatile function is used thus:
  SELECT * FROM mytable WHERE col=myvolatilefunc();
the planner must call the function once per table row, and assume
possibly different return values each time, so an indexscan will
not improve timings.

on the other hand, if the function is labeled STABLE, the planner
can assume that the same value will alway be returned, so only
one call to it can be made, and an indexscan might be found the
most effective.

hope this helps

gnari



Re: Index optimization ?

From
Bo Lorentsen
Date:
Ragnar Hafstað wrote:

>this has nothing to do with the return type. a volatile function is a
>function that is not garanteed to return the same value given same
>input parameters, (such as currval()).
>
>when a volatile function is used thus:
>  SELECT * FROM mytable WHERE col=myvolatilefunc();
>the planner must call the function once per table row, and assume
>possibly different return values each time, so an indexscan will
>not improve timings.
>
>
Why not use the index scan for every row, is this a "limit" in the
planner ? I think there is something in the planner I don't understand :-)

>on the other hand, if the function is labeled STABLE, the planner
>can assume that the same value will alway be returned, so only
>one call to it can be made, and an indexscan might be found the
>most effective.
>
>
The two other function types are not interesting, but I don't understand
the planners use of index optimization.

/BL

Re: Index optimization ?

From
Martijn van Oosterhout
Date:
On Sun, Jan 16, 2005 at 05:30:22PM +0100, Bo Lorentsen wrote:
> >One could conceivably attempt to make a functional index using
> >plus_random(), but the result it gives every time is indeterminant.
> >How would you be able to usefully search for values in an index that
> >is based on this function? Would it make sense do to do so?
>
> What you say is that PG can't see the difference between this
> "plus_random" and the "currval", right.
>
> But if I have a select (a quite strange one), like this :
>
> SELECT * FROM test_table WHERE id = plus_random( test_col );
>
> I don't understand the problem. The function always return an integer as
> specified in the function decl. so why not use the PK index for search,
> instead of using seq scan ? The value is totally unpredictable but it is
> still an integer and the pk index is still useful regarding performance !

No, it depends on your interpretation of the query. Note, I'm not up
with the SQL standard so maybe it doesn't work like this, but this is
what I think the problem is.

The above query can be interpreted as: for each row in test_table,
compare id against plus_random( test_col ). Now, in theory the
plus_random function needs to be evaluated for every row, each time
giving a different value, thus it may or may not match id.

You can see that with that interpretation an index on id doesn't help.
If you interpret the query so plus_random is evaluted only once, then
an index will help. If test_col is a column of the table then there is
no way an index can help you.

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: Index optimization ?

From
Ragnar Hafstað
Date:
On Sun, 2005-01-16 at 17:45 +0100, Bo Lorentsen wrote:
> Ragnar Hafstað wrote:
> >when a volatile function is used thus:
> >  SELECT * FROM mytable WHERE col=myvolatilefunc();
> >the planner must call the function once per table row, and assume
> >possibly different return values each time, so an indexscan will
> >not improve timings.
> >
> >
> Why not use the index scan for every row, is this a "limit" in the
> planner ? I think there is something in the planner I don't understand :-)

the planner will just use the plan it estimates will be fastest.
because of how indexscans work in postgresql, in this case it would be
slower than a tablescan (assuming the function really is volatile)

gnari



Re: Index optimization ?

From
Roman Neuhauser
Date:
# kleptog@svana.org / 2005-01-16 17:48:08 +0100:
> On Sun, Jan 16, 2005 at 05:30:22PM +0100, Bo Lorentsen wrote:
> > >One could conceivably attempt to make a functional index using
> > >plus_random(), but the result it gives every time is indeterminant.
> > >How would you be able to usefully search for values in an index that
> > >is based on this function? Would it make sense do to do so?
> >
> > What you say is that PG can't see the difference between this
> > "plus_random" and the "currval", right.
> >
> > But if I have a select (a quite strange one), like this :
> >
> > SELECT * FROM test_table WHERE id = plus_random( test_col );
> >
> > I don't understand the problem. The function always return an integer as
> > specified in the function decl. so why not use the PK index for search,
> > instead of using seq scan ? The value is totally unpredictable but it is
> > still an integer and the pk index is still useful regarding performance !
>
> No, it depends on your interpretation of the query. Note, I'm not up
> with the SQL standard so maybe it doesn't work like this, but this is
> what I think the problem is.
>
> The above query can be interpreted as: for each row in test_table,
> compare id against plus_random( test_col ).

    That's what happens if you declare the function VOLATILE.
    Make it STABLE, and the function call will be evaluated only once
    for the whole table scan. That's just what Tom Lane suggested in
    his post.

--
If you cc me or remove the list(s) completely I'll most likely ignore
your message.    see http://www.eyrie.org./~eagle/faqs/questions.html

Re: Index optimization ?

From
Tom Lane
Date:
Ragnar =?ISO-8859-1?Q?Hafsta=F0?= <gnari@simnet.is> writes:
> On Sun, 2005-01-16 at 17:45 +0100, Bo Lorentsen wrote:
>> Why not use the index scan for every row, is this a "limit" in the
>> planner ? I think there is something in the planner I don't understand :-)

> the planner will just use the plan it estimates will be fastest.
> because of how indexscans work in postgresql, in this case it would be
> slower than a tablescan (assuming the function really is volatile)

It has nothing to do with speed, it has to do with giving the correct
answer.  We define "correct answer" as being the result you would get
from a naive interpretation of the SQL semantics --- that is, for every
row in the FROM table, actually execute the WHERE clause, and return the
rows where it produces TRUE.

As an example, a query like
    SELECT * FROM mytable WHERE random() < 0.1;
should produce a random sampling of about one-tenth of the rows in mytable.
If we evaluated random() only once in this query, we would get either
all or none of the rows, clearly not the right answer.

An indexscan is a legal optimization only if the function(s) in the
WHERE clause are all STABLE or better.  This is because the index access
code will only evaluate the righthand side of the "indexcol = something"
clause once, and then will use that value to descend the btree and
select matching index entries.  We must be certain that this gives the
same result we would get from a seqscan.

The definition of STABLE that PostgreSQL uses was crafted specifically
to capture the property that a function is safe to use in an indexscan
qualification ...

            regards, tom lane

Re: Index optimization ?

From
Bo Lorentsen
Date:
Martijn van Oosterhout wrote:

>No, it depends on your interpretation of the query. Note, I'm not up
>with the SQL standard so maybe it doesn't work like this, but this is
>what I think the problem is.
>
>
I just try to learn, so that is ok :-) Tom gave me a solution that
works, so now I struggle to understand why.

>The above query can be interpreted as: for each row in test_table,
>compare id against plus_random( test_col ). Now, in theory the
>plus_random function needs to be evaluated for every row, each time
>giving a different value, thus it may or may not match id.
>
>
But if you take a look at a function, it has a return type. So "currval"
always returns a BIGINT no matter what kind of parameters are given,
that is a part of the declaration, as far as I can see. Why are this
type info not used to match an index, as the type is the same no matter
what row we are in, or no matter its parameter value (or context). The
value change, but not the type. The type is used to find a matching
index is it not ?

Am I misunderstanding you ?

>You can see that with that interpretation an index on id doesn't help.
>
>
No, I think this is the problem, I don't see :-) The function promise to
return a certain type, and type can be used to find the prober index (if
any).

>If you interpret the query so plus_random is evaluted only once, then
>an index will help. If test_col is a column of the table then there is
>no way an index can help you.
>
>
If and only if the function returns a different value TYPE, otherwise it
can use the same index but with different values, of the same type alias
use index scan.

But again, I am sure there is something I have misunderstud :-)

Thanks for trying :-)

/BL

Re: Index optimization ?

From
Bo Lorentsen
Date:
Tom Lane wrote:

>It has nothing to do with speed, it has to do with giving the correct
>answer.  We define "correct answer" as being the result you would get
>from a naive interpretation of the SQL semantics --- that is, for every
>row in the FROM table, actually execute the WHERE clause, and return the
>rows where it produces TRUE.
>
>As an example, a query like
>    SELECT * FROM mytable WHERE random() < 0.1;
>should produce a random sampling of about one-tenth of the rows in mytable.
>
>
Nice explaination ...

>If we evaluated random() only once in this query, we would get either
>all or none of the rows, clearly not the right answer.
>
>
So if the random function was stable, you either get all or none, as et
gets executed only ones ?

>An indexscan is a legal optimization only if the function(s) in the
>WHERE clause are all STABLE or better.  This is because the index access
>code will only evaluate the righthand side of the "indexcol = something"
>clause once, and then will use that value to descend the btree and
>select matching index entries.  We must be certain that this gives the
>same result we would get from a seqscan.
>
>
Now this sounds like a blink of the problem that I don't understand :-)
When you say it evaluate "right side" ones, what kind of information are
you (the executer) then getting, and how is the index match then
performed. Is all the where clause expression marked as volatile at this
level, just to be sure ?

Well maybe the real question is how does the executer match an index, or
am I off track ?

/BL

Re: Index optimization ?

From
Ragnar Hafstað
Date:
On Sun, 2005-01-16 at 14:11 -0500, Tom Lane wrote:
> Ragnar =?ISO-8859-1?Q?Hafsta=F0?= <gnari@simnet.is> writes:
> > On Sun, 2005-01-16 at 17:45 +0100, Bo Lorentsen wrote:
> >> Why not use the index scan for every row, is this a "limit" in the
> >> planner ? I think there is something in the planner I don't understand :-)
>
> > the planner will just use the plan it estimates will be fastest.
> > because of how indexscans work in postgresql, in this case it would be
> > slower than a tablescan (assuming the function really is volatile)
>
> It has nothing to do with speed, it has to do with giving the correct
> answer.  We define "correct answer" as being the result you would get
> from a naive interpretation of the SQL semantics --- that is, for every
> row in the FROM table, actually execute the WHERE clause, and return the
> rows where it produces TRUE.

I should not have used the word 'indexscan'. I just meant that it would
be less effective to use an index to look up each result of the volatile
function than using a tablescan. It was clear that the function would
have to be called for each row, but the OP was asking (I think) why
the index was not used.

gnari




Re: Index optimization ?

From
"Florian G. Pflug"
Date:
Bo Lorentsen wrote:
> So if the random function was stable, you either get all or none, as et
> gets executed only ones ?
>
>> An indexscan is a legal optimization only if the function(s) in the
>> WHERE clause are all STABLE or better.  This is because the index access
>> code will only evaluate the righthand side of the "indexcol = something"
>> clause once, and then will use that value to descend the btree and
>> select matching index entries.  We must be certain that this gives the
>> same result we would get from a seqscan.
>>
>>
> Now this sounds like a blink of the problem that I don't understand :-)
> When you say it evaluate "right side" ones, what kind of information are
> you (the executer) then getting, and how is the index match then
> performed. Is all the where clause expression marked as volatile at this
> level, just to be sure ?
Lets say, you have an query "select * from table where field =
function()". Now, according to the sql-spec, you would have to
scan each row in "table", call the function "functio()", and compare the
result. If the result of the call to "function()" matches the value in
"field", the you return the row, otherwise you don't return it.

An index is a tree, where each node has two subnodes/children. Each node
representents a row of your table (or, rather, references a row - it
contails only the value of the field you indexed). Additionally,
the value of the field of the "left" child (and the value of the field
of its children, and so on) is always guaranteed to be smaller-or-equal
to the value of field of the node itself, and the value
of the field of the "right" child is always guaranteed to be
greater-or-equal to the value of the field of the node.
So, if you have three records in the table "table", like this:
f1
--
1
2
3

Then your index looks the following:
    2
   / \
  1   3

Now, when doing an index lookup, you have to know which value to look
for (lets say you look for 3). Then you look at the top node, and
compare the value you are looking for to the value of the node. In our
case, the node has a smaller value then the one we are looking for.
Because we know that the left child of the toplevel node will have an
even smaller value, we don't need to look at the left child at all. We
just check the right child, and there we find our record with "field"=3.

_BUT_ this only works, because we knew for which value to look before we
started searching. If we the value we look for is constantly changing,
our index lookup would return bogus results. So, if the value is unknown
_at_the_beginning_ of the search, you can't use the index, because the
power of an index search comes from the idea to skip a whole subtree (in
our case the left one), because we _know_ it can't contain the value we
are looking for.

Functions marked "immutable" or "stable" is _guaranteed_ to never change
their value during a select statement, or at least not in an
unpredictable way. Thats why you can use return values of "immutable" or
"stable" functions for an index search.

> Well maybe the real question is how does the executer match an index, or
> am I off track ?
Lets say, you are doing the following query
"select * from table where field1 = currval('seq')" and field2 =
nextval('seq')

Now, the value of "currval('seq')" changes with every row visited. In
your example, the value of "currval" is actually stable - but postgres
has no way to know this. To use an index scan for your query, postgres
would need to know, that only a call to nextval can change the value of
currval - but this, of course, is a quite difficult dependency for a
database to track.

greetings, Florian Pflug



Attachment

Re: Index optimization ?

From
Tom Lane
Date:
Bo Lorentsen <bl@netgroup.dk> writes:
> Tom Lane wrote:
>>    SELECT * FROM mytable WHERE random() < 0.1;
>> If we evaluated random() only once in this query, we would get either
>> all or none of the rows, clearly not the right answer.
>>
> So if the random function was stable, you either get all or none, as et
> gets executed only ones ?

No, you'd still end up with a seqscan, because this WHERE clause offers
no chance of matching an index, and we don't do anything special with
stable functions beyond trying to match them to index conditions.
But consider something like

    SELECT * FROM mytable WHERE keycol = int(random() * 1000);

where keycol is indexed and contains integers 0..1000; let's say each
such value appears ten times.  With a seqscan implementation (which I
consider is what SQL defines the semantics to be) random() would be
recomputed at each row and there would be about a 1/1000 chance of
selecting each row.  You might get more or less than exactly ten result
rows, and they'd almost certainly contain different values of keycol.
Now if random() were marked stable (and of course both multiply and
int() are immutable), then the planner would consider an indexscan on
keycol to be a valid optimization.  But that would produce
distinguishably different results, because random() would be evaluated
only once: you would always get exactly ten rows and they'd always all
have the same keycol value.


>> An indexscan is a legal optimization only if the function(s) in the
>> WHERE clause are all STABLE or better.  This is because the index access
>> code will only evaluate the righthand side of the "indexcol = something"
>> clause once, and then will use that value to descend the btree and
>> select matching index entries.  We must be certain that this gives the
>> same result we would get from a seqscan.
>>
> Now this sounds like a blink of the problem that I don't understand :-)
> When you say it evaluate "right side" ones, what kind of information are
> you (the executer) then getting, and how is the index match then
> performed.

An index can basically implement conditions like "WHERE indexedcol =
constant" --- it takes the constant value and searches the index for
matches.  (Btrees can also do things like WHERE indexedcol <= constant,
but let's just think about equality to keep things simple.)  We can deal
with a nonconstant righthand side, so long as it's okay to evaluate the
value just once before the index starts to do its thing.  That
assumption is what STABLE is all about.

            regards, tom lane

Re: Index optimization ?

From
Greg Stark
Date:
"Florian G. Pflug" <fgp@phlo.org> writes:

> Lets say, you have an query "select * from table where field = function()".

Maybe this would be clearer with a more egregious example of volatility.

Say you had a function odd() that returns 1 and 0 alternating. That is, it
returns 1 the first time it's called, 0 the second time it's called, then 1,
then 0, etc.

If you did "select * from tab where col = odd()" you would expect to get half
of the rows where col=0 or col=1. Of course since the order is unpredictable
there's no way to know which ones but you should still be pretty sure it'll be
half of the rows.

If Postgres used an index it would call odd(), which would return 1 because
it's the first time, and then Postgres would go look up the rows where col is
1 and return all of them. That's a very different behaviour from if the index
isn't used. If all the records have col=1 then you're getting all of the
records instead of half of them. If col=0 then you're getting none of them
instead of half of them.

--
greg

Re: Index optimization ?

From
Bo Lorentsen
Date:
Tom Lane wrote:

>No, you'd still end up with a seqscan, because this WHERE clause offers
>no chance of matching an index, and we don't do anything special with
>stable functions beyond trying to match them to index conditions.
>
>
So, the executer uses the (first) value to find the index to use for ALL
rows, and if this value change on each row, this can't be optimized and
a seq scan is initiated.

Is this not a problem for joins ?

>But consider something like
>
>    SELECT * FROM mytable WHERE keycol = int(random() * 1000);
>
>where keycol is indexed and contains integers 0..1000; let's say each
>such value appears ten times.  With a seqscan implementation (which I
>consider is what SQL defines the semantics to be) random() would be
>recomputed at each row and there would be about a 1/1000 chance of
>selecting each row.
>
This would demand a new index lookup for each row, right ?

>You might get more or less than exactly ten result
>rows, and they'd almost certainly contain different values of keycol.
>
>
This much i do understand :-)

>Now if random() were marked stable (and of course both multiply and
>int() are immutable), then the planner would consider an indexscan on
>keycol to be a valid optimization.  But that would produce
>distinguishably different results, because random() would be evaluated
>only once: you would always get exactly ten rows and they'd always all
>have the same keycol value.
>
>
I know why random (and currval) is not stabel, but I just don't
understand why a variable righthand result in seq scan, and not an index
scan, even when the data types match an index type.

To me it sounds like an index lookup is a one time a query (not per row)
thing, but I don't understand why. This can be because, this is the way
it turned up, but there is more possibly an aspect of SQL that I don't
know too much about.

>An index can basically implement conditions like "WHERE indexedcol =
>constant" --- it takes the constant value and searches the index for
>matches.  (Btrees can also do things like WHERE indexedcol <= constant,
>but let's just think about equality to keep things simple.)
>
:-)

>We can deal
>with a nonconstant righthand side, so long as it's okay to evaluate the
>value just once before the index starts to do its thing.  That
>assumption is what STABLE is all about.
>
>
So righthand value can't evaluate per row, and the value type of the
righthand expression can't be used as a index match.

I just hoped for the executer to work like this :

find indexedcol indexs

evaluate the righthand expression, and find its type (not value)

match the righthand value type and match it on index types (is both
sides integer)

if index is found use this together with the per row righthand value

or just use seq scan (I don't understand why, this works if indexes don't)

This is what I thought PG was doing :-)

Hope, I did not miss any important points.

/BL


Re: Index optimization ?

From
Bo Lorentsen
Date:
Florian G. Pflug wrote:

> Lets say, you have an query "select * from table where field =
> function()". Now, according to the sql-spec, you would have to
> scan each row in "table", call the function "functio()", and compare the
> result. If the result of the call to "function()" matches the value in
> "field", the you return the row, otherwise you don't return it.

This far I follow :-)

> An index is a tree, where each node has two subnodes/children. Each node
> representents a row of your table (or, rather, references a row - it
> contails only the value of the field you indexed). Additionally,
> the value of the field of the "left" child (and the value of the field
> of its children, and so on) is always guaranteed to be
> smaller-or-equal to the value of field of the node itself, and the value
> of the field of the "right" child is always guaranteed to be
> greater-or-equal to the value of the field of the node.

What you say, is based on the result of the evaluation, the executer
will optimize or performe the index lookup, if the righthand is a
constant, but it will perform a seq scan if the value is not known on
the first query (volatile) ?

To me this sound like the indexes can't be performed per row, but only
per query ? Or, PG is not type aware when maching indexes ?

> Now, when doing an index lookup, you have to know which value to look
> for (lets say you look for 3). Then you look at the top node, and
> compare the value you are looking for to the value of the node. In our
> case, the node has a smaller value then the one we are looking for.
> Because we know that the left child of the toplevel node will have an
> even smaller value, we don't need to look at the left child at all. We
> just check the right child, and there we find our record with "field"=3.

But why can't we evaluate righthand and use this new row value (could
still be 3, but not 'foobar' :-)) as a key to the index ?

> _BUT_ this only works, because we knew for which value to look before we
> started searching. If we the value we look for is constantly changing,
> our index lookup would return bogus results. So, if the value is unknown
> _at_the_beginning_ of the search, you can't use the index, because the
> power of an index search comes from the idea to skip a whole subtree (in
> our case the left one), because we _know_ it can't contain the value
> we are looking for.

But the "value" is unknow yes ... but the type of the value is not, this
will not change from row to row only the value do this.

> Functions marked "immutable" or "stable" is _guaranteed_ to never
> change their value during a select statement, or at least not in an
> unpredictable way. Thats why you can use return values of "immutable"
> or "stable" functions for an index search.

I understand that, I just can't see why an index lookup can't be used on
"per row" basis.

> Lets say, you are doing the following query
> "select * from table where field1 = currval('seq')" and field2 =
> nextval('seq')
>
> Now, the value of "currval('seq')" changes with every row visited. In
> your example, the value of "currval" is actually stable - but postgres
> has no way to know this. To use an index scan for your query, postgres
> would need to know, that only a call to nextval can change the value
> of currval - but this, of course, is a quite difficult dependency for a
> database to track.

But why can't the integer index for "field1" be used, with the "currval"
result, even if it changes ? Are the index lookup only performed ones
(asked this before ) per query ?

/BL

Re: Index optimization ?

From
"Frank D. Engel, Jr."
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Okay, let's look at this a different way.

When you look at a volatile function or variable, let's say
CURRENT_TIMESTAMP (which returns the current date and time as of the
beginning of the transaction), you see a function or variable whose
value changes unpredictably between calls/reads.

Now let's look at that value being inserted into a table.  Say table X
has column Y.  We insert a few rows into X, and use CURRENT_TIMESTAMP
as the value for Y.

So we have something like this (example only):

SELECT Y FROM X;

Y
- -----
1
4
6
9
14

Later, we want to find all of the rows we just inserted.  We can't very
well do this using Y, since:

SELECT * FROM X WHERE Y = CURRENT_TIMESTAMP;

When CURRENT_TIMESTAMP is something like, say, 44, will not find any of
the same rows.


Now expand on this to look at an index on a volatile function.  When
the row inserts/updates take place, the function has a specific value
which it returns.  Now we can index this return value, but when we go
to search the results, we evaluate the function and get a *DIFFERENT
VALUE* -- thus, our search of the index reveals a *different* set of
rows than the one we were hoping to find, since the value it is looking
for (the result of the function call) is not the same as it was when we
built the index.

Why would you *want* to do this?  It is roughly equivalent to building
an index on random values, to return a random set of rows -- and this
is just as. if not more efficiently done without the index.


Again, looking at the ODD function, consider the above where column Z
becomes odd():

Y       Z
- ------------
1        1
4        0
6        1
9        0
14      1

Now run SELECT Y, Z FROM X WHERE Z = ODD() to get:

Y       Z
- ------------
4        0
9        0


Run the same query again to get:

Y       Z
- ------------
1        1
6        1
14      1



You see that the results will be inconsistent, since ODD() is volatile
and its result can change at any time (at least as far as PostgreSQL is
aware of it).


On Jan 17, 2005, at 10:30 AM, Bo Lorentsen wrote:

> Tom Lane wrote:
>
>> No, you'd still end up with a seqscan, because this WHERE clause
>> offers
>> no chance of matching an index, and we don't do anything special with
>> stable functions beyond trying to match them to index conditions.
>>
> So, the executer uses the (first) value to find the index to use for
> ALL rows, and if this value change on each row, this can't be
> optimized and a seq scan is initiated.
>
> Is this not a problem for joins ?
>
>> But consider something like
>>
>>     SELECT * FROM mytable WHERE keycol = int(random() * 1000);
>>
>> where keycol is indexed and contains integers 0..1000; let's say each
>> such value appears ten times.  With a seqscan implementation (which I
>> consider is what SQL defines the semantics to be) random() would be
>> recomputed at each row and there would be about a 1/1000 chance of
>> selecting each row.
>>
> This would demand a new index lookup for each row, right ?
>
>> You might get more or less than exactly ten result
>> rows, and they'd almost certainly contain different values of keycol.
>>
> This much i do understand :-)
>
>> Now if random() were marked stable (and of course both multiply and
>> int() are immutable), then the planner would consider an indexscan on
>> keycol to be a valid optimization.  But that would produce
>> distinguishably different results, because random() would be evaluated
>> only once: you would always get exactly ten rows and they'd always all
>> have the same keycol value.
>>
> I know why random (and currval) is not stabel, but I just don't
> understand why a variable righthand result in seq scan, and not an
> index scan, even when the data types match an index type.
>
> To me it sounds like an index lookup is a one time a query (not per
> row) thing, but I don't understand why. This can be because, this is
> the way it turned up, but there is more possibly an aspect of SQL that
> I don't know too much about.
>
>> An index can basically implement conditions like "WHERE indexedcol =
>> constant" --- it takes the constant value and searches the index for
>> matches.  (Btrees can also do things like WHERE indexedcol <=
>> constant,
>> but let's just think about equality to keep things simple.)
>>
> :-)
>
>> We can deal
>> with a nonconstant righthand side, so long as it's okay to evaluate
>> the
>> value just once before the index starts to do its thing.  That
>> assumption is what STABLE is all about.
>>
> So righthand value can't evaluate per row, and the value type of the
> righthand expression can't be used as a index match.
>
> I just hoped for the executer to work like this :
>
> find indexedcol indexs
>
> evaluate the righthand expression, and find its type (not value)
>
> match the righthand value type and match it on index types (is both
> sides integer)
>
> if index is found use this together with the per row righthand value
>
> or just use seq scan (I don't understand why, this works if indexes
> don't)
>
> This is what I thought PG was doing :-)
>
> Hope, I did not miss any important points.
>
> /BL
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>
>
- -----------------------------------------------------------
Frank D. Engel, Jr.  <fde101@fjrhome.net>

$ ln -s /usr/share/kjvbible /usr/manual
$ true | cat /usr/manual | grep "John 3:16"
John 3:16 For God so loved the world, that he gave his only begotten
Son, that whosoever believeth in him should not perish, but have
everlasting life.
$
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (Darwin)

iD8DBQFB698N7aqtWrR9cZoRAqxeAJsFGjPqvRlk5tBkW50uxzbarrJfbwCeOIj4
aHnMMXCeFXt61Ziji25h1+E=
=PCKJ
-----END PGP SIGNATURE-----



___________________________________________________________
$0 Web Hosting with up to 120MB web space, 1000 MB Transfer
10 Personalized POP and Web E-mail Accounts, and much more.
Signup at www.doteasy.com


Re: Index optimization ?

From
Bo Lorentsen
Date:
Greg Stark wrote:

>If Postgres used an index it would call odd(), which would return 1 because
>it's the first time, and then Postgres would go look up the rows where col is
>1 and return all of them. That's a very different behaviour from if the index
>isn't used. If all the records have col=1 then you're getting all of the
>records instead of half of them. If col=0 then you're getting none of them
>instead of half of them.
>
>
This is the differance between "volatile" and "stable" functions, but
not the answer to why an index lookuo are per query and not per row, is
it not ?

/BL

Re: Index optimization ?

From
Tom Lane
Date:
Bo Lorentsen <bl@netgroup.dk> writes:
> To me it sounds like an index lookup is a one time a query (not per row)
> thing, but I don't understand why.

I can't explain it any more clearly than Florian did:
http://archives.postgresql.org/pgsql-general/2005-01/msg00769.php

            regards, tom lane

Re: Index optimization ?

From
Greg Stark
Date:
Bo Lorentsen <bl@netgroup.dk> writes:

> I understand that, I just can't see why an index lookup can't be used on "per
> row" basis.

Well, how would that work?

--
greg

Re: Index optimization ?

From
"Florian G. Pflug"
Date:
Bo Lorentsen wrote:
> Greg Stark wrote:
>> If Postgres used an index it would call odd(), which would return 1
>> because
>> it's the first time, and then Postgres would go look up the rows where
>> col is
>> 1 and return all of them. That's a very different behaviour from if
>> the index
>> isn't used. If all the records have col=1 then you're getting all of the
>> records instead of half of them. If col=0 then you're getting none of
>> them
>> instead of half of them.
> This is the differance between "volatile" and "stable" functions, but
> not the answer to why an index lookuo are per query and not per row, is
> it not ?

Because the _whole_ _point_ of an index is to find matching rows
_without_ scanning the whole table. IF you have to look at every row
anyway, then just might as well to an sequential scan.

The performance boost an index gives you comes from that very fact that
you can avoid looking most of the rows. You only have to traverse the
index tree from the root to the node(s) you are looking for. If you have
an unique index, this means you have to traverse (roughly) log2(n)
nodes, if your table has n rows. (If your table has 2^32, or about 4
billion rows, you just need to do 32(!) comparisons when walking the
index tree, but whould need to do at worst 2^32 if you sequentially
scanned the table.

_But_ the "downside" is that you skipped rows. You just didn't look at
them _ever_. You don't even _know_ how many rows you skipped (Thats why
count(*) on a hughe table is slow). So you _can't_ use an index for a
where-condition that has to inspect every row. But, since the SQL-spec
requires you to call volatile functions _for_every_row_, you _can't_ use
an index in your case.

You can, howevery, accelerate something like "where f in (1,2,3,4)". You
just scan the index 4 times, each time for a different value. Of course,
if the number of values becomes larger and larger, there is a point
where it's more efficient to do a sequential scan _once_, instead of a
few tousand index scans (depends on the number of rows in the table).
The postgres optimizer tries to estimate this, and will switch to an
seq-scan, if it would have to do too many index lookups.

Your example (the one with currval()) would translate to a IN-clause
with about as many entries in the IN-list is you have rows in the table
(Because the function has to be called _for_ _every_ _row_). Clearly,
it's not feasable to use an index scan here - it would be slower than a
seq scan.

greetings, florian pflug

Attachment

Re: Index optimization ?

From
"Jim C. Nasby"
Date:
On Wed, Jan 19, 2005 at 02:15:42AM +0100, Florian G. Pflug wrote:
> You can, howevery, accelerate something like "where f in (1,2,3,4)". You
> just scan the index 4 times, each time for a different value. Of course,
> if the number of values becomes larger and larger, there is a point
> where it's more efficient to do a sequential scan _once_, instead of a
> few tousand index scans (depends on the number of rows in the table).
> The postgres optimizer tries to estimate this, and will switch to an
> seq-scan, if it would have to do too many index lookups.

Are PostgreSQL Btree indexes setup as a linked-list so you can scan
forwards and backwards in them? If so, is the IN processor smart enough
to collapse ranges of values into a single index scan (ie,
IN(1,2,3,4,8,9,10) would best be done as an index scan starting at 1 and
stoping at >4 and a second scan starting at 8 and stopping at >10).
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Index optimization ?

From
Alvaro Herrera
Date:
On Tue, Jan 18, 2005 at 07:33:51PM -0600, Jim C. Nasby wrote:
> On Wed, Jan 19, 2005 at 02:15:42AM +0100, Florian G. Pflug wrote:
> > You can, howevery, accelerate something like "where f in (1,2,3,4)". You
> > just scan the index 4 times, each time for a different value. Of course,
> > if the number of values becomes larger and larger, there is a point
> > where it's more efficient to do a sequential scan _once_, instead of a
> > few tousand index scans (depends on the number of rows in the table).
> > The postgres optimizer tries to estimate this, and will switch to an
> > seq-scan, if it would have to do too many index lookups.
>
> Are PostgreSQL Btree indexes setup as a linked-list so you can scan
> forwards and backwards in them?

Yes, they are.

> If so, is the IN processor smart enough to collapse ranges of values
> into a single index scan

No, it isn't AFAIK.

> (ie, IN(1,2,3,4,8,9,10) would best be done as an index scan starting
> at 1 and stoping at >4 and a second scan starting at 8 and stopping at
> >10).

That assumes the optimizer knows that the domain can contain integer
values ... seems a complex and infructuous analysis to do in general.

Maybe the optimizer could collapse that to a single index scan from 1 to
10 and then apply a filter to extract only the values actually mentioned.
Not sure how workable that is.

--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
La web junta la gente porque no importa que clase de mutante sexual seas,
tienes millones de posibles parejas. Pon "buscar gente que tengan sexo con
ciervos incendiándose", y el computador dirá "especifique el tipo de ciervo"
(Jason Alexander)

Re: Index optimization ?

From
"Jim C. Nasby"
Date:
On Tue, Jan 18, 2005 at 11:03:22PM -0300, Alvaro Herrera wrote:
> On Tue, Jan 18, 2005 at 07:33:51PM -0600, Jim C. Nasby wrote:
> > On Wed, Jan 19, 2005 at 02:15:42AM +0100, Florian G. Pflug wrote:
> > > You can, howevery, accelerate something like "where f in (1,2,3,4)". You
> > > just scan the index 4 times, each time for a different value. Of course,
> > > if the number of values becomes larger and larger, there is a point
> > > where it's more efficient to do a sequential scan _once_, instead of a
> > > few tousand index scans (depends on the number of rows in the table).
> > > The postgres optimizer tries to estimate this, and will switch to an
> > > seq-scan, if it would have to do too many index lookups.
> >
> > Are PostgreSQL Btree indexes setup as a linked-list so you can scan
> > forwards and backwards in them?
>
> Yes, they are.
>
> > If so, is the IN processor smart enough to collapse ranges of values
> > into a single index scan
>
> No, it isn't AFAIK.
>
> > (ie, IN(1,2,3,4,8,9,10) would best be done as an index scan starting
> > at 1 and stoping at >4 and a second scan starting at 8 and stopping at
> > >10).
>
> That assumes the optimizer knows that the domain can contain integer
> values ... seems a complex and infructuous analysis to do in general.
>
> Maybe the optimizer could collapse that to a single index scan from 1 to
> 10 and then apply a filter to extract only the values actually mentioned.
> Not sure how workable that is.

It seems like it should just be a SMOC (simple matter of code), though I
suspect better index statistics might be needed first. Given good enough
statistics, it should be easy to ascertain how many index tuples you can
expect between two values (even if it's not an int column, in this
case). The optimizer should also be able to know what the cost is to
scan down the index structure, so it should then be easy to figure out
which plan is better.

Speaking of index statistics, has anyone looked at ways to determine how
many tuples exist between two different values? For example, Sybase used
to keep statistics in 256 buckets. It would define the range of values
that each bucket covered, and how many tuples in the table fell into
that bucket. It's a very elegant way to deal with tables that have
skewed numbers, as well as clusters of different numbers.
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Index optimization ?

From
Bo Lorentsen
Date:
Florian G. Pflug wrote:

> Because the _whole_ _point_ of an index is to find matching rows
> _without_ scanning the whole table. IF you have to look at every row
> anyway, then just might as well to an sequential scan.

I am sorry it took me this long to understand this, but I think I got it
now thanks to all Your (all in this thread) efforts to make me :-)

What confuses me was the difference between what controls a query and
what filters a row in a selected data set, I just could not see the
difference, but that have changed :-) I think the problem was I just
could not see the general principles from the special case where I
thought I had the use for this index lookup.

Thanks to anyone for there effort to make me understand, hope I will be
a faster learner next time :-)

/BL


Re: Index optimization ?

From
Bo Lorentsen
Date:
Greg Stark wrote:

>>I understand that, I just can't see why an index lookup can't be used on "per
>>row" basis.
>>
>>
>
>Well, how would that work?
>
>
>
Well, good point, the "per row" is a set of data selected as a product
of the "static" part os the query (non volatile parts), the only thing
you can do with this dataset is seq scan this, and use the volatile
functions on each row. I just could not see this ... until now :-)

If your query is only volatile (like a currval)  then PG will select all
the table as a dataset and use the volatile expression on each, and that
I did not understand, as it semmed more logical to use the index, but I
learned :-)

/BL