Re: More efficient OR - Mailing list pgsql-sql

From PFC
Subject Re: More efficient OR
Date
Msg-id opsmaz1bxath1vuj@musicbox
Whole thread Raw
In response to Re: More efficient OR  ("Keith Worthington" <keithw@narrowpathinc.com>)
List pgsql-sql
You sound like you don't like the performance you get with OR or IN, from  
this I deduce that you have a very large list of values to OR from. These  
make huge queries which are not necessarily very fast ; also they are  
un-preparable by their very nature (ie. the planner has to look at each  
value, ponder its stats, think about it...) Basically a query with, say,  
'column IN (100 values)' will make postgres work a lot more than a query  
with 'column in (SELECT something which yields 100 values)'.
I have tested the following with good results :

- Write a very simple set returning function which takes an array as a  
parameter and RETURN NEXT each array element in turn. It's just a FOR...  
RETURN NEXT. Say you call it array_flatten( INTEGER[] ) or something.

- Then, instead of doing SELECT * FROM table WHERE id IN (1,4,77,586,1025)
do:
SELECT * FROM table WHERE id IN (SELECT * FROM  
array_flatten( '{1,4,77,586,1025}' ) );
or :
SELECT t.* FROM table t, (SELECT * FROM array_flatten( '{1,4,77,586,1025}'  
) ) foo WHERE t.id=foo.id;

The first one will do a uniqu'ing on the array, the second one will not.

You can also LEFT JOIN against your SRF to get the id's of the rows that  
were not in the table (you cannot do this with IN)

And you can PREPARE the statement to something that will take an array as  
a parameter and won't have to be parsed everytime.

Sometimes it can be a big performance boost. Try it !

However, if some value in your array matches a lot of rows in the table,  
it will be slower than the seq scan which would have been triggered by the  
planner actually seeing that value in the IN list and acting on it. But if  
you KNOW your column is unique, there is no point in forcing the planner  
to ponder each value in your list !

For an additional performance boost (likely negligible), you could sort  
your array in the function (or even in your application code) to ease the  
work of the index scanner, which will get a better cache hit rate.

If you have, say, 20.000 values to get, this is the only way.

Note that you could ask yourself why you need to get a lot of values. Are  
you fetching stuff from the database, computing a list of rows to get,  
then SELECTing them ? Then maybe you put something in the application that  
should really be in the database ?

As a side note, it would be nice :

- if that set returning function was a fast C built-in (I think there's  
one in contrib/intagg but obviously it works only for integers) because  
it's a useful tool and building brick ; same for array_accum, and some  
other commonly used five-lines aggregates and functions that everybody  
recodes once.

- if postgresql used this kind of optimization for the SELECT * FROM table  
WHERE id =ANY( array ) which currently uses a seq scan.

However, what IS nice from pg is that you can actually do the array SRF  
trick and pull a lot of rows by id's, at blazing speed, just by writing a  
three line function and tweaking your wuery.



> It sounds like IN will save some typing and code space but not decrease  
> the
> execution time.
>
> BETWEEN won't work for my real life query because the limiting values are
> quite disparate.
>
> Kind Regards,
> Keith
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
>




pgsql-sql by date:

Previous
From: KÖPFERL Robert
Date:
Subject: Re: Relation in tables
Next
From: PFC
Date:
Subject: Re: Relation in tables