Thread: SQL Query Optimization

SQL Query Optimization

From
Dav Coleman
Date:
Hello,

I am using postgresql to house chemical informatics data which consists of
several interlinked tables with tens of thousands (maximum) of rows.  When
doing search queries against these tables (which always requires multiple
joins) I have noticed that the semantically equivalent SQL queries can differ
vastly in speed performance depending on the order of clauses ANDed
together ( "WHERE cond1 AND cond2" takes forever, but  "WHERE cond2
AND cond1" comes right back).

So it appears I need to do some pre-optimization of the SQL query generated
by the user before submitting it to postgresql in order to guarantee (or
at least increase the likelihood of) the fastest results. I've tried STFW and
RTFM but haven't found any good pointers on where to start with this, 
although I feel that there must be some published algorithms or theories. 
Can anyone point me to a URL or other source to get me on my way?

Also, I wonder if this sort of query optimization is built into other 
databases such as Oracle?

I did find this URL: http://redbook.cs.berkeley.edu/lec7.html
which seems to be interesting, but honestly I'm far from a DB expert so I
can't follow most of it, and I can't tell if it is talking about optimization 
that can be done in application space (query rewrite) or something that has
to be done in the database engine itself. I'm going to try to find the book
it references though.

Basically I feel a bit in over my head, which is ok but I don't want to
waste time paddling in the wrong direction, so I'm hoping someone can
recognize where I need to look and nudge me in that direction. Maybe I 
just need proper terminology to plug into google.

Thanks,
Dav



-- 
Dav Coleman
http://www.danger-island.com/dav/


Re: SQL Query Optimization

From
"Josh Berkus"
Date:
Dav,

> I am using postgresql to house chemical informatics data which
> consists of
> several interlinked tables with tens of thousands (maximum) of rows.
>  When
> doing search queries against these tables (which always requires
> multiple
> joins) I have noticed that the semantically equivalent SQL queries
> can differ
> vastly in speed performance depending on the order of clauses ANDed
> together ( "WHERE cond1 AND cond2" takes forever, but  "WHERE cond2
> AND cond1" comes right back).

In most cases, the above kind of optimization difference is due to how
you indexed the table.  If, for example, you have an index on (field2,
field1), and you do a "WHERE field1 = y and field2 = x" then the query
parser probably won't use the index because the field order is
different.

Fortunately, in Postgres 7.2, you now get index usage statistics.Hopefully another user will follow-up this e-mail by
explaininghow to
 
access them.

The idea is that, if you find that certain views and queries are very
slow, then check what tables they all have in common.  Then check the
indexes and statistics for each table.  If you see a large table with
only 3 indexes, none of which are getting much use, then they are
pobpably the wrong indexes or you need to change the structure of your
WHERE clause.  Also, EXPLAIN can be a big help here.

See http://techdocs.postgresql.org for more stuff about optimization.

I can understand that you'd like a tool to make all this easier for
you, but I haven't seen any such thing, unless it ships with the
Enterprise version of Oracle.

-Josh Berkus


Re: SQL Query Optimization

From
Tom Lane
Date:
"Josh Berkus" <josh@agliodbs.com> writes:
>> ( "WHERE cond1 AND cond2" takes forever, but  "WHERE cond2
>> AND cond1" comes right back).

> In most cases, the above kind of optimization difference is due to how
> you indexed the table.  If, for example, you have an index on (field2,
> field1), and you do a "WHERE field1 = y and field2 = x" then the query
> parser probably won't use the index because the field order is
> different.

Not at all.  Postgres understands very well that it's allowed to
rearrange AND'ed clauses.  Using current sources (so that you can
see the index condition in EXPLAIN):

regression=# create table foo (f1 int, f2 int, unique(f1,f2));
NOTICE:  CREATE TABLE / UNIQUE will create implicit index 'foo_f1_key' for table 'foo'
CREATE
regression=# explain select * from foo where f1 = 1 and f2 = 42;                             QUERY PLAN
----------------------------------------------------------------------Index Scan using foo_f1_key on foo
(cost=0.00..4.83rows=1 width=8)  Index Cond: ((f1 = 1) AND (f2 = 42))
 
(2 rows)

regression=# explain select * from foo where f2 = 42 and f1 = 1;                             QUERY PLAN
----------------------------------------------------------------------Index Scan using foo_f1_key on foo
(cost=0.00..4.83rows=1 width=8)  Index Cond: ((f1 = 1) AND (f2 = 42))
 
(2 rows)


I was curious about the details of Dav's query because it wasn't obvious
why he'd be getting a different result.  Perhaps the two query plans are
mistakenly estimated to have exactly the same cost?  (Although WHERE
clause order doesn't affect the set of plans considered, it can affect
the order in which they're considered, which might result in a different
choice between two plans that are estimated to have identical costs.)
Another possibility: perhaps neither condition is indexable, but cond1
is vastly more expensive to compute than cond2?  (Maybe it's a
sub-SELECT.)  Right now I don't believe there's any code in there that
will rearrange AND-clause order strictly on the basis of
cost-to-compute-the-clauses-themselves.
        regards, tom lane


Re: SQL Query Optimization

From
Dav Coleman
Date:
I should be more clear, the problem is that the application user can 
basically construct the SQL query dynamically, so I have no control on
how the original SQL query will be formed or what it will consist of.
It can be any possible query in practice. Because of this, it is not just
a matter of analyzing any specific queries, and i don't want to start
creating every possible index (although i might, if i have to).

But I can see where I was heading in the wrong direction already. I was
thinking that what I needed was to find theories/algorithms on how to 
rewrite the SQL before submitting it to postgresql, and I maybe still
need to do that, but I guess I also need to EXPLAIN and analyze the 
bad vs good forms of the queries so I'll know what makes a 'good' vs 
'bad' query (so I'll get a sense on how to rewrite queries).  Perhaps
with that understanding, an algorithm for rewriting the queries will
be apparent.

I just figured I couldn't be the first person to run into this problem,
but I can't find it mentioned anywhere.

-Dav

btw, I'm running postgresql-7.1.2 (compilied from source) on rh7.0

Josh Berkus [josh@agliodbs.com] wrote:
> Dav,
> 
> > I am using postgresql to house chemical informatics data which
> > consists of
> > several interlinked tables with tens of thousands (maximum) of rows.
> >  When
> > doing search queries against these tables (which always requires
> > multiple
> > joins) I have noticed that the semantically equivalent SQL queries
> > can differ
> > vastly in speed performance depending on the order of clauses ANDed
> > together ( "WHERE cond1 AND cond2" takes forever, but  "WHERE cond2
> > AND cond1" comes right back).
> 
> In most cases, the above kind of optimization difference is due to how
> you indexed the table.  If, for example, you have an index on (field2,
> field1), and you do a "WHERE field1 = y and field2 = x" then the query
> parser probably won't use the index because the field order is
> different.
> 
> Fortunately, in Postgres 7.2, you now get index usage statistics.
>  Hopefully another user will follow-up this e-mail by explaining how to
> access them.
> 
> The idea is that, if you find that certain views and queries are very
> slow, then check what tables they all have in common.  Then check the
> indexes and statistics for each table.  If you see a large table with
> only 3 indexes, none of which are getting much use, then they are
> pobpably the wrong indexes or you need to change the structure of your
> WHERE clause.  Also, EXPLAIN can be a big help here.
> 
> See http://techdocs.postgresql.org for more stuff about optimization.
> 
> I can understand that you'd like a tool to make all this easier for
> you, but I haven't seen any such thing, unless it ships with the
> Enterprise version of Oracle.
> 
> -Josh Berkus

-- 
Dav Coleman
http://www.danger-island.com/dav/


Re: SQL Query Optimization

From
"Josh Berkus"
Date:
Dav,

> I should be more clear, the problem is that the application user can 
> basically construct the SQL query dynamically, so I have no control
> on
> how the original SQL query will be formed or what it will consist of.
> It can be any possible query in practice. Because of this, it is not
> just
> a matter of analyzing any specific queries, and i don't want to start
> creating every possible index (although i might, if i have to).

See Tom's response.  He's the expert.

However, if the user is allowed to write any query they wish, it does
sound like you'll need to construct every reasonable index.  This will
make UPDATES on your tables very expensive, but you have no way of
anticipating what the user will ask.

You'll also need to take a really hard look at the relational structure
of your database.  Seemingly trivial lack of Normal Form in your table
structures can become very costly as far as subqueries are concerned.Also, DISTINCT can be very costly on large
tables.

> I just figured I couldn't be the first person to run into this
> problem,
> but I can't find it mentioned anywhere.

Good luck.  I can't even think of any books I've reveiwed that would
address this issue.  Part of the problem, I think, is that optimization
is so circumstantial.

> btw, I'm running postgresql-7.1.2 (compilied from source) on rh7.0

I very much suggest that you upgrade to 7.2.1.  Tom and company have
made substantial improvements to the query parser and the tracking of
index statistics in 7.2.

-Josh


Re: SQL Query Optimization

From
Richard Huxton
Date:
On Thursday 18 April 2002 17:35, Dav Coleman wrote:
> I should be more clear, the problem is that the application user can
> basically construct the SQL query dynamically

> But I can see where I was heading in the wrong direction already. I was
> thinking that what I needed was to find theories/algorithms on how to
> rewrite the SQL before submitting it to postgresql, and I maybe still
> need to do that

Sort clauses alphabetically (or whatever makes sense to you) so you always get 
SELECT * FROM a,b WHERE c AND d rather than "b,a" or "d AND c". That way at 
least you're not getting variations.

> but I guess I also need to EXPLAIN and analyze the

Record the queries and times either in PG's log or in the application.

> bad vs good forms of the queries so I'll know what makes a 'good' vs
> 'bad' query (so I'll get a sense on how to rewrite queries).  Perhaps
> with that understanding, an algorithm for rewriting the queries will
> be apparent.
>
> I just figured I couldn't be the first person to run into this problem,
> but I can't find it mentioned anywhere.

After the basics (index on fields involved in joins etc) it all gets a bit 
specific to the size of the tables/indexes involved and the quirks of the 
parser.

If you logged the query-plan and cost estimates for each query processed it 
shouldn't be too difficult to automatically add indexes where required and 
see if it makes any difference. That assumes you have good clean patterns of 
usage in your queries. We're getting a bit AI there mind.

- Richard Huxton