Thread: intelligence in writing a query ...

intelligence in writing a query ...

From
The Hermit Hacker
Date:
Tom, with all the work you've been doing inside planner and optimizer, has
there been anything done for 7.1.2 to make how a query is written cause
the backend to be more intelligent?

I'm playing with a query that I just don't like, since its taking ~3min to
run ...

It started as:

EXPLAIN SELECT distinct s.gid, s.created, count(i.title) AS images             FROM status s LEFT JOIN images i ON
(s.gid= i.gid AND i.active), personal_data pd, relationship_wanted rw            WHERE s.active AND s.status != 0
      AND (s.gid = pd.gid AND pd.gender = 0)              AND (s.gid = rw.gid AND rw.gender = 1 )              AND ( (
age('now',pd.dob) > '26 years' ) AND ( age('now', pd.dob) < '46 years' ) )              AND country IN ( 'US' )
GROUPBY s.gid,s.created         ORDER BY  images desc;
 
NOTICE:  QUERY PLAN:

Unique  (cost=2365.87..2365.88 rows=1 width=37) ->  Sort  (cost=2365.87..2365.87 rows=1 width=37)       ->  Aggregate
(cost=2365.86..2365.86rows=1 width=37)             ->  Group  (cost=2365.86..2365.86 rows=1 width=37)
-> Sort  (cost=2365.86..2365.86 rows=1 width=37)                         ->  Nested Loop  (cost=167.62..2365.85 rows=1
width=37)                              ->  Nested Loop  (cost=0.00..600.30 rows=1 width=8)
      ->  Index Scan using personal_data_gender on personal_data pd  (cost=0.00..590.79 rows=4 width=4)
                   ->  Index Scan using relationship_wanted_gid on relationship_wanted rw  (cost=0.00..2.12 rows=1
width=4)                              ->  Materialize  (cost=1508.62..1508.62 rows=17128 width=29)
              ->  Hash Join  (cost=167.62..1508.62 rows=17128 width=29)                                           ->
SeqScan on status s  (cost=0.00..566.24 rows=17128 width=12)                                           ->  Hash
(cost=149.70..149.70rows=7170 width=17)                                                 ->  Seq Scan on images i
(cost=0.00..149.70rows=7170 width=17)
 

EXPLAIN

And, after playing a bit, I've got it to:

2EXPLAIN SELECT distinct s.gid, s.created, count(i.title) AS images             FROM status s LEFT JOIN images i ON
(s.gid= i.gid AND i.active), relationship_wanted rw            WHERE s.active AND s.status != 0              AND EXISTS
(SELECT gid                               FROM relationship_wanted                              WHERE gender = 1 )
       AND EXISTS ( SELECT gid                               FROM personal_data                              WHERE
gender= 0                                AND ( ( age('now', dob) > '26 years' ) AND ( age('now', dob) < '46 years' ) )
                             AND country IN ( 'US' ) )         GROUP BY s.gid,s.created         ORDER BY  images desc;
 
NOTICE:  QUERY PLAN:

Unique  (cost=313742358.09..314445331.35 rows=9372977 width=29) InitPlan   ->  Seq Scan on relationship_wanted
(cost=0.00..1006.03rows=1446 width=4)   ->  Index Scan using personal_data_gender on personal_data  (cost=0.00..590.79
rows=4width=4) ->  Sort  (cost=313742358.09..313742358.09 rows=93729769 width=29)       ->  Aggregate
(cost=285211774.88..292241507.54rows=93729769 width=29)             ->  Group  (cost=285211774.88..289898263.32
rows=937297688width=29)                   ->  Sort  (cost=285211774.88..285211774.88 rows=937297688 width=29)
             ->  Result  (cost=167.62..24262791.77 rows=937297688 width=29)                               ->  Nested
Loop (cost=167.62..24262791.77 rows=937297688 width=29)                                     ->  Hash Join
(cost=167.62..1508.62rows=17128 width=29)                                           ->  Seq Scan on status s
(cost=0.00..566.24rows=17128 width=12)                                           ->  Hash  (cost=149.70..149.70
rows=7170width=17)                                                 ->  Seq Scan on images i  (cost=0.00..149.70
rows=7170width=17)                                     ->  Seq Scan on relationship_wanted rw  (cost=0.00..869.22
rows=54722width=0)
 

EXPLAIN

Not much of an improvement ...

The 'personal_data' EXISTS clause:

SELECT gid FROM personal_dataWHERE gender = 0  AND ( ( age('now', dob) > '26 years' ) AND ( age('now', dob) < '46
years') )  AND country IN ( 'US' ) ;
 

NOTICE:  QUERY PLAN:

Index Scan using personal_data_gender on personal_data  (cost=0.00..590.79 rows=4 width=4)

EXPLAIN

returns 1893 rows, while status contains 26260 rows ... status and
personal_data have a 1-to-1 relationship, so out of 26260 rows in status,
*max* I'm ever going to deal with are the 1893 that are found in
personal_data ...

so, what I'd like to do is have the subselect on personal_data used first,
so as to reduce the set of data that the rest of the query will work only
on those 1893 gid's, instead of all 26260 of them ...

Make sense?


Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org



Re: intelligence in writing a query ...

From
Tom Lane
Date:
The Hermit Hacker <scrappy@hub.org> writes:
> 2EXPLAIN SELECT distinct s.gid, s.created, count(i.title) AS images
>               FROM status s LEFT JOIN images i ON (s.gid = i.gid AND i.active), relationship_wanted rw
>              WHERE s.active AND s.status != 0
>                AND EXISTS ( SELECT gid
>                                 FROM relationship_wanted
>                                WHERE gender = 1 )
>                AND EXISTS ( SELECT gid
>                                 FROM personal_data
>                                WHERE gender = 0
>                                  AND ( ( age('now', dob) > '26 years' ) AND ( age('now', dob) < '46 years' ) )
>                                  AND country IN ( 'US' ) )
>           GROUP BY s.gid,s.created
>           ORDER BY  images desc;

I don't understand what you're trying to do here.  The inner SELECTs
aren't dependent on anything in the outer query, so what are they for?

> ... status and
> personal_data have a 1-to-1 relationship,

Then why have two tables?  Merge them into one table and save yourself a
join.

Also, since status.gid is (I assume) unique, what's the use of the
DISTINCT clause at the top level?  Seems like that's costing you
a useless sort & unique pass ...
        regards, tom lane


Re: intelligence in writing a query ...

From
Don Baccus
Date:
At 05:35 PM 5/30/01 -0400, Tom Lane wrote:

>Also, since status.gid is (I assume) unique, what's the use of the
>DISTINCT clause at the top level?  Seems like that's costing you
>a useless sort & unique pass ...

>> 2EXPLAIN SELECT distinct s.gid, s.created, count(i.title) AS images
...
>>           GROUP BY s.gid,s.created

He's already paying for a sort due to the GROUP BY but of course that
makes the DISTINCT meaningless since s.gid and s.created are already
grouped...



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.