Thread: Subselect performance question

Subselect performance question

From
"Josh Berkus"
Date:
Folks,

I have a view which can be designed with eithher a sub-select in the
SELECT clause or the FROM clause.    The main query data set is small
(1000 records) but the table being filtered in the sub-select is large
(110,000 records).  

Under those circumstances, is there a hard-and-fast rule for which
query I should use?  Most of the time, users will only look at one
record at a time from the main data set.

SELECT client_id, client_name, (SELECT count(*) FROM case_clients WHERE client_id = clients.client_id
AND case_status = 'ACT') as active_count,(SELECT count(*) FROM case_clients WHERE client_id = clients.client_id
AND case_status = 'STL') as settled_count,
FROM clients;

OR:


SELECT client_id, client_name,  active_count, settled_count,
FROM clients    LEFT OUTER JOIN (SELECT client_id, count(*) as active_count FROM
case_clients WHERE case_status = 'ACT' GROUP BY client_id) ac ON
ac.client_id = clients.client_id  LEFT OUTER JOIN (SELECT client_id, count(*) as settled_count FROM
case_clients WHERE case_status = 'STL' GROUP BY client_id) sc ON
sc.client_id = clients.client_id;

-Josh Berkus



Re: Subselect performance question

From
Manfred Koizar
Date:
On Thu, 02 May 2002 09:37:56 -0700, Josh Berkus wrote:

>Folks,
>
>I have a view which can be designed with eithher a sub-select in the
>SELECT clause or the FROM clause.    The main query data set is small
>(1000 records) but the table being filtered in the sub-select is large
>(110,000 records).

Josh,

I don't know, if it matters, but your queries are not equivalent.  If
for a client_id there is no row with case_status = 'ACT', active_count
is ...

>
>Under those circumstances, is there a hard-and-fast rule for which
>query I should use?  Most of the time, users will only look at one
>record at a time from the main data set.
>
>SELECT client_id, client_name,
>    (SELECT count(*) FROM case_clients WHERE client_id = clients.client_id
>AND case_status = 'ACT') as active_count,
>    (SELECT count(*) FROM case_clients WHERE client_id = clients.client_id
>AND case_status = 'STL') as settled_count,
>FROM clients;

... 0 (zero), because the sub-select counts 0 rows

>
>OR:
>
>
>SELECT client_id, client_name,  active_count, settled_count,
>FROM clients
>    LEFT OUTER JOIN (SELECT client_id, count(*) as active_count FROM
>case_clients WHERE case_status = 'ACT' GROUP BY client_id) ac ON
>ac.client_id = clients.client_id
>   LEFT OUTER JOIN (SELECT client_id, count(*) as settled_count FROM
>case_clients WHERE case_status = 'STL' GROUP BY client_id) sc ON
>sc.client_id = clients.client_id;
>

NULL, because you do a LEFT JOIN to an empty sub-select.

SELECT c.client_id,      c.client_name,      count(ac.client_id) active_count,      count(sc.client_id) settled_count
FROM clients c   LEFT JOIN case_clients ac ON ac.client_id=c.client_id AND
ac.case_status= 'ACT'   LEFT JOIN case_clients sc ON sc.client_id=c.client_id AND
sc.case_status= 'STL' 
GROUP BY c.client_id, c.client_name;

should match your first query and it doesn't use sub-selects at all.
I haven't tested it though, so watch out for typos ...
My personal hard-and-fast rule is "avoid sub-selects, use joins",  but
YMMV.

>-Josh Berkus

ServusManfred


Re: Subselect performance question

From
Tom Lane
Date:
"Josh Berkus" <josh@agliodbs.com> writes:
> Under those circumstances, is there a hard-and-fast rule for which
> query I should use?  Most of the time, users will only look at one
> record at a time from the main data set.

> SELECT client_id, client_name, 
>     (SELECT count(*) FROM case_clients WHERE client_id = clients.client_id
> AND case_status = 'ACT') as active_count,
>     (SELECT count(*) FROM case_clients WHERE client_id = clients.client_id
> AND case_status = 'STL') as settled_count,
> FROM clients;

> OR:

> SELECT client_id, client_name,  active_count, settled_count,
> FROM clients 
>     LEFT OUTER JOIN (SELECT client_id, count(*) as active_count FROM
> case_clients WHERE case_status = 'ACT' GROUP BY client_id) ac ON
> ac.client_id = clients.client_id
>    LEFT OUTER JOIN (SELECT client_id, count(*) as settled_count FROM
> case_clients WHERE case_status = 'STL' GROUP BY client_id) sc ON
> sc.client_id = clients.client_id;

The second one looks like a sure loser to me, because of the GROUP BYs.

If you were expecting queries to retrieve many different client_ids,
it *might* be better to use the second form.  But I think for a small
number of client_ids the first will be quicker.
        regards, tom lane


Re: Subselect performance question

From
"Josh Berkus"
Date:
Manfred,

> I don't know, if it matters, but your queries are not equivalent.  If
> for a client_id there is no row with case_status = 'ACT',
> active_count
> is ...

Thanks.  Would have noticed that, and done a COALESCE.

> should match your first query and it doesn't use sub-selects at all.
> I haven't tested it though, so watch out for typos ...
> My personal hard-and-fast rule is "avoid sub-selects, use joins",
>  but
> YMMV.

Yes, usually.  There's other reasons in how I'm going to use the view,
though, that make me want to use a subselect.

Actually, somebody reminded me that if I have a fixed list of
categories, the fastest appoach performance-wise is to use a crosstab
identity martix.  What I did is to create and populate a table as
follows:

case_status_crosstab
status_group     act_count stl_count dis_count oth_count
---------------- --------- --------- --------- ---------
Closed\ Cases            0         0         0         0
Active\ Cases            1         0         0         0
Settled\ Cases           0         1         0         0
Dismissed\ Cases         0         0         1         0
Other\ Cases             0         0         0         1
Special\ Cases           0         0         0         1

Then I can count them as follows:

CREATE VIEW bv_clients AS
SELECT client_id, client_name, short_name, ljo.description as
office_location,contact_name, phone_no, fax_no, status_label,address_1, address_2, address_3, address_4, address_5,
address_6,city,state, zip_code, comments,active_count, settled_count, dismissed_count, other_count
 
FROM clients JOIN status ON (clients.status = status.status AND
status.relation = 'clients')
LEFT OUTER JOIN text_list_values ljo ON (clients.jw_office =
ljo.list_value and ljo.list_name ILIKE 'office location')
LEFT OUTER JOIN (SELECT client_id, sum(act_count) as active_count,
sum(stl_count) as settled_count,sum(dis_count) as dismissed_count, sum(oth_count) as other_count
FROM case_clients cc JOIN text_list_values tvl ON (cc.case_status =
tvl.list_value and list_name ilike 'case status')JOIN case_status_crosstab cst ON tvl.rollup1 = cst.status_group
GROUP BY client_id) as c_count USING (client_id);

This actually runs pretty fast, once everything is indexed.

-Josh Berkus