Thread: Subselect performance question
Folks, I have a view which can be designed with eithher a sub-select in the SELECT clause or the FROM clause. The main query data set is small (1000 records) but the table being filtered in the sub-select is large (110,000 records). Under those circumstances, is there a hard-and-fast rule for which query I should use? Most of the time, users will only look at one record at a time from the main data set. SELECT client_id, client_name, (SELECT count(*) FROM case_clients WHERE client_id = clients.client_id AND case_status = 'ACT') as active_count,(SELECT count(*) FROM case_clients WHERE client_id = clients.client_id AND case_status = 'STL') as settled_count, FROM clients; OR: SELECT client_id, client_name, active_count, settled_count, FROM clients LEFT OUTER JOIN (SELECT client_id, count(*) as active_count FROM case_clients WHERE case_status = 'ACT' GROUP BY client_id) ac ON ac.client_id = clients.client_id LEFT OUTER JOIN (SELECT client_id, count(*) as settled_count FROM case_clients WHERE case_status = 'STL' GROUP BY client_id) sc ON sc.client_id = clients.client_id; -Josh Berkus
On Thu, 02 May 2002 09:37:56 -0700, Josh Berkus wrote: >Folks, > >I have a view which can be designed with eithher a sub-select in the >SELECT clause or the FROM clause. The main query data set is small >(1000 records) but the table being filtered in the sub-select is large >(110,000 records). Josh, I don't know, if it matters, but your queries are not equivalent. If for a client_id there is no row with case_status = 'ACT', active_count is ... > >Under those circumstances, is there a hard-and-fast rule for which >query I should use? Most of the time, users will only look at one >record at a time from the main data set. > >SELECT client_id, client_name, > (SELECT count(*) FROM case_clients WHERE client_id = clients.client_id >AND case_status = 'ACT') as active_count, > (SELECT count(*) FROM case_clients WHERE client_id = clients.client_id >AND case_status = 'STL') as settled_count, >FROM clients; ... 0 (zero), because the sub-select counts 0 rows > >OR: > > >SELECT client_id, client_name, active_count, settled_count, >FROM clients > LEFT OUTER JOIN (SELECT client_id, count(*) as active_count FROM >case_clients WHERE case_status = 'ACT' GROUP BY client_id) ac ON >ac.client_id = clients.client_id > LEFT OUTER JOIN (SELECT client_id, count(*) as settled_count FROM >case_clients WHERE case_status = 'STL' GROUP BY client_id) sc ON >sc.client_id = clients.client_id; > NULL, because you do a LEFT JOIN to an empty sub-select. SELECT c.client_id, c.client_name, count(ac.client_id) active_count, count(sc.client_id) settled_count FROM clients c LEFT JOIN case_clients ac ON ac.client_id=c.client_id AND ac.case_status= 'ACT' LEFT JOIN case_clients sc ON sc.client_id=c.client_id AND sc.case_status= 'STL' GROUP BY c.client_id, c.client_name; should match your first query and it doesn't use sub-selects at all. I haven't tested it though, so watch out for typos ... My personal hard-and-fast rule is "avoid sub-selects, use joins", but YMMV. >-Josh Berkus ServusManfred
"Josh Berkus" <josh@agliodbs.com> writes: > Under those circumstances, is there a hard-and-fast rule for which > query I should use? Most of the time, users will only look at one > record at a time from the main data set. > SELECT client_id, client_name, > (SELECT count(*) FROM case_clients WHERE client_id = clients.client_id > AND case_status = 'ACT') as active_count, > (SELECT count(*) FROM case_clients WHERE client_id = clients.client_id > AND case_status = 'STL') as settled_count, > FROM clients; > OR: > SELECT client_id, client_name, active_count, settled_count, > FROM clients > LEFT OUTER JOIN (SELECT client_id, count(*) as active_count FROM > case_clients WHERE case_status = 'ACT' GROUP BY client_id) ac ON > ac.client_id = clients.client_id > LEFT OUTER JOIN (SELECT client_id, count(*) as settled_count FROM > case_clients WHERE case_status = 'STL' GROUP BY client_id) sc ON > sc.client_id = clients.client_id; The second one looks like a sure loser to me, because of the GROUP BYs. If you were expecting queries to retrieve many different client_ids, it *might* be better to use the second form. But I think for a small number of client_ids the first will be quicker. regards, tom lane
Manfred, > I don't know, if it matters, but your queries are not equivalent. If > for a client_id there is no row with case_status = 'ACT', > active_count > is ... Thanks. Would have noticed that, and done a COALESCE. > should match your first query and it doesn't use sub-selects at all. > I haven't tested it though, so watch out for typos ... > My personal hard-and-fast rule is "avoid sub-selects, use joins", > but > YMMV. Yes, usually. There's other reasons in how I'm going to use the view, though, that make me want to use a subselect. Actually, somebody reminded me that if I have a fixed list of categories, the fastest appoach performance-wise is to use a crosstab identity martix. What I did is to create and populate a table as follows: case_status_crosstab status_group act_count stl_count dis_count oth_count ---------------- --------- --------- --------- --------- Closed\ Cases 0 0 0 0 Active\ Cases 1 0 0 0 Settled\ Cases 0 1 0 0 Dismissed\ Cases 0 0 1 0 Other\ Cases 0 0 0 1 Special\ Cases 0 0 0 1 Then I can count them as follows: CREATE VIEW bv_clients AS SELECT client_id, client_name, short_name, ljo.description as office_location,contact_name, phone_no, fax_no, status_label,address_1, address_2, address_3, address_4, address_5, address_6,city,state, zip_code, comments,active_count, settled_count, dismissed_count, other_count FROM clients JOIN status ON (clients.status = status.status AND status.relation = 'clients') LEFT OUTER JOIN text_list_values ljo ON (clients.jw_office = ljo.list_value and ljo.list_name ILIKE 'office location') LEFT OUTER JOIN (SELECT client_id, sum(act_count) as active_count, sum(stl_count) as settled_count,sum(dis_count) as dismissed_count, sum(oth_count) as other_count FROM case_clients cc JOIN text_list_values tvl ON (cc.case_status = tvl.list_value and list_name ilike 'case status')JOIN case_status_crosstab cst ON tvl.rollup1 = cst.status_group GROUP BY client_id) as c_count USING (client_id); This actually runs pretty fast, once everything is indexed. -Josh Berkus