Thread: JOIN vs. WHERE ... IN (subselect)
Using PostgreSQL 7.3.2 on Linux. One of the types of queries that I like to do on our database is: SELECT * FROM table1 WHERE indexed_column1 IN (SELECT column2 FROM table2 WHERE <condition>); However, in our database table1 is quite large (~1M rows), and the above query takes "forever", EVEN IF table2 CONSISTS OFONLY ONE ROW! However, my third-party SQL book says that the above is equivalent to: SELECT table1.* FROM table1, table2 WHERE indexed_column1 = column2 AND <condition>; And indeed, the above query executes virtually instantaneously if "<condition>" results in a small number of rows from table2. I'm fairly new to SQL; are they really equivalent? If so, is there some reason that the first form is not optimized/transformedinto the second form, or is this a planned enhancement? -- Dean ps: If indexed_column1 has the same name as column2, then the query: SELECT table1.* from table1 JOIN table2 USING( column ) WHERE <condition>; Also executes quickly. I just like the readability of the first query above, but as they say, you gotta do what works ...
I was just given the following. Not your exact query but replacing WHERE IN with JOIN ... AS SUBSELECT fixed my performance problems. This query wnet from 15-30 seconds to under a second. I have attempted to research exactly what Mr. Arnold did, and found nothing regarding JOIN ... AS SUBSELECT in the docs. Any pointers appreciated. On Sun, 11 May 2003, Mathieu Arnold wrote: > > > --Le 11/05/2003 09:48 -0700, Chris Linstruth \xe9crivait : > > | > | But this can take 15-30 seconds: > | > | SELECT count(radacctid) AS sessions, > | sum(acctsessiontime) AS connecttime > | FROM radacct > | WHERE radacctid IN > | (SELECT DISTINCT ON (acctsessionid) radacctid FROM radacct > | WHERE username='cjl' > | AND acctstoptime IS NOT NULL > | AND date_trunc('month', now())=date_trunc('month', > | acctstoptime)); > | > | There are probably many different ways to perform this query. My > | main problem is trying to overcome the fact that try as I might, > | I can't stop the occasional duplicate accounting record from being > | inserted so I have to weed them out, hence the "DISTINCT ON > | (acctsessionid)". > > try this : > SELECT count(radacctid) AS sessions, > sum(acctsessiontime) AS connecttime > FROM radacct > JOIN (SELECT DISTINCT ON (acctsessionid) radacctid FROM radacct > WHERE username='cjl' > AND acctstoptime IS NOT NULL > AND date_trunc('month', now())=date_trunc('month', > acctstoptime)) AS subselect USING (radacctid); > -- Chris Linstruth <cjl@qnet.com> QNET 1529 East Palmdale Blvd Suite 200 Palmdale, CA 93550 (661) 538-2028 On Fri, 16 May 2003, Dean Gibson (DB Administrator) wrote: > Using PostgreSQL 7.3.2 on Linux. > > One of the types of queries that I like to do on our database is: > > SELECT * FROM table1 > WHERE indexed_column1 IN > (SELECT column2 FROM table2 WHERE <condition>); > > However, in our database table1 is quite large (~1M rows), and the above query takes "forever", EVEN IF table2 CONSISTSOF ONLY ONE ROW! > > However, my third-party SQL book says that the above is equivalent to: > > SELECT table1.* FROM table1, table2 > WHERE indexed_column1 = column2 AND <condition>; > > And indeed, the above query executes virtually instantaneously if "<condition>" results in a small number of rows fromtable2. > > I'm fairly new to SQL; are they really equivalent? If so, is there some reason that the first form is not optimized/transformedinto the second form, or is this a planned enhancement? > > -- Dean > > ps: If indexed_column1 has the same name as column2, then the query: > > SELECT table1.* from table1 > JOIN table2 USING( column ) > WHERE <condition>; > > Also executes quickly. I just like the readability of the first query above, but as they say, you gotta do what works... > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html >
The word 'subselect' is not a keyword but just an alias; see the syntax under SELECT in the docs under 'SQL Commands' (7.3.2):<br/><br /> | ( <i>select</i> ) [ AS ] <i>alias</i> [ ( <i>column_alias_list</i> ) ] <br /><br /><br /> Chris Linstruthwrote on 2003-05-17 07:45:<br /><blockquote cite="cite" class="cite" type="cite">I was just given the following. Not your exact query but replacing<br /> WHERE IN with JOIN ... AS SUBSELECT fixed my performance problems.<br/> This query wnet from 15-30 seconds to under a second.<br /><br /> I have attempted to research exactly whatMr. Arnold did, and found<br /> nothing regarding JOIN ... AS SUBSELECT in the docs. Any pointers<br /> appreciated.<br/><br /> On Sun, 11 May 2003, Mathieu Arnold wrote:<br /> ><br /> ><br /> > --Le 11/05/2003 09:48-0700, Chris Linstruth \xe9crivait :<br /> ><br /> > |<br /> > | But this can take 15-30 seconds:<br /> >|<br /> > | SELECT count(radacctid) AS sessions,<br /> > | sum(acctsessiontime) AS connecttime<br /> >| FROM radacct<br /> > | WHERE radacctid IN<br /> > | (SELECT DISTINCT ON (acctsessionid)radacctid FROM radacct<br /> > | WHERE username='cjl'<br /> > | AND acctstoptimeIS NOT NULL<br /> > | AND date_trunc('month', now())=date_trunc('month',<br /> > | acctstoptime));<br/> > |<br /> > | There are probably many different ways to perform this query. My<br /> > | mainproblem is trying to overcome the fact that try as I might,<br /> > | I can't stop the occasional duplicate accountingrecord from being<br /> > | inserted so I have to weed them out, hence the "DISTINCT ON<br /> > | (acctsessionid)".<br/> ><br /> > try this :<br /> > SELECT count(radacctid) AS sessions,<br /> > sum(acctsessiontime)AS connecttime<br /> > FROM radacct<br /> > JOIN (SELECT DISTINCT ON (acctsessionid)radacctid FROM radacct<br /> > WHERE username='cjl'<br /> > ANDacctstoptime IS NOT NULL<br /> > AND date_trunc('month', now())=date_trunc('month',<br /> > acctstoptime))AS subselect USING (radacctid);<br /> ><br /><br /> -- <br /> Chris Linstruth <cjl@qnet.com><br />QNET<br /> 1529 East Palmdale Blvd Suite 200<br /> Palmdale, CA 93550<br /> (661) 538-2028<br /><br /><br /> On Fri, 16May 2003, Dean Gibson (DB Administrator) wrote:<br /><br /> > Using PostgreSQL 7.3.2 on Linux.<br /> ><br /> >One of the types of queries that I like to do on our database is:<br /> ><br /> > SELECT * FROM table1<br /> > WHERE indexed_column1 IN<br /> > (SELECT column2 FROM table2 WHERE <condition>);<br /> ><br /> >However, in our database table1 is quite large (~1M rows), and the above query takes "forever", EVEN IF table2 CONSISTSOF ONLY ONE ROW!<br /> ><br /> > However, my third-party SQL book says that the above is equivalent to:<br/> ><br /> > SELECT table1.* FROM table1, table2<br /> > WHERE indexed_column1 = column2 AND <condition>;<br/> ><br /> > And indeed, the above query executes virtually instantaneously if "<condition>"results in a small number of rows from table2.<br /> ><br /> > I'm fairly new to SQL; are theyreally equivalent? If so, is there some reason that the first form is not optimized/transformed into the second form,or is this a planned enhancement?<br /> ><br /> > -- Dean<br /> ><br /> > ps: If indexed_column1 has thesame name as column2, then the query:<br /> ><br /> > SELECT table1.* from table1<br /> > JOIN table2 USING(column )<br /> > WHERE <condition>;<br /> ><br /> > Also executes quickly. I just like the readabilityof the first query above, but as they say, you gotta do what works ...<br /> ><br /> ><br /> > ---------------------------(endof broadcast)---------------------------<br /> > TIP 5: Have you checked our extensiveFAQ?<br /> ><br /> > <a eudora="autourl" href="http://www.postgresql.org/docs/faqs/FAQ.html">http://www.postgresql.org/docs/faqs/FAQ.html</a><br/> ></blockquote>
On Fri, 16 May 2003, Dean Gibson (DB Administrator) wrote: > Using PostgreSQL 7.3.2 on Linux. > > One of the types of queries that I like to do on our database is: > > SELECT * FROM table1 > WHERE indexed_column1 IN > (SELECT column2 FROM table2 WHERE <condition>); > > However, my third-party SQL book says that the above is equivalent to: > > SELECT table1.* FROM table1, table2 > WHERE indexed_column1 = column2 AND <condition>; > > And indeed, the above query executes virtually instantaneously if > "<condition>" results in a small number of rows from table2. > > I'm fairly new to SQL; are they really equivalent? If so, is there > some reason that the first form is not optimized/transformed into the > second form, or is this a planned enhancement? 7.4 already handles IN differently than previous versions and should runs queries like the first one better than current versions. And those two queries are not equivalent in any case unless the subselect is guaranteed to only have one matching row (or you use a distinct in the second query).