Thread: JOIN vs. WHERE ... IN (subselect)

JOIN vs. WHERE ... IN (subselect)

From
"Dean Gibson (DB Administrator)"
Date:
Using PostgreSQL 7.3.2 on Linux.

One of the types of queries that I like to do on our database is:

SELECT * FROM table1   WHERE indexed_column1 IN   (SELECT column2 FROM table2 WHERE <condition>);

However, in our database table1 is quite large (~1M rows), and the above query takes "forever", EVEN IF table2 CONSISTS
OFONLY ONE ROW!
 

However, my third-party SQL book says that the above is equivalent to:

SELECT table1.* FROM table1, table2   WHERE indexed_column1 = column2 AND <condition>;

And indeed, the above query executes virtually instantaneously if "<condition>" results in a small number of rows from
table2.

I'm fairly new to SQL;  are they really equivalent?  If so, is there some reason that the first form is not
optimized/transformedinto the second form, or is this a planned enhancement?
 

-- Dean

ps: If indexed_column1 has the same name as column2, then the query:

SELECT table1.* from table1   JOIN table2 USING( column )   WHERE <condition>;

Also executes quickly.  I just like the readability of the first query above, but as they say, you gotta do what works
...



Re: JOIN vs. WHERE ... IN (subselect)

From
Chris Linstruth
Date:
I was just given the following.  Not your exact query but replacing
WHERE IN with JOIN ... AS SUBSELECT fixed my performance problems.
This query wnet from 15-30 seconds to under a second.

I have attempted to research exactly what Mr. Arnold did, and found
nothing regarding JOIN ... AS SUBSELECT in the docs.  Any pointers
appreciated.

On Sun, 11 May 2003, Mathieu Arnold wrote:
>
>
> --Le 11/05/2003 09:48 -0700, Chris Linstruth \xe9crivait :
>
> |
> | But this can take 15-30 seconds:
> |
> | SELECT count(radacctid) AS sessions,
> |         sum(acctsessiontime) AS connecttime
> |         FROM radacct
> |         WHERE radacctid IN
> |         (SELECT DISTINCT ON (acctsessionid) radacctid FROM radacct
> |           WHERE username='cjl'
> |           AND acctstoptime IS NOT NULL
> |           AND date_trunc('month', now())=date_trunc('month',
> | acctstoptime));
> |
> | There are probably many different ways to perform this query.  My
> | main problem is trying to overcome the fact that try as I might,
> | I can't stop the occasional duplicate accounting record from being
> | inserted so I have to weed them out, hence the "DISTINCT ON
> | (acctsessionid)".
>
> try this :
> SELECT count(radacctid) AS sessions,
>         sum(acctsessiontime) AS connecttime
>         FROM radacct
>              JOIN (SELECT DISTINCT ON (acctsessionid) radacctid FROM radacct
>                    WHERE username='cjl'
>                    AND acctstoptime IS NOT NULL
>                    AND date_trunc('month', now())=date_trunc('month',
> acctstoptime)) AS subselect USING (radacctid);
>

-- 
Chris Linstruth <cjl@qnet.com>
QNET
1529 East Palmdale Blvd Suite 200
Palmdale, CA 93550
(661) 538-2028


On Fri, 16 May 2003, Dean Gibson (DB Administrator) wrote:

> Using PostgreSQL 7.3.2 on Linux.
>
> One of the types of queries that I like to do on our database is:
>
> SELECT * FROM table1
>     WHERE indexed_column1 IN
>     (SELECT column2 FROM table2 WHERE <condition>);
>
> However, in our database table1 is quite large (~1M rows), and the above query takes "forever", EVEN IF table2
CONSISTSOF ONLY ONE ROW!
 
>
> However, my third-party SQL book says that the above is equivalent to:
>
> SELECT table1.* FROM table1, table2
>     WHERE indexed_column1 = column2 AND <condition>;
>
> And indeed, the above query executes virtually instantaneously if "<condition>" results in a small number of rows
fromtable2.
 
>
> I'm fairly new to SQL;  are they really equivalent?  If so, is there some reason that the first form is not
optimized/transformedinto the second form, or is this a planned enhancement?
 
>
> -- Dean
>
> ps: If indexed_column1 has the same name as column2, then the query:
>
> SELECT table1.* from table1
>     JOIN table2 USING( column )
>     WHERE <condition>;
>
> Also executes quickly.  I just like the readability of the first query above, but as they say, you gotta do what
works...
 
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faqs/FAQ.html
>



Re: JOIN vs. WHERE ... IN (subselect)

From
"Dean Gibson (DB Administrator)"
Date:
The word 'subselect' is not a keyword but just an alias;  see the syntax under SELECT in the docs under 'SQL Commands'
(7.3.2):<br/><br /> | ( <i>select</i> ) [ AS ] <i>alias</i> [ ( <i>column_alias_list</i> ) ] <br /><br /><br /> Chris
Linstruthwrote on 2003-05-17 07:45:<br /><blockquote cite="cite" class="cite" type="cite">I was just given the
following. Not your exact query but replacing<br /> WHERE IN with JOIN ... AS SUBSELECT fixed my performance
problems.<br/> This query wnet from 15-30 seconds to under a second.<br /><br /> I have attempted to research exactly
whatMr. Arnold did, and found<br /> nothing regarding JOIN ... AS SUBSELECT in the docs.  Any pointers<br />
appreciated.<br/><br /> On Sun, 11 May 2003, Mathieu Arnold wrote:<br /> ><br /> ><br /> > --Le 11/05/2003
09:48-0700, Chris Linstruth \xe9crivait :<br /> ><br /> > |<br /> > | But this can take 15-30 seconds:<br />
>|<br /> > | SELECT count(radacctid) AS sessions,<br /> > |         sum(acctsessiontime) AS connecttime<br />
>|         FROM radacct<br /> > |         WHERE radacctid IN<br /> > |         (SELECT DISTINCT ON
(acctsessionid)radacctid FROM radacct<br /> > |           WHERE username='cjl'<br /> > |           AND
acctstoptimeIS NOT NULL<br /> > |           AND date_trunc('month', now())=date_trunc('month',<br /> > |
acctstoptime));<br/> > |<br /> > | There are probably many different ways to perform this query.  My<br /> > |
mainproblem is trying to overcome the fact that try as I might,<br /> > | I can't stop the occasional duplicate
accountingrecord from being<br /> > | inserted so I have to weed them out, hence the "DISTINCT ON<br /> > |
(acctsessionid)".<br/> ><br /> > try this :<br /> > SELECT count(radacctid) AS sessions,<br /> >        
sum(acctsessiontime)AS connecttime<br /> >         FROM radacct<br /> >              JOIN (SELECT DISTINCT ON
(acctsessionid)radacctid FROM radacct<br /> >                    WHERE username='cjl'<br /> >                   
ANDacctstoptime IS NOT NULL<br /> >                    AND date_trunc('month', now())=date_trunc('month',<br /> >
acctstoptime))AS subselect USING (radacctid);<br /> ><br /><br /> -- <br /> Chris Linstruth <cjl@qnet.com><br
/>QNET<br /> 1529 East Palmdale Blvd Suite 200<br /> Palmdale, CA 93550<br /> (661) 538-2028<br /><br /><br /> On Fri,
16May 2003, Dean Gibson (DB Administrator) wrote:<br /><br /> > Using PostgreSQL 7.3.2 on Linux.<br /> ><br />
>One of the types of queries that I like to do on our database is:<br /> ><br /> > SELECT * FROM table1<br />
>    WHERE indexed_column1 IN<br /> >     (SELECT column2 FROM table2 WHERE <condition>);<br /> ><br />
>However, in our database table1 is quite large (~1M rows), and the above query takes "forever", EVEN IF table2
CONSISTSOF ONLY ONE ROW!<br /> ><br /> > However, my third-party SQL book says that the above is equivalent
to:<br/> ><br /> > SELECT table1.* FROM table1, table2<br /> >     WHERE indexed_column1 = column2 AND
<condition>;<br/> ><br /> > And indeed, the above query executes virtually instantaneously if
"<condition>"results in a small number of rows from table2.<br /> ><br /> > I'm fairly new to SQL;  are
theyreally equivalent?  If so, is there some reason that the first form is not optimized/transformed into the second
form,or is this a planned enhancement?<br /> ><br /> > -- Dean<br /> ><br /> > ps: If indexed_column1 has
thesame name as column2, then the query:<br /> ><br /> > SELECT table1.* from table1<br /> >     JOIN table2
USING(column )<br /> >     WHERE <condition>;<br /> ><br /> > Also executes quickly.  I just like the
readabilityof the first query above, but as they say, you gotta do what works ...<br /> ><br /> ><br /> >
---------------------------(endof broadcast)---------------------------<br /> > TIP 5: Have you checked our
extensiveFAQ?<br /> ><br /> > <a eudora="autourl"
href="http://www.postgresql.org/docs/faqs/FAQ.html">http://www.postgresql.org/docs/faqs/FAQ.html</a><br/>
></blockquote>

Re: JOIN vs. WHERE ... IN (subselect)

From
Stephan Szabo
Date:
On Fri, 16 May 2003, Dean Gibson (DB Administrator) wrote:

> Using PostgreSQL 7.3.2 on Linux.
>
> One of the types of queries that I like to do on our database is:
>
> SELECT * FROM table1
>     WHERE indexed_column1 IN
>     (SELECT column2 FROM table2 WHERE <condition>);
>
> However, my third-party SQL book says that the above is equivalent to:
>
> SELECT table1.* FROM table1, table2
>     WHERE indexed_column1 = column2 AND <condition>;
>
> And indeed, the above query executes virtually instantaneously if
> "<condition>" results in a small number of rows from table2.
>
> I'm fairly new to SQL;  are they really equivalent?  If so, is there
> some reason that the first form is not optimized/transformed into the
> second form, or is this a planned enhancement?

7.4 already handles IN differently than previous versions and should
runs queries like the first one better than current versions.

And those two queries are not equivalent in any case unless the subselect
is guaranteed to only have one matching row (or you use a distinct in the
second query).