Thread: Subqueries and the optimizer
So, I have a table with an index: dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND locus_id IN (1,2); QUERY PLAN ------------------------------------------------------------------------ ------------------------------ Index Scan using idx_genes_locus_id, idx_genes_locus_id on genes (cost=0.00..88.21 rows=14 width=4) Index Cond: ((locus_id = 1) OR (locus_id = 2)) Filter: (blast_batch_id = 2) So far so good, but when I try it with a subquery: dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND locus_id IN (SELECT locus_id FROM ll_out_mm); QUERY PLAN ------------------------------------------------------------------------ Seq Scan on genes (cost=0.00..21414353.48 rows=11003 width=4) Filter: ((blast_batch_id = 2) AND (subplan)) SubPlan -> Seq Scan on ll_out_mm (cost=0.00..1267.64 rows=59264 width=4) How can I nudge the optimizer in the direction of using the index in the second case as well? Or is it supposed to be doing this in this case. Thanks, Dmitri
On Tue, 20 May 2003, Dmitri Bichko wrote: > So, I have a table with an index: > > dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND > locus_id IN (1,2); > QUERY PLAN > ------------------------------------------------------------------------ > ------------------------------ > Index Scan using idx_genes_locus_id, idx_genes_locus_id on genes > (cost=0.00..88.21 rows=14 width=4) > Index Cond: ((locus_id = 1) OR (locus_id = 2)) > Filter: (blast_batch_id = 2) > > So far so good, but when I try it with a subquery: > > dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND > locus_id IN (SELECT locus_id FROM ll_out_mm); > QUERY PLAN > ------------------------------------------------------------------------ > Seq Scan on genes (cost=0.00..21414353.48 rows=11003 width=4) > Filter: ((blast_batch_id = 2) AND (subplan)) > SubPlan > -> Seq Scan on ll_out_mm (cost=0.00..1267.64 rows=59264 width=4) > > > How can I nudge the optimizer in the direction of using the index in the > second case as well? Or is it supposed to be doing this in this case. In current stable versions, IN is not optimized terribly well. 7.4 will do a much better job. Converting to a join or EXISTS may help in the short term.
Try: SELECT blast_id FROM genes, ll_out_mm WHERE blast_batch_id = 2 AND genes.locus_id = ll_out_mm.locus_id; Using more recent versions of PostgreSQL, you can also write: SELECT blast_id FROM genes JOIN ll_out_mm USING ( locus_id) WHERE blast_batch_id = 2; -- Dean Dmitri Bichko wrote on 2003-05-20 10:50: >So, I have a table with an index: > >dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND >locus_id IN (1,2); > QUERY PLAN >------------------------------------------------------------------------ >------------------------------ > Index Scan using idx_genes_locus_id, idx_genes_locus_id on genes >(cost=0.00..88.21 rows=14 width=4) > Index Cond: ((locus_id = 1) OR (locus_id = 2)) > Filter: (blast_batch_id = 2) > >So far so good, but when I try it with a subquery: > >dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND >locus_id IN (SELECT locus_id FROM ll_out_mm); >QUERY PLAN >------------------------------------------------------------------------ > Seq Scan on genes (cost=0.00..21414353.48 rows=11003 width=4) > Filter: ((blast_batch_id = 2) AND (subplan)) > SubPlan > -> Seq Scan on ll_out_mm (cost=0.00..1267.64 rows=59264 width=4) > > >How can I nudge the optimizer in the direction of using the index in the >second case as well? Or is it supposed to be doing this in this case. > >Thanks, >Dmitri > >---------------------------(end of broadcast)--------------------------- >TIP 3: if posting/reading through Usenet, please send an appropriate >subscribe-nomail command to majordomo@postgresql.org so that your >message can get through to the mailing list cleanly
I wish it were as easy as a join - the query is much simplified for the purpose of the example, in reality the subselect is more complicated and includes a GROUP BY (which, at least as far as I know, makes subqueries the only way of doing this). Thanks anway, guess I'll wait for 7.4 with this (and just split them up into two queries for the time being), Dmitri -----Original Message----- From: Dean Gibson (DB Administrator) [mailto:dba-sql@ultimeth.net] Sent: Tuesday, May 20, 2003 2:28 PM To: pgsql-general@postgresql.org Subject: Re: [GENERAL] Subqueries and the optimizer Try: SELECT blast_id FROM genes, ll_out_mm WHERE blast_batch_id = 2 AND genes.locus_id = ll_out_mm.locus_id; Using more recent versions of PostgreSQL, you can also write: SELECT blast_id FROM genes JOIN ll_out_mm USING ( locus_id) WHERE blast_batch_id = 2; -- Dean Dmitri Bichko wrote on 2003-05-20 10:50: >So, I have a table with an index: > >dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND >locus_id IN (1,2); > QUERY PLAN >----------------------------------------------------------------------- - >------------------------------ > Index Scan using idx_genes_locus_id, idx_genes_locus_id on genes >(cost=0.00..88.21 rows=14 width=4) > Index Cond: ((locus_id = 1) OR (locus_id = 2)) > Filter: (blast_batch_id = 2) > >So far so good, but when I try it with a subquery: > >dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND >locus_id IN (SELECT locus_id FROM ll_out_mm); >QUERY PLAN >----------------------------------------------------------------------- - > Seq Scan on genes (cost=0.00..21414353.48 rows=11003 width=4) > Filter: ((blast_batch_id = 2) AND (subplan)) > SubPlan > -> Seq Scan on ll_out_mm (cost=0.00..1267.64 rows=59264 width=4) > > >How can I nudge the optimizer in the direction of using the index in the >second case as well? Or is it supposed to be doing this in this case. > >Thanks, >Dmitri > >---------------------------(end of broadcast)--------------------------- >TIP 3: if posting/reading through Usenet, please send an appropriate >subscribe-nomail command to majordomo@postgresql.org so that your >message can get through to the mailing list cleanly ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org
I had the same problem (VERY SLOW when using a WHERE xxx IN (SELECT ...)), but fortunately, a join solved my problem. However, the WHERE xxx IN (SELECT ...) syntax is often more natural. I'm also waiting for 7.4.
-- Dean
Dmitri Bichko wrote on 2003-05-20 13:45:
-- Dean
Dmitri Bichko wrote on 2003-05-20 13:45:
I wish it were as easy as a join - the query is much simplified for the purpose of the example, in reality the subselect is more complicated and includes a GROUP BY (which, at least as far as I know, makes subqueries the only way of doing this).
Thanks anway, guess I'll wait for 7.4 with this (and just split them up into two queries for the time being),
Dmitri
-----Original Message-----
From: Dean Gibson (DB Administrator) [mailto:dba-sql@ultimeth.net]
Sent: Tuesday, May 20, 2003 2:28 PM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Subqueries and the optimizer
Try:
SELECT blast_id FROM genes, ll_out_mm WHERE blast_batch_id = 2 AND
genes.locus_id = ll_out_mm.locus_id;
Using more recent versions of PostgreSQL, you can also write:
SELECT blast_id FROM genes JOIN ll_out_mm USING ( locus_id) WHERE
blast_batch_id = 2;
-- Dean
Dmitri Bichko wrote on 2003-05-20 10:50:
>So, I have a table with an index:
>
>dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND
>locus_id IN (1,2);
> QUERY PLAN
>-----------------------------------------------------------------------
-
>------------------------------
> Index Scan using idx_genes_locus_id, idx_genes_locus_id on genes
>(cost=0.00..88.21 rows=14 width=4)
> Index Cond: ((locus_id = 1) OR (locus_id = 2))
> Filter: (blast_batch_id = 2)
>
>So far so good, but when I try it with a subquery:
>
>dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND
>locus_id IN (SELECT locus_id FROM ll_out_mm);
>QUERY PLAN
>-----------------------------------------------------------------------
-
> Seq Scan on genes (cost=0.00..21414353.48 rows=11003 width=4)
> Filter: ((blast_batch_id = 2) AND (subplan))
> SubPlan
> -> Seq Scan on ll_out_mm (cost=0.00..1267.64 rows=59264 width=4)
>
>
>How can I nudge the optimizer in the direction of using the index in
the
>second case as well? Or is it supposed to be doing this in this case.
>
>Thanks,
>Dmitri
>
>---------------------------(end of
broadcast)---------------------------
>TIP 3: if posting/reading through Usenet, please send an appropriate
>subscribe-nomail command to majordomo@postgresql.org so that your
>message can get through to the mailing list cleanly
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
http://archives.postgresql.org
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
On Tue, 20 May 2003, Dmitri Bichko wrote: > I wish it were as easy as a join - the query is much simplified for the > purpose of the example, in reality the subselect is more complicated and > includes a GROUP BY (which, at least as far as I know, makes subqueries > the only way of doing this). But you may be able to do it as a subselect in FROM rather than a subselect in IN (err, right). For example, barring possible NULL related wierdness: select * from a where col in (select count(*) from b group by col2); can probably be done as something like: select a.* from a, (select distinct count(*) as count from b group by col2) as b where a.col=b.count; which in some cases for 7.3 and earlier will be better.
Dmitri Bichko wrote: > I wish it were as easy as a join - the query is much simplified for the > purpose of the example, in reality the subselect is more complicated and > includes a GROUP BY (which, at least as far as I know, makes subqueries > the only way of doing this). > > Thanks anway, guess I'll wait for 7.4 with this (and just split them up > into two queries for the time being), Did you try a correlated subquery using EXISTS? You can make that subquery as complicated as you want. The semantics between IN and EXISTS vary though in the prescence of NULLs. SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND EXISTS ( SELECT 1 FROM ll_out_mm WHERE ll_out_mm.locusid = genes.locus_id AND ... }; Mike Mascari mascarm@mascari.com
On Tue, May 20, 2003 at 04:45:15PM -0400, Dmitri Bichko wrote: > I wish it were as easy as a join - the query is much simplified for the > purpose of the example, in reality the subselect is more complicated and > includes a GROUP BY (which, at least as far as I know, makes subqueries > the only way of doing this). Note that IN (1,2) is quite different from IN (SELECT something), so if your query is like the latter then you probably want to convert it to EXISTS as shown in the FAQ. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "La soledad es compañia"
"Dmitri Bichko" <dbichko@genpathpharma.com> writes: > Thanks anway, guess I'll wait for 7.4 with this (and just split them up > into two queries for the time being), As I was just saying to someone else, it'd be worth testing complex IN cases against CVS tip to make sure that 7.4 will do a decent job with them. It's not too late now to consider improving any gaps that might remain --- but by the time we start the formal beta process, it will be. So I encourage you to grab a CVS-tip snapshot and set up a test database to check out the cases you are interested in. Please report what you find out to pgsql-hackers. regards, tom lane
Try: SELECT blast_id FROM genes, ll_out_mm WHERE blast_batch_id = 2 AND genes.locus_id = ll_out_mm.locus_id; Using more recent versions of PostgreSQL, you can also write: SELECT blast_id FROM genes JOIN ll_out_mm USING ( locus_id) WHERE blast_batch_id = 2; -- Dean Dmitri Bichko wrote on 2003-05-20 10:50: >So, I have a table with an index: > >dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND >locus_id IN (1,2); > QUERY PLAN >------------------------------------------------------------------------ >------------------------------ > Index Scan using idx_genes_locus_id, idx_genes_locus_id on genes >(cost=0.00..88.21 rows=14 width=4) > Index Cond: ((locus_id = 1) OR (locus_id = 2)) > Filter: (blast_batch_id = 2) > >So far so good, but when I try it with a subquery: > >dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND >locus_id IN (SELECT locus_id FROM ll_out_mm); >QUERY PLAN >------------------------------------------------------------------------ > Seq Scan on genes (cost=0.00..21414353.48 rows=11003 width=4) > Filter: ((blast_batch_id = 2) AND (subplan)) > SubPlan > -> Seq Scan on ll_out_mm (cost=0.00..1267.64 rows=59264 width=4) > > >How can I nudge the optimizer in the direction of using the index in the >second case as well? Or is it supposed to be doing this in this case. > >Thanks, >Dmitri > >---------------------------(end of broadcast)--------------------------- >TIP 3: if posting/reading through Usenet, please send an appropriate >subscribe-nomail command to majordomo@postgresql.org so that your >message can get through to the mailing list cleanly