Thread: Subqueries and the optimizer

Subqueries and the optimizer

From

"Dmitri Bichko"

Date:

20 May 2003, 16:50:42

So, I have a table with an index:

dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND
locus_id IN (1,2);
                                              QUERY PLAN
------------------------------------------------------------------------
------------------------------
 Index Scan using idx_genes_locus_id, idx_genes_locus_id on genes
(cost=0.00..88.21 rows=14 width=4)
   Index Cond: ((locus_id = 1) OR (locus_id = 2))
   Filter: (blast_batch_id = 2)

So far so good, but when I try it with a subquery:

dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND
locus_id IN (SELECT locus_id FROM ll_out_mm);
QUERY PLAN
------------------------------------------------------------------------
 Seq Scan on genes  (cost=0.00..21414353.48 rows=11003 width=4)
   Filter: ((blast_batch_id = 2) AND (subplan))
   SubPlan
     ->  Seq Scan on ll_out_mm  (cost=0.00..1267.64 rows=59264 width=4)


How can I nudge the optimizer in the direction of using the index in the
second case as well?  Or is it supposed to be doing this in this case.

Thanks,
Dmitri

Re: Subqueries and the optimizer

From

Stephan Szabo

Date:

20 May 2003, 17:20:14

On Tue, 20 May 2003, Dmitri Bichko wrote:

> So, I have a table with an index:
>
> dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND
> locus_id IN (1,2);
>                                               QUERY PLAN
> ------------------------------------------------------------------------
> ------------------------------
>  Index Scan using idx_genes_locus_id, idx_genes_locus_id on genes
> (cost=0.00..88.21 rows=14 width=4)
>    Index Cond: ((locus_id = 1) OR (locus_id = 2))
>    Filter: (blast_batch_id = 2)
>
> So far so good, but when I try it with a subquery:
>
> dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND
> locus_id IN (SELECT locus_id FROM ll_out_mm);
> QUERY PLAN
> ------------------------------------------------------------------------
>  Seq Scan on genes  (cost=0.00..21414353.48 rows=11003 width=4)
>    Filter: ((blast_batch_id = 2) AND (subplan))
>    SubPlan
>      ->  Seq Scan on ll_out_mm  (cost=0.00..1267.64 rows=59264 width=4)
>
>
> How can I nudge the optimizer in the direction of using the index in the
> second case as well?  Or is it supposed to be doing this in this case.

In current stable versions, IN is not optimized terribly well.  7.4 will
do a much better job.  Converting to a join or EXISTS may help in the
short term.

Re: Subqueries and the optimizer

From

"Dean Gibson (DB Administrator)"

Date:

20 May 2003, 17:28:28

Try:

SELECT blast_id FROM genes, ll_out_mm WHERE blast_batch_id = 2 AND
genes.locus_id = ll_out_mm.locus_id;

Using more recent versions of PostgreSQL, you can also write:

SELECT blast_id FROM genes JOIN ll_out_mm USING ( locus_id) WHERE blast_batch_id = 2;

-- Dean

Dmitri Bichko wrote on 2003-05-20 10:50:
>So, I have a table with an index:
>
>dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND
>locus_id IN (1,2);
>                                              QUERY PLAN
>------------------------------------------------------------------------
>------------------------------
> Index Scan using idx_genes_locus_id, idx_genes_locus_id on genes
>(cost=0.00..88.21 rows=14 width=4)
>   Index Cond: ((locus_id = 1) OR (locus_id = 2))
>   Filter: (blast_batch_id = 2)
>
>So far so good, but when I try it with a subquery:
>
>dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND
>locus_id IN (SELECT locus_id FROM ll_out_mm);
>QUERY PLAN
>------------------------------------------------------------------------
> Seq Scan on genes  (cost=0.00..21414353.48 rows=11003 width=4)
>   Filter: ((blast_batch_id = 2) AND (subplan))
>   SubPlan
>     ->  Seq Scan on ll_out_mm  (cost=0.00..1267.64 rows=59264 width=4)
>
>
>How can I nudge the optimizer in the direction of using the index in the
>second case as well?  Or is it supposed to be doing this in this case.
>
>Thanks,
>Dmitri
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: if posting/reading through Usenet, please send an appropriate
>subscribe-nomail command to majordomo@postgresql.org so that your
>message can get through to the mailing list cleanly

Re: Subqueries and the optimizer

From

"Dmitri Bichko"

Date:

20 May 2003, 19:45:33

I wish it were as easy as a join - the query is much simplified for the
purpose of the example, in reality the subselect is more complicated and
includes a GROUP BY (which, at least as far as I know, makes subqueries
the only way of doing this).

Thanks anway, guess I'll wait for 7.4 with this (and just split them up
into two queries for the time being),
Dmitri

-----Original Message-----
From: Dean Gibson (DB Administrator) [mailto:dba-sql@ultimeth.net]
Sent: Tuesday, May 20, 2003 2:28 PM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Subqueries and the optimizer


Try:

SELECT blast_id FROM genes, ll_out_mm WHERE blast_batch_id = 2 AND
genes.locus_id = ll_out_mm.locus_id;

Using more recent versions of PostgreSQL, you can also write:

SELECT blast_id FROM genes JOIN ll_out_mm USING ( locus_id) WHERE
blast_batch_id = 2;

-- Dean

Dmitri Bichko wrote on 2003-05-20 10:50:
>So, I have a table with an index:
>
>dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND
>locus_id IN (1,2);
>                                              QUERY PLAN
>-----------------------------------------------------------------------
-
>------------------------------
> Index Scan using idx_genes_locus_id, idx_genes_locus_id on genes
>(cost=0.00..88.21 rows=14 width=4)
>   Index Cond: ((locus_id = 1) OR (locus_id = 2))
>   Filter: (blast_batch_id = 2)
>
>So far so good, but when I try it with a subquery:
>
>dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND
>locus_id IN (SELECT locus_id FROM ll_out_mm);
>QUERY PLAN
>-----------------------------------------------------------------------
-
> Seq Scan on genes  (cost=0.00..21414353.48 rows=11003 width=4)
>   Filter: ((blast_batch_id = 2) AND (subplan))
>   SubPlan
>     ->  Seq Scan on ll_out_mm  (cost=0.00..1267.64 rows=59264 width=4)
>
>
>How can I nudge the optimizer in the direction of using the index in
the
>second case as well?  Or is it supposed to be doing this in this case.
>
>Thanks,
>Dmitri
>
>---------------------------(end of
broadcast)---------------------------
>TIP 3: if posting/reading through Usenet, please send an appropriate
>subscribe-nomail command to majordomo@postgresql.org so that your
>message can get through to the mailing list cleanly


---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: Subqueries and the optimizer

From

"Dean Gibson (DB Administrator)"

Date:

20 May 2003, 20:11:42

I had the same problem (VERY SLOW when using a WHERE xxx IN (SELECT ...)), but fortunately, a join solved my problem. However, the WHERE xxx IN (SELECT ...) syntax is often more natural. I'm also waiting for 7.4.

-- Dean

Dmitri Bichko wrote on 2003-05-20 13:45:

I wish it were as easy as a join - the query is much simplified for the purpose of the example, in reality the subselect is more complicated and includes a GROUP BY (which, at least as far as I know, makes subqueries the only way of doing this).

Thanks anway, guess I'll wait for 7.4 with this (and just split them up into two queries for the time being),
Dmitri

-----Original Message-----
From: Dean Gibson (DB Administrator) [mailto:dba-sql@ultimeth.net]
Sent: Tuesday, May 20, 2003 2:28 PM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Subqueries and the optimizer

Try:

SELECT blast_id FROM genes, ll_out_mm WHERE blast_batch_id = 2 AND
genes.locus_id = ll_out_mm.locus_id;

Using more recent versions of PostgreSQL, you can also write:

SELECT blast_id FROM genes JOIN ll_out_mm USING ( locus_id) WHERE
blast_batch_id = 2;

-- Dean

Dmitri Bichko wrote on 2003-05-20 10:50:
>So, I have a table with an index:
>
>dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND
>locus_id IN (1,2);
>                                              QUERY PLAN
>-----------------------------------------------------------------------
-
>------------------------------
> Index Scan using idx_genes_locus_id, idx_genes_locus_id on genes
>(cost=0.00..88.21 rows=14 width=4)
>   Index Cond: ((locus_id = 1) OR (locus_id = 2))
>   Filter: (blast_batch_id = 2)
>
>So far so good, but when I try it with a subquery:
>
>dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND
>locus_id IN (SELECT locus_id FROM ll_out_mm);
>QUERY PLAN
>-----------------------------------------------------------------------
-
> Seq Scan on genes (cost=0.00..21414353.48 rows=11003 width=4)
>   Filter: ((blast_batch_id = 2) AND (subplan))
>   SubPlan
>     -> Seq Scan on ll_out_mm (cost=0.00..1267.64 rows=59264 width=4)
>
>
>How can I nudge the optimizer in the direction of using the index in
the
>second case as well? Or is it supposed to be doing this in this case.
>
>Thanks,
>Dmitri
>
>---------------------------(end of
broadcast)---------------------------
>TIP 3: if posting/reading through Usenet, please send an appropriate
>subscribe-nomail command to majordomo@postgresql.org so that your
>message can get through to the mailing list cleanly

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Re: Subqueries and the optimizer

From

Stephan Szabo

Date:

20 May 2003, 20:21:05

On Tue, 20 May 2003, Dmitri Bichko wrote:

> I wish it were as easy as a join - the query is much simplified for the
> purpose of the example, in reality the subselect is more complicated and
> includes a GROUP BY (which, at least as far as I know, makes subqueries
> the only way of doing this).

But you may be able to do it as a subselect in FROM rather than a
subselect in IN (err, right).

For example, barring possible NULL related wierdness:

select * from a where col in (select count(*) from b group by col2);

can probably be done as something like:

select a.* from a, (select distinct count(*) as count from b group by
 col2) as b where a.col=b.count;

which in some cases for 7.3 and earlier will be better.

Re: Subqueries and the optimizer

From

Mike Mascari

Date:

20 May 2003, 20:22:17

Dmitri Bichko wrote:
> I wish it were as easy as a join - the query is much simplified for the
> purpose of the example, in reality the subselect is more complicated and
> includes a GROUP BY (which, at least as far as I know, makes subqueries
> the only way of doing this).
>
> Thanks anway, guess I'll wait for 7.4 with this (and just split them up
> into two queries for the time being),

Did you try a correlated subquery using EXISTS? You can make that
subquery as complicated as you want. The semantics between IN and
EXISTS vary though in the prescence of NULLs.

SELECT blast_id
FROM genes
WHERE blast_batch_id = 2 AND EXISTS (
 SELECT 1
 FROM ll_out_mm
 WHERE ll_out_mm.locusid = genes.locus_id AND
 ...
};

Mike Mascari
mascarm@mascari.com

Re: Subqueries and the optimizer

From

Alvaro Herrera

Date:

20 May 2003, 22:30:31

On Tue, May 20, 2003 at 04:45:15PM -0400, Dmitri Bichko wrote:
> I wish it were as easy as a join - the query is much simplified for the
> purpose of the example, in reality the subselect is more complicated and
> includes a GROUP BY (which, at least as far as I know, makes subqueries
> the only way of doing this).

Note that IN (1,2) is quite different from IN (SELECT something), so if
your query is like the latter then you probably want to convert it to
EXISTS as shown in the FAQ.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"La soledad es compañia"

Re: Subqueries and the optimizer

From

Tom Lane

Date:

21 May 2003, 02:51:09

"Dmitri Bichko" <dbichko@genpathpharma.com> writes:
> Thanks anway, guess I'll wait for 7.4 with this (and just split them up
> into two queries for the time being),

As I was just saying to someone else, it'd be worth testing complex IN
cases against CVS tip to make sure that 7.4 will do a decent job with
them.  It's not too late now to consider improving any gaps that might
remain --- but by the time we start the formal beta process, it will be.
So I encourage you to grab a CVS-tip snapshot and set up a test database
to check out the cases you are interested in.  Please report what you
find out to pgsql-hackers.

            regards, tom lane

Re: Subqueries and the optimizer

From

"Dean K. Gibson"

Date:

24 May 2003, 21:29:24

Try:

SELECT blast_id FROM genes, ll_out_mm WHERE blast_batch_id = 2 AND
genes.locus_id = ll_out_mm.locus_id;

Using more recent versions of PostgreSQL, you can also write:

SELECT blast_id FROM genes JOIN ll_out_mm USING ( locus_id) WHERE blast_batch_id = 2;

-- Dean

Dmitri Bichko wrote on 2003-05-20 10:50:
>So, I have a table with an index:
>
>dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND
>locus_id IN (1,2);
>                                              QUERY PLAN
>------------------------------------------------------------------------
>------------------------------
> Index Scan using idx_genes_locus_id, idx_genes_locus_id on genes
>(cost=0.00..88.21 rows=14 width=4)
>   Index Cond: ((locus_id = 1) OR (locus_id = 2))
>   Filter: (blast_batch_id = 2)
>
>So far so good, but when I try it with a subquery:
>
>dev=# explain SELECT blast_id FROM genes WHERE blast_batch_id = 2 AND
>locus_id IN (SELECT locus_id FROM ll_out_mm);
>QUERY PLAN
>------------------------------------------------------------------------
> Seq Scan on genes  (cost=0.00..21414353.48 rows=11003 width=4)
>   Filter: ((blast_batch_id = 2) AND (subplan))
>   SubPlan
>     ->  Seq Scan on ll_out_mm  (cost=0.00..1267.64 rows=59264 width=4)
>
>
>How can I nudge the optimizer in the direction of using the index in the
>second case as well?  Or is it supposed to be doing this in this case.
>
>Thanks,
>Dmitri
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: if posting/reading through Usenet, please send an appropriate
>subscribe-nomail command to majordomo@postgresql.org so that your
>message can get through to the mailing list cleanly