Home > mailing lists

WHERE IN (subselect) versus WHERE IN (1,2,3,) - Mailing list pgsql-general

From	Kevin Goess
Subject	WHERE IN (subselect) versus WHERE IN (1,2,3,)
Date	March 15, 2012 09:46:06
Msg-id	CABZkbxg3JEi-KAzwgRr34EuyqmKKRTM0H5AyEvqeBaAYekpMbQ@mail.gmail.com Whole thread
Responses	Re: WHERE IN (subselect) versus WHERE IN (1,2,3,)
List	pgsql-general

Tree view

My apologies, I'm sure this question has been asked before but I couldn't find anything on the list that meant anything to me.

We have a table "contexts" with 1.6 million rows, and a table "articles" with 1.4 million rows, where an "article" is a particular kind of "context". We want to select from a join on those two tables like this

    SELECT COUNT(*)
    FROM contexts
    JOIN articles ON (articles.context_key=contexts.context_key)
    WHERE contexts.context_key IN (...);
    /* and some combination of columns from articles and contexts */

If "IN(...)" is a query, then this guy does a seq scan on the contexts table, even if the subquery is "select col_a from kgtest" where kgtest has one row. If however I read the ids beforehand and write them into the query, a la "IN (111,222,333...)", then the everything is happy, up to at least 20,000 values written into the sql, at which point smaller machines will take 2-5 minutes to parse the query.

I can certainly write the ids inline into the SQL, but when I do that I get the distinct impression that I'm Doing It Wrong. Is this expected behavior? It seems surprising to me.

To demonstrate:

    /* nothing up my sleeve */
    # select * from kgtest;
    cola
    ---------
     1652729
    (1 row)

    /* inline, good query plan */
# explain (analyze, buffers) select count(*) from contexts JOIN articles ON (articles.context_key=contexts.context_key) where contexts.context_key in (1652729);
                                                             QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Aggregate (cost=3.82..3.83 rows=1 width=0) (actual time=0.188..0.189 rows=1 loops=1)
   Buffers: shared hit=7
   -> Nested Loop (cost=0.00..3.81 rows=1 width=0) (actual time=0.181..0.181 rows=0 loops=1)
         Buffers: shared hit=7
         -> Index Scan using contexts_pkey on contexts (cost=0.00..1.90 rows=1 width=4) (actual time=0.109..0.112 ro
               Index Cond: (context_key = 1652729)
               Buffers: shared hit=4
         -> Index Scan using articles_pkey on articles (cost=0.00..1.90 rows=1 width=4) (actual time=0.060..0.060 ro
               Index Cond: (articles.context_key = 1652729)
               Buffers: shared hit=3
Total runtime: 0.324 ms
(11 rows)

/* subselect, query plan does seq scan on contexts */

# explain (analyze, buffers) select count(*)from contexts JOIN articles ON (articles.context_key=contexts.context_key) where contexts.context_key in (select cola from kgtest);
                                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Aggregate (cost=118505.72..118505.73 rows=1 width=0) (actual time=0.274..0.275 rows=1 loops=1)
   Buffers: shared hit=5
   -> Hash Join (cost=12512.61..116661.91 rows=737524 width=0) (actual time=0.269..0.269 rows=0 loops=1)
         Hash Cond: (contexts.context_key = articles.context_key)
         Buffers: shared hit=5
         -> Seq Scan on contexts (cost=0.00..64533.03 rows=1648203 width=4) (actual time=0.009..0.009 rows=1 loops=1
               Buffers: shared hit=1
         -> Hash (cost=412.56..412.56 rows=737524 width=8) (actual time=0.110..0.110 rows=0 loops=1)
               Buckets: 4096 Batches: 32 Memory Usage: 0kB
               Buffers: shared hit=4
               -> Nested Loop (cost=40.00..412.56 rows=737524 width=8) (actual time=0.107..0.107 rows=0 loops=1)
                     Buffers: shared hit=4
                     -> HashAggregate (cost=40.00..42.00 rows=200 width=4) (actual time=0.069..0.071 rows=1 loops=1)
                           Buffers: shared hit=1
                           -> Seq Scan on kgtest (cost=0.00..34.00 rows=2400 width=4) (actual time=0.048..0.050 rows
                                 Buffers: shared hit=1
                     -> Index Scan using articles_pkey on articles (cost=0.00..1.84 rows=1 width=4) (actual time=0.0
                           Index Cond: (articles.context_key = kgtest.cola)
                           Buffers: shared hit=3
Total runtime: 0.442 ms

--
Kevin M. Goess
Software Engineer
Berkeley Electronic Press
kgoess@bepress.com

510-665-1200 x179
www.bepress.com

bepress: sustainable scholarly publishing

pgsql-general by date:

From: Bèrto ëd Sèra
Date: 15 March 2012, 09:42:14
Subject: Re: Backups

From: Andrey Chursin
Date: 15 March 2012, 10:15:45
Subject: Fetch from cursor with indexed sorting

WHERE IN (subselect) versus WHERE IN (1,2,3,) - Mailing list pgsql-general

Previous

Next