Re: WHERE IN (subselect) versus WHERE IN (1,2,3,) - Mailing list pgsql-general

From Albe Laurenz
Subject Re: WHERE IN (subselect) versus WHERE IN (1,2,3,)
Date
Msg-id D960CB61B694CF459DCFB4B0128514C207A2AC0D@exadv11.host.magwien.gv.at
Whole thread Raw
In response to WHERE IN (subselect) versus WHERE IN (1,2,3,)  (Kevin Goess <kgoess@bepress.com>)
Responses Re: WHERE IN (subselect) versus WHERE IN (1,2,3,)  (Kevin Goess <kgoess@bepress.com>)
List pgsql-general
Kevin Goess wrote:
> We have a table "contexts" with 1.6 million rows, and a table
"articles" with 1.4 million rows, where
> an "article" is a particular kind of "context".  We want to select
from a join on those two tables
> like this
>
>     SELECT COUNT(*)
>     FROM contexts
>     JOIN articles ON (articles.context_key=contexts.context_key)
>     WHERE contexts.context_key IN (...);
>     /* and some combination of columns from articles and contexts */
>
> If "IN(...)" is a query, then this guy does a seq scan on the contexts
table, even if the subquery is
> "select col_a from kgtest" where kgtest has one row.  If however I
read the ids beforehand and write
> them into the query, a la "IN (111,222,333...)", then the everything
is happy, up to at least 20,000
> values written into the sql, at which point smaller machines will take
2-5 minutes to parse the query.
>
> I can certainly write the ids inline into the SQL, but when I do that
I get the distinct impression
> that I'm Doing It Wrong.  Is this expected behavior?  It seems
surprising to me.
>
>
> To demonstrate:
>
>     /* nothing up my sleeve */
>     # select * from kgtest;
>       cola
>     ---------
>      1652729
>     (1 row)

[...]

>   /* subselect, query plan does seq scan on contexts */
[...]
>     ->  Seq Scan on kgtest  (cost=0.00..34.00 rows=2400 width=4)
(actual time=0.048..0.050 rows
[...]

There is something missing in this line, but according to what you wrote
it must be "actual [...] rows=1", And yet the planner assumes that the
scan will return 2400 rows.
That means that your statistics are not accurate.

As a first measure, you should ANALYZE the tables involved and see if
the problem persists.  If yes, post the new plans.

Yours,
Laurenz Albe

pgsql-general by date:

Previous
From: Janning Vygen
Date:
Subject: Re: Anonymized database dumps
Next
From: Sergey Konoplev
Date:
Subject: Re: Multi server query