Kevin Goess wrote:
> We have a table "contexts" with 1.6 million rows, and a table
"articles" with 1.4 million rows, where
> an "article" is a particular kind of "context". We want to select
from a join on those two tables
> like this
>
> SELECT COUNT(*)
> FROM contexts
> JOIN articles ON (articles.context_key=contexts.context_key)
> WHERE contexts.context_key IN (...);
> /* and some combination of columns from articles and contexts */
>
> If "IN(...)" is a query, then this guy does a seq scan on the contexts
table, even if the subquery is
> "select col_a from kgtest" where kgtest has one row. If however I
read the ids beforehand and write
> them into the query, a la "IN (111,222,333...)", then the everything
is happy, up to at least 20,000
> values written into the sql, at which point smaller machines will take
2-5 minutes to parse the query.
>
> I can certainly write the ids inline into the SQL, but when I do that
I get the distinct impression
> that I'm Doing It Wrong. Is this expected behavior? It seems
surprising to me.
>
>
> To demonstrate:
>
> /* nothing up my sleeve */
> # select * from kgtest;
> cola
> ---------
> 1652729
> (1 row)
[...]
> /* subselect, query plan does seq scan on contexts */
[...]
> -> Seq Scan on kgtest (cost=0.00..34.00 rows=2400 width=4)
(actual time=0.048..0.050 rows
[...]
There is something missing in this line, but according to what you wrote
it must be "actual [...] rows=1", And yet the planner assumes that the
scan will return 2400 rows.
That means that your statistics are not accurate.
As a first measure, you should ANALYZE the tables involved and see if
the problem persists. If yes, post the new plans.
Yours,
Laurenz Albe