> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Tom Lane
> Sent: Thursday, May 10, 2007 11:53 AM
> To: Neil Conway
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Planning large IN lists
>
> Neil Conway <neilc@samurai.com> writes:
> > When planning queries with a large IN expression in the WHERE
clause,
> > the planner transforms the IN list into a scalar array expression.
In
> > clause_selectivity(), we estimate the selectivity of the
ScalarArrayExpr
> > by calling scalararraysel(), which in turn estimates the selectivity
of
> > *each* array element in order to determine the selectivity of the
array
> > expression as a whole.
>
> > This is quite inefficient when the IN list is large.
>
> That's the least of the problems. We really ought to convert such
cases
> into an IN (VALUES(...)) type of query, since often repeated
indexscans
> aren't the best implementation.
It seems to me that if you have a unique index on the in list column,
then the problem is simplified. In that case, you just have to estimate
how many index seeks cost more than a table scan. Usually, it's around
5-10% of the table size for the average database. Not sure how it works
out in PostgreSQL.
So in the special case of an in list on a unique indexed column, compare
the cardinality of the table with the number of in list items and decide
to table scan or index seek based on that.
For arbitrary queries, it seems that it would be necessary to keep
histograms for the columns in question. Perhaps it could be collected
with an advanced analyze option.