On Mon, May 08, 2006 at 12:50:13PM -0500, Jim C. Nasby wrote:
> On Mon, May 08, 2006 at 10:42:21AM -0700, Mark Lewis wrote:
> > Doing a SELECT with a large list of variables inside an IN runs slowly
> > on every database we've tested. We've tested mostly in Oracle and
> > PostgreSQL, and both get very slow very quickly (actually Oracle refuses
> > to process the query at all after it gets too many bind parameters).
> >
> > In our case, we have a (potentially quite large) set of external values
> > that we want to look up in the database. We originally thought that
> > doing a single select with a large IN clause was the way to go, but then
> > we did some performance analysis on the optimal batch size (number of
> > items to include per IN clause), and discovered that for most databases,
> > the optimal batch size was 1. For PostgreSQL I think it was 2.
> >
> > The moral of the story is that you're probably better off running a
> > bunch of small selects than in trying to optimize things with one
> > gargantuan select.
>
> Ever experiment with loading the parameters into a temp table and
> joining to that?
> --
> Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
> Pervasive Software http://pervasive.com work: 512-231-6117
> vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
>
> ---------------------------(end of broadcast)---------------------------
The DB use by the DSPAM software is very similar to your use case. The
fastest queries are made using the PostgreSQL generate_series functionality
to unwind the "IN *" to multiple single selects. Here is the lookup function
that they use:
create function lookup_tokens(integer,bigint[])
returns setof dspam_token_data
language plpgsql stable
as '
declare
v_rec record;
begin
for v_rec in select * from dspam_token_data
where uid=$1
and token in (select $2[i]
from generate_series(array_lower($2,1),
array_upper($2,1)) s(i))
loop
return next v_rec;
end loop;
return;
end;';
You should be able to try something similar for your workload.
Ken Marshall