Re: slow IN() clause for many cases - Mailing list pgsql-hackers

From Jonah H. Harris
Subject Re: slow IN() clause for many cases
Date
Msg-id 36e682920510121229x2685b577y59ffb31b70fffd65@mail.gmail.com
Whole thread Raw
In response to Re: slow IN() clause for many cases  ("Ilia Kantor" <ilia@obnovlenie.ru>)
List pgsql-hackers
IMHO, we should first look at whether the O(N^2) behavior is needed.

On 10/12/05, Ilia Kantor <ilia@obnovlenie.ru > wrote:

1) merge join is faster for me then hash join (6 vs 3-4ms):

explain analyze select * from objects_hier where id in (select
(array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,
27,28,29,30])[i] from generate_series(1,30) s(i));

Hash Join  (cost= 17.50..162.93 rows=223 width=600) (actual
time=1.148..32.654 rows=138 loops=1)
   Hash Cond: ("outer".id =
('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,2
8,29,30}'::integer[])["inner".i])
   ->  Seq Scan on objects_hier  (cost=0.00..134.80 rows=1680 width=600)
(actual time=0.235..12.271 rows=1680 loops=1)
   ->  Hash  (cost=17.00..17.00 rows=200 width=4) (actual time=0.893..0.893
rows=30 loops=1)
         ->  HashAggregate  (cost=15.00..17.00 rows=200 width=4) (actual
time=0.161..0.830 rows=30 loops=1)
               ->  Function Scan on generate_series s  (cost=0.00..12.50
rows=1000 width=4) (actual time=0.037..0.097 rows=30 loops=1)

2) Is it possible to make the planner plan that better, e.g use Bitmap-OR
for <20numbers, use some other method for >20 numbers (actual number depends
on costs) ?

Maybe that should be added to TODO ?




-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto: pgsql-hackers-owner@postgresql.org] On Behalf Of Andrew - Supernews
Sent: Wednesday, October 12, 2005 1:41 PM
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] slow IN() clause for many cases

On 2005-10-11, "Ilia Kantor" <ilia@obnovlenie.ru> wrote:
> When in clause becomes large enough (>20-30 cases),
> It is much better to use "join" way of processing..

or even a different way of writing the IN clause.

This one is one I've used after considerable research:

select * from table
where field in (select (some_array_of_N_items)[i]
                   from generate_series(1,N) as s(i));

This generally plans as a nestloop, with a HashAggregate of the function
scan (of generate_series) on the outer path, and index lookups on the inner
path.

It's worth noting that EXPLAIN ANALYZE doesn't tell the whole story when
comparing queries of this kind. The IN (1,2,...30) form is much slower to
plan, and usually can't usefully be used in prepared form (you'd need to
prepare it separately for every different number of items); in contrast,
the array version can be prepared once and reused.

As the number of items in the IN clause increases, the planning time grows
rather radically. For example with 1000 items (here stashed in a psql
convenience variable for brevity), using 8.1beta2:

test=# prepare tstplan1 as select * from test where id in (:v);
Time: 4881.702 ms

compare:

test=# prepare tstplan2 as select * from test where id in
  (select (ARRAY[:v])[i] from generate_series(1,1000) s(i));
Time: 10.889 ms

(on my machine the break-even point for these two is less than 20 items,
or even less if the array is passed in as a literal or a single parameter
rather than constructed with ARRAY[].)

The actual execution time of these two is very close, with the second
being about 10% slower on my system (31ms vs 34ms, based on \timing values
from psql and averaged over several goes). However, the timings returned
from EXPLAIN ANALYZE are much more skewed: 42ms vs 66ms as reported in the
"total runtime" line. So not only is the planning time different, but also
the instrumentation overhead of EXPLAIN ANALYZE is wildly different between
the two forms.

What this means is that unless you're going to prepare in advance every
possible number of parameters to IN that your app is ever going to use,
the only way to get useful performance for IN queries with more than a
handful of literal values is to use an array method, in spite of the fact
that the bitmap-OR execution plan is actually at least as fast.

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend



--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

pgsql-hackers by date:

Previous
From: Emil Briggs
Date:
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Next
From: David Fetter
Date:
Subject: Comments on columns in the pg_catalog tables/views