Home > mailing lists

Re: Planning large IN lists - Mailing list pgsql-hackers

From	Bruce Momjian
Subject	Re: Planning large IN lists
Date	March 11, 2008 15:15:30
Msg-id	200803111815.m2BIF5e17619@momjian.us Whole thread Raw
In response to	Re: Planning large IN lists ("Atul Deopujari" <atul.deopujari@enterprisedb.com>)
List	pgsql-hackers

Tree view

Added to TODO:

* Consider using a hash for joining to a large IN (VALUES ...) list
 http://archives.postgresql.org/pgsql-hackers/2007-05/msg00450.php


---------------------------------------------------------------------------

Atul Deopujari wrote:
> Hi,
> 
> Tom Lane wrote:
> > Neil Conway <neilc@samurai.com> writes:
> >> When planning queries with a large IN expression in the WHERE clause,
> >> the planner transforms the IN list into a scalar array expression. In
> >> clause_selectivity(), we estimate the selectivity of the ScalarArrayExpr
> >> by calling scalararraysel(), which in turn estimates the selectivity of
> >> *each* array element in order to determine the selectivity of the array
> >> expression as a whole.
> > 
> >> This is quite inefficient when the IN list is large.
> > 
> > That's the least of the problems.  We really ought to convert such cases
> > into an IN (VALUES(...)) type of query, since often repeated indexscans
> > aren't the best implementation.
> > 
> I thought of giving this a shot and while I was working on it, it 
> occurred to me that we need to decide on a threshold value of the IN 
> list size above which such transformation should take place. For small 
> sizes of the IN list, scalararraysel() of IN list wins over the hash 
> join involved in IN (VALUES(...)). But for larger sizes of the IN list, 
> IN (VALUES(...)) comes out to be a clear winner.
> I would like to know what does the community think should be a heuristic 
> value of the IN list size beyond which this transformation should take 
> place.
> I was thinking of a GUC variable (or a hard coded value) which defaults 
> to say 30. This is based on numbers from the following test:
> 
> postgres=# create table w (w text);
> CREATE TABLE
> 
> postgres=# \copy w from '/usr/share/dict/words'
> 
> And run the following query with different IN list sizes
> explain analyze select * from w where w in ('one', 'two', ...);
> 
> I got the following runtimes:
> ------------------------------------
> IN list  IN (VALUES(...))     IN
> size
> ------------------------------------
> 150     ~2000 ms           ~5500 ms
> 100     ~1500 ms           ~4000 ms
> 80      ~1400 ms           ~3000 ms
> 50      ~1400 ms           ~2500 ms
> 30      ~1500 ms           ~1500 ms
> 20      ~1400 ms           ~1200 ms
> 10      ~1400 ms           ~1200 ms
> ------------------------------------
> 
> The IN (VALUES(...)) gives an almost steady state behavior, while the IN 
>   runtimes deteriorate with growing list size.
> 
> There would obviously be different conditions on which to base this 
> value. I seek community opinion on this.
> 
> -- 
> Atul
> 
> EnterpriseDB
> www.enterprisedb.com
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://postgres.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

pgsql-hackers by date:

From: "Joshua D. Drake"
Date: 11 March 2008, 15:07:39
Subject: Re: Patch for Prevent pg_dump/pg_restore from being affected by statement_timeout

From: Bruce Momjian
Date: 11 March 2008, 17:06:03
Subject: Re: Sorted writes in checkpoint

Re: Planning large IN lists - Mailing list pgsql-hackers

Previous

Next