Re: Planning large IN lists - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Re: Planning large IN lists |
Date | |
Msg-id | 200803111815.m2BIF5e17619@momjian.us Whole thread Raw |
In response to | Re: Planning large IN lists ("Atul Deopujari" <atul.deopujari@enterprisedb.com>) |
List | pgsql-hackers |
Added to TODO: * Consider using a hash for joining to a large IN (VALUES ...) list http://archives.postgresql.org/pgsql-hackers/2007-05/msg00450.php --------------------------------------------------------------------------- Atul Deopujari wrote: > Hi, > > Tom Lane wrote: > > Neil Conway <neilc@samurai.com> writes: > >> When planning queries with a large IN expression in the WHERE clause, > >> the planner transforms the IN list into a scalar array expression. In > >> clause_selectivity(), we estimate the selectivity of the ScalarArrayExpr > >> by calling scalararraysel(), which in turn estimates the selectivity of > >> *each* array element in order to determine the selectivity of the array > >> expression as a whole. > > > >> This is quite inefficient when the IN list is large. > > > > That's the least of the problems. We really ought to convert such cases > > into an IN (VALUES(...)) type of query, since often repeated indexscans > > aren't the best implementation. > > > I thought of giving this a shot and while I was working on it, it > occurred to me that we need to decide on a threshold value of the IN > list size above which such transformation should take place. For small > sizes of the IN list, scalararraysel() of IN list wins over the hash > join involved in IN (VALUES(...)). But for larger sizes of the IN list, > IN (VALUES(...)) comes out to be a clear winner. > I would like to know what does the community think should be a heuristic > value of the IN list size beyond which this transformation should take > place. > I was thinking of a GUC variable (or a hard coded value) which defaults > to say 30. This is based on numbers from the following test: > > postgres=# create table w (w text); > CREATE TABLE > > postgres=# \copy w from '/usr/share/dict/words' > > And run the following query with different IN list sizes > explain analyze select * from w where w in ('one', 'two', ...); > > I got the following runtimes: > ------------------------------------ > IN list IN (VALUES(...)) IN > size > ------------------------------------ > 150 ~2000 ms ~5500 ms > 100 ~1500 ms ~4000 ms > 80 ~1400 ms ~3000 ms > 50 ~1400 ms ~2500 ms > 30 ~1500 ms ~1500 ms > 20 ~1400 ms ~1200 ms > 10 ~1400 ms ~1200 ms > ------------------------------------ > > The IN (VALUES(...)) gives an almost steady state behavior, while the IN > runtimes deteriorate with growing list size. > > There would obviously be different conditions on which to base this > value. I seek community opinion on this. > > -- > Atul > > EnterpriseDB > www.enterprisedb.com > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
pgsql-hackers by date: