Home > mailing lists

Re: Optimization idea for long IN() lists - Mailing list pgsql-performance

From	Maxim Boguk
Subject	Re: Optimization idea for long IN() lists
Date	August 9, 2014 04:22:48
Msg-id	CAK-MWwRR80afBo6UHk201J+Q+gz8Wxg9kjE0-PFMtP3j15oZOQ@mail.gmail.com Whole thread
In response to	Optimization idea for long IN() lists (Josh Berkus <josh@agliodbs.com>)
List	pgsql-performance

Tree view

On Sat, Aug 9, 2014 at 5:15 AM, Josh Berkus <josh@agliodbs.com> wrote:

Folks,

So one thing we tell users who have chronically long IN() lists is that
they should create a temporary table and join against that instead.
Other than not having the code, is there a reason why PostgreSQL
shouldn't do something like this behind the scenes, automatically?

Hi Josh,

I know that problem for many years.

There are some workaround which doesn't require using the temporary tables (and I used that approach quite a lot when performance matter):

Instead of using:
SELECT * FROM sometable
WHERE
somefield IN (val1, val2, ...)
AND other_filters;

Query could be written as:
SELECT * FROM sometable
JOIN (VALUES ((val1), (val2) ...)) AS v(somefield) ON v.somefield=sometable.somefield
WHERE

other_filters;

When there no index on somefield query plans would look like as:

Original query:

Filter: (somefield = ANY ('{...}'::integer[]))

vs optimized query:

Hash Join (cost=0.25..117.89 rows=22 width=59) (actual time=5.332..5.332 rows=0 loops=1)
   Hash Cond: (sometable.somefield = "*VALUES*".somefield)
...
   -> Hash (cost=0.12..0.12 rows=10 width=4) (actual time=0.010..0.010 rows=10 loops=1)
         Buckets: 1024 Batches: 1 Memory Usage: 1kB
         -> Values Scan on "*VALUES*" (cost=0.00..0.12 rows=10 width=4) (actual time=0.001..0.003 rows=10 loops=1)

In synthetic data I observed the following performance results (fully in-memory data with integer values):

List length            IN Performance         JOIN VALUES Performance
       10                     5.39ms                         5.38ms
      100                     9.74ms                         5.49ms
     1000                    53.02ms                         9.89ms
    10000                   231.10ms                        13.14ms

So starting from 10 elements VALUES/HASH JOIN approach is clear winner.

In case of the text literals IN list performance difference even more obvious (~2 order of magnitude for 10000 list).

However, if IN list used for the primary key lookup - there are no visible performance difference between these two approaches.

So yes there are some space for optimization of "Filter: (somefield = ANY ('{...}'::integer[]))" via hashing.

Maxim Boguk
Senior Postgresql DBA
http://www.postgresql-consulting.ru/

Phone RU: +7 910 405 4718
Phone AU: +61 45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
Jabber: maxim.boguk@gmail.com
МойКруг: http://mboguk.moikrug.ru/

"People problems are solved with people.
If people cannot solve the problem, try technology.
People will then wish they'd listened at the first stage."

pgsql-performance by date:

From: Josh Berkus
Date: 08 August 2014, 19:15:45
Subject: Optimization idea for long IN() lists

From: 楊新波
Date: 12 August 2014, 02:59:15
Subject: how does the planer to estimate row when i use order by and group by

Re: Optimization idea for long IN() lists - Mailing list pgsql-performance

Previous

Next