Re: Any disadvantages of using =ANY(ARRAY()) instead of IN? - Mailing list pgsql-performance

From Noah Misch
Subject Re: Any disadvantages of using =ANY(ARRAY()) instead of IN?
Date
Msg-id 20120510082039.GA27567@tornado.leadboat.com
Whole thread Raw
In response to Any disadvantages of using =ANY(ARRAY()) instead of IN?  (Clemens Eisserer <linuxhippy@gmail.com>)
Responses Re: Any disadvantages of using =ANY(ARRAY()) instead of IN?
List pgsql-performance
On Tue, May 01, 2012 at 04:34:10PM +0200, Clemens Eisserer wrote:
> Quite often Hibernate ends up generating queries with a lot of joins
> which usually works well, except for queries which load some
> additional data based on a previous query (SUBSELECT collections),
> which look like:
>
> select ..... from table1 ... left outer join table 15 .... WHERE
> table1.id IN (select id .... join table16 ... join table20 WHERE
> table20.somevalue=?)
>
> Starting with some amount of joins, the optimizer starts to do quite
> suboptimal things like hash-joining huge tables where selctivity would
> very low.
> I already raised join_collapse_limit and from_collapse_limit, but
> after a certain point query planning starts to become very expensive.

Since you have 15+ tables at the top level, the genetic query optimizer should
be kicking in and delivering a plan in reasonable time, albeit with plan
quality hazards.  There's a danger zone when the deterministic planner is
still in effect but {from,join}_collapse_limit have limited the scope of its
investigation.  If you're in that zone and have not hand-tailored your
explicit join order, poor plans are unsurprising.  What exact configuration
changes are you using?

http://wiki.postgresql.org/wiki/Server_Configuration

> However, when using " =ANY(ARRAY(select ...))" instead of "IN" the
> planner seems to do a lot better, most likely because it treats the
> subquery as a black-box that needs to be executed independently. I've
> hacked hibernate a bit to use ANY+ARRAY, and it seems to work a lot
> better than using "IN".

I have also used that transformation to get better plans.  It can help for the
reason you say.  Specifically, fewer and different plan types are available
for the ANY(ARRAY(...)) case, and the row count estimate from the inner
subquery does not propagate upward as it does with IN (SELECT ...).

> However, I am a bit uncertain:
> - Is it safe to use ANY(ARRAY(select ...)) when I know the sub-query
> will only return a small amount (0-100s) of rows?

Hundreds of rows, no.  Consider this example:

CREATE TABLE t (c) AS SELECT * FROM generate_series(1,1000000);
ANALYZE t;
\set n 500
EXPLAIN ANALYZE SELECT * FROM t WHERE c IN (SELECT c FROM t WHERE c <= :n);
EXPLAIN ANALYZE SELECT * FROM t WHERE c = ANY (ARRAY(SELECT c FROM t WHERE c <= :n));

IN(...):
 Hash Semi Join  (cost=16931.12..33986.58 rows=490 width=4) (actual time=582.421..2200.322 rows=500 loops=1)
   Hash Cond: (public.t.c = public.t.c)
   ->  Seq Scan on t  (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.093..785.330 rows=1000000 loops=1)
   ->  Hash  (cost=16925.00..16925.00 rows=490 width=4) (actual time=582.289..582.289 rows=500 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 18kB
         ->  Seq Scan on t  (cost=0.00..16925.00 rows=490 width=4) (actual time=0.026..581.766 rows=500 loops=1)
               Filter: (c <= 500)
               Rows Removed by Filter: 999500
 Total runtime: 2200.767 ms

ANY(ARRAY(...)):
 Seq Scan on t  (cost=16925.00..43850.00 rows=10 width=4) (actual time=305.543..11748.014 rows=500 loops=1)
   Filter: (c = ANY ($0))
   Rows Removed by Filter: 999500
   InitPlan 1 (returns $0)
     ->  Seq Scan on t  (cost=0.00..16925.00 rows=490 width=4) (actual time=0.012..304.748 rows=500 loops=1)
           Filter: (c <= 500)
           Rows Removed by Filter: 999500
 Total runtime: 11748.348 ms

Note also the difference in output row estimates; that doesn't affect planning
here, but it could matter a lot if this snippet became part of a larger query.

Cut "n" to 5, though, and the ANY plan beats the IN plan at 800ms vs. 2400ms.
(Exact timing figures are fairly unstable on this particular test.)  It
appears that, unsurprisingly, evaluating a short filter list is cheaper than
probing a hash table.

> - Shouldn't the optimizer be a bit smarter avoiding optimizing this
> case in the first place, instead of bailing out later? Should I file a
> bug-report about this problem?

Filing a bug report with the content you've already posted would not add much,
but a self-contained test case could prove useful.  Many of the deficiencies
that can make ANY(ARRAY(...)) win do represent unimplemented planner
intelligence more than bugs.

Incidentally, you can isolate whether ANY(ARRAY(...))'s advantage comes solely
from suppressing the subquery collapse.  Keep "IN" but tack "OFFSET 0" onto
the subquery.  If this gives the same performance as ANY(ARRAY(...)), then the
subquery-collapse suppression was indeed the source of advantage.

nm

pgsql-performance by date:

Previous
From: Claudio Freire
Date:
Subject: Re: Could synchronous streaming replication really degrade the performance of the primary?
Next
From: "MauMau"
Date:
Subject: Re: Could synchronous streaming replication really degrade the performance of the primary?