Thread: Any disadvantages of using =ANY(ARRAY()) instead of IN?

Any disadvantages of using =ANY(ARRAY()) instead of IN?

From
Clemens Eisserer
Date:
Hi,

I am using postgresql as database for a hibernate based java oltp
project and as in previous projects am totally impressed by
postgresql's robustness, performance and feature-richness. Thanks for
this excellent piece of software.

Quite often Hibernate ends up generating queries with a lot of joins
which usually works well, except for queries which load some
additional data based on a previous query (SUBSELECT collections),
which look like:

select ..... from table1 ... left outer join table 15 .... WHERE
table1.id IN (select id .... join table16 ... join table20 WHERE
table20.somevalue=?)

Starting with some amount of joins, the optimizer starts to do quite
suboptimal things like hash-joining huge tables where selctivity would
very low.
I already raised join_collapse_limit and from_collapse_limit, but
after a certain point query planning starts to become very expensive.

However, when using " =ANY(ARRAY(select ...))" instead of "IN" the
planner seems to do a lot better, most likely because it treats the
subquery as a black-box that needs to be executed independently. I've
hacked hibernate a bit to use ANY+ARRAY, and it seems to work a lot
better than using "IN".

However, I am a bit uncertain:
- Is it safe to use ANY(ARRAY(select ...)) when I know the sub-query
will only return a small amount (0-100s) of rows?
- Shouldn't the optimizer be a bit smarter avoiding optimizing this
case in the first place, instead of bailing out later? Should I file a
bug-report about this problem?

Thank you in advance, Clemens

Re: Any disadvantages of using =ANY(ARRAY()) instead of IN?

From
Tom Lane
Date:
Clemens Eisserer <linuxhippy@gmail.com> writes:
> Quite often Hibernate ends up generating queries with a lot of joins
> which usually works well, except for queries which load some
> additional data based on a previous query (SUBSELECT collections),
> which look like:

> select ..... from table1 ... left outer join table 15 .... WHERE
> table1.id IN (select id .... join table16 ... join table20 WHERE
> table20.somevalue=?)

> Starting with some amount of joins, the optimizer starts to do quite
> suboptimal things like hash-joining huge tables where selctivity would
> very low.
> I already raised join_collapse_limit and from_collapse_limit, but
> after a certain point query planning starts to become very expensive.

What PG version are we talking about here?

> However, when using " =ANY(ARRAY(select ...))" instead of "IN" the
> planner seems to do a lot better, most likely because it treats the
> subquery as a black-box that needs to be executed independently. I've
> hacked hibernate a bit to use ANY+ARRAY, and it seems to work a lot
> better than using "IN".

That doesn't sound like a tremendously good idea to me.  But with
so few details, it's hard to comment intelligently.  Can you provide
a concrete test case?

            regards, tom lane

Re: Any disadvantages of using =ANY(ARRAY()) instead of IN?

From
Clemens Eisserer
Date:
Hi Tom,

Thanks for your reply.

> What PG version are we talking about here?
For development I use 9.1.3, on the production server is 8.4.7 -
happens with both cases.

> That doesn't sound like a tremendously good idea to me.
Could you elaborate on the downsides of this approach a bit?

> But with
> so few details, it's hard to comment intelligently.
> Can you provide a concrete test case?

A self contained testcase would take some time to create (and list
members willing to configure and run), so I hope a query as well as an
explain-analyze run will provide more information (done with 9.1.3):
http://pastebin.com/BGRdAPg2

Its kind of the worst-worst case which I will improve later (way too
much relations loaded through join-fetching), but its quite a good way
to show the issue. Replacing the IN with a ANY(ARRAY()) already yields
a way better plan.

Thank you in advance, Clemens

Re: Any disadvantages of using =ANY(ARRAY()) instead of IN?

From
Clemens Eisserer
Date:
Hi again,

>> That doesn't sound like a tremendously good idea to me.
> Could you elaborate on the downsides of this approach a bit?

Any other thoughts about the pro/cons replacing IN(subquery) with
=ANY(ARRAY(subquery))?
Are there patological cases, except when the subquery returns a huge
amount of rows?

Should Ifile a bug-report about the optimizer trying too hard to
collapse the subquery and therefor generating a bad plan?

Thank you in advance, Clemens

Re: Any disadvantages of using =ANY(ARRAY()) instead of IN?

From
Clemens Eisserer
Date:
Hi,

I would be really grateful for feedback regardding this issue. Tom?

Should Ifile a bug-report about the optimizer trying too hard to
collapse the subquery and therefor generating a bad plan?
Its my understanding that a IN shouldn't perform any worse than ANY on
an ARRAY, right?

Thank you in advance, Clemens

Re: Any disadvantages of using =ANY(ARRAY()) instead of IN?

From
Noah Misch
Date:
On Tue, May 01, 2012 at 04:34:10PM +0200, Clemens Eisserer wrote:
> Quite often Hibernate ends up generating queries with a lot of joins
> which usually works well, except for queries which load some
> additional data based on a previous query (SUBSELECT collections),
> which look like:
>
> select ..... from table1 ... left outer join table 15 .... WHERE
> table1.id IN (select id .... join table16 ... join table20 WHERE
> table20.somevalue=?)
>
> Starting with some amount of joins, the optimizer starts to do quite
> suboptimal things like hash-joining huge tables where selctivity would
> very low.
> I already raised join_collapse_limit and from_collapse_limit, but
> after a certain point query planning starts to become very expensive.

Since you have 15+ tables at the top level, the genetic query optimizer should
be kicking in and delivering a plan in reasonable time, albeit with plan
quality hazards.  There's a danger zone when the deterministic planner is
still in effect but {from,join}_collapse_limit have limited the scope of its
investigation.  If you're in that zone and have not hand-tailored your
explicit join order, poor plans are unsurprising.  What exact configuration
changes are you using?

http://wiki.postgresql.org/wiki/Server_Configuration

> However, when using " =ANY(ARRAY(select ...))" instead of "IN" the
> planner seems to do a lot better, most likely because it treats the
> subquery as a black-box that needs to be executed independently. I've
> hacked hibernate a bit to use ANY+ARRAY, and it seems to work a lot
> better than using "IN".

I have also used that transformation to get better plans.  It can help for the
reason you say.  Specifically, fewer and different plan types are available
for the ANY(ARRAY(...)) case, and the row count estimate from the inner
subquery does not propagate upward as it does with IN (SELECT ...).

> However, I am a bit uncertain:
> - Is it safe to use ANY(ARRAY(select ...)) when I know the sub-query
> will only return a small amount (0-100s) of rows?

Hundreds of rows, no.  Consider this example:

CREATE TABLE t (c) AS SELECT * FROM generate_series(1,1000000);
ANALYZE t;
\set n 500
EXPLAIN ANALYZE SELECT * FROM t WHERE c IN (SELECT c FROM t WHERE c <= :n);
EXPLAIN ANALYZE SELECT * FROM t WHERE c = ANY (ARRAY(SELECT c FROM t WHERE c <= :n));

IN(...):
 Hash Semi Join  (cost=16931.12..33986.58 rows=490 width=4) (actual time=582.421..2200.322 rows=500 loops=1)
   Hash Cond: (public.t.c = public.t.c)
   ->  Seq Scan on t  (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.093..785.330 rows=1000000 loops=1)
   ->  Hash  (cost=16925.00..16925.00 rows=490 width=4) (actual time=582.289..582.289 rows=500 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 18kB
         ->  Seq Scan on t  (cost=0.00..16925.00 rows=490 width=4) (actual time=0.026..581.766 rows=500 loops=1)
               Filter: (c <= 500)
               Rows Removed by Filter: 999500
 Total runtime: 2200.767 ms

ANY(ARRAY(...)):
 Seq Scan on t  (cost=16925.00..43850.00 rows=10 width=4) (actual time=305.543..11748.014 rows=500 loops=1)
   Filter: (c = ANY ($0))
   Rows Removed by Filter: 999500
   InitPlan 1 (returns $0)
     ->  Seq Scan on t  (cost=0.00..16925.00 rows=490 width=4) (actual time=0.012..304.748 rows=500 loops=1)
           Filter: (c <= 500)
           Rows Removed by Filter: 999500
 Total runtime: 11748.348 ms

Note also the difference in output row estimates; that doesn't affect planning
here, but it could matter a lot if this snippet became part of a larger query.

Cut "n" to 5, though, and the ANY plan beats the IN plan at 800ms vs. 2400ms.
(Exact timing figures are fairly unstable on this particular test.)  It
appears that, unsurprisingly, evaluating a short filter list is cheaper than
probing a hash table.

> - Shouldn't the optimizer be a bit smarter avoiding optimizing this
> case in the first place, instead of bailing out later? Should I file a
> bug-report about this problem?

Filing a bug report with the content you've already posted would not add much,
but a self-contained test case could prove useful.  Many of the deficiencies
that can make ANY(ARRAY(...)) win do represent unimplemented planner
intelligence more than bugs.

Incidentally, you can isolate whether ANY(ARRAY(...))'s advantage comes solely
from suppressing the subquery collapse.  Keep "IN" but tack "OFFSET 0" onto
the subquery.  If this gives the same performance as ANY(ARRAY(...)), then the
subquery-collapse suppression was indeed the source of advantage.

nm

Re: Any disadvantages of using =ANY(ARRAY()) instead of IN?

From
Clemens Eisserer
Date:
Hello Noah,

Thanks a lot for your feedback and explanations.

> Since you have 15+ tables at the top level, the genetic query optimizer should
> be kicking in and delivering a plan in reasonable time, albeit with plan
> quality hazards.  There's a danger zone when the deterministic planner is
> still in effect but {from,join}_collapse_limit have limited the scope of its
> investigation.  If you're in that zone and have not hand-tailored your
> explicit join order, poor plans are unsurprising.  What exact configuration
> changes are you using?

Basically only the changes, suggested here a year ago,  which made the
problem go away for less complex queries:

geqo_threshold = 20
from_collapse_limit = 13
join_collapse_limit = 13


> Hundreds of rows, no.  Consider this example:
> IN(...):
>  Total runtime: 2200.767 ms
>
> ANY(ARRAY(...)):
>  Total runtime: 11748.348 ms

In case there is an index on C, the resulting index scan is, even with
1000 elements, 3 times faster on my Notebook.
However, both queries execute in next-to-no time (15 vs 5ms).

> Filing a bug report with the content you've already posted would not add much,
> but a self-contained test case could prove useful.  Many of the deficiencies
> that can make ANY(ARRAY(...)) win do represent unimplemented planner
> intelligence more than bugs.
>
> Incidentally, you can isolate whether ANY(ARRAY(...))'s advantage comes solely
> from suppressing the subquery collapse.  Keep "IN" but tack "OFFSET 0" onto
> the subquery.  If this gives the same performance as ANY(ARRAY(...)), then the
> subquery-collapse suppression was indeed the source of advantage.

I see your point, some dumb logic to replace IN with ANY(ARRAY
wouldn't always yield better results.
I'll try to come up with a self-containing testcase.

Thanks again, Clemens

Re: Any disadvantages of using =ANY(ARRAY()) instead of IN?

From
Noah Misch
Date:
On Tue, May 01, 2012 at 04:34:10PM +0200, Clemens Eisserer wrote:
> select ..... from table1 ... left outer join table 15 .... WHERE
> table1.id IN (select id .... join table16 ... join table20 WHERE
> table20.somevalue=?)
>
> Starting with some amount of joins, the optimizer starts to do quite
> suboptimal things like hash-joining huge tables where selctivity would
> very low.
> I already raised join_collapse_limit and from_collapse_limit, but
> after a certain point query planning starts to become very expensive.

On Sun, May 13, 2012 at 04:35:30PM +0200, Clemens Eisserer wrote:
> > Since you have 15+ tables at the top level, the genetic query optimizer should
> > be kicking in and delivering a plan in reasonable time, albeit with plan
> > quality hazards. ??There's a danger zone when the deterministic planner is
> > still in effect but {from,join}_collapse_limit have limited the scope of its
> > investigation. ??If you're in that zone and have not hand-tailored your
> > explicit join order, poor plans are unsurprising. ??What exact configuration
> > changes are you using?
>
> Basically only the changes, suggested here a year ago,  which made the
> problem go away for less complex queries:
>
> geqo_threshold = 20
> from_collapse_limit = 13
> join_collapse_limit = 13

Given those settings and the query above, the planner will break the 15
top-level tables into lists of 13 and 2 tables, the 20 subquery tables into
lists of 13 and 7 tables.  The split points arise from order of appearance in
the query text.  The planner then optimizes join order within each list but
not across lists.  That perfectly explains your observation of "hash-joining
huge tables where selctivity would very low".

If it were me, I would try two things.  First, set from_collapse_limit = 100,
join_collapse_limit = 100, geqo_threshold = 8.  This will let the planner
consider all join orders for the 35 tables; it will use the genetic query
optimizer to choose one.  See if the plan time and resulting plan are decent.

Second, set from_collapse_limit = 100, join_collapse_limit = 100, geqo = off
and EXPLAIN the query.  This will take forever and might exhaust all system
memory.  If it does complete, you'll see the standard planner's opinion of an
optimal join order.  You can then modify the textual join order in your query
to get the same plan despite returning to a lower join_collapse_limit.  Since
Hibernate is generating your queries, that may prove inconvenient.  It's the
remaining escape hatch if the genetic query optimizer does not suffice.

Re: Any disadvantages of using =ANY(ARRAY()) instead of IN?

From
geirB
Date:
Ah, forgot one query:
WHERE IN is of course fast when we supply id's directly, but not when
they are wrapped as array and UNNEST'ed in query 6. (previous post
from me)

-- Test 6b: Fast. WHERE IN(explicit id list)
SELECT * FROM (
       SELECT * FROM table1
       UNION
       SELECT * FROM table1
) Q WHERE id IN
(100001,100002,100003,100004,100005,100006,100007,100008,100009,10010);

--
Geir Bostad
9.1.3(x64,win)


Re: Any disadvantages of using =ANY(ARRAY()) instead of IN?

From
geirB
Date:
I've encoutered similar issues myself (with UNION so far), so I tried
to build a simple test case, which may or may not cover Clemens's
case.

Test case 1 and 2 illustrates the issue, and case 3-9 are variations.

My observation: Looks like the optimizer cannot be close friends with
both UNION and IN/JOIN at the same time.
Actually - it looks like the UNION SELECT kids don't wanna share the
IN/JOIN toy we gave them, but are happy when they get their own toys
to play with ;)


DROP TABLE IF EXISTS table1;
CREATE TABLE table1 AS  SELECT i AS id FROM generate_series(1, 300000)
S(i);
CREATE INDEX ON table1(id);
ANALYZE table1;


-- Test 1: Slow. IN()
SELECT * FROM (
       SELECT * FROM table1
       UNION
       SELECT * FROM table1
) Q WHERE id IN (SELECT id FROM table1 LIMIT 10);


-- Test 2: Fast. ANY(ARRAY())
SELECT * FROM (
       SELECT * FROM table1
       UNION
       SELECT * FROM table1
) Q WHERE id = ANY(ARRAY(SELECT id FROM table1 LIMIT 10));






-- Test 3: Fast. Duplicate IN. Symptom fix? Or would you call it a
"better" query in terms of sql? -except for the unnecessary subquery,
which I kept for readability.
SELECT * FROM (
    SELECT * FROM table1
    WHERE id IN (SELECT id FROM table1 LIMIT 10)
    UNION
    SELECT * FROM table1
    WHERE id IN (SELECT id FROM table1 LIMIT 10)
) Q;


-- Test 4: Fast. Duplicate JOIN CTE.
WITH id_list AS (SELECT id FROM table1 LIMIT 10)
SELECT * FROM (
     SELECT * FROM table1 JOIN id_list USING(id)
     UNION
     SELECT * FROM table1 JOIN id_list USING(id)
) Q;


-- Test 5: Slow. IN(CTE)
WITH id_list AS (SELECT id FROM table1 LIMIT 10)
SELECT * FROM (
       SELECT * FROM table1
       UNION
       SELECT * FROM table1
) Q WHERE id IN (SELECT * FROM id_list);


-- Test 6: Slow. IN(explicit id list)
SELECT * FROM (
       SELECT * FROM table1
       UNION
       SELECT * FROM table1
) Q WHERE id IN (SELECT
UNNEST('{100001,100002,100003,100004,100005,100006,100007,100008,100009,10010}'::BIGINT[] )
AS id);


-- Test 7: Slow. IN(UNNEST(ARRAY())
SELECT * FROM (
       SELECT * FROM table1
       UNION
       SELECT * FROM table1
) Q WHERE id IN (SELECT UNNEST(ARRAY(SELECT id FROM table1 LIMIT 10))
AS id);


-- Test 8: Slow. JOIN CTE
WITH id_list AS (SELECT id FROM table1 LIMIT 10)
SELECT * FROM (
      SELECT * FROM table1
      UNION
      SELECT * FROM table1
) Q JOIN id_list USING(id);


-- Test 9: Fast. JOIN CTE + UNION ALL/DISTINCT (not quite the same
query)
WITH id_list AS (SELECT id FROM table1 LIMIT 10)
SELECT DISTINCT * FROM (
      SELECT * FROM table1
      UNION ALL
      SELECT * FROM table1
) Q JOIN id_list USING(id);


--
Geir Bostad
9.1.3(x64,win)