Re: Any disadvantages of using =ANY(ARRAY()) instead of IN? - Mailing list pgsql-performance
From | Noah Misch |
---|---|
Subject | Re: Any disadvantages of using =ANY(ARRAY()) instead of IN? |
Date | |
Msg-id | 20120510082039.GA27567@tornado.leadboat.com Whole thread Raw |
In response to | Any disadvantages of using =ANY(ARRAY()) instead of IN? (Clemens Eisserer <linuxhippy@gmail.com>) |
Responses |
Re: Any disadvantages of using =ANY(ARRAY()) instead of IN?
|
List | pgsql-performance |
On Tue, May 01, 2012 at 04:34:10PM +0200, Clemens Eisserer wrote: > Quite often Hibernate ends up generating queries with a lot of joins > which usually works well, except for queries which load some > additional data based on a previous query (SUBSELECT collections), > which look like: > > select ..... from table1 ... left outer join table 15 .... WHERE > table1.id IN (select id .... join table16 ... join table20 WHERE > table20.somevalue=?) > > Starting with some amount of joins, the optimizer starts to do quite > suboptimal things like hash-joining huge tables where selctivity would > very low. > I already raised join_collapse_limit and from_collapse_limit, but > after a certain point query planning starts to become very expensive. Since you have 15+ tables at the top level, the genetic query optimizer should be kicking in and delivering a plan in reasonable time, albeit with plan quality hazards. There's a danger zone when the deterministic planner is still in effect but {from,join}_collapse_limit have limited the scope of its investigation. If you're in that zone and have not hand-tailored your explicit join order, poor plans are unsurprising. What exact configuration changes are you using? http://wiki.postgresql.org/wiki/Server_Configuration > However, when using " =ANY(ARRAY(select ...))" instead of "IN" the > planner seems to do a lot better, most likely because it treats the > subquery as a black-box that needs to be executed independently. I've > hacked hibernate a bit to use ANY+ARRAY, and it seems to work a lot > better than using "IN". I have also used that transformation to get better plans. It can help for the reason you say. Specifically, fewer and different plan types are available for the ANY(ARRAY(...)) case, and the row count estimate from the inner subquery does not propagate upward as it does with IN (SELECT ...). > However, I am a bit uncertain: > - Is it safe to use ANY(ARRAY(select ...)) when I know the sub-query > will only return a small amount (0-100s) of rows? Hundreds of rows, no. Consider this example: CREATE TABLE t (c) AS SELECT * FROM generate_series(1,1000000); ANALYZE t; \set n 500 EXPLAIN ANALYZE SELECT * FROM t WHERE c IN (SELECT c FROM t WHERE c <= :n); EXPLAIN ANALYZE SELECT * FROM t WHERE c = ANY (ARRAY(SELECT c FROM t WHERE c <= :n)); IN(...): Hash Semi Join (cost=16931.12..33986.58 rows=490 width=4) (actual time=582.421..2200.322 rows=500 loops=1) Hash Cond: (public.t.c = public.t.c) -> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.093..785.330 rows=1000000 loops=1) -> Hash (cost=16925.00..16925.00 rows=490 width=4) (actual time=582.289..582.289 rows=500 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 18kB -> Seq Scan on t (cost=0.00..16925.00 rows=490 width=4) (actual time=0.026..581.766 rows=500 loops=1) Filter: (c <= 500) Rows Removed by Filter: 999500 Total runtime: 2200.767 ms ANY(ARRAY(...)): Seq Scan on t (cost=16925.00..43850.00 rows=10 width=4) (actual time=305.543..11748.014 rows=500 loops=1) Filter: (c = ANY ($0)) Rows Removed by Filter: 999500 InitPlan 1 (returns $0) -> Seq Scan on t (cost=0.00..16925.00 rows=490 width=4) (actual time=0.012..304.748 rows=500 loops=1) Filter: (c <= 500) Rows Removed by Filter: 999500 Total runtime: 11748.348 ms Note also the difference in output row estimates; that doesn't affect planning here, but it could matter a lot if this snippet became part of a larger query. Cut "n" to 5, though, and the ANY plan beats the IN plan at 800ms vs. 2400ms. (Exact timing figures are fairly unstable on this particular test.) It appears that, unsurprisingly, evaluating a short filter list is cheaper than probing a hash table. > - Shouldn't the optimizer be a bit smarter avoiding optimizing this > case in the first place, instead of bailing out later? Should I file a > bug-report about this problem? Filing a bug report with the content you've already posted would not add much, but a self-contained test case could prove useful. Many of the deficiencies that can make ANY(ARRAY(...)) win do represent unimplemented planner intelligence more than bugs. Incidentally, you can isolate whether ANY(ARRAY(...))'s advantage comes solely from suppressing the subquery collapse. Keep "IN" but tack "OFFSET 0" onto the subquery. If this gives the same performance as ANY(ARRAY(...)), then the subquery-collapse suppression was indeed the source of advantage. nm
pgsql-performance by date: