Thread: Any disadvantages of using =ANY(ARRAY()) instead of IN?
Hi, I am using postgresql as database for a hibernate based java oltp project and as in previous projects am totally impressed by postgresql's robustness, performance and feature-richness. Thanks for this excellent piece of software. Quite often Hibernate ends up generating queries with a lot of joins which usually works well, except for queries which load some additional data based on a previous query (SUBSELECT collections), which look like: select ..... from table1 ... left outer join table 15 .... WHERE table1.id IN (select id .... join table16 ... join table20 WHERE table20.somevalue=?) Starting with some amount of joins, the optimizer starts to do quite suboptimal things like hash-joining huge tables where selctivity would very low. I already raised join_collapse_limit and from_collapse_limit, but after a certain point query planning starts to become very expensive. However, when using " =ANY(ARRAY(select ...))" instead of "IN" the planner seems to do a lot better, most likely because it treats the subquery as a black-box that needs to be executed independently. I've hacked hibernate a bit to use ANY+ARRAY, and it seems to work a lot better than using "IN". However, I am a bit uncertain: - Is it safe to use ANY(ARRAY(select ...)) when I know the sub-query will only return a small amount (0-100s) of rows? - Shouldn't the optimizer be a bit smarter avoiding optimizing this case in the first place, instead of bailing out later? Should I file a bug-report about this problem? Thank you in advance, Clemens
Clemens Eisserer <linuxhippy@gmail.com> writes: > Quite often Hibernate ends up generating queries with a lot of joins > which usually works well, except for queries which load some > additional data based on a previous query (SUBSELECT collections), > which look like: > select ..... from table1 ... left outer join table 15 .... WHERE > table1.id IN (select id .... join table16 ... join table20 WHERE > table20.somevalue=?) > Starting with some amount of joins, the optimizer starts to do quite > suboptimal things like hash-joining huge tables where selctivity would > very low. > I already raised join_collapse_limit and from_collapse_limit, but > after a certain point query planning starts to become very expensive. What PG version are we talking about here? > However, when using " =ANY(ARRAY(select ...))" instead of "IN" the > planner seems to do a lot better, most likely because it treats the > subquery as a black-box that needs to be executed independently. I've > hacked hibernate a bit to use ANY+ARRAY, and it seems to work a lot > better than using "IN". That doesn't sound like a tremendously good idea to me. But with so few details, it's hard to comment intelligently. Can you provide a concrete test case? regards, tom lane
Hi Tom, Thanks for your reply. > What PG version are we talking about here? For development I use 9.1.3, on the production server is 8.4.7 - happens with both cases. > That doesn't sound like a tremendously good idea to me. Could you elaborate on the downsides of this approach a bit? > But with > so few details, it's hard to comment intelligently. > Can you provide a concrete test case? A self contained testcase would take some time to create (and list members willing to configure and run), so I hope a query as well as an explain-analyze run will provide more information (done with 9.1.3): http://pastebin.com/BGRdAPg2 Its kind of the worst-worst case which I will improve later (way too much relations loaded through join-fetching), but its quite a good way to show the issue. Replacing the IN with a ANY(ARRAY()) already yields a way better plan. Thank you in advance, Clemens
Hi again, >> That doesn't sound like a tremendously good idea to me. > Could you elaborate on the downsides of this approach a bit? Any other thoughts about the pro/cons replacing IN(subquery) with =ANY(ARRAY(subquery))? Are there patological cases, except when the subquery returns a huge amount of rows? Should Ifile a bug-report about the optimizer trying too hard to collapse the subquery and therefor generating a bad plan? Thank you in advance, Clemens
Hi, I would be really grateful for feedback regardding this issue. Tom? Should Ifile a bug-report about the optimizer trying too hard to collapse the subquery and therefor generating a bad plan? Its my understanding that a IN shouldn't perform any worse than ANY on an ARRAY, right? Thank you in advance, Clemens
On Tue, May 01, 2012 at 04:34:10PM +0200, Clemens Eisserer wrote: > Quite often Hibernate ends up generating queries with a lot of joins > which usually works well, except for queries which load some > additional data based on a previous query (SUBSELECT collections), > which look like: > > select ..... from table1 ... left outer join table 15 .... WHERE > table1.id IN (select id .... join table16 ... join table20 WHERE > table20.somevalue=?) > > Starting with some amount of joins, the optimizer starts to do quite > suboptimal things like hash-joining huge tables where selctivity would > very low. > I already raised join_collapse_limit and from_collapse_limit, but > after a certain point query planning starts to become very expensive. Since you have 15+ tables at the top level, the genetic query optimizer should be kicking in and delivering a plan in reasonable time, albeit with plan quality hazards. There's a danger zone when the deterministic planner is still in effect but {from,join}_collapse_limit have limited the scope of its investigation. If you're in that zone and have not hand-tailored your explicit join order, poor plans are unsurprising. What exact configuration changes are you using? http://wiki.postgresql.org/wiki/Server_Configuration > However, when using " =ANY(ARRAY(select ...))" instead of "IN" the > planner seems to do a lot better, most likely because it treats the > subquery as a black-box that needs to be executed independently. I've > hacked hibernate a bit to use ANY+ARRAY, and it seems to work a lot > better than using "IN". I have also used that transformation to get better plans. It can help for the reason you say. Specifically, fewer and different plan types are available for the ANY(ARRAY(...)) case, and the row count estimate from the inner subquery does not propagate upward as it does with IN (SELECT ...). > However, I am a bit uncertain: > - Is it safe to use ANY(ARRAY(select ...)) when I know the sub-query > will only return a small amount (0-100s) of rows? Hundreds of rows, no. Consider this example: CREATE TABLE t (c) AS SELECT * FROM generate_series(1,1000000); ANALYZE t; \set n 500 EXPLAIN ANALYZE SELECT * FROM t WHERE c IN (SELECT c FROM t WHERE c <= :n); EXPLAIN ANALYZE SELECT * FROM t WHERE c = ANY (ARRAY(SELECT c FROM t WHERE c <= :n)); IN(...): Hash Semi Join (cost=16931.12..33986.58 rows=490 width=4) (actual time=582.421..2200.322 rows=500 loops=1) Hash Cond: (public.t.c = public.t.c) -> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.093..785.330 rows=1000000 loops=1) -> Hash (cost=16925.00..16925.00 rows=490 width=4) (actual time=582.289..582.289 rows=500 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 18kB -> Seq Scan on t (cost=0.00..16925.00 rows=490 width=4) (actual time=0.026..581.766 rows=500 loops=1) Filter: (c <= 500) Rows Removed by Filter: 999500 Total runtime: 2200.767 ms ANY(ARRAY(...)): Seq Scan on t (cost=16925.00..43850.00 rows=10 width=4) (actual time=305.543..11748.014 rows=500 loops=1) Filter: (c = ANY ($0)) Rows Removed by Filter: 999500 InitPlan 1 (returns $0) -> Seq Scan on t (cost=0.00..16925.00 rows=490 width=4) (actual time=0.012..304.748 rows=500 loops=1) Filter: (c <= 500) Rows Removed by Filter: 999500 Total runtime: 11748.348 ms Note also the difference in output row estimates; that doesn't affect planning here, but it could matter a lot if this snippet became part of a larger query. Cut "n" to 5, though, and the ANY plan beats the IN plan at 800ms vs. 2400ms. (Exact timing figures are fairly unstable on this particular test.) It appears that, unsurprisingly, evaluating a short filter list is cheaper than probing a hash table. > - Shouldn't the optimizer be a bit smarter avoiding optimizing this > case in the first place, instead of bailing out later? Should I file a > bug-report about this problem? Filing a bug report with the content you've already posted would not add much, but a self-contained test case could prove useful. Many of the deficiencies that can make ANY(ARRAY(...)) win do represent unimplemented planner intelligence more than bugs. Incidentally, you can isolate whether ANY(ARRAY(...))'s advantage comes solely from suppressing the subquery collapse. Keep "IN" but tack "OFFSET 0" onto the subquery. If this gives the same performance as ANY(ARRAY(...)), then the subquery-collapse suppression was indeed the source of advantage. nm
Hello Noah, Thanks a lot for your feedback and explanations. > Since you have 15+ tables at the top level, the genetic query optimizer should > be kicking in and delivering a plan in reasonable time, albeit with plan > quality hazards. There's a danger zone when the deterministic planner is > still in effect but {from,join}_collapse_limit have limited the scope of its > investigation. If you're in that zone and have not hand-tailored your > explicit join order, poor plans are unsurprising. What exact configuration > changes are you using? Basically only the changes, suggested here a year ago, which made the problem go away for less complex queries: geqo_threshold = 20 from_collapse_limit = 13 join_collapse_limit = 13 > Hundreds of rows, no. Consider this example: > IN(...): > Total runtime: 2200.767 ms > > ANY(ARRAY(...)): > Total runtime: 11748.348 ms In case there is an index on C, the resulting index scan is, even with 1000 elements, 3 times faster on my Notebook. However, both queries execute in next-to-no time (15 vs 5ms). > Filing a bug report with the content you've already posted would not add much, > but a self-contained test case could prove useful. Many of the deficiencies > that can make ANY(ARRAY(...)) win do represent unimplemented planner > intelligence more than bugs. > > Incidentally, you can isolate whether ANY(ARRAY(...))'s advantage comes solely > from suppressing the subquery collapse. Keep "IN" but tack "OFFSET 0" onto > the subquery. If this gives the same performance as ANY(ARRAY(...)), then the > subquery-collapse suppression was indeed the source of advantage. I see your point, some dumb logic to replace IN with ANY(ARRAY wouldn't always yield better results. I'll try to come up with a self-containing testcase. Thanks again, Clemens
On Tue, May 01, 2012 at 04:34:10PM +0200, Clemens Eisserer wrote: > select ..... from table1 ... left outer join table 15 .... WHERE > table1.id IN (select id .... join table16 ... join table20 WHERE > table20.somevalue=?) > > Starting with some amount of joins, the optimizer starts to do quite > suboptimal things like hash-joining huge tables where selctivity would > very low. > I already raised join_collapse_limit and from_collapse_limit, but > after a certain point query planning starts to become very expensive. On Sun, May 13, 2012 at 04:35:30PM +0200, Clemens Eisserer wrote: > > Since you have 15+ tables at the top level, the genetic query optimizer should > > be kicking in and delivering a plan in reasonable time, albeit with plan > > quality hazards. ??There's a danger zone when the deterministic planner is > > still in effect but {from,join}_collapse_limit have limited the scope of its > > investigation. ??If you're in that zone and have not hand-tailored your > > explicit join order, poor plans are unsurprising. ??What exact configuration > > changes are you using? > > Basically only the changes, suggested here a year ago, which made the > problem go away for less complex queries: > > geqo_threshold = 20 > from_collapse_limit = 13 > join_collapse_limit = 13 Given those settings and the query above, the planner will break the 15 top-level tables into lists of 13 and 2 tables, the 20 subquery tables into lists of 13 and 7 tables. The split points arise from order of appearance in the query text. The planner then optimizes join order within each list but not across lists. That perfectly explains your observation of "hash-joining huge tables where selctivity would very low". If it were me, I would try two things. First, set from_collapse_limit = 100, join_collapse_limit = 100, geqo_threshold = 8. This will let the planner consider all join orders for the 35 tables; it will use the genetic query optimizer to choose one. See if the plan time and resulting plan are decent. Second, set from_collapse_limit = 100, join_collapse_limit = 100, geqo = off and EXPLAIN the query. This will take forever and might exhaust all system memory. If it does complete, you'll see the standard planner's opinion of an optimal join order. You can then modify the textual join order in your query to get the same plan despite returning to a lower join_collapse_limit. Since Hibernate is generating your queries, that may prove inconvenient. It's the remaining escape hatch if the genetic query optimizer does not suffice.
Ah, forgot one query: WHERE IN is of course fast when we supply id's directly, but not when they are wrapped as array and UNNEST'ed in query 6. (previous post from me) -- Test 6b: Fast. WHERE IN(explicit id list) SELECT * FROM ( SELECT * FROM table1 UNION SELECT * FROM table1 ) Q WHERE id IN (100001,100002,100003,100004,100005,100006,100007,100008,100009,10010); -- Geir Bostad 9.1.3(x64,win)
I've encoutered similar issues myself (with UNION so far), so I tried to build a simple test case, which may or may not cover Clemens's case. Test case 1 and 2 illustrates the issue, and case 3-9 are variations. My observation: Looks like the optimizer cannot be close friends with both UNION and IN/JOIN at the same time. Actually - it looks like the UNION SELECT kids don't wanna share the IN/JOIN toy we gave them, but are happy when they get their own toys to play with ;) DROP TABLE IF EXISTS table1; CREATE TABLE table1 AS SELECT i AS id FROM generate_series(1, 300000) S(i); CREATE INDEX ON table1(id); ANALYZE table1; -- Test 1: Slow. IN() SELECT * FROM ( SELECT * FROM table1 UNION SELECT * FROM table1 ) Q WHERE id IN (SELECT id FROM table1 LIMIT 10); -- Test 2: Fast. ANY(ARRAY()) SELECT * FROM ( SELECT * FROM table1 UNION SELECT * FROM table1 ) Q WHERE id = ANY(ARRAY(SELECT id FROM table1 LIMIT 10)); -- Test 3: Fast. Duplicate IN. Symptom fix? Or would you call it a "better" query in terms of sql? -except for the unnecessary subquery, which I kept for readability. SELECT * FROM ( SELECT * FROM table1 WHERE id IN (SELECT id FROM table1 LIMIT 10) UNION SELECT * FROM table1 WHERE id IN (SELECT id FROM table1 LIMIT 10) ) Q; -- Test 4: Fast. Duplicate JOIN CTE. WITH id_list AS (SELECT id FROM table1 LIMIT 10) SELECT * FROM ( SELECT * FROM table1 JOIN id_list USING(id) UNION SELECT * FROM table1 JOIN id_list USING(id) ) Q; -- Test 5: Slow. IN(CTE) WITH id_list AS (SELECT id FROM table1 LIMIT 10) SELECT * FROM ( SELECT * FROM table1 UNION SELECT * FROM table1 ) Q WHERE id IN (SELECT * FROM id_list); -- Test 6: Slow. IN(explicit id list) SELECT * FROM ( SELECT * FROM table1 UNION SELECT * FROM table1 ) Q WHERE id IN (SELECT UNNEST('{100001,100002,100003,100004,100005,100006,100007,100008,100009,10010}'::BIGINT[] ) AS id); -- Test 7: Slow. IN(UNNEST(ARRAY()) SELECT * FROM ( SELECT * FROM table1 UNION SELECT * FROM table1 ) Q WHERE id IN (SELECT UNNEST(ARRAY(SELECT id FROM table1 LIMIT 10)) AS id); -- Test 8: Slow. JOIN CTE WITH id_list AS (SELECT id FROM table1 LIMIT 10) SELECT * FROM ( SELECT * FROM table1 UNION SELECT * FROM table1 ) Q JOIN id_list USING(id); -- Test 9: Fast. JOIN CTE + UNION ALL/DISTINCT (not quite the same query) WITH id_list AS (SELECT id FROM table1 LIMIT 10) SELECT DISTINCT * FROM ( SELECT * FROM table1 UNION ALL SELECT * FROM table1 ) Q JOIN id_list USING(id); -- Geir Bostad 9.1.3(x64,win)