Thread: Huge generated UNION ALL faster than JOIN?
Hi, I recently stumbled across an interesting query performance question over at StackOverflow [1], which caught my attention and I started to investigate the issue further. One of the things consuming most of the time was an Index Only Scan executed millions of times. And on top came the Nested Loop which finally reduced the rows but also took a lot of time to do so. Explain plan: https://explain.depesz.com/s/4GYT My initial test using a ~200M row table [2] revealed a completely different plan with a no-brainer performance of ~10.7 milliseconds vs. the original 6.25 minutes (10x more rows cannot explain that). As the query included a LIMIT I also started to play around with OFFSET because I know that at least OFFSET + LIMIT (sorted) rows have to be read, with interesting result using one of my optimization attempts: To eliminate the JOIN and the massive amount of loops I decided to generate a huge UNION ALL query using a function [3] on-the-fly and although performing pretty poor for no OFFSET, it makes quite a huge difference with higher ones. Here, the times for "exec_a" are with plans as generated by PostgreSQL 11.2 and "exec_d" with my UNION ALL version (all times include planning and execution): c_offset | exec_a | exec_d ----------+-----------+----------- 0 | 10.694 | 746.892 10000 | 175.858 | 653.218 100000 | 1632.205 | 791.913 1000000 | 11244.091 | 2274.160 5000000 | 11567.438 | 9428.352 10000000 | 13442.229 | 17026.783 Complete plans for all executions here: exec_a: https://explain.depesz.com/s/Ck1 exec_d: https://explain.depesz.com/s/ZoUu A retest after upgrading to PostgreSQL 11.3 and adding another 200M rows revealed even different numbers: c_offset | exec_a | exec_a_x2 | exec_d | exec_d_x2 ----------+-----------+-----------+-----------+----------- 0 | 10.694 | 16.616 | 746.892 | 630.440 10000 | 175.858 | 182.922 | 653.218 | 646.173 100000 | 1632.205 | 1682.033 | 791.913 | 782.874 1000000 | 11244.091 | 24781.706 | 2274.160 | 2306.577 5000000 | 11567.438 | 24798.120 | 9428.352 | 8886.781 10000000 | 13442.229 | 27315.650 | 17026.783 | 16808.223 One major difference for the "exec_a" plans is that starting with OFFSET of 1000000, the planner switches from a "Merge Append" + "Nested Loop" to a "Parallel Append" + "Hash Join" + "Sort" + "Gather Merge", whereas the plans for "exec_d" always remain single-threaded. My question now is why can't the optimizer generate a plan that in this case does 114 loops of "events" scans instead of a million loops on the "subscription_signal"? There even is an index that spans both relevant columns here (see [2]) which is used extensively in my UNION variant (as intended and expected). Also I observed that while the parallel append is going to be faster eventually due to better I/O scalability (at least on my system using an SSD separately for log and different index/data tablespaces) it leads to a lot of CPU cores being saturated as well as a lot more I/O in general and also includes the bottleneck of per-worker disk-sorts. From the perspective of system resources this is not really helpful and it also doesn't seem to bring much benefit in my case as parallel append just saves ~10-20% (for OFFSET 1000000) vs. standard Append (with parallel index/seq scans of partitions). Using a single-threaded approach (to preserve resources for concurrent queries, max_parallel_workers_per_gather = 0), the UNION ALL approach is superior starting at offset 100000: c_offset | exec_a | exec_d ----------+-----------+----------- 0 | 18.028 | 292.762 10000 | 188.548 | 308.824 100000 | 1710.029 | 455.101 1000000 | 81325.527 | 1993.886 5000000 | 84206.901 | 8638.194 10000000 | 84846.488 | 16814.890 One thing that really disturbs me in this case is the decision of the optimizer to generate an Append + Hash starting with offset 1000000 instead of simply continuing with a Merge Append, which pushes down limits and returns just 10M intermediate rows whereas Append does not - yet - and results into 270M intermediate rows, resulting these numbers (enable_hashjoin turned off to force a Merge Append): c_offset | exec_a | exec_a_m ----------+-----------+------------ 1000000 | 81325.527 | 16517.566 ...but then degrades further because it switches to Append again (no way to test a Merge Append performance here, I guess): 5000000 | 84206.901 | 107161.533 10000000 | 84846.488 | 109368.087 Is there anything I can do about it (apart from my generated huge UNION) to speed things up? Please note that I'm using Timescale extension just as a simple way of managing the partitions and indexes and intentionally set the time column to a one different from the query filter to not have it optimize things away under the hood. Looking forward to any pointers here. Cheers, Ancoron Refs: [1] https://stackoverflow.com/questions/55470713 [2] https://paste.ofcode.org/szj7f7fCSYk7jQNdd5Wvbx [3] https://paste.ofcode.org/ibZ8fNmNFDrsyxa3NktdWB
Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes: > One of the things consuming most of the time was an Index Only Scan > executed millions of times. And on top came the Nested Loop which > finally reduced the rows but also took a lot of time to do so. > Explain plan: https://explain.depesz.com/s/4GYT The core problem you've got there is the misestimation of the join size: Nested Loop (cost=0.71..0.30 rows=72,839,557 width=33) (actual time=19,504.096..315,933.158 rows=274 loops=1) Anytime the planner is off by a factor of 250000x, it's not going to end well. In this case, it's imagining that the LIMIT will kick in after just a very small part of the join is executed --- but in reality, the LIMIT is larger than the join output, so that we have to execute the whole join. With a more accurate idea of the join result size, it would have chosen a different plan. What you ought to look into is why is that estimate so badly off. Maybe out-of-date stats, or you need to raise the stats target for one or both tables? > My question now is why can't the optimizer generate a plan that in this > case does 114 loops of "events" scans instead of a million loops on the > "subscription_signal"? I don't see any "events" table in that query, so this question isn't making a lot of sense to me. But in any case, the answer probably boils down to "it's guessing that a plan like this will stop early without having to scan all of the large table". regards, tom lane
On 12/05/2019 20:08, Tom Lane wrote: > Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes: >> One of the things consuming most of the time was an Index Only Scan >> executed millions of times. And on top came the Nested Loop which >> finally reduced the rows but also took a lot of time to do so. > >> Explain plan: https://explain.depesz.com/s/4GYT > > The core problem you've got there is the misestimation of the join size: > > Nested Loop (cost=0.71..0.30 rows=72,839,557 width=33) (actual time=19,504.096..315,933.158 rows=274 loops=1) > > Anytime the planner is off by a factor of 250000x, it's not going to end > well. In this case, it's imagining that the LIMIT will kick in after just > a very small part of the join is executed --- but in reality, the LIMIT > is larger than the join output, so that we have to execute the whole join. > With a more accurate idea of the join result size, it would have chosen > a different plan. > > What you ought to look into is why is that estimate so badly off. > Maybe out-of-date stats, or you need to raise the stats target for > one or both tables? > I thought so as well and that's why I started investigating, but after creating my own data set and a final analyze of both tables I ended up with similar difference in estimation vs. actual: https://explain.depesz.com/s/R7jp Nested Loop (cost=25.17..514,965,251.12 rows=27,021,979 width=56) (actual time=0.568..5.686 rows=274 loops=1) ...but this was fast due to the Merge Append being used and pushed-down LIMIT. >> My question now is why can't the optimizer generate a plan that in this >> case does 114 loops of "events" scans instead of a million loops on the >> "subscription_signal"? > > I don't see any "events" table in that query, so this question isn't > making a lot of sense to me. But in any case, the answer probably boils > down to "it's guessing that a plan like this will stop early without > having to scan all of the large table". > > regards, tom lane > Yes, Timescale extension is mangling the partition names quite a lot. I wonder if it would be possible to hold the result of the estimated smaller reference data (114 subscription_signal.signal_id entries in this case) in a VALUES list and then use that to filter the table with the larger estimate instead of looping over. Cheers, Ancoron