Thread: Huge generated UNION ALL faster than JOIN?

Huge generated UNION ALL faster than JOIN?

From

Ancoron Luciferis

Date:

12 May 2019, 17:08:44

Hi,

I recently stumbled across an interesting query performance question
over at StackOverflow [1], which caught my attention and I started to
investigate the issue further.

One of the things consuming most of the time was an Index Only Scan
executed millions of times. And on top came the Nested Loop which
finally reduced the rows but also took a lot of time to do so.

Explain plan: https://explain.depesz.com/s/4GYT

My initial test using a ~200M row table [2] revealed a completely
different plan with a no-brainer performance of ~10.7 milliseconds vs.
the original 6.25 minutes (10x more rows cannot explain that).

As the query included a LIMIT I also started to play around with OFFSET
because I know that at least OFFSET + LIMIT (sorted) rows have to be
read, with interesting result using one of my optimization attempts:

To eliminate the JOIN and the massive amount of loops I decided to
generate a huge UNION ALL query using a function [3] on-the-fly and
although performing pretty poor for no OFFSET, it makes quite a huge
difference with higher ones.

Here, the times for "exec_a" are with plans as generated by PostgreSQL
11.2 and "exec_d" with my UNION ALL version (all times include planning
and execution):

 c_offset |  exec_a   |  exec_d
----------+-----------+-----------
        0 |    10.694 |   746.892
    10000 |   175.858 |   653.218
   100000 |  1632.205 |   791.913
  1000000 | 11244.091 |  2274.160
  5000000 | 11567.438 |  9428.352
 10000000 | 13442.229 | 17026.783

Complete plans for all executions here:
exec_a: https://explain.depesz.com/s/Ck1
exec_d: https://explain.depesz.com/s/ZoUu

A retest after upgrading to PostgreSQL 11.3 and adding another 200M rows
revealed even different numbers:

 c_offset |  exec_a   | exec_a_x2 |  exec_d   | exec_d_x2
----------+-----------+-----------+-----------+-----------
        0 |    10.694 |    16.616 |   746.892 |   630.440
    10000 |   175.858 |   182.922 |   653.218 |   646.173
   100000 |  1632.205 |  1682.033 |   791.913 |   782.874
  1000000 | 11244.091 | 24781.706 |  2274.160 |  2306.577
  5000000 | 11567.438 | 24798.120 |  9428.352 |  8886.781
 10000000 | 13442.229 | 27315.650 | 17026.783 | 16808.223

One major difference for the "exec_a" plans is that starting with OFFSET
of 1000000, the planner switches from a "Merge Append" + "Nested Loop"
to a "Parallel Append" + "Hash Join" + "Sort" + "Gather Merge", whereas
the plans for "exec_d" always remain single-threaded.

My question now is why can't the optimizer generate a plan that in this
case does 114 loops of "events" scans instead of a million loops on the
"subscription_signal"? There even is an index that spans both relevant
columns here (see [2]) which is used extensively in my UNION variant (as
intended and expected).

Also I observed that while the parallel append is going to be faster
eventually due to better I/O scalability (at least on my system using an
SSD separately for log and different index/data tablespaces) it leads to
a lot of CPU cores being saturated as well as a lot more I/O in general
and also includes the bottleneck of per-worker disk-sorts. From the
perspective of system resources this is not really helpful and it also
doesn't seem to bring much benefit in my case as parallel append just
saves ~10-20% (for OFFSET 1000000) vs. standard Append (with parallel
index/seq scans of partitions).

Using a single-threaded approach (to preserve resources for concurrent
queries, max_parallel_workers_per_gather = 0), the UNION ALL approach is
superior starting at offset 100000:

 c_offset |  exec_a   |  exec_d
----------+-----------+-----------
        0 |    18.028 |   292.762
    10000 |   188.548 |   308.824
   100000 |  1710.029 |   455.101
  1000000 | 81325.527 |  1993.886
  5000000 | 84206.901 |  8638.194
 10000000 | 84846.488 | 16814.890

One thing that really disturbs me in this case is the decision of the
optimizer to generate an Append + Hash starting with offset 1000000
instead of simply continuing with a Merge Append, which pushes down
limits and returns just 10M intermediate rows whereas Append does not -
yet - and results into 270M intermediate rows, resulting these numbers
(enable_hashjoin turned off to force a Merge Append):

 c_offset |  exec_a   |  exec_a_m
----------+-----------+------------
  1000000 | 81325.527 |  16517.566

...but then degrades further because it switches to Append again (no way
to test a Merge Append performance here, I guess):
  5000000 | 84206.901 | 107161.533
 10000000 | 84846.488 | 109368.087


Is there anything I can do about it (apart from my generated huge UNION)
to speed things up?

Please note that I'm using Timescale extension just as a simple way of
managing the partitions and indexes and intentionally set the time
column to a one different from the query filter to not have it optimize
things away under the hood.

Looking forward to any pointers here.

Cheers,

    Ancoron

Refs:
[1] https://stackoverflow.com/questions/55470713
[2] https://paste.ofcode.org/szj7f7fCSYk7jQNdd5Wvbx
[3] https://paste.ofcode.org/ibZ8fNmNFDrsyxa3NktdWB

Re: Huge generated UNION ALL faster than JOIN?

From

Tom Lane

Date:

12 May 2019, 21:08:01

Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes:
> One of the things consuming most of the time was an Index Only Scan
> executed millions of times. And on top came the Nested Loop which
> finally reduced the rows but also took a lot of time to do so.

> Explain plan: https://explain.depesz.com/s/4GYT

The core problem you've got there is the misestimation of the join size:

Nested Loop (cost=0.71..0.30 rows=72,839,557 width=33) (actual time=19,504.096..315,933.158 rows=274 loops=1)

Anytime the planner is off by a factor of 250000x, it's not going to end
well.  In this case, it's imagining that the LIMIT will kick in after just
a very small part of the join is executed --- but in reality, the LIMIT
is larger than the join output, so that we have to execute the whole join.
With a more accurate idea of the join result size, it would have chosen
a different plan.

What you ought to look into is why is that estimate so badly off.
Maybe out-of-date stats, or you need to raise the stats target for
one or both tables?

> My question now is why can't the optimizer generate a plan that in this
> case does 114 loops of "events" scans instead of a million loops on the
> "subscription_signal"?

I don't see any "events" table in that query, so this question isn't
making a lot of sense to me.  But in any case, the answer probably boils
down to "it's guessing that a plan like this will stop early without
having to scan all of the large table".

            regards, tom lane

Re: Huge generated UNION ALL faster than JOIN?

From

Ancoron Luciferis

Date:

13 May 2019, 01:23:14

On 12/05/2019 20:08, Tom Lane wrote:
> Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes:
>> One of the things consuming most of the time was an Index Only Scan
>> executed millions of times. And on top came the Nested Loop which
>> finally reduced the rows but also took a lot of time to do so.
> 
>> Explain plan: https://explain.depesz.com/s/4GYT
> 
> The core problem you've got there is the misestimation of the join size:
> 
> Nested Loop (cost=0.71..0.30 rows=72,839,557 width=33) (actual time=19,504.096..315,933.158 rows=274 loops=1)
> 
> Anytime the planner is off by a factor of 250000x, it's not going to end
> well.  In this case, it's imagining that the LIMIT will kick in after just
> a very small part of the join is executed --- but in reality, the LIMIT
> is larger than the join output, so that we have to execute the whole join.
> With a more accurate idea of the join result size, it would have chosen
> a different plan.
> 
> What you ought to look into is why is that estimate so badly off.
> Maybe out-of-date stats, or you need to raise the stats target for
> one or both tables?
> 

I thought so as well and that's why I started investigating, but after
creating my own data set and a final analyze of both tables I ended up
with similar difference in estimation vs. actual:

https://explain.depesz.com/s/R7jp

Nested Loop (cost=25.17..514,965,251.12 rows=27,021,979 width=56)
(actual time=0.568..5.686 rows=274 loops=1)

...but this was fast due to the Merge Append being used and pushed-down
LIMIT.

>> My question now is why can't the optimizer generate a plan that in this
>> case does 114 loops of "events" scans instead of a million loops on the
>> "subscription_signal"?
> 
> I don't see any "events" table in that query, so this question isn't
> making a lot of sense to me.  But in any case, the answer probably boils
> down to "it's guessing that a plan like this will stop early without
> having to scan all of the large table".
> 
>             regards, tom lane
> 

Yes, Timescale extension is mangling the partition names quite a lot. I
wonder if it would be possible to hold the result of the estimated
smaller reference data (114 subscription_signal.signal_id entries in
this case) in a VALUES list and then use that to filter the table with
the larger estimate instead of looping over.

Cheers,

    Ancoron