Thread: [HACKERS] TPC-H Q20 from 1 hour to 19 hours!

[HACKERS] TPC-H Q20 from 1 hour to 19 hours!

From

Rafia Sabih

Date:

06 March 2017, 09:22:03

Hello all,

This is to bring to notice a peculiar instance I found recently while
running TPC-H benchmark queries. Q20 of the benchmark took 19 hours to
complete when run on a machine with 512 GB RAM and 32 cores with
following parameter settings on the commit id -
0c2070cefa0e5d097b715c9a3b9b5499470019aa

work_mem = 1 GB
shared_buffers = 8 GB
effective_cache_size = 10 GB
scale factor = 300
max_parallel_workers_per_gather = 4

The output of explain_analyse is as follows,
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Limit
(cost=60187550.59..60187550.59 rows=1 width=52) (actual
 
time=71064602.963..71064602.964 rows=1 loops=1)  ->  Sort  (cost=60187550.59..60187550.59 rows=3 width=52) (actual
time=71064602.961..71064602.961 rows=1 loops=1)        Sort Key: supplier.s_name        Sort Method: top-N heapsort
Memory:25kB        ->  Nested Loop Semi Join  (cost=52960550.15..60187550.57
 
rows=3 width=52) (actual time=2163639.699..71063153.953 rows=52263
loops=1)              Join Filter: (supplier.s_suppkey = lineitem.l_suppkey)              Rows Removed by Join Filter:
155706242106             ->  Nested Loop  (cost=963.43..10081.54 rows=120000
 
width=56) (actual time=178.589..6282.852 rows=120358 loops=1)                    ->  Seq Scan on nation
(cost=0.00..0.41rows=1
 
width=4) (actual time=0.018..0.042 rows=1 loops=1)                          Filter: (n_name = 'EGYPT'::bpchar)
               Rows Removed by Filter: 24                    ->  Bitmap Heap Scan on supplier
 
(cost=963.43..8881.13 rows=120000 width=64) (actual
time=178.563..6143.786 rows=120358 loops=1)                          Recheck Cond: (s_nationkey = nation.n_nationkey)
                      Heap Blocks: exact=57317                          ->  Bitmap Index Scan on
 
idx_supplier_nation_key  (cost=0.00..933.43 rows=120000 width=0)
(actual time=133.218..133.218 rows=120358 loops=1)                                Index Cond: (s_nationkey =
nation.n_nationkey)             ->  Materialize  (cost=52959586.72..60024469.24 rows=85
 
width=16) (actual time=12.679..175.456 rows=1293693 loops=120358)                    ->  Merge Join
(cost=52959586.72..60024468.82
rows=85 width=16) (actual time=1525322.753..2419045.641 rows=1696742
loops=1)                          Merge Cond: ((lineitem.l_partkey =
partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey))                          Join Filter:
((partsupp.ps_availqty)::numeric > ((0.5 * sum(lineitem.l_quantity))))                          Rows Removed by Join
Filter:3771                          ->  GroupAggregate
 
(cost=48448912.90..53325163.98 rows=144696357 width=48) (actual
time=1342085.116..2178225.340 rows=156665300 loops=1)                                Group Key: lineitem.l_partkey,
lineitem.l_suppkey                                ->  Sort
(cost=48448912.90..49125364.33 rows=270580572 width=21) (actual
time=1342085.072..1495863.254 rows=273057944 loops=1)                                      Sort Key:
lineitem.l_partkey,
lineitem.l_suppkey                                      Sort Method: external merge
Disk: 8282600kB                                      ->  Bitmap Heap Scan on
lineitem  (cost=2847641.84..10552097.42 rows=270580572 width=21)
(actual time=241170.637..650122.225 rows=273058377 loops=1)                                            Recheck Cond:
((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01
00:00:00'::timestamp without time zone))                                            Heap Blocks: exact=33444433
                                  ->  Bitmap Index Scan on
 
idx_l_shipdate  (cost=0.00..2779996.70 rows=270580572 width=0) (actual
time=190321.155..190321.155 rows=273058377 loops=1)                                                  Index Cond:
((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01
00:00:00'::timestamp without time zone))                          ->  Sort  (cost=4510673.81..4516734.42
rows=2424244 width=24) (actual time=183237.183..184039.914
rows=2602656 loops=1)                                Sort Key: partsupp.ps_partkey,
partsupp.ps_suppkey                                Sort Method: quicksort  Memory: 301637kB
  ->  Hash Join
 
(cost=379620.36..4253593.60 rows=2424244 width=24) (actual
time=8576.343..179703.563 rows=2602656 loops=1)                                      Hash Cond: (partsupp.ps_partkey
= part.p_partkey)                                      ->  Seq Scan on partsupp
(cost=0.00..2949730.80 rows=240000000 width=20) (actual
time=0.149..71902.279 rows=240000000 loops=1)                                      ->  Hash
(cost=372044.59..372044.59 rows=606062 width=4) (actual
time=8574.476..8574.476 rows=650664 loops=1)                                            Buckets: 1048576
Batches: 1  Memory Usage: 31067kB                                            ->  Gather
(cost=1000.00..372044.59 rows=606062 width=4) (actual
time=0.677..8090.145 rows=650664 loops=1)                                                  Workers Planned: 4
                                      Workers Launched: 4                                                  ->  Parallel
Seq
Scan on part  (cost=0.00..310438.39 rows=151516 width=4) (actual
time=0.056..8259.935 rows=130133 loops=5)                                                        Filter:
((p_name)::text ~~ 'beige%'::text)                                                        Rows Removed
by Filter: 11869867Planning time: 4.669 msExecution time: 71065478.356 ms

It is clear that selectivity estimations are really bad in this case
particularly at node, ->  Merge Join  (cost=52959586.72..60024468.82 rows=85 width=16)
(actual time=1525322.753..2419045.641 rows=1696742 loops=1)                          Merge Cond: ((lineitem.l_partkey
=
partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey))                          Join Filter:
((partsupp.ps_availqty)::numeric > ((0.5 * sum(lineitem.l_quantity))))                          Rows Removed by Join
Filter:3771
 

Still this puzzled me as during earlier runs of this benchmark I never
encountered such prolonged running times. On further investigation I
found that on reverting the commit
7fa93eec4e0c9c3e801e3c51aa4bae3a38aaa218
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Sat Dec 17 15:28:54 2016 -0500   Fix FK-based join selectivity estimation for semi/antijoins.

the query completes in around 1.5 hours, with the following query plan,Limit  (cost=69766456.38..69766456.38 rows=1
width=52)(actual
 
time=6056583.271..6056583.272 rows=1 loops=1)  ->  Sort  (cost=69766456.38..69766606.38 rows=60000 width=52)
(actual time=6056583.270..6056583.270 rows=1 loops=1)        Sort Key: supplier.s_name        Sort Method: top-N
heapsort Memory: 25kB        ->  Nested Loop  (cost=69742098.22..69766156.38 rows=60000
 
width=52) (actual time=6038178.125..6056492.537 rows=52263 loops=1)              Join Filter: (supplier.s_nationkey =
nation.n_nationkey)             Rows Removed by Join Filter: 1247849              ->  Seq Scan on nation
(cost=0.00..0.41rows=1
 
width=4) (actual time=0.011..0.049 rows=1 loops=1)                    Filter: (n_name = 'EGYPT'::bpchar)
   Rows Removed by Filter: 24              ->  Nested Loop  (cost=69742098.22..69747406.00
 
rows=1499997 width=60) (actual time=6038177.711..6056108.835
rows=1300112 loops=1)                    ->  HashAggregate  (cost=69742097.79..69742182.18
rows=8439 width=16) (actual time=6038177.626..6039095.036 rows=1300126
loops=1)                          Group Key: partsupp.ps_suppkey                          ->  Nested Loop Semi Join
(cost=48766000.84..69742076.69 rows=8439 width=16) (actual
time=2400414.925..6034282.917 rows=1696742 loops=1)                                ->  Merge Join
(cost=48766000.27..69735727.56 rows=8439 width=32) (actual
time=2400413.227..4113580.275 rows=156321526 loops=1)                                      Merge Cond:
((lineitem.l_partkey = partsupp.ps_partkey) AND (lineitem.l_suppkey =
partsupp.ps_suppkey))                                      Join Filter:
((partsupp.ps_availqty)::numeric > ((0.5 * sum(lineitem.l_quantity))))                                      Rows
Removedby Join Filter: 344021                                      ->  GroupAggregate
 
(cost=48765999.70..53642250.78 rows=144696357 width=48) (actual
time=2400413.048..3404489.071 rows=156665547 loops=1)                                            Group Key:
lineitem.l_partkey, lineitem.l_suppkey                                            ->  Sort
(cost=48765999.70..49442451.13 rows=270580572 width=21) (actual
time=2400412.958..2692446.813 rows=273058377 loops=1)                                                  Sort Key:
lineitem.l_partkey, lineitem.l_suppkey                                                  Sort Method:
external merge  Disk: 8282776kB                                                  ->  Bitmap Heap
Scan on lineitem  (cost=2847641.84..10552097.42 rows=270580572
width=21) (actual time=147080.637..1825300.462 rows=273058377 loops=1)
     Recheck Cond:
 
((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01
00:00:00'::timestamp without time zone))                                                        Rows Removed
by Index Recheck: 1346378434                                                        Heap Blocks:
exact=669636 lossy=32774797                                                        ->  Bitmap
Index Scan on idx_l_shipdate  (cost=0.00..2779996.70 rows=270580572
width=0) (actual time=146519.142..146519.142 rows=273058377 loops=1)
         Index
 
Cond: ((l_shipdate >= '1994-01-01'::date) AND (l_shipdate <
'1995-01-01 00:00:00'::timestamp without time zone))                                      ->  Index Scan using
partsupp_pkey on partsupp  (cost=0.57..12722651.69 rows=240000000
width=20) (actual time=0.160..197858.090 rows=240000000 loops=1)                                ->  Index Scan using
part_pkeyon
 
part  (cost=0.56..0.75 rows=1 width=4) (actual time=0.012..0.012
rows=0 loops=156321526)                                      Index Cond: (p_partkey =
lineitem.l_partkey)                                      Filter: ((p_name)::text ~~
'beige%'::text)                                      Rows Removed by Filter: 1                    ->  Index Scan using
supplier_pkeyon supplier
 
(cost=0.43..0.61 rows=1 width=64) (actual time=0.011..0.012 rows=1
loops=1300126)                          Index Cond: (s_suppkey = lineitem.l_suppkey)Planning time: 2.440 msExecution
time:6057329.179 ms
 

I hope there might be some way-out for such a case which includes the
benefits of the commit without hurting other cases (like this one)
this bad.
Thoughts, comments...

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

Re: TPC-H Q20 from 1 hour to 19 hours!

From

Robert Haas

Date:

29 March 2017, 22:00:01

On Mon, Mar 6, 2017 at 1:22 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:
> This is to bring to notice a peculiar instance I found recently while
> running TPC-H benchmark queries. Q20 of the benchmark took 19 hours to
> complete ...

That's bad.

> It is clear that selectivity estimations are really bad in this case
> particularly at node,
>   ->  Merge Join  (cost=52959586.72..60024468.82 rows=85 width=16)
> (actual time=1525322.753..2419045.641 rows=1696742 loops=1)
>                            Merge Cond: ((lineitem.l_partkey =
> partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey))
>                            Join Filter:
> ((partsupp.ps_availqty)::numeric > ((0.5 * sum(lineitem.l_quantity))))
>                            Rows Removed by Join Filter: 3771

So, the selectivity estimation here is bad both before and after Tom's
commit, but it's significantly worse after (actual value 1696742, old
estimate 3771, new estimate 85).

> Still this puzzled me as during earlier runs of this benchmark I never
> encountered such prolonged running times. On further investigation I
> found that on reverting the commit
> 7fa93eec4e0c9c3e801e3c51aa4bae3a38aaa218
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date:   Sat Dec 17 15:28:54 2016 -0500
>     Fix FK-based join selectivity estimation for semi/antijoins.

I don't think the problem originates at the Merge Join, though,
because the commit says that at is fixing semi and anti-join estimates
- this is a plain inner join, so in theory it shouldn't change.
However, it's a bit hard for me to piece through these plans, the
formatting kind of got messed up - things are wrapped.  Could you
possibly attach the plans as attachments?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: TPC-H Q20 from 1 hour to 19 hours!

From

Tomas Vondra

Date:

30 March 2017, 01:14:30


On 03/29/2017 09:00 PM, Robert Haas wrote:
> On Mon, Mar 6, 2017 at 1:22 AM, Rafia Sabih
> <rafia.sabih@enterprisedb.com> wrote:
>> This is to bring to notice a peculiar instance I found recently while
>> running TPC-H benchmark queries. Q20 of the benchmark took 19 hours to
>> complete ...
> 
> That's bad.
> 
>> It is clear that selectivity estimations are really bad in this case
>> particularly at node,
>>    ->  Merge Join  (cost=52959586.72..60024468.82 rows=85 width=16)
>> (actual time=1525322.753..2419045.641 rows=1696742 loops=1)
>>                             Merge Cond: ((lineitem.l_partkey =
>> partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey))
>>                             Join Filter:
>> ((partsupp.ps_availqty)::numeric > ((0.5 * sum(lineitem.l_quantity))))
>>                             Rows Removed by Join Filter: 3771
> 
> So, the selectivity estimation here is bad both before and after Tom's
> commit, but it's significantly worse after (actual value 1696742, old
> estimate 3771, new estimate 85).
> 
>> Still this puzzled me as during earlier runs of this benchmark I never
>> encountered such prolonged running times. On further investigation I
>> found that on reverting the commit
>> 7fa93eec4e0c9c3e801e3c51aa4bae3a38aaa218
>> Author: Tom Lane <tgl@sss.pgh.pa.us>
>> Date:   Sat Dec 17 15:28:54 2016 -0500
>>      Fix FK-based join selectivity estimation for semi/antijoins.
> 
> I don't think the problem originates at the Merge Join, though,
> because the commit says that at is fixing semi and anti-join estimates
> - this is a plain inner join, so in theory it shouldn't change.
> However, it's a bit hard for me to piece through these plans, the
> formatting kind of got messed up - things are wrapped.  Could you
> possibly attach the plans as attachments?
>

I've been looking into this today, and it seems to me the simplest query 
triggering this issue (essentially a part of q20) is this:

select
         ps_suppkey
from
         partsupp,
         (
                 select
                         l_partkey agg_partkey,
                         l_suppkey agg_suppkey
                 from
                         lineitem
                 group by
                         l_partkey,
                         l_suppkey
         ) agg_lineitem
where
         agg_partkey = ps_partkey
         and agg_suppkey = ps_suppkey
         and ps_partkey in (
                 select
                         p_partkey
                 from
                         part
                 where
                         p_name like 'hot%'
         );

which does actually include a semijoin. What seems to trigger the issue 
is the join to the aggregated lineitem table - when replacing it with a 
plain table, everything seems to be estimated perfectly fine.

Attached is a simple SQL script, that runs three variants of the query:

(1) with the join to the aggregated lineitem table
(2) with a join to a plain lineitem table
(3) with a join to a plain lineitem table and to 'part' (without the 
semijoin)

First the queries are executed on tables without any foreign keys 
(between those three), then with a FK between lineitem and partsupp, and 
finally with additional FK between partsupp and part.

Initially the estimates are bad, but once the first foreign key is 
added, the estimates get very accurate - except for the case (1).

I've only ran the queries on 10GB data set, but that should be enough. 
The plans are from current master - I'll rerun the script on an older 
release later today.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: TPC-H Q20 from 1 hour to 19 hours!

From

Tomas Vondra

Date:

30 March 2017, 03:00:45

On 03/30/2017 12:14 AM, Tomas Vondra wrote:
>
> I've only ran the queries on 10GB data set, but that should be enough. 
> The plans are from current master - I'll rerun the script on an older 
> release later today.
> 

So, an plans from an older release (9.4) are attached. What seems to 
matter is the join between partsupp and part, which is estimated like 
this on 9.4:

    ->  Sort  (cost=172852.06..173741.94 rows=355951 width=12)
              (actual time=321.998..334.440 rows=86836 loops=1)
          Sort Key: partsupp.ps_partkey, partsupp.ps_suppkey
          Sort Method: quicksort  Memory: 7143kB
          ->  Nested Loop  (cost=0.43..140031.03 rows=355951 width=12)
                       (actual time=0.025..303.145 rows=86836 loops=1)

and like this on current master:

   ->  Sort  (cost=146617.86..146819.89 rows=80809 width=12)
             (actual time=562.513..575.599 rows=86836 loops=1)
          Sort Key: partsupp.ps_partkey, partsupp.ps_suppkey
          Sort Method: quicksort  Memory: 7143kB
          ->  Nested Loop  (cost=0.43..140031.03 rows=80809 width=12)
                      (actual time=0.054..536.003 rows=86836 loops=1)

which however seems clearly more accurate than 9.4. So perhaps there is 
some bug in computing the mergejoin estimate, but it's also possible 
correcting the estimate (lowering it) also has some impact.

But joining to the aggregated subquery also clearly plays some role, 
because without it the estimates are higher.

Another interesting observation is that only the foreign key between 
part/partsupp seems to matter - once it gets dropped, the estimates get 
back close to 9.4 values.

What is however strange is that changing max_parallel_workers_per_gather 
affects row estimates *above* the Gather node. That seems a bit, um, 
suspicious, no? See the parallel-estimates.log.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: TPC-H Q20 from 1 hour to 19 hours!

From

Robert Haas

Date:

30 March 2017, 05:54:56

On Wed, Mar 29, 2017 at 8:00 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> What is however strange is that changing max_parallel_workers_per_gather
> affects row estimates *above* the Gather node. That seems a bit, um,
> suspicious, no? See the parallel-estimates.log.

Thanks for looking at this!  Comparing the parallel plan vs. the
non-parallel plan:

part: parallel rows (after Gather) 20202, non-parallel rows 20202
partsupp: parallel rows 18, non-parallel rows 18
part-partsupp join: parallel rows 88988, non-parallel rows 355951
lineitem: parallel rows 59986112, non-parallel rows 59986112
lineitem after aggregation: parallel rows 5998611, non-parallel rows 5998611
final join: parallel rows 131, non-parallel rows 524

I agree with you that that looks mighty suspicious.  Both the
part-partsupp join and the final join have exactly 4x as many
estimated rows in the non-parallel plan as in the parallel plan, and
it just so happens that the parallel divisor here will be 4.

Hmm... it looks like the parallel_workers value from the Gather node
is being erroneously propagated up to the higher levels of the plan
tree.   Wow.   Somehow, Gather Merge managed to get the logic correct
here, but Gather is totally wrong.  Argh.   Attached is a draft patch,
which I haven't really tested beyond checking that it passes 'make
check'.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

fix-gather-worker-propagation.patch

Re: TPC-H Q20 from 1 hour to 19 hours!

From

Rafia Sabih

Date:

30 March 2017, 06:23:16

On Thu, Mar 30, 2017 at 12:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:

> I don't think the problem originates at the Merge Join, though,
> because the commit says that at is fixing semi and anti-join estimates
> - this is a plain inner join, so in theory it shouldn't change.
> However, it's a bit hard for me to piece through these plans, the
> formatting kind of got messed up - things are wrapped.  Could you
> possibly attach the plans as attachments?
>
Sure, please find attached file for the plans before and after commit.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

Attachment

Q21.txt

Re: TPC-H Q20 from 1 hour to 19 hours!

From

Amit Kapila

Date:

31 March 2017, 14:43:36

On Thu, Mar 30, 2017 at 8:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Mar 29, 2017 at 8:00 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> What is however strange is that changing max_parallel_workers_per_gather
>> affects row estimates *above* the Gather node. That seems a bit, um,
>> suspicious, no? See the parallel-estimates.log.
>
> Thanks for looking at this!  Comparing the parallel plan vs. the
> non-parallel plan:
>
> part: parallel rows (after Gather) 20202, non-parallel rows 20202
> partsupp: parallel rows 18, non-parallel rows 18
> part-partsupp join: parallel rows 88988, non-parallel rows 355951
> lineitem: parallel rows 59986112, non-parallel rows 59986112
> lineitem after aggregation: parallel rows 5998611, non-parallel rows 5998611
> final join: parallel rows 131, non-parallel rows 524
>
> I agree with you that that looks mighty suspicious.  Both the
> part-partsupp join and the final join have exactly 4x as many
> estimated rows in the non-parallel plan as in the parallel plan, and
> it just so happens that the parallel divisor here will be 4.
>
> Hmm... it looks like the parallel_workers value from the Gather node
> is being erroneously propagated up to the higher levels of the plan
> tree.   Wow.   Somehow, Gather Merge managed to get the logic correct
> here, but Gather is totally wrong.  Argh.   Attached is a draft patch,
> which I haven't really tested beyond checking that it passes 'make
> check'.
>

Your patch looks good to me.  I have verified some join cases as well
where the behaviour is sane after patch.  I have also done testing
with force_parallel_mode=regress (ran make check-world) and everything
seems good.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: TPC-H Q20 from 1 hour to 19 hours!

From

Rafia Sabih

Date:

31 March 2017, 14:53:43

On Fri, Mar 31, 2017 at 5:13 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Mar 30, 2017 at 8:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Mar 29, 2017 at 8:00 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>> What is however strange is that changing max_parallel_workers_per_gather
>>> affects row estimates *above* the Gather node. That seems a bit, um,
>>> suspicious, no? See the parallel-estimates.log.
>>
>> Thanks for looking at this!  Comparing the parallel plan vs. the
>> non-parallel plan:
>>
>> part: parallel rows (after Gather) 20202, non-parallel rows 20202
>> partsupp: parallel rows 18, non-parallel rows 18
>> part-partsupp join: parallel rows 88988, non-parallel rows 355951
>> lineitem: parallel rows 59986112, non-parallel rows 59986112
>> lineitem after aggregation: parallel rows 5998611, non-parallel rows 5998611
>> final join: parallel rows 131, non-parallel rows 524
>>
>> I agree with you that that looks mighty suspicious.  Both the
>> part-partsupp join and the final join have exactly 4x as many
>> estimated rows in the non-parallel plan as in the parallel plan, and
>> it just so happens that the parallel divisor here will be 4.
>>
>> Hmm... it looks like the parallel_workers value from the Gather node
>> is being erroneously propagated up to the higher levels of the plan
>> tree.   Wow.   Somehow, Gather Merge managed to get the logic correct
>> here, but Gather is totally wrong.  Argh.   Attached is a draft patch,
>> which I haven't really tested beyond checking that it passes 'make
>> check'.
>>
>
> Your patch looks good to me.  I have verified some join cases as well
> where the behaviour is sane after patch.  I have also done testing
> with force_parallel_mode=regress (ran make check-world) and everything
> seems good.
>
> --

I tried checking the plan of Q20 with this patch, and got the following results,
with patch,
->  Merge Join  (cost=3025719.98..3499235.22 rows=6 width=16) (actual
time=176440.801..245903.143 rows=118124 loops=1)
                           Merge Cond: ((lineitem.l_partkey =
partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey))
                           Join Filter:
((partsupp.ps_availqty)::numeric > ((0.5 * sum(lineitem.l_quantity))))
and without patch,
->  Merge Join  (cost=3014830.12..3511637.54 rows=2 width=16) (actual
time=130994.237..208887.149 rows=118124 loops=1)
                                 Merge Cond: ((partsupp.ps_partkey =
lineitem.l_partkey) AND (partsupp.ps_suppkey = lineitem.l_suppkey))
                                 Join Filter:
((partsupp.ps_availqty)::numeric > ((0.5 * sum(lineitem.l_quantity))))

So, it looks like in the problematic area, it is not improving much.
Please find the attached file for the query plan of Q20 with and
without patch. I haven't evaluated the performance of this query with
patch.
-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

Attachment

q20_after_gather_fix_patch.txt

Re: TPC-H Q20 from 1 hour to 19 hours!

From

Robert Haas

Date:

01 April 2017, 03:57:28

On Fri, Mar 31, 2017 at 7:53 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:
> So, it looks like in the problematic area, it is not improving much.
> Please find the attached file for the query plan of Q20 with and
> without patch. I haven't evaluated the performance of this query with
> patch.

Yeah, I don't think this patch is fixing whatever is going wrong with
Q20.  It's just a problem Tomas found while investigating that issue.
Unfortunately, I don't think we know what's really going on here yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: TPC-H Q20 from 1 hour to 19 hours!

From

Tomas Vondra

Date:

01 April 2017, 05:29:38

On 04/01/2017 02:57 AM, Robert Haas wrote:
> On Fri, Mar 31, 2017 at 7:53 AM, Rafia Sabih
> <rafia.sabih@enterprisedb.com> wrote:
>> So, it looks like in the problematic area, it is not improving much.
>> Please find the attached file for the query plan of Q20 with and
>> without patch. I haven't evaluated the performance of this query with
>> patch.
> 
> Yeah, I don't think this patch is fixing whatever is going wrong with
> Q20.  It's just a problem Tomas found while investigating that issue.
> Unfortunately, I don't think we know what's really going on here yet.
> 

Yeah, it's just a side-bug. I'll look into the issue again once I get 
back home from pgconf.us, it's a little bit difficult to investigate 
this over ssh on the in-flight wifi :-/

My hunch is that the two foreign keys interfere in some strange way, 
i.e. we should apply just one and end up applying both, or something 
like that. Or perhaps it gets confused by the join-to-aggregated-rel?.

It's a bit weird, though, because the patch was originally meant to help 
with multi-column keys. It was then extended to also consider 
single-column FKs, but that should really be just 1/ndistinct anyway.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours!

From

Tomas Vondra

Date:

06 April 2017, 23:37:12

Hi,

I've been looking at this issue today, and so far I don't think it's a 
bug in the foreign key estimation. It seems mostly that the 9.5 
estimates were hopelessly bad, and the join estimation changes simply 
pushed it a tiny bit the wrong direction.

Although maybe there is a bug (or at least a change of behavior) in one 
case, but I'll get to that.

I've managed to extract a small part of Q20 that demonstrates the 
differences between versions quite nicely, I think. The part causing the 
trouble looks like this:

   explain select
       ps_suppkey
   from
       partsupp,
       (
           select
               l_partkey agg_partkey,
               l_suppkey agg_suppkey
           from
               lineitem
           where
               l_shipdate >= date '1997-01-01'
               and l_shipdate < date '1997-01-01' + interval '1' year
           group by
               l_partkey,
               l_suppkey
       ) agg_lineitem
   where
       agg_partkey = ps_partkey
       and agg_suppkey = ps_suppkey
       and ps_partkey in (
           select
               p_partkey
           from
               part
           where
               p_name like 'hot%'
       );

i.e. it aggregates the "lineitem" table, and then joins "partsupp" and 
"part" tables to it.

     "aggregated lineitem" <-> partsupp <-> part

I've collected estimates from four different variants of the query (see 
the attached exlain.sql):

1) SIMPLE
   - join directly to lineitem (without the aggregation)
   - remove the p_name LIKE pattern matching

2) SIMPLE+LIKE
   - like SIMPLE, but keep the LIKE condition

3) GROUPING
   - join to the aggregated lineitem table
   - remove the p_name LIKE pattern matching

4) GROUPING+LIKE
   - like GROUPING, but keep the LIKE condition

I've collected estimates on a 20GB data set, both from 9.5 (so without 
any of the FK estimation changes) and on master with different foreign 
keys between the tables.

no-keys  - no foreign keys between the three tables
lineitem - lineitem references partsupp
partsupp - partsupp references part
both     - both foreign keys

And the results look like this (actual row counts were collected on 9.5, 
but that should not matter - the results should be the same on all 
versions):

       branch     SIMPLE     SIMPLE+LIKE     GROUPING    GROUPING+LIKE
   --------------------------------------------------------------------
       actual  119994608         1311974     10897186           119238
          9.5       2863              35          160              160
      no-keys       2340              24          868              868
     lineitem  119994848         1229750          868              868
     partsupp       2340              24         1737               18
    both-keys  119994848         1212065         1737               18

This seems mostly sane, I guess, but let's look at various cases.

In the SIMPLE cases, the foreign key "lineitem->partsupp" makes a huge 
difference - the estimates are pretty exact, both with and without the 
LIKE condition. The "partsupp->part" key makes almost no difference, 
though - the minor differences (35/24 and 1229750/1212065) seem to be 
mostly due to minor differences in stats built by ANALYZE, particularly 
in histograms used by patternsel().

In the GROUPING cases, the situation is obviously much worse. The 
grouping makes it impossible to use the "lineitem->partsupp" foreign 
key, resulting in severe underestimates. The "partsupp->part" is used, 
but the difference is pretty negligible as it's a simple (one column) 
foreign key.

The change from 160 -> 868 is merely due to 84f9a35e3 changing how we 
estimate number of groups in a GROUP BY clause. In 9.5 we get this:

     ->  HashAggregate  (rows=1836028) (actual rows=10897186)

while since 9.6 we get this

     ->  GroupAggregate  (rows=9674242)

Not only is that much closer to the actual value than the 9.5 estimate, 
but it's almost exactly the factor between 160 and 868:

     9674242 / 1836028 = 5.27
     160 * 5.26 = 843

So I'd say the 160 vs. 868 is expected, although the result is still way 
off, of course.

Which brings me to the slightly suspicious bit. On 9.5, there's no 
difference between GROUP and GROUP+LIKE cases - the estimates are 
exactly the same in both cases. This is true too, but only without the 
foreign key between "partsupp" and "part", i.e. the two non-grouped 
relations in the join. And what's more, the difference (1737 vs. 16) is 
pretty much exactly 100x, which is the estimate for the LIKE condition.

So it kinda seems 9.5 does not apply this condition for semi-joins, 
while >=9.6 does that.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours!

From

Robert Haas

Date:

25 May 2017, 07:03:44

On Thu, Apr 6, 2017 at 4:37 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Which brings me to the slightly suspicious bit. On 9.5, there's no
> difference between GROUP and GROUP+LIKE cases - the estimates are exactly
> the same in both cases. This is true too, but only without the foreign key
> between "partsupp" and "part", i.e. the two non-grouped relations in the
> join. And what's more, the difference (1737 vs. 16) is pretty much exactly
> 100x, which is the estimate for the LIKE condition.

I don't follow this.  How does the foreign key between partsupp and
part change the selectivity of LIKE?

> So it kinda seems 9.5 does not apply this condition for semi-joins, while
>>=9.6 does that.

If 9.5 and prior are ignoring some of the quals, that's bad, but I
don't think I understand from your analysis why
7fa93eec4e0c9c3e801e3c51aa4bae3a38aaa218 regressed anything.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours!

From

Tomas Vondra

Date:

25 May 2017, 14:06:18

Hi,

On 5/25/17 6:03 AM, Robert Haas wrote:
> On Thu, Apr 6, 2017 at 4:37 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Which brings me to the slightly suspicious bit. On 9.5, there's no
>> difference between GROUP and GROUP+LIKE cases - the estimates are exactly
>> the same in both cases. This is true too, but only without the foreign key
>> between "partsupp" and "part", i.e. the two non-grouped relations in the
>> join. And what's more, the difference (1737 vs. 16) is pretty much exactly
>> 100x, which is the estimate for the LIKE condition.
> 
> I don't follow this.  How does the foreign key between partsupp and
> part change the selectivity of LIKE?
> 
>> So it kinda seems 9.5 does not apply this condition for semi-joins, while
>>> =9.6 does that.
> 

Well, get_foreign_key_join_selectivity() does handle restrictions when 
calculating joinrel size estimate in calc_joinrel_size_estimate(), so 
assuming there's some thinko it might easily cause this.

I haven't found any such thinko, but I don't dare to claim I fully 
understand what the current version of get_foreign_key_join_selectivity 
does :-/

> If 9.5 and prior are ignoring some of the quals, that's bad, but I
> don't think I understand from your analysis why
> 7fa93eec4e0c9c3e801e3c51aa4bae3a38aaa218 regressed anything.
> 

It's been quite a bit of time since I looked into this, but I think my 
main point was that it's hard to say it's a regression when both the old 
and new estimates are so utterly wrong.

I mean, 9.5 estimates 160, 9.6 estimates 18. We might fix the post-9.6 
estimate to return the same value as 9.5, and it might fix this 
particular query with this particular scale. But in the end it's just 
noise considering that the actual value is 120k (so 3 or 4 orders of 
magnitude off).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours!

From

Tom Lane

Date:

01 June 2017, 18:40:41

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> Which brings me to the slightly suspicious bit. On 9.5, there's no 
> difference between GROUP and GROUP+LIKE cases - the estimates are 
> exactly the same in both cases. This is true too, but only without the 
> foreign key between "partsupp" and "part", i.e. the two non-grouped 
> relations in the join. And what's more, the difference (1737 vs. 16) is 
> pretty much exactly 100x, which is the estimate for the LIKE condition.
> So it kinda seems 9.5 does not apply this condition for semi-joins, 
> while >=9.6 does that.

I got around to looking into this finally.  I assume that when you say
you added a foreign key from partsupp to part, you also created a unique
index on part.p_partkey --- there is no such index in dbt3's version of
the schema, anyway, so I had to add one to create a foreign key.
The unique index itself, never mind the foreign key, changes things
substantially for this query in HEAD, as a result of commit 92a43e485:
the semijoin to part becomes a plain join, and so we go through entirely
different code paths to estimate selectivity.  But without that, there's
clearly something odd going on, because the LIKE condition ought to have
some effect on the estimate of number of matched rows.

I poked into it and found that the problem stems from the fact that the
initial estimate of the top join relation's size is estimated using
agg_lineitem.agg_partkey = part.p_partkey as the join condition, not
the original partsupp.ps_partkey = part.p_partkey qual.  We can get
the former from the latter via equivalence-clause deductions, and
I think it's just bad luck that the join is formed first in that
direction.  What that means is that eqjoinsel_semi() finds no statistics
for the LHS variable, although it does have stats for the RHS.  There
is logic in eqjoinsel_semi() that attempts to reduce the semijoin
selectivity estimate proportionally to whatever restriction clauses were
applied to the RHS ... but if you look closely, that code has no effect
if we lack stats for the LHS.  We'll fall past both the MCV-dependent
calculation and the nd1-vs-nd2 test and end up taking the selectivity
as just 0.5, independently of anything else.

It seems reasonable to me that we ought to derate that default estimate
by whatever we'd concluded the restriction selectivity on the inner rel
is.  So I experimented with the attached patch and verified that it
results in the LIKE affecting the final rowcount estimate in this
situation.  I've not tested it much further than that, other than checking
that we still pass regression tests.

I'm not sure if we ought to think about applying this now.  It will likely
make things worse not better for the Q20 query, because it will cause the
underestimate for the full query to be even further off.  Still, in
principle it seems like it ought to be an improvement in most cases.

The thing that would actually have a chance of improving matters for Q20
would be if we could see our way to looking through the aggregation
subquery and applying the foreign key constraint for lineitem.  That
seems like a research project though; it's surely not happening for v10.

I'm also wondering idly if we could fix things so that the join size
estimate gets formed from a join condition that we have stats for rather
than one we lack them for.  But I have no clear ideas on ways to go
about that that wouldn't be horrid kluges.

            regards, tom lane

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 6e491bb..b875d5d 100644
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** eqjoinsel_semi(Oid operator,
*** 2603,2610 ****
           *
           * Crude as the above is, it's completely useless if we don't have
           * reliable ndistinct values for both sides.  Hence, if either nd1 or
!          * nd2 is default, punt and assume half of the uncertain rows have
!          * join partners.
           */
          if (!isdefault1 && !isdefault2)
          {
--- 2603,2609 ----
           *
           * Crude as the above is, it's completely useless if we don't have
           * reliable ndistinct values for both sides.  Hence, if either nd1 or
!          * nd2 is default, we can't use this.
           */
          if (!isdefault1 && !isdefault2)
          {
*************** eqjoinsel_semi(Oid operator,
*** 2616,2622 ****
--- 2615,2635 ----
                  uncertainfrac = nd2 / nd1;
          }
          else
+         {
+             /*
+              * In this situation, we basically assume half of the uncertain
+              * rows have join partners.  However, we'd still like to respond
+              * to restriction clauses applied to the inner rel, so what we
+              * really do is assume half of the uncertain rows have partners in
+              * the underlying inner rel, then reduce that fraction by the
+              * previously-determined selectivity of the inner restrictions.
+              */
              uncertainfrac = 0.5;
+             if (vardata2->rel &&
+                 vardata2->rel->rows > 0 &&
+                 vardata2->rel->rows < vardata2->rel->tuples)
+                 uncertainfrac *= vardata2->rel->rows / vardata2->rel->tuples;
+         }
          uncertain = 1.0 - matchfreq1 - nullfrac1;
          CLAMP_PROBABILITY(uncertain);
          selec = matchfreq1 + uncertainfrac * uncertain;
*************** eqjoinsel_semi(Oid operator,
*** 2624,2631 ****
      else
      {
          /*
!          * Without MCV lists for both sides, we can only use the heuristic
!          * about nd1 vs nd2.
           */
          double        nullfrac1 = stats1 ? stats1->stanullfrac : 0.0;

--- 2637,2644 ----
      else
      {
          /*
!          * Without MCV lists for both sides, we can only use the heuristics
!          * described above about nd1 vs nd2 and inner restriction clauses.
           */
          double        nullfrac1 = stats1 ? stats1->stanullfrac : 0.0;

*************** eqjoinsel_semi(Oid operator,
*** 2637,2643 ****
--- 2650,2662 ----
                  selec = (nd2 / nd1) * (1.0 - nullfrac1);
          }
          else
+         {
              selec = 0.5 * (1.0 - nullfrac1);
+             if (vardata2->rel &&
+                 vardata2->rel->rows > 0 &&
+                 vardata2->rel->rows < vardata2->rel->tuples)
+                 selec *= vardata2->rel->rows / vardata2->rel->tuples;
+         }
      }

      free_attstatsslot(&sslot1);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours!

From

Peter Geoghegan

Date:

11 June 2017, 20:27:16

On Thu, Jun 1, 2017 at 8:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The thing that would actually have a chance of improving matters for Q20
> would be if we could see our way to looking through the aggregation
> subquery and applying the foreign key constraint for lineitem.  That
> seems like a research project though; it's surely not happening for v10.

Do you mean teaching the optimizer to do something like this?:

select       ps_suppkey
from       partsupp,       (               select                       l_partkey agg_partkey,
l_suppkeyagg_suppkey               from                       lineitem   /* BEGIN my addition */
whereexists (               select                       p_partkey               from                       part
      where                       p_name like 'hot%'                       and p_partkey = l_partkey
)    /* END my addition */               group by                       l_partkey,                       l_suppkey
) agg_lineitem

where       agg_partkey = ps_partkey       and agg_suppkey = ps_suppkey       and ps_partkey in (               select
                    p_partkey               from                       part               where
p_namelike 'hot%'       );

Note that I introduced a new, redundant exists() in the agg_lineitem
fact table subquery. It now takes 23 seconds for me on Tomas' 10GB
TPC-H dataset, whereas the original query took over 90 minutes.
Clearly we're missing a trick or two here. I think that you need a
DAG-shaped query plan to make this work well, though, so it is
certainly a big project.

Apparently selectivity estimation isn't particularly challenging with
the TPC-H queries. I think that the big challenge for us is
limitations like this; there are similar issues with a number of other
TPC-H queries. It would be great if someone looked into implementing
bitmap semi-join.

-- 
Peter Geoghegan

Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours!

From

Tom Lane

Date:

11 June 2017, 20:36:50

Peter Geoghegan <pg@bowt.ie> writes:
> On Thu, Jun 1, 2017 at 8:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The thing that would actually have a chance of improving matters for Q20
>> would be if we could see our way to looking through the aggregation
>> subquery and applying the foreign key constraint for lineitem.  That
>> seems like a research project though; it's surely not happening for v10.

> Do you mean teaching the optimizer to do something like this?:

Uh, no.  I don't think we want to add any run-time checks.  The point in
this example is that we'd get a better rowcount estimate if we noticed
that the FK constraint could be considered while estimating the size of
the partsupp-to-aggregated-subquery join.

> Apparently selectivity estimation isn't particularly challenging with
> the TPC-H queries.

Maybe not with the rest of them, but we're certainly having an issue there
with Q20.
        regards, tom lane

Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours!

From

Peter Geoghegan

Date:

11 June 2017, 20:54:53

On Sun, Jun 11, 2017 at 10:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Do you mean teaching the optimizer to do something like this?:
>
> Uh, no.  I don't think we want to add any run-time checks.  The point in
> this example is that we'd get a better rowcount estimate if we noticed
> that the FK constraint could be considered while estimating the size of
> the partsupp-to-aggregated-subquery join.

Sorry for not considering the context of the thread more carefully.
Robert said something about selectivity estimation and TPC-H to me,
which I decide to research; I then rediscovered this thread.

Clearly Q20 is designed to reward systems that do better with moving
predicates into subqueries, as opposed to systems with better
selectivity estimation.

-- 
Peter Geoghegan

Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours!

From

Peter Geoghegan

Date:

12 June 2017, 01:31:55

On Sun, Jun 11, 2017 at 10:27 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> Note that I introduced a new, redundant exists() in the agg_lineitem
> fact table subquery. It now takes 23 seconds for me on Tomas' 10GB
> TPC-H dataset, whereas the original query took over 90 minutes.
> Clearly we're missing a trick or two here. I think that you need a
> DAG-shaped query plan to make this work well, though, so it is
> certainly a big project.

On closer examination, this seems to be due to the number of heap
accesses required by an index-only scan that the original query plan
uses; my test case was invalid. All the same, I understand that moving
predicates into many (TPC-H) subqueries is an important optimization,
and hope that more work can be done in this area.

-- 
Peter Geoghegan

Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours!

From

Tomas Vondra

Date:

12 June 2017, 02:10:31

Hi,

On 6/11/17 7:54 PM, Peter Geoghegan wrote:
> On Sun, Jun 11, 2017 at 10:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Do you mean teaching the optimizer to do something like this?:
>>
>> Uh, no.  I don't think we want to add any run-time checks.  The point in
>> this example is that we'd get a better rowcount estimate if we noticed
>> that the FK constraint could be considered while estimating the size of
>> the partsupp-to-aggregated-subquery join.
> 
> Sorry for not considering the context of the thread more carefully.
> Robert said something about selectivity estimation and TPC-H to me,
> which I decide to research; I then rediscovered this thread.
> 
> Clearly Q20 is designed to reward systems that do better with moving
> predicates into subqueries, as opposed to systems with better
> selectivity estimation.
> 

I do strongly recommend reading this paper analyzing choke points of 
individual TPC-H queries:
    http://oai.cwi.nl/oai/asset/21424/21424B.pdf

It's slightly orthogonal to the issue at hand (poor estimate in Q20 
causing choice of inefficient plan), it's a great paper to read. I 
thought I've already posted a link to the this paper sometime in the 
past, but I don't see it in the archives.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours!

From

Peter Geoghegan

Date:

12 June 2017, 02:39:40

On Sun, Jun 11, 2017 at 4:10 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I do strongly recommend reading this paper analyzing choke points of
> individual TPC-H queries:
>
>     http://oai.cwi.nl/oai/asset/21424/21424B.pdf
>
> It's slightly orthogonal to the issue at hand (poor estimate in Q20 causing
> choice of inefficient plan), it's a great paper to read. I thought I've
> already posted a link to the this paper sometime in the past, but I don't
> see it in the archives.

Thanks for the tip!

The practical focus of this paper really appeals to me.

-- 
Peter Geoghegan