Thread: [HACKERS] TPC-H Q20 from 1 hour to 19 hours!
Hello all, This is to bring to notice a peculiar instance I found recently while running TPC-H benchmark queries. Q20 of the benchmark took 19 hours to complete when run on a machine with 512 GB RAM and 32 cores with following parameter settings on the commit id - 0c2070cefa0e5d097b715c9a3b9b5499470019aa work_mem = 1 GB shared_buffers = 8 GB effective_cache_size = 10 GB scale factor = 300 max_parallel_workers_per_gather = 4 The output of explain_analyse is as follows, QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Limit (cost=60187550.59..60187550.59 rows=1 width=52) (actual time=71064602.963..71064602.964 rows=1 loops=1) -> Sort (cost=60187550.59..60187550.59 rows=3 width=52) (actual time=71064602.961..71064602.961 rows=1 loops=1) Sort Key: supplier.s_name Sort Method: top-N heapsort Memory:25kB -> Nested Loop Semi Join (cost=52960550.15..60187550.57 rows=3 width=52) (actual time=2163639.699..71063153.953 rows=52263 loops=1) Join Filter: (supplier.s_suppkey = lineitem.l_suppkey) Rows Removed by Join Filter: 155706242106 -> Nested Loop (cost=963.43..10081.54 rows=120000 width=56) (actual time=178.589..6282.852 rows=120358 loops=1) -> Seq Scan on nation (cost=0.00..0.41rows=1 width=4) (actual time=0.018..0.042 rows=1 loops=1) Filter: (n_name = 'EGYPT'::bpchar) Rows Removed by Filter: 24 -> Bitmap Heap Scan on supplier (cost=963.43..8881.13 rows=120000 width=64) (actual time=178.563..6143.786 rows=120358 loops=1) Recheck Cond: (s_nationkey = nation.n_nationkey) Heap Blocks: exact=57317 -> Bitmap Index Scan on idx_supplier_nation_key (cost=0.00..933.43 rows=120000 width=0) (actual time=133.218..133.218 rows=120358 loops=1) Index Cond: (s_nationkey = nation.n_nationkey) -> Materialize (cost=52959586.72..60024469.24 rows=85 width=16) (actual time=12.679..175.456 rows=1293693 loops=120358) -> Merge Join (cost=52959586.72..60024468.82 rows=85 width=16) (actual time=1525322.753..2419045.641 rows=1696742 loops=1) Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey)) Join Filter: ((partsupp.ps_availqty)::numeric > ((0.5 * sum(lineitem.l_quantity)))) Rows Removed by Join Filter:3771 -> GroupAggregate (cost=48448912.90..53325163.98 rows=144696357 width=48) (actual time=1342085.116..2178225.340 rows=156665300 loops=1) Group Key: lineitem.l_partkey, lineitem.l_suppkey -> Sort (cost=48448912.90..49125364.33 rows=270580572 width=21) (actual time=1342085.072..1495863.254 rows=273057944 loops=1) Sort Key: lineitem.l_partkey, lineitem.l_suppkey Sort Method: external merge Disk: 8282600kB -> Bitmap Heap Scan on lineitem (cost=2847641.84..10552097.42 rows=270580572 width=21) (actual time=241170.637..650122.225 rows=273058377 loops=1) Recheck Cond: ((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01 00:00:00'::timestamp without time zone)) Heap Blocks: exact=33444433 -> Bitmap Index Scan on idx_l_shipdate (cost=0.00..2779996.70 rows=270580572 width=0) (actual time=190321.155..190321.155 rows=273058377 loops=1) Index Cond: ((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01 00:00:00'::timestamp without time zone)) -> Sort (cost=4510673.81..4516734.42 rows=2424244 width=24) (actual time=183237.183..184039.914 rows=2602656 loops=1) Sort Key: partsupp.ps_partkey, partsupp.ps_suppkey Sort Method: quicksort Memory: 301637kB -> Hash Join (cost=379620.36..4253593.60 rows=2424244 width=24) (actual time=8576.343..179703.563 rows=2602656 loops=1) Hash Cond: (partsupp.ps_partkey = part.p_partkey) -> Seq Scan on partsupp (cost=0.00..2949730.80 rows=240000000 width=20) (actual time=0.149..71902.279 rows=240000000 loops=1) -> Hash (cost=372044.59..372044.59 rows=606062 width=4) (actual time=8574.476..8574.476 rows=650664 loops=1) Buckets: 1048576 Batches: 1 Memory Usage: 31067kB -> Gather (cost=1000.00..372044.59 rows=606062 width=4) (actual time=0.677..8090.145 rows=650664 loops=1) Workers Planned: 4 Workers Launched: 4 -> Parallel Seq Scan on part (cost=0.00..310438.39 rows=151516 width=4) (actual time=0.056..8259.935 rows=130133 loops=5) Filter: ((p_name)::text ~~ 'beige%'::text) Rows Removed by Filter: 11869867Planning time: 4.669 msExecution time: 71065478.356 ms It is clear that selectivity estimations are really bad in this case particularly at node, -> Merge Join (cost=52959586.72..60024468.82 rows=85 width=16) (actual time=1525322.753..2419045.641 rows=1696742 loops=1) Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey)) Join Filter: ((partsupp.ps_availqty)::numeric > ((0.5 * sum(lineitem.l_quantity)))) Rows Removed by Join Filter:3771 Still this puzzled me as during earlier runs of this benchmark I never encountered such prolonged running times. On further investigation I found that on reverting the commit 7fa93eec4e0c9c3e801e3c51aa4bae3a38aaa218 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Sat Dec 17 15:28:54 2016 -0500 Fix FK-based join selectivity estimation for semi/antijoins. the query completes in around 1.5 hours, with the following query plan,Limit (cost=69766456.38..69766456.38 rows=1 width=52)(actual time=6056583.271..6056583.272 rows=1 loops=1) -> Sort (cost=69766456.38..69766606.38 rows=60000 width=52) (actual time=6056583.270..6056583.270 rows=1 loops=1) Sort Key: supplier.s_name Sort Method: top-N heapsort Memory: 25kB -> Nested Loop (cost=69742098.22..69766156.38 rows=60000 width=52) (actual time=6038178.125..6056492.537 rows=52263 loops=1) Join Filter: (supplier.s_nationkey = nation.n_nationkey) Rows Removed by Join Filter: 1247849 -> Seq Scan on nation (cost=0.00..0.41rows=1 width=4) (actual time=0.011..0.049 rows=1 loops=1) Filter: (n_name = 'EGYPT'::bpchar) Rows Removed by Filter: 24 -> Nested Loop (cost=69742098.22..69747406.00 rows=1499997 width=60) (actual time=6038177.711..6056108.835 rows=1300112 loops=1) -> HashAggregate (cost=69742097.79..69742182.18 rows=8439 width=16) (actual time=6038177.626..6039095.036 rows=1300126 loops=1) Group Key: partsupp.ps_suppkey -> Nested Loop Semi Join (cost=48766000.84..69742076.69 rows=8439 width=16) (actual time=2400414.925..6034282.917 rows=1696742 loops=1) -> Merge Join (cost=48766000.27..69735727.56 rows=8439 width=32) (actual time=2400413.227..4113580.275 rows=156321526 loops=1) Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey)) Join Filter: ((partsupp.ps_availqty)::numeric > ((0.5 * sum(lineitem.l_quantity)))) Rows Removedby Join Filter: 344021 -> GroupAggregate (cost=48765999.70..53642250.78 rows=144696357 width=48) (actual time=2400413.048..3404489.071 rows=156665547 loops=1) Group Key: lineitem.l_partkey, lineitem.l_suppkey -> Sort (cost=48765999.70..49442451.13 rows=270580572 width=21) (actual time=2400412.958..2692446.813 rows=273058377 loops=1) Sort Key: lineitem.l_partkey, lineitem.l_suppkey Sort Method: external merge Disk: 8282776kB -> Bitmap Heap Scan on lineitem (cost=2847641.84..10552097.42 rows=270580572 width=21) (actual time=147080.637..1825300.462 rows=273058377 loops=1) Recheck Cond: ((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01 00:00:00'::timestamp without time zone)) Rows Removed by Index Recheck: 1346378434 Heap Blocks: exact=669636 lossy=32774797 -> Bitmap Index Scan on idx_l_shipdate (cost=0.00..2779996.70 rows=270580572 width=0) (actual time=146519.142..146519.142 rows=273058377 loops=1) Index Cond: ((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01 00:00:00'::timestamp without time zone)) -> Index Scan using partsupp_pkey on partsupp (cost=0.57..12722651.69 rows=240000000 width=20) (actual time=0.160..197858.090 rows=240000000 loops=1) -> Index Scan using part_pkeyon part (cost=0.56..0.75 rows=1 width=4) (actual time=0.012..0.012 rows=0 loops=156321526) Index Cond: (p_partkey = lineitem.l_partkey) Filter: ((p_name)::text ~~ 'beige%'::text) Rows Removed by Filter: 1 -> Index Scan using supplier_pkeyon supplier (cost=0.43..0.61 rows=1 width=64) (actual time=0.011..0.012 rows=1 loops=1300126) Index Cond: (s_suppkey = lineitem.l_suppkey)Planning time: 2.440 msExecution time:6057329.179 ms I hope there might be some way-out for such a case which includes the benefits of the commit without hurting other cases (like this one) this bad. Thoughts, comments... -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/
On Mon, Mar 6, 2017 at 1:22 AM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote: > This is to bring to notice a peculiar instance I found recently while > running TPC-H benchmark queries. Q20 of the benchmark took 19 hours to > complete ... That's bad. > It is clear that selectivity estimations are really bad in this case > particularly at node, > -> Merge Join (cost=52959586.72..60024468.82 rows=85 width=16) > (actual time=1525322.753..2419045.641 rows=1696742 loops=1) > Merge Cond: ((lineitem.l_partkey = > partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey)) > Join Filter: > ((partsupp.ps_availqty)::numeric > ((0.5 * sum(lineitem.l_quantity)))) > Rows Removed by Join Filter: 3771 So, the selectivity estimation here is bad both before and after Tom's commit, but it's significantly worse after (actual value 1696742, old estimate 3771, new estimate 85). > Still this puzzled me as during earlier runs of this benchmark I never > encountered such prolonged running times. On further investigation I > found that on reverting the commit > 7fa93eec4e0c9c3e801e3c51aa4bae3a38aaa218 > Author: Tom Lane <tgl@sss.pgh.pa.us> > Date: Sat Dec 17 15:28:54 2016 -0500 > Fix FK-based join selectivity estimation for semi/antijoins. I don't think the problem originates at the Merge Join, though, because the commit says that at is fixing semi and anti-join estimates - this is a plain inner join, so in theory it shouldn't change. However, it's a bit hard for me to piece through these plans, the formatting kind of got messed up - things are wrapped. Could you possibly attach the plans as attachments? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 03/29/2017 09:00 PM, Robert Haas wrote: > On Mon, Mar 6, 2017 at 1:22 AM, Rafia Sabih > <rafia.sabih@enterprisedb.com> wrote: >> This is to bring to notice a peculiar instance I found recently while >> running TPC-H benchmark queries. Q20 of the benchmark took 19 hours to >> complete ... > > That's bad. > >> It is clear that selectivity estimations are really bad in this case >> particularly at node, >> -> Merge Join (cost=52959586.72..60024468.82 rows=85 width=16) >> (actual time=1525322.753..2419045.641 rows=1696742 loops=1) >> Merge Cond: ((lineitem.l_partkey = >> partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey)) >> Join Filter: >> ((partsupp.ps_availqty)::numeric > ((0.5 * sum(lineitem.l_quantity)))) >> Rows Removed by Join Filter: 3771 > > So, the selectivity estimation here is bad both before and after Tom's > commit, but it's significantly worse after (actual value 1696742, old > estimate 3771, new estimate 85). > >> Still this puzzled me as during earlier runs of this benchmark I never >> encountered such prolonged running times. On further investigation I >> found that on reverting the commit >> 7fa93eec4e0c9c3e801e3c51aa4bae3a38aaa218 >> Author: Tom Lane <tgl@sss.pgh.pa.us> >> Date: Sat Dec 17 15:28:54 2016 -0500 >> Fix FK-based join selectivity estimation for semi/antijoins. > > I don't think the problem originates at the Merge Join, though, > because the commit says that at is fixing semi and anti-join estimates > - this is a plain inner join, so in theory it shouldn't change. > However, it's a bit hard for me to piece through these plans, the > formatting kind of got messed up - things are wrapped. Could you > possibly attach the plans as attachments? > I've been looking into this today, and it seems to me the simplest query triggering this issue (essentially a part of q20) is this: select ps_suppkey from partsupp, ( select l_partkey agg_partkey, l_suppkey agg_suppkey from lineitem group by l_partkey, l_suppkey ) agg_lineitem where agg_partkey = ps_partkey and agg_suppkey = ps_suppkey and ps_partkey in ( select p_partkey from part where p_name like 'hot%' ); which does actually include a semijoin. What seems to trigger the issue is the join to the aggregated lineitem table - when replacing it with a plain table, everything seems to be estimated perfectly fine. Attached is a simple SQL script, that runs three variants of the query: (1) with the join to the aggregated lineitem table (2) with a join to a plain lineitem table (3) with a join to a plain lineitem table and to 'part' (without the semijoin) First the queries are executed on tables without any foreign keys (between those three), then with a FK between lineitem and partsupp, and finally with additional FK between partsupp and part. Initially the estimates are bad, but once the first foreign key is added, the estimates get very accurate - except for the case (1). I've only ran the queries on 10GB data set, but that should be enough. The plans are from current master - I'll rerun the script on an older release later today. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 03/30/2017 12:14 AM, Tomas Vondra wrote: > > I've only ran the queries on 10GB data set, but that should be enough. > The plans are from current master - I'll rerun the script on an older > release later today. > So, an plans from an older release (9.4) are attached. What seems to matter is the join between partsupp and part, which is estimated like this on 9.4: -> Sort (cost=172852.06..173741.94 rows=355951 width=12) (actual time=321.998..334.440 rows=86836 loops=1) Sort Key: partsupp.ps_partkey, partsupp.ps_suppkey Sort Method: quicksort Memory: 7143kB -> Nested Loop (cost=0.43..140031.03 rows=355951 width=12) (actual time=0.025..303.145 rows=86836 loops=1) and like this on current master: -> Sort (cost=146617.86..146819.89 rows=80809 width=12) (actual time=562.513..575.599 rows=86836 loops=1) Sort Key: partsupp.ps_partkey, partsupp.ps_suppkey Sort Method: quicksort Memory: 7143kB -> Nested Loop (cost=0.43..140031.03 rows=80809 width=12) (actual time=0.054..536.003 rows=86836 loops=1) which however seems clearly more accurate than 9.4. So perhaps there is some bug in computing the mergejoin estimate, but it's also possible correcting the estimate (lowering it) also has some impact. But joining to the aggregated subquery also clearly plays some role, because without it the estimates are higher. Another interesting observation is that only the foreign key between part/partsupp seems to matter - once it gets dropped, the estimates get back close to 9.4 values. What is however strange is that changing max_parallel_workers_per_gather affects row estimates *above* the Gather node. That seems a bit, um, suspicious, no? See the parallel-estimates.log. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Wed, Mar 29, 2017 at 8:00 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > What is however strange is that changing max_parallel_workers_per_gather > affects row estimates *above* the Gather node. That seems a bit, um, > suspicious, no? See the parallel-estimates.log. Thanks for looking at this! Comparing the parallel plan vs. the non-parallel plan: part: parallel rows (after Gather) 20202, non-parallel rows 20202 partsupp: parallel rows 18, non-parallel rows 18 part-partsupp join: parallel rows 88988, non-parallel rows 355951 lineitem: parallel rows 59986112, non-parallel rows 59986112 lineitem after aggregation: parallel rows 5998611, non-parallel rows 5998611 final join: parallel rows 131, non-parallel rows 524 I agree with you that that looks mighty suspicious. Both the part-partsupp join and the final join have exactly 4x as many estimated rows in the non-parallel plan as in the parallel plan, and it just so happens that the parallel divisor here will be 4. Hmm... it looks like the parallel_workers value from the Gather node is being erroneously propagated up to the higher levels of the plan tree. Wow. Somehow, Gather Merge managed to get the logic correct here, but Gather is totally wrong. Argh. Attached is a draft patch, which I haven't really tested beyond checking that it passes 'make check'. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Thu, Mar 30, 2017 at 12:30 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I don't think the problem originates at the Merge Join, though, > because the commit says that at is fixing semi and anti-join estimates > - this is a plain inner join, so in theory it shouldn't change. > However, it's a bit hard for me to piece through these plans, the > formatting kind of got messed up - things are wrapped. Could you > possibly attach the plans as attachments? > Sure, please find attached file for the plans before and after commit. -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/
Attachment
On Thu, Mar 30, 2017 at 8:24 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Mar 29, 2017 at 8:00 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> What is however strange is that changing max_parallel_workers_per_gather >> affects row estimates *above* the Gather node. That seems a bit, um, >> suspicious, no? See the parallel-estimates.log. > > Thanks for looking at this! Comparing the parallel plan vs. the > non-parallel plan: > > part: parallel rows (after Gather) 20202, non-parallel rows 20202 > partsupp: parallel rows 18, non-parallel rows 18 > part-partsupp join: parallel rows 88988, non-parallel rows 355951 > lineitem: parallel rows 59986112, non-parallel rows 59986112 > lineitem after aggregation: parallel rows 5998611, non-parallel rows 5998611 > final join: parallel rows 131, non-parallel rows 524 > > I agree with you that that looks mighty suspicious. Both the > part-partsupp join and the final join have exactly 4x as many > estimated rows in the non-parallel plan as in the parallel plan, and > it just so happens that the parallel divisor here will be 4. > > Hmm... it looks like the parallel_workers value from the Gather node > is being erroneously propagated up to the higher levels of the plan > tree. Wow. Somehow, Gather Merge managed to get the logic correct > here, but Gather is totally wrong. Argh. Attached is a draft patch, > which I haven't really tested beyond checking that it passes 'make > check'. > Your patch looks good to me. I have verified some join cases as well where the behaviour is sane after patch. I have also done testing with force_parallel_mode=regress (ran make check-world) and everything seems good. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Mar 31, 2017 at 5:13 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Mar 30, 2017 at 8:24 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Mar 29, 2017 at 8:00 PM, Tomas Vondra >> <tomas.vondra@2ndquadrant.com> wrote: >>> What is however strange is that changing max_parallel_workers_per_gather >>> affects row estimates *above* the Gather node. That seems a bit, um, >>> suspicious, no? See the parallel-estimates.log. >> >> Thanks for looking at this! Comparing the parallel plan vs. the >> non-parallel plan: >> >> part: parallel rows (after Gather) 20202, non-parallel rows 20202 >> partsupp: parallel rows 18, non-parallel rows 18 >> part-partsupp join: parallel rows 88988, non-parallel rows 355951 >> lineitem: parallel rows 59986112, non-parallel rows 59986112 >> lineitem after aggregation: parallel rows 5998611, non-parallel rows 5998611 >> final join: parallel rows 131, non-parallel rows 524 >> >> I agree with you that that looks mighty suspicious. Both the >> part-partsupp join and the final join have exactly 4x as many >> estimated rows in the non-parallel plan as in the parallel plan, and >> it just so happens that the parallel divisor here will be 4. >> >> Hmm... it looks like the parallel_workers value from the Gather node >> is being erroneously propagated up to the higher levels of the plan >> tree. Wow. Somehow, Gather Merge managed to get the logic correct >> here, but Gather is totally wrong. Argh. Attached is a draft patch, >> which I haven't really tested beyond checking that it passes 'make >> check'. >> > > Your patch looks good to me. I have verified some join cases as well > where the behaviour is sane after patch. I have also done testing > with force_parallel_mode=regress (ran make check-world) and everything > seems good. > > -- I tried checking the plan of Q20 with this patch, and got the following results, with patch, -> Merge Join (cost=3025719.98..3499235.22 rows=6 width=16) (actual time=176440.801..245903.143 rows=118124 loops=1) Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey)) Join Filter: ((partsupp.ps_availqty)::numeric > ((0.5 * sum(lineitem.l_quantity)))) and without patch, -> Merge Join (cost=3014830.12..3511637.54 rows=2 width=16) (actual time=130994.237..208887.149 rows=118124 loops=1) Merge Cond: ((partsupp.ps_partkey = lineitem.l_partkey) AND (partsupp.ps_suppkey = lineitem.l_suppkey)) Join Filter: ((partsupp.ps_availqty)::numeric > ((0.5 * sum(lineitem.l_quantity)))) So, it looks like in the problematic area, it is not improving much. Please find the attached file for the query plan of Q20 with and without patch. I haven't evaluated the performance of this query with patch. -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/
Attachment
On Fri, Mar 31, 2017 at 7:53 AM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote: > So, it looks like in the problematic area, it is not improving much. > Please find the attached file for the query plan of Q20 with and > without patch. I haven't evaluated the performance of this query with > patch. Yeah, I don't think this patch is fixing whatever is going wrong with Q20. It's just a problem Tomas found while investigating that issue. Unfortunately, I don't think we know what's really going on here yet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 04/01/2017 02:57 AM, Robert Haas wrote: > On Fri, Mar 31, 2017 at 7:53 AM, Rafia Sabih > <rafia.sabih@enterprisedb.com> wrote: >> So, it looks like in the problematic area, it is not improving much. >> Please find the attached file for the query plan of Q20 with and >> without patch. I haven't evaluated the performance of this query with >> patch. > > Yeah, I don't think this patch is fixing whatever is going wrong with > Q20. It's just a problem Tomas found while investigating that issue. > Unfortunately, I don't think we know what's really going on here yet. > Yeah, it's just a side-bug. I'll look into the issue again once I get back home from pgconf.us, it's a little bit difficult to investigate this over ssh on the in-flight wifi :-/ My hunch is that the two foreign keys interfere in some strange way, i.e. we should apply just one and end up applying both, or something like that. Or perhaps it gets confused by the join-to-aggregated-rel?. It's a bit weird, though, because the patch was originally meant to help with multi-column keys. It was then extended to also consider single-column FKs, but that should really be just 1/ndistinct anyway. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, I've been looking at this issue today, and so far I don't think it's a bug in the foreign key estimation. It seems mostly that the 9.5 estimates were hopelessly bad, and the join estimation changes simply pushed it a tiny bit the wrong direction. Although maybe there is a bug (or at least a change of behavior) in one case, but I'll get to that. I've managed to extract a small part of Q20 that demonstrates the differences between versions quite nicely, I think. The part causing the trouble looks like this: explain select ps_suppkey from partsupp, ( select l_partkey agg_partkey, l_suppkey agg_suppkey from lineitem where l_shipdate >= date '1997-01-01' and l_shipdate < date '1997-01-01' + interval '1' year group by l_partkey, l_suppkey ) agg_lineitem where agg_partkey = ps_partkey and agg_suppkey = ps_suppkey and ps_partkey in ( select p_partkey from part where p_name like 'hot%' ); i.e. it aggregates the "lineitem" table, and then joins "partsupp" and "part" tables to it. "aggregated lineitem" <-> partsupp <-> part I've collected estimates from four different variants of the query (see the attached exlain.sql): 1) SIMPLE - join directly to lineitem (without the aggregation) - remove the p_name LIKE pattern matching 2) SIMPLE+LIKE - like SIMPLE, but keep the LIKE condition 3) GROUPING - join to the aggregated lineitem table - remove the p_name LIKE pattern matching 4) GROUPING+LIKE - like GROUPING, but keep the LIKE condition I've collected estimates on a 20GB data set, both from 9.5 (so without any of the FK estimation changes) and on master with different foreign keys between the tables. no-keys - no foreign keys between the three tables lineitem - lineitem references partsupp partsupp - partsupp references part both - both foreign keys And the results look like this (actual row counts were collected on 9.5, but that should not matter - the results should be the same on all versions): branch SIMPLE SIMPLE+LIKE GROUPING GROUPING+LIKE -------------------------------------------------------------------- actual 119994608 1311974 10897186 119238 9.5 2863 35 160 160 no-keys 2340 24 868 868 lineitem 119994848 1229750 868 868 partsupp 2340 24 1737 18 both-keys 119994848 1212065 1737 18 This seems mostly sane, I guess, but let's look at various cases. In the SIMPLE cases, the foreign key "lineitem->partsupp" makes a huge difference - the estimates are pretty exact, both with and without the LIKE condition. The "partsupp->part" key makes almost no difference, though - the minor differences (35/24 and 1229750/1212065) seem to be mostly due to minor differences in stats built by ANALYZE, particularly in histograms used by patternsel(). In the GROUPING cases, the situation is obviously much worse. The grouping makes it impossible to use the "lineitem->partsupp" foreign key, resulting in severe underestimates. The "partsupp->part" is used, but the difference is pretty negligible as it's a simple (one column) foreign key. The change from 160 -> 868 is merely due to 84f9a35e3 changing how we estimate number of groups in a GROUP BY clause. In 9.5 we get this: -> HashAggregate (rows=1836028) (actual rows=10897186) while since 9.6 we get this -> GroupAggregate (rows=9674242) Not only is that much closer to the actual value than the 9.5 estimate, but it's almost exactly the factor between 160 and 868: 9674242 / 1836028 = 5.27 160 * 5.26 = 843 So I'd say the 160 vs. 868 is expected, although the result is still way off, of course. Which brings me to the slightly suspicious bit. On 9.5, there's no difference between GROUP and GROUP+LIKE cases - the estimates are exactly the same in both cases. This is true too, but only without the foreign key between "partsupp" and "part", i.e. the two non-grouped relations in the join. And what's more, the difference (1737 vs. 16) is pretty much exactly 100x, which is the estimate for the LIKE condition. So it kinda seems 9.5 does not apply this condition for semi-joins, while >=9.6 does that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Apr 6, 2017 at 4:37 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Which brings me to the slightly suspicious bit. On 9.5, there's no > difference between GROUP and GROUP+LIKE cases - the estimates are exactly > the same in both cases. This is true too, but only without the foreign key > between "partsupp" and "part", i.e. the two non-grouped relations in the > join. And what's more, the difference (1737 vs. 16) is pretty much exactly > 100x, which is the estimate for the LIKE condition. I don't follow this. How does the foreign key between partsupp and part change the selectivity of LIKE? > So it kinda seems 9.5 does not apply this condition for semi-joins, while >>=9.6 does that. If 9.5 and prior are ignoring some of the quals, that's bad, but I don't think I understand from your analysis why 7fa93eec4e0c9c3e801e3c51aa4bae3a38aaa218 regressed anything. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 5/25/17 6:03 AM, Robert Haas wrote: > On Thu, Apr 6, 2017 at 4:37 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> Which brings me to the slightly suspicious bit. On 9.5, there's no >> difference between GROUP and GROUP+LIKE cases - the estimates are exactly >> the same in both cases. This is true too, but only without the foreign key >> between "partsupp" and "part", i.e. the two non-grouped relations in the >> join. And what's more, the difference (1737 vs. 16) is pretty much exactly >> 100x, which is the estimate for the LIKE condition. > > I don't follow this. How does the foreign key between partsupp and > part change the selectivity of LIKE? > >> So it kinda seems 9.5 does not apply this condition for semi-joins, while >>> =9.6 does that. > Well, get_foreign_key_join_selectivity() does handle restrictions when calculating joinrel size estimate in calc_joinrel_size_estimate(), so assuming there's some thinko it might easily cause this. I haven't found any such thinko, but I don't dare to claim I fully understand what the current version of get_foreign_key_join_selectivity does :-/ > If 9.5 and prior are ignoring some of the quals, that's bad, but I > don't think I understand from your analysis why > 7fa93eec4e0c9c3e801e3c51aa4bae3a38aaa218 regressed anything. > It's been quite a bit of time since I looked into this, but I think my main point was that it's hard to say it's a regression when both the old and new estimates are so utterly wrong. I mean, 9.5 estimates 160, 9.6 estimates 18. We might fix the post-9.6 estimate to return the same value as 9.5, and it might fix this particular query with this particular scale. But in the end it's just noise considering that the actual value is 120k (so 3 or 4 orders of magnitude off). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > Which brings me to the slightly suspicious bit. On 9.5, there's no > difference between GROUP and GROUP+LIKE cases - the estimates are > exactly the same in both cases. This is true too, but only without the > foreign key between "partsupp" and "part", i.e. the two non-grouped > relations in the join. And what's more, the difference (1737 vs. 16) is > pretty much exactly 100x, which is the estimate for the LIKE condition. > So it kinda seems 9.5 does not apply this condition for semi-joins, > while >=9.6 does that. I got around to looking into this finally. I assume that when you say you added a foreign key from partsupp to part, you also created a unique index on part.p_partkey --- there is no such index in dbt3's version of the schema, anyway, so I had to add one to create a foreign key. The unique index itself, never mind the foreign key, changes things substantially for this query in HEAD, as a result of commit 92a43e485: the semijoin to part becomes a plain join, and so we go through entirely different code paths to estimate selectivity. But without that, there's clearly something odd going on, because the LIKE condition ought to have some effect on the estimate of number of matched rows. I poked into it and found that the problem stems from the fact that the initial estimate of the top join relation's size is estimated using agg_lineitem.agg_partkey = part.p_partkey as the join condition, not the original partsupp.ps_partkey = part.p_partkey qual. We can get the former from the latter via equivalence-clause deductions, and I think it's just bad luck that the join is formed first in that direction. What that means is that eqjoinsel_semi() finds no statistics for the LHS variable, although it does have stats for the RHS. There is logic in eqjoinsel_semi() that attempts to reduce the semijoin selectivity estimate proportionally to whatever restriction clauses were applied to the RHS ... but if you look closely, that code has no effect if we lack stats for the LHS. We'll fall past both the MCV-dependent calculation and the nd1-vs-nd2 test and end up taking the selectivity as just 0.5, independently of anything else. It seems reasonable to me that we ought to derate that default estimate by whatever we'd concluded the restriction selectivity on the inner rel is. So I experimented with the attached patch and verified that it results in the LIKE affecting the final rowcount estimate in this situation. I've not tested it much further than that, other than checking that we still pass regression tests. I'm not sure if we ought to think about applying this now. It will likely make things worse not better for the Q20 query, because it will cause the underestimate for the full query to be even further off. Still, in principle it seems like it ought to be an improvement in most cases. The thing that would actually have a chance of improving matters for Q20 would be if we could see our way to looking through the aggregation subquery and applying the foreign key constraint for lineitem. That seems like a research project though; it's surely not happening for v10. I'm also wondering idly if we could fix things so that the join size estimate gets formed from a join condition that we have stats for rather than one we lack them for. But I have no clear ideas on ways to go about that that wouldn't be horrid kluges. regards, tom lane diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 6e491bb..b875d5d 100644 *** a/src/backend/utils/adt/selfuncs.c --- b/src/backend/utils/adt/selfuncs.c *************** eqjoinsel_semi(Oid operator, *** 2603,2610 **** * * Crude as the above is, it's completely useless if we don't have * reliable ndistinct values for both sides. Hence, if either nd1 or ! * nd2 is default, punt and assume half of the uncertain rows have ! * join partners. */ if (!isdefault1 && !isdefault2) { --- 2603,2609 ---- * * Crude as the above is, it's completely useless if we don't have * reliable ndistinct values for both sides. Hence, if either nd1 or ! * nd2 is default, we can't use this. */ if (!isdefault1 && !isdefault2) { *************** eqjoinsel_semi(Oid operator, *** 2616,2622 **** --- 2615,2635 ---- uncertainfrac = nd2 / nd1; } else + { + /* + * In this situation, we basically assume half of the uncertain + * rows have join partners. However, we'd still like to respond + * to restriction clauses applied to the inner rel, so what we + * really do is assume half of the uncertain rows have partners in + * the underlying inner rel, then reduce that fraction by the + * previously-determined selectivity of the inner restrictions. + */ uncertainfrac = 0.5; + if (vardata2->rel && + vardata2->rel->rows > 0 && + vardata2->rel->rows < vardata2->rel->tuples) + uncertainfrac *= vardata2->rel->rows / vardata2->rel->tuples; + } uncertain = 1.0 - matchfreq1 - nullfrac1; CLAMP_PROBABILITY(uncertain); selec = matchfreq1 + uncertainfrac * uncertain; *************** eqjoinsel_semi(Oid operator, *** 2624,2631 **** else { /* ! * Without MCV lists for both sides, we can only use the heuristic ! * about nd1 vs nd2. */ double nullfrac1 = stats1 ? stats1->stanullfrac : 0.0; --- 2637,2644 ---- else { /* ! * Without MCV lists for both sides, we can only use the heuristics ! * described above about nd1 vs nd2 and inner restriction clauses. */ double nullfrac1 = stats1 ? stats1->stanullfrac : 0.0; *************** eqjoinsel_semi(Oid operator, *** 2637,2643 **** --- 2650,2662 ---- selec = (nd2 / nd1) * (1.0 - nullfrac1); } else + { selec = 0.5 * (1.0 - nullfrac1); + if (vardata2->rel && + vardata2->rel->rows > 0 && + vardata2->rel->rows < vardata2->rel->tuples) + selec *= vardata2->rel->rows / vardata2->rel->tuples; + } } free_attstatsslot(&sslot1); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jun 1, 2017 at 8:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The thing that would actually have a chance of improving matters for Q20 > would be if we could see our way to looking through the aggregation > subquery and applying the foreign key constraint for lineitem. That > seems like a research project though; it's surely not happening for v10. Do you mean teaching the optimizer to do something like this?: select ps_suppkey from partsupp, ( select l_partkey agg_partkey, l_suppkeyagg_suppkey from lineitem /* BEGIN my addition */ whereexists ( select p_partkey from part where p_name like 'hot%' and p_partkey = l_partkey ) /* END my addition */ group by l_partkey, l_suppkey ) agg_lineitem where agg_partkey = ps_partkey and agg_suppkey = ps_suppkey and ps_partkey in ( select p_partkey from part where p_namelike 'hot%' ); Note that I introduced a new, redundant exists() in the agg_lineitem fact table subquery. It now takes 23 seconds for me on Tomas' 10GB TPC-H dataset, whereas the original query took over 90 minutes. Clearly we're missing a trick or two here. I think that you need a DAG-shaped query plan to make this work well, though, so it is certainly a big project. Apparently selectivity estimation isn't particularly challenging with the TPC-H queries. I think that the big challenge for us is limitations like this; there are similar issues with a number of other TPC-H queries. It would be great if someone looked into implementing bitmap semi-join. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Thu, Jun 1, 2017 at 8:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The thing that would actually have a chance of improving matters for Q20 >> would be if we could see our way to looking through the aggregation >> subquery and applying the foreign key constraint for lineitem. That >> seems like a research project though; it's surely not happening for v10. > Do you mean teaching the optimizer to do something like this?: Uh, no. I don't think we want to add any run-time checks. The point in this example is that we'd get a better rowcount estimate if we noticed that the FK constraint could be considered while estimating the size of the partsupp-to-aggregated-subquery join. > Apparently selectivity estimation isn't particularly challenging with > the TPC-H queries. Maybe not with the rest of them, but we're certainly having an issue there with Q20. regards, tom lane
On Sun, Jun 11, 2017 at 10:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Do you mean teaching the optimizer to do something like this?: > > Uh, no. I don't think we want to add any run-time checks. The point in > this example is that we'd get a better rowcount estimate if we noticed > that the FK constraint could be considered while estimating the size of > the partsupp-to-aggregated-subquery join. Sorry for not considering the context of the thread more carefully. Robert said something about selectivity estimation and TPC-H to me, which I decide to research; I then rediscovered this thread. Clearly Q20 is designed to reward systems that do better with moving predicates into subqueries, as opposed to systems with better selectivity estimation. -- Peter Geoghegan
On Sun, Jun 11, 2017 at 10:27 AM, Peter Geoghegan <pg@bowt.ie> wrote: > Note that I introduced a new, redundant exists() in the agg_lineitem > fact table subquery. It now takes 23 seconds for me on Tomas' 10GB > TPC-H dataset, whereas the original query took over 90 minutes. > Clearly we're missing a trick or two here. I think that you need a > DAG-shaped query plan to make this work well, though, so it is > certainly a big project. On closer examination, this seems to be due to the number of heap accesses required by an index-only scan that the original query plan uses; my test case was invalid. All the same, I understand that moving predicates into many (TPC-H) subqueries is an important optimization, and hope that more work can be done in this area. -- Peter Geoghegan
Hi, On 6/11/17 7:54 PM, Peter Geoghegan wrote: > On Sun, Jun 11, 2017 at 10:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Do you mean teaching the optimizer to do something like this?: >> >> Uh, no. I don't think we want to add any run-time checks. The point in >> this example is that we'd get a better rowcount estimate if we noticed >> that the FK constraint could be considered while estimating the size of >> the partsupp-to-aggregated-subquery join. > > Sorry for not considering the context of the thread more carefully. > Robert said something about selectivity estimation and TPC-H to me, > which I decide to research; I then rediscovered this thread. > > Clearly Q20 is designed to reward systems that do better with moving > predicates into subqueries, as opposed to systems with better > selectivity estimation. > I do strongly recommend reading this paper analyzing choke points of individual TPC-H queries: http://oai.cwi.nl/oai/asset/21424/21424B.pdf It's slightly orthogonal to the issue at hand (poor estimate in Q20 causing choice of inefficient plan), it's a great paper to read. I thought I've already posted a link to the this paper sometime in the past, but I don't see it in the archives. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Jun 11, 2017 at 4:10 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > I do strongly recommend reading this paper analyzing choke points of > individual TPC-H queries: > > http://oai.cwi.nl/oai/asset/21424/21424B.pdf > > It's slightly orthogonal to the issue at hand (poor estimate in Q20 causing > choice of inefficient plan), it's a great paper to read. I thought I've > already posted a link to the this paper sometime in the past, but I don't > see it in the archives. Thanks for the tip! The practical focus of this paper really appeals to me. -- Peter Geoghegan