Improving estimates for TPC-H Q2 - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Improving estimates for TPC-H Q2 |
Date | |
Msg-id | 20200507235711.3wazsx3euusrutkl@development Whole thread Raw |
Responses |
Re: Improving estimates for TPC-H Q2
|
List | pgsql-hackers |
Hi, I've been re-running the TPC-H benchmark, to remind myself the common issues with OLAP workloads, and one of the most annoying problems seems to be the misestimates in Q2. The query is not particularly complex, although it does have a correlated subquery with an aggregate, but it's one of the queries prone to a cascade of nested loops running forever. I wonder if there's something we could do to handle this better. A raw Q2 looks like this: select s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment from part, supplier, partsupp, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and p_size = 16 and p_type like '%NICKEL' and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'AMERICA' and ps_supplycost = ( select min(ps_supplycost) from partsupp, supplier, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'AMERICA' ) order by s_acctbal desc, n_name, s_name, p_partkey; and the full query plan is attached (q2-original-plan.txt). The relevant part of the plan is the final join, which also considers the subplan result (all estimates are for scale 10): -> Merge Join (cost=638655.36..1901120.61 rows=1 width=192) (actual time=7299.121..10993.517 rows=4737 loops=1) Merge Cond: (part.p_partkey = partsupp.ps_partkey) Join Filter: (partsupp.ps_supplycost = (SubPlan 1)) Rows Removed by Join Filter: 1661 Yeah, this is estimated as 1 row but actually returns 4737 rows. All the other nodes are estimated very accurately, it's just this final join that is entirely wrong. If you tweak the costs a bit (e.g. reducing random_page_cost etc.) the plan can easily switch to nested loops, with this join much deeper in the plan. See the attached q2-nested-loops.txt for an example (I had to disable merge/hash joins to trigger this on scale 10, on larger scales it can happen much easier). Now, the query seems a bit complex, but we can easily simplify it by creating an extra table and reducing the number of joins: create table foo as select * from partsupp, supplier, nation, region where s_suppkey = ps_suppkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'AMERICA'; reate index on t (ps_partkey); which allows us to rewrite Q2 like this (this also ditches the ORDER BY and LIMIT clauses): select 1 from part, t where p_partkey = ps_partkey and p_size = 16 and p_type like '%NICKEL' and ps_supplycost = ( select min(ps_supplycost) from t where p_partkey = ps_partkey ); in fact, we can ditch even the conditions on p_size/p_type which makes the issue even more severe: select 1 from part, t where p_partkey = ps_partkey and ps_supplycost = ( select min(ps_supplycost) from t where p_partkey = ps_partkey ); with the join estimated like this: Hash Join (cost=89761.10..1239195.66 rows=17 width=4) (actual time=15379.356..29315.436 rows=1182889 loops=1) Hash Cond: ((t.ps_partkey = part.p_partkey) AND (t.ps_supplycost = (SubPlan 1))) Yeah, that's underestimated by a factor of 70000 :-( An interesting observation is that if you remove the condition on supply cost (with the correlated subquery), the estimates get perfect atain. So this seems to be about this particular condition, or how we combine the selectivities ... I'm not sure I've figured all the details yet, but this seems to be due to a dependency between the ps_partkey and ps_supplycost columns. When estimating the second condition, we end up calling eqjoinsel() with SubPlan and Var arguments. We clearly won't have ndistinct of MCVs for the SubPlan, so we use nd1 = 200; /* default */ nd2 = 94005; /* n_distinct for t.ps_supplycost */ and end up (thanks to eqjoinsel_inner and no NULLs in data) with selec_inner = 0.00001 = Min(1/nd1, 1/nd2) But that's entirely bogus, because while there are ~100k distinct values in t.ps_supplycost, those are for *all* ps_partkey values combined. But each ps_partkey value has only about ~1.4 distinct ps_supplycost values on average: select avg(x) from (select count(distinct ps_supplycost) as x from t group by ps_partkey) foo; avg -------------------- 1.3560712631162481 (1 row) Which I think is the root cause here ... The fact that we're using the same table "t" in both the main query and the correlated subquery seems rather irrelevant here, because we might also create create table s as select ps_partkey, min(ps_supplycost) as min_ps_supplycost from t group by ps_partkey; and use that instead, and we'd still have the same issue. It's just the fact for a given ps_partkey value there's only a couple ps_supplycost values, not the 100k we have for a table. I wonder if we could use the ndistinct coefficients to improve this, somehow. I suppose eqjoinsel/eqjoinsel_inner could look at ndistinct(ps_partkey, ps_supplycost) / ndistinct(ps_partkey) when estimating the (SubPlan = Var) condition, and tweak selec_inner accordingly. I do see two challenges with this, though: 1) This probably requires considering all join clauses at once (just like we do for regular clauses), but eqjoinsel and friends seem rather heavily designed for inspecting the clauses one by one. 2) I'm not sure what to do about the SubPlan side, for which we may not have any reliable ndistinct estimates at all. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: