Hi,
I've been re-running the TPC-H benchmark, to remind myself the common
issues with OLAP workloads, and one of the most annoying problems seems
to be the misestimates in Q2. The query is not particularly complex,
although it does have a correlated subquery with an aggregate, but it's
one of the queries prone to a cascade of nested loops running forever.
I wonder if there's something we could do to handle this better.
A raw Q2 looks like this:
select
s_acctbal,
s_name,
n_name,
p_partkey,
p_mfgr,
s_address,
s_phone,
s_comment
from
part,
supplier,
partsupp,
nation,
region
where
p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and p_size = 16
and p_type like '%NICKEL'
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'AMERICA'
and ps_supplycost = (
select
min(ps_supplycost)
from
partsupp,
supplier,
nation,
region
where
p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'AMERICA'
)
order by
s_acctbal desc,
n_name,
s_name,
p_partkey;
and the full query plan is attached (q2-original-plan.txt).
The relevant part of the plan is the final join, which also considers
the subplan result (all estimates are for scale 10):
-> Merge Join (cost=638655.36..1901120.61 rows=1 width=192)
(actual time=7299.121..10993.517 rows=4737 loops=1)
Merge Cond: (part.p_partkey = partsupp.ps_partkey)
Join Filter: (partsupp.ps_supplycost = (SubPlan 1))
Rows Removed by Join Filter: 1661
Yeah, this is estimated as 1 row but actually returns 4737 rows. All
the other nodes are estimated very accurately, it's just this final join
that is entirely wrong.
If you tweak the costs a bit (e.g. reducing random_page_cost etc.) the
plan can easily switch to nested loops, with this join much deeper in
the plan. See the attached q2-nested-loops.txt for an example (I had to
disable merge/hash joins to trigger this on scale 10, on larger scales
it can happen much easier).
Now, the query seems a bit complex, but we can easily simplify it by
creating an extra table and reducing the number of joins:
create table foo as select
*
from
partsupp,
supplier,
nation,
region
where
s_suppkey = ps_suppkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'AMERICA';
reate index on t (ps_partkey);
which allows us to rewrite Q2 like this (this also ditches the ORDER BY
and LIMIT clauses):
select
1
from
part,
t
where
p_partkey = ps_partkey
and p_size = 16
and p_type like '%NICKEL'
and ps_supplycost = (
select
min(ps_supplycost)
from t
where
p_partkey = ps_partkey
);
in fact, we can ditch even the conditions on p_size/p_type which makes
the issue even more severe:
select
1
from
part,
t
where
p_partkey = ps_partkey
and ps_supplycost = (
select
min(ps_supplycost)
from t
where
p_partkey = ps_partkey
);
with the join estimated like this:
Hash Join (cost=89761.10..1239195.66 rows=17 width=4)
(actual time=15379.356..29315.436 rows=1182889 loops=1)
Hash Cond: ((t.ps_partkey = part.p_partkey) AND
(t.ps_supplycost = (SubPlan 1)))
Yeah, that's underestimated by a factor of 70000 :-(
An interesting observation is that if you remove the condition on supply
cost (with the correlated subquery), the estimates get perfect atain. So
this seems to be about this particular condition, or how we combine the
selectivities ...
I'm not sure I've figured all the details yet, but this seems to be due
to a dependency between the ps_partkey and ps_supplycost columns.
When estimating the second condition, we end up calling eqjoinsel()
with SubPlan and Var arguments. We clearly won't have ndistinct of MCVs
for the SubPlan, so we use
nd1 = 200; /* default */
nd2 = 94005; /* n_distinct for t.ps_supplycost */
and end up (thanks to eqjoinsel_inner and no NULLs in data) with
selec_inner = 0.00001 = Min(1/nd1, 1/nd2)
But that's entirely bogus, because while there are ~100k distinct values
in t.ps_supplycost, those are for *all* ps_partkey values combined. But
each ps_partkey value has only about ~1.4 distinct ps_supplycost values
on average:
select avg(x) from (select count(distinct ps_supplycost) as x
from t group by ps_partkey) foo;
avg
--------------------
1.3560712631162481
(1 row)
Which I think is the root cause here ...
The fact that we're using the same table "t" in both the main query and
the correlated subquery seems rather irrelevant here, because we might
also create
create table s as select
ps_partkey,
min(ps_supplycost) as min_ps_supplycost
from t group by ps_partkey;
and use that instead, and we'd still have the same issue. It's just the
fact for a given ps_partkey value there's only a couple ps_supplycost
values, not the 100k we have for a table.
I wonder if we could use the ndistinct coefficients to improve this,
somehow. I suppose eqjoinsel/eqjoinsel_inner could look at
ndistinct(ps_partkey, ps_supplycost) / ndistinct(ps_partkey)
when estimating the (SubPlan = Var) condition, and tweak selec_inner
accordingly.
I do see two challenges with this, though:
1) This probably requires considering all join clauses at once (just
like we do for regular clauses), but eqjoinsel and friends seem rather
heavily designed for inspecting the clauses one by one.
2) I'm not sure what to do about the SubPlan side, for which we may not
have any reliable ndistinct estimates at all.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services