Improving estimates for TPC-H Q2 - Mailing list pgsql-hackers

From Tomas Vondra
Subject Improving estimates for TPC-H Q2
Date
Msg-id 20200507235711.3wazsx3euusrutkl@development
Whole thread Raw
Responses Re: Improving estimates for TPC-H Q2
List pgsql-hackers
Hi,

I've been re-running the TPC-H benchmark, to remind myself the common
issues with OLAP workloads, and one of the most annoying problems seems
to be the misestimates in Q2. The query is not particularly complex,
although it does have a correlated subquery with an aggregate, but it's
one of the queries prone to a cascade of nested loops running forever.
I wonder if there's something we could do to handle this better.

A raw Q2 looks like this:

     select
         s_acctbal,
         s_name,
         n_name,
         p_partkey,
         p_mfgr,
         s_address,
         s_phone,
         s_comment
     from
         part,
         supplier,
         partsupp,
         nation,
         region
     where
         p_partkey = ps_partkey
         and s_suppkey = ps_suppkey
         and p_size = 16
         and p_type like '%NICKEL'
         and s_nationkey = n_nationkey
         and n_regionkey = r_regionkey
         and r_name = 'AMERICA'
         and ps_supplycost = (
             select
                 min(ps_supplycost)
             from
                 partsupp,
                 supplier,
                 nation,
                 region
             where
                 p_partkey = ps_partkey
                 and s_suppkey = ps_suppkey
                 and s_nationkey = n_nationkey
                 and n_regionkey = r_regionkey
                 and r_name = 'AMERICA'
         )
     order by
         s_acctbal desc,
         n_name,
         s_name,
         p_partkey;

and the full query plan is attached (q2-original-plan.txt).

The relevant part of the plan is the final join, which also considers
the subplan result (all estimates are for scale 10):

    ->  Merge Join  (cost=638655.36..1901120.61 rows=1 width=192)
                    (actual time=7299.121..10993.517 rows=4737 loops=1)
          Merge Cond: (part.p_partkey = partsupp.ps_partkey)
          Join Filter: (partsupp.ps_supplycost = (SubPlan 1))
          Rows Removed by Join Filter: 1661

Yeah, this is estimated as 1 row but actually returns 4737 rows. All
the other nodes are estimated very accurately, it's just this final join
that is entirely wrong.

If you tweak the costs a bit (e.g. reducing random_page_cost etc.) the
plan can easily switch to nested loops, with this join much deeper in
the plan. See the attached q2-nested-loops.txt for an example (I had to
disable merge/hash joins to trigger this on scale 10, on larger scales
it can happen much easier).

Now, the query seems a bit complex, but we can easily simplify it by
creating an extra table and reducing the number of joins:

     create table foo as select
       *
     from
         partsupp,
         supplier,
         nation,
         region
     where
         s_suppkey = ps_suppkey
         and s_nationkey = n_nationkey
         and n_regionkey = r_regionkey
         and r_name = 'AMERICA';

     reate index on t (ps_partkey);
     
which allows us to rewrite Q2 like this (this also ditches the ORDER BY
and LIMIT clauses):

     select
         1
     from
         part,
         t
     where
         p_partkey = ps_partkey
         and p_size = 16
         and p_type like '%NICKEL'
         and ps_supplycost = (
             select
                 min(ps_supplycost)
             from t
             where
                 p_partkey = ps_partkey
         );

in fact, we can ditch even the conditions on p_size/p_type which makes
the issue even more severe:

     select
         1
     from
         part,
         t
     where
         p_partkey = ps_partkey
         and ps_supplycost = (
             select
                 min(ps_supplycost)
             from t
             where
                 p_partkey = ps_partkey
         );

with the join estimated like this:

    Hash Join  (cost=89761.10..1239195.66 rows=17 width=4)
               (actual time=15379.356..29315.436 rows=1182889 loops=1)
      Hash Cond: ((t.ps_partkey = part.p_partkey) AND
                  (t.ps_supplycost = (SubPlan 1)))

Yeah, that's underestimated by a factor of 70000 :-(

An interesting observation is that if you remove the condition on supply
cost (with the correlated subquery), the estimates get perfect atain. So
this seems to be about this particular condition, or how we combine the
selectivities ...

I'm not sure I've figured all the details yet, but this seems to be due
to a dependency between the ps_partkey and ps_supplycost columns.

When estimating the second condition, we end up calling eqjoinsel()
with SubPlan and Var arguments. We clearly won't have ndistinct of MCVs
for the SubPlan, so we use

     nd1 = 200;   /* default */
     nd2 = 94005; /* n_distinct for t.ps_supplycost */

and end up (thanks to eqjoinsel_inner and no NULLs in data) with

     selec_inner = 0.00001 = Min(1/nd1, 1/nd2)

But that's entirely bogus, because while there are ~100k distinct values
in t.ps_supplycost, those are for *all* ps_partkey values combined. But
each ps_partkey value has only about ~1.4 distinct ps_supplycost values
on average:

     select avg(x) from (select count(distinct ps_supplycost) as x
                           from t group by ps_partkey) foo;

             avg         
     --------------------
      1.3560712631162481
     (1 row)

Which I think is the root cause here ...

The fact that we're using the same table "t" in both the main query and
the correlated subquery seems rather irrelevant here, because we might
also create

     create table s as select
         ps_partkey,
         min(ps_supplycost) as min_ps_supplycost
     from t group by ps_partkey;

and use that instead, and we'd still have the same issue. It's just the
fact for a given ps_partkey value there's only a couple ps_supplycost
values, not the 100k we have for a table.

I wonder if we could use the ndistinct coefficients to improve this,
somehow. I suppose eqjoinsel/eqjoinsel_inner could look at

     ndistinct(ps_partkey, ps_supplycost) / ndistinct(ps_partkey)

when estimating the (SubPlan = Var) condition, and tweak selec_inner
accordingly.

I do see two challenges with this, though:

1) This probably requires considering all join clauses at once (just
like we do for regular clauses), but eqjoinsel and friends seem rather
heavily designed for inspecting the clauses one by one.

2) I'm not sure what to do about the SubPlan side, for which we may not
have any reliable ndistinct estimates at all.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: 2pc leaks fds
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: +(pg_lsn, int8) and -(pg_lsn, int8) operators