Thread: Proper usage of ndistinct vs. dependencies extended statistics

Proper usage of ndistinct vs. dependencies extended statistics

From
Paul Martinez
Date:
Hello,

I have some questions about the different types of extended statistics
that were introduced in Postgres 10.
- Which types of queries are each statistic type supposed to improve?
- When should one type of statistic be used over the other? Should they
  both always be used?

We have a multi-tenant application and all of our tables have a denormalized
tenant_id column. (Most tables actually use the tenant_id as part of a
composite primary key on (tenant_id, id).)

As the docs suggest, we haven't created extended STATISTICS except for when
we observe the query planner making poor query plans.

We've seen poor query plans on queries involving filters on foreign keys:

  Table: fk_table
--------------------
tenant_id  | integer
id         | integer
fk_id      | integer

PRIMARY KEY (tenant_id, id)
FOREIGN KEY (tenant_id, fk_id) REFERENCES left_table(tenant_id, id)

The id columns on these tables are unique, so there is a functional dependence
between fk_id and tenant_id; if the fk_id columns are the same, then the
tenant_id columns must also be the same.

This table has ~4.6 million rows, ~1300 distinct values for tenant_id, and
~13000 distinct values for fk_id.

A single SELECT query that filters on tenant_id and fk_id erroneously
estimates that it will return a single row (4,600,000 / 1300 / 13,000 ~= 0.1):

=> EXPLAIN ANALYZE SELECT * FROM fk_table WHERE tenant_id = 100 AND
fk_id = 10000;
                                  QUERY PLAN
----------------------------------------------------------------------------------
 Index Scan using fk_table_tenant_id_fk_id_index on fk_table
   (cost=0.43..4.45 rows=1 width=44) (actual time=0.016..1.547
rows=3113 loops=1)
   Index Cond: ((tenant_id = 100) AND (fk_id = 10000))

In other places we've used a ndistinct statistic to solve this issue, but that
doesn't help in this case. Postgres still estimates that the query will return
a single row.

=> CREATE STATISTICS ndistinct_stat (ndistinct) ON tenant_id, fk_id
FROM fk_table;
=> ANALYZE fk_table;
=> SELECT stxname, stxndistinct FROM pg_statistic_ext;
    stxname     |  stxndistinct   |
----------------+-----------------+
 ndistinct_stat | {"1, 3": 3433}  |
=> EXPLAIN ANALYZE SELECT * FROM fk_table WHERE tenant_id = 100 AND
fk_id = 10000;
-- (unchanged)

Why doesn't the ndistinct statistic get used when planning this query? (We're
currently on Postgre 10.6.) In contrast, if we create a functional dependency
statistic then Postgres will accurately predict the result size.

=> CREATE STATISTICS dep_stat (dependencies) ON tenant_id, fk_id FROM fk_table;
=> ANALYZE fk_table;
=> SELECT stxname, stxdependencies FROM pg_statistic_ext;
  stxname   |             stxdependencies
----------------+------------------------------------------
 dep_stat   | {"1 => 3": 1.000000, "3 => 1": 0.060300}

=> EXPLAIN ANALYZE SELECT * FROM fk_table WHERE tenant_id = 100 AND
fk_id = 10000;
                                  QUERY PLAN
----------------------------------------------------------------------------------
 Index Scan using fk_table_tenant_id_fk_id_index on fk_table
   (cost=0.43..1042.23 rows=612 width=44) (actual time=0.011..0.813
rows=3056 loops=1)
   Index Cond: ((tenant_id = 100) AND (fk_id = 10000))

So, in general, which type of extended statistic should be used? Where do the
different kinds of statistics get used in the query planner? Is there an
advantage to using one type of statistic vs the other, or should we always
create both?

And in our specific example, with a schema designed for multi-tenancy, which
types of statistics should we use for our foreign keys, where tenant_id is
functionally dependent on the other foreign_id columns?

To explain where some of our confusion is coming from, here's the example where
adding an ndistinct statistic helped: Postgres was adding a filter after an
index scan instead of including the filter as part of the index scan.

big_table had ~500,000,000 rows,
~3000 distinct values for column a,
~3000 distinct values for column b,
but just ~4500 distinct values for the (a, b) tuple,
and column b was functionally dependent on column c.

Postgres wanted to do:

=> SELECT * FROM big_table WHERE a = 1 AND b = 10 AND c IN (100, 101, 102, ...);
Index Scan using big_table_a_b_c on big_table  (cost=0.57..122.41
rows=1 width=16)
  Index Cond: ((a = 1) AND (b = 10))
  Filter: c = ANY ('{100, 101, 102, 103, 104, 105, ...}')

But then we did:

=> CREATE STATISTICS big_table_a_b_ndistinct (ndistinct) ON a, b FROM big_table;
=> ANALYZE big_table;
=> SELECT * FROM big_table WHERE a = 1 AND b = 10 AND c IN (100, 101, 102, ...);
Index Scan using big_table_a_b_c on big_table  (cost=0.57..122.41
rows=1 width=16)
  Index Cond: ((a = 1) AND (b = 10)) AND (c = ANY ('{100, 101, 102,
103, 104, 105, ...}'))

(This had very poor performance between Postgres thought it would have to
filter 500,000,000 / 3000 / 3000 ~= 55 rows, but actually it had to filter
500,000,000 / 4500 ~= 110,000 rows.)

Because of the functional dependency on b and c, maybe a dependencies statistic
on b and c would have also had the desired effect, but at that point we didn't
entirely understand how functional dependencies worked, so we didn't try them.


If anyone can give some insight about when one of these two statistic types is
more appropriate that would be extremely helpful!

- Paul



Re: Proper usage of ndistinct vs. dependencies extended statistics

From
David Rowley
Date:
On Thu, 11 Apr 2019 at 11:53, Paul Martinez <hellopfm@gmail.com> wrote:
>
> I have some questions about the different types of extended statistics
> that were introduced in Postgres 10.
> - Which types of queries are each statistic type supposed to improve?

Multivariate ndistinct stats are aimed to improve distinct estimates
over groups of columns.  These can help in cases like GROUP BY a,b,
SELECT DISTINCT a,b, SELECT a,b FROM x UNION SELECT a,b FROM y;  They
also help in determining the number of times an index will be
rescanned in cases like nested loops with a parameterised inner path.

I see multivariate ndistinct estimates are not used for normal
selectivity estimates for unknown values.  e.g PREPARE q1 (int, int)
AS SELECT * FROM t1 WHERE a = $1 and b = $2; still assumes a and b are
independent even when ndistinct stats exist on the two columns.

There are a few other usages too. See calls of estimate_num_groups()

dependency stats just handle WHERE clauses (or more accurately,
clauses containing a reference to a single relation.  These only
handle equality OpExprs.  e.g "a = 10 and y = 3", not "a < 6 and y =
3".  Further stat types (most common values) added in PG12 aim to
allow inequality operators too.

> - When should one type of statistic be used over the other? Should they
>   both always be used?

If they both always should be always used then we'd likely not have
bothered making the types optional.   Both ndistinct and dependency
stats are fairly cheap to calculate and store, so it might not be too
big an issue adding both types if you're not sure. With these two
types there's not really any choice for the planner to decide to use
one or the other, it just makes use of the ones it can use for the
given situation.   That won't be the case as more stats types get
added. In PG12, for example, we had to choose of MCV stats should be
applied before dependencies stats.  That might be a no-brainer, but
perhaps the future there will be stats types where the order to apply
them is not so clear, although in those cases it might be questionable
why you'd want to define more than one type of stats on the same set
of columns.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services