Proper usage of ndistinct vs. dependencies extended statistics - Mailing list pgsql-hackers

From Paul Martinez
Subject Proper usage of ndistinct vs. dependencies extended statistics
Date
Msg-id CAF+2_SGhOV6_Nr_qxahqaYEcJYuWKRckuO16sMXk6uiueCZKGg@mail.gmail.com
Whole thread Raw
Responses Re: Proper usage of ndistinct vs. dependencies extended statistics
List pgsql-hackers
Hello,

I have some questions about the different types of extended statistics
that were introduced in Postgres 10.
- Which types of queries are each statistic type supposed to improve?
- When should one type of statistic be used over the other? Should they
  both always be used?

We have a multi-tenant application and all of our tables have a denormalized
tenant_id column. (Most tables actually use the tenant_id as part of a
composite primary key on (tenant_id, id).)

As the docs suggest, we haven't created extended STATISTICS except for when
we observe the query planner making poor query plans.

We've seen poor query plans on queries involving filters on foreign keys:

  Table: fk_table
--------------------
tenant_id  | integer
id         | integer
fk_id      | integer

PRIMARY KEY (tenant_id, id)
FOREIGN KEY (tenant_id, fk_id) REFERENCES left_table(tenant_id, id)

The id columns on these tables are unique, so there is a functional dependence
between fk_id and tenant_id; if the fk_id columns are the same, then the
tenant_id columns must also be the same.

This table has ~4.6 million rows, ~1300 distinct values for tenant_id, and
~13000 distinct values for fk_id.

A single SELECT query that filters on tenant_id and fk_id erroneously
estimates that it will return a single row (4,600,000 / 1300 / 13,000 ~= 0.1):

=> EXPLAIN ANALYZE SELECT * FROM fk_table WHERE tenant_id = 100 AND
fk_id = 10000;
                                  QUERY PLAN
----------------------------------------------------------------------------------
 Index Scan using fk_table_tenant_id_fk_id_index on fk_table
   (cost=0.43..4.45 rows=1 width=44) (actual time=0.016..1.547
rows=3113 loops=1)
   Index Cond: ((tenant_id = 100) AND (fk_id = 10000))

In other places we've used a ndistinct statistic to solve this issue, but that
doesn't help in this case. Postgres still estimates that the query will return
a single row.

=> CREATE STATISTICS ndistinct_stat (ndistinct) ON tenant_id, fk_id
FROM fk_table;
=> ANALYZE fk_table;
=> SELECT stxname, stxndistinct FROM pg_statistic_ext;
    stxname     |  stxndistinct   |
----------------+-----------------+
 ndistinct_stat | {"1, 3": 3433}  |
=> EXPLAIN ANALYZE SELECT * FROM fk_table WHERE tenant_id = 100 AND
fk_id = 10000;
-- (unchanged)

Why doesn't the ndistinct statistic get used when planning this query? (We're
currently on Postgre 10.6.) In contrast, if we create a functional dependency
statistic then Postgres will accurately predict the result size.

=> CREATE STATISTICS dep_stat (dependencies) ON tenant_id, fk_id FROM fk_table;
=> ANALYZE fk_table;
=> SELECT stxname, stxdependencies FROM pg_statistic_ext;
  stxname   |             stxdependencies
----------------+------------------------------------------
 dep_stat   | {"1 => 3": 1.000000, "3 => 1": 0.060300}

=> EXPLAIN ANALYZE SELECT * FROM fk_table WHERE tenant_id = 100 AND
fk_id = 10000;
                                  QUERY PLAN
----------------------------------------------------------------------------------
 Index Scan using fk_table_tenant_id_fk_id_index on fk_table
   (cost=0.43..1042.23 rows=612 width=44) (actual time=0.011..0.813
rows=3056 loops=1)
   Index Cond: ((tenant_id = 100) AND (fk_id = 10000))

So, in general, which type of extended statistic should be used? Where do the
different kinds of statistics get used in the query planner? Is there an
advantage to using one type of statistic vs the other, or should we always
create both?

And in our specific example, with a schema designed for multi-tenancy, which
types of statistics should we use for our foreign keys, where tenant_id is
functionally dependent on the other foreign_id columns?

To explain where some of our confusion is coming from, here's the example where
adding an ndistinct statistic helped: Postgres was adding a filter after an
index scan instead of including the filter as part of the index scan.

big_table had ~500,000,000 rows,
~3000 distinct values for column a,
~3000 distinct values for column b,
but just ~4500 distinct values for the (a, b) tuple,
and column b was functionally dependent on column c.

Postgres wanted to do:

=> SELECT * FROM big_table WHERE a = 1 AND b = 10 AND c IN (100, 101, 102, ...);
Index Scan using big_table_a_b_c on big_table  (cost=0.57..122.41
rows=1 width=16)
  Index Cond: ((a = 1) AND (b = 10))
  Filter: c = ANY ('{100, 101, 102, 103, 104, 105, ...}')

But then we did:

=> CREATE STATISTICS big_table_a_b_ndistinct (ndistinct) ON a, b FROM big_table;
=> ANALYZE big_table;
=> SELECT * FROM big_table WHERE a = 1 AND b = 10 AND c IN (100, 101, 102, ...);
Index Scan using big_table_a_b_c on big_table  (cost=0.57..122.41
rows=1 width=16)
  Index Cond: ((a = 1) AND (b = 10)) AND (c = ANY ('{100, 101, 102,
103, 104, 105, ...}'))

(This had very poor performance between Postgres thought it would have to
filter 500,000,000 / 3000 / 3000 ~= 55 rows, but actually it had to filter
500,000,000 / 4500 ~= 110,000 rows.)

Because of the functional dependency on b and c, maybe a dependencies statistic
on b and c would have also had the desired effect, but at that point we didn't
entirely understand how functional dependencies worked, so we didn't try them.


If anyone can give some insight about when one of these two statistic types is
more appropriate that would be extremely helpful!

- Paul



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Reducing the runtime of the core regression tests
Next
From: Peter Geoghegan
Date:
Subject: Re: Reducing the runtime of the core regression tests