Proper usage of ndistinct vs. dependencies extended statistics - Mailing list pgsql-hackers
From | Paul Martinez |
---|---|
Subject | Proper usage of ndistinct vs. dependencies extended statistics |
Date | |
Msg-id | CAF+2_SGhOV6_Nr_qxahqaYEcJYuWKRckuO16sMXk6uiueCZKGg@mail.gmail.com Whole thread Raw |
Responses |
Re: Proper usage of ndistinct vs. dependencies extended statistics
|
List | pgsql-hackers |
Hello, I have some questions about the different types of extended statistics that were introduced in Postgres 10. - Which types of queries are each statistic type supposed to improve? - When should one type of statistic be used over the other? Should they both always be used? We have a multi-tenant application and all of our tables have a denormalized tenant_id column. (Most tables actually use the tenant_id as part of a composite primary key on (tenant_id, id).) As the docs suggest, we haven't created extended STATISTICS except for when we observe the query planner making poor query plans. We've seen poor query plans on queries involving filters on foreign keys: Table: fk_table -------------------- tenant_id | integer id | integer fk_id | integer PRIMARY KEY (tenant_id, id) FOREIGN KEY (tenant_id, fk_id) REFERENCES left_table(tenant_id, id) The id columns on these tables are unique, so there is a functional dependence between fk_id and tenant_id; if the fk_id columns are the same, then the tenant_id columns must also be the same. This table has ~4.6 million rows, ~1300 distinct values for tenant_id, and ~13000 distinct values for fk_id. A single SELECT query that filters on tenant_id and fk_id erroneously estimates that it will return a single row (4,600,000 / 1300 / 13,000 ~= 0.1): => EXPLAIN ANALYZE SELECT * FROM fk_table WHERE tenant_id = 100 AND fk_id = 10000; QUERY PLAN ---------------------------------------------------------------------------------- Index Scan using fk_table_tenant_id_fk_id_index on fk_table (cost=0.43..4.45 rows=1 width=44) (actual time=0.016..1.547 rows=3113 loops=1) Index Cond: ((tenant_id = 100) AND (fk_id = 10000)) In other places we've used a ndistinct statistic to solve this issue, but that doesn't help in this case. Postgres still estimates that the query will return a single row. => CREATE STATISTICS ndistinct_stat (ndistinct) ON tenant_id, fk_id FROM fk_table; => ANALYZE fk_table; => SELECT stxname, stxndistinct FROM pg_statistic_ext; stxname | stxndistinct | ----------------+-----------------+ ndistinct_stat | {"1, 3": 3433} | => EXPLAIN ANALYZE SELECT * FROM fk_table WHERE tenant_id = 100 AND fk_id = 10000; -- (unchanged) Why doesn't the ndistinct statistic get used when planning this query? (We're currently on Postgre 10.6.) In contrast, if we create a functional dependency statistic then Postgres will accurately predict the result size. => CREATE STATISTICS dep_stat (dependencies) ON tenant_id, fk_id FROM fk_table; => ANALYZE fk_table; => SELECT stxname, stxdependencies FROM pg_statistic_ext; stxname | stxdependencies ----------------+------------------------------------------ dep_stat | {"1 => 3": 1.000000, "3 => 1": 0.060300} => EXPLAIN ANALYZE SELECT * FROM fk_table WHERE tenant_id = 100 AND fk_id = 10000; QUERY PLAN ---------------------------------------------------------------------------------- Index Scan using fk_table_tenant_id_fk_id_index on fk_table (cost=0.43..1042.23 rows=612 width=44) (actual time=0.011..0.813 rows=3056 loops=1) Index Cond: ((tenant_id = 100) AND (fk_id = 10000)) So, in general, which type of extended statistic should be used? Where do the different kinds of statistics get used in the query planner? Is there an advantage to using one type of statistic vs the other, or should we always create both? And in our specific example, with a schema designed for multi-tenancy, which types of statistics should we use for our foreign keys, where tenant_id is functionally dependent on the other foreign_id columns? To explain where some of our confusion is coming from, here's the example where adding an ndistinct statistic helped: Postgres was adding a filter after an index scan instead of including the filter as part of the index scan. big_table had ~500,000,000 rows, ~3000 distinct values for column a, ~3000 distinct values for column b, but just ~4500 distinct values for the (a, b) tuple, and column b was functionally dependent on column c. Postgres wanted to do: => SELECT * FROM big_table WHERE a = 1 AND b = 10 AND c IN (100, 101, 102, ...); Index Scan using big_table_a_b_c on big_table (cost=0.57..122.41 rows=1 width=16) Index Cond: ((a = 1) AND (b = 10)) Filter: c = ANY ('{100, 101, 102, 103, 104, 105, ...}') But then we did: => CREATE STATISTICS big_table_a_b_ndistinct (ndistinct) ON a, b FROM big_table; => ANALYZE big_table; => SELECT * FROM big_table WHERE a = 1 AND b = 10 AND c IN (100, 101, 102, ...); Index Scan using big_table_a_b_c on big_table (cost=0.57..122.41 rows=1 width=16) Index Cond: ((a = 1) AND (b = 10)) AND (c = ANY ('{100, 101, 102, 103, 104, 105, ...}')) (This had very poor performance between Postgres thought it would have to filter 500,000,000 / 3000 / 3000 ~= 55 rows, but actually it had to filter 500,000,000 / 4500 ~= 110,000 rows.) Because of the functional dependency on b and c, maybe a dependencies statistic on b and c would have also had the desired effect, but at that point we didn't entirely understand how functional dependencies worked, so we didn't try them. If anyone can give some insight about when one of these two statistic types is more appropriate that would be extremely helpful! - Paul
pgsql-hackers by date: