Multitenancy optimization - Mailing list pgsql-hackers
From | Konstantin Knizhnik |
---|---|
Subject | Multitenancy optimization |
Date | |
Msg-id | 5f89364f-1cd6-448b-6067-e279cbb16826@postgrespro.ru Whole thread Raw |
Responses |
Re: Multitenancy optimization
|
List | pgsql-hackers |
Hi hacker, Postgres is quite frequently used in different Internet services with multi-tenant architecture. It means that all object stored in the database have something like "tenant_id" foreign key. This key is used in all queries, i.e. select * from Product where tenant_id=? and product_name=?; The problem is that columns "tenant_id" and "product_name" are frequently highly correlated (for example if this product is produced just by one company). And Postgres knows nothing about this correlation and so makes incorrect estimation of selectivity of this predicate. Certainly it is possible to create multicolumn statistics to notify Postgres about columns correlation. But unfortunately it is not good and working solution. First of all we have to create multicolumn statistic for all possible combinations of table's attributes including "tenant_id". It is very inconvenient and inefficient. Second - right now multicolumn statistic is not used for calculating join selectivity. And for joins estimation errors are most critical, causing Postgres to choose bad execution plans. From my point of view the best solution is to make Postgres take in account possible statistics errors and choose "stable" plan which cost is not significantly increased in case of estimation errors. But it requires huge refactoring of optimizer. Right now I have information that some of Postgres customer which faced with such problem just hacked calc_joinrel_size_estimate function, checking attribute name and if it is "tenant_id" then do not take its selectivity in account. It leads to good query plans but certainly can not be considered as acceptable solution. I thought about more straightforward ways for reaching the same effect. Right now Postgres allows to explicitly specify number of distinct values for the attribute: alter table foo alter column x set (n_distinct=1); Unfortunately just setting it to 1 doesn't work. Postgres calculates selectivity based on MCV or histogram and not using n_distinct value. It is also possible to disable collection of statistic for this columns: alter table foo alter column x set statistics 0; But in this case Postgres is choosing DEFAULT_NUM_DISTINCT despite to n_distinct option specified for this attribute. I propose small patch which makes Postgres to use explicitly specified n_distinct attribute option value when no statistic is available. This test illustrating how it works (without this patch estimation for this query is 1 row): postgres=# create table foo(x integer, y integer); CREATE TABLE postgres=# insert into foo values (generate_series(1,100000)/10, generate_series(1,100000)/10); INSERT 0 100000 postgres=# alter table foo alter column x set (n_distinct=1); ALTER TABLE postgres=# alter table foo alter column x set statistics 0; ALTER TABLE postgres=# analyze foo; ANALYZE postgres=# explain select * from foo where x=100 and y=100; QUERY PLAN ------------------------------------------------------- Seq Scan on foo (cost=0.00..1943.00 rows=10 width=8) Filter: ((x = 100) AND (y = 100)) (2 rows) Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
pgsql-hackers by date: