Multi-column distinctness. - Mailing list pgsql-hackers
From | Kyotaro HORIGUCHI |
---|---|
Subject | Multi-column distinctness. |
Date | |
Msg-id | 20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp Whole thread Raw |
Responses |
Re: Multi-column distinctness.
Re: Multi-column distinctness. Re: Multi-column distinctness. |
List | pgsql-hackers |
Hello, this patch enables planner to be couscious of inter-column correlation. Sometimes two or more columns in a table has some correlation which brings underestimate, which leads to wrong join method and ends with slow execution. Tomas Vondra is now working on heavily-equipped multivariate statistics for OLAP usage. In contrast, this is a lightly implemented solution which calculates only the ratio between a rows estimated by current method and a actual row number. I think this doesn't conflict with his work except the grammar part. This would apply fewer cases but I suppose still in many cases the correlated colums would be in simple proportional relationship, so this can help the cases. The previous discussion is https://wiki.postgresql.org/wiki/Cross_Columns_Stats http://www.postgresql.org/message-id/4D0BA4D5.8080707@fuzzy.cz This patch is covers only the type A (Discrete values and equality conditions) but I think it is usable in many cases seen in the field. So I'd like to repropose for the latest version of PostgreSQL. - design outline Provide new system catalog pg_mvcoefficient to store theinformation required to do this. A user can instruct planner to correct the wrong estimationcaused by inter-column correlation by registering the columnsinpg_mvcoefficient using new DDL ALTER TABLE... ADD STATISTICS. Analyzing of the target table also stores the 'multivariatecoefficient' calculated by using the following formula intopg_mvcoefficient. mv_coef(c1, c2, ..) = ndistinct(c1 * c2 * ...) / (ndistinct(c1) * ndistinct(c2) * ...) In clauselist_selectivity, planner corrects the estimate ifgiven clauselist has equivalence-classes-compatible clauses forrequiredcolumns at the top-level. - Example The attached perl script gentbl.pl generates test data resemblessome tables in DBT-3 benchmark. > $ perl gentbl.pl | psql postgres > =# EXPLAIN ANALYZE SELECT * FROM t1 WHERE a = 1 AND b = 2501; ... > Seq Scan on t1 (cost=0.00..653.00 rows=1 width=12) (actual time=0.021..6.348 rows=8 loops=1) This doesn't have no harm but in a join case, > =# EXPLAIN ANALYZE SELECT * FROM t1 JOIN t2 ON t1.a = t2.a AND t1.b = t2.b; > Hash Join (cost=122.00..855.32 rows=32 width=24) > (actual time=2.009..29.208 rows=32000 loops=1) The correlation between a and b makes the estimate toosmall. Then register correlation setting. > =# ALTER TABLE t1 ADD STATISTICS (mvndistinct) ON (a, b); > =# ANALYZE t1; Then the estimate will be corrected. > =# EXPLAIN ANALYZE SELECT * FROM t1 JOIN t2 ON t1.a = t2.a AND t1.b = t2.b; > Hash Join (cost=122.00..855.32 rows=32000 width=24) > (actual time=1.907..29.025 rows=32000 loops=1) - Known limitations The coefficient calculated by this feature is applicble only forconjunctions of simple var-exprs on merge-joinable operator. The coefficient is applied regardless of whether the baseestimate has been calculated using MCV, so estimates fornon-joincases on the columns which has MCV can rather becomeinaccurate. Uniform correlation is assumed so some extent of correlationununiformity would lead to wrong estimation. This patch set doesn't contain any document yet. - Patche Files This patch consists of the following files. - 0001-New-system-catalog-pg_mvcoefficient.patch Adds new system catalog pg_mvcoefficient. - 0002-Analyze-part-for-multivariate-coefficient.patch Analyze part of multivariate coefficient. - 0003-Make-use-of-multivariate-coefficeient-in-estimation-.patch Planner part to make it use the multivariate coefficient. - 0004-Syntactical-part-of-multivariate-coefficient.patch Add new DDL to define mv coefficient columns. The four files above are essential. The two following files areexperimental patch to add mvcattrs to index columns. One ofthemadds a new opclass for int2vector of btree but it would beoverkill. - 0005-Add-btree-operator-class-for-int2vector.patch Add btree operator class for int2vector. - 0006-Use-modified-index-of-pg_mvcoefficient.patch Use modified index of pg_mvcoefficient. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
pgsql-hackers by date: