Multi-column distinctness. - Mailing list pgsql-hackers

From Kyotaro HORIGUCHI
Subject Multi-column distinctness.
Date
Msg-id 20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
Whole thread Raw
Responses Re: Multi-column distinctness.
Re: Multi-column distinctness.
Re: Multi-column distinctness.
List pgsql-hackers
Hello, this patch enables planner to be couscious of inter-column
correlation.

Sometimes two or more columns in a table has some correlation
which brings underestimate, which leads to wrong join method and
ends with slow execution.

Tomas Vondra is now working on heavily-equipped multivariate
statistics for OLAP usage. In contrast, this is a lightly
implemented solution which calculates only the ratio between a
rows estimated by current method and a actual row number. I think
this doesn't conflict with his work except the grammar part.


This would apply fewer cases but I suppose still in many cases
the correlated colums would be in simple proportional
relationship, so this can help the cases. The previous discussion
is

https://wiki.postgresql.org/wiki/Cross_Columns_Stats
http://www.postgresql.org/message-id/4D0BA4D5.8080707@fuzzy.cz

This patch is covers only the type A (Discrete values and
equality conditions) but I think it is usable in many cases seen
in the field. So I'd like to repropose for the latest version of
PostgreSQL.


- design outline
Provide new system catalog pg_mvcoefficient to store theinformation required to do this.
A user can instruct planner to correct the wrong estimationcaused by inter-column correlation by registering the
columnsinpg_mvcoefficient using new DDL ALTER TABLE... ADD STATISTICS.
 
Analyzing of the target table also stores the 'multivariatecoefficient' calculated by using the following formula
intopg_mvcoefficient.
 mv_coef(c1, c2, ..) =   ndistinct(c1 * c2 * ...) / (ndistinct(c1) * ndistinct(c2) * ...)
In clauselist_selectivity, planner corrects the estimate ifgiven clauselist has equivalence-classes-compatible clauses
forrequiredcolumns at the top-level.
 


- Example
The attached perl script gentbl.pl generates test data resemblessome tables in DBT-3 benchmark.

> $ perl gentbl.pl | psql postgres

>  =# EXPLAIN ANALYZE SELECT * FROM t1 WHERE a = 1 AND b = 2501; ...
>  Seq Scan on t1  (cost=0.00..653.00 rows=1 width=12) (actual time=0.021..6.348 rows=8 loops=1)
This doesn't have no harm but in a join case,

> =# EXPLAIN ANALYZE SELECT * FROM t1 JOIN t2 ON t1.a = t2.a AND t1.b = t2.b;
>  Hash Join  (cost=122.00..855.32 rows=32 width=24)
>             (actual time=2.009..29.208 rows=32000 loops=1)
The correlation between a and b makes the estimate toosmall. Then register correlation setting.

> =# ALTER TABLE t1 ADD STATISTICS (mvndistinct) ON (a, b);
> =# ANALYZE t1;
Then the estimate will be corrected.

> =# EXPLAIN ANALYZE SELECT * FROM t1 JOIN t2 ON t1.a = t2.a AND t1.b = t2.b;
>  Hash Join  (cost=122.00..855.32 rows=32000 width=24)
>             (actual time=1.907..29.025 rows=32000 loops=1)


- Known limitations
The coefficient calculated by this feature is applicble only forconjunctions of simple var-exprs on merge-joinable
operator.
The coefficient is applied regardless of whether the baseestimate has been calculated using MCV, so estimates
fornon-joincases on the columns which has MCV can rather becomeinaccurate.
 
Uniform correlation is assumed so some extent of correlationununiformity would lead to wrong estimation.
This patch set doesn't contain any document yet.


- Patche Files
This patch consists of the following files.
- 0001-New-system-catalog-pg_mvcoefficient.patch Adds new system catalog pg_mvcoefficient.
- 0002-Analyze-part-for-multivariate-coefficient.patch Analyze part of multivariate coefficient.
- 0003-Make-use-of-multivariate-coefficeient-in-estimation-.patch Planner part to make it use the multivariate
coefficient.
- 0004-Syntactical-part-of-multivariate-coefficient.patch Add new DDL to define mv coefficient columns.
The four files above are essential. The two following files areexperimental patch to add mvcattrs to index columns. One
ofthemadds a new opclass for int2vector of btree but it would beoverkill.
 
- 0005-Add-btree-operator-class-for-int2vector.patch Add btree operator class for int2vector.
- 0006-Use-modified-index-of-pg_mvcoefficient.patch Use modified index of pg_mvcoefficient.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: 9.5 feature count
Next
From: Kyotaro HORIGUCHI
Date:
Subject: Multiline-statement and multi-statement for pgbench custom script.