Home > mailing lists

Re: multivariate statistics v8 - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: multivariate statistics v8
Date	December 23, 2015 22:08:25
Msg-id	567AF104.80700@2ndquadrant.com Whole thread Raw
In response to	WIP: multivariate statistics / proof of concept (Tomas Vondra <tv@fuzzy.cz>)
Responses	Re: multivariate statistics v9 (Tomas Vondra <tomas.vondra@2ndquadrant.com>) Re: multivariate statistics v8 (Robert Haas <robertmhaas@gmail.com>)
List	pgsql-hackers

Tree view

Hi,

attached is v8 of the multivariate statistics patch (or rather a patch
series). The patch currently has 7 parts, but 0001 is just a fix of the
pull_varnos issue (possibly incorrect/temporary), and 0007 is just an
attempt to add the "multicolumn distinctness" (experimental for now).

There are three noteworthy changes:

1) Correct estimation of OR-clauses - this turned out to be a rather
minor change, thanks to simply transforming the OR-clauses to
AND-clauses, see clauselist_selectivity_or() for details.

2) Abandoning the ALTER TABLE ... ADD STATISTICS syntax and instead
adding separate commands CREATE STATISTICS / DROP STATISTICS, as
proposed in the "multicolumn distinctness" thread:

http://www.postgresql.org/message-id/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp

This seems a better approach than the ALTER TABLE one - not only it
nicely fixes the grammar issues, it also naturally extends to
multi-table statistics (despite we don't know how those should work
exactly).

The syntax is this:

CREATE STATISTICS name ON table (columns) WITH (options);

DROP STATISTICS name;

and the 'name' is optional (and if absent, should be generated just
like for indexes, but that's not implemented yet).

The remaining question is how unique the statistics name should be.
My initial plan was to make it unique within a table, but that of
course does not work well with the DROP STATISTICS (it'd have to
specify the table name also), and it'd also now work with statistics
on multiple tables (which is one of the reasons for abandoning ALTER
TABLE stuff).

So I think it should be unique across tables. Statistics are hardly
a global object, so it should be unique within a schema. I thought
that simply using the schema of the table would work, but that of
course breaks with multiple tables in different schemas. So the only
solution seems to be explicit schema for statistics.

3) I've also started hacking on adding the "multicolumn distinctness"
proposed by Horiguchi-san, but I haven't really got that working. It
seems to be a bit more complicated than I anticipated because of the
"only equality conditions" restriction. So the 0007 patch only
really adds basic syntax and trivial build.

I do have bunch of ideas/questions about this statistics type. For
example, should we compute just a single coefficient or the exact
combination of columns specified in CREATE STATISTICS, or perhaps
for some additional subsets? I.e. with

CREATE STATISTICS ON t (a,b,c) WITH (ndistinct);

should we compute just the coefficient for (a,b,c), or maybe also
for (a,b), (b,c) and (a,c)? For N columns there's O(2^N) such
combinations, but perhaps it's acceptable.

Having the coefficient for just the single combination specified in
CREATE STATISTICS makes the estimation difficult when some of the
columns are not specified. For example, with coefficient just for
(a,b,c), what should happen for (WHERE a=1 AND b=2)?

Should we simply ignore the statistics, or apply it anyway and
somehow compensate for the missing columns?

I've also started working on something like a paper, hopefully
explaining the ideas and implementation more clearly and consistently
than possible on a mailing list (thanks to charts, figures and such).
It's available here (both the .tex source and .pdf with the current
version):

https://bitbucket.org/tvondra/mvstats-paper/src

It's not exactly short (~30 pages), and it's certainly incomplete with a
plenty of TODO notes, but hopefully it's already useful and not entirely
bogus.

Comments and questions are welcome - both to the patch and paper.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

From: "Joshua D. Drake"
Date: 23 December 2015, 21:55:47
Subject: Re: Parallel pg_dump's error reporting doesn't work worth squat

From: Robert Haas
Date: 23 December 2015, 22:14:41
Subject: Re: parallel joins, and better parallel explain

Re: multivariate statistics v8 - Mailing list pgsql-hackers

Attachment

Previous

Next