Re: multivariate statistics v8 - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: multivariate statistics v8
Date
Msg-id 567AF104.80700@2ndquadrant.com
Whole thread Raw
In response to WIP: multivariate statistics / proof of concept  (Tomas Vondra <tv@fuzzy.cz>)
Responses Re: multivariate statistics v9  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Re: multivariate statistics v8  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi,

attached is v8 of the multivariate statistics patch (or rather a patch
series). The patch currently has 7 parts, but 0001 is just a fix of the
pull_varnos issue (possibly incorrect/temporary), and 0007 is just an
attempt to add the "multicolumn distinctness" (experimental for now).

There are three noteworthy changes:

1) Correct estimation of OR-clauses - this turned out to be a rather
    minor change, thanks to simply transforming the OR-clauses to
    AND-clauses, see clauselist_selectivity_or() for details.

2) Abandoning the ALTER TABLE ... ADD STATISTICS syntax and instead
    adding separate commands CREATE STATISTICS / DROP STATISTICS, as
    proposed in the "multicolumn distinctness" thread:


http://www.postgresql.org/message-id/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp

    This seems a better approach than the ALTER TABLE one - not only it
    nicely fixes the grammar issues, it also naturally extends to
    multi-table statistics (despite we don't know how those should work
    exactly).

    The syntax is this:

      CREATE STATISTICS name ON table (columns) WITH (options);

      DROP STATISTICS name;

    and the 'name' is optional (and if absent, should be generated just
    like for indexes, but that's not implemented yet).

    The remaining question is how unique the statistics name should be.
    My initial plan was to make it unique within a table, but that of
    course does not work well with the DROP STATISTICS (it'd have to
    specify the table name also), and it'd also now work with statistics
    on multiple tables (which is one of the reasons for abandoning ALTER
    TABLE stuff).

    So I think it should be unique across tables. Statistics are hardly
    a global object, so it should be unique within a schema. I thought
    that simply using the schema of the table would work, but that of
    course breaks with multiple tables in different schemas. So the only
    solution seems to be explicit schema for statistics.

3) I've also started hacking on adding the "multicolumn distinctness"
    proposed by Horiguchi-san, but I haven't really got that working. It
    seems to be a bit more complicated than I anticipated because of the
    "only equality conditions" restriction. So the 0007 patch only
    really adds basic syntax and trivial build.

    I do have bunch of ideas/questions about this statistics type. For
    example, should we compute just a single coefficient or the exact
    combination of columns specified in CREATE STATISTICS, or perhaps
    for some additional subsets? I.e. with

      CREATE STATISTICS ON t (a,b,c) WITH (ndistinct);

    should we compute just the coefficient for (a,b,c), or maybe also
    for (a,b), (b,c) and (a,c)? For N columns there's O(2^N) such
    combinations, but perhaps it's acceptable.

    Having the coefficient for just the single combination specified in
    CREATE STATISTICS makes the estimation difficult when some of the
    columns are not specified. For example, with coefficient just for
    (a,b,c), what should happen for (WHERE a=1 AND b=2)?

    Should we simply ignore the statistics, or apply it anyway and
    somehow compensate for the missing columns?


I've also started working on something like a paper, hopefully
explaining the ideas and implementation more clearly and consistently
than possible on a mailing list (thanks to charts, figures and such).
It's available here (both the .tex source and .pdf with the current
version):

     https://bitbucket.org/tvondra/mvstats-paper/src

It's not exactly short (~30 pages), and it's certainly incomplete with a
plenty of TODO notes, but hopefully it's already useful and not entirely
bogus.

Comments and questions are welcome - both to the patch and paper.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: "Joshua D. Drake"
Date:
Subject: Re: Parallel pg_dump's error reporting doesn't work worth squat
Next
From: Robert Haas
Date:
Subject: Re: parallel joins, and better parallel explain