Thread: WIP: multivariate statistics / proof of concept

WIP: multivariate statistics / proof of concept

From
Tomas Vondra
Date:
Hi,

attached is a WIP patch implementing multivariate statistics. The code
certainly is not "ready" - parts of it look as if written by a rogue
chimp who got bored of attempts to type the complete works of William
Shakespeare, and decided to try something different.

I also cut some corners to make it work, and those limitations need to
be fixed before the eventual commit (those are not difficult problems,
but were not necessary for a proof-of-concept patch).

It however seems to be working sufficiently well at this point, enough
to get some useful feedback. So here we go.

I expect to be busy over the next two weeks because of travel, so sorry
for somehow delayed responses. If you happen to attend pgconf.eu next
week (Oct 20-24), we can of course discuss this patch in person.


Goals and basics
----------------

The goal of this patch is allowing users to define multivariate
statistics (i.e. statistics on multiple columns), and improving
estimation when the columns are correlated.

Take for example a table like this:

    CREATE TABLE test (a INT, b INT, c INT);
    INSERT INTO test SELECT i/10000, i/10000, i/10000
                       FROM generate_series(1,1000000) s(i);
    ANALYZE test;

and do a query like this:

    SELECT * FROM test WHERE (a = 10) AND (b = 10) AND (c = 10);

which is estimated like this:

                       QUERY PLAN
---------------------------------------------------------
 Seq Scan on test  (cost=0.00..22906.00 rows=1 width=12)
   Filter: ((a = 10) AND (b = 10) AND (c = 10))
 Planning time: 0.142 ms
(3 rows)

The query of course returns 10.000 rows, but the planner assumes the
columns are independent and thus multiplies the selectivities. And 1/100
for each column means 1/1000000 in total, which is 1 row.

This example is of course somehow artificial, but the problem is far
from uncommon, especially in denormalized datasets (e.g. star schema).
If you ever got an index scan instead of a sequential scan due to poor
estimate, resulting in a query running for hours instead of seconds, you
know the pain.

The patch allows you to do this:

    ALTER TABLE test ADD STATISTICS ON (a, b, c);
    ANALYZE test;

which then results in this estimate:

                         QUERY PLAN
------------------------------------------------------------
 Seq Scan on test  (cost=0.00..22906.00 rows=9667 width=12)
   Filter: ((a = 10) AND (b = 10) AND (c = 10))
 Planning time: 0.110 ms
(3 rows)

This however is not free - both building such statistics (during
ANALYZE) and using it (during planning) costs some cycles. Even if we
optimize the hell out of it, it won't be entirely free.

One of the design goals in this patch is not to make the ANALYZE or
planning more expensive unless you add such statistics.

Those who add such statistics probably decided that the price is worth
the improved estimates, and lower risk of inefficient plans. If the
planning takes a few more miliseconds, it's probably worth it if you
risk queries running for minutes or hours because of misestimates.

It also does not guarantee the estimates to be always better. There will
be misestimates, although rather in the other direction (independence
assumption usually leads to underestimates, this may lead to
overestimates). However based on my experience from writing the patch I
be I believe it's possible to reasonably limit the extent of such errors
(just like in the single-column histograms, it's related to the bucket
size).

Of course, there will be cases when the old approach is lucky by
accident - there's not much we can do to beat luck. And we can't rely on
it either.


Design overview
---------------

The patch adds a new system catalog, called pg_mv_statistic, which is
used to keep track of requested statistics. There's also a pg_mv_stats
view, showing some basic info about the stats (not all the data).

There are three kinds of statistics

  - list of most common combinations of values (MCV list)
  - multi-dimensional histogram
  - associative rules

The first two are extensions of the single-column stats we already have.
The MCV list is a trivial extension to multiple dimensions, just
tracking combinations and frequencies. The histogram is more complex -
the structure is quite simple (multi-dimensional rectangles) but there's
a lot of ways to build it. But even the current naive and simple
implementation seems to work quite well.

The last kind (associative rules) is an attempt to track "implications"
between columns. It is however an experiment and it's not really used in
the patch so I'll ignore it for now.

I'm not going to explain all the implementation details here - if you
want to learn more, the best way is probably by reading the changes in
those files (probably in this order):

    src/include/utils/mvstats.h
    src/backend/commands/analyze.c
    src/backend/optimizer/path/clausesel.c

I tried to explain the ideas thoroughly in the comments, along with a
lot of TODO/FIXME items related to limitations, explained in the next
section.


Limitations
-----------

As I mentioned, the current patch has a number of practical limitations,
most importantly:

  (a) only data types passed by value (no varlena types)
  (b) only data types with sort (to be able to build histogram)
  (c) no NULL values supported
  (d) not handling DROP COLUMN or DROP TABLE and such
  (e) limited to stats on 8 columns (max)
  (f) optimizer uses single stats per table
  (g) limited list of compatible WHERE clauses
  (h) incomplete ADD STATISTICS syntax

The first three conditions are really a shortcut to a working patch, and
fixing them should not be difficult.

The limited number of columns is really just a sanity check. It's
possible to increase it, but I doubt stats on more columns will be
practical because of excessive size or poor accuracy.

A better approach is to support combining multiple stats, defined on
various subsets of columns. This is not implemented at the memoment, but
it's certainly on the roadmap. Currently the "smallest" stats covering
the most columns is selected.

Regarding the compatible WHERE clauses, the patch currently handles
conditions of the form

    column OPERATOR constant

where operator is one of the comparison operators (=, <, >, =<, >=). In
the future it's possible to add support for more conditions, e.g.
"column IS NULL" or "column OPERATOR column".

The last point is really just "unfinished implementation" - the syntax I
propose is this:

   ALTER TABLE ... ADD STATISTICS (options) ON (columns)

where the options influence the MCV list and histogram size, etc. The
options are recognized and may give you an idea of what it might do, but
it's not really used at the moment (except for storing in the
pg_mv_statistic catalog).



Examples
--------

Let's see a few examples of how to define the stats, and what difference
in estimates it makes:

CREATE TABLE test (a INT, b INT c INT);

-- same value in all columns
INSERT INTO test SELECT mod(i,100), mod(i,100), mod(i,100)
       FROM generate_series(1,1000000) s(i);

ANALYZE test;

=============== no multivariate stats ============================

SELECT * FROM test WHERE a = 10 AND b = 10;

                        QUERY PLAN
-------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..20406.00 rows=101 width=12)
                   (actual time=0.007..60.902 rows=10000 loops=1)
   Filter: ((a = 10) AND (b = 10))
   Rows Removed by Filter: 990000
 Planning time: 0.119 ms
 Execution time: 61.164 ms
(5 rows)


SELECT * FROM test WHERE a = 10 AND b = 10 AND c = 10;

                        QUERY PLAN
-------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..22906.00 rows=1 width=12)
                   (actual time=0.010..56.780 rows=10000 loops=1)
   Filter: ((a = 10) AND (b = 10) AND (c = 10))
   Rows Removed by Filter: 990000
 Planning time: 0.061 ms
 Execution time: 56.994 ms
(5 rows)


=============== with multivariate stats ===========================

ALTER TABLE test ADD STATISTICS ON (a, b, c);
ANALYZE test;

SELECT * FROM test WHERE a = 10 AND b = 10;

                        QUERY PLAN
-------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..20406.00 rows=10767 width=12)
                   (actual time=0.007..58.981 rows=10000 loops=1)
   Filter: ((a = 10) AND (b = 10))
   Rows Removed by Filter: 990000
 Planning time: 0.114 ms
 Execution time: 59.214 ms
(5 rows)

SELECT * FROM test WHERE a = 10 AND b = 10 AND c = 10;

                        QUERY PLAN
-------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..22906.00 rows=10767 width=12)
                   (actual time=0.008..61.838 rows=10000 loops=1)
   Filter: ((a = 10) AND (b = 10) AND (c = 10))
   Rows Removed by Filter: 990000
 Planning time: 0.088 ms
 Execution time: 62.057 ms
(5 rows)


OK, that was rather significant improvement, but it's also trivial
dataset. Let's see something more complicated - the following table has
correlated columns with distributions skewed to 0.

CREATE TABLE test (a INT, b INT, c INT);
INSERT INTO test SELECT r*MOD(i,50),
                        pow(r,2)*MOD(i,100),
                        pow(r,4)*MOD(i,500)
       FROM (SELECT random() AS r, i
               FROM generate_series(1,1000000) s(i)) foo;
ANALYZE test;


SELECT * FROM test WHERE a = 0 AND b = 0;

=============== no multivariate stats ============================

                        QUERY PLAN
-------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..20406.00 rows=9024 width=12)
                   (actual time=0.007..62.969 rows=49503 loops=1)
   Filter: ((a = 0) AND (b = 0))
   Rows Removed by Filter: 950497
 Planning time: 0.057 ms
 Execution time: 64.098 ms
(5 rows)

SELECT * FROM test WHERE a = 0 AND b = 0 AND c = 0;

                        QUERY PLAN
-------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..22906.00 rows=2126 width=12)
                   (actual time=0.008..63.862 rows=40770 loops=1)
   Filter: ((a = 0) AND (b = 0) AND (c = 0))
   Rows Removed by Filter: 959230
 Planning time: 0.060 ms
 Execution time: 64.794 ms
(5 rows)


=============== with multivariate stats ============================

ALTER TABLE test ADD STATISTICS ON (a, b, c);
ANALYZE test;

db=> SELECT * FROM pg_mv_stats;
schemaname | public
tablename  | test
attnums    | 1 2 3
mcvbytes   | 25904
mcvinfo    | nitems=809
histbytes  | 568240
histinfo   | nbuckets=13772


SELECT * FROM test WHERE a = 0 AND b = 0;

                        QUERY PLAN
-------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..20406.00 rows=47717 width=12)
                   (actual time=0.007..61.782 rows=49503 loops=1)
   Filter: ((a = 0) AND (b = 0))
   Rows Removed by Filter: 950497
 Planning time: 3.181 ms
 Execution time: 62.859 ms
(5 rows)


SELECT * FROM test WHERE a = 0 AND b = 0 AND c = 0;

                        QUERY PLAN
-------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..22906.00 rows=40567 width=12)
                   (actual time=0.009..66.685 rows=40770 loops=1)
   Filter: ((a = 0) AND (b = 0) AND (c = 0))
   Rows Removed by Filter: 959230
 Planning time: 0.188 ms
 Execution time: 67.593 ms
(5 rows)


regards
Tomas

Attachment

Re: WIP: multivariate statistics / proof of concept

From
Albe Laurenz
Date:
Tomas Vondra wrote:
> attached is a WIP patch implementing multivariate statistics.

I think that is pretty useful.
Oracle has an identical feature called "extended statistics".

That's probably an entirely different thing, but it would be very
nice to have statistics to estimate the correlation between columns
of different tables, to improve the estimate for the number of rows
in a join.

Yours,
Laurenz Albe

Re: WIP: multivariate statistics / proof of concept

From
Tomas Vondra
Date:
Hi!

On 13.10.2014 09:36, Albe Laurenz wrote:
> Tomas Vondra wrote:
>> attached is a WIP patch implementing multivariate statistics.
> 
> I think that is pretty useful.
> Oracle has an identical feature called "extended statistics".
> 
> That's probably an entirely different thing, but it would be very 
> nice to have statistics to estimate the correlation between columns 
> of different tables, to improve the estimate for the number of rows 
> in a join.

I don't have a clear idea of how that should work, but from the quick
look at how join selectivity estimation is implemented, I believe two
things might be possible:
(a) using conditional probabilities
    Say we have a join "ta JOIN tb ON (ta.x = tb.y)"
    Currently, the selectivity is derived from stats on the two keys.    Essentially probabilities P(x), P(y),
representedby the MCV lists.    But if there are additional WHERE conditions on the tables, and we    have suitable
multivariatestats, it's possible to use conditional    probabilities.
 
    E.g. if the query actually uses
        ... ta JOIN tb ON (ta.x = tb.y) WHERE ta.z = 10
    and we have stats on (ta.x, ta.z), we can use P(x|z=10) instead.    If the two columns are correlated, this might
bemuch different.
 
(b) using this for multi-column conditions
    If the join condition involves multiple columns, e.g.
        ON (ta.x = tb.y AND ta.p = tb.q)
    and we happen to have stats on (ta.x,ta.p) and (tb.y,tb.q), we may    use this to compute the cardinality (pretty
muchas we do today).
 

But I haven't really worked on this so far, I suspect there are various
subtle issues and I certainly don't plan to address this in the first
phase of the patch.

Tomas



Re: WIP: multivariate statistics / proof of concept

From
David Rowley
Date:
On Mon, Oct 13, 2014 at 11:00 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
Hi,

attached is a WIP patch implementing multivariate statistics. The code
certainly is not "ready" - parts of it look as if written by a rogue
chimp who got bored of attempts to type the complete works of William
Shakespeare, and decided to try something different.


I'm really glad you're working on this. I had been thinking of looking into doing this myself.


The last point is really just "unfinished implementation" - the syntax I
propose is this:

   ALTER TABLE ... ADD STATISTICS (options) ON (columns)

where the options influence the MCV list and histogram size, etc. The
options are recognized and may give you an idea of what it might do, but
it's not really used at the moment (except for storing in the
pg_mv_statistic catalog).



I've not really gotten around to looking at the patch yet, but I'm also wondering if it would be simple include allowing functional statistics too. The pg_mv_statistic name seems to indicate multi columns, but how about stats on date(datetime_column), or perhaps any non-volatile function. This would help to solve the problem highlighted here http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9Uf34DhQ@mail.gmail.com . Without giving it too much thought, perhaps any expression that can be indexed should be allowed to have stats? Would that be really difficult to implement in comparison to what you've already done with the patch so far?


I'm quite interested in reviewing your work on this, but it appears that some of your changes are not C89:

 src\backend\commands\analyze.c(3774): error C2057: expected constant expression [D:\Postgres\a\postgres.vcxproj]
 src\backend\commands\analyze.c(3774): error C2466: cannot allocate an array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
 src\backend\commands\analyze.c(3774): error C2133: 'indexes' : unknown size [D:\Postgres\a\postgres.vcxproj]
 src\backend\commands\analyze.c(4302): error C2057: expected constant expression [D:\Postgres\a\postgres.vcxproj]
 src\backend\commands\analyze.c(4302): error C2466: cannot allocate an array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
 src\backend\commands\analyze.c(4302): error C2133: 'ndistincts' : unknown size [D:\Postgres\a\postgres.vcxproj]
 src\backend\commands\analyze.c(4775): error C2057: expected constant expression [D:\Postgres\a\postgres.vcxproj]
 src\backend\commands\analyze.c(4775): error C2466: cannot allocate an array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
 src\backend\commands\analyze.c(4775): error C2133: 'keys' : unknown size [D:\Postgres\a\postgres.vcxproj]

The compiler I'm using is a bit too stupid to understand the C99 syntax.

I guess you'd need to palloc() these arrays instead in order to comply with the project standards.


I'm going to sign myself up to review this, so probably my first feedback would be the compiling problem.

Regards

David Rowley

 

Re: WIP: multivariate statistics / proof of concept

From
"Tomas Vondra"
Date:
Dne 29 Říjen 2014, 10:41, David Rowley napsal(a):
>
> I've not really gotten around to looking at the patch yet, but I'm also
> wondering if it would be simple include allowing functional statistics
> too.
> The pg_mv_statistic name seems to indicate multi columns, but how about
> stats on date(datetime_column), or perhaps any non-volatile function. This
> would help to solve the problem highlighted here
> http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9Uf34DhQ@mail.gmail.com
> . Without giving it too much thought, perhaps any expression that can be
> indexed should be allowed to have stats? Would that be really difficult to
> implement in comparison to what you've already done with the patch so far?

I don't know, but it seems mostly orthogonal to what the patch aims to do.
If we add collecting statistics on expressions (on a single column), then I'd
expect it to be reasonably simple to add this to the multi-column case.

There are features like join stats or range type stats, that are probably
more directly related to the patch (but out of scope for the initial
version).

> I'm quite interested in reviewing your work on this, but it appears that
> some of your changes are not C89:
>
>  src\backend\commands\analyze.c(3774): error C2057: expected constant
> expression [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(3774): error C2466: cannot allocate an
> array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(3774): error C2133: 'indexes' : unknown
> size [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4302): error C2057: expected constant
> expression [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4302): error C2466: cannot allocate an
> array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4302): error C2133: 'ndistincts' : unknown
> size [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4775): error C2057: expected constant
> expression [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4775): error C2466: cannot allocate an
> array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4775): error C2133: 'keys' : unknown size
> [D:\Postgres\a\postgres.vcxproj]
>
> The compiler I'm using is a bit too stupid to understand the C99 syntax.
>
> I guess you'd need to palloc() these arrays instead in order to comply
> with
> the project standards.
>
> http://www.postgresql.org/docs/devel/static/install-requirements.html
>
> I'm going to sign myself up to review this, so probably my first feedback
> would be the compiling problem.

I'll look into that. The thing is I don't have access to MSVC, so it's a bit
difficult to spot / fix those issues :-(

regards
Tomas




Re: WIP: multivariate statistics / proof of concept

From
Petr Jelinek
Date:
On 29/10/14 10:41, David Rowley wrote:
> On Mon, Oct 13, 2014 at 11:00 AM, Tomas Vondra <tv@fuzzy.cz
>
>     The last point is really just "unfinished implementation" - the syntax I
>     propose is this:
>
>         ALTER TABLE ... ADD STATISTICS (options) ON (columns)
>
>     where the options influence the MCV list and histogram size, etc. The
>     options are recognized and may give you an idea of what it might do, but
>     it's not really used at the moment (except for storing in the
>     pg_mv_statistic catalog).
>
>
>
> I've not really gotten around to looking at the patch yet, but I'm also
> wondering if it would be simple include allowing functional statistics
> too. The pg_mv_statistic name seems to indicate multi columns, but how
> about stats on date(datetime_column), or perhaps any non-volatile
> function. This would help to solve the problem highlighted here
> http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9Uf34DhQ@mail.gmail.com
> . Without giving it too much thought, perhaps any expression that can be
> indexed should be allowed to have stats? Would that be really difficult
> to implement in comparison to what you've already done with the patch so
> far?
>

I would not over-complicate requirements for the first version of this, 
I think it's already complicated enough.

Quick look at the patch suggests that it mainly needs discussion about 
design and particular implementation choices, there is fair amount of 
TODOs and FIXMEs. I'd like to look at it too but I doubt that I'll have 
time to do in depth review in this CF.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: WIP: multivariate statistics / proof of concept

From
"Tomas Vondra"
Date:
Dne 29 Říjen 2014, 12:31, Petr Jelinek napsal(a):
> On 29/10/14 10:41, David Rowley wrote:
>> On Mon, Oct 13, 2014 at 11:00 AM, Tomas Vondra <tv@fuzzy.cz
>>
>>     The last point is really just "unfinished implementation" - the
>> syntax I
>>     propose is this:
>>
>>         ALTER TABLE ... ADD STATISTICS (options) ON (columns)
>>
>>     where the options influence the MCV list and histogram size, etc.
>> The
>>     options are recognized and may give you an idea of what it might do,
>> but
>>     it's not really used at the moment (except for storing in the
>>     pg_mv_statistic catalog).
>>
>>
>>
>> I've not really gotten around to looking at the patch yet, but I'm also
>> wondering if it would be simple include allowing functional statistics
>> too. The pg_mv_statistic name seems to indicate multi columns, but how
>> about stats on date(datetime_column), or perhaps any non-volatile
>> function. This would help to solve the problem highlighted here
>> http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9Uf34DhQ@mail.gmail.com
>> . Without giving it too much thought, perhaps any expression that can be
>> indexed should be allowed to have stats? Would that be really difficult
>> to implement in comparison to what you've already done with the patch so
>> far?
>>
>
> I would not over-complicate requirements for the first version of this,
> I think it's already complicated enough.

My thoughts, exactly. I'm not willing to put more features into the
initial version of the patch. Actually, I'm thinking about ripping out
some experimental features (particularly "hashed MCV" and "associative
rules").

> Quick look at the patch suggests that it mainly needs discussion about
> design and particular implementation choices, there is fair amount of
> TODOs and FIXMEs. I'd like to look at it too but I doubt that I'll have
> time to do in depth review in this CF.

Yes. I think it's a bit premature to discuss the code thoroughly at this
point - I'd like to discuss the general approach to the feature (i.e.
minimizing the impact on those not using it, etc.).

The most interesting part of the code are probably the comments,
explaining the design in more detail, known shortcomings and possible ways
to address them.

regards
Tomas





Re: WIP: multivariate statistics / proof of concept

From
David Rowley
Date:
On Thu, Oct 30, 2014 at 12:48 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
Dne 29 Říjen 2014, 12:31, Petr Jelinek napsal(a):
>> I've not really gotten around to looking at the patch yet, but I'm also
>> wondering if it would be simple include allowing functional statistics
>> too. The pg_mv_statistic name seems to indicate multi columns, but how
>> about stats on date(datetime_column), or perhaps any non-volatile
>> function. This would help to solve the problem highlighted here
>> http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9Uf34DhQ@mail.gmail.com
>> . Without giving it too much thought, perhaps any expression that can be
>> indexed should be allowed to have stats? Would that be really difficult
>> to implement in comparison to what you've already done with the patch so
>> far?
>>
>
> I would not over-complicate requirements for the first version of this,
> I think it's already complicated enough.

My thoughts, exactly. I'm not willing to put more features into the
initial version of the patch. Actually, I'm thinking about ripping out
some experimental features (particularly "hashed MCV" and "associative
rules").


That's fair, but I didn't really mean to imply that you should go work on that too and that it should be part of this patch..
I was thinking more along the lines of that I don't really agree with the table name for the new stats and that at some later date someone will want to add expression stats and we'd probably better come up design that would be friendly towards that. At this time I can only think that the name of the table might not suit well to expression stats, I'd hate to see someone have to invent a 3rd table to support these when we could likely come up with something that could be extended later and still make sense both today and in the future.

I was just looking at how expression indexes are stored in pg_index and I see that if it's an expression index that the expression is stored in the indexprs column which is of type pg_node_tree, so quite possibly at some point in the future the new stats table could just have an extra column added, and for today, we'd just need to come up with a future proof name... Perhaps pg_statistic_ext or pg_statisticx, and name functions and source files something along those lines instead?

Regards

David Rowley

Re: WIP: multivariate statistics / proof of concept

From
David Rowley
Date:
On Thu, Oct 30, 2014 at 12:21 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
Dne 29 Říjen 2014, 10:41, David Rowley napsal(a):
> I'm quite interested in reviewing your work on this, but it appears that
> some of your changes are not C89:
>
>  src\backend\commands\analyze.c(3774): error C2057: expected constant
> expression [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(3774): error C2466: cannot allocate an
> array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(3774): error C2133: 'indexes' : unknown
> size [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4302): error C2057: expected constant
> expression [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4302): error C2466: cannot allocate an
> array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4302): error C2133: 'ndistincts' : unknown
> size [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4775): error C2057: expected constant
> expression [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4775): error C2466: cannot allocate an
> array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4775): error C2133: 'keys' : unknown size
> [D:\Postgres\a\postgres.vcxproj]
>

I'll look into that. The thing is I don't have access to MSVC, so it's a bit
difficult to spot / fix those issues :-(


It should be a pretty simple fix, just use the files and line numbers from the above. It's just a problem that in those 3 places you're declaring an array of a variable size, which is not allowed in C89. The thing to do instead would just be to palloc() the size you need and the pfree() it when you're done.

Regards

David Rowley
 

Re: WIP: multivariate statistics / proof of concept

From
"Tomas Vondra"
Date:
Dne 30 Říjen 2014, 10:17, David Rowley napsal(a):
> On Thu, Oct 30, 2014 at 12:48 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
>
>> Dne 29 Říjen 2014, 12:31, Petr Jelinek napsal(a):
>> >> I've not really gotten around to looking at the patch yet, but I'm
>> also
>> >> wondering if it would be simple include allowing functional
>> statistics
>> >> too. The pg_mv_statistic name seems to indicate multi columns, but
>> how
>> >> about stats on date(datetime_column), or perhaps any non-volatile
>> >> function. This would help to solve the problem highlighted here
>> >>
>> http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9Uf34DhQ@mail.gmail.com
>> >> . Without giving it too much thought, perhaps any expression that can
>> be
>> >> indexed should be allowed to have stats? Would that be really
>> difficult
>> >> to implement in comparison to what you've already done with the patch
>> so
>> >> far?
>> >>
>> >
>> > I would not over-complicate requirements for the first version of
>> this,
>> > I think it's already complicated enough.
>>
>> My thoughts, exactly. I'm not willing to put more features into the
>> initial version of the patch. Actually, I'm thinking about ripping out
>> some experimental features (particularly "hashed MCV" and "associative
>> rules").
>>
>>
> That's fair, but I didn't really mean to imply that you should go work on
> that too and that it should be part of this patch..
> I was thinking more along the lines of that I don't really agree with the
> table name for the new stats and that at some later date someone will want
> to add expression stats and we'd probably better come up design that would
> be friendly towards that. At this time I can only think that the name of
> the table might not suit well to expression stats, I'd hate to see someone
> have to invent a 3rd table to support these when we could likely come up
> with something that could be extended later and still make sense both
> today
> and in the future.
>
> I was just looking at how expression indexes are stored in pg_index and I
> see that if it's an expression index that the expression is stored in
> the indexprs column which is of type pg_node_tree, so quite possibly at
> some point in the future the new stats table could just have an extra
> column added, and for today, we'd just need to come up with a future proof
> name... Perhaps pg_statistic_ext or pg_statisticx, and name functions and
> source files something along those lines instead?

Ah, OK. I don't think the catalog name "pg_mv_statistic" is somehow
inappropriate for this purpose, though. IMHO the "multivariate" does not
mean "only columns" or "no expressions", it simply describes that the
approximated density function has multiple input variables, be it
attributes or expressions.

But maybe there's a better name.

Tomas




Re: WIP: multivariate statistics / proof of concept

From
Tomas Vondra
Date:
On 30.10.2014 10:23, David Rowley wrote:
> On Thu, Oct 30, 2014 at 12:21 AM, Tomas Vondra <tv@fuzzy.cz
> <mailto:tv@fuzzy.cz>> wrote:
>
>     Dne 29 Říjen 2014, 10:41, David Rowley napsal(a):
>     > I'm quite interested in reviewing your work on this, but it
>     appears that
>     > some of your changes are not C89:
>     >
>     >  src\backend\commands\analyze.c(3774): error C2057: expected constant
>     > expression [D:\Postgres\a\postgres.vcxproj]
>     >  src\backend\commands\analyze.c(3774): error C2466: cannot allocate an
>     > array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
>     >  src\backend\commands\analyze.c(3774): error C2133: 'indexes' :
>     unknown
>     > size [D:\Postgres\a\postgres.vcxproj]
>     >  src\backend\commands\analyze.c(4302): error C2057: expected constant
>     > expression [D:\Postgres\a\postgres.vcxproj]
>     >  src\backend\commands\analyze.c(4302): error C2466: cannot allocate an
>     > array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
>     >  src\backend\commands\analyze.c(4302): error C2133: 'ndistincts' :
>     unknown
>     > size [D:\Postgres\a\postgres.vcxproj]
>     >  src\backend\commands\analyze.c(4775): error C2057: expected constant
>     > expression [D:\Postgres\a\postgres.vcxproj]
>     >  src\backend\commands\analyze.c(4775): error C2466: cannot allocate an
>     > array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
>     >  src\backend\commands\analyze.c(4775): error C2133: 'keys' :
>     unknown size
>     > [D:\Postgres\a\postgres.vcxproj]
>     >
>
> I'll look into that. The thing is I don't have access to MSVC, so
> it's a bit difficult to spot / fix those issues :-(
>
>
> It should be a pretty simple fix, just use the files and line
> numbers from the above. It's just a problem that in those 3 places
> you're declaring an array of a variable size, which is not allowed in
> C89. The thing to do instead would just be to palloc() the size you
> need and the pfree() it when you're done.

Attached is a patch that should fix these issues.

The bad news is there are a few installcheck failures (and were in the
previous patch, but I haven't noticed for some reason). Apparently,
there's some mixup in how the patch handles Var->varno in some causes,
causing issues with a handful of regression tests.

The problem is that is_mv_compatible (checking whether the condition is
compatible with multivariate stats) does this

    if (! ((varRelid == 0) || (varRelid == var->varno)))
        return false;

    /* Also skip special varno values, and system attributes ... */
        if ((IS_SPECIAL_VARNO(var->varno)) ||
            (! AttrNumberIsForUserDefinedAttr(var->varattno)))
        return false;

assuming that after this, varno represents an index into the range
table, and passes it out to the caller.

And the caller (collect_mv_attnums) does this:

    RelOptInfo *rel = find_base_rel(root, varno);

which fails with errors like these:

    ERROR:  no relation entry for relid 0
    ERROR:  no relation entry for relid 1880

or whatever. What's even stranger is this:

regression=#   SELECT table_name, is_updatable, is_insertable_into
regression-#     FROM information_schema.views
regression-#    WHERE table_name = 'rw_view1';
ERROR:  no relation entry for relid 0
regression=#   SELECT table_name, is_updatable, is_insertable_into
regression-#     FROM information_schema.views
regression-# ;
regression=#   SELECT table_name, is_updatable, is_insertable_into
regression-#     FROM information_schema.views
regression-#    WHERE table_name = 'rw_view1';
 table_name | is_updatable | is_insertable_into
------------+--------------+--------------------
(0 rows)

regression=# explain  SELECT table_name, is_updatable, is_insertable_into
    FROM information_schema.views
   WHERE table_name = 'rw_view1';
ERROR:  no relation entry for relid 0


So, the query fails. After removing the WHERE clause it works, and this
somehow fixes the original query (with the WHERE clause). Nevertheless,
I still can't do explain on the query.

Clearly, I'm doing something wrong. I suspect it's caused either by
conditions involving function calls, or the fact that the view is a join
of multiple tables. But what?

For simple queries (single table, ...) it seems to be working fine.

regards
Tomas

Attachment

Re: WIP: multivariate statistics / proof of concept

From
Simon Riggs
Date:
On 12 October 2014 23:00, Tomas Vondra <tv@fuzzy.cz> wrote:

> It however seems to be working sufficiently well at this point, enough
> to get some useful feedback. So here we go.

This looks interesting and useful.

What I'd like to check before a detailed review is that this has
sufficient applicability to be useful.

My understanding is that Q9 and Q18 of TPC-H have poor plans as a
result of multi-column stats errors.

Could you look at those queries and confirm that this patch can
produce better plans for them?

If so, I will work with you to review this patch.

One aspect of the patch that seems to be missing is a user declaration
of correlation, just as we have for setting n_distinct. It seems like
an even easier place to start to just let the user specify the stats
declaratively. That way we can split the patch into two parts. First,
allow multi column stats that are user declared. Then add user stats
collected by ANALYZE. The first part is possibly contentious and thus
a good initial focus. The second part will have lots of discussion, so
good to skip for a first version.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: WIP: multivariate statistics / proof of concept

From
"Tomas Vondra"
Date:
Dne 13 Listopad 2014, 12:31, Simon Riggs napsal(a):
> On 12 October 2014 23:00, Tomas Vondra <tv@fuzzy.cz> wrote:
>
>> It however seems to be working sufficiently well at this point, enough
>> to get some useful feedback. So here we go.
>
> This looks interesting and useful.
>
> What I'd like to check before a detailed review is that this has
> sufficient applicability to be useful.
>
> My understanding is that Q9 and Q18 of TPC-H have poor plans as a
> result of multi-column stats errors.
>
> Could you look at those queries and confirm that this patch can
> produce better plans for them?

Sure. I planned to do such verification/demonstration anyway, after
discussing the overall approach.

I planned to give it a try on TPC-DS, but I can start with the TPC-H
queries you propose. I'm not sure whether the poor estimates in Q9 & Q18
come from column correlation though - if it's due to some other issues
(e.g. conditions that are difficult to estimate), this patch can't do
anything with them. But it's a good start.

> If so, I will work with you to review this patch.

Thanks!

> One aspect of the patch that seems to be missing is a user declaration
> of correlation, just as we have for setting n_distinct. It seems like
> an even easier place to start to just let the user specify the stats
> declaratively. That way we can split the patch into two parts. First,
> allow multi column stats that are user declared. Then add user stats
> collected by ANALYZE. The first part is possibly contentious and thus
> a good initial focus. The second part will have lots of discussion, so
> good to skip for a first version.

I'm not a big fan of this approach, for a number of reasons.

Firstly, it only works for "simple" parameters that are trivial to specify
(say, Pearson's correlation coefficient), and the patch does not work with
those at all - it only works with histograms, MCV lists (and might work
with associative rules in the future). And we certainly can't ask users to
specify multivariate histograms - because it's very difficult to do, and
also because complex stats are more susceptible to get stale after adding
new data to the table.

Secondly, even if we add such "simple" parameters to the patch, we have to
come up with a  way to apply those parameters to the estimates. The
problem is that as the parameters get simpler, it's less and less useful
to compute the stats.

Another question is whether it should support more than 2 columns ...

The only place where I think this might work are the associative rules.
It's simple to specify rules like ("ZIP code" implies "city") and we could
even do some simple check against the data to see if it actually makes
sense (and 'disable' the rule if not).

But maybe I got it wrong and you have something particular in mind? Can
you give an example of how it would work?

regards
Tomas




Re: WIP: multivariate statistics / proof of concept

From
Katharina Büchse
Date:
On 13.11.2014 14:11, Tomas Vondra wrote:
> Dne 13 Listopad 2014, 12:31, Simon Riggs napsal(a):
>> On 12 October 2014 23:00, Tomas Vondra <tv@fuzzy.cz> wrote:
>>
>>> It however seems to be working sufficiently well at this point, enough
>>> to get some useful feedback. So here we go.
>> This looks interesting and useful.
>>
>> What I'd like to check before a detailed review is that this has
>> sufficient applicability to be useful.
>>
>> My understanding is that Q9 and Q18 of TPC-H have poor plans as a
>> result of multi-column stats errors.
>>
>> Could you look at those queries and confirm that this patch can
>> produce better plans for them?
> Sure. I planned to do such verification/demonstration anyway, after
> discussing the overall approach.
>
> I planned to give it a try on TPC-DS, but I can start with the TPC-H
> queries you propose. I'm not sure whether the poor estimates in Q9 & Q18
> come from column correlation though - if it's due to some other issues
> (e.g. conditions that are difficult to estimate), this patch can't do
> anything with them. But it's a good start.
>
>> If so, I will work with you to review this patch.
> Thanks!
>
>> One aspect of the patch that seems to be missing is a user declaration
>> of correlation, just as we have for setting n_distinct. It seems like
>> an even easier place to start to just let the user specify the stats
>> declaratively. That way we can split the patch into two parts. First,
>> allow multi column stats that are user declared. Then add user stats
>> collected by ANALYZE. The first part is possibly contentious and thus
>> a good initial focus. The second part will have lots of discussion, so
>> good to skip for a first version.
> I'm not a big fan of this approach, for a number of reasons.
>
> Firstly, it only works for "simple" parameters that are trivial to specify
> (say, Pearson's correlation coefficient), and the patch does not work with
> those at all - it only works with histograms, MCV lists (and might work
> with associative rules in the future). And we certainly can't ask users to
> specify multivariate histograms - because it's very difficult to do, and
> also because complex stats are more susceptible to get stale after adding
> new data to the table.
>
> Secondly, even if we add such "simple" parameters to the patch, we have to
> come up with a  way to apply those parameters to the estimates. The
> problem is that as the parameters get simpler, it's less and less useful
> to compute the stats.
>
> Another question is whether it should support more than 2 columns ...
>
> The only place where I think this might work are the associative rules.
> It's simple to specify rules like ("ZIP code" implies "city") and we could
> even do some simple check against the data to see if it actually makes
> sense (and 'disable' the rule if not).
and even this simple example has its limits, at least in Germany ZIP 
codes are not unique for rural areas, where several villages have the 
same ZIP code.

I guess there are just a few examples where columns are completely 
functional dependent without any exceptions.
But of course, if the user gives this information just for optimization 
the statistics, some exceptions don't matter.
If this information should be used for creating different execution 
plans (e.g. on column A is an index and column B is functional 
dependent, one could think about using this index on A and the 
dependency instead of running through the whole table to find all tuples 
that fit the query on column B), exceptions are a very important issue.
>
> But maybe I got it wrong and you have something particular in mind? Can
> you give an example of how it would work?
>
> regards
> Tomas
>
>
>


-- 
Dipl.-Math. Katharina Büchse
Friedrich-Schiller-Universität Jena
Institut für Informatik
Lehrstuhl für Datenbanken und Informationssysteme
Ernst-Abbe-Platz 2
07743 Jena
Telefon 03641/946367
Webseite http://users.minet.uni-jena.de/~re89qen/




Re: WIP: multivariate statistics / proof of concept

From
"Tomas Vondra"
Date:
Dne 13 Listopad 2014, 16:51, Katharina Büchse napsal(a):
> On 13.11.2014 14:11, Tomas Vondra wrote:
>
>> The only place where I think this might work are the associative rules.
>> It's simple to specify rules like ("ZIP code" implies "city") and we
>> could
>> even do some simple check against the data to see if it actually makes
>> sense (and 'disable' the rule if not).
>
> and even this simple example has its limits, at least in Germany ZIP
> codes are not unique for rural areas, where several villages have the
> same ZIP code.
>
> I guess there are just a few examples where columns are completely
> functional dependent without any exceptions.
> But of course, if the user gives this information just for optimization
> the statistics, some exceptions don't matter.
> If this information should be used for creating different execution
> plans (e.g. on column A is an index and column B is functional
> dependent, one could think about using this index on A and the
> dependency instead of running through the whole table to find all tuples
> that fit the query on column B), exceptions are a very important issue.

Yes, exactly. The aim of this patch is "only" improving estimates, not
removing conditions from the plan (e.g. checking only the ZIP code and not
the city name). That certainly can't be done solely based on approximate
statistics, and as you point out most real-world data either contain bugs
or are inherently imperfect (we have the same kind of ZIP/city
inconsistencies in Czech). That's not a big issue for estimates (assuming
only small fraction of rows violates the rule) though.

Tomas




Re: WIP: multivariate statistics / proof of concept

From
Kevin Grittner
Date:
Tomas Vondra <tv@fuzzy.cz> wrote:
> Dne 13 Listopad 2014, 16:51, Katharina Büchse napsal(a):
>> On 13.11.2014 14:11, Tomas Vondra wrote:
>>
>>> The only place where I think this might work are the associative rules.
>>> It's simple to specify rules like ("ZIP code" implies "city") and we could
>>> even do some simple check against the data to see if it actually makes
>>> sense (and 'disable' the rule if not).
>>
>> and even this simple example has its limits, at least in Germany ZIP
>> codes are not unique for rural areas, where several villages have the
>> same ZIP code.

> as you point out most real-world data either contain bugs
> or are inherently imperfect (we have the same kind of ZIP/city
> inconsistencies in Czech).

You can have lots of fun with U.S. zip code, too. Just on the
nominally "Madison, Wisconsin" zip codes (those starting with 537),
there are several exceptions:

select zipcode, city, locationtype
from zipcode
where zipcode like '537%'
and Decommisioned = 'false'
and zipcodetype = 'STANDARD'
and locationtype in ('PRIMARY', 'ACCEPTABLE')
order by zipcode, city;

zipcode | city | locationtype
---------+-----------+--------------
53703 | MADISON | PRIMARY
53704 | MADISON | PRIMARY
53705 | MADISON | PRIMARY
53706 | MADISON | PRIMARY
53711 | FITCHBURG | ACCEPTABLE
53711 | MADISON | PRIMARY
53713 | FITCHBURG | ACCEPTABLE
53713 | MADISON | PRIMARY
53713 | MONONA | ACCEPTABLE
53714 | MADISON | PRIMARY
53714 | MONONA | ACCEPTABLE
53715 | MADISON | PRIMARY
53716 | MADISON | PRIMARY
53716 | MONONA | ACCEPTABLE
53717 | MADISON | PRIMARY
53718 | MADISON | PRIMARY
53719 | FITCHBURG | ACCEPTABLE
53719 | MADISON | PRIMARY
53725 | MADISON | PRIMARY
53726 | MADISON | PRIMARY
53744 | MADISON | PRIMARY
(21 rows)

If you eliminate the quals besides the zipcode column you get 61
rows and it gets much stranger, with legal municipalities that are
completely surrounded by Madison that the postal service would
rather you didn't use in addressing your envelopes, but they have
to deliver to anyway, and organizations inside Madison receiving
enough mail to (literally) have their own zip code -- where the
postal service allows the organization name as a deliverable
"city".

If you want to have your own fun with this data, you can download
it here:

http://federalgovernmentzipcodes.us/free-zipcode-database.csv

I was able to load it into PostgreSQL with this:

create table zipcode
(
recordnumber integer not null,
zipcode text not null,
zipcodetype text not null,
city text not null,
state text not null,
locationtype text not null,
lat double precision,
long double precision,
xaxis double precision not null,
yaxis double precision not null,
zaxis double precision not null,
worldregion text not null,
country text not null,
locationtext text,
location text,
decommisioned text not null,
taxreturnsfiled bigint,
estimatedpopulation bigint,
totalwages bigint,
notes text
);
comment on column zipcode.zipcode is 'Zipcode or military postal code(FPO/APO)';
comment on column zipcode.zipcodetype is 'Standard, PO BOX Only, Unique, Military(implies APO or FPO)';
comment on column zipcode.city is 'offical city name(s)';
comment on column zipcode.state is 'offical state, territory, or quasi-state (AA, AE, AP) abbreviation code';
comment on column zipcode.locationtype is 'Primary, Acceptable,Not Acceptable';
comment on column zipcode.lat is 'Decimal Latitude, if available';
comment on column zipcode.long is 'Decimal Longitude, if available';
comment on column zipcode.location is 'Standard Display (eg Phoenix, AZ ; Pago Pago, AS ; Melbourne, AU )';
comment on column zipcode.decommisioned is 'If Primary location, Yes implies historical Zipcode, No Implies current
Zipcode;If not Primary, Yes implies Historical Placename'; 
comment on column zipcode.taxreturnsfiled is 'Number of Individual Tax Returns Filed in 2008';
copy zipcode from 'filepath' with (format csv, header);
alter table zipcode add primary key (recordnumber);
create unique index zipcode_city on zipcode (zipcode, city);

I bet there are all sorts of correlation possibilities with, for
example, latitude and longitude and other variables.  With 81831
rows and so many correlations among the columns, it might be a
useful data set to test with.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: multivariate statistics / proof of concept

From
Tomas Vondra
Date:
On 15.11.2014 18:49, Kevin Grittner
> If you eliminate the quals besides the zipcode column you get 61
> rows and it gets much stranger, with legal municipalities that are
> completely surrounded by Madison that the postal service would
> rather you didn't use in addressing your envelopes, but they have
> to deliver to anyway, and organizations inside Madison receiving
> enough mail to (literally) have their own zip code -- where the
> postal service allows the organization name as a deliverable
> "city".
> 
> If you want to have your own fun with this data, you can download
> it here:
> 
> http://federalgovernmentzipcodes.us/free-zipcode-database.csv
>
...
> 
> I bet there are all sorts of correlation possibilities with, for
> example, latitude and longitude and other variables.  With 81831
> rows and so many correlations among the columns, it might be a
> useful data set to test with.

Thanks for the link. I've been looking for a good dataset with such
data, and this one is by far the best one.

The current version of the patch supports only data types passed by
value (i.e. no varlena types - text, ), which means it's impossible to
build multivariate stats on some of the interesting columns (state,
city, ...).

I guess it's time to start working on removing this limitation.

Tomas



Re: WIP: multivariate statistics / proof of concept

From
Michael Paquier
Date:
On Sun, Nov 16, 2014 at 3:35 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
> Thanks for the link. I've been looking for a good dataset with such
> data, and this one is by far the best one.
>
> The current version of the patch supports only data types passed by
> value (i.e. no varlena types - text, ), which means it's impossible to
> build multivariate stats on some of the interesting columns (state,
> city, ...).
>
> I guess it's time to start working on removing this limitation.
Tomas, what's your status on this patch? Are you planning to make it
more complicated than it is? For now I have switched it to a "Needs
Review" state because even your first version did not get advanced
review (that's quite big btw). I guess that we should switch it to the
next CF.
Regards,
-- 
Michael



Re: WIP: multivariate statistics / proof of concept

From
Tomas Vondra
Date:
On 8.12.2014 02:01, Michael Paquier wrote:
> On Sun, Nov 16, 2014 at 3:35 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> Thanks for the link. I've been looking for a good dataset with such
>> data, and this one is by far the best one.
>>
>> The current version of the patch supports only data types passed by
>> value (i.e. no varlena types - text, ), which means it's impossible to
>> build multivariate stats on some of the interesting columns (state,
>> city, ...).
>>
>> I guess it's time to start working on removing this limitation.
> Tomas, what's your status on this patch? Are you planning to make it
> more complicated than it is? For now I have switched it to a "Needs
> Review" state because even your first version did not get advanced
> review (that's quite big btw). I guess that we should switch it to the
> next CF.

Hello Michael,

I agree with moving the patch to the next CF - I'm working on the patch,
but I will take a bit more time to submit a new version and I can do
that in the next CF.

regards
Tomas



Re: WIP: multivariate statistics / proof of concept

From
Heikki Linnakangas
Date:
On 10/13/2014 01:00 AM, Tomas Vondra wrote:
> Hi,
>
> attached is a WIP patch implementing multivariate statistics.

Great! Really glad to see you working on this.

> +     * FIXME This sample sizing is mostly OK when computing stats for
> +     *       individual columns, but when computing multi-variate stats
> +     *       for multivariate stats (histograms, mcv, ...) it's rather
> +     *       insufficient. For small number of dimensions it works, but
> +     *       for complex stats it'd be nice use sample proportional to
> +     *       the table (say, 0.5% - 1%) instead of a fixed size.

I don't think a fraction of the table is appropriate. As long as the 
sample is random, the accuracy of a sample doesn't depend much on the 
size of the population. For example, if you sample 1,000 rows from a 
table with 100,000 rows, or 1000 rows from a table with 100,000,000 
rows, the accuracy is pretty much the same. That doesn't change when you 
go from a single variable to multiple variables.

You do need a bigger sample with multiple variables, however. My gut 
feeling is that if you sample N rows for a single variable, with two 
variables you need to sample N^2 rows to get the same accuracy. But it's 
not proportional to the table size. (I have no proof for that, but I'm 
sure there is literature on this.)

> + * Multivariate histograms
> + *
> + * Histograms are a collection of buckets, represented by n-dimensional
> + * rectangles. Each rectangle is delimited by an array of lower and
> + * upper boundaries, so that for for the i-th attribute
> + *
> + *     min[i] <= value[i] <= max[i]
> + *
> + * Each bucket tracks frequency (fraction of tuples it contains),
> + * information about the inequalities, number of distinct values in
> + * each dimension (which is used when building the histogram) etc.
> + *
> + * The boundaries may be either inclusive or exclusive, or the whole
> + * dimension may be NULL.
> + *
> + * The buckets may overlap (assuming the build algorithm keeps the
> + * frequencies additive) or may not cover the whole space (i.e. allow
> + * gaps). This entirely depends on the algorithm used to build the
> + * histogram.

That sounds pretty exotic. These buckets are quite different from the 
single-dimension buckets we currently have.

The paper you reference in partition_bucket() function, M. 
Muralikrishna, David J. DeWitt: Equi-Depth Histograms For Estimating 
Selectivity Factors For Multi-Dimensional Queries. SIGMOD Conference 
1988: 28-36, actually doesn't mention overlapping buckets at all. I 
haven't read the code in detail, but if it implements the algorithm from 
that paper, there will be no overlap.

- Heikki



Re: WIP: multivariate statistics / proof of concept

From
Tomas Vondra
Date:
On 11.12.2014 17:53, Heikki Linnakangas wrote:
> On 10/13/2014 01:00 AM, Tomas Vondra wrote:
>> Hi,
>>
>> attached is a WIP patch implementing multivariate statistics.
> 
> Great! Really glad to see you working on this.
> 
>> +     * FIXME This sample sizing is mostly OK when computing stats for
>> +     *       individual columns, but when computing multi-variate stats
>> +     *       for multivariate stats (histograms, mcv, ...) it's rather
>> +     *       insufficient. For small number of dimensions it works, but
>> +     *       for complex stats it'd be nice use sample proportional to
>> +     *       the table (say, 0.5% - 1%) instead of a fixed size.
> 
> I don't think a fraction of the table is appropriate. As long as the 
> sample is random, the accuracy of a sample doesn't depend much on
> the size of the population. For example, if you sample 1,000 rows
> from a table with 100,000 rows, or 1000 rows from a table with
> 100,000,000 rows, the accuracy is pretty much the same. That doesn't
> change when you go from a single variable to multiple variables.

I might be wrong, but I doubt that. First, I read a number of papers
while working on this patch, and all of them used samples proportional
to the data set. That's an indirect evidence, though.

> You do need a bigger sample with multiple variables, however. My gut 
> feeling is that if you sample N rows for a single variable, with two 
> variables you need to sample N^2 rows to get the same accuracy. But
> it's not proportional to the table size. (I have no proof for that,
> but I'm sure there is literature on this.)

Maybe. I think it's somehow related to the number of buckets (which
somehow determines the precision of the histogram). If you want 1000
buckets, the number of rows scanned needs to be e.g. 10x that. With
multi-variate histograms, we may shoot for more buckets (say, 100 in
each dimension).

> 
>> + * Multivariate histograms
>> + *
>> + * Histograms are a collection of buckets, represented by n-dimensional
>> + * rectangles. Each rectangle is delimited by an array of lower and
>> + * upper boundaries, so that for for the i-th attribute
>> + *
>> + *     min[i] <= value[i] <= max[i]
>> + *
>> + * Each bucket tracks frequency (fraction of tuples it contains),
>> + * information about the inequalities, number of distinct values in
>> + * each dimension (which is used when building the histogram) etc.
>> + *
>> + * The boundaries may be either inclusive or exclusive, or the whole
>> + * dimension may be NULL.
>> + *
>> + * The buckets may overlap (assuming the build algorithm keeps the
>> + * frequencies additive) or may not cover the whole space (i.e. allow
>> + * gaps). This entirely depends on the algorithm used to build the
>> + * histogram.
> 
> That sounds pretty exotic. These buckets are quite different from
> the single-dimension buckets we currently have.
> 
> The paper you reference in partition_bucket() function, M. 
> Muralikrishna, David J. DeWitt: Equi-Depth Histograms For Estimating 
> Selectivity Factors For Multi-Dimensional Queries. SIGMOD Conference 
> 1988: 28-36, actually doesn't mention overlapping buckets at all. I 
> haven't read the code in detail, but if it implements the algorithm
> from that paper, there will be no overlap.

The algorithm implemented in partition_bucket() is very simple and
naive, and it mostly resembles the algorithm described in the paper. I'm
sure there are differences, it's not a 1:1 implementation, but you're
right it produces non-overlapping buckets.

The point is that I envision more complex algorithms or different
histogram types, and some of them may produce overlapping buckets. Maybe
that's premature comment, and it will turn out it's not really necessary.

regards
Tomas



Re: WIP: multivariate statistics / proof of concept

From
Michael Paquier
Date:
On Wed, Dec 10, 2014 at 5:15 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
> I agree with moving the patch to the next CF - I'm working on the patch,
> but I will take a bit more time to submit a new version and I can do
> that in the next CF.
OK cool. I just moved it by myself. I didn't see it yet registered in 2014-12.
Thanks,
-- 
Michael



Re: WIP: multivariate statistics / proof of concept

From
Michael Paquier
Date:
On Mon, Dec 15, 2014 at 11:55 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Dec 10, 2014 at 5:15 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> I agree with moving the patch to the next CF - I'm working on the patch,
>> but I will take a bit more time to submit a new version and I can do
>> that in the next CF.
> OK cool. I just moved it by myself. I didn't see it yet registered in 2014-12.
Marked as returned with feedback. No new version showed up in the last
month and this patch was waiting for input from author.
-- 
Michael



Re: WIP: multivariate statistics / proof of concept

From
Tomas Vondra
Date:
Hi,

attached is an updated version of the multivariate stats patch. This is
going to be a bit longer mail, so I'll put here a small ToC ;-)

1) patch split into 4 parts
2) where to start / documentation
3) state of the code
4) main changes/improvements
5) remaining limitations

The motivation and design ideas, explained in the first message of this
thread are still valid. It might be a good idea to read it first:

  http://www.postgresql.org/message-id/flat/543AFA15.4080608@fuzzy.cz

BTW if you happen to go to FOSDEM [PGDay], I'll gladly give you an intro
into the patch in person, or discuss the patch in general.


1) Patch split into 4 parts
---------------------------
Firstly, the patch got broken into the following four pieces, to make
the reviews somewhat easier:

1) 0001-shared-infrastructure-and-functional-dependencies.patch

   - infrastructure, shared by all the kinds of stats added
     in the following patches (catalog, ALTER TABLE, ANALYZE ...)

   - implementation of a simple statistics, tracking functional
     dependencies between columns (previously called "associative
     rules", but that's incorrect for several reasons)

   - this does not modify the optimizer in any way

2) 0002-clause-reduction-using-functional-dependencies.patch

   - applies the functional dependencies to optimizer (i.e. considers
     the rules in clauselist_selectivity())

3) 0003-multivariate-MCV-lists.patch

   - multivariate MCV lists (both ANALYZE and optimizer parts)

4) 0004-multivariate-histograms.patch

   - multivariate histograms (both ANALYZE and optimizer parts)


You may look at the patches at github here:

  https://github.com/tvondra/postgres/tree/multivariate-stats-squashed

The branch is not stable, i.e. I'll rebase / squash / force-push changes
in the future. (There's also multivariate-stats development branch with
unsquashed changes, but you don't want to look at that, trust me.)

The patches are not exactly small (being in the 50-100 kB range), but
that's mostly because of the amount of comments explaining the goals and
implementation details.


2) Where to start / documentation
---------------------------------
I strived to document all the pieces properly, mostly in the form of
comments. There's no sgml documentation at this point, which should
obviously change in the future.

Anyway, I'd suggest reading the first e-mail in this thread, explaining
the ideas, and then these comments:

1) functional dependencies (patch 0001)
   - src/backend/utils/mvstats/dependencies.c

2) MCV lists (patch 0003)
   - src/backend/utils/mvstats/mcv.c

3) histograms (patch 0004)
   - src/backend/utils/mvstats/mcv.c

   - also see clauselist_mv_selectivity_mcvlist() in clausesel.c
   - also see clauselist_mv_selectivity_histogram() in clausesel.c

4) selectivity estimation (patches 0002-0004)
   - all in src/backend/optimizer/path/clausesel.c
   - clauselist_selectivity() - overview of how the stats are applied
   - clauselist_apply_dependencies() - functional dependencies reduction
   - clauselist_mv_selectivity_mcvlist() - MCV list estimation
   - clauselist_mv_selectivity_histogram() - histogram estimation


3) State of the code
--------------------
I've spent a fair amount of time testing the patches, and while I
believe there are no segfaults or so, I know parts of the code need a
bit more love.

The part most in need of improvements / comments is probably the code in
clausesel.c - that seems a bit quirky. Reviews / comments regarding this
part of the code are very welcome - I'm sure there are many ways to
improve this part.

There are a few FIXMEs elsewhere (e.g. about memory allocation in the
(de)serialization code), but those are mostly well-defined issues that I
know how to address (at least I believe so).


4) Main changes/improvements
----------------------------
There are many significant improvements. The previous patch version was
in the 'proof of concept' category (missing pieces, knowingly broken in
some areas), the current patch should 'mostly work'.

The patch fixes two most annoying limitations of the first version:

  (a) support for all data types (not just those passed by value)
  (b) handles NULL values properly
  (c) adds support for IS [NOT] NULL clauses

Aside from that the code was significantly improved, there are proper
regression tests and plenty of comments explaining the details.


5) Remaining limitations
------------------------

  (a) limited to stats on 8 columns

      This is mostly just a 'safeguard' restriction.

  (b) only data types with '<' operator

      I don't think this will change anytime soon, because all the
      algorithms for building the stats rely on this. I don't see
      this as a serious limitation though.

  (c) not handling DROP COLUMN or DROP TABLE and so on

      Currently this is not handled at all (so the regression tests
      do an explicit DELETE from the pg_mv_statistic catalog).

      Handling the DROP TABLE won't be difficult, it's similar to the
      current stats. Handling ALTER TABLE ... DROP COLUMN will be much
      more tricky I guess - should we drop all the stats referencing
      that column, or should we just remove it from the stats? Or
      should we keep it and treat it as NULL? Not sure what's the best
      solution.

  (d) limited list of compatible WHERE clauses

      The initial patch handled only simple operator clauses

          (Var op Constant)

      where operator is one of ('<', '<=', '=', '>=', '>'). Now it also
      handles IS [NOT] NULL clauses. Adding more clause types should
      not  be overly difficult - starting with more traditional
      'BooleanTest' conditions, or even multi-column conditions
          (Var op Var)

      which are difficult to estimate using simple-column stats.

  (e) optimizer uses single stats per table

      This is still true and I don't think this will change soon. i do
      have some ideas on how to merge multiple stats etc. but it's
      certainly complex stuff, unlikely to happen within this CF. The
      patch makes a lot of sense even without this particular feature,
      because you can create multiple stats, each suitable for different
      queries.

  (f) no JOIN conditions

      Similarly to the previous point, it's on the TODO but it's not
      going to happen in this CF.


kind regards

--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: multivariate statistics / proof of concept

From
Kyotaro HORIGUCHI
Date:
Hello,


Patch 0001 needs changes for OIDs since my patch was
committed. The attached is compatible with current master.

And I tried this like this, and got the following error on
analyze. But unfortunately I don't have enough time to
investigate it now.

postgres=# create table t1 (a int, b int, c int);
insert into t1 (select a/ 10000, a / 10000, a / 10000 from generate_series(0, 99999) a);
postgres=# analyze t1;
ERROR:  invalid memory alloc request size 1485176862

regards,


At Sat, 24 Jan 2015 21:21:39 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<54C3FED3.1060600@2ndquadrant.com>
> Hi,
> 
> attached is an updated version of the multivariate stats patch. This is
> going to be a bit longer mail, so I'll put here a small ToC ;-)
> 
> 1) patch split into 4 parts
> 2) where to start / documentation
> 3) state of the code
> 4) main changes/improvements
> 5) remaining limitations
> 
> The motivation and design ideas, explained in the first message of this
> thread are still valid. It might be a good idea to read it first:
> 
>   http://www.postgresql.org/message-id/flat/543AFA15.4080608@fuzzy.cz
> 
> BTW if you happen to go to FOSDEM [PGDay], I'll gladly give you an intro
> into the patch in person, or discuss the patch in general.
> 
> 
> 1) Patch split into 4 parts
> ---------------------------
> Firstly, the patch got broken into the following four pieces, to make
> the reviews somewhat easier:
> 
> 1) 0001-shared-infrastructure-and-functional-dependencies.patch
> 
>    - infrastructure, shared by all the kinds of stats added
>      in the following patches (catalog, ALTER TABLE, ANALYZE ...)
> 
>    - implementation of a simple statistics, tracking functional
>      dependencies between columns (previously called "associative
>      rules", but that's incorrect for several reasons)
> 
>    - this does not modify the optimizer in any way
> 2) 0002-clause-reduction-using-functional-dependencies.patch
> 
>    - applies the functional dependencies to optimizer (i.e. considers
>      the rules in clauselist_selectivity())
> 
> 3) 0003-multivariate-MCV-lists.patch
> 
>    - multivariate MCV lists (both ANALYZE and optimizer parts)
> 
> 4) 0004-multivariate-histograms.patch
> 
>    - multivariate histograms (both ANALYZE and optimizer parts)
> 
> 
> You may look at the patches at github here:
> 
>   https://github.com/tvondra/postgres/tree/multivariate-stats-squashed
> 
> The branch is not stable, i.e. I'll rebase / squash / force-push changes
> in the future. (There's also multivariate-stats development branch with
> unsquashed changes, but you don't want to look at that, trust me.)
> 
> The patches are not exactly small (being in the 50-100 kB range), but
> that's mostly because of the amount of comments explaining the goals and
> implementation details.
> 
> 
> 2) Where to start / documentation
> ---------------------------------
> I strived to document all the pieces properly, mostly in the form of
> comments. There's no sgml documentation at this point, which should
> obviously change in the future.
> 
> Anyway, I'd suggest reading the first e-mail in this thread, explaining
> the ideas, and then these comments:
> 
> 1) functional dependencies (patch 0001)
>    - src/backend/utils/mvstats/dependencies.c
> 
> 2) MCV lists (patch 0003)
>    - src/backend/utils/mvstats/mcv.c
> 
> 3) histograms (patch 0004)
>    - src/backend/utils/mvstats/mcv.c
> 
>    - also see clauselist_mv_selectivity_mcvlist() in clausesel.c
>    - also see clauselist_mv_selectivity_histogram() in clausesel.c
> 
> 4) selectivity estimation (patches 0002-0004)
>    - all in src/backend/optimizer/path/clausesel.c
>    - clauselist_selectivity() - overview of how the stats are applied
>    - clauselist_apply_dependencies() - functional dependencies reduction
>    - clauselist_mv_selectivity_mcvlist() - MCV list estimation
>    - clauselist_mv_selectivity_histogram() - histogram estimation
> 
> 
> 3) State of the code
> --------------------
> I've spent a fair amount of time testing the patches, and while I
> believe there are no segfaults or so, I know parts of the code need a
> bit more love.
> 
> The part most in need of improvements / comments is probably the code in
> clausesel.c - that seems a bit quirky. Reviews / comments regarding this
> part of the code are very welcome - I'm sure there are many ways to
> improve this part.
> 
> There are a few FIXMEs elsewhere (e.g. about memory allocation in the
> (de)serialization code), but those are mostly well-defined issues that I
> know how to address (at least I believe so).
> 
> 
> 4) Main changes/improvements
> ----------------------------
> There are many significant improvements. The previous patch version was
> in the 'proof of concept' category (missing pieces, knowingly broken in
> some areas), the current patch should 'mostly work'.
> 
> The patch fixes two most annoying limitations of the first version:
> 
>   (a) support for all data types (not just those passed by value)
>   (b) handles NULL values properly
>   (c) adds support for IS [NOT] NULL clauses
> 
> Aside from that the code was significantly improved, there are proper
> regression tests and plenty of comments explaining the details.
> 
> 
> 5) Remaining limitations
> ------------------------
> 
>   (a) limited to stats on 8 columns
> 
>       This is mostly just a 'safeguard' restriction.
> 
>   (b) only data types with '<' operator
> 
>       I don't think this will change anytime soon, because all the
>       algorithms for building the stats rely on this. I don't see
>       this as a serious limitation though.
> 
>   (c) not handling DROP COLUMN or DROP TABLE and so on
> 
>       Currently this is not handled at all (so the regression tests
>       do an explicit DELETE from the pg_mv_statistic catalog).
> 
>       Handling the DROP TABLE won't be difficult, it's similar to the
>       current stats. Handling ALTER TABLE ... DROP COLUMN will be much
>       more tricky I guess - should we drop all the stats referencing
>       that column, or should we just remove it from the stats? Or
>       should we keep it and treat it as NULL? Not sure what's the best
>       solution.
> 
>   (d) limited list of compatible WHERE clauses
> 
>       The initial patch handled only simple operator clauses
> 
>           (Var op Constant)
> 
>       where operator is one of ('<', '<=', '=', '>=', '>'). Now it also
>       handles IS [NOT] NULL clauses. Adding more clause types should
>       not  be overly difficult - starting with more traditional
>       'BooleanTest' conditions, or even multi-column conditions
>           (Var op Var)
> 
>       which are difficult to estimate using simple-column stats.
> 
>   (e) optimizer uses single stats per table
> 
>       This is still true and I don't think this will change soon. i do
>       have some ideas on how to merge multiple stats etc. but it's
>       certainly complex stuff, unlikely to happen within this CF. The
>       patch makes a lot of sense even without this particular feature,
>       because you can create multiple stats, each suitable for different
>       queries.
> 
>   (f) no JOIN conditions
> 
>       Similarly to the previous point, it's on the TODO but it's not
>       going to happen in this CF.
> 
> 
> kind regards
> 
> -- 
> Tomas Vondra                http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: multivariate statistics / proof of concept

From
Tomas Vondra
Date:
Hello,

On 20.3.2015 09:33, Kyotaro HORIGUCHI wrote:
> Hello,
> 
> 
> Patch 0001 needs changes for OIDs since my patch was
> committed. The attached is compatible with current master.

Thanks. I plan to submit a new version of the patch in a few days, with
significant progress in various directions. I'll have to rebase to
current master before submitting the new version anyway (which includes
fixing duplicate OIDs).

> And I tried this like this, and got the following error on
> analyze. But unfortunately I don't have enough time to
> investigate it now.
> 
> postgres=# create table t1 (a int, b int, c int);
> insert into t1 (select a/ 10000, a / 10000, a / 10000 from
> generate_series(0, 99999) a);
> postgres=# analyze t1;
> ERROR:  invalid memory alloc request size 1485176862

Interesting - particularly because this does not involve any
multivariate stats. I can't reproduce it with the current version of the
patch, so either it's unrelated, or I've fixed it since posting the last
version.

regards

-- 
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: multivariate statistics / proof of concept

From
Kyotaro HORIGUCHI
Date:
Hello,

> > Patch 0001 needs changes for OIDs since my patch was
> > committed. The attached is compatible with current master.
> 
> Thanks. I plan to submit a new version of the patch in a few days, with
> significant progress in various directions. I'll have to rebase to
> current master before submitting the new version anyway (which includes
> fixing duplicate OIDs).
> 
> > And I tried this like this, and got the following error on
> > analyze. But unfortunately I don't have enough time to
> > investigate it now.
> > 
> > postgres=# create table t1 (a int, b int, c int);
> > insert into t1 (select a/ 10000, a / 10000, a / 10000 from
> > generate_series(0, 99999) a);
> > postgres=# analyze t1;
> > ERROR:  invalid memory alloc request size 1485176862
> 
> Interesting - particularly because this does not involve any
> multivariate stats. I can't reproduce it with the current version of the
> patch, so either it's unrelated, or I've fixed it since posting the last
> version.

Sorry, not shown above, the *previous* t1 had been done "alter
table t1 add statistics (a, b, c)". Removing t1 didn't remove the
setting. reiniting cluster let me do that without error.

The steps throughout was as following.
===
create table t1 (a int, b int, c int);
alter table t1 add statistics (histogram) on (a, b, c);
drop table t1;  -- This does not remove the above setting.
create table t1 (a int, b int, c int);
insert into t1 (select a/ 10000, a / 10000, a / 10000 from generate_series(0, 99999) a);insert into t1 ...
regards,
-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: WIP: multivariate statistics / proof of concept

From
Tomas Vondra
Date:
Hello,

On 03/24/15 06:34, Kyotaro HORIGUCHI wrote:
>
> Sorry, not shown above, the *previous* t1 had been done "alter table
> t1 add statistics (a, b, c)". Removing t1 didn't remove the setting.
> reiniting cluster let me do that without error.

OK, thanks. My guess is this issue got already fixed in my working copy, 
but I will double-check that.

Admittedly, the management of the stats (e.g. removing stats when the 
table is dropped) is one of the incomplete parts. You have to delete the 
rows manually from pg_mv_statistic.

-- 
--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: multivariate statistics / proof of concept

From
Tomas Vondra
Date:
Hello,

attached is a new version of the patch series. Aside from fixing various
issues (crashes, memory leaks). The patches are rebased to current
master, and I also attach a few SQL scripts I used for testing (nothing
fancy, just stress-testing all the parts the patch touches).

The main changes in the patches (requiring plenty of changes in the
other parts) are about these:


(1) combining multiple statistics on a table
--------------------------------------------

In the previous version of the patch, it was only possible to use a
single statistics on a table - when there was a statistics "covering"
all the conditions it worked fine, but that's not always the case.

The new patch is able to combine multiple statistics by decomposing the
probability (=selectivity) into conditional probabilities. Imagine
estimating selectivity of clauses

   WHERE (a=1) AND (b=1) AND (c=1) AND (d=1)

with statistics on [a,b,c] and [b,c,d]. The selectivity may be split for
example like this:

   P(a=1,b=1,c=1,d=1) = P(a=1,b=1,c=1) * P(d=1|a=1,b=1,c=1)

where P(a=1,b=1,c=1) may be estimated using statistics [a,b,c], and the
second may be simplified like this:

   P(d=1|a=1,b=1,c=1) = P(d=1|b=1,c=1)

using the assumption "no multivariate stats => independent". Both these
probabilities match the existing statistics.

The idea is described a bit more in the part #5 of the patch.


(2) choosing the best combination of statistics
-----------------------------------------------

There may be more statistics on a table, and multiple possible ways to
use them to estimate the clauses (different ordering, overlapping
statistics, etc.).

The patch formulates this as an optimization task with two goals.

   (a) cover as many clauses as possible
   (b) reuse as many conditions (i.e. dependencies) as possible

and implements two algorithms to solve this: (a) exhaustive, walking
through all possible states (using dynamic programming), and (b) greedy,
choosing the best local solution in each step.

The time requirements for the exhaustive solution grows pretty quickly
with the number of clauses and statistics on a table (~ O(N!)). The
greedy is much faster, as it's ~O(N) and in fact much more time is spent
in actually processing the selected statistics (walking through the
histograms etc.).

I assume the exhaustive search may find a better solution in some cases
(that the greedy algorithm misses), but so far I've been unable to come
up with such example.

To make this easier to test, I've added GUC to switch between these
algorithms easily (set to 'greedy' by default)

    mvstat_search = {'greedy', 'exhaustive'}

I assume this GUC will be removed eventually, after we figure out which
algorithm is the right one.


(3) estimation of more complex conditions (AND/OR clauses)
----------------------------------------------------------

I've added ability to estimate more complex clauses - combinations of
AND/OR clauses and such. It's somewhat incomplete at the moment, but
hopefully the ideas will be clear from the TODOs/FIXMEs along the way.

Let me know if you have any questions about this version of the patch,
or about the ideas it implements in general.

I also welcome real-world examples of poorly estimated queries, so that
I can test if these patches improve that particular case situation.


regards

--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: multivariate statistics / proof of concept

From
Jeff Janes
Date:
On Mon, Mar 30, 2015 at 5:26 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hello,

attached is a new version of the patch series. Aside from fixing various
issues (crashes, memory leaks). The patches are rebased to current
master, and I also attach a few SQL scripts I used for testing (nothing
fancy, just stress-testing all the parts the patch touches).

Hi Tomas,

I get cascading conflicts in pg_proc.h.  It looked easy enough to fix, except then I get compiler errors:

funcapi.c: In function 'get_func_trftypes':
funcapi.c:890: warning: unused variable 'procStruct'
utils/fmgrtab.o:(.rodata+0x10cf8): undefined reference to `_null_'
utils/fmgrtab.o:(.rodata+0x10d18): undefined reference to `_null_'
utils/fmgrtab.o:(.rodata+0x10d38): undefined reference to `_null_'
utils/fmgrtab.o:(.rodata+0x10d58): undefined reference to `_null_'
collect2: ld returned 1 exit status
make[2]: *** [postgres] Error 1
make[1]: *** [all-backend-recurse] Error 2
make: *** [all-src-recurse] Error 2
make: *** Waiting for unfinished jobs....
make: *** [temp-install] Error 2


Cheers,

Jeff

Re: WIP: multivariate statistics / proof of concept

From
Stephen Frost
Date:
* Jeff Janes (jeff.janes@gmail.com) wrote:
> On Mon, Mar 30, 2015 at 5:26 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
> wrote:
> > attached is a new version of the patch series. Aside from fixing various
> > issues (crashes, memory leaks). The patches are rebased to current
> > master, and I also attach a few SQL scripts I used for testing (nothing
> > fancy, just stress-testing all the parts the patch touches).
>
> I get cascading conflicts in pg_proc.h.  It looked easy enough to fix,
> except then I get compiler errors:

Yeah, those are because you didn't address the new column which was
added to pg_proc.  You need to add another _null_ in the pg_proc.h lines
in the correct place, apparently on four lines.
Thanks!
    Stephen

Re: WIP: multivariate statistics / proof of concept

From
Jeff Janes
Date:
On Tue, Apr 28, 2015 at 9:13 AM, Stephen Frost <sfrost@snowman.net> wrote:
* Jeff Janes (jeff.janes@gmail.com) wrote:
> On Mon, Mar 30, 2015 at 5:26 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
> wrote:
> > attached is a new version of the patch series. Aside from fixing various
> > issues (crashes, memory leaks). The patches are rebased to current
> > master, and I also attach a few SQL scripts I used for testing (nothing
> > fancy, just stress-testing all the parts the patch touches).
>
> I get cascading conflicts in pg_proc.h.  It looked easy enough to fix,
> except then I get compiler errors:

Yeah, those are because you didn't address the new column which was
added to pg_proc.  You need to add another _null_ in the pg_proc.h lines
in the correct place, apparently on four lines.

Thanks.  I think I tried that, but was still having trouble.  But it turns out that the trouble was for an unrelated reason, and I got it to compile now.

Some of the fdw's need a patch as well in order to compile, see attached.

Cheers,

Jeff
Attachment

Re: WIP: multivariate statistics / proof of concept

From
Tomas Vondra
Date:
Hi,

On 04/28/15 19:36, Jeff Janes wrote:>
...
>
> Thanks. I think I tried that, but was still having trouble. But it
> turns out that the trouble was for an unrelated reason, and I got it
> to compile now.

Yeah, a new column was added to pg_proc the day after I submitted the 
pacth. Will address that in a new version, hopefully in a few days.

>
> Some of the fdw's need a patch as well in order to compile, see
> attached.

Thanks, I forgot to tweak the clauselist_selectivity() calls contrib :-(


--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics / patch v6

From
Tomas Vondra
Date:
Attached is v6 of the multivariate stats, with a number of improvements:

1) fix of the contrib compile-time errors (reported by Jeff)

2) fix of pg_proc issues (reported by Jeff)

3) rebase to current master

4) fix a bunch of issues in the previous patches, due to referencing
    some parts too early (e.g. histograms in the first patch, etc.)

5) remove the explicit DELETEs from pg_mv_statistic (in the regression
    tests), this is now handled automatically by DROP TABLE etc.

6) number of performance optimizations in selectivity estimations:

    (a) minimize calls to get_oprrest, significantly reducing
        syscache calls

    (b) significant reduction of palloc overhead in deserialization of
        MCV lists and histograms

    (c) use more compact serialized representation of MCV lists and
        histograms, often removing ~50% of the size

    (d) use histograms with limited deserialization, which also allows
        caching function calls

    (e) modified histogram bucket partitioning, resulting in more even
        bucket distribution (i.e. producing buckets with more equal
        density and about equal size of each dimension)

7) add functions for listing MCV list items and histogram buckets:

     - pg_mv_mcvlist_items(oid)
     - pg_mv_histogram_buckets(oid, type)

    This is quite useful when analyzing the MCV lists / histograms.

8) improved support for OR clauses

9) allow calling pull_varnos() on expression trees containing
    RestrictInfo nodes (not sure if this is the right fix, it's being
    discussed in another thread)



--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: multivariate statistics / patch v6

From
Kyotaro HORIGUCHI
Date:
Hello, this might be somewhat out of place but strongly related
to this patch so I'll propose this here.

This is a proposal of new feature for this patch or asking for
your approval for my moving on this as a different (but very
close) project.

===

> Attached is v6 of the multivariate stats, with a number of
> improvements:
...
> 2) fix of pg_proc issues (reported by Jeff)
> 
> 3) rebase to current master

Unfortunately, the v6 patch suffers some system oid conflicts
with recently added ones. And what more unfortunate for me is
that the code for functional dependencies looks undone:)

I mention this because I recently had a issue from strong
correlation between two columns in dbt3 benchmark. Two columns in
some table are in strong correlation but not in functional
dependencies, there are too many values and the distribution of
them is very uniform so MCV is no use for the table (histogram
has nothing to do with equal conditions). As the result, planner
estimates the number of rows largely wrong as expected especially
for joins.

I, then, had a try calculating the ratio between the product of
distinctness of every column and the distinctness of the set of
the columns, call it multivariate coefficient here, and found
that it looks greately useful for the small storage space, less
calculation, and simple code.

The attached first is a script to generate problematic tables.
And the second is a patch to make use of the mv coef on current
master.  The patch is a very primitive POC so no syntactical
interfaces involved.

For the case of your first example,

> =# create table t (a int, b int, c int);
> =# insert into t (select a/10000, a/10000, a/10000
>                   from generate_series(0, 999999) a);
> =# analyze t;
> =# explain analyze select * from t where a = 1 and b = 1 and c = 1;
>  Seq Scan on t  (cost=0.00..22906.00 rows=1 width=12)
>                 (actual time=3.878..250.628 rows=10000 loops=1)

Make use of mv coefficient.

> =# insert into pg_mvcoefficient values ('t'::regclass, 1, 2, 3, 0);
> =# analyze t;
> =# explain analyze select * from t where a = 1 and b = 1 and c = 1;
>  Seq Scan on t  (cost=0.00..22906.00 rows=9221 width=12)
>                 (actual time=3.740..242.330 rows=10000 loops=1)

Row number estimation was largely improved.

Well, my example,

> $ perl gentbl.pl 10000 | psql postgres
> $ psql postgres
> =# explain analyze select * from t1 where a = 1 and b = 2501;
>  Seq Scan on t1  (cost=0.00..6216.00 rows=1 width=8)
>                  (actual time=0.030..66.005 rows=8 loops=1)
> 
> =# explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);
>  Hash Join  (cost=1177.00..11393.76 rows=76 width=16)
>             (actual time=29.811..322.271 rows=320000 loops=1)

Too bad estimate for the join.

> =# insert into pg_mvcoefficient values ('t1'::regclass, 1, 2, 0, 0);
> =# analyze t1;
> =# explain analyze select * from t1 where a = 1 and b = 2501;
>  Seq Scan on t1  (cost=0.00..6216.00         rows=8 width=8)
>                  (actual time=0.032..104.144 rows=8 loops=1)
> 
> =# explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);
>  Hash Join  (cost=1177.00..11393.76      rows=305652 width=16)
>             (actual time=40.642..325.679 rows=320000 loops=1)

It gives almost correct estimations.

I think the result above shows that the multivariate coefficient
is significant to imporove estimates when correlated colums are
involved.

Would you consider this in your patch? Otherwise I move on this
as a different project from yours if you don't mind. Except user
interface won't conflict with yours, I suppose. But finally they
should need some labor of consolidation.

regards,

> 1) fix of the contrib compile-time errors (reported by Jeff)
> 
> 2) fix of pg_proc issues (reported by Jeff)
> 
> 3) rebase to current master
> 
> 4) fix a bunch of issues in the previous patches, due to referencing
>    some parts too early (e.g. histograms in the first patch, etc.)
> 
> 5) remove the explicit DELETEs from pg_mv_statistic (in the regression
>    tests), this is now handled automatically by DROP TABLE etc.
> 
> 6) number of performance optimizations in selectivity estimations:
> 
>    (a) minimize calls to get_oprrest, significantly reducing
>        syscache calls
> 
>    (b) significant reduction of palloc overhead in deserialization of
>        MCV lists and histograms
> 
>    (c) use more compact serialized representation of MCV lists and
>        histograms, often removing ~50% of the size
> 
>    (d) use histograms with limited deserialization, which also allows
>        caching function calls
> 
>    (e) modified histogram bucket partitioning, resulting in more even
>        bucket distribution (i.e. producing buckets with more equal
>        density and about equal size of each dimension)
> 
> 7) add functions for listing MCV list items and histogram buckets:
> 
>     - pg_mv_mcvlist_items(oid)
>     - pg_mv_histogram_buckets(oid, type)
> 
>    This is quite useful when analyzing the MCV lists / histograms.
> 
> 8) improved support for OR clauses
> 
> 9) allow calling pull_varnos() on expression trees containing
>    RestrictInfo nodes (not sure if this is the right fix, it's being
>    discussed in another thread)

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
#! /usr/bin/perl

$DBNAME="postgres";
$scale = 1000;

if ($#ARGV >= 0) {$scale = $ARGV[0];
}

print "create table t1 (a int, b int);\n";
print "create table t2 (a int, b int);\n";
print "insert into t1 values ";
$delim = "";
for $a (1..$scale) {for $x (1, 2500, 5000, 7500) {    $b = $a + $x;    print "\n";    for $i (1..8) {        print
$delim;       $delim = ", ";        print "($a, $b)";    }}
 
}
print ";\n";
print "insert into t2 values ";
$delim = "";
for $a (1..$scale) {print "\n";for $x (1, 2500, 5000, 7500) {    $b = $a + $x;    print $delim;    $delim = ", ";
print"($a, $b)";}
 
}
print ";\n";
print "analyze t1;\n";
print "analyze t2;\n";
diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile
index 37d05d1..d00835e 100644
--- a/src/backend/catalog/Makefile
+++ b/src/backend/catalog/Makefile
@@ -33,7 +33,8 @@ POSTGRES_BKI_SRCS = $(addprefix $(top_srcdir)/src/include/catalog/,\    pg_opfamily.h pg_opclass.h
pg_am.hpg_amop.h pg_amproc.h \    pg_language.h pg_largeobject_metadata.h pg_largeobject.h pg_aggregate.h \
pg_statistic.hpg_rewrite.h pg_trigger.h pg_event_trigger.h pg_description.h \
 
-    pg_cast.h pg_enum.h pg_namespace.h pg_conversion.h pg_depend.h \
+    pg_cast.h pg_enum.h pg_mvcoefficient.h pg_namespace.h pg_conversion.h \
+    pg_depend.h \    pg_database.h pg_db_role_setting.h pg_tablespace.h pg_pltemplate.h \    pg_authid.h
pg_auth_members.hpg_shdepend.h pg_shdescription.h \    pg_ts_config.h pg_ts_config_map.h pg_ts_dict.h \
 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 15ec0ad..9edaa0f 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -27,6 +27,7 @@#include "catalog/indexing.h"#include "catalog/pg_collation.h"#include "catalog/pg_inherits_fn.h"
+#include "catalog/pg_mvcoefficient.h"#include "catalog/pg_namespace.h"#include "commands/dbcommands.h"#include
"commands/tablecmds.h"
@@ -45,7 +46,9 @@#include "storage/procarray.h"#include "utils/acl.h"#include "utils/attoptcache.h"
+#include "utils/catcache.h"#include "utils/datum.h"
+#include "utils/fmgroids.h"#include "utils/guc.h"#include "utils/lsyscache.h"#include "utils/memutils.h"
@@ -110,6 +113,12 @@ static void update_attstats(Oid relid, bool inh,                int natts, VacAttrStats
**vacattrstats);staticDatum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);static Datum
ind_fetch_func(VacAttrStatsPstats, int rownum, bool *isNull);
 
+static float4 compute_mv_distinct(int nattrs,
+                                  int *stacolnums,
+                                  VacAttrStats **stats,
+                                  AnalyzeAttrFetchFunc fetchfunc,
+                                  int samplerows,
+                                  double totalrows);/*
@@ -552,6 +561,92 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
MemoryContextResetAndDeleteChildren(col_context);       }
 
+        /* Compute multivariate distinctness if ordered */
+        {
+            ScanKeyData    scankey;
+            SysScanDesc    sysscan;
+            Relation    mvcrel;
+            HeapTuple    oldtup, newtup;
+            int            i;
+
+            mvcrel = heap_open(MvCoefficientRelationId, RowExclusiveLock);
+
+            ScanKeyInit(&scankey,
+                        Anum_pg_mvcoefficient_mvcreloid,
+                        BTEqualStrategyNumber, F_OIDEQ,
+                        ObjectIdGetDatum(onerel->rd_id));
+            sysscan = systable_beginscan(mvcrel, MvCoefficientIndexId, true,
+                                         NULL, 1, &scankey);
+            oldtup = systable_getnext(sysscan);
+
+            while (HeapTupleIsValid(oldtup))
+            {
+                int        colnums[3];
+                int        ncols = 0;
+                float4    nd;
+                Datum    values[Natts_pg_mvcoefficient];
+                bool    nulls[Natts_pg_mvcoefficient];
+                bool    replaces[Natts_pg_mvcoefficient];
+                float4        simple_mv_distinct;
+                
+                Form_pg_mvcoefficient mvc =
+                    (Form_pg_mvcoefficient) GETSTRUCT (oldtup);
+
+                if (mvc->mvcattr1 > 0)
+                    colnums[ncols++] = mvc->mvcattr1 - 1;
+                if (mvc->mvcattr2 > 0)
+                    colnums[ncols++] = mvc->mvcattr2 - 1;
+                if (mvc->mvcattr3 > 0)
+                    colnums[ncols++] = mvc->mvcattr3 - 1;
+
+                if (ncols > 0)
+                {
+                    int        j;
+                    float4    nd_coef;
+
+                    simple_mv_distinct = 
+                        vacattrstats[colnums[0]]->stadistinct;
+                    if (simple_mv_distinct < 0)
+                        simple_mv_distinct = -simple_mv_distinct * totalrows;
+                    for (j = 1 ; j < ncols ; j++)
+                    {
+                        float4 t = vacattrstats[colnums[j]]->stadistinct;
+
+                        if (t < 0)
+                            t = -t * totalrows;
+                        simple_mv_distinct *= t;
+                    }
+
+                    nd = compute_mv_distinct(j, colnums, vacattrstats,
+                                     std_fetch_func, numrows, totalrows);
+
+                    nd_coef = nd / simple_mv_distinct;
+                    
+                    for (i = 0; i < Natts_pg_mvcoefficient ; ++i)
+                    {
+                        nulls[i] = false;
+                        replaces[i] = false;
+                    }
+                    values[Anum_pg_mvcoefficient_mvccoefficient - 1] =
+                        Float4GetDatum(nd_coef);
+                    replaces[Anum_pg_mvcoefficient_mvccoefficient - 1] = true;
+                    newtup = heap_modify_tuple(oldtup,
+                                               RelationGetDescr(mvcrel),
+                                               values,
+                                               nulls,
+                                               replaces);
+                    simple_heap_update(mvcrel, &oldtup->t_self, newtup);
+
+                    CatalogUpdateIndexes(mvcrel, newtup);
+
+                    oldtup = systable_getnext(sysscan);
+                }
+            }
+
+            systable_endscan(sysscan);
+            heap_close(mvcrel, RowExclusiveLock);
+        }
+        if (hasindex)            compute_index_stats(onerel, totalrows,                                indexdata,
nindexes,
@@ -1911,6 +2006,7 @@ static void compute_scalar_stats(VacAttrStatsP stats,                     int samplerows,
           double totalrows);static int    compare_scalars(const void *a, const void *b, void *arg);
 
+static int  compare_mv_scalars(const void *a, const void *b, void *arg);static int    compare_mcvs(const void *a,
constvoid *b);
 
@@ -2840,6 +2936,207 @@ compute_scalar_stats(VacAttrStatsP stats,}/*
+ *    compute_mv_distinct() -- compute multicolumn distinctness
+ */ 
+
+static float4
+compute_mv_distinct(int nattrs,
+                    int *stacolnums,
+                    VacAttrStats **stats,
+                    AnalyzeAttrFetchFunc fetchfunc,
+                    int samplerows,
+                    double totalrows)
+{
+    int            i, j;
+    int            null_cnt = 0;
+    int            nonnull_cnt = 0;
+    int            toowide_cnt = 0;
+    double        total_width = 0;
+    bool        is_varlena[3];
+    SortSupportData ssup[3];
+    ScalarItem **values, *values2;
+    int            values_cnt = 0;
+    int           *tupnoLink;
+    StdAnalyzeData *mystats[3];
+    float4        fndistinct;
+
+    Assert (nattrs <= 3);
+    for (i = 0 ; i < nattrs ; i++)
+    {
+        VacAttrStats *vas = stats[stacolnums[i]];
+        is_varlena[i] =
+            !vas->attrtype->typbyval && vas->attrtype->typlen == -1;
+        mystats[i] =
+            (StdAnalyzeData*) vas->extra_data;
+    }
+
+    values2 = (ScalarItem *) palloc(nattrs * samplerows * sizeof(ScalarItem));
+    values  = (ScalarItem **) palloc(samplerows * sizeof(ScalarItem*));
+    tupnoLink = (int *) palloc(samplerows * sizeof(int));
+
+    for (i = 0 ; i < samplerows ; i++)
+        values[i] = &values2[i * nattrs];
+
+    memset(ssup, 0, sizeof(ssup));
+    for (i = 0 ; i < nattrs ; i++)
+    {
+        ssup[i].ssup_cxt = CurrentMemoryContext;
+        /* We always use the default collation for statistics */
+        ssup[i].ssup_collation = DEFAULT_COLLATION_OID;
+        ssup[i].ssup_nulls_first = false;
+        ssup[i].abbreviate = true;
+        PrepareSortSupportFromOrderingOp(mystats[i]->ltopr, &ssup[i]);
+    }
+    ssup[nattrs].ssup_cxt = NULL;
+
+    /* Initial scan to find sortable values */
+    for (i = 0; i < samplerows; i++)
+    {
+        Datum        value[2];
+        bool        isnull = false;
+        bool        toowide = false;
+
+        vacuum_delay_point();
+
+        for (j = 0 ; j < nattrs ; j++)
+        {
+
+            value[j] = fetchfunc(stats[stacolnums[j]], i, &isnull);
+
+            /* Check for null/nonnull */
+            if (isnull)
+                break;
+
+            if (is_varlena[j])
+            {
+                total_width += VARSIZE_ANY(DatumGetPointer(value[j]));
+                if (toast_raw_datum_size(value[j]) > WIDTH_THRESHOLD)
+                {
+                    toowide = true;
+                    break;
+                }
+                value[j] = PointerGetDatum(PG_DETOAST_DATUM(value[j]));
+            }
+        }
+        if (isnull)
+        {
+            null_cnt++;
+            continue;
+        }
+        else if (toowide)
+        {
+            toowide_cnt++;
+            continue;
+        }
+        nonnull_cnt++;
+
+        /* Add it to the list to be sorted */
+        for (j = 0 ; j < nattrs ; j++)
+            values[values_cnt][j].value = value[j];
+
+        values[values_cnt][0].tupno = values_cnt;
+        tupnoLink[values_cnt] = values_cnt;
+        values_cnt++;
+    }
+
+    /* We can only compute real stats if we found some sortable values. */
+    if (values_cnt > 0)
+    {
+        int            ndistinct,    /* # distinct values in sample */
+                    nmultiple,    /* # that appear multiple times */
+                    dups_cnt;
+        CompareScalarsContext cxt;
+
+        /* Sort the collected values */
+        cxt.ssup = ssup;
+        cxt.tupnoLink = tupnoLink;
+        qsort_arg((void *) values, values_cnt, sizeof(ScalarItem*),
+                  compare_mv_scalars, (void *) &cxt);
+
+        ndistinct = 0;
+        nmultiple = 0;
+        dups_cnt = 0;
+        for (i = 0; i < values_cnt; i++)
+        {
+            int            tupno = values[i][0].tupno;
+
+            dups_cnt++;
+            if (tupnoLink[tupno] == tupno)
+            {
+                /* Reached end of duplicates of this value */
+                ndistinct++;
+                if (dups_cnt > 1)
+                    nmultiple++;
+
+                dups_cnt = 0;
+            }
+        }
+
+        if (nmultiple == 0)
+        {
+            /* If we found no repeated values, assume it's a unique column */
+            fndistinct = totalrows;
+        }
+        else if (toowide_cnt == 0 && nmultiple == ndistinct)
+        {
+            /*
+             * Every value in the sample appeared more than once.  Assume the
+             * column has just these values.
+             */
+            fndistinct = (float4)ndistinct;
+        }
+        else
+        {
+            /*----------
+             * Estimate the number of distinct values using the estimator
+             * proposed by Haas and Stokes in IBM Research Report RJ 10025:
+             *        n*d / (n - f1 + f1*n/N)
+             * where f1 is the number of distinct values that occurred
+             * exactly once in our sample of n rows (from a total of N),
+             * and d is the total number of distinct values in the sample.
+             * This is their Duj1 estimator; the other estimators they
+             * recommend are considerably more complex, and are numerically
+             * very unstable when n is much smaller than N.
+             *
+             * Overwidth values are assumed to have been distinct.
+             *----------
+             */
+            int            f1 = ndistinct - nmultiple + toowide_cnt;
+            int            d = f1 + nmultiple;
+            double        numer,
+                        denom,
+                        stadistinct;
+
+            numer = (double) samplerows *(double) d;
+
+            denom = (double) (samplerows - f1) +
+                (double) f1 *(double) samplerows / totalrows;
+
+            stadistinct = numer / denom;
+            /* Clamp to sane range in case of roundoff error */
+            if (stadistinct < (double) d)
+                stadistinct = (double) d;
+            if (stadistinct > totalrows)
+                stadistinct = totalrows;
+            fndistinct = floor(stadistinct + 0.5);
+        }
+    }
+    else if (nonnull_cnt > 0)
+    {
+        /* Assume all too-wide values are distinct, so it's a unique column */
+        fndistinct = totalrows;
+    }
+    else if (null_cnt > 0)
+    {
+        fndistinct =  0.0;        /* "unknown" */
+    }
+
+    /* We don't need to bother cleaning up any of our temporary palloc's */
+    return fndistinct;
+}
+
+
+/* * qsort_arg comparator for sorting ScalarItems * * Aside from sorting the items, we update the tupnoLink[] array
@@ -2876,6 +3173,43 @@ compare_scalars(const void *a, const void *b, void *arg)    return ta - tb;}
+static int
+compare_mv_scalars(const void *a, const void *b, void *arg)
+{
+    CompareScalarsContext *cxt = (CompareScalarsContext *) arg;
+    ScalarItem *va = *(ScalarItem**)a;
+    ScalarItem *vb = *(ScalarItem**)b;
+    Datum        da, db;
+    int            ta, tb;
+    int            compare;
+    int i;
+
+    for (i = 0 ; cxt->ssup[i].ssup_cxt ; i++)
+    {
+        da = va[i].value;
+        db = vb[i].value;
+
+        compare = ApplySortComparator(da, false, db, false, &cxt->ssup[i]);
+        if (compare != 0)
+            return compare;
+    }
+
+    /*
+     * The two datums are equal, so update cxt->tupnoLink[].
+     */
+    ta = va[0].tupno;
+    tb = vb[0].tupno;
+    if (cxt->tupnoLink[ta] < tb)
+        cxt->tupnoLink[ta] = tb;
+    if (cxt->tupnoLink[tb] < ta)
+        cxt->tupnoLink[tb] = ta;
+
+    /*
+     * For equal datums, sort by tupno
+     */
+    return ta - tb;
+}
+/* * qsort comparator for sorting ScalarMCVItems by position */
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index dcac1c1..43712ba 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -14,8 +14,14 @@ */#include "postgres.h"
+#include "access/genam.h"
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "catalog/indexing.h"#include "catalog/pg_operator.h"
+#include "catalog/pg_mvcoefficient.h"#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"#include "optimizer/clauses.h"#include "optimizer/cost.h"#include "optimizer/pathnode.h"
@@ -43,6 +49,93 @@ static void addRangeClause(RangeQueryClause **rqlist, Node *clause,               bool varonleft,
boolisLTsel, Selectivity s2);
 
+static bool
+collect_collist_walker(Node *node, Bitmapset **colsetlist)
+{
+    if (node == NULL)
+        return false;
+    if (IsA(node, Var))
+    {
+        Var *var = (Var*)node;
+
+        if (AttrNumberIsForUserDefinedAttr(var->varattno))
+            colsetlist[var->varno] = 
+                bms_add_member(colsetlist[var->varno], var->varattno);
+    }
+    return expression_tree_walker(node, collect_collist_walker,
+                                  (void*)colsetlist);
+}
+
+/* Find multivariate distinctness coefficient for clauselist */
+static double
+find_mv_join_coeffeicient(PlannerInfo *root, List *clauses)
+{
+    int relid;
+    ListCell   *l;
+    Bitmapset **colsetlist = NULL;
+    double mv_coef = 1.0;
+
+    /* Collect columns this clauselist on */
+    colsetlist = (Bitmapset**)
+        palloc0(root->simple_rel_array_size * sizeof(Bitmapset*));
+
+    foreach(l, clauses)
+    {
+        RestrictInfo *rti = (RestrictInfo *) lfirst(l);
+
+        /* Consider only EC-derived clauses between the joinrels */
+        if (rti->left_ec && rti->left_ec == rti->right_ec)
+        {
+            if (IsA(rti, RestrictInfo))
+                collect_collist_walker((Node*)rti->clause, colsetlist);
+        }
+    }
+
+    /* Find pg_mv_coefficient entries match this columlist */
+    for (relid = 1 ; relid < root->simple_rel_array_size ; relid++)
+    {
+        Relation mvcrel;
+        SysScanDesc sscan;
+        ScanKeyData skeys[1];
+        HeapTuple tuple;
+        
+        if (bms_is_empty(colsetlist[relid])) continue;
+
+        if (root->simple_rte_array[relid]->rtekind != RTE_RELATION) continue;
+
+        ScanKeyInit(&skeys[0],
+                    Anum_pg_mvcoefficient_mvcreloid,
+                    BTEqualStrategyNumber, F_OIDEQ,
+                    ObjectIdGetDatum(root->simple_rte_array[relid]->relid));
+        
+        mvcrel = heap_open(MvCoefficientRelationId, AccessShareLock);
+        sscan = systable_beginscan(mvcrel, MvCoefficientIndexId, true,
+                                   NULL, 1, skeys);
+        while (HeapTupleIsValid(tuple = systable_getnext(sscan)))
+        {
+            Bitmapset *mvccols = NULL;
+            Form_pg_mvcoefficient mvc =
+                (Form_pg_mvcoefficient) GETSTRUCT (tuple);
+
+            mvccols = bms_add_member(mvccols, mvc->mvcattr1);
+            mvccols = bms_add_member(mvccols, mvc->mvcattr2);
+            if (mvc->mvcattr3 > 0)
+                mvccols = bms_add_member(mvccols, mvc->mvcattr3);
+
+            if (!bms_is_subset(mvccols, colsetlist[relid]))
+                continue;
+
+            /* Prefer smaller one */
+            if (mvc->mvccoefficient > 0 && mvc->mvccoefficient < mv_coef)
+                mv_coef = mvc->mvccoefficient;
+        }
+        systable_endscan(sscan);
+        heap_close(mvcrel, AccessShareLock);
+    }
+
+    return mv_coef;
+}
+/**************************************************************************** *        ROUTINES TO COMPUTE
SELECTIVITIES****************************************************************************/
 
@@ -200,6 +293,9 @@ clauselist_selectivity(PlannerInfo *root,        s1 = s1 * s2;    }
+    /* Try multivariate distinctness correction for clauses */
+    s1 /= find_mv_join_coeffeicient(root, clauses);
+    /*     * Now scan the rangequery pair list.     */
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index f58e1ce..f4c1001 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -43,6 +43,7 @@#include "catalog/pg_foreign_server.h"#include "catalog/pg_foreign_table.h"#include
"catalog/pg_language.h"
+#include "catalog/pg_mvcoefficient.h"#include "catalog/pg_namespace.h"#include "catalog/pg_opclass.h"#include
"catalog/pg_operator.h"
@@ -501,6 +502,17 @@ static const struct cachedesc cacheinfo[] = {        },        4    },
+    {MvCoefficientRelationId,        /* MVCOEFFICIENT */
+        MvCoefficientIndexId,
+        4,
+        {
+            Anum_pg_mvcoefficient_mvcreloid,
+            Anum_pg_mvcoefficient_mvcattr1,
+            Anum_pg_mvcoefficient_mvcattr2,
+            Anum_pg_mvcoefficient_mvcattr3
+        },
+        4
+    },    {NamespaceRelationId,        /* NAMESPACENAME */        NamespaceNameIndexId,        1,
diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h
index 71e0010..0c76f93 100644
--- a/src/include/catalog/indexing.h
+++ b/src/include/catalog/indexing.h
@@ -173,6 +173,9 @@ DECLARE_UNIQUE_INDEX(pg_largeobject_loid_pn_index, 2683, on pg_largeobject
usingDECLARE_UNIQUE_INDEX(pg_largeobject_metadata_oid_index,2996, on pg_largeobject_metadata using btree(oid
oid_ops));#defineLargeObjectMetadataOidIndexId    2996
 
+DECLARE_UNIQUE_INDEX(pg_mvcoefficient_index, 3578, on pg_mvcoefficient using btree(mvcreloid oid_ops, mvcattr1
int2_ops,mvcattr2 int2_ops, mvcattr3 int2_ops));
 
+#define MvCoefficientIndexId  3578
+DECLARE_UNIQUE_INDEX(pg_namespace_nspname_index, 2684, on pg_namespace using btree(nspname name_ops));#define
NamespaceNameIndexId 2684DECLARE_UNIQUE_INDEX(pg_namespace_oid_index, 2685, on pg_namespace using btree(oid oid_ops));
 
diff --git a/src/include/catalog/pg_mvcoefficient.h b/src/include/catalog/pg_mvcoefficient.h
new file mode 100644
index 0000000..56259fd
--- /dev/null
+++ b/src/include/catalog/pg_mvcoefficient.h
@@ -0,0 +1,68 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_mvcoefficient.h
+ *      definition of the system multivariate coefficient relation
+ *      (pg_mvcoefficient) along with the relation's initial contents.
+ *
+ * Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ * src/include/catalog/pg_mvcoefficient.h
+ *
+ * NOTES
+ *      the genbki.pl script reads this file and generates .bki
+ *      information from the DATA() statements.
+ *
+ *      XXX do NOT break up DATA() statements into multiple lines!
+ *          the scripts are not as smart as you might think...
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_MVCOEFFICIENT_H
+#define PG_MVCOEFFICIENT_H
+
+#include "catalog/genbki.h"
+#include "nodes/pg_list.h"
+
+/* ----------------
+ *        pg_mvcoefficient definition.  cpp turns this into
+ *        typedef struct FormData_pg_mvcoefficient
+ * ----------------
+ */
+#define MvCoefficientRelationId    3577
+
+CATALOG(pg_mvcoefficient,3577) BKI_WITHOUT_OIDS
+{
+    Oid        mvcreloid;            /* OID of target relation */
+    int16    mvcattr1;            /* Column numbers */
+    int16    mvcattr2;
+    int16    mvcattr3;
+    float4    mvccoefficient;        /* multivariate distinctness coefficient */
+} FormData_pg_mvcoefficient;
+
+/* ----------------
+ *        Form_pg_mvcoefficient corresponds to a pointer to a tuple with the
+ *        format of pg_mvcoefficient relation.
+ * ----------------
+ */
+typedef FormData_pg_mvcoefficient *Form_pg_mvcoefficient;
+
+/* ----------------
+ *        compiler constants for pg_mvcoefficient
+ * ----------------
+< */
+#define Natts_pg_mvcoefficient                5
+#define Anum_pg_mvcoefficient_mvcreloid        1
+#define Anum_pg_mvcoefficient_mvcattr1            2
+#define Anum_pg_mvcoefficient_mvcattr2            3
+#define Anum_pg_mvcoefficient_mvcattr3            4
+#define Anum_pg_mvcoefficient_mvccoefficient    5
+
+/* ----------------
+ *        pg_mvcoefficient has no initial contents
+ * ----------------
+ */
+
+/*
+ * prototypes for functions in pg_enum.c
+ */
+#endif   /* PG_MVCOEFFICIENT_H */
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index 6634099..db8454c 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -66,6 +66,7 @@ enum SysCacheIdentifier    INDEXRELID,    LANGNAME,    LANGOID,
+    MVDISTINCT,    NAMESPACENAME,    NAMESPACEOID,    OPERNAMENSP,
diff --git a/src/test/regress/expected/sanity_check.out b/src/test/regress/expected/sanity_check.out
index eb0bc88..7c77796 100644
--- a/src/test/regress/expected/sanity_check.out
+++ b/src/test/regress/expected/sanity_check.out
@@ -113,6 +113,7 @@ pg_inherits|tpg_language|tpg_largeobject|tpg_largeobject_metadata|t
+pg_mvcoefficient|tpg_namespace|tpg_opclass|tpg_operator|t

Re: multivariate statistics / patch v6

From
Tomas Vondra
Date:

On 05/13/15 10:31, Kyotaro HORIGUCHI wrote:
> Hello, this might be somewhat out of place but strongly related
> to this patch so I'll propose this here.
>
> This is a proposal of new feature for this patch or asking for
> your approval for my moving on this as a different (but very
> close) project.
>
> ===
>
>> Attached is v6 of the multivariate stats, with a number of
>> improvements:
> ...
>> 2) fix of pg_proc issues (reported by Jeff)
>>
>> 3) rebase to current master
>
> Unfortunately, the v6 patch suffers some system oid conflicts
> with recently added ones. And what more unfortunate for me is
> that the code for functional dependencies looks undone:)

I'll fix the OID conflicts once the CF completes, which should be in a 
few days I guess. Until then you can apply it on top of master from 
about May 6 (that's when the v6 was created, and there should be no 
conflicts).

Regarding the functional dependencies - you're right there's room for 
improvement. For example it only works with dependencies between pairs 
of columns, not multi-column dependencies. Is this what you mean by 
incomplete?

> I mention this because I recently had a issue from strong
> correlation between two columns in dbt3 benchmark. Two columns in
> some table are in strong correlation but not in functional
> dependencies, there are too many values and the distribution of
> them is very uniform so MCV is no use for the table (histogram
> has nothing to do with equal conditions). As the result, planner
> estimates the number of rows largely wrong as expected especially
> for joins.

I think the other statistics types (esp. histograms) might be more 
useful here, but I assume you haven't tried that because of the conflicts.

The current patch does not handle joins at all, though.


> I, then, had a try calculating the ratio between the product of
> distinctness of every column and the distinctness of the set of
> the columns, call it multivariate coefficient here, and found
> that it looks greately useful for the small storage space, less
> calculation, and simple code.

So when you have two columns A and B, you compute this:
   ndistinct(A) * ndistinct(B)   ---------------------------          ndistinct(A,B)

where ndistinc(...) means number of distinct values in the column(s)?


> The attached first is a script to generate problematic tables.
> And the second is a patch to make use of the mv coef on current
> master.  The patch is a very primitive POC so no syntactical
> interfaces involved.
>
> For the case of your first example,
>
>> =# create table t (a int, b int, c int);
>> =# insert into t (select a/10000, a/10000, a/10000
>>                    from generate_series(0, 999999) a);
>> =# analyze t;
>> =# explain analyze select * from t where a = 1 and b = 1 and c = 1;
>>   Seq Scan on t  (cost=0.00..22906.00 rows=1 width=12)
>>                  (actual time=3.878..250.628 rows=10000 loops=1)
>
> Make use of mv coefficient.
>
>> =# insert into pg_mvcoefficient values ('t'::regclass, 1, 2, 3, 0);
>> =# analyze t;
>> =# explain analyze select * from t where a = 1 and b = 1 and c = 1;
>>   Seq Scan on t  (cost=0.00..22906.00 rows=9221 width=12)
>>                  (actual time=3.740..242.330 rows=10000 loops=1)
>
> Row number estimation was largely improved.

With my patch:

alter table t add statistics (mcv) on (a,b,c);
analyze t;
select * from pg_mv_stats;
 tablename | attnums | mcvbytes |  mcvinfo
-----------+---------+----------+------------ t         | 1 2 3   |     2964 | nitems=100

explain (analyze,timing off)  select * from t where a = 1 and b = 1 and c = 1;
                            QUERY PLAN
------------------------------------------------------------ Seq Scan on t  (cost=0.00..22906.00 rows=9533 width=12)
           (actual rows=10000 loops=1)   Filter: ((a = 1) AND (b = 1) AND (c = 1))   Rows Removed by Filter: 990000
Planningtime: 0.233 ms Execution time: 93.212 ms
 
(5 rows)

alter table t drop statistics all;
alter table t add statistics (histogram) on (a,b,c);
analyze t;

explain (analyze,timing off) select * from t where a = 1 and b = 1 and c = 1;
                        QUERY PLAN
-------------------------------------------------------------------- Seq Scan on t  (cost=0.00..22906.00 rows=9667
width=12)               (actual rows=10000 loops=1)   Filter: ((a = 1) AND (b = 1) AND (c = 1))   Rows Removed by
Filter:990000 Planning time: 0.594 ms Execution time: 109.917 ms
 
(5 rows)

So both the MCV list and histogram do quite a good work here, but there 
are certainly cases when that does not work and the mvcoefficient works 
better.

> Well, my example,
>
>> $ perl gentbl.pl 10000 | psql postgres
>> $ psql postgres
>> =# explain analyze select * from t1 where a = 1 and b = 2501;
>>   Seq Scan on t1  (cost=0.00..6216.00 rows=1 width=8)
>>                   (actual time=0.030..66.005 rows=8 loops=1)
>>
>> =# explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);
>>   Hash Join  (cost=1177.00..11393.76 rows=76 width=16)
>>              (actual time=29.811..322.271 rows=320000 loops=1)
>
> Too bad estimate for the join.
>
>> =# insert into pg_mvcoefficient values ('t1'::regclass, 1, 2, 0, 0);
>> =# analyze t1;
>> =# explain analyze select * from t1 where a = 1 and b = 2501;
>>   Seq Scan on t1  (cost=0.00..6216.00         rows=8 width=8)
>>                   (actual time=0.032..104.144 rows=8 loops=1)
>>
>> =# explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);
>>   Hash Join  (cost=1177.00..11393.76      rows=305652 width=16)
>>              (actual time=40.642..325.679 rows=320000 loops=1)
>
> It gives almost correct estimations.

The current patch does not handle joins, but it's one of the TODO items.

>
> I think the result above shows that the multivariate coefficient
> is significant to imporove estimates when correlated colums are
> involved.

Yes, it looks interesting. I'm wondering what are the "failure cases" 
when the coefficient approach does not work. It seems to me it relies on 
an assumption of consistency for all the ndistinct values. For example 
lets assume you have two columns - A and B, each with 1000 distinct 
values, and that each value in A has 100 matching values in B, so the 
coefficient is ~10
   1,000 * 1,000 / 100,000 = 10

Now, let's assume the distribution looks differently - with first 100 
values in A matching all 1000 values of B, and the remaining 900 values 
just a single B value. Then
  1,000 * 1,000 / (100,000 + 900) = ~9,9

So a very different distribution, but almost the same coefficient.

Are there any other assumptions like this?

Also, does the coefficient work only for equality conditions only?

>
> Would you consider this in your patch? Otherwise I move on this
> as a different project from yours if you don't mind. Except user
> interface won't conflict with yours, I suppose. But finally they
> should need some labor of consolidation.

I think it's a neat idea, and I think it might be added to the patch. It 
would fit in quite nicely, actually - I already do have other kinds of 
stats for addition, but I'm not going to work on that in the near 
future. It will require changes in some parts of the patch (selecting 
the stats for a list of clauses) and I'd like to complete the current 
patch first, and then add features in follow-up patches.

>
> regards,

regards
Tomas



Re: multivariate statistics / patch v6

From
Kyotaro HORIGUCHI
Date:
Hello,

At Thu, 14 May 2015 12:35:50 +0200, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<55547A86.8020400@2ndquadrant.com>
> 
> On 05/13/15 10:31, Kyotaro HORIGUCHI wrote:
> > Hello, this might be somewhat out of place but strongly related
> > to this patch so I'll propose this here.
> >
> > This is a proposal of new feature for this patch or asking for
> > your approval for my moving on this as a different (but very
> > close) project.
> >
> > ===
> >
> >> Attached is v6 of the multivariate stats, with a number of
> >> improvements:
> > ...
> >> 2) fix of pg_proc issues (reported by Jeff)
> >>
> >> 3) rebase to current master
> >
> > Unfortunately, the v6 patch suffers some system oid conflicts
> > with recently added ones. And what more unfortunate for me is
> > that the code for functional dependencies looks undone:)
> 
> I'll fix the OID conflicts once the CF completes, which should be in a
> few days I guess. Until then you can apply it on top of master from
> about May 6 (that's when the v6 was created, and there should be no
> conflicts).

I applied it with further fixing. It wasn't a problem :)

> Regarding the functional dependencies - you're right there's room for
> improvement. For example it only works with dependencies between pairs
> of columns, not multi-column dependencies. Is this what you mean by
> incomplete?

No, It overruns dependencies->deps because build_mv_dependencies
stores many elements into dependencies->deps[n] although it
really has a room for only one element. I suppose that you paused
writing it when you noticed that the number of required elements
is unknown before finising walk through all pairs of
values. palloc'ing numattrs^2 is reasonable enough as POC code
for now. Am I looking wrong version of patch?

-    dependencies = (MVDependencies)palloc0(sizeof(MVDependenciesData))
+    dependencies = (MVDependencies)palloc0(sizeof(MVDependenciesData) +
+                                sizeof(MVDependency) * numattrs * numattrs);

> > I mention this because I recently had a issue from strong
> > correlation between two columns in dbt3 benchmark. Two columns in
> > some table are in strong correlation but not in functional
> > dependencies, there are too many values and the distribution of
> > them is very uniform so MCV is no use for the table (histogram
> > has nothing to do with equal conditions). As the result, planner
> > estimates the number of rows largely wrong as expected especially
> > for joins.
> 
> I think the other statistics types (esp. histograms) might be more
> useful here, but I assume you haven't tried that because of the
> conflicts.
> 
> The current patch does not handle joins at all, though.

Well, that's one of the resons. But I understood that any
deterministic estimation cannot be applied for such distribution
when I saw what made the wrong estimation. eqsel and eqsel_join
finally relies on random match assumption on uniform distribution
when the value is not found in MCV list. And functional
dependencies stuff in your old patch (which works) (rightfully)
failed to find such relationship between the problematic
columns. So I tried ndistinct, which is not contained in your
patch to see how it works well.

> > I, then, had a try calculating the ratio between the product of
> > distinctness of every column and the distinctness of the set of
> > the columns, call it multivariate coefficient here, and found
> > that it looks greately useful for the small storage space, less
> > calculation, and simple code.
> 
> So when you have two columns A and B, you compute this:
> 
>    ndistinct(A) * ndistinct(B)
>    ---------------------------
>           ndistinct(A,B)

Yes, I used the reciprocal of that, though.

> where ndistinc(...) means number of distinct values in the column(s)?

Yes.

> > The attached first is a script to generate problematic tables.
> > And the second is a patch to make use of the mv coef on current
> > master.  The patch is a very primitive POC so no syntactical
> > interfaces involved.
...
> > Make use of mv coefficient.
> >
> >> =# insert into pg_mvcoefficient values ('t'::regclass, 1, 2, 3, 0);
> >> =# analyze t;
> >> =# explain analyze select * from t where a = 1 and b = 1 and c = 1;
> >>   Seq Scan on t  (cost=0.00..22906.00 rows=9221 width=12)
> >>                  (actual time=3.740..242.330 rows=10000 loops=1)
> >
> > Row number estimation was largely improved.
> 
> With my patch:
> 
> alter table t add statistics (mcv) on (a,b,c);
...
>  Seq Scan on t  (cost=0.00..22906.00 rows=9533 width=12)

Yes, your MV-MCV list should have one third of all possible (set
of) values so it works fine, I guess. But my original problem was
occurred on the condition that (the single column) MCVs contain
under 1% of possible values, MCV would not work for such cases,
but its very uniform distribution helps random assumption to
work.

> $ perl gentbl.pl 200000 | psql postgres
<takes a while..>
> posttres=# alter table t1 add statistics (mcv true) on (a, b);
> postgres=# analyze t1;
> postgres=# explain analyze select * from t1 where a = 1 and b = 2501;
> Seq Scan on t1  (cost=0.00..124319.00 rows=1 width=8)
>                 (actual time=0.051..1250.773 rows=8 loops=1)

The estimate "rows=1" is internally 2.4e-11, 3.33e+11 times
smaller than the real number. This will result in roughly the
same order of error for joins. This is because MV-MCV holds too
small part of the domain and then calculated using random
assumption. This won't be not saved by increasing
statistics_target to any sane amount.


> alter table t drop statistics all;
> alter table t add statistics (histogram) on (a,b,c);
...
>  Seq Scan on t  (cost=0.00..22906.00 rows=9667 width=12)

> So both the MCV list and histogram do quite a good work here,

I understand how you calculate selectivity for equality clauses
using histogram. And it calculates the result rows as 2.3e-11,
which is almost same as MV-MCV, and this comes the same cause
with it then yields the same result for joins.

> but there are certainly cases when that does not work and the
> mvcoefficient works better.

The condition mv-coef is effective where, as metioned above,
MV-MCV or MV-HISTO cannot hold sufficient part of the domain. The
appropriate combination of MV-MCV and mv-coef would be the same
as va_eq_(non_)const/eqjoinsel_inner for single column, which is,
applying mv-coef on the part of selectivity corresponding to
values not in MV-MCV. I have no idea to combinate it with
MV-HISTOGRAM right now.

> The current patch does not handle joins, but it's one of the TODO
> items.

Yes, but the result on the very large tables can be deduced from
the discussion above.

> > I think the result above shows that the multivariate coefficient
> > is significant to imporove estimates when correlated colums are
> > involved.
> 
> Yes, it looks interesting. I'm wondering what are the "failure cases"
> when the coefficient approach does not work. It seems to me it relies
> on an assumption of consistency for all the ndistinct values. For
> example lets assume you have two columns - A and B, each with 1000
> distinct values, and that each value in A has 100 matching values in
> B, so the coefficient is ~10
> 
>    1,000 * 1,000 / 100,000 = 10
> 
> Now, let's assume the distribution looks differently - with first 100
> values in A matching all 1000 values of B, and the remaining 900
> values just a single B value. Then
> 
>   1,000 * 1,000 / (100,000 + 900) = ~9,9
> 
> So a very different distribution, but almost the same coefficient.
> 
> Are there any other assumptions like this?

I think no for now. Just like the current var_eq_(non_)const and
eqjoinsel_inner does, since no clue for *the true* distribution
available, we have no choice other than stand on the random (on
uniform dist) assumption. And it gives not so bad estimates for
not so extreme distributions. It's of course not perfect but good
enough.

> Also, does the coefficient work only for equality conditions only?

The mvcoef is a parallel of ndistinct, (it is a bit wierd
expression though). So I guess it is appliable on the current
estimation codes where using ndistinct, almost of all of them
look to relate to equiality comparison. 

> > Would you consider this in your patch? Otherwise I move on this
> > as a different project from yours if you don't mind. Except user
> > interface won't conflict with yours, I suppose. But finally they
> > should need some labor of consolidation.
> 
> I think it's a neat idea, and I think it might be added to the
> patch. It would fit in quite nicely, actually - I already do have
> other kinds of stats for addition, but I'm not going to work on that
> in the near future. It will require changes in some parts of the patch
> (selecting the stats for a list of clauses) and I'd like to complete
> the current patch first, and then add features in follow-up patches.

I see. Let's work on this for now.

regares,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: multivariate statistics / patch v6

From
Tomas Vondra
Date:
Hello,

On 05/15/15 08:29, Kyotaro HORIGUCHI wrote:
> Hello,
>
>> Regarding the functional dependencies - you're right there's room
>> for improvement. For example it only works with dependencies
>> between pairs of columns, not multi-column dependencies. Is this
>> what you mean by incomplete?
>
> No, It overruns dependencies->deps because build_mv_dependencies
> stores many elements into dependencies->deps[n] although it
> really has a room for only one element. I suppose that you paused
> writing it when you noticed that the number of required elements
> is unknown before finising walk through all pairs of
> values. palloc'ing numattrs^2 is reasonable enough as POC code
> for now. Am I looking wrong version of patch?
>
> -    dependencies = (MVDependencies)palloc0(sizeof(MVDependenciesData))
> +    dependencies = (MVDependencies)palloc0(sizeof(MVDependenciesData) +
> +                                sizeof(MVDependency) * numattrs * numattrs);

Ah! That's clearly a bug. Thanks for noticing that, will fix in the next 
version of the patch.

>>> I mention this because I recently had a issue from strong
>>> correlation between two columns in dbt3 benchmark. Two columns
>>> in some table are in strong correlation but not in functional
>>> dependencies, there are too many values and the distribution of
>>> them is very uniform so MCV is no use for the table (histogram
>>> has nothing to do with equal conditions). As the result, planner
>>> estimates the number of rows largely wrong as expected
>>> especially for joins.
>>
>> I think the other statistics types (esp. histograms) might be more
>> useful here, but I assume you haven't tried that because of the
>> conflicts.
>>
>> The current patch does not handle joins at all, though.
>
> Well, that's one of the resons. But I understood that any
> deterministic estimation cannot be applied for such distribution
> when I saw what made the wrong estimation. eqsel and eqsel_join
> finally relies on random match assumption on uniform distribution
> when the value is not found in MCV list. And functional
> dependencies stuff in your old patch (which works) (rightfully)
> failed to find such relationship between the problematic
> columns. So I tried ndistinct, which is not contained in your
> patch to see how it works well.

Yes, that's certainly true. I think you're right that mv coefficient 
might be quite useful in some cases.

>> With my patch:
>>
>> alter table t add statistics (mcv) on (a,b,c);
> ...
>>   Seq Scan on t  (cost=0.00..22906.00 rows=9533 width=12)
>
> Yes, your MV-MCV list should have one third of all possible (set
> of) values so it works fine, I guess. But my original problem was
> occurred on the condition that (the single column) MCVs contain
> under 1% of possible values, MCV would not work for such cases,
> but its very uniform distribution helps random assumption to
> work.

Actually, I think the MCV list should contain all the items, as it 
decides the sample contains all the values from the data. The usual 1D 
MCV list uses the same logic. But you're right that on a data set with 
more MCV items and mostly uniform distribution, this won't work.


>
>> $ perl gentbl.pl 200000 | psql postgres
> <takes a while..>
>> posttres=# alter table t1 add statistics (mcv true) on (a, b);
>> postgres=# analyze t1;
>> postgres=# explain analyze select * from t1 where a = 1 and b = 2501;
>> Seq Scan on t1  (cost=0.00..124319.00 rows=1 width=8)
>>                  (actual time=0.051..1250.773 rows=8 loops=1)
>
> The estimate "rows=1" is internally 2.4e-11, 3.33e+11 times
> smaller than the real number. This will result in roughly the
> same order of error for joins. This is because MV-MCV holds too
> small part of the domain and then calculated using random
> assumption. This won't be not saved by increasing
> statistics_target to any sane amount.

Yes, the MCV lists don't do work well with data sets like this.

>> alter table t drop statistics all;
>> alter table t add statistics (histogram) on (a,b,c);
> ...
>>   Seq Scan on t  (cost=0.00..22906.00 rows=9667 width=12)
>
>> So both the MCV list and histogram do quite a good work here,
>
> I understand how you calculate selectivity for equality clauses
> using histogram. And it calculates the result rows as 2.3e-11,
> which is almost same as MV-MCV, and this comes the same cause
> with it then yields the same result for joins.
>
>> but there are certainly cases when that does not work and the
>> mvcoefficient works better.

+1

> The condition mv-coef is effective where, as metioned above,
> MV-MCV or MV-HISTO cannot hold sufficient part of the domain. The
> appropriate combination of MV-MCV and mv-coef would be the same
> as va_eq_(non_)const/eqjoinsel_inner for single column, which is,
> applying mv-coef on the part of selectivity corresponding to
> values not in MV-MCV. I have no idea to combinate it with
> MV-HISTOGRAM right now.
>
>> The current patch does not handle joins, but it's one of the TODO
>> items.
>
> Yes, but the result on the very large tables can be deduced from
> the discussion above.
>
>>> I think the result above shows that the multivariate coefficient
>>> is significant to imporove estimates when correlated colums are
>>> involved.
>>
>> Yes, it looks interesting. I'm wondering what are the "failure cases"
>> when the coefficient approach does not work. It seems to me it relies
>> on an assumption of consistency for all the ndistinct values. For
>> example lets assume you have two columns - A and B, each with 1000
>> distinct values, and that each value in A has 100 matching values in
>> B, so the coefficient is ~10
>>
>>     1,000 * 1,000 / 100,000 = 10
>>
>> Now, let's assume the distribution looks differently - with first 100
>> values in A matching all 1000 values of B, and the remaining 900
>> values just a single B value. Then
>>
>>    1,000 * 1,000 / (100,000 + 900) = ~9,9
>>
>> So a very different distribution, but almost the same coefficient.
>>
>> Are there any other assumptions like this?
>
> I think no for now. Just like the current var_eq_(non_)const and
> eqjoinsel_inner does, since no clue for *the true* distribution
> available, we have no choice other than stand on the random (on
> uniform dist) assumption. And it gives not so bad estimates for
> not so extreme distributions. It's of course not perfect but good
> enough.
>
>> Also, does the coefficient work only for equality conditions only?
>
> The mvcoef is a parallel of ndistinct, (it is a bit wierd
> expression though). So I guess it is appliable on the current
> estimation codes where using ndistinct, almost of all of them
> look to relate to equiality comparison.

ISTM the estimation of GROUP BY might benefit tremendously from this 
statistics. That is, helping with cardinality estimation of analytical 
queries, etc.

Also, we've only discussed 2-column coefficients. Would it be useful to 
track those coefficients for large groups of columns? For example
     ndistinct(A,B,C)    --------------------------------------------     ndistinct(A) * ndistinct(B) * ndistinct(C)

which might work better for queries like
    SELECT a,b,c FROM t GROUP BY a,b,c;

>>> Would you consider this in your patch? Otherwise I move on this
>>> as a different project from yours if you don't mind. Except user
>>> interface won't conflict with yours, I suppose. But finally they
>>> should need some labor of consolidation.
>>
>> I think it's a neat idea, and I think it might be added to the
>> patch. It would fit in quite nicely, actually - I already do have
>> other kinds of stats for addition, but I'm not going to work on
>> that in the near future. It will require changes in some parts of
>> the patch (selecting the stats for a list of clauses) and I'd like
>> to complete the current patch first, and then add features in
>> follow-up patches.
>
> I see. Let's work on this for now.

Thanks!

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics / patch v6

From
Tomas Vondra
Date:
Hello,

On 05/15/15 08:29, Kyotaro HORIGUCHI wrote:
> Hello,
>
> At Thu, 14 May 2015 12:35:50 +0200, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote in <55547A86.8020400@2ndquadrant.com>
...
>
>> Regarding the functional dependencies - you're right there's room for
>> improvement. For example it only works with dependencies between pairs
>> of columns, not multi-column dependencies. Is this what you mean by
>> incomplete?
>
> No, It overruns dependencies->deps because build_mv_dependencies
> stores many elements into dependencies->deps[n] although it
> really has a room for only one element. I suppose that you paused
> writing it when you noticed that the number of required elements
> is unknown before finising walk through all pairs of
> values. palloc'ing numattrs^2 is reasonable enough as POC code
> for now. Am I looking wrong version of patch?
>
> -    dependencies = (MVDependencies)palloc0(sizeof(MVDependenciesData))
> +    dependencies = (MVDependencies)palloc0(sizeof(MVDependenciesData) +
> +                                sizeof(MVDependency) * numattrs * numattrs);

Actually, looking at this a bit more, I think the current behavior is 
correct. I assume the line is from build_mv_dependencies(), but the 
whole block looks like this:
  if (dependencies == NULL)  {    dependencies = (MVDependencies)palloc0(sizeof(MVDependenciesData));
dependencies->magic= MVSTAT_DEPS_MAGIC;  }  else    dependencies = repalloc(dependencies,
offsetof(MVDependenciesData,deps) +                     sizeof(MVDependency) * (dependencies->ndeps + 1));
 

which allocates space for a single element initially, and then extends 
that when other dependencies are added.



--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics / patch v7

From
Tomas Vondra
Date:
Hello,

attached is v7 of the multivariate stats patch. The main improvement is
major refactoring of the clausesel.c portion - splitting the awfully
long spaghetti-style functions into smaller pieces, making it much more
understandable etc.

I do assume some of those pieces are unnecessary because there already
is a helper function with the same purpose (but I'm not aware of that).
But IMHO this piece of code begins to look reasonable (especially when
compared to the previous state).

The other major improvement it review of the comments (including FIXMEs
and TODOs), and removal of the obsolete / misplaced ones. And there was
plenty of those ...

These changes made this version ~20k smaller than v6.

The patch also rebases to current master, which I assume shall be quite
stable - so hopefully no more duplicate OIDs for a while.

There are 6 files attached, but only 0002-0006 are actually part of the
multivariate statistics patch itself. The first part makes it possible
to use pull_varnos() with expression trees containing RestrictInfo
nodes, but maybe this is not the right way to fix this (there's another
thread where this was discussed).

Also, the regression tests testing plan choice with multivariate stats
(e.g. that a bitmap index scan is chosen instead of index scan) fail
from time to time. I suppose this happens because the invalidation after
ANALYZE is not processed before executing the query, so the optimizer
does not see the stats, or something like that.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Attachment

Re: multivariate statistics / patch v7

From
Kyotaro HORIGUCHI
Date:
Hello, I started to work on this patch.

> attached is v7 of the multivariate stats patch. The main improvement
> is major refactoring of the clausesel.c portion - splitting the
> awfully long spaghetti-style functions into smaller pieces, making it
> much more understandable etc.

Thank you, it looks clearer. I have some comment for the brief
look at this. This patchset is relatively large so I will comment
on "per-notice" basis.. which means I'll send comment before
examining the entire of this patchset. Sorry in advance for the
desultory comments.

=======
General comments:

- You included unnecessary stuffs such like regression.diffs in these patches.

- Now OID 3307 is used by pg_stat_file. I moved pg_mv_stats_dependencies_info/show to 3311/3312.

- Single-variate stats have a mechanism to inject arbitrary values as statistics, that is, get_relation_stats_hook and
thesimilar stuffs. I want the similar mechanism for multivariate statistics, too.
 

0001:

- I also don't think it is right thing for expression_tree_walker to recognize RestrictInfo since it is not a part of
expression.

0003:

- In clauselist_selectivity, find_stats is uselessly called for single clause. This should be called after the
clauselistfound to consist more than one clause.
 

- Searching vars to be compared with mv-stat columns which find_stats does should stop at disjunctions. But this patch
doesn'tbehave so and it should be an unwanted behavior. The following steps shows that.
 

=====# CREATE TABLE t1 (a int, b int, c int);=# INSERT INTO t1 (SELECT a, a * 2, a * 3 FROM generate_series(0, 9999)
a);=#EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3; Seq Scan on t1  (cost=0.00..230.00 rows=1 width=12)=#
ALTERTABLE t1 ADD STATISTICS (HISTOGRAM) ON (a, b, c);=# ANALZYE t1;=# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2
ORc = 3; Seq Scan on t1  (cost=0.00..230.00 rows=268 width=12)
 
====Rows changed unwantedly.
It seems not so simple thing as your code assumes.

> I do assume some of those pieces are unnecessary because there already
> is a helper function with the same purpose (but I'm not aware of
> that). But IMHO this piece of code begins to look reasonable
> (especially when compared to the previous state).

Year, such kind of work should be done later:p This patch is
not-so-invasive so as to make it undoable.

> The other major improvement it review of the comments (including
> FIXMEs and TODOs), and removal of the obsolete / misplaced ones. And
> there was plenty of those ...
> 
> These changes made this version ~20k smaller than v6.
> 
> The patch also rebases to current master, which I assume shall be
> quite stable - so hopefully no more duplicate OIDs for a while.
> 
> There are 6 files attached, but only 0002-0006 are actually part of
> the multivariate statistics patch itself. The first part makes it
> possible to use pull_varnos() with expression trees containing
> RestrictInfo nodes, but maybe this is not the right way to fix this
> (there's another thread where this was discussed).

As mentioned above, checking if mv stats can be applied would be
more complex matter than now you are assuming. I also will
consider that.

> Also, the regression tests testing plan choice with multivariate stats
> (e.g. that a bitmap index scan is chosen instead of index scan) fail
> from time to time. I suppose this happens because the invalidation
> after ANALYZE is not processed before executing the query, so the
> optimizer does not see the stats, or something like that.

I saw that occurs, but have no idea how it occurs so far..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: multivariate statistics / patch v7

From
Tomas Vondra
Date:
Hello Horiguchi-san!

On 07/03/2015 07:30 AM, Kyotaro HORIGUCHI wrote:
> Hello, I started to work on this patch.
>
>> attached is v7 of the multivariate stats patch. The main improvement
>> is major refactoring of the clausesel.c portion - splitting the
>> awfully long spaghetti-style functions into smaller pieces, making it
>> much more understandable etc.
>
> Thank you, it looks clearer. I have some comment for the brief look
> at this. This patchset is relatively large so I will comment on
> "per-notice" basis.. which means I'll send comment before examining
> the entire of this patchset. Sorry in advance for the desultory
> comments.

Sure. If you run into something that's not clear enough, I'm happy to 
explain that (I tried to cover all the important details in the 
comments, but it's a large patch, indeed.)

> =======
> General comments:
>
> - You included unnecessary stuffs such like regression.diffs in
>    these patches.

Ahhhh :-/ Will fix.

>
> - Now OID 3307 is used by pg_stat_file. I moved
>    pg_mv_stats_dependencies_info/show to 3311/3312.

Will fix while rebasing to current master.

>
> - Single-variate stats have a mechanism to inject arbitrary
>    values as statistics, that is, get_relation_stats_hook and the
>    similar stuffs. I want the similar mechanism for multivariate
>    statistics, too.

Fair point, although I'm not sure where should we place the hook, how 
exactly should it be defined and how useful that would be in the end. 
Can you give an example of how you'd use such hook?

I've never used get_relation_stats_hook, but if I get it right, the 
plugins can use the hook to create the stats (for each column), either 
from scratch or tweaking the existing stats.

I'm not sure how this should work with multivariate stats, though, 
because there can be arbitrary number of stats for a column, and it 
really depends on all the clauses (so examine_variable() seems a bit 
inappropriate, as it only sees a single variable at a time).

Moreover, with multivariate stats
   (a) there may be arbitrary number of stats for a column
   (b) only some of the stats end up being used for the estimation

I see two or three possible places for calling such hook:
   (a) at the very beginning, after fetching the list of stats
       - sees all the existing stats on a table       - may add entirely new stats or tweak the existing ones
   (b) after collecting the list of variables compatible with       multivariate stats
       - like (a) and additionally knows which columns are interesting         for the query (but only with respect to
theexisting stats)
 
   (c) after optimization (selection of the right combination if stats)
       - like (b), but can't affect the optimization

But I can't really imagine anyone building multivariate stats on the 
fly, in the hook.

It's more complicated, though, because the query may call 
clauselist_selectivity multiple times, depending on how complex the 
WHERE clauses are.


> 0001:
>
> - I also don't think it is right thing for expression_tree_walker
>    to recognize RestrictInfo since it is not a part of expression.

Yes. In my working git repo, I've reworked this to use the second 
option, i.e. adding RestrictInfo pull_(varno|varattno)_walker:

https://github.com/tvondra/postgres/commit/2dc79b914c759d31becd8ae670b37b79663a595f

Do you think this is the correct solution? If not, how to fix it?

>
> 0003:
>
> - In clauselist_selectivity, find_stats is uselessly called for
>    single clause. This should be called after the clauselist found
>    to consist more than one clause.

Ok, will fix.

>
> - Searching vars to be compared with mv-stat columns which
>    find_stats does should stop at disjunctions. But this patch
>    doesn't behave so and it should be an unwanted behavior. The
>    following steps shows that.

Why should it stop at disjunctions? There's nothing wrong with using 
multivariate stats to estimate OR-clauses, IMHO.

>
> ====
>   =# CREATE TABLE t1 (a int, b int, c int);
>   =# INSERT INTO t1 (SELECT a, a * 2, a * 3 FROM generate_series(0, 9999) a);
>   =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
>    Seq Scan on t1  (cost=0.00..230.00 rows=1 width=12)
>   =# ALTER TABLE t1 ADD STATISTICS (HISTOGRAM) ON (a, b, c);
>   =# ANALZYE t1;
>   =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
>    Seq Scan on t1  (cost=0.00..230.00 rows=268 width=12)
> ====
>   Rows changed unwantedly.

That has nothing to do with OR clauses, but rather with using a type of 
statistics that does not fit the data and queries. Histograms are quite 
inaccurate for discrete data and equality conditions - in this case the 
clauses probably match one bucket, and so we use 1/2 the bucket as an 
estimate. There's nothing wrong with that.

So let's use MCV instead:

ALTER TABLE t1 ADD STATISTICS (MCV) ON (a, b, c);
ANALYZE t1;
EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;                     QUERY PLAN
----------------------------------------------------- Seq Scan on t1  (cost=0.00..230.00 rows=1 width=12)   Filter:
(((a= 1) AND (b = 2)) OR (c = 3))
 
(2 rows)

>   It seems not so simple thing as your code assumes.

Maybe, but I don't see what assumption is invalid? I see nothing wrong 
with the previous query.

kind regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics / patch v7

From
Kyotaro HORIGUCHI
Date:
Hi, Tomas. I'll kick the gas pedal.

> > Thank you, it looks clearer. I have some comment for the brief look
> > at this. This patchset is relatively large so I will comment on
> > "per-notice" basis.. which means I'll send comment before examining
> > the entire of this patchset. Sorry in advance for the desultory
> > comments.
> 
> Sure. If you run into something that's not clear enough, I'm happy to
> explain that (I tried to cover all the important details in the
> comments, but it's a large patch, indeed.)


> > - Single-variate stats have a mechanism to inject arbitrary
> >    values as statistics, that is, get_relation_stats_hook and the
> >    similar stuffs. I want the similar mechanism for multivariate
> >    statistics, too.
> 
> Fair point, although I'm not sure where should we place the hook, how
> exactly should it be defined and how useful that would be in the
> end. Can you give an example of how you'd use such hook?

It's my secret, but is open:p. this is crucial for us to examine
many planner-related problems occurred in our customer in-vitro.

http://pgdbmsstats.osdn.jp/pg_dbms_stats-en.html

# Mmm, this doc is a bit too old..

One tool of ours does like following, 

- Copy pg_statistics and some attributes of pg_class into some table. Of course this is exportable.

- For example, in examine_simple_variable, using the hook get_relation_stats_hook, inject the saved statistics in place
ofthe real statistics.
 

The hook point is placed where the parameters to specify what
statistics is needed are avaiable in compact shape, and all the
hook function should do is returning corresponding statistics
values.

So the parallel stuff for this mv stats will look like this.

MVStatisticInfo *
get_mv_statistics(PlannerInfo *root, relid);

or 

MVStatisticInfo *
get_mv_statistics(PlannerInfo *root, relid, <bitmap or list of attnos>);

So by simplly applying this, the current clauselist_selectivity
code will turn into following.

> if (list_length(clauses) == 1)
>    return clause_selectivity(....);
> 
> Index varrelid = find_singleton_relid(root, clauses, varRelid);
> 
> if (varrelid)
> {
> // Bitmapset attnums = collect_attnums(root, clauses, varrelid);
>   if (get_mv_statistics_hook)
>     stats = get_mv_statistics_hook(root, varrelid /*, attnums */);
>   else
>     statis = get_mv_statistics(root, varrelid /*, attnums*/);
> 
>   ....

In comparison to single statistics, statistics values might be
preferable to separate from definition.

> I've never used get_relation_stats_hook, but if I get it right, the
> plugins can use the hook to create the stats (for each column), either
> from scratch or tweaking the existing stats.

Mostly existing stats without change. I saw few hackers wanted to
provide predefined statistics for typical cases. I haven't see
anyone who tweaks existing stats.

> I'm not sure how this should work with multivariate stats, though,
> because there can be arbitrary number of stats for a column, and it
> really depends on all the clauses (so examine_variable() seems a bit
> inappropriate, as it only sees a single variable at a time).

Restriction clauses are not a problem. What is needed to replace
stats value is defining few APIs to retrieve them, and to
retrieve the stats values only in a way that compatible with the
API. It would be okay to be a substitute views for mv stats as an
extreme case but it is not good.

> Moreover, with multivariate stats
> 
>    (a) there may be arbitrary number of stats for a column
> 
>    (b) only some of the stats end up being used for the estimation
> 
> I see two or three possible places for calling such hook:
> 
>    (a) at the very beginning, after fetching the list of stats
> 
>        - sees all the existing stats on a table
>        - may add entirely new stats or tweak the existing ones

Getting all stats for a table would be okay but attnum list can
restrict the possibilities, as the second form of the example
APIs above. And we may forget the case of forged or tweaked
stats, they are their problem, not ours.


>    (b) after collecting the list of variables compatible with
>        multivariate stats
> 
>        - like (a) and additionally knows which columns are interesting
>          for the query (but only with respect to the existing stats)

We should carefully design the API to be able to point the
pertinent stats for every situation. Mv stats is based on the
correlation of multiple columns so I think only relid and
attributes list are enough as the parameter.

| if (st.relid == param.relid && bms_equal(st.attnums, param.attnums))
|    /* This is the stats to be wanted  */

If we can filter the appropriate stats from all the stats using
clauselist, we definitely can make the appropriate parameter
(column set) prior to retrieving mv statistics. Isn't it correct?

>    (c) after optimization (selection of the right combination if stats)
> 
>        - like (b), but can't affect the optimization
> 
> But I can't really imagine anyone building multivariate stats on the
> fly, in the hook.
> 
> It's more complicated, though, because the query may call
> clauselist_selectivity multiple times, depending on how complex the
> WHERE clauses are.
> 
> 
> > 0001:
> >
> > - I also don't think it is right thing for expression_tree_walker
> >    to recognize RestrictInfo since it is not a part of expression.
> 
> Yes. In my working git repo, I've reworked this to use the second
> option, i.e. adding RestrictInfo pull_(varno|varattno)_walker:
> 
> https://github.com/tvondra/postgres/commit/2dc79b914c759d31becd8ae670b37b79663a595f
> 
> Do you think this is the correct solution? If not, how to fix it?

The reason why I think it is not appropreate is that RestrictInfo
is not a part of expression.

Increasing selectivity of a condition by column correlation is
occurs only for a set of conjunctive clauses. OR operation
devides the sets. Is it agreeable? RestrictInfos can be nested
each other and we should be aware of the AND/OR operators. This
is what expression_tree_walker doesn't.

Perhaps we should provide the dedicate function such like
find_conjunctive_attr_set which does this,

- Check the type top expression of the clause
 - If it is a RestrictInfo, check clause_relids then check   clause.
 - If it is a bool OR, stop to search and return empty set of   attributes.
 - If it is a bool AND, make further check of the components. A   list of RestrictInfo should be treaed as AND
connection.
 - If it is operator exression, collect used relids and attrs   walking the expression tree.

I should missing something but I think the outline is correct.

Addition to that we should carefully avoid duplicate correction
using the same mv statistics.

I haven't understood what choose_mv_satistics precisely but I
suppose what this function does would be split into the 'making
parameter to find stats' part and 'matching the parameter with
stats in order to retrieve desired stats' part. Could you
reconstruct this process into the form like this?

I feel it is too invasive, or exccesively intermix(?)ed.

> > 0003:
> >
> > - In clauselist_selectivity, find_stats is uselessly called for
> >    single clause. This should be called after the clauselist found
> >    to consist more than one clause.
> 
> Ok, will fix.
> 
> >
> > - Searching vars to be compared with mv-stat columns which
> >    find_stats does should stop at disjunctions. But this patch
> >    doesn't behave so and it should be an unwanted behavior. The
> >    following steps shows that.
> 
> Why should it stop at disjunctions? There's nothing wrong with using
> multivariate stats to estimate OR-clauses, IMHO.

Mv statistics represents how often *every combination of the
column values* occurs. Is it correct? Where the combination can
be replaced with coexists, that is AND. For example MV-MCV.

(a, b, c) freq
(1, 2, 3)  100
(1, 2, 5)   50
(1, 3, 8)   20
(1, 7, 2)    5
===============
total      175

| select * from t where a = 1 and b = 2 and c = 3;
| SELECT 100

This is correct,

| select * from t where a = 1 and b = 2 or c = 3;
| SELECT 100

This is *not* correct. The correct number of tuples is 150.
This is a simple example where OR breaks MV stats assumption.

> > ====
> >   =# CREATE TABLE t1 (a int, b int, c int);
> >   =# INSERT INTO t1 (SELECT a, a * 2, a * 3 FROM generate_series(0,
> >   9999) a);
> >   =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
> >    Seq Scan on t1  (cost=0.00..230.00 rows=1 width=12)
> >   =# ALTER TABLE t1 ADD STATISTICS (HISTOGRAM) ON (a, b, c);
> >   =# ANALZYE t1;
> >   =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
> >    Seq Scan on t1  (cost=0.00..230.00 rows=268 width=12)
> > ====
> >   Rows changed unwantedly.
> 
> That has nothing to do with OR clauses, but rather with using a type
> of statistics that does not fit the data and queries. Histograms are
> quite inaccurate for discrete data and equality conditions - in this
> case the clauses probably match one bucket, and so we use 1/2 the
> bucket as an estimate. There's nothing wrong with that.
> 
> So let's use MCV instead:

Hmm, it's not a problem what specific number is displayed as
rows. What is crucial is the fact that rows has changed even
though it shouldn't have changed. As I demonstrated above.

> ALTER TABLE t1 ADD STATISTICS (MCV) ON (a, b, c);
> ANALYZE t1;
> EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
>                      QUERY PLAN
> -----------------------------------------------------
>  Seq Scan on t1  (cost=0.00..230.00 rows=1 width=12)
>    Filter: (((a = 1) AND (b = 2)) OR (c = 3))
> (2 rows)
> 
> >   It seems not so simple thing as your code assumes.
> 
> Maybe, but I don't see what assumption is invalid? I see nothing wrong
> with the previous query.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: multivariate statistics / patch v7

From
Tomas Vondra
Date:
Hi,

On 07/07/2015 08:05 AM, Kyotaro HORIGUCHI wrote:
> Hi, Tomas. I'll kick the gas pedal.
>
>>> Thank you, it looks clearer. I have some comment for the brief look
>>> at this. This patchset is relatively large so I will comment on
>>> "per-notice" basis.. which means I'll send comment before examining
>>> the entire of this patchset. Sorry in advance for the desultory
>>> comments.
>>
>> Sure. If you run into something that's not clear enough, I'm happy to
>> explain that (I tried to cover all the important details in the
>> comments, but it's a large patch, indeed.)
>
>
>>> - Single-variate stats have a mechanism to inject arbitrary
>>>     values as statistics, that is, get_relation_stats_hook and the
>>>     similar stuffs. I want the similar mechanism for multivariate
>>>     statistics, too.
>>
>> Fair point, although I'm not sure where should we place the hook,
>> how exactly should it be defined and how useful that would be in
>> the end. Can you give an example of how you'd use such hook?

...

>
> We should carefully design the API to be able to point the pertinent
> stats for every situation. Mv stats is based on the correlation of
> multiple columns so I think only relid and attributes list are
> enough as the parameter.
>
> | if (st.relid == param.relid && bms_equal(st.attnums, param.attnums))
> |    /* This is the stats to be wanted  */
>
> If we can filter the appropriate stats from all the stats using
> clauselist, we definitely can make the appropriate parameter (column
> set) prior to retrieving mv statistics. Isn't it correct?

Let me briefly explain how the current clauselist_selectivity 
implementation works.
  (1) check if there are multivariate statistics on the table - if not,      skip the multivariate parts altogether
(thepoint of this is to      minimize impact on users who don't use the new feature)
 
  (2) see if the are clauses compatible with multivariate stats - this      only checks "general compatibility" without
actuallychecking the      existing stats (the point is to terminate early, if the clauses      are not compatible
somehow- e.g. if the clauses reference only a      single attribute, use unsupported operators etc.)
 
  (3) if there are multivariate stats and compatible clauses, the      function choose_mv_stats tries to find the best
combinationof      multivariate stats with respect to the clauses (details later)
 
  (4) the clauses are estimated using the stats, the remaining clauses      are estimated using the current statistics
(singleattribute)
 

The only way to reliably inject new stats is by calling a hook before 
(1), allowing it to arbitrarily modify the list of stats. Based on the 
use cases you provided, I don't think it makes much sense to add 
additional hooks in the other phases.

At this place it's however now known what clauses are compatible with 
multivariate stats, or what attributes they are referencing. It might be 
possible to simply call pull_varattnos() and pass it to the hook, except 
that does not work with RestrictInfo :-/

Or maybe we could / should not put the hook into clauselist_selectivity 
but somewhere else? Say, to get_relation_info where we actually read the 
list of stats for the relation?

>>
>>
>>> 0001:
>>>
>>> - I also don't think it is right thing for expression_tree_walker
>>>     to recognize RestrictInfo since it is not a part of expression.
>>
>> Yes. In my working git repo, I've reworked this to use the second
>> option, i.e. adding RestrictInfo pull_(varno|varattno)_walker:
>>
>> https://github.com/tvondra/postgres/commit/2dc79b914c759d31becd8ae670b37b79663a595f
>>
>> Do you think this is the correct solution? If not, how to fix it?
>
> The reason why I think it is not appropreate is that RestrictInfo
> is not a part of expression.
>
> Increasing selectivity of a condition by column correlation is
> occurs only for a set of conjunctive clauses. OR operation
> devides the sets. Is it agreeable? RestrictInfos can be nested
> each other and we should be aware of the AND/OR operators. This
> is what expression_tree_walker doesn't.

I still don't understand why you think we need to differentiate between 
AND and OR operators. There's nothing wrong with estimating OR clauses 
using multivariate statistics.

>
> Perhaps we should provide the dedicate function such like
> find_conjunctive_attr_set which does this,

Perhaps. The reason why I added support for RestrictInfo into the 
existing walker implementations is that it seemed like the easiest way 
to fix the issue. But if there are reasons why that's incorrect, then 
inventing a new function is probably the right way.

>
> - Check the type top expression of the clause
>
>    - If it is a RestrictInfo, check clause_relids then check
>      clause.>
>    - If it is a bool OR, stop to search and return empty set of
>      attributes.>
>    - If it is a bool AND, make further check of the components. A
>      list of RestrictInfo should be treaed as AND connection.>
>    - If it is operator exression, collect used relids and attrs
>      walking the expression tree.>
> I should missing something but I think the outline is correct.

As I said before, there's nothing wrong with estimating OR clauses using 
multivariate statistics. So OR and AND should be handled exactly the same.

I think you're missing the fact that it's not enough to look at the 
relids from the RestrictInfo - we need to actually check what clauses 
are used inside, i.e. we need to check the clauses.

That's because only some of the clauses are compatible with multivariate 
stats, and only if all the clauses of the BoolExpr are "compatible" then 
we can estimate the clause as a whole. If it's a mix of supported and 
unsupported clauses, we can simply pass it to clauselist_selectivity 
which will repeat the whole process with.

> Addition to that we should carefully avoid duplicate correction
> using the same mv statistics.

Sure. That's what choose_mv_statistics does.

>
> I haven't understood what choose_mv_satistics precisely but I
> suppose what this function does would be split into the 'making
> parameter to find stats' part and 'matching the parameter with
> stats in order to retrieve desired stats' part. Could you
> reconstruct this process into the form like this?

The goal of choose_mv_statistics does is very simple - given a list of 
clauses, it tries to find the best combination of statistics, exploiting 
as much information as possible.

So let's say you have clauses
   WHERE a=1 AND b=1 AND c=1 AND d=1

but you only have statistics on [a,b], [b,c] and [b,c,d].

The simplest approach would be to use the 'largest' statistics, covering 
the most columns from the clauses - in this case [b,c,d]. This is what 
the initial patches do.

The last patch improves this significantly, by combining the statistics 
using conditional probability. In this case it'd probably use all three 
statistics, effectively decomposing the selectivity like this:
  P(a=1,b=1,c=1,d=1) = P(a=1,b=1) * P(c=1|b=1) * P(d=1|b=1,c=1)                         [a,b]         [b,c]
[b,c,d]

And each of those probabilities can be estimated using one of the stats.


> I feel it is too invasive, or exccesively intermix(?)ed.

I don't think it really fits your model - the hook has to be called much 
sooner, effectively at the very beginning of the clauselist_selectivity 
or even before that. Otherwise it might not get called at all (e.g. if 
there are no multivariate stats on the table, this whole part will be 
skipped).

>> Why should it stop at disjunctions? There's nothing wrong with using
>> multivariate stats to estimate OR-clauses, IMHO.
>
> Mv statistics represents how often *every combination of the
> column values* occurs. Is it correct? Where the combination can
> be replaced with coexists, that is AND. For example MV-MCV.
>
> (a, b, c) freq
> (1, 2, 3)  100
> (1, 2, 5)   50
> (1, 3, 8)   20
> (1, 7, 2)    5
> ===============
> total      175
>
> | select * from t where a = 1 and b = 2 and c = 3;
> | SELECT 100
>
> This is correct,
>
> | select * from t where a = 1 and b = 2 or c = 3;
> | SELECT 100
>
> This is *not* correct. The correct number of tuples is 150.
> This is a simple example where OR breaks MV stats assumption.

No, it does not.

I'm not sure where are the numbers coming from, though. So let's see how 
this actually works with multivariate statistics. I'll create a table 
with the 4 combinations you used in your example, but with 1000x more 
rows, to make the estimates a bit more accurate:
   CREATE TABLE  t (a INT, b INT, c INT);
   INSERT INTO t SELECT 1, 2, 3 FROM generate_series(1,100000);   INSERT INTO t SELECT 1, 2, 5 FROM
generate_series(1,50000);  INSERT INTO t SELECT 1, 3, 8 FROM generate_series(1,20000);   INSERT INTO t SELECT 1, 7, 2
FROMgenerate_series(1,5000);
 
   ALTER TABLE t ADD STATISTICS (mcv) ON (a,b,c);
   ANALYZE t;

And now let's see the two queries:

EXPLAIN select * from t where a = 1 and b = 2 and c = 3;                        QUERY PLAN
---------------------------------------------------------- Seq Scan on t  (cost=0.00..4008.50 rows=100403 width=12)
Filter:((a = 1) AND (b = 2) AND (c = 3))
 
(2 rows)

EXPLAIN select * from t where a = 1 and b = 2 or c = 3;                        QUERY PLAN
---------------------------------------------------------- Seq Scan on t  (cost=0.00..4008.50 rows=150103 width=12)
Filter:(((a = 1) AND (b = 2)) OR (c = 3))
 
(2 rows)

So the first query estimates 100k rows, the second one 150k rows. 
Exactly as expected, because MCV lists are discrete, match perfectly the 
data and behave exactly like your mental model.

If you try this with histograms though, you'll get the same estimate in 
both cases:
    ALTER TABLE t DROP STATISTICS ALL;    ALTER TABLE t ADD STATISTICS (histogram) ON (a,b,c);    ANALYZE t;

EXPLAIN select * from t where a = 1 and b = 2 and c = 3;                       QUERY PLAN
--------------------------------------------------------- Seq Scan on t  (cost=0.00..4008.50 rows=52707 width=12)
Filter:((a = 1) AND (b = 2) AND (c = 3))
 
(2 rows)

EXPLAIN select * from t where a = 1 and b = 2 or c = 3;                       QUERY PLAN
--------------------------------------------------------- Seq Scan on t  (cost=0.00..4008.50 rows=52707 width=12)
Filter:(((a = 1) AND (b = 2)) OR (c = 3))
 
(2 rows)

That's unfortunate, but it has nothing to do with some assumptions of 
multivariate statistics. The "problem" is that histograms are naturally 
fuzzy, and both conditions hit the same bucket.

The solution is simple - don't use histograms for such discrete data.


>>> ====
>>>    =# CREATE TABLE t1 (a int, b int, c int);
>>>    =# INSERT INTO t1 (SELECT a, a * 2, a * 3 FROM generate_series(0,
>>>    9999) a);
>>>    =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
>>>     Seq Scan on t1  (cost=0.00..230.00 rows=1 width=12)
>>>    =# ALTER TABLE t1 ADD STATISTICS (HISTOGRAM) ON (a, b, c);
>>>    =# ANALZYE t1;
>>>    =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
>>>     Seq Scan on t1  (cost=0.00..230.00 rows=268 width=12)
>>> ====
>>>    Rows changed unwantedly.
>>
>> That has nothing to do with OR clauses, but rather with using a
>> type of statistics that does not fit the data and queries.
>> Histograms are quite inaccurate for discrete data and equality
>> conditions - in this case the clauses probably match one bucket,
>> and so we use 1/2 the bucket as an estimate. There's nothing wrong
>> with that.
>>
>> So let's use MCV instead:
>
> Hmm, it's not a problem what specific number is displayed as
> rows. What is crucial is the fact that rows has changed even
> though it shouldn't have changed. As I demonstrated above.

Again, that has nothing to do with any assumptions, and it certainly 
does not demonstrate that OR clauses should not be handled by 
multivariate statistics.

In this case, you're observing two effects.
  (1) Natural inaccuracy of histograms when used for discrete data,      especially in combination with equality
conditions(because      that's impossible to estimate accurately with histograms).
 
  (2) The original estimate (without multivariate statistics) is only      seemingly accurate, because it falsely
assumesindependence.      It simply assumes that each condition matches 1/10000 of the      table, and multiplies that,
getting~0.00001 row estimate. This      is rounded up to 1, which is accidentally the exact value.
 

Let me demonstrate this on two examples - one with discrete data, one 
with continuous distribution.

1) discrete data
    CREATE TABLE t (a INT, b INT, c INT);    INSERT INTO t  SELECT i/1000, 2*(i/1000), 3*(i/1000)
FROMgenerate_series(1, 1000000) s(i);    ANALYZE t;
 
    -- no multivariate stats (so assumption of independence)
    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 and c = 3;
    Seq Scan on t  (cost=0.00..22906.00 rows=1 width=12)                   (actual time=0.290..59.120 rows=1000
loops=1)
    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 3;
    Seq Scan on t  (cost=0.00..22906.00 rows=966 width=12)                   (actual time=0.434..117.643 rows=1000
loops=1)
    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 6;
    Seq Scan on t  (cost=0.00..22906.00 rows=966 width=12)                   (actual time=0.433..96.956 rows=2000
loops=1)
    -- now let's add a histogram
    ALTER TABLE t ADD STATISTICS (histogram) on (a,b,c);    ANALYZE t;
    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 and c = 3;
    Seq Scan on t  (cost=0.00..22906.00 rows=817 width=12)                   (actual time=0.268..116.318 rows=1000
loops=1)
    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 3;
    Seq Scan on t  (cost=0.00..22906.00 rows=30333 width=12)                   (actual time=0.435..93.232 rows=1000
loops=1)
    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 6;
    Seq Scan on t  (cost=0.00..22906.00 rows=30333 width=12)                   (actual time=0.434..122.930 rows=2000
loops=1)
    -- now let's use a MCV list
    ALTER TABLE t DROP STATISTICS ALL;    ALTER TABLE t ADD STATISTICS (mcv) on (a,b,c);    ANALYZE t;
    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 and c = 3;
    Seq Scan on t  (cost=0.00..22906.00 rows=767 width=12)                   (actual time=0.268..70.604 rows=1000
loops=1)
    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 3;
    Seq Scan on t  (cost=0.00..22906.00 rows=767 width=12)                   (actual time=0.268..70.604 rows=1000
loops=1)
    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 6;
    Seq Scan on t  (cost=0.00..22906.00 rows=1767 width=12)                   (actual time=0.428..100.607 rows=2000
loops=1)

The default estimate of AND query is rather bad. For OR clause, it's not 
that bad (the OR selectivity is not that bad when it comes to 
dependency, but it's not difficult to construct counter examples).

The histogram is not that good - for the OR queries it often results in 
over-estimates (for equality conditions on discrete data).

But the MCV estimates are very accurate. The slight under-estimate is 
probably caused by the block sampling we're using to get sample rows.


2) continuous data (I'll only show histograms)

CREATE TABLE t (a FLOAT, b FLOAT, c FLOAT);
INSERT INTO t SELECT r,                     r + r*(random() - 0.5)/2,                     r + r*(random() - 0.5)/2
        FROM (SELECT random() as r                       FROM generate_series(1,1000000)) foo;
 
ANALYZE t;

-- no multivariate stats
EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 and c < 0.3;
Seq Scan on t  (cost=0.00..23870.00 rows=28768 width=24)               (actual time=0.026..323.383 rows=273897 loops=1)

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c < 0.3;
Seq Scan on t  (cost=0.00..23870.00 rows=372362 width=24)               (actual time=0.026..375.005 rows=317533
loops=1)

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c > 0.9; Seq Scan on t  (cost=0.00..23870.00 rows=192979
width=24)               (actual time=0.026..431.376 rows=393528 loops=1)
 

-- histograms
ALTER TABLE t ADD STATISTICS (histogram) on (a,b,c);
ANALYZE t;

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 and c < 0.3;
Seq Scan on t  (cost=0.00..23870.00 rows=267033 width=24)               (actual time=0.021..330.487 rows=273897
loops=1)

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c > 0.3;
Seq Scan on t  (cost=0.00..23870.00 rows=14317 width=24)               (actual time=0.027..906.321 rows=966870
loops=1)

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c > 0.9;
Seq Scan on t  (cost=0.00..23870.00 rows=20367 width=24)               (actual time=0.028..452.494 rows=393528
loops=1)

This seems wrong, because the estimate for the OR queries should not be 
lower than the estimate for the first query (with just AND), and it 
should not increase when increasing the boundary. I'd bet this is a bug 
in how the inequalities are handled with histograms, or how the AND/OR 
clauses are combined. I'll look into that.

But once again, there's nothing that would make OR clauses somehow 
incompatible with multivariate stats.


kind regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics / patch v7

From
Tomas Vondra
Date:
Hello Horiguchi-san!

On 07/07/2015 09:43 PM, Tomas Vondra wrote:
> -- histograms
> ALTER TABLE t ADD STATISTICS (histogram) on (a,b,c);
> ANALYZE t;
>
> EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 and c < 0.3;
> Seq Scan on t  (cost=0.00..23870.00 rows=267033 width=24)
>                 (actual time=0.021..330.487 rows=273897 loops=1)
>
> EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c > 0.3;
> Seq Scan on t  (cost=0.00..23870.00 rows=14317 width=24)
>                 (actual time=0.027..906.321 rows=966870 loops=1)
>
> EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c > 0.9;
> Seq Scan on t  (cost=0.00..23870.00 rows=20367 width=24)
>                 (actual time=0.028..452.494 rows=393528 loops=1)
>
> This seems wrong, because the estimate for the OR queries should not be
> lower than the estimate for the first query (with just AND), and it
> should not increase when increasing the boundary. I'd bet this is a bug
> in how the inequalities are handled with histograms, or how the AND/OR
> clauses are combined. I'll look into that.

FWIW this was a stupid bug in update_match_bitmap_histogram(), which 
initially handled only AND clauses, and thus assumed the "match" of a 
bucket can only decrease. But for OR clauses this is exactly the 
opposite (we assume no buckets match and add buckets matching at least 
one of the clauses).

With this fixed, the estimates look like this:

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 and c < 0.3;
Seq Scan on t  (cost=0.00..23870.00 rows=267033 width=24)               (actual time=0.102..321.524 rows=273897
loops=1)

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c < 0.3;
Seq Scan on t  (cost=0.00..23870.00 rows=319400 width=24)               (actual time=0.103..386.089 rows=317533
loops=1)

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c > 0.3;
Seq Scan on t  (cost=0.00..23870.00 rows=956833 width=24)               (actual time=0.133..908.455 rows=966870
loops=1)

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c > 0.9;
Seq Scan on t  (cost=0.00..23870.00 rows=393633 width=24)               (actual time=0.105..440.607 rows=393528
loops=1)

IMHO pretty accurate estimates - no issue with OR clauses.

I've pushed this to github [1] but I need to do some additional fixes. I 
also had to remove some optimizations while fixing this, and will have 
to reimplement those.

That's not to say that the handling of OR-clauses is perfectly correct. 
After looking at clauselist_selectivity_or(), I believe it's a bit 
broken and will need a bunch of fixes, as explained in the FIXMEs I 
pushed to github.

[1] https://github.com/tvondra/postgres/tree/mvstats

kind regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics / patch v7

From
Kyotaro HORIGUCHI
Date:
Hi, Thanks for the detailed explaination. I misunderstood the
code (more honest speaking, din't look so close there). Then I
looked it closer.


At Wed, 08 Jul 2015 03:03:16 +0200, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<559C76D4.2030805@2ndquadrant.com>
> FWIW this was a stupid bug in update_match_bitmap_histogram(), which
> initially handled only AND clauses, and thus assumed the "match" of a
> bucket can only decrease. But for OR clauses this is exactly the
> opposite (we assume no buckets match and add buckets matching at least
> one of the clauses).
> 
> With this fixed, the estimates look like this:
> 

> IMHO pretty accurate estimates - no issue with OR clauses.

Ok, I understood the diferrence between what I thought and what
you say. The code is actually concious of OR clause but is looks
somewhat confused.

Currently choosing mv stats in clauselist_selectivity can be
outlined as following,

1. find_stats finds candidate mv stats containing *all*  attributes appeared in the whole clauses regardless of and/or
exprsby walking whole the clause tree.
 
  Perhaps this is the measure to early bailout.

2.1. Within every disjunction elements, collect mv-related  attributes while checking whether the all leaf nodes (binop
or ifnull) are compatible by (eventually) walking whole the  clause tree.
 

2.2. Check if all the collected attribute are contained in  mv-stats columns.

3. Finally, clauseset_mv_selectivity_histogram() (and others).
  This funciton applies every ExprOp onto every attribute in  every histogram backes and (tries to) make the boolean
operationof the result bitmaps.
 

I have some comments on the implement and I also try to find the
solution for them.


1. The flow above looks doing  very similiar thins repeatedly.

2. I believe what the current code does can be simplified.

3. As you mentioned in comments, some additional infrastructure  needed.

After all, I think what we should do after this are as follows,
as the first step.

- Add the means to judge the selectivity operator(?) by other than oprrest of the op of ExprOp. (You missed neqsel
already)
 I suppose one solution for this is adding oprmvstats taking 'm', 'h' and 'f' and their combinations. Or for the
convenience,it would be a fixed-length string like this.
 
 oprname | oprmvstats =       | 'mhf' <>      | 'mhf' <       | 'mh-' >       | 'mh-' >=      | 'mh-' <=      | 'mh-'
 This would make the code in clause_is_mv_compatible like this.
 > oprmvstats = get_mvstatsset(expr->opno); /* bitwise representation */ > if (oprmvstats & types) > { >    *attnums =
bms_add_member(*attnums,var->varattno); >    return true; > } > return false;
 

- Current design just manage to work but it is too complicated and hardly have affinity with the existing estimation
framework.I proposed separation of finding stats phase and calculation phase, but I would like to propose transforming
RestrictInfo(andfinding mvstat) phase and running the transformed RestrictInfo phase after looking close to the patch.
 
 I think transforing RestrictInfo makes the situnation better. Since it nedds different information, maybe it is better
tohave new struct, say, RestrictInfoForEstimate (boo!). Then provide mvstatssel() to use in the new struct. The rough
lookingof the code would be like below. 
 
 clauselist_selectivity() {   ...   RestrictInfoForEstmate *esclause =     transformClauseListForEstimation(root,
clauses,varRelid);   ...
 
   return clause_selectivity(esclause): }
 clause_selectivity(RestrictInfoForEstmate *esclause) {   if (IsA(clause, RestrictInfo))...   if (IsA(clause,
RestrictInfoForEstimate))  {      RestrictInfoForEstimate *ecl = (RestrictInfoForEstimate*) clause;      if
(ecl->selfunc)     {         sx = ecl->selfunc(root, ecl);      }   }   if (IsA(clause, Var))... }
 
  transformClauseListForEstimation(...) {   ...
   relid = collect_mvstats_info(root, clause, &attlist);   if (!relid) return;   if (get_mvstats_hook)        mvstats =
(*get_mvstats_hoook)(root, relid, attset);   else        mvstats = find_mv_stats(root, relid, attset)) } ...
 

> I've pushed this to github [1] but I need to do some additional
> fixes. I also had to remove some optimizations while fixing this, and
> will have to reimplement those.
> 
> That's not to say that the handling of OR-clauses is perfectly
> correct. After looking at clauselist_selectivity_or(), I believe it's
> a bit broken and will need a bunch of fixes, as explained in the
> FIXMEs I pushed to github.
> 
> [1] https://github.com/tvondra/postgres/tree/mvstats

I don't see whether it is doable or not, and I suppose you're
unwilling to change the big picture, so I will consider the idea
and will show you the result, if it turns out to be possible and
promising.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: multivariate statistics / patch v7

From
Tomas Vondra
Date:
Hi,

On 07/13/2015 10:51 AM, Kyotaro HORIGUCHI wrote:
>
> Ok, I understood the diferrence between what I thought and what you
> say. The code is actually concious of OR clause but is looks somewhat
> confused.

I'm not sure which part is confused by the OR clauses, but it's 
certainly possible. Initially it only handled AND clauses, and the 
support for OR clauses was added later, so it's possible some parts are 
not behaving correctly.

>
> Currently choosing mv stats in clauselist_selectivity can be
> outlined as following,
>
> 1. find_stats finds candidate mv stats containing *all*
>     attributes appeared in the whole clauses regardless of and/or
>     exprs by walking whole the clause tree.
>
>     Perhaps this is the measure to early bailout.

Not entirely. The goal of find_stats() is to lookup all stats on the 
'current' relation - it's coded the way it is because I had to deal with 
varRelid=0 cases, in which case I have to inspect the Var nodes. But 
maybe I got this wrong and there's much simpler way to do that?

It is an early bailout in the sense that if there are no multivariate 
stats defined on the table, there's no point in doing any of the 
following steps. So that we don't increase planning times for users not 
using multivariate stats.

> 2.1. Within every disjunction elements, collect mv-related
>     attributes while checking whether the all leaf nodes (binop or
>     ifnull) are compatible by (eventually) walking whole the
>     clause tree.

Generally, yes. The idea is to check whether there are clauses that 
might be estimated using multivariate statistics, and whether the 
clauses reference at least two different attributes. Imagine a query 
like this:
   SELECT * FROM t WHERE (a=1) AND (a>0) AND (a<100)

It makes no sense to process this using multivariate statistics, because 
all the Var nodes reference a single attribute.

Similarly, the check is not just about the leaf nodes - to be able to 
estimate a clause at this point, we have to be able to process the whole 
tree, starting from the top-level clause. Although maybe that's no 
longer true, now that support for OR clauses was added ... I wonder 
whether there are other BoolExpr-like nodes, that might make the tree 
incompatible with multivariate statistics (in the sense that the current 
implementation does not know how to handle them).

Also note that even though the clause may be "incompatible" at this 
level, it may get partially processed by multivariate statistics later. 
For example with a query:
   SELECT * FROM t WHERE (a=1 OR b=2 OR c ~* 'xyz') AND (q=1 OR r=4)

the first query is "incompatible" because it contains unsupported 
operator '~*', but it will eventually be processed as BoolExpr node, and 
should be split into two parts - (a=1 OR b=2) which is compatible, and 
(c ~* 'xyz') which is incompatible.

This split should happen in clauselist_selectivity_or(), and the other 
thing that may be interesting is that it uses (q=1 OR r=4) as a 
condition. So if there's a statistics built on (a,b,q,r) we'll compute 
conditional probability
    P(a=1,b=2 | q=1,r=4)
>> 2.2. Check if all the collected attribute are contained in>     mv-stats columns.

No, I think you got this wrong. We do not check that *all* the 
attributes are contained in mvstats columns - we only need two such 
columns (then there's a chance that the multivariate statistics will get 
applied).

Anyway, both 2.1 and 2.2 are meant as a quick bailout, before doing the 
most expensive part, which is choose_mv_statistics(). Which is however 
missing in this list.

> 3. Finally, clauseset_mv_selectivity_histogram() (and others).
>
>     This funciton applies every ExprOp onto every attribute in
>     every histogram backes and (tries to) make the boolean
>     operation of the result bitmaps.

Yes, but this only happens after choose_mv_statistics(), because that's 
the code that decides which statistics will be used and in what order.

The list is also missing handling of the 'functional dependencies', so a 
complete list of steps would look like this:

1) find_stats - lookup stats on the current relation (from RelOptInfo)

2) apply functional dependencies
   a) check if there are equality clauses that may be reduced using      functional dependencies, referencing at least
twocolumns
 
   b) if yes, perform the clause reduction

3) apply MCV lists and histograms
   a) check if there are clauses 'compatible' with those types of      statistics, again containing at least two
columns
   b) if yes, use choose_mv_statistics() to decide which statistics to         apply and in which order
   c) apply the selected histograms and MCV lists

4) estimate the remaining clauses using the regular statistics
   a) this is where the clauselist_mv_selectivity_histogram and other      are called

I tried to explain this in the comment before clauselist_selectivity(), 
but maybe it's not detailed enough / missing some important details.

>
> I have some comments on the implement and I also try to find the
> solution for them.
>
>
> 1. The flow above looks doing  very similiar thins repeatedly.

I worked hard to remove such code duplicities, and believe all the 
current steps are necessary - for example 2(a) and 3(a) may seems 
similar, but it's really necessary to do that twice.

>
> 2. I believe what the current code does can be simplified.

Possibly.

>
> 3. As you mentioned in comments, some additional infrastructure
>     needed.
>
> After all, I think what we should do after this are as follows,
> as the first step.

OK.

>
> - Add the means to judge the selectivity operator(?) by other
>    than oprrest of the op of ExprOp. (You missed neqsel already)

Yes, the way we use 'oprno' to determine how to estimate the selectivity
is a bit awkward. It's inspired by handling of range queries, and having
something better would be nice.

But I don't think this is the reason why I missed neqsel, and I don't
see this as a significant design issue at this point. But if we can come
up with a better solution, why not ...

>    I suppose one solution for this is adding oprmvstats taking
>    'm', 'h' and 'f' and their combinations. Or for the
>    convenience, it would be a fixed-length string like this.
>
>    oprname | oprmvstats
>    =       | 'mhf'
>    <>      | 'mhf'
>    <       | 'mh-'
>    >       | 'mh-'
>    >=      | 'mh-'
>    <=      | 'mh-'
>
>    This would make the code in clause_is_mv_compatible like this.
>
>    > oprmvstats = get_mvstatsset(expr->opno); /* bitwise representation */
>    > if (oprmvstats & types)
>    > {
>    >    *attnums = bms_add_member(*attnums, var->varattno);
>    >    return true;
>    > }
>    > return false;

So this only determines the compatibility of operators with respect to 
different types of statistics? How does that solve the neqsel case? It 
will probably decide the clause is compatible, but it will later fail at 
the actual estimation, no?

>
> - Current design just manage to work but it is too complicated
>    and hardly have affinity with the existing estimation
>    framework.

I respectfully disagree. I've strived to make it as affine to the 
current implementation as possible - maybe it's possible to improve 
that, but I believe there's a natural difference between the two types 
of statistics. It may be somewhat simplified, but it will never be 
exactly the same.
>    I proposed separation of finding stats phase and
>    calculation phase, but I would like to propose transforming
>    RestrictInfo(and finding mvstat) phase and running the
>    transformed RestrictInfo phase after looking close to the
>    patch.

Those phases are already separated, as is illustrated by the steps 
explained above.

So technically we might place a hook either right after the find_stats() 
call, so that it's possible to process all the stats on the table, or 
maybe after the choose_mv_statistics() call, so that we only process the 
actually used stats.

>
>    I think transforing RestrictInfo makes the situnation
>    better. Since it nedds different information, maybe it is
>    better to have new struct, say, RestrictInfoForEstimate
>    (boo!). Then provide mvstatssel() to use in the new struct.
>    The rough looking of the code would be like below.
>
>    clauselist_selectivity()
>    {
>      ...
>      RestrictInfoForEstmate *esclause =
>        transformClauseListForEstimation(root, clauses, varRelid);
>      ...
>
>      return clause_selectivity(esclause):
>    }
>
>    clause_selectivity(RestrictInfoForEstmate *esclause)
>    {
>      if (IsA(clause, RestrictInfo))...
>      if (IsA(clause, RestrictInfoForEstimate))
>      {
>         RestrictInfoForEstimate *ecl = (RestrictInfoForEstimate*) clause;
>         if (ecl->selfunc)
>         {
>            sx = ecl->selfunc(root, ecl);
>         }
>      }
>      if (IsA(clause, Var))...
>    }
>
>
>    transformClauseListForEstimation(...)
>    {
>      ...
>
>      relid = collect_mvstats_info(root, clause, &attlist);
>      if (!relid) return;
>      if (get_mvstats_hook)
>           mvstats = (*get_mvstats_hoook) (root, relid, attset);
>      else
>           mvstats = find_mv_stats(root, relid, attset))
>    }
>    ...

So you'd transform the clause tree first, replacing parts of the tree 
(to be estimated by multivariate stats) by a new node type? That's an 
interesting idea, I think ...

I can't really say whether it's a good approach, though. Can you explain 
why do you think it'd make the situation better?

The one benefit I can think of is being able to look at the processed 
tree and see which parts will be estimated using multivariate stats.

But we'd effectively have to do the same stuff (choosing the stats, 
...), and if we move this pre-processing before clauselist_selectivity 
(I assume that's the point), we'd end up repeating a lot of the code. Or 
maybe not, I'm not sure.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics / patch v7

From
Kyotaro HORIGUCHI
Date:
Hi, I'd like to show you the modified constitution of
multivariate statistics application logic. Please find the
attached. They apply on your v7 patch.

The code to find mv-applicable clause is moved out of the main
flow of clauselist_selectivity. As I said in the previous mail,
the new function transformRestrictInfoForEstimate (too bad name
but just for PoC:) scans clauselist and generates
RestrictStatsData struct which drives mv-aware selectivity
calculation. This struct isolates MV and non-MV estimation.

The struct RestrictStatData mainly consists of the following
three parts,
 - clause to be estimated by current logic (MV is not applicable) - clause to be estimated by MV-staistics. - list of
childRestrictStatDatas, which are to be run   recursively.
 

mvclause_selectivty() is the topmost function where mv stats
works. This structure effectively prevents main estimation flow
from being broken by modifying mvstats part. Although I haven't
measured but I'm positive the code is far reduced from yours.

I attached two patches to this message. The first one is to
rebase v7 patch to current(maybe) master and the second applies
the refactoring.

I'm a little anxious about performance but I think this makes the
process to apply mv-stats far clearer. Regtests for mvstats
succeeded asis except for fdep, which is not implememted in this
patch.

What do you think about this?

regards,


> Hi, Thanks for the detailed explaination. I misunderstood the
> code (more honest speaking, din't look so close there). Then I
> looked it closer.
> 
> 
> At Wed, 08 Jul 2015 03:03:16 +0200, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<559C76D4.2030805@2ndquadrant.com>
> > FWIW this was a stupid bug in update_match_bitmap_histogram(), which
> > initially handled only AND clauses, and thus assumed the "match" of a
> > bucket can only decrease. But for OR clauses this is exactly the
> > opposite (we assume no buckets match and add buckets matching at least
> > one of the clauses).
> > 
> > With this fixed, the estimates look like this:
> > 
> 
> > IMHO pretty accurate estimates - no issue with OR clauses.
> 
> Ok, I understood the diferrence between what I thought and what
> you say. The code is actually concious of OR clause but is looks
> somewhat confused.
> 
> Currently choosing mv stats in clauselist_selectivity can be
> outlined as following,
> 
> 1. find_stats finds candidate mv stats containing *all*
>    attributes appeared in the whole clauses regardless of and/or
>    exprs by walking whole the clause tree.
> 
>    Perhaps this is the measure to early bailout.
> 
> 2.1. Within every disjunction elements, collect mv-related
>    attributes while checking whether the all leaf nodes (binop or
>    ifnull) are compatible by (eventually) walking whole the
>    clause tree.
> 
> 2.2. Check if all the collected attribute are contained in
>    mv-stats columns.
> 
> 3. Finally, clauseset_mv_selectivity_histogram() (and others).
> 
>    This funciton applies every ExprOp onto every attribute in
>    every histogram backes and (tries to) make the boolean
>    operation of the result bitmaps.
> 
> I have some comments on the implement and I also try to find the
> solution for them.
> 
> 
> 1. The flow above looks doing  very similiar thins repeatedly.
> 
> 2. I believe what the current code does can be simplified.
> 
> 3. As you mentioned in comments, some additional infrastructure
>    needed.
> 
> After all, I think what we should do after this are as follows,
> as the first step.
> 
> - Add the means to judge the selectivity operator(?) by other
>   than oprrest of the op of ExprOp. (You missed neqsel already)
> 
>   I suppose one solution for this is adding oprmvstats taking
>   'm', 'h' and 'f' and their combinations. Or for the
>   convenience, it would be a fixed-length string like this.
> 
>   oprname | oprmvstats
>   =       | 'mhf'
>   <>      | 'mhf'
>   <       | 'mh-'
>   >       | 'mh-'
>   >=      | 'mh-'
>   <=      | 'mh-'
> 
>   This would make the code in clause_is_mv_compatible like this.
> 
>   > oprmvstats = get_mvstatsset(expr->opno); /* bitwise representation */
>   > if (oprmvstats & types)
>   > {
>   >    *attnums = bms_add_member(*attnums, var->varattno);
>   >    return true;
>   > }
>   > return false;
> 
> - Current design just manage to work but it is too complicated
>   and hardly have affinity with the existing estimation
>   framework. I proposed separation of finding stats phase and
>   calculation phase, but I would like to propose transforming
>   RestrictInfo(and finding mvstat) phase and running the
>   transformed RestrictInfo phase after looking close to the
>   patch.
> 
>   I think transforing RestrictInfo makes the situnation
>   better. Since it nedds different information, maybe it is
>   better to have new struct, say, RestrictInfoForEstimate
>   (boo!). Then provide mvstatssel() to use in the new struct.
>   The rough looking of the code would be like below. 
> 
>   clauselist_selectivity()
>   {
>     ...
>     RestrictInfoForEstmate *esclause =
>       transformClauseListForEstimation(root, clauses, varRelid);
>     ...
> 
>     return clause_selectivity(esclause):
>   }
> 
>   clause_selectivity(RestrictInfoForEstmate *esclause)
>   {
>     if (IsA(clause, RestrictInfo))...
>     if (IsA(clause, RestrictInfoForEstimate))
>     {
>        RestrictInfoForEstimate *ecl = (RestrictInfoForEstimate*) clause;
>        if (ecl->selfunc)
>        {
>           sx = ecl->selfunc(root, ecl);
>        }
>     }
>     if (IsA(clause, Var))...
>   }
> 
>   
>   transformClauseListForEstimation(...)
>   {
>     ...
> 
>     relid = collect_mvstats_info(root, clause, &attlist);
>     if (!relid) return;
>     if (get_mvstats_hook)
>          mvstats = (*get_mvstats_hoook) (root, relid, attset);
>     else
>          mvstats = find_mv_stats(root, relid, attset))
>   }
>   ...
> 
> > I've pushed this to github [1] but I need to do some additional
> > fixes. I also had to remove some optimizations while fixing this, and
> > will have to reimplement those.
> > 
> > That's not to say that the handling of OR-clauses is perfectly
> > correct. After looking at clauselist_selectivity_or(), I believe it's
> > a bit broken and will need a bunch of fixes, as explained in the
> > FIXMEs I pushed to github.
> > 
> > [1] https://github.com/tvondra/postgres/tree/mvstats
> 
> I don't see whether it is doable or not, and I suppose you're
> unwilling to change the big picture, so I will consider the idea
> and will show you the result, if it turns out to be possible and
> promising.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: multivariate statistics / patch v7

From
Tomas Vondra
Date:
Hi,

On 07/16/2015 01:51 PM, Kyotaro HORIGUCHI wrote:
> Hi, I'd like to show you the modified constitution of
> multivariate statistics application logic. Please find the
> attached. They apply on your v7 patch.

Sadly I do have some trouble getting it to apply correctly :-(
So for now all my comments are based on just reading the code.

FWIW I've rebased my patch to the current master, it's available on 
github as usual:
    https://github.com/tvondra/postgres/commits/mvstats

> The code to find mv-applicable clause is moved out of the main
> flow of clauselist_selectivity. As I said in the previous mail,
> the new function transformRestrictInfoForEstimate (too bad name
> but just for PoC:) scans clauselist and generates
> RestrictStatsData struct which drives mv-aware selectivity
> calculation. This struct isolates MV and non-MV estimation.
>
> The struct RestrictStatData mainly consists of the following
> three parts,
>
>    - clause to be estimated by current logic (MV is not applicable)
>    - clause to be estimated by MV-staistics.
>    - list of child RestrictStatDatas, which are to be run
>      recursively.
>
> mvclause_selectivty() is the topmost function where mv stats
> works. This structure effectively prevents main estimation flow
> from being broken by modifying mvstats part. Although I haven't
> measured but I'm positive the code is far reduced from yours.
>
> I attached two patches to this message. The first one is to
> rebase v7 patch to current(maybe) master and the second applies
> the refactoring.
>
> I'm a little anxious about performance but I think this makes the
> process to apply mv-stats far clearer. Regtests for mvstats
> succeeded asis except for fdep, which is not implememted in this
> patch.
>
> What do you think about this?

I'm not sure, at this point. I'm having a hard time understanding how 
exactly the code works - there are pretty much no comments explaining 
the implementation, so it takes time to understand the code. This is 
especially true about transformRestrictInfoForEstimate which is also 
quite long. I understand it's a PoC, but comments would really help.

On a conceptual level, I think the idea to split the estimation into two 
phases - enrich the expression tree with nodes with details about stats 
etc, and then actually do the estimation in the second phase might be 
interesting. Not because it's somehow clearer, but because it gives us a 
chance to see the expression tree as a whole, with details about all the 
stats (with the current code we process/estimate the tree 
incrementally). But I don't really know how useful that would be.

I don't think the proposed change makes the process somehow clearer. I 
know it's a PoC at this point, so I don't expect it to be perfect, but 
for me the original code is certainly clearer. Of course, I'm biased as 
I wrote the current code, and I (naturally) shaped it to match my ideas 
during the development process, and I'm much more familiar with it.

Omitting the support for functional dependencies is a bit unfortunate, I 
think. Is that merely to make the PoC simpler, or is there something 
that makes it impossible to support that kind of stats?

Another thing that I noticed is that you completely removed the code 
that combined multiple stats (and selected the best combination of 
stats). In other words, you've reverted to the intermediate single 
statistics approach, including removing the improved handling of OR 
clauses and conditions. It's a bit difficult to judge the proposed 
approach not knowing how well it supports those (quite crucial) 
features. What if it can't support some them., or what if it makes the 
code much more complicated (thus defeating the goal of making it more 
clear)?

I share your concern about the performance impact - one thing is that 
this new code might be slower than the original one, but a more serious 
issue IMHO is that the performance impact will happen even for relations 
with no multivariate stats at all. The original patch was very careful 
about getting ~0% overhead in such cases, and if the new code does not 
allow that, I don't see this approach as acceptable. We must not put 
additional overhead on people not using multivariate stats.

But I think it's worth exploring this idea a bit more - can you rebase 
it to the current patch version (as on github) and adding the missing 
pieces (functional dependencies, multi-statistics estimation and passing 
conditions)?

One more thing - I noticed you extended the pg_operator catalog with a 
oprmvstat attribute, used to flag operators that are compatible with 
multivariate stats. I'm not happy with the current approach (using 
oprrest to do this decision), but I'm not really sure this is a good 
solution either. The culprit is that it only answers one of the two 
important questions - Is it compatible? How to perform the estimation?

So we'd have to rely on oprrest anyway, when actually performing the 
estimation of a clause with "compatible" operator. And we'd have to keep 
in sync two places (catalog and checks in file), and we'd have to update 
the catalog after improving the implementation (adding support for 
another operator).


kind regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics / patch v7

From
Kyotaro HORIGUCHI
Date:
Hello,

At Sat, 25 Jul 2015 23:09:31 +0200, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<55B3FB0B.7000201@2ndquadrant.com>
> Hi,
> 
> On 07/16/2015 01:51 PM, Kyotaro HORIGUCHI wrote:
> > Hi, I'd like to show you the modified constitution of
> > multivariate statistics application logic. Please find the
> > attached. They apply on your v7 patch.
> 
> Sadly I do have some trouble getting it to apply correctly :-(
> So for now all my comments are based on just reading the code.

Ah. My modification to rebase to the master for the time should
be culprit. Sorry for the dirty patch. 

# I would recreate the patch if you complained before struggling
# with the thing..

The core of the modification is on closesel.c. I attached the
patched closesel.c.

> FWIW I've rebased my patch to the current master, it's available on
> github as usual:
> 
>     https://github.com/tvondra/postgres/commits/mvstats

Thanks.

> > The code to find mv-applicable clause is moved out of the main
> > flow of clauselist_selectivity. As I said in the previous mail,
> > the new function transformRestrictInfoForEstimate (too bad name
> > but just for PoC:) scans clauselist and generates
> > RestrictStatsData struct which drives mv-aware selectivity
> > calculation. This struct isolates MV and non-MV estimation.
> >
> > The struct RestrictStatData mainly consists of the following
> > three parts,
> >
> >    - clause to be estimated by current logic (MV is not applicable)
> >    - clause to be estimated by MV-staistics.
> >    - list of child RestrictStatDatas, which are to be run
> >      recursively.
> >
> > mvclause_selectivty() is the topmost function where mv stats
> > works. This structure effectively prevents main estimation flow
> > from being broken by modifying mvstats part. Although I haven't
> > measured but I'm positive the code is far reduced from yours.
> >
> > I attached two patches to this message. The first one is to
> > rebase v7 patch to current(maybe) master and the second applies
> > the refactoring.
> >
> > I'm a little anxious about performance but I think this makes the
> > process to apply mv-stats far clearer. Regtests for mvstats
> > succeeded asis except for fdep, which is not implememted in this
> > patch.
> >
> > What do you think about this?
> 
> I'm not sure, at this point. I'm having a hard time understanding how
> exactly the code works - there are pretty much no comments explaining
> the implementation, so it takes time to understand the code. This is
> especially true about transformRestrictInfoForEstimate which is also
> quite long. I understand it's a PoC, but comments would really help.

The patch itself shold hardly readable because it's not from
master but from your last patch plus somthing.

My concern about the code at the time was following,

- You embedded the logic of multivariate estimation into clauselist_selectivity. I think estimate using multivariate
statisticsis quite different from the ordinary estimate based on single column stats then they are logically
separatableand we should do so.
 

- You are taking top-down approach and it runs tree-walking to check appliability of mv-stats for every stepping down
inclause tree. If the subtree found to be mv-applicable, split it to two parts - mv-compatible and non-compatible.
Thesesteps requires expression tree walking, which looks using too-much CPU.
 

- You look to be considering the cases when users create many multivariate statistics on attribute sets having
duplications.But it looks too-much for me. MV-stats are more resource-eating so we can assume the minimum usage of
that.

My suggestion in the patch is a bottom-up approach to find
mv-applicable portion(s) in the expression tree, which is the
basic way of planner overall. The approach requires no repetitive
run of tree walker, that is, pull_varnos. It could fail to find
the 'optimal' solution for complex situations but needs far less
calculation for almost the same return (I think..).

Even though it doesn't consider the functional dependency, the
reduce of the code shows the efficiency. It does not nothing
tricky.

> On a conceptual level, I think the idea to split the estimation into
> two phases - enrich the expression tree with nodes with details about
> stats etc, and then actually do the estimation in the second phase
> might be interesting. Not because it's somehow clearer, but because it
> gives us a chance to see the expression tree as a whole, with details
> about all the stats (with the current code we process/estimate the
> tree incrementally). But I don't really know how useful that would be.

It is difficult to say which approach is better sinch it is
affected by what we think important than other things. However I
concern about that your code substantially reconstructs the
expression (clause) tree midst of processing it. I believe it
should be a separate phase for simplicity. Of course additional
required resource is also should be considered but it is rather
reduced for this case.

> I don't think the proposed change makes the process somehow clearer. I
> know it's a PoC at this point, so I don't expect it to be perfect, but
> for me the original code is certainly clearer. Of course, I'm biased
> as I wrote the current code, and I (naturally) shaped it to match my
> ideas during the development process, and I'm much more familiar with
> it.

Mmm. we need someone else's opition:) What I think on this point
is described just above... OK, I try to describe this in other
words.

The embedded approach simply increases the state and code path
by, roughly, multiplication basis. The separate approcach adds
them in addition basis. I thinks this is the most siginificant
point of why I feel it 'clear'.

Of course, the acceptable complexity differs according to the
fundamental complexity, performance, required memory or someting
others but I feel it is too-much complexity for the objective.

> Omitting the support for functional dependencies is a bit unfortunate,
> I think. Is that merely to make the PoC simpler, or is there something
> that makes it impossible to support that kind of stats?

I don't think so. I ommited it simply because it would more time
to implement.

> Another thing that I noticed is that you completely removed the code
> that combined multiple stats (and selected the best combination of
> stats). In other words, you've reverted to the intermediate single
> statistics approach, including removing the improved handling of OR
> clauses and conditions.

Yeah, good catch :p I noticed that just after submitting the
patch that I retaion only one statistics at the second level from
the bottom but it is easily fixed by changing pruning timing. The
struct can hold multiple statistics anyway.

And I don't omit OR case. It is handled along with the AND
case. (in wrong way?)

>  It's a bit difficult to judge the proposed
> approach not knowing how well it supports those (quite crucial)
> features. What if it can't support some them., or what if it makes the
> code much more complicated (thus defeating the goal of making it more
> clear)?

OR is supported, Fdep is maybe supportable, but all of them
occurs within the function with the entangled name
(transform..something). But I should put more consider on your
latest code before that.

> I share your concern about the performance impact - one thing is that
> this new code might be slower than the original one, but a more
> serious issue IMHO is that the performance impact will happen even for
> relations with no multivariate stats at all. The original patch was
> very careful about getting ~0% overhead in such cases,

I don't think so. find_stats runs pull_varnos and
transformRestric.. also uses pull_varnos to bail out at the top
level. They should have almost the same overhead for the case.

> and if the new
> code does not allow that, I don't see this approach as acceptable. We
> must not put additional overhead on people not using multivariate
> stats.
> 
> But I think it's worth exploring this idea a bit more - can you rebase
> it to the current patch version (as on github) and adding the missing
> pieces (functional dependencies, multi-statistics estimation and
> passing conditions)?

With pleasure. Please wait for a while.

> One more thing - I noticed you extended the pg_operator catalog with a
> oprmvstat attribute, used to flag operators that are compatible with
> multivariate stats. I'm not happy with the current approach (using
> oprrest to do this decision), but I'm not really sure this is a good
> solution either. The culprit is that it only answers one of the two
> important questions - Is it compatible? How to perform the estimation?

Hostly saying, I also don't like this. But checking oprrest is
unpleasant much the same.

> So we'd have to rely on oprrest anyway, when actually performing the
> estimation of a clause with "compatible" operator. And we'd have to
> keep in sync two places (catalog and checks in file), and we'd have to
> update the catalog after improving the implementation (adding support
> for another operator).

Mmm. It depends on what the deveopers think about the definition
of oprrest. More practically, I'm worried whether it cannot be
other than eqsel for any equality operator. And the same for
comparison operators.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: multivariate statistics / patch v7

From
Tomas Vondra
Date:
Hello Horiguchi-san,

On 07/27/2015 09:04 AM, Kyotaro HORIGUCHI wrote:
> Hello,
>
> At Sat, 25 Jul 2015 23:09:31 +0200, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<55B3FB0B.7000201@2ndquadrant.com>
>> Hi,
>>
>> On 07/16/2015 01:51 PM, Kyotaro HORIGUCHI wrote:
>>> Hi, I'd like to show you the modified constitution of
>>> multivariate statistics application logic. Please find the
>>> attached. They apply on your v7 patch.
>>
>> Sadly I do have some trouble getting it to apply correctly :-(
>> So for now all my comments are based on just reading the code.
>
> Ah. My modification to rebase to the master for the time should
> be culprit. Sorry for the dirty patch.
>
> # I would recreate the patch if you complained before struggling
> # with the thing..
>
> The core of the modification is on closesel.c. I attached the
> patched closesel.c.

I don't see any attachment. Perhaps you forgot to actually attach it?

>
> My concern about the code at the time was following,
>
> - You embedded the logic of multivariate estimation into
>    clauselist_selectivity. I think estimate using multivariate
>    statistics is quite different from the ordinary estimate based
>    on single column stats then they are logically separatable and
>    we should do so.

I don't see them as very different, actually quite the opposite. The two 
kinds of statistics are complementary and should naturally coexist. 
Perhaps the current code is not perfect and a refactoring would make the 
code more readable, but I don't think it's primary aim should be to 
separate regular and multivariate stats.

>
> - You are taking top-down approach and it runs tree-walking to
>    check appliability of mv-stats for every stepping down in
>    clause tree. If the subtree found to be mv-applicable, split it
>    to two parts - mv-compatible and non-compatible. These steps
>    requires expression tree walking, which looks using too-much
>    CPU.

I'm taking top-down approach because that's what the regular stats do, 
and also because that's what allows implementing the features that I 
think are interesting - ability to combine multiple stats in an 
efficient way, pass conditions and such. I think those two features are 
very useful and allow very interesting things.

The bottom-up would work too, probably - I mean, we could start from 
leaves of the expression tree, and build the largest "subtree" 
compatible with multivariate stats and then try to estimate it. I don't 
see how we could pass conditions though, which works naturally in the 
top-down approach.

Or maybe a combination of both - identify the "compatible" subtrees 
first, then perform the top-down phase.

> - You look to be considering the cases when users create many
>    multivariate statistics on attribute sets having
>    duplications. But it looks too-much for me. MV-stats are more
>    resource-eating so we can assume the minimum usage of that.

Not really. I don't expect huge numbers of multivariate stats to be 
built on the tables.

But I think restricting the users to use a single multivariate 
statistics per table would be a significant limitation. And once you 
allow using multiple multivariate statistics for the set of clauses, 
supporting over-lapping stats is not that difficult.

What it however makes possible is combining multiple "small" stats into 
a larger one in a very efficient way - it assumes the overlap is 
sufficient, of course. But if that's true you may build multiple small 
(and very accurate) stats instead of one huge (or very inaccurate) 
statistics.

This also makes it possible to handle complex combinations of clauses 
that are compatible and incompatible with multivariate statistics, by 
passing the conditions.

>
> My suggestion in the patch is a bottom-up approach to find
> mv-applicable portion(s) in the expression tree, which is the
> basic way of planner overall. The approach requires no repetitive
> run of tree walker, that is, pull_varnos. It could fail to find
> the 'optimal' solution for complex situations but needs far less
> calculation for almost the same return (I think..).
>
> Even though it doesn't consider the functional dependency, the
> reduce of the code shows the efficiency. It does not nothing
> tricky.

OK

>> On a conceptual level, I think the idea to split the estimation into
>> two phases - enrich the expression tree with nodes with details about
>> stats etc, and then actually do the estimation in the second phase
>> might be interesting. Not because it's somehow clearer, but because it
>> gives us a chance to see the expression tree as a whole, with details
>> about all the stats (with the current code we process/estimate the
>> tree incrementally). But I don't really know how useful that would be.
>
> It is difficult to say which approach is better sinch it is
> affected by what we think important than other things. However I
> concern about that your code substantially reconstructs the
> expression (clause) tree midst of processing it. I believe it
> should be a separate phase for simplicity. Of course additional
> required resource is also should be considered but it is rather
> reduced for this case.

What do you mean by "reconstruct the expression tree"? It's true I'm 
walking the expression tree top-down, but how is that reconstructing?

>
>> I don't think the proposed change makes the process somehow clearer. I
>> know it's a PoC at this point, so I don't expect it to be perfect, but
>> for me the original code is certainly clearer. Of course, I'm biased
>> as I wrote the current code, and I (naturally) shaped it to match my
>> ideas during the development process, and I'm much more familiar with
>> it.
>
> Mmm. we need someone else's opition:) What I think on this point
> is described just above... OK, I try to describe this in other
> words.

I find your comments very valuable. I may not agree with some of them, 
but I certainly appreciate your point of view. So thank you very much 
for the time you spent reviewing this patch so far!

> The embedded approach simply increases the state and code path by,
> roughly, multiplication basis. The separate approcach adds them in
> addition basis. I thinks this is the most siginificant point of why I
> feel it 'clear'.
>
> Of course, the acceptable complexity differs according to the
> fundamental complexity, performance, required memory or someting
> others but I feel it is too-much complexity for the objective.

Yes, I think we might have slightly different objectives in mind.

Regarding the complexity - I am not too worried about spending more CPU 
cycles on this, as long as it does not impact the case where people have 
no multivariate statistics at all. That's because I expect people to use 
this for large DSS/DWH data sets with lots of dependencies in the (often 
denormalized) tables and complex conditions - in those cases the 
planning difference is negligible, especially if the improved estimates 
make the query run in seconds instead of hours.

This is why I was so careful to entirely skip the expensive processing 
when where were no multivariate stats, and why I don't like the fact 
that your approach makes this skip more difficult (or maybe impossible, 
I'm not sure).

It's also true that most OLTP queries (especially the short ones, thus 
most impacted by the increase of planning time) use rather short/simple 
clause lists, so even the top-down approach should be very cheap.

>
>> Omitting the support for functional dependencies is a bit unfortunate,
>> I think. Is that merely to make the PoC simpler, or is there something
>> that makes it impossible to support that kind of stats?
>
> I don't think so. I ommited it simply because it would more time
> to implement.

OK, thanks for confirming this.

>
>> Another thing that I noticed is that you completely removed the code
>> that combined multiple stats (and selected the best combination of
>> stats). In other words, you've reverted to the intermediate single
>> statistics approach, including removing the improved handling of OR
>> clauses and conditions.
>
> Yeah, good catch :p I noticed that just after submitting the
> patch that I retaion only one statistics at the second level from
> the bottom but it is easily fixed by changing pruning timing. The
> struct can hold multiple statistics anyway.

Great!

>
> And I don't omit OR case. It is handled along with the AND
> case. (in wrong way?)

Oh, I see. I got a bit confused because you've removed the optimization 
step (and conditions), and that needs to be handled a bit differently 
for the OR clauses.

>
>>   It's a bit difficult to judge the proposed
>> approach not knowing how well it supports those (quite crucial)
>> features. What if it can't support some them., or what if it makes the
>> code much more complicated (thus defeating the goal of making it more
>> clear)?
>
> OR is supported, Fdep is maybe supportable, but all of them
> occurs within the function with the entangled name
> (transform..something). But I should put more consider on your
> latest code before that.

Good. Likewise, I'd like to see more of your approach ;-)

>
>> I share your concern about the performance impact - one thing is that
>> this new code might be slower than the original one, but a more
>> serious issue IMHO is that the performance impact will happen even for
>> relations with no multivariate stats at all. The original patch was
>> very careful about getting ~0% overhead in such cases,
>
> I don't think so. find_stats runs pull_varnos and
> transformRestric.. also uses pull_varnos to bail out at the top
> level. They should have almost the same overhead for the case.

Understood. As I explained above, I'm not all that concerned about the 
performance impact, as long as we make sure it only applies to people 
using the multivariate stats.

I also think a combined approach - first a bottom-up step (identifying 
the largest compatible subtrees & caching the varnos), then a top-down 
step (doing the same optimization as implemented today) might minimize 
the performance impact.

>
>> and if the new
>> code does not allow that, I don't see this approach as acceptable. We
>> must not put additional overhead on people not using multivariate
>> stats.
>>
>> But I think it's worth exploring this idea a bit more - can you rebase
>> it to the current patch version (as on github) and adding the missing
>> pieces (functional dependencies, multi-statistics estimation and
>> passing conditions)?
>
> With pleasure. Please wait for a while.

Sure. Take your time.

>
>> One more thing - I noticed you extended the pg_operator catalog with a
>> oprmvstat attribute, used to flag operators that are compatible with
>> multivariate stats. I'm not happy with the current approach (using
>> oprrest to do this decision), but I'm not really sure this is a good
>> solution either. The culprit is that it only answers one of the two
>> important questions - Is it compatible? How to perform the estimation?
>
> Hostly saying, I also don't like this. But checking oprrest is
> unpleasant much the same.

The patch is already quite massive, so let's use the same approach as 
current stats, and leave this problem for another patch. If we come up 
with a great idea, we can work on it, but I see this as a loosely 
related annoyance rather than something this patch aims to address.

>> So we'd have to rely on oprrest anyway, when actually performing the
>> estimation of a clause with "compatible" operator. And we'd have to
>> keep in sync two places (catalog and checks in file), and we'd have to
>> update the catalog after improving the implementation (adding support
>> for another operator).
>
> Mmm. It depends on what the deveopers think about the definition
> of oprrest. More practically, I'm worried whether it cannot be
> other than eqsel for any equality operator. And the same for
> comparison operators.

OTOH if you define a new operator with oprrest=F_EQSEL, you're 
effectively saying "It's OK to estimate this using regular eq/lt/gt 
operators". If your operator is somehow incompatible with that, you 
should not set oprrest=F_EQSEL.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics / patch v7

From
Heikki Linnakangas
Date:
On 05/25/2015 11:43 PM, Tomas Vondra wrote:
> There are 6 files attached, but only 0002-0006 are actually part of the
> multivariate statistics patch itself.

All of these patches are huge. In order to review this in a reasonable 
amount of time, we need to do this in several steps. So let's see what 
would be the minimal set of these patches that could be reviewed and 
committed, while still being useful.

The main patches are:

1. shared infrastructure and functional dependencies
2. clause reduction using functional dependencies
3. multivariate MCV lists
4. multivariate histograms
5. multi-statistics estimation

Would it make sense to commit only patches 1 and 2 first? Would that be 
enough to get a benefit from this?

I have some doubts about the clause reduction and functional 
dependencies part of this. It seems to treat functional dependency as a 
boolean property, but even with the classic zipcode and city case, it's 
not always an all or nothing thing. At least in some countries, there 
can be zipcodes that span multiple cities. So zipcode=X does not 
completely imply city=Y, although there is a strong correlation (if 
that's the right term). How strong does the correlation need to be for 
this patch to decide that zipcode implies city? I couldn't actually see 
a clear threshold stated anywhere.

So rather than treating functional dependence as a boolean, I think it 
would make more sense to put a 0.0-1.0 number to it. That means that you 
can't do clause reduction like it's done in this patch, where you 
actually remove clauses from the query for cost esimation purposes. 
Instead, you need to calculate the selectivity for each clause 
independently, but instead of just multiplying the selectivities 
together, apply the "dependence factor" to it.

Does that make sense? I haven't really looked at the MCV, histogram and 
"multi-statistics estimation" patches yet. Do those patches make the 
clause reduction patch obsolete? Should we forget about the clause 
reduction and functional dependency patch, and focus on those later 
patches instead?

- Heikki




Re: multivariate statistics / patch v7

From
Kyotaro HORIGUCHI
Date:
Hello, I certainly attached the file this time.


At Mon, 27 Jul 2015 23:54:08 +0200, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<55B6A880.3050801@2ndquadrant.com>
> > The core of the modification is on closesel.c. I attached the
> > patched closesel.c.
> 
> I don't see any attachment. Perhaps you forgot to actually attach it?

Very sorry to have forgotten to attach it. I attached the new
patch applicable on the head of mvstats branch of your
repository.

> > My concern about the code at the time was following,
> >
> > - You embedded the logic of multivariate estimation into
> >    clauselist_selectivity. I think estimate using multivariate
> >    statistics is quite different from the ordinary estimate based
> >    on single column stats then they are logically separatable and
> >    we should do so.
> 
> I don't see them as very different, actually quite the opposite. The
> two kinds of statistics are complementary and should naturally
> coexist. Perhaps the current code is not perfect and a refactoring
> would make the code more readable, but I don't think it's primary aim
> should be to separate regular and multivariate stats.
> 
> > - You are taking top-down approach and it runs tree-walking to
> >    check appliability of mv-stats for every stepping down in
> >    clause tree. If the subtree found to be mv-applicable, split it
> >    to two parts - mv-compatible and non-compatible. These steps
> >    requires expression tree walking, which looks using too-much
> >    CPU.
> 
> I'm taking top-down approach because that's what the regular stats do,
> and also because that's what allows implementing the features that I
> think are interesting - ability to combine multiple stats in an
> efficient way, pass conditions and such. I think those two features
> are very useful and allow very interesting things.
> 
> The bottom-up would work too, probably - I mean, we could start from
> leaves of the expression tree, and build the largest "subtree"
> compatible with multivariate stats and then try to estimate it. I
> don't see how we could pass conditions though, which works naturally
> in the top-down approach.

By the way, the 'condition' looks to mean what will be received
by the parameter of clause(list)_selectivity with the same
name. But it is always NIL. Looking at the comment for
collect_mv_attnum, it is prepared for 'multitable statistics'. If
so, I think it's better removed from the current patch, because
it is useless now.

> Or maybe a combination of both - identify the "compatible" subtrees
> first, then perform the top-down phase.
> 
> > - You look to be considering the cases when users create many
> >    multivariate statistics on attribute sets having
> >    duplications. But it looks too-much for me. MV-stats are more
> >    resource-eating so we can assume the minimum usage of that.
> 
> Not really. I don't expect huge numbers of multivariate stats to be
> built on the tables.
> 
> But I think restricting the users to use a single multivariate
> statistics per table would be a significant limitation. And once you
> allow using multiple multivariate statistics for the set of clauses,
> supporting over-lapping stats is not that difficult.
> 
> What it however makes possible is combining multiple "small" stats
> into a larger one in a very efficient way - it assumes the overlap is
> sufficient, of course. But if that's true you may build multiple small
> (and very accurate) stats instead of one huge (or very inaccurate)
> statistics.
> 
> This also makes it possible to handle complex combinations of clauses
> that are compatible and incompatible with multivariate statistics, by
> passing the conditions.
> 
> >
> > My suggestion in the patch is a bottom-up approach to find
> > mv-applicable portion(s) in the expression tree, which is the
> > basic way of planner overall. The approach requires no repetitive
> > run of tree walker, that is, pull_varnos. It could fail to find
> > the 'optimal' solution for complex situations but needs far less
> > calculation for almost the same return (I think..).
> >
> > Even though it doesn't consider the functional dependency, the
> > reduce of the code shows the efficiency. It does not nothing
> > tricky.
> 
> OK

The functional dependency code looks immature in both the
detection phase and application phase in comparison to MCV and
histogram. Addition to that, as the comment in dependencies.c
says, fdep is not so significant (than MCV/HIST) because it is
usually carefully avoided and should be noticed and considered in
designing of application or the whole system.

Persisting to apply them all at once doesn't seem to be a good
strategy to be adopted earlier.

Or perhaps it might be better to register the dependency itself
than registering incomplete information (only the set of colums
involoved in the relationship) and try to detect the relationship
from the given values. I suppose those who can register the
columnset know the precise nature of the dependency in advance.

> >> On a conceptual level, I think the idea to split the estimation into
> >> two phases - enrich the expression tree with nodes with details about
> >> stats etc, and then actually do the estimation in the second phase
> >> might be interesting. Not because it's somehow clearer, but because it
> >> gives us a chance to see the expression tree as a whole, with details
> >> about all the stats (with the current code we process/estimate the
> >> tree incrementally). But I don't really know how useful that would be.
> >
> > It is difficult to say which approach is better sinch it is
> > affected by what we think important than other things. However I
> > concern about that your code substantially reconstructs the
> > expression (clause) tree midst of processing it. I believe it
> > should be a separate phase for simplicity. Of course additional
> > required resource is also should be considered but it is rather
> > reduced for this case.
> 
> What do you mean by "reconstruct the expression tree"? It's true I'm
> walking the expression tree top-down, but how is that reconstructing?

For example clauselist_mv_split does. It separates mvclauses from
original clauselist and apply mv-stats at once and (parhaps) let
the rest be processed in the 'normal' route. I called this as
"reconstruct", which I tried to do explicity and separately.

> >> I don't think the proposed change makes the process somehow clearer. I
> >> know it's a PoC at this point, so I don't expect it to be perfect, but
> >> for me the original code is certainly clearer. Of course, I'm biased
> >> as I wrote the current code, and I (naturally) shaped it to match my
> >> ideas during the development process, and I'm much more familiar with
> >> it.
> >
> > Mmm. we need someone else's opition:) What I think on this point
> > is described just above... OK, I try to describe this in other
> > words.
> 
> I find your comments very valuable. I may not agree with some of them,
> but I certainly appreciate your point of view. So thank you very much
> for the time you spent reviewing this patch so far!

Yeah, thank you for your patience and kindness.

> > The embedded approach simply increases the state and code path by,
> > roughly, multiplication basis. The separate approcach adds them in
> > addition basis. I thinks this is the most siginificant point of why I
> > feel it 'clear'.
> >
> > Of course, the acceptable complexity differs according to the
> > fundamental complexity, performance, required memory or someting
> > others but I feel it is too-much complexity for the objective.
> 
> Yes, I think we might have slightly different objectives in mind.

Sure! Now I'm understand what is the point.

> Regarding the complexity - I am not too worried about spending more
> CPU cycles on this, as long as it does not impact the case where
> people have no multivariate statistics at all. That's because I expect
> people to use this for large DSS/DWH data sets with lots of
> dependencies in the (often denormalized) tables and complex conditions
> - in those cases the planning difference is negligible, especially if
> the improved estimates make the query run in seconds instead of hours.

I share the vision with you. If that is the case, the mv-stats
route should not be intrude the existing non-mv-stats route. I
feel you have too much intruded clauselist_selectivity all the
more.

If that is the case, my mv-distinct code has different objective
from you. It aims to save the misestimation from multicolumn
correlations more commonly occurs in OLTP usage.

> This is why I was so careful to entirely skip the expensive processing
> when where were no multivariate stats, and why I don't like the fact
> that your approach makes this skip more difficult (or maybe
> impossible, I'm not sure).

My code totally skips if transformRestrictionForEstimate returns
NULL and runs clauselist_selectivity as usual. I think almost the
same as yours.

However, if you think it I believe we should not only skipping
calculation but also hiding the additional code blocks which is
overwhelming the normal route. The one of major objectives of my
approach is that point.

> It's also true that most OLTP queries (especially the short ones, thus
> most impacted by the increase of planning time) use rather
> short/simple clause lists, so even the top-down approach should be
> very cheap.
> 
> >> Omitting the support for functional dependencies is a bit unfortunate,
> >> I think. Is that merely to make the PoC simpler, or is there something
> >> that makes it impossible to support that kind of stats?
> >
> > I don't think so. I ommited it simply because it would more time
> > to implement.
> 
> OK, thanks for confirming this.
> 
> >
> >> Another thing that I noticed is that you completely removed the code
> >> that combined multiple stats (and selected the best combination of
> >> stats). In other words, you've reverted to the intermediate single
> >> statistics approach, including removing the improved handling of OR
> >> clauses and conditions.
> >
> > Yeah, good catch :p I noticed that just after submitting the
> > patch that I retaion only one statistics at the second level from
> > the bottom but it is easily fixed by changing pruning timing. The
> > struct can hold multiple statistics anyway.
> 
> Great!

But sorry. I found that considering multiple stats at every level
cannot be done without exhaustive searching of combinations among
child clauses and needs additional data structure. It needs more
thoughs.. As mentioned later, top-down might be suitable for
this optimization.

> > And I don't omit OR case. It is handled along with the AND
> > case. (in wrong way?)
> 
> Oh, I see. I got a bit confused because you've removed the
> optimization step (and conditions), and that needs to be handled a bit
> differently for the OR clauses.

Sorry to have forced you reading unapplicable patch:p

> >>   It's a bit difficult to judge the proposed
> >> approach not knowing how well it supports those (quite crucial)
> >> features. What if it can't support some them., or what if it makes the
> >> code much more complicated (thus defeating the goal of making it more
> >> clear)?
> >
> > OR is supported, Fdep is maybe supportable, but all of them
> > occurs within the function with the entangled name
> > (transform..something). But I should put more consider on your
> > latest code before that.
> 
> Good. Likewise, I'd like to see more of your approach ;-)
> 
> >
> >> I share your concern about the performance impact - one thing is that
> >> this new code might be slower than the original one, but a more
> >> serious issue IMHO is that the performance impact will happen even for
> >> relations with no multivariate stats at all. The original patch was
> >> very careful about getting ~0% overhead in such cases,
> >
> > I don't think so. find_stats runs pull_varnos and
> > transformRestric.. also uses pull_varnos to bail out at the top
> > level. They should have almost the same overhead for the case.
> 
> Understood. As I explained above, I'm not all that concerned about the
> performance impact, as long as we make sure it only applies to people
> using the multivariate stats.
> 
> I also think a combined approach - first a bottom-up step (identifying
> the largest compatible subtrees & caching the varnos), then a top-down
> step (doing the same optimization as implemented today) might minimize
> the performance impact.

I almost reaching the same conclusion.

> >> and if the new
> >> code does not allow that, I don't see this approach as acceptable. We
> >> must not put additional overhead on people not using multivariate
> >> stats.
> >>
> >> But I think it's worth exploring this idea a bit more - can you rebase
> >> it to the current patch version (as on github) and adding the missing
> >> pieces (functional dependencies, multi-statistics estimation and
> >> passing conditions)?
> >
> > With pleasure. Please wait for a while.
> 
> Sure. Take your time.
> 
> >
> >> One more thing - I noticed you extended the pg_operator catalog with a
> >> oprmvstat attribute, used to flag operators that are compatible with
> >> multivariate stats. I'm not happy with the current approach (using
> >> oprrest to do this decision), but I'm not really sure this is a good
> >> solution either. The culprit is that it only answers one of the two
> >> important questions - Is it compatible? How to perform the estimation?
> >
> > Hostly saying, I also don't like this. But checking oprrest is
> > unpleasant much the same.
> 
> The patch is already quite massive, so let's use the same approach as
> current stats, and leave this problem for another patch. If we come up
> with a great idea, we can work on it, but I see this as a loosely
> related annoyance rather than something this patch aims to address.

Agreed.

> >> So we'd have to rely on oprrest anyway, when actually performing the
> >> estimation of a clause with "compatible" operator. And we'd have to
> >> keep in sync two places (catalog and checks in file), and we'd have to
> >> update the catalog after improving the implementation (adding support
> >> for another operator).
> >
> > Mmm. It depends on what the deveopers think about the definition
> > of oprrest. More practically, I'm worried whether it cannot be
> > other than eqsel for any equality operator. And the same for
> > comparison operators.
> 
> OTOH if you define a new operator with oprrest=F_EQSEL, you're
> effectively saying "It's OK to estimate this using regular eq/lt/gt
> operators". If your operator is somehow incompatible with that, you
> should not set oprrest=F_EQSEL.

In contrast, some function other than F_EQSEL might be compatible
with mv-statistics.

For all that, it's not my concern. Although I think they really
are effectively the same, I'm uneasy to use the field apparently
not intended (or suitable) to distinguish such kind of property
of operator.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: multivariate statistics / patch v7

From
Tomas Vondra
Date:
Hi,

On 07/30/2015 10:21 AM, Heikki Linnakangas wrote:
> On 05/25/2015 11:43 PM, Tomas Vondra wrote:
>> There are 6 files attached, but only 0002-0006 are actually part of the
>> multivariate statistics patch itself.
>
> All of these patches are huge. In order to review this in a reasonable
> amount of time, we need to do this in several steps. So let's see what
> would be the minimal set of these patches that could be reviewed and
> committed, while still being useful.
>
> The main patches are:
>
> 1. shared infrastructure and functional dependencies
> 2. clause reduction using functional dependencies
> 3. multivariate MCV lists
> 4. multivariate histograms
> 5. multi-statistics estimation
>
> Would it make sense to commit only patches 1 and 2 first? Would that be
> enough to get a benefit from this?

I agree that the patch can't be reviewed as a single chunk - that was 
the idea when I split the original (single chunk) patch into multiple 
smaller pieces.

And yes, I believe committing pieces 1&2 might be enough to get 
something useful, which can then be improved by adding the "usual" MCV 
and histogram stats on top of that.

> I have some doubts about the clause reduction and functional
> dependencies part of this. It seems to treat functional dependency as
> a boolean property, but even with the classic zipcode and city case,
> it's not always an all or nothing thing. At least in some countries,
> there can be zipcodes that span multiple cities. So zipcode=X does
> not completely imply city=Y, although there is a strong correlation
> (if that's the right term). How strong does the correlation need to
> be for this patch to decide that zipcode implies city? I couldn't
> actually see a clear threshold stated anywhere.
>
> So rather than treating functional dependence as a boolean, I think
> it would make more sense to put a 0.0-1.0 number to it. That means
> that you can't do clause reduction like it's done in this patch,
> where you actually remove clauses from the query for cost esimation
> purposes. Instead, you need to calculate the selectivity for each
> clause independently, but instead of just multiplying the
> selectivities together, apply the "dependence factor" to it.
>
> Does that make sense? I haven't really looked at the MCV, histogram
> and "multi-statistics estimation" patches yet. Do those patches make
> the clause reduction patch obsolete? Should we forget about the
> clause reduction and functional dependency patch, and focus on those
> later patches instead?

Perhaps. It's true that most real-world data sets are not 100% valid 
with respect to functional dependencies - either because of natural 
imperfections (multiple cities with the same ZIP code) or just noise in 
the data (incorrect entries ...). And it's even mentioned in the code 
comments somewhere, I guess.

But there are two main reasons why I chose not to extend the functional 
dependencies with the [0.0-1.0] value you propose.

Firstly, functional dependencies were meant to be the simplest possible 
implementation, illustrating how the "infrastructure" is supposed to 
work (which is the main topic of the first patch).

Secondly, all kinds of statistics are "simplifications" of the actual 
data. So I think it's not incorrect to ignore the exceptions up to some 
threshold.

I also don't think this will make the estimates globally better. Let's 
say you have 1% of rows that contradict the functional dependency - you 
may either ignore them and have good estimates for 99% of the values and 
incorrect estimates for 1%, or tweak the rule a bit and make the 
estimates worse for 99% (and possibly better for 1%).

That being said, I'm not against improving the functional dependencies. 
I already do have some improvements on my TODO - like for example 
dependencies on more columns (not just A=>B but [A,B]=>C and such), but 
I think we should not squash this into those two patches.

And yet another point - ISTM these cases might easily be handled better 
by the statistics based on ndistinct coefficients, as proposed by 
Kyotaro-san some time ago. That is, compute and track
    ndistinct(A) * ndistinct(B) / ndistinct(A,B)

for all pairs of columns (or possibly larger groups). That seems to be 
similar to the coefficient you propose.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics / patch v7

From
Tomas Vondra
Date:
Hello,

On 07/30/2015 01:26 PM, Kyotaro HORIGUCHI wrote:
> Hello, I certainly attached the file this time.
>
>
> At Mon, 27 Jul 2015 23:54:08 +0200, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote in <55B6A880.3050801@2ndquadrant.com>
>
>> The bottom-up would work too, probably - I mean, we could start from
>> leaves of the expression tree, and build the largest "subtree"
>> compatible with multivariate stats and then try to estimate it. I
>> don't see how we could pass conditions though, which works naturally
>> in the top-down approach.
>
> By the way, the 'condition' looks to mean what will be received
> by the parameter of clause(list)_selectivity with the same
> name. But it is always NIL. Looking at the comment for
> collect_mv_attnum, it is prepared for 'multitable statistics'. If
> so, I think it's better removed from the current patch, because
> it is useless now.

I don't think so. Conditions certainly are not meant for multitable 
statistics only (I don't see any comment suggesting that at 
collect_mv_attnums), but are actually used with the current code.

For example try this:

create table t (a int, b int, c int);
insert into t select i/100, i/100, i/100                from generate_series(1,100000) s(i);
alter table t add statistics (mcv) on (a,b);
analyze t;

select * from t where a<10 and b < 10 and (a < 50 or b < 50 or c < 50);

What will happen when estimating this query is this:

(1) clauselist_selectivity is called, and sees a list of three clauses:
    (a<10)    (b<10)    (a<50 OR b<50 OR c<50)
    But there's only a single statistics on columns [a,b] so at this    point we can process only the first two
clauses.So we'll do that,    computing
 
        P(a<10, b<10)
    and we'll pass the OR-clause to the clause_selectivity() call, along    with the two already estimated clauses as
conditions.

(b) clause_selectivity will receive (a<50 OR b<50 OR c<50) as a clause    to estimate, and the two clauses as
conditions,computing
 
        P(a<50 OR b<50 OR c<50 | a<10, b<10)

The current estimate for the OR-clause is off, but I believe that's a 
bug in the current implementation of clauselist_selectivity_or(), and 
we've already discussed that some time ago.

>
> The functional dependency code looks immature in both the
> detection phase and application phase in comparison to MCV and
> histogram. Addition to that, as the comment in dependencies.c
> says, fdep is not so significant (than MCV/HIST) because it is
> usually carefully avoided and should be noticed and considered in
> designing of application or the whole system.

The code is certainly imperfect and needs improvements, no doubt about 
that. I have certainly spent much more time on MCV/histograms.

I'm not sure about stating that functional dependencies are less 
significant than MCV/HIST (I don't see any such statement in 
dependencies.c). I might have thought that initially, when I opted to 
implement fdeps as the simplest possible type of statistics, but I think 
it's quite practical, actually.

I however disagree about the last point - it's true that in many cases 
the databases are carefully normalized, which mostly makes functional 
dependencies irrelevant. But this is only true for OLTP systems, while 
the primary target of the patch are DSS/DWH systems. And in those 
systems denormalization is a very common practice.

So I don't think fdeps are completely irrelevant - it's quite useful in 
some scenarios, actually. Similarly to the ndistinct coefficient stats 
that you proposed, for example.

>
> Persisting to apply them all at once doesn't seem to be a good
> strategy to be adopted earlier.

Why?

>
> Or perhaps it might be better to register the dependency itself
> than registering incomplete information (only the set of colums
> involoved in the relationship) and try to detect the relationship
> from the given values. I suppose those who can register the
> columnset know the precise nature of the dependency in advance.

I don't see how that could be done? I mean, you only have the constants 
supplied in the query - how could you verify the functional dependency 
based on just those values (or even decide the direction)?

>>
>> What do you mean by "reconstruct the expression tree"? It's true I'm
>> walking the expression tree top-down, but how is that reconstructing?
>
> For example clauselist_mv_split does. It separates mvclauses from
> original clauselist and apply mv-stats at once and (parhaps) let
> the rest be processed in the 'normal' route. I called this as
> "reconstruct", which I tried to do explicity and separately.

Ah, I see. Thanks for the explanation. I wouldn't call this 
"reconstruction" though - I merely need to track which clauses to 
estimate using multivariate stats (and which need to be estimated using 
the regular stats). That's pretty much what RestrictStatData does, no?

>>
>> I find your comments very valuable. I may not agree with some of
>> them, but I certainly appreciate your point of view. So thank you
>> very much for the time you spent reviewing this patch so far!
>
> Yeah, thank you for your patience and kindness.

Likewise. It's very frustrating trying to understand complex code 
written by someone else, and I appreciate your effort.

>> Regarding the complexity - I am not too worried about spending
>> more CPU cycles on this, as long as it does not impact the case
>> where people have no multivariate statistics at all. That's because
>> I expect people to use this for large DSS/DWH data sets with lots
>> of dependencies in the (often denormalized) tables and complex
>> conditions - in those cases the planning difference is negligible,
>> especially if the improved estimates make the query run in seconds
>> instead of hours.
>
> I share the vision with you. If that is the case, the mv-stats
> route should not be intrude the existing non-mv-stats route. I
> feel you have too much intruded clauselist_selectivity all the
> more.
>
> If that is the case, my mv-distinct code has different objective
> from you. It aims to save the misestimation from multicolumn
> correlations more commonly occurs in OLTP usage.

OK. Let's see if we can make it work for both use cases.

>
>> This is why I was so careful to entirely skip the expensive
>> processing when where were no multivariate stats, and why I don't
>> like the fact that your approach makes this skip more difficult (or
>> maybe impossible, I'm not sure).
>
> My code totally skips if transformRestrictionForEstimate returns
> NULL and runs clauselist_selectivity as usual. I think almost the
> same as yours.

Ah, OK. Perhaps I missed that as I've had trouble applying the patch.

>
> However, if you think it I believe we should not only skipping
> calculation but also hiding the additional code blocks which is
> overwhelming the normal route. The one of major objectives of my
> approach is that point.

My main concern at this point was planning time, so skipping the 
calculation should be enough I believe. Hiding the additional code 
blocks is a matter of aesthetics, and we can address that by moving it 
to a separate method or such.

>
> But sorry. I found that considering multiple stats at every level
> cannot be done without exhaustive searching of combinations among
> child clauses and needs additional data structure. It needs more
> thoughs.. As mentioned later, top-down might be suitable for
> this optimization.

Do you think a combined approach - first bottom-up preprocessing, then 
top-down optimization (using the results of the first phase to speed 
things up) - might work?

>> Understood. As I explained above, I'm not all that concerned about
>> the performance impact, as long as we make sure it only applies to
>> people using the multivariate stats.
>>
>> I also think a combined approach - first a bottom-up step
>> (identifying the largest compatible subtrees & caching the varnos),
>> then a top-down step (doing the same optimization as implemented
>> today) might minimize the performance impact.
>
> I almost reaching the same conclusion.

Ah, so the answer to my last question is "yes". Now we only need to 
actually code it ;-)


kind regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics / patch v7

From
Heikki Linnakangas
Date:
On 07/30/2015 03:55 PM, Tomas Vondra wrote:
> On 07/30/2015 10:21 AM, Heikki Linnakangas wrote:
>> I have some doubts about the clause reduction and functional
>> dependencies part of this. It seems to treat functional dependency as
>> a boolean property, but even with the classic zipcode and city case,
>> it's not always an all or nothing thing. At least in some countries,
>> there can be zipcodes that span multiple cities. So zipcode=X does
>> not completely imply city=Y, although there is a strong correlation
>> (if that's the right term). How strong does the correlation need to
>> be for this patch to decide that zipcode implies city? I couldn't
>> actually see a clear threshold stated anywhere.
>>
>> So rather than treating functional dependence as a boolean, I think
>> it would make more sense to put a 0.0-1.0 number to it. That means
>> that you can't do clause reduction like it's done in this patch,
>> where you actually remove clauses from the query for cost esimation
>> purposes. Instead, you need to calculate the selectivity for each
>> clause independently, but instead of just multiplying the
>> selectivities together, apply the "dependence factor" to it.
>>
>> Does that make sense? I haven't really looked at the MCV, histogram
>> and "multi-statistics estimation" patches yet. Do those patches make
>> the clause reduction patch obsolete? Should we forget about the
>> clause reduction and functional dependency patch, and focus on those
>> later patches instead?
>
> Perhaps. It's true that most real-world data sets are not 100% valid
> with respect to functional dependencies - either because of natural
> imperfections (multiple cities with the same ZIP code) or just noise in
> the data (incorrect entries ...). And it's even mentioned in the code
> comments somewhere, I guess.
>
> But there are two main reasons why I chose not to extend the functional
> dependencies with the [0.0-1.0] value you propose.
>
> Firstly, functional dependencies were meant to be the simplest possible
> implementation, illustrating how the "infrastructure" is supposed to
> work (which is the main topic of the first patch).
>
> Secondly, all kinds of statistics are "simplifications" of the actual
> data. So I think it's not incorrect to ignore the exceptions up to some
> threshold.

The problem with a threshold is that around that threshold, even a small 
change in the data set can drastically change the produced estimates. 
For example, imagine that we know from the stats that zip code implies 
city. But then someone adds a single row to the table with an odd zip 
code & city combination, which pushes the estimator over the threshold, 
and the columns are no longer considered dependent, and the estimates 
are now completely different. We should avoid steep cliffs like that.

BTW, what is the threshold in the current patch?

- Heikki



Re: multivariate statistics / patch v7

From
Tomas Vondra
Date:
Hi,

On 07/30/2015 06:58 PM, Heikki Linnakangas wrote:
>
> The problem with a threshold is that around that threshold, even a
> small change in the data set can drastically change the produced
> estimates. For example, imagine that we know from the stats that zip
> code implies city. But then someone adds a single row to the table
> with an odd zip code & city combination, which pushes the estimator
> over the threshold, and the columns are no longer considered
> dependent, and the estimates are now completely different. We should
> avoid steep cliffs like that.
>
> BTW, what is the threshold in the current patch?

There's not a simple threshold - the algorithm mining the functional 
dependencies is a bit more complicated. I tried to explain it in the 
comment before build_mv_dependencies (in dependencies.c), but let me 
briefly summarize it here.

To mine dependency [A => B], build_mv_dependencies does this:

(1) sort the sample by {A,B}

(2) split the sample into groups with the same value of A

(3) for each group, decide if it's consistent with the dependency
    (a) if the group is too small (less than 3 rows), ignore it
    (a) if the group is consistent, update
        n_supporting        n_supporting_rows
    (b) if the group is inconsistent, update
        n_contradictingn_contradicting_rows

(4) decide whether the dependency is "valid" by checking
    n_supporting_rows >= n_contradicting_rows * 10

The limit is rather arbitrary and yes - I can imagine a more complex 
condition (e.g. looking at average number of tuples per group etc.), but 
I haven't looked into that - the point was to use something very simple, 
only to illustrate the infrastructure.

I think we might come up with some elaborate way of associating "degree" 
with the functional dependency, but at that point we really loose the 
simplicity, and also make it indistinguishable from the remaining 
statistics (because it won't be possible to reduce the clauses like 
this, before performing the regular estimation). Which is exactly what 
makes the functional dependencies so neat and efficient, so I'm not 
overly enthusiastic about doing that.

What seems more interesting is implementing the ndistinct coefficient 
instead, as proposed by Kyotaro-san - that seems to have the nice 
"smooth" behavior you desire, while keeping the simplicity.

Both statistics types (functional dependencies and ndistinct coeff) have 
one weak point, though - they somehow assume the queries use 
"compatible" values. For example if you use a query with
   WHERE city = 'New York' AND zip = 'zip for Detroit'

they can't detect cases like this, because those statistics types are 
oblivious to individual values. I don't see this as a fatal flaw, though 
- it's rather a consequence of the nature of the stats. And I tend to 
look at the functional dependencies the same way.

If you need stats without these "issues" you'll have to use MCV list or 
a histogram. Trying to fix the simple statistics types is futile, IMHO.

regards
Tomas

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics / patch v7

From
Michael Paquier
Date:
On Fri, Jul 31, 2015 at 6:28 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> [series of arguments]
>
> If you need stats without these "issues" you'll have to use MCV list or a
> histogram. Trying to fix the simple statistics types is futile, IMHO.

Patch is marked as returned with feedback. There has been advanced
discussions and reviews as well.
-- 
Michael



Re: multivariate statistics / patch v7

From
Josh Berkus
Date:
Tomas,

> attached is v7 of the multivariate stats patch. The main improvement is
> major refactoring of the clausesel.c portion - splitting the awfully
> long spaghetti-style functions into smaller pieces, making it much more
> understandable etc.

So presumably v7 handles varlena attributes as well, yes?   I have a
destruction test case for correlated column stats, so I'd like to test
your patch on it.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: multivariate statistics / patch v7

From
Tomas Vondra
Date:
Hi,

On 09/24/2015 06:43 PM, Josh Berkus wrote:
> Tomas,
>
>> attached is v7 of the multivariate stats patch. The main improvement is
>> major refactoring of the clausesel.c portion - splitting the awfully
>> long spaghetti-style functions into smaller pieces, making it much more
>> understandable etc.
>
> So presumably v7 handles varlena attributes as well, yes?   I have a
> destruction test case for correlated column stats, so I'd like to test
> your patch on it.

Yes, it should handle varlena OK. Let me know if you need help with 
that, and I'd like to hear feedback - whether it fixed your test case or 
not, etc.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v8

From
Tomas Vondra
Date:
Hi,

attached is v8 of the multivariate statistics patch (or rather a patch
series). The patch currently has 7 parts, but 0001 is just a fix of the
pull_varnos issue (possibly incorrect/temporary), and 0007 is just an
attempt to add the "multicolumn distinctness" (experimental for now).

There are three noteworthy changes:

1) Correct estimation of OR-clauses - this turned out to be a rather
    minor change, thanks to simply transforming the OR-clauses to
    AND-clauses, see clauselist_selectivity_or() for details.

2) Abandoning the ALTER TABLE ... ADD STATISTICS syntax and instead
    adding separate commands CREATE STATISTICS / DROP STATISTICS, as
    proposed in the "multicolumn distinctness" thread:


http://www.postgresql.org/message-id/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp

    This seems a better approach than the ALTER TABLE one - not only it
    nicely fixes the grammar issues, it also naturally extends to
    multi-table statistics (despite we don't know how those should work
    exactly).

    The syntax is this:

      CREATE STATISTICS name ON table (columns) WITH (options);

      DROP STATISTICS name;

    and the 'name' is optional (and if absent, should be generated just
    like for indexes, but that's not implemented yet).

    The remaining question is how unique the statistics name should be.
    My initial plan was to make it unique within a table, but that of
    course does not work well with the DROP STATISTICS (it'd have to
    specify the table name also), and it'd also now work with statistics
    on multiple tables (which is one of the reasons for abandoning ALTER
    TABLE stuff).

    So I think it should be unique across tables. Statistics are hardly
    a global object, so it should be unique within a schema. I thought
    that simply using the schema of the table would work, but that of
    course breaks with multiple tables in different schemas. So the only
    solution seems to be explicit schema for statistics.

3) I've also started hacking on adding the "multicolumn distinctness"
    proposed by Horiguchi-san, but I haven't really got that working. It
    seems to be a bit more complicated than I anticipated because of the
    "only equality conditions" restriction. So the 0007 patch only
    really adds basic syntax and trivial build.

    I do have bunch of ideas/questions about this statistics type. For
    example, should we compute just a single coefficient or the exact
    combination of columns specified in CREATE STATISTICS, or perhaps
    for some additional subsets? I.e. with

      CREATE STATISTICS ON t (a,b,c) WITH (ndistinct);

    should we compute just the coefficient for (a,b,c), or maybe also
    for (a,b), (b,c) and (a,c)? For N columns there's O(2^N) such
    combinations, but perhaps it's acceptable.

    Having the coefficient for just the single combination specified in
    CREATE STATISTICS makes the estimation difficult when some of the
    columns are not specified. For example, with coefficient just for
    (a,b,c), what should happen for (WHERE a=1 AND b=2)?

    Should we simply ignore the statistics, or apply it anyway and
    somehow compensate for the missing columns?


I've also started working on something like a paper, hopefully
explaining the ideas and implementation more clearly and consistently
than possible on a mailing list (thanks to charts, figures and such).
It's available here (both the .tex source and .pdf with the current
version):

     https://bitbucket.org/tvondra/mvstats-paper/src

It's not exactly short (~30 pages), and it's certainly incomplete with a
plenty of TODO notes, but hopefully it's already useful and not entirely
bogus.

Comments and questions are welcome - both to the patch and paper.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: multivariate statistics v9

From
Tomas Vondra
Date:
Hi,

attached is v9 of the patch series, including mostly these changes:

1) CREATE STATISTICS cleanup

    Firstly, I forgot to make the STATISTICS keyword unreserved again.
    I've also removed additional stuff from the grammar that turned out
    to be unnecessary / could be replaced with existing pieces.

2) making statistics schema-specific

    Similarly to the other objects (e.g. types), statistics names are now
    unique within a schema. This also means that the statistics may be
    created using qualified name, and also may belong to a different
    schema than a table.

    It seems to me we probably also need to track owner, and only allow
    the owner (or superuser / schema owner) to manipulate the statistics.

    The initial intention was to inherit all this from the parent table,
    but as we're designing this for the multi-table case, it's not
    really working anymore.

3) adding IF [NOT] EXISTS to DROP STATISTICS / CREATE STATISTICS

4) basic documentation of the DDL commands

    It's really simple at this point and some of the paragraphs are
    still empty. I also think that we'll have to add stuff explaining
    how to use statistics, not just docs for the DDL commands.

5) various fixes of the regression tests, related to the above


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: multivariate statistics / proof of concept

From
Gavin Flower
Date:
On 12/12/14 05:53, Heikki Linnakangas wrote:
> On 10/13/2014 01:00 AM, Tomas Vondra wrote:
>> Hi,
>>
>> attached is a WIP patch implementing multivariate statistics.
>
> Great! Really glad to see you working on this.
>
>> +     * FIXME This sample sizing is mostly OK when computing stats for
>> +     *       individual columns, but when computing multi-variate stats
>> +     *       for multivariate stats (histograms, mcv, ...) it's rather
>> +     *       insufficient. For small number of dimensions it works, but
>> +     *       for complex stats it'd be nice use sample proportional to
>> +     *       the table (say, 0.5% - 1%) instead of a fixed size.
>
> I don't think a fraction of the table is appropriate. As long as the 
> sample is random, the accuracy of a sample doesn't depend much on the 
> size of the population. For example, if you sample 1,000 rows from a 
> table with 100,000 rows, or 1000 rows from a table with 100,000,000 
> rows, the accuracy is pretty much the same. That doesn't change when 
> you go from a single variable to multiple variables.
>
> You do need a bigger sample with multiple variables, however. My gut 
> feeling is that if you sample N rows for a single variable, with two 
> variables you need to sample N^2 rows to get the same accuracy. But 
> it's not proportional to the table size. (I have no proof for that, 
> but I'm sure there is literature on this.)
[...]

I did stage III statistics at University many moons ago...

The accuracy of the sample only depends on the value of N, not the total 
size of the population, with the obvious constraint that N <= population 
size.

The standard deviation in a random sample is proportional to the square 
root of N.  So using N = 100 would have a standard deviation of about 
10%, so to reduce it to 5% you would need N = 400.

For multiple variables, it will also be a function of N - I don't recall 
precisely how, I suspect it might M * N were M is the number of 
parameters (but I'm not as certain).  I think M^N might be needed if you 
want all the possible correlations between sets of variable to be 
reasonably significant - but I'm mostly just guessing here.

So using a % of table size is somewhat silly, looking at the above. 
However, if you want to detect frequencies that occur at the 1% level, 
then you will need to sample 1% of the table or greater.  So which 
approach is 'best', depends on what you are trying to determine. The 
sample size is more useful when you need to decide between 2 different 
hypothesises.

The sampling methodology, is far more important than the ratio of N to 
population size - consider the bias imposed by using random telephone 
numbers, even before the event of mobile phones!


Cheers,
Gavin



Re: multivariate statistics v8

From
Robert Haas
Date:
On Wed, Dec 23, 2015 at 2:07 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>    The remaining question is how unique the statistics name should be.
>    My initial plan was to make it unique within a table, but that of
>    course does not work well with the DROP STATISTICS (it'd have to
>    specify the table name also), and it'd also now work with statistics
>    on multiple tables (which is one of the reasons for abandoning ALTER
>    TABLE stuff).
>
>    So I think it should be unique across tables. Statistics are hardly
>    a global object, so it should be unique within a schema. I thought
>    that simply using the schema of the table would work, but that of
>    course breaks with multiple tables in different schemas. So the only
>    solution seems to be explicit schema for statistics.

That solution seems good to me.

(with apologies for not having looked at the rest of this much at all)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: multivariate statistics v8

From
Bruce Momjian
Date:
On Wed, Jan 20, 2016 at 02:20:38PM -0500, Robert Haas wrote:
> On Wed, Dec 23, 2015 at 2:07 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >    The remaining question is how unique the statistics name should be.
> >    My initial plan was to make it unique within a table, but that of
> >    course does not work well with the DROP STATISTICS (it'd have to
> >    specify the table name also), and it'd also now work with statistics
> >    on multiple tables (which is one of the reasons for abandoning ALTER
> >    TABLE stuff).
> >
> >    So I think it should be unique across tables. Statistics are hardly
> >    a global object, so it should be unique within a schema. I thought
> >    that simply using the schema of the table would work, but that of
> >    course breaks with multiple tables in different schemas. So the only
> >    solution seems to be explicit schema for statistics.
> 
> That solution seems good to me.
> 
> (with apologies for not having looked at the rest of this much at all)

Woh, this will be an optimizer game-changer, from the user perspective!

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +



Re: multivariate statistics v8

From
Alvaro Herrera
Date:
Bruce Momjian wrote:
> On Wed, Jan 20, 2016 at 02:20:38PM -0500, Robert Haas wrote:
> > On Wed, Dec 23, 2015 at 2:07 PM, Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > >    The remaining question is how unique the statistics name should be.
> > >    My initial plan was to make it unique within a table, but that of
> > >    course does not work well with the DROP STATISTICS (it'd have to
> > >    specify the table name also), and it'd also now work with statistics
> > >    on multiple tables (which is one of the reasons for abandoning ALTER
> > >    TABLE stuff).
> > >
> > >    So I think it should be unique across tables. Statistics are hardly
> > >    a global object, so it should be unique within a schema. I thought
> > >    that simply using the schema of the table would work, but that of
> > >    course breaks with multiple tables in different schemas. So the only
> > >    solution seems to be explicit schema for statistics.
> > 
> > That solution seems good to me.
> > 
> > (with apologies for not having looked at the rest of this much at all)
> 
> Woh, this will be an optimizer game-changer, from the user perspective!

That is the intent.  The patch is huge, though -- any reviewing help is
welcome.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v8

From
Tomas Vondra
Date:

On 01/20/2016 10:54 PM, Alvaro Herrera wrote:
> Bruce Momjian wrote:
>> On Wed, Jan 20, 2016 at 02:20:38PM -0500, Robert Haas wrote:
>>> On Wed, Dec 23, 2015 at 2:07 PM, Tomas Vondra
>>> <tomas.vondra@2ndquadrant.com> wrote:
>>>>     The remaining question is how unique the statistics name should be.
>>>>     My initial plan was to make it unique within a table, but that of
>>>>     course does not work well with the DROP STATISTICS (it'd have to
>>>>     specify the table name also), and it'd also now work with statistics
>>>>     on multiple tables (which is one of the reasons for abandoning ALTER
>>>>     TABLE stuff).
>>>>
>>>>     So I think it should be unique across tables. Statistics are hardly
>>>>     a global object, so it should be unique within a schema. I thought
>>>>     that simply using the schema of the table would work, but that of
>>>>     course breaks with multiple tables in different schemas. So the only
>>>>     solution seems to be explicit schema for statistics.
>>>
>>> That solution seems good to me.
>>>
>>> (with apologies for not having looked at the rest of this much at all)
>>
>> Woh, this will be an optimizer game-changer, from the user perspective!
>
> That is the intent. The patch is huge, though -- any reviewing help
> is welcome.

It's also true that a significant fraction of the size is documentation 
(in the form of comments). However even after stripping them the patch 
is not exactly small ...

I'm afraid it may be rather difficult to understand the general idea of 
the patch. So if anyone is interested in discussing the patch in 
Brussels next week, I'm available.

Also, in December I've posted a link to a "paper" I started writing 
about the stats:
    https://bitbucket.org/tvondra/mvstats-paper/src


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v10

From
Tomas Vondra
Date:
Hi,

Attached is v10 of the patch series. There are 9 parts at the moment:

   0001-teach-pull_-varno-varattno-_walker-about-RestrictInf.patch
   0002-shared-infrastructure-and-functional-dependencies.patch
   0003-clause-reduction-using-functional-dependencies.patch
   0004-multivariate-MCV-lists.patch
   0005-multivariate-histograms.patch
   0006-multi-statistics-estimation.patch
   0007-multivariate-ndistinct-coefficients.patch
   0008-change-how-we-apply-selectivity-to-number-of-groups-.patch
   0009-fixup-of-regression-tests-plans-changes-by-group-by-.patch

However, the first one is still just a temporary workaround that I plan
to address next, and the last 3 are all dealing with the ndistinct
coefficients (and shall be squashed into a single chunk).


README docs
-----------

Aside from fixing a few bugs, there are several major improvements, the
main one being that I've moved most of the comments explaining how it
all works into a set of regular README files, located in
src/backend/utils/mvstats:

1) README.stats - Overview of available types of statistics, what
    clauses can be estimated, how multiple statistics are combined etc.
    This is probably the right place to start.

2) docs for each type of statistics currently available

    README.dependencies - soft functional dependencies
    README.mcv          - MCV lists
    README.histogram    - histograms
    README.ndistinct    - ndistinct coefficients

The READMEs are added and modified through the patch series, so the best
thing to do is apply all the patches and start reading.

I have not improved the user-oriented SGML documentation in this patch,
that's one of the tasks I'd lie to work on next. But the READMEs should
give you a good idea how it's supposed to work, and there are some
examples of use in the regression tests.


Significantly simplified places
-------------------------------

The patch version also significantly simplifies several places that were
needlessly complex in the previous ones - firstly the function
evaluating clauses on multivariate histograms was rather needlessly
bloated, so I've simplified it a lot. Similarly for the code in
clauselist_select() that combines multiple statistics to estimate a list
of clauses - that's much simpler now too. And various other pieces.

That being said, I still think the code in clausesel.c can be
simplified. I feel there's a lot of cruft, mostly due to unknowingly
implementing something that could be solved by an existing function.

A prime example of that is inspecting the expression tree to check if we
know how to estimate the clauses using the multivariate statistics. That
sounds like a nice match for expression walker, but currently is done by
custom code. I plan to look at that next.

Also, I'm not quite sure I understand what the varRelid parameter of
clauselist_selectivity is for, so the code may be handling that wrong
(seems to be working though).


ndistinct coefficients
----------------------

The one new piece in this patch is the GROUP BY estimation, based on the
ndistinct coefficients. So for example you can do this:

     CREATE TABLE t AS SELECT mod(i,1000) AS a, mod(i,1000) AS b
                         FROM generate_series(1,1000000) s(i);
     ANALYZE t;
     EXPLAIN SELECT * FROM t GROUP BY a, b;

which currently does this:

                               QUERY PLAN
-----------------------------------------------------------------------
  Group  (cost=127757.34..135257.34 rows=99996 width=8)
    Group Key: a, b
    ->  Sort  (cost=127757.34..130257.34 rows=1000000 width=8)
          Sort Key: a, b
          ->  Seq Scan on t  (cost=0.00..14425.00 rows=1000000 width=8)
(5 rows)

but we know that there are only 1000 groups because the columns are
correlated. So let's create ndistinct statistics on the two columns:

     CREATE STATISTICS s1 ON t (a,b) WITH (ndistinct);
     ANALYZE t;

which results in estimates like this:

                            QUERY PLAN
-----------------------------------------------------------------
  HashAggregate  (cost=19425.00..19435.00 rows=1000 width=8)
    Group Key: a, b
    ->  Seq Scan on t  (cost=0.00..14425.00 rows=1000000 width=8)
(3 rows)

I'm not quite sure how to combine this type of statistics with MCV lists
and histograms, so for now it's used only for GROUP BY.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: multivariate statistics v10

From
Thom Brown
Date:
On 2 March 2016 at 14:56, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Hi,
>
> Attached is v10 of the patch series. There are 9 parts at the moment:
>
>   0001-teach-pull_-varno-varattno-_walker-about-RestrictInf.patch
>   0002-shared-infrastructure-and-functional-dependencies.patch
>   0003-clause-reduction-using-functional-dependencies.patch
>   0004-multivariate-MCV-lists.patch
>   0005-multivariate-histograms.patch
>   0006-multi-statistics-estimation.patch
>   0007-multivariate-ndistinct-coefficients.patch
>   0008-change-how-we-apply-selectivity-to-number-of-groups-.patch
>   0009-fixup-of-regression-tests-plans-changes-by-group-by-.patch
>
> However, the first one is still just a temporary workaround that I plan to address next, and the last 3 are all
dealingwith the ndistinct coefficients (and shall be squashed into a single chunk). 
>
>
> README docs
> -----------
>
> Aside from fixing a few bugs, there are several major improvements, the main one being that I've moved most of the
commentsexplaining how it all works into a set of regular README files, located in src/backend/utils/mvstats: 
>
> 1) README.stats - Overview of available types of statistics, what
>    clauses can be estimated, how multiple statistics are combined etc.
>    This is probably the right place to start.
>
> 2) docs for each type of statistics currently available
>
>    README.dependencies - soft functional dependencies
>    README.mcv          - MCV lists
>    README.histogram    - histograms
>    README.ndistinct    - ndistinct coefficients
>
> The READMEs are added and modified through the patch series, so the best thing to do is apply all the patches and
startreading. 
>
> I have not improved the user-oriented SGML documentation in this patch, that's one of the tasks I'd lie to work on
next.But the READMEs should give you a good idea how it's supposed to work, and there are some examples of use in the
regressiontests. 
>
>
> Significantly simplified places
> -------------------------------
>
> The patch version also significantly simplifies several places that were needlessly complex in the previous ones -
firstlythe function evaluating clauses on multivariate histograms was rather needlessly bloated, so I've simplified it
alot. Similarly for the code in clauselist_select() that combines multiple statistics to estimate a list of clauses -
that'smuch simpler now too. And various other pieces. 
>
> That being said, I still think the code in clausesel.c can be simplified. I feel there's a lot of cruft, mostly due
tounknowingly implementing something that could be solved by an existing function. 
>
> A prime example of that is inspecting the expression tree to check if we know how to estimate the clauses using the
multivariatestatistics. That sounds like a nice match for expression walker, but currently is done by custom code. I
planto look at that next. 
>
> Also, I'm not quite sure I understand what the varRelid parameter of clauselist_selectivity is for, so the code may
behandling that wrong (seems to be working though). 
>
>
> ndistinct coefficients
> ----------------------
>
> The one new piece in this patch is the GROUP BY estimation, based on the ndistinct coefficients. So for example you
cando this: 
>
>     CREATE TABLE t AS SELECT mod(i,1000) AS a, mod(i,1000) AS b
>                         FROM generate_series(1,1000000) s(i);
>     ANALYZE t;
>     EXPLAIN SELECT * FROM t GROUP BY a, b;
>
> which currently does this:
>
>                               QUERY PLAN
> -----------------------------------------------------------------------
>  Group  (cost=127757.34..135257.34 rows=99996 width=8)
>    Group Key: a, b
>    ->  Sort  (cost=127757.34..130257.34 rows=1000000 width=8)
>          Sort Key: a, b
>          ->  Seq Scan on t  (cost=0.00..14425.00 rows=1000000 width=8)
> (5 rows)
>
> but we know that there are only 1000 groups because the columns are correlated. So let's create ndistinct statistics
onthe two columns: 
>
>     CREATE STATISTICS s1 ON t (a,b) WITH (ndistinct);
>     ANALYZE t;
>
> which results in estimates like this:
>
>                            QUERY PLAN
> -----------------------------------------------------------------
>  HashAggregate  (cost=19425.00..19435.00 rows=1000 width=8)
>    Group Key: a, b
>    ->  Seq Scan on t  (cost=0.00..14425.00 rows=1000000 width=8)
> (3 rows)
>
> I'm not quite sure how to combine this type of statistics with MCV lists and histograms, so for now it's used only
forGROUP BY. 

Well, firstly, the patches all apply.

But I have a question (which is coming really late, but I'll ask it
anyway).  Is it intended that CREATE STATISTICS will only be for
multivariate statistics?  Or do you think we could add support for
expression statistics in future too?

e.g.

CREATE STATISTICS stats_comment_length ON comments (length(comment));


I also note that the docs contain this:

CREATE STATISTICS [ IF NOT EXISTS ] statistics_name ON table_name ( [ { column_name } ] [, ...])
[ WITH ( statistics_parameter [= value] [, ... ] )

The open square bracket before WITH doesn't get closed.  Also, it
indicates that columns are entirely options, so () would be valid, but
that's not the case. Also, a space is missing after the first
ellipsis.  So I think this should read:

CREATE STATISTICS [ IF NOT EXISTS ] statistics_name ON table_name ( { column_name } [, ... ])
[ WITH ( statistics_parameter [= value] [, ... ] ) ]

Regards

Thom



Re: multivariate statistics v10

From
Tomas Vondra
Date:
Hi,

On 03/02/2016 05:17 PM, Thom Brown wrote:
...
> Well, firstly, the patches all apply.
>
> But I have a question (which is coming really late, but I'll ask it
> anyway).  Is it intended that CREATE STATISTICS will only be for
> multivariate statistics?  Or do you think we could add support for
> expression statistics in future too?
>
> e.g.
>
> CREATE STATISTICS stats_comment_length ON comments (length(comment));

Hmmm, that's not a use case I had in mind while working on the patch, 
but it sounds interesting. I don't see why the syntax would not support 
this - I'd like to add support for expressions into the multivariate 
patch, but that will still require at least 2 columns to build 
multivariate statistics. But perhaps it'd be possible to relax the "at 
least 2 columns" requirement, and collect regular statistics somewhere.

So I don't see why the syntax could not work for that case too, but I'm 
not going to work on that.

>
>
> I also note that the docs contain this:
>
> CREATE STATISTICS [ IF NOT EXISTS ] statistics_name ON table_name ( [
>    { column_name } ] [, ...])
> [ WITH ( statistics_parameter [= value] [, ... ] )
>
> The open square bracket before WITH doesn't get closed.  Also, it
> indicates that columns are entirely options, so () would be valid, but
> that's not the case. Also, a space is missing after the first
> ellipsis.  So I think this should read:
>
> CREATE STATISTICS [ IF NOT EXISTS ] statistics_name ON table_name (
>    { column_name } [, ... ])
> [ WITH ( statistics_parameter [= value] [, ... ] ) ]

Yeah, will fix.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v11

From
Tomas Vondra
Date:
Hi,

attached is v11 of the patch - this is mostly a cleanup of v10, removing
redundant code, adding missing comments, removing obsolete FIXME/TODOs
and so on. Overall this shaves ~20kB from the patch (not a primary
objective, though).

The one thing this (hopefully) fixes is handling of varRelid. Apparently
I got that a slightly wrong in the previous versions.

One thing I'm not quite sure about is schema of the new system catalog.
The existing catalog pg_statistic uses generic design with stakindN,
stanumbersN and stavaluesN columns, while the new catalog uses dedicated
columns for each type of stats (MCV, histogram, ...). Not sure whether
it's desirable to switch to the pg_statistic approach or not.

There are a few things I plan to look into next:

  * possibly more cleanups in clausesel.c (I'm wondering if some pieces
    should be moved to utils/mvstats/*.c)

  * a few FIXMEs in the infrastructure (e.g. deriving a name when not
    specified in CREATE STATISTICS)

  * move the ndistinct coefficients after functional dependencies in
    the patch series (but only use them for GROUP BY for now)

  * extend the functional dependencies to handle multiple columns on
    the left side (condition), i.e. dependencies like (a,b) -> c

  * address a few remaining FIXMEs in MCV/histograms building


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: multivariate statistics v11

From
Jeff Janes
Date:
On Tue, Mar 8, 2016 at 12:13 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Hi,
>
> attached is v11 of the patch - this is mostly a cleanup of v10, removing
> redundant code, adding missing comments, removing obsolete FIXME/TODOs
> and so on. Overall this shaves ~20kB from the patch (not a primary
> objective, though).

This has some some conflicts with the pathification commit, in the
regression tests.

To avoid that, I applied it to the commit before that, 3fc6e2d7f5b652b417fa6^

Having done that, In my hands, it fails its own regression tests.
Diff attached.

It breaks contrib postgres_fdw, I'll look into that when I get a
chance of no one beats me to it.

postgres_fdw.c: In function 'postgresGetForeignJoinPaths':
postgres_fdw.c:3623: error: too few arguments to function
'clauselist_selectivity'
postgres_fdw.c:3642: error: too few arguments to function
'clauselist_selectivity'

Cheers,

Jeff

Attachment

Re: multivariate statistics v11

From
Tomas Vondra
Date:
Hi,

thanks for looking at the patch. Sorry for the issues, attached is a
version v13 that should fix them (or most of them).

On Tue, 2016-03-08 at 18:24 -0800, Jeff Janes wrote:
> On Tue, Mar 8, 2016 at 12:13 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> > Hi,
> >
> > attached is v11 of the patch - this is mostly a cleanup of v10, removing
> > redundant code, adding missing comments, removing obsolete FIXME/TODOs
> > and so on. Overall this shaves ~20kB from the patch (not a primary
> > objective, though).
>
> This has some some conflicts with the pathification commit, in the
> regression tests.

Yeah, there was one join plan difference, due to the ndistinct
estimation patch. Meh. Fixed.

>
> To avoid that, I applied it to the commit before that, 3fc6e2d7f5b652b417fa6^

Rebased to 51c0f63e.

>
> Having done that, In my hands, it fails its own regression tests.
> Diff attached.

Fixed. This was caused by making names of the statistics unique across
tables, thus the regression tests started to fail when executed through
'make check' (but 'make installcheck' was still fine).

The diff however also includes a segfault, apparently in processing of
functional dependencies somewhere in ANALYZE. Sadly I've been unable to
reproduce any such failure, despite running the tests many times (even
when applied on the same commit). Is there any chance this might be due
to a broken build, or something like that. If not, can you try
reproducing it and investigate a bit (enable core dumps etc.)?

>
> It breaks contrib postgres_fdw, I'll look into that when I get a
> chance of no one beats me to it.
>
> postgres_fdw.c: In function 'postgresGetForeignJoinPaths':
> postgres_fdw.c:3623: error: too few arguments to function
> 'clauselist_selectivity'
> postgres_fdw.c:3642: error: too few arguments to function
> 'clauselist_selectivity'

Yeah, apparently there are two new calls to clauselist_selectivity, so I
had to add NIL as list of conditions.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: multivariate statistics v11

From
Alvaro Herrera
Date:
Hi,

I gave a very quick skim to patch 0002.  Not a real review yet.  But
there are a few trivial points to fix:

* You still have empty sections in the SGML docs (such as the EXAMPLES).
I suppose the syntax is now firm enough that we can get some.  (I looked
at the other patches to see whether it was filled in, but couldn't find
any additional text there.)

* check_object_ownership() needs to be filled in

* Since you're adding a new object type, please add a case to cover it
in the object_address.sql pg_regress test.

* in analyze.c (and elsewhere), please put new #include lines sorted.

* I think the AT_PASS_ADD_STATS is a leftover which should be removed.

* The XXX comment in get_relation_info should probably be handled
differently (namely, in a way that makes the syscache not contain OIDs
of dropped stats)

* The README.dependencies has a lot of TODOs.  Do we need to get them
done during the first cut?  If not, I suggest creating a new section
"Future work" in the file.

* Please put the common.h header in src/include.  Make sure not to
include "postgres.h" in it -- our policy is that postgres.h goes at the
top of every .c file and never in any .h file.  Also please find a
better name for it; even mvstats_common.h would be a lot more
convincing.  However:

* ISTM that the code in common.c properly belongs in
src/backend/catalog/pg_mvstats.c instead (or more properly
catalog/pg_mv_statistics.c), which probably means the common.h file
should be named something else; perhaps some of it could become
pg_mv_statistic_fn.h, while the rest continues to be
src/include/utils/mvstats_common.h?  Not sure.

* The version check in psql/describe.c uses 90500; should probably be
updated to 90600.

* _copyCreateStatsStmt is missing if_not_exists

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v14

From
Tomas Vondra
Date:
Hi,

thanks for the feedback. Attached is v14 of the patch series, fixing
most of the points you've raised.


On Wed, 2016-03-09 at 09:22 -0300, Alvaro Herrera wrote:
> Hi,
>
> I gave a very quick skim to patch 0002.  Not a real review yet.  But
> there are a few trivial points to fix:
>
> * You still have empty sections in the SGML docs (such as the EXAMPLES).
> I suppose the syntax is now firm enough that we can get some.  (I looked
> at the other patches to see whether it was filled in, but couldn't find
> any additional text there.)

Yes, that's one of the items I plan to work on next. Until now the
regression tests were a sufficient source of examples, but it's time to
do the SGML piece.

>
> * check_object_ownership() needs to be filled in

Done.

I've added pg_statistics_ownercheck, which also required adding OID of
the owner to the catalog. Initially the plan was to use the same owner
as for the table, but now that we've switched to CREATE STATISTICS
partially because it will allow multi-table stats, that does not make
sense (multiple tables with different owners).

This probably means we also need an 'ALTER STATISTICS ... OWNER TO'
command, which does not exist at this point.

>
> * Since you're adding a new object type, please add a case to cover it
> in the object_address.sql pg_regress test.

Done.

Apparently there was a bunch of missing pieces in objectaddress.c, so
this adds them too.

>
> * in analyze.c (and elsewhere), please put new #include lines sorted.

Done.

I've also significantly reduced the excessive list of includes in
statscmds.c. I expect the headers to require a bit more love, especially
in the subsequent patches (MCV, histograms etc.).

>
> * I think the AT_PASS_ADD_STATS is a leftover which should be removed.

Yeah. Now that we've invented CREATE TABLE, all the changes to
tablecmds.c were just unnecessary leftovers. Removed.

>
> * The XXX comment in get_relation_info should probably be handled
> differently (namely, in a way that makes the syscache not contain OIDs
> of dropped stats)

I believe that was actually an obsolete comment. Removed.

>
> * The README.dependencies has a lot of TODOs.  Do we need to get them
> done during the first cut?  If not, I suggest creating a new section
> "Future work" in the file.

Right. Most of those TODOs are future work, or rather ideas (more or
less crazy). The one thing I definitely want to address now is support
for dependencies with multiple columns on the left side, because that
requires changes to serialized format. I might also look at handling IS
NULL clauses, but that may wait.

>
> * Please put the common.h header in src/include.  Make sure not to
> include "postgres.h" in it -- our policy is that postgres.h goes at the
> top of every .c file and never in any .h file.  Also please find a
> better name for it; even mvstats_common.h would be a lot more
> convincing.  However:
>
> * ISTM that the code in common.c properly belongs in
> src/backend/catalog/pg_mvstats.c instead (or more properly
> catalog/pg_mv_statistics.c), which probably means the common.h file
> should be named something else; perhaps some of it could become
> pg_mv_statistic_fn.h, while the rest continues to be
> src/include/utils/mvstats_common.h?  Not sure.

Hmmm, not sure either. The idea was that the "common.h" is pretty much
just a private header with stuff that's not very useful anywhere else.

No changes here, for now.

>
> * The version check in psql/describe.c uses 90500; should probably be
> updated to 90600.

Fixed.

>
> * _copyCreateStatsStmt is missing if_not_exists

Fixed.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: multivariate statistics v14

From
Jeff Janes
Date:
On Wed, Mar 9, 2016 at 7:02 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Hi,
>
> thanks for the feedback. Attached is v14 of the patch series, fixing
> most of the points you've raised.


Hi Tomas,

Applied to aa09cd242fa7e3a694a31f, I still get the seg faults in make
check if I configure without --enable-cassert.

With --enable-cassert, it passes the regression test.

I got the core file, configured and compiled with:
CFLAGS="-fno-omit-frame-pointer"  --enable-debug

The first core dump is on this statement:
 -- check explain (expect bitmap index scan, not plain index scan) INSERT INTO functional_dependencies      SELECT
i/10000,i/20000, i/40000 FROM generate_series(1,1000000) s(i);
 

bt

#0  0x00000000006e1160 in cost_qual_eval (cost=0x2494418,
quals=0x2495550, root=0x2541b88) at costsize.c:3181
#1  0x00000000006e1ee5 in set_baserel_size_estimates (root=0x2541b88,
rel=0x2494300) at costsize.c:3754
#2  0x00000000006d37e8 in set_plain_rel_size (root=0x2541b88,
rel=0x2494300, rte=0x247e660) at allpaths.c:480
#3  0x00000000006d353d in set_rel_size (root=0x2541b88, rel=0x2494300,
rti=1, rte=0x247e660) at allpaths.c:350
#4  0x00000000006d338f in set_base_rel_sizes (root=0x2541b88) at allpaths.c:270
#5  0x00000000006d3233 in make_one_rel (root=0x2541b88,
joinlist=0x2494628) at allpaths.c:169
#6  0x000000000070012e in query_planner (root=0x2541b88,
tlist=0x2541e58, qp_callback=0x7048d4 <standard_qp_callback>,
qp_extra=0x7ffefa6474e0)   at planmain.c:246
#7  0x0000000000702a33 in grouping_planner (root=0x2541b88,
inheritance_update=0 '\000', tuple_fraction=0) at planner.c:1647
#8  0x0000000000701310 in subquery_planner (glob=0x2541af8,
parse=0x246a838, parent_root=0x0, hasRecursion=0 '\000',
tuple_fraction=0) at planner.c:740
#9  0x000000000070055b in standard_planner (parse=0x246a838,
cursorOptions=256, boundParams=0x0) at planner.c:290
#10 0x000000000070023f in planner (parse=0x246a838, cursorOptions=256,
boundParams=0x0) at planner.c:160
#11 0x00000000007b8bf9 in pg_plan_query (querytree=0x246a838,
cursorOptions=256, boundParams=0x0) at postgres.c:798
#12 0x00000000005d1967 in ExplainOneQuery (query=0x246a838, into=0x0,
es=0x246a778,   queryString=0x2443d80 "EXPLAIN (COSTS off)\n SELECT * FROM
mcv_list WHERE a = 10 AND b = 5;", params=0x0) at explain.c:350
#13 0x00000000005d16a3 in ExplainQuery (stmt=0x2444f90,
queryString=0x2443d80 "EXPLAIN (COSTS off)\n SELECT * FROM mcv_list
WHERE a = 10 AND b = 5;",   params=0x0, dest=0x246a6e8) at explain.c:244
#14 0x00000000007c0afb in standard_ProcessUtility (parsetree=0x2444f90,   queryString=0x2443d80 "EXPLAIN (COSTS off)\n
SELECT* FROM
 
mcv_list WHERE a = 10 AND b = 5;", context=PROCESS_UTILITY_TOPLEVEL,
params=0x0,   dest=0x246a6e8, completionTag=0x7ffefa647b60 "") at utility.c:659
#15 0x00000000007c0299 in ProcessUtility (parsetree=0x2444f90,
queryString=0x2443d80 "EXPLAIN (COSTS off)\n SELECT * FROM mcv_list
WHERE a = 10 AND b = 5;",   context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x246a6e8,
completionTag=0x7ffefa647b60 "") at utility.c:335
#16 0x00000000007bf47b in PortalRunUtility (portal=0x23ed510,
utilityStmt=0x2444f90, isTopLevel=1 '\001', dest=0x246a6e8,
completionTag=0x7ffefa647b60 "")   at pquery.c:1183
#17 0x00000000007bf1ce in FillPortalStore (portal=0x23ed510,
isTopLevel=1 '\001') at pquery.c:1057
#18 0x00000000007beb19 in PortalRun (portal=0x23ed510,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x253f6c0,
altdest=0x253f6c0,   completionTag=0x7ffefa647d40 "") at pquery.c:781
#19 0x00000000007b90ae in exec_simple_query (query_string=0x2443d80
"EXPLAIN (COSTS off)\n SELECT * FROM mcv_list WHERE a = 10 AND b =
5;")   at postgres.c:1094
#20 0x00000000007bcfac in PostgresMain (argc=1, argv=0x23d5070,
dbname=0x23d4e48 "regression", username=0x23d4e30 "jjanes") at
postgres.c:4021
#21 0x0000000000745a62 in BackendRun (port=0x23f4110) at postmaster.c:4258
#22 0x00000000007451d6 in BackendStartup (port=0x23f4110) at postmaster.c:3932
#23 0x0000000000741ab7 in ServerLoop () at postmaster.c:1690
#24 0x00000000007411c0 in PostmasterMain (argc=8, argv=0x23d3f20) at
postmaster.c:1298
#25 0x0000000000690026 in main (argc=8, argv=0x23d3f20) at main.c:223

Cheers,

Jeff



Re: multivariate statistics v14

From
Tomas Vondra
Date:
Hi,

On Wed, 2016-03-09 at 08:45 -0800, Jeff Janes wrote:
> On Wed, Mar 9, 2016 at 7:02 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> > Hi,
> >
> > thanks for the feedback. Attached is v14 of the patch series, fixing
> > most of the points you've raised.
>
>
> Hi Tomas,
>
> Applied to aa09cd242fa7e3a694a31f, I still get the seg faults in make
> check if I configure without --enable-cassert.

Ah, after disabling asserts I can reproduce it too. And the reason why
it fails is quite simple - clauselist_selectivity modifies the original
list of clauses, which then confuses cost_qual_eval.

Can you try if the attached patch fixes the issue? I'll need to rework a
bit more of the code, but let's see if this fixes the issue on your
machine too.

> With --enable-cassert, it passes the regression test.

I wonder how can it work with casserts and fail without them. That's
kinda exactly the opposite to what I'd expect ...

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: multivariate statistics v14

From
Jeff Janes
Date:
On Wed, Mar 9, 2016 at 9:21 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Hi,
>
> On Wed, 2016-03-09 at 08:45 -0800, Jeff Janes wrote:
>> On Wed, Mar 9, 2016 at 7:02 AM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>> > Hi,
>> >
>> > thanks for the feedback. Attached is v14 of the patch series, fixing
>> > most of the points you've raised.
>>
>>
>> Hi Tomas,
>>
>> Applied to aa09cd242fa7e3a694a31f, I still get the seg faults in make
>> check if I configure without --enable-cassert.
>
> Ah, after disabling asserts I can reproduce it too. And the reason why
> it fails is quite simple - clauselist_selectivity modifies the original
> list of clauses, which then confuses cost_qual_eval.
>
> Can you try if the attached patch fixes the issue? I'll need to rework a
> bit more of the code, but let's see if this fixes the issue on your
> machine too.

Yes, that fixes it.


>
>> With --enable-cassert, it passes the regression test.
>
> I wonder how can it work with casserts and fail without them. That's
> kinda exactly the opposite to what I'd expect ...

I too was surprised by that.  Maybe cassert makes a copy of some data
structure which is used in-place without cassert?

Thanks,

Jeff



Re: multivariate statistics v14

From
Tomas Vondra
Date:
On Wed, 2016-03-09 at 18:21 +0100, Tomas Vondra wrote:
> Hi,
> 
> On Wed, 2016-03-09 at 08:45 -0800, Jeff Janes wrote:
> > On Wed, Mar 9, 2016 at 7:02 AM, Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > > Hi,
> > >
> > > thanks for the feedback. Attached is v14 of the patch series, fixing
> > > most of the points you've raised.
> > 
> > 
> > Hi Tomas,
> > 
> > Applied to aa09cd242fa7e3a694a31f, I still get the seg faults in make
> > check if I configure without --enable-cassert.
> 
> Ah, after disabling asserts I can reproduce it too. And the reason why
> it fails is quite simple - clauselist_selectivity modifies the original
> list of clauses, which then confuses cost_qual_eval.

More precisely, it gets confused because the first clause in the list
gets deleted but cost_qual_eval never learns about that, and follows
stale pointer to the next cell, thus a segfault.

> 
> Can you try if the attached patch fixes the issue? I'll need to rework a
> bit more of the code, but let's see if this fixes the issue on your
> machine too.
> 
> > With --enable-cassert, it passes the regression test.
> 
> I wonder how can it work with casserts and fail without them. That's
> kinda exactly the opposite to what I'd expect ...

FWIW it seems to be somehow related to this assert in clausesel.c:
  Assert(count_mv_attnums(list_union(stat_clauses, stat_conditions),            relid, MV_CLAUSE_TYPE_MCV |
MV_CLAUSE_TYPE_HIST)>= 2);
 

With the assert in place, the code passes without a failure. After
removing the assert (commenting it out), or even just changing it to
   Assert(count_mv_attnums(stat_clauses, relid,                   MV_CLAUSE_TYPE_MCV | MV_CLAUSE_TYPE_HIST)        +
count_mv_attnums(stat_conditions,relid,                   MV_CLAUSE_TYPE_MCV | MV_CLAUSE_TYPE_HIST) >= 2);
 

i.e. removing the list_union, it fails as expected.

The only thing that I can think of is that list_union happens to place
the right stuff at the right position in memory - pure luck.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: multivariate statistics v14

From
Jeff Janes
Date:
On Wed, Mar 9, 2016 at 9:21 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Hi,
>
> On Wed, 2016-03-09 at 08:45 -0800, Jeff Janes wrote:
>> On Wed, Mar 9, 2016 at 7:02 AM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>> > Hi,
>> >
>> > thanks for the feedback. Attached is v14 of the patch series, fixing
>> > most of the points you've raised.
>>
>>
>> Hi Tomas,
>>
>> Applied to aa09cd242fa7e3a694a31f, I still get the seg faults in make
>> check if I configure without --enable-cassert.
>
> Ah, after disabling asserts I can reproduce it too. And the reason why
> it fails is quite simple - clauselist_selectivity modifies the original
> list of clauses, which then confuses cost_qual_eval.
>
> Can you try if the attached patch fixes the issue? I'll need to rework a
> bit more of the code, but let's see if this fixes the issue on your
> machine too.

That patch on top of v14 did fix the original problem.  But I got
another segfault:

jjanes=# create table foo as select x, floor(x/(10000000/500))::int as
y  from generate_series(1,10000000) f(x);
jjanes=# create index on foo (x,y);
jjanes=# create index on foo (y,x);
jjanes=# create statistics jjj on foo (x,y) with (dependencies,histogram);
jjanes=# analyze ;
server closed the connection unexpectedly

#0  multi_sort_add_dimension (mss=mss@entry=0x7f45dafc7c88,
sortdim=sortdim@entry=0, dim=dim@entry=0,
vacattrstats=vacattrstats@entry=0x16f0dd0) at common.c:436
#1  0x00000000007d022a in update_bucket_ndistinct (attrs=0x166fdf8,
stats=0x16f0dd0, bucket=<optimized out>) at histogram.c:1384
#2  0x00000000007d09aa in create_initial_mv_bucket (stats=0x16f0dd0,
attrs=0x166fdf8, rows=0x17cda20, numrows=30000) at histogram.c:880
#3  build_mv_histogram (numrows=30000, rows=rows@entry=0x170ecf0,
attrs=attrs@entry=0x166fdf8, stats=stats@entry=0x16f0dd0,
numrows_total=numrows_total@entry=30000)   at histogram.c:156
#4  0x00000000007ced19 in build_mv_stats
(onerel=onerel@entry=0x7f45e797d040, totalrows=9999985,
numrows=numrows@entry=30000, rows=rows@entry=0x170ecf0,
natts=natts@entry=2,   vacattrstats=vacattrstats@entry=0x166efa0) at common.c:106
#5  0x000000000055ff6b in do_analyze_rel
(onerel=onerel@entry=0x7f45e797d040, options=options@entry=2,
va_cols=va_cols@entry=0x0, acquirefunc=<optimized out>,
relpages=44248,   inh=inh@entry=0 '\000', in_outer_xact=in_outer_xact@entry=0
'\000', elevel=elevel@entry=13, params=0x7ffcbe382a30) at
analyze.c:585
#6  0x0000000000560ced in analyze_rel (relid=relid@entry=16441,
relation=relation@entry=0x16bc9d0, options=options@entry=2,
params=params@entry=0x7ffcbe382a30,   va_cols=va_cols@entry=0x0, in_outer_xact=<optimized out>,
bstrategy=0x16640f0) at analyze.c:262
#7  0x00000000005b70fd in vacuum (options=2, relation=0x16bc9d0,
relid=relid@entry=0, params=params@entry=0x7ffcbe382a30, va_cols=0x0,
bstrategy=<optimized out>,   bstrategy@entry=0x0, isTopLevel=isTopLevel@entry=1 '\001') at vacuum.c:313
#8  0x00000000005b748e in ExecVacuum (vacstmt=vacstmt@entry=0x16bca20,
isTopLevel=isTopLevel@entry=1 '\001') at vacuum.c:121
#9  0x00000000006c90f3 in standard_ProcessUtility
(parsetree=0x16bca20, queryString=0x16bbfc0 "analyze foo ;",
context=<optimized out>, params=0x0, dest=0x16bcd60,   completionTag=0x7ffcbe382fa0 "") at utility.c:654
#10 0x00007f45e413b1d1 in pgss_ProcessUtility (parsetree=0x16bca20,
queryString=0x16bbfc0 "analyze foo ;",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x16bcd60,   completionTag=0x7ffcbe382fa0 "") at
pg_stat_statements.c:986
#11 0x00000000006c6841 in PortalRunUtility (portal=0x16f7700,
utilityStmt=0x16bca20, isTopLevel=<optimized out>, dest=0x16bcd60,
completionTag=0x7ffcbe382fa0 "") at pquery.c:1175
#12 0x00000000006c73c5 in PortalRunMulti
(portal=portal@entry=0x16f7700, isTopLevel=isTopLevel@entry=1 '\001',
dest=dest@entry=0x16bcd60, altdest=altdest@entry=0x16bcd60,   completionTag=completionTag@entry=0x7ffcbe382fa0 "") at
pquery.c:1306
#13 0x00000000006c7dd9 in PortalRun (portal=portal@entry=0x16f7700,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1
'\001', dest=dest@entry=0x16bcd60,   altdest=altdest@entry=0x16bcd60,
completionTag=completionTag@entry=0x7ffcbe382fa0 "") at pquery.c:813
#14 0x00000000006c5c98 in exec_simple_query (query_string=0x16bbfc0
"analyze foo ;") at postgres.c:1094
#15 PostgresMain (argc=<optimized out>, argv=argv@entry=0x164baf8,
dbname=0x164b9a8 "jjanes", username=<optimized out>) at
postgres.c:4021
#16 0x000000000047cb1e in BackendRun (port=0x1669d40) at postmaster.c:4258
#17 BackendStartup (port=0x1669d40) at postmaster.c:3932
#18 ServerLoop () at postmaster.c:1690
#19 0x000000000066ff27 in PostmasterMain (argc=argc@entry=1,
argv=argv@entry=0x164aa10) at postmaster.c:1298
#20 0x000000000047d35e in main (argc=1, argv=0x164aa10) at main.c:228

Cheers,

Jeff



Re: multivariate statistics v14

From
Tomas Vondra
Date:
On Sat, 2016-03-12 at 23:30 -0800, Jeff Janes wrote:
> On Wed, Mar 9, 2016 at 9:21 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > Hi,
> >
> > On Wed, 2016-03-09 at 08:45 -0800, Jeff Janes wrote:
> > >
> > > On Wed, Mar 9, 2016 at 7:02 AM, Tomas Vondra
> > > <tomas.vondra@2ndquadrant.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > thanks for the feedback. Attached is v14 of the patch series,
> > > > fixing
> > > > most of the points you've raised.
> > >
> > > Hi Tomas,
> > >
> > > Applied to aa09cd242fa7e3a694a31f, I still get the seg faults in
> > > make
> > > check if I configure without --enable-cassert.
> > Ah, after disabling asserts I can reproduce it too. And the reason
> > why
> > it fails is quite simple - clauselist_selectivity modifies the
> > original
> > list of clauses, which then confuses cost_qual_eval.
> >
> > Can you try if the attached patch fixes the issue? I'll need to
> > rework a
> > bit more of the code, but let's see if this fixes the issue on your
> > machine too.
> That patch on top of v14 did fix the original problem.  But I got
> another segfault:

Oh, yeah. There was an extra pfree().

Attached is v15 of the patch series, fixing this and also doing quite a
few additional improvements:

* added some basic examples into the SGML documentation

* addressing the objectaddress omissions, as pointed out by Alvaro

* support for ALTER STATISTICS ... OWNER TO / RENAME / SET SCHEMA

* significant refactoring of MCV and histogram code, particularly 
  serialization, deserialization and building

* reworking the functional dependencies to support more complex 
  dependencies, with multiple columns as 'conditions'

* the reduction using functional dependencies is also significantly 
  simplified (I decided to get rid of computing the transitive closure 
  for now - it got too complex after the multi-condition dependencies, 
  so I'll leave that for the future

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: multivariate statistics v14

From
Tatsuo Ishii
Date:
> Instead of simply multiplying the ndistinct estimate with selecticity,
> we instead use the formula for the expected number of distinct values
> observed in 'k' rows when there are 'd' distinct values in the bin
> 
>     d * (1 - ((d - 1) / d)^k)
> 
> This is 'with replacements' which seems appropriate for the use, and it
> mostly assumes uniform distribution of the distinct values. So if the
> distribution is not uniform (e.g. there are very frequent groups) this
> may be less accurate than the current algorithm in some cases, giving
> over-estimates. But that's probably better than OOM.
> ---
>  src/backend/utils/adt/selfuncs.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
> index f8d39aa..6eceedf 100644
> --- a/src/backend/utils/adt/selfuncs.c
> +++ b/src/backend/utils/adt/selfuncs.c
> @@ -3466,7 +3466,7 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
>              /*
>               * Multiply by restriction selectivity.
>               */
> -            reldistinct *= rel->rows / rel->tuples;
> +            reldistinct = reldistinct * (1 - powl((reldistinct - 1) / reldistinct,rel->rows));

Why do you change "*=" style? I see no reason to change this.
        reldistinct *= 1 - powl((reldistinct - 1) / reldistinct, rel->rows);

Looks better to me because it's shorter and cleaner.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: multivariate statistics v14

From
Tatsuo Ishii
Date:
I apology if it's already discussed. I am new to this patch.

> Attached is v15 of the patch series, fixing this and also doing quite a
> few additional improvements:
>
> * added some basic examples into the SGML documentation
>
> * addressing the objectaddress omissions, as pointed out by Alvaro
>
> * support for ALTER STATISTICS ... OWNER TO / RENAME / SET SCHEMA
>
> * significant refactoring of MCV and histogram code, particularly 
>   serialization, deserialization and building
>
> * reworking the functional dependencies to support more complex 
>   dependencies, with multiple columns as 'conditions'
>
> * the reduction using functional dependencies is also significantly 
>   simplified (I decided to get rid of computing the transitive closure 
>   for now - it got too complex after the multi-condition dependencies, 
>   so I'll leave that for the future

Do you have any other missing parts in this work? I am asking because
I wonder if you want to push this into 9.6 or rather 9.7.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: multivariate statistics v14

From
Kyotaro HORIGUCHI
Date:
Hello, I returned to this.

At Sun, 13 Mar 2016 22:59:38 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<1457906378.27231.10.camel@2ndquadrant.com>
> Oh, yeah. There was an extra pfree().
>
> Attached is v15 of the patch series, fixing this and also doing quite a
> few additional improvements:
>
> * added some basic examples into the SGML documentation
>
> * addressing the objectaddress omissions, as pointed out by Alvaro
>
> * support for ALTER STATISTICS ... OWNER TO / RENAME / SET SCHEMA
>
> * significant refactoring of MCV and histogram code, particularly 
>   serialization, deserialization and building
>
> * reworking the functional dependencies to support more complex 
>   dependencies, with multiple columns as 'conditions'
>
> * the reduction using functional dependencies is also significantly 
>   simplified (I decided to get rid of computing the transitive closure 
>   for now - it got too complex after the multi-condition dependencies, 
>   so I'll leave that for the future

Many trailing white spaces found.

0002

+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
2014 should be 2016?

This patch defines many "magic"s for many structs, butmagic(number)s seems to be used to identify file or buffer pagein
PostgreSQL.They wouldn't be needed if you don't intend todig out or identify the orphan memory blocks of mvstats. 

+    MVDependency    deps[1];    /* XXX why not a pointer? */

MVDependency seems to be a pointer type.

+        if (numcols >= MVSTATS_MAX_DIMENSIONS)
+            ereport(ERROR,
and
+        Assert((attrs->dim1 >= 2) && (attrs->dim1 <= MVSTATS_MAX_DIMENSIONS));

seem to be contradicting.

.. Sorry, time is up..

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center





Re: multivariate statistics v14

From
Tomas Vondra
Date:
On 03/16/2016 09:31 AM, Kyotaro HORIGUCHI wrote:
> Hello, I returned to this.
>
> At Sun, 13 Mar 2016 22:59:38 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<1457906378.27231.10.camel@2ndquadrant.com>
>> Oh, yeah. There was an extra pfree().
>>
>> Attached is v15 of the patch series, fixing this and also doing quite a
>> few additional improvements:
>>
>> * added some basic examples into the SGML documentation
>>
>> * addressing the objectaddress omissions, as pointed out by Alvaro
>>
>> * support for ALTER STATISTICS ... OWNER TO / RENAME / SET SCHEMA
>>
>> * significant refactoring of MCV and histogram code, particularly
>>   serialization, deserialization and building
>>
>> * reworking the functional dependencies to support more complex
>>   dependencies, with multiple columns as 'conditions'
>>
>> * the reduction using functional dependencies is also significantly
>>   simplified (I decided to get rid of computing the transitive closure
>>   for now - it got too complex after the multi-condition dependencies,
>>   so I'll leave that for the future
>
> Many trailing white spaces found.

Sorry, haven't noticed that after one of the rebases. Fixed in the
attached v15 of the patch.

>
> 0002
>
> + * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
>
>  2014 should be 2016?

Yes, the copyright info will need some tweaks. There's a few other files
with 2015, and I think the start should be the current year (and not 1996).

>
>
>  This patch defines many "magic"s for many structs, but
>  magic(number)s seems to be used to identify file or buffer page
>  in PostgreSQL. They wouldn't be needed if you don't intend to
>  dig out or identify the orphan memory blocks of mvstats.
>
> +    MVDependency    deps[1];    /* XXX why not a pointer? */
>
> MVDependency seems to be a pointer type.

Right, but we need an array of the structures here, so one way is to use
a pointer and the other one is using variable-length field. Will remove
the comment, I think the structure is fine as is.

>
> +        if (numcols >= MVSTATS_MAX_DIMENSIONS)
> +            ereport(ERROR,
> and
> +        Assert((attrs->dim1 >= 2) && (attrs->dim1 <= MVSTATS_MAX_DIMENSIONS));
>
> seem to be contradicting.

Nope, because the first check is in a loop where 'numcols' is used as an
index into an array with MVSTATS_MAX_DIMENSIONS elements.

>
> .. Sorry, time is up..

Thanks for the comments!

Attached is v15 of the patch, that also fixes one mistake - after
reworking the functional dependencies to support multiple columns on the
left side (as conditions), I failed to move it to the proper place in
the patch series. So 0002 built the dependencies in the old way and 0003
changed it to the new one. That was pointless and added another 20kB to
the patch, so v15 moves the new code to 0002.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: multivariate statistics v14

From
Tomas Vondra
Date:
Hi,

On 03/16/2016 03:58 AM, Tatsuo Ishii wrote:
> I apology if it's already discussed. I am new to this patch.
>
>> Attached is v15 of the patch series, fixing this and also doing quite a
>> few additional improvements:
>>
>> * added some basic examples into the SGML documentation
>>
>> * addressing the objectaddress omissions, as pointed out by Alvaro
>>
>> * support for ALTER STATISTICS ... OWNER TO / RENAME / SET SCHEMA
>>
>> * significant refactoring of MCV and histogram code, particularly
>>   serialization, deserialization and building
>>
>> * reworking the functional dependencies to support more complex
>>   dependencies, with multiple columns as 'conditions'
>>
>> * the reduction using functional dependencies is also significantly
>>   simplified (I decided to get rid of computing the transitive closure
>>   for now - it got too complex after the multi-condition dependencies,
>>   so I'll leave that for the future
>
> Do you have any other missing parts in this work? I am asking
> because I wonder if you want to push this into 9.6 or rather 9.7.

I think the first few parts of the patch series, namely:
  * shared infrastructure (0002)  * functional dependencies (0003)  * MCV lists (0004)  * histograms (0005)

might make it into 9.6. I believe the code for building and storing the 
different kinds of stats is reasonably solid. What probably needs more 
thorough review are the changes in clauselist_selectivity(), but the 
code in these parts is reasonably simple as it only supports using a 
single multi-variate statistics per relation.

The part (0006) that allows using multiple statistics (i.e. selects 
which of the available stats to use and in what order) is probably the 
most complex part of the whole patch, and I myself do have some 
questions about some aspects of it. I don't think this part might get 
into 9.6 at this point (although it'd be nice if we managed to do that).

I can also imagine moving the ndistinct pieces forward, in front of 0006 
if that helps getting it into 9.6. There's a bit more work on making it 
more flexible, though, to allow handling subsets columns (currently we 
need a perfect match).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v14

From
Tatsuo Ishii
Date:
>> Many trailing white spaces found.
> 
> Sorry, haven't noticed that after one of the rebases. Fixed in the
> attached v15 of the patch.

There are still few of traling spaces.

/home/t-ishii/0002-shared-infrastructure-and-functional-dependencies.patch:3792: trailing whitespace.
/home/t-ishii/0004-multivariate-MCV-lists.patch:471: trailing whitespace.
/home/t-ishii/0004-multivariate-MCV-lists.patch:656: space before tab in indent.    {
/home/t-ishii/0004-multivariate-MCV-lists.patch:682: space before tab in indent.    }
/home/t-ishii/0004-multivariate-MCV-lists.patch:685: space before tab in indent.    {
/home/t-ishii/0004-multivariate-MCV-lists.patch:715: trailing whitespace.
/home/t-ishii/0006-multi-statistics-estimation.patch:2513: trailing whitespace.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: multivariate statistics v14

From
Tomas Vondra
Date:
On 03/21/2016 12:00 AM, Tatsuo Ishii wrote:
>>> Many trailing white spaces found.
>>
>> Sorry, haven't noticed that after one of the rebases. Fixed in the
>> attached v15 of the patch.
>
> There are still few of traling spaces.
>
> /home/t-ishii/0002-shared-infrastructure-and-functional-dependencies.patch:3792: trailing whitespace.
> /home/t-ishii/0004-multivariate-MCV-lists.patch:471: trailing whitespace.
> /home/t-ishii/0004-multivariate-MCV-lists.patch:656: space before tab in indent.
>      {
> /home/t-ishii/0004-multivariate-MCV-lists.patch:682: space before tab in indent.
>      }
> /home/t-ishii/0004-multivariate-MCV-lists.patch:685: space before tab in indent.
>      {
> /home/t-ishii/0004-multivariate-MCV-lists.patch:715: trailing whitespace.
> /home/t-ishii/0006-multi-statistics-estimation.patch:2513: trailing whitespace.
>
> Best regards,

D'oh. Thanks for reporting. Attached is v16, hopefully fixing the few
remaining whitespace issues.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: multivariate statistics v14

From
Alvaro Herrera
Date:
Another skim on 0002:

reference.sgml is missing a call to &alterStatistic.

ObjectProperty[] contains a comment that the ACL is "same as relation",
but is that still correct, given that now stats may be related to more
than one relation?  Do we even know what the rules for ACLs on
cross-relation stats are?  One very simple way to get around this is to
dictate that all the rels must have the same owner.  Perhaps we're not
considering the multi-relation case yet?

We have this FIXME comment in do_analyze_rel:

+     * FIXME This sample sizing is mostly OK when computing stats for
+     *       individual columns, but when computing multi-variate stats
+     *       for multivariate stats (histograms, mcv, ...) it's rather
+     *       insufficient. For stats on multiple columns / complex stats
+     *       we need larger sample sizes, because we need to build more
+     *       detailed stats (more MCV items / histogram buckets) to get
+     *       good accuracy. Maybe it'd be appropriate to use samples
+     *       proportional to the table (say, 0.5% - 1%) instead of a
+     *       fixed size might be more appropriate. Also, this should be
+     *       bound to the requested statistics size - e.g. number of MCV
+     *       items or histogram buckets should require several sample
+     *       rows per item/bucket (so the sample should be k*size).

Maybe this merits more discussion.  Right now we have an upper bound on
how much to scan for analyze; if we introduce the idea of scanning a
percentage of the relation, the time to analyze very large relations
could increase significantly.  Do we have an idea of what to do for
this?  For instance, a rule that would make me comfortable would say to
scan a sample 3x the current size when you have a mvstats on 3 columns;
then the size of fraction to scan is still bounded.  But does that
actually work?  From the wording of this comment, I assume you don't
actually know.

In this block (CreateStatistics)
+    /* look for duplicities */
+    for (i = 0; i < numcols; i++)
+        for (j = 0; j < numcols; j++)
+            if ((i != j) && (attnums[i] == attnums[j]))
+                ereport(ERROR,
+                        (errcode(ERRCODE_UNDEFINED_COLUMN),
+                         errmsg("duplicate column name in statistics definition")));

isn't it easier to have the inner loop go from i+1 to numcols?


I wonder if this is sensible with multi-relation statistics:
+    /*
+     * Store a dependency too, so that statistics are dropped on DROP TABLE
+     */
+    parentobject.classId = RelationRelationId;
+    parentobject.objectId = ObjectIdGetDatum(RelationGetRelid(rel));
+    parentobject.objectSubId = 0;
+    childobject.classId = MvStatisticRelationId;
+    childobject.objectId = statoid;
+    childobject.objectSubId = 0;

I suppose the idea is to drop the stats if any of the rels they are for
is dropped.

Right after that you create a dependency on the schema.  Is that
necessary?  Since you have the dependency on the relation, the stats
would be dropped by recursion.

Why are you #include'ing builtins.h everywhere?

RelationGetMVStatList() needs a comment.

Please get rid of common.h.  It's totally unlike the way we structure
our header files.  We don't keep headers in src/backend; they're all in
src/include.  One reason is that the latter gets installed as a whole in
include/server, which this file will not be.  This file may be necessary
to build some extensions in the future, for example.

In mvstats.h, please mark function prototypes as "extern".

Many files need a pgindent pass.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v14

From
Robert Haas
Date:
On Sun, Mar 20, 2016 at 11:34 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> ObjectProperty[] contains a comment that the ACL is "same as relation",
> but is that still correct, given that now stats may be related to more
> than one relation?  Do we even know what the rules for ACLs on
> cross-relation stats are?  One very simple way to get around this is to
> dictate that all the rels must have the same owner.

That's not really all that simple - you'd have to forbid changing the
owner of a relation involved in multi-rel statistics, but that's
horrible.  Presumably at the very least you'd then have to find some
way of allowing the owner of everything in the group to be changed at
the same time, but that's a whole new innovation.  I think this is a
very messy line of attack.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: multivariate statistics v14

From
Tomas Vondra
Date:
Hi,

On 03/21/2016 10:34 AM, Robert Haas wrote:
> On Sun, Mar 20, 2016 at 11:34 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> ObjectProperty[] contains a comment that the ACL is "same as relation",
>> but is that still correct, given that now stats may be related to more
>> than one relation?  Do we even know what the rules for ACLs on
>> cross-relation stats are?  One very simple way to get around this is to
>> dictate that all the rels must have the same owner.
>
> That's not really all that simple - you'd have to forbid changing
> the owner of a relation involved in multi-rel statistics, but that's
> horrible. Presumably at the very least you'd then have to find some
> way of allowing the owner of everything in the group to be changed
> at the same time, but that's a whole new innovation. I think this is
> a very messy line of attack.

I agree. I don't think we should / need to impose such additional 
restrictions (e.g. same owner for all tables).

I think for using the statistics (to compute estimates for a query), it 
should be enough that the user can access all the tables it's built on. 
Which happens somehow implicitly, and currently it's trivial as each 
statistics is built on a single table.

I don't have a clear idea what should we do in the future with multiple 
tables (e.g. when the statistics is built on 3 tables, the query is on 2 
of them and the user does not have access to the remaining one).

But maybe we need to support ACLs because of ALTER STATISTICS?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v14

From
Tomas Vondra
Date:
On 03/21/2016 04:34 AM, Alvaro Herrera wrote:
> Another skim on 0002:
>
> reference.sgml is missing a call to &alterStatistic.
>
> ObjectProperty[] contains a comment that the ACL is "same as relation",
> but is that still correct, given that now stats may be related to more
> than one relation?  Do we even know what the rules for ACLs on
> cross-relation stats are?  One very simple way to get around this is to
> dictate that all the rels must have the same owner.  Perhaps we're not
> considering the multi-relation case yet?

As I wrote in response to Robert's message, I don't think we need ACLs 
for statistics - the user should be able to use them when they can 
access all the underlying relations (in a query). For ALTER STATISTICS 
the (owner || superuser) check should be enough, right?

>
> We have this FIXME comment in do_analyze_rel:
>
> +     * FIXME This sample sizing is mostly OK when computing stats for
> +     *       individual columns, but when computing multi-variate stats
> +     *       for multivariate stats (histograms, mcv, ...) it's rather
> +     *       insufficient. For stats on multiple columns / complex stats
> +     *       we need larger sample sizes, because we need to build more
> +     *       detailed stats (more MCV items / histogram buckets) to get
> +     *       good accuracy. Maybe it'd be appropriate to use samples
> +     *       proportional to the table (say, 0.5% - 1%) instead of a
> +     *       fixed size might be more appropriate. Also, this should be
> +     *       bound to the requested statistics size - e.g. number of MCV
> +     *       items or histogram buckets should require several sample
> +     *       rows per item/bucket (so the sample should be k*size).
>
> Maybe this merits more discussion.  Right now we have an upper bound on
> how much to scan for analyze; if we introduce the idea of scanning a
> percentage of the relation, the time to analyze very large relations
> could increase significantly.  Do we have an idea of what to do for
> this?  For instance, a rule that would make me comfortable would say to
> scan a sample 3x the current size when you have a mvstats on 3 columns;
> then the size of fraction to scan is still bounded.  But does that
> actually work?  From the wording of this comment, I assume you don't
> actually know.

Yeah. I think more discussion is needed, because I myself am not sure 
the FIXME is actually correct. For now I think we're OK with using the 
same logic as statistics on a single column (300 * target).

>
> In this block (CreateStatistics)
> +    /* look for duplicities */
> +    for (i = 0; i < numcols; i++)
> +        for (j = 0; j < numcols; j++)
> +            if ((i != j) && (attnums[i] == attnums[j]))
> +                ereport(ERROR,
> +                        (errcode(ERRCODE_UNDEFINED_COLUMN),
> +                         errmsg("duplicate column name in statistics definition")));
>
> isn't it easier to have the inner loop go from i+1 to numcols?

It probably is.

>
> I wonder if this is sensible with multi-relation statistics:
> +    /*
> +     * Store a dependency too, so that statistics are dropped on DROP TABLE
> +     */
> +    parentobject.classId = RelationRelationId;
> +    parentobject.objectId = ObjectIdGetDatum(RelationGetRelid(rel));
> +    parentobject.objectSubId = 0;
> +    childobject.classId = MvStatisticRelationId;
> +    childobject.objectId = statoid;
> +    childobject.objectSubId = 0;
>
> I suppose the idea is to drop the stats if any of the rels they are for
> is dropped.

What do you mean by sensible? I mean, we don't support multiple tables 
at this point (except for choosing a syntax that should allow that), but 
the code assumes a single relation on a few places (like this one).

>
> Right after that you create a dependency on the schema.  Is that
> necessary?  Since you have the dependency on the relation, the stats
> would be dropped by recursion.

Hmmmm, that's probably right. Also, now that I think about it, it 
probably gets broken after ALTER STATISTICS ... SET SCHEMA, because the 
code does not remove the old dependency (and does not create a new one).

>
> Why are you #include'ing builtins.h everywhere?

Stupidity.

>
> RelationGetMVStatList() needs a comment.

OK.

>
> Please get rid of common.h.  It's totally unlike the way we structure
> our header files.  We don't keep headers in src/backend; they're all in
> src/include.  One reason is that the latter gets installed as a whole in
> include/server, which this file will not be.  This file may be necessary
> to build some extensions in the future, for example.

OK, I'll rework that and move it to src/include/.

>
> In mvstats.h, please mark function prototypes as "extern".
>
> Many files need a pgindent pass.

OK.

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v14

From
Jeff Janes
Date:
On Sun, Mar 20, 2016 at 4:34 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
>
> D'oh. Thanks for reporting. Attached is v16, hopefully fixing the few
> remaining whitespace issues.

Hi Tomas,

I'm trying out v16 against a common problem, where postgresql thinks
it is likely top stop early during a "order by (index express) limit
1" but it doesn't actually stop early due to cross-column
correlations.  But the multivariate statistics don't seem to help.  Am
I doing this wrong, or just expecting too much?


jjanes=# create table foo as select x, floor(x/(10000000/500))::int as
y  from generate_series(1,10000000) f(x);
jjanes=# create index on foo (x,y);
jjanes=# create index on foo (y,x);
jjanes=# create statistics jjj on foo (x,y) with (dependencies,histogram);
jjanes=# vacuum analyze ;


jjanes=# explain (analyze, timing off)  select x from foo where y
between 478 and 480 order by x limit 1;                                                   QUERY PLAN

-------------------------------------------------------------------------------------------------------------------Limit
(cost=0.43..4.92 rows=1 width=4) (actual rows=1 loops=1)  ->  Index Only Scan using foo_x_y_idx on foo
(cost=0.43..210156.55
rows=46812 width=4) (actual rows=1 loops=1)        Index Cond: ((y >= 478) AND (y <= 480))        Heap Fetches:
0Planningtime: 0.311 msExecution time: 478.917 ms
 

Here is walks up the index on x, until it meets the first row meeting
the qualification on y. It thinks it will get to stop early and be
very fast, but it doesn't.

If I add an dummy addition to the ORDER BY, to force it not to talk
the index, I get a plan which uses the other index and is actually
much faster, but is planned to be several hundred times slower:


jjanes=# explain (analyze, timing off)  select x from foo where y
between 478 and 480 order by x+0 limit 1;                                                       QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------Limit
(cost=1803.77..1803.77 rows=1 width=8) (actual rows=1 loops=1)  ->  Sort  (cost=1803.77..1920.80 rows=46812 width=8)
(actualrows=1 loops=1)        Sort Key: ((x + 0))        Sort Method: top-N heapsort  Memory: 25kB        ->  Index
OnlyScan using foo_y_x_idx on foo
 
(cost=0.43..1569.70 rows=46812 width=8) (actual rows=60000 loops=1)              Index Cond: ((y >= 478) AND (y <=
480))             Heap Fetches: 0Planning time: 0.175 msExecution time: 20.264 ms
 

(I use the "timing off" option, because without it the second plan
spends most of its time calling "gettimeofday")

Cheers,

Jeff



Re: multivariate statistics v14

From
Tatsuo Ishii
Date:
>> Do you have any other missing parts in this work? I am asking
>> because I wonder if you want to push this into 9.6 or rather 9.7.
> 
> I think the first few parts of the patch series, namely:
> 
>   * shared infrastructure (0002)
>   * functional dependencies (0003)
>   * MCV lists (0004)
>   * histograms (0005)
> 
> might make it into 9.6. I believe the code for building and storing
> the different kinds of stats is reasonably solid. What probably needs
> more thorough review are the changes in clauselist_selectivity(), but
> the code in these parts is reasonably simple as it only supports using
> a single multi-variate statistics per relation.
> 
> The part (0006) that allows using multiple statistics (i.e. selects
> which of the available stats to use and in what order) is probably the
> most complex part of the whole patch, and I myself do have some
> questions about some aspects of it. I don't think this part might get
> into 9.6 at this point (although it'd be nice if we managed to do
> that).

Hum. So without 0006 or beyond, there's not much benefit for the
PostgreSQL users, and you are not too confident about 0006 or
beyond. Then I would think it is a little bit hard to justify in
putting 000[2-5] into 9.6. I really like this feature and would like
to see in PostgreSQL someday, but I'm not sure if we should put the
patches (0002-0005) into PostgreSQL now. Please let me know if there's
some reaons we should put the patches into PostgreSQL now.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: multivariate statistics v14

From
Tomas Vondra
Date:
Hi,

On 03/22/2016 06:53 AM, Jeff Janes wrote:
> On Sun, Mar 20, 2016 at 4:34 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>
>>
>> D'oh. Thanks for reporting. Attached is v16, hopefully fixing the few
>> remaining whitespace issues.
>
> Hi Tomas,
>
> I'm trying out v16 against a common problem, where postgresql thinks
> it is likely top stop early during a "order by (index express) limit
> 1" but it doesn't actually stop early due to cross-column
> correlations.  But the multivariate statistics don't seem to help.  Am
> I doing this wrong, or just expecting too much?

Yes, I think you're expecting a too much from the current patch.

I've been thinking about perhaps addressing cases like this in the 
future, but it requires tracking position within the table somehow (e.g. 
by means of including ctid in the table, or something like that), and 
the current patch does not implement that.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v14

From
Tomas Vondra
Date:
Hello,

On 03/22/2016 09:13 AM, Tatsuo Ishii wrote:
>>> Do you have any other missing parts in this work? I am asking
>>> because I wonder if you want to push this into 9.6 or rather 9.7.
>>
>> I think the first few parts of the patch series, namely:
>>
>>   * shared infrastructure (0002)
>>   * functional dependencies (0003)
>>   * MCV lists (0004)
>>   * histograms (0005)
>>
>> might make it into 9.6. I believe the code for building and storing
>> the different kinds of stats is reasonably solid. What probably needs
>> more thorough review are the changes in clauselist_selectivity(), but
>> the code in these parts is reasonably simple as it only supports using
>> a single multi-variate statistics per relation.
>>
>> The part (0006) that allows using multiple statistics (i.e. selects
>> which of the available stats to use and in what order) is probably the
>> most complex part of the whole patch, and I myself do have some
>> questions about some aspects of it. I don't think this part might get
>> into 9.6 at this point (although it'd be nice if we managed to do
>> that).
>
> Hum. So without 0006 or beyond, there's not much benefit for the
> PostgreSQL users, and you are not too confident about 0006 or
> beyond. Then I would think it is a little bit hard to justify in
> putting 000[2-5] into 9.6. I really like this feature and would like
> to see in PostgreSQL someday, but I'm not sure if we should put the
> patches (0002-0005) into PostgreSQL now. Please let me know if there's
> some reaons we should put the patches into PostgreSQL now.

I don't think so. While being able to combine multiple statistics is 
certainly useful, I'm convinced that the initial patched add enough 
value on their own, even if the 0006 patch gets committed later.

A lot of queries will be just fine with the "single multivariate 
statistics" limitation, either because it's using less than 8 columns, 
or because only 8 columns are actually correlated. (FWIW the 8 column 
limit is mostly arbitrary, it may get increased if needed.)

I haven't really mentioned the aspects of 0006 that I think need more 
discussion, but it's mostly about the question whether combining the 
statistics by using the overlapping clauses as "conditions" is the right 
thing to do (or whether a more expensive approach is needed). None of 
that however invalidates the preceding patches.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v14

From
Tatsuo Ishii
Date:
>> Hum. So without 0006 or beyond, there's not much benefit for the
>> PostgreSQL users, and you are not too confident about 0006 or
>> beyond. Then I would think it is a little bit hard to justify in
>> putting 000[2-5] into 9.6. I really like this feature and would like
>> to see in PostgreSQL someday, but I'm not sure if we should put the
>> patches (0002-0005) into PostgreSQL now. Please let me know if there's
>> some reaons we should put the patches into PostgreSQL now.
> 
> I don't think so. While being able to combine multiple statistics is
> certainly useful, I'm convinced that the initial patched add enough

Can you please elaborate a little bit more how combining multiple
statistics is useful?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: multivariate statistics v14

From
Tomas Vondra
Date:
Hi,

On 03/22/2016 11:41 AM, Tatsuo Ishii wrote:
>>> Hum. So without 0006 or beyond, there's not much benefit for the
>>> PostgreSQL users, and you are not too confident about 0006 or
>>> beyond. Then I would think it is a little bit hard to justify in
>>> putting 000[2-5] into 9.6. I really like this feature and would
>>> like to see in PostgreSQL someday, but I'm not sure if we should
>>> put the patches (0002-0005) into PostgreSQL now. Please let me
>>> know if there's some reaons we should put the patches into
>>> PostgreSQL now.
>>
>> I don't think so. While being able to combine multiple statistics
>> is certainly useful, I'm convinced that the initial patched add
>> enough
>
> Can you please elaborate a little bit more how combining multiple
> statistics is useful?

Sure.

The goal of multivariate statistics is to approximate a probability 
distribution on a group of columns. The larger the number of columns, 
the less accurate the statistics will be (with respect to individual 
columns), assuming fixed size of the sample in ANALYZE, and fixed 
statistics size.

For example, if you add a column to multivariate histogram, you'll do 
some "bucket splits" by this dimension, thus reducing the accuracy for 
the other columns. You may of course allow larger statistics (e.g. 
histograms with more buckets), but that also requires larger samples, 
and so on.

Now, let's  assume you have a query like this:
    WHERE (a=1) AND (b=2) AND (c=3) AND (d=4)

and that "a" and "b" are correlated, and "c" and "d" are correlated, but 
that otherwise the columns are independent. It'd be a bit silly to 
require building statistics on (a,b,c,d), when two statistics on each of 
the column pairs would be cheaper and also more accurate.

That's of course a trivial case - independent groups of correlated 
columns. But I'd say this is actually a pretty common case, and I do 
believe there's not much controversy that we should support it.

Another reason to allow multiple statistics is that columns in one group 
may be a good fit for MCV list (which works well for discrete values), 
while the other group may be a good candidate for histogram (which works 
well for continuous values). This can't be solved by first building a 
MCV and then a histogram on the group.

The question of course is what to do if the groups are not independent. 
The patch does that by assuming the statistics overlap, and uses 
conditions on the columns included in both statistics to combine them 
using conditional probabilities. I do believe this works quite well, but 
this is perhaps the part that needs further discussion. There are other 
ways to combine the statistics, but I do expect them to be considerably 
more expensive.

Is this a sufficient explanation?

Of course, there's a fair amount of additional complexity that I have 
not mentioned here (e.g. selecting the right combination of stats).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v14

From
Tatsuo Ishii
Date:
> On 03/22/2016 11:41 AM, Tatsuo Ishii wrote:
>>>> Hum. So without 0006 or beyond, there's not much benefit for the
>>>> PostgreSQL users, and you are not too confident about 0006 or
>>>> beyond. Then I would think it is a little bit hard to justify in
>>>> putting 000[2-5] into 9.6. I really like this feature and would
>>>> like to see in PostgreSQL someday, but I'm not sure if we should
>>>> put the patches (0002-0005) into PostgreSQL now. Please let me
>>>> know if there's some reaons we should put the patches into
>>>> PostgreSQL now.
>>>
>>> I don't think so. While being able to combine multiple statistics
>>> is certainly useful, I'm convinced that the initial patched add
>>> enough
>>
>> Can you please elaborate a little bit more how combining multiple
>> statistics is useful?
> 
> Sure.
> 
> The goal of multivariate statistics is to approximate a probability
> distribution on a group of columns. The larger the number of columns,
> the less accurate the statistics will be (with respect to individual
> columns), assuming fixed size of the sample in ANALYZE, and fixed
> statistics size.
> 
> For example, if you add a column to multivariate histogram, you'll do
> some "bucket splits" by this dimension, thus reducing the accuracy for
> the other columns. You may of course allow larger statistics
> (e.g. histograms with more buckets), but that also requires larger
> samples, and so on.
> 
> Now, let's  assume you have a query like this:
> 
>     WHERE (a=1) AND (b=2) AND (c=3) AND (d=4)
> 
> and that "a" and "b" are correlated, and "c" and "d" are correlated,
> but that otherwise the columns are independent. It'd be a bit silly to
> require building statistics on (a,b,c,d), when two statistics on each
> of the column pairs would be cheaper and also more accurate.
> 
> That's of course a trivial case - independent groups of correlated
> columns. But I'd say this is actually a pretty common case, and I do
> believe there's not much controversy that we should support it.
> 
> Another reason to allow multiple statistics is that columns in one
> group may be a good fit for MCV list (which works well for discrete
> values), while the other group may be a good candidate for histogram
> (which works well for continuous values). This can't be solved by
> first building a MCV and then a histogram on the group.
> 
> The question of course is what to do if the groups are not
> independent. The patch does that by assuming the statistics overlap,
> and uses conditions on the columns included in both statistics to
> combine them using conditional probabilities. I do believe this works
> quite well, but this is perhaps the part that needs further
> discussion. There are other ways to combine the statistics, but I do
> expect them to be considerably more expensive.
> 
> Is this a sufficient explanation?
> 
> Of course, there's a fair amount of additional complexity that I have
> not mentioned here (e.g. selecting the right combination of stats).

Sorry, maybe I did not explain clearyly. My question is, if put
patches only 0002 to 0005 into 9.6, does it still give any visible
benefit to users?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: multivariate statistics v14

From
Tomas Vondra
Date:
Hi,

On 03/22/2016 01:46 PM, Tatsuo Ishii wrote:
...
> Sorry, maybe I did not explain clearly. My question is, if put
> patches only 0002 to 0005 into 9.6, does it still give any visible
> benefit to users?

The users will be able to define statistics with the limitation that 
only a single one (the one covering the most columns referenced by the 
clauses) can be used when estimating a query. Which is not perfect, but 
I think it's a valuable improvement.

It might also be possible to split 0006 into smaller pieces, for example 
implementing the "non-overlapping statistics" case first and then 
extending it to more complicated cases. That might increase the change 
of getting at least some of that into 9.6 ...

But considering it's not clear whether the initial chunks are likely to 
make it into 9.6 - I kinda expect a fair amount of comments from TL 
about the preceding parts, who mentioned he might look at the patch this 
week. So I'm not sure splitting 0006 into smaller pieces makes sense at 
this point.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v14

From
Tatsuo Ishii
Date:
> The users will be able to define statistics with the limitation that
> only a single one (the one covering the most columns referenced by the
> clauses) can be used when estimating a query. Which is not perfect,
> but I think it's a valuable improvement.
> 
> It might also be possible to split 0006 into smaller pieces, for
> example implementing the "non-overlapping statistics" case first and
> then extending it to more complicated cases. That might increase the
> change of getting at least some of that into 9.6 ...
> 
> But considering it's not clear whether the initial chunks are likely
> to make it into 9.6 - I kinda expect a fair amount of comments from TL
> about the preceding parts, who mentioned he might look at the patch
> this week. So I'm not sure splitting 0006 into smaller pieces makes
> sense at this point.

Thanks for the explanation. I will look into patch 0001 to 0005 so
that they could get into 9.6.

In the mean time after applying patch 0001 to 0005 of v16, I get this
while compiling SGML docs.

openjade:ref/create_statistics.sgml:281:26:X: reference to non-existent ID "SQL-ALTERSTATISTICS"
openjade:ref/drop_statistics.sgml:86:26:X: reference to non-existent ID "SQL-ALTERSTATISTICS"

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: multivariate statistics v14

From
Tomas Vondra
Date:
On 03/23/2016 02:53 AM, Tatsuo Ishii wrote:
>> The users will be able to define statistics with the limitation that
>> only a single one (the one covering the most columns referenced by the
>> clauses) can be used when estimating a query. Which is not perfect,
>> but I think it's a valuable improvement.
>>
>> It might also be possible to split 0006 into smaller pieces, for
>> example implementing the "non-overlapping statistics" case first and
>> then extending it to more complicated cases. That might increase the
>> change of getting at least some of that into 9.6 ...
>>
>> But considering it's not clear whether the initial chunks are likely
>> to make it into 9.6 - I kinda expect a fair amount of comments from TL
>> about the preceding parts, who mentioned he might look at the patch
>> this week. So I'm not sure splitting 0006 into smaller pieces makes
>> sense at this point.
>
> Thanks for the explanation. I will look into patch 0001 to 0005 so
> that they could get into 9.6.
>
> In the mean time after applying patch 0001 to 0005 of v16, I get this
> while compiling SGML docs.
>
> openjade:ref/create_statistics.sgml:281:26:X: reference to non-existent ID "SQL-ALTERSTATISTICS"
> openjade:ref/drop_statistics.sgml:86:26:X: reference to non-existent ID "SQL-ALTERSTATISTICS"

I believe this is because reference.sgml is missing a call to 
&alterStatistic (per report by Alvaro Herrera).

thanks

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v14

From
Tatsuo Ishii
Date:
>> Thanks for the explanation. I will look into patch 0001 to 0005 so
>> that they could get into 9.6.
>>
>> In the mean time after applying patch 0001 to 0005 of v16, I get this
>> while compiling SGML docs.
>>
>> openjade:ref/create_statistics.sgml:281:26:X: reference to
>> non-existent ID "SQL-ALTERSTATISTICS"
>> openjade:ref/drop_statistics.sgml:86:26:X: reference to non-existent
>> ID "SQL-ALTERSTATISTICS"
> 
> I believe this is because reference.sgml is missing a call to
> &alterStatistic (per report by Alvaro Herrera).

Ok, I will patch reference.sgml.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: multivariate statistics v14

From
Tatsuo Ishii
Date:
>> I believe this is because reference.sgml is missing a call to
>> &alterStatistic (per report by Alvaro Herrera).
> 
> Ok, I will patch reference.sgml.

Here are some comments on docs.

- There's no docs for pg_mv_statistic (should be added to "49. System Catalogs")

- The word "multivariate statistics" or something like that should appear in the index.

- There are some explanation how to deal with multivariate statistics in "14.1 Using Explain" and "14.2 Statistics used
bythe Planner" section.
 

I am now looking into the create statistics doc to see if the example
appearing in it is working. I will get back if I find any.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: multivariate statistics v14

From
Tatsuo Ishii
Date:
>>> I believe this is because reference.sgml is missing a call to
>>> &alterStatistic (per report by Alvaro Herrera).
>> 
>> Ok, I will patch reference.sgml.
> 
> Here are some comments on docs.
> 
> - There's no docs for pg_mv_statistic (should be added to "49. System
>   Catalogs")
> 
> - The word "multivariate statistics" or something like that should
>   appear in the index.
> 
> - There are some explanation how to deal with multivariate statistics
Oops. Should read "There should be some explanations".

>   in "14.1 Using Explain" and "14.2 Statistics used by the Planner"
>   section.
> 
> I am now looking into the create statistics doc to see if the example
> appearing in it is working. I will get back if I find any.
> 
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



Re: multivariate statistics v14

From
Tatsuo Ishii
Date:
>> I am now looking into the create statistics doc to see if the example
>> appearing in it is working. I will get back if I find any.

I have the ref doc: CREATE STATISTICS

There are nice examples how the multivariate statistics gives better
row number estimation. So I gave them a try.

"Create table t1 with two functionally dependent columns,i.e. knowledge of a value in the first column is sufficient
fordeterminingthe value in the other column" The example creates table"t1", then populates it using generate_series.
AfterCREATESTATISTICS, ANALYZE and EXPLAIN. I expected the EXPLAIN demonstrateshow result rows estimation is enhanced
byusing the multivariatestatistics.
 

Here is the EXPLAIN output using the multivariate statistics:

EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 1);                                           QUERY PLAN
                                    
 
---------------------------------------------------------------------------------------------------Seq Scan on t1
(cost=0.00..19425.00rows=98 width=8) (actual time=76.876..76.876 rows=0 loops=1)  Filter: ((a = 1) AND (b = 1))  Rows
Removedby Filter: 1000000Planning time: 0.146 msExecution time: 76.896 ms
 
(5 rows)

Here is the EXPLAIN output without the multivariate statistics:

EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 1);                                           QUERY PLAN
                                   
 
--------------------------------------------------------------------------------------------------Seq Scan on t1
(cost=0.00..19425.00rows=1 width=8) (actual time=78.867..78.867 rows=0 loops=1)  Filter: ((a = 1) AND (b = 1))  Rows
Removedby Filter: 1000000Planning time: 0.102 msExecution time: 78.885 ms
 
(5 rows)

It seems the row numbers estimation (98) using the multivariate
statistics is actually *worse* than the one (1) not using the
statistics because the actual row number is 0.

Next example (using table "t2") is much better than the case using t1.

Here is the EXPLAIN output using the multivariate statistics:

EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 1);                                              QUERY PLAN
                                         
 
--------------------------------------------------------------------------------------------------------Seq Scan on t2
(cost=0.00..19425.00rows=9633 width=8) (actual time=0.012..75.350 rows=10000 loops=1)  Filter: ((a = 1) AND (b = 1))
RowsRemoved by Filter: 990000Planning time: 0.107 msExecution time: 75.680 ms
 
(5 rows)

Here is the EXPLAIN output without the multivariate statistics:

EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 1);                                             QUERY PLAN
                                       
 
------------------------------------------------------------------------------------------------------Seq Scan on t2
(cost=0.00..19425.00rows=91 width=8) (actual time=0.008..76.614 rows=10000 loops=1)  Filter: ((a = 1) AND (b = 1))
RowsRemoved by Filter: 990000Planning time: 0.067 msExecution time: 76.935 ms
 
(5 rows)

This time it seems the row numbers estimation (9633) using the
multivariate statistics is much better than the one (91) not using the
statistics because the actual row number is 10000.

The last example (using table "t3") seems no effect by multivariate statistics.

Here is the EXPLAIN output using the multivariate statistics:

EXPLAIN ANALYZE SELECT * FROM t3 WHERE (a < 500) AND (b > 500);                                               QUERY
PLAN                                                
 
-----------------------------------------------------------------------------------------------------------Seq Scan on
t3 (cost=0.00..20407.65 rows=111123 width=16) (actual time=0.154..132.509 rows=6002 loops=1)  Filter: ((a <
'500'::doubleprecision) AND (b > '500'::double precision))  Rows Removed by Filter: 993998Planning time: 0.080
msExecutiontime: 132.735 ms
 
(5 rows)

EXPLAIN ANALYZE SELECT * FROM t3 WHERE (a < 400) AND (b > 600);                                               QUERY
PLAN                                               
 
----------------------------------------------------------------------------------------------------------Seq Scan on
t3 (cost=0.00..20407.65 rows=111123 width=16) (actual time=110.518..110.518 rows=0 loops=1)  Filter: ((a <
'400'::doubleprecision) AND (b > '600'::double precision))  Rows Removed by Filter: 1000000Planning time: 0.052
msExecutiontime: 110.531 ms
 
(5 rows)

Here is the EXPLAIN output without the multivariate statistics:

EXPLAIN ANALYZE SELECT * FROM t3 WHERE (a < 500) AND (b > 500);                                               QUERY
PLAN                                                
 
-----------------------------------------------------------------------------------------------------------Seq Scan on
t3 (cost=0.00..20407.65 rows=111123 width=16) (actual time=0.149..129.718 rows=5999 loops=1)  Filter: ((a <
'500'::doubleprecision) AND (b > '500'::double precision))  Rows Removed by Filter: 994001Planning time: 0.058
msExecutiontime: 129.893 ms
 
(5 rows)

EXPLAIN ANALYZE SELECT * FROM t3 WHERE (a < 400) AND (b > 600);                                               QUERY
PLAN                                               
 
----------------------------------------------------------------------------------------------------------Seq Scan on
t3 (cost=0.00..20407.65 rows=111123 width=16) (actual time=108.015..108.015 rows=0 loops=1)  Filter: ((a <
'400'::doubleprecision) AND (b > '600'::double precision))  Rows Removed by Filter: 1000000Planning time: 0.037
msExecutiontime: 108.027 ms
 
(5 rows)

This time it seems the row numbers estimation (111123) using the
multivariate statistics is same as same as the one (111123) not
using the statistics because the actual row number is 5999 or 0.

In summary, the only case which shows the effect of the multivariate
statistics is the "t2" case. So I don't see why other examples are
shown in the manual. Am I missing something?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: multivariate statistics v14

From
Tomas Vondra
Date:
On 03/23/2016 06:20 AM, Tatsuo Ishii wrote:
>>> I am now looking into the create statistics doc to see if the example
>>> appearing in it is working. I will get back if I find any.
>
> I have the ref doc: CREATE STATISTICS
>
> There are nice examples how the multivariate statistics gives better
> row number estimation. So I gave them a try.
>
> "Create table t1 with two functionally dependent columns,
>  i.e. knowledge of a value in the first column is sufficient for
>  determining the value in the other column" The example creates table
>  "t1", then populates it using generate_series. After CREATE
>  STATISTICS, ANALYZE and EXPLAIN. I expected the EXPLAIN demonstrates
>  how result rows estimation is enhanced by using the multivariate
>  statistics.
>
> Here is the EXPLAIN output using the multivariate statistics:
>
> EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 1);
>                                             QUERY PLAN
> ---------------------------------------------------------------------------------------------------
>  Seq Scan on t1  (cost=0.00..19425.00 rows=98 width=8) (actual time=76.876..76.876 rows=0 loops=1)
>    Filter: ((a = 1) AND (b = 1))
>    Rows Removed by Filter: 1000000
>  Planning time: 0.146 ms
>  Execution time: 76.896 ms
> (5 rows)
>
> Here is the EXPLAIN output without the multivariate statistics:
>
> EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 1);
>                                             QUERY PLAN
> --------------------------------------------------------------------------------------------------
>  Seq Scan on t1  (cost=0.00..19425.00 rows=1 width=8) (actual time=78.867..78.867 rows=0 loops=1)
>    Filter: ((a = 1) AND (b = 1))
>    Rows Removed by Filter: 1000000
>  Planning time: 0.102 ms
>  Execution time: 78.885 ms
> (5 rows)
>
> It seems the row numbers estimation (98) using the multivariate
> statistics is actually *worse* than the one (1) not using the
> statistics because the actual row number is 0.

Yes, there's a mistake in the first query, because the conditions 
actually are not compatible. I.e. (i/100)=1 and (i/500)=1 have no 
overlapping rows, clearly. It should be

EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 0);

instead. Will fix.

>
> Next example (using table "t2") is much better than the case using t1.
>
> Here is the EXPLAIN output using the multivariate statistics:
>
> EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 1);
>                                                QUERY PLAN
> --------------------------------------------------------------------------------------------------------
>  Seq Scan on t2  (cost=0.00..19425.00 rows=9633 width=8) (actual time=0.012..75.350 rows=10000 loops=1)
>    Filter: ((a = 1) AND (b = 1))
>    Rows Removed by Filter: 990000
>  Planning time: 0.107 ms
>  Execution time: 75.680 ms
> (5 rows)
>
> Here is the EXPLAIN output without the multivariate statistics:
>
> EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 1);
>                                               QUERY PLAN
> ------------------------------------------------------------------------------------------------------
>  Seq Scan on t2  (cost=0.00..19425.00 rows=91 width=8) (actual time=0.008..76.614 rows=10000 loops=1)
>    Filter: ((a = 1) AND (b = 1))
>    Rows Removed by Filter: 990000
>  Planning time: 0.067 ms
>  Execution time: 76.935 ms
> (5 rows)
>
> This time it seems the row numbers estimation (9633) using the
> multivariate statistics is much better than the one (91) not using the
> statistics because the actual row number is 10000.
>
> The last example (using table "t3") seems no effect by multivariate statistics.

Yes. There's a typo in the example - it analyzes the wrong table (t2 
instead of t3). Once I fix that, the estimates are much better.

> In summary, the only case which shows the effect of the multivariate
> statistics is the "t2" case. So I don't see why other examples are
> shown in the manual. Am I missing something?

No, thanks for spotting those mistakes. I'll fix them and submit a new 
version of the patch - either later today or perhaps tomorrow.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v14

From
Petr Jelinek
Date:
Hi,

I'll add couple of code comments from my first cursory read through 
(this is huge):

0002:
there is some whitespace noise between the varlistentries in 
alter_statistics.sgml

+    parentobject.classId = RelationRelationId;
+    parentobject.objectId = ObjectIdGetDatum(RelationGetRelid(rel));
+    parentobject.objectSubId = 0;
+    childobject.classId = MvStatisticRelationId;
+    childobject.objectId = statoid;
+    childobject.objectSubId = 0;

I wonder if this (several places similar code) would be simpler done 
using ObjectAddressSet()

The common.h in backend/utils/mvstat is slightly weird header file 
placement and naming.


0004:
+/* used for merging bitmaps - AND (min), OR (max) */
+#define MAX(x, y) (((x) > (y)) ? (x) : (y))
+#define MIN(x, y) (((x) < (y)) ? (x) : (y))

Huh? We have Max and Min macros defined in c.h

+        values[Anum_pg_mv_statistic_stamcv  - 1] = PointerGetDatum(data);

Why the double space (that's actually in several places in several of 
the patches).

I don't really understand why 0008 and 0009 are separate patches and 
aren't part of one of the other patches. But otherwise good job on 
splitting the functionality into patchset.

--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services



Re: multivariate statistics v14

From
Tomas Vondra
Date:
Hi,

attached is v17 of the patch series, with these changes:

* rebase to current master (the AM patch caused some conflicts)
* add alterStatistics to reference.sgml (Alvaro)
* move the sample size discussion to README.stats (Alvaro)
* tweak the inner for loop in CREATE STATISTICS (Alvaro)
* use ObjectAddressSet() to create dependencies in statscmds.c (Petr)
* fix whitespace in alterStatistics.sgml (Petr)
* replace custom MIN/MAX with Min/Max in c.h (Petr)
* fix examples in createStatistics.sgml (Tatsuo)

A few more comments inline:

On 03/23/2016 07:23 PM, Petr Jelinek wrote:
>
> The common.h in backend/utils/mvstat is slightly weird header file
> placement and naming.
>

True. I plan to move this header to

     src/include/catalog/pg_mv_statistic_fn.h

which is what the other catalogs do (as pointed by Alvaro). Or do you
think another location/name would be more appropriate?

>
> +        values[Anum_pg_mv_statistic_stamcv  - 1] = PointerGetDatum(data);
>
> Why the double space (that's actually in several places in several of
> the patches).

To align the whole block like this:

     nulls[Anum_pg_mv_statistic_stadeps  -1] = true;
     nulls[Anum_pg_mv_statistic_stamcv   -1] = true;
     nulls[Anum_pg_mv_statistic_stahist  -1] = true;
     nulls[Anum_pg_mv_statistic_standist -1] = true;

But I won't fight for this too hard, if it breaks rules somehow.

>
> I don't really understand why 0008 and 0009 are separate patches and
> aren't part of one of the other patches. But otherwise good job on
> splitting the functionality into patchset.

That is mostly because both 0007 and 0008 tweak the GROUP BY estimates,
but 0008 is not really part of this patch (it's discussed separately in
another thread). I admit it may be a bit confusing.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: multivariate statistics v14

From
Alvaro Herrera
Date:
Tomas Vondra wrote:

> >+        values[Anum_pg_mv_statistic_stamcv  - 1] = PointerGetDatum(data);
> >
> >Why the double space (that's actually in several places in several of
> >the patches).
> 
> To align the whole block like this:
> 
>     nulls[Anum_pg_mv_statistic_stadeps  -1] = true;
>     nulls[Anum_pg_mv_statistic_stamcv   -1] = true;
>     nulls[Anum_pg_mv_statistic_stahist  -1] = true;
>     nulls[Anum_pg_mv_statistic_standist -1] = true;
> 
> But I won't fight for this too hard, if it breaks rules somehow.

Yeah, it will be undone by pgindent.  I suggest you pgindent all the
patches in the series.  With some clever patch vs. patch -R application,
you can do it without having to resolve any conflicts when pgindent
modifies code that a patch further up in the series modifies again.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v14

From
Tomas Vondra
Date:
On 03/24/2016 06:45 PM, Alvaro Herrera wrote:
> Tomas Vondra wrote:
>
>>> +        values[Anum_pg_mv_statistic_stamcv  - 1] = PointerGetDatum(data);
>>>
>>> Why the double space (that's actually in several places in several of
>>> the patches).
>>
>> To align the whole block like this:
>>
>>     nulls[Anum_pg_mv_statistic_stadeps  -1] = true;
>>     nulls[Anum_pg_mv_statistic_stamcv   -1] = true;
>>     nulls[Anum_pg_mv_statistic_stahist  -1] = true;
>>     nulls[Anum_pg_mv_statistic_standist -1] = true;
>>
>> But I won't fight for this too hard, if it breaks rules somehow.
>
> Yeah, it will be undone by pgindent.  I suggest you pgindent all the
> patches in the series.  With some clever patch vs. patch -R application,
> you can do it without having to resolve any conflicts when pgindent
> modifies code that a patch further up in the series modifies again.
>

I could do that, but isn't that a bit pointless? I thought pgindent is 
run regularly on the whole codebase, not for individual patches. Sure, 
it'll tweak the formatting on a few places in the patch (including the 
code discussed above, as you pointed out), but there are many other such 
places coming from other committed patches.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v14

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> I could do that, but isn't that a bit pointless? I thought pgindent is 
> run regularly on the whole codebase, not for individual patches. Sure, 
> it'll tweak the formatting on a few places in the patch (including the 
> code discussed above, as you pointed out), but there are many other such 
> places coming from other committed patches.

One point of running pgindent for yourself is to make sure you haven't set
up any code in a way that will look horrible after pgindent gets done with
it.
        regards, tom lane



Re: multivariate statistics v14

From
Tomas Vondra
Date:
On 03/25/2016 10:26 PM, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> I could do that, but isn't that a bit pointless? I thought pgindent is
>> run regularly on the whole codebase, not for individual patches. Sure,
>> it'll tweak the formatting on a few places in the patch (including the
>> code discussed above, as you pointed out), but there are many other such
>> places coming from other committed patches.
>
> One point of running pgindent for yourself is to make sure you
> haven't set up any code in a way that will look horrible after
> pgindent gets done with it.

Fair point. Attached is v18 of the patch, after pgindent cleanup.

FWIW, most of the tweaks were minor things like (! x) instead of (!x)
and so on. I also had to fix a few comments with internal formatting,
because pgindent decided to reformat the text using tabs etc.

There are a few places where I reverted the pgindent formatting, because
it seemed a bit too weird - the first one are the lists of function
prototypes in common.h/mvstat.h, the second one are function calls to
_greedy/_exhaustive methods.

None of those places would however qualify as 'horrible' in my opinion,
and the _greedy/_exhaustive functions are in the 0006 part, so fixing
that is not of immediate importance I think.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: multivariate statistics v14

From
Tatsuo Ishii
Date:
> Fair point. Attached is v18 of the patch, after pgindent cleanup.

Here are some feedbacks to v18 patch.

1) regarding examples in create_statistics manual

Here are numbers I got. "with statistics" referrers to the case where
multivariate statistics are used.  "without statistics" referrers to the
case where multivariate statistics are not used. The numbers denote
estimated_rows/actual_rows. Thus closer to 1.0 is better. Some numbers
are shown as a fraction to avoid 0 division. In my understanding case
1, 3, 4 showed that multivariate statistics superior.
with statistics    without statistics
case1    0.98        0.01
case2    98/0        1/0
case3    1.05        0.01
case4    1/0        103/0
case5    18.50        18.33
case6    111123/0    1111123/0

2) following comments by me are not addressed in the v18 patch.

> - There's no docs for pg_mv_statistic (should be added to "49. System
>   Catalogs")
> 
> - The word "multivariate statistics" or something like that should
>   appear in the index.
> 
> - There are some explanation how to deal with multivariate statistics
>   in "14.1 Using Explain" and "14.2 Statistics used by the Planner"
>   section.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: multivariate statistics v14

From
Alvaro Herrera
Date:
Tomas Vondra wrote:

> There are a few places where I reverted the pgindent formatting, because it
> seemed a bit too weird - the first one are the lists of function prototypes
> in common.h/mvstat.h, the second one are function calls to
> _greedy/_exhaustive methods.

Function prototypes being weird is something that we've learned to
accept.  There's no point in undoing pgindent decisions there, because
the next run will re-apply them anyway.  Best not to fight it.

What you should definitely look into fixing is the formatting of
comments, if the result is too horrible.  You can prevent it from
messing those by adding dashes /*----- at the beginning of the comment.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v14

From
Tomas Vondra
Date:
Hi,

On 03/26/2016 10:18 AM, Tatsuo Ishii wrote:
>> Fair point. Attached is v18 of the patch, after pgindent cleanup.
>
> Here are some feedbacks to v18 patch.
>
> 1) regarding examples in create_statistics manual
>
> Here are numbers I got. "with statistics" referrers to the case where
> multivariate statistics are used.  "without statistics" referrers to the
> case where multivariate statistics are not used. The numbers denote
> estimated_rows/actual_rows. Thus closer to 1.0 is better. Some numbers
> are shown as a fraction to avoid 0 division. In my understanding case
> 1, 3, 4 showed that multivariate statistics superior.
>
>     with statistics    without statistics
> case1    0.98        0.01
> case2    98/0        1/0

The case2 shows that functional dependencies assume that the conditions 
used in queries won't be incompatible - that's something this type of 
statistics can't fix.

> case3    1.05        0.01
> case4    1/0        103/0
> case5    18.50        18.33
> case6    111123/0    1111123/0

The last two lines (case5 + case6) seem a bit suspicious. I believe 
those are for the histogram data, and I do get these numbers:

case5    0.93 (5517 / 5949)         42.0 (249943 / 5949)
case6    100/0                      100/0

Perhaps you've been using the version before the bugfix, with ANALYZE on 
the wrong table?

>
> 2) following comments by me are not addressed in the v18 patch.
>
>> - There's no docs for pg_mv_statistic (should be added to "49. System
>>   Catalogs")
>>
>> - The word "multivariate statistics" or something like that should
>>   appear in the index.
>>
>> - There are some explanation how to deal with multivariate statistics
>>   in "14.1 Using Explain" and "14.2 Statistics used by the Planner"
>>   section.

Yes, those are valid omissions. I plan to address them, and I'd also 
considering adding a section to 65.1 (How the Planner Uses Statistics), 
explaining more thoroughly how the planner uses multivariate stats.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v14

From
Tomas Vondra
Date:
On 03/26/2016 08:09 PM, Alvaro Herrera wrote:
> Tomas Vondra wrote:
>
>> There are a few places where I reverted the pgindent formatting, because it
>> seemed a bit too weird - the first one are the lists of function prototypes
>> in common.h/mvstat.h, the second one are function calls to
>> _greedy/_exhaustive methods.
>
> Function prototypes being weird is something that we've learned to
> accept.  There's no point in undoing pgindent decisions there, because
> the next run will re-apply them anyway.  Best not to fight it.
>
> What you should definitely look into fixing is the formatting of
> comments, if the result is too horrible.  You can prevent it from
> messing those by adding dashes /*----- at the beginning of the comment.
>

Yep, formatting of some of the comments got slightly broken, but it 
wasn't difficult to fix that without the /*------- trick.

I'm not sure about the prototypes though. It was a bit weird because 
prototypes in the same header file were formatted very differently.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v14

From
Alvaro Herrera
Date:
Tomas Vondra wrote:

> I'm not sure about the prototypes though. It was a bit weird because
> prototypes in the same header file were formatted very differently.

Yeah, it is very odd.  What happens is that the BSD indent binary does
one thing (return type is in one line and function name in following
line; subsequent argument lines are aligned to opening parens), then the
pgindent perl script changes it (moves function name to same line as
return type, but does not reindent subsequent lines of arguments).

You can imitate the effect by adding an extra newline just before the
function name, reflowing the arguments to align to the (, then deleting
the extra newline.  Rather annoying.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v14

From
David Steele
Date:
Hi Tomas,

On 3/28/16 4:42 AM, Tomas Vondra wrote:

> Yes, those are valid omissions. I plan to address them, and I'd also
> considering adding a section to 65.1 (How the Planner Uses Statistics),
> explaining more thoroughly how the planner uses multivariate stats.

It looks you need post a new patch so I have marked this "waiting on 
author".

Thanks,
-- 
-David
david@pgmasters.net



Re: multivariate statistics v14

From
Tatsuo Ishii
Date:
>>     with statistics    without statistics
>> case1    0.98        0.01
>> case2    98/0        1/0
> 
> The case2 shows that functional dependencies assume that the
> conditions used in queries won't be incompatible - that's something
> this type of statistics can't fix.

It would be nice if that's mentioned in the manual to avoid user's
confusion.

>> case3    1.05        0.01
>> case4    1/0        103/0
>> case5    18.50        18.33
>> case6    111123/0    1111123/0
> 
> The last two lines (case5 + case6) seem a bit suspicious. I believe
> those are for the histogram data, and I do get these numbers:
> 
> case5    0.93 (5517 / 5949)         42.0 (249943 / 5949)
> case6    100/0                      100/0
> 
> Perhaps you've been using the version before the bugfix, with ANALYZE
> on the wrong table?

You are right. I accidentally ANALYZE t2, not t3. Now I get these
numbers:

case5    1.23 (7367 / 5968)         41.7 (249118 / 5981)
case6    117/0                      162092/0

>> 2) following comments by me are not addressed in the v18 patch.
>>
>>> - There's no docs for pg_mv_statistic (should be added to "49. System
>>>   Catalogs")
>>>
>>> - The word "multivariate statistics" or something like that should
>>>   appear in the index.
>>>
>>> - There are some explanation how to deal with multivariate statistics
>>>   in "14.1 Using Explain" and "14.2 Statistics used by the Planner"
>>>   section.
> 
> Yes, those are valid omissions. I plan to address them, and I'd also
> considering adding a section to 65.1 (How the Planner Uses
> Statistics), explaining more thoroughly how the planner uses
> multivariate stats.

Great.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: multivariate statistics v14

From
Robert Haas
Date:
On Tue, Mar 29, 2016 at 11:18 AM, David Steele <david@pgmasters.net> wrote:
> On 3/28/16 4:42 AM, Tomas Vondra wrote:
>> Yes, those are valid omissions. I plan to address them, and I'd also
>> considering adding a section to 65.1 (How the Planner Uses Statistics),
>> explaining more thoroughly how the planner uses multivariate stats.
>
> It looks you need post a new patch so I have marked this "waiting on
> author".

Since no new version of this patch has been posted in the last 10
days, it seems clear that there will not be time for this to
reasonably become ready for committer and then get committed in the
few hours remaining before the deadline.  That is a bummer, since I
was hoping we would have this feature in this release, but hopefully
we will get it into 9.7.  I am marking it Returned with Feedback.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: multivariate statistics v14

From
Tomas Vondra
Date:
On 04/08/2016 05:55 PM, Robert Haas wrote:
> On Tue, Mar 29, 2016 at 11:18 AM, David Steele <david@pgmasters.net> wrote:
>> On 3/28/16 4:42 AM, Tomas Vondra wrote:
>>> Yes, those are valid omissions. I plan to address them, and I'd also
>>> considering adding a section to 65.1 (How the Planner Uses Statistics),
>>> explaining more thoroughly how the planner uses multivariate stats.
>>
>> It looks you need post a new patch so I have marked this "waiting on
>> author".
>
> Since no new version of this patch has been posted in the last 10
> days, it seems clear that there will not be time for this to
> reasonably become ready for committer and then get committed in the
> few hours remaining before the deadline. That is a bummer, since I
> was hoping we would have this feature in this release, but hopefully
> we will get it into 9.7. I am marking it Returned with Feedback.
>

Well, me to. But my feeling is the patch received entirely insufficient 
amount of thorough code review, considering how important part of the 
code it touches. I agree docs are an important part of a patch, but 
polishing user-level docs would hardly move the patch closer to being 
committable (especially when there's ~50kB of READMEs).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v14

From
Robert Haas
Date:
On Fri, Apr 8, 2016 at 2:55 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Well, me to. But my feeling is the patch received entirely insufficient
> amount of thorough code review, considering how important part of the code
> it touches. I agree docs are an important part of a patch, but polishing
> user-level docs would hardly move the patch closer to being committable
> (especially when there's ~50kB of READMEs).

I have to admit that I was really hoping Tom would follow through on
his statement that he would look into this one, or that Dean Rasheed
would get involved.  I am sure I could do a good review of this patch
given enough time, but I am also sure that it would take an amount of
time that is at least one if not two orders of magnitude more than I
put into any patch this CommitFest.  I understand statistics at some
basic level, but I am not an expert on them the way some people here
are.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: multivariate statistics v14

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Apr 8, 2016 at 2:55 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Well, me to. But my feeling is the patch received entirely insufficient
>> amount of thorough code review, considering how important part of the code
>> it touches. I agree docs are an important part of a patch, but polishing
>> user-level docs would hardly move the patch closer to being committable
>> (especially when there's ~50kB of READMEs).

> I have to admit that I was really hoping Tom would follow through on
> his statement that he would look into this one, or that Dean Rasheed
> would get involved.

I'm sorry I didn't get to it, but it's not like I have been slacking
during this commitfest.  At some point, you just have to accept that
not everything we could wish will get into 9.6.

I will make it a high priority for 9.7, though.
        regards, tom lane



Re: multivariate statistics v14

From
Robert Haas
Date:
On Fri, Apr 8, 2016 at 3:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Apr 8, 2016 at 2:55 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>> Well, me to. But my feeling is the patch received entirely insufficient
>>> amount of thorough code review, considering how important part of the code
>>> it touches. I agree docs are an important part of a patch, but polishing
>>> user-level docs would hardly move the patch closer to being committable
>>> (especially when there's ~50kB of READMEs).
>
>> I have to admit that I was really hoping Tom would follow through on
>> his statement that he would look into this one, or that Dean Rasheed
>> would get involved.
>
> I'm sorry I didn't get to it, but it's not like I have been slacking
> during this commitfest.  At some point, you just have to accept that
> not everything we could wish will get into 9.6.

I did not mean to imply otherwise.  I'm just explaining why I didn't
spend time on it - I figured I was not the most qualified person, and
of course I have not been slacking either.  :-)

> I will make it a high priority for 9.7, though.

Woohoo!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: multivariate statistics v14

From
Tatsuo Ishii
Date:
From: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Subject: Re: [HACKERS] multivariate statistics v14
Date: Fri, 8 Apr 2016 20:55:24 +0200
Message-ID: <5d1d62a6-6228-188c-e079-c1be59942168@2ndquadrant.com>

> On 04/08/2016 05:55 PM, Robert Haas wrote:
>> On Tue, Mar 29, 2016 at 11:18 AM, David Steele <david@pgmasters.net>
>> wrote:
>>> On 3/28/16 4:42 AM, Tomas Vondra wrote:
>>>> Yes, those are valid omissions. I plan to address them, and I'd also
>>>> considering adding a section to 65.1 (How the Planner Uses
>>>> Statistics),
>>>> explaining more thoroughly how the planner uses multivariate stats.
>>>
>>> It looks you need post a new patch so I have marked this "waiting on
>>> author".
>>
>> Since no new version of this patch has been posted in the last 10
>> days, it seems clear that there will not be time for this to
>> reasonably become ready for committer and then get committed in the
>> few hours remaining before the deadline. That is a bummer, since I
>> was hoping we would have this feature in this release, but hopefully
>> we will get it into 9.7. I am marking it Returned with Feedback.
>>
> 
> Well, me to. But my feeling is the patch received entirely
> insufficient amount of thorough code review, considering how important
> part of the code it touches. I agree docs are an important part of a
> patch, but polishing user-level docs would hardly move the patch
> closer to being committable (especially when there's ~50kB of
> READMEs).

My feedback regarding docs were:
> - There's no docs for pg_mv_statistic (should be added to "49. System
>   Catalogs")
>
> - The word "multivariate statistics" or something like that should
>   appear in the index.
> 
> - There are some explanation how to deal with multivariate statistics
>   in "14.1 Using Explain" and "14.2 Statistics used by the Planner"
>   section.

The second and the third point maybe are something like "polishing
user-level" docs, but I don't think the first one is for "user-level".
Also I think without the first one the patch will be never
committable. If someone add a new system catalog, the doc should be
added to "System Catalogs" section, that's our standard, at least in
my understanding.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: multivariate statistics v14

From
Simon Riggs
Date:
On 8 April 2016 at 20:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
 
I will make it a high priority for 9.7, though.

That is my plan also. I've already started reviewing the non-planner parts anyway, specifically patch 0002.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From
Tomas Vondra
Date:
Hi,

On 04/09/2016 01:21 AM, Tatsuo Ishii wrote:
> From: Tomas Vondra <tomas.vondra@2ndquadrant.com>
...
> My feedback regarding docs were:
>> - There's no docs for pg_mv_statistic (should be added to "49. System
>>   Catalogs")
>>
>> - The word "multivariate statistics" or something like that should
>>   appear in the index.
>>
>> - There are some explanation how to deal with multivariate statistics
>>   in "14.1 Using Explain" and "14.2 Statistics used by the Planner"
>>   section.
>
> The second and the third point maybe are something like "polishing
> user-level" docs, but I don't think the first one is for "user-level".
> Also I think without the first one the patch will be never
> committable. If someone add a new system catalog, the doc should be
> added to "System Catalogs" section, that's our standard, at least in
> my understanding.

I do apologize if it seemed that I don't value your review, and I do 
agree that those changes need to be done, although I still see them 
rather as a user-level docs (as opposed to READMEs/comments, which I 
think are used by developers much more often).

But I still think it wouldn't move the patch any closer to committable 
state, because what it really needs is review whether the catalog 
definition makes sense, whether it should be more like pg_statistic, and 
so on. Only then it makes sense to describe the catalog structure in the 
SGML docs, I think. That's why I added some basic SGML docs for 
CREATE/DROP/ALTER STATISTICS, which I expect to be rather stable, and 
not the catalog and other low-level stuff (which is commented heavily in 
the code anyway).

Had the patch been a Titanic, fixing the SGML docs a few days before the 
code freeze would be akin to washing the deck instead of looking for 
icebergs on April 15, 1912.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v14

From
Tatsuo Ishii
Date:
> But I still think it wouldn't move the patch any closer to committable
> state, because what it really needs is review whether the catalog
> definition makes sense, whether it should be more like pg_statistic,
> and so on. Only then it makes sense to describe the catalog structure
> in the SGML docs, I think. That's why I added some basic SGML docs for
> CREATE/DROP/ALTER STATISTICS, which I expect to be rather stable, and
> not the catalog and other low-level stuff (which is commented heavily
> in the code anyway).

Without "user-level docs" (now I understand that the term means all
SGML docs for you), it is very hard to find a visible
characteristics/behavior of the patch. CREATE/DROP/ALTER STATISTICS
just defines a user interface, and does not help how it affects to the
planning. The READMEs do not help either.

In this case reviewing your code is something like reviewing a program
which has no specification.

That's the reason why I said before below, but it was never seriously
considered.

>> - There are some explanation how to deal with multivariate statistics
>>   in "14.1 Using Explain" and "14.2 Statistics used by the Planner"
>>   section.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: multivariate statistics v14

From
Simon Riggs
Date:
On 9 April 2016 at 18:37, Tatsuo Ishii <ishii@postgresql.org> wrote:
> But I still think it wouldn't move the patch any closer to committable
> state, because what it really needs is review whether the catalog
> definition makes sense, whether it should be more like pg_statistic,
> and so on. Only then it makes sense to describe the catalog structure
> in the SGML docs, I think. That's why I added some basic SGML docs for
> CREATE/DROP/ALTER STATISTICS, which I expect to be rather stable, and
> not the catalog and other low-level stuff (which is commented heavily
> in the code anyway).

Without "user-level docs" (now I understand that the term means all
SGML docs for you), it is very hard to find a visible
characteristics/behavior of the patch. CREATE/DROP/ALTER STATISTICS
just defines a user interface, and does not help how it affects to the
planning. The READMEs do not help either.

In this case reviewing your code is something like reviewing a program
which has no specification.

That's the reason why I said before below, but it was never seriously
considered.

I would likely have said this myself but didn't even get that far. 

Your contribution was useful and went further than anybody else's review, so thank you.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From
Tomas Vondra
Date:
Hello,

On 04/09/2016 07:37 PM, Tatsuo Ishii wrote:
>> But I still think it wouldn't move the patch any closer to committable
>> state, because what it really needs is review whether the catalog
>> definition makes sense, whether it should be more like pg_statistic,
>> and so on. Only then it makes sense to describe the catalog structure
>> in the SGML docs, I think. That's why I added some basic SGML docs for
>> CREATE/DROP/ALTER STATISTICS, which I expect to be rather stable, and
>> not the catalog and other low-level stuff (which is commented heavily
>> in the code anyway).
>
> Without "user-level docs" (now I understand that the term means all
> SGML docs for you), it is very hard to find a visible
> characteristics/behavior of the patch. CREATE/DROP/ALTER STATISTICS
> just defines a user interface, and does not help how it affects to
> the planning. The READMEs do not help either.
>
> In this case reviewing your code is something like reviewing a
> program which has no specification.

I certainly agree that reviewing a patch without the context is hard. My 
intent was to provide such context / explanation in the READMEs, but 
perhaps I failed to do so with enough detail.

BTW when you say that READMEs do not help either, does that mean you 
consider READMEs unsuitable for this type of information in general, or 
that the current READMEs lack important information?

>
> That's the reason why I said before below, but it was never
> seriously considered.>

I've considered it, but my plan was to have detailed READMEs, and then 
eventually distill that into something suitable for the SGML (perhaps 
without discussion of some implementation details). Maybe that's not the 
right approach.

FWIW providing the context is why I started working on a "paper" 
explaining both the motivation and implementation, including a bit of 
math and figures (which is what we don't have in READMEs or SGML). I 
haven't updated it recently, and it probably got buried in the thread, 
but perhaps this would be a better way to provide the context?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics v14

From
Tomas Vondra
Date:
On 04/10/2016 10:25 AM, Simon Riggs wrote:
> On 9 April 2016 at 18:37, Tatsuo Ishii <ishii@postgresql.org
> <mailto:ishii@postgresql.org>> wrote:
>
>     > But I still think it wouldn't move the patch any closer to committable
>     > state, because what it really needs is review whether the catalog
>     > definition makes sense, whether it should be more like pg_statistic,
>     > and so on. Only then it makes sense to describe the catalog structure
>     > in the SGML docs, I think. That's why I added some basic SGML docs for
>     > CREATE/DROP/ALTER STATISTICS, which I expect to be rather stable, and
>     > not the catalog and other low-level stuff (which is commented heavily
>     > in the code anyway).
>
>     Without "user-level docs" (now I understand that the term means all
>     SGML docs for you), it is very hard to find a visible
>     characteristics/behavior of the patch. CREATE/DROP/ALTER STATISTICS
>     just defines a user interface, and does not help how it affects to the
>     planning. The READMEs do not help either.
>
>     In this case reviewing your code is something like reviewing a program
>     which has no specification.
>
>     That's the reason why I said before below, but it was never seriously
>     considered.
>
>
> I would likely have said this myself but didn't even get that far.
>
> Your contribution was useful and went further than anybody else's
> review, so thank you.

100% agreed. Thanks for the useful feedback.

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics (v19)

From
Tomas Vondra
Date:
Hi,

Attached is v19 of the "multivariate stats" patch series - essentially
v18 rebased on top of current master. Aside from a few bug fixes, the
main improvement is addition of SGML docs demonstrating the statistics
in a way similar to the current "Row Estimation Examples" (and the docs
are actually in the same section). I've tried to keep the right amount
of technical detail (and pointing to the right README for additional
details), but this may need improvements. I have not written docs
explaining how statistics may be combined yet (more about this later).


There are two general design questions that I'd like to get feedback on:


1) enriching the query tree with multivariate statistics info

Right now all the stuff related to multivariate statistics estimation
happens in clausesel.c - matching condition to statistics, selection of
statistics to use (if there are multiple usable stats), etc. So pretty
much all this info is internal to clausesel.c and does not get outside.

I'm starting to think that some of the steps (matching quals to stats,
selection of stats) should happen in a "preprocess" step before the
actual estimation, storing the information (which stats to use, etc.) in
a new type of node in the query tree - something like RestrictInfo.

I believe this needs to happen sometime after deconstruct_jointree() as
that builds RestrictInfos nodes, and looking at planmain.c, right after
extract_restriction_or_clauses seems about right. Haven't tried, though.

This would move all the "statistics selection" logic from clausesel.c,
separating it from the "actual estimation" and simplifying the code.

But more importantly, I think we'll need to show some of the data in
EXPLAIN output. With per-column statistics it's fairly straightforward
to determine which statistics are used and how. But with multivariate
stats things are often more complicated - there may be multiple
candidate statistics (e.g. histograms covering different subsets of the
conditions), it's possible to apply them in different orders, etc.

But EXPLAIN can't show the info if it's ephemeral and available only
within clausesel.c (and thrown away after the estimation).


2) combining multiple statistics

I think the ability to combine multivariate statistics (covering
different subsets of conditions) is important and useful, but I'm
starting to think that the current implementation may not be the correct
one (which is why I haven't written the SGML docs about this part of the
patch series yet).

Assume there's a table "t" with 3 columns (a, b, c), and that we're
estimating query:

    SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3

but that we only have two statistics (a,b) and (b,c). The current patch
does about this:

    P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)

i.e. it estimates the first two conditions using (a,b), and then
estimates (c=3) using (b,c) with "b=2" as a condition. Now, this is very
efficient, but it only works as long as the query contains conditions
"connecting" the two statistics. So if we remove the "b=2" condition
from the query, this stops working.

But it's possible to do this differently, e.g. by doing this:

    P(a=1) * P(c=3|a=1)

where P(c=3|a=1) is using (b,c), but uses (a,b) to restrict the set of
buckets (if the statistics is a histogram) to consider. In pseudo-code,
it might look like this:

    buckets = {}
    foreach bucket x in (b,c):
        foreach bucket y in (a,b):
           if y matches (a=1) and overlap(x,y):
               buckets := buckets + x

which is the part of (b,c) matching (a=1), allowing us to compute the
conditional probability.

It may get more complicated, of course. In particular, there may be
different types of statistics, and we need to be able to "match" them
against each other. With just MCV lists and histograms that's probably
easy enough, but if we add other types of statistics, it may get way
more complicated.

I still think this is a useful capability, but perhaps there are better
ideas how to do that. In any case, it only affects the last part of the
patch (0006).


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: multivariate statistics (v19)

From
Michael Paquier
Date:
On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Attached is v19 of the "multivariate stats" patch series - essentially v18
> rebased on top of current master. Aside from a few bug fixes, the main
> improvement is addition of SGML docs demonstrating the statistics in a way
> similar to the current "Row Estimation Examples" (and the docs are actually
> in the same section). I've tried to keep the right amount of technical
> detail (and pointing to the right README for additional details), but this
> may need improvements. I have not written docs explaining how statistics may
> be combined yet (more about this later).

What we have here is quite something:
$ git diff master --stat | tail -n177 files changed, 12809 insertions(+), 65 deletions(-)
I will try to get familiar on the topic and added myself as a reviewer
of this patch. Hopefully I'll get feedback soon.
-- 
Michael



Re: multivariate statistics (v19)

From
Tomas Vondra
Date:
On 08/05/2016 06:24 AM, Michael Paquier wrote:
> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Attached is v19 of the "multivariate stats" patch series - essentially v18
>> rebased on top of current master. Aside from a few bug fixes, the main
>> improvement is addition of SGML docs demonstrating the statistics in a way
>> similar to the current "Row Estimation Examples" (and the docs are actually
>> in the same section). I've tried to keep the right amount of technical
>> detail (and pointing to the right README for additional details), but this
>> may need improvements. I have not written docs explaining how statistics may
>> be combined yet (more about this later).
>
> What we have here is quite something:
> $ git diff master --stat | tail -n1
>  77 files changed, 12809 insertions(+), 65 deletions(-)
> I will try to get familiar on the topic and added myself as a reviewer
> of this patch. Hopefully I'll get feedback soon.

Yes, it's a large patch. Although 25% of the insertions are SGML docs, 
regression tests and READMEs, and large part of the remaining ~9k 
insertions are comments. But it may still be overwhelming, no doubt 
about that.

FWIW, if someone is interested in the patch but is unsure where to 
start, I'm ready to help with that as much as possible. For example if 
you happen to go to PostgresOpen, feel free to drag me to a corner and 
ask me as many questions as you want ...

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics (v19)

From
Michael Paquier
Date:
On Sat, Aug 6, 2016 at 2:38 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 08/05/2016 06:24 AM, Michael Paquier wrote:
>>
>> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>>
>>> Attached is v19 of the "multivariate stats" patch series - essentially
>>> v18
>>> rebased on top of current master. Aside from a few bug fixes, the main
>>> improvement is addition of SGML docs demonstrating the statistics in a
>>> way
>>> similar to the current "Row Estimation Examples" (and the docs are
>>> actually
>>> in the same section). I've tried to keep the right amount of technical
>>> detail (and pointing to the right README for additional details), but
>>> this
>>> may need improvements. I have not written docs explaining how statistics
>>> may
>>> be combined yet (more about this later).
>>
>>
>> What we have here is quite something:
>> $ git diff master --stat | tail -n1
>>  77 files changed, 12809 insertions(+), 65 deletions(-)
>> I will try to get familiar on the topic and added myself as a reviewer
>> of this patch. Hopefully I'll get feedback soon.
>
>
> Yes, it's a large patch. Although 25% of the insertions are SGML docs,
> regression tests and READMEs, and large part of the remaining ~9k insertions
> are comments. But it may still be overwhelming, no doubt about that.
>
> FWIW, if someone is interested in the patch but is unsure where to start,
> I'm ready to help with that as much as possible. For example if you happen
> to go to PostgresOpen, feel free to drag me to a corner and ask me as many
> questions as you want ...

Sure. Only PGconf SV is on my track this year.
-- 
Michael



Re: multivariate statistics (v19)

From
Michael Paquier
Date:
On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> 1) enriching the query tree with multivariate statistics info
>
> Right now all the stuff related to multivariate statistics estimation
> happens in clausesel.c - matching condition to statistics, selection of
> statistics to use (if there are multiple usable stats), etc. So pretty much
> all this info is internal to clausesel.c and does not get outside.

This does not seem bad to me as first sight but...

> I'm starting to think that some of the steps (matching quals to stats,
> selection of stats) should happen in a "preprocess" step before the actual
> estimation, storing the information (which stats to use, etc.) in a new type
> of node in the query tree - something like RestrictInfo.
>
> I believe this needs to happen sometime after deconstruct_jointree() as that
> builds RestrictInfos nodes, and looking at planmain.c, right after
> extract_restriction_or_clauses seems about right. Haven't tried, though.
>
> This would move all the "statistics selection" logic from clausesel.c,
> separating it from the "actual estimation" and simplifying the code.
>
> But more importantly, I think we'll need to show some of the data in EXPLAIN
> output. With per-column statistics it's fairly straightforward to determine
> which statistics are used and how. But with multivariate stats things are
> often more complicated - there may be multiple candidate statistics (e.g.
> histograms covering different subsets of the conditions), it's possible to
> apply them in different orders, etc.
>
> But EXPLAIN can't show the info if it's ephemeral and available only within
> clausesel.c (and thrown away after the estimation).

This gives a good reason to not do that in clauserel.c, it would be
really cool to be able to get some information regarding the stats
used with a simple EXPLAIN.

> 2) combining multiple statistics
>
> I think the ability to combine multivariate statistics (covering different
> subsets of conditions) is important and useful, but I'm starting to think
> that the current implementation may not be the correct one (which is why I
> haven't written the SGML docs about this part of the patch series yet).
>
> Assume there's a table "t" with 3 columns (a, b, c), and that we're
> estimating query:
>
>    SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3
>
> but that we only have two statistics (a,b) and (b,c). The current patch does
> about this:
>
>    P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)
>
> i.e. it estimates the first two conditions using (a,b), and then estimates
> (c=3) using (b,c) with "b=2" as a condition. Now, this is very efficient,
> but it only works as long as the query contains conditions "connecting" the
> two statistics. So if we remove the "b=2" condition from the query, this
> stops working.

This is trying to make the algorithm smarter than the user, which is
something I'd think we could live without. In this case statistics on
(a,c) or (a,b,c) are missing. And what if the user does not want to
make use of stats for (a,c) because he only defined (a,b) and (b,c)?

Patch 0001: there have been comments about that before, and you have
put the checks on RestrictInfo in a couple of variables of
pull_varnos_walker, so nothing to say from here.

Patch 0002:
+  <para>
+   <command>CREATE STATISTICS</command> will create a new multivariate
+   statistics on the table. The statistics will be created in the in the
+   current database. The statistics will be owned by the user issuing
+   the command.
+  </para>
s/in the/in the/.

+  <para>
+   Create table <structname>t1</> with two functionally dependent columns, i.e.
+   knowledge of a value in the first column is sufficient for detemining the
+   value in the other column. Then functional dependencies are built on those
+   columns:
s/detemining/determining/

+  <para>
+   If a schema name is given (for example, <literal>CREATE STATISTICS
+   myschema.mystat ...</>) then the statistics is created in the specified
+   schema.  Otherwise it is created in the current schema.  The name of
+   the table must be distinct from the name of any other statistics in the
+   same schema.
+  </para>
I would just assume that a statistics is located on the schema of the
relation it depends on. So the thing that may be better to do is just:
- Register the OID of the table a statistics depends on but not the schema.
- Give up on those query extensions related to the schema.
- Allow the same statistics name to be used for multiple tables.
- Just fail if a statistics name is being reused on the table again.
It may be better to complain about that even if the column list is
different.
- Register the dependency between the statistics and the table.

+ALTER STATISTICS <replaceable class="parameter">name</replaceable>
OWNER TO { <replaceable class="PARAMETER">new_owner</replaceable> |
CURRENT_USER | SESSION_USER }
On the same line, is OWNER TO really necessary? I could have assumed
that if a user is able to query the set of columns related to a
statistics, he should have access to it.

=# create statistics aa_a_b3 on aam (a, b) with (dependencies);
ERROR:  23505: duplicate key value violates unique constraint
"pg_mv_statistic_name_index"
DETAIL:  Key (staname, stanamespace)=(aa_a_b3, 2200) already exists.
SCHEMA NAME:  pg_catalog
TABLE NAME:  pg_mv_statistic
CONSTRAINT NAME:  pg_mv_statistic_name_index
LOCATION:  _bt_check_unique, nbtinsert.c:433
When creating a multivariate function with a name that already exists,
this error message should be more friendly.

=# create table aa (a int, b int);
CREATE TABLE
=# create view aav as select * from aa;
CREATE VIEW
=# create statistics aab_v on aav (a, b) with (dependencies);
CREATE STATISTICS
Why do views and foreign tables support this command? This code also
mentions that this case is not actually supported:
+       /* multivariate stats are supported on tables and matviews */
+       if (rel->rd_rel->relkind == RELKIND_RELATION ||
+           rel->rd_rel->relkind == RELKIND_MATVIEW)
+           tupdesc = RelationGetDescr(rel);
};

+/*
Spurious noise in the patch.

+   /* check that at least some statistics were requested */
+   if (!build_dependencies)
+       ereport(ERROR,
+               (errcode(ERRCODE_SYNTAX_ERROR),
+                errmsg("no statistics type (dependencies) was requested")));
So, WITH (dependencies) is mandatory in any case. Why not just
dropping it from the first cut then?

pg_mv_stats shows only the attribute numbers of the columns it has
stats on, I think that those should be the column names. [...after a
while...], as it is mentioned here:
+ * TODO  Would be nice if this printed column names (instead of just attnums).

Does this work properly with DDL deparsing? If yes, could it be
possible to add tests in test_ddl_deparse? This is a new object type,
so those look necessary I think.

Statistics definition reorder the columns by itself depending on their
order. For example:
create table aa (a int, b int);
create statistics aas on aa(b, a) with (dependencies);
\d aa   "public.aas" (dependencies) ON (a, b)
As this defines a correlation between multiple columns, isn't it wrong
to assume that (b, a) and (a, b) are always the same correlation? I
don't recall such properties as being always commutative (old
memories, I suck at stats in general). [...reading README...] So this
is caused by the implementation limitations that only limit the
analysis between interactions of two columns. Still it seems incorrect
to reorder the user-visible portion.

The comment on top of get_relation_info needs to be updated to mention
that mvstatlist gets fetched as well.

+   while (HeapTupleIsValid(htup = systable_getnext(indscan)))
+       /* TODO maybe include only already built statistics? */
+       result = insert_ordered_oid(result, HeapTupleGetOid(htup));
I haven't looked at the rest yet of the series yet, but I'd think that
including the ones not built may be a good idea to let caller do
itself more filtering. Of course this depends on the next series...

+typedef struct MVDependencyData
+{
+   int         nattributes;    /* number of attributes */
+   int16       attributes[1];  /* attribute numbers */
+} MVDependencyData;
You need to look for FLEXIBLE_ARRAY_MEMBER here. Same for MVDependenciesData.

+++ b/src/test/regress/serial_schedule
@@ -167,3 +167,4 @@ test: withtest: xmltest: event_triggertest: stats
+test: mv_dependencies
This test is not listed in parallel_schedule.

s/Apllying/Applying/

There is a lot of mumbo-jumbo regarding the way dependencies are
stored with mainly serialize_mv_dependencies and
deserialize_mv_dependencies that operates them from bytea/dep trees.
That's not cool and not portable because pg_mv_statistic represents
that as pure bytea. I would suggest creating a generic data type that
does those operations, named like pg_dependency_tree and then use that
in those new catalogs. pg_node_tree is a precedent of such a thing.
New features could as well make use of this new data type of we are
able to design that in a way generic enough, so that would be a base
patch that the current 0002 applies on top of.

Regarding psql:
- The new commands lack psql completion, that would ease the use of
the new commands.
- Would it make sense to have a backslash command to show the list of
statistics?

Congratulations. I just looked at 25% of the overall patch and my mind
is already blown away, but I am catching up with the rest...
-- 
Michael



Re: multivariate statistics (v19)

From
Tomas Vondra
Date:
On 08/10/2016 06:41 AM, Michael Paquier wrote:
> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
...
>> But more importantly, I think we'll need to show some of the data in EXPLAIN
>> output. With per-column statistics it's fairly straightforward to determine
>> which statistics are used and how. But with multivariate stats things are
>> often more complicated - there may be multiple candidate statistics (e.g.
>> histograms covering different subsets of the conditions), it's possible to
>> apply them in different orders, etc.
>>
>> But EXPLAIN can't show the info if it's ephemeral and available only within
>> clausesel.c (and thrown away after the estimation).
>
> This gives a good reason to not do that in clauserel.c, it would be
> really cool to be able to get some information regarding the stats
> used with a simple EXPLAIN.
>

I think there are two separate questions:

(a) Whether the query plan is "enriched" with information about 
statistics, or whether this information is ephemeral and available only 
in clausesel.c.

(b) Where exactly this enrichment happens.

Theoretically we might enrich the query plan (add nodes with info about 
the statistics), so that EXPLAIN gets the info, and it might still 
happen in clausesel.c.

>> 2) combining multiple statistics
>>
>> I think the ability to combine multivariate statistics (covering different
>> subsets of conditions) is important and useful, but I'm starting to think
>> that the current implementation may not be the correct one (which is why I
>> haven't written the SGML docs about this part of the patch series yet).
>>
>> Assume there's a table "t" with 3 columns (a, b, c), and that we're
>> estimating query:
>>
>>    SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3
>>
>> but that we only have two statistics (a,b) and (b,c). The current patch does
>> about this:
>>
>>    P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)
>>
>> i.e. it estimates the first two conditions using (a,b), and then estimates
>> (c=3) using (b,c) with "b=2" as a condition. Now, this is very efficient,
>> but it only works as long as the query contains conditions "connecting" the
>> two statistics. So if we remove the "b=2" condition from the query, this
>> stops working.
>
> This is trying to make the algorithm smarter than the user, which is
> something I'd think we could live without. In this case statistics on
> (a,c) or (a,b,c) are missing. And what if the user does not want to
> make use of stats for (a,c) because he only defined (a,b) and (b,c)?
>

I don't think so. Obviously, if you have statistics covering all the 
conditions - great, we can't really do better than that.

But there's a crucial relation between the number of dimensions of the 
statistics and accuracy of the statistics. Let's say you have statistics 
on 8 columns, and you split each dimension twice to build a histogram - 
that's 256 buckets right there, and we only get ~50% selectivity in each 
dimension (the actual histogram building algorithm is more complex, but 
you get the idea).

I see this as probably the most interesting part of the patch, and quite 
useful. But we'll definitely get the single-statistics estimate first, 
no doubt about that.

> Patch 0001: there have been comments about that before, and you have
> put the checks on RestrictInfo in a couple of variables of
> pull_varnos_walker, so nothing to say from here.
>

I don't follow. Are you suggesting 0001 is a reasonable fix, or that 
there's a proposed solution?

> Patch 0002:
> +  <para>
> +   <command>CREATE STATISTICS</command> will create a new multivariate
> +   statistics on the table. The statistics will be created in the in the
> +   current database. The statistics will be owned by the user issuing
> +   the command.
> +  </para>
> s/in the/in the/.
>
> +  <para>
> +   Create table <structname>t1</> with two functionally dependent columns, i.e.
> +   knowledge of a value in the first column is sufficient for detemining the
> +   value in the other column. Then functional dependencies are built on those
> +   columns:
> s/detemining/determining/
>
> +  <para>
> +   If a schema name is given (for example, <literal>CREATE STATISTICS
> +   myschema.mystat ...</>) then the statistics is created in the specified
> +   schema.  Otherwise it is created in the current schema.  The name of
> +   the table must be distinct from the name of any other statistics in the
> +   same schema.
> +  </para>
> I would just assume that a statistics is located on the schema of the
> relation it depends on. So the thing that may be better to do is just:
> - Register the OID of the table a statistics depends on but not the schema.
> - Give up on those query extensions related to the schema.
> - Allow the same statistics name to be used for multiple tables.
> - Just fail if a statistics name is being reused on the table again.
> It may be better to complain about that even if the column list is
> different.
> - Register the dependency between the statistics and the table.

The idea is that the syntax should work even for statistics built on 
multiple tables, e.g. to provide better statistics for joins. That's why 
the schema may be specified (as each table might be in different 
schema), and so on.

>
> +ALTER STATISTICS <replaceable class="parameter">name</replaceable>
> OWNER TO { <replaceable class="PARAMETER">new_owner</replaceable> |
> CURRENT_USER | SESSION_USER }
> On the same line, is OWNER TO really necessary? I could have assumed
> that if a user is able to query the set of columns related to a
> statistics, he should have access to it.
>

Not sure, TBH. I think I've reused ALTER INDEX syntax, but now I see 
it's actually ignored with a warning.

> =# create statistics aa_a_b3 on aam (a, b) with (dependencies);
> ERROR:  23505: duplicate key value violates unique constraint
> "pg_mv_statistic_name_index"
> DETAIL:  Key (staname, stanamespace)=(aa_a_b3, 2200) already exists.
> SCHEMA NAME:  pg_catalog
> TABLE NAME:  pg_mv_statistic
> CONSTRAINT NAME:  pg_mv_statistic_name_index
> LOCATION:  _bt_check_unique, nbtinsert.c:433
> When creating a multivariate function with a name that already exists,
> this error message should be more friendly.

Yes, agreed.

>
> =# create table aa (a int, b int);
> CREATE TABLE
> =# create view aav as select * from aa;
> CREATE VIEW
> =# create statistics aab_v on aav (a, b) with (dependencies);
> CREATE STATISTICS
> Why do views and foreign tables support this command? This code also
> mentions that this case is not actually supported:
> +       /* multivariate stats are supported on tables and matviews */
> +       if (rel->rd_rel->relkind == RELKIND_RELATION ||
> +           rel->rd_rel->relkind == RELKIND_MATVIEW)
> +           tupdesc = RelationGetDescr(rel);
>
>  };

Yes, seems like a bug.

>
> +
>  /*
> Spurious noise in the patch.
>
> +   /* check that at least some statistics were requested */
> +   if (!build_dependencies)
> +       ereport(ERROR,
> +               (errcode(ERRCODE_SYNTAX_ERROR),
> +                errmsg("no statistics type (dependencies) was requested")));
> So, WITH (dependencies) is mandatory in any case. Why not just
> dropping it from the first cut then?

Because the follow-up patches extend this to require at least one 
statistics type. So in 0004 it becomes
    if (!(build_dependencies || build_mcv))

and in 0005 it's
    if (!(build_dependencies || build_mcv || build_histogram))

We might drop it from 0002 (and assume build_dependencies=true), and 
then add the check in 0004. But it seems a bit pointless.

>
> pg_mv_stats shows only the attribute numbers of the columns it has
> stats on, I think that those should be the column names. [...after a
> while...], as it is mentioned here:
> + * TODO  Would be nice if this printed column names (instead of just attnums).

Yeah.

>
> Does this work properly with DDL deparsing? If yes, could it be
> possible to add tests in test_ddl_deparse? This is a new object type,
> so those look necessary I think.
>

I haven't done anything with DDL deparsing, so I think the answer is 
"no" and needs to be added to a TODO.

> Statistics definition reorder the columns by itself depending on their
> order. For example:
> create table aa (a int, b int);
> create statistics aas on aa(b, a) with (dependencies);
> \d aa
>     "public.aas" (dependencies) ON (a, b)
> As this defines a correlation between multiple columns, isn't it wrong
> to assume that (b, a) and (a, b) are always the same correlation? I
> don't recall such properties as being always commutative (old
> memories, I suck at stats in general). [...reading README...] So this
> is caused by the implementation limitations that only limit the
> analysis between interactions of two columns. Still it seems incorrect
> to reorder the user-visible portion.

I don't follow. If you talk about Pearson's correlation, that clearly 
does not depend on the order of columns - it's perfectly independent of 
that. If you talk about about correlation in the wider sense (i.e. 
arbitrary dependence between columns), that might depend - but I don't 
remember a single piece of the patch where this might be a problem.

Also, which README states that we can only analyze interactions between 
two columns? That's pretty clearly not the case - the patch should 
handle dependencies between more columns without any problems.

>
> The comment on top of get_relation_info needs to be updated to mention
> that mvstatlist gets fetched as well.
>
> +   while (HeapTupleIsValid(htup = systable_getnext(indscan)))
> +       /* TODO maybe include only already built statistics? */
> +       result = insert_ordered_oid(result, HeapTupleGetOid(htup));
> I haven't looked at the rest yet of the series yet, but I'd think that
> including the ones not built may be a good idea to let caller do
> itself more filtering. Of course this depends on the next series...
>

Probably, although the more I'm thinking about this the more I think 
I'll rework this along the lines of the foreign-key-estimation patch, 
i.e. preprocessing called from planmain.c (adding info to the query 
plan), estimation in clausesel.c etc. Which also affects this bit, 
because the foreign keys are also loaded elsewhere, IIRC.

> +typedef struct MVDependencyData
> +{
> +   int         nattributes;    /* number of attributes */
> +   int16       attributes[1];  /* attribute numbers */
> +} MVDependencyData;
> You need to look for FLEXIBLE_ARRAY_MEMBER here. Same for MVDependenciesData.
>
> +++ b/src/test/regress/serial_schedule
> @@ -167,3 +167,4 @@ test: with
>  test: xml
>  test: event_trigger
>  test: stats
> +test: mv_dependencies
> This test is not listed in parallel_schedule.
>
> s/Apllying/Applying/
>
> There is a lot of mumbo-jumbo regarding the way dependencies are
> stored with mainly serialize_mv_dependencies and
> deserialize_mv_dependencies that operates them from bytea/dep trees.
> That's not cool and not portable because pg_mv_statistic represents
> that as pure bytea. I would suggest creating a generic data type that
> does those operations, named like pg_dependency_tree and then use that
> in those new catalogs. pg_node_tree is a precedent of such a thing.
> New features could as well make use of this new data type of we are
> able to design that in a way generic enough, so that would be a base
> patch that the current 0002 applies on top of.

Interesting idea, haven't thought about that. So are you suggesting to 
add a data type for each statistics type (dependencies, MCV, histogram, 
...)?

>
> Regarding psql:
> - The new commands lack psql completion, that would ease the use of
> the new commands.
> - Would it make sense to have a backslash command to show the list of
> statistics?
>

Yeah, that's on the TODO.

> Congratulations. I just looked at 25% of the overall patch and my mind
> is already blown away, but I am catching up with the rest...
>

Thanks for looking.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics (v19)

From
Petr Jelinek
Date:
On 10/08/16 13:33, Tomas Vondra wrote:
> On 08/10/2016 06:41 AM, Michael Paquier wrote:
>> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
>>> 2) combining multiple statistics
>>>
>>> I think the ability to combine multivariate statistics (covering
>>> different
>>> subsets of conditions) is important and useful, but I'm starting to
>>> think
>>> that the current implementation may not be the correct one (which is
>>> why I
>>> haven't written the SGML docs about this part of the patch series yet).
>>>
>>> Assume there's a table "t" with 3 columns (a, b, c), and that we're
>>> estimating query:
>>>
>>>    SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3
>>>
>>> but that we only have two statistics (a,b) and (b,c). The current
>>> patch does
>>> about this:
>>>
>>>    P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)
>>>
>>> i.e. it estimates the first two conditions using (a,b), and then
>>> estimates
>>> (c=3) using (b,c) with "b=2" as a condition. Now, this is very
>>> efficient,
>>> but it only works as long as the query contains conditions
>>> "connecting" the
>>> two statistics. So if we remove the "b=2" condition from the query, this
>>> stops working.
>>
>> This is trying to make the algorithm smarter than the user, which is
>> something I'd think we could live without. In this case statistics on
>> (a,c) or (a,b,c) are missing. And what if the user does not want to
>> make use of stats for (a,c) because he only defined (a,b) and (b,c)?
>>
>
> I don't think so. Obviously, if you have statistics covering all the
> conditions - great, we can't really do better than that.
>
> But there's a crucial relation between the number of dimensions of the
> statistics and accuracy of the statistics. Let's say you have statistics
> on 8 columns, and you split each dimension twice to build a histogram -
> that's 256 buckets right there, and we only get ~50% selectivity in each
> dimension (the actual histogram building algorithm is more complex, but
> you get the idea).
>

I think it makes sense to pursue this, but I also think we can easily 
live with not having it in the first version that gets committed and 
doing it as follow-up patch.

--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services



Re: multivariate statistics (v19)

From
Michael Paquier
Date:
On Wed, Aug 10, 2016 at 8:33 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 08/10/2016 06:41 AM, Michael Paquier wrote:
>> Patch 0001: there have been comments about that before, and you have
>> put the checks on RestrictInfo in a couple of variables of
>> pull_varnos_walker, so nothing to say from here.
>>
>
> I don't follow. Are you suggesting 0001 is a reasonable fix, or that there's
> a proposed solution?

I think that's reasonable.

>> Patch 0002:
>> +  <para>
>> +   <command>CREATE STATISTICS</command> will create a new multivariate
>> +   statistics on the table. The statistics will be created in the in the
>> +   current database. The statistics will be owned by the user issuing
>> +   the command.
>> +  </para>
>> s/in the/in the/.
>>
>> +  <para>
>> +   Create table <structname>t1</> with two functionally dependent
>> columns, i.e.
>> +   knowledge of a value in the first column is sufficient for detemining
>> the
>> +   value in the other column. Then functional dependencies are built on
>> those
>> +   columns:
>> s/detemining/determining/
>>
>> +  <para>
>> +   If a schema name is given (for example, <literal>CREATE STATISTICS
>> +   myschema.mystat ...</>) then the statistics is created in the
>> specified
>> +   schema.  Otherwise it is created in the current schema.  The name of
>> +   the table must be distinct from the name of any other statistics in
>> the
>> +   same schema.
>> +  </para>
>> I would just assume that a statistics is located on the schema of the
>> relation it depends on. So the thing that may be better to do is just:
>> - Register the OID of the table a statistics depends on but not the
>> schema.
>> - Give up on those query extensions related to the schema.
>> - Allow the same statistics name to be used for multiple tables.
>> - Just fail if a statistics name is being reused on the table again.
>> It may be better to complain about that even if the column list is
>> different.
>> - Register the dependency between the statistics and the table.
>
> The idea is that the syntax should work even for statistics built on
> multiple tables, e.g. to provide better statistics for joins. That's why the
> schema may be specified (as each table might be in different schema), and so
> on.

So you mean that the same statistics could be shared between tables?
But as this is visibly not a concept introduced yet in this set of
patches, why not just cut it off for now to simplify the whole? If
there is no schema-related field in pg_mv_statistics we could still
add it later if it proves to be useful.

>> +
>>  /*
>> Spurious noise in the patch.
>>
>> +   /* check that at least some statistics were requested */
>> +   if (!build_dependencies)
>> +       ereport(ERROR,
>> +               (errcode(ERRCODE_SYNTAX_ERROR),
>> +                errmsg("no statistics type (dependencies) was
>> requested")));
>> So, WITH (dependencies) is mandatory in any case. Why not just
>> dropping it from the first cut then?
>
>
> Because the follow-up patches extend this to require at least one statistics
> type. So in 0004 it becomes
>
>     if (!(build_dependencies || build_mcv))
>
> and in 0005 it's
>
>     if (!(build_dependencies || build_mcv || build_histogram))
>
> We might drop it from 0002 (and assume build_dependencies=true), and then
> add the check in 0004. But it seems a bit pointless.

This is a complicated set of patches. I'd think that we should try to
simplify things as much as possible first, and the WITH clause is not
mandatory to have as of 0002.

>> Statistics definition reorder the columns by itself depending on their
>> order. For example:
>> create table aa (a int, b int);
>> create statistics aas on aa(b, a) with (dependencies);
>> \d aa
>>     "public.aas" (dependencies) ON (a, b)
>> As this defines a correlation between multiple columns, isn't it wrong
>> to assume that (b, a) and (a, b) are always the same correlation? I
>> don't recall such properties as being always commutative (old
>> memories, I suck at stats in general). [...reading README...] So this
>> is caused by the implementation limitations that only limit the
>> analysis between interactions of two columns. Still it seems incorrect
>> to reorder the user-visible portion.
>
> I don't follow. If you talk about Pearson's correlation, that clearly does
> not depend on the order of columns - it's perfectly independent of that. If
> you talk about about correlation in the wider sense (i.e. arbitrary
> dependence between columns), that might depend - but I don't remember a
> single piece of the patch where this might be a problem.

Yes, based on what is done in the patch that may not be a problem, but
I am wondering if this is not restricting things too much.

> Also, which README states that we can only analyze interactions between two
> columns? That's pretty clearly not the case - the patch should handle
> dependencies between more columns without any problems.

I have noticed that the patch evaluates all the set of permutations
possible using a column list, it seems to me though that say if we
have three columns (a,b,c) listed in a statistics, (a,b) => c and
(b,a) => c are two different things.

>> There is a lot of mumbo-jumbo regarding the way dependencies are
>> stored with mainly serialize_mv_dependencies and
>> deserialize_mv_dependencies that operates them from bytea/dep trees.
>> That's not cool and not portable because pg_mv_statistic represents
>> that as pure bytea. I would suggest creating a generic data type that
>> does those operations, named like pg_dependency_tree and then use that
>> in those new catalogs. pg_node_tree is a precedent of such a thing.
>> New features could as well make use of this new data type of we are
>> able to design that in a way generic enough, so that would be a base
>> patch that the current 0002 applies on top of.
>
>
> Interesting idea, haven't thought about that. So are you suggesting to add a
> data type for each statistics type (dependencies, MCV, histogram, ...)?

Yes that would be something like that, it would be actually perhaps
better to have one single data type, and be able to switch between
each model easily instead of putting byteas in the catalog.
-- 
Michael



Re: multivariate statistics (v19)

From
Michael Paquier
Date:
On Wed, Aug 10, 2016 at 8:50 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:
> On 10/08/16 13:33, Tomas Vondra wrote:
>>
>> On 08/10/2016 06:41 AM, Michael Paquier wrote:
>>>
>>> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
>>>>
>>>> 2) combining multiple statistics
>>>>
>>>>
>>>> I think the ability to combine multivariate statistics (covering
>>>> different
>>>> subsets of conditions) is important and useful, but I'm starting to
>>>> think
>>>> that the current implementation may not be the correct one (which is
>>>> why I
>>>> haven't written the SGML docs about this part of the patch series yet).
>>>>
>>>> Assume there's a table "t" with 3 columns (a, b, c), and that we're
>>>> estimating query:
>>>>
>>>>    SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3
>>>>
>>>> but that we only have two statistics (a,b) and (b,c). The current
>>>> patch does
>>>> about this:
>>>>
>>>>    P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)
>>>>
>>>> i.e. it estimates the first two conditions using (a,b), and then
>>>> estimates
>>>> (c=3) using (b,c) with "b=2" as a condition. Now, this is very
>>>> efficient,
>>>> but it only works as long as the query contains conditions
>>>> "connecting" the
>>>> two statistics. So if we remove the "b=2" condition from the query, this
>>>> stops working.
>>>
>>>
>>> This is trying to make the algorithm smarter than the user, which is
>>> something I'd think we could live without. In this case statistics on
>>> (a,c) or (a,b,c) are missing. And what if the user does not want to
>>> make use of stats for (a,c) because he only defined (a,b) and (b,c)?
>>>
>>
>> I don't think so. Obviously, if you have statistics covering all the
>> conditions - great, we can't really do better than that.
>>
>> But there's a crucial relation between the number of dimensions of the
>> statistics and accuracy of the statistics. Let's say you have statistics
>> on 8 columns, and you split each dimension twice to build a histogram -
>> that's 256 buckets right there, and we only get ~50% selectivity in each
>> dimension (the actual histogram building algorithm is more complex, but
>> you get the idea).
>
> I think it makes sense to pursue this, but I also think we can easily live
> with not having it in the first version that gets committed and doing it as
> follow-up patch.

This patch is large and complicated enough. As this is not a mandatory
piece to get a basic support, I'd suggest just to drop that for later.
--
Michael



Re: multivariate statistics (v19)

From
Ants Aasma
Date:
On Wed, Aug 3, 2016 at 4:58 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> 2) combining multiple statistics
>
> I think the ability to combine multivariate statistics (covering different
> subsets of conditions) is important and useful, but I'm starting to think
> that the current implementation may not be the correct one (which is why I
> haven't written the SGML docs about this part of the patch series yet).

While researching this topic a few years ago I came across a paper on
this exact topic called "Consistently Estimating the Selectivity of
Conjuncts of Predicates" [1]. While effective it seems to be quite
heavy-weight, so would probably need support for tiered optimization.

[1] https://courses.cs.washington.edu/courses/cse544/11wi/papers/markl-vldb-2005.pdf

Regards,
Ants Aasma



Re: multivariate statistics (v19)

From
Tomas Vondra
Date:
On 08/10/2016 03:29 PM, Ants Aasma wrote:
> On Wed, Aug 3, 2016 at 4:58 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> 2) combining multiple statistics
>>
>> I think the ability to combine multivariate statistics (covering different
>> subsets of conditions) is important and useful, but I'm starting to think
>> that the current implementation may not be the correct one (which is why I
>> haven't written the SGML docs about this part of the patch series yet).
>
> While researching this topic a few years ago I came across a paper on
> this exact topic called "Consistently Estimating the Selectivity of
> Conjuncts of Predicates" [1]. While effective it seems to be quite
> heavy-weight, so would probably need support for tiered optimization.
>
> [1] https://courses.cs.washington.edu/courses/cse544/11wi/papers/markl-vldb-2005.pdf
>

I think I've read that paper some time ago, and IIRC it's solving the 
same problem but in a very different way - instead of combining the 
statistics directly, it relies on the "partial" selectivities and then 
estimates the total selectivity using the maximum-entropy principle.

I think it's a nice idea and it probably works fine in many cases, but 
it kinda throws away part of the information (that we could get by 
matching the statistics against each other directly). But I'll keep that 
paper in mind, and we can revisit this solution later.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics (v19)

From
Tomas Vondra
Date:
On 08/10/2016 02:24 PM, Michael Paquier wrote:
> On Wed, Aug 10, 2016 at 8:50 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:
>> On 10/08/16 13:33, Tomas Vondra wrote:
>>>
>>> On 08/10/2016 06:41 AM, Michael Paquier wrote:
>>>>
>>>> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
>>>>>
>>>>> 2) combining multiple statistics
>>>>>
>>>>>
>>>>> I think the ability to combine multivariate statistics (covering
>>>>> different
>>>>> subsets of conditions) is important and useful, but I'm starting to
>>>>> think
>>>>> that the current implementation may not be the correct one (which is
>>>>> why I
>>>>> haven't written the SGML docs about this part of the patch series yet).
>>>>>
>>>>> Assume there's a table "t" with 3 columns (a, b, c), and that we're
>>>>> estimating query:
>>>>>
>>>>>    SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3
>>>>>
>>>>> but that we only have two statistics (a,b) and (b,c). The current
>>>>> patch does
>>>>> about this:
>>>>>
>>>>>    P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)
>>>>>
>>>>> i.e. it estimates the first two conditions using (a,b), and then
>>>>> estimates
>>>>> (c=3) using (b,c) with "b=2" as a condition. Now, this is very
>>>>> efficient,
>>>>> but it only works as long as the query contains conditions
>>>>> "connecting" the
>>>>> two statistics. So if we remove the "b=2" condition from the query, this
>>>>> stops working.
>>>>
>>>>
>>>> This is trying to make the algorithm smarter than the user, which is
>>>> something I'd think we could live without. In this case statistics on
>>>> (a,c) or (a,b,c) are missing. And what if the user does not want to
>>>> make use of stats for (a,c) because he only defined (a,b) and (b,c)?
>>>>
>>>
>>> I don't think so. Obviously, if you have statistics covering all the
>>> conditions - great, we can't really do better than that.
>>>
>>> But there's a crucial relation between the number of dimensions of the
>>> statistics and accuracy of the statistics. Let's say you have statistics
>>> on 8 columns, and you split each dimension twice to build a histogram -
>>> that's 256 buckets right there, and we only get ~50% selectivity in each
>>> dimension (the actual histogram building algorithm is more complex, but
>>> you get the idea).
>>
>> I think it makes sense to pursue this, but I also think we can easily live
>> with not having it in the first version that gets committed and doing it as
>> follow-up patch.
>
> This patch is large and complicated enough. As this is not a mandatory
> piece to get a basic support, I'd suggest just to drop that for later.

Which is why combining multiple statistics is in part 0006 and all the 
previous parts simply choose the single "best" statistics ;-)

I'm perfectly fine with committing just the first few parts, and leaving 
0006 for the next major version.

regards


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics (v19)

From
Tomas Vondra
Date:
On 08/10/2016 02:23 PM, Michael Paquier wrote:
> On Wed, Aug 10, 2016 at 8:33 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> On 08/10/2016 06:41 AM, Michael Paquier wrote:
>>> Patch 0001: there have been comments about that before, and you have
>>> put the checks on RestrictInfo in a couple of variables of
>>> pull_varnos_walker, so nothing to say from here.
>>>
>>
>> I don't follow. Are you suggesting 0001 is a reasonable fix, or that there's
>> a proposed solution?
>
> I think that's reasonable.
>

Well, to me the 0001 feels more like a temporary workaround rather than 
a proper solution. I just don't know how to deal with it so I've kept it 
for now. Pretty sure there will be complaints that adding RestrictInfo 
to the expression walkers is not a nice idea.
>> ...
>>
>> The idea is that the syntax should work even for statistics built on
>> multiple tables, e.g. to provide better statistics for joins. That's why the
>> schema may be specified (as each table might be in different schema), and so
>> on.
>
> So you mean that the same statistics could be shared between tables?
> But as this is visibly not a concept introduced yet in this set of
> patches, why not just cut it off for now to simplify the whole? If
> there is no schema-related field in pg_mv_statistics we could still
> add it later if it proves to be useful.
>

Yes, I think creating statistics on multiple tables is one of the 
possible future directions. One of the previous patch versions 
introduced ALTER TABLE ... ADD STATISTICS syntax, but that ran into 
issues in gram.y, and given the multi-table possibilities the CREATE 
STATISTICS seems like a much better idea anyway.

But I guess you're right we may make this a bit more strict now, and 
relax it in the future if needed. For example as we only support 
single-table statistics at this point, we may remove the schema and 
always create the statistics in the schema of the table.

But I don't think we should make the statistics names unique only within 
a table (instead of within the schema).

The difference between those two cases is that if we allow multi-table 
statistics in the future, we can simply allow specifying the schema and 
everything will work just fine. But it'd break the second case, as it 
might result in conflicts in existing schemas.

I do realize this might be seen as a case of "future proofing" based on 
dubious predictions of how something might work, but OTOH this (schema 
inherited from table, unique within a schema) is pretty consistent with 
how this work for indexes.

>>> +
>>>  /*
>>> Spurious noise in the patch.
>>>
>>> +   /* check that at least some statistics were requested */
>>> +   if (!build_dependencies)
>>> +       ereport(ERROR,
>>> +               (errcode(ERRCODE_SYNTAX_ERROR),
>>> +                errmsg("no statistics type (dependencies) was
>>> requested")));
>>> So, WITH (dependencies) is mandatory in any case. Why not just
>>> dropping it from the first cut then?
>>
>>
>> Because the follow-up patches extend this to require at least one statistics
>> type. So in 0004 it becomes
>>
>>     if (!(build_dependencies || build_mcv))
>>
>> and in 0005 it's
>>
>>     if (!(build_dependencies || build_mcv || build_histogram))
>>
>> We might drop it from 0002 (and assume build_dependencies=true), and then
>> add the check in 0004. But it seems a bit pointless.
>
> This is a complicated set of patches. I'd think that we should try to
> simplify things as much as possible first, and the WITH clause is not
> mandatory to have as of 0002.
>

OK, I can remove the WITH from the 0002 part. Not a big deal.

>>> Statistics definition reorder the columns by itself depending on their
>>> order. For example:
>>> create table aa (a int, b int);
>>> create statistics aas on aa(b, a) with (dependencies);
>>> \d aa
>>>     "public.aas" (dependencies) ON (a, b)
>>> As this defines a correlation between multiple columns, isn't it wrong
>>> to assume that (b, a) and (a, b) are always the same correlation? I
>>> don't recall such properties as being always commutative (old
>>> memories, I suck at stats in general). [...reading README...] So this
>>> is caused by the implementation limitations that only limit the
>>> analysis between interactions of two columns. Still it seems incorrect
>>> to reorder the user-visible portion.
>>
>> I don't follow. If you talk about Pearson's correlation, that clearly does
>> not depend on the order of columns - it's perfectly independent of that. If
>> you talk about about correlation in the wider sense (i.e. arbitrary
>> dependence between columns), that might depend - but I don't remember a
>> single piece of the patch where this might be a problem.
>
> Yes, based on what is done in the patch that may not be a problem, but
> I am wondering if this is not restricting things too much.
>

Let's keep the code as it is. If we run into this issue in the future, 
we can easily relax this - there's nothing depending on the ordering of 
attnums, IIRC.

>> Also, which README states that we can only analyze interactions between two
>> columns? That's pretty clearly not the case - the patch should handle
>> dependencies between more columns without any problems.
>
> I have noticed that the patch evaluates all the set of permutations
> possible using a column list, it seems to me though that say if we
> have three columns (a,b,c) listed in a statistics, (a,b) => c and
> (b,a) => c are two different things.
>

Yes, those are two different functional dependencies, of course. But the 
algorithm (during ANALYZE) should discover all of them, and even the 
examples are using three columns, so I'm not sure what you mean by 
"analyze interactions between two columns"?

>>> There is a lot of mumbo-jumbo regarding the way dependencies are
>>> stored with mainly serialize_mv_dependencies and
>>> deserialize_mv_dependencies that operates them from bytea/dep trees.
>>> That's not cool and not portable because pg_mv_statistic represents
>>> that as pure bytea. I would suggest creating a generic data type that
>>> does those operations, named like pg_dependency_tree and then use that
>>> in those new catalogs. pg_node_tree is a precedent of such a thing.
>>> New features could as well make use of this new data type of we are
>>> able to design that in a way generic enough, so that would be a base
>>> patch that the current 0002 applies on top of.
>>
>>
>> Interesting idea, haven't thought about that. So are you suggesting to add a
>> data type for each statistics type (dependencies, MCV, histogram, ...)?
>
> Yes that would be something like that, it would be actually perhaps
> better to have one single data type, and be able to switch between
> each model easily instead of putting byteas in the catalog.

Hmmm, not sure about that. For example what about combinations of 
statistics - e.g. when we have MCV list on the most common values and a 
histogram on the rest? Should we store both as a single value, or would 
that be in two separate values, or what?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics (v19)

From
Michael Paquier
Date:
On Thu, Aug 11, 2016 at 3:34 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 08/10/2016 02:23 PM, Michael Paquier wrote:
>>
>> On Wed, Aug 10, 2016 at 8:33 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>> The idea is that the syntax should work even for statistics built on
>>> multiple tables, e.g. to provide better statistics for joins. That's why
>>> the
>>> schema may be specified (as each table might be in different schema), and
>>> so
>>> on.
>>
>>
>> So you mean that the same statistics could be shared between tables?
>> But as this is visibly not a concept introduced yet in this set of
>> patches, why not just cut it off for now to simplify the whole? If
>> there is no schema-related field in pg_mv_statistics we could still
>> add it later if it proves to be useful.
>>
>
> Yes, I think creating statistics on multiple tables is one of the possible
> future directions. One of the previous patch versions introduced ALTER TABLE
> ... ADD STATISTICS syntax, but that ran into issues in gram.y, and given the
> multi-table possibilities the CREATE STATISTICS seems like a much better
> idea anyway.
>
> But I guess you're right we may make this a bit more strict now, and relax
> it in the future if needed. For example as we only support single-table
> statistics at this point, we may remove the schema and always create the
> statistics in the schema of the table.

This would simplify the code the code a bit so I'd suggest removing
that from the first shot. If there is demand for it, keeping the
infrastructure open for this extension is what we had better do.

> But I don't think we should make the statistics names unique only within a
> table (instead of within the schema).

They could be made unique using (name, table_oid, column_list).

>>>> There is a lot of mumbo-jumbo regarding the way dependencies are
>>>> stored with mainly serialize_mv_dependencies and
>>>> deserialize_mv_dependencies that operates them from bytea/dep trees.
>>>> That's not cool and not portable because pg_mv_statistic represents
>>>> that as pure bytea. I would suggest creating a generic data type that
>>>> does those operations, named like pg_dependency_tree and then use that
>>>> in those new catalogs. pg_node_tree is a precedent of such a thing.
>>>> New features could as well make use of this new data type of we are
>>>> able to design that in a way generic enough, so that would be a base
>>>> patch that the current 0002 applies on top of.
>>>
>>>
>>>
>>> Interesting idea, haven't thought about that. So are you suggesting to
>>> add a
>>> data type for each statistics type (dependencies, MCV, histogram, ...)?
>>
>>
>> Yes that would be something like that, it would be actually perhaps
>> better to have one single data type, and be able to switch between
>> each model easily instead of putting byteas in the catalog.
>
> Hmmm, not sure about that. For example what about combinations of statistics
> - e.g. when we have MCV list on the most common values and a histogram on
> the rest? Should we store both as a single value, or would that be in two
> separate values, or what?

The same statistics can combine two different things, using different
columns may depend on how readable things get...
Btw, for the format we could get inspired from pg_node_tree, with pg_stat_tree:
{HISTOGRAM :arg {BUCKET :index 0 :minvals ... }}
{DEPENDENCY :arg {:elt "a => c" ...} ... }
{MVC :arg {:index 0 :values {0,0} ... } ... }
Please consider that as a tentative idea to make things more friendly.
Others may have a different opinion on the matter.
-- 
Michael



Re: multivariate statistics (v19)

From
Tomas Vondra
Date:
On 08/10/2016 06:41 AM, Michael Paquier wrote:
> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> 1) enriching the query tree with multivariate statistics info
>>
>> Right now all the stuff related to multivariate statistics estimation
>> happens in clausesel.c - matching condition to statistics, selection of
>> statistics to use (if there are multiple usable stats), etc. So pretty much
>> all this info is internal to clausesel.c and does not get outside.
>
> This does not seem bad to me as first sight but...
>
>> I'm starting to think that some of the steps (matching quals to stats,
>> selection of stats) should happen in a "preprocess" step before the actual
>> estimation, storing the information (which stats to use, etc.) in a new type
>> of node in the query tree - something like RestrictInfo.
>>
>> I believe this needs to happen sometime after deconstruct_jointree() as that
>> builds RestrictInfos nodes, and looking at planmain.c, right after
>> extract_restriction_or_clauses seems about right. Haven't tried, though.
>>
>> This would move all the "statistics selection" logic from clausesel.c,
>> separating it from the "actual estimation" and simplifying the code.
>>
>> But more importantly, I think we'll need to show some of the data in EXPLAIN
>> output. With per-column statistics it's fairly straightforward to determine
>> which statistics are used and how. But with multivariate stats things are
>> often more complicated - there may be multiple candidate statistics (e.g.
>> histograms covering different subsets of the conditions), it's possible to
>> apply them in different orders, etc.
>>
>> But EXPLAIN can't show the info if it's ephemeral and available only within
>> clausesel.c (and thrown away after the estimation).
>
> This gives a good reason to not do that in clauserel.c, it would be
> really cool to be able to get some information regarding the stats
> used with a simple EXPLAIN.

I've been thinking about this, and I'm afraid it's way more complicated 
in practice. It essentially means doing something like
    rel->baserestrictinfo = enrichWithStatistics(rel->baserestrictinfo);

for each table (and in the future maybe also for joins etc.) But as the 
name suggests the list should only include RestrictInfo nodes, which 
seems to contradict the transformation.

For example with conditions
    WHERE (a=1) AND (b=2) AND (c=3)

the list will contain 3 RestrictInfos. But if there's a statistics on 
(a,b,c), we need to note that somehow - my plan was to inject a node 
storing this information, something like (a bit simplified):
    StatisticsInfo {         Oid statisticsoid; /* OID of the statistics */         List *mvconditions; /* estimate
usingthe statistics */         List *otherconditions; /* estimate the old way */    }
 

But that'd clearly violate the assumption that baserestrictinfo only 
contains RestrictInfo. I don't think it's feasible (or desirable) to 
rework all the places to expect both RestrictInfo and the new node.

I can think of two alternatives:

1) keep the transformed list as separate list, next to baserestrictinfo

This obviously fixes the issue, as each caller can decide which node it 
wants. But it also means we need to maintain two lists instead of one, 
and keep them synchronized.

2) embed the information into the existing tree

It might be possible to store the information in existing nodes, i.e. 
each node would track whether it's estimated the "old way" or using 
multivariate statistics (and which one). But it would require changing 
many of the existing nodes (at least those compatible with multivariate 
statistics: currently OpExpr, NullTest, ...).

And it also seems fairly difficult to reconstruct the information during 
the estimation, as it'd be necessary to look for other nodes to be 
estimated by the same statistics. Which seems to defeat the idea of 
preprocessing to some degree.

So I'm not sure what's the best solution. I'm leaning to (1), i.e. 
keeping a separate list, but I'd welcome other ideas.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics (v19)

From
Robert Haas
Date:
On Tue, Aug 2, 2016 at 9:58 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Attached is v19 of the "multivariate stats" patch series - essentially v18
> rebased on top of current master.

Tom:

ISTR that you were going to try to look at this patch set.  It seems
from the discussion that it's not really ready for serious
consideration for commit yet, but also that some high-level design
comments from you at this stage could go a long way toward making sure
that the final form of the patch is something that will be acceptable.

I'd really like to see us get some kind of capability along these
lines, but I'm sure it will go a lot better if you or Dean handle it
than if I try to do it ... not to mention that there are only so many
hours in the day.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: multivariate statistics (v19)

From
Michael Paquier
Date:
On Wed, Aug 24, 2016 at 2:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> ISTR that you were going to try to look at this patch set.  It seems
> from the discussion that it's not really ready for serious
> consideration for commit yet, but also that some high-level design
> comments from you at this stage could go a long way toward making sure
> that the final form of the patch is something that will be acceptable.
>
> I'd really like to see us get some kind of capability along these
> lines, but I'm sure it will go a lot better if you or Dean handle it
> than if I try to do it ... not to mention that there are only so many
> hours in the day.

Agreed. What I have been able to look until now was the high-level
structure of the patch, and I think that we should really shave 0002
and simplify it to get a core infrastructure in place, but the core
patch is at another level, and it would be good to get some feedback
regarding the structure of the patch and if it is moving in the good
direction is good or not.
-- 
Michael



Re: multivariate statistics (v19)

From
Dean Rasheed
Date:
On 3 August 2016 at 02:58, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> Attached is v19 of the "multivariate stats" patch series

Hi,

I started looking at this - just at a very high level - I've not read
much of the detail yet, but here are some initial review comments.

I think the overall infrastructure approach for CREATE STATISTICS
makes sense, and I agree with other suggestions upthread that it would
be useful to be able to build statistics on arbitrary expressions,
although that doesn't need to be part of this patch, it's useful to
keep that in mind as a possible future extension of this initial
design.

I can imagine it being useful to be able to create user-defined
statistics on an arbitrary list of expressions, and I think that would
include univariate as well as multivariate statistics. Perhaps that's
something to take into account in the naming of things, e.g., as David
Rowley suggested, something like pg_statistic_ext, rather than
pg_mv_statistic.

I also like the idea that this might one day be extended to support
statistics across multiple tables, although I think that might be
challenging to achieve -- you'd need a method of taking a random
sample of rows from a join between 2 or more tables. However, if the
intention is to be able to support that one day, I think that needs to
be accounted for in the syntax now -- specifically, I think it will be
too limiting to only support things extending the current syntax of
the form table1(col1, col2, ...), table2(col1, col2, ...), because
that precludes building statistics on an expression referring to
columns from more than one table. So I think we should plan further
ahead and use a syntax giving greater flexibility in the future, for
example something structured more like a query (like CREATE VIEW):

CREATE STATISTICS name [ WITH (options) ] ON expression [, ...] FROM table [, ...] WHERE condition

where the first version of the patch would only support expressions
that are simple column references, and would require at least 2 such
columns from a single table with no WHERE clause, i.e.:

CREATE STATISTICS name [ WITH (options) ] ON column1, column2 [, ...] FROM table

For multi-table statistics, a WHERE clause would typically be needed
to specify how the tables are expected to be joined, but potentially
such a clause might also be useful in single-table statistics, to
build partial statistics on a commonly queried subset of the table,
just like a partial index.

Of course, I'm not suggesting that the current patch do any of that --
it's big enough as it is. I'm just throwing out possible future
directions this might go in, so that we don't get painted into a
corner when designing the syntax for the current patch.


Regarding the statistics themselves, I read the description of soft
functional dependencies, and I'm somewhat skeptical about that
algorithm. I don't like the arbitrary thresholds or the sudden jump
from independence to dependence and clause reduction. As others have
said, I think this should account for a continuous spectrum of
dependence from fully independent to fully dependent, and combine
clause selectivities in a way based on the degree of dependence. For
example, if you computed an estimate for the fraction 'f' of the
table's rows for which a -> b, then it might be reasonable to combine
the selectivities using
 P(a,b) = P(a) * (f + (1-f) * P(b))

Of course, having just a single number that tells you the columns are
correlated, tells you nothing about whether the clauses on those
columns are consistent with that correlation. For example, in the
following table

CREATE TABLE t(a int, b int);
INSERT INTO t SELECT x/10, ((x/10)*789)%100 FROM generate_series(0,999) g(x);

'b' is functionally dependent on 'a' (and vice versa), but if you
query the rows with a<50 and with b<50, those clauses behave
essentially independently, because they're not consistent with the
functional dependence between 'a' and 'b', so the best way to combine
their selectivities is just to multiply them, as we currently do.

So whilst it may be interesting to determine that 'b' is functionally
dependent on 'a', it's not obvious whether that fact by itself should
be used in the selectivity estimates. Perhaps it should, on the
grounds that it's best to attempt to use all the available
information, but only if there are no more detailed statistics
available. In any case, knowing that there is a correlation can be
used as an indicator that it may be worthwhile to build more detailed
multivariate statistics like a MCV list or a histogram on those
columns.


Looking at the ndistinct coefficient 'q', I think it would be better
if the recorded statistic were just the estimate for
ndistinct(a,b,...) rather than a ratio of ndistinct values. That's a
more fundamental statistic, and it's easier to document and easier to
interpret. Also, I don't believe that the coefficient 'q' is the right
number to use for clause estimation:

Looking at README.ndistinct, I'm skeptical about the selectivity
estimation argument. In the case where a -> b, you'd have q =
ndistinct(b), so then P(a=1 & b=2) would become 1/ndistinct(a), which
is fine for a uniform distribution. But typically, there would be
univariate statistics on a and b, so if for example a=1 were 100x more
likely than average, you'd probably know that and the existing code
computing P(a=1) would reflect that, whereas simply using P(a=1 & b=2)
= 1/ndistinct(a) would be a significant underestimate, since it would
be ignoring known information about the distribution of a.

But likewise if, as is later argued, you were to use 'q' as a
correction factor applied to the individual clause selectivities, you
could end up with significant overestimates: if you said P(a=1 & b=2)
= q * P(a=1) * P(b=2), and a=1 were 100x more likely than average, and
a -> b, then b=2 would also be 100x more likely than average (assuming
that b=2 was the value implied by the functional dependency), and that
would also be reflected in the univariate statics on b, so then you'd
end up with an overall selectivity of around 10000/ndistinct(a), which
would be 100x too big. In fact, since a -> b means that q =
ndistinct(b), there's a good chance of hitting data for which q * P(b)
is greater than 1, so this formula would lead to a combined
selectivity greater than P(a), which is obviously nonsense.

Having a better estimate for ndistinct(a,b,...) looks very useful by
itself for GROUP BY estimation, and there may be other places that
would benefit from it, but I don't think it's the best statistic for
determining functional dependence or combining clause selectivities.

That's as much as I've looked at so far. It's such a big patch that
it's difficult to consider all at once. I think perhaps the smallest
committable self-contained unit providing a tangible benefit would be
something containing the core infrastructure plus the ndistinct
estimate and the improved GROUP BY estimation.

Regards,
Dean



Re: multivariate statistics (v19)

From
Tomas Vondra
Date:
Hi,

Thanks for looking into this!

On 09/12/2016 04:08 PM, Dean Rasheed wrote:
> On 3 August 2016 at 02:58, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> Attached is v19 of the "multivariate stats" patch series
>
> Hi,
>
> I started looking at this - just at a very high level - I've not read
> much of the detail yet, but here are some initial review comments.
>
> I think the overall infrastructure approach for CREATE STATISTICS
> makes sense, and I agree with other suggestions upthread that it would
> be useful to be able to build statistics on arbitrary expressions,
> although that doesn't need to be part of this patch, it's useful to
> keep that in mind as a possible future extension of this initial
> design.
>
> I can imagine it being useful to be able to create user-defined
> statistics on an arbitrary list of expressions, and I think that would
> include univariate as well as multivariate statistics. Perhaps that's
> something to take into account in the naming of things, e.g., as David
> Rowley suggested, something like pg_statistic_ext, rather than
> pg_mv_statistic.
>
> I also like the idea that this might one day be extended to support
> statistics across multiple tables, although I think that might be
> challenging to achieve -- you'd need a method of taking a random
> sample of rows from a join between 2 or more tables. However, if the
> intention is to be able to support that one day, I think that needs to
> be accounted for in the syntax now -- specifically, I think it will be
> too limiting to only support things extending the current syntax of
> the form table1(col1, col2, ...), table2(col1, col2, ...), because
> that precludes building statistics on an expression referring to
> columns from more than one table. So I think we should plan further
> ahead and use a syntax giving greater flexibility in the future, for
> example something structured more like a query (like CREATE VIEW):
>
> CREATE STATISTICS name
>   [ WITH (options) ]
>   ON expression [, ...]
>   FROM table [, ...]
>   WHERE condition
>
> where the first version of the patch would only support expressions
> that are simple column references, and would require at least 2 such
> columns from a single table with no WHERE clause, i.e.:
>
> CREATE STATISTICS name
>   [ WITH (options) ]
>   ON column1, column2 [, ...]
>   FROM table
>
> For multi-table statistics, a WHERE clause would typically be needed
> to specify how the tables are expected to be joined, but potentially
> such a clause might also be useful in single-table statistics, to
> build partial statistics on a commonly queried subset of the table,
> just like a partial index.

Hmm, the "partial statistics" idea seems interesting, It would allow us 
to provide additional / more detailed statistics only for a subset of a 
table.

I'm however not sure about the join case - how would the syntax work 
with outer joins? But as you said, we only need
 CREATE STATISTICS name   [ WITH (options) ]   ON (column1, column2 [, ...])   FROM table   WHERE condition

until we add support for join statistics.

>
> Regarding the statistics themselves, I read the description of soft
> functional dependencies, and I'm somewhat skeptical about that
> algorithm. I don't like the arbitrary thresholds or the sudden jump
> from independence to dependence and clause reduction. As others have
> said, I think this should account for a continuous spectrum of
> dependence from fully independent to fully dependent, and combine
> clause selectivities in a way based on the degree of dependence. For
> example, if you computed an estimate for the fraction 'f' of the
> table's rows for which a -> b, then it might be reasonable to combine
> the selectivities using
>
>   P(a,b) = P(a) * (f + (1-f) * P(b))
>

Yeah, I agree that the thresholds resulting in sudden changes between 
"dependent" and "not dependent" are annoying. The question is whether it 
makes sense to fix that, though - the functional dependencies were meant 
as the simplest form of statistics, allowing us to get the rest of the 
infrastructure in.

I'm OK with replacing the true/false dependencies with a degree of 
dependency between 0 and 1, but I'm a bit afraid it'll result in 
complaints that the first patch got too large / complicated.

It also contradicts the idea of using functional dependencies as a 
low-overhead type of statistics, filtering the list of clauses that need 
to be estimated using more expensive types of statistics (MCV lists, 
histograms, ...). Switching to a degree of dependency would prevent 
removal of "unnecessary" clauses.

> Of course, having just a single number that tells you the columns are
> correlated, tells you nothing about whether the clauses on those
> columns are consistent with that correlation. For example, in the
> following table
>
> CREATE TABLE t(a int, b int);
> INSERT INTO t SELECT x/10, ((x/10)*789)%100 FROM generate_series(0,999) g(x);
>
> 'b' is functionally dependent on 'a' (and vice versa), but if you
> query the rows with a<50 and with b<50, those clauses behave
> essentially independently, because they're not consistent with the
> functional dependence between 'a' and 'b', so the best way to combine
> their selectivities is just to multiply them, as we currently do.
>
> So whilst it may be interesting to determine that 'b' is functionally
> dependent on 'a', it's not obvious whether that fact by itself should
> be used in the selectivity estimates. Perhaps it should, on the
> grounds that it's best to attempt to use all the available
> information, but only if there are no more detailed statistics
> available. In any case, knowing that there is a correlation can be
> used as an indicator that it may be worthwhile to build more detailed
> multivariate statistics like a MCV list or a histogram on those
> columns.
>

Right. IIRC this is actually described in the README as "incompatible 
conditions". While implementing it, I concluded that this is OK and it's 
up to the developer to decide whether the queries are compatible with 
the "assumption of compatibility". But maybe this is reasoning is bogus 
and makes (the current implementation of) functional dependencies 
unusable in practice.

But I like the idea of reverting the order from

(a) look for functional dependencies
(b) reduce the clauses using functional dependencies
(c) estimate the rest using multivariate MCV/histograms

to

(a) estimate the rest using multivariate MCV/histograms
(b) try to apply functional dependencies on the remaining clauses

It contradicts the idea of functional dependencies as "low-overhead 
statistics" but maybe it's worth it.

>
> Looking at the ndistinct coefficient 'q', I think it would be better
> if the recorded statistic were just the estimate for
> ndistinct(a,b,...) rather than a ratio of ndistinct values. That's a
> more fundamental statistic, and it's easier to document and easier to
> interpret. Also, I don't believe that the coefficient 'q' is the right
> number to use for clause estimation:
>

IIRC the reason why I stored the coefficient instead of the ndistinct() 
values is that the coefficients are not directly related to number of 
rows in the original relation, so you can apply it directly to whatever 
cardinality estimate you have.

Otherwise it's mostly the same information - it's trivial to compute one 
from the other.
>
> Looking at README.ndistinct, I'm skeptical about the selectivity
> estimation argument. In the case where a -> b, you'd have q =
> ndistinct(b), so then P(a=1 & b=2) would become 1/ndistinct(a), which
> is fine for a uniform distribution. But typically, there would be
> univariate statistics on a and b, so if for example a=1 were 100x more
> likely than average, you'd probably know that and the existing code
> computing P(a=1) would reflect that, whereas simply using P(a=1 & b=2)
> = 1/ndistinct(a) would be a significant underestimate, since it would
> be ignoring known information about the distribution of a.
>
> But likewise if, as is later argued, you were to use 'q' as a
> correction factor applied to the individual clause selectivities, you
> could end up with significant overestimates: if you said P(a=1 & b=2)
> = q * P(a=1) * P(b=2), and a=1 were 100x more likely than average, and
> a -> b, then b=2 would also be 100x more likely than average (assuming
> that b=2 was the value implied by the functional dependency), and that
> would also be reflected in the univariate statics on b, so then you'd
> end up with an overall selectivity of around 10000/ndistinct(a), which
> would be 100x too big. In fact, since a -> b means that q =
> ndistinct(b), there's a good chance of hitting data for which q * P(b)
> is greater than 1, so this formula would lead to a combined
> selectivity greater than P(a), which is obviously nonsense.

Well, yeah. The
    P(a=1) = 1/ndistinct(a)

was really just a simplification for the uniform distribution, and 
looking at "q" as a correction factor is much more practical - no doubt 
about that.

As for the overestimated and underestimates - I don't think we can 
entirely prevent that. We're essentially replacing one assumption (AVIA) 
with other assumptions (homogenity for ndistinct, compatibility for 
functional dependencies), hoping that those assumptions are weaker in 
some sense. But there'll always be cases that break those assumptions 
and I don't think we can prevent that.

Unlike the functional dependencies, this "homogenity" assumption is not 
dependent on the queries at all, so it should be possible to verify it 
during ANALYZE.

Also, maybe we could/should use the same approach as for functional 
dependencies, i.e. try using more detailed statistics first and then 
apply ndistinct coefficients only on the remaining clauses?

>
> Having a better estimate for ndistinct(a,b,...) looks very useful by
> itself for GROUP BY estimation, and there may be other places that
> would benefit from it, but I don't think it's the best statistic for
> determining functional dependence or combining clause selectivities.
>

Not sure. I think it may be very useful type of statistics, but I'm not 
going to fight for this very hard. I'm fine with ignoring this 
statistics type for now, getting the other "detailed" statistics types 
(MCV, histograms) in and then revisiting this.

> That's as much as I've looked at so far. It's such a big patch that
> it's difficult to consider all at once. I think perhaps the smallest
> committable self-contained unit providing a tangible benefit would be
> something containing the core infrastructure plus the ndistinct
> estimate and the improved GROUP BY estimation.
>

FWIW I find the ndistinct statistics as rather uninteresting (at least 
compared to the other types of statistics), which is why it's the last 
patch in the patch series. Perhaps I shouldn't have include it at all, 
as it's just a distraction.


regards
Dean



Re: multivariate statistics (v19)

From
Heikki Linnakangas
Date:
This patch set is in pretty good shape, the only problem is that it's so 
big that no-one seems to have the time or courage to do the final 
touches and commit it. If we just focus on the functional dependencies 
part for now, I think we might get somewhere. I peeked at the MCV and 
histogram patches too, and I think they make total sense as well, and 
are a natural extension of the functional dependencies patch. So if we 
just focus on that for now, I don't think we will paint ourselves in the 
corner.

(more below)

On 09/14/2016 01:01 AM, Tomas Vondra wrote:
> On 09/12/2016 04:08 PM, Dean Rasheed wrote:
>> Regarding the statistics themselves, I read the description of soft
>> functional dependencies, and I'm somewhat skeptical about that
>> algorithm. I don't like the arbitrary thresholds or the sudden jump
>> from independence to dependence and clause reduction. As others have
>> said, I think this should account for a continuous spectrum of
>> dependence from fully independent to fully dependent, and combine
>> clause selectivities in a way based on the degree of dependence. For
>> example, if you computed an estimate for the fraction 'f' of the
>> table's rows for which a -> b, then it might be reasonable to combine
>> the selectivities using
>>
>>   P(a,b) = P(a) * (f + (1-f) * P(b))
>>
>
> Yeah, I agree that the thresholds resulting in sudden changes between
> "dependent" and "not dependent" are annoying. The question is whether it
> makes sense to fix that, though - the functional dependencies were meant
> as the simplest form of statistics, allowing us to get the rest of the
> infrastructure in.
>
> I'm OK with replacing the true/false dependencies with a degree of
> dependency between 0 and 1, but I'm a bit afraid it'll result in
> complaints that the first patch got too large / complicated.

+1 for using a floating degree between 0 and 1, rather than a boolean.

> It also contradicts the idea of using functional dependencies as a
> low-overhead type of statistics, filtering the list of clauses that need
> to be estimated using more expensive types of statistics (MCV lists,
> histograms, ...). Switching to a degree of dependency would prevent
> removal of "unnecessary" clauses.

That sounds OK to me, although I'm not deeply familiar with this patch yet.

>> Of course, having just a single number that tells you the columns are
>> correlated, tells you nothing about whether the clauses on those
>> columns are consistent with that correlation. For example, in the
>> following table
>>
>> CREATE TABLE t(a int, b int);
>> INSERT INTO t SELECT x/10, ((x/10)*789)%100 FROM generate_series(0,999) g(x);
>>
>> 'b' is functionally dependent on 'a' (and vice versa), but if you
>> query the rows with a<50 and with b<50, those clauses behave
>> essentially independently, because they're not consistent with the
>> functional dependence between 'a' and 'b', so the best way to combine
>> their selectivities is just to multiply them, as we currently do.
>>
>> So whilst it may be interesting to determine that 'b' is functionally
>> dependent on 'a', it's not obvious whether that fact by itself should
>> be used in the selectivity estimates. Perhaps it should, on the
>> grounds that it's best to attempt to use all the available
>> information, but only if there are no more detailed statistics
>> available. In any case, knowing that there is a correlation can be
>> used as an indicator that it may be worthwhile to build more detailed
>> multivariate statistics like a MCV list or a histogram on those
>> columns.
>
> Right. IIRC this is actually described in the README as "incompatible
> conditions". While implementing it, I concluded that this is OK and it's
> up to the developer to decide whether the queries are compatible with
> the "assumption of compatibility". But maybe this is reasoning is bogus
> and makes (the current implementation of) functional dependencies
> unusable in practice.

I think that's OK. It seems like a good assumption that the conditions 
are "compatible" with the functional dependency. For two reasons:

1) A query with compatible clauses is much more likely to occur in real 
life. Why would you run a query with an incompatible ZIP and city clauses?

2) If the conditions were in fact incompatible, the query is likely to 
return 0 rows, and will bail out very quickly, even if the estimates are 
way off and you choose a non-optimal plan. There are exceptions, of 
course: an index scan might be able to conclude that there are no rows 
much quicker than a seqscan, but as a general rule of thumb, a query 
that returns 0 rows isn't very sensitive to the chosen plan.

And of course, as long as we're not collecting these statistics 
automatically, if it doesn't work for your application, just don't 
collect them.


I fear that using "statistics" as the name of the new object might get a 
bit awkward. "statistics" is a plural, but we use it as the name of a 
single object, like "pants" or "scissors". Not sure I have any better 
ideas though. "estimator"? "statistics collection"? Or perhaps it should 
be singular, "statistic". I note that you actually called the system 
table "pg_mv_statistic", in singular.

I'm not a big fan of storing the stats as just a bytea blob, and having 
to use special functions to interpret it. By looking at the patch, it's 
not clear to me what we actually store for functional dependencies. A 
list of attribute numbers? Could we store them simply as an int[]? (I'm 
not a big fan of the hack in pg_statistic, that allows storing arrays of 
any data type in the same column, though. But for functional 
dependencies, I don't think we need that.)

Overall, this is going to be a great feature!

- Heikki




Re: multivariate statistics (v19)

From
Michael Paquier
Date:
On Fri, Sep 30, 2016 at 8:10 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> This patch set is in pretty good shape, the only problem is that it's so big
> that no-one seems to have the time or courage to do the final touches and
> commit it.

Did you see my suggestions about simplifying its SQL structure? You
could shave some code without impacting the base set of features.

> I fear that using "statistics" as the name of the new object might get a bit
> awkward. "statistics" is a plural, but we use it as the name of a single
> object, like "pants" or "scissors". Not sure I have any better ideas though.
> "estimator"? "statistics collection"? Or perhaps it should be singular,
> "statistic". I note that you actually called the system table
> "pg_mv_statistic", in singular.
>
> I'm not a big fan of storing the stats as just a bytea blob, and having to
> use special functions to interpret it. By looking at the patch, it's not
> clear to me what we actually store for functional dependencies. A list of
> attribute numbers? Could we store them simply as an int[]? (I'm not a big
> fan of the hack in pg_statistic, that allows storing arrays of any data type
> in the same column, though. But for functional dependencies, I don't think
> we need that.)

I am marking this patch as returned with feedback for now.

> Overall, this is going to be a great feature!

+1.
-- 
Michael



Re: multivariate statistics (v19)

From
Heikki Linnakangas
Date:
On 10/03/2016 04:46 AM, Michael Paquier wrote:
> On Fri, Sep 30, 2016 at 8:10 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> This patch set is in pretty good shape, the only problem is that it's so big
>> that no-one seems to have the time or courage to do the final touches and
>> commit it.
>
> Did you see my suggestions about simplifying its SQL structure? You
> could shave some code without impacting the base set of features.

Yeah. The idea was to use something like pg_node_tree to store all the 
different kinds of statistics, the histogram, the MCV, and the 
functional dependencies, in one datum. Or JSON, maybe. It sounds better 
than an opaque bytea blob, although I'd prefer something more 
relational. For the functional dependencies, I think we could get away 
with a simple float array, so let's do that in the first cut, and 
revisit this for the MCV and histogram later. Separate columns for the 
functional dependencies, the MCVs, and the histogram, probably makes 
sense anyway.

- Heikki




Re: multivariate statistics (v19)

From
Michael Paquier
Date:
On Mon, Oct 3, 2016 at 8:25 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Yeah. The idea was to use something like pg_node_tree to store all the
> different kinds of statistics, the histogram, the MCV, and the functional
> dependencies, in one datum. Or JSON, maybe. It sounds better than an opaque
> bytea blob, although I'd prefer something more relational. For the
> functional dependencies, I think we could get away with a simple float
> array, so let's do that in the first cut, and revisit this for the MCV and
> histogram later.

OK. A second thing was related to the use of schemas in the new system
catalogs. As mentioned in [1], those could be removed.
[1]: https://www.postgresql.org/message-id/CAB7nPqTU40Q5_NSgHVoMJfbyH1HDtqMbFDJ+kwFJSpam35b3Qg@mail.gmail.com.

> Separate columns for the functional dependencies, the MCVs,
> and the histogram, probably makes sense anyway.

Probably..
-- 
Michael



Re: multivariate statistics (v19)

From
Dean Rasheed
Date:
On 4 October 2016 at 04:25, Michael Paquier <michael.paquier@gmail.com> wrote:
> OK. A second thing was related to the use of schemas in the new system
> catalogs. As mentioned in [1], those could be removed.
> [1]: https://www.postgresql.org/message-id/CAB7nPqTU40Q5_NSgHVoMJfbyH1HDtqMbFDJ+kwFJSpam35b3Qg@mail.gmail.com.
>

That doesn't work, because if the intention is to be able to one day
support statistics across multiple tables, you can't assume that the
statistics are in the same schema as the table.

In fact, if multi-table statistics are to be allowed in the future, I
think you want to move away from thinking of statistics as depending
on and referring to a single table, and handle them more like views --
i.e, store a pg_node_tree representing the from_clause and add
multiple dependencies at statistics creation time. That was what I was
getting at upthread when I suggested the alternate syntax, and also
answers Tomas' question about how JOIN might one day be supported.

Of course, if we don't think that we will ever support multi-table
statistics, that all goes away, and you may as well make the
statistics name local to the table, but I think that's a bit limiting.
One way or the other, I think this is a question that needs to be
answered now. My vote is to leave expansion room to support
multi-table statistics in the future.

Regards,
Dean



Re: multivariate statistics (v19)

From
Dean Rasheed
Date:
On 30 September 2016 at 12:10, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I fear that using "statistics" as the name of the new object might get a bit
> awkward. "statistics" is a plural, but we use it as the name of a single
> object, like "pants" or "scissors". Not sure I have any better ideas though.
> "estimator"? "statistics collection"? Or perhaps it should be singular,
> "statistic". I note that you actually called the system table
> "pg_mv_statistic", in singular.
>

I think it's OK. The functional dependency is a single statistic, but
MCV lists and histograms are multiple statistics (multiple facts about
the data sampled), so in general when you create one of these new
objects, you are creating multiple statistics about the data. Also I
find "CREATE STATISTIC" just sounds a bit clumsy compared to "CREATE
STATISTICS".

The convention for naming system catalogs seems to be to use the
singular for tables and plural for views, so I guess we should stick
with that. It doesn't seem like the end of the world that it doesn't
match the user-facing syntax. A bigger concern is the use of "mv" in
the name, because as has already been pointed out, this table may also
in the future be used to store univariate expression and partial
statistics, so I think we should drop the "mv" and go with something
like pg_statistic_ext, or some other more general name.

Regards,
Dean



Re: multivariate statistics (v19)

From
Heikki Linnakangas
Date:
On 10/04/2016 10:49 AM, Dean Rasheed wrote:
> On 30 September 2016 at 12:10, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I fear that using "statistics" as the name of the new object might get a bit
>> awkward. "statistics" is a plural, but we use it as the name of a single
>> object, like "pants" or "scissors". Not sure I have any better ideas though.
>> "estimator"? "statistics collection"? Or perhaps it should be singular,
>> "statistic". I note that you actually called the system table
>> "pg_mv_statistic", in singular.
>
> I think it's OK. The functional dependency is a single statistic, but
> MCV lists and histograms are multiple statistics (multiple facts about
> the data sampled), so in general when you create one of these new
> objects, you are creating multiple statistics about the data.

Ok. I don't really have any better ideas, was just hoping that someone 
else would.

> Also I find "CREATE STATISTIC" just sounds a bit clumsy compared to
> "CREATE STATISTICS".

Agreed.

> The convention for naming system catalogs seems to be to use the
> singular for tables and plural for views, so I guess we should stick
> with that.

However, for tables and views, each object you store in those views is a 
"table" or "view", but with this thing, the object you store is 
"statistics". Would you have a catalog table called "pg_scissor"?

We call the current system table "pg_statistic", though. I agree we 
should call it pg_mv_statistic, in singular, to follow the example of 
pg_statistic.

Of course, the user-friendly system view on top of that is called 
"pg_stats", just to confuse things more :-).

> It doesn't seem like the end of the world that it doesn't
> match the user-facing syntax. A bigger concern is the use of "mv" in
> the name, because as has already been pointed out, this table may also
> in the future be used to store univariate expression and partial
> statistics, so I think we should drop the "mv" and go with something
> like pg_statistic_ext, or some other more general name.

Also, "mv" makes me think of materialized views, which is completely 
unrelated to this.

- Heikki




Re: multivariate statistics (v19)

From
Gavin Flower
Date:
On 04/10/16 20:37, Dean Rasheed wrote:
> On 4 October 2016 at 04:25, Michael Paquier <michael.paquier@gmail.com> wrote:
>> OK. A second thing was related to the use of schemas in the new system
>> catalogs. As mentioned in [1], those could be removed.
>> [1]: https://www.postgresql.org/message-id/CAB7nPqTU40Q5_NSgHVoMJfbyH1HDtqMbFDJ+kwFJSpam35b3Qg@mail.gmail.com.
>>
> That doesn't work, because if the intention is to be able to one day
> support statistics across multiple tables, you can't assume that the
> statistics are in the same schema as the table.
>
> In fact, if multi-table statistics are to be allowed in the future, I
> think you want to move away from thinking of statistics as depending
> on and referring to a single table, and handle them more like views --
> i.e, store a pg_node_tree representing the from_clause and add
> multiple dependencies at statistics creation time. That was what I was
> getting at upthread when I suggested the alternate syntax, and also
> answers Tomas' question about how JOIN might one day be supported.
>
> Of course, if we don't think that we will ever support multi-table
> statistics, that all goes away, and you may as well make the
> statistics name local to the table, but I think that's a bit limiting.
> One way or the other, I think this is a question that needs to be
> answered now. My vote is to leave expansion room to support
> multi-table statistics in the future.
>
> Regards,
> Dean
>
>
I can see multi-table statistics being useful if one is trying to 
optimise indexes for multiple joins.

Am assuming that the statistics can be accessed by the user as well as 
the planner? (I've only lightly followed this thread, so I might have 
missed, significant relevant details!)


Cheers,
Gavin




Re: multivariate statistics (v19)

From
Dean Rasheed
Date:
On 4 October 2016 at 09:15, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> However, for tables and views, each object you store in those views is a
> "table" or "view", but with this thing, the object you store is
> "statistics". Would you have a catalog table called "pg_scissor"?
>

No, probably not (unless it was storing individual scissor blades).

However, in this case, we have related pre-existing catalog tables, so...

> We call the current system table "pg_statistic", though. I agree we should
> call it pg_mv_statistic, in singular, to follow the example of pg_statistic.
>
> Of course, the user-friendly system view on top of that is called
> "pg_stats", just to confuse things more :-).
>

I agree. Given where we are, with a pg_statistic table and a pg_stats
view, I think the least worst solution is to have a pg_statistic_ext
table, and then maybe a pg_stats_ext view.


>> It doesn't seem like the end of the world that it doesn't
>> match the user-facing syntax. A bigger concern is the use of "mv" in
>> the name, because as has already been pointed out, this table may also
>> in the future be used to store univariate expression and partial
>> statistics, so I think we should drop the "mv" and go with something
>> like pg_statistic_ext, or some other more general name.
>
>
> Also, "mv" makes me think of materialized views, which is completely
> unrelated to this.
>

Yeah, I hadn't thought of that.

Regards,
Dean



Re: multivariate statistics (v19)

From
Tomas Vondra
Date:
Hi everyone,

thanks for the reviews. Let me sum the feedback so far, and outline my 
plans for the next patch version that I'd like to submit for CF 2016-11.


1) syntax changes

I agree with the changes proposed by Dean, although only a subset of the 
syntax is going to be supported until we add support for either join or 
partial statistics. So something like this:
 CREATE STATISTICS name   [ WITH (options) ]   ON (column1, column2 [, ...])   FROM table

That should be a difficult change.


2) catalog names

I'm not sure what are the best names, so I'm fine with using whatever is 
the consensus.

That being said, I'm not sure I like extending the catalog to also 
support non-multivariate statistics (like for example statistics on 
expressions). While that would be a clearly useful feature, it seems 
like a slightly different use case and perhaps a separate catalog would 
be better. So maybe pg_statistic_ext is not the best name.


3) special data type(s) to store statistics

I agree using an opaque bytea value is not very nice. I see Heikki 
proposed using something like pg_node_tree, and maybe storing all the 
statistics in a single value.

I assume the pg_node_tree was meant only as an inspiration how to build 
pseudo-type on top of a varlena value. I agree that's a good idea, and I 
plan to do something like that - say adding pg_mcv, pg_histogram, 
pg_ndistinct and pg_dependencies data types.

Heikki also mentioned that maybe JSONB would be a good way to store the 
statistics. I don't think so - firstly, it only supports a subset of 
data types, so we'd be unable to store statistics for some data types 
(or we'd have to store them as text, which sucks). Also, there's a fair 
amount of smartness in how the statistics are stored (e.g. how the 
histogram bucket boundaries are deduplicated, or how the estimation uses 
the serialized representation directly). We'd lose all of that when 
using JSONB.

Similarly for storing all the statistics in a single value - I see no 
reason why keeping the statistics in separate columns would be a bad 
idea (after all, that's kinda the point of relational databases). Also, 
there are perfectly valid cases when the caller only needs a particular 
type of statistic - e.g. when estimating GROUP BY we'll only need the 
ndistinct coefficients. Why should we force the caller to fetch and 
detoast everything, and throw away probably 99% of that?

So my plan here is to define pseudo types similar to how pg_node_tree is 
defined. That does not seem like a tremendous amount of work.


4) functional dependencies

Several people mentioned they don't like how functional dependencies are 
detected at ANALYZE time, particularly that there's a sudden jump 
between 0 and 1. Instead, a continuous "dependency degree" between 0 and 
1 was proposed.

I'm fine with that, although that makes "clause reduction" (deciding 
that we don't need to estimate one of the clauses at all, as it's 
implied by some other clause) impossible. But that's fine, the 
functional dependencies will still be much less expensive than the other 
statistics.

I'm wondering how will this interact with transitivity, though. IIRC the 
current implementation is able to detect transitive dependencies and use 
that to reduce storage space etc.

In any case, this significantly complicates the functional dependencies, 
which were meant as a trivial type of statistics, mostly to establish 
the shared infrastructure. Which brings me to ndistinct.


5) ndistinct

So far, the ndistinct coefficients were lumped at the very end of the 
patch, and the statistic was only built but not used for any sort of 
estimation. I agree with Dean that perhaps it'd be better to move this 
to the very beginning, and use it as the simplest statistic to build the 
infrastructure instead of functional dependencies (which only gets truer 
due to the changes in functional dependencies, discussed in the 
preceding section).

I think it's probably a good idea and I plan to do that, so the patch 
series will probably look like this:
   * 001 - CREATE STATISTICS infrastucture with ndistinct coefficients   * 002 - use ndistinct coefficients to improve
GROUPBY estimates   * 003 - use ndistinct coefficients in clausesel.c (not sure)   * 004 - add functional dependencies
(build+ clausesel.c)   * 005 - add multivariate MCV (build + clausesel.c)   * 006 - add multivariate histograms (build
+clausesel.c)
 

I'm not sure about using the ndistinct coefficients in clausesel.c to 
estimate regular conditions - it's the place for which ndistinct 
coefficients were originally proposed by Kyotaro-san, but I seem to 
remember it was non-trivial to choose the best statistics when there 
were other types of stats available. But I'll look into that.


6) combining statistics

I've decided not to re-submit this part of the patch until the basic 
functionality gets in. I do think it's a very useful feature (despite 
having my doubts about the existing implementation), but it clearly 
distracts people.

Instead, the patch will use some simple selection strategy (e.g. using a 
single statistics covering most conditions) or perhaps something more 
advanced (e.g. non-overlapping statistics). But nothing complicated.


7) enriching the query plan

Sadly, none of the reviews provides any sort of feedback on how to 
enrich the query plan with information about statistics (instead of 
doing that in clausesel.c in ad-hoc ephemeral manner).

So I'm still a bit stuck on this :-(


8) join statistics

Not directly related to the current patch, but I recommend reading this 
paper quantifying impact of each part of query optimizer (estimates, 
cost model, plan enumeration):
    http://www.vldb.org/pvldb/vol9/p204-leis.pdf

The one conclusion that I take from it is we really need to think about 
improving the join estimates, somehow. Because it's by far the most 
significant source of issues (and the hardest one to fix).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics (v19)

From
Tomas Vondra
Date:
Hi,

Attached is v20 of the multivariate statistics patch series, doing
mostly the changes outlined in the preceding e-mail from October 11.

The patch series currently has these parts:

* 0001 : (FIX) teach pull_varno about RestrictInfo
* 0002 : (PATCH) shared infrastructure and ndistinct coefficients
* 0003 : (PATCH) functional dependencies (only the ANALYZE part)
* 0004 : (PATCH) selectivity estimation using functional dependencies
* 0005 : (PATCH) multivariate MCV lists
* 0006 : (PATCH) multivariate histograms
* 0007 : (WIP) selectivity estimation using ndistinct coefficients
* 0008 : (WIP) use multiple statistics for estimation
* 0009 : (WIP) psql tab completion basics

Let me elaborate about the main changes in this version:


1) rework CREATE STATISTICS to what Dean Rasheed proposed in [1]:
-----------------------------------------------------------------------

      CREATE STATISTICS name WITH (options) ON (columns) FROM table

This allows adding support for statistics on joins, expressions
referencing multiple tables, and partial statistics (with WHERE
predicates, similar to indexes). Although those things are not
implemented (and I don't know if/when that happens), it's good the
syntax supports it.

I've been thinking about using "CREATE STATISTIC" instead, but I decided
to stick with "STATISTICS" for two reasons. Firstly it's possible to
create multiple statistics in a single command, for example by using
WITH (mcv,histogram). And secondly, we already hava "ALTER TABLE ... SET
STATISTICS n" (although that tweaks the statistics target for a column,
not the statistics on the column).


2) no changes to catalog names
-----------------------------------------------------------------------

Clearly, naming things is one of the hardest things in computer science.
I don't have a good idea what names would be better than the current
ones. In any case, this is fairly trivial to do.


3) special data types for statistics
-----------------------------------------------------------------------

Heikki proposed to invent a new data type, similar to pg_node_tree. I do
agree that storing the stats in plain bytea (i.e. catalog having bytea
columns) was not particularly convenient, but I'm not sure how much of
pg_node_tree Heikki wanted to copy.

In particular, I'm not sure whether Heikki's idea was store all the
statistics together in a single Datum, serialized into a text string
(similar to pg_node_tree).

I don't think that would be a good idea, as the statistics may be quite
large and complex, and deserializing them from text format would be
quite expensive. For pg_node_tree that's not a major issue because the
values are usually fairly small. Similarly, packing everything into a
single datum would force the planner to parse/unpack everything, even if
it needs just a small piece (e.g. the ndistinct coefficients, but not
histograms).

So I've decided to invent new data types, one for each statistic type:

* pg_ndistinct
* pg_dependencies
* pg_mcv_list
* pg_histogram

Similarly to pg_node_tree those data types only support output, i.e.
both 'recv' and 'in' functions do elog(ERROR). But while pg_node_tree is
stored as text, those new data types are still bytea.

I do believe this is a good solution, and it allows casting the data
types to text easily, as it simply calls the out function.

The statistics however do not store attnums in the bytea, just indexes
into pg_mv_statistic.stakeys. That means the out functions can't print
column names in the output, or values (because without the attnum we
don't know the type, and thus can't lookup the proper out function).

I don't think there's a good solution for that (I was thinking about
storing the attnums/typeoid in the statistics itself, but that seems
fairly ugly). And I'm quite happy with those new data types.


4) replace functional dependencies with ndistinct (in the first patch)
-----------------------------------------------------------------------

As the ndistinct coeffients are simpler than functional dependencies,
I've decided to use them in the fist patch in the series, which
implements the shared infrastructure. This does not mean throwing away
functional dependencies entirely, just moving them to a later patch.


5) rework of ndistinct coefficients
-----------------------------------------------------------------------

The ndistinct coefficients were also significantly reworked. Instead of
computing and storing the value for the exact combination of attributes,
the new version computes ndistinct for all combinations of attributes.

So for example with CREATE STATISTICS x ON (a,b,c) the old patch only
computed ndistinct on (a,b,c), while the new patch computes ndistinct on
{(a,b,c), (a,b), (a,c), (b,c)}. This makes it way more powerful.

The first patch (0002) only uses this in estimate_num_groups to improve
GROUP BY estimates. A later patch (0007) shows how it might be used for
selectivity estimation, but it's a very early WIP at this point.

Also, I'm not sure we should use ndistinct coefficients this way,
because of the "homogenity" assumption, similarly to functional
dependencies. Functional dependencies are used only for selectivity
estimation, so it's quite easy not to use them if they don't work for
that purpose. But ndistinct coefficients are also used for GROUP BY
estimation, where the homogenity assumption is not such a big deal. So I
expect people to add ndistinct, get better GROUP BY estimates but
sometimes worse selectivity estimates - not great, I guess.

But the selectivity estimation using ndistinct coefficients is very
simple right now - in particular it does not use the per-clause
selectivities at all, it simply assumes the whole selectivity is
1/ndistinct for the combination of columns.

Functional dependencies use this formula to combine the selectivities:

     P(a,b) = P(a) * [f + (1-f)*P(b)]

so maybe there's something similar for ndistinct coefficients? I mean,
let's  say we know ndistinc(a), ndistinct(b), ndistinct(a,b) and P(a)
and P(b). How do we compute P(a,b)?


5) rework functional dependencies
-----------------------------------------------------------------------

Based on Dean's feedback, I've reworked functional dependencies to use
continuous "degree" of validity (instead of true/false behavior,
resulting in sudden changes in behavior).

This significantly reduced the amount of code, because the old patch
tried to identify transitive dependencies (to minimize time and storage
requirements). Switching to continuous degree makes this impossible (or
at least far more complicated), so I've simply ripped all of this out.

This means the statistics will be larger and ANALYZE will take more
time, the differences are fairly small in practice, and the estimation
actually seems to work better.


6) MCV and histogram changes
-----------------------------------------------------------------------

Those statistics types are mostly unchanged, except for a few minor bug
fixes and removal of remove max_mcv_items and max_buckets options.

Those options were meant to allow users to limit the size of the
statistics, but the implementation was ignoring them so far. So I've
ripped them out, and if needed we may reintroduce them later.


7) no more (elaborate) combinations of statistics
-----------------------------------------------------------------------

I've ripped out the patch that combined multiple statistics in very
elaborate way - it was overly complex, possibly wrong, but most
importantly it distracted people from the preceding patches. So I've
ripped this out, and instead replaced that with a very simple approach
that allows using multiple statistics on different subsets if the clause
list. So for example

      WHERE (a=1) AND (b=1) AND (c=1) AND (d=1)

may benefit from two statistics, one on (a,b) and second on (c,d). It's
very simple approach, but it does the trick for many cases and is better
than "single statistics" limitation.

The 0008 patch is actually very simple, essentially adding just a loop
into the code blocks, so I think it's quite likely this will get merged
into the preceding patches.


8) reduce table sizes used in regression tests
-----------------------------------------------------------------------

Some of the regression tests used quite large tables (with up to 1M
rows), which had two issues - long runtimes and unstability (because the
ANALYZE sample is only 30k rows, so there were sometimes small changes
due to picking a different sample). I've limited the table sizes to 30k
rows.


8) open / unsolved questions
-----------------------------------------------------------------------

The main open question is still whether clausesel.c is the best place to
do all the heavy lifting (particularly matching clauses and statistics,
and deciding which statistics to use). I suspect some of that should be
done elsewhere (earlier in the planning), enriching the query tree
somehow. Then clausesel.c would "only" compute the estimates, and it
would also allow showing the info in EXPLAIN.

I'm not particularly happy with the changes in claselist_selectivity
look right now - there are three almost identical blocks, so this would
deserve some refactoring. But I'd like to get some feedback first.

regards

[1]
https://www.postgresql.org/message-id/CAEZATCUtGR+U5+QTwjHhe9rLG2nguEysHQ5NaqcK=VbJ78VQFA@mail.gmail.com

[2]
https://www.postgresql.org/message-id/1c7e4e63-769b-f8ce-f245-85ef4f59fcba%40iki.fi

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Attachment

Re: WIP: multivariate statistics / proof of concept

From
Robert Haas
Date:
[ reviving an old multivariate statistics thread ]

On Thu, Nov 13, 2014 at 6:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 12 October 2014 23:00, Tomas Vondra <tv@fuzzy.cz> wrote:
>
>> It however seems to be working sufficiently well at this point, enough
>> to get some useful feedback. So here we go.
>
> This looks interesting and useful.
>
> What I'd like to check before a detailed review is that this has
> sufficient applicability to be useful.
>
> My understanding is that Q9 and Q18 of TPC-H have poor plans as a
> result of multi-column stats errors.
>
> Could you look at those queries and confirm that this patch can
> produce better plans for them?

Tomas, did you ever do any testing in this area?  One of my
colleagues, Rafia Sabih, recently did some testing of TPC-H queries @
20 GB.  Q18 actually doesn't complete at all right now because of an
issue with the new simplehash implementation.  I reported it to Andres
and he tracked it down, but hasn't posted the patch yet - see
http://archives.postgresql.org/message-id/20161115192802.jfbec5s6ougxwicp@alap3.anarazel.de

Of the remaining queries, the slowest are Q9 and Q20, and both of them
have serious estimation errors.  On Q9, things go wrong here:
                                ->  Merge Join
(cost=5225092.04..6595105.57 rows=154 width=47) (actual
time=103592.821..149335.010 rows=6503988 loops=1)                                      Merge Cond:
(partsupp.ps_partkey = lineitem.l_partkey)                                      Join Filter:
(lineitem.l_suppkey = partsupp.ps_suppkey)                                      Rows Removed by Join Filter: 19511964
                                  ->  Index Scan using
 
idx_partsupp_partkey on partsupp  (cost=0.43..781956.32 rows=15999792
width=22) (actual time=0.044..11825.481 rows=15999881 loops=1)                                      ->  Sort
(cost=5224967.03..5245348.02 rows=8152396 width=45) (actual
time=103592.505..112205.444 rows=26015949 loops=1)                                            Sort Key: part.p_partkey
                                         Sort Method: quicksort
 
Memory: 704733kB                                            ->  Hash Join
(cost=127278.36..4289121.18 rows=8152396 width=45) (actual
time=1084.370..94732.951 rows=6503988 loops=1)                                                  Hash Cond:
(lineitem.l_partkey = part.p_partkey)                                                  ->  Seq Scan on
lineitem  (cost=0.00..3630339.08 rows=119994608 width=41) (actual
time=0.015..33355.637 rows=119994608 loops=1)                                                  ->  Hash
(cost=123743.07..123743.07 rows=282823 width=4) (actual
time=1083.686..1083.686 rows=216867 loops=1)                                                        Buckets:
524288  Batches: 1  Memory Usage: 11721kB                                                        ->  Gather
(cost=1000.00..123743.07 rows=282823 width=4) (actual
time=0.418..926.283 rows=216867 loops=1)                                                              Workers
Planned: 4                                                              Workers
Launched: 4                                                              ->
Parallel Seq Scan on part  (cost=0.00..94460.77 rows=70706 width=4)
(actual time=0.063..962.909 rows=43373 loops=5)

Filter: ((p_name)::text ~~ '%grey%'::text)

Rows Removed by Filter: 756627

The estimate for the index scan on partsupp is essentially perfect,
and the lineitem-part join is off by about 3x.  However, the merge
join is off by about 4000x, which is real bad.

On Q20, things go wrong here:
                    ->  Merge Join  (cost=5928271.92..6411281.44
rows=278 width=16) (actual time=77887.963..136614.284 rows=118124
loops=1)                          Merge Cond: ((lineitem.l_partkey =
partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey))                          Join Filter:
((partsupp.ps_availqty)::numeric > ((0.5 * sum(lineitem.l_quantity))))                          Rows Removed by Join
Filter:242                          ->  GroupAggregate
 
(cost=5363980.40..5691151.45 rows=9681876 width=48) (actual
time=76672.726..131482.677 rows=10890067 loops=1)                                Group Key: lineitem.l_partkey,
lineitem.l_suppkey                                ->  Sort
(cost=5363980.40..5409466.13 rows=18194291 width=21) (actual
time=76672.661..86405.882 rows=18194084 loops=1)                                      Sort Key: lineitem.l_partkey,
lineitem.l_suppkey                                      Sort Method: external merge
Disk: 551376kB                                      ->  Bitmap Heap Scan on
lineitem  (cost=466716.05..3170023.42 rows=18194291 width=21) (actual
time=13735.552..39289.995 rows=18195269 loops=1)                                            Recheck Cond:
((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01
00:00:00'::timestamp without time zone))                                            Heap Blocks: exact=2230011
                                 ->  Bitmap Index Scan on
 
idx_lineitem_shipdate  (cost=0.00..462167.48 rows=18194291 width=0)
(actual time=11771.173..11771.173 rows=18195269 loops=1)                                                  Index Cond:
((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01
00:00:00'::timestamp without time zone))                          ->  Sort  (cost=564291.52..567827.56
rows=1414417 width=24) (actual time=1214.812..1264.356 rows=173936
loops=1)                                Sort Key: partsupp.ps_partkey,
partsupp.ps_suppkey                                Sort Method: quicksort  Memory: 19733kB
 ->  Nested Loop
 
(cost=1000.43..419796.26 rows=1414417 width=24) (actual
time=0.447..985.562 rows=173936 loops=1)                                      ->  Gather
(cost=1000.00..99501.07 rows=40403 width=4) (actual time=0.390..34.476
rows=43484 loops=1)                                            Workers Planned: 4
    Workers Launched: 4                                            ->  Parallel Seq Scan on
 
part  (cost=0.00..94460.77 rows=10101 width=4) (actual
time=0.143..527.665 rows=8697 loops=5)                                                  Filter:
((p_name)::text ~~ 'beige%'::text)                                                  Rows Removed by
Filter: 791303                                      ->  Index Scan using
idx_partsupp_partkey on partsupp  (cost=0.43..7.58 rows=35 width=20)
(actual time=0.017..0.019 rows=4 loops=43484)                                            Index Cond: (ps_partkey =
part.p_partkey)

The estimate for the GroupAggregate feeding one side of the merge join
is quite accurate.  The estimate for the part-partsupp join on the
other side is off by 8x.  Then things get much worse: the estimate for
the merge join is off by 400x.

I'm not really sure whether the multivariate statistics stuff will fix
this kind of case or not, but if it did it would be awesome.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: multivariate statistics / proof of concept

From
Tomas Vondra
Date:
On 11/21/2016 11:10 PM, Robert Haas wrote:
> [ reviving an old multivariate statistics thread ]
>
> On Thu, Nov 13, 2014 at 6:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 12 October 2014 23:00, Tomas Vondra <tv@fuzzy.cz> wrote:
>>
>>> It however seems to be working sufficiently well at this point, enough
>>> to get some useful feedback. So here we go.
>>
>> This looks interesting and useful.
>>
>> What I'd like to check before a detailed review is that this has
>> sufficient applicability to be useful.
>>
>> My understanding is that Q9 and Q18 of TPC-H have poor plans as a
>> result of multi-column stats errors.
>>
>> Could you look at those queries and confirm that this patch can
>> produce better plans for them?
>
> Tomas, did you ever do any testing in this area?  One of my
> colleagues, Rafia Sabih, recently did some testing of TPC-H queries @
> 20 GB.  Q18 actually doesn't complete at all right now because of an
> issue with the new simplehash implementation.  I reported it to Andres
> and he tracked it down, but hasn't posted the patch yet - see
> http://archives.postgresql.org/message-id/20161115192802.jfbec5s6ougxwicp@alap3.anarazel.de
>
> Of the remaining queries, the slowest are Q9 and Q20, and both of them
> have serious estimation errors.  On Q9, things go wrong here:
>
>                                  ->  Merge Join
> (cost=5225092.04..6595105.57 rows=154 width=47) (actual
> time=103592.821..149335.010 rows=6503988 loops=1)
>                                        Merge Cond:
> (partsupp.ps_partkey = lineitem.l_partkey)
>                                        Join Filter:
> (lineitem.l_suppkey = partsupp.ps_suppkey)
>                                        Rows Removed by Join Filter: 19511964
>                                        ->  Index Scan using> [snip]
>
> Rows Removed by Filter: 756627
>
> The estimate for the index scan on partsupp is essentially perfect,
> and the lineitem-part join is off by about 3x.  However, the merge
> join is off by about 4000x, which is real bad.
>

The patch only deals with statistics on base relations, no joins, at 
this point. It's meant to be extended in that direction, so the syntax 
supports it, but at this point that's all. No joins.

That being said, this estimate should be improved in 9.6, when you 
create a foreign key between the tables. In fact, that patch was exactly 
about Q9.

This is how the join estimate looks on scale 1 without the FK between 
the two tables:
                          QUERY PLAN
----------------------------------------------------------------------- Merge Join  (cost=19.19..700980.12 rows=2404
width=261)  Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND                (lineitem.l_suppkey =
partsupp.ps_suppkey))  ->  Index Scan using idx_lineitem_part_supp on lineitem                (cost=0.43..605856.84
rows=6001117width=117)   ->  Index Scan using partsupp_pkey on partsupp                (cost=0.42..61141.76 rows=800000
width=144)
(4 rows)


and with the foreign key:
                             QUERY PLAN
----------------------------------------------------------------------- Merge Join  (cost=19.19..700980.12 rows=6001117
width=261)            (actual rows=6001215 loops=1)   Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND
        (lineitem.l_suppkey = partsupp.ps_suppkey))   ->  Index Scan using idx_lineitem_part_supp on lineitem
    (cost=0.43..605856.84 rows=6001117 width=117)                (actual rows=6001215 loops=1)   ->  Index Scan using
partsupp_pkeyon partsupp                (cost=0.42..61141.76 rows=800000 width=144)                (actual rows=6001672
loops=1)Planning time: 3.840 ms Execution time: 21987.913 ms
 
(6 rows)


> On Q20, things go wrong here:>
> [snip]
>
> The estimate for the GroupAggregate feeding one side of the merge join
> is quite accurate.  The estimate for the part-partsupp join on the
> other side is off by 8x.  Then things get much worse: the estimate for
> the merge join is off by 400x.
>

Well, most of the estimation error comes from the join, but sadly the 
aggregate makes using the foreign keys impossible - at least in the 
current version. I don't know if it can be improved, somehow.

> I'm not really sure whether the multivariate statistics stuff will fix
> this kind of case or not, but if it did it would be awesome.
>

Join statistics are something I'd like to add eventually, but I don't 
see how it could happen in the first version. Also, the patch received 
no reviews this CF, and making it even larger is unlikely to make it 
more attractive.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: multivariate statistics / proof of concept

From
Haribabu Kommi
Date:


On Tue, Nov 22, 2016 at 2:42 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 11/21/2016 11:10 PM, Robert Haas wrote:
[ reviving an old multivariate statistics thread ]

On Thu, Nov 13, 2014 at 6:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 12 October 2014 23:00, Tomas Vondra <tv@fuzzy.cz> wrote:

It however seems to be working sufficiently well at this point, enough
to get some useful feedback. So here we go.

This looks interesting and useful.

What I'd like to check before a detailed review is that this has
sufficient applicability to be useful.

My understanding is that Q9 and Q18 of TPC-H have poor plans as a
result of multi-column stats errors.

Could you look at those queries and confirm that this patch can
produce better plans for them?

Tomas, did you ever do any testing in this area?  One of my
colleagues, Rafia Sabih, recently did some testing of TPC-H queries @
20 GB.  Q18 actually doesn't complete at all right now because of an
issue with the new simplehash implementation.  I reported it to Andres
and he tracked it down, but hasn't posted the patch yet - see
http://archives.postgresql.org/message-id/20161115192802.jfbec5s6ougxwicp@alap3.anarazel.de

Of the remaining queries, the slowest are Q9 and Q20, and both of them
have serious estimation errors.  On Q9, things go wrong here:

                                 ->  Merge Join
(cost=5225092.04..6595105.57 rows=154 width=47) (actual
time=103592.821..149335.010 rows=6503988 loops=1)
                                       Merge Cond:
(partsupp.ps_partkey = lineitem.l_partkey)
                                       Join Filter:
(lineitem.l_suppkey = partsupp.ps_suppkey)
                                       Rows Removed by Join Filter: 19511964
                                       ->  Index Scan using
> [snip]

Rows Removed by Filter: 756627

The estimate for the index scan on partsupp is essentially perfect,
and the lineitem-part join is off by about 3x.  However, the merge
join is off by about 4000x, which is real bad.


The patch only deals with statistics on base relations, no joins, at this point. It's meant to be extended in that direction, so the syntax supports it, but at this point that's all. No joins.

That being said, this estimate should be improved in 9.6, when you create a foreign key between the tables. In fact, that patch was exactly about Q9.

This is how the join estimate looks on scale 1 without the FK between the two tables:

                          QUERY PLAN
-----------------------------------------------------------------------
 Merge Join  (cost=19.19..700980.12 rows=2404 width=261)
   Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND
                (lineitem.l_suppkey = partsupp.ps_suppkey))
   ->  Index Scan using idx_lineitem_part_supp on lineitem
                (cost=0.43..605856.84 rows=6001117 width=117)
   ->  Index Scan using partsupp_pkey on partsupp
                (cost=0.42..61141.76 rows=800000 width=144)
(4 rows)


and with the foreign key:

                             QUERY PLAN
-----------------------------------------------------------------------
 Merge Join  (cost=19.19..700980.12 rows=6001117 width=261)
             (actual rows=6001215 loops=1)
   Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND
                (lineitem.l_suppkey = partsupp.ps_suppkey))
   ->  Index Scan using idx_lineitem_part_supp on lineitem
                (cost=0.43..605856.84 rows=6001117 width=117)
                (actual rows=6001215 loops=1)
   ->  Index Scan using partsupp_pkey on partsupp
                (cost=0.42..61141.76 rows=800000 width=144)
                (actual rows=6001672 loops=1)
 Planning time: 3.840 ms
 Execution time: 21987.913 ms
(6 rows)


On Q20, things go wrong here:
>
[snip]

The estimate for the GroupAggregate feeding one side of the merge join
is quite accurate.  The estimate for the part-partsupp join on the
other side is off by 8x.  Then things get much worse: the estimate for
the merge join is off by 400x.


Well, most of the estimation error comes from the join, but sadly the aggregate makes using the foreign keys impossible - at least in the current version. I don't know if it can be improved, somehow.

I'm not really sure whether the multivariate statistics stuff will fix
this kind of case or not, but if it did it would be awesome.


Join statistics are something I'd like to add eventually, but I don't see how it could happen in the first version. Also, the patch received no reviews this CF, and making it even larger is unlikely to make it more attractive.

Moved to next CF with "needs review" status.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] multivariate statistics (v19)

From
Amit Langote
Date:
Hi Tomas,

On 2016/10/30 4:23, Tomas Vondra wrote:
> Hi,
> 
> Attached is v20 of the multivariate statistics patch series, doing mostly
> the changes outlined in the preceding e-mail from October 11.
> 
> The patch series currently has these parts:
> 
> * 0001 : (FIX) teach pull_varno about RestrictInfo
> * 0002 : (PATCH) shared infrastructure and ndistinct coefficients
> * 0003 : (PATCH) functional dependencies (only the ANALYZE part)
> * 0004 : (PATCH) selectivity estimation using functional dependencies
> * 0005 : (PATCH) multivariate MCV lists
> * 0006 : (PATCH) multivariate histograms
> * 0007 : (WIP) selectivity estimation using ndistinct coefficients
> * 0008 : (WIP) use multiple statistics for estimation
> * 0009 : (WIP) psql tab completion basics

Unfortunately, this failed to compile because of the duplicate_oids error.
Partitioning patch consumed same OIDs as used in this patch.

I will try to read the patches in some more detail, but in the meantime,
here are some comments/nitpicks on the documentation:

No updates to doc/src/sgml/catalogs.sgml?

+  <para>
+   The examples presented in <xref linkend="row-estimation-examples"> used
+   statistics about individual columns to compute selectivity estimates.
+   When estimating conditions on multiple columns, the planner assumes
+   independence and multiplies the selectivities. When the columns are
+   correlated, the independence assumption is violated, and the estimates
+   may be seriously off, resulting in poor plan choices.
+  </para>

The term independence is used in isolation - independence of what?
Independence of the distributions of values in separate columns?  Also,
the phrase "seriously off" could perhaps be replaced by more rigorous
terminology; it might be unclear to some readers.  Perhaps: wildly
inaccurate, :)

+<programlisting>
+EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1;
+                                           QUERY PLAN
+-------------------------------------------------------------------------------------------------
+ Seq Scan on t  (cost=0.00..170.00 rows=100 width=8) (actual
time=0.031..2.870 rows=100 loops=1)
+   Filter: (a = 1)
+   Rows Removed by Filter: 9900
+ Planning time: 0.092 ms
+ Execution time: 3.103 ms

Is there a reason why examples in "67.2. Multivariate Statistics" (like
the one above) use EXPLAIN ANALYZE, whereas those in "67.1. Row Estimation
Examples" (also, other relevant chapters) uses just EXPLAIN.

+   the final 0.01% estimate. The plan however shows that this results in
+   a significant under-estimate, as the actual number of rows matching the

s/under-estimate/underestimate/g

+  <para>
+   For additional details about multivariate statistics, see
+   <filename>src/backend/utils/mvstats/README.statsc</>. There are additional
+   <literal>README</> for each type of statistics, mentioned in the following
+   sections.
+  </para>

Referring to source tree READMEs seems novel around this portion of the
documentation, but I think not too far away, there are some references.
This is under the VII. Internals chapter anyway, so that might be OK.

In any case, s/README.statsc/README.stats/g

Also, s/additional README/additional READMEs/g  (tags omitted for brevity)

+    used in definitions of database normal forms. When simplified, saying
that
+    <literal>b</> is functionally dependent on <literal>a</> means that

Maybe, s/When simplified/In simple terms/g

+    In normalized databases, only functional dependencies on primary keys
+    and super keys are allowed. In practice however many data sets are not
+    fully normalized, for example thanks to intentional denormalization for
+    performance reasons. The table <literal>t</> is an example of a data
+    with functional dependencies. As <literal>a=b</> for all rows in the
+    table, <literal>a</> is functionally dependent on <literal>b</> and
+    <literal>b</> is functionally dependent on <literal>a</literal>.

"super keys" sounds like a new term.

s/for example thanks to/for example, thanks to/g  (or due to instead of
thanks to)

How about: s/an example of a data with/an example of a schema with/g

Perhaps, s/a=b/a = b/g  (additional white space)

+    Similarly to per-column statistics, multivariate statistics are stored in

I notice that "similar to" is used more often than "similarly to".  But
that might be OK.

+     This shows that the statistics is defined on table <structname>t</>,

Perhaps: the statistics is -> the statistics are or the statistic is

+     lists <structfield>attnums</structfield> of the columns (references
+     <structname>pg_attribute</structname>).

While this text may be OK on the catalog description page, it might be
better to expand attnums here as "attribute numbers" dropping the
parenthesized phrase altogether.

+<programlisting>
+SELECT pg_mv_stats_dependencies_show(stadeps)
+  FROM pg_mv_statistic WHERE staname = 's1';
+
+ pg_mv_stats_dependencies_show
+-------------------------------
+ (1) => 2, (2) => 1
+(1 row)
+</programlisting>

Couldn't this somehow show actual column names, instead of attribute numbers?

Will read more later.

Thanks,
Amit





Re: [HACKERS] multivariate statistics (v19)

From
Tomas Vondra
Date:
Hi Amit,

attached is v21 of the patch series, rebased to current master 
(resolving the duplicate OID and a few trivial merge conflicts), and 
also fixing some of the issues you reported.

On 12/12/2016 12:26 PM, Amit Langote wrote:
>
> Hi Tomas,
>
> On 2016/10/30 4:23, Tomas Vondra wrote:
>> Hi,
>>
>> Attached is v20 of the multivariate statistics patch series, doing mostly
>> the changes outlined in the preceding e-mail from October 11.
>>
>> The patch series currently has these parts:
>>
>> * 0001 : (FIX) teach pull_varno about RestrictInfo
>> * 0002 : (PATCH) shared infrastructure and ndistinct coefficients
>> * 0003 : (PATCH) functional dependencies (only the ANALYZE part)
>> * 0004 : (PATCH) selectivity estimation using functional dependencies
>> * 0005 : (PATCH) multivariate MCV lists
>> * 0006 : (PATCH) multivariate histograms
>> * 0007 : (WIP) selectivity estimation using ndistinct coefficients
>> * 0008 : (WIP) use multiple statistics for estimation
>> * 0009 : (WIP) psql tab completion basics
>
> Unfortunately, this failed to compile because of the duplicate_oids error.
> Partitioning patch consumed same OIDs as used in this patch.
>

Fixed, should compile fine now (even each patch in the series).

> I will try to read the patches in some more detail, but in the meantime,
> here are some comments/nitpicks on the documentation:
>
> No updates to doc/src/sgml/catalogs.sgml?
>

Good point. I've added a section for the pg_mv_statistic catalog.

> +  <para>
> +   The examples presented in <xref linkend="row-estimation-examples"> used
> +   statistics about individual columns to compute selectivity estimates.
> +   When estimating conditions on multiple columns, the planner assumes
> +   independence and multiplies the selectivities. When the columns are
> +   correlated, the independence assumption is violated, and the estimates
> +   may be seriously off, resulting in poor plan choices.
> +  </para>
>
> The term independence is used in isolation - independence of what?
> Independence of the distributions of values in separate columns?  Also,
> the phrase "seriously off" could perhaps be replaced by more rigorous
> terminology; it might be unclear to some readers.  Perhaps: wildly
> inaccurate, :)
>

I've reworded this to "independence of the conditions" and "off by 
several orders of magnitude". Hope that's better.

> +<programlisting>
> +EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1;
> +                                           QUERY PLAN
> +-------------------------------------------------------------------------------------------------
> + Seq Scan on t  (cost=0.00..170.00 rows=100 width=8) (actual
> time=0.031..2.870 rows=100 loops=1)
> +   Filter: (a = 1)
> +   Rows Removed by Filter: 9900
> + Planning time: 0.092 ms
> + Execution time: 3.103 ms
>
> Is there a reason why examples in "67.2. Multivariate Statistics" (like
> the one above) use EXPLAIN ANALYZE, whereas those in "67.1. Row Estimation
> Examples" (also, other relevant chapters) uses just EXPLAIN.
>

Yes, the reason is that while 67.1 shows how the optimizer estimates row 
counts and constructs the plan (so EXPLAIN is sufficient), 67.2 
demonstrates how the estimates are inaccurate with respect to the actual 
row counts. Thus the EXPLAIN ANALYZE.

> +   the final 0.01% estimate. The plan however shows that this results in
> +   a significant under-estimate, as the actual number of rows matching the
>
> s/under-estimate/underestimate/g
>
> +  <para>
> +   For additional details about multivariate statistics, see
> +   <filename>src/backend/utils/mvstats/README.statsc</>. There are additional
> +   <literal>README</> for each type of statistics, mentioned in the following
> +   sections.
> +  </para>
>
> Referring to source tree READMEs seems novel around this portion of the
> documentation, but I think not too far away, there are some references.
> This is under the VII. Internals chapter anyway, so that might be OK.
>

I think the there's a threshold when the detail becomes too detailed for 
the sgml docs - say, when it discusses some implementation details, at 
which point a README is more appropriate. I don't know if I got it 
entirely right with the docs, though, so perhaps some bits may move in 
either direction.

> In any case, s/README.statsc/README.stats/g
>
> Also, s/additional README/additional READMEs/g  (tags omitted for brevity)
>
> +    used in definitions of database normal forms. When simplified, saying
> that
> +    <literal>b</> is functionally dependent on <literal>a</> means that
>

Fixed.

> Maybe, s/When simplified/In simple terms/g
>
> +    In normalized databases, only functional dependencies on primary keys
> +    and super keys are allowed. In practice however many data sets are not
> +    fully normalized, for example thanks to intentional denormalization for
> +    performance reasons. The table <literal>t</> is an example of a data
> +    with functional dependencies. As <literal>a=b</> for all rows in the
> +    table, <literal>a</> is functionally dependent on <literal>b</> and
> +    <literal>b</> is functionally dependent on <literal>a</literal>.
>
> "super keys" sounds like a new term.
>

Actually no, "super key" is a term defined in normal forms.

> s/for example thanks to/for example, thanks to/g  (or due to instead of
> thanks to)
>
> How about: s/an example of a data with/an example of a schema with/g
>

I think "example of data set" is better. Reworded.

> Perhaps, s/a=b/a = b/g  (additional white space)
>
> +    Similarly to per-column statistics, multivariate statistics are stored in
>
> I notice that "similar to" is used more often than "similarly to".  But
> that might be OK.
>

Not sure.

> +     This shows that the statistics is defined on table <structname>t</>,
>
> Perhaps: the statistics is -> the statistics are or the statistic is
>

As that paragraph is only about functional dependencies, I think 
'statistic is' is more appropriate.

> +     lists <structfield>attnums</structfield> of the columns (references
> +     <structname>pg_attribute</structname>).
>
> While this text may be OK on the catalog description page, it might be
> better to expand attnums here as "attribute numbers" dropping the
> parenthesized phrase altogether.
>

Not sure. I've reworded it like this:

    This shows that the statistic is defined on table <structname>t</>,
    <structfield>attnums</structfield> lists attribute numbers of columns
    (references <structname>pg_attribute</structname>). It also shows

Does that sound better?

> +<programlisting>
> +SELECT pg_mv_stats_dependencies_show(stadeps)
> +  FROM pg_mv_statistic WHERE staname = 's1';
> +
> + pg_mv_stats_dependencies_show
> +-------------------------------
> + (1) => 2, (2) => 1
> +(1 row)
> +</programlisting>
>
> Couldn't this somehow show actual column names, instead of attribute numbers?
>

Yeah, I was thinking about that too. The trouble is that's table-level 
metadata, so we don't have that kind of info serialized within the data 
type (e.g. because it would not handle column renames etc.).

It might be possible to explicitly pass the table OID as a parameter of 
the function, but it seemed a bit ugly to me.


FWIW, as I wrote in this thread, the place where this patch series needs 
feedback most desperately is integration into the optimizer. Currently 
all the magic happens in clausesel.c and does not leave it.I think it 
would be good to move some of that (particularly the choice of 
statistics to apply) to an earlier stage, and store the information 
within the plan tree itself, so that it's available outside clausesel.c 
(e.g. for EXPLAIN - showing which stats were picked seems useful).

I was thinking it might work similarly to the foreign key estimation 
patch (100340e2). It might even be more efficient, as the current code 
may end repeating the selection of statistics multiple times. But 
enriching the plan tree turned out to be way more invasive than I'm 
comfortable with (but maybe that'd be OK).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] multivariate statistics (v19)

From
Petr Jelinek
Date:
On 12/12/16 22:50, Tomas Vondra wrote:
>> +<programlisting>
>> +SELECT pg_mv_stats_dependencies_show(stadeps)
>> +  FROM pg_mv_statistic WHERE staname = 's1';
>> +
>> + pg_mv_stats_dependencies_show
>> +-------------------------------
>> + (1) => 2, (2) => 1
>> +(1 row)
>> +</programlisting>
>>
>> Couldn't this somehow show actual column names, instead of attribute
>> numbers?
>>
> 
> Yeah, I was thinking about that too. The trouble is that's table-level
> metadata, so we don't have that kind of info serialized within the data
> type (e.g. because it would not handle column renames etc.).
> 
> It might be possible to explicitly pass the table OID as a parameter of
> the function, but it seemed a bit ugly to me.

I think it makes sense to have such function, this is not out function
so I think it's ok for it to have the oid as input, especially since in
the use-case shown above you can use starelid easily.

> 
> FWIW, as I wrote in this thread, the place where this patch series needs
> feedback most desperately is integration into the optimizer. Currently
> all the magic happens in clausesel.c and does not leave it.I think it
> would be good to move some of that (particularly the choice of
> statistics to apply) to an earlier stage, and store the information
> within the plan tree itself, so that it's available outside clausesel.c
> (e.g. for EXPLAIN - showing which stats were picked seems useful).
> 
> I was thinking it might work similarly to the foreign key estimation
> patch (100340e2). It might even be more efficient, as the current code
> may end repeating the selection of statistics multiple times. But
> enriching the plan tree turned out to be way more invasive than I'm
> comfortable with (but maybe that'd be OK).
>

In theory it seems like possibly reasonable approach to me, mainly
because mv statistics are user defined objects. I guess we'd have to see
at least some PoC to see how invasive it is. But I ultimately think that
feedback from a committer who is more familiar with planner is needed here.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] multivariate statistics (v19)

From
Petr Jelinek
Date:
On 12/12/16 22:50, Tomas Vondra wrote:
> On 12/12/2016 12:26 PM, Amit Langote wrote:
>>
>> Hi Tomas,
>>
>> On 2016/10/30 4:23, Tomas Vondra wrote:
>>> Hi,
>>>
>>> Attached is v20 of the multivariate statistics patch series, doing
>>> mostly
>>> the changes outlined in the preceding e-mail from October 11.
>>>
>>> The patch series currently has these parts:
>>>
>>> * 0001 : (FIX) teach pull_varno about RestrictInfo
>>> * 0002 : (PATCH) shared infrastructure and ndistinct coefficients

Hi,

I went over these two (IMHO those could easily be considered as minimal
committable set even if the user visible functionality they provide is
rather limited).

> dropping statistics
> -------------------
> 
> The statistics may be dropped automatically using DROP STATISTICS.
> 
> After ALTER TABLE ... DROP COLUMN, statistics referencing are:
> 
>   (a) dropped, if the statistics would reference only one column
> 
>   (b) retained, but modified on the next ANALYZE

This should be documented in user visible form if you plan to keep it
(it does make sense to me).

> +   therefore perfectly correlated. Providing additional information about
> +   correlation between columns is the purpose of multivariate statistics,
> +   and the rest of this section thoroughly explains how the planner
> +   leverages them to improve estimates.
> +  </para>
> +
> +  <para>
> +   For additional details about multivariate statistics, see
> +   <filename>src/backend/utils/mvstats/README.stats</>. There are additional
> +   <literal>READMEs</> for each type of statistics, mentioned in the following
> +   sections.
> +  </para>
> +
> + </sect1>

I don't think this qualifies as "thoroughly explains" ;)

> +
> +Oid
> +get_statistics_oid(List *names, bool missing_ok)

No comment?

> +        case OBJECT_STATISTICS:
> +            msg = gettext_noop("statistics \"%s\" does not exist, skipping");
> +            name = NameListToString(objname);
> +            break;

This sounds somewhat weird (plural vs singular).

> + * XXX Maybe this should check for duplicate stats. Although it's not clear
> + * what "duplicate" would mean here (wheter to compare only keys or also
> + * options). Moreover, we don't do such checks for indexes, although those
> + * store tuples and recreating a new index may be a way to fix bloat (which
> + * is a problem statistics don't have).
> + */
> +ObjectAddress
> +CreateStatistics(CreateStatsStmt *stmt)

I don't think we should check duplicates TBH so I would remove the XXX
(also "wheter" is typo but if you remove that paragraph it does not matter).

> +    if (true)
> +    {

Huh?

> +
> +List *
> +RelationGetMVStatList(Relation relation)
> +{
...
> +
> +void
> +update_mv_stats(Oid mvoid, MVNDistinct ndistinct,
> +                int2vector *attrs, VacAttrStats **stats)
...
> +static double
> +ndistinct_for_combination(double totalrows, int numrows, HeapTuple *rows,
> +                   int2vector *attrs, VacAttrStats **stats,
> +                   int k, int *combination)
> +{


Again, these deserve comment.

I'll try to look at other patches in the series as time permits.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] multivariate statistics (v19)

From
Dilip Kumar
Date:
On Tue, Dec 13, 2016 at 3:20 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> attached is v21 of the patch series, rebased to current master (resolving
> the duplicate OID and a few trivial merge conflicts), and also fixing some
> of the issues you reported.

I wanted to test the grouping estimation behaviour with TPCH, While
testing I found some crash so I thought of reporting it.

My setup detail:
TPCH scale factor : 5
Applied all the patch for 21 series, and ran below queries.

postgres=# analyze part;
ANALYZE
postgres=# CREATE STATISTICS s2  WITH (ndistinct) on (p_brand, p_type,
p_size) from part;
CREATE STATISTICS
postgres=# analyze part;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

I think it should be easily reproducible, in case it's not I can send
call stack or core dump.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] multivariate statistics (v19)

From
Tomas Vondra
Date:
On 01/03/2017 02:42 PM, Dilip Kumar wrote:
> On Tue, Dec 13, 2016 at 3:20 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> attached is v21 of the patch series, rebased to current master (resolving
>> the duplicate OID and a few trivial merge conflicts), and also fixing some
>> of the issues you reported.
>
> I wanted to test the grouping estimation behaviour with TPCH, While
> testing I found some crash so I thought of reporting it.
>
> My setup detail:
> TPCH scale factor : 5
> Applied all the patch for 21 series, and ran below queries.
>
> postgres=# analyze part;
> ANALYZE
> postgres=# CREATE STATISTICS s2  WITH (ndistinct) on (p_brand, p_type,
> p_size) from part;
> CREATE STATISTICS
> postgres=# analyze part;
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
>
> I think it should be easily reproducible, in case it's not I can send
> call stack or core dump.
>

Thanks for the report. It was trivial to reproduce and it turned out to 
be a fairly simple bug. Will send a new version of the patch soon.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] multivariate statistics (v19)

From
Tomas Vondra
Date:
On 12/30/2016 02:12 PM, Petr Jelinek wrote:
> On 12/12/16 22:50, Tomas Vondra wrote:
>> On 12/12/2016 12:26 PM, Amit Langote wrote:
>>>
>>> Hi Tomas,
>>>
>>> On 2016/10/30 4:23, Tomas Vondra wrote:
>>>> Hi,
>>>>
>>>> Attached is v20 of the multivariate statistics patch series, doing
>>>> mostly
>>>> the changes outlined in the preceding e-mail from October 11.
>>>>
>>>> The patch series currently has these parts:
>>>>
>>>> * 0001 : (FIX) teach pull_varno about RestrictInfo
>>>> * 0002 : (PATCH) shared infrastructure and ndistinct coefficients
>
> Hi,
>
> I went over these two (IMHO those could easily be considered as minimal
> committable set even if the user visible functionality they provide is
> rather limited).
>

Yes, although I still have my doubts 0001 is the right way to make 
pull_varnos work. It's probably related to the bigger design question, 
because moving the statistics selection to an earlier phase could make 
it unnecessary I guess.

>> dropping statistics
>> -------------------
>>
>> The statistics may be dropped automatically using DROP STATISTICS.
>>
>> After ALTER TABLE ... DROP COLUMN, statistics referencing are:
>>
>>   (a) dropped, if the statistics would reference only one column
>>
>>   (b) retained, but modified on the next ANALYZE
>
> This should be documented in user visible form if you plan to keep it
> (it does make sense to me).
>

Yes, I plan to keep it. I agree it should be documented, probably on the 
ALTER TABLE page (and linked from CREATE/DROP statistics pages).

>> +   therefore perfectly correlated. Providing additional information about
>> +   correlation between columns is the purpose of multivariate statistics,
>> +   and the rest of this section thoroughly explains how the planner
>> +   leverages them to improve estimates.
>> +  </para>
>> +
>> +  <para>
>> +   For additional details about multivariate statistics, see
>> +   <filename>src/backend/utils/mvstats/README.stats</>. There are additional
>> +   <literal>READMEs</> for each type of statistics, mentioned in the following
>> +   sections.
>> +  </para>
>> +
>> + </sect1>
>
> I don't think this qualifies as "thoroughly explains" ;)
>

OK, I'll drop the "thoroughly" ;-)

>> +
>> +Oid
>> +get_statistics_oid(List *names, bool missing_ok)
>
> No comment?
>
>> +        case OBJECT_STATISTICS:
>> +            msg = gettext_noop("statistics \"%s\" does not exist, skipping");
>> +            name = NameListToString(objname);
>> +            break;
>
> This sounds somewhat weird (plural vs singular).
>

Ah, right - it should be either "statistic ... does not" or "statistics 
... do not". I think "statistics" is the right choice here, because (a) 
we have CREATE STATISTICS and (b) it may be a combination of statistics, 
e.g. histogram + MCV.

>> + * XXX Maybe this should check for duplicate stats. Although it's not clear
>> + * what "duplicate" would mean here (wheter to compare only keys or also
>> + * options). Moreover, we don't do such checks for indexes, although those
>> + * store tuples and recreating a new index may be a way to fix bloat (which
>> + * is a problem statistics don't have).
>> + */
>> +ObjectAddress
>> +CreateStatistics(CreateStatsStmt *stmt)
>
> I don't think we should check duplicates TBH so I would remove the XXX
> (also "wheter" is typo but if you remove that paragraph it does not matter).
>

Yes, I came to the same conclusion - we can only really check for exact 
matches (same set of columns, same choice of statistic types), but 
that's fairly useless. I'll remove the XXX.

>> +    if (true)
>> +    {
>
> Huh?
>

Yeah, that's a bit weird pattern. It's a remainder of copy-pasting the 
preceding block, which looks like this
    if (hasindex)    {        ...    }

But we've decided to not add similar flag for the statistics. I'll move 
the block to a separate function (instead of merging it directly into 
the function, which is already a bit largeish).

>> +
>> +List *
>> +RelationGetMVStatList(Relation relation)
>> +{
> ...
>> +
>> +void
>> +update_mv_stats(Oid mvoid, MVNDistinct ndistinct,
>> +                int2vector *attrs, VacAttrStats **stats)
> ...
>> +static double
>> +ndistinct_for_combination(double totalrows, int numrows, HeapTuple *rows,
>> +                   int2vector *attrs, VacAttrStats **stats,
>> +                   int k, int *combination)
>> +{
>
>
> Again, these deserve comment.
>

OK, will add.

> I'll try to look at other patches in the series as time permits.

thanks

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] multivariate statistics (v19)

From
Tomas Vondra
Date:
On 01/03/2017 05:22 PM, Tomas Vondra wrote:
> On 01/03/2017 02:42 PM, Dilip Kumar wrote:
...
>> I think it should be easily reproducible, in case it's not I can send
>> call stack or core dump.
>>
>
> Thanks for the report. It was trivial to reproduce and it turned out to
> be a fairly simple bug. Will send a new version of the patch soon.
>

Attached is v22 of the patch series, rebased to current master and 
fixing the reported bug. I haven't made any other changes - the issues 
reported by Petr are mostly minor, so I've decided to wait a bit more 
for (hopefully) other reviews.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] multivariate statistics (v19)

From
Dilip Kumar
Date:
On Wed, Jan 4, 2017 at 8:05 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Attached is v22 of the patch series, rebased to current master and fixing
> the reported bug. I haven't made any other changes - the issues reported by
> Petr are mostly minor, so I've decided to wait a bit more for (hopefully)
> other reviews.

v22 fixes the problem, I reported.  In my test, I observed that group
by estimation is much better with ndistinct stat.

Here is one example:

postgres=# explain analyze select p_brand, p_type, p_size from part
group by p_brand, p_type, p_size;                                                     QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------HashAggregate
(cost=37992.00..38992.00 rows=100000 width=36) (actual
 
time=953.359..1011.302 rows=186607 loops=1)  Group Key: p_brand, p_type, p_size  ->  Seq Scan on part
(cost=0.00..30492.00rows=1000000 width=36)
 
(actual time=0.013..163.672 rows=1000000 loops=1)Planning time: 0.194 msExecution time: 1020.776 ms
(5 rows)

postgres=# CREATE STATISTICS s2  WITH (ndistinct) on (p_brand, p_type,
p_size) from part;
CREATE STATISTICS
postgres=# analyze part;
ANALYZE
postgres=# explain analyze select p_brand, p_type, p_size from part
group by p_brand, p_type, p_size;                                                     QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------HashAggregate
(cost=37992.00..39622.46 rows=163046 width=36) (actual
 
time=935.162..992.944 rows=186607 loops=1)  Group Key: p_brand, p_type, p_size  ->  Seq Scan on part
(cost=0.00..30492.00rows=1000000 width=36)
 
(actual time=0.013..156.746 rows=1000000 loops=1)Planning time: 0.308 msExecution time: 1001.889 ms

In above example,
Without MVStat-> estimated: 100000 Actual: 186607
With MVStat-> estimated: 163046 Actual: 186607

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] multivariate statistics (v19)

From
Tomas Vondra
Date:
On 01/04/2017 03:21 PM, Dilip Kumar wrote:
> On Wed, Jan 4, 2017 at 8:05 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Attached is v22 of the patch series, rebased to current master and fixing
>> the reported bug. I haven't made any other changes - the issues reported by
>> Petr are mostly minor, so I've decided to wait a bit more for (hopefully)
>> other reviews.
>
> v22 fixes the problem, I reported.  In my test, I observed that group
> by estimation is much better with ndistinct stat.
>
> Here is one example:
>
> postgres=# explain analyze select p_brand, p_type, p_size from part
> group by p_brand, p_type, p_size;
>                                                       QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------
>  HashAggregate  (cost=37992.00..38992.00 rows=100000 width=36) (actual
> time=953.359..1011.302 rows=186607 loops=1)
>    Group Key: p_brand, p_type, p_size
>    ->  Seq Scan on part  (cost=0.00..30492.00 rows=1000000 width=36)
> (actual time=0.013..163.672 rows=1000000 loops=1)
>  Planning time: 0.194 ms
>  Execution time: 1020.776 ms
> (5 rows)
>
> postgres=# CREATE STATISTICS s2  WITH (ndistinct) on (p_brand, p_type,
> p_size) from part;
> CREATE STATISTICS
> postgres=# analyze part;
> ANALYZE
> postgres=# explain analyze select p_brand, p_type, p_size from part
> group by p_brand, p_type, p_size;
>                                                       QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------
>  HashAggregate  (cost=37992.00..39622.46 rows=163046 width=36) (actual
> time=935.162..992.944 rows=186607 loops=1)
>    Group Key: p_brand, p_type, p_size
>    ->  Seq Scan on part  (cost=0.00..30492.00 rows=1000000 width=36)
> (actual time=0.013..156.746 rows=1000000 loops=1)
>  Planning time: 0.308 ms
>  Execution time: 1001.889 ms
>
> In above example,
> Without MVStat-> estimated: 100000 Actual: 186607
> With MVStat-> estimated: 163046 Actual: 186607
>

Thanks. Those plans match my experiments with the TPC-H data set, 
although I've been playing with the smallest scale (1GB).

It's not very difficult to make the estimation error arbitrary large, 
e.g. by using perfectly correlated (identical) columns.

regard

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] multivariate statistics (v19)

From
Michael Paquier
Date:
On Wed, Jan 4, 2017 at 11:35 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 01/03/2017 05:22 PM, Tomas Vondra wrote:
>>
>> On 01/03/2017 02:42 PM, Dilip Kumar wrote:
>
> ...
>>>
>>> I think it should be easily reproducible, in case it's not I can send
>>> call stack or core dump.
>>>
>>
>> Thanks for the report. It was trivial to reproduce and it turned out to
>> be a fairly simple bug. Will send a new version of the patch soon.
>>
>
> Attached is v22 of the patch series, rebased to current master and fixing
> the reported bug. I haven't made any other changes - the issues reported by
> Petr are mostly minor, so I've decided to wait a bit more for (hopefully)
> other reviews.

And nothing has happened since. Are there people willing to review
this patch and help it proceed? As this patch is quite large, I am not
sure if it is fit to join the last CF. Thoughts?
-- 
Michael



Re: [HACKERS] multivariate statistics (v19)

From
Alvaro Herrera
Date:
Michael Paquier wrote:
> On Wed, Jan 4, 2017 at 11:35 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:

> > Attached is v22 of the patch series, rebased to current master and fixing
> > the reported bug. I haven't made any other changes - the issues reported by
> > Petr are mostly minor, so I've decided to wait a bit more for (hopefully)
> > other reviews.
> 
> And nothing has happened since. Are there people willing to review
> this patch and help it proceed?

I am going to grab this patch as committer.

> As this patch is quite large, I am not sure if it is fit to join the
> last CF. Thoughts?

All patches, regardless of size, are welcome to join any commitfest.
The last commitfest is not different in that regard.  The rule I
remember is that patches may not arrive *for the first time* in the last
commitfest.  This patch has already seen a lot of work in previous
commitfests, so it's fine.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] multivariate statistics (v19)

From
Michael Paquier
Date:
On Wed, Jan 25, 2017 at 9:56 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Michael Paquier wrote:
>> And nothing has happened since. Are there people willing to review
>> this patch and help it proceed?
>
> I am going to grab this patch as committer.

Thanks, that's good to know.
-- 
Michael



Re: [HACKERS] multivariate statistics (v19)

From
"Ideriha, Takeshi"
Date:
Hi

When you have time, could you rebase the pathes? 
Some patches cannot be applied to the current HEAD.
0001 patch can be applied but the following 0002 patch cannot be.

I've just started reading your patch (mainly docs and README, not yet source code.)

Though these are minor things, I've found some typos or mistakes in the document and README.

>+   statistics on the table. The statistics will be created in the in the
>+   current database. The statistics will be owned by the user issuing

Regarding line 629 at 0002-PATCH-shared-infrastructure-and-ndistinct-coeffi-v22.patch,
there is a double "in the".

>+   knowledge of a value in the first column is sufficient for detemining the
>+   value in the other column. Then functional dependencies are built on those

Regarding line 701 at 0002-PATCH,
"determining" is mistakenly spelled "detemining".


>@@ -0,0 +1,98 @@
>+Multivariate statististics
>+==========================

Regarding line 2415 at 0002-PATCH, "statististics" should be statistics


>+ <refnamediv>
>+  <refname>CREATE STATISTICS</refname>
>+  <refpurpose>define a new statistics</refpurpose>
>+ </refnamediv>

>+ <refnamediv>
>+  <refname>DROP STATISTICS</refname>
>+  <refpurpose>remove a statistics</refpurpose>
>+ </refnamediv>

Regarding line 612 and 771 at 0002-PATCH,
I assume saying "multiple statistics" explicitly is easier to understand to users
since these commands don't for the statistics we already have in the pg_statistics in my understanding.

>+   [1] http://en.wikipedia.org/wiki/Database_normalization

Regarding line 386 at 0003-PATCH, is it better to change this link to this one:
https://en.wikipedia.org/wiki/Functional_dependency ?
README.dependencies cites directly above link.

Though I pointed out these typoes and so on, 
I believe these feedback are less priority compared to the source code itself.

So please work on my feedback if you have time.

regards,
Ideriha Takeshi

Re: [HACKERS] multivariate statistics (v19)

From
Dilip Kumar
Date:
On Thu, Jan 5, 2017 at 3:27 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Thanks. Those plans match my experiments with the TPC-H data set, although
> I've been playing with the smallest scale (1GB).
>
> It's not very difficult to make the estimation error arbitrary large, e.g.
> by using perfectly correlated (identical) columns.

I have done an initial review for ndistint and histogram patches,
there are few review comments.

ndistinct
---------
1. Duplicate statistics:
postgres=# create statistics s with (ndistinct) on (a,c) from t;
2017-01-07 16:21:54.575 IST [63817] ERROR:  duplicate key value
violates unique constraint "pg_mv_statistic_name_index"
2017-01-07 16:21:54.575 IST [63817] DETAIL:  Key (staname,
stanamespace)=(s, 2200) already exists.
2017-01-07 16:21:54.575 IST [63817] STATEMENT:  create statistics s
with (ndistinct) on (a,c) from t;
ERROR:  duplicate key value violates unique constraint
"pg_mv_statistic_name_index"
DETAIL:  Key (staname, stanamespace)=(s, 2200) already exists.

For duplicate statistics, I think we can check the existence of the
statistics and give more meaningful error code something statistics
"s" already exist.

2. Typo
+ /*
+ * Sort the attnums, which makes detecting duplicies somewhat
+ * easier, and it does not hurt (it does not affect the efficiency,
+ * onlike for indexes, for example).
+ */
/onlike/unlike

3. Typo
/** Find attnims of MV stats using the mvoid.*/
int2vector *
find_mv_attnums(Oid mvoid, Oid *relid)

/attnims/attnums


histograms
--------------
+ if (matches[i] == MVSTATS_MATCH_FULL)
+ s += mvhist->buckets[i]->ntuples;
+ else if (matches[i] == MVSTATS_MATCH_PARTIAL)
+ s += 0.5 * mvhist->buckets[i]->ntuples;

Isn't it will be better that take some percentage of the bucket based
on the number of distinct element for partial matching buckets.


+static int
+update_match_bitmap_histogram(PlannerInfo *root, List *clauses,
+  int2vector *stakeys,
+  MVSerializedHistogram mvhist,
+  int nmatches, char *matches,
+  bool is_or)
+{
+ int i;

For each clause we are processing all the buckets, can't we use some
data structure which can make multi-dimensions information searching
faster.
Something like HTree, RTree, Maybe storing histogram in these formats
will be difficult?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] multivariate statistics (v19)

From
Kyotaro HORIGUCHI
Date:
Hello, I'll return on this since this should welcome more eyeballs.

At Thu, 26 Jan 2017 09:03:10 +0000, "Ideriha, Takeshi" <ideriha.takeshi@jp.fujitsu.com> wrote in
<4E72940DA2BF16479384A86D54D0988A565822A9@G01JPEXMBKW04>
> Hi
> 
> When you have time, could you rebase the pathes? 
> Some patches cannot be applied to the current HEAD.

For those who are willing to look this,
352a24a1f9d6f7d4abb1175bfd22acc358f43140 breaks this. So just
before it can accept this patches cleanly.

> 0001 patch can be applied but the following 0002 patch cannot be.
> 
> I've just started reading your patch (mainly docs and README, not yet source code.)
> 
> Though these are minor things, I've found some typos or mistakes in the document and README.
> 
> >+   statistics on the table. The statistics will be created in the in the
> >+   current database. The statistics will be owned by the user issuing
> 
> Regarding line 629 at 0002-PATCH-shared-infrastructure-and-ndistinct-coeffi-v22.patch,
> there is a double "in the".
> 
> >+   knowledge of a value in the first column is sufficient for detemining the
> >+   value in the other column. Then functional dependencies are built on those
> 
> Regarding line 701 at 0002-PATCH,
> "determining" is mistakenly spelled "detemining".
> 
> 
> >@@ -0,0 +1,98 @@
> >+Multivariate statististics
> >+==========================
> 
> Regarding line 2415 at 0002-PATCH, "statististics" should be statistics
> 
> 
> >+ <refnamediv>
> >+  <refname>CREATE STATISTICS</refname>
> >+  <refpurpose>define a new statistics</refpurpose>
> >+ </refnamediv>
> 
> >+ <refnamediv>
> >+  <refname>DROP STATISTICS</refname>
> >+  <refpurpose>remove a statistics</refpurpose>
> >+ </refnamediv>
> 
> Regarding line 612 and 771 at 0002-PATCH,
> I assume saying "multiple statistics" explicitly is easier to understand to users
> since these commands don't for the statistics we already have in the pg_statistics in my understanding.
> 
> >+   [1] http://en.wikipedia.org/wiki/Database_normalization
> 
> Regarding line 386 at 0003-PATCH, is it better to change this link to this one:
> https://en.wikipedia.org/wiki/Functional_dependency ?
> README.dependencies cites directly above link.
> 
> Though I pointed out these typoes and so on, 
> I believe these feedback are less priority compared to the source code itself.
> 
> So please work on my feedback if you have time.


README.dependencies
 > dependencies, and for each one count the number of rows rows consistent it. "of rows rows consistent it" => "or rows
consistentwith it"?
 
 > are in fact consistent with the functinal dependency, i.e. that given the a
 "that given the a" => "that given a" ?


dependencies.c:
dependency_dgree():
 - The k is assumed larger than 1. I think assertion is required.
 - "/* end of the preceding group */" seems to be better if it   is just after the "if (multi_sort.." currently just
afterit.
 
 - The following comment seems mis-edited.   > * If there is a single are no contradicting rows, count the group   > *
assupporting, otherwise contradicting.    maybe this would be like the following? The varialbe counting   the first
"contradiction"is named "n_violations". This seems   somewhat confusing.    > * If there are no violating rows up to
here,count the group   > * as supporting, otherwise contradicting.   - "/* first columns match, but the last one does
not"   else if (multi_sort_compare_dims((k - 1), (k - 1), ...
 
    The above comparison should use multi_sort_compare_dim, not    dims
  - This function counts "n_contradicting_rows" but it is not    referenced. Anyway n_contradicting_rows = numrows -
n_supporing_rowsso it and n_contradicting seem    unncecessary.
 
build_mv_dependencies():
  - In the commnet,    "* covering jut 2 columns, to the largest ones, covering all columns"    "* included int the
statistics.We start from the smallest ones because we"
 
   l1: "jut" => "just", l2: "int" => "in"
mvstats.h:
  - struct MVDependencyData/ MVDependenciesData
    The varialbe length member at the last of the structs should    be defined using FLEXIBLE_ARRAY_MEMBER, from the
convention.
  - I'm not sure how much it impacts performance, but some    struct members seems to have a bit too wide types. For
example,MVDepedenciesData.type is of int32 but it can have    only '1' for now and it won't be two-digits. Also ndeps
cannot be so large.
 

common.c:
 multi_sort_compare_dims needs comment.

general: This patch uses int16 as the type of attrubute number but it might be better to use AttrNumber for the
purpose.(Specifically it seems defined as the type for an attribute  index but also used as the varialbe for number of
attributes)


Sorry for the random comment in advance. I'll learn this further.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: [HACKERS] multivariate statistics (v19)

From
Alvaro Herrera
Date:
Tomas Vondra wrote:
> On 01/03/2017 05:22 PM, Tomas Vondra wrote:
> > On 01/03/2017 02:42 PM, Dilip Kumar wrote:
> ...
> > > I think it should be easily reproducible, in case it's not I can send
> > > call stack or core dump.
> > > 
> > 
> > Thanks for the report. It was trivial to reproduce and it turned out to
> > be a fairly simple bug. Will send a new version of the patch soon.
> > 
> 
> Attached is v22 of the patch series, rebased to current master and fixing
> the reported bug. I haven't made any other changes - the issues reported by
> Petr are mostly minor, so I've decided to wait a bit more for (hopefully)
> other reviews.

Hmm.  So we have a catalog pg_mv_statistics which stores two things:
1. the configuration regarding mvstats that have been requested by user  via CREATE/ALTER STATISTICS
2. the actual values captured from the above, via ANALYZE

I think this conflates two things that really are separate, given their
different timings and usage patterns.  This decision is causing the
catalog to have columns enabled/built flags for each set of stats
requested, which looks a bit odd.  In particular, the fact that you have
to heap_update the catalog in order to add more stuff as it's built
looks inconvenient.

Have you thought about having the "requested" bits be separate from the
actual computed values?  Something like

pg_mv_statistics starelid staname stanamespace staowner     -- all the above as currently staenabled    array of "char"
{d,f,s}stakeys
 
// no CATALOG_VARLEN here

where each char in the staenabled array has a #define and indicates one
type, "ndistinct", "functional dep", "selectivity" etc.

The actual values computed by ANALYZE would live in a catalog like:

pg_mv_statistics_values stvstaid    -- OID of the corresponding pg_mv_statistics row.  Needed? stvrelid    -- same as
starelidstvkeys    -- same as stakeys
 
#ifdef CATALOG_VARLEN stvkind    'd' or 'f' or 's', etc stvvalue    the bytea blob
#endif

I think that would be simpler, both conceptually and in terms of code.

The other angle to consider is planner-side: how does the planner gets
to the values?  I think as far as the planner goes, the first catalog
doesn't matter at all, because a statistics type that has been enabled
but not computed is not interesting at all; planner only cares about the
values in the second catalog (this is why I added stvkeys).  Currently
you're just caching a single pg_mv_statistics row in get_relation_info
(and only if any of the "built" flags is set), which is simple.  With my
proposed change, you'd need to keep multiple pg_mv_statistics_values
rows.

But maybe you already tried something like what I propose and there's a
reason not to do it?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] multivariate statistics (v19)

From
Alvaro Herrera
Date:
Minor nitpicks:

Let me suggest to use get_attnum() in CreateStatistics instead of
SearchSysCacheAttName for each column.  Also, we use type AttrNumber for
attribute numbers rather than int16.  Finally in the same function you
have an erroneous ERRCODE_UNDEFINED_COLUMN which should be
ERRCODE_DUPLICATE_COLUMN in the loop that searches for duplicates.

May I suggest that compare_int16 be named attnum_cmp (just to be
consistent with other qsort comparators) and look likereturn *((const AttrNumber *) a) - *((const AttrNumber *) b);
instead of memcmp?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] multivariate statistics (v19)

From
Tomas Vondra
Date:
Hi everyone,

thanks for the reviews! Attached is v23 of the patch series, addressing 
most of the points raised in the reviews.

A quick summary of the changes (I'll respond to the other threads for 
points that deserve a bit more detailed discussion):

0) Rebase to current master. The main culprit was the pesky logical 
replication patch committed a week ago, because SUBSCRIPTION and 
STATISTICS are right next to each other in gram.y, various switches etc.

1) Many typos, mentioned by all the reviewers.

2) I've added a short explanation (in alter_table.sgml) of how ALTER 
TABLE ... DROP COLUMN handles multivariate statistics, i.e. that those 
are only dropped if there would be a single remaining column.

3) I've reworded 'thoroughly' to 'in more detail' in planstats.sgml, to 
make Petr happy ;-)

4) Added missing comments to get_statistics_oid, RelationGetMVStatList, 
update_mv_stats, ndistinct_for_combination. Also update_mv_stats() was 
not used outside common.c, so I've made it static and removed the 
prototype from mvstats.h.

5) I've changed 'statistics does not exist' to 'statistics do not exist' 
on a number of places.

6) Removed XXX about checking for duplicates in CreateStatistics. I 
agree with Petr that we shouldn't do such checks, as we're not doing 
that for other objects (e.g. indexes).

7) I've moved moved the code loading statistics from get_relation_info 
into a new function get_relation_statistics, to get rid of the

   if (true)
   {
    ...
   }

block, which was there due to mimicking how index details are loaded 
without having hasindex-like flag. I like this better than merging the 
block into get_relation_info directly.

8) I've changed 'a statistics' to 'multivariate statistics' on a few 
places in sgml docs, to make it clear it's not referring to the 
'regular' statistics (e.g. at CREATE/DROP STATISTICS, mentioned by 
Ideriha Takeshi).

9) I've changed the link in README.dependencies to 
https://en.wikipedia.org/wiki/Functional_dependency as proposed by 
Ideriha Takeshi. I'm pretty sure the wiki page about database 
normalization, referenced by the original link, included a nice 
functional dependency example some time ago, but it seems to have 
changed and the new link is better.

But perhaps it's not a good idea to link to wikipedia, as the pages 
clearly change quite significantly?

10) The CREATE STATISTICS now reports a nice 'already exists' message, 
instead of the 'duplicate key', pointed out by Dilip.

11) MVNDistinctItem/MVNDistinctData now use FLEXIBLE_ARRAY_MEMBER for 
the array, just like the other structs.



On 01/26/2017 12:01 PM, Kyotaro HORIGUCHI wrote:
> dependencies.c:
>
>  dependency_dgree():
>
>   - The k is assumed larger than 1. I think assertion is required.
>
>   - "/* end of the preceding group */" seems to be better if it
>     is just after the "if (multi_sort.." currently just after it.
>
>   - The following comment seems mis-edited.
>     > * If there is a single are no contradicting rows, count the group
>     > * as supporting, otherwise contradicting.
>
>     maybe this would be like the following? The varialbe counting
>     the first "contradiction" is named "n_violations". This seems
>     somewhat confusing.
>
>     > * If there are no violating rows up to here, count the group
>     > * as supporting, otherwise contradicting.
>
>    - "/* first columns match, but the last one does not"
>      else if (multi_sort_compare_dims((k - 1), (k - 1), ...
>
>      The above comparison should use multi_sort_compare_dim, not
>      dims
>
>    - This function counts "n_contradicting_rows" but it is not
>      referenced. Anyway n_contradicting_rows = numrows -
>      n_supporing_rows so it and n_contradicting seem
>      unncecessary.
>

Yes, absolutely. This was clearly unnecessary remainder of the original 
implementation, and I failed to clean it up after adopting Dean's idea 
of continuous dependency degree.

I've also reworked the method a bit, moving handling of the last group 
into the main loop (instead of doing that separately right after the 
loop, which I think was a bit ugly anyway). Can you check if you're 
happy with the code & comments now?

>
>  mvstats.h:
>
>    - struct MVDependencyData/ MVDependenciesData
>
>      The varialbe length member at the last of the structs should
>      be defined using FLEXIBLE_ARRAY_MEMBER, from the convention.
>

Yes, fixed. The other structures already used that macro, but I failed 
to notice MVDependencyData/ MVDependenciesData need that fix too.

 >
>    - I'm not sure how much it impacts performance, but some
>      struct members seems to have a bit too wide types. For
>      example, MVDepedenciesData.type is of int32 but it can have
>      only '1' for now and it won't be two-digits. Also ndeps
>      cannot be so large.
>

I doubt the impact on performance is measurable, particularly for the 
global fields (e.g. nbuckets is tiny compared to the space needed for 
the buckets themselves).

But I think you're right we shouldn't use fields wider than actually 
needed (e.g. using uint32 for nbuckets is a bit insane, and uint16 would 
be just fine). It's not just a matter of performance, but also a way to 
document expected values etc.

I'll go through the fields and use smaller data types where appropriate.

>
> general:
>   This patch uses int16 as the type of attrubute number but it
>   might be better to use AttrNumber for the purpose.
>   (Specifically it seems defined as the type for an attribute
>    index but also used as the varialbe for number of attributes)
>

Agreed. Will check with the struct members.

>
> Sorry for the random comment in advance. I'll learn this further.
>

Thanks for the review!

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] multivariate statistics (v19)

From
Tomas Vondra
Date:
On 01/26/2017 10:43 AM, Dilip Kumar wrote:
>
> histograms
> --------------
> + if (matches[i] == MVSTATS_MATCH_FULL)
> + s += mvhist->buckets[i]->ntuples;
> + else if (matches[i] == MVSTATS_MATCH_PARTIAL)
> + s += 0.5 * mvhist->buckets[i]->ntuples;
>
> Isn't it will be better that take some percentage of the bucket based
> on the number of distinct element for partial matching buckets.
>

I don't think so, for the same reason why ineq_histogram_selectivity() 
in selfuncs.c uses
    binfrac = 0.5;

for partial bucket matches - it provides minimum average error. Even if 
we knew the number of distinct items in the bucket, we have no idea what 
the distribution within the bucket looks like. Maybe 99% of the bucket 
are covered by a single distinct value, maybe all the items are squashed 
on one side of the bucket, etc.

Moreover we don't really know the number of distinct values in the 
bucket - we only know the number of distinct items in the sample, and 
only while building the histogram. I don't think it makes much sense to 
estimate the number of distinct items in a bucket, because the buckets 
contain only very few rows so the estimates would be wildly inaccurate.

>
> +static int
> +update_match_bitmap_histogram(PlannerInfo *root, List *clauses,
> +  int2vector *stakeys,
> +  MVSerializedHistogram mvhist,
> +  int nmatches, char *matches,
> +  bool is_or)
> +{
> + int i;
>
> For each clause we are processing all the buckets, can't we use some
> data structure which can make multi-dimensions information searching
> faster.>

No, we're not processing all buckets for each clause. We're' only 
processing buckets that were not "ruled out" by preceding clauses. 
That's the whole point of the bitmap.

For example for condition (a=1) AND (b=2), the code will first evaluate 
(a=1) on all buckets, and then (b=2) but only on buckets where (a=1) was 
evaluated as true. Similarly for OR clauses.
>
> Something like HTree, RTree, Maybe storing histogram in these formats
> will be difficult?
>

Maybe, but I don't want to do that in the first version. I'm not opposed 
to doing that in the future, if we find out the v1 histograms are not 
efficient (I don't think we will, based on tests I did while working on 
the patch). Support for other histogram implementations is pretty much 
why there is 'type' field in the struct.

For now I think we should stick with the simple implementation.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] multivariate statistics (v19)

From
Tomas Vondra
Date:
Hello,

On 01/26/2017 10:03 AM, Ideriha, Takeshi wrote:
>
> Though I pointed out these typoes and so on,
> I believe these feedback are less priority compared to the source code itself.
>
> So please work on my feedback if you have time.
>

I think getting the comments (and docs in general) right is just as 
important as the code. So thank you for your review!

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] multivariate statistics (v19)

From
Tomas Vondra
Date:
On 01/30/2017 05:55 PM, Alvaro Herrera wrote:
> Minor nitpicks:
>
> Let me suggest to use get_attnum() in CreateStatistics instead of
> SearchSysCacheAttName for each column.  Also, we use type AttrNumber for
> attribute numbers rather than int16.  Finally in the same function you
> have an erroneous ERRCODE_UNDEFINED_COLUMN which should be
> ERRCODE_DUPLICATE_COLUMN in the loop that searches for duplicates.
>
> May I suggest that compare_int16 be named attnum_cmp (just to be
> consistent with other qsort comparators) and look like
>     return *((const AttrNumber *) a) - *((const AttrNumber *) b);
> instead of memcmp?
>

Yes, I think this is pretty much what Kyotaro-san pointed out in his 
review. I'll go through the patch and make sure the correct data types 
are used.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] multivariate statistics (v19)

From
Tomas Vondra
Date:
On 01/30/2017 05:12 PM, Alvaro Herrera wrote:>
> Hmm.  So we have a catalog pg_mv_statistics which stores two things:
> 1. the configuration regarding mvstats that have been requested by user
>    via CREATE/ALTER STATISTICS
> 2. the actual values captured from the above, via ANALYZE
>
> I think this conflates two things that really are separate, given their
> different timings and usage patterns.  This decision is causing the
> catalog to have columns enabled/built flags for each set of stats
> requested, which looks a bit odd.  In particular, the fact that you have
> to heap_update the catalog in order to add more stuff as it's built
> looks inconvenient.
>
> Have you thought about having the "requested" bits be separate from the
> actual computed values?  Something like
>
> pg_mv_statistics
>   starelid
>   staname
>   stanamespace
>   staowner     -- all the above as currently
>   staenabled    array of "char" {d,f,s}
>   stakeys
> // no CATALOG_VARLEN here
>
> where each char in the staenabled array has a #define and indicates one
> type, "ndistinct", "functional dep", "selectivity" etc.
>
> The actual values computed by ANALYZE would live in a catalog like:
>
> pg_mv_statistics_values
>   stvstaid    -- OID of the corresponding pg_mv_statistics row.  Needed?

Definitely needed. How else would you know which MCV list and histogram 
belong together? This works just like in pg_statistic - when both MCV 
and histograms are enabled for the statistic, we first build MCV list, 
then histogram on remaining rows. So we need to pair them.

>   stvrelid    -- same as starelid
>   stvkeys    -- same as stakeys
> #ifdef CATALOG_VARLEN
>   stvkind    'd' or 'f' or 's', etc
>   stvvalue    the bytea blob
> #endif
>
> I think that would be simpler, both conceptually and in terms of code.

I think the main issue here is that it throws away the special data 
types (pg_histogram, pg_mcv, pg_ndistinct, pg_dependencies), which I 
think is a neat idea and would like to keep it. This would throw that 
away, making everything bytea again. I don't like that.

>
> The other angle to consider is planner-side: how does the planner gets
> to the values?  I think as far as the planner goes, the first catalog
> doesn't matter at all, because a statistics type that has been enabled
> but not computed is not interesting at all; planner only cares about the
> values in the second catalog (this is why I added stvkeys).  Currently
> you're just caching a single pg_mv_statistics row in get_relation_info
> (and only if any of the "built" flags is set), which is simple.  With my
> proposed change, you'd need to keep multiple pg_mv_statistics_values
> rows.
>
> But maybe you already tried something like what I propose and there's a
> reason not to do it?
>

Honestly, I don't see how this improves the situation. We still need to 
cache data for exactly one catalog, so how is that simpler?

The way I see it, it actually makes things more complicated, because now 
we have two catalogs to manage instead of one (e.g. when doing DROP 
STATISTICS, or after ALTER TABLE ... DROP COLUMN).

The 'built' flags may be easily replaced with a check if the bytea-like 
columns are NULL, and the 'enabled' columns may be replaced by the array 
of char, just like you proposed.

That'd give us a single catalog looking like this:

pg_mv_statistics  starelid  staname  stanamespace  staowner      -- all the above as currently  staenabled    array of
"char"{d,f,s}  stakeys  stadeps  (dependencies)  standist (ndistinct coefficients)  stamcv   (MCV list)  stahist
(histogram)

Which is probably a better / simpler structure than the current one.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] multivariate statistics (v19)

From
Alvaro Herrera
Date:
Tomas Vondra wrote:

> The 'built' flags may be easily replaced with a check if the bytea-like
> columns are NULL, and the 'enabled' columns may be replaced by the array of
> char, just like you proposed.
> 
> That'd give us a single catalog looking like this:
> 
> pg_mv_statistics
>   starelid
>   staname
>   stanamespace
>   staowner      -- all the above as currently
>   staenabled    array of "char" {d,f,s}
>   stakeys
>   stadeps  (dependencies)
>   standist (ndistinct coefficients)
>   stamcv   (MCV list)
>   stahist  (histogram)
> 
> Which is probably a better / simpler structure than the current one.

Looks good to me.  I don't think we need to keep the names very short --
I would propose "standistinct", "stahistogram", "stadependencies".

Thanks,

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] multivariate statistics (v19)

From
Tomas Vondra
Date:
On 01/30/2017 09:37 PM, Alvaro Herrera wrote:
> Tomas Vondra wrote:
>
>> The 'built' flags may be easily replaced with a check if the bytea-like
>> columns are NULL, and the 'enabled' columns may be replaced by the array of
>> char, just like you proposed.
>>
>> That'd give us a single catalog looking like this:
>>
>> pg_mv_statistics
>>   starelid
>>   staname
>>   stanamespace
>>   staowner      -- all the above as currently
>>   staenabled    array of "char" {d,f,s}
>>   stakeys
>>   stadeps  (dependencies)
>>   standist (ndistinct coefficients)
>>   stamcv   (MCV list)
>>   stahist  (histogram)
>>
>> Which is probably a better / simpler structure than the current one.
>
> Looks good to me.  I don't think we need to keep the names very short --
> I would propose "standistinct", "stahistogram", "stadependencies".
>

Yeah, I got annoyed by the short names too.

This however reminds me that perhaps pg_mv_statistic is not the best 
name. I know others proposed pg_statistic_ext (and pg_stats_ext), and 
while I wasn't a big fan initially, I think it's a better name. People 
generally don't know what 'multivariate' means, while 'extended' is 
better known (e.g. because Oracle uses it for similar stuff).

So I think I'll switch to that name too.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] multivariate statistics (v19)

From
Michael Paquier
Date:
On Tue, Jan 31, 2017 at 6:57 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> This however reminds me that perhaps pg_mv_statistic is not the best name. I
> know others proposed pg_statistic_ext (and pg_stats_ext), and while I wasn't
> a big fan initially, I think it's a better name. People generally don't know
> what 'multivariate' means, while 'extended' is better known (e.g. because
> Oracle uses it for similar stuff).
>
> So I think I'll switch to that name too.

I have moved this patch to the next CF, with Álvaro as reviewer.
--
Michael



Re: [HACKERS] multivariate statistics (v19)

From
Amit Langote
Date:
On 2017/01/31 6:57, Tomas Vondra wrote:
> On 01/30/2017 09:37 PM, Alvaro Herrera wrote:
>> Looks good to me.  I don't think we need to keep the names very short --
>> I would propose "standistinct", "stahistogram", "stadependencies".
>>
> 
> Yeah, I got annoyed by the short names too.
> 
> This however reminds me that perhaps pg_mv_statistic is not the best name.
> I know others proposed pg_statistic_ext (and pg_stats_ext), and while I
> wasn't a big fan initially, I think it's a better name. People generally
> don't know what 'multivariate' means, while 'extended' is better known
> (e.g. because Oracle uses it for similar stuff).
> 
> So I think I'll switch to that name too.

+1 to pg_statistics_ext.  Maybe, even pg_statistics_extended, however
being that verbose may not be warranted.

Thanks,
Amit






Re: [HACKERS] multivariate statistics (v19)

From
Tomas Vondra
Date:
On 01/31/2017 07:52 AM, Amit Langote wrote:
> On 2017/01/31 6:57, Tomas Vondra wrote:
>> On 01/30/2017 09:37 PM, Alvaro Herrera wrote:
>>> Looks good to me.  I don't think we need to keep the names very short --
>>> I would propose "standistinct", "stahistogram", "stadependencies".
>>>
>>
>> Yeah, I got annoyed by the short names too.
>>
>> This however reminds me that perhaps pg_mv_statistic is not the best name.
>> I know others proposed pg_statistic_ext (and pg_stats_ext), and while I
>> wasn't a big fan initially, I think it's a better name. People generally
>> don't know what 'multivariate' means, while 'extended' is better known
>> (e.g. because Oracle uses it for similar stuff).
>>
>> So I think I'll switch to that name too.
>
> +1 to pg_statistics_ext. Maybe, even pg_statistics_extended, however
> being that verbose may not be warranted.
>

Yeah, I think pg_statistic_extended / pg_stats_extended seems fine.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] multivariate statistics (v19)

From
Alvaro Herrera
Date:
Still looking at 0002.

pg_ndistinct_in disallows input, claiming that pg_node_tree does the
same thing.  But pg_node_tree does it for security reasons: you could
crash the backend if you supplied a malicious value.  I don't think that
applies to pg_ndistinct_in.  Perhaps it will be useful to inject fake
stats at some point, so why not allow it?  It shouldn't be complicated
(though it does require writing some additional code, so perhaps that's
one reason we don't want to allow input of these values).

The comment on top of pg_ndistinct_out is missing the "_out"; also it
talks about histograms, which is not what this is about.

In the same function, a trivial point you don't need to pstrdup() the
.data out of a stringinfo; it's already palloc'ed in the right context
-- just PG_RETURN_CSTRING(str.data) and forget about "ret".  Saves you
one line.

Nearby, some auxiliary functions such as n_choose_k and num_combinations
are not documented.  What it is that they do?  I'd move these at the end
of the file, keeping the important entry points at the top of the file.

I see this patch has a estimate_ndistinct() which claims to be a re-
implementation of code already in analyze.c, but it is actually a lot
simpler than what analyze.c does.  I've been wondering if it'd be a good
idea to use some of this code so that some routines are moved out of
analyze.c; good implementations of statistics-related functions would
live in src/backend/statistics/ where they can be used both by analyze.c
and your new mvstats stuff.  (More generally I am beginning to wonder if
the new directory should be just src/backend/statistics.)

common.h does not belong in src/backend/utils/mvstats; IMO it should be
called src/include/utils/mvstat.h.  Also, it must not include
postgres.h, and it probably doesn't need most of the #includes it has;
those are better put into whatever include it.  It definitely needs a
guarding #ifdef MVSTATS_H around its whole content too.  An include file
is not just a way to avoid #includes in other files; it is supposed to
be a minimally invasive way of exporting the structs and functions
implemented in some file into other files.  So it must be kept minimal.

psql/tab-complete.c compares the wrong version number (9.6 instead of
10).

Is it important to have a cast from pg_ndistinct to bytea?  I think
it's odd that outputting it as bytea yields something completely
different than as text.  (The bytea is not human readable and cannot be
used for future input, so what is the point?)


In another subthread you seem to have surrendered to the opinion that
the new catalog should be called pg_statistics_ext, just in case in the
future we come up with additional things to put on it.  However, given
its schema, with a "starelid / stakeys", is it sensible to think that
we're going to get anything other than something that involves multiple
variables?  Maybe it should just be "pg_statistics_multivar" and if
something else comes along we create another catalog with an appropriate
schema.  Heck, how does this catalog serve the purpose of cross-table
statistics in the first place, given that it has room to record a single
relid only?  Are you thinking that in the future you'd change starelid
into an oidvector column?

The comment in gram.y about the CREATE STATISTICS is at odds with what
is actually allowed by the grammar.

I think the name of a statistics is only useful to DROP/ALTER it, right?
I wonder why it's useful that statistics belongs in a schema.  Perhaps
it should be a global object?  I suppose the name collisions would
become bothersome if you have many mvstats.  

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] multivariate statistics (v19)

From
Tomas Vondra
Date:

On 02/01/2017 11:52 PM, Alvaro Herrera wrote:
> Still looking at 0002.
> 
> pg_ndistinct_in disallows input, claiming that pg_node_tree does the 
> same thing. But pg_node_tree does it for security reasons: you could 
> crash the backend if you supplied a malicious value. I don't think
> that applies to pg_ndistinct_in. Perhaps it will be useful to inject
> fake stats at some point, so why not allow it? It shouldn't be
> complicated (though it does require writing some additional code, so
> perhaps that's one reason we don't want to allow input of these
> values).>

Yes, I haven't written the code, and I'm not sure it's a very practical 
way to inject custom statistics. But if we decide to allow that in the 
future, we can probably add the code.

There's a subtle difference between pg_node_tree and the data types for 
statistics - pg_node_tree stores the value as a string (matching the 
nodeToString output), so the _in function is fairly simple. Of course, 
stringToNode() assumes safe input, which is why the input is disabled.

OTOH the statistics are stored in an optimized binary format, allowing 
to use the value directly (without having to do expensive parsing etc).

I was thinking that the easiest way to add support for _in would be to 
add a bunch of Nodes for the statistics, along with in/out functions, 
but keeping the internal binary representation. But that'll be tricky to 
do in a safe way - even if those nodes are coded in a very defensive 
ways, I'd bet there'll be ways to inject unsafe nodes.

So I'm OK with not having the _in for now. If needed, it's possible to 
construct the statistics as a bytea using a bit of C code. That's at 
least obviously unsafe, as anything written in C, touching the memory.

> The comment on top of pg_ndistinct_out is missing the "_out"; also it
> talks about histograms, which is not what this is about.
> 

OK, will fix.

> In the same function, a trivial point you don't need to pstrdup() the
> .data out of a stringinfo; it's already palloc'ed in the right context
> -- just PG_RETURN_CSTRING(str.data) and forget about "ret".  Saves you
> one line.
> 

Will fix too.

> Nearby, some auxiliary functions such as n_choose_k and
> num_combinations are not documented. What it is that they do? I'd
> move these at the end of the file, keeping the important entry points
> at the top of the file.

I'd say n-choose-k is pretty widely known term from combinatorics. The 
comment would essentially say just 'this is n-choose-k' which seems 
rather pointless. So as much as I dislike the self-documenting code, 
this actually seems like a good case of that.

> I see this patch has a estimate_ndistinct() which claims to be a re-
> implementation of code already in analyze.c, but it is actually a lot
> simpler than what analyze.c does.  I've been wondering if it'd be a good
> idea to use some of this code so that some routines are moved out of
> analyze.c; good implementations of statistics-related functions would
> live in src/backend/statistics/ where they can be used both by analyze.c
> and your new mvstats stuff.  (More generally I am beginning to wonder if
> the new directory should be just src/backend/statistics.)
> 

I'll look into that. I have to check if I ignored some assumptions or 
corner cases the analyze.c deals with.

> common.h does not belong in src/backend/utils/mvstats; IMO it should be
> called src/include/utils/mvstat.h.  Also, it must not include
> postgres.h, and it probably doesn't need most of the #includes it has;
> those are better put into whatever include it.  It definitely needs a
> guarding #ifdef MVSTATS_H around its whole content too.  An include file
> is not just a way to avoid #includes in other files; it is supposed to
> be a minimally invasive way of exporting the structs and functions
> implemented in some file into other files.  So it must be kept minimal.
> 

Will do.

> psql/tab-complete.c compares the wrong version number (9.6 instead of
> 10).
> 
> Is it important to have a cast from pg_ndistinct to bytea?  I think
> it's odd that outputting it as bytea yields something completely
> different than as text.  (The bytea is not human readable and cannot be
> used for future input, so what is the point?)
> 

Because it internally is a bytea, and it seems useful to have the 
ability to inspect the bytea value directly (e.g. to see the length of 
the bytea and not the string output).

> 
> In another subthread you seem to have surrendered to the opinion that
> the new catalog should be called pg_statistics_ext, just in case in the
> future we come up with additional things to put on it.  However, given
> its schema, with a "starelid / stakeys", is it sensible to think that
> we're going to get anything other than something that involves multiple
> variables?  Maybe it should just be "pg_statistics_multivar" and if
> something else comes along we create another catalog with an appropriate
> schema.  Heck, how does this catalog serve the purpose of cross-table
> statistics in the first place, given that it has room to record a single
> relid only?  Are you thinking that in the future you'd change starelid
> into an oidvector column?
> 

Yes, I think the starelid will turn into OID vector. The reason why I 
haven't done that in the current version of the catalog is to keep it 
simple. Supporting join statistics will require tracking OID for each 
attribute, because those will be from multiple relations. It'll also 
require tracking "join condition" and so on.

We've designed the CREATED STATISTICS syntax to support this extension, 
but I'm strongly against complicating the catalogs at this point.

> The comment in gram.y about the CREATE STATISTICS is at odds with what
> is actually allowed by the grammar.
> 

Which comment?

> I think the name of a statistics is only useful to DROP/ALTER it, right?
> I wonder why it's useful that statistics belongs in a schema.  Perhaps
> it should be a global object?  I suppose the name collisions would
> become bothersome if you have many mvstats.
> 

I think it shouldn't be a global object. I consider them to be a part of 
a schema (just like indexes, for example). Imagine you have a 
multi-tenant database, with using exactly the same (tables/indexes) 
schema, but keept in different schemas. Why shouldn't it be possible to 
also use the same set of statistics for each tenant?


T.



Re: [HACKERS] multivariate statistics (v19)

From
Robert Haas
Date:
On Thu, Feb 2, 2017 at 3:59 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> There's a subtle difference between pg_node_tree and the data types for
> statistics - pg_node_tree stores the value as a string (matching the
> nodeToString output), so the _in function is fairly simple. Of course,
> stringToNode() assumes safe input, which is why the input is disabled.
>
> OTOH the statistics are stored in an optimized binary format, allowing to
> use the value directly (without having to do expensive parsing etc).
>
> I was thinking that the easiest way to add support for _in would be to add a
> bunch of Nodes for the statistics, along with in/out functions, but keeping
> the internal binary representation. But that'll be tricky to do in a safe
> way - even if those nodes are coded in a very defensive ways, I'd bet
> there'll be ways to inject unsafe nodes.
>
> So I'm OK with not having the _in for now. If needed, it's possible to
> construct the statistics as a bytea using a bit of C code. That's at least
> obviously unsafe, as anything written in C, touching the memory.

Since these data types are already special-purpose, I don't really see
why it would be desirable to entangle them with the existing code for
serializing and deserializing Nodes.  Whether or not it's absolutely
necessary for these types to have input functions, it seems at least
possible that it would be useful, and it becomes much less likely that
we can make that work if it's piggybacking on stringToNode().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] multivariate statistics (v19)

From
Alvaro Herrera
Date:
Tomas Vondra wrote:
> On 02/01/2017 11:52 PM, Alvaro Herrera wrote:

> > Nearby, some auxiliary functions such as n_choose_k and
> > num_combinations are not documented. What it is that they do? I'd
> > move these at the end of the file, keeping the important entry points
> > at the top of the file.
> 
> I'd say n-choose-k is pretty widely known term from combinatorics. The
> comment would essentially say just 'this is n-choose-k' which seems rather
> pointless. So as much as I dislike the self-documenting code, this actually
> seems like a good case of that.

Actually, we do have such comments all over the place.  I knew this as
"n sobre k", so the english name doesn't immediately ring a bell with me
until I look it up; I think the function comment could just say
"n_choose_k -- this function returns the binomial coefficient".

> > I see this patch has a estimate_ndistinct() which claims to be a re-
> > implementation of code already in analyze.c, but it is actually a lot
> > simpler than what analyze.c does.  I've been wondering if it'd be a good
> > idea to use some of this code so that some routines are moved out of
> > analyze.c; good implementations of statistics-related functions would
> > live in src/backend/statistics/ where they can be used both by analyze.c
> > and your new mvstats stuff.  (More generally I am beginning to wonder if
> > the new directory should be just src/backend/statistics.)
> 
> I'll look into that. I have to check if I ignored some assumptions or corner
> cases the analyze.c deals with.

Maybe it's not terribly important to refactor analyze.c from the get go,
but let's give the subdir a more general name.  Hence my vote for having
the subdir be "statistics" instead of "mvstats".

> > In another subthread you seem to have surrendered to the opinion that
> > the new catalog should be called pg_statistics_ext, just in case in the
> > future we come up with additional things to put on it.  However, given
> > its schema, with a "starelid / stakeys", is it sensible to think that
> > we're going to get anything other than something that involves multiple
> > variables?  Maybe it should just be "pg_statistics_multivar" and if
> > something else comes along we create another catalog with an appropriate
> > schema.  Heck, how does this catalog serve the purpose of cross-table
> > statistics in the first place, given that it has room to record a single
> > relid only?  Are you thinking that in the future you'd change starelid
> > into an oidvector column?
> 
> Yes, I think the starelid will turn into OID vector. The reason why I
> haven't done that in the current version of the catalog is to keep it
> simple.

OK -- as long as we know what the way forward is, I'm good.  Still, my
main point was that even if we have multiple rels, this catalog will be
about having multivariate statistics, and not different kinds of
statistical data.  I would keep pg_mv_statistics, really.

> > The comment in gram.y about the CREATE STATISTICS is at odds with what
> > is actually allowed by the grammar.
> 
> Which comment?

This one:*              CREATE STATISTICS stats_name ON relname (columns) WITH (options)
the production actually says: CREATE STATISTICS any_name ON '(' columnList ')' FROM qualified_name

> > I think the name of a statistics is only useful to DROP/ALTER it, right?
> > I wonder why it's useful that statistics belongs in a schema.  Perhaps
> > it should be a global object?  I suppose the name collisions would
> > become bothersome if you have many mvstats.
> 
> I think it shouldn't be a global object. I consider them to be a part of a
> schema (just like indexes, for example). Imagine you have a multi-tenant
> database, with using exactly the same (tables/indexes) schema, but keept in
> different schemas. Why shouldn't it be possible to also use the same set of
> statistics for each tenant?

True.  Suggestion withdrawn.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] multivariate statistics (v19)

From
Alvaro Herrera
Date:
Looking at 0003, I notice that gram.y is changed to add a WITH ( .. )
clause.  If it's not specified, an error is raised.  If you create
stats with (ndistinct) then you can't alter it later to add
"dependencies" or whatever; unless I misunderstand, you have to drop the
statistics and create another one.  Probably in a forthcoming patch we
should have ALTER support to add a stats type.

Also, why isn't the default to build everything, rather than nothing?

BTW, almost everything in the backend could be inside "utils/", so let's
not do that -- let's just create src/backend/statistics/ for all your
code.

Here a few notes while reading README.dependencies -- some typos, two
questions.

diff --git a/src/backend/utils/mvstats/README.dependencies b/src/backend/utils/mvstats/README.dependencies
index 908f094..7f3ed3d 100644
--- a/src/backend/utils/mvstats/README.dependencies
+++ b/src/backend/utils/mvstats/README.dependencies
@@ -36,7 +36,7 @@ design choice to model the dataset in denormalized way, either because ofperformance or to make
queryingeasier.
 
-soft dependencies
+Soft dependencies-----------------Real-world data sets often contain data errors, either because of data entry
@@ -48,7 +48,7 @@ rendering the approach mostly useless even for slightly noisy data sets, orresult in sudden changes
inbehavior depending on minor differences betweensamples provided to ANALYZE.
 
-For this reason the statistics implementes "soft" functional dependencies,
+For this reason the statistics implements "soft" functional dependencies,associating each functional dependency with a
degreeof validity (a numbernumber between 0 and 1). This degree is then used to combine selectivitiesin a smooth
manner.
@@ -75,6 +75,7 @@ The algorithm also requires a minimum size of the group to consider itconsistent (currently 3 rows in
thesample). Small groups make it less likelyto break the consistency.
 
+## What is it that we store in the catalog?Clause reduction (planner/optimizer)------------------------------------
@@ -95,12 +96,12 @@ example for (a,b,c) we first use (a,b=>c) to break the computation intoand then apply (a=>b) the
sameway on P(a=?,b=?).
 
-Consistecy of clauses
+Consistency of clauses---------------------Functional dependencies only express general dependencies between
columns,withoutreferencing particular values. This assumes that the equality clauses
 
-are in fact consistent with the functinal dependency, i.e. that given a
+are in fact consistent with the functional dependency, i.e. that given adependency (a=>b), the value in (b=?) clause
isthe value determined by (a=?).If that's not the case, the clauses are "inconsistent" with the functionaldependency
andthe result will be over-estimation.
 
@@ -111,6 +112,7 @@ set will be empty, but we'll estimate the selectivity using the ZIP condition.In this case the
defaultestimation based on AVIA principle happens to workbetter, but mostly by chance.
 
+## what is AVIA principle?This issue is the price for the simplicity of functional dependencies. If theapplication
frequentlyconstructs queries with clauses inconsistent with
 

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] multivariate statistics (v19)

From
Alvaro Herrera
Date:
Still about 0003.  dependencies.c comment at the top of the file should
contain some details about what is it implementing and a general
description of the algorithm and data structures.  As before, it's best
to have the main entry point build_mv_dependencies at the top, the other
public functions, keeping the internal routines at the bottom of the
file.  That eases code study for future readers.  (Minimizing number of
function prototypes is not a goal.)

What is MVSTAT_DEPS_TYPE_BASIC?  Is "functional dependencies" really
BASIC?  I wonder if it should be TYPE_FUNCTIONAL_DEPS or something.

As with pg_ndistinct_out, there's no need to pstrdup(str.data), as it's
already palloc'ed in the right context.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] multivariate statistics (v19)

From
Dean Rasheed
Date:
On 6 February 2017 at 21:26, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Tomas Vondra wrote:
>> On 02/01/2017 11:52 PM, Alvaro Herrera wrote:
>
>> > Nearby, some auxiliary functions such as n_choose_k and
>> > num_combinations are not documented. What it is that they do? I'd
>> > move these at the end of the file, keeping the important entry points
>> > at the top of the file.
>>
>> I'd say n-choose-k is pretty widely known term from combinatorics. The
>> comment would essentially say just 'this is n-choose-k' which seems rather
>> pointless. So as much as I dislike the self-documenting code, this actually
>> seems like a good case of that.
>
> Actually, we do have such comments all over the place.  I knew this as
> "n sobre k", so the english name doesn't immediately ring a bell with me
> until I look it up; I think the function comment could just say
> "n_choose_k -- this function returns the binomial coefficient".
>

One of the things you have to watch out for when writing code to
compute binomial coefficients is integer overflow, since the numerator
and denominator get large very quickly. For example, the current code
will overflow for n=13, k=12, which really isn't that large.

This can be avoided by computing the product in reverse and using a
larger datatype like a 64-bit integer to store a single intermediate
result. The point about multiplying the terms in reverse is that it
guarantees that each intermediate result is an exact integer (a
smaller binomial coefficient), so there is no need to track separate
numerators and denominators, and you avoid huge intermediate
factorials. Here's what that looks like in psuedo-code:

binomial(int n, int k):   # Save computational effort by using the symmetry of the binomial   # coefficients   k =
min(k,n-k);
 
   # Compute the result using binomial(n, k) = binomial(n-1, k-1) * n / k,   # starting from binomial(n-k, 0) = 1, and
computingthe sequence   # binomial(n-k+1, 1), binomial(n-k+2, 2), ...   #   # Note that each intermediate result is an
exactinteger.   int64 result = 1;   for (int i = 1; i <= k; i++)   {       result = (result * (n-k+i)) / i;       if
(result> INT_MAX) Raise overflow error   }   return (int) result;
 


Note also that I think num_combinations(n) is just an expensive way of
calculating 2^n - n - 1.

Regards,
Dean



Re: [HACKERS] multivariate statistics (v19)

From
David Fetter
Date:
On Wed, Feb 08, 2017 at 03:23:25PM +0000, Dean Rasheed wrote:
> On 6 February 2017 at 21:26, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > Tomas Vondra wrote:
> >> On 02/01/2017 11:52 PM, Alvaro Herrera wrote:
> >
> >> > Nearby, some auxiliary functions such as n_choose_k and
> >> > num_combinations are not documented. What it is that they do? I'd
> >> > move these at the end of the file, keeping the important entry points
> >> > at the top of the file.
> >>
> >> I'd say n-choose-k is pretty widely known term from combinatorics. The
> >> comment would essentially say just 'this is n-choose-k' which seems rather
> >> pointless. So as much as I dislike the self-documenting code, this actually
> >> seems like a good case of that.
> >
> > Actually, we do have such comments all over the place.  I knew this as
> > "n sobre k", so the english name doesn't immediately ring a bell with me
> > until I look it up; I think the function comment could just say
> > "n_choose_k -- this function returns the binomial coefficient".
> 
> One of the things you have to watch out for when writing code to
> compute binomial coefficients is integer overflow, since the numerator
> and denominator get large very quickly. For example, the current code
> will overflow for n=13, k=12, which really isn't that large.
> 
> This can be avoided by computing the product in reverse and using a
> larger datatype like a 64-bit integer to store a single intermediate
> result. The point about multiplying the terms in reverse is that it
> guarantees that each intermediate result is an exact integer (a
> smaller binomial coefficient), so there is no need to track separate
> numerators and denominators, and you avoid huge intermediate
> factorials. Here's what that looks like in psuedo-code:
> 
> binomial(int n, int k):
>     # Save computational effort by using the symmetry of the binomial
>     # coefficients
>     k = min(k, n-k);
> 
>     # Compute the result using binomial(n, k) = binomial(n-1, k-1) * n / k,
>     # starting from binomial(n-k, 0) = 1, and computing the sequence
>     # binomial(n-k+1, 1), binomial(n-k+2, 2), ...
>     #
>     # Note that each intermediate result is an exact integer.
>     int64 result = 1;
>     for (int i = 1; i <= k; i++)
>     {
>         result = (result * (n-k+i)) / i;
>         if (result > INT_MAX) Raise overflow error
>     }
>     return (int) result;
> 
> 
> Note also that I think num_combinations(n) is just an expensive way of
> calculating 2^n - n - 1.

Combinations are n!/(k! * (n-k)!), so computing those is more
along the lines of:

unsigned long long
choose(unsigned long long n, unsigned long long k) {   if (k > n) {       return 0;   }   unsigned long long r = 1;
for(unsigned long long d = 1; d <= k; ++d) {       r *= n--;       r /= d;   }   return r;
 
}

which greatly reduces the chance of overflow.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: [HACKERS] multivariate statistics (v19)

From
Dean Rasheed
Date:
On 8 February 2017 at 16:09, David Fetter <david@fetter.org> wrote:
> Combinations are n!/(k! * (n-k)!), so computing those is more
> along the lines of:
>
> unsigned long long
> choose(unsigned long long n, unsigned long long k) {
>     if (k > n) {
>         return 0;
>     }
>     unsigned long long r = 1;
>     for (unsigned long long d = 1; d <= k; ++d) {
>         r *= n--;
>         r /= d;
>     }
>     return r;
> }
>
> which greatly reduces the chance of overflow.
>

Hmm, but that doesn't actually prevent overflows, since it can
overflow in the multiplication step, and there is no protection
against that.

In the algorithm I presented, the inputs and the intermediate result
are kept below INT_MAX, so the multiplication step cannot overflow the
64-bit integer, and it will only raise an overflow error if the actual
result won't fit in a 32-bit int. Actually a crucial part of that,
which I failed to mention previously, is the first step replacing k
with min(k, n-k). This is necessary for inputs like (100,99), which
should return 100, and which must be computed as 100 choose 1, not 100
choose 99, otherwise it will overflow internally before getting to the
final result.

Regards,
Dean



Re: [HACKERS] multivariate statistics (v19)

From
Tomas Vondra
Date:
On 02/08/2017 07:40 PM, Dean Rasheed wrote:
> On 8 February 2017 at 16:09, David Fetter <david@fetter.org> wrote:
>> Combinations are n!/(k! * (n-k)!), so computing those is more
>> along the lines of:
>>
>> unsigned long long
>> choose(unsigned long long n, unsigned long long k) {
>>     if (k > n) {
>>         return 0;
>>     }
>>     unsigned long long r = 1;
>>     for (unsigned long long d = 1; d <= k; ++d) {
>>         r *= n--;
>>         r /= d;
>>     }
>>     return r;
>> }
>>
>> which greatly reduces the chance of overflow.
>>
>
> Hmm, but that doesn't actually prevent overflows, since it can
> overflow in the multiplication step, and there is no protection
> against that.
>
> In the algorithm I presented, the inputs and the intermediate result
> are kept below INT_MAX, so the multiplication step cannot overflow the
> 64-bit integer, and it will only raise an overflow error if the actual
> result won't fit in a 32-bit int. Actually a crucial part of that,
> which I failed to mention previously, is the first step replacing k
> with min(k, n-k). This is necessary for inputs like (100,99), which
> should return 100, and which must be computed as 100 choose 1, not 100
> choose 99, otherwise it will overflow internally before getting to the
> final result.
>

Thanks for the feedback, I'll fix this. I've allowed myself to be a bit 
sloppy because the number of attributes in the statistics is currently 
limited to 8, so the overflows are currently not an issue. But it 
doesn't hurt to make it future-proof, in case we change that mostly 
artificial limit sometime in the future.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] multivariate statistics (v19)

From
Dean Rasheed
Date:
On 11 February 2017 at 01:17, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> Thanks for the feedback, I'll fix this. I've allowed myself to be a bit
> sloppy because the number of attributes in the statistics is currently
> limited to 8, so the overflows are currently not an issue. But it doesn't
> hurt to make it future-proof, in case we change that mostly artificial limit
> sometime in the future.
>

Ah right, so it can't overflow at present, but it's neater to have an
overflow-proof algorithm.

Thinking about the exactness of the division steps is quite
interesting. Actually, the order of the multiplying factors doesn't
matter as long as the divisors are in increasing order. So in both my
proposal:
   result = 1   for (i = 1; i <= k; i++)       result = (result * (n-k+i)) / i;

and David's proposal, which is equivalent but has the multiplying
factors in the opposite order, equivalent to:
   result = 1   for (i = 1; i <= k; i++)       result = (result * (n-i+1)) / i;

the divisions are exact at each step. The first time through the loop
it divides by 1 which is trivially exact. The second time it divides
by 2, having multiplied by 2 consecutive factors, one of which is
therefore guaranteed to be divisible by 2. The third time it divides
by 3, having multiplied by 3 consecutive factors, one of which is
therefore guaranteed to be divisible by 3, and so on.

My approach originally seemed more logical to me because of the way it
derives from the recurrence relation binomial(n, k) = binomial(n-1,
k-1) * n / k, but they both work fine as long as they have suitable
overflow checks.

It's also interesting that descriptions of this algorithm tend to talk
about setting k to min(k, n-k) at the start as an optimisation step,
as I did in fact, whereas it's actually more than that -- it helps
prevent unnecessary intermediate overflows when k > n/2. Of course,
that's not a worry for the current use of this function, but it's good
to have a robust algorithm.

Regards,
Dean



Re: [HACKERS] multivariate statistics (v19)

From
David Fetter
Date:
On Sun, Feb 12, 2017 at 10:35:04AM +0000, Dean Rasheed wrote:
> On 11 February 2017 at 01:17, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> > Thanks for the feedback, I'll fix this. I've allowed myself to be a bit
> > sloppy because the number of attributes in the statistics is currently
> > limited to 8, so the overflows are currently not an issue. But it doesn't
> > hurt to make it future-proof, in case we change that mostly artificial limit
> > sometime in the future.
> >
> 
> Ah right, so it can't overflow at present, but it's neater to have an
> overflow-proof algorithm.
> 
> Thinking about the exactness of the division steps is quite
> interesting. Actually, the order of the multiplying factors doesn't
> matter as long as the divisors are in increasing order. So in both my
> proposal:
> 
>     result = 1
>     for (i = 1; i <= k; i++)
>         result = (result * (n-k+i)) / i;
> 
> and David's proposal, which is equivalent but has the multiplying
> factors in the opposite order, equivalent to:
> 
>     result = 1
>     for (i = 1; i <= k; i++)
>         result = (result * (n-i+1)) / i;
> 
> the divisions are exact at each step. The first time through the loop
> it divides by 1 which is trivially exact. The second time it divides
> by 2, having multiplied by 2 consecutive factors, one of which is
> therefore guaranteed to be divisible by 2. The third time it divides
> by 3, having multiplied by 3 consecutive factors, one of which is
> therefore guaranteed to be divisible by 3, and so on.

Right.  You know you can use integer division, which make sense as
permutations of discrete sets are always integers.

> My approach originally seemed more logical to me because of the way it
> derives from the recurrence relation binomial(n, k) = binomial(n-1,
> k-1) * n / k, but they both work fine as long as they have suitable
> overflow checks.

Right.  We could even cache those checks (sorry) based on data type
limits by architecture and OS if performance on those operations ever
matters that much.

> It's also interesting that descriptions of this algorithm tend to
> talk about setting k to min(k, n-k) at the start as an optimisation
> step, as I did in fact, whereas it's actually more than that -- it
> helps prevent unnecessary intermediate overflows when k > n/2. Of
> course, that's not a worry for the current use of this function, but
> it's good to have a robust algorithm.

Indeed. :)

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



[HACKERS] Multivariate statistics and expression indexes

From
Bruce Momjian
Date:
At the risk of asking a stupid question, we already have optimizer
statistics on expression indexes.  In what sense are we using this for
multi-variate statistics, and in what sense can't we.

FYI, I just wrote a blog post about expression index statistics:
http://momjian.us/main/blogs/pgblog/2017.html#February_20_2017

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Multivariate statistics and expression indexes

From
Tomas Vondra
Date:
On 02/21/2017 12:13 AM, Bruce Momjian wrote:
> At the risk of asking a stupid question, we already have optimizer
> statistics on expression indexes.  In what sense are we using this for
> multi-variate statistics, and in what sense can't we.
>

We're not using that at all, because those are really orthogonal 
features. Even with expression indexes, the statistics are per 
attribute, and the attributes are treated as independent.

There was a proposal to also allow creating statistics on expressions 
(without having to create an index), but that's not supported yet.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Multivariate statistics and expression indexes

From
Bruce Momjian
Date:
On Tue, Feb 21, 2017 at 01:27:53AM +0100, Tomas Vondra wrote:
> On 02/21/2017 12:13 AM, Bruce Momjian wrote:
> >At the risk of asking a stupid question, we already have optimizer
> >statistics on expression indexes.  In what sense are we using this for
> >multi-variate statistics, and in what sense can't we.
> >
> 
> We're not using that at all, because those are really orthogonal features.
> Even with expression indexes, the statistics are per attribute, and the
> attributes are treated as independent.
> 
> There was a proposal to also allow creating statistics on expressions
> (without having to create an index), but that's not supported yet.

OK, thanks.  I had to ask.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] multivariate statistics (v24)

From
Tomas Vondra
Date:
OK,

attached is v24 of the patch series, addressing most of the reported 
issues and comments (at least I believe so). The main changes are:

1) I've mostly abandoned the "multivariate" name in favor of "extended", 
particularly in places referring to stats stored in the pg_statistic_ext 
in general. "Multivariate" is now used only in places talking about 
particular types (e.g. multivariate histograms).

The "extended" name is more widely used for this type of statistics, and 
the assumption is that we'll also add other (non-multivariate) types of 
statistics - e.g. statistics on custom expressions, or some for of join 
statistics.

2) Catalog pg_mv_statistic was renamed to pg_statistic_ext (and 
pg_mv_stats view renamed to pg_stats_ext).

3) The structure of pg_statistic_ext was changed as proposed by Alvaro, 
i.e. the boolean flags were removed and instead we have just a single 
"char[]" column with list of enabled statistics.

4) I also got rid of the "mv" part in most variable/function/constant 
names, replacing it by "ext" or something similar. Also mvstats.h got 
renamed to stats.h.

5) Moved the files from src/backend/utils/mvstats to backend/statistics.

6) Fixed the n_choose_k() overflow issues by using the algorithm 
proposed by Dean. Also, use the simple formula for num_combinations().

7) I've tweaked data types for a few struct members (in stats.h). I've 
kept most of the uint32 fields at the top level though, because int16 
might not be large enough for large statistics and the overhead is 
minimal (compared to the space needed e.g. for histogram buckets).


The renames/changes were quite widespread, but I've done my best to fix 
all the comments and various other places.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] multivariate statistics (v24)

From
Kyotaro HORIGUCHI
Date:
Hello,

At Thu, 2 Mar 2017 04:05:34 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<a78ffb17-70e8-a55a-c10c-66ab575e88ed@2ndquadrant.com>
> OK,
> 
> attached is v24 of the patch series, addressing most of the reported
> issues and comments (at least I believe so). The main changes are:

Unfortunately, 0002 conflicts with the current master
(4461a9b). Could you rebase them or tell us the commit where this
patches stand on?

I only saw the patch files but have some comments.

> 1) I've mostly abandoned the "multivariate" name in favor of
> "extended", particularly in places referring to stats stored in the
> pg_statistic_ext in general. "Multivariate" is now used only in places
> talking about particular types (e.g. multivariate histograms).
> 
> The "extended" name is more widely used for this type of statistics,
> and the assumption is that we'll also add other (non-multivariate)
> types of statistics - e.g. statistics on custom expressions, or some
> for of join statistics.

In 0005, and 

@@ -184,14 +208,43 @@ clauselist_selectivity(PlannerInfo *root,     * If there are no such stats or not enough
attributes,don't waste time     * simply skip to estimation using the plain per-column stats.     */
 
+    if (has_stats(stats, STATS_TYPE_MCV) &&
...
+            /* compute the multivariate stats */
+            s1 *= clauselist_ext_selectivity(root, mvclauses, stat);
====
@@ -1080,10 +1136,71 @@ clauselist_ext_selectivity_deps(PlannerInfo *root, Index relid,}/*
+ * estimate selectivity of clauses using multivariate statistic

These comment is left unchanged?  or on purpose? 0007 adds very
similar texts.

> 2) Catalog pg_mv_statistic was renamed to pg_statistic_ext (and
> pg_mv_stats view renamed to pg_stats_ext).

FWIW, "extended statistic" would be abbreviated as
"ext_statistic" or "extended_stats". Why have you exchanged the
words?

> 3) The structure of pg_statistic_ext was changed as proposed by
> Alvaro, i.e. the boolean flags were removed and instead we have just a
> single "char[]" column with list of enabled statistics.
> 
> 4) I also got rid of the "mv" part in most variable/function/constant
> names, replacing it by "ext" or something similar. Also mvstats.h got
> renamed to stats.h.
> 
> 5) Moved the files from src/backend/utils/mvstats to
> backend/statistics.
> 
> 6) Fixed the n_choose_k() overflow issues by using the algorithm
> proposed by Dean. Also, use the simple formula for num_combinations().
> 
> 7) I've tweaked data types for a few struct members (in stats.h). I've
> kept most of the uint32 fields at the top level though, because int16
> might not be large enough for large statistics and the overhead is
> minimal (compared to the space needed e.g. for histogram buckets).

Some formulated proof or boundary value test cases might be
needed (to prevent future trouble). Or any defined behavior on
overflow of them might be enough. I belive all (or most) of
overflow-able data has such behavior.

> The renames/changes were quite widespread, but I've done my best to
> fix all the comments and various other places.
> 
> regards

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: [HACKERS] multivariate statistics (v25)

From
Tomas Vondra
Date:
On 03/02/2017 07:42 AM, Kyotaro HORIGUCHI wrote:
> Hello,
>
> At Thu, 2 Mar 2017 04:05:34 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<a78ffb17-70e8-a55a-c10c-66ab575e88ed@2ndquadrant.com>
>> OK,
>>
>> attached is v24 of the patch series, addressing most of the reported
>> issues and comments (at least I believe so). The main changes are:
>
> Unfortunately, 0002 conflicts with the current master
> (4461a9b). Could you rebase them or tell us the commit where this
> patches stand on?
>

Attached is a rebased patch series, otherwise it's the same as v24.

FWIW it was based on 016c990834 from Feb 28, but apparently some recent 
patch caused a minor conflict.

> I only saw the patch files but have some comments.
>
>> 1) I've mostly abandoned the "multivariate" name in favor of
>> "extended", particularly in places referring to stats stored in the
>> pg_statistic_ext in general. "Multivariate" is now used only in places
>> talking about particular types (e.g. multivariate histograms).
>>
>> The "extended" name is more widely used for this type of statistics,
>> and the assumption is that we'll also add other (non-multivariate)
>> types of statistics - e.g. statistics on custom expressions, or some
>> for of join statistics.
>
> In 0005, and
>
> @@ -184,14 +208,43 @@ clauselist_selectivity(PlannerInfo *root,
>       * If there are no such stats or not enough attributes, don't waste time
>       * simply skip to estimation using the plain per-column stats.
>       */
> +    if (has_stats(stats, STATS_TYPE_MCV) &&
> ...
> +            /* compute the multivariate stats */
> +            s1 *= clauselist_ext_selectivity(root, mvclauses, stat);
> ====
> @@ -1080,10 +1136,71 @@ clauselist_ext_selectivity_deps(PlannerInfo *root, Index relid,
>  }
>
>  /*
> + * estimate selectivity of clauses using multivariate statistic
>
> These comment is left unchanged?  or on purpose? 0007 adds very
> similar texts.
>

Hmm, those comments should be probably changed to "extended".

>> 2) Catalog pg_mv_statistic was renamed to pg_statistic_ext (and
>> pg_mv_stats view renamed to pg_stats_ext).
>
> FWIW, "extended statistic" would be abbreviated as
> "ext_statistic" or "extended_stats". Why have you exchanged the
> words?
>

Because this way it's clear it's a version of pg_statistic, and it will 
be sorted right next to it.

>> 3) The structure of pg_statistic_ext was changed as proposed by
>> Alvaro, i.e. the boolean flags were removed and instead we have just a
>> single "char[]" column with list of enabled statistics.
>>
>> 4) I also got rid of the "mv" part in most variable/function/constant
>> names, replacing it by "ext" or something similar. Also mvstats.h got
>> renamed to stats.h.
>>
>> 5) Moved the files from src/backend/utils/mvstats to
>> backend/statistics.
>>
>> 6) Fixed the n_choose_k() overflow issues by using the algorithm
>> proposed by Dean. Also, use the simple formula for num_combinations().
>>
>> 7) I've tweaked data types for a few struct members (in stats.h). I've
>> kept most of the uint32 fields at the top level though, because int16
>> might not be large enough for large statistics and the overhead is
>> minimal (compared to the space needed e.g. for histogram buckets).
>
> Some formulated proof or boundary value test cases might be
> needed (to prevent future trouble). Or any defined behavior on
> overflow of them might be enough. I belive all (or most) of
> overflow-able data has such behavior.
>

That is probably a good idea and I plan to do that.

>> The renames/changes were quite widespread, but I've done my best to
>> fix all the comments and various other places.
>>
>> regards
>
> regards,
>

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] multivariate statistics (v25)

From
Tomas Vondra
Date:
On 03/02/2017 03:52 PM, Tomas Vondra wrote:
> On 03/02/2017 07:42 AM, Kyotaro HORIGUCHI wrote:
>> Hello,
>>
>> At Thu, 2 Mar 2017 04:05:34 +0100, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote in
>> <a78ffb17-70e8-a55a-c10c-66ab575e88ed@2ndquadrant.com>
>>> OK,
>>>
>>> attached is v24 of the patch series, addressing most of the reported
>>> issues and comments (at least I believe so). The main changes are:
>>
>> Unfortunately, 0002 conflicts with the current master
>> (4461a9b). Could you rebase them or tell us the commit where this
>> patches stand on?
>>
>
> Attached is a rebased patch series, otherwise it's the same as v24.
>

This time with the attachments ....

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] multivariate statistics (v24)

From
Robert Haas
Date:
On Thu, Mar 2, 2017 at 8:35 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> attached is v24 of the patch series, addressing most of the reported issues
> and comments (at least I believe so). The main changes are:
>
> 1) I've mostly abandoned the "multivariate" name in favor of "extended",
> particularly in places referring to stats stored in the pg_statistic_ext in
> general. "Multivariate" is now used only in places talking about particular
> types (e.g. multivariate histograms).
>
> The "extended" name is more widely used for this type of statistics, and the
> assumption is that we'll also add other (non-multivariate) types of
> statistics - e.g. statistics on custom expressions, or some for of join
> statistics.

Oh, I like that.  I found it hard to wrap my head around what
"multivariate" was supposed to mean, exactly.  I think "extended" will
be clearer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] multivariate statistics (v25)

From
David Rowley
Date:
On 3 March 2017 at 03:53, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
This time with the attachments ....

It's been a long while since I looked at this patch, but I'm now taking another look.

I've made a list of stuff I've found from making my first pass on 0001 and 0002. Some of the stuff may seem a little pedantic, so apologies about those ones. I merely SET nit_picking_threshold TO 0; and reviewed.

Here goes:

0001:

+ RestrictInfo *rinfo = (RestrictInfo*)node;

and 

+ RestrictInfo *rinfo = (RestrictInfo *)node;
+
+ return expression_tree_walker((Node*)rinfo->clause,
+   pull_varattnos_walker,
+   (void*) context);

spacing incorrect. Please space after type name in casts and after the closing parenthesis. 

0002:

+      dropped as well.  Multivariate statistics referencing the column will
+      be dropped only if there would remain a single non-dropped column.

I was initially confused by this. I think it should worded as:

"Multivariate statistics referencing the dropped column will also be removed if the removal of the column would cause the statistics to contain data for only a single column"

I had been confused as I'd been thinking of dropping multiple columns at once with the same command, and only 1 column remained in the table. So I think it's best to clarify you mean the statistic here.

+ OCLASS_STATISTICS /* pg_statistics_ext */

I wonder if this should be named: OCLASS_STATISTICEXT. The comment is also incorrect and should read "pg_statistic_ext" (without 's')

I tried to perform a test in this area and received an error:

postgres=# create table ab1 (a int, b int);
CREATE TABLE
postgres=# create statistics ab1_a_b_stats on (a,b) from ab1;
CREATE STATISTICS
postgres=# alter table ab1 drop column a;
ALTER TABLE
postgres=# drop table ab1;
ERROR:  cache lookup failed for statistics 16399

+   When estimating conditions on multiple columns, the planner assumes
+   independence of the conditions and multiplies the selectivities. When the
+   columns are correlated, the independence assumption is violated, and the
+   estimates may be off by several orders of magnitude, resulting in poor
+   plan choices.

I don't think the assumption is violated. We still assume that they're independent, which is incorrect. Nothing gets violated.

Perhaps it would be more accurate to write:

"When estimating the selectivity of conditions over multiple columns, the planner normally assumes each condition is independent of other conditions, and simply multiplies the selectivity estimates of each condition together to produce a final selectivity estimation for all conditions. This method can often lead to inaccurate row estimations when the conditions have dependencies on one another. Such misestimations can result poor plan choices being made."

+   using <command>CREATE STATISTICS</> command.

using the ...

+   As explained in <xref linkend="planner-stats">, the planner can determine
+   cardinality of <structname>t</structname> using the number of pages and
+   rows is looked up in <structname>pg_class</structname>:

perhaps "rows is" should become "rows as" or "rows which are".

+ * delete multi-variate statistics
+ */
+ RemoveStatisticsExt(relid, 0);

I think it should be "delete extended statistics"


Should this not be rejected?

postgres=# create view v1 as select 1 a, 2 b;
CREATE VIEW
postgres=# create statistics v1_a_stats on (a,b) from v1;
CREATE STATISTICS

and this?

postgres=# create sequence test_seq;
CREATE SEQUENCE
postgres=# select * from test_seq;
 last_value | log_cnt | is_called
------------+---------+-----------
          1 |       0 | f
(1 row)
postgres=# create statistics test_seq_stats on (last_value,log_cnt) from test_seq;
CREATE STATISTICS

The patch does claim:

+ /* extended stats are supported on tables and matviews */

So I guess it should be disallowed.


+ /* OBJECT_STATISTICS */
+ {
+ "statistics", OBJECT_STATISTICS

Maybe this should be changed to be OBJECT_STATISTICEXT */. Doing it this way would close the door a bit on pg_depends records existing for pg_statistic.

A quick test shows a problem here:

postgres=# create table ab (a int, b int);
CREATE TABLE
postgres=# create statistics ab_a_b_stats on (a,b) from ab;
CREATE STATISTICS
postgres=# create statistics ab_a_b_stats1 on (a,b) from ab;
CREATE STATISTICS
postgres=# alter statistics ab_a_b_stats1 rename to ab_a_b_stats;
ERROR:  unsupported object class 3381


+/*****************************************************************************
+ *
+ * QUERY :
+ * CREATE STATISTICS stats_name ON relname (columns) WITH (options)
+ *
+ *****************************************************************************/

Old Syntax?


+ $$ = (Node *)n;

Incorrect spacing.


+ * The returned list is guaranteed to be sorted in order by OID, although
+ * this is not currently needed.

hmm, whats the tie-breaker going to be for:

CREATE TABLE abc (a int, b int, c int);
create statistics abc_ab_stats (a,b) from abc;
create statistics abc_bc_stats (b,c) from abc;

select * from abc where a=1 and b=1 and c=1;

I've not gotten to that part of the code yet, but reading the comment made me wonder how you're handling this. I think predictable is a good way, so that would require some ordering on this list... I presume.


+ * happen if the statistics has fewer attributes than we have Vars.

"statistics" is plural, so "has" should be "have"

although I see you mix the plurals up a few lines later and write in singular form.

+ /* check that all Vars are covered by the statistic */


This one is more of a question:

+ bool found;
+ double ndist = find_ndistinct(root, rel, varinfos, &found);

would it be better to return the bool and pass the &ndist here? That way you could simply write:

if (!find_ndistinct(root, rel, varinfos, &reldistinct))
  clamp *= 0.1;


@@ -3450,6 +3467,7 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
  clamp = rel->tuples;
  }
  }
+

Adds a new line by mistake.


+ /*
+ * Only ndistinct stats covering all Vars are acceptable, which can't
+ * happen if the statistics has fewer attributes than we have Vars.
+ */
+ if (bms_num_members(attnums) > info->stakeys->dim1)
+ continue;

bms_num_members() done inside loop. Would you say it's OK to assume the compiler will do that before the loop?, or do you think it's best to set it before looping? We already know we're going to loop at least once, since we'd have short circuited at the start of the function otherwise.


+ k = -1;
+ while ((k = bms_next_member(attnums, k)) >= 0)
+ {
+ bool attr_found = false;
+ for (i = 0; i < info->stakeys->dim1; i++)
+ {
+ if (info->stakeys->values[i] == k)
+ {
+ attr_found = true;
+ break;
+ }
+ }
+
+ /* found attribute not covered by this ndistinct stats, skip */
+ if (!attr_found)
+ {
+ matches = false;
+ break;
+ }
+ }

Would it be better just to stuff info->stakeys->values into a bitmapset and check its a subset of attnums? It would mean allocating memory in the loop, so maybe you think otherwise, but in that case maybe StatisticExtInfo should store the bitmapset?


+ if (! matches)
+ continue;

extra whitespace after !


+ /* not the right item (different number of attributes) */
+ if (item->nattrs != bms_num_members(attnums))
+ continue;

again using bms_num_members() inside a loop when its known before the loop.


+ Assert(!(*found));

This confused me for a minute as I mistakenly read this as Assert((*found)); can you comment this to say something along the lines of the fact that we should have returned already if we found a match.


+ appendPQExpBuffer(&buf, "(dependencies)");

I think it's better practice to use appendPQExpBufferStr() when there's no formatting. It'll perform marginally better, which might not be important here, but it sets a better example for people to follow when performance is more critical.


+ List   *keys; /* String nodes naming referenced column(s) */

column(s) should read columns. 's' is not optional.


+ bool rd_statvalid; /* state of rd_statlist: true/false */

so bool can only be true or false. Good to know ;-)  the comment is probably useless, can you improve?


+   change the definition of a extended statistics

"a" should be "an", Also is statistics plural here. It's commonly mixed up in the patch. I think it needs standardised. I personally think if you're speaking of a single pg_statatic_ext row, then it should be singular. Yet, I'm aware you're using plural for the CREATE STATISTICS command, to me that feels a bit like: CREATE TABLES mytable ();  am I somehow thinking wrongly somehow here?


+        The name (optionally schema-qualified) of a statistics to be altered.

"a" should be "the"


+   If a schema name is given (for example, <literal>CREATE STATISTICS
+   myschema.mystat ...</>) then the statistics is created in the specified
+   schema.  Otherwise it is created in the current schema.  The name of

What's created in the current schema? I thought this was just for naming?

+  <para>
+   To be able to create a table, you must have <literal>USAGE</literal>
+   privilege on all column types or the type in the <literal>OF</literal>
+   clause, respectively.
+  </para>

"create a table" ? create an extended statistic ?

+  <title>Examples</title>
+
+  <para>
+   ...
+  </para>

Why are the examples missing? I've not looked beyond patch 0002 yet, but I'd have assumed 0002 should be commitable without requiring later patches to make it correct.

+ * statscmds.c
+ *  Commands for creating and altering extended statistics
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California

2017.

+ * statistics might work with  equality only.

extra space

+ /* costruction of array of enabled statistic */

construction?

+ atttuple = SearchSysCacheAttName(relid, attname);

+

+ if (!HeapTupleIsValid(atttuple))

+ ereport(ERROR,

+ (errcode(ERRCODE_UNDEFINED_COLUMN),

+  errmsg("column \"%s\" referenced in statistics does not exist",

+ attname)));

+

+ /* more than STATS_MAX_DIMENSIONS columns not allowed */

+ if (numcols >= STATS_MAX_DIMENSIONS)

+ ereport(ERROR,

+ (errcode(ERRCODE_TOO_MANY_COLUMNS),

+ errmsg("cannot have more than %d keys in statistics",

+ STATS_MAX_DIMENSIONS)));

+

+ attnums[numcols] = ((Form_pg_attribute) GETSTRUCT(atttuple))->attnum;

+ ReleaseSysCache(atttuple);


Looks like a syscache leak. No?


+ /*
+ * Delete the pg_proc tuple.
+ */
+ relation = heap_open(StatisticExtRelationId, RowExclusiveLock);

pg_proc?


+ * pg_statistic_ext.h
+ *  definition of the system "extended statistic" relation (pg_statistic_ext)
+ *  along with the relation's initial contents.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group

2017

+ * stats.h
+ *  Multivariate statistics and selectivity estimation functions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group

2017

"Multivariate" should be "Extended". My justification here is that stats_are_built() is contained within, which is used in get_relation_statistics() which is not specific to MV stats.

0003:

No more time today. Will try and get to those soon.

Setting to waiting on author in the meantime.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] multivariate statistics (v25)

From
David Rowley
Date:
On 13 March 2017 at 23:00, David Rowley <david.rowley@2ndquadrant.com> wrote:
0003:

No more time today. Will try and get to those soon.

0003:

I've now read this patch. My main aim here was to learn what it does and how it works. I need to spend much longer understanding how your calculating the functional dependencies.

In the meantime I've pasted the notes I took while reading over the patch. 

+ default:
+ elog(ERROR, "unexcpected statistics type requested: %d", type);

"unexpected", but we generally use "unknown".

@@ -1293,7 +1294,8 @@ get_relation_statistics(RelOptInfo *rel, Relation relation)
  info->rel = rel;
 
  /* built/available statistics */
- info->ndist_built = true;
+ info->ndist_built = stats_are_built(htup, STATS_EXT_NDISTINCT);
+ info->deps_built = stats_are_built(htup, STATS_EXT_DEPENDENCIES);

I don't really like how this function is shaping up. You're calling stats_are_built() potentially twice for each stats type. There must be a nicer way to do this. Are non-built stats common enough to optimize building a StatisticExtInfo regardless and throwing it away if it happens to be useless?  

Can you also rename mvoid to become something more esoid or similar. I seem to always read it as m-void instead of mv-oid and naturally I expect a void pointer rather than an Oid.

+dependencies, and for each one count the number of rows rows consistent it.

duplicate word "rows"

+Apllying the functional dependencies is fairly simple - given a list of

Applying


+In this case the default estimation based on AVIA principle happens to work


hmm, maybe I should know what AVIA principles are, but I don't. Is there something I should be reading?  I searched a bit around the internet for a few minutes it didn't seem have a great idea either.

+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group

2017


+ Assert(tmp <= ((char *) output + len));

Shouldn't you just Assert(tmp == ((char *) output + len)); at the end of the loop?


+ if (dependencies->magic != STATS_DEPS_MAGIC)
+ elog(ERROR, "invalid dependency magic %d (expected %dd)",
+ dependencies->magic, STATS_DEPS_MAGIC);
+
+ if (dependencies->type != STATS_DEPS_TYPE_BASIC)
+ elog(ERROR, "invalid dependency type %d (expected %dd)",
+ dependencies->type, STATS_DEPS_TYPE_BASIC);

%dd ?

+ Assert(dependencies->ndeps > 0);

Why Assert() and not elog() ? Wouldn't think mean that a corrupt dependency could fail an Assert


+ dependencies = (MVDependencies) palloc0(sizeof(MVDependenciesData));

Why palloc0() and not palloc()?

Can you not just read it into a variable on the stack, then check the exact size using tempdeps.ndeps * sizeof(MVDependency), then memcpy() it over? That'll save you the realloc()


+ /* what minimum bytea size do we expect for those parameters */
+ expected_size = offsetof(MVDependenciesData, deps) +
+ dependencies->ndeps * (offsetof(MVDependencyData, attributes) +
+   sizeof(AttrNumber) * 2);

Can't quite make sense of this yet. Why * 2?


+ /* is the number of attributes valid? */
+ Assert((k >= 2) && (k <= STATS_MAX_DIMENSIONS));

Seems like a bad idea to Assert() this. Wouldn't some bad data being deserialized cause an Assert failure?


+ d = (MVDependency) palloc0(offsetof(MVDependencyData, attributes) +
+   (k * sizeof(AttrNumber)));

Why palloc0(), you seem to write out all the fields right away. Seems like a waste to zero the memory.

+ /* still within the bytea */
+ Assert(tmp <= ((char *) data + VARSIZE_ANY(data)));

Any point? You're already Asserting that you've consumed the entire array at the end anyway.

+ appendStringInfoString(&str, "[");

appendStringInfoChar(&str. '['); would be better.

+ ret = pstrdup(str.data);

ret = pnstrdup(str.data, str.len);


+CREATE STATISTICS s1 WITH (dependencies) ON (a,a) FROM functional_dependencies;
+ERROR:  duplicate column name in statistics definition

Is it worth mentioning which column here?

I'll try to spend more time understanding 0003 soon.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] multivariate statistics (v25)

From
Alvaro Herrera
Date:
I tried patch 0002 today and again there are conflicts, so I rebased and
fixed the merge problems.  I also changed a number of minor things, all
AFAICS cosmetic in nature:

* moved src/backend/statistics/common.h to src/include/statistics/common.h, as previously commented.  I also took out
postgres.hand most of the includes; instead, put all these into each .c source file.  That aligns with our established
practice.I also removed two prototypes that should actually be in stats.h. I think statistics/common.h should be
furtherrenamed to statistics/stats_ext_internal.h, and statistics/stats.h to something different though I don't know
whatATM.
 

* Moved src/include/utils/stats.h to src/include/statistics, clean it up a bit.

* Moved some structs from analyze.c into statistics/common.h, removing some duplication; have analyze.c include that
file.

* renamed src/test/regress/sql/mv_ndistinct.sql to stats_ext.sql, to collect all ext.stats. related tests in a single
file,instead of having a large number of them.  I also added one test that drops a column, per David Rowley's reported
failure,but I didn't actually fix the problem nor add it to the expected file.  (I'll follow up with that tomorrow, if
Tomasdoesn't beat me to it).  Also, put the test in an earlier parallel test group, 'cause I see no reason to put it
last.

* A bunch of stylistic changes.

The added tests pass (or they passed before I added the drop column
tests; not a surprise really that they pass, since I didn't touch
anything functionally), but they aren't terribly exhaustive at the stage
of the first patch in the series.

I didn't get around to addressing all of David Rowley's input.  Also I
didn't try to rebase the remaining patches in the series on top of this
one.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] multivariate statistics (v25)

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:
> I tried patch 0002 today and again there are conflicts, so I rebased and
> fixed the merge problems.

... and attached the patch.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] multivariate statistics (v25)

From
David Fetter
Date:
On Tue, Mar 14, 2017 at 07:10:49PM -0300, Alvaro Herrera wrote:
> Alvaro Herrera wrote:
> > I tried patch 0002 today and again there are conflicts, so I rebased and
> > fixed the merge problems.
> 
> ... and attached the patch.

Is the plan to convert completely from "multivariate" to "extended?"
I ask because I found a "multivariate" in there.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: [HACKERS] multivariate statistics (v25)

From
David Rowley
Date:
On 15 March 2017 at 12:18, David Fetter <david@fetter.org> wrote:

Is the plan to convert completely from "multivariate" to "extended?"
I ask because I found a "multivariate" in there.

I get the idea that Tomas would like to keep the multivariate when it's actually referencing multivariate stats. The idea of the rename was to allow future expansion of the code to perhaps allow creation of stats on expressions, which is not multivariate. If you've found multivariate reference in an area that should be generic to extended statistics then that's a bug and should be fixed.

I found a few of these and listed them during my review.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] multivariate statistics (v25)

From
Alvaro Herrera
Date:
Here's another version of 0002 after cleaning up almost everything from
David's review.  I also added tests for ALTER STATISTICS in
sql/alter_generic.sql which made me realize there were three crasher bug
in here; fixed all those.  It also made me realize that psql's \d was a
little bit too generous with dropped columns in a stats object.  That
should all behave better now.

One thing I didn't do was change StatisticExtInfo to use a bitmapset
instead of int2vector.  I think it's a good idea to do so.

I'll go rebase the followup patches now.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] multivariate statistics (v25)

From
Alvaro Herrera
Date:
David Rowley wrote:

> + k = -1;
> + while ((k = bms_next_member(attnums, k)) >= 0)
> + {
> + bool attr_found = false;
> + for (i = 0; i < info->stakeys->dim1; i++)
> + {
> + if (info->stakeys->values[i] == k)
> + {
> + attr_found = true;
> + break;
> + }
> + }
> +
> + /* found attribute not covered by this ndistinct stats, skip */
> + if (!attr_found)
> + {
> + matches = false;
> + break;
> + }
> + }
> 
> Would it be better just to stuff info->stakeys->values into a bitmapset and
> check its a subset of attnums? It would mean allocating memory in the loop,
> so maybe you think otherwise, but in that case maybe StatisticExtInfo
> should store the bitmapset?

Yeah, I think StatisticExtInfo should have a bitmapset, not an
int2vector.

> + appendPQExpBuffer(&buf, "(dependencies)");
> 
> I think it's better practice to use appendPQExpBufferStr() when there's no
> formatting. It'll perform marginally better, which might not be important
> here, but it sets a better example for people to follow when performance is
> more critical.

FWIW this should have said "(ndistinct)" anyway :-)

> +   change the definition of a extended statistics
> 
> "a" should be "an", Also is statistics plural here. It's commonly mixed up
> in the patch. I think it needs standardised. I personally think if you're
> speaking of a single pg_statatic_ext row, then it should be singular. Yet,
> I'm aware you're using plural for the CREATE STATISTICS command, to me that
> feels a bit like: CREATE TABLES mytable ();  am I somehow thinking wrongly
> somehow here?

This was discussed upthread as I recall.  This is what Merriam-Webster says on
the topic:

statistic
1   :  a single term or datum in a collection of statistics
2 a :  a quantity (as the mean of a sample) that is computed from a sample;      specifically :  estimate 3b b :  a
randomvariable that takes on the possible values of a statistic
 

statistics
1   :  a branch of mathematics dealing with the collection, analysis,      interpretation, and presentation of masses
ofnumerical data
 
2   :  a collection of quantitative data

Now, I think there's room to say that a single object created by the new CREATE
STATISTICS is really the latter, not the former.  I find it very weird
that a single of these objects is named in the plural form, though, and
it looks odd all over the place.  I would rather use the term
"statistics object", and then we can continue using the singular.

> +   If a schema name is given (for example, <literal>CREATE STATISTICS
> +   myschema.mystat ...</>) then the statistics is created in the specified
> +   schema.  Otherwise it is created in the current schema.  The name of
> 
> What's created in the current schema? I thought this was just for naming?

Well, "created in a schema" means that the object is named after that
schema.  So both are the same thing.  Is this unclear in some way?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] multivariate statistics (v25)

From
David Rowley
Date:
On 16 March 2017 at 09:45, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Here's another version of 0002 after cleaning up almost everything from
David's review.  I also added tests for ALTER STATISTICS in
sql/alter_generic.sql which made me realize there were three crasher bug
in here; fixed all those.  It also made me realize that psql's \d was a
little bit too generous with dropped columns in a stats object.  That
should all behave better now.

Thanks for fixing.

As you mentioned to me off-list about missing pg_dump support, I've gone and implemented that in the attached patch. 

I followed how pg_dump works for indexes, and created pg_get_statisticsextdef() in ruleutils.c. I was unsure if I should be naming this pg_get_statisticsdef() instead.

I also noticed there's no COMMENT ON support either, so I added that too.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Attachment

Re: [HACKERS] multivariate statistics (v25)

From
Alvaro Herrera
Date:
Here's a rebased series on top of today's a3eac988c267.  I call this
v28.

I put David's pg_dump and COMMENT patches as second in line, just after
the initial infrastructure patch.  I suppose those three have to be
committed together, while the others (which add support for additional
statistic types) can rightly remain as separate commits.

(I think I lost some regression test files.  I couldn't make up my mind
about putting each statistic type's tests in a separate file, or all
together in stats_ext.sql.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] multivariate statistics (v25)

From
David Rowley
Date:
On 17 March 2017 at 11:20, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> (I think I lost some regression test files.  I couldn't make up my mind
> about putting each statistic type's tests in a separate file, or all
> together in stats_ext.sql.)

+1 for stats_ext.sql. I wanted to add some tests for
pg_statisticsextdef(), but I didn't see a suitable location.
stats_ext.sql would have been a good spot.


-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: multivariate statistics (v25)

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:
> Here's a rebased series on top of today's a3eac988c267.  I call this
> v28.
> 
> I put David's pg_dump and COMMENT patches as second in line, just after
> the initial infrastructure patch.  I suppose those three have to be
> committed together, while the others (which add support for additional
> statistic types) can rightly remain as separate commits.

As I said in another thread, I pushed parts 0002,0003,0004.  Tomas said
he would try to rebase patches 0001,0005,0006 on top of what was
committed.  My intention is to give that one a look as soon as it is
available.  So we will have n-distinct and functional dependencies in
PG10.  It sounds unlikely that we will get MCVs and histograms in, since
they're each a lot of code.

I suppose we need 0011 too (psql tab completion), but that can wait.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics (v25)

From
David Rowley
Date:
On 25 March 2017 at 07:35, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
As I said in another thread, I pushed parts 0002,0003,0004.  Tomas said
he would try to rebase patches 0001,0005,0006 on top of what was
committed.  My intention is to give that one a look as soon as it is
available.  So we will have n-distinct and functional dependencies in
PG10.  It sounds unlikely that we will get MCVs and histograms in, since
they're each a lot of code.

I've been working on the MV functional dependencies part of the patch to polish it up a bit. Tomas has been busy with a few other duties.

I've made some changes around how clauselist_selectivity() determines if it should try to apply any extended stats. The solution I came up with was to add two parameters to this function, one for the RelOptInfo in question, and one a bool to control if we should try to apply any extended stats. For clauselist_selectivity() usage involving join rels we just pass the rel as NULL, that way we can skip all the extended stats stuff with very low overhead. When we actually have a base relation to pass along we can do so, along with a true tryextstats value to have the function attempt to use any extended stats to assist with the selectivity estimation.

When adding these two parameters I had 2nd thoughts that the "tryextstats" was required at all. We could just have this controlled by if the rel is a base rel of kind RTE_RELATION. I ended up having to pass these parameters further, down to clauselist_selectivity's singleton couterpart, clause_selectivity(). This was due to clause_selectivity() calling clauselist_selectivity() for some clause types. I'm not entirely sure if this is actually required, but I can't see any reason for it to cause problems.

I've also attempted to simplify some of the logic within clauselist_selectivity and some other parts of clausesel.c to remove some unneeded code and make it a bit more efficient. For example, we no longer count the attributes in the clause list before calling a similar function to retrieve the actual attnums. This is now done as a single step.

I've not yet quite gotten as far as I'd like with this. I'd quite like to see clauselist_ext_split() gone, and instead we could build up a bitmapset of clause list indexes to ignore when applying the selectivity of clauses that couldn't use any extended stats. I'm planning on having a bit more of a look at this tomorrow.

The attached patch should apply to master as of f90d23d0c51895e0d7db7910538e85d3d38691f0.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Attachment

Re: multivariate statistics (v25)

From
Kyotaro HORIGUCHI
Date:
Hello,

At Fri, 31 Mar 2017 03:03:06 +1300, David Rowley <david.rowley@2ndquadrant.com> wrote in
<CAKJS1f-fqo97jasVF57yfVyG+=T5JLce5ynCi1vvezXxX=wgoA@mail.gmail.com>
> On 25 March 2017 at 07:35, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> 
> > As I said in another thread, I pushed parts 0002,0003,0004.  Tomas said
> > he would try to rebase patches 0001,0005,0006 on top of what was
> > committed.  My intention is to give that one a look as soon as it is
> > available.  So we will have n-distinct and functional dependencies in
> > PG10.  It sounds unlikely that we will get MCVs and histograms in, since
> > they're each a lot of code.
> >
> 
> I've been working on the MV functional dependencies part of the patch to
> polish it up a bit. Tomas has been busy with a few other duties.
> 
> I've made some changes around how clauselist_selectivity() determines if it
> should try to apply any extended stats. The solution I came up with was to
> add two parameters to this function, one for the RelOptInfo in question,
> and one a bool to control if we should try to apply any extended stats.
> For clauselist_selectivity() usage involving join rels we just pass the rel
> as NULL, that way we can skip all the extended stats stuff with very low
> overhead. When we actually have a base relation to pass along we can do so,
> along with a true tryextstats value to have the function attempt to use any
> extended stats to assist with the selectivity estimation.
> 
> When adding these two parameters I had 2nd thoughts that the "tryextstats"
> was required at all. We could just have this controlled by if the rel is a
> base rel of kind RTE_RELATION. I ended up having to pass these parameters
> further, down to clauselist_selectivity's singleton couterpart,
> clause_selectivity(). This was due to clause_selectivity() calling
> clauselist_selectivity() for some clause types. I'm not entirely sure if
> this is actually required, but I can't see any reason for it to cause
> problems.

I understand that the reason for tryextstats is that the two are
perfectly correlating but caluse_selectivity requires the
RelOptInfo anyway. Some comment about that may be reuiqred in the
function comment.

> I've also attempted to simplify some of the logic within
> clauselist_selectivity and some other parts of clausesel.c to remove some
> unneeded code and make it a bit more efficient. For example, we no longer
> count the attributes in the clause list before calling a similar function
> to retrieve the actual attnums. This is now done as a single step.
> 
> I've not yet quite gotten as far as I'd like with this. I'd quite like to
> see clauselist_ext_split() gone, and instead we could build up a bitmapset
> of clause list indexes to ignore when applying the selectivity of clauses
> that couldn't use any extended stats. I'm planning on having a bit more of
> a look at this tomorrow.
> 
> The attached patch should apply to master as
> of f90d23d0c51895e0d7db7910538e85d3d38691f0.

FWIW, I tries this. This cleanly applied on it but make ends with
the following error.

$ make -s
Writing postgres.bki
Writing schemapg.h
Writing postgres.description
Writing postgres.shdescription
Writing fmgroids.h
Writing fmgrprotos.h
Writing fmgrtab.c
make[3]: *** No rule to make target `dependencies.o', needed by `objfiles.txt'.  Stop.
make[2]: *** [statistics-recursive] Error 2
make[1]: *** [all-backend-recurse] Error 2
make: *** [all-src-recurse] Error 2


Some random comments by just looking on the patch:

======
The name of the function "collect_ext_attnums", and
"clause_is_ext_compatible" seems odd since "ext" doesn't seem to
be a part of "extended statistics". Some other names looks the
same, too.

Something like "collect_e(xt)stat_compatible_attnums" and
"clause_is_e(xt)stat_compatible" seem better to me.

======
The following comment seems something wrong.

+ * When applying functional dependencies, we start with the strongest ones
+ * strongest dependencies. That is, we select the dependency that:

======
dependency_is_fully_matched() is not found. Maybe some other
patches are assumed?

======
+        /* see if it actually has the right */
+        ok = (NumRelids((Node *) expr) == 1) &&
+            (is_pseudo_constant_clause(lsecond(expr->args)) ||
+             (varonleft = false,
+              is_pseudo_constant_clause(linitial(expr->args))));
+
+        /* unsupported structure (two variables or so) */
+        if (!ok)
+            return true;

Ok is used only here. I don't think seeming-expressions with side
effect is not good idea here.

======
+        switch (get_oprrest(expr->opno))
+        {
+            case F_EQSEL:
+
+                /* equality conditions are compatible with all statistics */
+                break;
+
+            default:
+
+                /* unknown estimator */
+                return true;
+        }

This seems somewhat stupid..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: multivariate statistics (v25)

From
David Rowley
Date:
On 31 March 2017 at 21:18, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello,

At Fri, 31 Mar 2017 03:03:06 +1300, David Rowley <david.rowley@2ndquadrant.com> wrote in <CAKJS1f-fqo97jasVF57yfVyG+=T5JLce5ynCi1vvezXxX=wgoA@mail.gmail.com>

FWIW, I tries this. This cleanly applied on it but make ends with
the following error.

$ make -s
Writing postgres.bki
Writing schemapg.h
Writing postgres.description
Writing postgres.shdescription
Writing fmgroids.h
Writing fmgrprotos.h
Writing fmgrtab.c
make[3]: *** No rule to make target `dependencies.o', needed by `objfiles.txt'.  Stop.
make[2]: *** [statistics-recursive] Error 2
make[1]: *** [all-backend-recurse] Error 2
make: *** [all-src-recurse] Error 2

Apologies. I was caught out by patching back on to master, then committing, and git diff'ing the last commit, where i'd of course forgotten to get add those files.

I'm just in the middle of fixing up some other stuff. Hopefully I'll post a working patch soon.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: multivariate statistics (v25)

From
David Rowley
Date:
On 31 March 2017 at 21:18, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> When adding these two parameters I had 2nd thoughts that the "tryextstats"
> was required at all. We could just have this controlled by if the rel is a
> base rel of kind RTE_RELATION. I ended up having to pass these parameters
> further, down to clauselist_selectivity's singleton couterpart,
> clause_selectivity(). This was due to clause_selectivity() calling
> clauselist_selectivity() for some clause types. I'm not entirely sure if
> this is actually required, but I can't see any reason for it to cause
> problems.

I understand that the reason for tryextstats is that the two are
perfectly correlating but caluse_selectivity requires the
RelOptInfo anyway. Some comment about that may be reuiqred in the
function comment.

hmm, you could say one is functionally dependant on the other. I did consider removing it, but it seemed weird to pass a NULL relation when we dont want to attempt to use extended stats.
 
Some random comments by just looking on the patch:

======
The name of the function "collect_ext_attnums", and
"clause_is_ext_compatible" seems odd since "ext" doesn't seem to
be a part of "extended statistics". Some other names looks the
same, too.

I agree. I've made some changes to the patch to change how the functional dependency estimations are applied. I've removed most of the code from clausesel.c and put it into dependencies.c. In doing so I've removed some of the inefficiencies that were in the patch.  For example clause_is_ext_compatible() was being called many times on the same clause at different times. I've now nailed that down to just once per clause.
 
Something like "collect_e(xt)stat_compatible_attnums" and
"clause_is_e(xt)stat_compatible" seem better to me.


Changed to dependency_compatible_clause(), since this was searching for equality clauses in the form Var = Const, or Const = Var. This seems specific to the functional depdencies checking. A multivariate histogram won't want the same.
 
======
The following comment seems something wrong.

+ * When applying functional dependencies, we start with the strongest ones
+ * strongest dependencies. That is, we select the dependency that:

======
dependency_is_fully_matched() is not found. Maybe some other
patches are assumed?

======
+               /* see if it actually has the right */
+               ok = (NumRelids((Node *) expr) == 1) &&
+                       (is_pseudo_constant_clause(lsecond(expr->args)) ||
+                        (varonleft = false,
+                         is_pseudo_constant_clause(linitial(expr->args))));
+
+               /* unsupported structure (two variables or so) */
+               if (!ok)
+                       return true;

Ok is used only here. I don't think seeming-expressions with side
effect is not good idea here.


I thought the same, but I happened to notice that Tomas must have taken it from clauselist_selectivity().
 
======
+               switch (get_oprrest(expr->opno))
+               {
+                       case F_EQSEL:
+
+                               /* equality conditions are compatible with all statistics */
+                               break;
+
+                       default:
+
+                               /* unknown estimator */
+                               return true;
+               }

This seems somewhat stupid..

I agree. Changed. 

I've attached an updated patch.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Attachment

Re: multivariate statistics (v25)

From
David Rowley
Date:
On 1 April 2017 at 04:25, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I've attached an updated patch.

I've made another pass at this and ended up removing the tryextstats
variable. We now only try to use extended statistics when
clauselist_selectivity() is given a valid RelOptInfo with rtekind ==
RTE_RELATION, and of course, it must also have some extended stats
defined too.

I've also cleaned up a few more comments, many of which I managed to
omit updating when I refactored how the selectivity estimates ties
into clauselist_selectivity()

I'm quite happy with all of this now, and would also be happy for
other people to take a look and comment.

As a reviewer, I'd be marking this ready for committer, but I've moved
a little way from just reviewing this now, having spent two weeks
hacking at it.

The latest patch is attached.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: multivariate statistics (v25)

From
Tomas Vondra
Date:
On 04/04/2017 09:55 AM, David Rowley wrote:
> On 1 April 2017 at 04:25, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> I've attached an updated patch.
>
> I've made another pass at this and ended up removing the tryextstats
> variable. We now only try to use extended statistics when
> clauselist_selectivity() is given a valid RelOptInfo with rtekind ==
> RTE_RELATION, and of course, it must also have some extended stats
> defined too.
>
> I've also cleaned up a few more comments, many of which I managed to
> omit updating when I refactored how the selectivity estimates ties
> into clauselist_selectivity()
>
> I'm quite happy with all of this now, and would also be happy for
> other people to take a look and comment.
>
> As a reviewer, I'd be marking this ready for committer, but I've moved
> a little way from just reviewing this now, having spent two weeks
> hacking at it.
>
> The latest patch is attached.
>

Thanks David, I agree the reworked patch is much cleaner that the last 
version I posted. Thanks for spending your time on it.

Two minor comments:

1) DEPENDENCY_MIN_GROUP_SIZE

I'm not sure we still need the min_group_size, when evaluating 
dependencies. It was meant to deal with 'noisy' data, but I think it 
after switching to the 'degree' it might actually be a bad idea.

Consider this:
    create table t (a int, b int);    insert into t select 1, 1 from generate_series(1, 10000) s(i);    insert into t
selecti, i from generate_series(2, 20000) s(i);    create statistics s with (dependencies) on (a,b) from t;    analyze
t;
    select stadependencies from pg_statistic_ext ;                  stadependencies
--------------------------------------------    [{1 => 2 : 0.333344}, {2 => 1 : 0.333344}]    (1 row)
 

So the degree of the dependency is just ~0.333 although it's obviously a 
perfect dependency, i.e. a knowledge of 'a' determines 'b'. The reason 
is that we discard 2/3 of rows, because those groups are only a single 
row each, except for the one large group (1/3 of rows).

Without the mininum group size limitation, the dependencies are:
    test=# select stadependencies from pg_statistic_ext ;                  stadependencies
--------------------------------------------    [{1 => 2 : 1.000000}, {2 => 1 : 1.000000}]    (1 row)
 

which seems way more reasonable, I think.


2) A minor detail is that instead of this
    if (estimatedclauses != NULL &&        bms_is_member(listidx, estimatedclauses))        continue;

perhaps we should do just this:
    if (bms_is_member(listidx, estimatedclauses))        continue;

bms_is_member does the same NULL check right at the beginning, so I 
don't think this might make a measurable difference.


kind regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics (v25)

From
Kyotaro HORIGUCHI
Date:
At Tue, 4 Apr 2017 20:19:39 +0200, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<56f40b20-c464-fad2-ff39-06b668fac47c@2ndquadrant.com>
> On 04/04/2017 09:55 AM, David Rowley wrote:
> > On 1 April 2017 at 04:25, David Rowley <david.rowley@2ndquadrant.com>
> > wrote:
> >> I've attached an updated patch.
> >
> > I've made another pass at this and ended up removing the tryextstats
> > variable. We now only try to use extended statistics when
> > clauselist_selectivity() is given a valid RelOptInfo with rtekind ==
> > RTE_RELATION, and of course, it must also have some extended stats
> > defined too.
> >
> > I've also cleaned up a few more comments, many of which I managed to
> > omit updating when I refactored how the selectivity estimates ties
> > into clauselist_selectivity()
> >
> > I'm quite happy with all of this now, and would also be happy for
> > other people to take a look and comment.
> >
> > As a reviewer, I'd be marking this ready for committer, but I've moved
> > a little way from just reviewing this now, having spent two weeks
> > hacking at it.
> >
> > The latest patch is attached.
> >
> 
> Thanks David, I agree the reworked patch is much cleaner that the last
> version I posted. Thanks for spending your time on it.
> 
> Two minor comments:
> 
> 1) DEPENDENCY_MIN_GROUP_SIZE
> 
> I'm not sure we still need the min_group_size, when evaluating
> dependencies. It was meant to deal with 'noisy' data, but I think it
> after switching to the 'degree' it might actually be a bad idea.
> 
> Consider this:
> 
>     create table t (a int, b int);
>     insert into t select 1, 1 from generate_series(1, 10000) s(i);
>     insert into t select i, i from generate_series(2, 20000) s(i);
>     create statistics s with (dependencies) on (a,b) from t;
>     analyze t;
> 
>     select stadependencies from pg_statistic_ext ;
>                   stadependencies
>     --------------------------------------------
>      [{1 => 2 : 0.333344}, {2 => 1 : 0.333344}]
>     (1 row)
> 
> So the degree of the dependency is just ~0.333 although it's obviously
> a perfect dependency, i.e. a knowledge of 'a' determines 'b'. The
> reason is that we discard 2/3 of rows, because those groups are only a
> single row each, except for the one large group (1/3 of rows).
> 
> Without the mininum group size limitation, the dependencies are:
> 
>     test=# select stadependencies from pg_statistic_ext ;
>                   stadependencies
>     --------------------------------------------
>      [{1 => 2 : 1.000000}, {2 => 1 : 1.000000}]
>     (1 row)
> 
> which seems way more reasonable, I think.

I think the same. Quite large part of functional dependency in
reality is in this kind.

> 2) A minor detail is that instead of this
> 
>     if (estimatedclauses != NULL &&
>         bms_is_member(listidx, estimatedclauses))
>         continue;
> 
> perhaps we should do just this:
> 
>     if (bms_is_member(listidx, estimatedclauses))
>         continue;
> 
> bms_is_member does the same NULL check right at the beginning, so I
> don't think this might make a measurable difference.


I have some other comments.

======
- The comment for clauselist_selectivity,
| + * When 'rel' is not null and rtekind = RTE_RELATION, we'll try to apply
| + * selectivity estimates using any extended statistcs on 'rel'.

The 'rel' is actually a parameter but rtekind means rel->rtekind
so this might be better be such like the following.

| When a relation of RTE_RELATION is given as 'rel', we try
| extended statistcs on the relation.

Then the following line doesn't seem to be required.

| + * If we identify such extended statistics exist, we try to apply them.


=====
The following comment in the same function,

| +    if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
| +    {
| +        /*
| +         * Try to estimate with multivariate functional dependency statistics.
| +         *
| +         * The function will supply an estimate for the clauses which it
| +         * estimated for. Any clauses which were unsuitible were ignored.
| +         * Clauses which were estimated will have their 0-based list index set
| +         * in estimatedclauses.  We must ignore these clauses when processing
| +         * the remaining clauses later.
| +         */

(Notice that I'm not a good writer) This might better be the
following.

|  dependencies_clauselist_selectivity gives selectivity over
|  caluses that functional dependencies on the given relation is
|  applicable. 0-based index numbers of consumed clauses are
|  returned in the bitmap set estimatedclauses so that the
|  estimation here after can ignore them.

=====
| +        s1 *= dependencies_clauselist_selectivity(root, clauses, varRelid,
| +                                   jointype, sjinfo, rel, &estimatedclauses);

The name prefix "dependency_" means "functional_dependency" here
and omitting "functional" is confusing to me. On the other hand
"functional_dependency" is quite long as prefix. Could we use
"func_dependency" or something that is shorter but meaningful?
(But this change causes renaming of many other sutff..)

=====
The name "dependency_compatible_clause" might be meaningful if it
were "clause_is_compatible_with_(functional_)dependency" or such.

=====
dependency_compatible_walker() returns true if given node is
*not* compatible. Isn't it confusing?

=====
dependency_compatible_walker() seems implicitly expecting that
RestrictInfo will be given at the first. RestrictInfo might
should be processed outside this function in _compatible_clause().

=====
dependency_compatible_walker() can return two or more attriburtes
but dependency_compatible_clause() errors out in the case. Since
_walker is called only from the _clause, _walker can return
earlier with "incompatible" in such a case.

=====
In the comment in dependencies_clauselist_selectivity(), 

|  /*
|   * Technically we could find more than one clause for a given
|   * attnum. Since these clauses must be equality clauses, we choose
|   * to only take the selectivity estimate from the final clause in
|   * the list for this attnum. If the attnum happens to be compared
|   * to a different Const in another clause then no rows will match
|   * anyway. If it happens to be compared to the same Const, then
|   * ignoring the additional clause is just the thing to do.
|   */
|  if (dependency_implies_attribute(dependency,
|                                   list_attnums[listidx]))

If multiple clauses include the attribute, selectivity estimates
for clauses other than the last one are waste of time. Why not the
first one but the last one?

Even if all clauses should be added into estimatedclauses,
calling clause_selectivity once is enough. Since
clause_selectivity may return 1.0 for some clauses, using s2 for
the decision seems reasonable.

|  if (dependency_implies_attribute(dependency,
|                                   list_attnums[listidx]))
|  {
|      clause = (Node *) lfirst(l);
+      if (s2 == 1.0)
|        s2 = clause_selectivity(root, clause, varRelid, jointype, sjinfo,

# This '==' works since it is not a result of a calculation.

=====
Still in dependencies_clauselist_selectivity,
dependency_implies_attributes seems designed to return true for
at least one clause in the clauses but any failure leands to
infinite loop. I think any measure against the case is required.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: multivariate statistics (v25)

From
"Sven R. Kunze"
Date:
Thanks Tomas and David for hacking on this patch.

On 04.04.2017 20:19, Tomas Vondra wrote:
> I'm not sure we still need the min_group_size, when evaluating 
> dependencies. It was meant to deal with 'noisy' data, but I think it 
> after switching to the 'degree' it might actually be a bad idea.
>
> Consider this:
>
>     create table t (a int, b int);
>     insert into t select 1, 1 from generate_series(1, 10000) s(i);
>     insert into t select i, i from generate_series(2, 20000) s(i);
>     create statistics s with (dependencies) on (a,b) from t;
>     analyze t;
>
>     select stadependencies from pg_statistic_ext ;
>                   stadependencies
>     --------------------------------------------
>      [{1 => 2 : 0.333344}, {2 => 1 : 0.333344}]
>     (1 row)
>
> So the degree of the dependency is just ~0.333 although it's obviously 
> a perfect dependency, i.e. a knowledge of 'a' determines 'b'. The 
> reason is that we discard 2/3 of rows, because those groups are only a 
> single row each, except for the one large group (1/3 of rows).

Just for me to follow the comments better. Is "dependency" roughly the 
same as when statisticians speak about " conditional probability"?

Sven



Re: multivariate statistics (v25)

From
Tomas Vondra
Date:

On 04/05/2017 08:41 AM, Sven R. Kunze wrote:
> Thanks Tomas and David for hacking on this patch.
> 
> On 04.04.2017 20:19, Tomas Vondra wrote:
>> I'm not sure we still need the min_group_size, when evaluating 
>> dependencies. It was meant to deal with 'noisy' data, but I think it 
>> after switching to the 'degree' it might actually be a bad idea.
>>
>> Consider this:
>>
>>     create table t (a int, b int);
>>     insert into t select 1, 1 from generate_series(1, 10000) s(i);
>>     insert into t select i, i from generate_series(2, 20000) s(i);
>>     create statistics s with (dependencies) on (a,b) from t;
>>     analyze t;
>>
>>     select stadependencies from pg_statistic_ext ;
>>                   stadependencies
>>     --------------------------------------------
>>      [{1 => 2 : 0.333344}, {2 => 1 : 0.333344}]
>>     (1 row)
>>
>> So the degree of the dependency is just ~0.333 although it's obviously 
>> a perfect dependency, i.e. a knowledge of 'a' determines 'b'. The 
>> reason is that we discard 2/3 of rows, because those groups are only a 
>> single row each, except for the one large group (1/3 of rows).
> 
> Just for me to follow the comments better. Is "dependency" roughly the 
> same as when statisticians speak about " conditional probability"?
> 

No, it's more 'functional dependency' from relational normal forms.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics (v25)

From
David Rowley
Date:
On 5 April 2017 at 14:53, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Tue, 4 Apr 2017 20:19:39 +0200, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<56f40b20-c464-fad2-ff39-06b668fac47c@2ndquadrant.com>
>> Two minor comments:
>>
>> 1) DEPENDENCY_MIN_GROUP_SIZE
>>
>> I'm not sure we still need the min_group_size, when evaluating
>> dependencies. It was meant to deal with 'noisy' data, but I think it
>> after switching to the 'degree' it might actually be a bad idea.

Yeah, I'd wondered about this when I first started testing the patch.
I failed to get any functional dependencies because my values were too
unique. Seems I'd gotten a bit used to it, and in the end thought that
if the values are unique enough then they won't suffer as much from
the underestimation problem you're trying to solve here.

I've removed that part of the code now.

> I think the same. Quite large part of functional dependency in
> reality is in this kind.
>
>> 2) A minor detail is that instead of this
>>
>>     if (estimatedclauses != NULL &&
>>         bms_is_member(listidx, estimatedclauses))
>>         continue;
>>
>> perhaps we should do just this:
>>
>>     if (bms_is_member(listidx, estimatedclauses))
>>         continue;
>>
>> bms_is_member does the same NULL check right at the beginning, so I
>> don't think this might make a measurable difference.

hmm yeah, I'd added that because I thought the estimatedclauses would
be NULL in 99.9% of cases and thought that I might be able to shave a
few cycles off. I see that there's an x < 0 test before the NULL test
in the function. Anyway, I'm not going to put up a fight here, so I've
removed it. I didn't ever benchmark anything to see if the extra test
actually helped anyway...

> I have some other comments.
>
> ======
> - The comment for clauselist_selectivity,
> | + * When 'rel' is not null and rtekind = RTE_RELATION, we'll try to apply
> | + * selectivity estimates using any extended statistcs on 'rel'.
>
> The 'rel' is actually a parameter but rtekind means rel->rtekind
> so this might be better be such like the following.
>
> | When a relation of RTE_RELATION is given as 'rel', we try
> | extended statistcs on the relation.
>
> Then the following line doesn't seem to be required.
>
> | + * If we identify such extended statistics exist, we try to apply them.

Yes, good point. I've revised this comment a bit now.

>
> =====
> The following comment in the same function,
>
> | +    if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
> | +    {
> | +        /*
> | +         * Try to estimate with multivariate functional dependency statistics.
> | +         *
> | +         * The function will supply an estimate for the clauses which it
> | +         * estimated for. Any clauses which were unsuitible were ignored.
> | +         * Clauses which were estimated will have their 0-based list index set
> | +         * in estimatedclauses.  We must ignore these clauses when processing
> | +         * the remaining clauses later.
> | +         */
>
> (Notice that I'm not a good writer) This might better be the
> following.
>
> |  dependencies_clauselist_selectivity gives selectivity over
> |  caluses that functional dependencies on the given relation is
> |  applicable. 0-based index numbers of consumed clauses are
> |  returned in the bitmap set estimatedclauses so that the
> |  estimation here after can ignore them.

I've changed this one too now.

> =====
> | +        s1 *= dependencies_clauselist_selectivity(root, clauses, varRelid,
> | +                                   jointype, sjinfo, rel, &estimatedclauses);
>
> The name prefix "dependency_" means "functional_dependency" here
> and omitting "functional" is confusing to me. On the other hand
> "functional_dependency" is quite long as prefix. Could we use
> "func_dependency" or something that is shorter but meaningful?
> (But this change causes renaming of many other sutff..)

oh no! Many functions in dependencies.c start with dependencies_. To
me, it's a bit of an OOP thing, which if we'd been using some other
language would have been dependencies->clauselist_selectivity(). Of
course, not all functions in that file follow that rule, but I don't
feel a pressing need to go make that any worse.  Perhaps the prefix
could be func_dependency, but I really don't feel very excited about
having it that way, and even less so about making the change.

> =====
> The name "dependency_compatible_clause" might be meaningful if it
> were "clause_is_compatible_with_(functional_)dependency" or such.

I could maybe squeeze the word "is" in there.  ... OK done.

> =====
> dependency_compatible_walker() returns true if given node is
> *not* compatible. Isn't it confusing?

Yeah.

>
> =====
> dependency_compatible_walker() seems implicitly expecting that
> RestrictInfo will be given at the first. RestrictInfo might(
> should be processed outside this function in _compatible_clause().

Actually, I don't really see a great need for this to be a recursive
walker type function. So I've just gone and stuck all that logic in
dependency_is_compatible_clause() instead.

> =====
> dependency_compatible_walker() can return two or more attriburtes
> but dependency_compatible_clause() errors out in the case. Since
> _walker is called only from the _clause, _walker can return
> earlier with "incompatible" in such a case.

I don't quite see how it's possible for it to ever have more than 1
attnum in there. We only capture Vars from one side of a binary
OpExpr. If one side of the OpExpr is an Expr, then we'd not capture
anything, and not recurse into the Expr. Anyway, I've pulled that code
out into dependency_is_compatible_clause now.

> =====
> In the comment in dependencies_clauselist_selectivity(),
>
> |  /*
> |   * Technically we could find more than one clause for a given
> |   * attnum. Since these clauses must be equality clauses, we choose
> |   * to only take the selectivity estimate from the final clause in
> |   * the list for this attnum. If the attnum happens to be compared
> |   * to a different Const in another clause then no rows will match
> |   * anyway. If it happens to be compared to the same Const, then
> |   * ignoring the additional clause is just the thing to do.
> |   */
> |  if (dependency_implies_attribute(dependency,
> |                                   list_attnums[listidx]))
>
> If multiple clauses include the attribute, selectivity estimates
> for clauses other than the last one are waste of time. Why not the
> first one but the last one?

Why not the middle one? Really it's not expected to be a common case.
If someone writes: WHERE a = 1 and a = 2; then they'll likely not get
many results back. If the same clause is duplicated then well, it
won't be the only thing that does a little needless extra work. I
don't think optimising for this is worth the trouble.

>
> Even if all clauses should be added into estimatedclauses,
> calling clause_selectivity once is enough. Since
> clause_selectivity may return 1.0 for some clauses, using s2 for
> the decision seems reasonable.
>
> |  if (dependency_implies_attribute(dependency,
> |                                   list_attnums[listidx]))
> |  {
> |      clause = (Node *) lfirst(l);
> +      if (s2 == 1.0)
> |        s2 = clause_selectivity(root, clause, varRelid, jointype, sjinfo,
>
> # This '==' works since it is not a result of a calculation.

I don't think this is an important optimisation. It's a corner case if
more than one match, although not impossible. I vote to leave it as
is, and not optimise the corner case.

> =====
> Still in dependencies_clauselist_selectivity,
> dependency_implies_attributes seems designed to return true for
> at least one clause in the clauses but any failure leands to
> infinite loop. I think any measure against the case is required.

I did consider this, but I really can't see a scenario that this is
possible. find_strongest_dependency() would not have found a
dependency if dependency_implies_attribute() was going to fail, so
we'd have exited the loop already. I think it's safe providing that
'clauses_attnums' is in sync with the clauses that we'll examine in
the loop over the 'clauses' list. Perhaps the while loop should have
some safety valve, but I'm not all that sure what that would be, and
since I can't see how it could become an infinite loop, I've not
bothered to think too hard about what else might be done here.

I've attached an updated patch to address Tomas' concerns and yours too.

Thank you to both for looking at my changes

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: multivariate statistics (v25)

From
Simon Riggs
Date:
On 5 April 2017 at 10:47, David Rowley <david.rowley@2ndquadrant.com> wrote:

>> I have some other comments.

Me too.


CREATE STATISTICS should take ShareUpdateExclusiveLock like ANALYZE.

This change is in line with other changes in this and earlier
releases. Comments and docs included.

Patch ready to be applied directly barring objections.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: multivariate statistics (v25)

From
"Tels"
Date:
Moin,

On Wed, April 5, 2017 2:52 pm, Simon Riggs wrote:
> On 5 April 2017 at 10:47, David Rowley <david.rowley@2ndquadrant.com>
> wrote:
>
>>> I have some other comments.
>
> Me too.
>
>
> CREATE STATISTICS should take ShareUpdateExclusiveLock like ANALYZE.
>
> This change is in line with other changes in this and earlier
> releases. Comments and docs included.
>
> Patch ready to be applied directly barring objections.

I know I'm a bit late, but isn't the syntax backwards?

"CREATE STATISTICS s1 WITH (dependencies) ON (col_a, col_b) FROM table;"

These do it the other way round:

CREATE INDEX idx ON table (col_a);

AND:
  CREATE TABLE t (    id INT  REFERENCES table_2 (col_b);  );

Won't this be confusing and make things hard to remember?

Sorry for not asking earlier, I somehow missed this.

Regard,

Tels



Re: multivariate statistics (v25)

From
David Rowley
Date:
On 6 April 2017 at 07:19, Tels <nospam-abuse@bloodgate.com> wrote:
> I know I'm a bit late, but isn't the syntax backwards?
>
> "CREATE STATISTICS s1 WITH (dependencies) ON (col_a, col_b) FROM table;"
>
> These do it the other way round:
>
> CREATE INDEX idx ON table (col_a);
>
> AND:
>
>    CREATE TABLE t (
>      id INT  REFERENCES table_2 (col_b);
>    );
>
> Won't this be confusing and make things hard to remember?
>
> Sorry for not asking earlier, I somehow missed this.

The reasoning is in [1]

[1] https://www.postgresql.org/message-id/CAEZATCUtGR+U5+QTwjHhe9rLG2nguEysHQ5NaqcK=VbJ78VQFA@mail.gmail.com


-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: multivariate statistics (v25)

From
Simon Riggs
Date:
On 5 April 2017 at 10:47, David Rowley <david.rowley@2ndquadrant.com> wrote:

> I've attached an updated patch to address Tomas' concerns and yours too.

Commited, with some doc changes and additions based upon my explorations.

For the record, I measured the time to calc extended statistics as
+800ms on 2 million row sample.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: multivariate statistics (v25)

From
David Rowley
Date:
On 6 April 2017 at 10:17, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 5 April 2017 at 10:47, David Rowley <david.rowley@2ndquadrant.com> wrote:
>
>> I've attached an updated patch to address Tomas' concerns and yours too.
>
> Commited, with some doc changes and additions based upon my explorations.

Great. Thanks for committing!


-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services