Thread: WIP: multivariate statistics / proof of concept

WIP: multivariate statistics / proof of concept

From

Tomas Vondra

Date:

12 October 2014, 22:01:16

Hi,

attached is a WIP patch implementing multivariate statistics. The code
certainly is not "ready" - parts of it look as if written by a rogue
chimp who got bored of attempts to type the complete works of William
Shakespeare, and decided to try something different.

I also cut some corners to make it work, and those limitations need to
be fixed before the eventual commit (those are not difficult problems,
but were not necessary for a proof-of-concept patch).

It however seems to be working sufficiently well at this point, enough
to get some useful feedback. So here we go.

I expect to be busy over the next two weeks because of travel, so sorry
for somehow delayed responses. If you happen to attend pgconf.eu next
week (Oct 20-24), we can of course discuss this patch in person.


Goals and basics
----------------

The goal of this patch is allowing users to define multivariate
statistics (i.e. statistics on multiple columns), and improving
estimation when the columns are correlated.

Take for example a table like this:

    CREATE TABLE test (a INT, b INT, c INT);
    INSERT INTO test SELECT i/10000, i/10000, i/10000
                       FROM generate_series(1,1000000) s(i);
    ANALYZE test;

and do a query like this:

    SELECT * FROM test WHERE (a = 10) AND (b = 10) AND (c = 10);

which is estimated like this:

                       QUERY PLAN
---------------------------------------------------------
 Seq Scan on test  (cost=0.00..22906.00 rows=1 width=12)
   Filter: ((a = 10) AND (b = 10) AND (c = 10))
 Planning time: 0.142 ms
(3 rows)

The query of course returns 10.000 rows, but the planner assumes the
columns are independent and thus multiplies the selectivities. And 1/100
for each column means 1/1000000 in total, which is 1 row.

This example is of course somehow artificial, but the problem is far
from uncommon, especially in denormalized datasets (e.g. star schema).
If you ever got an index scan instead of a sequential scan due to poor
estimate, resulting in a query running for hours instead of seconds, you
know the pain.

The patch allows you to do this:

    ALTER TABLE test ADD STATISTICS ON (a, b, c);
    ANALYZE test;

which then results in this estimate:

                         QUERY PLAN
------------------------------------------------------------
 Seq Scan on test  (cost=0.00..22906.00 rows=9667 width=12)
   Filter: ((a = 10) AND (b = 10) AND (c = 10))
 Planning time: 0.110 ms
(3 rows)

This however is not free - both building such statistics (during
ANALYZE) and using it (during planning) costs some cycles. Even if we
optimize the hell out of it, it won't be entirely free.

One of the design goals in this patch is not to make the ANALYZE or
planning more expensive unless you add such statistics.

Those who add such statistics probably decided that the price is worth
the improved estimates, and lower risk of inefficient plans. If the
planning takes a few more miliseconds, it's probably worth it if you
risk queries running for minutes or hours because of misestimates.

It also does not guarantee the estimates to be always better. There will
be misestimates, although rather in the other direction (independence
assumption usually leads to underestimates, this may lead to
overestimates). However based on my experience from writing the patch I
be I believe it's possible to reasonably limit the extent of such errors
(just like in the single-column histograms, it's related to the bucket
size).

Of course, there will be cases when the old approach is lucky by
accident - there's not much we can do to beat luck. And we can't rely on
it either.


Design overview
---------------

The patch adds a new system catalog, called pg_mv_statistic, which is
used to keep track of requested statistics. There's also a pg_mv_stats
view, showing some basic info about the stats (not all the data).

There are three kinds of statistics

  - list of most common combinations of values (MCV list)
  - multi-dimensional histogram
  - associative rules

The first two are extensions of the single-column stats we already have.
The MCV list is a trivial extension to multiple dimensions, just
tracking combinations and frequencies. The histogram is more complex -
the structure is quite simple (multi-dimensional rectangles) but there's
a lot of ways to build it. But even the current naive and simple
implementation seems to work quite well.

The last kind (associative rules) is an attempt to track "implications"
between columns. It is however an experiment and it's not really used in
the patch so I'll ignore it for now.

I'm not going to explain all the implementation details here - if you
want to learn more, the best way is probably by reading the changes in
those files (probably in this order):

    src/include/utils/mvstats.h
    src/backend/commands/analyze.c
    src/backend/optimizer/path/clausesel.c

I tried to explain the ideas thoroughly in the comments, along with a
lot of TODO/FIXME items related to limitations, explained in the next
section.


Limitations
-----------

As I mentioned, the current patch has a number of practical limitations,
most importantly:

  (a) only data types passed by value (no varlena types)
  (b) only data types with sort (to be able to build histogram)
  (c) no NULL values supported
  (d) not handling DROP COLUMN or DROP TABLE and such
  (e) limited to stats on 8 columns (max)
  (f) optimizer uses single stats per table
  (g) limited list of compatible WHERE clauses
  (h) incomplete ADD STATISTICS syntax

The first three conditions are really a shortcut to a working patch, and
fixing them should not be difficult.

The limited number of columns is really just a sanity check. It's
possible to increase it, but I doubt stats on more columns will be
practical because of excessive size or poor accuracy.

A better approach is to support combining multiple stats, defined on
various subsets of columns. This is not implemented at the memoment, but
it's certainly on the roadmap. Currently the "smallest" stats covering
the most columns is selected.

Regarding the compatible WHERE clauses, the patch currently handles
conditions of the form

    column OPERATOR constant

where operator is one of the comparison operators (=, <, >, =<, >=). In
the future it's possible to add support for more conditions, e.g.
"column IS NULL" or "column OPERATOR column".

The last point is really just "unfinished implementation" - the syntax I
propose is this:

   ALTER TABLE ... ADD STATISTICS (options) ON (columns)

where the options influence the MCV list and histogram size, etc. The
options are recognized and may give you an idea of what it might do, but
it's not really used at the moment (except for storing in the
pg_mv_statistic catalog).



Examples
--------

Let's see a few examples of how to define the stats, and what difference
in estimates it makes:

CREATE TABLE test (a INT, b INT c INT);

-- same value in all columns
INSERT INTO test SELECT mod(i,100), mod(i,100), mod(i,100)
       FROM generate_series(1,1000000) s(i);

ANALYZE test;

=============== no multivariate stats ============================

SELECT * FROM test WHERE a = 10 AND b = 10;

                        QUERY PLAN
-------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..20406.00 rows=101 width=12)
                   (actual time=0.007..60.902 rows=10000 loops=1)
   Filter: ((a = 10) AND (b = 10))
   Rows Removed by Filter: 990000
 Planning time: 0.119 ms
 Execution time: 61.164 ms
(5 rows)


SELECT * FROM test WHERE a = 10 AND b = 10 AND c = 10;

                        QUERY PLAN
-------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..22906.00 rows=1 width=12)
                   (actual time=0.010..56.780 rows=10000 loops=1)
   Filter: ((a = 10) AND (b = 10) AND (c = 10))
   Rows Removed by Filter: 990000
 Planning time: 0.061 ms
 Execution time: 56.994 ms
(5 rows)


=============== with multivariate stats ===========================

ALTER TABLE test ADD STATISTICS ON (a, b, c);
ANALYZE test;

SELECT * FROM test WHERE a = 10 AND b = 10;

                        QUERY PLAN
-------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..20406.00 rows=10767 width=12)
                   (actual time=0.007..58.981 rows=10000 loops=1)
   Filter: ((a = 10) AND (b = 10))
   Rows Removed by Filter: 990000
 Planning time: 0.114 ms
 Execution time: 59.214 ms
(5 rows)

SELECT * FROM test WHERE a = 10 AND b = 10 AND c = 10;

                        QUERY PLAN
-------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..22906.00 rows=10767 width=12)
                   (actual time=0.008..61.838 rows=10000 loops=1)
   Filter: ((a = 10) AND (b = 10) AND (c = 10))
   Rows Removed by Filter: 990000
 Planning time: 0.088 ms
 Execution time: 62.057 ms
(5 rows)


OK, that was rather significant improvement, but it's also trivial
dataset. Let's see something more complicated - the following table has
correlated columns with distributions skewed to 0.

CREATE TABLE test (a INT, b INT, c INT);
INSERT INTO test SELECT r*MOD(i,50),
                        pow(r,2)*MOD(i,100),
                        pow(r,4)*MOD(i,500)
       FROM (SELECT random() AS r, i
               FROM generate_series(1,1000000) s(i)) foo;
ANALYZE test;


SELECT * FROM test WHERE a = 0 AND b = 0;

=============== no multivariate stats ============================

                        QUERY PLAN
-------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..20406.00 rows=9024 width=12)
                   (actual time=0.007..62.969 rows=49503 loops=1)
   Filter: ((a = 0) AND (b = 0))
   Rows Removed by Filter: 950497
 Planning time: 0.057 ms
 Execution time: 64.098 ms
(5 rows)

SELECT * FROM test WHERE a = 0 AND b = 0 AND c = 0;

                        QUERY PLAN
-------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..22906.00 rows=2126 width=12)
                   (actual time=0.008..63.862 rows=40770 loops=1)
   Filter: ((a = 0) AND (b = 0) AND (c = 0))
   Rows Removed by Filter: 959230
 Planning time: 0.060 ms
 Execution time: 64.794 ms
(5 rows)


=============== with multivariate stats ============================

ALTER TABLE test ADD STATISTICS ON (a, b, c);
ANALYZE test;

db=> SELECT * FROM pg_mv_stats;
schemaname | public
tablename  | test
attnums    | 1 2 3
mcvbytes   | 25904
mcvinfo    | nitems=809
histbytes  | 568240
histinfo   | nbuckets=13772


SELECT * FROM test WHERE a = 0 AND b = 0;

                        QUERY PLAN
-------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..20406.00 rows=47717 width=12)
                   (actual time=0.007..61.782 rows=49503 loops=1)
   Filter: ((a = 0) AND (b = 0))
   Rows Removed by Filter: 950497
 Planning time: 3.181 ms
 Execution time: 62.859 ms
(5 rows)


SELECT * FROM test WHERE a = 0 AND b = 0 AND c = 0;

                        QUERY PLAN
-------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..22906.00 rows=40567 width=12)
                   (actual time=0.009..66.685 rows=40770 loops=1)
   Filter: ((a = 0) AND (b = 0) AND (c = 0))
   Rows Removed by Filter: 959230
 Planning time: 0.188 ms
 Execution time: 67.593 ms
(5 rows)


regards
Tomas

Attachment

multivar-stats-v1.patch

Re: WIP: multivariate statistics / proof of concept

From

Albe Laurenz

Date:

13 October 2014, 07:36:28

Tomas Vondra wrote:
> attached is a WIP patch implementing multivariate statistics.

I think that is pretty useful.
Oracle has an identical feature called "extended statistics".

That's probably an entirely different thing, but it would be very
nice to have statistics to estimate the correlation between columns
of different tables, to improve the estimate for the number of rows
in a join.

Yours,
Laurenz Albe

Re: WIP: multivariate statistics / proof of concept

From

Tomas Vondra

Date:

13 October 2014, 19:47:32

Hi!

On 13.10.2014 09:36, Albe Laurenz wrote:
> Tomas Vondra wrote:
>> attached is a WIP patch implementing multivariate statistics.
> 
> I think that is pretty useful.
> Oracle has an identical feature called "extended statistics".
> 
> That's probably an entirely different thing, but it would be very 
> nice to have statistics to estimate the correlation between columns 
> of different tables, to improve the estimate for the number of rows 
> in a join.

I don't have a clear idea of how that should work, but from the quick
look at how join selectivity estimation is implemented, I believe two
things might be possible:
(a) using conditional probabilities
    Say we have a join "ta JOIN tb ON (ta.x = tb.y)"
    Currently, the selectivity is derived from stats on the two keys.    Essentially probabilities P(x), P(y),
representedby the MCV lists.    But if there are additional WHERE conditions on the tables, and we    have suitable
multivariatestats, it's possible to use conditional    probabilities.
 
    E.g. if the query actually uses
        ... ta JOIN tb ON (ta.x = tb.y) WHERE ta.z = 10
    and we have stats on (ta.x, ta.z), we can use P(x|z=10) instead.    If the two columns are correlated, this might
bemuch different.
 
(b) using this for multi-column conditions
    If the join condition involves multiple columns, e.g.
        ON (ta.x = tb.y AND ta.p = tb.q)
    and we happen to have stats on (ta.x,ta.p) and (tb.y,tb.q), we may    use this to compute the cardinality (pretty
muchas we do today).
 

But I haven't really worked on this so far, I suspect there are various
subtle issues and I certainly don't plan to address this in the first
phase of the patch.

Tomas

Re: WIP: multivariate statistics / proof of concept

From

David Rowley

Date:

29 October 2014, 09:41:16

On Mon, Oct 13, 2014 at 11:00 AM, Tomas Vondra <tv@fuzzy.cz> wrote:

Hi,

attached is a WIP patch implementing multivariate statistics. The code
certainly is not "ready" - parts of it look as if written by a rogue
chimp who got bored of attempts to type the complete works of William
Shakespeare, and decided to try something different.

I'm really glad you're working on this. I had been thinking of looking into doing this myself.

The last point is really just "unfinished implementation" - the syntax I
propose is this:

ALTER TABLE ... ADD STATISTICS (options) ON (columns)

where the options influence the MCV list and histogram size, etc. The
options are recognized and may give you an idea of what it might do, but
it's not really used at the moment (except for storing in the
pg_mv_statistic catalog).

I've not really gotten around to looking at the patch yet, but I'm also wondering if it would be simple include allowing functional statistics too. The pg_mv_statistic name seems to indicate multi columns, but how about stats on date(datetime_column), or perhaps any non-volatile function. This would help to solve the problem highlighted here http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9Uf34DhQ@mail.gmail.com . Without giving it too much thought, perhaps any expression that can be indexed should be allowed to have stats? Would that be really difficult to implement in comparison to what you've already done with the patch so far?

I'm quite interested in reviewing your work on this, but it appears that some of your changes are not C89:

src\backend\commands\analyze.c(3774): error C2057: expected constant expression [D:\Postgres\a\postgres.vcxproj]

src\backend\commands\analyze.c(3774): error C2466: cannot allocate an array of constant size 0 [D:\Postgres\a\postgres.vcxproj]

src\backend\commands\analyze.c(3774): error C2133: 'indexes' : unknown size [D:\Postgres\a\postgres.vcxproj]

src\backend\commands\analyze.c(4302): error C2057: expected constant expression [D:\Postgres\a\postgres.vcxproj]

src\backend\commands\analyze.c(4302): error C2466: cannot allocate an array of constant size 0 [D:\Postgres\a\postgres.vcxproj]

src\backend\commands\analyze.c(4302): error C2133: 'ndistincts' : unknown size [D:\Postgres\a\postgres.vcxproj]

src\backend\commands\analyze.c(4775): error C2057: expected constant expression [D:\Postgres\a\postgres.vcxproj]

src\backend\commands\analyze.c(4775): error C2466: cannot allocate an array of constant size 0 [D:\Postgres\a\postgres.vcxproj]

src\backend\commands\analyze.c(4775): error C2133: 'keys' : unknown size [D:\Postgres\a\postgres.vcxproj]

The compiler I'm using is a bit too stupid to understand the C99 syntax.

I guess you'd need to palloc() these arrays instead in order to comply with the project standards.

http://www.postgresql.org/docs/devel/static/install-requirements.html

I'm going to sign myself up to review this, so probably my first feedback would be the compiling problem.

Regards

David Rowley

Re: WIP: multivariate statistics / proof of concept

From

"Tomas Vondra"

Date:

29 October 2014, 11:21:17

Dne 29 Říjen 2014, 10:41, David Rowley napsal(a):
>
> I've not really gotten around to looking at the patch yet, but I'm also
> wondering if it would be simple include allowing functional statistics
> too.
> The pg_mv_statistic name seems to indicate multi columns, but how about
> stats on date(datetime_column), or perhaps any non-volatile function. This
> would help to solve the problem highlighted here
> http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9Uf34DhQ@mail.gmail.com
> . Without giving it too much thought, perhaps any expression that can be
> indexed should be allowed to have stats? Would that be really difficult to
> implement in comparison to what you've already done with the patch so far?

I don't know, but it seems mostly orthogonal to what the patch aims to do.
If we add collecting statistics on expressions (on a single column), then I'd
expect it to be reasonably simple to add this to the multi-column case.

There are features like join stats or range type stats, that are probably
more directly related to the patch (but out of scope for the initial
version).

> I'm quite interested in reviewing your work on this, but it appears that
> some of your changes are not C89:
>
>  src\backend\commands\analyze.c(3774): error C2057: expected constant
> expression [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(3774): error C2466: cannot allocate an
> array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(3774): error C2133: 'indexes' : unknown
> size [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4302): error C2057: expected constant
> expression [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4302): error C2466: cannot allocate an
> array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4302): error C2133: 'ndistincts' : unknown
> size [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4775): error C2057: expected constant
> expression [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4775): error C2466: cannot allocate an
> array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
>  src\backend\commands\analyze.c(4775): error C2133: 'keys' : unknown size
> [D:\Postgres\a\postgres.vcxproj]
>
> The compiler I'm using is a bit too stupid to understand the C99 syntax.
>
> I guess you'd need to palloc() these arrays instead in order to comply
> with
> the project standards.
>
> http://www.postgresql.org/docs/devel/static/install-requirements.html
>
> I'm going to sign myself up to review this, so probably my first feedback
> would be the compiling problem.

I'll look into that. The thing is I don't have access to MSVC, so it's a bit
difficult to spot / fix those issues :-(

regards
Tomas

Re: WIP: multivariate statistics / proof of concept

From

Petr Jelinek

Date:

29 October 2014, 11:32:00

On 29/10/14 10:41, David Rowley wrote:
> On Mon, Oct 13, 2014 at 11:00 AM, Tomas Vondra <tv@fuzzy.cz
>
>     The last point is really just "unfinished implementation" - the syntax I
>     propose is this:
>
>         ALTER TABLE ... ADD STATISTICS (options) ON (columns)
>
>     where the options influence the MCV list and histogram size, etc. The
>     options are recognized and may give you an idea of what it might do, but
>     it's not really used at the moment (except for storing in the
>     pg_mv_statistic catalog).
>
>
>
> I've not really gotten around to looking at the patch yet, but I'm also
> wondering if it would be simple include allowing functional statistics
> too. The pg_mv_statistic name seems to indicate multi columns, but how
> about stats on date(datetime_column), or perhaps any non-volatile
> function. This would help to solve the problem highlighted here
> http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9Uf34DhQ@mail.gmail.com
> . Without giving it too much thought, perhaps any expression that can be
> indexed should be allowed to have stats? Would that be really difficult
> to implement in comparison to what you've already done with the patch so
> far?
>

I would not over-complicate requirements for the first version of this, 
I think it's already complicated enough.

Quick look at the patch suggests that it mainly needs discussion about 
design and particular implementation choices, there is fair amount of 
TODOs and FIXMEs. I'd like to look at it too but I doubt that I'll have 
time to do in depth review in this CF.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services

Re: WIP: multivariate statistics / proof of concept

From

"Tomas Vondra"

Date:

29 October 2014, 11:48:50

Dne 29 Říjen 2014, 12:31, Petr Jelinek napsal(a):
> On 29/10/14 10:41, David Rowley wrote:
>> On Mon, Oct 13, 2014 at 11:00 AM, Tomas Vondra <tv@fuzzy.cz
>>
>>     The last point is really just "unfinished implementation" - the
>> syntax I
>>     propose is this:
>>
>>         ALTER TABLE ... ADD STATISTICS (options) ON (columns)
>>
>>     where the options influence the MCV list and histogram size, etc.
>> The
>>     options are recognized and may give you an idea of what it might do,
>> but
>>     it's not really used at the moment (except for storing in the
>>     pg_mv_statistic catalog).
>>
>>
>>
>> I've not really gotten around to looking at the patch yet, but I'm also
>> wondering if it would be simple include allowing functional statistics
>> too. The pg_mv_statistic name seems to indicate multi columns, but how
>> about stats on date(datetime_column), or perhaps any non-volatile
>> function. This would help to solve the problem highlighted here
>> http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9Uf34DhQ@mail.gmail.com
>> . Without giving it too much thought, perhaps any expression that can be
>> indexed should be allowed to have stats? Would that be really difficult
>> to implement in comparison to what you've already done with the patch so
>> far?
>>
>
> I would not over-complicate requirements for the first version of this,
> I think it's already complicated enough.

My thoughts, exactly. I'm not willing to put more features into the
initial version of the patch. Actually, I'm thinking about ripping out
some experimental features (particularly "hashed MCV" and "associative
rules").

> Quick look at the patch suggests that it mainly needs discussion about
> design and particular implementation choices, there is fair amount of
> TODOs and FIXMEs. I'd like to look at it too but I doubt that I'll have
> time to do in depth review in this CF.

Yes. I think it's a bit premature to discuss the code thoroughly at this
point - I'd like to discuss the general approach to the feature (i.e.
minimizing the impact on those not using it, etc.).

The most interesting part of the code are probably the comments,
explaining the design in more detail, known shortcomings and possible ways
to address them.

regards
Tomas

Re: WIP: multivariate statistics / proof of concept

From

David Rowley

Date:

30 October 2014, 09:17:16

On Thu, Oct 30, 2014 at 12:48 AM, Tomas Vondra <tv@fuzzy.cz> wrote:

Dne 29 Říjen 2014, 12:31, Petr Jelinek napsal(a):
>> I've not really gotten around to looking at the patch yet, but I'm also
>> wondering if it would be simple include allowing functional statistics
>> too. The pg_mv_statistic name seems to indicate multi columns, but how
>> about stats on date(datetime_column), or perhaps any non-volatile
>> function. This would help to solve the problem highlighted here
>> http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9Uf34DhQ@mail.gmail.com
>> . Without giving it too much thought, perhaps any expression that can be
>> indexed should be allowed to have stats? Would that be really difficult
>> to implement in comparison to what you've already done with the patch so
>> far?
>>
>
> I would not over-complicate requirements for the first version of this,
> I think it's already complicated enough.

My thoughts, exactly. I'm not willing to put more features into the
initial version of the patch. Actually, I'm thinking about ripping out
some experimental features (particularly "hashed MCV" and "associative
rules").

That's fair, but I didn't really mean to imply that you should go work on that too and that it should be part of this patch..

I was thinking more along the lines of that I don't really agree with the table name for the new stats and that at some later date someone will want to add expression stats and we'd probably better come up design that would be friendly towards that. At this time I can only think that the name of the table might not suit well to expression stats, I'd hate to see someone have to invent a 3rd table to support these when we could likely come up with something that could be extended later and still make sense both today and in the future.

I was just looking at how expression indexes are stored in pg_index and I see that if it's an expression index that the expression is stored in the indexprs column which is of type pg_node_tree, so quite possibly at some point in the future the new stats table could just have an extra column added, and for today, we'd just need to come up with a future proof name... Perhaps pg_statistic_ext or pg_statisticx, and name functions and source files something along those lines instead?

Regards

David Rowley

Re: WIP: multivariate statistics / proof of concept

From

David Rowley

Date:

30 October 2014, 09:23:47

On Thu, Oct 30, 2014 at 12:21 AM, Tomas Vondra <tv@fuzzy.cz> wrote:

Dne 29 Říjen 2014, 10:41, David Rowley napsal(a):
> I'm quite interested in reviewing your work on this, but it appears that
> some of your changes are not C89:
>
> src\backend\commands\analyze.c(3774): error C2057: expected constant
> expression [D:\Postgres\a\postgres.vcxproj]
> src\backend\commands\analyze.c(3774): error C2466: cannot allocate an
> array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
> src\backend\commands\analyze.c(3774): error C2133: 'indexes' : unknown
> size [D:\Postgres\a\postgres.vcxproj]
> src\backend\commands\analyze.c(4302): error C2057: expected constant
> expression [D:\Postgres\a\postgres.vcxproj]
> src\backend\commands\analyze.c(4302): error C2466: cannot allocate an
> array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
> src\backend\commands\analyze.c(4302): error C2133: 'ndistincts' : unknown
> size [D:\Postgres\a\postgres.vcxproj]
> src\backend\commands\analyze.c(4775): error C2057: expected constant
> expression [D:\Postgres\a\postgres.vcxproj]
> src\backend\commands\analyze.c(4775): error C2466: cannot allocate an
> array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
> src\backend\commands\analyze.c(4775): error C2133: 'keys' : unknown size
> [D:\Postgres\a\postgres.vcxproj]
>

I'll look into that. The thing is I don't have access to MSVC, so it's a bit
difficult to spot / fix those issues :-(

It should be a pretty simple fix, just use the files and line numbers from the above. It's just a problem that in those 3 places you're declaring an array of a variable size, which is not allowed in C89. The thing to do instead would just be to palloc() the size you need and the pfree() it when you're done.

Regards

David Rowley

Re: WIP: multivariate statistics / proof of concept

From

"Tomas Vondra"

Date:

30 October 2014, 10:29:49

Dne 30 Říjen 2014, 10:17, David Rowley napsal(a):
> On Thu, Oct 30, 2014 at 12:48 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
>
>> Dne 29 Říjen 2014, 12:31, Petr Jelinek napsal(a):
>> >> I've not really gotten around to looking at the patch yet, but I'm
>> also
>> >> wondering if it would be simple include allowing functional
>> statistics
>> >> too. The pg_mv_statistic name seems to indicate multi columns, but
>> how
>> >> about stats on date(datetime_column), or perhaps any non-volatile
>> >> function. This would help to solve the problem highlighted here
>> >>
>> http://www.postgresql.org/message-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9Uf34DhQ@mail.gmail.com
>> >> . Without giving it too much thought, perhaps any expression that can
>> be
>> >> indexed should be allowed to have stats? Would that be really
>> difficult
>> >> to implement in comparison to what you've already done with the patch
>> so
>> >> far?
>> >>
>> >
>> > I would not over-complicate requirements for the first version of
>> this,
>> > I think it's already complicated enough.
>>
>> My thoughts, exactly. I'm not willing to put more features into the
>> initial version of the patch. Actually, I'm thinking about ripping out
>> some experimental features (particularly "hashed MCV" and "associative
>> rules").
>>
>>
> That's fair, but I didn't really mean to imply that you should go work on
> that too and that it should be part of this patch..
> I was thinking more along the lines of that I don't really agree with the
> table name for the new stats and that at some later date someone will want
> to add expression stats and we'd probably better come up design that would
> be friendly towards that. At this time I can only think that the name of
> the table might not suit well to expression stats, I'd hate to see someone
> have to invent a 3rd table to support these when we could likely come up
> with something that could be extended later and still make sense both
> today
> and in the future.
>
> I was just looking at how expression indexes are stored in pg_index and I
> see that if it's an expression index that the expression is stored in
> the indexprs column which is of type pg_node_tree, so quite possibly at
> some point in the future the new stats table could just have an extra
> column added, and for today, we'd just need to come up with a future proof
> name... Perhaps pg_statistic_ext or pg_statisticx, and name functions and
> source files something along those lines instead?

Ah, OK. I don't think the catalog name "pg_mv_statistic" is somehow
inappropriate for this purpose, though. IMHO the "multivariate" does not
mean "only columns" or "no expressions", it simply describes that the
approximated density function has multiple input variables, be it
attributes or expressions.

But maybe there's a better name.

Tomas

Re: WIP: multivariate statistics / proof of concept

From

Tomas Vondra

Date:

10 November 2014, 02:35:13

On 30.10.2014 10:23, David Rowley wrote:
> On Thu, Oct 30, 2014 at 12:21 AM, Tomas Vondra <tv@fuzzy.cz
> <mailto:tv@fuzzy.cz>> wrote:
>
>     Dne 29 Říjen 2014, 10:41, David Rowley napsal(a):
>     > I'm quite interested in reviewing your work on this, but it
>     appears that
>     > some of your changes are not C89:
>     >
>     >  src\backend\commands\analyze.c(3774): error C2057: expected constant
>     > expression [D:\Postgres\a\postgres.vcxproj]
>     >  src\backend\commands\analyze.c(3774): error C2466: cannot allocate an
>     > array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
>     >  src\backend\commands\analyze.c(3774): error C2133: 'indexes' :
>     unknown
>     > size [D:\Postgres\a\postgres.vcxproj]
>     >  src\backend\commands\analyze.c(4302): error C2057: expected constant
>     > expression [D:\Postgres\a\postgres.vcxproj]
>     >  src\backend\commands\analyze.c(4302): error C2466: cannot allocate an
>     > array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
>     >  src\backend\commands\analyze.c(4302): error C2133: 'ndistincts' :
>     unknown
>     > size [D:\Postgres\a\postgres.vcxproj]
>     >  src\backend\commands\analyze.c(4775): error C2057: expected constant
>     > expression [D:\Postgres\a\postgres.vcxproj]
>     >  src\backend\commands\analyze.c(4775): error C2466: cannot allocate an
>     > array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
>     >  src\backend\commands\analyze.c(4775): error C2133: 'keys' :
>     unknown size
>     > [D:\Postgres\a\postgres.vcxproj]
>     >
>
> I'll look into that. The thing is I don't have access to MSVC, so
> it's a bit difficult to spot / fix those issues :-(
>
>
> It should be a pretty simple fix, just use the files and line
> numbers from the above. It's just a problem that in those 3 places
> you're declaring an array of a variable size, which is not allowed in
> C89. The thing to do instead would just be to palloc() the size you
> need and the pfree() it when you're done.

Attached is a patch that should fix these issues.

The bad news is there are a few installcheck failures (and were in the
previous patch, but I haven't noticed for some reason). Apparently,
there's some mixup in how the patch handles Var->varno in some causes,
causing issues with a handful of regression tests.

The problem is that is_mv_compatible (checking whether the condition is
compatible with multivariate stats) does this

    if (! ((varRelid == 0) || (varRelid == var->varno)))
        return false;

    /* Also skip special varno values, and system attributes ... */
        if ((IS_SPECIAL_VARNO(var->varno)) ||
            (! AttrNumberIsForUserDefinedAttr(var->varattno)))
        return false;

assuming that after this, varno represents an index into the range
table, and passes it out to the caller.

And the caller (collect_mv_attnums) does this:

    RelOptInfo *rel = find_base_rel(root, varno);

which fails with errors like these:

    ERROR:  no relation entry for relid 0
    ERROR:  no relation entry for relid 1880

or whatever. What's even stranger is this:

regression=#   SELECT table_name, is_updatable, is_insertable_into
regression-#     FROM information_schema.views
regression-#    WHERE table_name = 'rw_view1';
ERROR:  no relation entry for relid 0
regression=#   SELECT table_name, is_updatable, is_insertable_into
regression-#     FROM information_schema.views
regression-# ;
regression=#   SELECT table_name, is_updatable, is_insertable_into
regression-#     FROM information_schema.views
regression-#    WHERE table_name = 'rw_view1';
 table_name | is_updatable | is_insertable_into
------------+--------------+--------------------
(0 rows)

regression=# explain  SELECT table_name, is_updatable, is_insertable_into
    FROM information_schema.views
   WHERE table_name = 'rw_view1';
ERROR:  no relation entry for relid 0


So, the query fails. After removing the WHERE clause it works, and this
somehow fixes the original query (with the WHERE clause). Nevertheless,
I still can't do explain on the query.

Clearly, I'm doing something wrong. I suspect it's caused either by
conditions involving function calls, or the fact that the view is a join
of multiple tables. But what?

For simple queries (single table, ...) it seems to be working fine.

regards
Tomas

Attachment

multivar-stats-v2.patch

Re: WIP: multivariate statistics / proof of concept

From

Simon Riggs

Date:

13 November 2014, 11:31:32

On 12 October 2014 23:00, Tomas Vondra <tv@fuzzy.cz> wrote:

> It however seems to be working sufficiently well at this point, enough
> to get some useful feedback. So here we go.

This looks interesting and useful.

What I'd like to check before a detailed review is that this has
sufficient applicability to be useful.

My understanding is that Q9 and Q18 of TPC-H have poor plans as a
result of multi-column stats errors.

Could you look at those queries and confirm that this patch can
produce better plans for them?

If so, I will work with you to review this patch.

One aspect of the patch that seems to be missing is a user declaration
of correlation, just as we have for setting n_distinct. It seems like
an even easier place to start to just let the user specify the stats
declaratively. That way we can split the patch into two parts. First,
allow multi column stats that are user declared. Then add user stats
collected by ANALYZE. The first part is possibly contentious and thus
a good initial focus. The second part will have lots of discussion, so
good to skip for a first version.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: multivariate statistics / proof of concept

From

"Tomas Vondra"

Date:

13 November 2014, 13:12:00

Dne 13 Listopad 2014, 12:31, Simon Riggs napsal(a):
> On 12 October 2014 23:00, Tomas Vondra <tv@fuzzy.cz> wrote:
>
>> It however seems to be working sufficiently well at this point, enough
>> to get some useful feedback. So here we go.
>
> This looks interesting and useful.
>
> What I'd like to check before a detailed review is that this has
> sufficient applicability to be useful.
>
> My understanding is that Q9 and Q18 of TPC-H have poor plans as a
> result of multi-column stats errors.
>
> Could you look at those queries and confirm that this patch can
> produce better plans for them?

Sure. I planned to do such verification/demonstration anyway, after
discussing the overall approach.

I planned to give it a try on TPC-DS, but I can start with the TPC-H
queries you propose. I'm not sure whether the poor estimates in Q9 & Q18
come from column correlation though - if it's due to some other issues
(e.g. conditions that are difficult to estimate), this patch can't do
anything with them. But it's a good start.

> If so, I will work with you to review this patch.

Thanks!

> One aspect of the patch that seems to be missing is a user declaration
> of correlation, just as we have for setting n_distinct. It seems like
> an even easier place to start to just let the user specify the stats
> declaratively. That way we can split the patch into two parts. First,
> allow multi column stats that are user declared. Then add user stats
> collected by ANALYZE. The first part is possibly contentious and thus
> a good initial focus. The second part will have lots of discussion, so
> good to skip for a first version.

I'm not a big fan of this approach, for a number of reasons.

Firstly, it only works for "simple" parameters that are trivial to specify
(say, Pearson's correlation coefficient), and the patch does not work with
those at all - it only works with histograms, MCV lists (and might work
with associative rules in the future). And we certainly can't ask users to
specify multivariate histograms - because it's very difficult to do, and
also because complex stats are more susceptible to get stale after adding
new data to the table.

Secondly, even if we add such "simple" parameters to the patch, we have to
come up with a  way to apply those parameters to the estimates. The
problem is that as the parameters get simpler, it's less and less useful
to compute the stats.

Another question is whether it should support more than 2 columns ...

The only place where I think this might work are the associative rules.
It's simple to specify rules like ("ZIP code" implies "city") and we could
even do some simple check against the data to see if it actually makes
sense (and 'disable' the rule if not).

But maybe I got it wrong and you have something particular in mind? Can
you give an example of how it would work?

regards
Tomas

Re: WIP: multivariate statistics / proof of concept

From

Katharina Büchse

Date:

13 November 2014, 15:51:30

On 13.11.2014 14:11, Tomas Vondra wrote:
> Dne 13 Listopad 2014, 12:31, Simon Riggs napsal(a):
>> On 12 October 2014 23:00, Tomas Vondra <tv@fuzzy.cz> wrote:
>>
>>> It however seems to be working sufficiently well at this point, enough
>>> to get some useful feedback. So here we go.
>> This looks interesting and useful.
>>
>> What I'd like to check before a detailed review is that this has
>> sufficient applicability to be useful.
>>
>> My understanding is that Q9 and Q18 of TPC-H have poor plans as a
>> result of multi-column stats errors.
>>
>> Could you look at those queries and confirm that this patch can
>> produce better plans for them?
> Sure. I planned to do such verification/demonstration anyway, after
> discussing the overall approach.
>
> I planned to give it a try on TPC-DS, but I can start with the TPC-H
> queries you propose. I'm not sure whether the poor estimates in Q9 & Q18
> come from column correlation though - if it's due to some other issues
> (e.g. conditions that are difficult to estimate), this patch can't do
> anything with them. But it's a good start.
>
>> If so, I will work with you to review this patch.
> Thanks!
>
>> One aspect of the patch that seems to be missing is a user declaration
>> of correlation, just as we have for setting n_distinct. It seems like
>> an even easier place to start to just let the user specify the stats
>> declaratively. That way we can split the patch into two parts. First,
>> allow multi column stats that are user declared. Then add user stats
>> collected by ANALYZE. The first part is possibly contentious and thus
>> a good initial focus. The second part will have lots of discussion, so
>> good to skip for a first version.
> I'm not a big fan of this approach, for a number of reasons.
>
> Firstly, it only works for "simple" parameters that are trivial to specify
> (say, Pearson's correlation coefficient), and the patch does not work with
> those at all - it only works with histograms, MCV lists (and might work
> with associative rules in the future). And we certainly can't ask users to
> specify multivariate histograms - because it's very difficult to do, and
> also because complex stats are more susceptible to get stale after adding
> new data to the table.
>
> Secondly, even if we add such "simple" parameters to the patch, we have to
> come up with a  way to apply those parameters to the estimates. The
> problem is that as the parameters get simpler, it's less and less useful
> to compute the stats.
>
> Another question is whether it should support more than 2 columns ...
>
> The only place where I think this might work are the associative rules.
> It's simple to specify rules like ("ZIP code" implies "city") and we could
> even do some simple check against the data to see if it actually makes
> sense (and 'disable' the rule if not).
and even this simple example has its limits, at least in Germany ZIP 
codes are not unique for rural areas, where several villages have the 
same ZIP code.

I guess there are just a few examples where columns are completely 
functional dependent without any exceptions.
But of course, if the user gives this information just for optimization 
the statistics, some exceptions don't matter.
If this information should be used for creating different execution 
plans (e.g. on column A is an index and column B is functional 
dependent, one could think about using this index on A and the 
dependency instead of running through the whole table to find all tuples 
that fit the query on column B), exceptions are a very important issue.
>
> But maybe I got it wrong and you have something particular in mind? Can
> you give an example of how it would work?
>
> regards
> Tomas
>
>
>


-- 
Dipl.-Math. Katharina Büchse
Friedrich-Schiller-Universität Jena
Institut für Informatik
Lehrstuhl für Datenbanken und Informationssysteme
Ernst-Abbe-Platz 2
07743 Jena
Telefon 03641/946367
Webseite http://users.minet.uni-jena.de/~re89qen/

Re: WIP: multivariate statistics / proof of concept

From

"Tomas Vondra"

Date:

13 November 2014, 16:42:34

Dne 13 Listopad 2014, 16:51, Katharina Büchse napsal(a):
> On 13.11.2014 14:11, Tomas Vondra wrote:
>
>> The only place where I think this might work are the associative rules.
>> It's simple to specify rules like ("ZIP code" implies "city") and we
>> could
>> even do some simple check against the data to see if it actually makes
>> sense (and 'disable' the rule if not).
>
> and even this simple example has its limits, at least in Germany ZIP
> codes are not unique for rural areas, where several villages have the
> same ZIP code.
>
> I guess there are just a few examples where columns are completely
> functional dependent without any exceptions.
> But of course, if the user gives this information just for optimization
> the statistics, some exceptions don't matter.
> If this information should be used for creating different execution
> plans (e.g. on column A is an index and column B is functional
> dependent, one could think about using this index on A and the
> dependency instead of running through the whole table to find all tuples
> that fit the query on column B), exceptions are a very important issue.

Yes, exactly. The aim of this patch is "only" improving estimates, not
removing conditions from the plan (e.g. checking only the ZIP code and not
the city name). That certainly can't be done solely based on approximate
statistics, and as you point out most real-world data either contain bugs
or are inherently imperfect (we have the same kind of ZIP/city
inconsistencies in Czech). That's not a big issue for estimates (assuming
only small fraction of rows violates the rule) though.

Tomas

Re: WIP: multivariate statistics / proof of concept

From

Kevin Grittner

Date:

15 November 2014, 17:50:45

Tomas Vondra <tv@fuzzy.cz> wrote:
> Dne 13 Listopad 2014, 16:51, Katharina Büchse napsal(a):
>> On 13.11.2014 14:11, Tomas Vondra wrote:
>>
>>> The only place where I think this might work are the associative rules.
>>> It's simple to specify rules like ("ZIP code" implies "city") and we could
>>> even do some simple check against the data to see if it actually makes
>>> sense (and 'disable' the rule if not).
>>
>> and even this simple example has its limits, at least in Germany ZIP
>> codes are not unique for rural areas, where several villages have the
>> same ZIP code.

> as you point out most real-world data either contain bugs
> or are inherently imperfect (we have the same kind of ZIP/city
> inconsistencies in Czech).

You can have lots of fun with U.S. zip code, too. Just on the
nominally "Madison, Wisconsin" zip codes (those starting with 537),
there are several exceptions:

select zipcode, city, locationtype
from zipcode
where zipcode like '537%'
and Decommisioned = 'false'
and zipcodetype = 'STANDARD'
and locationtype in ('PRIMARY', 'ACCEPTABLE')
order by zipcode, city;

zipcode | city | locationtype
---------+-----------+--------------
53703 | MADISON | PRIMARY
53704 | MADISON | PRIMARY
53705 | MADISON | PRIMARY
53706 | MADISON | PRIMARY
53711 | FITCHBURG | ACCEPTABLE
53711 | MADISON | PRIMARY
53713 | FITCHBURG | ACCEPTABLE
53713 | MADISON | PRIMARY
53713 | MONONA | ACCEPTABLE
53714 | MADISON | PRIMARY
53714 | MONONA | ACCEPTABLE
53715 | MADISON | PRIMARY
53716 | MADISON | PRIMARY
53716 | MONONA | ACCEPTABLE
53717 | MADISON | PRIMARY
53718 | MADISON | PRIMARY
53719 | FITCHBURG | ACCEPTABLE
53719 | MADISON | PRIMARY
53725 | MADISON | PRIMARY
53726 | MADISON | PRIMARY
53744 | MADISON | PRIMARY
(21 rows)

If you eliminate the quals besides the zipcode column you get 61
rows and it gets much stranger, with legal municipalities that are
completely surrounded by Madison that the postal service would
rather you didn't use in addressing your envelopes, but they have
to deliver to anyway, and organizations inside Madison receiving
enough mail to (literally) have their own zip code -- where the
postal service allows the organization name as a deliverable
"city".

If you want to have your own fun with this data, you can download
it here:

http://federalgovernmentzipcodes.us/free-zipcode-database.csv

I was able to load it into PostgreSQL with this:

create table zipcode
(
recordnumber integer not null,
zipcode text not null,
zipcodetype text not null,
city text not null,
state text not null,
locationtype text not null,
lat double precision,
long double precision,
xaxis double precision not null,
yaxis double precision not null,
zaxis double precision not null,
worldregion text not null,
country text not null,
locationtext text,
location text,
decommisioned text not null,
taxreturnsfiled bigint,
estimatedpopulation bigint,
totalwages bigint,
notes text
);
comment on column zipcode.zipcode is 'Zipcode or military postal code(FPO/APO)';
comment on column zipcode.zipcodetype is 'Standard, PO BOX Only, Unique, Military(implies APO or FPO)';
comment on column zipcode.city is 'offical city name(s)';
comment on column zipcode.state is 'offical state, territory, or quasi-state (AA, AE, AP) abbreviation code';
comment on column zipcode.locationtype is 'Primary, Acceptable,Not Acceptable';
comment on column zipcode.lat is 'Decimal Latitude, if available';
comment on column zipcode.long is 'Decimal Longitude, if available';
comment on column zipcode.location is 'Standard Display (eg Phoenix, AZ ; Pago Pago, AS ; Melbourne, AU )';
comment on column zipcode.decommisioned is 'If Primary location, Yes implies historical Zipcode, No Implies current
Zipcode;If not Primary, Yes implies Historical Placename'; 
comment on column zipcode.taxreturnsfiled is 'Number of Individual Tax Returns Filed in 2008';
copy zipcode from 'filepath' with (format csv, header);
alter table zipcode add primary key (recordnumber);
create unique index zipcode_city on zipcode (zipcode, city);

I bet there are all sorts of correlation possibilities with, for
example, latitude and longitude and other variables.  With 81831
rows and so many correlations among the columns, it might be a
useful data set to test with.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: multivariate statistics / proof of concept

From

Tomas Vondra

Date:

15 November 2014, 18:36:11

On 15.11.2014 18:49, Kevin Grittner
> If you eliminate the quals besides the zipcode column you get 61
> rows and it gets much stranger, with legal municipalities that are
> completely surrounded by Madison that the postal service would
> rather you didn't use in addressing your envelopes, but they have
> to deliver to anyway, and organizations inside Madison receiving
> enough mail to (literally) have their own zip code -- where the
> postal service allows the organization name as a deliverable
> "city".
> 
> If you want to have your own fun with this data, you can download
> it here:
> 
> http://federalgovernmentzipcodes.us/free-zipcode-database.csv
>
...
> 
> I bet there are all sorts of correlation possibilities with, for
> example, latitude and longitude and other variables.  With 81831
> rows and so many correlations among the columns, it might be a
> useful data set to test with.

Thanks for the link. I've been looking for a good dataset with such
data, and this one is by far the best one.

The current version of the patch supports only data types passed by
value (i.e. no varlena types - text, ), which means it's impossible to
build multivariate stats on some of the interesting columns (state,
city, ...).

I guess it's time to start working on removing this limitation.

Tomas

Re: WIP: multivariate statistics / proof of concept

From

Michael Paquier

Date:

08 December 2014, 01:01:53

On Sun, Nov 16, 2014 at 3:35 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
> Thanks for the link. I've been looking for a good dataset with such
> data, and this one is by far the best one.
>
> The current version of the patch supports only data types passed by
> value (i.e. no varlena types - text, ), which means it's impossible to
> build multivariate stats on some of the interesting columns (state,
> city, ...).
>
> I guess it's time to start working on removing this limitation.
Tomas, what's your status on this patch? Are you planning to make it
more complicated than it is? For now I have switched it to a "Needs
Review" state because even your first version did not get advanced
review (that's quite big btw). I guess that we should switch it to the
next CF.
Regards,
-- 
Michael

Re: WIP: multivariate statistics / proof of concept

From

Tomas Vondra

Date:

09 December 2014, 20:16:03

On 8.12.2014 02:01, Michael Paquier wrote:
> On Sun, Nov 16, 2014 at 3:35 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> Thanks for the link. I've been looking for a good dataset with such
>> data, and this one is by far the best one.
>>
>> The current version of the patch supports only data types passed by
>> value (i.e. no varlena types - text, ), which means it's impossible to
>> build multivariate stats on some of the interesting columns (state,
>> city, ...).
>>
>> I guess it's time to start working on removing this limitation.
> Tomas, what's your status on this patch? Are you planning to make it
> more complicated than it is? For now I have switched it to a "Needs
> Review" state because even your first version did not get advanced
> review (that's quite big btw). I guess that we should switch it to the
> next CF.

Hello Michael,

I agree with moving the patch to the next CF - I'm working on the patch,
but I will take a bit more time to submit a new version and I can do
that in the next CF.

regards
Tomas

Re: WIP: multivariate statistics / proof of concept

From

Heikki Linnakangas

Date:

11 December 2014, 16:54:19

On 10/13/2014 01:00 AM, Tomas Vondra wrote:
> Hi,
>
> attached is a WIP patch implementing multivariate statistics.

Great! Really glad to see you working on this.

> +     * FIXME This sample sizing is mostly OK when computing stats for
> +     *       individual columns, but when computing multi-variate stats
> +     *       for multivariate stats (histograms, mcv, ...) it's rather
> +     *       insufficient. For small number of dimensions it works, but
> +     *       for complex stats it'd be nice use sample proportional to
> +     *       the table (say, 0.5% - 1%) instead of a fixed size.

I don't think a fraction of the table is appropriate. As long as the 
sample is random, the accuracy of a sample doesn't depend much on the 
size of the population. For example, if you sample 1,000 rows from a 
table with 100,000 rows, or 1000 rows from a table with 100,000,000 
rows, the accuracy is pretty much the same. That doesn't change when you 
go from a single variable to multiple variables.

You do need a bigger sample with multiple variables, however. My gut 
feeling is that if you sample N rows for a single variable, with two 
variables you need to sample N^2 rows to get the same accuracy. But it's 
not proportional to the table size. (I have no proof for that, but I'm 
sure there is literature on this.)

> + * Multivariate histograms
> + *
> + * Histograms are a collection of buckets, represented by n-dimensional
> + * rectangles. Each rectangle is delimited by an array of lower and
> + * upper boundaries, so that for for the i-th attribute
> + *
> + *     min[i] <= value[i] <= max[i]
> + *
> + * Each bucket tracks frequency (fraction of tuples it contains),
> + * information about the inequalities, number of distinct values in
> + * each dimension (which is used when building the histogram) etc.
> + *
> + * The boundaries may be either inclusive or exclusive, or the whole
> + * dimension may be NULL.
> + *
> + * The buckets may overlap (assuming the build algorithm keeps the
> + * frequencies additive) or may not cover the whole space (i.e. allow
> + * gaps). This entirely depends on the algorithm used to build the
> + * histogram.

That sounds pretty exotic. These buckets are quite different from the 
single-dimension buckets we currently have.

The paper you reference in partition_bucket() function, M. 
Muralikrishna, David J. DeWitt: Equi-Depth Histograms For Estimating 
Selectivity Factors For Multi-Dimensional Queries. SIGMOD Conference 
1988: 28-36, actually doesn't mention overlapping buckets at all. I 
haven't read the code in detail, but if it implements the algorithm from 
that paper, there will be no overlap.

- Heikki

Re: WIP: multivariate statistics / proof of concept

From

Tomas Vondra

Date:

11 December 2014, 20:08:07

On 11.12.2014 17:53, Heikki Linnakangas wrote:
> On 10/13/2014 01:00 AM, Tomas Vondra wrote:
>> Hi,
>>
>> attached is a WIP patch implementing multivariate statistics.
> 
> Great! Really glad to see you working on this.
> 
>> +     * FIXME This sample sizing is mostly OK when computing stats for
>> +     *       individual columns, but when computing multi-variate stats
>> +     *       for multivariate stats (histograms, mcv, ...) it's rather
>> +     *       insufficient. For small number of dimensions it works, but
>> +     *       for complex stats it'd be nice use sample proportional to
>> +     *       the table (say, 0.5% - 1%) instead of a fixed size.
> 
> I don't think a fraction of the table is appropriate. As long as the 
> sample is random, the accuracy of a sample doesn't depend much on
> the size of the population. For example, if you sample 1,000 rows
> from a table with 100,000 rows, or 1000 rows from a table with
> 100,000,000 rows, the accuracy is pretty much the same. That doesn't
> change when you go from a single variable to multiple variables.

I might be wrong, but I doubt that. First, I read a number of papers
while working on this patch, and all of them used samples proportional
to the data set. That's an indirect evidence, though.

> You do need a bigger sample with multiple variables, however. My gut 
> feeling is that if you sample N rows for a single variable, with two 
> variables you need to sample N^2 rows to get the same accuracy. But
> it's not proportional to the table size. (I have no proof for that,
> but I'm sure there is literature on this.)

Maybe. I think it's somehow related to the number of buckets (which
somehow determines the precision of the histogram). If you want 1000
buckets, the number of rows scanned needs to be e.g. 10x that. With
multi-variate histograms, we may shoot for more buckets (say, 100 in
each dimension).

> 
>> + * Multivariate histograms
>> + *
>> + * Histograms are a collection of buckets, represented by n-dimensional
>> + * rectangles. Each rectangle is delimited by an array of lower and
>> + * upper boundaries, so that for for the i-th attribute
>> + *
>> + *     min[i] <= value[i] <= max[i]
>> + *
>> + * Each bucket tracks frequency (fraction of tuples it contains),
>> + * information about the inequalities, number of distinct values in
>> + * each dimension (which is used when building the histogram) etc.
>> + *
>> + * The boundaries may be either inclusive or exclusive, or the whole
>> + * dimension may be NULL.
>> + *
>> + * The buckets may overlap (assuming the build algorithm keeps the
>> + * frequencies additive) or may not cover the whole space (i.e. allow
>> + * gaps). This entirely depends on the algorithm used to build the
>> + * histogram.
> 
> That sounds pretty exotic. These buckets are quite different from
> the single-dimension buckets we currently have.
> 
> The paper you reference in partition_bucket() function, M. 
> Muralikrishna, David J. DeWitt: Equi-Depth Histograms For Estimating 
> Selectivity Factors For Multi-Dimensional Queries. SIGMOD Conference 
> 1988: 28-36, actually doesn't mention overlapping buckets at all. I 
> haven't read the code in detail, but if it implements the algorithm
> from that paper, there will be no overlap.

The algorithm implemented in partition_bucket() is very simple and
naive, and it mostly resembles the algorithm described in the paper. I'm
sure there are differences, it's not a 1:1 implementation, but you're
right it produces non-overlapping buckets.

The point is that I envision more complex algorithms or different
histogram types, and some of them may produce overlapping buckets. Maybe
that's premature comment, and it will turn out it's not really necessary.

regards
Tomas

Re: WIP: multivariate statistics / proof of concept

From

Michael Paquier

Date:

15 December 2014, 02:55:34

On Wed, Dec 10, 2014 at 5:15 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
> I agree with moving the patch to the next CF - I'm working on the patch,
> but I will take a bit more time to submit a new version and I can do
> that in the next CF.
OK cool. I just moved it by myself. I didn't see it yet registered in 2014-12.
Thanks,
-- 
Michael

Re: WIP: multivariate statistics / proof of concept

From

Michael Paquier

Date:

15 January 2015, 08:00:32

On Mon, Dec 15, 2014 at 11:55 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Dec 10, 2014 at 5:15 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> I agree with moving the patch to the next CF - I'm working on the patch,
>> but I will take a bit more time to submit a new version and I can do
>> that in the next CF.
> OK cool. I just moved it by myself. I didn't see it yet registered in 2014-12.
Marked as returned with feedback. No new version showed up in the last
month and this patch was waiting for input from author.
-- 
Michael

Re: WIP: multivariate statistics / proof of concept

From

Tomas Vondra

Date:

24 January 2015, 20:22:14

Hi,

attached is an updated version of the multivariate stats patch. This is
going to be a bit longer mail, so I'll put here a small ToC ;-)

1) patch split into 4 parts
2) where to start / documentation
3) state of the code
4) main changes/improvements
5) remaining limitations

The motivation and design ideas, explained in the first message of this
thread are still valid. It might be a good idea to read it first:

  http://www.postgresql.org/message-id/flat/543AFA15.4080608@fuzzy.cz

BTW if you happen to go to FOSDEM [PGDay], I'll gladly give you an intro
into the patch in person, or discuss the patch in general.


1) Patch split into 4 parts
---------------------------
Firstly, the patch got broken into the following four pieces, to make
the reviews somewhat easier:

1) 0001-shared-infrastructure-and-functional-dependencies.patch

   - infrastructure, shared by all the kinds of stats added
     in the following patches (catalog, ALTER TABLE, ANALYZE ...)

   - implementation of a simple statistics, tracking functional
     dependencies between columns (previously called "associative
     rules", but that's incorrect for several reasons)

   - this does not modify the optimizer in any way

2) 0002-clause-reduction-using-functional-dependencies.patch

   - applies the functional dependencies to optimizer (i.e. considers
     the rules in clauselist_selectivity())

3) 0003-multivariate-MCV-lists.patch

   - multivariate MCV lists (both ANALYZE and optimizer parts)

4) 0004-multivariate-histograms.patch

   - multivariate histograms (both ANALYZE and optimizer parts)


You may look at the patches at github here:

  https://github.com/tvondra/postgres/tree/multivariate-stats-squashed

The branch is not stable, i.e. I'll rebase / squash / force-push changes
in the future. (There's also multivariate-stats development branch with
unsquashed changes, but you don't want to look at that, trust me.)

The patches are not exactly small (being in the 50-100 kB range), but
that's mostly because of the amount of comments explaining the goals and
implementation details.


2) Where to start / documentation
---------------------------------
I strived to document all the pieces properly, mostly in the form of
comments. There's no sgml documentation at this point, which should
obviously change in the future.

Anyway, I'd suggest reading the first e-mail in this thread, explaining
the ideas, and then these comments:

1) functional dependencies (patch 0001)
   - src/backend/utils/mvstats/dependencies.c

2) MCV lists (patch 0003)
   - src/backend/utils/mvstats/mcv.c

3) histograms (patch 0004)
   - src/backend/utils/mvstats/mcv.c

   - also see clauselist_mv_selectivity_mcvlist() in clausesel.c
   - also see clauselist_mv_selectivity_histogram() in clausesel.c

4) selectivity estimation (patches 0002-0004)
   - all in src/backend/optimizer/path/clausesel.c
   - clauselist_selectivity() - overview of how the stats are applied
   - clauselist_apply_dependencies() - functional dependencies reduction
   - clauselist_mv_selectivity_mcvlist() - MCV list estimation
   - clauselist_mv_selectivity_histogram() - histogram estimation


3) State of the code
--------------------
I've spent a fair amount of time testing the patches, and while I
believe there are no segfaults or so, I know parts of the code need a
bit more love.

The part most in need of improvements / comments is probably the code in
clausesel.c - that seems a bit quirky. Reviews / comments regarding this
part of the code are very welcome - I'm sure there are many ways to
improve this part.

There are a few FIXMEs elsewhere (e.g. about memory allocation in the
(de)serialization code), but those are mostly well-defined issues that I
know how to address (at least I believe so).


4) Main changes/improvements
----------------------------
There are many significant improvements. The previous patch version was
in the 'proof of concept' category (missing pieces, knowingly broken in
some areas), the current patch should 'mostly work'.

The patch fixes two most annoying limitations of the first version:

  (a) support for all data types (not just those passed by value)
  (b) handles NULL values properly
  (c) adds support for IS [NOT] NULL clauses

Aside from that the code was significantly improved, there are proper
regression tests and plenty of comments explaining the details.


5) Remaining limitations
------------------------

  (a) limited to stats on 8 columns

      This is mostly just a 'safeguard' restriction.

  (b) only data types with '<' operator

      I don't think this will change anytime soon, because all the
      algorithms for building the stats rely on this. I don't see
      this as a serious limitation though.

  (c) not handling DROP COLUMN or DROP TABLE and so on

      Currently this is not handled at all (so the regression tests
      do an explicit DELETE from the pg_mv_statistic catalog).

      Handling the DROP TABLE won't be difficult, it's similar to the
      current stats. Handling ALTER TABLE ... DROP COLUMN will be much
      more tricky I guess - should we drop all the stats referencing
      that column, or should we just remove it from the stats? Or
      should we keep it and treat it as NULL? Not sure what's the best
      solution.

  (d) limited list of compatible WHERE clauses

      The initial patch handled only simple operator clauses

          (Var op Constant)

      where operator is one of ('<', '<=', '=', '>=', '>'). Now it also
      handles IS [NOT] NULL clauses. Adding more clause types should
      not  be overly difficult - starting with more traditional
      'BooleanTest' conditions, or even multi-column conditions
          (Var op Var)

      which are difficult to estimate using simple-column stats.

  (e) optimizer uses single stats per table

      This is still true and I don't think this will change soon. i do
      have some ideas on how to merge multiple stats etc. but it's
      certainly complex stuff, unlikely to happen within this CF. The
      patch makes a lot of sense even without this particular feature,
      because you can create multiple stats, each suitable for different
      queries.

  (f) no JOIN conditions

      Similarly to the previous point, it's on the TODO but it's not
      going to happen in this CF.


kind regards

--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Hello,

attached is a new version of the patch series. Aside from fixing various
issues (crashes, memory leaks). The patches are rebased to current
master, and I also attach a few SQL scripts I used for testing (nothing
fancy, just stress-testing all the parts the patch touches).

The main changes in the patches (requiring plenty of changes in the
other parts) are about these:


(1) combining multiple statistics on a table
--------------------------------------------

In the previous version of the patch, it was only possible to use a
single statistics on a table - when there was a statistics "covering"
all the conditions it worked fine, but that's not always the case.

The new patch is able to combine multiple statistics by decomposing the
probability (=selectivity) into conditional probabilities. Imagine
estimating selectivity of clauses

   WHERE (a=1) AND (b=1) AND (c=1) AND (d=1)

with statistics on [a,b,c] and [b,c,d]. The selectivity may be split for
example like this:

   P(a=1,b=1,c=1,d=1) = P(a=1,b=1,c=1) * P(d=1|a=1,b=1,c=1)

where P(a=1,b=1,c=1) may be estimated using statistics [a,b,c], and the
second may be simplified like this:

   P(d=1|a=1,b=1,c=1) = P(d=1|b=1,c=1)

using the assumption "no multivariate stats => independent". Both these
probabilities match the existing statistics.

The idea is described a bit more in the part #5 of the patch.


(2) choosing the best combination of statistics
-----------------------------------------------

There may be more statistics on a table, and multiple possible ways to
use them to estimate the clauses (different ordering, overlapping
statistics, etc.).

The patch formulates this as an optimization task with two goals.

   (a) cover as many clauses as possible
   (b) reuse as many conditions (i.e. dependencies) as possible

and implements two algorithms to solve this: (a) exhaustive, walking
through all possible states (using dynamic programming), and (b) greedy,
choosing the best local solution in each step.

The time requirements for the exhaustive solution grows pretty quickly
with the number of clauses and statistics on a table (~ O(N!)). The
greedy is much faster, as it's ~O(N) and in fact much more time is spent
in actually processing the selected statistics (walking through the
histograms etc.).

I assume the exhaustive search may find a better solution in some cases
(that the greedy algorithm misses), but so far I've been unable to come
up with such example.

To make this easier to test, I've added GUC to switch between these
algorithms easily (set to 'greedy' by default)

    mvstat_search = {'greedy', 'exhaustive'}

I assume this GUC will be removed eventually, after we figure out which
algorithm is the right one.


(3) estimation of more complex conditions (AND/OR clauses)
----------------------------------------------------------

I've added ability to estimate more complex clauses - combinations of
AND/OR clauses and such. It's somewhat incomplete at the moment, but
hopefully the ideas will be clear from the TODOs/FIXMEs along the way.

Let me know if you have any questions about this version of the patch,
or about the ideas it implements in general.

I also welcome real-world examples of poorly estimated queries, so that
I can test if these patches improve that particular case situation.


regards

--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: multivariate statistics / proof of concept

From

Jeff Janes

Date:

28 April 2015, 16:09:48

On Mon, Mar 30, 2015 at 5:26 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Hello,

attached is a new version of the patch series. Aside from fixing various
issues (crashes, memory leaks). The patches are rebased to current
master, and I also attach a few SQL scripts I used for testing (nothing
fancy, just stress-testing all the parts the patch touches).

Hi Tomas,

I get cascading conflicts in pg_proc.h. It looked easy enough to fix, except then I get compiler errors:

funcapi.c: In function 'get_func_trftypes':

funcapi.c:890: warning: unused variable 'procStruct'

utils/fmgrtab.o:(.rodata+0x10cf8): undefined reference to `_null_'

utils/fmgrtab.o:(.rodata+0x10d18): undefined reference to `_null_'

utils/fmgrtab.o:(.rodata+0x10d38): undefined reference to `_null_'

utils/fmgrtab.o:(.rodata+0x10d58): undefined reference to `_null_'

collect2: ld returned 1 exit status

make[2]: *** [postgres] Error 1

make[1]: *** [all-backend-recurse] Error 2

make: *** [all-src-recurse] Error 2

make: *** Waiting for unfinished jobs....

make: *** [temp-install] Error 2

Cheers,

Jeff

Re: WIP: multivariate statistics / proof of concept

From

Stephen Frost

Date:

28 April 2015, 16:13:16

* Jeff Janes (jeff.janes@gmail.com) wrote:
> On Mon, Mar 30, 2015 at 5:26 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
> wrote:
> > attached is a new version of the patch series. Aside from fixing various
> > issues (crashes, memory leaks). The patches are rebased to current
> > master, and I also attach a few SQL scripts I used for testing (nothing
> > fancy, just stress-testing all the parts the patch touches).
>
> I get cascading conflicts in pg_proc.h.  It looked easy enough to fix,
> except then I get compiler errors:

Yeah, those are because you didn't address the new column which was
added to pg_proc.  You need to add another _null_ in the pg_proc.h lines
in the correct place, apparently on four lines.
Thanks!
    Stephen

Re: WIP: multivariate statistics / proof of concept

From

Jeff Janes

Date:

28 April 2015, 17:37:05

On Tue, Apr 28, 2015 at 9:13 AM, Stephen Frost <sfrost@snowman.net> wrote:

* Jeff Janes (jeff.janes@gmail.com) wrote:
> On Mon, Mar 30, 2015 at 5:26 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
> wrote:
> > attached is a new version of the patch series. Aside from fixing various
> > issues (crashes, memory leaks). The patches are rebased to current
> > master, and I also attach a few SQL scripts I used for testing (nothing
> > fancy, just stress-testing all the parts the patch touches).
>
> I get cascading conflicts in pg_proc.h. It looked easy enough to fix,
> except then I get compiler errors:

Yeah, those are because you didn't address the new column which was
added to pg_proc. You need to add another _null_ in the pg_proc.h lines
in the correct place, apparently on four lines.

Thanks. I think I tried that, but was still having trouble. But it turns out that the trouble was for an unrelated reason, and I got it to compile now.

Some of the fdw's need a patch as well in order to compile, see attached.

Cheers,

Jeff

Attachment

multivariate_contrib.patch

Re: WIP: multivariate statistics / proof of concept

From

Tomas Vondra

Date:

28 April 2015, 18:16:47

Hi,

On 04/28/15 19:36, Jeff Janes wrote:>
...
>
> Thanks. I think I tried that, but was still having trouble. But it
> turns out that the trouble was for an unrelated reason, and I got it
> to compile now.

Yeah, a new column was added to pg_proc the day after I submitted the 
pacth. Will address that in a new version, hopefully in a few days.

>
> Some of the fdw's need a patch as well in order to compile, see
> attached.

Thanks, I forgot to tweak the clauselist_selectivity() calls contrib :-(


--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics / patch v6

From

Tomas Vondra

Date:

06 May 2015, 20:04:32

Attached is v6 of the multivariate stats, with a number of improvements:

1) fix of the contrib compile-time errors (reported by Jeff)

2) fix of pg_proc issues (reported by Jeff)

3) rebase to current master

4) fix a bunch of issues in the previous patches, due to referencing
    some parts too early (e.g. histograms in the first patch, etc.)

5) remove the explicit DELETEs from pg_mv_statistic (in the regression
    tests), this is now handled automatically by DROP TABLE etc.

6) number of performance optimizations in selectivity estimations:

    (a) minimize calls to get_oprrest, significantly reducing
        syscache calls

    (b) significant reduction of palloc overhead in deserialization of
        MCV lists and histograms

    (c) use more compact serialized representation of MCV lists and
        histograms, often removing ~50% of the size

    (d) use histograms with limited deserialization, which also allows
        caching function calls

    (e) modified histogram bucket partitioning, resulting in more even
        bucket distribution (i.e. producing buckets with more equal
        density and about equal size of each dimension)

7) add functions for listing MCV list items and histogram buckets:

     - pg_mv_mcvlist_items(oid)
     - pg_mv_histogram_buckets(oid, type)

    This is quite useful when analyzing the MCV lists / histograms.

8) improved support for OR clauses

9) allow calling pull_varnos() on expression trees containing
    RestrictInfo nodes (not sure if this is the right fix, it's being
    discussed in another thread)



--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Hello,

attached is v7 of the multivariate stats patch. The main improvement is
major refactoring of the clausesel.c portion - splitting the awfully
long spaghetti-style functions into smaller pieces, making it much more
understandable etc.

I do assume some of those pieces are unnecessary because there already
is a helper function with the same purpose (but I'm not aware of that).
But IMHO this piece of code begins to look reasonable (especially when
compared to the previous state).

The other major improvement it review of the comments (including FIXMEs
and TODOs), and removal of the obsolete / misplaced ones. And there was
plenty of those ...

These changes made this version ~20k smaller than v6.

The patch also rebases to current master, which I assume shall be quite
stable - so hopefully no more duplicate OIDs for a while.

There are 6 files attached, but only 0002-0006 are actually part of the
multivariate statistics patch itself. The first part makes it possible
to use pull_varnos() with expression trees containing RestrictInfo
nodes, but maybe this is not the right way to fix this (there's another
thread where this was discussed).

Also, the regression tests testing plan choice with multivariate stats
(e.g. that a bitmap index scan is chosen instead of index scan) fail
from time to time. I suppose this happens because the invalidation after
ANALYZE is not processed before executing the query, so the optimizer
does not see the stats, or something like that.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Hi,

attached is v8 of the multivariate statistics patch (or rather a patch
series). The patch currently has 7 parts, but 0001 is just a fix of the
pull_varnos issue (possibly incorrect/temporary), and 0007 is just an
attempt to add the "multicolumn distinctness" (experimental for now).

There are three noteworthy changes:

1) Correct estimation of OR-clauses - this turned out to be a rather
minor change, thanks to simply transforming the OR-clauses to
AND-clauses, see clauselist_selectivity_or() for details.

2) Abandoning the ALTER TABLE ... ADD STATISTICS syntax and instead
adding separate commands CREATE STATISTICS / DROP STATISTICS, as
proposed in the "multicolumn distinctness" thread:

http://www.postgresql.org/message-id/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp

This seems a better approach than the ALTER TABLE one - not only it
nicely fixes the grammar issues, it also naturally extends to
multi-table statistics (despite we don't know how those should work
exactly).

The syntax is this:

CREATE STATISTICS name ON table (columns) WITH (options);

DROP STATISTICS name;

and the 'name' is optional (and if absent, should be generated just
like for indexes, but that's not implemented yet).

The remaining question is how unique the statistics name should be.
My initial plan was to make it unique within a table, but that of
course does not work well with the DROP STATISTICS (it'd have to
specify the table name also), and it'd also now work with statistics
on multiple tables (which is one of the reasons for abandoning ALTER
TABLE stuff).

So I think it should be unique across tables. Statistics are hardly
a global object, so it should be unique within a schema. I thought
that simply using the schema of the table would work, but that of
course breaks with multiple tables in different schemas. So the only
solution seems to be explicit schema for statistics.

3) I've also started hacking on adding the "multicolumn distinctness"
proposed by Horiguchi-san, but I haven't really got that working. It
seems to be a bit more complicated than I anticipated because of the
"only equality conditions" restriction. So the 0007 patch only
really adds basic syntax and trivial build.

I do have bunch of ideas/questions about this statistics type. For
example, should we compute just a single coefficient or the exact
combination of columns specified in CREATE STATISTICS, or perhaps
for some additional subsets? I.e. with

CREATE STATISTICS ON t (a,b,c) WITH (ndistinct);

should we compute just the coefficient for (a,b,c), or maybe also
for (a,b), (b,c) and (a,c)? For N columns there's O(2^N) such
combinations, but perhaps it's acceptable.

Having the coefficient for just the single combination specified in
CREATE STATISTICS makes the estimation difficult when some of the
columns are not specified. For example, with coefficient just for
(a,b,c), what should happen for (WHERE a=1 AND b=2)?

Should we simply ignore the statistics, or apply it anyway and
somehow compensate for the missing columns?

I've also started working on something like a paper, hopefully
explaining the ideas and implementation more clearly and consistently
than possible on a mailing list (thanks to charts, figures and such).
It's available here (both the .tex source and .pdf with the current
version):

https://bitbucket.org/tvondra/mvstats-paper/src

It's not exactly short (~30 pages), and it's certainly incomplete with a
plenty of TODO notes, but hopefully it's already useful and not entirely
bogus.

Comments and questions are welcome - both to the patch and paper.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: multivariate statistics v9

From

Tomas Vondra

Date:

19 January 2016, 04:24:23

Hi,

attached is v9 of the patch series, including mostly these changes:

1) CREATE STATISTICS cleanup

    Firstly, I forgot to make the STATISTICS keyword unreserved again.
    I've also removed additional stuff from the grammar that turned out
    to be unnecessary / could be replaced with existing pieces.

2) making statistics schema-specific

    Similarly to the other objects (e.g. types), statistics names are now
    unique within a schema. This also means that the statistics may be
    created using qualified name, and also may belong to a different
    schema than a table.

    It seems to me we probably also need to track owner, and only allow
    the owner (or superuser / schema owner) to manipulate the statistics.

    The initial intention was to inherit all this from the parent table,
    but as we're designing this for the multi-table case, it's not
    really working anymore.

3) adding IF [NOT] EXISTS to DROP STATISTICS / CREATE STATISTICS

4) basic documentation of the DDL commands

    It's really simple at this point and some of the paragraphs are
    still empty. I also think that we'll have to add stuff explaining
    how to use statistics, not just docs for the DDL commands.

5) various fixes of the regression tests, related to the above


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Hi,

Attached is v10 of the patch series. There are 9 parts at the moment:

   0001-teach-pull_-varno-varattno-_walker-about-RestrictInf.patch
   0002-shared-infrastructure-and-functional-dependencies.patch
   0003-clause-reduction-using-functional-dependencies.patch
   0004-multivariate-MCV-lists.patch
   0005-multivariate-histograms.patch
   0006-multi-statistics-estimation.patch
   0007-multivariate-ndistinct-coefficients.patch
   0008-change-how-we-apply-selectivity-to-number-of-groups-.patch
   0009-fixup-of-regression-tests-plans-changes-by-group-by-.patch

However, the first one is still just a temporary workaround that I plan
to address next, and the last 3 are all dealing with the ndistinct
coefficients (and shall be squashed into a single chunk).


README docs
-----------

Aside from fixing a few bugs, there are several major improvements, the
main one being that I've moved most of the comments explaining how it
all works into a set of regular README files, located in
src/backend/utils/mvstats:

1) README.stats - Overview of available types of statistics, what
    clauses can be estimated, how multiple statistics are combined etc.
    This is probably the right place to start.

2) docs for each type of statistics currently available

    README.dependencies - soft functional dependencies
    README.mcv          - MCV lists
    README.histogram    - histograms
    README.ndistinct    - ndistinct coefficients

The READMEs are added and modified through the patch series, so the best
thing to do is apply all the patches and start reading.

I have not improved the user-oriented SGML documentation in this patch,
that's one of the tasks I'd lie to work on next. But the READMEs should
give you a good idea how it's supposed to work, and there are some
examples of use in the regression tests.


Significantly simplified places
-------------------------------

The patch version also significantly simplifies several places that were
needlessly complex in the previous ones - firstly the function
evaluating clauses on multivariate histograms was rather needlessly
bloated, so I've simplified it a lot. Similarly for the code in
clauselist_select() that combines multiple statistics to estimate a list
of clauses - that's much simpler now too. And various other pieces.

That being said, I still think the code in clausesel.c can be
simplified. I feel there's a lot of cruft, mostly due to unknowingly
implementing something that could be solved by an existing function.

A prime example of that is inspecting the expression tree to check if we
know how to estimate the clauses using the multivariate statistics. That
sounds like a nice match for expression walker, but currently is done by
custom code. I plan to look at that next.

Also, I'm not quite sure I understand what the varRelid parameter of
clauselist_selectivity is for, so the code may be handling that wrong
(seems to be working though).


ndistinct coefficients
----------------------

The one new piece in this patch is the GROUP BY estimation, based on the
ndistinct coefficients. So for example you can do this:

     CREATE TABLE t AS SELECT mod(i,1000) AS a, mod(i,1000) AS b
                         FROM generate_series(1,1000000) s(i);
     ANALYZE t;
     EXPLAIN SELECT * FROM t GROUP BY a, b;

which currently does this:

                               QUERY PLAN
-----------------------------------------------------------------------
  Group  (cost=127757.34..135257.34 rows=99996 width=8)
    Group Key: a, b
    ->  Sort  (cost=127757.34..130257.34 rows=1000000 width=8)
          Sort Key: a, b
          ->  Seq Scan on t  (cost=0.00..14425.00 rows=1000000 width=8)
(5 rows)

but we know that there are only 1000 groups because the columns are
correlated. So let's create ndistinct statistics on the two columns:

     CREATE STATISTICS s1 ON t (a,b) WITH (ndistinct);
     ANALYZE t;

which results in estimates like this:

                            QUERY PLAN
-----------------------------------------------------------------
  HashAggregate  (cost=19425.00..19435.00 rows=1000 width=8)
    Group Key: a, b
    ->  Seq Scan on t  (cost=0.00..14425.00 rows=1000000 width=8)
(3 rows)

I'm not quite sure how to combine this type of statistics with MCV lists
and histograms, so for now it's used only for GROUP BY.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Hi,

attached is v11 of the patch - this is mostly a cleanup of v10, removing
redundant code, adding missing comments, removing obsolete FIXME/TODOs
and so on. Overall this shaves ~20kB from the patch (not a primary
objective, though).

The one thing this (hopefully) fixes is handling of varRelid. Apparently
I got that a slightly wrong in the previous versions.

One thing I'm not quite sure about is schema of the new system catalog.
The existing catalog pg_statistic uses generic design with stakindN,
stanumbersN and stavaluesN columns, while the new catalog uses dedicated
columns for each type of stats (MCV, histogram, ...). Not sure whether
it's desirable to switch to the pg_statistic approach or not.

There are a few things I plan to look into next:

  * possibly more cleanups in clausesel.c (I'm wondering if some pieces
    should be moved to utils/mvstats/*.c)

  * a few FIXMEs in the infrastructure (e.g. deriving a name when not
    specified in CREATE STATISTICS)

  * move the ndistinct coefficients after functional dependencies in
    the patch series (but only use them for GROUP BY for now)

  * extend the functional dependencies to handle multiple columns on
    the left side (condition), i.e. dependencies like (a,b) -> c

  * address a few remaining FIXMEs in MCV/histograms building


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: multivariate statistics v11

From

Jeff Janes

Date:

09 March 2016, 02:24:50

On Tue, Mar 8, 2016 at 12:13 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Hi,
>
> attached is v11 of the patch - this is mostly a cleanup of v10, removing
> redundant code, adding missing comments, removing obsolete FIXME/TODOs
> and so on. Overall this shaves ~20kB from the patch (not a primary
> objective, though).

This has some some conflicts with the pathification commit, in the
regression tests.

To avoid that, I applied it to the commit before that, 3fc6e2d7f5b652b417fa6^

Having done that, In my hands, it fails its own regression tests.
Diff attached.

It breaks contrib postgres_fdw, I'll look into that when I get a
chance of no one beats me to it.

postgres_fdw.c: In function 'postgresGetForeignJoinPaths':
postgres_fdw.c:3623: error: too few arguments to function
'clauselist_selectivity'
postgres_fdw.c:3642: error: too few arguments to function
'clauselist_selectivity'

Cheers,

Jeff

Attachment

regression.diffs

Re: multivariate statistics v11

From

Tomas Vondra

Date:

09 March 2016, 09:54:43

Hi,

thanks for looking at the patch. Sorry for the issues, attached is a
version v13 that should fix them (or most of them).

On Tue, 2016-03-08 at 18:24 -0800, Jeff Janes wrote:
> On Tue, Mar 8, 2016 at 12:13 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> > Hi,
> >
> > attached is v11 of the patch - this is mostly a cleanup of v10, removing
> > redundant code, adding missing comments, removing obsolete FIXME/TODOs
> > and so on. Overall this shaves ~20kB from the patch (not a primary
> > objective, though).
>
> This has some some conflicts with the pathification commit, in the
> regression tests.

Yeah, there was one join plan difference, due to the ndistinct
estimation patch. Meh. Fixed.

>
> To avoid that, I applied it to the commit before that, 3fc6e2d7f5b652b417fa6^

Rebased to 51c0f63e.

>
> Having done that, In my hands, it fails its own regression tests.
> Diff attached.

Fixed. This was caused by making names of the statistics unique across
tables, thus the regression tests started to fail when executed through
'make check' (but 'make installcheck' was still fine).

The diff however also includes a segfault, apparently in processing of
functional dependencies somewhere in ANALYZE. Sadly I've been unable to
reproduce any such failure, despite running the tests many times (even
when applied on the same commit). Is there any chance this might be due
to a broken build, or something like that. If not, can you try
reproducing it and investigate a bit (enable core dumps etc.)?

>
> It breaks contrib postgres_fdw, I'll look into that when I get a
> chance of no one beats me to it.
>
> postgres_fdw.c: In function 'postgresGetForeignJoinPaths':
> postgres_fdw.c:3623: error: too few arguments to function
> 'clauselist_selectivity'
> postgres_fdw.c:3642: error: too few arguments to function
> 'clauselist_selectivity'

Yeah, apparently there are two new calls to clauselist_selectivity, so I
had to add NIL as list of conditions.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: multivariate statistics v11

From

Alvaro Herrera

Date:

09 March 2016, 12:22:59

Hi,

I gave a very quick skim to patch 0002.  Not a real review yet.  But
there are a few trivial points to fix:

* You still have empty sections in the SGML docs (such as the EXAMPLES).
I suppose the syntax is now firm enough that we can get some.  (I looked
at the other patches to see whether it was filled in, but couldn't find
any additional text there.)

* check_object_ownership() needs to be filled in

* Since you're adding a new object type, please add a case to cover it
in the object_address.sql pg_regress test.

* in analyze.c (and elsewhere), please put new #include lines sorted.

* I think the AT_PASS_ADD_STATS is a leftover which should be removed.

* The XXX comment in get_relation_info should probably be handled
differently (namely, in a way that makes the syscache not contain OIDs
of dropped stats)

* The README.dependencies has a lot of TODOs.  Do we need to get them
done during the first cut?  If not, I suggest creating a new section
"Future work" in the file.

* Please put the common.h header in src/include.  Make sure not to
include "postgres.h" in it -- our policy is that postgres.h goes at the
top of every .c file and never in any .h file.  Also please find a
better name for it; even mvstats_common.h would be a lot more
convincing.  However:

* ISTM that the code in common.c properly belongs in
src/backend/catalog/pg_mvstats.c instead (or more properly
catalog/pg_mv_statistics.c), which probably means the common.h file
should be named something else; perhaps some of it could become
pg_mv_statistic_fn.h, while the rest continues to be
src/include/utils/mvstats_common.h?  Not sure.

* The version check in psql/describe.c uses 90500; should probably be
updated to 90600.

* _copyCreateStatsStmt is missing if_not_exists

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From

Tomas Vondra

Date:

09 March 2016, 15:03:36

Hi,

thanks for the feedback. Attached is v14 of the patch series, fixing
most of the points you've raised.

On Wed, 2016-03-09 at 09:22 -0300, Alvaro Herrera wrote:
> Hi,
>
> I gave a very quick skim to patch 0002.  Not a real review yet.  But
> there are a few trivial points to fix:
>
> * You still have empty sections in the SGML docs (such as the EXAMPLES).
> I suppose the syntax is now firm enough that we can get some.  (I looked
> at the other patches to see whether it was filled in, but couldn't find
> any additional text there.)

Yes, that's one of the items I plan to work on next. Until now the
regression tests were a sufficient source of examples, but it's time to
do the SGML piece.

>
> * check_object_ownership() needs to be filled in

Done.

I've added pg_statistics_ownercheck, which also required adding OID of
the owner to the catalog. Initially the plan was to use the same owner
as for the table, but now that we've switched to CREATE STATISTICS
partially because it will allow multi-table stats, that does not make
sense (multiple tables with different owners).

This probably means we also need an 'ALTER STATISTICS ... OWNER TO'
command, which does not exist at this point.

>
> * Since you're adding a new object type, please add a case to cover it
> in the object_address.sql pg_regress test.

Done.

Apparently there was a bunch of missing pieces in objectaddress.c, so
this adds them too.

>
> * in analyze.c (and elsewhere), please put new #include lines sorted.

Done.

I've also significantly reduced the excessive list of includes in
statscmds.c. I expect the headers to require a bit more love, especially
in the subsequent patches (MCV, histograms etc.).

>
> * I think the AT_PASS_ADD_STATS is a leftover which should be removed.

Yeah. Now that we've invented CREATE TABLE, all the changes to
tablecmds.c were just unnecessary leftovers. Removed.

>
> * The XXX comment in get_relation_info should probably be handled
> differently (namely, in a way that makes the syscache not contain OIDs
> of dropped stats)

I believe that was actually an obsolete comment. Removed.

>
> * The README.dependencies has a lot of TODOs.  Do we need to get them
> done during the first cut?  If not, I suggest creating a new section
> "Future work" in the file.

Right. Most of those TODOs are future work, or rather ideas (more or
less crazy). The one thing I definitely want to address now is support
for dependencies with multiple columns on the left side, because that
requires changes to serialized format. I might also look at handling IS
NULL clauses, but that may wait.

>
> * Please put the common.h header in src/include.  Make sure not to
> include "postgres.h" in it -- our policy is that postgres.h goes at the
> top of every .c file and never in any .h file.  Also please find a
> better name for it; even mvstats_common.h would be a lot more
> convincing.  However:
>
> * ISTM that the code in common.c properly belongs in
> src/backend/catalog/pg_mvstats.c instead (or more properly
> catalog/pg_mv_statistics.c), which probably means the common.h file
> should be named something else; perhaps some of it could become
> pg_mv_statistic_fn.h, while the rest continues to be
> src/include/utils/mvstats_common.h?  Not sure.

Hmmm, not sure either. The idea was that the "common.h" is pretty much
just a private header with stuff that's not very useful anywhere else.

No changes here, for now.

>
> * The version check in psql/describe.c uses 90500; should probably be
> updated to 90600.

Fixed.

>
> * _copyCreateStatsStmt is missing if_not_exists

Fixed.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: multivariate statistics v14

From

Jeff Janes

Date:

09 March 2016, 16:46:20

On Wed, Mar 9, 2016 at 7:02 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Hi,
>
> thanks for the feedback. Attached is v14 of the patch series, fixing
> most of the points you've raised.


Hi Tomas,

Applied to aa09cd242fa7e3a694a31f, I still get the seg faults in make
check if I configure without --enable-cassert.

With --enable-cassert, it passes the regression test.

I got the core file, configured and compiled with:
CFLAGS="-fno-omit-frame-pointer"  --enable-debug

The first core dump is on this statement:
 -- check explain (expect bitmap index scan, not plain index scan) INSERT INTO functional_dependencies      SELECT
i/10000,i/20000, i/40000 FROM generate_series(1,1000000) s(i);
 

bt

#0  0x00000000006e1160 in cost_qual_eval (cost=0x2494418,
quals=0x2495550, root=0x2541b88) at costsize.c:3181
#1  0x00000000006e1ee5 in set_baserel_size_estimates (root=0x2541b88,
rel=0x2494300) at costsize.c:3754
#2  0x00000000006d37e8 in set_plain_rel_size (root=0x2541b88,
rel=0x2494300, rte=0x247e660) at allpaths.c:480
#3  0x00000000006d353d in set_rel_size (root=0x2541b88, rel=0x2494300,
rti=1, rte=0x247e660) at allpaths.c:350
#4  0x00000000006d338f in set_base_rel_sizes (root=0x2541b88) at allpaths.c:270
#5  0x00000000006d3233 in make_one_rel (root=0x2541b88,
joinlist=0x2494628) at allpaths.c:169
#6  0x000000000070012e in query_planner (root=0x2541b88,
tlist=0x2541e58, qp_callback=0x7048d4 <standard_qp_callback>,
qp_extra=0x7ffefa6474e0)   at planmain.c:246
#7  0x0000000000702a33 in grouping_planner (root=0x2541b88,
inheritance_update=0 '\000', tuple_fraction=0) at planner.c:1647
#8  0x0000000000701310 in subquery_planner (glob=0x2541af8,
parse=0x246a838, parent_root=0x0, hasRecursion=0 '\000',
tuple_fraction=0) at planner.c:740
#9  0x000000000070055b in standard_planner (parse=0x246a838,
cursorOptions=256, boundParams=0x0) at planner.c:290
#10 0x000000000070023f in planner (parse=0x246a838, cursorOptions=256,
boundParams=0x0) at planner.c:160
#11 0x00000000007b8bf9 in pg_plan_query (querytree=0x246a838,
cursorOptions=256, boundParams=0x0) at postgres.c:798
#12 0x00000000005d1967 in ExplainOneQuery (query=0x246a838, into=0x0,
es=0x246a778,   queryString=0x2443d80 "EXPLAIN (COSTS off)\n SELECT * FROM
mcv_list WHERE a = 10 AND b = 5;", params=0x0) at explain.c:350
#13 0x00000000005d16a3 in ExplainQuery (stmt=0x2444f90,
queryString=0x2443d80 "EXPLAIN (COSTS off)\n SELECT * FROM mcv_list
WHERE a = 10 AND b = 5;",   params=0x0, dest=0x246a6e8) at explain.c:244
#14 0x00000000007c0afb in standard_ProcessUtility (parsetree=0x2444f90,   queryString=0x2443d80 "EXPLAIN (COSTS off)\n
SELECT* FROM
 
mcv_list WHERE a = 10 AND b = 5;", context=PROCESS_UTILITY_TOPLEVEL,
params=0x0,   dest=0x246a6e8, completionTag=0x7ffefa647b60 "") at utility.c:659
#15 0x00000000007c0299 in ProcessUtility (parsetree=0x2444f90,
queryString=0x2443d80 "EXPLAIN (COSTS off)\n SELECT * FROM mcv_list
WHERE a = 10 AND b = 5;",   context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x246a6e8,
completionTag=0x7ffefa647b60 "") at utility.c:335
#16 0x00000000007bf47b in PortalRunUtility (portal=0x23ed510,
utilityStmt=0x2444f90, isTopLevel=1 '\001', dest=0x246a6e8,
completionTag=0x7ffefa647b60 "")   at pquery.c:1183
#17 0x00000000007bf1ce in FillPortalStore (portal=0x23ed510,
isTopLevel=1 '\001') at pquery.c:1057
#18 0x00000000007beb19 in PortalRun (portal=0x23ed510,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x253f6c0,
altdest=0x253f6c0,   completionTag=0x7ffefa647d40 "") at pquery.c:781
#19 0x00000000007b90ae in exec_simple_query (query_string=0x2443d80
"EXPLAIN (COSTS off)\n SELECT * FROM mcv_list WHERE a = 10 AND b =
5;")   at postgres.c:1094
#20 0x00000000007bcfac in PostgresMain (argc=1, argv=0x23d5070,
dbname=0x23d4e48 "regression", username=0x23d4e30 "jjanes") at
postgres.c:4021
#21 0x0000000000745a62 in BackendRun (port=0x23f4110) at postmaster.c:4258
#22 0x00000000007451d6 in BackendStartup (port=0x23f4110) at postmaster.c:3932
#23 0x0000000000741ab7 in ServerLoop () at postmaster.c:1690
#24 0x00000000007411c0 in PostmasterMain (argc=8, argv=0x23d3f20) at
postmaster.c:1298
#25 0x0000000000690026 in main (argc=8, argv=0x23d3f20) at main.c:223

Cheers,

Jeff

Re: multivariate statistics v14

From

Tomas Vondra

Date:

09 March 2016, 17:21:52

Hi,

On Wed, 2016-03-09 at 08:45 -0800, Jeff Janes wrote:
> On Wed, Mar 9, 2016 at 7:02 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> > Hi,
> >
> > thanks for the feedback. Attached is v14 of the patch series, fixing
> > most of the points you've raised.
>
>
> Hi Tomas,
>
> Applied to aa09cd242fa7e3a694a31f, I still get the seg faults in make
> check if I configure without --enable-cassert.

Ah, after disabling asserts I can reproduce it too. And the reason why
it fails is quite simple - clauselist_selectivity modifies the original
list of clauses, which then confuses cost_qual_eval.

Can you try if the attached patch fixes the issue? I'll need to rework a
bit more of the code, but let's see if this fixes the issue on your
machine too.

> With --enable-cassert, it passes the regression test.

I wonder how can it work with casserts and fail without them. That's
kinda exactly the opposite to what I'd expect ...

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

mvstats-segfault-fix.patch

Re: multivariate statistics v14

From

Jeff Janes

Date:

09 March 2016, 18:09:40

On Wed, Mar 9, 2016 at 9:21 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Hi,
>
> On Wed, 2016-03-09 at 08:45 -0800, Jeff Janes wrote:
>> On Wed, Mar 9, 2016 at 7:02 AM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>> > Hi,
>> >
>> > thanks for the feedback. Attached is v14 of the patch series, fixing
>> > most of the points you've raised.
>>
>>
>> Hi Tomas,
>>
>> Applied to aa09cd242fa7e3a694a31f, I still get the seg faults in make
>> check if I configure without --enable-cassert.
>
> Ah, after disabling asserts I can reproduce it too. And the reason why
> it fails is quite simple - clauselist_selectivity modifies the original
> list of clauses, which then confuses cost_qual_eval.
>
> Can you try if the attached patch fixes the issue? I'll need to rework a
> bit more of the code, but let's see if this fixes the issue on your
> machine too.

Yes, that fixes it.


>
>> With --enable-cassert, it passes the regression test.
>
> I wonder how can it work with casserts and fail without them. That's
> kinda exactly the opposite to what I'd expect ...

I too was surprised by that.  Maybe cassert makes a copy of some data
structure which is used in-place without cassert?

Thanks,

Jeff

Re: multivariate statistics v14

From

Tomas Vondra

Date:

09 March 2016, 18:18:20

On Wed, 2016-03-09 at 18:21 +0100, Tomas Vondra wrote:
> Hi,
> 
> On Wed, 2016-03-09 at 08:45 -0800, Jeff Janes wrote:
> > On Wed, Mar 9, 2016 at 7:02 AM, Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > > Hi,
> > >
> > > thanks for the feedback. Attached is v14 of the patch series, fixing
> > > most of the points you've raised.
> > 
> > 
> > Hi Tomas,
> > 
> > Applied to aa09cd242fa7e3a694a31f, I still get the seg faults in make
> > check if I configure without --enable-cassert.
> 
> Ah, after disabling asserts I can reproduce it too. And the reason why
> it fails is quite simple - clauselist_selectivity modifies the original
> list of clauses, which then confuses cost_qual_eval.

More precisely, it gets confused because the first clause in the list
gets deleted but cost_qual_eval never learns about that, and follows
stale pointer to the next cell, thus a segfault.

> 
> Can you try if the attached patch fixes the issue? I'll need to rework a
> bit more of the code, but let's see if this fixes the issue on your
> machine too.
> 
> > With --enable-cassert, it passes the regression test.
> 
> I wonder how can it work with casserts and fail without them. That's
> kinda exactly the opposite to what I'd expect ...

FWIW it seems to be somehow related to this assert in clausesel.c:
  Assert(count_mv_attnums(list_union(stat_clauses, stat_conditions),            relid, MV_CLAUSE_TYPE_MCV |
MV_CLAUSE_TYPE_HIST)>= 2);

With the assert in place, the code passes without a failure. After
removing the assert (commenting it out), or even just changing it to
   Assert(count_mv_attnums(stat_clauses, relid,                   MV_CLAUSE_TYPE_MCV | MV_CLAUSE_TYPE_HIST)        +
count_mv_attnums(stat_conditions,relid,                   MV_CLAUSE_TYPE_MCV | MV_CLAUSE_TYPE_HIST) >= 2);

i.e. removing the list_union, it fails as expected.

The only thing that I can think of is that list_union happens to place
the right stuff at the right position in memory - pure luck.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From

Jeff Janes

Date:

13 March 2016, 07:30:13

On Wed, Mar 9, 2016 at 9:21 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Hi,
>
> On Wed, 2016-03-09 at 08:45 -0800, Jeff Janes wrote:
>> On Wed, Mar 9, 2016 at 7:02 AM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>> > Hi,
>> >
>> > thanks for the feedback. Attached is v14 of the patch series, fixing
>> > most of the points you've raised.
>>
>>
>> Hi Tomas,
>>
>> Applied to aa09cd242fa7e3a694a31f, I still get the seg faults in make
>> check if I configure without --enable-cassert.
>
> Ah, after disabling asserts I can reproduce it too. And the reason why
> it fails is quite simple - clauselist_selectivity modifies the original
> list of clauses, which then confuses cost_qual_eval.
>
> Can you try if the attached patch fixes the issue? I'll need to rework a
> bit more of the code, but let's see if this fixes the issue on your
> machine too.

That patch on top of v14 did fix the original problem.  But I got
another segfault:

jjanes=# create table foo as select x, floor(x/(10000000/500))::int as
y  from generate_series(1,10000000) f(x);
jjanes=# create index on foo (x,y);
jjanes=# create index on foo (y,x);
jjanes=# create statistics jjj on foo (x,y) with (dependencies,histogram);
jjanes=# analyze ;
server closed the connection unexpectedly

#0  multi_sort_add_dimension (mss=mss@entry=0x7f45dafc7c88,
sortdim=sortdim@entry=0, dim=dim@entry=0,
vacattrstats=vacattrstats@entry=0x16f0dd0) at common.c:436
#1  0x00000000007d022a in update_bucket_ndistinct (attrs=0x166fdf8,
stats=0x16f0dd0, bucket=<optimized out>) at histogram.c:1384
#2  0x00000000007d09aa in create_initial_mv_bucket (stats=0x16f0dd0,
attrs=0x166fdf8, rows=0x17cda20, numrows=30000) at histogram.c:880
#3  build_mv_histogram (numrows=30000, rows=rows@entry=0x170ecf0,
attrs=attrs@entry=0x166fdf8, stats=stats@entry=0x16f0dd0,
numrows_total=numrows_total@entry=30000)   at histogram.c:156
#4  0x00000000007ced19 in build_mv_stats
(onerel=onerel@entry=0x7f45e797d040, totalrows=9999985,
numrows=numrows@entry=30000, rows=rows@entry=0x170ecf0,
natts=natts@entry=2,   vacattrstats=vacattrstats@entry=0x166efa0) at common.c:106
#5  0x000000000055ff6b in do_analyze_rel
(onerel=onerel@entry=0x7f45e797d040, options=options@entry=2,
va_cols=va_cols@entry=0x0, acquirefunc=<optimized out>,
relpages=44248,   inh=inh@entry=0 '\000', in_outer_xact=in_outer_xact@entry=0
'\000', elevel=elevel@entry=13, params=0x7ffcbe382a30) at
analyze.c:585
#6  0x0000000000560ced in analyze_rel (relid=relid@entry=16441,
relation=relation@entry=0x16bc9d0, options=options@entry=2,
params=params@entry=0x7ffcbe382a30,   va_cols=va_cols@entry=0x0, in_outer_xact=<optimized out>,
bstrategy=0x16640f0) at analyze.c:262
#7  0x00000000005b70fd in vacuum (options=2, relation=0x16bc9d0,
relid=relid@entry=0, params=params@entry=0x7ffcbe382a30, va_cols=0x0,
bstrategy=<optimized out>,   bstrategy@entry=0x0, isTopLevel=isTopLevel@entry=1 '\001') at vacuum.c:313
#8  0x00000000005b748e in ExecVacuum (vacstmt=vacstmt@entry=0x16bca20,
isTopLevel=isTopLevel@entry=1 '\001') at vacuum.c:121
#9  0x00000000006c90f3 in standard_ProcessUtility
(parsetree=0x16bca20, queryString=0x16bbfc0 "analyze foo ;",
context=<optimized out>, params=0x0, dest=0x16bcd60,   completionTag=0x7ffcbe382fa0 "") at utility.c:654
#10 0x00007f45e413b1d1 in pgss_ProcessUtility (parsetree=0x16bca20,
queryString=0x16bbfc0 "analyze foo ;",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x16bcd60,   completionTag=0x7ffcbe382fa0 "") at
pg_stat_statements.c:986
#11 0x00000000006c6841 in PortalRunUtility (portal=0x16f7700,
utilityStmt=0x16bca20, isTopLevel=<optimized out>, dest=0x16bcd60,
completionTag=0x7ffcbe382fa0 "") at pquery.c:1175
#12 0x00000000006c73c5 in PortalRunMulti
(portal=portal@entry=0x16f7700, isTopLevel=isTopLevel@entry=1 '\001',
dest=dest@entry=0x16bcd60, altdest=altdest@entry=0x16bcd60,   completionTag=completionTag@entry=0x7ffcbe382fa0 "") at
pquery.c:1306
#13 0x00000000006c7dd9 in PortalRun (portal=portal@entry=0x16f7700,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1
'\001', dest=dest@entry=0x16bcd60,   altdest=altdest@entry=0x16bcd60,
completionTag=completionTag@entry=0x7ffcbe382fa0 "") at pquery.c:813
#14 0x00000000006c5c98 in exec_simple_query (query_string=0x16bbfc0
"analyze foo ;") at postgres.c:1094
#15 PostgresMain (argc=<optimized out>, argv=argv@entry=0x164baf8,
dbname=0x164b9a8 "jjanes", username=<optimized out>) at
postgres.c:4021
#16 0x000000000047cb1e in BackendRun (port=0x1669d40) at postmaster.c:4258
#17 BackendStartup (port=0x1669d40) at postmaster.c:3932
#18 ServerLoop () at postmaster.c:1690
#19 0x000000000066ff27 in PostmasterMain (argc=argc@entry=1,
argv=argv@entry=0x164aa10) at postmaster.c:1298
#20 0x000000000047d35e in main (argc=1, argv=0x164aa10) at main.c:228

Cheers,

Jeff

Re: multivariate statistics v14

From

Tomas Vondra

Date:

13 March 2016, 22:00:08

On Sat, 2016-03-12 at 23:30 -0800, Jeff Janes wrote:
> On Wed, Mar 9, 2016 at 9:21 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > Hi,
> >
> > On Wed, 2016-03-09 at 08:45 -0800, Jeff Janes wrote:
> > >
> > > On Wed, Mar 9, 2016 at 7:02 AM, Tomas Vondra
> > > <tomas.vondra@2ndquadrant.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > thanks for the feedback. Attached is v14 of the patch series,
> > > > fixing
> > > > most of the points you've raised.
> > >
> > > Hi Tomas,
> > >
> > > Applied to aa09cd242fa7e3a694a31f, I still get the seg faults in
> > > make
> > > check if I configure without --enable-cassert.
> > Ah, after disabling asserts I can reproduce it too. And the reason
> > why
> > it fails is quite simple - clauselist_selectivity modifies the
> > original
> > list of clauses, which then confuses cost_qual_eval.
> >
> > Can you try if the attached patch fixes the issue? I'll need to
> > rework a
> > bit more of the code, but let's see if this fixes the issue on your
> > machine too.
> That patch on top of v14 did fix the original problem.  But I got
> another segfault:

Oh, yeah. There was an extra pfree().

Attached is v15 of the patch series, fixing this and also doing quite a
few additional improvements:

* added some basic examples into the SGML documentation

* addressing the objectaddress omissions, as pointed out by Alvaro

* support for ALTER STATISTICS ... OWNER TO / RENAME / SET SCHEMA

* significant refactoring of MCV and histogram code, particularly 
  serialization, deserialization and building

* reworking the functional dependencies to support more complex 
  dependencies, with multiple columns as 'conditions'

* the reduction using functional dependencies is also significantly 
  simplified (I decided to get rid of computing the transitive closure 
  for now - it got too complex after the multi-condition dependencies, 
  so I'll leave that for the future

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

On 03/16/2016 09:31 AM, Kyotaro HORIGUCHI wrote:
> Hello, I returned to this.
>
> At Sun, 13 Mar 2016 22:59:38 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<1457906378.27231.10.camel@2ndquadrant.com>
>> Oh, yeah. There was an extra pfree().
>>
>> Attached is v15 of the patch series, fixing this and also doing quite a
>> few additional improvements:
>>
>> * added some basic examples into the SGML documentation
>>
>> * addressing the objectaddress omissions, as pointed out by Alvaro
>>
>> * support for ALTER STATISTICS ... OWNER TO / RENAME / SET SCHEMA
>>
>> * significant refactoring of MCV and histogram code, particularly
>>   serialization, deserialization and building
>>
>> * reworking the functional dependencies to support more complex
>>   dependencies, with multiple columns as 'conditions'
>>
>> * the reduction using functional dependencies is also significantly
>>   simplified (I decided to get rid of computing the transitive closure
>>   for now - it got too complex after the multi-condition dependencies,
>>   so I'll leave that for the future
>
> Many trailing white spaces found.

Sorry, haven't noticed that after one of the rebases. Fixed in the
attached v15 of the patch.

>
> 0002
>
> + * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
>
>  2014 should be 2016?

Yes, the copyright info will need some tweaks. There's a few other files
with 2015, and I think the start should be the current year (and not 1996).

>
>
>  This patch defines many "magic"s for many structs, but
>  magic(number)s seems to be used to identify file or buffer page
>  in PostgreSQL. They wouldn't be needed if you don't intend to
>  dig out or identify the orphan memory blocks of mvstats.
>
> +    MVDependency    deps[1];    /* XXX why not a pointer? */
>
> MVDependency seems to be a pointer type.

Right, but we need an array of the structures here, so one way is to use
a pointer and the other one is using variable-length field. Will remove
the comment, I think the structure is fine as is.

>
> +        if (numcols >= MVSTATS_MAX_DIMENSIONS)
> +            ereport(ERROR,
> and
> +        Assert((attrs->dim1 >= 2) && (attrs->dim1 <= MVSTATS_MAX_DIMENSIONS));
>
> seem to be contradicting.

Nope, because the first check is in a loop where 'numcols' is used as an
index into an array with MVSTATS_MAX_DIMENSIONS elements.

>
> .. Sorry, time is up..

Thanks for the comments!

Attached is v15 of the patch, that also fixes one mistake - after
reworking the functional dependencies to support multiple columns on the
left side (as conditions), I failed to move it to the proper place in
the patch series. So 0002 built the dependencies in the old way and 0003
changed it to the new one. That was pointless and added another 20kB to
the patch, so v15 moves the new code to 0002.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

On 03/21/2016 12:00 AM, Tatsuo Ishii wrote:
>>> Many trailing white spaces found.
>>
>> Sorry, haven't noticed that after one of the rebases. Fixed in the
>> attached v15 of the patch.
>
> There are still few of traling spaces.
>
> /home/t-ishii/0002-shared-infrastructure-and-functional-dependencies.patch:3792: trailing whitespace.
> /home/t-ishii/0004-multivariate-MCV-lists.patch:471: trailing whitespace.
> /home/t-ishii/0004-multivariate-MCV-lists.patch:656: space before tab in indent.
>      {
> /home/t-ishii/0004-multivariate-MCV-lists.patch:682: space before tab in indent.
>      }
> /home/t-ishii/0004-multivariate-MCV-lists.patch:685: space before tab in indent.
>      {
> /home/t-ishii/0004-multivariate-MCV-lists.patch:715: trailing whitespace.
> /home/t-ishii/0006-multi-statistics-estimation.patch:2513: trailing whitespace.
>
> Best regards,

D'oh. Thanks for reporting. Attached is v16, hopefully fixing the few
remaining whitespace issues.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

multivariate-stats-v16.tgz

Re: multivariate statistics v14

From

Alvaro Herrera

Date:

21 March 2016, 03:34:48

Another skim on 0002:

reference.sgml is missing a call to &alterStatistic.

ObjectProperty[] contains a comment that the ACL is "same as relation",
but is that still correct, given that now stats may be related to more
than one relation?  Do we even know what the rules for ACLs on
cross-relation stats are?  One very simple way to get around this is to
dictate that all the rels must have the same owner.  Perhaps we're not
considering the multi-relation case yet?

We have this FIXME comment in do_analyze_rel:

+     * FIXME This sample sizing is mostly OK when computing stats for
+     *       individual columns, but when computing multi-variate stats
+     *       for multivariate stats (histograms, mcv, ...) it's rather
+     *       insufficient. For stats on multiple columns / complex stats
+     *       we need larger sample sizes, because we need to build more
+     *       detailed stats (more MCV items / histogram buckets) to get
+     *       good accuracy. Maybe it'd be appropriate to use samples
+     *       proportional to the table (say, 0.5% - 1%) instead of a
+     *       fixed size might be more appropriate. Also, this should be
+     *       bound to the requested statistics size - e.g. number of MCV
+     *       items or histogram buckets should require several sample
+     *       rows per item/bucket (so the sample should be k*size).

Maybe this merits more discussion.  Right now we have an upper bound on
how much to scan for analyze; if we introduce the idea of scanning a
percentage of the relation, the time to analyze very large relations
could increase significantly.  Do we have an idea of what to do for
this?  For instance, a rule that would make me comfortable would say to
scan a sample 3x the current size when you have a mvstats on 3 columns;
then the size of fraction to scan is still bounded.  But does that
actually work?  From the wording of this comment, I assume you don't
actually know.

In this block (CreateStatistics)
+    /* look for duplicities */
+    for (i = 0; i < numcols; i++)
+        for (j = 0; j < numcols; j++)
+            if ((i != j) && (attnums[i] == attnums[j]))
+                ereport(ERROR,
+                        (errcode(ERRCODE_UNDEFINED_COLUMN),
+                         errmsg("duplicate column name in statistics definition")));

isn't it easier to have the inner loop go from i+1 to numcols?


I wonder if this is sensible with multi-relation statistics:
+    /*
+     * Store a dependency too, so that statistics are dropped on DROP TABLE
+     */
+    parentobject.classId = RelationRelationId;
+    parentobject.objectId = ObjectIdGetDatum(RelationGetRelid(rel));
+    parentobject.objectSubId = 0;
+    childobject.classId = MvStatisticRelationId;
+    childobject.objectId = statoid;
+    childobject.objectSubId = 0;

I suppose the idea is to drop the stats if any of the rels they are for
is dropped.

Right after that you create a dependency on the schema.  Is that
necessary?  Since you have the dependency on the relation, the stats
would be dropped by recursion.

Why are you #include'ing builtins.h everywhere?

RelationGetMVStatList() needs a comment.

Please get rid of common.h.  It's totally unlike the way we structure
our header files.  We don't keep headers in src/backend; they're all in
src/include.  One reason is that the latter gets installed as a whole in
include/server, which this file will not be.  This file may be necessary
to build some extensions in the future, for example.

In mvstats.h, please mark function prototypes as "extern".

Many files need a pgindent pass.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From

Robert Haas

Date:

21 March 2016, 09:34:12

On Sun, Mar 20, 2016 at 11:34 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> ObjectProperty[] contains a comment that the ACL is "same as relation",
> but is that still correct, given that now stats may be related to more
> than one relation?  Do we even know what the rules for ACLs on
> cross-relation stats are?  One very simple way to get around this is to
> dictate that all the rels must have the same owner.

That's not really all that simple - you'd have to forbid changing the
owner of a relation involved in multi-rel statistics, but that's
horrible.  Presumably at the very least you'd then have to find some
way of allowing the owner of everything in the group to be changed at
the same time, but that's a whole new innovation.  I think this is a
very messy line of attack.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: multivariate statistics v14

From

Tomas Vondra

Date:

21 March 2016, 10:08:41

Hi,

On 03/21/2016 10:34 AM, Robert Haas wrote:
> On Sun, Mar 20, 2016 at 11:34 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> ObjectProperty[] contains a comment that the ACL is "same as relation",
>> but is that still correct, given that now stats may be related to more
>> than one relation?  Do we even know what the rules for ACLs on
>> cross-relation stats are?  One very simple way to get around this is to
>> dictate that all the rels must have the same owner.
>
> That's not really all that simple - you'd have to forbid changing
> the owner of a relation involved in multi-rel statistics, but that's
> horrible. Presumably at the very least you'd then have to find some
> way of allowing the owner of everything in the group to be changed
> at the same time, but that's a whole new innovation. I think this is
> a very messy line of attack.

I agree. I don't think we should / need to impose such additional 
restrictions (e.g. same owner for all tables).

I think for using the statistics (to compute estimates for a query), it 
should be enough that the user can access all the tables it's built on. 
Which happens somehow implicitly, and currently it's trivial as each 
statistics is built on a single table.

I don't have a clear idea what should we do in the future with multiple 
tables (e.g. when the statistics is built on 3 tables, the query is on 2 
of them and the user does not have access to the remaining one).

But maybe we need to support ACLs because of ALTER STATISTICS?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From

Tomas Vondra

Date:

21 March 2016, 10:30:57

On 03/21/2016 04:34 AM, Alvaro Herrera wrote:
> Another skim on 0002:
>
> reference.sgml is missing a call to &alterStatistic.
>
> ObjectProperty[] contains a comment that the ACL is "same as relation",
> but is that still correct, given that now stats may be related to more
> than one relation?  Do we even know what the rules for ACLs on
> cross-relation stats are?  One very simple way to get around this is to
> dictate that all the rels must have the same owner.  Perhaps we're not
> considering the multi-relation case yet?

As I wrote in response to Robert's message, I don't think we need ACLs 
for statistics - the user should be able to use them when they can 
access all the underlying relations (in a query). For ALTER STATISTICS 
the (owner || superuser) check should be enough, right?

>
> We have this FIXME comment in do_analyze_rel:
>
> +     * FIXME This sample sizing is mostly OK when computing stats for
> +     *       individual columns, but when computing multi-variate stats
> +     *       for multivariate stats (histograms, mcv, ...) it's rather
> +     *       insufficient. For stats on multiple columns / complex stats
> +     *       we need larger sample sizes, because we need to build more
> +     *       detailed stats (more MCV items / histogram buckets) to get
> +     *       good accuracy. Maybe it'd be appropriate to use samples
> +     *       proportional to the table (say, 0.5% - 1%) instead of a
> +     *       fixed size might be more appropriate. Also, this should be
> +     *       bound to the requested statistics size - e.g. number of MCV
> +     *       items or histogram buckets should require several sample
> +     *       rows per item/bucket (so the sample should be k*size).
>
> Maybe this merits more discussion.  Right now we have an upper bound on
> how much to scan for analyze; if we introduce the idea of scanning a
> percentage of the relation, the time to analyze very large relations
> could increase significantly.  Do we have an idea of what to do for
> this?  For instance, a rule that would make me comfortable would say to
> scan a sample 3x the current size when you have a mvstats on 3 columns;
> then the size of fraction to scan is still bounded.  But does that
> actually work?  From the wording of this comment, I assume you don't
> actually know.

Yeah. I think more discussion is needed, because I myself am not sure 
the FIXME is actually correct. For now I think we're OK with using the 
same logic as statistics on a single column (300 * target).

>
> In this block (CreateStatistics)
> +    /* look for duplicities */
> +    for (i = 0; i < numcols; i++)
> +        for (j = 0; j < numcols; j++)
> +            if ((i != j) && (attnums[i] == attnums[j]))
> +                ereport(ERROR,
> +                        (errcode(ERRCODE_UNDEFINED_COLUMN),
> +                         errmsg("duplicate column name in statistics definition")));
>
> isn't it easier to have the inner loop go from i+1 to numcols?

It probably is.

>
> I wonder if this is sensible with multi-relation statistics:
> +    /*
> +     * Store a dependency too, so that statistics are dropped on DROP TABLE
> +     */
> +    parentobject.classId = RelationRelationId;
> +    parentobject.objectId = ObjectIdGetDatum(RelationGetRelid(rel));
> +    parentobject.objectSubId = 0;
> +    childobject.classId = MvStatisticRelationId;
> +    childobject.objectId = statoid;
> +    childobject.objectSubId = 0;
>
> I suppose the idea is to drop the stats if any of the rels they are for
> is dropped.

What do you mean by sensible? I mean, we don't support multiple tables 
at this point (except for choosing a syntax that should allow that), but 
the code assumes a single relation on a few places (like this one).

>
> Right after that you create a dependency on the schema.  Is that
> necessary?  Since you have the dependency on the relation, the stats
> would be dropped by recursion.

Hmmmm, that's probably right. Also, now that I think about it, it 
probably gets broken after ALTER STATISTICS ... SET SCHEMA, because the 
code does not remove the old dependency (and does not create a new one).

>
> Why are you #include'ing builtins.h everywhere?

Stupidity.

>
> RelationGetMVStatList() needs a comment.

OK.

>
> Please get rid of common.h.  It's totally unlike the way we structure
> our header files.  We don't keep headers in src/backend; they're all in
> src/include.  One reason is that the latter gets installed as a whole in
> include/server, which this file will not be.  This file may be necessary
> to build some extensions in the future, for example.

OK, I'll rework that and move it to src/include/.

>
> In mvstats.h, please mark function prototypes as "extern".
>
> Many files need a pgindent pass.

OK.

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From

Jeff Janes

Date:

22 March 2016, 05:53:19

On Sun, Mar 20, 2016 at 4:34 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
>
> D'oh. Thanks for reporting. Attached is v16, hopefully fixing the few
> remaining whitespace issues.

Hi Tomas,

I'm trying out v16 against a common problem, where postgresql thinks
it is likely top stop early during a "order by (index express) limit
1" but it doesn't actually stop early due to cross-column
correlations.  But the multivariate statistics don't seem to help.  Am
I doing this wrong, or just expecting too much?


jjanes=# create table foo as select x, floor(x/(10000000/500))::int as
y  from generate_series(1,10000000) f(x);
jjanes=# create index on foo (x,y);
jjanes=# create index on foo (y,x);
jjanes=# create statistics jjj on foo (x,y) with (dependencies,histogram);
jjanes=# vacuum analyze ;


jjanes=# explain (analyze, timing off)  select x from foo where y
between 478 and 480 order by x limit 1;                                                   QUERY PLAN

-------------------------------------------------------------------------------------------------------------------Limit
(cost=0.43..4.92 rows=1 width=4) (actual rows=1 loops=1)  ->  Index Only Scan using foo_x_y_idx on foo
(cost=0.43..210156.55
rows=46812 width=4) (actual rows=1 loops=1)        Index Cond: ((y >= 478) AND (y <= 480))        Heap Fetches:
0Planningtime: 0.311 msExecution time: 478.917 ms
 

Here is walks up the index on x, until it meets the first row meeting
the qualification on y. It thinks it will get to stop early and be
very fast, but it doesn't.

If I add an dummy addition to the ORDER BY, to force it not to talk
the index, I get a plan which uses the other index and is actually
much faster, but is planned to be several hundred times slower:


jjanes=# explain (analyze, timing off)  select x from foo where y
between 478 and 480 order by x+0 limit 1;                                                       QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------Limit
(cost=1803.77..1803.77 rows=1 width=8) (actual rows=1 loops=1)  ->  Sort  (cost=1803.77..1920.80 rows=46812 width=8)
(actualrows=1 loops=1)        Sort Key: ((x + 0))        Sort Method: top-N heapsort  Memory: 25kB        ->  Index
OnlyScan using foo_y_x_idx on foo
 
(cost=0.43..1569.70 rows=46812 width=8) (actual rows=60000 loops=1)              Index Cond: ((y >= 478) AND (y <=
480))             Heap Fetches: 0Planning time: 0.175 msExecution time: 20.264 ms
 

(I use the "timing off" option, because without it the second plan
spends most of its time calling "gettimeofday")

Cheers,

Jeff

Re: multivariate statistics v14

From

Tatsuo Ishii

Date:

22 March 2016, 08:13:38

>> Do you have any other missing parts in this work? I am asking
>> because I wonder if you want to push this into 9.6 or rather 9.7.
> 
> I think the first few parts of the patch series, namely:
> 
>   * shared infrastructure (0002)
>   * functional dependencies (0003)
>   * MCV lists (0004)
>   * histograms (0005)
> 
> might make it into 9.6. I believe the code for building and storing
> the different kinds of stats is reasonably solid. What probably needs
> more thorough review are the changes in clauselist_selectivity(), but
> the code in these parts is reasonably simple as it only supports using
> a single multi-variate statistics per relation.
> 
> The part (0006) that allows using multiple statistics (i.e. selects
> which of the available stats to use and in what order) is probably the
> most complex part of the whole patch, and I myself do have some
> questions about some aspects of it. I don't think this part might get
> into 9.6 at this point (although it'd be nice if we managed to do
> that).

Hum. So without 0006 or beyond, there's not much benefit for the
PostgreSQL users, and you are not too confident about 0006 or
beyond. Then I would think it is a little bit hard to justify in
putting 000[2-5] into 9.6. I really like this feature and would like
to see in PostgreSQL someday, but I'm not sure if we should put the
patches (0002-0005) into PostgreSQL now. Please let me know if there's
some reaons we should put the patches into PostgreSQL now.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: multivariate statistics v14

From

Tomas Vondra

Date:

22 March 2016, 09:23:32

Hi,

On 03/22/2016 06:53 AM, Jeff Janes wrote:
> On Sun, Mar 20, 2016 at 4:34 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>
>>
>> D'oh. Thanks for reporting. Attached is v16, hopefully fixing the few
>> remaining whitespace issues.
>
> Hi Tomas,
>
> I'm trying out v16 against a common problem, where postgresql thinks
> it is likely top stop early during a "order by (index express) limit
> 1" but it doesn't actually stop early due to cross-column
> correlations.  But the multivariate statistics don't seem to help.  Am
> I doing this wrong, or just expecting too much?

Yes, I think you're expecting a too much from the current patch.

I've been thinking about perhaps addressing cases like this in the 
future, but it requires tracking position within the table somehow (e.g. 
by means of including ctid in the table, or something like that), and 
the current patch does not implement that.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From

Tomas Vondra

Date:

22 March 2016, 09:44:24

Hello,

On 03/22/2016 09:13 AM, Tatsuo Ishii wrote:
>>> Do you have any other missing parts in this work? I am asking
>>> because I wonder if you want to push this into 9.6 or rather 9.7.
>>
>> I think the first few parts of the patch series, namely:
>>
>>   * shared infrastructure (0002)
>>   * functional dependencies (0003)
>>   * MCV lists (0004)
>>   * histograms (0005)
>>
>> might make it into 9.6. I believe the code for building and storing
>> the different kinds of stats is reasonably solid. What probably needs
>> more thorough review are the changes in clauselist_selectivity(), but
>> the code in these parts is reasonably simple as it only supports using
>> a single multi-variate statistics per relation.
>>
>> The part (0006) that allows using multiple statistics (i.e. selects
>> which of the available stats to use and in what order) is probably the
>> most complex part of the whole patch, and I myself do have some
>> questions about some aspects of it. I don't think this part might get
>> into 9.6 at this point (although it'd be nice if we managed to do
>> that).
>
> Hum. So without 0006 or beyond, there's not much benefit for the
> PostgreSQL users, and you are not too confident about 0006 or
> beyond. Then I would think it is a little bit hard to justify in
> putting 000[2-5] into 9.6. I really like this feature and would like
> to see in PostgreSQL someday, but I'm not sure if we should put the
> patches (0002-0005) into PostgreSQL now. Please let me know if there's
> some reaons we should put the patches into PostgreSQL now.

I don't think so. While being able to combine multiple statistics is 
certainly useful, I'm convinced that the initial patched add enough 
value on their own, even if the 0006 patch gets committed later.

A lot of queries will be just fine with the "single multivariate 
statistics" limitation, either because it's using less than 8 columns, 
or because only 8 columns are actually correlated. (FWIW the 8 column 
limit is mostly arbitrary, it may get increased if needed.)

I haven't really mentioned the aspects of 0006 that I think need more 
discussion, but it's mostly about the question whether combining the 
statistics by using the overlapping clauses as "conditions" is the right 
thing to do (or whether a more expensive approach is needed). None of 
that however invalidates the preceding patches.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From

Tatsuo Ishii

Date:

22 March 2016, 10:41:10

>> Hum. So without 0006 or beyond, there's not much benefit for the
>> PostgreSQL users, and you are not too confident about 0006 or
>> beyond. Then I would think it is a little bit hard to justify in
>> putting 000[2-5] into 9.6. I really like this feature and would like
>> to see in PostgreSQL someday, but I'm not sure if we should put the
>> patches (0002-0005) into PostgreSQL now. Please let me know if there's
>> some reaons we should put the patches into PostgreSQL now.
> 
> I don't think so. While being able to combine multiple statistics is
> certainly useful, I'm convinced that the initial patched add enough

Can you please elaborate a little bit more how combining multiple
statistics is useful?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: multivariate statistics v14

From

Tomas Vondra

Date:

22 March 2016, 12:37:33

Hi,

On 03/22/2016 11:41 AM, Tatsuo Ishii wrote:
>>> Hum. So without 0006 or beyond, there's not much benefit for the
>>> PostgreSQL users, and you are not too confident about 0006 or
>>> beyond. Then I would think it is a little bit hard to justify in
>>> putting 000[2-5] into 9.6. I really like this feature and would
>>> like to see in PostgreSQL someday, but I'm not sure if we should
>>> put the patches (0002-0005) into PostgreSQL now. Please let me
>>> know if there's some reaons we should put the patches into
>>> PostgreSQL now.
>>
>> I don't think so. While being able to combine multiple statistics
>> is certainly useful, I'm convinced that the initial patched add
>> enough
>
> Can you please elaborate a little bit more how combining multiple
> statistics is useful?

Sure.

The goal of multivariate statistics is to approximate a probability 
distribution on a group of columns. The larger the number of columns, 
the less accurate the statistics will be (with respect to individual 
columns), assuming fixed size of the sample in ANALYZE, and fixed 
statistics size.

For example, if you add a column to multivariate histogram, you'll do 
some "bucket splits" by this dimension, thus reducing the accuracy for 
the other columns. You may of course allow larger statistics (e.g. 
histograms with more buckets), but that also requires larger samples, 
and so on.

Now, let's  assume you have a query like this:
    WHERE (a=1) AND (b=2) AND (c=3) AND (d=4)

and that "a" and "b" are correlated, and "c" and "d" are correlated, but 
that otherwise the columns are independent. It'd be a bit silly to 
require building statistics on (a,b,c,d), when two statistics on each of 
the column pairs would be cheaper and also more accurate.

That's of course a trivial case - independent groups of correlated 
columns. But I'd say this is actually a pretty common case, and I do 
believe there's not much controversy that we should support it.

Another reason to allow multiple statistics is that columns in one group 
may be a good fit for MCV list (which works well for discrete values), 
while the other group may be a good candidate for histogram (which works 
well for continuous values). This can't be solved by first building a 
MCV and then a histogram on the group.

The question of course is what to do if the groups are not independent. 
The patch does that by assuming the statistics overlap, and uses 
conditions on the columns included in both statistics to combine them 
using conditional probabilities. I do believe this works quite well, but 
this is perhaps the part that needs further discussion. There are other 
ways to combine the statistics, but I do expect them to be considerably 
more expensive.

Is this a sufficient explanation?

Of course, there's a fair amount of additional complexity that I have 
not mentioned here (e.g. selecting the right combination of stats).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From

Tatsuo Ishii

Date:

22 March 2016, 12:46:22

> On 03/22/2016 11:41 AM, Tatsuo Ishii wrote:
>>>> Hum. So without 0006 or beyond, there's not much benefit for the
>>>> PostgreSQL users, and you are not too confident about 0006 or
>>>> beyond. Then I would think it is a little bit hard to justify in
>>>> putting 000[2-5] into 9.6. I really like this feature and would
>>>> like to see in PostgreSQL someday, but I'm not sure if we should
>>>> put the patches (0002-0005) into PostgreSQL now. Please let me
>>>> know if there's some reaons we should put the patches into
>>>> PostgreSQL now.
>>>
>>> I don't think so. While being able to combine multiple statistics
>>> is certainly useful, I'm convinced that the initial patched add
>>> enough
>>
>> Can you please elaborate a little bit more how combining multiple
>> statistics is useful?
> 
> Sure.
> 
> The goal of multivariate statistics is to approximate a probability
> distribution on a group of columns. The larger the number of columns,
> the less accurate the statistics will be (with respect to individual
> columns), assuming fixed size of the sample in ANALYZE, and fixed
> statistics size.
> 
> For example, if you add a column to multivariate histogram, you'll do
> some "bucket splits" by this dimension, thus reducing the accuracy for
> the other columns. You may of course allow larger statistics
> (e.g. histograms with more buckets), but that also requires larger
> samples, and so on.
> 
> Now, let's  assume you have a query like this:
> 
>     WHERE (a=1) AND (b=2) AND (c=3) AND (d=4)
> 
> and that "a" and "b" are correlated, and "c" and "d" are correlated,
> but that otherwise the columns are independent. It'd be a bit silly to
> require building statistics on (a,b,c,d), when two statistics on each
> of the column pairs would be cheaper and also more accurate.
> 
> That's of course a trivial case - independent groups of correlated
> columns. But I'd say this is actually a pretty common case, and I do
> believe there's not much controversy that we should support it.
> 
> Another reason to allow multiple statistics is that columns in one
> group may be a good fit for MCV list (which works well for discrete
> values), while the other group may be a good candidate for histogram
> (which works well for continuous values). This can't be solved by
> first building a MCV and then a histogram on the group.
> 
> The question of course is what to do if the groups are not
> independent. The patch does that by assuming the statistics overlap,
> and uses conditions on the columns included in both statistics to
> combine them using conditional probabilities. I do believe this works
> quite well, but this is perhaps the part that needs further
> discussion. There are other ways to combine the statistics, but I do
> expect them to be considerably more expensive.
> 
> Is this a sufficient explanation?
> 
> Of course, there's a fair amount of additional complexity that I have
> not mentioned here (e.g. selecting the right combination of stats).

Sorry, maybe I did not explain clearyly. My question is, if put
patches only 0002 to 0005 into 9.6, does it still give any visible
benefit to users?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: multivariate statistics v14

From

Tomas Vondra

Date:

22 March 2016, 13:49:26

Hi,

On 03/22/2016 01:46 PM, Tatsuo Ishii wrote:
...
> Sorry, maybe I did not explain clearly. My question is, if put
> patches only 0002 to 0005 into 9.6, does it still give any visible
> benefit to users?

The users will be able to define statistics with the limitation that 
only a single one (the one covering the most columns referenced by the 
clauses) can be used when estimating a query. Which is not perfect, but 
I think it's a valuable improvement.

It might also be possible to split 0006 into smaller pieces, for example 
implementing the "non-overlapping statistics" case first and then 
extending it to more complicated cases. That might increase the change 
of getting at least some of that into 9.6 ...

But considering it's not clear whether the initial chunks are likely to 
make it into 9.6 - I kinda expect a fair amount of comments from TL 
about the preceding parts, who mentioned he might look at the patch this 
week. So I'm not sure splitting 0006 into smaller pieces makes sense at 
this point.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From

Tatsuo Ishii

Date:

23 March 2016, 01:53:24

> The users will be able to define statistics with the limitation that
> only a single one (the one covering the most columns referenced by the
> clauses) can be used when estimating a query. Which is not perfect,
> but I think it's a valuable improvement.
> 
> It might also be possible to split 0006 into smaller pieces, for
> example implementing the "non-overlapping statistics" case first and
> then extending it to more complicated cases. That might increase the
> change of getting at least some of that into 9.6 ...
> 
> But considering it's not clear whether the initial chunks are likely
> to make it into 9.6 - I kinda expect a fair amount of comments from TL
> about the preceding parts, who mentioned he might look at the patch
> this week. So I'm not sure splitting 0006 into smaller pieces makes
> sense at this point.

Thanks for the explanation. I will look into patch 0001 to 0005 so
that they could get into 9.6.

In the mean time after applying patch 0001 to 0005 of v16, I get this
while compiling SGML docs.

openjade:ref/create_statistics.sgml:281:26:X: reference to non-existent ID "SQL-ALTERSTATISTICS"
openjade:ref/drop_statistics.sgml:86:26:X: reference to non-existent ID "SQL-ALTERSTATISTICS"

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: multivariate statistics v14

From

Tomas Vondra

Date:

23 March 2016, 01:57:31

On 03/23/2016 02:53 AM, Tatsuo Ishii wrote:
>> The users will be able to define statistics with the limitation that
>> only a single one (the one covering the most columns referenced by the
>> clauses) can be used when estimating a query. Which is not perfect,
>> but I think it's a valuable improvement.
>>
>> It might also be possible to split 0006 into smaller pieces, for
>> example implementing the "non-overlapping statistics" case first and
>> then extending it to more complicated cases. That might increase the
>> change of getting at least some of that into 9.6 ...
>>
>> But considering it's not clear whether the initial chunks are likely
>> to make it into 9.6 - I kinda expect a fair amount of comments from TL
>> about the preceding parts, who mentioned he might look at the patch
>> this week. So I'm not sure splitting 0006 into smaller pieces makes
>> sense at this point.
>
> Thanks for the explanation. I will look into patch 0001 to 0005 so
> that they could get into 9.6.
>
> In the mean time after applying patch 0001 to 0005 of v16, I get this
> while compiling SGML docs.
>
> openjade:ref/create_statistics.sgml:281:26:X: reference to non-existent ID "SQL-ALTERSTATISTICS"
> openjade:ref/drop_statistics.sgml:86:26:X: reference to non-existent ID "SQL-ALTERSTATISTICS"

I believe this is because reference.sgml is missing a call to 
&alterStatistic (per report by Alvaro Herrera).

thanks

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From

Tatsuo Ishii

Date:

23 March 2016, 02:01:31

>> Thanks for the explanation. I will look into patch 0001 to 0005 so
>> that they could get into 9.6.
>>
>> In the mean time after applying patch 0001 to 0005 of v16, I get this
>> while compiling SGML docs.
>>
>> openjade:ref/create_statistics.sgml:281:26:X: reference to
>> non-existent ID "SQL-ALTERSTATISTICS"
>> openjade:ref/drop_statistics.sgml:86:26:X: reference to non-existent
>> ID "SQL-ALTERSTATISTICS"
> 
> I believe this is because reference.sgml is missing a call to
> &alterStatistic (per report by Alvaro Herrera).

Ok, I will patch reference.sgml.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: multivariate statistics v14

From

Tatsuo Ishii

Date:

23 March 2016, 02:58:04

>> I believe this is because reference.sgml is missing a call to
>> &alterStatistic (per report by Alvaro Herrera).
> 
> Ok, I will patch reference.sgml.

Here are some comments on docs.

- There's no docs for pg_mv_statistic (should be added to "49. System Catalogs")

- The word "multivariate statistics" or something like that should appear in the index.

- There are some explanation how to deal with multivariate statistics in "14.1 Using Explain" and "14.2 Statistics used
bythe Planner" section.
 

I am now looking into the create statistics doc to see if the example
appearing in it is working. I will get back if I find any.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: multivariate statistics v14

From

Tatsuo Ishii

Date:

23 March 2016, 03:01:11

>>> I believe this is because reference.sgml is missing a call to
>>> &alterStatistic (per report by Alvaro Herrera).
>> 
>> Ok, I will patch reference.sgml.
> 
> Here are some comments on docs.
> 
> - There's no docs for pg_mv_statistic (should be added to "49. System
>   Catalogs")
> 
> - The word "multivariate statistics" or something like that should
>   appear in the index.
> 
> - There are some explanation how to deal with multivariate statistics
Oops. Should read "There should be some explanations".

>   in "14.1 Using Explain" and "14.2 Statistics used by the Planner"
>   section.
> 
> I am now looking into the create statistics doc to see if the example
> appearing in it is working. I will get back if I find any.
> 
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

Re: multivariate statistics v14

From

Tatsuo Ishii

Date:

23 March 2016, 05:20:11

>> I am now looking into the create statistics doc to see if the example
>> appearing in it is working. I will get back if I find any.

I have the ref doc: CREATE STATISTICS

There are nice examples how the multivariate statistics gives better
row number estimation. So I gave them a try.

"Create table t1 with two functionally dependent columns,i.e. knowledge of a value in the first column is sufficient
fordeterminingthe value in the other column" The example creates table"t1", then populates it using generate_series.
AfterCREATESTATISTICS, ANALYZE and EXPLAIN. I expected the EXPLAIN demonstrateshow result rows estimation is enhanced
byusing the multivariatestatistics.
 

Here is the EXPLAIN output using the multivariate statistics:

EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 1);                                           QUERY PLAN
                                    
 
---------------------------------------------------------------------------------------------------Seq Scan on t1
(cost=0.00..19425.00rows=98 width=8) (actual time=76.876..76.876 rows=0 loops=1)  Filter: ((a = 1) AND (b = 1))  Rows
Removedby Filter: 1000000Planning time: 0.146 msExecution time: 76.896 ms
 
(5 rows)

Here is the EXPLAIN output without the multivariate statistics:

EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 1);                                           QUERY PLAN
                                   
 
--------------------------------------------------------------------------------------------------Seq Scan on t1
(cost=0.00..19425.00rows=1 width=8) (actual time=78.867..78.867 rows=0 loops=1)  Filter: ((a = 1) AND (b = 1))  Rows
Removedby Filter: 1000000Planning time: 0.102 msExecution time: 78.885 ms
 
(5 rows)

It seems the row numbers estimation (98) using the multivariate
statistics is actually *worse* than the one (1) not using the
statistics because the actual row number is 0.

Next example (using table "t2") is much better than the case using t1.

Here is the EXPLAIN output using the multivariate statistics:

EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 1);                                              QUERY PLAN
                                         
 
--------------------------------------------------------------------------------------------------------Seq Scan on t2
(cost=0.00..19425.00rows=9633 width=8) (actual time=0.012..75.350 rows=10000 loops=1)  Filter: ((a = 1) AND (b = 1))
RowsRemoved by Filter: 990000Planning time: 0.107 msExecution time: 75.680 ms
 
(5 rows)

Here is the EXPLAIN output without the multivariate statistics:

EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 1);                                             QUERY PLAN
                                       
 
------------------------------------------------------------------------------------------------------Seq Scan on t2
(cost=0.00..19425.00rows=91 width=8) (actual time=0.008..76.614 rows=10000 loops=1)  Filter: ((a = 1) AND (b = 1))
RowsRemoved by Filter: 990000Planning time: 0.067 msExecution time: 76.935 ms
 
(5 rows)

This time it seems the row numbers estimation (9633) using the
multivariate statistics is much better than the one (91) not using the
statistics because the actual row number is 10000.

The last example (using table "t3") seems no effect by multivariate statistics.

Here is the EXPLAIN output using the multivariate statistics:

EXPLAIN ANALYZE SELECT * FROM t3 WHERE (a < 500) AND (b > 500);                                               QUERY
PLAN                                                
 
-----------------------------------------------------------------------------------------------------------Seq Scan on
t3 (cost=0.00..20407.65 rows=111123 width=16) (actual time=0.154..132.509 rows=6002 loops=1)  Filter: ((a <
'500'::doubleprecision) AND (b > '500'::double precision))  Rows Removed by Filter: 993998Planning time: 0.080
msExecutiontime: 132.735 ms
 
(5 rows)

EXPLAIN ANALYZE SELECT * FROM t3 WHERE (a < 400) AND (b > 600);                                               QUERY
PLAN                                               
 
----------------------------------------------------------------------------------------------------------Seq Scan on
t3 (cost=0.00..20407.65 rows=111123 width=16) (actual time=110.518..110.518 rows=0 loops=1)  Filter: ((a <
'400'::doubleprecision) AND (b > '600'::double precision))  Rows Removed by Filter: 1000000Planning time: 0.052
msExecutiontime: 110.531 ms
 
(5 rows)

Here is the EXPLAIN output without the multivariate statistics:

EXPLAIN ANALYZE SELECT * FROM t3 WHERE (a < 500) AND (b > 500);                                               QUERY
PLAN                                                
 
-----------------------------------------------------------------------------------------------------------Seq Scan on
t3 (cost=0.00..20407.65 rows=111123 width=16) (actual time=0.149..129.718 rows=5999 loops=1)  Filter: ((a <
'500'::doubleprecision) AND (b > '500'::double precision))  Rows Removed by Filter: 994001Planning time: 0.058
msExecutiontime: 129.893 ms
 
(5 rows)

EXPLAIN ANALYZE SELECT * FROM t3 WHERE (a < 400) AND (b > 600);                                               QUERY
PLAN                                               
 
----------------------------------------------------------------------------------------------------------Seq Scan on
t3 (cost=0.00..20407.65 rows=111123 width=16) (actual time=108.015..108.015 rows=0 loops=1)  Filter: ((a <
'400'::doubleprecision) AND (b > '600'::double precision))  Rows Removed by Filter: 1000000Planning time: 0.037
msExecutiontime: 108.027 ms
 
(5 rows)

This time it seems the row numbers estimation (111123) using the
multivariate statistics is same as same as the one (111123) not
using the statistics because the actual row number is 5999 or 0.

In summary, the only case which shows the effect of the multivariate
statistics is the "t2" case. So I don't see why other examples are
shown in the manual. Am I missing something?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: multivariate statistics v14

From

Tomas Vondra

Date:

23 March 2016, 13:22:03

On 03/23/2016 06:20 AM, Tatsuo Ishii wrote:
>>> I am now looking into the create statistics doc to see if the example
>>> appearing in it is working. I will get back if I find any.
>
> I have the ref doc: CREATE STATISTICS
>
> There are nice examples how the multivariate statistics gives better
> row number estimation. So I gave them a try.
>
> "Create table t1 with two functionally dependent columns,
>  i.e. knowledge of a value in the first column is sufficient for
>  determining the value in the other column" The example creates table
>  "t1", then populates it using generate_series. After CREATE
>  STATISTICS, ANALYZE and EXPLAIN. I expected the EXPLAIN demonstrates
>  how result rows estimation is enhanced by using the multivariate
>  statistics.
>
> Here is the EXPLAIN output using the multivariate statistics:
>
> EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 1);
>                                             QUERY PLAN
> ---------------------------------------------------------------------------------------------------
>  Seq Scan on t1  (cost=0.00..19425.00 rows=98 width=8) (actual time=76.876..76.876 rows=0 loops=1)
>    Filter: ((a = 1) AND (b = 1))
>    Rows Removed by Filter: 1000000
>  Planning time: 0.146 ms
>  Execution time: 76.896 ms
> (5 rows)
>
> Here is the EXPLAIN output without the multivariate statistics:
>
> EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 1);
>                                             QUERY PLAN
> --------------------------------------------------------------------------------------------------
>  Seq Scan on t1  (cost=0.00..19425.00 rows=1 width=8) (actual time=78.867..78.867 rows=0 loops=1)
>    Filter: ((a = 1) AND (b = 1))
>    Rows Removed by Filter: 1000000
>  Planning time: 0.102 ms
>  Execution time: 78.885 ms
> (5 rows)
>
> It seems the row numbers estimation (98) using the multivariate
> statistics is actually *worse* than the one (1) not using the
> statistics because the actual row number is 0.

Yes, there's a mistake in the first query, because the conditions 
actually are not compatible. I.e. (i/100)=1 and (i/500)=1 have no 
overlapping rows, clearly. It should be

EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 0);

instead. Will fix.

>
> Next example (using table "t2") is much better than the case using t1.
>
> Here is the EXPLAIN output using the multivariate statistics:
>
> EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 1);
>                                                QUERY PLAN
> --------------------------------------------------------------------------------------------------------
>  Seq Scan on t2  (cost=0.00..19425.00 rows=9633 width=8) (actual time=0.012..75.350 rows=10000 loops=1)
>    Filter: ((a = 1) AND (b = 1))
>    Rows Removed by Filter: 990000
>  Planning time: 0.107 ms
>  Execution time: 75.680 ms
> (5 rows)
>
> Here is the EXPLAIN output without the multivariate statistics:
>
> EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 1);
>                                               QUERY PLAN
> ------------------------------------------------------------------------------------------------------
>  Seq Scan on t2  (cost=0.00..19425.00 rows=91 width=8) (actual time=0.008..76.614 rows=10000 loops=1)
>    Filter: ((a = 1) AND (b = 1))
>    Rows Removed by Filter: 990000
>  Planning time: 0.067 ms
>  Execution time: 76.935 ms
> (5 rows)
>
> This time it seems the row numbers estimation (9633) using the
> multivariate statistics is much better than the one (91) not using the
> statistics because the actual row number is 10000.
>
> The last example (using table "t3") seems no effect by multivariate statistics.

Yes. There's a typo in the example - it analyzes the wrong table (t2 
instead of t3). Once I fix that, the estimates are much better.

> In summary, the only case which shows the effect of the multivariate
> statistics is the "t2" case. So I don't see why other examples are
> shown in the manual. Am I missing something?

No, thanks for spotting those mistakes. I'll fix them and submit a new 
version of the patch - either later today or perhaps tomorrow.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From

Petr Jelinek

Date:

23 March 2016, 18:23:49

Hi,

I'll add couple of code comments from my first cursory read through 
(this is huge):

0002:
there is some whitespace noise between the varlistentries in 
alter_statistics.sgml

+    parentobject.classId = RelationRelationId;
+    parentobject.objectId = ObjectIdGetDatum(RelationGetRelid(rel));
+    parentobject.objectSubId = 0;
+    childobject.classId = MvStatisticRelationId;
+    childobject.objectId = statoid;
+    childobject.objectSubId = 0;

I wonder if this (several places similar code) would be simpler done 
using ObjectAddressSet()

The common.h in backend/utils/mvstat is slightly weird header file 
placement and naming.


0004:
+/* used for merging bitmaps - AND (min), OR (max) */
+#define MAX(x, y) (((x) > (y)) ? (x) : (y))
+#define MIN(x, y) (((x) < (y)) ? (x) : (y))

Huh? We have Max and Min macros defined in c.h

+        values[Anum_pg_mv_statistic_stamcv  - 1] = PointerGetDatum(data);

Why the double space (that's actually in several places in several of 
the patches).

I don't really understand why 0008 and 0009 are separate patches and 
aren't part of one of the other patches. But otherwise good job on 
splitting the functionality into patchset.

--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services

Re: multivariate statistics v14

From

Tomas Vondra

Date:

24 March 2016, 17:13:12

Hi,

attached is v17 of the patch series, with these changes:

* rebase to current master (the AM patch caused some conflicts)
* add alterStatistics to reference.sgml (Alvaro)
* move the sample size discussion to README.stats (Alvaro)
* tweak the inner for loop in CREATE STATISTICS (Alvaro)
* use ObjectAddressSet() to create dependencies in statscmds.c (Petr)
* fix whitespace in alterStatistics.sgml (Petr)
* replace custom MIN/MAX with Min/Max in c.h (Petr)
* fix examples in createStatistics.sgml (Tatsuo)

A few more comments inline:

On 03/23/2016 07:23 PM, Petr Jelinek wrote:
>
> The common.h in backend/utils/mvstat is slightly weird header file
> placement and naming.
>

True. I plan to move this header to

     src/include/catalog/pg_mv_statistic_fn.h

which is what the other catalogs do (as pointed by Alvaro). Or do you
think another location/name would be more appropriate?

>
> +        values[Anum_pg_mv_statistic_stamcv  - 1] = PointerGetDatum(data);
>
> Why the double space (that's actually in several places in several of
> the patches).

To align the whole block like this:

     nulls[Anum_pg_mv_statistic_stadeps  -1] = true;
     nulls[Anum_pg_mv_statistic_stamcv   -1] = true;
     nulls[Anum_pg_mv_statistic_stahist  -1] = true;
     nulls[Anum_pg_mv_statistic_standist -1] = true;

But I won't fight for this too hard, if it breaks rules somehow.

>
> I don't really understand why 0008 and 0009 are separate patches and
> aren't part of one of the other patches. But otherwise good job on
> splitting the functionality into patchset.

That is mostly because both 0007 and 0008 tweak the GROUP BY estimates,
but 0008 is not really part of this patch (it's discussed separately in
another thread). I admit it may be a bit confusing.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

multivariate-stats-v17.tgz

Re: multivariate statistics v14

From

Alvaro Herrera

Date:

24 March 2016, 17:45:39

Tomas Vondra wrote:

> >+        values[Anum_pg_mv_statistic_stamcv  - 1] = PointerGetDatum(data);
> >
> >Why the double space (that's actually in several places in several of
> >the patches).
> 
> To align the whole block like this:
> 
>     nulls[Anum_pg_mv_statistic_stadeps  -1] = true;
>     nulls[Anum_pg_mv_statistic_stamcv   -1] = true;
>     nulls[Anum_pg_mv_statistic_stahist  -1] = true;
>     nulls[Anum_pg_mv_statistic_standist -1] = true;
> 
> But I won't fight for this too hard, if it breaks rules somehow.

Yeah, it will be undone by pgindent.  I suggest you pgindent all the
patches in the series.  With some clever patch vs. patch -R application,
you can do it without having to resolve any conflicts when pgindent
modifies code that a patch further up in the series modifies again.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From

Tomas Vondra

Date:

25 March 2016, 15:32:13

On 03/24/2016 06:45 PM, Alvaro Herrera wrote:
> Tomas Vondra wrote:
>
>>> +        values[Anum_pg_mv_statistic_stamcv  - 1] = PointerGetDatum(data);
>>>
>>> Why the double space (that's actually in several places in several of
>>> the patches).
>>
>> To align the whole block like this:
>>
>>     nulls[Anum_pg_mv_statistic_stadeps  -1] = true;
>>     nulls[Anum_pg_mv_statistic_stamcv   -1] = true;
>>     nulls[Anum_pg_mv_statistic_stahist  -1] = true;
>>     nulls[Anum_pg_mv_statistic_standist -1] = true;
>>
>> But I won't fight for this too hard, if it breaks rules somehow.
>
> Yeah, it will be undone by pgindent.  I suggest you pgindent all the
> patches in the series.  With some clever patch vs. patch -R application,
> you can do it without having to resolve any conflicts when pgindent
> modifies code that a patch further up in the series modifies again.
>

I could do that, but isn't that a bit pointless? I thought pgindent is 
run regularly on the whole codebase, not for individual patches. Sure, 
it'll tweak the formatting on a few places in the patch (including the 
code discussed above, as you pointed out), but there are many other such 
places coming from other committed patches.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From

Tom Lane

Date:

25 March 2016, 21:26:19

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> I could do that, but isn't that a bit pointless? I thought pgindent is 
> run regularly on the whole codebase, not for individual patches. Sure, 
> it'll tweak the formatting on a few places in the patch (including the 
> code discussed above, as you pointed out), but there are many other such 
> places coming from other committed patches.

One point of running pgindent for yourself is to make sure you haven't set
up any code in a way that will look horrible after pgindent gets done with
it.
        regards, tom lane

Re: multivariate statistics v14

From

Tomas Vondra

Date:

26 March 2016, 02:02:21

On 03/25/2016 10:26 PM, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> I could do that, but isn't that a bit pointless? I thought pgindent is
>> run regularly on the whole codebase, not for individual patches. Sure,
>> it'll tweak the formatting on a few places in the patch (including the
>> code discussed above, as you pointed out), but there are many other such
>> places coming from other committed patches.
>
> One point of running pgindent for yourself is to make sure you
> haven't set up any code in a way that will look horrible after
> pgindent gets done with it.

Fair point. Attached is v18 of the patch, after pgindent cleanup.

FWIW, most of the tweaks were minor things like (! x) instead of (!x)
and so on. I also had to fix a few comments with internal formatting,
because pgindent decided to reformat the text using tabs etc.

There are a few places where I reverted the pgindent formatting, because
it seemed a bit too weird - the first one are the lists of function
prototypes in common.h/mvstat.h, the second one are function calls to
_greedy/_exhaustive methods.

None of those places would however qualify as 'horrible' in my opinion,
and the _greedy/_exhaustive functions are in the 0006 part, so fixing
that is not of immediate importance I think.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

multivariate-stats-v18.tgz

Re: multivariate statistics v14

From

Tatsuo Ishii

Date:

26 March 2016, 09:18:42

> Fair point. Attached is v18 of the patch, after pgindent cleanup.

Here are some feedbacks to v18 patch.

1) regarding examples in create_statistics manual

Here are numbers I got. "with statistics" referrers to the case where
multivariate statistics are used.  "without statistics" referrers to the
case where multivariate statistics are not used. The numbers denote
estimated_rows/actual_rows. Thus closer to 1.0 is better. Some numbers
are shown as a fraction to avoid 0 division. In my understanding case
1, 3, 4 showed that multivariate statistics superior.
with statistics    without statistics
case1    0.98        0.01
case2    98/0        1/0
case3    1.05        0.01
case4    1/0        103/0
case5    18.50        18.33
case6    111123/0    1111123/0

2) following comments by me are not addressed in the v18 patch.

> - There's no docs for pg_mv_statistic (should be added to "49. System
>   Catalogs")
> 
> - The word "multivariate statistics" or something like that should
>   appear in the index.
> 
> - There are some explanation how to deal with multivariate statistics
>   in "14.1 Using Explain" and "14.2 Statistics used by the Planner"
>   section.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: multivariate statistics v14

From

Alvaro Herrera

Date:

26 March 2016, 19:09:41

Tomas Vondra wrote:

> There are a few places where I reverted the pgindent formatting, because it
> seemed a bit too weird - the first one are the lists of function prototypes
> in common.h/mvstat.h, the second one are function calls to
> _greedy/_exhaustive methods.

Function prototypes being weird is something that we've learned to
accept.  There's no point in undoing pgindent decisions there, because
the next run will re-apply them anyway.  Best not to fight it.

What you should definitely look into fixing is the formatting of
comments, if the result is too horrible.  You can prevent it from
messing those by adding dashes /*----- at the beginning of the comment.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From

Tomas Vondra

Date:

28 March 2016, 08:42:42

Hi,

On 03/26/2016 10:18 AM, Tatsuo Ishii wrote:
>> Fair point. Attached is v18 of the patch, after pgindent cleanup.
>
> Here are some feedbacks to v18 patch.
>
> 1) regarding examples in create_statistics manual
>
> Here are numbers I got. "with statistics" referrers to the case where
> multivariate statistics are used.  "without statistics" referrers to the
> case where multivariate statistics are not used. The numbers denote
> estimated_rows/actual_rows. Thus closer to 1.0 is better. Some numbers
> are shown as a fraction to avoid 0 division. In my understanding case
> 1, 3, 4 showed that multivariate statistics superior.
>
>     with statistics    without statistics
> case1    0.98        0.01
> case2    98/0        1/0

The case2 shows that functional dependencies assume that the conditions 
used in queries won't be incompatible - that's something this type of 
statistics can't fix.

> case3    1.05        0.01
> case4    1/0        103/0
> case5    18.50        18.33
> case6    111123/0    1111123/0

The last two lines (case5 + case6) seem a bit suspicious. I believe 
those are for the histogram data, and I do get these numbers:

case5    0.93 (5517 / 5949)         42.0 (249943 / 5949)
case6    100/0                      100/0

Perhaps you've been using the version before the bugfix, with ANALYZE on 
the wrong table?

>
> 2) following comments by me are not addressed in the v18 patch.
>
>> - There's no docs for pg_mv_statistic (should be added to "49. System
>>   Catalogs")
>>
>> - The word "multivariate statistics" or something like that should
>>   appear in the index.
>>
>> - There are some explanation how to deal with multivariate statistics
>>   in "14.1 Using Explain" and "14.2 Statistics used by the Planner"
>>   section.

Yes, those are valid omissions. I plan to address them, and I'd also 
considering adding a section to 65.1 (How the Planner Uses Statistics), 
explaining more thoroughly how the planner uses multivariate stats.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From

Tomas Vondra

Date:

28 March 2016, 08:49:21

On 03/26/2016 08:09 PM, Alvaro Herrera wrote:
> Tomas Vondra wrote:
>
>> There are a few places where I reverted the pgindent formatting, because it
>> seemed a bit too weird - the first one are the lists of function prototypes
>> in common.h/mvstat.h, the second one are function calls to
>> _greedy/_exhaustive methods.
>
> Function prototypes being weird is something that we've learned to
> accept.  There's no point in undoing pgindent decisions there, because
> the next run will re-apply them anyway.  Best not to fight it.
>
> What you should definitely look into fixing is the formatting of
> comments, if the result is too horrible.  You can prevent it from
> messing those by adding dashes /*----- at the beginning of the comment.
>

Yep, formatting of some of the comments got slightly broken, but it 
wasn't difficult to fix that without the /*------- trick.

I'm not sure about the prototypes though. It was a bit weird because 
prototypes in the same header file were formatted very differently.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From

Alvaro Herrera

Date:

28 March 2016, 15:54:59

Tomas Vondra wrote:

> I'm not sure about the prototypes though. It was a bit weird because
> prototypes in the same header file were formatted very differently.

Yeah, it is very odd.  What happens is that the BSD indent binary does
one thing (return type is in one line and function name in following
line; subsequent argument lines are aligned to opening parens), then the
pgindent perl script changes it (moves function name to same line as
return type, but does not reindent subsequent lines of arguments).

You can imitate the effect by adding an extra newline just before the
function name, reflowing the arguments to align to the (, then deleting
the extra newline.  Rather annoying.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From

David Steele

Date:

29 March 2016, 15:18:08

Hi Tomas,

On 3/28/16 4:42 AM, Tomas Vondra wrote:

> Yes, those are valid omissions. I plan to address them, and I'd also
> considering adding a section to 65.1 (How the Planner Uses Statistics),
> explaining more thoroughly how the planner uses multivariate stats.

It looks you need post a new patch so I have marked this "waiting on 
author".

Thanks,
-- 
-David
david@pgmasters.net

Re: multivariate statistics v14

From

Tatsuo Ishii

Date:

30 March 2016, 05:15:47

>>     with statistics    without statistics
>> case1    0.98        0.01
>> case2    98/0        1/0
> 
> The case2 shows that functional dependencies assume that the
> conditions used in queries won't be incompatible - that's something
> this type of statistics can't fix.

It would be nice if that's mentioned in the manual to avoid user's
confusion.

>> case3    1.05        0.01
>> case4    1/0        103/0
>> case5    18.50        18.33
>> case6    111123/0    1111123/0
> 
> The last two lines (case5 + case6) seem a bit suspicious. I believe
> those are for the histogram data, and I do get these numbers:
> 
> case5    0.93 (5517 / 5949)         42.0 (249943 / 5949)
> case6    100/0                      100/0
> 
> Perhaps you've been using the version before the bugfix, with ANALYZE
> on the wrong table?

You are right. I accidentally ANALYZE t2, not t3. Now I get these
numbers:

case5    1.23 (7367 / 5968)         41.7 (249118 / 5981)
case6    117/0                      162092/0

>> 2) following comments by me are not addressed in the v18 patch.
>>
>>> - There's no docs for pg_mv_statistic (should be added to "49. System
>>>   Catalogs")
>>>
>>> - The word "multivariate statistics" or something like that should
>>>   appear in the index.
>>>
>>> - There are some explanation how to deal with multivariate statistics
>>>   in "14.1 Using Explain" and "14.2 Statistics used by the Planner"
>>>   section.
> 
> Yes, those are valid omissions. I plan to address them, and I'd also
> considering adding a section to 65.1 (How the Planner Uses
> Statistics), explaining more thoroughly how the planner uses
> multivariate stats.

Great.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: multivariate statistics v14

From

Robert Haas

Date:

08 April 2016, 15:55:58

On Tue, Mar 29, 2016 at 11:18 AM, David Steele <david@pgmasters.net> wrote:
> On 3/28/16 4:42 AM, Tomas Vondra wrote:
>> Yes, those are valid omissions. I plan to address them, and I'd also
>> considering adding a section to 65.1 (How the Planner Uses Statistics),
>> explaining more thoroughly how the planner uses multivariate stats.
>
> It looks you need post a new patch so I have marked this "waiting on
> author".

Since no new version of this patch has been posted in the last 10
days, it seems clear that there will not be time for this to
reasonably become ready for committer and then get committed in the
few hours remaining before the deadline.  That is a bummer, since I
was hoping we would have this feature in this release, but hopefully
we will get it into 9.7.  I am marking it Returned with Feedback.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: multivariate statistics v14

From

Tomas Vondra

Date:

08 April 2016, 18:55:35

On 04/08/2016 05:55 PM, Robert Haas wrote:
> On Tue, Mar 29, 2016 at 11:18 AM, David Steele <david@pgmasters.net> wrote:
>> On 3/28/16 4:42 AM, Tomas Vondra wrote:
>>> Yes, those are valid omissions. I plan to address them, and I'd also
>>> considering adding a section to 65.1 (How the Planner Uses Statistics),
>>> explaining more thoroughly how the planner uses multivariate stats.
>>
>> It looks you need post a new patch so I have marked this "waiting on
>> author".
>
> Since no new version of this patch has been posted in the last 10
> days, it seems clear that there will not be time for this to
> reasonably become ready for committer and then get committed in the
> few hours remaining before the deadline. That is a bummer, since I
> was hoping we would have this feature in this release, but hopefully
> we will get it into 9.7. I am marking it Returned with Feedback.
>

Well, me to. But my feeling is the patch received entirely insufficient 
amount of thorough code review, considering how important part of the 
code it touches. I agree docs are an important part of a patch, but 
polishing user-level docs would hardly move the patch closer to being 
committable (especially when there's ~50kB of READMEs).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From

Robert Haas

Date:

08 April 2016, 19:03:54

On Fri, Apr 8, 2016 at 2:55 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Well, me to. But my feeling is the patch received entirely insufficient
> amount of thorough code review, considering how important part of the code
> it touches. I agree docs are an important part of a patch, but polishing
> user-level docs would hardly move the patch closer to being committable
> (especially when there's ~50kB of READMEs).

I have to admit that I was really hoping Tom would follow through on
his statement that he would look into this one, or that Dean Rasheed
would get involved.  I am sure I could do a good review of this patch
given enough time, but I am also sure that it would take an amount of
time that is at least one if not two orders of magnitude more than I
put into any patch this CommitFest.  I understand statistics at some
basic level, but I am not an expert on them the way some people here
are.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: multivariate statistics v14

From

Tom Lane

Date:

08 April 2016, 19:13:55

Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Apr 8, 2016 at 2:55 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Well, me to. But my feeling is the patch received entirely insufficient
>> amount of thorough code review, considering how important part of the code
>> it touches. I agree docs are an important part of a patch, but polishing
>> user-level docs would hardly move the patch closer to being committable
>> (especially when there's ~50kB of READMEs).

> I have to admit that I was really hoping Tom would follow through on
> his statement that he would look into this one, or that Dean Rasheed
> would get involved.

I'm sorry I didn't get to it, but it's not like I have been slacking
during this commitfest.  At some point, you just have to accept that
not everything we could wish will get into 9.6.

I will make it a high priority for 9.7, though.
        regards, tom lane

Re: multivariate statistics v14

From

Robert Haas

Date:

08 April 2016, 19:26:54

On Fri, Apr 8, 2016 at 3:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Apr 8, 2016 at 2:55 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>> Well, me to. But my feeling is the patch received entirely insufficient
>>> amount of thorough code review, considering how important part of the code
>>> it touches. I agree docs are an important part of a patch, but polishing
>>> user-level docs would hardly move the patch closer to being committable
>>> (especially when there's ~50kB of READMEs).
>
>> I have to admit that I was really hoping Tom would follow through on
>> his statement that he would look into this one, or that Dean Rasheed
>> would get involved.
>
> I'm sorry I didn't get to it, but it's not like I have been slacking
> during this commitfest.  At some point, you just have to accept that
> not everything we could wish will get into 9.6.

I did not mean to imply otherwise.  I'm just explaining why I didn't
spend time on it - I figured I was not the most qualified person, and
of course I have not been slacking either.  :-)

> I will make it a high priority for 9.7, though.

Woohoo!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: multivariate statistics v14

From

Tatsuo Ishii

Date:

08 April 2016, 23:21:53

From: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Subject: Re: [HACKERS] multivariate statistics v14
Date: Fri, 8 Apr 2016 20:55:24 +0200
Message-ID: <5d1d62a6-6228-188c-e079-c1be59942168@2ndquadrant.com>

> On 04/08/2016 05:55 PM, Robert Haas wrote:
>> On Tue, Mar 29, 2016 at 11:18 AM, David Steele <david@pgmasters.net>
>> wrote:
>>> On 3/28/16 4:42 AM, Tomas Vondra wrote:
>>>> Yes, those are valid omissions. I plan to address them, and I'd also
>>>> considering adding a section to 65.1 (How the Planner Uses
>>>> Statistics),
>>>> explaining more thoroughly how the planner uses multivariate stats.
>>>
>>> It looks you need post a new patch so I have marked this "waiting on
>>> author".
>>
>> Since no new version of this patch has been posted in the last 10
>> days, it seems clear that there will not be time for this to
>> reasonably become ready for committer and then get committed in the
>> few hours remaining before the deadline. That is a bummer, since I
>> was hoping we would have this feature in this release, but hopefully
>> we will get it into 9.7. I am marking it Returned with Feedback.
>>
> 
> Well, me to. But my feeling is the patch received entirely
> insufficient amount of thorough code review, considering how important
> part of the code it touches. I agree docs are an important part of a
> patch, but polishing user-level docs would hardly move the patch
> closer to being committable (especially when there's ~50kB of
> READMEs).

My feedback regarding docs were:
> - There's no docs for pg_mv_statistic (should be added to "49. System
>   Catalogs")
>
> - The word "multivariate statistics" or something like that should
>   appear in the index.
> 
> - There are some explanation how to deal with multivariate statistics
>   in "14.1 Using Explain" and "14.2 Statistics used by the Planner"
>   section.

The second and the third point maybe are something like "polishing
user-level" docs, but I don't think the first one is for "user-level".
Also I think without the first one the patch will be never
committable. If someone add a new system catalog, the doc should be
added to "System Catalogs" section, that's our standard, at least in
my understanding.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: multivariate statistics v14

From

Simon Riggs

Date:

09 April 2016, 10:00:51

On 8 April 2016 at 20:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I will make it a high priority for 9.7, though.

That is my plan also. I've already started reviewing the non-planner parts anyway, specifically patch 0002.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From

Tomas Vondra

Date:

09 April 2016, 11:32:48

Hi,

On 04/09/2016 01:21 AM, Tatsuo Ishii wrote:
> From: Tomas Vondra <tomas.vondra@2ndquadrant.com>
...
> My feedback regarding docs were:
>> - There's no docs for pg_mv_statistic (should be added to "49. System
>>   Catalogs")
>>
>> - The word "multivariate statistics" or something like that should
>>   appear in the index.
>>
>> - There are some explanation how to deal with multivariate statistics
>>   in "14.1 Using Explain" and "14.2 Statistics used by the Planner"
>>   section.
>
> The second and the third point maybe are something like "polishing
> user-level" docs, but I don't think the first one is for "user-level".
> Also I think without the first one the patch will be never
> committable. If someone add a new system catalog, the doc should be
> added to "System Catalogs" section, that's our standard, at least in
> my understanding.

I do apologize if it seemed that I don't value your review, and I do 
agree that those changes need to be done, although I still see them 
rather as a user-level docs (as opposed to READMEs/comments, which I 
think are used by developers much more often).

But I still think it wouldn't move the patch any closer to committable 
state, because what it really needs is review whether the catalog 
definition makes sense, whether it should be more like pg_statistic, and 
so on. Only then it makes sense to describe the catalog structure in the 
SGML docs, I think. That's why I added some basic SGML docs for 
CREATE/DROP/ALTER STATISTICS, which I expect to be rather stable, and 
not the catalog and other low-level stuff (which is commented heavily in 
the code anyway).

Had the patch been a Titanic, fixing the SGML docs a few days before the 
code freeze would be akin to washing the deck instead of looking for 
icebergs on April 15, 1912.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From

Tatsuo Ishii

Date:

09 April 2016, 17:38:10

> But I still think it wouldn't move the patch any closer to committable
> state, because what it really needs is review whether the catalog
> definition makes sense, whether it should be more like pg_statistic,
> and so on. Only then it makes sense to describe the catalog structure
> in the SGML docs, I think. That's why I added some basic SGML docs for
> CREATE/DROP/ALTER STATISTICS, which I expect to be rather stable, and
> not the catalog and other low-level stuff (which is commented heavily
> in the code anyway).

Without "user-level docs" (now I understand that the term means all
SGML docs for you), it is very hard to find a visible
characteristics/behavior of the patch. CREATE/DROP/ALTER STATISTICS
just defines a user interface, and does not help how it affects to the
planning. The READMEs do not help either.

In this case reviewing your code is something like reviewing a program
which has no specification.

That's the reason why I said before below, but it was never seriously
considered.

>> - There are some explanation how to deal with multivariate statistics
>>   in "14.1 Using Explain" and "14.2 Statistics used by the Planner"
>>   section.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: multivariate statistics v14

From

Simon Riggs

Date:

10 April 2016, 08:25:55

On 9 April 2016 at 18:37, Tatsuo Ishii <ishii@postgresql.org> wrote:

> But I still think it wouldn't move the patch any closer to committable
> state, because what it really needs is review whether the catalog
> definition makes sense, whether it should be more like pg_statistic,
> and so on. Only then it makes sense to describe the catalog structure
> in the SGML docs, I think. That's why I added some basic SGML docs for
> CREATE/DROP/ALTER STATISTICS, which I expect to be rather stable, and
> not the catalog and other low-level stuff (which is commented heavily
> in the code anyway).

Without "user-level docs" (now I understand that the term means all
SGML docs for you), it is very hard to find a visible
characteristics/behavior of the patch. CREATE/DROP/ALTER STATISTICS
just defines a user interface, and does not help how it affects to the
planning. The READMEs do not help either.

In this case reviewing your code is something like reviewing a program
which has no specification.

That's the reason why I said before below, but it was never seriously
considered.

I would likely have said this myself but didn't even get that far.

Your contribution was useful and went further than anybody else's review, so thank you.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From

Tomas Vondra

Date:

10 April 2016, 18:28:09

Hello,

On 04/09/2016 07:37 PM, Tatsuo Ishii wrote:
>> But I still think it wouldn't move the patch any closer to committable
>> state, because what it really needs is review whether the catalog
>> definition makes sense, whether it should be more like pg_statistic,
>> and so on. Only then it makes sense to describe the catalog structure
>> in the SGML docs, I think. That's why I added some basic SGML docs for
>> CREATE/DROP/ALTER STATISTICS, which I expect to be rather stable, and
>> not the catalog and other low-level stuff (which is commented heavily
>> in the code anyway).
>
> Without "user-level docs" (now I understand that the term means all
> SGML docs for you), it is very hard to find a visible
> characteristics/behavior of the patch. CREATE/DROP/ALTER STATISTICS
> just defines a user interface, and does not help how it affects to
> the planning. The READMEs do not help either.
>
> In this case reviewing your code is something like reviewing a
> program which has no specification.

I certainly agree that reviewing a patch without the context is hard. My 
intent was to provide such context / explanation in the READMEs, but 
perhaps I failed to do so with enough detail.

BTW when you say that READMEs do not help either, does that mean you 
consider READMEs unsuitable for this type of information in general, or 
that the current READMEs lack important information?

>
> That's the reason why I said before below, but it was never
> seriously considered.>

I've considered it, but my plan was to have detailed READMEs, and then 
eventually distill that into something suitable for the SGML (perhaps 
without discussion of some implementation details). Maybe that's not the 
right approach.

FWIW providing the context is why I started working on a "paper" 
explaining both the motivation and implementation, including a bit of 
math and figures (which is what we don't have in READMEs or SGML). I 
haven't updated it recently, and it probably got buried in the thread, 
but perhaps this would be a better way to provide the context?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics v14

From

Tomas Vondra

Date:

10 April 2016, 18:29:25

On 04/10/2016 10:25 AM, Simon Riggs wrote:
> On 9 April 2016 at 18:37, Tatsuo Ishii <ishii@postgresql.org
> <mailto:ishii@postgresql.org>> wrote:
>
>     > But I still think it wouldn't move the patch any closer to committable
>     > state, because what it really needs is review whether the catalog
>     > definition makes sense, whether it should be more like pg_statistic,
>     > and so on. Only then it makes sense to describe the catalog structure
>     > in the SGML docs, I think. That's why I added some basic SGML docs for
>     > CREATE/DROP/ALTER STATISTICS, which I expect to be rather stable, and
>     > not the catalog and other low-level stuff (which is commented heavily
>     > in the code anyway).
>
>     Without "user-level docs" (now I understand that the term means all
>     SGML docs for you), it is very hard to find a visible
>     characteristics/behavior of the patch. CREATE/DROP/ALTER STATISTICS
>     just defines a user interface, and does not help how it affects to the
>     planning. The READMEs do not help either.
>
>     In this case reviewing your code is something like reviewing a program
>     which has no specification.
>
>     That's the reason why I said before below, but it was never seriously
>     considered.
>
>
> I would likely have said this myself but didn't even get that far.
>
> Your contribution was useful and went further than anybody else's
> review, so thank you.

100% agreed. Thanks for the useful feedback.

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics (v19)

From

Tomas Vondra

Date:

03 August 2016, 01:58:28

Hi,

Attached is v19 of the "multivariate stats" patch series - essentially
v18 rebased on top of current master. Aside from a few bug fixes, the
main improvement is addition of SGML docs demonstrating the statistics
in a way similar to the current "Row Estimation Examples" (and the docs
are actually in the same section). I've tried to keep the right amount
of technical detail (and pointing to the right README for additional
details), but this may need improvements. I have not written docs
explaining how statistics may be combined yet (more about this later).


There are two general design questions that I'd like to get feedback on:


1) enriching the query tree with multivariate statistics info

Right now all the stuff related to multivariate statistics estimation
happens in clausesel.c - matching condition to statistics, selection of
statistics to use (if there are multiple usable stats), etc. So pretty
much all this info is internal to clausesel.c and does not get outside.

I'm starting to think that some of the steps (matching quals to stats,
selection of stats) should happen in a "preprocess" step before the
actual estimation, storing the information (which stats to use, etc.) in
a new type of node in the query tree - something like RestrictInfo.

I believe this needs to happen sometime after deconstruct_jointree() as
that builds RestrictInfos nodes, and looking at planmain.c, right after
extract_restriction_or_clauses seems about right. Haven't tried, though.

This would move all the "statistics selection" logic from clausesel.c,
separating it from the "actual estimation" and simplifying the code.

But more importantly, I think we'll need to show some of the data in
EXPLAIN output. With per-column statistics it's fairly straightforward
to determine which statistics are used and how. But with multivariate
stats things are often more complicated - there may be multiple
candidate statistics (e.g. histograms covering different subsets of the
conditions), it's possible to apply them in different orders, etc.

But EXPLAIN can't show the info if it's ephemeral and available only
within clausesel.c (and thrown away after the estimation).


2) combining multiple statistics

I think the ability to combine multivariate statistics (covering
different subsets of conditions) is important and useful, but I'm
starting to think that the current implementation may not be the correct
one (which is why I haven't written the SGML docs about this part of the
patch series yet).

Assume there's a table "t" with 3 columns (a, b, c), and that we're
estimating query:

    SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3

but that we only have two statistics (a,b) and (b,c). The current patch
does about this:

    P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)

i.e. it estimates the first two conditions using (a,b), and then
estimates (c=3) using (b,c) with "b=2" as a condition. Now, this is very
efficient, but it only works as long as the query contains conditions
"connecting" the two statistics. So if we remove the "b=2" condition
from the query, this stops working.

But it's possible to do this differently, e.g. by doing this:

    P(a=1) * P(c=3|a=1)

where P(c=3|a=1) is using (b,c), but uses (a,b) to restrict the set of
buckets (if the statistics is a histogram) to consider. In pseudo-code,
it might look like this:

    buckets = {}
    foreach bucket x in (b,c):
        foreach bucket y in (a,b):
           if y matches (a=1) and overlap(x,y):
               buckets := buckets + x

which is the part of (b,c) matching (a=1), allowing us to compute the
conditional probability.

It may get more complicated, of course. In particular, there may be
different types of statistics, and we need to be able to "match" them
against each other. With just MCV lists and histograms that's probably
easy enough, but if we add other types of statistics, it may get way
more complicated.

I still think this is a useful capability, but perhaps there are better
ideas how to do that. In any case, it only affects the last part of the
patch (0006).


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

multivariate-stats-v19.tgz

Re: multivariate statistics (v19)

From

Michael Paquier

Date:

05 August 2016, 04:24:50

On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Attached is v19 of the "multivariate stats" patch series - essentially v18
> rebased on top of current master. Aside from a few bug fixes, the main
> improvement is addition of SGML docs demonstrating the statistics in a way
> similar to the current "Row Estimation Examples" (and the docs are actually
> in the same section). I've tried to keep the right amount of technical
> detail (and pointing to the right README for additional details), but this
> may need improvements. I have not written docs explaining how statistics may
> be combined yet (more about this later).

What we have here is quite something:
$ git diff master --stat | tail -n177 files changed, 12809 insertions(+), 65 deletions(-)
I will try to get familiar on the topic and added myself as a reviewer
of this patch. Hopefully I'll get feedback soon.
-- 
Michael

Re: multivariate statistics (v19)

From

Tomas Vondra

Date:

05 August 2016, 17:38:15

On 08/05/2016 06:24 AM, Michael Paquier wrote:
> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Attached is v19 of the "multivariate stats" patch series - essentially v18
>> rebased on top of current master. Aside from a few bug fixes, the main
>> improvement is addition of SGML docs demonstrating the statistics in a way
>> similar to the current "Row Estimation Examples" (and the docs are actually
>> in the same section). I've tried to keep the right amount of technical
>> detail (and pointing to the right README for additional details), but this
>> may need improvements. I have not written docs explaining how statistics may
>> be combined yet (more about this later).
>
> What we have here is quite something:
> $ git diff master --stat | tail -n1
>  77 files changed, 12809 insertions(+), 65 deletions(-)
> I will try to get familiar on the topic and added myself as a reviewer
> of this patch. Hopefully I'll get feedback soon.

Yes, it's a large patch. Although 25% of the insertions are SGML docs, 
regression tests and READMEs, and large part of the remaining ~9k 
insertions are comments. But it may still be overwhelming, no doubt 
about that.

FWIW, if someone is interested in the patch but is unsure where to 
start, I'm ready to help with that as much as possible. For example if 
you happen to go to PostgresOpen, feel free to drag me to a corner and 
ask me as many questions as you want ...

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics (v19)

From

Michael Paquier

Date:

05 August 2016, 22:21:51

On Sat, Aug 6, 2016 at 2:38 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 08/05/2016 06:24 AM, Michael Paquier wrote:
>>
>> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>>
>>> Attached is v19 of the "multivariate stats" patch series - essentially
>>> v18
>>> rebased on top of current master. Aside from a few bug fixes, the main
>>> improvement is addition of SGML docs demonstrating the statistics in a
>>> way
>>> similar to the current "Row Estimation Examples" (and the docs are
>>> actually
>>> in the same section). I've tried to keep the right amount of technical
>>> detail (and pointing to the right README for additional details), but
>>> this
>>> may need improvements. I have not written docs explaining how statistics
>>> may
>>> be combined yet (more about this later).
>>
>>
>> What we have here is quite something:
>> $ git diff master --stat | tail -n1
>>  77 files changed, 12809 insertions(+), 65 deletions(-)
>> I will try to get familiar on the topic and added myself as a reviewer
>> of this patch. Hopefully I'll get feedback soon.
>
>
> Yes, it's a large patch. Although 25% of the insertions are SGML docs,
> regression tests and READMEs, and large part of the remaining ~9k insertions
> are comments. But it may still be overwhelming, no doubt about that.
>
> FWIW, if someone is interested in the patch but is unsure where to start,
> I'm ready to help with that as much as possible. For example if you happen
> to go to PostgresOpen, feel free to drag me to a corner and ask me as many
> questions as you want ...

Sure. Only PGconf SV is on my track this year.
-- 
Michael

Re: multivariate statistics (v19)

From

Michael Paquier

Date:

10 August 2016, 04:41:38

On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> 1) enriching the query tree with multivariate statistics info
>
> Right now all the stuff related to multivariate statistics estimation
> happens in clausesel.c - matching condition to statistics, selection of
> statistics to use (if there are multiple usable stats), etc. So pretty much
> all this info is internal to clausesel.c and does not get outside.

This does not seem bad to me as first sight but...

> I'm starting to think that some of the steps (matching quals to stats,
> selection of stats) should happen in a "preprocess" step before the actual
> estimation, storing the information (which stats to use, etc.) in a new type
> of node in the query tree - something like RestrictInfo.
>
> I believe this needs to happen sometime after deconstruct_jointree() as that
> builds RestrictInfos nodes, and looking at planmain.c, right after
> extract_restriction_or_clauses seems about right. Haven't tried, though.
>
> This would move all the "statistics selection" logic from clausesel.c,
> separating it from the "actual estimation" and simplifying the code.
>
> But more importantly, I think we'll need to show some of the data in EXPLAIN
> output. With per-column statistics it's fairly straightforward to determine
> which statistics are used and how. But with multivariate stats things are
> often more complicated - there may be multiple candidate statistics (e.g.
> histograms covering different subsets of the conditions), it's possible to
> apply them in different orders, etc.
>
> But EXPLAIN can't show the info if it's ephemeral and available only within
> clausesel.c (and thrown away after the estimation).

This gives a good reason to not do that in clauserel.c, it would be
really cool to be able to get some information regarding the stats
used with a simple EXPLAIN.

> 2) combining multiple statistics
>
> I think the ability to combine multivariate statistics (covering different
> subsets of conditions) is important and useful, but I'm starting to think
> that the current implementation may not be the correct one (which is why I
> haven't written the SGML docs about this part of the patch series yet).
>
> Assume there's a table "t" with 3 columns (a, b, c), and that we're
> estimating query:
>
>    SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3
>
> but that we only have two statistics (a,b) and (b,c). The current patch does
> about this:
>
>    P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)
>
> i.e. it estimates the first two conditions using (a,b), and then estimates
> (c=3) using (b,c) with "b=2" as a condition. Now, this is very efficient,
> but it only works as long as the query contains conditions "connecting" the
> two statistics. So if we remove the "b=2" condition from the query, this
> stops working.

This is trying to make the algorithm smarter than the user, which is
something I'd think we could live without. In this case statistics on
(a,c) or (a,b,c) are missing. And what if the user does not want to
make use of stats for (a,c) because he only defined (a,b) and (b,c)?

Patch 0001: there have been comments about that before, and you have
put the checks on RestrictInfo in a couple of variables of
pull_varnos_walker, so nothing to say from here.

Patch 0002:
+  <para>
+   <command>CREATE STATISTICS</command> will create a new multivariate
+   statistics on the table. The statistics will be created in the in the
+   current database. The statistics will be owned by the user issuing
+   the command.
+  </para>
s/in the/in the/.

+  <para>
+   Create table <structname>t1</> with two functionally dependent columns, i.e.
+   knowledge of a value in the first column is sufficient for detemining the
+   value in the other column. Then functional dependencies are built on those
+   columns:
s/detemining/determining/

+  <para>
+   If a schema name is given (for example, <literal>CREATE STATISTICS
+   myschema.mystat ...</>) then the statistics is created in the specified
+   schema.  Otherwise it is created in the current schema.  The name of
+   the table must be distinct from the name of any other statistics in the
+   same schema.
+  </para>
I would just assume that a statistics is located on the schema of the
relation it depends on. So the thing that may be better to do is just:
- Register the OID of the table a statistics depends on but not the schema.
- Give up on those query extensions related to the schema.
- Allow the same statistics name to be used for multiple tables.
- Just fail if a statistics name is being reused on the table again.
It may be better to complain about that even if the column list is
different.
- Register the dependency between the statistics and the table.

+ALTER STATISTICS <replaceable class="parameter">name</replaceable>
OWNER TO { <replaceable class="PARAMETER">new_owner</replaceable> |
CURRENT_USER | SESSION_USER }
On the same line, is OWNER TO really necessary? I could have assumed
that if a user is able to query the set of columns related to a
statistics, he should have access to it.

=# create statistics aa_a_b3 on aam (a, b) with (dependencies);
ERROR:  23505: duplicate key value violates unique constraint
"pg_mv_statistic_name_index"
DETAIL:  Key (staname, stanamespace)=(aa_a_b3, 2200) already exists.
SCHEMA NAME:  pg_catalog
TABLE NAME:  pg_mv_statistic
CONSTRAINT NAME:  pg_mv_statistic_name_index
LOCATION:  _bt_check_unique, nbtinsert.c:433
When creating a multivariate function with a name that already exists,
this error message should be more friendly.

=# create table aa (a int, b int);
CREATE TABLE
=# create view aav as select * from aa;
CREATE VIEW
=# create statistics aab_v on aav (a, b) with (dependencies);
CREATE STATISTICS
Why do views and foreign tables support this command? This code also
mentions that this case is not actually supported:
+       /* multivariate stats are supported on tables and matviews */
+       if (rel->rd_rel->relkind == RELKIND_RELATION ||
+           rel->rd_rel->relkind == RELKIND_MATVIEW)
+           tupdesc = RelationGetDescr(rel);
};

+/*
Spurious noise in the patch.

+   /* check that at least some statistics were requested */
+   if (!build_dependencies)
+       ereport(ERROR,
+               (errcode(ERRCODE_SYNTAX_ERROR),
+                errmsg("no statistics type (dependencies) was requested")));
So, WITH (dependencies) is mandatory in any case. Why not just
dropping it from the first cut then?

pg_mv_stats shows only the attribute numbers of the columns it has
stats on, I think that those should be the column names. [...after a
while...], as it is mentioned here:
+ * TODO  Would be nice if this printed column names (instead of just attnums).

Does this work properly with DDL deparsing? If yes, could it be
possible to add tests in test_ddl_deparse? This is a new object type,
so those look necessary I think.

Statistics definition reorder the columns by itself depending on their
order. For example:
create table aa (a int, b int);
create statistics aas on aa(b, a) with (dependencies);
\d aa   "public.aas" (dependencies) ON (a, b)
As this defines a correlation between multiple columns, isn't it wrong
to assume that (b, a) and (a, b) are always the same correlation? I
don't recall such properties as being always commutative (old
memories, I suck at stats in general). [...reading README...] So this
is caused by the implementation limitations that only limit the
analysis between interactions of two columns. Still it seems incorrect
to reorder the user-visible portion.

The comment on top of get_relation_info needs to be updated to mention
that mvstatlist gets fetched as well.

+   while (HeapTupleIsValid(htup = systable_getnext(indscan)))
+       /* TODO maybe include only already built statistics? */
+       result = insert_ordered_oid(result, HeapTupleGetOid(htup));
I haven't looked at the rest yet of the series yet, but I'd think that
including the ones not built may be a good idea to let caller do
itself more filtering. Of course this depends on the next series...

+typedef struct MVDependencyData
+{
+   int         nattributes;    /* number of attributes */
+   int16       attributes[1];  /* attribute numbers */
+} MVDependencyData;
You need to look for FLEXIBLE_ARRAY_MEMBER here. Same for MVDependenciesData.

+++ b/src/test/regress/serial_schedule
@@ -167,3 +167,4 @@ test: withtest: xmltest: event_triggertest: stats
+test: mv_dependencies
This test is not listed in parallel_schedule.

s/Apllying/Applying/

There is a lot of mumbo-jumbo regarding the way dependencies are
stored with mainly serialize_mv_dependencies and
deserialize_mv_dependencies that operates them from bytea/dep trees.
That's not cool and not portable because pg_mv_statistic represents
that as pure bytea. I would suggest creating a generic data type that
does those operations, named like pg_dependency_tree and then use that
in those new catalogs. pg_node_tree is a precedent of such a thing.
New features could as well make use of this new data type of we are
able to design that in a way generic enough, so that would be a base
patch that the current 0002 applies on top of.

Regarding psql:
- The new commands lack psql completion, that would ease the use of
the new commands.
- Would it make sense to have a backslash command to show the list of
statistics?

Congratulations. I just looked at 25% of the overall patch and my mind
is already blown away, but I am catching up with the rest...
-- 
Michael

Re: multivariate statistics (v19)

From

Tomas Vondra

Date:

10 August 2016, 11:33:35

On 08/10/2016 06:41 AM, Michael Paquier wrote:
> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
...
>> But more importantly, I think we'll need to show some of the data in EXPLAIN
>> output. With per-column statistics it's fairly straightforward to determine
>> which statistics are used and how. But with multivariate stats things are
>> often more complicated - there may be multiple candidate statistics (e.g.
>> histograms covering different subsets of the conditions), it's possible to
>> apply them in different orders, etc.
>>
>> But EXPLAIN can't show the info if it's ephemeral and available only within
>> clausesel.c (and thrown away after the estimation).
>
> This gives a good reason to not do that in clauserel.c, it would be
> really cool to be able to get some information regarding the stats
> used with a simple EXPLAIN.
>

I think there are two separate questions:

(a) Whether the query plan is "enriched" with information about 
statistics, or whether this information is ephemeral and available only 
in clausesel.c.

(b) Where exactly this enrichment happens.

Theoretically we might enrich the query plan (add nodes with info about 
the statistics), so that EXPLAIN gets the info, and it might still 
happen in clausesel.c.

>> 2) combining multiple statistics
>>
>> I think the ability to combine multivariate statistics (covering different
>> subsets of conditions) is important and useful, but I'm starting to think
>> that the current implementation may not be the correct one (which is why I
>> haven't written the SGML docs about this part of the patch series yet).
>>
>> Assume there's a table "t" with 3 columns (a, b, c), and that we're
>> estimating query:
>>
>>    SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3
>>
>> but that we only have two statistics (a,b) and (b,c). The current patch does
>> about this:
>>
>>    P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)
>>
>> i.e. it estimates the first two conditions using (a,b), and then estimates
>> (c=3) using (b,c) with "b=2" as a condition. Now, this is very efficient,
>> but it only works as long as the query contains conditions "connecting" the
>> two statistics. So if we remove the "b=2" condition from the query, this
>> stops working.
>
> This is trying to make the algorithm smarter than the user, which is
> something I'd think we could live without. In this case statistics on
> (a,c) or (a,b,c) are missing. And what if the user does not want to
> make use of stats for (a,c) because he only defined (a,b) and (b,c)?
>

I don't think so. Obviously, if you have statistics covering all the 
conditions - great, we can't really do better than that.

But there's a crucial relation between the number of dimensions of the 
statistics and accuracy of the statistics. Let's say you have statistics 
on 8 columns, and you split each dimension twice to build a histogram - 
that's 256 buckets right there, and we only get ~50% selectivity in each 
dimension (the actual histogram building algorithm is more complex, but 
you get the idea).

I see this as probably the most interesting part of the patch, and quite 
useful. But we'll definitely get the single-statistics estimate first, 
no doubt about that.

> Patch 0001: there have been comments about that before, and you have
> put the checks on RestrictInfo in a couple of variables of
> pull_varnos_walker, so nothing to say from here.
>

I don't follow. Are you suggesting 0001 is a reasonable fix, or that 
there's a proposed solution?

> Patch 0002:
> +  <para>
> +   <command>CREATE STATISTICS</command> will create a new multivariate
> +   statistics on the table. The statistics will be created in the in the
> +   current database. The statistics will be owned by the user issuing
> +   the command.
> +  </para>
> s/in the/in the/.
>
> +  <para>
> +   Create table <structname>t1</> with two functionally dependent columns, i.e.
> +   knowledge of a value in the first column is sufficient for detemining the
> +   value in the other column. Then functional dependencies are built on those
> +   columns:
> s/detemining/determining/
>
> +  <para>
> +   If a schema name is given (for example, <literal>CREATE STATISTICS
> +   myschema.mystat ...</>) then the statistics is created in the specified
> +   schema.  Otherwise it is created in the current schema.  The name of
> +   the table must be distinct from the name of any other statistics in the
> +   same schema.
> +  </para>
> I would just assume that a statistics is located on the schema of the
> relation it depends on. So the thing that may be better to do is just:
> - Register the OID of the table a statistics depends on but not the schema.
> - Give up on those query extensions related to the schema.
> - Allow the same statistics name to be used for multiple tables.
> - Just fail if a statistics name is being reused on the table again.
> It may be better to complain about that even if the column list is
> different.
> - Register the dependency between the statistics and the table.

The idea is that the syntax should work even for statistics built on 
multiple tables, e.g. to provide better statistics for joins. That's why 
the schema may be specified (as each table might be in different 
schema), and so on.

>
> +ALTER STATISTICS <replaceable class="parameter">name</replaceable>
> OWNER TO { <replaceable class="PARAMETER">new_owner</replaceable> |
> CURRENT_USER | SESSION_USER }
> On the same line, is OWNER TO really necessary? I could have assumed
> that if a user is able to query the set of columns related to a
> statistics, he should have access to it.
>

Not sure, TBH. I think I've reused ALTER INDEX syntax, but now I see 
it's actually ignored with a warning.

> =# create statistics aa_a_b3 on aam (a, b) with (dependencies);
> ERROR:  23505: duplicate key value violates unique constraint
> "pg_mv_statistic_name_index"
> DETAIL:  Key (staname, stanamespace)=(aa_a_b3, 2200) already exists.
> SCHEMA NAME:  pg_catalog
> TABLE NAME:  pg_mv_statistic
> CONSTRAINT NAME:  pg_mv_statistic_name_index
> LOCATION:  _bt_check_unique, nbtinsert.c:433
> When creating a multivariate function with a name that already exists,
> this error message should be more friendly.

Yes, agreed.

>
> =# create table aa (a int, b int);
> CREATE TABLE
> =# create view aav as select * from aa;
> CREATE VIEW
> =# create statistics aab_v on aav (a, b) with (dependencies);
> CREATE STATISTICS
> Why do views and foreign tables support this command? This code also
> mentions that this case is not actually supported:
> +       /* multivariate stats are supported on tables and matviews */
> +       if (rel->rd_rel->relkind == RELKIND_RELATION ||
> +           rel->rd_rel->relkind == RELKIND_MATVIEW)
> +           tupdesc = RelationGetDescr(rel);
>
>  };

Yes, seems like a bug.

>
> +
>  /*
> Spurious noise in the patch.
>
> +   /* check that at least some statistics were requested */
> +   if (!build_dependencies)
> +       ereport(ERROR,
> +               (errcode(ERRCODE_SYNTAX_ERROR),
> +                errmsg("no statistics type (dependencies) was requested")));
> So, WITH (dependencies) is mandatory in any case. Why not just
> dropping it from the first cut then?

Because the follow-up patches extend this to require at least one 
statistics type. So in 0004 it becomes
    if (!(build_dependencies || build_mcv))

and in 0005 it's
    if (!(build_dependencies || build_mcv || build_histogram))

We might drop it from 0002 (and assume build_dependencies=true), and 
then add the check in 0004. But it seems a bit pointless.

>
> pg_mv_stats shows only the attribute numbers of the columns it has
> stats on, I think that those should be the column names. [...after a
> while...], as it is mentioned here:
> + * TODO  Would be nice if this printed column names (instead of just attnums).

Yeah.

>
> Does this work properly with DDL deparsing? If yes, could it be
> possible to add tests in test_ddl_deparse? This is a new object type,
> so those look necessary I think.
>

I haven't done anything with DDL deparsing, so I think the answer is 
"no" and needs to be added to a TODO.

> Statistics definition reorder the columns by itself depending on their
> order. For example:
> create table aa (a int, b int);
> create statistics aas on aa(b, a) with (dependencies);
> \d aa
>     "public.aas" (dependencies) ON (a, b)
> As this defines a correlation between multiple columns, isn't it wrong
> to assume that (b, a) and (a, b) are always the same correlation? I
> don't recall such properties as being always commutative (old
> memories, I suck at stats in general). [...reading README...] So this
> is caused by the implementation limitations that only limit the
> analysis between interactions of two columns. Still it seems incorrect
> to reorder the user-visible portion.

I don't follow. If you talk about Pearson's correlation, that clearly 
does not depend on the order of columns - it's perfectly independent of 
that. If you talk about about correlation in the wider sense (i.e. 
arbitrary dependence between columns), that might depend - but I don't 
remember a single piece of the patch where this might be a problem.

Also, which README states that we can only analyze interactions between 
two columns? That's pretty clearly not the case - the patch should 
handle dependencies between more columns without any problems.

>
> The comment on top of get_relation_info needs to be updated to mention
> that mvstatlist gets fetched as well.
>
> +   while (HeapTupleIsValid(htup = systable_getnext(indscan)))
> +       /* TODO maybe include only already built statistics? */
> +       result = insert_ordered_oid(result, HeapTupleGetOid(htup));
> I haven't looked at the rest yet of the series yet, but I'd think that
> including the ones not built may be a good idea to let caller do
> itself more filtering. Of course this depends on the next series...
>

Probably, although the more I'm thinking about this the more I think 
I'll rework this along the lines of the foreign-key-estimation patch, 
i.e. preprocessing called from planmain.c (adding info to the query 
plan), estimation in clausesel.c etc. Which also affects this bit, 
because the foreign keys are also loaded elsewhere, IIRC.

> +typedef struct MVDependencyData
> +{
> +   int         nattributes;    /* number of attributes */
> +   int16       attributes[1];  /* attribute numbers */
> +} MVDependencyData;
> You need to look for FLEXIBLE_ARRAY_MEMBER here. Same for MVDependenciesData.
>
> +++ b/src/test/regress/serial_schedule
> @@ -167,3 +167,4 @@ test: with
>  test: xml
>  test: event_trigger
>  test: stats
> +test: mv_dependencies
> This test is not listed in parallel_schedule.
>
> s/Apllying/Applying/
>
> There is a lot of mumbo-jumbo regarding the way dependencies are
> stored with mainly serialize_mv_dependencies and
> deserialize_mv_dependencies that operates them from bytea/dep trees.
> That's not cool and not portable because pg_mv_statistic represents
> that as pure bytea. I would suggest creating a generic data type that
> does those operations, named like pg_dependency_tree and then use that
> in those new catalogs. pg_node_tree is a precedent of such a thing.
> New features could as well make use of this new data type of we are
> able to design that in a way generic enough, so that would be a base
> patch that the current 0002 applies on top of.

Interesting idea, haven't thought about that. So are you suggesting to 
add a data type for each statistics type (dependencies, MCV, histogram, 
...)?

>
> Regarding psql:
> - The new commands lack psql completion, that would ease the use of
> the new commands.
> - Would it make sense to have a backslash command to show the list of
> statistics?
>

Yeah, that's on the TODO.

> Congratulations. I just looked at 25% of the overall patch and my mind
> is already blown away, but I am catching up with the rest...
>

Thanks for looking.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics (v19)

From

Petr Jelinek

Date:

10 August 2016, 11:51:03

On 10/08/16 13:33, Tomas Vondra wrote:
> On 08/10/2016 06:41 AM, Michael Paquier wrote:
>> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
>>> 2) combining multiple statistics
>>>
>>> I think the ability to combine multivariate statistics (covering
>>> different
>>> subsets of conditions) is important and useful, but I'm starting to
>>> think
>>> that the current implementation may not be the correct one (which is
>>> why I
>>> haven't written the SGML docs about this part of the patch series yet).
>>>
>>> Assume there's a table "t" with 3 columns (a, b, c), and that we're
>>> estimating query:
>>>
>>>    SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3
>>>
>>> but that we only have two statistics (a,b) and (b,c). The current
>>> patch does
>>> about this:
>>>
>>>    P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)
>>>
>>> i.e. it estimates the first two conditions using (a,b), and then
>>> estimates
>>> (c=3) using (b,c) with "b=2" as a condition. Now, this is very
>>> efficient,
>>> but it only works as long as the query contains conditions
>>> "connecting" the
>>> two statistics. So if we remove the "b=2" condition from the query, this
>>> stops working.
>>
>> This is trying to make the algorithm smarter than the user, which is
>> something I'd think we could live without. In this case statistics on
>> (a,c) or (a,b,c) are missing. And what if the user does not want to
>> make use of stats for (a,c) because he only defined (a,b) and (b,c)?
>>
>
> I don't think so. Obviously, if you have statistics covering all the
> conditions - great, we can't really do better than that.
>
> But there's a crucial relation between the number of dimensions of the
> statistics and accuracy of the statistics. Let's say you have statistics
> on 8 columns, and you split each dimension twice to build a histogram -
> that's 256 buckets right there, and we only get ~50% selectivity in each
> dimension (the actual histogram building algorithm is more complex, but
> you get the idea).
>

I think it makes sense to pursue this, but I also think we can easily 
live with not having it in the first version that gets committed and 
doing it as follow-up patch.

--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services

Re: multivariate statistics (v19)

From

Michael Paquier

Date:

10 August 2016, 12:23:36

On Wed, Aug 10, 2016 at 8:33 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 08/10/2016 06:41 AM, Michael Paquier wrote:
>> Patch 0001: there have been comments about that before, and you have
>> put the checks on RestrictInfo in a couple of variables of
>> pull_varnos_walker, so nothing to say from here.
>>
>
> I don't follow. Are you suggesting 0001 is a reasonable fix, or that there's
> a proposed solution?

I think that's reasonable.

>> Patch 0002:
>> +  <para>
>> +   <command>CREATE STATISTICS</command> will create a new multivariate
>> +   statistics on the table. The statistics will be created in the in the
>> +   current database. The statistics will be owned by the user issuing
>> +   the command.
>> +  </para>
>> s/in the/in the/.
>>
>> +  <para>
>> +   Create table <structname>t1</> with two functionally dependent
>> columns, i.e.
>> +   knowledge of a value in the first column is sufficient for detemining
>> the
>> +   value in the other column. Then functional dependencies are built on
>> those
>> +   columns:
>> s/detemining/determining/
>>
>> +  <para>
>> +   If a schema name is given (for example, <literal>CREATE STATISTICS
>> +   myschema.mystat ...</>) then the statistics is created in the
>> specified
>> +   schema.  Otherwise it is created in the current schema.  The name of
>> +   the table must be distinct from the name of any other statistics in
>> the
>> +   same schema.
>> +  </para>
>> I would just assume that a statistics is located on the schema of the
>> relation it depends on. So the thing that may be better to do is just:
>> - Register the OID of the table a statistics depends on but not the
>> schema.
>> - Give up on those query extensions related to the schema.
>> - Allow the same statistics name to be used for multiple tables.
>> - Just fail if a statistics name is being reused on the table again.
>> It may be better to complain about that even if the column list is
>> different.
>> - Register the dependency between the statistics and the table.
>
> The idea is that the syntax should work even for statistics built on
> multiple tables, e.g. to provide better statistics for joins. That's why the
> schema may be specified (as each table might be in different schema), and so
> on.

So you mean that the same statistics could be shared between tables?
But as this is visibly not a concept introduced yet in this set of
patches, why not just cut it off for now to simplify the whole? If
there is no schema-related field in pg_mv_statistics we could still
add it later if it proves to be useful.

>> +
>>  /*
>> Spurious noise in the patch.
>>
>> +   /* check that at least some statistics were requested */
>> +   if (!build_dependencies)
>> +       ereport(ERROR,
>> +               (errcode(ERRCODE_SYNTAX_ERROR),
>> +                errmsg("no statistics type (dependencies) was
>> requested")));
>> So, WITH (dependencies) is mandatory in any case. Why not just
>> dropping it from the first cut then?
>
>
> Because the follow-up patches extend this to require at least one statistics
> type. So in 0004 it becomes
>
>     if (!(build_dependencies || build_mcv))
>
> and in 0005 it's
>
>     if (!(build_dependencies || build_mcv || build_histogram))
>
> We might drop it from 0002 (and assume build_dependencies=true), and then
> add the check in 0004. But it seems a bit pointless.

This is a complicated set of patches. I'd think that we should try to
simplify things as much as possible first, and the WITH clause is not
mandatory to have as of 0002.

>> Statistics definition reorder the columns by itself depending on their
>> order. For example:
>> create table aa (a int, b int);
>> create statistics aas on aa(b, a) with (dependencies);
>> \d aa
>>     "public.aas" (dependencies) ON (a, b)
>> As this defines a correlation between multiple columns, isn't it wrong
>> to assume that (b, a) and (a, b) are always the same correlation? I
>> don't recall such properties as being always commutative (old
>> memories, I suck at stats in general). [...reading README...] So this
>> is caused by the implementation limitations that only limit the
>> analysis between interactions of two columns. Still it seems incorrect
>> to reorder the user-visible portion.
>
> I don't follow. If you talk about Pearson's correlation, that clearly does
> not depend on the order of columns - it's perfectly independent of that. If
> you talk about about correlation in the wider sense (i.e. arbitrary
> dependence between columns), that might depend - but I don't remember a
> single piece of the patch where this might be a problem.

Yes, based on what is done in the patch that may not be a problem, but
I am wondering if this is not restricting things too much.

> Also, which README states that we can only analyze interactions between two
> columns? That's pretty clearly not the case - the patch should handle
> dependencies between more columns without any problems.

I have noticed that the patch evaluates all the set of permutations
possible using a column list, it seems to me though that say if we
have three columns (a,b,c) listed in a statistics, (a,b) => c and
(b,a) => c are two different things.

>> There is a lot of mumbo-jumbo regarding the way dependencies are
>> stored with mainly serialize_mv_dependencies and
>> deserialize_mv_dependencies that operates them from bytea/dep trees.
>> That's not cool and not portable because pg_mv_statistic represents
>> that as pure bytea. I would suggest creating a generic data type that
>> does those operations, named like pg_dependency_tree and then use that
>> in those new catalogs. pg_node_tree is a precedent of such a thing.
>> New features could as well make use of this new data type of we are
>> able to design that in a way generic enough, so that would be a base
>> patch that the current 0002 applies on top of.
>
>
> Interesting idea, haven't thought about that. So are you suggesting to add a
> data type for each statistics type (dependencies, MCV, histogram, ...)?

Yes that would be something like that, it would be actually perhaps
better to have one single data type, and be able to switch between
each model easily instead of putting byteas in the catalog.
-- 
Michael

Re: multivariate statistics (v19)

From

Michael Paquier

Date:

10 August 2016, 12:25:04

On Wed, Aug 10, 2016 at 8:50 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:
> On 10/08/16 13:33, Tomas Vondra wrote:
>>
>> On 08/10/2016 06:41 AM, Michael Paquier wrote:
>>>
>>> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
>>>>
>>>> 2) combining multiple statistics
>>>>
>>>>
>>>> I think the ability to combine multivariate statistics (covering
>>>> different
>>>> subsets of conditions) is important and useful, but I'm starting to
>>>> think
>>>> that the current implementation may not be the correct one (which is
>>>> why I
>>>> haven't written the SGML docs about this part of the patch series yet).
>>>>
>>>> Assume there's a table "t" with 3 columns (a, b, c), and that we're
>>>> estimating query:
>>>>
>>>>    SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3
>>>>
>>>> but that we only have two statistics (a,b) and (b,c). The current
>>>> patch does
>>>> about this:
>>>>
>>>>    P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)
>>>>
>>>> i.e. it estimates the first two conditions using (a,b), and then
>>>> estimates
>>>> (c=3) using (b,c) with "b=2" as a condition. Now, this is very
>>>> efficient,
>>>> but it only works as long as the query contains conditions
>>>> "connecting" the
>>>> two statistics. So if we remove the "b=2" condition from the query, this
>>>> stops working.
>>>
>>>
>>> This is trying to make the algorithm smarter than the user, which is
>>> something I'd think we could live without. In this case statistics on
>>> (a,c) or (a,b,c) are missing. And what if the user does not want to
>>> make use of stats for (a,c) because he only defined (a,b) and (b,c)?
>>>
>>
>> I don't think so. Obviously, if you have statistics covering all the
>> conditions - great, we can't really do better than that.
>>
>> But there's a crucial relation between the number of dimensions of the
>> statistics and accuracy of the statistics. Let's say you have statistics
>> on 8 columns, and you split each dimension twice to build a histogram -
>> that's 256 buckets right there, and we only get ~50% selectivity in each
>> dimension (the actual histogram building algorithm is more complex, but
>> you get the idea).
>
> I think it makes sense to pursue this, but I also think we can easily live
> with not having it in the first version that gets committed and doing it as
> follow-up patch.

This patch is large and complicated enough. As this is not a mandatory
piece to get a basic support, I'd suggest just to drop that for later.
--
Michael

Re: multivariate statistics (v19)

From

Ants Aasma

Date:

10 August 2016, 13:29:11

On Wed, Aug 3, 2016 at 4:58 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> 2) combining multiple statistics
>
> I think the ability to combine multivariate statistics (covering different
> subsets of conditions) is important and useful, but I'm starting to think
> that the current implementation may not be the correct one (which is why I
> haven't written the SGML docs about this part of the patch series yet).

While researching this topic a few years ago I came across a paper on
this exact topic called "Consistently Estimating the Selectivity of
Conjuncts of Predicates" [1]. While effective it seems to be quite
heavy-weight, so would probably need support for tiered optimization.

[1] https://courses.cs.washington.edu/courses/cse544/11wi/papers/markl-vldb-2005.pdf

Regards,
Ants Aasma

Re: multivariate statistics (v19)

From

Tomas Vondra

Date:

10 August 2016, 18:07:33

On 08/10/2016 03:29 PM, Ants Aasma wrote:
> On Wed, Aug 3, 2016 at 4:58 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> 2) combining multiple statistics
>>
>> I think the ability to combine multivariate statistics (covering different
>> subsets of conditions) is important and useful, but I'm starting to think
>> that the current implementation may not be the correct one (which is why I
>> haven't written the SGML docs about this part of the patch series yet).
>
> While researching this topic a few years ago I came across a paper on
> this exact topic called "Consistently Estimating the Selectivity of
> Conjuncts of Predicates" [1]. While effective it seems to be quite
> heavy-weight, so would probably need support for tiered optimization.
>
> [1] https://courses.cs.washington.edu/courses/cse544/11wi/papers/markl-vldb-2005.pdf
>

I think I've read that paper some time ago, and IIRC it's solving the 
same problem but in a very different way - instead of combining the 
statistics directly, it relies on the "partial" selectivities and then 
estimates the total selectivity using the maximum-entropy principle.

I think it's a nice idea and it probably works fine in many cases, but 
it kinda throws away part of the information (that we could get by 
matching the statistics against each other directly). But I'll keep that 
paper in mind, and we can revisit this solution later.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics (v19)

From

Tomas Vondra

Date:

10 August 2016, 18:09:22

On 08/10/2016 02:24 PM, Michael Paquier wrote:
> On Wed, Aug 10, 2016 at 8:50 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:
>> On 10/08/16 13:33, Tomas Vondra wrote:
>>>
>>> On 08/10/2016 06:41 AM, Michael Paquier wrote:
>>>>
>>>> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
>>>>>
>>>>> 2) combining multiple statistics
>>>>>
>>>>>
>>>>> I think the ability to combine multivariate statistics (covering
>>>>> different
>>>>> subsets of conditions) is important and useful, but I'm starting to
>>>>> think
>>>>> that the current implementation may not be the correct one (which is
>>>>> why I
>>>>> haven't written the SGML docs about this part of the patch series yet).
>>>>>
>>>>> Assume there's a table "t" with 3 columns (a, b, c), and that we're
>>>>> estimating query:
>>>>>
>>>>>    SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3
>>>>>
>>>>> but that we only have two statistics (a,b) and (b,c). The current
>>>>> patch does
>>>>> about this:
>>>>>
>>>>>    P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)
>>>>>
>>>>> i.e. it estimates the first two conditions using (a,b), and then
>>>>> estimates
>>>>> (c=3) using (b,c) with "b=2" as a condition. Now, this is very
>>>>> efficient,
>>>>> but it only works as long as the query contains conditions
>>>>> "connecting" the
>>>>> two statistics. So if we remove the "b=2" condition from the query, this
>>>>> stops working.
>>>>
>>>>
>>>> This is trying to make the algorithm smarter than the user, which is
>>>> something I'd think we could live without. In this case statistics on
>>>> (a,c) or (a,b,c) are missing. And what if the user does not want to
>>>> make use of stats for (a,c) because he only defined (a,b) and (b,c)?
>>>>
>>>
>>> I don't think so. Obviously, if you have statistics covering all the
>>> conditions - great, we can't really do better than that.
>>>
>>> But there's a crucial relation between the number of dimensions of the
>>> statistics and accuracy of the statistics. Let's say you have statistics
>>> on 8 columns, and you split each dimension twice to build a histogram -
>>> that's 256 buckets right there, and we only get ~50% selectivity in each
>>> dimension (the actual histogram building algorithm is more complex, but
>>> you get the idea).
>>
>> I think it makes sense to pursue this, but I also think we can easily live
>> with not having it in the first version that gets committed and doing it as
>> follow-up patch.
>
> This patch is large and complicated enough. As this is not a mandatory
> piece to get a basic support, I'd suggest just to drop that for later.

Which is why combining multiple statistics is in part 0006 and all the 
previous parts simply choose the single "best" statistics ;-)

I'm perfectly fine with committing just the first few parts, and leaving 
0006 for the next major version.

regards


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics (v19)

From

Tomas Vondra

Date:

10 August 2016, 18:35:08

On 08/10/2016 02:23 PM, Michael Paquier wrote:
> On Wed, Aug 10, 2016 at 8:33 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> On 08/10/2016 06:41 AM, Michael Paquier wrote:
>>> Patch 0001: there have been comments about that before, and you have
>>> put the checks on RestrictInfo in a couple of variables of
>>> pull_varnos_walker, so nothing to say from here.
>>>
>>
>> I don't follow. Are you suggesting 0001 is a reasonable fix, or that there's
>> a proposed solution?
>
> I think that's reasonable.
>

Well, to me the 0001 feels more like a temporary workaround rather than 
a proper solution. I just don't know how to deal with it so I've kept it 
for now. Pretty sure there will be complaints that adding RestrictInfo 
to the expression walkers is not a nice idea.
>> ...
>>
>> The idea is that the syntax should work even for statistics built on
>> multiple tables, e.g. to provide better statistics for joins. That's why the
>> schema may be specified (as each table might be in different schema), and so
>> on.
>
> So you mean that the same statistics could be shared between tables?
> But as this is visibly not a concept introduced yet in this set of
> patches, why not just cut it off for now to simplify the whole? If
> there is no schema-related field in pg_mv_statistics we could still
> add it later if it proves to be useful.
>

Yes, I think creating statistics on multiple tables is one of the 
possible future directions. One of the previous patch versions 
introduced ALTER TABLE ... ADD STATISTICS syntax, but that ran into 
issues in gram.y, and given the multi-table possibilities the CREATE 
STATISTICS seems like a much better idea anyway.

But I guess you're right we may make this a bit more strict now, and 
relax it in the future if needed. For example as we only support 
single-table statistics at this point, we may remove the schema and 
always create the statistics in the schema of the table.

But I don't think we should make the statistics names unique only within 
a table (instead of within the schema).

The difference between those two cases is that if we allow multi-table 
statistics in the future, we can simply allow specifying the schema and 
everything will work just fine. But it'd break the second case, as it 
might result in conflicts in existing schemas.

I do realize this might be seen as a case of "future proofing" based on 
dubious predictions of how something might work, but OTOH this (schema 
inherited from table, unique within a schema) is pretty consistent with 
how this work for indexes.

>>> +
>>>  /*
>>> Spurious noise in the patch.
>>>
>>> +   /* check that at least some statistics were requested */
>>> +   if (!build_dependencies)
>>> +       ereport(ERROR,
>>> +               (errcode(ERRCODE_SYNTAX_ERROR),
>>> +                errmsg("no statistics type (dependencies) was
>>> requested")));
>>> So, WITH (dependencies) is mandatory in any case. Why not just
>>> dropping it from the first cut then?
>>
>>
>> Because the follow-up patches extend this to require at least one statistics
>> type. So in 0004 it becomes
>>
>>     if (!(build_dependencies || build_mcv))
>>
>> and in 0005 it's
>>
>>     if (!(build_dependencies || build_mcv || build_histogram))
>>
>> We might drop it from 0002 (and assume build_dependencies=true), and then
>> add the check in 0004. But it seems a bit pointless.
>
> This is a complicated set of patches. I'd think that we should try to
> simplify things as much as possible first, and the WITH clause is not
> mandatory to have as of 0002.
>

OK, I can remove the WITH from the 0002 part. Not a big deal.

>>> Statistics definition reorder the columns by itself depending on their
>>> order. For example:
>>> create table aa (a int, b int);
>>> create statistics aas on aa(b, a) with (dependencies);
>>> \d aa
>>>     "public.aas" (dependencies) ON (a, b)
>>> As this defines a correlation between multiple columns, isn't it wrong
>>> to assume that (b, a) and (a, b) are always the same correlation? I
>>> don't recall such properties as being always commutative (old
>>> memories, I suck at stats in general). [...reading README...] So this
>>> is caused by the implementation limitations that only limit the
>>> analysis between interactions of two columns. Still it seems incorrect
>>> to reorder the user-visible portion.
>>
>> I don't follow. If you talk about Pearson's correlation, that clearly does
>> not depend on the order of columns - it's perfectly independent of that. If
>> you talk about about correlation in the wider sense (i.e. arbitrary
>> dependence between columns), that might depend - but I don't remember a
>> single piece of the patch where this might be a problem.
>
> Yes, based on what is done in the patch that may not be a problem, but
> I am wondering if this is not restricting things too much.
>

Let's keep the code as it is. If we run into this issue in the future, 
we can easily relax this - there's nothing depending on the ordering of 
attnums, IIRC.

>> Also, which README states that we can only analyze interactions between two
>> columns? That's pretty clearly not the case - the patch should handle
>> dependencies between more columns without any problems.
>
> I have noticed that the patch evaluates all the set of permutations
> possible using a column list, it seems to me though that say if we
> have three columns (a,b,c) listed in a statistics, (a,b) => c and
> (b,a) => c are two different things.
>

Yes, those are two different functional dependencies, of course. But the 
algorithm (during ANALYZE) should discover all of them, and even the 
examples are using three columns, so I'm not sure what you mean by 
"analyze interactions between two columns"?

>>> There is a lot of mumbo-jumbo regarding the way dependencies are
>>> stored with mainly serialize_mv_dependencies and
>>> deserialize_mv_dependencies that operates them from bytea/dep trees.
>>> That's not cool and not portable because pg_mv_statistic represents
>>> that as pure bytea. I would suggest creating a generic data type that
>>> does those operations, named like pg_dependency_tree and then use that
>>> in those new catalogs. pg_node_tree is a precedent of such a thing.
>>> New features could as well make use of this new data type of we are
>>> able to design that in a way generic enough, so that would be a base
>>> patch that the current 0002 applies on top of.
>>
>>
>> Interesting idea, haven't thought about that. So are you suggesting to add a
>> data type for each statistics type (dependencies, MCV, histogram, ...)?
>
> Yes that would be something like that, it would be actually perhaps
> better to have one single data type, and be able to switch between
> each model easily instead of putting byteas in the catalog.

Hmmm, not sure about that. For example what about combinations of 
statistics - e.g. when we have MCV list on the most common values and a 
histogram on the rest? Should we store both as a single value, or would 
that be in two separate values, or what?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics (v19)

From

Michael Paquier

Date:

11 August 2016, 05:55:39

On Thu, Aug 11, 2016 at 3:34 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 08/10/2016 02:23 PM, Michael Paquier wrote:
>>
>> On Wed, Aug 10, 2016 at 8:33 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>> The idea is that the syntax should work even for statistics built on
>>> multiple tables, e.g. to provide better statistics for joins. That's why
>>> the
>>> schema may be specified (as each table might be in different schema), and
>>> so
>>> on.
>>
>>
>> So you mean that the same statistics could be shared between tables?
>> But as this is visibly not a concept introduced yet in this set of
>> patches, why not just cut it off for now to simplify the whole? If
>> there is no schema-related field in pg_mv_statistics we could still
>> add it later if it proves to be useful.
>>
>
> Yes, I think creating statistics on multiple tables is one of the possible
> future directions. One of the previous patch versions introduced ALTER TABLE
> ... ADD STATISTICS syntax, but that ran into issues in gram.y, and given the
> multi-table possibilities the CREATE STATISTICS seems like a much better
> idea anyway.
>
> But I guess you're right we may make this a bit more strict now, and relax
> it in the future if needed. For example as we only support single-table
> statistics at this point, we may remove the schema and always create the
> statistics in the schema of the table.

This would simplify the code the code a bit so I'd suggest removing
that from the first shot. If there is demand for it, keeping the
infrastructure open for this extension is what we had better do.

> But I don't think we should make the statistics names unique only within a
> table (instead of within the schema).

They could be made unique using (name, table_oid, column_list).

>>>> There is a lot of mumbo-jumbo regarding the way dependencies are
>>>> stored with mainly serialize_mv_dependencies and
>>>> deserialize_mv_dependencies that operates them from bytea/dep trees.
>>>> That's not cool and not portable because pg_mv_statistic represents
>>>> that as pure bytea. I would suggest creating a generic data type that
>>>> does those operations, named like pg_dependency_tree and then use that
>>>> in those new catalogs. pg_node_tree is a precedent of such a thing.
>>>> New features could as well make use of this new data type of we are
>>>> able to design that in a way generic enough, so that would be a base
>>>> patch that the current 0002 applies on top of.
>>>
>>>
>>>
>>> Interesting idea, haven't thought about that. So are you suggesting to
>>> add a
>>> data type for each statistics type (dependencies, MCV, histogram, ...)?
>>
>>
>> Yes that would be something like that, it would be actually perhaps
>> better to have one single data type, and be able to switch between
>> each model easily instead of putting byteas in the catalog.
>
> Hmmm, not sure about that. For example what about combinations of statistics
> - e.g. when we have MCV list on the most common values and a histogram on
> the rest? Should we store both as a single value, or would that be in two
> separate values, or what?

The same statistics can combine two different things, using different
columns may depend on how readable things get...
Btw, for the format we could get inspired from pg_node_tree, with pg_stat_tree:
{HISTOGRAM :arg {BUCKET :index 0 :minvals ... }}
{DEPENDENCY :arg {:elt "a => c" ...} ... }
{MVC :arg {:index 0 :values {0,0} ... } ... }
Please consider that as a tentative idea to make things more friendly.
Others may have a different opinion on the matter.
-- 
Michael

Re: multivariate statistics (v19)

From

Tomas Vondra

Date:

15 August 2016, 20:50:26

On 08/10/2016 06:41 AM, Michael Paquier wrote:
> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> 1) enriching the query tree with multivariate statistics info
>>
>> Right now all the stuff related to multivariate statistics estimation
>> happens in clausesel.c - matching condition to statistics, selection of
>> statistics to use (if there are multiple usable stats), etc. So pretty much
>> all this info is internal to clausesel.c and does not get outside.
>
> This does not seem bad to me as first sight but...
>
>> I'm starting to think that some of the steps (matching quals to stats,
>> selection of stats) should happen in a "preprocess" step before the actual
>> estimation, storing the information (which stats to use, etc.) in a new type
>> of node in the query tree - something like RestrictInfo.
>>
>> I believe this needs to happen sometime after deconstruct_jointree() as that
>> builds RestrictInfos nodes, and looking at planmain.c, right after
>> extract_restriction_or_clauses seems about right. Haven't tried, though.
>>
>> This would move all the "statistics selection" logic from clausesel.c,
>> separating it from the "actual estimation" and simplifying the code.
>>
>> But more importantly, I think we'll need to show some of the data in EXPLAIN
>> output. With per-column statistics it's fairly straightforward to determine
>> which statistics are used and how. But with multivariate stats things are
>> often more complicated - there may be multiple candidate statistics (e.g.
>> histograms covering different subsets of the conditions), it's possible to
>> apply them in different orders, etc.
>>
>> But EXPLAIN can't show the info if it's ephemeral and available only within
>> clausesel.c (and thrown away after the estimation).
>
> This gives a good reason to not do that in clauserel.c, it would be
> really cool to be able to get some information regarding the stats
> used with a simple EXPLAIN.

I've been thinking about this, and I'm afraid it's way more complicated 
in practice. It essentially means doing something like
    rel->baserestrictinfo = enrichWithStatistics(rel->baserestrictinfo);

for each table (and in the future maybe also for joins etc.) But as the 
name suggests the list should only include RestrictInfo nodes, which 
seems to contradict the transformation.

For example with conditions
    WHERE (a=1) AND (b=2) AND (c=3)

the list will contain 3 RestrictInfos. But if there's a statistics on 
(a,b,c), we need to note that somehow - my plan was to inject a node 
storing this information, something like (a bit simplified):
    StatisticsInfo {         Oid statisticsoid; /* OID of the statistics */         List *mvconditions; /* estimate
usingthe statistics */         List *otherconditions; /* estimate the old way */    }

But that'd clearly violate the assumption that baserestrictinfo only 
contains RestrictInfo. I don't think it's feasible (or desirable) to 
rework all the places to expect both RestrictInfo and the new node.

I can think of two alternatives:

1) keep the transformed list as separate list, next to baserestrictinfo

This obviously fixes the issue, as each caller can decide which node it 
wants. But it also means we need to maintain two lists instead of one, 
and keep them synchronized.

2) embed the information into the existing tree

It might be possible to store the information in existing nodes, i.e. 
each node would track whether it's estimated the "old way" or using 
multivariate statistics (and which one). But it would require changing 
many of the existing nodes (at least those compatible with multivariate 
statistics: currently OpExpr, NullTest, ...).

And it also seems fairly difficult to reconstruct the information during 
the estimation, as it'd be necessary to look for other nodes to be 
estimated by the same statistics. Which seems to defeat the idea of 
preprocessing to some degree.

So I'm not sure what's the best solution. I'm leaning to (1), i.e. 
keeping a separate list, but I'd welcome other ideas.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics (v19)

From

Robert Haas

Date:

23 August 2016, 17:03:23

On Tue, Aug 2, 2016 at 9:58 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Attached is v19 of the "multivariate stats" patch series - essentially v18
> rebased on top of current master.

Tom:

ISTR that you were going to try to look at this patch set.  It seems
from the discussion that it's not really ready for serious
consideration for commit yet, but also that some high-level design
comments from you at this stage could go a long way toward making sure
that the final form of the patch is something that will be acceptable.

I'd really like to see us get some kind of capability along these
lines, but I'm sure it will go a lot better if you or Dean handle it
than if I try to do it ... not to mention that there are only so many
hours in the day.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: multivariate statistics (v19)

From

Michael Paquier

Date:

30 August 2016, 06:54:51

On Wed, Aug 24, 2016 at 2:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> ISTR that you were going to try to look at this patch set.  It seems
> from the discussion that it's not really ready for serious
> consideration for commit yet, but also that some high-level design
> comments from you at this stage could go a long way toward making sure
> that the final form of the patch is something that will be acceptable.
>
> I'd really like to see us get some kind of capability along these
> lines, but I'm sure it will go a lot better if you or Dean handle it
> than if I try to do it ... not to mention that there are only so many
> hours in the day.

Agreed. What I have been able to look until now was the high-level
structure of the patch, and I think that we should really shave 0002
and simplify it to get a core infrastructure in place, but the core
patch is at another level, and it would be good to get some feedback
regarding the structure of the patch and if it is moving in the good
direction is good or not.
-- 
Michael

Re: multivariate statistics (v19)

From

Dean Rasheed

Date:

12 September 2016, 14:08:09

On 3 August 2016 at 02:58, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> Attached is v19 of the "multivariate stats" patch series

Hi,

I started looking at this - just at a very high level - I've not read
much of the detail yet, but here are some initial review comments.

I think the overall infrastructure approach for CREATE STATISTICS
makes sense, and I agree with other suggestions upthread that it would
be useful to be able to build statistics on arbitrary expressions,
although that doesn't need to be part of this patch, it's useful to
keep that in mind as a possible future extension of this initial
design.

I can imagine it being useful to be able to create user-defined
statistics on an arbitrary list of expressions, and I think that would
include univariate as well as multivariate statistics. Perhaps that's
something to take into account in the naming of things, e.g., as David
Rowley suggested, something like pg_statistic_ext, rather than
pg_mv_statistic.

I also like the idea that this might one day be extended to support
statistics across multiple tables, although I think that might be
challenging to achieve -- you'd need a method of taking a random
sample of rows from a join between 2 or more tables. However, if the
intention is to be able to support that one day, I think that needs to
be accounted for in the syntax now -- specifically, I think it will be
too limiting to only support things extending the current syntax of
the form table1(col1, col2, ...), table2(col1, col2, ...), because
that precludes building statistics on an expression referring to
columns from more than one table. So I think we should plan further
ahead and use a syntax giving greater flexibility in the future, for
example something structured more like a query (like CREATE VIEW):

CREATE STATISTICS name [ WITH (options) ] ON expression [, ...] FROM table [, ...] WHERE condition

where the first version of the patch would only support expressions
that are simple column references, and would require at least 2 such
columns from a single table with no WHERE clause, i.e.:

CREATE STATISTICS name [ WITH (options) ] ON column1, column2 [, ...] FROM table

For multi-table statistics, a WHERE clause would typically be needed
to specify how the tables are expected to be joined, but potentially
such a clause might also be useful in single-table statistics, to
build partial statistics on a commonly queried subset of the table,
just like a partial index.

Of course, I'm not suggesting that the current patch do any of that --
it's big enough as it is. I'm just throwing out possible future
directions this might go in, so that we don't get painted into a
corner when designing the syntax for the current patch.


Regarding the statistics themselves, I read the description of soft
functional dependencies, and I'm somewhat skeptical about that
algorithm. I don't like the arbitrary thresholds or the sudden jump
from independence to dependence and clause reduction. As others have
said, I think this should account for a continuous spectrum of
dependence from fully independent to fully dependent, and combine
clause selectivities in a way based on the degree of dependence. For
example, if you computed an estimate for the fraction 'f' of the
table's rows for which a -> b, then it might be reasonable to combine
the selectivities using
 P(a,b) = P(a) * (f + (1-f) * P(b))

Of course, having just a single number that tells you the columns are
correlated, tells you nothing about whether the clauses on those
columns are consistent with that correlation. For example, in the
following table

CREATE TABLE t(a int, b int);
INSERT INTO t SELECT x/10, ((x/10)*789)%100 FROM generate_series(0,999) g(x);

'b' is functionally dependent on 'a' (and vice versa), but if you
query the rows with a<50 and with b<50, those clauses behave
essentially independently, because they're not consistent with the
functional dependence between 'a' and 'b', so the best way to combine
their selectivities is just to multiply them, as we currently do.

So whilst it may be interesting to determine that 'b' is functionally
dependent on 'a', it's not obvious whether that fact by itself should
be used in the selectivity estimates. Perhaps it should, on the
grounds that it's best to attempt to use all the available
information, but only if there are no more detailed statistics
available. In any case, knowing that there is a correlation can be
used as an indicator that it may be worthwhile to build more detailed
multivariate statistics like a MCV list or a histogram on those
columns.


Looking at the ndistinct coefficient 'q', I think it would be better
if the recorded statistic were just the estimate for
ndistinct(a,b,...) rather than a ratio of ndistinct values. That's a
more fundamental statistic, and it's easier to document and easier to
interpret. Also, I don't believe that the coefficient 'q' is the right
number to use for clause estimation:

Looking at README.ndistinct, I'm skeptical about the selectivity
estimation argument. In the case where a -> b, you'd have q =
ndistinct(b), so then P(a=1 & b=2) would become 1/ndistinct(a), which
is fine for a uniform distribution. But typically, there would be
univariate statistics on a and b, so if for example a=1 were 100x more
likely than average, you'd probably know that and the existing code
computing P(a=1) would reflect that, whereas simply using P(a=1 & b=2)
= 1/ndistinct(a) would be a significant underestimate, since it would
be ignoring known information about the distribution of a.

But likewise if, as is later argued, you were to use 'q' as a
correction factor applied to the individual clause selectivities, you
could end up with significant overestimates: if you said P(a=1 & b=2)
= q * P(a=1) * P(b=2), and a=1 were 100x more likely than average, and
a -> b, then b=2 would also be 100x more likely than average (assuming
that b=2 was the value implied by the functional dependency), and that
would also be reflected in the univariate statics on b, so then you'd
end up with an overall selectivity of around 10000/ndistinct(a), which
would be 100x too big. In fact, since a -> b means that q =
ndistinct(b), there's a good chance of hitting data for which q * P(b)
is greater than 1, so this formula would lead to a combined
selectivity greater than P(a), which is obviously nonsense.

Having a better estimate for ndistinct(a,b,...) looks very useful by
itself for GROUP BY estimation, and there may be other places that
would benefit from it, but I don't think it's the best statistic for
determining functional dependence or combining clause selectivities.

That's as much as I've looked at so far. It's such a big patch that
it's difficult to consider all at once. I think perhaps the smallest
committable self-contained unit providing a tangible benefit would be
something containing the core infrastructure plus the ndistinct
estimate and the improved GROUP BY estimation.

Regards,
Dean

Re: multivariate statistics (v19)

From

Tomas Vondra

Date:

13 September 2016, 22:01:40

Hi,

Thanks for looking into this!

On 09/12/2016 04:08 PM, Dean Rasheed wrote:
> On 3 August 2016 at 02:58, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> Attached is v19 of the "multivariate stats" patch series
>
> Hi,
>
> I started looking at this - just at a very high level - I've not read
> much of the detail yet, but here are some initial review comments.
>
> I think the overall infrastructure approach for CREATE STATISTICS
> makes sense, and I agree with other suggestions upthread that it would
> be useful to be able to build statistics on arbitrary expressions,
> although that doesn't need to be part of this patch, it's useful to
> keep that in mind as a possible future extension of this initial
> design.
>
> I can imagine it being useful to be able to create user-defined
> statistics on an arbitrary list of expressions, and I think that would
> include univariate as well as multivariate statistics. Perhaps that's
> something to take into account in the naming of things, e.g., as David
> Rowley suggested, something like pg_statistic_ext, rather than
> pg_mv_statistic.
>
> I also like the idea that this might one day be extended to support
> statistics across multiple tables, although I think that might be
> challenging to achieve -- you'd need a method of taking a random
> sample of rows from a join between 2 or more tables. However, if the
> intention is to be able to support that one day, I think that needs to
> be accounted for in the syntax now -- specifically, I think it will be
> too limiting to only support things extending the current syntax of
> the form table1(col1, col2, ...), table2(col1, col2, ...), because
> that precludes building statistics on an expression referring to
> columns from more than one table. So I think we should plan further
> ahead and use a syntax giving greater flexibility in the future, for
> example something structured more like a query (like CREATE VIEW):
>
> CREATE STATISTICS name
>   [ WITH (options) ]
>   ON expression [, ...]
>   FROM table [, ...]
>   WHERE condition
>
> where the first version of the patch would only support expressions
> that are simple column references, and would require at least 2 such
> columns from a single table with no WHERE clause, i.e.:
>
> CREATE STATISTICS name
>   [ WITH (options) ]
>   ON column1, column2 [, ...]
>   FROM table
>
> For multi-table statistics, a WHERE clause would typically be needed
> to specify how the tables are expected to be joined, but potentially
> such a clause might also be useful in single-table statistics, to
> build partial statistics on a commonly queried subset of the table,
> just like a partial index.

Hmm, the "partial statistics" idea seems interesting, It would allow us 
to provide additional / more detailed statistics only for a subset of a 
table.

I'm however not sure about the join case - how would the syntax work 
with outer joins? But as you said, we only need
 CREATE STATISTICS name   [ WITH (options) ]   ON (column1, column2 [, ...])   FROM table   WHERE condition

until we add support for join statistics.

>
> Regarding the statistics themselves, I read the description of soft
> functional dependencies, and I'm somewhat skeptical about that
> algorithm. I don't like the arbitrary thresholds or the sudden jump
> from independence to dependence and clause reduction. As others have
> said, I think this should account for a continuous spectrum of
> dependence from fully independent to fully dependent, and combine
> clause selectivities in a way based on the degree of dependence. For
> example, if you computed an estimate for the fraction 'f' of the
> table's rows for which a -> b, then it might be reasonable to combine
> the selectivities using
>
>   P(a,b) = P(a) * (f + (1-f) * P(b))
>

Yeah, I agree that the thresholds resulting in sudden changes between 
"dependent" and "not dependent" are annoying. The question is whether it 
makes sense to fix that, though - the functional dependencies were meant 
as the simplest form of statistics, allowing us to get the rest of the 
infrastructure in.

I'm OK with replacing the true/false dependencies with a degree of 
dependency between 0 and 1, but I'm a bit afraid it'll result in 
complaints that the first patch got too large / complicated.

It also contradicts the idea of using functional dependencies as a 
low-overhead type of statistics, filtering the list of clauses that need 
to be estimated using more expensive types of statistics (MCV lists, 
histograms, ...). Switching to a degree of dependency would prevent 
removal of "unnecessary" clauses.

> Of course, having just a single number that tells you the columns are
> correlated, tells you nothing about whether the clauses on those
> columns are consistent with that correlation. For example, in the
> following table
>
> CREATE TABLE t(a int, b int);
> INSERT INTO t SELECT x/10, ((x/10)*789)%100 FROM generate_series(0,999) g(x);
>
> 'b' is functionally dependent on 'a' (and vice versa), but if you
> query the rows with a<50 and with b<50, those clauses behave
> essentially independently, because they're not consistent with the
> functional dependence between 'a' and 'b', so the best way to combine
> their selectivities is just to multiply them, as we currently do.
>
> So whilst it may be interesting to determine that 'b' is functionally
> dependent on 'a', it's not obvious whether that fact by itself should
> be used in the selectivity estimates. Perhaps it should, on the
> grounds that it's best to attempt to use all the available
> information, but only if there are no more detailed statistics
> available. In any case, knowing that there is a correlation can be
> used as an indicator that it may be worthwhile to build more detailed
> multivariate statistics like a MCV list or a histogram on those
> columns.
>

Right. IIRC this is actually described in the README as "incompatible 
conditions". While implementing it, I concluded that this is OK and it's 
up to the developer to decide whether the queries are compatible with 
the "assumption of compatibility". But maybe this is reasoning is bogus 
and makes (the current implementation of) functional dependencies 
unusable in practice.

But I like the idea of reverting the order from

(a) look for functional dependencies
(b) reduce the clauses using functional dependencies
(c) estimate the rest using multivariate MCV/histograms

to

(a) estimate the rest using multivariate MCV/histograms
(b) try to apply functional dependencies on the remaining clauses

It contradicts the idea of functional dependencies as "low-overhead 
statistics" but maybe it's worth it.

>
> Looking at the ndistinct coefficient 'q', I think it would be better
> if the recorded statistic were just the estimate for
> ndistinct(a,b,...) rather than a ratio of ndistinct values. That's a
> more fundamental statistic, and it's easier to document and easier to
> interpret. Also, I don't believe that the coefficient 'q' is the right
> number to use for clause estimation:
>

IIRC the reason why I stored the coefficient instead of the ndistinct() 
values is that the coefficients are not directly related to number of 
rows in the original relation, so you can apply it directly to whatever 
cardinality estimate you have.

Otherwise it's mostly the same information - it's trivial to compute one 
from the other.
>
> Looking at README.ndistinct, I'm skeptical about the selectivity
> estimation argument. In the case where a -> b, you'd have q =
> ndistinct(b), so then P(a=1 & b=2) would become 1/ndistinct(a), which
> is fine for a uniform distribution. But typically, there would be
> univariate statistics on a and b, so if for example a=1 were 100x more
> likely than average, you'd probably know that and the existing code
> computing P(a=1) would reflect that, whereas simply using P(a=1 & b=2)
> = 1/ndistinct(a) would be a significant underestimate, since it would
> be ignoring known information about the distribution of a.
>
> But likewise if, as is later argued, you were to use 'q' as a
> correction factor applied to the individual clause selectivities, you
> could end up with significant overestimates: if you said P(a=1 & b=2)
> = q * P(a=1) * P(b=2), and a=1 were 100x more likely than average, and
> a -> b, then b=2 would also be 100x more likely than average (assuming
> that b=2 was the value implied by the functional dependency), and that
> would also be reflected in the univariate statics on b, so then you'd
> end up with an overall selectivity of around 10000/ndistinct(a), which
> would be 100x too big. In fact, since a -> b means that q =
> ndistinct(b), there's a good chance of hitting data for which q * P(b)
> is greater than 1, so this formula would lead to a combined
> selectivity greater than P(a), which is obviously nonsense.

Well, yeah. The
    P(a=1) = 1/ndistinct(a)

was really just a simplification for the uniform distribution, and 
looking at "q" as a correction factor is much more practical - no doubt 
about that.

As for the overestimated and underestimates - I don't think we can 
entirely prevent that. We're essentially replacing one assumption (AVIA) 
with other assumptions (homogenity for ndistinct, compatibility for 
functional dependencies), hoping that those assumptions are weaker in 
some sense. But there'll always be cases that break those assumptions 
and I don't think we can prevent that.

Unlike the functional dependencies, this "homogenity" assumption is not 
dependent on the queries at all, so it should be possible to verify it 
during ANALYZE.

Also, maybe we could/should use the same approach as for functional 
dependencies, i.e. try using more detailed statistics first and then 
apply ndistinct coefficients only on the remaining clauses?

>
> Having a better estimate for ndistinct(a,b,...) looks very useful by
> itself for GROUP BY estimation, and there may be other places that
> would benefit from it, but I don't think it's the best statistic for
> determining functional dependence or combining clause selectivities.
>

Not sure. I think it may be very useful type of statistics, but I'm not 
going to fight for this very hard. I'm fine with ignoring this 
statistics type for now, getting the other "detailed" statistics types 
(MCV, histograms) in and then revisiting this.

> That's as much as I've looked at so far. It's such a big patch that
> it's difficult to consider all at once. I think perhaps the smallest
> committable self-contained unit providing a tangible benefit would be
> something containing the core infrastructure plus the ndistinct
> estimate and the improved GROUP BY estimation.
>

FWIW I find the ndistinct statistics as rather uninteresting (at least 
compared to the other types of statistics), which is why it's the last 
patch in the patch series. Perhaps I shouldn't have include it at all, 
as it's just a distraction.

regards
Dean

Re: multivariate statistics (v19)

From

Heikki Linnakangas

Date:

30 September 2016, 11:10:18

This patch set is in pretty good shape, the only problem is that it's so 
big that no-one seems to have the time or courage to do the final 
touches and commit it. If we just focus on the functional dependencies 
part for now, I think we might get somewhere. I peeked at the MCV and 
histogram patches too, and I think they make total sense as well, and 
are a natural extension of the functional dependencies patch. So if we 
just focus on that for now, I don't think we will paint ourselves in the 
corner.

(more below)

On 09/14/2016 01:01 AM, Tomas Vondra wrote:
> On 09/12/2016 04:08 PM, Dean Rasheed wrote:
>> Regarding the statistics themselves, I read the description of soft
>> functional dependencies, and I'm somewhat skeptical about that
>> algorithm. I don't like the arbitrary thresholds or the sudden jump
>> from independence to dependence and clause reduction. As others have
>> said, I think this should account for a continuous spectrum of
>> dependence from fully independent to fully dependent, and combine
>> clause selectivities in a way based on the degree of dependence. For
>> example, if you computed an estimate for the fraction 'f' of the
>> table's rows for which a -> b, then it might be reasonable to combine
>> the selectivities using
>>
>>   P(a,b) = P(a) * (f + (1-f) * P(b))
>>
>
> Yeah, I agree that the thresholds resulting in sudden changes between
> "dependent" and "not dependent" are annoying. The question is whether it
> makes sense to fix that, though - the functional dependencies were meant
> as the simplest form of statistics, allowing us to get the rest of the
> infrastructure in.
>
> I'm OK with replacing the true/false dependencies with a degree of
> dependency between 0 and 1, but I'm a bit afraid it'll result in
> complaints that the first patch got too large / complicated.

+1 for using a floating degree between 0 and 1, rather than a boolean.

> It also contradicts the idea of using functional dependencies as a
> low-overhead type of statistics, filtering the list of clauses that need
> to be estimated using more expensive types of statistics (MCV lists,
> histograms, ...). Switching to a degree of dependency would prevent
> removal of "unnecessary" clauses.

That sounds OK to me, although I'm not deeply familiar with this patch yet.

>> Of course, having just a single number that tells you the columns are
>> correlated, tells you nothing about whether the clauses on those
>> columns are consistent with that correlation. For example, in the
>> following table
>>
>> CREATE TABLE t(a int, b int);
>> INSERT INTO t SELECT x/10, ((x/10)*789)%100 FROM generate_series(0,999) g(x);
>>
>> 'b' is functionally dependent on 'a' (and vice versa), but if you
>> query the rows with a<50 and with b<50, those clauses behave
>> essentially independently, because they're not consistent with the
>> functional dependence between 'a' and 'b', so the best way to combine
>> their selectivities is just to multiply them, as we currently do.
>>
>> So whilst it may be interesting to determine that 'b' is functionally
>> dependent on 'a', it's not obvious whether that fact by itself should
>> be used in the selectivity estimates. Perhaps it should, on the
>> grounds that it's best to attempt to use all the available
>> information, but only if there are no more detailed statistics
>> available. In any case, knowing that there is a correlation can be
>> used as an indicator that it may be worthwhile to build more detailed
>> multivariate statistics like a MCV list or a histogram on those
>> columns.
>
> Right. IIRC this is actually described in the README as "incompatible
> conditions". While implementing it, I concluded that this is OK and it's
> up to the developer to decide whether the queries are compatible with
> the "assumption of compatibility". But maybe this is reasoning is bogus
> and makes (the current implementation of) functional dependencies
> unusable in practice.

I think that's OK. It seems like a good assumption that the conditions 
are "compatible" with the functional dependency. For two reasons:

1) A query with compatible clauses is much more likely to occur in real 
life. Why would you run a query with an incompatible ZIP and city clauses?

2) If the conditions were in fact incompatible, the query is likely to 
return 0 rows, and will bail out very quickly, even if the estimates are 
way off and you choose a non-optimal plan. There are exceptions, of 
course: an index scan might be able to conclude that there are no rows 
much quicker than a seqscan, but as a general rule of thumb, a query 
that returns 0 rows isn't very sensitive to the chosen plan.

And of course, as long as we're not collecting these statistics 
automatically, if it doesn't work for your application, just don't 
collect them.

I fear that using "statistics" as the name of the new object might get a 
bit awkward. "statistics" is a plural, but we use it as the name of a 
single object, like "pants" or "scissors". Not sure I have any better 
ideas though. "estimator"? "statistics collection"? Or perhaps it should 
be singular, "statistic". I note that you actually called the system 
table "pg_mv_statistic", in singular.

I'm not a big fan of storing the stats as just a bytea blob, and having 
to use special functions to interpret it. By looking at the patch, it's 
not clear to me what we actually store for functional dependencies. A 
list of attribute numbers? Could we store them simply as an int[]? (I'm 
not a big fan of the hack in pg_statistic, that allows storing arrays of 
any data type in the same column, though. But for functional 
dependencies, I don't think we need that.)

Overall, this is going to be a great feature!

- Heikki

Re: multivariate statistics (v19)

From

Michael Paquier

Date:

03 October 2016, 01:46:35

On Fri, Sep 30, 2016 at 8:10 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> This patch set is in pretty good shape, the only problem is that it's so big
> that no-one seems to have the time or courage to do the final touches and
> commit it.

Did you see my suggestions about simplifying its SQL structure? You
could shave some code without impacting the base set of features.

> I fear that using "statistics" as the name of the new object might get a bit
> awkward. "statistics" is a plural, but we use it as the name of a single
> object, like "pants" or "scissors". Not sure I have any better ideas though.
> "estimator"? "statistics collection"? Or perhaps it should be singular,
> "statistic". I note that you actually called the system table
> "pg_mv_statistic", in singular.
>
> I'm not a big fan of storing the stats as just a bytea blob, and having to
> use special functions to interpret it. By looking at the patch, it's not
> clear to me what we actually store for functional dependencies. A list of
> attribute numbers? Could we store them simply as an int[]? (I'm not a big
> fan of the hack in pg_statistic, that allows storing arrays of any data type
> in the same column, though. But for functional dependencies, I don't think
> we need that.)

I am marking this patch as returned with feedback for now.

> Overall, this is going to be a great feature!

+1.
-- 
Michael

Re: multivariate statistics (v19)

From

Heikki Linnakangas

Date:

03 October 2016, 11:25:25

On 10/03/2016 04:46 AM, Michael Paquier wrote:
> On Fri, Sep 30, 2016 at 8:10 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> This patch set is in pretty good shape, the only problem is that it's so big
>> that no-one seems to have the time or courage to do the final touches and
>> commit it.
>
> Did you see my suggestions about simplifying its SQL structure? You
> could shave some code without impacting the base set of features.

Yeah. The idea was to use something like pg_node_tree to store all the 
different kinds of statistics, the histogram, the MCV, and the 
functional dependencies, in one datum. Or JSON, maybe. It sounds better 
than an opaque bytea blob, although I'd prefer something more 
relational. For the functional dependencies, I think we could get away 
with a simple float array, so let's do that in the first cut, and 
revisit this for the MCV and histogram later. Separate columns for the 
functional dependencies, the MCVs, and the histogram, probably makes 
sense anyway.

- Heikki

Re: multivariate statistics (v19)

From

Michael Paquier

Date:

04 October 2016, 03:25:32

On Mon, Oct 3, 2016 at 8:25 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Yeah. The idea was to use something like pg_node_tree to store all the
> different kinds of statistics, the histogram, the MCV, and the functional
> dependencies, in one datum. Or JSON, maybe. It sounds better than an opaque
> bytea blob, although I'd prefer something more relational. For the
> functional dependencies, I think we could get away with a simple float
> array, so let's do that in the first cut, and revisit this for the MCV and
> histogram later.

OK. A second thing was related to the use of schemas in the new system
catalogs. As mentioned in [1], those could be removed.
[1]: https://www.postgresql.org/message-id/CAB7nPqTU40Q5_NSgHVoMJfbyH1HDtqMbFDJ+kwFJSpam35b3Qg@mail.gmail.com.

> Separate columns for the functional dependencies, the MCVs,
> and the histogram, probably makes sense anyway.

Probably..
-- 
Michael

Re: multivariate statistics (v19)

From

Dean Rasheed

Date:

04 October 2016, 07:37:47

On 4 October 2016 at 04:25, Michael Paquier <michael.paquier@gmail.com> wrote:
> OK. A second thing was related to the use of schemas in the new system
> catalogs. As mentioned in [1], those could be removed.
> [1]: https://www.postgresql.org/message-id/CAB7nPqTU40Q5_NSgHVoMJfbyH1HDtqMbFDJ+kwFJSpam35b3Qg@mail.gmail.com.
>

That doesn't work, because if the intention is to be able to one day
support statistics across multiple tables, you can't assume that the
statistics are in the same schema as the table.

In fact, if multi-table statistics are to be allowed in the future, I
think you want to move away from thinking of statistics as depending
on and referring to a single table, and handle them more like views --
i.e, store a pg_node_tree representing the from_clause and add
multiple dependencies at statistics creation time. That was what I was
getting at upthread when I suggested the alternate syntax, and also
answers Tomas' question about how JOIN might one day be supported.

Of course, if we don't think that we will ever support multi-table
statistics, that all goes away, and you may as well make the
statistics name local to the table, but I think that's a bit limiting.
One way or the other, I think this is a question that needs to be
answered now. My vote is to leave expansion room to support
multi-table statistics in the future.

Regards,
Dean

Re: multivariate statistics (v19)

From

Dean Rasheed

Date:

04 October 2016, 07:49:45

On 30 September 2016 at 12:10, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I fear that using "statistics" as the name of the new object might get a bit
> awkward. "statistics" is a plural, but we use it as the name of a single
> object, like "pants" or "scissors". Not sure I have any better ideas though.
> "estimator"? "statistics collection"? Or perhaps it should be singular,
> "statistic". I note that you actually called the system table
> "pg_mv_statistic", in singular.
>

I think it's OK. The functional dependency is a single statistic, but
MCV lists and histograms are multiple statistics (multiple facts about
the data sampled), so in general when you create one of these new
objects, you are creating multiple statistics about the data. Also I
find "CREATE STATISTIC" just sounds a bit clumsy compared to "CREATE
STATISTICS".

The convention for naming system catalogs seems to be to use the
singular for tables and plural for views, so I guess we should stick
with that. It doesn't seem like the end of the world that it doesn't
match the user-facing syntax. A bigger concern is the use of "mv" in
the name, because as has already been pointed out, this table may also
in the future be used to store univariate expression and partial
statistics, so I think we should drop the "mv" and go with something
like pg_statistic_ext, or some other more general name.

Regards,
Dean

Re: multivariate statistics (v19)

From

Heikki Linnakangas

Date:

04 October 2016, 08:15:37

On 10/04/2016 10:49 AM, Dean Rasheed wrote:
> On 30 September 2016 at 12:10, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I fear that using "statistics" as the name of the new object might get a bit
>> awkward. "statistics" is a plural, but we use it as the name of a single
>> object, like "pants" or "scissors". Not sure I have any better ideas though.
>> "estimator"? "statistics collection"? Or perhaps it should be singular,
>> "statistic". I note that you actually called the system table
>> "pg_mv_statistic", in singular.
>
> I think it's OK. The functional dependency is a single statistic, but
> MCV lists and histograms are multiple statistics (multiple facts about
> the data sampled), so in general when you create one of these new
> objects, you are creating multiple statistics about the data.

Ok. I don't really have any better ideas, was just hoping that someone 
else would.

> Also I find "CREATE STATISTIC" just sounds a bit clumsy compared to
> "CREATE STATISTICS".

Agreed.

> The convention for naming system catalogs seems to be to use the
> singular for tables and plural for views, so I guess we should stick
> with that.

However, for tables and views, each object you store in those views is a 
"table" or "view", but with this thing, the object you store is 
"statistics". Would you have a catalog table called "pg_scissor"?

We call the current system table "pg_statistic", though. I agree we 
should call it pg_mv_statistic, in singular, to follow the example of 
pg_statistic.

Of course, the user-friendly system view on top of that is called 
"pg_stats", just to confuse things more :-).

> It doesn't seem like the end of the world that it doesn't
> match the user-facing syntax. A bigger concern is the use of "mv" in
> the name, because as has already been pointed out, this table may also
> in the future be used to store univariate expression and partial
> statistics, so I think we should drop the "mv" and go with something
> like pg_statistic_ext, or some other more general name.

Also, "mv" makes me think of materialized views, which is completely 
unrelated to this.

- Heikki

Re: multivariate statistics (v19)

From

Gavin Flower

Date:

04 October 2016, 08:51:36

On 04/10/16 20:37, Dean Rasheed wrote:
> On 4 October 2016 at 04:25, Michael Paquier <michael.paquier@gmail.com> wrote:
>> OK. A second thing was related to the use of schemas in the new system
>> catalogs. As mentioned in [1], those could be removed.
>> [1]: https://www.postgresql.org/message-id/CAB7nPqTU40Q5_NSgHVoMJfbyH1HDtqMbFDJ+kwFJSpam35b3Qg@mail.gmail.com.
>>
> That doesn't work, because if the intention is to be able to one day
> support statistics across multiple tables, you can't assume that the
> statistics are in the same schema as the table.
>
> In fact, if multi-table statistics are to be allowed in the future, I
> think you want to move away from thinking of statistics as depending
> on and referring to a single table, and handle them more like views --
> i.e, store a pg_node_tree representing the from_clause and add
> multiple dependencies at statistics creation time. That was what I was
> getting at upthread when I suggested the alternate syntax, and also
> answers Tomas' question about how JOIN might one day be supported.
>
> Of course, if we don't think that we will ever support multi-table
> statistics, that all goes away, and you may as well make the
> statistics name local to the table, but I think that's a bit limiting.
> One way or the other, I think this is a question that needs to be
> answered now. My vote is to leave expansion room to support
> multi-table statistics in the future.
>
> Regards,
> Dean
>
>
I can see multi-table statistics being useful if one is trying to 
optimise indexes for multiple joins.

Am assuming that the statistics can be accessed by the user as well as 
the planner? (I've only lightly followed this thread, so I might have 
missed, significant relevant details!)


Cheers,
Gavin

Re: multivariate statistics (v19)

From

Dean Rasheed

Date:

04 October 2016, 09:21:18

On 4 October 2016 at 09:15, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> However, for tables and views, each object you store in those views is a
> "table" or "view", but with this thing, the object you store is
> "statistics". Would you have a catalog table called "pg_scissor"?
>

No, probably not (unless it was storing individual scissor blades).

However, in this case, we have related pre-existing catalog tables, so...

> We call the current system table "pg_statistic", though. I agree we should
> call it pg_mv_statistic, in singular, to follow the example of pg_statistic.
>
> Of course, the user-friendly system view on top of that is called
> "pg_stats", just to confuse things more :-).
>

I agree. Given where we are, with a pg_statistic table and a pg_stats
view, I think the least worst solution is to have a pg_statistic_ext
table, and then maybe a pg_stats_ext view.


>> It doesn't seem like the end of the world that it doesn't
>> match the user-facing syntax. A bigger concern is the use of "mv" in
>> the name, because as has already been pointed out, this table may also
>> in the future be used to store univariate expression and partial
>> statistics, so I think we should drop the "mv" and go with something
>> like pg_statistic_ext, or some other more general name.
>
>
> Also, "mv" makes me think of materialized views, which is completely
> unrelated to this.
>

Yeah, I hadn't thought of that.

Regards,
Dean

Re: multivariate statistics (v19)

From

Tomas Vondra

Date:

11 October 2016, 03:39:34

Hi everyone,

thanks for the reviews. Let me sum the feedback so far, and outline my
plans for the next patch version that I'd like to submit for CF 2016-11.

1) syntax changes

I agree with the changes proposed by Dean, although only a subset of the
syntax is going to be supported until we add support for either join or
partial statistics. So something like this:
CREATE STATISTICS name [ WITH (options) ] ON (column1, column2 [, ...]) FROM table

That should be a difficult change.

2) catalog names

I'm not sure what are the best names, so I'm fine with using whatever is
the consensus.

That being said, I'm not sure I like extending the catalog to also
support non-multivariate statistics (like for example statistics on
expressions). While that would be a clearly useful feature, it seems
like a slightly different use case and perhaps a separate catalog would
be better. So maybe pg_statistic_ext is not the best name.

3) special data type(s) to store statistics

I agree using an opaque bytea value is not very nice. I see Heikki
proposed using something like pg_node_tree, and maybe storing all the
statistics in a single value.

I assume the pg_node_tree was meant only as an inspiration how to build
pseudo-type on top of a varlena value. I agree that's a good idea, and I
plan to do something like that - say adding pg_mcv, pg_histogram,
pg_ndistinct and pg_dependencies data types.

Heikki also mentioned that maybe JSONB would be a good way to store the
statistics. I don't think so - firstly, it only supports a subset of
data types, so we'd be unable to store statistics for some data types
(or we'd have to store them as text, which sucks). Also, there's a fair
amount of smartness in how the statistics are stored (e.g. how the
histogram bucket boundaries are deduplicated, or how the estimation uses
the serialized representation directly). We'd lose all of that when
using JSONB.

Similarly for storing all the statistics in a single value - I see no
reason why keeping the statistics in separate columns would be a bad
idea (after all, that's kinda the point of relational databases). Also,
there are perfectly valid cases when the caller only needs a particular
type of statistic - e.g. when estimating GROUP BY we'll only need the
ndistinct coefficients. Why should we force the caller to fetch and
detoast everything, and throw away probably 99% of that?

So my plan here is to define pseudo types similar to how pg_node_tree is
defined. That does not seem like a tremendous amount of work.

4) functional dependencies

Several people mentioned they don't like how functional dependencies are
detected at ANALYZE time, particularly that there's a sudden jump
between 0 and 1. Instead, a continuous "dependency degree" between 0 and
1 was proposed.

I'm fine with that, although that makes "clause reduction" (deciding
that we don't need to estimate one of the clauses at all, as it's
implied by some other clause) impossible. But that's fine, the
functional dependencies will still be much less expensive than the other
statistics.

I'm wondering how will this interact with transitivity, though. IIRC the
current implementation is able to detect transitive dependencies and use
that to reduce storage space etc.

In any case, this significantly complicates the functional dependencies,
which were meant as a trivial type of statistics, mostly to establish
the shared infrastructure. Which brings me to ndistinct.

5) ndistinct

So far, the ndistinct coefficients were lumped at the very end of the
patch, and the statistic was only built but not used for any sort of
estimation. I agree with Dean that perhaps it'd be better to move this
to the very beginning, and use it as the simplest statistic to build the
infrastructure instead of functional dependencies (which only gets truer
due to the changes in functional dependencies, discussed in the
preceding section).

I think it's probably a good idea and I plan to do that, so the patch
series will probably look like this:
* 001 - CREATE STATISTICS infrastucture with ndistinct coefficients * 002 - use ndistinct coefficients to improve
GROUPBY estimates * 003 - use ndistinct coefficients in clausesel.c (not sure) * 004 - add functional dependencies
(build+ clausesel.c) * 005 - add multivariate MCV (build + clausesel.c) * 006 - add multivariate histograms (build
+clausesel.c)

I'm not sure about using the ndistinct coefficients in clausesel.c to
estimate regular conditions - it's the place for which ndistinct
coefficients were originally proposed by Kyotaro-san, but I seem to
remember it was non-trivial to choose the best statistics when there
were other types of stats available. But I'll look into that.

6) combining statistics

I've decided not to re-submit this part of the patch until the basic
functionality gets in. I do think it's a very useful feature (despite
having my doubts about the existing implementation), but it clearly
distracts people.

Instead, the patch will use some simple selection strategy (e.g. using a
single statistics covering most conditions) or perhaps something more
advanced (e.g. non-overlapping statistics). But nothing complicated.

7) enriching the query plan

Sadly, none of the reviews provides any sort of feedback on how to
enrich the query plan with information about statistics (instead of
doing that in clausesel.c in ad-hoc ephemeral manner).

So I'm still a bit stuck on this :-(

8) join statistics

Not directly related to the current patch, but I recommend reading this
paper quantifying impact of each part of query optimizer (estimates,
cost model, plan enumeration):
http://www.vldb.org/pvldb/vol9/p204-leis.pdf

The one conclusion that I take from it is we really need to think about
improving the join estimates, somehow. Because it's by far the most
significant source of issues (and the hardest one to fix).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics (v19)

From

Tomas Vondra

Date:

29 October 2016, 19:23:51

Hi,

Attached is v20 of the multivariate statistics patch series, doing
mostly the changes outlined in the preceding e-mail from October 11.

The patch series currently has these parts:

* 0001 : (FIX) teach pull_varno about RestrictInfo
* 0002 : (PATCH) shared infrastructure and ndistinct coefficients
* 0003 : (PATCH) functional dependencies (only the ANALYZE part)
* 0004 : (PATCH) selectivity estimation using functional dependencies
* 0005 : (PATCH) multivariate MCV lists
* 0006 : (PATCH) multivariate histograms
* 0007 : (WIP) selectivity estimation using ndistinct coefficients
* 0008 : (WIP) use multiple statistics for estimation
* 0009 : (WIP) psql tab completion basics

Let me elaborate about the main changes in this version:

1) rework CREATE STATISTICS to what Dean Rasheed proposed in [1]:
-----------------------------------------------------------------------

CREATE STATISTICS name WITH (options) ON (columns) FROM table

This allows adding support for statistics on joins, expressions
referencing multiple tables, and partial statistics (with WHERE
predicates, similar to indexes). Although those things are not
implemented (and I don't know if/when that happens), it's good the
syntax supports it.

I've been thinking about using "CREATE STATISTIC" instead, but I decided
to stick with "STATISTICS" for two reasons. Firstly it's possible to
create multiple statistics in a single command, for example by using
WITH (mcv,histogram). And secondly, we already hava "ALTER TABLE ... SET
STATISTICS n" (although that tweaks the statistics target for a column,
not the statistics on the column).

2) no changes to catalog names
-----------------------------------------------------------------------

Clearly, naming things is one of the hardest things in computer science.
I don't have a good idea what names would be better than the current
ones. In any case, this is fairly trivial to do.

3) special data types for statistics
-----------------------------------------------------------------------

Heikki proposed to invent a new data type, similar to pg_node_tree. I do
agree that storing the stats in plain bytea (i.e. catalog having bytea
columns) was not particularly convenient, but I'm not sure how much of
pg_node_tree Heikki wanted to copy.

In particular, I'm not sure whether Heikki's idea was store all the
statistics together in a single Datum, serialized into a text string
(similar to pg_node_tree).

I don't think that would be a good idea, as the statistics may be quite
large and complex, and deserializing them from text format would be
quite expensive. For pg_node_tree that's not a major issue because the
values are usually fairly small. Similarly, packing everything into a
single datum would force the planner to parse/unpack everything, even if
it needs just a small piece (e.g. the ndistinct coefficients, but not
histograms).

So I've decided to invent new data types, one for each statistic type:

* pg_ndistinct
* pg_dependencies
* pg_mcv_list
* pg_histogram

Similarly to pg_node_tree those data types only support output, i.e.
both 'recv' and 'in' functions do elog(ERROR). But while pg_node_tree is
stored as text, those new data types are still bytea.

I do believe this is a good solution, and it allows casting the data
types to text easily, as it simply calls the out function.

The statistics however do not store attnums in the bytea, just indexes
into pg_mv_statistic.stakeys. That means the out functions can't print
column names in the output, or values (because without the attnum we
don't know the type, and thus can't lookup the proper out function).

I don't think there's a good solution for that (I was thinking about
storing the attnums/typeoid in the statistics itself, but that seems
fairly ugly). And I'm quite happy with those new data types.

4) replace functional dependencies with ndistinct (in the first patch)
-----------------------------------------------------------------------

As the ndistinct coeffients are simpler than functional dependencies,
I've decided to use them in the fist patch in the series, which
implements the shared infrastructure. This does not mean throwing away
functional dependencies entirely, just moving them to a later patch.

5) rework of ndistinct coefficients
-----------------------------------------------------------------------

The ndistinct coefficients were also significantly reworked. Instead of
computing and storing the value for the exact combination of attributes,
the new version computes ndistinct for all combinations of attributes.

So for example with CREATE STATISTICS x ON (a,b,c) the old patch only
computed ndistinct on (a,b,c), while the new patch computes ndistinct on
{(a,b,c), (a,b), (a,c), (b,c)}. This makes it way more powerful.

The first patch (0002) only uses this in estimate_num_groups to improve
GROUP BY estimates. A later patch (0007) shows how it might be used for
selectivity estimation, but it's a very early WIP at this point.

Also, I'm not sure we should use ndistinct coefficients this way,
because of the "homogenity" assumption, similarly to functional
dependencies. Functional dependencies are used only for selectivity
estimation, so it's quite easy not to use them if they don't work for
that purpose. But ndistinct coefficients are also used for GROUP BY
estimation, where the homogenity assumption is not such a big deal. So I
expect people to add ndistinct, get better GROUP BY estimates but
sometimes worse selectivity estimates - not great, I guess.

But the selectivity estimation using ndistinct coefficients is very
simple right now - in particular it does not use the per-clause
selectivities at all, it simply assumes the whole selectivity is
1/ndistinct for the combination of columns.

Functional dependencies use this formula to combine the selectivities:

P(a,b) = P(a) * [f + (1-f)*P(b)]

so maybe there's something similar for ndistinct coefficients? I mean,
let's say we know ndistinc(a), ndistinct(b), ndistinct(a,b) and P(a)
and P(b). How do we compute P(a,b)?

5) rework functional dependencies
-----------------------------------------------------------------------

Based on Dean's feedback, I've reworked functional dependencies to use
continuous "degree" of validity (instead of true/false behavior,
resulting in sudden changes in behavior).

This significantly reduced the amount of code, because the old patch
tried to identify transitive dependencies (to minimize time and storage
requirements). Switching to continuous degree makes this impossible (or
at least far more complicated), so I've simply ripped all of this out.

This means the statistics will be larger and ANALYZE will take more
time, the differences are fairly small in practice, and the estimation
actually seems to work better.

6) MCV and histogram changes
-----------------------------------------------------------------------

Those statistics types are mostly unchanged, except for a few minor bug
fixes and removal of remove max_mcv_items and max_buckets options.

Those options were meant to allow users to limit the size of the
statistics, but the implementation was ignoring them so far. So I've
ripped them out, and if needed we may reintroduce them later.

7) no more (elaborate) combinations of statistics
-----------------------------------------------------------------------

I've ripped out the patch that combined multiple statistics in very
elaborate way - it was overly complex, possibly wrong, but most
importantly it distracted people from the preceding patches. So I've
ripped this out, and instead replaced that with a very simple approach
that allows using multiple statistics on different subsets if the clause
list. So for example

WHERE (a=1) AND (b=1) AND (c=1) AND (d=1)

may benefit from two statistics, one on (a,b) and second on (c,d). It's
very simple approach, but it does the trick for many cases and is better
than "single statistics" limitation.

The 0008 patch is actually very simple, essentially adding just a loop
into the code blocks, so I think it's quite likely this will get merged
into the preceding patches.

8) reduce table sizes used in regression tests
-----------------------------------------------------------------------

Some of the regression tests used quite large tables (with up to 1M
rows), which had two issues - long runtimes and unstability (because the
ANALYZE sample is only 30k rows, so there were sometimes small changes
due to picking a different sample). I've limited the table sizes to 30k
rows.

8) open / unsolved questions
-----------------------------------------------------------------------

The main open question is still whether clausesel.c is the best place to
do all the heavy lifting (particularly matching clauses and statistics,
and deciding which statistics to use). I suspect some of that should be
done elsewhere (earlier in the planning), enriching the query tree
somehow. Then clausesel.c would "only" compute the estimates, and it
would also allow showing the info in EXPLAIN.

I'm not particularly happy with the changes in claselist_selectivity
look right now - there are three almost identical blocks, so this would
deserve some refactoring. But I'd like to get some feedback first.

regards

[1]
https://www.postgresql.org/message-id/CAEZATCUtGR+U5+QTwjHhe9rLG2nguEysHQ5NaqcK=VbJ78VQFA@mail.gmail.com

[2]
https://www.postgresql.org/message-id/1c7e4e63-769b-f8ce-f245-85ef4f59fcba%40iki.fi

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

multivariate-stats-v20.tgz

Re: WIP: multivariate statistics / proof of concept

From

Robert Haas

Date:

21 November 2016, 22:11:03

[ reviving an old multivariate statistics thread ]

On Thu, Nov 13, 2014 at 6:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 12 October 2014 23:00, Tomas Vondra <tv@fuzzy.cz> wrote:
>
>> It however seems to be working sufficiently well at this point, enough
>> to get some useful feedback. So here we go.
>
> This looks interesting and useful.
>
> What I'd like to check before a detailed review is that this has
> sufficient applicability to be useful.
>
> My understanding is that Q9 and Q18 of TPC-H have poor plans as a
> result of multi-column stats errors.
>
> Could you look at those queries and confirm that this patch can
> produce better plans for them?

Tomas, did you ever do any testing in this area?  One of my
colleagues, Rafia Sabih, recently did some testing of TPC-H queries @
20 GB.  Q18 actually doesn't complete at all right now because of an
issue with the new simplehash implementation.  I reported it to Andres
and he tracked it down, but hasn't posted the patch yet - see
http://archives.postgresql.org/message-id/20161115192802.jfbec5s6ougxwicp@alap3.anarazel.de

Of the remaining queries, the slowest are Q9 and Q20, and both of them
have serious estimation errors.  On Q9, things go wrong here:
                                ->  Merge Join
(cost=5225092.04..6595105.57 rows=154 width=47) (actual
time=103592.821..149335.010 rows=6503988 loops=1)                                      Merge Cond:
(partsupp.ps_partkey = lineitem.l_partkey)                                      Join Filter:
(lineitem.l_suppkey = partsupp.ps_suppkey)                                      Rows Removed by Join Filter: 19511964
                                  ->  Index Scan using
 
idx_partsupp_partkey on partsupp  (cost=0.43..781956.32 rows=15999792
width=22) (actual time=0.044..11825.481 rows=15999881 loops=1)                                      ->  Sort
(cost=5224967.03..5245348.02 rows=8152396 width=45) (actual
time=103592.505..112205.444 rows=26015949 loops=1)                                            Sort Key: part.p_partkey
                                         Sort Method: quicksort
 
Memory: 704733kB                                            ->  Hash Join
(cost=127278.36..4289121.18 rows=8152396 width=45) (actual
time=1084.370..94732.951 rows=6503988 loops=1)                                                  Hash Cond:
(lineitem.l_partkey = part.p_partkey)                                                  ->  Seq Scan on
lineitem  (cost=0.00..3630339.08 rows=119994608 width=41) (actual
time=0.015..33355.637 rows=119994608 loops=1)                                                  ->  Hash
(cost=123743.07..123743.07 rows=282823 width=4) (actual
time=1083.686..1083.686 rows=216867 loops=1)                                                        Buckets:
524288  Batches: 1  Memory Usage: 11721kB                                                        ->  Gather
(cost=1000.00..123743.07 rows=282823 width=4) (actual
time=0.418..926.283 rows=216867 loops=1)                                                              Workers
Planned: 4                                                              Workers
Launched: 4                                                              ->
Parallel Seq Scan on part  (cost=0.00..94460.77 rows=70706 width=4)
(actual time=0.063..962.909 rows=43373 loops=5)

Filter: ((p_name)::text ~~ '%grey%'::text)

Rows Removed by Filter: 756627

The estimate for the index scan on partsupp is essentially perfect,
and the lineitem-part join is off by about 3x.  However, the merge
join is off by about 4000x, which is real bad.

On Q20, things go wrong here:
                    ->  Merge Join  (cost=5928271.92..6411281.44
rows=278 width=16) (actual time=77887.963..136614.284 rows=118124
loops=1)                          Merge Cond: ((lineitem.l_partkey =
partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey))                          Join Filter:
((partsupp.ps_availqty)::numeric > ((0.5 * sum(lineitem.l_quantity))))                          Rows Removed by Join
Filter:242                          ->  GroupAggregate
 
(cost=5363980.40..5691151.45 rows=9681876 width=48) (actual
time=76672.726..131482.677 rows=10890067 loops=1)                                Group Key: lineitem.l_partkey,
lineitem.l_suppkey                                ->  Sort
(cost=5363980.40..5409466.13 rows=18194291 width=21) (actual
time=76672.661..86405.882 rows=18194084 loops=1)                                      Sort Key: lineitem.l_partkey,
lineitem.l_suppkey                                      Sort Method: external merge
Disk: 551376kB                                      ->  Bitmap Heap Scan on
lineitem  (cost=466716.05..3170023.42 rows=18194291 width=21) (actual
time=13735.552..39289.995 rows=18195269 loops=1)                                            Recheck Cond:
((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01
00:00:00'::timestamp without time zone))                                            Heap Blocks: exact=2230011
                                 ->  Bitmap Index Scan on
 
idx_lineitem_shipdate  (cost=0.00..462167.48 rows=18194291 width=0)
(actual time=11771.173..11771.173 rows=18195269 loops=1)                                                  Index Cond:
((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01
00:00:00'::timestamp without time zone))                          ->  Sort  (cost=564291.52..567827.56
rows=1414417 width=24) (actual time=1214.812..1264.356 rows=173936
loops=1)                                Sort Key: partsupp.ps_partkey,
partsupp.ps_suppkey                                Sort Method: quicksort  Memory: 19733kB
 ->  Nested Loop
 
(cost=1000.43..419796.26 rows=1414417 width=24) (actual
time=0.447..985.562 rows=173936 loops=1)                                      ->  Gather
(cost=1000.00..99501.07 rows=40403 width=4) (actual time=0.390..34.476
rows=43484 loops=1)                                            Workers Planned: 4
    Workers Launched: 4                                            ->  Parallel Seq Scan on
 
part  (cost=0.00..94460.77 rows=10101 width=4) (actual
time=0.143..527.665 rows=8697 loops=5)                                                  Filter:
((p_name)::text ~~ 'beige%'::text)                                                  Rows Removed by
Filter: 791303                                      ->  Index Scan using
idx_partsupp_partkey on partsupp  (cost=0.43..7.58 rows=35 width=20)
(actual time=0.017..0.019 rows=4 loops=43484)                                            Index Cond: (ps_partkey =
part.p_partkey)

The estimate for the GroupAggregate feeding one side of the merge join
is quite accurate.  The estimate for the part-partsupp join on the
other side is off by 8x.  Then things get much worse: the estimate for
the merge join is off by 400x.

I'm not really sure whether the multivariate statistics stuff will fix
this kind of case or not, but if it did it would be awesome.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: multivariate statistics / proof of concept

From

Tomas Vondra

Date:

22 November 2016, 03:42:24

On 11/21/2016 11:10 PM, Robert Haas wrote:
> [ reviving an old multivariate statistics thread ]
>
> On Thu, Nov 13, 2014 at 6:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 12 October 2014 23:00, Tomas Vondra <tv@fuzzy.cz> wrote:
>>
>>> It however seems to be working sufficiently well at this point, enough
>>> to get some useful feedback. So here we go.
>>
>> This looks interesting and useful.
>>
>> What I'd like to check before a detailed review is that this has
>> sufficient applicability to be useful.
>>
>> My understanding is that Q9 and Q18 of TPC-H have poor plans as a
>> result of multi-column stats errors.
>>
>> Could you look at those queries and confirm that this patch can
>> produce better plans for them?
>
> Tomas, did you ever do any testing in this area?  One of my
> colleagues, Rafia Sabih, recently did some testing of TPC-H queries @
> 20 GB.  Q18 actually doesn't complete at all right now because of an
> issue with the new simplehash implementation.  I reported it to Andres
> and he tracked it down, but hasn't posted the patch yet - see
> http://archives.postgresql.org/message-id/20161115192802.jfbec5s6ougxwicp@alap3.anarazel.de
>
> Of the remaining queries, the slowest are Q9 and Q20, and both of them
> have serious estimation errors.  On Q9, things go wrong here:
>
>                                  ->  Merge Join
> (cost=5225092.04..6595105.57 rows=154 width=47) (actual
> time=103592.821..149335.010 rows=6503988 loops=1)
>                                        Merge Cond:
> (partsupp.ps_partkey = lineitem.l_partkey)
>                                        Join Filter:
> (lineitem.l_suppkey = partsupp.ps_suppkey)
>                                        Rows Removed by Join Filter: 19511964
>                                        ->  Index Scan using> [snip]
>
> Rows Removed by Filter: 756627
>
> The estimate for the index scan on partsupp is essentially perfect,
> and the lineitem-part join is off by about 3x.  However, the merge
> join is off by about 4000x, which is real bad.
>

The patch only deals with statistics on base relations, no joins, at 
this point. It's meant to be extended in that direction, so the syntax 
supports it, but at this point that's all. No joins.

That being said, this estimate should be improved in 9.6, when you 
create a foreign key between the tables. In fact, that patch was exactly 
about Q9.

This is how the join estimate looks on scale 1 without the FK between 
the two tables:
                          QUERY PLAN
----------------------------------------------------------------------- Merge Join  (cost=19.19..700980.12 rows=2404
width=261)  Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND                (lineitem.l_suppkey =
partsupp.ps_suppkey))  ->  Index Scan using idx_lineitem_part_supp on lineitem                (cost=0.43..605856.84
rows=6001117width=117)   ->  Index Scan using partsupp_pkey on partsupp                (cost=0.42..61141.76 rows=800000
width=144)
(4 rows)


and with the foreign key:
                             QUERY PLAN
----------------------------------------------------------------------- Merge Join  (cost=19.19..700980.12 rows=6001117
width=261)            (actual rows=6001215 loops=1)   Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND
        (lineitem.l_suppkey = partsupp.ps_suppkey))   ->  Index Scan using idx_lineitem_part_supp on lineitem
    (cost=0.43..605856.84 rows=6001117 width=117)                (actual rows=6001215 loops=1)   ->  Index Scan using
partsupp_pkeyon partsupp                (cost=0.42..61141.76 rows=800000 width=144)                (actual rows=6001672
loops=1)Planning time: 3.840 ms Execution time: 21987.913 ms
 
(6 rows)


> On Q20, things go wrong here:>
> [snip]
>
> The estimate for the GroupAggregate feeding one side of the merge join
> is quite accurate.  The estimate for the part-partsupp join on the
> other side is off by 8x.  Then things get much worse: the estimate for
> the merge join is off by 400x.
>

Well, most of the estimation error comes from the join, but sadly the 
aggregate makes using the foreign keys impossible - at least in the 
current version. I don't know if it can be improved, somehow.

> I'm not really sure whether the multivariate statistics stuff will fix
> this kind of case or not, but if it did it would be awesome.
>

Join statistics are something I'd like to add eventually, but I don't 
see how it could happen in the first version. Also, the patch received 
no reviews this CF, and making it even larger is unlikely to make it 
more attractive.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: multivariate statistics / proof of concept

From

Haribabu Kommi

Date:

02 December 2016, 12:03:17

On Tue, Nov 22, 2016 at 2:42 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On 11/21/2016 11:10 PM, Robert Haas wrote:
[ reviving an old multivariate statistics thread ]

On Thu, Nov 13, 2014 at 6:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 12 October 2014 23:00, Tomas Vondra <tv@fuzzy.cz> wrote:

It however seems to be working sufficiently well at this point, enough
to get some useful feedback. So here we go.

This looks interesting and useful.

What I'd like to check before a detailed review is that this has
sufficient applicability to be useful.

My understanding is that Q9 and Q18 of TPC-H have poor plans as a
result of multi-column stats errors.

Could you look at those queries and confirm that this patch can
produce better plans for them?

Tomas, did you ever do any testing in this area? One of my
colleagues, Rafia Sabih, recently did some testing of TPC-H queries @
20 GB. Q18 actually doesn't complete at all right now because of an
issue with the new simplehash implementation. I reported it to Andres
and he tracked it down, but hasn't posted the patch yet - see
http://archives.postgresql.org/message-id/20161115192802.jfbec5s6ougxwicp@alap3.anarazel.de

Of the remaining queries, the slowest are Q9 and Q20, and both of them
have serious estimation errors. On Q9, things go wrong here:

-> Merge Join
(cost=5225092.04..6595105.57 rows=154 width=47) (actual
time=103592.821..149335.010 rows=6503988 loops=1)
Merge Cond:
(partsupp.ps_partkey = lineitem.l_partkey)
Join Filter:
(lineitem.l_suppkey = partsupp.ps_suppkey)
Rows Removed by Join Filter: 19511964
-> Index Scan using
> [snip]

Rows Removed by Filter: 756627

The estimate for the index scan on partsupp is essentially perfect,
and the lineitem-part join is off by about 3x. However, the merge
join is off by about 4000x, which is real bad.

The patch only deals with statistics on base relations, no joins, at this point. It's meant to be extended in that direction, so the syntax supports it, but at this point that's all. No joins.

That being said, this estimate should be improved in 9.6, when you create a foreign key between the tables. In fact, that patch was exactly about Q9.

This is how the join estimate looks on scale 1 without the FK between the two tables:

QUERY PLAN
-----------------------------------------------------------------------
Merge Join (cost=19.19..700980.12 rows=2404 width=261)
Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND
(lineitem.l_suppkey = partsupp.ps_suppkey))
-> Index Scan using idx_lineitem_part_supp on lineitem
(cost=0.43..605856.84 rows=6001117 width=117)
-> Index Scan using partsupp_pkey on partsupp
(cost=0.42..61141.76 rows=800000 width=144)
(4 rows)

and with the foreign key:

QUERY PLAN
-----------------------------------------------------------------------
Merge Join (cost=19.19..700980.12 rows=6001117 width=261)
(actual rows=6001215 loops=1)
Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND
(lineitem.l_suppkey = partsupp.ps_suppkey))
-> Index Scan using idx_lineitem_part_supp on lineitem
(cost=0.43..605856.84 rows=6001117 width=117)
(actual rows=6001215 loops=1)
-> Index Scan using partsupp_pkey on partsupp
(cost=0.42..61141.76 rows=800000 width=144)
(actual rows=6001672 loops=1)
Planning time: 3.840 ms
Execution time: 21987.913 ms
(6 rows)

On Q20, things go wrong here:
>
[snip]

The estimate for the GroupAggregate feeding one side of the merge join
is quite accurate. The estimate for the part-partsupp join on the
other side is off by 8x. Then things get much worse: the estimate for
the merge join is off by 400x.

Well, most of the estimation error comes from the join, but sadly the aggregate makes using the foreign keys impossible - at least in the current version. I don't know if it can be improved, somehow.

I'm not really sure whether the multivariate statistics stuff will fix
this kind of case or not, but if it did it would be awesome.

Join statistics are something I'd like to add eventually, but I don't see how it could happen in the first version. Also, the patch received no reviews this CF, and making it even larger is unlikely to make it more attractive.

Moved to next CF with "needs review" status.

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] multivariate statistics (v19)

From

Amit Langote

Date:

12 December 2016, 14:26:33

Hi Tomas,

On 2016/10/30 4:23, Tomas Vondra wrote:
> Hi,
> 
> Attached is v20 of the multivariate statistics patch series, doing mostly
> the changes outlined in the preceding e-mail from October 11.
> 
> The patch series currently has these parts:
> 
> * 0001 : (FIX) teach pull_varno about RestrictInfo
> * 0002 : (PATCH) shared infrastructure and ndistinct coefficients
> * 0003 : (PATCH) functional dependencies (only the ANALYZE part)
> * 0004 : (PATCH) selectivity estimation using functional dependencies
> * 0005 : (PATCH) multivariate MCV lists
> * 0006 : (PATCH) multivariate histograms
> * 0007 : (WIP) selectivity estimation using ndistinct coefficients
> * 0008 : (WIP) use multiple statistics for estimation
> * 0009 : (WIP) psql tab completion basics

Unfortunately, this failed to compile because of the duplicate_oids error.
Partitioning patch consumed same OIDs as used in this patch.

I will try to read the patches in some more detail, but in the meantime,
here are some comments/nitpicks on the documentation:

No updates to doc/src/sgml/catalogs.sgml?

+  <para>
+   The examples presented in <xref linkend="row-estimation-examples"> used
+   statistics about individual columns to compute selectivity estimates.
+   When estimating conditions on multiple columns, the planner assumes
+   independence and multiplies the selectivities. When the columns are
+   correlated, the independence assumption is violated, and the estimates
+   may be seriously off, resulting in poor plan choices.
+  </para>

The term independence is used in isolation - independence of what?
Independence of the distributions of values in separate columns?  Also,
the phrase "seriously off" could perhaps be replaced by more rigorous
terminology; it might be unclear to some readers.  Perhaps: wildly
inaccurate, :)

+<programlisting>
+EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1;
+                                           QUERY PLAN
+-------------------------------------------------------------------------------------------------
+ Seq Scan on t  (cost=0.00..170.00 rows=100 width=8) (actual
time=0.031..2.870 rows=100 loops=1)
+   Filter: (a = 1)
+   Rows Removed by Filter: 9900
+ Planning time: 0.092 ms
+ Execution time: 3.103 ms

Is there a reason why examples in "67.2. Multivariate Statistics" (like
the one above) use EXPLAIN ANALYZE, whereas those in "67.1. Row Estimation
Examples" (also, other relevant chapters) uses just EXPLAIN.

+   the final 0.01% estimate. The plan however shows that this results in
+   a significant under-estimate, as the actual number of rows matching the

s/under-estimate/underestimate/g

+  <para>
+   For additional details about multivariate statistics, see
+   <filename>src/backend/utils/mvstats/README.statsc</>. There are additional
+   <literal>README</> for each type of statistics, mentioned in the following
+   sections.
+  </para>

Referring to source tree READMEs seems novel around this portion of the
documentation, but I think not too far away, there are some references.
This is under the VII. Internals chapter anyway, so that might be OK.

In any case, s/README.statsc/README.stats/g

Also, s/additional README/additional READMEs/g  (tags omitted for brevity)

+    used in definitions of database normal forms. When simplified, saying
that
+    <literal>b</> is functionally dependent on <literal>a</> means that

Maybe, s/When simplified/In simple terms/g

+    In normalized databases, only functional dependencies on primary keys
+    and super keys are allowed. In practice however many data sets are not
+    fully normalized, for example thanks to intentional denormalization for
+    performance reasons. The table <literal>t</> is an example of a data
+    with functional dependencies. As <literal>a=b</> for all rows in the
+    table, <literal>a</> is functionally dependent on <literal>b</> and
+    <literal>b</> is functionally dependent on <literal>a</literal>.

"super keys" sounds like a new term.

s/for example thanks to/for example, thanks to/g  (or due to instead of
thanks to)

How about: s/an example of a data with/an example of a schema with/g

Perhaps, s/a=b/a = b/g  (additional white space)

+    Similarly to per-column statistics, multivariate statistics are stored in

I notice that "similar to" is used more often than "similarly to".  But
that might be OK.

+     This shows that the statistics is defined on table <structname>t</>,

Perhaps: the statistics is -> the statistics are or the statistic is

+     lists <structfield>attnums</structfield> of the columns (references
+     <structname>pg_attribute</structname>).

While this text may be OK on the catalog description page, it might be
better to expand attnums here as "attribute numbers" dropping the
parenthesized phrase altogether.

+<programlisting>
+SELECT pg_mv_stats_dependencies_show(stadeps)
+  FROM pg_mv_statistic WHERE staname = 's1';
+
+ pg_mv_stats_dependencies_show
+-------------------------------
+ (1) => 2, (2) => 1
+(1 row)
+</programlisting>

Couldn't this somehow show actual column names, instead of attribute numbers?

Will read more later.

Thanks,
Amit

Re: [HACKERS] multivariate statistics (v19)

From

Tomas Vondra

Date:

13 December 2016, 00:50:05

Hi Amit,

attached is v21 of the patch series, rebased to current master 
(resolving the duplicate OID and a few trivial merge conflicts), and 
also fixing some of the issues you reported.

On 12/12/2016 12:26 PM, Amit Langote wrote:
>
> Hi Tomas,
>
> On 2016/10/30 4:23, Tomas Vondra wrote:
>> Hi,
>>
>> Attached is v20 of the multivariate statistics patch series, doing mostly
>> the changes outlined in the preceding e-mail from October 11.
>>
>> The patch series currently has these parts:
>>
>> * 0001 : (FIX) teach pull_varno about RestrictInfo
>> * 0002 : (PATCH) shared infrastructure and ndistinct coefficients
>> * 0003 : (PATCH) functional dependencies (only the ANALYZE part)
>> * 0004 : (PATCH) selectivity estimation using functional dependencies
>> * 0005 : (PATCH) multivariate MCV lists
>> * 0006 : (PATCH) multivariate histograms
>> * 0007 : (WIP) selectivity estimation using ndistinct coefficients
>> * 0008 : (WIP) use multiple statistics for estimation
>> * 0009 : (WIP) psql tab completion basics
>
> Unfortunately, this failed to compile because of the duplicate_oids error.
> Partitioning patch consumed same OIDs as used in this patch.
>

Fixed, should compile fine now (even each patch in the series).

> I will try to read the patches in some more detail, but in the meantime,
> here are some comments/nitpicks on the documentation:
>
> No updates to doc/src/sgml/catalogs.sgml?
>

Good point. I've added a section for the pg_mv_statistic catalog.

> +  <para>
> +   The examples presented in <xref linkend="row-estimation-examples"> used
> +   statistics about individual columns to compute selectivity estimates.
> +   When estimating conditions on multiple columns, the planner assumes
> +   independence and multiplies the selectivities. When the columns are
> +   correlated, the independence assumption is violated, and the estimates
> +   may be seriously off, resulting in poor plan choices.
> +  </para>
>
> The term independence is used in isolation - independence of what?
> Independence of the distributions of values in separate columns?  Also,
> the phrase "seriously off" could perhaps be replaced by more rigorous
> terminology; it might be unclear to some readers.  Perhaps: wildly
> inaccurate, :)
>

I've reworded this to "independence of the conditions" and "off by 
several orders of magnitude". Hope that's better.

> +<programlisting>
> +EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1;
> +                                           QUERY PLAN
> +-------------------------------------------------------------------------------------------------
> + Seq Scan on t  (cost=0.00..170.00 rows=100 width=8) (actual
> time=0.031..2.870 rows=100 loops=1)
> +   Filter: (a = 1)
> +   Rows Removed by Filter: 9900
> + Planning time: 0.092 ms
> + Execution time: 3.103 ms
>
> Is there a reason why examples in "67.2. Multivariate Statistics" (like
> the one above) use EXPLAIN ANALYZE, whereas those in "67.1. Row Estimation
> Examples" (also, other relevant chapters) uses just EXPLAIN.
>

Yes, the reason is that while 67.1 shows how the optimizer estimates row 
counts and constructs the plan (so EXPLAIN is sufficient), 67.2 
demonstrates how the estimates are inaccurate with respect to the actual 
row counts. Thus the EXPLAIN ANALYZE.

> +   the final 0.01% estimate. The plan however shows that this results in
> +   a significant under-estimate, as the actual number of rows matching the
>
> s/under-estimate/underestimate/g
>
> +  <para>
> +   For additional details about multivariate statistics, see
> +   <filename>src/backend/utils/mvstats/README.statsc</>. There are additional
> +   <literal>README</> for each type of statistics, mentioned in the following
> +   sections.
> +  </para>
>
> Referring to source tree READMEs seems novel around this portion of the
> documentation, but I think not too far away, there are some references.
> This is under the VII. Internals chapter anyway, so that might be OK.
>

I think the there's a threshold when the detail becomes too detailed for 
the sgml docs - say, when it discusses some implementation details, at 
which point a README is more appropriate. I don't know if I got it 
entirely right with the docs, though, so perhaps some bits may move in 
either direction.

> In any case, s/README.statsc/README.stats/g
>
> Also, s/additional README/additional READMEs/g  (tags omitted for brevity)
>
> +    used in definitions of database normal forms. When simplified, saying
> that
> +    <literal>b</> is functionally dependent on <literal>a</> means that
>

Fixed.

> Maybe, s/When simplified/In simple terms/g
>
> +    In normalized databases, only functional dependencies on primary keys
> +    and super keys are allowed. In practice however many data sets are not
> +    fully normalized, for example thanks to intentional denormalization for
> +    performance reasons. The table <literal>t</> is an example of a data
> +    with functional dependencies. As <literal>a=b</> for all rows in the
> +    table, <literal>a</> is functionally dependent on <literal>b</> and
> +    <literal>b</> is functionally dependent on <literal>a</literal>.
>
> "super keys" sounds like a new term.
>

Actually no, "super key" is a term defined in normal forms.

> s/for example thanks to/for example, thanks to/g  (or due to instead of
> thanks to)
>
> How about: s/an example of a data with/an example of a schema with/g
>

I think "example of data set" is better. Reworded.

> Perhaps, s/a=b/a = b/g  (additional white space)
>
> +    Similarly to per-column statistics, multivariate statistics are stored in
>
> I notice that "similar to" is used more often than "similarly to".  But
> that might be OK.
>

Not sure.

> +     This shows that the statistics is defined on table <structname>t</>,
>
> Perhaps: the statistics is -> the statistics are or the statistic is
>

As that paragraph is only about functional dependencies, I think 
'statistic is' is more appropriate.

> +     lists <structfield>attnums</structfield> of the columns (references
> +     <structname>pg_attribute</structname>).
>
> While this text may be OK on the catalog description page, it might be
> better to expand attnums here as "attribute numbers" dropping the
> parenthesized phrase altogether.
>

Not sure. I've reworded it like this:

    This shows that the statistic is defined on table <structname>t</>,
    <structfield>attnums</structfield> lists attribute numbers of columns
    (references <structname>pg_attribute</structname>). It also shows

Does that sound better?

> +<programlisting>
> +SELECT pg_mv_stats_dependencies_show(stadeps)
> +  FROM pg_mv_statistic WHERE staname = 's1';
> +
> + pg_mv_stats_dependencies_show
> +-------------------------------
> + (1) => 2, (2) => 1
> +(1 row)
> +</programlisting>
>
> Couldn't this somehow show actual column names, instead of attribute numbers?
>

Yeah, I was thinking about that too. The trouble is that's table-level 
metadata, so we don't have that kind of info serialized within the data 
type (e.g. because it would not handle column renames etc.).

It might be possible to explicitly pass the table OID as a parameter of 
the function, but it seemed a bit ugly to me.


FWIW, as I wrote in this thread, the place where this patch series needs 
feedback most desperately is integration into the optimizer. Currently 
all the magic happens in clausesel.c and does not leave it.I think it 
would be good to move some of that (particularly the choice of 
statistics to apply) to an earlier stage, and store the information 
within the plan tree itself, so that it's available outside clausesel.c 
(e.g. for EXPLAIN - showing which stats were picked seems useful).

I was thinking it might work similarly to the foreign key estimation 
patch (100340e2). It might even be more efficient, as the current code 
may end repeating the selection of statistics multiple times. But 
enriching the plan tree turned out to be way more invasive than I'm 
comfortable with (but maybe that'd be OK).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

multivariate-stats-v21.tgz

Re: [HACKERS] multivariate statistics (v19)

From

Petr Jelinek

Date:

30 December 2016, 16:05:18

On 12/12/16 22:50, Tomas Vondra wrote:
>> +<programlisting>
>> +SELECT pg_mv_stats_dependencies_show(stadeps)
>> +  FROM pg_mv_statistic WHERE staname = 's1';
>> +
>> + pg_mv_stats_dependencies_show
>> +-------------------------------
>> + (1) => 2, (2) => 1
>> +(1 row)
>> +</programlisting>
>>
>> Couldn't this somehow show actual column names, instead of attribute
>> numbers?
>>
> 
> Yeah, I was thinking about that too. The trouble is that's table-level
> metadata, so we don't have that kind of info serialized within the data
> type (e.g. because it would not handle column renames etc.).
> 
> It might be possible to explicitly pass the table OID as a parameter of
> the function, but it seemed a bit ugly to me.

I think it makes sense to have such function, this is not out function
so I think it's ok for it to have the oid as input, especially since in
the use-case shown above you can use starelid easily.

> 
> FWIW, as I wrote in this thread, the place where this patch series needs
> feedback most desperately is integration into the optimizer. Currently
> all the magic happens in clausesel.c and does not leave it.I think it
> would be good to move some of that (particularly the choice of
> statistics to apply) to an earlier stage, and store the information
> within the plan tree itself, so that it's available outside clausesel.c
> (e.g. for EXPLAIN - showing which stats were picked seems useful).
> 
> I was thinking it might work similarly to the foreign key estimation
> patch (100340e2). It might even be more efficient, as the current code
> may end repeating the selection of statistics multiple times. But
> enriching the plan tree turned out to be way more invasive than I'm
> comfortable with (but maybe that'd be OK).
>

In theory it seems like possibly reasonable approach to me, mainly
because mv statistics are user defined objects. I guess we'd have to see
at least some PoC to see how invasive it is. But I ultimately think that
feedback from a committer who is more familiar with planner is needed here.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services

Re: [HACKERS] multivariate statistics (v19)

From

Petr Jelinek

Date:

30 December 2016, 16:12:25

On 12/12/16 22:50, Tomas Vondra wrote:
> On 12/12/2016 12:26 PM, Amit Langote wrote:
>>
>> Hi Tomas,
>>
>> On 2016/10/30 4:23, Tomas Vondra wrote:
>>> Hi,
>>>
>>> Attached is v20 of the multivariate statistics patch series, doing
>>> mostly
>>> the changes outlined in the preceding e-mail from October 11.
>>>
>>> The patch series currently has these parts:
>>>
>>> * 0001 : (FIX) teach pull_varno about RestrictInfo
>>> * 0002 : (PATCH) shared infrastructure and ndistinct coefficients

Hi,

I went over these two (IMHO those could easily be considered as minimal
committable set even if the user visible functionality they provide is
rather limited).

> dropping statistics
> -------------------
> 
> The statistics may be dropped automatically using DROP STATISTICS.
> 
> After ALTER TABLE ... DROP COLUMN, statistics referencing are:
> 
>   (a) dropped, if the statistics would reference only one column
> 
>   (b) retained, but modified on the next ANALYZE

This should be documented in user visible form if you plan to keep it
(it does make sense to me).

> +   therefore perfectly correlated. Providing additional information about
> +   correlation between columns is the purpose of multivariate statistics,
> +   and the rest of this section thoroughly explains how the planner
> +   leverages them to improve estimates.
> +  </para>
> +
> +  <para>
> +   For additional details about multivariate statistics, see
> +   <filename>src/backend/utils/mvstats/README.stats</>. There are additional
> +   <literal>READMEs</> for each type of statistics, mentioned in the following
> +   sections.
> +  </para>
> +
> + </sect1>

I don't think this qualifies as "thoroughly explains" ;)

> +
> +Oid
> +get_statistics_oid(List *names, bool missing_ok)

No comment?

> +        case OBJECT_STATISTICS:
> +            msg = gettext_noop("statistics \"%s\" does not exist, skipping");
> +            name = NameListToString(objname);
> +            break;

This sounds somewhat weird (plural vs singular).

> + * XXX Maybe this should check for duplicate stats. Although it's not clear
> + * what "duplicate" would mean here (wheter to compare only keys or also
> + * options). Moreover, we don't do such checks for indexes, although those
> + * store tuples and recreating a new index may be a way to fix bloat (which
> + * is a problem statistics don't have).
> + */
> +ObjectAddress
> +CreateStatistics(CreateStatsStmt *stmt)

I don't think we should check duplicates TBH so I would remove the XXX
(also "wheter" is typo but if you remove that paragraph it does not matter).

> +    if (true)
> +    {

Huh?

> +
> +List *
> +RelationGetMVStatList(Relation relation)
> +{
...
> +
> +void
> +update_mv_stats(Oid mvoid, MVNDistinct ndistinct,
> +                int2vector *attrs, VacAttrStats **stats)
...
> +static double
> +ndistinct_for_combination(double totalrows, int numrows, HeapTuple *rows,
> +                   int2vector *attrs, VacAttrStats **stats,
> +                   int k, int *combination)
> +{


Again, these deserve comment.

I'll try to look at other patches in the series as time permits.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services

Re: [HACKERS] multivariate statistics (v19)

From

Dilip Kumar

Date:

03 January 2017, 16:42:04

On Tue, Dec 13, 2016 at 3:20 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> attached is v21 of the patch series, rebased to current master (resolving
> the duplicate OID and a few trivial merge conflicts), and also fixing some
> of the issues you reported.

I wanted to test the grouping estimation behaviour with TPCH, While
testing I found some crash so I thought of reporting it.

My setup detail:
TPCH scale factor : 5
Applied all the patch for 21 series, and ran below queries.

postgres=# analyze part;
ANALYZE
postgres=# CREATE STATISTICS s2  WITH (ndistinct) on (p_brand, p_type,
p_size) from part;
CREATE STATISTICS
postgres=# analyze part;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

I think it should be easily reproducible, in case it's not I can send
call stack or core dump.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] multivariate statistics (v19)

From

Tomas Vondra

Date:

03 January 2017, 19:22:48

On 01/03/2017 02:42 PM, Dilip Kumar wrote:
> On Tue, Dec 13, 2016 at 3:20 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> attached is v21 of the patch series, rebased to current master (resolving
>> the duplicate OID and a few trivial merge conflicts), and also fixing some
>> of the issues you reported.
>
> I wanted to test the grouping estimation behaviour with TPCH, While
> testing I found some crash so I thought of reporting it.
>
> My setup detail:
> TPCH scale factor : 5
> Applied all the patch for 21 series, and ran below queries.
>
> postgres=# analyze part;
> ANALYZE
> postgres=# CREATE STATISTICS s2  WITH (ndistinct) on (p_brand, p_type,
> p_size) from part;
> CREATE STATISTICS
> postgres=# analyze part;
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
>
> I think it should be easily reproducible, in case it's not I can send
> call stack or core dump.
>

Thanks for the report. It was trivial to reproduce and it turned out to 
be a fairly simple bug. Will send a new version of the patch soon.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] multivariate statistics (v19)

From

Tomas Vondra

Date:

04 January 2017, 00:55:01

On 12/30/2016 02:12 PM, Petr Jelinek wrote:
> On 12/12/16 22:50, Tomas Vondra wrote:
>> On 12/12/2016 12:26 PM, Amit Langote wrote:
>>>
>>> Hi Tomas,
>>>
>>> On 2016/10/30 4:23, Tomas Vondra wrote:
>>>> Hi,
>>>>
>>>> Attached is v20 of the multivariate statistics patch series, doing
>>>> mostly
>>>> the changes outlined in the preceding e-mail from October 11.
>>>>
>>>> The patch series currently has these parts:
>>>>
>>>> * 0001 : (FIX) teach pull_varno about RestrictInfo
>>>> * 0002 : (PATCH) shared infrastructure and ndistinct coefficients
>
> Hi,
>
> I went over these two (IMHO those could easily be considered as minimal
> committable set even if the user visible functionality they provide is
> rather limited).
>

Yes, although I still have my doubts 0001 is the right way to make 
pull_varnos work. It's probably related to the bigger design question, 
because moving the statistics selection to an earlier phase could make 
it unnecessary I guess.

>> dropping statistics
>> -------------------
>>
>> The statistics may be dropped automatically using DROP STATISTICS.
>>
>> After ALTER TABLE ... DROP COLUMN, statistics referencing are:
>>
>>   (a) dropped, if the statistics would reference only one column
>>
>>   (b) retained, but modified on the next ANALYZE
>
> This should be documented in user visible form if you plan to keep it
> (it does make sense to me).
>

Yes, I plan to keep it. I agree it should be documented, probably on the 
ALTER TABLE page (and linked from CREATE/DROP statistics pages).

>> +   therefore perfectly correlated. Providing additional information about
>> +   correlation between columns is the purpose of multivariate statistics,
>> +   and the rest of this section thoroughly explains how the planner
>> +   leverages them to improve estimates.
>> +  </para>
>> +
>> +  <para>
>> +   For additional details about multivariate statistics, see
>> +   <filename>src/backend/utils/mvstats/README.stats</>. There are additional
>> +   <literal>READMEs</> for each type of statistics, mentioned in the following
>> +   sections.
>> +  </para>
>> +
>> + </sect1>
>
> I don't think this qualifies as "thoroughly explains" ;)
>

OK, I'll drop the "thoroughly" ;-)

>> +
>> +Oid
>> +get_statistics_oid(List *names, bool missing_ok)
>
> No comment?
>
>> +        case OBJECT_STATISTICS:
>> +            msg = gettext_noop("statistics \"%s\" does not exist, skipping");
>> +            name = NameListToString(objname);
>> +            break;
>
> This sounds somewhat weird (plural vs singular).
>

Ah, right - it should be either "statistic ... does not" or "statistics 
... do not". I think "statistics" is the right choice here, because (a) 
we have CREATE STATISTICS and (b) it may be a combination of statistics, 
e.g. histogram + MCV.

>> + * XXX Maybe this should check for duplicate stats. Although it's not clear
>> + * what "duplicate" would mean here (wheter to compare only keys or also
>> + * options). Moreover, we don't do such checks for indexes, although those
>> + * store tuples and recreating a new index may be a way to fix bloat (which
>> + * is a problem statistics don't have).
>> + */
>> +ObjectAddress
>> +CreateStatistics(CreateStatsStmt *stmt)
>
> I don't think we should check duplicates TBH so I would remove the XXX
> (also "wheter" is typo but if you remove that paragraph it does not matter).
>

Yes, I came to the same conclusion - we can only really check for exact 
matches (same set of columns, same choice of statistic types), but 
that's fairly useless. I'll remove the XXX.

>> +    if (true)
>> +    {
>
> Huh?
>

Yeah, that's a bit weird pattern. It's a remainder of copy-pasting the 
preceding block, which looks like this
    if (hasindex)    {        ...    }

But we've decided to not add similar flag for the statistics. I'll move 
the block to a separate function (instead of merging it directly into 
the function, which is already a bit largeish).

>> +
>> +List *
>> +RelationGetMVStatList(Relation relation)
>> +{
> ...
>> +
>> +void
>> +update_mv_stats(Oid mvoid, MVNDistinct ndistinct,
>> +                int2vector *attrs, VacAttrStats **stats)
> ...
>> +static double
>> +ndistinct_for_combination(double totalrows, int numrows, HeapTuple *rows,
>> +                   int2vector *attrs, VacAttrStats **stats,
>> +                   int k, int *combination)
>> +{
>
>
> Again, these deserve comment.
>

OK, will add.

> I'll try to look at other patches in the series as time permits.

thanks

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] multivariate statistics (v19)

From

Tomas Vondra

Date:

04 January 2017, 05:35:15

On 01/03/2017 05:22 PM, Tomas Vondra wrote:
> On 01/03/2017 02:42 PM, Dilip Kumar wrote:
...
>> I think it should be easily reproducible, in case it's not I can send
>> call stack or core dump.
>>
>
> Thanks for the report. It was trivial to reproduce and it turned out to
> be a fairly simple bug. Will send a new version of the patch soon.
>

Attached is v22 of the patch series, rebased to current master and 
fixing the reported bug. I haven't made any other changes - the issues 
reported by Petr are mostly minor, so I've decided to wait a bit more 
for (hopefully) other reviews.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Hi everyone,

thanks for the reviews! Attached is v23 of the patch series, addressing 
most of the points raised in the reviews.

A quick summary of the changes (I'll respond to the other threads for 
points that deserve a bit more detailed discussion):

0) Rebase to current master. The main culprit was the pesky logical 
replication patch committed a week ago, because SUBSCRIPTION and 
STATISTICS are right next to each other in gram.y, various switches etc.

1) Many typos, mentioned by all the reviewers.

2) I've added a short explanation (in alter_table.sgml) of how ALTER 
TABLE ... DROP COLUMN handles multivariate statistics, i.e. that those 
are only dropped if there would be a single remaining column.

3) I've reworded 'thoroughly' to 'in more detail' in planstats.sgml, to 
make Petr happy ;-)

4) Added missing comments to get_statistics_oid, RelationGetMVStatList, 
update_mv_stats, ndistinct_for_combination. Also update_mv_stats() was 
not used outside common.c, so I've made it static and removed the 
prototype from mvstats.h.

5) I've changed 'statistics does not exist' to 'statistics do not exist' 
on a number of places.

6) Removed XXX about checking for duplicates in CreateStatistics. I 
agree with Petr that we shouldn't do such checks, as we're not doing 
that for other objects (e.g. indexes).

7) I've moved moved the code loading statistics from get_relation_info 
into a new function get_relation_statistics, to get rid of the

   if (true)
   {
    ...
   }

block, which was there due to mimicking how index details are loaded 
without having hasindex-like flag. I like this better than merging the 
block into get_relation_info directly.

8) I've changed 'a statistics' to 'multivariate statistics' on a few 
places in sgml docs, to make it clear it's not referring to the 
'regular' statistics (e.g. at CREATE/DROP STATISTICS, mentioned by 
Ideriha Takeshi).

9) I've changed the link in README.dependencies to 
https://en.wikipedia.org/wiki/Functional_dependency as proposed by 
Ideriha Takeshi. I'm pretty sure the wiki page about database 
normalization, referenced by the original link, included a nice 
functional dependency example some time ago, but it seems to have 
changed and the new link is better.

But perhaps it's not a good idea to link to wikipedia, as the pages 
clearly change quite significantly?

10) The CREATE STATISTICS now reports a nice 'already exists' message, 
instead of the 'duplicate key', pointed out by Dilip.

11) MVNDistinctItem/MVNDistinctData now use FLEXIBLE_ARRAY_MEMBER for 
the array, just like the other structs.

On 01/26/2017 12:01 PM, Kyotaro HORIGUCHI wrote:
> dependencies.c:
>
>  dependency_dgree():
>
>   - The k is assumed larger than 1. I think assertion is required.
>
>   - "/* end of the preceding group */" seems to be better if it
>     is just after the "if (multi_sort.." currently just after it.
>
>   - The following comment seems mis-edited.
>     > * If there is a single are no contradicting rows, count the group
>     > * as supporting, otherwise contradicting.
>
>     maybe this would be like the following? The varialbe counting
>     the first "contradiction" is named "n_violations". This seems
>     somewhat confusing.
>
>     > * If there are no violating rows up to here, count the group
>     > * as supporting, otherwise contradicting.
>
>    - "/* first columns match, but the last one does not"
>      else if (multi_sort_compare_dims((k - 1), (k - 1), ...
>
>      The above comparison should use multi_sort_compare_dim, not
>      dims
>
>    - This function counts "n_contradicting_rows" but it is not
>      referenced. Anyway n_contradicting_rows = numrows -
>      n_supporing_rows so it and n_contradicting seem
>      unncecessary.
>

Yes, absolutely. This was clearly unnecessary remainder of the original 
implementation, and I failed to clean it up after adopting Dean's idea 
of continuous dependency degree.

I've also reworked the method a bit, moving handling of the last group 
into the main loop (instead of doing that separately right after the 
loop, which I think was a bit ugly anyway). Can you check if you're 
happy with the code & comments now?

>
>  mvstats.h:
>
>    - struct MVDependencyData/ MVDependenciesData
>
>      The varialbe length member at the last of the structs should
>      be defined using FLEXIBLE_ARRAY_MEMBER, from the convention.
>

Yes, fixed. The other structures already used that macro, but I failed 
to notice MVDependencyData/ MVDependenciesData need that fix too.

 >
>    - I'm not sure how much it impacts performance, but some
>      struct members seems to have a bit too wide types. For
>      example, MVDepedenciesData.type is of int32 but it can have
>      only '1' for now and it won't be two-digits. Also ndeps
>      cannot be so large.
>

I doubt the impact on performance is measurable, particularly for the 
global fields (e.g. nbuckets is tiny compared to the space needed for 
the buckets themselves).

But I think you're right we shouldn't use fields wider than actually 
needed (e.g. using uint32 for nbuckets is a bit insane, and uint16 would 
be just fine). It's not just a matter of performance, but also a way to 
document expected values etc.

I'll go through the fields and use smaller data types where appropriate.

>
> general:
>   This patch uses int16 as the type of attrubute number but it
>   might be better to use AttrNumber for the purpose.
>   (Specifically it seems defined as the type for an attribute
>    index but also used as the varialbe for number of attributes)
>

Agreed. Will check with the struct members.

>
> Sorry for the random comment in advance. I'll learn this further.
>

Thanks for the review!

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

OK,

attached is v24 of the patch series, addressing most of the reported 
issues and comments (at least I believe so). The main changes are:

1) I've mostly abandoned the "multivariate" name in favor of "extended", 
particularly in places referring to stats stored in the pg_statistic_ext 
in general. "Multivariate" is now used only in places talking about 
particular types (e.g. multivariate histograms).

The "extended" name is more widely used for this type of statistics, and 
the assumption is that we'll also add other (non-multivariate) types of 
statistics - e.g. statistics on custom expressions, or some for of join 
statistics.

2) Catalog pg_mv_statistic was renamed to pg_statistic_ext (and 
pg_mv_stats view renamed to pg_stats_ext).

3) The structure of pg_statistic_ext was changed as proposed by Alvaro, 
i.e. the boolean flags were removed and instead we have just a single 
"char[]" column with list of enabled statistics.

4) I also got rid of the "mv" part in most variable/function/constant 
names, replacing it by "ext" or something similar. Also mvstats.h got 
renamed to stats.h.

5) Moved the files from src/backend/utils/mvstats to backend/statistics.

6) Fixed the n_choose_k() overflow issues by using the algorithm 
proposed by Dean. Also, use the simple formula for num_combinations().

7) I've tweaked data types for a few struct members (in stats.h). I've 
kept most of the uint32 fields at the top level though, because int16 
might not be large enough for large statistics and the overhead is 
minimal (compared to the space needed e.g. for histogram buckets).


The renames/changes were quite widespread, but I've done my best to fix 
all the comments and various other places.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On 03/02/2017 03:52 PM, Tomas Vondra wrote:
> On 03/02/2017 07:42 AM, Kyotaro HORIGUCHI wrote:
>> Hello,
>>
>> At Thu, 2 Mar 2017 04:05:34 +0100, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote in
>> <a78ffb17-70e8-a55a-c10c-66ab575e88ed@2ndquadrant.com>
>>> OK,
>>>
>>> attached is v24 of the patch series, addressing most of the reported
>>> issues and comments (at least I believe so). The main changes are:
>>
>> Unfortunately, 0002 conflicts with the current master
>> (4461a9b). Could you rebase them or tell us the commit where this
>> patches stand on?
>>
>
> Attached is a rebased patch series, otherwise it's the same as v24.
>

This time with the attachments ....

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] multivariate statistics (v24)

From

Robert Haas

Date:

04 March 2017, 10:03:32

On Thu, Mar 2, 2017 at 8:35 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> attached is v24 of the patch series, addressing most of the reported issues
> and comments (at least I believe so). The main changes are:
>
> 1) I've mostly abandoned the "multivariate" name in favor of "extended",
> particularly in places referring to stats stored in the pg_statistic_ext in
> general. "Multivariate" is now used only in places talking about particular
> types (e.g. multivariate histograms).
>
> The "extended" name is more widely used for this type of statistics, and the
> assumption is that we'll also add other (non-multivariate) types of
> statistics - e.g. statistics on custom expressions, or some for of join
> statistics.

Oh, I like that.  I found it hard to wrap my head around what
"multivariate" was supposed to mean, exactly.  I think "extended" will
be clearer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] multivariate statistics (v25)

From

David Rowley

Date:

13 March 2017, 13:00:57

On 3 March 2017 at 03:53, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

This time with the attachments ....

It's been a long while since I looked at this patch, but I'm now taking another look.

I've made a list of stuff I've found from making my first pass on 0001 and 0002. Some of the stuff may seem a little pedantic, so apologies about those ones. I merely SET nit_picking_threshold TO 0; and reviewed.

Here goes:

0001:

+ RestrictInfo *rinfo = (RestrictInfo*)node;

and

+ RestrictInfo *rinfo = (RestrictInfo *)node;

+ return expression_tree_walker((Node*)rinfo->clause,

+ pull_varattnos_walker,

+ (void*) context);

spacing incorrect. Please space after type name in casts and after the closing parenthesis.

0002:

+ dropped as well. Multivariate statistics referencing the column will

+ be dropped only if there would remain a single non-dropped column.

I was initially confused by this. I think it should worded as:

"Multivariate statistics referencing the dropped column will also be removed if the removal of the column would cause the statistics to contain data for only a single column"

I had been confused as I'd been thinking of dropping multiple columns at once with the same command, and only 1 column remained in the table. So I think it's best to clarify you mean the statistic here.

+ OCLASS_STATISTICS /* pg_statistics_ext */

I wonder if this should be named: OCLASS_STATISTICEXT. The comment is also incorrect and should read "pg_statistic_ext" (without 's')

I tried to perform a test in this area and received an error:

postgres=# create table ab1 (a int, b int);

CREATE TABLE

postgres=# create statistics ab1_a_b_stats on (a,b) from ab1;

CREATE STATISTICS

postgres=# alter table ab1 drop column a;

ALTER TABLE

postgres=# drop table ab1;

ERROR: cache lookup failed for statistics 16399

+ When estimating conditions on multiple columns, the planner assumes

+ independence of the conditions and multiplies the selectivities. When the

+ columns are correlated, the independence assumption is violated, and the

+ estimates may be off by several orders of magnitude, resulting in poor

+ plan choices.

I don't think the assumption is violated. We still assume that they're independent, which is incorrect. Nothing gets violated.

Perhaps it would be more accurate to write:

"When estimating the selectivity of conditions over multiple columns, the planner normally assumes each condition is independent of other conditions, and simply multiplies the selectivity estimates of each condition together to produce a final selectivity estimation for all conditions. This method can often lead to inaccurate row estimations when the conditions have dependencies on one another. Such misestimations can result poor plan choices being made."

+ using <command>CREATE STATISTICS</> command.

using the ...

+ As explained in <xref linkend="planner-stats">, the planner can determine

+ cardinality of <structname>t</structname> using the number of pages and

+ rows is looked up in <structname>pg_class</structname>:

perhaps "rows is" should become "rows as" or "rows which are".

+ * delete multi-variate statistics

+ */

+ RemoveStatisticsExt(relid, 0);

I think it should be "delete extended statistics"

Should this not be rejected?

postgres=# create view v1 as select 1 a, 2 b;

CREATE VIEW

postgres=# create statistics v1_a_stats on (a,b) from v1;

CREATE STATISTICS

and this?

postgres=# create sequence test_seq;

CREATE SEQUENCE

postgres=# select * from test_seq;

last_value | log_cnt | is_called

------------+---------+-----------

1 | 0 | f

(1 row)

postgres=# create statistics test_seq_stats on (last_value,log_cnt) from test_seq;

CREATE STATISTICS

The patch does claim:

+ /* extended stats are supported on tables and matviews */

So I guess it should be disallowed.

+ /* OBJECT_STATISTICS */

+ {

+ "statistics", OBJECT_STATISTICS

Maybe this should be changed to be OBJECT_STATISTICEXT */. Doing it this way would close the door a bit on pg_depends records existing for pg_statistic.

A quick test shows a problem here:

postgres=# create table ab (a int, b int);

CREATE TABLE

postgres=# create statistics ab_a_b_stats on (a,b) from ab;

CREATE STATISTICS

postgres=# create statistics ab_a_b_stats1 on (a,b) from ab;

CREATE STATISTICS

postgres=# alter statistics ab_a_b_stats1 rename to ab_a_b_stats;

ERROR: unsupported object class 3381

+/*****************************************************************************

+ *

+ * QUERY :

+ * CREATE STATISTICS stats_name ON relname (columns) WITH (options)

+ *

+ *****************************************************************************/

Old Syntax?

+ $$ = (Node *)n;

Incorrect spacing.

+ * The returned list is guaranteed to be sorted in order by OID, although

+ * this is not currently needed.

hmm, whats the tie-breaker going to be for:

CREATE TABLE abc (a int, b int, c int);

create statistics abc_ab_stats (a,b) from abc;

create statistics abc_bc_stats (b,c) from abc;

select * from abc where a=1 and b=1 and c=1;

I've not gotten to that part of the code yet, but reading the comment made me wonder how you're handling this. I think predictable is a good way, so that would require some ordering on this list... I presume.

+ * happen if the statistics has fewer attributes than we have Vars.

"statistics" is plural, so "has" should be "have"

although I see you mix the plurals up a few lines later and write in singular form.

+ /* check that all Vars are covered by the statistic */

This one is more of a question:

+ bool found;

+ double ndist = find_ndistinct(root, rel, varinfos, &found);

would it be better to return the bool and pass the &ndist here? That way you could simply write:

if (!find_ndistinct(root, rel, varinfos, &reldistinct))

clamp *= 0.1;

@@ -3450,6 +3467,7 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,

clamp = rel->tuples;

}

Adds a new line by mistake.

+ /*

+ * Only ndistinct stats covering all Vars are acceptable, which can't

+ * happen if the statistics has fewer attributes than we have Vars.

+ */

+ if (bms_num_members(attnums) > info->stakeys->dim1)

+ continue;

bms_num_members() done inside loop. Would you say it's OK to assume the compiler will do that before the loop?, or do you think it's best to set it before looping? We already know we're going to loop at least once, since we'd have short circuited at the start of the function otherwise.

+ k = -1;

+ while ((k = bms_next_member(attnums, k)) >= 0)

+ {

+ bool attr_found = false;

+ for (i = 0; i < info->stakeys->dim1; i++)

+ {

+ if (info->stakeys->values[i] == k)

+ {

+ attr_found = true;

+ break;

+ }

+ /* found attribute not covered by this ndistinct stats, skip */

+ if (!attr_found)

+ {

+ matches = false;

+ break;

+ }

Would it be better just to stuff info->stakeys->values into a bitmapset and check its a subset of attnums? It would mean allocating memory in the loop, so maybe you think otherwise, but in that case maybe StatisticExtInfo should store the bitmapset?

+ if (! matches)

+ continue;

extra whitespace after !

+ /* not the right item (different number of attributes) */

+ if (item->nattrs != bms_num_members(attnums))

+ continue;

again using bms_num_members() inside a loop when its known before the loop.

+ Assert(!(*found));

This confused me for a minute as I mistakenly read this as Assert((*found)); can you comment this to say something along the lines of the fact that we should have returned already if we found a match.

+ appendPQExpBuffer(&buf, "(dependencies)");

I think it's better practice to use appendPQExpBufferStr() when there's no formatting. It'll perform marginally better, which might not be important here, but it sets a better example for people to follow when performance is more critical.

+ List *keys; /* String nodes naming referenced column(s) */

column(s) should read columns. 's' is not optional.

+ bool rd_statvalid; /* state of rd_statlist: true/false */

so bool can only be true or false. Good to know ;-) the comment is probably useless, can you improve?

+ change the definition of a extended statistics

"a" should be "an", Also is statistics plural here. It's commonly mixed up in the patch. I think it needs standardised. I personally think if you're speaking of a single pg_statatic_ext row, then it should be singular. Yet, I'm aware you're using plural for the CREATE STATISTICS command, to me that feels a bit like: CREATE TABLES mytable (); am I somehow thinking wrongly somehow here?

+ The name (optionally schema-qualified) of a statistics to be altered.

"a" should be "the"

+ If a schema name is given (for example, <literal>CREATE STATISTICS

+ myschema.mystat ...</>) then the statistics is created in the specified

+ schema. Otherwise it is created in the current schema. The name of

What's created in the current schema? I thought this was just for naming?

+ <para>

+ To be able to create a table, you must have <literal>USAGE</literal>

+ privilege on all column types or the type in the <literal>OF</literal>

+ clause, respectively.

+ </para>

"create a table" ? create an extended statistic ?

+ <title>Examples</title>

+ <para>

+ ...

+ </para>

Why are the examples missing? I've not looked beyond patch 0002 yet, but I'd have assumed 0002 should be commitable without requiring later patches to make it correct.

+ * statscmds.c

+ * Commands for creating and altering extended statistics

+ *

2017.

+ * statistics might work with equality only.

extra space

+ /* costruction of array of enabled statistic */

construction?

+ atttuple = SearchSysCacheAttName(relid, attname);

+ if (!HeapTupleIsValid(atttuple))

+ ereport(ERROR,

+ (errcode(ERRCODE_UNDEFINED_COLUMN),

+ errmsg("column \"%s\" referenced in statistics does not exist",

+ attname)));

+ /* more than STATS_MAX_DIMENSIONS columns not allowed */

+ if (numcols >= STATS_MAX_DIMENSIONS)

+ ereport(ERROR,

+ (errcode(ERRCODE_TOO_MANY_COLUMNS),

+ errmsg("cannot have more than %d keys in statistics",

+ STATS_MAX_DIMENSIONS)));

+ attnums[numcols] = ((Form_pg_attribute) GETSTRUCT(atttuple))->attnum;

+ ReleaseSysCache(atttuple);

Looks like a syscache leak. No?

+ /*

+ * Delete the pg_proc tuple.

+ */

+ relation = heap_open(StatisticExtRelationId, RowExclusiveLock);

pg_proc?

+ * pg_statistic_ext.h

+ * definition of the system "extended statistic" relation (pg_statistic_ext)

+ * along with the relation's initial contents.

+ *

2017

+ * stats.h

+ * Multivariate statistics and selectivity estimation functions.

+ *

2017

"Multivariate" should be "Extended". My justification here is that stats_are_built() is contained within, which is used in get_relation_statistics() which is not specific to MV stats.

0003:

No more time today. Will try and get to those soon.

Setting to waiting on author in the meantime.

David Rowley http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] multivariate statistics (v25)

From

David Rowley

Date:

14 March 2017, 13:59:57

On 13 March 2017 at 23:00, David Rowley <david.rowley@2ndquadrant.com> wrote:

0003:

No more time today. Will try and get to those soon.

0003:

I've now read this patch. My main aim here was to learn what it does and how it works. I need to spend much longer understanding how your calculating the functional dependencies.

In the meantime I've pasted the notes I took while reading over the patch.

+ default:

+ elog(ERROR, "unexcpected statistics type requested: %d", type);

"unexpected", but we generally use "unknown".

@@ -1293,7 +1294,8 @@ get_relation_statistics(RelOptInfo *rel, Relation relation)

info->rel = rel;

/* built/available statistics */

- info->ndist_built = true;

+ info->ndist_built = stats_are_built(htup, STATS_EXT_NDISTINCT);

+ info->deps_built = stats_are_built(htup, STATS_EXT_DEPENDENCIES);

I don't really like how this function is shaping up. You're calling stats_are_built() potentially twice for each stats type. There must be a nicer way to do this. Are non-built stats common enough to optimize building a StatisticExtInfo regardless and throwing it away if it happens to be useless?

Can you also rename mvoid to become something more esoid or similar. I seem to always read it as m-void instead of mv-oid and naturally I expect a void pointer rather than an Oid.

+dependencies, and for each one count the number of rows rows consistent it.

duplicate word "rows"

+Apllying the functional dependencies is fairly simple - given a list of

Applying

+In this case the default estimation based on AVIA principle happens to work

hmm, maybe I should know what AVIA principles are, but I don't. Is there something I should be reading? I searched a bit around the internet for a few minutes it didn't seem have a great idea either.

2017

+ Assert(tmp <= ((char *) output + len));

Shouldn't you just Assert(tmp == ((char *) output + len)); at the end of the loop?

+ if (dependencies->magic != STATS_DEPS_MAGIC)

+ elog(ERROR, "invalid dependency magic %d (expected %dd)",

+ dependencies->magic, STATS_DEPS_MAGIC);

+ if (dependencies->type != STATS_DEPS_TYPE_BASIC)

+ elog(ERROR, "invalid dependency type %d (expected %dd)",

+ dependencies->type, STATS_DEPS_TYPE_BASIC);

%dd ?

+ Assert(dependencies->ndeps > 0);

Why Assert() and not elog() ? Wouldn't think mean that a corrupt dependency could fail an Assert

+ dependencies = (MVDependencies) palloc0(sizeof(MVDependenciesData));

Why palloc0() and not palloc()?

Can you not just read it into a variable on the stack, then check the exact size using tempdeps.ndeps * sizeof(MVDependency), then memcpy() it over? That'll save you the realloc()

+ /* what minimum bytea size do we expect for those parameters */

+ expected_size = offsetof(MVDependenciesData, deps) +

+ dependencies->ndeps * (offsetof(MVDependencyData, attributes) +

+ sizeof(AttrNumber) * 2);

Can't quite make sense of this yet. Why * 2?

+ /* is the number of attributes valid? */

+ Assert((k >= 2) && (k <= STATS_MAX_DIMENSIONS));

Seems like a bad idea to Assert() this. Wouldn't some bad data being deserialized cause an Assert failure?

+ d = (MVDependency) palloc0(offsetof(MVDependencyData, attributes) +

+ (k * sizeof(AttrNumber)));

Why palloc0(), you seem to write out all the fields right away. Seems like a waste to zero the memory.

+ /* still within the bytea */

+ Assert(tmp <= ((char *) data + VARSIZE_ANY(data)));

Any point? You're already Asserting that you've consumed the entire array at the end anyway.

+ appendStringInfoString(&str, "[");

appendStringInfoChar(&str. '['); would be better.

+ ret = pstrdup(str.data);

ret = pnstrdup(str.data, str.len);

+CREATE STATISTICS s1 WITH (dependencies) ON (a,a) FROM functional_dependencies;

+ERROR: duplicate column name in statistics definition

Is it worth mentioning which column here?

I'll try to spend more time understanding 0003 soon.

David Rowley http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] multivariate statistics (v25)

From

Alvaro Herrera

Date:

15 March 2017, 01:08:32

I tried patch 0002 today and again there are conflicts, so I rebased and
fixed the merge problems. I also changed a number of minor things, all
AFAICS cosmetic in nature:

* moved src/backend/statistics/common.h to src/include/statistics/common.h, as previously commented. I also took out
postgres.hand most of the includes; instead, put all these into each .c source file. That aligns with our established
practice.I also removed two prototypes that should actually be in stats.h. I think statistics/common.h should be
furtherrenamed to statistics/stats_ext_internal.h, and statistics/stats.h to something different though I don't know
whatATM.

* Moved src/include/utils/stats.h to src/include/statistics, clean it up a bit.

* Moved some structs from analyze.c into statistics/common.h, removing some duplication; have analyze.c include that
file.

* renamed src/test/regress/sql/mv_ndistinct.sql to stats_ext.sql, to collect all ext.stats. related tests in a single
file,instead of having a large number of them. I also added one test that drops a column, per David Rowley's reported
failure,but I didn't actually fix the problem nor add it to the expected file. (I'll follow up with that tomorrow, if
Tomasdoesn't beat me to it). Also, put the test in an earlier parallel test group, 'cause I see no reason to put it
last.

* A bunch of stylistic changes.

The added tests pass (or they passed before I added the drop column
tests; not a surprise really that they pass, since I didn't touch
anything functionally), but they aren't terribly exhaustive at the stage
of the first patch in the series.

I didn't get around to addressing all of David Rowley's input. Also I
didn't try to rebase the remaining patches in the series on top of this
one.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] multivariate statistics (v25)

From

Alvaro Herrera

Date:

15 March 2017, 01:10:49

Alvaro Herrera wrote:
> I tried patch 0002 today and again there are conflicts, so I rebased and
> fixed the merge problems.

... and attached the patch.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

stats-ext-shared-infra.patch

Re: [HACKERS] multivariate statistics (v25)

From

David Fetter

Date:

15 March 2017, 02:18:04

On Tue, Mar 14, 2017 at 07:10:49PM -0300, Alvaro Herrera wrote:
> Alvaro Herrera wrote:
> > I tried patch 0002 today and again there are conflicts, so I rebased and
> > fixed the merge problems.
> 
> ... and attached the patch.

Is the plan to convert completely from "multivariate" to "extended?"
I ask because I found a "multivariate" in there.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: [HACKERS] multivariate statistics (v25)

From

David Rowley

Date:

15 March 2017, 05:05:38

On 15 March 2017 at 12:18, David Fetter <david@fetter.org> wrote:

Is the plan to convert completely from "multivariate" to "extended?"
I ask because I found a "multivariate" in there.

I get the idea that Tomas would like to keep the multivariate when it's actually referencing multivariate stats. The idea of the rename was to allow future expansion of the code to perhaps allow creation of stats on expressions, which is not multivariate. If you've found multivariate reference in an area that should be generic to extended statistics then that's a bug and should be fixed.

I found a few of these and listed them during my review.

David Rowley http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] multivariate statistics (v25)

From

Alvaro Herrera

Date:

15 March 2017, 23:45:18

Here's another version of 0002 after cleaning up almost everything from
David's review.  I also added tests for ALTER STATISTICS in
sql/alter_generic.sql which made me realize there were three crasher bug
in here; fixed all those.  It also made me realize that psql's \d was a
little bit too generous with dropped columns in a stats object.  That
should all behave better now.

One thing I didn't do was change StatisticExtInfo to use a bitmapset
instead of int2vector.  I think it's a good idea to do so.

I'll go rebase the followup patches now.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

stats-ext-shared-infra-27.patch

Re: [HACKERS] multivariate statistics (v25)

From

Alvaro Herrera

Date:

16 March 2017, 07:36:51

David Rowley wrote:

> + k = -1;
> + while ((k = bms_next_member(attnums, k)) >= 0)
> + {
> + bool attr_found = false;
> + for (i = 0; i < info->stakeys->dim1; i++)
> + {
> + if (info->stakeys->values[i] == k)
> + {
> + attr_found = true;
> + break;
> + }
> + }
> +
> + /* found attribute not covered by this ndistinct stats, skip */
> + if (!attr_found)
> + {
> + matches = false;
> + break;
> + }
> + }
> 
> Would it be better just to stuff info->stakeys->values into a bitmapset and
> check its a subset of attnums? It would mean allocating memory in the loop,
> so maybe you think otherwise, but in that case maybe StatisticExtInfo
> should store the bitmapset?

Yeah, I think StatisticExtInfo should have a bitmapset, not an
int2vector.

> + appendPQExpBuffer(&buf, "(dependencies)");
> 
> I think it's better practice to use appendPQExpBufferStr() when there's no
> formatting. It'll perform marginally better, which might not be important
> here, but it sets a better example for people to follow when performance is
> more critical.

FWIW this should have said "(ndistinct)" anyway :-)

> +   change the definition of a extended statistics
> 
> "a" should be "an", Also is statistics plural here. It's commonly mixed up
> in the patch. I think it needs standardised. I personally think if you're
> speaking of a single pg_statatic_ext row, then it should be singular. Yet,
> I'm aware you're using plural for the CREATE STATISTICS command, to me that
> feels a bit like: CREATE TABLES mytable ();  am I somehow thinking wrongly
> somehow here?

This was discussed upthread as I recall.  This is what Merriam-Webster says on
the topic:

statistic
1   :  a single term or datum in a collection of statistics
2 a :  a quantity (as the mean of a sample) that is computed from a sample;      specifically :  estimate 3b b :  a
randomvariable that takes on the possible values of a statistic
 

statistics
1   :  a branch of mathematics dealing with the collection, analysis,      interpretation, and presentation of masses
ofnumerical data
 
2   :  a collection of quantitative data

Now, I think there's room to say that a single object created by the new CREATE
STATISTICS is really the latter, not the former.  I find it very weird
that a single of these objects is named in the plural form, though, and
it looks odd all over the place.  I would rather use the term
"statistics object", and then we can continue using the singular.

> +   If a schema name is given (for example, <literal>CREATE STATISTICS
> +   myschema.mystat ...</>) then the statistics is created in the specified
> +   schema.  Otherwise it is created in the current schema.  The name of
> 
> What's created in the current schema? I thought this was just for naming?

Well, "created in a schema" means that the object is named after that
schema.  So both are the same thing.  Is this unclear in some way?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] multivariate statistics (v25)

From

David Rowley

Date:

16 March 2017, 15:51:43

On 16 March 2017 at 09:45, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Here's another version of 0002 after cleaning up almost everything from
David's review. I also added tests for ALTER STATISTICS in
sql/alter_generic.sql which made me realize there were three crasher bug
in here; fixed all those. It also made me realize that psql's \d was a
little bit too generous with dropped columns in a stats object. That
should all behave better now.

Thanks for fixing.

As you mentioned to me off-list about missing pg_dump support, I've gone and implemented that in the attached patch.

I followed how pg_dump works for indexes, and created pg_get_statisticsextdef() in ruleutils.c. I was unsure if I should be naming this pg_get_statisticsdef() instead.

I also noticed there's no COMMENT ON support either, so I added that too.

David Rowley http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services

Attachment

extstats_pg_dump_and_comment_support.patch

Re: [HACKERS] multivariate statistics (v25)

From

Alvaro Herrera

Date:

17 March 2017, 01:20:33

Here's a rebased series on top of today's a3eac988c267.  I call this
v28.

I put David's pg_dump and COMMENT patches as second in line, just after
the initial infrastructure patch.  I suppose those three have to be
committed together, while the others (which add support for additional
statistic types) can rightly remain as separate commits.

(I think I lost some regression test files.  I couldn't make up my mind
about putting each statistic type's tests in a separate file, or all
together in stats_ext.sql.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On 25 March 2017 at 07:35, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

As I said in another thread, I pushed parts 0002,0003,0004. Tomas said
he would try to rebase patches 0001,0005,0006 on top of what was
committed. My intention is to give that one a look as soon as it is
available. So we will have n-distinct and functional dependencies in
PG10. It sounds unlikely that we will get MCVs and histograms in, since
they're each a lot of code.

I've been working on the MV functional dependencies part of the patch to polish it up a bit. Tomas has been busy with a few other duties.

I've made some changes around how clauselist_selectivity() determines if it should try to apply any extended stats. The solution I came up with was to add two parameters to this function, one for the RelOptInfo in question, and one a bool to control if we should try to apply any extended stats. For clauselist_selectivity() usage involving join rels we just pass the rel as NULL, that way we can skip all the extended stats stuff with very low overhead. When we actually have a base relation to pass along we can do so, along with a true tryextstats value to have the function attempt to use any extended stats to assist with the selectivity estimation.

When adding these two parameters I had 2nd thoughts that the "tryextstats" was required at all. We could just have this controlled by if the rel is a base rel of kind RTE_RELATION. I ended up having to pass these parameters further, down to clauselist_selectivity's singleton couterpart, clause_selectivity(). This was due to clause_selectivity() calling clauselist_selectivity() for some clause types. I'm not entirely sure if this is actually required, but I can't see any reason for it to cause problems.

I've also attempted to simplify some of the logic within clauselist_selectivity and some other parts of clausesel.c to remove some unneeded code and make it a bit more efficient. For example, we no longer count the attributes in the clause list before calling a similar function to retrieve the actual attnums. This is now done as a single step.

I've not yet quite gotten as far as I'd like with this. I'd quite like to see clauselist_ext_split() gone, and instead we could build up a bitmapset of clause list indexes to ignore when applying the selectivity of clauses that couldn't use any extended stats. I'm planning on having a bit more of a look at this tomorrow.

The attached patch should apply to master as of f90d23d0c51895e0d7db7910538e85d3d38691f0.

David Rowley http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services

Attachment

mv_functional-deps_2017-03-31.patch

Re: multivariate statistics (v25)

From

Kyotaro HORIGUCHI

Date:

31 March 2017, 11:18:21

Hello,

At Fri, 31 Mar 2017 03:03:06 +1300, David Rowley <david.rowley@2ndquadrant.com> wrote in
<CAKJS1f-fqo97jasVF57yfVyG+=T5JLce5ynCi1vvezXxX=wgoA@mail.gmail.com>
> On 25 March 2017 at 07:35, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> 
> > As I said in another thread, I pushed parts 0002,0003,0004.  Tomas said
> > he would try to rebase patches 0001,0005,0006 on top of what was
> > committed.  My intention is to give that one a look as soon as it is
> > available.  So we will have n-distinct and functional dependencies in
> > PG10.  It sounds unlikely that we will get MCVs and histograms in, since
> > they're each a lot of code.
> >
> 
> I've been working on the MV functional dependencies part of the patch to
> polish it up a bit. Tomas has been busy with a few other duties.
> 
> I've made some changes around how clauselist_selectivity() determines if it
> should try to apply any extended stats. The solution I came up with was to
> add two parameters to this function, one for the RelOptInfo in question,
> and one a bool to control if we should try to apply any extended stats.
> For clauselist_selectivity() usage involving join rels we just pass the rel
> as NULL, that way we can skip all the extended stats stuff with very low
> overhead. When we actually have a base relation to pass along we can do so,
> along with a true tryextstats value to have the function attempt to use any
> extended stats to assist with the selectivity estimation.
> 
> When adding these two parameters I had 2nd thoughts that the "tryextstats"
> was required at all. We could just have this controlled by if the rel is a
> base rel of kind RTE_RELATION. I ended up having to pass these parameters
> further, down to clauselist_selectivity's singleton couterpart,
> clause_selectivity(). This was due to clause_selectivity() calling
> clauselist_selectivity() for some clause types. I'm not entirely sure if
> this is actually required, but I can't see any reason for it to cause
> problems.

I understand that the reason for tryextstats is that the two are
perfectly correlating but caluse_selectivity requires the
RelOptInfo anyway. Some comment about that may be reuiqred in the
function comment.

> I've also attempted to simplify some of the logic within
> clauselist_selectivity and some other parts of clausesel.c to remove some
> unneeded code and make it a bit more efficient. For example, we no longer
> count the attributes in the clause list before calling a similar function
> to retrieve the actual attnums. This is now done as a single step.
> 
> I've not yet quite gotten as far as I'd like with this. I'd quite like to
> see clauselist_ext_split() gone, and instead we could build up a bitmapset
> of clause list indexes to ignore when applying the selectivity of clauses
> that couldn't use any extended stats. I'm planning on having a bit more of
> a look at this tomorrow.
> 
> The attached patch should apply to master as
> of f90d23d0c51895e0d7db7910538e85d3d38691f0.

FWIW, I tries this. This cleanly applied on it but make ends with
the following error.

$ make -s
Writing postgres.bki
Writing schemapg.h
Writing postgres.description
Writing postgres.shdescription
Writing fmgroids.h
Writing fmgrprotos.h
Writing fmgrtab.c
make[3]: *** No rule to make target `dependencies.o', needed by `objfiles.txt'.  Stop.
make[2]: *** [statistics-recursive] Error 2
make[1]: *** [all-backend-recurse] Error 2
make: *** [all-src-recurse] Error 2


Some random comments by just looking on the patch:

======
The name of the function "collect_ext_attnums", and
"clause_is_ext_compatible" seems odd since "ext" doesn't seem to
be a part of "extended statistics". Some other names looks the
same, too.

Something like "collect_e(xt)stat_compatible_attnums" and
"clause_is_e(xt)stat_compatible" seem better to me.

======
The following comment seems something wrong.

+ * When applying functional dependencies, we start with the strongest ones
+ * strongest dependencies. That is, we select the dependency that:

======
dependency_is_fully_matched() is not found. Maybe some other
patches are assumed?

======
+        /* see if it actually has the right */
+        ok = (NumRelids((Node *) expr) == 1) &&
+            (is_pseudo_constant_clause(lsecond(expr->args)) ||
+             (varonleft = false,
+              is_pseudo_constant_clause(linitial(expr->args))));
+
+        /* unsupported structure (two variables or so) */
+        if (!ok)
+            return true;

Ok is used only here. I don't think seeming-expressions with side
effect is not good idea here.

======
+        switch (get_oprrest(expr->opno))
+        {
+            case F_EQSEL:
+
+                /* equality conditions are compatible with all statistics */
+                break;
+
+            default:
+
+                /* unknown estimator */
+                return true;
+        }

This seems somewhat stupid..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: multivariate statistics (v25)

From

David Rowley

Date:

31 March 2017, 12:05:46

On 31 March 2017 at 21:18, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello,

At Fri, 31 Mar 2017 03:03:06 +1300, David Rowley <david.rowley@2ndquadrant.com> wrote in <CAKJS1f-fqo97jasVF57yfVyG+=T5JLce5ynCi1vvezXxX=wgoA@mail.gmail.com>

FWIW, I tries this. This cleanly applied on it but make ends with
the following error.

$ make -s
Writing postgres.bki
Writing schemapg.h
Writing postgres.description
Writing postgres.shdescription
Writing fmgroids.h
Writing fmgrprotos.h
Writing fmgrtab.c
make[3]: *** No rule to make target `dependencies.o', needed by `objfiles.txt'. Stop.
make[2]: *** [statistics-recursive] Error 2
make[1]: *** [all-backend-recurse] Error 2
make: *** [all-src-recurse] Error 2

Apologies. I was caught out by patching back on to master, then committing, and git diff'ing the last commit, where i'd of course forgotten to get add those files.

I'm just in the middle of fixing up some other stuff. Hopefully I'll post a working patch soon.

David Rowley http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services

Re: multivariate statistics (v25)

From

David Rowley

Date:

31 March 2017, 18:25:12

On 31 March 2017 at 21:18, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

> When adding these two parameters I had 2nd thoughts that the "tryextstats"
> was required at all. We could just have this controlled by if the rel is a
> base rel of kind RTE_RELATION. I ended up having to pass these parameters
> further, down to clauselist_selectivity's singleton couterpart,
> clause_selectivity(). This was due to clause_selectivity() calling
> clauselist_selectivity() for some clause types. I'm not entirely sure if
> this is actually required, but I can't see any reason for it to cause
> problems.

I understand that the reason for tryextstats is that the two are
perfectly correlating but caluse_selectivity requires the
RelOptInfo anyway. Some comment about that may be reuiqred in the
function comment.

hmm, you could say one is functionally dependant on the other. I did consider removing it, but it seemed weird to pass a NULL relation when we dont want to attempt to use extended stats.

Some random comments by just looking on the patch:

======
The name of the function "collect_ext_attnums", and
"clause_is_ext_compatible" seems odd since "ext" doesn't seem to
be a part of "extended statistics". Some other names looks the
same, too.

I agree. I've made some changes to the patch to change how the functional dependency estimations are applied. I've removed most of the code from clausesel.c and put it into dependencies.c. In doing so I've removed some of the inefficiencies that were in the patch. For example clause_is_ext_compatible() was being called many times on the same clause at different times. I've now nailed that down to just once per clause.

Something like "collect_e(xt)stat_compatible_attnums" and
"clause_is_e(xt)stat_compatible" seem better to me.

Changed to dependency_compatible_clause(), since this was searching for equality clauses in the form Var = Const, or Const = Var. This seems specific to the functional depdencies checking. A multivariate histogram won't want the same.

======
The following comment seems something wrong.

+ * When applying functional dependencies, we start with the strongest ones
+ * strongest dependencies. That is, we select the dependency that:

======
dependency_is_fully_matched() is not found. Maybe some other
patches are assumed?

======
+ /* see if it actually has the right */
+ ok = (NumRelids((Node *) expr) == 1) &&
+ (is_pseudo_constant_clause(lsecond(expr->args)) ||
+ (varonleft = false,
+ is_pseudo_constant_clause(linitial(expr->args))));
+
+ /* unsupported structure (two variables or so) */
+ if (!ok)
+ return true;

Ok is used only here. I don't think seeming-expressions with side
effect is not good idea here.

I thought the same, but I happened to notice that Tomas must have taken it from clauselist_selectivity().

======
+ switch (get_oprrest(expr->opno))
+ {
+ case F_EQSEL:
+
+ /* equality conditions are compatible with all statistics */
+ break;
+
+ default:
+
+ /* unknown estimator */
+ return true;
+ }

This seems somewhat stupid..

I agree. Changed.

I've attached an updated patch.

David Rowley http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services

Attachment

mv_functional-deps_2017-04-01.patch

Re: multivariate statistics (v25)

From

David Rowley

Date:

04 April 2017, 10:55:34

On 1 April 2017 at 04:25, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I've attached an updated patch.

I've made another pass at this and ended up removing the tryextstats
variable. We now only try to use extended statistics when
clauselist_selectivity() is given a valid RelOptInfo with rtekind ==
RTE_RELATION, and of course, it must also have some extended stats
defined too.

I've also cleaned up a few more comments, many of which I managed to
omit updating when I refactored how the selectivity estimates ties
into clauselist_selectivity()

I'm quite happy with all of this now, and would also be happy for
other people to take a look and comment.

As a reviewer, I'd be marking this ready for committer, but I've moved
a little way from just reviewing this now, having spent two weeks
hacking at it.

The latest patch is attached.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

mv_functional-deps_2017-04-04.patch

Re: multivariate statistics (v25)

From

Tomas Vondra

Date:

04 April 2017, 21:19:39

On 04/04/2017 09:55 AM, David Rowley wrote:
> On 1 April 2017 at 04:25, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> I've attached an updated patch.
>
> I've made another pass at this and ended up removing the tryextstats
> variable. We now only try to use extended statistics when
> clauselist_selectivity() is given a valid RelOptInfo with rtekind ==
> RTE_RELATION, and of course, it must also have some extended stats
> defined too.
>
> I've also cleaned up a few more comments, many of which I managed to
> omit updating when I refactored how the selectivity estimates ties
> into clauselist_selectivity()
>
> I'm quite happy with all of this now, and would also be happy for
> other people to take a look and comment.
>
> As a reviewer, I'd be marking this ready for committer, but I've moved
> a little way from just reviewing this now, having spent two weeks
> hacking at it.
>
> The latest patch is attached.
>

Thanks David, I agree the reworked patch is much cleaner that the last 
version I posted. Thanks for spending your time on it.

Two minor comments:

1) DEPENDENCY_MIN_GROUP_SIZE

I'm not sure we still need the min_group_size, when evaluating 
dependencies. It was meant to deal with 'noisy' data, but I think it 
after switching to the 'degree' it might actually be a bad idea.

Consider this:
    create table t (a int, b int);    insert into t select 1, 1 from generate_series(1, 10000) s(i);    insert into t
selecti, i from generate_series(2, 20000) s(i);    create statistics s with (dependencies) on (a,b) from t;    analyze
t;
    select stadependencies from pg_statistic_ext ;                  stadependencies
--------------------------------------------    [{1 => 2 : 0.333344}, {2 => 1 : 0.333344}]    (1 row)
 

So the degree of the dependency is just ~0.333 although it's obviously a 
perfect dependency, i.e. a knowledge of 'a' determines 'b'. The reason 
is that we discard 2/3 of rows, because those groups are only a single 
row each, except for the one large group (1/3 of rows).

Without the mininum group size limitation, the dependencies are:
    test=# select stadependencies from pg_statistic_ext ;                  stadependencies
--------------------------------------------    [{1 => 2 : 1.000000}, {2 => 1 : 1.000000}]    (1 row)
 

which seems way more reasonable, I think.


2) A minor detail is that instead of this
    if (estimatedclauses != NULL &&        bms_is_member(listidx, estimatedclauses))        continue;

perhaps we should do just this:
    if (bms_is_member(listidx, estimatedclauses))        continue;

bms_is_member does the same NULL check right at the beginning, so I 
don't think this might make a measurable difference.


kind regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics (v25)

From

Kyotaro HORIGUCHI

Date:

05 April 2017, 05:53:51

At Tue, 4 Apr 2017 20:19:39 +0200, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<56f40b20-c464-fad2-ff39-06b668fac47c@2ndquadrant.com>
> On 04/04/2017 09:55 AM, David Rowley wrote:
> > On 1 April 2017 at 04:25, David Rowley <david.rowley@2ndquadrant.com>
> > wrote:
> >> I've attached an updated patch.
> >
> > I've made another pass at this and ended up removing the tryextstats
> > variable. We now only try to use extended statistics when
> > clauselist_selectivity() is given a valid RelOptInfo with rtekind ==
> > RTE_RELATION, and of course, it must also have some extended stats
> > defined too.
> >
> > I've also cleaned up a few more comments, many of which I managed to
> > omit updating when I refactored how the selectivity estimates ties
> > into clauselist_selectivity()
> >
> > I'm quite happy with all of this now, and would also be happy for
> > other people to take a look and comment.
> >
> > As a reviewer, I'd be marking this ready for committer, but I've moved
> > a little way from just reviewing this now, having spent two weeks
> > hacking at it.
> >
> > The latest patch is attached.
> >
> 
> Thanks David, I agree the reworked patch is much cleaner that the last
> version I posted. Thanks for spending your time on it.
> 
> Two minor comments:
> 
> 1) DEPENDENCY_MIN_GROUP_SIZE
> 
> I'm not sure we still need the min_group_size, when evaluating
> dependencies. It was meant to deal with 'noisy' data, but I think it
> after switching to the 'degree' it might actually be a bad idea.
> 
> Consider this:
> 
>     create table t (a int, b int);
>     insert into t select 1, 1 from generate_series(1, 10000) s(i);
>     insert into t select i, i from generate_series(2, 20000) s(i);
>     create statistics s with (dependencies) on (a,b) from t;
>     analyze t;
> 
>     select stadependencies from pg_statistic_ext ;
>                   stadependencies
>     --------------------------------------------
>      [{1 => 2 : 0.333344}, {2 => 1 : 0.333344}]
>     (1 row)
> 
> So the degree of the dependency is just ~0.333 although it's obviously
> a perfect dependency, i.e. a knowledge of 'a' determines 'b'. The
> reason is that we discard 2/3 of rows, because those groups are only a
> single row each, except for the one large group (1/3 of rows).
> 
> Without the mininum group size limitation, the dependencies are:
> 
>     test=# select stadependencies from pg_statistic_ext ;
>                   stadependencies
>     --------------------------------------------
>      [{1 => 2 : 1.000000}, {2 => 1 : 1.000000}]
>     (1 row)
> 
> which seems way more reasonable, I think.

I think the same. Quite large part of functional dependency in
reality is in this kind.

> 2) A minor detail is that instead of this
> 
>     if (estimatedclauses != NULL &&
>         bms_is_member(listidx, estimatedclauses))
>         continue;
> 
> perhaps we should do just this:
> 
>     if (bms_is_member(listidx, estimatedclauses))
>         continue;
> 
> bms_is_member does the same NULL check right at the beginning, so I
> don't think this might make a measurable difference.

I have some other comments.

======
- The comment for clauselist_selectivity,
| + * When 'rel' is not null and rtekind = RTE_RELATION, we'll try to apply
| + * selectivity estimates using any extended statistcs on 'rel'.

The 'rel' is actually a parameter but rtekind means rel->rtekind
so this might be better be such like the following.

| When a relation of RTE_RELATION is given as 'rel', we try
| extended statistcs on the relation.

Then the following line doesn't seem to be required.

| + * If we identify such extended statistics exist, we try to apply them.

=====
The following comment in the same function,

| +    if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
| +    {
| +        /*
| +         * Try to estimate with multivariate functional dependency statistics.
| +         *
| +         * The function will supply an estimate for the clauses which it
| +         * estimated for. Any clauses which were unsuitible were ignored.
| +         * Clauses which were estimated will have their 0-based list index set
| +         * in estimatedclauses.  We must ignore these clauses when processing
| +         * the remaining clauses later.
| +         */

(Notice that I'm not a good writer) This might better be the
following.

|  dependencies_clauselist_selectivity gives selectivity over
|  caluses that functional dependencies on the given relation is
|  applicable. 0-based index numbers of consumed clauses are
|  returned in the bitmap set estimatedclauses so that the
|  estimation here after can ignore them.

=====
| +        s1 *= dependencies_clauselist_selectivity(root, clauses, varRelid,
| +                                   jointype, sjinfo, rel, &estimatedclauses);

The name prefix "dependency_" means "functional_dependency" here
and omitting "functional" is confusing to me. On the other hand
"functional_dependency" is quite long as prefix. Could we use
"func_dependency" or something that is shorter but meaningful?
(But this change causes renaming of many other sutff..)

=====
The name "dependency_compatible_clause" might be meaningful if it
were "clause_is_compatible_with_(functional_)dependency" or such.

=====
dependency_compatible_walker() returns true if given node is
*not* compatible. Isn't it confusing?

=====
dependency_compatible_walker() seems implicitly expecting that
RestrictInfo will be given at the first. RestrictInfo might
should be processed outside this function in _compatible_clause().

=====
dependency_compatible_walker() can return two or more attriburtes
but dependency_compatible_clause() errors out in the case. Since
_walker is called only from the _clause, _walker can return
earlier with "incompatible" in such a case.

=====
In the comment in dependencies_clauselist_selectivity(), 

|  /*
|   * Technically we could find more than one clause for a given
|   * attnum. Since these clauses must be equality clauses, we choose
|   * to only take the selectivity estimate from the final clause in
|   * the list for this attnum. If the attnum happens to be compared
|   * to a different Const in another clause then no rows will match
|   * anyway. If it happens to be compared to the same Const, then
|   * ignoring the additional clause is just the thing to do.
|   */
|  if (dependency_implies_attribute(dependency,
|                                   list_attnums[listidx]))

If multiple clauses include the attribute, selectivity estimates
for clauses other than the last one are waste of time. Why not the
first one but the last one?

Even if all clauses should be added into estimatedclauses,
calling clause_selectivity once is enough. Since
clause_selectivity may return 1.0 for some clauses, using s2 for
the decision seems reasonable.

|  if (dependency_implies_attribute(dependency,
|                                   list_attnums[listidx]))
|  {
|      clause = (Node *) lfirst(l);
+      if (s2 == 1.0)
|        s2 = clause_selectivity(root, clause, varRelid, jointype, sjinfo,

# This '==' works since it is not a result of a calculation.

=====
Still in dependencies_clauselist_selectivity,
dependency_implies_attributes seems designed to return true for
at least one clause in the clauses but any failure leands to
infinite loop. I think any measure against the case is required.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: multivariate statistics (v25)

From

"Sven R. Kunze"

Date:

05 April 2017, 09:41:31

Thanks Tomas and David for hacking on this patch.

On 04.04.2017 20:19, Tomas Vondra wrote:
> I'm not sure we still need the min_group_size, when evaluating 
> dependencies. It was meant to deal with 'noisy' data, but I think it 
> after switching to the 'degree' it might actually be a bad idea.
>
> Consider this:
>
>     create table t (a int, b int);
>     insert into t select 1, 1 from generate_series(1, 10000) s(i);
>     insert into t select i, i from generate_series(2, 20000) s(i);
>     create statistics s with (dependencies) on (a,b) from t;
>     analyze t;
>
>     select stadependencies from pg_statistic_ext ;
>                   stadependencies
>     --------------------------------------------
>      [{1 => 2 : 0.333344}, {2 => 1 : 0.333344}]
>     (1 row)
>
> So the degree of the dependency is just ~0.333 although it's obviously 
> a perfect dependency, i.e. a knowledge of 'a' determines 'b'. The 
> reason is that we discard 2/3 of rows, because those groups are only a 
> single row each, except for the one large group (1/3 of rows).

Just for me to follow the comments better. Is "dependency" roughly the 
same as when statisticians speak about " conditional probability"?

Sven

Re: multivariate statistics (v25)

From

Tomas Vondra

Date:

05 April 2017, 12:41:29


On 04/05/2017 08:41 AM, Sven R. Kunze wrote:
> Thanks Tomas and David for hacking on this patch.
> 
> On 04.04.2017 20:19, Tomas Vondra wrote:
>> I'm not sure we still need the min_group_size, when evaluating 
>> dependencies. It was meant to deal with 'noisy' data, but I think it 
>> after switching to the 'degree' it might actually be a bad idea.
>>
>> Consider this:
>>
>>     create table t (a int, b int);
>>     insert into t select 1, 1 from generate_series(1, 10000) s(i);
>>     insert into t select i, i from generate_series(2, 20000) s(i);
>>     create statistics s with (dependencies) on (a,b) from t;
>>     analyze t;
>>
>>     select stadependencies from pg_statistic_ext ;
>>                   stadependencies
>>     --------------------------------------------
>>      [{1 => 2 : 0.333344}, {2 => 1 : 0.333344}]
>>     (1 row)
>>
>> So the degree of the dependency is just ~0.333 although it's obviously 
>> a perfect dependency, i.e. a knowledge of 'a' determines 'b'. The 
>> reason is that we discard 2/3 of rows, because those groups are only a 
>> single row each, except for the one large group (1/3 of rows).
> 
> Just for me to follow the comments better. Is "dependency" roughly the 
> same as when statisticians speak about " conditional probability"?
> 

No, it's more 'functional dependency' from relational normal forms.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics (v25)

From

David Rowley

Date:

05 April 2017, 17:47:44

On 5 April 2017 at 14:53, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Tue, 4 Apr 2017 20:19:39 +0200, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<56f40b20-c464-fad2-ff39-06b668fac47c@2ndquadrant.com>
>> Two minor comments:
>>
>> 1) DEPENDENCY_MIN_GROUP_SIZE
>>
>> I'm not sure we still need the min_group_size, when evaluating
>> dependencies. It was meant to deal with 'noisy' data, but I think it
>> after switching to the 'degree' it might actually be a bad idea.

Yeah, I'd wondered about this when I first started testing the patch.
I failed to get any functional dependencies because my values were too
unique. Seems I'd gotten a bit used to it, and in the end thought that
if the values are unique enough then they won't suffer as much from
the underestimation problem you're trying to solve here.

I've removed that part of the code now.

> I think the same. Quite large part of functional dependency in
> reality is in this kind.
>
>> 2) A minor detail is that instead of this
>>
>>     if (estimatedclauses != NULL &&
>>         bms_is_member(listidx, estimatedclauses))
>>         continue;
>>
>> perhaps we should do just this:
>>
>>     if (bms_is_member(listidx, estimatedclauses))
>>         continue;
>>
>> bms_is_member does the same NULL check right at the beginning, so I
>> don't think this might make a measurable difference.

hmm yeah, I'd added that because I thought the estimatedclauses would
be NULL in 99.9% of cases and thought that I might be able to shave a
few cycles off. I see that there's an x < 0 test before the NULL test
in the function. Anyway, I'm not going to put up a fight here, so I've
removed it. I didn't ever benchmark anything to see if the extra test
actually helped anyway...

> I have some other comments.
>
> ======
> - The comment for clauselist_selectivity,
> | + * When 'rel' is not null and rtekind = RTE_RELATION, we'll try to apply
> | + * selectivity estimates using any extended statistcs on 'rel'.
>
> The 'rel' is actually a parameter but rtekind means rel->rtekind
> so this might be better be such like the following.
>
> | When a relation of RTE_RELATION is given as 'rel', we try
> | extended statistcs on the relation.
>
> Then the following line doesn't seem to be required.
>
> | + * If we identify such extended statistics exist, we try to apply them.

Yes, good point. I've revised this comment a bit now.

>
> =====
> The following comment in the same function,
>
> | +    if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
> | +    {
> | +        /*
> | +         * Try to estimate with multivariate functional dependency statistics.
> | +         *
> | +         * The function will supply an estimate for the clauses which it
> | +         * estimated for. Any clauses which were unsuitible were ignored.
> | +         * Clauses which were estimated will have their 0-based list index set
> | +         * in estimatedclauses.  We must ignore these clauses when processing
> | +         * the remaining clauses later.
> | +         */
>
> (Notice that I'm not a good writer) This might better be the
> following.
>
> |  dependencies_clauselist_selectivity gives selectivity over
> |  caluses that functional dependencies on the given relation is
> |  applicable. 0-based index numbers of consumed clauses are
> |  returned in the bitmap set estimatedclauses so that the
> |  estimation here after can ignore them.

I've changed this one too now.

> =====
> | +        s1 *= dependencies_clauselist_selectivity(root, clauses, varRelid,
> | +                                   jointype, sjinfo, rel, &estimatedclauses);
>
> The name prefix "dependency_" means "functional_dependency" here
> and omitting "functional" is confusing to me. On the other hand
> "functional_dependency" is quite long as prefix. Could we use
> "func_dependency" or something that is shorter but meaningful?
> (But this change causes renaming of many other sutff..)

oh no! Many functions in dependencies.c start with dependencies_. To
me, it's a bit of an OOP thing, which if we'd been using some other
language would have been dependencies->clauselist_selectivity(). Of
course, not all functions in that file follow that rule, but I don't
feel a pressing need to go make that any worse.  Perhaps the prefix
could be func_dependency, but I really don't feel very excited about
having it that way, and even less so about making the change.

> =====
> The name "dependency_compatible_clause" might be meaningful if it
> were "clause_is_compatible_with_(functional_)dependency" or such.

I could maybe squeeze the word "is" in there.  ... OK done.

> =====
> dependency_compatible_walker() returns true if given node is
> *not* compatible. Isn't it confusing?

Yeah.

>
> =====
> dependency_compatible_walker() seems implicitly expecting that
> RestrictInfo will be given at the first. RestrictInfo might(
> should be processed outside this function in _compatible_clause().

Actually, I don't really see a great need for this to be a recursive
walker type function. So I've just gone and stuck all that logic in
dependency_is_compatible_clause() instead.

> =====
> dependency_compatible_walker() can return two or more attriburtes
> but dependency_compatible_clause() errors out in the case. Since
> _walker is called only from the _clause, _walker can return
> earlier with "incompatible" in such a case.

I don't quite see how it's possible for it to ever have more than 1
attnum in there. We only capture Vars from one side of a binary
OpExpr. If one side of the OpExpr is an Expr, then we'd not capture
anything, and not recurse into the Expr. Anyway, I've pulled that code
out into dependency_is_compatible_clause now.

> =====
> In the comment in dependencies_clauselist_selectivity(),
>
> |  /*
> |   * Technically we could find more than one clause for a given
> |   * attnum. Since these clauses must be equality clauses, we choose
> |   * to only take the selectivity estimate from the final clause in
> |   * the list for this attnum. If the attnum happens to be compared
> |   * to a different Const in another clause then no rows will match
> |   * anyway. If it happens to be compared to the same Const, then
> |   * ignoring the additional clause is just the thing to do.
> |   */
> |  if (dependency_implies_attribute(dependency,
> |                                   list_attnums[listidx]))
>
> If multiple clauses include the attribute, selectivity estimates
> for clauses other than the last one are waste of time. Why not the
> first one but the last one?

Why not the middle one? Really it's not expected to be a common case.
If someone writes: WHERE a = 1 and a = 2; then they'll likely not get
many results back. If the same clause is duplicated then well, it
won't be the only thing that does a little needless extra work. I
don't think optimising for this is worth the trouble.

>
> Even if all clauses should be added into estimatedclauses,
> calling clause_selectivity once is enough. Since
> clause_selectivity may return 1.0 for some clauses, using s2 for
> the decision seems reasonable.
>
> |  if (dependency_implies_attribute(dependency,
> |                                   list_attnums[listidx]))
> |  {
> |      clause = (Node *) lfirst(l);
> +      if (s2 == 1.0)
> |        s2 = clause_selectivity(root, clause, varRelid, jointype, sjinfo,
>
> # This '==' works since it is not a result of a calculation.

I don't think this is an important optimisation. It's a corner case if
more than one match, although not impossible. I vote to leave it as
is, and not optimise the corner case.

> =====
> Still in dependencies_clauselist_selectivity,
> dependency_implies_attributes seems designed to return true for
> at least one clause in the clauses but any failure leands to
> infinite loop. I think any measure against the case is required.

I did consider this, but I really can't see a scenario that this is
possible. find_strongest_dependency() would not have found a
dependency if dependency_implies_attribute() was going to fail, so
we'd have exited the loop already. I think it's safe providing that
'clauses_attnums' is in sync with the clauses that we'll examine in
the loop over the 'clauses' list. Perhaps the while loop should have
some safety valve, but I'm not all that sure what that would be, and
since I can't see how it could become an infinite loop, I've not
bothered to think too hard about what else might be done here.

I've attached an updated patch to address Tomas' concerns and yours too.

Thank you to both for looking at my changes

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

mv_functional-deps_2017-04-06.patch

Re: multivariate statistics (v25)

From

Simon Riggs

Date:

05 April 2017, 21:52:39

On 5 April 2017 at 10:47, David Rowley <david.rowley@2ndquadrant.com> wrote:

>> I have some other comments.

Me too.


CREATE STATISTICS should take ShareUpdateExclusiveLock like ANALYZE.

This change is in line with other changes in this and earlier
releases. Comments and docs included.

Patch ready to be applied directly barring objections.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

create_statistics_lock_reduction.v1.patch

Re: multivariate statistics (v25)

From

"Tels"

Date:

05 April 2017, 22:19:04

Moin,

On Wed, April 5, 2017 2:52 pm, Simon Riggs wrote:
> On 5 April 2017 at 10:47, David Rowley <david.rowley@2ndquadrant.com>
> wrote:
>
>>> I have some other comments.
>
> Me too.
>
>
> CREATE STATISTICS should take ShareUpdateExclusiveLock like ANALYZE.
>
> This change is in line with other changes in this and earlier
> releases. Comments and docs included.
>
> Patch ready to be applied directly barring objections.

I know I'm a bit late, but isn't the syntax backwards?

"CREATE STATISTICS s1 WITH (dependencies) ON (col_a, col_b) FROM table;"

These do it the other way round:

CREATE INDEX idx ON table (col_a);

AND:
  CREATE TABLE t (    id INT  REFERENCES table_2 (col_b);  );

Won't this be confusing and make things hard to remember?

Sorry for not asking earlier, I somehow missed this.

Regard,

Tels

Re: multivariate statistics (v25)

From

David Rowley

Date:

06 April 2017, 01:16:41

On 6 April 2017 at 07:19, Tels <nospam-abuse@bloodgate.com> wrote:
> I know I'm a bit late, but isn't the syntax backwards?
>
> "CREATE STATISTICS s1 WITH (dependencies) ON (col_a, col_b) FROM table;"
>
> These do it the other way round:
>
> CREATE INDEX idx ON table (col_a);
>
> AND:
>
>    CREATE TABLE t (
>      id INT  REFERENCES table_2 (col_b);
>    );
>
> Won't this be confusing and make things hard to remember?
>
> Sorry for not asking earlier, I somehow missed this.

The reasoning is in [1]

[1] https://www.postgresql.org/message-id/CAEZATCUtGR+U5+QTwjHhe9rLG2nguEysHQ5NaqcK=VbJ78VQFA@mail.gmail.com


-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: multivariate statistics (v25)

From

Simon Riggs

Date:

06 April 2017, 01:17:58

On 5 April 2017 at 10:47, David Rowley <david.rowley@2ndquadrant.com> wrote:

> I've attached an updated patch to address Tomas' concerns and yours too.

Commited, with some doc changes and additions based upon my explorations.

For the record, I measured the time to calc extended statistics as
+800ms on 2 million row sample.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: multivariate statistics (v25)

From

David Rowley

Date:

06 April 2017, 01:22:26

On 6 April 2017 at 10:17, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 5 April 2017 at 10:47, David Rowley <david.rowley@2ndquadrant.com> wrote:
>
>> I've attached an updated patch to address Tomas' concerns and yours too.
>
> Commited, with some doc changes and additions based upon my explorations.

Great. Thanks for committing!


-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services