Thread: Multi-column distinctness.

Multi-column distinctness.

From

Kyotaro HORIGUCHI

Date:

28 August 2015, 08:33:52

Hello, this patch enables planner to be couscious of inter-column
correlation.

Sometimes two or more columns in a table has some correlation
which brings underestimate, which leads to wrong join method and
ends with slow execution.

Tomas Vondra is now working on heavily-equipped multivariate
statistics for OLAP usage. In contrast, this is a lightly
implemented solution which calculates only the ratio between a
rows estimated by current method and a actual row number. I think
this doesn't conflict with his work except the grammar part.


This would apply fewer cases but I suppose still in many cases
the correlated colums would be in simple proportional
relationship, so this can help the cases. The previous discussion
is

https://wiki.postgresql.org/wiki/Cross_Columns_Stats
http://www.postgresql.org/message-id/4D0BA4D5.8080707@fuzzy.cz

This patch is covers only the type A (Discrete values and
equality conditions) but I think it is usable in many cases seen
in the field. So I'd like to repropose for the latest version of
PostgreSQL.


- design outline
Provide new system catalog pg_mvcoefficient to store theinformation required to do this.
A user can instruct planner to correct the wrong estimationcaused by inter-column correlation by registering the
columnsinpg_mvcoefficient using new DDL ALTER TABLE... ADD STATISTICS.
 
Analyzing of the target table also stores the 'multivariatecoefficient' calculated by using the following formula
intopg_mvcoefficient.
 mv_coef(c1, c2, ..) =   ndistinct(c1 * c2 * ...) / (ndistinct(c1) * ndistinct(c2) * ...)
In clauselist_selectivity, planner corrects the estimate ifgiven clauselist has equivalence-classes-compatible clauses
forrequiredcolumns at the top-level.
 


- Example
The attached perl script gentbl.pl generates test data resemblessome tables in DBT-3 benchmark.

> $ perl gentbl.pl | psql postgres

>  =# EXPLAIN ANALYZE SELECT * FROM t1 WHERE a = 1 AND b = 2501; ...
>  Seq Scan on t1  (cost=0.00..653.00 rows=1 width=12) (actual time=0.021..6.348 rows=8 loops=1)
This doesn't have no harm but in a join case,

> =# EXPLAIN ANALYZE SELECT * FROM t1 JOIN t2 ON t1.a = t2.a AND t1.b = t2.b;
>  Hash Join  (cost=122.00..855.32 rows=32 width=24)
>             (actual time=2.009..29.208 rows=32000 loops=1)
The correlation between a and b makes the estimate toosmall. Then register correlation setting.

> =# ALTER TABLE t1 ADD STATISTICS (mvndistinct) ON (a, b);
> =# ANALYZE t1;
Then the estimate will be corrected.

> =# EXPLAIN ANALYZE SELECT * FROM t1 JOIN t2 ON t1.a = t2.a AND t1.b = t2.b;
>  Hash Join  (cost=122.00..855.32 rows=32000 width=24)
>             (actual time=1.907..29.025 rows=32000 loops=1)


- Known limitations
The coefficient calculated by this feature is applicble only forconjunctions of simple var-exprs on merge-joinable
operator.
The coefficient is applied regardless of whether the baseestimate has been calculated using MCV, so estimates
fornon-joincases on the columns which has MCV can rather becomeinaccurate.
 
Uniform correlation is assumed so some extent of correlationununiformity would lead to wrong estimation.
This patch set doesn't contain any document yet.


- Patche Files
This patch consists of the following files.
- 0001-New-system-catalog-pg_mvcoefficient.patch Adds new system catalog pg_mvcoefficient.
- 0002-Analyze-part-for-multivariate-coefficient.patch Analyze part of multivariate coefficient.
- 0003-Make-use-of-multivariate-coefficeient-in-estimation-.patch Planner part to make it use the multivariate
coefficient.
- 0004-Syntactical-part-of-multivariate-coefficient.patch Add new DDL to define mv coefficient columns.
The four files above are essential. The two following files areexperimental patch to add mvcattrs to index columns. One
ofthemadds a new opclass for int2vector of btree but it would beoverkill.
 
- 0005-Add-btree-operator-class-for-int2vector.patch Add btree operator class for int2vector.
- 0006-Use-modified-index-of-pg_mvcoefficient.patch Use modified index of pg_mvcoefficient.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Multi-column distinctness.

From

Bruce Momjian

Date:

05 September 2015, 19:47:53

On Fri, Aug 28, 2015 at 05:33:34PM +0900, Kyotaro HORIGUCHI wrote:
> Hello, this patch enables planner to be couscious of inter-column
> correlation.
> 
> Sometimes two or more columns in a table has some correlation
> which brings underestimate, which leads to wrong join method and
> ends with slow execution.
> 
> Tomas Vondra is now working on heavily-equipped multivariate
> statistics for OLAP usage. In contrast, this is a lightly
> implemented solution which calculates only the ratio between a
> rows estimated by current method and a actual row number. I think
> this doesn't conflict with his work except the grammar part.

I am very glad multi-variate statistics are being addressed.  I think
this is our most common cause of optimizer failures.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: Multi-column distinctness.

From

Simon Riggs

Date:

06 September 2015, 08:25:02

On 28 August 2015 at 09:33, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Tomas Vondra is now working on heavily-equipped multivariate
statistics for OLAP usage. In contrast, this is a lightly
implemented solution which calculates only the ratio between a
rows estimated by current method and a actual row number. I think
this doesn't conflict with his work except the grammar part.

I think it very obviously does conflict, so I don't see this patch as appropriate.

If you think a cut version of Tomas' patch is appropriate, then the usual response is to give a review that says "Tomas, I think a cut down version is appropriate here, can we reduce the scope of this patch for now?". If you have done that and he refuses to listen, then a separate patch version is appropriate. Otherwise we should just reject this second patchset to avoid confusion and to avoid encouraging people to take this approach.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Multi-column distinctness.

From

Simon Riggs

Date:

06 September 2015, 08:26:52

On 5 September 2015 at 20:46, Bruce Momjian <bruce@momjian.us> wrote:

On Fri, Aug 28, 2015 at 05:33:34PM +0900, Kyotaro HORIGUCHI wrote:
> Hello, this patch enables planner to be couscious of inter-column
> correlation.
>
> Sometimes two or more columns in a table has some correlation
> which brings underestimate, which leads to wrong join method and
> ends with slow execution.
>
> Tomas Vondra is now working on heavily-equipped multivariate
> statistics for OLAP usage. In contrast, this is a lightly
> implemented solution which calculates only the ratio between a
> rows estimated by current method and a actual row number. I think
> this doesn't conflict with his work except the grammar part.

I am very glad multi-variate statistics are being addressed. I think
this is our most common cause of optimizer failures.

Glad to see you agree.

Why have you said this on *this* patch, when Tomas' patch has been very obviously addressing this problem since January, with very little traction.

Please add your backing to Tomas' patch.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Multi-column distinctness.

From

Tomas Vondra

Date:

06 September 2015, 08:31:35

Hello Horiguchi-san,

On 08/28/2015 10:33 AM, Kyotaro HORIGUCHI wrote:
> Hello, this patch enables planner to be couscious of inter-column
> correlation.
>
> Sometimes two or more columns in a table has some correlation
> which brings underestimate, which leads to wrong join method and
> ends with slow execution.
>
> Tomas Vondra is now working on heavily-equipped multivariate
> statistics for OLAP usage. In contrast, this is a lightly
> implemented solution which calculates only the ratio between a
> rows estimated by current method and a actual row number. I think
> this doesn't conflict with his work except the grammar part.

I'd like to sign up as a reviewer of this patch, if you're OK with that. 
I don't see the patch in any of the commitfests, though :-(

Now, a few comments based on reading the patches (will try testing them 
in the next few days, hopefully).

1) catalog design
-----------------
I see you've invented a catalog with a slightly different structure than 
I use in the multivariate stats patch, storing the coefficient for each 
combination of columns in a separate row. The idea in the multivariate 
patch is to generate all possible combinations of columns and store the 
coefficients packed in a single row, which requires implementing a more 
complicated data structure. But your patch does not do that (only 
computes coefficient for the one combination).

This is another potential incompatibility with the multivariate patch, 
although it's rather in the background and should not be difficult to 
rework if needed.

2) find_mv_coeffeicient
---------------------
I don't quite see why this is the right thing
    /* Prefer smaller one */    if (mvc->mvccoefficient > 0 && mvc->mvccoefficient < mv_coef)        mv_coef =
mvc->mvccoefficient;

i.e. why it's correct to choose the record with the lower coefficient 
and not the one with most columns (the largest subset). Maybe there's 
some rule that those are in fact the same thing (seems like adding a 
column can only lower the coefficient), but it's not quite obvious and 
would deserve a comment.

3) single statistics limitation
-------------------------------
What annoys me a bit is that this patch only applies a single statistics 
- it won't try "combining" multiple smaller ones (even if they don't 
overlap, which shouldn't be that difficult).

4) no sub-combinations
----------------------
Another slightly annoying thing is that the patch computes statistics 
only for the specified combination of columns, and no subsets. So if you 
have enabled stats on [a,b,c] you won't be able to use that for 
conditions on [a,b], for example. Sure, the user can create the stats 
manually, but IMO the less burden we place on the user the better. It's 
enough that we force users to create the stats at all, we shouldn't 
really force them co create all possible combinations.

5) syntax
---------
The syntax might be one of the pain points if we eventually decide to 
commit the multivariate stats patch. I have no intention in blocking 
this patch for that reasons, but if we might design the syntax to make 
it compatible with the multivariate patch, that'd be nice. But it seems 
to me the syntax is pretty much the same, no?

I.e. it uses
    ADD STATISTICS (options) ON (columns)

just like the multivariate patch, no? Well, it doesn't really check the 
stattypes in ATExecAddDropMvStatistics except for checking there's a 
single entry, but the syntax seems OK.

BTW mixing ADD and DROP in ATExecAddDropMvStatistics seems a bit 
confusing. Maybe two separate methods would be better?

6) GROUP BY
-----------
One thing I'd really love to see is improvement of GROUP BY clauses. 
That's one of the main sources of pain in analytical queries, IMHO, 
resulting in either OOM issues in Hash Aggregate, or choice of sort 
paths where hash would be more efficient.

regards
Tomas

-- 
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Multi-column distinctness.

From

Tomas Vondra

Date:

06 September 2015, 08:46:26

Hello,

On 09/06/2015 10:24 AM, Simon Riggs wrote:
> On 28 August 2015 at 09:33, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp
> <mailto:horiguchi.kyotaro@lab.ntt.co.jp>> wrote:
>
>     Tomas Vondra is now working on heavily-equipped multivariate
>     statistics for OLAP usage. In contrast, this is a lightly
>     implemented solution which calculates only the ratio between a
>     rows estimated by current method and a actual row number. I think
>     this doesn't conflict with his work except the grammar part.
>
>
> I think it very obviously does conflict, so I don't see this patch as
> appropriate.
>
> If you think a cut version of Tomas' patch is appropriate, then the
> usual response is to give a review that says "Tomas, I think a cut down
> version is appropriate here, can we reduce the scope of this patch for
> now?". If you have done that and he refuses to listen, then a separate
> patch version is appropriate. Otherwise we should just reject this
> second patchset to avoid confusion and to avoid encouraging people to
> take this approach.

FWIW Horiguchi-san is one of the few people who actually took time to 
review the multivariate stats patch, and I don't quite see this patch as 
conflicting with the multivariate one.

It implements a small subset of the (much larger) multivariate stats 
patch, and reusing it within my patch should not be a big deal. 
Actually, this type of statistics was proposed by Horiguchi-san himself, 
and the multivariate patch does not implement it yet (although I intend 
to address that soon).

So no conflict here. Of course, we need to be a bit careful to make it 
compatible (especially the syntax part).

kind regards

-- 
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Multi-column distinctness.

From

Tomas Vondra

Date:

06 September 2015, 10:04:34

<p dir="ltr"><br /> On Sep 6, 2015 10:31, "Tomas Vondra" <<a
href="mailto:tomas.vondra@2ndquadrant.com">tomas.vondra@2ndquadrant.com</a>>wrote:<br /> ><br /> > 5)
syntax<br/> > ---------<br /> > The syntax might be one of the pain points if we eventually decide to commit the
multivariatestats patch. I have no intention in blocking this patch for that reasons, but if we might design the syntax
tomake it compatible with the multivariate patch, that'd be nice. But it seems to me the syntax is pretty much the
same,no?<br /> ><br /> > I.e. it uses<br /> ><br /> >     ADD STATISTICS (options) ON (columns)<br />
><br/> > just like the multivariate patch, no? Well, it doesn't really check the stattypes in
ATExecAddDropMvStatisticsexcept for checking there's a single entry, but the syntax seems OK.<br /> ><br /> > BTW
mixingADD and DROP in ATExecAddDropMvStatistics seems a bit confusing. Maybe two separate methods would be better?<br
/>><p dir="ltr">BTW one more comment about the syntax - you ran into the same conflict between "ADD [COLUMN] column"
and"ADD STATISTICS" like I did, but you solved it by making the COLUMN required while I made STATISTICS keyword.<p
dir="ltr">I'mnot enthusiastic about the keyword thing, but making the COLUMN required is certainly much worse as it
breaksmany existing scripts. The keyword inky breaks cases that manipulate "statistics" column.<p dir="ltr">If any of
thisis unacceptable, then we probably need to come up with a different syntax.<p dir="ltr">Regards<p dir="ltr">-- <br
/>Tomas Vondra                   <a href="http://www.2ndQuadrant.com/">http://www.2ndQuadrant.com/</a><p
dir="ltr">PostgreSQLDevelopment, 24x7 Support, Remote DBA, Training & Services<br />

Re: Multi-column distinctness.

From

Kyotaro HORIGUCHI

Date:

07 September 2015, 02:58:57

Hello,

Thank you for pointing that. It is one crucial point of this
patch. Sorry for not mentioning on the point.

At Sun, 6 Sep 2015 09:24:48 +0100, Simon Riggs <simon@2ndQuadrant.com> wrote in
<CANP8+j+F+DzrzCW1cjK1Up009TAytYN=P_DNsJ4OZEUJEXywjA@mail.gmail.com>
> > Tomas Vondra is now working on heavily-equipped multivariate
> > statistics for OLAP usage. In contrast, this is a lightly
> > implemented solution which calculates only the ratio between a
> > rows estimated by current method and a actual row number. I think
> > this doesn't conflict with his work except the grammar part.
> >
> 
> I think it very obviously does conflict, so I don't see this patch as
> appropriate.
> 
> If you think a cut version of Tomas' patch is appropriate, then the usual
> response is to give a review that says "Tomas, I think a cut down version
> is appropriate here, can we reduce the scope of this patch for now?". If
> you have done that and he refuses to listen, then a separate patch version
> is appropriate. Otherwise we should just reject this second patchset to
> avoid confusion and to avoid encouraging people to take this approach.

You are absolutely right generally and I agree if this is 'a cut
version' of Tomas's patch. I might have wrong concept about size
of a piece of work.

I will discontinue this patch if Tomas and/or Simon, or many
think this as inappropriate to be brought up now (or ever after)
after reading the following explanation.

======
I already asked Tomas to *add* this feature in his patch and got
a reply that it will be after the completion of undergoing work.

Tomas's patch and mine are generally aiming similar objective but
as discussed with Tomas I understood there's some crucial
differences between them.

He is considering more precise and widely-applicable estimation
baesd on a firm theoretical basis, and as described in the
ciation above, it is aiming OLAP usage and allowing rather
complex calculation. It will be reduced through future
discussions but the priority to do so is not so high for now.

Although Tomas's patch is very complex and needs more labor to
complete, resolving the wrong prediction caused by multicolumn
correlation (especially on OLTP usage) is demanded. So I tried a
patch that suit the objective. It has only rough distinctness
coefficient and doesn't have MV-MCV, MV-HISTOGRAM and strict
functional dependency but needs quire small storage and less
calculation.

The two are so different in concrete objective and
characteristics and does not have common functional piece exept
grammer part so I concluded that this patch doesn't break the
foothold (standpoint?) of Toamas's patch and we can continue to
work on it as we did until now.

This is why I think this is not a cut-down version of Tomas's and
dosn't break or conflict with it. But if many don't think so, I
should dismiss this, of course.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Multi-column distinctness.

From

Kyotaro HORIGUCHI

Date:

07 September 2015, 03:13:29

Hello,

> FWIW Horiguchi-san is one of the few people who actually took time to
> review

I personally think such kind of things should not to be counted
in judging this issue:)

> the multivariate stats patch, and I don't quite see this patch
> as conflicting with the multivariate one.
> 
> It implements a small subset of the (much larger) multivariate stats
> patch, and reusing it within my patch should not be a big
> deal. Actually, this type of statistics was proposed by Horiguchi-san
> himself, and the multivariate patch does not implement it yet
> (although I intend to address that soon).

Year. I think we can deal Tomas's patch as natural enhancement of
this patch and it is not so bad as a vangard(?).

> So no conflict here. Of course, we need to be a bit careful to make it
> compatible (especially the syntax part).

I agree with it. They don't make bad conflict but we might stuc
by wrong decision in API, like syntax will.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Multi-column distinctness.

From

Kyotaro HORIGUCHI

Date:

07 September 2015, 03:25:27

Hello,

> > 5) syntax
> > ---------
> > The syntax might be one of the pain points if we eventually decide to
> commit the multivariate stats patch. I have no intention in blocking this
> patch for that reasons, but if we might design the syntax to make it
> compatible with the multivariate patch, that'd be nice. But it seems to me
> the syntax is pretty much the same, no?
> >
> > I.e. it uses
> >
> >     ADD STATISTICS (options) ON (columns)
> >
> > just like the multivariate patch, no? Well, it doesn't really check the
> stattypes in ATExecAddDropMvStatistics except for checking there's a single
> entry, but the syntax seems OK.
>
> > BTW mixing ADD and DROP in ATExecAddDropMvStatistics seems a bit
> confusing. Maybe two separate methods would be better?

No problem.

> BTW one more comment about the syntax - you ran into the same conflict
> between "ADD [COLUMN] column" and "ADD STATISTICS" like I did, but you
> solved it by making the COLUMN required while I made STATISTICS keyword.
> I'm not enthusiastic about the keyword thing,

I don't have firm idea that it has the priority than duplication
of syntax, but I followed the policy to reduce keywords. Which do
you think is more acceptable between the additional definition
caused by new keywords and syntax splitting caused by avoiding
them for this case?

> but making the COLUMN
> required is certainly much worse as it breaks many existing scripts. The
> keyword inky breaks cases that manipulate "statistics" column.

Ouch! It is simply by accident, or my lack of carefulness. I will
come up with fixed syntax later..

> If any of this is unacceptable, then we probably need to come up with a
> different syntax.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Multi-column distinctness.

From

Tomas Vondra

Date:

11 September 2015, 16:58:31

On 09/07/2015 05:25 AM, Kyotaro HORIGUCHI wrote:
> Hello,
>
>> but making the COLUMN
>> required is certainly much worse as it breaks many existing scripts. The
>> keyword inky breaks cases that manipulate "statistics" column.
>
> Ouch! It is simply by accident, or my lack of carefulness. I will
> come up with fixed syntax later..
>
>> If any of this is unacceptable, then we probably need to come up with a
>> different syntax.

I've been thinking about the syntax, and I think both options (making 
COLUMN required or making STATISTICS a reserved keyword) will break 
something no matter what we do, forcing the users to either always use 
ADD COLUMN or quote all the existing uses of "statistics" (as column 
names, for example).

Maybe the best solution is to abandon the ALTER TABLE approach entirely, 
and instead invent a new set of commands
  CREATE STATISTICS  DROP STATISTICS

(ALTER STATISTICS seems a bit excessive at this point).

Another thing is that perhaps we should add names for statistics, just 
like we do for constraints, for example. Otherwise the DROP STATISTICS 
handling is rather awkward - for example if the user creates stats twice 
by mistake, he's unable to drop just one of them.

regards

-- 
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Multi-column distinctness.

From

Tomas Vondra

Date:

20 October 2015, 00:16:12

Hello Kyotaro-san,

On 09/11/2015 06:58 PM, Tomas Vondra wrote:>
> Maybe the best solution is to abandon the ALTER TABLE approach
> entirely, and instead invent a new set of commands
>
>    CREATE STATISTICS
>    DROP STATISTICS
>
> (ALTER STATISTICS seems a bit excessive at this point).
>
> Another thing is that perhaps we should add names for statistics,
> just like we do for constraints, for example. Otherwise the DROP
> STATISTICS handling is rather awkward - for example if the user
> creates stats twice by mistake, he's unable to drop just one of them.

Do you think this modified syntax makes sense? I'll have time to hack on 
this over the next few days.

regards

-- 
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Multi-column distinctness.

From

Simon Riggs

Date:

20 October 2015, 11:15:00

On 19 October 2015 at 20:16, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Hello Kyotaro-san,

On 09/11/2015 06:58 PM, Tomas Vondra wrote:
>
Maybe the best solution is to abandon the ALTER TABLE approach
entirely, and instead invent a new set of commands

CREATE STATISTICS
DROP STATISTICS

(ALTER STATISTICS seems a bit excessive at this point).

Another thing is that perhaps we should add names for statistics,
just like we do for constraints, for example. Otherwise the DROP
STATISTICS handling is rather awkward - for example if the user
creates stats twice by mistake, he's unable to drop just one of them.

Do you think this modified syntax makes sense? I'll have time to hack on this over the next few days.

Koyotaro's changes to force COLUMN to be required won't get through.

ISTM that we could use COLLECT STATISTICS instead of ADD STATISTICS, and use REMOVE STATISTICS instead of DROP STATISTICS. That way we can use ALTER TABLE rather than inventing a new command. 5 minute change...

Unless there is some better reason for a whole new command?

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Multi-column distinctness.

From

Tomas Vondra

Date:

20 October 2015, 14:51:48

Hi,

On 10/20/2015 01:14 PM, Simon Riggs wrote:
> On 19 October 2015 at 20:16, Tomas Vondra <tomas.vondra@2ndquadrant.com
> <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
>     Hello Kyotaro-san,
>
>     On 09/11/2015 06:58 PM, Tomas Vondra wrote:
>     >
>
>         Maybe the best solution is to abandon the ALTER TABLE approach
>         entirely, and instead invent a new set of commands
>
>             CREATE STATISTICS
>             DROP STATISTICS
>
>         (ALTER STATISTICS seems a bit excessive at this point).
>
>         Another thing is that perhaps we should add names for statistics,
>         just like we do for constraints, for example. Otherwise the DROP
>         STATISTICS handling is rather awkward - for example if the user
>         creates stats twice by mistake, he's unable to drop just one of
>         them.
>
>
>     Do you think this modified syntax makes sense? I'll have time to
>     hack on this over the next few days.
>
>
> Koyotaro's changes to force COLUMN to be required won't get through.
>
> ISTM that we could use COLLECT STATISTICS instead of ADD STATISTICS, and
> use REMOVE STATISTICS instead of DROP STATISTICS. That way we can use
> ALTER TABLE rather than inventing a new command. 5 minute change...

That seems like a neat idea, actually. I'm not sure COLLECT is a good 
choice as it suggest the statistics is actually built, but that only 
happens during ANALYZE. But otherwise this seems to solve the issues 
with keywords and it's quite simple.

>
> Unless there is some better reason for a whole new command?

Not really. The other proposal (adding names for statistics) does not 
require new command. The one thing that would require new command is 
building statistics on multiple tables (for join estimation), but I 
don't have any idea of how that would actually work.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Multi-column distinctness.

From

Robert Haas

Date:

20 October 2015, 15:22:38

On Tue, Oct 20, 2015 at 10:51 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>> Koyotaro's changes to force COLUMN to be required won't get through.
>>
>> ISTM that we could use COLLECT STATISTICS instead of ADD STATISTICS, and
>> use REMOVE STATISTICS instead of DROP STATISTICS. That way we can use
>> ALTER TABLE rather than inventing a new command. 5 minute change...
>
> That seems like a neat idea, actually. I'm not sure COLLECT is a good choice
> as it suggest the statistics is actually built, but that only happens during
> ANALYZE. But otherwise this seems to solve the issues with keywords and it's
> quite simple.

But ADD is no better there.  I think ALTER TABLE .. COLLECT STATISTICS
isn't any worse than ALTER TABLE ... CLUSTER ON index_name.  In both
cases, it means, when you do this operation, do it this way.

I would suggest that instead of DROP or REMOVE, the opposite should be
ALTER TABLE .. NO COLLECT STATISTICS.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Multi-column distinctness.

From

Tom Lane

Date:

20 October 2015, 15:59:49

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Oct 20, 2015 at 10:51 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>>> ISTM that we could use COLLECT STATISTICS instead of ADD STATISTICS, and
>>>> use REMOVE STATISTICS instead of DROP STATISTICS. That way we can use
>>>> ALTER TABLE rather than inventing a new command. 5 minute change...

>> That seems like a neat idea, actually. I'm not sure COLLECT is a good choice
>> as it suggest the statistics is actually built, but that only happens during
>> ANALYZE. But otherwise this seems to solve the issues with keywords and it's
>> quite simple.

> But ADD is no better there.  I think ALTER TABLE .. COLLECT STATISTICS
> isn't any worse than ALTER TABLE ... CLUSTER ON index_name.  In both
> cases, it means, when you do this operation, do it this way.

> I would suggest that instead of DROP or REMOVE, the opposite should be
> ALTER TABLE .. NO COLLECT STATISTICS.

Why is this an improvement over using already-existing keywords?
        regards, tom lane

Re: Multi-column distinctness.

From

Simon Riggs

Date:

20 October 2015, 16:51:27

On 20 October 2015 at 11:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Oct 20, 2015 at 10:51 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>>> ISTM that we could use COLLECT STATISTICS instead of ADD STATISTICS, and
>>>> use REMOVE STATISTICS instead of DROP STATISTICS. That way we can use
>>>> ALTER TABLE rather than inventing a new command. 5 minute change...

>> That seems like a neat idea, actually. I'm not sure COLLECT is a good choice
>> as it suggest the statistics is actually built, but that only happens during
>> ANALYZE. But otherwise this seems to solve the issues with keywords and it's
>> quite simple.

> But ADD is no better there. I think ALTER TABLE .. COLLECT STATISTICS
> isn't any worse than ALTER TABLE ... CLUSTER ON index_name. In both
> cases, it means, when you do this operation, do it this way.

> I would suggest that instead of DROP or REMOVE, the opposite should be
> ALTER TABLE .. NO COLLECT STATISTICS.

Why is this an improvement over using already-existing keywords?

The earlier patch changed the grammar for the DROP (column) subcommand, which I am saying is not acceptable.

So by using an alternate keyword we are able to keep the existing syntax untouched.

I suggested the word COLLECT since that is another word commonly used in conjunction with STATISTICS (at least in DB2 and Teradata).

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Multi-column distinctness.

From

Robert Haas

Date:

20 October 2015, 18:03:45

On Tue, Oct 20, 2015 at 11:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Oct 20, 2015 at 10:51 AM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>>>> ISTM that we could use COLLECT STATISTICS instead of ADD STATISTICS, and
>>>>> use REMOVE STATISTICS instead of DROP STATISTICS. That way we can use
>>>>> ALTER TABLE rather than inventing a new command. 5 minute change...
>
>>> That seems like a neat idea, actually. I'm not sure COLLECT is a good choice
>>> as it suggest the statistics is actually built, but that only happens during
>>> ANALYZE. But otherwise this seems to solve the issues with keywords and it's
>>> quite simple.
>
>> But ADD is no better there.  I think ALTER TABLE .. COLLECT STATISTICS
>> isn't any worse than ALTER TABLE ... CLUSTER ON index_name.  In both
>> cases, it means, when you do this operation, do it this way.
>
>> I would suggest that instead of DROP or REMOVE, the opposite should be
>> ALTER TABLE .. NO COLLECT STATISTICS.
>
> Why is this an improvement over using already-existing keywords?

Well, if we can use existing keywords, that's better still.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Multi-column distinctness.

From

Tomas Vondra

Date:

20 October 2015, 18:22:22


On 10/20/2015 05:59 PM, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Oct 20, 2015 at 10:51 AM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>>>> ISTM that we could use COLLECT STATISTICS instead of ADD STATISTICS, and
>>>>> use REMOVE STATISTICS instead of DROP STATISTICS. That way we can use
>>>>> ALTER TABLE rather than inventing a new command. 5 minute change...
>
>>> That seems like a neat idea, actually. I'm not sure COLLECT is a good choice
>>> as it suggest the statistics is actually built, but that only happens during
>>> ANALYZE. But otherwise this seems to solve the issues with keywords and it's
>>> quite simple.
>
>> But ADD is no better there.  I think ALTER TABLE .. COLLECT STATISTICS
>> isn't any worse than ALTER TABLE ... CLUSTER ON index_name.  In both
>> cases, it means, when you do this operation, do it this way.
>
>> I would suggest that instead of DROP or REMOVE, the opposite should be
>> ALTER TABLE .. NO COLLECT STATISTICS.
>
> Why is this an improvement over using already-existing keywords?

The problem is that the actual syntax is ADD [COLUMN], so we can't 
simply use ADD STATISTICS as that would mean a conflict in the grammar. 
Resolving it means either making COLUMN non-optional, or adding 
STATISTICS to reserved keywords - both options break existing code.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Multi-column distinctness.

From

Tom Lane

Date:

20 October 2015, 18:51:17

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On 10/20/2015 05:59 PM, Tom Lane wrote:
>> Why is this an improvement over using already-existing keywords?

> The problem is that the actual syntax is ADD [COLUMN], so we can't 
> simply use ADD STATISTICS as that would mean a conflict in the grammar. 
> Resolving it means either making COLUMN non-optional, or adding 
> STATISTICS to reserved keywords - both options break existing code.

I'm unconvinced that it cannot be made to work.  The proposal was
something like  ALTER TABLE t ADD STATISTICS ON column-names ...
no?  ON is already fully reserved, which means that this is not ambiguous.
Or if you would rather not use ON, another way of making it not ambiguous
would be to put the column-names list in parentheses.

It's entirely possible that some refactoring of the grammar would
be needed to make it work, of course.
        regards, tom lane

Re: Multi-column distinctness.

From

Tom Lane

Date:

20 October 2015, 19:12:07

I wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> The problem is that the actual syntax is ADD [COLUMN], so we can't 
>> simply use ADD STATISTICS as that would mean a conflict in the grammar. 
>> Resolving it means either making COLUMN non-optional, or adding 
>> STATISTICS to reserved keywords - both options break existing code.

> I'm unconvinced that it cannot be made to work.

Specifically, this works just fine for me as a variant of alter_table_cmd:
        | ADD_P STATISTICS ON '(' name_list ')'

and I can get this to work too
        | DROP STATISTICS ON '(' name_list ')'

if I expand out the "DROP opt_column" productions to be two separate
productions with and without COLUMN, as was done long ago for ADD COLUMN.

These also work without ON, actually, though it does not work to leave out
the parens.  (I think that has to do with the conflict against multiple
ALTER TABLE subcommands, not against ADD COLUMN.)

It doesn't look like we could easily stick a "name" in between STATISTICS
and ON, but I'm unconvinced that that's necessary.  Can't we just insist
that there be only one statistics declaration for a given column list?
        regards, tom lane

Re: Multi-column distinctness.

From

Tomas Vondra

Date:

20 October 2015, 20:48:27

Hi,

On 10/20/2015 09:11 PM, Tom Lane wrote:
> I wrote:
>> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>>> The problem is that the actual syntax is ADD [COLUMN], so we
>>> can't simply use ADD STATISTICS as that would mean a conflict in
>>> the grammar. Resolving it means either making COLUMN
>>> non-optional, or adding STATISTICS to reserved keywords - both
>>> options break existing code.
>
>> I'm unconvinced that it cannot be made to work.
>
> Specifically, this works just fine for me as a variant of
> alter_table_cmd:
>
>             | ADD_P STATISTICS ON '(' name_list ')'
>
> and I can get this to work too
>
>             | DROP STATISTICS ON '(' name_list ')'
>
> if I expand out the "DROP opt_column" productions to be two separate
> productions with and without COLUMN, as was done long ago for ADD
> COLUMN.
>
> These also work without ON, actually, though it does not work to
> leave out the parens. (I think that has to do with the conflict
> against multiple ALTER TABLE subcommands, not against ADD COLUMN.)
>
> It doesn't look like we could easily stick a "name" in between
> STATISTICS and ON, but I'm unconvinced that that's necessary.

Well, it's definitely easier to reference the statistics by name (e.g. 
in the DROP command).

> Can't we just insist that there be only one statistics declaration
> for a given column list?

I would rather not, because I envision preferring different types of 
statistics for different types of queries. For example the statistics 
discussed in this particular thread only supports equality clauses, so 
this restriction would mean we can't also define histogram, we'll be 
unable to estimate queries with ranges.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Multi-column distinctness.

From

Simon Riggs

Date:

20 October 2015, 21:28:51

On 20 October 2015 at 16:48, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

and I can get this to work too

| DROP STATISTICS ON '(' name_list ')'

if I expand out the "DROP opt_column" productions to be two separate
productions with and without COLUMN, as was done long ago for ADD
COLUMN.

These also work without ON, actually, though it does not work to
leave out the parens. (I think that has to do with the conflict
against multiple ALTER TABLE subcommands, not against ADD COLUMN.)

It doesn't look like we could easily stick a "name" in between
STATISTICS and ON, but I'm unconvinced that that's necessary.

Well, it's definitely easier to reference the statistics by name (e.g. in the DROP command).

Can't we just insist that there be only one statistics declaration
for a given column list?

I would rather not, because I envision preferring different types of statistics for different types of queries. For example the statistics discussed in this particular thread only supports equality clauses, so this restriction would mean we can't also define histogram, we'll be unable to estimate queries with ranges.

Can we do something like this...

ADD STATISTICS ON (col list) USING (histogram, MFV)

so we have Types/Methods of statistic, rather than specific names for the statistic entry?

Since this command doesn't actually ADD the statistics, it just creates a specification used by the next ANALYZE, it would seem better to use a different keyword than ADD, perhaps DECLARE STATISTICS ON... and DROP STATISTICS ON

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Multi-column distinctness.

From

Tomas Vondra

Date:

21 October 2015, 00:57:47


On 10/20/2015 11:28 PM, Simon Riggs wrote:
> On 20 October 2015 at 16:48, Tomas Vondra <tomas.vondra@2ndquadrant.com
> <mailto:tomas.vondra@2ndquadrant.com>> wrote:>> On 10/20/2015 09:11 PM, Tom Lane wrote:
>>
>> Can't we just insist that there be only one statistics declaration
>> for a given column list? >>
>
> I would rather not, because I envision preferring different types of
> statistics for different types of queries. For example the statistics
> discussed in this particular thread only supports equality clauses,
> so this restriction would mean we can't also define histogram, we'll
> be unable to estimate queries with ranges.
>
>
> Can we do something like this...
>
> ADD STATISTICS ON (col list) USING (histogram, MFV)
> so we have Types/Methods of statistic, rather than specific names for
> the statistic entry?

That's how it works now (comparing columns and types of stats), but I 
find it awkward. That's why I proposed adding the name.

>
> Since this command doesn't actually ADD the statistics, it just creates
> a specification used by the next ANALYZE, it would seem better to use a
> different keyword than ADD, perhaps DECLARE STATISTICS ON... and DROP
> STATISTICS ON

Maybe, although we should not use DROP with DECLARE. Not only DROP has 
the same issue with name as ADD, but I think it's a good practice to 
commands "paired" ADD-DROP and DECLARE-X?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Multi-column distinctness.

From

Tom Lane

Date:

21 October 2015, 01:06:26

Tomas Vondra <tomas.vondra@2ndQuadrant.com> writes:
> On 10/20/2015 11:28 PM, Simon Riggs wrote:
>> Since this command doesn't actually ADD the statistics, it just creates
>> a specification used by the next ANALYZE, it would seem better to use a
>> different keyword than ADD, perhaps DECLARE STATISTICS ON... and DROP
>> STATISTICS ON

> Maybe, although we should not use DROP with DECLARE. Not only DROP has 
> the same issue with name as ADD, but I think it's a good practice to 
> commands "paired" ADD-DROP and DECLARE-X?

If we spelled the drop case as "DROP STATISTICS name", I think the
conflict could be avoided.  I agree though that pairing DECLARE and DROP
seems rather unintuitive.

I am getting more attracted to your suggestion of making these things
stand-alone commands "CREATE STATISTICS" and "DROP STATISTICS".  Not only
does that fix the syntactic problems of shoehorning them into ALTER TABLE,
but it would be possible to design the syntax to allow for straightforward
extension to multiple-table cases.  Even if we don't know what we'd do
with that just yet, it seems to me like a possible future extension that
we should keep in mind.  And anything based on ALTER TABLE just cannot do
that reasonably.

So consider

CREATE STATISTICS name ON table(columnlist) [, table(columnlist) ...]
[ WITH options ]
        regards, tom lane

Re: Multi-column distinctness.

From

Simon Riggs

Date:

21 October 2015, 01:14:33

On 20 October 2015 at 21:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I am getting more attracted to your suggestion of making these things
stand-alone commands "CREATE STATISTICS" and "DROP STATISTICS". Not only
does that fix the syntactic problems of shoehorning them into ALTER TABLE,
but it would be possible to design the syntax to allow for straightforward
extension to multiple-table cases. Even if we don't know what we'd do
with that just yet, it seems to me like a possible future extension that
we should keep in mind. And anything based on ALTER TABLE just cannot do
that reasonably.

Good argument, so now I understand and agree with Tomas' original suggestion

So consider

CREATE STATISTICS name ON table(columnlist) [, table(columnlist) ...]
[ WITH options ]

Seems good

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services