Re: PoC/WIP: Extended statistics on expressions - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: PoC/WIP: Extended statistics on expressions
Date
Msg-id 20210816013255.GS10479@telsasoft.com
Whole thread Raw
In response to Re: PoC/WIP: Extended statistics on expressions  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: PoC/WIP: Extended statistics on expressions  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
On Mon, Dec 07, 2020 at 03:15:17PM +0100, Tomas Vondra wrote:
> > Looking at the current behaviour, there are a couple of things that
> > seem a little odd, even though they are understandable. For example,
> > the fact that
> > 
> >   CREATE STATISTICS s (expressions) ON (expr), col FROM tbl;
> > 
> > fails, but
> > 
> >   CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl;
> > 
> > succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax
> > 
> >   CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl;
> > 
> > tends to suggest that it's going to create statistics on the pair of
> > expressions, describing their correlation, when actually it builds 2
> > independent statistics. Also, this error text isn't entirely accurate:
> > 
> >   CREATE STATISTICS s ON col FROM tbl;
> >   ERROR:  extended statistics require at least 2 columns
> > 
> > because extended statistics don't always require 2 columns, they can
> > also just have an expression, or multiple expressions and 0 or 1
> > columns.
> > 
> > I think a lot of this stems from treating "expressions" in the same
> > way as the other (multi-column) stats kinds, and it might actually be
> > neater to have separate documented syntaxes for single- and
> > multi-column statistics:
> > 
> >   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
> >     ON (expression)
> >     FROM table_name
> > 
> >   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
> >     [ ( statistics_kind [, ... ] ) ]
> >     ON { column_name | (expression) } , { column_name | (expression) } [, ...]
> >     FROM table_name
> > 
> > The first syntax would create single-column stats, and wouldn't accept
> > a statistics_kind argument, because there is only one kind of
> > single-column statistic. Maybe that might change in the future, but if
> > so, it's likely that the kinds of single-column stats will be
> > different from the kinds of multi-column stats.
> > 
> > In the second syntax, the only accepted kinds would be the current
> > multi-column stats kinds (ndistinct, dependencies, and mcv), and it
> > would always build stats describing the correlations between the
> > columns listed. It would continue to build standard/expression stats
> > on any expressions in the list, but that's more of an implementation
> > detail.
> > 
> > It would no longer be possible to do "CREATE STATISTICS s
> > (expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to
> > issue 2 separate "CREATE STATISTICS" commands, but that seems more
> > logical, because they're independent stats.
> > 
> > The parsing code might not change much, but some of the errors would
> > be different. For example, the errors "building only extended
> > expression statistics on simple columns not allowed" and "extended
> > expression statistics require at least one expression" would go away,
> > and the error "extended statistics require at least 2 columns" might
> > become more specific, depending on the stats kind.

This still seems odd:

postgres=# CREATE STATISTICS asf ON i FROM t;
ERROR:  extended statistics require at least 2 columns
postgres=# CREATE STATISTICS asf ON (i) FROM t;
CREATE STATISTICS

It seems wrong that the command works with added parens, but builds expression
stats on a simple column (which is redundant with what analyze does without
extended stats).

-- 
Justin



pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: PoC/WIP: Extended statistics on expressions
Next
From: Andy Fan
Date:
Subject: Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)