Re: PoC/WIP: Extended statistics on expressions - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: PoC/WIP: Extended statistics on expressions
Date
Msg-id a2be7e05-e6b7-1467-d15a-49011e45a4b8@enterprisedb.com
Whole thread Raw
In response to Re: PoC/WIP: Extended statistics on expressions  (Justin Pryzby <pryzby@telsasoft.com>)
Responses Re: PoC/WIP: Extended statistics on expressions  (Justin Pryzby <pryzby@telsasoft.com>)
Re: PoC/WIP: Extended statistics on expressions  (Justin Pryzby <pryzby@telsasoft.com>)
List pgsql-hackers

On 8/16/21 3:32 AM, Justin Pryzby wrote:
> On Mon, Dec 07, 2020 at 03:15:17PM +0100, Tomas Vondra wrote:
>>> Looking at the current behaviour, there are a couple of things that
>>> seem a little odd, even though they are understandable. For example,
>>> the fact that
>>>
>>>    CREATE STATISTICS s (expressions) ON (expr), col FROM tbl;
>>>
>>> fails, but
>>>
>>>    CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl;
>>>
>>> succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax
>>>
>>>    CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl;
>>>
>>> tends to suggest that it's going to create statistics on the pair of
>>> expressions, describing their correlation, when actually it builds 2
>>> independent statistics. Also, this error text isn't entirely accurate:
>>>
>>>    CREATE STATISTICS s ON col FROM tbl;
>>>    ERROR:  extended statistics require at least 2 columns
>>>
>>> because extended statistics don't always require 2 columns, they can
>>> also just have an expression, or multiple expressions and 0 or 1
>>> columns.
>>>
>>> I think a lot of this stems from treating "expressions" in the same
>>> way as the other (multi-column) stats kinds, and it might actually be
>>> neater to have separate documented syntaxes for single- and
>>> multi-column statistics:
>>>
>>>    CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
>>>      ON (expression)
>>>      FROM table_name
>>>
>>>    CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
>>>      [ ( statistics_kind [, ... ] ) ]
>>>      ON { column_name | (expression) } , { column_name | (expression) } [, ...]
>>>      FROM table_name
>>>
>>> The first syntax would create single-column stats, and wouldn't accept
>>> a statistics_kind argument, because there is only one kind of
>>> single-column statistic. Maybe that might change in the future, but if
>>> so, it's likely that the kinds of single-column stats will be
>>> different from the kinds of multi-column stats.
>>>
>>> In the second syntax, the only accepted kinds would be the current
>>> multi-column stats kinds (ndistinct, dependencies, and mcv), and it
>>> would always build stats describing the correlations between the
>>> columns listed. It would continue to build standard/expression stats
>>> on any expressions in the list, but that's more of an implementation
>>> detail.
>>>
>>> It would no longer be possible to do "CREATE STATISTICS s
>>> (expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to
>>> issue 2 separate "CREATE STATISTICS" commands, but that seems more
>>> logical, because they're independent stats.
>>>
>>> The parsing code might not change much, but some of the errors would
>>> be different. For example, the errors "building only extended
>>> expression statistics on simple columns not allowed" and "extended
>>> expression statistics require at least one expression" would go away,
>>> and the error "extended statistics require at least 2 columns" might
>>> become more specific, depending on the stats kind.
> 
> This still seems odd:
> 
> postgres=# CREATE STATISTICS asf ON i FROM t;
> ERROR:  extended statistics require at least 2 columns
> postgres=# CREATE STATISTICS asf ON (i) FROM t;
> CREATE STATISTICS
> 
> It seems wrong that the command works with added parens, but builds expression
> stats on a simple column (which is redundant with what analyze does without
> extended stats).
> 

Well, yeah. But I think this is a behavior that was discussed somewhere 
in this thread, and the agreement was that it's not worth the 
complexity, as this comment explains

   * XXX We do only the bare minimum to separate simple attribute and
   * complex expressions - for example "(a)" will be treated as a complex
   * expression. No matter how elaborate the check is, there'll always be
   * a way around it, if the user is determined (consider e.g. "(a+0)"),
   * so it's not worth protecting against it.

Of course, maybe that wasn't the right decision - it's a bit weird that

   CREATE INDEX on t ((a), (b))

actually "extracts" the column references and stores that in indkeys, 
instead of treating that as expressions.

Patch 0001 fixes the "double parens" issue discussed elsewhere in this 
thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a 
simple column reference.

But I'm not sure 0002 is something we can do without catversion bump. 
What if someone created such "bogus" statistics? It's mostly harmless, 
because the statistics is useless anyway (AFAICS we'll just use the 
regular one we have for the column), but if they do pg_dump, that'll 
fail because of this new restriction.

OTOH we're still "only" in beta, and IIRC the rule is not to bump 
catversion after rc1.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: Non-decimal integer literals
Next
From: Álvaro Herrera
Date:
Subject: Re: Autovacuum on partitioned table (autoanalyze)