Re: PoC/WIP: Extended statistics on expressions - Mailing list pgsql-hackers

From Dean Rasheed
Subject Re: PoC/WIP: Extended statistics on expressions
Date
Msg-id CAEZATCU9uPo7JYdx4k0-ufXXZH8t7itodibUwCva+s+AvAKcnw@mail.gmail.com
Whole thread Raw
In response to Re: PoC/WIP: Extended statistics on expressions  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: PoC/WIP: Extended statistics on expressions
List pgsql-hackers
On Thu, 3 Dec 2020 at 15:23, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> Attached is a patch series rebased on top of 25a9e54d2d.

After reading this thread and [1], I think I prefer the name
"standard" rather than "expressions", because it is meant to describe
the kind of statistics being built rather than what they apply to, but
maybe that name doesn't actually need to be exposed to the end user:

Looking at the current behaviour, there are a couple of things that
seem a little odd, even though they are understandable. For example,
the fact that

  CREATE STATISTICS s (expressions) ON (expr), col FROM tbl;

fails, but

  CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl;

succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax

  CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl;

tends to suggest that it's going to create statistics on the pair of
expressions, describing their correlation, when actually it builds 2
independent statistics. Also, this error text isn't entirely accurate:

  CREATE STATISTICS s ON col FROM tbl;
  ERROR:  extended statistics require at least 2 columns

because extended statistics don't always require 2 columns, they can
also just have an expression, or multiple expressions and 0 or 1
columns.

I think a lot of this stems from treating "expressions" in the same
way as the other (multi-column) stats kinds, and it might actually be
neater to have separate documented syntaxes for single- and
multi-column statistics:

  CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
    ON (expression)
    FROM table_name

  CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
    [ ( statistics_kind [, ... ] ) ]
    ON { column_name | (expression) } , { column_name | (expression) } [, ...]
    FROM table_name

The first syntax would create single-column stats, and wouldn't accept
a statistics_kind argument, because there is only one kind of
single-column statistic. Maybe that might change in the future, but if
so, it's likely that the kinds of single-column stats will be
different from the kinds of multi-column stats.

In the second syntax, the only accepted kinds would be the current
multi-column stats kinds (ndistinct, dependencies, and mcv), and it
would always build stats describing the correlations between the
columns listed. It would continue to build standard/expression stats
on any expressions in the list, but that's more of an implementation
detail.

It would no longer be possible to do "CREATE STATISTICS s
(expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to
issue 2 separate "CREATE STATISTICS" commands, but that seems more
logical, because they're independent stats.

The parsing code might not change much, but some of the errors would
be different. For example, the errors "building only extended
expression statistics on simple columns not allowed" and "extended
expression statistics require at least one expression" would go away,
and the error "extended statistics require at least 2 columns" might
become more specific, depending on the stats kind.

Regards,
Dean

[1] https://www.postgresql.org/message-id/flat/1009.1579038764%40sss.pgh.pa.us#8624792a20ae595683b574f5933dae53



pgsql-hackers by date:

Previous
From: "Hou, Zhijie"
Date:
Subject: RE: Parallel copy
Next
From: Andrey Borodin
Date:
Subject: Re: Yet another fast GiST build