Thread: PoC/WIP: Extended statistics on expressions

PoC/WIP: Extended statistics on expressions

From

Tomas Vondra

Date:

16 November 2020, 13:49:42

Hi,

Attached is a PoC/WIP patch adding support for extended statistics on
expressions. This is by no means "ready" - most of the stuff works, but
often in a rather hackish way. I certainly don't expect this to pass
regression tests, for example.

There's an example demonstrating how this works for two queries at the
end of this message. Now let's talk about the main parts of the patch:

1) extending grammar to allow expressions, not just plain columns

   Fairly straighforward, I think. I'm sure the logic which expressions
   are allowed is not 100% (e.g. volatile functions etc.) but that's
   a detail we can deal with later.

2) store the expressions in pg_statistic_ext catalog

   I ended up adding a separate column, similar to indexprs, except that
   the order of columns/expressions does not matter, so we don't need to
   bother with storing 0s in stxkeys - we simply consider expressions to
   be "after" all the simple columns.

3) build statistics

   This should work too, for all three types of statistics we have (mcv,
   dependencies and ndistinct). This should work too, although the code
   changes are often very hackish "to make it work".

   The main challenge here was how to represent the expressions in the
   statistics - e.g. in ndistinct, which track ndistinct estimates for
   combinations of parameters, and so far we used attnums for that. I
   decided the easiest way it to keep doing that, but offset the
   expressions by MaxHeapAttributeNumber. That seems to work, but maybe
   there's a better way.

4) apply the statistics

   This is the hard part, really, and the exact state of the support
   depends on type of statistics.

   For ndistinct coefficients, it generally works. I'm sure there may be
   bugs in estimate_num_groups, etc. but in principle it works.

   For MCV lists, it generally works too - you can define statistics on
   the expressions and the estimates should improve. The main downside
   here is that it requires at least two expressions, otherwise we can't
   build/apply the extended statistics. So for example

      SELECT * FROM t WHERE mod(a,100) = 10 AND mod(b,11) = 0

   may be estimated "correctly", once you drop any of the conditions it
   gets much worse as we don't have stats for individual expressions.
   That's rather annoying - it does not break the extended MCV, but the
   behavior will certainly cause confusion.

   For functional dependencies, the estimation does not work yet. Also,
   the missing per-column statistics have bigger impact than on MCV,
   because while MCV can work fine without it, the dependencies heavily
   rely on the per-column estimates. We only apply "corrections" based
   on the dependency degree, so we still need (good) per-column
   estimates, which does not quite work with the expressions.


   Of course, the lack of per-expression statistics may be somewhat
   fixed by adding indexes on expressions, but that's kinda expensive.


Now, a simple example demonstrating how this improves estimates - let's
create a table with 1M rows, and do queries with mod() expressions on
it. It might be date_trunc() or something similar, that'd work too.


table with 1M rows
==================

test=# create table t (a int);
test=# insert into t select i from generate_series(1,1000000) s(i);
test=# analyze t;


poorly estimated queries
========================

test=# explain (analyze, timing off) select * from t where mod(a,3) = 0
and mod(a,7) = 0;
                                    QUERY PLAN

----------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..24425.00 rows=25 width=4) (actual rows=47619
loops=1)
   Filter: ((mod(a, 3) = 0) AND (mod(a, 7) = 0))
   Rows Removed by Filter: 952381
 Planning Time: 0.329 ms
 Execution Time: 156.675 ms
(5 rows)

test=# explain (analyze, timing off) select mod(a,3), mod(a,7) from t
group by 1, 2;
                                          QUERY PLAN

-----------------------------------------------------------------------------------------------
 HashAggregate  (cost=75675.00..98487.50 rows=1000000 width=8) (actual
rows=21 loops=1)
   Group Key: mod(a, 3), mod(a, 7)
   Planned Partitions: 32  Batches: 1  Memory Usage: 1561kB
   ->  Seq Scan on t  (cost=0.00..19425.00 rows=1000000 width=8) (actual
rows=1000000 loops=1)
 Planning Time: 0.277 ms
 Execution Time: 502.803 ms
(6 rows)


improved estimates
==================

test=# create statistics s1 (ndistinct) on mod(a,3), mod(a,7) from t;
test=# analyze t;

test=# explain (analyze, timing off) select mod(a,3), mod(a,7) from t
group by 1, 2;
                                          QUERY PLAN

-----------------------------------------------------------------------------------------------
 HashAggregate  (cost=24425.00..24425.31 rows=21 width=8) (actual
rows=21 loops=1)
   Group Key: mod(a, 3), mod(a, 7)
   Batches: 1  Memory Usage: 24kB
   ->  Seq Scan on t  (cost=0.00..19425.00 rows=1000000 width=8) (actual
rows=1000000 loops=1)
 Planning Time: 0.135 ms
 Execution Time: 500.092 ms
(6 rows)

test=# create statistics s2 (mcv) on mod(a,3), mod(a,7) from t;
test=# analyze t;

test=# explain (analyze, timing off) select * from t where mod(a,3) = 0
and mod(a,7) = 0;
                                     QUERY PLAN

-------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..24425.00 rows=46433 width=4) (actual
rows=47619 loops=1)
   Filter: ((mod(a, 3) = 0) AND (mod(a, 7) = 0))
   Rows Removed by Filter: 952381
 Planning Time: 0.702 ms
 Execution Time: 152.280 ms
(5 rows)


Clearly, estimates for both queries are significantly improved. Of
course, this example is kinda artificial/simplistic.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

0001-Support-for-extended-statistics-on-expressi-20201116.patch

Re: PoC/WIP: Extended statistics on expressions

From

Tomas Vondra

Date:

16 November 2020, 15:27:46

On 11/16/20 2:49 PM, Tomas Vondra wrote:
> Hi,
>
> ...
>
> 4) apply the statistics
> 
>    This is the hard part, really, and the exact state of the support
>    depends on type of statistics.
> 
>    For ndistinct coefficients, it generally works. I'm sure there may be
>    bugs in estimate_num_groups, etc. but in principle it works.
> 
>    For MCV lists, it generally works too - you can define statistics on
>    the expressions and the estimates should improve. The main downside
>    here is that it requires at least two expressions, otherwise we can't
>    build/apply the extended statistics. So for example
> 
>       SELECT * FROM t WHERE mod(a,100) = 10 AND mod(b,11) = 0
> 
>    may be estimated "correctly", once you drop any of the conditions it
>    gets much worse as we don't have stats for individual expressions.
>    That's rather annoying - it does not break the extended MCV, but the
>    behavior will certainly cause confusion.
> 
>    For functional dependencies, the estimation does not work yet. Also,
>    the missing per-column statistics have bigger impact than on MCV,
>    because while MCV can work fine without it, the dependencies heavily
>    rely on the per-column estimates. We only apply "corrections" based
>    on the dependency degree, so we still need (good) per-column
>    estimates, which does not quite work with the expressions.
> 
> 
>    Of course, the lack of per-expression statistics may be somewhat
>    fixed by adding indexes on expressions, but that's kinda expensive.
> 

FWIW after re-reading [1], I think the plan to build pg_statistic rows
for expressions and stash them in pg_statistic_ext_data is the way to
go. I was thinking that maybe we'll need some new statistics type to
request this, e.g.

   CREATE STATISTICS s (expressions) ON ...

but on second thought I think we should just build this whenever there
are expressions in the definition. It'll require some changes (e.g. we
require at least two items in the list, but we'll want to allow building
stats on a single expression too, I guess), but that's doable.

Of course, we don't have any catalogs with composite types yet, so it's
not 100% sure this will work, but it's worth a try.

regards


[1]
https://www.postgresql.org/message-id/flat/6331.1579041473%40sss.pgh.pa.us#5ec6af7583e84cef2ca6a9e8a713511e

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PoC/WIP: Extended statistics on expressions

From

Tomas Vondra

Date:

22 November 2020, 19:03:51

Hi,

attached is a significantly improved version of the patch, allowing
defining extended statistics on expressions. This fixes most of the
problems in the previous WIP version and AFAICS it does pass all
regression tests (including under valgrind). There's a bunch of FIXMEs
and a couple loose ends, but overall I think it's ready for reviews.


Overall, the patch does two main things:

* it adds a new "expressions" statistics kind, building per-expression
statistics (i.e it's similar to having expression index)

* it allows using expressions in definition of extended statistics, and
properly handles that in all existing statistics kinds (dependencies,
mcv, ndistinct)


The expression handling mostly copies what we do for indexes, with
similar restrictions - no volatile functions, aggregates etc. The list
of expressions is stored in pg_statistic_ext catalog, but unlike for
indexes we don't need to worry about the exact order of elements, so
there are no "0" for expressions in stxkeys etc. We simply assume the
expressions come after simple columns, and that's it.

To reference expressions in the built statistics (e.g. in a dependency)
we use "special attnums" computed from the expression index by adding
MaxHeapAttributeNumber. So the first expression has attnum 1601 etc.

This mapping expressions to attnums is used both while building and
applying the statistics to clauses, as it makes the whole process much
simpler than dealing with attnums and expressions entirely separately.


The first part allows us to do something like this:

    CREATE TABLE t (a int);
    INSERT INTO t SELECT i FROM generate_series(1,1000000) s(i);
    ANALYZE t;

    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT * FROM t WHERE mod(a,10) = 0;

    CREATE STATISTICS s (expressions) ON mod(a,10) FROM t;
    ANALYZE t;

    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT * FROM t WHERE mod(a,10) = 0;

Without the statistics we get this:

                                 QUERY PLAN
    --------------------------------------------------------
     Seq Scan on t  (cost=0.00..19425.00 rows=5000 width=4)
                    (actual rows=100000 loops=1)
       Filter: (mod(a, 10) = 0)
       Rows Removed by Filter: 900000
     Planning Time: 0.216 ms
     Execution Time: 157.552 ms
    (5 rows)

while with the statistics we get this

                                 QUERY PLAN
    ----------------------------------------------------------
     Seq Scan on t  (cost=0.00..19425.00 rows=100900 width=4)
                    (actual rows=100000 loops=1)
       Filter: (mod(a, 10) = 0)
       Rows Removed by Filter: 900000
     Planning Time: 0.399 ms
     Execution Time: 157.530 ms
    (5 rows)

So that's pretty nice improvement. In practice you could get the same
effect by creating an expression index

    CREATE INDEX ON t (mod(a,10));

but of course that's far from free - there's cost to maintain the index,
it blocks HOT, and it takes space on disk. The statistics have none of
these issues.

Implementation-wise, this simply builds per-column statistics for each
expression, and stashes them into a new column in pg_statistic_ext_data
catalog as an array of elements with pg_statistic composite type. And
then in selfuncs.c we look not just at indexes, but also at this when
looking for expression stats.

So that gives us the per-expression stats. This is enabled by default
when you don't specify the statistics type and the definition includes
any expression that is not a simple column reference. Otherwise you may
also request it explicitly by using "expressions" in the CREATE.


Now, the second part is really just a natural extension of the existing
stats to also work with expressions. The easiest thing is probably to
show some examples, so consider this:

    CREATE TABLE t (a INT, b INT, c INT);
    INSERT INTO t SELECT i, i, i FROM generate_series(1,1000000) s(i);
    ANALYZE t;

which without any statistics gives us this:

    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT 1 FROM t WHERE mod(a,10) = 0 AND mod(b,5) = 0;

                           QUERY PLAN
    ------------------------------------------------------
     Seq Scan on t  (cost=0.00..25406.00 rows=25 width=4)
                    (actual rows=100000 loops=1)
       Filter: ((mod(a, 10) = 0) AND (mod(b, 5) = 0))
       Rows Removed by Filter: 900000
     Planning Time: 0.080 ms
     Execution Time: 161.445 ms
    (5 rows)


    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT 1 FROM t GROUP BY mod(a,10), mod(b,5);

                               QUERY PLAN
    ------------------------------------------------------------------
     HashAggregate  (cost=76656.00..99468.50 rows=1000000 width=12)
                    (actual rows=10 loops=1)
       Group Key: mod(a, 10), mod(b, 5)
       Planned Partitions: 32  Batches: 1  Memory Usage: 1561kB
       ->  Seq Scan on t  (cost=0.00..20406.00 rows=1000000 width=8)
                          (actual rows=1000000 loops=1)
     Planning Time: 0.232 ms
     Execution Time: 514.446 ms
    (6 rows)

and now let's add statistics on the expressions:

    CREATE STATISTICS s ON mod(a,10), mod(b,5) FROM t;
    ANALYZE t;

which ends up with these spot-on estimates:

    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT 1 FROM t WHERE mod(a,10) = 0 AND mod(b,5) = 0;

                            QUERY PLAN
    ---------------------------------------------------------
     Seq Scan on t  (cost=0.00..25406.00 rows=97400 width=4)
                    (actual rows=100000 loops=1)
       Filter: ((mod(a, 10) = 0) AND (mod(b, 5) = 0))
       Rows Removed by Filter: 900000
     Planning Time: 0.366 ms
     Execution Time: 159.207 ms
    (5 rows)

    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT 1 FROM t GROUP BY mod(a,10), mod(b,5);

                               QUERY PLAN
    -----------------------------------------------------------------
     HashAggregate  (cost=25406.00..25406.15 rows=10 width=12)
                    (actual rows=10 loops=1)
       Group Key: mod(a, 10), mod(b, 5)
       Batches: 1  Memory Usage: 24kB
       ->  Seq Scan on t  (cost=0.00..20406.00 rows=1000000 width=8)
                          (actual rows=1000000 loops=1)
     Planning Time: 0.299 ms
     Execution Time: 530.793 ms
    (6 rows)

Of course, this is a very simple query, but hopefully you get the idea.


There's about two main areas where I think might be hidden issues:

1) We're kinda faking the pg_statistic entries, and I suppose there
might be some loose ends (e.g. with respect to ACLs etc.).

2) Memory management while evaluating the expressions during analyze is
kinda simplistic, we're probably keeping the memory around longer than
needed etc.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

0001-Extended-statistics-on-expressions-20201122.patch

Re: PoC/WIP: Extended statistics on expressions

From

Justin Pryzby

Date:

23 November 2020, 02:26:04

On Sun, Nov 22, 2020 at 08:03:51PM +0100, Tomas Vondra wrote:
> attached is a significantly improved version of the patch, allowing
> defining extended statistics on expressions. This fixes most of the
> problems in the previous WIP version and AFAICS it does pass all
> regression tests (including under valgrind). There's a bunch of FIXMEs
> and a couple loose ends, but overall I think it's ready for reviews.

I was looking at the previous patch, so now read this one instead, and attach
some proposed fixes.

+  * This matters especially for * expensive expressions, of course.

+   The expression can refer only to columns of the underlying table, but
+   it can use all columns, not just the ones the statistics is defined
+   on.

I don't know what these are trying to say?

+                                errmsg("statistics expressions and predicates can refer only to the table being
indexed")));
+        * partial-index predicates.  Create it in the per-index context to be

I think these are copied and shouldn't mention "indexes" or "predicates".  Or
should statistics support predicates, too ?

Idea: if a user specifies no stakinds, and there's no expression specified,
then you automatically build everything except for expressional stats.  But if
they specify only one statistics "column", it gives an error.  If that's a
non-simple column reference, should that instead build *only* expressional
stats (possibly with a NOTICE, since the user might be intending to make MV
stats).

I think pg_stats_ext should allow inspecting the pg_statistic data in
pg_statistic_ext_data.stxdexprs.  I guess array_agg() should be ordered by
something, so maybe it should use ORDINALITY (?)

I hacked more on bootstrap.c so included that here.

-- 
Justin

Attachment

Re: PoC/WIP: Extended statistics on expressions

From

Tomas Vondra

Date:

23 November 2020, 03:30:26

On 11/23/20 3:26 AM, Justin Pryzby wrote:
> On Sun, Nov 22, 2020 at 08:03:51PM +0100, Tomas Vondra wrote:
>> attached is a significantly improved version of the patch, allowing
>> defining extended statistics on expressions. This fixes most of the
>> problems in the previous WIP version and AFAICS it does pass all
>> regression tests (including under valgrind). There's a bunch of FIXMEs
>> and a couple loose ends, but overall I think it's ready for reviews.
> 
> I was looking at the previous patch, so now read this one instead, and attach
> some proposed fixes.
> 
> +  * This matters especially for * expensive expressions, of course.
> 

The point this was trying to make is that we evaluate the expressions
only once, and use the results to build all extended statistics. Instead
of leaving it up to every "build" to re-evaluate it.

> +   The expression can refer only to columns of the underlying table, but
> +   it can use all columns, not just the ones the statistics is defined
> +   on.
> 
> I don't know what these are trying to say?
> 

D'oh. That's bogus para, copied from the CREATE INDEX docs (where it
talked about the index predicate, which is irrelevant here).

> +                                errmsg("statistics expressions and predicates can refer only to the table being
indexed")));
> +        * partial-index predicates.  Create it in the per-index context to be
> 
> I think these are copied and shouldn't mention "indexes" or "predicates".  Or
> should statistics support predicates, too ?
> 

Right. Stupid copy-pasto.

> Idea: if a user specifies no stakinds, and there's no expression specified,
> then you automatically build everything except for expressional stats.  But if
> they specify only one statistics "column", it gives an error.  If that's a
> non-simple column reference, should that instead build *only* expressional
> stats (possibly with a NOTICE, since the user might be intending to make MV
> stats).
> 

Right, that was the intention - but I messed up and it works only if you
specify the "expressions" kind explicitly (and I also added the ERROR
message to expected output by mistake). I agree we should handle this
automatically, so that

   CREATE STATISTICS s ON (a+b) FROM t

works and only creates statistics for the expression.

> I think pg_stats_ext should allow inspecting the pg_statistic data in
> pg_statistic_ext_data.stxdexprs.  I guess array_agg() should be ordered by
> something, so maybe it should use ORDINALITY (?)
> 

I agree we should expose the expression statistics, but I'm not
convinced we should do that in the pg_stats_ext view itself. The problem
is that it's a table bested in a table, essentially, with non-trivial
structure, so I was thinking about adding a separate view exposing just
this one part. Something like pg_stats_ext_expressions, with about the
same structure as pg_stats, or something.

> I hacked more on bootstrap.c so included that here.

Thanks. As for the 0004-0007 patches:

0004 - Seems fine. IMHO not really "silly errors" but OK.

0005 - Mostly OK. The docs wording mostly comes from CREATE INDEX docs,
though. The paragraph about "t1" is old, so if we want to reword it then
maybe we should backpatch too.

0006 - Not sure. I think CreateStatistics can be fixed with less code,
keeping it more like PG13 (good for backpatching). Not sure why rename
extended statistics to multi-variate statistics - we use "extended"
everywhere. Not sure what's the point of serialize_expr_stats changes,
that's code is mostly copy-paste from update_attstats.

0007 - I suspect this makes the pg_stats_ext too complex to work with,
IMHO we should move this to a separate view.

Thanks for the review! I'll try to look more closely at those patches
sometime next week, and merge most of it.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PoC/WIP: Extended statistics on expressions

From

Tomas Vondra

Date:

23 November 2020, 23:30:18

Hi,

Attached is an updated version of the patch series, merging some of the
changes proposed by Justin. I've kept the bootstrap patches separate, at
least for now.

As for the individual 0004-0007 patches:

1) 0004 - merged as is

2) 0005 - I've merged some of the docs changes, but some of the wording
was copied from CREATE INDEX docs in which case I've kept that. I've
also not merged changed to pre-existing docs, like the t1 example which
is unrelated to this patch.

OTOH I've corrected the t3 example description, which was somewhat bogus
and unrelated to what the example actually did. I've also removed the
irrelevant para which originally described index predicates and was
copied from CREATE INDEX docs by mistake.

3) 0006 - I've committed something similar / less invasive, achieving
the same goals (I think), and I've added a couple regression tests.

4) 0007 - I agreed we need a way to expose the stats, but including this
in pg_stats_ext seems rather inconvenient (table in a table is difficult
to work with). Instead I've added a new catalog pg_stats_ext_exprs with
structure similar to pg_stats. I've also added the expressions to the
pg_stats_ext catalog, which was only showing the attributes, and some
basic docs for the catalog changes.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hi,

Attached is a patch series rebased on top of 25a9e54d2d which improves
estimation of OR clauses. There were only a couple minor conflicts.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

On 12/11/20 1:58 PM, Dean Rasheed wrote:
> On Tue, 8 Dec 2020 at 12:44, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>> Possibly. But I don't think it's worth the extra complexity. I don't
>> expect people to have a lot of overlapping stats, so the amount of
>> wasted space and CPU time is expected to be fairly limited.
>>
>> So I don't think it's worth spending too much time on this now. Let's
>> just do what you proposed, and revisit this later if needed.
>>
> 
> Yes, I think that's a reasonable approach to take. As long as the
> documentation makes it clear that building MCV stats also causes
> standard expression stats to be built on any expressions included in
> the list, then the user will know and can avoid duplication most of
> the time. I don't think there's any need for code to try to prevent
> that -- just as we don't bother with code to prevent a user building
> multiple indexes on the same column.
> 
> The only case where duplication won't be avoidable is where there are
> multiple MCV stats sharing the same expression, but that's probably
> quite unlikely in practice, and it seems acceptable to leave improving
> that as a possible future optimisation.
> 

OK. Attached is an updated version, reworking it this way.

I tried tweaking the grammar to differentiate these two syntax variants,
but that led to shift/reduce conflicts with the existing ones. I tried
fixing that, but I ended up doing that in CreateStatistics().

The other thing is that we probably can't tie this to just MCV, because
functional dependencies need the per-expression stats too. So I simply
build expression stats whenever there's at least one expression.

I also decided to keep the "expressions" statistics kind - it's not
allowed to specify it in CREATE STATISTICS, but it's useful internally
as it allows deciding whether to build the stats in a single place.
Otherwise we'd need to do that every time we build the statistics, etc.

I added a brief explanation to the sgml docs, not sure if that's good
enough - maybe it needs more details.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hi,

Attached is a patch fixing most of the issues. There are a couple 
exceptions:


 > * Looking at the pg_stats_ext view, I think perhaps expressions stats
 > should be omitted entirely from that view, since it doesn't show any
 > useful information about them. So it could remove "e" from the "kinds"
 > array, and exclude rows whose only kind is "e", since such rows have
 > no interesting data in them. Essentially, the new view
 > pg_stats_ext_exprs makes having any expression stats in pg_stats_ext
 > redundant. Hiding this data in pg_stats_ext would also be consistent
 > with making the "expressions" stats kind hidden from the user.

I haven't removed the expressions stats from pg_stats_ext view yet. I'm 
not 100% sure about it yet.


 > * Why does the new code in tcop/utility.c not use
 > RangeVarGetRelidExtended together with RangeVarCallbackOwnsRelation?
 > That seems preferable to doing the ACL check in CreateStatistics().
 > For one thing, as it stands, it allows the lock to be taken even if
 > the user doesn't own the table. Is it intentional that the current
 > code allows extended stats to be created on system catalogs? That
 > would be one thing that using RangeVarCallbackOwnsRelation would
 > change, but I can't see a reason to allow it.

I haven't switched utility.c to RangeVarGetRelidExtended together with 
RangeVarCallbackOwnsRelation, because the current code allows checking 
for object type first. I don't recall why exactly was it done this way, 
but I didn't feel like changing that in this patch.

You're however right it should not be possible to create statistics on 
system catalogs. For regular users that should be rejected thanks to the 
ownership check, but superuser may create it. I've added proper check to 
CreateStatistics() - this is probably worth backpatching.


 > * In src/bin/psql/describe.c, I think the \d output should also
 > exclude the "expressions" stats kind and just list the other kinds (or
 > have no kinds list at all, if there are no other kinds), to make it
 > consistent with the CREATE STATISTICS syntax.

I've done this, but I went one step further - we hide the list of kinds 
using the same rules as pg_dump, i.e. we don't list the kinds if all of 
them are selected. Not sure if that's the right thing, though.


The rest of the issues should be fixed, I think.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attached is an updated version of the patch series, fixing a couple issues:

1) docs issues, pointed out by Justin Pryzby

2) adds ACL check to statext_extract_expression to verify access to 
attributes in the expression(s)

3) adds comment to statext_is_compatible_clause_internal explaining the 
ambiguity in extracting expressions for extended stats

4) fixes/improves memory management in compute_expr_stats

5) a bunch of minor comment and code fixes


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC/WIP: Extended statistics on expressions

From

Justin Pryzby

Date:

16 January 2021, 23:22:08

On Sat, Jan 16, 2021 at 05:48:43PM +0100, Tomas Vondra wrote:
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>expr</structfield> <type>text</type>
> +      </para>
> +      <para>
> +       Expression the extended statistics is defined on
> +      </para></entry>

Expression the extended statistics ARE defined on
Or maybe say "on which the extended statistics are defined"

> +  <para>
> +   The <command>CREATE STATISTICS</command> command has two basic forms. The
> +   simple variant allows to build statistics for a single expression, does

.. ALLOWS BUILDING statistics for a single expression, AND does (or BUT does)

> +   Expression statistics are per-expression and are similar to creating an
> +   index on the expression, except that they avoid the overhead of the index.

Maybe say "overhead of index maintenance"

> +   All functions and operators used in a statistics definition must be
> +   <quote>immutable</quote>, that is, their results must depend only on
> +   their arguments and never on any outside influence (such as
> +   the contents of another table or the current time).  This restriction

say "outside factor" or "external factor"

> +   results of those expression, and uses default estimates as illustrated
> +   by the first query.  The planner also does not realize the value of the

realize THAT

> +   second column fully defines the value of the other column, because date
> +   truncated to day still identifies the month. Then expression and
> +   ndistinct statistics are built on those two columns:

I got an error doing this:

CREATE TABLE t AS SELECT generate_series(1,9) AS i;
CREATE STATISTICS s ON (i+1) ,(i+1+0) FROM t;
ANALYZE t;
SELECT i+1 FROM t GROUP BY 1;
ERROR:  corrupt MVNDistinct entry

-- 
Justin

Re: PoC/WIP: Extended statistics on expressions

From

Tomas Vondra

Date:

17 January 2021, 00:23:39

On 1/17/21 12:22 AM, Justin Pryzby wrote:
> On Sat, Jan 16, 2021 at 05:48:43PM +0100, Tomas Vondra wrote:
>> +      <entry role="catalog_table_entry"><para role="column_definition">
>> +       <structfield>expr</structfield> <type>text</type>
>> +      </para>
>> +      <para>
>> +       Expression the extended statistics is defined on
>> +      </para></entry>
> 
> Expression the extended statistics ARE defined on
> Or maybe say "on which the extended statistics are defined"
> 

I'm pretty sure "is" is correct because "expression" is singular.

>> +  <para>
>> +   The <command>CREATE STATISTICS</command> command has two basic forms. The
>> +   simple variant allows to build statistics for a single expression, does
> 
> .. ALLOWS BUILDING statistics for a single expression, AND does (or BUT does)
> 
>> +   Expression statistics are per-expression and are similar to creating an
>> +   index on the expression, except that they avoid the overhead of the index.
> 
> Maybe say "overhead of index maintenance"
> 

Yeah, that sounds better.

>> +   All functions and operators used in a statistics definition must be
>> +   <quote>immutable</quote>, that is, their results must depend only on
>> +   their arguments and never on any outside influence (such as
>> +   the contents of another table or the current time).  This restriction
> 
> say "outside factor" or "external factor"
> 

In fact, we've removed the immutability restriction, so this paragraph 
should have been removed.

>> +   results of those expression, and uses default estimates as illustrated
>> +   by the first query.  The planner also does not realize the value of the
> 
> realize THAT
> 

OK, changed.

>> +   second column fully defines the value of the other column, because date
>> +   truncated to day still identifies the month. Then expression and
>> +   ndistinct statistics are built on those two columns:
> 
> I got an error doing this:
> 
> CREATE TABLE t AS SELECT generate_series(1,9) AS i;
> CREATE STATISTICS s ON (i+1) ,(i+1+0) FROM t;
> ANALYZE t;
> SELECT i+1 FROM t GROUP BY 1;
> ERROR:  corrupt MVNDistinct entry
> 

Thanks. There was a thinko in estimate_multivariate_ndistinct, resulting 
in mismatching the ndistinct coefficient items. The attached patch fixes 
that, but I've realized the way we pick the "best" statistics may need 
some improvements (I added an XXX comment about that).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC/WIP: Extended statistics on expressions

From

Zhihong Yu

Date:

17 January 2021, 02:55:20

Hi,

+ * Check that only the base rel is mentioned. (This should be dead code
+ * now that add_missing_from is history.)
+ */
+ if (list_length(pstate->p_rtable) != 1)

If it is dead code, it can be removed, right ?

For statext_mcv_clauselist_selectivity:

+ // bms_free(list_attnums[listidx]);

The commented line can be removed.

+bool
+examine_clause_args2(List *args, Node **exprp, Const **cstp, bool *expronleftp)

Better add some comment for examine_clause_args2 since there is examine_clause_args() already.

+ if (!ok || stats->compute_stats == NULL || stats->minrows <= 0)

When would stats->minrows have negative value ?

For serialize_expr_stats():

+ sd = table_open(StatisticRelationId, RowExclusiveLock);
+
+ /* lookup OID of composite type for pg_statistic */
+ typOid = get_rel_type_id(StatisticRelationId);
+ if (!OidIsValid(typOid))
+ ereport(ERROR,

Looks like the table_open() call can be made after the typOid check.

+ Datum values[Natts_pg_statistic];
+ bool nulls[Natts_pg_statistic];
+ HeapTuple stup;
+
+ if (!stats->stats_valid)

It seems the local arrays can be declared after the validity check.

+ if (enabled[i] == STATS_EXT_NDISTINCT)
+ ndistinct_enabled = true;
+ if (enabled[i] == STATS_EXT_DEPENDENCIES)
+ dependencies_enabled = true;
+ if (enabled[i] == STATS_EXT_MCV)

the second and third if should be preceded with 'else'

+ReleaseDummy(HeapTuple tuple)
+{
+ pfree(tuple);

Since ReleaseDummy() is just a single pfree call, maybe you don't need this method - call pfree in its place.

Cheers

On Sat, Jan 16, 2021 at 4:24 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 1/17/21 12:22 AM, Justin Pryzby wrote:
> On Sat, Jan 16, 2021 at 05:48:43PM +0100, Tomas Vondra wrote:
>> + <entry role="catalog_table_entry"><para role="column_definition">
>> + <structfield>expr</structfield> <type>text</type>
>> + </para>
>> + <para>
>> + Expression the extended statistics is defined on
>> + </para></entry>
>
> Expression the extended statistics ARE defined on
> Or maybe say "on which the extended statistics are defined"
>

I'm pretty sure "is" is correct because "expression" is singular.

>> + <para>
>> + The <command>CREATE STATISTICS</command> command has two basic forms. The
>> + simple variant allows to build statistics for a single expression, does
>
> .. ALLOWS BUILDING statistics for a single expression, AND does (or BUT does)
>
>> + Expression statistics are per-expression and are similar to creating an
>> + index on the expression, except that they avoid the overhead of the index.
>
> Maybe say "overhead of index maintenance"
>

Yeah, that sounds better.

>> + All functions and operators used in a statistics definition must be
>> + <quote>immutable</quote>, that is, their results must depend only on
>> + their arguments and never on any outside influence (such as
>> + the contents of another table or the current time). This restriction
>
> say "outside factor" or "external factor"
>

In fact, we've removed the immutability restriction, so this paragraph
should have been removed.

>> + results of those expression, and uses default estimates as illustrated
>> + by the first query. The planner also does not realize the value of the
>
> realize THAT
>

OK, changed.

>> + second column fully defines the value of the other column, because date
>> + truncated to day still identifies the month. Then expression and
>> + ndistinct statistics are built on those two columns:
>
> I got an error doing this:
>
> CREATE TABLE t AS SELECT generate_series(1,9) AS i;
> CREATE STATISTICS s ON (i+1) ,(i+1+0) FROM t;
> ANALYZE t;
> SELECT i+1 FROM t GROUP BY 1;
> ERROR: corrupt MVNDistinct entry
>

Thanks. There was a thinko in estimate_multivariate_ndistinct, resulting
in mismatching the ndistinct coefficient items. The attached patch fixes
that, but I've realized the way we pick the "best" statistics may need
some improvements (I added an XXX comment about that).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PoC/WIP: Extended statistics on expressions

From

Justin Pryzby

Date:

17 January 2021, 05:29:54

On Sun, Jan 17, 2021 at 01:23:39AM +0100, Tomas Vondra wrote:
> diff --git a/doc/src/sgml/ref/create_statistics.sgml b/doc/src/sgml/ref/create_statistics.sgml
> index 4363be50c3..5b8eb8d248 100644
> --- a/doc/src/sgml/ref/create_statistics.sgml
> +++ b/doc/src/sgml/ref/create_statistics.sgml
> @@ -21,9 +21,13 @@ PostgreSQL documentation
>  
>   <refsynopsisdiv>
>  <synopsis>
> +CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_name</replaceable>
> +    ON ( <replaceable class="parameter">expression</replaceable> )
> +    FROM <replaceable class="parameter">table_name</replaceable>

>  CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_name</replaceable>
>      [ ( <replaceable class="parameter">statistics_kind</replaceable> [, ... ] ) ]
> -    ON <replaceable class="parameter">column_name</replaceable>, <replaceable
class="parameter">column_name</replaceable>[, ...]
 
> +    ON { <replaceable class="parameter">column_name</replaceable> | ( <replaceable
class="parameter">expression</replaceable>) } [, ...]
 
>      FROM <replaceable class="parameter">table_name</replaceable>
>  </synopsis>

I think this part is wrong, since it's not possible to specify a single column
in form#2.

If I'm right, the current way is:

 - form#1 allows expression statistics of a single expression, and doesn't
   allow specifying "kinds";

 - form#2 allows specifying "kinds", but always computes expression statistics,
   and requires multiple columns/expressions.

So it'd need to be column_name|expression, column_name|expression [,...]

> @@ -39,6 +43,16 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_na
>     database and will be owned by the user issuing the command.
>    </para>
>  
> +  <para>
> +   The <command>CREATE STATISTICS</command> command has two basic forms. The
> +   simple variant allows building statistics for a single expression, does
> +   not allow specifying any statistics kinds and provides benefits similar
> +   to an expression index. The full variant allows defining statistics objects
> +   on multiple columns and expressions, and selecting which statistics kinds will
> +   be built. The per-expression statistics are built automatically when there
> +   is at least one expression.
> +  </para>

> +   <varlistentry>
> +    <term><replaceable class="parameter">expression</replaceable></term>
> +    <listitem>
> +     <para>
> +      The expression to be covered by the computed statistics. In this case
> +      only a single expression is required, in which case only the expression
> +      statistics kind is allowed. The order of expressions is insignificant.

I think this part is wrong now ?
I guess there's no user-facing expression "kind" left in the CREATE command.
I guess "in which case" means "if only one expr is specified".
"expression" could be either form#1 or form#2.

Maybe it should just say:
> +      An expression to be covered by the computed statistics.

Maybe somewhere else, say:
> In the second form of the command, the order of expressions is insignificant.

-- 
Justin

Re: PoC/WIP: Extended statistics on expressions

From

Tomas Vondra

Date:

17 January 2021, 23:14:30


On 1/17/21 3:55 AM, Zhihong Yu wrote:
> Hi,
> +    * Check that only the base rel is mentioned.  (This should be dead code
> +    * now that add_missing_from is history.)
> +    */
> +   if (list_length(pstate->p_rtable) != 1)
> 
> If it is dead code, it can be removed, right ?
> 

Maybe. The question is whether it really is dead code. It's copied from 
transformIndexStmt so I kept it for now.

> For statext_mcv_clauselist_selectivity:
> 
> +                   // bms_free(list_attnums[listidx]);
>  > The commented line can be removed.
>

Actually, this should probably do list_free on the list_exprs.

> +bool
> +examine_clause_args2(List *args, Node **exprp, Const **cstp, bool 
> *expronleftp)
> 
> Better add some comment for examine_clause_args2 since there 
> is examine_clause_args() already.
> 

Yeah, this probably needs a better name. I have a feeling this may need 
a refactoring to reuse more of the existing code for the expressions.

> +   if (!ok || stats->compute_stats == NULL || stats->minrows <= 0)
> 
> When would stats->minrows have negative value ?
> 

The typanalyze function (e.g. std_typanalyze) can set it to negative 
value. This is the same check as in examine_attribute, and we need the 
same protections I think.

> For serialize_expr_stats():
> 
> +   sd = table_open(StatisticRelationId, RowExclusiveLock);
> +
> +   /* lookup OID of composite type for pg_statistic */
> +   typOid = get_rel_type_id(StatisticRelationId);
> +   if (!OidIsValid(typOid))
> +       ereport(ERROR,
> 
> Looks like the table_open() call can be made after the typOid check.
> 

Probably, but this failure is unlikely (should never happen) so I don't 
think this makes any real difference.

> +       Datum       values[Natts_pg_statistic];
> +       bool        nulls[Natts_pg_statistic];
> +       HeapTuple   stup;
> +
> +       if (!stats->stats_valid)
> 
> It seems the local arrays can be declared after the validity check.
> 

Nope, that would be against C99.

> +           if (enabled[i] == STATS_EXT_NDISTINCT)
> +               ndistinct_enabled = true;
> +           if (enabled[i] == STATS_EXT_DEPENDENCIES)
> +               dependencies_enabled = true;
> +           if (enabled[i] == STATS_EXT_MCV)
> 
> the second and third if should be preceded with 'else'
> 

Yeah, although this just moves existing code. But you're right it could 
use else.

> +ReleaseDummy(HeapTuple tuple)
> +{
> +   pfree(tuple);
> 
> Since ReleaseDummy() is just a single pfree call, maybe you don't need 
> this method - call pfree in its place.
> 

No, that's not possible because the freefunc callback expects signature

     void (*)(HeapTupleData *)

and pfree() does not match that.


thanks

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PoC/WIP: Extended statistics on expressions

From

Tomas Vondra

Date:

17 January 2021, 23:26:17


On 1/17/21 6:29 AM, Justin Pryzby wrote:
> On Sun, Jan 17, 2021 at 01:23:39AM +0100, Tomas Vondra wrote:
>> diff --git a/doc/src/sgml/ref/create_statistics.sgml b/doc/src/sgml/ref/create_statistics.sgml
>> index 4363be50c3..5b8eb8d248 100644
>> --- a/doc/src/sgml/ref/create_statistics.sgml
>> +++ b/doc/src/sgml/ref/create_statistics.sgml
>> @@ -21,9 +21,13 @@ PostgreSQL documentation
>>   
>>    <refsynopsisdiv>
>>   <synopsis>
>> +CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_name</replaceable>
>> +    ON ( <replaceable class="parameter">expression</replaceable> )
>> +    FROM <replaceable class="parameter">table_name</replaceable>
> 
>>   CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_name</replaceable>
>>       [ ( <replaceable class="parameter">statistics_kind</replaceable> [, ... ] ) ]
>> -    ON <replaceable class="parameter">column_name</replaceable>, <replaceable
class="parameter">column_name</replaceable>[, ...]
 
>> +    ON { <replaceable class="parameter">column_name</replaceable> | ( <replaceable
class="parameter">expression</replaceable>) } [, ...]
 
>>       FROM <replaceable class="parameter">table_name</replaceable>
>>   </synopsis>
> 
> I think this part is wrong, since it's not possible to specify a single column
> in form#2.
> 
> If I'm right, the current way is:
> 
>   - form#1 allows expression statistics of a single expression, and doesn't
>     allow specifying "kinds";
> 
>   - form#2 allows specifying "kinds", but always computes expression statistics,
>     and requires multiple columns/expressions.
> 
> So it'd need to be column_name|expression, column_name|expression [,...]
> 

Strictly speaking you're probably correct - there should be at least two 
elements. But I'm somewhat hesitant about making this more complex, 
because it'll be harder to understand.


>> @@ -39,6 +43,16 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_na
>>      database and will be owned by the user issuing the command.
>>     </para>
>>   
>> +  <para>
>> +   The <command>CREATE STATISTICS</command> command has two basic forms. The
>> +   simple variant allows building statistics for a single expression, does
>> +   not allow specifying any statistics kinds and provides benefits similar
>> +   to an expression index. The full variant allows defining statistics objects
>> +   on multiple columns and expressions, and selecting which statistics kinds will
>> +   be built. The per-expression statistics are built automatically when there
>> +   is at least one expression.
>> +  </para>
> 
>> +   <varlistentry>
>> +    <term><replaceable class="parameter">expression</replaceable></term>
>> +    <listitem>
>> +     <para>
>> +      The expression to be covered by the computed statistics. In this case
>> +      only a single expression is required, in which case only the expression
>> +      statistics kind is allowed. The order of expressions is insignificant.
> 
> I think this part is wrong now ?
> I guess there's no user-facing expression "kind" left in the CREATE command.
> I guess "in which case" means "if only one expr is specified".
> "expression" could be either form#1 or form#2.
> 
> Maybe it should just say:
>> +      An expression to be covered by the computed statistics.
> 
> Maybe somewhere else, say:
>> In the second form of the command, the order of expressions is insignificant.
> 

Yeah, this is a leftover from when there was "expressions" kind. I'll 
reword this a bit.


thanks

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PoC/WIP: Extended statistics on expressions

From

Justin Pryzby

Date:

18 January 2021, 04:53:31

On Sun, Jan 17, 2021 at 01:23:39AM +0100, Tomas Vondra wrote:
> > CREATE TABLE t AS SELECT generate_series(1,9) AS i;
> > CREATE STATISTICS s ON (i+1) ,(i+1+0) FROM t;
> > ANALYZE t;
> > SELECT i+1 FROM t GROUP BY 1;
> > ERROR:  corrupt MVNDistinct entry
> 
> Thanks. There was a thinko in estimate_multivariate_ndistinct, resulting in
> mismatching the ndistinct coefficient items. The attached patch fixes that,
> but I've realized the way we pick the "best" statistics may need some
> improvements (I added an XXX comment about that).

That maybe indicates a deficiency in testing and code coverage.

| postgres=# CREATE TABLE t(i int);
| postgres=# CREATE STATISTICS s2 ON (i+1) ,(i+1+0) FROM t;
| postgres=# \d t
|                  Table "public.t"
|  Column |  Type   | Collation | Nullable | Default 
| --------+---------+-----------+----------+---------
|  i      | integer |           |          | 
| Indexes:
|     "t_i_idx" btree (i)
| Statistics objects:
|     "public"."s2" (ndistinct, dependencies, mcv) ON  FROM t

on ... what ?

-- 
Justin

Re: PoC/WIP: Extended statistics on expressions

From

Dean Rasheed

Date:

18 January 2021, 15:48:28

Looking through extended_stats.c, I found a corner case that can lead
to a seg-fault:

CREATE TABLE foo();
CREATE STATISTICS s ON (1) FROM foo;
ANALYSE foo;

This crashes in lookup_var_attr_stats(), because it isn't expecting
nvacatts to be 0. I can't think of any case where building stats on a
table with no analysable columns is useful, so it should probably just
exit early in that case.


In BuildRelationExtStatistics(), it looks like min_attrs should be
declared assert-only.


In evaluate_expressions():

+   /* set the pointers */
+   result = (ExprInfo *) ptr;
+   ptr += sizeof(ExprInfo);

I think that should probably have a MAXALIGN().


A slightly bigger issue that I don't like is the way it assigns
attribute numbers for expressions starting from
MaxHeapAttributeNumber+1, so the first expression has an attnum of
1601. That leads to pretty inefficient use of Bitmapsets, since most
tables only contain a handful of columns, and there's a large block of
unused space in the middle the Bitmapset.

An alternative approach might be to use regular attnums for columns
and use negative indexes -1, -2, -3, ... for expressions in the stored
stats. Then when storing and retrieving attnums from Bitmapsets, it
could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values
in the Bitmapsets, since there can't be more than that many
expressions (just like other code stores system attributes using
FirstLowInvalidHeapAttributeNumber).

That would be a somewhat bigger change, but hopefully fairly
mechanical, and then some code like add_expressions_to_attributes()
would go away.


Looking at the new view pg_stats_ext_exprs, I noticed that it fails to
show expressions until the statistics have been built. For example:

CREATE TABLE foo(a int, b int);
CREATE STATISTICS s ON (a+b), (a*b) FROM foo;
SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;

 statistics_name | tablename | expr | n_distinct
-----------------+-----------+------+------------
 s               | foo       |      |
(1 row)

but after populating and analysing the table, this becomes:

 statistics_name | tablename |  expr   | n_distinct
-----------------+-----------+---------+------------
 s               | foo       | (a + b) |         11
 s               | foo       | (a * b) |         11
(2 rows)

I think it should show the expressions even before the stats have been built.

Another issue is that it returns rows for non-expression stats as
well. For example:

CREATE TABLE foo(a int, b int);
CREATE STATISTICS s ON a, b FROM foo;
SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;

 statistics_name | tablename | expr | n_distinct
-----------------+-----------+------+------------
 s               | foo       |      |
(1 row)

and those values will never be populated, since they're not
expressions, so I would expect them to not be shown in the view.

So basically, instead of

+         LEFT JOIN LATERAL (
+             SELECT
+                 *
+             FROM (
+                 SELECT
+
unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
+                     unnest(sd.stxdexpr)::pg_statistic AS a
+             ) x
+         ) stat ON sd.stxdexpr IS NOT NULL;

perhaps just

+         JOIN LATERAL (
+             SELECT
+                 *
+             FROM (
+                 SELECT
+
unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
+                     unnest(sd.stxdexpr)::pg_statistic AS a
+             ) x
+         ) stat ON true;

Regards,
Dean

Re: PoC/WIP: Extended statistics on expressions

From

Tomas Vondra

Date:

19 January 2021, 01:57:05

On 1/18/21 4:48 PM, Dean Rasheed wrote:
> Looking through extended_stats.c, I found a corner case that can lead
> to a seg-fault:
> 
> CREATE TABLE foo();
> CREATE STATISTICS s ON (1) FROM foo;
> ANALYSE foo;
> 
> This crashes in lookup_var_attr_stats(), because it isn't expecting
> nvacatts to be 0. I can't think of any case where building stats on a
> table with no analysable columns is useful, so it should probably just
> exit early in that case.
> 
> 
> In BuildRelationExtStatistics(), it looks like min_attrs should be
> declared assert-only.
> 
> 
> In evaluate_expressions():
> 
> +   /* set the pointers */
> +   result = (ExprInfo *) ptr;
> +   ptr += sizeof(ExprInfo);
> 
> I think that should probably have a MAXALIGN().
> 

Thanks, I'll fix all of that.

> 
> A slightly bigger issue that I don't like is the way it assigns
> attribute numbers for expressions starting from
> MaxHeapAttributeNumber+1, so the first expression has an attnum of
> 1601. That leads to pretty inefficient use of Bitmapsets, since most
> tables only contain a handful of columns, and there's a large block of
> unused space in the middle the Bitmapset.
> 
> An alternative approach might be to use regular attnums for columns
> and use negative indexes -1, -2, -3, ... for expressions in the stored
> stats. Then when storing and retrieving attnums from Bitmapsets, it
> could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values
> in the Bitmapsets, since there can't be more than that many
> expressions (just like other code stores system attributes using
> FirstLowInvalidHeapAttributeNumber).
> 
> That would be a somewhat bigger change, but hopefully fairly
> mechanical, and then some code like add_expressions_to_attributes()
> would go away.
> 

Well, I tried this but unfortunately it's not that simple. We still need 
to build the bitmaps, so I don't think add_expression_to_attributes can 
be just removed. I mean, we need to do the offsetting somewhere, even if 
we change how we do it.

But the main issue is that in some cases the number of expressions is 
not really limited by STATS_MAX_DIMENSIONS - for example when applying 
functional dependencies, we "merge" multiple statistics, so we may end 
up with more expressions. So we can't just use STATS_MAX_DIMENSIONS.

Also, if we offset regular attnums by STATS_MAX_DIMENSIONS, that inverts 
the order of processing (so far we've assumed expressions are after 
regular attnums). So the changes are more extensive - I tried doing that 
anyway, and I'm still struggling with crashes and regression failures. 
Of course, that doesn't mean we shouldn't do it, but it's far from 
mechanical. (Some of that is probably a sign this code needs a bit more 
work to polish.)

But I wonder if it'd be easier to just calculate the actual max attnum 
and then use it instead of MaxHeapAttributeNumber ...

> 
> Looking at the new view pg_stats_ext_exprs, I noticed that it fails to
> show expressions until the statistics have been built. For example:
> 
> CREATE TABLE foo(a int, b int);
> CREATE STATISTICS s ON (a+b), (a*b) FROM foo;
> SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;
> 
>   statistics_name | tablename | expr | n_distinct
> -----------------+-----------+------+------------
>   s               | foo       |      |
> (1 row)
> 
> but after populating and analysing the table, this becomes:
> 
>   statistics_name | tablename |  expr   | n_distinct
> -----------------+-----------+---------+------------
>   s               | foo       | (a + b) |         11
>   s               | foo       | (a * b) |         11
> (2 rows)
> 
> I think it should show the expressions even before the stats have been built.
> 
> Another issue is that it returns rows for non-expression stats as
> well. For example:
> 
> CREATE TABLE foo(a int, b int);
> CREATE STATISTICS s ON a, b FROM foo;
> SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;
> 
>   statistics_name | tablename | expr | n_distinct
> -----------------+-----------+------+------------
>   s               | foo       |      |
> (1 row)
> 
> and those values will never be populated, since they're not
> expressions, so I would expect them to not be shown in the view.
> 
> So basically, instead of
> 
> +         LEFT JOIN LATERAL (
> +             SELECT
> +                 *
> +             FROM (
> +                 SELECT
> +
> unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
> +                     unnest(sd.stxdexpr)::pg_statistic AS a
> +             ) x
> +         ) stat ON sd.stxdexpr IS NOT NULL;
> 
> perhaps just
> 
> +         JOIN LATERAL (
> +             SELECT
> +                 *
> +             FROM (
> +                 SELECT
> +
> unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
> +                     unnest(sd.stxdexpr)::pg_statistic AS a
> +             ) x
> +         ) stat ON true;

Will fix.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PoC/WIP: Extended statistics on expressions

From

Dean Rasheed

Date:

21 January 2021, 11:11:55

On Tue, 19 Jan 2021 at 01:57, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> > A slightly bigger issue that I don't like is the way it assigns
> > attribute numbers for expressions starting from
> > MaxHeapAttributeNumber+1, so the first expression has an attnum of
> > 1601. That leads to pretty inefficient use of Bitmapsets, since most
> > tables only contain a handful of columns, and there's a large block of
> > unused space in the middle the Bitmapset.
> >
> > An alternative approach might be to use regular attnums for columns
> > and use negative indexes -1, -2, -3, ... for expressions in the stored
> > stats. Then when storing and retrieving attnums from Bitmapsets, it
> > could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values
> > in the Bitmapsets, since there can't be more than that many
> > expressions (just like other code stores system attributes using
> > FirstLowInvalidHeapAttributeNumber).
>
> Well, I tried this but unfortunately it's not that simple. We still need
> to build the bitmaps, so I don't think add_expression_to_attributes can
> be just removed. I mean, we need to do the offsetting somewhere, even if
> we change how we do it.

Hmm, I was imagining that the offsetting would happen in each place
that adds or retrieves an attnum from a Bitmapset, much like a lot of
other code does for system attributes, and then you'd know you had an
expression if the resulting attnum was negative.

I was also thinking that it would be these negative attnums that would
be stored in the stats data, so instead of something like "1, 2 =>
1601", it would be "1, 2 => -1", so in some sense "-1" would be the
"real" attnum associated with the expression.

> But the main issue is that in some cases the number of expressions is
> not really limited by STATS_MAX_DIMENSIONS - for example when applying
> functional dependencies, we "merge" multiple statistics, so we may end
> up with more expressions. So we can't just use STATS_MAX_DIMENSIONS.

Ah, I see. I hadn't really fully understood what that code was doing.

ISTM though that this is really an internal problem to the
dependencies code, in that these "merged" Bitmapsets containing attrs
from multiple different stats objects do not (and must not) ever go
outside that local code that uses them. So that code would be free to
use a different offset for its own purposes -- e..g., collect all the
distinct expressions across all the stats objects and then offset by
the number of distinct expressions.

> Also, if we offset regular attnums by STATS_MAX_DIMENSIONS, that inverts
> the order of processing (so far we've assumed expressions are after
> regular attnums). So the changes are more extensive - I tried doing that
> anyway, and I'm still struggling with crashes and regression failures.
> Of course, that doesn't mean we shouldn't do it, but it's far from
> mechanical. (Some of that is probably a sign this code needs a bit more
> work to polish.)

Interesting. What code assumes expressions come after attributes?
Ideally, I think it would be cleaner if no code assumed any particular
order, but I can believe that it might be convenient in some
circumstances.

> But I wonder if it'd be easier to just calculate the actual max attnum
> and then use it instead of MaxHeapAttributeNumber ...

Hmm, I'm not sure how that would work. There still needs to be an
attnum that gets stored in the database, and it has to continue to
work if the user adds columns to the table. That's why I was
advocating storing negative values, though I haven't actually tried it
to see what might go wrong.

Regards,
Dean

Re: PoC/WIP: Extended statistics on expressions

From

Tomas Vondra

Date:

22 January 2021, 01:26:29


On 1/21/21 12:11 PM, Dean Rasheed wrote:
> On Tue, 19 Jan 2021 at 01:57, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>>> A slightly bigger issue that I don't like is the way it assigns
>>> attribute numbers for expressions starting from
>>> MaxHeapAttributeNumber+1, so the first expression has an attnum of
>>> 1601. That leads to pretty inefficient use of Bitmapsets, since most
>>> tables only contain a handful of columns, and there's a large block of
>>> unused space in the middle the Bitmapset.
>>>
>>> An alternative approach might be to use regular attnums for columns
>>> and use negative indexes -1, -2, -3, ... for expressions in the stored
>>> stats. Then when storing and retrieving attnums from Bitmapsets, it
>>> could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values
>>> in the Bitmapsets, since there can't be more than that many
>>> expressions (just like other code stores system attributes using
>>> FirstLowInvalidHeapAttributeNumber).
>>
>> Well, I tried this but unfortunately it's not that simple. We still need
>> to build the bitmaps, so I don't think add_expression_to_attributes can
>> be just removed. I mean, we need to do the offsetting somewhere, even if
>> we change how we do it.
> 
> Hmm, I was imagining that the offsetting would happen in each place
> that adds or retrieves an attnum from a Bitmapset, much like a lot of
> other code does for system attributes, and then you'd know you had an
> expression if the resulting attnum was negative.
> 
> I was also thinking that it would be these negative attnums that would
> be stored in the stats data, so instead of something like "1, 2 =>
> 1601", it would be "1, 2 => -1", so in some sense "-1" would be the
> "real" attnum associated with the expression.
> 
>> But the main issue is that in some cases the number of expressions is
>> not really limited by STATS_MAX_DIMENSIONS - for example when applying
>> functional dependencies, we "merge" multiple statistics, so we may end
>> up with more expressions. So we can't just use STATS_MAX_DIMENSIONS.
> 
> Ah, I see. I hadn't really fully understood what that code was doing.
> 
> ISTM though that this is really an internal problem to the
> dependencies code, in that these "merged" Bitmapsets containing attrs
> from multiple different stats objects do not (and must not) ever go
> outside that local code that uses them. So that code would be free to
> use a different offset for its own purposes -- e..g., collect all the
> distinct expressions across all the stats objects and then offset by
> the number of distinct expressions.
> 

>> Also, if we offset regular attnums by STATS_MAX_DIMENSIONS, that inverts
>> the order of processing (so far we've assumed expressions are after
>> regular attnums). So the changes are more extensive - I tried doing that
>> anyway, and I'm still struggling with crashes and regression failures.
>> Of course, that doesn't mean we shouldn't do it, but it's far from
>> mechanical. (Some of that is probably a sign this code needs a bit more
>> work to polish.)
> 
> Interesting. What code assumes expressions come after attributes?
> Ideally, I think it would be cleaner if no code assumed any particular
> order, but I can believe that it might be convenient in some
> circumstances.
> 

Well, in a bunch of places we look at the index (from the bitmap) and 
use it to determine whether the value is a regular attribute or an 
expression, because the values are stored in separate arrays.

This is solvable, all I'm saying (both here and in the preceding part 
about dependencies) is that it's not entirely mechanical task. But it 
might be better to rethink the separation of simple values and 
expression, and make it more "unified" so that most of the code does not 
really need to deal with these differences.

>> But I wonder if it'd be easier to just calculate the actual max attnum
>> and then use it instead of MaxHeapAttributeNumber ...
> 
> Hmm, I'm not sure how that would work. There still needs to be an
> attnum that gets stored in the database, and it has to continue to
> work if the user adds columns to the table. That's why I was
> advocating storing negative values, though I haven't actually tried it
> to see what might go wrong.
> 

Well, yeah, we need to identify the expression in some statistics (e.g. 
in dependencies or ndistinct items). And yeah, offsetting the expression 
attnums by MaxHeapAttributeNumber may be inefficient in this case.


Attached is an updated version of the patch, hopefully addressing all 
issues pointed out by you, Justin and Zhihong, with the exception of the 
expression attnums discussed here.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC/WIP: Extended statistics on expressions

From

Justin Pryzby

Date:

22 January 2021, 02:29:58

This already needs to be rebased on 55dc86eca.

And needs to update rules.out.

And doesn't address this one:

On Sun, Jan 17, 2021 at 10:53:31PM -0600, Justin Pryzby wrote:
> | postgres=# CREATE TABLE t(i int);
> | postgres=# CREATE STATISTICS s2 ON (i+1) ,(i+1+0) FROM t;
> | postgres=# \d t
> |                  Table "public.t"
> |  Column |  Type   | Collation | Nullable | Default 
> | --------+---------+-----------+----------+---------
> |  i      | integer |           |          | 
> | Indexes:
> |     "t_i_idx" btree (i)
> | Statistics objects:
> |     "public"."s2" (ndistinct, dependencies, mcv) ON  FROM t
> 
> on ... what ?

-- 
Justin

Re: PoC/WIP: Extended statistics on expressions

From

Tomas Vondra

Date:

22 January 2021, 03:49:51

On 1/22/21 3:29 AM, Justin Pryzby wrote:
> This already needs to be rebased on 55dc86eca.
> 
> And needs to update rules.out.
> 

Whooops. A fixed version attached.

> And doesn't address this one:
> 
> On Sun, Jan 17, 2021 at 10:53:31PM -0600, Justin Pryzby wrote:
>> | postgres=# CREATE TABLE t(i int);
>> | postgres=# CREATE STATISTICS s2 ON (i+1) ,(i+1+0) FROM t;
>> | postgres=# \d t
>> |                  Table "public.t"
>> |  Column |  Type   | Collation | Nullable | Default
>> | --------+---------+-----------+----------+---------
>> |  i      | integer |           |          |
>> | Indexes:
>> |     "t_i_idx" btree (i)
>> | Statistics objects:
>> |     "public"."s2" (ndistinct, dependencies, mcv) ON  FROM t
>>
>> on ... what ?
> 

Umm, for me that prints:

test=# CREATE TABLE t(i int);
CREATE TABLE
test=# CREATE STATISTICS s2 ON (i+1) ,(i+1+0) FROM t;
CREATE STATISTICS
test=# \d t
                  Table "public.t"
  Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
  i      | integer |           |          |
Statistics objects:
     "public"."s2" ON ((i + 1)), (((i + 1) + 0)) FROM t

which I think is OK. But maybe there's something else to trigger the 
problem?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

On Fri, 22 Jan 2021 at 03:49, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Whooops. A fixed version attached.
>

The change to pg_stats_ext_exprs isn't quite right, because now it
cross joins expressions and their stats, which leads to too many rows,
with the wrong stats being listed against expressions. For example:

CREATE TABLE foo (a int, b text);
INSERT INTO foo SELECT 1, 'xxx' FROM generate_series(1,1000);
CREATE STATISTICS foo_s ON (a*10), upper(b) FROM foo;
ANALYSE foo;

SELECT tablename, statistics_name, expr, most_common_vals
  FROM pg_stats_ext_exprs;

 tablename | statistics_name |   expr   | most_common_vals
-----------+-----------------+----------+------------------
 foo       | foo_s           | (a * 10) | {10}
 foo       | foo_s           | (a * 10) | {XXX}
 foo       | foo_s           | upper(b) | {10}
 foo       | foo_s           | upper(b) | {XXX}
(4 rows)

More protection is still required for tables with no analysable
columns. For example:

CREATE TABLE foo();
CREATE STATISTICS foo_s ON (1) FROM foo;
INSERT INTO foo SELECT FROM generate_series(1,1000);
ANALYSE foo;

Program received signal SIGSEGV, Segmentation fault.
0x000000000090e9d4 in lookup_var_attr_stats (rel=0x7f7766b37598, attrs=0x0,
    exprs=0x216b258, nvacatts=0, vacatts=0x216cb40) at extended_stats.c:664
664            stats[i]->tupDesc = vacatts[0]->tupDesc;

#0  0x000000000090e9d4 in lookup_var_attr_stats (rel=0x7f7766b37598,
    attrs=0x0, exprs=0x216b258, nvacatts=0, vacatts=0x216cb40)
    at extended_stats.c:664
#1  0x000000000090da93 in BuildRelationExtStatistics (onerel=0x7f7766b37598,
    totalrows=1000, numrows=100, rows=0x216d040, natts=0,
    vacattrstats=0x216cb40) at extended_stats.c:161
#2  0x000000000066ea97 in do_analyze_rel (onerel=0x7f7766b37598,
    params=0x7ffc06f7d450, va_cols=0x0,
    acquirefunc=0x66f71a <acquire_sample_rows>, relpages=4, inh=false,
    in_outer_xact=false, elevel=13) at analyze.c:595

Attached is an incremental update fixing those issues, together with a
few more suggested improvements:

There was quite a bit of code duplication in extended_stats.c which I
attempted to reduce by

1). Deleting examine_opclause_expression() in favour of examine_clause_args().
2). Deleting examine_opclause_expression2() in favour of examine_clause_args2().
3). Merging examine_clause_args() and examine_clause_args2(), renaming
it examine_opclause_args() (which was actually the name it had in its
original doc comment, despite the name in the code being different).
4). Merging statext_extract_expression() and
statext_extract_expression_internal() into
statext_is_compatible_clause() and
statext_is_compatible_clause_internal() respectively.

That last change goes beyond just removing code duplication. It allows
support for compound clauses that contain a mix of attribute and
expression clauses, for example, this simple test case wasn't
previously estimated well:

CREATE TABLE foo (a int, b int, c int);
INSERT INTO foo SELECT x/100, x/100, x/100 FROM generate_series(1,10000) g(x);
CREATE STATISTICS foo_s on a,b,(c*c) FROM foo;
ANALYSE foo;
EXPLAIN ANALYSE SELECT * FROM foo WHERE a=1 AND (b=1 OR c*c=1);

I didn't add any new regression tests, but perhaps it would be worth
adding something to test a case like that.

I changed choose_best_statistics() in a couple of ways. Firstly, I
think it wants to only count expressions from fully covered clauses,
just as we only count attributes if the stat covers all the attributes
from a clause, since otherwise the stat cannot estimate the clause, so
it shouldn't count. Secondly, I think the number of expressions in the
stat needs to be added to it's number of keys, so that the choice of
narrowest stat with the same number of matches counts expressions in
the same way as attributes.

I simplified the code in statext_mcv_clauselist_selectivity(), by
attempting to handle expressions and attributes together in the same
way, making it much closer to the original code. I don't think that
the check for the existence of a stat covering all the expressions in
a clause was necessary when pre-processing the list of clauses, since
that's checked later on, so it's enough to just detect compatible
clauses. Also, it now checks for stats that cover both the attributes
and the expressions from each clause, rather than one or the other, to
cope with examples like the one above. I also updated the check for
simple_clauses -- what's wanted there is to identify clauses that only
reference a single column or a single expression, so that the later
code doesn't apply multi-column estimates to it.

I'm attaching it as a incremental patch (0004) on top of your patches,
but if 0003 and 0004 are collapsed together, the total number of diffs
is less than 0003 alone.

Regards,
Dean

Attachment

Re: PoC/WIP: Extended statistics on expressions

From

Tomas Vondra

Date:

16 February 2021, 18:40:29


On 1/27/21 12:02 PM, Dean Rasheed wrote:
> On Fri, 22 Jan 2021 at 03:49, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> Whooops. A fixed version attached.
>>
> 
> The change to pg_stats_ext_exprs isn't quite right, because now it
> cross joins expressions and their stats, which leads to too many rows,
> with the wrong stats being listed against expressions. For example:
> 
> CREATE TABLE foo (a int, b text);
> INSERT INTO foo SELECT 1, 'xxx' FROM generate_series(1,1000);
> CREATE STATISTICS foo_s ON (a*10), upper(b) FROM foo;
> ANALYSE foo;
> 
> SELECT tablename, statistics_name, expr, most_common_vals
>   FROM pg_stats_ext_exprs;
> 
>  tablename | statistics_name |   expr   | most_common_vals
> -----------+-----------------+----------+------------------
>  foo       | foo_s           | (a * 10) | {10}
>  foo       | foo_s           | (a * 10) | {XXX}
>  foo       | foo_s           | upper(b) | {10}
>  foo       | foo_s           | upper(b) | {XXX}
> (4 rows)
> 
> 
> More protection is still required for tables with no analysable
> columns. For example:
> 
> CREATE TABLE foo();
> CREATE STATISTICS foo_s ON (1) FROM foo;
> INSERT INTO foo SELECT FROM generate_series(1,1000);
> ANALYSE foo;
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x000000000090e9d4 in lookup_var_attr_stats (rel=0x7f7766b37598, attrs=0x0,
>     exprs=0x216b258, nvacatts=0, vacatts=0x216cb40) at extended_stats.c:664
> 664            stats[i]->tupDesc = vacatts[0]->tupDesc;
> 
> #0  0x000000000090e9d4 in lookup_var_attr_stats (rel=0x7f7766b37598,
>     attrs=0x0, exprs=0x216b258, nvacatts=0, vacatts=0x216cb40)
>     at extended_stats.c:664
> #1  0x000000000090da93 in BuildRelationExtStatistics (onerel=0x7f7766b37598,
>     totalrows=1000, numrows=100, rows=0x216d040, natts=0,
>     vacattrstats=0x216cb40) at extended_stats.c:161
> #2  0x000000000066ea97 in do_analyze_rel (onerel=0x7f7766b37598,
>     params=0x7ffc06f7d450, va_cols=0x0,
>     acquirefunc=0x66f71a <acquire_sample_rows>, relpages=4, inh=false,
>     in_outer_xact=false, elevel=13) at analyze.c:595
> 
> 
> Attached is an incremental update fixing those issues, together with a
> few more suggested improvements:
> 
> There was quite a bit of code duplication in extended_stats.c which I
> attempted to reduce by
> 
> 1). Deleting examine_opclause_expression() in favour of examine_clause_args().
> 2). Deleting examine_opclause_expression2() in favour of examine_clause_args2().
> 3). Merging examine_clause_args() and examine_clause_args2(), renaming
> it examine_opclause_args() (which was actually the name it had in its
> original doc comment, despite the name in the code being different).
> 4). Merging statext_extract_expression() and
> statext_extract_expression_internal() into
> statext_is_compatible_clause() and
> statext_is_compatible_clause_internal() respectively.
> 
> That last change goes beyond just removing code duplication. It allows
> support for compound clauses that contain a mix of attribute and
> expression clauses, for example, this simple test case wasn't
> previously estimated well:
> 
> CREATE TABLE foo (a int, b int, c int);
> INSERT INTO foo SELECT x/100, x/100, x/100 FROM generate_series(1,10000) g(x);
> CREATE STATISTICS foo_s on a,b,(c*c) FROM foo;
> ANALYSE foo;
> EXPLAIN ANALYSE SELECT * FROM foo WHERE a=1 AND (b=1 OR c*c=1);
> 
> I didn't add any new regression tests, but perhaps it would be worth
> adding something to test a case like that.
> 
> I changed choose_best_statistics() in a couple of ways. Firstly, I
> think it wants to only count expressions from fully covered clauses,
> just as we only count attributes if the stat covers all the attributes
> from a clause, since otherwise the stat cannot estimate the clause, so
> it shouldn't count. Secondly, I think the number of expressions in the
> stat needs to be added to it's number of keys, so that the choice of
> narrowest stat with the same number of matches counts expressions in
> the same way as attributes.
> 
> I simplified the code in statext_mcv_clauselist_selectivity(), by
> attempting to handle expressions and attributes together in the same
> way, making it much closer to the original code. I don't think that
> the check for the existence of a stat covering all the expressions in
> a clause was necessary when pre-processing the list of clauses, since
> that's checked later on, so it's enough to just detect compatible
> clauses. Also, it now checks for stats that cover both the attributes
> and the expressions from each clause, rather than one or the other, to
> cope with examples like the one above. I also updated the check for
> simple_clauses -- what's wanted there is to identify clauses that only
> reference a single column or a single expression, so that the later
> code doesn't apply multi-column estimates to it.
> 
> I'm attaching it as a incremental patch (0004) on top of your patches,
> but if 0003 and 0004 are collapsed together, the total number of diffs
> is less than 0003 alone.
> 

Thanks. All of this seems like a clear improvement, both removing the
duplicate copy-pasted code, and fixing the handling of the cases that
mix plain variables and expressions. FWIW I agree there should be a
regression test for this, so I'll add one.

I think the main remaining issue is how we handle the expressions in
bitmapsets. I've been experimenting with this a bit, but shifting the
regular attnums and stashing expressions before them seems quite
complex, especially when we don't know how many expressions there are
(e.g. when merging functional dependencies). It's true using attnums
above MaxHeapAttributeNumber for expressions wastes ~200B, but is that
really an issue, considering it's very short-lived allocation?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PoC/WIP: Extended statistics on expressions

From

Tomas Vondra

Date:

18 February 2021, 01:31:44

Hi,

Attached is a rebased patch series, merging the changes from the last
review into the 0003 patch, and with a WIP patch 0004 reworking the
tracking of expressions (to address the inefficiency due to relying on
MaxHeapAttributeNumber).

The 0004 passes is very much an experimental patch with a lot of ad hoc
changes. It passes make check, but it definitely needs much more work,
cleanup and testing. At this point it's more a demonstration of what
would be needed to rework it like this.

The main change is that instead of handling expressions by assigning
them attnums above MaxHeapAttributeNumber, we assign them system-like
attnums, i.e. negative ones. So the first one gets -1, the second one
-2, etc. And then we shift all attnums above 0, to allow using the
bitmapset as before.

Overall, this works, but the shifting is kinda pointless - it allows us
to build a bitmapset, but it's mostly useless because it depends on how
many expressions are in the statistics definition. So we can't compare
or combine bitmapsets for different statistics, and (more importantly)
we can't easily compare bitmapset on attnums from clauses.

Using MaxHeapAttributeNumber allowed using the bitmapsets at least for
regular attributes. Not sure if that's a major advantage, outweighing
wasting some space.

I wonder if we should just ditch the bitmapsets, and just use simple
arrays of attnums. I don't think we expect too many elements here,
especially when dealing with individual statistics. So now we're just
building and rebuilding the bitmapsets ... seems pointless.


One thing I'd like to improve (independently of what we do with the
bitmapsets) is getting rid of the distinction between attributes and
expressions when building the statistics - currently all the various
places have to care about whether the item is attribute or expression,
and look either into the tuple or array of pre-calculated value, do
various shifts to get the indexes, etc. That's quite tedious, and I've
made a lot of errors in that (and I'm sure there are more). So IMO we
should simplify this by replacing this with something containing values
for both attributes and expressions, handling it in a unified way.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hi,

I've pushed the first two parts, dealing with composite types during
bootstrap. I've decided to commit both, including the array->list
conversion, as that makes the reloads simpler. I've made two tweaks:

1) I've renamed the function to reload_typ_list, which I think is better
(and it used to be reload_typ_array).

2) I've removed the did_reread assert. It'd allow just a single reload,
which blocks recursive composite types - seems unnecessary, although we
don't need that now. I think we can't have infinite recursion, because
we can only load types from already defined catalogs (so no cycles).


I've rebased and cleaned up the main part of the patch. There's a bunch
of comments slightly reworded / cleaned up, etc. The more significant
changes are:

1) The explanation of the example added to create_statistics.sgml was
somewhat wrong, so I corrected that.

2) I've renamed StatBuildData to StatsBuildData.

3) I've resolved the FIXMEs in examine_expression.

We don't need to do anything special about the collation, because unlike
indexes it's not possible to specify "collate" for the attributes. It's
possible to do thatin the expression, but exprCollation handles that.

For statistics target, we simply use the value determined for the
statistics itself. There's no way to specify that for expressions.

4) I've updated the comments about ndistinct estimates in selfuncs.c,
because some of it was a bit wrong/misleading - per the discussion we
had about matching stats to expressions.

5) I've also tweaked the add_unique_group_expr so that we don't have to
run examine_variable() repeatedly if we already have it.

6) Resolved the FIXME about acl_ok in examine_variable. Instead of just
setting it to 'true' it now mimics what we do for indexes. I think it's
correct, but this is probably worth a review.

7) psql now prints expressions only for (version > 14). I've considered
tweaking the existing block, but that was quite incomprehensible so I
just made a modified copy.

I think this is 99.999% ready to be pushed, so barring objections I'll
do that in a day or two.

The part 0003 is a small tweak I'll consider, preferring exact matches
of expressions over Var matches. It's not a clear win (but what is, in a
greedy algorithm), but it does improve one of the regression tests.
Minor change, though.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Thanks, it seems to be some thinko in handling in PlaceHolderVars, which
seem to break the code's assumptions about varnos. This fixes it for me,
but I need to look at it more closely.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC/WIP: Extended statistics on expressions

From

Dean Rasheed

Date:

24 March 2021, 13:36:15

On Wed, 24 Mar 2021 at 10:22, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Thanks, it seems to be some thinko in handling in PlaceHolderVars, which
> seem to break the code's assumptions about varnos. This fixes it for me,
> but I need to look at it more closely.
>

I think that makes sense.

Reviewing the docs, I noticed a couple of omissions, and had a few
other suggestions (attached).

Regards,
Dean

Attachment

doc-suggestions.patch

Re: PoC/WIP: Extended statistics on expressions

From

Tomas Vondra

Date:

24 March 2021, 14:47:54

On 3/24/21 2:36 PM, Dean Rasheed wrote:
> On Wed, 24 Mar 2021 at 10:22, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> Thanks, it seems to be some thinko in handling in PlaceHolderVars, which
>> seem to break the code's assumptions about varnos. This fixes it for me,
>> but I need to look at it more closely.
>>
> 
> I think that makes sense.
> 

AFAIK the primary issue here is that the two places disagree. While
estimate_num_groups does this

    varnos = pull_varnos(root, (Node *) varshere);
    if (bms_membership(varnos) == BMS_SINGLETON)
    { ... }

the add_unique_group_expr does this

    varnos = pull_varnos(root, (Node *) groupexpr);

That is, one looks at the group expression, while the other look at vars
extracted from it by pull_var_clause(). Apparently for PlaceHolderVar
this can differ, causing the crash.

So we need to change one of those places - my fix tweaked the second
place to also look at the vars, but maybe we should change the other
place? Or maybe it's not the right fix for PlaceHolderVars ...

> Reviewing the docs, I noticed a couple of omissions, and had a few
> other suggestions (attached).
> 

Thanks! I'll include that in the next version of the patch.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PoC/WIP: Extended statistics on expressions

From

Dean Rasheed

Date:

24 March 2021, 16:28:05

On Wed, 24 Mar 2021 at 14:48, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> AFAIK the primary issue here is that the two places disagree. While
> estimate_num_groups does this
>
>     varnos = pull_varnos(root, (Node *) varshere);
>     if (bms_membership(varnos) == BMS_SINGLETON)
>     { ... }
>
> the add_unique_group_expr does this
>
>     varnos = pull_varnos(root, (Node *) groupexpr);
>
> That is, one looks at the group expression, while the other look at vars
> extracted from it by pull_var_clause(). Apparently for PlaceHolderVar
> this can differ, causing the crash.
>
> So we need to change one of those places - my fix tweaked the second
> place to also look at the vars, but maybe we should change the other
> place? Or maybe it's not the right fix for PlaceHolderVars ...
>

I think that it doesn't make any difference which place is changed.

This is a case of an expression with no stats. With your change,
you'll get a single GroupExprInfo containing a list of
VariableStatData's for each of it's Var's, whereas if you changed it
the other way, you'd get a separate GroupExprInfo for each Var. But I
think they'd both end up being treated the same by
estimate_multivariate_ndistinct(), since there wouldn't be any stats
matching the expression, only the individual Var's. Maybe changing the
first place would be the more bulletproof fix though.

Regards,
Dean

Re: PoC/WIP: Extended statistics on expressions

From

Tomas Vondra

Date:

24 March 2021, 16:38:32

On 3/24/21 7:24 AM, Justin Pryzby wrote:
> Most importantly, it looks like this forgets to update catalog documentation
> for stxexprs and stxkind='e'
> 

Good catch.

> It seems like you're preferring to use pluralized "statistics" in a lot of
> places that sound wrong to me.  For example:
>> Currently the first statistics wins, which seems silly.
> I can write more separately, but I think this is resolved and clarified if you
> write "statistics object" and not just "statistics".  
> 

OK "statistics object" seems better and more consistent.

>> +       Name of schema containing table
> 
> I don't know about the nearby descriptions, but this one sounds too much like a
> "schema-containing" table.  Say "Name of the schema which contains the table" ?
> 

I think the current spelling is OK / consistent with the other catalogs.

>> +       Name of table
> 
> Say "name of table on which the extended statistics are defined"
> 

I've used "Name of table the statistics object is defined on".

>> +       Name of extended statistics
> 
> "Name of the extended statistic object"
> 
>> +       Owner of the extended statistics
> 
> ..object
> 

OK

>> +       Expression the extended statistics is defined on
> 
> I think it should say "the extended statistic", or "the extended statistics
> object".  Maybe "..on which the extended statistic is defined"
> 

OK

>> +       of random access to the disk.  (This expression is null if the expression
>> +       data type does not have a <literal><</literal> operator.)
> 
> expression's data type
> 

OK

>> +   much-too-small row count estimate in the first two queries. Moreover, the
> 
> maybe say "dramatically underestimates the rowcount"
> 

I've changed this to "... results in a significant underestimate of row
count".

>> +   planner has no information about relationship between the expressions, so it
> 
> the relationship
> 

OK

>> +   assumes the two <literal>WHERE</literal> and <literal>GROUP BY</literal>
>> +   conditions are independent, and multiplies their selectivities together to
>> +   arrive at a much-too-high group count estimate in the aggregate query.
> 
> severe overestimate ?
> 

OK

>> +   This is further exacerbated by the lack of accurate statistics for the
>> +   expressions, forcing the planner to use default ndistinct estimate for the
> 
> use *a default
> 

OK

>> +   expression derived from ndistinct for the column. With such statistics, the
>> +   planner recognizes that the conditions are correlated and arrives at much
>> +   more accurate estimates.
> 
> are correlated comma
> 

OK

>> +            if (type->lt_opr == InvalidOid)
> 
> These could be !OidIsValid
> 

Maybe, but it's like this already. I'll leave this alone and then
fix/backpatch separately.

>> +     * expressions. It's either expensive or very easy to defeat for
>> +     * determined used, and there's no risk if we allow such statistics (the
>> +     * statistics is useless, but harmless).
> 
> I think it's meant to say "for a determined user" ?
> 

Right.

>> +     * If there are no simply-referenced columns, give the statistics an auto
>> +     * dependency on the whole table.  In most cases, this will be redundant,
>> +     * but it might not be if the statistics expressions contain no Vars
>> +     * (which might seem strange but possible).
>> +     */
>> +    if (!nattnums)
>> +    {
>> +        ObjectAddressSet(parentobject, RelationRelationId, relid);
>> +        recordDependencyOn(&myself, &parentobject, DEPENDENCY_AUTO);
>> +    }
> 
> Can this be unconditional ?
> 

What would be the benefit? This behavior copied from index_create, so
I'd prefer keeping it the same for consistency reason. Presumably it's
like that for some reason (a bit of cargo cult programming, I know).

>> +     * Translate the array of indexs to regular attnums for the dependency (we
> 
> sp: indexes
> 

OK

>> +                     * Not found a matching expression, so we can simply skip
> 
> Found no matching expr
> 

OK

>> +                /* if found a matching, */
> 
> matching ..
> 

Matching dependency.

>> +examine_attribute(Node *expr)
> 
> Maybe you should rename this to something distinct ?  So it's easy to add a
> breakpoint there, for example.
> 

What would be a better name? It's not difficult to add a breakpoint
using line number, for example.

>> +    stats->anl_context = CurrentMemoryContext;    /* XXX should be using
>> +                                                 * something else? */
> 
>> +        bool        nulls[Natts_pg_statistic];
> ...
>> +         * Construct a new pg_statistic tuple
>> +         */
>> +        for (i = 0; i < Natts_pg_statistic; ++i)
>> +        {
>> +            nulls[i] = false;
>> +        }
> 
> Shouldn't you just write nulls[Natts_pg_statistic] = {false};
> or at least: memset(nulls, 0, sizeof(nulls));
> 

Maybe, but it's a copy of what update_attstats() does, so I prefer
keeping it the same.

>> +                 * We don't store collations used to build the statistics, but
>> +                 * we can use the collation for the attribute itself, as
>> +                 * stored in varcollid. We do reset the statistics after a
>> +                 * type change (including collation change), so this is OK. We
>> +                 * may need to relax this after allowing extended statistics
>> +                 * on expressions.
> 
> This text should be updated or removed ?
> 

Yeah, the last sentence is obsolete. Updated.

>> @@ -2705,7 +2705,108 @@ describeOneTableDetails(const char *schemaname,
>>          }
>>  
>>          /* print any extended statistics */
>> -        if (pset.sversion >= 100000)
>> +        if (pset.sversion >= 140000)
>> +        {
>> +            printfPQExpBuffer(&buf,
>> +                              "SELECT oid, "
>> +                              "stxrelid::pg_catalog.regclass, "
>> +                              "stxnamespace::pg_catalog.regnamespace AS nsp, "
>> +                              "stxname,\n"
>> +                              "pg_get_statisticsobjdef_columns(oid) AS columns,\n"
>> +                              "  'd' = any(stxkind) AS ndist_enabled,\n"
>> +                              "  'f' = any(stxkind) AS deps_enabled,\n"
>> +                              "  'm' = any(stxkind) AS mcv_enabled,\n");
>> +
>> +            if (pset.sversion >= 130000)
>> +                appendPQExpBufferStr(&buf, "  stxstattarget\n");
>> +            else
>> +                appendPQExpBufferStr(&buf, "  -1 AS stxstattarget\n");
> 
>  >= 130000 is fully determined by >= 14000 :)
> 

Ah, right.

>> +     * type of the opclass, which is not interesting for our purposes.  (Note:
>> +     * if we did anything with non-expression index columns, we'd need to
> 
> index is wrong ?
> 

Fixed

> I mentioned a bunch of other references to "index" and "predicate" which are
> still around:
> 

Whooops, sorry. Fixed.


I'll post a cleaned-up version of the patch addressing Dean's review
comments too.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PoC/WIP: Extended statistics on expressions

From

Justin Pryzby

Date:

24 March 2021, 16:42:26

On Wed, Mar 24, 2021 at 01:24:46AM -0500, Justin Pryzby wrote:
> It seems like you're preferring to use pluralized "statistics" in a lot of
> places that sound wrong to me.  For example:
> > Currently the first statistics wins, which seems silly.
> I can write more separately, but I think this is resolved and clarified if you
> write "statistics object" and not just "statistics".  

In HEAD:catalogs.sgml, pg_statistic_ext (the table) says "object":
|Name of the statistics object
|Owner of the statistics object
|An array of attribute numbers, indicating which table columns are covered by this statistics object;

But pg_stats_ext (the view) doesn't say "object", which sounds wrong:
|Name of extended statistics
|Owner of the extended statistics
|Names of the columns the extended statistics is defined on

Other pre-existing issues: should be singular "statistic":
doc/src/sgml/perform.sgml:     Another type of statistics stored for each column are most-common value
doc/src/sgml/ref/psql-ref.sgml:        The status of each kind of extended statistics is shown in a column

Pre-existing issues: doesn't say "object" but I think it should:
src/backend/commands/statscmds.c:                                        errmsg("statistics creation on system columns
isnot supported")));
 
src/backend/commands/statscmds.c:                                        errmsg("cannot have more than %d columns in
statistics",
src/backend/commands/statscmds.c:        * If we got here and the OID is not valid, it means the statistics does
src/backend/commands/statscmds.c: * Select a nonconflicting name for a new statistics.
src/backend/commands/statscmds.c: * Generate "name2" for a new statistics given the list of column names for it
src/backend/statistics/extended_stats.c:                /* compute statistics target for this statistics */
src/backend/statistics/extended_stats.c: * attributes the statistics is defined on, and then the default statistics
src/backend/statistics/mcv.c: * The input is the OID of the statistics, and there are no rows returned if

should say "for a statistics object" or "for statistics objects"
src/backend/statistics/extended_stats.c: * target for a statistics objects (from the object target, attribute targets

Your patch adds these:

Should say "object":
+        * Check if we actually have a matching statistics for the expression.
                                                                                                          
 
+               /* evaluate expressions (if the statistics has any) */
                                                                                                          
 
+        * for the extended statistics. The second option seems more reasonable.
                                                                                                          
 
+                * the statistics had all options enabled on the original version.
                                                                                                          
 
+                * But if the statistics is defined on just a single column, it has to
                                                                                                          
 
+       /* has the statistics expressions? */
                                                                                                          
 
+                       /* expression - see if it's in the statistics */
                                                                                                          
 
+                                        * column(s) the statistics depends on.  Also require all
                                                                                                          
 
+        * statistics is defined on more than one column/expression).
                                                                                                          
 
+        * statistics is useless, but harmless).
                                                                                                          
 
+        * If there are no simply-referenced columns, give the statistics an auto
                                                                                                          
 


+                        * Then the first statistics matches no expressions and 3 vars,
                                                                                                          
 
+                        * while the second statistics matches one expression and 1 var.
                                                                                                          
 
+                        * Currently the first statistics wins, which seems silly.
                                                                                                          
 

+                        * [(a+c), d]. But maybe it's better than failing to match the
                                                                                                          
 
+                        * second statistics?
                                                                                                          
 

I can make patches for these (separate patches for HEAD and your patch), but I
don't think your patch has to wait on it, since the user-facing documentation
is consistent with what's already there, and the rest are internal comments.

-- 
Justin

Re: PoC/WIP: Extended statistics on expressions

From

Tomas Vondra

Date:

24 March 2021, 16:48:17


On 3/24/21 5:28 PM, Dean Rasheed wrote:
> On Wed, 24 Mar 2021 at 14:48, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> AFAIK the primary issue here is that the two places disagree. While
>> estimate_num_groups does this
>>
>>     varnos = pull_varnos(root, (Node *) varshere);
>>     if (bms_membership(varnos) == BMS_SINGLETON)
>>     { ... }
>>
>> the add_unique_group_expr does this
>>
>>     varnos = pull_varnos(root, (Node *) groupexpr);
>>
>> That is, one looks at the group expression, while the other look at vars
>> extracted from it by pull_var_clause(). Apparently for PlaceHolderVar
>> this can differ, causing the crash.
>>
>> So we need to change one of those places - my fix tweaked the second
>> place to also look at the vars, but maybe we should change the other
>> place? Or maybe it's not the right fix for PlaceHolderVars ...
>>
> 
> I think that it doesn't make any difference which place is changed.
> 
> This is a case of an expression with no stats. With your change,
> you'll get a single GroupExprInfo containing a list of
> VariableStatData's for each of it's Var's, whereas if you changed it
> the other way, you'd get a separate GroupExprInfo for each Var. But I
> think they'd both end up being treated the same by
> estimate_multivariate_ndistinct(), since there wouldn't be any stats
> matching the expression, only the individual Var's. Maybe changing the
> first place would be the more bulletproof fix though.
> 

Yeah, I think that's true. I'll do a bit more research / experiments.

As for the changes proposed in the create_statistics, do we really want
to use univariate / multivariate there? Yes, the terms are correct, but
I'm not sure how many people looking at CREATE STATISTICS will
understand them.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PoC/WIP: Extended statistics on expressions

From

Dean Rasheed

Date:

24 March 2021, 17:15:46

On Wed, 24 Mar 2021 at 16:48, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> As for the changes proposed in the create_statistics, do we really want
> to use univariate / multivariate there? Yes, the terms are correct, but
> I'm not sure how many people looking at CREATE STATISTICS will
> understand them.
>

Hmm, I think "univariate" and "multivariate" are pretty ubiquitous,
when used to describe statistics. You could use "single-column" and
"multi-column", but then "column" isn't really right anymore, since it
might be a column or an expression. I can't think of any other terms
that fit.

Regards,
Dean

Re: PoC/WIP: Extended statistics on expressions

From

Tomas Vondra

Date:

25 March 2021, 00:05:37

On 3/24/21 5:48 PM, Tomas Vondra wrote:
> 
> 
> On 3/24/21 5:28 PM, Dean Rasheed wrote:
>> On Wed, 24 Mar 2021 at 14:48, Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> AFAIK the primary issue here is that the two places disagree. While
>>> estimate_num_groups does this
>>>
>>>     varnos = pull_varnos(root, (Node *) varshere);
>>>     if (bms_membership(varnos) == BMS_SINGLETON)
>>>     { ... }
>>>
>>> the add_unique_group_expr does this
>>>
>>>     varnos = pull_varnos(root, (Node *) groupexpr);
>>>
>>> That is, one looks at the group expression, while the other look at vars
>>> extracted from it by pull_var_clause(). Apparently for PlaceHolderVar
>>> this can differ, causing the crash.
>>>
>>> So we need to change one of those places - my fix tweaked the second
>>> place to also look at the vars, but maybe we should change the other
>>> place? Or maybe it's not the right fix for PlaceHolderVars ...
>>>
>>
>> I think that it doesn't make any difference which place is changed.
>>
>> This is a case of an expression with no stats. With your change,
>> you'll get a single GroupExprInfo containing a list of
>> VariableStatData's for each of it's Var's, whereas if you changed it
>> the other way, you'd get a separate GroupExprInfo for each Var. But I
>> think they'd both end up being treated the same by
>> estimate_multivariate_ndistinct(), since there wouldn't be any stats
>> matching the expression, only the individual Var's. Maybe changing the
>> first place would be the more bulletproof fix though.
>>
> 
> Yeah, I think that's true. I'll do a bit more research / experiments.
> 

Actually, I think we need that block at all - there's no point in
keeping the exact expression, because if there was a statistics matching
it it'd be matched by the examine_variable. So if we get here, we have
to just split it into the vars anyway. So the second block is entirely
useless.

That however means we don't need the processing with GroupExprInfo and
GroupVarInfo lists, i.e. we can revert back to the original simpler
processing, with a bit of extra logic to match expressions, that's all.

The patch 0003 does this (it's a bit crude, but hopefully enough to
demonstrate).

here's an updated patch. 0001 should address most of the today's review
items regarding comments etc.

0002 is an attempt to fix an issue I noticed today - we need to handle
type changes. Until now we did not have problems with that, because we
only had attnums - so we just reset the statistics (with the exception
of functional dependencies, on the assumption that those remain valid).

With expressions it's a bit more complicated, though.

1) we need to transform the expressions so that the Vars contain the
right type info etc. Otherwise an analyze with the old pg_node_tree crashes

2) we need to reset the pg_statistic[] data too, which however makes
keeping the functional dependencies a bit less useful, because those
rely on the expression stats :-(

So I'm wondering what to do about this. I looked into how ALTER TABLE
handles indexes, and 0003 is a PoC to do the same thing for statistics.
Of couse, this is a bit unfortunate because it recreates the statistics
(so we don't keep anything, not even functional dependencies).

I think we have two options:

a) Make UpdateStatisticsForTypeChange smarter to also transform and
update the expression string, and reset pg_statistics[] data.

b) Just recreate the statistics, just like we do for indexes. Currently
this does not force analyze, so it just resets all the stats. Maybe it
should do analyze, though.

Any opinions? I need to think about this a bit more, but maybe (b) with
the analyze is the right thing to do. Keeping just some of the stats
always seemed a bit weird. (This is why the 0002 patch breaks one of the
regression tests.)

BTW I wonder how useful the updated statistics actually is. Consider
this example:

========================================================================
CREATE TABLE t (a int, b int, c int);

INSERT INTO t SELECT mod(i,10), mod(i,10), mod(i,10)
  FROM generate_series(1,1000000) s(i);

CREATE STATISTICS s (ndistinct) ON (a+b), (b+c) FROM t;

ANALYZE t;

EXPLAIN ANALYZE SELECT 1 FROM t GROUP BY (a+b), (b+c);

test=# \d t
                 Table "public.t"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 a      | integer |           |          |
 b      | integer |           |          |
 c      | integer |           |          |
Statistics objects:
    "public"."s" (ndistinct) ON ((a + b)), ((b + c)) FROM t

test=# EXPLAIN  SELECT 1 FROM t GROUP BY (a+b), (b+c);
                           QUERY PLAN
-----------------------------------------------------------------
 HashAggregate  (cost=25406.00..25406.15 rows=10 width=12)
   Group Key: (a + b), (b + c)
   ->  Seq Scan on t  (cost=0.00..20406.00 rows=1000000 width=8)
(3 rows)
========================================================================

Great. Now let's change one of the data types to something else:

========================================================================
test=# alter table t alter column c type numeric;
ALTER TABLE
test=# \d t
                 Table "public.t"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 a      | integer |           |          |
 b      | integer |           |          |
 c      | numeric |           |          |
Statistics objects:
    "public"."s" (ndistinct) ON ((a + b)), (((b)::numeric + c)) FROM t

test=# analyze t;
ANALYZE
test=# EXPLAIN  SELECT 1 FROM t GROUP BY (a+b), (b+c);
                            QUERY PLAN
------------------------------------------------------------------
 HashAggregate  (cost=27906.00..27906.17 rows=10 width=40)
   Group Key: (a + b), ((b)::numeric + c)
   ->  Seq Scan on t  (cost=0.00..22906.00 rows=1000000 width=36)
(3 rows)
========================================================================

Great! Let's change it again:

========================================================================
test=# alter table t alter column c type double precision;
ALTER TABLE
test=# analyze t;
ANALYZE
test=# EXPLAIN  SELECT 1 FROM t GROUP BY (a+b), (b+c);
                            QUERY PLAN
------------------------------------------------------------------
 HashAggregate  (cost=27906.00..27923.50 rows=1000 width=16)
   Group Key: (a + b), ((b)::double precision + c)
   ->  Seq Scan on t  (cost=0.00..22906.00 rows=1000000 width=12)
(3 rows)
========================================================================

Well, not that great, apparently. We clearly failed to match the second
expression, so we ended with (b+c) estimated as (10 * 10). Why? Because
the expression now looks like this:

========================================================================
"public"."s" (ndistinct) ON ((a + b)), ((((b)::numeric)::double
precision + c)) FROM t
========================================================================

But we're matching it to (((b)::double precision + c)), so that fails.

This is not specific to extended statistics - indexes have exactly the
same issue. Not sure how common this is in practice.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC/WIP: Extended statistics on expressions

From

Justin Pryzby

Date:

25 March 2021, 00:30:06

On Thu, Mar 25, 2021 at 01:05:37AM +0100, Tomas Vondra wrote:
> here's an updated patch. 0001 should address most of the today's review
> items regarding comments etc.

This is still an issue:

postgres=# CREATE STATISTICS xt ON a FROM t JOIN t ON true;
ERROR:  schema "i" does not exist

-- 
Justin

Re: PoC/WIP: Extended statistics on expressions

From

Tomas Vondra

Date:

25 March 2021, 01:33:06

On 3/25/21 1:30 AM, Justin Pryzby wrote:
> On Thu, Mar 25, 2021 at 01:05:37AM +0100, Tomas Vondra wrote:
>> here's an updated patch. 0001 should address most of the today's review
>> items regarding comments etc.
> 
> This is still an issue:
> 
> postgres=# CREATE STATISTICS xt ON a FROM t JOIN t ON true;
> ERROR:  schema "i" does not exist
> 

Ah, right. That's a weird issue. I was really confused about this,
because nothing changes about the grammar or how we check the number of
relations. The problem is pretty trivial - the new code in utility.c
just grabs the first element and casts it to RangeVar, without checking
that it actually is RangeVar. With joins it's a JoinExpr, so we get a
bogus error.

The attached version fixes it by simply doing the check in utility.c.
It's a bit redundant with what's in CreateStatistics() but I don't think
we can just postpone it easily - we need to do the transformation here,
with access to queryString. But maybe we don't need to pass the relid,
when we have the list of relations in CreateStatsStmt itself ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

On 3/25/21 2:33 PM, Dean Rasheed wrote:
> On Thu, 25 Mar 2021 at 00:05, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> here's an updated patch. 0001
> 
> The change to the way that CreateStatistics() records dependencies
> isn't quite right -- recordDependencyOnSingleRelExpr() will not create
> any dependencies if the expression uses only a whole-row Var. However,
> pull_varattnos() will include whole-row Vars, and so nattnums_exprs
> will be non-zero, and CreateStatistics() will not create a whole-table
> dependency when it should.
> 
> I suppose that could be fixed up by inspecting the bitmapset returned
> by pull_varattnos() in more detail, but I think it's probably safer to
> revert to the previous code, which matched what index_create() did.
> 

Ah, good catch. I haven't realized recordDependencyOnSingleRelExpr works
like that, so I've moved it after the whole-table dependency.

Attached is an updated patch series, with all the changes discussed
here. I've cleaned up the ndistinct stuff a bit more (essentially
reverting back from GroupExprInfo to GroupVarInfo name), and got rid of
the UpdateStatisticsForTypeChange.

I've also looked at speeding up the stats_ext regression tests. The 0002
patch reduces the size of a couple of test tables, and removes a bunch
of queries. I've initially mostly just copied the original tests, but we
don't really need that many queries I think. This cuts the runtime about
in half, so it's mostly in line with other tests. Some of these changes
are in existing tests, I'll consider moving that into a separate patch
applied before the main one.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On 8/16/21 3:32 AM, Justin Pryzby wrote:
> On Mon, Dec 07, 2020 at 03:15:17PM +0100, Tomas Vondra wrote:
>>> Looking at the current behaviour, there are a couple of things that
>>> seem a little odd, even though they are understandable. For example,
>>> the fact that
>>>
>>>    CREATE STATISTICS s (expressions) ON (expr), col FROM tbl;
>>>
>>> fails, but
>>>
>>>    CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl;
>>>
>>> succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax
>>>
>>>    CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl;
>>>
>>> tends to suggest that it's going to create statistics on the pair of
>>> expressions, describing their correlation, when actually it builds 2
>>> independent statistics. Also, this error text isn't entirely accurate:
>>>
>>>    CREATE STATISTICS s ON col FROM tbl;
>>>    ERROR:  extended statistics require at least 2 columns
>>>
>>> because extended statistics don't always require 2 columns, they can
>>> also just have an expression, or multiple expressions and 0 or 1
>>> columns.
>>>
>>> I think a lot of this stems from treating "expressions" in the same
>>> way as the other (multi-column) stats kinds, and it might actually be
>>> neater to have separate documented syntaxes for single- and
>>> multi-column statistics:
>>>
>>>    CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
>>>      ON (expression)
>>>      FROM table_name
>>>
>>>    CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
>>>      [ ( statistics_kind [, ... ] ) ]
>>>      ON { column_name | (expression) } , { column_name | (expression) } [, ...]
>>>      FROM table_name
>>>
>>> The first syntax would create single-column stats, and wouldn't accept
>>> a statistics_kind argument, because there is only one kind of
>>> single-column statistic. Maybe that might change in the future, but if
>>> so, it's likely that the kinds of single-column stats will be
>>> different from the kinds of multi-column stats.
>>>
>>> In the second syntax, the only accepted kinds would be the current
>>> multi-column stats kinds (ndistinct, dependencies, and mcv), and it
>>> would always build stats describing the correlations between the
>>> columns listed. It would continue to build standard/expression stats
>>> on any expressions in the list, but that's more of an implementation
>>> detail.
>>>
>>> It would no longer be possible to do "CREATE STATISTICS s
>>> (expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to
>>> issue 2 separate "CREATE STATISTICS" commands, but that seems more
>>> logical, because they're independent stats.
>>>
>>> The parsing code might not change much, but some of the errors would
>>> be different. For example, the errors "building only extended
>>> expression statistics on simple columns not allowed" and "extended
>>> expression statistics require at least one expression" would go away,
>>> and the error "extended statistics require at least 2 columns" might
>>> become more specific, depending on the stats kind.
> 
> This still seems odd:
> 
> postgres=# CREATE STATISTICS asf ON i FROM t;
> ERROR:  extended statistics require at least 2 columns
> postgres=# CREATE STATISTICS asf ON (i) FROM t;
> CREATE STATISTICS
> 
> It seems wrong that the command works with added parens, but builds expression
> stats on a simple column (which is redundant with what analyze does without
> extended stats).
> 

Well, yeah. But I think this is a behavior that was discussed somewhere 
in this thread, and the agreement was that it's not worth the 
complexity, as this comment explains

   * XXX We do only the bare minimum to separate simple attribute and
   * complex expressions - for example "(a)" will be treated as a complex
   * expression. No matter how elaborate the check is, there'll always be
   * a way around it, if the user is determined (consider e.g. "(a+0)"),
   * so it's not worth protecting against it.

Of course, maybe that wasn't the right decision - it's a bit weird that

   CREATE INDEX on t ((a), (b))

actually "extracts" the column references and stores that in indkeys, 
instead of treating that as expressions.

Patch 0001 fixes the "double parens" issue discussed elsewhere in this 
thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a 
simple column reference.

But I'm not sure 0002 is something we can do without catversion bump. 
What if someone created such "bogus" statistics? It's mostly harmless, 
because the statistics is useless anyway (AFAICS we'll just use the 
regular one we have for the column), but if they do pg_dump, that'll 
fail because of this new restriction.

OTOH we're still "only" in beta, and IIRC the rule is not to bump 
catversion after rc1.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

On 8/24/21 3:13 PM, Justin Pryzby wrote:
> On Mon, Aug 16, 2021 at 05:41:57PM +0200, Tomas Vondra wrote:
>>> This still seems odd:
>>>
>>> postgres=# CREATE STATISTICS asf ON i FROM t;
>>> ERROR:  extended statistics require at least 2 columns
>>> postgres=# CREATE STATISTICS asf ON (i) FROM t;
>>> CREATE STATISTICS
>>>
>>> It seems wrong that the command works with added parens, but builds expression
>>> stats on a simple column (which is redundant with what analyze does without
>>> extended stats).
>>
>> Well, yeah. But I think this is a behavior that was discussed somewhere in
>> this thread, and the agreement was that it's not worth the complexity, as
>> this comment explains
>>
>>    * XXX We do only the bare minimum to separate simple attribute and
>>    * complex expressions - for example "(a)" will be treated as a complex
>>    * expression. No matter how elaborate the check is, there'll always be
>>    * a way around it, if the user is determined (consider e.g. "(a+0)"),
>>    * so it's not worth protecting against it.
>>
>> Patch 0001 fixes the "double parens" issue discussed elsewhere in this
>> thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a simple
>> column reference.
> 
> 0002 refuses to create expressional stats on a simple column reference like
> (a), which I think is helps to avoid a user accidentally creating useless ext
> stats objects (which are redundant with the table's column stats).
> 
> 0002 does not attempt to refuse cases like (a+0), which I think is fine:
> we don't try to reject useless cases if someone insists on it.
> See 240971675, 701fd0bbc.
> 
> So I am +1 to apply both patches.
> 
> I added this as an Opened Item for increased visibility.
> 

I've pushed both fixes, so the open item should be resolved.

However while polishing the second patch, I realized we're allowing 
statistics on expressions referencing system attributes. So this fails;

CREATE STATISTICS s ON ctid, x FROM t;

but this passes:

CREATE STATISTICS s ON (ctid::text), x FROM t;

IMO we should reject such expressions, just like we reject direct 
references to system attributes - patch attached.

Furthermore, I wonder if we should reject expressions without any Vars? 
This works now:

CREATE STATISTICS s ON (11:text) FROM t;

but it seems rather silly / useless, so maybe we should reject it.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

0001-Reject-extended-statistics-referencing-system-attrib.patch

Re: PoC/WIP: Extended statistics on expressions

From

Justin Pryzby

Date:

01 September 2021, 19:38:16

On Wed, Sep 01, 2021 at 06:45:29PM +0200, Tomas Vondra wrote:
> > > Patch 0001 fixes the "double parens" issue discussed elsewhere in this
> > > thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a simple
> > > column reference.
> > 
> > 0002 refuses to create expressional stats on a simple column reference like
> > (a), which I think is helps to avoid a user accidentally creating useless ext
> > stats objects (which are redundant with the table's column stats).
> > 
> > 0002 does not attempt to refuse cases like (a+0), which I think is fine:
> > we don't try to reject useless cases if someone insists on it.
> > See 240971675, 701fd0bbc.
> > 
> > So I am +1 to apply both patches.
> > 
> > I added this as an Opened Item for increased visibility.
> 
> I've pushed both fixes, so the open item should be resolved.

Thank you - I marked it as such.

There are some typos in 537ca68db (refenrece)
I'll add them to my typos branch if you don't want to patch them right now or
wait to see if someone notices anything else.

diff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c
index 59369f8736..17cbd97808 100644
--- a/src/backend/commands/statscmds.c
+++ b/src/backend/commands/statscmds.c
@@ -205,27 +205,27 @@ CreateStatistics(CreateStatsStmt *stmt)
     numcols = list_length(stmt->exprs);
     if (numcols > STATS_MAX_DIMENSIONS)
         ereport(ERROR,
                 (errcode(ERRCODE_TOO_MANY_COLUMNS),
                  errmsg("cannot have more than %d columns in statistics",
                         STATS_MAX_DIMENSIONS)));
 
     /*
      * Convert the expression list to a simple array of attnums, but also keep
      * a list of more complex expressions.  While at it, enforce some
      * constraints - we don't allow extended statistics on system attributes,
-     * and we require the data type to have less-than operator.
+     * and we require the data type to have a less-than operator.
      *
-     * There are many ways how to "mask" a simple attribute refenrece as an
+     * There are many ways to "mask" a simple attribute reference as an
      * expression, for example "(a+0)" etc. We can't possibly detect all of
-     * them, but we handle at least the simple case with attribute in parens.
+     * them, but we handle at least the simple case with the attribute in parens.
      * There'll always be a way around this, if the user is determined (like
      * the "(a+0)" example), but this makes it somewhat consistent with how
      * indexes treat attributes/expressions.
      */
     foreach(cell, stmt->exprs)
     {
         StatsElem  *selem = lfirst_node(StatsElem, cell);
 
         if (selem->name)        /* column reference */
         {
             char       *attname;

Re: PoC/WIP: Extended statistics on expressions

From

Tomas Vondra

Date:

01 September 2021, 19:56:20


On 9/1/21 9:38 PM, Justin Pryzby wrote:
> On Wed, Sep 01, 2021 at 06:45:29PM +0200, Tomas Vondra wrote:
>>>> Patch 0001 fixes the "double parens" issue discussed elsewhere in this
>>>> thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a simple
>>>> column reference.
>>>
>>> 0002 refuses to create expressional stats on a simple column reference like
>>> (a), which I think is helps to avoid a user accidentally creating useless ext
>>> stats objects (which are redundant with the table's column stats).
>>>
>>> 0002 does not attempt to refuse cases like (a+0), which I think is fine:
>>> we don't try to reject useless cases if someone insists on it.
>>> See 240971675, 701fd0bbc.
>>>
>>> So I am +1 to apply both patches.
>>>
>>> I added this as an Opened Item for increased visibility.
>>
>> I've pushed both fixes, so the open item should be resolved.
> 
> Thank you - I marked it as such.
> 
> There are some typos in 537ca68db (refenrece)
> I'll add them to my typos branch if you don't want to patch them right now or
> wait to see if someone notices anything else.
> 

Yeah, probably better to wait a bit. Any opinions on rejecting 
expressions referencing system attributes or no attributes at all?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PoC/WIP: Extended statistics on expressions

From

Justin Pryzby

Date:

03 September 2021, 03:56:01

On Wed, Sep 01, 2021 at 06:45:29PM +0200, Tomas Vondra wrote:
> However while polishing the second patch, I realized we're allowing
> statistics on expressions referencing system attributes. So this fails;
> 
> CREATE STATISTICS s ON ctid, x FROM t;
> 
> but this passes:
> 
> CREATE STATISTICS s ON (ctid::text), x FROM t;
> 
> IMO we should reject such expressions, just like we reject direct references
> to system attributes - patch attached.

Right, same as indexes.  +1

> Furthermore, I wonder if we should reject expressions without any Vars? This
> works now:
> 
> CREATE STATISTICS s ON (11:text) FROM t;
> 
> but it seems rather silly / useless, so maybe we should reject it.

To my surprise, this is also allowed for indexes...

But (maybe this is what I was remembering) it's prohibited to have a constant
expression as a partition key.

Expressions without a var seem like a case where the user did something
deliberately silly, and dis-similar from the case of making a stats expression
on a simple column - that seemed like it could be a legitimate
mistake/confusion (it's not unreasonable to write an extra parenthesis, but
it's strange if that causes it to behave differently).

I think it's not worth too much effort to prohibit this: if they're determined,
they can still write an expresion with a var which is constant.  I'm not going
to say it's worth zero effort, though.

-- 
Justin

Re: PoC/WIP: Extended statistics on expressions

From

Tomas Vondra

Date:

19 September 2021, 23:07:23

On 9/3/21 5:56 AM, Justin Pryzby wrote:
> On Wed, Sep 01, 2021 at 06:45:29PM +0200, Tomas Vondra wrote:
>> However while polishing the second patch, I realized we're allowing
>> statistics on expressions referencing system attributes. So this fails;
>>
>> CREATE STATISTICS s ON ctid, x FROM t;
>>
>> but this passes:
>>
>> CREATE STATISTICS s ON (ctid::text), x FROM t;
>>
>> IMO we should reject such expressions, just like we reject direct references
>> to system attributes - patch attached.
> 
> Right, same as indexes.  +1
>

I've pushed this check, disallowing extended stats on expressions 
referencing system attributes. This means we'll reject both ctid and 
(ctid::text), just like for indexes.

>> Furthermore, I wonder if we should reject expressions without any Vars? This
>> works now:
>>
>> CREATE STATISTICS s ON (11:text) FROM t;
>>
>> but it seems rather silly / useless, so maybe we should reject it.
> 
> To my surprise, this is also allowed for indexes...
> 
> But (maybe this is what I was remembering) it's prohibited to have a constant
> expression as a partition key.
> 
> Expressions without a var seem like a case where the user did something
> deliberately silly, and dis-similar from the case of making a stats expression
> on a simple column - that seemed like it could be a legitimate
> mistake/confusion (it's not unreasonable to write an extra parenthesis, but
> it's strange if that causes it to behave differently).
> 
> I think it's not worth too much effort to prohibit this: if they're determined,
> they can still write an expresion with a var which is constant.  I'm not going
> to say it's worth zero effort, though.
> 

I've decided not to push this. The statistics objects on expressions not 
referencing any variables seem useless, but maybe not entirely - we 
allow volatile expressions, like

   CREATE STATISTICS s ON (random()) FROM t;

which I suppose might be useful. And we reject similar cases (except for 
the volatility, of course) for indexes too.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company