Thread: PoC/WIP: Extended statistics on expressions

PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
Hi,

Attached is a PoC/WIP patch adding support for extended statistics on
expressions. This is by no means "ready" - most of the stuff works, but
often in a rather hackish way. I certainly don't expect this to pass
regression tests, for example.

There's an example demonstrating how this works for two queries at the
end of this message. Now let's talk about the main parts of the patch:

1) extending grammar to allow expressions, not just plain columns

   Fairly straighforward, I think. I'm sure the logic which expressions
   are allowed is not 100% (e.g. volatile functions etc.) but that's
   a detail we can deal with later.

2) store the expressions in pg_statistic_ext catalog

   I ended up adding a separate column, similar to indexprs, except that
   the order of columns/expressions does not matter, so we don't need to
   bother with storing 0s in stxkeys - we simply consider expressions to
   be "after" all the simple columns.

3) build statistics

   This should work too, for all three types of statistics we have (mcv,
   dependencies and ndistinct). This should work too, although the code
   changes are often very hackish "to make it work".

   The main challenge here was how to represent the expressions in the
   statistics - e.g. in ndistinct, which track ndistinct estimates for
   combinations of parameters, and so far we used attnums for that. I
   decided the easiest way it to keep doing that, but offset the
   expressions by MaxHeapAttributeNumber. That seems to work, but maybe
   there's a better way.

4) apply the statistics

   This is the hard part, really, and the exact state of the support
   depends on type of statistics.

   For ndistinct coefficients, it generally works. I'm sure there may be
   bugs in estimate_num_groups, etc. but in principle it works.

   For MCV lists, it generally works too - you can define statistics on
   the expressions and the estimates should improve. The main downside
   here is that it requires at least two expressions, otherwise we can't
   build/apply the extended statistics. So for example

      SELECT * FROM t WHERE mod(a,100) = 10 AND mod(b,11) = 0

   may be estimated "correctly", once you drop any of the conditions it
   gets much worse as we don't have stats for individual expressions.
   That's rather annoying - it does not break the extended MCV, but the
   behavior will certainly cause confusion.

   For functional dependencies, the estimation does not work yet. Also,
   the missing per-column statistics have bigger impact than on MCV,
   because while MCV can work fine without it, the dependencies heavily
   rely on the per-column estimates. We only apply "corrections" based
   on the dependency degree, so we still need (good) per-column
   estimates, which does not quite work with the expressions.


   Of course, the lack of per-expression statistics may be somewhat
   fixed by adding indexes on expressions, but that's kinda expensive.


Now, a simple example demonstrating how this improves estimates - let's
create a table with 1M rows, and do queries with mod() expressions on
it. It might be date_trunc() or something similar, that'd work too.


table with 1M rows
==================

test=# create table t (a int);
test=# insert into t select i from generate_series(1,1000000) s(i);
test=# analyze t;


poorly estimated queries
========================

test=# explain (analyze, timing off) select * from t where mod(a,3) = 0
and mod(a,7) = 0;
                                    QUERY PLAN

----------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..24425.00 rows=25 width=4) (actual rows=47619
loops=1)
   Filter: ((mod(a, 3) = 0) AND (mod(a, 7) = 0))
   Rows Removed by Filter: 952381
 Planning Time: 0.329 ms
 Execution Time: 156.675 ms
(5 rows)

test=# explain (analyze, timing off) select mod(a,3), mod(a,7) from t
group by 1, 2;
                                          QUERY PLAN

-----------------------------------------------------------------------------------------------
 HashAggregate  (cost=75675.00..98487.50 rows=1000000 width=8) (actual
rows=21 loops=1)
   Group Key: mod(a, 3), mod(a, 7)
   Planned Partitions: 32  Batches: 1  Memory Usage: 1561kB
   ->  Seq Scan on t  (cost=0.00..19425.00 rows=1000000 width=8) (actual
rows=1000000 loops=1)
 Planning Time: 0.277 ms
 Execution Time: 502.803 ms
(6 rows)


improved estimates
==================

test=# create statistics s1 (ndistinct) on mod(a,3), mod(a,7) from t;
test=# analyze t;

test=# explain (analyze, timing off) select mod(a,3), mod(a,7) from t
group by 1, 2;
                                          QUERY PLAN

-----------------------------------------------------------------------------------------------
 HashAggregate  (cost=24425.00..24425.31 rows=21 width=8) (actual
rows=21 loops=1)
   Group Key: mod(a, 3), mod(a, 7)
   Batches: 1  Memory Usage: 24kB
   ->  Seq Scan on t  (cost=0.00..19425.00 rows=1000000 width=8) (actual
rows=1000000 loops=1)
 Planning Time: 0.135 ms
 Execution Time: 500.092 ms
(6 rows)

test=# create statistics s2 (mcv) on mod(a,3), mod(a,7) from t;
test=# analyze t;

test=# explain (analyze, timing off) select * from t where mod(a,3) = 0
and mod(a,7) = 0;
                                     QUERY PLAN

-------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..24425.00 rows=46433 width=4) (actual
rows=47619 loops=1)
   Filter: ((mod(a, 3) = 0) AND (mod(a, 7) = 0))
   Rows Removed by Filter: 952381
 Planning Time: 0.702 ms
 Execution Time: 152.280 ms
(5 rows)


Clearly, estimates for both queries are significantly improved. Of
course, this example is kinda artificial/simplistic.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
On 11/16/20 2:49 PM, Tomas Vondra wrote:
> Hi,
>
> ...
>
> 4) apply the statistics
> 
>    This is the hard part, really, and the exact state of the support
>    depends on type of statistics.
> 
>    For ndistinct coefficients, it generally works. I'm sure there may be
>    bugs in estimate_num_groups, etc. but in principle it works.
> 
>    For MCV lists, it generally works too - you can define statistics on
>    the expressions and the estimates should improve. The main downside
>    here is that it requires at least two expressions, otherwise we can't
>    build/apply the extended statistics. So for example
> 
>       SELECT * FROM t WHERE mod(a,100) = 10 AND mod(b,11) = 0
> 
>    may be estimated "correctly", once you drop any of the conditions it
>    gets much worse as we don't have stats for individual expressions.
>    That's rather annoying - it does not break the extended MCV, but the
>    behavior will certainly cause confusion.
> 
>    For functional dependencies, the estimation does not work yet. Also,
>    the missing per-column statistics have bigger impact than on MCV,
>    because while MCV can work fine without it, the dependencies heavily
>    rely on the per-column estimates. We only apply "corrections" based
>    on the dependency degree, so we still need (good) per-column
>    estimates, which does not quite work with the expressions.
> 
> 
>    Of course, the lack of per-expression statistics may be somewhat
>    fixed by adding indexes on expressions, but that's kinda expensive.
> 

FWIW after re-reading [1], I think the plan to build pg_statistic rows
for expressions and stash them in pg_statistic_ext_data is the way to
go. I was thinking that maybe we'll need some new statistics type to
request this, e.g.

   CREATE STATISTICS s (expressions) ON ...

but on second thought I think we should just build this whenever there
are expressions in the definition. It'll require some changes (e.g. we
require at least two items in the list, but we'll want to allow building
stats on a single expression too, I guess), but that's doable.

Of course, we don't have any catalogs with composite types yet, so it's
not 100% sure this will work, but it's worth a try.

regards


[1]
https://www.postgresql.org/message-id/flat/6331.1579041473%40sss.pgh.pa.us#5ec6af7583e84cef2ca6a9e8a713511e

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
Hi,

attached is a significantly improved version of the patch, allowing
defining extended statistics on expressions. This fixes most of the
problems in the previous WIP version and AFAICS it does pass all
regression tests (including under valgrind). There's a bunch of FIXMEs
and a couple loose ends, but overall I think it's ready for reviews.


Overall, the patch does two main things:

* it adds a new "expressions" statistics kind, building per-expression
statistics (i.e it's similar to having expression index)

* it allows using expressions in definition of extended statistics, and
properly handles that in all existing statistics kinds (dependencies,
mcv, ndistinct)


The expression handling mostly copies what we do for indexes, with
similar restrictions - no volatile functions, aggregates etc. The list
of expressions is stored in pg_statistic_ext catalog, but unlike for
indexes we don't need to worry about the exact order of elements, so
there are no "0" for expressions in stxkeys etc. We simply assume the
expressions come after simple columns, and that's it.

To reference expressions in the built statistics (e.g. in a dependency)
we use "special attnums" computed from the expression index by adding
MaxHeapAttributeNumber. So the first expression has attnum 1601 etc.

This mapping expressions to attnums is used both while building and
applying the statistics to clauses, as it makes the whole process much
simpler than dealing with attnums and expressions entirely separately.


The first part allows us to do something like this:

    CREATE TABLE t (a int);
    INSERT INTO t SELECT i FROM generate_series(1,1000000) s(i);
    ANALYZE t;

    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT * FROM t WHERE mod(a,10) = 0;

    CREATE STATISTICS s (expressions) ON mod(a,10) FROM t;
    ANALYZE t;

    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT * FROM t WHERE mod(a,10) = 0;

Without the statistics we get this:

                                 QUERY PLAN
    --------------------------------------------------------
     Seq Scan on t  (cost=0.00..19425.00 rows=5000 width=4)
                    (actual rows=100000 loops=1)
       Filter: (mod(a, 10) = 0)
       Rows Removed by Filter: 900000
     Planning Time: 0.216 ms
     Execution Time: 157.552 ms
    (5 rows)

while with the statistics we get this

                                 QUERY PLAN
    ----------------------------------------------------------
     Seq Scan on t  (cost=0.00..19425.00 rows=100900 width=4)
                    (actual rows=100000 loops=1)
       Filter: (mod(a, 10) = 0)
       Rows Removed by Filter: 900000
     Planning Time: 0.399 ms
     Execution Time: 157.530 ms
    (5 rows)

So that's pretty nice improvement. In practice you could get the same
effect by creating an expression index

    CREATE INDEX ON t (mod(a,10));

but of course that's far from free - there's cost to maintain the index,
it blocks HOT, and it takes space on disk. The statistics have none of
these issues.

Implementation-wise, this simply builds per-column statistics for each
expression, and stashes them into a new column in pg_statistic_ext_data
catalog as an array of elements with pg_statistic composite type. And
then in selfuncs.c we look not just at indexes, but also at this when
looking for expression stats.

So that gives us the per-expression stats. This is enabled by default
when you don't specify the statistics type and the definition includes
any expression that is not a simple column reference. Otherwise you may
also request it explicitly by using "expressions" in the CREATE.


Now, the second part is really just a natural extension of the existing
stats to also work with expressions. The easiest thing is probably to
show some examples, so consider this:

    CREATE TABLE t (a INT, b INT, c INT);
    INSERT INTO t SELECT i, i, i FROM generate_series(1,1000000) s(i);
    ANALYZE t;

which without any statistics gives us this:

    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT 1 FROM t WHERE mod(a,10) = 0 AND mod(b,5) = 0;

                           QUERY PLAN
    ------------------------------------------------------
     Seq Scan on t  (cost=0.00..25406.00 rows=25 width=4)
                    (actual rows=100000 loops=1)
       Filter: ((mod(a, 10) = 0) AND (mod(b, 5) = 0))
       Rows Removed by Filter: 900000
     Planning Time: 0.080 ms
     Execution Time: 161.445 ms
    (5 rows)


    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT 1 FROM t GROUP BY mod(a,10), mod(b,5);

                               QUERY PLAN
    ------------------------------------------------------------------
     HashAggregate  (cost=76656.00..99468.50 rows=1000000 width=12)
                    (actual rows=10 loops=1)
       Group Key: mod(a, 10), mod(b, 5)
       Planned Partitions: 32  Batches: 1  Memory Usage: 1561kB
       ->  Seq Scan on t  (cost=0.00..20406.00 rows=1000000 width=8)
                          (actual rows=1000000 loops=1)
     Planning Time: 0.232 ms
     Execution Time: 514.446 ms
    (6 rows)

and now let's add statistics on the expressions:

    CREATE STATISTICS s ON mod(a,10), mod(b,5) FROM t;
    ANALYZE t;

which ends up with these spot-on estimates:

    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT 1 FROM t WHERE mod(a,10) = 0 AND mod(b,5) = 0;

                            QUERY PLAN
    ---------------------------------------------------------
     Seq Scan on t  (cost=0.00..25406.00 rows=97400 width=4)
                    (actual rows=100000 loops=1)
       Filter: ((mod(a, 10) = 0) AND (mod(b, 5) = 0))
       Rows Removed by Filter: 900000
     Planning Time: 0.366 ms
     Execution Time: 159.207 ms
    (5 rows)

    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT 1 FROM t GROUP BY mod(a,10), mod(b,5);

                               QUERY PLAN
    -----------------------------------------------------------------
     HashAggregate  (cost=25406.00..25406.15 rows=10 width=12)
                    (actual rows=10 loops=1)
       Group Key: mod(a, 10), mod(b, 5)
       Batches: 1  Memory Usage: 24kB
       ->  Seq Scan on t  (cost=0.00..20406.00 rows=1000000 width=8)
                          (actual rows=1000000 loops=1)
     Planning Time: 0.299 ms
     Execution Time: 530.793 ms
    (6 rows)

Of course, this is a very simple query, but hopefully you get the idea.


There's about two main areas where I think might be hidden issues:

1) We're kinda faking the pg_statistic entries, and I suppose there
might be some loose ends (e.g. with respect to ACLs etc.).

2) Memory management while evaluating the expressions during analyze is
kinda simplistic, we're probably keeping the memory around longer than
needed etc.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC/WIP: Extended statistics on expressions

From
Justin Pryzby
Date:
On Sun, Nov 22, 2020 at 08:03:51PM +0100, Tomas Vondra wrote:
> attached is a significantly improved version of the patch, allowing
> defining extended statistics on expressions. This fixes most of the
> problems in the previous WIP version and AFAICS it does pass all
> regression tests (including under valgrind). There's a bunch of FIXMEs
> and a couple loose ends, but overall I think it's ready for reviews.

I was looking at the previous patch, so now read this one instead, and attach
some proposed fixes.

+  * This matters especially for * expensive expressions, of course.

+   The expression can refer only to columns of the underlying table, but
+   it can use all columns, not just the ones the statistics is defined
+   on.

I don't know what these are trying to say?

+                                errmsg("statistics expressions and predicates can refer only to the table being
indexed")));
+        * partial-index predicates.  Create it in the per-index context to be

I think these are copied and shouldn't mention "indexes" or "predicates".  Or
should statistics support predicates, too ?

Idea: if a user specifies no stakinds, and there's no expression specified,
then you automatically build everything except for expressional stats.  But if
they specify only one statistics "column", it gives an error.  If that's a
non-simple column reference, should that instead build *only* expressional
stats (possibly with a NOTICE, since the user might be intending to make MV
stats).

I think pg_stats_ext should allow inspecting the pg_statistic data in
pg_statistic_ext_data.stxdexprs.  I guess array_agg() should be ordered by
something, so maybe it should use ORDINALITY (?)

I hacked more on bootstrap.c so included that here.

-- 
Justin

Attachment

Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:

On 11/23/20 3:26 AM, Justin Pryzby wrote:
> On Sun, Nov 22, 2020 at 08:03:51PM +0100, Tomas Vondra wrote:
>> attached is a significantly improved version of the patch, allowing
>> defining extended statistics on expressions. This fixes most of the
>> problems in the previous WIP version and AFAICS it does pass all
>> regression tests (including under valgrind). There's a bunch of FIXMEs
>> and a couple loose ends, but overall I think it's ready for reviews.
> 
> I was looking at the previous patch, so now read this one instead, and attach
> some proposed fixes.
> 
> +  * This matters especially for * expensive expressions, of course.
> 

The point this was trying to make is that we evaluate the expressions
only once, and use the results to build all extended statistics. Instead
of leaving it up to every "build" to re-evaluate it.

> +   The expression can refer only to columns of the underlying table, but
> +   it can use all columns, not just the ones the statistics is defined
> +   on.
> 
> I don't know what these are trying to say?
> 

D'oh. That's bogus para, copied from the CREATE INDEX docs (where it
talked about the index predicate, which is irrelevant here).

> +                                errmsg("statistics expressions and predicates can refer only to the table being
indexed")));
> +        * partial-index predicates.  Create it in the per-index context to be
> 
> I think these are copied and shouldn't mention "indexes" or "predicates".  Or
> should statistics support predicates, too ?
> 

Right. Stupid copy-pasto.

> Idea: if a user specifies no stakinds, and there's no expression specified,
> then you automatically build everything except for expressional stats.  But if
> they specify only one statistics "column", it gives an error.  If that's a
> non-simple column reference, should that instead build *only* expressional
> stats (possibly with a NOTICE, since the user might be intending to make MV
> stats).
> 

Right, that was the intention - but I messed up and it works only if you
specify the "expressions" kind explicitly (and I also added the ERROR
message to expected output by mistake). I agree we should handle this
automatically, so that

   CREATE STATISTICS s ON (a+b) FROM t

works and only creates statistics for the expression.


> I think pg_stats_ext should allow inspecting the pg_statistic data in
> pg_statistic_ext_data.stxdexprs.  I guess array_agg() should be ordered by
> something, so maybe it should use ORDINALITY (?)
> 

I agree we should expose the expression statistics, but I'm not
convinced we should do that in the pg_stats_ext view itself. The problem
is that it's a table bested in a table, essentially, with non-trivial
structure, so I was thinking about adding a separate view exposing just
this one part. Something like pg_stats_ext_expressions, with about the
same structure as pg_stats, or something.

> I hacked more on bootstrap.c so included that here.

Thanks. As for the 0004-0007 patches:

0004 - Seems fine. IMHO not really "silly errors" but OK.

0005 - Mostly OK. The docs wording mostly comes from CREATE INDEX docs,
though. The paragraph about "t1" is old, so if we want to reword it then
maybe we should backpatch too.

0006 - Not sure. I think CreateStatistics can be fixed with less code,
keeping it more like PG13 (good for backpatching). Not sure why rename
extended statistics to multi-variate statistics - we use "extended"
everywhere. Not sure what's the point of serialize_expr_stats changes,
that's code is mostly copy-paste from update_attstats.

0007 - I suspect this makes the pg_stats_ext too complex to work with,
IMHO we should move this to a separate view.


Thanks for the review! I'll try to look more closely at those patches
sometime next week, and merge most of it.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
Hi,

Attached is an updated version of the patch series, merging some of the
changes proposed by Justin. I've kept the bootstrap patches separate, at
least for now.

As for the individual 0004-0007 patches:

1) 0004 - merged as is

2) 0005 - I've merged some of the docs changes, but some of the wording
was copied from CREATE INDEX docs in which case I've kept that. I've
also not merged changed to pre-existing docs, like the t1 example which
is unrelated to this patch.

OTOH I've corrected the t3 example description, which was somewhat bogus
and unrelated to what the example actually did. I've also removed the
irrelevant para which originally described index predicates and was
copied from CREATE INDEX docs by mistake.

3) 0006 - I've committed something similar / less invasive, achieving
the same goals (I think), and I've added a couple regression tests.

4) 0007 - I agreed we need a way to expose the stats, but including this
in pg_stats_ext seems rather inconvenient (table in a table is difficult
to work with). Instead I've added a new catalog pg_stats_ext_exprs with
structure similar to pg_stats. I've also added the expressions to the
pg_stats_ext catalog, which was only showing the attributes, and some
basic docs for the catalog changes.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC/WIP: Extended statistics on expressions

From
Justin Pryzby
Date:
On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote:
> 0004 - Seems fine. IMHO not really "silly errors" but OK.

This is one of the same issues you pointed out - shadowing a variable.
Could be backpatched.

On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote:
> > +                                errmsg("statistics expressions and predicates can refer only to the table being
indexed")));
> > +        * partial-index predicates.  Create it in the per-index context to be
> > 
> > I think these are copied and shouldn't mention "indexes" or "predicates".  Or
> > should statistics support predicates, too ?
> > 
> 
> Right. Stupid copy-pasto.

Right, but then I was wondering if CREATE STATS should actually support
predicates, since one use case is to do what indexes do without their overhead.
I haven't thought about it enough yet.

> 0006 - Not sure. I think CreateStatistics can be fixed with less code,
> keeping it more like PG13 (good for backpatching). Not sure why rename
> extended statistics to multi-variate statistics - we use "extended"
> everywhere.

-       if (build_expressions && (list_length(stxexprs) == 0))
+       if (!build_expressions_only && (list_length(stmt->exprs) < 2))
                ereport(ERROR,  
                                (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
-                                errmsg("extended expression statistics require at least one expression")));
+                                errmsg("multi-variate statistics require at least two columns")));

I think all of "CREATE STATISTICS" has been known as "extended stats", so I
think it may be confusing to say that it requires two columns for the general
facility.

> Not sure what's the point of serialize_expr_stats changes,
> that's code is mostly copy-paste from update_attstats.

Right.  I think "i" is poor variable name when it isn't a loop variable and not
of limited scope.

> 0007 - I suspect this makes the pg_stats_ext too complex to work with,
> IMHO we should move this to a separate view.

Right - then unnest() the whole thing and return one row per expression rather
than array, as you've done.  Maybe the docs should say that this returns one
row per expression.

Looking quickly at your new patch: I guess you know there's a bunch of
lingering references to "indexes" and "predicates":

I don't know if you want to go to the effort to prohibit this.
postgres=# CREATE STATISTICS asf ON (tableoid::int+1) FROM t;
CREATE STATISTICS

I think a lot of people will find this confusing:

postgres=# CREATE STATISTICS asf ON i FROM t;
ERROR:  extended statistics require at least 2 columns
postgres=# CREATE STATISTICS asf ON (i) FROM t;
CREATE STATISTICS

postgres=# CREATE STATISTICS asf (expressions) ON i FROM t;
ERROR:  extended expression statistics require at least one expression
postgres=# CREATE STATISTICS asf (expressions) ON (i) FROM t;
CREATE STATISTICS

I haven't looked, but is it possible to make it work without parens ?

-- 
Justin



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:

On 11/24/20 5:23 PM, Justin Pryzby wrote:
> On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote:
>> 0004 - Seems fine. IMHO not really "silly errors" but OK.
> 
> This is one of the same issues you pointed out - shadowing a variable.
> Could be backpatched.
> 
> On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote:
>>> +                                errmsg("statistics expressions and predicates can refer only to the table being
indexed")));
>>> +        * partial-index predicates.  Create it in the per-index context to be
>>>
>>> I think these are copied and shouldn't mention "indexes" or "predicates".  Or
>>> should statistics support predicates, too ?
>>>
>>
>> Right. Stupid copy-pasto.
> 
> Right, but then I was wondering if CREATE STATS should actually support
> predicates, since one use case is to do what indexes do without their overhead.
> I haven't thought about it enough yet.
> 

Well, it's not supported now, so the message is bogus. I'm not against
supporting "partial statistics" with predicates in the future, but it's
going to be non-trivial project on it's own. It's not something I can
bolt onto the current patch easily.

>> 0006 - Not sure. I think CreateStatistics can be fixed with less code,
>> keeping it more like PG13 (good for backpatching). Not sure why rename
>> extended statistics to multi-variate statistics - we use "extended"
>> everywhere.
> 
> -       if (build_expressions && (list_length(stxexprs) == 0))
> +       if (!build_expressions_only && (list_length(stmt->exprs) < 2))
>                 ereport(ERROR,  
>                                 (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> -                                errmsg("extended expression statistics require at least one expression")));
> +                                errmsg("multi-variate statistics require at least two columns")));
> 
> I think all of "CREATE STATISTICS" has been known as "extended stats", so I
> think it may be confusing to say that it requires two columns for the general
> facility.
> 
>> Not sure what's the point of serialize_expr_stats changes,
>> that's code is mostly copy-paste from update_attstats.
> 
> Right.  I think "i" is poor variable name when it isn't a loop variable and not
> of limited scope.
> 

OK, I understand. I'll consider tweaking that.

>> 0007 - I suspect this makes the pg_stats_ext too complex to work with,
>> IMHO we should move this to a separate view.
> 
> Right - then unnest() the whole thing and return one row per expression rather
> than array, as you've done.  Maybe the docs should say that this returns one
> row per expression.
> 
> Looking quickly at your new patch: I guess you know there's a bunch of
> lingering references to "indexes" and "predicates":
> 
> I don't know if you want to go to the effort to prohibit this.
> postgres=# CREATE STATISTICS asf ON (tableoid::int+1) FROM t;
> CREATE STATISTICS
> 

Hmm, we're already rejecting system attributes, I suppose we should do
the same thing for expressions on system attributes.

> I think a lot of people will find this confusing:
> 
> postgres=# CREATE STATISTICS asf ON i FROM t;
> ERROR:  extended statistics require at least 2 columns
> postgres=# CREATE STATISTICS asf ON (i) FROM t;
> CREATE STATISTICS
> 
> postgres=# CREATE STATISTICS asf (expressions) ON i FROM t;
> ERROR:  extended expression statistics require at least one expression
> postgres=# CREATE STATISTICS asf (expressions) ON (i) FROM t;
> CREATE STATISTICS
> 
> I haven't looked, but is it possible to make it work without parens ?
> 

Hmm, you're right that may be surprising. I suppose we could walk the
expressions while creating the statistics, and replace such trivial
expressions with the nested variable, but I haven't tried. I wonder what
the CREATE INDEX behavior would be in these cases.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
Hi,

Attached is a patch series rebased on top of 25a9e54d2d which improves
estimation of OR clauses. There were only a couple minor conflicts.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC/WIP: Extended statistics on expressions

From
Dean Rasheed
Date:
On Thu, 3 Dec 2020 at 15:23, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> Attached is a patch series rebased on top of 25a9e54d2d.

After reading this thread and [1], I think I prefer the name
"standard" rather than "expressions", because it is meant to describe
the kind of statistics being built rather than what they apply to, but
maybe that name doesn't actually need to be exposed to the end user:

Looking at the current behaviour, there are a couple of things that
seem a little odd, even though they are understandable. For example,
the fact that

  CREATE STATISTICS s (expressions) ON (expr), col FROM tbl;

fails, but

  CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl;

succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax

  CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl;

tends to suggest that it's going to create statistics on the pair of
expressions, describing their correlation, when actually it builds 2
independent statistics. Also, this error text isn't entirely accurate:

  CREATE STATISTICS s ON col FROM tbl;
  ERROR:  extended statistics require at least 2 columns

because extended statistics don't always require 2 columns, they can
also just have an expression, or multiple expressions and 0 or 1
columns.

I think a lot of this stems from treating "expressions" in the same
way as the other (multi-column) stats kinds, and it might actually be
neater to have separate documented syntaxes for single- and
multi-column statistics:

  CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
    ON (expression)
    FROM table_name

  CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
    [ ( statistics_kind [, ... ] ) ]
    ON { column_name | (expression) } , { column_name | (expression) } [, ...]
    FROM table_name

The first syntax would create single-column stats, and wouldn't accept
a statistics_kind argument, because there is only one kind of
single-column statistic. Maybe that might change in the future, but if
so, it's likely that the kinds of single-column stats will be
different from the kinds of multi-column stats.

In the second syntax, the only accepted kinds would be the current
multi-column stats kinds (ndistinct, dependencies, and mcv), and it
would always build stats describing the correlations between the
columns listed. It would continue to build standard/expression stats
on any expressions in the list, but that's more of an implementation
detail.

It would no longer be possible to do "CREATE STATISTICS s
(expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to
issue 2 separate "CREATE STATISTICS" commands, but that seems more
logical, because they're independent stats.

The parsing code might not change much, but some of the errors would
be different. For example, the errors "building only extended
expression statistics on simple columns not allowed" and "extended
expression statistics require at least one expression" would go away,
and the error "extended statistics require at least 2 columns" might
become more specific, depending on the stats kind.

Regards,
Dean

[1] https://www.postgresql.org/message-id/flat/1009.1579038764%40sss.pgh.pa.us#8624792a20ae595683b574f5933dae53



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:

On 12/7/20 10:56 AM, Dean Rasheed wrote:
> On Thu, 3 Dec 2020 at 15:23, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>> Attached is a patch series rebased on top of 25a9e54d2d.
> 
> After reading this thread and [1], I think I prefer the name
> "standard" rather than "expressions", because it is meant to describe
> the kind of statistics being built rather than what they apply to, but
> maybe that name doesn't actually need to be exposed to the end user:
> 
> Looking at the current behaviour, there are a couple of things that
> seem a little odd, even though they are understandable. For example,
> the fact that
> 
>   CREATE STATISTICS s (expressions) ON (expr), col FROM tbl;
> 
> fails, but
> 
>   CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl;
> 
> succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax
> 
>   CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl;
> 
> tends to suggest that it's going to create statistics on the pair of
> expressions, describing their correlation, when actually it builds 2
> independent statistics. Also, this error text isn't entirely accurate:
> 
>   CREATE STATISTICS s ON col FROM tbl;
>   ERROR:  extended statistics require at least 2 columns
> 
> because extended statistics don't always require 2 columns, they can
> also just have an expression, or multiple expressions and 0 or 1
> columns.
> 
> I think a lot of this stems from treating "expressions" in the same
> way as the other (multi-column) stats kinds, and it might actually be
> neater to have separate documented syntaxes for single- and
> multi-column statistics:
> 
>   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
>     ON (expression)
>     FROM table_name
> 
>   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
>     [ ( statistics_kind [, ... ] ) ]
>     ON { column_name | (expression) } , { column_name | (expression) } [, ...]
>     FROM table_name
> 
> The first syntax would create single-column stats, and wouldn't accept
> a statistics_kind argument, because there is only one kind of
> single-column statistic. Maybe that might change in the future, but if
> so, it's likely that the kinds of single-column stats will be
> different from the kinds of multi-column stats.
> 
> In the second syntax, the only accepted kinds would be the current
> multi-column stats kinds (ndistinct, dependencies, and mcv), and it
> would always build stats describing the correlations between the
> columns listed. It would continue to build standard/expression stats
> on any expressions in the list, but that's more of an implementation
> detail.
> 
> It would no longer be possible to do "CREATE STATISTICS s
> (expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to
> issue 2 separate "CREATE STATISTICS" commands, but that seems more
> logical, because they're independent stats.
> 
> The parsing code might not change much, but some of the errors would
> be different. For example, the errors "building only extended
> expression statistics on simple columns not allowed" and "extended
> expression statistics require at least one expression" would go away,
> and the error "extended statistics require at least 2 columns" might
> become more specific, depending on the stats kind.
> 

I think it makes sense in general. I see two issues with this approach,
though:

* By adding expression/standard stats for individual statistics, it
makes the list of statistics longer - I wonder if this might have
measurable impact on lookups in this list.

* I'm not sure it's a good idea that the second syntax would always
build the per-expression stats. Firstly, it seems a bit strange that it
behaves differently than the other kinds. Secondly, I wonder if there
are cases where it'd be desirable to explicitly disable building these
per-expression stats. For example, what if we have multiple extended
statistics objects, overlapping on a couple expressions. It seems
pointless to build the stats for all of them.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Dean Rasheed
Date:
On Mon, 7 Dec 2020 at 14:15, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> On 12/7/20 10:56 AM, Dean Rasheed wrote:
> > it might actually be
> > neater to have separate documented syntaxes for single- and
> > multi-column statistics:
> >
> >   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
> >     ON (expression)
> >     FROM table_name
> >
> >   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
> >     [ ( statistics_kind [, ... ] ) ]
> >     ON { column_name | (expression) } , { column_name | (expression) } [, ...]
> >     FROM table_name
>
> I think it makes sense in general. I see two issues with this approach,
> though:
>
> * By adding expression/standard stats for individual statistics, it
> makes the list of statistics longer - I wonder if this might have
> measurable impact on lookups in this list.
>
> * I'm not sure it's a good idea that the second syntax would always
> build the per-expression stats. Firstly, it seems a bit strange that it
> behaves differently than the other kinds. Secondly, I wonder if there
> are cases where it'd be desirable to explicitly disable building these
> per-expression stats. For example, what if we have multiple extended
> statistics objects, overlapping on a couple expressions. It seems
> pointless to build the stats for all of them.
>

Hmm, I'm not sure it would really be a good idea to build MCV stats on
expressions without also building the standard stats for those
expressions, otherwise the assumptions that
mcv_combine_selectivities() makes about simple_sel and mcv_basesel
wouldn't really hold. But then, if multiple MCV stats shared the same
expression, it would be quite wasteful to build standard stats on the
expression more than once.

It feels like it should build a single extended stats object for each
unique expression, with appropriate dependencies for any MCV stats
that used those expressions, but I'm not sure how complex that would
be. Dropping the last MCV stat object using a standard expression stat
object might reasonably drop the expression stats ... except if they
were explicitly created by the user, independently of any MCV stats.
That could get quite messy.

Regards,
Dean



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:

On 12/7/20 5:02 PM, Dean Rasheed wrote:
> On Mon, 7 Dec 2020 at 14:15, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 12/7/20 10:56 AM, Dean Rasheed wrote:
>>> it might actually be
>>> neater to have separate documented syntaxes for single- and
>>> multi-column statistics:
>>>
>>>   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
>>>     ON (expression)
>>>     FROM table_name
>>>
>>>   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
>>>     [ ( statistics_kind [, ... ] ) ]
>>>     ON { column_name | (expression) } , { column_name | (expression) } [, ...]
>>>     FROM table_name
>>
>> I think it makes sense in general. I see two issues with this approach,
>> though:
>>
>> * By adding expression/standard stats for individual statistics, it
>> makes the list of statistics longer - I wonder if this might have
>> measurable impact on lookups in this list.
>>
>> * I'm not sure it's a good idea that the second syntax would always
>> build the per-expression stats. Firstly, it seems a bit strange that it
>> behaves differently than the other kinds. Secondly, I wonder if there
>> are cases where it'd be desirable to explicitly disable building these
>> per-expression stats. For example, what if we have multiple extended
>> statistics objects, overlapping on a couple expressions. It seems
>> pointless to build the stats for all of them.
>>
> 
> Hmm, I'm not sure it would really be a good idea to build MCV stats on
> expressions without also building the standard stats for those
> expressions, otherwise the assumptions that
> mcv_combine_selectivities() makes about simple_sel and mcv_basesel
> wouldn't really hold. But then, if multiple MCV stats shared the same
> expression, it would be quite wasteful to build standard stats on the
> expression more than once.
> 

Yeah. You're right it'd be problematic to build MCV on expressions
without having the per-expression stats. In fact, that's exactly the
problem what forced me to add the per-expression stats to this patch.
Originally I planned to address it in a later patch, but I had to move
it forward.

So I think you're right we need to ensure we have standard stats for
each expression at least once, to make this work well.

> It feels like it should build a single extended stats object for each
> unique expression, with appropriate dependencies for any MCV stats
> that used those expressions, but I'm not sure how complex that would
> be. Dropping the last MCV stat object using a standard expression stat
> object might reasonably drop the expression stats ... except if they
> were explicitly created by the user, independently of any MCV stats.
> That could get quite messy.
> 

Possibly. But I don't think it's worth the extra complexity. I don't
expect people to have a lot of overlapping stats, so the amount of
wasted space and CPU time is expected to be fairly limited.

So I don't think it's worth spending too much time on this now. Let's
just do what you proposed, and revisit this later if needed.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Dean Rasheed
Date:
On Tue, 8 Dec 2020 at 12:44, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> Possibly. But I don't think it's worth the extra complexity. I don't
> expect people to have a lot of overlapping stats, so the amount of
> wasted space and CPU time is expected to be fairly limited.
>
> So I don't think it's worth spending too much time on this now. Let's
> just do what you proposed, and revisit this later if needed.
>

Yes, I think that's a reasonable approach to take. As long as the
documentation makes it clear that building MCV stats also causes
standard expression stats to be built on any expressions included in
the list, then the user will know and can avoid duplication most of
the time. I don't think there's any need for code to try to prevent
that -- just as we don't bother with code to prevent a user building
multiple indexes on the same column.

The only case where duplication won't be avoidable is where there are
multiple MCV stats sharing the same expression, but that's probably
quite unlikely in practice, and it seems acceptable to leave improving
that as a possible future optimisation.

Regards,
Dean



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
On 12/11/20 1:58 PM, Dean Rasheed wrote:
> On Tue, 8 Dec 2020 at 12:44, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>> Possibly. But I don't think it's worth the extra complexity. I don't
>> expect people to have a lot of overlapping stats, so the amount of
>> wasted space and CPU time is expected to be fairly limited.
>>
>> So I don't think it's worth spending too much time on this now. Let's
>> just do what you proposed, and revisit this later if needed.
>>
> 
> Yes, I think that's a reasonable approach to take. As long as the
> documentation makes it clear that building MCV stats also causes
> standard expression stats to be built on any expressions included in
> the list, then the user will know and can avoid duplication most of
> the time. I don't think there's any need for code to try to prevent
> that -- just as we don't bother with code to prevent a user building
> multiple indexes on the same column.
> 
> The only case where duplication won't be avoidable is where there are
> multiple MCV stats sharing the same expression, but that's probably
> quite unlikely in practice, and it seems acceptable to leave improving
> that as a possible future optimisation.
> 

OK. Attached is an updated version, reworking it this way.

I tried tweaking the grammar to differentiate these two syntax variants,
but that led to shift/reduce conflicts with the existing ones. I tried
fixing that, but I ended up doing that in CreateStatistics().

The other thing is that we probably can't tie this to just MCV, because
functional dependencies need the per-expression stats too. So I simply
build expression stats whenever there's at least one expression.

I also decided to keep the "expressions" statistics kind - it's not
allowed to specify it in CREATE STATISTICS, but it's useful internally
as it allows deciding whether to build the stats in a single place.
Otherwise we'd need to do that every time we build the statistics, etc.

I added a brief explanation to the sgml docs, not sure if that's good
enough - maybe it needs more details.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC/WIP: Extended statistics on expressions

From
Dean Rasheed
Date:
On Fri, 11 Dec 2020 at 20:17, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> OK. Attached is an updated version, reworking it this way.

Cool. I think this is an exciting development, so I hope it makes it
into the next release.

I have started looking at it. So far I have only looked at the
catalog, parser and client changes, but I thought it's worth posting
my comments so far.

> I tried tweaking the grammar to differentiate these two syntax variants,
> but that led to shift/reduce conflicts with the existing ones. I tried
> fixing that, but I ended up doing that in CreateStatistics().

Yeah, that makes sense. I wasn't expecting the grammar to change.

> The other thing is that we probably can't tie this to just MCV, because
> functional dependencies need the per-expression stats too. So I simply
> build expression stats whenever there's at least one expression.

Makes sense.

> I also decided to keep the "expressions" statistics kind - it's not
> allowed to specify it in CREATE STATISTICS, but it's useful internally
> as it allows deciding whether to build the stats in a single place.
> Otherwise we'd need to do that every time we build the statistics, etc.

Yes, I thought that would be the easiest way to do it. Essentially the
"expressions" stats kind is an internal implementation detail, hidden
from the user, because it's built automatically when required, so you
don't need to (and can't) explicitly ask for it. This new behaviour
seems much more logical to me.

> I added a brief explanation to the sgml docs, not sure if that's good
> enough - maybe it needs more details.

Yes, I think that could use a little tidying up, but I haven't looked
too closely yet.


Some other comments:

* I'm not sure I understand the need for 0001. Wasn't there an earlier
version of this patch that just did it by re-populating the type
array, but which still had it as an array rather than turning it into
a list? Making it a list falsifies some of the comments and
function/variable name choices in that file.

* There's a comment typo in catalog/Makefile -- "are are reputedly
other...", should be "there are reputedly other...".

* Looking at the pg_stats_ext view, I think perhaps expressions stats
should be omitted entirely from that view, since it doesn't show any
useful information about them. So it could remove "e" from the "kinds"
array, and exclude rows whose only kind is "e", since such rows have
no interesting data in them. Essentially, the new view
pg_stats_ext_exprs makes having any expression stats in pg_stats_ext
redundant. Hiding this data in pg_stats_ext would also be consistent
with making the "expressions" stats kind hidden from the user.

* In gram.y, it wasn't quite obvious why you converted the column list
for CREATE STATISTICS from an expr_list to a stats_params list. I
figured it out, and it makes sense, but I think it could use a
comment, perhaps something along the lines of the one for index_elem,
e.g.:

/*
 * Statistics attributes can be either simple column references, or arbitrary
 * expressions in parens.  For compatibility with index attributes permitted
 * in CREATE INDEX, we allow an expression that's just a function call to be
 * written without parens.
 */

* In parse_func.c and parse_agg.c, there are a few new error strings
that use the abbreviation "stats expressions", whereas most other
errors refer to "statistics expressions". For consistency, I think
they should all be the latter.

* In generateClonedExtStatsStmt(), I think the "expressions" stats
kind needs to be explicitly excluded, otherwise CREATE TABLE (LIKE
...) fails if the source table has expression stats.

* CreateStatistics() uses ShareUpdateExclusiveLock, but in
tcop/utility.c the relation is opened with a ShareLock. ISTM that the
latter lock mode should be made to match CreateStatistics().

* Why does the new code in tcop/utility.c not use
RangeVarGetRelidExtended together with RangeVarCallbackOwnsRelation?
That seems preferable to doing the ACL check in CreateStatistics().
For one thing, as it stands, it allows the lock to be taken even if
the user doesn't own the table. Is it intentional that the current
code allows extended stats to be created on system catalogs? That
would be one thing that using RangeVarCallbackOwnsRelation would
change, but I can't see a reason to allow it.

* In src/bin/psql/describe.c, I think the \d output should also
exclude the "expressions" stats kind and just list the other kinds (or
have no kinds list at all, if there are no other kinds), to make it
consistent with the CREATE STATISTICS syntax.

* The pg_dump output for a stats object whose only kind is
"expressions" is broken -- it includes a spurious "()" for the kinds
list.

That's it for now. I'll look at the optimiser changes next, and try to
post more comments later this week.

Regards,
Dean



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
On 1/4/21 4:34 PM, Dean Rasheed wrote:
>
> ... 
> 
> Some other comments:
> 
> * I'm not sure I understand the need for 0001. Wasn't there an earlier
> version of this patch that just did it by re-populating the type
> array, but which still had it as an array rather than turning it into
> a list? Making it a list falsifies some of the comments and
> function/variable name choices in that file.
> 

That's a bit done to Justin - I think it's fine to use the older version 
repopulating the type array, but that question is somewhat unrelated to 
this patch.

> * There's a comment typo in catalog/Makefile -- "are are reputedly
> other...", should be "there are reputedly other...".
> 
> * Looking at the pg_stats_ext view, I think perhaps expressions stats
> should be omitted entirely from that view, since it doesn't show any
> useful information about them. So it could remove "e" from the "kinds"
> array, and exclude rows whose only kind is "e", since such rows have
> no interesting data in them. Essentially, the new view
> pg_stats_ext_exprs makes having any expression stats in pg_stats_ext
> redundant. Hiding this data in pg_stats_ext would also be consistent
> with making the "expressions" stats kind hidden from the user.
> 

Hmmm, not sure. I'm not sure removing 'e' from the array is a good idea. 
On the one hand it's internal detail, on the other hand most of that 
view is internal detail too. Excluding rows with only 'e' seems 
reasonable, though. I need to think about this.

> * In gram.y, it wasn't quite obvious why you converted the column list
> for CREATE STATISTICS from an expr_list to a stats_params list. I
> figured it out, and it makes sense, but I think it could use a
> comment, perhaps something along the lines of the one for index_elem,
> e.g.:
> 
> /*
>   * Statistics attributes can be either simple column references, or arbitrary
>   * expressions in parens.  For compatibility with index attributes permitted
>   * in CREATE INDEX, we allow an expression that's just a function call to be
>   * written without parens.
>   */
> 

OH, right. I'd have trouble figuring this myself, and I wrote that code 
myself only one or two months ago.

> * In parse_func.c and parse_agg.c, there are a few new error strings
> that use the abbreviation "stats expressions", whereas most other
> errors refer to "statistics expressions". For consistency, I think
> they should all be the latter.
> 

OK, will fix.

> * In generateClonedExtStatsStmt(), I think the "expressions" stats
> kind needs to be explicitly excluded, otherwise CREATE TABLE (LIKE
> ...) fails if the source table has expression stats.
> 

Yeah, will fix. I guess this also means we're missing some tests.

> * CreateStatistics() uses ShareUpdateExclusiveLock, but in
> tcop/utility.c the relation is opened with a ShareLock. ISTM that the
> latter lock mode should be made to match CreateStatistics().
> 

Not sure, will check.

> * Why does the new code in tcop/utility.c not use
> RangeVarGetRelidExtended together with RangeVarCallbackOwnsRelation?
> That seems preferable to doing the ACL check in CreateStatistics().
> For one thing, as it stands, it allows the lock to be taken even if
> the user doesn't own the table. Is it intentional that the current
> code allows extended stats to be created on system catalogs? That
> would be one thing that using RangeVarCallbackOwnsRelation would
> change, but I can't see a reason to allow it.
> 

I think I copied the code from somewhere - probably expression indexes, 
or something like that. Not a proof that it's the right/better way to do 
this, though.

> * In src/bin/psql/describe.c, I think the \d output should also
> exclude the "expressions" stats kind and just list the other kinds (or
> have no kinds list at all, if there are no other kinds), to make it
> consistent with the CREATE STATISTICS syntax.
> 

Not sure I understand. Why would this make it consistent with CREATE 
STATISTICS? Can you elaborate?

> * The pg_dump output for a stats object whose only kind is
> "expressions" is broken -- it includes a spurious "()" for the kinds
> list.
> 

Will fix. Again, this suggests there are TAP tests missing.

> That's it for now. I'll look at the optimiser changes next, and try to
> post more comments later this week.
> 

Thanks!


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Dean Rasheed
Date:
On Tue, 5 Jan 2021 at 00:45, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> On 1/4/21 4:34 PM, Dean Rasheed wrote:
> >
> > * In src/bin/psql/describe.c, I think the \d output should also
> > exclude the "expressions" stats kind and just list the other kinds (or
> > have no kinds list at all, if there are no other kinds), to make it
> > consistent with the CREATE STATISTICS syntax.
>
> Not sure I understand. Why would this make it consistent with CREATE
> STATISTICS? Can you elaborate?
>

This isn't absolutely essential, but I think it would be neater. For
example, if I have a table with stats like this:

CREATE TABLE foo(a int, b int);
CREATE STATISTICS foo_s_ab (mcv) ON a,b FROM foo;

then the \d output is as follows:

\d foo
                Table "public.foo"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 a      | integer |           |          |
 b      | integer |           |          |
Statistics objects:
    "public"."foo_s_ab" (mcv) ON a, b FROM foo

and the stats line matches the DDL used to create the stats. It could,
for example, be copy-pasted and tweaked to create similar stats on
another table, but even if that's not very likely, it's neat that it
reflects how the stats were created.

OTOH, if there are expressions in the list, it produces something like this:

                Table "public.foo"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 a      | integer |           |          |
 b      | integer |           |          |
Statistics objects:
    "public"."foo_s_ab" (mcv, expressions) ON a, b, ((a * b)) FROM foo

which no longer matches the DDL used, and isn't part of an accepted
syntax, so seems a bit inconsistent.

In general, if we're making the "expressions" kind an internal
implementation detail that just gets built automatically when needed,
then I think we should hide it from this sort of output, so the list
of kinds matches the list that the user used when the stats were
created.

Regards,
Dean



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:

On 1/5/21 3:10 PM, Dean Rasheed wrote:
> On Tue, 5 Jan 2021 at 00:45, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 1/4/21 4:34 PM, Dean Rasheed wrote:
>>>
>>> * In src/bin/psql/describe.c, I think the \d output should also
>>> exclude the "expressions" stats kind and just list the other kinds (or
>>> have no kinds list at all, if there are no other kinds), to make it
>>> consistent with the CREATE STATISTICS syntax.
>>
>> Not sure I understand. Why would this make it consistent with CREATE
>> STATISTICS? Can you elaborate?
>>
> 
> This isn't absolutely essential, but I think it would be neater. For
> example, if I have a table with stats like this:
> 
> CREATE TABLE foo(a int, b int);
> CREATE STATISTICS foo_s_ab (mcv) ON a,b FROM foo;
> 
> then the \d output is as follows:
> 
> \d foo
>                  Table "public.foo"
>   Column |  Type   | Collation | Nullable | Default
> --------+---------+-----------+----------+---------
>   a      | integer |           |          |
>   b      | integer |           |          |
> Statistics objects:
>      "public"."foo_s_ab" (mcv) ON a, b FROM foo
> 
> and the stats line matches the DDL used to create the stats. It could,
> for example, be copy-pasted and tweaked to create similar stats on
> another table, but even if that's not very likely, it's neat that it
> reflects how the stats were created.
> 
> OTOH, if there are expressions in the list, it produces something like this:
> 
>                  Table "public.foo"
>   Column |  Type   | Collation | Nullable | Default
> --------+---------+-----------+----------+---------
>   a      | integer |           |          |
>   b      | integer |           |          |
> Statistics objects:
>      "public"."foo_s_ab" (mcv, expressions) ON a, b, ((a * b)) FROM foo
> 
> which no longer matches the DDL used, and isn't part of an accepted
> syntax, so seems a bit inconsistent.
> 
> In general, if we're making the "expressions" kind an internal
> implementation detail that just gets built automatically when needed,
> then I think we should hide it from this sort of output, so the list
> of kinds matches the list that the user used when the stats were
> created.
> 

Hmm, I see. You're probably right it's not necessary to show this, given 
the modified handling of expression stats (which makes them an internal 
detail, not exposed to users). I'll tweak this.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Dean Rasheed
Date:
Looking over the statscmds.c changes, there are a few XXX's and
FIXME's that need resolving, and I had a couple of other minor
comments:

+           /*
+            * An expression using mutable functions is probably wrong,
+            * since if you aren't going to get the same result for the
+            * same data every time, it's not clear what the index entries
+            * mean at all.
+            */
+           if (CheckMutability((Expr *) expr))
+               ereport(ERROR,

That comment is presumably copied from the index code, so needs updating.


+           /*
+            * Disallow data types without a less-than operator
+            *
+            * XXX Maybe allow this, but only for EXPRESSIONS stats and
+            * prevent building e.g. MCV etc.
+            */
+           atttype = exprType(expr);
+           type = lookup_type_cache(atttype, TYPECACHE_LT_OPR);
+           if (type->lt_opr == InvalidOid)
+               ereport(ERROR,
+                       (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                        errmsg("expression cannot be used in
statistics because its type %s has no default btree operator class",
+                               format_type_be(atttype))));

As the comment suggests, it's probably worth skipping this check if
numcols is 1 so that single-column stats can be built for more types
of expressions. (I'm assuming that it's basically no more effort to
make that work, so I think it falls into the might-as-well-do-it
category.)


+   /*
+    * Parse the statistics kinds.  Firstly, check that this is not the
+    * variant building statistics for a single expression, in which case
+    * we don't allow specifying any statistis kinds.  The simple variant
+    * only has one expression, and does not allow statistics kinds.
+    */
+   if ((list_length(stmt->exprs) == 1) && (list_length(stxexprs) == 1))
+   {

Typo: "statistis"
Nit-picking, this test could just be:

+   if ((numcols == 1) && (list_length(stxexprs) == 1))

which IMO is a little more readable, and matches a similar test a
little further down.


+   /*
+    * If there are no simply-referenced columns, give the statistics an
+    * auto dependency on the whole table.  In most cases, this will
+    * be redundant, but it might not be if the statistics expressions
+    * contain no Vars (which might seem strange but possible).
+    *
+    * XXX This is copied from index_create, not sure if it's applicable
+    * to extended statistics too.
+    */

Seems right to me.


+       /*
+        * FIXME use 'expr' for expressions, which have empty column names.
+        * For indexes this is handled in ChooseIndexColumnNames, but we
+        * have no such function for stats.
+        */
+       if (!name)
+           name = "expr";

In theory, this function could be made to duplicate the logic used for
indexes, creating names like "expr1", "expr2", etc. To be honest
though, I don't think it's worth the effort. The code for indexes
isn't really bulletproof anyway -- for example there might be a column
called "expr" that is or isn't included in the index, which would make
the generated name ambiguous. And in any case, a name like
"tbl_cola_expr_colb_expr1_colc_stat" isn't really any more useful than
"tbl_cola_expr_colb_expr_colc_stat". So I'd be tempted to leave that
code as it is.


+
+/*
+ * CheckMutability
+ *     Test whether given expression is mutable
+ *
+ * FIXME copied from indexcmds.c, maybe use some shared function?
+ */
+static bool
+CheckMutability(Expr *expr)
+{

As the comment says, it's quite messy duplicating this code, but I'm
wondering whether it would be OK to just skip this check entirely. I
think someone else suggested that elsewhere, and I think it might not
be a bad idea.

For indexes, it could easily lead to wrong query results, but for
stats the most likely problem is that the stats would get out of date
(which they tend to do all by themselves anyway) and need rebuilding.

If you ignore intentionally crazy examples (which are still possible
even with this check), then there are probably many legitimate cases
where someone might want to use non-immutable functions in stats, and
this check just forces them to create an immutable wrapper function.

Regards,
Dean



Re: PoC/WIP: Extended statistics on expressions

From
Dean Rasheed
Date:
Starting to look at the planner code, I found an oversight in the way
expression stats are read at the start of planning -- it is necessary
to call ChangeVarNodes() on any expressions if the relid isn't 1,
otherwise the stats expressions may contain Var nodes referring to the
wrong relation. Possibly the easiest place to do that would be in
get_relation_statistics(), if rel->relid != 1.

Here's a simple test case:

CREATE TABLE foo AS SELECT x FROM generate_series(1,100000) g(x);
CREATE STATISTICS foo_s ON (x%10) FROM foo;
ANALYSE foo;

EXPLAIN SELECT * FROM foo WHERE x%10 = 0;
EXPLAIN SELECT * FROM (SELECT 1) t, foo WHERE x%10 = 0;

(in the second query, the stats don't get applied).

Regards,
Dean



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
Hi,

Attached is a patch fixing most of the issues. There are a couple 
exceptions:


 > * Looking at the pg_stats_ext view, I think perhaps expressions stats
 > should be omitted entirely from that view, since it doesn't show any
 > useful information about them. So it could remove "e" from the "kinds"
 > array, and exclude rows whose only kind is "e", since such rows have
 > no interesting data in them. Essentially, the new view
 > pg_stats_ext_exprs makes having any expression stats in pg_stats_ext
 > redundant. Hiding this data in pg_stats_ext would also be consistent
 > with making the "expressions" stats kind hidden from the user.

I haven't removed the expressions stats from pg_stats_ext view yet. I'm 
not 100% sure about it yet.


 > * Why does the new code in tcop/utility.c not use
 > RangeVarGetRelidExtended together with RangeVarCallbackOwnsRelation?
 > That seems preferable to doing the ACL check in CreateStatistics().
 > For one thing, as it stands, it allows the lock to be taken even if
 > the user doesn't own the table. Is it intentional that the current
 > code allows extended stats to be created on system catalogs? That
 > would be one thing that using RangeVarCallbackOwnsRelation would
 > change, but I can't see a reason to allow it.

I haven't switched utility.c to RangeVarGetRelidExtended together with 
RangeVarCallbackOwnsRelation, because the current code allows checking 
for object type first. I don't recall why exactly was it done this way, 
but I didn't feel like changing that in this patch.

You're however right it should not be possible to create statistics on 
system catalogs. For regular users that should be rejected thanks to the 
ownership check, but superuser may create it. I've added proper check to 
CreateStatistics() - this is probably worth backpatching.


 > * In src/bin/psql/describe.c, I think the \d output should also
 > exclude the "expressions" stats kind and just list the other kinds (or
 > have no kinds list at all, if there are no other kinds), to make it
 > consistent with the CREATE STATISTICS syntax.

I've done this, but I went one step further - we hide the list of kinds 
using the same rules as pg_dump, i.e. we don't list the kinds if all of 
them are selected. Not sure if that's the right thing, though.


The rest of the issues should be fixed, I think.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC/WIP: Extended statistics on expressions

From
Justin Pryzby
Date:
On Fri, Jan 08, 2021 at 01:57:29AM +0100, Tomas Vondra wrote:
> Attached is a patch fixing most of the issues. There are a couple
> exceptions:

In the docs:

+   — at the cost that its schema must be extended whenever the structure
                                                                                        
 
+   of statistics <link linkend="catalog-pg-statistic"><structname>pg_statistic</structname></link> changes.
                                                                                        
 

should say "of statistics *IN* pg_statistics changes" ?

+   to an expression index. The full variant allows defining statistics objects
                                                                                        
 
+   on multiple columns and expressions, and pick which statistics kinds will
                                                                                        
 
+   be built. The per-expression statistics are built automatically when there
                                                                                        
 

"and pick" is wrong - maybe say "and selecting which.."

+   and run a query using an expression on that column.  Without the
                                                                                        
 

remove "the" ?

+   extended statistics, the planner has no information about data
                                                                                        
 
+   distribution for reasults of those expression, and uses default
                                                                                        
 

*results

+   estimates as illustrated by the first query.  The planner also does
                                                                                        
 
+   not realize the value of the second column fully defines the value
                                                                                        
 
+   of the other column, because date truncated to day still identifies
                                                                                        
 
+   the month). Then expression and ndistinct statistics are built on
                                                                                        
 

The ")" is unbalanced

+                               /* all parts of thi expression are covered by this statistics */
                                                                                        
 

this

+ * GrouExprInfos, but only if it's not known equal to any of the existing
                                                                                        
 

Group

+        * we don't allow specifying any statistis kinds.  The simple variant
                                                                                        
 

statistics

+        * If no statistic type was specified, build them all (but request
                                                                                        
 

Say "kind" not "type" ?

+ * expression is a simple Var. OTOH we check that there's at least one
                                                                                        
 
+ * statistics matching the expression.
                                                                                        
 

one statistic (singular) ?

+                * the future, we might consider
                                                                                        
 
+                */
                                                                                        
 

consider ???

+-- (not it fails, when there are no simple column references)
                                                                                        
 

note?

There's some remaining copy/paste stuff from index expressions:

errmsg("statistics expressions and predicates can refer only to the table being indexed")));
left behind by evaluating the predicate or index expressions.
Set up for predicate or expression evaluation
Need an EState for evaluation of index expressions and
/* Compute and save index expression values */
left behind by evaluating the predicate or index expressions.
Fetch function for analyzing index expressions.
partial-index predicates.  Create it in the per-index context to be
* When analyzing an expression index, believe the expression tree's type
                                                                               
 

-- 
Justin



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
On 1/8/21 3:35 AM, Justin Pryzby wrote:
> On Fri, Jan 08, 2021 at 01:57:29AM +0100, Tomas Vondra wrote:
>> Attached is a patch fixing most of the issues. There are a couple
>> exceptions:
> 
> In the docs:
> 
> ...
>

Thanks! Checking the docs and comments is a tedious work, I appreciate
you going through all that. I'll fix that in the next version.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
Attached is an updated version of the patch series, fixing a couple issues:

1) docs issues, pointed out by Justin Pryzby

2) adds ACL check to statext_extract_expression to verify access to 
attributes in the expression(s)

3) adds comment to statext_is_compatible_clause_internal explaining the 
ambiguity in extracting expressions for extended stats

4) fixes/improves memory management in compute_expr_stats

5) a bunch of minor comment and code fixes


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC/WIP: Extended statistics on expressions

From
Justin Pryzby
Date:
On Sat, Jan 16, 2021 at 05:48:43PM +0100, Tomas Vondra wrote:
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>expr</structfield> <type>text</type>
> +      </para>
> +      <para>
> +       Expression the extended statistics is defined on
> +      </para></entry>

Expression the extended statistics ARE defined on
Or maybe say "on which the extended statistics are defined"

> +  <para>
> +   The <command>CREATE STATISTICS</command> command has two basic forms. The
> +   simple variant allows to build statistics for a single expression, does

.. ALLOWS BUILDING statistics for a single expression, AND does (or BUT does)

> +   Expression statistics are per-expression and are similar to creating an
> +   index on the expression, except that they avoid the overhead of the index.

Maybe say "overhead of index maintenance"

> +   All functions and operators used in a statistics definition must be
> +   <quote>immutable</quote>, that is, their results must depend only on
> +   their arguments and never on any outside influence (such as
> +   the contents of another table or the current time).  This restriction

say "outside factor" or "external factor"

> +   results of those expression, and uses default estimates as illustrated
> +   by the first query.  The planner also does not realize the value of the

realize THAT

> +   second column fully defines the value of the other column, because date
> +   truncated to day still identifies the month. Then expression and
> +   ndistinct statistics are built on those two columns:

I got an error doing this:

CREATE TABLE t AS SELECT generate_series(1,9) AS i;
CREATE STATISTICS s ON (i+1) ,(i+1+0) FROM t;
ANALYZE t;
SELECT i+1 FROM t GROUP BY 1;
ERROR:  corrupt MVNDistinct entry

-- 
Justin



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
On 1/17/21 12:22 AM, Justin Pryzby wrote:
> On Sat, Jan 16, 2021 at 05:48:43PM +0100, Tomas Vondra wrote:
>> +      <entry role="catalog_table_entry"><para role="column_definition">
>> +       <structfield>expr</structfield> <type>text</type>
>> +      </para>
>> +      <para>
>> +       Expression the extended statistics is defined on
>> +      </para></entry>
> 
> Expression the extended statistics ARE defined on
> Or maybe say "on which the extended statistics are defined"
> 

I'm pretty sure "is" is correct because "expression" is singular.

>> +  <para>
>> +   The <command>CREATE STATISTICS</command> command has two basic forms. The
>> +   simple variant allows to build statistics for a single expression, does
> 
> .. ALLOWS BUILDING statistics for a single expression, AND does (or BUT does)
> 
>> +   Expression statistics are per-expression and are similar to creating an
>> +   index on the expression, except that they avoid the overhead of the index.
> 
> Maybe say "overhead of index maintenance"
> 

Yeah, that sounds better.

>> +   All functions and operators used in a statistics definition must be
>> +   <quote>immutable</quote>, that is, their results must depend only on
>> +   their arguments and never on any outside influence (such as
>> +   the contents of another table or the current time).  This restriction
> 
> say "outside factor" or "external factor"
> 

In fact, we've removed the immutability restriction, so this paragraph 
should have been removed.

>> +   results of those expression, and uses default estimates as illustrated
>> +   by the first query.  The planner also does not realize the value of the
> 
> realize THAT
> 

OK, changed.

>> +   second column fully defines the value of the other column, because date
>> +   truncated to day still identifies the month. Then expression and
>> +   ndistinct statistics are built on those two columns:
> 
> I got an error doing this:
> 
> CREATE TABLE t AS SELECT generate_series(1,9) AS i;
> CREATE STATISTICS s ON (i+1) ,(i+1+0) FROM t;
> ANALYZE t;
> SELECT i+1 FROM t GROUP BY 1;
> ERROR:  corrupt MVNDistinct entry
> 

Thanks. There was a thinko in estimate_multivariate_ndistinct, resulting 
in mismatching the ndistinct coefficient items. The attached patch fixes 
that, but I've realized the way we pick the "best" statistics may need 
some improvements (I added an XXX comment about that).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC/WIP: Extended statistics on expressions

From
Zhihong Yu
Date:
Hi,
+    * Check that only the base rel is mentioned.  (This should be dead code
+    * now that add_missing_from is history.)
+    */
+   if (list_length(pstate->p_rtable) != 1)

If it is dead code, it can be removed, right ?

For statext_mcv_clauselist_selectivity:

+                   // bms_free(list_attnums[listidx]);

The commented line can be removed.

+bool
+examine_clause_args2(List *args, Node **exprp, Const **cstp, bool *expronleftp)

Better add some comment for examine_clause_args2 since there is examine_clause_args() already.

+   if (!ok || stats->compute_stats == NULL || stats->minrows <= 0)

When would stats->minrows have negative value ?

For serialize_expr_stats():

+   sd = table_open(StatisticRelationId, RowExclusiveLock);
+
+   /* lookup OID of composite type for pg_statistic */
+   typOid = get_rel_type_id(StatisticRelationId);
+   if (!OidIsValid(typOid))
+       ereport(ERROR,

Looks like the table_open() call can be made after the typOid check.

+       Datum       values[Natts_pg_statistic];
+       bool        nulls[Natts_pg_statistic];
+       HeapTuple   stup;
+
+       if (!stats->stats_valid)

It seems the local arrays can be declared after the validity check.

+           if (enabled[i] == STATS_EXT_NDISTINCT)
+               ndistinct_enabled = true;
+           if (enabled[i] == STATS_EXT_DEPENDENCIES)
+               dependencies_enabled = true;
+           if (enabled[i] == STATS_EXT_MCV)

the second and third if should be preceded with 'else'

+ReleaseDummy(HeapTuple tuple)
+{
+   pfree(tuple);

Since ReleaseDummy() is just a single pfree call, maybe you don't need this method - call pfree in its place.

Cheers

On Sat, Jan 16, 2021 at 4:24 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
On 1/17/21 12:22 AM, Justin Pryzby wrote:
> On Sat, Jan 16, 2021 at 05:48:43PM +0100, Tomas Vondra wrote:
>> +      <entry role="catalog_table_entry"><para role="column_definition">
>> +       <structfield>expr</structfield> <type>text</type>
>> +      </para>
>> +      <para>
>> +       Expression the extended statistics is defined on
>> +      </para></entry>
>
> Expression the extended statistics ARE defined on
> Or maybe say "on which the extended statistics are defined"
>

I'm pretty sure "is" is correct because "expression" is singular.

>> +  <para>
>> +   The <command>CREATE STATISTICS</command> command has two basic forms. The
>> +   simple variant allows to build statistics for a single expression, does
>
> .. ALLOWS BUILDING statistics for a single expression, AND does (or BUT does)
>
>> +   Expression statistics are per-expression and are similar to creating an
>> +   index on the expression, except that they avoid the overhead of the index.
>
> Maybe say "overhead of index maintenance"
>

Yeah, that sounds better.

>> +   All functions and operators used in a statistics definition must be
>> +   <quote>immutable</quote>, that is, their results must depend only on
>> +   their arguments and never on any outside influence (such as
>> +   the contents of another table or the current time).  This restriction
>
> say "outside factor" or "external factor"
>

In fact, we've removed the immutability restriction, so this paragraph
should have been removed.

>> +   results of those expression, and uses default estimates as illustrated
>> +   by the first query.  The planner also does not realize the value of the
>
> realize THAT
>

OK, changed.

>> +   second column fully defines the value of the other column, because date
>> +   truncated to day still identifies the month. Then expression and
>> +   ndistinct statistics are built on those two columns:
>
> I got an error doing this:
>
> CREATE TABLE t AS SELECT generate_series(1,9) AS i;
> CREATE STATISTICS s ON (i+1) ,(i+1+0) FROM t;
> ANALYZE t;
> SELECT i+1 FROM t GROUP BY 1;
> ERROR:  corrupt MVNDistinct entry
>

Thanks. There was a thinko in estimate_multivariate_ndistinct, resulting
in mismatching the ndistinct coefficient items. The attached patch fixes
that, but I've realized the way we pick the "best" statistics may need
some improvements (I added an XXX comment about that).


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PoC/WIP: Extended statistics on expressions

From
Justin Pryzby
Date:
On Sun, Jan 17, 2021 at 01:23:39AM +0100, Tomas Vondra wrote:
> diff --git a/doc/src/sgml/ref/create_statistics.sgml b/doc/src/sgml/ref/create_statistics.sgml
> index 4363be50c3..5b8eb8d248 100644
> --- a/doc/src/sgml/ref/create_statistics.sgml
> +++ b/doc/src/sgml/ref/create_statistics.sgml
> @@ -21,9 +21,13 @@ PostgreSQL documentation
>  
>   <refsynopsisdiv>
>  <synopsis>
> +CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_name</replaceable>
> +    ON ( <replaceable class="parameter">expression</replaceable> )
> +    FROM <replaceable class="parameter">table_name</replaceable>

>  CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_name</replaceable>
>      [ ( <replaceable class="parameter">statistics_kind</replaceable> [, ... ] ) ]
> -    ON <replaceable class="parameter">column_name</replaceable>, <replaceable
class="parameter">column_name</replaceable>[, ...]
 
> +    ON { <replaceable class="parameter">column_name</replaceable> | ( <replaceable
class="parameter">expression</replaceable>) } [, ...]
 
>      FROM <replaceable class="parameter">table_name</replaceable>
>  </synopsis>

I think this part is wrong, since it's not possible to specify a single column
in form#2.

If I'm right, the current way is:

 - form#1 allows expression statistics of a single expression, and doesn't
   allow specifying "kinds";

 - form#2 allows specifying "kinds", but always computes expression statistics,
   and requires multiple columns/expressions.

So it'd need to be column_name|expression, column_name|expression [,...]

> @@ -39,6 +43,16 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_na
>     database and will be owned by the user issuing the command.
>    </para>
>  
> +  <para>
> +   The <command>CREATE STATISTICS</command> command has two basic forms. The
> +   simple variant allows building statistics for a single expression, does
> +   not allow specifying any statistics kinds and provides benefits similar
> +   to an expression index. The full variant allows defining statistics objects
> +   on multiple columns and expressions, and selecting which statistics kinds will
> +   be built. The per-expression statistics are built automatically when there
> +   is at least one expression.
> +  </para>

> +   <varlistentry>
> +    <term><replaceable class="parameter">expression</replaceable></term>
> +    <listitem>
> +     <para>
> +      The expression to be covered by the computed statistics. In this case
> +      only a single expression is required, in which case only the expression
> +      statistics kind is allowed. The order of expressions is insignificant.

I think this part is wrong now ?
I guess there's no user-facing expression "kind" left in the CREATE command.
I guess "in which case" means "if only one expr is specified".
"expression" could be either form#1 or form#2.

Maybe it should just say:
> +      An expression to be covered by the computed statistics.

Maybe somewhere else, say:
> In the second form of the command, the order of expressions is insignificant.

-- 
Justin



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:

On 1/17/21 3:55 AM, Zhihong Yu wrote:
> Hi,
> +    * Check that only the base rel is mentioned.  (This should be dead code
> +    * now that add_missing_from is history.)
> +    */
> +   if (list_length(pstate->p_rtable) != 1)
> 
> If it is dead code, it can be removed, right ?
> 

Maybe. The question is whether it really is dead code. It's copied from 
transformIndexStmt so I kept it for now.

> For statext_mcv_clauselist_selectivity:
> 
> +                   // bms_free(list_attnums[listidx]);
>  > The commented line can be removed.
>

Actually, this should probably do list_free on the list_exprs.

> +bool
> +examine_clause_args2(List *args, Node **exprp, Const **cstp, bool 
> *expronleftp)
> 
> Better add some comment for examine_clause_args2 since there 
> is examine_clause_args() already.
> 

Yeah, this probably needs a better name. I have a feeling this may need 
a refactoring to reuse more of the existing code for the expressions.

> +   if (!ok || stats->compute_stats == NULL || stats->minrows <= 0)
> 
> When would stats->minrows have negative value ?
> 

The typanalyze function (e.g. std_typanalyze) can set it to negative 
value. This is the same check as in examine_attribute, and we need the 
same protections I think.

> For serialize_expr_stats():
> 
> +   sd = table_open(StatisticRelationId, RowExclusiveLock);
> +
> +   /* lookup OID of composite type for pg_statistic */
> +   typOid = get_rel_type_id(StatisticRelationId);
> +   if (!OidIsValid(typOid))
> +       ereport(ERROR,
> 
> Looks like the table_open() call can be made after the typOid check.
> 

Probably, but this failure is unlikely (should never happen) so I don't 
think this makes any real difference.

> +       Datum       values[Natts_pg_statistic];
> +       bool        nulls[Natts_pg_statistic];
> +       HeapTuple   stup;
> +
> +       if (!stats->stats_valid)
> 
> It seems the local arrays can be declared after the validity check.
> 

Nope, that would be against C99.

> +           if (enabled[i] == STATS_EXT_NDISTINCT)
> +               ndistinct_enabled = true;
> +           if (enabled[i] == STATS_EXT_DEPENDENCIES)
> +               dependencies_enabled = true;
> +           if (enabled[i] == STATS_EXT_MCV)
> 
> the second and third if should be preceded with 'else'
> 

Yeah, although this just moves existing code. But you're right it could 
use else.

> +ReleaseDummy(HeapTuple tuple)
> +{
> +   pfree(tuple);
> 
> Since ReleaseDummy() is just a single pfree call, maybe you don't need 
> this method - call pfree in its place.
> 

No, that's not possible because the freefunc callback expects signature

     void (*)(HeapTupleData *)

and pfree() does not match that.


thanks

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:

On 1/17/21 6:29 AM, Justin Pryzby wrote:
> On Sun, Jan 17, 2021 at 01:23:39AM +0100, Tomas Vondra wrote:
>> diff --git a/doc/src/sgml/ref/create_statistics.sgml b/doc/src/sgml/ref/create_statistics.sgml
>> index 4363be50c3..5b8eb8d248 100644
>> --- a/doc/src/sgml/ref/create_statistics.sgml
>> +++ b/doc/src/sgml/ref/create_statistics.sgml
>> @@ -21,9 +21,13 @@ PostgreSQL documentation
>>   
>>    <refsynopsisdiv>
>>   <synopsis>
>> +CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_name</replaceable>
>> +    ON ( <replaceable class="parameter">expression</replaceable> )
>> +    FROM <replaceable class="parameter">table_name</replaceable>
> 
>>   CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_name</replaceable>
>>       [ ( <replaceable class="parameter">statistics_kind</replaceable> [, ... ] ) ]
>> -    ON <replaceable class="parameter">column_name</replaceable>, <replaceable
class="parameter">column_name</replaceable>[, ...]
 
>> +    ON { <replaceable class="parameter">column_name</replaceable> | ( <replaceable
class="parameter">expression</replaceable>) } [, ...]
 
>>       FROM <replaceable class="parameter">table_name</replaceable>
>>   </synopsis>
> 
> I think this part is wrong, since it's not possible to specify a single column
> in form#2.
> 
> If I'm right, the current way is:
> 
>   - form#1 allows expression statistics of a single expression, and doesn't
>     allow specifying "kinds";
> 
>   - form#2 allows specifying "kinds", but always computes expression statistics,
>     and requires multiple columns/expressions.
> 
> So it'd need to be column_name|expression, column_name|expression [,...]
> 

Strictly speaking you're probably correct - there should be at least two 
elements. But I'm somewhat hesitant about making this more complex, 
because it'll be harder to understand.


>> @@ -39,6 +43,16 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_na
>>      database and will be owned by the user issuing the command.
>>     </para>
>>   
>> +  <para>
>> +   The <command>CREATE STATISTICS</command> command has two basic forms. The
>> +   simple variant allows building statistics for a single expression, does
>> +   not allow specifying any statistics kinds and provides benefits similar
>> +   to an expression index. The full variant allows defining statistics objects
>> +   on multiple columns and expressions, and selecting which statistics kinds will
>> +   be built. The per-expression statistics are built automatically when there
>> +   is at least one expression.
>> +  </para>
> 
>> +   <varlistentry>
>> +    <term><replaceable class="parameter">expression</replaceable></term>
>> +    <listitem>
>> +     <para>
>> +      The expression to be covered by the computed statistics. In this case
>> +      only a single expression is required, in which case only the expression
>> +      statistics kind is allowed. The order of expressions is insignificant.
> 
> I think this part is wrong now ?
> I guess there's no user-facing expression "kind" left in the CREATE command.
> I guess "in which case" means "if only one expr is specified".
> "expression" could be either form#1 or form#2.
> 
> Maybe it should just say:
>> +      An expression to be covered by the computed statistics.
> 
> Maybe somewhere else, say:
>> In the second form of the command, the order of expressions is insignificant.
> 

Yeah, this is a leftover from when there was "expressions" kind. I'll 
reword this a bit.


thanks

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Justin Pryzby
Date:
On Sun, Jan 17, 2021 at 01:23:39AM +0100, Tomas Vondra wrote:
> > CREATE TABLE t AS SELECT generate_series(1,9) AS i;
> > CREATE STATISTICS s ON (i+1) ,(i+1+0) FROM t;
> > ANALYZE t;
> > SELECT i+1 FROM t GROUP BY 1;
> > ERROR:  corrupt MVNDistinct entry
> 
> Thanks. There was a thinko in estimate_multivariate_ndistinct, resulting in
> mismatching the ndistinct coefficient items. The attached patch fixes that,
> but I've realized the way we pick the "best" statistics may need some
> improvements (I added an XXX comment about that).

That maybe indicates a deficiency in testing and code coverage.

| postgres=# CREATE TABLE t(i int);
| postgres=# CREATE STATISTICS s2 ON (i+1) ,(i+1+0) FROM t;
| postgres=# \d t
|                  Table "public.t"
|  Column |  Type   | Collation | Nullable | Default 
| --------+---------+-----------+----------+---------
|  i      | integer |           |          | 
| Indexes:
|     "t_i_idx" btree (i)
| Statistics objects:
|     "public"."s2" (ndistinct, dependencies, mcv) ON  FROM t

on ... what ?

-- 
Justin



Re: PoC/WIP: Extended statistics on expressions

From
Dean Rasheed
Date:
Looking through extended_stats.c, I found a corner case that can lead
to a seg-fault:

CREATE TABLE foo();
CREATE STATISTICS s ON (1) FROM foo;
ANALYSE foo;

This crashes in lookup_var_attr_stats(), because it isn't expecting
nvacatts to be 0. I can't think of any case where building stats on a
table with no analysable columns is useful, so it should probably just
exit early in that case.


In BuildRelationExtStatistics(), it looks like min_attrs should be
declared assert-only.


In evaluate_expressions():

+   /* set the pointers */
+   result = (ExprInfo *) ptr;
+   ptr += sizeof(ExprInfo);

I think that should probably have a MAXALIGN().


A slightly bigger issue that I don't like is the way it assigns
attribute numbers for expressions starting from
MaxHeapAttributeNumber+1, so the first expression has an attnum of
1601. That leads to pretty inefficient use of Bitmapsets, since most
tables only contain a handful of columns, and there's a large block of
unused space in the middle the Bitmapset.

An alternative approach might be to use regular attnums for columns
and use negative indexes -1, -2, -3, ... for expressions in the stored
stats. Then when storing and retrieving attnums from Bitmapsets, it
could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values
in the Bitmapsets, since there can't be more than that many
expressions (just like other code stores system attributes using
FirstLowInvalidHeapAttributeNumber).

That would be a somewhat bigger change, but hopefully fairly
mechanical, and then some code like add_expressions_to_attributes()
would go away.


Looking at the new view pg_stats_ext_exprs, I noticed that it fails to
show expressions until the statistics have been built. For example:

CREATE TABLE foo(a int, b int);
CREATE STATISTICS s ON (a+b), (a*b) FROM foo;
SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;

 statistics_name | tablename | expr | n_distinct
-----------------+-----------+------+------------
 s               | foo       |      |
(1 row)

but after populating and analysing the table, this becomes:

 statistics_name | tablename |  expr   | n_distinct
-----------------+-----------+---------+------------
 s               | foo       | (a + b) |         11
 s               | foo       | (a * b) |         11
(2 rows)

I think it should show the expressions even before the stats have been built.

Another issue is that it returns rows for non-expression stats as
well. For example:

CREATE TABLE foo(a int, b int);
CREATE STATISTICS s ON a, b FROM foo;
SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;

 statistics_name | tablename | expr | n_distinct
-----------------+-----------+------+------------
 s               | foo       |      |
(1 row)

and those values will never be populated, since they're not
expressions, so I would expect them to not be shown in the view.

So basically, instead of

+         LEFT JOIN LATERAL (
+             SELECT
+                 *
+             FROM (
+                 SELECT
+
unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
+                     unnest(sd.stxdexpr)::pg_statistic AS a
+             ) x
+         ) stat ON sd.stxdexpr IS NOT NULL;

perhaps just

+         JOIN LATERAL (
+             SELECT
+                 *
+             FROM (
+                 SELECT
+
unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
+                     unnest(sd.stxdexpr)::pg_statistic AS a
+             ) x
+         ) stat ON true;

Regards,
Dean



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
On 1/18/21 4:48 PM, Dean Rasheed wrote:
> Looking through extended_stats.c, I found a corner case that can lead
> to a seg-fault:
> 
> CREATE TABLE foo();
> CREATE STATISTICS s ON (1) FROM foo;
> ANALYSE foo;
> 
> This crashes in lookup_var_attr_stats(), because it isn't expecting
> nvacatts to be 0. I can't think of any case where building stats on a
> table with no analysable columns is useful, so it should probably just
> exit early in that case.
> 
> 
> In BuildRelationExtStatistics(), it looks like min_attrs should be
> declared assert-only.
> 
> 
> In evaluate_expressions():
> 
> +   /* set the pointers */
> +   result = (ExprInfo *) ptr;
> +   ptr += sizeof(ExprInfo);
> 
> I think that should probably have a MAXALIGN().
> 

Thanks, I'll fix all of that.

> 
> A slightly bigger issue that I don't like is the way it assigns
> attribute numbers for expressions starting from
> MaxHeapAttributeNumber+1, so the first expression has an attnum of
> 1601. That leads to pretty inefficient use of Bitmapsets, since most
> tables only contain a handful of columns, and there's a large block of
> unused space in the middle the Bitmapset.
> 
> An alternative approach might be to use regular attnums for columns
> and use negative indexes -1, -2, -3, ... for expressions in the stored
> stats. Then when storing and retrieving attnums from Bitmapsets, it
> could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values
> in the Bitmapsets, since there can't be more than that many
> expressions (just like other code stores system attributes using
> FirstLowInvalidHeapAttributeNumber).
> 
> That would be a somewhat bigger change, but hopefully fairly
> mechanical, and then some code like add_expressions_to_attributes()
> would go away.
> 

Well, I tried this but unfortunately it's not that simple. We still need 
to build the bitmaps, so I don't think add_expression_to_attributes can 
be just removed. I mean, we need to do the offsetting somewhere, even if 
we change how we do it.

But the main issue is that in some cases the number of expressions is 
not really limited by STATS_MAX_DIMENSIONS - for example when applying 
functional dependencies, we "merge" multiple statistics, so we may end 
up with more expressions. So we can't just use STATS_MAX_DIMENSIONS.

Also, if we offset regular attnums by STATS_MAX_DIMENSIONS, that inverts 
the order of processing (so far we've assumed expressions are after 
regular attnums). So the changes are more extensive - I tried doing that 
anyway, and I'm still struggling with crashes and regression failures. 
Of course, that doesn't mean we shouldn't do it, but it's far from 
mechanical. (Some of that is probably a sign this code needs a bit more 
work to polish.)

But I wonder if it'd be easier to just calculate the actual max attnum 
and then use it instead of MaxHeapAttributeNumber ...

> 
> Looking at the new view pg_stats_ext_exprs, I noticed that it fails to
> show expressions until the statistics have been built. For example:
> 
> CREATE TABLE foo(a int, b int);
> CREATE STATISTICS s ON (a+b), (a*b) FROM foo;
> SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;
> 
>   statistics_name | tablename | expr | n_distinct
> -----------------+-----------+------+------------
>   s               | foo       |      |
> (1 row)
> 
> but after populating and analysing the table, this becomes:
> 
>   statistics_name | tablename |  expr   | n_distinct
> -----------------+-----------+---------+------------
>   s               | foo       | (a + b) |         11
>   s               | foo       | (a * b) |         11
> (2 rows)
> 
> I think it should show the expressions even before the stats have been built.
> 
> Another issue is that it returns rows for non-expression stats as
> well. For example:
> 
> CREATE TABLE foo(a int, b int);
> CREATE STATISTICS s ON a, b FROM foo;
> SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;
> 
>   statistics_name | tablename | expr | n_distinct
> -----------------+-----------+------+------------
>   s               | foo       |      |
> (1 row)
> 
> and those values will never be populated, since they're not
> expressions, so I would expect them to not be shown in the view.
> 
> So basically, instead of
> 
> +         LEFT JOIN LATERAL (
> +             SELECT
> +                 *
> +             FROM (
> +                 SELECT
> +
> unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
> +                     unnest(sd.stxdexpr)::pg_statistic AS a
> +             ) x
> +         ) stat ON sd.stxdexpr IS NOT NULL;
> 
> perhaps just
> 
> +         JOIN LATERAL (
> +             SELECT
> +                 *
> +             FROM (
> +                 SELECT
> +
> unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
> +                     unnest(sd.stxdexpr)::pg_statistic AS a
> +             ) x
> +         ) stat ON true;

Will fix.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Dean Rasheed
Date:
On Tue, 19 Jan 2021 at 01:57, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> > A slightly bigger issue that I don't like is the way it assigns
> > attribute numbers for expressions starting from
> > MaxHeapAttributeNumber+1, so the first expression has an attnum of
> > 1601. That leads to pretty inefficient use of Bitmapsets, since most
> > tables only contain a handful of columns, and there's a large block of
> > unused space in the middle the Bitmapset.
> >
> > An alternative approach might be to use regular attnums for columns
> > and use negative indexes -1, -2, -3, ... for expressions in the stored
> > stats. Then when storing and retrieving attnums from Bitmapsets, it
> > could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values
> > in the Bitmapsets, since there can't be more than that many
> > expressions (just like other code stores system attributes using
> > FirstLowInvalidHeapAttributeNumber).
>
> Well, I tried this but unfortunately it's not that simple. We still need
> to build the bitmaps, so I don't think add_expression_to_attributes can
> be just removed. I mean, we need to do the offsetting somewhere, even if
> we change how we do it.

Hmm, I was imagining that the offsetting would happen in each place
that adds or retrieves an attnum from a Bitmapset, much like a lot of
other code does for system attributes, and then you'd know you had an
expression if the resulting attnum was negative.

I was also thinking that it would be these negative attnums that would
be stored in the stats data, so instead of something like "1, 2 =>
1601", it would be "1, 2 => -1", so in some sense "-1" would be the
"real" attnum associated with the expression.

> But the main issue is that in some cases the number of expressions is
> not really limited by STATS_MAX_DIMENSIONS - for example when applying
> functional dependencies, we "merge" multiple statistics, so we may end
> up with more expressions. So we can't just use STATS_MAX_DIMENSIONS.

Ah, I see. I hadn't really fully understood what that code was doing.

ISTM though that this is really an internal problem to the
dependencies code, in that these "merged" Bitmapsets containing attrs
from multiple different stats objects do not (and must not) ever go
outside that local code that uses them. So that code would be free to
use a different offset for its own purposes -- e..g., collect all the
distinct expressions across all the stats objects and then offset by
the number of distinct expressions.

> Also, if we offset regular attnums by STATS_MAX_DIMENSIONS, that inverts
> the order of processing (so far we've assumed expressions are after
> regular attnums). So the changes are more extensive - I tried doing that
> anyway, and I'm still struggling with crashes and regression failures.
> Of course, that doesn't mean we shouldn't do it, but it's far from
> mechanical. (Some of that is probably a sign this code needs a bit more
> work to polish.)

Interesting. What code assumes expressions come after attributes?
Ideally, I think it would be cleaner if no code assumed any particular
order, but I can believe that it might be convenient in some
circumstances.

> But I wonder if it'd be easier to just calculate the actual max attnum
> and then use it instead of MaxHeapAttributeNumber ...

Hmm, I'm not sure how that would work. There still needs to be an
attnum that gets stored in the database, and it has to continue to
work if the user adds columns to the table. That's why I was
advocating storing negative values, though I haven't actually tried it
to see what might go wrong.

Regards,
Dean



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:

On 1/21/21 12:11 PM, Dean Rasheed wrote:
> On Tue, 19 Jan 2021 at 01:57, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>>> A slightly bigger issue that I don't like is the way it assigns
>>> attribute numbers for expressions starting from
>>> MaxHeapAttributeNumber+1, so the first expression has an attnum of
>>> 1601. That leads to pretty inefficient use of Bitmapsets, since most
>>> tables only contain a handful of columns, and there's a large block of
>>> unused space in the middle the Bitmapset.
>>>
>>> An alternative approach might be to use regular attnums for columns
>>> and use negative indexes -1, -2, -3, ... for expressions in the stored
>>> stats. Then when storing and retrieving attnums from Bitmapsets, it
>>> could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values
>>> in the Bitmapsets, since there can't be more than that many
>>> expressions (just like other code stores system attributes using
>>> FirstLowInvalidHeapAttributeNumber).
>>
>> Well, I tried this but unfortunately it's not that simple. We still need
>> to build the bitmaps, so I don't think add_expression_to_attributes can
>> be just removed. I mean, we need to do the offsetting somewhere, even if
>> we change how we do it.
> 
> Hmm, I was imagining that the offsetting would happen in each place
> that adds or retrieves an attnum from a Bitmapset, much like a lot of
> other code does for system attributes, and then you'd know you had an
> expression if the resulting attnum was negative.
> 
> I was also thinking that it would be these negative attnums that would
> be stored in the stats data, so instead of something like "1, 2 =>
> 1601", it would be "1, 2 => -1", so in some sense "-1" would be the
> "real" attnum associated with the expression.
> 
>> But the main issue is that in some cases the number of expressions is
>> not really limited by STATS_MAX_DIMENSIONS - for example when applying
>> functional dependencies, we "merge" multiple statistics, so we may end
>> up with more expressions. So we can't just use STATS_MAX_DIMENSIONS.
> 
> Ah, I see. I hadn't really fully understood what that code was doing.
> 
> ISTM though that this is really an internal problem to the
> dependencies code, in that these "merged" Bitmapsets containing attrs
> from multiple different stats objects do not (and must not) ever go
> outside that local code that uses them. So that code would be free to
> use a different offset for its own purposes -- e..g., collect all the
> distinct expressions across all the stats objects and then offset by
> the number of distinct expressions.
> 

>> Also, if we offset regular attnums by STATS_MAX_DIMENSIONS, that inverts
>> the order of processing (so far we've assumed expressions are after
>> regular attnums). So the changes are more extensive - I tried doing that
>> anyway, and I'm still struggling with crashes and regression failures.
>> Of course, that doesn't mean we shouldn't do it, but it's far from
>> mechanical. (Some of that is probably a sign this code needs a bit more
>> work to polish.)
> 
> Interesting. What code assumes expressions come after attributes?
> Ideally, I think it would be cleaner if no code assumed any particular
> order, but I can believe that it might be convenient in some
> circumstances.
> 

Well, in a bunch of places we look at the index (from the bitmap) and 
use it to determine whether the value is a regular attribute or an 
expression, because the values are stored in separate arrays.

This is solvable, all I'm saying (both here and in the preceding part 
about dependencies) is that it's not entirely mechanical task. But it 
might be better to rethink the separation of simple values and 
expression, and make it more "unified" so that most of the code does not 
really need to deal with these differences.

>> But I wonder if it'd be easier to just calculate the actual max attnum
>> and then use it instead of MaxHeapAttributeNumber ...
> 
> Hmm, I'm not sure how that would work. There still needs to be an
> attnum that gets stored in the database, and it has to continue to
> work if the user adds columns to the table. That's why I was
> advocating storing negative values, though I haven't actually tried it
> to see what might go wrong.
> 

Well, yeah, we need to identify the expression in some statistics (e.g. 
in dependencies or ndistinct items). And yeah, offsetting the expression 
attnums by MaxHeapAttributeNumber may be inefficient in this case.


Attached is an updated version of the patch, hopefully addressing all 
issues pointed out by you, Justin and Zhihong, with the exception of the 
expression attnums discussed here.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC/WIP: Extended statistics on expressions

From
Justin Pryzby
Date:
This already needs to be rebased on 55dc86eca.

And needs to update rules.out.

And doesn't address this one:

On Sun, Jan 17, 2021 at 10:53:31PM -0600, Justin Pryzby wrote:
> | postgres=# CREATE TABLE t(i int);
> | postgres=# CREATE STATISTICS s2 ON (i+1) ,(i+1+0) FROM t;
> | postgres=# \d t
> |                  Table "public.t"
> |  Column |  Type   | Collation | Nullable | Default 
> | --------+---------+-----------+----------+---------
> |  i      | integer |           |          | 
> | Indexes:
> |     "t_i_idx" btree (i)
> | Statistics objects:
> |     "public"."s2" (ndistinct, dependencies, mcv) ON  FROM t
> 
> on ... what ?

-- 
Justin



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
On 1/22/21 3:29 AM, Justin Pryzby wrote:
> This already needs to be rebased on 55dc86eca.
> 
> And needs to update rules.out.
> 

Whooops. A fixed version attached.

> And doesn't address this one:
> 
> On Sun, Jan 17, 2021 at 10:53:31PM -0600, Justin Pryzby wrote:
>> | postgres=# CREATE TABLE t(i int);
>> | postgres=# CREATE STATISTICS s2 ON (i+1) ,(i+1+0) FROM t;
>> | postgres=# \d t
>> |                  Table "public.t"
>> |  Column |  Type   | Collation | Nullable | Default
>> | --------+---------+-----------+----------+---------
>> |  i      | integer |           |          |
>> | Indexes:
>> |     "t_i_idx" btree (i)
>> | Statistics objects:
>> |     "public"."s2" (ndistinct, dependencies, mcv) ON  FROM t
>>
>> on ... what ?
> 

Umm, for me that prints:

test=# CREATE TABLE t(i int);
CREATE TABLE
test=# CREATE STATISTICS s2 ON (i+1) ,(i+1+0) FROM t;
CREATE STATISTICS
test=# \d t
                  Table "public.t"
  Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
  i      | integer |           |          |
Statistics objects:
     "public"."s2" ON ((i + 1)), (((i + 1) + 0)) FROM t

which I think is OK. But maybe there's something else to trigger the 
problem?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC/WIP: Extended statistics on expressions

From
Justin Pryzby
Date:
On Fri, Jan 22, 2021 at 04:49:51AM +0100, Tomas Vondra wrote:
> > > | Statistics objects:
> > > |     "public"."s2" (ndistinct, dependencies, mcv) ON  FROM t
> 
> Umm, for me that prints:

>     "public"."s2" ON ((i + 1)), (((i + 1) + 0)) FROM t
> 
> which I think is OK. But maybe there's something else to trigger the
> problem?

Oh.  It's because I was using /usr/bin/psql and not ./src/bin/psql.
I think it's considered ok if old client's \d commands don't work on new
server, but it's not clear to me if it's ok if they misbehave.  It's almost
better it made an ERROR.

In any case, why are there so many parentheses ?

-- 
Justin



Re: PoC/WIP: Extended statistics on expressions

From
Justin Pryzby
Date:
On Thu, Jan 21, 2021 at 10:01:01PM -0600, Justin Pryzby wrote:
> On Fri, Jan 22, 2021 at 04:49:51AM +0100, Tomas Vondra wrote:
> > > > | Statistics objects:
> > > > |     "public"."s2" (ndistinct, dependencies, mcv) ON  FROM t
> > 
> > Umm, for me that prints:
> 
> >     "public"."s2" ON ((i + 1)), (((i + 1) + 0)) FROM t
> > 
> > which I think is OK. But maybe there's something else to trigger the
> > problem?
> 
> Oh.  It's because I was using /usr/bin/psql and not ./src/bin/psql.
> I think it's considered ok if old client's \d commands don't work on new
> server, but it's not clear to me if it's ok if they misbehave.  It's almost
> better it made an ERROR.

I think you'll maybe have to do something better - this seems a bit too weird:

| postgres=# CREATE STATISTICS s2 ON (i+1) ,i FROM t;
| postgres=# \d t
| ...
|     "public"."s2" (ndistinct, dependencies, mcv) ON i FROM t

It suggests including additional columns in stxkeys for each expression.
Maybe that also helps give direction to response to Dean's concern?

That doesn't make old psql do anything more desirable, though.
Unless you also added attributes, all you can do is make it say things like
"columns: ctid".

> In any case, why are there so many parentheses ?

-- 
Justin



Re: PoC/WIP: Extended statistics on expressions

From
Dean Rasheed
Date:
On Fri, 22 Jan 2021 at 04:46, Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> I think you'll maybe have to do something better - this seems a bit too weird:
>
> | postgres=# CREATE STATISTICS s2 ON (i+1) ,i FROM t;
> | postgres=# \d t
> | ...
> |     "public"."s2" (ndistinct, dependencies, mcv) ON i FROM t
>

I guess that's not surprising, given that old psql knows nothing about
expressions in stats.

In general, I think connecting old versions of psql to newer servers
is not supported. You're lucky if \d works at all. So it shouldn't be
this patch's responsibility to make that output nicer.

Regards,
Dean



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:

On 1/22/21 10:00 AM, Dean Rasheed wrote:
> On Fri, 22 Jan 2021 at 04:46, Justin Pryzby <pryzby@telsasoft.com> wrote:
>>
>> I think you'll maybe have to do something better - this seems a bit too weird:
>>
>> | postgres=# CREATE STATISTICS s2 ON (i+1) ,i FROM t;
>> | postgres=# \d t
>> | ...
>> |     "public"."s2" (ndistinct, dependencies, mcv) ON i FROM t
>>
> 
> I guess that's not surprising, given that old psql knows nothing about
> expressions in stats.
> 
> In general, I think connecting old versions of psql to newer servers
> is not supported. You're lucky if \d works at all. So it shouldn't be
> this patch's responsibility to make that output nicer.
> 

Yeah. It's not clear to me what exactly could we do with this, without 
"backpatching" the old psql or making the ruleutils.c consider version 
of the psql. Neither of these seems possible/acceptable.

I'm sure this is not the only place showing "incomplete" information in 
old psql on new server.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:

On 1/22/21 5:01 AM, Justin Pryzby wrote:
> On Fri, Jan 22, 2021 at 04:49:51AM +0100, Tomas Vondra wrote:
>>>> | Statistics objects:
>>>> |     "public"."s2" (ndistinct, dependencies, mcv) ON  FROM t
>>
>> Umm, for me that prints:
> 
>>      "public"."s2" ON ((i + 1)), (((i + 1) + 0)) FROM t
>>
>> which I think is OK. But maybe there's something else to trigger the
>> problem?
> 
> Oh.  It's because I was using /usr/bin/psql and not ./src/bin/psql.
> I think it's considered ok if old client's \d commands don't work on new
> server, but it's not clear to me if it's ok if they misbehave.  It's almost
> better it made an ERROR.
> 

Well, how would the server know to throw an error? We can't quite patch 
the old psql (if we could, we could just tweak the query).

> In any case, why are there so many parentheses ?
> 

That's a bug in pg_get_statisticsobj_worker, probably. It shouldn't be 
adding extra parentheses, on top of what deparse_expression_pretty does. 
Will fix.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Dean Rasheed
Date:
On Fri, 22 Jan 2021 at 03:49, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Whooops. A fixed version attached.
>

The change to pg_stats_ext_exprs isn't quite right, because now it
cross joins expressions and their stats, which leads to too many rows,
with the wrong stats being listed against expressions. For example:

CREATE TABLE foo (a int, b text);
INSERT INTO foo SELECT 1, 'xxx' FROM generate_series(1,1000);
CREATE STATISTICS foo_s ON (a*10), upper(b) FROM foo;
ANALYSE foo;

SELECT tablename, statistics_name, expr, most_common_vals
  FROM pg_stats_ext_exprs;

 tablename | statistics_name |   expr   | most_common_vals
-----------+-----------------+----------+------------------
 foo       | foo_s           | (a * 10) | {10}
 foo       | foo_s           | (a * 10) | {XXX}
 foo       | foo_s           | upper(b) | {10}
 foo       | foo_s           | upper(b) | {XXX}
(4 rows)


More protection is still required for tables with no analysable
columns. For example:

CREATE TABLE foo();
CREATE STATISTICS foo_s ON (1) FROM foo;
INSERT INTO foo SELECT FROM generate_series(1,1000);
ANALYSE foo;

Program received signal SIGSEGV, Segmentation fault.
0x000000000090e9d4 in lookup_var_attr_stats (rel=0x7f7766b37598, attrs=0x0,
    exprs=0x216b258, nvacatts=0, vacatts=0x216cb40) at extended_stats.c:664
664            stats[i]->tupDesc = vacatts[0]->tupDesc;

#0  0x000000000090e9d4 in lookup_var_attr_stats (rel=0x7f7766b37598,
    attrs=0x0, exprs=0x216b258, nvacatts=0, vacatts=0x216cb40)
    at extended_stats.c:664
#1  0x000000000090da93 in BuildRelationExtStatistics (onerel=0x7f7766b37598,
    totalrows=1000, numrows=100, rows=0x216d040, natts=0,
    vacattrstats=0x216cb40) at extended_stats.c:161
#2  0x000000000066ea97 in do_analyze_rel (onerel=0x7f7766b37598,
    params=0x7ffc06f7d450, va_cols=0x0,
    acquirefunc=0x66f71a <acquire_sample_rows>, relpages=4, inh=false,
    in_outer_xact=false, elevel=13) at analyze.c:595


Attached is an incremental update fixing those issues, together with a
few more suggested improvements:

There was quite a bit of code duplication in extended_stats.c which I
attempted to reduce by

1). Deleting examine_opclause_expression() in favour of examine_clause_args().
2). Deleting examine_opclause_expression2() in favour of examine_clause_args2().
3). Merging examine_clause_args() and examine_clause_args2(), renaming
it examine_opclause_args() (which was actually the name it had in its
original doc comment, despite the name in the code being different).
4). Merging statext_extract_expression() and
statext_extract_expression_internal() into
statext_is_compatible_clause() and
statext_is_compatible_clause_internal() respectively.

That last change goes beyond just removing code duplication. It allows
support for compound clauses that contain a mix of attribute and
expression clauses, for example, this simple test case wasn't
previously estimated well:

CREATE TABLE foo (a int, b int, c int);
INSERT INTO foo SELECT x/100, x/100, x/100 FROM generate_series(1,10000) g(x);
CREATE STATISTICS foo_s on a,b,(c*c) FROM foo;
ANALYSE foo;
EXPLAIN ANALYSE SELECT * FROM foo WHERE a=1 AND (b=1 OR c*c=1);

I didn't add any new regression tests, but perhaps it would be worth
adding something to test a case like that.

I changed choose_best_statistics() in a couple of ways. Firstly, I
think it wants to only count expressions from fully covered clauses,
just as we only count attributes if the stat covers all the attributes
from a clause, since otherwise the stat cannot estimate the clause, so
it shouldn't count. Secondly, I think the number of expressions in the
stat needs to be added to it's number of keys, so that the choice of
narrowest stat with the same number of matches counts expressions in
the same way as attributes.

I simplified the code in statext_mcv_clauselist_selectivity(), by
attempting to handle expressions and attributes together in the same
way, making it much closer to the original code. I don't think that
the check for the existence of a stat covering all the expressions in
a clause was necessary when pre-processing the list of clauses, since
that's checked later on, so it's enough to just detect compatible
clauses. Also, it now checks for stats that cover both the attributes
and the expressions from each clause, rather than one or the other, to
cope with examples like the one above. I also updated the check for
simple_clauses -- what's wanted there is to identify clauses that only
reference a single column or a single expression, so that the later
code doesn't apply multi-column estimates to it.

I'm attaching it as a incremental patch (0004) on top of your patches,
but if 0003 and 0004 are collapsed together, the total number of diffs
is less than 0003 alone.

Regards,
Dean

Attachment

Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:

On 1/27/21 12:02 PM, Dean Rasheed wrote:
> On Fri, 22 Jan 2021 at 03:49, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> Whooops. A fixed version attached.
>>
> 
> The change to pg_stats_ext_exprs isn't quite right, because now it
> cross joins expressions and their stats, which leads to too many rows,
> with the wrong stats being listed against expressions. For example:
> 
> CREATE TABLE foo (a int, b text);
> INSERT INTO foo SELECT 1, 'xxx' FROM generate_series(1,1000);
> CREATE STATISTICS foo_s ON (a*10), upper(b) FROM foo;
> ANALYSE foo;
> 
> SELECT tablename, statistics_name, expr, most_common_vals
>   FROM pg_stats_ext_exprs;
> 
>  tablename | statistics_name |   expr   | most_common_vals
> -----------+-----------------+----------+------------------
>  foo       | foo_s           | (a * 10) | {10}
>  foo       | foo_s           | (a * 10) | {XXX}
>  foo       | foo_s           | upper(b) | {10}
>  foo       | foo_s           | upper(b) | {XXX}
> (4 rows)
> 
> 
> More protection is still required for tables with no analysable
> columns. For example:
> 
> CREATE TABLE foo();
> CREATE STATISTICS foo_s ON (1) FROM foo;
> INSERT INTO foo SELECT FROM generate_series(1,1000);
> ANALYSE foo;
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x000000000090e9d4 in lookup_var_attr_stats (rel=0x7f7766b37598, attrs=0x0,
>     exprs=0x216b258, nvacatts=0, vacatts=0x216cb40) at extended_stats.c:664
> 664            stats[i]->tupDesc = vacatts[0]->tupDesc;
> 
> #0  0x000000000090e9d4 in lookup_var_attr_stats (rel=0x7f7766b37598,
>     attrs=0x0, exprs=0x216b258, nvacatts=0, vacatts=0x216cb40)
>     at extended_stats.c:664
> #1  0x000000000090da93 in BuildRelationExtStatistics (onerel=0x7f7766b37598,
>     totalrows=1000, numrows=100, rows=0x216d040, natts=0,
>     vacattrstats=0x216cb40) at extended_stats.c:161
> #2  0x000000000066ea97 in do_analyze_rel (onerel=0x7f7766b37598,
>     params=0x7ffc06f7d450, va_cols=0x0,
>     acquirefunc=0x66f71a <acquire_sample_rows>, relpages=4, inh=false,
>     in_outer_xact=false, elevel=13) at analyze.c:595
> 
> 
> Attached is an incremental update fixing those issues, together with a
> few more suggested improvements:
> 
> There was quite a bit of code duplication in extended_stats.c which I
> attempted to reduce by
> 
> 1). Deleting examine_opclause_expression() in favour of examine_clause_args().
> 2). Deleting examine_opclause_expression2() in favour of examine_clause_args2().
> 3). Merging examine_clause_args() and examine_clause_args2(), renaming
> it examine_opclause_args() (which was actually the name it had in its
> original doc comment, despite the name in the code being different).
> 4). Merging statext_extract_expression() and
> statext_extract_expression_internal() into
> statext_is_compatible_clause() and
> statext_is_compatible_clause_internal() respectively.
> 
> That last change goes beyond just removing code duplication. It allows
> support for compound clauses that contain a mix of attribute and
> expression clauses, for example, this simple test case wasn't
> previously estimated well:
> 
> CREATE TABLE foo (a int, b int, c int);
> INSERT INTO foo SELECT x/100, x/100, x/100 FROM generate_series(1,10000) g(x);
> CREATE STATISTICS foo_s on a,b,(c*c) FROM foo;
> ANALYSE foo;
> EXPLAIN ANALYSE SELECT * FROM foo WHERE a=1 AND (b=1 OR c*c=1);
> 
> I didn't add any new regression tests, but perhaps it would be worth
> adding something to test a case like that.
> 
> I changed choose_best_statistics() in a couple of ways. Firstly, I
> think it wants to only count expressions from fully covered clauses,
> just as we only count attributes if the stat covers all the attributes
> from a clause, since otherwise the stat cannot estimate the clause, so
> it shouldn't count. Secondly, I think the number of expressions in the
> stat needs to be added to it's number of keys, so that the choice of
> narrowest stat with the same number of matches counts expressions in
> the same way as attributes.
> 
> I simplified the code in statext_mcv_clauselist_selectivity(), by
> attempting to handle expressions and attributes together in the same
> way, making it much closer to the original code. I don't think that
> the check for the existence of a stat covering all the expressions in
> a clause was necessary when pre-processing the list of clauses, since
> that's checked later on, so it's enough to just detect compatible
> clauses. Also, it now checks for stats that cover both the attributes
> and the expressions from each clause, rather than one or the other, to
> cope with examples like the one above. I also updated the check for
> simple_clauses -- what's wanted there is to identify clauses that only
> reference a single column or a single expression, so that the later
> code doesn't apply multi-column estimates to it.
> 
> I'm attaching it as a incremental patch (0004) on top of your patches,
> but if 0003 and 0004 are collapsed together, the total number of diffs
> is less than 0003 alone.
> 

Thanks. All of this seems like a clear improvement, both removing the
duplicate copy-pasted code, and fixing the handling of the cases that
mix plain variables and expressions. FWIW I agree there should be a
regression test for this, so I'll add one.

I think the main remaining issue is how we handle the expressions in
bitmapsets. I've been experimenting with this a bit, but shifting the
regular attnums and stashing expressions before them seems quite
complex, especially when we don't know how many expressions there are
(e.g. when merging functional dependencies). It's true using attnums
above MaxHeapAttributeNumber for expressions wastes ~200B, but is that
really an issue, considering it's very short-lived allocation?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
Hi,

Attached is a rebased patch series, merging the changes from the last
review into the 0003 patch, and with a WIP patch 0004 reworking the
tracking of expressions (to address the inefficiency due to relying on
MaxHeapAttributeNumber).

The 0004 passes is very much an experimental patch with a lot of ad hoc
changes. It passes make check, but it definitely needs much more work,
cleanup and testing. At this point it's more a demonstration of what
would be needed to rework it like this.

The main change is that instead of handling expressions by assigning
them attnums above MaxHeapAttributeNumber, we assign them system-like
attnums, i.e. negative ones. So the first one gets -1, the second one
-2, etc. And then we shift all attnums above 0, to allow using the
bitmapset as before.

Overall, this works, but the shifting is kinda pointless - it allows us
to build a bitmapset, but it's mostly useless because it depends on how
many expressions are in the statistics definition. So we can't compare
or combine bitmapsets for different statistics, and (more importantly)
we can't easily compare bitmapset on attnums from clauses.

Using MaxHeapAttributeNumber allowed using the bitmapsets at least for
regular attributes. Not sure if that's a major advantage, outweighing
wasting some space.

I wonder if we should just ditch the bitmapsets, and just use simple
arrays of attnums. I don't think we expect too many elements here,
especially when dealing with individual statistics. So now we're just
building and rebuilding the bitmapsets ... seems pointless.


One thing I'd like to improve (independently of what we do with the
bitmapsets) is getting rid of the distinction between attributes and
expressions when building the statistics - currently all the various
places have to care about whether the item is attribute or expression,
and look either into the tuple or array of pre-calculated value, do
various shifts to get the indexes, etc. That's quite tedious, and I've
made a lot of errors in that (and I'm sure there are more). So IMO we
should simplify this by replacing this with something containing values
for both attributes and expressions, handling it in a unified way.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
Hi,

On 3/17/21 4:55 PM, Dean Rasheed wrote:
> On Sun, 7 Mar 2021 at 21:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>> 2) ndistinct
>>
>> There's one thing that's bugging me, in how we handle "partial" matches.
>> For each expression we track both the original expression and the Vars
>> we extract from it. If we can't find a statistics matching the whole
>> expression, we try to match those individual Vars, and we remove the
>> matching ones from the list. And in the end we multiply the estimates
>> for the remaining Vars.
>>
>> This works fine with one matching ndistinct statistics. Consider for example
>>
>>      GROUP BY (a+b), (c+d)
>>
>> with statistics on [(a+b),c] - that is, expression and one column.
> 
> I've just been going over this example, and I think it actually works
> slightly differently from how you described, though I haven't worked
> out the full general implications of that.
> 
>> We parse the expressions into two GroupExprInfo
>>
>>      {expr: (a+b), vars: [a, b]}
>>      {expr: (c+d), vars: [c, d]}
>>
> 
> Here, I think what you actually get, in the presence of stats on
> [(a+b),c] is actually the following two GroupExprInfos:
> 
>       {expr: (a+b), vars: []}
>       {expr: (c+d), vars: [c, d]}
> 

Yeah, right. To be precise, we get

    {expr: (a+b), vars: [(a+b)]}

because in the first case we pass NIL, so add_unique_group_expr treats
the whole expression as a var (a bit strange, but OK).

> because of the code immediately after this comment in estimate_num_groups():
> 
>         /*
>          * If examine_variable is able to deduce anything about the GROUP BY
>          * expression, treat it as a single variable even if it's really more
>          * complicated.
>          */
> 
> As it happens, that makes no difference in this case, where there is
> just this one stats object, but it does change things when there are
> two stats objects.
> 
>> and the statistics matches the first item exactly (the expression). The
>> second expression is not in the statistics, but we match "c". So we end
>> up with an estimate for "(a+b), c" and have one remaining GroupExprInfo:
>>
>>      {expr: (c+d), vars: [d]}
> 
> Right.
> 
>> Without any other statistics we estimate that as ndistinct for "d", so
>> we end up with
>>
>>      ndistinct((a+b), c) * ndistinct(d)
>>
>> which mostly makes sense. It assumes ndistinct(c+d) is product of the
>> ndistinct estimates, but that's kinda what we've been always doing.
> 
> Yes, that appears to be what happens, and it's probably the best that
> can be done with the available stats.
> 
>> But now consider we have another statistics on just (c+d). In the second
>> loop we end up matching this expression exactly, so we end up with
>>
>>      ndistinct((a+b), c) * ndistinct((c+d))
> 
> In this case, with stats on (c+d) as well, the two GroupExprInfos
> built at the start change to:
> 
>       {expr: (a+b), vars: []}
>       {expr: (c+d), vars: []}
> 
> so it actually ends up not using any multivariate stats, but instead
> uses the two univariate expression stats, giving
> 
>      ndistinct((a+b)) * ndistinct((c+d))
> 
> which actually seems pretty good, and gave very good estimates in the
> simple test case I tried.
> 

Yeah, that works pretty well in this case.

I wonder if we'd be better off extracting the Vars and doing what I
mistakenly described as the current behavior. That's essentially mean
extracting the Vars even in the case where we now pass NIL.

My concern is that the current behavior (where we prefer expression
stats over multi-column stats to some extent) works fine as long as the
parts are independent, but once there's dependency it's probably more
likely to produce underestimates. I think underestimates for grouping
estimates were a risk in the past, so let's not make that worse.

>> i.e. we kinda use the "c" twice. Which is a bit unfortunate. I think
>> what we should do after the first loop is just discarding the whole
>> expression and "expand" into per-variable GroupExprInfo, so in the
>> second step we would not match the (c+d) statistics.
> 
> Not using the (c+d) stats would give either
> 
>      ndistinct((a+b)) * ndistinct(c) * ndistinct(d)
> 
> or
> 
>      ndistinct((a+b), c) * ndistinct(d)
> 
> depending on exactly how the algorithm was changed. In my test, these
> both gave worse estimates, but there are probably other datasets for
> which they might do better. It all depends on how much correlation
> there is between (a+b) and c.
> 
> I suspect that there is no optimal strategy for handling overlapping
> stats that works best for all datasets, but the current algorithm
> seems to do a pretty decent job.
> 
>> Of course, maybe there's a better way to pick the statistics, but I
>> think our conclusion so far was that people should just create
>> statistics covering all the columns in the query, to not have to match
>> multiple statistics like this.
> 
> Yes, I think that's always likely to work better, especially for
> ndistinct stats, where all possible permutations of subsets of the
> columns are included, so a single ndistinct stat can work well for a
> range of different queries.
> 

Yeah, I agree that's a reasonable mitigation. Ultimately, there's no
perfect algorithm how to pick and combine stats when we don't know if
there even is a statistical dependency between the subsets of columns.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Dean Rasheed
Date:
On Wed, 17 Mar 2021 at 17:26, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> My concern is that the current behavior (where we prefer expression
> stats over multi-column stats to some extent) works fine as long as the
> parts are independent, but once there's dependency it's probably more
> likely to produce underestimates. I think underestimates for grouping
> estimates were a risk in the past, so let's not make that worse.
>

I'm not sure the current behaviour really is preferring expression
stats over multi-column stats. In this example, where we're grouping
by (a+b), (c+d) and have stats on [(a+b),c] and (c+d), neither of
those multi-column stats actually match more than one
column/expression. If anything, I'd go the other way and say that it
was wrong to use the [(a+b),c] stats in the first case, where they
were the only stats available, since those stats aren't really
applicable to (c+d), which probably ought to be treated as
independent. IOW, it might have been better to estimate the first case
as

     ndistinct((a+b)) * ndistinct(c) * ndistinct(d)

and the second case as

     ndistinct((a+b)) * ndistinct((c+d))

Regards,
Dean



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:

On 3/17/21 7:54 PM, Dean Rasheed wrote:
> On Wed, 17 Mar 2021 at 17:26, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> My concern is that the current behavior (where we prefer expression
>> stats over multi-column stats to some extent) works fine as long as the
>> parts are independent, but once there's dependency it's probably more
>> likely to produce underestimates. I think underestimates for grouping
>> estimates were a risk in the past, so let's not make that worse.
>>
> 
> I'm not sure the current behaviour really is preferring expression
> stats over multi-column stats. In this example, where we're grouping
> by (a+b), (c+d) and have stats on [(a+b),c] and (c+d), neither of
> those multi-column stats actually match more than one
> column/expression. If anything, I'd go the other way and say that it
> was wrong to use the [(a+b),c] stats in the first case, where they
> were the only stats available, since those stats aren't really
> applicable to (c+d), which probably ought to be treated as
> independent. IOW, it might have been better to estimate the first case
> as
> 
>      ndistinct((a+b)) * ndistinct(c) * ndistinct(d)
> 
> and the second case as
> 
>      ndistinct((a+b)) * ndistinct((c+d))
> 

OK. I might be confused, but isn't that what the algorithm currently
does? Or am I just confused about what the first/second case refers to?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Dean Rasheed
Date:
On Wed, 17 Mar 2021 at 19:07, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 3/17/21 7:54 PM, Dean Rasheed wrote:
> >
> > it might have been better to estimate the first case as
> >
> >      ndistinct((a+b)) * ndistinct(c) * ndistinct(d)
> >
> > and the second case as
> >
> >      ndistinct((a+b)) * ndistinct((c+d))
>
> OK. I might be confused, but isn't that what the algorithm currently
> does? Or am I just confused about what the first/second case refers to?
>

No, it currently estimates the first case as ndistinct((a+b),c) *
ndistinct(d). Having said that, maybe that's OK after all. It at least
makes an effort to account for any correlation between (a+b) and
(c+d), using the known correlation between (a+b) and c. For reference,
here is the test case I was using (which isn't really very good for
catching dependence between columns):

DROP TABLE IF EXISTS foo;
CREATE TABLE foo (a int, b int, c int, d int);
INSERT INTO foo SELECT x%10, x%11, x%12, x%13 FROM generate_series(1,100000) x;
SELECT COUNT(DISTINCT a) FROM foo; -- 10
SELECT COUNT(DISTINCT b) FROM foo; -- 11
SELECT COUNT(DISTINCT c) FROM foo; -- 12
SELECT COUNT(DISTINCT d) FROM foo; -- 13
SELECT COUNT(DISTINCT (a+b)) FROM foo; -- 20
SELECT COUNT(DISTINCT (c+d)) FROM foo; -- 24
SELECT COUNT(DISTINCT ((a+b),c)) FROM foo; -- 228
SELECT COUNT(DISTINCT ((a+b),(c+d))) FROM foo; -- 478

-- First case: stats on [(a+b),c]
CREATE STATISTICS s1(ndistinct) ON (a+b),c FROM foo;
ANALYSE foo;
EXPLAIN ANALYSE
SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
  -- Estimate = 2964, Actual = 478
  -- This estimate is ndistinct((a+b),c) * ndistinct(d) = 228*13

-- Second case: stats on (c+d) as well
CREATE STATISTICS s2 ON (c+d) FROM foo;
ANALYSE foo;
EXPLAIN ANALYSE
SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
  -- Estimate = 480, Actual = 478
  -- This estimate is ndistinct((a+b)) * ndistinct((c+d)) = 20*24

I think that's probably pretty reasonable behaviour, given incomplete
stats (the estimate with no extended stats is capped at 10000).

Regards,
Dean



Re: PoC/WIP: Extended statistics on expressions

From
Dean Rasheed
Date:
On Wed, 17 Mar 2021 at 20:48, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>
> For reference, here is the test case I was using (which isn't really very good for
> catching dependence between columns):
>

And here's a test case with much more dependence between the columns:

DROP TABLE IF EXISTS foo;
CREATE TABLE foo (a int, b int, c int, d int);
INSERT INTO foo SELECT x%2, x%5, x%10, x%15 FROM generate_series(1,100000) x;
SELECT COUNT(DISTINCT a) FROM foo; -- 2
SELECT COUNT(DISTINCT b) FROM foo; -- 5
SELECT COUNT(DISTINCT c) FROM foo; -- 10
SELECT COUNT(DISTINCT d) FROM foo; -- 15
SELECT COUNT(DISTINCT (a+b)) FROM foo; -- 6
SELECT COUNT(DISTINCT (c+d)) FROM foo; -- 20
SELECT COUNT(DISTINCT ((a+b),c)) FROM foo; -- 10
SELECT COUNT(DISTINCT ((a+b),(c+d))) FROM foo; -- 30

-- First case: stats on [(a+b),c]
CREATE STATISTICS s1(ndistinct) ON (a+b),c FROM foo;
ANALYSE foo;
EXPLAIN ANALYSE
SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
  -- Estimate = 150, Actual = 30
  -- This estimate is ndistinct((a+b),c) * ndistinct(d) = 10*15,
  -- which is much better than ndistinct((a+b)) * ndistinct(c) *
ndistinct(d) = 6*10*15 = 900
  -- Estimate with no stats = 1500

-- Second case: stats on (c+d) as well
CREATE STATISTICS s2 ON (c+d) FROM foo;
ANALYSE foo;
EXPLAIN ANALYSE
SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
  -- Estimate = 120, Actual = 30
  -- This estimate is ndistinct((a+b)) * ndistinct((c+d)) = 6*20

Again, I'd say the current behaviour is pretty good.

Regards,
Dean



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:

On 3/17/21 9:58 PM, Dean Rasheed wrote:
> On Wed, 17 Mar 2021 at 20:48, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>>
>> For reference, here is the test case I was using (which isn't really very good for
>> catching dependence between columns):
>>
> 
> And here's a test case with much more dependence between the columns:
> 
> DROP TABLE IF EXISTS foo;
> CREATE TABLE foo (a int, b int, c int, d int);
> INSERT INTO foo SELECT x%2, x%5, x%10, x%15 FROM generate_series(1,100000) x;
> SELECT COUNT(DISTINCT a) FROM foo; -- 2
> SELECT COUNT(DISTINCT b) FROM foo; -- 5
> SELECT COUNT(DISTINCT c) FROM foo; -- 10
> SELECT COUNT(DISTINCT d) FROM foo; -- 15
> SELECT COUNT(DISTINCT (a+b)) FROM foo; -- 6
> SELECT COUNT(DISTINCT (c+d)) FROM foo; -- 20
> SELECT COUNT(DISTINCT ((a+b),c)) FROM foo; -- 10
> SELECT COUNT(DISTINCT ((a+b),(c+d))) FROM foo; -- 30
> 
> -- First case: stats on [(a+b),c]
> CREATE STATISTICS s1(ndistinct) ON (a+b),c FROM foo;
> ANALYSE foo;
> EXPLAIN ANALYSE
> SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
>   -- Estimate = 150, Actual = 30
>   -- This estimate is ndistinct((a+b),c) * ndistinct(d) = 10*15,
>   -- which is much better than ndistinct((a+b)) * ndistinct(c) *
> ndistinct(d) = 6*10*15 = 900
>   -- Estimate with no stats = 1500
> 
> -- Second case: stats on (c+d) as well
> CREATE STATISTICS s2 ON (c+d) FROM foo;
> ANALYSE foo;
> EXPLAIN ANALYSE
> SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
>   -- Estimate = 120, Actual = 30
>   -- This estimate is ndistinct((a+b)) * ndistinct((c+d)) = 6*20
> 
> Again, I'd say the current behaviour is pretty good.
> 

Thanks!

I agree applying at least the [(a+b),c] stats is probably the right
approach, as it means we're considering at least the available
information about dependence between the columns.

I think to improve this, we'll need to teach the code to use overlapping
statistics, a bit like conditional probability. In this case we might do
something like this:

    ndistinct((a+b),c) * (ndistinct((c+d)) / ndistinct(c))

Which in this case would be either, for the "less correlated" case

    228 * 24 / 12 = 446   (actual = 478, current estimate = 480)

or, for the "more correlated" case

    10 * 20 / 10 = 20     (actual = 30, current estimate = 120)

But that's clearly a matter for a future patch, and I'm sure there are
cases where this will produce worse estimates.


Anyway, I plan to go over the patches one more time, and start pushing
them sometime early next week. I don't want to leave it until the very
last moment in the CF.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Dean Rasheed
Date:
On Wed, 17 Mar 2021 at 21:31, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> I agree applying at least the [(a+b),c] stats is probably the right
> approach, as it means we're considering at least the available
> information about dependence between the columns.
>
> I think to improve this, we'll need to teach the code to use overlapping
> statistics, a bit like conditional probability. In this case we might do
> something like this:
>
>     ndistinct((a+b),c) * (ndistinct((c+d)) / ndistinct(c))

Yes, I was thinking the same thing. That would be equivalent to
applying a multiplicative "correction" factor of

  ndistinct(a,b,c,...) / ( ndistinct(a) * ndistinct(b) * ndistinct(c) * ... )

for each multivariate stat applicable to more than one
column/expression, regardless of whether those columns were already
covered by other multivariate stats. That might well simplify the
implementation, as well as probably produce better estimates.

> But that's clearly a matter for a future patch, and I'm sure there are
> cases where this will produce worse estimates.

Agreed.

> Anyway, I plan to go over the patches one more time, and start pushing
> them sometime early next week. I don't want to leave it until the very
> last moment in the CF.

+1. I think they're in good enough shape for that process to start.

Regards,
Dean



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
Hi,

I've pushed the first two parts, dealing with composite types during
bootstrap. I've decided to commit both, including the array->list
conversion, as that makes the reloads simpler. I've made two tweaks:

1) I've renamed the function to reload_typ_list, which I think is better
(and it used to be reload_typ_array).

2) I've removed the did_reread assert. It'd allow just a single reload,
which blocks recursive composite types - seems unnecessary, although we
don't need that now. I think we can't have infinite recursion, because
we can only load types from already defined catalogs (so no cycles).


I've rebased and cleaned up the main part of the patch. There's a bunch
of comments slightly reworded / cleaned up, etc. The more significant
changes are:

1) The explanation of the example added to create_statistics.sgml was
somewhat wrong, so I corrected that.

2) I've renamed StatBuildData to StatsBuildData.

3) I've resolved the FIXMEs in examine_expression.

We don't need to do anything special about the collation, because unlike
indexes it's not possible to specify "collate" for the attributes. It's
possible to do thatin the expression, but exprCollation handles that.

For statistics target, we simply use the value determined for the
statistics itself. There's no way to specify that for expressions.

4) I've updated the comments about ndistinct estimates in selfuncs.c,
because some of it was a bit wrong/misleading - per the discussion we
had about matching stats to expressions.

5) I've also tweaked the add_unique_group_expr so that we don't have to
run examine_variable() repeatedly if we already have it.

6) Resolved the FIXME about acl_ok in examine_variable. Instead of just
setting it to 'true' it now mimics what we do for indexes. I think it's
correct, but this is probably worth a review.

7) psql now prints expressions only for (version > 14). I've considered
tweaking the existing block, but that was quite incomprehensible so I
just made a modified copy.

I think this is 99.999% ready to be pushed, so barring objections I'll
do that in a day or two.

The part 0003 is a small tweak I'll consider, preferring exact matches
of expressions over Var matches. It's not a clear win (but what is, in a
greedy algorithm), but it does improve one of the regression tests.
Minor change, though.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC/WIP: Extended statistics on expressions

From
Justin Pryzby
Date:
Most importantly, it looks like this forgets to update catalog documentation
for stxexprs and stxkind='e'

It seems like you're preferring to use pluralized "statistics" in a lot of
places that sound wrong to me.  For example:
> Currently the first statistics wins, which seems silly.
I can write more separately, but I think this is resolved and clarified if you
write "statistics object" and not just "statistics".  

> +       Name of schema containing table

I don't know about the nearby descriptions, but this one sounds too much like a
"schema-containing" table.  Say "Name of the schema which contains the table" ?

> +       Name of table

Say "name of table on which the extended statistics are defined"

> +       Name of extended statistics

"Name of the extended statistic object"

> +       Owner of the extended statistics

..object

> +       Expression the extended statistics is defined on

I think it should say "the extended statistic", or "the extended statistics
object".  Maybe "..on which the extended statistic is defined"

> +       of random access to the disk.  (This expression is null if the expression
> +       data type does not have a <literal><</literal> operator.)

expression's data type

> +   much-too-small row count estimate in the first two queries. Moreover, the

maybe say "dramatically underestimates the rowcount"

> +   planner has no information about relationship between the expressions, so it

the relationship

> +   assumes the two <literal>WHERE</literal> and <literal>GROUP BY</literal>
> +   conditions are independent, and multiplies their selectivities together to
> +   arrive at a much-too-high group count estimate in the aggregate query.

severe overestimate ?

> +   This is further exacerbated by the lack of accurate statistics for the
> +   expressions, forcing the planner to use default ndistinct estimate for the

use *a default

> +   expression derived from ndistinct for the column. With such statistics, the
> +   planner recognizes that the conditions are correlated and arrives at much
> +   more accurate estimates.

are correlated comma

> +            if (type->lt_opr == InvalidOid)

These could be !OidIsValid

> +     * expressions. It's either expensive or very easy to defeat for
> +     * determined used, and there's no risk if we allow such statistics (the
> +     * statistics is useless, but harmless).

I think it's meant to say "for a determined user" ?

> +     * If there are no simply-referenced columns, give the statistics an auto
> +     * dependency on the whole table.  In most cases, this will be redundant,
> +     * but it might not be if the statistics expressions contain no Vars
> +     * (which might seem strange but possible).
> +     */
> +    if (!nattnums)
> +    {
> +        ObjectAddressSet(parentobject, RelationRelationId, relid);
> +        recordDependencyOn(&myself, &parentobject, DEPENDENCY_AUTO);
> +    }

Can this be unconditional ?

> +     * Translate the array of indexs to regular attnums for the dependency (we

sp: indexes

> +                     * Not found a matching expression, so we can simply skip

Found no matching expr

> +                /* if found a matching, */

matching ..

> +examine_attribute(Node *expr)

Maybe you should rename this to something distinct ?  So it's easy to add a
breakpoint there, for example.

> +    stats->anl_context = CurrentMemoryContext;    /* XXX should be using
> +                                                 * something else? */

> +        bool        nulls[Natts_pg_statistic];
...
> +         * Construct a new pg_statistic tuple
> +         */
> +        for (i = 0; i < Natts_pg_statistic; ++i)
> +        {
> +            nulls[i] = false;
> +        }

Shouldn't you just write nulls[Natts_pg_statistic] = {false};
or at least: memset(nulls, 0, sizeof(nulls));

> +                 * We don't store collations used to build the statistics, but
> +                 * we can use the collation for the attribute itself, as
> +                 * stored in varcollid. We do reset the statistics after a
> +                 * type change (including collation change), so this is OK. We
> +                 * may need to relax this after allowing extended statistics
> +                 * on expressions.

This text should be updated or removed ?

> @@ -2705,7 +2705,108 @@ describeOneTableDetails(const char *schemaname,
>          }
>  
>          /* print any extended statistics */
> -        if (pset.sversion >= 100000)
> +        if (pset.sversion >= 140000)
> +        {
> +            printfPQExpBuffer(&buf,
> +                              "SELECT oid, "
> +                              "stxrelid::pg_catalog.regclass, "
> +                              "stxnamespace::pg_catalog.regnamespace AS nsp, "
> +                              "stxname,\n"
> +                              "pg_get_statisticsobjdef_columns(oid) AS columns,\n"
> +                              "  'd' = any(stxkind) AS ndist_enabled,\n"
> +                              "  'f' = any(stxkind) AS deps_enabled,\n"
> +                              "  'm' = any(stxkind) AS mcv_enabled,\n");
> +
> +            if (pset.sversion >= 130000)
> +                appendPQExpBufferStr(&buf, "  stxstattarget\n");
> +            else
> +                appendPQExpBufferStr(&buf, "  -1 AS stxstattarget\n");

 >= 130000 is fully determined by >= 14000 :)

> +     * type of the opclass, which is not interesting for our purposes.  (Note:
> +     * if we did anything with non-expression index columns, we'd need to

index is wrong ?

I mentioned a bunch of other references to "index" and "predicate" which are
still around:

On Thu, Jan 07, 2021 at 08:35:37PM -0600, Justin Pryzby wrote:
> There's some remaining copy/paste stuff from index expressions:
> 
> errmsg("statistics expressions and predicates can refer only to the table being indexed")));
> left behind by evaluating the predicate or index expressions.
> Set up for predicate or expression evaluation
> Need an EState for evaluation of index expressions and
> partial-index predicates.  Create it in the per-index context to be
> Fetch function for analyzing index expressions.



Re: PoC/WIP: Extended statistics on expressions

From
Justin Pryzby
Date:
I got this crash running sqlsmith:

#1  0x00007f907574b801 in __GI_abort () at abort.c:79
#2  0x00005646b95a35f8 in ExceptionalCondition (conditionName=conditionName@entry=0x5646b97411db
"bms_num_members(varnos)== 1", errorType=errorType@entry=0x5646b95fa00b "FailedAssertion", 
 
    fileName=fileName@entry=0x5646b9739dbe "selfuncs.c", lineNumber=lineNumber@entry=3332) at assert.c:69
#3  0x00005646b955c9a1 in add_unique_group_expr (vars=0x5646bbd9e200, expr=0x5646b9eb0c30, exprinfos=0x5646bbd9e100,
root=0x5646ba9a0cb0)at selfuncs.c:3332
 
#4  add_unique_group_expr (root=0x5646ba9a0cb0, exprinfos=0x5646bbd9e100, expr=0x5646b9eb0c30, vars=0x5646bbd9e200) at
selfuncs.c:3307
#5  0x00005646b955d560 in estimate_num_groups () at selfuncs.c:3558
#6  0x00005646b93ad004 in create_distinct_paths (input_rel=<optimized out>, root=0x5646ba9a0cb0) at planner.c:4808
#7  grouping_planner () at planner.c:2238
#8  0x00005646b93ae0ef in subquery_planner (glob=glob@entry=0x5646ba9a0b98, parse=parse@entry=0x5646ba905d80,
parent_root=parent_root@entry=0x0,hasRecursion=hasRecursion@entry=false, tuple_fraction=tuple_fraction@entry=0)
 
    at planner.c:1024
#9  0x00005646b93af543 in standard_planner (parse=0x5646ba905d80, query_string=<optimized out>, cursorOptions=256,
boundParams=<optimizedout>) at planner.c:404
 
#10 0x00005646b94873ac in pg_plan_query (querytree=0x5646ba905d80, 
    query_string=0x5646b9cd87e0 "select distinct \n  \n    pg_catalog.variance(\n
cast(pg_catalog.pg_stat_get_bgwriter_timed_checkpoints()as int8)) over (partition by subq_0.c2 order by subq_0.c0) as
c0,\n  subq_0.c2 as c1, \n  sub"..., cursorOptions=256, boundParams=0x0) at postgres.c:821
 
#11 0x00005646b94874a1 in pg_plan_queries (querytrees=0x5646baaba250, 
    query_string=query_string@entry=0x5646b9cd87e0 "select distinct \n  \n    pg_catalog.variance(\n
cast(pg_catalog.pg_stat_get_bgwriter_timed_checkpoints()as int8)) over (partition by subq_0.c2 order by subq_0.c0) as
c0,\n  subq_0.c2 as c1, \n  sub"..., cursorOptions=cursorOptions@entry=256, boundParams=boundParams@entry=0x0) at
postgres.c:912
#12 0x00005646b9487888 in exec_simple_query () at postgres.c:1104

2021-03-24 03:06:12.489 CDT postmaster[11653] LOG:  server process (PID 11696) was terminated by signal 6: Aborted
2021-03-24 03:06:12.489 CDT postmaster[11653] DETAIL:  Failed process was running: select distinct 
          
            pg_catalog.variance(
              cast(pg_catalog.pg_stat_get_bgwriter_timed_checkpoints() as int8)) over (partition by subq_0.c2 order by
subq_0.c0)as c0, 
 
          subq_0.c2 as c1, 
          subq_0.c0 as c2, 
          subq_0.c2 as c3, 
          subq_0.c1 as c4, 
          subq_0.c1 as c5, 
          subq_0.c0 as c6
        from 
          (select  
                ref_1.foreign_server_catalog as c0, 
                ref_1.authorization_identifier as c1, 
                sample_2.tgname as c2, 
                ref_1.foreign_server_catalog as c3
              from 
                pg_catalog.pg_stat_database_conflicts as ref_0
                        left join information_schema._pg_user_mappings as ref_1
                        on (ref_0.datname < ref_0.datname)
                      inner join pg_catalog.pg_amproc as sample_0 tablesample system (5) 
                      on (cast(null as uuid) < cast(null as uuid))
                    left join pg_catalog.pg_aggregate as sample_1 tablesample system (2.9) 
                    on (sample_0.amprocnum = sample_1.aggnumdirectargs )
                  inner join pg_catalog.pg_trigger as sample_2 tablesampl



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
Hi Justin,

Unfortunately the query is incomplete, so I can't quite determine what
went wrong. Can you extract the full query causing the crash, either
from the server log or from a core file?

thanks

On 3/24/21 9:14 AM, Justin Pryzby wrote:
> I got this crash running sqlsmith:
> 
> #1  0x00007f907574b801 in __GI_abort () at abort.c:79
> #2  0x00005646b95a35f8 in ExceptionalCondition (conditionName=conditionName@entry=0x5646b97411db
"bms_num_members(varnos)== 1", errorType=errorType@entry=0x5646b95fa00b "FailedAssertion", 
 
>     fileName=fileName@entry=0x5646b9739dbe "selfuncs.c", lineNumber=lineNumber@entry=3332) at assert.c:69
> #3  0x00005646b955c9a1 in add_unique_group_expr (vars=0x5646bbd9e200, expr=0x5646b9eb0c30, exprinfos=0x5646bbd9e100,
root=0x5646ba9a0cb0)at selfuncs.c:3332
 
> #4  add_unique_group_expr (root=0x5646ba9a0cb0, exprinfos=0x5646bbd9e100, expr=0x5646b9eb0c30, vars=0x5646bbd9e200)
atselfuncs.c:3307
 
> #5  0x00005646b955d560 in estimate_num_groups () at selfuncs.c:3558
> #6  0x00005646b93ad004 in create_distinct_paths (input_rel=<optimized out>, root=0x5646ba9a0cb0) at planner.c:4808
> #7  grouping_planner () at planner.c:2238
> #8  0x00005646b93ae0ef in subquery_planner (glob=glob@entry=0x5646ba9a0b98, parse=parse@entry=0x5646ba905d80,
parent_root=parent_root@entry=0x0,hasRecursion=hasRecursion@entry=false, tuple_fraction=tuple_fraction@entry=0)
 
>     at planner.c:1024
> #9  0x00005646b93af543 in standard_planner (parse=0x5646ba905d80, query_string=<optimized out>, cursorOptions=256,
boundParams=<optimizedout>) at planner.c:404
 
> #10 0x00005646b94873ac in pg_plan_query (querytree=0x5646ba905d80, 
>     query_string=0x5646b9cd87e0 "select distinct \n  \n    pg_catalog.variance(\n
cast(pg_catalog.pg_stat_get_bgwriter_timed_checkpoints()as int8)) over (partition by subq_0.c2 order by subq_0.c0) as
c0,\n  subq_0.c2 as c1, \n  sub"..., cursorOptions=256, boundParams=0x0) at postgres.c:821
 
> #11 0x00005646b94874a1 in pg_plan_queries (querytrees=0x5646baaba250, 
>     query_string=query_string@entry=0x5646b9cd87e0 "select distinct \n  \n    pg_catalog.variance(\n
cast(pg_catalog.pg_stat_get_bgwriter_timed_checkpoints()as int8)) over (partition by subq_0.c2 order by subq_0.c0) as
c0,\n  subq_0.c2 as c1, \n  sub"..., cursorOptions=cursorOptions@entry=256, boundParams=boundParams@entry=0x0) at
postgres.c:912
> #12 0x00005646b9487888 in exec_simple_query () at postgres.c:1104
> 
> 2021-03-24 03:06:12.489 CDT postmaster[11653] LOG:  server process (PID 11696) was terminated by signal 6: Aborted
> 2021-03-24 03:06:12.489 CDT postmaster[11653] DETAIL:  Failed process was running: select distinct 
>           
>             pg_catalog.variance(
>               cast(pg_catalog.pg_stat_get_bgwriter_timed_checkpoints() as int8)) over (partition by subq_0.c2 order
bysubq_0.c0) as c0, 
 
>           subq_0.c2 as c1, 
>           subq_0.c0 as c2, 
>           subq_0.c2 as c3, 
>           subq_0.c1 as c4, 
>           subq_0.c1 as c5, 
>           subq_0.c0 as c6
>         from 
>           (select  
>                 ref_1.foreign_server_catalog as c0, 
>                 ref_1.authorization_identifier as c1, 
>                 sample_2.tgname as c2, 
>                 ref_1.foreign_server_catalog as c3
>               from 
>                 pg_catalog.pg_stat_database_conflicts as ref_0
>                         left join information_schema._pg_user_mappings as ref_1
>                         on (ref_0.datname < ref_0.datname)
>                       inner join pg_catalog.pg_amproc as sample_0 tablesample system (5) 
>                       on (cast(null as uuid) < cast(null as uuid))
>                     left join pg_catalog.pg_aggregate as sample_1 tablesample system (2.9) 
>                     on (sample_0.amprocnum = sample_1.aggnumdirectargs )
>                   inner join pg_catalog.pg_trigger as sample_2 tablesampl
> 

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Justin Pryzby
Date:
On Wed, Mar 24, 2021 at 09:54:22AM +0100, Tomas Vondra wrote:
> Hi Justin,
> 
> Unfortunately the query is incomplete, so I can't quite determine what
> went wrong. Can you extract the full query causing the crash, either
> from the server log or from a core file?

Oh, shoot, I didn't realize it was truncated, and I already destroyed the core
and moved on to something else...

But this fails well enough, and may be much shorter than the original :)

select distinct
     pg_catalog.variance(
       cast(pg_catalog.pg_stat_get_bgwriter_timed_checkpoints() as int8)) over (partition by subq_0.c2 order by
subq_0.c0)as c0,
 
   subq_0.c2 as c1, subq_0.c0 as c2, subq_0.c2 as c3, subq_0.c1 as c4, subq_0.c1 as c5, subq_0.c0 as c6
 from
   (select
         ref_1.foreign_server_catalog as c0,
         ref_1.authorization_identifier as c1,
         sample_2.tgname as c2,
         ref_1.foreign_server_catalog as c3
       from
         pg_catalog.pg_stat_database_conflicts as ref_0
                 left join information_schema._pg_user_mappings as ref_1
                 on (ref_0.datname < ref_0.datname)
               inner join pg_catalog.pg_amproc as sample_0 tablesample system (5)
               on (cast(null as uuid) < cast(null as uuid))
             left join pg_catalog.pg_aggregate as sample_1 tablesample system (2.9)
             on (sample_0.amprocnum = sample_1.aggnumdirectargs )
           inner join pg_catalog.pg_trigger as sample_2 tablesample system (1) on true )subq_0;

TRAP: FailedAssertion("bms_num_members(varnos) == 1", File: "selfuncs.c", Line: 3332, PID: 16422)

Also ... with this patch CREATE STATISTIC is no longer rejecting multiple
tables, and instead does this:

postgres=# CREATE STATISTICS xt ON a FROM t JOIN t ON true;
ERROR:  schema "i" does not exist

-- 
Justin



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
Thanks, it seems to be some thinko in handling in PlaceHolderVars, which
seem to break the code's assumptions about varnos. This fixes it for me,
but I need to look at it more closely.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC/WIP: Extended statistics on expressions

From
Dean Rasheed
Date:
On Wed, 24 Mar 2021 at 10:22, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Thanks, it seems to be some thinko in handling in PlaceHolderVars, which
> seem to break the code's assumptions about varnos. This fixes it for me,
> but I need to look at it more closely.
>

I think that makes sense.

Reviewing the docs, I noticed a couple of omissions, and had a few
other suggestions (attached).

Regards,
Dean

Attachment

Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
On 3/24/21 2:36 PM, Dean Rasheed wrote:
> On Wed, 24 Mar 2021 at 10:22, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> Thanks, it seems to be some thinko in handling in PlaceHolderVars, which
>> seem to break the code's assumptions about varnos. This fixes it for me,
>> but I need to look at it more closely.
>>
> 
> I think that makes sense.
> 

AFAIK the primary issue here is that the two places disagree. While
estimate_num_groups does this

    varnos = pull_varnos(root, (Node *) varshere);
    if (bms_membership(varnos) == BMS_SINGLETON)
    { ... }

the add_unique_group_expr does this

    varnos = pull_varnos(root, (Node *) groupexpr);

That is, one looks at the group expression, while the other look at vars
extracted from it by pull_var_clause(). Apparently for PlaceHolderVar
this can differ, causing the crash.

So we need to change one of those places - my fix tweaked the second
place to also look at the vars, but maybe we should change the other
place? Or maybe it's not the right fix for PlaceHolderVars ...

> Reviewing the docs, I noticed a couple of omissions, and had a few
> other suggestions (attached).
> 

Thanks! I'll include that in the next version of the patch.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Dean Rasheed
Date:
On Wed, 24 Mar 2021 at 14:48, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> AFAIK the primary issue here is that the two places disagree. While
> estimate_num_groups does this
>
>     varnos = pull_varnos(root, (Node *) varshere);
>     if (bms_membership(varnos) == BMS_SINGLETON)
>     { ... }
>
> the add_unique_group_expr does this
>
>     varnos = pull_varnos(root, (Node *) groupexpr);
>
> That is, one looks at the group expression, while the other look at vars
> extracted from it by pull_var_clause(). Apparently for PlaceHolderVar
> this can differ, causing the crash.
>
> So we need to change one of those places - my fix tweaked the second
> place to also look at the vars, but maybe we should change the other
> place? Or maybe it's not the right fix for PlaceHolderVars ...
>

I think that it doesn't make any difference which place is changed.

This is a case of an expression with no stats. With your change,
you'll get a single GroupExprInfo containing a list of
VariableStatData's for each of it's Var's, whereas if you changed it
the other way, you'd get a separate GroupExprInfo for each Var. But I
think they'd both end up being treated the same by
estimate_multivariate_ndistinct(), since there wouldn't be any stats
matching the expression, only the individual Var's. Maybe changing the
first place would be the more bulletproof fix though.

Regards,
Dean



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
On 3/24/21 7:24 AM, Justin Pryzby wrote:
> Most importantly, it looks like this forgets to update catalog documentation
> for stxexprs and stxkind='e'
> 

Good catch.

> It seems like you're preferring to use pluralized "statistics" in a lot of
> places that sound wrong to me.  For example:
>> Currently the first statistics wins, which seems silly.
> I can write more separately, but I think this is resolved and clarified if you
> write "statistics object" and not just "statistics".  
> 

OK "statistics object" seems better and more consistent.

>> +       Name of schema containing table
> 
> I don't know about the nearby descriptions, but this one sounds too much like a
> "schema-containing" table.  Say "Name of the schema which contains the table" ?
> 

I think the current spelling is OK / consistent with the other catalogs.

>> +       Name of table
> 
> Say "name of table on which the extended statistics are defined"
> 

I've used "Name of table the statistics object is defined on".

>> +       Name of extended statistics
> 
> "Name of the extended statistic object"
> 
>> +       Owner of the extended statistics
> 
> ..object
> 

OK

>> +       Expression the extended statistics is defined on
> 
> I think it should say "the extended statistic", or "the extended statistics
> object".  Maybe "..on which the extended statistic is defined"
> 

OK

>> +       of random access to the disk.  (This expression is null if the expression
>> +       data type does not have a <literal><</literal> operator.)
> 
> expression's data type
> 

OK

>> +   much-too-small row count estimate in the first two queries. Moreover, the
> 
> maybe say "dramatically underestimates the rowcount"
> 

I've changed this to "... results in a significant underestimate of row
count".

>> +   planner has no information about relationship between the expressions, so it
> 
> the relationship
> 

OK

>> +   assumes the two <literal>WHERE</literal> and <literal>GROUP BY</literal>
>> +   conditions are independent, and multiplies their selectivities together to
>> +   arrive at a much-too-high group count estimate in the aggregate query.
> 
> severe overestimate ?
> 

OK

>> +   This is further exacerbated by the lack of accurate statistics for the
>> +   expressions, forcing the planner to use default ndistinct estimate for the
> 
> use *a default
> 

OK

>> +   expression derived from ndistinct for the column. With such statistics, the
>> +   planner recognizes that the conditions are correlated and arrives at much
>> +   more accurate estimates.
> 
> are correlated comma
> 

OK

>> +            if (type->lt_opr == InvalidOid)
> 
> These could be !OidIsValid
> 

Maybe, but it's like this already. I'll leave this alone and then
fix/backpatch separately.

>> +     * expressions. It's either expensive or very easy to defeat for
>> +     * determined used, and there's no risk if we allow such statistics (the
>> +     * statistics is useless, but harmless).
> 
> I think it's meant to say "for a determined user" ?
> 

Right.

>> +     * If there are no simply-referenced columns, give the statistics an auto
>> +     * dependency on the whole table.  In most cases, this will be redundant,
>> +     * but it might not be if the statistics expressions contain no Vars
>> +     * (which might seem strange but possible).
>> +     */
>> +    if (!nattnums)
>> +    {
>> +        ObjectAddressSet(parentobject, RelationRelationId, relid);
>> +        recordDependencyOn(&myself, &parentobject, DEPENDENCY_AUTO);
>> +    }
> 
> Can this be unconditional ?
> 

What would be the benefit? This behavior copied from index_create, so
I'd prefer keeping it the same for consistency reason. Presumably it's
like that for some reason (a bit of cargo cult programming, I know).

>> +     * Translate the array of indexs to regular attnums for the dependency (we
> 
> sp: indexes
> 

OK

>> +                     * Not found a matching expression, so we can simply skip
> 
> Found no matching expr
> 

OK

>> +                /* if found a matching, */
> 
> matching ..
> 

Matching dependency.

>> +examine_attribute(Node *expr)
> 
> Maybe you should rename this to something distinct ?  So it's easy to add a
> breakpoint there, for example.
> 

What would be a better name? It's not difficult to add a breakpoint
using line number, for example.

>> +    stats->anl_context = CurrentMemoryContext;    /* XXX should be using
>> +                                                 * something else? */
> 
>> +        bool        nulls[Natts_pg_statistic];
> ...
>> +         * Construct a new pg_statistic tuple
>> +         */
>> +        for (i = 0; i < Natts_pg_statistic; ++i)
>> +        {
>> +            nulls[i] = false;
>> +        }
> 
> Shouldn't you just write nulls[Natts_pg_statistic] = {false};
> or at least: memset(nulls, 0, sizeof(nulls));
> 

Maybe, but it's a copy of what update_attstats() does, so I prefer
keeping it the same.

>> +                 * We don't store collations used to build the statistics, but
>> +                 * we can use the collation for the attribute itself, as
>> +                 * stored in varcollid. We do reset the statistics after a
>> +                 * type change (including collation change), so this is OK. We
>> +                 * may need to relax this after allowing extended statistics
>> +                 * on expressions.
> 
> This text should be updated or removed ?
> 

Yeah, the last sentence is obsolete. Updated.

>> @@ -2705,7 +2705,108 @@ describeOneTableDetails(const char *schemaname,
>>          }
>>  
>>          /* print any extended statistics */
>> -        if (pset.sversion >= 100000)
>> +        if (pset.sversion >= 140000)
>> +        {
>> +            printfPQExpBuffer(&buf,
>> +                              "SELECT oid, "
>> +                              "stxrelid::pg_catalog.regclass, "
>> +                              "stxnamespace::pg_catalog.regnamespace AS nsp, "
>> +                              "stxname,\n"
>> +                              "pg_get_statisticsobjdef_columns(oid) AS columns,\n"
>> +                              "  'd' = any(stxkind) AS ndist_enabled,\n"
>> +                              "  'f' = any(stxkind) AS deps_enabled,\n"
>> +                              "  'm' = any(stxkind) AS mcv_enabled,\n");
>> +
>> +            if (pset.sversion >= 130000)
>> +                appendPQExpBufferStr(&buf, "  stxstattarget\n");
>> +            else
>> +                appendPQExpBufferStr(&buf, "  -1 AS stxstattarget\n");
> 
>  >= 130000 is fully determined by >= 14000 :)
> 

Ah, right.

>> +     * type of the opclass, which is not interesting for our purposes.  (Note:
>> +     * if we did anything with non-expression index columns, we'd need to
> 
> index is wrong ?
> 

Fixed

> I mentioned a bunch of other references to "index" and "predicate" which are
> still around:
> 

Whooops, sorry. Fixed.


I'll post a cleaned-up version of the patch addressing Dean's review
comments too.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Justin Pryzby
Date:
On Wed, Mar 24, 2021 at 01:24:46AM -0500, Justin Pryzby wrote:
> It seems like you're preferring to use pluralized "statistics" in a lot of
> places that sound wrong to me.  For example:
> > Currently the first statistics wins, which seems silly.
> I can write more separately, but I think this is resolved and clarified if you
> write "statistics object" and not just "statistics".  

In HEAD:catalogs.sgml, pg_statistic_ext (the table) says "object":
|Name of the statistics object
|Owner of the statistics object
|An array of attribute numbers, indicating which table columns are covered by this statistics object;

But pg_stats_ext (the view) doesn't say "object", which sounds wrong:
|Name of extended statistics
|Owner of the extended statistics
|Names of the columns the extended statistics is defined on

Other pre-existing issues: should be singular "statistic":
doc/src/sgml/perform.sgml:     Another type of statistics stored for each column are most-common value
doc/src/sgml/ref/psql-ref.sgml:        The status of each kind of extended statistics is shown in a column

Pre-existing issues: doesn't say "object" but I think it should:
src/backend/commands/statscmds.c:                                        errmsg("statistics creation on system columns
isnot supported")));
 
src/backend/commands/statscmds.c:                                        errmsg("cannot have more than %d columns in
statistics",
src/backend/commands/statscmds.c:        * If we got here and the OID is not valid, it means the statistics does
src/backend/commands/statscmds.c: * Select a nonconflicting name for a new statistics.
src/backend/commands/statscmds.c: * Generate "name2" for a new statistics given the list of column names for it
src/backend/statistics/extended_stats.c:                /* compute statistics target for this statistics */
src/backend/statistics/extended_stats.c: * attributes the statistics is defined on, and then the default statistics
src/backend/statistics/mcv.c: * The input is the OID of the statistics, and there are no rows returned if

should say "for a statistics object" or "for statistics objects"
src/backend/statistics/extended_stats.c: * target for a statistics objects (from the object target, attribute targets

Your patch adds these:

Should say "object":
+        * Check if we actually have a matching statistics for the expression.
                                                                                                          
 
+               /* evaluate expressions (if the statistics has any) */
                                                                                                          
 
+        * for the extended statistics. The second option seems more reasonable.
                                                                                                          
 
+                * the statistics had all options enabled on the original version.
                                                                                                          
 
+                * But if the statistics is defined on just a single column, it has to
                                                                                                          
 
+       /* has the statistics expressions? */
                                                                                                          
 
+                       /* expression - see if it's in the statistics */
                                                                                                          
 
+                                        * column(s) the statistics depends on.  Also require all
                                                                                                          
 
+        * statistics is defined on more than one column/expression).
                                                                                                          
 
+        * statistics is useless, but harmless).
                                                                                                          
 
+        * If there are no simply-referenced columns, give the statistics an auto
                                                                                                          
 


+                        * Then the first statistics matches no expressions and 3 vars,
                                                                                                          
 
+                        * while the second statistics matches one expression and 1 var.
                                                                                                          
 
+                        * Currently the first statistics wins, which seems silly.
                                                                                                          
 

+                        * [(a+c), d]. But maybe it's better than failing to match the
                                                                                                          
 
+                        * second statistics?
                                                                                                          
 

I can make patches for these (separate patches for HEAD and your patch), but I
don't think your patch has to wait on it, since the user-facing documentation
is consistent with what's already there, and the rest are internal comments.

-- 
Justin



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:

On 3/24/21 5:28 PM, Dean Rasheed wrote:
> On Wed, 24 Mar 2021 at 14:48, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> AFAIK the primary issue here is that the two places disagree. While
>> estimate_num_groups does this
>>
>>     varnos = pull_varnos(root, (Node *) varshere);
>>     if (bms_membership(varnos) == BMS_SINGLETON)
>>     { ... }
>>
>> the add_unique_group_expr does this
>>
>>     varnos = pull_varnos(root, (Node *) groupexpr);
>>
>> That is, one looks at the group expression, while the other look at vars
>> extracted from it by pull_var_clause(). Apparently for PlaceHolderVar
>> this can differ, causing the crash.
>>
>> So we need to change one of those places - my fix tweaked the second
>> place to also look at the vars, but maybe we should change the other
>> place? Or maybe it's not the right fix for PlaceHolderVars ...
>>
> 
> I think that it doesn't make any difference which place is changed.
> 
> This is a case of an expression with no stats. With your change,
> you'll get a single GroupExprInfo containing a list of
> VariableStatData's for each of it's Var's, whereas if you changed it
> the other way, you'd get a separate GroupExprInfo for each Var. But I
> think they'd both end up being treated the same by
> estimate_multivariate_ndistinct(), since there wouldn't be any stats
> matching the expression, only the individual Var's. Maybe changing the
> first place would be the more bulletproof fix though.
> 

Yeah, I think that's true. I'll do a bit more research / experiments.

As for the changes proposed in the create_statistics, do we really want
to use univariate / multivariate there? Yes, the terms are correct, but
I'm not sure how many people looking at CREATE STATISTICS will
understand them.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Dean Rasheed
Date:
On Wed, 24 Mar 2021 at 16:48, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> As for the changes proposed in the create_statistics, do we really want
> to use univariate / multivariate there? Yes, the terms are correct, but
> I'm not sure how many people looking at CREATE STATISTICS will
> understand them.
>

Hmm, I think "univariate" and "multivariate" are pretty ubiquitous,
when used to describe statistics. You could use "single-column" and
"multi-column", but then "column" isn't really right anymore, since it
might be a column or an expression. I can't think of any other terms
that fit.

Regards,
Dean



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
On 3/24/21 5:48 PM, Tomas Vondra wrote:
> 
> 
> On 3/24/21 5:28 PM, Dean Rasheed wrote:
>> On Wed, 24 Mar 2021 at 14:48, Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> AFAIK the primary issue here is that the two places disagree. While
>>> estimate_num_groups does this
>>>
>>>     varnos = pull_varnos(root, (Node *) varshere);
>>>     if (bms_membership(varnos) == BMS_SINGLETON)
>>>     { ... }
>>>
>>> the add_unique_group_expr does this
>>>
>>>     varnos = pull_varnos(root, (Node *) groupexpr);
>>>
>>> That is, one looks at the group expression, while the other look at vars
>>> extracted from it by pull_var_clause(). Apparently for PlaceHolderVar
>>> this can differ, causing the crash.
>>>
>>> So we need to change one of those places - my fix tweaked the second
>>> place to also look at the vars, but maybe we should change the other
>>> place? Or maybe it's not the right fix for PlaceHolderVars ...
>>>
>>
>> I think that it doesn't make any difference which place is changed.
>>
>> This is a case of an expression with no stats. With your change,
>> you'll get a single GroupExprInfo containing a list of
>> VariableStatData's for each of it's Var's, whereas if you changed it
>> the other way, you'd get a separate GroupExprInfo for each Var. But I
>> think they'd both end up being treated the same by
>> estimate_multivariate_ndistinct(), since there wouldn't be any stats
>> matching the expression, only the individual Var's. Maybe changing the
>> first place would be the more bulletproof fix though.
>>
> 
> Yeah, I think that's true. I'll do a bit more research / experiments.
> 

Actually, I think we need that block at all - there's no point in
keeping the exact expression, because if there was a statistics matching
it it'd be matched by the examine_variable. So if we get here, we have
to just split it into the vars anyway. So the second block is entirely
useless.

That however means we don't need the processing with GroupExprInfo and
GroupVarInfo lists, i.e. we can revert back to the original simpler
processing, with a bit of extra logic to match expressions, that's all.

The patch 0003 does this (it's a bit crude, but hopefully enough to
demonstrate).

here's an updated patch. 0001 should address most of the today's review
items regarding comments etc.

0002 is an attempt to fix an issue I noticed today - we need to handle
type changes. Until now we did not have problems with that, because we
only had attnums - so we just reset the statistics (with the exception
of functional dependencies, on the assumption that those remain valid).

With expressions it's a bit more complicated, though.

1) we need to transform the expressions so that the Vars contain the
right type info etc. Otherwise an analyze with the old pg_node_tree crashes

2) we need to reset the pg_statistic[] data too, which however makes
keeping the functional dependencies a bit less useful, because those
rely on the expression stats :-(

So I'm wondering what to do about this. I looked into how ALTER TABLE
handles indexes, and 0003 is a PoC to do the same thing for statistics.
Of couse, this is a bit unfortunate because it recreates the statistics
(so we don't keep anything, not even functional dependencies).

I think we have two options:

a) Make UpdateStatisticsForTypeChange smarter to also transform and
update the expression string, and reset pg_statistics[] data.

b) Just recreate the statistics, just like we do for indexes. Currently
this does not force analyze, so it just resets all the stats. Maybe it
should do analyze, though.

Any opinions? I need to think about this a bit more, but maybe (b) with
the analyze is the right thing to do. Keeping just some of the stats
always seemed a bit weird. (This is why the 0002 patch breaks one of the
regression tests.)

BTW I wonder how useful the updated statistics actually is. Consider
this example:

========================================================================
CREATE TABLE t (a int, b int, c int);

INSERT INTO t SELECT mod(i,10), mod(i,10), mod(i,10)
  FROM generate_series(1,1000000) s(i);

CREATE STATISTICS s (ndistinct) ON (a+b), (b+c) FROM t;

ANALYZE t;

EXPLAIN ANALYZE SELECT 1 FROM t GROUP BY (a+b), (b+c);

test=# \d t
                 Table "public.t"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 a      | integer |           |          |
 b      | integer |           |          |
 c      | integer |           |          |
Statistics objects:
    "public"."s" (ndistinct) ON ((a + b)), ((b + c)) FROM t

test=# EXPLAIN  SELECT 1 FROM t GROUP BY (a+b), (b+c);
                           QUERY PLAN
-----------------------------------------------------------------
 HashAggregate  (cost=25406.00..25406.15 rows=10 width=12)
   Group Key: (a + b), (b + c)
   ->  Seq Scan on t  (cost=0.00..20406.00 rows=1000000 width=8)
(3 rows)
========================================================================

Great. Now let's change one of the data types to something else:

========================================================================
test=# alter table t alter column c type numeric;
ALTER TABLE
test=# \d t
                 Table "public.t"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 a      | integer |           |          |
 b      | integer |           |          |
 c      | numeric |           |          |
Statistics objects:
    "public"."s" (ndistinct) ON ((a + b)), (((b)::numeric + c)) FROM t

test=# analyze t;
ANALYZE
test=# EXPLAIN  SELECT 1 FROM t GROUP BY (a+b), (b+c);
                            QUERY PLAN
------------------------------------------------------------------
 HashAggregate  (cost=27906.00..27906.17 rows=10 width=40)
   Group Key: (a + b), ((b)::numeric + c)
   ->  Seq Scan on t  (cost=0.00..22906.00 rows=1000000 width=36)
(3 rows)
========================================================================

Great! Let's change it again:

========================================================================
test=# alter table t alter column c type double precision;
ALTER TABLE
test=# analyze t;
ANALYZE
test=# EXPLAIN  SELECT 1 FROM t GROUP BY (a+b), (b+c);
                            QUERY PLAN
------------------------------------------------------------------
 HashAggregate  (cost=27906.00..27923.50 rows=1000 width=16)
   Group Key: (a + b), ((b)::double precision + c)
   ->  Seq Scan on t  (cost=0.00..22906.00 rows=1000000 width=12)
(3 rows)
========================================================================

Well, not that great, apparently. We clearly failed to match the second
expression, so we ended with (b+c) estimated as (10 * 10). Why? Because
the expression now looks like this:

========================================================================
"public"."s" (ndistinct) ON ((a + b)), ((((b)::numeric)::double
precision + c)) FROM t
========================================================================

But we're matching it to (((b)::double precision + c)), so that fails.

This is not specific to extended statistics - indexes have exactly the
same issue. Not sure how common this is in practice.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC/WIP: Extended statistics on expressions

From
Justin Pryzby
Date:
On Thu, Mar 25, 2021 at 01:05:37AM +0100, Tomas Vondra wrote:
> here's an updated patch. 0001 should address most of the today's review
> items regarding comments etc.

This is still an issue:

postgres=# CREATE STATISTICS xt ON a FROM t JOIN t ON true;
ERROR:  schema "i" does not exist

-- 
Justin



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
On 3/25/21 1:30 AM, Justin Pryzby wrote:
> On Thu, Mar 25, 2021 at 01:05:37AM +0100, Tomas Vondra wrote:
>> here's an updated patch. 0001 should address most of the today's review
>> items regarding comments etc.
> 
> This is still an issue:
> 
> postgres=# CREATE STATISTICS xt ON a FROM t JOIN t ON true;
> ERROR:  schema "i" does not exist
> 

Ah, right. That's a weird issue. I was really confused about this,
because nothing changes about the grammar or how we check the number of
relations. The problem is pretty trivial - the new code in utility.c
just grabs the first element and casts it to RangeVar, without checking
that it actually is RangeVar. With joins it's a JoinExpr, so we get a
bogus error.

The attached version fixes it by simply doing the check in utility.c.
It's a bit redundant with what's in CreateStatistics() but I don't think
we can just postpone it easily - we need to do the transformation here,
with access to queryString. But maybe we don't need to pass the relid,
when we have the list of relations in CreateStatsStmt itself ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
On 3/25/21 1:05 AM, Tomas Vondra wrote:
> ...
>
> 0002 is an attempt to fix an issue I noticed today - we need to handle
> type changes. Until now we did not have problems with that, because we
> only had attnums - so we just reset the statistics (with the exception
> of functional dependencies, on the assumption that those remain valid).
> 
> With expressions it's a bit more complicated, though.
> 
> 1) we need to transform the expressions so that the Vars contain the
> right type info etc. Otherwise an analyze with the old pg_node_tree crashes
> 
> 2) we need to reset the pg_statistic[] data too, which however makes
> keeping the functional dependencies a bit less useful, because those
> rely on the expression stats :-(
> 
> So I'm wondering what to do about this. I looked into how ALTER TABLE
> handles indexes, and 0003 is a PoC to do the same thing for statistics.
> Of couse, this is a bit unfortunate because it recreates the statistics
> (so we don't keep anything, not even functional dependencies).
> 
> I think we have two options:
> 
> a) Make UpdateStatisticsForTypeChange smarter to also transform and
> update the expression string, and reset pg_statistics[] data.
> 
> b) Just recreate the statistics, just like we do for indexes. Currently
> this does not force analyze, so it just resets all the stats. Maybe it
> should do analyze, though.
> 
> Any opinions? I need to think about this a bit more, but maybe (b) with
> the analyze is the right thing to do. Keeping just some of the stats
> always seemed a bit weird. (This is why the 0002 patch breaks one of the
> regression tests.)
> 

After thinking about this a bit more I think (b) is the right choice,
and the analyze is not necessary. The reason is fairly simple - we drop
the per-column statistics, because ATExecAlterColumnType does

    RemoveStatistics(RelationGetRelid(rel), attnum);

so the user has to run analyze anyway, to get any reasonable estimates
(we keep the functional dependencies, but those still rely on per-column
statistics quite a bit). And we'll have to do the same thing with
per-expression stats too. It was a nice idea to keep at least the stats
that are not outright broken, but unfortunately it's not a very useful
optimization. It increases the instability of the system, because now we
have estimates with all statistics, no statistics, and something in
between after the partial reset. Not nice.

So my plan is to get rid of UpdateStatisticsForTypeChange, and just do
mostly what we do for indexes. It's not perfect (as demonstrated in last
message), but that'd apply even to option (a).

Any better ideas?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Dean Rasheed
Date:
On Thu, 25 Mar 2021 at 00:05, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Actually, I think we need that block at all - there's no point in
> keeping the exact expression, because if there was a statistics matching
> it it'd be matched by the examine_variable. So if we get here, we have
> to just split it into the vars anyway. So the second block is entirely
> useless.

Good point.

> That however means we don't need the processing with GroupExprInfo and
> GroupVarInfo lists, i.e. we can revert back to the original simpler
> processing, with a bit of extra logic to match expressions, that's all.
>
> The patch 0003 does this (it's a bit crude, but hopefully enough to
> demonstrate).

Cool. I did wonder about that, but I didn't fully think it through.
I'll take a look.

> 0002 is an attempt to fix an issue I noticed today - we need to handle
> type changes.
>
> I think we have two options:
>
> a) Make UpdateStatisticsForTypeChange smarter to also transform and
> update the expression string, and reset pg_statistics[] data.
>
> b) Just recreate the statistics, just like we do for indexes. Currently
> this does not force analyze, so it just resets all the stats. Maybe it
> should do analyze, though.

I'd vote for (b) without an analyse, and I agree with getting rid of
UpdateStatisticsForTypeChange(). I've always been a bit skeptical
about trying to preserve extended statistics after a type change, when
we don't preserve regular per-column stats.

> BTW I wonder how useful the updated statistics actually is. Consider
> this example:
> ...
> the expression now looks like this:
>
> ========================================================================
> "public"."s" (ndistinct) ON ((a + b)), ((((b)::numeric)::double
> precision + c)) FROM t
> ========================================================================
>
> But we're matching it to (((b)::double precision + c)), so that fails.
>
> This is not specific to extended statistics - indexes have exactly the
> same issue. Not sure how common this is in practice.

Hmm, that's unfortunate. Maybe it's not that common in practice
though. I'm not sure if there is any practical way to fix it, but if
there is, I guess we'd want to apply the same fix to both stats and
indexes, and that certainly seems out of scope for this patch.

Regards,
Dean



Re: PoC/WIP: Extended statistics on expressions

From
Dean Rasheed
Date:
On Thu, 25 Mar 2021 at 00:05, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> here's an updated patch. 0001

The change to the way that CreateStatistics() records dependencies
isn't quite right -- recordDependencyOnSingleRelExpr() will not create
any dependencies if the expression uses only a whole-row Var. However,
pull_varattnos() will include whole-row Vars, and so nattnums_exprs
will be non-zero, and CreateStatistics() will not create a whole-table
dependency when it should.

I suppose that could be fixed up by inspecting the bitmapset returned
by pull_varattnos() in more detail, but I think it's probably safer to
revert to the previous code, which matched what index_create() did.

Regards,
Dean



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
On 3/25/21 2:33 PM, Dean Rasheed wrote:
> On Thu, 25 Mar 2021 at 00:05, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> here's an updated patch. 0001
> 
> The change to the way that CreateStatistics() records dependencies
> isn't quite right -- recordDependencyOnSingleRelExpr() will not create
> any dependencies if the expression uses only a whole-row Var. However,
> pull_varattnos() will include whole-row Vars, and so nattnums_exprs
> will be non-zero, and CreateStatistics() will not create a whole-table
> dependency when it should.
> 
> I suppose that could be fixed up by inspecting the bitmapset returned
> by pull_varattnos() in more detail, but I think it's probably safer to
> revert to the previous code, which matched what index_create() did.
> 

Ah, good catch. I haven't realized recordDependencyOnSingleRelExpr works
like that, so I've moved it after the whole-table dependency.

Attached is an updated patch series, with all the changes discussed
here. I've cleaned up the ndistinct stuff a bit more (essentially
reverting back from GroupExprInfo to GroupVarInfo name), and got rid of
the UpdateStatisticsForTypeChange.

I've also looked at speeding up the stats_ext regression tests. The 0002
patch reduces the size of a couple of test tables, and removes a bunch
of queries. I've initially mostly just copied the original tests, but we
don't really need that many queries I think. This cuts the runtime about
in half, so it's mostly in line with other tests. Some of these changes
are in existing tests, I'll consider moving that into a separate patch
applied before the main one.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC/WIP: Extended statistics on expressions

From
Dean Rasheed
Date:
On Thu, 25 Mar 2021 at 19:59, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Attached is an updated patch series, with all the changes discussed
> here. I've cleaned up the ndistinct stuff a bit more (essentially
> reverting back from GroupExprInfo to GroupVarInfo name), and got rid of
> the UpdateStatisticsForTypeChange.
>

I've looked over all that and I think it's in pretty good shape. I
particularly like how much simpler the ndistinct code has now become.

Some (hopefully final) review comments:

1). I don't think index.c is the right place for
StatisticsGetRelation(). I appreciate that it is very similar to the
adjacent IndexGetRelation() function, but index.c is really only for
index-related code, so I think StatisticsGetRelation() should go in
statscmds.c

2). Perhaps the error message at statscmds.c:293 should read

   "expression cannot be used in multivariate statistics because its
type %s has no default btree operator class"

(i.e., add the word "multivariate", since the same expression *can* be
used in univariate statistics even though it has no less-than
operator).

3). The comment for ATExecAddStatistics() should probably mention that
"ALTER TABLE ADD STATISTICS" isn't a command in the grammar, in a
similar way to other similar functions, e.g.:

/*
 * ALTER TABLE ADD STATISTICS
 *
 * This is no such command in the grammar, but we use this internally to add
 * AT_ReAddStatistics subcommands to rebuild extended statistics after a table
 * column type change.
 */

4). The comment at the start of ATPostAlterTypeParse() needs updating
to mention CREATE STATISTICS statements.

5). I think ATPostAlterTypeParse() should also attempt to preserve any
COMMENTs attached to statistics objects, i.e., something like:

--- src/backend/commands/tablecmds.c.orig    2021-03-26 10:39:38.328631864 +0000
+++ src/backend/commands/tablecmds.c    2021-03-26 10:47:21.042279580 +0000
@@ -12619,6 +12619,9 @@
             CreateStatsStmt  *stmt = (CreateStatsStmt *) stm;
             AlterTableCmd *newcmd;

+            /* keep the statistics object's comment */
+            stmt->stxcomment = GetComment(oldId, StatisticExtRelationId, 0);
+
             newcmd = makeNode(AlterTableCmd);
             newcmd->subtype = AT_ReAddStatistics;
             newcmd->def = (Node *) stmt;

6). Comment typo at extended_stats.c:2532 - s/statitics/statistics/

7). I don't think that the big XXX comment near the start of
estimate_multivariate_ndistinct() is really relevant anymore, now that
the code has been simplified and we no longer extract Vars from
expressions, so perhaps it can just be deleted.

Regards,
Dean



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:

On 3/26/21 12:37 PM, Dean Rasheed wrote:
> On Thu, 25 Mar 2021 at 19:59, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> Attached is an updated patch series, with all the changes discussed
>> here. I've cleaned up the ndistinct stuff a bit more (essentially
>> reverting back from GroupExprInfo to GroupVarInfo name), and got rid of
>> the UpdateStatisticsForTypeChange.
>>
> 
> I've looked over all that and I think it's in pretty good shape. I
> particularly like how much simpler the ndistinct code has now become.
> 
> Some (hopefully final) review comments:
> 
> 1). I don't think index.c is the right place for
> StatisticsGetRelation(). I appreciate that it is very similar to the
> adjacent IndexGetRelation() function, but index.c is really only for
> index-related code, so I think StatisticsGetRelation() should go in
> statscmds.c
> 

Ah, right, I forgot about this. I wonder if we should add
catalog/statistics.c, similar to catalog/index.c (instead of adding it
locally to statscmds.c).

> 2). Perhaps the error message at statscmds.c:293 should read
> 
>    "expression cannot be used in multivariate statistics because its
> type %s has no default btree operator class"
> 
> (i.e., add the word "multivariate", since the same expression *can* be
> used in univariate statistics even though it has no less-than
> operator).
> 
> 3). The comment for ATExecAddStatistics() should probably mention that
> "ALTER TABLE ADD STATISTICS" isn't a command in the grammar, in a
> similar way to other similar functions, e.g.:
> 
> /*
>  * ALTER TABLE ADD STATISTICS
>  *
>  * This is no such command in the grammar, but we use this internally to add
>  * AT_ReAddStatistics subcommands to rebuild extended statistics after a table
>  * column type change.
>  */
> 
> 4). The comment at the start of ATPostAlterTypeParse() needs updating
> to mention CREATE STATISTICS statements.
> 
> 5). I think ATPostAlterTypeParse() should also attempt to preserve any
> COMMENTs attached to statistics objects, i.e., something like:
> 
> --- src/backend/commands/tablecmds.c.orig    2021-03-26 10:39:38.328631864 +0000
> +++ src/backend/commands/tablecmds.c    2021-03-26 10:47:21.042279580 +0000
> @@ -12619,6 +12619,9 @@
>              CreateStatsStmt  *stmt = (CreateStatsStmt *) stm;
>              AlterTableCmd *newcmd;
> 
> +            /* keep the statistics object's comment */
> +            stmt->stxcomment = GetComment(oldId, StatisticExtRelationId, 0);
> +
>              newcmd = makeNode(AlterTableCmd);
>              newcmd->subtype = AT_ReAddStatistics;
>              newcmd->def = (Node *) stmt;
> 
> 6). Comment typo at extended_stats.c:2532 - s/statitics/statistics/
> 
> 7). I don't think that the big XXX comment near the start of
> estimate_multivariate_ndistinct() is really relevant anymore, now that
> the code has been simplified and we no longer extract Vars from
> expressions, so perhaps it can just be deleted.
> 

Thanks! I'll fix these, and then will consider getting it committed
sometime later today, once the buildfarm does some testing on the other
stuff I already committed.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
On 3/26/21 1:54 PM, Tomas Vondra wrote:
> 
> 
> On 3/26/21 12:37 PM, Dean Rasheed wrote:
>> On Thu, 25 Mar 2021 at 19:59, Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> Attached is an updated patch series, with all the changes discussed
>>> here. I've cleaned up the ndistinct stuff a bit more (essentially
>>> reverting back from GroupExprInfo to GroupVarInfo name), and got rid of
>>> the UpdateStatisticsForTypeChange.
>>>
>>
>> I've looked over all that and I think it's in pretty good shape. I
>> particularly like how much simpler the ndistinct code has now become.
>>
>> Some (hopefully final) review comments:
>>
>> ...
>>
> 
> Thanks! I'll fix these, and then will consider getting it committed
> sometime later today, once the buildfarm does some testing on the other
> stuff I already committed.
> 

OK, pushed after a bit more polishing and testing. I've noticed one more
missing piece in describe (expressions missing in \dX), so I fixed that.

May the buildfarm be merciful ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:

On 3/27/21 1:17 AM, Tomas Vondra wrote:
> On 3/26/21 1:54 PM, Tomas Vondra wrote:
>>
>>
>> On 3/26/21 12:37 PM, Dean Rasheed wrote:
>>> On Thu, 25 Mar 2021 at 19:59, Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>> Attached is an updated patch series, with all the changes discussed
>>>> here. I've cleaned up the ndistinct stuff a bit more (essentially
>>>> reverting back from GroupExprInfo to GroupVarInfo name), and got rid of
>>>> the UpdateStatisticsForTypeChange.
>>>>
>>>
>>> I've looked over all that and I think it's in pretty good shape. I
>>> particularly like how much simpler the ndistinct code has now become.
>>>
>>> Some (hopefully final) review comments:
>>>
>>> ...
>>>
>>
>> Thanks! I'll fix these, and then will consider getting it committed
>> sometime later today, once the buildfarm does some testing on the other
>> stuff I already committed.
>>
> 
> OK, pushed after a bit more polishing and testing. I've noticed one more
> missing piece in describe (expressions missing in \dX), so I fixed that.
> 
> May the buildfarm be merciful ...
> 

LOL! It failed on *my* buildfarm machine, because apparently some of the
expressions used in stats_ext depend on locale and the machine is using
cs_CZ.UTF-8. Will fix later ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Justin Pryzby
Date:
I suggest to add some kind of reference to stats expressions here.

--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml

  <sect2 id="vacuum-for-statistics">
   <title>Updating Planner Statistics</title>

   <indexterm zone="vacuum-for-statistics">
    <primary>statistics</primary>
    <secondary>of the planner</secondary>
   </indexterm>

[...]

@@ -330,10 +330,12 @@
 
     <para>
      Also, by default there is limited information available about
-     the selectivity of functions.  However, if you create an expression
+     the selectivity of functions.  However, if you create a statistics
+     expression or an expression
      index that uses a function call, useful statistics will be
      gathered about the function, which can greatly improve query
      plans that use the expression index.


-- 
Justin



Re: PoC/WIP: Extended statistics on expressions (\d in old client)

From
Justin Pryzby
Date:
On Fri, Jan 22, 2021 at 02:09:04PM +0100, Tomas Vondra wrote:
> On 1/22/21 5:01 AM, Justin Pryzby wrote:
> > On Fri, Jan 22, 2021 at 04:49:51AM +0100, Tomas Vondra wrote:
> > > > > | Statistics objects:
> > > > > |     "public"."s2" (ndistinct, dependencies, mcv) ON  FROM t
> > > 
> > > Umm, for me that prints:
> > 
> > >      "public"."s2" ON ((i + 1)), (((i + 1) + 0)) FROM t
> > > 
> > > which I think is OK. But maybe there's something else to trigger the
> > > problem?
> > 
> > Oh.  It's because I was using /usr/bin/psql and not ./src/bin/psql.
> > I think it's considered ok if old client's \d commands don't work on new
> > server, but it's not clear to me if it's ok if they misbehave.  It's almost
> > better it made an ERROR.
> > 
> 
> Well, how would the server know to throw an error? We can't quite patch the
> old psql (if we could, we could just tweak the query).

To refresh: stats objects on a v14 server which include expressions are shown
by pre-v14 psql client with the expressions elided (because the attnums don't
correspond to anything in pg_attribute).

I'm mentioning it again since, even though I knew about this earlier in the
year, it caused some confusion for me again just now while testing our
application.  I had the v14 server installed but the psql symlink still pointed
to the v13 client.

There may not be anything we can do about it.
And it may not be a significant issue outside the beta period: more typically,
the client version would match the server version.

-- 
Justin



Re: PoC/WIP: Extended statistics on expressions

From
Noah Misch
Date:
On Sat, Mar 27, 2021 at 01:17:14AM +0100, Tomas Vondra wrote:
> OK, pushed after a bit more polishing and testing.

This added a "transformed" field to CreateStatsStmt, but it didn't mention
that field in src/backend/nodes.  Should those functions handle the field?



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
On 6/6/21 7:37 AM, Noah Misch wrote:
> On Sat, Mar 27, 2021 at 01:17:14AM +0100, Tomas Vondra wrote:
>> OK, pushed after a bit more polishing and testing.
> 
> This added a "transformed" field to CreateStatsStmt, but it didn't mention
> that field in src/backend/nodes.  Should those functions handle the field?
> 

Yup, that's a mistake - it should do whatever CREATE INDEX is doing. Not
sure if it can result in error/failure or just inefficiency (due to
transforming the expressions repeatedly), but it should do whatever
CREATE INDEX is doing.

Thanks for noticing! Fixed by d57ecebd12.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> On 6/6/21 7:37 AM, Noah Misch wrote:
>> This added a "transformed" field to CreateStatsStmt, but it didn't mention
>> that field in src/backend/nodes.  Should those functions handle the field?

> Yup, that's a mistake - it should do whatever CREATE INDEX is doing. Not
> sure if it can result in error/failure or just inefficiency (due to
> transforming the expressions repeatedly), but it should do whatever
> CREATE INDEX is doing.

I'm curious about how come the buildfarm didn't notice this.  The
animals using COPY_PARSE_PLAN_TREES should have failed.  The fact
that they didn't implies that there's no test case that makes use
of a nonzero value for this field, which seems like a testing gap.

            regards, tom lane



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:

On 6/6/21 9:17 PM, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
>> On 6/6/21 7:37 AM, Noah Misch wrote:
>>> This added a "transformed" field to CreateStatsStmt, but it didn't mention
>>> that field in src/backend/nodes.  Should those functions handle the field?
> 
>> Yup, that's a mistake - it should do whatever CREATE INDEX is doing. Not
>> sure if it can result in error/failure or just inefficiency (due to
>> transforming the expressions repeatedly), but it should do whatever
>> CREATE INDEX is doing.
> 
> I'm curious about how come the buildfarm didn't notice this.  The
> animals using COPY_PARSE_PLAN_TREES should have failed.  The fact
> that they didn't implies that there's no test case that makes use
> of a nonzero value for this field, which seems like a testing gap.
> 

AFAICS the reason is pretty simple - the COPY_PARSE_PLAN_TREES checks
look like this:

    List       *new_list = copyObject(raw_parsetree_list);

    /* This checks both copyObject() and the equal() routines... */
    if (!equal(new_list, raw_parsetree_list))
        elog(WARNING, "copyObject() failed to produce an equal raw
                       parse tree");
    else
        raw_parsetree_list = new_list;
    }

But if the field is missing from all the functions, equal() can't detect
that copyObject() did not actually copy it. It'd detect a case when the
field was added just to one place, but not this. The CREATE INDEX (which
served as an example for CREATE STATISTICS) has exactly the same issue.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> On 6/6/21 9:17 PM, Tom Lane wrote:
>> I'm curious about how come the buildfarm didn't notice this.  The
>> animals using COPY_PARSE_PLAN_TREES should have failed.  The fact
>> that they didn't implies that there's no test case that makes use
>> of a nonzero value for this field, which seems like a testing gap.

> AFAICS the reason is pretty simple - the COPY_PARSE_PLAN_TREES checks
> look like this:
> ...
> But if the field is missing from all the functions, equal() can't detect
> that copyObject() did not actually copy it.

Right, that code would only detect a missing copyfuncs.c line if
equalfuncs.c did have the line, which isn't all that likely.  However,
we then pass the copied node on to further processing, which in principle
should result in visible failures when copyfuncs.c is missing a line.

I think the reason it didn't is that the transformed field would always
be zero (false) in grammar output.  We could only detect a problem if
we copied already-transformed nodes and then used them further.  Even
then it *might* not fail, because the consequence would likely be an
extra round of parse analysis on the expressions, which is likely to
be a no-op.

Not sure if there's a good way to improve that.  I hope sometime soon
we'll be able to auto-generate these functions, and then the risk of
this sort of mistake will go away (he says optimistically).

            regards, tom lane



Re: PoC/WIP: Extended statistics on expressions

From
Noah Misch
Date:
On Sun, Jun 06, 2021 at 09:13:17PM +0200, Tomas Vondra wrote:
> 
> On 6/6/21 7:37 AM, Noah Misch wrote:
> > On Sat, Mar 27, 2021 at 01:17:14AM +0100, Tomas Vondra wrote:
> >> OK, pushed after a bit more polishing and testing.
> > 
> > This added a "transformed" field to CreateStatsStmt, but it didn't mention
> > that field in src/backend/nodes.  Should those functions handle the field?
> > 
> 
> Yup, that's a mistake - it should do whatever CREATE INDEX is doing. Not
> sure if it can result in error/failure or just inefficiency (due to
> transforming the expressions repeatedly), but it should do whatever
> CREATE INDEX is doing.
> 
> Thanks for noticing! Fixed by d57ecebd12.

Great.  For future reference, this didn't need a catversion bump.  readfuncs.c
changes need a catversion bump, since the catalogs might contain input for
each read function.  Other src/backend/nodes functions don't face that.  Also,
src/backend/nodes generally process fields in the order that they appear in
the struct.  The order you used in d57ecebd12 is nicer, being more like
IndexStmt, so I'm pushing an order change to the struct.



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:

On 6/11/21 6:55 AM, Noah Misch wrote:
> On Sun, Jun 06, 2021 at 09:13:17PM +0200, Tomas Vondra wrote:
>>
>> On 6/6/21 7:37 AM, Noah Misch wrote:
>>> On Sat, Mar 27, 2021 at 01:17:14AM +0100, Tomas Vondra wrote:
>>>> OK, pushed after a bit more polishing and testing.
>>>
>>> This added a "transformed" field to CreateStatsStmt, but it didn't mention
>>> that field in src/backend/nodes.  Should those functions handle the field?
>>>
>>
>> Yup, that's a mistake - it should do whatever CREATE INDEX is doing. Not
>> sure if it can result in error/failure or just inefficiency (due to
>> transforming the expressions repeatedly), but it should do whatever
>> CREATE INDEX is doing.
>>
>> Thanks for noticing! Fixed by d57ecebd12.
> 
> Great.  For future reference, this didn't need a catversion bump.  readfuncs.c
> changes need a catversion bump, since the catalogs might contain input for
> each read function.  Other src/backend/nodes functions don't face that.  Also,
> src/backend/nodes generally process fields in the order that they appear in
> the struct.  The order you used in d57ecebd12 is nicer, being more like
> IndexStmt, so I'm pushing an order change to the struct.
> 

OK, makes sense. Thanks!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Justin Pryzby
Date:
On 1/22/21 5:01 AM, Justin Pryzby wrote:
> > In any case, why are there so many parentheses ?

On Fri, Jan 22, 2021 at 02:09:04PM +0100, Tomas Vondra wrote:
> That's a bug in pg_get_statisticsobj_worker, probably. It shouldn't be
> adding extra parentheses, on top of what deparse_expression_pretty does.
> Will fix.

The extra parens are still here - is it intended ?

postgres=# CREATE STATISTICS s ON i, (1+i), (2+i) FROM t;
CREATE STATISTICS
postgres=# \d t
                 Table "public.t"
 Column |  Type   | Collation | Nullable | Default 
--------+---------+-----------+----------+---------
 i      | integer |           |          | 
Statistics objects:
    "public"."s" ON i, ((1 + i)), ((2 + i)) FROM t

-- 
Justin



Re: PoC/WIP: Extended statistics on expressions

From
Justin Pryzby
Date:
On Mon, Dec 07, 2020 at 03:15:17PM +0100, Tomas Vondra wrote:
> > Looking at the current behaviour, there are a couple of things that
> > seem a little odd, even though they are understandable. For example,
> > the fact that
> > 
> >   CREATE STATISTICS s (expressions) ON (expr), col FROM tbl;
> > 
> > fails, but
> > 
> >   CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl;
> > 
> > succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax
> > 
> >   CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl;
> > 
> > tends to suggest that it's going to create statistics on the pair of
> > expressions, describing their correlation, when actually it builds 2
> > independent statistics. Also, this error text isn't entirely accurate:
> > 
> >   CREATE STATISTICS s ON col FROM tbl;
> >   ERROR:  extended statistics require at least 2 columns
> > 
> > because extended statistics don't always require 2 columns, they can
> > also just have an expression, or multiple expressions and 0 or 1
> > columns.
> > 
> > I think a lot of this stems from treating "expressions" in the same
> > way as the other (multi-column) stats kinds, and it might actually be
> > neater to have separate documented syntaxes for single- and
> > multi-column statistics:
> > 
> >   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
> >     ON (expression)
> >     FROM table_name
> > 
> >   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
> >     [ ( statistics_kind [, ... ] ) ]
> >     ON { column_name | (expression) } , { column_name | (expression) } [, ...]
> >     FROM table_name
> > 
> > The first syntax would create single-column stats, and wouldn't accept
> > a statistics_kind argument, because there is only one kind of
> > single-column statistic. Maybe that might change in the future, but if
> > so, it's likely that the kinds of single-column stats will be
> > different from the kinds of multi-column stats.
> > 
> > In the second syntax, the only accepted kinds would be the current
> > multi-column stats kinds (ndistinct, dependencies, and mcv), and it
> > would always build stats describing the correlations between the
> > columns listed. It would continue to build standard/expression stats
> > on any expressions in the list, but that's more of an implementation
> > detail.
> > 
> > It would no longer be possible to do "CREATE STATISTICS s
> > (expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to
> > issue 2 separate "CREATE STATISTICS" commands, but that seems more
> > logical, because they're independent stats.
> > 
> > The parsing code might not change much, but some of the errors would
> > be different. For example, the errors "building only extended
> > expression statistics on simple columns not allowed" and "extended
> > expression statistics require at least one expression" would go away,
> > and the error "extended statistics require at least 2 columns" might
> > become more specific, depending on the stats kind.

This still seems odd:

postgres=# CREATE STATISTICS asf ON i FROM t;
ERROR:  extended statistics require at least 2 columns
postgres=# CREATE STATISTICS asf ON (i) FROM t;
CREATE STATISTICS

It seems wrong that the command works with added parens, but builds expression
stats on a simple column (which is redundant with what analyze does without
extended stats).

-- 
Justin



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:

On 8/16/21 3:31 AM, Justin Pryzby wrote:
> On 1/22/21 5:01 AM, Justin Pryzby wrote:
>>> In any case, why are there so many parentheses ?
> 
> On Fri, Jan 22, 2021 at 02:09:04PM +0100, Tomas Vondra wrote:
>> That's a bug in pg_get_statisticsobj_worker, probably. It shouldn't be
>> adding extra parentheses, on top of what deparse_expression_pretty does.
>> Will fix.
> 
> The extra parens are still here - is it intended ?
> 

Ah, thanks for reminding me! I was looking at this, and the problem is 
that pg_get_statisticsobj_worker only does this:

     prettyFlags = PRETTYFLAG_INDENT;

Changing that to

     prettyFlags = PRETTYFLAG_INDENT | PRETTYFLAG_PAREN;

fixes this (not sure we need the INDENT flag - probably not).

I'm a bit confused, though. My assumption was "PRETTYFLAG_PAREN = true" 
would force the deparsing itself to add the parens, if needed, but in 
reality it works the other way around.

I guess it's more complicated due to deparsing multi-level expressions, 
but unfortunately, there's no comment explaining what it does.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:

On 8/16/21 3:32 AM, Justin Pryzby wrote:
> On Mon, Dec 07, 2020 at 03:15:17PM +0100, Tomas Vondra wrote:
>>> Looking at the current behaviour, there are a couple of things that
>>> seem a little odd, even though they are understandable. For example,
>>> the fact that
>>>
>>>    CREATE STATISTICS s (expressions) ON (expr), col FROM tbl;
>>>
>>> fails, but
>>>
>>>    CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl;
>>>
>>> succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax
>>>
>>>    CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl;
>>>
>>> tends to suggest that it's going to create statistics on the pair of
>>> expressions, describing their correlation, when actually it builds 2
>>> independent statistics. Also, this error text isn't entirely accurate:
>>>
>>>    CREATE STATISTICS s ON col FROM tbl;
>>>    ERROR:  extended statistics require at least 2 columns
>>>
>>> because extended statistics don't always require 2 columns, they can
>>> also just have an expression, or multiple expressions and 0 or 1
>>> columns.
>>>
>>> I think a lot of this stems from treating "expressions" in the same
>>> way as the other (multi-column) stats kinds, and it might actually be
>>> neater to have separate documented syntaxes for single- and
>>> multi-column statistics:
>>>
>>>    CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
>>>      ON (expression)
>>>      FROM table_name
>>>
>>>    CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
>>>      [ ( statistics_kind [, ... ] ) ]
>>>      ON { column_name | (expression) } , { column_name | (expression) } [, ...]
>>>      FROM table_name
>>>
>>> The first syntax would create single-column stats, and wouldn't accept
>>> a statistics_kind argument, because there is only one kind of
>>> single-column statistic. Maybe that might change in the future, but if
>>> so, it's likely that the kinds of single-column stats will be
>>> different from the kinds of multi-column stats.
>>>
>>> In the second syntax, the only accepted kinds would be the current
>>> multi-column stats kinds (ndistinct, dependencies, and mcv), and it
>>> would always build stats describing the correlations between the
>>> columns listed. It would continue to build standard/expression stats
>>> on any expressions in the list, but that's more of an implementation
>>> detail.
>>>
>>> It would no longer be possible to do "CREATE STATISTICS s
>>> (expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to
>>> issue 2 separate "CREATE STATISTICS" commands, but that seems more
>>> logical, because they're independent stats.
>>>
>>> The parsing code might not change much, but some of the errors would
>>> be different. For example, the errors "building only extended
>>> expression statistics on simple columns not allowed" and "extended
>>> expression statistics require at least one expression" would go away,
>>> and the error "extended statistics require at least 2 columns" might
>>> become more specific, depending on the stats kind.
> 
> This still seems odd:
> 
> postgres=# CREATE STATISTICS asf ON i FROM t;
> ERROR:  extended statistics require at least 2 columns
> postgres=# CREATE STATISTICS asf ON (i) FROM t;
> CREATE STATISTICS
> 
> It seems wrong that the command works with added parens, but builds expression
> stats on a simple column (which is redundant with what analyze does without
> extended stats).
> 

Well, yeah. But I think this is a behavior that was discussed somewhere 
in this thread, and the agreement was that it's not worth the 
complexity, as this comment explains

   * XXX We do only the bare minimum to separate simple attribute and
   * complex expressions - for example "(a)" will be treated as a complex
   * expression. No matter how elaborate the check is, there'll always be
   * a way around it, if the user is determined (consider e.g. "(a+0)"),
   * so it's not worth protecting against it.

Of course, maybe that wasn't the right decision - it's a bit weird that

   CREATE INDEX on t ((a), (b))

actually "extracts" the column references and stores that in indkeys, 
instead of treating that as expressions.

Patch 0001 fixes the "double parens" issue discussed elsewhere in this 
thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a 
simple column reference.

But I'm not sure 0002 is something we can do without catversion bump. 
What if someone created such "bogus" statistics? It's mostly harmless, 
because the statistics is useless anyway (AFAICS we'll just use the 
regular one we have for the column), but if they do pg_dump, that'll 
fail because of this new restriction.

OTOH we're still "only" in beta, and IIRC the rule is not to bump 
catversion after rc1.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC/WIP: Extended statistics on expressions

From
Justin Pryzby
Date:
> Patch 0001 fixes the "double parens" issue discussed elsewhere in this
> thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a simple
> column reference.

> From: Tomas Vondra <tomas.vondra@postgresql.org>
> Date: Mon, 16 Aug 2021 17:19:33 +0200
> Subject: [PATCH 2/2] fix: identify single-attribute references

> diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
> index a4ee54d516..be1f3a5175 100644
> --- a/src/bin/pg_dump/t/002_pg_dump.pl
> +++ b/src/bin/pg_dump/t/002_pg_dump.pl
> @@ -2811,7 +2811,7 @@ my %tests = (
>          create_sql   => 'CREATE STATISTICS dump_test.test_ext_stats_expr
>                              ON (2 * col1) FROM dump_test.test_fifth_table',
>          regexp => qr/^
> -            \QCREATE STATISTICS dump_test.test_ext_stats_expr ON ((2 * col1)) FROM dump_test.test_fifth_table;\E
> +            \QCREATE STATISTICS dump_test.test_ext_stats_expr ON (2 * col1) FROM dump_test.test_fifth_table;\E


This hunk should be in 0001, no ?

> But I'm not sure 0002 is something we can do without catversion bump. What
> if someone created such "bogus" statistics? It's mostly harmless, because
> the statistics is useless anyway (AFAICS we'll just use the regular one we
> have for the column), but if they do pg_dump, that'll fail because of this
> new restriction.

I think it's okay if it pg_dump throws an error, since the fix is as easy as
dropping the stx object.  (It wouldn't be okay if it silently misbehaved.)

Andres concluded similarly with the reverted autovacuum patch:
https://www.postgresql.org/message-id/20210817105022.e2t4rozkhqy2myhn@alap3.anarazel.de

+RMT in case someone wants to argue otherwise.

-- 
Justin



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:

On 8/18/21 5:07 AM, Justin Pryzby wrote:
>> Patch 0001 fixes the "double parens" issue discussed elsewhere in this
>> thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a simple
>> column reference.
> 
>> From: Tomas Vondra <tomas.vondra@postgresql.org>
>> Date: Mon, 16 Aug 2021 17:19:33 +0200
>> Subject: [PATCH 2/2] fix: identify single-attribute references
> 
>> diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
>> index a4ee54d516..be1f3a5175 100644
>> --- a/src/bin/pg_dump/t/002_pg_dump.pl
>> +++ b/src/bin/pg_dump/t/002_pg_dump.pl
>> @@ -2811,7 +2811,7 @@ my %tests = (
>>           create_sql   => 'CREATE STATISTICS dump_test.test_ext_stats_expr
>>                               ON (2 * col1) FROM dump_test.test_fifth_table',
>>           regexp => qr/^
>> -            \QCREATE STATISTICS dump_test.test_ext_stats_expr ON ((2 * col1)) FROM dump_test.test_fifth_table;\E
>> +            \QCREATE STATISTICS dump_test.test_ext_stats_expr ON (2 * col1) FROM dump_test.test_fifth_table;\E
> 
> 
> This hunk should be in 0001, no ?
>
Yeah, I mixed that up a bit.

>> But I'm not sure 0002 is something we can do without catversion bump. What
>> if someone created such "bogus" statistics? It's mostly harmless, because
>> the statistics is useless anyway (AFAICS we'll just use the regular one we
>> have for the column), but if they do pg_dump, that'll fail because of this
>> new restriction.
> 
> I think it's okay if it pg_dump throws an error, since the fix is as easy as
> dropping the stx object.  (It wouldn't be okay if it silently misbehaved.)
> 
> Andres concluded similarly with the reverted autovacuum patch:
> https://www.postgresql.org/message-id/20210817105022.e2t4rozkhqy2myhn@alap3.anarazel.de
> 
 > +RMT in case someone wants to argue otherwise.
 >

I feel a bit uneasy about it, but if there's a precedent ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Justin Pryzby
Date:
On Mon, Aug 16, 2021 at 05:41:57PM +0200, Tomas Vondra wrote:
> > This still seems odd:
> > 
> > postgres=# CREATE STATISTICS asf ON i FROM t;
> > ERROR:  extended statistics require at least 2 columns
> > postgres=# CREATE STATISTICS asf ON (i) FROM t;
> > CREATE STATISTICS
> > 
> > It seems wrong that the command works with added parens, but builds expression
> > stats on a simple column (which is redundant with what analyze does without
> > extended stats).
> 
> Well, yeah. But I think this is a behavior that was discussed somewhere in
> this thread, and the agreement was that it's not worth the complexity, as
> this comment explains
> 
>   * XXX We do only the bare minimum to separate simple attribute and
>   * complex expressions - for example "(a)" will be treated as a complex
>   * expression. No matter how elaborate the check is, there'll always be
>   * a way around it, if the user is determined (consider e.g. "(a+0)"),
>   * so it's not worth protecting against it.
> 
> Patch 0001 fixes the "double parens" issue discussed elsewhere in this
> thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a simple
> column reference.

0002 refuses to create expressional stats on a simple column reference like
(a), which I think is helps to avoid a user accidentally creating useless ext
stats objects (which are redundant with the table's column stats).

0002 does not attempt to refuse cases like (a+0), which I think is fine:
we don't try to reject useless cases if someone insists on it.
See 240971675, 701fd0bbc.

So I am +1 to apply both patches.

I added this as an Opened Item for increased visibility.

-- 
Justin



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
On 8/24/21 3:13 PM, Justin Pryzby wrote:
> On Mon, Aug 16, 2021 at 05:41:57PM +0200, Tomas Vondra wrote:
>>> This still seems odd:
>>>
>>> postgres=# CREATE STATISTICS asf ON i FROM t;
>>> ERROR:  extended statistics require at least 2 columns
>>> postgres=# CREATE STATISTICS asf ON (i) FROM t;
>>> CREATE STATISTICS
>>>
>>> It seems wrong that the command works with added parens, but builds expression
>>> stats on a simple column (which is redundant with what analyze does without
>>> extended stats).
>>
>> Well, yeah. But I think this is a behavior that was discussed somewhere in
>> this thread, and the agreement was that it's not worth the complexity, as
>> this comment explains
>>
>>    * XXX We do only the bare minimum to separate simple attribute and
>>    * complex expressions - for example "(a)" will be treated as a complex
>>    * expression. No matter how elaborate the check is, there'll always be
>>    * a way around it, if the user is determined (consider e.g. "(a+0)"),
>>    * so it's not worth protecting against it.
>>
>> Patch 0001 fixes the "double parens" issue discussed elsewhere in this
>> thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a simple
>> column reference.
> 
> 0002 refuses to create expressional stats on a simple column reference like
> (a), which I think is helps to avoid a user accidentally creating useless ext
> stats objects (which are redundant with the table's column stats).
> 
> 0002 does not attempt to refuse cases like (a+0), which I think is fine:
> we don't try to reject useless cases if someone insists on it.
> See 240971675, 701fd0bbc.
> 
> So I am +1 to apply both patches.
> 
> I added this as an Opened Item for increased visibility.
> 

I've pushed both fixes, so the open item should be resolved.

However while polishing the second patch, I realized we're allowing 
statistics on expressions referencing system attributes. So this fails;

CREATE STATISTICS s ON ctid, x FROM t;

but this passes:

CREATE STATISTICS s ON (ctid::text), x FROM t;

IMO we should reject such expressions, just like we reject direct 
references to system attributes - patch attached.

Furthermore, I wonder if we should reject expressions without any Vars? 
This works now:

CREATE STATISTICS s ON (11:text) FROM t;

but it seems rather silly / useless, so maybe we should reject it.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PoC/WIP: Extended statistics on expressions

From
Justin Pryzby
Date:
On Wed, Sep 01, 2021 at 06:45:29PM +0200, Tomas Vondra wrote:
> > > Patch 0001 fixes the "double parens" issue discussed elsewhere in this
> > > thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a simple
> > > column reference.
> > 
> > 0002 refuses to create expressional stats on a simple column reference like
> > (a), which I think is helps to avoid a user accidentally creating useless ext
> > stats objects (which are redundant with the table's column stats).
> > 
> > 0002 does not attempt to refuse cases like (a+0), which I think is fine:
> > we don't try to reject useless cases if someone insists on it.
> > See 240971675, 701fd0bbc.
> > 
> > So I am +1 to apply both patches.
> > 
> > I added this as an Opened Item for increased visibility.
> 
> I've pushed both fixes, so the open item should be resolved.

Thank you - I marked it as such.

There are some typos in 537ca68db (refenrece)
I'll add them to my typos branch if you don't want to patch them right now or
wait to see if someone notices anything else.

diff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c
index 59369f8736..17cbd97808 100644
--- a/src/backend/commands/statscmds.c
+++ b/src/backend/commands/statscmds.c
@@ -205,27 +205,27 @@ CreateStatistics(CreateStatsStmt *stmt)
     numcols = list_length(stmt->exprs);
     if (numcols > STATS_MAX_DIMENSIONS)
         ereport(ERROR,
                 (errcode(ERRCODE_TOO_MANY_COLUMNS),
                  errmsg("cannot have more than %d columns in statistics",
                         STATS_MAX_DIMENSIONS)));
 
     /*
      * Convert the expression list to a simple array of attnums, but also keep
      * a list of more complex expressions.  While at it, enforce some
      * constraints - we don't allow extended statistics on system attributes,
-     * and we require the data type to have less-than operator.
+     * and we require the data type to have a less-than operator.
      *
-     * There are many ways how to "mask" a simple attribute refenrece as an
+     * There are many ways to "mask" a simple attribute reference as an
      * expression, for example "(a+0)" etc. We can't possibly detect all of
-     * them, but we handle at least the simple case with attribute in parens.
+     * them, but we handle at least the simple case with the attribute in parens.
      * There'll always be a way around this, if the user is determined (like
      * the "(a+0)" example), but this makes it somewhat consistent with how
      * indexes treat attributes/expressions.
      */
     foreach(cell, stmt->exprs)
     {
         StatsElem  *selem = lfirst_node(StatsElem, cell);
 
         if (selem->name)        /* column reference */
         {
             char       *attname;



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:

On 9/1/21 9:38 PM, Justin Pryzby wrote:
> On Wed, Sep 01, 2021 at 06:45:29PM +0200, Tomas Vondra wrote:
>>>> Patch 0001 fixes the "double parens" issue discussed elsewhere in this
>>>> thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a simple
>>>> column reference.
>>>
>>> 0002 refuses to create expressional stats on a simple column reference like
>>> (a), which I think is helps to avoid a user accidentally creating useless ext
>>> stats objects (which are redundant with the table's column stats).
>>>
>>> 0002 does not attempt to refuse cases like (a+0), which I think is fine:
>>> we don't try to reject useless cases if someone insists on it.
>>> See 240971675, 701fd0bbc.
>>>
>>> So I am +1 to apply both patches.
>>>
>>> I added this as an Opened Item for increased visibility.
>>
>> I've pushed both fixes, so the open item should be resolved.
> 
> Thank you - I marked it as such.
> 
> There are some typos in 537ca68db (refenrece)
> I'll add them to my typos branch if you don't want to patch them right now or
> wait to see if someone notices anything else.
> 

Yeah, probably better to wait a bit. Any opinions on rejecting 
expressions referencing system attributes or no attributes at all?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC/WIP: Extended statistics on expressions

From
Justin Pryzby
Date:
On Wed, Sep 01, 2021 at 06:45:29PM +0200, Tomas Vondra wrote:
> However while polishing the second patch, I realized we're allowing
> statistics on expressions referencing system attributes. So this fails;
> 
> CREATE STATISTICS s ON ctid, x FROM t;
> 
> but this passes:
> 
> CREATE STATISTICS s ON (ctid::text), x FROM t;
> 
> IMO we should reject such expressions, just like we reject direct references
> to system attributes - patch attached.

Right, same as indexes.  +1

> Furthermore, I wonder if we should reject expressions without any Vars? This
> works now:
> 
> CREATE STATISTICS s ON (11:text) FROM t;
> 
> but it seems rather silly / useless, so maybe we should reject it.

To my surprise, this is also allowed for indexes...

But (maybe this is what I was remembering) it's prohibited to have a constant
expression as a partition key.

Expressions without a var seem like a case where the user did something
deliberately silly, and dis-similar from the case of making a stats expression
on a simple column - that seemed like it could be a legitimate
mistake/confusion (it's not unreasonable to write an extra parenthesis, but
it's strange if that causes it to behave differently).

I think it's not worth too much effort to prohibit this: if they're determined,
they can still write an expresion with a var which is constant.  I'm not going
to say it's worth zero effort, though.

-- 
Justin



Re: PoC/WIP: Extended statistics on expressions

From
Tomas Vondra
Date:
On 9/3/21 5:56 AM, Justin Pryzby wrote:
> On Wed, Sep 01, 2021 at 06:45:29PM +0200, Tomas Vondra wrote:
>> However while polishing the second patch, I realized we're allowing
>> statistics on expressions referencing system attributes. So this fails;
>>
>> CREATE STATISTICS s ON ctid, x FROM t;
>>
>> but this passes:
>>
>> CREATE STATISTICS s ON (ctid::text), x FROM t;
>>
>> IMO we should reject such expressions, just like we reject direct references
>> to system attributes - patch attached.
> 
> Right, same as indexes.  +1
>

I've pushed this check, disallowing extended stats on expressions 
referencing system attributes. This means we'll reject both ctid and 
(ctid::text), just like for indexes.

>> Furthermore, I wonder if we should reject expressions without any Vars? This
>> works now:
>>
>> CREATE STATISTICS s ON (11:text) FROM t;
>>
>> but it seems rather silly / useless, so maybe we should reject it.
> 
> To my surprise, this is also allowed for indexes...
> 
> But (maybe this is what I was remembering) it's prohibited to have a constant
> expression as a partition key.
> 
> Expressions without a var seem like a case where the user did something
> deliberately silly, and dis-similar from the case of making a stats expression
> on a simple column - that seemed like it could be a legitimate
> mistake/confusion (it's not unreasonable to write an extra parenthesis, but
> it's strange if that causes it to behave differently).
> 
> I think it's not worth too much effort to prohibit this: if they're determined,
> they can still write an expresion with a var which is constant.  I'm not going
> to say it's worth zero effort, though.
> 

I've decided not to push this. The statistics objects on expressions not 
referencing any variables seem useless, but maybe not entirely - we 
allow volatile expressions, like

   CREATE STATISTICS s ON (random()) FROM t;

which I suppose might be useful. And we reject similar cases (except for 
the volatility, of course) for indexes too.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company