Thread: planner support functions: handle GROUP BY estimates ?

planner support functions: handle GROUP BY estimates ?

From
Justin Pryzby
Date:
Tom implemented "Planner support functions":
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a391ff3c3d418e404a2c6e4ff0865a107752827b
https://www.postgresql.org/docs/12/xfunc-optimization.html

I wondered whether there was any consideration to extend that to allow
providing improved estimates of "group by".  That currently requires manually
by creating an expression index, if the function is IMMUTABLE (which is not
true for eg.  date_trunc of timestamptz).

ts=# explain analyze SELECT date_trunc('day', start_time) FROM child.alu_amms_201911 GROUP BY 1;
 HashAggregate  (cost=87.34..98.45 rows=889 width=8) (actual time=1.476..1.482 rows=19 loops=1)

ts=# explain analyze SELECT date_trunc('year', start_time) FROM child.alu_amms_201911 GROUP BY 1;
 HashAggregate  (cost=87.34..98.45 rows=889 width=8) (actual time=1.499..1.500 rows=1 loops=1)

ts=# CREATE INDEX ON child.alu_amms_201911 (date_trunc('year',start_time));
ts=# ANALYZE child.alu_amms_201911;
ts=# explain analyze SELECT date_trunc('year', start_time) FROM child.alu_amms_201911 GROUP BY 1;
 HashAggregate  (cost=87.34..87.35 rows=1 width=8) (actual time=1.414..1.414 rows=1 loops=1)



Re: planner support functions: handle GROUP BY estimates ?

From
Justin Pryzby
Date:
On Tue, Nov 19, 2019 at 01:34:21PM -0600, Justin Pryzby wrote:
> Tom implemented "Planner support functions":
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a391ff3c3d418e404a2c6e4ff0865a107752827b
> https://www.postgresql.org/docs/12/xfunc-optimization.html
> 
> I wondered whether there was any consideration to extend that to allow
> providing improved estimates of "group by".  That currently requires manually
> by creating an expression index, if the function is IMMUTABLE (which is not
> true for eg.  date_trunc of timestamptz).

I didn't hear back so tried implementing this for date_trunc().  Currently, the
planner assumes that functions output equally many groups as their input
variables.  Most invocations of our reports use date_trunc (or similar), so my
earlier attempt to alert on rowcount misestimates was very brief.

I currently assume that the input data has 1 second granularity:
|postgres=# CREATE TABLE t(i) AS SELECT date_trunc('second',a)a FROM generate_series(now(), now()+'7 day'::interval, '1
seconds')a;ANALYZE t;
 
|postgres=# explain analyze SELECT date_trunc('hour',i) i FROM t GROUP BY 1;
| Group  (cost=9021.85..9042.13 rows=169 width=8) (actual time=1365.934..1366.453 rows=169 loops=1)
|
|postgres=# explain analyze SELECT date_trunc('minute',i) i FROM t GROUP BY 1;
| Finalize HashAggregate  (cost=10172.79..10298.81 rows=10081 width=8) (actual time=1406.057..1413.413 rows=10081
loops=1)
|
|postgres=# explain analyze SELECT date_trunc('day',i) i FROM t GROUP BY 1;
| Group  (cost=9013.71..9014.67 rows=8 width=8) (actual time=1582.998..1583.030 rows=8 loops=1)

If the input timestamps have (say) hourly granularity, rowcount will be
*underestimated* by 3600x, which is worse than the behavior in master of
overestimating by (for "day") 24x.

I'm trying to think of ways to address that:

0) Add a fudge factor of 4x or maybe 30x;

1) Avoid applying a corrective factor for seconds or minutes that makes the
rowcount less than (say) 2 or 100.  That would divide 24 but might then avoid
the last /60 or /60/60.  Ultimately, that's more "fudge" than anything else;

2) Leave alone pg_catalog.date_trunc(), but provide "template" support
functions like timestamp_support_10pow1, 10pow2, 10pow3, etc, which include the
given corrective factor, which should allow more accurate rowcount for input
data with granularity of the given number of seconds.

Ideally, that would be user-specified factor, but I don't think that's possible
to specify in SQL; the constant has to be built into the C function.  At
telsasoft, our data mostly has 15minute granularity (900sec), so we'd maybe
make a "date_trunc" function in the user schema which calls the
pg_catalog.date_trunc with support function timestamp_support_10pow3;

There could be a "base" support function that accepts a multiplier argument,
and then any user-provided C extension would be a one-liner specifing an
arbitrary value;

3) Maybe there are better functions than date_trunc() to address;

4) Leave it as a patch in the archives for people to borrow from;

Justin

Attachment

Re: planner support functions: handle GROUP BY estimates ?

From
Justin Pryzby
Date:
On Sun, Dec 22, 2019 at 06:16:48PM -0600, Justin Pryzby wrote:
> On Tue, Nov 19, 2019 at 01:34:21PM -0600, Justin Pryzby wrote:
> > Tom implemented "Planner support functions":
> > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a391ff3c3d418e404a2c6e4ff0865a107752827b
> > https://www.postgresql.org/docs/12/xfunc-optimization.html
> > 
> > I wondered whether there was any consideration to extend that to allow
> > providing improved estimates of "group by".  That currently requires manually
> > by creating an expression index, if the function is IMMUTABLE (which is not
> > true for eg.  date_trunc of timestamptz).
> 
> I didn't hear back so tried implementing this for date_trunc().  Currently, the

> I currently assume that the input data has 1 second granularity:
...
> If the input timestamps have (say) hourly granularity, rowcount will be
> *underestimated* by 3600x, which is worse than the behavior in master of
> overestimating by (for "day") 24x.
> 
> I'm trying to think of ways to address that:

In the attached, I handled that by using histogram and variable's initial
ndistinct estimate, giving good estimates even for intermediate granularities
of input timestamps.

|postgres=# DROP TABLE IF EXISTS t; CREATE TABLE t(i) AS SELECT a FROM generate_series(now(), now()+'11 day'::interval,
'15minutes')a,generate_series(1,9)b; ANALYZE t;
 
|
|postgres=# explain analyze SELECT date_trunc('hour',i) i FROM t GROUP BY 1;
| HashAggregate  (cost=185.69..188.99 rows=264 width=8) (actual time=42.110..42.317 rows=265 loops=1)
|
|postgres=# explain analyze SELECT date_trunc('minute',i) i FROM t GROUP BY 1;
| HashAggregate  (cost=185.69..198.91 rows=1057 width=8) (actual time=41.685..42.264 rows=1057 loops=1)
|
|postgres=# explain analyze SELECT date_trunc('day',i) i FROM t GROUP BY 1;
| HashAggregate  (cost=185.69..185.83 rows=11 width=8) (actual time=46.672..46.681 rows=12 loops=1)
|
|postgres=# explain analyze SELECT date_trunc('second',i) i FROM t GROUP BY 1;
| HashAggregate  (cost=185.69..198.91 rows=1057 width=8) (actual time=41.816..42.435 rows=1057 loops=1)

Attachment

Re: planner support functions: handle GROUP BY estimates ?

From
Tom Lane
Date:
Justin Pryzby <pryzby@telsasoft.com> writes:
> On Sun, Dec 22, 2019 at 06:16:48PM -0600, Justin Pryzby wrote:
>> On Tue, Nov 19, 2019 at 01:34:21PM -0600, Justin Pryzby wrote:
>>> Tom implemented "Planner support functions":
>>> https://www.postgresql.org/docs/12/xfunc-optimization.html
>>> I wondered whether there was any consideration to extend that to allow
>>> providing improved estimates of "group by".  That currently requires manually
>>> by creating an expression index, if the function is IMMUTABLE (which is not
>>> true for eg.  date_trunc of timestamptz).

>> I didn't hear back so tried implementing this for date_trunc().  Currently, the
>> ...
>> If the input timestamps have (say) hourly granularity, rowcount will be
>> *underestimated* by 3600x, which is worse than the behavior in master of
>> overestimating by (for "day") 24x.

While I don't have any objection in principle to extending the set of
things planner support functions can do, it doesn't seem like the idea is
giving you all that much traction for this problem.  There isn't that much
knowledge that's specific to date_trunc in this, and instead you've got a
bunch of generic problems (that would have to be solved again in every
other function's planner support).

Another issue is that it seems like this doesn't compose nicely ---
if the GROUP BY expression is "f(g(x))", how do f's support function
and g's support function interact?

The direction that I've been wanting to go in for this kind of problem
is to allow CREATE STATISTICS on an expression, ie if you were concerned
about the estimation accuracy for GROUP BY or anything else, you could do
something like

CREATE STATISTICS foo ON date_trunc('day', mod_time) FROM my_table;

This would have the effect of cueing ANALYZE to gather stats on the
value of that expression, which the planner could then use, very much
as if you'd created an index on the expression.  The advantages of
doing this rather than making an index are

(1) you don't have to pay the maintenance costs for an index,

(2) we don't have to restrict it to immutable expressions.  (Volatile
expressions would have to be disallowed, if only because of fear of
side-effects; but I think we could allow stable expressions just fine.
Worst case problem is that the stats are stale, but so what?)

With a solution like this, we don't have to solve any of the difficult
problems of how the pieces of the expression interact with each other
or with the statistics of the underlying column(s).  We just use the
stats if available, and the estimate will be as good as it'd be for
a plain column reference.

I'm not sure how much new infrastructure would have to be built
for this.  We designed the CREATE STATISTICS syntax to support
this (partly at my insistence IIRC) but I do not think any of the
existing plumbing is ready for it.  I don't think it'd be very
hard to plug this into ANALYZE or the planner, but there might be
quite some work to be done on the catalog infrastructure, pg_dump,
etc.

cc'ing Tomas in case he has any thoughts about it.

            regards, tom lane



Re: planner support functions: handle GROUP BY estimates ?

From
Tomas Vondra
Date:
On Tue, Jan 14, 2020 at 03:12:21PM -0500, Tom Lane wrote:
>Justin Pryzby <pryzby@telsasoft.com> writes:
>> On Sun, Dec 22, 2019 at 06:16:48PM -0600, Justin Pryzby wrote:
>>> On Tue, Nov 19, 2019 at 01:34:21PM -0600, Justin Pryzby wrote:
>>>> Tom implemented "Planner support functions":
>>>> https://www.postgresql.org/docs/12/xfunc-optimization.html
>>>> I wondered whether there was any consideration to extend that to allow
>>>> providing improved estimates of "group by".  That currently requires manually
>>>> by creating an expression index, if the function is IMMUTABLE (which is not
>>>> true for eg.  date_trunc of timestamptz).
>
>>> I didn't hear back so tried implementing this for date_trunc().  Currently, the
>>> ...
>>> If the input timestamps have (say) hourly granularity, rowcount will be
>>> *underestimated* by 3600x, which is worse than the behavior in master of
>>> overestimating by (for "day") 24x.
>
>While I don't have any objection in principle to extending the set of
>things planner support functions can do, it doesn't seem like the idea is
>giving you all that much traction for this problem.  There isn't that much
>knowledge that's specific to date_trunc in this, and instead you've got a
>bunch of generic problems (that would have to be solved again in every
>other function's planner support).
>
>Another issue is that it seems like this doesn't compose nicely ---
>if the GROUP BY expression is "f(g(x))", how do f's support function
>and g's support function interact?
>
>The direction that I've been wanting to go in for this kind of problem
>is to allow CREATE STATISTICS on an expression, ie if you were concerned
>about the estimation accuracy for GROUP BY or anything else, you could do
>something like
>
>CREATE STATISTICS foo ON date_trunc('day', mod_time) FROM my_table;
>
>This would have the effect of cueing ANALYZE to gather stats on the
>value of that expression, which the planner could then use, very much
>as if you'd created an index on the expression.  The advantages of
>doing this rather than making an index are
>
>(1) you don't have to pay the maintenance costs for an index,
>
>(2) we don't have to restrict it to immutable expressions.  (Volatile
>expressions would have to be disallowed, if only because of fear of
>side-effects; but I think we could allow stable expressions just fine.
>Worst case problem is that the stats are stale, but so what?)
>
>With a solution like this, we don't have to solve any of the difficult
>problems of how the pieces of the expression interact with each other
>or with the statistics of the underlying column(s).  We just use the
>stats if available, and the estimate will be as good as it'd be for
>a plain column reference.
>
>I'm not sure how much new infrastructure would have to be built
>for this.  We designed the CREATE STATISTICS syntax to support
>this (partly at my insistence IIRC) but I do not think any of the
>existing plumbing is ready for it.  I don't think it'd be very
>hard to plug this into ANALYZE or the planner, but there might be
>quite some work to be done on the catalog infrastructure, pg_dump,
>etc.
>
>cc'ing Tomas in case he has any thoughts about it.
>

Well, I certainly do thoughts about this - it's pretty much exactly what
I proposed yesterday in this thread:

   https://www.postgresql.org/message-id/flat/20200113230008.g67iyk4cs3xbnjju@development

The third part of that patch series is exactly about supporting extended
statistics on expressions, about the way you described here. The current
status of the WIP patch is that grammar + ANALYZE mostly works, but
there is no support in the planner. It's obviously still very hackish.

The main thing I'm not sure about is how to represent this in catalogs,
whether to have two fields (like for indexes) or maybe a single list of
expressions.


I'm also wondering if we could/should 100% rely on extended statistics,
because those are really meant to track correlations between columns,
which means we currently require at least two attributes in CREATE
STATISTICS and so on. So maybe what we want is collecting "regular"
per-column stats just like we do for indexes, but without the index
maintenance overhead?

The advantage would be we'd get exactly the same stats as for indexes,
and we could use them in the same places out of the box. While with
extended stats we'll have to tweak those places.

Now, the trouble is we can't store stuff in pg_statistic without having
a relation (i.e. table / index / ...) but maybe we could invent a new 
relation type for this purpose. Of course, it'd require some catalog
work to represent this ...


Ultimately I think we'd want both things, it's not one or the other.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: planner support functions: handle GROUP BY estimates ?

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On Tue, Jan 14, 2020 at 03:12:21PM -0500, Tom Lane wrote:
>> cc'ing Tomas in case he has any thoughts about it.

> Well, I certainly do thoughts about this - it's pretty much exactly what
> I proposed yesterday in this thread:
>    https://www.postgresql.org/message-id/flat/20200113230008.g67iyk4cs3xbnjju@development
> The third part of that patch series is exactly about supporting extended
> statistics on expressions, about the way you described here. The current
> status of the WIP patch is that grammar + ANALYZE mostly works, but
> there is no support in the planner. It's obviously still very hackish.

Cool.  We should probably take the discussion to that thread, then.

> I'm also wondering if we could/should 100% rely on extended statistics,
> because those are really meant to track correlations between columns,

Yeah, it seems likely to me that the infrastructure for this would be
somewhat different --- the user-facing syntax could be basically the
same, but ultimately we want to generate entries in pg_statistic not
pg_statistic_ext_data.  Or at least entries that look the same as what
you could find in pg_statistic.

            regards, tom lane



Re: planner support functions: handle GROUP BY estimates ?

From
Tomas Vondra
Date:
On Tue, Jan 14, 2020 at 04:21:57PM -0500, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On Tue, Jan 14, 2020 at 03:12:21PM -0500, Tom Lane wrote:
>>> cc'ing Tomas in case he has any thoughts about it.
>
>> Well, I certainly do thoughts about this - it's pretty much exactly what
>> I proposed yesterday in this thread:
>>    https://www.postgresql.org/message-id/flat/20200113230008.g67iyk4cs3xbnjju@development
>> The third part of that patch series is exactly about supporting extended
>> statistics on expressions, about the way you described here. The current
>> status of the WIP patch is that grammar + ANALYZE mostly works, but
>> there is no support in the planner. It's obviously still very hackish.
>
>Cool.  We should probably take the discussion to that thread, then.
>
>> I'm also wondering if we could/should 100% rely on extended statistics,
>> because those are really meant to track correlations between columns,
>
>Yeah, it seems likely to me that the infrastructure for this would be
>somewhat different --- the user-facing syntax could be basically the
>same, but ultimately we want to generate entries in pg_statistic not
>pg_statistic_ext_data.  Or at least entries that look the same as what
>you could find in pg_statistic.
>

Yeah. I think we could invent a new type of statistics "expressions"
which would simply built this per-column stats. So for example

   CREATE STATISTICS s (expressions) ON (a*b), sqrt(c) FROM t;

would build per-column stats stored in pg_statistics, while

   CREATE STATISTICS s (mcv) ON (a*b), sqrt(c) FROM t;

would build the multi-column MCV list on expressions.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: planner support functions: handle GROUP BY estimates ?

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On Tue, Jan 14, 2020 at 04:21:57PM -0500, Tom Lane wrote:
>> Yeah, it seems likely to me that the infrastructure for this would be
>> somewhat different --- the user-facing syntax could be basically the
>> same, but ultimately we want to generate entries in pg_statistic not
>> pg_statistic_ext_data.  Or at least entries that look the same as what
>> you could find in pg_statistic.

> Yeah. I think we could invent a new type of statistics "expressions"
> which would simply built this per-column stats. So for example
>    CREATE STATISTICS s (expressions) ON (a*b), sqrt(c) FROM t;

I was imagining the type keyword as being "standard" or something
like that, since what it's going to build are the "standard" kinds
of stats for the expression's datatype.  But yeah, has to be some other
keyword than the existing ones.

The main issue for sticking the results into pg_statistic is that
the primary key there is (starelid, staattnum), and we haven't got
a suitable attnum.  I wouldn't much object to putting the data into
pg_statistic_ext_data, but it doesn't really have a suitable
rowtype ...

            regards, tom lane



Re: planner support functions: handle GROUP BY estimates ?

From
Tomas Vondra
Date:
On Tue, Jan 14, 2020 at 04:52:44PM -0500, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On Tue, Jan 14, 2020 at 04:21:57PM -0500, Tom Lane wrote:
>>> Yeah, it seems likely to me that the infrastructure for this would be
>>> somewhat different --- the user-facing syntax could be basically the
>>> same, but ultimately we want to generate entries in pg_statistic not
>>> pg_statistic_ext_data.  Or at least entries that look the same as what
>>> you could find in pg_statistic.
>
>> Yeah. I think we could invent a new type of statistics "expressions"
>> which would simply built this per-column stats. So for example
>>    CREATE STATISTICS s (expressions) ON (a*b), sqrt(c) FROM t;
>
>I was imagining the type keyword as being "standard" or something
>like that, since what it's going to build are the "standard" kinds
>of stats for the expression's datatype.  But yeah, has to be some other
>keyword than the existing ones.
>
>The main issue for sticking the results into pg_statistic is that
>the primary key there is (starelid, staattnum), and we haven't got
>a suitable attnum.  I wouldn't much object to putting the data into
>pg_statistic_ext_data, but it doesn't really have a suitable
>rowtype ...

Well, that's why I proposed to essentially build a fake "relation" just
for this purpose. So we'd have a pg_class entry with a special relkind,
attnums and all that. And the expressions would be stored either in
pg_statistic_ext or in a new catalog. But maybe that's nonsense.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: planner support functions: handle GROUP BY estimates ?

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On Tue, Jan 14, 2020 at 04:52:44PM -0500, Tom Lane wrote:
>> The main issue for sticking the results into pg_statistic is that
>> the primary key there is (starelid, staattnum), and we haven't got
>> a suitable attnum.  I wouldn't much object to putting the data into
>> pg_statistic_ext_data, but it doesn't really have a suitable
>> rowtype ...

> Well, that's why I proposed to essentially build a fake "relation" just
> for this purpose. So we'd have a pg_class entry with a special relkind,
> attnums and all that. And the expressions would be stored either in
> pg_statistic_ext or in a new catalog. But maybe that's nonsense.

Seems pretty yucky.  I realize we've already got "fake relations" like
foreign tables and composite types, but the number of special cases
those create is very annoying.  And you still don't have anyplace to
put the expressions themselves in such a structure --- I hope you
weren't going to propose fake pg_index rows for that.

I wonder just how messy it would be to add a column to pg_statistic_ext
whose type is the composite type "pg_statistic", and drop the required
data into that.  We've not yet used any composite types in the system
catalogs, AFAIR, but since pg_statistic_ext isn't a bootstrap catalog
it seems like we might be able to get away with it.

            regards, tom lane



Re: planner support functions: handle GROUP BY estimates ?

From
Tomas Vondra
Date:
On Tue, Jan 14, 2020 at 05:37:53PM -0500, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On Tue, Jan 14, 2020 at 04:52:44PM -0500, Tom Lane wrote:
>>> The main issue for sticking the results into pg_statistic is that
>>> the primary key there is (starelid, staattnum), and we haven't got
>>> a suitable attnum.  I wouldn't much object to putting the data into
>>> pg_statistic_ext_data, but it doesn't really have a suitable
>>> rowtype ...
>
>> Well, that's why I proposed to essentially build a fake "relation" just
>> for this purpose. So we'd have a pg_class entry with a special relkind,
>> attnums and all that. And the expressions would be stored either in
>> pg_statistic_ext or in a new catalog. But maybe that's nonsense.
>
>Seems pretty yucky.  I realize we've already got "fake relations" like
>foreign tables and composite types, but the number of special cases
>those create is very annoying.  And you still don't have anyplace to
>put the expressions themselves in such a structure --- I hope you
>weren't going to propose fake pg_index rows for that.
>

No, I wasn't going to propose fake pg_index rows, because - I actually
wrote "stored either in pg_statistic_ext or in a new catalog" so I was
thinking about a new catalog (so a dedicated and simplified copy of
pg_index).

>I wonder just how messy it would be to add a column to pg_statistic_ext
>whose type is the composite type "pg_statistic", and drop the required
>data into that.  We've not yet used any composite types in the system
>catalogs, AFAIR, but since pg_statistic_ext isn't a bootstrap catalog
>it seems like we might be able to get away with it.
>

I don't know, but feels a bit awkward to store this type of stats into
pg_statistic_ext, which was meant for multi-column stats. Maybe it'd
work fine, not sure.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: planner support functions: handle GROUP BY estimates ?

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On Tue, Jan 14, 2020 at 05:37:53PM -0500, Tom Lane wrote:
>> I wonder just how messy it would be to add a column to pg_statistic_ext
>> whose type is the composite type "pg_statistic", and drop the required
>> data into that.  We've not yet used any composite types in the system
>> catalogs, AFAIR, but since pg_statistic_ext isn't a bootstrap catalog
>> it seems like we might be able to get away with it.

[ I meant pg_statistic_ext_data, obviously ]

> I don't know, but feels a bit awkward to store this type of stats into
> pg_statistic_ext, which was meant for multi-column stats. Maybe it'd
> work fine, not sure.

If we wanted to allow a single statistics object to contain data for
multiple expressions, we'd actually need that to be array-of-pg_statistic
not just pg_statistic.  Seems do-able, but on the other hand we could
just prohibit having more than one output column in the "query" for this
type of extended statistic.  Either way, this seems far less invasive
than either a new catalog or a new relation relkind (to say nothing of
needing both, which is where you seemed to be headed).

            regards, tom lane



Re: planner support functions: handle GROUP BY estimates ?

From
Tomas Vondra
Date:
On 1/15/20 12:44 AM, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On Tue, Jan 14, 2020 at 05:37:53PM -0500, Tom Lane wrote:
>>> I wonder just how messy it would be to add a column to pg_statistic_ext
>>> whose type is the composite type "pg_statistic", and drop the required
>>> data into that.  We've not yet used any composite types in the system
>>> catalogs, AFAIR, but since pg_statistic_ext isn't a bootstrap catalog
>>> it seems like we might be able to get away with it.
> 
> [ I meant pg_statistic_ext_data, obviously ]
> 
>> I don't know, but feels a bit awkward to store this type of stats into
>> pg_statistic_ext, which was meant for multi-column stats. Maybe it'd
>> work fine, not sure.
> 
> If we wanted to allow a single statistics object to contain data for
> multiple expressions, we'd actually need that to be array-of-pg_statistic
> not just pg_statistic.  Seems do-able, but on the other hand we could
> just prohibit having more than one output column in the "query" for this
> type of extended statistic.  Either way, this seems far less invasive
> than either a new catalog or a new relation relkind (to say nothing of
> needing both, which is where you seemed to be headed).
> 

I've started looking at statistics on expressions too, mostly because it
seems the extended stats improvements (as discussed in [1]) need that.

The "stash pg_statistic records into pg_statistics_ext_data" approach
seems simple, but it's not clear to me how to make it work, so I'd
appreciate some guidance.


1) Considering we don't have any composite types in any catalog yet, and
naive attempts to just use something like

    pg_statistic stxdexprs[1];

did not work. So I suppose this will require changes to genbki.pl, but
honestly, my Perl-fu is non-existent :-(


2) Won't it be an issue that pg_statistic contains pseudo-types? That
is, this does not work, for example:

    test=# create table t (a pg_statistic[]);
    ERROR:  column "stavalues1" has pseudo-type anyarray

and it seems unlikely just using this in a catalog would make it work.


regards


[1]
https://www.postgresql.org/message-id/ad7891d2-e90c-b446-9fe2-7419143847d7%40enterprisedb.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: planner support functions: handle GROUP BY estimates ?

From
Tomas Vondra
Date:

On 11/17/20 5:18 PM, Justin Pryzby wrote:
> On Mon, Nov 16, 2020 at 06:24:41PM +0100, Tomas Vondra wrote:
>> On 1/15/20 12:44 AM, Tom Lane wrote:
>>> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>>>> On Tue, Jan 14, 2020 at 05:37:53PM -0500, Tom Lane wrote:
>>>>> I wonder just how messy it would be to add a column to pg_statistic_ext
>>>>> whose type is the composite type "pg_statistic", and drop the required
>>>>> data into that.  We've not yet used any composite types in the system
>>>>> catalogs, AFAIR, but since pg_statistic_ext isn't a bootstrap catalog
>>>>> it seems like we might be able to get away with it.
>>>
>>> [ I meant pg_statistic_ext_data, obviously ]
>>>
>>>> I don't know, but feels a bit awkward to store this type of stats into
>>>> pg_statistic_ext, which was meant for multi-column stats. Maybe it'd
>>>> work fine, not sure.
>>
>> I've started looking at statistics on expressions too, mostly because it
>> seems the extended stats improvements (as discussed in [1]) need that.
>>
>> The "stash pg_statistic records into pg_statistics_ext_data" approach
>> seems simple, but it's not clear to me how to make it work, so I'd
>> appreciate some guidance.
>>
>>
>> 1) Considering we don't have any composite types in any catalog yet, and
>> naive attempts to just use something like
>>
>>     pg_statistic stxdexprs[1];
>>
>> did not work. So I suppose this will require changes to genbki.pl, but
>> honestly, my Perl-fu is non-existent :-(
> 
> In the attached, I didn't need to mess with perl.
> 
>> 2) Won't it be an issue that pg_statistic contains pseudo-types? That
>> is, this does not work, for example:
>>
>>     test=# create table t (a pg_statistic[]);
>>     ERROR:  column "stavalues1" has pseudo-type anyarray
> 
> It works during initdb for the reasons that it's allowed for pg_statistic.
> 

Oh, wow! I haven't expected a patch implementing this, that's great. I
owe you a beer or a drink of your choice.

Thanks!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company