Thread: planner support functions: handle GROUP BY estimates ?
Tom implemented "Planner support functions": https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a391ff3c3d418e404a2c6e4ff0865a107752827b https://www.postgresql.org/docs/12/xfunc-optimization.html I wondered whether there was any consideration to extend that to allow providing improved estimates of "group by". That currently requires manually by creating an expression index, if the function is IMMUTABLE (which is not true for eg. date_trunc of timestamptz). ts=# explain analyze SELECT date_trunc('day', start_time) FROM child.alu_amms_201911 GROUP BY 1; HashAggregate (cost=87.34..98.45 rows=889 width=8) (actual time=1.476..1.482 rows=19 loops=1) ts=# explain analyze SELECT date_trunc('year', start_time) FROM child.alu_amms_201911 GROUP BY 1; HashAggregate (cost=87.34..98.45 rows=889 width=8) (actual time=1.499..1.500 rows=1 loops=1) ts=# CREATE INDEX ON child.alu_amms_201911 (date_trunc('year',start_time)); ts=# ANALYZE child.alu_amms_201911; ts=# explain analyze SELECT date_trunc('year', start_time) FROM child.alu_amms_201911 GROUP BY 1; HashAggregate (cost=87.34..87.35 rows=1 width=8) (actual time=1.414..1.414 rows=1 loops=1)
On Tue, Nov 19, 2019 at 01:34:21PM -0600, Justin Pryzby wrote: > Tom implemented "Planner support functions": > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a391ff3c3d418e404a2c6e4ff0865a107752827b > https://www.postgresql.org/docs/12/xfunc-optimization.html > > I wondered whether there was any consideration to extend that to allow > providing improved estimates of "group by". That currently requires manually > by creating an expression index, if the function is IMMUTABLE (which is not > true for eg. date_trunc of timestamptz). I didn't hear back so tried implementing this for date_trunc(). Currently, the planner assumes that functions output equally many groups as their input variables. Most invocations of our reports use date_trunc (or similar), so my earlier attempt to alert on rowcount misestimates was very brief. I currently assume that the input data has 1 second granularity: |postgres=# CREATE TABLE t(i) AS SELECT date_trunc('second',a)a FROM generate_series(now(), now()+'7 day'::interval, '1 seconds')a;ANALYZE t; |postgres=# explain analyze SELECT date_trunc('hour',i) i FROM t GROUP BY 1; | Group (cost=9021.85..9042.13 rows=169 width=8) (actual time=1365.934..1366.453 rows=169 loops=1) | |postgres=# explain analyze SELECT date_trunc('minute',i) i FROM t GROUP BY 1; | Finalize HashAggregate (cost=10172.79..10298.81 rows=10081 width=8) (actual time=1406.057..1413.413 rows=10081 loops=1) | |postgres=# explain analyze SELECT date_trunc('day',i) i FROM t GROUP BY 1; | Group (cost=9013.71..9014.67 rows=8 width=8) (actual time=1582.998..1583.030 rows=8 loops=1) If the input timestamps have (say) hourly granularity, rowcount will be *underestimated* by 3600x, which is worse than the behavior in master of overestimating by (for "day") 24x. I'm trying to think of ways to address that: 0) Add a fudge factor of 4x or maybe 30x; 1) Avoid applying a corrective factor for seconds or minutes that makes the rowcount less than (say) 2 or 100. That would divide 24 but might then avoid the last /60 or /60/60. Ultimately, that's more "fudge" than anything else; 2) Leave alone pg_catalog.date_trunc(), but provide "template" support functions like timestamp_support_10pow1, 10pow2, 10pow3, etc, which include the given corrective factor, which should allow more accurate rowcount for input data with granularity of the given number of seconds. Ideally, that would be user-specified factor, but I don't think that's possible to specify in SQL; the constant has to be built into the C function. At telsasoft, our data mostly has 15minute granularity (900sec), so we'd maybe make a "date_trunc" function in the user schema which calls the pg_catalog.date_trunc with support function timestamp_support_10pow3; There could be a "base" support function that accepts a multiplier argument, and then any user-provided C extension would be a one-liner specifing an arbitrary value; 3) Maybe there are better functions than date_trunc() to address; 4) Leave it as a patch in the archives for people to borrow from; Justin
Attachment
On Sun, Dec 22, 2019 at 06:16:48PM -0600, Justin Pryzby wrote: > On Tue, Nov 19, 2019 at 01:34:21PM -0600, Justin Pryzby wrote: > > Tom implemented "Planner support functions": > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a391ff3c3d418e404a2c6e4ff0865a107752827b > > https://www.postgresql.org/docs/12/xfunc-optimization.html > > > > I wondered whether there was any consideration to extend that to allow > > providing improved estimates of "group by". That currently requires manually > > by creating an expression index, if the function is IMMUTABLE (which is not > > true for eg. date_trunc of timestamptz). > > I didn't hear back so tried implementing this for date_trunc(). Currently, the > I currently assume that the input data has 1 second granularity: ... > If the input timestamps have (say) hourly granularity, rowcount will be > *underestimated* by 3600x, which is worse than the behavior in master of > overestimating by (for "day") 24x. > > I'm trying to think of ways to address that: In the attached, I handled that by using histogram and variable's initial ndistinct estimate, giving good estimates even for intermediate granularities of input timestamps. |postgres=# DROP TABLE IF EXISTS t; CREATE TABLE t(i) AS SELECT a FROM generate_series(now(), now()+'11 day'::interval, '15minutes')a,generate_series(1,9)b; ANALYZE t; | |postgres=# explain analyze SELECT date_trunc('hour',i) i FROM t GROUP BY 1; | HashAggregate (cost=185.69..188.99 rows=264 width=8) (actual time=42.110..42.317 rows=265 loops=1) | |postgres=# explain analyze SELECT date_trunc('minute',i) i FROM t GROUP BY 1; | HashAggregate (cost=185.69..198.91 rows=1057 width=8) (actual time=41.685..42.264 rows=1057 loops=1) | |postgres=# explain analyze SELECT date_trunc('day',i) i FROM t GROUP BY 1; | HashAggregate (cost=185.69..185.83 rows=11 width=8) (actual time=46.672..46.681 rows=12 loops=1) | |postgres=# explain analyze SELECT date_trunc('second',i) i FROM t GROUP BY 1; | HashAggregate (cost=185.69..198.91 rows=1057 width=8) (actual time=41.816..42.435 rows=1057 loops=1)
Attachment
Justin Pryzby <pryzby@telsasoft.com> writes: > On Sun, Dec 22, 2019 at 06:16:48PM -0600, Justin Pryzby wrote: >> On Tue, Nov 19, 2019 at 01:34:21PM -0600, Justin Pryzby wrote: >>> Tom implemented "Planner support functions": >>> https://www.postgresql.org/docs/12/xfunc-optimization.html >>> I wondered whether there was any consideration to extend that to allow >>> providing improved estimates of "group by". That currently requires manually >>> by creating an expression index, if the function is IMMUTABLE (which is not >>> true for eg. date_trunc of timestamptz). >> I didn't hear back so tried implementing this for date_trunc(). Currently, the >> ... >> If the input timestamps have (say) hourly granularity, rowcount will be >> *underestimated* by 3600x, which is worse than the behavior in master of >> overestimating by (for "day") 24x. While I don't have any objection in principle to extending the set of things planner support functions can do, it doesn't seem like the idea is giving you all that much traction for this problem. There isn't that much knowledge that's specific to date_trunc in this, and instead you've got a bunch of generic problems (that would have to be solved again in every other function's planner support). Another issue is that it seems like this doesn't compose nicely --- if the GROUP BY expression is "f(g(x))", how do f's support function and g's support function interact? The direction that I've been wanting to go in for this kind of problem is to allow CREATE STATISTICS on an expression, ie if you were concerned about the estimation accuracy for GROUP BY or anything else, you could do something like CREATE STATISTICS foo ON date_trunc('day', mod_time) FROM my_table; This would have the effect of cueing ANALYZE to gather stats on the value of that expression, which the planner could then use, very much as if you'd created an index on the expression. The advantages of doing this rather than making an index are (1) you don't have to pay the maintenance costs for an index, (2) we don't have to restrict it to immutable expressions. (Volatile expressions would have to be disallowed, if only because of fear of side-effects; but I think we could allow stable expressions just fine. Worst case problem is that the stats are stale, but so what?) With a solution like this, we don't have to solve any of the difficult problems of how the pieces of the expression interact with each other or with the statistics of the underlying column(s). We just use the stats if available, and the estimate will be as good as it'd be for a plain column reference. I'm not sure how much new infrastructure would have to be built for this. We designed the CREATE STATISTICS syntax to support this (partly at my insistence IIRC) but I do not think any of the existing plumbing is ready for it. I don't think it'd be very hard to plug this into ANALYZE or the planner, but there might be quite some work to be done on the catalog infrastructure, pg_dump, etc. cc'ing Tomas in case he has any thoughts about it. regards, tom lane
On Tue, Jan 14, 2020 at 03:12:21PM -0500, Tom Lane wrote: >Justin Pryzby <pryzby@telsasoft.com> writes: >> On Sun, Dec 22, 2019 at 06:16:48PM -0600, Justin Pryzby wrote: >>> On Tue, Nov 19, 2019 at 01:34:21PM -0600, Justin Pryzby wrote: >>>> Tom implemented "Planner support functions": >>>> https://www.postgresql.org/docs/12/xfunc-optimization.html >>>> I wondered whether there was any consideration to extend that to allow >>>> providing improved estimates of "group by". That currently requires manually >>>> by creating an expression index, if the function is IMMUTABLE (which is not >>>> true for eg. date_trunc of timestamptz). > >>> I didn't hear back so tried implementing this for date_trunc(). Currently, the >>> ... >>> If the input timestamps have (say) hourly granularity, rowcount will be >>> *underestimated* by 3600x, which is worse than the behavior in master of >>> overestimating by (for "day") 24x. > >While I don't have any objection in principle to extending the set of >things planner support functions can do, it doesn't seem like the idea is >giving you all that much traction for this problem. There isn't that much >knowledge that's specific to date_trunc in this, and instead you've got a >bunch of generic problems (that would have to be solved again in every >other function's planner support). > >Another issue is that it seems like this doesn't compose nicely --- >if the GROUP BY expression is "f(g(x))", how do f's support function >and g's support function interact? > >The direction that I've been wanting to go in for this kind of problem >is to allow CREATE STATISTICS on an expression, ie if you were concerned >about the estimation accuracy for GROUP BY or anything else, you could do >something like > >CREATE STATISTICS foo ON date_trunc('day', mod_time) FROM my_table; > >This would have the effect of cueing ANALYZE to gather stats on the >value of that expression, which the planner could then use, very much >as if you'd created an index on the expression. The advantages of >doing this rather than making an index are > >(1) you don't have to pay the maintenance costs for an index, > >(2) we don't have to restrict it to immutable expressions. (Volatile >expressions would have to be disallowed, if only because of fear of >side-effects; but I think we could allow stable expressions just fine. >Worst case problem is that the stats are stale, but so what?) > >With a solution like this, we don't have to solve any of the difficult >problems of how the pieces of the expression interact with each other >or with the statistics of the underlying column(s). We just use the >stats if available, and the estimate will be as good as it'd be for >a plain column reference. > >I'm not sure how much new infrastructure would have to be built >for this. We designed the CREATE STATISTICS syntax to support >this (partly at my insistence IIRC) but I do not think any of the >existing plumbing is ready for it. I don't think it'd be very >hard to plug this into ANALYZE or the planner, but there might be >quite some work to be done on the catalog infrastructure, pg_dump, >etc. > >cc'ing Tomas in case he has any thoughts about it. > Well, I certainly do thoughts about this - it's pretty much exactly what I proposed yesterday in this thread: https://www.postgresql.org/message-id/flat/20200113230008.g67iyk4cs3xbnjju@development The third part of that patch series is exactly about supporting extended statistics on expressions, about the way you described here. The current status of the WIP patch is that grammar + ANALYZE mostly works, but there is no support in the planner. It's obviously still very hackish. The main thing I'm not sure about is how to represent this in catalogs, whether to have two fields (like for indexes) or maybe a single list of expressions. I'm also wondering if we could/should 100% rely on extended statistics, because those are really meant to track correlations between columns, which means we currently require at least two attributes in CREATE STATISTICS and so on. So maybe what we want is collecting "regular" per-column stats just like we do for indexes, but without the index maintenance overhead? The advantage would be we'd get exactly the same stats as for indexes, and we could use them in the same places out of the box. While with extended stats we'll have to tweak those places. Now, the trouble is we can't store stuff in pg_statistic without having a relation (i.e. table / index / ...) but maybe we could invent a new relation type for this purpose. Of course, it'd require some catalog work to represent this ... Ultimately I think we'd want both things, it's not one or the other. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On Tue, Jan 14, 2020 at 03:12:21PM -0500, Tom Lane wrote: >> cc'ing Tomas in case he has any thoughts about it. > Well, I certainly do thoughts about this - it's pretty much exactly what > I proposed yesterday in this thread: > https://www.postgresql.org/message-id/flat/20200113230008.g67iyk4cs3xbnjju@development > The third part of that patch series is exactly about supporting extended > statistics on expressions, about the way you described here. The current > status of the WIP patch is that grammar + ANALYZE mostly works, but > there is no support in the planner. It's obviously still very hackish. Cool. We should probably take the discussion to that thread, then. > I'm also wondering if we could/should 100% rely on extended statistics, > because those are really meant to track correlations between columns, Yeah, it seems likely to me that the infrastructure for this would be somewhat different --- the user-facing syntax could be basically the same, but ultimately we want to generate entries in pg_statistic not pg_statistic_ext_data. Or at least entries that look the same as what you could find in pg_statistic. regards, tom lane
On Tue, Jan 14, 2020 at 04:21:57PM -0500, Tom Lane wrote: >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> On Tue, Jan 14, 2020 at 03:12:21PM -0500, Tom Lane wrote: >>> cc'ing Tomas in case he has any thoughts about it. > >> Well, I certainly do thoughts about this - it's pretty much exactly what >> I proposed yesterday in this thread: >> https://www.postgresql.org/message-id/flat/20200113230008.g67iyk4cs3xbnjju@development >> The third part of that patch series is exactly about supporting extended >> statistics on expressions, about the way you described here. The current >> status of the WIP patch is that grammar + ANALYZE mostly works, but >> there is no support in the planner. It's obviously still very hackish. > >Cool. We should probably take the discussion to that thread, then. > >> I'm also wondering if we could/should 100% rely on extended statistics, >> because those are really meant to track correlations between columns, > >Yeah, it seems likely to me that the infrastructure for this would be >somewhat different --- the user-facing syntax could be basically the >same, but ultimately we want to generate entries in pg_statistic not >pg_statistic_ext_data. Or at least entries that look the same as what >you could find in pg_statistic. > Yeah. I think we could invent a new type of statistics "expressions" which would simply built this per-column stats. So for example CREATE STATISTICS s (expressions) ON (a*b), sqrt(c) FROM t; would build per-column stats stored in pg_statistics, while CREATE STATISTICS s (mcv) ON (a*b), sqrt(c) FROM t; would build the multi-column MCV list on expressions. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On Tue, Jan 14, 2020 at 04:21:57PM -0500, Tom Lane wrote: >> Yeah, it seems likely to me that the infrastructure for this would be >> somewhat different --- the user-facing syntax could be basically the >> same, but ultimately we want to generate entries in pg_statistic not >> pg_statistic_ext_data. Or at least entries that look the same as what >> you could find in pg_statistic. > Yeah. I think we could invent a new type of statistics "expressions" > which would simply built this per-column stats. So for example > CREATE STATISTICS s (expressions) ON (a*b), sqrt(c) FROM t; I was imagining the type keyword as being "standard" or something like that, since what it's going to build are the "standard" kinds of stats for the expression's datatype. But yeah, has to be some other keyword than the existing ones. The main issue for sticking the results into pg_statistic is that the primary key there is (starelid, staattnum), and we haven't got a suitable attnum. I wouldn't much object to putting the data into pg_statistic_ext_data, but it doesn't really have a suitable rowtype ... regards, tom lane
On Tue, Jan 14, 2020 at 04:52:44PM -0500, Tom Lane wrote: >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> On Tue, Jan 14, 2020 at 04:21:57PM -0500, Tom Lane wrote: >>> Yeah, it seems likely to me that the infrastructure for this would be >>> somewhat different --- the user-facing syntax could be basically the >>> same, but ultimately we want to generate entries in pg_statistic not >>> pg_statistic_ext_data. Or at least entries that look the same as what >>> you could find in pg_statistic. > >> Yeah. I think we could invent a new type of statistics "expressions" >> which would simply built this per-column stats. So for example >> CREATE STATISTICS s (expressions) ON (a*b), sqrt(c) FROM t; > >I was imagining the type keyword as being "standard" or something >like that, since what it's going to build are the "standard" kinds >of stats for the expression's datatype. But yeah, has to be some other >keyword than the existing ones. > >The main issue for sticking the results into pg_statistic is that >the primary key there is (starelid, staattnum), and we haven't got >a suitable attnum. I wouldn't much object to putting the data into >pg_statistic_ext_data, but it doesn't really have a suitable >rowtype ... Well, that's why I proposed to essentially build a fake "relation" just for this purpose. So we'd have a pg_class entry with a special relkind, attnums and all that. And the expressions would be stored either in pg_statistic_ext or in a new catalog. But maybe that's nonsense. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On Tue, Jan 14, 2020 at 04:52:44PM -0500, Tom Lane wrote: >> The main issue for sticking the results into pg_statistic is that >> the primary key there is (starelid, staattnum), and we haven't got >> a suitable attnum. I wouldn't much object to putting the data into >> pg_statistic_ext_data, but it doesn't really have a suitable >> rowtype ... > Well, that's why I proposed to essentially build a fake "relation" just > for this purpose. So we'd have a pg_class entry with a special relkind, > attnums and all that. And the expressions would be stored either in > pg_statistic_ext or in a new catalog. But maybe that's nonsense. Seems pretty yucky. I realize we've already got "fake relations" like foreign tables and composite types, but the number of special cases those create is very annoying. And you still don't have anyplace to put the expressions themselves in such a structure --- I hope you weren't going to propose fake pg_index rows for that. I wonder just how messy it would be to add a column to pg_statistic_ext whose type is the composite type "pg_statistic", and drop the required data into that. We've not yet used any composite types in the system catalogs, AFAIR, but since pg_statistic_ext isn't a bootstrap catalog it seems like we might be able to get away with it. regards, tom lane
On Tue, Jan 14, 2020 at 05:37:53PM -0500, Tom Lane wrote: >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> On Tue, Jan 14, 2020 at 04:52:44PM -0500, Tom Lane wrote: >>> The main issue for sticking the results into pg_statistic is that >>> the primary key there is (starelid, staattnum), and we haven't got >>> a suitable attnum. I wouldn't much object to putting the data into >>> pg_statistic_ext_data, but it doesn't really have a suitable >>> rowtype ... > >> Well, that's why I proposed to essentially build a fake "relation" just >> for this purpose. So we'd have a pg_class entry with a special relkind, >> attnums and all that. And the expressions would be stored either in >> pg_statistic_ext or in a new catalog. But maybe that's nonsense. > >Seems pretty yucky. I realize we've already got "fake relations" like >foreign tables and composite types, but the number of special cases >those create is very annoying. And you still don't have anyplace to >put the expressions themselves in such a structure --- I hope you >weren't going to propose fake pg_index rows for that. > No, I wasn't going to propose fake pg_index rows, because - I actually wrote "stored either in pg_statistic_ext or in a new catalog" so I was thinking about a new catalog (so a dedicated and simplified copy of pg_index). >I wonder just how messy it would be to add a column to pg_statistic_ext >whose type is the composite type "pg_statistic", and drop the required >data into that. We've not yet used any composite types in the system >catalogs, AFAIR, but since pg_statistic_ext isn't a bootstrap catalog >it seems like we might be able to get away with it. > I don't know, but feels a bit awkward to store this type of stats into pg_statistic_ext, which was meant for multi-column stats. Maybe it'd work fine, not sure. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On Tue, Jan 14, 2020 at 05:37:53PM -0500, Tom Lane wrote: >> I wonder just how messy it would be to add a column to pg_statistic_ext >> whose type is the composite type "pg_statistic", and drop the required >> data into that. We've not yet used any composite types in the system >> catalogs, AFAIR, but since pg_statistic_ext isn't a bootstrap catalog >> it seems like we might be able to get away with it. [ I meant pg_statistic_ext_data, obviously ] > I don't know, but feels a bit awkward to store this type of stats into > pg_statistic_ext, which was meant for multi-column stats. Maybe it'd > work fine, not sure. If we wanted to allow a single statistics object to contain data for multiple expressions, we'd actually need that to be array-of-pg_statistic not just pg_statistic. Seems do-able, but on the other hand we could just prohibit having more than one output column in the "query" for this type of extended statistic. Either way, this seems far less invasive than either a new catalog or a new relation relkind (to say nothing of needing both, which is where you seemed to be headed). regards, tom lane
On 1/15/20 12:44 AM, Tom Lane wrote: > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> On Tue, Jan 14, 2020 at 05:37:53PM -0500, Tom Lane wrote: >>> I wonder just how messy it would be to add a column to pg_statistic_ext >>> whose type is the composite type "pg_statistic", and drop the required >>> data into that. We've not yet used any composite types in the system >>> catalogs, AFAIR, but since pg_statistic_ext isn't a bootstrap catalog >>> it seems like we might be able to get away with it. > > [ I meant pg_statistic_ext_data, obviously ] > >> I don't know, but feels a bit awkward to store this type of stats into >> pg_statistic_ext, which was meant for multi-column stats. Maybe it'd >> work fine, not sure. > > If we wanted to allow a single statistics object to contain data for > multiple expressions, we'd actually need that to be array-of-pg_statistic > not just pg_statistic. Seems do-able, but on the other hand we could > just prohibit having more than one output column in the "query" for this > type of extended statistic. Either way, this seems far less invasive > than either a new catalog or a new relation relkind (to say nothing of > needing both, which is where you seemed to be headed). > I've started looking at statistics on expressions too, mostly because it seems the extended stats improvements (as discussed in [1]) need that. The "stash pg_statistic records into pg_statistics_ext_data" approach seems simple, but it's not clear to me how to make it work, so I'd appreciate some guidance. 1) Considering we don't have any composite types in any catalog yet, and naive attempts to just use something like pg_statistic stxdexprs[1]; did not work. So I suppose this will require changes to genbki.pl, but honestly, my Perl-fu is non-existent :-( 2) Won't it be an issue that pg_statistic contains pseudo-types? That is, this does not work, for example: test=# create table t (a pg_statistic[]); ERROR: column "stavalues1" has pseudo-type anyarray and it seems unlikely just using this in a catalog would make it work. regards [1] https://www.postgresql.org/message-id/ad7891d2-e90c-b446-9fe2-7419143847d7%40enterprisedb.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 11/17/20 5:18 PM, Justin Pryzby wrote: > On Mon, Nov 16, 2020 at 06:24:41PM +0100, Tomas Vondra wrote: >> On 1/15/20 12:44 AM, Tom Lane wrote: >>> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >>>> On Tue, Jan 14, 2020 at 05:37:53PM -0500, Tom Lane wrote: >>>>> I wonder just how messy it would be to add a column to pg_statistic_ext >>>>> whose type is the composite type "pg_statistic", and drop the required >>>>> data into that. We've not yet used any composite types in the system >>>>> catalogs, AFAIR, but since pg_statistic_ext isn't a bootstrap catalog >>>>> it seems like we might be able to get away with it. >>> >>> [ I meant pg_statistic_ext_data, obviously ] >>> >>>> I don't know, but feels a bit awkward to store this type of stats into >>>> pg_statistic_ext, which was meant for multi-column stats. Maybe it'd >>>> work fine, not sure. >> >> I've started looking at statistics on expressions too, mostly because it >> seems the extended stats improvements (as discussed in [1]) need that. >> >> The "stash pg_statistic records into pg_statistics_ext_data" approach >> seems simple, but it's not clear to me how to make it work, so I'd >> appreciate some guidance. >> >> >> 1) Considering we don't have any composite types in any catalog yet, and >> naive attempts to just use something like >> >> pg_statistic stxdexprs[1]; >> >> did not work. So I suppose this will require changes to genbki.pl, but >> honestly, my Perl-fu is non-existent :-( > > In the attached, I didn't need to mess with perl. > >> 2) Won't it be an issue that pg_statistic contains pseudo-types? That >> is, this does not work, for example: >> >> test=# create table t (a pg_statistic[]); >> ERROR: column "stavalues1" has pseudo-type anyarray > > It works during initdb for the reasons that it's allowed for pg_statistic. > Oh, wow! I haven't expected a patch implementing this, that's great. I owe you a beer or a drink of your choice. Thanks! -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company