Re: planner support functions: handle GROUP BY estimates ? - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: planner support functions: handle GROUP BY estimates ? |
Date | |
Msg-id | 20200114205349.m32jgrf77sbxsziy@development Whole thread Raw |
In response to | Re: planner support functions: handle GROUP BY estimates ? (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: planner support functions: handle GROUP BY estimates ?
|
List | pgsql-hackers |
On Tue, Jan 14, 2020 at 03:12:21PM -0500, Tom Lane wrote: >Justin Pryzby <pryzby@telsasoft.com> writes: >> On Sun, Dec 22, 2019 at 06:16:48PM -0600, Justin Pryzby wrote: >>> On Tue, Nov 19, 2019 at 01:34:21PM -0600, Justin Pryzby wrote: >>>> Tom implemented "Planner support functions": >>>> https://www.postgresql.org/docs/12/xfunc-optimization.html >>>> I wondered whether there was any consideration to extend that to allow >>>> providing improved estimates of "group by". That currently requires manually >>>> by creating an expression index, if the function is IMMUTABLE (which is not >>>> true for eg. date_trunc of timestamptz). > >>> I didn't hear back so tried implementing this for date_trunc(). Currently, the >>> ... >>> If the input timestamps have (say) hourly granularity, rowcount will be >>> *underestimated* by 3600x, which is worse than the behavior in master of >>> overestimating by (for "day") 24x. > >While I don't have any objection in principle to extending the set of >things planner support functions can do, it doesn't seem like the idea is >giving you all that much traction for this problem. There isn't that much >knowledge that's specific to date_trunc in this, and instead you've got a >bunch of generic problems (that would have to be solved again in every >other function's planner support). > >Another issue is that it seems like this doesn't compose nicely --- >if the GROUP BY expression is "f(g(x))", how do f's support function >and g's support function interact? > >The direction that I've been wanting to go in for this kind of problem >is to allow CREATE STATISTICS on an expression, ie if you were concerned >about the estimation accuracy for GROUP BY or anything else, you could do >something like > >CREATE STATISTICS foo ON date_trunc('day', mod_time) FROM my_table; > >This would have the effect of cueing ANALYZE to gather stats on the >value of that expression, which the planner could then use, very much >as if you'd created an index on the expression. The advantages of >doing this rather than making an index are > >(1) you don't have to pay the maintenance costs for an index, > >(2) we don't have to restrict it to immutable expressions. (Volatile >expressions would have to be disallowed, if only because of fear of >side-effects; but I think we could allow stable expressions just fine. >Worst case problem is that the stats are stale, but so what?) > >With a solution like this, we don't have to solve any of the difficult >problems of how the pieces of the expression interact with each other >or with the statistics of the underlying column(s). We just use the >stats if available, and the estimate will be as good as it'd be for >a plain column reference. > >I'm not sure how much new infrastructure would have to be built >for this. We designed the CREATE STATISTICS syntax to support >this (partly at my insistence IIRC) but I do not think any of the >existing plumbing is ready for it. I don't think it'd be very >hard to plug this into ANALYZE or the planner, but there might be >quite some work to be done on the catalog infrastructure, pg_dump, >etc. > >cc'ing Tomas in case he has any thoughts about it. > Well, I certainly do thoughts about this - it's pretty much exactly what I proposed yesterday in this thread: https://www.postgresql.org/message-id/flat/20200113230008.g67iyk4cs3xbnjju@development The third part of that patch series is exactly about supporting extended statistics on expressions, about the way you described here. The current status of the WIP patch is that grammar + ANALYZE mostly works, but there is no support in the planner. It's obviously still very hackish. The main thing I'm not sure about is how to represent this in catalogs, whether to have two fields (like for indexes) or maybe a single list of expressions. I'm also wondering if we could/should 100% rely on extended statistics, because those are really meant to track correlations between columns, which means we currently require at least two attributes in CREATE STATISTICS and so on. So maybe what we want is collecting "regular" per-column stats just like we do for indexes, but without the index maintenance overhead? The advantage would be we'd get exactly the same stats as for indexes, and we could use them in the same places out of the box. While with extended stats we'll have to tweak those places. Now, the trouble is we can't store stuff in pg_statistic without having a relation (i.e. table / index / ...) but maybe we could invent a new relation type for this purpose. Of course, it'd require some catalog work to represent this ... Ultimately I think we'd want both things, it's not one or the other. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: