Thread: Additional Statistics Hooks

Additional Statistics Hooks

From
Mat Arye
Date:
Hi All,

I have a question about statistics hooks. I am trying to teach the planner that when grouping by something like date_trunc('1 day', time) will produce a lot less rows than the number of distinct time values. I want to do that in an extension. The problem is that I don't see a way to make the get_relation_stats_hook work well fo that since by the time it's called you only see the `time` var and not the full expression. None of the other hooks seem appropriate either. So 2 questions:

1) Would people be opposed to adding a code hook somewhere at the start of `examine_variable` (selfuncs.c) to allow creating statistics on complete expressions? I can submit a patch if this seems reasonable.

2) Do patches that add code hooks (and are probably under 10 lines) need to go through the entire commitfest process. I guess what I am really asking is if PG12 would be the first version such a patch could appear in or is PG11 still a possibility? Just wondering what the policy on such stuff is.

Thanks,
Mat
TimescaleDB

Re: Additional Statistics Hooks

From
Euler Taveira
Date:
2018-03-12 14:03 GMT-03:00 Mat Arye <mat@timescale.com>:
> I have a question about statistics hooks. I am trying to teach the planner
> that when grouping by something like date_trunc('1 day', time) will produce
> a lot less rows than the number of distinct time values. I want to do that
> in an extension. The problem is that I don't see a way to make the
> get_relation_stats_hook work well fo that since by the time it's called you
> only see the `time` var and not the full expression. None of the other hooks
> seem appropriate either. So 2 questions:
>
Isn't it the case to extend the available hook?

> 1) Would people be opposed to adding a code hook somewhere at the start of
> `examine_variable` (selfuncs.c) to allow creating statistics on complete
> expressions? I can submit a patch if this seems reasonable.
>
If you explain the use case maybe it could be considered.

> 2) Do patches that add code hooks (and are probably under 10 lines) need to
> go through the entire commitfest process. I guess what I am really asking is
> if PG12 would be the first version such a patch could appear in or is PG11
> still a possibility? Just wondering what the policy on such stuff is.
>
If it is a new feature and is not in the last CF, it won't be
considered for v11 (even small patches).


-- 
   Euler Taveira                                   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento


Re: Additional Statistics Hooks

From
Mat Arye
Date:


On Mon, Mar 12, 2018 at 2:52 PM, Euler Taveira <euler@timbira.com.br> wrote:
2018-03-12 14:03 GMT-03:00 Mat Arye <mat@timescale.com>:
> I have a question about statistics hooks. I am trying to teach the planner
> that when grouping by something like date_trunc('1 day', time) will produce
> a lot less rows than the number of distinct time values. I want to do that
> in an extension. The problem is that I don't see a way to make the
> get_relation_stats_hook work well fo that since by the time it's called you
> only see the `time` var and not the full expression. None of the other hooks
> seem appropriate either. So 2 questions:
>
Isn't it the case to extend the available hook?

> 1) Would people be opposed to adding a code hook somewhere at the start of
> `examine_variable` (selfuncs.c) to allow creating statistics on complete
> expressions? I can submit a patch if this seems reasonable.
>
If you explain the use case maybe it could be considered.

So the use-case is an analytical query like 

SELECT date_trunc('hour', time) AS MetricMinuteTs, AVG(value) as avg
FROM hyper
WHERE time >= '2001-01-04T00:00:00' AND time <= '2001-01-05T01:00:00'
GROUP BY MetricMinuteTs
ORDER BY MetricMinuteTs DESC;

Right now this query will choose a much-less-efficient GroupAggregate plan
instead of a HashAggregate. It will choose this because it thinks the number of groups
produced here is 9,000,000 because that's the number of distinct time values there are.
But, because date_trunc "buckets" the values there will be about 24 groups (1 for each hour).

 

> 2) Do patches that add code hooks (and are probably under 10 lines) need to
> go through the entire commitfest process. I guess what I am really asking is
> if PG12 would be the first version such a patch could appear in or is PG11
> still a possibility? Just wondering what the policy on such stuff is.
>
If it is a new feature and is not in the last CF, it won't be
considered for v11 (even small patches).


--
   Euler Taveira                                   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: Additional Statistics Hooks

From
Tom Lane
Date:
Mat Arye <mat@timescale.com> writes:
> So the use-case is an analytical query like

> SELECT date_trunc('hour', time) AS MetricMinuteTs, AVG(value) as avg
> FROM hyper
> WHERE time >= '2001-01-04T00:00:00' AND time <= '2001-01-05T01:00:00'
> GROUP BY MetricMinuteTs
> ORDER BY MetricMinuteTs DESC;

> Right now this query will choose a much-less-efficient GroupAggregate plan
> instead of a HashAggregate. It will choose this because it thinks the
> number of groups
> produced here is 9,000,000 because that's the number of distinct time
> values there are.
> But, because date_trunc "buckets" the values there will be about 24 groups
> (1 for each hour).

While it would certainly be nice to have better behavior for that,
"add a hook so users who can write C can fix it by hand" doesn't seem
like a great solution.  On top of the sheer difficulty of writing a
hook function, you'd have the problem that no pre-written hook could
know about all available functions.  I think somehow we'd need a way
to add per-function knowledge, perhaps roughly like the protransform
feature.

            regards, tom lane


Re: Additional Statistics Hooks

From
David Rowley
Date:
On 13 March 2018 at 11:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> While it would certainly be nice to have better behavior for that,
> "add a hook so users who can write C can fix it by hand" doesn't seem
> like a great solution.  On top of the sheer difficulty of writing a
> hook function, you'd have the problem that no pre-written hook could
> know about all available functions.  I think somehow we'd need a way
> to add per-function knowledge, perhaps roughly like the protransform
> feature.

I always imagined that extended statistics could be used for this.
Right now the estimates are much better when you create an index on
the function, but there's no real reason to limit the stats that are
gathered to just plain columns + expression indexes.

I believe I'm not the only person to have considered this. Originally
extended statistics were named multivariate statistics. I think it was
Dean and I (maybe others too) that suggested to Tomas to give the
feature a more generic name so that it can be used for a more general
purpose later.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Additional Statistics Hooks

From
Ashutosh Bapat
Date:
On Tue, Mar 13, 2018 at 4:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Mat Arye <mat@timescale.com> writes:
>> So the use-case is an analytical query like
>
>> SELECT date_trunc('hour', time) AS MetricMinuteTs, AVG(value) as avg
>> FROM hyper
>> WHERE time >= '2001-01-04T00:00:00' AND time <= '2001-01-05T01:00:00'
>> GROUP BY MetricMinuteTs
>> ORDER BY MetricMinuteTs DESC;
>
>> Right now this query will choose a much-less-efficient GroupAggregate plan
>> instead of a HashAggregate. It will choose this because it thinks the
>> number of groups
>> produced here is 9,000,000 because that's the number of distinct time
>> values there are.
>> But, because date_trunc "buckets" the values there will be about 24 groups
>> (1 for each hour).
>
> While it would certainly be nice to have better behavior for that,
> "add a hook so users who can write C can fix it by hand" doesn't seem
> like a great solution.  On top of the sheer difficulty of writing a
> hook function, you'd have the problem that no pre-written hook could
> know about all available functions.  I think somehow we'd need a way
> to add per-function knowledge, perhaps roughly like the protransform
> feature.

Like cost associated with a function, we may associate mapping
cardinality with a function. It tells how many distinct input values
map to 1 output value. By input value, I mean input argument tuple. In
Mat's case the mapping cardinality will be 12. The number of distinct
values that function may output is estimated as number of estimated
rows / mapping cardinality of that function.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: Additional Statistics Hooks

From
Mat Arye
Date:


On Tue, Mar 13, 2018 at 6:56 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Tue, Mar 13, 2018 at 4:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Mat Arye <mat@timescale.com> writes:
>> So the use-case is an analytical query like
>
>> SELECT date_trunc('hour', time) AS MetricMinuteTs, AVG(value) as avg
>> FROM hyper
>> WHERE time >= '2001-01-04T00:00:00' AND time <= '2001-01-05T01:00:00'
>> GROUP BY MetricMinuteTs
>> ORDER BY MetricMinuteTs DESC;
>
>> Right now this query will choose a much-less-efficient GroupAggregate plan
>> instead of a HashAggregate. It will choose this because it thinks the
>> number of groups
>> produced here is 9,000,000 because that's the number of distinct time
>> values there are.
>> But, because date_trunc "buckets" the values there will be about 24 groups
>> (1 for each hour).
>
> While it would certainly be nice to have better behavior for that,
> "add a hook so users who can write C can fix it by hand" doesn't seem
> like a great solution.  On top of the sheer difficulty of writing a
> hook function, you'd have the problem that no pre-written hook could
> know about all available functions.  I think somehow we'd need a way
> to add per-function knowledge, perhaps roughly like the protransform
> feature.

Like cost associated with a function, we may associate mapping
cardinality with a function. It tells how many distinct input values
map to 1 output value. By input value, I mean input argument tuple. In
Mat's case the mapping cardinality will be 12. The number of distinct
values that function may output is estimated as number of estimated
rows / mapping cardinality of that function.

I think this is complicated by the fact that the mapping cardinality is not a constant per function
but depends on the constant given as the first argument to the function and the granularity of the
underlying data (do you have a second-granularity or microsecond granularity). I actually think the logic for the
estimate here should be the (max(time)-min(time))/interval. I think to be general you need to allow functions on statistics to determine the estimate.

 

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Additional Statistics Hooks

From
Mat Arye
Date:


On Tue, Mar 13, 2018 at 6:31 AM, David Rowley <david.rowley@2ndquadrant.com> wrote:
On 13 March 2018 at 11:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> While it would certainly be nice to have better behavior for that,
> "add a hook so users who can write C can fix it by hand" doesn't seem
> like a great solution.  On top of the sheer difficulty of writing a
> hook function, you'd have the problem that no pre-written hook could
> know about all available functions.  I think somehow we'd need a way
> to add per-function knowledge, perhaps roughly like the protransform
> feature.

I think this isn't either-or. I think a general hook can be useful for extensions
that want to optimize particular data distributions/workloads using domain-knowledge about functions common for those workloads. 
That way users working with that data can use extensions to optimize workloads without writing C themselves. I also think a 
protransform like feature would add a lot of power to the native planner but this could take a while 
to get into core properly and may not handle all kinds of data distributions/cases.

An example, of a case a protransform type system would not be able to optimize is mathematical operator expressions like bucketing integers by decile --- (integer / 10) * 10. 
This is somewhat analogous to date_trunc in the integer space and would also change the number of resulting distinct rows.
 

I always imagined that extended statistics could be used for this.
Right now the estimates are much better when you create an index on
the function, but there's no real reason to limit the stats that are
gathered to just plain columns + expression indexes.

I believe I'm not the only person to have considered this. Originally
extended statistics were named multivariate statistics. I think it was
Dean and I (maybe others too) that suggested to Tomas to give the
feature a more generic name so that it can be used for a more general
purpose later.

I also think that the point with extended statistics is a good one and points to the need for more experimentation/experience which I think
a C hook is better suited for. Putting in a hook will allow extension writers like us to experiment and figure out the kinds of transform on statistics that are useful while having 
a small footprint on the core. I think designing a protransform-like system would benefit from more experience with the kinds of transformations that are useful.
For example, can anything be done if the interval passed to date_trunc is not constant, or is it not even worth bothering with that case? Maybe extended
statistics is a better approach, etc.  

 

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Additional Statistics Hooks

From
Tom Lane
Date:
Mat Arye <mat@timescale.com> writes:
> An example, of a case a protransform type system would not be able to
> optimize is mathematical operator expressions like bucketing integers by
> decile --- (integer / 10) * 10.

Uh, why not?  An estimation function that is specific to integer divide
shouldn't have much trouble figuring out that x/10 has one-tenth as many
distinct values as x does.  I'd certainly rather have that knowledge
associated directly with int4div, and the corresponding knowledge about
date_trunc associated with that function, and similar knowledge about
extension-provided operators provided by the extensions, than try to
maintain a hook function that embeds all such knowledge.

> I also think that the point with extended statistics is a good one and
> points to the need for more experimentation/experience which I think
> a C hook is better suited for. Putting in a hook will allow extension
> writers like us to experiment and figure out the kinds of transform on
> statistics that are useful while having
> a small footprint on the core.

If you're experimenting you might as well just change the source code.
A hook is only useful if you're trying to ship something for production,
and I doubt that factorizing things this way is a credible production
solution.

            regards, tom lane


Re: Additional Statistics Hooks

From
Ashutosh Bapat
Date:
On Tue, Mar 13, 2018 at 8:55 PM, Mat Arye <mat@timescale.com> wrote:
>>
>> Like cost associated with a function, we may associate mapping
>> cardinality with a function. It tells how many distinct input values
>> map to 1 output value. By input value, I mean input argument tuple. In
>> Mat's case the mapping cardinality will be 12. The number of distinct
>> values that function may output is estimated as number of estimated
>> rows / mapping cardinality of that function.
>
>
> I think this is complicated by the fact that the mapping cardinality is not
> a constant per function
> but depends on the constant given as the first argument to the function and
> the granularity of the
> underlying data (do you have a second-granularity or microsecond
> granularity). I actually think the logic for the
> estimate here should be the (max(time)-min(time))/interval. I think to be
> general you need to allow functions on statistics to determine the estimate.
>

I think my solution was quite short-sighted. You are right. We need a
function taking statistics about the input argument as input and
output the statistics about the output. The planner can then use this
statistics to arrive at various estimates.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: Additional Statistics Hooks

From
Tomas Vondra
Date:

On 03/15/2018 06:00 AM, Ashutosh Bapat wrote:
> On Tue, Mar 13, 2018 at 8:55 PM, Mat Arye <mat@timescale.com> wrote:
>>>
>>> Like cost associated with a function, we may associate mapping
>>> cardinality with a function. It tells how many distinct input values
>>> map to 1 output value. By input value, I mean input argument tuple. In
>>> Mat's case the mapping cardinality will be 12. The number of distinct
>>> values that function may output is estimated as number of estimated
>>> rows / mapping cardinality of that function.
>>
>>
>> I think this is complicated by the fact that the mapping cardinality is not
>> a constant per function
>> but depends on the constant given as the first argument to the function and
>> the granularity of the
>> underlying data (do you have a second-granularity or microsecond
>> granularity). I actually think the logic for the
>> estimate here should be the (max(time)-min(time))/interval. I think to be
>> general you need to allow functions on statistics to determine the estimate.
>>
> 
> I think my solution was quite short-sighted. You are right. We need a
> function taking statistics about the input argument as input and
> output the statistics about the output. The planner can then use this
> statistics to arrive at various estimates.
> 

I think the best solution is to extend the CREATE STATISTICS so that it
handles things like

    CREATE STATISTICS s ON date_trunc('day', column) FROM table

As David mentioned elsewhere in this thread, this was considered before
CREATE STATISTICS was introduced in PG11 and it's why the features is
called 'extended' and not 'multivariate'.

It would give us the same stats as we have for expression indexes, but
without the extra overhead.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Additional Statistics Hooks

From
Ashutosh Bapat
Date:
On Thu, Mar 15, 2018 at 7:59 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
>
> On 03/15/2018 06:00 AM, Ashutosh Bapat wrote:
>> On Tue, Mar 13, 2018 at 8:55 PM, Mat Arye <mat@timescale.com> wrote:
>>>>
>>>> Like cost associated with a function, we may associate mapping
>>>> cardinality with a function. It tells how many distinct input values
>>>> map to 1 output value. By input value, I mean input argument tuple. In
>>>> Mat's case the mapping cardinality will be 12. The number of distinct
>>>> values that function may output is estimated as number of estimated
>>>> rows / mapping cardinality of that function.
>>>
>>>
>>> I think this is complicated by the fact that the mapping cardinality is not
>>> a constant per function
>>> but depends on the constant given as the first argument to the function and
>>> the granularity of the
>>> underlying data (do you have a second-granularity or microsecond
>>> granularity). I actually think the logic for the
>>> estimate here should be the (max(time)-min(time))/interval. I think to be
>>> general you need to allow functions on statistics to determine the estimate.
>>>
>>
>> I think my solution was quite short-sighted. You are right. We need a
>> function taking statistics about the input argument as input and
>> output the statistics about the output. The planner can then use this
>> statistics to arrive at various estimates.
>>
>
> I think the best solution is to extend the CREATE STATISTICS so that it
> handles things like
>
>     CREATE STATISTICS s ON date_trunc('day', column) FROM table
>

I think we need both kinds of solution here. My proposal would work
for commonly used functions like int4div(). There is no point in
creating statistics for all the expression where int4div() is used;
there will be many queries which use this function and it's hard to
spot since it's invoked for '/' operator. Also there will be many
tables with which this function will be used, and maintaining
statistics for all those tables will eat a lot of space. But
date_trunc() will not be used that widely and thus crafting statistics
for that function per table where it's used will work.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company