Re: Additional Statistics Hooks - Mailing list pgsql-hackers

From Ashutosh Bapat
Subject Re: Additional Statistics Hooks
Date
Msg-id CAFjFpRcfmOYwEdp+XZ-bmgK9ZnNaY_PTmKZ8qvW01oHo7rCDJA@mail.gmail.com
Whole thread Raw
In response to Re: Additional Statistics Hooks  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
On Thu, Mar 15, 2018 at 7:59 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
>
> On 03/15/2018 06:00 AM, Ashutosh Bapat wrote:
>> On Tue, Mar 13, 2018 at 8:55 PM, Mat Arye <mat@timescale.com> wrote:
>>>>
>>>> Like cost associated with a function, we may associate mapping
>>>> cardinality with a function. It tells how many distinct input values
>>>> map to 1 output value. By input value, I mean input argument tuple. In
>>>> Mat's case the mapping cardinality will be 12. The number of distinct
>>>> values that function may output is estimated as number of estimated
>>>> rows / mapping cardinality of that function.
>>>
>>>
>>> I think this is complicated by the fact that the mapping cardinality is not
>>> a constant per function
>>> but depends on the constant given as the first argument to the function and
>>> the granularity of the
>>> underlying data (do you have a second-granularity or microsecond
>>> granularity). I actually think the logic for the
>>> estimate here should be the (max(time)-min(time))/interval. I think to be
>>> general you need to allow functions on statistics to determine the estimate.
>>>
>>
>> I think my solution was quite short-sighted. You are right. We need a
>> function taking statistics about the input argument as input and
>> output the statistics about the output. The planner can then use this
>> statistics to arrive at various estimates.
>>
>
> I think the best solution is to extend the CREATE STATISTICS so that it
> handles things like
>
>     CREATE STATISTICS s ON date_trunc('day', column) FROM table
>

I think we need both kinds of solution here. My proposal would work
for commonly used functions like int4div(). There is no point in
creating statistics for all the expression where int4div() is used;
there will be many queries which use this function and it's hard to
spot since it's invoked for '/' operator. Also there will be many
tables with which this function will be used, and maintaining
statistics for all those tables will eat a lot of space. But
date_trunc() will not be used that widely and thus crafting statistics
for that function per table where it's used will work.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: Typo in xlog.c
Next
From: Simon Riggs
Date:
Subject: Re: [HACKERS] MERGE SQL Statement for PG11