Re: Additional Statistics Hooks - Mailing list pgsql-hackers

From Mat Arye
Subject Re: Additional Statistics Hooks
Date
Msg-id CADsUR0Bnh=PUHkNVHNGb-U8ckbck1An-pX8kwXkt=TZzVMOK6Q@mail.gmail.com
Whole thread Raw
In response to Re: Additional Statistics Hooks  (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
Responses Re: Additional Statistics Hooks
List pgsql-hackers


On Tue, Mar 13, 2018 at 6:56 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Tue, Mar 13, 2018 at 4:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Mat Arye <mat@timescale.com> writes:
>> So the use-case is an analytical query like
>
>> SELECT date_trunc('hour', time) AS MetricMinuteTs, AVG(value) as avg
>> FROM hyper
>> WHERE time >= '2001-01-04T00:00:00' AND time <= '2001-01-05T01:00:00'
>> GROUP BY MetricMinuteTs
>> ORDER BY MetricMinuteTs DESC;
>
>> Right now this query will choose a much-less-efficient GroupAggregate plan
>> instead of a HashAggregate. It will choose this because it thinks the
>> number of groups
>> produced here is 9,000,000 because that's the number of distinct time
>> values there are.
>> But, because date_trunc "buckets" the values there will be about 24 groups
>> (1 for each hour).
>
> While it would certainly be nice to have better behavior for that,
> "add a hook so users who can write C can fix it by hand" doesn't seem
> like a great solution.  On top of the sheer difficulty of writing a
> hook function, you'd have the problem that no pre-written hook could
> know about all available functions.  I think somehow we'd need a way
> to add per-function knowledge, perhaps roughly like the protransform
> feature.

Like cost associated with a function, we may associate mapping
cardinality with a function. It tells how many distinct input values
map to 1 output value. By input value, I mean input argument tuple. In
Mat's case the mapping cardinality will be 12. The number of distinct
values that function may output is estimated as number of estimated
rows / mapping cardinality of that function.

I think this is complicated by the fact that the mapping cardinality is not a constant per function
but depends on the constant given as the first argument to the function and the granularity of the
underlying data (do you have a second-granularity or microsecond granularity). I actually think the logic for the
estimate here should be the (max(time)-min(time))/interval. I think to be general you need to allow functions on statistics to determine the estimate.

 

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

pgsql-hackers by date:

Previous
From: David Steele
Date:
Subject: Re: PATCH: Configurable file mode mask
Next
From: Tom Lane
Date:
Subject: Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.