Re: Built-in binning functions - Mailing list pgsql-hackers

From Petr Jelinek
Subject Re: Built-in binning functions
Date
Msg-id 54037D9E.1080904@2ndquadrant.com
Whole thread Raw
In response to Re: Built-in binning functions  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Built-in binning functions
List pgsql-hackers
On 30/08/14 19:24, Tom Lane wrote:
> Pavel Stehule <pavel.stehule@gmail.com> writes:
>> 1. I am thinking so reduction to only numeric types is not necessary -
>> although we can live without it - but there are lot of non numeric
>> categories: chars, date, ...
>
> I wasn't terribly happy about that either.  I still think we should
> reduce this to a single polymorphic function, as in the attached.
>

I did try to write generic function very similar to what you wrote but 
discarded it because of the performance reasons. If we really want to 
support any data type I am all for having generic function but I still 
would like to have one optimized for float8 because that seems to be the 
most used use-case (at least that one is the reason why I even wrote 
this) for performance reasons.

Here are some numbers:
First float8:
CREATE TABLE wbtest AS SELECT (random() * 100)::float8 a FROM 
generate_series(1,1000000) ORDER BY 1;

SELECT
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 90]::float8[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 91]::float8[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 92]::float8[])
FROM wbtest;

Optimized float8: ~250ms
Tom's generic: ~400ms


Numeric:
CREATE TABLE wbtestn AS SELECT (random() * 100)::numeric a FROM 
generate_series(1,1000000) ORDER BY 1;

SELECT
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 90]::numeric[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 91]::numeric[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 92]::numeric[])
FROM wbtestn;

Optimized numeric: ~600ms
My generic: ~780ms
Tom's generic: ~900ms

The difference between my generic and Tom's generic is because Tom's is 
slowed down by the deconstruct_array.


>> 2. Still I strongly afraid about used searching method - there is not
>> possible to check a  validity of input. Did you check how much linear
>> searching is slower - you spoke, so you have a real data and real use case?
>
> Well, if we check the input then we'll be doing O(N) comparisons instead
> of O(log N).  That could be a serious cost penalty for large arrays and
> expensive comparison functions (such as strcoll()).  I think it's probably
> sufficient to have a clear warning in the docs.
>

With resort the performance is worse than bunch of CASE WHEN :(

>
> Also, a documentation quibble: in Petr's patch the documentation and
> comments refer to the thresholds as "right bounds", which I didn't care
> for and replaced with "upper bounds".  However, looking at it again,
> I wonder if we would not be better off describing them as "lower bounds".
> They are exclusive bounds if seen as upper bounds, and inclusive if seen
> as lower bounds, and on the whole I think the latter viewpoint might be
> less confusing.  Thoughts?

Upper bounds is probably better name than right bounds, I agree with 
that. In any case it's upper bound for a bucket (that's why there is one 
more bucket to which everything bigger than max threshold goes into).


--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Gavin Flower
Date:
Subject: Re: Built-in binning functions
Next
From: Tom Lane
Date:
Subject: Re: On partitioning