Re: Built-in binning functions - Mailing list pgsql-hackers
From | Petr Jelinek |
---|---|
Subject | Re: Built-in binning functions |
Date | |
Msg-id | 54037D9E.1080904@2ndquadrant.com Whole thread Raw |
In response to | Re: Built-in binning functions (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Built-in binning functions
|
List | pgsql-hackers |
On 30/08/14 19:24, Tom Lane wrote: > Pavel Stehule <pavel.stehule@gmail.com> writes: >> 1. I am thinking so reduction to only numeric types is not necessary - >> although we can live without it - but there are lot of non numeric >> categories: chars, date, ... > > I wasn't terribly happy about that either. I still think we should > reduce this to a single polymorphic function, as in the attached. > I did try to write generic function very similar to what you wrote but discarded it because of the performance reasons. If we really want to support any data type I am all for having generic function but I still would like to have one optimized for float8 because that seems to be the most used use-case (at least that one is the reason why I even wrote this) for performance reasons. Here are some numbers: First float8: CREATE TABLE wbtest AS SELECT (random() * 100)::float8 a FROM generate_series(1,1000000) ORDER BY 1; SELECT width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 90]::float8[]), width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 91]::float8[]), width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 92]::float8[]) FROM wbtest; Optimized float8: ~250ms Tom's generic: ~400ms Numeric: CREATE TABLE wbtestn AS SELECT (random() * 100)::numeric a FROM generate_series(1,1000000) ORDER BY 1; SELECT width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 90]::numeric[]), width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 91]::numeric[]), width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 92]::numeric[]) FROM wbtestn; Optimized numeric: ~600ms My generic: ~780ms Tom's generic: ~900ms The difference between my generic and Tom's generic is because Tom's is slowed down by the deconstruct_array. >> 2. Still I strongly afraid about used searching method - there is not >> possible to check a validity of input. Did you check how much linear >> searching is slower - you spoke, so you have a real data and real use case? > > Well, if we check the input then we'll be doing O(N) comparisons instead > of O(log N). That could be a serious cost penalty for large arrays and > expensive comparison functions (such as strcoll()). I think it's probably > sufficient to have a clear warning in the docs. > With resort the performance is worse than bunch of CASE WHEN :( > > Also, a documentation quibble: in Petr's patch the documentation and > comments refer to the thresholds as "right bounds", which I didn't care > for and replaced with "upper bounds". However, looking at it again, > I wonder if we would not be better off describing them as "lower bounds". > They are exclusive bounds if seen as upper bounds, and inclusive if seen > as lower bounds, and on the whole I think the latter viewpoint might be > less confusing. Thoughts? Upper bounds is probably better name than right bounds, I agree with that. In any case it's upper bound for a bucket (that's why there is one more bucket to which everything bigger than max threshold goes into). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: