Thread: Built-in binning functions

Built-in binning functions

From

Petr Jelinek

Date:

14 June 2014, 03:22:15

Hello,

here is a patch implementing varwidth_bucket (naming is up for
discussion) function which does binning with variable bucket width. The
use-cases are same as for width_bucket (=data analytics, mainly
histograms), the difference is that width_bucket uses buckets of same
width but the varwidth_bucket accepts an sorted array of right-bound
thresholds to define the individual buckets.

Currently applications implement this with long CASE statements which
are quite hard to read/maintain and are much slower than this
implementation which uses binary search.

There are 3 actual functions, one generic and two faster versions for
the int8 and float8 input that take advantage of the static width of
those types.


The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under grant
agreement n° 318633

--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Attachment

binning_fns.patch

Re: Built-in binning functions

From

Robert Haas

Date:

17 June 2014, 17:15:51

On Fri, Jun 13, 2014 at 8:22 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:
> here is a patch implementing varwidth_bucket (naming is up for discussion)
> function which does binning with variable bucket width. The use-cases are
> same as for width_bucket (=data analytics, mainly histograms), the
> difference is that width_bucket uses buckets of same width but the
> varwidth_bucket accepts an sorted array of right-bound thresholds to define
> the individual buckets.
>
> Currently applications implement this with long CASE statements which are
> quite hard to read/maintain and are much slower than this implementation
> which uses binary search.
>
> There are 3 actual functions, one generic and two faster versions for the
> int8 and float8 input that take advantage of the static width of those
> types.

I wonder if stuff like this shouldn't live in contrib rather than
core, but I guess it's probably not worth it for 3 functions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Built-in binning functions

From

Petr Jelinek

Date:

17 June 2014, 17:19:27

On 17/06/14 16:15, Robert Haas wrote:
> On Fri, Jun 13, 2014 at 8:22 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:
>> here is a patch implementing varwidth_bucket (naming is up for discussion)
>> function which does binning with variable bucket width. The use-cases are
>> same as for width_bucket (=data analytics, mainly histograms), the
>> difference is that width_bucket uses buckets of same width but the
>> varwidth_bucket accepts an sorted array of right-bound thresholds to define
>> the individual buckets.
>>
> I wonder if stuff like this shouldn't live in contrib rather than
> core, but I guess it's probably not worth it for 3 functions.
>

I was wondering the same and I also think that 3 simple functions is not 
that much, plus it seems natural extension of the existing width_bucket.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services

Re: Built-in binning functions

From

Tom Lane

Date:

08 July 2014, 03:14:18

Petr Jelinek <petr@2ndquadrant.com> writes:
> here is a patch implementing varwidth_bucket (naming is up for 
> discussion) function which does binning with variable bucket width.

I didn't see any discussion of the naming question in this thread.
I'd like to propose that it should be just "width_bucket()"; we can
easily determine which function is meant, considering that the
SQL-spec variants don't take arrays and don't even have the same
number of actual arguments.

Also, the set of functions provided still needs more thought, because the
resolution of overloaded functions doesn't really work as nicely as all
that.  I illustrate:

regression=# create function myf(int8) returns int as 'select 0' language sql;
CREATE FUNCTION
regression=# create function myf(float8) returns int as 'select 1' language sql; 
CREATE FUNCTION
regression=# create function myf(anyelement) returns int as 'select 2' language sql;
CREATE FUNCTION
regression=# select myf(1);myf 
-----  1
(1 row)

So given plain integer arguments, we'll invoke the float8 version not the
int8 version, which may be undesirable.  The same for int2 arguments.
We could fix that by adding bespoke int4 and maybe int2 variants, but
TBH, I'm not sure that the specific-type functions are worth the trouble.
Maybe we should just have one generic function, and take the trouble to
optimize its array-subscripting calculations for either fixed-length or
variable-length array elements?  Have you got performance measurements
demonstrating that multiple implementations really buy enough to justify
the extra code?

Also, I'm not convinced by this business of throwing an error for a
NULL array element.  Per spec, null arguments to width_bucket()
produce a null result, not an error --- shouldn't this flavor act
similarly?  In any case I think the test needs to use
array_contains_nulls() not just ARR_HASNULL.

Lastly, the spec defines behaviors for width_bucket that allow either
ascending or descending buckets.  We could possibly do something similar
in these functions by initially comparing the two endpoint elements to see
which one is larger, and then reversing the sense of all the comparisons
if the first one is larger.  I'm not sure if that's worth the trouble or
not, but if the SQL committee thought descending bucket numbering was
worthwhile, maybe it's worthwhile here too.
        regards, tom lane

Re: Built-in binning functions

From

Petr Jelinek

Date:

16 July 2014, 11:04:21

On 08/07/14 02:14, Tom Lane wrote:
> Petr Jelinek <petr@2ndquadrant.com> writes:
>> here is a patch implementing varwidth_bucket (naming is up for
>> discussion) function which does binning with variable bucket width.
>
> I didn't see any discussion of the naming question in this thread.
> I'd like to propose that it should be just "width_bucket()"; we can
> easily determine which function is meant, considering that the
> SQL-spec variants don't take arrays and don't even have the same
> number of actual arguments.

I did mention in submission that the names are up for discussion, I am 
all for naming it just width_bucket.

>
> So given plain integer arguments, we'll invoke the float8 version not the
> int8 version, which may be undesirable.  The same for int2 arguments.
> We could fix that by adding bespoke int4 and maybe int2 variants, but

Hmm, yeah I don't love the idea of having 3 functions just to cover 
integer datatypes :(.

> TBH, I'm not sure that the specific-type functions are worth the trouble.
> Maybe we should just have one generic function, and take the trouble to
> optimize its array-subscripting calculations for either fixed-length or
> variable-length array elements?  Have you got performance measurements
> demonstrating that multiple implementations really buy enough to justify
> the extra code?

The performance difference is about 20% (+/- few depending on the array 
size), I don't know if that's bad enough to warrant type-specific 
implementation. I personally don't know how to make the generic 
implementation much faster than it is now, except maybe by turning it 
into aggregate which would cache the deconstructed version of the array, 
but that change semantics quite a bit and is probably not all that 
desirable.

>
> Also, I'm not convinced by this business of throwing an error for a
> NULL array element.  Per spec, null arguments to width_bucket()
> produce a null result, not an error --- shouldn't this flavor act
> similarly?  In any case I think the test needs to use
> array_contains_nulls() not just ARR_HASNULL.

I am not against returning NULL for NULL array, I would still like to 
error on array that has values + NULL somewhere in the middle though.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services

Re: Built-in binning functions

From

Pavel Stehule

Date:

16 July 2014, 22:36:21

2014-07-16 10:04 GMT+02:00 Petr Jelinek <petr@2ndquadrant.com>:

On 08/07/14 02:14, Tom Lane wrote:
Petr Jelinek <petr@2ndquadrant.com> writes:
here is a patch implementing varwidth_bucket (naming is up for
discussion) function which does binning with variable bucket width.

I didn't see any discussion of the naming question in this thread.
I'd like to propose that it should be just "width_bucket()"; we can
easily determine which function is meant, considering that the
SQL-spec variants don't take arrays and don't even have the same
number of actual arguments.

I did mention in submission that the names are up for discussion, I am all for naming it just width_bucket.

I had this idea too - but I am not sure if it is good idea. A distance between ANSI SQL with_bucket and our enhancing is larger than in our implementation of "median" for example.

I can live with both names, but current name I prefer.

So given plain integer arguments, we'll invoke the float8 version not the
int8 version, which may be undesirable. The same for int2 arguments.
We could fix that by adding bespoke int4 and maybe int2 variants, but

Hmm, yeah I don't love the idea of having 3 functions just to cover integer datatypes :(.

TBH, I'm not sure that the specific-type functions are worth the trouble.
Maybe we should just have one generic function, and take the trouble to
optimize its array-subscripting calculations for either fixed-length or
variable-length array elements? Have you got performance measurements
demonstrating that multiple implementations really buy enough to justify
the extra code?

The performance difference is about 20% (+/- few depending on the array size), I don't know if that's bad enough to warrant type-specific implementation. I personally don't know how to make the generic implementation much faster than it is now, except maybe by turning it into aggregate which would cache the deconstructed version of the array, but that change semantics quite a bit and is probably not all that desirable.

I am not sure if our API is enough to do it - there are no any simple support for immutable parameters.

The performance is one point. Second point is wrong result, when input array is not well sorted. But this check means next performance penalization. But we cannot do it.

Also, I'm not convinced by this business of throwing an error for a
NULL array element. Per spec, null arguments to width_bucket()
produce a null result, not an error --- shouldn't this flavor act
similarly? In any case I think the test needs to use
array_contains_nulls() not just ARR_HASNULL.

I am not against returning NULL for NULL array, I would still like to error on array that has values + NULL somewhere in the middle though.

Pavel

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Built-in binning functions

From

Petr Jelinek

Date:

18 July 2014, 15:35:09

On 16/07/14 21:35, Pavel Stehule wrote:
>     The performance difference is about 20% (+/- few depending on the
>     array size), I don't know if that's bad enough to warrant
>     type-specific implementation. I personally don't know how to make
>     the generic implementation much faster than it is now, except maybe
>     by turning it into aggregate which would cache the deconstructed
>     version of the array, but that change semantics quite a bit and is
>     probably not all that desirable.
>
>
> I am not sure if our API is enough to do it - there are no any simple
> support for immutable parameters.

Just to clarify, the ~20% performance difference is with separate 
generic implementation for fixed width types (most of the time seems to 
be spent in the FunctionCallInvoke part, I even tryed to use sortsupport 
instead but it does not seem to help much).


--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services

Re: Built-in binning functions

From

Petr Jelinek

Date:

18 July 2014, 21:20:41

On 08/07/14 02:14, Tom Lane wrote:
> Also, the set of functions provided still needs more thought, because the
> resolution of overloaded functions doesn't really work as nicely as all
> that.

I am honestly considering just removing special case for int8 for now, 
the sql standard width_bucket has only support for float8 and numeric, 
maybe it's enough for this one also...

What's your opinion on that?

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services

Re: Built-in binning functions

From

Simon Riggs

Date:

20 July 2014, 12:01:20

On 16 July 2014 20:35, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
>
>
> 2014-07-16 10:04 GMT+02:00 Petr Jelinek <petr@2ndquadrant.com>:
>
>> On 08/07/14 02:14, Tom Lane wrote:
>>>
>>> Petr Jelinek <petr@2ndquadrant.com> writes:
>>>>
>>>> here is a patch implementing varwidth_bucket (naming is up for
>>>> discussion) function which does binning with variable bucket width.
>>>
>>>
>>> I didn't see any discussion of the naming question in this thread.
>>> I'd like to propose that it should be just "width_bucket()"; we can
>>> easily determine which function is meant, considering that the
>>> SQL-spec variants don't take arrays and don't even have the same
>>> number of actual arguments.
>>
>>
>> I did mention in submission that the names are up for discussion, I am all
>> for naming it just width_bucket.
>
>
> I had this idea too - but I am not sure if it is good idea. A distance
> between ANSI SQL with_bucket and our enhancing is larger than in our
> implementation of "median" for example.
>
> I can live with both names, but current name I prefer.

Hmmm, not sure. Let's look around and think what words people use.

Transforms of this type are referred to as discretization in formal
literature and as binning in commong usage/statistical software.

width_bucket() seems to refer to an equal-width binning process. The
function being discussed here is a generic mechanism, the boundaries
of which could have been decided using equal-frequency or other
mechanisms. Using the word "width" in those contexts could be
confusing.

Given width_bucket() is already in use for SQL Standard function, I
agree it would be confusing to have same name for different parameter
profiles.

So I suggest that we use the more generic function name bin(), with a
second choice of discretize()  (though that seems annoyingly easy to
spell incorrectly)

Whatever we do, it seems clear we need a section in the manual to
describe Statistical Functions, including width_bucket(), whatever we
call this function and including the terms bin, binning, transform,
discretize and discretization to ensure those are easily searchable.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Built-in binning functions

From

Petr Jelinek

Date:

25 August 2014, 17:15:29

Hi,

I finally had some time to get back to this.

I attached version3 of the patch which "fixes" Tom's complaint about
int8 version by removing the int8 version as it does not seem necessary
(the float8 can handle integers just fine).

This patch now basically has just one optimized function for float8 and
one for numeric datatypes, just like width_bucket.

> On 08/07/14 02:14, Tom Lane wrote:
> Also, I'm not convinced by this business of throwing an error for a
> NULL array element.  Per spec, null arguments to width_bucket()
> produce a null result, not an error --- shouldn't this flavor act
> similarly?  In any case I think the test needs to use
> array_contains_nulls() not just ARR_HASNULL.

I really think there should be difference between NULL array and NULL
inside array and that NULL inside the array is wrong input. I changed
the check to array_contains_nulls() though.

> On 08/07/14 02:14, Tom Lane wrote:
> Lastly, the spec defines behaviors for width_bucket that allow either
> ascending or descending buckets.  We could possibly do something
> similar

I am not sure it's worth it here as we require input to be sorted
anyway. It might be worthwhile if we decided to do this as an aggregate
(since there input would not be required to be presorted) instead of
function but I am not sure if that's desirable either as that would
limit usability to only the single main use-case (group by and count()).

On 20/07/14 11:01, Simon Riggs wrote:
> On 16 July 2014 20:35, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>>
>>> On 08/07/14 02:14, Tom Lane wrote:
>>>>
>>>> I didn't see any discussion of the naming question in this thread.
>>>> I'd like to propose that it should be just "width_bucket()"; we can
>>>> easily determine which function is meant, considering that the
>>>> SQL-spec variants don't take arrays and don't even have the same
>>>> number of actual arguments.
>>>
>>> I did mention in submission that the names are up for discussion, I am all
>>> for naming it just width_bucket.
>>
>> I had this idea too - but I am not sure if it is good idea. A distance
>> between ANSI SQL with_bucket and our enhancing is larger than in our
>> implementation of "median" for example.
>>
>> I can live with both names, but current name I prefer.
>
>
> So I suggest that we use the more generic function name bin(), with a
> second choice of discretize()  (though that seems annoyingly easy to
> spell incorrectly)
>

I really don't think bin() is good choice here, bucket has same meaning
in this context and bin is often used as shorthand for binary which
would be confusing here.

Anyway I currently left the name as it is, I would not be against using
width_bucket() or even just bucket(), not sure about the discretize()
option.

--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Attachment

binning-fns-v3.patch

Re: Built-in binning functions

From

Pavel Stehule

Date:

29 August 2014, 21:21:35

I am looking to your last patch and I have a few questions, notes

1. I am thinking so reduction to only numeric types is not necessary - although we can live without it - but there are lot of non numeric categories: chars, date, ...

2. Still I strongly afraid about used searching method - there is not possible to check a validity of input. Did you check how much linear searching is slower - you spoke, so you have a real data and real use case? Used methods can returns wrong result without any warning, what is acceptable for extensions, but I am not sure, for core feature.

3. I am thinking about name - I don't think so varwidth_bucket is correct -- in relation to name "width_bucket" ... what about "range_bucket" or "scope_bucket" ?

last patch is very simple, there are no new compilation or regress tests issues.

Regards

Pavel

2014-08-25 16:15 GMT+02:00 Petr Jelinek <petr@2ndquadrant.com>:

Hi,

I finally had some time to get back to this.

I attached version3 of the patch which "fixes" Tom's complaint about int8 version by removing the int8 version as it does not seem necessary (the float8 can handle integers just fine).

This patch now basically has just one optimized function for float8 and one for numeric datatypes, just like width_bucket.

On 08/07/14 02:14, Tom Lane wrote:
Also, I'm not convinced by this business of throwing an error for a
NULL array element. Per spec, null arguments to width_bucket()
produce a null result, not an error --- shouldn't this flavor act
similarly? In any case I think the test needs to use
array_contains_nulls() not just ARR_HASNULL.

I really think there should be difference between NULL array and NULL inside array and that NULL inside the array is wrong input. I changed the check to array_contains_nulls() though.

On 08/07/14 02:14, Tom Lane wrote:
Lastly, the spec defines behaviors for width_bucket that allow either
ascending or descending buckets. We could possibly do something
similar

I am not sure it's worth it here as we require input to be sorted anyway. It might be worthwhile if we decided to do this as an aggregate (since there input would not be required to be presorted) instead of function but I am not sure if that's desirable either as that would limit usability to only the single main use-case (group by and count()).

On 20/07/14 11:01, Simon Riggs wrote:
On 16 July 2014 20:35, Pavel Stehule <pavel.stehule@gmail.com> wrote:

On 08/07/14 02:14, Tom Lane wrote:

I didn't see any discussion of the naming question in this thread.
I'd like to propose that it should be just "width_bucket()"; we can
easily determine which function is meant, considering that the
SQL-spec variants don't take arrays and don't even have the same
number of actual arguments.

I did mention in submission that the names are up for discussion, I am all
for naming it just width_bucket.

I had this idea too - but I am not sure if it is good idea. A distance
between ANSI SQL with_bucket and our enhancing is larger than in our
implementation of "median" for example.

I can live with both names, but current name I prefer.

So I suggest that we use the more generic function name bin(), with a
second choice of discretize() (though that seems annoyingly easy to
spell incorrectly)

I really don't think bin() is good choice here, bucket has same meaning in this context and bin is often used as shorthand for binary which would be confusing here.

Anyway I currently left the name as it is, I would not be against using width_bucket() or even just bucket(), not sure about the discretize() option.

--
Petr Jelinek http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services

Re: Built-in binning functions

From

Tom Lane

Date:

30 August 2014, 20:24:42

Pavel Stehule <pavel.stehule@gmail.com> writes:
> 1. I am thinking so reduction to only numeric types is not necessary -
> although we can live without it - but there are lot of non numeric
> categories: chars, date, ...

I wasn't terribly happy about that either.  I still think we should
reduce this to a single polymorphic function, as in the attached.

> 2. Still I strongly afraid about used searching method - there is not
> possible to check a  validity of input. Did you check how much linear
> searching is slower - you spoke, so you have a real data and real use case?

Well, if we check the input then we'll be doing O(N) comparisons instead
of O(log N).  That could be a serious cost penalty for large arrays and
expensive comparison functions (such as strcoll()).  I think it's probably
sufficient to have a clear warning in the docs.

> 3. I am thinking about name - I don't think so varwidth_bucket is correct
> -- in relation to name "width_bucket" ... what about "range_bucket" or
> "scope_bucket" ?

Neither of those seem like improvements from here.  I agree with the
objections to bin() as well.  bucket() might be all right but it still
seems a bit too generic.  width_bucket(), which at least shows there's
a relationship to the standard functions, still seems like the best
of the suggestions so far.

It occurs to me that there would be an advantage to using some other
name besides width_bucket: we could then mark the function as variadic,
which might be a notational advantage in some usages.  (We can't do
that if we call it width_bucket because the four-parameter case would
be ambiguous with the existing functions.)  I'm not sure that this is
important enough to justify changing the name though, especially if
we can't come up with a good name.  Also, doing that would put a very
large premium on picking a non-generic name, else we'd risk creating
ambiguities against user-defined functions.

Also, a documentation quibble: in Petr's patch the documentation and
comments refer to the thresholds as "right bounds", which I didn't care
for and replaced with "upper bounds".  However, looking at it again,
I wonder if we would not be better off describing them as "lower bounds".
They are exclusive bounds if seen as upper bounds, and inclusive if seen
as lower bounds, and on the whole I think the latter viewpoint might be
less confusing.  Thoughts?

Proposed patch with 1 polymorphic function attached.

            regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 722640b..93cf1e7 100644
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***************
*** 901,925 ****
          <indexterm>
           <primary>width_bucket</primary>
          </indexterm>
!         <literal><function>width_bucket(<parameter>op</parameter> <type>numeric</type>, <parameter>b1</parameter>
<type>numeric</type>,<parameter>b2</parameter> <type>numeric</type>, <parameter>count</parameter>
<type>int</type>)</function></literal>
         </entry>
         <entry><type>int</type></entry>
         <entry>return the bucket to which <parameter>operand</> would
         be assigned in an equidepth histogram with <parameter>count</>
!        buckets, in the range <parameter>b1</> to <parameter>b2</></entry>
         <entry><literal>width_bucket(5.35, 0.024, 10.06, 5)</literal></entry>
         <entry><literal>3</literal></entry>
        </row>

        <row>
!        <entry><literal><function>width_bucket(<parameter>op</parameter> <type>dp</type>, <parameter>b1</parameter>
<type>dp</type>,<parameter>b2</parameter> <type>dp</type>, <parameter>count</parameter>
<type>int</type>)</function></literal></entry>
         <entry><type>int</type></entry>
         <entry>return the bucket to which <parameter>operand</> would
         be assigned in an equidepth histogram with <parameter>count</>
!        buckets, in the range <parameter>b1</> to <parameter>b2</></entry>
         <entry><literal>width_bucket(5.35, 0.024, 10.06, 5)</literal></entry>
         <entry><literal>3</literal></entry>
        </row>
       </tbody>
      </tgroup>
     </table>
--- 901,936 ----
          <indexterm>
           <primary>width_bucket</primary>
          </indexterm>
!         <literal><function>width_bucket(<parameter>operand</parameter> <type>numeric</type>,
<parameter>b1</parameter><type>numeric</type>, <parameter>b2</parameter> <type>numeric</type>,
<parameter>count</parameter><type>int</type>)</function></literal> 
         </entry>
         <entry><type>int</type></entry>
         <entry>return the bucket to which <parameter>operand</> would
         be assigned in an equidepth histogram with <parameter>count</>
!        buckets spanning the range <parameter>b1</> to <parameter>b2</></entry>
         <entry><literal>width_bucket(5.35, 0.024, 10.06, 5)</literal></entry>
         <entry><literal>3</literal></entry>
        </row>

        <row>
!        <entry><literal><function>width_bucket(<parameter>operand</parameter> <type>dp</type>,
<parameter>b1</parameter><type>dp</type>, <parameter>b2</parameter> <type>dp</type>, <parameter>count</parameter>
<type>int</type>)</function></literal></entry>
         <entry><type>int</type></entry>
         <entry>return the bucket to which <parameter>operand</> would
         be assigned in an equidepth histogram with <parameter>count</>
!        buckets spanning the range <parameter>b1</> to <parameter>b2</></entry>
         <entry><literal>width_bucket(5.35, 0.024, 10.06, 5)</literal></entry>
         <entry><literal>3</literal></entry>
        </row>
+
+       <row>
+        <entry><literal><function>width_bucket(<parameter>operand</parameter> <type>anyelement</type>,
<parameter>thresholds</parameter><type>anyarray</type>)</function></literal></entry> 
+        <entry><type>int</type></entry>
+        <entry>return the bucket to which <parameter>operand</> would
+        be assigned given an array listing the upper bounds of the buckets;
+        the <parameter>thresholds</> array <emphasis>must be sorted</>,
+        smallest first, or unexpected results will be obtained</entry>
+        <entry><literal>width_bucket(now(), array['yesterday', 'today', 'tomorrow']::timestamptz[])</literal></entry>
+        <entry><literal>2</literal></entry>
+       </row>
       </tbody>
      </tgroup>
     </table>
diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index f8e94ec..f3bba13 100644
*** a/src/backend/utils/adt/arrayfuncs.c
--- b/src/backend/utils/adt/arrayfuncs.c
*************** static ArrayType *array_replace_internal
*** 130,135 ****
--- 130,143 ----
                         Datum replace, bool replace_isnull,
                         bool remove, Oid collation,
                         FunctionCallInfo fcinfo);
+ static int width_bucket_fixed(Datum operand,
+                    ArrayType *thresholds,
+                    Oid collation,
+                    TypeCacheEntry *typentry);
+ static int width_bucket_variable(Datum operand,
+                       ArrayType *thresholds,
+                       Oid collation,
+                       TypeCacheEntry *typentry);


  /*
*************** array_replace(PG_FUNCTION_ARGS)
*** 5502,5504 ****
--- 5510,5681 ----
                                     fcinfo);
      PG_RETURN_ARRAYTYPE_P(array);
  }
+
+ /*
+  * Implements width_bucket(anyelement, anyarray).
+  *
+  * 'thresholds' is an array containing upper bound values for each bucket;
+  * these must be sorted from smallest to largest, or bogus results will be
+  * produced.  If N thresholds are supplied, the output is from 0 to N:
+  * 0 is for inputs < first threshold, N is for inputs >= last threshold.
+  */
+ Datum
+ width_bucket_generic(PG_FUNCTION_ARGS)
+ {
+     Datum        operand = PG_GETARG_DATUM(0);
+     ArrayType  *thresholds = PG_GETARG_ARRAYTYPE_P(1);
+     Oid            collation = PG_GET_COLLATION();
+     Oid            element_type = ARR_ELEMTYPE(thresholds);
+     TypeCacheEntry *typentry;
+     int            result;
+
+     /* Check input */
+     if (ARR_NDIM(thresholds) > 1)
+         ereport(ERROR,
+                 (errcode(ERRCODE_ARRAY_SUBSCRIPT_ERROR),
+                  errmsg("thresholds must be one-dimensional array")));
+
+     if (array_contains_nulls(thresholds))
+         ereport(ERROR,
+                 (errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
+                  errmsg("thresholds array must not contain NULLs")));
+
+     /* Cache information about the input type */
+     typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+     if (typentry == NULL ||
+         typentry->type_id != element_type)
+     {
+         typentry = lookup_type_cache(element_type,
+                                      TYPECACHE_CMP_PROC_FINFO);
+         if (!OidIsValid(typentry->cmp_proc_finfo.fn_oid))
+             ereport(ERROR,
+                     (errcode(ERRCODE_UNDEFINED_FUNCTION),
+                errmsg("could not identify a comparison function for type %s",
+                       format_type_be(element_type))));
+         fcinfo->flinfo->fn_extra = (void *) typentry;
+     }
+
+     /* We have two implementation paths for fixed- and variable-width types */
+     if (typentry->typlen > 0)
+         result = width_bucket_fixed(operand, thresholds, collation, typentry);
+     else
+         result = width_bucket_variable(operand, thresholds, collation, typentry);
+
+     /* Avoid leaking memory when handed toasted input. */
+     PG_FREE_IF_COPY(thresholds, 1);
+
+     PG_RETURN_INT32(result);
+ }
+
+ /*
+  * width_bucket for fixed-width data types.
+  */
+ static int
+ width_bucket_fixed(Datum operand,
+                    ArrayType *thresholds,
+                    Oid collation,
+                    TypeCacheEntry *typentry)
+ {
+     char       *thresholds_data;
+     int            typlen;
+     FunctionCallInfoData locfcinfo;
+     int            left;
+     int            right;
+
+     /*
+      * Since we know the array contains no NULLs, we can just index it
+      * directly.
+      */
+     thresholds_data = (char *) ARR_DATA_PTR(thresholds);
+     typlen = typentry->typlen;
+
+     InitFunctionCallInfoData(locfcinfo, &typentry->cmp_proc_finfo, 2,
+                              collation, NULL, NULL);
+
+     /* Find the bucket */
+     left = 0;
+     right = ArrayGetNItems(ARR_NDIM(thresholds), ARR_DIMS(thresholds));
+     while (left < right)
+     {
+         int            mid;
+         char       *ptr;
+         Datum        elm;
+         int32        cmpresult;
+
+         mid = (left + right) / 2;
+
+         ptr = thresholds_data + mid * typlen;
+         elm = fetch_att(ptr, typentry->typbyval, typlen);
+
+         locfcinfo.arg[0] = operand;
+         locfcinfo.arg[1] = elm;
+         locfcinfo.argnull[0] = false;
+         locfcinfo.argnull[1] = false;
+         locfcinfo.isnull = false;
+
+         cmpresult = DatumGetInt32(FunctionCallInvoke(&locfcinfo));
+
+         if (cmpresult < 0)
+             right = mid;
+         else
+             left = mid + 1;
+     }
+
+     return left;
+ }
+
+ /*
+  * width_bucket for variable-width data types.
+  */
+ static int
+ width_bucket_variable(Datum operand,
+                       ArrayType *thresholds,
+                       Oid collation,
+                       TypeCacheEntry *typentry)
+ {
+     Datum       *dvalues;
+     int            nelems;
+     FunctionCallInfoData locfcinfo;
+     int            left;
+     int            right;
+
+     /*
+      * The easiest and probably most efficient way to locate the array
+      * elements is with deconstruct_array.
+      */
+     deconstruct_array(thresholds,
+                       typentry->type_id, typentry->typlen,
+                       typentry->typbyval, typentry->typalign,
+                       &dvalues, NULL, &nelems);
+
+     InitFunctionCallInfoData(locfcinfo, &typentry->cmp_proc_finfo, 2,
+                              collation, NULL, NULL);
+
+     /* Find the bucket */
+     left = 0;
+     right = nelems;
+     while (left < right)
+     {
+         int            mid;
+         int32        cmpresult;
+
+         mid = (left + right) / 2;
+
+         locfcinfo.arg[0] = operand;
+         locfcinfo.arg[1] = dvalues[mid];
+         locfcinfo.argnull[0] = false;
+         locfcinfo.argnull[1] = false;
+         locfcinfo.isnull = false;
+
+         cmpresult = DatumGetInt32(FunctionCallInvoke(&locfcinfo));
+
+         if (cmpresult < 0)
+             right = mid;
+         else
+             left = mid + 1;
+     }
+
+     pfree(dvalues);
+
+     return left;
+ }
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 5176ed0..9b55acc 100644
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
*************** DATA(insert OID = 2334 (  array_agg_fina
*** 885,890 ****
--- 885,892 ----
  DESCR("aggregate final function");
  DATA(insert OID = 2335 (  array_agg           PGNSP PGUID 12 1 0 0 0 t f f f f f i 1 0 2277 "2283" _null_ _null_
_null__null_ aggregate_dummy _null_ _null_ _null_ )); 
  DESCR("concatenate aggregate input into an array");
+ DATA(insert OID = 3218 ( width_bucket       PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 23 "2283 2277" _null_ _null_
_null__null_ width_bucket_generic _null_ _null_ _null_ )); 
+ DESCR("bucket number of operand given a sorted array of bucket upper bounds");
  DATA(insert OID = 3816 (  array_typanalyze PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 16 "2281" _null_ _null_ _null_
_null_array_typanalyze _null_ _null_ _null_ )); 
  DESCR("array typanalyze");
  DATA(insert OID = 3817 (  arraycontsel       PGNSP PGUID 12 1 0 0 0 f f f f t f s 4 0 701 "2281 26 2281 23" _null_
_null__null_ _null_ arraycontsel _null_ _null_ _null_ )); 
diff --git a/src/include/utils/array.h b/src/include/utils/array.h
index 9bbfaae..2abaccd 100644
*** a/src/include/utils/array.h
--- b/src/include/utils/array.h
*************** extern Datum array_fill_with_lower_bound
*** 214,219 ****
--- 214,220 ----
  extern Datum array_unnest(PG_FUNCTION_ARGS);
  extern Datum array_remove(PG_FUNCTION_ARGS);
  extern Datum array_replace(PG_FUNCTION_ARGS);
+ extern Datum width_bucket_generic(PG_FUNCTION_ARGS);

  extern Datum array_ref(ArrayType *array, int nSubscripts, int *indx,
            int arraytyplen, int elmlen, bool elmbyval, char elmalign,
diff --git a/src/test/regress/expected/arrays.out b/src/test/regress/expected/arrays.out
index 4286691..04f2400 100644
*** a/src/test/regress/expected/arrays.out
--- b/src/test/regress/expected/arrays.out
*************** select length(md5((f1[1]).c2)) from dest
*** 1706,1708 ****
--- 1706,1807 ----

  drop table dest;
  drop type textandtext;
+ -- Tests for generic form of width_bucket()
+ SELECT
+     op,
+     width_bucket(op, ARRAY[1, 3, 5, 10.0]) AS wb_1,
+     width_bucket(op, ARRAY[0, 5.5, 9.99]) AS wb_2,
+     width_bucket(op, ARRAY[-6, -5, 2.0]) AS wb_3
+ FROM (VALUES
+   (-5.2),
+   (-0.0000000001),
+   (0.000000000001),
+   (1),
+   (1.99999999999999),
+   (2),
+   (2.00000000000001),
+   (3),
+   (4),
+   (4.5),
+   (5),
+   (5.5),
+   (6),
+   (7),
+   (8),
+   (9),
+   (9.99999999999999),
+   (10),
+   (10.0000000000001)
+ ) v(op);
+         op        | wb_1 | wb_2 | wb_3
+ ------------------+------+------+------
+              -5.2 |    0 |    0 |    1
+     -0.0000000001 |    0 |    0 |    2
+    0.000000000001 |    0 |    1 |    2
+                 1 |    1 |    1 |    2
+  1.99999999999999 |    1 |    1 |    2
+                 2 |    1 |    1 |    3
+  2.00000000000001 |    1 |    1 |    3
+                 3 |    2 |    1 |    3
+                 4 |    2 |    1 |    3
+               4.5 |    2 |    1 |    3
+                 5 |    3 |    1 |    3
+               5.5 |    3 |    2 |    3
+                 6 |    3 |    2 |    3
+                 7 |    3 |    2 |    3
+                 8 |    3 |    2 |    3
+                 9 |    3 |    2 |    3
+  9.99999999999999 |    3 |    3 |    3
+                10 |    4 |    3 |    3
+  10.0000000000001 |    4 |    3 |    3
+ (19 rows)
+
+ SELECT
+     op,
+     width_bucket(op, ARRAY[1, 3, 5, 10]) AS wb_1
+ FROM generate_series(0,11) as op;
+  op | wb_1
+ ----+------
+   0 |    0
+   1 |    1
+   2 |    1
+   3 |    2
+   4 |    2
+   5 |    3
+   6 |    3
+   7 |    3
+   8 |    3
+   9 |    3
+  10 |    4
+  11 |    4
+ (12 rows)
+
+ SELECT width_bucket(now(),
+                     array['yesterday', 'today', 'tomorrow']::timestamptz[]);
+  width_bucket
+ --------------
+             2
+ (1 row)
+
+ SELECT width_bucket(5, ARRAY[3]);
+  width_bucket
+ --------------
+             1
+ (1 row)
+
+ SELECT width_bucket(5, '{}');
+  width_bucket
+ --------------
+             0
+ (1 row)
+
+ -- error cases
+ SELECT width_bucket('5'::text, ARRAY[3, 4]::integer[]);
+ ERROR:  function width_bucket(text, integer[]) does not exist
+ LINE 1: SELECT width_bucket('5'::text, ARRAY[3, 4]::integer[]);
+                ^
+ HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
+ SELECT width_bucket(5, ARRAY[3, 4, NULL]);
+ ERROR:  thresholds array must not contain NULLs
+ SELECT width_bucket(5, ARRAY[ARRAY[1, 2], ARRAY[3, 4]]);
+ ERROR:  thresholds must be one-dimensional array
diff --git a/src/test/regress/sql/arrays.sql b/src/test/regress/sql/arrays.sql
index d9f7cbf..eef3eb3 100644
*** a/src/test/regress/sql/arrays.sql
--- b/src/test/regress/sql/arrays.sql
*************** drop table src;
*** 476,478 ****
--- 476,523 ----
  select length(md5((f1[1]).c2)) from dest;
  drop table dest;
  drop type textandtext;
+
+ -- Tests for generic form of width_bucket()
+
+ SELECT
+     op,
+     width_bucket(op, ARRAY[1, 3, 5, 10.0]) AS wb_1,
+     width_bucket(op, ARRAY[0, 5.5, 9.99]) AS wb_2,
+     width_bucket(op, ARRAY[-6, -5, 2.0]) AS wb_3
+ FROM (VALUES
+   (-5.2),
+   (-0.0000000001),
+   (0.000000000001),
+   (1),
+   (1.99999999999999),
+   (2),
+   (2.00000000000001),
+   (3),
+   (4),
+   (4.5),
+   (5),
+   (5.5),
+   (6),
+   (7),
+   (8),
+   (9),
+   (9.99999999999999),
+   (10),
+   (10.0000000000001)
+ ) v(op);
+
+ SELECT
+     op,
+     width_bucket(op, ARRAY[1, 3, 5, 10]) AS wb_1
+ FROM generate_series(0,11) as op;
+
+ SELECT width_bucket(now(),
+                     array['yesterday', 'today', 'tomorrow']::timestamptz[]);
+
+ SELECT width_bucket(5, ARRAY[3]);
+ SELECT width_bucket(5, '{}');
+
+ -- error cases
+ SELECT width_bucket('5'::text, ARRAY[3, 4]::integer[]);
+ SELECT width_bucket(5, ARRAY[3, 4, NULL]);
+ SELECT width_bucket(5, ARRAY[ARRAY[1, 2], ARRAY[3, 4]]);

Re: Built-in binning functions

From

Simon Riggs

Date:

31 August 2014, 20:41:18

On 30 August 2014 18:24, Tom Lane <tgl@sss.pgh.pa.us> wrote:

>> 3. I am thinking about name - I don't think so varwidth_bucket is correct
>> -- in relation to name "width_bucket" ... what about "range_bucket" or
>> "scope_bucket" ?
>
> Neither of those seem like improvements from here.  I agree with the
> objections to bin() as well.  bucket() might be all right but it still
> seems a bit too generic.  width_bucket(), which at least shows there's
> a relationship to the standard functions, still seems like the best
> of the suggestions so far.

I'd been mulling that over and agree with objections to bin()

Suggest discretize() as a much more informative name. The other names
will be overlooked by anybody that doesn't already know what to look
for.

Thanks for the polymorphic patch, that sounds good. Sorry, not
reviewed more deeply (still travelling).

Re: Built-in binning functions

From

Tom Lane

Date:

31 August 2014, 21:00:46

Simon Riggs <simon@2ndquadrant.com> writes:
> Suggest discretize() as a much more informative name. The other names
> will be overlooked by anybody that doesn't already know what to look
> for.

I did not like that idea to begin with, but it's growing more attractive.
In particular, I think it would be unique enough that we could safely
mark the polymorphic function as variadic (a move that would cause
ambiguity issues for pretty nearly any user-defined function of the
same name).

OTOH, I'm not as convinced as you that this name would mean anything
to "anybody that doesn't already know what to look for".  It still
seems like rather arcane terminology.
        regards, tom lane

Re: Built-in binning functions

From

Gavin Flower

Date:

31 August 2014, 22:45:29

On 01/09/14 06:00, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
>> Suggest discretize() as a much more informative name. The other names
>> will be overlooked by anybody that doesn't already know what to look
>> for.
> I did not like that idea to begin with, but it's growing more attractive.
> In particular, I think it would be unique enough that we could safely
> mark the polymorphic function as variadic (a move that would cause
> ambiguity issues for pretty nearly any user-defined function of the
> same name).
>
> OTOH, I'm not as convinced as you that this name would mean anything
> to "anybody that doesn't already know what to look for".  It still
> seems like rather arcane terminology.
>
>             regards, tom lane
>
>
If I was new to PostgreSQL, I'd probably read this as disc*(), a 
function to do something with discs.

I have done finite maths and statistics at university, plus I have been 
vaguely following this thread - but, the meaning of discretize() is not 
immediately apparent to me (if I reread a previous email or 2 in this 
thread, then I'm sure it would then be obvious!).  I might guess that it 
is to make something discrete, but why would I want to do that?  So if I 
came across this function in a year or 2, I'd probably go right past it.

I have been programming for over 40 years, and I still find choosing 
appropriate names sometimes very difficult, so I know it is not easy to 
come up with meaningful names - especially ones that mean something 
relevant to other people!

Cheers,
Gavin

Re: Built-in binning functions

From

Petr Jelinek

Date:

31 August 2014, 22:55:23

On 30/08/14 19:24, Tom Lane wrote:
> Pavel Stehule <pavel.stehule@gmail.com> writes:
>> 1. I am thinking so reduction to only numeric types is not necessary -
>> although we can live without it - but there are lot of non numeric
>> categories: chars, date, ...
>
> I wasn't terribly happy about that either.  I still think we should
> reduce this to a single polymorphic function, as in the attached.
>

I did try to write generic function very similar to what you wrote but 
discarded it because of the performance reasons. If we really want to 
support any data type I am all for having generic function but I still 
would like to have one optimized for float8 because that seems to be the 
most used use-case (at least that one is the reason why I even wrote 
this) for performance reasons.

Here are some numbers:
First float8:
CREATE TABLE wbtest AS SELECT (random() * 100)::float8 a FROM 
generate_series(1,1000000) ORDER BY 1;

SELECT
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 90]::float8[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 91]::float8[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 92]::float8[])
FROM wbtest;

Optimized float8: ~250ms
Tom's generic: ~400ms


Numeric:
CREATE TABLE wbtestn AS SELECT (random() * 100)::numeric a FROM 
generate_series(1,1000000) ORDER BY 1;

SELECT
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 90]::numeric[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 91]::numeric[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 92]::numeric[])
FROM wbtestn;

Optimized numeric: ~600ms
My generic: ~780ms
Tom's generic: ~900ms

The difference between my generic and Tom's generic is because Tom's is 
slowed down by the deconstruct_array.


>> 2. Still I strongly afraid about used searching method - there is not
>> possible to check a  validity of input. Did you check how much linear
>> searching is slower - you spoke, so you have a real data and real use case?
>
> Well, if we check the input then we'll be doing O(N) comparisons instead
> of O(log N).  That could be a serious cost penalty for large arrays and
> expensive comparison functions (such as strcoll()).  I think it's probably
> sufficient to have a clear warning in the docs.
>

With resort the performance is worse than bunch of CASE WHEN :(

>
> Also, a documentation quibble: in Petr's patch the documentation and
> comments refer to the thresholds as "right bounds", which I didn't care
> for and replaced with "upper bounds".  However, looking at it again,
> I wonder if we would not be better off describing them as "lower bounds".
> They are exclusive bounds if seen as upper bounds, and inclusive if seen
> as lower bounds, and on the whole I think the latter viewpoint might be
> less confusing.  Thoughts?

Upper bounds is probably better name than right bounds, I agree with 
that. In any case it's upper bound for a bucket (that's why there is one 
more bucket to which everything bigger than max threshold goes into).


--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services

Re: Built-in binning functions

From

Tom Lane

Date:

31 August 2014, 23:34:10

Petr Jelinek <petr@2ndquadrant.com> writes:
> On 30/08/14 19:24, Tom Lane wrote:
>> I wasn't terribly happy about that either.  I still think we should
>> reduce this to a single polymorphic function, as in the attached.

> I did try to write generic function very similar to what you wrote but 
> discarded it because of the performance reasons. If we really want to 
> support any data type I am all for having generic function but I still 
> would like to have one optimized for float8 because that seems to be the 
> most used use-case (at least that one is the reason why I even wrote 
> this) for performance reasons.

Well, part of the reason why your v3 float8 function looks so fast is that
it's cheating: it will not produce the right answers for comparisons
involving NaNs.  I'm not sure how good it would look once you'd added
some isnan() tests to make the behavior actually equivalent to
float8_cmp_internal().

However, assuming it still seems worthwhile once that's accounted for,
I don't have a strong objection to having an additional code path for
float8.  There are two ways we could do that:

1. Add a runtime test in the polymorphic function, eg
if (element_type == FLOAT8OID)    result = width_bucket_float8(...);else if (typentry->typlen > 0)    result =
width_bucket_fixed(...);else   result = width_bucket_variable(...);

2. Define a SQL function width_bucket(float8, float8[]) alongside
the polymorphic one.

While #2 might be marginally faster at runtime, the main difference
between these is what the parser would choose to do with ambiguous cases.

I experimented a bit and it seemed that the parser would prefer the
float8 function in any situation where the inputs were of numeric types,
probably because float8 is the preferred type in the numeric category.
I'm not real sure that would be a good thing: in particular it would
cast integer and numeric inputs to float8, which risks precision loss,
and there doesn't seem to be any way to get the parser to make the
other choice if the inputs are of numeric-category types.

As against that objection, it would make cross-type numeric cases easier
to use, for example "width_bucket(1, array[2.4])" would be accepted.
If we just offer the anyelement/anyarray version then the parser would
make you cast "1" to numeric before it would consider the function to
match.

On balance the runtime-test approach looks like a better idea, in that
it doesn't risk any unexpected semantic behaviors.

> The difference between my generic and Tom's generic is because Tom's is 
> slowed down by the deconstruct_array.

Meh.  It looked to me like your version would have O(N^2) performance
problems from computing array offsets repeatedly, depending on exactly
which array element it ended up on.  It would avoid repeat calculations
only if it always moved right.
        regards, tom lane

Re: Built-in binning functions

From

Simon Riggs

Date:

01 September 2014, 00:27:41

On 31 August 2014 20:44, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote:
> On 01/09/14 06:00, Tom Lane wrote:
>>
>> Simon Riggs <simon@2ndquadrant.com> writes:
>>>
>>> Suggest discretize() as a much more informative name. The other names
>>> will be overlooked by anybody that doesn't already know what to look
>>> for.
>>
>> I did not like that idea to begin with, but it's growing more attractive.
>> In particular, I think it would be unique enough that we could safely
>> mark the polymorphic function as variadic (a move that would cause
>> ambiguity issues for pretty nearly any user-defined function of the
>> same name).
>>
>> OTOH, I'm not as convinced as you that this name would mean anything
>> to "anybody that doesn't already know what to look for".  It still
>> seems like rather arcane terminology.
>>
>>
> If I was new to PostgreSQL, I'd probably read this as disc*(), a function to
> do something with discs.
>
> I have done finite maths and statistics at university, plus I have been
> vaguely following this thread - but, the meaning of discretize() is not
> immediately apparent to me (if I reread a previous email or 2 in this
> thread, then I'm sure it would then be obvious!).  I might guess that it is
> to make something discrete, but why would I want to do that?  So if I came
> across this function in a year or 2, I'd probably go right past it.
>
> I have been programming for over 40 years, and I still find choosing
> appropriate names sometimes very difficult, so I know it is not easy to come
> up with meaningful names - especially ones that mean something relevant to
> other people!

It's important that we *don't* come up with a new name.

This function does something useful and the words people already use
to describe that are binning and discretization. By this I mean names
already used by very widely used software. Search them and see, or
suggest more widely used alternatives, please.

Re: Built-in binning functions

From

Petr Jelinek

Date:

01 September 2014, 02:00:01

On 31/08/14 22:33, Tom Lane wrote:
> Petr Jelinek <petr@2ndquadrant.com> writes:
>> On 30/08/14 19:24, Tom Lane wrote:
>>> I wasn't terribly happy about that either.  I still think we should
>>> reduce this to a single polymorphic function, as in the attached.
>
>> I did try to write generic function very similar to what you wrote but
>> discarded it because of the performance reasons. If we really want to
>> support any data type I am all for having generic function but I still
>> would like to have one optimized for float8 because that seems to be the
>> most used use-case (at least that one is the reason why I even wrote
>> this) for performance reasons.
>
> Well, part of the reason why your v3 float8 function looks so fast is that
> it's cheating: it will not produce the right answers for comparisons
> involving NaNs.  I'm not sure how good it would look once you'd added
> some isnan() tests to make the behavior actually equivalent to
> float8_cmp_internal().
>

True, it increases the time of the test to around 285ms, still almost 
30% speed difference.

> However, assuming it still seems worthwhile once that's accounted for,
> I don't have a strong objection to having an additional code path for
> float8.  There are two ways we could do that:
>
> 1. Add a runtime test in the polymorphic function, eg
>
>     if (element_type == FLOAT8OID)
>         result = width_bucket_float8(...);
>     else if (typentry->typlen > 0)
>         result = width_bucket_fixed(...);
>     else
>         result = width_bucket_variable(...);
>

Yeah I think I prefer this one too, will see how much performance it eats.

>
>> The difference between my generic and Tom's generic is because Tom's is
>> slowed down by the deconstruct_array.
>
> Meh.  It looked to me like your version would have O(N^2) performance
> problems from computing array offsets repeatedly, depending on exactly
> which array element it ended up on.  It would avoid repeat calculations
> only if it always moved right.
>

I actually think that worst case (when you go always left) for my 
version is O(N) since you only need to seek for the half of previous 
interval (it's doing binary search after all) and you do O(N) in the 
deconstruct_array. It would be very different if we could cache the 
array somehow (ie, if this was an aggregate) then it would obviously 
make a lot of sense to use deconstruct_array and in that case it would 
make even sense to resort probably, but sadly we can't cache afaik.

Also, I made more tests with various array sizes (3-10000) and 
distributions and mine version was never slower.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services

Re: Built-in binning functions

From

David G Johnston

Date:

01 September 2014, 02:13:12

Simon Riggs wrote
> width_bucket() seems to refer to an equal-width binning process. The
> function being discussed here is a generic mechanism, the boundaries
> of which could have been decided using equal-frequency or other
> mechanisms. Using the word "width" in those contexts could be
> confusing.
> 
> Given width_bucket() is already in use for SQL Standard function, I
> agree it would be confusing to have same name for different parameter
> profiles.

literal_bucket(float8, float8[])

Since "bucket" is the 'verb' here (in this specific case meaning "lookup the
supplied value in the supplied bucket definition") and "width" is a modifier
(the bucket specification describes an equal-width structure) I suggest
"literal_bucket(val, array[])" such that the bucket is still the verb but
now the modifier describes a structure that is literally provided.

David J.





--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Built-in-binning-functions-tp5807266p5817097.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: Built-in binning functions

From

Tom Lane

Date:

01 September 2014, 02:42:17

Petr Jelinek <petr@2ndquadrant.com> writes:
> On 31/08/14 22:33, Tom Lane wrote:
>> Petr Jelinek <petr@2ndquadrant.com> writes:
>>> The difference between my generic and Tom's generic is because Tom's is
>>> slowed down by the deconstruct_array.

>> Meh.  It looked to me like your version would have O(N^2) performance
>> problems from computing array offsets repeatedly, depending on exactly
>> which array element it ended up on.  It would avoid repeat calculations
>> only if it always moved right.

> I actually think that worst case (when you go always left) for my 
> version is O(N) since you only need to seek for the half of previous 
> interval (it's doing binary search after all) and you do O(N) in the 
> deconstruct_array.

[ thinks about that for awhile... ]  Hm, I think you are right.

If there are N elements in the array then we will scan over N/2 of them
to locate the midpoint element for the first comparison.  After that,
we will go either left or right, and in either case we will need to scan
over N/4 elements to find the second-level midpoint.  The third-level
midpoint will be found after scanning N/8 elements, and so on.  So the
number of elements scanned over to compute their lengths is N/2 + N/4 +
N/8 ..., or asymptotically N, the same as for the deconstruct_array
approach.  deconstruct_array might have some small advantage from tighter
inner loops, but this is probably more than blown by the need for a
palloc and pfree.

So I was wrong and your way is better for navigating the array in the
variable-width case, assuming that we're not concerned about spending
some more code space to make this function a shade faster.

BTW, was there a reason for not noticing the case of exact match in
the search loop, and falling out early?  As it stands the code will
reliably choose the leftmost match if there are multiple equal items
in the search array, but do we care about such cases?
        regards, tom lane

Re: Built-in binning functions

From

Tom Lane

Date:

01 September 2014, 02:48:49

David G Johnston <david.g.johnston@gmail.com> writes:
> Since "bucket" is the 'verb' here (in this specific case meaning "lookup the
> supplied value in the supplied bucket definition") and "width" is a modifier
> (the bucket specification describes an equal-width structure) I suggest
> "literal_bucket(val, array[])" such that the bucket is still the verb but
> now the modifier describes a structure that is literally provided.

It's a very considerable stretch to see "bucket" as a verb here :-).
Maybe that's why the SQL committee's choice of function name seems
so unnatural (to me anyway).

I was wondering about bucket_index(), ie "get the index of the bucket
this value falls into".  Or get_bucket(), or get_bucket_index() if you
like verbosity.
        regards, tom lane

Re: Built-in binning functions

From

David Johnston

Date:

01 September 2014, 02:59:15

<div dir="ltr"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><span
style="font-family:arial">OnSun, Aug 31, 2014 at 7:48 PM, Tom Lane </span><span dir="ltr"
style="font-family:arial"><<ahref="mailto:tgl@sss.pgh.pa.us" target="_blank">tgl@sss.pgh.pa.us</a>></span><span
style="font-family:arial">wrote:</span><br /></div><div class="gmail_extra"><div class="gmail_quote"><blockquote
class="gmail_quote"style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">DavidG
Johnston<<a href="mailto:david.g.johnston@gmail.com">david.g.johnston@gmail.com</a>> writes:<br /> > Since
"bucket"is the 'verb' here (in this specific case meaning "lookup the<br /> > supplied value in the supplied bucket
definition")and "width" is a modifier<br /> > (the bucket specification describes an equal-width structure) I
suggest<br/> > "literal_bucket(val, array[])" such that the bucket is still the verb but<br /> > now the modifier
describesa structure that is literally provided.<br /><br /> It's a very considerable stretch to see "bucket" as a verb
here:-).<br /> Maybe that's why the SQL committee's choice of function name seems<br /> so unnatural (to me anyway).<br
/><br/> I was wondering about bucket_index(), ie "get the index of the bucket<br /> this value falls into".  Or
get_bucket(),or get_bucket_index() if you<br /> like verbosity.<br /><br />                         regards, tom
lane<br/></blockquote></div><br /></div><div class="gmail_extra"><div class="gmail_default"
style="font-family:arial,helvetica,sans-serif">Igot stuck on the thought that a function name should ideally
be/includea verb...</div><br /></div><div class="gmail_extra"><div class="gmail_default"
style="font-family:arial,helvetica,sans-serif">Evenif you read it as a noun (and thus the verb is an implied "get")
thenaming logic still holds.  </div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br
/></div><divclass="gmail_default" style="font-family:arial,helvetica,sans-serif">I pondered a "get_" version though the
argumentfor avoiding conflicting user-code decreases its appeal.</div><div class="gmail_default"
style="font-family:arial,helvetica,sans-serif"><br/></div><div class="gmail_default"
style="font-family:arial,helvetica,sans-serif">Thegood part about SQL standard naming is that the typical programmer
isn'tlikely to pick a conflicting name.</div><div class="gmail_default"
style="font-family:arial,helvetica,sans-serif"><br/></div><div class="gmail_default"
style="font-family:arial,helvetica,sans-serif">"bucket_index"is appealing by itself though user-code probable...as bad
asI think "width_bucket" is for a name the fact is that it currently exists and, even forced, consistency has
merit.</div><divclass="gmail_default" style="font-family:arial,helvetica,sans-serif"><br /></div><div
class="gmail_default"style="font-family:arial,helvetica,sans-serif">David J.</div></div></div>

Re: Built-in binning functions

From

Petr Jelinek

Date:

01 September 2014, 22:29:57

On 01/09/14 01:42, Tom Lane wrote:
>
> BTW, was there a reason for not noticing the case of exact match in
> the search loop, and falling out early?  As it stands the code will
> reliably choose the leftmost match if there are multiple equal items
> in the search array, but do we care about such cases?
>

I am not sure if we care, probably not.

Anyway I attached patch that I am happy with. I am not yet sure what to
do with naming.

--
   Petr Jelinek                  http://www.2ndQuadrant.com/
   PostgreSQL Development, 24x7 Support, Training & Services

Attachment

binning-fns-v5.patch

Re: Built-in binning functions

From

Pavel Stehule

Date:

04 September 2014, 22:16:57

Hi

I did a review of last patch

1. There is no problem with patching

2. compilation and doc compilation without warnings and issues.

3. code is clean, respects Postgres coding rules and is well documented - it is slightly modified Tom's version with float8 optimization

4. The name with_bucket is probably one with wide agreement

5. There are a basic set of tests for muttable or fixed sized types

I found only one issue - float8 path has no own test in regress tests. When this issue will be fixed, I will mark this patch as ready for commit

Regards

Pavel

2014-09-01 21:29 GMT+02:00 Petr Jelinek <petr@2ndquadrant.com>:

On 01/09/14 01:42, Tom Lane wrote:

BTW, was there a reason for not noticing the case of exact match in
the search loop, and falling out early? As it stands the code will
reliably choose the leftmost match if there are multiple equal items
in the search array, but do we care about such cases?

I am not sure if we care, probably not.

Anyway I attached patch that I am happy with. I am not yet sure what to do with naming.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Built-in binning functions

From

Petr Jelinek

Date:

07 September 2014, 15:30:02

On 04/09/14 21:16, Pavel Stehule wrote:
>
> I did a review of last patch

Thanks

>
> I found only one issue - float8 path has no own test in regress tests.
> When this issue will be fixed, I will mark this patch as ready for commit
>

Ah yeah I forgot about those, fixed in attached patch.

--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Attachment

binning-fns-v6.patch

Re: Built-in binning functions

From

Pavel Stehule

Date:

07 September 2014, 21:46:32

2014-09-07 14:29 GMT+02:00 Petr Jelinek <petr@2ndquadrant.com>:

On 04/09/14 21:16, Pavel Stehule wrote:

I did a review of last patch

Thanks

I found only one issue - float8 path has no own test in regress tests.
When this issue will be fixed, I will mark this patch as ready for commit

Ah yeah I forgot about those, fixed in attached patch.

It is ready for commit

Thank you

Pavel

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Built-in binning functions

From

Andres Freund

Date:

07 September 2014, 21:59:09

On 2014-09-07 20:45:38 +0200, Pavel Stehule wrote:
> 2014-09-07 14:29 GMT+02:00 Petr Jelinek <petr@2ndquadrant.com>:
> > Ah yeah I forgot about those, fixed in attached patch.

> ok
> 
> It is ready for commit

Tom, are you planning to take this one?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Built-in binning functions

From

Tom Lane

Date:

07 September 2014, 22:05:45

Andres Freund <andres@2ndquadrant.com> writes:
> Tom, are you planning to take this one?

Yeah, I already looked at it once, so will take it.

I think the main remaining issue is that we don't have consensus on
the function name AFAICT.  I'm okay with using width_bucket(), as
is done in the latest patch, but there were objections ...
        regards, tom lane

Re: Built-in binning functions

From

Andres Freund

Date:

07 September 2014, 22:09:12

On 2014-09-07 15:05:35 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > Tom, are you planning to take this one?
> 
> Yeah, I already looked at it once, so will take it.

Cool.

> I think the main remaining issue is that we don't have consensus on
> the function name AFAICT.  I'm okay with using width_bucket(), as
> is done in the latest patch, but there were objections ...

Just reread that part of the thread and personally I disliked all the
other suggested names more than width_bucket.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Built-in binning functions

From

Petr Jelinek

Date:

07 September 2014, 22:48:05

On 07/09/14 21:09, Andres Freund wrote:
> On 2014-09-07 15:05:35 -0400, Tom Lane wrote:
>> I think the main remaining issue is that we don't have consensus on
>> the function name AFAICT.  I'm okay with using width_bucket(), as
>> is done in the latest patch, but there were objections ...
>
> Just reread that part of the thread and personally I disliked all the
> other suggested names more than width_bucket.
>

Same here, that's why I didn't change it.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services

Re: Built-in binning functions

From

Tom Lane

Date:

09 September 2014, 22:36:20

Petr Jelinek <petr@2ndquadrant.com> writes:
> On 07/09/14 21:09, Andres Freund wrote:
>> On 2014-09-07 15:05:35 -0400, Tom Lane wrote:
>>> I think the main remaining issue is that we don't have consensus on
>>> the function name AFAICT.  I'm okay with using width_bucket(), as
>>> is done in the latest patch, but there were objections ...

>> Just reread that part of the thread and personally I disliked all the
>> other suggested names more than width_bucket.

> Same here, that's why I didn't change it.

Not hearing any further discussion, I committed it with that name
(and a bit of further cleanup).
        regards, tom lane