Thread: Built-in binning functions
Hello, here is a patch implementing varwidth_bucket (naming is up for discussion) function which does binning with variable bucket width. The use-cases are same as for width_bucket (=data analytics, mainly histograms), the difference is that width_bucket uses buckets of same width but the varwidth_bucket accepts an sorted array of right-bound thresholds to define the individual buckets. Currently applications implement this with long CASE statements which are quite hard to read/maintain and are much slower than this implementation which uses binary search. There are 3 actual functions, one generic and two faster versions for the int8 and float8 input that take advantage of the static width of those types. The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 318633 -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Fri, Jun 13, 2014 at 8:22 PM, Petr Jelinek <petr@2ndquadrant.com> wrote: > here is a patch implementing varwidth_bucket (naming is up for discussion) > function which does binning with variable bucket width. The use-cases are > same as for width_bucket (=data analytics, mainly histograms), the > difference is that width_bucket uses buckets of same width but the > varwidth_bucket accepts an sorted array of right-bound thresholds to define > the individual buckets. > > Currently applications implement this with long CASE statements which are > quite hard to read/maintain and are much slower than this implementation > which uses binary search. > > There are 3 actual functions, one generic and two faster versions for the > int8 and float8 input that take advantage of the static width of those > types. I wonder if stuff like this shouldn't live in contrib rather than core, but I guess it's probably not worth it for 3 functions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 17/06/14 16:15, Robert Haas wrote: > On Fri, Jun 13, 2014 at 8:22 PM, Petr Jelinek <petr@2ndquadrant.com> wrote: >> here is a patch implementing varwidth_bucket (naming is up for discussion) >> function which does binning with variable bucket width. The use-cases are >> same as for width_bucket (=data analytics, mainly histograms), the >> difference is that width_bucket uses buckets of same width but the >> varwidth_bucket accepts an sorted array of right-bound thresholds to define >> the individual buckets. >> > I wonder if stuff like this shouldn't live in contrib rather than > core, but I guess it's probably not worth it for 3 functions. > I was wondering the same and I also think that 3 simple functions is not that much, plus it seems natural extension of the existing width_bucket. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Petr Jelinek <petr@2ndquadrant.com> writes: > here is a patch implementing varwidth_bucket (naming is up for > discussion) function which does binning with variable bucket width. I didn't see any discussion of the naming question in this thread. I'd like to propose that it should be just "width_bucket()"; we can easily determine which function is meant, considering that the SQL-spec variants don't take arrays and don't even have the same number of actual arguments. Also, the set of functions provided still needs more thought, because the resolution of overloaded functions doesn't really work as nicely as all that. I illustrate: regression=# create function myf(int8) returns int as 'select 0' language sql; CREATE FUNCTION regression=# create function myf(float8) returns int as 'select 1' language sql; CREATE FUNCTION regression=# create function myf(anyelement) returns int as 'select 2' language sql; CREATE FUNCTION regression=# select myf(1);myf ----- 1 (1 row) So given plain integer arguments, we'll invoke the float8 version not the int8 version, which may be undesirable. The same for int2 arguments. We could fix that by adding bespoke int4 and maybe int2 variants, but TBH, I'm not sure that the specific-type functions are worth the trouble. Maybe we should just have one generic function, and take the trouble to optimize its array-subscripting calculations for either fixed-length or variable-length array elements? Have you got performance measurements demonstrating that multiple implementations really buy enough to justify the extra code? Also, I'm not convinced by this business of throwing an error for a NULL array element. Per spec, null arguments to width_bucket() produce a null result, not an error --- shouldn't this flavor act similarly? In any case I think the test needs to use array_contains_nulls() not just ARR_HASNULL. Lastly, the spec defines behaviors for width_bucket that allow either ascending or descending buckets. We could possibly do something similar in these functions by initially comparing the two endpoint elements to see which one is larger, and then reversing the sense of all the comparisons if the first one is larger. I'm not sure if that's worth the trouble or not, but if the SQL committee thought descending bucket numbering was worthwhile, maybe it's worthwhile here too. regards, tom lane
On 08/07/14 02:14, Tom Lane wrote: > Petr Jelinek <petr@2ndquadrant.com> writes: >> here is a patch implementing varwidth_bucket (naming is up for >> discussion) function which does binning with variable bucket width. > > I didn't see any discussion of the naming question in this thread. > I'd like to propose that it should be just "width_bucket()"; we can > easily determine which function is meant, considering that the > SQL-spec variants don't take arrays and don't even have the same > number of actual arguments. I did mention in submission that the names are up for discussion, I am all for naming it just width_bucket. > > So given plain integer arguments, we'll invoke the float8 version not the > int8 version, which may be undesirable. The same for int2 arguments. > We could fix that by adding bespoke int4 and maybe int2 variants, but Hmm, yeah I don't love the idea of having 3 functions just to cover integer datatypes :(. > TBH, I'm not sure that the specific-type functions are worth the trouble. > Maybe we should just have one generic function, and take the trouble to > optimize its array-subscripting calculations for either fixed-length or > variable-length array elements? Have you got performance measurements > demonstrating that multiple implementations really buy enough to justify > the extra code? The performance difference is about 20% (+/- few depending on the array size), I don't know if that's bad enough to warrant type-specific implementation. I personally don't know how to make the generic implementation much faster than it is now, except maybe by turning it into aggregate which would cache the deconstructed version of the array, but that change semantics quite a bit and is probably not all that desirable. > > Also, I'm not convinced by this business of throwing an error for a > NULL array element. Per spec, null arguments to width_bucket() > produce a null result, not an error --- shouldn't this flavor act > similarly? In any case I think the test needs to use > array_contains_nulls() not just ARR_HASNULL. I am not against returning NULL for NULL array, I would still like to error on array that has values + NULL somewhere in the middle though. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
2014-07-16 10:04 GMT+02:00 Petr Jelinek <petr@2ndquadrant.com>:
+1
On 08/07/14 02:14, Tom Lane wrote:I did mention in submission that the names are up for discussion, I am all for naming it just width_bucket.Petr Jelinek <petr@2ndquadrant.com> writes:here is a patch implementing varwidth_bucket (naming is up for
discussion) function which does binning with variable bucket width.
I didn't see any discussion of the naming question in this thread.
I'd like to propose that it should be just "width_bucket()"; we can
easily determine which function is meant, considering that the
SQL-spec variants don't take arrays and don't even have the same
number of actual arguments.
I had this idea too - but I am not sure if it is good idea. A distance between ANSI SQL with_bucket and our enhancing is larger than in our implementation of "median" for example.
I can live with both names, but current name I prefer.
Hmm, yeah I don't love the idea of having 3 functions just to cover integer datatypes :(.
So given plain integer arguments, we'll invoke the float8 version not the
int8 version, which may be undesirable. The same for int2 arguments.
We could fix that by adding bespoke int4 and maybe int2 variants, butThe performance difference is about 20% (+/- few depending on the array size), I don't know if that's bad enough to warrant type-specific implementation. I personally don't know how to make the generic implementation much faster than it is now, except maybe by turning it into aggregate which would cache the deconstructed version of the array, but that change semantics quite a bit and is probably not all that desirable.TBH, I'm not sure that the specific-type functions are worth the trouble.
Maybe we should just have one generic function, and take the trouble to
optimize its array-subscripting calculations for either fixed-length or
variable-length array elements? Have you got performance measurements
demonstrating that multiple implementations really buy enough to justify
the extra code?
I am not sure if our API is enough to do it - there are no any simple support for immutable parameters.
The performance is one point. Second point is wrong result, when input array is not well sorted. But this check means next performance penalization. But we cannot do it.
I am not against returning NULL for NULL array, I would still like to error on array that has values + NULL somewhere in the middle though.
Also, I'm not convinced by this business of throwing an error for a
NULL array element. Per spec, null arguments to width_bucket()
produce a null result, not an error --- shouldn't this flavor act
similarly? In any case I think the test needs to use
array_contains_nulls() not just ARR_HASNULL.
+1
Pavel
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 16/07/14 21:35, Pavel Stehule wrote: > The performance difference is about 20% (+/- few depending on the > array size), I don't know if that's bad enough to warrant > type-specific implementation. I personally don't know how to make > the generic implementation much faster than it is now, except maybe > by turning it into aggregate which would cache the deconstructed > version of the array, but that change semantics quite a bit and is > probably not all that desirable. > > > I am not sure if our API is enough to do it - there are no any simple > support for immutable parameters. Just to clarify, the ~20% performance difference is with separate generic implementation for fixed width types (most of the time seems to be spent in the FunctionCallInvoke part, I even tryed to use sortsupport instead but it does not seem to help much). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 08/07/14 02:14, Tom Lane wrote: > Also, the set of functions provided still needs more thought, because the > resolution of overloaded functions doesn't really work as nicely as all > that. I am honestly considering just removing special case for int8 for now, the sql standard width_bucket has only support for float8 and numeric, maybe it's enough for this one also... What's your opinion on that? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 16 July 2014 20:35, Pavel Stehule <pavel.stehule@gmail.com> wrote: > > > > 2014-07-16 10:04 GMT+02:00 Petr Jelinek <petr@2ndquadrant.com>: > >> On 08/07/14 02:14, Tom Lane wrote: >>> >>> Petr Jelinek <petr@2ndquadrant.com> writes: >>>> >>>> here is a patch implementing varwidth_bucket (naming is up for >>>> discussion) function which does binning with variable bucket width. >>> >>> >>> I didn't see any discussion of the naming question in this thread. >>> I'd like to propose that it should be just "width_bucket()"; we can >>> easily determine which function is meant, considering that the >>> SQL-spec variants don't take arrays and don't even have the same >>> number of actual arguments. >> >> >> I did mention in submission that the names are up for discussion, I am all >> for naming it just width_bucket. > > > I had this idea too - but I am not sure if it is good idea. A distance > between ANSI SQL with_bucket and our enhancing is larger than in our > implementation of "median" for example. > > I can live with both names, but current name I prefer. Hmmm, not sure. Let's look around and think what words people use. Transforms of this type are referred to as discretization in formal literature and as binning in commong usage/statistical software. width_bucket() seems to refer to an equal-width binning process. The function being discussed here is a generic mechanism, the boundaries of which could have been decided using equal-frequency or other mechanisms. Using the word "width" in those contexts could be confusing. Given width_bucket() is already in use for SQL Standard function, I agree it would be confusing to have same name for different parameter profiles. So I suggest that we use the more generic function name bin(), with a second choice of discretize() (though that seems annoyingly easy to spell incorrectly) Whatever we do, it seems clear we need a section in the manual to describe Statistical Functions, including width_bucket(), whatever we call this function and including the terms bin, binning, transform, discretize and discretization to ensure those are easily searchable. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, I finally had some time to get back to this. I attached version3 of the patch which "fixes" Tom's complaint about int8 version by removing the int8 version as it does not seem necessary (the float8 can handle integers just fine). This patch now basically has just one optimized function for float8 and one for numeric datatypes, just like width_bucket. > On 08/07/14 02:14, Tom Lane wrote: > Also, I'm not convinced by this business of throwing an error for a > NULL array element. Per spec, null arguments to width_bucket() > produce a null result, not an error --- shouldn't this flavor act > similarly? In any case I think the test needs to use > array_contains_nulls() not just ARR_HASNULL. I really think there should be difference between NULL array and NULL inside array and that NULL inside the array is wrong input. I changed the check to array_contains_nulls() though. > On 08/07/14 02:14, Tom Lane wrote: > Lastly, the spec defines behaviors for width_bucket that allow either > ascending or descending buckets. We could possibly do something > similar I am not sure it's worth it here as we require input to be sorted anyway. It might be worthwhile if we decided to do this as an aggregate (since there input would not be required to be presorted) instead of function but I am not sure if that's desirable either as that would limit usability to only the single main use-case (group by and count()). On 20/07/14 11:01, Simon Riggs wrote: > On 16 July 2014 20:35, Pavel Stehule <pavel.stehule@gmail.com> wrote: >> >>> On 08/07/14 02:14, Tom Lane wrote: >>>> >>>> I didn't see any discussion of the naming question in this thread. >>>> I'd like to propose that it should be just "width_bucket()"; we can >>>> easily determine which function is meant, considering that the >>>> SQL-spec variants don't take arrays and don't even have the same >>>> number of actual arguments. >>> >>> I did mention in submission that the names are up for discussion, I am all >>> for naming it just width_bucket. >> >> I had this idea too - but I am not sure if it is good idea. A distance >> between ANSI SQL with_bucket and our enhancing is larger than in our >> implementation of "median" for example. >> >> I can live with both names, but current name I prefer. > > > So I suggest that we use the more generic function name bin(), with a > second choice of discretize() (though that seems annoyingly easy to > spell incorrectly) > I really don't think bin() is good choice here, bucket has same meaning in this context and bin is often used as shorthand for binary which would be confusing here. Anyway I currently left the name as it is, I would not be against using width_bucket() or even just bucket(), not sure about the discretize() option. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi
I am looking to your last patch and I have a few questions, notes3. I am thinking about name - I don't think so varwidth_bucket is correct -- in relation to name "width_bucket" ... what about "range_bucket" or "scope_bucket" ?
last patch is very simple, there are no new compilation or regress tests issues.
last patch is very simple, there are no new compilation or regress tests issues.
Pavel
2014-08-25 16:15 GMT+02:00 Petr Jelinek <petr@2ndquadrant.com>:
Hi,
I finally had some time to get back to this.
I attached version3 of the patch which "fixes" Tom's complaint about int8 version by removing the int8 version as it does not seem necessary (the float8 can handle integers just fine).
This patch now basically has just one optimized function for float8 and one for numeric datatypes, just like width_bucket.On 08/07/14 02:14, Tom Lane wrote:Also, I'm not convinced by this business of throwing an error for a
NULL array element. Per spec, null arguments to width_bucket()
produce a null result, not an error --- shouldn't this flavor act
similarly? In any case I think the test needs to use
array_contains_nulls() not just ARR_HASNULL.
I really think there should be difference between NULL array and NULL inside array and that NULL inside the array is wrong input. I changed the check to array_contains_nulls() though.On 08/07/14 02:14, Tom Lane wrote:Lastly, the spec defines behaviors for width_bucket that allow either
ascending or descending buckets. We could possibly do something
similar
I am not sure it's worth it here as we require input to be sorted anyway. It might be worthwhile if we decided to do this as an aggregate (since there input would not be required to be presorted) instead of function but I am not sure if that's desirable either as that would limit usability to only the single main use-case (group by and count()).
On 20/07/14 11:01, Simon Riggs wrote:On 16 July 2014 20:35, Pavel Stehule <pavel.stehule@gmail.com> wrote:On 08/07/14 02:14, Tom Lane wrote:
I didn't see any discussion of the naming question in this thread.
I'd like to propose that it should be just "width_bucket()"; we can
easily determine which function is meant, considering that the
SQL-spec variants don't take arrays and don't even have the same
number of actual arguments.
I did mention in submission that the names are up for discussion, I am all
for naming it just width_bucket.
I had this idea too - but I am not sure if it is good idea. A distance
between ANSI SQL with_bucket and our enhancing is larger than in our
implementation of "median" for example.
I can live with both names, but current name I prefer.So I suggest that we use the more generic function name bin(), with a
second choice of discretize() (though that seems annoyingly easy to
spell incorrectly)
I really don't think bin() is good choice here, bucket has same meaning in this context and bin is often used as shorthand for binary which would be confusing here.
Anyway I currently left the name as it is, I would not be against using width_bucket() or even just bucket(), not sure about the discretize() option.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Pavel Stehule <pavel.stehule@gmail.com> writes: > 1. I am thinking so reduction to only numeric types is not necessary - > although we can live without it - but there are lot of non numeric > categories: chars, date, ... I wasn't terribly happy about that either. I still think we should reduce this to a single polymorphic function, as in the attached. > 2. Still I strongly afraid about used searching method - there is not > possible to check a validity of input. Did you check how much linear > searching is slower - you spoke, so you have a real data and real use case? Well, if we check the input then we'll be doing O(N) comparisons instead of O(log N). That could be a serious cost penalty for large arrays and expensive comparison functions (such as strcoll()). I think it's probably sufficient to have a clear warning in the docs. > 3. I am thinking about name - I don't think so varwidth_bucket is correct > -- in relation to name "width_bucket" ... what about "range_bucket" or > "scope_bucket" ? Neither of those seem like improvements from here. I agree with the objections to bin() as well. bucket() might be all right but it still seems a bit too generic. width_bucket(), which at least shows there's a relationship to the standard functions, still seems like the best of the suggestions so far. It occurs to me that there would be an advantage to using some other name besides width_bucket: we could then mark the function as variadic, which might be a notational advantage in some usages. (We can't do that if we call it width_bucket because the four-parameter case would be ambiguous with the existing functions.) I'm not sure that this is important enough to justify changing the name though, especially if we can't come up with a good name. Also, doing that would put a very large premium on picking a non-generic name, else we'd risk creating ambiguities against user-defined functions. Also, a documentation quibble: in Petr's patch the documentation and comments refer to the thresholds as "right bounds", which I didn't care for and replaced with "upper bounds". However, looking at it again, I wonder if we would not be better off describing them as "lower bounds". They are exclusive bounds if seen as upper bounds, and inclusive if seen as lower bounds, and on the whole I think the latter viewpoint might be less confusing. Thoughts? Proposed patch with 1 polymorphic function attached. regards, tom lane diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 722640b..93cf1e7 100644 *** a/doc/src/sgml/func.sgml --- b/doc/src/sgml/func.sgml *************** *** 901,925 **** <indexterm> <primary>width_bucket</primary> </indexterm> ! <literal><function>width_bucket(<parameter>op</parameter> <type>numeric</type>, <parameter>b1</parameter> <type>numeric</type>,<parameter>b2</parameter> <type>numeric</type>, <parameter>count</parameter> <type>int</type>)</function></literal> </entry> <entry><type>int</type></entry> <entry>return the bucket to which <parameter>operand</> would be assigned in an equidepth histogram with <parameter>count</> ! buckets, in the range <parameter>b1</> to <parameter>b2</></entry> <entry><literal>width_bucket(5.35, 0.024, 10.06, 5)</literal></entry> <entry><literal>3</literal></entry> </row> <row> ! <entry><literal><function>width_bucket(<parameter>op</parameter> <type>dp</type>, <parameter>b1</parameter> <type>dp</type>,<parameter>b2</parameter> <type>dp</type>, <parameter>count</parameter> <type>int</type>)</function></literal></entry> <entry><type>int</type></entry> <entry>return the bucket to which <parameter>operand</> would be assigned in an equidepth histogram with <parameter>count</> ! buckets, in the range <parameter>b1</> to <parameter>b2</></entry> <entry><literal>width_bucket(5.35, 0.024, 10.06, 5)</literal></entry> <entry><literal>3</literal></entry> </row> </tbody> </tgroup> </table> --- 901,936 ---- <indexterm> <primary>width_bucket</primary> </indexterm> ! <literal><function>width_bucket(<parameter>operand</parameter> <type>numeric</type>, <parameter>b1</parameter><type>numeric</type>, <parameter>b2</parameter> <type>numeric</type>, <parameter>count</parameter><type>int</type>)</function></literal> </entry> <entry><type>int</type></entry> <entry>return the bucket to which <parameter>operand</> would be assigned in an equidepth histogram with <parameter>count</> ! buckets spanning the range <parameter>b1</> to <parameter>b2</></entry> <entry><literal>width_bucket(5.35, 0.024, 10.06, 5)</literal></entry> <entry><literal>3</literal></entry> </row> <row> ! <entry><literal><function>width_bucket(<parameter>operand</parameter> <type>dp</type>, <parameter>b1</parameter><type>dp</type>, <parameter>b2</parameter> <type>dp</type>, <parameter>count</parameter> <type>int</type>)</function></literal></entry> <entry><type>int</type></entry> <entry>return the bucket to which <parameter>operand</> would be assigned in an equidepth histogram with <parameter>count</> ! buckets spanning the range <parameter>b1</> to <parameter>b2</></entry> <entry><literal>width_bucket(5.35, 0.024, 10.06, 5)</literal></entry> <entry><literal>3</literal></entry> </row> + + <row> + <entry><literal><function>width_bucket(<parameter>operand</parameter> <type>anyelement</type>, <parameter>thresholds</parameter><type>anyarray</type>)</function></literal></entry> + <entry><type>int</type></entry> + <entry>return the bucket to which <parameter>operand</> would + be assigned given an array listing the upper bounds of the buckets; + the <parameter>thresholds</> array <emphasis>must be sorted</>, + smallest first, or unexpected results will be obtained</entry> + <entry><literal>width_bucket(now(), array['yesterday', 'today', 'tomorrow']::timestamptz[])</literal></entry> + <entry><literal>2</literal></entry> + </row> </tbody> </tgroup> </table> diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c index f8e94ec..f3bba13 100644 *** a/src/backend/utils/adt/arrayfuncs.c --- b/src/backend/utils/adt/arrayfuncs.c *************** static ArrayType *array_replace_internal *** 130,135 **** --- 130,143 ---- Datum replace, bool replace_isnull, bool remove, Oid collation, FunctionCallInfo fcinfo); + static int width_bucket_fixed(Datum operand, + ArrayType *thresholds, + Oid collation, + TypeCacheEntry *typentry); + static int width_bucket_variable(Datum operand, + ArrayType *thresholds, + Oid collation, + TypeCacheEntry *typentry); /* *************** array_replace(PG_FUNCTION_ARGS) *** 5502,5504 **** --- 5510,5681 ---- fcinfo); PG_RETURN_ARRAYTYPE_P(array); } + + /* + * Implements width_bucket(anyelement, anyarray). + * + * 'thresholds' is an array containing upper bound values for each bucket; + * these must be sorted from smallest to largest, or bogus results will be + * produced. If N thresholds are supplied, the output is from 0 to N: + * 0 is for inputs < first threshold, N is for inputs >= last threshold. + */ + Datum + width_bucket_generic(PG_FUNCTION_ARGS) + { + Datum operand = PG_GETARG_DATUM(0); + ArrayType *thresholds = PG_GETARG_ARRAYTYPE_P(1); + Oid collation = PG_GET_COLLATION(); + Oid element_type = ARR_ELEMTYPE(thresholds); + TypeCacheEntry *typentry; + int result; + + /* Check input */ + if (ARR_NDIM(thresholds) > 1) + ereport(ERROR, + (errcode(ERRCODE_ARRAY_SUBSCRIPT_ERROR), + errmsg("thresholds must be one-dimensional array"))); + + if (array_contains_nulls(thresholds)) + ereport(ERROR, + (errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED), + errmsg("thresholds array must not contain NULLs"))); + + /* Cache information about the input type */ + typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra; + if (typentry == NULL || + typentry->type_id != element_type) + { + typentry = lookup_type_cache(element_type, + TYPECACHE_CMP_PROC_FINFO); + if (!OidIsValid(typentry->cmp_proc_finfo.fn_oid)) + ereport(ERROR, + (errcode(ERRCODE_UNDEFINED_FUNCTION), + errmsg("could not identify a comparison function for type %s", + format_type_be(element_type)))); + fcinfo->flinfo->fn_extra = (void *) typentry; + } + + /* We have two implementation paths for fixed- and variable-width types */ + if (typentry->typlen > 0) + result = width_bucket_fixed(operand, thresholds, collation, typentry); + else + result = width_bucket_variable(operand, thresholds, collation, typentry); + + /* Avoid leaking memory when handed toasted input. */ + PG_FREE_IF_COPY(thresholds, 1); + + PG_RETURN_INT32(result); + } + + /* + * width_bucket for fixed-width data types. + */ + static int + width_bucket_fixed(Datum operand, + ArrayType *thresholds, + Oid collation, + TypeCacheEntry *typentry) + { + char *thresholds_data; + int typlen; + FunctionCallInfoData locfcinfo; + int left; + int right; + + /* + * Since we know the array contains no NULLs, we can just index it + * directly. + */ + thresholds_data = (char *) ARR_DATA_PTR(thresholds); + typlen = typentry->typlen; + + InitFunctionCallInfoData(locfcinfo, &typentry->cmp_proc_finfo, 2, + collation, NULL, NULL); + + /* Find the bucket */ + left = 0; + right = ArrayGetNItems(ARR_NDIM(thresholds), ARR_DIMS(thresholds)); + while (left < right) + { + int mid; + char *ptr; + Datum elm; + int32 cmpresult; + + mid = (left + right) / 2; + + ptr = thresholds_data + mid * typlen; + elm = fetch_att(ptr, typentry->typbyval, typlen); + + locfcinfo.arg[0] = operand; + locfcinfo.arg[1] = elm; + locfcinfo.argnull[0] = false; + locfcinfo.argnull[1] = false; + locfcinfo.isnull = false; + + cmpresult = DatumGetInt32(FunctionCallInvoke(&locfcinfo)); + + if (cmpresult < 0) + right = mid; + else + left = mid + 1; + } + + return left; + } + + /* + * width_bucket for variable-width data types. + */ + static int + width_bucket_variable(Datum operand, + ArrayType *thresholds, + Oid collation, + TypeCacheEntry *typentry) + { + Datum *dvalues; + int nelems; + FunctionCallInfoData locfcinfo; + int left; + int right; + + /* + * The easiest and probably most efficient way to locate the array + * elements is with deconstruct_array. + */ + deconstruct_array(thresholds, + typentry->type_id, typentry->typlen, + typentry->typbyval, typentry->typalign, + &dvalues, NULL, &nelems); + + InitFunctionCallInfoData(locfcinfo, &typentry->cmp_proc_finfo, 2, + collation, NULL, NULL); + + /* Find the bucket */ + left = 0; + right = nelems; + while (left < right) + { + int mid; + int32 cmpresult; + + mid = (left + right) / 2; + + locfcinfo.arg[0] = operand; + locfcinfo.arg[1] = dvalues[mid]; + locfcinfo.argnull[0] = false; + locfcinfo.argnull[1] = false; + locfcinfo.isnull = false; + + cmpresult = DatumGetInt32(FunctionCallInvoke(&locfcinfo)); + + if (cmpresult < 0) + right = mid; + else + left = mid + 1; + } + + pfree(dvalues); + + return left; + } diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h index 5176ed0..9b55acc 100644 *** a/src/include/catalog/pg_proc.h --- b/src/include/catalog/pg_proc.h *************** DATA(insert OID = 2334 ( array_agg_fina *** 885,890 **** --- 885,892 ---- DESCR("aggregate final function"); DATA(insert OID = 2335 ( array_agg PGNSP PGUID 12 1 0 0 0 t f f f f f i 1 0 2277 "2283" _null_ _null_ _null__null_ aggregate_dummy _null_ _null_ _null_ )); DESCR("concatenate aggregate input into an array"); + DATA(insert OID = 3218 ( width_bucket PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 23 "2283 2277" _null_ _null_ _null__null_ width_bucket_generic _null_ _null_ _null_ )); + DESCR("bucket number of operand given a sorted array of bucket upper bounds"); DATA(insert OID = 3816 ( array_typanalyze PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 16 "2281" _null_ _null_ _null_ _null_array_typanalyze _null_ _null_ _null_ )); DESCR("array typanalyze"); DATA(insert OID = 3817 ( arraycontsel PGNSP PGUID 12 1 0 0 0 f f f f t f s 4 0 701 "2281 26 2281 23" _null_ _null__null_ _null_ arraycontsel _null_ _null_ _null_ )); diff --git a/src/include/utils/array.h b/src/include/utils/array.h index 9bbfaae..2abaccd 100644 *** a/src/include/utils/array.h --- b/src/include/utils/array.h *************** extern Datum array_fill_with_lower_bound *** 214,219 **** --- 214,220 ---- extern Datum array_unnest(PG_FUNCTION_ARGS); extern Datum array_remove(PG_FUNCTION_ARGS); extern Datum array_replace(PG_FUNCTION_ARGS); + extern Datum width_bucket_generic(PG_FUNCTION_ARGS); extern Datum array_ref(ArrayType *array, int nSubscripts, int *indx, int arraytyplen, int elmlen, bool elmbyval, char elmalign, diff --git a/src/test/regress/expected/arrays.out b/src/test/regress/expected/arrays.out index 4286691..04f2400 100644 *** a/src/test/regress/expected/arrays.out --- b/src/test/regress/expected/arrays.out *************** select length(md5((f1[1]).c2)) from dest *** 1706,1708 **** --- 1706,1807 ---- drop table dest; drop type textandtext; + -- Tests for generic form of width_bucket() + SELECT + op, + width_bucket(op, ARRAY[1, 3, 5, 10.0]) AS wb_1, + width_bucket(op, ARRAY[0, 5.5, 9.99]) AS wb_2, + width_bucket(op, ARRAY[-6, -5, 2.0]) AS wb_3 + FROM (VALUES + (-5.2), + (-0.0000000001), + (0.000000000001), + (1), + (1.99999999999999), + (2), + (2.00000000000001), + (3), + (4), + (4.5), + (5), + (5.5), + (6), + (7), + (8), + (9), + (9.99999999999999), + (10), + (10.0000000000001) + ) v(op); + op | wb_1 | wb_2 | wb_3 + ------------------+------+------+------ + -5.2 | 0 | 0 | 1 + -0.0000000001 | 0 | 0 | 2 + 0.000000000001 | 0 | 1 | 2 + 1 | 1 | 1 | 2 + 1.99999999999999 | 1 | 1 | 2 + 2 | 1 | 1 | 3 + 2.00000000000001 | 1 | 1 | 3 + 3 | 2 | 1 | 3 + 4 | 2 | 1 | 3 + 4.5 | 2 | 1 | 3 + 5 | 3 | 1 | 3 + 5.5 | 3 | 2 | 3 + 6 | 3 | 2 | 3 + 7 | 3 | 2 | 3 + 8 | 3 | 2 | 3 + 9 | 3 | 2 | 3 + 9.99999999999999 | 3 | 3 | 3 + 10 | 4 | 3 | 3 + 10.0000000000001 | 4 | 3 | 3 + (19 rows) + + SELECT + op, + width_bucket(op, ARRAY[1, 3, 5, 10]) AS wb_1 + FROM generate_series(0,11) as op; + op | wb_1 + ----+------ + 0 | 0 + 1 | 1 + 2 | 1 + 3 | 2 + 4 | 2 + 5 | 3 + 6 | 3 + 7 | 3 + 8 | 3 + 9 | 3 + 10 | 4 + 11 | 4 + (12 rows) + + SELECT width_bucket(now(), + array['yesterday', 'today', 'tomorrow']::timestamptz[]); + width_bucket + -------------- + 2 + (1 row) + + SELECT width_bucket(5, ARRAY[3]); + width_bucket + -------------- + 1 + (1 row) + + SELECT width_bucket(5, '{}'); + width_bucket + -------------- + 0 + (1 row) + + -- error cases + SELECT width_bucket('5'::text, ARRAY[3, 4]::integer[]); + ERROR: function width_bucket(text, integer[]) does not exist + LINE 1: SELECT width_bucket('5'::text, ARRAY[3, 4]::integer[]); + ^ + HINT: No function matches the given name and argument types. You might need to add explicit type casts. + SELECT width_bucket(5, ARRAY[3, 4, NULL]); + ERROR: thresholds array must not contain NULLs + SELECT width_bucket(5, ARRAY[ARRAY[1, 2], ARRAY[3, 4]]); + ERROR: thresholds must be one-dimensional array diff --git a/src/test/regress/sql/arrays.sql b/src/test/regress/sql/arrays.sql index d9f7cbf..eef3eb3 100644 *** a/src/test/regress/sql/arrays.sql --- b/src/test/regress/sql/arrays.sql *************** drop table src; *** 476,478 **** --- 476,523 ---- select length(md5((f1[1]).c2)) from dest; drop table dest; drop type textandtext; + + -- Tests for generic form of width_bucket() + + SELECT + op, + width_bucket(op, ARRAY[1, 3, 5, 10.0]) AS wb_1, + width_bucket(op, ARRAY[0, 5.5, 9.99]) AS wb_2, + width_bucket(op, ARRAY[-6, -5, 2.0]) AS wb_3 + FROM (VALUES + (-5.2), + (-0.0000000001), + (0.000000000001), + (1), + (1.99999999999999), + (2), + (2.00000000000001), + (3), + (4), + (4.5), + (5), + (5.5), + (6), + (7), + (8), + (9), + (9.99999999999999), + (10), + (10.0000000000001) + ) v(op); + + SELECT + op, + width_bucket(op, ARRAY[1, 3, 5, 10]) AS wb_1 + FROM generate_series(0,11) as op; + + SELECT width_bucket(now(), + array['yesterday', 'today', 'tomorrow']::timestamptz[]); + + SELECT width_bucket(5, ARRAY[3]); + SELECT width_bucket(5, '{}'); + + -- error cases + SELECT width_bucket('5'::text, ARRAY[3, 4]::integer[]); + SELECT width_bucket(5, ARRAY[3, 4, NULL]); + SELECT width_bucket(5, ARRAY[ARRAY[1, 2], ARRAY[3, 4]]);
On 30 August 2014 18:24, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> 3. I am thinking about name - I don't think so varwidth_bucket is correct >> -- in relation to name "width_bucket" ... what about "range_bucket" or >> "scope_bucket" ? > > Neither of those seem like improvements from here. I agree with the > objections to bin() as well. bucket() might be all right but it still > seems a bit too generic. width_bucket(), which at least shows there's > a relationship to the standard functions, still seems like the best > of the suggestions so far. I'd been mulling that over and agree with objections to bin() Suggest discretize() as a much more informative name. The other names will be overlooked by anybody that doesn't already know what to look for. Thanks for the polymorphic patch, that sounds good. Sorry, not reviewed more deeply (still travelling).
Simon Riggs <simon@2ndquadrant.com> writes: > Suggest discretize() as a much more informative name. The other names > will be overlooked by anybody that doesn't already know what to look > for. I did not like that idea to begin with, but it's growing more attractive. In particular, I think it would be unique enough that we could safely mark the polymorphic function as variadic (a move that would cause ambiguity issues for pretty nearly any user-defined function of the same name). OTOH, I'm not as convinced as you that this name would mean anything to "anybody that doesn't already know what to look for". It still seems like rather arcane terminology. regards, tom lane
On 01/09/14 06:00, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: >> Suggest discretize() as a much more informative name. The other names >> will be overlooked by anybody that doesn't already know what to look >> for. > I did not like that idea to begin with, but it's growing more attractive. > In particular, I think it would be unique enough that we could safely > mark the polymorphic function as variadic (a move that would cause > ambiguity issues for pretty nearly any user-defined function of the > same name). > > OTOH, I'm not as convinced as you that this name would mean anything > to "anybody that doesn't already know what to look for". It still > seems like rather arcane terminology. > > regards, tom lane > > If I was new to PostgreSQL, I'd probably read this as disc*(), a function to do something with discs. I have done finite maths and statistics at university, plus I have been vaguely following this thread - but, the meaning of discretize() is not immediately apparent to me (if I reread a previous email or 2 in this thread, then I'm sure it would then be obvious!). I might guess that it is to make something discrete, but why would I want to do that? So if I came across this function in a year or 2, I'd probably go right past it. I have been programming for over 40 years, and I still find choosing appropriate names sometimes very difficult, so I know it is not easy to come up with meaningful names - especially ones that mean something relevant to other people! Cheers, Gavin
On 30/08/14 19:24, Tom Lane wrote: > Pavel Stehule <pavel.stehule@gmail.com> writes: >> 1. I am thinking so reduction to only numeric types is not necessary - >> although we can live without it - but there are lot of non numeric >> categories: chars, date, ... > > I wasn't terribly happy about that either. I still think we should > reduce this to a single polymorphic function, as in the attached. > I did try to write generic function very similar to what you wrote but discarded it because of the performance reasons. If we really want to support any data type I am all for having generic function but I still would like to have one optimized for float8 because that seems to be the most used use-case (at least that one is the reason why I even wrote this) for performance reasons. Here are some numbers: First float8: CREATE TABLE wbtest AS SELECT (random() * 100)::float8 a FROM generate_series(1,1000000) ORDER BY 1; SELECT width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 90]::float8[]), width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 91]::float8[]), width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 92]::float8[]) FROM wbtest; Optimized float8: ~250ms Tom's generic: ~400ms Numeric: CREATE TABLE wbtestn AS SELECT (random() * 100)::numeric a FROM generate_series(1,1000000) ORDER BY 1; SELECT width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 90]::numeric[]), width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 91]::numeric[]), width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 92]::numeric[]) FROM wbtestn; Optimized numeric: ~600ms My generic: ~780ms Tom's generic: ~900ms The difference between my generic and Tom's generic is because Tom's is slowed down by the deconstruct_array. >> 2. Still I strongly afraid about used searching method - there is not >> possible to check a validity of input. Did you check how much linear >> searching is slower - you spoke, so you have a real data and real use case? > > Well, if we check the input then we'll be doing O(N) comparisons instead > of O(log N). That could be a serious cost penalty for large arrays and > expensive comparison functions (such as strcoll()). I think it's probably > sufficient to have a clear warning in the docs. > With resort the performance is worse than bunch of CASE WHEN :( > > Also, a documentation quibble: in Petr's patch the documentation and > comments refer to the thresholds as "right bounds", which I didn't care > for and replaced with "upper bounds". However, looking at it again, > I wonder if we would not be better off describing them as "lower bounds". > They are exclusive bounds if seen as upper bounds, and inclusive if seen > as lower bounds, and on the whole I think the latter viewpoint might be > less confusing. Thoughts? Upper bounds is probably better name than right bounds, I agree with that. In any case it's upper bound for a bucket (that's why there is one more bucket to which everything bigger than max threshold goes into). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Petr Jelinek <petr@2ndquadrant.com> writes: > On 30/08/14 19:24, Tom Lane wrote: >> I wasn't terribly happy about that either. I still think we should >> reduce this to a single polymorphic function, as in the attached. > I did try to write generic function very similar to what you wrote but > discarded it because of the performance reasons. If we really want to > support any data type I am all for having generic function but I still > would like to have one optimized for float8 because that seems to be the > most used use-case (at least that one is the reason why I even wrote > this) for performance reasons. Well, part of the reason why your v3 float8 function looks so fast is that it's cheating: it will not produce the right answers for comparisons involving NaNs. I'm not sure how good it would look once you'd added some isnan() tests to make the behavior actually equivalent to float8_cmp_internal(). However, assuming it still seems worthwhile once that's accounted for, I don't have a strong objection to having an additional code path for float8. There are two ways we could do that: 1. Add a runtime test in the polymorphic function, eg if (element_type == FLOAT8OID) result = width_bucket_float8(...);else if (typentry->typlen > 0) result = width_bucket_fixed(...);else result = width_bucket_variable(...); 2. Define a SQL function width_bucket(float8, float8[]) alongside the polymorphic one. While #2 might be marginally faster at runtime, the main difference between these is what the parser would choose to do with ambiguous cases. I experimented a bit and it seemed that the parser would prefer the float8 function in any situation where the inputs were of numeric types, probably because float8 is the preferred type in the numeric category. I'm not real sure that would be a good thing: in particular it would cast integer and numeric inputs to float8, which risks precision loss, and there doesn't seem to be any way to get the parser to make the other choice if the inputs are of numeric-category types. As against that objection, it would make cross-type numeric cases easier to use, for example "width_bucket(1, array[2.4])" would be accepted. If we just offer the anyelement/anyarray version then the parser would make you cast "1" to numeric before it would consider the function to match. On balance the runtime-test approach looks like a better idea, in that it doesn't risk any unexpected semantic behaviors. > The difference between my generic and Tom's generic is because Tom's is > slowed down by the deconstruct_array. Meh. It looked to me like your version would have O(N^2) performance problems from computing array offsets repeatedly, depending on exactly which array element it ended up on. It would avoid repeat calculations only if it always moved right. regards, tom lane
On 31 August 2014 20:44, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote: > On 01/09/14 06:00, Tom Lane wrote: >> >> Simon Riggs <simon@2ndquadrant.com> writes: >>> >>> Suggest discretize() as a much more informative name. The other names >>> will be overlooked by anybody that doesn't already know what to look >>> for. >> >> I did not like that idea to begin with, but it's growing more attractive. >> In particular, I think it would be unique enough that we could safely >> mark the polymorphic function as variadic (a move that would cause >> ambiguity issues for pretty nearly any user-defined function of the >> same name). >> >> OTOH, I'm not as convinced as you that this name would mean anything >> to "anybody that doesn't already know what to look for". It still >> seems like rather arcane terminology. >> >> > If I was new to PostgreSQL, I'd probably read this as disc*(), a function to > do something with discs. > > I have done finite maths and statistics at university, plus I have been > vaguely following this thread - but, the meaning of discretize() is not > immediately apparent to me (if I reread a previous email or 2 in this > thread, then I'm sure it would then be obvious!). I might guess that it is > to make something discrete, but why would I want to do that? So if I came > across this function in a year or 2, I'd probably go right past it. > > I have been programming for over 40 years, and I still find choosing > appropriate names sometimes very difficult, so I know it is not easy to come > up with meaningful names - especially ones that mean something relevant to > other people! It's important that we *don't* come up with a new name. This function does something useful and the words people already use to describe that are binning and discretization. By this I mean names already used by very widely used software. Search them and see, or suggest more widely used alternatives, please.
On 31/08/14 22:33, Tom Lane wrote: > Petr Jelinek <petr@2ndquadrant.com> writes: >> On 30/08/14 19:24, Tom Lane wrote: >>> I wasn't terribly happy about that either. I still think we should >>> reduce this to a single polymorphic function, as in the attached. > >> I did try to write generic function very similar to what you wrote but >> discarded it because of the performance reasons. If we really want to >> support any data type I am all for having generic function but I still >> would like to have one optimized for float8 because that seems to be the >> most used use-case (at least that one is the reason why I even wrote >> this) for performance reasons. > > Well, part of the reason why your v3 float8 function looks so fast is that > it's cheating: it will not produce the right answers for comparisons > involving NaNs. I'm not sure how good it would look once you'd added > some isnan() tests to make the behavior actually equivalent to > float8_cmp_internal(). > True, it increases the time of the test to around 285ms, still almost 30% speed difference. > However, assuming it still seems worthwhile once that's accounted for, > I don't have a strong objection to having an additional code path for > float8. There are two ways we could do that: > > 1. Add a runtime test in the polymorphic function, eg > > if (element_type == FLOAT8OID) > result = width_bucket_float8(...); > else if (typentry->typlen > 0) > result = width_bucket_fixed(...); > else > result = width_bucket_variable(...); > Yeah I think I prefer this one too, will see how much performance it eats. > >> The difference between my generic and Tom's generic is because Tom's is >> slowed down by the deconstruct_array. > > Meh. It looked to me like your version would have O(N^2) performance > problems from computing array offsets repeatedly, depending on exactly > which array element it ended up on. It would avoid repeat calculations > only if it always moved right. > I actually think that worst case (when you go always left) for my version is O(N) since you only need to seek for the half of previous interval (it's doing binary search after all) and you do O(N) in the deconstruct_array. It would be very different if we could cache the array somehow (ie, if this was an aggregate) then it would obviously make a lot of sense to use deconstruct_array and in that case it would make even sense to resort probably, but sadly we can't cache afaik. Also, I made more tests with various array sizes (3-10000) and distributions and mine version was never slower. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs wrote > width_bucket() seems to refer to an equal-width binning process. The > function being discussed here is a generic mechanism, the boundaries > of which could have been decided using equal-frequency or other > mechanisms. Using the word "width" in those contexts could be > confusing. > > Given width_bucket() is already in use for SQL Standard function, I > agree it would be confusing to have same name for different parameter > profiles. literal_bucket(float8, float8[]) Since "bucket" is the 'verb' here (in this specific case meaning "lookup the supplied value in the supplied bucket definition") and "width" is a modifier (the bucket specification describes an equal-width structure) I suggest "literal_bucket(val, array[])" such that the bucket is still the verb but now the modifier describes a structure that is literally provided. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Built-in-binning-functions-tp5807266p5817097.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
Petr Jelinek <petr@2ndquadrant.com> writes: > On 31/08/14 22:33, Tom Lane wrote: >> Petr Jelinek <petr@2ndquadrant.com> writes: >>> The difference between my generic and Tom's generic is because Tom's is >>> slowed down by the deconstruct_array. >> Meh. It looked to me like your version would have O(N^2) performance >> problems from computing array offsets repeatedly, depending on exactly >> which array element it ended up on. It would avoid repeat calculations >> only if it always moved right. > I actually think that worst case (when you go always left) for my > version is O(N) since you only need to seek for the half of previous > interval (it's doing binary search after all) and you do O(N) in the > deconstruct_array. [ thinks about that for awhile... ] Hm, I think you are right. If there are N elements in the array then we will scan over N/2 of them to locate the midpoint element for the first comparison. After that, we will go either left or right, and in either case we will need to scan over N/4 elements to find the second-level midpoint. The third-level midpoint will be found after scanning N/8 elements, and so on. So the number of elements scanned over to compute their lengths is N/2 + N/4 + N/8 ..., or asymptotically N, the same as for the deconstruct_array approach. deconstruct_array might have some small advantage from tighter inner loops, but this is probably more than blown by the need for a palloc and pfree. So I was wrong and your way is better for navigating the array in the variable-width case, assuming that we're not concerned about spending some more code space to make this function a shade faster. BTW, was there a reason for not noticing the case of exact match in the search loop, and falling out early? As it stands the code will reliably choose the leftmost match if there are multiple equal items in the search array, but do we care about such cases? regards, tom lane
David G Johnston <david.g.johnston@gmail.com> writes: > Since "bucket" is the 'verb' here (in this specific case meaning "lookup the > supplied value in the supplied bucket definition") and "width" is a modifier > (the bucket specification describes an equal-width structure) I suggest > "literal_bucket(val, array[])" such that the bucket is still the verb but > now the modifier describes a structure that is literally provided. It's a very considerable stretch to see "bucket" as a verb here :-). Maybe that's why the SQL committee's choice of function name seems so unnatural (to me anyway). I was wondering about bucket_index(), ie "get the index of the bucket this value falls into". Or get_bucket(), or get_bucket_index() if you like verbosity. regards, tom lane
<div dir="ltr"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><span style="font-family:arial">OnSun, Aug 31, 2014 at 7:48 PM, Tom Lane </span><span dir="ltr" style="font-family:arial"><<ahref="mailto:tgl@sss.pgh.pa.us" target="_blank">tgl@sss.pgh.pa.us</a>></span><span style="font-family:arial">wrote:</span><br /></div><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote"style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">DavidG Johnston<<a href="mailto:david.g.johnston@gmail.com">david.g.johnston@gmail.com</a>> writes:<br /> > Since "bucket"is the 'verb' here (in this specific case meaning "lookup the<br /> > supplied value in the supplied bucket definition")and "width" is a modifier<br /> > (the bucket specification describes an equal-width structure) I suggest<br/> > "literal_bucket(val, array[])" such that the bucket is still the verb but<br /> > now the modifier describesa structure that is literally provided.<br /><br /> It's a very considerable stretch to see "bucket" as a verb here:-).<br /> Maybe that's why the SQL committee's choice of function name seems<br /> so unnatural (to me anyway).<br /><br/> I was wondering about bucket_index(), ie "get the index of the bucket<br /> this value falls into". Or get_bucket(),or get_bucket_index() if you<br /> like verbosity.<br /><br /> regards, tom lane<br/></blockquote></div><br /></div><div class="gmail_extra"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Igot stuck on the thought that a function name should ideally be/includea verb...</div><br /></div><div class="gmail_extra"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Evenif you read it as a noun (and thus the verb is an implied "get") thenaming logic still holds. </div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br /></div><divclass="gmail_default" style="font-family:arial,helvetica,sans-serif">I pondered a "get_" version though the argumentfor avoiding conflicting user-code decreases its appeal.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br/></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Thegood part about SQL standard naming is that the typical programmer isn'tlikely to pick a conflicting name.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br/></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">"bucket_index"is appealing by itself though user-code probable...as bad asI think "width_bucket" is for a name the fact is that it currently exists and, even forced, consistency has merit.</div><divclass="gmail_default" style="font-family:arial,helvetica,sans-serif"><br /></div><div class="gmail_default"style="font-family:arial,helvetica,sans-serif">David J.</div></div></div>
On 01/09/14 01:42, Tom Lane wrote: > > BTW, was there a reason for not noticing the case of exact match in > the search loop, and falling out early? As it stands the code will > reliably choose the leftmost match if there are multiple equal items > in the search array, but do we care about such cases? > I am not sure if we care, probably not. Anyway I attached patch that I am happy with. I am not yet sure what to do with naming. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi
I did a review of last patch
1. There is no problem with patchingI did a review of last patch
4. The name with_bucket is probably one with wide agreement
5. There are a basic set of tests for muttable or fixed sized types
Pavel
2014-09-01 21:29 GMT+02:00 Petr Jelinek <petr@2ndquadrant.com>:
On 01/09/14 01:42, Tom Lane wrote:I am not sure if we care, probably not.
BTW, was there a reason for not noticing the case of exact match in
the search loop, and falling out early? As it stands the code will
reliably choose the leftmost match if there are multiple equal items
in the search array, but do we care about such cases?
Anyway I attached patch that I am happy with. I am not yet sure what to do with naming.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 04/09/14 21:16, Pavel Stehule wrote: > > I did a review of last patch Thanks > > I found only one issue - float8 path has no own test in regress tests. > When this issue will be fixed, I will mark this patch as ready for commit > Ah yeah I forgot about those, fixed in attached patch. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
2014-09-07 14:29 GMT+02:00 Petr Jelinek <petr@2ndquadrant.com>:
On 04/09/14 21:16, Pavel Stehule wrote:
I did a review of last patch
Thanks
I found only one issue - float8 path has no own test in regress tests.
When this issue will be fixed, I will mark this patch as ready for commit
Ah yeah I forgot about those, fixed in attached patch.
ok
It is ready for commit
Thank you
Pavel
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 2014-09-07 20:45:38 +0200, Pavel Stehule wrote: > 2014-09-07 14:29 GMT+02:00 Petr Jelinek <petr@2ndquadrant.com>: > > Ah yeah I forgot about those, fixed in attached patch. > ok > > It is ready for commit Tom, are you planning to take this one? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > Tom, are you planning to take this one? Yeah, I already looked at it once, so will take it. I think the main remaining issue is that we don't have consensus on the function name AFAICT. I'm okay with using width_bucket(), as is done in the latest patch, but there were objections ... regards, tom lane
On 2014-09-07 15:05:35 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > Tom, are you planning to take this one? > > Yeah, I already looked at it once, so will take it. Cool. > I think the main remaining issue is that we don't have consensus on > the function name AFAICT. I'm okay with using width_bucket(), as > is done in the latest patch, but there were objections ... Just reread that part of the thread and personally I disliked all the other suggested names more than width_bucket. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 07/09/14 21:09, Andres Freund wrote: > On 2014-09-07 15:05:35 -0400, Tom Lane wrote: >> I think the main remaining issue is that we don't have consensus on >> the function name AFAICT. I'm okay with using width_bucket(), as >> is done in the latest patch, but there were objections ... > > Just reread that part of the thread and personally I disliked all the > other suggested names more than width_bucket. > Same here, that's why I didn't change it. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Petr Jelinek <petr@2ndquadrant.com> writes: > On 07/09/14 21:09, Andres Freund wrote: >> On 2014-09-07 15:05:35 -0400, Tom Lane wrote: >>> I think the main remaining issue is that we don't have consensus on >>> the function name AFAICT. I'm okay with using width_bucket(), as >>> is done in the latest patch, but there were objections ... >> Just reread that part of the thread and personally I disliked all the >> other suggested names more than width_bucket. > Same here, that's why I didn't change it. Not hearing any further discussion, I committed it with that name (and a bit of further cleanup). regards, tom lane