Re: Built-in binning functions - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Built-in binning functions
Date
Msg-id 6053.1409419470@sss.pgh.pa.us
Whole thread Raw
In response to Re: Built-in binning functions  (Pavel Stehule <pavel.stehule@gmail.com>)
Responses Re: Built-in binning functions
Re: Built-in binning functions
List pgsql-hackers
Pavel Stehule <pavel.stehule@gmail.com> writes:
> 1. I am thinking so reduction to only numeric types is not necessary -
> although we can live without it - but there are lot of non numeric
> categories: chars, date, ...

I wasn't terribly happy about that either.  I still think we should
reduce this to a single polymorphic function, as in the attached.

> 2. Still I strongly afraid about used searching method - there is not
> possible to check a  validity of input. Did you check how much linear
> searching is slower - you spoke, so you have a real data and real use case?

Well, if we check the input then we'll be doing O(N) comparisons instead
of O(log N).  That could be a serious cost penalty for large arrays and
expensive comparison functions (such as strcoll()).  I think it's probably
sufficient to have a clear warning in the docs.

> 3. I am thinking about name - I don't think so varwidth_bucket is correct
> -- in relation to name "width_bucket" ... what about "range_bucket" or
> "scope_bucket" ?

Neither of those seem like improvements from here.  I agree with the
objections to bin() as well.  bucket() might be all right but it still
seems a bit too generic.  width_bucket(), which at least shows there's
a relationship to the standard functions, still seems like the best
of the suggestions so far.

It occurs to me that there would be an advantage to using some other
name besides width_bucket: we could then mark the function as variadic,
which might be a notational advantage in some usages.  (We can't do
that if we call it width_bucket because the four-parameter case would
be ambiguous with the existing functions.)  I'm not sure that this is
important enough to justify changing the name though, especially if
we can't come up with a good name.  Also, doing that would put a very
large premium on picking a non-generic name, else we'd risk creating
ambiguities against user-defined functions.

Also, a documentation quibble: in Petr's patch the documentation and
comments refer to the thresholds as "right bounds", which I didn't care
for and replaced with "upper bounds".  However, looking at it again,
I wonder if we would not be better off describing them as "lower bounds".
They are exclusive bounds if seen as upper bounds, and inclusive if seen
as lower bounds, and on the whole I think the latter viewpoint might be
less confusing.  Thoughts?

Proposed patch with 1 polymorphic function attached.

            regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 722640b..93cf1e7 100644
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***************
*** 901,925 ****
          <indexterm>
           <primary>width_bucket</primary>
          </indexterm>
!         <literal><function>width_bucket(<parameter>op</parameter> <type>numeric</type>, <parameter>b1</parameter>
<type>numeric</type>,<parameter>b2</parameter> <type>numeric</type>, <parameter>count</parameter>
<type>int</type>)</function></literal>
         </entry>
         <entry><type>int</type></entry>
         <entry>return the bucket to which <parameter>operand</> would
         be assigned in an equidepth histogram with <parameter>count</>
!        buckets, in the range <parameter>b1</> to <parameter>b2</></entry>
         <entry><literal>width_bucket(5.35, 0.024, 10.06, 5)</literal></entry>
         <entry><literal>3</literal></entry>
        </row>

        <row>
!        <entry><literal><function>width_bucket(<parameter>op</parameter> <type>dp</type>, <parameter>b1</parameter>
<type>dp</type>,<parameter>b2</parameter> <type>dp</type>, <parameter>count</parameter>
<type>int</type>)</function></literal></entry>
         <entry><type>int</type></entry>
         <entry>return the bucket to which <parameter>operand</> would
         be assigned in an equidepth histogram with <parameter>count</>
!        buckets, in the range <parameter>b1</> to <parameter>b2</></entry>
         <entry><literal>width_bucket(5.35, 0.024, 10.06, 5)</literal></entry>
         <entry><literal>3</literal></entry>
        </row>
       </tbody>
      </tgroup>
     </table>
--- 901,936 ----
          <indexterm>
           <primary>width_bucket</primary>
          </indexterm>
!         <literal><function>width_bucket(<parameter>operand</parameter> <type>numeric</type>,
<parameter>b1</parameter><type>numeric</type>, <parameter>b2</parameter> <type>numeric</type>,
<parameter>count</parameter><type>int</type>)</function></literal> 
         </entry>
         <entry><type>int</type></entry>
         <entry>return the bucket to which <parameter>operand</> would
         be assigned in an equidepth histogram with <parameter>count</>
!        buckets spanning the range <parameter>b1</> to <parameter>b2</></entry>
         <entry><literal>width_bucket(5.35, 0.024, 10.06, 5)</literal></entry>
         <entry><literal>3</literal></entry>
        </row>

        <row>
!        <entry><literal><function>width_bucket(<parameter>operand</parameter> <type>dp</type>,
<parameter>b1</parameter><type>dp</type>, <parameter>b2</parameter> <type>dp</type>, <parameter>count</parameter>
<type>int</type>)</function></literal></entry>
         <entry><type>int</type></entry>
         <entry>return the bucket to which <parameter>operand</> would
         be assigned in an equidepth histogram with <parameter>count</>
!        buckets spanning the range <parameter>b1</> to <parameter>b2</></entry>
         <entry><literal>width_bucket(5.35, 0.024, 10.06, 5)</literal></entry>
         <entry><literal>3</literal></entry>
        </row>
+
+       <row>
+        <entry><literal><function>width_bucket(<parameter>operand</parameter> <type>anyelement</type>,
<parameter>thresholds</parameter><type>anyarray</type>)</function></literal></entry> 
+        <entry><type>int</type></entry>
+        <entry>return the bucket to which <parameter>operand</> would
+        be assigned given an array listing the upper bounds of the buckets;
+        the <parameter>thresholds</> array <emphasis>must be sorted</>,
+        smallest first, or unexpected results will be obtained</entry>
+        <entry><literal>width_bucket(now(), array['yesterday', 'today', 'tomorrow']::timestamptz[])</literal></entry>
+        <entry><literal>2</literal></entry>
+       </row>
       </tbody>
      </tgroup>
     </table>
diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index f8e94ec..f3bba13 100644
*** a/src/backend/utils/adt/arrayfuncs.c
--- b/src/backend/utils/adt/arrayfuncs.c
*************** static ArrayType *array_replace_internal
*** 130,135 ****
--- 130,143 ----
                         Datum replace, bool replace_isnull,
                         bool remove, Oid collation,
                         FunctionCallInfo fcinfo);
+ static int width_bucket_fixed(Datum operand,
+                    ArrayType *thresholds,
+                    Oid collation,
+                    TypeCacheEntry *typentry);
+ static int width_bucket_variable(Datum operand,
+                       ArrayType *thresholds,
+                       Oid collation,
+                       TypeCacheEntry *typentry);


  /*
*************** array_replace(PG_FUNCTION_ARGS)
*** 5502,5504 ****
--- 5510,5681 ----
                                     fcinfo);
      PG_RETURN_ARRAYTYPE_P(array);
  }
+
+ /*
+  * Implements width_bucket(anyelement, anyarray).
+  *
+  * 'thresholds' is an array containing upper bound values for each bucket;
+  * these must be sorted from smallest to largest, or bogus results will be
+  * produced.  If N thresholds are supplied, the output is from 0 to N:
+  * 0 is for inputs < first threshold, N is for inputs >= last threshold.
+  */
+ Datum
+ width_bucket_generic(PG_FUNCTION_ARGS)
+ {
+     Datum        operand = PG_GETARG_DATUM(0);
+     ArrayType  *thresholds = PG_GETARG_ARRAYTYPE_P(1);
+     Oid            collation = PG_GET_COLLATION();
+     Oid            element_type = ARR_ELEMTYPE(thresholds);
+     TypeCacheEntry *typentry;
+     int            result;
+
+     /* Check input */
+     if (ARR_NDIM(thresholds) > 1)
+         ereport(ERROR,
+                 (errcode(ERRCODE_ARRAY_SUBSCRIPT_ERROR),
+                  errmsg("thresholds must be one-dimensional array")));
+
+     if (array_contains_nulls(thresholds))
+         ereport(ERROR,
+                 (errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
+                  errmsg("thresholds array must not contain NULLs")));
+
+     /* Cache information about the input type */
+     typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+     if (typentry == NULL ||
+         typentry->type_id != element_type)
+     {
+         typentry = lookup_type_cache(element_type,
+                                      TYPECACHE_CMP_PROC_FINFO);
+         if (!OidIsValid(typentry->cmp_proc_finfo.fn_oid))
+             ereport(ERROR,
+                     (errcode(ERRCODE_UNDEFINED_FUNCTION),
+                errmsg("could not identify a comparison function for type %s",
+                       format_type_be(element_type))));
+         fcinfo->flinfo->fn_extra = (void *) typentry;
+     }
+
+     /* We have two implementation paths for fixed- and variable-width types */
+     if (typentry->typlen > 0)
+         result = width_bucket_fixed(operand, thresholds, collation, typentry);
+     else
+         result = width_bucket_variable(operand, thresholds, collation, typentry);
+
+     /* Avoid leaking memory when handed toasted input. */
+     PG_FREE_IF_COPY(thresholds, 1);
+
+     PG_RETURN_INT32(result);
+ }
+
+ /*
+  * width_bucket for fixed-width data types.
+  */
+ static int
+ width_bucket_fixed(Datum operand,
+                    ArrayType *thresholds,
+                    Oid collation,
+                    TypeCacheEntry *typentry)
+ {
+     char       *thresholds_data;
+     int            typlen;
+     FunctionCallInfoData locfcinfo;
+     int            left;
+     int            right;
+
+     /*
+      * Since we know the array contains no NULLs, we can just index it
+      * directly.
+      */
+     thresholds_data = (char *) ARR_DATA_PTR(thresholds);
+     typlen = typentry->typlen;
+
+     InitFunctionCallInfoData(locfcinfo, &typentry->cmp_proc_finfo, 2,
+                              collation, NULL, NULL);
+
+     /* Find the bucket */
+     left = 0;
+     right = ArrayGetNItems(ARR_NDIM(thresholds), ARR_DIMS(thresholds));
+     while (left < right)
+     {
+         int            mid;
+         char       *ptr;
+         Datum        elm;
+         int32        cmpresult;
+
+         mid = (left + right) / 2;
+
+         ptr = thresholds_data + mid * typlen;
+         elm = fetch_att(ptr, typentry->typbyval, typlen);
+
+         locfcinfo.arg[0] = operand;
+         locfcinfo.arg[1] = elm;
+         locfcinfo.argnull[0] = false;
+         locfcinfo.argnull[1] = false;
+         locfcinfo.isnull = false;
+
+         cmpresult = DatumGetInt32(FunctionCallInvoke(&locfcinfo));
+
+         if (cmpresult < 0)
+             right = mid;
+         else
+             left = mid + 1;
+     }
+
+     return left;
+ }
+
+ /*
+  * width_bucket for variable-width data types.
+  */
+ static int
+ width_bucket_variable(Datum operand,
+                       ArrayType *thresholds,
+                       Oid collation,
+                       TypeCacheEntry *typentry)
+ {
+     Datum       *dvalues;
+     int            nelems;
+     FunctionCallInfoData locfcinfo;
+     int            left;
+     int            right;
+
+     /*
+      * The easiest and probably most efficient way to locate the array
+      * elements is with deconstruct_array.
+      */
+     deconstruct_array(thresholds,
+                       typentry->type_id, typentry->typlen,
+                       typentry->typbyval, typentry->typalign,
+                       &dvalues, NULL, &nelems);
+
+     InitFunctionCallInfoData(locfcinfo, &typentry->cmp_proc_finfo, 2,
+                              collation, NULL, NULL);
+
+     /* Find the bucket */
+     left = 0;
+     right = nelems;
+     while (left < right)
+     {
+         int            mid;
+         int32        cmpresult;
+
+         mid = (left + right) / 2;
+
+         locfcinfo.arg[0] = operand;
+         locfcinfo.arg[1] = dvalues[mid];
+         locfcinfo.argnull[0] = false;
+         locfcinfo.argnull[1] = false;
+         locfcinfo.isnull = false;
+
+         cmpresult = DatumGetInt32(FunctionCallInvoke(&locfcinfo));
+
+         if (cmpresult < 0)
+             right = mid;
+         else
+             left = mid + 1;
+     }
+
+     pfree(dvalues);
+
+     return left;
+ }
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 5176ed0..9b55acc 100644
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
*************** DATA(insert OID = 2334 (  array_agg_fina
*** 885,890 ****
--- 885,892 ----
  DESCR("aggregate final function");
  DATA(insert OID = 2335 (  array_agg           PGNSP PGUID 12 1 0 0 0 t f f f f f i 1 0 2277 "2283" _null_ _null_
_null__null_ aggregate_dummy _null_ _null_ _null_ )); 
  DESCR("concatenate aggregate input into an array");
+ DATA(insert OID = 3218 ( width_bucket       PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 23 "2283 2277" _null_ _null_
_null__null_ width_bucket_generic _null_ _null_ _null_ )); 
+ DESCR("bucket number of operand given a sorted array of bucket upper bounds");
  DATA(insert OID = 3816 (  array_typanalyze PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 16 "2281" _null_ _null_ _null_
_null_array_typanalyze _null_ _null_ _null_ )); 
  DESCR("array typanalyze");
  DATA(insert OID = 3817 (  arraycontsel       PGNSP PGUID 12 1 0 0 0 f f f f t f s 4 0 701 "2281 26 2281 23" _null_
_null__null_ _null_ arraycontsel _null_ _null_ _null_ )); 
diff --git a/src/include/utils/array.h b/src/include/utils/array.h
index 9bbfaae..2abaccd 100644
*** a/src/include/utils/array.h
--- b/src/include/utils/array.h
*************** extern Datum array_fill_with_lower_bound
*** 214,219 ****
--- 214,220 ----
  extern Datum array_unnest(PG_FUNCTION_ARGS);
  extern Datum array_remove(PG_FUNCTION_ARGS);
  extern Datum array_replace(PG_FUNCTION_ARGS);
+ extern Datum width_bucket_generic(PG_FUNCTION_ARGS);

  extern Datum array_ref(ArrayType *array, int nSubscripts, int *indx,
            int arraytyplen, int elmlen, bool elmbyval, char elmalign,
diff --git a/src/test/regress/expected/arrays.out b/src/test/regress/expected/arrays.out
index 4286691..04f2400 100644
*** a/src/test/regress/expected/arrays.out
--- b/src/test/regress/expected/arrays.out
*************** select length(md5((f1[1]).c2)) from dest
*** 1706,1708 ****
--- 1706,1807 ----

  drop table dest;
  drop type textandtext;
+ -- Tests for generic form of width_bucket()
+ SELECT
+     op,
+     width_bucket(op, ARRAY[1, 3, 5, 10.0]) AS wb_1,
+     width_bucket(op, ARRAY[0, 5.5, 9.99]) AS wb_2,
+     width_bucket(op, ARRAY[-6, -5, 2.0]) AS wb_3
+ FROM (VALUES
+   (-5.2),
+   (-0.0000000001),
+   (0.000000000001),
+   (1),
+   (1.99999999999999),
+   (2),
+   (2.00000000000001),
+   (3),
+   (4),
+   (4.5),
+   (5),
+   (5.5),
+   (6),
+   (7),
+   (8),
+   (9),
+   (9.99999999999999),
+   (10),
+   (10.0000000000001)
+ ) v(op);
+         op        | wb_1 | wb_2 | wb_3
+ ------------------+------+------+------
+              -5.2 |    0 |    0 |    1
+     -0.0000000001 |    0 |    0 |    2
+    0.000000000001 |    0 |    1 |    2
+                 1 |    1 |    1 |    2
+  1.99999999999999 |    1 |    1 |    2
+                 2 |    1 |    1 |    3
+  2.00000000000001 |    1 |    1 |    3
+                 3 |    2 |    1 |    3
+                 4 |    2 |    1 |    3
+               4.5 |    2 |    1 |    3
+                 5 |    3 |    1 |    3
+               5.5 |    3 |    2 |    3
+                 6 |    3 |    2 |    3
+                 7 |    3 |    2 |    3
+                 8 |    3 |    2 |    3
+                 9 |    3 |    2 |    3
+  9.99999999999999 |    3 |    3 |    3
+                10 |    4 |    3 |    3
+  10.0000000000001 |    4 |    3 |    3
+ (19 rows)
+
+ SELECT
+     op,
+     width_bucket(op, ARRAY[1, 3, 5, 10]) AS wb_1
+ FROM generate_series(0,11) as op;
+  op | wb_1
+ ----+------
+   0 |    0
+   1 |    1
+   2 |    1
+   3 |    2
+   4 |    2
+   5 |    3
+   6 |    3
+   7 |    3
+   8 |    3
+   9 |    3
+  10 |    4
+  11 |    4
+ (12 rows)
+
+ SELECT width_bucket(now(),
+                     array['yesterday', 'today', 'tomorrow']::timestamptz[]);
+  width_bucket
+ --------------
+             2
+ (1 row)
+
+ SELECT width_bucket(5, ARRAY[3]);
+  width_bucket
+ --------------
+             1
+ (1 row)
+
+ SELECT width_bucket(5, '{}');
+  width_bucket
+ --------------
+             0
+ (1 row)
+
+ -- error cases
+ SELECT width_bucket('5'::text, ARRAY[3, 4]::integer[]);
+ ERROR:  function width_bucket(text, integer[]) does not exist
+ LINE 1: SELECT width_bucket('5'::text, ARRAY[3, 4]::integer[]);
+                ^
+ HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
+ SELECT width_bucket(5, ARRAY[3, 4, NULL]);
+ ERROR:  thresholds array must not contain NULLs
+ SELECT width_bucket(5, ARRAY[ARRAY[1, 2], ARRAY[3, 4]]);
+ ERROR:  thresholds must be one-dimensional array
diff --git a/src/test/regress/sql/arrays.sql b/src/test/regress/sql/arrays.sql
index d9f7cbf..eef3eb3 100644
*** a/src/test/regress/sql/arrays.sql
--- b/src/test/regress/sql/arrays.sql
*************** drop table src;
*** 476,478 ****
--- 476,523 ----
  select length(md5((f1[1]).c2)) from dest;
  drop table dest;
  drop type textandtext;
+
+ -- Tests for generic form of width_bucket()
+
+ SELECT
+     op,
+     width_bucket(op, ARRAY[1, 3, 5, 10.0]) AS wb_1,
+     width_bucket(op, ARRAY[0, 5.5, 9.99]) AS wb_2,
+     width_bucket(op, ARRAY[-6, -5, 2.0]) AS wb_3
+ FROM (VALUES
+   (-5.2),
+   (-0.0000000001),
+   (0.000000000001),
+   (1),
+   (1.99999999999999),
+   (2),
+   (2.00000000000001),
+   (3),
+   (4),
+   (4.5),
+   (5),
+   (5.5),
+   (6),
+   (7),
+   (8),
+   (9),
+   (9.99999999999999),
+   (10),
+   (10.0000000000001)
+ ) v(op);
+
+ SELECT
+     op,
+     width_bucket(op, ARRAY[1, 3, 5, 10]) AS wb_1
+ FROM generate_series(0,11) as op;
+
+ SELECT width_bucket(now(),
+                     array['yesterday', 'today', 'tomorrow']::timestamptz[]);
+
+ SELECT width_bucket(5, ARRAY[3]);
+ SELECT width_bucket(5, '{}');
+
+ -- error cases
+ SELECT width_bucket('5'::text, ARRAY[3, 4]::integer[]);
+ SELECT width_bucket(5, ARRAY[3, 4, NULL]);
+ SELECT width_bucket(5, ARRAY[ARRAY[1, 2], ARRAY[3, 4]]);

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: postgresql latency & bgwriter not doing its job
Next
From: Tom Lane
Date:
Subject: Re: postgresql latency & bgwriter not doing its job