Thread: Simplifying width_bucket_numeric()

Simplifying width_bucket_numeric()

From
Dean Rasheed
Date:
In the numeric width_bucket() code, we currently do the following:

    mul_var(&operand_var, count_var, &operand_var,
            operand_var.dscale + count_var->dscale);
    div_var(&operand_var, &bound2_var, result_var,
            select_div_scale(&operand_var, &bound2_var), true);

    if (cmp_var(result_var, count_var) >= 0)
        set_var_from_var(count_var, result_var);
    else
    {
        add_var(result_var, &const_one, result_var);
        floor_var(result_var, result_var);
    }

select_div_scale() returns a value between 16 and 1000, depending on
the dscales and weights of its inputs, so div_var() computes that many
digits after the decimal point and rounds the final digit. Assuming
the result doesn't exceed count, we then floor the result, throwing
away all those digits to get the integer part that we want.

Instead, this can be done more simply and efficiently, using division
with truncation as follows:

    mul_var(&operand_var, count_var, &operand_var,
            operand_var.dscale + count_var->dscale);
    div_var(&operand_var, &bound2_var, result_var, 0, false);
    add_var(result_var, &const_one, result_var);

This doesn't compute any digits after the decimal point, and instead
just returns the integer part of the quotient, which is much cheaper
to compute, and precisely what we want.

In fact, the current code is slightly inaccurate, because the rounding
step can incorrectly round up into the next internal bucket, for
example:

  width_bucket(6.666666666666666, 0, 10, 3) -> 2
  width_bucket(6.6666666666666666, 0, 10, 3) -> 3

though in practice that's extremely unlikely and doesn't really matter.

Patch attached. I didn't bother with any new test cases, since there
appears to be sufficient coverage already.

As a quick performance/correctness test, I ran the following:

SELECT setseed(0);
CREATE TEMP TABLE t AS
  SELECT random(-4.000000, 8.000000) op,
         random(-4.100000, -2.000000) b1,
         random(6.000000, 8.100000) b2,
         random(1, 15) c
  FROM generate_series(1, 10000000);

SELECT hash_array(array_agg(width_bucket(op, b1, b2, c))) FROM t;
-- Result not changed by patch

SELECT sum(width_bucket(op, b1, b2, c)) FROM t;
Time: 3658.962 ms (00:03.659)  -- HEAD
Time: 3089.946 ms (00:03.090)  -- with patch

Regards,
Dean

Attachment

Re: Simplifying width_bucket_numeric()

From
"Joel Jacobson"
Date:
On Sat, Jul 6, 2024, at 17:36, Dean Rasheed wrote:
> In the numeric width_bucket() code, we currently do the following:
..
> Instead, this can be done more simply and efficiently, using division
> with truncation as follows:
..
>
> Patch attached. I didn't bother with any new test cases, since there
> appears to be sufficient coverage already.
>
> As a quick performance/correctness test, I ran the following:
>
> SELECT setseed(0);
> CREATE TEMP TABLE t AS
>   SELECT random(-4.000000, 8.000000) op,
>          random(-4.100000, -2.000000) b1,
>          random(6.000000, 8.100000) b2,
>          random(1, 15) c
>   FROM generate_series(1, 10000000);
>
> SELECT hash_array(array_agg(width_bucket(op, b1, b2, c))) FROM t;
> -- Result not changed by patch

Same hash_array on all my three machines:

 hash_array
-------------
 -1179801276
(1 row)

> SELECT sum(width_bucket(op, b1, b2, c)) FROM t;
> Time: 3658.962 ms (00:03.659)  -- HEAD
> Time: 3089.946 ms (00:03.090)  -- with patch

Significant improvement on all my three machines:

/*
 * Apple M3 Max
 */

Time: 2255.154 ms (00:02.255) -- HEAD
Time: 1830.985 ms (00:01.831)
Time: 1826.190 ms (00:01.826)
Time: 1831.020 ms (00:01.831)
Time: 1832.934 ms (00:01.833)
Time: 1843.061 ms (00:01.843)

Time: 1957.062 ms (00:01.957) -- simplify-width_bucket_numeric.patch
Time: 1545.121 ms (00:01.545)
Time: 1541.621 ms (00:01.542)
Time: 1536.388 ms (00:01.536)
Time: 1538.721 ms (00:01.539)
Time: 1592.384 ms (00:01.592)

/*
 * Intel Core i9-14900K
 */

Time: 2541.959 ms (00:02.542) -- HEAD
Time: 2534.803 ms (00:02.535)
Time: 2532.343 ms (00:02.532)
Time: 2529.408 ms (00:02.529)
Time: 2528.600 ms (00:02.529)

Time: 2107.901 ms (00:02.108) -- simplify-width_bucket_numeric.patch
Time: 2095.413 ms (00:02.095)
Time: 2093.985 ms (00:02.094)
Time: 2093.910 ms (00:02.094)
Time: 2094.935 ms (00:02.095)

/*
 * AMD Ryzen 9 7950X3D
 */


Time: 2226.498 ms (00:02.226) -- HEAD
Time: 2238.083 ms (00:02.238)
Time: 2239.075 ms (00:02.239)
Time: 2238.488 ms (00:02.238)
Time: 2238.166 ms (00:02.238)

Time: 1853.382 ms (00:01.853) -- simplify-width_bucket_numeric.patch
Time: 1842.630 ms (00:01.843)
Time: 1828.309 ms (00:01.828)
Time: 1844.654 ms (00:01.845)
Time: 1828.520 ms (00:01.829)

Regards,
Joel



Re: Simplifying width_bucket_numeric()

From
Bryan Green
Date:

🔥

On Sun, Jul 7, 2024, 7:44 AM Joel Jacobson <joel@compiler.org> wrote:
On Sat, Jul 6, 2024, at 17:36, Dean Rasheed wrote:
> In the numeric width_bucket() code, we currently do the following:
..
> Instead, this can be done more simply and efficiently, using division
> with truncation as follows:
..
>
> Patch attached. I didn't bother with any new test cases, since there
> appears to be sufficient coverage already.
>
> As a quick performance/correctness test, I ran the following:
>
> SELECT setseed(0);
> CREATE TEMP TABLE t AS
>   SELECT random(-4.000000, 8.000000) op,
>          random(-4.100000, -2.000000) b1,
>          random(6.000000, 8.100000) b2,
>          random(1, 15) c
>   FROM generate_series(1, 10000000);
>
> SELECT hash_array(array_agg(width_bucket(op, b1, b2, c))) FROM t;
> -- Result not changed by patch

Same hash_array on all my three machines:

 hash_array
-------------
 -1179801276
(1 row)

> SELECT sum(width_bucket(op, b1, b2, c)) FROM t;
> Time: 3658.962 ms (00:03.659)  -- HEAD
> Time: 3089.946 ms (00:03.090)  -- with patch

Significant improvement on all my three machines:

/*
 * Apple M3 Max
 */

Time: 2255.154 ms (00:02.255) -- HEAD
Time: 1830.985 ms (00:01.831)
Time: 1826.190 ms (00:01.826)
Time: 1831.020 ms (00:01.831)
Time: 1832.934 ms (00:01.833)
Time: 1843.061 ms (00:01.843)

Time: 1957.062 ms (00:01.957) -- simplify-width_bucket_numeric.patch
Time: 1545.121 ms (00:01.545)
Time: 1541.621 ms (00:01.542)
Time: 1536.388 ms (00:01.536)
Time: 1538.721 ms (00:01.539)
Time: 1592.384 ms (00:01.592)

/*
 * Intel Core i9-14900K
 */

Time: 2541.959 ms (00:02.542) -- HEAD
Time: 2534.803 ms (00:02.535)
Time: 2532.343 ms (00:02.532)
Time: 2529.408 ms (00:02.529)
Time: 2528.600 ms (00:02.529)

Time: 2107.901 ms (00:02.108) -- simplify-width_bucket_numeric.patch
Time: 2095.413 ms (00:02.095)
Time: 2093.985 ms (00:02.094)
Time: 2093.910 ms (00:02.094)
Time: 2094.935 ms (00:02.095)

/*
 * AMD Ryzen 9 7950X3D
 */


Time: 2226.498 ms (00:02.226) -- HEAD
Time: 2238.083 ms (00:02.238)
Time: 2239.075 ms (00:02.239)
Time: 2238.488 ms (00:02.238)
Time: 2238.166 ms (00:02.238)

Time: 1853.382 ms (00:01.853) -- simplify-width_bucket_numeric.patch
Time: 1842.630 ms (00:01.843)
Time: 1828.309 ms (00:01.828)
Time: 1844.654 ms (00:01.845)
Time: 1828.520 ms (00:01.829)

Regards,
Joel


Re: Simplifying width_bucket_numeric()

From
Dean Rasheed
Date:
On Sun, 7 Jul 2024 at 13:43, Joel Jacobson <joel@compiler.org> wrote:
>
> > SELECT hash_array(array_agg(width_bucket(op, b1, b2, c))) FROM t;
> > -- Result not changed by patch
>
> Same hash_array on all my three machines:
>
> > SELECT sum(width_bucket(op, b1, b2, c)) FROM t;
> > Time: 3658.962 ms (00:03.659)  -- HEAD
> > Time: 3089.946 ms (00:03.090)  -- with patch
>
> Significant improvement on all my three machines:
>

Thanks for testing. I have committed this now.

(I also realised that the "reversed_bounds" code was unnecessary,
since the only effect was to negate both inputs to div_var(), so the
signs cancel out.)

Regards,
Dean