Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands. - Mailing list pgsql-hackers

From Joel Jacobson
Subject Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.
Date
Msg-id 42d93ebe-fbed-448f-b550-096be9e577db@app.fastmail.com
Whole thread Raw
In response to Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.
List pgsql-hackers
On Thu, Jul 4, 2024, at 20:43, Dean Rasheed wrote:
> On Wed, 3 Jul 2024 at 21:45, Joel Jacobson <joel@compiler.org> wrote:
>>
>> > On Wed, Jul 3, 2024, at 20:57, Dean Rasheed wrote:
>> >> I wouldn't expect it to ever be off by more than 1
>> >
>> > OK, so then the cases I found where it was off by 2 for the mul_var_int() patch
>> > are unexpected?
>>
>> Sorry, I meant off by 2 for the mul_var_small() patch, these cases that I found:
>>
>
> Yeah, so that was another bug in mul_var_small(). If rscale is made
> small enough, the result index for the digits computed before the main
> loop overlaps the ones after, so it would overwrite digits already
> computed.
>
> Of course, that's fairly easy to fix, but at this point I think the
> better solution is to only use mul_var_small() when an exact product
> is requested. We would have to do that for mul_var_int() anyway,
> because of its accuracy issues discussed earlier. I think this is a
> reasonable thing to do because only functions like ln_var() and
> exp_var() will ask mul_var() for a reduced-rscale result, and those
> functions are likely to be dominated by computations involving larger
> numbers, for which this patch wouldn't help anyway. Also those
> functions are probably less widely used.
>
> If we make that decision, a lot of the complexity in mul_var_small()
> goes away, including all the conditional array accesses, making it
> simpler and more efficient. v6 patch attached.

Nice, I think that looks like the better trade-off.

> I also updated the mul_var_int() patch so that it is also only invoked
> when an exact product is requested, and I noticed a couple of other
> minor optimisations that could be made.

Looks nice.

> Then I decided to try
> implementing mul_var_int64(). This gives a pretty decent speedup for
> 3-digit inputs, but unfortunately it is much slower for 4-digit inputs
> (for which most values will go through the 128-bit code path). I'm
> attaching that too, just for information, but it's clearly not going
> to be acceptable as-is.
>
> Running your benchmark queries, I got these results:
>
> SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1;
> Time: 4520.874 ms (00:04.521)  -- HEAD
> Time: 3937.536 ms (00:03.938)  -- v5-mul_var_int.patch
> Time: 3919.495 ms (00:03.919)  -- v5-mul_var_small.patch
> Time: 3916.964 ms (00:03.917)  -- v6-mul_var_int64.patch
> Time: 3811.118 ms (00:03.811)  -- v6-mul_var_small.patch

My benchmarks also indicate v6-mul_var_small.patch (v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch)
is the winner on all my three CPUs, for var1ndigits=1:

-- Apple M3 Max

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- HEAD
Time: 3046.668 ms (00:03.047)
Time: 3053.327 ms (00:03.053)
Time: 3045.517 ms (00:03.046)
Time: 3049.626 ms (00:03.050)
Time: 3075.101 ms (00:03.075)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-add-mul_var_int64.patch
Time: 2781.536 ms (00:02.782)
Time: 2781.324 ms (00:02.781)
Time: 2781.301 ms (00:02.781)
Time: 2786.524 ms (00:02.787)
Time: 2784.494 ms (00:02.784)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 2711.167 ms (00:02.711)
Time: 2723.647 ms (00:02.724)
Time: 2706.372 ms (00:02.706)
Time: 2708.883 ms (00:02.709)
Time: 2704.621 ms (00:02.705)

-- Intel Core i9-14900K

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- HEAD
Time: 3496.714 ms (00:03.497)
Time: 3278.909 ms (00:03.279)
Time: 3278.631 ms (00:03.279)
Time: 3277.658 ms (00:03.278)
Time: 3276.121 ms (00:03.276)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-add-mul_var_int64.patch
Time: 3080.751 ms (00:03.081)
Time: 3078.118 ms (00:03.078)
Time: 3079.932 ms (00:03.080)
Time: 3080.668 ms (00:03.081)
Time: 3080.697 ms (00:03.081)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3043.635 ms (00:03.044)
Time: 3043.606 ms (00:03.044)
Time: 3041.391 ms (00:03.041)
Time: 3041.997 ms (00:03.042)
Time: 3045.464 ms (00:03.045)

-- AMD Ryzen 9 7950X3D

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- HEAD
Time: 3421.307 ms (00:03.421)
Time: 3400.935 ms (00:03.401)
Time: 3359.839 ms (00:03.360)
Time: 3374.246 ms (00:03.374)
Time: 3375.085 ms (00:03.375)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-add-mul_var_int64.patch
Time: 3002.214 ms (00:03.002)
Time: 3016.042 ms (00:03.016)
Time: 3010.541 ms (00:03.011)
Time: 3009.204 ms (00:03.009)
Time: 3002.088 ms (00:03.002)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 2959.319 ms (00:02.959)
Time: 2957.694 ms (00:02.958)
Time: 2971.559 ms (00:02.972)
Time: 2958.788 ms (00:02.959)
Time: 2958.812 ms (00:02.959)

> SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2;
> Time: 4762.528 ms (00:04.763)  -- HEAD
> Time: 4075.546 ms (00:04.076)  -- v5-mul_var_int.patch
> Time: 4055.180 ms (00:04.055)  -- v5-mul_var_small.patch
> Time: 4037.866 ms (00:04.038)  -- v6-mul_var_int64.patch
> Time: 4018.488 ms (00:04.018)  -- v6-mul_var_small.patch

I get mixed results for var1ndigits=2:
Winner on Apple M3 Max and AMD Ryzen 9 7950X3D is v6-add-mul_var_int64.patch.
Winner on Intel Core i9-14900K is v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch

-- Apple M3 Max

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- HEAD
Time: 3369.724 ms (00:03.370)
Time: 3340.977 ms (00:03.341)
Time: 3331.407 ms (00:03.331)
Time: 3333.304 ms (00:03.333)
Time: 3332.136 ms (00:03.332)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-add-mul_var_int64.patch
Time: 2768.108 ms (00:02.768)
Time: 2736.996 ms (00:02.737)
Time: 2730.217 ms (00:02.730)
Time: 2743.556 ms (00:02.744)
Time: 2746.725 ms (00:02.747)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 2895.200 ms (00:02.895)
Time: 2894.823 ms (00:02.895)
Time: 2899.475 ms (00:02.899)
Time: 2895.385 ms (00:02.895)
Time: 2898.647 ms (00:02.899)

-- Intel Core i9-14900K

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- HEAD
Time: 3385.836 ms (00:03.386)
Time: 3367.739 ms (00:03.368)
Time: 3363.321 ms (00:03.363)
Time: 3365.433 ms (00:03.365)
Time: 3365.301 ms (00:03.365)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-add-mul_var_int64.patch
Time: 3086.253 ms (00:03.086)
Time: 3085.272 ms (00:03.085)
Time: 3085.769 ms (00:03.086)
Time: 3089.128 ms (00:03.089)
Time: 3086.735 ms (00:03.087)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3053.775 ms (00:03.054)
Time: 3058.392 ms (00:03.058)
Time: 3068.113 ms (00:03.068)
Time: 3057.333 ms (00:03.057)
Time: 3057.218 ms (00:03.057)

-- AMD Ryzen 9 7950X3D

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- HEAD
Time: 3619.441 ms (00:03.619)
Time: 3609.553 ms (00:03.610)
Time: 3574.277 ms (00:03.574)
Time: 3578.031 ms (00:03.578)
Time: 3558.545 ms (00:03.559)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-add-mul_var_int64.patch
Time: 3061.858 ms (00:03.062)
Time: 3082.174 ms (00:03.082)
Time: 3081.093 ms (00:03.081)
Time: 3093.610 ms (00:03.094)
Time: 3064.507 ms (00:03.065)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3100.287 ms (00:03.100)
Time: 3100.264 ms (00:03.100)
Time: 3097.207 ms (00:03.097)
Time: 3101.477 ms (00:03.101)
Time: 3098.035 ms (00:03.098)

> SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3;
> Time: 5387.514 ms (00:05.388)  -- HEAD
> Time: 5350.736 ms (00:05.351)  -- v5-mul_var_int.patch
> Time: 4648.449 ms (00:04.648)  -- v5-mul_var_small.patch
> Time: 4655.204 ms (00:04.655)  -- v6-mul_var_int64.patch
> Time: 4645.962 ms (00:04.646)  -- v6-mul_var_small.patch

I get mixed results for var1ndigits=3:
Winner on Apple M3 Max and AMD Ryzen 9 7950X3D is v6-add-mul_var_int64.patch.
Winner on Intel Core i9-14900K is v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Same winners as for var1ndigits=2.

-- Apple M3 Max

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD
Time: 3466.932 ms (00:03.467)
Time: 3447.001 ms (00:03.447)
Time: 3457.259 ms (00:03.457)
Time: 3445.938 ms (00:03.446)
Time: 3443.310 ms (00:03.443)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-add-mul_var_int64.patch
Time: 2988.444 ms (00:02.988)
Time: 2976.036 ms (00:02.976)
Time: 2982.756 ms (00:02.983)
Time: 2986.436 ms (00:02.986)
Time: 2973.457 ms (00:02.973)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3026.666 ms (00:03.027)
Time: 3024.458 ms (00:03.024)
Time: 3022.976 ms (00:03.023)
Time: 3029.526 ms (00:03.030)
Time: 3021.543 ms (00:03.022)

-- Intel Core i9-14900K

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD
Time: 4510.078 ms (00:04.510)
Time: 4222.501 ms (00:04.223)
Time: 4179.509 ms (00:04.180)
Time: 4179.307 ms (00:04.179)
Time: 4183.026 ms (00:04.183)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-add-mul_var_int64.patch
Time: 3811.866 ms (00:03.812)
Time: 3816.664 ms (00:03.817)
Time: 3811.695 ms (00:03.812)
Time: 3811.674 ms (00:03.812)
Time: 3812.265 ms (00:03.812)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 4095.053 ms (00:04.095)
Time: 3896.002 ms (00:03.896)
Time: 3888.999 ms (00:03.889)
Time: 3889.346 ms (00:03.889)
Time: 3889.017 ms (00:03.889)

-- AMD Ryzen 9 7950X3D

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD
Time: 3946.255 ms (00:03.946)
Time: 3896.110 ms (00:03.896)
Time: 3877.470 ms (00:03.877)
Time: 3854.402 ms (00:03.854)
Time: 3901.218 ms (00:03.901)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-add-mul_var_int64.patch
Time: 3393.572 ms (00:03.394)
Time: 3395.401 ms (00:03.395)
Time: 3395.199 ms (00:03.395)
Time: 3418.555 ms (00:03.419)
Time: 3399.619 ms (00:03.400)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3308.791 ms (00:03.309)
Time: 3309.316 ms (00:03.309)
Time: 3318.238 ms (00:03.318)
Time: 3296.061 ms (00:03.296)
Time: 3320.282 ms (00:03.320)

> SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4;
> Time: 5617.150 ms (00:05.617)  -- HEAD
> Time: 5505.913 ms (00:05.506)  -- v5-mul_var_int.patch
> Time: 5486.441 ms (00:05.486)  -- v5-mul_var_small.patch
> Time: 8203.081 ms (00:08.203)  -- v6-mul_var_int64.patch
> Time: 5598.909 ms (00:05.599)  -- v6-mul_var_small.patch
> So v6-mul_var_int64 improves on v5-mul_var_int in the 3-digit case,
> but is terrible in the 4-digit case. None of the other patches touch
> the 4-digit case, but it might be interesting to try mul_var_small()
> with 4 digits.

Interesting you got so bad bench results for v6-mul_var_int64.patch
for var1ndigits=4, that patch is actually the winner on AMD Ryzen 9 7950X3D.

On Intel Core i9-14900K the winner is v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.
On Apple M3 Max, HEAD is the winner.

-- Apple M3 Max

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 3618.530 ms (00:03.619)
Time: 3595.239 ms (00:03.595)
Time: 3600.412 ms (00:03.600)
Time: 3607.500 ms (00:03.607)
Time: 3604.122 ms (00:03.604)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-add-mul_var_int64.patch
Time: 4498.993 ms (00:04.499)
Time: 4527.302 ms (00:04.527)
Time: 4493.613 ms (00:04.494)
Time: 4482.194 ms (00:04.482)
Time: 4493.660 ms (00:04.494)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3628.177 ms (00:03.628)
Time: 3613.975 ms (00:03.614)
Time: 3612.213 ms (00:03.612)
Time: 3614.026 ms (00:03.614)
Time: 3622.959 ms (00:03.623)

-- Intel Core i9-14900K

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 4130.810 ms (00:04.131)
Time: 3836.462 ms (00:03.836)
Time: 3810.604 ms (00:03.811)
Time: 3805.443 ms (00:03.805)
Time: 3803.055 ms (00:03.803)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-add-mul_var_int64.patch
Time: 3952.862 ms (00:03.953)
Time: 3792.272 ms (00:03.792)
Time: 3793.995 ms (00:03.794)
Time: 3790.531 ms (00:03.791)
Time: 3793.647 ms (00:03.794)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3762.828 ms (00:03.763)
Time: 3754.869 ms (00:03.755)
Time: 3756.041 ms (00:03.756)
Time: 3758.719 ms (00:03.759)
Time: 3754.894 ms (00:03.755)

-- AMD Ryzen 9 7950X3D

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 4075.988 ms (00:04.076)
Time: 4067.702 ms (00:04.068)
Time: 4035.191 ms (00:04.035)
Time: 4022.411 ms (00:04.022)
Time: 4016.475 ms (00:04.016)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-add-mul_var_int64.patch
Time: 3830.021 ms (00:03.830)
Time: 3837.947 ms (00:03.838)
Time: 3834.894 ms (00:03.835)
Time: 3832.755 ms (00:03.833)
Time: 3834.512 ms (00:03.835)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 4031.128 ms (00:04.031)
Time: 4001.590 ms (00:04.002)
Time: 4031.212 ms (00:04.031)
Time: 4035.941 ms (00:04.036)
Time: 4031.218 ms (00:04.031)

Regards,
Joel



pgsql-hackers by date:

Previous
From: "Daniel Verite"
Date:
Subject: Re: Built-in CTYPE provider
Next
From: vignesh C
Date:
Subject: Re: Logical Replication of sequences