Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands. - Mailing list pgsql-hackers
From | Joel Jacobson |
---|---|
Subject | Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands. |
Date | |
Msg-id | 42d93ebe-fbed-448f-b550-096be9e577db@app.fastmail.com Whole thread Raw |
In response to | Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands. (Dean Rasheed <dean.a.rasheed@gmail.com>) |
Responses |
Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.
|
List | pgsql-hackers |
On Thu, Jul 4, 2024, at 20:43, Dean Rasheed wrote: > On Wed, 3 Jul 2024 at 21:45, Joel Jacobson <joel@compiler.org> wrote: >> >> > On Wed, Jul 3, 2024, at 20:57, Dean Rasheed wrote: >> >> I wouldn't expect it to ever be off by more than 1 >> > >> > OK, so then the cases I found where it was off by 2 for the mul_var_int() patch >> > are unexpected? >> >> Sorry, I meant off by 2 for the mul_var_small() patch, these cases that I found: >> > > Yeah, so that was another bug in mul_var_small(). If rscale is made > small enough, the result index for the digits computed before the main > loop overlaps the ones after, so it would overwrite digits already > computed. > > Of course, that's fairly easy to fix, but at this point I think the > better solution is to only use mul_var_small() when an exact product > is requested. We would have to do that for mul_var_int() anyway, > because of its accuracy issues discussed earlier. I think this is a > reasonable thing to do because only functions like ln_var() and > exp_var() will ask mul_var() for a reduced-rscale result, and those > functions are likely to be dominated by computations involving larger > numbers, for which this patch wouldn't help anyway. Also those > functions are probably less widely used. > > If we make that decision, a lot of the complexity in mul_var_small() > goes away, including all the conditional array accesses, making it > simpler and more efficient. v6 patch attached. Nice, I think that looks like the better trade-off. > I also updated the mul_var_int() patch so that it is also only invoked > when an exact product is requested, and I noticed a couple of other > minor optimisations that could be made. Looks nice. > Then I decided to try > implementing mul_var_int64(). This gives a pretty decent speedup for > 3-digit inputs, but unfortunately it is much slower for 4-digit inputs > (for which most values will go through the 128-bit code path). I'm > attaching that too, just for information, but it's clearly not going > to be acceptable as-is. > > Running your benchmark queries, I got these results: > > SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; > Time: 4520.874 ms (00:04.521) -- HEAD > Time: 3937.536 ms (00:03.938) -- v5-mul_var_int.patch > Time: 3919.495 ms (00:03.919) -- v5-mul_var_small.patch > Time: 3916.964 ms (00:03.917) -- v6-mul_var_int64.patch > Time: 3811.118 ms (00:03.811) -- v6-mul_var_small.patch My benchmarks also indicate v6-mul_var_small.patch (v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch) is the winner on all my three CPUs, for var1ndigits=1: -- Apple M3 Max SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- HEAD Time: 3046.668 ms (00:03.047) Time: 3053.327 ms (00:03.053) Time: 3045.517 ms (00:03.046) Time: 3049.626 ms (00:03.050) Time: 3075.101 ms (00:03.075) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-add-mul_var_int64.patch Time: 2781.536 ms (00:02.782) Time: 2781.324 ms (00:02.781) Time: 2781.301 ms (00:02.781) Time: 2786.524 ms (00:02.787) Time: 2784.494 ms (00:02.784) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch Time: 2711.167 ms (00:02.711) Time: 2723.647 ms (00:02.724) Time: 2706.372 ms (00:02.706) Time: 2708.883 ms (00:02.709) Time: 2704.621 ms (00:02.705) -- Intel Core i9-14900K SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- HEAD Time: 3496.714 ms (00:03.497) Time: 3278.909 ms (00:03.279) Time: 3278.631 ms (00:03.279) Time: 3277.658 ms (00:03.278) Time: 3276.121 ms (00:03.276) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-add-mul_var_int64.patch Time: 3080.751 ms (00:03.081) Time: 3078.118 ms (00:03.078) Time: 3079.932 ms (00:03.080) Time: 3080.668 ms (00:03.081) Time: 3080.697 ms (00:03.081) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch Time: 3043.635 ms (00:03.044) Time: 3043.606 ms (00:03.044) Time: 3041.391 ms (00:03.041) Time: 3041.997 ms (00:03.042) Time: 3045.464 ms (00:03.045) -- AMD Ryzen 9 7950X3D SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- HEAD Time: 3421.307 ms (00:03.421) Time: 3400.935 ms (00:03.401) Time: 3359.839 ms (00:03.360) Time: 3374.246 ms (00:03.374) Time: 3375.085 ms (00:03.375) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-add-mul_var_int64.patch Time: 3002.214 ms (00:03.002) Time: 3016.042 ms (00:03.016) Time: 3010.541 ms (00:03.011) Time: 3009.204 ms (00:03.009) Time: 3002.088 ms (00:03.002) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch Time: 2959.319 ms (00:02.959) Time: 2957.694 ms (00:02.958) Time: 2971.559 ms (00:02.972) Time: 2958.788 ms (00:02.959) Time: 2958.812 ms (00:02.959) > SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; > Time: 4762.528 ms (00:04.763) -- HEAD > Time: 4075.546 ms (00:04.076) -- v5-mul_var_int.patch > Time: 4055.180 ms (00:04.055) -- v5-mul_var_small.patch > Time: 4037.866 ms (00:04.038) -- v6-mul_var_int64.patch > Time: 4018.488 ms (00:04.018) -- v6-mul_var_small.patch I get mixed results for var1ndigits=2: Winner on Apple M3 Max and AMD Ryzen 9 7950X3D is v6-add-mul_var_int64.patch. Winner on Intel Core i9-14900K is v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch -- Apple M3 Max SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- HEAD Time: 3369.724 ms (00:03.370) Time: 3340.977 ms (00:03.341) Time: 3331.407 ms (00:03.331) Time: 3333.304 ms (00:03.333) Time: 3332.136 ms (00:03.332) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-add-mul_var_int64.patch Time: 2768.108 ms (00:02.768) Time: 2736.996 ms (00:02.737) Time: 2730.217 ms (00:02.730) Time: 2743.556 ms (00:02.744) Time: 2746.725 ms (00:02.747) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch Time: 2895.200 ms (00:02.895) Time: 2894.823 ms (00:02.895) Time: 2899.475 ms (00:02.899) Time: 2895.385 ms (00:02.895) Time: 2898.647 ms (00:02.899) -- Intel Core i9-14900K SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- HEAD Time: 3385.836 ms (00:03.386) Time: 3367.739 ms (00:03.368) Time: 3363.321 ms (00:03.363) Time: 3365.433 ms (00:03.365) Time: 3365.301 ms (00:03.365) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-add-mul_var_int64.patch Time: 3086.253 ms (00:03.086) Time: 3085.272 ms (00:03.085) Time: 3085.769 ms (00:03.086) Time: 3089.128 ms (00:03.089) Time: 3086.735 ms (00:03.087) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch Time: 3053.775 ms (00:03.054) Time: 3058.392 ms (00:03.058) Time: 3068.113 ms (00:03.068) Time: 3057.333 ms (00:03.057) Time: 3057.218 ms (00:03.057) -- AMD Ryzen 9 7950X3D SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- HEAD Time: 3619.441 ms (00:03.619) Time: 3609.553 ms (00:03.610) Time: 3574.277 ms (00:03.574) Time: 3578.031 ms (00:03.578) Time: 3558.545 ms (00:03.559) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-add-mul_var_int64.patch Time: 3061.858 ms (00:03.062) Time: 3082.174 ms (00:03.082) Time: 3081.093 ms (00:03.081) Time: 3093.610 ms (00:03.094) Time: 3064.507 ms (00:03.065) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch Time: 3100.287 ms (00:03.100) Time: 3100.264 ms (00:03.100) Time: 3097.207 ms (00:03.097) Time: 3101.477 ms (00:03.101) Time: 3098.035 ms (00:03.098) > SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; > Time: 5387.514 ms (00:05.388) -- HEAD > Time: 5350.736 ms (00:05.351) -- v5-mul_var_int.patch > Time: 4648.449 ms (00:04.648) -- v5-mul_var_small.patch > Time: 4655.204 ms (00:04.655) -- v6-mul_var_int64.patch > Time: 4645.962 ms (00:04.646) -- v6-mul_var_small.patch I get mixed results for var1ndigits=3: Winner on Apple M3 Max and AMD Ryzen 9 7950X3D is v6-add-mul_var_int64.patch. Winner on Intel Core i9-14900K is v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch Same winners as for var1ndigits=2. -- Apple M3 Max SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD Time: 3466.932 ms (00:03.467) Time: 3447.001 ms (00:03.447) Time: 3457.259 ms (00:03.457) Time: 3445.938 ms (00:03.446) Time: 3443.310 ms (00:03.443) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-add-mul_var_int64.patch Time: 2988.444 ms (00:02.988) Time: 2976.036 ms (00:02.976) Time: 2982.756 ms (00:02.983) Time: 2986.436 ms (00:02.986) Time: 2973.457 ms (00:02.973) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch Time: 3026.666 ms (00:03.027) Time: 3024.458 ms (00:03.024) Time: 3022.976 ms (00:03.023) Time: 3029.526 ms (00:03.030) Time: 3021.543 ms (00:03.022) -- Intel Core i9-14900K SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD Time: 4510.078 ms (00:04.510) Time: 4222.501 ms (00:04.223) Time: 4179.509 ms (00:04.180) Time: 4179.307 ms (00:04.179) Time: 4183.026 ms (00:04.183) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-add-mul_var_int64.patch Time: 3811.866 ms (00:03.812) Time: 3816.664 ms (00:03.817) Time: 3811.695 ms (00:03.812) Time: 3811.674 ms (00:03.812) Time: 3812.265 ms (00:03.812) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch Time: 4095.053 ms (00:04.095) Time: 3896.002 ms (00:03.896) Time: 3888.999 ms (00:03.889) Time: 3889.346 ms (00:03.889) Time: 3889.017 ms (00:03.889) -- AMD Ryzen 9 7950X3D SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD Time: 3946.255 ms (00:03.946) Time: 3896.110 ms (00:03.896) Time: 3877.470 ms (00:03.877) Time: 3854.402 ms (00:03.854) Time: 3901.218 ms (00:03.901) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-add-mul_var_int64.patch Time: 3393.572 ms (00:03.394) Time: 3395.401 ms (00:03.395) Time: 3395.199 ms (00:03.395) Time: 3418.555 ms (00:03.419) Time: 3399.619 ms (00:03.400) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch Time: 3308.791 ms (00:03.309) Time: 3309.316 ms (00:03.309) Time: 3318.238 ms (00:03.318) Time: 3296.061 ms (00:03.296) Time: 3320.282 ms (00:03.320) > SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; > Time: 5617.150 ms (00:05.617) -- HEAD > Time: 5505.913 ms (00:05.506) -- v5-mul_var_int.patch > Time: 5486.441 ms (00:05.486) -- v5-mul_var_small.patch > Time: 8203.081 ms (00:08.203) -- v6-mul_var_int64.patch > Time: 5598.909 ms (00:05.599) -- v6-mul_var_small.patch > So v6-mul_var_int64 improves on v5-mul_var_int in the 3-digit case, > but is terrible in the 4-digit case. None of the other patches touch > the 4-digit case, but it might be interesting to try mul_var_small() > with 4 digits. Interesting you got so bad bench results for v6-mul_var_int64.patch for var1ndigits=4, that patch is actually the winner on AMD Ryzen 9 7950X3D. On Intel Core i9-14900K the winner is v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch. On Apple M3 Max, HEAD is the winner. -- Apple M3 Max SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD Time: 3618.530 ms (00:03.619) Time: 3595.239 ms (00:03.595) Time: 3600.412 ms (00:03.600) Time: 3607.500 ms (00:03.607) Time: 3604.122 ms (00:03.604) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-add-mul_var_int64.patch Time: 4498.993 ms (00:04.499) Time: 4527.302 ms (00:04.527) Time: 4493.613 ms (00:04.494) Time: 4482.194 ms (00:04.482) Time: 4493.660 ms (00:04.494) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch Time: 3628.177 ms (00:03.628) Time: 3613.975 ms (00:03.614) Time: 3612.213 ms (00:03.612) Time: 3614.026 ms (00:03.614) Time: 3622.959 ms (00:03.623) -- Intel Core i9-14900K SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD Time: 4130.810 ms (00:04.131) Time: 3836.462 ms (00:03.836) Time: 3810.604 ms (00:03.811) Time: 3805.443 ms (00:03.805) Time: 3803.055 ms (00:03.803) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-add-mul_var_int64.patch Time: 3952.862 ms (00:03.953) Time: 3792.272 ms (00:03.792) Time: 3793.995 ms (00:03.794) Time: 3790.531 ms (00:03.791) Time: 3793.647 ms (00:03.794) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch Time: 3762.828 ms (00:03.763) Time: 3754.869 ms (00:03.755) Time: 3756.041 ms (00:03.756) Time: 3758.719 ms (00:03.759) Time: 3754.894 ms (00:03.755) -- AMD Ryzen 9 7950X3D SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD Time: 4075.988 ms (00:04.076) Time: 4067.702 ms (00:04.068) Time: 4035.191 ms (00:04.035) Time: 4022.411 ms (00:04.022) Time: 4016.475 ms (00:04.016) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-add-mul_var_int64.patch Time: 3830.021 ms (00:03.830) Time: 3837.947 ms (00:03.838) Time: 3834.894 ms (00:03.835) Time: 3832.755 ms (00:03.833) Time: 3834.512 ms (00:03.835) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch Time: 4031.128 ms (00:04.031) Time: 4001.590 ms (00:04.002) Time: 4031.212 ms (00:04.031) Time: 4035.941 ms (00:04.036) Time: 4031.218 ms (00:04.031) Regards, Joel
pgsql-hackers by date: