Re: Optimize mul_var() for var1ndigits >= 8 - Mailing list pgsql-hackers
From | Dean Rasheed |
---|---|
Subject | Re: Optimize mul_var() for var1ndigits >= 8 |
Date | |
Msg-id | CAEZATCViHMfZdnfxpVJzTf08FRkuyvBOvEG=RsV9TekUgOExGA@mail.gmail.com Whole thread Raw |
In response to | Re: Optimize mul_var() for var1ndigits >= 8 (Dean Rasheed <dean.a.rasheed@gmail.com>) |
Responses |
Re: Optimize mul_var() for var1ndigits >= 8
|
List | pgsql-hackers |
On Fri, 12 Jul 2024 at 13:34, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > > Then I tried compiling with -m32, and unfortunately this made the > patch slower than HEAD for small inputs: > > -- var1ndigits1=5, var2ndigits2=5 [-m32, SIMD disabled] > call rate=5.052332e+06 -- HEAD > call rate=3.883459e+06 -- v2 patch > > -- var1ndigits1=6, var2ndigits2=6 [-m32, SIMD disabled] > call rate=4.7221405e+06 -- HEAD > call rate=3.7965685e+06 -- v2 patch > > -- var1ndigits1=7, var2ndigits2=7 [-m32, SIMD disabled] > call rate=4.4638375e+06 -- HEAD > call rate=3.39948e+06 -- v2 patch > > and it doesn't reach parity until around ndigits=26, which is > disappointing. It does still get much faster for large inputs > I spent some more time hacking on this, to try to improve the situation for 32-bit builds. One of the biggest factors was the 64-bit division that is necessary during carry propagation, which is very slow -- every compiler/platform I looked at on godbolt.org turns this into a call to a slow function like __udivdi3(). However, since we are dividing by a constant (NBASE^2), it can be done using the same multiply-and-shift trick that compilers use in 64-bit builds, and that significantly improves performance. Unfortunately, in 32-bit builds, using 64-bit integers is still slower for small inputs (below 20-30 NBASE digits), so in the end I figured that it's best to retain the old 32-bit base-NBASE multiplication code for small inputs, and only use 64-bit base-NBASE^2 multiplication when the inputs are above a certain size threshold. Furthermore, since this threshold is quite low, it's possible to make an additional simplification: as long as the threshold is less than or equal to 42, it can be shown that there is no chance of 32-bit integer overflow, and so the "maxdig" tracking and renormalisation code is not needed. Getting rid of that makes the outer multiplication loop very simple, and makes quite a noticeable difference to the performance for inputs below the threshold. In 64-bit builds, doing 64-bit base-NBASE^2 multiplication is faster for all inputs over 4 or 5 NBASE digits, so the threshold can be much lower. However, for numbers near that threshold, it's a close thing, so it makes sense to also extend mul_var_small() to cover 1-6 digit inputs, since that gives a much larger gain for numbers of that size. That's quite nice because it equates to inputs with up to 21-24 decimal digits, which I suspect are quite commonly used in practice. One risk in having different thresholds in 32-bit and 64-bit builds is that it opens up the possibility that the results from the reduced-rscale computations used by inexact functions like ln() and exp() might be be different (actually, this may already be a possibility, due to div_var_fast()'s use of floating point arithmetic, but that seems considerably less likely). In practice though, this should be extremely unlikely, due to the fact that any difference would have to propagate through MUL_GUARD_DIGITS NBASE digits (8 decimal digits), and it doesn't show up in any of the existing tests. IMO a very small chance of different results on different platforms is probably acceptable, but it wouldn't be acceptable to make the threshold a runtime configurable parameter that could lead to different results on the same platform. Overall, this has turned out to be more code than I would have liked, but I think it's worth it, because the performance gains are pretty substantial across the board. (Here, I'm comparing with REL_17_STABLE, so that the effect of mul_var_short() for ndigits <= 6 can be seen.) 32-bit build ============ Balanced inputs: ndigits1 | ndigits2 | PG17 rate | patch rate | % change ----------+----------+---------------+---------------+---------- 1 | 1 | 5.973068e+06 | 6.873059e+06 | +15.07% 2 | 2 | 5.646598e+06 | 6.6456665e+06 | +17.69% 3 | 3 | 5.8176995e+06 | 7.0903175e+06 | +21.87% 4 | 4 | 5.545298e+06 | 6.9701605e+06 | +25.69% 5 | 5 | 5.2109155e+06 | 6.6406185e+06 | +27.44% 6 | 6 | 4.9276335e+06 | 6.478698e+06 | +31.48% 7 | 7 | 4.6595095e+06 | 4.8514485e+06 | +4.12% 8 | 8 | 4.898536e+06 | 5.1599975e+06 | +5.34% 9 | 9 | 4.537171e+06 | 4.830987e+06 | +6.48% 10 | 10 | 4.308139e+06 | 4.6029265e+06 | +6.84% 11 | 11 | 4.092612e+06 | 4.37966e+06 | +7.01% 12 | 12 | 3.9345035e+06 | 4.213998e+06 | +7.10% 13 | 13 | 3.7600162e+06 | 4.0115955e+06 | +6.69% 14 | 14 | 3.5959855e+06 | 3.8216508e+06 | +6.28% 15 | 15 | 3.3576732e+06 | 3.6070588e+06 | +7.43% 16 | 16 | 3.657139e+06 | 3.9067975e+06 | +6.83% 17 | 17 | 3.3601955e+06 | 3.5651722e+06 | +6.10% 18 | 18 | 3.1844472e+06 | 3.4542238e+06 | +8.47% 19 | 19 | 3.0286392e+06 | 3.2380662e+06 | +6.91% 20 | 20 | 2.9012185e+06 | 3.0496352e+06 | +5.12% 21 | 21 | 2.755444e+06 | 2.9508798e+06 | +7.09% 22 | 22 | 2.6263908e+06 | 2.8168945e+06 | +7.25% 23 | 23 | 2.5470438e+06 | 2.7056318e+06 | +6.23% 24 | 24 | 2.764418e+06 | 2.9597732e+06 | +7.07% 25 | 25 | 2.509954e+06 | 2.7368215e+06 | +9.04% 26 | 26 | 2.3699395e+06 | 2.565404e+06 | +8.25% 27 | 27 | 2.286344e+06 | 2.4400948e+06 | +6.72% 28 | 28 | 2.199087e+06 | 2.361374e+06 | +7.38% 29 | 29 | 2.1208148e+06 | 2.26925e+06 | +7.00% 30 | 30 | 2.0469475e+06 | 2.2127455e+06 | +8.10% 31 | 31 | 1.9973804e+06 | 2.3771615e+06 | +19.01% 32 | 32 | 2.1288205e+06 | 2.3166555e+06 | +8.82% 33 | 33 | 1.9876898e+06 | 2.1759028e+06 | +9.47% 34 | 34 | 1.8906434e+06 | 2.169534e+06 | +14.75% 35 | 35 | 1.8069352e+06 | 1.990085e+06 | +10.14% 36 | 36 | 1.7412926e+06 | 1.9940908e+06 | +14.52% 37 | 37 | 1.6956435e+06 | 1.8492525e+06 | +9.06% 38 | 38 | 1.6253338e+06 | 1.8493976e+06 | +13.79% 39 | 39 | 1.5734566e+06 | 1.9296996e+06 | +22.64% 40 | 40 | 1.6692021e+06 | 1.902438e+06 | +13.97% 50 | 50 | 1.1116885e+06 | 1.5319529e+06 | +37.80% 100 | 100 | 399552.38 | 722142.44 | +80.74% 250 | 250 | 81092.02 | 195967.67 | +141.66% 500 | 500 | 21654.633 | 58329.473 | +169.36% 1000 | 1000 | 5525.9775 | 16420.611 | +197.15% 2500 | 2500 | 907.98206 | 2825.7324 | +211.21% 5000 | 5000 | 230.26935 | 731.26105 | +217.57% 10000 | 10000 | 57.793922 | 179.12334 | +209.93% 16383 | 16383 | 21.446728 | 64.39028 | +200.23% Unbalanced inputs: ndigits1 | ndigits2 | PG17 rate | patch rate | % change ----------+----------+-----------+------------+---------- 1 | 10000 | 42816.89 | 52843.01 | +23.42% 2 | 10000 | 41032.25 | 52111.957 | +27.00% 3 | 10000 | 39550.176 | 52262.477 | +32.14% 4 | 10000 | 38015.59 | 43962.535 | +15.64% 5 | 10000 | 36560.22 | 43736.305 | +19.63% 6 | 10000 | 35305.77 | 38204.2 | +8.21% 7 | 10000 | 33833.086 | 36533.387 | +7.98% 8 | 10000 | 32847.996 | 35774.715 | +8.91% 9 | 10000 | 31345.736 | 33831.926 | +7.93% 10 | 10000 | 30351.6 | 32715.969 | +7.79% 11 | 10000 | 29321.592 | 31478.398 | +7.36% 12 | 10000 | 28616.018 | 30861.885 | +7.85% 13 | 10000 | 28216.12 | 29510.95 | +4.59% 14 | 10000 | 27396.408 | 29077.73 | +6.14% 15 | 10000 | 26519.08 | 28235.08 | +6.47% 16 | 10000 | 25778.102 | 27538.908 | +6.83% 17 | 10000 | 26024.918 | 26677.498 | +2.51% 18 | 10000 | 25316.346 | 26729.395 | +5.58% 19 | 10000 | 24626.07 | 26076.979 | +5.89% 20 | 10000 | 23912.383 | 25709.967 | +7.52% 21 | 10000 | 23238.488 | 24761.57 | +6.55% 22 | 10000 | 22746.25 | 23925.934 | +5.19% 23 | 10000 | 22120.777 | 23442.34 | +5.97% 24 | 10000 | 21645.215 | 22771.193 | +5.20% 25 | 10000 | 21135.049 | 22185.893 | +4.97% 26 | 10000 | 20685.025 | 21831.74 | +5.54% 27 | 10000 | 20039.559 | 20854.793 | +4.07% 28 | 10000 | 19846.092 | 21072.758 | +6.18% 29 | 10000 | 19414.059 | 20289.414 | +4.51% 30 | 10000 | 18968.617 | 19774.797 | +4.25% 31 | 10000 | 18394.988 | 21307.074 | +15.83% 32 | 10000 | 18291.504 | 21349.691 | +16.72% 33 | 10000 | 17899.676 | 20885.299 | +16.68% 34 | 10000 | 17505.402 | 20620.105 | +17.79% 35 | 10000 | 17174.918 | 20383.594 | +18.68% 36 | 10000 | 16609.867 | 20342.18 | +22.47% 37 | 10000 | 16457.889 | 19953.91 | +21.24% 38 | 10000 | 15926.13 | 19783.203 | +24.22% 39 | 10000 | 15441.283 | 19288.654 | +24.92% 40 | 10000 | 15038.773 | 19415.52 | +29.10% 50 | 10000 | 11264.285 | 17608.648 | +56.32% 100 | 10000 | 6253.7637 | 11620.726 | +85.82% 250 | 10000 | 2696.207 | 5939.3857 | +120.29% 500 | 10000 | 1338.4141 | 3268.2004 | +144.18% 1000 | 10000 | 672.6717 | 1691.9614 | +151.53% 2500 | 10000 | 267.5996 | 708.20386 | +164.65% 5000 | 10000 | 131.50755 | 353.92822 | +169.13% numeric_big regression test: ok 224 - numeric_big 279 ms [PG17] ok 224 - numeric_big 182 ms [v3 patch] 64-bit build ============ Balanced inputs: ndigits1 | ndigits2 | PG17 rate | patch rate | % change ----------+----------+---------------+---------------+---------- 1 | 1 | 8.507485e+06 | 9.53221e+06 | +12.04% 2 | 2 | 8.0975715e+06 | 9.431853e+06 | +16.48% 3 | 3 | 6.461359e+06 | 7.3669945e+06 | +14.02% 4 | 4 | 6.1728355e+06 | 7.152418e+06 | +15.87% 5 | 5 | 6.500831e+06 | 7.6977115e+06 | +18.41% 6 | 6 | 6.1784075e+06 | 7.3765005e+06 | +19.39% 7 | 7 | 5.90117e+06 | 6.2799965e+06 | +6.42% 8 | 8 | 5.9217105e+06 | 6.147141e+06 | +3.81% 9 | 9 | 5.477262e+06 | 5.9889475e+06 | +9.34% 10 | 10 | 5.2147235e+06 | 5.858963e+06 | +12.35% 11 | 11 | 4.882895e+06 | 5.6766675e+06 | +16.26% 12 | 12 | 4.61105e+06 | 5.559544e+06 | +20.57% 13 | 13 | 4.382494e+06 | 5.3770165e+06 | +22.69% 14 | 14 | 4.134509e+06 | 5.256462e+06 | +27.14% 15 | 15 | 3.7595882e+06 | 5.0751355e+06 | +34.99% 16 | 16 | 4.3353435e+06 | 4.970363e+06 | +14.65% 17 | 17 | 3.9258755e+06 | 4.935394e+06 | +25.71% 18 | 18 | 3.7562495e+06 | 4.8723875e+06 | +29.71% 19 | 19 | 3.4890312e+06 | 4.723648e+06 | +35.39% 20 | 20 | 3.289758e+06 | 4.6569305e+06 | +41.56% 21 | 21 | 3.103908e+06 | 4.4747755e+06 | +44.17% 22 | 22 | 2.9545148e+06 | 4.4227305e+06 | +49.69% 23 | 23 | 2.7975982e+06 | 4.5065035e+06 | +61.08% 24 | 24 | 3.2456168e+06 | 4.4578115e+06 | +37.35% 25 | 25 | 2.9515055e+06 | 4.0208335e+06 | +36.23% 26 | 26 | 2.8158568e+06 | 4.0437498e+06 | +43.61% 27 | 27 | 2.6376518e+06 | 3.8959785e+06 | +47.71% 28 | 28 | 2.5094318e+06 | 3.8648428e+06 | +54.01% 29 | 29 | 2.3714905e+06 | 3.67182e+06 | +54.83% 30 | 30 | 2.2456015e+06 | 3.6337285e+06 | +61.82% 31 | 31 | 2.169437e+06 | 3.7209152e+06 | +71.52% 32 | 32 | 2.5022498e+06 | 3.6609378e+06 | +46.31% 33 | 33 | 2.27133e+06 | 3.435459e+06 | +51.25% 34 | 34 | 2.1836462e+06 | 3.4042262e+06 | +55.90% 35 | 35 | 2.0753196e+06 | 3.2125422e+06 | +54.80% 36 | 36 | 1.9650498e+06 | 3.185525e+06 | +62.11% 37 | 37 | 1.8668318e+06 | 3.0366508e+06 | +62.66% 38 | 38 | 1.7678832e+06 | 3.014941e+06 | +70.54% 39 | 39 | 1.6764314e+06 | 3.1080448e+06 | +85.40% 40 | 40 | 1.9170026e+06 | 3.0942025e+06 | +61.41% 50 | 50 | 1.2242934e+06 | 2.3769868e+06 | +94.15% 100 | 100 | 401733.62 | 1.0854601e+06 | +170.19% 250 | 250 | 81861.45 | 249837.78 | +205.20% 500 | 500 | 21613.402 | 71239.04 | +229.61% 1000 | 1000 | 5551.617 | 18349.414 | +230.52% 2500 | 2500 | 906.501 | 3107.6462 | +242.82% 5000 | 5000 | 231.65045 | 794.86444 | +243.13% 10000 | 10000 | 58.372395 | 188.37387 | +222.71% 16383 | 16383 | 21.629467 | 73.58552 | +240.21% Unbalanced inputs: ndigits1 | ndigits2 | PG17 rate | patch rate | % change ----------+----------+-----------+------------+---------- 1 | 10000 | 44137.258 | 62292.844 | +41.13% 2 | 10000 | 42009.895 | 62705.445 | +49.26% 3 | 10000 | 40569.617 | 58555.727 | +44.33% 4 | 10000 | 38914.03 | 49439.008 | +27.05% 5 | 10000 | 37361.39 | 47173.445 | +26.26% 6 | 10000 | 35807.61 | 42609.203 | +18.99% 7 | 10000 | 34386.684 | 49325.582 | +43.44% 8 | 10000 | 33380.312 | 49298.59 | +47.69% 9 | 10000 | 32228.17 | 46869.844 | +45.43% 10 | 10000 | 31022.46 | 47015.98 | +51.55% 11 | 10000 | 29782.623 | 45074 | +51.34% 12 | 10000 | 29540.896 | 44712.23 | +51.36% 13 | 10000 | 28521.643 | 42589.98 | +49.33% 14 | 10000 | 27772.59 | 43325.863 | +56.00% 15 | 10000 | 26871.25 | 41443 | +54.23% 16 | 10000 | 26179.322 | 41245.508 | +57.55% 17 | 10000 | 26367.402 | 39348.543 | +49.23% 18 | 10000 | 25769.176 | 40105.402 | +55.63% 19 | 10000 | 24869.504 | 38316.44 | +54.07% 20 | 10000 | 24395.436 | 37647.33 | +54.32% 21 | 10000 | 23532.748 | 36552.914 | +55.33% 22 | 10000 | 23151.842 | 36824.04 | +59.05% 23 | 10000 | 22661.33 | 35556.918 | +56.91% 24 | 10000 | 22113.502 | 34923.83 | +57.93% 25 | 10000 | 21481.773 | 33601.785 | +56.42% 26 | 10000 | 20943.576 | 34277.58 | +63.67% 27 | 10000 | 20437.605 | 32957.406 | +61.26% 28 | 10000 | 20049.12 | 32413.64 | +61.67% 29 | 10000 | 19674.787 | 31537.846 | +60.30% 30 | 10000 | 19092.572 | 32252.404 | +68.93% 31 | 10000 | 18761.932 | 30825.836 | +64.30% 32 | 10000 | 18480.184 | 30616.389 | +65.67% 33 | 10000 | 18130.89 | 29493.594 | +62.67% 34 | 10000 | 17750.996 | 30054.01 | +69.31% 35 | 10000 | 17406.83 | 29090.297 | +67.12% 36 | 10000 | 17138.23 | 29117.42 | +69.90% 37 | 10000 | 16666.799 | 28429.32 | +70.57% 38 | 10000 | 16144.025 | 29082.398 | +80.14% 39 | 10000 | 15548.838 | 28195.258 | +81.33% 40 | 10000 | 15305.571 | 27273.215 | +78.19% 50 | 10000 | 11099.766 | 25494.129 | +129.68% 100 | 10000 | 6310.7827 | 14895.447 | +136.03% 250 | 10000 | 2687.7397 | 7149.1016 | +165.99% 500 | 10000 | 1354.7455 | 3608.8845 | +166.39% 1000 | 10000 | 677.3838 | 1852.1256 | +173.42% 2500 | 10000 | 269.74582 | 748.5785 | +177.51% 5000 | 10000 | 132.6432 | 377.23288 | +184.40% numeric_big regression test: ok 224 - numeric_big 256 ms [PG17] ok 224 - numeric_big 161 ms [v3 patch] Regards, Dean
Attachment
pgsql-hackers by date: