Re: Optimize mul_var() for var1ndigits >= 8 - Mailing list pgsql-hackers

From Dean Rasheed
Subject Re: Optimize mul_var() for var1ndigits >= 8
Date
Msg-id CAEZATCViHMfZdnfxpVJzTf08FRkuyvBOvEG=RsV9TekUgOExGA@mail.gmail.com
Whole thread Raw
In response to Re: Optimize mul_var() for var1ndigits >= 8  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: Optimize mul_var() for var1ndigits >= 8
List pgsql-hackers
On Fri, 12 Jul 2024 at 13:34, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>
> Then I tried compiling with -m32, and unfortunately this made the
> patch slower than HEAD for small inputs:
>
> -- var1ndigits1=5, var2ndigits2=5 [-m32, SIMD disabled]
> call rate=5.052332e+06  -- HEAD
> call rate=3.883459e+06  -- v2 patch
>
> -- var1ndigits1=6, var2ndigits2=6 [-m32, SIMD disabled]
> call rate=4.7221405e+06  -- HEAD
> call rate=3.7965685e+06  -- v2 patch
>
> -- var1ndigits1=7, var2ndigits2=7 [-m32, SIMD disabled]
> call rate=4.4638375e+06   -- HEAD
> call rate=3.39948e+06     -- v2 patch
>
> and it doesn't reach parity until around ndigits=26, which is
> disappointing. It does still get much faster for large inputs
>

I spent some more time hacking on this, to try to improve the
situation for 32-bit builds. One of the biggest factors was the 64-bit
division that is necessary during carry propagation, which is very
slow -- every compiler/platform I looked at on godbolt.org turns this
into a call to a slow function like __udivdi3(). However, since we are
dividing by a constant (NBASE^2), it can be done using the same
multiply-and-shift trick that compilers use in 64-bit builds, and that
significantly improves performance.

Unfortunately, in 32-bit builds, using 64-bit integers is still slower
for small inputs (below 20-30 NBASE digits), so in the end I figured
that it's best to retain the old 32-bit base-NBASE multiplication code
for small inputs, and only use 64-bit base-NBASE^2 multiplication when
the inputs are above a certain size threshold. Furthermore, since this
threshold is quite low, it's possible to make an additional
simplification: as long as the threshold is less than or equal to 42,
it can be shown that there is no chance of 32-bit integer overflow,
and so the "maxdig" tracking and renormalisation code is not needed.
Getting rid of that makes the outer multiplication loop very simple,
and makes quite a noticeable difference to the performance for inputs
below the threshold.

In 64-bit builds, doing 64-bit base-NBASE^2 multiplication is faster
for all inputs over 4 or 5 NBASE digits, so the threshold can be much
lower. However, for numbers near that threshold, it's a close thing,
so it makes sense to also extend mul_var_small() to cover 1-6 digit
inputs, since that gives a much larger gain for numbers of that size.
That's quite nice because it equates to inputs with up to 21-24
decimal digits, which I suspect are quite commonly used in practice.

One risk in having different thresholds in 32-bit and 64-bit builds is
that it opens up the possibility that the results from the
reduced-rscale computations used by inexact functions like ln() and
exp() might be be different (actually, this may already be a
possibility, due to div_var_fast()'s use of floating point arithmetic,
but that seems considerably less likely). In practice though, this
should be extremely unlikely, due to the fact that any difference
would have to propagate through MUL_GUARD_DIGITS NBASE digits (8
decimal digits), and it doesn't show up in any of the existing tests.
IMO a very small chance of different results on different platforms is
probably acceptable, but it wouldn't be acceptable to make the
threshold a runtime configurable parameter that could lead to
different results on the same platform.

Overall, this has turned out to be more code than I would have liked,
but I think it's worth it, because the performance gains are pretty
substantial across the board.

(Here, I'm comparing with REL_17_STABLE, so that the effect of
mul_var_short() for ndigits <= 6 can be seen.)

32-bit build
============

Balanced inputs:

 ndigits1 | ndigits2 |   PG17 rate   |  patch rate   | % change
----------+----------+---------------+---------------+----------
        1 |        1 |  5.973068e+06 |  6.873059e+06 | +15.07%
        2 |        2 |  5.646598e+06 | 6.6456665e+06 | +17.69%
        3 |        3 | 5.8176995e+06 | 7.0903175e+06 | +21.87%
        4 |        4 |  5.545298e+06 | 6.9701605e+06 | +25.69%
        5 |        5 | 5.2109155e+06 | 6.6406185e+06 | +27.44%
        6 |        6 | 4.9276335e+06 |  6.478698e+06 | +31.48%
        7 |        7 | 4.6595095e+06 | 4.8514485e+06 | +4.12%
        8 |        8 |  4.898536e+06 | 5.1599975e+06 | +5.34%
        9 |        9 |  4.537171e+06 |  4.830987e+06 | +6.48%
       10 |       10 |  4.308139e+06 | 4.6029265e+06 | +6.84%
       11 |       11 |  4.092612e+06 |   4.37966e+06 | +7.01%
       12 |       12 | 3.9345035e+06 |  4.213998e+06 | +7.10%
       13 |       13 | 3.7600162e+06 | 4.0115955e+06 | +6.69%
       14 |       14 | 3.5959855e+06 | 3.8216508e+06 | +6.28%
       15 |       15 | 3.3576732e+06 | 3.6070588e+06 | +7.43%
       16 |       16 |  3.657139e+06 | 3.9067975e+06 | +6.83%
       17 |       17 | 3.3601955e+06 | 3.5651722e+06 | +6.10%
       18 |       18 | 3.1844472e+06 | 3.4542238e+06 | +8.47%
       19 |       19 | 3.0286392e+06 | 3.2380662e+06 | +6.91%
       20 |       20 | 2.9012185e+06 | 3.0496352e+06 | +5.12%
       21 |       21 |  2.755444e+06 | 2.9508798e+06 | +7.09%
       22 |       22 | 2.6263908e+06 | 2.8168945e+06 | +7.25%
       23 |       23 | 2.5470438e+06 | 2.7056318e+06 | +6.23%
       24 |       24 |  2.764418e+06 | 2.9597732e+06 | +7.07%
       25 |       25 |  2.509954e+06 | 2.7368215e+06 | +9.04%
       26 |       26 | 2.3699395e+06 |  2.565404e+06 | +8.25%
       27 |       27 |  2.286344e+06 | 2.4400948e+06 | +6.72%
       28 |       28 |  2.199087e+06 |  2.361374e+06 | +7.38%
       29 |       29 | 2.1208148e+06 |   2.26925e+06 | +7.00%
       30 |       30 | 2.0469475e+06 | 2.2127455e+06 | +8.10%
       31 |       31 | 1.9973804e+06 | 2.3771615e+06 | +19.01%
       32 |       32 | 2.1288205e+06 | 2.3166555e+06 | +8.82%
       33 |       33 | 1.9876898e+06 | 2.1759028e+06 | +9.47%
       34 |       34 | 1.8906434e+06 |  2.169534e+06 | +14.75%
       35 |       35 | 1.8069352e+06 |  1.990085e+06 | +10.14%
       36 |       36 | 1.7412926e+06 | 1.9940908e+06 | +14.52%
       37 |       37 | 1.6956435e+06 | 1.8492525e+06 | +9.06%
       38 |       38 | 1.6253338e+06 | 1.8493976e+06 | +13.79%
       39 |       39 | 1.5734566e+06 | 1.9296996e+06 | +22.64%
       40 |       40 | 1.6692021e+06 |  1.902438e+06 | +13.97%
       50 |       50 | 1.1116885e+06 | 1.5319529e+06 | +37.80%
      100 |      100 |     399552.38 |     722142.44 | +80.74%
      250 |      250 |      81092.02 |     195967.67 | +141.66%
      500 |      500 |     21654.633 |     58329.473 | +169.36%
     1000 |     1000 |     5525.9775 |     16420.611 | +197.15%
     2500 |     2500 |     907.98206 |     2825.7324 | +211.21%
     5000 |     5000 |     230.26935 |     731.26105 | +217.57%
    10000 |    10000 |     57.793922 |     179.12334 | +209.93%
    16383 |    16383 |     21.446728 |      64.39028 | +200.23%

Unbalanced inputs:

 ndigits1 | ndigits2 | PG17 rate | patch rate | % change
----------+----------+-----------+------------+----------
        1 |    10000 |  42816.89 |   52843.01 | +23.42%
        2 |    10000 |  41032.25 |  52111.957 | +27.00%
        3 |    10000 | 39550.176 |  52262.477 | +32.14%
        4 |    10000 |  38015.59 |  43962.535 | +15.64%
        5 |    10000 |  36560.22 |  43736.305 | +19.63%
        6 |    10000 |  35305.77 |    38204.2 | +8.21%
        7 |    10000 | 33833.086 |  36533.387 | +7.98%
        8 |    10000 | 32847.996 |  35774.715 | +8.91%
        9 |    10000 | 31345.736 |  33831.926 | +7.93%
       10 |    10000 |   30351.6 |  32715.969 | +7.79%
       11 |    10000 | 29321.592 |  31478.398 | +7.36%
       12 |    10000 | 28616.018 |  30861.885 | +7.85%
       13 |    10000 |  28216.12 |   29510.95 | +4.59%
       14 |    10000 | 27396.408 |   29077.73 | +6.14%
       15 |    10000 |  26519.08 |   28235.08 | +6.47%
       16 |    10000 | 25778.102 |  27538.908 | +6.83%
       17 |    10000 | 26024.918 |  26677.498 | +2.51%
       18 |    10000 | 25316.346 |  26729.395 | +5.58%
       19 |    10000 |  24626.07 |  26076.979 | +5.89%
       20 |    10000 | 23912.383 |  25709.967 | +7.52%
       21 |    10000 | 23238.488 |   24761.57 | +6.55%
       22 |    10000 |  22746.25 |  23925.934 | +5.19%
       23 |    10000 | 22120.777 |   23442.34 | +5.97%
       24 |    10000 | 21645.215 |  22771.193 | +5.20%
       25 |    10000 | 21135.049 |  22185.893 | +4.97%
       26 |    10000 | 20685.025 |   21831.74 | +5.54%
       27 |    10000 | 20039.559 |  20854.793 | +4.07%
       28 |    10000 | 19846.092 |  21072.758 | +6.18%
       29 |    10000 | 19414.059 |  20289.414 | +4.51%
       30 |    10000 | 18968.617 |  19774.797 | +4.25%
       31 |    10000 | 18394.988 |  21307.074 | +15.83%
       32 |    10000 | 18291.504 |  21349.691 | +16.72%
       33 |    10000 | 17899.676 |  20885.299 | +16.68%
       34 |    10000 | 17505.402 |  20620.105 | +17.79%
       35 |    10000 | 17174.918 |  20383.594 | +18.68%
       36 |    10000 | 16609.867 |   20342.18 | +22.47%
       37 |    10000 | 16457.889 |   19953.91 | +21.24%
       38 |    10000 |  15926.13 |  19783.203 | +24.22%
       39 |    10000 | 15441.283 |  19288.654 | +24.92%
       40 |    10000 | 15038.773 |   19415.52 | +29.10%
       50 |    10000 | 11264.285 |  17608.648 | +56.32%
      100 |    10000 | 6253.7637 |  11620.726 | +85.82%
      250 |    10000 |  2696.207 |  5939.3857 | +120.29%
      500 |    10000 | 1338.4141 |  3268.2004 | +144.18%
     1000 |    10000 |  672.6717 |  1691.9614 | +151.53%
     2500 |    10000 |  267.5996 |  708.20386 | +164.65%
     5000 |    10000 | 131.50755 |  353.92822 | +169.13%

numeric_big regression test:

ok 224       - numeric_big                               279 ms  [PG17]
ok 224       - numeric_big                               182 ms  [v3 patch]


64-bit build
============

Balanced inputs:

 ndigits1 | ndigits2 |   PG17 rate   |  patch rate   | % change
----------+----------+---------------+---------------+----------
        1 |        1 |  8.507485e+06 |   9.53221e+06 | +12.04%
        2 |        2 | 8.0975715e+06 |  9.431853e+06 | +16.48%
        3 |        3 |  6.461359e+06 | 7.3669945e+06 | +14.02%
        4 |        4 | 6.1728355e+06 |  7.152418e+06 | +15.87%
        5 |        5 |  6.500831e+06 | 7.6977115e+06 | +18.41%
        6 |        6 | 6.1784075e+06 | 7.3765005e+06 | +19.39%
        7 |        7 |   5.90117e+06 | 6.2799965e+06 | +6.42%
        8 |        8 | 5.9217105e+06 |  6.147141e+06 | +3.81%
        9 |        9 |  5.477262e+06 | 5.9889475e+06 | +9.34%
       10 |       10 | 5.2147235e+06 |  5.858963e+06 | +12.35%
       11 |       11 |  4.882895e+06 | 5.6766675e+06 | +16.26%
       12 |       12 |   4.61105e+06 |  5.559544e+06 | +20.57%
       13 |       13 |  4.382494e+06 | 5.3770165e+06 | +22.69%
       14 |       14 |  4.134509e+06 |  5.256462e+06 | +27.14%
       15 |       15 | 3.7595882e+06 | 5.0751355e+06 | +34.99%
       16 |       16 | 4.3353435e+06 |  4.970363e+06 | +14.65%
       17 |       17 | 3.9258755e+06 |  4.935394e+06 | +25.71%
       18 |       18 | 3.7562495e+06 | 4.8723875e+06 | +29.71%
       19 |       19 | 3.4890312e+06 |  4.723648e+06 | +35.39%
       20 |       20 |  3.289758e+06 | 4.6569305e+06 | +41.56%
       21 |       21 |  3.103908e+06 | 4.4747755e+06 | +44.17%
       22 |       22 | 2.9545148e+06 | 4.4227305e+06 | +49.69%
       23 |       23 | 2.7975982e+06 | 4.5065035e+06 | +61.08%
       24 |       24 | 3.2456168e+06 | 4.4578115e+06 | +37.35%
       25 |       25 | 2.9515055e+06 | 4.0208335e+06 | +36.23%
       26 |       26 | 2.8158568e+06 | 4.0437498e+06 | +43.61%
       27 |       27 | 2.6376518e+06 | 3.8959785e+06 | +47.71%
       28 |       28 | 2.5094318e+06 | 3.8648428e+06 | +54.01%
       29 |       29 | 2.3714905e+06 |   3.67182e+06 | +54.83%
       30 |       30 | 2.2456015e+06 | 3.6337285e+06 | +61.82%
       31 |       31 |  2.169437e+06 | 3.7209152e+06 | +71.52%
       32 |       32 | 2.5022498e+06 | 3.6609378e+06 | +46.31%
       33 |       33 |   2.27133e+06 |  3.435459e+06 | +51.25%
       34 |       34 | 2.1836462e+06 | 3.4042262e+06 | +55.90%
       35 |       35 | 2.0753196e+06 | 3.2125422e+06 | +54.80%
       36 |       36 | 1.9650498e+06 |  3.185525e+06 | +62.11%
       37 |       37 | 1.8668318e+06 | 3.0366508e+06 | +62.66%
       38 |       38 | 1.7678832e+06 |  3.014941e+06 | +70.54%
       39 |       39 | 1.6764314e+06 | 3.1080448e+06 | +85.40%
       40 |       40 | 1.9170026e+06 | 3.0942025e+06 | +61.41%
       50 |       50 | 1.2242934e+06 | 2.3769868e+06 | +94.15%
      100 |      100 |     401733.62 | 1.0854601e+06 | +170.19%
      250 |      250 |      81861.45 |     249837.78 | +205.20%
      500 |      500 |     21613.402 |      71239.04 | +229.61%
     1000 |     1000 |      5551.617 |     18349.414 | +230.52%
     2500 |     2500 |       906.501 |     3107.6462 | +242.82%
     5000 |     5000 |     231.65045 |     794.86444 | +243.13%
    10000 |    10000 |     58.372395 |     188.37387 | +222.71%
    16383 |    16383 |     21.629467 |      73.58552 | +240.21%

Unbalanced inputs:

 ndigits1 | ndigits2 | PG17 rate | patch rate | % change
----------+----------+-----------+------------+----------
        1 |    10000 | 44137.258 |  62292.844 | +41.13%
        2 |    10000 | 42009.895 |  62705.445 | +49.26%
        3 |    10000 | 40569.617 |  58555.727 | +44.33%
        4 |    10000 |  38914.03 |  49439.008 | +27.05%
        5 |    10000 |  37361.39 |  47173.445 | +26.26%
        6 |    10000 |  35807.61 |  42609.203 | +18.99%
        7 |    10000 | 34386.684 |  49325.582 | +43.44%
        8 |    10000 | 33380.312 |   49298.59 | +47.69%
        9 |    10000 |  32228.17 |  46869.844 | +45.43%
       10 |    10000 |  31022.46 |   47015.98 | +51.55%
       11 |    10000 | 29782.623 |      45074 | +51.34%
       12 |    10000 | 29540.896 |   44712.23 | +51.36%
       13 |    10000 | 28521.643 |   42589.98 | +49.33%
       14 |    10000 |  27772.59 |  43325.863 | +56.00%
       15 |    10000 |  26871.25 |      41443 | +54.23%
       16 |    10000 | 26179.322 |  41245.508 | +57.55%
       17 |    10000 | 26367.402 |  39348.543 | +49.23%
       18 |    10000 | 25769.176 |  40105.402 | +55.63%
       19 |    10000 | 24869.504 |   38316.44 | +54.07%
       20 |    10000 | 24395.436 |   37647.33 | +54.32%
       21 |    10000 | 23532.748 |  36552.914 | +55.33%
       22 |    10000 | 23151.842 |   36824.04 | +59.05%
       23 |    10000 |  22661.33 |  35556.918 | +56.91%
       24 |    10000 | 22113.502 |   34923.83 | +57.93%
       25 |    10000 | 21481.773 |  33601.785 | +56.42%
       26 |    10000 | 20943.576 |   34277.58 | +63.67%
       27 |    10000 | 20437.605 |  32957.406 | +61.26%
       28 |    10000 |  20049.12 |   32413.64 | +61.67%
       29 |    10000 | 19674.787 |  31537.846 | +60.30%
       30 |    10000 | 19092.572 |  32252.404 | +68.93%
       31 |    10000 | 18761.932 |  30825.836 | +64.30%
       32 |    10000 | 18480.184 |  30616.389 | +65.67%
       33 |    10000 |  18130.89 |  29493.594 | +62.67%
       34 |    10000 | 17750.996 |   30054.01 | +69.31%
       35 |    10000 |  17406.83 |  29090.297 | +67.12%
       36 |    10000 |  17138.23 |   29117.42 | +69.90%
       37 |    10000 | 16666.799 |   28429.32 | +70.57%
       38 |    10000 | 16144.025 |  29082.398 | +80.14%
       39 |    10000 | 15548.838 |  28195.258 | +81.33%
       40 |    10000 | 15305.571 |  27273.215 | +78.19%
       50 |    10000 | 11099.766 |  25494.129 | +129.68%
      100 |    10000 | 6310.7827 |  14895.447 | +136.03%
      250 |    10000 | 2687.7397 |  7149.1016 | +165.99%
      500 |    10000 | 1354.7455 |  3608.8845 | +166.39%
     1000 |    10000 |  677.3838 |  1852.1256 | +173.42%
     2500 |    10000 | 269.74582 |   748.5785 | +177.51%
     5000 |    10000 |  132.6432 |  377.23288 | +184.40%

numeric_big regression test:

ok 224       - numeric_big                               256 ms  [PG17]
ok 224       - numeric_big                               161 ms  [v3 patch]

Regards,
Dean

Attachment

pgsql-hackers by date:

Previous
From: "Andrey M. Borodin"
Date:
Subject: Re: UUID v7
Next
From: Andrew Dunstan
Date:
Subject: Re: why is pg_upgrade's regression run so slow?