Thread: Re: Optimising numeric division

Re: Optimising numeric division

From
"Joel Jacobson"
Date:
On Fri, Aug 23, 2024, at 21:21, Joel Jacobson wrote:
> Attachments:
> * perf_test-M3 Max.out
> * perf_test-Intel Core i9-14900K.out
> * perf_test-AMD Ryzen 9 7950X3D.out

Here are some additional benchmarks from pg-catbench:

AMD Ryzen 9 7950X3D:

select x var1ndigits,y var2ndigits,a_avg,b_avg,pooled_stddev,abs_diff,rel_diff,sigmas from catbench.vreport where
summarylike 'Optimise numeric division using base-NBASE^2 arithmetic%' and function_name = 'numeric_div' order by 1,2; 
 var1ndigits | var2ndigits | a_avg  | b_avg  | pooled_stddev | abs_diff | rel_diff | sigmas
-------------+-------------+--------+--------+---------------+----------+----------+--------
           1 |           1 | 43 ns  | 39 ns  | 11 ns         | -3.3 ns  |       -8 |      0
           1 |           2 | 47 ns  | 43 ns  | 11 ns         | -3.2 ns  |       -7 |      0
           1 |           3 | 55 ns  | 89 ns  | 18 ns         | 33 ns    |       60 |      2
           1 |           4 | 76 ns  | 93 ns  | 20 ns         | 16 ns    |       22 |      1
           1 |           8 | 190 ns | 98 ns  | 36 ns         | -94 ns   |      -49 |      3
           1 |          16 | 280 ns | 120 ns | 52 ns         | -160 ns  |      -57 |      3
           1 |          32 | 490 ns | 140 ns | 98 ns         | -350 ns  |      -72 |      4
           1 |          64 | 780 ns | 190 ns | 110 ns        | -590 ns  |      -75 |      5
           1 |         128 | 1.4 µs | 330 ns | 85 ns         | -1.1 µs  |      -77 |     13
           1 |         256 | 590 ns | 210 ns | 120 ns        | -380 ns  |      -65 |      3
           1 |         512 | 1.6 µs | 430 ns | 460 ns        | -1.2 µs  |      -74 |      3
           1 |        1024 | 2.8 µs | 820 ns | 1 µs          | -2 µs    |      -71 |      2
           1 |        2048 | 6.6 µs | 1.4 µs | 1.9 µs        | -5.2 µs  |      -78 |      3
           1 |        4096 | 11 µs  | 2.8 µs | 2.5 µs        | -8.5 µs  |      -76 |      3
           1 |        8192 | 25 µs  | 5.6 µs | 7.4 µs        | -20 µs   |      -78 |      3
           1 |       16384 | 53 µs  | 15 µs  | 15 µs         | -37 µs   |      -71 |      3
           2 |           2 | 49 ns  | 49 ns  | 12 ns         | 2e-10 s  |        0 |      0
           2 |           3 | 54 ns  | 93 ns  | 16 ns         | 39 ns    |       73 |      2
           2 |           4 | 86 ns  | 97 ns  | 19 ns         | 10 ns    |       12 |      1
           2 |           8 | 200 ns | 89 ns  | 36 ns         | -110 ns  |      -56 |      3
           2 |          16 | 340 ns | 120 ns | 63 ns         | -210 ns  |      -63 |      3
           2 |          32 | 460 ns | 130 ns | 84 ns         | -320 ns  |      -71 |      4
           2 |          64 | 770 ns | 210 ns | 68 ns         | -560 ns  |      -73 |      8
           2 |         128 | 1.5 µs | 330 ns | 290 ns        | -1.2 µs  |      -78 |      4
           2 |         256 | 1.2 µs | 380 ns | 160 ns        | -870 ns  |      -70 |      5
           2 |         512 | 1.9 µs | 520 ns | 510 ns        | -1.4 µs  |      -73 |      3
           2 |        1024 | 3.9 µs | 1 µs   | 1.2 µs        | -2.9 µs  |      -74 |      2
           2 |        2048 | 7.8 µs | 2.1 µs | 2.5 µs        | -5.7 µs  |      -74 |      2
           2 |        4096 | 17 µs  | 5.3 µs | 2 µs          | -12 µs   |      -69 |      6
           2 |        8192 | 30 µs  | 7.7 µs | 8.5 µs        | -22 µs   |      -74 |      3
           2 |       16384 | 35 µs  | 7.3 µs | 7.2 µs        | -27 µs   |      -79 |      4
           3 |           3 | 52 ns  | 88 ns  | 16 ns         | 36 ns    |       69 |      2
           3 |           4 | 78 ns  | 97 ns  | 24 ns         | 19 ns    |       24 |      1
           3 |           8 | 210 ns | 94 ns  | 38 ns         | -120 ns  |      -56 |      3
           3 |          16 | 300 ns | 130 ns | 53 ns         | -170 ns  |      -57 |      3
           3 |          32 | 510 ns | 140 ns | 95 ns         | -370 ns  |      -72 |      4
           3 |          64 | 800 ns | 230 ns | 100 ns        | -570 ns  |      -72 |      6
           3 |         128 | 1.4 µs | 290 ns | 210 ns        | -1.1 µs  |      -79 |      5
           3 |         256 | 900 ns | 270 ns | 310 ns        | -630 ns  |      -70 |      2
           3 |         512 | 1.4 µs | 470 ns | 550 ns        | -940 ns  |      -67 |      2
           3 |        1024 | 2.2 µs | 510 ns | 460 ns        | -1.7 µs  |      -77 |      4
           3 |        2048 | 5.5 µs | 1.6 µs | 2.1 µs        | -4 µs    |      -72 |      2
           3 |        4096 | 13 µs  | 3.8 µs | 3.7 µs        | -9.3 µs  |      -71 |      3
           3 |        8192 | 29 µs  | 7.6 µs | 8.3 µs        | -22 µs   |      -74 |      3
           3 |       16384 | 43 µs  | 11 µs  | 17 µs         | -32 µs   |      -74 |      2
           4 |           4 | 85 ns  | 92 ns  | 20 ns         | 6.6 ns   |        8 |      0
           4 |           8 | 200 ns | 89 ns  | 37 ns         | -110 ns  |      -56 |      3
           4 |          16 | 320 ns | 120 ns | 61 ns         | -200 ns  |      -61 |      3
           4 |          32 | 470 ns | 140 ns | 88 ns         | -320 ns  |      -69 |      4
           4 |          64 | 800 ns | 220 ns | 120 ns        | -580 ns  |      -72 |      5
           4 |         128 | 1.5 µs | 330 ns | 250 ns        | -1.2 µs  |      -78 |      5
           4 |         256 | 890 ns | 340 ns | 310 ns        | -550 ns  |      -61 |      2
           4 |         512 | 1.2 µs | 460 ns | 330 ns        | -730 ns  |      -62 |      2
           4 |        1024 | 2.5 µs | 790 ns | 860 ns        | -1.7 µs  |      -69 |      2
           4 |        2048 | 3.5 µs | 950 ns | 400 ns        | -2.6 µs  |      -73 |      7
           4 |        4096 | 13 µs  | 4 µs   | 3.8 µs        | -9 µs    |      -69 |      2
           4 |        8192 | 30 µs  | 7.7 µs | 8.2 µs        | -22 µs   |      -74 |      3
           4 |       16384 | 61 µs  | 20 µs  | 7.1 µs        | -42 µs   |      -68 |      6
           8 |           8 | 200 ns | 94 ns  | 38 ns         | -110 ns  |      -53 |      3
           8 |          16 | 330 ns | 110 ns | 63 ns         | -210 ns  |      -66 |      3
           8 |          32 | 480 ns | 140 ns | 86 ns         | -340 ns  |      -71 |      4
           8 |          64 | 800 ns | 210 ns | 49 ns         | -590 ns  |      -74 |     12
           8 |         128 | 1.6 µs | 320 ns | 190 ns        | -1.2 µs  |      -79 |      7
           8 |         256 | 1.6 µs | 460 ns | 290 ns        | -1.1 µs  |      -71 |      4
           8 |         512 | 1.9 µs | 570 ns | 550 ns        | -1.3 µs  |      -70 |      2
           8 |        1024 | 4 µs   | 1 µs   | 1.1 µs        | -3 µs    |      -75 |      3
           8 |        2048 | 6.6 µs | 1.4 µs | 2.4 µs        | -5.2 µs  |      -78 |      2
           8 |        4096 | 19 µs  | 5.2 µs | 870 ns        | -14 µs   |      -73 |     16
           8 |        8192 | 22 µs  | 5.8 µs | 8.2 µs        | -16 µs   |      -73 |      2
           8 |       16384 | 50 µs  | 11 µs  | 14 µs         | -39 µs   |      -78 |      3
          16 |          16 | 310 ns | 130 ns | 57 ns         | -180 ns  |      -59 |      3
          16 |          32 | 460 ns | 160 ns | 63 ns         | -310 ns  |      -66 |      5
          16 |          64 | 820 ns | 210 ns | 130 ns        | -610 ns  |      -74 |      5
          16 |         128 | 1.4 µs | 310 ns | 180 ns        | -1.1 µs  |      -78 |      6
          16 |         256 | 2.8 µs | 510 ns | 150 ns        | -2.3 µs  |      -82 |     15
          16 |         512 | 1.2 µs | 320 ns | 340 ns        | -840 ns  |      -72 |      2
          16 |        1024 | 4.6 µs | 1.2 µs | 980 ns        | -3.4 µs  |      -73 |      3
          16 |        2048 | 7.7 µs | 1.9 µs | 2.4 µs        | -5.8 µs  |      -75 |      2
          16 |        4096 | 15 µs  | 4.1 µs | 4.3 µs        | -11 µs   |      -72 |      3
          16 |        8192 | 14 µs  | 3.6 µs | 480 ns        | -10 µs   |      -74 |     21
          16 |       16384 | 28 µs  | 6.9 µs | 430 ns        | -21 µs   |      -75 |     48
          32 |          32 | 570 ns | 150 ns | 120 ns        | -420 ns  |      -74 |      4
          32 |          64 | 860 ns | 220 ns | 120 ns        | -640 ns  |      -74 |      5
          32 |         128 | 1.4 µs | 350 ns | 160 ns        | -1.1 µs  |      -76 |      7
          32 |         256 | 2.9 µs | 530 ns | 420 ns        | -2.4 µs  |      -82 |      6
          32 |         512 | 2.3 µs | 750 ns | 410 ns        | -1.6 µs  |      -67 |      4
          32 |        1024 | 3.7 µs | 1 µs   | 1 µs          | -2.7 µs  |      -73 |      3
          32 |        2048 | 5.4 µs | 1.6 µs | 2.2 µs        | -3.8 µs  |      -70 |      2
          32 |        4096 | 11 µs  | 1.8 µs | 1.7 µs        | -9.2 µs  |      -83 |      5
          32 |        8192 | 25 µs  | 6.1 µs | 7.5 µs        | -19 µs   |      -76 |      3
          32 |       16384 | 59 µs  | 15 µs  | 17 µs         | -44 µs   |      -74 |      3
          64 |          64 | 830 ns | 230 ns | 150 ns        | -610 ns  |      -73 |      4
          64 |         128 | 1.4 µs | 330 ns | 100 ns        | -1.1 µs  |      -77 |     11
          64 |         256 | 2.9 µs | 520 ns | 170 ns        | -2.3 µs  |      -82 |     14
          64 |         512 | 1.7 µs | 540 ns | 530 ns        | -1.2 µs  |      -69 |      2
          64 |        1024 | 2.7 µs | 770 ns | 580 ns        | -2 µs    |      -72 |      3
          64 |        2048 | 4.3 µs | 1 µs   | 950 ns        | -3.3 µs  |      -77 |      3
          64 |        4096 | 17 µs  | 3.8 µs | 2.5 µs        | -13 µs   |      -77 |      5
          64 |        8192 | 38 µs  | 9.9 µs | 1.3 µs        | -28 µs   |      -74 |     21
          64 |       16384 | 66 µs  | 15 µs  | 10 µs         | -51 µs   |      -77 |      5
         128 |         128 | 1.6 µs | 380 ns | 180 ns        | -1.2 µs  |      -76 |      7
         128 |         256 | 2.6 µs | 530 ns | 100 ns        | -2.1 µs  |      -80 |     20
         128 |         512 | 1.2 µs | 380 ns | 300 ns        | -840 ns  |      -69 |      3
         128 |        1024 | 5.3 µs | 1.3 µs | 1 µs          | -4 µs    |      -75 |      4
         128 |        2048 | 5.5 µs | 980 ns | 960 ns        | -4.5 µs  |      -82 |      5
         128 |        4096 | 13 µs  | 2.8 µs | 3.9 µs        | -10 µs   |      -78 |      3
         128 |        8192 | 18 µs  | 5.9 µs | 5.2 µs        | -12 µs   |      -68 |      2
         128 |       16384 | 59 µs  | 15 µs  | 9.3 µs        | -44 µs   |      -75 |      5
         256 |         256 | 3.3 µs | 590 ns | 540 ns        | -2.7 µs  |      -82 |      5
         256 |         512 | 1.2 µs | 410 ns | 340 ns        | -830 ns  |      -67 |      2
         256 |        1024 | 2.9 µs | 810 ns | 1.1 µs        | -2.1 µs  |      -72 |      2
         256 |        2048 | 9.1 µs | 2.4 µs | 1.7 µs        | -6.7 µs  |      -74 |      4
         256 |        4096 | 13 µs  | 3.8 µs | 3.7 µs        | -9.4 µs  |      -71 |      2
         256 |        8192 | 30 µs  | 8 µs   | 4.7 µs        | -22 µs   |      -73 |      5
         256 |       16384 | 43 µs  | 11 µs  | 17 µs         | -32 µs   |      -74 |      2
         512 |         512 | 6.6 µs | 1 µs   | 790 ns        | -5.6 µs  |      -84 |      7
         512 |        1024 | 4.4 µs | 980 ns | 1.4 µs        | -3.4 µs  |      -78 |      2
         512 |        2048 | 9.6 µs | 2 µs   | 2.2 µs        | -7.6 µs  |      -79 |      4
         512 |        4096 | 9.3 µs | 1.9 µs | 2.1 µs        | -7.4 µs  |      -79 |      3
         512 |        8192 | 27 µs  | 7.9 µs | 7.8 µs        | -19 µs   |      -70 |      2
         512 |       16384 | 60 µs  | 15 µs  | 17 µs         | -44 µs   |      -74 |      3
        1024 |        1024 | 12 µs  | 2 µs   | 960 ns        | -9.9 µs  |      -83 |     10
        1024 |        2048 | 4.7 µs | 1.5 µs | 1.2 µs        | -3.2 µs  |      -67 |      3
        1024 |        4096 | 11 µs  | 3.1 µs | 4.6 µs        | -8.2 µs  |      -73 |      2
        1024 |        8192 | 22 µs  | 5.8 µs | 8.8 µs        | -17 µs   |      -74 |      2
        1024 |       16384 | 35 µs  | 7.4 µs | 7.5 µs        | -28 µs   |      -79 |      4
        2048 |        2048 | 24 µs  | 3.8 µs | 1.5 µs        | -20 µs   |      -84 |     13
        2048 |        4096 | 17 µs  | 4.1 µs | 5 µs          | -13 µs   |      -75 |      3
        2048 |        8192 | 27 µs  | 7.9 µs | 8 µs          | -19 µs   |      -71 |      2
        2048 |       16384 | 45 µs  | 12 µs  | 18 µs         | -33 µs   |      -74 |      2
        4096 |        4096 | 51 µs  | 8.3 µs | 1.7 µs        | -42 µs   |      -84 |     24
        4096 |        8192 | 28 µs  | 7.5 µs | 8.6 µs        | -21 µs   |      -73 |      2
        4096 |       16384 | 28 µs  | 8.4 µs | 1.1 µs        | -19 µs   |      -70 |     17
        8192 |        8192 | 80 µs  | 14 µs  | 1.3 µs        | -66 µs   |      -82 |     49
        8192 |       16384 | 66 µs  | 16 µs  | 20 µs         | -50 µs   |      -76 |      3
       16384 |       16384 | 200 µs | 30 µs  | 2.4 µs        | -170 µs  |      -85 |     71
(136 rows)

Since microbenchmark results are not normally distributed,
the sigmas and stddev columns unfortunately don't say much at all,
they are just an attempt to give some form of indication of variance.
Any ideas on better indicators appreciated.

Here are the same report for Intel Core i9-14900K:

select x var1ndigits,y var2ndigits,a_avg,b_avg,pooled_stddev,abs_diff,rel_diff,sigmas from catbench.vreport where
summarylike 'Optimise numeric division using base-NBASE^2 arithmetic%' and function_name = 'numeric_div' order by 1,2; 
 var1ndigits | var2ndigits | a_avg  | b_avg  | pooled_stddev |  abs_diff  | rel_diff | sigmas
-------------+-------------+--------+--------+---------------+------------+----------+--------
           1 |           1 | 72 ns  | 72 ns  | 6.8 ns        | -1.9e-10 s |        0 |      0
           1 |           2 | 76 ns  | 77 ns  | 8.5 ns        | 1.1 ns     |        1 |      0
           1 |           3 | 87 ns  | 120 ns | 10 ns         | 38 ns      |       43 |      4
           1 |           4 | 98 ns  | 120 ns | 12 ns         | 27 ns      |       27 |      2
           1 |           8 | 340 ns | 130 ns | 19 ns         | -200 ns    |      -60 |     11
           1 |          16 | 500 ns | 160 ns | 34 ns         | -350 ns    |      -69 |     10
           1 |          32 | 850 ns | 200 ns | 81 ns         | -660 ns    |      -77 |      8
           1 |          64 | 1.6 µs | 300 ns | 130 ns        | -1.3 µs    |      -82 |     11
           1 |         128 | 3.2 µs | 520 ns | 180 ns        | -2.7 µs    |      -84 |     15
           1 |         256 | 2.2 µs | 570 ns | 360 ns        | -1.6 µs    |      -74 |      5
           1 |         512 | 4.4 µs | 1.1 µs | 1.2 µs        | -3.3 µs    |      -75 |      3
           1 |        1024 | 4.9 µs | 1 µs   | 1 µs          | -3.9 µs    |      -79 |      4
           1 |        2048 | 15 µs  | 4.2 µs | 4.1 µs        | -11 µs     |      -72 |      3
           1 |        4096 | 42 µs  | 11 µs  | 3.3 µs        | -31 µs     |      -75 |     10
           1 |        8192 | 86 µs  | 22 µs  | 4.5 µs        | -65 µs     |      -75 |     15
           1 |       16384 | 65 µs  | 17 µs  | 4 µs          | -47 µs     |      -73 |     12
           2 |           2 | 78 ns  | 83 ns  | 8.2 ns        | 5.1 ns     |        7 |      1
           2 |           3 | 89 ns  | 120 ns | 12 ns         | 29 ns      |       33 |      3
           2 |           4 | 97 ns  | 130 ns | 13 ns         | 29 ns      |       30 |      2
           2 |           8 | 310 ns | 130 ns | 30 ns         | -180 ns    |      -58 |      6
           2 |          16 | 520 ns | 160 ns | 34 ns         | -360 ns    |      -69 |     11
           2 |          32 | 980 ns | 210 ns | 14 ns         | -780 ns    |      -79 |     57
           2 |          64 | 1.6 µs | 300 ns | 150 ns        | -1.3 µs    |      -81 |      9
           2 |         128 | 3.1 µs | 530 ns | 270 ns        | -2.6 µs    |      -83 |     10
           2 |         256 | 2.8 µs | 710 ns | 150 ns        | -2.1 µs    |      -74 |     14
           2 |         512 | 4.2 µs | 1.1 µs | 720 ns        | -3.1 µs    |      -73 |      4
           2 |        1024 | 6.7 µs | 2.2 µs | 1.5 µs        | -4.5 µs    |      -67 |      3
           2 |        2048 | 7.7 µs | 2 µs   | 690 ns        | -5.7 µs    |      -74 |      8
           2 |        4096 | 20 µs  | 4 µs   | 4 µs          | -16 µs     |      -81 |      4
           2 |        8192 | 65 µs  | 13 µs  | 12 µs         | -52 µs     |      -80 |      4
           2 |       16384 | 100 µs | 26 µs  | 39 µs         | -78 µs     |      -75 |      2
           3 |           3 | 87 ns  | 130 ns | 10 ns         | 39 ns      |       45 |      4
           3 |           4 | 100 ns | 130 ns | 13 ns         | 27 ns      |       27 |      2
           3 |           8 | 330 ns | 120 ns | 21 ns         | -210 ns    |      -63 |     10
           3 |          16 | 470 ns | 160 ns | 38 ns         | -310 ns    |      -66 |      8
           3 |          32 | 910 ns | 190 ns | 72 ns         | -720 ns    |      -79 |     10
           3 |          64 | 1.7 µs | 300 ns | 130 ns        | -1.4 µs    |      -83 |     11
           3 |         128 | 3.2 µs | 510 ns | 250 ns        | -2.7 µs    |      -84 |     11
           3 |         256 | 1.9 µs | 570 ns | 530 ns        | -1.4 µs    |      -71 |      3
           3 |         512 | 3.3 µs | 770 ns | 1.2 µs        | -2.5 µs    |      -77 |      2
           3 |        1024 | 7.6 µs | 2 µs   | 2.2 µs        | -5.6 µs    |      -73 |      3
           3 |        2048 | 12 µs  | 3 µs   | 4.8 µs        | -9.4 µs    |      -76 |      2
           3 |        4096 | 30 µs  | 8.4 µs | 8.7 µs        | -21 µs     |      -72 |      2
           3 |        8192 | 68 µs  | 17 µs  | 19 µs         | -51 µs     |      -75 |      3
           3 |       16384 | 100 µs | 26 µs  | 39 µs         | -76 µs     |      -75 |      2
           4 |           4 | 100 ns | 130 ns | 13 ns         | 28 ns      |       28 |      2
           4 |           8 | 300 ns | 130 ns | 33 ns         | -170 ns    |      -56 |      5
           4 |          16 | 510 ns | 160 ns | 35 ns         | -350 ns    |      -69 |     10
           4 |          32 | 910 ns | 210 ns | 77 ns         | -700 ns    |      -77 |      9
           4 |          64 | 1.7 µs | 310 ns | 97 ns         | -1.4 µs    |      -82 |     14
           4 |         128 | 3.1 µs | 490 ns | 270 ns        | -2.6 µs    |      -84 |     10
           4 |         256 | 1.9 µs | 440 ns | 530 ns        | -1.4 µs    |      -77 |      3
           4 |         512 | 2.8 µs | 800 ns | 730 ns        | -2 µs      |      -71 |      3
           4 |        1024 | 9.4 µs | 2.1 µs | 1.6 µs        | -7.3 µs    |      -77 |      5
           4 |        2048 | 13 µs  | 3 µs   | 4.9 µs        | -9.7 µs    |      -76 |      2
           4 |        4096 | 34 µs  | 8.4 µs | 9.9 µs        | -26 µs     |      -75 |      3
           4 |        8192 | 61 µs  | 17 µs  | 17 µs         | -44 µs     |      -73 |      3
           4 |       16384 | 120 µs | 26 µs  | 34 µs         | -89 µs     |      -77 |      3
           8 |           8 | 310 ns | 120 ns | 33 ns         | -180 ns    |      -59 |      6
           8 |          16 | 540 ns | 170 ns | 30 ns         | -360 ns    |      -68 |     12
           8 |          32 | 930 ns | 200 ns | 75 ns         | -730 ns    |      -79 |     10
           8 |          64 | 1.7 µs | 310 ns | 120 ns        | -1.4 µs    |      -82 |     12
           8 |         128 | 3.3 µs | 510 ns | 270 ns        | -2.7 µs    |      -84 |     10
           8 |         256 | 4.7 µs | 880 ns | 330 ns        | -3.8 µs    |      -81 |     11
           8 |         512 | 3.7 µs | 800 ns | 1 µs          | -2.9 µs    |      -79 |      3
           8 |        1024 | 6.3 µs | 1.5 µs | 2.5 µs        | -4.8 µs    |      -76 |      2
           8 |        2048 | 14 µs  | 3 µs   | 4.2 µs        | -11 µs     |      -79 |      3
           8 |        4096 | 29 µs  | 6.2 µs | 8.6 µs        | -23 µs     |      -79 |      3
           8 |        8192 | 32 µs  | 8.3 µs | 2 µs          | -24 µs     |      -74 |     12
           8 |       16384 | 120 µs | 25 µs  | 34 µs         | -94 µs     |      -79 |      3
          16 |          16 | 490 ns | 160 ns | 55 ns         | -330 ns    |      -67 |      6
          16 |          32 | 1 µs   | 200 ns | 38 ns         | -810 ns    |      -80 |     21
          16 |          64 | 1.6 µs | 310 ns | 130 ns        | -1.3 µs    |      -81 |     10
          16 |         128 | 3.2 µs | 530 ns | 250 ns        | -2.7 µs    |      -83 |     11
          16 |         256 | 6.4 µs | 950 ns | 440 ns        | -5.5 µs    |      -85 |     13
          16 |         512 | 3.1 µs | 780 ns | 1.2 µs        | -2.3 µs    |      -74 |      2
          16 |        1024 | 5.2 µs | 1.5 µs | 1.4 µs        | -3.8 µs    |      -72 |      3
          16 |        2048 | 22 µs  | 5.2 µs | 1.1 µs        | -16 µs     |      -76 |     15
          16 |        4096 | 29 µs  | 8.4 µs | 8.1 µs        | -21 µs     |      -71 |      3
          16 |        8192 | 76 µs  | 17 µs  | 12 µs         | -59 µs     |      -78 |      5
          16 |       16384 | 120 µs | 26 µs  | 12 µs         | -90 µs     |      -77 |      8
          32 |          32 | 970 ns | 220 ns | 98 ns         | -740 ns    |      -77 |      8
          32 |          64 | 1.7 µs | 310 ns | 140 ns        | -1.3 µs    |      -81 |     10
          32 |         128 | 3.3 µs | 520 ns | 280 ns        | -2.8 µs    |      -84 |     10
          32 |         256 | 6.9 µs | 980 ns | 250 ns        | -5.9 µs    |      -86 |     23
          32 |         512 | 3.7 µs | 790 ns | 1.1 µs        | -2.9 µs    |      -79 |      3
          32 |        1024 | 6.2 µs | 1.6 µs | 2.5 µs        | -4.7 µs    |      -75 |      2
          32 |        2048 | 22 µs  | 5.3 µs | 840 ns        | -17 µs     |      -76 |     20
          32 |        4096 | 20 µs  | 4 µs   | 4.3 µs        | -16 µs     |      -80 |      4
          32 |        8192 | 58 µs  | 13 µs  | 17 µs         | -45 µs     |      -78 |      3
          32 |       16384 | 140 µs | 34 µs  | 19 µs         | -100 µs    |      -75 |      5
          64 |          64 | 2 µs   | 340 ns | 74 ns         | -1.6 µs    |      -83 |     22
          64 |         128 | 3.6 µs | 550 ns | 190 ns        | -3.1 µs    |      -85 |     17
          64 |         256 | 6.1 µs | 930 ns | 600 ns        | -5.2 µs    |      -85 |      9
          64 |         512 | 3.3 µs | 790 ns | 1.3 µs        | -2.5 µs    |      -76 |      2
          64 |        1024 | 9.5 µs | 2 µs   | 1.5 µs        | -7.4 µs    |      -79 |      5
          64 |        2048 | 13 µs  | 3.1 µs | 4.9 µs        | -9.4 µs    |      -75 |      2
          64 |        4096 | 39 µs  | 11 µs  | 4.3 µs        | -29 µs     |      -73 |      7
          64 |        8192 | 50 µs  | 12 µs  | 19 µs         | -38 µs     |      -75 |      2
          64 |       16384 | 64 µs  | 17 µs  | 4.5 µs        | -47 µs     |      -73 |     10
         128 |         128 | 3 µs   | 540 ns | 210 ns        | -2.5 µs    |      -82 |     12
         128 |         256 | 6.9 µs | 1 µs   | 390 ns        | -5.9 µs    |      -85 |     15
         128 |         512 | 2.7 µs | 810 ns | 730 ns        | -1.9 µs    |      -70 |      3
         128 |        1024 | 5.1 µs | 1.6 µs | 1.5 µs        | -3.5 µs    |      -68 |      2
         128 |        2048 | 20 µs  | 5.2 µs | 2.4 µs        | -15 µs     |      -74 |      6
         128 |        4096 | 26 µs  | 6.1 µs | 9.8 µs        | -19 µs     |      -76 |      2
         128 |        8192 | 60 µs  | 17 µs  | 17 µs         | -44 µs     |      -72 |      3
         128 |       16384 | 100 µs | 26 µs  | 39 µs         | -77 µs     |      -74 |      2
         256 |         256 | 5.9 µs | 1 µs   | 370 ns        | -4.9 µs    |      -83 |     13
         256 |         512 | 4.2 µs | 850 ns | 1.2 µs        | -3.4 µs    |      -80 |      3
         256 |        1024 | 12 µs  | 2.6 µs | 180 ns        | -9.3 µs    |      -78 |     51
         256 |        2048 | 20 µs  | 5 µs   | 2.4 µs        | -15 µs     |      -75 |      6
         256 |        4096 | 30 µs  | 6.3 µs | 8.8 µs        | -24 µs     |      -79 |      3
         256 |        8192 | 69 µs  | 17 µs  | 19 µs         | -52 µs     |      -75 |      3
         256 |       16384 | 150 µs | 35 µs  | 23 µs         | -120 µs    |      -78 |      5
         512 |         512 | 14 µs  | 2.1 µs | 1.1 µs        | -12 µs     |      -85 |     11
         512 |        1024 | 9.7 µs | 1.8 µs | 1.4 µs        | -7.9 µs    |      -82 |      6
         512 |        2048 | 10 µs  | 2.1 µs | 2.4 µs        | -8.3 µs    |      -79 |      3
         512 |        4096 | 20 µs  | 4.2 µs | 4.5 µs        | -16 µs     |      -79 |      4
         512 |        8192 | 88 µs  | 21 µs  | 4.6 µs        | -67 µs     |      -76 |     15
         512 |       16384 | 140 µs | 34 µs  | 38 µs         | -100 µs    |      -76 |      3
        1024 |        1024 | 25 µs  | 4 µs   | 2.7 µs        | -21 µs     |      -84 |      8
        1024 |        2048 | 16 µs  | 4.2 µs | 5.2 µs        | -12 µs     |      -74 |      2
        1024 |        4096 | 40 µs  | 8.1 µs | 6.7 µs        | -32 µs     |      -80 |      5
        1024 |        8192 | 80 µs  | 17 µs  | 12 µs         | -63 µs     |      -79 |      5
        1024 |       16384 | 120 µs | 35 µs  | 35 µs         | -89 µs     |      -72 |      3
        2048 |        2048 | 55 µs  | 8 µs   | 5.2 µs        | -46 µs     |      -85 |      9
        2048 |        4096 | 28 µs  | 6.2 µs | 6 µs          | -22 µs     |      -78 |      4
        2048 |        8192 | 53 µs  | 13 µs  | 21 µs         | -40 µs     |      -76 |      2
        2048 |       16384 | 100 µs | 26 µs  | 40 µs         | -76 µs     |      -74 |      2
        4096 |        4096 | 110 µs | 16 µs  | 8.8 µs        | -95 µs     |      -86 |     11
        4096 |        8192 | 43 µs  | 13 µs  | 12 µs         | -30 µs     |      -69 |      2
        4096 |       16384 | 150 µs | 36 µs  | 42 µs         | -110 µs    |      -76 |      3
        8192 |        8192 | 210 µs | 32 µs  | 22 µs         | -180 µs    |      -85 |      8
        8192 |       16384 | 160 µs | 34 µs  | 47 µs         | -120 µs    |      -79 |      3
       16384 |       16384 | 370 µs | 61 µs  | 23 µs         | -310 µs    |      -84 |     14
(136 rows)

Out of these, some appears to be slower, but not sure if actually so,
might just be noise, since quite few sigmas, and like said above,
the sigmas isn't very scientific since the data can't be assumed
to be normally distributed:

AMD Ryzen 9 7950X3D:

select x var1ndigits,y var2ndigits,a_avg,b_avg,pooled_stddev,abs_diff,rel_diff,sigmas from catbench.vreport where
summarylike 'Optimise numeric division using base-NBASE^2 arithmetic%' and function_name = 'numeric_div' and rel_diff >
0order by 1,2; 
 var1ndigits | var2ndigits | a_avg | b_avg | pooled_stddev | abs_diff | rel_diff | sigmas
-------------+-------------+-------+-------+---------------+----------+----------+--------
           1 |           3 | 55 ns | 89 ns | 18 ns         | 33 ns    |       60 |      2
           1 |           4 | 76 ns | 93 ns | 20 ns         | 16 ns    |       22 |      1
           2 |           3 | 54 ns | 93 ns | 16 ns         | 39 ns    |       73 |      2
           2 |           4 | 86 ns | 97 ns | 19 ns         | 10 ns    |       12 |      1
           3 |           3 | 52 ns | 88 ns | 16 ns         | 36 ns    |       69 |      2
           3 |           4 | 78 ns | 97 ns | 24 ns         | 19 ns    |       24 |      1
           4 |           4 | 85 ns | 92 ns | 20 ns         | 6.6 ns   |        8 |      0
(7 rows)


Intel Core i9-14900K:

select x var1ndigits,y var2ndigits,a_avg,b_avg,pooled_stddev,abs_diff,rel_diff,sigmas from catbench.vreport where
summarylike 'Optimise numeric division using base-NBASE^2 arithmetic%' and function_name = 'numeric_div' and rel_diff >
0order by 1,2; 
 var1ndigits | var2ndigits | a_avg  | b_avg  | pooled_stddev | abs_diff | rel_diff | sigmas
-------------+-------------+--------+--------+---------------+----------+----------+--------
           1 |           2 | 76 ns  | 77 ns  | 8.5 ns        | 1.1 ns   |        1 |      0
           1 |           3 | 87 ns  | 120 ns | 10 ns         | 38 ns    |       43 |      4
           1 |           4 | 98 ns  | 120 ns | 12 ns         | 27 ns    |       27 |      2
           2 |           2 | 78 ns  | 83 ns  | 8.2 ns        | 5.1 ns   |        7 |      1
           2 |           3 | 89 ns  | 120 ns | 12 ns         | 29 ns    |       33 |      3
           2 |           4 | 97 ns  | 130 ns | 13 ns         | 29 ns    |       30 |      2
           3 |           3 | 87 ns  | 130 ns | 10 ns         | 39 ns    |       45 |      4
           3 |           4 | 100 ns | 130 ns | 13 ns         | 27 ns    |       27 |      2
           4 |           4 | 100 ns | 130 ns | 13 ns         | 28 ns    |       28 |      2
(9 rows)

Quite similar (var1ndigits,var2ndigits) pairs that seems slower between
the two CPUs, so maybe it actually is a slowdown.

These benchmark results were obtained by comparing
numeric_div() between HEAD (ff59d5d) and with
v1-0001-Optimise-numeric-division-using-base-NBASE-2-arit.patch
applied.

Since statistical tools that rely on normal distributions can't be used,
let's look at the individual measurements for (var1ndigits=3, var2ndigits=3)
since that seems to be the biggest slowdown on both CPUs,
and see if our level of surprise is affected.

This is how many microseconds 512 iterations took
when comparing HEAD vs v1-0001:

AMD Ryzen 9 7950X3D:
{21,31,31,21,21,21,32,34,21,21,22,35,36,35,39,21,35,21,21,30,21,21,21,20,31,36,22,22,37,33} -- HEAD (ff59d5d)
{49,60,32,56,48,55,48,48,32,32,48,48,32,63,48,49,49,56,48,49,33,47,32,47,33,55,55,56,33,32} -- v1-0001

Intel Core i9-14900K:
{45,36,46,45,49,45,46,49,46,46,46,45,49,49,45,33,46,46,36,49,45,49,46,45,46,49,45,49,46,45} -- HEAD (ff59d5d)
{70,63,70,70,70,51,63,45,69,69,63,70,70,69,70,62,69,70,63,70,69,69,63,69,64,70,69,63,64,51} -- v1-0001

n=30 (3 random vars * 10 measurements)

(The reason why Intel is slower than AMD is because the Intel is running at a fixed CPU frequency.)

Regards,
Joel



Re: Optimising numeric division

From
"Joel Jacobson"
Date:
On Sat, Aug 24, 2024, at 00:00, Joel Jacobson wrote:
> Since statistical tools that rely on normal distributions can't be used,
> let's look at the individual measurements for (var1ndigits=3, var2ndigits=3)
> since that seems to be the biggest slowdown on both CPUs,
> and see if our level of surprise is affected.

Here is a more traditional benchmark,
which seems to also indicate (var1ndigits=3, var2ndigits=3) is a bit slower:

SELECT setseed(0);
CREATE TABLE t AS
SELECT
    random(111111111111::numeric,999999999999::numeric) AS var1,
    random(111111111111::numeric,999999999999::numeric) AS var2
FROM generate_series(1,1e7);
EXPLAIN ANALYZE SELECT SUM(var1/var2) FROM t;

/*
 * Intel Core i9-14900K
 */

-- HEAD (ff59d5d)
Execution Time: 575.141 ms
Execution Time: 572.179 ms
Execution Time: 571.394 ms

-- v1-0001-Optimise-numeric-division-using-base-NBASE-2-arit.patch
Execution Time: 672.983 ms
Execution Time: 603.031 ms
Execution Time: 620.736 ms

/*
 * AMD Ryzen 9 7950X3D
 */

-- HEAD (ff59d5d)
Execution Time: 561.349 ms
Execution Time: 516.365 ms
Execution Time: 510.782 ms

-- v1-0001-Optimise-numeric-division-using-base-NBASE-2-arit.patch
Execution Time: 659.049 ms
Execution Time: 607.035 ms
Execution Time: 600.026 ms

Regards,
Joel



Re: Optimising numeric division

From
"Joel Jacobson"
Date:
On Sat, Aug 24, 2024, at 01:35, Joel Jacobson wrote:
> On Sat, Aug 24, 2024, at 00:00, Joel Jacobson wrote:
>> Since statistical tools that rely on normal distributions can't be used,
>> let's look at the individual measurements for (var1ndigits=3, var2ndigits=3)
>> since that seems to be the biggest slowdown on both CPUs,
>> and see if our level of surprise is affected.
>
> Here is a more traditional benchmark,
> which seems to also indicate (var1ndigits=3, var2ndigits=3) is a bit slower:

I tested just adding back div_var_int64, and it seems to help.

-- Intel Core i9-14900K:

select summary, x var1ndigits,y var2ndigits,a_avg,b_avg,pooled_stddev,abs_diff,rel_diff,sigmas from catbench.vreport
wherefunction_name = 'numeric_div' and summary like 'Add back div_var_int64%' and sigmas > 1 order by x,y;
 
         summary          | var1ndigits | var2ndigits | a_avg  | b_avg  | pooled_stddev | abs_diff | rel_diff | sigmas

--------------------------+-------------+-------------+--------+--------+---------------+----------+----------+--------
 Add back div_var_int64.  |           1 |           3 | 120 ns | 85 ns  | 11 ns         | -40 ns   |      -32 |      4
 Add back div_var_int64.  |           1 |           4 | 120 ns | 97 ns  | 11 ns         | -28 ns   |      -23 |      3
 Add back div_var_int64.  |           2 |           3 | 120 ns | 89 ns  | 11 ns         | -29 ns   |      -25 |      3
 Add back div_var_int64.  |           2 |           4 | 130 ns | 94 ns  | 14 ns         | -32 ns   |      -25 |      2
 Add back div_var_int64.  |           3 |           3 | 130 ns | 85 ns  | 11 ns         | -41 ns   |      -32 |      4
 Add back div_var_int64.  |           3 |           4 | 130 ns | 99 ns  | 13 ns         | -29 ns   |      -23 |      2
 Add back div_var_int64.  |           4 |           4 | 130 ns | 100 ns | 12 ns         | -28 ns   |      -22 |      2
(7 rows)

Regards,
Joel



Re: Optimising numeric division

From
"Joel Jacobson"
Date:
On Sat, Aug 24, 2024, at 14:10, Dean Rasheed wrote:
> On Sat, 24 Aug 2024 at 08:26, Joel Jacobson <joel@compiler.org> wrote:
>>
>> On Sat, Aug 24, 2024, at 01:35, Joel Jacobson wrote:
>> > On Sat, Aug 24, 2024, at 00:00, Joel Jacobson wrote:
>> >> Since statistical tools that rely on normal distributions can't be used,
>> >> let's look at the individual measurements for (var1ndigits=3, var2ndigits=3)
>> >> since that seems to be the biggest slowdown on both CPUs,
>> >> and see if our level of surprise is affected.
>> >
>> > Here is a more traditional benchmark,
>> > which seems to also indicate (var1ndigits=3, var2ndigits=3) is a bit slower:
>>
>> I tested just adding back div_var_int64, and it seems to help.
>>
>
> Thanks for testing.
>
> There does appear to be quite a lot of variability between platforms
> over whether or not div_var_int64() is a win for 3 and 4 digit
> divisors. Since this patch is primarily about improving div_var()'s
> long division algorithm, it's probably best for it to not touch that,
> so I've put div_var_int64() back in for now. We could possibly
> investigate whether it can be improved separately.
>
> Looking at your other test results, they seem to confirm my previous
> observation that exact mode is faster than approximate mode for
> var2ndigits <= 12 or so, so I've added code to do that.
>
> I also expanded on the comments for the quotient-correction code a bit.

Nice. LGTM.
I've successfully tested the new patch again on both Intel and AMD.

I've marked it as Ready for Committer.

Regards,
Joel