Optimize numeric.c mul_var() using the Karatsuba algorithm - Mailing list pgsql-hackers

From Joel Jacobson
Subject Optimize numeric.c mul_var() using the Karatsuba algorithm
Date
Msg-id 7f95163f-2019-4416-a042-6e2141619e5d@app.fastmail.com
Whole thread Raw
Responses Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
List pgsql-hackers
Hi,

This patch introduces the Karatsuba algorithm to speed up multiplication
operations in numeric.c, where the operands have many digits.

It is implemented via a new conditional in mul_var() that determines whether
the sizes of the factors are sufficiently large to justify its use. This decision
is non-trivial due to its recursive nature, depending on the size and ratio of
the factors. Moreover, the optimal threshold varies across different
architectures.

This benefits all users of mul_var() in numeric.c, such as:
numeric_mul()
numeric_lcm()
numeric_fac()
int64_div_fast_to_numeric()
mod_var()
div_mod_var()
sqrt_var()
exp_var()
ln_var()
power_var()

The macro KARATSUBA_CONDITION(var1ndigits, var2ndigits) is responsible of this
decision. It is deliberately conservative to prevent performance regressions on
the tested architectures while maximizing potential gains and maintaining
simplicity.

Patches:

1. mul_var-karatsuba.patch
Modifies mul_var() to use the Karatsuba-functions
for multiplying larger numerical factors.

2. mul_var-karatsuba-benchmark.patch
Introduces numeric_mul_karatsuba() and mul_var_karatsuba()
alongside the existing numeric_mul() and mul_var() functions.
This enables benchmark comparisons between the original multiplication method
and the Karatsuba-optimized version.

Some benchmark numbers, tested on Intel Core i9-14900K:

Helper-function to generate numeric of given ndigits,
using the new random(min numeric, max numeric):

CREATE OR REPLACE FUNCTION random_ndigits(ndigits INT) RETURNS NUMERIC AS $$
SELECT random(
    ('1000'||repeat('0000',ndigits-1))::numeric,
    (repeat('9999',ndigits))::numeric
)
$$ LANGUAGE sql;

Benchmark equal factor sizes, 16384 x 16384 ndigits:

SELECT random_ndigits(16384) * random_ndigits(16384) > 0;
Time: 33.990 ms
Time: 33.961 ms
Time: 34.183 ms

SELECT numeric_mul_karatsuba(random_ndigits(16384), random_ndigits(16384)) > 0;
Time: 17.621 ms
Time: 17.209 ms
Time: 16.444 ms

Benchmark equal factor sizes, 8192 x 8192 ndigits:

SELECT random_ndigits(8192) * random_ndigits(8192) > 0;
Time: 12.568 ms
Time: 12.563 ms
Time: 12.701 ms

SELECT numeric_mul_karatsuba(random_ndigits(8192), random_ndigits(8192)) > 0;
Time: 9.919 ms
Time: 9.929 ms
Time: 9.659 ms

To measure smaller factor sizes, \timing doesn't provide enough precision.
Below measurements are made using my pg-timeit extension:

Benchmark equal factor sizes, 1024 x 1024 ndigits:

SELECT timeit.h('numeric_mul',ARRAY[random_ndigits(1024)::TEXT,random_ndigits(1024)::TEXT],significant_figures:=2);
 100 µs
SELECT
timeit.h('numeric_mul_karatsuba',ARRAY[random_ndigits(1024)::TEXT,random_ndigits(1024)::TEXT],significant_figures:=2);
 73 µs

Benchmark equal factor sizes, 512 x 512 ndigits:

SELECT timeit.h('numeric_mul',ARRAY[random_ndigits(512)::TEXT,random_ndigits(512)::TEXT],significant_figures:=2);
 27 µs
SELECT
timeit.h('numeric_mul_karatsuba',ARRAY[random_ndigits(512)::TEXT,random_ndigits(512)::TEXT],significant_figures:=2);
 23 µs

Benchmark unequal factor sizes, 2048 x 16384 ndigits:

SELECT timeit.h('numeric_mul',ARRAY[random_ndigits(2048)::TEXT,random_ndigits(16384)::TEXT],significant_figures:=2);
3.6 ms
SELECT
timeit.h('numeric_mul_karatsuba',ARRAY[random_ndigits(2048)::TEXT,random_ndigits(16384)::TEXT],significant_figures:=2);
2.7 ms

The KARATSUBA_CONDITION was determined through benchmarking on the following architectures:

- Intel Core i9-14900K (desktop)
- AMD Ryzen 9 7950X3D (desktop)
- Apple M1Max (laptop)
- AWS EC2 m7g.4xlarge (cloud server, AWS Graviton3 CPU)
- AWS EC2 m7i.4xlarge (cloud server, Intel Xeon 4th Gen, Sapphire Rapids)

The images depicting the benchmark plots are rather large, so I've refrained
from including them as attachments. Instead, I've provided URLs to
the benchmarks for direct access:


https://gist.githubusercontent.com/joelonsql/e9d06cdbcdf56cd8ffa673f499880b0d/raw/69df06e95bc254090f8397765079e1a8145eb5ac/derive_threshold_function_using_dynamic_programming.png
This image shows the best possible performance ratio per architecture,
derived using Dynamic Programming. The black line segment shows the manually crafted
threshold function, which aims to avoid performance regressions, while capturing
the beneficial regions, as a relatively simple threshold function,
which has been implemented in both patches as the KARATSUBA_CONDITION macro.


https://gist.githubusercontent.com/joelonsql/e9d06cdbcdf56cd8ffa673f499880b0d/raw/69df06e95bc254090f8397765079e1a8145eb5ac/benchmark.png
This plot displays the actual performance ratio per architecture,
measured after applying the mul_var-karatsuba-benchmark.patch.

The performance_ratio scale in both plots uses a rainbow scale,
where blue is at 1.0 and means no change. The maximum at 4.0
means that the Karatsuba version was four times faster
than the traditional mul_var() at that architecture.

To make it easier to distinguish performance regressions from,
a magenta color scale that goes from pure magenta just below 1.0,
to dark at 0.0. I picked magenta for this purpose since it's
not part of the rainbow colors.

/Joel
Attachment

pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: Things I don't like about \du's "Attributes" column
Next
From: Tom Lane
Date:
Subject: Bugs in ecpg's macro mechanism