Optimize numeric.c mul_var() using the Karatsuba algorithm - Mailing list pgsql-hackers
From | Joel Jacobson |
---|---|
Subject | Optimize numeric.c mul_var() using the Karatsuba algorithm |
Date | |
Msg-id | 7f95163f-2019-4416-a042-6e2141619e5d@app.fastmail.com Whole thread Raw |
Responses |
Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
|
List | pgsql-hackers |
Hi, This patch introduces the Karatsuba algorithm to speed up multiplication operations in numeric.c, where the operands have many digits. It is implemented via a new conditional in mul_var() that determines whether the sizes of the factors are sufficiently large to justify its use. This decision is non-trivial due to its recursive nature, depending on the size and ratio of the factors. Moreover, the optimal threshold varies across different architectures. This benefits all users of mul_var() in numeric.c, such as: numeric_mul() numeric_lcm() numeric_fac() int64_div_fast_to_numeric() mod_var() div_mod_var() sqrt_var() exp_var() ln_var() power_var() The macro KARATSUBA_CONDITION(var1ndigits, var2ndigits) is responsible of this decision. It is deliberately conservative to prevent performance regressions on the tested architectures while maximizing potential gains and maintaining simplicity. Patches: 1. mul_var-karatsuba.patch Modifies mul_var() to use the Karatsuba-functions for multiplying larger numerical factors. 2. mul_var-karatsuba-benchmark.patch Introduces numeric_mul_karatsuba() and mul_var_karatsuba() alongside the existing numeric_mul() and mul_var() functions. This enables benchmark comparisons between the original multiplication method and the Karatsuba-optimized version. Some benchmark numbers, tested on Intel Core i9-14900K: Helper-function to generate numeric of given ndigits, using the new random(min numeric, max numeric): CREATE OR REPLACE FUNCTION random_ndigits(ndigits INT) RETURNS NUMERIC AS $$ SELECT random( ('1000'||repeat('0000',ndigits-1))::numeric, (repeat('9999',ndigits))::numeric ) $$ LANGUAGE sql; Benchmark equal factor sizes, 16384 x 16384 ndigits: SELECT random_ndigits(16384) * random_ndigits(16384) > 0; Time: 33.990 ms Time: 33.961 ms Time: 34.183 ms SELECT numeric_mul_karatsuba(random_ndigits(16384), random_ndigits(16384)) > 0; Time: 17.621 ms Time: 17.209 ms Time: 16.444 ms Benchmark equal factor sizes, 8192 x 8192 ndigits: SELECT random_ndigits(8192) * random_ndigits(8192) > 0; Time: 12.568 ms Time: 12.563 ms Time: 12.701 ms SELECT numeric_mul_karatsuba(random_ndigits(8192), random_ndigits(8192)) > 0; Time: 9.919 ms Time: 9.929 ms Time: 9.659 ms To measure smaller factor sizes, \timing doesn't provide enough precision. Below measurements are made using my pg-timeit extension: Benchmark equal factor sizes, 1024 x 1024 ndigits: SELECT timeit.h('numeric_mul',ARRAY[random_ndigits(1024)::TEXT,random_ndigits(1024)::TEXT],significant_figures:=2); 100 µs SELECT timeit.h('numeric_mul_karatsuba',ARRAY[random_ndigits(1024)::TEXT,random_ndigits(1024)::TEXT],significant_figures:=2); 73 µs Benchmark equal factor sizes, 512 x 512 ndigits: SELECT timeit.h('numeric_mul',ARRAY[random_ndigits(512)::TEXT,random_ndigits(512)::TEXT],significant_figures:=2); 27 µs SELECT timeit.h('numeric_mul_karatsuba',ARRAY[random_ndigits(512)::TEXT,random_ndigits(512)::TEXT],significant_figures:=2); 23 µs Benchmark unequal factor sizes, 2048 x 16384 ndigits: SELECT timeit.h('numeric_mul',ARRAY[random_ndigits(2048)::TEXT,random_ndigits(16384)::TEXT],significant_figures:=2); 3.6 ms SELECT timeit.h('numeric_mul_karatsuba',ARRAY[random_ndigits(2048)::TEXT,random_ndigits(16384)::TEXT],significant_figures:=2); 2.7 ms The KARATSUBA_CONDITION was determined through benchmarking on the following architectures: - Intel Core i9-14900K (desktop) - AMD Ryzen 9 7950X3D (desktop) - Apple M1Max (laptop) - AWS EC2 m7g.4xlarge (cloud server, AWS Graviton3 CPU) - AWS EC2 m7i.4xlarge (cloud server, Intel Xeon 4th Gen, Sapphire Rapids) The images depicting the benchmark plots are rather large, so I've refrained from including them as attachments. Instead, I've provided URLs to the benchmarks for direct access: https://gist.githubusercontent.com/joelonsql/e9d06cdbcdf56cd8ffa673f499880b0d/raw/69df06e95bc254090f8397765079e1a8145eb5ac/derive_threshold_function_using_dynamic_programming.png This image shows the best possible performance ratio per architecture, derived using Dynamic Programming. The black line segment shows the manually crafted threshold function, which aims to avoid performance regressions, while capturing the beneficial regions, as a relatively simple threshold function, which has been implemented in both patches as the KARATSUBA_CONDITION macro. https://gist.githubusercontent.com/joelonsql/e9d06cdbcdf56cd8ffa673f499880b0d/raw/69df06e95bc254090f8397765079e1a8145eb5ac/benchmark.png This plot displays the actual performance ratio per architecture, measured after applying the mul_var-karatsuba-benchmark.patch. The performance_ratio scale in both plots uses a rainbow scale, where blue is at 1.0 and means no change. The maximum at 4.0 means that the Karatsuba version was four times faster than the traditional mul_var() at that architecture. To make it easier to distinguish performance regressions from, a magenta color scale that goes from pure magenta just below 1.0, to dark at 0.0. I picked magenta for this purpose since it's not part of the rainbow colors. /Joel
Attachment
pgsql-hackers by date: