Thread: Optimize numeric.c mul_var() using the Karatsuba algorithm

Optimize numeric.c mul_var() using the Karatsuba algorithm

From
"Joel Jacobson"
Date:
Hi,

This patch introduces the Karatsuba algorithm to speed up multiplication
operations in numeric.c, where the operands have many digits.

It is implemented via a new conditional in mul_var() that determines whether
the sizes of the factors are sufficiently large to justify its use. This decision
is non-trivial due to its recursive nature, depending on the size and ratio of
the factors. Moreover, the optimal threshold varies across different
architectures.

This benefits all users of mul_var() in numeric.c, such as:
numeric_mul()
numeric_lcm()
numeric_fac()
int64_div_fast_to_numeric()
mod_var()
div_mod_var()
sqrt_var()
exp_var()
ln_var()
power_var()

The macro KARATSUBA_CONDITION(var1ndigits, var2ndigits) is responsible of this
decision. It is deliberately conservative to prevent performance regressions on
the tested architectures while maximizing potential gains and maintaining
simplicity.

Patches:

1. mul_var-karatsuba.patch
Modifies mul_var() to use the Karatsuba-functions
for multiplying larger numerical factors.

2. mul_var-karatsuba-benchmark.patch
Introduces numeric_mul_karatsuba() and mul_var_karatsuba()
alongside the existing numeric_mul() and mul_var() functions.
This enables benchmark comparisons between the original multiplication method
and the Karatsuba-optimized version.

Some benchmark numbers, tested on Intel Core i9-14900K:

Helper-function to generate numeric of given ndigits,
using the new random(min numeric, max numeric):

CREATE OR REPLACE FUNCTION random_ndigits(ndigits INT) RETURNS NUMERIC AS $$
SELECT random(
    ('1000'||repeat('0000',ndigits-1))::numeric,
    (repeat('9999',ndigits))::numeric
)
$$ LANGUAGE sql;

Benchmark equal factor sizes, 16384 x 16384 ndigits:

SELECT random_ndigits(16384) * random_ndigits(16384) > 0;
Time: 33.990 ms
Time: 33.961 ms
Time: 34.183 ms

SELECT numeric_mul_karatsuba(random_ndigits(16384), random_ndigits(16384)) > 0;
Time: 17.621 ms
Time: 17.209 ms
Time: 16.444 ms

Benchmark equal factor sizes, 8192 x 8192 ndigits:

SELECT random_ndigits(8192) * random_ndigits(8192) > 0;
Time: 12.568 ms
Time: 12.563 ms
Time: 12.701 ms

SELECT numeric_mul_karatsuba(random_ndigits(8192), random_ndigits(8192)) > 0;
Time: 9.919 ms
Time: 9.929 ms
Time: 9.659 ms

To measure smaller factor sizes, \timing doesn't provide enough precision.
Below measurements are made using my pg-timeit extension:

Benchmark equal factor sizes, 1024 x 1024 ndigits:

SELECT timeit.h('numeric_mul',ARRAY[random_ndigits(1024)::TEXT,random_ndigits(1024)::TEXT],significant_figures:=2);
 100 µs
SELECT
timeit.h('numeric_mul_karatsuba',ARRAY[random_ndigits(1024)::TEXT,random_ndigits(1024)::TEXT],significant_figures:=2);
 73 µs

Benchmark equal factor sizes, 512 x 512 ndigits:

SELECT timeit.h('numeric_mul',ARRAY[random_ndigits(512)::TEXT,random_ndigits(512)::TEXT],significant_figures:=2);
 27 µs
SELECT
timeit.h('numeric_mul_karatsuba',ARRAY[random_ndigits(512)::TEXT,random_ndigits(512)::TEXT],significant_figures:=2);
 23 µs

Benchmark unequal factor sizes, 2048 x 16384 ndigits:

SELECT timeit.h('numeric_mul',ARRAY[random_ndigits(2048)::TEXT,random_ndigits(16384)::TEXT],significant_figures:=2);
3.6 ms
SELECT
timeit.h('numeric_mul_karatsuba',ARRAY[random_ndigits(2048)::TEXT,random_ndigits(16384)::TEXT],significant_figures:=2);
2.7 ms

The KARATSUBA_CONDITION was determined through benchmarking on the following architectures:

- Intel Core i9-14900K (desktop)
- AMD Ryzen 9 7950X3D (desktop)
- Apple M1Max (laptop)
- AWS EC2 m7g.4xlarge (cloud server, AWS Graviton3 CPU)
- AWS EC2 m7i.4xlarge (cloud server, Intel Xeon 4th Gen, Sapphire Rapids)

The images depicting the benchmark plots are rather large, so I've refrained
from including them as attachments. Instead, I've provided URLs to
the benchmarks for direct access:


https://gist.githubusercontent.com/joelonsql/e9d06cdbcdf56cd8ffa673f499880b0d/raw/69df06e95bc254090f8397765079e1a8145eb5ac/derive_threshold_function_using_dynamic_programming.png
This image shows the best possible performance ratio per architecture,
derived using Dynamic Programming. The black line segment shows the manually crafted
threshold function, which aims to avoid performance regressions, while capturing
the beneficial regions, as a relatively simple threshold function,
which has been implemented in both patches as the KARATSUBA_CONDITION macro.


https://gist.githubusercontent.com/joelonsql/e9d06cdbcdf56cd8ffa673f499880b0d/raw/69df06e95bc254090f8397765079e1a8145eb5ac/benchmark.png
This plot displays the actual performance ratio per architecture,
measured after applying the mul_var-karatsuba-benchmark.patch.

The performance_ratio scale in both plots uses a rainbow scale,
where blue is at 1.0 and means no change. The maximum at 4.0
means that the Karatsuba version was four times faster
than the traditional mul_var() at that architecture.

To make it easier to distinguish performance regressions from,
a magenta color scale that goes from pure magenta just below 1.0,
to dark at 0.0. I picked magenta for this purpose since it's
not part of the rainbow colors.

/Joel
Attachment

Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

From
Aaron Altman
Date:
Hi Joel, thanks for posting this.  Although I have only a cursory 
familiarity with fast multiplication algorithms, I'd like to try and 
give it a review.  To start with, can you help me understand the choice 
of this algorithm versus others like Toom?  If this looks correct on a 
closer view I'll propose it for inclusion. Along the way though I'd like 
to have it explicitly called out whether this is superior in general to 
other choices, better for more realistic use cases, simpler, clearer to 
license or something similar.  It would be nice for future dicussions to 
have some context around whether it would make sense to have conditions 
to choose other algorithms as well, or if this one is generally the best 
for what Postgres users are usually doing.

Continuing with code review in any case.  Interested to hear more.

Regards,

Aaron Altman




Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

From
"Joel Jacobson"
Date:
On Tue, Jun 11, 2024, at 19:16, Aaron Altman wrote:
> Hi Joel, thanks for posting this.  Although I have only a cursory
> familiarity with fast multiplication algorithms, I'd like to try and
> give it a review.  To start with, can you help me understand the choice
> of this algorithm versus others like Toom?  If this looks correct on a
> closer view I'll propose it for inclusion. Along the way though I'd like
> to have it explicitly called out whether this is superior in general to
> other choices, better for more realistic use cases, simpler, clearer to
> license or something similar.  It would be nice for future dicussions to
> have some context around whether it would make sense to have conditions
> to choose other algorithms as well, or if this one is generally the best
> for what Postgres users are usually doing.
>
> Continuing with code review in any case.  Interested to hear more.

Hi Aaron, thanks for looking at this!

The choice of best algorithm depends on the factor sizes.

The larger factor sizes, the more complicated algorithm can be "afforded".

List of fast multiplication algorithms,
ordered by factor sizes they are suitable for:

- Long multiplication, aka "schoolbook" multiplication.
- Karatsuba
- Toom-3
- Schönhage–Strassen algorithm (Fast Fourier Transform)

The Toom-3 algorithm can be modified to split the smaller and larger factors
into different number of parts. The notation used at Wikipedia is e.g. Toom-2.5
which I think means splitting the larger into three parts, and the smaller into
two parts, while GMP uses Toom32 to mean the same thing.
Personally, I think GMPs notation is easier to understand as the number of parts
can be directly derived from the name.

I experimented with implementing Toom-3 as well, but there was only a marginal
win, at very large factor sizes, and since numeric's max ndigits
(number of base-digits) is capped at 32768, I didn't think it was worth it,
since it adds quite a lot of complexity.

The Karatsuba algorithm is the next step in the hierarchy of fast multiplication
algorithms, and all other bigint libs I've looked at implement Karatsuba,
even if they also implement Toom-3, since Karatsuba is faster than Toom-3 for
sufficiently small factors, but that are at the same time sufficiently large for
Karatsuba to be faster than schoolbook.

I was initially surprised by the quite large threshold, where Karatsuba started
to be a win over schoolbook.

I think the explanation why mul_var() stays fast up to quite remarkably high
factor sizes, could be a combination of several things, such as:

- mul_var() is already heavily optimized, with clever tricks,
  such as deferred carry propagation.

- numeric uses NBASE=10000, while other bigint libs usually use a power of two.

In the Karatsuba implementation, I tried to keep the KARATSUBA_CONDITION()
quite simple, but it's way more complex than what most bigint libs use,
that usually just check if the smaller factor is smaller than some threshold,
and if so, use schoolbook. For instance, this is what Rust's num-bigint does:

    if x.len() <= 32 {
        // Long multiplication
    } else if x.len() * 2 <= y.len() {
        // Half-Karatsuba, for factors with significant length disparity
    } else if x.len() <= 256 {
        // Karatsuba multiplication
    } else {
        // Toom-3 multiplication
    }

Source: https://github.com/rust-num/num-bigint/blob/master/src/biguint/multiplication.rs#L101

Side note: When working on Karatsuba in mul_var(), I looked at some other bigint
implementations, to try to understand their threshold functions.
I noticed that Rust's num-bigint didn't optimise for factors with significant
length disparity, so I contributed a patch based on my "Half-Karatsuba" idea,
that I got when working with mul_var(), which has now been merged:
https://github.com/rust-num/num-bigint/commit/06b61c8138ad8a9959ac54d9773d0a9ebe25b346

In mul_var(), if we don't like the complexity of KARATSUBA_CONDITION(),
we could go for a more traditional threshold approach, i.e. just checking
the smaller factor size. However, I believe that would be at the expense
of missing out of some performance gains.

I've tried quite hard to find the best KARATSUBA_CONDITION(), but I found it to
be a really hard problem, the differences between different CPU architectures,
in combination with wanting a simple expression, means there is no obvious
perfect threshold function, there will always be a trade-off.

I eventually stopped trying to improve it, and just settled on the version in
the patch, and thought that I'll leave it up to the community to give feedback
on what complexity for the threshold function is motivated. If we absolutely
just want to check the smallest factor size, like Rust, then it's super simple,
then the threshold can easily be found just by testing different values.
It's when both factor sizes are input to the threshold function that makes it
complicated.

/Joel



Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

From
Aaron Altman
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:       not tested
Spec compliant:           not tested
Documentation:            not tested

This applies cleanly to master, builds and passes regression tests in my Windows/Cygwin environment.  

I also read through comments and confirmed for myself that the assumption about the caller ensuring var1 is shorter is
donealready in unchanged code from mul_var.  Frees correspond to inits.  The "Karatsuba condition" reasoning for
decidingwhether a number is big enough to use this algorithm appears to match what Joel has stated in this thread.
 

The arithmetic appears to match what's been described in the comments.  I have *not* confirmed that with any detailed
reviewof the Karatsuba algorithm from outside sources, other implementations like the Rust one referenced here, or
anythingsimilar.  I'm hoping that the regression tests give sufficient coverage that if the arithmetic was incorrect
therewould be obvious failures.  If additional coverage was needed, cases falling immediately on either side of the
limitingconditions used in the patch would probably be useful.  From the limited precedent I've exposed myself to, that
doesn'tseem to be required here, but I'm open to contrary input from other reviewers.  In the meantime, I'm marking
thisapproved.  
 

Thanks for the detailed background and comments, Joel!

The new status of this patch is: Ready for Committer

Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

From
"Joel Jacobson"
Date:
On Fri, Jun 14, 2024, at 03:07, Aaron Altman wrote:
> Thanks for the detailed background and comments, Joel!
>
> The new status of this patch is: Ready for Committer

Thanks for reviewing.

Attached, rebased version of the patch that implements the Karatsuba algorithm in numeric.c's mul_var().

/Joel

Attachment

Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

From
Michael Paquier
Date:
On Sun, Jun 23, 2024 at 09:00:29AM +0200, Joel Jacobson wrote:
> Attached, rebased version of the patch that implements the Karatsuba algorithm in numeric.c's mul_var().

It's one of these areas where Dean Rasheed would be a good match for a
review, so adding him in CC.  He has been doing a lot of stuff in this
area over the years.

+#define KARATSUBA_BASE_LIMIT 384
+#define KARATSUBA_VAR1_MIN1 128
+#define KARATSUBA_VAR1_MIN2 2000
+#define KARATSUBA_VAR2_MIN1 2500
+#define KARATSUBA_VAR2_MIN2 9000
+#define KARATSUBA_SLOPE 0.764
+#define KARATSUBA_INTERCEPT 90.737

These numbers feel magic, and there are no explanations behind these
choices so it is hard to know whether these are good or not, or if
there are potentially "better" choices.  I'd suggest to explain why
these variables are here as well as the basics of the method in this
area of the code, with the function doing the operation pointing at
that so as all the explanations are in a single place.  Okay, these
are thresholds based on the number of digits to decide if the normal
or Karatsuba's method should be used, but grouping all the
explanations in a single place is simpler.

I may have missed something, but did you do some benchmark when the
thresholds are at their limit where we would fallback to the
calculation method of HEAD?  I guess that the answer to my question of
"Is HEAD performing better across these thresholds?" is clearly "no"
based on what I read at [1] and the threshold numbers chosen, still
asking.

[1]: https://en.wikipedia.org/wiki/Karatsuba_algorithm
--
Michael

Attachment

Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

From
Dean Rasheed
Date:
On Sun, Jun 23, 2024 at 09:00:29AM +0200, Joel Jacobson wrote:
> Attached, rebased version of the patch that implements the Karatsuba algorithm in numeric.c's mul_var().
>

Something to watch out for is that not all callers of mul_var() want
an exact result. Several internal callers request an approximate
result by passing it an rscale value less than the sum of the input
dscales. The schoolbook algorithm handles that by computing up to
rscale plus MUL_GUARD_DIGITS extra digits and then rounding, whereas
the new Karatsuba code always computes the full result and then
rounds. That impacts the performance of various functions, for
example:

select sum(exp(x)) from generate_series(5999.9, 5950.0, -0.1) x;

Time: 1790.825 ms (00:01.791)  [HEAD]
Time: 2161.244 ms (00:02.161)  [with patch]

Looking at mul_var_karatsuba_half(), I don't really like the approach
it takes. The whole correctness proof using the Karatsuba formula
seems to somewhat miss the point that this function isn't actually
implementing the Karatsuba algorithm, it is implementing the
schoolbook algorithm in two steps, by splitting the longer input into
two pieces. But why just split it into two pieces? That will just lead
to a lot of unnecessary recursion for very unbalanced inputs. Instead,
why not split the longer input into N roughly equal sized pieces, each
around the same length as the shorter input, multiplying and adding
them at the appropriate offsets? As an example, given inputs with
var1ndigits = 1000 and var2ndigits = 10000, mul_var() will invoke
mul_var_karatsuba_half(), which will then recursively invoke mul_var()
twice with var1ndigits = 1000 and var2ndigits = 5000, which no longer
satisfies KARATSUBA_CONDITION(), so it will just invoke the schoolbook
algorithm on each half, which stands no chance of being any faster. On
the other hand, if it divided var2 into 10 chunks of length 1000, it
would invoke the Karatsuba algorithm on each chunk, which would at
least stand a chance of being faster.

Related to that, KARATSUBA_HIGH_RANGE_CONDITION() doesn't appear to
make a lot of sense. For inputs with var1ndigits between 128 and 2000,
and var2ndigits > 9000, this condition will pass and it will
recursively break up the longer input into smaller and smaller pieces
until eventually that condition no longer passes, but none of the
other conditions in KARATSUBA_CONDITION() will pass either, so it'll
just invoke the schoolbook algorithm on each piece, which is bound to
be slower once all the overheads are taken into account. For example,
given var1ndigits = 200 and var2ndigits = 30000, KARATSUBA_CONDITION()
will pass due to KARATSUBA_HIGH_RANGE_CONDITION(), and it will recurse
with var1ndigits = 200 and var2ndigits = 15000, and then again with
var1ndigits = 200 and var2ndigits = 7500, at which point
KARATSUBA_CONDITION() no longer passes. With mul_var_karatsuba_half()
implemented as it is, that is bound to happen, because each half will
end up having var2ndigits between 4500 and 9000, which fails
KARATSUBA_CONDITION() if var1ndigits < 2000. If
mul_var_karatsuba_half() was replaced by something that recursed with
more balanced chunks, then it might make more sense, though allowing
values of var1ndigits down to 128 doesn't make sense, since the
Karatsuba algorithm will never be invoked for inputs shorter than 384.

Looking at KARATSUBA_MIDDLE_RANGE_CONDITION(), the test that
var2ndigits > 2500 seems to be redundant. If var1ndigits > 2000 and
var2ndigits < 2500, then KARATSUBA_LOW_RANGE_CONDITION() is satisfied,
so these tests could be simplified, eliminating some of those magic
constants.

However, I really don't like having these magic constants at all,
because in practice the threshold above which the Karatsuba algorithm
is a win can vary depending on a number of factors, such as whether
it's running on 32-bit or 64-bit, whether or not SIMD instructions are
available, the relative timings of CPU instructions, the compiler
options used, and probably a bunch of other things. The last time I
looked at the Java source code, for example, they had separate
thresholds for 32-bit and 64-bit platforms, and even that's probably
too crude. Some numeric libraries tune the thresholds for a large
number of different platforms, but that takes a lot of effort. I think
a better approach would be to have a configurable threshold. Ideally,
this would be just one number, with all other numbers being derived
from it, possibly using some simple heuristic to reduce the effective
threshold for more balanced inputs, for which the Karatsuba algorithm
is more efficient.

Having a configurable threshold would allow people to tune for best
performance on their own platforms, and also it would make it easier
to write tests that hit the new code. As it stands, it's not obvious
how much of the new code is being hit by the existing tests.

Doing a quick test on my machine, using random equal-length inputs of
various sizes, I got the following performance results:

 digits | rate (HEAD)   | rate (patch)  | change
--------+---------------+---------------+--------
     10 | 6.060014e+06  | 6.0189365e+06 | -0.7%
    100 | 2.7038752e+06 | 2.7287925e+06 | +0.9%
   1000 | 88640.37      | 90504.82      | +2.1%
   1500 | 39885.23      | 41041.504     | +2.9%
   1600 | 36355.24      | 33368.28      | -8.2%
   2000 | 23308.582     | 23105.932     | -0.9%
   3000 | 10765.185     | 11360.11      | +5.5%
   4000 | 6118.2554     | 6645.4116     | +8.6%
   5000 | 3928.4985     | 4639.914      | +18.1%
  10000 | 1003.80164    | 1431.9335     | +42.7%
  20000 | 255.46135     | 456.23462     | +78.6%
  30000 | 110.69313     | 226.53398     | +104.7%
  40000 | 62.29333      | 148.12916     | +137.8%
  50000 | 39.867493     | 95.16788      | +138.7%
  60000 | 27.7672       | 74.01282      | +166.5%

The Karatsuba algorithm kicks in at 384*4 = 1536 decimal digits, so
presumably the variations below that are just noise, but this does
seem to suggest that KARATSUBA_BASE_LIMIT = 384 is too low for me, and
I'd probably want it to be something like 500-700.

There's another complication though (if the threshold is made
configurable): the various numeric functions that use mul_var() are
immutable, which means that the results from the Karatsuba algorithm
must match those from the schoolbook algorithm exactly, for all
inputs. That's not currently the case when computing approximate
results with a reduced rscale. That's fixable, but it's not obvious
whether or not the Karatsuba algorithm can actually be made beneficial
when computing such approximate results.

There's a wider question as to how many people use such big numeric
values -- i.e., how many people are actually going to benefit from
this? I don't have a good feel for that.

Regards,
Dean



Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> There's another complication though (if the threshold is made
> configurable): the various numeric functions that use mul_var() are
> immutable, which means that the results from the Karatsuba algorithm
> must match those from the schoolbook algorithm exactly, for all
> inputs.

That seems like an impossible standard to meet.  What we'd probably
have to do is enable Karatsuba only when mul_var is being asked
for an exact (full-precision) result.  This'd complicate the check
condition further, possibly reaching the point where there is a
visible drag on performance in the non-Karatsuba case.

Another possible source of drag: if mul_var is now recursive,
does it need a stack depth check?  If we can prove that the
number of recursion levels is no more than O(log(N)) in the
number of input digits, it's probably safe to skip that ...
but I see no such proof here.  (In general I find this patch
seriously undercommented.)

> There's a wider question as to how many people use such big numeric
> values -- i.e., how many people are actually going to benefit from
> this? I don't have a good feel for that.

I have heard of people doing calculations on bignum integers in
Postgres, but they are very few and far between --- usually that
sort of task is better done in another language.  (No matter how
fast we make mul_var, the general overhead of SQL expressions in
general and type numeric in particular means it's probably not
the right tool for heavy-duty bignum arithmetic.)

There is definitely an argument to be made that this proposal is
not worth the development effort and ongoing maintenance effort
we'd have to sink into it.

            regards, tom lane



Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

From
Dean Rasheed
Date:
On Sat, 29 Jun 2024 at 16:25, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> > There's another complication though (if the threshold is made
> > configurable): the various numeric functions that use mul_var() are
> > immutable, which means that the results from the Karatsuba algorithm
> > must match those from the schoolbook algorithm exactly, for all
> > inputs.
>
> That seems like an impossible standard to meet.  What we'd probably
> have to do is enable Karatsuba only when mul_var is being asked
> for an exact (full-precision) result.

Yeah, using Karatsuba for approximate products is probably a bit too
ambitious. I think it'd be reasonably straightforward to have it
produce the same results for all rscale values. You'd just have to
decide on the required rscale for each sub-product, based on where
it's being added, truncating at various points, and being sure to only
round once at the very end. The problem is that it'd end up computing
a larger fraction of the full product than the schoolbook algorithm
would have done, so the threshold for using Karatsuba would have to be
higher (probably quite a lot higher) and figuring out how that varied
with the requested rscale would be hard.

So using Karatsuba only in the full-precision case seems like a
reasonable restriction. That'd still benefit some other functions like
sqrt() and therefore ln() and pow() to some extent. However,...

> There is definitely an argument to be made that this proposal is
> not worth the development effort and ongoing maintenance effort
> we'd have to sink into it.

I'm leaning more towards this opinion, especially since I think the
patch needs a lot more work to be committable.

Regards,
Dean



Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

From
"Joel Jacobson"
Date:
On Sat, Jun 29, 2024, at 14:22, Dean Rasheed wrote:
> On Sun, Jun 23, 2024 at 09:00:29AM +0200, Joel Jacobson wrote:
>> Attached, rebased version of the patch that implements the Karatsuba algorithm in numeric.c's mul_var().
>>
>
> Something to watch out for is that not all callers of mul_var() want
> an exact result. Several internal callers request an approximate
> result by passing it an rscale value less than the sum of the input
> dscales. The schoolbook algorithm handles that by computing up to
> rscale plus MUL_GUARD_DIGITS extra digits and then rounding, whereas
> the new Karatsuba code always computes the full result and then
> rounds. That impacts the performance of various functions, for
> example:
>
> select sum(exp(x)) from generate_series(5999.9, 5950.0, -0.1) x;
>
> Time: 1790.825 ms (00:01.791)  [HEAD]
> Time: 2161.244 ms (00:02.161)  [with patch]

Ops. Thanks for spotting this and clarifying.

I read Tom's reply and note that we should only do Karatsuba
an exact (full-precision) result is requested.

> Looking at mul_var_karatsuba_half(), I don't really like the approach
> it takes. The whole correctness proof using the Karatsuba formula
> seems to somewhat miss the point that this function isn't actually
> implementing the Karatsuba algorithm, it is implementing the
> schoolbook algorithm in two steps, by splitting the longer input into
> two pieces.

The surprising realization here is that there are actually (var1ndigits, var2ndigits)
combinations where *only* doing mul_var_karatsuba_half() recursively
all the way down to schoolbook *is* a performance win,
even though we don't do any mul_var_karatsuba_full().

mul_var_karatsuba_half() *is* actually implementing the exact
Karatsuba formula, it's just taking a shortcut exploiting the pre-known
that splitting `var1` at `m2` would result in `high1` being zero.
This allows the provably correct substitutions to be made,
which avoids the meaningless computations.

> But why just split it into two pieces? That will just lead
> to a lot of unnecessary recursion for very unbalanced inputs. Instead,
> why not split the longer input into N roughly equal sized pieces, each
> around the same length as the shorter input, multiplying and adding
> them at the appropriate offsets?

The approach you're describing is implemented by e.g. CPython
and is called "lopsided" in their code base. It has some different
performance characteristics, compared to the recursive Half-Karatsuba
approach.

What I didn't like about lopsided is the degenerate case where the
last chunk is much shorter than the var1, for example, if we pretend
we would be doing Karatsuba all the way down to ndigits 2,
and think about the example var1ndigits = 3 and var2ndigits = 10,
then lopsided would do
var1ndigits=3 var2ndigits=3
var1ndigits=3 var2ndigits=3
var1ndigits=3 var2ndigits=3
var1ndigits=3 var2ndigits=1

whereas Half-Karatsuba would do
var1ndigits=3 var2ndigits=5
var1ndigits=3 var2ndigits=5

You can find contrary examples too of course where lopsided
is better than Half-Karatsuba, none of them seem substantially better
than the other.

My measurements indicated that overall, Half-Karatsuba seemed like
the overall marginal winner, on the architectures I tested, but they were all
very similar, i.e. almost the same number of "wins" and "losses",
for different (var1ndigits, var2ndigits) combinations.

Note that even with lopsided, there will still be recursion, since Karatsuba
is a recursive algorithm, so to satisfy Tom Lane's request about
proving the recursion is limited, we will still need to prove the same
thing for lopsided+Karatsuba.

Here is some old code from my experiments, if we want to evaluate lopsided:

```
static void slice_var(const NumericVar *var, int start, int length,
                      NumericVar *slice);

static void mul_var_lopsided(const NumericVar *var1, const NumericVar *var2,
                             NumericVar *result);

/*
 * slice_var() -
 *
 * Extract a slice of a NumericVar starting at a specified position
 * and with a specified length.
 */
static void
slice_var(const NumericVar *var, int start, int length,
          NumericVar *slice)
{
    Assert(start >= 0);
    Assert(start + length <= var->ndigits);

    init_var(slice);

    slice->ndigits = length;
    slice->digits = var->digits + start;
    slice->buf = NULL;
    slice->weight = var->weight - var->ndigits + length;
    slice->sign = var->sign;
    slice->dscale = (var->ndigits - var->weight - 1) * DEC_DIGITS;
}

/*
 * mul_var_lopsided() -
 *
 * Lopsided Multiplication for unequal-length factors.
 *
 * This function handles the case where var1 has significantly fewer digits
 * than var2. In such a scenario, splitting var1 for a balanced multiplication
 * algorithm would be inefficient, as the high part would be zero.
 *
 * To overcome this inefficiency, the function divides factor2 into a series of
 * slices, each containing the same number of digits as var1, and multiplies
 * var1 with each slice one at a time. As a result, the recursive call to
 * mul_var() will have balanced inputs, which improves the performance of
 * divide-and-conquer algorithm, such as the Karatsuba.
 */
static void
mul_var_lopsided(const NumericVar *var1, const NumericVar *var2,
                 NumericVar *result)
{
    int            var1ndigits = var1->ndigits;
    int            var2ndigits = var2->ndigits;
    int            processed = 0;
    int            remaining = var2ndigits;
    int            length;
    NumericVar    slice;
    NumericVar    product;
    NumericVar    sum;

    Assert(var1ndigits <= var2ndigits);
    Assert(var1ndigits > MUL_SMALL);
    Assert(var1ndigits * 2 <= var2ndigits);

    init_var(&slice);
    init_var(&product);
    init_var(&sum);

    while (remaining > 0)
    {
        length = Min(remaining, var1ndigits);
        slice_var(var2, var2ndigits - processed - length, length, &slice);
        mul_var(var1, &slice, &product, var1->dscale + slice.dscale);
        product.weight += processed;
        add_var(&sum, &product, &sum);
        remaining -= length;
        processed += length;
    }

    set_var_from_var(&sum, result);

    free_var(&slice);
    free_var(&product);
    free_var(&sum);
}
```

> As an example, given inputs with
> var1ndigits = 1000 and var2ndigits = 10000, mul_var() will invoke
> mul_var_karatsuba_half(), which will then recursively invoke mul_var()
> twice with var1ndigits = 1000 and var2ndigits = 5000, which no longer
> satisfies KARATSUBA_CONDITION(), so it will just invoke the schoolbook
> algorithm on each half, which stands no chance of being any faster. On
> the other hand, if it divided var2 into 10 chunks of length 1000, it
> would invoke the Karatsuba algorithm on each chunk, which would at
> least stand a chance of being faster.

Interesting example!

Indeed only mul_var_karatsuba_half() will be called with the inputs:
var1ndigits=1000 var2ndigits=10000
var1ndigits=1000 var2ndigits=5000
It will never call mul_var_karatsuba_full().

Surprisingly, this still gives a 13% speed-up on a Intel Core i9-14900K.

This performance gain comes from the splitting of the larger factor.

Here is how I benchmarked using pg-timeit [1] and the
mul_var-karatsuba-benchmark.patch [2] from my original post:

To test the patch, you have to edit pg_proc.dat for
numeric_mul_karatsuba and give it a new unique oid.

```
SELECT
   timeit.pretty_time(total_time_a / 1e6 / executions,3) AS execution_time_a,
   timeit.pretty_time(total_time_b / 1e6 / executions,3) AS execution_time_b,
   total_time_a::numeric/total_time_b - 1 AS execution_time_difference
FROM timeit.cmp(
   'numeric_mul',
   'numeric_mul_karatsuba',
   input_values := ARRAY[
      random_ndigits(1000)::TEXT,
      random_ndigits(10000)::TEXT
   ],
   min_time := 1000000,
   timeout := '10 s'
);
-[ RECORD 1 ]-------------+-------------------
execution_time_a          | 976 µs
execution_time_b          | 864 µs
execution_time_difference | 0.1294294200936600
```

The KARATSUBA_CONDITION tries to capture the interesting region of performance gains, as observed from measurements
[3],in an expression that is not too complex. 

In the image [3], the purple to black area is performance regressions,
and the rainbow colors are performance gains.

The black colored line segment is the KARATSUBA_CONDITION,
which tries to capture the three performance gain regions,
as defined by:
KARATSUBA_LOW_RANGE_CONDITION
KARATSUBA_MIDDLE_RANGE_CONDITION
KARATSUBA_HIGH_RANGE_CONDITION

> Related to that, KARATSUBA_HIGH_RANGE_CONDITION() doesn't appear to
> make a lot of sense. For inputs with var1ndigits between 128 and 2000,
> and var2ndigits > 9000, this condition will pass and it will
> recursively break up the longer input into smaller and smaller pieces
> until eventually that condition no longer passes, but none of the
> other conditions in KARATSUBA_CONDITION() will pass either, so it'll
> just invoke the schoolbook algorithm on each piece, which is bound to
> be slower once all the overheads are taken into account. For example,
> given var1ndigits = 200 and var2ndigits = 30000, KARATSUBA_CONDITION()
> will pass due to KARATSUBA_HIGH_RANGE_CONDITION(), and it will recurse
> with var1ndigits = 200 and var2ndigits = 15000, and then again with
> var1ndigits = 200 and var2ndigits = 7500, at which point
> KARATSUBA_CONDITION() no longer passes. With mul_var_karatsuba_half()
> implemented as it is, that is bound to happen, because each half will
> end up having var2ndigits between 4500 and 9000, which fails
> KARATSUBA_CONDITION() if var1ndigits < 2000. If
> mul_var_karatsuba_half() was replaced by something that recursed with
> more balanced chunks, then it might make more sense,though allowing
> values of var1ndigits down to 128 doesn't make sense, since the
> Karatsuba algorithm will never be invoked for inputs shorter than 384.

Like explained above, the mul_var_karatsuba_half() is not meaningless,
even for cases where we never reach mul_var_karatsuba_full().

Regarding the last comment on 128 and 384, I think you're reading the
conditions wrong, note that 128 is for var1ndigits while 384 is for var2ndigits:

+#define KARATSUBA_BASE_LIMIT 384
+#define KARATSUBA_VAR1_MIN1 128
...
+    ((var2ndigits) >= KARATSUBA_BASE_LIMIT && \
...
+     (var1ndigits) > KARATSUBA_VAR1_MIN1)

> Looking at KARATSUBA_MIDDLE_RANGE_CONDITION(), the test that
> var2ndigits > 2500 seems to be redundant. If var1ndigits > 2000 and
> var2ndigits < 2500, then KARATSUBA_LOW_RANGE_CONDITION() is satisfied,
> so these tests could be simplified, eliminating some of those magic
> constants.

Yes, I realized that myself too, but chose to keep the start and end
boundaries for each of the three ranges. Since it's just boolean logic,
with constants, I think the compiler should be smart enough to optimize
away the redundancy, but maybe better to keep the redundant condition
as a comment instead of actual code, I have no strong opinion what's best.

> However, I really don't like having these magic constants at all,
> because in practice the threshold above which the Karatsuba algorithm
> is a win can vary depending on a number of factors, such as whether
> it's running on 32-bit or 64-bit, whether or not SIMD instructions are
> available, the relative timings of CPU instructions, the compiler
> options used, and probably a bunch of other things. The last time I
> looked at the Java source code, for example, they had separate
> thresholds for 32-bit and 64-bit platforms, and even that's probably
> too crude. Some numeric libraries tune the thresholds for a large
> number of different platforms, but that takes a lot of effort. I think
> a better approach would be to have a configurable threshold. Ideally,
> this would be just one number, with all other numbers being derived
> from it, possibly using some simple heuristic to reduce the effective
> threshold for more balanced inputs, for which the Karatsuba algorithm
> is more efficient.
>
> Having a configurable threshold would allow people to tune for best
> performance on their own platforms, and also it would make it easier
> to write tests that hit the new code. As it stands, it's not obvious
> how much of the new code is being hit by the existing tests.
>
> Doing a quick test on my machine, using random equal-length inputs of
> various sizes, I got the following performance results:
>
>  digits | rate (HEAD)   | rate (patch)  | change
> --------+---------------+---------------+--------
>      10 | 6.060014e+06  | 6.0189365e+06 | -0.7%
>     100 | 2.7038752e+06 | 2.7287925e+06 | +0.9%
>    1000 | 88640.37      | 90504.82      | +2.1%
>    1500 | 39885.23      | 41041.504     | +2.9%
>    1600 | 36355.24      | 33368.28      | -8.2%
>    2000 | 23308.582     | 23105.932     | -0.9%
>    3000 | 10765.185     | 11360.11      | +5.5%
>    4000 | 6118.2554     | 6645.4116     | +8.6%
>    5000 | 3928.4985     | 4639.914      | +18.1%
>   10000 | 1003.80164    | 1431.9335     | +42.7%
>   20000 | 255.46135     | 456.23462     | +78.6%
>   30000 | 110.69313     | 226.53398     | +104.7%
>   40000 | 62.29333      | 148.12916     | +137.8%
>   50000 | 39.867493     | 95.16788      | +138.7%
>   60000 | 27.7672       | 74.01282      | +166.5%
>
> The Karatsuba algorithm kicks in at 384*4 = 1536 decimal digits, so
> presumably the variations below that are just noise, but this does
> seem to suggest that KARATSUBA_BASE_LIMIT = 384 is too low for me, and
> I'd probably want it to be something like 500-700.

I've tried hard to reduce the magic part of these constants,
by benchmarking on numerous architectures, and picking them
manually by making a balanced judgement about what complexity
could possibly be acceptable for the threshold function,
and what performance gains that are important to try to capture.

I think this approach is actually less magical than the hard-coded
single value constants I've seen in many other numeric libraries,
where it's not clear at all what the full two dimensional performance
image looks like.

I considered if my initial post on this should propose a patch with a
simple threshold function, that just checks if var1ndigits is larger
than some constant, like many other numeric libraries do.
However, I decided I should at least try to do something smarter,
since it seemed possible.

> There's another complication though (if the threshold is made
> configurable): the various numeric functions that use mul_var() are
> immutable, which means that the results from the Karatsuba algorithm
> must match those from the schoolbook algorithm exactly, for all
> inputs. That's not currently the case when computing approximate
> results with a reduced rscale. That's fixable, but it's not obvious
> whether or not the Karatsuba algorithm can actually be made beneficial
> when computing such approximate results.

I read Tom's reply on this part, and understand we can only do Karatsuba
if full rscale is desired.

> There's a wider question as to how many people use such big numeric
> values -- i.e., how many people are actually going to benefit from
> this? I don't have a good feel for that.

Personally, I started working on this because I wanted a to use numeric
to implement the ECDSA verify algorithm in PL/pgSQL, to avoid
dependency on a C extension that I discovered could segfault.

Unfortunately, Karatsuba didn't help much for this particular case,
since the ECDSA factors are not big enough.
I ended up implementing ECDSA verify as a pgrx extension instead.

However, I can imagine other crypto algorithms might require larger factors,
as well as other scientific research use cases, such as astronomy and physics,
that could desire storage of numeric values of very high precision.

Toom-3 is probably overkill, since we "only" support up to 32768 base digits,
but I think we should at least consider optimizing for the range of numeric values
that are supported by numeric.

Regards,
Joel

[1] https://github.com/joelonsql/pg-timeit
[2] https://www.postgresql.org/message-id/attachment/159528/mul_var-karatsuba-benchmark.patch
[3]
https://gist.githubusercontent.com/joelonsql/e9d06cdbcdf56cd8ffa673f499880b0d/raw/69df06e95bc254090f8397765079e1a8145eb5ac/derive_threshold_function_using_dynamic_programming.png



Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

From
Alvaro Herrera
Date:
On 2024-Jun-30, Joel Jacobson wrote:

> However, I can imagine other crypto algorithms might require larger factors,
> as well as other scientific research use cases, such as astronomy and physics,
> that could desire storage of numeric values of very high precision.

FWIW I was talking to some people with astronomical databases, and for
reasons that seem pretty relatable they prefer to do all numerical
analysis outside of the database, and keep the database server only for
the data storage and retrieval parts.  It seems a really hard sell that
they would try to do that analysis in the database, because the
paralellizability aspect is completely different for the analysis part
than for the database part -- I mean, they can easily add more servers
to do photography analysis, but trying to add database servers (to cope
with additional load from doing the analysis there) is a much harder
sell.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Officer Krupke, what are we to do?
Gee, officer Krupke, Krup you! (West Side Story, "Gee, Officer Krupke")



Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

From
"Joel Jacobson"
Date:
On Sat, Jun 29, 2024, at 17:25, Tom Lane wrote:
> Dean Rasheed <dean.a.rasheed@gmail.com> writes:
>> There's another complication though (if the threshold is made
>> configurable): the various numeric functions that use mul_var() are
>> immutable, which means that the results from the Karatsuba algorithm
>> must match those from the schoolbook algorithm exactly, for all
>> inputs.
>
> That seems like an impossible standard to meet.  What we'd probably
> have to do is enable Karatsuba only when mul_var is being asked
> for an exact (full-precision) result.  This'd complicate the check
> condition further, possibly reaching the point where there is a
> visible drag on performance in the non-Karatsuba case.

OK, noted that we should only do Karatsuba when mul_var is
being asked for an exact (full-precision) result.

> Another possible source of drag: if mul_var is now recursive,
> does it need a stack depth check?  If we can prove that the
> number of recursion levels is no more than O(log(N)) in the
> number of input digits, it's probably safe to skip that ...
> but I see no such proof here.

> (In general I find this patch seriously undercommented.)

Yes, the #define's around KARATSUBA_CONDITION needs
to be documented, but I felt this region of the patch might evolve
and need some discussion first, if this level of complexity is
acceptable.

However, I think the comments above split_var_at(),
mul_var_karatsuba_full() and mul_var_karatsuba_half()
are quite good already, what do you think?

>> There's a wider question as to how many people use such big numeric
>> values -- i.e., how many people are actually going to benefit from
>> this? I don't have a good feel for that.
>
> I have heard of people doing calculations on bignum integers in
> Postgres, but they are very few and far between --- usually that
> sort of task is better done in another language.  (No matter how
> fast we make mul_var, the general overhead of SQL expressions in
> general and type numeric in particular means it's probably not
> the right tool for heavy-duty bignum arithmetic.)
>
> There is definitely an argument to be made that this proposal is
> not worth the development effort and ongoing maintenance effort
> we'd have to sink into it.

It's a good question, I'm not sure what I think, maybe status quo is the best,
but it feels like something could be done at least.

The patch basically consists of three parts:

- Threshold function KARATSUBA_CONDITION
This is a bit unorthodox, since it's more ambitious than many popular numeric
libraries. I've only found GMP to be even more complex.

- mul_var_karatsuba_half()
Since it's provably correct using simple school-grade substitutions,
I don't think this function is unorthodox, and shouldn't need to change,
unless we change the definition of NumericVar in the future.

- mul_var_karatsuba()
This follows the canonical pseudo-code at Wikipedia for the Karatsuba
algorithm precisely, so nothing unorthodox here either.

So, I think the KARATSUBA_CONDITION require more development and maintenance
effort than the rest of the patch, since it's based on measurements
on numerous architectures, which will be different in the future.

The rest of the patch, split_var_at(), mul_var_karatsuba_full(),
mul_var_karatsuba_half(), should require much less effort to maintain,
since they are should remain the same,
even when we need to support new architectures.

I'm eager to hear your thoughts after you've also read my other reply moments ago to Dean.

Regards,
Joel