Re: Optimize numeric.c mul_var() using the Karatsuba algorithm - Mailing list pgsql-hackers

From Joel Jacobson
Subject Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
Date
Msg-id d87545fa-5b98-4474-a1ab-5af1ed78fb4d@app.fastmail.com
Whole thread Raw
In response to Re: Optimize numeric.c mul_var() using the Karatsuba algorithm  (Aaron Altman <aaronaltman@posteo.net>)
Responses Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
List pgsql-hackers
On Tue, Jun 11, 2024, at 19:16, Aaron Altman wrote:
> Hi Joel, thanks for posting this.  Although I have only a cursory
> familiarity with fast multiplication algorithms, I'd like to try and
> give it a review.  To start with, can you help me understand the choice
> of this algorithm versus others like Toom?  If this looks correct on a
> closer view I'll propose it for inclusion. Along the way though I'd like
> to have it explicitly called out whether this is superior in general to
> other choices, better for more realistic use cases, simpler, clearer to
> license or something similar.  It would be nice for future dicussions to
> have some context around whether it would make sense to have conditions
> to choose other algorithms as well, or if this one is generally the best
> for what Postgres users are usually doing.
>
> Continuing with code review in any case.  Interested to hear more.

Hi Aaron, thanks for looking at this!

The choice of best algorithm depends on the factor sizes.

The larger factor sizes, the more complicated algorithm can be "afforded".

List of fast multiplication algorithms,
ordered by factor sizes they are suitable for:

- Long multiplication, aka "schoolbook" multiplication.
- Karatsuba
- Toom-3
- Schönhage–Strassen algorithm (Fast Fourier Transform)

The Toom-3 algorithm can be modified to split the smaller and larger factors
into different number of parts. The notation used at Wikipedia is e.g. Toom-2.5
which I think means splitting the larger into three parts, and the smaller into
two parts, while GMP uses Toom32 to mean the same thing.
Personally, I think GMPs notation is easier to understand as the number of parts
can be directly derived from the name.

I experimented with implementing Toom-3 as well, but there was only a marginal
win, at very large factor sizes, and since numeric's max ndigits
(number of base-digits) is capped at 32768, I didn't think it was worth it,
since it adds quite a lot of complexity.

The Karatsuba algorithm is the next step in the hierarchy of fast multiplication
algorithms, and all other bigint libs I've looked at implement Karatsuba,
even if they also implement Toom-3, since Karatsuba is faster than Toom-3 for
sufficiently small factors, but that are at the same time sufficiently large for
Karatsuba to be faster than schoolbook.

I was initially surprised by the quite large threshold, where Karatsuba started
to be a win over schoolbook.

I think the explanation why mul_var() stays fast up to quite remarkably high
factor sizes, could be a combination of several things, such as:

- mul_var() is already heavily optimized, with clever tricks,
  such as deferred carry propagation.

- numeric uses NBASE=10000, while other bigint libs usually use a power of two.

In the Karatsuba implementation, I tried to keep the KARATSUBA_CONDITION()
quite simple, but it's way more complex than what most bigint libs use,
that usually just check if the smaller factor is smaller than some threshold,
and if so, use schoolbook. For instance, this is what Rust's num-bigint does:

    if x.len() <= 32 {
        // Long multiplication
    } else if x.len() * 2 <= y.len() {
        // Half-Karatsuba, for factors with significant length disparity
    } else if x.len() <= 256 {
        // Karatsuba multiplication
    } else {
        // Toom-3 multiplication
    }

Source: https://github.com/rust-num/num-bigint/blob/master/src/biguint/multiplication.rs#L101

Side note: When working on Karatsuba in mul_var(), I looked at some other bigint
implementations, to try to understand their threshold functions.
I noticed that Rust's num-bigint didn't optimise for factors with significant
length disparity, so I contributed a patch based on my "Half-Karatsuba" idea,
that I got when working with mul_var(), which has now been merged:
https://github.com/rust-num/num-bigint/commit/06b61c8138ad8a9959ac54d9773d0a9ebe25b346

In mul_var(), if we don't like the complexity of KARATSUBA_CONDITION(),
we could go for a more traditional threshold approach, i.e. just checking
the smaller factor size. However, I believe that would be at the expense
of missing out of some performance gains.

I've tried quite hard to find the best KARATSUBA_CONDITION(), but I found it to
be a really hard problem, the differences between different CPU architectures,
in combination with wanting a simple expression, means there is no obvious
perfect threshold function, there will always be a trade-off.

I eventually stopped trying to improve it, and just settled on the version in
the patch, and thought that I'll leave it up to the community to give feedback
on what complexity for the threshold function is motivated. If we absolutely
just want to check the smallest factor size, like Rust, then it's super simple,
then the threshold can easily be found just by testing different values.
It's when both factor sizes are input to the threshold function that makes it
complicated.

/Joel



pgsql-hackers by date:

Previous
From: Nazir Bilal Yavuz
Date:
Subject: Re: Show WAL write and fsync stats in pg_stat_io
Next
From: Nazir Bilal Yavuz
Date:
Subject: Re: Show WAL write and fsync stats in pg_stat_io