Re: Optimize numeric.c mul_var() using the Karatsuba algorithm - Mailing list pgsql-hackers
From | Dean Rasheed |
---|---|
Subject | Re: Optimize numeric.c mul_var() using the Karatsuba algorithm |
Date | |
Msg-id | CAEZATCW9hzVVfoH8-OZL4+-q6PD-ztEkJdxuzC6kBY3f+RhJEA@mail.gmail.com Whole thread Raw |
In response to | Re: Optimize numeric.c mul_var() using the Karatsuba algorithm ("Joel Jacobson" <joel@compiler.org>) |
List | pgsql-hackers |
On Sun, 30 Jun 2024 at 11:22, Joel Jacobson <joel@compiler.org> wrote: > > On Sat, Jun 29, 2024, at 14:22, Dean Rasheed wrote: > > > But why just split it into two pieces? That will just lead > > to a lot of unnecessary recursion for very unbalanced inputs. Instead, > > why not split the longer input into N roughly equal sized pieces, each > > around the same length as the shorter input, multiplying and adding > > them at the appropriate offsets? > > The approach you're describing is implemented by e.g. CPython > and is called "lopsided" in their code base. It has some different > performance characteristics, compared to the recursive Half-Karatsuba > approach. > > What I didn't like about lopsided is the degenerate case where the > last chunk is much shorter than the var1, for example, if we pretend > we would be doing Karatsuba all the way down to ndigits 2, > and think about the example var1ndigits = 3 and var2ndigits = 10, > then lopsided would do > var1ndigits=3 var2ndigits=3 > var1ndigits=3 var2ndigits=3 > var1ndigits=3 var2ndigits=3 > var1ndigits=3 var2ndigits=1 > > whereas Half-Karatsuba would do > var1ndigits=3 var2ndigits=5 > var1ndigits=3 var2ndigits=5 > > You can find contrary examples too of course where lopsided > is better than Half-Karatsuba, none of them seem substantially better > than the other. Actually, that wasn't quite what I was thinking of. The approach I've seen (I can't remember where) is to split var2 into roughly equal-sized chunks as follows: If var2ndigits >= 2 * var1ndigits, split it into a number chunks where nchunks = var2ndigits / var1ndigits using integer division with truncation. Then call mul_var_chunks() (say), passing it nchunks, to perform the multiplication using that many chunks. mul_var_chunks() would use nchunks to decides on the chunk size according to chunk_size = var2ndigits / nchunks again using integer division with truncation. That has a remainder in the range [0, nchunks), which is divided up between the chunks so that some of them end up being chunk_size + 1 digits. I.e., something like: chunk_remainder = var2ndigits - chunk_size * nchunks for (int start = 0, end = chunk_size; start < var2ndigits; start = end, end = start + chunk_size) { /* Distribute remainder over the first few chunks */ if (chunk_remainder > 0) { end++; chunk_remainder--; } /* Process chunk between "start" and "end" */ } What this does is adapt the shape of the chunks to the region where the Karatsuba algorithm is most efficient. For example, suppose var1ndigits = 1000. Then: For 2000x1000 to 2999x1000 digit products, nchunks will be 2 and chunk_size will be at most (2999/2)+1 = 1500. For 3000x1000 to 3999x1000 digit products, nchunks will be 3 and chunk_size will be at most (3999/3)+1 = 1334. And so on. The result is that all the chunks will fall into that region where var2ndigits / var1ndigits is between 1 and 1.5, for which KARATSUBA_LOW_RANGE_CONDITION() will almost certainly pass, and Karatsuba will operate efficiently all the way down to KARATSUBA_BASE_LIMIT. Regards, Dean
pgsql-hackers by date: