Thread: Some optimisations for numeric division

Some optimisations for numeric division

From
Dean Rasheed
Date:
Attached are 3 small patches that improve the performance of numeric
division. Patch 0002 seems to have the biggest impact, but I think
they're all worth including, since they're quite simple changes, with
noticeable performance gains.


Patch 0001 vectorises the inner loop of div_var_fast(). This loop is
almost identical to the inner loop of mul_var(), which was vectorised
by commit 8870917623. The thing preventing the div_var_fast() loop
from being vectorised is the assignment to div[qi + i], and simply
replacing that with div_qi[i] where div_qi = &div[qi], in the same
style as mul_var(), fixes that.

One difference between this and the mul_var() code is that it is also
necessary to add an explicit cast to get the compiler to generate
matching output, and give the best results. This is because in
mul_var() the compiler recognises that var1digit is actually 16-bit,
rather than 32-bit, and so it doesn't need to multiply the high part,
but in div_var_fast() it's not obvious to the compiler that qdigit
also fits in 16 bits, hence the cast.

The actual performance gain is typically quite modest, since
div_var_fast() is always only a part of some more complex numeric
computation, but it can be seen in cases like raising numbers to
negative integer powers, e.g.:

CREATE TEMP TABLE num_test(x numeric);
SELECT setseed(0);
INSERT INTO num_test
  SELECT (SELECT ('1.'||string_agg((random()*9)::int::text, '')||x)::numeric
          FROM generate_series(1,100))
  FROM generate_series(1,100000) g(x);

SELECT sum(x^(-2)) FROM num_test;

Time: 112.949 ms  (HEAD)
Time: 98.537 ms   (with patch)


Patch 0002 simplifies the inner loop of div_var() (the guts of the
public-facing numeric division operator) by more closely combining the
multiplication and subtraction operations and folding the separate
"carry" and "borrow" variables into a single "borrow", as suggested by
the old code comment.

IMO, this makes the code easier to understand, as well as giving more
significant performance gains:

CREATE TEMP TABLE div_test(x numeric);
SELECT setseed(0);
INSERT INTO div_test
  SELECT (SELECT ('1.'||string_agg((random()*9)::int::text, ''))::numeric + x
          FROM generate_series(1,50))
  FROM generate_series(1,1000) g(x);

SELECT sum(x/y) FROM div_test t1(x), div_test t2(y);

Time: 1477.340 ms  (HEAD)
Time: 826.748 ms   (with patch)


Patch 0003 replaces some uses of div_var_fast() with div_var().
Specifically, when the divisor has just one or two base-NBASE digits,
div_var() is faster. This is especially true for 1-digit divisors, due
to the "fast path" code in div_var() to handle that. It's also just
about true for 2-digit divisors, and it occurs to me that that case
could potentially be optimised further with similar fast path code in
div_var(), but I haven't tried that yet.

One-digit divisors are quite common. For example, in the Taylor series
expansions in exp_var() and ln_var(), since the number of Taylor
series terms never exceeds a few hundred in practice. Also,
exp_var()'s argument reduction step divides by 2^ndiv2, which is
roughly 100 times the input, rounded up to a power of two, and valid
inputs are less than 6000, so this will always be one or two digits.

Testing this with a bunch of random exp() and ln() computations I saw
a speed-up of 15-20%, and it reduced the run time of the numeric-big
regression test by around 10%, which seems worth having.

Regards,
Dean

Attachment

Re: Some optimisations for numeric division

From
Tom Lane
Date:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> Attached are 3 small patches that improve the performance of numeric
> division. Patch 0002 seems to have the biggest impact, but I think
> they're all worth including, since they're quite simple changes, with
> noticeable performance gains.

I took a quick look through these (just eyeball, didn't try to verify
your performance statements).  I'm +1 on 0001 and 0002, but 0003 feels
a bit ad-hoc.  It certainly *looks* weird for the allegedly faster
function to be handing off to the allegedly slower one.  I also wonder
if we're leaving anything on the table by not exploiting
div_var_fast's weaker roundoff guarantees in this case.  Should we
think about a more thoroughgoing redesign of these functions' APIs?
Another idea is to only worry about the single-divisor-digit
optimization, and just copy div_var's (very small) inner loop for that
case into div_var_fast.

            regards, tom lane



Re: Some optimisations for numeric division

From
Dean Rasheed
Date:
On Wed, 23 Feb 2022 at 20:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I took a quick look through these (just eyeball, didn't try to verify
> your performance statements).

Thanks for looking!

>  I'm +1 on 0001 and 0002, but 0003 feels
> a bit ad-hoc.  It certainly *looks* weird for the allegedly faster
> function to be handing off to the allegedly slower one.  I also wonder
> if we're leaving anything on the table by not exploiting
> div_var_fast's weaker roundoff guarantees in this case.  Should we
> think about a more thoroughgoing redesign of these functions' APIs?

Hmm, I'm not sure what kind of thing you had in mind.

One thought that occurred to me was that it's a bit silly that
exp_var() and ln_var() have to use a NumericVar for what could just be
an int, if we had a div_var_int() function that could divide by an
int. Then both div_var() and div_var_fast() could hand off to it for
one and two digit divisors.

Regards,
Dean



Re: Some optimisations for numeric division

From
Tom Lane
Date:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> On Wed, 23 Feb 2022 at 20:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm +1 on 0001 and 0002, but 0003 feels
>> a bit ad-hoc.  It certainly *looks* weird for the allegedly faster
>> function to be handing off to the allegedly slower one.  I also wonder
>> if we're leaving anything on the table by not exploiting
>> div_var_fast's weaker roundoff guarantees in this case.  Should we
>> think about a more thoroughgoing redesign of these functions' APIs?

> Hmm, I'm not sure what kind of thing you had in mind.

I'm not either, tbh.  Just seems like this needs more than some
hacking around the margins.

> One thought that occurred to me was that it's a bit silly that
> exp_var() and ln_var() have to use a NumericVar for what could just be
> an int, if we had a div_var_int() function that could divide by an
> int. Then both div_var() and div_var_fast() could hand off to it for
> one and two digit divisors.

Oooh, that seems like a good idea.

            regards, tom lane



Re: Some optimisations for numeric division

From
Dean Rasheed
Date:
On Wed, 23 Feb 2022 at 22:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Dean Rasheed <dean.a.rasheed@gmail.com> writes:
>
> > One thought that occurred to me was that it's a bit silly that
> > exp_var() and ln_var() have to use a NumericVar for what could just be
> > an int, if we had a div_var_int() function that could divide by an
> > int. Then both div_var() and div_var_fast() could hand off to it for
> > one and two digit divisors.
>
> Oooh, that seems like a good idea.
>

OK, I've replaced the 0003 patch with one that does that instead. The
div_var_int() API is slightly different in that it also accepts a
divisor weight argument, but the alternative would have been for the
callers to have to adjust both the result weight and rscale, which
would have been uglier.

There's a large block of code in div_var() that needs re-indenting,
but I think it would be better to leave that to a later pgindent run.

The performance results are quite pleasing. It's slightly faster than
the old one-digit div_var() code because it manages to avoid some
digit array copying, and for two digit divisors it's much faster:

CREATE TEMP TABLE div_test(x numeric, y numeric);
SELECT setseed(0);
INSERT INTO div_test
  SELECT (SELECT (((x%9)+1)||string_agg((random()*9)::int::text, ''))::numeric
          FROM generate_series(1,50)),
         (SELECT ('1.'||string_agg((random()*9)::int::text,
'')||(x%10)||'e3')::numeric
          FROM generate_series(1,6))
  FROM generate_series(1,5000) g(x);
select * from div_test limit 10;

SELECT sum(t1.x/t2.y) FROM div_test t1, div_test t2;

Time: 11600.034 ms  (HEAD)
Time: 9890.788 ms   (with 0002)
Time: 6202.851 ms   (with 0003)

And obviously it'll be a larger relative gain for div_var_fast(),
since that was slower to begin with in such cases.

This makes me think that it might also be worthwhile to follow this
with a similar div_var_int64() function on platforms with 128-bit
integers, which could then be used to handle 3- and 4-digit divisors,
which are probably quite common in practice.

Attached is the updated patch series (0001 and 0002 unchanged).

Regards,
Dean

Attachment

Re: Some optimisations for numeric division

From
Dean Rasheed
Date:
On Fri, 25 Feb 2022 at 10:45, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>
> Attached is the updated patch series (0001 and 0002 unchanged).
>

And another update following feedback from the cfbot.

Regards,
Dean

Attachment

Re: Some optimisations for numeric division

From
Tom Lane
Date:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> And another update following feedback from the cfbot.

This patchset LGTM.  On my machine there doesn't seem to be any
measurable performance change for the numeric regression test,
but numeric_big gets about 15% faster.

The only additional thought I have, based on your comments about
0001, is that maybe in mul_var we should declare var1digit as
being NumericDigit not int.  I tried that here and didn't see
any material change in the generated assembly code (using gcc
8.5.0), so you're right that modern gcc doesn't need any help
there; but I wonder if it could help on other compiler versions.

I'll mark this RFC.

            regards, tom lane



Re: Some optimisations for numeric division

From
Dean Rasheed
Date:
On Fri, 25 Feb 2022 at 18:30, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> > And another update following feedback from the cfbot.
>
> This patchset LGTM.  On my machine there doesn't seem to be any
> measurable performance change for the numeric regression test,
> but numeric_big gets about 15% faster.
>

Yes, that matches my observations. Thanks for reviewing and testing.

> The only additional thought I have, based on your comments about
> 0001, is that maybe in mul_var we should declare var1digit as
> being NumericDigit not int.  I tried that here and didn't see
> any material change in the generated assembly code (using gcc
> 8.5.0), so you're right that modern gcc doesn't need any help
> there; but I wonder if it could help on other compiler versions.
>

Yes, that makes sense. Done that way.

Regards,
Dean