Re: Optimize mul_var() for var1ndigits >= 8 - Mailing list pgsql-hackers

From Dean Rasheed
Subject Re: Optimize mul_var() for var1ndigits >= 8
Date
Msg-id CAEZATCXniGE2pkwsXoV4gKFga9VYO6=JUnA8xpuV+cGYC2RjZA@mail.gmail.com
Whole thread Raw
In response to Optimize mul_var() for var1ndigits >= 8  ("Joel Jacobson" <joel@compiler.org>)
Responses Re: Optimize mul_var() for var1ndigits >= 8
List pgsql-hackers
On Sun, 7 Jul 2024 at 20:46, Joel Jacobson <joel@compiler.org> wrote:
>
> This patch adds a mul_var_large() that is dispatched to from mul_var()
> for var1ndigits >= 8, regardless of rscale.
>
> -- var1ndigits == var2ndigits == 16384
> SELECT COUNT(*) FROM n_max WHERE product = var1 * var2;
> Time: 3191.145 ms (00:03.191) -- HEAD
> Time: 1836.404 ms (00:01.836) -- mul_var_large
>

That's pretty nice. For some reason though, this patch seems to
consistently make the numeric_big regression test a bit slower:

ok 224       - numeric_big                               280 ms  [HEAD]
ok 224       - numeric_big                               311 ms  [patch]

though I do get a lot of variability when I run that test.

I think this is related to this code:

+   res_ndigits = var1ndigits + var2ndigits + 1;
+   res_weight = var1->weight + var2->weight + 2 +
+                ((res_ndigits * 2) - (var1->ndigits + var2->ndigits + 1));
+   maxdigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS +
+       MUL_GUARD_DIGITS;
+   res_ndigits = Min(res_ndigits, maxdigits);

In mul_var_large(), var1ndigits, var2ndigits, and res_ndigits are
counts of NBASE^2 digits, whereas maxdigits is a count of NBASE
digits, so it's not legit to compare them like that. I think it's
pretty confusing to use the same variable names as are used elsewhere
for different things.

I also don't like basically duplicating the whole of mul_var() in a
second function.

The fact that this is close to the current speed for numbers with
around 8 digits is encouraging though. My first thought was that if it
could be made just a little faster, maybe it could replace mul_var()
rather than duplicating it.

I had a go at that in the attached v2 patch, which now gives a very
nice speedup when running numeric_big:

ok 224       - numeric_big                               159 ms  [v2 patch]

The v2 patch includes the following additional optimisations:

1). Use unsigned integers, rather than signed integers, as discussed
over in [1].

2). Attempt to fix the formulae incorporating maxdigits mentioned
above. This part really made my brain hurt, and I'm still not quite
sure that I've got it right. In particular, it needs double-checking
to ensure that it's not losing accuracy in the reduced-rscale case.

3). Instead of converting var1 to base NBASE^2 and storing it, just
compute each base-NBASE^2 digit at the point where it's needed, at the
start of the outer loop.

4). Instead of converting all of var2 to base NBASE^2, just convert
the part that actually contributes to the final result. That can make
a big difference when asked for a reduced-rscale result.

5). Use 32-bit integers instead of 64-bit integers to hold the
converted digits of var2.

6). Merge the final carry-propagation pass with the code to convert
the result back to base NBASE.

7). Mark mul_var_short() as pg_noinline. That seemed to make a
difference in some tests, where this patch seemed to cause the
compiler to generate slightly less efficient code for mul_var_short()
when it was inlined. In any case, it seems preferable to separate the
two, especially if mul_var_short() grows in the future.

Overall, those optimisations made quite a big difference for large inputs:

-- var1ndigits1=16383, var2ndigits2=16383
call rate=23.991785  -- HEAD
call rate=35.19989   -- v1 patch
call rate=75.50344   -- v2 patch

which is now a 3.1x speedup relative to HEAD.

For small inputs (above mul_var_short()'s 4-digit threshold), it's
pretty-much performance-neutral:

-- var1ndigits1=5, var2ndigits2=5
call rate=6.045675e+06   -- HEAD
call rate=6.1231815e+06  -- v2 patch

which is pretty-much in the region of random noise. It starts to
become a more definite win for anything larger (in either input):

-- var1ndigits1=5, var2ndigits2=10
call rate=5.437945e+06   -- HEAD
call rate=5.6906255e+06  -- v2 patch

-- var1ndigits1=6, var2ndigits2=6
call rate=5.748427e+06   -- HEAD
call rate=5.953112e+06   -- v2 patch

-- var1ndigits1=7, var2ndigits2=7
call rate=5.3638645e+06  -- HEAD
call rate=5.700681e+06   -- v2 patch

What's less clear is whether there are any platforms for which this
makes it significantly slower.

I tried testing with SIMD disabled, which ought to not affect the
small-input cases, but actually seemed to favour the patch slightly:

-- var1ndigits1=5, var2ndigits2=5 [SIMD disabled]
call rate=6.0269715e+06  -- HEAD
call rate=6.2982245e+06  -- v2 patch

For large inputs, disabling SIMD makes everything much slower, but it
made the relative difference larger:

-- var1ndigits1=16383, var2ndigits2=16383 [SIMD disabled]
call rate=8.24175  -- HEAD
call rate=36.3987   -- v2 patch

which is now 4.3x faster with the patch.

Then I tried compiling with -m32, and unfortunately this made the
patch slower than HEAD for small inputs:

-- var1ndigits1=5, var2ndigits2=5 [-m32, SIMD disabled]
call rate=5.052332e+06  -- HEAD
call rate=3.883459e+06  -- v2 patch

-- var1ndigits1=6, var2ndigits2=6 [-m32, SIMD disabled]
call rate=4.7221405e+06  -- HEAD
call rate=3.7965685e+06  -- v2 patch

-- var1ndigits1=7, var2ndigits2=7 [-m32, SIMD disabled]
call rate=4.4638375e+06   -- HEAD
call rate=3.39948e+06     -- v2 patch

and it doesn't reach parity until around ndigits=26, which is
disappointing. It does still get much faster for large inputs:

-- var1ndigits1=16383, var2ndigits2=16383 [-m32, SIMD disabled]
call rate=5.6747904
call rate=20.104694

and it still makes numeric_big much faster:

[-m32, SIMD disabled]
ok 224       - numeric_big                               596 ms  [HEAD]
ok 224       - numeric_big                               294 ms  [v2 patch]

I'm not sure whether compiling with -m32 is a realistic simulation of
systems people are really using. It's tempting to just regard such
systems as legacy, and ignore them, given the other large gains, but
maybe this is not the only case that gets slower.

Another option would be to only use this optimisation on 64-bit
machines, though I think that would make the code pretty messy, and it
would be throwing away the gains for large inputs, which might well
outweigh the losses.

Thoughts anyone?

Regards,
Dean

[1] https://www.postgresql.org/message-id/19834f2c-4bf1-4e27-85ed-ca5df0f28e03@app.fastmail.com

Attachment

pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: MERGE/SPLIT partition commands should create new partitions in the parent's tablespace?
Next
From: Bertrand Drouvot
Date:
Subject: Re: Flush pgstats file during checkpoints