Re: Optimize mul_var() for var1ndigits >= 8 - Mailing list pgsql-hackers
From | Dean Rasheed |
---|---|
Subject | Re: Optimize mul_var() for var1ndigits >= 8 |
Date | |
Msg-id | CAEZATCXniGE2pkwsXoV4gKFga9VYO6=JUnA8xpuV+cGYC2RjZA@mail.gmail.com Whole thread Raw |
In response to | Optimize mul_var() for var1ndigits >= 8 ("Joel Jacobson" <joel@compiler.org>) |
Responses |
Re: Optimize mul_var() for var1ndigits >= 8
|
List | pgsql-hackers |
On Sun, 7 Jul 2024 at 20:46, Joel Jacobson <joel@compiler.org> wrote: > > This patch adds a mul_var_large() that is dispatched to from mul_var() > for var1ndigits >= 8, regardless of rscale. > > -- var1ndigits == var2ndigits == 16384 > SELECT COUNT(*) FROM n_max WHERE product = var1 * var2; > Time: 3191.145 ms (00:03.191) -- HEAD > Time: 1836.404 ms (00:01.836) -- mul_var_large > That's pretty nice. For some reason though, this patch seems to consistently make the numeric_big regression test a bit slower: ok 224 - numeric_big 280 ms [HEAD] ok 224 - numeric_big 311 ms [patch] though I do get a lot of variability when I run that test. I think this is related to this code: + res_ndigits = var1ndigits + var2ndigits + 1; + res_weight = var1->weight + var2->weight + 2 + + ((res_ndigits * 2) - (var1->ndigits + var2->ndigits + 1)); + maxdigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS + + MUL_GUARD_DIGITS; + res_ndigits = Min(res_ndigits, maxdigits); In mul_var_large(), var1ndigits, var2ndigits, and res_ndigits are counts of NBASE^2 digits, whereas maxdigits is a count of NBASE digits, so it's not legit to compare them like that. I think it's pretty confusing to use the same variable names as are used elsewhere for different things. I also don't like basically duplicating the whole of mul_var() in a second function. The fact that this is close to the current speed for numbers with around 8 digits is encouraging though. My first thought was that if it could be made just a little faster, maybe it could replace mul_var() rather than duplicating it. I had a go at that in the attached v2 patch, which now gives a very nice speedup when running numeric_big: ok 224 - numeric_big 159 ms [v2 patch] The v2 patch includes the following additional optimisations: 1). Use unsigned integers, rather than signed integers, as discussed over in [1]. 2). Attempt to fix the formulae incorporating maxdigits mentioned above. This part really made my brain hurt, and I'm still not quite sure that I've got it right. In particular, it needs double-checking to ensure that it's not losing accuracy in the reduced-rscale case. 3). Instead of converting var1 to base NBASE^2 and storing it, just compute each base-NBASE^2 digit at the point where it's needed, at the start of the outer loop. 4). Instead of converting all of var2 to base NBASE^2, just convert the part that actually contributes to the final result. That can make a big difference when asked for a reduced-rscale result. 5). Use 32-bit integers instead of 64-bit integers to hold the converted digits of var2. 6). Merge the final carry-propagation pass with the code to convert the result back to base NBASE. 7). Mark mul_var_short() as pg_noinline. That seemed to make a difference in some tests, where this patch seemed to cause the compiler to generate slightly less efficient code for mul_var_short() when it was inlined. In any case, it seems preferable to separate the two, especially if mul_var_short() grows in the future. Overall, those optimisations made quite a big difference for large inputs: -- var1ndigits1=16383, var2ndigits2=16383 call rate=23.991785 -- HEAD call rate=35.19989 -- v1 patch call rate=75.50344 -- v2 patch which is now a 3.1x speedup relative to HEAD. For small inputs (above mul_var_short()'s 4-digit threshold), it's pretty-much performance-neutral: -- var1ndigits1=5, var2ndigits2=5 call rate=6.045675e+06 -- HEAD call rate=6.1231815e+06 -- v2 patch which is pretty-much in the region of random noise. It starts to become a more definite win for anything larger (in either input): -- var1ndigits1=5, var2ndigits2=10 call rate=5.437945e+06 -- HEAD call rate=5.6906255e+06 -- v2 patch -- var1ndigits1=6, var2ndigits2=6 call rate=5.748427e+06 -- HEAD call rate=5.953112e+06 -- v2 patch -- var1ndigits1=7, var2ndigits2=7 call rate=5.3638645e+06 -- HEAD call rate=5.700681e+06 -- v2 patch What's less clear is whether there are any platforms for which this makes it significantly slower. I tried testing with SIMD disabled, which ought to not affect the small-input cases, but actually seemed to favour the patch slightly: -- var1ndigits1=5, var2ndigits2=5 [SIMD disabled] call rate=6.0269715e+06 -- HEAD call rate=6.2982245e+06 -- v2 patch For large inputs, disabling SIMD makes everything much slower, but it made the relative difference larger: -- var1ndigits1=16383, var2ndigits2=16383 [SIMD disabled] call rate=8.24175 -- HEAD call rate=36.3987 -- v2 patch which is now 4.3x faster with the patch. Then I tried compiling with -m32, and unfortunately this made the patch slower than HEAD for small inputs: -- var1ndigits1=5, var2ndigits2=5 [-m32, SIMD disabled] call rate=5.052332e+06 -- HEAD call rate=3.883459e+06 -- v2 patch -- var1ndigits1=6, var2ndigits2=6 [-m32, SIMD disabled] call rate=4.7221405e+06 -- HEAD call rate=3.7965685e+06 -- v2 patch -- var1ndigits1=7, var2ndigits2=7 [-m32, SIMD disabled] call rate=4.4638375e+06 -- HEAD call rate=3.39948e+06 -- v2 patch and it doesn't reach parity until around ndigits=26, which is disappointing. It does still get much faster for large inputs: -- var1ndigits1=16383, var2ndigits2=16383 [-m32, SIMD disabled] call rate=5.6747904 call rate=20.104694 and it still makes numeric_big much faster: [-m32, SIMD disabled] ok 224 - numeric_big 596 ms [HEAD] ok 224 - numeric_big 294 ms [v2 patch] I'm not sure whether compiling with -m32 is a realistic simulation of systems people are really using. It's tempting to just regard such systems as legacy, and ignore them, given the other large gains, but maybe this is not the only case that gets slower. Another option would be to only use this optimisation on 64-bit machines, though I think that would make the code pretty messy, and it would be throwing away the gains for large inputs, which might well outweigh the losses. Thoughts anyone? Regards, Dean [1] https://www.postgresql.org/message-id/19834f2c-4bf1-4e27-85ed-ca5df0f28e03@app.fastmail.com
Attachment
pgsql-hackers by date: