Thread: Some optimisations for numeric division
Attached are 3 small patches that improve the performance of numeric division. Patch 0002 seems to have the biggest impact, but I think they're all worth including, since they're quite simple changes, with noticeable performance gains. Patch 0001 vectorises the inner loop of div_var_fast(). This loop is almost identical to the inner loop of mul_var(), which was vectorised by commit 8870917623. The thing preventing the div_var_fast() loop from being vectorised is the assignment to div[qi + i], and simply replacing that with div_qi[i] where div_qi = &div[qi], in the same style as mul_var(), fixes that. One difference between this and the mul_var() code is that it is also necessary to add an explicit cast to get the compiler to generate matching output, and give the best results. This is because in mul_var() the compiler recognises that var1digit is actually 16-bit, rather than 32-bit, and so it doesn't need to multiply the high part, but in div_var_fast() it's not obvious to the compiler that qdigit also fits in 16 bits, hence the cast. The actual performance gain is typically quite modest, since div_var_fast() is always only a part of some more complex numeric computation, but it can be seen in cases like raising numbers to negative integer powers, e.g.: CREATE TEMP TABLE num_test(x numeric); SELECT setseed(0); INSERT INTO num_test SELECT (SELECT ('1.'||string_agg((random()*9)::int::text, '')||x)::numeric FROM generate_series(1,100)) FROM generate_series(1,100000) g(x); SELECT sum(x^(-2)) FROM num_test; Time: 112.949 ms (HEAD) Time: 98.537 ms (with patch) Patch 0002 simplifies the inner loop of div_var() (the guts of the public-facing numeric division operator) by more closely combining the multiplication and subtraction operations and folding the separate "carry" and "borrow" variables into a single "borrow", as suggested by the old code comment. IMO, this makes the code easier to understand, as well as giving more significant performance gains: CREATE TEMP TABLE div_test(x numeric); SELECT setseed(0); INSERT INTO div_test SELECT (SELECT ('1.'||string_agg((random()*9)::int::text, ''))::numeric + x FROM generate_series(1,50)) FROM generate_series(1,1000) g(x); SELECT sum(x/y) FROM div_test t1(x), div_test t2(y); Time: 1477.340 ms (HEAD) Time: 826.748 ms (with patch) Patch 0003 replaces some uses of div_var_fast() with div_var(). Specifically, when the divisor has just one or two base-NBASE digits, div_var() is faster. This is especially true for 1-digit divisors, due to the "fast path" code in div_var() to handle that. It's also just about true for 2-digit divisors, and it occurs to me that that case could potentially be optimised further with similar fast path code in div_var(), but I haven't tried that yet. One-digit divisors are quite common. For example, in the Taylor series expansions in exp_var() and ln_var(), since the number of Taylor series terms never exceeds a few hundred in practice. Also, exp_var()'s argument reduction step divides by 2^ndiv2, which is roughly 100 times the input, rounded up to a power of two, and valid inputs are less than 6000, so this will always be one or two digits. Testing this with a bunch of random exp() and ln() computations I saw a speed-up of 15-20%, and it reduced the run time of the numeric-big regression test by around 10%, which seems worth having. Regards, Dean
Attachment
Dean Rasheed <dean.a.rasheed@gmail.com> writes: > Attached are 3 small patches that improve the performance of numeric > division. Patch 0002 seems to have the biggest impact, but I think > they're all worth including, since they're quite simple changes, with > noticeable performance gains. I took a quick look through these (just eyeball, didn't try to verify your performance statements). I'm +1 on 0001 and 0002, but 0003 feels a bit ad-hoc. It certainly *looks* weird for the allegedly faster function to be handing off to the allegedly slower one. I also wonder if we're leaving anything on the table by not exploiting div_var_fast's weaker roundoff guarantees in this case. Should we think about a more thoroughgoing redesign of these functions' APIs? Another idea is to only worry about the single-divisor-digit optimization, and just copy div_var's (very small) inner loop for that case into div_var_fast. regards, tom lane
On Wed, 23 Feb 2022 at 20:55, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I took a quick look through these (just eyeball, didn't try to verify > your performance statements). Thanks for looking! > I'm +1 on 0001 and 0002, but 0003 feels > a bit ad-hoc. It certainly *looks* weird for the allegedly faster > function to be handing off to the allegedly slower one. I also wonder > if we're leaving anything on the table by not exploiting > div_var_fast's weaker roundoff guarantees in this case. Should we > think about a more thoroughgoing redesign of these functions' APIs? Hmm, I'm not sure what kind of thing you had in mind. One thought that occurred to me was that it's a bit silly that exp_var() and ln_var() have to use a NumericVar for what could just be an int, if we had a div_var_int() function that could divide by an int. Then both div_var() and div_var_fast() could hand off to it for one and two digit divisors. Regards, Dean
Dean Rasheed <dean.a.rasheed@gmail.com> writes: > On Wed, 23 Feb 2022 at 20:55, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'm +1 on 0001 and 0002, but 0003 feels >> a bit ad-hoc. It certainly *looks* weird for the allegedly faster >> function to be handing off to the allegedly slower one. I also wonder >> if we're leaving anything on the table by not exploiting >> div_var_fast's weaker roundoff guarantees in this case. Should we >> think about a more thoroughgoing redesign of these functions' APIs? > Hmm, I'm not sure what kind of thing you had in mind. I'm not either, tbh. Just seems like this needs more than some hacking around the margins. > One thought that occurred to me was that it's a bit silly that > exp_var() and ln_var() have to use a NumericVar for what could just be > an int, if we had a div_var_int() function that could divide by an > int. Then both div_var() and div_var_fast() could hand off to it for > one and two digit divisors. Oooh, that seems like a good idea. regards, tom lane
On Wed, 23 Feb 2022 at 22:55, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Dean Rasheed <dean.a.rasheed@gmail.com> writes: > > > One thought that occurred to me was that it's a bit silly that > > exp_var() and ln_var() have to use a NumericVar for what could just be > > an int, if we had a div_var_int() function that could divide by an > > int. Then both div_var() and div_var_fast() could hand off to it for > > one and two digit divisors. > > Oooh, that seems like a good idea. > OK, I've replaced the 0003 patch with one that does that instead. The div_var_int() API is slightly different in that it also accepts a divisor weight argument, but the alternative would have been for the callers to have to adjust both the result weight and rscale, which would have been uglier. There's a large block of code in div_var() that needs re-indenting, but I think it would be better to leave that to a later pgindent run. The performance results are quite pleasing. It's slightly faster than the old one-digit div_var() code because it manages to avoid some digit array copying, and for two digit divisors it's much faster: CREATE TEMP TABLE div_test(x numeric, y numeric); SELECT setseed(0); INSERT INTO div_test SELECT (SELECT (((x%9)+1)||string_agg((random()*9)::int::text, ''))::numeric FROM generate_series(1,50)), (SELECT ('1.'||string_agg((random()*9)::int::text, '')||(x%10)||'e3')::numeric FROM generate_series(1,6)) FROM generate_series(1,5000) g(x); select * from div_test limit 10; SELECT sum(t1.x/t2.y) FROM div_test t1, div_test t2; Time: 11600.034 ms (HEAD) Time: 9890.788 ms (with 0002) Time: 6202.851 ms (with 0003) And obviously it'll be a larger relative gain for div_var_fast(), since that was slower to begin with in such cases. This makes me think that it might also be worthwhile to follow this with a similar div_var_int64() function on platforms with 128-bit integers, which could then be used to handle 3- and 4-digit divisors, which are probably quite common in practice. Attached is the updated patch series (0001 and 0002 unchanged). Regards, Dean
Attachment
On Fri, 25 Feb 2022 at 10:45, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > > Attached is the updated patch series (0001 and 0002 unchanged). > And another update following feedback from the cfbot. Regards, Dean
Attachment
Dean Rasheed <dean.a.rasheed@gmail.com> writes: > And another update following feedback from the cfbot. This patchset LGTM. On my machine there doesn't seem to be any measurable performance change for the numeric regression test, but numeric_big gets about 15% faster. The only additional thought I have, based on your comments about 0001, is that maybe in mul_var we should declare var1digit as being NumericDigit not int. I tried that here and didn't see any material change in the generated assembly code (using gcc 8.5.0), so you're right that modern gcc doesn't need any help there; but I wonder if it could help on other compiler versions. I'll mark this RFC. regards, tom lane
On Fri, 25 Feb 2022 at 18:30, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Dean Rasheed <dean.a.rasheed@gmail.com> writes: > > And another update following feedback from the cfbot. > > This patchset LGTM. On my machine there doesn't seem to be any > measurable performance change for the numeric regression test, > but numeric_big gets about 15% faster. > Yes, that matches my observations. Thanks for reviewing and testing. > The only additional thought I have, based on your comments about > 0001, is that maybe in mul_var we should declare var1digit as > being NumericDigit not int. I tried that here and didn't see > any material change in the generated assembly code (using gcc > 8.5.0), so you're right that modern gcc doesn't need any help > there; but I wonder if it could help on other compiler versions. > Yes, that makes sense. Done that way. Regards, Dean