Thread: Improve performance of pg_strtointNN functions
Over on [1], Dean and I were discussing why both gcc and clang don't seem to want to optimize the multiplication that we're doing in pg_strtoint16, pg_strtoint32 and pg_strtoint64 into something more efficient. It seems that this is down to the overflow checks. Removing seems to allow the compiler to better optimize the compiled code. See the use of LEA instead of IMUL in [2]. Instead of using the pg_mul_sNN_overflow functions, we can just accumulate the number in an unsigned version of the type and do an overflow check by checking if the accumulated value has gone beyond a 10th of the maximum *signed* value for the type. We then just need to do some final overflow checks after the accumulation is done. What those depend on if it's a negative or positive number. I ran a few microbenchmarks with the attached str2int.c file and I see about a 10-20% performance increase: $ ./str2int -100000000 100000000 n = -100000000, e = 100000000 pg_strtoint32 in 3.207926 seconds pg_strtoint32_v2 in 2.763062 seconds (16.100399% faster) v2 is the updated version of the function On Windows, the gains are generally a bit more. I think this must be due to the lack of overflow intrinsic functions. >str2int.exe -100000000 100000000 n = -100000000, e = 100000000 pg_strtoint32 in 9.416000 seconds pg_strtoint32_v2 in 8.099000 seconds (16.261267% faster) I was thinking that we should likely apply this before doing the hex literals, which is the main focus of [1]. The reason being is so that that patch can immediately have faster conversions by allowing the compiler to use bit shifting instead of other means of multiplying by a power-of-2 number. I'm hoping this removes a barrier for Peter from the small gripe I raised on that thread about the patch having slower than required hex, octal and binary string parsing. David [1] https://postgr.es/m/CAApHDvrL6_+wKgPqRHr7gH_6xy3hXM6a3QCsZ5ForurjDFfenA@mail.gmail.com [2] https://godbolt.org/z/7YoMT63q1
Attachment
On Thu, Dec 1, 2022 at 6:42 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> I was thinking that we should likely apply this before doing the hex
> literals, which is the main focus of [1]. The reason being is so that
> that patch can immediately have faster conversions by allowing the
> compiler to use bit shifting instead of other means of multiplying by
> a power-of-2 number. I'm hoping this removes a barrier for Peter from
> the small gripe I raised on that thread about the patch having slower
> than required hex, octal and binary string parsing.
I don't see why the non-decimal literal patch needs to be "immediately" faster? If doing this first leads to less code churn, that's another consideration, but you haven't made that argument.
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, 1 Dec 2022 at 18:27, John Naylor <john.naylor@enterprisedb.com> wrote: > I don't see why the non-decimal literal patch needs to be "immediately" faster? If doing this first leads to less codechurn, that's another consideration, but you haven't made that argument. My view is that Peter wants to keep the code he's adding for the hex, octal and binary parsing as similar to the existing code as possible. I very much understand Peter's point of view on that. Consistency is good. However, if we commit the hex literals patch first, people might ask "why don't we use bit-wise operators to make the power-of-2 bases faster?", which seems like a very legitimate question. I asked it, anyway... On the other hand, if Peter adds the bit-wise operators then the problem of code inconsistency remains. As an alternative to those 2 options, I'm proposing we commit this first then the above dilemma disappears completely. If this was going to cause huge conflicts with Peter's patch then I might think differently. I feel like it's a fairly trivial task to rebase. If the consensus is that we should fix this afterwards, then I'm happy to delay. David
On Thu, 1 Dec 2022 at 05:38, David Rowley <dgrowleyml@gmail.com> wrote: > > On Thu, 1 Dec 2022 at 18:27, John Naylor <john.naylor@enterprisedb.com> wrote: > > I don't see why the non-decimal literal patch needs to be "immediately" faster? If doing this first leads to less codechurn, that's another consideration, but you haven't made that argument. > > My view is that Peter wants to keep the code he's adding for the hex, > octal and binary parsing as similar to the existing code as possible. > I very much understand Peter's point of view on that. Consistency is > good. However, if we commit the hex literals patch first, people might > ask "why don't we use bit-wise operators to make the power-of-2 bases > faster?", which seems like a very legitimate question. I asked it, > anyway... On the other hand, if Peter adds the bit-wise operators > then the problem of code inconsistency remains. > > As an alternative to those 2 options, I'm proposing we commit this > first then the above dilemma disappears completely. > > If this was going to cause huge conflicts with Peter's patch then I > might think differently. I feel like it's a fairly trivial task to > rebase. > > If the consensus is that we should fix this afterwards, then I'm happy to delay. > I feel like it should be done afterwards, so that any performance gains can be measured for all bases. Otherwise, we won't really know, or have any record of, how much faster this was for other bases, or be able to go back and test that. Regards, Dean
On 01.12.22 06:38, David Rowley wrote: > If this was going to cause huge conflicts with Peter's patch then I > might think differently. I feel like it's a fairly trivial task to > rebase. > > If the consensus is that we should fix this afterwards, then I'm happy to delay. If we are happy with this patch, then it's okay with me if you push this first. I'll probably need to do another pass over my patch anyway, so a bit more work isn't a problem.
On Fri, 2 Dec 2022 at 20:35, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > If we are happy with this patch, then it's okay with me if you push this > first. I'll probably need to do another pass over my patch anyway, so a > bit more work isn't a problem. Thanks. I'll start looking at the patch again now. If I don't find any problems then I'll push it. I just did some performance tests by COPYing in 40 million INTs ranging from -/+ 20 million. create table ab(a int, b int); insert into ab select x,x from generate_series(-20000000,20000000)x; copy ab to '/tmp/ab.csv'; -- master truncate ab; copy ab from '/tmp/ab.csv'; Time: 10219.386 ms (00:10.219) Time: 10252.572 ms (00:10.253) Time: 10202.940 ms (00:10.203) -- patched truncate ab; copy ab from '/tmp/ab.csv'; Time: 9522.020 ms (00:09.522) Time: 9441.294 ms (00:09.441) Time: 9432.834 ms (00:09.433) About ~8% faster David
On Sun, 4 Dec 2022 at 13:52, David Rowley <dgrowleyml@gmail.com> wrote: > Thanks. I'll start looking at the patch again now. If I don't find any > problems then I'll push it. Pushed with some small adjustments. David
On Sun, 4 Dec 2022 at 03:19, David Rowley <dgrowleyml@gmail.com> wrote: > > Pushed with some small adjustments. > Ah, I see that you changed the overflow test, and I realise that I forgot to answer your question about why I wrote that as 1 - INT_MIN / 10 over on the other thread. The reason is that we need to detect whether tmp * base will exceed -INT_MIN, not INT_MAX, since we're accumulating the absolute value of a signed integer. So the right test is tmp >= 1 - INT_MIN / base or equivalently tmp > -(INT_MIN / base) I used the first form, because it didn't require extra parentheses, but that doesn't really matter. The point is that, in general, that's not the same as tmp > INT_MAX / base though it happens to be the same for base = 10, because INT_MIN and INT_MAX aren't divisible by 10. It will break when base is a power of 2 though, so although it's not broken now, it's morally wrong, and it risks breaking when Peter commits his patch. Regards, Dean
On Sun, 4 Dec 2022 at 22:53, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > Ah, I see that you changed the overflow test, and I realise that I > forgot to answer your question about why I wrote that as 1 - INT_MIN / > 10 over on the other thread. > > The reason is that we need to detect whether tmp * base will exceed > -INT_MIN, not INT_MAX, since we're accumulating the absolute value of > a signed integer. I think I'd been too focused on the simplicity of that expression and also the base 10 part. I saw that everything worked in base 10 and failed to give enough forward thought to other bases. I now see that it was wrong-headed to code it the way I had it. Thanks for pointing this out. I've just pushed a fix. David