Thread: Auto-vectorization speeds up multiplication of large-precision numerics
There is this for loop in mul_var() : /* * Add the appropriate multiple of var2 into the accumulator. * * As above, digits of var2 can be ignored if they don't contribute, * so we only include digits for which i1+i2+2 <= res_ndigits - 1. */ for (i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3), i = i1 + i2 + 2; i2 >= 0; i2--) dig[i--] += var1digit * var2digits[i2]; With gcc -O3, the above for loop, if simplified, gets auto-vectorized [1] ; and this results in speedups for multiplication of PostgreSQL numeric types having large precisions. The speedups start becoming noticeable from around 50 precision onwards. With 50 precision the improvement I saw was 5%, with 60 11%, 120 50%, 240 2.2x, and so on. On my arm64 machine, a similar benefit starts showing up from 20 precision onwards. I used this query from regress/sql/numeric_big.sql : SELECT t1.val * t2.val FROM num_data t1, num_data t2 If I use the schema created by numeric_big.sql, the speedup was 2.5x to 2.7x across three machines. Also, the regress/sql/numeric_big test itself speeds up by 80% For the for loop to be auto-vectorized, I had to simplify it to something like this : i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3); digptr = &dig[i1 + 2]; for (i = 0; i <= i2; i++) digptr[i] += var1digit * var2digits[i]; gcc also can vectorize backward loop such as this : for (i = n-1; i >= 0; i--) a += b[i]; gcc -fopt-info-all gives this info : numeric.c:7217:3: optimized: loop vectorized using 16 byte vectors But if the assignment is not as simple as above, it does not vectorize the backward traversal : i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3); digptr = &dig[i1 + i2 + 2]; for (; i2 >= 0; i2--) digptr[i2] += var1digit * var2digits[i2]; numeric.c:7380:3: missed: couldn't vectorize loop numeric.c:7381:15: missed: not vectorized: relevant stmt not supported: _39 = *_38; Even for forward loop traversal, the below can't be vectorized seemingly because it involves two variables : count = Min(var2ndigits - 1, res_ndigits - i1 - 3) + 1; i = i1 + i2 - count + 3; for (i2 = 0; i2 < count; i++, i2++) dig[i] += var1digit * var2digits[i2]; numeric.c:7394:3: missed: couldn't vectorize loop numeric.c:7395:11: missed: not vectorized: not suitable for gather load _37 = *_36; So it's better to keep the loop simple : i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3); digptr = &dig[i1 + 2]; for (i = 0; i <= i2; i++) digptr[i] += var1digit * var2digits[i]; numeric.c:7387:3: optimized: loop vectorized using 16 byte vectors Attached is the patch that uses the above loop. With the patch, in mul_var() assembly code, I could see the multiply-accumulate instructions that operate on SIMD vectors (these are arm64 instructions) : smlal v1.4s, v2.4h, v3.4h smlal2 v0.4s, v2.8h, v3.8h I extracted the "SELECT t1.val * t2.val FROM num_data t1, num_data t2" query from regress/sql/numeric_big.sql, and ran it on the data that the test creates (it inserts values with precisions ranging from 500 to 700). Attached is create_schema.sql which creates the regression test schema. With this query, below are the changes in mul_var() figures with and without patch : (All the below figures are with -O3 build.) HEAD : + 64.06% postgres postgres [.] mul_var + 13.00% postgres postgres [.] get_str_from_var + 6.32% postgres [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore + 1.65% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string + 1.10% postgres [kernel.kallsyms] [k] _raw_spin_lock + 0.96% postgres [kernel.kallsyms] [k] get_page_from_freelist + 0.73% postgres [kernel.kallsyms] [k] page_counter_try_charge + 0.64% postgres postgres [.] AllocSetAlloc Patched : + 35.91% postgres postgres [.] mul_var + 20.43% postgres postgres [.] get_str_from_var + 13.01% postgres [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore + 2.31% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string + 1.48% postgres [kernel.kallsyms] [k] _raw_spin_lock + 1.15% postgres [kernel.kallsyms] [k] get_page_from_freelist + 0.99% postgres postgres [.] AllocSetAlloc + 0.58% postgres postgres [.] base_yyparse Times in milliseconds for SELECT t1.val * t2.val FROM num_data t1, num_data t2 : Machine 1 (amd64) Head : .668 .723 .658 .660 Patched : .288 .280 .282 .282 Machine 2 (arm64) Head : .897 .879 .888 .897 Patched : .329 .324 .321 .320 Average times in milliseconds for numeric_big regression test : Machine 1 (amd64) Head : 801 Patched : 445 Machine 2 (arm64) Head : 1105 Patched : 550 gcc -O3 option : I understand we have kept the default gcc CFLAGS to -O2, because, I believe, we might enable some bugs due to some assumptions in the code, if we make it -O3. But with this patch, we allow products built with -O3 flag to get this benefit. The actual gcc option to enable auto-vectorization is -ftree-loop-vectorize. But for -O3 it is always true. What we can do in the future is to have a separate file that has such optimized code that is proven to work with such optimization flags, and enable the required compiler flags only for such files, if the build is done with -O2. [1] https://gcc.gnu.org/projects/tree-ssa/vectorization.html#using -- Thanks, -Amit Khandekar Huawei Technologies
Attachment
Re: Auto-vectorization speeds up multiplication of large-precisionnumerics
From
Peter Eisentraut
Date:
On 2020-06-09 13:50, Amit Khandekar wrote: > Also, the regress/sql/numeric_big test itself speeds up by 80% That's nice. I can confirm the speedup: -O3 without the patch: numeric ... ok 737 ms test numeric_big ... ok 1014 ms -O3 with the patch: numeric ... ok 680 ms test numeric_big ... ok 580 ms Also: -O2 without the patch: numeric ... ok 693 ms test numeric_big ... ok 1160 ms -O2 with the patch: numeric ... ok 677 ms test numeric_big ... ok 917 ms So the patch helps either way. But it also seems that without the patch, -O3 might be a bit slower in some cases. This might need more testing. > For the for loop to be auto-vectorized, I had to simplify it to > something like this : Well, how do we make sure we keep it that way? How do we prevent some random rearranging of the code or some random compiler change to break this again? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Auto-vectorization speeds up multiplication of large-precision numerics
From
Amit Khandekar
Date:
On Wed, 10 Jun 2020 at 04:20, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > > On 2020-06-09 13:50, Amit Khandekar wrote: > > Also, the regress/sql/numeric_big test itself speeds up by 80% > > That's nice. I can confirm the speedup: > > -O3 without the patch: > > numeric ... ok 737 ms > test numeric_big ... ok 1014 ms > > -O3 with the patch: > > numeric ... ok 680 ms > test numeric_big ... ok 580 ms > > Also: > > -O2 without the patch: > > numeric ... ok 693 ms > test numeric_big ... ok 1160 ms > > -O2 with the patch: > > numeric ... ok 677 ms > test numeric_big ... ok 917 ms > > So the patch helps either way. Oh, I didn't observe that the patch helps numeric_big.sql to speed up to some extent even with -O2. For me, it takes 805 on head and 732 ms with patch. One possible reason that I can think of is : Because of the forward loop traversal, pre-fetching might be helping. But this is just a wild guess. -O3 : HEAD test numeric ... ok 102 ms test numeric_big ... ok 803 ms -O3 : patched : test numeric ... ok 100 ms test numeric_big ... ok 450 ms -O2 : HEAD test numeric ... ok 100 ms test numeric_big ... ok 805 ms -O2 patched : test numeric ... ok 103 ms test numeric_big ... ok 732 ms > But it also seems that without the patch, -O3 might > be a bit slower in some cases. This might need more testing. For me, there is no observed change in the times with -O2 versus -O3, on head. Are you getting a consistent slower numeric*.sql tests with -O3 on current code ? Not sure what might be the reason. But this is not related to the patch. Is it with the context of patch that you are suggesting that it might need more testing ? There might be existing cases that might be running a bit slower with O3, but that's strange actually. Probably optimization in those cases might not be working as thought by the compiler, and in fact they might be working negatively. Probably that's one of the reasons -O2 is the default choice. > > > For the for loop to be auto-vectorized, I had to simplify it to > > something like this : > > Well, how do we make sure we keep it that way? How do we prevent some > random rearranging of the code or some random compiler change to break > this again? I believe the compiler rearranges the code segments w.r.t. one another when those are independent of each other. I guess the compiler is able to identify that. With our case, it's the for loop. There are some variables used inside it, and that would prevent it from moving the for loop. Even if the compiler finds it safe to move relative to surrounding code, it would not spilt the for loop contents themselves, so the for loop will remain intact, and so would the vectorization, although it seems to keep an unrolled version of the loop in case it is called with smaller iteration counts. But yes, if someone in the future tries to change the for loop, it would possibly break the auto-vectorization. So, we should have appropriate comments (patch has those). Let me know if you find any possible reasons due to which the compiler would in the future stop the vectorization even when there is no change in the for loop. It might look safer if we take the for loop out into an inline function; just to play it safe ?
Re: Auto-vectorization speeds up multiplication of large-precision numerics
From
Amit Khandekar
Date:
FYI : this one is present in the July commitfest.
Re: Auto-vectorization speeds up multiplication of large-precision numerics
From
Peter Eisentraut
Date:
On 2020-06-10 14:15, Amit Khandekar wrote: >> Well, how do we make sure we keep it that way? How do we prevent some >> random rearranging of the code or some random compiler change to break >> this again? > I believe the compiler rearranges the code segments w.r.t. one another > when those are independent of each other. I guess the compiler is able > to identify that. With our case, it's the for loop. There are some > variables used inside it, and that would prevent it from moving the > for loop. Even if the compiler finds it safe to move relative to > surrounding code, it would not spilt the for loop contents themselves, > so the for loop will remain intact, and so would the vectorization, > although it seems to keep an unrolled version of the loop in case it > is called with smaller iteration counts. But yes, if someone in the > future tries to change the for loop, it would possibly break the > auto-vectorization. So, we should have appropriate comments (patch has > those). Let me know if you find any possible reasons due to which the > compiler would in the future stop the vectorization even when there is > no change in the for loop. We normally don't compile with -O3, so very few users would get the benefit of this. We have CFLAGS_VECTOR for the checksum code. I suppose if we are making the numeric code vectorizable as well, we should apply this there also. I think we need a bit of a policy decision from the group here. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > We normally don't compile with -O3, so very few users would get the > benefit of this. Yeah. I don't think changing that baseline globally would be a wise move. > We have CFLAGS_VECTOR for the checksum code. I > suppose if we are making the numeric code vectorizable as well, we > should apply this there also. > I think we need a bit of a policy decision from the group here. I'd vote in favor of applying CFLAGS_VECTOR to specific source files that can benefit. We already have experience with that and we've not detected any destabilization potential. (I've not looked at this patch, so don't take this as a +1 for this specific patch.) regards, tom lane
Re: Auto-vectorization speeds up multiplication of large-precision numerics
From
Amit Khandekar
Date:
On Fri, 10 Jul 2020 at 19:02, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > > We normally don't compile with -O3, so very few users would get the > > benefit of this. > > Yeah. I don't think changing that baseline globally would be a wise move. > > > We have CFLAGS_VECTOR for the checksum code. I > > suppose if we are making the numeric code vectorizable as well, we > > should apply this there also. > > > I think we need a bit of a policy decision from the group here. > > I'd vote in favor of applying CFLAGS_VECTOR to specific source files > that can benefit. We already have experience with that and we've not > detected any destabilization potential. I tried this in utils/adt/Makefile : + +numeric.o: CFLAGS += ${CFLAGS_VECTOR} + and it works. CFLAGS_VECTOR also includes the -funroll-loops option, which I believe, had showed improvements in the checksum.c runs ( [1] ). This option makes the object file a bit bigger. For numeric.o, it's size increased by 15K; from 116672 to 131360 bytes. I ran the multiplication test, and didn't see any additional speed-up with this option. Also, it does not seem to be related to vectorization. So I was thinking of splitting the CFLAGS_VECTOR into CFLAGS_VECTOR and CFLAGS_UNROLL_LOOPS. Checksum.c can use both these flags, and numeric.c can use only CFLAGS_VECTOR. I was also wondering if it's worth to extract only the code that we think can be optimized and keep it in a separate file (say numeric_vectorize.c or adt_vectorize.c, which can have mul_var() to start with), and use this file as a collection of all such code in the adt module, and then we can add such files into other modules as and when needed. For numeric.c, there can be already some scope for auto-vectorizations in other similar regions in that file, so we don't require a separate numeric_vectorize.c and just pass the CFLAGS_VECTOR flag for this file itself. [1] https://www.postgresql.org/message-id/flat/CA%2BU5nML8JYeGqM-k4eEwNJi5H%3DU57oPLBsBDoZUv4cfcmdnpUA%40mail.gmail.com#2ec419817ff429588dd1229fb663080e -- Thanks, -Amit Khandekar Huawei Technologies
Re: Auto-vectorization speeds up multiplication of large-precision numerics
From
Amit Khandekar
Date:
On Mon, 13 Jul 2020 at 14:27, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > I tried this in utils/adt/Makefile : > + > +numeric.o: CFLAGS += ${CFLAGS_VECTOR} > + > and it works. > > CFLAGS_VECTOR also includes the -funroll-loops option, which I > believe, had showed improvements in the checksum.c runs ( [1] ). This > option makes the object file a bit bigger. For numeric.o, it's size > increased by 15K; from 116672 to 131360 bytes. I ran the > multiplication test, and didn't see any additional speed-up with this > option. Also, it does not seem to be related to vectorization. So I > was thinking of splitting the CFLAGS_VECTOR into CFLAGS_VECTOR and > CFLAGS_UNROLL_LOOPS. Checksum.c can use both these flags, and > numeric.c can use only CFLAGS_VECTOR. I did as above. Attached is the v2 patch. In case of existing CFLAGS_VECTOR, an env variable also could be set by that name when running configure. I did the same for CFLAGS_UNROLL_LOOPS. Now, developers who already are using CFLAGS_VECTOR env while configur'ing might be using this env because their compilers don't have these compiler options so they must be using some equivalent compiler options. numeric.c will now be compiled with CFLAGS_VECTOR, so for them it will now be compiled with their equivalent of vectorize and unroll-loops option, which is ok, I think. Just that the numeric.o size will be increased, that's it. > > [1] https://www.postgresql.org/message-id/flat/CA%2BU5nML8JYeGqM-k4eEwNJi5H%3DU57oPLBsBDoZUv4cfcmdnpUA%40mail.gmail.com#2ec419817ff429588dd1229fb663080e -- Thanks, -Amit Khandekar Huawei Technologies
Attachment
Amit Khandekar <amitdkhan.pg@gmail.com> writes: > I did as above. Attached is the v2 patch. I made some cosmetic changes to this and committed it. AFAICT, there's no measurable performance change to the "numeric" regression test, but I got a solid 45% speedup on "numeric_big", so it's clearly a win for wider arithmetic cases. It seemed to me to be useful to actually rename CFLAGS_VECTOR if we're changing its meaning, so I made it CFLAGS_VECTORIZE. regards, tom lane
I wrote: > I made some cosmetic changes to this and committed it. BTW, poking at this further, it seems that the patch only really works for gcc. clang accepts the -ftree-vectorize switch, but looking at the generated asm shows that it does nothing useful. Which is odd, because clang does do loop vectorization. I tried adding -Rpass-analysis=loop-vectorize and got numeric.c:8341:3: remark: loop not vectorized: could not determine number of loop iterations [-Rpass-analysis=loop-vectorize] for (i2 = 0; i2 <= i; i2++) ^ which is interesting but I don't know how to proceed further. regards, tom lane
Re: Auto-vectorization speeds up multiplication of large-precision numerics
From
Amit Khandekar
Date:
On Mon, 7 Sep 2020 at 11:23, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wrote: > > I made some cosmetic changes to this and committed it. Thanks! > > BTW, poking at this further, it seems that the patch only really > works for gcc. clang accepts the -ftree-vectorize switch, but > looking at the generated asm shows that it does nothing useful. > Which is odd, because clang does do loop vectorization. > > I tried adding -Rpass-analysis=loop-vectorize and got > > numeric.c:8341:3: remark: loop not vectorized: could not determine number of loop iterations [-Rpass-analysis=loop-vectorize] > for (i2 = 0; i2 <= i; i2++) Hmm, yeah that's unfortunate. My guess is that the compiler would do vectorization only if 'i' is a constant, which is not true for our case. -- Thanks, -Amit Khandekar Huawei Technologies
Amit Khandekar <amitdkhan.pg@gmail.com> writes: > On Mon, 7 Sep 2020 at 11:23, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> BTW, poking at this further, it seems that the patch only really >> works for gcc. clang accepts the -ftree-vectorize switch, but >> looking at the generated asm shows that it does nothing useful. >> Which is odd, because clang does do loop vectorization. > Hmm, yeah that's unfortunate. My guess is that the compiler would do > vectorization only if 'i' is a constant, which is not true for our > case. No, they claim to handle variable trip counts, per https://llvm.org/docs/Vectorizers.html#loops-with-unknown-trip-count I experimented with a few different ideas such as adding restrict decoration to the pointers, and eventually found that what works is to write the loop termination condition as "i2 < limit" rather than "i2 <= limit". It took me a long time to think of trying that, because it seemed ridiculously stupid. But it works. regards, tom lane
I wrote: > I experimented with a few different ideas such as adding restrict > decoration to the pointers, and eventually found that what works > is to write the loop termination condition as "i2 < limit" > rather than "i2 <= limit". It took me a long time to think of > trying that, because it seemed ridiculously stupid. But it works. I've done more testing and confirmed that both gcc and clang can vectorize the improved loop on aarch64 as well as x86_64. (clang's results can be confusing because -ftree-vectorize doesn't seem to have any effect: its vectorizer is on by default. But if you use -fno-vectorize it'll go back to the old, slower code.) The only buildfarm effect I've noticed is that locust and prairiedog, which are using nearly the same ancient gcc version, complain c1: warning: -ftree-vectorize enables strict aliasing. -fno-strict-aliasing is ignored when Auto Vectorization is used. which is expected (they say the same for checksum.c), but then there are a bunch of warning: dereferencing type-punned pointer will break strict-aliasing rules which seems worrisome. (This sort of thing is the reason I'm hesitant to apply higher optimization levels across the board.) Both animals pass the regression tests anyway, but if any other compilers treat -ftree-vectorize as an excuse to apply stricter optimization assumptions, we could be in for trouble. I looked closer and saw that all of those warnings are about init_var(), and this change makes them go away: -#define init_var(v) MemSetAligned(v, 0, sizeof(NumericVar)) +#define init_var(v) memset(v, 0, sizeof(NumericVar)) I'm a little inclined to commit that as future-proofing. It's essentially reversing out a micro-optimization I made in d72f6c750. I doubt I had hard evidence that it made any noticeable difference; and even if it did back then, modern compilers probably prefer the memset approach. regards, tom lane
Re: Auto-vectorization speeds up multiplication of large-precision numerics
From
Amit Khandekar
Date:
On Tue, 8 Sep 2020 at 02:19, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wrote: > > I experimented with a few different ideas such as adding restrict > > decoration to the pointers, and eventually found that what works > > is to write the loop termination condition as "i2 < limit" > > rather than "i2 <= limit". It took me a long time to think of > > trying that, because it seemed ridiculously stupid. But it works. Ah ok. I checked the "Auto-Vectorization in LLVM" link that you shared. All the examples use "< n" or "> n". None of them use "<= n". Looks like a hidden restriction. > > I've done more testing and confirmed that both gcc and clang can > vectorize the improved loop on aarch64 as well as x86_64. (clang's > results can be confusing because -ftree-vectorize doesn't seem to > have any effect: its vectorizer is on by default. But if you use > -fno-vectorize it'll go back to the old, slower code.) > > The only buildfarm effect I've noticed is that locust and > prairiedog, which are using nearly the same ancient gcc version, > complain > > c1: warning: -ftree-vectorize enables strict aliasing. -fno-strict-aliasing is ignored when Auto Vectorization is used. > > which is expected (they say the same for checksum.c), but then > there are a bunch of > > warning: dereferencing type-punned pointer will break strict-aliasing rules > > which seems worrisome. (This sort of thing is the reason I'm > hesitant to apply higher optimization levels across the board.) > Both animals pass the regression tests anyway, but if any other > compilers treat -ftree-vectorize as an excuse to apply stricter > optimization assumptions, we could be in for trouble. > > I looked closer and saw that all of those warnings are about > init_var(), and this change makes them go away: > > -#define init_var(v) MemSetAligned(v, 0, sizeof(NumericVar)) > +#define init_var(v) memset(v, 0, sizeof(NumericVar)) > > I'm a little inclined to commit that as future-proofing. It's > essentially reversing out a micro-optimization I made in d72f6c750. > I doubt I had hard evidence that it made any noticeable difference; > and even if it did back then, modern compilers probably prefer the > memset approach. Thanks. I must admit it did not occur to me that I could have very well installed clang on my linux machine and tried compiling this file, or tested with some older gcc versions. I think I was using gcc 8. Do you know what was the gcc compiler version that gave these warnings ? -- Thanks, -Amit Khandekar Huawei Technologies
Amit Khandekar <amitdkhan.pg@gmail.com> writes: > Thanks. I must admit it did not occur to me that I could have very > well installed clang on my linux machine and tried compiling this > file, or tested with some older gcc versions. I think I was using gcc > 8. Do you know what was the gcc compiler version that gave these > warnings ? Per the buildfarm's configure logs, prairiedog is using configure: using compiler=powerpc-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5341) IIRC, locust has a newer build number but it's the same underlying gcc version. regards, tom lane