Thread: Speeding up GIST index creation for tsvectors
Hi, In hemdistsign() of tsgistidx.c, if we process the bits in 64-bit chunks rather than byte-by-byte, we get an overall speed up in Gist index creation for tsvector types. With default siglen (124), the speed up is 12-20%. With siglen=700, it is 30-50%. So with longer signature lengths, we get higher percentage speed-up. The attached patch 0001 has the required changes. In the patch 0001, rather than using xor operator on char values, xor is operated on 64-bit chunks. And since the chunks are 64-bit, popcount64() is used on each of the chunks. I have checked that the two bitvector pointer arguments of hemdistsign() are not always 64-bit aligned. So process the leading mis-aligned bits and the trailing remainder bits char-by-char, leaving the middle 64-bit chunks for popcount64() usage. We might extend this to the hemdistsign() definitions at other places in the code. But for now, we can start with gist. I haven't tried other places. ------------- While working on this, I observed that on platforms other than x86_64, we still declare pg_popcount64() as a function pointer, even though we don't use the runtime selection of right function using__get_cpuid() as is done on x86. The other patch i.e. 0002 is a general optimization that avoids this function pointer for pg_popcount32/64() call. The patch arranges for direct function call so as to get rid of function pointer dereferencing each time pg_popcount32/64 is called. To do this, define pg_popcount64 to another function name (pg_popcount64_nonasm) rather than a function pointer, whenever USE_POPCNT_ASM is not defined. And let pg_popcount64_nonasm() be a static inline function so that whenever pg_popcount64() is called, directly the __builtin_popcount() gets called. For platforms not supporting __builtin_popcount(), continue using the slow version as is the current behaviour. Tested this 0002 patch on ARM64, with patch 0001 already applied, and the gist index creation for tsvectors *further* speeds up by 6% for default siglen (=124), and by 12% with siglen=700. ------------- Schema : CREATE TABLE test_tsvector(t text, a tsvector); -- Attached tsearch.data (a bigger version of -- src/test/regress/data/tsearch.data) \COPY test_tsvector FROM 'tsearch.data'; Test case that shows improvement : CREATE INDEX wowidx6 ON test_tsvector USING gist (a); Time taken by the above create-index command, in seconds, along with % speed-up w.r.t. HEAD : A) siglen=124 (Default) head 0001.patch 0001+0002.patch x86 .827 .737 (12%) ..... arm 1.098 .912 (20%) .861 (28%) B) siglen=700 (... USING gist (a tsvector_ops(siglen=700)) head 0001.patch 0001+0002.patch x86 1.121 .847 (32%) ..... arm 1.751 1.191 (47%) 1.062 (65%) -- Thanks, -Amit Khandekar Huawei Technologies
Attachment
Hi, Amit!
--
It's really cool to hear about another GiST improvement proposal. I'd like to connect recently committed GiST ordered build discussion here [1] and further improvement proposed [2]
I've tested feature [1] and got 2.5-3 times speed improvement which is much better I believe. There is an ongoing activity [2] to build support for different data types for GiST. Maybe you will consider it interesting to join.
BTW you may have heard about Gin and Rum [3] indexes which suit text search much, much better (and faster) than GiST. The idea to process data in bigger chunks is good. Still optimize index structure, minimizing disc pages access, etc. seems better in many cases.
Thank you for your proposal!
On Thu, 10 Dec 2020 at 20:43, Pavel Borisov <pashkin.elfe@gmail.com> wrote: > > Hi, Amit! > It's really cool to hear about another GiST improvement proposal. I'd like to connect recently committed GiST ordered builddiscussion here [1] and further improvement proposed [2] > > I've tested feature [1] and got 2.5-3 times speed improvement which is much better I believe. Yeah, I am completely new to the GIST stuff, but I had taken a quick look at the sortsupport feature for GIST, and found it very interesting. I believe it's an additional option for making the gist index builds much faster. But then I thought that my small patch would still be worthwhile because for tsvector types the non-sort method for index build would continue to be used by users, and in general we can extend this small optimization for other gist types also. > There is an ongoing activity [2] to build support for different data types for GiST. Maybe you will consider it interestingto join. > > BTW you may have heard about Gin and Rum [3] indexes which suit text search much, much better (and faster) than GiST. Theidea to process data in bigger chunks is good. Still optimize index structure, minimizing disc pages access, etc. seemsbetter in many cases. Sure. Thanks for the pointers. -- Thanks, -Amit Khandekar Huawei Technologies
> 13 дек. 2020 г., в 17:46, Amit Khandekar <amitdkhan.pg@gmail.com> написал(а): > > On Thu, 10 Dec 2020 at 20:43, Pavel Borisov <pashkin.elfe@gmail.com> wrote: >> >> Hi, Amit! >> It's really cool to hear about another GiST improvement proposal. I'd like to connect recently committed GiST orderedbuild discussion here [1] and further improvement proposed [2] >> >> I've tested feature [1] and got 2.5-3 times speed improvement which is much better I believe. > > Yeah, I am completely new to the GIST stuff, but I had taken a quick > look at the sortsupport feature for GIST, and found it very > interesting. I believe it's an additional option for making the gist > index builds much faster. +1 This will make all INSERTs and UPDATES for tsvector's GiSTs. Also I really like idea of taking advantage of hardware capabilities like __builtin_* etc wherever possible. Meanwhile there are at least 4 incarnation of hemdistsign() functions that are quite similar. I'd propose to refactor themsomehow... Thanks! Best regards, Andrey Borodin.
On Sun, 13 Dec 2020 at 9:28 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > +1 > This will make all INSERTs and UPDATES for tsvector's GiSTs. Oh, I didn't realize that this code is getting used in GIST index insertion and creation too. Will check there. > Also I really like idea of taking advantage of hardware capabilities like __builtin_* etc wherever possible. Yes. Also, the __builtin_popcount() uses SIMD vectorization (on arm64 : "cnt v0.8b, v0.8b"), hence there's all the more reason to use it. Over and above that, I had thought that if we can auto-vectorize the byte-by-byte xor operation and the popcount() call using compiler optimizations, we would benefit out of this, but didn't see any more improvement. I hoped for the benefit because that would have allowed us to process in 128-bit chunks or 256-bit chunks, since the vector registers are at least that long. Maybe gcc is not that smart to translate __builtin_popcount() to 128/256 bit vectorized instruction. But for XOR operator, it does translate to 128bit vectorized instructions (on arm64 : "eor v2.16b, v2.16b, v18.16b") > Meanwhile there are at least 4 incarnation of hemdistsign() functions that are quite similar. I'd propose to refactor themsomehow... Yes, I hope we get the benefit there also. Before that, I thought I should post the first use-case to get some early comments. Thanks for your encouraging comments :)
On Tue, 15 Dec 2020 at 20:34, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > On Sun, 13 Dec 2020 at 9:28 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > +1 > > This will make all INSERTs and UPDATES for tsvector's GiSTs. > > Oh, I didn't realize that this code is getting used in GIST index > insertion and creation too. Will check there. I ran some insert and update tests; they show only marginal improvement. So looks like the patch is mainly improving index builds. > > Meanwhile there are at least 4 incarnation of hemdistsign() functions that are quite similar. I'd propose to refactorthem somehow... > > Yes, I hope we get the benefit there also. Before that, I thought I > should post the first use-case to get some early comments. Thanks for > your encouraging comments :) The attached v2 version of 0001 patch extends the hemdistsign() changes to the other use cases like intarray, ltree and hstore. I see the same index build improvement for all these types. Since for the gist index creation of some of these types the default value for siglen is small (8-20), I tested with small siglens. For siglens <= 20, particularly for values that are not multiples of 8 (e.g. 10, 13, etc), I see a 1-7 % reduction in speed of index creation. It's probably because of an extra function call for pg_xorcount(); and also might be due to the extra logic in pg_xorcount() which becomes prominent for shorter traversals. So for siglen less than 32, I kept the existing method using byte-by-byte traversal. -- Thanks, -Amit Khandekar Huawei Technologies
Attachment
On Thu, 11 Mar 2021 at 04:25, John Naylor <john.naylor@enterprisedb.com> wrote: > Okay, so it's hard-coded in various functions in contrib modules. In that > case, let's just keep the existing coding for those. In fact, the comments > that got removed by your patch specifically say: /* Using the popcount > functions here isn't likely to win */ ...which your testing confirmed. The > new function should be used only for Gist and possibly Intarray, whose > default is 124. That means we can't get rid of hemdistsign(), but that's > fine. The comment is there for all types. Since I get the performance better on all the types, I have kept the pg_xorcount() call for all these contrib modules. I understand that since for some types default siglen is small, we won't get benefit. But I think we should call pg_xorcount() for the benefit of non-default siglen case. Have replaced hemdistsign() by pg_xorcount() everywhere; but with that, the call looks a bit clumsy because of having to type-cast the parameters to const unsigned char *. I noticed that the cast to "unsigned char *" is required only when we use the value in the pg_number_of_ones array. Elsewhere we can safely use 'a' and 'b' as "char *". So I changed the pg_xorcount() parameters from unsigned char * to char *. > I think the way to go is a simplified version of your 0001 (not 0002), with > only a single function, for gist and intarray only, and a style that better > matches the surrounding code. If you look at my xor functions in the attached > text file, you'll get an idea of what it should look like. Note that it got > the above performance without ever trying to massage the pointer alignment. > I'm a bit uncomfortable with the fact that we can't rely on alignment, but > maybe there's a simple fix somewhere in the gist code. In the attached 0001-v3 patch, I have updated the code to match it with surrounding code; specifically the usage of *a++ rather than a[i]. Regarding the alignment changes... I have removed the code that handled the leading identically unaligned bytes, for lack of evidence that percentage of such cases is significant. Like I noted earlier, for the tsearch data I used, identically unaligned cases were only 6%. If I find scenarios where these cases can be significant after all and if we cannot do anything in the gist index code, then we might have to bring back the unaligned byte handling. I didn't get a chance to dig deeper into the gist index implementation to see why they are not always 8-byte aligned. I have kept the 0002 patch as-is. Due to significant *additional* speedup, over and above the 0001 improvement, I think the code re-arrangement done is worth it for non-x86 platforms. -Amit Khandekar
Attachment
On Fri, Mar 19, 2021 at 8:57 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> On Thu, 11 Mar 2021 at 04:25, John Naylor <john.naylor@enterprisedb.com> wrote:
> > Okay, so it's hard-coded in various functions in contrib modules. In that
> > case, let's just keep the existing coding for those. In fact, the comments
> > that got removed by your patch specifically say: /* Using the popcount
> > functions here isn't likely to win */ ...which your testing confirmed. The
> > new function should be used only for Gist and possibly Intarray, whose
> > default is 124. That means we can't get rid of hemdistsign(), but that's
> > fine.
>
> The comment is there for all types. Since I get the performance better
> on all the types, I have kept the pg_xorcount() call for all these
> contrib modules. I understand that since for some types default
> siglen is small, we won't get benefit. But I think we should call
> pg_xorcount() for the benefit of non-default siglen case.
The 1-7% degradation measured was from an earlier version, when pg_xorcount_long had a lot of finicky branching and computation. Is it still true in v3? We should answer that first. I'm interested in what happens if you now use pg_xorcount_long in the call sites, at least in the worst case 7% test.
> Have replaced hemdistsign() by pg_xorcount() everywhere; but with
> that, the call looks a bit clumsy because of having to type-cast the
> parameters to const unsigned char *. I noticed that the cast to
> "unsigned char *" is required only when we use the value in the
> pg_number_of_ones array. Elsewhere we can safely use 'a' and 'b' as
> "char *". So I changed the pg_xorcount() parameters from unsigned char
> * to char *.
That matches the style of that file, so +1.
> > I think the way to go is a simplified version of your 0001 (not 0002), with
> > only a single function, for gist and intarray only, and a style that better
> > matches the surrounding code. If you look at my xor functions in the attached
> > text file, you'll get an idea of what it should look like. Note that it got
> > the above performance without ever trying to massage the pointer alignment.
> > I'm a bit uncomfortable with the fact that we can't rely on alignment, but
> > maybe there's a simple fix somewhere in the gist code.
>
> In the attached 0001-v3 patch, I have updated the code to match it
> with surrounding code; specifically the usage of *a++ rather than
> a[i].
>
> Regarding the alignment changes... I have removed the code that
> handled the leading identically unaligned bytes, for lack of evidence
> that percentage of such cases is significant. Like I noted earlier,
> for the tsearch data I used, identically unaligned cases were only 6%.
> If I find scenarios where these cases can be significant after all and
> if we cannot do anything in the gist index code, then we might have to
> bring back the unaligned byte handling. I didn't get a chance to dig
> deeper into the gist index implementation to see why they are not
> always 8-byte aligned.
I find it stranger that something equivalent to char* is not randomly misaligned, but rather only seems to land on 4-byte boundaries.
[thinks] I'm guessing it's because of VARHDRSZ, but I'm not positive.
FWIW, I anticipate some push back from the community because of the fact that the optimization relies on statistical phenomena.
> I have kept the 0002 patch as-is. Due to significant *additional*
> speedup, over and above the 0001 improvement, I think the code
> re-arrangement done is worth it for non-x86 platforms.
For the amount of code uglification involved, we should be getting full asm popcount support on Arm, not an attempt to kluge around the current implementation. I'd be happy to review such an effort for PG15, by the way.
Readability counts, and as it stands I don't feel comfortable asking a committer to read 0002.
--
John Naylor
EDB: http://www.enterprisedb.com
>
> On Thu, 11 Mar 2021 at 04:25, John Naylor <john.naylor@enterprisedb.com> wrote:
> > Okay, so it's hard-coded in various functions in contrib modules. In that
> > case, let's just keep the existing coding for those. In fact, the comments
> > that got removed by your patch specifically say: /* Using the popcount
> > functions here isn't likely to win */ ...which your testing confirmed. The
> > new function should be used only for Gist and possibly Intarray, whose
> > default is 124. That means we can't get rid of hemdistsign(), but that's
> > fine.
>
> The comment is there for all types. Since I get the performance better
> on all the types, I have kept the pg_xorcount() call for all these
> contrib modules. I understand that since for some types default
> siglen is small, we won't get benefit. But I think we should call
> pg_xorcount() for the benefit of non-default siglen case.
The 1-7% degradation measured was from an earlier version, when pg_xorcount_long had a lot of finicky branching and computation. Is it still true in v3? We should answer that first. I'm interested in what happens if you now use pg_xorcount_long in the call sites, at least in the worst case 7% test.
> Have replaced hemdistsign() by pg_xorcount() everywhere; but with
> that, the call looks a bit clumsy because of having to type-cast the
> parameters to const unsigned char *. I noticed that the cast to
> "unsigned char *" is required only when we use the value in the
> pg_number_of_ones array. Elsewhere we can safely use 'a' and 'b' as
> "char *". So I changed the pg_xorcount() parameters from unsigned char
> * to char *.
That matches the style of that file, so +1.
> > I think the way to go is a simplified version of your 0001 (not 0002), with
> > only a single function, for gist and intarray only, and a style that better
> > matches the surrounding code. If you look at my xor functions in the attached
> > text file, you'll get an idea of what it should look like. Note that it got
> > the above performance without ever trying to massage the pointer alignment.
> > I'm a bit uncomfortable with the fact that we can't rely on alignment, but
> > maybe there's a simple fix somewhere in the gist code.
>
> In the attached 0001-v3 patch, I have updated the code to match it
> with surrounding code; specifically the usage of *a++ rather than
> a[i].
>
> Regarding the alignment changes... I have removed the code that
> handled the leading identically unaligned bytes, for lack of evidence
> that percentage of such cases is significant. Like I noted earlier,
> for the tsearch data I used, identically unaligned cases were only 6%.
> If I find scenarios where these cases can be significant after all and
> if we cannot do anything in the gist index code, then we might have to
> bring back the unaligned byte handling. I didn't get a chance to dig
> deeper into the gist index implementation to see why they are not
> always 8-byte aligned.
I find it stranger that something equivalent to char* is not randomly misaligned, but rather only seems to land on 4-byte boundaries.
[thinks] I'm guessing it's because of VARHDRSZ, but I'm not positive.
FWIW, I anticipate some push back from the community because of the fact that the optimization relies on statistical phenomena.
> I have kept the 0002 patch as-is. Due to significant *additional*
> speedup, over and above the 0001 improvement, I think the code
> re-arrangement done is worth it for non-x86 platforms.
For the amount of code uglification involved, we should be getting full asm popcount support on Arm, not an attempt to kluge around the current implementation. I'd be happy to review such an effort for PG15, by the way.
Readability counts, and as it stands I don't feel comfortable asking a committer to read 0002.
--
John Naylor
EDB: http://www.enterprisedb.com
On Sat, 20 Mar 2021 at 02:19, John Naylor <john.naylor@enterprisedb.com> wrote: > On Fri, Mar 19, 2021 at 8:57 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > Regarding the alignment changes... I have removed the code that > > handled the leading identically unaligned bytes, for lack of evidence > > that percentage of such cases is significant. Like I noted earlier, > > for the tsearch data I used, identically unaligned cases were only 6%. > > If I find scenarios where these cases can be significant after all and > > if we cannot do anything in the gist index code, then we might have to > > bring back the unaligned byte handling. I didn't get a chance to dig > > deeper into the gist index implementation to see why they are not > > always 8-byte aligned. > > I find it stranger that something equivalent to char* is not randomly misaligned, but rather only seems to land on 4-byteboundaries. > > [thinks] I'm guessing it's because of VARHDRSZ, but I'm not positive. > > FWIW, I anticipate some push back from the community because of the fact that the optimization relies on statistical phenomena. I dug into this issue for tsvector type. Found out that it's the way in which the sign array elements are arranged that is causing the pointers to be misaligned: Datum gtsvector_picksplit(PG_FUNCTION_ARGS) { ...... cache = (CACHESIGN *) palloc(sizeof(CACHESIGN) * (maxoff + 2)); cache_sign = palloc(siglen * (maxoff + 2)); for (j = 0; j < maxoff + 2; j++) cache[j].sign = &cache_sign[siglen * j]; .... } If siglen is not a multiple of 8 (say 700), cache[j].sign will in some cases point to non-8-byte-aligned addresses, as you can see in the above code snippet. Replacing siglen by MAXALIGN64(siglen) in the above snippet gets rid of the misalignment. This change applied over the 0001-v3 patch gives additional ~15% benefit. MAXALIGN64(siglen) will cause a bit more space, but for not-so-small siglens, this looks worth doing. Haven't yet checked into types other than tsvector. Will get back with your other review comments. I thought, meanwhile, I can post the above update first.
On Sun, Aug 1, 2021 at 11:41 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> > FWIW, I anticipate some push back from the community because of the fact that the optimization relies on statistical phenomena.
>
> I dug into this issue for tsvector type. Found out that it's the way
> in which the sign array elements are arranged that is causing the pointers to
> be misaligned:
[...]
> If siglen is not a multiple of 8 (say 700), cache[j].sign will in some
> cases point to non-8-byte-aligned addresses, as you can see in the
> above code snippet.
>
> Replacing siglen by MAXALIGN64(siglen) in the above snippet gets rid
> of the misalignment. This change applied over the 0001-v3 patch gives
> additional ~15% benefit. MAXALIGN64(siglen) will cause a bit more
> space, but for not-so-small siglens, this looks worth doing. Haven't
> yet checked into types other than tsvector.
Sounds good.
> Will get back with your other review comments. I thought, meanwhile, I
> can post the above update first.
Thinking some more, my discomfort with inline functions that call a global function doesn't make logical sense, so feel free to do it that way if you like.
--
John Naylor
EDB: http://www.enterprisedb.com
> cases point to non-8-byte-aligned addresses, as you can see in the
> above code snippet.
>
> Replacing siglen by MAXALIGN64(siglen) in the above snippet gets rid
> of the misalignment. This change applied over the 0001-v3 patch gives
> additional ~15% benefit. MAXALIGN64(siglen) will cause a bit more
> space, but for not-so-small siglens, this looks worth doing. Haven't
> yet checked into types other than tsvector.
Sounds good.
> Will get back with your other review comments. I thought, meanwhile, I
> can post the above update first.
Thinking some more, my discomfort with inline functions that call a global function doesn't make logical sense, so feel free to do it that way if you like.
--
John Naylor
EDB: http://www.enterprisedb.com