Thread: Speeding up GIST index creation for tsvectors

Speeding up GIST index creation for tsvectors

From

Amit Khandekar

Date:

10 December 2020, 14:31:31

Hi,

In hemdistsign() of tsgistidx.c, if we process the bits in 64-bit
chunks rather than byte-by-byte, we get an overall speed up in Gist
index creation for tsvector types. With default siglen (124), the
speed up is 12-20%. With siglen=700, it is 30-50%. So with longer
signature lengths, we get higher percentage speed-up. The attached
patch 0001 has the required changes.

In the patch 0001, rather than using xor operator on char values, xor
is operated on 64-bit chunks. And since the chunks are 64-bit,
popcount64() is used on each of the chunks. I have checked that the
two bitvector pointer arguments of hemdistsign() are not always 64-bit
aligned. So process the leading mis-aligned bits and the trailing
remainder bits char-by-char, leaving the middle 64-bit chunks for
popcount64() usage.

We might extend this to the hemdistsign() definitions at other places
in the code. But for now, we can start with gist. I haven't tried
other places.

-------------

While working on this, I observed that on platforms other than x86_64,
we still declare pg_popcount64() as a function pointer, even though we
don't use the runtime selection of right function using__get_cpuid()
as is done on x86.
The other patch i.e. 0002 is a general optimization that avoids this
function pointer for pg_popcount32/64() call. The patch arranges for
direct function call so as to get rid of function pointer
dereferencing each time pg_popcount32/64 is called.

To do this, define pg_popcount64 to another function name
(pg_popcount64_nonasm) rather than a function pointer, whenever
USE_POPCNT_ASM is not defined.  And let pg_popcount64_nonasm() be a
static inline function so that whenever pg_popcount64() is called,
directly the __builtin_popcount() gets called. For platforms not
supporting __builtin_popcount(), continue using the slow version as is
the current behaviour.

Tested this 0002 patch on ARM64, with patch 0001 already applied, and the
gist index creation for tsvectors *further* speeds up by 6% for
default siglen (=124), and by 12% with siglen=700.

-------------

Schema :

CREATE TABLE test_tsvector(t text, a tsvector);
-- Attached tsearch.data (a bigger version of
-- src/test/regress/data/tsearch.data)
\COPY test_tsvector FROM 'tsearch.data';

Test case that shows improvement :
CREATE INDEX wowidx6 ON test_tsvector USING gist (a);

Time taken by the above create-index command, in seconds, along with %
speed-up w.r.t. HEAD :

A) siglen=124 (Default)
           head     0001.patch    0001+0002.patch
x86     .827    .737 (12%)        .....
arm    1.098    .912 (20%)      .861 (28%)


B) siglen=700 (... USING gist (a tsvector_ops(siglen=700))
          head     0001.patch    0001+0002.patch
x86    1.121    .847 (32%)       .....
arm    1.751   1.191 (47%)    1.062 (65%)

--
Thanks,
-Amit Khandekar
Huawei Technologies

Attachment

Re: Speeding up GIST index creation for tsvectors

From

Pavel Borisov

Date:

10 December 2020, 15:13:36

Hi, Amit!

It's really cool to hear about another GiST improvement proposal. I'd like to connect recently committed GiST ordered build discussion here [1] and further improvement proposed [2]

I've tested feature [1] and got 2.5-3 times speed improvement which is much better I believe. There is an ongoing activity [2] to build support for different data types for GiST. Maybe you will consider it interesting to join.

BTW you may have heard about Gin and Rum [3] indexes which suit text search much, much better (and faster) than GiST. The idea to process data in bigger chunks is good. Still optimize index structure, minimizing disc pages access, etc. seems better in many cases.

Thank you for your proposal!

[1] https://commitfest.postgresql.org/29/2276/

[2] https://commitfest.postgresql.org/31/2824/

[3] https://github.com/postgrespro/rum

Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Speeding up GIST index creation for tsvectors

From

Amit Khandekar

Date:

13 December 2020, 12:46:54

On Thu, 10 Dec 2020 at 20:43, Pavel Borisov <pashkin.elfe@gmail.com> wrote:
>
> Hi, Amit!
> It's really cool to hear about another GiST improvement proposal. I'd like to connect recently committed GiST ordered
builddiscussion here [1] and further improvement proposed [2] 
>
> I've tested feature [1] and got 2.5-3 times speed improvement which is much better I believe.

Yeah, I am completely new to the GIST stuff, but I had taken a quick
look at the sortsupport feature for GIST, and found it very
interesting. I believe it's an additional option for making the gist
index builds much faster. But then I thought that my small patch would
still be worthwhile because for tsvector types the non-sort method for
index build would continue to be used by users, and in general we can
extend this small optimization for other gist types also.

> There is an ongoing activity [2] to build support for different data types for GiST. Maybe you will consider it
interestingto join. 
>
> BTW you may have heard about Gin and Rum [3] indexes which suit text search much, much better (and faster) than GiST.
Theidea to process data in bigger chunks is good. Still optimize index structure, minimizing disc pages access, etc.
seemsbetter in many cases. 

Sure. Thanks for the pointers.

--
Thanks,
-Amit Khandekar
Huawei Technologies

Re: Speeding up GIST index creation for tsvectors

From

Andrey Borodin

Date:

13 December 2020, 15:58:18


> 13 дек. 2020 г., в 17:46, Amit Khandekar <amitdkhan.pg@gmail.com> написал(а):
>
> On Thu, 10 Dec 2020 at 20:43, Pavel Borisov <pashkin.elfe@gmail.com> wrote:
>>
>> Hi, Amit!
>> It's really cool to hear about another GiST improvement proposal. I'd like to connect recently committed GiST
orderedbuild discussion here [1] and further improvement proposed [2] 
>>
>> I've tested feature [1] and got 2.5-3 times speed improvement which is much better I believe.
>
> Yeah, I am completely new to the GIST stuff, but I had taken a quick
> look at the sortsupport feature for GIST, and found it very
> interesting. I believe it's an additional option for making the gist
> index builds much faster.
+1
This will make all INSERTs and UPDATES for tsvector's GiSTs.
Also I really like idea of taking advantage of hardware capabilities like __builtin_* etc wherever possible.

Meanwhile there are at least 4 incarnation of hemdistsign() functions that are quite similar. I'd propose to refactor
themsomehow... 

Thanks!

Best regards, Andrey Borodin.

Re: Speeding up GIST index creation for tsvectors

From

Amit Khandekar

Date:

15 December 2020, 15:04:02

On Sun, 13 Dec 2020 at 9:28 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> +1
> This will make all INSERTs and UPDATES for tsvector's GiSTs.

Oh, I didn't realize that this code is getting used in GIST index
insertion and creation too. Will check there.

> Also I really like idea of taking advantage of hardware capabilities like __builtin_* etc wherever possible.

Yes. Also, the __builtin_popcount() uses SIMD vectorization (on arm64
: "cnt  v0.8b, v0.8b"), hence there's all the more reason to use it.
Over and above that, I had thought that if we can auto-vectorize the
byte-by-byte xor operation and the popcount() call using compiler
optimizations, we would benefit out of this, but didn't see any more
improvement. I hoped for the benefit because that would have allowed
us to process in 128-bit chunks or 256-bit chunks, since the vector
registers are at least that long. Maybe gcc is not that smart to
translate __builtin_popcount() to 128/256 bit vectorized instruction.
But for XOR operator, it does translate to 128bit vectorized
instructions (on arm64 : "eor  v2.16b, v2.16b, v18.16b")

> Meanwhile there are at least 4 incarnation of hemdistsign() functions that are quite similar. I'd propose to refactor
themsomehow...

Yes, I hope we get the benefit there also. Before that, I thought I
should post the first use-case to get some early comments. Thanks for
your encouraging comments :)

Re: Speeding up GIST index creation for tsvectors

From

Amit Khandekar

Date:

27 January 2021, 15:06:28

On Tue, 15 Dec 2020 at 20:34, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> On Sun, 13 Dec 2020 at 9:28 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > +1
> > This will make all INSERTs and UPDATES for tsvector's GiSTs.
>
> Oh, I didn't realize that this code is getting used in GIST index
> insertion and creation too. Will check there.

I ran some insert and update tests; they show only marginal
improvement. So looks like the patch is mainly improving index builds.

> > Meanwhile there are at least 4 incarnation of hemdistsign() functions that are quite similar. I'd propose to
refactorthem somehow...

>
> Yes, I hope we get the benefit there also. Before that, I thought I
> should post the first use-case to get some early comments. Thanks for
> your encouraging comments :)

The attached v2 version of 0001 patch extends the hemdistsign()
changes to the other use cases like  intarray, ltree and hstore. I see
the same index build improvement for all these types.

Since for the gist index creation of some of these types the default
value for siglen is small (8-20), I tested with small siglens. For
siglens <= 20, particularly for values that are not multiples of 8
(e.g. 10, 13, etc), I see a  1-7 % reduction in speed of index
creation. It's probably because of
an extra function call for pg_xorcount(); and also might be due to the
extra logic in pg_xorcount() which becomes prominent for shorter
traversals. So for siglen less than 32, I kept the existing method
using byte-by-byte traversal.

--
Thanks,
-Amit Khandekar
Huawei Technologies

Attachment

Re: Speeding up GIST index creation for tsvectors

From

Amit Khandekar

Date:

19 March 2021, 12:56:33

On Thu, 11 Mar 2021 at 04:25, John Naylor <john.naylor@enterprisedb.com> wrote:
> Okay, so it's hard-coded in various functions in contrib modules. In that
> case, let's just keep the existing coding for those. In fact, the comments
> that got removed by your patch specifically say: /* Using the popcount
> functions here isn't likely to win */ ...which your testing confirmed. The
> new function should be used only for Gist and possibly Intarray, whose
> default is 124. That means we can't get rid of hemdistsign(), but that's
> fine.

The comment is there for all types. Since I get the performance better
on all the types, I have kept the pg_xorcount() call for all these
contrib modules.  I understand that since for some types default
siglen is small, we won't get benefit. But I think we should call
pg_xorcount() for the benefit of non-default siglen case.

Have replaced hemdistsign() by pg_xorcount() everywhere; but with
that, the call looks a bit clumsy because of having to type-cast the
parameters to const unsigned char *.  I noticed that the cast to
"unsigned char *" is required only when we use the value in the
pg_number_of_ones array. Elsewhere we can safely use 'a' and 'b' as
"char *". So I changed the pg_xorcount() parameters from unsigned char
* to char *.

> I think the way to go is a simplified version of your 0001 (not 0002), with
> only a single function, for gist and intarray only, and a style that better
> matches the surrounding code. If you look at my xor functions in the attached
> text file, you'll get an idea of what it should look like. Note that it got
> the above performance without ever trying to massage the pointer alignment.
> I'm a bit uncomfortable with the fact that we can't rely on alignment, but
> maybe there's a simple fix somewhere in the gist code.

In the attached 0001-v3 patch, I have updated the code to match it
with surrounding code; specifically the usage of *a++ rather than
a[i].

Regarding the alignment changes... I have removed the code that
handled the leading identically unaligned bytes, for lack of evidence
that percentage of such cases is significant. Like I noted earlier,
for the tsearch data I used, identically unaligned cases were only 6%.
If I find scenarios where these cases can be significant after all and
if we cannot do anything in the gist index code, then we might have to
bring back the unaligned byte handling. I didn't get a chance to dig
deeper into the gist index implementation to see why they are not
always 8-byte aligned.

I have kept the 0002 patch as-is. Due to significant *additional*
speedup, over and above the 0001 improvement, I think the code
re-arrangement done is worth it for non-x86 platforms.

-Amit Khandekar

Attachment

Re: Speeding up GIST index creation for tsvectors

From

John Naylor

Date:

19 March 2021, 20:49:43

On Fri, Mar 19, 2021 at 8:57 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> On Thu, 11 Mar 2021 at 04:25, John Naylor <john.naylor@enterprisedb.com> wrote:
> > Okay, so it's hard-coded in various functions in contrib modules. In that
> > case, let's just keep the existing coding for those. In fact, the comments
> > that got removed by your patch specifically say: /* Using the popcount
> > functions here isn't likely to win */ ...which your testing confirmed. The
> > new function should be used only for Gist and possibly Intarray, whose
> > default is 124. That means we can't get rid of hemdistsign(), but that's
> > fine.
>
> The comment is there for all types. Since I get the performance better
> on all the types, I have kept the pg_xorcount() call for all these
> contrib modules. I understand that since for some types default
> siglen is small, we won't get benefit. But I think we should call
> pg_xorcount() for the benefit of non-default siglen case.

The 1-7% degradation measured was from an earlier version, when pg_xorcount_long had a lot of finicky branching and computation. Is it still true in v3? We should answer that first. I'm interested in what happens if you now use pg_xorcount_long in the call sites, at least in the worst case 7% test.

> Have replaced hemdistsign() by pg_xorcount() everywhere; but with
> that, the call looks a bit clumsy because of having to type-cast the
> parameters to const unsigned char *. I noticed that the cast to
> "unsigned char *" is required only when we use the value in the
> pg_number_of_ones array. Elsewhere we can safely use 'a' and 'b' as
> "char *". So I changed the pg_xorcount() parameters from unsigned char
> * to char *.

That matches the style of that file, so +1.

> > I think the way to go is a simplified version of your 0001 (not 0002), with
> > only a single function, for gist and intarray only, and a style that better
> > matches the surrounding code. If you look at my xor functions in the attached
> > text file, you'll get an idea of what it should look like. Note that it got
> > the above performance without ever trying to massage the pointer alignment.
> > I'm a bit uncomfortable with the fact that we can't rely on alignment, but
> > maybe there's a simple fix somewhere in the gist code.
>
> In the attached 0001-v3 patch, I have updated the code to match it
> with surrounding code; specifically the usage of *a++ rather than
> a[i].
>
> Regarding the alignment changes... I have removed the code that
> handled the leading identically unaligned bytes, for lack of evidence
> that percentage of such cases is significant. Like I noted earlier,
> for the tsearch data I used, identically unaligned cases were only 6%.
> If I find scenarios where these cases can be significant after all and
> if we cannot do anything in the gist index code, then we might have to
> bring back the unaligned byte handling. I didn't get a chance to dig
> deeper into the gist index implementation to see why they are not
> always 8-byte aligned.

I find it stranger that something equivalent to char* is not randomly misaligned, but rather only seems to land on 4-byte boundaries.

[thinks] I'm guessing it's because of VARHDRSZ, but I'm not positive.

FWIW, I anticipate some push back from the community because of the fact that the optimization relies on statistical phenomena.

> I have kept the 0002 patch as-is. Due to significant *additional*
> speedup, over and above the 0001 improvement, I think the code
> re-arrangement done is worth it for non-x86 platforms.

For the amount of code uglification involved, we should be getting full asm popcount support on Arm, not an attempt to kluge around the current implementation. I'd be happy to review such an effort for PG15, by the way.

Readability counts, and as it stands I don't feel comfortable asking a committer to read 0002.
--
John Naylor
EDB: http://www.enterprisedb.com

Re: Speeding up GIST index creation for tsvectors

From

Amit Khandekar

Date:

02 August 2021, 03:40:31

On Sat, 20 Mar 2021 at 02:19, John Naylor <john.naylor@enterprisedb.com> wrote:
> On Fri, Mar 19, 2021 at 8:57 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > Regarding the alignment changes... I have removed the code that
> > handled the leading identically unaligned bytes, for lack of evidence
> > that percentage of such cases is significant. Like I noted earlier,
> > for the tsearch data I used, identically unaligned cases were only 6%.
> > If I find scenarios where these cases can be significant after all and
> > if we cannot do anything in the gist index code, then we might have to
> > bring back the unaligned byte handling. I didn't get a chance to dig
> > deeper into the gist index implementation to see why they are not
> > always 8-byte aligned.
>
> I find it stranger that something equivalent to char* is not randomly misaligned, but rather only seems to land on
4-byteboundaries.

>
> [thinks] I'm guessing it's because of VARHDRSZ, but I'm not positive.
>
> FWIW, I anticipate some push back from the community because of the fact that the optimization relies on statistical
phenomena.

I dug into this issue for tsvector type. Found out that it's the way
in which the sign array elements are arranged that is causing the pointers to
be misaligned:

Datum
gtsvector_picksplit(PG_FUNCTION_ARGS)
{
......
    cache = (CACHESIGN *) palloc(sizeof(CACHESIGN) * (maxoff + 2));
    cache_sign = palloc(siglen * (maxoff + 2));

    for (j = 0; j < maxoff + 2; j++)
        cache[j].sign = &cache_sign[siglen * j];
....
}

If siglen is not a multiple of 8 (say 700), cache[j].sign will in some
cases point to non-8-byte-aligned addresses, as you can see in the
above code snippet.

Replacing siglen by MAXALIGN64(siglen) in the above snippet gets rid
of the misalignment. This change applied over the 0001-v3 patch gives
additional ~15% benefit. MAXALIGN64(siglen) will cause a bit more
space, but for not-so-small siglens, this looks worth doing. Haven't
yet checked into types other than tsvector.

Will get back with your other review comments. I thought, meanwhile, I
can post the above update first.

Re: Speeding up GIST index creation for tsvectors

From

John Naylor

Date:

02 August 2021, 10:26:37

On Sun, Aug 1, 2021 at 11:41 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> > FWIW, I anticipate some push back from the community because of the fact that the optimization relies on statistical phenomena.
>
> I dug into this issue for tsvector type. Found out that it's the way
> in which the sign array elements are arranged that is causing the pointers to
> be misaligned:
[...]

> If siglen is not a multiple of 8 (say 700), cache[j].sign will in some
> cases point to non-8-byte-aligned addresses, as you can see in the
> above code snippet.
>
> Replacing siglen by MAXALIGN64(siglen) in the above snippet gets rid
> of the misalignment. This change applied over the 0001-v3 patch gives
> additional ~15% benefit. MAXALIGN64(siglen) will cause a bit more
> space, but for not-so-small siglens, this looks worth doing. Haven't
> yet checked into types other than tsvector.

Sounds good.

> Will get back with your other review comments. I thought, meanwhile, I
> can post the above update first.

Thinking some more, my discomfort with inline functions that call a global function doesn't make logical sense, so feel free to do it that way if you like.

--
John Naylor
EDB: http://www.enterprisedb.com