Re: Speeding up GIST index creation for tsvectors - Mailing list pgsql-hackers

From Amit Khandekar
Subject Re: Speeding up GIST index creation for tsvectors
Date
Msg-id CAJ3gD9ftbJ2Hjf2NJVO83J_8-soVGy2d=JgR91peUYDRfTFknQ@mail.gmail.com
Whole thread Raw
In response to Re: Speeding up GIST index creation for tsvectors  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: Speeding up GIST index creation for tsvectors  (John Naylor <john.naylor@enterprisedb.com>)
List pgsql-hackers
On Sat, 20 Mar 2021 at 02:19, John Naylor <john.naylor@enterprisedb.com> wrote:
> On Fri, Mar 19, 2021 at 8:57 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > Regarding the alignment changes... I have removed the code that
> > handled the leading identically unaligned bytes, for lack of evidence
> > that percentage of such cases is significant. Like I noted earlier,
> > for the tsearch data I used, identically unaligned cases were only 6%.
> > If I find scenarios where these cases can be significant after all and
> > if we cannot do anything in the gist index code, then we might have to
> > bring back the unaligned byte handling. I didn't get a chance to dig
> > deeper into the gist index implementation to see why they are not
> > always 8-byte aligned.
>
> I find it stranger that something equivalent to char* is not randomly misaligned, but rather only seems to land on
4-byteboundaries.
 
>
> [thinks] I'm guessing it's because of VARHDRSZ, but I'm not positive.
>
> FWIW, I anticipate some push back from the community because of the fact that the optimization relies on statistical
phenomena.

I dug into this issue for tsvector type. Found out that it's the way
in which the sign array elements are arranged that is causing the pointers to
be misaligned:

Datum
gtsvector_picksplit(PG_FUNCTION_ARGS)
{
......
    cache = (CACHESIGN *) palloc(sizeof(CACHESIGN) * (maxoff + 2));
    cache_sign = palloc(siglen * (maxoff + 2));

    for (j = 0; j < maxoff + 2; j++)
        cache[j].sign = &cache_sign[siglen * j];
....
}

If siglen is not a multiple of 8 (say 700), cache[j].sign will in some
cases point to non-8-byte-aligned addresses, as you can see in the
above code snippet.

Replacing siglen by MAXALIGN64(siglen) in the above snippet gets rid
of the misalignment. This change applied over the 0001-v3 patch gives
additional ~15% benefit. MAXALIGN64(siglen) will cause a bit more
space, but for not-so-small siglens, this looks worth doing. Haven't
yet checked into types other than tsvector.

Will get back with your other review comments. I thought, meanwhile, I
can post the above update first.



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Skipping logical replication transactions on subscriber side
Next
From: Amit Kapila
Date:
Subject: Re: Parallel Inserts (WAS: [bug?] Missed parallel safety checks..)