Re: Reduce build times of pg_trgm GIN indexes - Mailing list pgsql-hackers

From David Geier
Subject Re: Reduce build times of pg_trgm GIN indexes
Date
Msg-id 7c874742-93f5-4e7f-9a01-958488763210@gmail.com
Whole thread Raw
In response to Re: Reduce build times of pg_trgm GIN indexes  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
On 09.01.2026 19:36, Heikki Linnakangas wrote:
>>>> - v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
>>>> ginExtractEntries() deduplicates and sorts the entries returned from
>>>> the
>>>> extract value function. In case of pg_trgm, that is completely
>>>> redundant
>>>> because the trigrams are already deduplicated and sorted. The current
>>>> version of this patch is just to demonstrate the potential. We need to
>>>> think about what we want here. Ideally, we would require the extraction
>>>> function to provide the entries deduplicated and sorted. Alternatively,
>>>> we could indicate to ginExtractEntries() if the entries are already
>>>> deduplicated and sorted. If we don't want to alter the signature of the
>>>> extract value function, we could e.g. use the MSB of the nentries
>>>> argument.
>>>
>>> Yeah, this seems wrong as it is. You're assuming that if the extract
>>> function returns nullFlags == NULL, the array is already sorted and
>>> deduped.
>>
>> As said, that was just for demonstration purposes of the possible gains.
>> I've changed the code now such that the extractValue function of the GIN
>> index can indicate via the third argument uniqueAndSorted, if the
>> returned keys are already unique and sorted.
>>
>> Unfortunately, it seems like this patch regresses performance. See
>> measurements below. I haven't had the time to investigate why that is.
>> It's pretty counter intuitive, given that this patch effectively only
>> removes code. Maybe you could re-test patch 0004 and share your runtimes?
> 
> Looking at 0001 and 0004 patches and ginExtractEntries(), the way
> ginExtractEntries() handles NULLs looks a little inefficient. It treats
> NULLs as proper entries, passing them through qsort() for deduplication
> and comparing them with cmpEntries(). But the end result is always that
> if the opclass's extractValueFn() function returned any NULLs, then
> there's exctly one GIN_CAT_NULL_KEY as the last entry of the result
> array. Surely we could be smarter about how we accomplish that. Your
> 0004 patch eliminates the deduplication overhead altogether, which is
> great of course, but the point remains for when we still need the
> deduplication.
Good observation. I like this idea. I've focused on pg_trgm but making
this code faster is certainly useful.

Given that doing the sort on pre-sorted input apparently doesn't add
measurable overhead, according to my benchmark results, we can apply
your patch and leave mine out for the moment being.

That's btw. also the reason for why 0002 doesn't show much gain: when
the data is pre-sorted, cmpEntries() is not called as much.
> 
> Attached is an attempt at that. It avoids the null-checks in
> cmpEntries(), saving a few cycles. That's drowned in noise with your
> test cases, but with the attached test case with int arrays, I'm seeing
> a 1-2 % gain. That's not much, but I think it's still worth doing
> because it makes the code a little simpler too, IMHO. (I didn't test it
> together with the rest of your patches.)

Performance is death by a thousand cuts and that's definitely the case
for the GIN index code. I'm all for putting in these small improvements
because they'll add up and what's now 2% can be 10% once other
optimization are in.

I took a look at your patch. Overall looks good to me. Just a few comments:

1) You should be able to create the categories array without the need
for the subsequent for loop as follows:

StaticAssertStmt(GIN_CAT_NORM_KEY == 0, "Assuming GIN_CAT_NORM_KEY=0");
*categories = palloc0_array(GinNullCategory, (nentries + (hasNull ? 1 : 0));

2) Your test case seems sub-optimal to show the gains. The arrays don't
contain any NULL values and are also already sorted. Or I'm missing
something.

3) Could you try your patch in conjunction with 0002 on data that is not
pre-sorted and then check if 0002 has more impact? That way cmpEntries()
should be called much more often.

4) Have you checked if there are regression tests that exercise this
code? If not, how about adding some?

>>> Perhaps we should introduce a new qunique_eq() function with a different
>>> callback signature:
>>>
>>> /* like qunique(), but the callback function returns true/false rather
>>> than int */
>>> static inline size_t
>>> qunique_eq(void *array, size_t elements, size_t width,
>>>          bool (*equal) (const void *, const void *))
>>>
>>
>> I would prefer to change qunique() instead. That would enforce using an
>> adequate comparison function from the get go. There are only ~15 calls
>> to qunique(). So refactoring this should also be a fairly small patch. I
>> can do that if there's agreement for that approach.
> 
> Works for me.
> 
> At quick glance, most if not all of the qunique() calls call qsort()
> just before qunique(). I wonder if we should have a single "sort and
> deduplicate" function, instead. It could perhaps do some deduplication
> "on the go", or other optimizations.

If it's just for deduplication purposes and the data doesn't have to end
up sorted, something based on a hash map should be even faster. How
about we start with changing the qunique() comparator signature and as a
2nd step take a closer look at how to go about providing a function that
does it in one go?

If you agree, I'll share a patch here next week.

>>> Did you measure how big is the impact from each individual patch?
>>> Patches 1 and 2 seem pretty much ready to be committed, but I wonder if
>>> they make any difference on their own.
>>
>> Here is the impact of each patch. I ran again CREATE INDEX three times
>> and took the fastest run. The run of each patch includes all previous
>> patches as well. For example, the timings for patch 0003 were measured
>> with a binary that also had patch 0002 and 0001 applied. To get the
>> impact of each patch in isolation, the delta to the previous run was
>> taken.
>>
>> Code                                | movies |delta  | lineitem | delta
>> ------------------------------------|--------|-------|------------------
>> master                              | 10,311 | 0     | 256,986  | 0
>> v1-0001-Inline-ginCompareAttEntries |  9,694 | 617   | 239,778  | 17,208
>> v1-0002-Optimized-comparison-func   |  9,510 | 184   | 238,094  |  1,684
>> v1-0003-Use-sort_template.h         |  8,661 | 849   | 231,190  |  6,904
>> v1-0004-Avoid-dedup-and-sort-in     |  9,305 | -644  | 232,472  | -1,282
>> v1-0005-Make-btint4cmp-branchless   |  8,240 | 1,065 | 228,387  |  4,085
>> v1-0006-Use-radix-sort              |  6,976 | 1,264 | 207,687  | 20,700
>> v1-0007-Faster-qunique-comparator   |  5,911 | 1,065 | 203,744  |  3,943
>> v1-0008-Add-ASCII-fastpath          |  3,409 | 2,502 | 161,469  | 42,275
>>
>> Attached is v2 of the patch set with the aforementioned changes. I've
>> also fixed the white space errors in 0003, 0004 and 0008, as reported by
>> Kirill.
> 
> Thanks, I pushed patch 0001 now, that's a simple and clear win.
Great. Thanks.

What about the other patches? 0003 and 0007 are also pretty simple and
IMHO uncontroversial while giving decent savings.

For 0006 I would make the code also work for char being unsigned. That's
still missing.

Any thoughts about 0008 and my findings regarding the to lower-case
conversion for ASCII? Adding to pg_locale_struct if it's save to use
ASCII-style tolower should be straight forward and then the code should
be correct.

--
David Geier



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Maybe BF "timedout" failures are the client script's fault?
Next
From: Masahiko Sawada
Date:
Subject: Re: pg_upgrade: optimize replication slot caught-up check