Re: Reduce build times of pg_trgm GIN indexes - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Reduce build times of pg_trgm GIN indexes
Date
Msg-id 6d16b6bd-a1ff-4469-aefb-a1c8274e561a@iki.fi
Whole thread Raw
In response to Re: Reduce build times of pg_trgm GIN indexes  (David Geier <geidav.pg@gmail.com>)
Responses Re: Reduce build times of pg_trgm GIN indexes
List pgsql-hackers
On 09/01/2026 14:06, David Geier wrote:
> On 06.01.2026 18:00, Heikki Linnakangas wrote:
>> On 05/01/2026 17:01, David Geier wrote:
>>> - v1-0002-Optimized-comparison-functions.patch: Use FunctionCallInvoke()
>>> instead of FunctionCall2Coll(). This saves a bunch of per-comparison
>>> setup code, such as calling InitFunctionCallInfoData().
>>
>> You lose the check for NULL result with this. That's probably still
>> worth checking.
> 
> It seems like existing code where all args are not null, has that safety
> check. Added it for consistency.
> 
>>> - v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
>>> ginExtractEntries() deduplicates and sorts the entries returned from the
>>> extract value function. In case of pg_trgm, that is completely redundant
>>> because the trigrams are already deduplicated and sorted. The current
>>> version of this patch is just to demonstrate the potential. We need to
>>> think about what we want here. Ideally, we would require the extraction
>>> function to provide the entries deduplicated and sorted. Alternatively,
>>> we could indicate to ginExtractEntries() if the entries are already
>>> deduplicated and sorted. If we don't want to alter the signature of the
>>> extract value function, we could e.g. use the MSB of the nentries
>>> argument.
>>
>> Yeah, this seems wrong as it is. You're assuming that if the extract
>> function returns nullFlags == NULL, the array is already sorted and
>> deduped.
> 
> As said, that was just for demonstration purposes of the possible gains.
> I've changed the code now such that the extractValue function of the GIN
> index can indicate via the third argument uniqueAndSorted, if the
> returned keys are already unique and sorted.
> 
> Unfortunately, it seems like this patch regresses performance. See
> measurements below. I haven't had the time to investigate why that is.
> It's pretty counter intuitive, given that this patch effectively only
> removes code. Maybe you could re-test patch 0004 and share your runtimes?

Looking at 0001 and 0004 patches and ginExtractEntries(), the way 
ginExtractEntries() handles NULLs looks a little inefficient. It treats 
NULLs as proper entries, passing them through qsort() for deduplication 
and comparing them with cmpEntries(). But the end result is always that 
if the opclass's extractValueFn() function returned any NULLs, then 
there's exctly one GIN_CAT_NULL_KEY as the last entry of the result 
array. Surely we could be smarter about how we accomplish that. Your 
0004 patch eliminates the deduplication overhead altogether, which is 
great of course, but the point remains for when we still need the 
deduplication.

Attached is an attempt at that. It avoids the null-checks in 
cmpEntries(), saving a few cycles. That's drowned in noise with your 
test cases, but with the attached test case with int arrays, I'm seeing 
a 1-2 % gain. That's not much, but I think it's still worth doing 
because it makes the code a little simpler too, IMHO. (I didn't test it 
together with the rest of your patches.)

>> Perhaps we should introduce a new qunique_eq() function with a different
>> callback signature:
>>
>> /* like qunique(), but the callback function returns true/false rather
>> than int */
>> static inline size_t
>> qunique_eq(void *array, size_t elements, size_t width,
>>          bool (*equal) (const void *, const void *))
>>
> 
> I would prefer to change qunique() instead. That would enforce using an
> adequate comparison function from the get go. There are only ~15 calls
> to qunique(). So refactoring this should also be a fairly small patch. I
> can do that if there's agreement for that approach.

Works for me.

At quick glance, most if not all of the qunique() calls call qsort() 
just before qunique(). I wonder if we should have a single "sort and 
deduplicate" function, instead. It could perhaps do some deduplication 
"on the go", or other optimizations.

>> Did you measure how big is the impact from each individual patch?
>> Patches 1 and 2 seem pretty much ready to be committed, but I wonder if
>> they make any difference on their own.
> 
> Here is the impact of each patch. I ran again CREATE INDEX three times
> and took the fastest run. The run of each patch includes all previous
> patches as well. For example, the timings for patch 0003 were measured
> with a binary that also had patch 0002 and 0001 applied. To get the
> impact of each patch in isolation, the delta to the previous run was taken.
> 
> Code                                | movies |delta  | lineitem | delta
> ------------------------------------|--------|-------|------------------
> master                              | 10,311 | 0     | 256,986  | 0
> v1-0001-Inline-ginCompareAttEntries |  9,694 | 617   | 239,778  | 17,208
> v1-0002-Optimized-comparison-func   |  9,510 | 184   | 238,094  |  1,684
> v1-0003-Use-sort_template.h         |  8,661 | 849   | 231,190  |  6,904
> v1-0004-Avoid-dedup-and-sort-in     |  9,305 | -644  | 232,472  | -1,282
> v1-0005-Make-btint4cmp-branchless   |  8,240 | 1,065 | 228,387  |  4,085
> v1-0006-Use-radix-sort              |  6,976 | 1,264 | 207,687  | 20,700
> v1-0007-Faster-qunique-comparator   |  5,911 | 1,065 | 203,744  |  3,943
> v1-0008-Add-ASCII-fastpath          |  3,409 | 2,502 | 161,469  | 42,275
> 
> Attached is v2 of the patch set with the aforementioned changes. I've
> also fixed the white space errors in 0003, 0004 and 0008, as reported by
> Kirill.

Thanks, I pushed patch 0001 now, that's a simple and clear win.

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: Dave Cramer
Date:
Subject: Re: Proposal to allow setting cursor options on Portals
Next
From: Andres Freund
Date:
Subject: Re: Stack-based tracking of per-node WAL/buffer usage