Hi hackers,
Attached is a series of patches which gradually reduce the time it takes
to create GIN indexes. Most of the gains come from optimizing the
trigram extraction code in pg_trgm. A few small optimizations apply to
any GIN index operator class.
The changes are motivated by the time it takes to create GIN indexes on
large production tables, especially, on columns with long strings. Even
with multiple parallel maintenance workers I've seen this taking hours.
For testing purposes I've used two different data sets:
1. The l_comment column of the TPC-H SF 10 lineitem table. l_comment
contains relatively short strings with a minimum of 10, a maximum of 43
and an average of ~27 characters.
2. The plots from a collection of movies from Wikipedia. The plots are
much longer than l_comment, with a minimum of 15, a maximum of 36,773
and an average of ~2,165 characters. The CSV file can be downloaded here
[1].
Testing both cases is important because a big part of the trigram
extraction is spent on removing duplicates. The longer the string, the
more duplicates are usually encountered.
The script I used for testing is attached. I ran CREATE INDEX three
times and took the fastest run. I'm getting the following results on my
i9-13950HX dev laptop in release build:
Data set | Patched (ms) | Master (ms) | Speedup
--------------------|--------------|--------------|----------
movies(plot) | 3,409 | 10,311 | 3.02x
lineitem(l_comment) | 161,569 | 256,986 | 1.59x
The attached patches do the following:
- v1-0001-Inline-ginCompareAttEntries.patch: Inline
ginCompareAttEntries() which is very frequently called by the GIN code.
- v1-0002-Optimized-comparison-functions.patch: Use FunctionCallInvoke()
instead of FunctionCall2Coll(). This saves a bunch of per-comparison
setup code, such as calling InitFunctionCallInfoData().
- v1-0003-Use-sort_template.h.patch: Use sort_template.h instead of
qsort(), to inline calls to the sort comparator. This is an interim step
that is further improved on by patch 0006.
- v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
ginExtractEntries() deduplicates and sorts the entries returned from the
extract value function. In case of pg_trgm, that is completely redundant
because the trigrams are already deduplicated and sorted. The current
version of this patch is just to demonstrate the potential. We need to
think about what we want here. Ideally, we would require the extraction
function to provide the entries deduplicated and sorted. Alternatively,
we could indicate to ginExtractEntries() if the entries are already
deduplicated and sorted. If we don't want to alter the signature of the
extract value function, we could e.g. use the MSB of the nentries argument.
- v1-0005-Make-btint4cmp-branchless.patch: Removes branches from
btint4cmp(), which is heavily called from the GIN code. This might as
well have benefit in other parts of the code base.
v1-0006-Use-radix-sort.patch: Replace the sort_template.h-based qsort()
with radix sort. For the purpose of demonstrating the possible gains,
I've only replaced the signed variant for now. I've also tried using
simplehash.h for deduplicating followed by a sort_template.h-based sort.
But that was slower.
v1-0007-Faster-qunique-comparator.patch: qunique() doesn't require a
full sort comparator (-1 = less, 0 = equal, 1 = greater) but only a
equal/unequal comparator (e.g. 0 = equal and 1 = unequal). The same
optimization can be done in plenty of other places in our code base.
Likely, in most of them the gains are insignificant.
v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
of text is actually ASCII. Hence, we provide a fast path for this case
which is exercised if the MSB of the current character is unset.
With above changes, the majority of the runtime is now spent on
inserting the trigrams into the GIN index via ginInsertBAEntry(). The
code in master uses a red-black for further deduplication and sorting.
Traversing the red-black tree and updating it is pretty slow. I haven't
looked through all the code yet, but it seems to me that we would be
better off replacing the red-black tree with a sort and/or a hash map.
But I'll leave this as future work for now.
[1]
https://github.com/kiq005/movie-recommendation/raw/refs/heads/master/src/dataset/wiki_movie_plots_deduped.csv
--
David Geier