Re: Parallel CREATE INDEX for GIN indexes - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: Parallel CREATE INDEX for GIN indexes
Date
Msg-id CAEze2WiTAeZe4t5wAeRN834xFBqROPmjeK2XTstNko6bbVPX=A@mail.gmail.com
Whole thread Raw
In response to Re: Parallel CREATE INDEX for GIN indexes  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Responses Re: Parallel CREATE INDEX for GIN indexes
List pgsql-hackers
On Wed, 3 Jul 2024 at 20:36, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Mon, 24 Jun 2024 at 02:58, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> > So I've switched this to use the regular data-type comparisons, with
> > SortSupport etc. There's a bit more cleanup remaining and testing
> > needed, but I'm not aware of any bugs.

I've hit assertion failures in my testing of the combined patches, in
AssertCheckItemPointers: it assumes it's never called when the buffer
is empty and uninitialized, but that's wrong: we don't initialize the
items array until the first tuple, which will cause the assertion to
fire. By updating the first 2 assertions of AssertCheckItemPointers, I
could get it working.

> ---
> > +++ b/src/backend/utils/sort/tuplesortvariants.c
>
> I was thinking some more about merging tuples inside the tuplesort. I
> realized that this could be implemented by allowing buffering of tuple
> writes in writetup. This would require adding a flush operation at the
> end of mergeonerun to store the final unflushed tuple on the tape, but
> that shouldn't be too expensive. This buffering, when implemented
> through e.g. a GinBuffer in TuplesortPublic->arg, could allow us to
> merge the TID lists of same-valued GIN tuples while they're getting
> stored and re-sorted, thus reducing the temporary space usage of the
> tuplesort by some amount with limited overhead for other
> non-deduplicating tuplesorts.
>
> I've not yet spent the time to get this to work though, but I'm fairly
> sure it'd use less temporary space than the current approach with the
> 2 tuplesorts, and could have lower overall CPU overhead as well
> because the number of sortable items gets reduced much earlier in the
> process.

I've now spent some time on this. Attached the original patchset, plus
2 incremental patches, the first of which implement the above design
(patch no. 8).

Local tests show it's significantly faster: for the below test case
I've seen reindex time reduced from 777455ms to 626217ms, or ~20%
improvement.

After applying the 'reduce the size of GinTuple' patch, index creation
time is down to 551514ms, or about 29% faster total. This all was
tested with a fresh stock postgres configuration.

"""
CREATE UNLOGGED TABLE testdata
AS SELECT sha256(i::text::bytea)::text
    FROM generate_series(1, 15000000) i;
CREATE INDEX ON testdata USING gin (sha256 gin_trgm_ops);
"""

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachment

pgsql-hackers by date:

Previous
From: Marat Bukharov
Date:
Subject: Re: [PATCH] Add min/max aggregate functions to BYTEA
Next
From: Nathan Bossart
Date:
Subject: Re: remove check hooks for GUCs that contribute to MaxBackends