Re: Parallel CREATE INDEX for BRIN indexes - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: Parallel CREATE INDEX for BRIN indexes
Date
Msg-id CAEze2WjugXkONAe9RZSgoVipHX5m6c0LnDZ=FVgWs25Fk-zX2A@mail.gmail.com
Whole thread Raw
In response to Re: Parallel CREATE INDEX for BRIN indexes  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Parallel CREATE INDEX for BRIN indexes
List pgsql-hackers
On Wed, 8 Nov 2023 at 12:03, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> Hi,
>
> here's an updated patch, addressing the review comments, and reworking
> how the work is divided between the workers & leader etc.
>
> 0001 is just v2, rebased to current master
>
> 0002 and 0003 address most of the issues, in particular it
>
>   - removes the unnecessary spool
>   - fixes bs_reltuples type to double
>   - a couple comments are reworded to be clearer
>   - changes loop/condition in brinbuildCallbackParallel
>   - removes asserts added for debugging
>   - fixes cast in comparetup_index_brin
>   - 0003 then simplifies comparetup_index_brin
>   - I haven't inlined the tuplesort_puttuple_common parameter
>     (didn't seem worth it)

OK, thanks

> 0004 Reworks how the work is divided between workers and combined by the
> leader. It undoes the tableam.c changes that attempted to divide the
> relation into chunks matching the BRIN ranges, and instead merges the
> results in the leader (using the BRIN "union" function).

That's OK.

> I haven't done any indentation fixes yet.
>
> I did fairly extensive testing, using pageinspect to compare indexes
> built with/without parallelism. More testing is needed, but it seems to
> work fine (with other opclasses and so on).

After code-only review, here are some comments:

> +++ b/src/backend/access/brin/brin.c
> [...]
> +/* Magic numbers for parallel state sharing */
> +#define PARALLEL_KEY_BRIN_SHARED        UINT64CONST(0xA000000000000001)
> +#define PARALLEL_KEY_TUPLESORT            UINT64CONST(0xA000000000000002)

These shm keys use the same constants also in use in
access/nbtree/nbtsort.c. While this shouldn't be an issue in normal
operations, I'd prefer if we didn't actively introduce conflicting
identifiers when we still have significant amounts of unused values
remaining.

> +#define PARALLEL_KEY_QUERY_TEXT            UINT64CONST(0xA000000000000003)

This is the fourth definition of a PARALLEL%_KEY_QUERY_TEXT, the
others being in access/nbtree/nbtsort.c (value 0xA000000000000004, one
more than brin's), backend/executor/execParallel.c
(0xE000000000000008), and PARALLEL_VACUUM_KEY_QUERY_TEXT (0x3) (though
I've not checked that their uses are exactly the same, I'd expect at
least btree to match mostly, if not fully, 1:1).
I think we could probably benefit from a less ad-hoc sharing of query
texts. I don't think that needs to happen specifically in this patch,
but I think it's something to keep in mind in future efforts.

> +_brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
> [...]
> +    BrinSpool  *spool = state->bs_spool;
> [...]
> +    if (!state)
> +        return;

I think the assignment to spool should be moved to below this
condition, as _brin_begin_parallel calls this with state=NULL when it
can't launch parallel workers, which will cause issues here.

> +    state->bs_numtuples = brinshared->indtuples;

My IDE complains about bs_numtuples being an integer. This is a
pre-existing issue, but still valid: we can hit an overflow on tables
with pages_per_range=1 and relsize >= 2^31 pages. Extremely unlikely,
but problematic nonetheless.

> +     * FIXME This probably needs some memory management fixes - we're reading
> +     * tuples from the tuplesort, we're allocating an emty tuple, and so on.
> +     * Probably better to release this memory.

This should probably be resolved.

I also noticed that this is likely to execute `union_tuples` many
times when pages_per_range is coprime with the parallel table scan's
block stride (or when we for other reasons have many tuples returned
for each range); and this `union_tuples` internally allocates and
deletes its own memory context for its deserialization of the 'b'
tuple. I think we should just pass a scratch context instead, so that
we don't have the overhead of continously creating then deleting the
same memory context.

> +++ b/src/backend/catalog/index.c
> [...]
> -        indexRelation->rd_rel->relam == BTREE_AM_OID)
> +        (indexRelation->rd_rel->relam == BTREE_AM_OID ||
> +         indexRelation->rd_rel->relam == BRIN_AM_OID))

I think this needs some more effort. I imagine a new
IndexAmRoutine->amcanbuildparallel is more appropriate than this
hard-coded list - external indexes may want to utilize the parallel
index creation planner facility, too.


Some notes:
As a project PostgreSQL seems to be trying to move away from
hardcoding heap into everything in favour of the more AM-agnostic
'table'. I suggest replacing all mentions of "heap" in the arguments
with "table", to reduce the work future maintainers need to do to fix
this. Even when this AM is mostly targetted towards the heap AM, other
AMs (such as those I've heard of that were developed internally at
EDB) use the same block-addressing that heap does, and should thus be
compatible with BRIN. Thus, "heap" is not a useful name here.

There are 2 new mentions of "tuplestore" in the patch, while the
structure used is tuplesort: one on form_and_spill_tuple, and one on
brinbuildCallbackParallel. Please update those comments.

That's it for code review. I'll do some performance comparisons and
testing soon, too.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Add recovery to pg_control and remove backup_label
Next
From: David Steele
Date:
Subject: Re: Add recovery to pg_control and remove backup_label