Re: Parallel CREATE INDEX for BRIN indexes - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Parallel CREATE INDEX for BRIN indexes
Date
Msg-id 2e86059f-4004-eec1-2bc1-9e7fd119c9e2@enterprisedb.com
Whole thread Raw
In response to Re: Parallel CREATE INDEX for BRIN indexes  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Responses Re: Parallel CREATE INDEX for BRIN indexes
List pgsql-hackers

On 7/11/23 23:11, Matthias van de Meent wrote:
> On Thu, 6 Jul 2023 at 16:13, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 7/5/23 16:33, Matthias van de Meent wrote:
>>> ...
>>>
>>>> Maybe. I wasn't that familiar with what parallel tuplesort can and can't
>>>> do, and the little I knew I managed to forget since I wrote this patch.
>>>> Which similar features do you have in mind?
>>>
>>> I was referring to the feature that is "emitting a single sorted run
>>> of tuples at the leader backend based on data gathered in parallel
>>> worker backends". It manages the sort state, on-disk runs etc. so that
>>> you don't have to manage that yourself.
>>>
>>> Adding a new storage format for what is effectively a logical tape
>>> (logtape.{c,h}) and manually merging it seems like a lot of changes if
>>> that functionality is readily available, standardized and optimized in
>>> sortsupport; and adds an additional place to manually go through for
>>> disk-related changes like TDE.
>>>
>>
>> Here's a new version of the patch, with three main changes:
> 
> Thanks! I've done a review on the patch, and most looks good. Some
> places need cleanup and polish, some others more documentations, and
> there are some other issues, but so far it's looking OK.
> 
>> One thing I was wondering about is whether it might be better to allow
>> the workers to process overlapping ranges, and then let the leader to
>> merge the summaries. That would mean we might not need the tableam.c
>> changes at all, but the leader would need to do more work (although the
>> BRIN indexes tend to be fairly small). The main reason that got me
>> thinking about this is that we have pretty much no tests for the union
>> procedures, because triggering that is really difficult. But for
>> parallel index builds that'd be much more common.
> 
> Hmm, that's a good point. I don't mind either way, but it would add
> overhead in the leader to do all of that merging - especially when you
> configure pages_per_range > PARALLEL_SEQSCAN_MAX_CHUNK_SIZE as we'd
> need to merge up to parallel_workers tuples. That could be a
> significant overhead.
> 
> ... thinks a bit.
> 
> Hmm, but with the current P_S_M_C_S of 8192 blocks that's quite
> unlikely to be a serious problem - the per-backend IO saved of such
> large ranges on a single backend has presumably much more impact than
> the merging of n_parallel_tasks max-sized brin tuples. So, seems fine
> with me.
> 

As for PARALLEL_SEQSCAN_MAX_CHUNK_SIZE, the last patch actually
considers the chunk_factor (i.e. pages_per_range) *after* doing

    pbscanwork->phsw_chunk_size = Min(pbscanwork->phsw_chunk_size,
                                      PARALLEL_SEQSCAN_MAX_CHUNK_SIZE);

so even with (pages_per_range > PARALLEL_SEQSCAN_MAX_CHUNK_SIZE) it
would not need to merge anything.

Now, that might have been a bad idea and PARALLEL_SEQSCAN_MAX_CHUNK_SIZE
should be considered. In which case this *has* to do the union, even if
only for the rare corner case.

But I don't think that's a major issue - it's pretty sure summarizing
the tuples is way more expensive than merging the summaries. Which is
what matters for Amdahl's law ...


> Review follows below.
> 
> Kind regards,
> 
> Matthias van de Meent
> Neon (https://neon.tech/)
> 
> -----------
> 
>> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
> 
>> +    BrinShared       *brinshared;
> 
> Needs some indentation fixes.
> 
>> +    int            bs_reltuples;
>> [...]
>> +    state->bs_reltuples += reltuples;
> 
> My IDE warns me that reltuples is a double. Looking deeper into the
> value, it contains the number of live tuples in the table, so this
> conversion may not result in a meaningful value for tables with >=2^31
> live tuples. Tables > 56GB could begin to get affected by this.
> 
>> +    int            bs_worker_id;
> 
> This variable seems to be unused.
> 
>> +    BrinSpool  *bs_spool;
>> +    BrinSpool  *bs_spool_out;
> 
> Are both used? If so, could you add comments why we have two spools
> here, instead of only one?
> 

OK, I admit I'm not sure both are actually necessary. I was struggling
getting it working with just one spool, because when the leader
participates as a worker, it needs to both summarize some of the chunks
(and put the tuples somewhere). And then it also needs to consume the
final output.

Maybe it's just a case of cargo cult programming - I was mostly copying
stuff from the btree index build, trying to make it work, and then with
two spools it started working.

>> +/*
>> + * A version of the callback, used by parallel index builds. The main difference
>> + * is that instead of writing the BRIN tuples into the index, we write them into
>> + * a shared tuplestore file, and leave the insertion up to the leader (which may
> 
> + ... shared tuplesort, and ...
> 
>> brinbuildCallbackParallel(...)
>> +    while (thisblock > state->bs_currRangeStart + state->bs_pagesPerRange - 1)
> 
> shouldn't this be an 'if' now?
> 

Hmmm, probably ... that way we'd skip the empty ranges.

>> +        while (thisblock > state->bs_currRangeStart + state->bs_pagesPerRange - 1)
>> +            state->bs_currRangeStart += state->bs_pagesPerRange;
> 
> Is there a reason why you went with iterative addition instead of a
> single divide-and-multiply like the following?:
> 
> +        state->bs_currRangeStart += state->bs_pagesPerRange *
> ((state->bs_currRangeStart - thisblock) / state->bs_pagesPerRange);
> 

Probably laziness ... You're right the divide-multiply seems simpler.

>> diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
>> [...]
>> -table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan)
>> +table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan, BlockNumber chunk_factor)
>> [...]
>> -    /* compare phs_syncscan initialization to similar logic in initscan */
>> +    bpscan->phs_chunk_factor = chunk_factor;
>> +    /* compare phs_syncscan initialization to similar logic in initscan
>> +     *
>> +     * Disable sync scans if the chunk factor is set (valid block number).
>> +     */
> 
> I think this needs some pgindent or other style work, both on comment
> style and line lengths
> 

Right.

>> diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
>> [...]
>> +            Assert(false); (x3)
> 
> I think these can be cleaned up, right?
> 

Duh! Absolutely, this shouldn't have been in the patch at all. I only
added those to quickly identify places that got the tuplesort into
unexpected state (much easier with a coredump and a backtrace).

>> diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
>> [...]
>> + * Computing BrinTuple size with only the tuple is difficult, so we want to track
>> + * the length for r referenced by SortTuple. That's what BrinSortTuple is meant
>> + * to do - it's essentially a BrinTuple prefixed by length. We only write the
>> + * BrinTuple to the logtapes, though.
> 
> Why don't we write the full BrinSortTuple to disk? Doesn't that make more sense?
> 

Not sure I understand. We ultimately do, because we write

    (length + BrinTuple)

and BrinSortTuple is exactly that. But if we write BrinSortTuple, we
would end up writing length for that too, no?

Or maybe I just don't understand how would that make things simpler.

>> +    tuplesort_puttuple_common(state, &stup,
>> +                              base->sortKeys &&
>> +                              base->sortKeys->abbrev_converter &&
>> +                              !stup.isnull1);
> 
> Can't this last argument just be inlined, based on knowledge that we
> don't use sortKeys in brin?
> 

What does "inlined" mean for an argument? But yeah, I guess it might be
just set to false. And we should probably have an assert that there are
no sortKeys.

>> +comparetup_index_brin(const SortTuple *a, const SortTuple *b,
>> +                      Tuplesortstate *state)
>> +{
>> +    BrinTuple  *tuple1;
>> [...]
>> +    tuple1 = &((BrinSortTuple *) a)->tuple;
>> [...]
> 
> I'm fairly sure that this cast (and it's neighbour) is incorrect and
> should be the following instead:
> 
> +    tuple1 = &((BrinSortTuple *) (a->tuple))->tuple;
> 
> Additionally, I think the following would be a better approach here,
> as we wouldn't need to do pointer-chasing:
> 

Uh, right. This only works because 'tuple' happens to be the first field
in SortTuple.

> + static int
> + comparetup_index_brin(const SortTuple *a, const SortTuple *b,
> +                      Tuplesortstate *state)
> + {
> +    Assert(TuplesortstateGetPublic(state)->haveDatum1);
> +
> +    if (DatumGetUInt32(a->datum1) > DatumGetUInt32(b->datum1))
> +        return 1;
> +    if (DatumGetUInt32(a->datum1) < DatumGetUInt32(b->datum1))
> +        return -1;
> +     /* silence compilers */
> +    return 0;
> + }
> 

Good idea! I forgot we're guaranteed to have datum1.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: logical decoding and replication of sequences, take 2
Next
From: Ashutosh Bapat
Date:
Subject: Re: logical decoding and replication of sequences, take 2