Re: Parallel CREATE INDEX for BRIN indexes - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Parallel CREATE INDEX for BRIN indexes |
Date | |
Msg-id | 2e86059f-4004-eec1-2bc1-9e7fd119c9e2@enterprisedb.com Whole thread Raw |
In response to | Re: Parallel CREATE INDEX for BRIN indexes (Matthias van de Meent <boekewurm+postgres@gmail.com>) |
Responses |
Re: Parallel CREATE INDEX for BRIN indexes
|
List | pgsql-hackers |
On 7/11/23 23:11, Matthias van de Meent wrote: > On Thu, 6 Jul 2023 at 16:13, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >> >> On 7/5/23 16:33, Matthias van de Meent wrote: >>> ... >>> >>>> Maybe. I wasn't that familiar with what parallel tuplesort can and can't >>>> do, and the little I knew I managed to forget since I wrote this patch. >>>> Which similar features do you have in mind? >>> >>> I was referring to the feature that is "emitting a single sorted run >>> of tuples at the leader backend based on data gathered in parallel >>> worker backends". It manages the sort state, on-disk runs etc. so that >>> you don't have to manage that yourself. >>> >>> Adding a new storage format for what is effectively a logical tape >>> (logtape.{c,h}) and manually merging it seems like a lot of changes if >>> that functionality is readily available, standardized and optimized in >>> sortsupport; and adds an additional place to manually go through for >>> disk-related changes like TDE. >>> >> >> Here's a new version of the patch, with three main changes: > > Thanks! I've done a review on the patch, and most looks good. Some > places need cleanup and polish, some others more documentations, and > there are some other issues, but so far it's looking OK. > >> One thing I was wondering about is whether it might be better to allow >> the workers to process overlapping ranges, and then let the leader to >> merge the summaries. That would mean we might not need the tableam.c >> changes at all, but the leader would need to do more work (although the >> BRIN indexes tend to be fairly small). The main reason that got me >> thinking about this is that we have pretty much no tests for the union >> procedures, because triggering that is really difficult. But for >> parallel index builds that'd be much more common. > > Hmm, that's a good point. I don't mind either way, but it would add > overhead in the leader to do all of that merging - especially when you > configure pages_per_range > PARALLEL_SEQSCAN_MAX_CHUNK_SIZE as we'd > need to merge up to parallel_workers tuples. That could be a > significant overhead. > > ... thinks a bit. > > Hmm, but with the current P_S_M_C_S of 8192 blocks that's quite > unlikely to be a serious problem - the per-backend IO saved of such > large ranges on a single backend has presumably much more impact than > the merging of n_parallel_tasks max-sized brin tuples. So, seems fine > with me. > As for PARALLEL_SEQSCAN_MAX_CHUNK_SIZE, the last patch actually considers the chunk_factor (i.e. pages_per_range) *after* doing pbscanwork->phsw_chunk_size = Min(pbscanwork->phsw_chunk_size, PARALLEL_SEQSCAN_MAX_CHUNK_SIZE); so even with (pages_per_range > PARALLEL_SEQSCAN_MAX_CHUNK_SIZE) it would not need to merge anything. Now, that might have been a bad idea and PARALLEL_SEQSCAN_MAX_CHUNK_SIZE should be considered. In which case this *has* to do the union, even if only for the rare corner case. But I don't think that's a major issue - it's pretty sure summarizing the tuples is way more expensive than merging the summaries. Which is what matters for Amdahl's law ... > Review follows below. > > Kind regards, > > Matthias van de Meent > Neon (https://neon.tech/) > > ----------- > >> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c > >> + BrinShared *brinshared; > > Needs some indentation fixes. > >> + int bs_reltuples; >> [...] >> + state->bs_reltuples += reltuples; > > My IDE warns me that reltuples is a double. Looking deeper into the > value, it contains the number of live tuples in the table, so this > conversion may not result in a meaningful value for tables with >=2^31 > live tuples. Tables > 56GB could begin to get affected by this. > >> + int bs_worker_id; > > This variable seems to be unused. > >> + BrinSpool *bs_spool; >> + BrinSpool *bs_spool_out; > > Are both used? If so, could you add comments why we have two spools > here, instead of only one? > OK, I admit I'm not sure both are actually necessary. I was struggling getting it working with just one spool, because when the leader participates as a worker, it needs to both summarize some of the chunks (and put the tuples somewhere). And then it also needs to consume the final output. Maybe it's just a case of cargo cult programming - I was mostly copying stuff from the btree index build, trying to make it work, and then with two spools it started working. >> +/* >> + * A version of the callback, used by parallel index builds. The main difference >> + * is that instead of writing the BRIN tuples into the index, we write them into >> + * a shared tuplestore file, and leave the insertion up to the leader (which may > > + ... shared tuplesort, and ... > >> brinbuildCallbackParallel(...) >> + while (thisblock > state->bs_currRangeStart + state->bs_pagesPerRange - 1) > > shouldn't this be an 'if' now? > Hmmm, probably ... that way we'd skip the empty ranges. >> + while (thisblock > state->bs_currRangeStart + state->bs_pagesPerRange - 1) >> + state->bs_currRangeStart += state->bs_pagesPerRange; > > Is there a reason why you went with iterative addition instead of a > single divide-and-multiply like the following?: > > + state->bs_currRangeStart += state->bs_pagesPerRange * > ((state->bs_currRangeStart - thisblock) / state->bs_pagesPerRange); > Probably laziness ... You're right the divide-multiply seems simpler. >> diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c >> [...] >> -table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan) >> +table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan, BlockNumber chunk_factor) >> [...] >> - /* compare phs_syncscan initialization to similar logic in initscan */ >> + bpscan->phs_chunk_factor = chunk_factor; >> + /* compare phs_syncscan initialization to similar logic in initscan >> + * >> + * Disable sync scans if the chunk factor is set (valid block number). >> + */ > > I think this needs some pgindent or other style work, both on comment > style and line lengths > Right. >> diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c >> [...] >> + Assert(false); (x3) > > I think these can be cleaned up, right? > Duh! Absolutely, this shouldn't have been in the patch at all. I only added those to quickly identify places that got the tuplesort into unexpected state (much easier with a coredump and a backtrace). >> diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c >> [...] >> + * Computing BrinTuple size with only the tuple is difficult, so we want to track >> + * the length for r referenced by SortTuple. That's what BrinSortTuple is meant >> + * to do - it's essentially a BrinTuple prefixed by length. We only write the >> + * BrinTuple to the logtapes, though. > > Why don't we write the full BrinSortTuple to disk? Doesn't that make more sense? > Not sure I understand. We ultimately do, because we write (length + BrinTuple) and BrinSortTuple is exactly that. But if we write BrinSortTuple, we would end up writing length for that too, no? Or maybe I just don't understand how would that make things simpler. >> + tuplesort_puttuple_common(state, &stup, >> + base->sortKeys && >> + base->sortKeys->abbrev_converter && >> + !stup.isnull1); > > Can't this last argument just be inlined, based on knowledge that we > don't use sortKeys in brin? > What does "inlined" mean for an argument? But yeah, I guess it might be just set to false. And we should probably have an assert that there are no sortKeys. >> +comparetup_index_brin(const SortTuple *a, const SortTuple *b, >> + Tuplesortstate *state) >> +{ >> + BrinTuple *tuple1; >> [...] >> + tuple1 = &((BrinSortTuple *) a)->tuple; >> [...] > > I'm fairly sure that this cast (and it's neighbour) is incorrect and > should be the following instead: > > + tuple1 = &((BrinSortTuple *) (a->tuple))->tuple; > > Additionally, I think the following would be a better approach here, > as we wouldn't need to do pointer-chasing: > Uh, right. This only works because 'tuple' happens to be the first field in SortTuple. > + static int > + comparetup_index_brin(const SortTuple *a, const SortTuple *b, > + Tuplesortstate *state) > + { > + Assert(TuplesortstateGetPublic(state)->haveDatum1); > + > + if (DatumGetUInt32(a->datum1) > DatumGetUInt32(b->datum1)) > + return 1; > + if (DatumGetUInt32(a->datum1) < DatumGetUInt32(b->datum1)) > + return -1; > + /* silence compilers */ > + return 0; > + } > Good idea! I forgot we're guaranteed to have datum1. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: