Re: Parallel CREATE INDEX for GIN indexes - Mailing list pgsql-hackers
From | Matthias van de Meent |
---|---|
Subject | Re: Parallel CREATE INDEX for GIN indexes |
Date | |
Msg-id | CAEze2WjK=gReoEm2Kj-AvrH7x=+9AJX_Mb9bOR6g4n-c1JS_0w@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel CREATE INDEX for GIN indexes (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
List | pgsql-hackers |
On Sat, 4 Jan 2025 at 17:58, Tomas Vondra <tomas@vondra.me> wrote: > > On 11/24/24 19:04, Kirill Reshke wrote: > > On Tue, 8 Oct 2024 at 17:06, Tomas Vondra <tomas@vondra.me> wrote: > >> > >> On 10/8/24 04:03, Michael Paquier wrote: > >>> > >>> _gin_parallel_build_main() is introduced in 0001. Please make sure to > >>> pass down a query ID. > >> > >> Thanks for the ping. Here's an updated patch doing that, and also fixing > >> a couple whitespace issues. No other changes, but I plan to get back to > >> this patch soon - before the next CF. > >> > >> > >> regards > >> > >> -- > >> Tomas Vondra > > > > Hi! I was looking through this series of patches because thread of > > GIN&GIST amcheck patch references it. > > > > I have spotted this in gininsert.c: > > 1) > >> /* > >> * Store shared tuplesort-private state, for which we reserved space. > >> * Then, initialize opaque state using tuplesort routine. > >> */ > >> sharedsort = (Sharedsort *) shm_toc_allocate(pcxt->toc, estsort); > >> tuplesort_initialize_shared(sharedsort, scantuplesortstates,> > > pcxt->seg); > >> /* > >> * Store shared tuplesort-private state, for which we reserved space. > >> * Then, initialize opaque state using tuplesort routine. > >> */ > > > > Is it necessary to duplicate the entire comment? > > > > Yes, that's a copy-paste mistake. Removed the second comment. > > > And, while we are here, isn't it " initialize the opaque state "? > > > > Not sure, this is copy pasted as-is from nbtree code. > > > 2) typo : > > * the TID array, and returning false if it's too large (more thant work_mem, > > > > Fixed. > > > 3) in _gin_build_tuple: > > > > .... > > else if (typlen == -2) > > keylen = strlen(DatumGetPointer(key)) + 1; > > else > > elog(ERROR, "invalid typlen"); > > > > > > Maybe `elog(ERROR, "invalid typLen: %d", typLen); ` as in `datumGetSize`? > > > > Makes sense, I reworded it a little bit. But it's however supposed to be > a can't happen condition. > > > 4) in _gin_compare_tuples: > > > >> if ((a->category == GIN_CAT_NORM_KEY) && > > (b->category == GIN_CAT_NORM_KEY)) > > > > maybe just a->category == GIN_CAT_NORM_KEY? a->category is already > > equal to b->category because of previous if statements. > > > > True. I've simplified the condition. > > > 5) In _gin_partition_sorted_data: > > > >> char fname[128]; > >> sprintf(fname, "worker-%d", i); > > > > Other places use MAXPGPATH in similar cases. > > > > OK, fixed the two places that format worker-%d. > > > Also, code `sprintf(fname, "worker-%d",...);` duplicates. This might > > be error-prone. Should we have a macro/inline function for this? > > > > Maybe. I think using a constant might be a good idea, but anything more > complicated is not worth it. There's only two places using it, not very > far apart. > > > I will take another look later, maybe reporting real problems, not nit-picks. > > > > Thanks. Attached is a rebased patch series fixing those issues, and one > issue I found in an AssertCheckGinBuffer, which was calling the other > assert (AssertCheckItemPointers) even for empty buffers. I think this > part might need some more work, so that it's clear what the various > asserts assume (or rather to allow just calling AssertCheckGinBuffer > everywhere, with some flags). Thanks for the rebase. > 0001 In gininsert.c, I think I'd prefer GinBuildShared over GinShared. While current GIN infrastructure doesn't do parallel index scans (and I can't think of an easy way to parallelize it) I think this it's better to make clear that this isn't related to index scan. > + * mutex protects all fields before heapdesc. This comment is still inaccurate. > + /* FIXME likely duplicate with indtuples */ I think this doesn't have to be duplicate, as we can distinguish between number of heap tuples and the number of GIN (key, TID) pairs loaded. This distinction doesn't really exist anywhere else, though, so to expose this to users we may need changes in pg_stat_progress_create_index. While I haven't checked if that distinction is being made in the code, I think it would be a useful distinction to have. > GinBufferInit This seems to depend on the btree operator classes to get sortsupport functions, bypassing the GIN compare support function (support function 1) and adding a dependency on the btree opclasses for indexable types. This can cause "bad" ordering, or failure to build the index when the parallel path is chosen and no default btree opclass is defined for the type. I think it'd be better if we allowed users to specify which sortsupport function to use, or at least use the correct compare function when it's defined on the attribute's operator class. > include/access/gin_tuple.h > + OffsetNumber attrnum; /* attnum of index key */ I think this would best be AttrNumber-typed? Looks like I didn't notice or fix that in 0009. > My plan is to eventually commit the first couple patches, possibly up > 0007 or even 0009. Sounds good. I'll see if I have some time to do some cleanup on my patches (0008 and 0009), as they need some better polish on the comments and commit messages. Kind regards, Matthias van de Meent Neon (https://neon.tech)
pgsql-hackers by date: