Re: Parallel CREATE INDEX for GIN indexes - Mailing list pgsql-hackers
From | Matthias van de Meent |
---|---|
Subject | Re: Parallel CREATE INDEX for GIN indexes |
Date | |
Msg-id | CAEze2WjWdkR8s4L3dQsjPkoB4mkW-X8fB7oOZr58QfiwQbpRTQ@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel CREATE INDEX for GIN indexes (Andy Fan <zhihuifan1213@163.com>) |
List | pgsql-hackers |
On Tue, 27 Aug 2024 at 12:15, Andy Fan <zhihuifan1213@163.com> wrote: > > Tomas Vondra <tomas.vondra@enterprisedb.com> writes: > > And let's talk about the improvement by Matthias, namely: > > > > * 0008 Use a single GIN tuplesort > > * 0009 Reduce the size of GinTuple by 12 bytes > > > > I haven't really seen any impact on duration - it seems more or less > > within noise. Maybe it would be different on machines with less RAM, but > > on my two systems it didn't really make a difference. > > > > It did significantly reduce the amount of temporary data written, by > > ~40% or so. This is pretty nicely visible on the "trgm" case, which > > generates the most temp files of the four indexes. An example from the > > i5/32MB section looks like this: > > > > label 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010 > > ------------------------------------------------------------------------ > > trgm / 3 0 2635 3690 3715 1177 1177 1179 1179 696 682 > > 1016 > > After seeing the above data, I want to know where the time is spent and > why the ~40% IO doesn't make a measurable duration improvement. then I > did the following test. [...] > ==== Here is my definition for stage 1/2/3. > stage 1: > reltuples = table_index_build_scan(heap, index, indexInfo, true, progress, > ginBuildCallbackParallel, state, scan); > > /* write remaining accumulated entries */ > ginFlushBuildState(state, index); > > stage 2: > _gin_process_worker_data(state, state->bs_worker_sort) > > stage 3: > tuplesort_performsort(state->bs_sortstate); Note that tuplesort does most of its sort and IO work while receiving tuples, which in this case would be during table_index_build_scan. tuplesort_performsort usually only needs to flush the last elements of a sort that it still has in memory, which is thus generally a cheap operation bound by maintenance work memory, and definitely not representative of the total cost of sorting data. In certain rare cases it may take a longer time as it may have to merge sorted runs, but those cases are quite rare in my experience. > But 0008 still does many good stuff: > 1. Reduce the IO usage, this would be more useful on some heavily IO > workload. > 2. Simplify the building logic by removing one stage. > 3. Add the 'buffer-writetup' to tuplesort.c, I don't have other user > case for now, but it looks like a reasonable design. I'd imagine nbtree would like to use this too, for applying some deduplication in the sort stage. The IO benefits are quite likely to be worth it; a minimum space saving of 25% on duplicated key values in tuple sorts sounds real great. And it doesn't even have to merge all duplicates: even if you only merge 10 tuples at a time, the space saving on those duplicates would be at least 47% on 64-bit systems. > I think the current blocker is if it is safe to hack the tuplesort.c. > With my current knowledge, It looks good to me, but it would be better > open a dedicated thread to discuss this specially, the review would not > take a long time if a people who is experienced on this area would take > a look. I could adapt the patch for nbtree use, to see if anyone's willing to review that? > > Now, what's the 0010 patch about? > > > > For some indexes (e.g. trgm), the parallel builds help a lot, because > > they produce a lot of temporary data and the parallel sort is a > > substantial part of the work. But for other indexes (especially the > > "smaller" indexes on jsonb headers), it's not that great. For example > > for "jsonb", having 3 workers shaves off only ~25% of the time, not 75%. > > > > Clearly, this happens because a lot of time is spent outside the sort, > > actually inserting data into the index. > > You can always foucs on the most important part which inpires me a lot, > even with my simple testing, the "inserting data into index" stage take > 40% time. nbtree does sorted insertions into the tree, constructing leaf pages one at a time and adding separator keys in the page above when the leaf page was filled, thus removing the need to descend the btree. I imagine we can save some performance by mirroring that in GIN too, with as additional bonus that we'd be free to start logging completed pages before we're done with the full index, reducing max WAL throughput in GIN index creation. > > I think they might be complementary. I don't think it's reasonable to > > expect GIN's BuildAccumulator to buffer all the index tuples at the > > same time (as I mentioned upthread: we are or should be limited by > > work memory), but the BuildAccumulator will do a much better job at > > combining tuples than the in-memory sort + merge-write done by > > Tuplesort (because BA will use (much?) less memory for the same number > > of stored values). > > Thank you Matthias for valuing my point! and thanks for highlighting the > benefit that BuildAccumulator can do a better job for sorting in memory > (I think it is mainly because BuildAccumulator can do run-time merge > when accept more tuples). but I still not willing to go further at this > direction. Reasons are: a). It probably can't make a big difference at > the final result. b). The best implementation of this idea would be > allowing the user of tuplesort.c to insert the pre-sort tuples into tape > directly rather than inserting them into tuplesort's memory and dump > them into tape without a sort. You'd still need to keep track of sorted runs on those tapes, which is what tuplesort.c does for us. > However I can't define a clean API for > the former case. I imagine a pair of tuplesort_beginsortedrun(); tuplesort_endsortedrun() -functions to help this, but I'm not 100% sure if we'd want to expose Tuplesort to non-PG sorting algorithms, as it would be one easy way to create incorrect results if the sort used in tuplesort isn't exactly equivalent to the sort used by the provider of the tuples. > c). create-index is a maintenance work, improving it by > 30% would be good, but if we just improve it by <3, it looks not very > charming in practice. I think improving 3% on reindex operations can be well worth the effort. Also, do note that the current patch does (still) not correctly handle [maintenance_]work_mem: Every backend's BuildAccumulator uses up to work_mem of memory here, while the launched tuplesorts use an additional maintenance_work_mem of memory, for a total of (workers + 1) * work_mem + m_w_m of memory usage. The available memory should instead be allocated between tuplesort and BuildAccumulator, but can probably mostly be allocated to just BuildAccumulator if we can dump the data into the tuplesort directly, as it'd reduce the overall number of operations and memory allocations for the tuplesort. I think that once we correctly account for memory allocations (and an improved write path) we'll be able to see a meaningfully larger performance improvement. > So my option is if we can have agreement on 0008, then we can final > review/test on the existing code (including 0009), and leave further > improvement as a dedicated patch. As mentioned above, I think I could update the patch for a btree implementation that also has immediate benefits, if so desired? Kind regards, Matthias van de Meent Neon (https://neon.tech)
pgsql-hackers by date: