Re: Parallel CREATE INDEX for GIN indexes - Mailing list pgsql-hackers
From | Andy Fan |
---|---|
Subject | Re: Parallel CREATE INDEX for GIN indexes |
Date | |
Msg-id | 87a5gyqnl5.fsf@163.com Whole thread Raw |
In response to | Re: Parallel CREATE INDEX for GIN indexes (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Parallel CREATE INDEX for GIN indexes
Re: Parallel CREATE INDEX for GIN indexes |
List | pgsql-hackers |
Tomas Vondra <tomas.vondra@enterprisedb.com> writes: > Hi, > > I got to do the detailed benchmarking on the latest version of the patch > series, so here's the results. My goal was to better understand the > impact of each patch individually - especially the two parts introduced > by Matthias, but not only - so I ran the test on a build with each fo > the 0001-0009 patches. > > This is the same test I did at the very beginning, but the basic details > are that I have a 22GB table with archives of our mailing lists (1.6M > messages, roughly), and I build a couple different GIN indexes on > that: .. > Very impresive testing! > And let's talk about the improvement by Matthias, namely: > > * 0008 Use a single GIN tuplesort > * 0009 Reduce the size of GinTuple by 12 bytes > > I haven't really seen any impact on duration - it seems more or less > within noise. Maybe it would be different on machines with less RAM, but > on my two systems it didn't really make a difference. > > It did significantly reduce the amount of temporary data written, by > ~40% or so. This is pretty nicely visible on the "trgm" case, which > generates the most temp files of the four indexes. An example from the > i5/32MB section looks like this: > > label 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010 > ------------------------------------------------------------------------ > trgm / 3 0 2635 3690 3715 1177 1177 1179 1179 696 682 > 1016 After seeing the above data, I want to know where the time is spent and why the ~40% IO doesn't make a measurable duration improvement. then I did the following test. create table gin_t (a int[]); insert into gin_t select * from rand_array(30000000, 0, 100, 0, 50); [1] select pg_prewarm('gin_t'); postgres=# create index on gin_t using gin(a); INFO: pid: 145078, stage 1 took 44476 ms INFO: pid: 145177, stage 1 took 44474 ms INFO: pid: 145078, stage 2 took 2662 ms INFO: pid: 145177, stage 2 took 2611 ms INFO: pid: 145177, stage 3 took 240 ms INFO: pid: 145078, stage 3 took 239 ms CREATE INDEX Time: 79472.135 ms (01:19.472) Then we can see stage 1 take 56% execution time. stage 2 + stage 3 take 3% execution time. and the leader's work takes the rest 41% execution time. I think that's why we didn't see much performance improvement of 0008 since it improves the "stage 2 and stage 3". ==== Here is my definition for stage 1/2/3. stage 1: reltuples = table_index_build_scan(heap, index, indexInfo, true, progress, ginBuildCallbackParallel, state, scan); /* write remaining accumulated entries */ ginFlushBuildState(state, index); stage 2: _gin_process_worker_data(state, state->bs_worker_sort) stage 3: tuplesort_performsort(state->bs_sortstate); But 0008 still does many good stuff: 1. Reduce the IO usage, this would be more useful on some heavily IO workload. 2. Simplify the building logic by removing one stage. 3. Add the 'buffer-writetup' to tuplesort.c, I don't have other user case for now, but it looks like a reasonable design. I think the current blocker is if it is safe to hack the tuplesort.c. With my current knowledge, It looks good to me, but it would be better open a dedicated thread to discuss this specially, the review would not take a long time if a people who is experienced on this area would take a look. > Now, what's the 0010 patch about? > > For some indexes (e.g. trgm), the parallel builds help a lot, because > they produce a lot of temporary data and the parallel sort is a > substantial part of the work. But for other indexes (especially the > "smaller" indexes on jsonb headers), it's not that great. For example > for "jsonb", having 3 workers shaves off only ~25% of the time, not 75%. > > Clearly, this happens because a lot of time is spent outside the sort, > actually inserting data into the index. You can always foucs on the most important part which inpires me a lot, even with my simple testing, the "inserting data into index" stage take 40% time. > So I was wondering if we might > parallelize that too, and how much time would it save - 0010 is an > experimental patch doing that. It splits the processing into 3 phases: > > 1. workers feeding data into tuplesort > 2. leader finishes sort and "repartitions" the data > 3. workers inserting their partition into index > > The patch is far from perfect (more a PoC) .. > > This does help a little bit, reducing the duration by ~15-25%. I wonder > if this might be improved by partitioning the data differently - not by > shuffling everything from the tuplesort into fileset (it increases the > amount of temporary data in the charts). And also by by distributing the > data differently - right now it's a bit of a round robin, because it > wasn't clear we know how many entries are there. Due to the complexity of the existing code, I would like to foucs on existing patch first. So I vote for this optimization as a dedeciated patch. >>> and later we called 'tuplesort_performsort(state->bs_sortstate);'. Even >>> we have some CTID merges activity in '....(1)', the tuples are still >>> ordered, so the sort (in both tuplesort_putgintuple and >>> 'tuplesort_performsort) are not necessary, >>> .. >>> If Matthias's proposal is adopted, my optimization will not be useful >>> anymore and Matthias's porposal looks like a more natural and effecient >>> way. > > I think they might be complementary. I don't think it's reasonable to > expect GIN's BuildAccumulator to buffer all the index tuples at the > same time (as I mentioned upthread: we are or should be limited by > work memory), but the BuildAccumulator will do a much better job at > combining tuples than the in-memory sort + merge-write done by > Tuplesort (because BA will use (much?) less memory for the same number > of stored values). Thank you Matthias for valuing my point! and thanks for highlighting the benefit that BuildAccumulator can do a better job for sorting in memory (I think it is mainly because BuildAccumulator can do run-time merge when accept more tuples). but I still not willing to go further at this direction. Reasons are: a). It probably can't make a big difference at the final result. b). The best implementation of this idea would be allowing the user of tuplesort.c to insert the pre-sort tuples into tape directly rather than inserting them into tuplesort's memory and dump them into tape without a sort. However I can't define a clean API for the former case. c). create-index is a maintenance work, improving it by 30% would be good, but if we just improve it by <3, it looks not very charming in practice. So my option is if we can have agreement on 0008, then we can final review/test on the existing code (including 0009), and leave further improvement as a dedicated patch. What do you think? [1] https://www.postgresql.org/message-id/87le0iqrsu.fsf%40163.com -- Best Regards Andy Fan
pgsql-hackers by date: