Re: Parallel CREATE INDEX for GIN indexes - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Parallel CREATE INDEX for GIN indexes |
Date | |
Msg-id | c2753e01-9b06-43f0-a9ae-43638ce4bedf@vondra.me Whole thread Raw |
In response to | Re: Parallel CREATE INDEX for GIN indexes (Andy Fan <zhihuifan1213@163.com>) |
Responses |
Re: Parallel CREATE INDEX for GIN indexes
|
List | pgsql-hackers |
On 8/27/24 12:14, Andy Fan wrote: > Tomas Vondra <tomas.vondra@enterprisedb.com> writes: > >> Hi, >> >> I got to do the detailed benchmarking on the latest version of the patch >> series, so here's the results. My goal was to better understand the >> impact of each patch individually - especially the two parts introduced >> by Matthias, but not only - so I ran the test on a build with each fo >> the 0001-0009 patches. >> >> This is the same test I did at the very beginning, but the basic details >> are that I have a 22GB table with archives of our mailing lists (1.6M >> messages, roughly), and I build a couple different GIN indexes on >> that: > .. >> > > Very impresive testing! > >> And let's talk about the improvement by Matthias, namely: >> >> * 0008 Use a single GIN tuplesort >> * 0009 Reduce the size of GinTuple by 12 bytes >> >> I haven't really seen any impact on duration - it seems more or less >> within noise. Maybe it would be different on machines with less RAM, but >> on my two systems it didn't really make a difference. >> >> It did significantly reduce the amount of temporary data written, by >> ~40% or so. This is pretty nicely visible on the "trgm" case, which >> generates the most temp files of the four indexes. An example from the >> i5/32MB section looks like this: >> >> label 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010 >> ------------------------------------------------------------------------ >> trgm / 3 0 2635 3690 3715 1177 1177 1179 1179 696 682 >> 1016 > > After seeing the above data, I want to know where the time is spent and > why the ~40% IO doesn't make a measurable duration improvement. then I > did the following test. > > create table gin_t (a int[]); > insert into gin_t select * from rand_array(30000000, 0, 100, 0, 50); [1] > select pg_prewarm('gin_t'); > > postgres=# create index on gin_t using gin(a); > INFO: pid: 145078, stage 1 took 44476 ms > INFO: pid: 145177, stage 1 took 44474 ms > INFO: pid: 145078, stage 2 took 2662 ms > INFO: pid: 145177, stage 2 took 2611 ms > INFO: pid: 145177, stage 3 took 240 ms > INFO: pid: 145078, stage 3 took 239 ms > > CREATE INDEX > Time: 79472.135 ms (01:19.472) > > Then we can see stage 1 take 56% execution time. stage 2 + stage 3 take > 3% execution time. and the leader's work takes the rest 41% execution > time. I think that's why we didn't see much performance improvement of > 0008 since it improves the "stage 2 and stage 3". > Yes, that makes sense. It's so small fraction of the computation that it can't translate to a meaningful speed. > ==== Here is my definition for stage 1/2/3. > stage 1: > reltuples = table_index_build_scan(heap, index, indexInfo, true, progress, > ginBuildCallbackParallel, state, scan); > > /* write remaining accumulated entries */ > ginFlushBuildState(state, index); > > stage 2: > _gin_process_worker_data(state, state->bs_worker_sort) > > stage 3: > tuplesort_performsort(state->bs_sortstate); > > > But 0008 still does many good stuff: > 1. Reduce the IO usage, this would be more useful on some heavily IO > workload. > 2. Simplify the building logic by removing one stage. > 3. Add the 'buffer-writetup' to tuplesort.c, I don't have other user > case for now, but it looks like a reasonable design. > > I think the current blocker is if it is safe to hack the tuplesort.c. > With my current knowledge, It looks good to me, but it would be better > open a dedicated thread to discuss this specially, the review would not > take a long time if a people who is experienced on this area would take > a look. > I agree. I expressed the same impression earlier in this thread, IIRC. >> Now, what's the 0010 patch about? >> >> For some indexes (e.g. trgm), the parallel builds help a lot, because >> they produce a lot of temporary data and the parallel sort is a >> substantial part of the work. But for other indexes (especially the >> "smaller" indexes on jsonb headers), it's not that great. For example >> for "jsonb", having 3 workers shaves off only ~25% of the time, not 75%. >> >> Clearly, this happens because a lot of time is spent outside the sort, >> actually inserting data into the index. > > You can always foucs on the most important part which inpires me a lot, > even with my simple testing, the "inserting data into index" stage take > 40% time. >>> So I was wondering if we might >> parallelize that too, and how much time would it save - 0010 is an >> experimental patch doing that. It splits the processing into 3 phases: >> >> 1. workers feeding data into tuplesort >> 2. leader finishes sort and "repartitions" the data >> 3. workers inserting their partition into index >> >> The patch is far from perfect (more a PoC) .. >> >> This does help a little bit, reducing the duration by ~15-25%. I wonder >> if this might be improved by partitioning the data differently - not by >> shuffling everything from the tuplesort into fileset (it increases the >> amount of temporary data in the charts). And also by by distributing the >> data differently - right now it's a bit of a round robin, because it >> wasn't clear we know how many entries are there. > > Due to the complexity of the existing code, I would like to foucs on > existing patch first. So I vote for this optimization as a dedeciated > patch. > I agree. Even if we decide to do these parallel inserts, it relies on doing the parallel sort first. So it makes sense to leave that for later, as an additional improvement. >>>> and later we called 'tuplesort_performsort(state->bs_sortstate);'. Even >>>> we have some CTID merges activity in '....(1)', the tuples are still >>>> ordered, so the sort (in both tuplesort_putgintuple and >>>> 'tuplesort_performsort) are not necessary, >>>> .. >>>> If Matthias's proposal is adopted, my optimization will not be useful >>>> anymore and Matthias's porposal looks like a more natural and effecient >>>> way. >> >> I think they might be complementary. I don't think it's reasonable to >> expect GIN's BuildAccumulator to buffer all the index tuples at the >> same time (as I mentioned upthread: we are or should be limited by >> work memory), but the BuildAccumulator will do a much better job at >> combining tuples than the in-memory sort + merge-write done by >> Tuplesort (because BA will use (much?) less memory for the same number >> of stored values). > > Thank you Matthias for valuing my point! and thanks for highlighting the > benefit that BuildAccumulator can do a better job for sorting in memory > (I think it is mainly because BuildAccumulator can do run-time merge > when accept more tuples). but I still not willing to go further at this > direction. Reasons are: a). It probably can't make a big difference at > the final result. b). The best implementation of this idea would be > allowing the user of tuplesort.c to insert the pre-sort tuples into tape > directly rather than inserting them into tuplesort's memory and dump > them into tape without a sort. However I can't define a clean API for > the former case. c). create-index is a maintenance work, improving it by > 30% would be good, but if we just improve it by <3, it looks not very > charming in practice. > > So my option is if we can have agreement on 0008, then we can final > review/test on the existing code (including 0009), and leave further > improvement as a dedicated patch. > > What do you think? > Yeah. I think we have agreement on 0001-0007. I'm a bit torn about 0008, I have not expected changing tuplesort like this when I started working on the patch, but I can't deny it's a massive speedup for some cases (where the patch doesn't help otherwise). But then in other cases it doesn't help at all, and 0010 helps. I wonder if maybe there's a good way to "flip" between those two approaches, by some heuristics. regards -- Tomas Vondra
pgsql-hackers by date: