Re: Parallel CREATE INDEX for GIN indexes - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Parallel CREATE INDEX for GIN indexes
Date
Msg-id c2753e01-9b06-43f0-a9ae-43638ce4bedf@vondra.me
Whole thread Raw
In response to Re: Parallel CREATE INDEX for GIN indexes  (Andy Fan <zhihuifan1213@163.com>)
Responses Re: Parallel CREATE INDEX for GIN indexes
List pgsql-hackers

On 8/27/24 12:14, Andy Fan wrote:
> Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> 
>> Hi,
>>
>> I got to do the detailed benchmarking on the latest version of the patch
>> series, so here's the results. My goal was to better understand the
>> impact of each patch individually - especially the two parts introduced
>> by Matthias, but not only - so I ran the test on a build with each fo
>> the 0001-0009 patches.
>>
>> This is the same test I did at the very beginning, but the basic details
>> are that I have a 22GB table with archives of our mailing lists (1.6M
>> messages, roughly), and I build a couple different GIN indexes on
>> that:
> ..
>>
> 
> Very impresive testing!
> 
>> And let's talk about the improvement by Matthias, namely:
>>
>> * 0008 Use a single GIN tuplesort
>> * 0009 Reduce the size of GinTuple by 12 bytes
>>
>> I haven't really seen any impact on duration - it seems more or less
>> within noise. Maybe it would be different on machines with less RAM, but
>> on my two systems it didn't really make a difference.
>>
>> It did significantly reduce the amount of temporary data written, by
>> ~40% or so. This is pretty nicely visible on the "trgm" case, which
>> generates the most temp files of the four indexes. An example from the
>> i5/32MB section looks like this:
>>
>> label    0000 0001  0002  0003  0004  0005  0006  0007  0008  0009  0010
>> ------------------------------------------------------------------------
>> trgm / 3    0 2635  3690  3715  1177  1177  1179  1179   696   682
>> 1016
> 
> After seeing the above data, I want to know where the time is spent and
> why the ~40% IO doesn't make a measurable duration improvement. then I
> did the following test.
> 
> create table gin_t (a int[]);
> insert into gin_t select * from rand_array(30000000, 0, 100, 0, 50); [1]
> select pg_prewarm('gin_t');
> 
> postgres=# create index on gin_t using gin(a);
> INFO:  pid: 145078,  stage 1 took 44476 ms
> INFO:  pid: 145177,  stage 1 took 44474 ms
> INFO:  pid: 145078,  stage 2 took 2662 ms
> INFO:  pid: 145177,  stage 2 took 2611 ms
> INFO:  pid: 145177,  stage 3 took 240 ms
> INFO:  pid: 145078,  stage 3 took 239 ms
> 
> CREATE INDEX
> Time: 79472.135 ms (01:19.472)
> 
> Then we can see stage 1 take 56% execution time. stage 2 + stage 3 take
> 3% execution time. and the leader's work takes the rest 41% execution
> time. I think that's why we didn't see much performance improvement of
> 0008 since it improves the "stage 2 and stage 3".
> 

Yes, that makes sense. It's so small fraction of the computation that it
can't translate to a meaningful speed.

> ==== Here is my definition for stage 1/2/3. 
> stage 1:
>          reltuples = table_index_build_scan(heap, index, indexInfo, true, progress,
>                                            ginBuildCallbackParallel, state, scan);
> 
>         /* write remaining accumulated entries */
>         ginFlushBuildState(state, index);
>     
> stage 2:
>         _gin_process_worker_data(state, state->bs_worker_sort)
> 
> stage 3:
>         tuplesort_performsort(state->bs_sortstate);
> 
> 
> But 0008 still does many good stuff:
> 1. Reduce the IO usage, this would be more useful on some heavily IO
>    workload. 
> 2. Simplify the building logic by removing one stage.
> 3. Add the 'buffer-writetup' to tuplesort.c, I don't have other user
>    case for now, but it looks like a reasonable design.
> 
> I think the current blocker is if it is safe to hack the tuplesort.c.
> With my current knowledge, It looks good to me, but it would be better
> open a dedicated thread to discuss this specially, the review would not
> take a long time if a people who is experienced on this area would take
> a look. 
> 

I agree. I expressed the same impression earlier in this thread, IIRC.

>> Now, what's the 0010 patch about?
>>
>> For some indexes (e.g. trgm), the parallel builds help a lot, because
>> they produce a lot of temporary data and the parallel sort is a
>> substantial part of the work. But for other indexes (especially the
>> "smaller" indexes on jsonb headers), it's not that great. For example
>> for "jsonb", having 3 workers shaves off only ~25% of the time, not 75%.
>>
>> Clearly, this happens because a lot of time is spent outside the sort,
>> actually inserting data into the index.
> 
> You can always foucs on the most important part which inpires me a lot,
> even with my simple testing, the "inserting data into index" stage take
> 40% time.
>>> So I was wondering if we might
>> parallelize that too, and how much time would it save - 0010 is an
>> experimental patch doing that. It splits the processing into 3 phases:
>>
>> 1. workers feeding data into tuplesort
>> 2. leader finishes sort and "repartitions" the data
>> 3. workers inserting their partition into index
>>
>> The patch is far from perfect (more a PoC) ..
>>
>> This does help a little bit, reducing the duration by ~15-25%. I wonder
>> if this might be improved by partitioning the data differently - not by
>> shuffling everything from the tuplesort into fileset (it increases the
>> amount of temporary data in the charts). And also by by distributing the
>> data differently - right now it's a bit of a round robin, because it
>> wasn't clear we know how many entries are there.
> 
> Due to the complexity of the existing code, I would like to foucs on
> existing patch first. So I vote for this optimization as a dedeciated
> patch. 
> 

I agree. Even if we decide to do these parallel inserts, it relies on
doing the parallel sort first. So it makes sense to leave that for
later, as an additional improvement.

>>>> and later we called 'tuplesort_performsort(state->bs_sortstate);'.  Even
>>>> we have some CTID merges activity in  '....(1)', the tuples are still
>>>> ordered, so the sort (in both tuplesort_putgintuple and
>>>> 'tuplesort_performsort) are not necessary,
>>>> ..
>>>> If Matthias's proposal is adopted, my optimization will not be useful
>>>> anymore and Matthias's porposal looks like a more natural and effecient
>>>> way.
>>
>> I think they might be complementary. I don't think it's reasonable to
>> expect GIN's BuildAccumulator to buffer all the index tuples at the
>> same time (as I mentioned upthread: we are or should be limited by
>> work memory), but the BuildAccumulator will do a much better job at
>> combining tuples than the in-memory sort + merge-write done by
>> Tuplesort (because BA will use (much?) less memory for the same number
>> of stored values).
> 
> Thank you Matthias for valuing my point! and thanks for highlighting the
> benefit that BuildAccumulator can do a better job for sorting in memory
> (I think it is mainly because BuildAccumulator can do run-time merge
> when accept more tuples).  but I still not willing to go further at this
> direction. Reasons are: a). It probably can't make a big difference at
> the final result. b). The best implementation of this idea would be
> allowing the user of tuplesort.c to insert the pre-sort tuples into tape
> directly rather than inserting them into tuplesort's memory and dump
> them into tape without a sort. However I can't define a clean API for
> the former case. c). create-index is a maintenance work, improving it by
> 30% would be good, but if we just improve it by <3, it looks not very
> charming in practice.
> 
> So my option is if we can have agreement on 0008, then we can final
> review/test on the existing code (including 0009), and leave further
> improvement as a dedicated patch. 
> 
> What do you think?
> 

Yeah. I think we have agreement on 0001-0007. I'm a bit torn about 0008,
I have not expected changing tuplesort like this when I started working
on the patch, but I can't deny it's a massive speedup for some cases
(where the patch doesn't help otherwise). But then in other cases it
doesn't help at all, and 0010 helps. I wonder if maybe there's a good
way to "flip" between those two approaches, by some heuristics.


regards

-- 
Tomas Vondra



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Doc: fix the note related to the GUC "synchronized_standby_slots"
Next
From: Ranier Vilela
Date:
Subject: Re: Redundant Result node