Re: Parallel CREATE INDEX for GIN indexes - Mailing list pgsql-hackers

From Andy Fan
Subject Re: Parallel CREATE INDEX for GIN indexes
Date
Msg-id 87a5gyqnl5.fsf@163.com
Whole thread Raw
In response to Re: Parallel CREATE INDEX for GIN indexes  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Parallel CREATE INDEX for GIN indexes
Re: Parallel CREATE INDEX for GIN indexes
List pgsql-hackers
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

> Hi,
>
> I got to do the detailed benchmarking on the latest version of the patch
> series, so here's the results. My goal was to better understand the
> impact of each patch individually - especially the two parts introduced
> by Matthias, but not only - so I ran the test on a build with each fo
> the 0001-0009 patches.
>
> This is the same test I did at the very beginning, but the basic details
> are that I have a 22GB table with archives of our mailing lists (1.6M
> messages, roughly), and I build a couple different GIN indexes on
> that:
..
>

Very impresive testing!

> And let's talk about the improvement by Matthias, namely:
>
> * 0008 Use a single GIN tuplesort
> * 0009 Reduce the size of GinTuple by 12 bytes
>
> I haven't really seen any impact on duration - it seems more or less
> within noise. Maybe it would be different on machines with less RAM, but
> on my two systems it didn't really make a difference.
>
> It did significantly reduce the amount of temporary data written, by
> ~40% or so. This is pretty nicely visible on the "trgm" case, which
> generates the most temp files of the four indexes. An example from the
> i5/32MB section looks like this:
>
> label    0000 0001  0002  0003  0004  0005  0006  0007  0008  0009  0010
> ------------------------------------------------------------------------
> trgm / 3    0 2635  3690  3715  1177  1177  1179  1179   696   682
> 1016

After seeing the above data, I want to know where the time is spent and
why the ~40% IO doesn't make a measurable duration improvement. then I
did the following test.

create table gin_t (a int[]);
insert into gin_t select * from rand_array(30000000, 0, 100, 0, 50); [1]
select pg_prewarm('gin_t');

postgres=# create index on gin_t using gin(a);
INFO:  pid: 145078,  stage 1 took 44476 ms
INFO:  pid: 145177,  stage 1 took 44474 ms
INFO:  pid: 145078,  stage 2 took 2662 ms
INFO:  pid: 145177,  stage 2 took 2611 ms
INFO:  pid: 145177,  stage 3 took 240 ms
INFO:  pid: 145078,  stage 3 took 239 ms

CREATE INDEX
Time: 79472.135 ms (01:19.472)

Then we can see stage 1 take 56% execution time. stage 2 + stage 3 take
3% execution time. and the leader's work takes the rest 41% execution
time. I think that's why we didn't see much performance improvement of
0008 since it improves the "stage 2 and stage 3".

==== Here is my definition for stage 1/2/3. 
stage 1:
         reltuples = table_index_build_scan(heap, index, indexInfo, true, progress,
                                           ginBuildCallbackParallel, state, scan);

        /* write remaining accumulated entries */
        ginFlushBuildState(state, index);
    
stage 2:
        _gin_process_worker_data(state, state->bs_worker_sort)

stage 3:
        tuplesort_performsort(state->bs_sortstate);


But 0008 still does many good stuff:
1. Reduce the IO usage, this would be more useful on some heavily IO
   workload. 
2. Simplify the building logic by removing one stage.
3. Add the 'buffer-writetup' to tuplesort.c, I don't have other user
   case for now, but it looks like a reasonable design.

I think the current blocker is if it is safe to hack the tuplesort.c.
With my current knowledge, It looks good to me, but it would be better
open a dedicated thread to discuss this specially, the review would not
take a long time if a people who is experienced on this area would take
a look. 

> Now, what's the 0010 patch about?
>
> For some indexes (e.g. trgm), the parallel builds help a lot, because
> they produce a lot of temporary data and the parallel sort is a
> substantial part of the work. But for other indexes (especially the
> "smaller" indexes on jsonb headers), it's not that great. For example
> for "jsonb", having 3 workers shaves off only ~25% of the time, not 75%.
>
> Clearly, this happens because a lot of time is spent outside the sort,
> actually inserting data into the index.

You can always foucs on the most important part which inpires me a lot,
even with my simple testing, the "inserting data into index" stage take
40% time.

> So I was wondering if we might
> parallelize that too, and how much time would it save - 0010 is an
> experimental patch doing that. It splits the processing into 3 phases:
>
> 1. workers feeding data into tuplesort
> 2. leader finishes sort and "repartitions" the data
> 3. workers inserting their partition into index
>
> The patch is far from perfect (more a PoC) ..
>
> This does help a little bit, reducing the duration by ~15-25%. I wonder
> if this might be improved by partitioning the data differently - not by
> shuffling everything from the tuplesort into fileset (it increases the
> amount of temporary data in the charts). And also by by distributing the
> data differently - right now it's a bit of a round robin, because it
> wasn't clear we know how many entries are there.

Due to the complexity of the existing code, I would like to foucs on
existing patch first. So I vote for this optimization as a dedeciated
patch. 

>>> and later we called 'tuplesort_performsort(state->bs_sortstate);'.  Even
>>> we have some CTID merges activity in  '....(1)', the tuples are still
>>> ordered, so the sort (in both tuplesort_putgintuple and
>>> 'tuplesort_performsort) are not necessary,
>>> ..
>>> If Matthias's proposal is adopted, my optimization will not be useful
>>> anymore and Matthias's porposal looks like a more natural and effecient
>>> way.
>
> I think they might be complementary. I don't think it's reasonable to
> expect GIN's BuildAccumulator to buffer all the index tuples at the
> same time (as I mentioned upthread: we are or should be limited by
> work memory), but the BuildAccumulator will do a much better job at
> combining tuples than the in-memory sort + merge-write done by
> Tuplesort (because BA will use (much?) less memory for the same number
> of stored values).

Thank you Matthias for valuing my point! and thanks for highlighting the
benefit that BuildAccumulator can do a better job for sorting in memory
(I think it is mainly because BuildAccumulator can do run-time merge
when accept more tuples).  but I still not willing to go further at this
direction. Reasons are: a). It probably can't make a big difference at
the final result. b). The best implementation of this idea would be
allowing the user of tuplesort.c to insert the pre-sort tuples into tape
directly rather than inserting them into tuplesort's memory and dump
them into tape without a sort. However I can't define a clean API for
the former case. c). create-index is a maintenance work, improving it by
30% would be good, but if we just improve it by <3, it looks not very
charming in practice.

So my option is if we can have agreement on 0008, then we can final
review/test on the existing code (including 0009), and leave further
improvement as a dedicated patch. 

What do you think?

[1] https://www.postgresql.org/message-id/87le0iqrsu.fsf%40163.com

-- 
Best Regards
Andy Fan




pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Segfault in jit tuple deforming on arm64 due to LLVM issue
Next
From: Amit Kapila
Date:
Subject: Re: Doc: fix the note related to the GUC "synchronized_standby_slots"