Re: Gather performance analysis - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: Gather performance analysis
Date
Msg-id CAFiTN-u2McrrYTpR_X4cqU04X=sX-g6pn4VH0=oR-0aoqtpMEg@mail.gmail.com
Whole thread Raw
In response to Re: Gather performance analysis  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Gather performance analysis  (Zhihong Yu <zyu@yugabyte.com>)
Re: Gather performance analysis  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
On Tue, Aug 24, 2021 at 8:48 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
> 1) Non-parallel (default)
>  Execution Time: 31627.492 ms
>
> 2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
>  Execution Time: 37498.672 ms
>
> 3) Same as above (2) but with the patch.
> Execution Time: 23649.287 ms

This strikes me as an amazingly good result. I guess before seeing
these results, I would have said that you can't reasonably expect
parallel query to win on a query like this because there isn't enough
for the workers to do. It's not like they are spending time evaluating
filter conditions or anything like that - they're just fetching tuples
off of disk pages and sticking them into a queue. And it's unclear to
me why it should be better to have a bunch of processes doing that
instead of just one. I would have thought, looking at just (1) and
(2), that parallelism gained nothing and communication overhead lost 6
seconds.

But what this suggests is that parallelism gained at least 8 seconds,
and communication overhead lost at least 14 seconds. In fact...

Right, good observation.


> - If I apply both Experiment#1 and Experiment#2 patches together then,
> we can further reduce the execution time to 20963.539 ms (with 4
> workers and 4MB tuple queue size)

...this suggests that parallelism actually gained at least 10-11
seconds, and the communication overhead lost at least 15-16 seconds.

Yes
 
If that's accurate, it's pretty crazy. We might need to drastically
reduce the value of parallel_tuple_cost if these results hold up and
this patch gets committed.

In one of my experiments[Test1] I have noticed that even on the head the force parallel plan is significantly faster compared to the non-parallel plan, but with patch it is even better.  The point is now also there might be some cases where the force parallel plans are faster but we are not sure whether we can reduce the parallel_tuple_cost or not.  But with the patch it is definitely sure that the parallel tuple queue is faster compared to what we have now, So I agree we should consider reducing the parallel_tuple_cost after this patch.

Additionally, I've done some more experiments with artificial workloads, as well as workloads where the parallel plan is selected by default, and in all cases I've seen a significant improvement.  The gain is directly proportional to the load on the tuple queue, as expected.

Test1: (Worker returns all tuples but only few tuples returns to the client)
----------------------------------------------------
INSERT INTO t SELECT i%10, repeat('a', 200) from generate_series(1,200000000) as i;
set max_parallel_workers_per_gather=4;

Target Query: SELECT random() FROM t GROUP BY a;

Non-parallel (default plan): 77170.421 ms
Parallel (parallel_tuple_cost=0):  53794.324 ms
Parallel with patch (parallel_tuple_cost=0): 42567.850 ms

20% gain compared force parallel, 45% gain compared to default plan.

Test2: (Parallel case with default parallel_tuple_cost)
----------------------------------------------
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i;

set max_parallel_workers_per_gather=4;
SELECT * from t WHERE a < 17500000;
Parallel(default plan): 23730.054 ms
Parallel with patch (default plan): 21614.251 ms

8 to 10 % gain compared to the default parallel plan.

I have done cleanup in the patch and I will add this to the September commitfest.

I am planning to do further testing for identifying the optimal batch size in different workloads.  WIth above workload I am seeing similar results with batch size 4k to 16k (1/4 of the ring size) so in the attached patch I have kept as 1/4 of the ring size.  We might change that based on more analysis and testing.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachment

pgsql-hackers by date:

Previous
From: "Bossart, Nathan"
Date:
Subject: Re: Estimating HugePages Requirements?
Next
From: Trafalgar Ricardo Lu
Date:
Subject: Summary of GSoC 2021