Re: Gather performance analysis - Mailing list pgsql-hackers

From Zhihong Yu
Subject Re: Gather performance analysis
Date
Msg-id CALNJ-vRdkFLBPJYf-x6vW7RPrae2oviR+tGJVenTAUQk3R+jDg@mail.gmail.com
Whole thread Raw
In response to Re: Gather performance analysis  (Zhihong Yu <zyu@yugabyte.com>)
Responses Re: Gather performance analysis  (Dilip Kumar <dilipbalaut@gmail.com>)
List pgsql-hackers


On Sat, Aug 28, 2021 at 4:29 AM Zhihong Yu <zyu@yugabyte.com> wrote:


On Sat, Aug 28, 2021 at 12:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Tue, Aug 24, 2021 at 8:48 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
> 1) Non-parallel (default)
>  Execution Time: 31627.492 ms
>
> 2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
>  Execution Time: 37498.672 ms
>
> 3) Same as above (2) but with the patch.
> Execution Time: 23649.287 ms

This strikes me as an amazingly good result. I guess before seeing
these results, I would have said that you can't reasonably expect
parallel query to win on a query like this because there isn't enough
for the workers to do. It's not like they are spending time evaluating
filter conditions or anything like that - they're just fetching tuples
off of disk pages and sticking them into a queue. And it's unclear to
me why it should be better to have a bunch of processes doing that
instead of just one. I would have thought, looking at just (1) and
(2), that parallelism gained nothing and communication overhead lost 6
seconds.

But what this suggests is that parallelism gained at least 8 seconds,
and communication overhead lost at least 14 seconds. In fact...

Right, good observation.


> - If I apply both Experiment#1 and Experiment#2 patches together then,
> we can further reduce the execution time to 20963.539 ms (with 4
> workers and 4MB tuple queue size)

...this suggests that parallelism actually gained at least 10-11
seconds, and the communication overhead lost at least 15-16 seconds.

Yes
 
If that's accurate, it's pretty crazy. We might need to drastically
reduce the value of parallel_tuple_cost if these results hold up and
this patch gets committed.

In one of my experiments[Test1] I have noticed that even on the head the force parallel plan is significantly faster compared to the non-parallel plan, but with patch it is even better.  The point is now also there might be some cases where the force parallel plans are faster but we are not sure whether we can reduce the parallel_tuple_cost or not.  But with the patch it is definitely sure that the parallel tuple queue is faster compared to what we have now, So I agree we should consider reducing the parallel_tuple_cost after this patch.

Additionally, I've done some more experiments with artificial workloads, as well as workloads where the parallel plan is selected by default, and in all cases I've seen a significant improvement.  The gain is directly proportional to the load on the tuple queue, as expected.

Test1: (Worker returns all tuples but only few tuples returns to the client)
----------------------------------------------------
INSERT INTO t SELECT i%10, repeat('a', 200) from generate_series(1,200000000) as i;
set max_parallel_workers_per_gather=4;

Target Query: SELECT random() FROM t GROUP BY a;

Non-parallel (default plan): 77170.421 ms
Parallel (parallel_tuple_cost=0):  53794.324 ms
Parallel with patch (parallel_tuple_cost=0): 42567.850 ms

20% gain compared force parallel, 45% gain compared to default plan.

Test2: (Parallel case with default parallel_tuple_cost)
----------------------------------------------
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i;

set max_parallel_workers_per_gather=4;
SELECT * from t WHERE a < 17500000;
Parallel(default plan): 23730.054 ms
Parallel with patch (default plan): 21614.251 ms

8 to 10 % gain compared to the default parallel plan.

I have done cleanup in the patch and I will add this to the September commitfest.

I am planning to do further testing for identifying the optimal batch size in different workloads.  WIth above workload I am seeing similar results with batch size 4k to 16k (1/4 of the ring size) so in the attached patch I have kept as 1/4 of the ring size.  We might change that based on more analysis and testing.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Hi,
Some minor comments.
For shm_mq.c, existing comment says:

 * mqh_partial_bytes, mqh_expected_bytes, and mqh_length_word_complete 

+   Size        mqh_send_pending;
    bool        mqh_length_word_complete;
    bool        mqh_counterparty_attached;

I wonder if mqh_send_pending should be declared after mqh_length_word_complete - this way, the order of fields matches the order of explanation for the fields.

+   if (mqh->mqh_send_pending > mq->mq_ring_size / 4 || force_flush)

The above can be written as:

+   if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 1))

so that when force_flush is true, the other condition is not evaluated.

Cheers
 
There  was a typo in suggested code above. It should be:

+   if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 2))

Cheers

pgsql-hackers by date:

Previous
From: Zhihong Yu
Date:
Subject: Re: Gather performance analysis
Next
From: Bharath Rupireddy
Date:
Subject: Re: pg_receivewal starting position