Thread: Gather performance analysis

Gather performance analysis

From

Dilip Kumar

Date:

06 August 2021, 08:30:48

Hi,

I have been working on analyzing the performance of sending the tuple
from workers to the Gather using the tuple queue.  In the past there
were many off-list discussions around this area, basically, the main
point is that when the "shm_mq" was implemented that time maybe this
was one of the best ways to implement this. But now, we have other
choices like DSA for allocating shared memory on-demand, shared
temporary files for non-blocking tuple queue.

So my motivation for looking into this area is that now, we have
another flexible alternative so can we use them to make gather faster
and if so then
1. Can we actually reduce the tuple transfer cost and enable
parallelism in more cases by reducing parallel_tuple_cost.
2. Can we use the tuple queue in more places, e.g., to implement the
redistribute operator where we need to transfer data between the
workers.

IMHO for #1, it will be good enough if we can make the tuple transfer
faster, but for #2, we will have to make a) tuple transfer faster
because then we will have to transfer the tuples between the workers
as well b) Infinite non-blocking tuple queue(maybe using shared temp
file) so that there is no deadlock while workers are redistributing
tuples to each other.

So I have done some quick performance tests and analysis using perf,
and some experiments with small prototypes for targeting a different
set of problems.

--Setup
SET parallel_tuple_cost TO 0   -- to test parallelism in the extreme case
CREATE TABLE t (a int, b varchar);
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i;
ANALYZE t;
Test query: EXPLAIN ANALYZE SELECT * FROM t;

Perf analysis: Gather Node
   - 43.57% shm_mq_receive
      - 78.94% shm_mq_receive_bytes
         - 91.27% pg_atomic_read_u64
            - pg_atomic_read_u64_impl
               - apic_timer_interrupt
                 smp_apic_timer_interrupt

Perf analysis: Worker Node
      - 99.14% shm_mq_sendv
         - 74.10% shm_mq_send_bytes
            + 42.35% shm_mq_inc_bytes_written
            - 32.56% pg_atomic_read_u64
               - pg_atomic_read_u64_impl
                  - 86.27% apic_timer_interrupt
            + 17.93% WaitLatch

From the perf results and also from the code analysis I can think of
two main problems here
1. Schyncronization between the worker and gather node, just to
identify the bytes written and read they need to do at least 2-3
atomic operations for each tuple and I think that is having huge
penalty due to a) frequent cache line invalidation b) a lot of atomic
operations.

2. If the tuple queue is full then the worker might need to wait for
the gather to consume the tuple.

Experiment #1:
As part of this experiment, I have modified the sender to keep the
local copy of "mq_bytes_read" and "mq_bytes_written" in the local mqh
handle so that we don't need to frequently read/write cache sensitive
shared memory variables.  So now we only read/write from the shared
memory in the below conditions

1) If the number of available bytes is not enough to send the tuple,
read the updated value of bytes read and also inform the reader about
the new writes.
2) After every 4k bytes written, update the shared memory variable and
inform the reader.
3) on detach for sending any remaining data.

Machine information:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
CPU(s):                56
On-line CPU(s) list:   0-55
Thread(s) per core:    2
Core(s) per socket:    14
Socket(s):             2
NUMA node(s):          2

Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
1) Non-parallel (default)
 Execution Time: 31627.492 ms

2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
 Execution Time: 37498.672 ms

3) Same as above (2) but with the patch.
Execution Time: 23649.287 ms

Observation:
- As expected the results show that forcing the parallelism (by
reducing the parallel_tuple_cost), drastically impacts the
performance.
- But in the same scenario, with the patch, we can see a huge gain of ~40%
- Even if we compare it with the non-parallel plan we have gain ~25%.
- With this, I think we can conclude that there is a huge potential
for improvement if we communicate the tuple in batches, 1) one simple
approach is what I used in my experiment, I think we can do some
optimization in the reader as well, that instead of reading
bytes_written every time from shared memory remember the previous
value and once we have exhausted that then only read back the updated
value from the shared memory.  2) Instead of copying the whole tuple
in the tuple queue we can copy store the dsa_pointers of the tuple
batch, I think Thomas Munro also suggested a similar approach to
Robert, got to know this in offlist discussion with Robert.

Experiment #2: See the behavior by increasing the parallel tuple queue
size on head
(for this I created a small patch to make parallel_tuple_queue size
configurable)

-- Results
4 WORKERS (tup_queue size= 64kB) : 38337.046 ms
4 WORKERS (tup_queue size= 1MB)  : 36186.883 ms
4 WORKERS (tup_queue size= 4MB)  : 36252.740 ms

8 WORKERS (tup_queue size= 64kB) : 42296.731 ms
8 WORKERS (tup_queue size= 1MB)  : 37403.872 ms
8 WORKERS (tup_queue size= 4MB)  : 39184.319 ms

16 WORKERS (tup_queue size= 64kB) : 42726.139 ms
16 WORKERS (tup_queue size= 1MB)  : 36219.975 ms
16 WORKERS (tup_queue size= 4MB)  : 39117.109 ms

Observation:
- There are some gains by increasing the tuple queue size but that is
limited up to 1MB, even tried with more data but the gain is not
linear and performance starts to drop after 4MB.
- If I apply both Experiment#1 and Experiment#2 patches together then,
we can further reduce the execution time to 20963.539 ms (with 4
workers and 4MB tuple queue size)

Conclusion:
With the above experiments,
1) I see a huge potential in the first idea so maybe we can do more
experiments based on the prototype implemented in the first idea and
we can expand the same for the reader and we can also try out the idea
of the dsa_pointers.

2) with the second idea of tuple queue size, I see some benefit but
that is not scaling so maybe, for now, there is no much point in
pursuing in this direction, but I think in the future if we want to
implement the redistribute operator then it is must for providing an
infinite tuple queue (maybe using temp file) to avoid deadlock.

Note: POC patches are not attached, I will send them after some more
experiments and cleanup, maybe I will try to optimize the reader part
as well before sending them.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Gather performance analysis

From

Dilip Kumar

Date:

07 August 2021, 08:17:50

On Fri, Aug 6, 2021 at 2:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Experiment #1:
> As part of this experiment, I have modified the sender to keep the
> local copy of "mq_bytes_read" and "mq_bytes_written" in the local mqh
> handle so that we don't need to frequently read/write cache sensitive
> shared memory variables.  So now we only read/write from the shared
> memory in the below conditions
>
> 1) If the number of available bytes is not enough to send the tuple,
> read the updated value of bytes read and also inform the reader about
> the new writes.
> 2) After every 4k bytes written, update the shared memory variable and
> inform the reader.
> 3) on detach for sending any remaining data.
...
> Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
> 1) Non-parallel (default)
>  Execution Time: 31627.492 ms
>
> 2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
>  Execution Time: 37498.672 ms
>
> 3) Same as above (2) but with the patch.
> Execution Time: 23649.287 ms

Here is the POC patch for the same, apart from this extreme case I am
able to see improvement with this patch for normal parallel queries as
well.

Next, I will perform some more tests with different sets of queries to
see the improvements and post the results.  I will also try to
optimize the reader on the similar line.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

POC-0001-Optimize-shm_mq_send_bytes.patch

Re: Gather performance analysis

From

Robert Haas

Date:

24 August 2021, 15:18:04

On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
> 1) Non-parallel (default)
>  Execution Time: 31627.492 ms
>
> 2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
>  Execution Time: 37498.672 ms
>
> 3) Same as above (2) but with the patch.
> Execution Time: 23649.287 ms

This strikes me as an amazingly good result. I guess before seeing
these results, I would have said that you can't reasonably expect
parallel query to win on a query like this because there isn't enough
for the workers to do. It's not like they are spending time evaluating
filter conditions or anything like that - they're just fetching tuples
off of disk pages and sticking them into a queue. And it's unclear to
me why it should be better to have a bunch of processes doing that
instead of just one. I would have thought, looking at just (1) and
(2), that parallelism gained nothing and communication overhead lost 6
seconds.

But what this suggests is that parallelism gained at least 8 seconds,
and communication overhead lost at least 14 seconds. In fact...

> - If I apply both Experiment#1 and Experiment#2 patches together then,
> we can further reduce the execution time to 20963.539 ms (with 4
> workers and 4MB tuple queue size)

...this suggests that parallelism actually gained at least 10-11
seconds, and the communication overhead lost at least 15-16 seconds.
If that's accurate, it's pretty crazy. We might need to drastically
reduce the value of parallel_tuple_cost if these results hold up and
this patch gets committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Gather performance analysis

From

Dilip Kumar

Date:

28 August 2021, 07:10:39

On Tue, Aug 24, 2021 at 8:48 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
> 1) Non-parallel (default)
> Execution Time: 31627.492 ms
>
> 2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
> Execution Time: 37498.672 ms
>
> 3) Same as above (2) but with the patch.
> Execution Time: 23649.287 ms

This strikes me as an amazingly good result. I guess before seeing
these results, I would have said that you can't reasonably expect
parallel query to win on a query like this because there isn't enough
for the workers to do. It's not like they are spending time evaluating
filter conditions or anything like that - they're just fetching tuples
off of disk pages and sticking them into a queue. And it's unclear to
me why it should be better to have a bunch of processes doing that
instead of just one. I would have thought, looking at just (1) and
(2), that parallelism gained nothing and communication overhead lost 6
seconds.

But what this suggests is that parallelism gained at least 8 seconds,
and communication overhead lost at least 14 seconds. In fact...

Right, good observation.

> - If I apply both Experiment#1 and Experiment#2 patches together then,
> we can further reduce the execution time to 20963.539 ms (with 4
> workers and 4MB tuple queue size)

...this suggests that parallelism actually gained at least 10-11
seconds, and the communication overhead lost at least 15-16 seconds.

Yes

If that's accurate, it's pretty crazy. We might need to drastically
reduce the value of parallel_tuple_cost if these results hold up and
this patch gets committed.

In one of my experiments[Test1] I have noticed that even on the head the force parallel plan is significantly faster compared to the non-parallel plan, but with patch it is even better. The point is now also there might be some cases where the force parallel plans are faster but we are not sure whether we can reduce the parallel_tuple_cost or not. But with the patch it is definitely sure that the parallel tuple queue is faster compared to what we have now, So I agree we should consider reducing the parallel_tuple_cost after this patch.

Additionally, I've done some more experiments with artificial workloads, as well as workloads where the parallel plan is selected by default, and in all cases I've seen a significant improvement. The gain is directly proportional to the load on the tuple queue, as expected.

Test1: (Worker returns all tuples but only few tuples returns to the client)

----------------------------------------------------

INSERT INTO t SELECT i%10, repeat('a', 200) from generate_series(1,200000000) as i;
set max_parallel_workers_per_gather=4;

Target Query: SELECT random() FROM t GROUP BY a;

Non-parallel (default plan): 77170.421 ms
Parallel (parallel_tuple_cost=0): 53794.324 ms
Parallel with patch (parallel_tuple_cost=0): 42567.850 ms

20% gain compared force parallel, 45% gain compared to default plan.

Test2: (Parallel case with default parallel_tuple_cost)

----------------------------------------------

INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i;

set max_parallel_workers_per_gather=4;
SELECT * from t WHERE a < 17500000;
Parallel(default plan): 23730.054 ms
Parallel with patch (default plan): 21614.251 ms

8 to 10 % gain compared to the default parallel plan.

I have done cleanup in the patch and I will add this to the September commitfest.

I am planning to do further testing for identifying the optimal batch size in different workloads. WIth above workload I am seeing similar results with batch size 4k to 16k (1/4 of the ring size) so in the attached patch I have kept as 1/4 of the ring size. We might change that based on more analysis and testing.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patch

Re: Gather performance analysis

From

Zhihong Yu

Date:

28 August 2021, 11:29:08

On Sat, Aug 28, 2021 at 12:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Aug 24, 2021 at 8:48 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
> 1) Non-parallel (default)
> Execution Time: 31627.492 ms
>
> 2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
> Execution Time: 37498.672 ms
>
> 3) Same as above (2) but with the patch.
> Execution Time: 23649.287 ms

This strikes me as an amazingly good result. I guess before seeing
these results, I would have said that you can't reasonably expect
parallel query to win on a query like this because there isn't enough
for the workers to do. It's not like they are spending time evaluating
filter conditions or anything like that - they're just fetching tuples
off of disk pages and sticking them into a queue. And it's unclear to
me why it should be better to have a bunch of processes doing that
instead of just one. I would have thought, looking at just (1) and
(2), that parallelism gained nothing and communication overhead lost 6
seconds.

But what this suggests is that parallelism gained at least 8 seconds,
and communication overhead lost at least 14 seconds. In fact...

Right, good observation.

> - If I apply both Experiment#1 and Experiment#2 patches together then,
> we can further reduce the execution time to 20963.539 ms (with 4
> workers and 4MB tuple queue size)

...this suggests that parallelism actually gained at least 10-11
seconds, and the communication overhead lost at least 15-16 seconds.

Yes

If that's accurate, it's pretty crazy. We might need to drastically
reduce the value of parallel_tuple_cost if these results hold up and
this patch gets committed.

In one of my experiments[Test1] I have noticed that even on the head the force parallel plan is significantly faster compared to the non-parallel plan, but with patch it is even better. The point is now also there might be some cases where the force parallel plans are faster but we are not sure whether we can reduce the parallel_tuple_cost or not. But with the patch it is definitely sure that the parallel tuple queue is faster compared to what we have now, So I agree we should consider reducing the parallel_tuple_cost after this patch.

Additionally, I've done some more experiments with artificial workloads, as well as workloads where the parallel plan is selected by default, and in all cases I've seen a significant improvement. The gain is directly proportional to the load on the tuple queue, as expected.

Test1: (Worker returns all tuples but only few tuples returns to the client)
----------------------------------------------------
INSERT INTO t SELECT i%10, repeat('a', 200) from generate_series(1,200000000) as i;
set max_parallel_workers_per_gather=4;

Target Query: SELECT random() FROM t GROUP BY a;

Non-parallel (default plan): 77170.421 ms
Parallel (parallel_tuple_cost=0): 53794.324 ms
Parallel with patch (parallel_tuple_cost=0): 42567.850 ms

20% gain compared force parallel, 45% gain compared to default plan.

Test2: (Parallel case with default parallel_tuple_cost)
----------------------------------------------
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i;

set max_parallel_workers_per_gather=4;
SELECT * from t WHERE a < 17500000;
Parallel(default plan): 23730.054 ms
Parallel with patch (default plan): 21614.251 ms

8 to 10 % gain compared to the default parallel plan.

I have done cleanup in the patch and I will add this to the September commitfest.

I am planning to do further testing for identifying the optimal batch size in different workloads. WIth above workload I am seeing similar results with batch size 4k to 16k (1/4 of the ring size) so in the attached patch I have kept as 1/4 of the ring size. We might change that based on more analysis and testing.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Hi,

Some minor comments.

For shm_mq.c, existing comment says:

* mqh_partial_bytes, mqh_expected_bytes, and mqh_length_word_complete

+ Size mqh_send_pending;
bool mqh_length_word_complete;
bool mqh_counterparty_attached;

I wonder if mqh_send_pending should be declared after mqh_length_word_complete - this way, the order of fields matches the order of explanation for the fields.

+ if (mqh->mqh_send_pending > mq->mq_ring_size / 4 || force_flush)

The above can be written as:

+ if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 1))

so that when force_flush is true, the other condition is not evaluated.

Cheers

Re: Gather performance analysis

From

Zhihong Yu

Date:

28 August 2021, 11:33:59

On Sat, Aug 28, 2021 at 4:29 AM Zhihong Yu <zyu@yugabyte.com> wrote:

On Sat, Aug 28, 2021 at 12:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Tue, Aug 24, 2021 at 8:48 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
> 1) Non-parallel (default)
> Execution Time: 31627.492 ms
>
> 2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
> Execution Time: 37498.672 ms
>
> 3) Same as above (2) but with the patch.
> Execution Time: 23649.287 ms

This strikes me as an amazingly good result. I guess before seeing
these results, I would have said that you can't reasonably expect
parallel query to win on a query like this because there isn't enough
for the workers to do. It's not like they are spending time evaluating
filter conditions or anything like that - they're just fetching tuples
off of disk pages and sticking them into a queue. And it's unclear to
me why it should be better to have a bunch of processes doing that
instead of just one. I would have thought, looking at just (1) and
(2), that parallelism gained nothing and communication overhead lost 6
seconds.

But what this suggests is that parallelism gained at least 8 seconds,
and communication overhead lost at least 14 seconds. In fact...

Right, good observation.

> - If I apply both Experiment#1 and Experiment#2 patches together then,
> we can further reduce the execution time to 20963.539 ms (with 4
> workers and 4MB tuple queue size)

...this suggests that parallelism actually gained at least 10-11
seconds, and the communication overhead lost at least 15-16 seconds.

Yes

If that's accurate, it's pretty crazy. We might need to drastically
reduce the value of parallel_tuple_cost if these results hold up and
this patch gets committed.

In one of my experiments[Test1] I have noticed that even on the head the force parallel plan is significantly faster compared to the non-parallel plan, but with patch it is even better. The point is now also there might be some cases where the force parallel plans are faster but we are not sure whether we can reduce the parallel_tuple_cost or not. But with the patch it is definitely sure that the parallel tuple queue is faster compared to what we have now, So I agree we should consider reducing the parallel_tuple_cost after this patch.

Additionally, I've done some more experiments with artificial workloads, as well as workloads where the parallel plan is selected by default, and in all cases I've seen a significant improvement. The gain is directly proportional to the load on the tuple queue, as expected.

Test1: (Worker returns all tuples but only few tuples returns to the client)
----------------------------------------------------
INSERT INTO t SELECT i%10, repeat('a', 200) from generate_series(1,200000000) as i;
set max_parallel_workers_per_gather=4;

Target Query: SELECT random() FROM t GROUP BY a;

Non-parallel (default plan): 77170.421 ms
Parallel (parallel_tuple_cost=0): 53794.324 ms
Parallel with patch (parallel_tuple_cost=0): 42567.850 ms

20% gain compared force parallel, 45% gain compared to default plan.

Test2: (Parallel case with default parallel_tuple_cost)
----------------------------------------------
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i;

set max_parallel_workers_per_gather=4;
SELECT * from t WHERE a < 17500000;
Parallel(default plan): 23730.054 ms
Parallel with patch (default plan): 21614.251 ms

8 to 10 % gain compared to the default parallel plan.

I have done cleanup in the patch and I will add this to the September commitfest.

I am planning to do further testing for identifying the optimal batch size in different workloads. WIth above workload I am seeing similar results with batch size 4k to 16k (1/4 of the ring size) so in the attached patch I have kept as 1/4 of the ring size. We might change that based on more analysis and testing.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Hi,
Some minor comments.
For shm_mq.c, existing comment says:

* mqh_partial_bytes, mqh_expected_bytes, and mqh_length_word_complete

+ Size mqh_send_pending;
bool mqh_length_word_complete;
bool mqh_counterparty_attached;

I wonder if mqh_send_pending should be declared after mqh_length_word_complete - this way, the order of fields matches the order of explanation for the fields.

+ if (mqh->mqh_send_pending > mq->mq_ring_size / 4 || force_flush)

The above can be written as:

+ if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 1))

so that when force_flush is true, the other condition is not evaluated.

Cheers

There was a typo in suggested code above. It should be:

+ if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 2))

Cheers

Re: Gather performance analysis

From

Tomas Vondra

Date:

07 September 2021, 15:11:47

Hi,

The numbers presented in this thread seem very promising - clearly
there's significant potential for improvements. I'll run similar
benchmarks too, to get a better understanding of this.

Can you share some basic details about the hardware you used?
Particularly the CPU model - I guess this might explain some of the
results, e.g. if CPU caches are ~1MB, that'd explain why setting
tup_queue_size to 1MB improves things, but 4MB is a bit slower.
Similarly, number of cores might explain why 4 workers perform better
than 8 or 16 workers.

Now, this is mostly expected, but the consequence is that maybe things
like queue size should be tunable/dynamic, not hard-coded?

As for the patches, I think the proposed changes are sensible, but I
wonder what queries might get slower. For example with the batching
(updating the counter only once every 4kB, that pretty much transfers
data in larger chunks with higher latency. So what if the query needs
only a small chunk, like a LIMIT query? Similarly, this might mean the
upper parts of the plan have to wait for the data for longer, and thus
can't start some async operation (like send them to a FDW, or something
like that). I do admit those are theoretical queries, I haven't tried
creating such query.

FWIW I've tried applying both patches at the same time, but there's a
conflict in shm_mq_sendv - not a complex one, but I'm not sure what's
the correct solution. Can you share a "combined" patch?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Gather performance analysis

From

Andres Freund

Date:

07 September 2021, 21:38:44

Hi,

On 2021-08-06 14:00:48 +0530, Dilip Kumar wrote:
> --Setup
> SET parallel_tuple_cost TO 0   -- to test parallelism in the extreme case
> CREATE TABLE t (a int, b varchar);
> INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i;
> ANALYZE t;
> Test query: EXPLAIN ANALYZE SELECT * FROM t;
> 
> Perf analysis: Gather Node
>    - 43.57% shm_mq_receive
>       - 78.94% shm_mq_receive_bytes
>          - 91.27% pg_atomic_read_u64
>             - pg_atomic_read_u64_impl
>                - apic_timer_interrupt
>                  smp_apic_timer_interrupt
> 
> Perf analysis: Worker Node
>       - 99.14% shm_mq_sendv
>          - 74.10% shm_mq_send_bytes
>             + 42.35% shm_mq_inc_bytes_written
>             - 32.56% pg_atomic_read_u64
>                - pg_atomic_read_u64_impl
>                   - 86.27% apic_timer_interrupt
>             + 17.93% WaitLatch
> 
> From the perf results and also from the code analysis I can think of
> two main problems here

Looking at this profile made me wonder if this was a build without
optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls should
be inlined. And while perf can reconstruct inlined functions when using
--call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for me.

FWIW, I see times like this

postgres[4144648][1]=# EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                  QUERY PLAN                                                  │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Gather  (cost=1000.00..6716686.33 rows=200000000 width=208) (actual rows=200000000 loops=1)                  │
│   Workers Planned: 2                                                                                         │
│   Workers Launched: 2                                                                                        │
│   ->  Parallel Seq Scan on t  (cost=0.00..6715686.33 rows=83333333 width=208) (actual rows=66666667 loops=3) │
│ Planning Time: 0.043 ms                                                                                      │
│ Execution Time: 24954.012 ms                                                                                 │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(6 rows)


Looking at a profile I see the biggest bottleneck in the leader (which is the
bottleneck as soon as the worker count is increased) to be reading the length
word of the message. I do see shm_mq_receive_bytes() in the profile, but the
costly part there is the "read % (uint64) ringsize" - divisions are slow. We
could just compute a mask instead of the size.

We also should probably split the read-mostly data in shm_mq (ring_size,
detached, ring_offset, receiver, sender) into a separate cacheline from the
read/write data. Or perhaps copy more info into the handle, particularly the
ringsize (or mask).

Greetings,

Andres Freund

Re: Gather performance analysis

From

Dilip Kumar

Date:

08 September 2021, 06:05:55

On Tue, Sep 7, 2021 at 8:41 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Hi,

The numbers presented in this thread seem very promising - clearly
there's significant potential for improvements. I'll run similar
benchmarks too, to get a better understanding of this.

Thanks for showing interest.

Can you share some basic details about the hardware you used?
Particularly the CPU model - I guess this might explain some of the
results, e.g. if CPU caches are ~1MB, that'd explain why setting
tup_queue_size to 1MB improves things, but 4MB is a bit slower.
Similarly, number of cores might explain why 4 workers perform better
than 8 or 16 workers.

I have attached the output of the lscpu. I think batching the data before updating in the shared memory will win because we are avoiding the frequent cache misses and IMHO the benefit will be more in the machine with more CPU sockets.

Now, this is mostly expected, but the consequence is that maybe things
like queue size should be tunable/dynamic, not hard-coded?

Actually, my intention behind the tuple queue size was to just see the behavior. Do we really have the problem of workers stalling on queue while sending the tuple, the perf report showed some load on WaitLatch on the worker side so I did this experiment. I saw some benefits but it was not really huge. I am not sure whether we want to just increase the tuple queue size or make it tunable, but if we want to support redistribute operators in future sometime then maybe we should make it dynamically growing at runtime, maybe using dsa or dsa + shared files.

As for the patches, I think the proposed changes are sensible, but I
wonder what queries might get slower. For example with the batching
(updating the counter only once every 4kB, that pretty much transfers
data in larger chunks with higher latency. So what if the query needs
only a small chunk, like a LIMIT query? Similarly, this might mean the
upper parts of the plan have to wait for the data for longer, and thus
can't start some async operation (like send them to a FDW, or something
like that). I do admit those are theoretical queries, I haven't tried
creating such query.

Yeah, I was thinking about such cases, basically, this design can increase the startup cost of the Gather node, I will also try to derive such cases and test them.

FWIW I've tried applying both patches at the same time, but there's a
conflict in shm_mq_sendv - not a complex one, but I'm not sure what's
the correct solution. Can you share a "combined" patch?

Actually, these both patches are the same, "v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patch" is the cleaner version of the first patch. For configurable tuple queue size I did not send a patch, because that is I just used for the testing purpose and never intended to to propose anything. My most of the latest performance data I sent with only "v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patch" and with default tuple queue size.

But I am attaching both the patches in case you want to play around.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Gather performance analysis

From

Dilip Kumar

Date:

08 September 2021, 06:15:16

On Wed, Sep 8, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote:

Looking at this profile made me wonder if this was a build without
optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls should
be inlined. And while perf can reconstruct inlined functions when using
--call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for me.

Yeah, for profiling generally I build without optimizations so that I can see all the functions in the stack, so yeah profile results are without optimizations build but the performance results are with optimizations build.

FWIW, I see times like this

postgres[4144648][1]=# EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Gather (cost=1000.00..6716686.33 rows=200000000 width=208) (actual rows=200000000 loops=1) │
│ Workers Planned: 2 │
│ Workers Launched: 2 │
│ -> Parallel Seq Scan on t (cost=0.00..6715686.33 rows=83333333 width=208) (actual rows=66666667 loops=3) │
│ Planning Time: 0.043 ms │
│ Execution Time: 24954.012 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(6 rows)

Is this with or without patch, I mean can we see a comparison that patch improved anything in your environment?

Looking at a profile I see the biggest bottleneck in the leader (which is the
bottleneck as soon as the worker count is increased) to be reading the length
word of the message. I do see shm_mq_receive_bytes() in the profile, but the
costly part there is the "read % (uint64) ringsize" - divisions are slow. We
could just compute a mask instead of the size.

Yeah that could be done, I can test with this change as well that how much we gain with this.

We also should probably split the read-mostly data in shm_mq (ring_size,
detached, ring_offset, receiver, sender) into a separate cacheline from the
read/write data. Or perhaps copy more info into the handle, particularly the
ringsize (or mask).

Good suggestion, I will do some experiments around this.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Gather performance analysis

From

Andres Freund

Date:

08 September 2021, 06:33:47

Hi,

On 2021-09-08 11:45:16 +0530, Dilip Kumar wrote:
> On Wed, Sep 8, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote:
> 
> 
> > Looking at this profile made me wonder if this was a build without
> > optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls
> > should
> > be inlined. And while perf can reconstruct inlined functions when using
> > --call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for
> > me.
> >
> 
> Yeah, for profiling generally I build without optimizations so that I can
> see all the functions in the stack, so yeah profile results are without
> optimizations build but the performance results are with optimizations
> build.

I'm afraid that makes the profiles just about meaningless :(.


> Is this with or without patch, I mean can we see a comparison that patch
> improved anything in your environment?

It was without any patches. I'll try the patch in a bit.

Greetings,

Andres Freund

Re: Gather performance analysis

From

Dilip Kumar

Date:

08 September 2021, 07:40:44

On Wed, Sep 8, 2021 at 12:03 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-09-08 11:45:16 +0530, Dilip Kumar wrote:
> On Wed, Sep 8, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote:
>
>
> > Looking at this profile made me wonder if this was a build without
> > optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls
> > should
> > be inlined. And while perf can reconstruct inlined functions when using
> > --call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for
> > me.
> >
>
> Yeah, for profiling generally I build without optimizations so that I can
> see all the functions in the stack, so yeah profile results are without
> optimizations build but the performance results are with optimizations
> build.

I'm afraid that makes the profiles just about meaningless :(.

Maybe it can be misleading sometimes, but I feel sometimes it is more informative compared to the optimized build where it makes some function inline, and then it becomes really hard to distinguish which function really has the problem. But your point is taken and I will run with an optimized build.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Gather performance analysis

From

Tomas Vondra

Date:

08 September 2021, 09:58:22

On 9/8/21 9:40 AM, Dilip Kumar wrote:
> On Wed, Sep 8, 2021 at 12:03 PM Andres Freund <andres@anarazel.de
> <mailto:andres@anarazel.de>> wrote:
> 
>     Hi,
> 
>     On 2021-09-08 11:45:16 +0530, Dilip Kumar wrote:
>     > On Wed, Sep 8, 2021 at 3:08 AM Andres Freund <andres@anarazel.de
>     <mailto:andres@anarazel.de>> wrote:
>     >
>     >
>     > > Looking at this profile made me wonder if this was a build without
>     > > optimizations. The
>     pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls
>     > > should
>     > > be inlined. And while perf can reconstruct inlined functions
>     when using
>     > > --call-graph=dwarf, they show up like "pg_atomic_read_u64
>     (inlined)" for
>     > > me.
>     > >
>     >
>     > Yeah, for profiling generally I build without optimizations so
>     that I can
>     > see all the functions in the stack, so yeah profile results are
>     without
>     > optimizations build but the performance results are with optimizations
>     > build.
> 
>     I'm afraid that makes the profiles just about meaningless :(.
> 
> 
> Maybe it can be misleading sometimes, but I feel sometimes it is more
> informative compared to the optimized build where it makes some function
> inline, and then it becomes really hard to distinguish which function
> really has the problem.  But your point is taken and I will run with an
> optimized build.
> 

IMHO Andres is right optimization may make profiles mostly useless in
most cases - it may skew timings for different parts differently, so
something that'd be optimized out may take much more time.

It may provide valuable insights, but we definitely should not use such
binaries for benchmarking and comparisons of the patches.

As mentioned, I did some benchmarks, and I do see some nice improvements
even with properly optimized builds -O2.

Attached is a simple script that varies a bunch of parameters (number of
workers, number of rows/columns, ...) and then measures duration of a
simple query, similar to what you did. I haven't varied the queue size,
that might be interesting too.

The PDF shows a comparison of master and the two patches. For 10k rows
there's not much difference, but for 1M and 10M rows there are some nice
improvements in the 20-30% range. Of course, it's just a single query in
a simple benchmark.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Gather performance analysis

From

Dilip Kumar

Date:

08 September 2021, 11:11:54

On Wed, Sep 8, 2021 at 3:28 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 9/8/21 9:40 AM, Dilip Kumar wrote:

> Maybe it can be misleading sometimes, but I feel sometimes it is more
> informative compared to the optimized build where it makes some function
> inline, and then it becomes really hard to distinguish which function
> really has the problem. But your point is taken and I will run with an
> optimized build.
>

IMHO Andres is right optimization may make profiles mostly useless in
most cases - it may skew timings for different parts differently, so
something that'd be optimized out may take much more time.

It may provide valuable insights, but we definitely should not use such
binaries for benchmarking and comparisons of the patches.

Yeah, I completely agree that those binaries should not be used for benchmarking and patch comparison and I never used it for that purpose. I was also making the same point that with debug binaries sometimes we get some valuable insight during profiling.

As mentioned, I did some benchmarks, and I do see some nice improvements
even with properly optimized builds -O2.

Attached is a simple script that varies a bunch of parameters (number of
workers, number of rows/columns, ...) and then measures duration of a
simple query, similar to what you did. I haven't varied the queue size,
that might be interesting too.

The PDF shows a comparison of master and the two patches. For 10k rows
there's not much difference, but for 1M and 10M rows there are some nice
improvements in the 20-30% range. Of course, it's just a single query in
a simple benchmark.

Thanks for the benchmarking.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Gather performance analysis

From

Dilip Kumar

Date:

15 September 2021, 07:26:25

On Wed, Sep 8, 2021 at 4:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Based on various suggestions, I have some more experiments with the patch.

1) I have measured the cache misses count and I see a ~20% reduction
in cache misses with the patch (updating shared memory counter only
after we written certain amount of data).
command: perf stat -e
cycles,instructions,cache-references,cache-misses -p <receiver-pid>
Head:
    13,918,480,258      cycles
    21,082,968,730      instructions              #    1.51  insn per
cycle
        13,206,426      cache-references
        12,432,402      cache-misses              #   94.139 % of all
cache refs

Patch:
    14,119,691,844      cycles
    29,497,239,984      instructions              #    2.09  insn per
cycle
         4,245,819      cache-references
         3,085,047      cache-misses              #   72.661 % of all cache refs

I have taken multiple samples with different execution times, and I
can see the cache-misses with the patch is 72-74% whereas without the
patch it is 92-94%.  So as expected these results clearly showing we
are saving a lot by avoiding cache misses.

2) As pointed by Tomas, I have tried different test cases, where this
patch can regress the performance

CREATE TABLE t (a int, b varchar);
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i;
set enable_gathermerge=off;
Query: select * from t1 where a < 100000 order by a;

Plan:
Sort  (cost=1714422.10..1714645.24 rows=89258 width=15)
   ->  Gather  (cost=1000.00..1707082.55 rows=89258 width=15)
         ->  Parallel Seq Scan on t1  (cost=0.00..1706082.55
rows=22314 width=15)
               Filter: (a < 100000)

So the idea is, that without a patch we should immediately get the
tuple to the sort node whereas with a patch there would be some delay
before we send the tuple to the gather node as we are batching.   With
this also, I did not notice any consistent regression with the patch,
however, with explain analyze I have noticed 2-3 % drop with the
patch.

3. I tried some other optimizations, pointed by Andres,
a) Separating read-only and read-write data in shm_mq and also moving
some fields out of shm_mq

struct shm_mq (after change)
{
/* mostly read-only field*/

PGPROC    *mq_receiver;
PGPROC    *mq_sender;
bool mq_detached;
slock_t mq_mutex;

/* read-write fields*/
pg_atomic_uint64 mq_bytes_read;
pg_atomic_uint64 mq_bytes_written;
char mq_ring[FLEXIBLE_ARRAY_MEMBER];
};

Note: mq_ring_size and mq_ring_offset moved to shm_mq_handle.

I did not see any extra improvement with this idea.

4. Another thought about changing the "mq_ring_size" to a mask
- I think this could improve something, but currently, "mq_ring_size"
is not the 2's power value so we can not convert this to a mask
directly.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Gather performance analysis

From

Dilip Kumar

Date:

16 September 2021, 07:25:06

On Sat, Aug 28, 2021 at 5:04 PM Zhihong Yu <zyu@yugabyte.com> wrote:
>

>>
>>  * mqh_partial_bytes, mqh_expected_bytes, and mqh_length_word_complete
>>
>> +   Size        mqh_send_pending;
>>     bool        mqh_length_word_complete;
>>     bool        mqh_counterparty_attached;
>>
>> I wonder if mqh_send_pending should be declared after mqh_length_word_complete - this way, the order of fields
matchesthe order of explanation for the fields.
 

Moved it after mqh_consume_pending and moved comment as well in the
correct order.
>
> There  was a typo in suggested code above. It should be:
>
> +   if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 2))

Done


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

v2-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patch

Re: Gather performance analysis

From

Tomas Vondra

Date:

18 September 2021, 13:18:43


On 9/8/21 8:05 AM, Dilip Kumar wrote:
> On Tue, Sep 7, 2021 at 8:41 PM Tomas Vondra 
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>> 
> wrote:
> 
>     Hi,
> 
>     The numbers presented in this thread seem very promising - clearly
>     there's significant potential for improvements. I'll run similar
>     benchmarks too, to get a better understanding of this.
> 
> 
> Thanks for showing interest.
> 
> 
>     Can you share some basic details about the hardware you used?
>     Particularly the CPU model - I guess this might explain some of the
>     results, e.g. if CPU caches are ~1MB, that'd explain why setting
>     tup_queue_size to 1MB improves things, but 4MB is a bit slower.
>     Similarly, number of cores might explain why 4 workers perform better
>     than 8 or 16 workers.
> 
> 
> I have attached the output of the lscpu.  I think batching the data 
> before updating in the shared memory will win because we are avoiding 
> the frequent cache misses and IMHO the benefit will be more in the 
> machine with more CPU sockets.
> 
>     Now, this is mostly expected, but the consequence is that maybe things
>     like queue size should be tunable/dynamic, not hard-coded?
> 
> 
> Actually, my intention behind the tuple queue size was to just see the 
> behavior. Do we really have the problem of workers stalling on queue 
> while sending the tuple, the perf report showed some load on WaitLatch 
> on the worker side so I did this experiment.  I saw some benefits but it 
> was not really huge.  I am not sure whether we want to just increase the 
> tuple queue size or make it tunable,  but if we want to support 
> redistribute operators in future sometime then maybe we should make it 
> dynamically growing at runtime, maybe using dsa or dsa + shared files.
> 

Thanks. I ran a couple more benchmarks, with different queue sizes 
(16kB, 64kB, 256kB and 1MB) and according to the results the queue size 
really makes almost no difference. It might make a difference for some 
queries, but I wouldn't bother tuning this until we have a plausible 
example - increasing the queue size is not free either.

So it was worth checking, but I'd just leave it as 64kB for now. We may 
revisit this later in a separate patch/thread.

>     As for the patches, I think the proposed changes are sensible, but I
>     wonder what queries might get slower. For example with the batching
>     (updating the counter only once every 4kB, that pretty much transfers
>     data in larger chunks with higher latency. So what if the query needs
>     only a small chunk, like a LIMIT query? Similarly, this might mean the
>     upper parts of the plan have to wait for the data for longer, and thus
>     can't start some async operation (like send them to a FDW, or something
>     like that). I do admit those are theoretical queries, I haven't tried
>     creating such query.
> 
> 
> Yeah, I was thinking about such cases, basically, this design can 
> increase the startup cost of the Gather node, I will also try to derive 
> such cases and test them.
> 
> 
>     FWIW I've tried applying both patches at the same time, but there's a
>     conflict in shm_mq_sendv - not a complex one, but I'm not sure what's
>     the correct solution. Can you share a "combined" patch?
> 
> 
> Actually, these both patches are the same, 
> "v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patch" is the 
> cleaner version of the first patch.  For configurable tuple queue size I 
> did not send a patch, because that is I just used for the testing 
> purpose and never intended to to propose anything.  My most of the 
> latest performance data I sent with only 
> "v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patch" and with 
> default tuple queue size.
> 
> But I am attaching both the patches in case you want to play around.
> 

Ah, silly me. I should have noticed the second patch is just a refined 
version of the first one.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Gather performance analysis

From

Robert Haas

Date:

23 September 2021, 19:31:56

On Wed, Sep 8, 2021 at 2:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> But I am attaching both the patches in case you want to play around.

I don't really see any reason not to commit 0001. Perhaps some very
minor grammatical nitpicking is in order here, but apart from that I
can't really see anything to criticize with this approach. It seems
safe enough, it's not invasive in any way that matters, and we have
benchmark results showing that it works well. If someone comes up with
something even better, no harm done, we can always change it again.

Objections?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Gather performance analysis

From

Tomas Vondra

Date:

23 September 2021, 20:00:02

On 9/23/21 9:31 PM, Robert Haas wrote:
> On Wed, Sep 8, 2021 at 2:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> But I am attaching both the patches in case you want to play around.
> 
> I don't really see any reason not to commit 0001. Perhaps some very
> minor grammatical nitpicking is in order here, but apart from that I
> can't really see anything to criticize with this approach. It seems
> safe enough, it's not invasive in any way that matters, and we have
> benchmark results showing that it works well. If someone comes up with
> something even better, no harm done, we can always change it again.
> 
> Objections?

Yeah, it seems like a fairly clear win, according to the benchmarks.

I did find some suspicious behavior on the bigger box I have available 
(with 2x xeon e5-2620v3), see the attached spreadsheet. But it seems 
pretty weird because the worst affected case is with no parallel workers 
(so the queue changes should affect it). Not sure how to explain it, but 
the behavior seems consistent.

Anyway, the other machine with a single CPU seems perfectly clean.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

queue-results.ods

Re: Gather performance analysis

From

Robert Haas

Date:

23 September 2021, 20:31:09

On Thu, Sep 23, 2021 at 4:00 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> I did find some suspicious behavior on the bigger box I have available
> (with 2x xeon e5-2620v3), see the attached spreadsheet. But it seems
> pretty weird because the worst affected case is with no parallel workers
> (so the queue changes should affect it). Not sure how to explain it, but
> the behavior seems consistent.

That is pretty odd. I'm inclined to mostly discount the runs with
10000 tuples because sending such a tiny number of tuples doesn't
really take any significant amount of time, and it seems possible that
variations in the runtime of other code due to code movement effects
could end up mattering more than the changes to the performance of
shm_mq. However, the results with a million tuples seem like they're
probably delivering statistically significant results ... and I guess
maybe what's happening is that the patch hurts when the tuples are too
big relative to the queue size.

I guess your columns are an md5 value each, which is 32 bytes +
overhead, so a 20-columns tuple is ~1kB. Since Dilip's patch flushes
the value to shared memory when more than a quarter of the queue has
been filled, that probably means we flush every 4-5 tuples. I wonder
if that means we need a smaller threshold, like 1/8 of the queue size?
Or maybe the behavior should be adaptive somehow, depending on whether
the receiver ends up waiting for data? Or ... perhaps only small
tuples are worth batching, so that the threshold for posting to shared
memory should be a constant rather than a fraction of the queue size?
I guess we need to know why we see the time spike up in those cases,
if we want to improve them.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Gather performance analysis

From

Tomas Vondra

Date:

23 September 2021, 21:36:34


On 9/23/21 10:31 PM, Robert Haas wrote:
> On Thu, Sep 23, 2021 at 4:00 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> I did find some suspicious behavior on the bigger box I have available
>> (with 2x xeon e5-2620v3), see the attached spreadsheet. But it seems
>> pretty weird because the worst affected case is with no parallel workers
>> (so the queue changes should affect it). Not sure how to explain it, but
>> the behavior seems consistent.
> 
> That is pretty odd. I'm inclined to mostly discount the runs with
> 10000 tuples because sending such a tiny number of tuples doesn't
> really take any significant amount of time, and it seems possible that
> variations in the runtime of other code due to code movement effects
> could end up mattering more than the changes to the performance of
> shm_mq. However, the results with a million tuples seem like they're
> probably delivering statistically significant results ... and I guess
> maybe what's happening is that the patch hurts when the tuples are too
> big relative to the queue size.
> 

Agreed on 10k rows being too small, we can ignore that. And yes, binary 
layout might make a difference, of course. My rule of thumb is 5% (in 
both directions) is about the difference that might make, and most 
results are within that range.

> I guess your columns are an md5 value each, which is 32 bytes +
> overhead, so a 20-columns tuple is ~1kB. Since Dilip's patch flushes
> the value to shared memory when more than a quarter of the queue has
> been filled, that probably means we flush every 4-5 tuples. I wonder
> if that means we need a smaller threshold, like 1/8 of the queue size?
> Or maybe the behavior should be adaptive somehow, depending on whether
> the receiver ends up waiting for data? Or ... perhaps only small
> tuples are worth batching, so that the threshold for posting to shared
> memory should be a constant rather than a fraction of the queue size?
> I guess we need to know why we see the time spike up in those cases,
> if we want to improve them.
> 

Not sure about this, because

(a) That should affect both CPUs, I think, but i5-2500k does not have 
any such issue.

(b) One thing I haven't mentioned is I tried with larger queue sizes too 
(that's the 16kB, 64kB, 256kB and 1MB in columns). Although it's true 
larger queue improve the situation a bit.

(c) This can't explain the slowdown for cases without any Gather nodes 
(and it's ~17%, so unlikely due to binary layout).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Gather performance analysis

From

Robert Haas

Date:

23 September 2021, 23:43:00

On Thu, Sep 23, 2021 at 5:36 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> (c) This can't explain the slowdown for cases without any Gather nodes
> (and it's ~17%, so unlikely due to binary layout).

Yeah, but none of the modified code would even execute in those cases,
so it's either binary layout, something wrong in your test
environment, or gremlins.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Gather performance analysis

From

Dilip Kumar

Date:

24 September 2021, 05:19:01

On Fri, Sep 24, 2021 at 2:01 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Sep 23, 2021 at 4:00 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> > I did find some suspicious behavior on the bigger box I have available
> > (with 2x xeon e5-2620v3), see the attached spreadsheet. But it seems
> > pretty weird because the worst affected case is with no parallel workers
> > (so the queue changes should affect it). Not sure how to explain it, but
> > the behavior seems consistent.
>
> That is pretty odd. I'm inclined to mostly discount the runs with
> 10000 tuples because sending such a tiny number of tuples doesn't
> really take any significant amount of time, and it seems possible that
> variations in the runtime of other code due to code movement effects
> could end up mattering more than the changes to the performance of
> shm_mq. However, the results with a million tuples seem like they're
> probably delivering statistically significant results ... and I guess
> maybe what's happening is that the patch hurts when the tuples are too
> big relative to the queue size.

I am looking at the "query-results.ods" file shared by Tomas, with a
million tuple I do not really see where the patch hurts? because I am
seeing in most of the cases the time taken by the patch is 60-80%
compared to the head.  And the worst case with a million tuple is
100.32% are are we pointing to that 0.32% or there is something else
that I am missing here.

>
> I guess your columns are an md5 value each, which is 32 bytes +
> overhead, so a 20-columns tuple is ~1kB. Since Dilip's patch flushes
> the value to shared memory when more than a quarter of the queue has
> been filled, that probably means we flush every 4-5 tuples. I wonder
> if that means we need a smaller threshold, like 1/8 of the queue size?
> Or maybe the behavior should be adaptive somehow, depending on whether
> the receiver ends up waiting for data? Or ... perhaps only small
> tuples are worth batching, so that the threshold for posting to shared
> memory should be a constant rather than a fraction of the queue size?
> I guess we need to know why we see the time spike up in those cases,
> if we want to improve them.

I will test with the larger tuple sizes and will see the behavior with
different thresholds.  With 250 bytes tuple size, I have tested with
different thresholds and it appeared that 1/4 of the queue size works
best.  But I will do more detailed testing and share the results.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Gather performance analysis

From

Dilip Kumar

Date:

24 September 2021, 07:50:35

On Fri, Sep 24, 2021 at 1:30 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 9/23/21 9:31 PM, Robert Haas wrote:
> > On Wed, Sep 8, 2021 at 2:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >> But I am attaching both the patches in case you want to play around.
> >
> > I don't really see any reason not to commit 0001. Perhaps some very
> > minor grammatical nitpicking is in order here, but apart from that I
> > can't really see anything to criticize with this approach. It seems
> > safe enough, it's not invasive in any way that matters, and we have
> > benchmark results showing that it works well. If someone comes up with
> > something even better, no harm done, we can always change it again.
> >
> > Objections?
>
> Yeah, it seems like a fairly clear win, according to the benchmarks.
>
> I did find some suspicious behavior on the bigger box I have available
> (with 2x xeon e5-2620v3), see the attached spreadsheet. But it seems
> pretty weird because the worst affected case is with no parallel workers
> (so the queue changes should affect it). Not sure how to explain it, but
> the behavior seems consistent.
>
> Anyway, the other machine with a single CPU seems perfectly clean.

Tomas, can you share your test script, I would like to repeat the same
test in my environment and with different batching sizes.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Gather performance analysis

From

Robert Haas

Date:

24 September 2021, 17:08:28

On Fri, Sep 24, 2021 at 3:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Tomas, can you share your test script, I would like to repeat the same
> test in my environment and with different batching sizes.

I think it's probably the run.sh file attached to
http://postgr.es/m/d76a759d-9240-94f5-399e-ae244e5f0285@enterprisedb.com

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Gather performance analysis

From

Robert Haas

Date:

24 September 2021, 17:10:55

On Fri, Sep 24, 2021 at 1:19 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I am looking at the "query-results.ods" file shared by Tomas, with a
> million tuple I do not really see where the patch hurts? because I am
> seeing in most of the cases the time taken by the patch is 60-80%
> compared to the head.  And the worst case with a million tuple is
> 100.32% are are we pointing to that 0.32% or there is something else
> that I am missing here.

The spreadsheet has two tabs. Flip to "xeon e5-2620v3" and scroll all
the way down.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Gather performance analysis

From

Tomas Vondra

Date:

24 September 2021, 20:44:32

On 9/24/21 1:43 AM, Robert Haas wrote:
> On Thu, Sep 23, 2021 at 5:36 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> (c) This can't explain the slowdown for cases without any Gather nodes
>> (and it's ~17%, so unlikely due to binary layout).
> 
> Yeah, but none of the modified code would even execute in those cases,
> so it's either binary layout, something wrong in your test
> environment, or gremlins.
> 

I'm not going to the office all that often these days, but I think I'm 
sure I'd notice a bunch of gremlins running around ... so it's probably 
the binary layout thing. But still, strange.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Gather performance analysis

From

Tomas Vondra

Date:

24 September 2021, 20:48:37

On 9/24/21 7:08 PM, Robert Haas wrote:
> On Fri, Sep 24, 2021 at 3:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> Tomas, can you share your test script, I would like to repeat the same
>> test in my environment and with different batching sizes.
> 
> I think it's probably the run.sh file attached to
> http://postgr.es/m/d76a759d-9240-94f5-399e-ae244e5f0285@enterprisedb.com
> 

Yep. I've used a slightly modified version, which also varies the queue 
size, but we can ignore that and the rest of the script is the same.

As for the config, I've used only slightly tuned postgresql.conf, with 
shared_buffers set to 1GB, and increased worker limits.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Gather performance analysis

From

Dilip Kumar

Date:

26 September 2021, 05:51:52

On Sat, Sep 25, 2021 at 2:18 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 9/24/21 7:08 PM, Robert Haas wrote:
> > On Fri, Sep 24, 2021 at 3:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >> Tomas, can you share your test script, I would like to repeat the same
> >> test in my environment and with different batching sizes.

For now I have tested for 1M and 10M rows, shared buffers=16GM, for
now tested with default batching 1/4th of the queue size and I can see
the performance gain is huge. Time taken with the patch is in the
range of 37-90% compared to the master.  Please refer to the attached
file for more detailed results.  I could not see any regression that
Tomas saw, still I am planning to repeat it with different batch
sizes.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Gather performance analysis

From

Dilip Kumar

Date:

27 September 2021, 05:22:29

On Sun, Sep 26, 2021 at 11:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Sep 25, 2021 at 2:18 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> > On 9/24/21 7:08 PM, Robert Haas wrote:
> > > On Fri, Sep 24, 2021 at 3:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >> Tomas, can you share your test script, I would like to repeat the same
> > >> test in my environment and with different batching sizes.
>
> For now I have tested for 1M and 10M rows, shared buffers=16GM, for
> now tested with default batching 1/4th of the queue size and I can see
> the performance gain is huge. Time taken with the patch is in the
> range of 37-90% compared to the master.  Please refer to the attached
> file for more detailed results.  I could not see any regression that
> Tomas saw, still I am planning to repeat it with different batch
> sizes.

I have done testing with different batch sizes, 16k (which is the same
as 1/4 of the queue size with 64k queue size) , 8k, 4k, 2k.

In the attached sheet I have done a comparison of
1. head vs patch (1/4 queue size) = execution time reduced to 37% to
90% this is the same as the old sheet.
2. patch (1/4 queue size) vs patch(8k batch) =  both are same, but 8k
batch size is slow in some cases.
3. patch (1/4 queue size) vs patch(4k batch) = both are same, but 4k
batch size is slow in some cases (even slower than 8k batch size).
4. patch (1/4 queue size) vs patch(2k batch) = 2k batch size is
significantly slow.

With these results, 1/4 of the queue size seems to be the winner and I
think we might go for that value, however someone might think that 4k
batch size is optimal because it is just marginally slow and with that
we will have to worry less about increasing the latency in some worse
case.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

results.ods

Re: Gather performance analysis

From

Robert Haas

Date:

27 September 2021, 17:22:31

On Mon, Sep 27, 2021 at 1:22 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have done testing with different batch sizes, 16k (which is the same
> as 1/4 of the queue size with 64k queue size) , 8k, 4k, 2k.
>
> In the attached sheet I have done a comparison of
> 1. head vs patch (1/4 queue size) = execution time reduced to 37% to
> 90% this is the same as the old sheet.
> 2. patch (1/4 queue size) vs patch(8k batch) =  both are same, but 8k
> batch size is slow in some cases.
> 3. patch (1/4 queue size) vs patch(4k batch) = both are same, but 4k
> batch size is slow in some cases (even slower than 8k batch size).
> 4. patch (1/4 queue size) vs patch(2k batch) = 2k batch size is
> significantly slow.

Generally these results seem to show that a larger batch size is
better than a smaller one, but we know that's not true everywhere and
under all circumstances, because some of Tomas's numbers are worse
than the unpatched cases. And I think we don't really understand the
reason for those results. Now it could be that there's no really good
reason for those results, and it's just something weird or not very
generally interesting.

On the other hand, while it's easy to see that batching can be a win
if it avoids contention, it also seems easy to imagine that it can be
a loss. By postponing the update to shared memory, we are essentially
gambling. If nobody would have read the updated value anyway, we win,
because we avoided doing work that wasn't really needed by
consolidating multiple updates of shared memory down to one. However,
imagine the scenario where someone reads a value that is small enough
that they have to block, because they think there's no more data
available. If there really is more data available and we just didn't
update shared memory, then we lose.

Here, the wins are going to be much smaller than the losses. Cache
line contention isn't cheap, but it's a lot cheaper than actually
having a process go to sleep and having to wake it up again. So
essentially the patch is betting that the winning scenario is much
more common than the losing scenario - the occasional big losses when
the reader sleeps unnecessarily will be more than counterbalanced by
the small wins every time we skip an update to shared memory without
causing that to happen.

And most of the time, that's probably a good bet. But, if you do
somehow hit the losing case repeatedly, then you could see a
significant regression. And that might explain Tomas's results.
Perhaps for some reason they just happen to hit that case over and
over again. If that's true, it would be useful to know why it happens
in that case and not others, because then maybe we could avoid the
problem somehow. However, I'm not sure how to figure that out, and I'm
not even entirely sure it's important to figure it out.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Gather performance analysis

From

Dilip Kumar

Date:

28 September 2021, 06:48:46

On Mon, Sep 27, 2021 at 10:52 PM Robert Haas <robertmhaas@gmail.com> wrote:

>
> And most of the time, that's probably a good bet. But, if you do
> somehow hit the losing case repeatedly, then you could see a
> significant regression. And that might explain Tomas's results.
> Perhaps for some reason they just happen to hit that case over and
> over again. If that's true, it would be useful to know why it happens
> in that case and not others, because then maybe we could avoid the
> problem somehow. However, I'm not sure how to figure that out, and I'm
> not even entirely sure it's important to figure it out.
>

Yeah, if it is losing in some cases then it is definitely good to know
the reason, I was just looking into the performance numbers shared by
Tomas in query-results, I can see the worst case is
with 10000000 rows, 10 columns and 4 threads and queue size 64k.
Basically, if we see the execution time with head is ~804ms whereas
with patch it is ~1277 ms.  But then I just tried to notice the
pattern with different queue size so number are like below,

                          16k             64k       256k        1024k
Head               1232.779    804.24    1134.723    901.257
Patch              1371.493    1277.705    862.598    783.481

So what I have noticed is that in most of the cases on head as well as
with the patch, increasing the queue size make it faster, but with
head suddenly for this particular combination of rows, column and
thread the execution time is very low for 64k queue size and then
again the execution time increased with 256k queue size and then
follow the pattern.  So this particular dip in the execution time on
the head looks a bit suspicious to me.  I mean how could we justify
this sudden big dip in execution time w.r.t the other pattern.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Gather performance analysis

From

Amit Kapila

Date:

28 September 2021, 10:53:21

On Tue, Sep 28, 2021 at 12:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Sep 27, 2021 at 10:52 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> >
> > And most of the time, that's probably a good bet. But, if you do
> > somehow hit the losing case repeatedly, then you could see a
> > significant regression. And that might explain Tomas's results.
> > Perhaps for some reason they just happen to hit that case over and
> > over again. If that's true, it would be useful to know why it happens
> > in that case and not others, because then maybe we could avoid the
> > problem somehow. However, I'm not sure how to figure that out, and I'm
> > not even entirely sure it's important to figure it out.
> >
>
> Yeah, if it is losing in some cases then it is definitely good to know
> the reason, I was just looking into the performance numbers shared by
> Tomas in query-results, I can see the worst case is
> with 10000000 rows, 10 columns and 4 threads and queue size 64k.
> Basically, if we see the execution time with head is ~804ms whereas
> with patch it is ~1277 ms.  But then I just tried to notice the
> pattern with different queue size so number are like below,
>
>                           16k             64k       256k        1024k
> Head               1232.779    804.24    1134.723    901.257
> Patch              1371.493    1277.705    862.598    783.481
>
> So what I have noticed is that in most of the cases on head as well as
> with the patch, increasing the queue size make it faster, but with
> head suddenly for this particular combination of rows, column and
> thread the execution time is very low for 64k queue size and then
> again the execution time increased with 256k queue size and then
> follow the pattern.  So this particular dip in the execution time on
> the head looks a bit suspicious to me.
>

I concur with your observation. Isn't it possible to repeat the same
test in the same environment to verify these results?

-- 
With Regards,
Amit Kapila.

Re: Gather performance analysis

From

Tomas Vondra

Date:

28 September 2021, 11:51:35

On 9/28/21 12:53 PM, Amit Kapila wrote:
> On Tue, Sep 28, 2021 at 12:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> On Mon, Sep 27, 2021 at 10:52 PM Robert Haas <robertmhaas@gmail.com> wrote:
>>
>>>
>>> And most of the time, that's probably a good bet. But, if you do
>>> somehow hit the losing case repeatedly, then you could see a
>>> significant regression. And that might explain Tomas's results.
>>> Perhaps for some reason they just happen to hit that case over and
>>> over again. If that's true, it would be useful to know why it happens
>>> in that case and not others, because then maybe we could avoid the
>>> problem somehow. However, I'm not sure how to figure that out, and I'm
>>> not even entirely sure it's important to figure it out.
>>>
>>
>> Yeah, if it is losing in some cases then it is definitely good to know
>> the reason, I was just looking into the performance numbers shared by
>> Tomas in query-results, I can see the worst case is
>> with 10000000 rows, 10 columns and 4 threads and queue size 64k.
>> Basically, if we see the execution time with head is ~804ms whereas
>> with patch it is ~1277 ms.  But then I just tried to notice the
>> pattern with different queue size so number are like below,
>>
>>                            16k             64k       256k        1024k
>> Head               1232.779    804.24    1134.723    901.257
>> Patch              1371.493    1277.705    862.598    783.481
>>
>> So what I have noticed is that in most of the cases on head as well as
>> with the patch, increasing the queue size make it faster, but with
>> head suddenly for this particular combination of rows, column and
>> thread the execution time is very low for 64k queue size and then
>> again the execution time increased with 256k queue size and then
>> follow the pattern.  So this particular dip in the execution time on
>> the head looks a bit suspicious to me.
>>
> 
> I concur with your observation. Isn't it possible to repeat the same
> test in the same environment to verify these results?
> 

I can repeat any tests we need on that machine, of course.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Gather performance analysis

From

Dilip Kumar

Date:

28 September 2021, 12:00:42

On Tue, Sep 28, 2021 at 5:21 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> >> Yeah, if it is losing in some cases then it is definitely good to know
> >> the reason, I was just looking into the performance numbers shared by
> >> Tomas in query-results, I can see the worst case is
> >> with 10000000 rows, 10 columns and 4 threads and queue size 64k.
> >> Basically, if we see the execution time with head is ~804ms whereas
> >> with patch it is ~1277 ms.  But then I just tried to notice the
> >> pattern with different queue size so number are like below,
> >>
> >>                            16k             64k       256k        1024k
> >> Head               1232.779    804.24    1134.723    901.257
> >> Patch              1371.493    1277.705    862.598    783.481
> >>
> >> So what I have noticed is that in most of the cases on head as well as
> >> with the patch, increasing the queue size make it faster, but with
> >> head suddenly for this particular combination of rows, column and
> >> thread the execution time is very low for 64k queue size and then
> >> again the execution time increased with 256k queue size and then
> >> follow the pattern.  So this particular dip in the execution time on
> >> the head looks a bit suspicious to me.
> >>
> >
> > I concur with your observation. Isn't it possible to repeat the same
> > test in the same environment to verify these results?
> >
>
> I can repeat any tests we need on that machine, of course.
>

I think that would be great, can we just test this specific target
where we are seeing a huge dip with the patch, e.g.
with 10000000 rows, 10 columns and 4 threads, and queue size 64k.  In
my performance machine, I tried to run this test multiple times but on
the head, it is taking ~2000 ms whereas with the patch it is ~1500 ms,
so I am not able to reproduce this.  So it would be good if you can
run only this specific test and repeat it a couple of times on your
performance machine.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Gather performance analysis

From

Robert Haas

Date:

29 September 2021, 16:39:50

On Tue, Sep 28, 2021 at 2:49 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>                           16k             64k       256k        1024k
> Head               1232.779    804.24    1134.723    901.257
> Patch              1371.493    1277.705    862.598    783.481
>
> So what I have noticed is that in most of the cases on head as well as
> with the patch, increasing the queue size make it faster, but with
> head suddenly for this particular combination of rows, column and
> thread the execution time is very low for 64k queue size and then
> again the execution time increased with 256k queue size and then
> follow the pattern.  So this particular dip in the execution time on
> the head looks a bit suspicious to me.  I mean how could we justify
> this sudden big dip in execution time w.r.t the other pattern.

Oh, interesting. So there's not really a performance regression here
so much as that one particular case ran exceptionally fast on the
unpatched code.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Gather performance analysis

From

Tomas Vondra

Date:

12 October 2021, 13:10:51

On 9/28/21 14:00, Dilip Kumar wrote:
>
> I think that would be great, can we just test this specific target
> where we are seeing a huge dip with the patch, e.g.
> with 10000000 rows, 10 columns and 4 threads, and queue size 64k.  In
> my performance machine, I tried to run this test multiple times but on
> the head, it is taking ~2000 ms whereas with the patch it is ~1500 ms,
> so I am not able to reproduce this.  So it would be good if you can
> run only this specific test and repeat it a couple of times on your
> performance machine.
> 

I ran the benchmark again, with 10 runs instead of 5, the results and 
scripts are attached. It seems the worst case got much better and is now 
in line with the rest of the results, so it probably was a coincidence.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Gather performance analysis

From

Dilip Kumar

Date:

12 October 2021, 14:13:44

On Tue, Oct 12, 2021 at 6:41 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 9/28/21 14:00, Dilip Kumar wrote:
> >
> > I think that would be great, can we just test this specific target
> > where we are seeing a huge dip with the patch, e.g.
> > with 10000000 rows, 10 columns and 4 threads, and queue size 64k.  In
> > my performance machine, I tried to run this test multiple times but on
> > the head, it is taking ~2000 ms whereas with the patch it is ~1500 ms,
> > so I am not able to reproduce this.  So it would be good if you can
> > run only this specific test and repeat it a couple of times on your
> > performance machine.
> >
>
> I ran the benchmark again, with 10 runs instead of 5, the results and
> scripts are attached. It seems the worst case got much better and is now
> in line with the rest of the results, so it probably was a coincidence.

Thanks, yeah now it looks in line with other results.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Gather performance analysis

From

Robert Haas

Date:

14 October 2021, 20:18:29

On Tue, Oct 12, 2021 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Thanks, yeah now it looks in line with other results.

Since it seems there are no remaining concerns here, and we have
benchmarking results showing that the patch helps, I have committed
the patch.

I wonder whether the new code in shm_mq_send_bytes() should guard
against calling shm_mq_inc_bytes_written() with a second argument of
0, or alternatively whether shm_mq_inc_bytes_written() should have an
internal defense against that. It might save some writes to shared
memory, but it would also add a branch, which isn't free, either.

I also think that, as a followup action item, we need to reassess
parallel_tuple_cost.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Gather performance analysis

From

Amit Kapila

Date:

26 October 2021, 09:01:25

On Fri, Oct 15, 2021 at 1:48 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Oct 12, 2021 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > Thanks, yeah now it looks in line with other results.
>
> Since it seems there are no remaining concerns here, and we have
> benchmarking results showing that the patch helps, I have committed
> the patch.
>

Can we mark the corresponding CF entry [1] as committed?

[1] - https://commitfest.postgresql.org/35/3304/

-- 
With Regards,
Amit Kapila.

Re: Gather performance analysis

From

Dilip Kumar

Date:

26 October 2021, 09:03:14

On Tue, Oct 26, 2021 at 2:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Oct 15, 2021 at 1:48 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Tue, Oct 12, 2021 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > Thanks, yeah now it looks in line with other results.
> >
> > Since it seems there are no remaining concerns here, and we have
> > benchmarking results showing that the patch helps, I have committed
> > the patch.
> >
>
> Can we mark the corresponding CF entry [1] as committed?
>
> [1] - https://commitfest.postgresql.org/35/3304/

Done!

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com