Thread: Gather performance analysis
Hi, I have been working on analyzing the performance of sending the tuple from workers to the Gather using the tuple queue. In the past there were many off-list discussions around this area, basically, the main point is that when the "shm_mq" was implemented that time maybe this was one of the best ways to implement this. But now, we have other choices like DSA for allocating shared memory on-demand, shared temporary files for non-blocking tuple queue. So my motivation for looking into this area is that now, we have another flexible alternative so can we use them to make gather faster and if so then 1. Can we actually reduce the tuple transfer cost and enable parallelism in more cases by reducing parallel_tuple_cost. 2. Can we use the tuple queue in more places, e.g., to implement the redistribute operator where we need to transfer data between the workers. IMHO for #1, it will be good enough if we can make the tuple transfer faster, but for #2, we will have to make a) tuple transfer faster because then we will have to transfer the tuples between the workers as well b) Infinite non-blocking tuple queue(maybe using shared temp file) so that there is no deadlock while workers are redistributing tuples to each other. So I have done some quick performance tests and analysis using perf, and some experiments with small prototypes for targeting a different set of problems. --Setup SET parallel_tuple_cost TO 0 -- to test parallelism in the extreme case CREATE TABLE t (a int, b varchar); INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i; ANALYZE t; Test query: EXPLAIN ANALYZE SELECT * FROM t; Perf analysis: Gather Node - 43.57% shm_mq_receive - 78.94% shm_mq_receive_bytes - 91.27% pg_atomic_read_u64 - pg_atomic_read_u64_impl - apic_timer_interrupt smp_apic_timer_interrupt Perf analysis: Worker Node - 99.14% shm_mq_sendv - 74.10% shm_mq_send_bytes + 42.35% shm_mq_inc_bytes_written - 32.56% pg_atomic_read_u64 - pg_atomic_read_u64_impl - 86.27% apic_timer_interrupt + 17.93% WaitLatch From the perf results and also from the code analysis I can think of two main problems here 1. Schyncronization between the worker and gather node, just to identify the bytes written and read they need to do at least 2-3 atomic operations for each tuple and I think that is having huge penalty due to a) frequent cache line invalidation b) a lot of atomic operations. 2. If the tuple queue is full then the worker might need to wait for the gather to consume the tuple. Experiment #1: As part of this experiment, I have modified the sender to keep the local copy of "mq_bytes_read" and "mq_bytes_written" in the local mqh handle so that we don't need to frequently read/write cache sensitive shared memory variables. So now we only read/write from the shared memory in the below conditions 1) If the number of available bytes is not enough to send the tuple, read the updated value of bytes read and also inform the reader about the new writes. 2) After every 4k bytes written, update the shared memory variable and inform the reader. 3) on detach for sending any remaining data. Machine information: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit CPU(s): 56 On-line CPU(s) list: 0-55 Thread(s) per core: 2 Core(s) per socket: 14 Socket(s): 2 NUMA node(s): 2 Results: (query EXPLAIN ANALYZE SELECT * FROM t;) 1) Non-parallel (default) Execution Time: 31627.492 ms 2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0) Execution Time: 37498.672 ms 3) Same as above (2) but with the patch. Execution Time: 23649.287 ms Observation: - As expected the results show that forcing the parallelism (by reducing the parallel_tuple_cost), drastically impacts the performance. - But in the same scenario, with the patch, we can see a huge gain of ~40% - Even if we compare it with the non-parallel plan we have gain ~25%. - With this, I think we can conclude that there is a huge potential for improvement if we communicate the tuple in batches, 1) one simple approach is what I used in my experiment, I think we can do some optimization in the reader as well, that instead of reading bytes_written every time from shared memory remember the previous value and once we have exhausted that then only read back the updated value from the shared memory. 2) Instead of copying the whole tuple in the tuple queue we can copy store the dsa_pointers of the tuple batch, I think Thomas Munro also suggested a similar approach to Robert, got to know this in offlist discussion with Robert. Experiment #2: See the behavior by increasing the parallel tuple queue size on head (for this I created a small patch to make parallel_tuple_queue size configurable) -- Results 4 WORKERS (tup_queue size= 64kB) : 38337.046 ms 4 WORKERS (tup_queue size= 1MB) : 36186.883 ms 4 WORKERS (tup_queue size= 4MB) : 36252.740 ms 8 WORKERS (tup_queue size= 64kB) : 42296.731 ms 8 WORKERS (tup_queue size= 1MB) : 37403.872 ms 8 WORKERS (tup_queue size= 4MB) : 39184.319 ms 16 WORKERS (tup_queue size= 64kB) : 42726.139 ms 16 WORKERS (tup_queue size= 1MB) : 36219.975 ms 16 WORKERS (tup_queue size= 4MB) : 39117.109 ms Observation: - There are some gains by increasing the tuple queue size but that is limited up to 1MB, even tried with more data but the gain is not linear and performance starts to drop after 4MB. - If I apply both Experiment#1 and Experiment#2 patches together then, we can further reduce the execution time to 20963.539 ms (with 4 workers and 4MB tuple queue size) Conclusion: With the above experiments, 1) I see a huge potential in the first idea so maybe we can do more experiments based on the prototype implemented in the first idea and we can expand the same for the reader and we can also try out the idea of the dsa_pointers. 2) with the second idea of tuple queue size, I see some benefit but that is not scaling so maybe, for now, there is no much point in pursuing in this direction, but I think in the future if we want to implement the redistribute operator then it is must for providing an infinite tuple queue (maybe using temp file) to avoid deadlock. Note: POC patches are not attached, I will send them after some more experiments and cleanup, maybe I will try to optimize the reader part as well before sending them. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Aug 6, 2021 at 2:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Experiment #1: > As part of this experiment, I have modified the sender to keep the > local copy of "mq_bytes_read" and "mq_bytes_written" in the local mqh > handle so that we don't need to frequently read/write cache sensitive > shared memory variables. So now we only read/write from the shared > memory in the below conditions > > 1) If the number of available bytes is not enough to send the tuple, > read the updated value of bytes read and also inform the reader about > the new writes. > 2) After every 4k bytes written, update the shared memory variable and > inform the reader. > 3) on detach for sending any remaining data. ... > Results: (query EXPLAIN ANALYZE SELECT * FROM t;) > 1) Non-parallel (default) > Execution Time: 31627.492 ms > > 2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0) > Execution Time: 37498.672 ms > > 3) Same as above (2) but with the patch. > Execution Time: 23649.287 ms Here is the POC patch for the same, apart from this extreme case I am able to see improvement with this patch for normal parallel queries as well. Next, I will perform some more tests with different sets of queries to see the improvements and post the results. I will also try to optimize the reader on the similar line. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Results: (query EXPLAIN ANALYZE SELECT * FROM t;) > 1) Non-parallel (default) > Execution Time: 31627.492 ms > > 2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0) > Execution Time: 37498.672 ms > > 3) Same as above (2) but with the patch. > Execution Time: 23649.287 ms This strikes me as an amazingly good result. I guess before seeing these results, I would have said that you can't reasonably expect parallel query to win on a query like this because there isn't enough for the workers to do. It's not like they are spending time evaluating filter conditions or anything like that - they're just fetching tuples off of disk pages and sticking them into a queue. And it's unclear to me why it should be better to have a bunch of processes doing that instead of just one. I would have thought, looking at just (1) and (2), that parallelism gained nothing and communication overhead lost 6 seconds. But what this suggests is that parallelism gained at least 8 seconds, and communication overhead lost at least 14 seconds. In fact... > - If I apply both Experiment#1 and Experiment#2 patches together then, > we can further reduce the execution time to 20963.539 ms (with 4 > workers and 4MB tuple queue size) ...this suggests that parallelism actually gained at least 10-11 seconds, and the communication overhead lost at least 15-16 seconds. If that's accurate, it's pretty crazy. We might need to drastically reduce the value of parallel_tuple_cost if these results hold up and this patch gets committed. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Aug 24, 2021 at 8:48 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
> 1) Non-parallel (default)
> Execution Time: 31627.492 ms
>
> 2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
> Execution Time: 37498.672 ms
>
> 3) Same as above (2) but with the patch.
> Execution Time: 23649.287 ms
This strikes me as an amazingly good result. I guess before seeing
these results, I would have said that you can't reasonably expect
parallel query to win on a query like this because there isn't enough
for the workers to do. It's not like they are spending time evaluating
filter conditions or anything like that - they're just fetching tuples
off of disk pages and sticking them into a queue. And it's unclear to
me why it should be better to have a bunch of processes doing that
instead of just one. I would have thought, looking at just (1) and
(2), that parallelism gained nothing and communication overhead lost 6
seconds.
But what this suggests is that parallelism gained at least 8 seconds,
and communication overhead lost at least 14 seconds. In fact...
Right, good observation.
> - If I apply both Experiment#1 and Experiment#2 patches together then,
> we can further reduce the execution time to 20963.539 ms (with 4
> workers and 4MB tuple queue size)
...this suggests that parallelism actually gained at least 10-11
seconds, and the communication overhead lost at least 15-16 seconds.
Yes
If that's accurate, it's pretty crazy. We might need to drastically
reduce the value of parallel_tuple_cost if these results hold up and
this patch gets committed.
In one of my experiments[Test1] I have noticed that even on the head the force parallel plan is significantly faster compared to the non-parallel plan, but with patch it is even better. The point is now also there might be some cases where the force parallel plans are faster but we are not sure whether we can reduce the parallel_tuple_cost or not. But with the patch it is definitely sure that the parallel tuple queue is faster compared to what we have now, So I agree we should consider reducing the parallel_tuple_cost after this patch.
Additionally, I've done some more experiments with artificial workloads, as well as workloads where the parallel plan is selected by default, and in all cases I've seen a significant improvement. The gain is directly proportional to the load on the tuple queue, as expected.
Test1: (Worker returns all tuples but only few tuples returns to the client)
----------------------------------------------------
INSERT INTO t SELECT i%10, repeat('a', 200) from generate_series(1,200000000) as i;
set max_parallel_workers_per_gather=4;
set max_parallel_workers_per_gather=4;
Target Query: SELECT random() FROM t GROUP BY a;
Non-parallel (default plan): 77170.421 ms
Parallel (parallel_tuple_cost=0): 53794.324 ms
Parallel with patch (parallel_tuple_cost=0): 42567.850 ms
20% gain compared force parallel, 45% gain compared to default plan.
Non-parallel (default plan): 77170.421 ms
Parallel (parallel_tuple_cost=0): 53794.324 ms
Parallel with patch (parallel_tuple_cost=0): 42567.850 ms
20% gain compared force parallel, 45% gain compared to default plan.
----------------------------------------------
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i;
set max_parallel_workers_per_gather=4;
SELECT * from t WHERE a < 17500000;
Parallel(default plan): 23730.054 ms
Parallel with patch (default plan): 21614.251 ms
8 to 10 % gain compared to the default parallel plan.
set max_parallel_workers_per_gather=4;
SELECT * from t WHERE a < 17500000;
Parallel(default plan): 23730.054 ms
Parallel with patch (default plan): 21614.251 ms
8 to 10 % gain compared to the default parallel plan.
I have done cleanup in the patch and I will add this to the September commitfest.
I am planning to do further testing for identifying the optimal batch size in different workloads. WIth above workload I am seeing similar results with batch size 4k to 16k (1/4 of the ring size) so in the attached patch I have kept as 1/4 of the ring size. We might change that based on more analysis and testing.
Attachment
On Sat, Aug 28, 2021 at 12:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Tue, Aug 24, 2021 at 8:48 PM Robert Haas <robertmhaas@gmail.com> wrote:--On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
> 1) Non-parallel (default)
> Execution Time: 31627.492 ms
>
> 2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
> Execution Time: 37498.672 ms
>
> 3) Same as above (2) but with the patch.
> Execution Time: 23649.287 ms
This strikes me as an amazingly good result. I guess before seeing
these results, I would have said that you can't reasonably expect
parallel query to win on a query like this because there isn't enough
for the workers to do. It's not like they are spending time evaluating
filter conditions or anything like that - they're just fetching tuples
off of disk pages and sticking them into a queue. And it's unclear to
me why it should be better to have a bunch of processes doing that
instead of just one. I would have thought, looking at just (1) and
(2), that parallelism gained nothing and communication overhead lost 6
seconds.
But what this suggests is that parallelism gained at least 8 seconds,
and communication overhead lost at least 14 seconds. In fact...Right, good observation.
> - If I apply both Experiment#1 and Experiment#2 patches together then,
> we can further reduce the execution time to 20963.539 ms (with 4
> workers and 4MB tuple queue size)
...this suggests that parallelism actually gained at least 10-11
seconds, and the communication overhead lost at least 15-16 seconds.YesIf that's accurate, it's pretty crazy. We might need to drastically
reduce the value of parallel_tuple_cost if these results hold up and
this patch gets committed.In one of my experiments[Test1] I have noticed that even on the head the force parallel plan is significantly faster compared to the non-parallel plan, but with patch it is even better. The point is now also there might be some cases where the force parallel plans are faster but we are not sure whether we can reduce the parallel_tuple_cost or not. But with the patch it is definitely sure that the parallel tuple queue is faster compared to what we have now, So I agree we should consider reducing the parallel_tuple_cost after this patch.Additionally, I've done some more experiments with artificial workloads, as well as workloads where the parallel plan is selected by default, and in all cases I've seen a significant improvement. The gain is directly proportional to the load on the tuple queue, as expected.Test1: (Worker returns all tuples but only few tuples returns to the client)----------------------------------------------------INSERT INTO t SELECT i%10, repeat('a', 200) from generate_series(1,200000000) as i;
set max_parallel_workers_per_gather=4;Target Query: SELECT random() FROM t GROUP BY a;
Non-parallel (default plan): 77170.421 ms
Parallel (parallel_tuple_cost=0): 53794.324 ms
Parallel with patch (parallel_tuple_cost=0): 42567.850 ms
20% gain compared force parallel, 45% gain compared to default plan.Test2: (Parallel case with default parallel_tuple_cost)----------------------------------------------INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i;
set max_parallel_workers_per_gather=4;
SELECT * from t WHERE a < 17500000;
Parallel(default plan): 23730.054 ms
Parallel with patch (default plan): 21614.251 ms
8 to 10 % gain compared to the default parallel plan.I have done cleanup in the patch and I will add this to the September commitfest.I am planning to do further testing for identifying the optimal batch size in different workloads. WIth above workload I am seeing similar results with batch size 4k to 16k (1/4 of the ring size) so in the attached patch I have kept as 1/4 of the ring size. We might change that based on more analysis and testing.
Hi,
Some minor comments.
For shm_mq.c, existing comment says:
* mqh_partial_bytes, mqh_expected_bytes, and mqh_length_word_complete
+ Size mqh_send_pending;
bool mqh_length_word_complete;
bool mqh_counterparty_attached;
bool mqh_length_word_complete;
bool mqh_counterparty_attached;
I wonder if mqh_send_pending should be declared after mqh_length_word_complete - this way, the order of fields matches the order of explanation for the fields.
+ if (mqh->mqh_send_pending > mq->mq_ring_size / 4 || force_flush)
The above can be written as:
+ if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 1))
so that when force_flush is true, the other condition is not evaluated.
Cheers
On Sat, Aug 28, 2021 at 4:29 AM Zhihong Yu <zyu@yugabyte.com> wrote:
On Sat, Aug 28, 2021 at 12:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:On Tue, Aug 24, 2021 at 8:48 PM Robert Haas <robertmhaas@gmail.com> wrote:--On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
> 1) Non-parallel (default)
> Execution Time: 31627.492 ms
>
> 2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
> Execution Time: 37498.672 ms
>
> 3) Same as above (2) but with the patch.
> Execution Time: 23649.287 ms
This strikes me as an amazingly good result. I guess before seeing
these results, I would have said that you can't reasonably expect
parallel query to win on a query like this because there isn't enough
for the workers to do. It's not like they are spending time evaluating
filter conditions or anything like that - they're just fetching tuples
off of disk pages and sticking them into a queue. And it's unclear to
me why it should be better to have a bunch of processes doing that
instead of just one. I would have thought, looking at just (1) and
(2), that parallelism gained nothing and communication overhead lost 6
seconds.
But what this suggests is that parallelism gained at least 8 seconds,
and communication overhead lost at least 14 seconds. In fact...Right, good observation.
> - If I apply both Experiment#1 and Experiment#2 patches together then,
> we can further reduce the execution time to 20963.539 ms (with 4
> workers and 4MB tuple queue size)
...this suggests that parallelism actually gained at least 10-11
seconds, and the communication overhead lost at least 15-16 seconds.YesIf that's accurate, it's pretty crazy. We might need to drastically
reduce the value of parallel_tuple_cost if these results hold up and
this patch gets committed.In one of my experiments[Test1] I have noticed that even on the head the force parallel plan is significantly faster compared to the non-parallel plan, but with patch it is even better. The point is now also there might be some cases where the force parallel plans are faster but we are not sure whether we can reduce the parallel_tuple_cost or not. But with the patch it is definitely sure that the parallel tuple queue is faster compared to what we have now, So I agree we should consider reducing the parallel_tuple_cost after this patch.Additionally, I've done some more experiments with artificial workloads, as well as workloads where the parallel plan is selected by default, and in all cases I've seen a significant improvement. The gain is directly proportional to the load on the tuple queue, as expected.Test1: (Worker returns all tuples but only few tuples returns to the client)----------------------------------------------------INSERT INTO t SELECT i%10, repeat('a', 200) from generate_series(1,200000000) as i;
set max_parallel_workers_per_gather=4;Target Query: SELECT random() FROM t GROUP BY a;
Non-parallel (default plan): 77170.421 ms
Parallel (parallel_tuple_cost=0): 53794.324 ms
Parallel with patch (parallel_tuple_cost=0): 42567.850 ms
20% gain compared force parallel, 45% gain compared to default plan.Test2: (Parallel case with default parallel_tuple_cost)----------------------------------------------INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i;
set max_parallel_workers_per_gather=4;
SELECT * from t WHERE a < 17500000;
Parallel(default plan): 23730.054 ms
Parallel with patch (default plan): 21614.251 ms
8 to 10 % gain compared to the default parallel plan.I have done cleanup in the patch and I will add this to the September commitfest.I am planning to do further testing for identifying the optimal batch size in different workloads. WIth above workload I am seeing similar results with batch size 4k to 16k (1/4 of the ring size) so in the attached patch I have kept as 1/4 of the ring size. We might change that based on more analysis and testing.Hi,Some minor comments.For shm_mq.c, existing comment says:* mqh_partial_bytes, mqh_expected_bytes, and mqh_length_word_complete+ Size mqh_send_pending;
bool mqh_length_word_complete;
bool mqh_counterparty_attached;I wonder if mqh_send_pending should be declared after mqh_length_word_complete - this way, the order of fields matches the order of explanation for the fields.+ if (mqh->mqh_send_pending > mq->mq_ring_size / 4 || force_flush)The above can be written as:+ if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 1))so that when force_flush is true, the other condition is not evaluated.Cheers
There was a typo in suggested code above. It should be:
+ if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 2))
Cheers
Hi, The numbers presented in this thread seem very promising - clearly there's significant potential for improvements. I'll run similar benchmarks too, to get a better understanding of this. Can you share some basic details about the hardware you used? Particularly the CPU model - I guess this might explain some of the results, e.g. if CPU caches are ~1MB, that'd explain why setting tup_queue_size to 1MB improves things, but 4MB is a bit slower. Similarly, number of cores might explain why 4 workers perform better than 8 or 16 workers. Now, this is mostly expected, but the consequence is that maybe things like queue size should be tunable/dynamic, not hard-coded? As for the patches, I think the proposed changes are sensible, but I wonder what queries might get slower. For example with the batching (updating the counter only once every 4kB, that pretty much transfers data in larger chunks with higher latency. So what if the query needs only a small chunk, like a LIMIT query? Similarly, this might mean the upper parts of the plan have to wait for the data for longer, and thus can't start some async operation (like send them to a FDW, or something like that). I do admit those are theoretical queries, I haven't tried creating such query. FWIW I've tried applying both patches at the same time, but there's a conflict in shm_mq_sendv - not a complex one, but I'm not sure what's the correct solution. Can you share a "combined" patch? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2021-08-06 14:00:48 +0530, Dilip Kumar wrote: > --Setup > SET parallel_tuple_cost TO 0 -- to test parallelism in the extreme case > CREATE TABLE t (a int, b varchar); > INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i; > ANALYZE t; > Test query: EXPLAIN ANALYZE SELECT * FROM t; > > Perf analysis: Gather Node > - 43.57% shm_mq_receive > - 78.94% shm_mq_receive_bytes > - 91.27% pg_atomic_read_u64 > - pg_atomic_read_u64_impl > - apic_timer_interrupt > smp_apic_timer_interrupt > > Perf analysis: Worker Node > - 99.14% shm_mq_sendv > - 74.10% shm_mq_send_bytes > + 42.35% shm_mq_inc_bytes_written > - 32.56% pg_atomic_read_u64 > - pg_atomic_read_u64_impl > - 86.27% apic_timer_interrupt > + 17.93% WaitLatch > > From the perf results and also from the code analysis I can think of > two main problems here Looking at this profile made me wonder if this was a build without optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls should be inlined. And while perf can reconstruct inlined functions when using --call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for me. FWIW, I see times like this postgres[4144648][1]=# EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t; ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Gather (cost=1000.00..6716686.33 rows=200000000 width=208) (actual rows=200000000 loops=1) │ │ Workers Planned: 2 │ │ Workers Launched: 2 │ │ -> Parallel Seq Scan on t (cost=0.00..6715686.33 rows=83333333 width=208) (actual rows=66666667 loops=3) │ │ Planning Time: 0.043 ms │ │ Execution Time: 24954.012 ms │ └──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ (6 rows) Looking at a profile I see the biggest bottleneck in the leader (which is the bottleneck as soon as the worker count is increased) to be reading the length word of the message. I do see shm_mq_receive_bytes() in the profile, but the costly part there is the "read % (uint64) ringsize" - divisions are slow. We could just compute a mask instead of the size. We also should probably split the read-mostly data in shm_mq (ring_size, detached, ring_offset, receiver, sender) into a separate cacheline from the read/write data. Or perhaps copy more info into the handle, particularly the ringsize (or mask). Greetings, Andres Freund
On Tue, Sep 7, 2021 at 8:41 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Hi,
The numbers presented in this thread seem very promising - clearly
there's significant potential for improvements. I'll run similar
benchmarks too, to get a better understanding of this.
Thanks for showing interest.
Can you share some basic details about the hardware you used?
Particularly the CPU model - I guess this might explain some of the
results, e.g. if CPU caches are ~1MB, that'd explain why setting
tup_queue_size to 1MB improves things, but 4MB is a bit slower.
Similarly, number of cores might explain why 4 workers perform better
than 8 or 16 workers.
Now, this is mostly expected, but the consequence is that maybe things
like queue size should be tunable/dynamic, not hard-coded?
Actually, my intention behind the tuple queue size was to just see the behavior. Do we really have the problem of workers stalling on queue while sending the tuple, the perf report showed some load on WaitLatch on the worker side so I did this experiment. I saw some benefits but it was not really huge. I am not sure whether we want to just increase the tuple queue size or make it tunable, but if we want to support redistribute operators in future sometime then maybe we should make it dynamically growing at runtime, maybe using dsa or dsa + shared files.
As for the patches, I think the proposed changes are sensible, but I
wonder what queries might get slower. For example with the batching
(updating the counter only once every 4kB, that pretty much transfers
data in larger chunks with higher latency. So what if the query needs
only a small chunk, like a LIMIT query? Similarly, this might mean the
upper parts of the plan have to wait for the data for longer, and thus
can't start some async operation (like send them to a FDW, or something
like that). I do admit those are theoretical queries, I haven't tried
creating such query.
Yeah, I was thinking about such cases, basically, this design can increase the startup cost of the Gather node, I will also try to derive such cases and test them.
FWIW I've tried applying both patches at the same time, but there's a
conflict in shm_mq_sendv - not a complex one, but I'm not sure what's
the correct solution. Can you share a "combined" patch?
Actually, these both patches are the same, "v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patch" is the cleaner version of the first patch. For configurable tuple queue size I did not send a patch, because that is I just used for the testing purpose and never intended to to propose anything. My most of the latest performance data I sent with only "v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patch" and with default tuple queue size.
But I am attaching both the patches in case you want to play around.
Attachment
On Wed, Sep 8, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote:
Looking at this profile made me wonder if this was a build without
optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls should
be inlined. And while perf can reconstruct inlined functions when using
--call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for me.
Yeah, for profiling generally I build without optimizations so that I can see all the functions in the stack, so yeah profile results are without optimizations build but the performance results are with optimizations build.
FWIW, I see times like this
postgres[4144648][1]=# EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Gather (cost=1000.00..6716686.33 rows=200000000 width=208) (actual rows=200000000 loops=1) │
│ Workers Planned: 2 │
│ Workers Launched: 2 │
│ -> Parallel Seq Scan on t (cost=0.00..6715686.33 rows=83333333 width=208) (actual rows=66666667 loops=3) │
│ Planning Time: 0.043 ms │
│ Execution Time: 24954.012 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(6 rows)
Is this with or without patch, I mean can we see a comparison that patch improved anything in your environment?
Looking at a profile I see the biggest bottleneck in the leader (which is the
bottleneck as soon as the worker count is increased) to be reading the length
word of the message. I do see shm_mq_receive_bytes() in the profile, but the
costly part there is the "read % (uint64) ringsize" - divisions are slow. We
could just compute a mask instead of the size.
Yeah that could be done, I can test with this change as well that how much we gain with this.
We also should probably split the read-mostly data in shm_mq (ring_size,
detached, ring_offset, receiver, sender) into a separate cacheline from the
read/write data. Or perhaps copy more info into the handle, particularly the
ringsize (or mask).
Good suggestion, I will do some experiments around this.
--
Hi, On 2021-09-08 11:45:16 +0530, Dilip Kumar wrote: > On Wed, Sep 8, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote: > > > > Looking at this profile made me wonder if this was a build without > > optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls > > should > > be inlined. And while perf can reconstruct inlined functions when using > > --call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for > > me. > > > > Yeah, for profiling generally I build without optimizations so that I can > see all the functions in the stack, so yeah profile results are without > optimizations build but the performance results are with optimizations > build. I'm afraid that makes the profiles just about meaningless :(. > Is this with or without patch, I mean can we see a comparison that patch > improved anything in your environment? It was without any patches. I'll try the patch in a bit. Greetings, Andres Freund
On Wed, Sep 8, 2021 at 12:03 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-09-08 11:45:16 +0530, Dilip Kumar wrote:
> On Wed, Sep 8, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote:
>
>
> > Looking at this profile made me wonder if this was a build without
> > optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls
> > should
> > be inlined. And while perf can reconstruct inlined functions when using
> > --call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for
> > me.
> >
>
> Yeah, for profiling generally I build without optimizations so that I can
> see all the functions in the stack, so yeah profile results are without
> optimizations build but the performance results are with optimizations
> build.
I'm afraid that makes the profiles just about meaningless :(.
Maybe it can be misleading sometimes, but I feel sometimes it is more informative compared to the optimized build where it makes some function inline, and then it becomes really hard to distinguish which function really has the problem. But your point is taken and I will run with an optimized build.
On 9/8/21 9:40 AM, Dilip Kumar wrote: > On Wed, Sep 8, 2021 at 12:03 PM Andres Freund <andres@anarazel.de > <mailto:andres@anarazel.de>> wrote: > > Hi, > > On 2021-09-08 11:45:16 +0530, Dilip Kumar wrote: > > On Wed, Sep 8, 2021 at 3:08 AM Andres Freund <andres@anarazel.de > <mailto:andres@anarazel.de>> wrote: > > > > > > > Looking at this profile made me wonder if this was a build without > > > optimizations. The > pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls > > > should > > > be inlined. And while perf can reconstruct inlined functions > when using > > > --call-graph=dwarf, they show up like "pg_atomic_read_u64 > (inlined)" for > > > me. > > > > > > > Yeah, for profiling generally I build without optimizations so > that I can > > see all the functions in the stack, so yeah profile results are > without > > optimizations build but the performance results are with optimizations > > build. > > I'm afraid that makes the profiles just about meaningless :(. > > > Maybe it can be misleading sometimes, but I feel sometimes it is more > informative compared to the optimized build where it makes some function > inline, and then it becomes really hard to distinguish which function > really has the problem. But your point is taken and I will run with an > optimized build. > IMHO Andres is right optimization may make profiles mostly useless in most cases - it may skew timings for different parts differently, so something that'd be optimized out may take much more time. It may provide valuable insights, but we definitely should not use such binaries for benchmarking and comparisons of the patches. As mentioned, I did some benchmarks, and I do see some nice improvements even with properly optimized builds -O2. Attached is a simple script that varies a bunch of parameters (number of workers, number of rows/columns, ...) and then measures duration of a simple query, similar to what you did. I haven't varied the queue size, that might be interesting too. The PDF shows a comparison of master and the two patches. For 10k rows there's not much difference, but for 1M and 10M rows there are some nice improvements in the 20-30% range. Of course, it's just a single query in a simple benchmark. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wed, Sep 8, 2021 at 3:28 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
On 9/8/21 9:40 AM, Dilip Kumar wrote:
> Maybe it can be misleading sometimes, but I feel sometimes it is more
> informative compared to the optimized build where it makes some function
> inline, and then it becomes really hard to distinguish which function
> really has the problem. But your point is taken and I will run with an
> optimized build.
>
IMHO Andres is right optimization may make profiles mostly useless in
most cases - it may skew timings for different parts differently, so
something that'd be optimized out may take much more time.
It may provide valuable insights, but we definitely should not use such
binaries for benchmarking and comparisons of the patches.
Yeah, I completely agree that those binaries should not be used for benchmarking and patch comparison and I never used it for that purpose. I was also making the same point that with debug binaries sometimes we get some valuable insight during profiling.
As mentioned, I did some benchmarks, and I do see some nice improvements
even with properly optimized builds -O2.
Attached is a simple script that varies a bunch of parameters (number of
workers, number of rows/columns, ...) and then measures duration of a
simple query, similar to what you did. I haven't varied the queue size,
that might be interesting too.
The PDF shows a comparison of master and the two patches. For 10k rows
there's not much difference, but for 1M and 10M rows there are some nice
improvements in the 20-30% range. Of course, it's just a single query in
a simple benchmark.
Thanks for the benchmarking.
On Wed, Sep 8, 2021 at 4:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: Based on various suggestions, I have some more experiments with the patch. 1) I have measured the cache misses count and I see a ~20% reduction in cache misses with the patch (updating shared memory counter only after we written certain amount of data). command: perf stat -e cycles,instructions,cache-references,cache-misses -p <receiver-pid> Head: 13,918,480,258 cycles 21,082,968,730 instructions # 1.51 insn per cycle 13,206,426 cache-references 12,432,402 cache-misses # 94.139 % of all cache refs Patch: 14,119,691,844 cycles 29,497,239,984 instructions # 2.09 insn per cycle 4,245,819 cache-references 3,085,047 cache-misses # 72.661 % of all cache refs I have taken multiple samples with different execution times, and I can see the cache-misses with the patch is 72-74% whereas without the patch it is 92-94%. So as expected these results clearly showing we are saving a lot by avoiding cache misses. 2) As pointed by Tomas, I have tried different test cases, where this patch can regress the performance CREATE TABLE t (a int, b varchar); INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i; set enable_gathermerge=off; Query: select * from t1 where a < 100000 order by a; Plan: Sort (cost=1714422.10..1714645.24 rows=89258 width=15) -> Gather (cost=1000.00..1707082.55 rows=89258 width=15) -> Parallel Seq Scan on t1 (cost=0.00..1706082.55 rows=22314 width=15) Filter: (a < 100000) So the idea is, that without a patch we should immediately get the tuple to the sort node whereas with a patch there would be some delay before we send the tuple to the gather node as we are batching. With this also, I did not notice any consistent regression with the patch, however, with explain analyze I have noticed 2-3 % drop with the patch. 3. I tried some other optimizations, pointed by Andres, a) Separating read-only and read-write data in shm_mq and also moving some fields out of shm_mq struct shm_mq (after change) { /* mostly read-only field*/ PGPROC *mq_receiver; PGPROC *mq_sender; bool mq_detached; slock_t mq_mutex; /* read-write fields*/ pg_atomic_uint64 mq_bytes_read; pg_atomic_uint64 mq_bytes_written; char mq_ring[FLEXIBLE_ARRAY_MEMBER]; }; Note: mq_ring_size and mq_ring_offset moved to shm_mq_handle. I did not see any extra improvement with this idea. 4. Another thought about changing the "mq_ring_size" to a mask - I think this could improve something, but currently, "mq_ring_size" is not the 2's power value so we can not convert this to a mask directly. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sat, Aug 28, 2021 at 5:04 PM Zhihong Yu <zyu@yugabyte.com> wrote: > >> >> * mqh_partial_bytes, mqh_expected_bytes, and mqh_length_word_complete >> >> + Size mqh_send_pending; >> bool mqh_length_word_complete; >> bool mqh_counterparty_attached; >> >> I wonder if mqh_send_pending should be declared after mqh_length_word_complete - this way, the order of fields matchesthe order of explanation for the fields. Moved it after mqh_consume_pending and moved comment as well in the correct order. > > There was a typo in suggested code above. It should be: > > + if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 2)) Done -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On 9/8/21 8:05 AM, Dilip Kumar wrote: > On Tue, Sep 7, 2021 at 8:41 PM Tomas Vondra > <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>> > wrote: > > Hi, > > The numbers presented in this thread seem very promising - clearly > there's significant potential for improvements. I'll run similar > benchmarks too, to get a better understanding of this. > > > Thanks for showing interest. > > > Can you share some basic details about the hardware you used? > Particularly the CPU model - I guess this might explain some of the > results, e.g. if CPU caches are ~1MB, that'd explain why setting > tup_queue_size to 1MB improves things, but 4MB is a bit slower. > Similarly, number of cores might explain why 4 workers perform better > than 8 or 16 workers. > > > I have attached the output of the lscpu. I think batching the data > before updating in the shared memory will win because we are avoiding > the frequent cache misses and IMHO the benefit will be more in the > machine with more CPU sockets. > > Now, this is mostly expected, but the consequence is that maybe things > like queue size should be tunable/dynamic, not hard-coded? > > > Actually, my intention behind the tuple queue size was to just see the > behavior. Do we really have the problem of workers stalling on queue > while sending the tuple, the perf report showed some load on WaitLatch > on the worker side so I did this experiment. I saw some benefits but it > was not really huge. I am not sure whether we want to just increase the > tuple queue size or make it tunable, but if we want to support > redistribute operators in future sometime then maybe we should make it > dynamically growing at runtime, maybe using dsa or dsa + shared files. > Thanks. I ran a couple more benchmarks, with different queue sizes (16kB, 64kB, 256kB and 1MB) and according to the results the queue size really makes almost no difference. It might make a difference for some queries, but I wouldn't bother tuning this until we have a plausible example - increasing the queue size is not free either. So it was worth checking, but I'd just leave it as 64kB for now. We may revisit this later in a separate patch/thread. > As for the patches, I think the proposed changes are sensible, but I > wonder what queries might get slower. For example with the batching > (updating the counter only once every 4kB, that pretty much transfers > data in larger chunks with higher latency. So what if the query needs > only a small chunk, like a LIMIT query? Similarly, this might mean the > upper parts of the plan have to wait for the data for longer, and thus > can't start some async operation (like send them to a FDW, or something > like that). I do admit those are theoretical queries, I haven't tried > creating such query. > > > Yeah, I was thinking about such cases, basically, this design can > increase the startup cost of the Gather node, I will also try to derive > such cases and test them. > > > FWIW I've tried applying both patches at the same time, but there's a > conflict in shm_mq_sendv - not a complex one, but I'm not sure what's > the correct solution. Can you share a "combined" patch? > > > Actually, these both patches are the same, > "v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patch" is the > cleaner version of the first patch. For configurable tuple queue size I > did not send a patch, because that is I just used for the testing > purpose and never intended to to propose anything. My most of the > latest performance data I sent with only > "v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patch" and with > default tuple queue size. > > But I am attaching both the patches in case you want to play around. > Ah, silly me. I should have noticed the second patch is just a refined version of the first one. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Sep 8, 2021 at 2:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > But I am attaching both the patches in case you want to play around. I don't really see any reason not to commit 0001. Perhaps some very minor grammatical nitpicking is in order here, but apart from that I can't really see anything to criticize with this approach. It seems safe enough, it's not invasive in any way that matters, and we have benchmark results showing that it works well. If someone comes up with something even better, no harm done, we can always change it again. Objections? -- Robert Haas EDB: http://www.enterprisedb.com
On 9/23/21 9:31 PM, Robert Haas wrote: > On Wed, Sep 8, 2021 at 2:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> But I am attaching both the patches in case you want to play around. > > I don't really see any reason not to commit 0001. Perhaps some very > minor grammatical nitpicking is in order here, but apart from that I > can't really see anything to criticize with this approach. It seems > safe enough, it's not invasive in any way that matters, and we have > benchmark results showing that it works well. If someone comes up with > something even better, no harm done, we can always change it again. > > Objections? Yeah, it seems like a fairly clear win, according to the benchmarks. I did find some suspicious behavior on the bigger box I have available (with 2x xeon e5-2620v3), see the attached spreadsheet. But it seems pretty weird because the worst affected case is with no parallel workers (so the queue changes should affect it). Not sure how to explain it, but the behavior seems consistent. Anyway, the other machine with a single CPU seems perfectly clean. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Thu, Sep 23, 2021 at 4:00 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > I did find some suspicious behavior on the bigger box I have available > (with 2x xeon e5-2620v3), see the attached spreadsheet. But it seems > pretty weird because the worst affected case is with no parallel workers > (so the queue changes should affect it). Not sure how to explain it, but > the behavior seems consistent. That is pretty odd. I'm inclined to mostly discount the runs with 10000 tuples because sending such a tiny number of tuples doesn't really take any significant amount of time, and it seems possible that variations in the runtime of other code due to code movement effects could end up mattering more than the changes to the performance of shm_mq. However, the results with a million tuples seem like they're probably delivering statistically significant results ... and I guess maybe what's happening is that the patch hurts when the tuples are too big relative to the queue size. I guess your columns are an md5 value each, which is 32 bytes + overhead, so a 20-columns tuple is ~1kB. Since Dilip's patch flushes the value to shared memory when more than a quarter of the queue has been filled, that probably means we flush every 4-5 tuples. I wonder if that means we need a smaller threshold, like 1/8 of the queue size? Or maybe the behavior should be adaptive somehow, depending on whether the receiver ends up waiting for data? Or ... perhaps only small tuples are worth batching, so that the threshold for posting to shared memory should be a constant rather than a fraction of the queue size? I guess we need to know why we see the time spike up in those cases, if we want to improve them. -- Robert Haas EDB: http://www.enterprisedb.com
On 9/23/21 10:31 PM, Robert Haas wrote: > On Thu, Sep 23, 2021 at 4:00 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> I did find some suspicious behavior on the bigger box I have available >> (with 2x xeon e5-2620v3), see the attached spreadsheet. But it seems >> pretty weird because the worst affected case is with no parallel workers >> (so the queue changes should affect it). Not sure how to explain it, but >> the behavior seems consistent. > > That is pretty odd. I'm inclined to mostly discount the runs with > 10000 tuples because sending such a tiny number of tuples doesn't > really take any significant amount of time, and it seems possible that > variations in the runtime of other code due to code movement effects > could end up mattering more than the changes to the performance of > shm_mq. However, the results with a million tuples seem like they're > probably delivering statistically significant results ... and I guess > maybe what's happening is that the patch hurts when the tuples are too > big relative to the queue size. > Agreed on 10k rows being too small, we can ignore that. And yes, binary layout might make a difference, of course. My rule of thumb is 5% (in both directions) is about the difference that might make, and most results are within that range. > I guess your columns are an md5 value each, which is 32 bytes + > overhead, so a 20-columns tuple is ~1kB. Since Dilip's patch flushes > the value to shared memory when more than a quarter of the queue has > been filled, that probably means we flush every 4-5 tuples. I wonder > if that means we need a smaller threshold, like 1/8 of the queue size? > Or maybe the behavior should be adaptive somehow, depending on whether > the receiver ends up waiting for data? Or ... perhaps only small > tuples are worth batching, so that the threshold for posting to shared > memory should be a constant rather than a fraction of the queue size? > I guess we need to know why we see the time spike up in those cases, > if we want to improve them. > Not sure about this, because (a) That should affect both CPUs, I think, but i5-2500k does not have any such issue. (b) One thing I haven't mentioned is I tried with larger queue sizes too (that's the 16kB, 64kB, 256kB and 1MB in columns). Although it's true larger queue improve the situation a bit. (c) This can't explain the slowdown for cases without any Gather nodes (and it's ~17%, so unlikely due to binary layout). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Sep 23, 2021 at 5:36 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > (c) This can't explain the slowdown for cases without any Gather nodes > (and it's ~17%, so unlikely due to binary layout). Yeah, but none of the modified code would even execute in those cases, so it's either binary layout, something wrong in your test environment, or gremlins. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Sep 24, 2021 at 2:01 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Sep 23, 2021 at 4:00 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > I did find some suspicious behavior on the bigger box I have available > > (with 2x xeon e5-2620v3), see the attached spreadsheet. But it seems > > pretty weird because the worst affected case is with no parallel workers > > (so the queue changes should affect it). Not sure how to explain it, but > > the behavior seems consistent. > > That is pretty odd. I'm inclined to mostly discount the runs with > 10000 tuples because sending such a tiny number of tuples doesn't > really take any significant amount of time, and it seems possible that > variations in the runtime of other code due to code movement effects > could end up mattering more than the changes to the performance of > shm_mq. However, the results with a million tuples seem like they're > probably delivering statistically significant results ... and I guess > maybe what's happening is that the patch hurts when the tuples are too > big relative to the queue size. I am looking at the "query-results.ods" file shared by Tomas, with a million tuple I do not really see where the patch hurts? because I am seeing in most of the cases the time taken by the patch is 60-80% compared to the head. And the worst case with a million tuple is 100.32% are are we pointing to that 0.32% or there is something else that I am missing here. > > I guess your columns are an md5 value each, which is 32 bytes + > overhead, so a 20-columns tuple is ~1kB. Since Dilip's patch flushes > the value to shared memory when more than a quarter of the queue has > been filled, that probably means we flush every 4-5 tuples. I wonder > if that means we need a smaller threshold, like 1/8 of the queue size? > Or maybe the behavior should be adaptive somehow, depending on whether > the receiver ends up waiting for data? Or ... perhaps only small > tuples are worth batching, so that the threshold for posting to shared > memory should be a constant rather than a fraction of the queue size? > I guess we need to know why we see the time spike up in those cases, > if we want to improve them. I will test with the larger tuple sizes and will see the behavior with different thresholds. With 250 bytes tuple size, I have tested with different thresholds and it appeared that 1/4 of the queue size works best. But I will do more detailed testing and share the results. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Sep 24, 2021 at 1:30 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 9/23/21 9:31 PM, Robert Haas wrote: > > On Wed, Sep 8, 2021 at 2:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > >> But I am attaching both the patches in case you want to play around. > > > > I don't really see any reason not to commit 0001. Perhaps some very > > minor grammatical nitpicking is in order here, but apart from that I > > can't really see anything to criticize with this approach. It seems > > safe enough, it's not invasive in any way that matters, and we have > > benchmark results showing that it works well. If someone comes up with > > something even better, no harm done, we can always change it again. > > > > Objections? > > Yeah, it seems like a fairly clear win, according to the benchmarks. > > I did find some suspicious behavior on the bigger box I have available > (with 2x xeon e5-2620v3), see the attached spreadsheet. But it seems > pretty weird because the worst affected case is with no parallel workers > (so the queue changes should affect it). Not sure how to explain it, but > the behavior seems consistent. > > Anyway, the other machine with a single CPU seems perfectly clean. Tomas, can you share your test script, I would like to repeat the same test in my environment and with different batching sizes. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Sep 24, 2021 at 3:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Tomas, can you share your test script, I would like to repeat the same > test in my environment and with different batching sizes. I think it's probably the run.sh file attached to http://postgr.es/m/d76a759d-9240-94f5-399e-ae244e5f0285@enterprisedb.com -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Sep 24, 2021 at 1:19 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I am looking at the "query-results.ods" file shared by Tomas, with a > million tuple I do not really see where the patch hurts? because I am > seeing in most of the cases the time taken by the patch is 60-80% > compared to the head. And the worst case with a million tuple is > 100.32% are are we pointing to that 0.32% or there is something else > that I am missing here. The spreadsheet has two tabs. Flip to "xeon e5-2620v3" and scroll all the way down. -- Robert Haas EDB: http://www.enterprisedb.com
On 9/24/21 1:43 AM, Robert Haas wrote: > On Thu, Sep 23, 2021 at 5:36 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> (c) This can't explain the slowdown for cases without any Gather nodes >> (and it's ~17%, so unlikely due to binary layout). > > Yeah, but none of the modified code would even execute in those cases, > so it's either binary layout, something wrong in your test > environment, or gremlins. > I'm not going to the office all that often these days, but I think I'm sure I'd notice a bunch of gremlins running around ... so it's probably the binary layout thing. But still, strange. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 9/24/21 7:08 PM, Robert Haas wrote: > On Fri, Sep 24, 2021 at 3:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> Tomas, can you share your test script, I would like to repeat the same >> test in my environment and with different batching sizes. > > I think it's probably the run.sh file attached to > http://postgr.es/m/d76a759d-9240-94f5-399e-ae244e5f0285@enterprisedb.com > Yep. I've used a slightly modified version, which also varies the queue size, but we can ignore that and the rest of the script is the same. As for the config, I've used only slightly tuned postgresql.conf, with shared_buffers set to 1GB, and increased worker limits. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Sep 25, 2021 at 2:18 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 9/24/21 7:08 PM, Robert Haas wrote: > > On Fri, Sep 24, 2021 at 3:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > >> Tomas, can you share your test script, I would like to repeat the same > >> test in my environment and with different batching sizes. For now I have tested for 1M and 10M rows, shared buffers=16GM, for now tested with default batching 1/4th of the queue size and I can see the performance gain is huge. Time taken with the patch is in the range of 37-90% compared to the master. Please refer to the attached file for more detailed results. I could not see any regression that Tomas saw, still I am planning to repeat it with different batch sizes. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Sun, Sep 26, 2021 at 11:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sat, Sep 25, 2021 at 2:18 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > On 9/24/21 7:08 PM, Robert Haas wrote: > > > On Fri, Sep 24, 2021 at 3:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > >> Tomas, can you share your test script, I would like to repeat the same > > >> test in my environment and with different batching sizes. > > For now I have tested for 1M and 10M rows, shared buffers=16GM, for > now tested with default batching 1/4th of the queue size and I can see > the performance gain is huge. Time taken with the patch is in the > range of 37-90% compared to the master. Please refer to the attached > file for more detailed results. I could not see any regression that > Tomas saw, still I am planning to repeat it with different batch > sizes. I have done testing with different batch sizes, 16k (which is the same as 1/4 of the queue size with 64k queue size) , 8k, 4k, 2k. In the attached sheet I have done a comparison of 1. head vs patch (1/4 queue size) = execution time reduced to 37% to 90% this is the same as the old sheet. 2. patch (1/4 queue size) vs patch(8k batch) = both are same, but 8k batch size is slow in some cases. 3. patch (1/4 queue size) vs patch(4k batch) = both are same, but 4k batch size is slow in some cases (even slower than 8k batch size). 4. patch (1/4 queue size) vs patch(2k batch) = 2k batch size is significantly slow. With these results, 1/4 of the queue size seems to be the winner and I think we might go for that value, however someone might think that 4k batch size is optimal because it is just marginally slow and with that we will have to worry less about increasing the latency in some worse case. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Sep 27, 2021 at 1:22 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I have done testing with different batch sizes, 16k (which is the same > as 1/4 of the queue size with 64k queue size) , 8k, 4k, 2k. > > In the attached sheet I have done a comparison of > 1. head vs patch (1/4 queue size) = execution time reduced to 37% to > 90% this is the same as the old sheet. > 2. patch (1/4 queue size) vs patch(8k batch) = both are same, but 8k > batch size is slow in some cases. > 3. patch (1/4 queue size) vs patch(4k batch) = both are same, but 4k > batch size is slow in some cases (even slower than 8k batch size). > 4. patch (1/4 queue size) vs patch(2k batch) = 2k batch size is > significantly slow. Generally these results seem to show that a larger batch size is better than a smaller one, but we know that's not true everywhere and under all circumstances, because some of Tomas's numbers are worse than the unpatched cases. And I think we don't really understand the reason for those results. Now it could be that there's no really good reason for those results, and it's just something weird or not very generally interesting. On the other hand, while it's easy to see that batching can be a win if it avoids contention, it also seems easy to imagine that it can be a loss. By postponing the update to shared memory, we are essentially gambling. If nobody would have read the updated value anyway, we win, because we avoided doing work that wasn't really needed by consolidating multiple updates of shared memory down to one. However, imagine the scenario where someone reads a value that is small enough that they have to block, because they think there's no more data available. If there really is more data available and we just didn't update shared memory, then we lose. Here, the wins are going to be much smaller than the losses. Cache line contention isn't cheap, but it's a lot cheaper than actually having a process go to sleep and having to wake it up again. So essentially the patch is betting that the winning scenario is much more common than the losing scenario - the occasional big losses when the reader sleeps unnecessarily will be more than counterbalanced by the small wins every time we skip an update to shared memory without causing that to happen. And most of the time, that's probably a good bet. But, if you do somehow hit the losing case repeatedly, then you could see a significant regression. And that might explain Tomas's results. Perhaps for some reason they just happen to hit that case over and over again. If that's true, it would be useful to know why it happens in that case and not others, because then maybe we could avoid the problem somehow. However, I'm not sure how to figure that out, and I'm not even entirely sure it's important to figure it out. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Sep 27, 2021 at 10:52 PM Robert Haas <robertmhaas@gmail.com> wrote: > > And most of the time, that's probably a good bet. But, if you do > somehow hit the losing case repeatedly, then you could see a > significant regression. And that might explain Tomas's results. > Perhaps for some reason they just happen to hit that case over and > over again. If that's true, it would be useful to know why it happens > in that case and not others, because then maybe we could avoid the > problem somehow. However, I'm not sure how to figure that out, and I'm > not even entirely sure it's important to figure it out. > Yeah, if it is losing in some cases then it is definitely good to know the reason, I was just looking into the performance numbers shared by Tomas in query-results, I can see the worst case is with 10000000 rows, 10 columns and 4 threads and queue size 64k. Basically, if we see the execution time with head is ~804ms whereas with patch it is ~1277 ms. But then I just tried to notice the pattern with different queue size so number are like below, 16k 64k 256k 1024k Head 1232.779 804.24 1134.723 901.257 Patch 1371.493 1277.705 862.598 783.481 So what I have noticed is that in most of the cases on head as well as with the patch, increasing the queue size make it faster, but with head suddenly for this particular combination of rows, column and thread the execution time is very low for 64k queue size and then again the execution time increased with 256k queue size and then follow the pattern. So this particular dip in the execution time on the head looks a bit suspicious to me. I mean how could we justify this sudden big dip in execution time w.r.t the other pattern. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Sep 28, 2021 at 12:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Sep 27, 2021 at 10:52 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > > And most of the time, that's probably a good bet. But, if you do > > somehow hit the losing case repeatedly, then you could see a > > significant regression. And that might explain Tomas's results. > > Perhaps for some reason they just happen to hit that case over and > > over again. If that's true, it would be useful to know why it happens > > in that case and not others, because then maybe we could avoid the > > problem somehow. However, I'm not sure how to figure that out, and I'm > > not even entirely sure it's important to figure it out. > > > > Yeah, if it is losing in some cases then it is definitely good to know > the reason, I was just looking into the performance numbers shared by > Tomas in query-results, I can see the worst case is > with 10000000 rows, 10 columns and 4 threads and queue size 64k. > Basically, if we see the execution time with head is ~804ms whereas > with patch it is ~1277 ms. But then I just tried to notice the > pattern with different queue size so number are like below, > > 16k 64k 256k 1024k > Head 1232.779 804.24 1134.723 901.257 > Patch 1371.493 1277.705 862.598 783.481 > > So what I have noticed is that in most of the cases on head as well as > with the patch, increasing the queue size make it faster, but with > head suddenly for this particular combination of rows, column and > thread the execution time is very low for 64k queue size and then > again the execution time increased with 256k queue size and then > follow the pattern. So this particular dip in the execution time on > the head looks a bit suspicious to me. > I concur with your observation. Isn't it possible to repeat the same test in the same environment to verify these results? -- With Regards, Amit Kapila.
On 9/28/21 12:53 PM, Amit Kapila wrote: > On Tue, Sep 28, 2021 at 12:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> >> On Mon, Sep 27, 2021 at 10:52 PM Robert Haas <robertmhaas@gmail.com> wrote: >> >>> >>> And most of the time, that's probably a good bet. But, if you do >>> somehow hit the losing case repeatedly, then you could see a >>> significant regression. And that might explain Tomas's results. >>> Perhaps for some reason they just happen to hit that case over and >>> over again. If that's true, it would be useful to know why it happens >>> in that case and not others, because then maybe we could avoid the >>> problem somehow. However, I'm not sure how to figure that out, and I'm >>> not even entirely sure it's important to figure it out. >>> >> >> Yeah, if it is losing in some cases then it is definitely good to know >> the reason, I was just looking into the performance numbers shared by >> Tomas in query-results, I can see the worst case is >> with 10000000 rows, 10 columns and 4 threads and queue size 64k. >> Basically, if we see the execution time with head is ~804ms whereas >> with patch it is ~1277 ms. But then I just tried to notice the >> pattern with different queue size so number are like below, >> >> 16k 64k 256k 1024k >> Head 1232.779 804.24 1134.723 901.257 >> Patch 1371.493 1277.705 862.598 783.481 >> >> So what I have noticed is that in most of the cases on head as well as >> with the patch, increasing the queue size make it faster, but with >> head suddenly for this particular combination of rows, column and >> thread the execution time is very low for 64k queue size and then >> again the execution time increased with 256k queue size and then >> follow the pattern. So this particular dip in the execution time on >> the head looks a bit suspicious to me. >> > > I concur with your observation. Isn't it possible to repeat the same > test in the same environment to verify these results? > I can repeat any tests we need on that machine, of course. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 28, 2021 at 5:21 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > >> Yeah, if it is losing in some cases then it is definitely good to know > >> the reason, I was just looking into the performance numbers shared by > >> Tomas in query-results, I can see the worst case is > >> with 10000000 rows, 10 columns and 4 threads and queue size 64k. > >> Basically, if we see the execution time with head is ~804ms whereas > >> with patch it is ~1277 ms. But then I just tried to notice the > >> pattern with different queue size so number are like below, > >> > >> 16k 64k 256k 1024k > >> Head 1232.779 804.24 1134.723 901.257 > >> Patch 1371.493 1277.705 862.598 783.481 > >> > >> So what I have noticed is that in most of the cases on head as well as > >> with the patch, increasing the queue size make it faster, but with > >> head suddenly for this particular combination of rows, column and > >> thread the execution time is very low for 64k queue size and then > >> again the execution time increased with 256k queue size and then > >> follow the pattern. So this particular dip in the execution time on > >> the head looks a bit suspicious to me. > >> > > > > I concur with your observation. Isn't it possible to repeat the same > > test in the same environment to verify these results? > > > > I can repeat any tests we need on that machine, of course. > I think that would be great, can we just test this specific target where we are seeing a huge dip with the patch, e.g. with 10000000 rows, 10 columns and 4 threads, and queue size 64k. In my performance machine, I tried to run this test multiple times but on the head, it is taking ~2000 ms whereas with the patch it is ~1500 ms, so I am not able to reproduce this. So it would be good if you can run only this specific test and repeat it a couple of times on your performance machine. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Sep 28, 2021 at 2:49 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > 16k 64k 256k 1024k > Head 1232.779 804.24 1134.723 901.257 > Patch 1371.493 1277.705 862.598 783.481 > > So what I have noticed is that in most of the cases on head as well as > with the patch, increasing the queue size make it faster, but with > head suddenly for this particular combination of rows, column and > thread the execution time is very low for 64k queue size and then > again the execution time increased with 256k queue size and then > follow the pattern. So this particular dip in the execution time on > the head looks a bit suspicious to me. I mean how could we justify > this sudden big dip in execution time w.r.t the other pattern. Oh, interesting. So there's not really a performance regression here so much as that one particular case ran exceptionally fast on the unpatched code. -- Robert Haas EDB: http://www.enterprisedb.com
On 9/28/21 14:00, Dilip Kumar wrote: > > I think that would be great, can we just test this specific target > where we are seeing a huge dip with the patch, e.g. > with 10000000 rows, 10 columns and 4 threads, and queue size 64k. In > my performance machine, I tried to run this test multiple times but on > the head, it is taking ~2000 ms whereas with the patch it is ~1500 ms, > so I am not able to reproduce this. So it would be good if you can > run only this specific test and repeat it a couple of times on your > performance machine. > I ran the benchmark again, with 10 runs instead of 5, the results and scripts are attached. It seems the worst case got much better and is now in line with the rest of the results, so it probably was a coincidence. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Tue, Oct 12, 2021 at 6:41 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 9/28/21 14:00, Dilip Kumar wrote: > > > > I think that would be great, can we just test this specific target > > where we are seeing a huge dip with the patch, e.g. > > with 10000000 rows, 10 columns and 4 threads, and queue size 64k. In > > my performance machine, I tried to run this test multiple times but on > > the head, it is taking ~2000 ms whereas with the patch it is ~1500 ms, > > so I am not able to reproduce this. So it would be good if you can > > run only this specific test and repeat it a couple of times on your > > performance machine. > > > > I ran the benchmark again, with 10 runs instead of 5, the results and > scripts are attached. It seems the worst case got much better and is now > in line with the rest of the results, so it probably was a coincidence. Thanks, yeah now it looks in line with other results. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Oct 12, 2021 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Thanks, yeah now it looks in line with other results. Since it seems there are no remaining concerns here, and we have benchmarking results showing that the patch helps, I have committed the patch. I wonder whether the new code in shm_mq_send_bytes() should guard against calling shm_mq_inc_bytes_written() with a second argument of 0, or alternatively whether shm_mq_inc_bytes_written() should have an internal defense against that. It might save some writes to shared memory, but it would also add a branch, which isn't free, either. I also think that, as a followup action item, we need to reassess parallel_tuple_cost. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Oct 15, 2021 at 1:48 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Oct 12, 2021 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Thanks, yeah now it looks in line with other results. > > Since it seems there are no remaining concerns here, and we have > benchmarking results showing that the patch helps, I have committed > the patch. > Can we mark the corresponding CF entry [1] as committed? [1] - https://commitfest.postgresql.org/35/3304/ -- With Regards, Amit Kapila.
On Tue, Oct 26, 2021 at 2:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Oct 15, 2021 at 1:48 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Tue, Oct 12, 2021 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > Thanks, yeah now it looks in line with other results. > > > > Since it seems there are no remaining concerns here, and we have > > benchmarking results showing that the patch helps, I have committed > > the patch. > > > > Can we mark the corresponding CF entry [1] as committed? > > [1] - https://commitfest.postgresql.org/35/3304/ Done! -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com