Gather performance analysis - Mailing list pgsql-hackers
From | Dilip Kumar |
---|---|
Subject | Gather performance analysis |
Date | |
Msg-id | CAFiTN-tVXqn_OG7tHNeSkBbN+iiCZTiQ83uakax43y1sQb2OBA@mail.gmail.com Whole thread Raw |
Responses |
Re: Gather performance analysis
Re: Gather performance analysis Re: Gather performance analysis |
List | pgsql-hackers |
Hi, I have been working on analyzing the performance of sending the tuple from workers to the Gather using the tuple queue. In the past there were many off-list discussions around this area, basically, the main point is that when the "shm_mq" was implemented that time maybe this was one of the best ways to implement this. But now, we have other choices like DSA for allocating shared memory on-demand, shared temporary files for non-blocking tuple queue. So my motivation for looking into this area is that now, we have another flexible alternative so can we use them to make gather faster and if so then 1. Can we actually reduce the tuple transfer cost and enable parallelism in more cases by reducing parallel_tuple_cost. 2. Can we use the tuple queue in more places, e.g., to implement the redistribute operator where we need to transfer data between the workers. IMHO for #1, it will be good enough if we can make the tuple transfer faster, but for #2, we will have to make a) tuple transfer faster because then we will have to transfer the tuples between the workers as well b) Infinite non-blocking tuple queue(maybe using shared temp file) so that there is no deadlock while workers are redistributing tuples to each other. So I have done some quick performance tests and analysis using perf, and some experiments with small prototypes for targeting a different set of problems. --Setup SET parallel_tuple_cost TO 0 -- to test parallelism in the extreme case CREATE TABLE t (a int, b varchar); INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i; ANALYZE t; Test query: EXPLAIN ANALYZE SELECT * FROM t; Perf analysis: Gather Node - 43.57% shm_mq_receive - 78.94% shm_mq_receive_bytes - 91.27% pg_atomic_read_u64 - pg_atomic_read_u64_impl - apic_timer_interrupt smp_apic_timer_interrupt Perf analysis: Worker Node - 99.14% shm_mq_sendv - 74.10% shm_mq_send_bytes + 42.35% shm_mq_inc_bytes_written - 32.56% pg_atomic_read_u64 - pg_atomic_read_u64_impl - 86.27% apic_timer_interrupt + 17.93% WaitLatch From the perf results and also from the code analysis I can think of two main problems here 1. Schyncronization between the worker and gather node, just to identify the bytes written and read they need to do at least 2-3 atomic operations for each tuple and I think that is having huge penalty due to a) frequent cache line invalidation b) a lot of atomic operations. 2. If the tuple queue is full then the worker might need to wait for the gather to consume the tuple. Experiment #1: As part of this experiment, I have modified the sender to keep the local copy of "mq_bytes_read" and "mq_bytes_written" in the local mqh handle so that we don't need to frequently read/write cache sensitive shared memory variables. So now we only read/write from the shared memory in the below conditions 1) If the number of available bytes is not enough to send the tuple, read the updated value of bytes read and also inform the reader about the new writes. 2) After every 4k bytes written, update the shared memory variable and inform the reader. 3) on detach for sending any remaining data. Machine information: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit CPU(s): 56 On-line CPU(s) list: 0-55 Thread(s) per core: 2 Core(s) per socket: 14 Socket(s): 2 NUMA node(s): 2 Results: (query EXPLAIN ANALYZE SELECT * FROM t;) 1) Non-parallel (default) Execution Time: 31627.492 ms 2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0) Execution Time: 37498.672 ms 3) Same as above (2) but with the patch. Execution Time: 23649.287 ms Observation: - As expected the results show that forcing the parallelism (by reducing the parallel_tuple_cost), drastically impacts the performance. - But in the same scenario, with the patch, we can see a huge gain of ~40% - Even if we compare it with the non-parallel plan we have gain ~25%. - With this, I think we can conclude that there is a huge potential for improvement if we communicate the tuple in batches, 1) one simple approach is what I used in my experiment, I think we can do some optimization in the reader as well, that instead of reading bytes_written every time from shared memory remember the previous value and once we have exhausted that then only read back the updated value from the shared memory. 2) Instead of copying the whole tuple in the tuple queue we can copy store the dsa_pointers of the tuple batch, I think Thomas Munro also suggested a similar approach to Robert, got to know this in offlist discussion with Robert. Experiment #2: See the behavior by increasing the parallel tuple queue size on head (for this I created a small patch to make parallel_tuple_queue size configurable) -- Results 4 WORKERS (tup_queue size= 64kB) : 38337.046 ms 4 WORKERS (tup_queue size= 1MB) : 36186.883 ms 4 WORKERS (tup_queue size= 4MB) : 36252.740 ms 8 WORKERS (tup_queue size= 64kB) : 42296.731 ms 8 WORKERS (tup_queue size= 1MB) : 37403.872 ms 8 WORKERS (tup_queue size= 4MB) : 39184.319 ms 16 WORKERS (tup_queue size= 64kB) : 42726.139 ms 16 WORKERS (tup_queue size= 1MB) : 36219.975 ms 16 WORKERS (tup_queue size= 4MB) : 39117.109 ms Observation: - There are some gains by increasing the tuple queue size but that is limited up to 1MB, even tried with more data but the gain is not linear and performance starts to drop after 4MB. - If I apply both Experiment#1 and Experiment#2 patches together then, we can further reduce the execution time to 20963.539 ms (with 4 workers and 4MB tuple queue size) Conclusion: With the above experiments, 1) I see a huge potential in the first idea so maybe we can do more experiments based on the prototype implemented in the first idea and we can expand the same for the reader and we can also try out the idea of the dsa_pointers. 2) with the second idea of tuple queue size, I see some benefit but that is not scaling so maybe, for now, there is no much point in pursuing in this direction, but I think in the future if we want to implement the redistribute operator then it is must for providing an infinite tuple queue (maybe using temp file) to avoid deadlock. Note: POC patches are not attached, I will send them after some more experiments and cleanup, maybe I will try to optimize the reader part as well before sending them. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: