Gather performance analysis - Mailing list pgsql-hackers

From Dilip Kumar
Subject Gather performance analysis
Date
Msg-id CAFiTN-tVXqn_OG7tHNeSkBbN+iiCZTiQ83uakax43y1sQb2OBA@mail.gmail.com
Whole thread Raw
Responses Re: Gather performance analysis
Re: Gather performance analysis
Re: Gather performance analysis
List pgsql-hackers
Hi,

I have been working on analyzing the performance of sending the tuple
from workers to the Gather using the tuple queue.  In the past there
were many off-list discussions around this area, basically, the main
point is that when the "shm_mq" was implemented that time maybe this
was one of the best ways to implement this. But now, we have other
choices like DSA for allocating shared memory on-demand, shared
temporary files for non-blocking tuple queue.

So my motivation for looking into this area is that now, we have
another flexible alternative so can we use them to make gather faster
and if so then
1. Can we actually reduce the tuple transfer cost and enable
parallelism in more cases by reducing parallel_tuple_cost.
2. Can we use the tuple queue in more places, e.g., to implement the
redistribute operator where we need to transfer data between the
workers.

IMHO for #1, it will be good enough if we can make the tuple transfer
faster, but for #2, we will have to make a) tuple transfer faster
because then we will have to transfer the tuples between the workers
as well b) Infinite non-blocking tuple queue(maybe using shared temp
file) so that there is no deadlock while workers are redistributing
tuples to each other.

So I have done some quick performance tests and analysis using perf,
and some experiments with small prototypes for targeting a different
set of problems.

--Setup
SET parallel_tuple_cost TO 0   -- to test parallelism in the extreme case
CREATE TABLE t (a int, b varchar);
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i;
ANALYZE t;
Test query: EXPLAIN ANALYZE SELECT * FROM t;

Perf analysis: Gather Node
   - 43.57% shm_mq_receive
      - 78.94% shm_mq_receive_bytes
         - 91.27% pg_atomic_read_u64
            - pg_atomic_read_u64_impl
               - apic_timer_interrupt
                 smp_apic_timer_interrupt

Perf analysis: Worker Node
      - 99.14% shm_mq_sendv
         - 74.10% shm_mq_send_bytes
            + 42.35% shm_mq_inc_bytes_written
            - 32.56% pg_atomic_read_u64
               - pg_atomic_read_u64_impl
                  - 86.27% apic_timer_interrupt
            + 17.93% WaitLatch

From the perf results and also from the code analysis I can think of
two main problems here
1. Schyncronization between the worker and gather node, just to
identify the bytes written and read they need to do at least 2-3
atomic operations for each tuple and I think that is having huge
penalty due to a) frequent cache line invalidation b) a lot of atomic
operations.

2. If the tuple queue is full then the worker might need to wait for
the gather to consume the tuple.

Experiment #1:
As part of this experiment, I have modified the sender to keep the
local copy of "mq_bytes_read" and "mq_bytes_written" in the local mqh
handle so that we don't need to frequently read/write cache sensitive
shared memory variables.  So now we only read/write from the shared
memory in the below conditions

1) If the number of available bytes is not enough to send the tuple,
read the updated value of bytes read and also inform the reader about
the new writes.
2) After every 4k bytes written, update the shared memory variable and
inform the reader.
3) on detach for sending any remaining data.

Machine information:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
CPU(s):                56
On-line CPU(s) list:   0-55
Thread(s) per core:    2
Core(s) per socket:    14
Socket(s):             2
NUMA node(s):          2

Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
1) Non-parallel (default)
 Execution Time: 31627.492 ms

2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
 Execution Time: 37498.672 ms

3) Same as above (2) but with the patch.
Execution Time: 23649.287 ms

Observation:
- As expected the results show that forcing the parallelism (by
reducing the parallel_tuple_cost), drastically impacts the
performance.
- But in the same scenario, with the patch, we can see a huge gain of ~40%
- Even if we compare it with the non-parallel plan we have gain ~25%.
- With this, I think we can conclude that there is a huge potential
for improvement if we communicate the tuple in batches, 1) one simple
approach is what I used in my experiment, I think we can do some
optimization in the reader as well, that instead of reading
bytes_written every time from shared memory remember the previous
value and once we have exhausted that then only read back the updated
value from the shared memory.  2) Instead of copying the whole tuple
in the tuple queue we can copy store the dsa_pointers of the tuple
batch, I think Thomas Munro also suggested a similar approach to
Robert, got to know this in offlist discussion with Robert.

Experiment #2: See the behavior by increasing the parallel tuple queue
size on head
(for this I created a small patch to make parallel_tuple_queue size
configurable)

-- Results
4 WORKERS (tup_queue size= 64kB) : 38337.046 ms
4 WORKERS (tup_queue size= 1MB)  : 36186.883 ms
4 WORKERS (tup_queue size= 4MB)  : 36252.740 ms

8 WORKERS (tup_queue size= 64kB) : 42296.731 ms
8 WORKERS (tup_queue size= 1MB)  : 37403.872 ms
8 WORKERS (tup_queue size= 4MB)  : 39184.319 ms

16 WORKERS (tup_queue size= 64kB) : 42726.139 ms
16 WORKERS (tup_queue size= 1MB)  : 36219.975 ms
16 WORKERS (tup_queue size= 4MB)  : 39117.109 ms

Observation:
- There are some gains by increasing the tuple queue size but that is
limited up to 1MB, even tried with more data but the gain is not
linear and performance starts to drop after 4MB.
- If I apply both Experiment#1 and Experiment#2 patches together then,
we can further reduce the execution time to 20963.539 ms (with 4
workers and 4MB tuple queue size)

Conclusion:
With the above experiments,
1) I see a huge potential in the first idea so maybe we can do more
experiments based on the prototype implemented in the first idea and
we can expand the same for the reader and we can also try out the idea
of the dsa_pointers.

2) with the second idea of tuple queue size, I see some benefit but
that is not scaling so maybe, for now, there is no much point in
pursuing in this direction, but I think in the future if we want to
implement the redistribute operator then it is must for providing an
infinite tuple queue (maybe using temp file) to avoid deadlock.

Note: POC patches are not attached, I will send them after some more
experiments and cleanup, maybe I will try to optimize the reader part
as well before sending them.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Added schema level support for publication.
Next
From: vignesh C
Date:
Subject: Re: Added schema level support for publication.