Re: Gather performance analysis - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Gather performance analysis |
Date | |
Msg-id | 20210907213844.wyln22fasij54ltz@alap3.anarazel.de Whole thread Raw |
In response to | Gather performance analysis (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: Gather performance analysis
|
List | pgsql-hackers |
Hi, On 2021-08-06 14:00:48 +0530, Dilip Kumar wrote: > --Setup > SET parallel_tuple_cost TO 0 -- to test parallelism in the extreme case > CREATE TABLE t (a int, b varchar); > INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i; > ANALYZE t; > Test query: EXPLAIN ANALYZE SELECT * FROM t; > > Perf analysis: Gather Node > - 43.57% shm_mq_receive > - 78.94% shm_mq_receive_bytes > - 91.27% pg_atomic_read_u64 > - pg_atomic_read_u64_impl > - apic_timer_interrupt > smp_apic_timer_interrupt > > Perf analysis: Worker Node > - 99.14% shm_mq_sendv > - 74.10% shm_mq_send_bytes > + 42.35% shm_mq_inc_bytes_written > - 32.56% pg_atomic_read_u64 > - pg_atomic_read_u64_impl > - 86.27% apic_timer_interrupt > + 17.93% WaitLatch > > From the perf results and also from the code analysis I can think of > two main problems here Looking at this profile made me wonder if this was a build without optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls should be inlined. And while perf can reconstruct inlined functions when using --call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for me. FWIW, I see times like this postgres[4144648][1]=# EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t; ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Gather (cost=1000.00..6716686.33 rows=200000000 width=208) (actual rows=200000000 loops=1) │ │ Workers Planned: 2 │ │ Workers Launched: 2 │ │ -> Parallel Seq Scan on t (cost=0.00..6715686.33 rows=83333333 width=208) (actual rows=66666667 loops=3) │ │ Planning Time: 0.043 ms │ │ Execution Time: 24954.012 ms │ └──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ (6 rows) Looking at a profile I see the biggest bottleneck in the leader (which is the bottleneck as soon as the worker count is increased) to be reading the length word of the message. I do see shm_mq_receive_bytes() in the profile, but the costly part there is the "read % (uint64) ringsize" - divisions are slow. We could just compute a mask instead of the size. We also should probably split the read-mostly data in shm_mq (ring_size, detached, ring_offset, receiver, sender) into a separate cacheline from the read/write data. Or perhaps copy more info into the handle, particularly the ringsize (or mask). Greetings, Andres Freund
pgsql-hackers by date: