Re: Gather performance analysis - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Gather performance analysis
Date
Msg-id 20210907213844.wyln22fasij54ltz@alap3.anarazel.de
Whole thread Raw
In response to Gather performance analysis  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: Gather performance analysis
List pgsql-hackers
Hi,

On 2021-08-06 14:00:48 +0530, Dilip Kumar wrote:
> --Setup
> SET parallel_tuple_cost TO 0   -- to test parallelism in the extreme case
> CREATE TABLE t (a int, b varchar);
> INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i;
> ANALYZE t;
> Test query: EXPLAIN ANALYZE SELECT * FROM t;
> 
> Perf analysis: Gather Node
>    - 43.57% shm_mq_receive
>       - 78.94% shm_mq_receive_bytes
>          - 91.27% pg_atomic_read_u64
>             - pg_atomic_read_u64_impl
>                - apic_timer_interrupt
>                  smp_apic_timer_interrupt
> 
> Perf analysis: Worker Node
>       - 99.14% shm_mq_sendv
>          - 74.10% shm_mq_send_bytes
>             + 42.35% shm_mq_inc_bytes_written
>             - 32.56% pg_atomic_read_u64
>                - pg_atomic_read_u64_impl
>                   - 86.27% apic_timer_interrupt
>             + 17.93% WaitLatch
> 
> From the perf results and also from the code analysis I can think of
> two main problems here

Looking at this profile made me wonder if this was a build without
optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls should
be inlined. And while perf can reconstruct inlined functions when using
--call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for me.

FWIW, I see times like this

postgres[4144648][1]=# EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                  QUERY PLAN                                                  │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Gather  (cost=1000.00..6716686.33 rows=200000000 width=208) (actual rows=200000000 loops=1)                  │
│   Workers Planned: 2                                                                                         │
│   Workers Launched: 2                                                                                        │
│   ->  Parallel Seq Scan on t  (cost=0.00..6715686.33 rows=83333333 width=208) (actual rows=66666667 loops=3) │
│ Planning Time: 0.043 ms                                                                                      │
│ Execution Time: 24954.012 ms                                                                                 │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(6 rows)


Looking at a profile I see the biggest bottleneck in the leader (which is the
bottleneck as soon as the worker count is increased) to be reading the length
word of the message. I do see shm_mq_receive_bytes() in the profile, but the
costly part there is the "read % (uint64) ringsize" - divisions are slow. We
could just compute a mask instead of the size.

We also should probably split the read-mostly data in shm_mq (ring_size,
detached, ring_offset, receiver, sender) into a separate cacheline from the
read/write data. Or perhaps copy more info into the handle, particularly the
ringsize (or mask).

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Melanie Plageman
Date:
Subject: Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Next
From: Victor Spirin
Date:
Subject: Re: Atomic rename feature for Windows.