Re: Gather performance analysis - Mailing list pgsql-hackers
From | Dilip Kumar |
---|---|
Subject | Re: Gather performance analysis |
Date | |
Msg-id | CAFiTN-uNByjDK3+_q129NpZBALJngL8p1Kj=JAdqp585DvRQQA@mail.gmail.com Whole thread Raw |
In response to | Re: Gather performance analysis (Dilip Kumar <dilipbalaut@gmail.com>) |
List | pgsql-hackers |
On Wed, Sep 8, 2021 at 4:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: Based on various suggestions, I have some more experiments with the patch. 1) I have measured the cache misses count and I see a ~20% reduction in cache misses with the patch (updating shared memory counter only after we written certain amount of data). command: perf stat -e cycles,instructions,cache-references,cache-misses -p <receiver-pid> Head: 13,918,480,258 cycles 21,082,968,730 instructions # 1.51 insn per cycle 13,206,426 cache-references 12,432,402 cache-misses # 94.139 % of all cache refs Patch: 14,119,691,844 cycles 29,497,239,984 instructions # 2.09 insn per cycle 4,245,819 cache-references 3,085,047 cache-misses # 72.661 % of all cache refs I have taken multiple samples with different execution times, and I can see the cache-misses with the patch is 72-74% whereas without the patch it is 92-94%. So as expected these results clearly showing we are saving a lot by avoiding cache misses. 2) As pointed by Tomas, I have tried different test cases, where this patch can regress the performance CREATE TABLE t (a int, b varchar); INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i; set enable_gathermerge=off; Query: select * from t1 where a < 100000 order by a; Plan: Sort (cost=1714422.10..1714645.24 rows=89258 width=15) -> Gather (cost=1000.00..1707082.55 rows=89258 width=15) -> Parallel Seq Scan on t1 (cost=0.00..1706082.55 rows=22314 width=15) Filter: (a < 100000) So the idea is, that without a patch we should immediately get the tuple to the sort node whereas with a patch there would be some delay before we send the tuple to the gather node as we are batching. With this also, I did not notice any consistent regression with the patch, however, with explain analyze I have noticed 2-3 % drop with the patch. 3. I tried some other optimizations, pointed by Andres, a) Separating read-only and read-write data in shm_mq and also moving some fields out of shm_mq struct shm_mq (after change) { /* mostly read-only field*/ PGPROC *mq_receiver; PGPROC *mq_sender; bool mq_detached; slock_t mq_mutex; /* read-write fields*/ pg_atomic_uint64 mq_bytes_read; pg_atomic_uint64 mq_bytes_written; char mq_ring[FLEXIBLE_ARRAY_MEMBER]; }; Note: mq_ring_size and mq_ring_offset moved to shm_mq_handle. I did not see any extra improvement with this idea. 4. Another thought about changing the "mq_ring_size" to a mask - I think this could improve something, but currently, "mq_ring_size" is not the 2's power value so we can not convert this to a mask directly. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: