Re: Gather performance analysis - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: Gather performance analysis
Date
Msg-id CAFiTN-uNByjDK3+_q129NpZBALJngL8p1Kj=JAdqp585DvRQQA@mail.gmail.com
Whole thread Raw
In response to Re: Gather performance analysis  (Dilip Kumar <dilipbalaut@gmail.com>)
List pgsql-hackers
On Wed, Sep 8, 2021 at 4:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Based on various suggestions, I have some more experiments with the patch.

1) I have measured the cache misses count and I see a ~20% reduction
in cache misses with the patch (updating shared memory counter only
after we written certain amount of data).
command: perf stat -e
cycles,instructions,cache-references,cache-misses -p <receiver-pid>
Head:
    13,918,480,258      cycles
    21,082,968,730      instructions              #    1.51  insn per
cycle
        13,206,426      cache-references
        12,432,402      cache-misses              #   94.139 % of all
cache refs

Patch:
    14,119,691,844      cycles
    29,497,239,984      instructions              #    2.09  insn per
cycle
         4,245,819      cache-references
         3,085,047      cache-misses              #   72.661 % of all cache refs

I have taken multiple samples with different execution times, and I
can see the cache-misses with the patch is 72-74% whereas without the
patch it is 92-94%.  So as expected these results clearly showing we
are saving a lot by avoiding cache misses.

2) As pointed by Tomas, I have tried different test cases, where this
patch can regress the performance

CREATE TABLE t (a int, b varchar);
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i;
set enable_gathermerge=off;
Query: select * from t1 where a < 100000 order by a;

Plan:
Sort  (cost=1714422.10..1714645.24 rows=89258 width=15)
   ->  Gather  (cost=1000.00..1707082.55 rows=89258 width=15)
         ->  Parallel Seq Scan on t1  (cost=0.00..1706082.55
rows=22314 width=15)
               Filter: (a < 100000)

So the idea is, that without a patch we should immediately get the
tuple to the sort node whereas with a patch there would be some delay
before we send the tuple to the gather node as we are batching.   With
this also, I did not notice any consistent regression with the patch,
however, with explain analyze I have noticed 2-3 % drop with the
patch.

3. I tried some other optimizations, pointed by Andres,
a) Separating read-only and read-write data in shm_mq and also moving
some fields out of shm_mq

struct shm_mq (after change)
{
/* mostly read-only field*/

PGPROC    *mq_receiver;
PGPROC    *mq_sender;
bool mq_detached;
slock_t mq_mutex;

/* read-write fields*/
pg_atomic_uint64 mq_bytes_read;
pg_atomic_uint64 mq_bytes_written;
char mq_ring[FLEXIBLE_ARRAY_MEMBER];
};

Note: mq_ring_size and mq_ring_offset moved to shm_mq_handle.

I did not see any extra improvement with this idea.

4. Another thought about changing the "mq_ring_size" to a mask
- I think this could improve something, but currently, "mq_ring_size"
is not the 2's power value so we can not convert this to a mask
directly.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: resowner module README needs update?
Next
From: Amit Langote
Date:
Subject: Re: resowner module README needs update?