[HACKERS] Fix performance of generic atomics - Mailing list pgsql-hackers

From Sokolov Yura
Subject [HACKERS] Fix performance of generic atomics
Date
Msg-id 7f65886daca545067f82bf2b463b218d@postgrespro.ru
Whole thread Raw
Responses Re: [HACKERS] Fix performance of generic atomics
List pgsql-hackers
Good day, everyone.

I've been played with pgbench on huge machine.
(72 cores, 56 for postgresql, enough memory to fit base
both into shared_buffers and file cache)
(pgbench scale 500, unlogged tables, fsync=off,
synchronous commit=off, wal_writer_flush_after=0).

With 200 clients performance is around 76000tps and main
bottleneck in this dumb test is LWLockWaitListLock.

I added gcc specific implementation for pg_atomic_fetch_or_u32_impl
(ie using __sync_fetch_and_or) and performance became 83000tps.

It were a bit strange at a first look, cause __sync_fetch_and_or
compiles to almost same CAS loop.

Looking closely, I noticed that intrinsic performs doesn't do
read in the loop body, but at loop initialization. It is correct
behavior cause `lock cmpxchg` instruction stores old value in EAX
register.

It is expected behavior, and pg_compare_and_exchange_*_impl does
the same in all implementations. So there is no need to re-read
value in the loop body:

Example diff for pg_atomic_exchange_u32_impl:
 static inline uint32 pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 
xchg_) {     uint32 old;
+    old = pg_atomic_read_u32_impl(ptr);     while (true)     {
-        old = pg_atomic_read_u32_impl(ptr);         if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_))
   break;     }     return old; }
 

After applying this change to all generic atomic functions
(and for pg_atomic_fetch_or_u32_impl ), performance became
equal to __sync_fetch_and_or intrinsic.

Attached patch contains patch for all generic atomic
functions, and also __sync_fetch_and_(or|and) for gcc, cause
I believe GCC optimize code around intrinsic better than
around inline assembler.
(final performance is around 86000tps, but difference between
83000tps and 86000tps is not so obvious in NUMA system).

With regards,
-- 
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company



pgsql-hackers by date:

Previous
From: Andrew Borodin
Date:
Subject: [HACKERS] Allow GiST opcalsses without compress\decompres functions
Next
From: Peter Eisentraut
Date:
Subject: Re: [HACKERS] pg_dump ignoring information_schema tables which usedin Create Publication.