Good day, everyone.
I've been played with pgbench on huge machine.
(72 cores, 56 for postgresql, enough memory to fit base
both into shared_buffers and file cache)
(pgbench scale 500, unlogged tables, fsync=off,
synchronous commit=off, wal_writer_flush_after=0).
With 200 clients performance is around 76000tps and main
bottleneck in this dumb test is LWLockWaitListLock.
I added gcc specific implementation for pg_atomic_fetch_or_u32_impl
(ie using __sync_fetch_and_or) and performance became 83000tps.
It were a bit strange at a first look, cause __sync_fetch_and_or
compiles to almost same CAS loop.
Looking closely, I noticed that intrinsic performs doesn't do
read in the loop body, but at loop initialization. It is correct
behavior cause `lock cmpxchg` instruction stores old value in EAX
register.
It is expected behavior, and pg_compare_and_exchange_*_impl does
the same in all implementations. So there is no need to re-read
value in the loop body:
Example diff for pg_atomic_exchange_u32_impl:
static inline uint32 pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32
xchg_) { uint32 old;
+ old = pg_atomic_read_u32_impl(ptr); while (true) {
- old = pg_atomic_read_u32_impl(ptr); if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_))
break; } return old; }
After applying this change to all generic atomic functions
(and for pg_atomic_fetch_or_u32_impl ), performance became
equal to __sync_fetch_and_or intrinsic.
Attached patch contains patch for all generic atomic
functions, and also __sync_fetch_and_(or|and) for gcc, cause
I believe GCC optimize code around intrinsic better than
around inline assembler.
(final performance is around 86000tps, but difference between
83000tps and 86000tps is not so obvious in NUMA system).
With regards,
--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company