Re: Avoid stack frame setup in performance critical routines using tail calls - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Avoid stack frame setup in performance critical routines using tail calls
Date
Msg-id 20210720061657.bcueir3krgmkt6m5@alap3.anarazel.de
Whole thread Raw
In response to Re: Avoid stack frame setup in performance critical routines using tail calls  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: Avoid stack frame setup in performance critical routines using tail calls  (David Rowley <dgrowleyml@gmail.com>)
List pgsql-hackers
Hi,

On 2021-07-20 16:50:09 +1200, David Rowley wrote:
> I've not taken the time to study the patch but I was running some
> other benchmarks today on a small scale pgbench readonly test and I
> took this patch for a spin to see if I could see the same performance
> gains.

Thanks!


> This is an AMD 3990x machine that seems to get the most throughput
> from pgbench with 132 processes
> 
> I did: pgbench -T 240 -P 10 -c 132 -j 132 -S -M prepared
> --random-seed=12345 postgres
> 
> master = dd498998a
> 
> Master: 3816959.53 tps
> Patched: 3820723.252 tps
> 
> I didn't quite get the same 2-3% as you did, but it did come out
> faster than on master.

It would not at all be suprising to me if AMD in recent microarchitectures did
a better job at removing stack management overview (e.g. by better register
renaming, or by resolving dependencies on %rsp in a smarter way) than Intel
has. This was on a Cascade Lake CPU (xeon 5215), which, despite being released
in 2019, effectively is a moderately polished (or maybe shoehorned)
microarchitecture from 2015 due to all the Intel troubles. Whereas Zen2 is
from 2019.

It's also possible that my attempts at avoiding the stack management just
didn't work on your compiler. Either due to vendor (I know that gcc is better
at it than clang), version, or compiler flags (e.g. -fno-omit-frame-pointer
could make it harder, -fno-optimize-sibling-calls would disable it).

A third plausible explanation for the difference is that at a client count of
132, the bottlenecks are sufficiently elsewhere to just not show a meaningful
gain from memory management efficiency improvements.


Any chance you could show a `perf annotate AllocSetAlloc` and `perf annotate
palloc` from a patched run? And perhaps how high their percentages of the
total work are. E.g. using something like
perf report -g none|grep -E 'AllocSetAlloc|palloc|MemoryContextAlloc|pfree'

It'd be interesting to know where the bottlenecks on a zen2 machine are.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Greg Nancarrow
Date:
Subject: Re: row filtering for logical replication
Next
From: David Rowley
Date:
Subject: Re: Avoid stack frame setup in performance critical routines using tail calls