Thread: Some performance degradation in REL_16 vs REL_15

Some performance degradation in REL_16 vs REL_15

From
"Anton A. Melnikov"
Date:
Greetengs!

Found that simple test pgbench -c20 -T20 -j8 gives approximately
for REL_15_STABLE at 5143f76:  336+-1 TPS
and
for REL_16_STABLE at 4ac7635f: 324+-1 TPS

The performance drop is approximately 3,5%  while the corrected standard deviation is only 0.3%.
See the raw_data.txt attached.

How do you think, is there any cause for concern here?

And is it worth spending time bisecting for the commit where this degradation may have occurred?

Would be glad for any comments and concerns.

With the best regards,

-- 
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: Some performance degradation in REL_16 vs REL_15

From
David Rowley
Date:
On Thu, 12 Oct 2023 at 21:01, Anton A. Melnikov
<a.melnikov@postgrespro.ru> wrote:
>
> Greetengs!
>
> Found that simple test pgbench -c20 -T20 -j8 gives approximately
> for REL_15_STABLE at 5143f76:  336+-1 TPS
> and
> for REL_16_STABLE at 4ac7635f: 324+-1 TPS
>
> And is it worth spending time bisecting for the commit where this degradation may have occurred?

It would be interesting to know what's to blame here and if you can
attribute it to a certain commit.

David



Re: Some performance degradation in REL_16 vs REL_15

From
Michael Paquier
Date:
On Thu, Oct 12, 2023 at 09:20:36PM +1300, David Rowley wrote:
> It would be interesting to know what's to blame here and if you can
> attribute it to a certain commit.

+1.
--
Michael

Attachment

Re: Some performance degradation in REL_16 vs REL_15

From
Andres Freund
Date:
Hi,

On 2023-10-12 11:00:22 +0300, Anton A. Melnikov wrote:
> Found that simple test pgbench -c20 -T20 -j8 gives approximately
> for REL_15_STABLE at 5143f76:  336+-1 TPS
> and
> for REL_16_STABLE at 4ac7635f: 324+-1 TPS
> 
> The performance drop is approximately 3,5%  while the corrected standard deviation is only 0.3%.
> See the raw_data.txt attached.

Could you provide a bit more details about how you ran the benchmark?  The
reason I am asking is that ~330 TPS is pretty slow for -c20.  Even on spinning
rust and using the default settings, I get considerably higher results.

Oh - I do get results closer to yours if I use pgbench scale 1, causing a lot
of row level contention. What scale did you use?

Greetings,

Andres Freund



Re: Some performance degradation in REL_16 vs REL_15

From
"Anton A. Melnikov"
Date:
On 13.10.2023 05:05, Andres Freund wrote:
> Could you provide a bit more details about how you ran the benchmark?  The
> reason I am asking is that ~330 TPS is pretty slow for -c20.  Even on spinning
> rust and using the default settings, I get considerably higher results.
> 
> Oh - I do get results closer to yours if I use pgbench scale 1, causing a lot
> of row level contention. What scale did you use?


I use default scale of 1.
And run the command sequence:
$pgbench -i bench
$sleep 1
$pgbench -c20 -T10 -j8
in a loop to get similar initial conditions for every "pgbench -c20 -T10 -j8" run.

Thanks for your interest!

With the best wishes,

-- 
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Some performance degradation in REL_16 vs REL_15

From
邱宇航
Date:
I wrote a script and test on branch REL_[10-16]_STABLE, and do see performance drop in REL_13_STABLE, which is about 1~2%.

scale round 10 11 12 13 14 15 16
1 1 7922.2 8018.3 8102.8 7838.3 7829.2 7870.0 7846.1
2 7922.4 7923.5 8090.3 7887.7 7912.4 7815.2 7865.6
3 7937.6 7964.9 8012.8 7918.5 7879.4 7786.4 7981.1
4 8000.4 7959.5 8141.1 7886.3 7840.9 7863.5 8022.4
5 7921.8 7945.5 8005.2 7993.7 7957.0 7803.8 7899.8
6 7893.8 7895.1 8017.2 7879.8 7880.9 7911.4 7909.2
7 7879.3 7853.5 8071.7 7956.2 7876.7 7863.3 7986.3
8 7980.5 7964.1 8119.2 8015.2 7877.6 7784.9 7923.6
9 8083.9 7946.4 7960.3 7913.9 7924.6 7867.7 7928.6
10 7971.2 7991.8 7999.5 7812.4 7824.3 7831.0 7953.4
AVG 7951.3 7946.3 8052.0 7910.2 7880.3 7839.7 7931.6
MED 7930.0 7952.9 8044.5 7900.8 7878.5 7847.1 7926.1
10 1 41221.5 41394.8 40926.8 40566.6 41661.3 40511.9 40961.8
2 40974.0 40697.9 40842.4 40269.2 41127.7 40795.5 40814.9
3 41453.5 41426.4 41066.2 40890.9 41018.6 40897.3 40891.7
4 41691.9 40294.9 41189.8 40873.8 41539.7 40943.2 40643.8
5 40843.4 40855.5 41243.8 40351.3 40863.2 40839.6 40795.5
6 40969.3 40897.9 41380.8 40734.7 41269.3 41301.0 41061.0
7 40981.1 41119.5 41158.0 40834.6 40967.1 40790.6 41061.6
8 41006.4 41205.9 40740.3 40978.7 40742.4 40951.6 41242.1
9 41089.9 41129.7 40648.3 40622.1 40782.0 40460.5 40877.9
10 41280.3 41462.7 41316.4 40728.0 40983.9 40747.0 40964.6
AVG 41151.1 41048.5 41051.3 40685.0 41095.5 40823.8 40931.5
MED 41048.2 41124.6 41112.1 40731.3 41001.3 40817.6 40926.7
100 1 43429.0 43190.2 44099.3 43941.5 43883.3 44215.0 44604.9
2 43281.7 43795.2 44963.6 44331.5 43559.7 43571.5 43403.9
3 43749.0 43614.1 44616.7 43759.5 43617.8 43530.3 43362.4
4 43362.0 43197.3 44296.7 43692.4 42020.5 43607.3 43081.8
5 43373.4 43288.0 44240.9 43795.0 43630.6 43576.7 43512.0
6 43637.0 43385.2 45130.1 43792.5 43635.4 43905.2 43371.2
7 43621.2 43474.2 43735.0 43592.2 43889.7 43947.7 43369.8
8 43351.0 43937.5 44285.6 43877.2 43771.1 43879.1 43680.4
9 43481.3 43700.5 44119.9 43786.9 43440.8 44083.1 43563.2
10 43238.7 43559.5 44310.8 43406.0 44306.6 43376.3 43242.7
AVG 43452.4 43514.2 44379.9 43797.5 43575.6 43769.2 43519.2
MED 43401.2 43516.8 44291.2 43789.7 43633.0 43743.2 43387.5

The script looks like:

    initdb data >/dev/null 2>&1 #initdb on every round

    pg_ctl -D data -l logfile start >/dev/null 2>&1 #start without changing any setting

    pgbench -i postgres $scale >/dev/null 2>&1

    sleep 1 >/dev/null 2>&1

    pgbench -c20 -T10 -j8


And here is the pg_config output:

...

CONFIGURE =  '--enable-debug' '--prefix=/home/postgres/base' '--enable-depend' 'PKG_CONFIG_PATH=/usr/local/lib64/pkgconfig::/usr/lib/pkgconfig'

CC = gcc

CPPFLAGS = -D_GNU_SOURCE

CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wshadow=compatible-local -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -O2

CFLAGS_SL = -fPIC

LDFLAGS = -Wl,--as-needed -Wl,-rpath,'/home/postgres/base/lib',--enable-new-dtags

LDFLAGS_EX = 

LDFLAGS_SL = 

LIBS = -lpgcommon -lpgport -lz -lreadline -lpthread -lrt -ldl -lm 

VERSION = PostgreSQL 16.0


—-
Yuhang Qiu

Re: Some performance degradation in REL_16 vs REL_15

From
Andres Freund
Date:
Hi,

On 2023-10-16 11:04:25 +0300, Anton A. Melnikov wrote:
> On 13.10.2023 05:05, Andres Freund wrote:
> > Could you provide a bit more details about how you ran the benchmark?  The
> > reason I am asking is that ~330 TPS is pretty slow for -c20.  Even on spinning
> > rust and using the default settings, I get considerably higher results.
> > 
> > Oh - I do get results closer to yours if I use pgbench scale 1, causing a lot
> > of row level contention. What scale did you use?
> 
> 
> I use default scale of 1.

That means you're largely going to be bottlenecked due to row level
contention. For read/write pgbench you normally want to use a scale that's
bigger than the client count, best by at least 2x.

Have you built postgres with assertions enabled or such?

What is the server configuration for both versions?


> And run the command sequence:
> $pgbench -i bench
> $sleep 1
> $pgbench -c20 -T10 -j8

I assume you also specify the database name here, given you specified it for
pgbench -i?

As you're not doing a new initdb here, the state of the cluster will
substantially depend on what has run before. This can matter substantially
because a cluster with prior substantial write activity will already have
initialized WAL files and can reuse them cheaply, whereas one without that
activity needs to initialize new files.  Although that matters a bit less with
scale 1, because there's just not a whole lot of writes.

At the very least you should trigger a checkpoint before or after pgbench
-i. The performance between having a checkpoint during the pgbench run or not
is substantially different, and if you're not triggering one explicitly, it'll
be up to random chance whether it happens during the run or not. It's less
important if you run pgbench for an extended time, but if you do it just for
10s...

E.g. on my workstation, if there's no checkpoint, I get around 633 TPS across
repeated runs, but if there's a checkpoint between pgbench -i and the pgbench
run, it's around 615 TPS.

Greetings,

Andres Freund



Re: Some performance degradation in REL_16 vs REL_15

From
Tom Lane
Date:
=?utf-8?B?6YKx5a6H6Iiq?= <iamqyh@gmail.com> writes:
> I wrote a script and test on branch REL_[10-16]_STABLE, and do see performance drop in REL_13_STABLE, which is about
1~2%.

I'm really skeptical that we should pay much attention to these numbers.
You've made several of the mistakes that we typically tell people not to
make when using pgbench:

* scale <= number of sessions means you're measuring a lot of
row-update contention

* once you crank up the scale enough to avoid that problem, running
with the default shared_buffers seems like a pretty poor choice

* 10-second runtime is probably an order of magnitude too small
to get useful, reliable numbers

On top of all that, discrepancies on the order of a percent or two
commonly arise from hard-to-control-for effects like the cache
alignment of hot spots in different parts of the code.  That means
that you can see changes of that size from nothing more than
day-to-day changes in completely unrelated parts of the code.

I'd get excited about say a 10% performance drop, because that's
probably more than noise; but I'm not convinced that any of the
differences you show here are more than noise.

            regards, tom lane



Re: Some performance degradation in REL_16 vs REL_15

From
"Anton A. Melnikov"
Date:
Hi, Andres!

Thanks for your patience and advice.

On 18.10.2023 07:10, Andres Freund wrote:
>> On 13.10.2023 05:05, Andres Freund wrote:
>>> Could you provide a bit more details about how you ran the benchmark?  The
>>> reason I am asking is that ~330 TPS is pretty slow for -c20.  Even on spinning
>>> rust and using the default settings, I get considerably higher results.
>>>
>>> Oh - I do get results closer to yours if I use pgbench scale 1, causing a lot
>>> of row level contention. What scale did you use?

>> I use default scale of 1.
> 
> That means you're largely going to be bottlenecked due to row level
> contention. For read/write pgbench you normally want to use a scale that's
> bigger than the client count, best by at least 2x.

I performed differential measurements with -s21 to obtain scale > number
in accordance with:
> On 18.10.2023 07:14, Tom Lane wrote:
>> * scale <= number of sessions means you're measuring a lot of
>> row-update contention
>> 
And plan to do the same for -s40.

> Have you built postgres with assertions enabled or such?
I ran the debug configuration with --enable-cassert and -O0.
And plan to do the same without asserts and -O2 soon.

> What is the server configuration for both versions?
In all measurements i used default postgresql.conf files. Please see the detailed reproduction below.

> 
>> And run the command sequence:
>> $pgbench -i bench
>> $sleep 1
>> $pgbench -c20 -T10 -j8
> 
> I assume you also specify the database name here, given you specified it for
> pgbench -i?

Sorry, there were a two mistakes in the command sequence that I copied in the previous letter
compared to the real one. I did not completely copy it from the sighting, not from the real script.
In fact i used pgbench -c20 -T20 -j8 bench. -T10 is really unstable and was used only for
preliminary measurements to make them more faster. Please see my first letter in this thread.
   
Here are the detailed description of the last measurements that were performed on my pc.
With a scale of -s21, the difference between REL_16 and REL_15 is rather small, but it has
a cumulative effect from version to version and reaches a maximum between
REL_10_STABLE and REL_16_STABLE.
The measurement procedure was as follows;
- rebuild from sources and reinstall server with
./configure --enable-debug --enable-cassert --with-perl \
    --with-icu --enable-depend --enable-tap-tests
- init the new empty db with initdb -k -D $INSTDIR/data
See postgresql10.conf and postgresql16.conf attached.
- run the series of 100 measurements with the scripts like that for REL_16:
date && rm -f result.txt && for ((i=0; i<100; i++)); do pgbench -U postgres -i -s21 bench> /dev/null 2>&1;
psql -U postgres -d bench -c "checkpoint"; RES=$(pgbench -U postgres -c20 -T20 -j8 bench 2>&1 | awk '/tps[[:space:]]/ {
print$3 }');
 
echo $RES >> result.txt; echo Measurement N$i done. TPS=$RES; done; cat result.txt
or
date && rm -f result.txt && for ((i=0; i<100; i++)); do pgbench -U postgres -i -s21 bench> /dev/null 2>&1;
psql -U postgres -d bench -c "checkpoint"; RES=$(pgbench -U postgres -c20 -T20 -j8 bench 2>&1 | awk '/excluding/ {
print$3 }');
 
echo $RES >> result.txt; echo Measurement N$i done. TPS=$RES; done; cat result.txt
for REL_10 respectively.

For REL_16_STABLE at 7cc2f59dd the average TPS was: 2020+-70,
for REL_10_STABLE at c18c12c98 - 2260+-70

The percentage difference was approximately 11%.
Please see the 16vs10.png picture with the graphical representation of the data obtained.
Also there are the raw data in the raw_data_s21.txt.

In some days i hope to perform additional measurements that were mentioned above in this letter.
It would be interesting to establish the reason for this difference. And  i would be very grateful
if you could advise me what other settings can be tweaked.

With the best wishes!

-- 
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: Some performance degradation in REL_16 vs REL_15

From
Andres Freund
Date:
Hi,

On 2023-10-30 15:28:53 +0300, Anton A. Melnikov wrote:
> For REL_16_STABLE at 7cc2f59dd the average TPS was: 2020+-70,
> for REL_10_STABLE at c18c12c98 - 2260+-70
> 
> The percentage difference was approximately 11%.
> Please see the 16vs10.png picture with the graphical representation of the data obtained.
> Also there are the raw data in the raw_data_s21.txt.
> 
> In some days i hope to perform additional measurements that were mentioned above in this letter.
> It would be interesting to establish the reason for this difference. And  i would be very grateful
> if you could advise me what other settings can be tweaked.

There's really no point in comparing peformance with assertions enabled
(leaving aside assertions that cause extreme performance difference, making
development harder). We very well might have added assertions making things
more expensive, without affecting performance in optimized/non-assert builds.

Greetings,

Andres Freund



Re: Some performance degradation in REL_16 vs REL_15

From
"Anton A. Melnikov"
Date:
On 30.10.2023 22:51, Andres Freund wrote:

> There's really no point in comparing peformance with assertions enabled
> (leaving aside assertions that cause extreme performance difference, making
> development harder). We very well might have added assertions making things
> more expensive, without affecting performance in optimized/non-assert builds.
> 

Thanks for advice! I repeated measurements on my pc without asserts and CFLAGS="-O2".
Also i reduced the number of clients to -c6 to leave a reserve of two cores
from my 8-core cpu and used -j6 accordingly.

The results were similar: on my pc REL_10_STABLE(c18c12c9) was faster than REL_16_STABLE(07494a0d)
but the effect became weaker:
  REL_10_STABLE gives ~965+-15 TPS(+-2%) while REL_16_STABLE gives ~920+-30 TPS(+-3%) in the test: pgbench -s8 -c6 -T20
-j6
So 10 is faster than 16 by ~5%. (see raw-my-pc.txt attached for the raw data)

Then, thanks to my colleagues, i carried out similar measurements on the more powerful 24-core standalone server.
The REL_10_STABLE gives 8260+-100 TPS(+-1%) while REL_16_STABLE gives 8580+-90 TPS(+-1%) in the same test: pgbench -s8
-c6-T20 -j6
 
The test gave an opposite result!
On that server the 16 is faster than 10 by ~4%.

When i scaled the test on server to get the same reserve of two cores, the results became like this:
REL_10_STABLE gives ~16000+-300 TPS(+-2%) while REL_16_STABLE gives ~18500+-200 TPS(+-1%) in the scaled test: pgbench
-s24-c22 -T20 -j22
 
Here the difference is more noticeable: 16 is faster than 10 by ~15%. (raw-server.txt)

The configure options and test scripts on my pc and server were the same:
export CFLAGS="-O2"
./configure --enable-debug --with-perl --with-icu --enable-depend --enable-tap-tests
#reinstall
#reinitdb
#create database bench
for ((i=0; i<100; i++)); do pgbench -U postgres -i -s8 bench> /dev/null 2>&1;
psql -U postgres -d bench -c "checkpoint"; RES=$(pgbench -U postgres -c6 -T20 -j6 bench;

Configurations:
my pc:  8-core AMD Ryzen 7 4700U @ 1.4GHz, 64GB RAM, NVMe M.2 SSD drive.
Linux 5.15.0-88-generic #98~20.04.1-Ubuntu SMP Mon Oct 9 16:43:45 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
server: 2x 12-hyperthreaded cores Intel(R) Xeon(R) CPU X5675 @ 3.07GHz, 24GB RAM, RAID from SSD drives.
Linux 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux

I can't understand why i get the opposite results on my pc and on the server. It is clear that the absolute
TPS values will be different for various configurations. This is normal. But differences?
Is it unlikely that some kind of reference configuration is needed to accurately
measure the difference in performance. Probably something wrong with my pc, but now
i can not figure out what's wrong.

Would be very grateful for any advice or comments to clarify this problem.

With the best wishes!

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: Some performance degradation in REL_16 vs REL_15

From
Tom Lane
Date:
"Anton A. Melnikov" <a.melnikov@postgrespro.ru> writes:
> I can't understand why i get the opposite results on my pc and on the server. It is clear that the absolute
> TPS values will be different for various configurations. This is normal. But differences?
> Is it unlikely that some kind of reference configuration is needed to accurately
> measure the difference in performance. Probably something wrong with my pc, but now
> i can not figure out what's wrong.

> Would be very grateful for any advice or comments to clarify this problem.

Benchmarking is hard :-(.  IME it's absolutely typical to see
variations of a couple of percent even when "nothing has changed",
for example after modifying some code that's nowhere near any
hot code path for the test case.  I usually attribute this to
cache effects, such as a couple of bits of hot code now sharing or
not sharing a cache line.  If you use two different compiler versions
then that situation is likely to occur all over the place even with
exactly the same source code.  NUMA creates huge reproducibility
problems too on multisocket machines (which your server is IIUC).
When I had a multisocket workstation I'd usually bind all the server
processes to one socket if I wanted more-or-less-repeatable numbers.

I wouldn't put a lot of faith in the idea that measured pgbench
differences of up to several percent are meaningful at all,
especially when comparing across different hardware and different
OS+compiler versions.  There are too many variables that have
little to do with the theoretical performance of the source code.

            regards, tom lane



Re: Some performance degradation in REL_16 vs REL_15

From
Andres Freund
Date:
Hi,

On 2023-11-15 11:33:44 +0300, Anton A. Melnikov wrote:
> The configure options and test scripts on my pc and server were the same:
> export CFLAGS="-O2"
> ./configure --enable-debug --with-perl --with-icu --enable-depend --enable-tap-tests
> #reinstall
> #reinitdb
> #create database bench
> for ((i=0; i<100; i++)); do pgbench -U postgres -i -s8 bench> /dev/null 2>&1;
> psql -U postgres -d bench -c "checkpoint"; RES=$(pgbench -U postgres -c6 -T20 -j6 bench;

Even with scale 8 you're likely significantly impacted by contention. And
obviously WAL write latency. See below for why that matters.



> I can't understand why i get the opposite results on my pc and on the server. It is clear that the absolute
> TPS values will be different for various configurations. This is normal. But differences?
> Is it unlikely that some kind of reference configuration is needed to accurately
> measure the difference in performance. Probably something wrong with my pc, but now
> i can not figure out what's wrong.

One very common reason for symptoms like this are power-saving measures by the
CPU. In workloads where the CPU is not meaningfully utilized, the CPU will go
into a powersaving mode - which can cause workloads that are latency sensitive
to be badly affected.  Both because initially the cpu will just work at a
lower frequency and because it takes time to shift to a higher latency.


Here's an example:
I bound the server and psql to the same CPU core (nothing else is allowed to
use that core. And ran the following:

\o /dev/null
SELECT 1; SELECT 1; SELECT 1; SELECT pg_sleep(0.1); SELECT 1; SELECT 1; SELECT 1;
Time: 0.181 ms
Time: 0.085 ms
Time: 0.071 ms
Time: 100.474 ms
Time: 0.153 ms
Time: 0.077 ms
Time: 0.069 ms

You can see how the first query timing was slower, the next two were faster,
and then after the pg_sleep() it's slow again.


# tell the CPU to optimize for performance not power
cpupower frequency-set --governor performance

# disable going to lower power states
cpupower idle-set -D0

# disable turbo mode for consistent performance
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

Now the timings are:
Time: 0.038 ms
Time: 0.028 ms
Time: 0.025 ms
Time: 1000.262 ms (00:01.000)
Time: 0.027 ms
Time: 0.024 ms
Time: 0.023 ms

Look, fast and reasonably consistent timings.

Switching back:
Time: 0.155 ms
Time: 0.123 ms
Time: 0.074 ms
Time: 1001.235 ms (00:01.001)
Time: 0.120 ms
Time: 0.077 ms
Time: 0.068 ms


The perverse thing is that this often means that *reducing* the number of
instructions executed yields to *worse* behaviour when under non-sustained
load, because from the CPUs point of view there is less need to increase clock
speed.


To show how much of a difference that can make, I ran pgbench with a single
client on one core, and the server on another (so the CPU is idle inbetween):
numactl --physcpubind 11 pgbench -n -M prepared -P1 -S -c 1 -T10

With power optimized configuration:
latency average = 0.035 ms
latency stddev = 0.002 ms
initial connection time = 5.255 ms
tps = 28434.334672 (without initial connection time)

With performance optimized configuration:
latency average = 0.025 ms
latency stddev = 0.001 ms
initial connection time = 3.544 ms
tps = 40079.995935 (without initial connection time)

That's a whopping 1.4x in throughput!


Now, the same thing, except that I used a custom workload where pgbench
transactions are executed in a pipelined fashion, 100 read-only transactions
in one script execution:

With power optimized configuration:
latency average = 1.055 ms
latency stddev = 0.125 ms
initial connection time = 6.915 ms
tps = 947.985286 (without initial connection time)

(this means we actually executed 94798.5286 readonly pgbench transactions/s)

With performance optimized configuration:
latency average = 1.376 ms
latency stddev = 0.083 ms
initial connection time = 3.759 ms
tps = 726.849018 (without initial connection time)

Suddenly the super-duper performance optimized settings are worse (but note
that stddev is down)! I suspect the problem is that now because we disabled
idle states, the cpu ends up clocking *lower*, due to power usage.

If I just change the relevant *cores* to the performance optimized
configuration:

cpupower -c 10,11 idle-set -D0; cpupower -c 10,11 frequency-set --governor performance

latency average = 0.940 ms
latency stddev = 0.061 ms
initial connection time = 3.311 ms
tps = 1063.719116 (without initial connection time)

It wins again.


Now, realistically you'd never use -D0 (i.e. disabling all downclocking, not
just lower states) - the power differential is quite big and as shown here it
can hurt performance as well.

On an idle system, looking at the cpu power usage with:
powerstat -D -R 5 1000

  Time    User  Nice   Sys  Idle    IO  Run Ctxt/s  IRQ/s Fork Exec Exit  Watts   pkg-0   dram  pkg-1
09:45:03   0.6   0.0   0.2  99.2   0.0    1   2861   2823    0    0    0  46.84   24.82   3.68  18.33
09:45:08   1.0   0.0   0.1  99.0   0.0    2   2565   1602    0    0    0  54.78   28.92   4.74  21.12
09:45:13   0.8   0.0   0.3  98.9   0.0    1   3769   2322    0    0    0  55.65   29.43   4.72  21.50
09:45:18   0.8   0.0   0.1  99.1   0.0    3   2513   1479    0    0    0  51.95   27.47   4.23  20.24
09:45:23   0.6   0.0   0.1  99.3   0.0    1   2282   1448    0    0    0  49.44   26.12   3.91  19.41
09:45:28   0.8   0.0   0.1  99.1   0.0    2   2422   1465    0    0    0  51.79   27.33   4.27  20.19
09:45:33   0.9   0.0   0.1  99.0   0.0    2   2358   1566    0    0    0  55.05   29.03   4.73  21.29
09:45:38   0.6   0.0   0.1  99.4   0.0    1   2976   4207    0    0    0  54.38   29.08   4.02  21.28
09:45:43   1.2   0.0   0.2  98.6   0.0    2   3988  37670    0    0    0 125.51   64.15   7.97  53.39
09:45:48   0.6   0.0   0.2  99.2   0.0    2   3263  40000    0    0    0 126.31   63.84   7.97  54.50
09:45:53   0.4   0.0   0.0  99.6   0.0    1   2333  39716    0    0    0 125.94   63.64   7.90  54.40
09:45:58   0.3   0.0   0.0  99.6   0.0    1   2783  39795    0    0    0 125.93   63.58   7.90  54.44
09:46:03   0.6   0.0   0.2  99.2   0.0    1   3081  24910    0    0    0  93.55   47.69   6.10  39.75
09:46:08   0.5   0.0   0.1  99.4   0.0    1   2623   1356    0    0    0  42.65   22.59   3.18  16.89

(interesting columns are Watts, pkg-0, dram, pkg-1)

Initially this was with cpupower idle-set -E, then I ran cpupower idle-set
-D0, and then cpupower idle-set -E - you can easily see where.


If I use a more sensible -D5 system-wide and run the pgbench from earlier, I
get the performance *and* sensible power usage:

latency average = 0.946 ms
latency stddev = 0.063 ms
initial connection time = 3.482 ms
tps = 1057.413946 (without initial connection time)

09:50:36   0.4   0.0   0.1  99.5   0.0    1   2593   1206    0    0    0  42.63   22.47   3.18  16.97


To get back to your benchmark: Because you're measuring a highly contended
system, where most of the time postgres will just wait for
  a) row-level locks to be released
  b) WAL writes to finish (you're clearly not using a very fast disk)
the CPU would have plenty time to clock down.


Benchmarking sucks. Modern hardware realities suck.

Greetings,

Andres Freund



Re: Some performance degradation in REL_16 vs REL_15

From
Andres Freund
Date:
Hi,

On 2023-11-15 10:09:06 -0500, Tom Lane wrote:
> "Anton A. Melnikov" <a.melnikov@postgrespro.ru> writes:
> > I can't understand why i get the opposite results on my pc and on the server. It is clear that the absolute
> > TPS values will be different for various configurations. This is normal. But differences?
> > Is it unlikely that some kind of reference configuration is needed to accurately
> > measure the difference in performance. Probably something wrong with my pc, but now
> > i can not figure out what's wrong.
>
> > Would be very grateful for any advice or comments to clarify this problem.
>
> Benchmarking is hard :-(.

Indeed.


> IME it's absolutely typical to see variations of a couple of percent even
> when "nothing has changed", for example after modifying some code that's
> nowhere near any hot code path for the test case.  I usually attribute this
> to cache effects, such as a couple of bits of hot code now sharing or not
> sharing a cache line.

FWIW, I think we're overusing that explanation in our community. Of course you
can encounter things like this, but the replacement policies of cpu caches
have gotten a lot better and the caches have gotten bigger too.

IME this kind of thing is typically dwarfed by much bigger variations from
things like

- cpu scheduling - whether the relevant pgbench thread is colocated on the
  same core as the relevant backend can make a huge difference,
  particularly when CPU power saving modes are not disabled.  Just looking at
  tps from a fully cached readonly pgbench, with a single client:

  Power savings enabled, same core:
  37493

  Power savings enabled, different core:
  28539

  Power savings disabled, same core:
  38167

  Power savings disabled, different core:
  37365


- can transparent huge pages be used for the executable mapping, or not

  On newer kernels linux (and some filesystems) can use huge pages for the
  executable. To what degree that succeeds is a large factor in performance.

  Single threaded read-only pgbench

  postgres mapped without huge pages:
  37155 TPS

  with 2MB of postgres as huge pages:
  37695 TPS

  with 6MB of postgres as huge pages:
  42733 TPS

  The really annoying thing about this is that entirely unpredictable whether
  huge pages are used or not. Building the same way, sometimes 0, sometimes 2MB,
  sometimes 6MB are mapped huge. Even though the on-disk contents are
  precisely the same.  And it can even change without rebuilding, if the
  binary is evicted from the page cache.

  This alone makes benchmarking extremely annoying. It basically can't be
  controlled and has huge effects.


- How long has the server been started

  If e.g. once you run your benchmark on the first connection to a database,
  and after a restart not (e.g. autovacuum starts up beforehand), you can get
  a fairly different memory layout and cache situation, due to [not] using the
  relcache init file. If not, you'll have a catcache that's populated,
  otherwise not.

  Another mean one is whether you start your benchmark within a relatively
  short time of the server starting. Readonly pgbench with a single client,
  started immediately after the server:

  progress: 12.0 s, 37784.4 tps, lat 0.026 ms stddev 0.001, 0 failed
  progress: 13.0 s, 37779.6 tps, lat 0.026 ms stddev 0.001, 0 failed
  progress: 14.0 s, 37668.2 tps, lat 0.026 ms stddev 0.001, 0 failed
  progress: 15.0 s, 32133.0 tps, lat 0.031 ms stddev 0.113, 0 failed
  progress: 16.0 s, 37564.9 tps, lat 0.027 ms stddev 0.012, 0 failed
  progress: 17.0 s, 37731.7 tps, lat 0.026 ms stddev 0.001, 0 failed

  There's a dip at 15s, odd - turns out that's due to bgwriter writing a WAL
  record, which triggers walwriter to write it out and then initialize the
  whole WAL buffers with 0s - happens once.  In this case I've exagerated the
  effect a bit by using a 1GB wal_buffers, but it's visible otherwise too.
  Whether your benchmark period includes that dip or not adds a fair bit of
  noise.

  You can even see the effects of autovacuum workers launching - even if
  there's nothing to do!  Not as a huge dip, but enough to add some "run to
  run" variation.


- How much other dirty data is there in the kernel pagecache. If you e.g. just
  built a new binary, even with just minor changes, the kernel will need to
  flush those pages eventually, which may contend for IO and increases page
  faults.

  Rebuilding an optimized build generates something like 1GB of dirty
  data. Particularly with ccache, that'll typically not yet be flushed by the
  time you run a benchmark. That's not nothing, even with a decent NVMe SSD.

- many more, unfortunately

Greetings,

Andres Freund



Re: Some performance degradation in REL_16 vs REL_15

From
"Anton A. Melnikov"
Date:
Hi!

Thank you very much for pointing out these important moments:

"NUMA" effects:
On 15.11.2023 18:09, Tom Lane wrote:
> NUMA creates huge reproducibility
> problems too on multisocket machines (which your server is IIUC).
> When I had a multisocket workstation I'd usually bind all the server
> processes to one socket if I wanted more-or-less-repeatable numbers.
>

"frequency" effects:
On 15.11.2023 21:04, Andres Freund wrote:
>
> One very common reason for symptoms like this are power-saving measures by the
> CPU. In workloads where the CPU is not meaningfully utilized, the CPU will go
> into a powersaving mode - which can cause workloads that are latency sensitive
> to be badly affected.  Both because initially the cpu will just work at a
> lower frequency and because it takes time to shift to a higher latency.

> To get back to your benchmark: Because you're measuring a highly contended
> system, where most of the time postgres will just wait for
>    a) row-level locks to be released
>    b) WAL writes to finish (you're clearly not using a very fast disk)
> the CPU would have plenty time to clock down.

And "huge page" effect:
On 15.11.2023 23:21, Andres Freund wrote:
>    The really annoying thing about this is that entirely unpredictable whether
>    huge pages are used or not. Building the same way, sometimes 0, sometimes 2MB,
>    sometimes 6MB are mapped huge. Even though the on-disk contents are
>    precisely the same.  And it can even change without rebuilding, if the
>    binary is evicted from the page cache.

As for "NUMA" and "frequency" effects the following steps were made.

As directed by Andres i really found that CPU cores sometimes were falling down in low-frequency
states during the test. See server-C1-C1E-falls.mkv attached.
Additionally, with my colleague Andrew Bille [<andrewbille(at)gmail(dot)com>]
we found an interesting thing.
We disabled software performance management in the server BIOS
and completely prohibited C-states here. But still,
the processor cores sometimes fell to C1 and C1E with a frequency decrease.
So we set back the software management in the BIOS and an additional command
echo <scaling_max_freq> | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq
solved this problem. The server cores began to operate at a fixed
frequency as shown in: server-stable-freq.mkv
And a similar result was obtained on my pc: mypc-stable-freq.mkv

Then i brought the server and my pc hardware configuration as close as possible to each other.
There are eight real cores on my pc without hyperthreading. On the other hand there are
only six real cores in every cpu on the server. So it is impossible to obtain 8 real cores on single cpu here.
To align both  configurations i disable two of the 8 cores on my pc with numactl. On the server the second cpu was also
disabled
through numactl while the hyperthreading was disabled in the BIOS.
Finally we got a similar configuration of single six-core processor both on the server and on my pc.

Full set of the configuration commands for server was like that:
numactl --cpunodebind=0 --membind=0 --physcpubind=1,3,5,7,9,11 bash
sudo cpupower frequency-set -g performance
sudo cpupower idle-set -D0
echo 3059000 | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq
(Turbo Boost was disabled in BIOS.)
And for my pc like that:
numactl --cpunodebind=0 --membind=0 --physcpubind=0,1,2,3,4,5 bash
sudo cpupower frequency-set -g performance
sudo cpupower idle-set -D0
echo 2000000 | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost

There are the numactl and cpupower output after configuration in
server-hw.txt and mypc-hw.txt respectively.

To eliminate the "huge page" effect i use huge_pages = off
in the all further differential measurements.

As for the other effects that Andres mentioned: dips due to bgwriter, autovacuum,
kernel pagecache effects and other possible noise,
i minimize the effect of them statistically using the following measurement sequence.
One series of such a measurement consists of:
1) Remove install and data dirs;
2) Fully reconfigure, rebuild and reinstall server;
3) Reinit new db;
4) Launch pgbench for 20 seconds in a loop with intermediate checkpoints, as you advised.
A few tens measurements totally.
5) Sort the results and remove points with considerable (more than 1%) deviation from tails.
The algorithm was as follows:
   if the fist or last point deviates from the average by more than 1% remove it;
   recalculate the average value;
   repeat from the begin.
Removed points marked with "x" in the raw data files raw-server.txt and raw-myPC.txt.
6) Get a result series of TPS data and average (expected) value
7) Calculate the standard deviation (error)
Each series takes approximately 20 minutes.

The general order of measurements was as follows:
1) Compare REL_10_STABLE vs REL_16_STABLE on my pc and take difference1 in TPS;
2) Compare REL_10_STABLE vs REL_16_STABLE at the same commits on server and take the difference2;
3) Compare difference1 and difference2 with each other.

In my first attempt i got exactly the opposite results. On my pc REL_10 was noticeably
faster than REL_16 while on the server the latter wins.

After repeating measurements using the updated configurations and methods described above
i got the following results:

On the server:
REL_10_STABLE gives ~6130 TPS(+-0.3%) while REL_16_STABLE gives ~6350 TPS(+-0.3%) in the test: pgbench -s8 -c4 -T20
-j4
In comparison with previous attempt the standard deviation became smaller by 3 times but difference
in speed remained almost the same:
the 16th is faster than 10th by ~3,6%. See 10vs16-server.png
(previous was ~4% in the pgbench -s8 -c6 -T20 -j6)

On my pc :
REL_10_STABLE gives ~783 TPS(+-0.4%) while REL_16_STABLE gives ~784 TPS(+-0.4%) in the same test: pgbench -s8 -c4 -T20
-j4
Here the difference in comparison with last attempt is significant. 10th and 16th no longer differ in speed while
the standard deviation became smaller by 5 times. See 10vs16-mypc.png

The results obtained on server seems more reliable as they are consistent with the previous ones
but i would like to figure out why i don’t see a difference in TPS on my PC.
Will be glad to any comments and and concerns.

And want to thank again for the previous advice. Owing to them the measurements have become more stable.

Merry Christmas and the best regards!

-- 
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: Some performance degradation in REL_16 vs REL_15

From
"Anton A. Melnikov"
Date:
On 26.12.2023 07:38, Anton A. Melnikov wrote:
> The results obtained on server seems more reliable as they are consistent with the previous ones
> but i would like to figure out why i don’t see a difference in TPS on my PC.

Seems i found the solution.
Since the possible reason could be an I/O bottleneck, as firstly pointed by Thomas in [1] in similar measurements,
i took on my PC the same approach as in this thread [2] using a 12GB ramdisk from 64GB of my total RAM.

The REL_10_STABLE at c18c12c983 gives ~6900+-100 TPS (+-1.4%), while
REL_16_STABLE at 07494a0df gives ~7200+-90 TPS (+-1.3%). See raw data in the raw-myPC.txt
and graphical representation in the 16vs10-mypc.png

REL_16_STABLE is faster than REL_10_STABLE by ~4%.
So the results on my PC became in consistent with the ones obtained on the server.

Also i performed comparative measurements on the server with ramdrive for all current versions as of January 27th.
Please, see the dependence of the average TPS on the version number at the 10-to-master.png graph.
The raw data are here: raw-server.txt

Brief results are as follows:
1) Differences between 12th, 13th, 14th, 15th and master are subtle and within ~1%.
2) 11th slower than 12th..master by ~2%
3) 10th slower than 12th..master by ~4%
The standard deviation in all measurements except 10th and 11th was ~0.3%, for 11th ~0,4%, for 10th ~0,5%.

Thus, the problem in the subject of this thread is actually not relevant, there is no performance
degradation in REL_16 vs REL_15.

But i'm worried about such a question. If we do not take into account possible fluctuations of a few percent
from commit to commit we can gradually accumulate a considerable loss in performance without noticing it.
And then it will be quite difficult to establish its origin.
Perhaps i'm missing something here. If there any thoughts on this question
would be glad to take them into account.


With the best wishes!

-- 
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

[1] https://www.postgresql.org/message-id/98646b96-6dcf-8d8a-3daf-837f25f8b1e3%40enterprisedb.com
[2] https://www.postgresql.org/message-id/a74b5a91-d7c1-4138-86df-371c5e2b2be3%40postgrespro.ru


Attachment