Re: Improving connection scalability: GetSnapshotData() - Mailing list pgsql-hackers

From Konstantin Knizhnik
Subject Re: Improving connection scalability: GetSnapshotData()
Date
Msg-id 4f245382-2f04-3b2e-ae94-d075d2eb7868@postgrespro.ru
Whole thread Raw
In response to Re: Improving connection scalability: GetSnapshotData()  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Improving connection scalability: GetSnapshotData()
List pgsql-hackers


On 03.09.2020 11:18, Michael Paquier wrote:
On Sun, Aug 16, 2020 at 02:26:57PM -0700, Andres Freund wrote:
So we get some builfarm results while thinking about this.
Andres, there is an entry in the CF for this thread:
https://commitfest.postgresql.org/29/2500/

A lot of work has been committed with 623a9ba, 73487a6, 5788e25, etc.
Now that PGXACT is done, how much work is remaining here?
--
Michael

Andres, 
First of all a lot of thanks for this work.
Improving Postgres connection scalability is very important.

Reported results looks very impressive.
But I tried to reproduce them and didn't observed similar behavior.
So I am wondering what can be the difference and what I am doing wrong.

I have tried two different systems.
First one is IBM Power2 server with 384 cores and 8Tb of RAM.
I run the same read-only pgbench test as you. I do not think that size of the database is matter, so I used scale 100 - 
it seems to be enough to avoid frequent buffer conflicts.
Then I run the same scripts as you:

 for ((n=100; n < 1000; n+=100)); do echo $n; pgbench -M prepared -c $n -T 100 -j $n -M prepared -S -n postgres ;  done
 for ((n=1000; n <= 5000; n+=1000)); do echo $n; pgbench -M prepared -c $n -T 100 -j $n -M prepared -S -n postgres ;  done


I have compared current master with version of Postgres prior to your commits with scalability improvements: a9a4a7ad56

For all number of connections older version shows slightly better results, for example for 500 clients: 475k TPS vs. 450k TPS for current master.

This is quite exotic server and I do not have currently access to it.
So I have repeated experiments at Intel server.
It has 160 cores Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz and 256Gb of RAM.

The same database, the same script, results are the following:

Clientsold/incold/exlnew/incnew/exl
10001105750116329212061051212701
20001050933112468811497061164942
30001063667119515811180871144216
40001040065129043211073481163906
5000943813125864311037901160251
I have separately show results including/excluding connection connections establishing, 
because in new version there are almost no differences between them, 
but for old version gap between them is noticeable.

Configuration file has the following differences with default postgres config:

max_connections = 10000			# (change requires restart)
shared_buffers = 8GB			# min 128kB


This results contradict with yours and makes me ask the following questions:

1. Why in your case performance is almost two times larger (2 millions vs 1)?
The hardware in my case seems to be at least not worser than yours...
May be there are some other improvements in the version you have tested which are not yet committed to master?

2. You wrote: This is on a machine with 2
Intel(R) Xeon(R) Platinum 8168, but virtualized (2 sockets of 18 cores/36 threads)

According to Intel specification Intel® Xeon® Platinum 8168 Processor has 24 cores:
https://ark.intel.com/content/www/us/en/ark/products/120504/intel-xeon-platinum-8168-processor-33m-cache-2-70-ghz.html

And at your graph we can see almost linear increase of speed up to 40 connections. 

But most suspicious word for me is "virtualized". What is the actual hardware and how it is virtualized?

Do you have any idea why in my case master version (with your commits) behaves almost the same as non-patched version?
Below is yet another table showing scalability from 10 to 100 connections and combining your results (first two columns) and my results (last two columns):


Clientsold masterpgxact-split-cachecurrent master
revision 9a4a7ad56 
10367883375682358984
347067
20748000810964668631
630304
309992311288276920255
848244
4099167215733101100745
970717
50
101756117157621193928
1008755
60
99394317896981255629
917788
70
97137918194771277634
873022
80
96627618422481266523
830197
90
90117518478231255260
736550
100
80317518657951241143
736756

May be it is because of more complex architecture of my server?
-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: [PATCH]Fix ja.po error
Next
From: Kelly Min
Date:
Subject: [PATCH] Comments related to " buffer descriptors“ cache line size"