Re: Improving connection scalability: GetSnapshotData() - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Improving connection scalability: GetSnapshotData()
Date
Msg-id 20200904185304.bs27ufejpujp5azx@alap3.anarazel.de
Whole thread Raw
In response to Re: Improving connection scalability: GetSnapshotData()  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
Responses Re: Improving connection scalability: GetSnapshotData()
Re: Improving connection scalability: GetSnapshotData()
Re: Improving connection scalability: GetSnapshotData()
List pgsql-hackers
Hi,

On 2020-09-04 18:24:12 +0300, Konstantin Knizhnik wrote:
> Reported results looks very impressive.
> But I tried to reproduce them and didn't observed similar behavior.
> So I am wondering what can be the difference and what I am doing wrong.

That is odd - I did reproduce it on quite a few systems by now.


> Configuration file has the following differences with default postgres config:
> 
> max_connections = 10000            # (change requires restart)
> shared_buffers = 8GB            # min 128kB

I also used huge_pages=on / configured them on the OS level. Otherwise
TLB misses will be a significant factor.

Does it change if you initialize the test database using
PGOPTIONS='-c vacuum_freeze_min_age=0' pgbench -i -s 100
or run a manual VACUUM FREEZE; after initialization?


> I have tried two different systems.
> First one is IBM Power2 server with 384 cores and 8Tb of RAM.
> I run the same read-only pgbench test as you. I do not think that size of the database is matter, so I used scale 100
-
> it seems to be enough to avoid frequent buffer conflicts.
> Then I run the same scripts as you:
>
>  for ((n=100; n < 1000; n+=100)); do echo $n; pgbench -M prepared -c $n -T 100 -j $n -M prepared -S -n postgres ;
done
>  for ((n=1000; n <= 5000; n+=1000)); do echo $n; pgbench -M prepared -c $n -T 100 -j $n -M prepared -S -n postgres ;
done
>
>
> I have compared current master with version of Postgres prior to your commits with scalability improvements:
a9a4a7ad56

Hm, it'd probably be good to compare commits closer to the changes, to
avoid other changes showing up.

Hm - did you verify if all the connections were actually established?
Particularly without the patch applied? With an unmodified pgbench, I
sometimes saw better numbers, but only because only half the connections
were able to be established, due to ProcArrayLock contention.

See https://www.postgresql.org/message-id/20200227180100.zyvjwzcpiokfsqm2%40alap3.anarazel.de

There also is the issue that pgbench numbers for inclusive/exclusive are
just about meaningless right now:
https://www.postgresql.org/message-id/20200227202636.qaf7o6qcajsudoor%40alap3.anarazel.de
(reminds me, need to get that fixed)


One more thing worth investigating is whether your results change
significantly when you start the server using
numactl --interleave=all <start_server_cmdline>.
Especially on larger systems the results otherwise can vary a lot from
run-to-run, because the placement of shared buffers matters a lot.


> So I have repeated experiments at Intel server.
> It has 160 cores Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz and 256Gb of RAM.
>
> The same database, the same script, results are the following:
>
> Clients     old/inc     old/exl     new/inc     new/exl
> 1000     1105750     1163292     1206105     1212701
> 2000     1050933     1124688     1149706     1164942
> 3000     1063667     1195158     1118087     1144216
> 4000     1040065     1290432     1107348     1163906
> 5000     943813     1258643     1103790     1160251

> I have separately show results including/excluding connection connections establishing,
> because in new version there are almost no differences between them,
> but for old version gap between them is noticeable.
>
> Configuration file has the following differences with default postgres config:
>
> max_connections = 10000            # (change requires restart)
> shared_buffers = 8GB            # min 128kB

>
> This results contradict with yours and makes me ask the following questions:

> 1. Why in your case performance is almost two times larger (2 millions vs 1)?
> The hardware in my case seems to be at least not worser than yours...
> May be there are some other improvements in the version you have tested which are not yet committed to master?

No, no uncommitted changes, except for the pgbench stuff mentioned
above. However I found that the kernel version matters a fair bit, it's
pretty easy to run into kernel scalability issues in a workload that is
this heavy scheduler dependent.

Did you connect via tcp or unix socket? Was pgbench running on the same
machine? It was locally via unix socket for me (but it's also observable
via two machines, just with lower overall throughput).

Did you run a profile to see where the bottleneck is?


There's a seperate benchmark that I found to be quite revealing that's
far less dependent on scheduler behaviour. Run two pgbench instances:

1) With a very simply script '\sleep 1s' or such, and many connections
   (e.g. 100,1000,5000). That's to simulate connections that are
   currently idle.
2) With a normal pgbench read only script, and low client counts.

Before the changes 2) shows a very sharp decline in performance when the
count in 1) increases. Afterwards its pretty much linear.

I think this benchmark actually is much more real world oriented - due
to latency and client side overheads it's very normal to have a large
fraction of connections idle in read mostly OLTP workloads.

Here's the result on my workstation (2x Xeon Gold 5215 CPUs), testing
1f42d35a1d6144a23602b2c0bc7f97f3046cf890 against
07f32fcd23ac81898ed47f88beb569c631a2f223 which are the commits pre/post
connection scalability changes.

I used fairly short pgbench runs (15s), and the numbers are the best of
three runs. I also had emacs and mutt open - some noise to be
expected. But I also gotta work ;)

| Idle Connections | Active Connections | TPS pre | TPS post |
|-----------------:|-------------------:|--------:|---------:|
|                0 |                  1 |   33599 |    33406 |
|              100 |                  1 |   31088 |    33279 |
|             1000 |                  1 |   29377 |    33434 |
|             2500 |                  1 |   27050 |    33149 |
|             5000 |                  1 |   21895 |    33903 |
|            10000 |                  1 |   16034 |    33140 |
|                0 |                 48 | 1042005 |  1125104 |
|              100 |                 48 |  986731 |  1103584 |
|             1000 |                 48 |  854230 |  1119043 |
|             2500 |                 48 |  716624 |  1119353 |
|             5000 |                 48 |  553657 |  1119476 |
|            10000 |                 48 |  369845 |  1115740 |


And a second version of this, where the idle connections are just less
busy, using the following script:
\sleep 100ms
SELECT 1;

| Mostly Idle Connections | Active Connections | TPS pre |       TPS post |
|------------------------:|-------------------:|--------:|---------------:|
|                       0 |                  1 |   33837 |   34095.891429 |
|                     100 |                  1 |   30622 |   31166.767491 |
|                    1000 |                  1 |   25523 |   28829.313249 |
|                    2500 |                  1 |   19260 |   24978.878822 |
|                    5000 |                  1 |   11171 |   24208.146408 |
|                   10000 |                  1 |    6702 |   29577.517084 |
|                       0 |                 48 | 1022721 | 1133153.772338 |
|                     100 |                 48 |  980705 | 1034235.255883 |
|                    1000 |                 48 |  824668 | 1115965.638395 |
|                    2500 |                 48 |  698510 | 1073280.930789 |
|                    5000 |                 48 |  478535 | 1041931.158287 |
|                   10000 |                 48 |  276042 | 953567.038634  |

It's probably worth to call out that in the second test run here the
run-to-run variability is huge. Presumably because it's very scheduler
dependent much CPU time "active" backends and the "active" pgbench gets
at higher "mostly idle" connection counts.


> 2. You wrote: This is on a machine with 2
> Intel(R) Xeon(R) Platinum 8168, but virtualized (2 sockets of 18 cores/36 threads)
>
> According to Intel specification Intel® Xeon® Platinum 8168 Processor has 24 cores:
>
https://ark.intel.com/content/www/us/en/ark/products/120504/intel-xeon-platinum-8168-processor-33m-cache-2-70-ghz.html
>
> And at your graph we can see almost linear increase of speed up to 40 connections.
>
> But most suspicious word for me is "virtualized". What is the actual hardware and how it is virtualized?

That was on an azure Fs72v2. I think that's hyperv virtualized, with all
the "lost" cores dedicated to the hypervisor. But I did reproduce the
speedups on my unvirtualized workstation (2x Xeon Gold 5215 CPUs) -
the ceiling is lower, obviously.


> May be it is because of more complex architecture of my server?

Think we'll need profiles to know...

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Disk-based hash aggregate's cost model
Next
From: Alvaro Herrera
Date:
Subject: Re: report expected contrecord size