Thread: Improving connection scalability (src/backend/storage/ipc/procarray.c)

Improving connection scalability (src/backend/storage/ipc/procarray.c)

From
Ranier Vilela
Date:
Hi hackers,

Inspired by Andre's works, I put efforts to try to improve GetSnapShotData().
I think that I got something.

As is well known GetSnapShotData() is a critical function for Postgres performance, already demonstrated in other benchmarks.
So any gain, no matter how small, makes a difference.

So far, no regression has been observed.

Windows 10 64 bits (msvc 2019 64 bits)
pgbench -M prepared -c $conns -S -n -U postgres

conns         tps head                 tps patched
1          2918.004085              3027.550711
10      12262.415696            12876.641772
50      13656.724571            14877.410140
80      14338.202348            15244.192915

pgbench can't run with conns > 80.

Linux Ubuntu 64 bits (gcc 9.4)
./pgbench -M prepared -c  $conns -j $conns -S -n -U postgres

conns         tps head                 tps patched
1          2918.004085              3190.810466
10      12262.415696            17199.862401
50      13656.724571            18278.194114
80      14338.202348            17955.336101
90      16597.510373            18269.660184
100    17706.775793            18349.650150
200    16877.067441            17881.250615
300    16942.260775            17181.441752
400    16794.514911            17124.533892
500    16598.502151            17181.244953
600    16717.935001            16961.130742
700    16651.204834            16959.172005
800    16467.546583            16834.591719
900    16588.241149            16693.902459
1000  16564.985265            16936.952195

I was surprised that in my tests, with connections greater than 100,
the tps performance drops, even in the head.
I don't have access to a powerful machine, to really test with a high workload.
But with connections below 100, It seems to me that there are obvious gains.

patch attached.

regards,
Ranier Vilela


Attachment
On Tue, May 24, 2022 at 11:28 AM Ranier Vilela <ranier.vf@gmail.com> wrote:
> I think that I got something.

You might have something, but it's pretty hard to tell based on
looking at this patch. Whatever relevant changes it has are mixed with
a bunch of changes that are probably not relevant. For example, it's
hard to believe that moving "uint32 i" to an inner scope in
XidInMVCCSnapshot() is causing a performance gain, because an
optimizing compiler should figure that out anyway.

An even bigger issue is that it's not sufficient to just demonstrate
that the patch improves performance. It's also necessary to make an
argument as to why it is safe and correct, and "I tried it out and
nothing seemed to break" does not qualify as an argument. I'd guess
that most or maybe all of the performance gain that you've observed
here is attributable to changing GetSnapshotData() to call
GetSnapshotDataReuse() without first acquiring ProcArrayLock. That
doesn't seem like a completely hopeless idea, because the comments for
GetSnapshotDataReuse() say this:

 * This very likely can be evolved to not need ProcArrayLock held (at very
 * least in the case we already hold a snapshot), but that's for another day.

However, those comment seem to imply that it might not be safe in all
cases, and that changes might be needed someplace in order to make it
safe, but you haven't updated these comments, or changed the function
in any way, so it's not really clear how or whether whatever problems
Andres was worried about have been handled.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)

From
Ranier Vilela
Date:
Em ter., 24 de mai. de 2022 às 13:06, Robert Haas <robertmhaas@gmail.com> escreveu:
On Tue, May 24, 2022 at 11:28 AM Ranier Vilela <ranier.vf@gmail.com> wrote:
> I think that I got something.

You might have something, but it's pretty hard to tell based on
looking at this patch. Whatever relevant changes it has are mixed with
a bunch of changes that are probably not relevant. For example, it's
hard to believe that moving "uint32 i" to an inner scope in
XidInMVCCSnapshot() is causing a performance gain, because an
optimizing compiler should figure that out anyway.
I believe that even these small changes are helpful and favorable.
Improves code readability and helps the compiler generate better code,
especially for older compilers.
 

An even bigger issue is that it's not sufficient to just demonstrate
that the patch improves performance. It's also necessary to make an
argument as to why it is safe and correct, and "I tried it out and
nothing seemed to break" does not qualify as an argument.
 Ok, certainly the convincing work is not good.

I'd guess that most or maybe all of the performance gain that you've observed
here is attributable to changing GetSnapshotData() to call
GetSnapshotDataReuse() without first acquiring ProcArrayLock.
It certainly helps, but I trust that's not the only reason, in all the tests I did, there was an improvement in performance, even before using this feature.
If you look closely at GetSnapShotData() you will see that GetSnapshotDataReuse is called for all snapshots, even the new ones, which is unnecessary.
Another example NormalTransactionIdPrecedes is more expensive than testing statusFlags.

That
doesn't seem like a completely hopeless idea, because the comments for
GetSnapshotDataReuse() say this:

 * This very likely can be evolved to not need ProcArrayLock held (at very
 * least in the case we already hold a snapshot), but that's for another day.

However, those comment seem to imply that it might not be safe in all
cases, and that changes might be needed someplace in order to make it
safe, but you haven't updated these comments, or changed the function
in any way, so it's not really clear how or whether whatever problems
Andres was worried about have been handled.
I think it's worth trying and testing to see if everything goes well,
so in the final patch apply whatever comments are needed.

regards,
Ranier Vilela

Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)

From
Andres Freund
Date:
Hi,

On 2022-05-24 12:28:20 -0300, Ranier Vilela wrote:
> Linux Ubuntu 64 bits (gcc 9.4)
> ./pgbench -M prepared -c  $conns -j $conns -S -n -U postgres
> 
> conns         tps head                 tps patched
> 1          2918.004085              3190.810466
> 10      12262.415696            17199.862401
> 50      13656.724571            18278.194114
> 80      14338.202348            17955.336101
> 90      16597.510373            18269.660184
> 100    17706.775793            18349.650150
> 200    16877.067441            17881.250615
> 300    16942.260775            17181.441752
> 400    16794.514911            17124.533892
> 500    16598.502151            17181.244953
> 600    16717.935001            16961.130742
> 700    16651.204834            16959.172005
> 800    16467.546583            16834.591719
> 900    16588.241149            16693.902459
> 1000  16564.985265            16936.952195

17-18k tps is pretty low for pgbench -S. For a shared_buffers resident run, I
can get 40k in a single connection in an optimized build. If you're testing a
workload >> shared_buffers, GetSnapshotData() isn't the bottleneck. And
testing an assert build isn't a meaningful exercise either, unless you have
way way higher gains (i.e. stuff like turning O(n^2) into O(n)).

What pgbench scale is this and are you using an optimized build?

Greetings,

Andres Freund



Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)

From
Andres Freund
Date:
Hi,

On 2022-05-24 13:23:43 -0300, Ranier Vilela wrote:
> It certainly helps, but I trust that's not the only reason, in all the
> tests I did, there was an improvement in performance, even before using
> this feature.
> If you look closely at GetSnapShotData() you will see that
> GetSnapshotDataReuse is called for all snapshots, even the new ones, which
> is unnecessary.

That only happens a handful of times as snapshots are persistently
allocated. Doing an extra GetSnapshotDataReuse() in those cases doesn't matter
for performance. If anything this increases the number of jumps for the common
case.


It'd be a huge win to avoid needing ProcArrayLock when reusing a snapshot, but
it's not at all easy to guarantee that it's correct / see how to make it
correct. I'm fairly sure it can be made correct, but ...


> Another example NormalTransactionIdPrecedes is more expensive than testing
> statusFlags.

That may be true when you count instructions, but isn't at all true when you
take into account that the cachelines containing status flags are hotly
contended.

Also, the likelihood of filtering out a proc due to
NormalTransactionIdPrecedes(xid, xmax) is *vastly* higher than the due to the
statusFlags check. There may be a lot of procs failing that test, but
typically there will be far fewer backends in vacuum or logical decoding.

Greetings,

Andres Freund



Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)

From
Ranier Vilela
Date:
Em qua., 25 de mai. de 2022 às 00:46, Andres Freund <andres@anarazel.de> escreveu:

Hi Andres, thank you for taking a look.

On 2022-05-24 12:28:20 -0300, Ranier Vilela wrote:
> Linux Ubuntu 64 bits (gcc 9.4)
> ./pgbench -M prepared -c  $conns -j $conns -S -n -U postgres
>
> conns         tps head                 tps patched
> 1          2918.004085              3190.810466
> 10      12262.415696            17199.862401
> 50      13656.724571            18278.194114
> 80      14338.202348            17955.336101
> 90      16597.510373            18269.660184
> 100    17706.775793            18349.650150
> 200    16877.067441            17881.250615
> 300    16942.260775            17181.441752
> 400    16794.514911            17124.533892
> 500    16598.502151            17181.244953
> 600    16717.935001            16961.130742
> 700    16651.204834            16959.172005
> 800    16467.546583            16834.591719
> 900    16588.241149            16693.902459
> 1000  16564.985265            16936.952195

17-18k tps is pretty low for pgbench -S. For a shared_buffers resident run, I
can get 40k in a single connection in an optimized build. If you're testing a
workload >> shared_buffers, GetSnapshotData() isn't the bottleneck. And
testing an assert build isn't a meaningful exercise either, unless you have
way way higher gains (i.e. stuff like turning O(n^2) into O(n)).
Thanks for sharing these hits.
Yes, their 17-18k tps are disappointing.


What pgbench scale is this and are you using an optimized build?
Yes this optimized build.
CFLAGS='-Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -O2'
from config.log

pgbench was initialized with:
pgbench -i -p 5432 -d postgres

pgbench -M prepared -c 100 -j 100 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100

The shared_buffers is default:
shared_buffers = 128MB

Intel® Core™ i5-8250U CPU Quad Core
RAM 8GB
SSD 256 GB

Can you share the pgbench configuration and shared buffers
this benchmark?

regards,
Ranier Vilela

Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)

From
Ranier Vilela
Date:
Em qua., 25 de mai. de 2022 às 00:56, Andres Freund <andres@anarazel.de> escreveu:
Hi,

On 2022-05-24 13:23:43 -0300, Ranier Vilela wrote:
> It certainly helps, but I trust that's not the only reason, in all the
> tests I did, there was an improvement in performance, even before using
> this feature.
> If you look closely at GetSnapShotData() you will see that
> GetSnapshotDataReuse is called for all snapshots, even the new ones, which
> is unnecessary.

That only happens a handful of times as snapshots are persistently
allocated.
Yes, but now this does not happen with new snapshots.

Doing an extra GetSnapshotDataReuse() in those cases doesn't matter
for performance. If anything this increases the number of jumps for the common
case.
IMHO with GetSnapShotData(), any gain makes a difference.
 

It'd be a huge win to avoid needing ProcArrayLock when reusing a snapshot, but
it's not at all easy to guarantee that it's correct / see how to make it
correct. I'm fairly sure it can be made correct, but ...
I believe it's worth the effort to make sure everything goes well and use this feature.


> Another example NormalTransactionIdPrecedes is more expensive than testing
> statusFlags.

That may be true when you count instructions, but isn't at all true when you
take into account that the cachelines containing status flags are hotly
contended.

Also, the likelihood of filtering out a proc due to
NormalTransactionIdPrecedes(xid, xmax) is *vastly* higher than the due to the
statusFlags check. There may be a lot of procs failing that test, but
typically there will be far fewer backends in vacuum or logical decoding.
I believe that keeping the instructions in the cache together works better than having the status flags test in the middle.
But I will test this to be sure.

regards,
Ranier Vilela

On 5/25/22 11:07, Ranier Vilela wrote:
> Em qua., 25 de mai. de 2022 às 00:46, Andres Freund <andres@anarazel.de
> <mailto:andres@anarazel.de>> escreveu:
> 
> Hi Andres, thank you for taking a look.
> 
> 
>     On 2022-05-24 12:28:20 -0300, Ranier Vilela wrote:
>     > Linux Ubuntu 64 bits (gcc 9.4)
>     > ./pgbench -M prepared -c  $conns -j $conns -S -n -U postgres
>     >
>     > conns         tps head                 tps patched
>     > 1          2918.004085              3190.810466
>     > 10      12262.415696            17199.862401
>     > 50      13656.724571            18278.194114
>     > 80      14338.202348            17955.336101
>     > 90      16597.510373            18269.660184
>     > 100    17706.775793            18349.650150
>     > 200    16877.067441            17881.250615
>     > 300    16942.260775            17181.441752
>     > 400    16794.514911            17124.533892
>     > 500    16598.502151            17181.244953
>     > 600    16717.935001            16961.130742
>     > 700    16651.204834            16959.172005
>     > 800    16467.546583            16834.591719
>     > 900    16588.241149            16693.902459
>     > 1000  16564.985265            16936.952195
> 
>     17-18k tps is pretty low for pgbench -S. For a shared_buffers
>     resident run, I
>     can get 40k in a single connection in an optimized build. If you're
>     testing a
>     workload >> shared_buffers, GetSnapshotData() isn't the bottleneck. And
>     testing an assert build isn't a meaningful exercise either, unless
>     you have
>     way way higher gains (i.e. stuff like turning O(n^2) into O(n)).
> 
> Thanks for sharing these hits.
> Yes, their 17-18k tps are disappointing.
> 
> 
>     What pgbench scale is this and are you using an optimized build?
> 
> Yes this optimized build.
> CFLAGS='-Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Werror=vla -Wendif-labels
> -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type
> -Wformat-security -fno-strict-aliasing -fwrapv
> -fexcess-precision=standard -Wno-format-truncation
> -Wno-stringop-truncation -O2'
> from config.log
> 

That can still be assert-enabled build. We need to see configure flags.

> pgbench was initialized with:
> pgbench -i -p 5432 -d postgres
> 
> pgbench -M prepared -c 100 -j 100 -S -n -U postgres

You're not specifying duration/number of transactions to execute. So
it's using just 10 transactions per client, which is bound to give you
bogus results due to not having anything in relcache etc. Use -T 60 or
something like that.

> pgbench (15beta1)
> transaction type: <builtin: select only>
> scaling factor: 1
> query mode: prepared
> number of clients: 100
> number of threads: 100
> 
> The shared_buffers is default:
> shared_buffers = 128MB
> 
> Intel® Core™ i5-8250U CPU Quad Core
> RAM 8GB
> SSD 256 GB
> 

Well, quick results on my laptop (i7-9750H, so not that different from
what you have):

1 = 18908.080126
2 = 32943.953182
3 = 42316.079028
4 = 46700.087645

So something is likely wrong in your setup.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)

From
Ranier Vilela
Date:
Em qua., 25 de mai. de 2022 às 07:13, Tomas Vondra <tomas.vondra@enterprisedb.com> escreveu:


On 5/25/22 11:07, Ranier Vilela wrote:
> Em qua., 25 de mai. de 2022 às 00:46, Andres Freund <andres@anarazel.de
> <mailto:andres@anarazel.de>> escreveu:
>
> Hi Andres, thank you for taking a look.
>
>
>     On 2022-05-24 12:28:20 -0300, Ranier Vilela wrote:
>     > Linux Ubuntu 64 bits (gcc 9.4)
>     > ./pgbench -M prepared -c  $conns -j $conns -S -n -U postgres
>     >
>     > conns         tps head                 tps patched
>     > 1          2918.004085              3190.810466
>     > 10      12262.415696            17199.862401
>     > 50      13656.724571            18278.194114
>     > 80      14338.202348            17955.336101
>     > 90      16597.510373            18269.660184
>     > 100    17706.775793            18349.650150
>     > 200    16877.067441            17881.250615
>     > 300    16942.260775            17181.441752
>     > 400    16794.514911            17124.533892
>     > 500    16598.502151            17181.244953
>     > 600    16717.935001            16961.130742
>     > 700    16651.204834            16959.172005
>     > 800    16467.546583            16834.591719
>     > 900    16588.241149            16693.902459
>     > 1000  16564.985265            16936.952195
>
>     17-18k tps is pretty low for pgbench -S. For a shared_buffers
>     resident run, I
>     can get 40k in a single connection in an optimized build. If you're
>     testing a
>     workload >> shared_buffers, GetSnapshotData() isn't the bottleneck. And
>     testing an assert build isn't a meaningful exercise either, unless
>     you have
>     way way higher gains (i.e. stuff like turning O(n^2) into O(n)).
>
> Thanks for sharing these hits.
> Yes, their 17-18k tps are disappointing.
>
>
>     What pgbench scale is this and are you using an optimized build?
>
> Yes this optimized build.
> CFLAGS='-Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Werror=vla -Wendif-labels
> -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type
> -Wformat-security -fno-strict-aliasing -fwrapv
> -fexcess-precision=standard -Wno-format-truncation
> -Wno-stringop-truncation -O2'
> from config.log
>

That can still be assert-enabled build. We need to see configure flags.
./configure
Attached the config.log (compressed)
 

> pgbench was initialized with:
> pgbench -i -p 5432 -d postgres
>
> pgbench -M prepared -c 100 -j 100 -S -n -U postgres

You're not specifying duration/number of transactions to execute. So
it's using just 10 transactions per client, which is bound to give you
bogus results due to not having anything in relcache etc. Use -T 60 or
something like that.
Ok, I will try with -T 60.
 

> pgbench (15beta1)
> transaction type: <builtin: select only>
> scaling factor: 1
> query mode: prepared
> number of clients: 100
> number of threads: 100
>
> The shared_buffers is default:
> shared_buffers = 128MB
>
> Intel® Core™ i5-8250U CPU Quad Core
> RAM 8GB
> SSD 256 GB
>

Well, quick results on my laptop (i7-9750H, so not that different from
what you have):

1 = 18908.080126
2 = 32943.953182
3 = 42316.079028
4 = 46700.087645

So something is likely wrong in your setup.
select version();
                                                 version                                                  
----------------------------------------------------------------------------------------------------------
 PostgreSQL 15beta1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
 
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.4.0-1ubuntu1~20.04.1' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-9 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-9-Av3uEd/gcc-9-9.4.0/debian/tmp-nvptx/usr,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)

regards,
Ranier Vilela
Attachment

Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)

From
Ranier Vilela
Date:
Em qua., 25 de mai. de 2022 às 08:26, Ranier Vilela <ranier.vf@gmail.com> escreveu:
Em qua., 25 de mai. de 2022 às 07:13, Tomas Vondra <tomas.vondra@enterprisedb.com> escreveu:


On 5/25/22 11:07, Ranier Vilela wrote:
> Em qua., 25 de mai. de 2022 às 00:46, Andres Freund <andres@anarazel.de
> <mailto:andres@anarazel.de>> escreveu:
>
> Hi Andres, thank you for taking a look.
>
>
>     On 2022-05-24 12:28:20 -0300, Ranier Vilela wrote:
>     > Linux Ubuntu 64 bits (gcc 9.4)
>     > ./pgbench -M prepared -c  $conns -j $conns -S -n -U postgres
>     >
>     > conns         tps head                 tps patched
>     > 1          2918.004085              3190.810466
>     > 10      12262.415696            17199.862401
>     > 50      13656.724571            18278.194114
>     > 80      14338.202348            17955.336101
>     > 90      16597.510373            18269.660184
>     > 100    17706.775793            18349.650150
>     > 200    16877.067441            17881.250615
>     > 300    16942.260775            17181.441752
>     > 400    16794.514911            17124.533892
>     > 500    16598.502151            17181.244953
>     > 600    16717.935001            16961.130742
>     > 700    16651.204834            16959.172005
>     > 800    16467.546583            16834.591719
>     > 900    16588.241149            16693.902459
>     > 1000  16564.985265            16936.952195
>
>     17-18k tps is pretty low for pgbench -S. For a shared_buffers
>     resident run, I
>     can get 40k in a single connection in an optimized build. If you're
>     testing a
>     workload >> shared_buffers, GetSnapshotData() isn't the bottleneck. And
>     testing an assert build isn't a meaningful exercise either, unless
>     you have
>     way way higher gains (i.e. stuff like turning O(n^2) into O(n)).
>
> Thanks for sharing these hits.
> Yes, their 17-18k tps are disappointing.
>
>
>     What pgbench scale is this and are you using an optimized build?
>
> Yes this optimized build.
> CFLAGS='-Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Werror=vla -Wendif-labels
> -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type
> -Wformat-security -fno-strict-aliasing -fwrapv
> -fexcess-precision=standard -Wno-format-truncation
> -Wno-stringop-truncation -O2'
> from config.log
>

That can still be assert-enabled build. We need to see configure flags.
./configure
Attached the config.log (compressed)
 

> pgbench was initialized with:
> pgbench -i -p 5432 -d postgres
>
> pgbench -M prepared -c 100 -j 100 -S -n -U postgres

You're not specifying duration/number of transactions to execute. So
it's using just 10 transactions per client, which is bound to give you
bogus results due to not having anything in relcache etc. Use -T 60 or
something like that.
Ok, I will try with -T 60.

Here the results with -T 60:
Linux Ubuntu 64 bits
shared_buffers = 128MB

./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres

pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 s

conns            tps head              tps patched

1         17126.326108            17792.414234
10       82068.123383            82468.334836
50       73808.731404            74678.839428
80       73290.191713            73116.553986
90       67558.483043            68384.906949
100     65960.982801            66997.793777
200     62216.011998            62870.243385
300     62924.225658            62796.157548
400     62278.099704            63129.555135
500     63257.930870            62188.825044
600     61479.890611            61517.913967
700     61139.354053            61327.898847
800     60833.663791            61517.913967
900     61305.129642            61248.336593
1000   60990.918719            61041.670996


Linux Ubuntu 64 bits
shared_buffers = 2048MB

./pgbench -M prepared -c $conns -j $conns -S -n -U postgres

pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
number of transactions per client: 10

conns          tps head                tps patched

1           2918.004085              3211.303789
10       12262.415696            15540.015540
50       13656.724571            16701.182444
80       14338.202348            16628.559551
90       16597.510373            16835.016835
100     17706.775793            16607.433487
200     16877.067441            16426.969799
300     16942.260775            16319.780662
400     16794.514911            16155.023607
500     16598.502151            16051.106724
600     16717.935001            16007.171213
700     16651.204834            16004.353184
800     16467.546583            16834.591719
900     16588.241149            16693.902459
1000   16564.985265            16936.952195


Linux Ubuntu 64 bits
shared_buffers = 2048MB

./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres

pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 s

conns           tps head               tps patched

1         17174.265804            17792.414234
10       82365.634750            82468.334836
50       74593.714180            74678.839428
80       69219.756038            73116.553986
90       67419.574189            68384.906949
100     66613.771701            66997.793777
200     61739.784830            62870.243385
300     62109.691298            62796.157548
400     61630.822446            63129.555135
500     61711.019964            62755.190389
600     60620.010181            61517.913967
700     60303.317736            61688.044232
800     60451.113573            61076.666572
900     60017.327157            61256.290037
1000   60088.823434            60986.799312

regards,
Ranier Vilela
On 5/27/22 02:11, Ranier Vilela wrote:
>
> ...
> 
> Here the results with -T 60:

Might be a good idea to share your analysis / interpretation of the
results, not just the raw data. After all, the change is being proposed
by you, so do you think this shows the change is beneficial?

> Linux Ubuntu 64 bits
> shared_buffers = 128MB
> 
> ./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
> 
> pgbench (15beta1)
> transaction type: <builtin: select only>
> scaling factor: 1
> query mode: prepared
> number of clients: 100
> number of threads: 100
> maximum number of tries: 1
> duration: 60 s
> 
> conns            tps head              tps patched
> 
> 1         17126.326108            17792.414234
> 10       82068.123383            82468.334836
> 50       73808.731404            74678.839428
> 80       73290.191713            73116.553986
> 90       67558.483043            68384.906949
> 100     65960.982801            66997.793777
> 200     62216.011998            62870.243385
> 300     62924.225658            62796.157548
> 400     62278.099704            63129.555135
> 500     63257.930870            62188.825044
> 600     61479.890611            61517.913967
> 700     61139.354053            61327.898847
> 800     60833.663791            61517.913967
> 900     61305.129642            61248.336593
> 1000   60990.918719            61041.670996
> 

These results look much saner, but IMHO it also does not show any clear
benefit of the patch. Or are you still claiming there is a benefit?

BTW it's generally a good idea to do multiple runs and then use the
average and/or median. Results from a single may be quite noisy.

> 
> Linux Ubuntu 64 bits
> shared_buffers = 2048MB
> 
> ./pgbench -M prepared -c $conns -j $conns -S -n -U postgres
> 
> pgbench (15beta1)
> transaction type: <builtin: select only>
> scaling factor: 1
> query mode: prepared
> number of clients: 100
> number of threads: 100
> maximum number of tries: 1
> number of transactions per client: 10
> 
> conns          tps head                tps patched
> 
> 1           2918.004085              3211.303789
> 10       12262.415696            15540.015540
> 50       13656.724571            16701.182444
> 80       14338.202348            16628.559551
> 90       16597.510373            16835.016835
> 100     17706.775793            16607.433487
> 200     16877.067441            16426.969799
> 300     16942.260775            16319.780662
> 400     16794.514911            16155.023607
> 500     16598.502151            16051.106724
> 600     16717.935001            16007.171213
> 700     16651.204834            16004.353184
> 800     16467.546583            16834.591719
> 900     16588.241149            16693.902459
> 1000   16564.985265            16936.952195
> 

I think we've agreed these results are useless.

> 
> Linux Ubuntu 64 bits
> shared_buffers = 2048MB
> 
> ./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
> 
> pgbench (15beta1)
> transaction type: <builtin: select only>
> scaling factor: 1
> query mode: prepared
> number of clients: 100
> number of threads: 100
> maximum number of tries: 1
> duration: 60 s
> 
> conns           tps head               tps patched
> 
> 1         17174.265804            17792.414234
> 10       82365.634750            82468.334836
> 50       74593.714180            74678.839428
> 80       69219.756038            73116.553986
> 90       67419.574189            68384.906949
> 100     66613.771701            66997.793777
> 200     61739.784830            62870.243385
> 300     62109.691298            62796.157548
> 400     61630.822446            63129.555135
> 500     61711.019964            62755.190389
> 600     60620.010181            61517.913967
> 700     60303.317736            61688.044232
> 800     60451.113573            61076.666572
> 900     60017.327157            61256.290037
> 1000   60088.823434            60986.799312
> 

I have no idea why shared buffers 2GB would be interesting. The proposed
change was related to procarray, not shared buffers. And scale 1 is
~15MB of data, so it fits into 128MB just fine.

Also, the first ~10 results for "patched" case match results for 128MB
shared buffers. That seems very unlikely to happen by chance, so this
seems rather suspicious.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)

From
Ranier Vilela
Date:
Em qui., 26 de mai. de 2022 às 22:30, Tomas Vondra <tomas.vondra@enterprisedb.com> escreveu:
On 5/27/22 02:11, Ranier Vilela wrote:
>
> ...
>
> Here the results with -T 60:

Might be a good idea to share your analysis / interpretation of the
results, not just the raw data. After all, the change is being proposed
by you, so do you think this shows the change is beneficial?
I think so, but the expectation has diminished.
I expected that the more connections, the better the performance.
And for both patch and head, this doesn't happen in tests.
Performance degrades with a greater number of connections.
GetSnapShowData() isn't a bottleneck?
 

> Linux Ubuntu 64 bits
> shared_buffers = 128MB
>
> ./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
>
> pgbench (15beta1)
> transaction type: <builtin: select only>
> scaling factor: 1
> query mode: prepared
> number of clients: 100
> number of threads: 100
> maximum number of tries: 1
> duration: 60 s
>
> conns            tps head              tps patched
>
> 1         17126.326108            17792.414234
> 10       82068.123383            82468.334836
> 50       73808.731404            74678.839428
> 80       73290.191713            73116.553986
> 90       67558.483043            68384.906949
> 100     65960.982801            66997.793777
> 200     62216.011998            62870.243385
> 300     62924.225658            62796.157548
> 400     62278.099704            63129.555135
> 500     63257.930870            62188.825044
> 600     61479.890611            61517.913967
> 700     61139.354053            61327.898847
> 800     60833.663791            61517.913967
> 900     61305.129642            61248.336593
> 1000   60990.918719            61041.670996
>

These results look much saner, but IMHO it also does not show any clear
benefit of the patch. Or are you still claiming there is a benefit?
We agree that they are micro-optimizations.
However, I think they should be considered micro-optimizations in inner loops,
because all in procarray.c is a hotpath.
The first objective, I believe, was achieved, with no performance regression.
I agree, the gains are small, by the tests done.
But, IMHO, this is a good way, small gains turn into big gains in the end, when applied to all code.

Consider GetSnapShotData()
1. Most of the time the snapshot is not null, so:
if (snaphost == NULL), will fail most of the time.

With the patch:
if (snapshot->xip != NULL)
{
    if (GetSnapshotDataReuse(snapshot))
     return snapshot;
}

Most of the time the test is true and GetSnapshotDataReuse is not called for new
snapshots.

count, subcount and  suboverflowed, will not be initialized, for all snapshots.

2. If snapshot is taken during recoverys
The pgprocnos and ProcGlobal->subxidStates are not touched unnecessarily.
Only if is not suboverflowed.
Skipping all InvalidTransactionId, mypgxactoff, backends doing logical decoding,
and XID is >= xmax.

3. Calling GetSnapshotDataReuse() without first acquiring ProcArrayLock.
There's an agreement that this would be fine, for now.

Consider ComputeXidHorizons()
1. ProcGlobal->statusFlags is touched before the lock.
2. allStatusFlags[index] is not touched for all numProcs.

All changes were made with the aim of avoiding or postponing unnecessary work.


BTW it's generally a good idea to do multiple runs and then use the
average and/or median. Results from a single may be quite noisy.

>
> Linux Ubuntu 64 bits
> shared_buffers = 2048MB
>
> ./pgbench -M prepared -c $conns -j $conns -S -n -U postgres
>
> pgbench (15beta1)
> transaction type: <builtin: select only>
> scaling factor: 1
> query mode: prepared
> number of clients: 100
> number of threads: 100
> maximum number of tries: 1
> number of transactions per client: 10
>
> conns          tps head                tps patched
>
> 1           2918.004085              3211.303789
> 10       12262.415696            15540.015540
> 50       13656.724571            16701.182444
> 80       14338.202348            16628.559551
> 90       16597.510373            16835.016835
> 100     17706.775793            16607.433487
> 200     16877.067441            16426.969799
> 300     16942.260775            16319.780662
> 400     16794.514911            16155.023607
> 500     16598.502151            16051.106724
> 600     16717.935001            16007.171213
> 700     16651.204834            16004.353184
> 800     16467.546583            16834.591719
> 900     16588.241149            16693.902459
> 1000   16564.985265            16936.952195
>

I think we've agreed these results are useless.

>
> Linux Ubuntu 64 bits
> shared_buffers = 2048MB
>
> ./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
>
> pgbench (15beta1)
> transaction type: <builtin: select only>
> scaling factor: 1
> query mode: prepared
> number of clients: 100
> number of threads: 100
> maximum number of tries: 1
> duration: 60 s
>
> conns           tps head               tps patched
>
> 1         17174.265804            17792.414234
> 10       82365.634750            82468.334836
> 50       74593.714180            74678.839428
> 80       69219.756038            73116.553986
> 90       67419.574189            68384.906949
> 100     66613.771701            66997.793777
> 200     61739.784830            62870.243385
> 300     62109.691298            62796.157548
> 400     61630.822446            63129.555135
> 500     61711.019964            62755.190389
> 600     60620.010181            61517.913967
> 700     60303.317736            61688.044232
> 800     60451.113573            61076.666572
> 900     60017.327157            61256.290037
> 1000   60088.823434            60986.799312
>

I have no idea why shared buffers 2GB would be interesting. The proposed
change was related to procarray, not shared buffers. And scale 1 is
~15MB of data, so it fits into 128MB just fine.
 I thought about doing this benchmark, in the most common usage situation (25% of RAM).


Also, the first ~10 results for "patched" case match results for 128MB
shared buffers. That seems very unlikely to happen by chance, so this
seems rather suspicious.
Probably, copy and paste mistake.
I redid this test, for patched:

Linux Ubuntu 64 bits
shared_buffers = 2048MB

./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres

pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 s

conns           tps head                tps patched

1         17174.265804            17524.482668
10       82365.634750            81840.537713
50       74593.714180            74806.729434
80       69219.756038            73116.553986
90       67419.574189            69130.749209
100     66613.771701            67478.234595
200     61739.784830            63094.202413
300     62109.691298            62984.501251
400     61630.822446            63243.232816
500     61711.019964            62827.977636
600     60620.010181            62838.051693
700     60303.317736            61594.629618
800     60451.113573            61208.629058
900     60017.327157            61171.001256
1000   60088.823434            60558.067810
 
regards,
Ranier Vilela

Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)

From
Andres Freund
Date:
Hi,

On 2022-05-27 03:30:46 +0200, Tomas Vondra wrote:
> On 5/27/22 02:11, Ranier Vilela wrote:
> > ./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
> > 
> > pgbench (15beta1)
> > transaction type: <builtin: select only>
> > scaling factor: 1
> > query mode: prepared
> > number of clients: 100
> > number of threads: 100
> > maximum number of tries: 1
> > duration: 60 s
> > 
> > conns            tps head              tps patched
> > 
> > 1         17126.326108            17792.414234
> > 10       82068.123383            82468.334836
> > 50       73808.731404            74678.839428
> > 80       73290.191713            73116.553986
> > 90       67558.483043            68384.906949
> > 100     65960.982801            66997.793777
> > 200     62216.011998            62870.243385
> > 300     62924.225658            62796.157548
> > 400     62278.099704            63129.555135
> > 500     63257.930870            62188.825044
> > 600     61479.890611            61517.913967
> > 700     61139.354053            61327.898847
> > 800     60833.663791            61517.913967
> > 900     61305.129642            61248.336593
> > 1000   60990.918719            61041.670996
> > 
> 
> These results look much saner, but IMHO it also does not show any clear
> benefit of the patch. Or are you still claiming there is a benefit?

They don't look all that sane to me - isn't that way lower than one would
expect? Restricting both client and server to the same four cores, a
thermically challenged older laptop I have around I get 150k tps at both 10
and 100 clients.

Either way, I'd not expect to see any GetSnapshotData() scalability effects to
show up on an "Intel® Core™ i5-8250U CPU Quad Core" - there's just not enough
concurrency.

The correct pieces of these changes seem very unlikely to affect
GetSnapshotData() performance meaningfully.

To improve something like GetSnapshotData() you first have to come up with a
workload that shows it being a meaningful part of a profile. Unless it is,
performance differences are going to just be due to various forms of noise.

Greetings,

Andres Freund



Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)

From
Andres Freund
Date:
Hi,

On 2022-05-27 10:35:08 -0300, Ranier Vilela wrote:
> Em qui., 26 de mai. de 2022 às 22:30, Tomas Vondra <
> tomas.vondra@enterprisedb.com> escreveu:
>
> > On 5/27/22 02:11, Ranier Vilela wrote:
> > >
> > > ...
> > >
> > > Here the results with -T 60:
> >
> > Might be a good idea to share your analysis / interpretation of the
> > results, not just the raw data. After all, the change is being proposed
> > by you, so do you think this shows the change is beneficial?
> >
> I think so, but the expectation has diminished.
> I expected that the more connections, the better the performance.
> And for both patch and head, this doesn't happen in tests.
> Performance degrades with a greater number of connections.

Your system has four CPUs. Once they're all busy, adding more connections
won't improve performance. It'll just add more and more context switching,
cache misses, and make the OS scheduler do more work.



> GetSnapShowData() isn't a bottleneck?

I'd be surprised if it showed up in a profile on your machine with that
workload in any sort of meaningful way. The snapshot reuse logic will always
work - because there are no writes - and thus the only work that needs to be
done is to acquire the ProcArrayLock briefly.  And because there is only a
small number of cores, contention on the cacheline for that isn't a problem.


> > These results look much saner, but IMHO it also does not show any clear
> > benefit of the patch. Or are you still claiming there is a benefit?
> >
> We agree that they are micro-optimizations.  However, I think they should be
> considered micro-optimizations in inner loops, because all in procarray.c is
> a hotpath.

As explained earlier, I don't agree that they optimize anything - you're
making some of the scalability behaviour *worse*, if it's changed at all.


> The first objective, I believe, was achieved, with no performance
> regression.
> I agree, the gains are small, by the tests done.

There are no gains.


> But, IMHO, this is a good way, small gains turn into big gains in the end,
> when applied to all code.
>
> Consider GetSnapShotData()
> 1. Most of the time the snapshot is not null, so:
> if (snaphost == NULL), will fail most of the time.
>
> With the patch:
> if (snapshot->xip != NULL)
> {
>     if (GetSnapshotDataReuse(snapshot))
>      return snapshot;
> }
>
> Most of the time the test is true and GetSnapshotDataReuse is not called
> for new
> snapshots.
> count, subcount and  suboverflowed, will not be initialized, for all
> snapshots.

But that's irrelevant. There's only a few "new" snapshots in the life of a
connection. You're optimizing something irrelevant.


> 2. If snapshot is taken during recoverys
> The pgprocnos and ProcGlobal->subxidStates are not touched unnecessarily.

That code isn't reached when in recovery?


> 3. Calling GetSnapshotDataReuse() without first acquiring ProcArrayLock.
> There's an agreement that this would be fine, for now.

There's no such agreement at all. It's not correct.


> Consider ComputeXidHorizons()
> 1. ProcGlobal->statusFlags is touched before the lock.

Hard to believe that'd have a measurable effect.


> 2. allStatusFlags[index] is not touched for all numProcs.

I'd be surprised if the compiler couldn't defer that load on its own.

Greetings,

Andres Freund



Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)

From
Ranier Vilela
Date:


Em sex., 27 de mai. de 2022 às 18:08, Andres Freund <andres@anarazel.de> escreveu:
Hi,

On 2022-05-27 03:30:46 +0200, Tomas Vondra wrote:
> On 5/27/22 02:11, Ranier Vilela wrote:
> > ./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
> >
> > pgbench (15beta1)
> > transaction type: <builtin: select only>
> > scaling factor: 1
> > query mode: prepared
> > number of clients: 100
> > number of threads: 100
> > maximum number of tries: 1
> > duration: 60 s
> >
> > conns            tps head              tps patched
> >
> > 1         17126.326108            17792.414234
> > 10       82068.123383            82468.334836
> > 50       73808.731404            74678.839428
> > 80       73290.191713            73116.553986
> > 90       67558.483043            68384.906949
> > 100     65960.982801            66997.793777
> > 200     62216.011998            62870.243385
> > 300     62924.225658            62796.157548
> > 400     62278.099704            63129.555135
> > 500     63257.930870            62188.825044
> > 600     61479.890611            61517.913967
> > 700     61139.354053            61327.898847
> > 800     60833.663791            61517.913967
> > 900     61305.129642            61248.336593
> > 1000   60990.918719            61041.670996
> >
>
> These results look much saner, but IMHO it also does not show any clear
> benefit of the patch. Or are you still claiming there is a benefit?

They don't look all that sane to me - isn't that way lower than one would
expect?
Yes, quite disappointing.

Restricting both client and server to the same four cores, a
thermically challenged older laptop I have around I get 150k tps at both 10
and 100 clients.
And you can share the benchmark details? Hardware, postgres and pgbench, please?

Either way, I'd not expect to see any GetSnapshotData() scalability effects to
show up on an "Intel® Core™ i5-8250U CPU Quad Core" - there's just not enough
concurrency.
This means that our customers will not see any connections scalability with PG15, using the simplest hardware?


The correct pieces of these changes seem very unlikely to affect
GetSnapshotData() performance meaningfully.

To improve something like GetSnapshotData() you first have to come up with a
workload that shows it being a meaningful part of a profile. Unless it is,
performance differences are going to just be due to various forms of noise.
Actually in the profiles I got with perf, GetSnapShotData() didn't show up.

regards,
Ranier Vilela

Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)

From
Ranier Vilela
Date:
Em sex., 27 de mai. de 2022 às 18:22, Andres Freund <andres@anarazel.de> escreveu:
Hi,

On 2022-05-27 10:35:08 -0300, Ranier Vilela wrote:
> Em qui., 26 de mai. de 2022 às 22:30, Tomas Vondra <
> tomas.vondra@enterprisedb.com> escreveu:
>
> > On 5/27/22 02:11, Ranier Vilela wrote:
> > >
> > > ...
> > >
> > > Here the results with -T 60:
> >
> > Might be a good idea to share your analysis / interpretation of the
> > results, not just the raw data. After all, the change is being proposed
> > by you, so do you think this shows the change is beneficial?
> >
> I think so, but the expectation has diminished.
> I expected that the more connections, the better the performance.
> And for both patch and head, this doesn't happen in tests.
> Performance degrades with a greater number of connections.

Your system has four CPUs. Once they're all busy, adding more connections
won't improve performance. It'll just add more and more context switching,
cache misses, and make the OS scheduler do more work.
conns       tps head
10      82365.634750
50      74593.714180
80      69219.756038
90      67419.574189
100     66613.771701
Yes it is quite disappointing that with 100 connections, tps loses to 10 connections.
 



> GetSnapShowData() isn't a bottleneck?

I'd be surprised if it showed up in a profile on your machine with that
workload in any sort of meaningful way. The snapshot reuse logic will always
work - because there are no writes - and thus the only work that needs to be
done is to acquire the ProcArrayLock briefly.  And because there is only a
small number of cores, contention on the cacheline for that isn't a problem.
Thanks for sharing this.
 


> > These results look much saner, but IMHO it also does not show any clear
> > benefit of the patch. Or are you still claiming there is a benefit?
> >
> We agree that they are micro-optimizations.  However, I think they should be
> considered micro-optimizations in inner loops, because all in procarray.c is
> a hotpath.

As explained earlier, I don't agree that they optimize anything - you're
making some of the scalability behaviour *worse*, if it's changed at all.


> The first objective, I believe, was achieved, with no performance
> regression.
> I agree, the gains are small, by the tests done.

There are no gains.
IMHO, I must disagree.
 


> But, IMHO, this is a good way, small gains turn into big gains in the end,
> when applied to all code.
>
> Consider GetSnapShotData()
> 1. Most of the time the snapshot is not null, so:
> if (snaphost == NULL), will fail most of the time.
>
> With the patch:
> if (snapshot->xip != NULL)
> {
>     if (GetSnapshotDataReuse(snapshot))
>      return snapshot;
> }
>
> Most of the time the test is true and GetSnapshotDataReuse is not called
> for new
> snapshots.
> count, subcount and  suboverflowed, will not be initialized, for all
> snapshots.

But that's irrelevant. There's only a few "new" snapshots in the life of a
connection. You're optimizing something irrelevant.
IMHO, when GetSnapShotData() is the bottleneck, all is relevant.
 


> 2. If snapshot is taken during recoverys
> The pgprocnos and ProcGlobal->subxidStates are not touched unnecessarily.

That code isn't reached when in recovery?
Currently it is reached *even* when not in recovery.
With the patch, *only* is reached when in recovery.



> 3. Calling GetSnapshotDataReuse() without first acquiring ProcArrayLock.
> There's an agreement that this would be fine, for now.

There's no such agreement at all. It's not correct.
Ok, but there is a chance it will work correctly.



> Consider ComputeXidHorizons()
> 1. ProcGlobal->statusFlags is touched before the lock.

Hard to believe that'd have a measurable effect.
IMHO, anything you take out of the lock is a benefit.
 


> 2. allStatusFlags[index] is not touched for all numProcs.

I'd be surprised if the compiler couldn't defer that load on its own.
Better be sure of that, no?

regards,
Ranier Vilela
On 5/28/22 02:15, Ranier Vilela wrote:
> 
> 
> Em sex., 27 de mai. de 2022 às 18:08, Andres Freund <andres@anarazel.de
> <mailto:andres@anarazel.de>> escreveu:
> 
>     Hi,
> 
>     On 2022-05-27 03:30:46 +0200, Tomas Vondra wrote:
>     > On 5/27/22 02:11, Ranier Vilela wrote:
>     > > ./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
>     > >
>     > > pgbench (15beta1)
>     > > transaction type: <builtin: select only>
>     > > scaling factor: 1
>     > > query mode: prepared
>     > > number of clients: 100
>     > > number of threads: 100
>     > > maximum number of tries: 1
>     > > duration: 60 s
>     > >
>     > > conns            tps head              tps patched
>     > >
>     > > 1         17126.326108            17792.414234
>     > > 10       82068.123383            82468.334836
>     > > 50       73808.731404            74678.839428
>     > > 80       73290.191713            73116.553986
>     > > 90       67558.483043            68384.906949
>     > > 100     65960.982801            66997.793777
>     > > 200     62216.011998            62870.243385
>     > > 300     62924.225658            62796.157548
>     > > 400     62278.099704            63129.555135
>     > > 500     63257.930870            62188.825044
>     > > 600     61479.890611            61517.913967
>     > > 700     61139.354053            61327.898847
>     > > 800     60833.663791            61517.913967
>     > > 900     61305.129642            61248.336593
>     > > 1000   60990.918719            61041.670996
>     > >
>     >
>     > These results look much saner, but IMHO it also does not show any
>     clear
>     > benefit of the patch. Or are you still claiming there is a benefit?
> 
>     They don't look all that sane to me - isn't that way lower than one
>     would
>     expect?
> 
> Yes, quite disappointing.
> 
>     Restricting both client and server to the same four cores, a
>     thermically challenged older laptop I have around I get 150k tps at
>     both 10
>     and 100 clients.
> 
> And you can share the benchmark details? Hardware, postgres and pgbench,
> please?
> 
> 
>     Either way, I'd not expect to see any GetSnapshotData() scalability
>     effects to
>     show up on an "Intel® Core™ i5-8250U CPU Quad Core" - there's just
>     not enough
>     concurrency.
> 
> This means that our customers will not see any connections scalability
> with PG15, using the simplest hardware?
> 

No. It means that on 4-core machine GetSnapshotData() is unlikely to be
a bottleneck, because you'll hit various other bottlenecks way earlier.

I personally doubt it even makes sense to worry about scaling to this
many connections on such tiny system too much.

> 
>     The correct pieces of these changes seem very unlikely to affect
>     GetSnapshotData() performance meaningfully.
> 
>     To improve something like GetSnapshotData() you first have to come
>     up with a
>     workload that shows it being a meaningful part of a profile. Unless
>     it is,
>     performance differences are going to just be due to various forms of
>     noise.
> 
> Actually in the profiles I got with perf, GetSnapShotData() didn't show up.
> 

But that's exactly the point Andres is trying to make - if you don't see
GetSnapshotData() in the perf profile, why do you think optimizing it
will have any meaningful impact on throughput?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On 5/28/22 02:36, Ranier Vilela wrote:
> Em sex., 27 de mai. de 2022 às 18:22, Andres Freund <andres@anarazel.de
> <mailto:andres@anarazel.de>> escreveu:
> 
>     Hi,
> 
>     On 2022-05-27 10:35:08 -0300, Ranier Vilela wrote:
>     > Em qui., 26 de mai. de 2022 às 22:30, Tomas Vondra <
>     > tomas.vondra@enterprisedb.com
>     <mailto:tomas.vondra@enterprisedb.com>> escreveu:
>     >
>     > > On 5/27/22 02:11, Ranier Vilela wrote:
>     > > >
>     > > > ...
>     > > >
>     > > > Here the results with -T 60:
>     > >
>     > > Might be a good idea to share your analysis / interpretation of the
>     > > results, not just the raw data. After all, the change is being
>     proposed
>     > > by you, so do you think this shows the change is beneficial?
>     > >
>     > I think so, but the expectation has diminished.
>     > I expected that the more connections, the better the performance.
>     > And for both patch and head, this doesn't happen in tests.
>     > Performance degrades with a greater number of connections.
> 
>     Your system has four CPUs. Once they're all busy, adding more
>     connections
>     won't improve performance. It'll just add more and more context
>     switching,
>     cache misses, and make the OS scheduler do more work.
> 
> conns       tps head
> 10      82365.634750
> 50      74593.714180
> 80      69219.756038
> 90      67419.574189
> 100     66613.771701
> Yes it is quite disappointing that with 100 connections, tps loses to 10
> connections.
>  

IMO that's entirely expected on a system with only 4 cores. Increasing
the number of connections inevitably means more overhead (you have to
track/manage more stuff). And at some point the backends start competing
for L2/L3 caches, context switches are not free either, etc. So once you
cross ~2-3x the number of cores, you should expect this.

This behavior is natural/inherent, it's unlikely to go away, and it's
one of the reasons why we recommend not to use too many connections. If
you try to maximize throughput, just don't do that. Or just use machine
with more cores.

> 
>     > GetSnapShowData() isn't a bottleneck?
> 
>     I'd be surprised if it showed up in a profile on your machine with that
>     workload in any sort of meaningful way. The snapshot reuse logic
>     will always
>     work - because there are no writes - and thus the only work that
>     needs to be
>     done is to acquire the ProcArrayLock briefly.  And because there is
>     only a
>     small number of cores, contention on the cacheline for that isn't a
>     problem.
> 
> Thanks for sharing this.
>  
> 
> 
> 
>     > > These results look much saner, but IMHO it also does not show
>     any clear
>     > > benefit of the patch. Or are you still claiming there is a benefit?
>     > >
>     > We agree that they are micro-optimizations.  However, I think they
>     should be
>     > considered micro-optimizations in inner loops, because all in
>     procarray.c is
>     > a hotpath.
> 
>     As explained earlier, I don't agree that they optimize anything - you're
>     making some of the scalability behaviour *worse*, if it's changed at
>     all.
> 
> 
>     > The first objective, I believe, was achieved, with no performance
>     > regression.
>     > I agree, the gains are small, by the tests done.
> 
>     There are no gains.
> 
> IMHO, I must disagree.
>  

You don't have to, really. What you should do is showing results
demonstrating the claimed gains, and so far you have not done that.

I don't want to be rude, but so far you've shown results from a
benchmark testing fork(), due to only running 10 transactions per
client, and then results from a single run for each client count (which
doesn't really show any gains either, and is so noisy).

As mentioned GetSnapshotData() is not even in perf profile, so why would
the patch even make a difference?

You've also claimed it helps generating better code on older compilers,
but you've never supported that with any evidence.


Maybe there is an improvement - show us. Do a benchmark with more runs,
to average-out the noise. Calculate VAR/STDEV to show how variable the
results are. Use that to compare results and decide if there is an
improvement. Also, keep in mind binary layout matters [1].

[1] https://www.youtube.com/watch?v=r-TLSBdHe1A

> 
> 
> 
>     > But, IMHO, this is a good way, small gains turn into big gains in
>     the end,
>     > when applied to all code.
>     >
>     > Consider GetSnapShotData()
>     > 1. Most of the time the snapshot is not null, so:
>     > if (snaphost == NULL), will fail most of the time.
>     >
>     > With the patch:
>     > if (snapshot->xip != NULL)
>     > {
>     >     if (GetSnapshotDataReuse(snapshot))
>     >      return snapshot;
>     > }
>     >
>     > Most of the time the test is true and GetSnapshotDataReuse is not
>     called
>     > for new
>     > snapshots.
>     > count, subcount and  suboverflowed, will not be initialized, for all
>     > snapshots.
> 
>     But that's irrelevant. There's only a few "new" snapshots in the
>     life of a
>     connection. You're optimizing something irrelevant.
> 
> IMHO, when GetSnapShotData() is the bottleneck, all is relevant.
>  

Maybe. Show us the difference.

> 
> 
> 
>     > 2. If snapshot is taken during recoverys
>     > The pgprocnos and ProcGlobal->subxidStates are not touched
>     unnecessarily.
> 
>     That code isn't reached when in recovery?
> 
> Currently it is reached *even* when not in recovery.
> With the patch, *only* is reached when in recovery.
> 
> 
> 
>     > 3. Calling GetSnapshotDataReuse() without first acquiring
>     ProcArrayLock.
>     > There's an agreement that this would be fine, for now.
> 
>     There's no such agreement at all. It's not correct.
> 
> Ok, but there is a chance it will work correctly.
> 

Either it's correct or not. Chance of being correct does not count.

> 
> 
>     > Consider ComputeXidHorizons()
>     > 1. ProcGlobal->statusFlags is touched before the lock.
> 
>     Hard to believe that'd have a measurable effect.
> 
> IMHO, anything you take out of the lock is a benefit.
>  

Maybe. Show us the difference.

> 
> 
> 
>     > 2. allStatusFlags[index] is not touched for all numProcs.
> 
>     I'd be surprised if the compiler couldn't defer that load on its own.
> 
> Better be sure of that, no?
> 

We rely on compilers doing this in about a million other places.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)

From
Ranier Vilela
Date:
Em sáb., 28 de mai. de 2022 às 09:00, Tomas Vondra <tomas.vondra@enterprisedb.com> escreveu:
On 5/28/22 02:15, Ranier Vilela wrote:
>
>
> Em sex., 27 de mai. de 2022 às 18:08, Andres Freund <andres@anarazel.de
> <mailto:andres@anarazel.de>> escreveu:
>
>     Hi,
>
>     On 2022-05-27 03:30:46 +0200, Tomas Vondra wrote:
>     > On 5/27/22 02:11, Ranier Vilela wrote:
>     > > ./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
>     > >
>     > > pgbench (15beta1)
>     > > transaction type: <builtin: select only>
>     > > scaling factor: 1
>     > > query mode: prepared
>     > > number of clients: 100
>     > > number of threads: 100
>     > > maximum number of tries: 1
>     > > duration: 60 s
>     > >
>     > > conns            tps head              tps patched
>     > >
>     > > 1         17126.326108            17792.414234
>     > > 10       82068.123383            82468.334836
>     > > 50       73808.731404            74678.839428
>     > > 80       73290.191713            73116.553986
>     > > 90       67558.483043            68384.906949
>     > > 100     65960.982801            66997.793777
>     > > 200     62216.011998            62870.243385
>     > > 300     62924.225658            62796.157548
>     > > 400     62278.099704            63129.555135
>     > > 500     63257.930870            62188.825044
>     > > 600     61479.890611            61517.913967
>     > > 700     61139.354053            61327.898847
>     > > 800     60833.663791            61517.913967
>     > > 900     61305.129642            61248.336593
>     > > 1000   60990.918719            61041.670996
>     > >
>     >
>     > These results look much saner, but IMHO it also does not show any
>     clear
>     > benefit of the patch. Or are you still claiming there is a benefit?
>
>     They don't look all that sane to me - isn't that way lower than one
>     would
>     expect?
>
> Yes, quite disappointing.
>
>     Restricting both client and server to the same four cores, a
>     thermically challenged older laptop I have around I get 150k tps at
>     both 10
>     and 100 clients.
>
> And you can share the benchmark details? Hardware, postgres and pgbench,
> please?
>
>
>     Either way, I'd not expect to see any GetSnapshotData() scalability
>     effects to
>     show up on an "Intel® Core™ i5-8250U CPU Quad Core" - there's just
>     not enough
>     concurrency.
>
> This means that our customers will not see any connections scalability
> with PG15, using the simplest hardware?
>

No. It means that on 4-core machine GetSnapshotData() is unlikely to be
a bottleneck, because you'll hit various other bottlenecks way earlier.

I personally doubt it even makes sense to worry about scaling to this
many connections on such tiny system too much.

>
>     The correct pieces of these changes seem very unlikely to affect
>     GetSnapshotData() performance meaningfully.
>
>     To improve something like GetSnapshotData() you first have to come
>     up with a
>     workload that shows it being a meaningful part of a profile. Unless
>     it is,
>     performance differences are going to just be due to various forms of
>     noise.
>
> Actually in the profiles I got with perf, GetSnapShotData() didn't show up.
>

But that's exactly the point Andres is trying to make - if you don't see
GetSnapshotData() in the perf profile, why do you think optimizing it
will have any meaningful impact on throughput?
You see, I've seen in several places that GetSnapShotData() is the bottleneck in scaling connections.
One of them, if I remember correctly, was at an IBM in Russia.
Another statement occurs in [1][2][3]
Just because I don't have enough hardware to force GetSnapShotData() doesn't mean optimizing it won't make a difference. 
And even on my modest hardware, we've seen gains, small but consistent.
So IMHO everyone will benefit, including the small servers.

regards,
Ranier Vilela




regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)

From
Daniel Gustafsson
Date:
> On 28 May 2022, at 16:12, Ranier Vilela <ranier.vf@gmail.com> wrote:

> Just because I don't have enough hardware to force GetSnapShotData() doesn't mean optimizing it won't make a
difference. 

Quoting Andres from upthread:

    "To improve something like GetSnapshotData() you first have to come up with
    a workload that shows it being a meaningful part of a profile.  Unless it
    is, performance differences are going to just be due to various forms of
    noise."

If you think this is a worthwhile improvement, you need to figure out a way to
reliably test it in order to prove it.

--
Daniel Gustafsson        https://vmware.com/




On 5/28/22 16:12, Ranier Vilela wrote:
> Em sáb., 28 de mai. de 2022 às 09:00, Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> escreveu:
> 
>     On 5/28/22 02:15, Ranier Vilela wrote:
>     >
>     >
>     > Em sex., 27 de mai. de 2022 às 18:08, Andres Freund
>     <andres@anarazel.de <mailto:andres@anarazel.de>
>     > <mailto:andres@anarazel.de <mailto:andres@anarazel.de>>> escreveu:
>     >
>     >     Hi,
>     >
>     >     On 2022-05-27 03:30:46 +0200, Tomas Vondra wrote:
>     >     > On 5/27/22 02:11, Ranier Vilela wrote:
>     >     > > ./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U
>     postgres
>     >     > >
>     >     > > pgbench (15beta1)
>     >     > > transaction type: <builtin: select only>
>     >     > > scaling factor: 1
>     >     > > query mode: prepared
>     >     > > number of clients: 100
>     >     > > number of threads: 100
>     >     > > maximum number of tries: 1
>     >     > > duration: 60 s
>     >     > >
>     >     > > conns            tps head              tps patched
>     >     > >
>     >     > > 1         17126.326108            17792.414234
>     >     > > 10       82068.123383            82468.334836
>     >     > > 50       73808.731404            74678.839428
>     >     > > 80       73290.191713            73116.553986
>     >     > > 90       67558.483043            68384.906949
>     >     > > 100     65960.982801            66997.793777
>     >     > > 200     62216.011998            62870.243385
>     >     > > 300     62924.225658            62796.157548
>     >     > > 400     62278.099704            63129.555135
>     >     > > 500     63257.930870            62188.825044
>     >     > > 600     61479.890611            61517.913967
>     >     > > 700     61139.354053            61327.898847
>     >     > > 800     60833.663791            61517.913967
>     >     > > 900     61305.129642            61248.336593
>     >     > > 1000   60990.918719            61041.670996
>     >     > >
>     >     >
>     >     > These results look much saner, but IMHO it also does not
>     show any
>     >     clear
>     >     > benefit of the patch. Or are you still claiming there is a
>     benefit?
>     >
>     >     They don't look all that sane to me - isn't that way lower
>     than one
>     >     would
>     >     expect?
>     >
>     > Yes, quite disappointing.
>     >
>     >     Restricting both client and server to the same four cores, a
>     >     thermically challenged older laptop I have around I get 150k
>     tps at
>     >     both 10
>     >     and 100 clients.
>     >
>     > And you can share the benchmark details? Hardware, postgres and
>     pgbench,
>     > please?
>     >
>     >
>     >     Either way, I'd not expect to see any GetSnapshotData()
>     scalability
>     >     effects to
>     >     show up on an "Intel® Core™ i5-8250U CPU Quad Core" - there's just
>     >     not enough
>     >     concurrency.
>     >
>     > This means that our customers will not see any connections scalability
>     > with PG15, using the simplest hardware?
>     >
> 
>     No. It means that on 4-core machine GetSnapshotData() is unlikely to be
>     a bottleneck, because you'll hit various other bottlenecks way earlier.
> 
>     I personally doubt it even makes sense to worry about scaling to this
>     many connections on such tiny system too much.
> 
> 
>     >
>     >     The correct pieces of these changes seem very unlikely to affect
>     >     GetSnapshotData() performance meaningfully.
>     >
>     >     To improve something like GetSnapshotData() you first have to come
>     >     up with a
>     >     workload that shows it being a meaningful part of a profile.
>     Unless
>     >     it is,
>     >     performance differences are going to just be due to various
>     forms of
>     >     noise.
>     >
>     > Actually in the profiles I got with perf, GetSnapShotData() didn't
>     show up.
>     >
> 
>     But that's exactly the point Andres is trying to make - if you don't see
>     GetSnapshotData() in the perf profile, why do you think optimizing it
>     will have any meaningful impact on throughput?
> 
> You see, I've seen in several places that GetSnapShotData() is the
> bottleneck in scaling connections.
> One of them, if I remember correctly, was at an IBM in Russia.
> Another statement occurs in [1][2][3]

No one is claiming GetSnapshotData() can't be a bottleneck on systems
with many cores. That's certainly possible, which is why e.g. Andres
spent a lot of time optimizing for that case.

But that's what we're arguing about. You're trying to convince us that
your patch will improve things, and you're supporting that by numbers
from a machine that is unlikely to be hitting this bottleneck.

> Just because I don't have enough hardware to force GetSnapShotData()
> doesn't mean optimizing it won't make a difference.

Well, the question is if it actually optimizes things. Maybe it does,
which would be great, but people in this thread (including me) seem to
be fairly skeptical about that claim, because the results are frankly
entirely unconvincing.

I doubt we'll just accept changes in such sensitive places without
results from a relevant machine. Maybe if there was a clear agreement
it's a win, but that's not the case here.


> And even on my modest hardware, we've seen gains, small but consistent.
> So IMHO everyone will benefit, including the small servers.
> 

No, we haven't seen any convincing gains. I've tried to explain multiple
times that the results you've shared are not showing any clear
improvement, due to only having one run for each client count (which
means there's a lot of noise), impact of binary layout in different
builds, etc. You've ignored all of that, so instead of repeating myself,
I did a simple benchmark on my two machines:

1) i5-2500k / 4 cores and 8GB RAM (so similar to what you have)

2) 2x e5-2620v3 / 16/32 cores, 64GB RAM (so somewhat bigger)

and I tested 1, 2, 5, 10, 50, 100, ...., 1000 clients using the same
benchmark as you (pgbench -S -M prepared ... ). I did 10 client counts
for each client count, to calculate median which evens out the noise.
And for fun I tried this with gcc 9.3, 10.3 and 11.2. The script and
results from both machines are attached.

The results from xeon and gcc 11.2 look like this:

    clients    master    patched     diff
   ---------------------------------------
          1     46460     44936       97%
          2     87486     84746       97%
          5    199102    192169       97%
         10    344458    339403       99%
         20    515257    512513       99%
         30    528675    525467       99%
         40    592761    594384      100%
         50    694635    706193      102%
        100    643950    655238      102%
        200    690133    696815      101%
        300    670403    677818      101%
        400    678573    681387      100%
        500    665349    678722      102%
        600    666028    670915      101%
        700    662316    662511      100%
        800    647922    654745      101%
        900    650274    654698      101%
       1000    644482    649332      101%

Please, explain to me how this shows consistent measurable improvement?

The standard deviation is roughly 1.5% on average, and the difference is
well within that range. Even if there was a tiny improvement for the
high client counts, no one sane will run with that many clients, because
the throughput peaks at ~50 clients. So even if you gain 1% with 500
clients, it's still less than with 50 clients. If anything, this shows
regression for lower client counts.

FWIW this entirely ignores the question is this benchmark even hits the
bottleneck this patch aims to improve. Also, there's the question of
correctness, and I'd bet Andres is right getting snapshot without
holding ProcArrayLock is busted.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)

From
Ranier Vilela
Date:
Em sáb., 28 de mai. de 2022 às 09:35, Tomas Vondra <tomas.vondra@enterprisedb.com> escreveu:
On 5/28/22 02:36, Ranier Vilela wrote:
> Em sex., 27 de mai. de 2022 às 18:22, Andres Freund <andres@anarazel.de
> <mailto:andres@anarazel.de>> escreveu:
>
>     Hi,
>
>     On 2022-05-27 10:35:08 -0300, Ranier Vilela wrote:
>     > Em qui., 26 de mai. de 2022 às 22:30, Tomas Vondra <
>     > tomas.vondra@enterprisedb.com
>     <mailto:tomas.vondra@enterprisedb.com>> escreveu:
>     >
>     > > On 5/27/22 02:11, Ranier Vilela wrote:
>     > > >
>     > > > ...
>     > > >
>     > > > Here the results with -T 60:
>     > >
>     > > Might be a good idea to share your analysis / interpretation of the
>     > > results, not just the raw data. After all, the change is being
>     proposed
>     > > by you, so do you think this shows the change is beneficial?
>     > >
>     > I think so, but the expectation has diminished.
>     > I expected that the more connections, the better the performance.
>     > And for both patch and head, this doesn't happen in tests.
>     > Performance degrades with a greater number of connections.
>
>     Your system has four CPUs. Once they're all busy, adding more
>     connections
>     won't improve performance. It'll just add more and more context
>     switching,
>     cache misses, and make the OS scheduler do more work.
>
> conns       tps head
> 10      82365.634750
> 50      74593.714180
> 80      69219.756038
> 90      67419.574189
> 100     66613.771701
> Yes it is quite disappointing that with 100 connections, tps loses to 10
> connections.
>  

IMO that's entirely expected on a system with only 4 cores. Increasing
the number of connections inevitably means more overhead (you have to
track/manage more stuff). And at some point the backends start competing
for L2/L3 caches, context switches are not free either, etc. So once you
cross ~2-3x the number of cores, you should expect this.

This behavior is natural/inherent, it's unlikely to go away, and it's
one of the reasons why we recommend not to use too many connections. If
you try to maximize throughput, just don't do that. Or just use machine
with more cores.

>
>     > GetSnapShowData() isn't a bottleneck?
>
>     I'd be surprised if it showed up in a profile on your machine with that
>     workload in any sort of meaningful way. The snapshot reuse logic
>     will always
>     work - because there are no writes - and thus the only work that
>     needs to be
>     done is to acquire the ProcArrayLock briefly.  And because there is
>     only a
>     small number of cores, contention on the cacheline for that isn't a
>     problem.
>
> Thanks for sharing this.
>  
>
>
>
>     > > These results look much saner, but IMHO it also does not show
>     any clear
>     > > benefit of the patch. Or are you still claiming there is a benefit?
>     > >
>     > We agree that they are micro-optimizations.  However, I think they
>     should be
>     > considered micro-optimizations in inner loops, because all in
>     procarray.c is
>     > a hotpath.
>
>     As explained earlier, I don't agree that they optimize anything - you're
>     making some of the scalability behaviour *worse*, if it's changed at
>     all.
>
>
>     > The first objective, I believe, was achieved, with no performance
>     > regression.
>     > I agree, the gains are small, by the tests done.
>
>     There are no gains.
>
> IMHO, I must disagree.
>  

You don't have to, really. What you should do is showing results
demonstrating the claimed gains, and so far you have not done that.

I don't want to be rude, but so far you've shown results from a
benchmark testing fork(), due to only running 10 transactions per
client, and then results from a single run for each client count (which
doesn't really show any gains either, and is so noisy).

As mentioned GetSnapshotData() is not even in perf profile, so why would
the patch even make a difference?

You've also claimed it helps generating better code on older compilers,
but you've never supported that with any evidence.


Maybe there is an improvement - show us. Do a benchmark with more runs,
to average-out the noise. Calculate VAR/STDEV to show how variable the
results are. Use that to compare results and decide if there is an
improvement. Also, keep in mind binary layout matters [1].
I redid the benchmark with a better machine:
Intel i7-10510U
RAM 8GB
SSD 512GB
Linux Ubuntu 64 bits

All files are attached, including the raw data of the results.
I did the calculations as requested.
But a quick average of the 10 benchmarks, done resulted in 10,000 tps more.
Not bad, for a simple patch, made entirely of micro-optimizations.

Results attached.
 
regards,
Ranier Vilela
Attachment

Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)

From
Andres Freund
Date:
On 2022-05-29 18:00:14 +0200, Tomas Vondra wrote:
> Also, there's the question of correctness, and I'd bet Andres is right
> getting snapshot without holding ProcArrayLock is busted.

Unless there's some actual analysis of this by Rainier, I'm just going to
ignore this thread going forward. It's pointless to invest time when
everything we say is just ignored.



On 5/29/22 19:26, Ranier Vilela wrote:
> ...
> I redid the benchmark with a better machine:
> Intel i7-10510U
> RAM 8GB
> SSD 512GB
> Linux Ubuntu 64 bits
> 
> All files are attached, including the raw data of the results.
> I did the calculations as requested.
> But a quick average of the 10 benchmarks, done resulted in 10,000 tps more.
> Not bad, for a simple patch, made entirely of micro-optimizations.
> 

I am a bit puzzled by the calculations.

It seems you somehow sum the differences for each run, and then average
that over all the runs. So, something like

  SELECT avg(delta_tps) FROM (
     SELECT run, SUM(patched_tps - master_tps) AS delta_tps
       FROM results GROUP BY run
  ) foo;

That's certainly "unorthodox" way to evaluate the results, because it
mixes results for different client counts. That's certainly not what I
suggested, and it's a pretty useless view on the data, as it obfuscates
how throughput depends on the client count.

And no, the resulting 10k does not mean you've "gained" 10k tps anywhere
- none of the "diff" values is anywhere close to that value. If you
tested more client counts, you'd probably get bigger difference.
Compared to the "sum(tps)" for each run, it's like 0.8% difference. But
even that is entirely useless, due to mixing different client counts.

I'm sorry, but this is so silly it's hard to even explain why ...


What I meant is calculating median for each client count, so for example
for the master branch you get 10 values for 1 client

  38820  39245  39773  39597  39301  39442  39379  39622  38909  38454

and if you calculate median, you'll get 39340 (and stdev 411). And same
for the other client counts, etc. If you do that, you'll get this:

  clients    master   patched     diff
  ------------------------------------
        1     39340     40173    2.12%
       10    132462    134274    1.37%
       50    115669    116575    0.78%
      100     97931     98816    0.90%
      200     88912     89660    0.84%
      300     87879     88636    0.86%
      400     87721     88219    0.57%
      500     87267     88078    0.93%
      600     87317     87781    0.53%
      700     86907     87603    0.80%
      800     86852     87364    0.59%
      900     86578     87173    0.69%
     1000     86481     86969    0.56%

How exactly this improves scalability is completely unclear to me.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)

From
Ranier Vilela
Date:
Em dom., 29 de mai. de 2022 às 15:21, Andres Freund <andres@anarazel.de> escreveu:
On 2022-05-29 18:00:14 +0200, Tomas Vondra wrote:
> Also, there's the question of correctness, and I'd bet Andres is right
> getting snapshot without holding ProcArrayLock is busted.

Unless there's some actual analysis of this by Rainier, I'm just going to
ignore this thread going forward. It's pointless to invest time when
everything we say is just ignored.
Sorry, just not my intention to ignore this important point.
Of course, any performance gain is good, but robustness comes first.

As soon as I have some time.

regards,
Ranier Vilela

Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)

From
Ranier Vilela
Date:
Em dom., 29 de mai. de 2022 às 17:10, Ranier Vilela <ranier.vf@gmail.com> escreveu:
Em dom., 29 de mai. de 2022 às 15:21, Andres Freund <andres@anarazel.de> escreveu:
On 2022-05-29 18:00:14 +0200, Tomas Vondra wrote:
> Also, there's the question of correctness, and I'd bet Andres is right
> getting snapshot without holding ProcArrayLock is busted.

Unless there's some actual analysis of this by Rainier, I'm just going to
ignore this thread going forward. It's pointless to invest time when
everything we say is just ignored.
Sorry, just not my intention to ignore this important point.
Of course, any performance gain is good, but robustness comes first.

As soon as I have some time.
I redid the benchmarks, with getting a snapshot with holding ProcArrayLock.

Average Results




Connections:         
tps headtps patchdiff
139196,308898539858,0207936661,711895100008101,69%
265050,864381965245,9852367195,1208548100,30%
591486,029835991862,9026528376,872816899995100,41%
10131318,0774955131547,1404573229,062961799995100,17%
50116531,2334144116687,0325522155,799137800001100,13%
10098969,465044998808,6778717-160,78717319999199,84%
20089514,523864989463,6196075-50,90425740000599,94%
30088426,361218388457,269515130,9082968000002100,03%
40088078,168691288338,2859163260,117225099995100,30%
50087791,162003988074,3418504283,179846500003100,32%
60087552,334339487930,8645184378,530178999994100,43%
100086538,377289586771,1946099232,817320400005100,27%
avg89204,408873191789420,4446318251981,0816042100,24%
 
For clients with 1 connections, the results are good.
But for clients with 100 and 200 connections, the results are not good.
I can't say why these two tests were so bad.
Because, 100 and 200 results, I'm not sure if this should go ahead, if it's worth the effort.

Attached the results files and calc plan.

regards,
Ranier Vilela
Attachment
On 5/31/22 16:36, Ranier Vilela wrote:
> Em dom., 29 de mai. de 2022 às 17:10, Ranier Vilela <ranier.vf@gmail.com
> <mailto:ranier.vf@gmail.com>> escreveu:
> 
>     Em dom., 29 de mai. de 2022 às 15:21, Andres Freund
>     <andres@anarazel.de <mailto:andres@anarazel.de>> escreveu:
> 
>         On 2022-05-29 18:00:14 +0200, Tomas Vondra wrote:
>         > Also, there's the question of correctness, and I'd bet Andres
>         is right
>         > getting snapshot without holding ProcArrayLock is busted.
> 
>         Unless there's some actual analysis of this by Rainier, I'm just
>         going to
>         ignore this thread going forward. It's pointless to invest time when
>         everything we say is just ignored.
> 
>     Sorry, just not my intention to ignore this important point.
>     Of course, any performance gain is good, but robustness comes first.
> 
>     As soon as I have some time.
> 
> I redid the benchmarks, with getting a snapshot with holding ProcArrayLock.
> 
> Average Results
>     
>     
>     
>     
> Connections:         
>     tps head     tps patch     diff     
> 1     39196,3088985     39858,0207936     661,711895100008     101,69%
> 2     65050,8643819     65245,9852367     195,1208548     100,30%
> 5     91486,0298359     91862,9026528     376,872816899995     100,41%
> 10     131318,0774955     131547,1404573     229,062961799995     100,17%
> 50     116531,2334144     116687,0325522     155,799137800001     100,13%
> 100     98969,4650449     98808,6778717     -160,787173199991     99,84%
> 200     89514,5238649     89463,6196075     -50,904257400005     99,94%
> 300     88426,3612183     88457,2695151     30,9082968000002     100,03%
> 400     88078,1686912     88338,2859163     260,117225099995     100,30%
> 500     87791,1620039     88074,3418504     283,179846500003     100,32%
> 600     87552,3343394     87930,8645184     378,530178999994     100,43%
> 1000     86538,3772895     86771,1946099     232,817320400005     100,27%
> avg     89204,4088731917     89420,444631825     1981,0816042     100,24%
> 
>  
> For clients with 1 connections, the results are good.

Isn't that a bit strange, considering the aim of this patch was
scalability? Which should improve higher client counts in the first place.

> But for clients with 100 and 200 connections, the results are not good.
> I can't say why these two tests were so bad.
> Because, 100 and 200 results, I'm not sure if this should go ahead, if
> it's worth the effort.
> 

I'd argue this is either just noise, and there's no actual difference.
This could be verified by some sort of statistical testing (e.g. the
well known t-test).

Another option is that this is simply due to differences in binary
layout - this can result in small differences (easily 1-2%) that are
completely unrelated to what the patch does. This is exactly what the
"stabilizer" talk I mentioned a couple days ago was about.

FWIW, when a patch improves scalability, the improvement usually
increases with the number of clients. So you'd see no/small improvement
for 10 clients, 100 clients would be improved more, 200 more, etc. We
see nothing like that here. So either the patch does not really improve
anything, or perhaps the benchmark doesn't even hit the bottleneck the
patch is meant to improve (which was already suggested in this thread
repeatedly).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)

From
Jacob Champion
Date:
On 5/31/22 11:44, Tomas Vondra wrote:
> I'd argue this is either just noise, and there's no actual difference.
> This could be verified by some sort of statistical testing (e.g. the
> well known t-test).

Given the conversation so far, I'll go ahead and mark this Returned with
Feedback. Specifically, this patch would need hard statistical proof
that it's having a positive effect.

--Jacob



Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)

From
Ranier Vilela
Date:
Em seg., 1 de ago. de 2022 às 18:51, Jacob Champion <jchampion@timescale.com> escreveu:
On 5/31/22 11:44, Tomas Vondra wrote:
> I'd argue this is either just noise, and there's no actual difference.
> This could be verified by some sort of statistical testing (e.g. the
> well known t-test).

Given the conversation so far, I'll go ahead and mark this Returned with
Feedback. Specifically, this patch would need hard statistical proof
that it's having a positive effect.
I think that the contrary opinions, the little proven benefit and the lack of enthusiasm in changing procarray.c,
I believe it is best to reject it.

regards,
Ranier Vilela