Thread: Improving connection scalability (src/backend/storage/ipc/procarray.c)
Hi hackers,
Inspired by Andre's works, I put efforts to try to improve GetSnapShotData().
I think that I got something.
As is well known GetSnapShotData() is a critical function for Postgres performance, already demonstrated in other benchmarks.
So any gain, no matter how small, makes a difference.
So any gain, no matter how small, makes a difference.
So far, no regression has been observed.
Windows 10 64 bits (msvc 2019 64 bits)
pgbench -M prepared -c $conns -S -n -U postgres
conns tps head tps patched
1 2918.004085 3027.550711
10 12262.415696 12876.641772
50 13656.724571 14877.410140
80 14338.202348 15244.192915
pgbench -M prepared -c $conns -S -n -U postgres
conns tps head tps patched
1 2918.004085 3027.550711
10 12262.415696 12876.641772
50 13656.724571 14877.410140
80 14338.202348 15244.192915
pgbench can't run with conns > 80.
Linux Ubuntu 64 bits (gcc 9.4)
./pgbench -M prepared -c $conns -j $conns -S -n -U postgres
conns tps head tps patched
1 2918.004085 3190.810466
10 12262.415696 17199.862401
50 13656.724571 18278.194114
80 14338.202348 17955.336101
90 16597.510373 18269.660184
100 17706.775793 18349.650150
200 16877.067441 17881.250615
300 16942.260775 17181.441752
400 16794.514911 17124.533892
500 16598.502151 17181.244953
600 16717.935001 16961.130742
700 16651.204834 16959.172005
800 16467.546583 16834.591719
900 16588.241149 16693.902459
1000 16564.985265 16936.952195
./pgbench -M prepared -c $conns -j $conns -S -n -U postgres
conns tps head tps patched
1 2918.004085 3190.810466
10 12262.415696 17199.862401
50 13656.724571 18278.194114
80 14338.202348 17955.336101
90 16597.510373 18269.660184
100 17706.775793 18349.650150
200 16877.067441 17881.250615
300 16942.260775 17181.441752
400 16794.514911 17124.533892
500 16598.502151 17181.244953
600 16717.935001 16961.130742
700 16651.204834 16959.172005
800 16467.546583 16834.591719
900 16588.241149 16693.902459
1000 16564.985265 16936.952195
I was surprised that in my tests, with connections greater than 100,
the tps performance drops, even in the head.
I don't have access to a powerful machine, to really test with a high workload.
But with connections below 100, It seems to me that there are obvious gains.
patch attached.
regards,
Ranier Vilela
Attachment
On Tue, May 24, 2022 at 11:28 AM Ranier Vilela <ranier.vf@gmail.com> wrote: > I think that I got something. You might have something, but it's pretty hard to tell based on looking at this patch. Whatever relevant changes it has are mixed with a bunch of changes that are probably not relevant. For example, it's hard to believe that moving "uint32 i" to an inner scope in XidInMVCCSnapshot() is causing a performance gain, because an optimizing compiler should figure that out anyway. An even bigger issue is that it's not sufficient to just demonstrate that the patch improves performance. It's also necessary to make an argument as to why it is safe and correct, and "I tried it out and nothing seemed to break" does not qualify as an argument. I'd guess that most or maybe all of the performance gain that you've observed here is attributable to changing GetSnapshotData() to call GetSnapshotDataReuse() without first acquiring ProcArrayLock. That doesn't seem like a completely hopeless idea, because the comments for GetSnapshotDataReuse() say this: * This very likely can be evolved to not need ProcArrayLock held (at very * least in the case we already hold a snapshot), but that's for another day. However, those comment seem to imply that it might not be safe in all cases, and that changes might be needed someplace in order to make it safe, but you haven't updated these comments, or changed the function in any way, so it's not really clear how or whether whatever problems Andres was worried about have been handled. -- Robert Haas EDB: http://www.enterprisedb.com
Em ter., 24 de mai. de 2022 às 13:06, Robert Haas <robertmhaas@gmail.com> escreveu:
On Tue, May 24, 2022 at 11:28 AM Ranier Vilela <ranier.vf@gmail.com> wrote:
> I think that I got something.
You might have something, but it's pretty hard to tell based on
looking at this patch. Whatever relevant changes it has are mixed with
a bunch of changes that are probably not relevant. For example, it's
hard to believe that moving "uint32 i" to an inner scope in
XidInMVCCSnapshot() is causing a performance gain, because an
optimizing compiler should figure that out anyway.
I believe that even these small changes are helpful and favorable.
Improves code readability and helps the compiler generate better code,
Improves code readability and helps the compiler generate better code,
especially for older compilers.
An even bigger issue is that it's not sufficient to just demonstrate
that the patch improves performance. It's also necessary to make an
argument as to why it is safe and correct, and "I tried it out and
nothing seemed to break" does not qualify as an argument.
Ok, certainly the convincing work is not good.
I'd guess that most or maybe all of the performance gain that you've observed
here is attributable to changing GetSnapshotData() to call
GetSnapshotDataReuse() without first acquiring ProcArrayLock.
It certainly helps, but I trust that's not the only reason, in all the tests I did, there was an improvement in performance, even before using this feature.
If you look closely at GetSnapShotData() you will see that GetSnapshotDataReuse is called for all snapshots, even the new ones, which is unnecessary.
Another example NormalTransactionIdPrecedes is more expensive than testing statusFlags.
If you look closely at GetSnapShotData() you will see that GetSnapshotDataReuse is called for all snapshots, even the new ones, which is unnecessary.
Another example NormalTransactionIdPrecedes is more expensive than testing statusFlags.
That
doesn't seem like a completely hopeless idea, because the comments for
GetSnapshotDataReuse() say this:
* This very likely can be evolved to not need ProcArrayLock held (at very
* least in the case we already hold a snapshot), but that's for another day.
However, those comment seem to imply that it might not be safe in all
cases, and that changes might be needed someplace in order to make it
safe, but you haven't updated these comments, or changed the function
in any way, so it's not really clear how or whether whatever problems
Andres was worried about have been handled.
I think it's worth trying and testing to see if everything goes well,
so in the final patch apply whatever comments are needed.
regards,
Ranier Vilela
Hi, On 2022-05-24 12:28:20 -0300, Ranier Vilela wrote: > Linux Ubuntu 64 bits (gcc 9.4) > ./pgbench -M prepared -c $conns -j $conns -S -n -U postgres > > conns tps head tps patched > 1 2918.004085 3190.810466 > 10 12262.415696 17199.862401 > 50 13656.724571 18278.194114 > 80 14338.202348 17955.336101 > 90 16597.510373 18269.660184 > 100 17706.775793 18349.650150 > 200 16877.067441 17881.250615 > 300 16942.260775 17181.441752 > 400 16794.514911 17124.533892 > 500 16598.502151 17181.244953 > 600 16717.935001 16961.130742 > 700 16651.204834 16959.172005 > 800 16467.546583 16834.591719 > 900 16588.241149 16693.902459 > 1000 16564.985265 16936.952195 17-18k tps is pretty low for pgbench -S. For a shared_buffers resident run, I can get 40k in a single connection in an optimized build. If you're testing a workload >> shared_buffers, GetSnapshotData() isn't the bottleneck. And testing an assert build isn't a meaningful exercise either, unless you have way way higher gains (i.e. stuff like turning O(n^2) into O(n)). What pgbench scale is this and are you using an optimized build? Greetings, Andres Freund
Hi, On 2022-05-24 13:23:43 -0300, Ranier Vilela wrote: > It certainly helps, but I trust that's not the only reason, in all the > tests I did, there was an improvement in performance, even before using > this feature. > If you look closely at GetSnapShotData() you will see that > GetSnapshotDataReuse is called for all snapshots, even the new ones, which > is unnecessary. That only happens a handful of times as snapshots are persistently allocated. Doing an extra GetSnapshotDataReuse() in those cases doesn't matter for performance. If anything this increases the number of jumps for the common case. It'd be a huge win to avoid needing ProcArrayLock when reusing a snapshot, but it's not at all easy to guarantee that it's correct / see how to make it correct. I'm fairly sure it can be made correct, but ... > Another example NormalTransactionIdPrecedes is more expensive than testing > statusFlags. That may be true when you count instructions, but isn't at all true when you take into account that the cachelines containing status flags are hotly contended. Also, the likelihood of filtering out a proc due to NormalTransactionIdPrecedes(xid, xmax) is *vastly* higher than the due to the statusFlags check. There may be a lot of procs failing that test, but typically there will be far fewer backends in vacuum or logical decoding. Greetings, Andres Freund
Em qua., 25 de mai. de 2022 às 00:46, Andres Freund <andres@anarazel.de> escreveu:
Hi Andres, thank you for taking a look.
On 2022-05-24 12:28:20 -0300, Ranier Vilela wrote:
> Linux Ubuntu 64 bits (gcc 9.4)
> ./pgbench -M prepared -c $conns -j $conns -S -n -U postgres
>
> conns tps head tps patched
> 1 2918.004085 3190.810466
> 10 12262.415696 17199.862401
> 50 13656.724571 18278.194114
> 80 14338.202348 17955.336101
> 90 16597.510373 18269.660184
> 100 17706.775793 18349.650150
> 200 16877.067441 17881.250615
> 300 16942.260775 17181.441752
> 400 16794.514911 17124.533892
> 500 16598.502151 17181.244953
> 600 16717.935001 16961.130742
> 700 16651.204834 16959.172005
> 800 16467.546583 16834.591719
> 900 16588.241149 16693.902459
> 1000 16564.985265 16936.952195
17-18k tps is pretty low for pgbench -S. For a shared_buffers resident run, I
can get 40k in a single connection in an optimized build. If you're testing a
workload >> shared_buffers, GetSnapshotData() isn't the bottleneck. And
testing an assert build isn't a meaningful exercise either, unless you have
way way higher gains (i.e. stuff like turning O(n^2) into O(n)).
Thanks for sharing these hits.
Yes, their 17-18k tps are disappointing.
What pgbench scale is this and are you using an optimized build?
Yes this optimized build.
CFLAGS='-Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -O2'
from config.log
pgbench was initialized with:
pgbench -i -p 5432 -d postgres
pgbench -M prepared -c 100 -j 100 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
The shared_buffers is default:
shared_buffers = 128MB
Intel® Core™ i5-8250U CPU Quad Core
RAM 8GB
SSD 256 GB
Can you share the pgbench configuration and shared buffers
this benchmark?
regards,
Ranier Vilela
Em qua., 25 de mai. de 2022 às 00:56, Andres Freund <andres@anarazel.de> escreveu:
Hi,
On 2022-05-24 13:23:43 -0300, Ranier Vilela wrote:
> It certainly helps, but I trust that's not the only reason, in all the
> tests I did, there was an improvement in performance, even before using
> this feature.
> If you look closely at GetSnapShotData() you will see that
> GetSnapshotDataReuse is called for all snapshots, even the new ones, which
> is unnecessary.
That only happens a handful of times as snapshots are persistently
allocated.
Yes, but now this does not happen with new snapshots.
Doing an extra GetSnapshotDataReuse() in those cases doesn't matter
for performance. If anything this increases the number of jumps for the common
case.
IMHO with GetSnapShotData(), any gain makes a difference.
It'd be a huge win to avoid needing ProcArrayLock when reusing a snapshot, but
it's not at all easy to guarantee that it's correct / see how to make it
correct. I'm fairly sure it can be made correct, but ...
I believe it's worth the effort to make sure everything goes well and use this feature.
> Another example NormalTransactionIdPrecedes is more expensive than testing
> statusFlags.
That may be true when you count instructions, but isn't at all true when you
take into account that the cachelines containing status flags are hotly
contended.
Also, the likelihood of filtering out a proc due to
NormalTransactionIdPrecedes(xid, xmax) is *vastly* higher than the due to the
statusFlags check. There may be a lot of procs failing that test, but
typically there will be far fewer backends in vacuum or logical decoding.
I believe that keeping the instructions in the cache together works better than having the status flags test in the middle.
But I will test this to be sure.
But I will test this to be sure.
regards,
Ranier Vilela
On 5/25/22 11:07, Ranier Vilela wrote: > Em qua., 25 de mai. de 2022 às 00:46, Andres Freund <andres@anarazel.de > <mailto:andres@anarazel.de>> escreveu: > > Hi Andres, thank you for taking a look. > > > On 2022-05-24 12:28:20 -0300, Ranier Vilela wrote: > > Linux Ubuntu 64 bits (gcc 9.4) > > ./pgbench -M prepared -c $conns -j $conns -S -n -U postgres > > > > conns tps head tps patched > > 1 2918.004085 3190.810466 > > 10 12262.415696 17199.862401 > > 50 13656.724571 18278.194114 > > 80 14338.202348 17955.336101 > > 90 16597.510373 18269.660184 > > 100 17706.775793 18349.650150 > > 200 16877.067441 17881.250615 > > 300 16942.260775 17181.441752 > > 400 16794.514911 17124.533892 > > 500 16598.502151 17181.244953 > > 600 16717.935001 16961.130742 > > 700 16651.204834 16959.172005 > > 800 16467.546583 16834.591719 > > 900 16588.241149 16693.902459 > > 1000 16564.985265 16936.952195 > > 17-18k tps is pretty low for pgbench -S. For a shared_buffers > resident run, I > can get 40k in a single connection in an optimized build. If you're > testing a > workload >> shared_buffers, GetSnapshotData() isn't the bottleneck. And > testing an assert build isn't a meaningful exercise either, unless > you have > way way higher gains (i.e. stuff like turning O(n^2) into O(n)). > > Thanks for sharing these hits. > Yes, their 17-18k tps are disappointing. > > > What pgbench scale is this and are you using an optimized build? > > Yes this optimized build. > CFLAGS='-Wall -Wmissing-prototypes -Wpointer-arith > -Wdeclaration-after-statement -Werror=vla -Wendif-labels > -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type > -Wformat-security -fno-strict-aliasing -fwrapv > -fexcess-precision=standard -Wno-format-truncation > -Wno-stringop-truncation -O2' > from config.log > That can still be assert-enabled build. We need to see configure flags. > pgbench was initialized with: > pgbench -i -p 5432 -d postgres > > pgbench -M prepared -c 100 -j 100 -S -n -U postgres You're not specifying duration/number of transactions to execute. So it's using just 10 transactions per client, which is bound to give you bogus results due to not having anything in relcache etc. Use -T 60 or something like that. > pgbench (15beta1) > transaction type: <builtin: select only> > scaling factor: 1 > query mode: prepared > number of clients: 100 > number of threads: 100 > > The shared_buffers is default: > shared_buffers = 128MB > > Intel® Core™ i5-8250U CPU Quad Core > RAM 8GB > SSD 256 GB > Well, quick results on my laptop (i7-9750H, so not that different from what you have): 1 = 18908.080126 2 = 32943.953182 3 = 42316.079028 4 = 46700.087645 So something is likely wrong in your setup. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Em qua., 25 de mai. de 2022 às 07:13, Tomas Vondra <tomas.vondra@enterprisedb.com> escreveu:
On 5/25/22 11:07, Ranier Vilela wrote:
> Em qua., 25 de mai. de 2022 às 00:46, Andres Freund <andres@anarazel.de
> <mailto:andres@anarazel.de>> escreveu:
>
> Hi Andres, thank you for taking a look.
>
>
> On 2022-05-24 12:28:20 -0300, Ranier Vilela wrote:
> > Linux Ubuntu 64 bits (gcc 9.4)
> > ./pgbench -M prepared -c $conns -j $conns -S -n -U postgres
> >
> > conns tps head tps patched
> > 1 2918.004085 3190.810466
> > 10 12262.415696 17199.862401
> > 50 13656.724571 18278.194114
> > 80 14338.202348 17955.336101
> > 90 16597.510373 18269.660184
> > 100 17706.775793 18349.650150
> > 200 16877.067441 17881.250615
> > 300 16942.260775 17181.441752
> > 400 16794.514911 17124.533892
> > 500 16598.502151 17181.244953
> > 600 16717.935001 16961.130742
> > 700 16651.204834 16959.172005
> > 800 16467.546583 16834.591719
> > 900 16588.241149 16693.902459
> > 1000 16564.985265 16936.952195
>
> 17-18k tps is pretty low for pgbench -S. For a shared_buffers
> resident run, I
> can get 40k in a single connection in an optimized build. If you're
> testing a
> workload >> shared_buffers, GetSnapshotData() isn't the bottleneck. And
> testing an assert build isn't a meaningful exercise either, unless
> you have
> way way higher gains (i.e. stuff like turning O(n^2) into O(n)).
>
> Thanks for sharing these hits.
> Yes, their 17-18k tps are disappointing.
>
>
> What pgbench scale is this and are you using an optimized build?
>
> Yes this optimized build.
> CFLAGS='-Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Werror=vla -Wendif-labels
> -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type
> -Wformat-security -fno-strict-aliasing -fwrapv
> -fexcess-precision=standard -Wno-format-truncation
> -Wno-stringop-truncation -O2'
> from config.log
>
That can still be assert-enabled build. We need to see configure flags.
./configure
Attached the config.log (compressed)
> pgbench was initialized with:
> pgbench -i -p 5432 -d postgres
>
> pgbench -M prepared -c 100 -j 100 -S -n -U postgres
You're not specifying duration/number of transactions to execute. So
it's using just 10 transactions per client, which is bound to give you
bogus results due to not having anything in relcache etc. Use -T 60 or
something like that.
Ok, I will try with -T 60.
> pgbench (15beta1)
> transaction type: <builtin: select only>
> scaling factor: 1
> query mode: prepared
> number of clients: 100
> number of threads: 100
>
> The shared_buffers is default:
> shared_buffers = 128MB
>
> Intel® Core™ i5-8250U CPU Quad Core
> RAM 8GB
> SSD 256 GB
>
Well, quick results on my laptop (i7-9750H, so not that different from
what you have):
1 = 18908.080126
2 = 32943.953182
3 = 42316.079028
4 = 46700.087645
So something is likely wrong in your setup.
select version();
version
----------------------------------------------------------------------------------------------------------
PostgreSQL 15beta1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
version
----------------------------------------------------------------------------------------------------------
PostgreSQL 15beta1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.4.0-1ubuntu1~20.04.1' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-9 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-9-Av3uEd/gcc-9-9.4.0/debian/tmp-nvptx/usr,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.4.0-1ubuntu1~20.04.1' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-9 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-9-Av3uEd/gcc-9-9.4.0/debian/tmp-nvptx/usr,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
regards,
Ranier Vilela
Attachment
Em qua., 25 de mai. de 2022 às 08:26, Ranier Vilela <ranier.vf@gmail.com> escreveu:
Em qua., 25 de mai. de 2022 às 07:13, Tomas Vondra <tomas.vondra@enterprisedb.com> escreveu:
On 5/25/22 11:07, Ranier Vilela wrote:
> Em qua., 25 de mai. de 2022 às 00:46, Andres Freund <andres@anarazel.de
> <mailto:andres@anarazel.de>> escreveu:
>
> Hi Andres, thank you for taking a look.
>
>
> On 2022-05-24 12:28:20 -0300, Ranier Vilela wrote:
> > Linux Ubuntu 64 bits (gcc 9.4)
> > ./pgbench -M prepared -c $conns -j $conns -S -n -U postgres
> >
> > conns tps head tps patched
> > 1 2918.004085 3190.810466
> > 10 12262.415696 17199.862401
> > 50 13656.724571 18278.194114
> > 80 14338.202348 17955.336101
> > 90 16597.510373 18269.660184
> > 100 17706.775793 18349.650150
> > 200 16877.067441 17881.250615
> > 300 16942.260775 17181.441752
> > 400 16794.514911 17124.533892
> > 500 16598.502151 17181.244953
> > 600 16717.935001 16961.130742
> > 700 16651.204834 16959.172005
> > 800 16467.546583 16834.591719
> > 900 16588.241149 16693.902459
> > 1000 16564.985265 16936.952195
>
> 17-18k tps is pretty low for pgbench -S. For a shared_buffers
> resident run, I
> can get 40k in a single connection in an optimized build. If you're
> testing a
> workload >> shared_buffers, GetSnapshotData() isn't the bottleneck. And
> testing an assert build isn't a meaningful exercise either, unless
> you have
> way way higher gains (i.e. stuff like turning O(n^2) into O(n)).
>
> Thanks for sharing these hits.
> Yes, their 17-18k tps are disappointing.
>
>
> What pgbench scale is this and are you using an optimized build?
>
> Yes this optimized build.
> CFLAGS='-Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Werror=vla -Wendif-labels
> -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type
> -Wformat-security -fno-strict-aliasing -fwrapv
> -fexcess-precision=standard -Wno-format-truncation
> -Wno-stringop-truncation -O2'
> from config.log
>
That can still be assert-enabled build. We need to see configure flags../configureAttached the config.log (compressed)
> pgbench was initialized with:
> pgbench -i -p 5432 -d postgres
>
> pgbench -M prepared -c 100 -j 100 -S -n -U postgres
You're not specifying duration/number of transactions to execute. So
it's using just 10 transactions per client, which is bound to give you
bogus results due to not having anything in relcache etc. Use -T 60 or
something like that.Ok, I will try with -T 60.
Linux Ubuntu 64 bits
shared_buffers = 128MB
./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 s
conns tps head tps patched
1 17126.326108 17792.414234
10 82068.123383 82468.334836
50 73808.731404 74678.839428
80 73290.191713 73116.553986
90 67558.483043 68384.906949
100 65960.982801 66997.793777
200 62216.011998 62870.243385
300 62924.225658 62796.157548
400 62278.099704 63129.555135
500 63257.930870 62188.825044
600 61479.890611 61517.913967
700 61139.354053 61327.898847
800 60833.663791 61517.913967
900 61305.129642 61248.336593
1000 60990.918719 61041.670996
shared_buffers = 128MB
./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 s
conns tps head tps patched
1 17126.326108 17792.414234
10 82068.123383 82468.334836
50 73808.731404 74678.839428
80 73290.191713 73116.553986
90 67558.483043 68384.906949
100 65960.982801 66997.793777
200 62216.011998 62870.243385
300 62924.225658 62796.157548
400 62278.099704 63129.555135
500 63257.930870 62188.825044
600 61479.890611 61517.913967
700 61139.354053 61327.898847
800 60833.663791 61517.913967
900 61305.129642 61248.336593
1000 60990.918719 61041.670996
Linux Ubuntu 64 bits
shared_buffers = 2048MB
./pgbench -M prepared -c $conns -j $conns -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
number of transactions per client: 10
conns tps head tps patched
1 2918.004085 3211.303789
10 12262.415696 15540.015540
50 13656.724571 16701.182444
80 14338.202348 16628.559551
90 16597.510373 16835.016835
100 17706.775793 16607.433487
200 16877.067441 16426.969799
300 16942.260775 16319.780662
400 16794.514911 16155.023607
500 16598.502151 16051.106724
600 16717.935001 16007.171213
700 16651.204834 16004.353184
800 16467.546583 16834.591719
900 16588.241149 16693.902459
1000 16564.985265 16936.952195
Linux Ubuntu 64 bits
shared_buffers = 2048MB
./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 s
conns tps head tps patched
1 17174.265804 17792.414234
10 82365.634750 82468.334836
50 74593.714180 74678.839428
80 69219.756038 73116.553986
90 67419.574189 68384.906949
100 66613.771701 66997.793777
200 61739.784830 62870.243385
300 62109.691298 62796.157548
400 61630.822446 63129.555135
500 61711.019964 62755.190389
600 60620.010181 61517.913967
700 60303.317736 61688.044232
800 60451.113573 61076.666572
900 60017.327157 61256.290037
1000 60088.823434 60986.799312
shared_buffers = 2048MB
./pgbench -M prepared -c $conns -j $conns -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
number of transactions per client: 10
conns tps head tps patched
1 2918.004085 3211.303789
10 12262.415696 15540.015540
50 13656.724571 16701.182444
80 14338.202348 16628.559551
90 16597.510373 16835.016835
100 17706.775793 16607.433487
200 16877.067441 16426.969799
300 16942.260775 16319.780662
400 16794.514911 16155.023607
500 16598.502151 16051.106724
600 16717.935001 16007.171213
700 16651.204834 16004.353184
800 16467.546583 16834.591719
900 16588.241149 16693.902459
1000 16564.985265 16936.952195
Linux Ubuntu 64 bits
shared_buffers = 2048MB
./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 s
conns tps head tps patched
1 17174.265804 17792.414234
10 82365.634750 82468.334836
50 74593.714180 74678.839428
80 69219.756038 73116.553986
90 67419.574189 68384.906949
100 66613.771701 66997.793777
200 61739.784830 62870.243385
300 62109.691298 62796.157548
400 61630.822446 63129.555135
500 61711.019964 62755.190389
600 60620.010181 61517.913967
700 60303.317736 61688.044232
800 60451.113573 61076.666572
900 60017.327157 61256.290037
1000 60088.823434 60986.799312
regards,
Ranier Vilela
On 5/27/22 02:11, Ranier Vilela wrote: > > ... > > Here the results with -T 60: Might be a good idea to share your analysis / interpretation of the results, not just the raw data. After all, the change is being proposed by you, so do you think this shows the change is beneficial? > Linux Ubuntu 64 bits > shared_buffers = 128MB > > ./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres > > pgbench (15beta1) > transaction type: <builtin: select only> > scaling factor: 1 > query mode: prepared > number of clients: 100 > number of threads: 100 > maximum number of tries: 1 > duration: 60 s > > conns tps head tps patched > > 1 17126.326108 17792.414234 > 10 82068.123383 82468.334836 > 50 73808.731404 74678.839428 > 80 73290.191713 73116.553986 > 90 67558.483043 68384.906949 > 100 65960.982801 66997.793777 > 200 62216.011998 62870.243385 > 300 62924.225658 62796.157548 > 400 62278.099704 63129.555135 > 500 63257.930870 62188.825044 > 600 61479.890611 61517.913967 > 700 61139.354053 61327.898847 > 800 60833.663791 61517.913967 > 900 61305.129642 61248.336593 > 1000 60990.918719 61041.670996 > These results look much saner, but IMHO it also does not show any clear benefit of the patch. Or are you still claiming there is a benefit? BTW it's generally a good idea to do multiple runs and then use the average and/or median. Results from a single may be quite noisy. > > Linux Ubuntu 64 bits > shared_buffers = 2048MB > > ./pgbench -M prepared -c $conns -j $conns -S -n -U postgres > > pgbench (15beta1) > transaction type: <builtin: select only> > scaling factor: 1 > query mode: prepared > number of clients: 100 > number of threads: 100 > maximum number of tries: 1 > number of transactions per client: 10 > > conns tps head tps patched > > 1 2918.004085 3211.303789 > 10 12262.415696 15540.015540 > 50 13656.724571 16701.182444 > 80 14338.202348 16628.559551 > 90 16597.510373 16835.016835 > 100 17706.775793 16607.433487 > 200 16877.067441 16426.969799 > 300 16942.260775 16319.780662 > 400 16794.514911 16155.023607 > 500 16598.502151 16051.106724 > 600 16717.935001 16007.171213 > 700 16651.204834 16004.353184 > 800 16467.546583 16834.591719 > 900 16588.241149 16693.902459 > 1000 16564.985265 16936.952195 > I think we've agreed these results are useless. > > Linux Ubuntu 64 bits > shared_buffers = 2048MB > > ./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres > > pgbench (15beta1) > transaction type: <builtin: select only> > scaling factor: 1 > query mode: prepared > number of clients: 100 > number of threads: 100 > maximum number of tries: 1 > duration: 60 s > > conns tps head tps patched > > 1 17174.265804 17792.414234 > 10 82365.634750 82468.334836 > 50 74593.714180 74678.839428 > 80 69219.756038 73116.553986 > 90 67419.574189 68384.906949 > 100 66613.771701 66997.793777 > 200 61739.784830 62870.243385 > 300 62109.691298 62796.157548 > 400 61630.822446 63129.555135 > 500 61711.019964 62755.190389 > 600 60620.010181 61517.913967 > 700 60303.317736 61688.044232 > 800 60451.113573 61076.666572 > 900 60017.327157 61256.290037 > 1000 60088.823434 60986.799312 > I have no idea why shared buffers 2GB would be interesting. The proposed change was related to procarray, not shared buffers. And scale 1 is ~15MB of data, so it fits into 128MB just fine. Also, the first ~10 results for "patched" case match results for 128MB shared buffers. That seems very unlikely to happen by chance, so this seems rather suspicious. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Em qui., 26 de mai. de 2022 às 22:30, Tomas Vondra <tomas.vondra@enterprisedb.com> escreveu:
On 5/27/22 02:11, Ranier Vilela wrote:I think so, but the expectation has diminished.
>
> ...
>
> Here the results with -T 60:
Might be a good idea to share your analysis / interpretation of the
results, not just the raw data. After all, the change is being proposed
by you, so do you think this shows the change is beneficial?
I expected that the more connections, the better the performance.
And for both patch and head, this doesn't happen in tests.
Performance degrades with a greater number of connections.
And for both patch and head, this doesn't happen in tests.
Performance degrades with a greater number of connections.
GetSnapShowData() isn't a bottleneck?
> Linux Ubuntu 64 bits
> shared_buffers = 128MB
>
> ./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
>
> pgbench (15beta1)
> transaction type: <builtin: select only>
> scaling factor: 1
> query mode: prepared
> number of clients: 100
> number of threads: 100
> maximum number of tries: 1
> duration: 60 s
>
> conns tps head tps patched
>
> 1 17126.326108 17792.414234
> 10 82068.123383 82468.334836
> 50 73808.731404 74678.839428
> 80 73290.191713 73116.553986
> 90 67558.483043 68384.906949
> 100 65960.982801 66997.793777
> 200 62216.011998 62870.243385
> 300 62924.225658 62796.157548
> 400 62278.099704 63129.555135
> 500 63257.930870 62188.825044
> 600 61479.890611 61517.913967
> 700 61139.354053 61327.898847
> 800 60833.663791 61517.913967
> 900 61305.129642 61248.336593
> 1000 60990.918719 61041.670996
>
These results look much saner, but IMHO it also does not show any clear
benefit of the patch. Or are you still claiming there is a benefit?
We agree that they are micro-optimizations.
However, I think they should be considered micro-optimizations in inner loops,
However, I think they should be considered micro-optimizations in inner loops,
because all in procarray.c is a hotpath.
The first objective, I believe, was achieved, with no performance regression.
I agree, the gains are small, by the tests done.
But, IMHO, this is a good way, small gains turn into big gains in the end, when applied to all code.
Consider GetSnapShotData()
1. Most of the time the snapshot is not null, so:
if (snaphost == NULL), will fail most of the time.
With the patch:
if (snapshot->xip != NULL)
{
if (GetSnapshotDataReuse(snapshot))
return snapshot;
}
{
if (GetSnapshotDataReuse(snapshot))
return snapshot;
}
Most of the time the test is true and GetSnapshotDataReuse is not called for new
snapshots.
count, subcount and suboverflowed, will not be initialized, for all snapshots.
2. If snapshot is taken during recoverys
The pgprocnos and ProcGlobal->subxidStates are not touched unnecessarily.
Only if is not suboverflowed.
Skipping all InvalidTransactionId, mypgxactoff, backends doing logical decoding,
and XID is >= xmax.
3. Calling GetSnapshotDataReuse() without first acquiring ProcArrayLock.
There's an agreement that this would be fine, for now.
Consider ComputeXidHorizons()
1. ProcGlobal->statusFlags is touched before the lock.
2. allStatusFlags[index] is not touched for all numProcs.
All changes were made with the aim of avoiding or postponing unnecessary work.
BTW it's generally a good idea to do multiple runs and then use the
average and/or median. Results from a single may be quite noisy.
>
> Linux Ubuntu 64 bits
> shared_buffers = 2048MB
>
> ./pgbench -M prepared -c $conns -j $conns -S -n -U postgres
>
> pgbench (15beta1)
> transaction type: <builtin: select only>
> scaling factor: 1
> query mode: prepared
> number of clients: 100
> number of threads: 100
> maximum number of tries: 1
> number of transactions per client: 10
>
> conns tps head tps patched
>
> 1 2918.004085 3211.303789
> 10 12262.415696 15540.015540
> 50 13656.724571 16701.182444
> 80 14338.202348 16628.559551
> 90 16597.510373 16835.016835
> 100 17706.775793 16607.433487
> 200 16877.067441 16426.969799
> 300 16942.260775 16319.780662
> 400 16794.514911 16155.023607
> 500 16598.502151 16051.106724
> 600 16717.935001 16007.171213
> 700 16651.204834 16004.353184
> 800 16467.546583 16834.591719
> 900 16588.241149 16693.902459
> 1000 16564.985265 16936.952195
>
I think we've agreed these results are useless.
>
> Linux Ubuntu 64 bits
> shared_buffers = 2048MB
>
> ./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
>
> pgbench (15beta1)
> transaction type: <builtin: select only>
> scaling factor: 1
> query mode: prepared
> number of clients: 100
> number of threads: 100
> maximum number of tries: 1
> duration: 60 s
>
> conns tps head tps patched
>
> 1 17174.265804 17792.414234
> 10 82365.634750 82468.334836
> 50 74593.714180 74678.839428
> 80 69219.756038 73116.553986
> 90 67419.574189 68384.906949
> 100 66613.771701 66997.793777
> 200 61739.784830 62870.243385
> 300 62109.691298 62796.157548
> 400 61630.822446 63129.555135
> 500 61711.019964 62755.190389
> 600 60620.010181 61517.913967
> 700 60303.317736 61688.044232
> 800 60451.113573 61076.666572
> 900 60017.327157 61256.290037
> 1000 60088.823434 60986.799312
>
I have no idea why shared buffers 2GB would be interesting. The proposed
change was related to procarray, not shared buffers. And scale 1 is
~15MB of data, so it fits into 128MB just fine.
I thought about doing this benchmark, in the most common usage situation (25% of RAM).
Also, the first ~10 results for "patched" case match results for 128MB
shared buffers. That seems very unlikely to happen by chance, so this
seems rather suspicious.
Probably, copy and paste mistake.
I redid this test, for patched:
Linux Ubuntu 64 bits
shared_buffers = 2048MB
./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 s
conns tps head tps patched
1 17174.265804 17524.482668
10 82365.634750 81840.537713
50 74593.714180 74806.729434
80 69219.756038 73116.553986
90 67419.574189 69130.749209
100 66613.771701 67478.234595
200 61739.784830 63094.202413
300 62109.691298 62984.501251
400 61630.822446 63243.232816
500 61711.019964 62827.977636
600 60620.010181 62838.051693
700 60303.317736 61594.629618
800 60451.113573 61208.629058
900 60017.327157 61171.001256
1000 60088.823434 60558.067810
shared_buffers = 2048MB
./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 s
conns tps head tps patched
1 17174.265804 17524.482668
10 82365.634750 81840.537713
50 74593.714180 74806.729434
80 69219.756038 73116.553986
90 67419.574189 69130.749209
100 66613.771701 67478.234595
200 61739.784830 63094.202413
300 62109.691298 62984.501251
400 61630.822446 63243.232816
500 61711.019964 62827.977636
600 60620.010181 62838.051693
700 60303.317736 61594.629618
800 60451.113573 61208.629058
900 60017.327157 61171.001256
1000 60088.823434 60558.067810
regards,
Ranier Vilela
Hi, On 2022-05-27 03:30:46 +0200, Tomas Vondra wrote: > On 5/27/22 02:11, Ranier Vilela wrote: > > ./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres > > > > pgbench (15beta1) > > transaction type: <builtin: select only> > > scaling factor: 1 > > query mode: prepared > > number of clients: 100 > > number of threads: 100 > > maximum number of tries: 1 > > duration: 60 s > > > > conns tps head tps patched > > > > 1 17126.326108 17792.414234 > > 10 82068.123383 82468.334836 > > 50 73808.731404 74678.839428 > > 80 73290.191713 73116.553986 > > 90 67558.483043 68384.906949 > > 100 65960.982801 66997.793777 > > 200 62216.011998 62870.243385 > > 300 62924.225658 62796.157548 > > 400 62278.099704 63129.555135 > > 500 63257.930870 62188.825044 > > 600 61479.890611 61517.913967 > > 700 61139.354053 61327.898847 > > 800 60833.663791 61517.913967 > > 900 61305.129642 61248.336593 > > 1000 60990.918719 61041.670996 > > > > These results look much saner, but IMHO it also does not show any clear > benefit of the patch. Or are you still claiming there is a benefit? They don't look all that sane to me - isn't that way lower than one would expect? Restricting both client and server to the same four cores, a thermically challenged older laptop I have around I get 150k tps at both 10 and 100 clients. Either way, I'd not expect to see any GetSnapshotData() scalability effects to show up on an "Intel® Core™ i5-8250U CPU Quad Core" - there's just not enough concurrency. The correct pieces of these changes seem very unlikely to affect GetSnapshotData() performance meaningfully. To improve something like GetSnapshotData() you first have to come up with a workload that shows it being a meaningful part of a profile. Unless it is, performance differences are going to just be due to various forms of noise. Greetings, Andres Freund
Hi, On 2022-05-27 10:35:08 -0300, Ranier Vilela wrote: > Em qui., 26 de mai. de 2022 às 22:30, Tomas Vondra < > tomas.vondra@enterprisedb.com> escreveu: > > > On 5/27/22 02:11, Ranier Vilela wrote: > > > > > > ... > > > > > > Here the results with -T 60: > > > > Might be a good idea to share your analysis / interpretation of the > > results, not just the raw data. After all, the change is being proposed > > by you, so do you think this shows the change is beneficial? > > > I think so, but the expectation has diminished. > I expected that the more connections, the better the performance. > And for both patch and head, this doesn't happen in tests. > Performance degrades with a greater number of connections. Your system has four CPUs. Once they're all busy, adding more connections won't improve performance. It'll just add more and more context switching, cache misses, and make the OS scheduler do more work. > GetSnapShowData() isn't a bottleneck? I'd be surprised if it showed up in a profile on your machine with that workload in any sort of meaningful way. The snapshot reuse logic will always work - because there are no writes - and thus the only work that needs to be done is to acquire the ProcArrayLock briefly. And because there is only a small number of cores, contention on the cacheline for that isn't a problem. > > These results look much saner, but IMHO it also does not show any clear > > benefit of the patch. Or are you still claiming there is a benefit? > > > We agree that they are micro-optimizations. However, I think they should be > considered micro-optimizations in inner loops, because all in procarray.c is > a hotpath. As explained earlier, I don't agree that they optimize anything - you're making some of the scalability behaviour *worse*, if it's changed at all. > The first objective, I believe, was achieved, with no performance > regression. > I agree, the gains are small, by the tests done. There are no gains. > But, IMHO, this is a good way, small gains turn into big gains in the end, > when applied to all code. > > Consider GetSnapShotData() > 1. Most of the time the snapshot is not null, so: > if (snaphost == NULL), will fail most of the time. > > With the patch: > if (snapshot->xip != NULL) > { > if (GetSnapshotDataReuse(snapshot)) > return snapshot; > } > > Most of the time the test is true and GetSnapshotDataReuse is not called > for new > snapshots. > count, subcount and suboverflowed, will not be initialized, for all > snapshots. But that's irrelevant. There's only a few "new" snapshots in the life of a connection. You're optimizing something irrelevant. > 2. If snapshot is taken during recoverys > The pgprocnos and ProcGlobal->subxidStates are not touched unnecessarily. That code isn't reached when in recovery? > 3. Calling GetSnapshotDataReuse() without first acquiring ProcArrayLock. > There's an agreement that this would be fine, for now. There's no such agreement at all. It's not correct. > Consider ComputeXidHorizons() > 1. ProcGlobal->statusFlags is touched before the lock. Hard to believe that'd have a measurable effect. > 2. allStatusFlags[index] is not touched for all numProcs. I'd be surprised if the compiler couldn't defer that load on its own. Greetings, Andres Freund
Em sex., 27 de mai. de 2022 às 18:08, Andres Freund <andres@anarazel.de> escreveu:
Hi,
On 2022-05-27 03:30:46 +0200, Tomas Vondra wrote:
> On 5/27/22 02:11, Ranier Vilela wrote:
> > ./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
> >
> > pgbench (15beta1)
> > transaction type: <builtin: select only>
> > scaling factor: 1
> > query mode: prepared
> > number of clients: 100
> > number of threads: 100
> > maximum number of tries: 1
> > duration: 60 s
> >
> > conns tps head tps patched
> >
> > 1 17126.326108 17792.414234
> > 10 82068.123383 82468.334836
> > 50 73808.731404 74678.839428
> > 80 73290.191713 73116.553986
> > 90 67558.483043 68384.906949
> > 100 65960.982801 66997.793777
> > 200 62216.011998 62870.243385
> > 300 62924.225658 62796.157548
> > 400 62278.099704 63129.555135
> > 500 63257.930870 62188.825044
> > 600 61479.890611 61517.913967
> > 700 61139.354053 61327.898847
> > 800 60833.663791 61517.913967
> > 900 61305.129642 61248.336593
> > 1000 60990.918719 61041.670996
> >
>
> These results look much saner, but IMHO it also does not show any clear
> benefit of the patch. Or are you still claiming there is a benefit?
They don't look all that sane to me - isn't that way lower than one would
expect?
Yes, quite disappointing.
Restricting both client and server to the same four cores, a
thermically challenged older laptop I have around I get 150k tps at both 10
and 100 clients.
And you can share the benchmark details? Hardware, postgres and pgbench, please?
Either way, I'd not expect to see any GetSnapshotData() scalability effects to
show up on an "Intel® Core™ i5-8250U CPU Quad Core" - there's just not enough
concurrency.
This means that our customers will not see any connections scalability with PG15, using the simplest hardware?
The correct pieces of these changes seem very unlikely to affect
GetSnapshotData() performance meaningfully.
To improve something like GetSnapshotData() you first have to come up with a
workload that shows it being a meaningful part of a profile. Unless it is,
performance differences are going to just be due to various forms of noise.
Actually in the profiles I got with perf, GetSnapShotData() didn't show up.
regards,
Ranier Vilela
Em sex., 27 de mai. de 2022 às 18:22, Andres Freund <andres@anarazel.de> escreveu:
Hi,
On 2022-05-27 10:35:08 -0300, Ranier Vilela wrote:
> Em qui., 26 de mai. de 2022 às 22:30, Tomas Vondra <
> tomas.vondra@enterprisedb.com> escreveu:
>
> > On 5/27/22 02:11, Ranier Vilela wrote:
> > >
> > > ...
> > >
> > > Here the results with -T 60:
> >
> > Might be a good idea to share your analysis / interpretation of the
> > results, not just the raw data. After all, the change is being proposed
> > by you, so do you think this shows the change is beneficial?
> >
> I think so, but the expectation has diminished.
> I expected that the more connections, the better the performance.
> And for both patch and head, this doesn't happen in tests.
> Performance degrades with a greater number of connections.
Your system has four CPUs. Once they're all busy, adding more connections
won't improve performance. It'll just add more and more context switching,
cache misses, and make the OS scheduler do more work.
conns tps head
10 82365.634750
50 74593.714180
80 69219.756038
90 67419.574189
100 66613.771701
10 82365.634750
50 74593.714180
80 69219.756038
90 67419.574189
100 66613.771701
Yes it is quite disappointing that with 100 connections, tps loses to 10 connections.
> GetSnapShowData() isn't a bottleneck?
I'd be surprised if it showed up in a profile on your machine with that
workload in any sort of meaningful way. The snapshot reuse logic will always
work - because there are no writes - and thus the only work that needs to be
done is to acquire the ProcArrayLock briefly. And because there is only a
small number of cores, contention on the cacheline for that isn't a problem.
Thanks for sharing this.
> > These results look much saner, but IMHO it also does not show any clear
> > benefit of the patch. Or are you still claiming there is a benefit?
> >
> We agree that they are micro-optimizations. However, I think they should be
> considered micro-optimizations in inner loops, because all in procarray.c is
> a hotpath.
As explained earlier, I don't agree that they optimize anything - you're
making some of the scalability behaviour *worse*, if it's changed at all.
> The first objective, I believe, was achieved, with no performance
> regression.
> I agree, the gains are small, by the tests done.
There are no gains.
IMHO, I must disagree.
> But, IMHO, this is a good way, small gains turn into big gains in the end,
> when applied to all code.
>
> Consider GetSnapShotData()
> 1. Most of the time the snapshot is not null, so:
> if (snaphost == NULL), will fail most of the time.
>
> With the patch:
> if (snapshot->xip != NULL)
> {
> if (GetSnapshotDataReuse(snapshot))
> return snapshot;
> }
>
> Most of the time the test is true and GetSnapshotDataReuse is not called
> for new
> snapshots.
> count, subcount and suboverflowed, will not be initialized, for all
> snapshots.
But that's irrelevant. There's only a few "new" snapshots in the life of a
connection. You're optimizing something irrelevant.
IMHO, when GetSnapShotData() is the bottleneck, all is relevant.
> 2. If snapshot is taken during recoverys
> The pgprocnos and ProcGlobal->subxidStates are not touched unnecessarily.
That code isn't reached when in recovery?
Currently it is reached *even* when not in recovery.
With the patch, *only* is reached when in recovery.
> 3. Calling GetSnapshotDataReuse() without first acquiring ProcArrayLock.
> There's an agreement that this would be fine, for now.
There's no such agreement at all. It's not correct.
Ok, but there is a chance it will work correctly.
> Consider ComputeXidHorizons()
> 1. ProcGlobal->statusFlags is touched before the lock.
Hard to believe that'd have a measurable effect.
IMHO, anything you take out of the lock is a benefit.
> 2. allStatusFlags[index] is not touched for all numProcs.
I'd be surprised if the compiler couldn't defer that load on its own.
Better be sure of that, no?
regards,
Ranier Vilela
On 5/28/22 02:15, Ranier Vilela wrote: > > > Em sex., 27 de mai. de 2022 às 18:08, Andres Freund <andres@anarazel.de > <mailto:andres@anarazel.de>> escreveu: > > Hi, > > On 2022-05-27 03:30:46 +0200, Tomas Vondra wrote: > > On 5/27/22 02:11, Ranier Vilela wrote: > > > ./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres > > > > > > pgbench (15beta1) > > > transaction type: <builtin: select only> > > > scaling factor: 1 > > > query mode: prepared > > > number of clients: 100 > > > number of threads: 100 > > > maximum number of tries: 1 > > > duration: 60 s > > > > > > conns tps head tps patched > > > > > > 1 17126.326108 17792.414234 > > > 10 82068.123383 82468.334836 > > > 50 73808.731404 74678.839428 > > > 80 73290.191713 73116.553986 > > > 90 67558.483043 68384.906949 > > > 100 65960.982801 66997.793777 > > > 200 62216.011998 62870.243385 > > > 300 62924.225658 62796.157548 > > > 400 62278.099704 63129.555135 > > > 500 63257.930870 62188.825044 > > > 600 61479.890611 61517.913967 > > > 700 61139.354053 61327.898847 > > > 800 60833.663791 61517.913967 > > > 900 61305.129642 61248.336593 > > > 1000 60990.918719 61041.670996 > > > > > > > These results look much saner, but IMHO it also does not show any > clear > > benefit of the patch. Or are you still claiming there is a benefit? > > They don't look all that sane to me - isn't that way lower than one > would > expect? > > Yes, quite disappointing. > > Restricting both client and server to the same four cores, a > thermically challenged older laptop I have around I get 150k tps at > both 10 > and 100 clients. > > And you can share the benchmark details? Hardware, postgres and pgbench, > please? > > > Either way, I'd not expect to see any GetSnapshotData() scalability > effects to > show up on an "Intel® Core™ i5-8250U CPU Quad Core" - there's just > not enough > concurrency. > > This means that our customers will not see any connections scalability > with PG15, using the simplest hardware? > No. It means that on 4-core machine GetSnapshotData() is unlikely to be a bottleneck, because you'll hit various other bottlenecks way earlier. I personally doubt it even makes sense to worry about scaling to this many connections on such tiny system too much. > > The correct pieces of these changes seem very unlikely to affect > GetSnapshotData() performance meaningfully. > > To improve something like GetSnapshotData() you first have to come > up with a > workload that shows it being a meaningful part of a profile. Unless > it is, > performance differences are going to just be due to various forms of > noise. > > Actually in the profiles I got with perf, GetSnapShotData() didn't show up. > But that's exactly the point Andres is trying to make - if you don't see GetSnapshotData() in the perf profile, why do you think optimizing it will have any meaningful impact on throughput? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 5/28/22 02:36, Ranier Vilela wrote: > Em sex., 27 de mai. de 2022 às 18:22, Andres Freund <andres@anarazel.de > <mailto:andres@anarazel.de>> escreveu: > > Hi, > > On 2022-05-27 10:35:08 -0300, Ranier Vilela wrote: > > Em qui., 26 de mai. de 2022 às 22:30, Tomas Vondra < > > tomas.vondra@enterprisedb.com > <mailto:tomas.vondra@enterprisedb.com>> escreveu: > > > > > On 5/27/22 02:11, Ranier Vilela wrote: > > > > > > > > ... > > > > > > > > Here the results with -T 60: > > > > > > Might be a good idea to share your analysis / interpretation of the > > > results, not just the raw data. After all, the change is being > proposed > > > by you, so do you think this shows the change is beneficial? > > > > > I think so, but the expectation has diminished. > > I expected that the more connections, the better the performance. > > And for both patch and head, this doesn't happen in tests. > > Performance degrades with a greater number of connections. > > Your system has four CPUs. Once they're all busy, adding more > connections > won't improve performance. It'll just add more and more context > switching, > cache misses, and make the OS scheduler do more work. > > conns tps head > 10 82365.634750 > 50 74593.714180 > 80 69219.756038 > 90 67419.574189 > 100 66613.771701 > Yes it is quite disappointing that with 100 connections, tps loses to 10 > connections. > IMO that's entirely expected on a system with only 4 cores. Increasing the number of connections inevitably means more overhead (you have to track/manage more stuff). And at some point the backends start competing for L2/L3 caches, context switches are not free either, etc. So once you cross ~2-3x the number of cores, you should expect this. This behavior is natural/inherent, it's unlikely to go away, and it's one of the reasons why we recommend not to use too many connections. If you try to maximize throughput, just don't do that. Or just use machine with more cores. > > > GetSnapShowData() isn't a bottleneck? > > I'd be surprised if it showed up in a profile on your machine with that > workload in any sort of meaningful way. The snapshot reuse logic > will always > work - because there are no writes - and thus the only work that > needs to be > done is to acquire the ProcArrayLock briefly. And because there is > only a > small number of cores, contention on the cacheline for that isn't a > problem. > > Thanks for sharing this. > > > > > > > These results look much saner, but IMHO it also does not show > any clear > > > benefit of the patch. Or are you still claiming there is a benefit? > > > > > We agree that they are micro-optimizations. However, I think they > should be > > considered micro-optimizations in inner loops, because all in > procarray.c is > > a hotpath. > > As explained earlier, I don't agree that they optimize anything - you're > making some of the scalability behaviour *worse*, if it's changed at > all. > > > > The first objective, I believe, was achieved, with no performance > > regression. > > I agree, the gains are small, by the tests done. > > There are no gains. > > IMHO, I must disagree. > You don't have to, really. What you should do is showing results demonstrating the claimed gains, and so far you have not done that. I don't want to be rude, but so far you've shown results from a benchmark testing fork(), due to only running 10 transactions per client, and then results from a single run for each client count (which doesn't really show any gains either, and is so noisy). As mentioned GetSnapshotData() is not even in perf profile, so why would the patch even make a difference? You've also claimed it helps generating better code on older compilers, but you've never supported that with any evidence. Maybe there is an improvement - show us. Do a benchmark with more runs, to average-out the noise. Calculate VAR/STDEV to show how variable the results are. Use that to compare results and decide if there is an improvement. Also, keep in mind binary layout matters [1]. [1] https://www.youtube.com/watch?v=r-TLSBdHe1A > > > > > But, IMHO, this is a good way, small gains turn into big gains in > the end, > > when applied to all code. > > > > Consider GetSnapShotData() > > 1. Most of the time the snapshot is not null, so: > > if (snaphost == NULL), will fail most of the time. > > > > With the patch: > > if (snapshot->xip != NULL) > > { > > if (GetSnapshotDataReuse(snapshot)) > > return snapshot; > > } > > > > Most of the time the test is true and GetSnapshotDataReuse is not > called > > for new > > snapshots. > > count, subcount and suboverflowed, will not be initialized, for all > > snapshots. > > But that's irrelevant. There's only a few "new" snapshots in the > life of a > connection. You're optimizing something irrelevant. > > IMHO, when GetSnapShotData() is the bottleneck, all is relevant. > Maybe. Show us the difference. > > > > > 2. If snapshot is taken during recoverys > > The pgprocnos and ProcGlobal->subxidStates are not touched > unnecessarily. > > That code isn't reached when in recovery? > > Currently it is reached *even* when not in recovery. > With the patch, *only* is reached when in recovery. > > > > > 3. Calling GetSnapshotDataReuse() without first acquiring > ProcArrayLock. > > There's an agreement that this would be fine, for now. > > There's no such agreement at all. It's not correct. > > Ok, but there is a chance it will work correctly. > Either it's correct or not. Chance of being correct does not count. > > > > Consider ComputeXidHorizons() > > 1. ProcGlobal->statusFlags is touched before the lock. > > Hard to believe that'd have a measurable effect. > > IMHO, anything you take out of the lock is a benefit. > Maybe. Show us the difference. > > > > > 2. allStatusFlags[index] is not touched for all numProcs. > > I'd be surprised if the compiler couldn't defer that load on its own. > > Better be sure of that, no? > We rely on compilers doing this in about a million other places. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Em sáb., 28 de mai. de 2022 às 09:00, Tomas Vondra <tomas.vondra@enterprisedb.com> escreveu:
On 5/28/22 02:15, Ranier Vilela wrote:
>
>
> Em sex., 27 de mai. de 2022 às 18:08, Andres Freund <andres@anarazel.de
> <mailto:andres@anarazel.de>> escreveu:
>
> Hi,
>
> On 2022-05-27 03:30:46 +0200, Tomas Vondra wrote:
> > On 5/27/22 02:11, Ranier Vilela wrote:
> > > ./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
> > >
> > > pgbench (15beta1)
> > > transaction type: <builtin: select only>
> > > scaling factor: 1
> > > query mode: prepared
> > > number of clients: 100
> > > number of threads: 100
> > > maximum number of tries: 1
> > > duration: 60 s
> > >
> > > conns tps head tps patched
> > >
> > > 1 17126.326108 17792.414234
> > > 10 82068.123383 82468.334836
> > > 50 73808.731404 74678.839428
> > > 80 73290.191713 73116.553986
> > > 90 67558.483043 68384.906949
> > > 100 65960.982801 66997.793777
> > > 200 62216.011998 62870.243385
> > > 300 62924.225658 62796.157548
> > > 400 62278.099704 63129.555135
> > > 500 63257.930870 62188.825044
> > > 600 61479.890611 61517.913967
> > > 700 61139.354053 61327.898847
> > > 800 60833.663791 61517.913967
> > > 900 61305.129642 61248.336593
> > > 1000 60990.918719 61041.670996
> > >
> >
> > These results look much saner, but IMHO it also does not show any
> clear
> > benefit of the patch. Or are you still claiming there is a benefit?
>
> They don't look all that sane to me - isn't that way lower than one
> would
> expect?
>
> Yes, quite disappointing.
>
> Restricting both client and server to the same four cores, a
> thermically challenged older laptop I have around I get 150k tps at
> both 10
> and 100 clients.
>
> And you can share the benchmark details? Hardware, postgres and pgbench,
> please?
>
>
> Either way, I'd not expect to see any GetSnapshotData() scalability
> effects to
> show up on an "Intel® Core™ i5-8250U CPU Quad Core" - there's just
> not enough
> concurrency.
>
> This means that our customers will not see any connections scalability
> with PG15, using the simplest hardware?
>
No. It means that on 4-core machine GetSnapshotData() is unlikely to be
a bottleneck, because you'll hit various other bottlenecks way earlier.
I personally doubt it even makes sense to worry about scaling to this
many connections on such tiny system too much.
>
> The correct pieces of these changes seem very unlikely to affect
> GetSnapshotData() performance meaningfully.
>
> To improve something like GetSnapshotData() you first have to come
> up with a
> workload that shows it being a meaningful part of a profile. Unless
> it is,
> performance differences are going to just be due to various forms of
> noise.
>
> Actually in the profiles I got with perf, GetSnapShotData() didn't show up.
>
But that's exactly the point Andres is trying to make - if you don't see
GetSnapshotData() in the perf profile, why do you think optimizing it
will have any meaningful impact on throughput?
You see, I've seen in several places that GetSnapShotData() is the bottleneck in scaling connections.
One of them, if I remember correctly, was at an IBM in Russia.
Another statement occurs in [1][2][3]
Just because I don't have enough hardware to force GetSnapShotData() doesn't mean optimizing it won't make a difference.
One of them, if I remember correctly, was at an IBM in Russia.
Another statement occurs in [1][2][3]
Just because I don't have enough hardware to force GetSnapShotData() doesn't mean optimizing it won't make a difference.
And even on my modest hardware, we've seen gains, small but consistent.
So IMHO everyone will benefit, including the small servers.
So IMHO everyone will benefit, including the small servers.
regards,
Ranier Vilela
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)
From
Daniel Gustafsson
Date:
> On 28 May 2022, at 16:12, Ranier Vilela <ranier.vf@gmail.com> wrote: > Just because I don't have enough hardware to force GetSnapShotData() doesn't mean optimizing it won't make a difference. Quoting Andres from upthread: "To improve something like GetSnapshotData() you first have to come up with a workload that shows it being a meaningful part of a profile. Unless it is, performance differences are going to just be due to various forms of noise." If you think this is a worthwhile improvement, you need to figure out a way to reliably test it in order to prove it. -- Daniel Gustafsson https://vmware.com/
On 5/28/22 16:12, Ranier Vilela wrote: > Em sáb., 28 de mai. de 2022 às 09:00, Tomas Vondra > <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>> > escreveu: > > On 5/28/22 02:15, Ranier Vilela wrote: > > > > > > Em sex., 27 de mai. de 2022 às 18:08, Andres Freund > <andres@anarazel.de <mailto:andres@anarazel.de> > > <mailto:andres@anarazel.de <mailto:andres@anarazel.de>>> escreveu: > > > > Hi, > > > > On 2022-05-27 03:30:46 +0200, Tomas Vondra wrote: > > > On 5/27/22 02:11, Ranier Vilela wrote: > > > > ./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U > postgres > > > > > > > > pgbench (15beta1) > > > > transaction type: <builtin: select only> > > > > scaling factor: 1 > > > > query mode: prepared > > > > number of clients: 100 > > > > number of threads: 100 > > > > maximum number of tries: 1 > > > > duration: 60 s > > > > > > > > conns tps head tps patched > > > > > > > > 1 17126.326108 17792.414234 > > > > 10 82068.123383 82468.334836 > > > > 50 73808.731404 74678.839428 > > > > 80 73290.191713 73116.553986 > > > > 90 67558.483043 68384.906949 > > > > 100 65960.982801 66997.793777 > > > > 200 62216.011998 62870.243385 > > > > 300 62924.225658 62796.157548 > > > > 400 62278.099704 63129.555135 > > > > 500 63257.930870 62188.825044 > > > > 600 61479.890611 61517.913967 > > > > 700 61139.354053 61327.898847 > > > > 800 60833.663791 61517.913967 > > > > 900 61305.129642 61248.336593 > > > > 1000 60990.918719 61041.670996 > > > > > > > > > > These results look much saner, but IMHO it also does not > show any > > clear > > > benefit of the patch. Or are you still claiming there is a > benefit? > > > > They don't look all that sane to me - isn't that way lower > than one > > would > > expect? > > > > Yes, quite disappointing. > > > > Restricting both client and server to the same four cores, a > > thermically challenged older laptop I have around I get 150k > tps at > > both 10 > > and 100 clients. > > > > And you can share the benchmark details? Hardware, postgres and > pgbench, > > please? > > > > > > Either way, I'd not expect to see any GetSnapshotData() > scalability > > effects to > > show up on an "Intel® Core™ i5-8250U CPU Quad Core" - there's just > > not enough > > concurrency. > > > > This means that our customers will not see any connections scalability > > with PG15, using the simplest hardware? > > > > No. It means that on 4-core machine GetSnapshotData() is unlikely to be > a bottleneck, because you'll hit various other bottlenecks way earlier. > > I personally doubt it even makes sense to worry about scaling to this > many connections on such tiny system too much. > > > > > > The correct pieces of these changes seem very unlikely to affect > > GetSnapshotData() performance meaningfully. > > > > To improve something like GetSnapshotData() you first have to come > > up with a > > workload that shows it being a meaningful part of a profile. > Unless > > it is, > > performance differences are going to just be due to various > forms of > > noise. > > > > Actually in the profiles I got with perf, GetSnapShotData() didn't > show up. > > > > But that's exactly the point Andres is trying to make - if you don't see > GetSnapshotData() in the perf profile, why do you think optimizing it > will have any meaningful impact on throughput? > > You see, I've seen in several places that GetSnapShotData() is the > bottleneck in scaling connections. > One of them, if I remember correctly, was at an IBM in Russia. > Another statement occurs in [1][2][3] No one is claiming GetSnapshotData() can't be a bottleneck on systems with many cores. That's certainly possible, which is why e.g. Andres spent a lot of time optimizing for that case. But that's what we're arguing about. You're trying to convince us that your patch will improve things, and you're supporting that by numbers from a machine that is unlikely to be hitting this bottleneck. > Just because I don't have enough hardware to force GetSnapShotData() > doesn't mean optimizing it won't make a difference. Well, the question is if it actually optimizes things. Maybe it does, which would be great, but people in this thread (including me) seem to be fairly skeptical about that claim, because the results are frankly entirely unconvincing. I doubt we'll just accept changes in such sensitive places without results from a relevant machine. Maybe if there was a clear agreement it's a win, but that's not the case here. > And even on my modest hardware, we've seen gains, small but consistent. > So IMHO everyone will benefit, including the small servers. > No, we haven't seen any convincing gains. I've tried to explain multiple times that the results you've shared are not showing any clear improvement, due to only having one run for each client count (which means there's a lot of noise), impact of binary layout in different builds, etc. You've ignored all of that, so instead of repeating myself, I did a simple benchmark on my two machines: 1) i5-2500k / 4 cores and 8GB RAM (so similar to what you have) 2) 2x e5-2620v3 / 16/32 cores, 64GB RAM (so somewhat bigger) and I tested 1, 2, 5, 10, 50, 100, ...., 1000 clients using the same benchmark as you (pgbench -S -M prepared ... ). I did 10 client counts for each client count, to calculate median which evens out the noise. And for fun I tried this with gcc 9.3, 10.3 and 11.2. The script and results from both machines are attached. The results from xeon and gcc 11.2 look like this: clients master patched diff --------------------------------------- 1 46460 44936 97% 2 87486 84746 97% 5 199102 192169 97% 10 344458 339403 99% 20 515257 512513 99% 30 528675 525467 99% 40 592761 594384 100% 50 694635 706193 102% 100 643950 655238 102% 200 690133 696815 101% 300 670403 677818 101% 400 678573 681387 100% 500 665349 678722 102% 600 666028 670915 101% 700 662316 662511 100% 800 647922 654745 101% 900 650274 654698 101% 1000 644482 649332 101% Please, explain to me how this shows consistent measurable improvement? The standard deviation is roughly 1.5% on average, and the difference is well within that range. Even if there was a tiny improvement for the high client counts, no one sane will run with that many clients, because the throughput peaks at ~50 clients. So even if you gain 1% with 500 clients, it's still less than with 50 clients. If anything, this shows regression for lower client counts. FWIW this entirely ignores the question is this benchmark even hits the bottleneck this patch aims to improve. Also, there's the question of correctness, and I'd bet Andres is right getting snapshot without holding ProcArrayLock is busted. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Em sáb., 28 de mai. de 2022 às 09:35, Tomas Vondra <tomas.vondra@enterprisedb.com> escreveu:
On 5/28/22 02:36, Ranier Vilela wrote:
> Em sex., 27 de mai. de 2022 às 18:22, Andres Freund <andres@anarazel.de
> <mailto:andres@anarazel.de>> escreveu:
>
> Hi,
>
> On 2022-05-27 10:35:08 -0300, Ranier Vilela wrote:
> > Em qui., 26 de mai. de 2022 às 22:30, Tomas Vondra <
> > tomas.vondra@enterprisedb.com
> <mailto:tomas.vondra@enterprisedb.com>> escreveu:
> >
> > > On 5/27/22 02:11, Ranier Vilela wrote:
> > > >
> > > > ...
> > > >
> > > > Here the results with -T 60:
> > >
> > > Might be a good idea to share your analysis / interpretation of the
> > > results, not just the raw data. After all, the change is being
> proposed
> > > by you, so do you think this shows the change is beneficial?
> > >
> > I think so, but the expectation has diminished.
> > I expected that the more connections, the better the performance.
> > And for both patch and head, this doesn't happen in tests.
> > Performance degrades with a greater number of connections.
>
> Your system has four CPUs. Once they're all busy, adding more
> connections
> won't improve performance. It'll just add more and more context
> switching,
> cache misses, and make the OS scheduler do more work.
>
> conns tps head
> 10 82365.634750
> 50 74593.714180
> 80 69219.756038
> 90 67419.574189
> 100 66613.771701
> Yes it is quite disappointing that with 100 connections, tps loses to 10
> connections.
>
IMO that's entirely expected on a system with only 4 cores. Increasing
the number of connections inevitably means more overhead (you have to
track/manage more stuff). And at some point the backends start competing
for L2/L3 caches, context switches are not free either, etc. So once you
cross ~2-3x the number of cores, you should expect this.
This behavior is natural/inherent, it's unlikely to go away, and it's
one of the reasons why we recommend not to use too many connections. If
you try to maximize throughput, just don't do that. Or just use machine
with more cores.
>
> > GetSnapShowData() isn't a bottleneck?
>
> I'd be surprised if it showed up in a profile on your machine with that
> workload in any sort of meaningful way. The snapshot reuse logic
> will always
> work - because there are no writes - and thus the only work that
> needs to be
> done is to acquire the ProcArrayLock briefly. And because there is
> only a
> small number of cores, contention on the cacheline for that isn't a
> problem.
>
> Thanks for sharing this.
>
>
>
>
> > > These results look much saner, but IMHO it also does not show
> any clear
> > > benefit of the patch. Or are you still claiming there is a benefit?
> > >
> > We agree that they are micro-optimizations. However, I think they
> should be
> > considered micro-optimizations in inner loops, because all in
> procarray.c is
> > a hotpath.
>
> As explained earlier, I don't agree that they optimize anything - you're
> making some of the scalability behaviour *worse*, if it's changed at
> all.
>
>
> > The first objective, I believe, was achieved, with no performance
> > regression.
> > I agree, the gains are small, by the tests done.
>
> There are no gains.
>
> IMHO, I must disagree.
>
You don't have to, really. What you should do is showing results
demonstrating the claimed gains, and so far you have not done that.
I don't want to be rude, but so far you've shown results from a
benchmark testing fork(), due to only running 10 transactions per
client, and then results from a single run for each client count (which
doesn't really show any gains either, and is so noisy).
As mentioned GetSnapshotData() is not even in perf profile, so why would
the patch even make a difference?
You've also claimed it helps generating better code on older compilers,
but you've never supported that with any evidence.
Maybe there is an improvement - show us. Do a benchmark with more runs,
to average-out the noise. Calculate VAR/STDEV to show how variable the
results are. Use that to compare results and decide if there is an
improvement. Also, keep in mind binary layout matters [1].
I redid the benchmark with a better machine:
Intel i7-10510U
RAM 8GB
SSD 512GB
Linux Ubuntu 64 bits
Intel i7-10510U
RAM 8GB
SSD 512GB
Linux Ubuntu 64 bits
All files are attached, including the raw data of the results.
I did the calculations as requested.
But a quick average of the 10 benchmarks, done resulted in 10,000 tps more.
Not bad, for a simple patch, made entirely of micro-optimizations.
I did the calculations as requested.
But a quick average of the 10 benchmarks, done resulted in 10,000 tps more.
Not bad, for a simple patch, made entirely of micro-optimizations.
Results attached.
regards,
Ranier Vilela
Attachment
On 2022-05-29 18:00:14 +0200, Tomas Vondra wrote: > Also, there's the question of correctness, and I'd bet Andres is right > getting snapshot without holding ProcArrayLock is busted. Unless there's some actual analysis of this by Rainier, I'm just going to ignore this thread going forward. It's pointless to invest time when everything we say is just ignored.
On 5/29/22 19:26, Ranier Vilela wrote: > ... > I redid the benchmark with a better machine: > Intel i7-10510U > RAM 8GB > SSD 512GB > Linux Ubuntu 64 bits > > All files are attached, including the raw data of the results. > I did the calculations as requested. > But a quick average of the 10 benchmarks, done resulted in 10,000 tps more. > Not bad, for a simple patch, made entirely of micro-optimizations. > I am a bit puzzled by the calculations. It seems you somehow sum the differences for each run, and then average that over all the runs. So, something like SELECT avg(delta_tps) FROM ( SELECT run, SUM(patched_tps - master_tps) AS delta_tps FROM results GROUP BY run ) foo; That's certainly "unorthodox" way to evaluate the results, because it mixes results for different client counts. That's certainly not what I suggested, and it's a pretty useless view on the data, as it obfuscates how throughput depends on the client count. And no, the resulting 10k does not mean you've "gained" 10k tps anywhere - none of the "diff" values is anywhere close to that value. If you tested more client counts, you'd probably get bigger difference. Compared to the "sum(tps)" for each run, it's like 0.8% difference. But even that is entirely useless, due to mixing different client counts. I'm sorry, but this is so silly it's hard to even explain why ... What I meant is calculating median for each client count, so for example for the master branch you get 10 values for 1 client 38820 39245 39773 39597 39301 39442 39379 39622 38909 38454 and if you calculate median, you'll get 39340 (and stdev 411). And same for the other client counts, etc. If you do that, you'll get this: clients master patched diff ------------------------------------ 1 39340 40173 2.12% 10 132462 134274 1.37% 50 115669 116575 0.78% 100 97931 98816 0.90% 200 88912 89660 0.84% 300 87879 88636 0.86% 400 87721 88219 0.57% 500 87267 88078 0.93% 600 87317 87781 0.53% 700 86907 87603 0.80% 800 86852 87364 0.59% 900 86578 87173 0.69% 1000 86481 86969 0.56% How exactly this improves scalability is completely unclear to me. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Em dom., 29 de mai. de 2022 às 15:21, Andres Freund <andres@anarazel.de> escreveu:
On 2022-05-29 18:00:14 +0200, Tomas Vondra wrote:
> Also, there's the question of correctness, and I'd bet Andres is right
> getting snapshot without holding ProcArrayLock is busted.
Unless there's some actual analysis of this by Rainier, I'm just going to
ignore this thread going forward. It's pointless to invest time when
everything we say is just ignored.
Sorry, just not my intention to ignore this important point.
Of course, any performance gain is good, but robustness comes first.
As soon as I have some time.
regards,
Ranier Vilela
Em dom., 29 de mai. de 2022 às 17:10, Ranier Vilela <ranier.vf@gmail.com> escreveu:
Em dom., 29 de mai. de 2022 às 15:21, Andres Freund <andres@anarazel.de> escreveu:On 2022-05-29 18:00:14 +0200, Tomas Vondra wrote:
> Also, there's the question of correctness, and I'd bet Andres is right
> getting snapshot without holding ProcArrayLock is busted.
Unless there's some actual analysis of this by Rainier, I'm just going to
ignore this thread going forward. It's pointless to invest time when
everything we say is just ignored.Sorry, just not my intention to ignore this important point.Of course, any performance gain is good, but robustness comes first.As soon as I have some time.
I redid the benchmarks, with getting a snapshot with holding ProcArrayLock.
Average Results | ||||
Connections: | tps head | tps patch | diff | |
1 | 39196,3088985 | 39858,0207936 | 661,711895100008 | 101,69% |
2 | 65050,8643819 | 65245,9852367 | 195,1208548 | 100,30% |
5 | 91486,0298359 | 91862,9026528 | 376,872816899995 | 100,41% |
10 | 131318,0774955 | 131547,1404573 | 229,062961799995 | 100,17% |
50 | 116531,2334144 | 116687,0325522 | 155,799137800001 | 100,13% |
100 | 98969,4650449 | 98808,6778717 | -160,787173199991 | 99,84% |
200 | 89514,5238649 | 89463,6196075 | -50,904257400005 | 99,94% |
300 | 88426,3612183 | 88457,2695151 | 30,9082968000002 | 100,03% |
400 | 88078,1686912 | 88338,2859163 | 260,117225099995 | 100,30% |
500 | 87791,1620039 | 88074,3418504 | 283,179846500003 | 100,32% |
600 | 87552,3343394 | 87930,8645184 | 378,530178999994 | 100,43% |
1000 | 86538,3772895 | 86771,1946099 | 232,817320400005 | 100,27% |
avg | 89204,4088731917 | 89420,444631825 | 1981,0816042 | 100,24% |
For clients with 1 connections, the results are good.
But for clients with 100 and 200 connections, the results are not good.
I can't say why these two tests were so bad.
Because, 100 and 200 results, I'm not sure if this should go ahead, if it's worth the effort.
Attached the results files and calc plan.
regards,
Ranier Vilela
Attachment
On 5/31/22 16:36, Ranier Vilela wrote: > Em dom., 29 de mai. de 2022 às 17:10, Ranier Vilela <ranier.vf@gmail.com > <mailto:ranier.vf@gmail.com>> escreveu: > > Em dom., 29 de mai. de 2022 às 15:21, Andres Freund > <andres@anarazel.de <mailto:andres@anarazel.de>> escreveu: > > On 2022-05-29 18:00:14 +0200, Tomas Vondra wrote: > > Also, there's the question of correctness, and I'd bet Andres > is right > > getting snapshot without holding ProcArrayLock is busted. > > Unless there's some actual analysis of this by Rainier, I'm just > going to > ignore this thread going forward. It's pointless to invest time when > everything we say is just ignored. > > Sorry, just not my intention to ignore this important point. > Of course, any performance gain is good, but robustness comes first. > > As soon as I have some time. > > I redid the benchmarks, with getting a snapshot with holding ProcArrayLock. > > Average Results > > > > > Connections: > tps head tps patch diff > 1 39196,3088985 39858,0207936 661,711895100008 101,69% > 2 65050,8643819 65245,9852367 195,1208548 100,30% > 5 91486,0298359 91862,9026528 376,872816899995 100,41% > 10 131318,0774955 131547,1404573 229,062961799995 100,17% > 50 116531,2334144 116687,0325522 155,799137800001 100,13% > 100 98969,4650449 98808,6778717 -160,787173199991 99,84% > 200 89514,5238649 89463,6196075 -50,904257400005 99,94% > 300 88426,3612183 88457,2695151 30,9082968000002 100,03% > 400 88078,1686912 88338,2859163 260,117225099995 100,30% > 500 87791,1620039 88074,3418504 283,179846500003 100,32% > 600 87552,3343394 87930,8645184 378,530178999994 100,43% > 1000 86538,3772895 86771,1946099 232,817320400005 100,27% > avg 89204,4088731917 89420,444631825 1981,0816042 100,24% > > > For clients with 1 connections, the results are good. Isn't that a bit strange, considering the aim of this patch was scalability? Which should improve higher client counts in the first place. > But for clients with 100 and 200 connections, the results are not good. > I can't say why these two tests were so bad. > Because, 100 and 200 results, I'm not sure if this should go ahead, if > it's worth the effort. > I'd argue this is either just noise, and there's no actual difference. This could be verified by some sort of statistical testing (e.g. the well known t-test). Another option is that this is simply due to differences in binary layout - this can result in small differences (easily 1-2%) that are completely unrelated to what the patch does. This is exactly what the "stabilizer" talk I mentioned a couple days ago was about. FWIW, when a patch improves scalability, the improvement usually increases with the number of clients. So you'd see no/small improvement for 10 clients, 100 clients would be improved more, 200 more, etc. We see nothing like that here. So either the patch does not really improve anything, or perhaps the benchmark doesn't even hit the bottleneck the patch is meant to improve (which was already suggested in this thread repeatedly). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)
From
Jacob Champion
Date:
On 5/31/22 11:44, Tomas Vondra wrote: > I'd argue this is either just noise, and there's no actual difference. > This could be verified by some sort of statistical testing (e.g. the > well known t-test). Given the conversation so far, I'll go ahead and mark this Returned with Feedback. Specifically, this patch would need hard statistical proof that it's having a positive effect. --Jacob
Em seg., 1 de ago. de 2022 às 18:51, Jacob Champion <jchampion@timescale.com> escreveu:
On 5/31/22 11:44, Tomas Vondra wrote:
> I'd argue this is either just noise, and there's no actual difference.
> This could be verified by some sort of statistical testing (e.g. the
> well known t-test).
Given the conversation so far, I'll go ahead and mark this Returned with
Feedback. Specifically, this patch would need hard statistical proof
that it's having a positive effect.
I think that the contrary opinions, the little proven benefit and the lack of enthusiasm in changing procarray.c,
I believe it is best to reject it.
regards,
Ranier Vilela