Re: Improving connection scalability (src/backend/storage/ipc/procarray.c) - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)
Date
Msg-id 6ec612ba-1c97-6845-5ee2-964b80996898@enterprisedb.com
Whole thread Raw
In response to Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)  (Ranier Vilela <ranier.vf@gmail.com>)
Responses Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)
List pgsql-hackers
On 5/28/22 16:12, Ranier Vilela wrote:
> Em sáb., 28 de mai. de 2022 às 09:00, Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> escreveu:
> 
>     On 5/28/22 02:15, Ranier Vilela wrote:
>     >
>     >
>     > Em sex., 27 de mai. de 2022 às 18:08, Andres Freund
>     <andres@anarazel.de <mailto:andres@anarazel.de>
>     > <mailto:andres@anarazel.de <mailto:andres@anarazel.de>>> escreveu:
>     >
>     >     Hi,
>     >
>     >     On 2022-05-27 03:30:46 +0200, Tomas Vondra wrote:
>     >     > On 5/27/22 02:11, Ranier Vilela wrote:
>     >     > > ./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U
>     postgres
>     >     > >
>     >     > > pgbench (15beta1)
>     >     > > transaction type: <builtin: select only>
>     >     > > scaling factor: 1
>     >     > > query mode: prepared
>     >     > > number of clients: 100
>     >     > > number of threads: 100
>     >     > > maximum number of tries: 1
>     >     > > duration: 60 s
>     >     > >
>     >     > > conns            tps head              tps patched
>     >     > >
>     >     > > 1         17126.326108            17792.414234
>     >     > > 10       82068.123383            82468.334836
>     >     > > 50       73808.731404            74678.839428
>     >     > > 80       73290.191713            73116.553986
>     >     > > 90       67558.483043            68384.906949
>     >     > > 100     65960.982801            66997.793777
>     >     > > 200     62216.011998            62870.243385
>     >     > > 300     62924.225658            62796.157548
>     >     > > 400     62278.099704            63129.555135
>     >     > > 500     63257.930870            62188.825044
>     >     > > 600     61479.890611            61517.913967
>     >     > > 700     61139.354053            61327.898847
>     >     > > 800     60833.663791            61517.913967
>     >     > > 900     61305.129642            61248.336593
>     >     > > 1000   60990.918719            61041.670996
>     >     > >
>     >     >
>     >     > These results look much saner, but IMHO it also does not
>     show any
>     >     clear
>     >     > benefit of the patch. Or are you still claiming there is a
>     benefit?
>     >
>     >     They don't look all that sane to me - isn't that way lower
>     than one
>     >     would
>     >     expect?
>     >
>     > Yes, quite disappointing.
>     >
>     >     Restricting both client and server to the same four cores, a
>     >     thermically challenged older laptop I have around I get 150k
>     tps at
>     >     both 10
>     >     and 100 clients.
>     >
>     > And you can share the benchmark details? Hardware, postgres and
>     pgbench,
>     > please?
>     >
>     >
>     >     Either way, I'd not expect to see any GetSnapshotData()
>     scalability
>     >     effects to
>     >     show up on an "Intel® Core™ i5-8250U CPU Quad Core" - there's just
>     >     not enough
>     >     concurrency.
>     >
>     > This means that our customers will not see any connections scalability
>     > with PG15, using the simplest hardware?
>     >
> 
>     No. It means that on 4-core machine GetSnapshotData() is unlikely to be
>     a bottleneck, because you'll hit various other bottlenecks way earlier.
> 
>     I personally doubt it even makes sense to worry about scaling to this
>     many connections on such tiny system too much.
> 
> 
>     >
>     >     The correct pieces of these changes seem very unlikely to affect
>     >     GetSnapshotData() performance meaningfully.
>     >
>     >     To improve something like GetSnapshotData() you first have to come
>     >     up with a
>     >     workload that shows it being a meaningful part of a profile.
>     Unless
>     >     it is,
>     >     performance differences are going to just be due to various
>     forms of
>     >     noise.
>     >
>     > Actually in the profiles I got with perf, GetSnapShotData() didn't
>     show up.
>     >
> 
>     But that's exactly the point Andres is trying to make - if you don't see
>     GetSnapshotData() in the perf profile, why do you think optimizing it
>     will have any meaningful impact on throughput?
> 
> You see, I've seen in several places that GetSnapShotData() is the
> bottleneck in scaling connections.
> One of them, if I remember correctly, was at an IBM in Russia.
> Another statement occurs in [1][2][3]

No one is claiming GetSnapshotData() can't be a bottleneck on systems
with many cores. That's certainly possible, which is why e.g. Andres
spent a lot of time optimizing for that case.

But that's what we're arguing about. You're trying to convince us that
your patch will improve things, and you're supporting that by numbers
from a machine that is unlikely to be hitting this bottleneck.

> Just because I don't have enough hardware to force GetSnapShotData()
> doesn't mean optimizing it won't make a difference.

Well, the question is if it actually optimizes things. Maybe it does,
which would be great, but people in this thread (including me) seem to
be fairly skeptical about that claim, because the results are frankly
entirely unconvincing.

I doubt we'll just accept changes in such sensitive places without
results from a relevant machine. Maybe if there was a clear agreement
it's a win, but that's not the case here.


> And even on my modest hardware, we've seen gains, small but consistent.
> So IMHO everyone will benefit, including the small servers.
> 

No, we haven't seen any convincing gains. I've tried to explain multiple
times that the results you've shared are not showing any clear
improvement, due to only having one run for each client count (which
means there's a lot of noise), impact of binary layout in different
builds, etc. You've ignored all of that, so instead of repeating myself,
I did a simple benchmark on my two machines:

1) i5-2500k / 4 cores and 8GB RAM (so similar to what you have)

2) 2x e5-2620v3 / 16/32 cores, 64GB RAM (so somewhat bigger)

and I tested 1, 2, 5, 10, 50, 100, ...., 1000 clients using the same
benchmark as you (pgbench -S -M prepared ... ). I did 10 client counts
for each client count, to calculate median which evens out the noise.
And for fun I tried this with gcc 9.3, 10.3 and 11.2. The script and
results from both machines are attached.

The results from xeon and gcc 11.2 look like this:

    clients    master    patched     diff
   ---------------------------------------
          1     46460     44936       97%
          2     87486     84746       97%
          5    199102    192169       97%
         10    344458    339403       99%
         20    515257    512513       99%
         30    528675    525467       99%
         40    592761    594384      100%
         50    694635    706193      102%
        100    643950    655238      102%
        200    690133    696815      101%
        300    670403    677818      101%
        400    678573    681387      100%
        500    665349    678722      102%
        600    666028    670915      101%
        700    662316    662511      100%
        800    647922    654745      101%
        900    650274    654698      101%
       1000    644482    649332      101%

Please, explain to me how this shows consistent measurable improvement?

The standard deviation is roughly 1.5% on average, and the difference is
well within that range. Even if there was a tiny improvement for the
high client counts, no one sane will run with that many clients, because
the throughput peaks at ~50 clients. So even if you gain 1% with 500
clients, it's still less than with 50 clients. If anything, this shows
regression for lower client counts.

FWIW this entirely ignores the question is this benchmark even hits the
bottleneck this patch aims to improve. Also, there's the question of
correctness, and I'd bet Andres is right getting snapshot without
holding ProcArrayLock is busted.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: convert libpq uri-regress tests to tap test
Next
From: Ranier Vilela
Date:
Subject: Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)