Re: Improving connection scalability: GetSnapshotData() - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Improving connection scalability: GetSnapshotData()
Date
Msg-id 20200406133959.viql5fqecog6mppj@alap3.anarazel.de
Whole thread Raw
In response to Re: Improving connection scalability: GetSnapshotData()  (Alexander Korotkov <a.korotkov@postgrespro.ru>)
Responses Re: Improving connection scalability: GetSnapshotData()  (Andres Freund <andres@anarazel.de>)
Re: Improving connection scalability: GetSnapshotData()  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

These benchmarks are on my workstation. The larger VM I used in the last
round wasn't currently available.

HW:
2 x Intel(R) Xeon(R) Gold 5215 (each 10 cores / 20 threads)
192GB Ram.
data directory is on a Samsung SSD 970 PRO 1TB

A bunch of terminals, emacs, mutt are open while the benchmark is
running. No browser.

Unless mentioned otherwise, relevant configuration options are:
max_connections=1200
shared_buffers=8GB
max_prepared_transactions=1000
synchronous_commit=local
huge_pages=on
fsync=off # to make it more likely to see scalability bottlenecks


Independent of the effects of this patch (i.e. including master) I had a
fairly hard time getting reproducible number for *low* client cases. I
found the numbers to be more reproducible if I pinned server/pgbench
onto the same core :(.  I chose to do that for the -c1 cases, to
benchmark the optimal behaviour, as that seemed to have the biggest
potential for regressions.

All numbers are best of three. Tests start in freshly created cluster
each.


On 2020-03-30 17:04:00 +0300, Alexander Korotkov wrote:
> Following pgbench scripts comes first to my mind:
> 1) SELECT txid_current(); (artificial but good for checking corner case)

-M prepared -T 180
(did a few longer runs, but doesn't seem to matter much)

clients     tps master      tps pgxact
1           46118           46027
16          377357          440233
40          373304          410142
198         103912          105579

btw, there's some pretty horrible cacheline bouncing in txid_current()
because backends first ReadNextFullTransactionId() (acquires XidGenLock
in shared mode, reads ShmemVariableCache->nextFullXid), then separately
causes GetNewTransactionId() (acquires XidGenLock exclusively, reads &
writes nextFullXid).

With for fsync=off (and also for synchronous_commit=off) the numbers
are, at lower client counts, severly depressed and variable due to
walwriter going completely nuts (using more CPU than the backend doing
the queries). Because WAL writes are so fast on my storage, individual
XLogBackgroundFlush() calls are very quick. This leads to a *lot* of
kill()s from the backend, from within XLogSetAsyncXactLSN().  There got
to be a bug here.  But unrelated.

> 2) Single insert statement (as example of very short transaction)

CREATE TABLE testinsert(c1 int not null, c2 int not null, c3 int not null, c4 int not null);
INSERT INTO testinsert VALUES(1, 2, 3, 4);

-M prepared -T 360

fsync on:
clients     tps master      tps pgxact
1           653             658
16          5687            5668
40          14212           14229
198         60483           62420

fsync off:
clients     tps master      tps pgxact
1           59356           59891
16          290626        299991
40          348210          355669
198         289182          291529

clients     tps master      tps pgxact
1024        47586           52135

-M simple
fsync off:
clients     tps master      tps pgxact
40          289077          326699
198         286011          299928




> 3) Plain pgbench read-write (you already did it for sure)

-s 100 -M prepared -T 700

autovacuum=off, fsync on:
clients     tps master      tps pgxact
1           474             479
16          4356            4476
40          8591            9309
198         20045           20261
1024        17986           18545

autovacuum=off, fsync off:
clients     tps master      tps pgxact
1           7828            7719
16          49069           50482
40          68241           73081
198         73464           77801
1024        25621           28410

I chose autovacuum off because otherwise the results vary much more
widely, and autovacuum isn't really needed for the workload.



> 4) pgbench read-write script with increased amount of SELECTs.  Repeat
> select from pgbench_accounts say 10 times with different aids.

I did intersperse all server-side statements in the script with two
selects of other pgbench_account rows each.

-s 100 -M prepared -T 700
autovacuum=off, fsync on:
clients     tps master      tps pgxact
1           365             367
198         20065           21391

-s 1000 -M prepared -T 700
autovacuum=off, fsync on:
clients     tps master      tps pgxact
16          2757            2880
40          4734            4996
198         16950           19998
1024        22423           24935


> 5) 10% pgbench read-write, 90% of pgbench read-only

-s 100 -M prepared -T 100 -bselect-only@9 -btpcb-like@1

autovacuum=off, fsync on:
clients     tps master      tps pgxact
16          37289           38656
40          81284           81260
198         189002          189357
1024        143986          164762


> > That definitely needs to be measured, due to the locking changes around procarrayaddd/remove.
> >
> > I don't think regressions besides perhaps 2pc are likely - there's nothing really getting more expensive but
procarrayadd/remove.
 
>
> I agree that ProcArrayAdd()/Remove() should be first subject of
> investigation, but other cases should be checked as well IMHO.

I'm not sure I really see the point. If simple prepared tx doesn't show
up as a negative difference, a more complex one won't either, since the
ProcArrayAdd()/Remove() related bottlenecks will play smaller and
smaller role.


> Regarding 2pc I can following scenarios come to my mind:
> 1) pgbench read-write modified so that every transaction is prepared
> first, then commit prepared.

The numbers here are -M simple, because I wanted to use
PREPARE TRANSACTION 'ptx_:client_id';
COMMIT PREPARED 'ptx_:client_id';

-s 100 -M prepared -T 700 -f ~/tmp/pgbench-write-2pc.sql
autovacuum=off, fsync on:
clients     tps master      tps pgxact
1           251             249
16          2134            2174
40          3984            4089
198         6677            7522
1024        3641            3617


> 2) 10% of 2pc pgbench read-write, 90% normal pgbench read-write

-s 100 -M prepared -T 100 -f ~/tmp/pgbench-write-2pc.sql@1 -btpcb-like@9

clients     tps master      tps pgxact
198         18625           18906

> 3) 10% of 2pc pgbench read-write, 90% normal pgbench read-only

-s 100 -M prepared -T 100 -f ~/tmp/pgbench-write-2pc.sql@1 -bselect-only@9

clients     tps master      tps pgxact
198         84817           84350


I also benchmarked connection overhead, by using pgbench with -C
executing SELECT 1.

-T 10
clients     tps master      tps pgxact
1           572             587
16          2109            2140
40          2127            2136
198         2097            2129
1024        2101            2118



These numbers seem pretty decent to me. The regressions seem mostly
within noise. The one possible exception to that is plain pgbench
read/write with fsync=off and only a single session. I'll run more
benchmarks around that tomorrow (but now it's 6am :().

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Julien Rouhaud
Date:
Subject: Re: WAL usage calculation patch
Next
From: Anastasia Lubennikova
Date:
Subject: Re: pg_upgrade fails with non-standard ACL