Re: [PATCH] fix a performance issue with multiple logical-decodingwalsenders - Mailing list pgsql-hackers

From Marc Cousin
Subject Re: [PATCH] fix a performance issue with multiple logical-decodingwalsenders
Date
Msg-id ff8d58cc-8f5a-b0f2-9705-98d7ff923fad@gmail.com
Whole thread Raw
In response to Re: [PATCH] fix a performance issue with multiple logical-decoding walsenders  (Pierre Ducroquet <p.psql@pinaraf.info>)
List pgsql-hackers
Hi,
I spent a little bit of time trying to explain the problem we are facing clearly, and provide a reproducible benchmark.

So here it is.

What triggered our investigation is that we have a PostgreSQL cluster containing about 15 databases, most of them being used as sources for logical replication. This means we have about as many WAL senders active on the cluster at the same time.

What we saw was that we had very high spikes of CPU activity, with very high level of SYS (up to 50% of the whole system was SYS, with 100% load on all CPUs) on the server, which is a dual Xeon Silver 4110, so 16 cores, 32 threads. That seemed insane as our usual load isn't that high on average (like 20% total cpu use, with ~ 1% of SYS), and mostly on 2 of those 15 databases. This mostly occured when creating an index, or doing batch updates, copys, etc. And all WAL senders consumed about the same amount, even those connected to databases where nothing happened. And while testing, we noticed things were worse with PG 12 than with PG 10 (that was the cause for the first patch Pierre posted, which was more of a way to get PostmasterIsAlive out of the way for PG 10 to get the same behaviour on both databases and confirm what was different between the two versions).

So that was what the first benchmark we did, what Pierre posted a few days ago. With the second patch (reducing calls to GetFlushRecPtr), on PostgreSQL 12, with statements affecting lots of records at once, we managed to reduce the WAL senders' consumption by a factor of 15 (if the patch is correct of course). SYS was down to more sensible (near 0) values. WAL senders for databases which had no decoding to do didn't consume that much anymore, only the one connected to the database doing the work used a lot of CPU, but that's expected. This solves the problem we are facing. Without this, we won't be able to upgrade to PG 12, as the impact of GetFlushRecPtr is even worse than with PG 10.


I've now tried to measure the impact of the patch on a more evenly balanced activity on several databases, where the contention on GetFlushRecPtr is less severe, to see if there are wins in all cases. Simple scripts to test this are provided as an attachment.

Just set the "max" environment variable to the amount of databases/WAL senders you want, and then run create_logical.sh (creates the databases and the slots), then connect_logical.sh (connects pg_recvlogical processes to these slots), and run_stress.sh (runs a pgbench on each database, each doing 100000 transactions, all in parallel). drop_logical.sh does the cleanup. Sorry, they are very basic...

Here are the results on patched/unpatched PG 12. You have the messages from all pgbenchs, then the consumption of all WAL senders at the end of a run.

As you can see, pgbench runs are ~ 20% faster. That's because with unpatched, we are CPU starved (the server is 100% CPU), which we aren't with patched. WAL senders needed about half the CPU in the patched case as shown by pidstat.

UNPATCHED:

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.943 ms
tps = 514.796454 (including connections establishing)
tps = 514.805610 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.949 ms
tps = 513.130790 (including connections establishing)
tps = 513.168135 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.950 ms
tps = 512.946425 (including connections establishing)
tps = 512.961746 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.951 ms
tps = 512.643065 (including connections establishing)
tps = 512.678765 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.953 ms
tps = 512.159794 (including connections establishing)
tps = 512.178075 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.953 ms
tps = 512.024962 (including connections establishing)
tps = 512.034590 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.953 ms
tps = 512.016413 (including connections establishing)
tps = 512.034438 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.954 ms
tps = 511.728080 (including connections establishing)
tps = 511.760138 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.954 ms
tps = 511.655046 (including connections establishing)
tps = 511.678533 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.955 ms
tps = 511.604767 (including connections establishing)
tps = 511.617562 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.955 ms
tps = 511.525593 (including connections establishing)
tps = 511.558150 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.955 ms
tps = 511.496498 (including connections establishing)
tps = 511.505871 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.956 ms
tps = 511.334434 (including connections establishing)
tps = 511.363141 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.956 ms
tps = 511.256908 (including connections establishing)
tps = 511.284577 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.957 ms
tps = 511.021219 (including connections establishing)
tps = 511.041905 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.957 ms
tps = 510.977176 (including connections establishing)
tps = 511.004730 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.958 ms
tps = 510.818341 (including connections establishing)
tps = 510.870735 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.958 ms
tps = 510.719611 (including connections establishing)
tps = 510.741600 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.959 ms
tps = 510.460165 (including connections establishing)
tps = 510.504649 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.961 ms
tps = 509.962481 (including connections establishing)
tps = 509.978043 (excluding connections establishing)

~$ pidstat -C postgres  -l -u -T CHILD
Linux 4.9.0-11-amd64 (hardy)    01/06/2020      _x86_64_        (32 CPU)

02:39:30 PM   UID       PID    usr-ms system-ms  guest-ms  Command
02:39:30 PM   111     78644   2232190    284870         0  /tmp/pg-12-unpatched/bin/postgres
02:39:30 PM   111     78646       400       340         0  postgres: checkpointer
02:39:30 PM   111     78647       530      1330         0  postgres: background writer
02:39:30 PM   111     78648       300       150         0  postgres: walwriter
02:39:30 PM   111     78649        30        40         0  postgres: autovacuum launcher
02:39:30 PM   111     78650       130       440         0  postgres: stats collector
02:39:30 PM   111     78790     86340     20560         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78792     89170     23270         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78794     86990     20740         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78795     91900     22540         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78797     92000     23190         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78799     94060     22520         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78801     95300     21500         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78803     93120     21360         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78805     95420     21920         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78807     94400     21350         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78809     88850     20390         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78811     90030     20690         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78812     94310     22660         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78813     94080     22470         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78814     95370     21520         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78815     94780     21470         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78816     92440     21080         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78817     94360     20700         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78818     92230     20760         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78819     90280     20780         0  postgres: walsender postgres [local] idle

PATCHED:

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.680 ms
tps = 595.090858 (including connections establishing)
tps = 595.131449 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.683 ms
tps = 594.156492 (including connections establishing)
tps = 594.198624 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.684 ms
tps = 593.927387 (including connections establishing)
tps = 593.946829 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.686 ms
tps = 593.209506 (including connections establishing)
tps = 593.257556 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.686 ms
tps = 593.144977 (including connections establishing)
tps = 593.162018 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.686 ms
tps = 593.084403 (including connections establishing)
tps = 593.122178 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.686 ms
tps = 593.134657 (including connections establishing)
tps = 593.199432 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.687 ms
tps = 592.908760 (including connections establishing)
tps = 592.923386 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.687 ms
tps = 592.802027 (including connections establishing)
tps = 592.814300 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.687 ms
tps = 592.678874 (including connections establishing)
tps = 592.769759 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.687 ms
tps = 592.642501 (including connections establishing)
tps = 592.723261 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.688 ms
tps = 592.249597 (including connections establishing)
tps = 592.262962 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.690 ms
tps = 591.867795 (including connections establishing)
tps = 591.958672 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.690 ms
tps = 591.835950 (including connections establishing)
tps = 591.908940 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.690 ms
tps = 591.799816 (including connections establishing)
tps = 591.824497 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.690 ms
tps = 591.738978 (including connections establishing)
tps = 591.780258 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.691 ms
tps = 591.530490 (including connections establishing)
tps = 591.570876 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.691 ms
tps = 591.452142 (including connections establishing)
tps = 591.498424 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.693 ms
tps = 590.674305 (including connections establishing)
tps = 590.708951 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.693 ms
tps = 590.517240 (including connections establishing)
tps = 590.563531 (excluding connections establishing)


~$ pidstat -C postgres  -l -u -T CHILD
Linux 4.9.0-11-amd64 (hardy)    01/06/2020      _x86_64_        (32 CPU)

02:29:02 PM   UID       PID    usr-ms system-ms  guest-ms  Command
02:29:02 PM   111     75810   2185430    294190         0  /tmp/pg-12-patched/bin/postgres
02:29:02 PM   111     75812       410       320         0  postgres: checkpointer
02:29:02 PM   111     75813       410      1230         0  postgres: background writer
02:29:02 PM   111     75814       270        80         0  postgres: walwriter
02:29:02 PM   111     75815        30        20         0  postgres: autovacuum launcher
02:29:02 PM   111     75816       130       360         0  postgres: stats collector
02:29:02 PM   111     75961     35890     27240         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75963     37390     27950         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75965     38360     28110         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75966     38350     28160         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75968     38370     28160         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75970     37820     28110         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75972     38250     27330         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75974     36870     27640         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75976     36890     26850         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75979     36920     26330         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75980     37090     27240         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75981     38040     28210         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75982     36530     27460         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75983     37560     27330         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75984     36660     27170         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75985     36370     27020         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75986     36960     27000         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75987     36460     26820         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75988     36290     27140         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75989     36320     26750         0  postgres: walsender postgres [local] idle


Regard,

Marc 
Attachment

pgsql-hackers by date:

Previous
From: Julien Rouhaud
Date:
Subject: Re: [PATCH] fix a performance issue with multiple logical-decoding walsenders
Next
From: "Li, Zheng"
Date:
Subject: Re: NOT IN subquery optimization