Thread: [PATCH] fix a performance issue with multiple logical-decoding walsenders

[PATCH] fix a performance issue with multiple logical-decoding walsenders

From

Pierre Ducroquet

Date:

26 December 2019, 16:42:51

Hello

Our current setup uses logical replication to build a BI replication server 
along our primary clusters (running PG 10.10 so far). This implies having one 
logical replication slot per database. After some analysis, we identified two 
hot spots behind this issue. Fixing them gave us a 10 fold performance 
improvement in decoding speed.


We noticed our load had gotten quite bigger on our primary since the 
introduction of this replication, seeing spikes in system time when a lot of 
wal were being written (for instance when creating GIN indexes).

The first hot spot is PostmasterIsAlive. The implementation reads on a pipe to 
know if the postmaster is still alive, but this is very expensive kernel-wise. 
When switching the implementation to a much more primitive (and probably 
wrong):
    bool PostmasterIsAliveInternal() {
        return getppid() == PostmasterPid;
    }
we stopped seeing spikes in system time.
But after doing that, the CPU time used by our walsenders increased. We 
reached a second hot spot, this time in XLogSendLogical, where each walsender 
was using 100% of user CPU for minutes. After checking with perf, it appears 
all the decoders are fighting on GetFlushRecPtr.

On PostgreSQL 12, the call to PostmasterIsAlive is no longer present in 
WalSndLoop (thanks to commit cfdf4dc4f), so only the second hot spot is still 
present, with the same effects.

Attached to this email are two patches. 

The first one, specific to PG 10 for our use-case, simplifies the 
PostmasterIsAlive function, as described above. I don't know if this 
implementation is valid, but it was needed to uncover the performance issue in 
XLogSendLogical. Would it be possible to remove the systematic call to 
PostmasterIsAlive in WalSndLoop? We are not certain of the behaviour.

The second one was tested on PG 10 and PG 12 (with 48 lines offset). It has on 
PG12 the same effect it has on a PG10+isAlive patch. Instead of calling each 
time GetFlushRecPtr, we call it only if we notice we have reached the value of 
the previous call. This way, when the senders are busy decoding, we are no 
longer fighting for a spinlock to read the FlushRecPtr.

Here are some benchmark results.
On PG 10, to decode our replication stream, we went from 3m 43s to over 5 
minutes after removing the first hot spot, and then down to 22 seconds.
On PG 12, we had to change the benchmark (due to GIN indexes creation being 
more optimized) so we can not compare directly with our previous bench. We 
went from 15m 11s down to 59 seconds.
If needed, we can provide scripts to reproduce this situation. It is quite 
simple: add ~20 walsenders doing logical replication in database A, and then 
generate a lot of data in database B. The walsenders will be woken up by the 
activity on database B, but not sending it thus keeping hitting the same 
locks.

Regards

Attachment

Re: [PATCH] fix a performance issue with multiple logical-decoding walsenders

From

Julien Rouhaud

Date:

26 December 2019, 19:18:46

Hello Pierre,

On Thu, Dec 26, 2019 at 5:43 PM Pierre Ducroquet <p.psql@pinaraf.info> wrote:
>
> The second one was tested on PG 10 and PG 12 (with 48 lines offset). It has on
> PG12 the same effect it has on a PG10+isAlive patch. Instead of calling each
> time GetFlushRecPtr, we call it only if we notice we have reached the value of
> the previous call. This way, when the senders are busy decoding, we are no
> longer fighting for a spinlock to read the FlushRecPtr.

The patch is quite straightforward and looks good to me.

-    XLogRecPtr    flushPtr;
+    static XLogRecPtr flushPtr = 0;

You should use InvalidXLogRecPtr instead though, and maybe adding some
comments to explain why the static variable is a life changer here.

> Here are some benchmark results.
> On PG 10, to decode our replication stream, we went from 3m 43s to over 5
> minutes after removing the first hot spot, and then down to 22 seconds.
> On PG 12, we had to change the benchmark (due to GIN indexes creation being
> more optimized) so we can not compare directly with our previous bench. We
> went from 15m 11s down to 59 seconds.
> If needed, we can provide scripts to reproduce this situation. It is quite
> simple: add ~20 walsenders doing logical replication in database A, and then
> generate a lot of data in database B. The walsenders will be woken up by the
> activity on database B, but not sending it thus keeping hitting the same
> locks.

Quite impressive speedup!

Re: [PATCH] fix a performance issue with multiple logical-decoding walsenders

From

Pierre Ducroquet

Date:

28 December 2019, 12:55:59

On Thursday, December 26, 2019 8:18:46 PM CET Julien Rouhaud wrote:
> Hello Pierre,
> 
> On Thu, Dec 26, 2019 at 5:43 PM Pierre Ducroquet <p.psql@pinaraf.info> 
wrote:
> > The second one was tested on PG 10 and PG 12 (with 48 lines offset). It
> > has on PG12 the same effect it has on a PG10+isAlive patch. Instead of
> > calling each time GetFlushRecPtr, we call it only if we notice we have
> > reached the value of the previous call. This way, when the senders are
> > busy decoding, we are no longer fighting for a spinlock to read the
> > FlushRecPtr.
> 
> The patch is quite straightforward and looks good to me.
> 
> -    XLogRecPtr    flushPtr;
> +    static XLogRecPtr flushPtr = 0;
> 
> You should use InvalidXLogRecPtr instead though, and maybe adding some
> comments to explain why the static variable is a life changer here.
> 
> > Here are some benchmark results.
> > On PG 10, to decode our replication stream, we went from 3m 43s to over 5
> > minutes after removing the first hot spot, and then down to 22 seconds.
> > On PG 12, we had to change the benchmark (due to GIN indexes creation
> > being
> > more optimized) so we can not compare directly with our previous bench. We
> > went from 15m 11s down to 59 seconds.
> > If needed, we can provide scripts to reproduce this situation. It is quite
> > simple: add ~20 walsenders doing logical replication in database A, and
> > then generate a lot of data in database B. The walsenders will be woken
> > up by the activity on database B, but not sending it thus keeping hitting
> > the same locks.
> 
> Quite impressive speedup!



Hi

Thank you for your comments.
Attached to this email is a patch with better comments regarding the 
XLogSendLogical change.
We've spent quite some time yesterday benching it again, this time with 
changes that must be fully processed by the decoder. The speed-up is obviously 
much smaller, we are only ~5% faster than without the patch.

Regards

Attachment

XLogSendLogical-speedup.patch

Re: [PATCH] fix a performance issue with multiple logical-decoding walsenders

From

Julien Rouhaud

Date:

29 December 2019, 12:32:31

On Sat, Dec 28, 2019 at 1:56 PM Pierre Ducroquet <p.psql@pinaraf.info> wrote:
>
> Thank you for your comments.
> Attached to this email is a patch with better comments regarding the
> XLogSendLogical change.

Arguably the first test to compare to InvalidXLogRecPtr is unneeded,
as any value of EndRecPtr is greater or equal than that value.  It
will only save at best 1 GetFlushRecPtr() per walsender process
lifetime, so I'm not sure it's worth arguing about it.  Other than
that I still think that it's a straightforward optimization that
brings nice speedup, and I don't see any problem with this patch.  I
think that given the time of the year you should create a commitfest
entry for this patch to make sure it won't be forgotten (and obviously
I'll mark it as RFC, unless someone objects by then).

> We've spent quite some time yesterday benching it again, this time with
> changes that must be fully processed by the decoder. The speed-up is obviously
> much smaller, we are only ~5% faster than without the patch.

I'm assuming that it's benchmarking done with multiple logical slots?
Anyway, a 5% speedup in the case that this patch is not aimed to
optimize is quite nice!

Re: [PATCH] fix a performance issue with multiple logical-decoding walsenders

From

Pierre Ducroquet

Date:

30 December 2019, 11:13:58

On Sunday, December 29, 2019 1:32:31 PM CET Julien Rouhaud wrote:
> On Sat, Dec 28, 2019 at 1:56 PM Pierre Ducroquet <p.psql@pinaraf.info> 
wrote:
> > Thank you for your comments.
> > Attached to this email is a patch with better comments regarding the
> > XLogSendLogical change.
> 
> Arguably the first test to compare to InvalidXLogRecPtr is unneeded,
> as any value of EndRecPtr is greater or equal than that value.  It
> will only save at best 1 GetFlushRecPtr() per walsender process
> lifetime, so I'm not sure it's worth arguing about it.  Other than
> that I still think that it's a straightforward optimization that
> brings nice speedup, and I don't see any problem with this patch.  I
> think that given the time of the year you should create a commitfest
> entry for this patch to make sure it won't be forgotten (and obviously
> I'll mark it as RFC, unless someone objects by then).
> 
> > We've spent quite some time yesterday benching it again, this time with
> > changes that must be fully processed by the decoder. The speed-up is
> > obviously much smaller, we are only ~5% faster than without the patch.
> 
> I'm assuming that it's benchmarking done with multiple logical slots?
> Anyway, a 5% speedup in the case that this patch is not aimed to
> optimize is quite nice!

I've created a commitfest entry for this patch.
https://commitfest.postgresql.org/26/2403/
I would like to know if it would be acceptable to backport this to PostgreSQL 
12. I have to write a clean benchmark for that (our previous benchs are either 
PG10 or PG12 specific), but the change from Thomas Munro that removed the 
calls to PostmasterIsAlive is very likely to have the same side-effect we 
observed in PG10 when patching IsAlive, aka. moving the pressure from the pipe 
reads to the PostgreSQL locks between processes, and this made the whole 
process slower: the benchmark showed a serious regression, going from 3m45s to 
5m15s to decode the test transactions.

Regards

 Pierre

Re: [PATCH] fix a performance issue with multiple logical-decoding walsenders

From

Tom Lane

Date:

06 January 2020, 17:57:33

Pierre Ducroquet <p.psql@pinaraf.info> writes:
> Attached to this email is a patch with better comments regarding the 
> XLogSendLogical change.

Hi,
  This patch entirely fails to apply for me (and for the cfbot too).
It looks like (a) it's missing a final newline and (b) all the tabs
have been mangled into spaces, and not correctly mangled either.
I could probably reconstruct a workable patch if I had to, but
it seems likely that it'd be easier for you to resend it with a
little more care about attaching an unmodified attachment.

As for the question of back-patching, it seems to me that it'd
likely be reasonable to put this into v12, but probably not
further back.  There will be no interest in back-patching
commit cfdf4dc4f, and it seems like the argument for this
patch is relatively weak without that.

            regards, tom lane

Re: [PATCH] fix a performance issue with multiple logical-decoding walsenders

From

Pierre Ducroquet

Date:

06 January 2020, 18:57:40

On Monday, January 6, 2020 6:57:33 PM CET Tom Lane wrote:
> Pierre Ducroquet <p.psql@pinaraf.info> writes:
> > Attached to this email is a patch with better comments regarding the
> > XLogSendLogical change.
> 
> Hi,
>   This patch entirely fails to apply for me (and for the cfbot too).
> It looks like (a) it's missing a final newline and (b) all the tabs
> have been mangled into spaces, and not correctly mangled either.
> I could probably reconstruct a workable patch if I had to, but
> it seems likely that it'd be easier for you to resend it with a
> little more care about attaching an unmodified attachment.
> 
> As for the question of back-patching, it seems to me that it'd
> likely be reasonable to put this into v12, but probably not
> further back.  There will be no interest in back-patching
> commit cfdf4dc4f, and it seems like the argument for this
> patch is relatively weak without that.
> 
>             regards, tom lane

Hi

My deepest apologies for the patch being broken, I messed up when transferring 
it between my computers after altering the comments. The verbatim one attached 
to this email applies with no issue on current HEAD.
The patch regarding PostmasterIsAlive is completely pointless since v12 where 
the function was rewritten, and was included only to help reproduce the issue 
on older versions. Back-patching the walsender patch further than v12 would 
imply back-patching all the machinery introduced for PostmasterIsAlive 
(9f09529952) or another intrusive change there, a too big risk indeed.

Regards 

 Pierre

Attachment

XLogSendLogical-speedup.patch

Re: [PATCH] fix a performance issue with multiple logical-decoding walsenders

From

Julien Rouhaud

Date:

06 January 2020, 19:16:08

On Mon, Jan 6, 2020 at 7:57 PM Pierre Ducroquet <p.psql@pinaraf.info> wrote:
>
> On Monday, January 6, 2020 6:57:33 PM CET Tom Lane wrote:
> > Pierre Ducroquet <p.psql@pinaraf.info> writes:
> > > Attached to this email is a patch with better comments regarding the
> > > XLogSendLogical change.
> >
> > Hi,
> >   This patch entirely fails to apply for me (and for the cfbot too).
> > It looks like (a) it's missing a final newline and (b) all the tabs
> > have been mangled into spaces, and not correctly mangled either.
> > I could probably reconstruct a workable patch if I had to, but
> > it seems likely that it'd be easier for you to resend it with a
> > little more care about attaching an unmodified attachment.
> >
> > As for the question of back-patching, it seems to me that it'd
> > likely be reasonable to put this into v12, but probably not
> > further back.  There will be no interest in back-patching
> > commit cfdf4dc4f, and it seems like the argument for this
> > patch is relatively weak without that.
> >
> >                       regards, tom lane
>
> Hi
>
> My deepest apologies for the patch being broken, I messed up when transferring
> it between my computers after altering the comments. The verbatim one attached
> to this email applies with no issue on current HEAD.
> The patch regarding PostmasterIsAlive is completely pointless since v12 where
> the function was rewritten, and was included only to help reproduce the issue
> on older versions. Back-patching the walsender patch further than v12 would
> imply back-patching all the machinery introduced for PostmasterIsAlive
> (9f09529952) or another intrusive change there, a too big risk indeed.

+1, backpatch to v12 looks sensible.

Re: [PATCH] fix a performance issue with multiple logical-decodingwalsenders

From

Marc Cousin

Date:

06 January 2020, 19:21:01

Hi,

I spent a little bit of time trying to explain the problem we are facing clearly, and provide a reproducible benchmark.

So here it is.

What triggered our investigation is that we have a PostgreSQL cluster containing about 15 databases, most of them being used as sources for logical replication. This means we have about as many WAL senders active on the cluster at the same time.

What we saw was that we had very high spikes of CPU activity, with very high level of SYS (up to 50% of the whole system was SYS, with 100% load on all CPUs) on the server, which is a dual Xeon Silver 4110, so 16 cores, 32 threads. That seemed insane as our usual load isn't that high on average (like 20% total cpu use, with ~ 1% of SYS), and mostly on 2 of those 15 databases. This mostly occured when creating an index, or doing batch updates, copys, etc. And all WAL senders consumed about the same amount, even those connected to databases where nothing happened. And while testing, we noticed things were worse with PG 12 than with PG 10 (that was the cause for the first patch Pierre posted, which was more of a way to get PostmasterIsAlive out of the way for PG 10 to get the same behaviour on both databases and confirm what was different between the two versions).

So that was what the first benchmark we did, what Pierre posted a few days ago. With the second patch (reducing calls to GetFlushRecPtr), on PostgreSQL 12, with statements affecting lots of records at once, we managed to reduce the WAL senders' consumption by a factor of 15 (if the patch is correct of course). SYS was down to more sensible (near 0) values. WAL senders for databases which had no decoding to do didn't consume that much anymore, only the one connected to the database doing the work used a lot of CPU, but that's expected. This solves the problem we are facing. Without this, we won't be able to upgrade to PG 12, as the impact of GetFlushRecPtr is even worse than with PG 10.


I've now tried to measure the impact of the patch on a more evenly balanced activity on several databases, where the contention on GetFlushRecPtr is less severe, to see if there are wins in all cases. Simple scripts to test this are provided as an attachment.

Just set the "max" environment variable to the amount of databases/WAL senders you want, and then run create_logical.sh (creates the databases and the slots), then connect_logical.sh (connects pg_recvlogical processes to these slots), and run_stress.sh (runs a pgbench on each database, each doing 100000 transactions, all in parallel). drop_logical.sh does the cleanup. Sorry, they are very basic...

Here are the results on patched/unpatched PG 12. You have the messages from all pgbenchs, then the consumption of all WAL senders at the end of a run.

As you can see, pgbench runs are ~ 20% faster. That's because with unpatched, we are CPU starved (the server is 100% CPU), which we aren't with patched. WAL senders needed about half the CPU in the patched case as shown by pidstat.

UNPATCHED:

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.943 ms
tps = 514.796454 (including connections establishing)
tps = 514.805610 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.949 ms
tps = 513.130790 (including connections establishing)
tps = 513.168135 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.950 ms
tps = 512.946425 (including connections establishing)
tps = 512.961746 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.951 ms
tps = 512.643065 (including connections establishing)
tps = 512.678765 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.953 ms
tps = 512.159794 (including connections establishing)
tps = 512.178075 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.953 ms
tps = 512.024962 (including connections establishing)
tps = 512.034590 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.953 ms
tps = 512.016413 (including connections establishing)
tps = 512.034438 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.954 ms
tps = 511.728080 (including connections establishing)
tps = 511.760138 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.954 ms
tps = 511.655046 (including connections establishing)
tps = 511.678533 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.955 ms
tps = 511.604767 (including connections establishing)
tps = 511.617562 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.955 ms
tps = 511.525593 (including connections establishing)
tps = 511.558150 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.955 ms
tps = 511.496498 (including connections establishing)
tps = 511.505871 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.956 ms
tps = 511.334434 (including connections establishing)
tps = 511.363141 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.956 ms
tps = 511.256908 (including connections establishing)
tps = 511.284577 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.957 ms
tps = 511.021219 (including connections establishing)
tps = 511.041905 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.957 ms
tps = 510.977176 (including connections establishing)
tps = 511.004730 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.958 ms
tps = 510.818341 (including connections establishing)
tps = 510.870735 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.958 ms
tps = 510.719611 (including connections establishing)
tps = 510.741600 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.959 ms
tps = 510.460165 (including connections establishing)
tps = 510.504649 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.961 ms
tps = 509.962481 (including connections establishing)
tps = 509.978043 (excluding connections establishing)

~$ pidstat -C postgres  -l -u -T CHILD
Linux 4.9.0-11-amd64 (hardy)    01/06/2020      _x86_64_        (32 CPU)

02:39:30 PM   UID       PID    usr-ms system-ms  guest-ms  Command
02:39:30 PM   111     78644   2232190    284870         0  /tmp/pg-12-unpatched/bin/postgres
02:39:30 PM   111     78646       400       340         0  postgres: checkpointer
02:39:30 PM   111     78647       530      1330         0  postgres: background writer
02:39:30 PM   111     78648       300       150         0  postgres: walwriter
02:39:30 PM   111     78649        30        40         0  postgres: autovacuum launcher
02:39:30 PM   111     78650       130       440         0  postgres: stats collector
02:39:30 PM   111     78790     86340     20560         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78792     89170     23270         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78794     86990     20740         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78795     91900     22540         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78797     92000     23190         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78799     94060     22520         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78801     95300     21500         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78803     93120     21360         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78805     95420     21920         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78807     94400     21350         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78809     88850     20390         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78811     90030     20690         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78812     94310     22660         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78813     94080     22470         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78814     95370     21520         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78815     94780     21470         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78816     92440     21080         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78817     94360     20700         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78818     92230     20760         0  postgres: walsender postgres [local] idle
02:39:30 PM   111     78819     90280     20780         0  postgres: walsender postgres [local] idle

PATCHED:

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.680 ms
tps = 595.090858 (including connections establishing)
tps = 595.131449 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.683 ms
tps = 594.156492 (including connections establishing)
tps = 594.198624 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.684 ms
tps = 593.927387 (including connections establishing)
tps = 593.946829 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.686 ms
tps = 593.209506 (including connections establishing)
tps = 593.257556 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.686 ms
tps = 593.144977 (including connections establishing)
tps = 593.162018 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.686 ms
tps = 593.084403 (including connections establishing)
tps = 593.122178 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.686 ms
tps = 593.134657 (including connections establishing)
tps = 593.199432 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.687 ms
tps = 592.908760 (including connections establishing)
tps = 592.923386 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.687 ms
tps = 592.802027 (including connections establishing)
tps = 592.814300 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.687 ms
tps = 592.678874 (including connections establishing)
tps = 592.769759 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.687 ms
tps = 592.642501 (including connections establishing)
tps = 592.723261 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.688 ms
tps = 592.249597 (including connections establishing)
tps = 592.262962 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.690 ms
tps = 591.867795 (including connections establishing)
tps = 591.958672 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.690 ms
tps = 591.835950 (including connections establishing)
tps = 591.908940 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.690 ms
tps = 591.799816 (including connections establishing)
tps = 591.824497 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.690 ms
tps = 591.738978 (including connections establishing)
tps = 591.780258 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.691 ms
tps = 591.530490 (including connections establishing)
tps = 591.570876 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.691 ms
tps = 591.452142 (including connections establishing)
tps = 591.498424 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.693 ms
tps = 590.674305 (including connections establishing)
tps = 590.708951 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.693 ms
tps = 590.517240 (including connections establishing)
tps = 590.563531 (excluding connections establishing)


~$ pidstat -C postgres  -l -u -T CHILD
Linux 4.9.0-11-amd64 (hardy)    01/06/2020      _x86_64_        (32 CPU)

02:29:02 PM   UID       PID    usr-ms system-ms  guest-ms  Command
02:29:02 PM   111     75810   2185430    294190         0  /tmp/pg-12-patched/bin/postgres
02:29:02 PM   111     75812       410       320         0  postgres: checkpointer
02:29:02 PM   111     75813       410      1230         0  postgres: background writer
02:29:02 PM   111     75814       270        80         0  postgres: walwriter
02:29:02 PM   111     75815        30        20         0  postgres: autovacuum launcher
02:29:02 PM   111     75816       130       360         0  postgres: stats collector
02:29:02 PM   111     75961     35890     27240         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75963     37390     27950         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75965     38360     28110         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75966     38350     28160         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75968     38370     28160         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75970     37820     28110         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75972     38250     27330         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75974     36870     27640         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75976     36890     26850         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75979     36920     26330         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75980     37090     27240         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75981     38040     28210         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75982     36530     27460         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75983     37560     27330         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75984     36660     27170         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75985     36370     27020         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75986     36960     27000         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75987     36460     26820         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75988     36290     27140         0  postgres: walsender postgres [local] idle
02:29:02 PM   111     75989     36320     26750         0  postgres: walsender postgres [local] idle


Regard,

Marc

Attachment

perf_test_logical.tar.gz

Re: [PATCH] fix a performance issue with multiple logical-decoding walsenders

From

Tom Lane

Date:

06 January 2020, 21:43:36

Julien Rouhaud <rjuju123@gmail.com> writes:
> On Mon, Jan 6, 2020 at 7:57 PM Pierre Ducroquet <p.psql@pinaraf.info> wrote:
>> My deepest apologies for the patch being broken, I messed up when transferring
>> it between my computers after altering the comments. The verbatim one attached
>> to this email applies with no issue on current HEAD.
>> The patch regarding PostmasterIsAlive is completely pointless since v12 where
>> the function was rewritten, and was included only to help reproduce the issue
>> on older versions. Back-patching the walsender patch further than v12 would
>> imply back-patching all the machinery introduced for PostmasterIsAlive
>> (9f09529952) or another intrusive change there, a too big risk indeed.

> +1, backpatch to v12 looks sensible.

Pushed with some minor cosmetic fiddling.

            regards, tom lane