Re: Logical replication keepalive flood - Mailing list pgsql-hackers

From Peter Smith
Subject Re: Logical replication keepalive flood
Date
Msg-id CAHut+PtyMBzweYUpb_QazVL6Uze2Yc5M5Ti2Xwee_eWM3Jrbog@mail.gmail.com
Whole thread Raw
In response to Re: Logical replication keepalive flood  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Logical replication keepalive flood
Re: Logical replication keepalive flood
Re: Logical replication keepalive flood
List pgsql-hackers
Hi.

By using Kyotaro's "counting" patch I was able to reproduce very
similar results to what he had earlier posted [1].

AFAIK I have the same test scenario that he was using.

Test setup:
- using async pub/sub
- subscription is for the pgbench_history table
- pgbench is run for 10 seconds
- config for all the wal_sender/receiver timeout GUCs are just default values


Results (HEAD + Kyotaro counting patch)
=======================================

[postgres@CentOS7-x64 ~]$ 2021-08-10 16:36:23.733 AEST [32436] LOG:
Total records: 18183
2021-08-10 16:36:23.733 AEST [32436] LOG:      8:     2 /     0 /
1:  440616 /  580320
2021-08-10 16:36:23.733 AEST [32436] LOG:     16:     4 /     8 /
4:    4524 /  288688
2021-08-10 16:36:23.733 AEST [32436] LOG:     24:  2916 /  5151 /
2756:   31227 /  323190
2021-08-10 16:36:23.733 AEST [32436] LOG:    187:     2 /     0 /
51:     157 /   10629
2021-08-10 16:36:23.733 AEST [32436] LOG:   2960:     1 /     0 /
1: 49656944 / 49656944
2021-08-10 16:36:23.733 AEST [32436] LOG:   7537:     2 /     0 /
36:     231 /    7028
2021-08-10 16:36:23.733 AEST [32436] LOG:   7577:     1 /     2 /
78:     106 /     106
2021-08-10 16:36:23.733 AEST [32436] LOG:   8280:     1 /     2 /
3:      88 /      88

//////

That base data is showing there are similar numbers of keepalives sent
as there are calls made to WalSndWaitForWal. IIUC it means that mostly
the loop is sending the special keepalives on the *first* iteration,
but by the time of the *second* iteration the ProcessRepliesIfAny()
will have some status already received, and so mostly sending another
keepalive will be deemed unnecessary.

Based on this, our idea was to simply skip sending the
WalSndKeepalive(false) for the FIRST iteration of the loop only! PSA
the patch 0002 which does this skip.

With this skip patch (v1-0002) applied the same pgbench tests were run
again. The results look like below.

Results (HEAD + Kyotaro patch + Skip-first keepalive patch)
===========================================================

RUN #1
------
[postgres@CentOS7-x64 ~]$ 2021-08-11 16:32:59.827 AEST [20339] LOG:
Total records: 19367
2021-08-11 16:32:59.827 AEST [20339] LOG:     24:    10 /  9232 /
3098:      19 /     440
2021-08-11 16:32:59.827 AEST [20339] LOG:    102:     1 /     1 /
32:     257 /   16828
2021-08-11 16:32:59.827 AEST [20339] LOG:    187:     1 /     1 /
52:     155 /    9541

RUN #2
------
[postgres@CentOS7-x64 ~]$ 2021-08-11 16:36:03.983 AEST [25513] LOG:
Total records: 17815
2021-08-11 16:36:03.983 AEST [25513] LOG:     24:    73 /  8683 /
2958:    1647 /    3290
2021-08-11 16:36:03.983 AEST [25513] LOG:   8280:     1 /     1 /
3:      88 /      88

RUN #3
------
[postgres@CentOS7-x64 ~]$ 2021-08-11 16:39:27.655 AEST [31061] LOG:
Total records: 19906
2021-08-11 16:39:27.655 AEST [31061] LOG:     24:    18 /  8546 /
2890:      61 /    1530
2021-08-11 16:39:27.655 AEST [31061] LOG:     83:     1 /     3 /
1:    8664 /    8664

~~

This data shows the special keepalives are now greatly reduced from
1000s to just 10s.

Thoughts?

------
[1] https://www.postgresql.org/message-id/20210610.150016.1709823354377067679.horikyota.ntt%40gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: call popcount32/64 directly on non-x86 platforms
Next
From: Masahiko Sawada
Date:
Subject: Re: Small documentation improvement for ALTER SUBSCRIPTION