Re: Logical replication keepalive flood - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: Logical replication keepalive flood
Date
Msg-id 20210611.103747.1303885450661824696.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: Logical replication keepalive flood  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Logical replication keepalive flood
List pgsql-hackers
At Thu, 10 Jun 2021 12:18:00 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
> Good analysis. I think this analysis has shown that walsender is
> sending messages at top speed as soon as they are generated. So, I am
> wondering why there is any need to wait/sleep in such a workload. One
> possibility that occurred to me RecentFlushPtr is not updated and or
> we are not checking it aggressively. To investigate on that lines, can
> you check the behavior with the attached patch? This is just a quick
> hack patch to test whether we need to really wait for WAL a bit
> aggressively.

Yeah, anyway the comment for the caller site of WalSndKeepalive tells
that exiting out of the function *after* there is somewhat wrong.

> * possibly are waiting for a later location. So, before sleeping, we
> * send a ping containing the flush location. If the receiver is

But I nothing changed by moving the keepalive check to after the exit
check. (loc <= RecentFlushPtr).

And the patch also doesn't change the situation so much. The average
number of loops is reduced from 3 to 2 per call but the ratio between
total records and keepalives doesn't change.

previsous: A=#total-rec = 19476, B=#keepalive=3006, B/A = 0.154
this time: A=#total-rec = 13208, B=#keepalive=1988, B/A = 0.151

Total records: 13208
reqsz: #sent/ #!sent/ #call: wr lag  / fl lag
   8:     4 /     4 /     4:    6448 /  268148
  16:     1 /     1 /     1:    8688 /  387320
  24:  1988 /  1987 /  1999:    6357 /  226163
 195:     1 /     0 /    20:     408 /    1647
7477:     2 /     0 /   244:      68 /     847
8225:     1 /     1 /     1:    7208 /    7208

So I checked how many bytes RecentFlushPtr is behind requested loc if
it is not advanced enough.

Total records: 15128
reqsz:  #sent/ #!sent/ #call: wr lag  / fl lag  / RecentFlushPtr lag
    8:     2 /     2 /     2:     520 /   60640 /       8
   16:     1 /     1 /     1:    8664 /   89336 /      16
   24:  2290 /  2274 /  2302:    5677 /  230583 /      23
  187:     1 /     0 /    40:       1 /    6118 /       1
 7577:     1 /     0 /    69:     120 /    3733 /      65
 8177:     1 /     1 /     1:    8288 /    8288 /    2673

So it's not a matter of RecentFlushPtr check. (Almost) Always when
WalSndWakeupRequest feels to need to send a keepalive, the function is
called before the record begins to be written.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: "osumi.takamichi@fujitsu.com"
Date:
Subject: RE: locking [user] catalog tables vs 2pc vs logical rep
Next
From: Noah Misch
Date:
Subject: Re: BF assertion failure on mandrill in walsender, v13