Thread: Logical replication keepalive flood

Logical replication keepalive flood

From
Abbas Butt
Date:
Hi,
I have observed the following behavior with PostgreSQL 13.3.

The WAL sender process sends approximately 500 keepalive messages per second to pg_recvlogical.
These keepalive messages are totally un-necessary.
Keepalives should be sent only if there is no network traffic and a certain time (half of wal_sender_timeout) passes.
These keepalive messages not only choke the network but also impact the performance of the receiver,
because the receiver has to process the received message and then decide whether to reply to it or not.
The receiver remains busy doing this activity 500 times a second.

On investigation it is revealed that the following code fragment in function WalSndWaitForWal in file walsender.c is responsible for sending these frequent keepalives:

if (MyWalSnd->flush < sentPtr &&
    MyWalSnd->write < sentPtr &&
    !waiting_for_ping_response)
        WalSndKeepalive(false);

waiting_for_ping_response is normally false, and flush and write will always be less than sentPtr (Receiver's LSNs cannot advance server's LSNs)

Here are the steps to reproduce:
1. Start the database server.
2. Setup pgbench tables.
  ./pgbench -i -s 50 -h 192.168.5.140 -p 7654 -U abbas postgres
3. Create a logical replication slot.
   SELECT * FROM pg_create_logical_replication_slot('my_slot', 'test_decoding');
4. Start pg_recvlogical.
  ./pg_recvlogical --slot=my_slot --verbose -d postgres -h 192.168.5.140 -p 7654 -U abbas --start -f -
5. Run pgbench
  ./pgbench -U abbas -h 192.168.5.140 -p 7654  -c 2 -j 2 -T 1200 -n postgres
6. Observer network traffic to find the keepalive flood.

Alternately modify the above code fragment to see approx 500 keepalive log messages a second

if (MyWalSnd->flush < sentPtr &&
    MyWalSnd->write < sentPtr &&
    !waiting_for_ping_response)
{
    elog(LOG, "[Keepalive]  wrt ptr %X/%X  snt ptr %X/%X ",
               (uint32) (MyWalSnd->write >> 32),
               (uint32) MyWalSnd->write,
               (uint32) (sentPtr >> 32),
               (uint32) sentPtr);
    WalSndKeepalive(false);
}

Opinions?

--
--
Abbas
Senior Architect


Skype ID: gabbasb

Follow us on Twitter
@EnterpriseDB

Re: Logical replication keepalive flood

From
Kyotaro Horiguchi
Date:
At Sat, 5 Jun 2021 16:08:00 +0500, Abbas Butt <abbas.butt@enterprisedb.com> wrote in 
> Hi,
> I have observed the following behavior with PostgreSQL 13.3.
> 
> The WAL sender process sends approximately 500 keepalive messages per
> second to pg_recvlogical.
> These keepalive messages are totally un-necessary.
> Keepalives should be sent only if there is no network traffic and a certain
> time (half of wal_sender_timeout) passes.
> These keepalive messages not only choke the network but also impact the
> performance of the receiver,
> because the receiver has to process the received message and then decide
> whether to reply to it or not.
> The receiver remains busy doing this activity 500 times a second.

I can reproduce the problem.

> On investigation it is revealed that the following code fragment in
> function WalSndWaitForWal in file walsender.c is responsible for sending
> these frequent keepalives:
> 
> if (MyWalSnd->flush < sentPtr &&
>     MyWalSnd->write < sentPtr &&
>     !waiting_for_ping_response)
>         WalSndKeepalive(false);

The immediate cause is pg_recvlogical doesn't send a reply before
sleeping. Currently it sends replies every 10 seconds intervals.

So the attached first patch stops the flood.

That said, I don't think it is not intended that logical walsender
sends keep-alive packets with such a high frequency.  It happens
because walsender actually doesn't wait at all because it waits on
WL_SOCKET_WRITEABLE because the keep-alive packet inserted just before
is always pending.

So as the attached second, we should try to flush out the keep-alive
packets if possible before checking pg_is_send_pending().

Any one can "fix" the issue but I think each of them is reasonable by
itself.

Any thoughts, suggestions and/or opinions?

regareds.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 5efec160e8..4497ff1071 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -362,6 +362,10 @@ StreamLogicalLog(void)
                 goto error;
             }
 
+            /* sned reply for all writes so far */
+            if (!flushAndSendFeedback(conn, &now))
+                goto error;
+            
             FD_ZERO(&input_mask);
             FD_SET(PQsocket(conn), &input_mask);

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 109c723f4e..fcea56d1c1 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1469,6 +1469,9 @@ WalSndWaitForWal(XLogRecPtr loc)
         /* Send keepalive if the time has come */
         WalSndKeepaliveIfNecessary();
 
+        /* We may have queued a keep alive packet. flush it before sleeping. */
+        pq_flush_if_writable();
+        
         /*
          * Sleep until something happens or we time out.  Also wait for the
          * socket becoming writable, if there's still pending output.

Re: Logical replication keepalive flood

From
Amit Kapila
Date:
On Mon, Jun 7, 2021 at 12:54 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Sat, 5 Jun 2021 16:08:00 +0500, Abbas Butt <abbas.butt@enterprisedb.com> wrote in
> > Hi,
> > I have observed the following behavior with PostgreSQL 13.3.
> >
> > The WAL sender process sends approximately 500 keepalive messages per
> > second to pg_recvlogical.
> > These keepalive messages are totally un-necessary.
> > Keepalives should be sent only if there is no network traffic and a certain
> > time (half of wal_sender_timeout) passes.
> > These keepalive messages not only choke the network but also impact the
> > performance of the receiver,
> > because the receiver has to process the received message and then decide
> > whether to reply to it or not.
> > The receiver remains busy doing this activity 500 times a second.
>
> I can reproduce the problem.
>
> > On investigation it is revealed that the following code fragment in
> > function WalSndWaitForWal in file walsender.c is responsible for sending
> > these frequent keepalives:
> >
> > if (MyWalSnd->flush < sentPtr &&
> >     MyWalSnd->write < sentPtr &&
> >     !waiting_for_ping_response)
> >         WalSndKeepalive(false);
>
> The immediate cause is pg_recvlogical doesn't send a reply before
> sleeping. Currently it sends replies every 10 seconds intervals.
>

Yeah, but one can use -s option to send it at lesser intervals.

> So the attached first patch stops the flood.
>

I am not sure sending feedback every time before sleep is a good idea,
this might lead to unnecessarily sending more messages. Can we try by
using one-second interval with -s option to see how it behaves? As a
matter of comparison the similar logic in workers.c uses
wal_receiver_timeout to send such an update message rather than
sending it every time before sleep.

> That said, I don't think it is not intended that logical walsender
> sends keep-alive packets with such a high frequency.  It happens
> because walsender actually doesn't wait at all because it waits on
> WL_SOCKET_WRITEABLE because the keep-alive packet inserted just before
> is always pending.
>
> So as the attached second, we should try to flush out the keep-alive
> packets if possible before checking pg_is_send_pending().
>

/* Send keepalive if the time has come */
  WalSndKeepaliveIfNecessary();

+ /* We may have queued a keep alive packet. flush it before sleeping. */
+ pq_flush_if_writable();

We already call pq_flush_if_writable() from WalSndKeepaliveIfNecessary
after sending the keep-alive message, so not sure how this helps?

-- 
With Regards,
Amit Kapila.



Re: Logical replication keepalive flood

From
Abbas Butt
Date:


On Mon, Jun 7, 2021 at 3:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Jun 7, 2021 at 12:54 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Sat, 5 Jun 2021 16:08:00 +0500, Abbas Butt <abbas.butt@enterprisedb.com> wrote in
> > Hi,
> > I have observed the following behavior with PostgreSQL 13.3.
> >
> > The WAL sender process sends approximately 500 keepalive messages per
> > second to pg_recvlogical.
> > These keepalive messages are totally un-necessary.
> > Keepalives should be sent only if there is no network traffic and a certain
> > time (half of wal_sender_timeout) passes.
> > These keepalive messages not only choke the network but also impact the
> > performance of the receiver,
> > because the receiver has to process the received message and then decide
> > whether to reply to it or not.
> > The receiver remains busy doing this activity 500 times a second.
>
> I can reproduce the problem.
>
> > On investigation it is revealed that the following code fragment in
> > function WalSndWaitForWal in file walsender.c is responsible for sending
> > these frequent keepalives:
> >
> > if (MyWalSnd->flush < sentPtr &&
> >     MyWalSnd->write < sentPtr &&
> >     !waiting_for_ping_response)
> >         WalSndKeepalive(false);
>
> The immediate cause is pg_recvlogical doesn't send a reply before
> sleeping. Currently it sends replies every 10 seconds intervals.
>

Yeah, but one can use -s option to send it at lesser intervals.

That option can impact pg_recvlogical, it will not impact the server sending keepalives too frequently.
By default the status interval is 10 secs, still we are getting 500 keepalives a second from the server.
 

> So the attached first patch stops the flood.
>

I am not sure sending feedback every time before sleep is a good idea,
this might lead to unnecessarily sending more messages. Can we try by
using one-second interval with -s option to see how it behaves? As a
matter of comparison the similar logic in workers.c uses
wal_receiver_timeout to send such an update message rather than
sending it every time before sleep.

> That said, I don't think it is not intended that logical walsender
> sends keep-alive packets with such a high frequency.  It happens
> because walsender actually doesn't wait at all because it waits on
> WL_SOCKET_WRITEABLE because the keep-alive packet inserted just before
> is always pending.
>
> So as the attached second, we should try to flush out the keep-alive
> packets if possible before checking pg_is_send_pending().
>

/* Send keepalive if the time has come */
  WalSndKeepaliveIfNecessary();

+ /* We may have queued a keep alive packet. flush it before sleeping. */
+ pq_flush_if_writable();

We already call pq_flush_if_writable() from WalSndKeepaliveIfNecessary
after sending the keep-alive message, so not sure how this helps?

--
With Regards,
Amit Kapila.


--
--
Abbas
Senior Architect


Skype ID: gabbasb

Follow us on Twitter
@EnterpriseDB

Re: Logical replication keepalive flood

From
Kyotaro Horiguchi
Date:
At Mon, 7 Jun 2021 15:26:05 +0500, Abbas Butt <abbas.butt@enterprisedb.com> wrote in 
> On Mon, Jun 7, 2021 at 3:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > The immediate cause is pg_recvlogical doesn't send a reply before
> > > sleeping. Currently it sends replies every 10 seconds intervals.
> > >
> >
> > Yeah, but one can use -s option to send it at lesser intervals.
> >
> 
> That option can impact pg_recvlogical, it will not impact the server
> sending keepalives too frequently.
> By default the status interval is 10 secs, still we are getting 500
> keepalives a second from the server.
>
> > > So the attached first patch stops the flood.
> > >
> >
> > I am not sure sending feedback every time before sleep is a good idea,
> > this might lead to unnecessarily sending more messages. Can we try by
> > using one-second interval with -s option to see how it behaves? As a
> > matter of comparison the similar logic in workers.c uses
> > wal_receiver_timeout to send such an update message rather than
> > sending it every time before sleep.

Logical walreceiver sends a feedback when walrcv_eceive() doesn't
receive a byte.  If its' not good that pg_recvlogical does the same
thing, do we need to improve logical walsender's behavior as well?

> > > That said, I don't think it is not intended that logical walsender
> > > sends keep-alive packets with such a high frequency.  It happens
> > > because walsender actually doesn't wait at all because it waits on
> > > WL_SOCKET_WRITEABLE because the keep-alive packet inserted just before
> > > is always pending.
> > >
> > > So as the attached second, we should try to flush out the keep-alive
> > > packets if possible before checking pg_is_send_pending().
> > >
> >
> > /* Send keepalive if the time has come */
> >   WalSndKeepaliveIfNecessary();
> >
> > + /* We may have queued a keep alive packet. flush it before sleeping. */
> > + pq_flush_if_writable();
> >
> > We already call pq_flush_if_writable() from WalSndKeepaliveIfNecessary
> > after sending the keep-alive message, so not sure how this helps?

No. WalSndKeepaliveIfNecessary calls it only when walreceiver does not
receive a reply message for a long time. So the keepalive sent by the
direct call to WalSndKeepalive() from WalSndWaitForWal is not flushed
out in most cases, which causes the flood.

I rechecked all callers of WalSndKeepalive().

WalSndKeepalive()
+- *WalSndWaltForWal
+- ProcessStandbyReplyMessage
|+- ProcessStandbyMessage
| +- ProcessRepliesIfAny
|  +- $WalSndWriteData
|  +- *WalSndWaitForWal
|  +- WalSndLoop
|    (calls pq_flush_if_writable() after sending the packet, but the
|     keepalive packet prevents following stream data from being sent
|     since the pending keepalive-packet causes pq_is_sned_pending()
|     return (falsely) true.)
+- WalSndDone
 +- *WalSndLoop
+- WalSndKeepaliveIfNecessary
   (calls pq_flush_if_writable always only after calling WalSndKeepalive())

The callers prefixed by '*' above misunderstand that some of the data
sent by them are still pending even when the only pending bytes is the
keepalive packet. Of course the keepalive pakcets should be sent
*before* sleep and the unsent keepalive packet prevents the callers
from sleeping then they immediately retry sending another keepalive
pakcet and repeat it until the condition changes. (The callers
prevised by "$" also enters a sleep before flushing but doesn't repeat
sending keepalives.)

The caller is forgetting that a keepalive pakcet may be queued but not
flushed after calling WalSndKeepalive.  So more sensible fix would be
calling pq_flush_if_writable in WalSndKeepalive?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Logical replication keepalive flood

From
Kyotaro Horiguchi
Date:
At Tue, 08 Jun 2021 10:05:36 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Mon, 7 Jun 2021 15:26:05 +0500, Abbas Butt <abbas.butt@enterprisedb.com> wrote in 
> > On Mon, Jun 7, 2021 at 3:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > I am not sure sending feedback every time before sleep is a good idea,
> > > this might lead to unnecessarily sending more messages. Can we try by
> > > using one-second interval with -s option to see how it behaves? As a
> > > matter of comparison the similar logic in workers.c uses
> > > wal_receiver_timeout to send such an update message rather than
> > > sending it every time before sleep.
> 
> Logical walreceiver sends a feedback when walrcv_eceive() doesn't
> receive a byte.  If its' not good that pg_recvlogical does the same
> thing, do we need to improve logical walsender's behavior as well?

For the clarity, only the change in the walsender side can stop the
flood.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Logical replication keepalive flood

From
Abbas Butt
Date:
Hi Kyotaro,
I have tried to test your patches. Unfortunately even after applying the patches 
the WAL Sender is still sending too frequent keepalive messages.
In my opinion the fix is to make sure that wal_sender_timeout/2 has passed before sending
the keepalive message in the code fragment I had shared earlier.
In  other words we should replace the call to
WalSndKeepalive(false);
with
WalSndKeepaliveIfNecessary(false);

Do you agree with the suggested fix?

On Tue, Jun 8, 2021 at 10:09 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
At Tue, 08 Jun 2021 10:05:36 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
> At Mon, 7 Jun 2021 15:26:05 +0500, Abbas Butt <abbas.butt@enterprisedb.com> wrote in
> > On Mon, Jun 7, 2021 at 3:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > I am not sure sending feedback every time before sleep is a good idea,
> > > this might lead to unnecessarily sending more messages. Can we try by
> > > using one-second interval with -s option to see how it behaves? As a
> > > matter of comparison the similar logic in workers.c uses
> > > wal_receiver_timeout to send such an update message rather than
> > > sending it every time before sleep.
>
> Logical walreceiver sends a feedback when walrcv_eceive() doesn't
> receive a byte.  If its' not good that pg_recvlogical does the same
> thing, do we need to improve logical walsender's behavior as well?

For the clarity, only the change in the walsender side can stop the
flood.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


--
--
Abbas
Senior Architect


Skype ID: gabbasb

Follow us on Twitter
@EnterpriseDB

Re: Logical replication keepalive flood

From
Kyotaro Horiguchi
Date:
Hi.

On 2021/06/08 21:21, Abbas Butt wrote:
> Hi Kyotaro,
> I have tried to test your patches. Unfortunately even after applying the
> patches
> the WAL Sender is still sending too frequent keepalive messages.

Sorry for the bogus patch.  I must have seen something impossible.

The keep-alive packet is immediately flushed explicitly, so Amit is 
right that no additional

pq_flush_if_writable() is not needed.

> In my opinion the fix is to make sure that wal_sender_timeout/2 has passed
> before sending
> the keepalive message in the code fragment I had shared earlier.
> In  other words we should replace the call to
> WalSndKeepalive(false);
> with
> WalSndKeepaliveIfNecessary(false);
>
> Do you agree with the suggested fix?

I'm afraid not. The same is done just after unconditionally.

The issue - if actually it is - we send a keep-alive packet before a 
quite short sleep.

We really want to send it if the sleep gets long but we cannot predict 
that before entering a sleep.

Let me think a little more on this..

regards.





Re: Logical replication keepalive flood

From
Kyotaro Horiguchi
Date:
At Wed, 9 Jun 2021 11:21:55 +0900, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> The issue - if actually it is - we send a keep-alive packet before a
> quite short sleep.
> 
> We really want to send it if the sleep gets long but we cannot predict
> that before entering a sleep.
> 
> Let me think a little more on this..

After some investigation, I find out that the keepalives are sent
almost always after XLogSendLogical requests for the *next* record. In
most of the cases the record is not yet inserted at the request time
but insertd very soon (in 1-digit milliseconds). It doesn't seem to be
expected that that happens with such a high frequency when
XLogSendLogical is keeping up-to-date with the bleeding edge of WAL
records.

It is completely unpredictable when the next record comes, so we
cannot decide whether to send a keepalive or not at the current
timing.

Since we want to send a keepalive when we have nothing to send for a
while, it is a bit different to keep sending keepalives at some
intervals while the loop is busy.

As a possible solution, the attached patch splits the sleep into two
pieces. If the first sleep reaches the timeout then send a keepalive
then sleep for the remaining time. The first timeout is quite
arbitrary but keepalive of 4Hz at maximum doesn't look so bad to me.

Is it acceptable?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 109c723f4e..49b3c0d4e2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -105,6 +105,9 @@
  */
 #define MAX_SEND_SIZE (XLOG_BLCKSZ * 16)
 
+/* Minimum idle time for sending an idle-time keepalive in milliseconds */
+#define KEEPALIVE_TIMEOUT 250
+
 /* Array of WalSnds in shared memory */
 WalSndCtlData *WalSndCtl = NULL;
 
@@ -244,7 +247,7 @@ static void WalSndKeepalive(bool requestReply);
 static void WalSndKeepaliveIfNecessary(void);
 static void WalSndCheckTimeOut(void);
 static long WalSndComputeSleeptime(TimestampTz now);
-static void WalSndWait(uint32 socket_events, long timeout, uint32 wait_event);
+static int WalSndWait(uint32 socket_events, long timeout, uint32 wait_event);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid);
@@ -1428,19 +1431,6 @@ WalSndWaitForWal(XLogRecPtr loc)
         if (got_STOPPING)
             break;
 
-        /*
-         * We only send regular messages to the client for full decoded
-         * transactions, but a synchronous replication and walsender shutdown
-         * possibly are waiting for a later location. So, before sleeping, we
-         * send a ping containing the flush location. If the receiver is
-         * otherwise idle, this keepalive will trigger a reply. Processing the
-         * reply will update these MyWalSnd locations.
-         */
-        if (MyWalSnd->flush < sentPtr &&
-            MyWalSnd->write < sentPtr &&
-            !waiting_for_ping_response)
-            WalSndKeepalive(false);
-
         /* check whether we're done */
         if (loc <= RecentFlushPtr)
             break;
@@ -1483,6 +1473,39 @@ WalSndWaitForWal(XLogRecPtr loc)
         if (pq_is_send_pending())
             wakeEvents |= WL_SOCKET_WRITEABLE;
 
+        /*
+         * We only send regular messages to the client for full decoded
+         * transactions, but a synchronous replication and walsender shutdown
+         * possibly are waiting for a later location. So, before sleeping, we
+         * send a ping containing the flush location. If the receiver is
+         * otherwise idle, this keepalive will trigger a reply. Processing the
+         * reply will update these MyWalSnd locations. If the sleep is shorter
+         * than KEEPALIVE_TIMEOUT milliseconds, we skip sending a keepalive to
+         * prevent it from getting too-frequent.
+         */
+        if (MyWalSnd->flush < sentPtr &&
+            MyWalSnd->write < sentPtr &&
+            !waiting_for_ping_response)
+        {
+            if (sleeptime > KEEPALIVE_TIMEOUT)
+            {
+                int r;
+
+                r = WalSndWait(wakeEvents, KEEPALIVE_TIMEOUT,
+                               WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+
+                if (r != 0)
+                    continue;
+
+                sleeptime -= KEEPALIVE_TIMEOUT;
+            }
+
+            WalSndKeepalive(false);
+
+            if (pq_flush_if_writable() != 0)
+                WalSndShutdown();
+        }
+        
         WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
     }
 
@@ -3136,15 +3159,18 @@ WalSndWakeup(void)
  * composed of optional WL_SOCKET_WRITEABLE and WL_SOCKET_READABLE flags.  Exit
  * on postmaster death.
  */
-static void
+static int
 WalSndWait(uint32 socket_events, long timeout, uint32 wait_event)
 {
     WaitEvent    event;
+    int            ret;
 
     ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, socket_events, NULL);
-    if (WaitEventSetWait(FeBeWaitSet, timeout, &event, 1, wait_event) == 1 &&
-        (event.events & WL_POSTMASTER_DEATH))
+    ret = WaitEventSetWait(FeBeWaitSet, timeout, &event, 1, wait_event);
+    if (ret == 1 && (event.events & WL_POSTMASTER_DEATH))
         proc_exit(1);
+
+    return ret;
 }
 
 /*

Re: Logical replication keepalive flood

From
Amit Kapila
Date:
On Wed, Jun 9, 2021 at 1:47 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Wed, 9 Jun 2021 11:21:55 +0900, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
> > The issue - if actually it is - we send a keep-alive packet before a
> > quite short sleep.
> >
> > We really want to send it if the sleep gets long but we cannot predict
> > that before entering a sleep.
> >
> > Let me think a little more on this..
>
> After some investigation, I find out that the keepalives are sent
> almost always after XLogSendLogical requests for the *next* record.
>

Does these keepalive messages are sent at the same frequency even for
subscribers? Basically, I wanted to check if we have logical
replication set up between 2 nodes then do we send these keep-alive
messages flood? If not, then why is it different in the case of
pg_recvlogical? Is it possible that the write/flush location is not
updated at the pace at which we expect? Please see commit 41d5f8ad73
which seems to be talking about a similar problem.

-- 
With Regards,
Amit Kapila.



Re: Logical replication keepalive flood

From
Abbas Butt
Date:
Hi,

On Wed, Jun 9, 2021 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Jun 9, 2021 at 1:47 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Wed, 9 Jun 2021 11:21:55 +0900, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
> > The issue - if actually it is - we send a keep-alive packet before a
> > quite short sleep.
> >
> > We really want to send it if the sleep gets long but we cannot predict
> > that before entering a sleep.
> >
> > Let me think a little more on this..
>
> After some investigation, I find out that the keepalives are sent
> almost always after XLogSendLogical requests for the *next* record.
>

Does these keepalive messages are sent at the same frequency even for
subscribers?

Yes, I have tested it with one publisher and one subscriber.
The moment I start pgbench session I can see keepalive messages sent and replied by the subscriber with same frequency.
 
Basically, I wanted to check if we have logical
replication set up between 2 nodes then do we send these keep-alive
messages flood?

Yes we do.
 
If not, then why is it different in the case of
pg_recvlogical?

Nothing, the WAL sender behaviour is same in both cases.
 
Is it possible that the write/flush location is not
updated at the pace at which we expect?

Well, it is async replication. The receiver can choose to update LSNs at its own will, say after 10 mins interval.
It should only impact the size of WAL retained by the server.

Please see commit 41d5f8ad73
which seems to be talking about a similar problem.

That commit does not address this problem.
 

--
With Regards,
Amit Kapila.


--
--
Abbas
Senior Architect


Skype ID: gabbasb

Follow us on Twitter
@EnterpriseDB

Re: Logical replication keepalive flood

From
Kyotaro Horiguchi
Date:
At Wed, 9 Jun 2021 17:32:25 +0500, Abbas Butt <abbas.butt@enterprisedb.com> wrote in 
> 
> On Wed, Jun 9, 2021 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Does these keepalive messages are sent at the same frequency even for
> > subscribers?
> 
> Yes, I have tested it with one publisher and one subscriber.
> The moment I start pgbench session I can see keepalive messages sent and
> replied by the subscriber with same frequency.
> 
> > Basically, I wanted to check if we have logical
> > replication set up between 2 nodes then do we send these keep-alive
> > messages flood?
> 
> Yes we do.
> 
> > If not, then why is it different in the case of
> > pg_recvlogical?
> 
> Nothing, the WAL sender behaviour is same in both cases.
> 
> 
> > Is it possible that the write/flush location is not
> > updated at the pace at which we expect?

Yes. MyWalSnd->flush/write are updated far frequently but still
MyWalSnd->write is behind sentPtr by from thousands of bytes up to
less than 1 block (1block = 8192 bytes). (Flush lags are larger than
write lags, of course.)

I counted how many times keepalives are sent for each request length
to logical_read_xlog_page() for 10 seconds pgbench run and replicating
pgbench_history, using the attached change.

size:  sent /notsent/ calls: write lag/ flush lag
   8:     3 /     6 /     3:    5960 /  348962
  16:     1 /     2 /     1:     520 /  201096
  24:  2425 /  4852 /  2461:    5259 /  293569
  98:     2 /     0 /    54:       5 /    1050
 187:     2 /     0 /    94:       0 /    1060
4432:     1 /     0 /     1: 410473592 / 410473592
7617:     2 /     0 /    27:     317 /   17133
8280:     1 /     2 /     4:     390 /     390

Where,

size    is requested data length to logical_read_xlog_page()

sent    is the number of keepalives sent in the loop in WalSndWaitForWal

notsent is the number of runs of the loop in WalSndWaitForWal without
        sending a keepalive

calls   is the number of calls to WalSndWaitForWal

write lag is the bytes MyWalSnd->write is behind from sentPtr at the
        first run of the loop per call to logical_read_xlog_page.

flush lag is the the same to the above for MyWalSnd->flush.

Maybe the line of size=4432 is the first time fetch of WAL.

So this numbers show that WalSndWaitForWal is called almost only at
starting to fetching a record, and in that case the function runs the
loop three times and sends one keepalive by average.

> Well, it is async replication. The receiver can choose to update LSNs at
> its own will, say after 10 mins interval.
> It should only impact the size of WAL retained by the server.
> 
> Please see commit 41d5f8ad73
> > which seems to be talking about a similar problem.
> >
> 
> That commit does not address this problem.

Yeah, at least for me, WalSndWaitForWal send a keepalive per one call.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 42738eb940..ee78116e79 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -571,6 +571,7 @@ err:
  * We fetch the page from a reader-local cache if we know we have the required
  * data and if there hasn't been any error since caching the data.
  */
+int hogestate = -1;
 static int
 ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 {
@@ -605,6 +606,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
     {
         XLogRecPtr    targetSegmentPtr = pageptr - targetPageOff;
 
+        hogestate = pageptr + XLOG_BLCKSZ - state->currRecPtr;
         readLen = state->routine.page_read(state, targetSegmentPtr, XLOG_BLCKSZ,
                                            state->currRecPtr,
                                            state->readBuf);
@@ -623,6 +625,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
      * First, read the requested data length, but at least a short page header
      * so that we can validate it.
      */
+    hogestate = pageptr + Max(reqLen, SizeOfXLogShortPHD) - state->currRecPtr;
     readLen = state->routine.page_read(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
                                        state->currRecPtr,
                                        state->readBuf);
@@ -642,6 +645,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
     /* still not enough */
     if (readLen < XLogPageHeaderSize(hdr))
     {
+        hogestate = pageptr + XLogPageHeaderSize(hdr) - state->currRecPtr;
         readLen = state->routine.page_read(state, pageptr, XLogPageHeaderSize(hdr),
                                            state->currRecPtr,
                                            state->readBuf);
@@ -649,6 +653,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
             goto err;
     }
 
+    hogestate = -1;
     /*
      * Now that we know we have the full header, validate it.
      */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 109c723f4e..0de10c4a31 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1363,17 +1363,45 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
  * if we detect a shutdown request (either from postmaster or client)
  * we will return early, so caller must always check.
  */
+unsigned long counts[32768][3] = {0};
+unsigned long lagw[32768] = {0};
+unsigned long lagf[32768] = {0};
+
+void
+PrintCounts(void)
+{
+    int i = 0;
+    for (i = 0 ; i < 32768 ; i++)
+    {
+        if (counts[i][0] + counts[i][1] + counts[i][2] > 0)
+        {
+            unsigned long wl = 0, fl = 0;
+            if (counts[i][1] > 0)
+            {
+                wl = lagw[i] / counts[i][0];
+                fl = lagf[i] / counts[i][0];
+            
+                ereport(LOG, (errmsg ("[%5d]: %5lu / %5lu / %5lu: %5lu  %5lu",
+                                      i, counts[i][1], counts[i][2], counts[i][0], wl, fl), errhidestmt(true)));
+            }
+        }
+    }
+}
+
 static XLogRecPtr
 WalSndWaitForWal(XLogRecPtr loc)
 {
     int            wakeEvents;
     static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+    extern int hogestate;
+    bool        lagtaken = false;
 
     /*
      * Fast path to avoid acquiring the spinlock in case we already know we
      * have enough WAL available. This is particularly interesting if we're
      * far behind.
      */
+    counts[hogestate][0]++;
     if (RecentFlushPtr != InvalidXLogRecPtr &&
         loc <= RecentFlushPtr)
         return RecentFlushPtr;
@@ -1439,7 +1467,39 @@ WalSndWaitForWal(XLogRecPtr loc)
         if (MyWalSnd->flush < sentPtr &&
             MyWalSnd->write < sentPtr &&
             !waiting_for_ping_response)
+        {
+            if (hogestate >= 0)
+            {
+                counts[hogestate][1]++;
+                if (!lagtaken)
+                {
+                    lagf[hogestate] += sentPtr - MyWalSnd->flush;
+                    lagw[hogestate] += sentPtr - MyWalSnd->write;
+                    lagtaken = true;
+                }
+            }
+//            ereport(LOG, (errmsg ("KA[%lu/%lu/%lu]: %X/%X %X/%X %X/%X %d: %ld",
+//                                  ka, na, ka + na,
+//                                  LSN_FORMAT_ARGS(MyWalSnd->flush),
+//                                  LSN_FORMAT_ARGS(MyWalSnd->write),
+//                                  LSN_FORMAT_ARGS(sentPtr),
+//                                  waiting_for_ping_response,
+//                                  sentPtr - MyWalSnd->write)));
             WalSndKeepalive(false);
+        }
+        else
+        {
+            if (hogestate >= 0)
+                counts[hogestate][2]++;
+
+//            ereport(LOG, (errmsg ("kap[%lu/%lu/%lu]: %X/%X %X/%X %X/%X %d: %ld",
+//                                  ka, na, ka + na,
+//                                  LSN_FORMAT_ARGS(MyWalSnd->flush),
+//                                  LSN_FORMAT_ARGS(MyWalSnd->write),
+//                                  LSN_FORMAT_ARGS(sentPtr),
+//                                  waiting_for_ping_response,
+//                                  sentPtr - MyWalSnd->write)));
+        }
 
         /* check whether we're done */
         if (loc <= RecentFlushPtr)

Re: Logical replication keepalive flood

From
Kyotaro Horiguchi
Date:
At Thu, 10 Jun 2021 15:00:16 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Wed, 9 Jun 2021 17:32:25 +0500, Abbas Butt <abbas.butt@enterprisedb.com> wrote in 
> > 
> > On Wed, Jun 9, 2021 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > Is it possible that the write/flush location is not
> > > updated at the pace at which we expect?
> 
> Yes. MyWalSnd->flush/write are updated far frequently but still
> MyWalSnd->write is behind sentPtr by from thousands of bytes up to
> less than 1 block (1block = 8192 bytes). (Flush lags are larger than
> write lags, of course.)

For more clarity, I changed the previous patch a bit and retook numbers.

Total records: 19476
  8:     2 /     4 /     2:    4648 /  302472
 16:     5 /    10 /     5:    5427 /  139872
 24:  3006 /  6015 /  3028:    4739 /  267215
187:     2 /     0 /    50:       1 /     398

While a 10 seconds run of pgbench, it walsender reads 19476 records
and calls logical_read_xlog_page() 3028 times, and the mean of write
lag is 4739 bytes and flush lag is 267215 bytes (really?), as the
result most of the record fetch causes a keep alive. (The WAL contains
many FPIs).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 42738eb940..ee78116e79 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -571,6 +571,7 @@ err:
  * We fetch the page from a reader-local cache if we know we have the required
  * data and if there hasn't been any error since caching the data.
  */
+int hogestate = -1;
 static int
 ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 {
@@ -605,6 +606,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
     {
         XLogRecPtr    targetSegmentPtr = pageptr - targetPageOff;
 
+        hogestate = pageptr + XLOG_BLCKSZ - state->currRecPtr;
         readLen = state->routine.page_read(state, targetSegmentPtr, XLOG_BLCKSZ,
                                            state->currRecPtr,
                                            state->readBuf);
@@ -623,6 +625,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
      * First, read the requested data length, but at least a short page header
      * so that we can validate it.
      */
+    hogestate = pageptr + Max(reqLen, SizeOfXLogShortPHD) - state->currRecPtr;
     readLen = state->routine.page_read(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
                                        state->currRecPtr,
                                        state->readBuf);
@@ -642,6 +645,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
     /* still not enough */
     if (readLen < XLogPageHeaderSize(hdr))
     {
+        hogestate = pageptr + XLogPageHeaderSize(hdr) - state->currRecPtr;
         readLen = state->routine.page_read(state, pageptr, XLogPageHeaderSize(hdr),
                                            state->currRecPtr,
                                            state->readBuf);
@@ -649,6 +653,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
             goto err;
     }
 
+    hogestate = -1;
     /*
      * Now that we know we have the full header, validate it.
      */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 109c723f4e..62f5f09fee 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1363,17 +1363,49 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
  * if we detect a shutdown request (either from postmaster or client)
  * we will return early, so caller must always check.
  */
+unsigned long counts[32768][3] = {0};
+unsigned long lagw[32768] = {0};
+unsigned long lagf[32768] = {0};
+unsigned long nrec = 0;
+void
+PrintCounts(void)
+{
+    int i = 0;
+    ereport(LOG, (errmsg ("Total records: %lu", nrec), errhidestmt(true)));
+    nrec = 0;
+
+    for (i = 0 ; i < 32768 ; i++)
+    {
+        if (counts[i][0] + counts[i][1] + counts[i][2] > 0)
+        {
+            unsigned long wl = 0, fl = 0;
+            if (counts[i][1] > 0)
+            {
+                wl = lagw[i] / counts[i][0];
+                fl = lagf[i] / counts[i][0];
+            
+                ereport(LOG, (errmsg ("%5d: %5lu / %5lu / %5lu: %7lu / %7lu",
+                                      i, counts[i][1], counts[i][2], counts[i][0], wl, fl), errhidestmt(true)));
+            }
+            counts[i][0] = counts[i][1] = counts[i][2] = lagw[i] = lagf[i] = 0;
+        }
+    }
+}
+
 static XLogRecPtr
 WalSndWaitForWal(XLogRecPtr loc)
 {
     int            wakeEvents;
     static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+    extern int hogestate;
+    bool        lagtaken = false;
 
     /*
      * Fast path to avoid acquiring the spinlock in case we already know we
      * have enough WAL available. This is particularly interesting if we're
      * far behind.
      */
+    counts[hogestate][0]++;
     if (RecentFlushPtr != InvalidXLogRecPtr &&
         loc <= RecentFlushPtr)
         return RecentFlushPtr;
@@ -1439,7 +1471,39 @@ WalSndWaitForWal(XLogRecPtr loc)
         if (MyWalSnd->flush < sentPtr &&
             MyWalSnd->write < sentPtr &&
             !waiting_for_ping_response)
+        {
+            if (hogestate >= 0)
+            {
+                counts[hogestate][1]++;
+                if (!lagtaken)
+                {
+                    lagf[hogestate] += sentPtr - MyWalSnd->flush;
+                    lagw[hogestate] += sentPtr - MyWalSnd->write;
+                    lagtaken = true;
+                }
+            }
+//            ereport(LOG, (errmsg ("KA[%lu/%lu/%lu]: %X/%X %X/%X %X/%X %d: %ld",
+//                                  ka, na, ka + na,
+//                                  LSN_FORMAT_ARGS(MyWalSnd->flush),
+//                                  LSN_FORMAT_ARGS(MyWalSnd->write),
+//                                  LSN_FORMAT_ARGS(sentPtr),
+//                                  waiting_for_ping_response,
+//                                  sentPtr - MyWalSnd->write)));
             WalSndKeepalive(false);
+        }
+        else
+        {
+            if (hogestate >= 0)
+                counts[hogestate][2]++;
+
+//            ereport(LOG, (errmsg ("kap[%lu/%lu/%lu]: %X/%X %X/%X %X/%X %d: %ld",
+//                                  ka, na, ka + na,
+//                                  LSN_FORMAT_ARGS(MyWalSnd->flush),
+//                                  LSN_FORMAT_ARGS(MyWalSnd->write),
+//                                  LSN_FORMAT_ARGS(sentPtr),
+//                                  waiting_for_ping_response,
+//                                  sentPtr - MyWalSnd->write)));
+        }
 
         /* check whether we're done */
         if (loc <= RecentFlushPtr)
@@ -2843,6 +2907,7 @@ XLogSendLogical(void)
 {
     XLogRecord *record;
     char       *errm;
+    extern unsigned long nrec;
 
     /*
      * We'll use the current flush point to determine whether we've caught up.
@@ -2860,6 +2925,7 @@ XLogSendLogical(void)
      */
     WalSndCaughtUp = false;
 
+    nrec++;
     record = XLogReadRecord(logical_decoding_ctx->reader, &errm);
 
     /* xlog record was invalid */

Re: Logical replication keepalive flood

From
Amit Kapila
Date:
On Thu, Jun 10, 2021 at 11:42 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Thu, 10 Jun 2021 15:00:16 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
> > At Wed, 9 Jun 2021 17:32:25 +0500, Abbas Butt <abbas.butt@enterprisedb.com> wrote in
> > >
> > > On Wed, Jun 9, 2021 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > Is it possible that the write/flush location is not
> > > > updated at the pace at which we expect?
> >
> > Yes. MyWalSnd->flush/write are updated far frequently but still
> > MyWalSnd->write is behind sentPtr by from thousands of bytes up to
> > less than 1 block (1block = 8192 bytes). (Flush lags are larger than
> > write lags, of course.)
>
> For more clarity, I changed the previous patch a bit and retook numbers.
>
> Total records: 19476
>   8:     2 /     4 /     2:    4648 /  302472
>  16:     5 /    10 /     5:    5427 /  139872
>  24:  3006 /  6015 /  3028:    4739 /  267215
> 187:     2 /     0 /    50:       1 /     398
>
> While a 10 seconds run of pgbench, it walsender reads 19476 records
> and calls logical_read_xlog_page() 3028 times, and the mean of write
> lag is 4739 bytes and flush lag is 267215 bytes (really?), as the
> result most of the record fetch causes a keep alive. (The WAL contains
> many FPIs).
>

Good analysis. I think this analysis has shown that walsender is
sending messages at top speed as soon as they are generated. So, I am
wondering why there is any need to wait/sleep in such a workload. One
possibility that occurred to me RecentFlushPtr is not updated and or
we are not checking it aggressively. To investigate on that lines, can
you check the behavior with the attached patch? This is just a quick
hack patch to test whether we need to really wait for WAL a bit
aggressively.

-- 
With Regards,
Amit Kapila.

Attachment

Re: Logical replication keepalive flood

From
Kyotaro Horiguchi
Date:
At Thu, 10 Jun 2021 12:18:00 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
> Good analysis. I think this analysis has shown that walsender is
> sending messages at top speed as soon as they are generated. So, I am
> wondering why there is any need to wait/sleep in such a workload. One
> possibility that occurred to me RecentFlushPtr is not updated and or
> we are not checking it aggressively. To investigate on that lines, can
> you check the behavior with the attached patch? This is just a quick
> hack patch to test whether we need to really wait for WAL a bit
> aggressively.

Yeah, anyway the comment for the caller site of WalSndKeepalive tells
that exiting out of the function *after* there is somewhat wrong.

> * possibly are waiting for a later location. So, before sleeping, we
> * send a ping containing the flush location. If the receiver is

But I nothing changed by moving the keepalive check to after the exit
check. (loc <= RecentFlushPtr).

And the patch also doesn't change the situation so much. The average
number of loops is reduced from 3 to 2 per call but the ratio between
total records and keepalives doesn't change.

previsous: A=#total-rec = 19476, B=#keepalive=3006, B/A = 0.154
this time: A=#total-rec = 13208, B=#keepalive=1988, B/A = 0.151

Total records: 13208
reqsz: #sent/ #!sent/ #call: wr lag  / fl lag
   8:     4 /     4 /     4:    6448 /  268148
  16:     1 /     1 /     1:    8688 /  387320
  24:  1988 /  1987 /  1999:    6357 /  226163
 195:     1 /     0 /    20:     408 /    1647
7477:     2 /     0 /   244:      68 /     847
8225:     1 /     1 /     1:    7208 /    7208

So I checked how many bytes RecentFlushPtr is behind requested loc if
it is not advanced enough.

Total records: 15128
reqsz:  #sent/ #!sent/ #call: wr lag  / fl lag  / RecentFlushPtr lag
    8:     2 /     2 /     2:     520 /   60640 /       8
   16:     1 /     1 /     1:    8664 /   89336 /      16
   24:  2290 /  2274 /  2302:    5677 /  230583 /      23
  187:     1 /     0 /    40:       1 /    6118 /       1
 7577:     1 /     0 /    69:     120 /    3733 /      65
 8177:     1 /     1 /     1:    8288 /    8288 /    2673

So it's not a matter of RecentFlushPtr check. (Almost) Always when
WalSndWakeupRequest feels to need to send a keepalive, the function is
called before the record begins to be written.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Logical replication keepalive flood

From
Amit Kapila
Date:
On Fri, Jun 11, 2021 at 7:07 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Thu, 10 Jun 2021 12:18:00 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
> > Good analysis. I think this analysis has shown that walsender is
> > sending messages at top speed as soon as they are generated. So, I am
> > wondering why there is any need to wait/sleep in such a workload. One
> > possibility that occurred to me RecentFlushPtr is not updated and or
> > we are not checking it aggressively. To investigate on that lines, can
> > you check the behavior with the attached patch? This is just a quick
> > hack patch to test whether we need to really wait for WAL a bit
> > aggressively.
>
> Yeah, anyway the comment for the caller site of WalSndKeepalive tells
> that exiting out of the function *after* there is somewhat wrong.
>
> > * possibly are waiting for a later location. So, before sleeping, we
> > * send a ping containing the flush location. If the receiver is
>
> But I nothing changed by moving the keepalive check to after the exit
> check. (loc <= RecentFlushPtr).
>
> And the patch also doesn't change the situation so much. The average
> number of loops is reduced from 3 to 2 per call but the ratio between
> total records and keepalives doesn't change.
>
> previsous: A=#total-rec = 19476, B=#keepalive=3006, B/A = 0.154
> this time: A=#total-rec = 13208, B=#keepalive=1988, B/A = 0.151
>
> Total records: 13208
> reqsz: #sent/ #!sent/ #call: wr lag  / fl lag
>    8:     4 /     4 /     4:    6448 /  268148
>   16:     1 /     1 /     1:    8688 /  387320
>   24:  1988 /  1987 /  1999:    6357 /  226163
>  195:     1 /     0 /    20:     408 /    1647
> 7477:     2 /     0 /   244:      68 /     847
> 8225:     1 /     1 /     1:    7208 /    7208
>
> So I checked how many bytes RecentFlushPtr is behind requested loc if
> it is not advanced enough.
>
> Total records: 15128
> reqsz:  #sent/ #!sent/ #call: wr lag  / fl lag  / RecentFlushPtr lag
>     8:     2 /     2 /     2:     520 /   60640 /       8
>    16:     1 /     1 /     1:    8664 /   89336 /      16
>    24:  2290 /  2274 /  2302:    5677 /  230583 /      23
>   187:     1 /     0 /    40:       1 /    6118 /       1
>  7577:     1 /     0 /    69:     120 /    3733 /      65
>  8177:     1 /     1 /     1:    8288 /    8288 /    2673
>

Does this data indicate that when the request_size is 187 or 7577,
even though we have called WalSndWaitForWal() 40 and 69 times
respectively but keepalive is sent just once? Why such a behavior
should depend upon request size?

> So it's not a matter of RecentFlushPtr check. (Almost) Always when
> WalSndWakeupRequest feels to need to send a keepalive, the function is
> called before the record begins to be written.
>

I think we always wake up walsender after we have flushed the WAL via
WalSndWakeupProcessRequests(). I think here the reason why we are
seeing keepalives is that we always send it before sleeping. So, it
seems each time we try to read a new page, we call WalSndWaitForWal
which sends at least one keepalive message. I am not sure what is an
appropriate way to reduce the frequency of these keepalive messages.
Andres might have some ideas?

-- 
With Regards,
Amit Kapila.



Re: Logical replication keepalive flood

From
Peter Smith
Date:
Hi.

By using Kyotaro's "counting" patch I was able to reproduce very
similar results to what he had earlier posted [1].

AFAIK I have the same test scenario that he was using.

Test setup:
- using async pub/sub
- subscription is for the pgbench_history table
- pgbench is run for 10 seconds
- config for all the wal_sender/receiver timeout GUCs are just default values


Results (HEAD + Kyotaro counting patch)
=======================================

[postgres@CentOS7-x64 ~]$ 2021-08-10 16:36:23.733 AEST [32436] LOG:
Total records: 18183
2021-08-10 16:36:23.733 AEST [32436] LOG:      8:     2 /     0 /
1:  440616 /  580320
2021-08-10 16:36:23.733 AEST [32436] LOG:     16:     4 /     8 /
4:    4524 /  288688
2021-08-10 16:36:23.733 AEST [32436] LOG:     24:  2916 /  5151 /
2756:   31227 /  323190
2021-08-10 16:36:23.733 AEST [32436] LOG:    187:     2 /     0 /
51:     157 /   10629
2021-08-10 16:36:23.733 AEST [32436] LOG:   2960:     1 /     0 /
1: 49656944 / 49656944
2021-08-10 16:36:23.733 AEST [32436] LOG:   7537:     2 /     0 /
36:     231 /    7028
2021-08-10 16:36:23.733 AEST [32436] LOG:   7577:     1 /     2 /
78:     106 /     106
2021-08-10 16:36:23.733 AEST [32436] LOG:   8280:     1 /     2 /
3:      88 /      88

//////

That base data is showing there are similar numbers of keepalives sent
as there are calls made to WalSndWaitForWal. IIUC it means that mostly
the loop is sending the special keepalives on the *first* iteration,
but by the time of the *second* iteration the ProcessRepliesIfAny()
will have some status already received, and so mostly sending another
keepalive will be deemed unnecessary.

Based on this, our idea was to simply skip sending the
WalSndKeepalive(false) for the FIRST iteration of the loop only! PSA
the patch 0002 which does this skip.

With this skip patch (v1-0002) applied the same pgbench tests were run
again. The results look like below.

Results (HEAD + Kyotaro patch + Skip-first keepalive patch)
===========================================================

RUN #1
------
[postgres@CentOS7-x64 ~]$ 2021-08-11 16:32:59.827 AEST [20339] LOG:
Total records: 19367
2021-08-11 16:32:59.827 AEST [20339] LOG:     24:    10 /  9232 /
3098:      19 /     440
2021-08-11 16:32:59.827 AEST [20339] LOG:    102:     1 /     1 /
32:     257 /   16828
2021-08-11 16:32:59.827 AEST [20339] LOG:    187:     1 /     1 /
52:     155 /    9541

RUN #2
------
[postgres@CentOS7-x64 ~]$ 2021-08-11 16:36:03.983 AEST [25513] LOG:
Total records: 17815
2021-08-11 16:36:03.983 AEST [25513] LOG:     24:    73 /  8683 /
2958:    1647 /    3290
2021-08-11 16:36:03.983 AEST [25513] LOG:   8280:     1 /     1 /
3:      88 /      88

RUN #3
------
[postgres@CentOS7-x64 ~]$ 2021-08-11 16:39:27.655 AEST [31061] LOG:
Total records: 19906
2021-08-11 16:39:27.655 AEST [31061] LOG:     24:    18 /  8546 /
2890:      61 /    1530
2021-08-11 16:39:27.655 AEST [31061] LOG:     83:     1 /     3 /
1:    8664 /    8664

~~

This data shows the special keepalives are now greatly reduced from
1000s to just 10s.

Thoughts?

------
[1] https://www.postgresql.org/message-id/20210610.150016.1709823354377067679.horikyota.ntt%40gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Logical replication keepalive flood

From
Peter Smith
Date:
FYI - Here are some more counter results with/without the skip patch
[1] applied.

This is the same test setup as before except now using *synchronous* pub/sub.

//////////

Test setup:
- using synchronous pub/sub
- subscription is for the pgbench_history table
- pgbench is run for 10 seconds
- config for all the wal_sender/receiver timeout GUCs are just default values

WITHOUT the skip-first patch applied
=====================================

RUN #1
------
LOG:  Total records: 310
LOG:     24:    49 /   131 /    49:    8403 /    9270
LOG:    944:     1 /     0 /     1: 159693904 / 159693904
LOG:   8280:     1 /     2 /     2:     480 /     480

RUN #2
------
LOG:  Total records: 275
LOG:     24:    45 /   129 /    46:    8580 /    8766
LOG:   5392:     1 /     0 /     1: 160107248 / 160107248

RUN #3
------
LOG:  Total records: 330
LOG:     24:    50 /   144 /    51:    8705 /    8705
LOG:   3704:     1 /     0 /     1: 160510344 / 160510344
LOG:   8280:     1 /     2 /     2:     468 /     468

WITH the skip-first patch applied
=================================

RUN #1
------
LOG:  Total records: 247
LOG:     24:     5 /   172 /    44: 3601700 / 3601700
LOG:   8280:     1 /     1 /     1:    1192 /    1192

RUN #2
------
LOG:  Total records: 338
LOG:     24:     8 /   199 /    55:    1335 /    1335
LOG:   7597:     1 /     1 /     1:   11712 /   11712
LOG:   8280:     1 /     1 /     2:     480 /     480

RUN #3
------
LOG:  Total records: 292
LOG:     24:     4 /   184 /    49:     719 /     719

//////////

As before there is a big % reduction of keepalives after the patch,
except here there was never really much of a "flood" in the first
place.

------
[1] https://www.postgresql.org/message-id/CAHut%2BPtyMBzweYUpb_QazVL6Uze2Yc5M5Ti2Xwee_eWM3Jrbog%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Logical replication keepalive flood

From
Greg Nancarrow
Date:
On Thu, Aug 12, 2021 at 12:33 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> This data shows the special keepalives are now greatly reduced from
> 1000s to just 10s.
>
> Thoughts?
>

I could easily see the flood of keepalives with the test setup
described by the original problem reporter (Abbas Butt).
I found that the
"v1-0002-WIP-skip-the-keepalive-on-FIRST-loop-iteration.patch" patch
reduced the keepalives by about 50% in this case.
I also tried the pub/sub setup with the publication on the
pgbench_history table.
With this pub/sub setup, I found that the patch dramatically reduced
the keepalives sent, similar to that reported by Peter.
Results (using Kyotoro’s keepalive counting patch) are below:

PUB/SUB, publishing the pgbench_history table

(1) without patch, 10s pgbench run:

2021-09-08 15:21:56.643 AEST [63720] LOG:  Total records: 47019
2021-09-08 15:21:56.643 AEST [63720] LOG:      8:     8 /    16 /
8:    8571 /  882048
2021-09-08 15:21:56.643 AEST [63720] LOG:     16:     5 /    10 /
5:    3649 /  764892
2021-09-08 15:21:56.643 AEST [63720] LOG:     24:  6271 / 12561 /
6331:  113989 /  886115
2021-09-08 15:21:56.643 AEST [63720] LOG:    195:     2 /     0 /
112:      72 /   10945
2021-09-08 15:21:56.643 AEST [63720] LOG:   6856:     1 /     0 /
1: 666232176 / 666232176
2021-09-08 15:21:56.643 AEST [63720] LOG:   7477:     2 /     0 /
298:      27 /    3303
2021-09-08 15:21:56.643 AEST [63720] LOG:   8159:    19 /    32 /
6073:      15 /    1869

(2) with patch, 10s pgbench run

2021-09-08 15:39:14.008 AEST [71431] LOG:  Total records: 45858
2021-09-08 15:39:14.008 AEST [71431] LOG:     24:    61 / 18278 /
6168:  108034 /  115228
2021-09-08 15:39:14.008 AEST [71431] LOG:     84:     1 /     1 /
7:    2256 /  295230
2021-09-08 15:39:14.008 AEST [71431] LOG:    110:     1 /     1 /
3:   10629 /  708293
2021-09-08 15:39:14.008 AEST [71431] LOG:   7477:    18 /    18 /
4577:      53 /    7850


Where columns are:

size: sent /notsent/ calls: write lag/ flush lag

- size is requested data length to logical_read_xlog_page()
- sent is the number of keepalives sent in the loop in WalSndWaitForWal
- notsent is the number of runs of the loop in WalSndWaitForWal
without sending a keepalive
- calls is the number of calls to WalSndWaitForWal
- write lag is the bytes MyWalSnd->write is behind from sentPtr at the
first run of the loop per call to logical_read_xlog_page.
- flush lag is the same as the above, but for MyWalSnd->flush.


However, the problem I found is that, with the patch applied, there is
a test failure when running “make check-world”:

 t/006_logical_decoding.pl ............ 4/14
#   Failed test 'pg_recvlogical acknowledged changes'
#   at t/006_logical_decoding.pl line 117.
#          got: 'BEGIN
# table public.decoding_test: INSERT: x[integer]:5 y[text]:'5''
#     expected: ''
# Looks like you failed 1 test of 14.
t/006_logical_decoding.pl ............ Dubious, test returned 1 (wstat
256, 0x100) Failed 1/14 subtests


To investigate this, I added some additional logging to
pg_recvlogical.c and PostgresNode.pm and re-ran
006_logical_decoding.pl without and with the patch (logs attached).

When the patch is NOT applied, and when pg_recvlogical is invoked by
the test for a 2nd time with the same "--endpos" LSN, it gets a
keepalive, detects walEnd>=endpos, and thus returns an empty record.
The test is expecting an empty record, so all is OK.
When the patch is applied, and when pg_recvlogical is invoked by the
test for a 2nd time with the same "--endpos" LSN, it gets a WAL record
with THE SAME LSN (== endpos) as previously obtained by the last WAL
record when it was invoked the 1st time, but the record data is
actually the first row of some records written after endpos, that it
wasn't meant to read.
This doesn't seem right to me - how can pg_recvlogical receive two
different WAL records with the same LSN?
With the patch applied, I was expecting pg_reclogical to get WAL
records with LSN>endpos, but this doesn't happen.
I'm thinking that the patch must have broken walsender in some way,
possibly by missing out on calls to ProcessStandbyReplyMessage()
because the sending of some keepalives are avoided (see below), so
that the MyWalSnd flush and write location are not kept up-to-date.
The code comments below seem to hint about this. I don't really like
the way keepalives are used for this, but this seems to be the
existing functionality. Maybe someone else can confirm that this could
indeed break walsender?

walsender.c
WalSndWaitForWal()

        /*
         * We only send regular messages to the client for full decoded
         * transactions, but a synchronous replication and walsender shutdown
         * possibly are waiting for a later location. So, before sleeping, we
         * send a ping containing the flush location. If the receiver is
         * otherwise idle, this keepalive will trigger a reply. Processing the
         * reply will update these MyWalSnd locations.
         */
        if (!loop_first_time && /* avoid keepalive on first iteration
*/   <--- added by the patch
            MyWalSnd->flush < sentPtr &&
            MyWalSnd->write < sentPtr &&
            !waiting_for_ping_response)
        {
            WalSndKeepalive(false);


Regards,
Greg Nancarrow
Fujitsu Australia

Attachment

RE: Logical replication keepalive flood

From
"houzj.fnst@fujitsu.com"
Date:
From Tuesday, September 14, 2021 1:39 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
> However, the problem I found is that, with the patch applied, there is
> a test failure when running “make check-world”:
> 
>  t/006_logical_decoding.pl ............ 4/14
> #   Failed test 'pg_recvlogical acknowledged changes'
> #   at t/006_logical_decoding.pl line 117.
> #          got: 'BEGIN
> # table public.decoding_test: INSERT: x[integer]:5 y[text]:'5''
> #     expected: ''
> # Looks like you failed 1 test of 14.
> t/006_logical_decoding.pl ............ Dubious, test returned 1 (wstat
> 256, 0x100) Failed 1/14 subtests
> 
> 

After applying the patch,
I saw the same problem and can reproduce it by the following steps:

1) execute the SQLs.
-----------SQL-----------
CREATE TABLE decoding_test(x integer, y text);
SELECT pg_create_logical_replication_slot('test_slot', 'test_decoding');
INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;

-- use the lsn here to execute pg_recvlogical later
SELECT lsn FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;
INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;
----------------------

2) Then, if I execute the following command twice:
# pg_recvlogical -E lsn -d postgres -S 'test_slot' --start --no-loop -f -
BEGIN 708
table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
COMMIT 708

# pg_recvlogical -E lsn -d postgres -S 'test_slot' --start --no-loop -f -
BEGIN 709

It still generated ' BEGIN 709' in the second time execution.
But it will output nothing in the second time execution without the patch.

Best regards,
Hou zj




Re: Logical replication keepalive flood

From
Amit Kapila
Date:
On Thu, Sep 16, 2021 at 6:29 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> From Tuesday, September 14, 2021 1:39 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
> > However, the problem I found is that, with the patch applied, there is
> > a test failure when running “make check-world”:
> >
> >  t/006_logical_decoding.pl ............ 4/14
> > #   Failed test 'pg_recvlogical acknowledged changes'
> > #   at t/006_logical_decoding.pl line 117.
> > #          got: 'BEGIN
> > # table public.decoding_test: INSERT: x[integer]:5 y[text]:'5''
> > #     expected: ''
> > # Looks like you failed 1 test of 14.
> > t/006_logical_decoding.pl ............ Dubious, test returned 1 (wstat
> > 256, 0x100) Failed 1/14 subtests
> >
> >
>
> After applying the patch,
> I saw the same problem and can reproduce it by the following steps:
>
> 1) execute the SQLs.
> -----------SQL-----------
> CREATE TABLE decoding_test(x integer, y text);
> SELECT pg_create_logical_replication_slot('test_slot', 'test_decoding');
> INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;
>
> -- use the lsn here to execute pg_recvlogical later
> SELECT lsn FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;
> INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;
> ----------------------
>
> 2) Then, if I execute the following command twice:
> # pg_recvlogical -E lsn -d postgres -S 'test_slot' --start --no-loop -f -
> BEGIN 708
> table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
> table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
> table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
> table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
> COMMIT 708
>
> # pg_recvlogical -E lsn -d postgres -S 'test_slot' --start --no-loop -f -
> BEGIN 709
>
> It still generated ' BEGIN 709' in the second time execution.
> But it will output nothing in the second time execution without the patch.
>

I think here the reason is that the first_lsn of a transaction is
always equal to end_lsn of the previous transaction (See comments
above first_lsn and end_lsn fields of ReorderBufferTXN). I have not
debugged but I think in StreamLogicalLog() the cur_record_lsn after
receiving 'w' message, in this case, will be equal to endpos whereas
we expect to be greater than endpos to exit. Before the patch, it will
always get the 'k' message where we expect the received lsn to be
equal to endpos to conclude that we can exit. Do let me know if your
analysis differs?

--
With Regards,
Amit Kapila.



Re: Logical replication keepalive flood

From
Amit Kapila
Date:
On Thu, Aug 12, 2021 at 8:03 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> That base data is showing there are similar numbers of keepalives sent
> as there are calls made to WalSndWaitForWal. IIUC it means that mostly
> the loop is sending the special keepalives on the *first* iteration,
> but by the time of the *second* iteration the ProcessRepliesIfAny()
> will have some status already received, and so mostly sending another
> keepalive will be deemed unnecessary.
>
> Based on this, our idea was to simply skip sending the
> WalSndKeepalive(false) for the FIRST iteration of the loop only! PSA
> the patch 0002 which does this skip.
>

I think we should also keep in mind that there are cases where it
seems we are not able to send keep-alive at the appropriate frequency.
See the discussion [1]. This is to ensure that we shouldn't
unintentionally hamper some other workloads by fixing the workload
being discussed here.

[1] - https://www.postgresql.org/message-id/20210913.103107.813489310351696839.horikyota.ntt%40gmail.com

-- 
With Regards,
Amit Kapila.



RE: Logical replication keepalive flood

From
"houzj.fnst@fujitsu.com"
Date:
On Thursday, September 16, 2021 8:36 PM Amit Kapila <amit.kapila16@gmail.com>:
> On Thu, Sep 16, 2021 at 6:29 AM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote:
> >
> > After applying the patch,
> > I saw the same problem and can reproduce it by the following steps:
> >
> > 1) execute the SQLs.
> > -----------SQL-----------
> > CREATE TABLE decoding_test(x integer, y text);
> > SELECT pg_create_logical_replication_slot('test_slot', 'test_decoding');
> > INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4)
> s;
> >
> > -- use the lsn here to execute pg_recvlogical later
> > SELECT lsn FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL)
> ORDER BY lsn DESC LIMIT 1;
> > INSERT INTO decoding_test(x,y) SELECT s, s::text FROM
> generate_series(5,50) s;
> > ----------------------
> >
> > 2) Then, if I execute the following command twice:
> > # pg_recvlogical -E lsn -d postgres -S 'test_slot' --start --no-loop -f -
> > BEGIN 708
> > table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
> > table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
> > table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
> > table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
> > COMMIT 708
> >
> > # pg_recvlogical -E lsn -d postgres -S 'test_slot' --start --no-loop -f -
> > BEGIN 709
> >
> > It still generated ' BEGIN 709' in the second time execution.
> > But it will output nothing in the second time execution without the patch.
> >
> 
> I think here the reason is that the first_lsn of a transaction is
> always equal to end_lsn of the previous transaction (See comments
> above first_lsn and end_lsn fields of ReorderBufferTXN). I have not
> debugged but I think in StreamLogicalLog() the cur_record_lsn after
> receiving 'w' message, in this case, will be equal to endpos whereas
> we expect to be greater than endpos to exit. Before the patch, it will
> always get the 'k' message where we expect the received lsn to be
> equal to endpos to conclude that we can exit. Do let me know if your
> analysis differs?

After debugging it, I agree with your analysis.

WITH the patch:
in function StreamLogicalLog(), I can see the cur_record_lsn is equal
to endpos which result in unexpected record.

WITHOUT the patch:
In function StreamLogicalLog(), it first received a 'k' message which will break the
loop by the following code.

            if (endposReached)
            {
                prepareToTerminate(conn, endpos, true, InvalidXLogRecPtr);
                time_to_abort = true;
                break;
            }

Best regards,
Hou zj

Re: Logical replication keepalive flood

From
Peter Smith
Date:
On Thu, Sep 16, 2021 at 10:59 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> From Tuesday, September 14, 2021 1:39 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
> > However, the problem I found is that, with the patch applied, there is
> > a test failure when running “make check-world”:
> >
> >  t/006_logical_decoding.pl ............ 4/14
> > #   Failed test 'pg_recvlogical acknowledged changes'
> > #   at t/006_logical_decoding.pl line 117.
> > #          got: 'BEGIN
> > # table public.decoding_test: INSERT: x[integer]:5 y[text]:'5''
> > #     expected: ''
> > # Looks like you failed 1 test of 14.
> > t/006_logical_decoding.pl ............ Dubious, test returned 1 (wstat
> > 256, 0x100) Failed 1/14 subtests
> >
> >
>
> After applying the patch,
> I saw the same problem and can reproduce it by the following steps:
>
> 1) execute the SQLs.
> -----------SQL-----------
> CREATE TABLE decoding_test(x integer, y text);
> SELECT pg_create_logical_replication_slot('test_slot', 'test_decoding');
> INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;
>
> -- use the lsn here to execute pg_recvlogical later
> SELECT lsn FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;
> INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;
> ----------------------
>
> 2) Then, if I execute the following command twice:
> # pg_recvlogical -E lsn -d postgres -S 'test_slot' --start --no-loop -f -
> BEGIN 708
> table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
> table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
> table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
> table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
> COMMIT 708
>
> # pg_recvlogical -E lsn -d postgres -S 'test_slot' --start --no-loop -f -
> BEGIN 709
>
> It still generated ' BEGIN 709' in the second time execution.
> But it will output nothing in the second time execution without the patch.
>

Hello Hous-san, thanks for including the steps. Unfortunately, no
matter what I tried, I could never get the patch to display the
problem "BEGIN 709" for the 2nd time execution of pg_recvlogical

After discussion offline (thanks Greg!) it was found that the
pg_recvlogical step 2 posted above is not quite identical to what the
TAP 006 test is doing.

Specifically, the TAP test also includes some other options (-o
include-xids=0 -o skip-empty-xacts=1) which are not in your step.

If I include these options then I can reproduce the problem.
-----------------------------------------
[postgres@CentOS7-x64 ~]$ pg_recvlogical -E '0/150B5E0'  -d postgres
-S 'test_slot' --start --no-loop -o include-xids=0 -o
skip-empty-xacts=1 -f -
BEGIN
table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
-----------------------------------------

I don't know why these options should make any difference but they do.
Perhaps they cause a fluke of millisecond timing differences in our
different VM environments.

------
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Logical replication keepalive flood

From
Amit Kapila
Date:
On Fri, Sep 17, 2021 at 12:42 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Thu, Sep 16, 2021 at 10:59 AM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
>
> Hello Hous-san, thanks for including the steps. Unfortunately, no
> matter what I tried, I could never get the patch to display the
> problem "BEGIN 709" for the 2nd time execution of pg_recvlogical
>
> After discussion offline (thanks Greg!) it was found that the
> pg_recvlogical step 2 posted above is not quite identical to what the
> TAP 006 test is doing.
>
> Specifically, the TAP test also includes some other options (-o
> include-xids=0 -o skip-empty-xacts=1) which are not in your step.
>
> If I include these options then I can reproduce the problem.
> -----------------------------------------
> [postgres@CentOS7-x64 ~]$ pg_recvlogical -E '0/150B5E0'  -d postgres
> -S 'test_slot' --start --no-loop -o include-xids=0 -o
> skip-empty-xacts=1 -f -
> BEGIN
> table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
> -----------------------------------------
>
> I don't know why these options should make any difference but they do.
>

I think there is a possibility that skip-empty-xacts = 1 is making
difference. Basically, if there is some empty transaction say via
autovacuum, it would skip it and possibly send keep-alive message
before sending transaction id 709. Then you won't see the problem with
Hou-San's test. Can you try by keeping autovacuum = off and by not
using skip-empty-xact option?

-- 
With Regards,
Amit Kapila.



Re: Logical replication keepalive flood

From
Greg Nancarrow
Date:
On Thu, Sep 16, 2021 at 10:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I think here the reason is that the first_lsn of a transaction is
> always equal to end_lsn of the previous transaction (See comments
> above first_lsn and end_lsn fields of ReorderBufferTXN).

That may be the case, but those comments certainly don't make this clear.

>I have not
> debugged but I think in StreamLogicalLog() the cur_record_lsn after
> receiving 'w' message, in this case, will be equal to endpos whereas
> we expect to be greater than endpos to exit. Before the patch, it will
> always get the 'k' message where we expect the received lsn to be
> equal to endpos to conclude that we can exit. Do let me know if your
> analysis differs?
>

Yes, pg_recvlogical seems to be relying on receiving a keepalive for
its "--endpos" logic to work (and the 006 test is relying on '' record
output from pg_recvlogical in this case).
But is it correct to be relying on a keepalive for this?
As I already pointed out, there's also code which seems to be relying
on replies from sending keepalives, to update flush and write
locations related to LSN.
The original problem reporter measured 500 keepalives per second being
sent by walsender (which I also reproduced, for pg_recvlogical and
pub/sub cases).
None of these cases appear to be traditional uses of "keepalive" type
messages to me.
Am I missing something? Documentation?


Regards,
Greg Nancarrow
Fujitsu Australia



Re: Logical replication keepalive flood

From
Amit Kapila
Date:
On Fri, Sep 17, 2021 at 3:03 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> On Thu, Sep 16, 2021 at 10:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I think here the reason is that the first_lsn of a transaction is
> > always equal to end_lsn of the previous transaction (See comments
> > above first_lsn and end_lsn fields of ReorderBufferTXN).
>
> That may be the case, but those comments certainly don't make this clear.
>
> >I have not
> > debugged but I think in StreamLogicalLog() the cur_record_lsn after
> > receiving 'w' message, in this case, will be equal to endpos whereas
> > we expect to be greater than endpos to exit. Before the patch, it will
> > always get the 'k' message where we expect the received lsn to be
> > equal to endpos to conclude that we can exit. Do let me know if your
> > analysis differs?
> >
>
> Yes, pg_recvlogical seems to be relying on receiving a keepalive for
> its "--endpos" logic to work (and the 006 test is relying on '' record
> output from pg_recvlogical in this case).
> But is it correct to be relying on a keepalive for this?
>

I don't think this experiment/test indicates that pg_recvlogical's
"--endpos" relies on keepalive. It would just print the records till
--endpos and then exit. In the test under discussion, as per
confirmation by Hou-San, the BEGIN record received has the same LSN as
--endpos, so it would just output that and exit which is what is
mentioned in pg_recvlogical docs as well (If there's a record with LSN
exactly equal to lsn, the record will be output).

I think here the test case could be a culprit. In the original commit
eb2a6131be [1], where this test of the second time using
pg_recvlogical was added there were no additional Inserts (# Insert
some rows after $endpos, which we won't read.) which were later added
by a different commit 8222a9d9a1 [2]. I am not sure if the test added
by commit [2] was a good idea. It seems to be working due to the way
keepalives are being sent but otherwise, it can fail as per the
current design of pg_recvlogical.

[1]:
commit eb2a6131beccaad2b39629191508062b70d3a1c6
Author: Simon Riggs <simon@2ndQuadrant.com>
Date:   Tue Mar 21 14:04:49 2017 +0000

    Add a pg_recvlogical wrapper to PostgresNode

[2]:
commit 8222a9d9a12356349114ec275b01a1a58da2b941
Author: Noah Misch <noah@leadboat.com>
Date:   Wed May 13 20:42:09 2020 -0700

    In successful pg_recvlogical, end PGRES_COPY_OUT cleanly.

-- 
With Regards,
Amit Kapila.



Re: Logical replication keepalive flood

From
Greg Nancarrow
Date:
On Sun, Sep 19, 2021 at 3:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> >
> > Yes, pg_recvlogical seems to be relying on receiving a keepalive for
> > its "--endpos" logic to work (and the 006 test is relying on '' record
> > output from pg_recvlogical in this case).
> > But is it correct to be relying on a keepalive for this?
> >
>
> I don't think this experiment/test indicates that pg_recvlogical's
> "--endpos" relies on keepalive. It would just print the records till
> --endpos and then exit. In the test under discussion, as per
> confirmation by Hou-San, the BEGIN record received has the same LSN as
> --endpos, so it would just output that and exit which is what is
> mentioned in pg_recvlogical docs as well (If there's a record with LSN
> exactly equal to lsn, the record will be output).
>

It seems to be relying on keepalive to get ONE specific record per
--endpos value, because once we apply the
"v1-0002-WIP-skip-the-keepalive-on-FIRST-loop-iteration.patch" patch,
then when pg_recvlogical is invoked for a second time with the same
--endos, it outputs the next record (BEGIN) too. So now for the same
--endpos value, we've had two different records output by
pg_recvlogical.
I have not seen this described in the documentation, so I think it
will need to be updated, should keepalives be reduced as per the
patch. The current documentation seems to be implying that one
particular record will be returned for a given --endpos  (at least,
there is no mention of the possibility of different records being
output for the one --endpos, or that the first_lsn of a transaction is
always equal to end_lsn of the previous transaction).

--endpos=lsn

   In --start mode, automatically stop replication and exit with
normal exit status 0 when receiving reaches the specified LSN. If
specified when not in --start mode, an error is raised.

   If there's a record with LSN exactly equal to lsn, the record will be output.

   The --endpos option is not aware of transaction boundaries and may
truncate output partway through a transaction. Any partially output
transaction will not be consumed and will be replayed again when the
slot is next read from. Individual messages are never truncated.


Regards,
Greg Nancarrow
Fujitsu Australia



Re: Logical replication keepalive flood

From
Amit Kapila
Date:
On Thu, Sep 30, 2021 at 8:49 AM Hou, Zhijie/侯 志杰 <houzj.fnst@fujitsu.com> wrote:
>
> I noticed another patch that Horiguchi-San posted earlier[1].
>
> The approach in that patch is to splits the sleep into two
> pieces. If the first sleep reaches the timeout then send a keepalive
> then sleep for the remaining time.
>
> I tested that patch and can see the keepalive message is reduced and
> the patch won't cause the current regression test fail.
>
> Since I didn't find some comments about that patch,
> I wonder did we find some problems about that patch ?
>

I am not able to understand some parts of that patch.

+ If the sleep is shorter
+ * than KEEPALIVE_TIMEOUT milliseconds, we skip sending a keepalive to
+ * prevent it from getting too-frequent.
+ */
+ if (MyWalSnd->flush < sentPtr &&
+ MyWalSnd->write < sentPtr &&
+ !waiting_for_ping_response)
+ {
+ if (sleeptime > KEEPALIVE_TIMEOUT)
+ {
+ int r;
+
+ r = WalSndWait(wakeEvents, KEEPALIVE_TIMEOUT,
+    WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+
+ if (r != 0)
+ continue;
+
+ sleeptime -= KEEPALIVE_TIMEOUT;
+ }
+
+ WalSndKeepalive(false);

It claims to skip sending keepalive if the sleep time is shorter than
KEEPALIVE_TIMEOUT (a new threshold) but the above code seems to
suggest it sends in both cases. What am I missing?

Also, more to the point this special keep_alive seems to be sent for
synchronous replication and walsender shutdown as they can expect
updated locations. You haven't given any reason (theory) why those two
won't be impacted due to this change? I am aware that for synchronous
replication, we wake waiters while ProcessStandbyReplyMessage but I am
not sure how it helps with wal sender shutdown? I think we need to
know the reasons for this message and then need to see if the change
has any impact on the same.

--
With Regards,
Amit Kapila.



Re: Logical replication keepalive flood

From
Greg Nancarrow
Date:
On Thu, Sep 30, 2021 at 3:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I am not able to understand some parts of that patch.
>
> + If the sleep is shorter
> + * than KEEPALIVE_TIMEOUT milliseconds, we skip sending a keepalive to
> + * prevent it from getting too-frequent.
> + */
> + if (MyWalSnd->flush < sentPtr &&
> + MyWalSnd->write < sentPtr &&
> + !waiting_for_ping_response)
> + {
> + if (sleeptime > KEEPALIVE_TIMEOUT)
> + {
> + int r;
> +
> + r = WalSndWait(wakeEvents, KEEPALIVE_TIMEOUT,
> +    WAIT_EVENT_WAL_SENDER_WAIT_WAL);
> +
> + if (r != 0)
> + continue;
> +
> + sleeptime -= KEEPALIVE_TIMEOUT;
> + }
> +
> + WalSndKeepalive(false);
>
> It claims to skip sending keepalive if the sleep time is shorter than
> KEEPALIVE_TIMEOUT (a new threshold) but the above code seems to
> suggest it sends in both cases. What am I missing?
>

The comment does seem to be wrong.
The way I see it, if the calculated sleeptime is greater than
KEEPALIVE_TIMEOUT, then it first sleeps for up to KEEPALIVE_TIMEOUT
milliseconds and skips sending a keepalive if something happens (i.e.
the socket becomes readable/writeable) during that time (WalSendWait
will return non-zero in that case). Otherwise, it sends a keepalive
and sleeps for (sleeptime - KEEPALIVE_TIMEOUT), then loops around
again ...

> Also, more to the point this special keep_alive seems to be sent for
> synchronous replication and walsender shutdown as they can expect
> updated locations. You haven't given any reason (theory) why those two
> won't be impacted due to this change? I am aware that for synchronous
> replication, we wake waiters while ProcessStandbyReplyMessage but I am
> not sure how it helps with wal sender shutdown? I think we need to
> know the reasons for this message and then need to see if the change
> has any impact on the same.
>

Yes, I'm not sure about the possible impacts, still looking at it.

Regards,
Greg Nancarrow
Fujitsu Australia



Re: Logical replication keepalive flood

From
Greg Nancarrow
Date:
On Thu, Sep 30, 2021 at 1:19 PM Hou, Zhijie/侯 志杰 <houzj.fnst@fujitsu.com> wrote:
>
> I tested that patch and can see the keepalive message is reduced and
> the patch won't cause the current regression test fail.
>

Actually, with the patch applied, I find that "make check-world" fails
(006_logical_decoding, test 7).

Regards,
Greg Nancarrow
Fujitsu Australia



Re: Logical replication keepalive flood

From
Kyotaro Horiguchi
Date:
At Thu, 30 Sep 2021 16:21:25 +1000, Greg Nancarrow <gregn4422@gmail.com> wrote in 
> On Thu, Sep 30, 2021 at 3:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > It claims to skip sending keepalive if the sleep time is shorter than
> > KEEPALIVE_TIMEOUT (a new threshold) but the above code seems to
> > suggest it sends in both cases. What am I missing?
> >
> 
> The comment does seem to be wrong.

Mmm. Indeed the comment does say something like that...  Looking the
patch name together, I might have confused something. However, the
patch looks like working for the purpose.

> The way I see it, if the calculated sleeptime is greater than
> KEEPALIVE_TIMEOUT, then it first sleeps for up to KEEPALIVE_TIMEOUT
> milliseconds and skips sending a keepalive if something happens (i.e.
> the socket becomes readable/writeable) during that time (WalSendWait
> will return non-zero in that case). Otherwise, it sends a keepalive
> and sleeps for (sleeptime - KEEPALIVE_TIMEOUT), then loops around
> again ...

The maim point of the patch is moving of the timing of sending the
before-sleep keepalive.  It seems to me that currently
WalSndWaitForWal may send "before-sleep" keepalive every run of the
loop under a certain circumstance. I suspect this happen in this case.

After the patch applied, that keepalive is sent only when the loop is
actually going to sleep some time.  In case the next WAL doesn't come
for KEEPALIVE_TIMEOUT milliseconds, it sends a keepalive. There's a
dubious behavior when sleeptime <= KEEPALIVE_TIMEOUT that it sends a
keepalive immediately.  It was (as far as I recall) intentional in
order to make the code simpler.  However, on second thought, we will
have the next chance to send keepalive in that case, and intermittent
frequent keepalives can happen with that behavior.  So I came to think
that we can omit keepalives at all that case.

(I myself haven't see the keepalive flood..)

> > Also, more to the point this special keep_alive seems to be sent for
> > synchronous replication and walsender shutdown as they can expect
> > updated locations. You haven't given any reason (theory) why those two
> > won't be impacted due to this change? I am aware that for synchronous
> > replication, we wake waiters while ProcessStandbyReplyMessage but I am
> > not sure how it helps with wal sender shutdown? I think we need to
> > know the reasons for this message and then need to see if the change
> > has any impact on the same.
> >
> 
> Yes, I'm not sure about the possible impacts, still looking at it.

If the comment describes the objective correctly, the only possible
impact would be that there may be a case where server responds a bit
slowly for a shutdown request.  But I'm not sure it is definitely
true.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3ca2a11389..0b04b978e9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -105,6 +105,9 @@
  */
 #define MAX_SEND_SIZE (XLOG_BLCKSZ * 16)
 
+/* Minimum idle time for sending an idle-time keepalive in milliseconds */
+#define KEEPALIVE_TIMEOUT 250
+
 /* Array of WalSnds in shared memory */
 WalSndCtlData *WalSndCtl = NULL;
 
@@ -244,7 +247,7 @@ static void WalSndKeepalive(bool requestReply);
 static void WalSndKeepaliveIfNecessary(void);
 static void WalSndCheckTimeOut(void);
 static long WalSndComputeSleeptime(TimestampTz now);
-static void WalSndWait(uint32 socket_events, long timeout, uint32 wait_event);
+static int WalSndWait(uint32 socket_events, long timeout, uint32 wait_event);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid);
@@ -1436,19 +1439,6 @@ WalSndWaitForWal(XLogRecPtr loc)
         if (got_STOPPING)
             break;
 
-        /*
-         * We only send regular messages to the client for full decoded
-         * transactions, but a synchronous replication and walsender shutdown
-         * possibly are waiting for a later location. So, before sleeping, we
-         * send a ping containing the flush location. If the receiver is
-         * otherwise idle, this keepalive will trigger a reply. Processing the
-         * reply will update these MyWalSnd locations.
-         */
-        if (MyWalSnd->flush < sentPtr &&
-            MyWalSnd->write < sentPtr &&
-            !waiting_for_ping_response)
-            WalSndKeepalive(false);
-
         /* check whether we're done */
         if (loc <= RecentFlushPtr)
             break;
@@ -1491,6 +1481,41 @@ WalSndWaitForWal(XLogRecPtr loc)
         if (pq_is_send_pending())
             wakeEvents |= WL_SOCKET_WRITEABLE;
 
+        /*
+         * We only send regular messages to the client for full decoded
+         * transactions, but a synchronous replication and walsender shutdown
+         * possibly are waiting for a later location. So, before sleeping, we
+         * send a ping containing the flush location. If the receiver is
+         * otherwise idle, this keepalive will trigger a reply. Processing the
+         * reply will update these MyWalSnd locations. To prevent too-frequent
+         * keepalives, wait KEEPALIVE_TIMEOUT milliseconds before sending a
+         * keepalive. If sleeptime is less than KEEPALIVE_TIMEOUT, instead
+         * WalSndKeepaliveIfNecessary will work if needed.
+         */
+        if (MyWalSnd->flush < sentPtr &&
+            MyWalSnd->write < sentPtr &&
+            !waiting_for_ping_response &&
+            sleeptime > KEEPALIVE_TIMEOUT)
+        {
+            int r;
+            
+            r = WalSndWait(wakeEvents, KEEPALIVE_TIMEOUT,
+                           WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+
+            /*
+             * If some event happens before the timeout expires, go ahead
+             * without sending keepalive.
+             */
+            if (r != 0)
+                continue;
+
+            sleeptime -= KEEPALIVE_TIMEOUT;
+
+            WalSndKeepalive(false);
+            if (pq_flush_if_writable() != 0)
+                WalSndShutdown();
+        }
+        
         WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
     }
 
@@ -3144,15 +3169,18 @@ WalSndWakeup(void)
  * composed of optional WL_SOCKET_WRITEABLE and WL_SOCKET_READABLE flags.  Exit
  * on postmaster death.
  */
-static void
+static int
 WalSndWait(uint32 socket_events, long timeout, uint32 wait_event)
 {
     WaitEvent    event;
+    int            ret;
 
     ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, socket_events, NULL);
-    if (WaitEventSetWait(FeBeWaitSet, timeout, &event, 1, wait_event) == 1 &&
-        (event.events & WL_POSTMASTER_DEATH))
+    ret = WaitEventSetWait(FeBeWaitSet, timeout, &event, 1, wait_event);
+    if (ret == 1 && (event.events & WL_POSTMASTER_DEATH))
         proc_exit(1);
+
+    return ret;
 }
 
 /*

Re: Logical replication keepalive flood

From
Kyotaro Horiguchi
Date:
At Thu, 30 Sep 2021 17:21:03 +1000, Greg Nancarrow <gregn4422@gmail.com> wrote in 
> Actually, with the patch applied, I find that "make check-world" fails
> (006_logical_decoding, test 7).

Mmm..

t/006_logical_decoding.pl .. 4/14 
#   Failed test 'pg_recvlogical acknowledged changes'
#   at t/006_logical_decoding.pl line 117.
#          got: 'BEGIN
# table public.decoding_test: INSERT: x[integer]:5 y[text]:'5''
#     expected: ''

I'm not sure what the test is checking for now, though.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Logical replication keepalive flood

From
Greg Nancarrow
Date:
On Thu, Sep 30, 2021 at 6:08 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Thu, 30 Sep 2021 17:21:03 +1000, Greg Nancarrow <gregn4422@gmail.com> wrote in
> > Actually, with the patch applied, I find that "make check-world" fails
> > (006_logical_decoding, test 7).
>
> Mmm..
>
> t/006_logical_decoding.pl .. 4/14
> #   Failed test 'pg_recvlogical acknowledged changes'
> #   at t/006_logical_decoding.pl line 117.
> #          got: 'BEGIN
> # table public.decoding_test: INSERT: x[integer]:5 y[text]:'5''
> #     expected: ''
>
> I'm not sure what the test is checking for now, though.
>

I think it's trying to check that pg_recvlogical doesn't read "past" a
specified "--endpos" LSN. The test is invoking pg_recvlogical with the
same --endpos LSN value multiple times.
After first getting the LSN (to use for the --endpos value) after 4
rows are inserted, some additional rows are inserted which the test
expects pg_recvlogical won't read because it should't read past
--endpos.
Problem is, the test seems to be relying on a keepalive between the
WAL record of the first transaction and the WAL record of the next
transaction.
As Amit previously explained on this thread "I think here the reason
is that the first_lsn of a transaction is always equal to end_lsn of
the previous transaction (See comments
above first_lsn and end_lsn fields of ReorderBufferTXN)."
When the patch is applied, pg_recvlogical doesn't read a keepalive
when it is invoked with the same --endpos for a second time here, and
it ends up reading the first WAL record for the next transaction
(those additional rows that the test expects it won't read).


Regards,
Greg Nancarrow
Fujitsu Australia



Re: Logical replication keepalive flood

From
Kyotaro Horiguchi
Date:
At Thu, 30 Sep 2021 17:08:35 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Thu, 30 Sep 2021 17:21:03 +1000, Greg Nancarrow <gregn4422@gmail.com> wrote in 
> > Actually, with the patch applied, I find that "make check-world" fails
> > (006_logical_decoding, test 7).
> 
> Mmm..
> 
> t/006_logical_decoding.pl .. 4/14 
> #   Failed test 'pg_recvlogical acknowledged changes'
> #   at t/006_logical_decoding.pl line 117.
> #          got: 'BEGIN
> # table public.decoding_test: INSERT: x[integer]:5 y[text]:'5''
> #     expected: ''
> 
> I'm not sure what the test is checking for now, though.

It's checking that endpos works correctly? The logical decoded WALs
looks like this.

0/1528F10|table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
0/15290F8|table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
0/1529138|table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
0/1529178|table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
0/15291E8|COMMIT 709
0/15291E8|BEGIN 710
0/15291E8|table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
0/1529228|table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'

The COMMIT and BEGIN shares the same LSN, which I don't understand how come.

The previous read by pg_recvlocal prceeded upto the COMMIT record. and
the following command runs after that behaves differently.

pg_recvlogical -S test_slot --dbname postgres --endpos '0/15291E8' -f - --no-loop --start

Before the patch it ends before reading a record, and after the patch
it reads into the "table ..." line.  pg_recvlogical seems using the
endpos as the beginning of the last record. In that meaning the three
lines (COMMIT 709/BEGIN 710/table ...'5') are falls under the end of
data.

The difference seems coming from the timing keepalive
comes. pg_recvlogical checks the endpos only when keepalive comes.  In
other words, it needs keepalive for every data line so that it stops
exactly at the specified endpos.

1. Is it the correct behavior that the three data lines share the same
  LSN?  I think BEGIN and the next line should do, but COMMIT and next
  BEGIN shouldn't.

2. Is it the designed behavior that pg_recvlogical does check endpos
  only when a keepalive comes? If it is the correct behavior, we
  shouldn't avoid the keepalive flood.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Logical replication keepalive flood

From
Amit Kapila
Date:
On Thu, Sep 30, 2021 at 1:26 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Thu, 30 Sep 2021 16:21:25 +1000, Greg Nancarrow <gregn4422@gmail.com> wrote in
> > On Thu, Sep 30, 2021 at 3:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
>
> > > Also, more to the point this special keep_alive seems to be sent for
> > > synchronous replication and walsender shutdown as they can expect
> > > updated locations. You haven't given any reason (theory) why those two
> > > won't be impacted due to this change? I am aware that for synchronous
> > > replication, we wake waiters while ProcessStandbyReplyMessage but I am
> > > not sure how it helps with wal sender shutdown? I think we need to
> > > know the reasons for this message and then need to see if the change
> > > has any impact on the same.
> > >
> >
> > Yes, I'm not sure about the possible impacts, still looking at it.
>
> If the comment describes the objective correctly, the only possible
> impact would be that there may be a case where server responds a bit
> slowly for a shutdown request.  But I'm not sure it is definitely
> true.
>

So, we should try to find how wal sender shutdown is dependent on
sending keep alive and second thing is what about sync rep case? I
think in the worst case that also might delay. Why do you think that
would be acceptable?

-- 
With Regards,
Amit Kapila.



Re: Logical replication keepalive flood

From
Amit Kapila
Date:
On Thu, Sep 30, 2021 at 3:41 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Thu, 30 Sep 2021 17:08:35 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
> > At Thu, 30 Sep 2021 17:21:03 +1000, Greg Nancarrow <gregn4422@gmail.com> wrote in
> > > Actually, with the patch applied, I find that "make check-world" fails
> > > (006_logical_decoding, test 7).
> >
> > Mmm..
> >
> > t/006_logical_decoding.pl .. 4/14
> > #   Failed test 'pg_recvlogical acknowledged changes'
> > #   at t/006_logical_decoding.pl line 117.
> > #          got: 'BEGIN
> > # table public.decoding_test: INSERT: x[integer]:5 y[text]:'5''
> > #     expected: ''
> >
> > I'm not sure what the test is checking for now, though.
>
> It's checking that endpos works correctly? The logical decoded WALs
> looks like this.
>
> 0/1528F10|table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
> 0/15290F8|table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
> 0/1529138|table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
> 0/1529178|table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
> 0/15291E8|COMMIT 709
> 0/15291E8|BEGIN 710
> 0/15291E8|table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
> 0/1529228|table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
>
> The COMMIT and BEGIN shares the same LSN, which I don't understand how come.
>

This is because endlsn is always commit record + 1 which makes it
equal to start of next record and we use endlsn here for commit. See
below comments in code.
/*
* LSN pointing to the end of the commit record + 1.
*/
XLogRecPtr end_lsn;

> The previous read by pg_recvlocal prceeded upto the COMMIT record. and
> the following command runs after that behaves differently.
>
> pg_recvlogical -S test_slot --dbname postgres --endpos '0/15291E8' -f - --no-loop --start
>
> Before the patch it ends before reading a record, and after the patch
> it reads into the "table ..." line.  pg_recvlogical seems using the
> endpos as the beginning of the last record. In that meaning the three
> lines (COMMIT 709/BEGIN 710/table ...'5') are falls under the end of
> data.
>
> The difference seems coming from the timing keepalive
> comes. pg_recvlogical checks the endpos only when keepalive comes.  In
> other words, it needs keepalive for every data line so that it stops
> exactly at the specified endpos.
>
> 1. Is it the correct behavior that the three data lines share the same
>   LSN?  I think BEGIN and the next line should do, but COMMIT and next
>   BEGIN shouldn't.
>
> 2. Is it the designed behavior that pg_recvlogical does check endpos
>   only when a keepalive comes? If it is the correct behavior, we
>   shouldn't avoid the keepalive flood.
>

If anything, I think this is a testcase issue as explained by me in email [1]

[1] - https://www.postgresql.org/message-id/CAA4eK1Ja2XmK59Czv1V%2BtfOgU4mcFfDwTtTgO02Wd%3DrO02JbiQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



Re: Logical replication keepalive flood

From
Greg Nancarrow
Date:
On Thu, Sep 30, 2021 at 5:56 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> After the patch applied, that keepalive is sent only when the loop is
> actually going to sleep some time.  In case the next WAL doesn't come
> for KEEPALIVE_TIMEOUT milliseconds, it sends a keepalive. There's a
> dubious behavior when sleeptime <= KEEPALIVE_TIMEOUT that it sends a
> keepalive immediately.  It was (as far as I recall) intentional in
> order to make the code simpler.  However, on second thought, we will
> have the next chance to send keepalive in that case, and intermittent
> frequent keepalives can happen with that behavior.  So I came to think
> that we can omit keepalives at all that case.
>
> (I myself haven't see the keepalive flood..)
>

I tried your updated patch
(avoid_keepalive_flood_at_bleeding_edge_of_wal.patch, rebased) and
also manually applied your previous keepalive-counting code
(count_keepalives2.diff.txt), adapted to the code updates.
I tested both the problem originally reported (which used
pg_recvlogical) and similarly using pub/sub of the pgbench_history
table, and in both cases I found that your patch very significantly
reduced the keepalives, so the keepalive flood is no longer seen.
I am still a little unsure about the impact on pg_recvlogical --endpos
functionality, which is detected by the regression test failure. I did
try to update pg_recvlogical, to not rely on a keepalive for --endpos,
but so far haven't been successful in doing that. If the test is
altered/removed then I think that the documentation for pg_recvlogical
--endpos will need updating in some way.

Regards,
Greg Nancarrow
Fujitsu Australia



Re: Logical replication keepalive flood

From
Kyotaro Horiguchi
Date:
At Thu, 30 Sep 2021 17:07:08 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
> On Thu, Sep 30, 2021 at 1:26 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > If the comment describes the objective correctly, the only possible
> > impact would be that there may be a case where server responds a bit
> > slowly for a shutdown request.  But I'm not sure it is definitely
> > true.
> >
> 
> So, we should try to find how wal sender shutdown is dependent on
> sending keep alive and second thing is what about sync rep case? I
> think in the worst case that also might delay. Why do you think that
> would be acceptable?

Mmm. AFAICS including the history of the code, the problem to fix
looks like to be pthat logical wal receiver doesn't send a flush
response spontaneously. As far as receiver doesn't do that and unless
we allow some delay of the response, sender inevitably needs to ping
frequently until the wanted respons returns.

It seems to me that it is better to make the receiver send a response
at flush LSN movement spontaneously rather than tweaking the keepalive
sending mechanism.  But letting XLogFlush trigger lsn_mapping
processing does not seem simple..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center