Thread: Logical replication keepalive flood
Hi,
I have observed the following behavior with PostgreSQL 13.3.
The WAL sender process sends approximately 500 keepalive messages per second to pg_recvlogical.
These keepalive messages are totally un-necessary.
Keepalives should be sent only if there is no network traffic and a certain time (half of wal_sender_timeout) passes.
These keepalive messages not only choke the network but also impact the performance of the receiver,
because the receiver has to process the received message and then decide whether to reply to it or not.
The receiver remains busy doing this activity 500 times a second.
On investigation it is revealed that the following code fragment in function WalSndWaitForWal in file walsender.c is responsible for sending these frequent keepalives:
if (MyWalSnd->flush < sentPtr &&
MyWalSnd->write < sentPtr &&
!waiting_for_ping_response)
WalSndKeepalive(false);
MyWalSnd->write < sentPtr &&
!waiting_for_ping_response)
WalSndKeepalive(false);
waiting_for_ping_response is normally false, and flush and write will always be less than sentPtr (Receiver's LSNs cannot advance server's LSNs)
Here are the steps to reproduce:
1. Start the database server.
2. Setup pgbench tables.
./pgbench -i -s 50 -h 192.168.5.140 -p 7654 -U abbas postgres
3. Create a logical replication slot.
SELECT * FROM pg_create_logical_replication_slot('my_slot', 'test_decoding');
4. Start pg_recvlogical.
./pg_recvlogical --slot=my_slot --verbose -d postgres -h 192.168.5.140 -p 7654 -U abbas --start -f -
5. Run pgbench
./pgbench -U abbas -h 192.168.5.140 -p 7654 -c 2 -j 2 -T 1200 -n postgres
6. Observer network traffic to find the keepalive flood.
Alternately modify the above code fragment to see approx 500 keepalive log messages a second
if (MyWalSnd->flush < sentPtr &&
MyWalSnd->write < sentPtr &&
!waiting_for_ping_response)
{
elog(LOG, "[Keepalive] wrt ptr %X/%X snt ptr %X/%X ",
(uint32) (MyWalSnd->write >> 32),
(uint32) MyWalSnd->write,
(uint32) (sentPtr >> 32),
(uint32) sentPtr);
WalSndKeepalive(false);
}
Opinions?
--
MyWalSnd->write < sentPtr &&
!waiting_for_ping_response)
{
elog(LOG, "[Keepalive] wrt ptr %X/%X snt ptr %X/%X ",
(uint32) (MyWalSnd->write >> 32),
(uint32) MyWalSnd->write,
(uint32) (sentPtr >> 32),
(uint32) sentPtr);
WalSndKeepalive(false);
}
Opinions?
--
Abbas
Abbas
Senior Architect
Ph: 92.334.5100153
Skype ID: gabbasb
Follow us on Twitter
@EnterpriseDBAt Sat, 5 Jun 2021 16:08:00 +0500, Abbas Butt <abbas.butt@enterprisedb.com> wrote in > Hi, > I have observed the following behavior with PostgreSQL 13.3. > > The WAL sender process sends approximately 500 keepalive messages per > second to pg_recvlogical. > These keepalive messages are totally un-necessary. > Keepalives should be sent only if there is no network traffic and a certain > time (half of wal_sender_timeout) passes. > These keepalive messages not only choke the network but also impact the > performance of the receiver, > because the receiver has to process the received message and then decide > whether to reply to it or not. > The receiver remains busy doing this activity 500 times a second. I can reproduce the problem. > On investigation it is revealed that the following code fragment in > function WalSndWaitForWal in file walsender.c is responsible for sending > these frequent keepalives: > > if (MyWalSnd->flush < sentPtr && > MyWalSnd->write < sentPtr && > !waiting_for_ping_response) > WalSndKeepalive(false); The immediate cause is pg_recvlogical doesn't send a reply before sleeping. Currently it sends replies every 10 seconds intervals. So the attached first patch stops the flood. That said, I don't think it is not intended that logical walsender sends keep-alive packets with such a high frequency. It happens because walsender actually doesn't wait at all because it waits on WL_SOCKET_WRITEABLE because the keep-alive packet inserted just before is always pending. So as the attached second, we should try to flush out the keep-alive packets if possible before checking pg_is_send_pending(). Any one can "fix" the issue but I think each of them is reasonable by itself. Any thoughts, suggestions and/or opinions? regareds. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c index 5efec160e8..4497ff1071 100644 --- a/src/bin/pg_basebackup/pg_recvlogical.c +++ b/src/bin/pg_basebackup/pg_recvlogical.c @@ -362,6 +362,10 @@ StreamLogicalLog(void) goto error; } + /* sned reply for all writes so far */ + if (!flushAndSendFeedback(conn, &now)) + goto error; + FD_ZERO(&input_mask); FD_SET(PQsocket(conn), &input_mask); diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 109c723f4e..fcea56d1c1 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -1469,6 +1469,9 @@ WalSndWaitForWal(XLogRecPtr loc) /* Send keepalive if the time has come */ WalSndKeepaliveIfNecessary(); + /* We may have queued a keep alive packet. flush it before sleeping. */ + pq_flush_if_writable(); + /* * Sleep until something happens or we time out. Also wait for the * socket becoming writable, if there's still pending output.
On Mon, Jun 7, 2021 at 12:54 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Sat, 5 Jun 2021 16:08:00 +0500, Abbas Butt <abbas.butt@enterprisedb.com> wrote in > > Hi, > > I have observed the following behavior with PostgreSQL 13.3. > > > > The WAL sender process sends approximately 500 keepalive messages per > > second to pg_recvlogical. > > These keepalive messages are totally un-necessary. > > Keepalives should be sent only if there is no network traffic and a certain > > time (half of wal_sender_timeout) passes. > > These keepalive messages not only choke the network but also impact the > > performance of the receiver, > > because the receiver has to process the received message and then decide > > whether to reply to it or not. > > The receiver remains busy doing this activity 500 times a second. > > I can reproduce the problem. > > > On investigation it is revealed that the following code fragment in > > function WalSndWaitForWal in file walsender.c is responsible for sending > > these frequent keepalives: > > > > if (MyWalSnd->flush < sentPtr && > > MyWalSnd->write < sentPtr && > > !waiting_for_ping_response) > > WalSndKeepalive(false); > > The immediate cause is pg_recvlogical doesn't send a reply before > sleeping. Currently it sends replies every 10 seconds intervals. > Yeah, but one can use -s option to send it at lesser intervals. > So the attached first patch stops the flood. > I am not sure sending feedback every time before sleep is a good idea, this might lead to unnecessarily sending more messages. Can we try by using one-second interval with -s option to see how it behaves? As a matter of comparison the similar logic in workers.c uses wal_receiver_timeout to send such an update message rather than sending it every time before sleep. > That said, I don't think it is not intended that logical walsender > sends keep-alive packets with such a high frequency. It happens > because walsender actually doesn't wait at all because it waits on > WL_SOCKET_WRITEABLE because the keep-alive packet inserted just before > is always pending. > > So as the attached second, we should try to flush out the keep-alive > packets if possible before checking pg_is_send_pending(). > /* Send keepalive if the time has come */ WalSndKeepaliveIfNecessary(); + /* We may have queued a keep alive packet. flush it before sleeping. */ + pq_flush_if_writable(); We already call pq_flush_if_writable() from WalSndKeepaliveIfNecessary after sending the keep-alive message, so not sure how this helps? -- With Regards, Amit Kapila.
On Mon, Jun 7, 2021 at 3:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Jun 7, 2021 at 12:54 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Sat, 5 Jun 2021 16:08:00 +0500, Abbas Butt <abbas.butt@enterprisedb.com> wrote in
> > Hi,
> > I have observed the following behavior with PostgreSQL 13.3.
> >
> > The WAL sender process sends approximately 500 keepalive messages per
> > second to pg_recvlogical.
> > These keepalive messages are totally un-necessary.
> > Keepalives should be sent only if there is no network traffic and a certain
> > time (half of wal_sender_timeout) passes.
> > These keepalive messages not only choke the network but also impact the
> > performance of the receiver,
> > because the receiver has to process the received message and then decide
> > whether to reply to it or not.
> > The receiver remains busy doing this activity 500 times a second.
>
> I can reproduce the problem.
>
> > On investigation it is revealed that the following code fragment in
> > function WalSndWaitForWal in file walsender.c is responsible for sending
> > these frequent keepalives:
> >
> > if (MyWalSnd->flush < sentPtr &&
> > MyWalSnd->write < sentPtr &&
> > !waiting_for_ping_response)
> > WalSndKeepalive(false);
>
> The immediate cause is pg_recvlogical doesn't send a reply before
> sleeping. Currently it sends replies every 10 seconds intervals.
>
Yeah, but one can use -s option to send it at lesser intervals.
That option can impact pg_recvlogical, it will not impact the server sending keepalives too frequently.
By default the status interval is 10 secs, still we are getting 500 keepalives a second from the server.
> So the attached first patch stops the flood.
>
I am not sure sending feedback every time before sleep is a good idea,
this might lead to unnecessarily sending more messages. Can we try by
using one-second interval with -s option to see how it behaves? As a
matter of comparison the similar logic in workers.c uses
wal_receiver_timeout to send such an update message rather than
sending it every time before sleep.
> That said, I don't think it is not intended that logical walsender
> sends keep-alive packets with such a high frequency. It happens
> because walsender actually doesn't wait at all because it waits on
> WL_SOCKET_WRITEABLE because the keep-alive packet inserted just before
> is always pending.
>
> So as the attached second, we should try to flush out the keep-alive
> packets if possible before checking pg_is_send_pending().
>
/* Send keepalive if the time has come */
WalSndKeepaliveIfNecessary();
+ /* We may have queued a keep alive packet. flush it before sleeping. */
+ pq_flush_if_writable();
We already call pq_flush_if_writable() from WalSndKeepaliveIfNecessary
after sending the keep-alive message, so not sure how this helps?
--
With Regards,
Amit Kapila.
--
Abbas
Abbas
Senior Architect
Ph: 92.334.5100153
Skype ID: gabbasb
Follow us on Twitter
@EnterpriseDBAt Mon, 7 Jun 2021 15:26:05 +0500, Abbas Butt <abbas.butt@enterprisedb.com> wrote in > On Mon, Jun 7, 2021 at 3:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > The immediate cause is pg_recvlogical doesn't send a reply before > > > sleeping. Currently it sends replies every 10 seconds intervals. > > > > > > > Yeah, but one can use -s option to send it at lesser intervals. > > > > That option can impact pg_recvlogical, it will not impact the server > sending keepalives too frequently. > By default the status interval is 10 secs, still we are getting 500 > keepalives a second from the server. > > > > So the attached first patch stops the flood. > > > > > > > I am not sure sending feedback every time before sleep is a good idea, > > this might lead to unnecessarily sending more messages. Can we try by > > using one-second interval with -s option to see how it behaves? As a > > matter of comparison the similar logic in workers.c uses > > wal_receiver_timeout to send such an update message rather than > > sending it every time before sleep. Logical walreceiver sends a feedback when walrcv_eceive() doesn't receive a byte. If its' not good that pg_recvlogical does the same thing, do we need to improve logical walsender's behavior as well? > > > That said, I don't think it is not intended that logical walsender > > > sends keep-alive packets with such a high frequency. It happens > > > because walsender actually doesn't wait at all because it waits on > > > WL_SOCKET_WRITEABLE because the keep-alive packet inserted just before > > > is always pending. > > > > > > So as the attached second, we should try to flush out the keep-alive > > > packets if possible before checking pg_is_send_pending(). > > > > > > > /* Send keepalive if the time has come */ > > WalSndKeepaliveIfNecessary(); > > > > + /* We may have queued a keep alive packet. flush it before sleeping. */ > > + pq_flush_if_writable(); > > > > We already call pq_flush_if_writable() from WalSndKeepaliveIfNecessary > > after sending the keep-alive message, so not sure how this helps? No. WalSndKeepaliveIfNecessary calls it only when walreceiver does not receive a reply message for a long time. So the keepalive sent by the direct call to WalSndKeepalive() from WalSndWaitForWal is not flushed out in most cases, which causes the flood. I rechecked all callers of WalSndKeepalive(). WalSndKeepalive() +- *WalSndWaltForWal +- ProcessStandbyReplyMessage |+- ProcessStandbyMessage | +- ProcessRepliesIfAny | +- $WalSndWriteData | +- *WalSndWaitForWal | +- WalSndLoop | (calls pq_flush_if_writable() after sending the packet, but the | keepalive packet prevents following stream data from being sent | since the pending keepalive-packet causes pq_is_sned_pending() | return (falsely) true.) +- WalSndDone +- *WalSndLoop +- WalSndKeepaliveIfNecessary (calls pq_flush_if_writable always only after calling WalSndKeepalive()) The callers prefixed by '*' above misunderstand that some of the data sent by them are still pending even when the only pending bytes is the keepalive packet. Of course the keepalive pakcets should be sent *before* sleep and the unsent keepalive packet prevents the callers from sleeping then they immediately retry sending another keepalive pakcet and repeat it until the condition changes. (The callers prevised by "$" also enters a sleep before flushing but doesn't repeat sending keepalives.) The caller is forgetting that a keepalive pakcet may be queued but not flushed after calling WalSndKeepalive. So more sensible fix would be calling pq_flush_if_writable in WalSndKeepalive? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Tue, 08 Jun 2021 10:05:36 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Mon, 7 Jun 2021 15:26:05 +0500, Abbas Butt <abbas.butt@enterprisedb.com> wrote in > > On Mon, Jun 7, 2021 at 3:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > I am not sure sending feedback every time before sleep is a good idea, > > > this might lead to unnecessarily sending more messages. Can we try by > > > using one-second interval with -s option to see how it behaves? As a > > > matter of comparison the similar logic in workers.c uses > > > wal_receiver_timeout to send such an update message rather than > > > sending it every time before sleep. > > Logical walreceiver sends a feedback when walrcv_eceive() doesn't > receive a byte. If its' not good that pg_recvlogical does the same > thing, do we need to improve logical walsender's behavior as well? For the clarity, only the change in the walsender side can stop the flood. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi Kyotaro,
I have tried to test your patches. Unfortunately even after applying the patches
the WAL Sender is still sending too frequent keepalive messages.
In my opinion the fix is to make sure that wal_sender_timeout/2 has passed before sending
the keepalive message in the code fragment I had shared earlier.
In other words we should replace the call to
WalSndKeepalive(false);
with
WalSndKeepaliveIfNecessary(false);
Do you agree with the suggested fix?
On Tue, Jun 8, 2021 at 10:09 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
At Tue, 08 Jun 2021 10:05:36 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
> At Mon, 7 Jun 2021 15:26:05 +0500, Abbas Butt <abbas.butt@enterprisedb.com> wrote in
> > On Mon, Jun 7, 2021 at 3:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > I am not sure sending feedback every time before sleep is a good idea,
> > > this might lead to unnecessarily sending more messages. Can we try by
> > > using one-second interval with -s option to see how it behaves? As a
> > > matter of comparison the similar logic in workers.c uses
> > > wal_receiver_timeout to send such an update message rather than
> > > sending it every time before sleep.
>
> Logical walreceiver sends a feedback when walrcv_eceive() doesn't
> receive a byte. If its' not good that pg_recvlogical does the same
> thing, do we need to improve logical walsender's behavior as well?
For the clarity, only the change in the walsender side can stop the
flood.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Abbas
Abbas
Senior Architect
Ph: 92.334.5100153
Skype ID: gabbasb
Follow us on Twitter
@EnterpriseDBHi. On 2021/06/08 21:21, Abbas Butt wrote: > Hi Kyotaro, > I have tried to test your patches. Unfortunately even after applying the > patches > the WAL Sender is still sending too frequent keepalive messages. Sorry for the bogus patch. I must have seen something impossible. The keep-alive packet is immediately flushed explicitly, so Amit is right that no additional pq_flush_if_writable() is not needed. > In my opinion the fix is to make sure that wal_sender_timeout/2 has passed > before sending > the keepalive message in the code fragment I had shared earlier. > In other words we should replace the call to > WalSndKeepalive(false); > with > WalSndKeepaliveIfNecessary(false); > > Do you agree with the suggested fix? I'm afraid not. The same is done just after unconditionally. The issue - if actually it is - we send a keep-alive packet before a quite short sleep. We really want to send it if the sleep gets long but we cannot predict that before entering a sleep. Let me think a little more on this.. regards.
At Wed, 9 Jun 2021 11:21:55 +0900, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > The issue - if actually it is - we send a keep-alive packet before a > quite short sleep. > > We really want to send it if the sleep gets long but we cannot predict > that before entering a sleep. > > Let me think a little more on this.. After some investigation, I find out that the keepalives are sent almost always after XLogSendLogical requests for the *next* record. In most of the cases the record is not yet inserted at the request time but insertd very soon (in 1-digit milliseconds). It doesn't seem to be expected that that happens with such a high frequency when XLogSendLogical is keeping up-to-date with the bleeding edge of WAL records. It is completely unpredictable when the next record comes, so we cannot decide whether to send a keepalive or not at the current timing. Since we want to send a keepalive when we have nothing to send for a while, it is a bit different to keep sending keepalives at some intervals while the loop is busy. As a possible solution, the attached patch splits the sleep into two pieces. If the first sleep reaches the timeout then send a keepalive then sleep for the remaining time. The first timeout is quite arbitrary but keepalive of 4Hz at maximum doesn't look so bad to me. Is it acceptable? regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 109c723f4e..49b3c0d4e2 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -105,6 +105,9 @@ */ #define MAX_SEND_SIZE (XLOG_BLCKSZ * 16) +/* Minimum idle time for sending an idle-time keepalive in milliseconds */ +#define KEEPALIVE_TIMEOUT 250 + /* Array of WalSnds in shared memory */ WalSndCtlData *WalSndCtl = NULL; @@ -244,7 +247,7 @@ static void WalSndKeepalive(bool requestReply); static void WalSndKeepaliveIfNecessary(void); static void WalSndCheckTimeOut(void); static long WalSndComputeSleeptime(TimestampTz now); -static void WalSndWait(uint32 socket_events, long timeout, uint32 wait_event); +static int WalSndWait(uint32 socket_events, long timeout, uint32 wait_event); static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write); static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write); static void WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid); @@ -1428,19 +1431,6 @@ WalSndWaitForWal(XLogRecPtr loc) if (got_STOPPING) break; - /* - * We only send regular messages to the client for full decoded - * transactions, but a synchronous replication and walsender shutdown - * possibly are waiting for a later location. So, before sleeping, we - * send a ping containing the flush location. If the receiver is - * otherwise idle, this keepalive will trigger a reply. Processing the - * reply will update these MyWalSnd locations. - */ - if (MyWalSnd->flush < sentPtr && - MyWalSnd->write < sentPtr && - !waiting_for_ping_response) - WalSndKeepalive(false); - /* check whether we're done */ if (loc <= RecentFlushPtr) break; @@ -1483,6 +1473,39 @@ WalSndWaitForWal(XLogRecPtr loc) if (pq_is_send_pending()) wakeEvents |= WL_SOCKET_WRITEABLE; + /* + * We only send regular messages to the client for full decoded + * transactions, but a synchronous replication and walsender shutdown + * possibly are waiting for a later location. So, before sleeping, we + * send a ping containing the flush location. If the receiver is + * otherwise idle, this keepalive will trigger a reply. Processing the + * reply will update these MyWalSnd locations. If the sleep is shorter + * than KEEPALIVE_TIMEOUT milliseconds, we skip sending a keepalive to + * prevent it from getting too-frequent. + */ + if (MyWalSnd->flush < sentPtr && + MyWalSnd->write < sentPtr && + !waiting_for_ping_response) + { + if (sleeptime > KEEPALIVE_TIMEOUT) + { + int r; + + r = WalSndWait(wakeEvents, KEEPALIVE_TIMEOUT, + WAIT_EVENT_WAL_SENDER_WAIT_WAL); + + if (r != 0) + continue; + + sleeptime -= KEEPALIVE_TIMEOUT; + } + + WalSndKeepalive(false); + + if (pq_flush_if_writable() != 0) + WalSndShutdown(); + } + WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL); } @@ -3136,15 +3159,18 @@ WalSndWakeup(void) * composed of optional WL_SOCKET_WRITEABLE and WL_SOCKET_READABLE flags. Exit * on postmaster death. */ -static void +static int WalSndWait(uint32 socket_events, long timeout, uint32 wait_event) { WaitEvent event; + int ret; ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, socket_events, NULL); - if (WaitEventSetWait(FeBeWaitSet, timeout, &event, 1, wait_event) == 1 && - (event.events & WL_POSTMASTER_DEATH)) + ret = WaitEventSetWait(FeBeWaitSet, timeout, &event, 1, wait_event); + if (ret == 1 && (event.events & WL_POSTMASTER_DEATH)) proc_exit(1); + + return ret; } /*
On Wed, Jun 9, 2021 at 1:47 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Wed, 9 Jun 2021 11:21:55 +0900, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > The issue - if actually it is - we send a keep-alive packet before a > > quite short sleep. > > > > We really want to send it if the sleep gets long but we cannot predict > > that before entering a sleep. > > > > Let me think a little more on this.. > > After some investigation, I find out that the keepalives are sent > almost always after XLogSendLogical requests for the *next* record. > Does these keepalive messages are sent at the same frequency even for subscribers? Basically, I wanted to check if we have logical replication set up between 2 nodes then do we send these keep-alive messages flood? If not, then why is it different in the case of pg_recvlogical? Is it possible that the write/flush location is not updated at the pace at which we expect? Please see commit 41d5f8ad73 which seems to be talking about a similar problem. -- With Regards, Amit Kapila.
Hi,
On Wed, Jun 9, 2021 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Jun 9, 2021 at 1:47 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Wed, 9 Jun 2021 11:21:55 +0900, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
> > The issue - if actually it is - we send a keep-alive packet before a
> > quite short sleep.
> >
> > We really want to send it if the sleep gets long but we cannot predict
> > that before entering a sleep.
> >
> > Let me think a little more on this..
>
> After some investigation, I find out that the keepalives are sent
> almost always after XLogSendLogical requests for the *next* record.
>
Does these keepalive messages are sent at the same frequency even for
subscribers?
Yes, I have tested it with one publisher and one subscriber.
The moment I start pgbench session I can see keepalive messages sent and replied by the subscriber with same frequency.
Basically, I wanted to check if we have logical
replication set up between 2 nodes then do we send these keep-alive
messages flood?
Yes we do.
If not, then why is it different in the case of
pg_recvlogical?
Nothing, the WAL sender behaviour is same in both cases.
Is it possible that the write/flush location is not
updated at the pace at which we expect?
Well, it is async replication. The receiver can choose to update LSNs at its own will, say after 10 mins interval.
It should only impact the size of WAL retained by the server.
Please see commit 41d5f8ad73
which seems to be talking about a similar problem.
That commit does not address this problem.
--
With Regards,
Amit Kapila.
--
Abbas
Abbas
Senior Architect
Ph: 92.334.5100153
Skype ID: gabbasb
Follow us on Twitter
@EnterpriseDBAt Wed, 9 Jun 2021 17:32:25 +0500, Abbas Butt <abbas.butt@enterprisedb.com> wrote in > > On Wed, Jun 9, 2021 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Does these keepalive messages are sent at the same frequency even for > > subscribers? > > Yes, I have tested it with one publisher and one subscriber. > The moment I start pgbench session I can see keepalive messages sent and > replied by the subscriber with same frequency. > > > Basically, I wanted to check if we have logical > > replication set up between 2 nodes then do we send these keep-alive > > messages flood? > > Yes we do. > > > If not, then why is it different in the case of > > pg_recvlogical? > > Nothing, the WAL sender behaviour is same in both cases. > > > > Is it possible that the write/flush location is not > > updated at the pace at which we expect? Yes. MyWalSnd->flush/write are updated far frequently but still MyWalSnd->write is behind sentPtr by from thousands of bytes up to less than 1 block (1block = 8192 bytes). (Flush lags are larger than write lags, of course.) I counted how many times keepalives are sent for each request length to logical_read_xlog_page() for 10 seconds pgbench run and replicating pgbench_history, using the attached change. size: sent /notsent/ calls: write lag/ flush lag 8: 3 / 6 / 3: 5960 / 348962 16: 1 / 2 / 1: 520 / 201096 24: 2425 / 4852 / 2461: 5259 / 293569 98: 2 / 0 / 54: 5 / 1050 187: 2 / 0 / 94: 0 / 1060 4432: 1 / 0 / 1: 410473592 / 410473592 7617: 2 / 0 / 27: 317 / 17133 8280: 1 / 2 / 4: 390 / 390 Where, size is requested data length to logical_read_xlog_page() sent is the number of keepalives sent in the loop in WalSndWaitForWal notsent is the number of runs of the loop in WalSndWaitForWal without sending a keepalive calls is the number of calls to WalSndWaitForWal write lag is the bytes MyWalSnd->write is behind from sentPtr at the first run of the loop per call to logical_read_xlog_page. flush lag is the the same to the above for MyWalSnd->flush. Maybe the line of size=4432 is the first time fetch of WAL. So this numbers show that WalSndWaitForWal is called almost only at starting to fetching a record, and in that case the function runs the loop three times and sends one keepalive by average. > Well, it is async replication. The receiver can choose to update LSNs at > its own will, say after 10 mins interval. > It should only impact the size of WAL retained by the server. > > Please see commit 41d5f8ad73 > > which seems to be talking about a similar problem. > > > > That commit does not address this problem. Yeah, at least for me, WalSndWaitForWal send a keepalive per one call. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 42738eb940..ee78116e79 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -571,6 +571,7 @@ err: * We fetch the page from a reader-local cache if we know we have the required * data and if there hasn't been any error since caching the data. */ +int hogestate = -1; static int ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen) { @@ -605,6 +606,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen) { XLogRecPtr targetSegmentPtr = pageptr - targetPageOff; + hogestate = pageptr + XLOG_BLCKSZ - state->currRecPtr; readLen = state->routine.page_read(state, targetSegmentPtr, XLOG_BLCKSZ, state->currRecPtr, state->readBuf); @@ -623,6 +625,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen) * First, read the requested data length, but at least a short page header * so that we can validate it. */ + hogestate = pageptr + Max(reqLen, SizeOfXLogShortPHD) - state->currRecPtr; readLen = state->routine.page_read(state, pageptr, Max(reqLen, SizeOfXLogShortPHD), state->currRecPtr, state->readBuf); @@ -642,6 +645,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen) /* still not enough */ if (readLen < XLogPageHeaderSize(hdr)) { + hogestate = pageptr + XLogPageHeaderSize(hdr) - state->currRecPtr; readLen = state->routine.page_read(state, pageptr, XLogPageHeaderSize(hdr), state->currRecPtr, state->readBuf); @@ -649,6 +653,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen) goto err; } + hogestate = -1; /* * Now that we know we have the full header, validate it. */ diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 109c723f4e..0de10c4a31 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -1363,17 +1363,45 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId * if we detect a shutdown request (either from postmaster or client) * we will return early, so caller must always check. */ +unsigned long counts[32768][3] = {0}; +unsigned long lagw[32768] = {0}; +unsigned long lagf[32768] = {0}; + +void +PrintCounts(void) +{ + int i = 0; + for (i = 0 ; i < 32768 ; i++) + { + if (counts[i][0] + counts[i][1] + counts[i][2] > 0) + { + unsigned long wl = 0, fl = 0; + if (counts[i][1] > 0) + { + wl = lagw[i] / counts[i][0]; + fl = lagf[i] / counts[i][0]; + + ereport(LOG, (errmsg ("[%5d]: %5lu / %5lu / %5lu: %5lu %5lu", + i, counts[i][1], counts[i][2], counts[i][0], wl, fl), errhidestmt(true))); + } + } + } +} + static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc) { int wakeEvents; static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr; + extern int hogestate; + bool lagtaken = false; /* * Fast path to avoid acquiring the spinlock in case we already know we * have enough WAL available. This is particularly interesting if we're * far behind. */ + counts[hogestate][0]++; if (RecentFlushPtr != InvalidXLogRecPtr && loc <= RecentFlushPtr) return RecentFlushPtr; @@ -1439,7 +1467,39 @@ WalSndWaitForWal(XLogRecPtr loc) if (MyWalSnd->flush < sentPtr && MyWalSnd->write < sentPtr && !waiting_for_ping_response) + { + if (hogestate >= 0) + { + counts[hogestate][1]++; + if (!lagtaken) + { + lagf[hogestate] += sentPtr - MyWalSnd->flush; + lagw[hogestate] += sentPtr - MyWalSnd->write; + lagtaken = true; + } + } +// ereport(LOG, (errmsg ("KA[%lu/%lu/%lu]: %X/%X %X/%X %X/%X %d: %ld", +// ka, na, ka + na, +// LSN_FORMAT_ARGS(MyWalSnd->flush), +// LSN_FORMAT_ARGS(MyWalSnd->write), +// LSN_FORMAT_ARGS(sentPtr), +// waiting_for_ping_response, +// sentPtr - MyWalSnd->write))); WalSndKeepalive(false); + } + else + { + if (hogestate >= 0) + counts[hogestate][2]++; + +// ereport(LOG, (errmsg ("kap[%lu/%lu/%lu]: %X/%X %X/%X %X/%X %d: %ld", +// ka, na, ka + na, +// LSN_FORMAT_ARGS(MyWalSnd->flush), +// LSN_FORMAT_ARGS(MyWalSnd->write), +// LSN_FORMAT_ARGS(sentPtr), +// waiting_for_ping_response, +// sentPtr - MyWalSnd->write))); + } /* check whether we're done */ if (loc <= RecentFlushPtr)
At Thu, 10 Jun 2021 15:00:16 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Wed, 9 Jun 2021 17:32:25 +0500, Abbas Butt <abbas.butt@enterprisedb.com> wrote in > > > > On Wed, Jun 9, 2021 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Is it possible that the write/flush location is not > > > updated at the pace at which we expect? > > Yes. MyWalSnd->flush/write are updated far frequently but still > MyWalSnd->write is behind sentPtr by from thousands of bytes up to > less than 1 block (1block = 8192 bytes). (Flush lags are larger than > write lags, of course.) For more clarity, I changed the previous patch a bit and retook numbers. Total records: 19476 8: 2 / 4 / 2: 4648 / 302472 16: 5 / 10 / 5: 5427 / 139872 24: 3006 / 6015 / 3028: 4739 / 267215 187: 2 / 0 / 50: 1 / 398 While a 10 seconds run of pgbench, it walsender reads 19476 records and calls logical_read_xlog_page() 3028 times, and the mean of write lag is 4739 bytes and flush lag is 267215 bytes (really?), as the result most of the record fetch causes a keep alive. (The WAL contains many FPIs). regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 42738eb940..ee78116e79 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -571,6 +571,7 @@ err: * We fetch the page from a reader-local cache if we know we have the required * data and if there hasn't been any error since caching the data. */ +int hogestate = -1; static int ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen) { @@ -605,6 +606,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen) { XLogRecPtr targetSegmentPtr = pageptr - targetPageOff; + hogestate = pageptr + XLOG_BLCKSZ - state->currRecPtr; readLen = state->routine.page_read(state, targetSegmentPtr, XLOG_BLCKSZ, state->currRecPtr, state->readBuf); @@ -623,6 +625,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen) * First, read the requested data length, but at least a short page header * so that we can validate it. */ + hogestate = pageptr + Max(reqLen, SizeOfXLogShortPHD) - state->currRecPtr; readLen = state->routine.page_read(state, pageptr, Max(reqLen, SizeOfXLogShortPHD), state->currRecPtr, state->readBuf); @@ -642,6 +645,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen) /* still not enough */ if (readLen < XLogPageHeaderSize(hdr)) { + hogestate = pageptr + XLogPageHeaderSize(hdr) - state->currRecPtr; readLen = state->routine.page_read(state, pageptr, XLogPageHeaderSize(hdr), state->currRecPtr, state->readBuf); @@ -649,6 +653,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen) goto err; } + hogestate = -1; /* * Now that we know we have the full header, validate it. */ diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 109c723f4e..62f5f09fee 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -1363,17 +1363,49 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId * if we detect a shutdown request (either from postmaster or client) * we will return early, so caller must always check. */ +unsigned long counts[32768][3] = {0}; +unsigned long lagw[32768] = {0}; +unsigned long lagf[32768] = {0}; +unsigned long nrec = 0; +void +PrintCounts(void) +{ + int i = 0; + ereport(LOG, (errmsg ("Total records: %lu", nrec), errhidestmt(true))); + nrec = 0; + + for (i = 0 ; i < 32768 ; i++) + { + if (counts[i][0] + counts[i][1] + counts[i][2] > 0) + { + unsigned long wl = 0, fl = 0; + if (counts[i][1] > 0) + { + wl = lagw[i] / counts[i][0]; + fl = lagf[i] / counts[i][0]; + + ereport(LOG, (errmsg ("%5d: %5lu / %5lu / %5lu: %7lu / %7lu", + i, counts[i][1], counts[i][2], counts[i][0], wl, fl), errhidestmt(true))); + } + counts[i][0] = counts[i][1] = counts[i][2] = lagw[i] = lagf[i] = 0; + } + } +} + static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc) { int wakeEvents; static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr; + extern int hogestate; + bool lagtaken = false; /* * Fast path to avoid acquiring the spinlock in case we already know we * have enough WAL available. This is particularly interesting if we're * far behind. */ + counts[hogestate][0]++; if (RecentFlushPtr != InvalidXLogRecPtr && loc <= RecentFlushPtr) return RecentFlushPtr; @@ -1439,7 +1471,39 @@ WalSndWaitForWal(XLogRecPtr loc) if (MyWalSnd->flush < sentPtr && MyWalSnd->write < sentPtr && !waiting_for_ping_response) + { + if (hogestate >= 0) + { + counts[hogestate][1]++; + if (!lagtaken) + { + lagf[hogestate] += sentPtr - MyWalSnd->flush; + lagw[hogestate] += sentPtr - MyWalSnd->write; + lagtaken = true; + } + } +// ereport(LOG, (errmsg ("KA[%lu/%lu/%lu]: %X/%X %X/%X %X/%X %d: %ld", +// ka, na, ka + na, +// LSN_FORMAT_ARGS(MyWalSnd->flush), +// LSN_FORMAT_ARGS(MyWalSnd->write), +// LSN_FORMAT_ARGS(sentPtr), +// waiting_for_ping_response, +// sentPtr - MyWalSnd->write))); WalSndKeepalive(false); + } + else + { + if (hogestate >= 0) + counts[hogestate][2]++; + +// ereport(LOG, (errmsg ("kap[%lu/%lu/%lu]: %X/%X %X/%X %X/%X %d: %ld", +// ka, na, ka + na, +// LSN_FORMAT_ARGS(MyWalSnd->flush), +// LSN_FORMAT_ARGS(MyWalSnd->write), +// LSN_FORMAT_ARGS(sentPtr), +// waiting_for_ping_response, +// sentPtr - MyWalSnd->write))); + } /* check whether we're done */ if (loc <= RecentFlushPtr) @@ -2843,6 +2907,7 @@ XLogSendLogical(void) { XLogRecord *record; char *errm; + extern unsigned long nrec; /* * We'll use the current flush point to determine whether we've caught up. @@ -2860,6 +2925,7 @@ XLogSendLogical(void) */ WalSndCaughtUp = false; + nrec++; record = XLogReadRecord(logical_decoding_ctx->reader, &errm); /* xlog record was invalid */
On Thu, Jun 10, 2021 at 11:42 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Thu, 10 Jun 2021 15:00:16 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > At Wed, 9 Jun 2021 17:32:25 +0500, Abbas Butt <abbas.butt@enterprisedb.com> wrote in > > > > > > On Wed, Jun 9, 2021 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Is it possible that the write/flush location is not > > > > updated at the pace at which we expect? > > > > Yes. MyWalSnd->flush/write are updated far frequently but still > > MyWalSnd->write is behind sentPtr by from thousands of bytes up to > > less than 1 block (1block = 8192 bytes). (Flush lags are larger than > > write lags, of course.) > > For more clarity, I changed the previous patch a bit and retook numbers. > > Total records: 19476 > 8: 2 / 4 / 2: 4648 / 302472 > 16: 5 / 10 / 5: 5427 / 139872 > 24: 3006 / 6015 / 3028: 4739 / 267215 > 187: 2 / 0 / 50: 1 / 398 > > While a 10 seconds run of pgbench, it walsender reads 19476 records > and calls logical_read_xlog_page() 3028 times, and the mean of write > lag is 4739 bytes and flush lag is 267215 bytes (really?), as the > result most of the record fetch causes a keep alive. (The WAL contains > many FPIs). > Good analysis. I think this analysis has shown that walsender is sending messages at top speed as soon as they are generated. So, I am wondering why there is any need to wait/sleep in such a workload. One possibility that occurred to me RecentFlushPtr is not updated and or we are not checking it aggressively. To investigate on that lines, can you check the behavior with the attached patch? This is just a quick hack patch to test whether we need to really wait for WAL a bit aggressively. -- With Regards, Amit Kapila.
Attachment
At Thu, 10 Jun 2021 12:18:00 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > Good analysis. I think this analysis has shown that walsender is > sending messages at top speed as soon as they are generated. So, I am > wondering why there is any need to wait/sleep in such a workload. One > possibility that occurred to me RecentFlushPtr is not updated and or > we are not checking it aggressively. To investigate on that lines, can > you check the behavior with the attached patch? This is just a quick > hack patch to test whether we need to really wait for WAL a bit > aggressively. Yeah, anyway the comment for the caller site of WalSndKeepalive tells that exiting out of the function *after* there is somewhat wrong. > * possibly are waiting for a later location. So, before sleeping, we > * send a ping containing the flush location. If the receiver is But I nothing changed by moving the keepalive check to after the exit check. (loc <= RecentFlushPtr). And the patch also doesn't change the situation so much. The average number of loops is reduced from 3 to 2 per call but the ratio between total records and keepalives doesn't change. previsous: A=#total-rec = 19476, B=#keepalive=3006, B/A = 0.154 this time: A=#total-rec = 13208, B=#keepalive=1988, B/A = 0.151 Total records: 13208 reqsz: #sent/ #!sent/ #call: wr lag / fl lag 8: 4 / 4 / 4: 6448 / 268148 16: 1 / 1 / 1: 8688 / 387320 24: 1988 / 1987 / 1999: 6357 / 226163 195: 1 / 0 / 20: 408 / 1647 7477: 2 / 0 / 244: 68 / 847 8225: 1 / 1 / 1: 7208 / 7208 So I checked how many bytes RecentFlushPtr is behind requested loc if it is not advanced enough. Total records: 15128 reqsz: #sent/ #!sent/ #call: wr lag / fl lag / RecentFlushPtr lag 8: 2 / 2 / 2: 520 / 60640 / 8 16: 1 / 1 / 1: 8664 / 89336 / 16 24: 2290 / 2274 / 2302: 5677 / 230583 / 23 187: 1 / 0 / 40: 1 / 6118 / 1 7577: 1 / 0 / 69: 120 / 3733 / 65 8177: 1 / 1 / 1: 8288 / 8288 / 2673 So it's not a matter of RecentFlushPtr check. (Almost) Always when WalSndWakeupRequest feels to need to send a keepalive, the function is called before the record begins to be written. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Jun 11, 2021 at 7:07 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Thu, 10 Jun 2021 12:18:00 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > Good analysis. I think this analysis has shown that walsender is > > sending messages at top speed as soon as they are generated. So, I am > > wondering why there is any need to wait/sleep in such a workload. One > > possibility that occurred to me RecentFlushPtr is not updated and or > > we are not checking it aggressively. To investigate on that lines, can > > you check the behavior with the attached patch? This is just a quick > > hack patch to test whether we need to really wait for WAL a bit > > aggressively. > > Yeah, anyway the comment for the caller site of WalSndKeepalive tells > that exiting out of the function *after* there is somewhat wrong. > > > * possibly are waiting for a later location. So, before sleeping, we > > * send a ping containing the flush location. If the receiver is > > But I nothing changed by moving the keepalive check to after the exit > check. (loc <= RecentFlushPtr). > > And the patch also doesn't change the situation so much. The average > number of loops is reduced from 3 to 2 per call but the ratio between > total records and keepalives doesn't change. > > previsous: A=#total-rec = 19476, B=#keepalive=3006, B/A = 0.154 > this time: A=#total-rec = 13208, B=#keepalive=1988, B/A = 0.151 > > Total records: 13208 > reqsz: #sent/ #!sent/ #call: wr lag / fl lag > 8: 4 / 4 / 4: 6448 / 268148 > 16: 1 / 1 / 1: 8688 / 387320 > 24: 1988 / 1987 / 1999: 6357 / 226163 > 195: 1 / 0 / 20: 408 / 1647 > 7477: 2 / 0 / 244: 68 / 847 > 8225: 1 / 1 / 1: 7208 / 7208 > > So I checked how many bytes RecentFlushPtr is behind requested loc if > it is not advanced enough. > > Total records: 15128 > reqsz: #sent/ #!sent/ #call: wr lag / fl lag / RecentFlushPtr lag > 8: 2 / 2 / 2: 520 / 60640 / 8 > 16: 1 / 1 / 1: 8664 / 89336 / 16 > 24: 2290 / 2274 / 2302: 5677 / 230583 / 23 > 187: 1 / 0 / 40: 1 / 6118 / 1 > 7577: 1 / 0 / 69: 120 / 3733 / 65 > 8177: 1 / 1 / 1: 8288 / 8288 / 2673 > Does this data indicate that when the request_size is 187 or 7577, even though we have called WalSndWaitForWal() 40 and 69 times respectively but keepalive is sent just once? Why such a behavior should depend upon request size? > So it's not a matter of RecentFlushPtr check. (Almost) Always when > WalSndWakeupRequest feels to need to send a keepalive, the function is > called before the record begins to be written. > I think we always wake up walsender after we have flushed the WAL via WalSndWakeupProcessRequests(). I think here the reason why we are seeing keepalives is that we always send it before sleeping. So, it seems each time we try to read a new page, we call WalSndWaitForWal which sends at least one keepalive message. I am not sure what is an appropriate way to reduce the frequency of these keepalive messages. Andres might have some ideas? -- With Regards, Amit Kapila.
Hi. By using Kyotaro's "counting" patch I was able to reproduce very similar results to what he had earlier posted [1]. AFAIK I have the same test scenario that he was using. Test setup: - using async pub/sub - subscription is for the pgbench_history table - pgbench is run for 10 seconds - config for all the wal_sender/receiver timeout GUCs are just default values Results (HEAD + Kyotaro counting patch) ======================================= [postgres@CentOS7-x64 ~]$ 2021-08-10 16:36:23.733 AEST [32436] LOG: Total records: 18183 2021-08-10 16:36:23.733 AEST [32436] LOG: 8: 2 / 0 / 1: 440616 / 580320 2021-08-10 16:36:23.733 AEST [32436] LOG: 16: 4 / 8 / 4: 4524 / 288688 2021-08-10 16:36:23.733 AEST [32436] LOG: 24: 2916 / 5151 / 2756: 31227 / 323190 2021-08-10 16:36:23.733 AEST [32436] LOG: 187: 2 / 0 / 51: 157 / 10629 2021-08-10 16:36:23.733 AEST [32436] LOG: 2960: 1 / 0 / 1: 49656944 / 49656944 2021-08-10 16:36:23.733 AEST [32436] LOG: 7537: 2 / 0 / 36: 231 / 7028 2021-08-10 16:36:23.733 AEST [32436] LOG: 7577: 1 / 2 / 78: 106 / 106 2021-08-10 16:36:23.733 AEST [32436] LOG: 8280: 1 / 2 / 3: 88 / 88 ////// That base data is showing there are similar numbers of keepalives sent as there are calls made to WalSndWaitForWal. IIUC it means that mostly the loop is sending the special keepalives on the *first* iteration, but by the time of the *second* iteration the ProcessRepliesIfAny() will have some status already received, and so mostly sending another keepalive will be deemed unnecessary. Based on this, our idea was to simply skip sending the WalSndKeepalive(false) for the FIRST iteration of the loop only! PSA the patch 0002 which does this skip. With this skip patch (v1-0002) applied the same pgbench tests were run again. The results look like below. Results (HEAD + Kyotaro patch + Skip-first keepalive patch) =========================================================== RUN #1 ------ [postgres@CentOS7-x64 ~]$ 2021-08-11 16:32:59.827 AEST [20339] LOG: Total records: 19367 2021-08-11 16:32:59.827 AEST [20339] LOG: 24: 10 / 9232 / 3098: 19 / 440 2021-08-11 16:32:59.827 AEST [20339] LOG: 102: 1 / 1 / 32: 257 / 16828 2021-08-11 16:32:59.827 AEST [20339] LOG: 187: 1 / 1 / 52: 155 / 9541 RUN #2 ------ [postgres@CentOS7-x64 ~]$ 2021-08-11 16:36:03.983 AEST [25513] LOG: Total records: 17815 2021-08-11 16:36:03.983 AEST [25513] LOG: 24: 73 / 8683 / 2958: 1647 / 3290 2021-08-11 16:36:03.983 AEST [25513] LOG: 8280: 1 / 1 / 3: 88 / 88 RUN #3 ------ [postgres@CentOS7-x64 ~]$ 2021-08-11 16:39:27.655 AEST [31061] LOG: Total records: 19906 2021-08-11 16:39:27.655 AEST [31061] LOG: 24: 18 / 8546 / 2890: 61 / 1530 2021-08-11 16:39:27.655 AEST [31061] LOG: 83: 1 / 3 / 1: 8664 / 8664 ~~ This data shows the special keepalives are now greatly reduced from 1000s to just 10s. Thoughts? ------ [1] https://www.postgresql.org/message-id/20210610.150016.1709823354377067679.horikyota.ntt%40gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Attachment
FYI - Here are some more counter results with/without the skip patch [1] applied. This is the same test setup as before except now using *synchronous* pub/sub. ////////// Test setup: - using synchronous pub/sub - subscription is for the pgbench_history table - pgbench is run for 10 seconds - config for all the wal_sender/receiver timeout GUCs are just default values WITHOUT the skip-first patch applied ===================================== RUN #1 ------ LOG: Total records: 310 LOG: 24: 49 / 131 / 49: 8403 / 9270 LOG: 944: 1 / 0 / 1: 159693904 / 159693904 LOG: 8280: 1 / 2 / 2: 480 / 480 RUN #2 ------ LOG: Total records: 275 LOG: 24: 45 / 129 / 46: 8580 / 8766 LOG: 5392: 1 / 0 / 1: 160107248 / 160107248 RUN #3 ------ LOG: Total records: 330 LOG: 24: 50 / 144 / 51: 8705 / 8705 LOG: 3704: 1 / 0 / 1: 160510344 / 160510344 LOG: 8280: 1 / 2 / 2: 468 / 468 WITH the skip-first patch applied ================================= RUN #1 ------ LOG: Total records: 247 LOG: 24: 5 / 172 / 44: 3601700 / 3601700 LOG: 8280: 1 / 1 / 1: 1192 / 1192 RUN #2 ------ LOG: Total records: 338 LOG: 24: 8 / 199 / 55: 1335 / 1335 LOG: 7597: 1 / 1 / 1: 11712 / 11712 LOG: 8280: 1 / 1 / 2: 480 / 480 RUN #3 ------ LOG: Total records: 292 LOG: 24: 4 / 184 / 49: 719 / 719 ////////// As before there is a big % reduction of keepalives after the patch, except here there was never really much of a "flood" in the first place. ------ [1] https://www.postgresql.org/message-id/CAHut%2BPtyMBzweYUpb_QazVL6Uze2Yc5M5Ti2Xwee_eWM3Jrbog%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
On Thu, Aug 12, 2021 at 12:33 PM Peter Smith <smithpb2250@gmail.com> wrote: > > This data shows the special keepalives are now greatly reduced from > 1000s to just 10s. > > Thoughts? > I could easily see the flood of keepalives with the test setup described by the original problem reporter (Abbas Butt). I found that the "v1-0002-WIP-skip-the-keepalive-on-FIRST-loop-iteration.patch" patch reduced the keepalives by about 50% in this case. I also tried the pub/sub setup with the publication on the pgbench_history table. With this pub/sub setup, I found that the patch dramatically reduced the keepalives sent, similar to that reported by Peter. Results (using Kyotoro’s keepalive counting patch) are below: PUB/SUB, publishing the pgbench_history table (1) without patch, 10s pgbench run: 2021-09-08 15:21:56.643 AEST [63720] LOG: Total records: 47019 2021-09-08 15:21:56.643 AEST [63720] LOG: 8: 8 / 16 / 8: 8571 / 882048 2021-09-08 15:21:56.643 AEST [63720] LOG: 16: 5 / 10 / 5: 3649 / 764892 2021-09-08 15:21:56.643 AEST [63720] LOG: 24: 6271 / 12561 / 6331: 113989 / 886115 2021-09-08 15:21:56.643 AEST [63720] LOG: 195: 2 / 0 / 112: 72 / 10945 2021-09-08 15:21:56.643 AEST [63720] LOG: 6856: 1 / 0 / 1: 666232176 / 666232176 2021-09-08 15:21:56.643 AEST [63720] LOG: 7477: 2 / 0 / 298: 27 / 3303 2021-09-08 15:21:56.643 AEST [63720] LOG: 8159: 19 / 32 / 6073: 15 / 1869 (2) with patch, 10s pgbench run 2021-09-08 15:39:14.008 AEST [71431] LOG: Total records: 45858 2021-09-08 15:39:14.008 AEST [71431] LOG: 24: 61 / 18278 / 6168: 108034 / 115228 2021-09-08 15:39:14.008 AEST [71431] LOG: 84: 1 / 1 / 7: 2256 / 295230 2021-09-08 15:39:14.008 AEST [71431] LOG: 110: 1 / 1 / 3: 10629 / 708293 2021-09-08 15:39:14.008 AEST [71431] LOG: 7477: 18 / 18 / 4577: 53 / 7850 Where columns are: size: sent /notsent/ calls: write lag/ flush lag - size is requested data length to logical_read_xlog_page() - sent is the number of keepalives sent in the loop in WalSndWaitForWal - notsent is the number of runs of the loop in WalSndWaitForWal without sending a keepalive - calls is the number of calls to WalSndWaitForWal - write lag is the bytes MyWalSnd->write is behind from sentPtr at the first run of the loop per call to logical_read_xlog_page. - flush lag is the same as the above, but for MyWalSnd->flush. However, the problem I found is that, with the patch applied, there is a test failure when running “make check-world”: t/006_logical_decoding.pl ............ 4/14 # Failed test 'pg_recvlogical acknowledged changes' # at t/006_logical_decoding.pl line 117. # got: 'BEGIN # table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'' # expected: '' # Looks like you failed 1 test of 14. t/006_logical_decoding.pl ............ Dubious, test returned 1 (wstat 256, 0x100) Failed 1/14 subtests To investigate this, I added some additional logging to pg_recvlogical.c and PostgresNode.pm and re-ran 006_logical_decoding.pl without and with the patch (logs attached). When the patch is NOT applied, and when pg_recvlogical is invoked by the test for a 2nd time with the same "--endpos" LSN, it gets a keepalive, detects walEnd>=endpos, and thus returns an empty record. The test is expecting an empty record, so all is OK. When the patch is applied, and when pg_recvlogical is invoked by the test for a 2nd time with the same "--endpos" LSN, it gets a WAL record with THE SAME LSN (== endpos) as previously obtained by the last WAL record when it was invoked the 1st time, but the record data is actually the first row of some records written after endpos, that it wasn't meant to read. This doesn't seem right to me - how can pg_recvlogical receive two different WAL records with the same LSN? With the patch applied, I was expecting pg_reclogical to get WAL records with LSN>endpos, but this doesn't happen. I'm thinking that the patch must have broken walsender in some way, possibly by missing out on calls to ProcessStandbyReplyMessage() because the sending of some keepalives are avoided (see below), so that the MyWalSnd flush and write location are not kept up-to-date. The code comments below seem to hint about this. I don't really like the way keepalives are used for this, but this seems to be the existing functionality. Maybe someone else can confirm that this could indeed break walsender? walsender.c WalSndWaitForWal() /* * We only send regular messages to the client for full decoded * transactions, but a synchronous replication and walsender shutdown * possibly are waiting for a later location. So, before sleeping, we * send a ping containing the flush location. If the receiver is * otherwise idle, this keepalive will trigger a reply. Processing the * reply will update these MyWalSnd locations. */ if (!loop_first_time && /* avoid keepalive on first iteration */ <--- added by the patch MyWalSnd->flush < sentPtr && MyWalSnd->write < sentPtr && !waiting_for_ping_response) { WalSndKeepalive(false); Regards, Greg Nancarrow Fujitsu Australia
Attachment
From Tuesday, September 14, 2021 1:39 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > However, the problem I found is that, with the patch applied, there is > a test failure when running “make check-world”: > > t/006_logical_decoding.pl ............ 4/14 > # Failed test 'pg_recvlogical acknowledged changes' > # at t/006_logical_decoding.pl line 117. > # got: 'BEGIN > # table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'' > # expected: '' > # Looks like you failed 1 test of 14. > t/006_logical_decoding.pl ............ Dubious, test returned 1 (wstat > 256, 0x100) Failed 1/14 subtests > > After applying the patch, I saw the same problem and can reproduce it by the following steps: 1) execute the SQLs. -----------SQL----------- CREATE TABLE decoding_test(x integer, y text); SELECT pg_create_logical_replication_slot('test_slot', 'test_decoding'); INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s; -- use the lsn here to execute pg_recvlogical later SELECT lsn FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY lsn DESC LIMIT 1; INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s; ---------------------- 2) Then, if I execute the following command twice: # pg_recvlogical -E lsn -d postgres -S 'test_slot' --start --no-loop -f - BEGIN 708 table public.decoding_test: INSERT: x[integer]:1 y[text]:'1' table public.decoding_test: INSERT: x[integer]:2 y[text]:'2' table public.decoding_test: INSERT: x[integer]:3 y[text]:'3' table public.decoding_test: INSERT: x[integer]:4 y[text]:'4' COMMIT 708 # pg_recvlogical -E lsn -d postgres -S 'test_slot' --start --no-loop -f - BEGIN 709 It still generated ' BEGIN 709' in the second time execution. But it will output nothing in the second time execution without the patch. Best regards, Hou zj
On Thu, Sep 16, 2021 at 6:29 AM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > > From Tuesday, September 14, 2021 1:39 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > > However, the problem I found is that, with the patch applied, there is > > a test failure when running “make check-world”: > > > > t/006_logical_decoding.pl ............ 4/14 > > # Failed test 'pg_recvlogical acknowledged changes' > > # at t/006_logical_decoding.pl line 117. > > # got: 'BEGIN > > # table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'' > > # expected: '' > > # Looks like you failed 1 test of 14. > > t/006_logical_decoding.pl ............ Dubious, test returned 1 (wstat > > 256, 0x100) Failed 1/14 subtests > > > > > > After applying the patch, > I saw the same problem and can reproduce it by the following steps: > > 1) execute the SQLs. > -----------SQL----------- > CREATE TABLE decoding_test(x integer, y text); > SELECT pg_create_logical_replication_slot('test_slot', 'test_decoding'); > INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s; > > -- use the lsn here to execute pg_recvlogical later > SELECT lsn FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY lsn DESC LIMIT 1; > INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s; > ---------------------- > > 2) Then, if I execute the following command twice: > # pg_recvlogical -E lsn -d postgres -S 'test_slot' --start --no-loop -f - > BEGIN 708 > table public.decoding_test: INSERT: x[integer]:1 y[text]:'1' > table public.decoding_test: INSERT: x[integer]:2 y[text]:'2' > table public.decoding_test: INSERT: x[integer]:3 y[text]:'3' > table public.decoding_test: INSERT: x[integer]:4 y[text]:'4' > COMMIT 708 > > # pg_recvlogical -E lsn -d postgres -S 'test_slot' --start --no-loop -f - > BEGIN 709 > > It still generated ' BEGIN 709' in the second time execution. > But it will output nothing in the second time execution without the patch. > I think here the reason is that the first_lsn of a transaction is always equal to end_lsn of the previous transaction (See comments above first_lsn and end_lsn fields of ReorderBufferTXN). I have not debugged but I think in StreamLogicalLog() the cur_record_lsn after receiving 'w' message, in this case, will be equal to endpos whereas we expect to be greater than endpos to exit. Before the patch, it will always get the 'k' message where we expect the received lsn to be equal to endpos to conclude that we can exit. Do let me know if your analysis differs? -- With Regards, Amit Kapila.
On Thu, Aug 12, 2021 at 8:03 AM Peter Smith <smithpb2250@gmail.com> wrote: > > That base data is showing there are similar numbers of keepalives sent > as there are calls made to WalSndWaitForWal. IIUC it means that mostly > the loop is sending the special keepalives on the *first* iteration, > but by the time of the *second* iteration the ProcessRepliesIfAny() > will have some status already received, and so mostly sending another > keepalive will be deemed unnecessary. > > Based on this, our idea was to simply skip sending the > WalSndKeepalive(false) for the FIRST iteration of the loop only! PSA > the patch 0002 which does this skip. > I think we should also keep in mind that there are cases where it seems we are not able to send keep-alive at the appropriate frequency. See the discussion [1]. This is to ensure that we shouldn't unintentionally hamper some other workloads by fixing the workload being discussed here. [1] - https://www.postgresql.org/message-id/20210913.103107.813489310351696839.horikyota.ntt%40gmail.com -- With Regards, Amit Kapila.
On Thursday, September 16, 2021 8:36 PM Amit Kapila <amit.kapila16@gmail.com>: > On Thu, Sep 16, 2021 at 6:29 AM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > > > > After applying the patch, > > I saw the same problem and can reproduce it by the following steps: > > > > 1) execute the SQLs. > > -----------SQL----------- > > CREATE TABLE decoding_test(x integer, y text); > > SELECT pg_create_logical_replication_slot('test_slot', 'test_decoding'); > > INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) > s; > > > > -- use the lsn here to execute pg_recvlogical later > > SELECT lsn FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) > ORDER BY lsn DESC LIMIT 1; > > INSERT INTO decoding_test(x,y) SELECT s, s::text FROM > generate_series(5,50) s; > > ---------------------- > > > > 2) Then, if I execute the following command twice: > > # pg_recvlogical -E lsn -d postgres -S 'test_slot' --start --no-loop -f - > > BEGIN 708 > > table public.decoding_test: INSERT: x[integer]:1 y[text]:'1' > > table public.decoding_test: INSERT: x[integer]:2 y[text]:'2' > > table public.decoding_test: INSERT: x[integer]:3 y[text]:'3' > > table public.decoding_test: INSERT: x[integer]:4 y[text]:'4' > > COMMIT 708 > > > > # pg_recvlogical -E lsn -d postgres -S 'test_slot' --start --no-loop -f - > > BEGIN 709 > > > > It still generated ' BEGIN 709' in the second time execution. > > But it will output nothing in the second time execution without the patch. > > > > I think here the reason is that the first_lsn of a transaction is > always equal to end_lsn of the previous transaction (See comments > above first_lsn and end_lsn fields of ReorderBufferTXN). I have not > debugged but I think in StreamLogicalLog() the cur_record_lsn after > receiving 'w' message, in this case, will be equal to endpos whereas > we expect to be greater than endpos to exit. Before the patch, it will > always get the 'k' message where we expect the received lsn to be > equal to endpos to conclude that we can exit. Do let me know if your > analysis differs? After debugging it, I agree with your analysis. WITH the patch: in function StreamLogicalLog(), I can see the cur_record_lsn is equal to endpos which result in unexpected record. WITHOUT the patch: In function StreamLogicalLog(), it first received a 'k' message which will break the loop by the following code. if (endposReached) { prepareToTerminate(conn, endpos, true, InvalidXLogRecPtr); time_to_abort = true; break; } Best regards, Hou zj
On Thu, Sep 16, 2021 at 10:59 AM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > > From Tuesday, September 14, 2021 1:39 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > > However, the problem I found is that, with the patch applied, there is > > a test failure when running “make check-world”: > > > > t/006_logical_decoding.pl ............ 4/14 > > # Failed test 'pg_recvlogical acknowledged changes' > > # at t/006_logical_decoding.pl line 117. > > # got: 'BEGIN > > # table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'' > > # expected: '' > > # Looks like you failed 1 test of 14. > > t/006_logical_decoding.pl ............ Dubious, test returned 1 (wstat > > 256, 0x100) Failed 1/14 subtests > > > > > > After applying the patch, > I saw the same problem and can reproduce it by the following steps: > > 1) execute the SQLs. > -----------SQL----------- > CREATE TABLE decoding_test(x integer, y text); > SELECT pg_create_logical_replication_slot('test_slot', 'test_decoding'); > INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s; > > -- use the lsn here to execute pg_recvlogical later > SELECT lsn FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY lsn DESC LIMIT 1; > INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s; > ---------------------- > > 2) Then, if I execute the following command twice: > # pg_recvlogical -E lsn -d postgres -S 'test_slot' --start --no-loop -f - > BEGIN 708 > table public.decoding_test: INSERT: x[integer]:1 y[text]:'1' > table public.decoding_test: INSERT: x[integer]:2 y[text]:'2' > table public.decoding_test: INSERT: x[integer]:3 y[text]:'3' > table public.decoding_test: INSERT: x[integer]:4 y[text]:'4' > COMMIT 708 > > # pg_recvlogical -E lsn -d postgres -S 'test_slot' --start --no-loop -f - > BEGIN 709 > > It still generated ' BEGIN 709' in the second time execution. > But it will output nothing in the second time execution without the patch. > Hello Hous-san, thanks for including the steps. Unfortunately, no matter what I tried, I could never get the patch to display the problem "BEGIN 709" for the 2nd time execution of pg_recvlogical After discussion offline (thanks Greg!) it was found that the pg_recvlogical step 2 posted above is not quite identical to what the TAP 006 test is doing. Specifically, the TAP test also includes some other options (-o include-xids=0 -o skip-empty-xacts=1) which are not in your step. If I include these options then I can reproduce the problem. ----------------------------------------- [postgres@CentOS7-x64 ~]$ pg_recvlogical -E '0/150B5E0' -d postgres -S 'test_slot' --start --no-loop -o include-xids=0 -o skip-empty-xacts=1 -f - BEGIN table public.decoding_test: INSERT: x[integer]:5 y[text]:'5' ----------------------------------------- I don't know why these options should make any difference but they do. Perhaps they cause a fluke of millisecond timing differences in our different VM environments. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Fri, Sep 17, 2021 at 12:42 PM Peter Smith <smithpb2250@gmail.com> wrote: > > On Thu, Sep 16, 2021 at 10:59 AM houzj.fnst@fujitsu.com > <houzj.fnst@fujitsu.com> wrote: > > Hello Hous-san, thanks for including the steps. Unfortunately, no > matter what I tried, I could never get the patch to display the > problem "BEGIN 709" for the 2nd time execution of pg_recvlogical > > After discussion offline (thanks Greg!) it was found that the > pg_recvlogical step 2 posted above is not quite identical to what the > TAP 006 test is doing. > > Specifically, the TAP test also includes some other options (-o > include-xids=0 -o skip-empty-xacts=1) which are not in your step. > > If I include these options then I can reproduce the problem. > ----------------------------------------- > [postgres@CentOS7-x64 ~]$ pg_recvlogical -E '0/150B5E0' -d postgres > -S 'test_slot' --start --no-loop -o include-xids=0 -o > skip-empty-xacts=1 -f - > BEGIN > table public.decoding_test: INSERT: x[integer]:5 y[text]:'5' > ----------------------------------------- > > I don't know why these options should make any difference but they do. > I think there is a possibility that skip-empty-xacts = 1 is making difference. Basically, if there is some empty transaction say via autovacuum, it would skip it and possibly send keep-alive message before sending transaction id 709. Then you won't see the problem with Hou-San's test. Can you try by keeping autovacuum = off and by not using skip-empty-xact option? -- With Regards, Amit Kapila.
On Thu, Sep 16, 2021 at 10:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I think here the reason is that the first_lsn of a transaction is > always equal to end_lsn of the previous transaction (See comments > above first_lsn and end_lsn fields of ReorderBufferTXN). That may be the case, but those comments certainly don't make this clear. >I have not > debugged but I think in StreamLogicalLog() the cur_record_lsn after > receiving 'w' message, in this case, will be equal to endpos whereas > we expect to be greater than endpos to exit. Before the patch, it will > always get the 'k' message where we expect the received lsn to be > equal to endpos to conclude that we can exit. Do let me know if your > analysis differs? > Yes, pg_recvlogical seems to be relying on receiving a keepalive for its "--endpos" logic to work (and the 006 test is relying on '' record output from pg_recvlogical in this case). But is it correct to be relying on a keepalive for this? As I already pointed out, there's also code which seems to be relying on replies from sending keepalives, to update flush and write locations related to LSN. The original problem reporter measured 500 keepalives per second being sent by walsender (which I also reproduced, for pg_recvlogical and pub/sub cases). None of these cases appear to be traditional uses of "keepalive" type messages to me. Am I missing something? Documentation? Regards, Greg Nancarrow Fujitsu Australia
On Fri, Sep 17, 2021 at 3:03 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > > On Thu, Sep 16, 2021 at 10:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I think here the reason is that the first_lsn of a transaction is > > always equal to end_lsn of the previous transaction (See comments > > above first_lsn and end_lsn fields of ReorderBufferTXN). > > That may be the case, but those comments certainly don't make this clear. > > >I have not > > debugged but I think in StreamLogicalLog() the cur_record_lsn after > > receiving 'w' message, in this case, will be equal to endpos whereas > > we expect to be greater than endpos to exit. Before the patch, it will > > always get the 'k' message where we expect the received lsn to be > > equal to endpos to conclude that we can exit. Do let me know if your > > analysis differs? > > > > Yes, pg_recvlogical seems to be relying on receiving a keepalive for > its "--endpos" logic to work (and the 006 test is relying on '' record > output from pg_recvlogical in this case). > But is it correct to be relying on a keepalive for this? > I don't think this experiment/test indicates that pg_recvlogical's "--endpos" relies on keepalive. It would just print the records till --endpos and then exit. In the test under discussion, as per confirmation by Hou-San, the BEGIN record received has the same LSN as --endpos, so it would just output that and exit which is what is mentioned in pg_recvlogical docs as well (If there's a record with LSN exactly equal to lsn, the record will be output). I think here the test case could be a culprit. In the original commit eb2a6131be [1], where this test of the second time using pg_recvlogical was added there were no additional Inserts (# Insert some rows after $endpos, which we won't read.) which were later added by a different commit 8222a9d9a1 [2]. I am not sure if the test added by commit [2] was a good idea. It seems to be working due to the way keepalives are being sent but otherwise, it can fail as per the current design of pg_recvlogical. [1]: commit eb2a6131beccaad2b39629191508062b70d3a1c6 Author: Simon Riggs <simon@2ndQuadrant.com> Date: Tue Mar 21 14:04:49 2017 +0000 Add a pg_recvlogical wrapper to PostgresNode [2]: commit 8222a9d9a12356349114ec275b01a1a58da2b941 Author: Noah Misch <noah@leadboat.com> Date: Wed May 13 20:42:09 2020 -0700 In successful pg_recvlogical, end PGRES_COPY_OUT cleanly. -- With Regards, Amit Kapila.
On Sun, Sep 19, 2021 at 3:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > Yes, pg_recvlogical seems to be relying on receiving a keepalive for > > its "--endpos" logic to work (and the 006 test is relying on '' record > > output from pg_recvlogical in this case). > > But is it correct to be relying on a keepalive for this? > > > > I don't think this experiment/test indicates that pg_recvlogical's > "--endpos" relies on keepalive. It would just print the records till > --endpos and then exit. In the test under discussion, as per > confirmation by Hou-San, the BEGIN record received has the same LSN as > --endpos, so it would just output that and exit which is what is > mentioned in pg_recvlogical docs as well (If there's a record with LSN > exactly equal to lsn, the record will be output). > It seems to be relying on keepalive to get ONE specific record per --endpos value, because once we apply the "v1-0002-WIP-skip-the-keepalive-on-FIRST-loop-iteration.patch" patch, then when pg_recvlogical is invoked for a second time with the same --endos, it outputs the next record (BEGIN) too. So now for the same --endpos value, we've had two different records output by pg_recvlogical. I have not seen this described in the documentation, so I think it will need to be updated, should keepalives be reduced as per the patch. The current documentation seems to be implying that one particular record will be returned for a given --endpos (at least, there is no mention of the possibility of different records being output for the one --endpos, or that the first_lsn of a transaction is always equal to end_lsn of the previous transaction). --endpos=lsn In --start mode, automatically stop replication and exit with normal exit status 0 when receiving reaches the specified LSN. If specified when not in --start mode, an error is raised. If there's a record with LSN exactly equal to lsn, the record will be output. The --endpos option is not aware of transaction boundaries and may truncate output partway through a transaction. Any partially output transaction will not be consumed and will be replayed again when the slot is next read from. Individual messages are never truncated. Regards, Greg Nancarrow Fujitsu Australia
On Thu, Sep 30, 2021 at 8:49 AM Hou, Zhijie/侯 志杰 <houzj.fnst@fujitsu.com> wrote: > > I noticed another patch that Horiguchi-San posted earlier[1]. > > The approach in that patch is to splits the sleep into two > pieces. If the first sleep reaches the timeout then send a keepalive > then sleep for the remaining time. > > I tested that patch and can see the keepalive message is reduced and > the patch won't cause the current regression test fail. > > Since I didn't find some comments about that patch, > I wonder did we find some problems about that patch ? > I am not able to understand some parts of that patch. + If the sleep is shorter + * than KEEPALIVE_TIMEOUT milliseconds, we skip sending a keepalive to + * prevent it from getting too-frequent. + */ + if (MyWalSnd->flush < sentPtr && + MyWalSnd->write < sentPtr && + !waiting_for_ping_response) + { + if (sleeptime > KEEPALIVE_TIMEOUT) + { + int r; + + r = WalSndWait(wakeEvents, KEEPALIVE_TIMEOUT, + WAIT_EVENT_WAL_SENDER_WAIT_WAL); + + if (r != 0) + continue; + + sleeptime -= KEEPALIVE_TIMEOUT; + } + + WalSndKeepalive(false); It claims to skip sending keepalive if the sleep time is shorter than KEEPALIVE_TIMEOUT (a new threshold) but the above code seems to suggest it sends in both cases. What am I missing? Also, more to the point this special keep_alive seems to be sent for synchronous replication and walsender shutdown as they can expect updated locations. You haven't given any reason (theory) why those two won't be impacted due to this change? I am aware that for synchronous replication, we wake waiters while ProcessStandbyReplyMessage but I am not sure how it helps with wal sender shutdown? I think we need to know the reasons for this message and then need to see if the change has any impact on the same. -- With Regards, Amit Kapila.
On Thu, Sep 30, 2021 at 3:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I am not able to understand some parts of that patch. > > + If the sleep is shorter > + * than KEEPALIVE_TIMEOUT milliseconds, we skip sending a keepalive to > + * prevent it from getting too-frequent. > + */ > + if (MyWalSnd->flush < sentPtr && > + MyWalSnd->write < sentPtr && > + !waiting_for_ping_response) > + { > + if (sleeptime > KEEPALIVE_TIMEOUT) > + { > + int r; > + > + r = WalSndWait(wakeEvents, KEEPALIVE_TIMEOUT, > + WAIT_EVENT_WAL_SENDER_WAIT_WAL); > + > + if (r != 0) > + continue; > + > + sleeptime -= KEEPALIVE_TIMEOUT; > + } > + > + WalSndKeepalive(false); > > It claims to skip sending keepalive if the sleep time is shorter than > KEEPALIVE_TIMEOUT (a new threshold) but the above code seems to > suggest it sends in both cases. What am I missing? > The comment does seem to be wrong. The way I see it, if the calculated sleeptime is greater than KEEPALIVE_TIMEOUT, then it first sleeps for up to KEEPALIVE_TIMEOUT milliseconds and skips sending a keepalive if something happens (i.e. the socket becomes readable/writeable) during that time (WalSendWait will return non-zero in that case). Otherwise, it sends a keepalive and sleeps for (sleeptime - KEEPALIVE_TIMEOUT), then loops around again ... > Also, more to the point this special keep_alive seems to be sent for > synchronous replication and walsender shutdown as they can expect > updated locations. You haven't given any reason (theory) why those two > won't be impacted due to this change? I am aware that for synchronous > replication, we wake waiters while ProcessStandbyReplyMessage but I am > not sure how it helps with wal sender shutdown? I think we need to > know the reasons for this message and then need to see if the change > has any impact on the same. > Yes, I'm not sure about the possible impacts, still looking at it. Regards, Greg Nancarrow Fujitsu Australia
On Thu, Sep 30, 2021 at 1:19 PM Hou, Zhijie/侯 志杰 <houzj.fnst@fujitsu.com> wrote: > > I tested that patch and can see the keepalive message is reduced and > the patch won't cause the current regression test fail. > Actually, with the patch applied, I find that "make check-world" fails (006_logical_decoding, test 7). Regards, Greg Nancarrow Fujitsu Australia
At Thu, 30 Sep 2021 16:21:25 +1000, Greg Nancarrow <gregn4422@gmail.com> wrote in > On Thu, Sep 30, 2021 at 3:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > It claims to skip sending keepalive if the sleep time is shorter than > > KEEPALIVE_TIMEOUT (a new threshold) but the above code seems to > > suggest it sends in both cases. What am I missing? > > > > The comment does seem to be wrong. Mmm. Indeed the comment does say something like that... Looking the patch name together, I might have confused something. However, the patch looks like working for the purpose. > The way I see it, if the calculated sleeptime is greater than > KEEPALIVE_TIMEOUT, then it first sleeps for up to KEEPALIVE_TIMEOUT > milliseconds and skips sending a keepalive if something happens (i.e. > the socket becomes readable/writeable) during that time (WalSendWait > will return non-zero in that case). Otherwise, it sends a keepalive > and sleeps for (sleeptime - KEEPALIVE_TIMEOUT), then loops around > again ... The maim point of the patch is moving of the timing of sending the before-sleep keepalive. It seems to me that currently WalSndWaitForWal may send "before-sleep" keepalive every run of the loop under a certain circumstance. I suspect this happen in this case. After the patch applied, that keepalive is sent only when the loop is actually going to sleep some time. In case the next WAL doesn't come for KEEPALIVE_TIMEOUT milliseconds, it sends a keepalive. There's a dubious behavior when sleeptime <= KEEPALIVE_TIMEOUT that it sends a keepalive immediately. It was (as far as I recall) intentional in order to make the code simpler. However, on second thought, we will have the next chance to send keepalive in that case, and intermittent frequent keepalives can happen with that behavior. So I came to think that we can omit keepalives at all that case. (I myself haven't see the keepalive flood..) > > Also, more to the point this special keep_alive seems to be sent for > > synchronous replication and walsender shutdown as they can expect > > updated locations. You haven't given any reason (theory) why those two > > won't be impacted due to this change? I am aware that for synchronous > > replication, we wake waiters while ProcessStandbyReplyMessage but I am > > not sure how it helps with wal sender shutdown? I think we need to > > know the reasons for this message and then need to see if the change > > has any impact on the same. > > > > Yes, I'm not sure about the possible impacts, still looking at it. If the comment describes the objective correctly, the only possible impact would be that there may be a case where server responds a bit slowly for a shutdown request. But I'm not sure it is definitely true. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 3ca2a11389..0b04b978e9 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -105,6 +105,9 @@ */ #define MAX_SEND_SIZE (XLOG_BLCKSZ * 16) +/* Minimum idle time for sending an idle-time keepalive in milliseconds */ +#define KEEPALIVE_TIMEOUT 250 + /* Array of WalSnds in shared memory */ WalSndCtlData *WalSndCtl = NULL; @@ -244,7 +247,7 @@ static void WalSndKeepalive(bool requestReply); static void WalSndKeepaliveIfNecessary(void); static void WalSndCheckTimeOut(void); static long WalSndComputeSleeptime(TimestampTz now); -static void WalSndWait(uint32 socket_events, long timeout, uint32 wait_event); +static int WalSndWait(uint32 socket_events, long timeout, uint32 wait_event); static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write); static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write); static void WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid); @@ -1436,19 +1439,6 @@ WalSndWaitForWal(XLogRecPtr loc) if (got_STOPPING) break; - /* - * We only send regular messages to the client for full decoded - * transactions, but a synchronous replication and walsender shutdown - * possibly are waiting for a later location. So, before sleeping, we - * send a ping containing the flush location. If the receiver is - * otherwise idle, this keepalive will trigger a reply. Processing the - * reply will update these MyWalSnd locations. - */ - if (MyWalSnd->flush < sentPtr && - MyWalSnd->write < sentPtr && - !waiting_for_ping_response) - WalSndKeepalive(false); - /* check whether we're done */ if (loc <= RecentFlushPtr) break; @@ -1491,6 +1481,41 @@ WalSndWaitForWal(XLogRecPtr loc) if (pq_is_send_pending()) wakeEvents |= WL_SOCKET_WRITEABLE; + /* + * We only send regular messages to the client for full decoded + * transactions, but a synchronous replication and walsender shutdown + * possibly are waiting for a later location. So, before sleeping, we + * send a ping containing the flush location. If the receiver is + * otherwise idle, this keepalive will trigger a reply. Processing the + * reply will update these MyWalSnd locations. To prevent too-frequent + * keepalives, wait KEEPALIVE_TIMEOUT milliseconds before sending a + * keepalive. If sleeptime is less than KEEPALIVE_TIMEOUT, instead + * WalSndKeepaliveIfNecessary will work if needed. + */ + if (MyWalSnd->flush < sentPtr && + MyWalSnd->write < sentPtr && + !waiting_for_ping_response && + sleeptime > KEEPALIVE_TIMEOUT) + { + int r; + + r = WalSndWait(wakeEvents, KEEPALIVE_TIMEOUT, + WAIT_EVENT_WAL_SENDER_WAIT_WAL); + + /* + * If some event happens before the timeout expires, go ahead + * without sending keepalive. + */ + if (r != 0) + continue; + + sleeptime -= KEEPALIVE_TIMEOUT; + + WalSndKeepalive(false); + if (pq_flush_if_writable() != 0) + WalSndShutdown(); + } + WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL); } @@ -3144,15 +3169,18 @@ WalSndWakeup(void) * composed of optional WL_SOCKET_WRITEABLE and WL_SOCKET_READABLE flags. Exit * on postmaster death. */ -static void +static int WalSndWait(uint32 socket_events, long timeout, uint32 wait_event) { WaitEvent event; + int ret; ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, socket_events, NULL); - if (WaitEventSetWait(FeBeWaitSet, timeout, &event, 1, wait_event) == 1 && - (event.events & WL_POSTMASTER_DEATH)) + ret = WaitEventSetWait(FeBeWaitSet, timeout, &event, 1, wait_event); + if (ret == 1 && (event.events & WL_POSTMASTER_DEATH)) proc_exit(1); + + return ret; } /*
At Thu, 30 Sep 2021 17:21:03 +1000, Greg Nancarrow <gregn4422@gmail.com> wrote in > Actually, with the patch applied, I find that "make check-world" fails > (006_logical_decoding, test 7). Mmm.. t/006_logical_decoding.pl .. 4/14 # Failed test 'pg_recvlogical acknowledged changes' # at t/006_logical_decoding.pl line 117. # got: 'BEGIN # table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'' # expected: '' I'm not sure what the test is checking for now, though. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Sep 30, 2021 at 6:08 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Thu, 30 Sep 2021 17:21:03 +1000, Greg Nancarrow <gregn4422@gmail.com> wrote in > > Actually, with the patch applied, I find that "make check-world" fails > > (006_logical_decoding, test 7). > > Mmm.. > > t/006_logical_decoding.pl .. 4/14 > # Failed test 'pg_recvlogical acknowledged changes' > # at t/006_logical_decoding.pl line 117. > # got: 'BEGIN > # table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'' > # expected: '' > > I'm not sure what the test is checking for now, though. > I think it's trying to check that pg_recvlogical doesn't read "past" a specified "--endpos" LSN. The test is invoking pg_recvlogical with the same --endpos LSN value multiple times. After first getting the LSN (to use for the --endpos value) after 4 rows are inserted, some additional rows are inserted which the test expects pg_recvlogical won't read because it should't read past --endpos. Problem is, the test seems to be relying on a keepalive between the WAL record of the first transaction and the WAL record of the next transaction. As Amit previously explained on this thread "I think here the reason is that the first_lsn of a transaction is always equal to end_lsn of the previous transaction (See comments above first_lsn and end_lsn fields of ReorderBufferTXN)." When the patch is applied, pg_recvlogical doesn't read a keepalive when it is invoked with the same --endpos for a second time here, and it ends up reading the first WAL record for the next transaction (those additional rows that the test expects it won't read). Regards, Greg Nancarrow Fujitsu Australia
At Thu, 30 Sep 2021 17:08:35 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Thu, 30 Sep 2021 17:21:03 +1000, Greg Nancarrow <gregn4422@gmail.com> wrote in > > Actually, with the patch applied, I find that "make check-world" fails > > (006_logical_decoding, test 7). > > Mmm.. > > t/006_logical_decoding.pl .. 4/14 > # Failed test 'pg_recvlogical acknowledged changes' > # at t/006_logical_decoding.pl line 117. > # got: 'BEGIN > # table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'' > # expected: '' > > I'm not sure what the test is checking for now, though. It's checking that endpos works correctly? The logical decoded WALs looks like this. 0/1528F10|table public.decoding_test: INSERT: x[integer]:1 y[text]:'1' 0/15290F8|table public.decoding_test: INSERT: x[integer]:2 y[text]:'2' 0/1529138|table public.decoding_test: INSERT: x[integer]:3 y[text]:'3' 0/1529178|table public.decoding_test: INSERT: x[integer]:4 y[text]:'4' 0/15291E8|COMMIT 709 0/15291E8|BEGIN 710 0/15291E8|table public.decoding_test: INSERT: x[integer]:5 y[text]:'5' 0/1529228|table public.decoding_test: INSERT: x[integer]:6 y[text]:'6' The COMMIT and BEGIN shares the same LSN, which I don't understand how come. The previous read by pg_recvlocal prceeded upto the COMMIT record. and the following command runs after that behaves differently. pg_recvlogical -S test_slot --dbname postgres --endpos '0/15291E8' -f - --no-loop --start Before the patch it ends before reading a record, and after the patch it reads into the "table ..." line. pg_recvlogical seems using the endpos as the beginning of the last record. In that meaning the three lines (COMMIT 709/BEGIN 710/table ...'5') are falls under the end of data. The difference seems coming from the timing keepalive comes. pg_recvlogical checks the endpos only when keepalive comes. In other words, it needs keepalive for every data line so that it stops exactly at the specified endpos. 1. Is it the correct behavior that the three data lines share the same LSN? I think BEGIN and the next line should do, but COMMIT and next BEGIN shouldn't. 2. Is it the designed behavior that pg_recvlogical does check endpos only when a keepalive comes? If it is the correct behavior, we shouldn't avoid the keepalive flood. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Sep 30, 2021 at 1:26 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Thu, 30 Sep 2021 16:21:25 +1000, Greg Nancarrow <gregn4422@gmail.com> wrote in > > On Thu, Sep 30, 2021 at 3:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > Also, more to the point this special keep_alive seems to be sent for > > > synchronous replication and walsender shutdown as they can expect > > > updated locations. You haven't given any reason (theory) why those two > > > won't be impacted due to this change? I am aware that for synchronous > > > replication, we wake waiters while ProcessStandbyReplyMessage but I am > > > not sure how it helps with wal sender shutdown? I think we need to > > > know the reasons for this message and then need to see if the change > > > has any impact on the same. > > > > > > > Yes, I'm not sure about the possible impacts, still looking at it. > > If the comment describes the objective correctly, the only possible > impact would be that there may be a case where server responds a bit > slowly for a shutdown request. But I'm not sure it is definitely > true. > So, we should try to find how wal sender shutdown is dependent on sending keep alive and second thing is what about sync rep case? I think in the worst case that also might delay. Why do you think that would be acceptable? -- With Regards, Amit Kapila.
On Thu, Sep 30, 2021 at 3:41 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Thu, 30 Sep 2021 17:08:35 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > At Thu, 30 Sep 2021 17:21:03 +1000, Greg Nancarrow <gregn4422@gmail.com> wrote in > > > Actually, with the patch applied, I find that "make check-world" fails > > > (006_logical_decoding, test 7). > > > > Mmm.. > > > > t/006_logical_decoding.pl .. 4/14 > > # Failed test 'pg_recvlogical acknowledged changes' > > # at t/006_logical_decoding.pl line 117. > > # got: 'BEGIN > > # table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'' > > # expected: '' > > > > I'm not sure what the test is checking for now, though. > > It's checking that endpos works correctly? The logical decoded WALs > looks like this. > > 0/1528F10|table public.decoding_test: INSERT: x[integer]:1 y[text]:'1' > 0/15290F8|table public.decoding_test: INSERT: x[integer]:2 y[text]:'2' > 0/1529138|table public.decoding_test: INSERT: x[integer]:3 y[text]:'3' > 0/1529178|table public.decoding_test: INSERT: x[integer]:4 y[text]:'4' > 0/15291E8|COMMIT 709 > 0/15291E8|BEGIN 710 > 0/15291E8|table public.decoding_test: INSERT: x[integer]:5 y[text]:'5' > 0/1529228|table public.decoding_test: INSERT: x[integer]:6 y[text]:'6' > > The COMMIT and BEGIN shares the same LSN, which I don't understand how come. > This is because endlsn is always commit record + 1 which makes it equal to start of next record and we use endlsn here for commit. See below comments in code. /* * LSN pointing to the end of the commit record + 1. */ XLogRecPtr end_lsn; > The previous read by pg_recvlocal prceeded upto the COMMIT record. and > the following command runs after that behaves differently. > > pg_recvlogical -S test_slot --dbname postgres --endpos '0/15291E8' -f - --no-loop --start > > Before the patch it ends before reading a record, and after the patch > it reads into the "table ..." line. pg_recvlogical seems using the > endpos as the beginning of the last record. In that meaning the three > lines (COMMIT 709/BEGIN 710/table ...'5') are falls under the end of > data. > > The difference seems coming from the timing keepalive > comes. pg_recvlogical checks the endpos only when keepalive comes. In > other words, it needs keepalive for every data line so that it stops > exactly at the specified endpos. > > 1. Is it the correct behavior that the three data lines share the same > LSN? I think BEGIN and the next line should do, but COMMIT and next > BEGIN shouldn't. > > 2. Is it the designed behavior that pg_recvlogical does check endpos > only when a keepalive comes? If it is the correct behavior, we > shouldn't avoid the keepalive flood. > If anything, I think this is a testcase issue as explained by me in email [1] [1] - https://www.postgresql.org/message-id/CAA4eK1Ja2XmK59Czv1V%2BtfOgU4mcFfDwTtTgO02Wd%3DrO02JbiQ%40mail.gmail.com -- With Regards, Amit Kapila.
On Thu, Sep 30, 2021 at 5:56 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > After the patch applied, that keepalive is sent only when the loop is > actually going to sleep some time. In case the next WAL doesn't come > for KEEPALIVE_TIMEOUT milliseconds, it sends a keepalive. There's a > dubious behavior when sleeptime <= KEEPALIVE_TIMEOUT that it sends a > keepalive immediately. It was (as far as I recall) intentional in > order to make the code simpler. However, on second thought, we will > have the next chance to send keepalive in that case, and intermittent > frequent keepalives can happen with that behavior. So I came to think > that we can omit keepalives at all that case. > > (I myself haven't see the keepalive flood..) > I tried your updated patch (avoid_keepalive_flood_at_bleeding_edge_of_wal.patch, rebased) and also manually applied your previous keepalive-counting code (count_keepalives2.diff.txt), adapted to the code updates. I tested both the problem originally reported (which used pg_recvlogical) and similarly using pub/sub of the pgbench_history table, and in both cases I found that your patch very significantly reduced the keepalives, so the keepalive flood is no longer seen. I am still a little unsure about the impact on pg_recvlogical --endpos functionality, which is detected by the regression test failure. I did try to update pg_recvlogical, to not rely on a keepalive for --endpos, but so far haven't been successful in doing that. If the test is altered/removed then I think that the documentation for pg_recvlogical --endpos will need updating in some way. Regards, Greg Nancarrow Fujitsu Australia
At Thu, 30 Sep 2021 17:07:08 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Thu, Sep 30, 2021 at 1:26 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > If the comment describes the objective correctly, the only possible > > impact would be that there may be a case where server responds a bit > > slowly for a shutdown request. But I'm not sure it is definitely > > true. > > > > So, we should try to find how wal sender shutdown is dependent on > sending keep alive and second thing is what about sync rep case? I > think in the worst case that also might delay. Why do you think that > would be acceptable? Mmm. AFAICS including the history of the code, the problem to fix looks like to be pthat logical wal receiver doesn't send a flush response spontaneously. As far as receiver doesn't do that and unless we allow some delay of the response, sender inevitably needs to ping frequently until the wanted respons returns. It seems to me that it is better to make the receiver send a response at flush LSN movement spontaneously rather than tweaking the keepalive sending mechanism. But letting XLogFlush trigger lsn_mapping processing does not seem simple.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center