Re: Logical replication keepalive flood - Mailing list pgsql-hackers

From Greg Nancarrow
Subject Re: Logical replication keepalive flood
Date
Msg-id CAJcOf-ct+7K53kPsnYery=8W6sZx7Q14H8UjqAgpwkxCfvR5mQ@mail.gmail.com
Whole thread Raw
In response to Re: Logical replication keepalive flood  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Logical replication keepalive flood
List pgsql-hackers
On Thu, Sep 16, 2021 at 10:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I think here the reason is that the first_lsn of a transaction is
> always equal to end_lsn of the previous transaction (See comments
> above first_lsn and end_lsn fields of ReorderBufferTXN).

That may be the case, but those comments certainly don't make this clear.

>I have not
> debugged but I think in StreamLogicalLog() the cur_record_lsn after
> receiving 'w' message, in this case, will be equal to endpos whereas
> we expect to be greater than endpos to exit. Before the patch, it will
> always get the 'k' message where we expect the received lsn to be
> equal to endpos to conclude that we can exit. Do let me know if your
> analysis differs?
>

Yes, pg_recvlogical seems to be relying on receiving a keepalive for
its "--endpos" logic to work (and the 006 test is relying on '' record
output from pg_recvlogical in this case).
But is it correct to be relying on a keepalive for this?
As I already pointed out, there's also code which seems to be relying
on replies from sending keepalives, to update flush and write
locations related to LSN.
The original problem reporter measured 500 keepalives per second being
sent by walsender (which I also reproduced, for pg_recvlogical and
pub/sub cases).
None of these cases appear to be traditional uses of "keepalive" type
messages to me.
Am I missing something? Documentation?


Regards,
Greg Nancarrow
Fujitsu Australia



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Logical replication keepalive flood
Next
From: Fabrice Chapuis
Date:
Subject: Logical replication timeout problem