Thread: Unnecessary confirm work on logical replication

Unnecessary confirm work on logical replication

From
Emre Hasegeli
Date:
I was reading the logical replication code and found a little
unnecessary work we are doing.

The confirmed_flushed_lsn cannot reasonably be ahead of the
current_lsn, so there is no point of calling
LogicalConfirmReceivedLocation() every time we update the candidate
xmin or restart_lsn.

Patch is attached.

Attachment

Re: Unnecessary confirm work on logical replication

From
Ashutosh Bapat
Date:
On Fri, Apr 7, 2023 at 11:06 PM Emre Hasegeli <emre@hasegeli.com> wrote:
>
> I was reading the logical replication code and found a little
> unnecessary work we are doing.
>
> The confirmed_flushed_lsn cannot reasonably be ahead of the
> current_lsn, so there is no point of calling
> LogicalConfirmReceivedLocation() every time we update the candidate
> xmin or restart_lsn.

In fact, the WAL sender always starts reading WAL from restart_lsn,
which in turn is always <= confirmed_flush_lsn. While reading WAL, WAL
sender may read XLOG_RUNNING_XACTS WAL record with lsn <=
confirmed_flush_lsn. While processing XLOG_RUNNING_XACTS record it may
update its restart_lsn and catalog_xmin with current_lsn = lsn fo
XLOG_RUNNING_XACTS record. In this situation current_lsn <=
confirmed_flush_lsn.

Does that make sense?

--
Best Wishes,
Ashutosh Bapat



Re: Unnecessary confirm work on logical replication

From
Emre Hasegeli
Date:
> In fact, the WAL sender always starts reading WAL from restart_lsn,
> which in turn is always <= confirmed_flush_lsn. While reading WAL, WAL
> sender may read XLOG_RUNNING_XACTS WAL record with lsn <=
> confirmed_flush_lsn. While processing XLOG_RUNNING_XACTS record it may
> update its restart_lsn and catalog_xmin with current_lsn = lsn fo
> XLOG_RUNNING_XACTS record. In this situation current_lsn <=
> confirmed_flush_lsn.

This can only happen when the WAL sender is restarted.  However in
this case, the restart_lsn and catalog_xmin should have already been
persisted by the previous run of the WAL sender.

I still doubt these calls are necessary.  I think there is a
complicated chicken and egg problem here.  Here is my logic:

1) LogicalConfirmReceivedLocation() is called explicitly when
confirmed_flush is sent by the replication client.

2) LogicalConfirmReceivedLocation() is the only place that updates
confirmed_flush.

3) The replication client can only send a confirmed_flush for a
current_lsn it has already received.

4) These two functions have already run for any current_lsn the
replication client has received.

5) These two functions call LogicalConfirmReceivedLocation() only if
current_lsn <= confirmed_flush.

Thank you for your patience.



Re: Unnecessary confirm work on logical replication

From
Tomas Vondra
Date:
On 4/11/23 14:58, Emre Hasegeli wrote:
>> In fact, the WAL sender always starts reading WAL from restart_lsn,
>> which in turn is always <= confirmed_flush_lsn. While reading WAL, WAL
>> sender may read XLOG_RUNNING_XACTS WAL record with lsn <=
>> confirmed_flush_lsn. While processing XLOG_RUNNING_XACTS record it may
>> update its restart_lsn and catalog_xmin with current_lsn = lsn fo
>> XLOG_RUNNING_XACTS record. In this situation current_lsn <=
>> confirmed_flush_lsn.
> 
> This can only happen when the WAL sender is restarted.  However in
> this case, the restart_lsn and catalog_xmin should have already been
> persisted by the previous run of the WAL sender.
> 
> I still doubt these calls are necessary.  I think there is a
> complicated chicken and egg problem here.  Here is my logic:
> 
> 1) LogicalConfirmReceivedLocation() is called explicitly when
> confirmed_flush is sent by the replication client.
> 
> 2) LogicalConfirmReceivedLocation() is the only place that updates
> confirmed_flush.
> 
> 3) The replication client can only send a confirmed_flush for a
> current_lsn it has already received.
> 
> 4) These two functions have already run for any current_lsn the
> replication client has received.
> 
> 5) These two functions call LogicalConfirmReceivedLocation() only if
> current_lsn <= confirmed_flush.
> 
> Thank you for your patience.
> 

Hi Emre,

I was going through my TODO list of messages, and I stumbled on this
thread from a couple months back. Do you still think this is something
we should do?

I see there was some discussion about whether this update is needed, or
whether current_lsn can ever be <= confirmed_flush_lsn. I think you may
be right this can't happen, but I wonder if we could verify that by an
assert in a convenient place (instead of just removing the update).

Also, did you try to quantify how expensive this is? The update seems
very cheap, but I guess you just noticed by eye-balling the code, which
is fine ofc. Even if it's cheap/not noticeable, it still may be worth
removing so as not to confuse people improving the code in the future.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company