Thread: Unnecessary confirm work on logical replication
I was reading the logical replication code and found a little unnecessary work we are doing. The confirmed_flushed_lsn cannot reasonably be ahead of the current_lsn, so there is no point of calling LogicalConfirmReceivedLocation() every time we update the candidate xmin or restart_lsn. Patch is attached.
Attachment
On Fri, Apr 7, 2023 at 11:06 PM Emre Hasegeli <emre@hasegeli.com> wrote: > > I was reading the logical replication code and found a little > unnecessary work we are doing. > > The confirmed_flushed_lsn cannot reasonably be ahead of the > current_lsn, so there is no point of calling > LogicalConfirmReceivedLocation() every time we update the candidate > xmin or restart_lsn. In fact, the WAL sender always starts reading WAL from restart_lsn, which in turn is always <= confirmed_flush_lsn. While reading WAL, WAL sender may read XLOG_RUNNING_XACTS WAL record with lsn <= confirmed_flush_lsn. While processing XLOG_RUNNING_XACTS record it may update its restart_lsn and catalog_xmin with current_lsn = lsn fo XLOG_RUNNING_XACTS record. In this situation current_lsn <= confirmed_flush_lsn. Does that make sense? -- Best Wishes, Ashutosh Bapat
> In fact, the WAL sender always starts reading WAL from restart_lsn, > which in turn is always <= confirmed_flush_lsn. While reading WAL, WAL > sender may read XLOG_RUNNING_XACTS WAL record with lsn <= > confirmed_flush_lsn. While processing XLOG_RUNNING_XACTS record it may > update its restart_lsn and catalog_xmin with current_lsn = lsn fo > XLOG_RUNNING_XACTS record. In this situation current_lsn <= > confirmed_flush_lsn. This can only happen when the WAL sender is restarted. However in this case, the restart_lsn and catalog_xmin should have already been persisted by the previous run of the WAL sender. I still doubt these calls are necessary. I think there is a complicated chicken and egg problem here. Here is my logic: 1) LogicalConfirmReceivedLocation() is called explicitly when confirmed_flush is sent by the replication client. 2) LogicalConfirmReceivedLocation() is the only place that updates confirmed_flush. 3) The replication client can only send a confirmed_flush for a current_lsn it has already received. 4) These two functions have already run for any current_lsn the replication client has received. 5) These two functions call LogicalConfirmReceivedLocation() only if current_lsn <= confirmed_flush. Thank you for your patience.
On 4/11/23 14:58, Emre Hasegeli wrote: >> In fact, the WAL sender always starts reading WAL from restart_lsn, >> which in turn is always <= confirmed_flush_lsn. While reading WAL, WAL >> sender may read XLOG_RUNNING_XACTS WAL record with lsn <= >> confirmed_flush_lsn. While processing XLOG_RUNNING_XACTS record it may >> update its restart_lsn and catalog_xmin with current_lsn = lsn fo >> XLOG_RUNNING_XACTS record. In this situation current_lsn <= >> confirmed_flush_lsn. > > This can only happen when the WAL sender is restarted. However in > this case, the restart_lsn and catalog_xmin should have already been > persisted by the previous run of the WAL sender. > > I still doubt these calls are necessary. I think there is a > complicated chicken and egg problem here. Here is my logic: > > 1) LogicalConfirmReceivedLocation() is called explicitly when > confirmed_flush is sent by the replication client. > > 2) LogicalConfirmReceivedLocation() is the only place that updates > confirmed_flush. > > 3) The replication client can only send a confirmed_flush for a > current_lsn it has already received. > > 4) These two functions have already run for any current_lsn the > replication client has received. > > 5) These two functions call LogicalConfirmReceivedLocation() only if > current_lsn <= confirmed_flush. > > Thank you for your patience. > Hi Emre, I was going through my TODO list of messages, and I stumbled on this thread from a couple months back. Do you still think this is something we should do? I see there was some discussion about whether this update is needed, or whether current_lsn can ever be <= confirmed_flush_lsn. I think you may be right this can't happen, but I wonder if we could verify that by an assert in a convenient place (instead of just removing the update). Also, did you try to quantify how expensive this is? The update seems very cheap, but I guess you just noticed by eye-balling the code, which is fine ofc. Even if it's cheap/not noticeable, it still may be worth removing so as not to confuse people improving the code in the future. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company