Re: Possible bug in logical replication. - Mailing list pgsql-hackers

From Arseny Sher
Subject Re: Possible bug in logical replication.
Date
Msg-id 87woutq467.fsf@ars-thinkpad
Whole thread Raw
In response to Re: Possible bug in logical replication.  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Possible bug in logical replication.
List pgsql-hackers
Michael Paquier <michael@paquier.xyz> writes:

> On Mon, Jun 18, 2018 at 09:42:36PM +0900, Michael Paquier wrote:
>> On Fri, Jun 15, 2018 at 06:27:56PM +0300, Arseny Sher wrote:
>> It seems to me that we still want to have the slot forwarding finish in
>> this case even if this is interrupted.  Petr, isn't that the intention
>> here?
>
> I have been chewing a bit more on the proposed patch, finishing with the
> attached to close the loop.  Thoughts?

Sorry for being pedantic, but it seems to me worthwhile to mention *why*
we need decoding machinery at all -- like I wrote:

+ * While we could just do LogicalConfirmReceivedLocation updating
+ * confirmed_flush_lsn, we'd better digest WAL to advance restart_lsn
+ * (allowing to recycle WAL) and xmin (allowing to vacuum old catalog tuples).

Also,

>   * The slot's restart_lsn is used as start point for reading records,

This is clearly seen from the code, I propose to remove this.

>   * while confirmed_lsn is used as base point for the decoding context.

And as I wrote, this doesn't matter as changes are not produced.

>   * The LSN position to move to is checked by doing a per-record scan and
>   * logical decoding which makes sure that confirmed_lsn is updated to a
>   * LSN which allows the future slot consumer to get consistent logical
> - * changes.
> + * changes.  As decoding is done with fast_forward mode, no changes are
> + * actually generated.

confirmed_lsn is always updated to `moveto` unless we run out of WAL
earlier (and unless we try to move slot backwards, which is obviously
forbidden) -- consistent changes are practically irrelevant
here. Moreover, we can directly set confirmed_lsn and still have
consistent changes further as restart_lsn and xmin of the slot are not
touched. What we actually do here is trying to advance *restart_lsn and
xmin* as far as we can but up to the point which ensures that decoding
can assemble a consistent snapshot allowing to fully decode all COMMITs
since updated `confirmed_flush_lsn`. All this happens in
SnapBuildProcessRunningXacts.

> It seems to me that we still want to have the slot forwarding finish in
> this case even if this is interrupted.  Petr, isn't that the intention
> here?

Probably, though I am not sure what is the point of this. Ok, I keep
this check in the updated (with your comments) patch and CC'ing Petr.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61588d626f..76bafca41c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -335,11 +335,14 @@ CreateInitDecodingContext(char *plugin,
  *        The LSN at which to start decoding.  If InvalidXLogRecPtr, restart
  *        from the slot's confirmed_flush; otherwise, start from the specified
  *        location (but move it forwards to confirmed_flush if it's older than
- *        that, see below).
+ *        that, see below). Doesn't matter in fast_forward mode.
  *
  * output_plugin_options
  *        contains options passed to the output plugin.
  *
+ * fast_forward
+ *        bypasses the generation of logical changes.
+ *
  * read_page, prepare_write, do_write, update_progress
  *        callbacks that have to be filled to perform the use-case dependent,
  *        actual work.
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2806e1076c..0a4985ef8c 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -341,12 +341,10 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto)
 
 /*
  * Helper function for advancing logical replication slot forward.
- * The slot's restart_lsn is used as start point for reading records,
- * while confirmed_lsn is used as base point for the decoding context.
- * The LSN position to move to is checked by doing a per-record scan and
- * logical decoding which makes sure that confirmed_lsn is updated to a
- * LSN which allows the future slot consumer to get consistent logical
- * changes.
+ * While we could just do LogicalConfirmReceivedLocation updating
+ * confirmed_flush_lsn, we'd better digest WAL to advance restart_lsn
+ * (allowing to recycle WAL) and xmin (allowing to vacuum old catalog tuples).
+ * We do it in special fast_forward mode without actual replay.
  */
 static XLogRecPtr
 pg_logical_replication_slot_advance(XLogRecPtr moveto)
@@ -358,7 +356,6 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 
     PG_TRY();
     {
-        /* restart at slot's confirmed_flush */
         ctx = CreateDecodingContext(InvalidXLogRecPtr,
                                     NIL,
                                     true,
@@ -388,10 +385,7 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
              */
             startlsn = InvalidXLogRecPtr;
 
-            /*
-             * The {begin_txn,change,commit_txn}_wrapper callbacks above will
-             * store the description into our tuplestore.
-             */
+            /* Changes are not actually produced in fast_forward mode. */
             if (record != NULL)
                 LogicalDecodingProcessRecord(ctx, ctx->reader);


pgsql-hackers by date:

Previous
From: Rajkumar Raghuwanshi
Date:
Subject: ERROR: ORDER/GROUP BY expression not found in targetlist
Next
From: Sergei Kornilov
Date:
Subject: Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query