Thread: logical replication of truncate command with trigger causes Assert

logical replication of truncate command with trigger causes Assert

From
Mark Dilger
Date:
Hackers,

On master, when a statement level trigger is fired for a replicated truncate command, the following stack trace is
generated:

TRAP: FailedAssertion("portal != NULL", File: "pquery.c", Line: 1760, PID: 93854)
0   postgres                            0x0000000108e269f2 ExceptionalCondition + 130
1   postgres                            0x0000000108bef2f4 EnsurePortalSnapshotExists + 100
2   postgres                            0x0000000108a93231 _SPI_execute_plan + 529
3   postgres                            0x0000000108a93c0f SPI_execute_plan_with_paramlist + 127
4   plpgsql.so                          0x00000001098bf9e5 exec_stmt_execsql + 277
5   plpgsql.so                          0x00000001098bbaf6 exec_stmts + 294
6   plpgsql.so                          0x00000001098bb367 exec_stmt_block + 1127
7   plpgsql.so                          0x00000001098ba57a plpgsql_exec_trigger + 442
8   plpgsql.so                          0x00000001098cb5b1 plpgsql_call_handler + 305
9   postgres                            0x0000000108a3137c ExecCallTriggerFunc + 348
10  postgres                            0x0000000108a3447d afterTriggerInvokeEvents + 1517
11  postgres                            0x0000000108a33bb0 AfterTriggerEndQuery + 128
12  postgres                            0x0000000108a1a9e2 ExecuteTruncateGuts + 2210
13  postgres                            0x0000000108b83369 apply_dispatch + 3913
14  postgres                            0x0000000108b82185 LogicalRepApplyLoop + 485
15  postgres                            0x0000000108b81f87 ApplyWorkerMain + 1047
16  postgres                            0x0000000108b474a2 StartBackgroundWorker + 386
17  postgres                            0x0000000108b55cf6 maybe_start_bgworkers + 1254
18  postgres                            0x0000000108b54510 sigusr1_handler + 464
19  libsystem_platform.dylib            0x00007fff69f3d5fd _sigtramp + 29
20  ???                                 0x0000000000000000 0x0 + 0
21  postgres                            0x0000000108b537ae PostmasterMain + 3726
22  postgres                            0x0000000108aaa140 help + 0
23  libdyld.dylib                       0x00007fff69d44cc9 start + 1
24  ???                                 0x0000000000000004 0x0 + 4

I believe the issue was introduced in commit 84f5c2908da which added EnsurePortalSnapshotExists.   That's not going to
workin the case of logical replication, because there isn't an ActivePortal nor a snapshot. 

Attached patch v1-0001 reliably reproduces the problem, though you have to Ctrl-C out of it, because the logical
replicationgets stuck in a loop after the Assert is triggered.  You can see the stack trace by opening
tmp_check/log/021_truncate_subscriber.log


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> On master, when a statement level trigger is fired for a replicated truncate command, the following stack trace is
generated:

Hmm.

> I believe the issue was introduced in commit 84f5c2908da which added EnsurePortalSnapshotExists.   That's not going
towork in the case of logical replication, because there isn't an ActivePortal nor a snapshot. 

The right way to say that is "commit 84f5c2908da exposed the pre-existing
unsafe behavior of this code".  It's not okay to run arbitrary user code
without holding a snapshot to protect TOAST dereference operations.

I suppose that either apply_dispatch or LogicalRepApplyLoop needs to
grow some more snapshot management logic, but I've not looked at that
code much, so I don't have an opinion on just where to add it.

There's a reasonable case to be made that running user code outside
a Portal is a just-plain-bad idea, so we should fix the logical
apply worker to make it create a suitable Portal.  I speculated about
that in the commit message for 84f5c2908da, but I don't feel like
taking point on such work.

            regards, tom lane



Re: logical replication of truncate command with trigger causes Assert

From
Mark Dilger
Date:

> On Jun 8, 2021, at 3:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> The right way to say that is "commit 84f5c2908da exposed the pre-existing
> unsafe behavior of this code".  It's not okay to run arbitrary user code
> without holding a snapshot to protect TOAST dereference operations.

Sure, I didn't expect that you'd broken things so much as we now have an Assert where, at least for simple commands,
thingswere working back in April.  Those things may not have been working correctly -- I'll have to do some more test
developmentto see if I can get the pre-84f5c2908da to misbehave -- but this may appear to be a regression in version 14
ifwe don't do something. 

Calling ExecuteTruncateGuts from within the logical replication worker was introduced in commit 039eb6e92f2, "Logical
replicationsupport for TRUNCATE", back in April 2018.  So whatever we do will likely need to be backpatched. 

> I suppose that either apply_dispatch or LogicalRepApplyLoop needs to
> grow some more snapshot management logic, but I've not looked at that
> code much, so I don't have an opinion on just where to add it.

I was looking at those for other reasons prior to hitting this bug.  My purpose was to figure out how to get the code
torespect privileges.  Perhaps the solution to these two issues is related.  I don't know yet. 

As it stands, a subscription can only be created by a superuser, and the replication happens under that user's
current_userand session_user.  I naively thought that adding a built-in role pg_logical_replication which could create
subscriptionswould be of some use.  I implemented that but, but now if I create a user named "replication_manager" with
membershipin pg_logical_replication but not superuser, it turns out that even though the apply worker runs as
replication_manager,the insert/update/delete commands work without checking privileges.  (They can insert/update/delete
tablesand execute functions owned by a database superuser for which "replication_manager" has no privileges.)  So I
needto go a bit further to get acl checks called from this code path. 

> There's a reasonable case to be made that running user code outside
> a Portal is a just-plain-bad idea, so we should fix the logical
> apply worker to make it create a suitable Portal.  I speculated about
> that in the commit message for 84f5c2908da, but I don't feel like
> taking point on such work.

I'll dig into this a bit more.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Mark Dilger <mark.dilger@enterprisedb.com> writes:
> On Jun 8, 2021, at 3:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I suppose that either apply_dispatch or LogicalRepApplyLoop needs to
>> grow some more snapshot management logic, but I've not looked at that
>> code much, so I don't have an opinion on just where to add it.

> I was looking at those for other reasons prior to hitting this bug.

After looking at it a bit, I see a couple of options:

1. Just wrap the call of ExecuteTruncateGuts with
PushActiveSnapshot(GetTransactionSnapshot()) and PopActiveSnapshot().

2. Decide that we ought to ensure that a snapshot exists throughout
most of this code.  It's not entirely obvious to me that there is no
code path reachable from, say, apply_handle_truncate's collection of
relation OIDs that needs a snapshot.  If we went for that, I'd think
the right solution is to do PushActiveSnapshot right after each
ensure_transaction call, and then PopActiveSnapshot on the way out of
the respective subroutine.  We could then drop the snapshot management
calls that are currently associated with the executor state.

> My purpose was to figure out how to get the code to respect
> privileges.  Perhaps the solution to these two issues is related.
> I don't know yet.

Doesn't seem tremendously related.  But yeah, there is Stuff That
Is Missing in these code paths.

            regards, tom lane



Re: logical replication of truncate command with trigger causes Assert

From
Amit Kapila
Date:
On Wed, Jun 9, 2021 at 5:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Mark Dilger <mark.dilger@enterprisedb.com> writes:
> > On Jun 8, 2021, at 3:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I suppose that either apply_dispatch or LogicalRepApplyLoop needs to
> >> grow some more snapshot management logic, but I've not looked at that
> >> code much, so I don't have an opinion on just where to add it.
>
> > I was looking at those for other reasons prior to hitting this bug.
>
> After looking at it a bit, I see a couple of options:
>
> 1. Just wrap the call of ExecuteTruncateGuts with
> PushActiveSnapshot(GetTransactionSnapshot()) and PopActiveSnapshot().
>
> 2. Decide that we ought to ensure that a snapshot exists throughout
> most of this code.  It's not entirely obvious to me that there is no
> code path reachable from, say, apply_handle_truncate's collection of
> relation OIDs that needs a snapshot.  If we went for that, I'd think
> the right solution is to do PushActiveSnapshot right after each
> ensure_transaction call, and then PopActiveSnapshot on the way out of
> the respective subroutine.  We could then drop the snapshot management
> calls that are currently associated with the executor state.
>

+1 for the second option as with that, apart from what you said it
will take off some load from future developers to decide which part of
changes should be after acquiring snapshot.

-- 
With Regards,
Amit Kapila.



Amit Kapila <amit.kapila16@gmail.com> writes:
> On Wed, Jun 9, 2021 at 5:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> 2. Decide that we ought to ensure that a snapshot exists throughout
>> most of this code.  It's not entirely obvious to me that there is no
>> code path reachable from, say, apply_handle_truncate's collection of
>> relation OIDs that needs a snapshot.  If we went for that, I'd think
>> the right solution is to do PushActiveSnapshot right after each
>> ensure_transaction call, and then PopActiveSnapshot on the way out of
>> the respective subroutine.  We could then drop the snapshot management
>> calls that are currently associated with the executor state.

> +1 for the second option as with that, apart from what you said it
> will take off some load from future developers to decide which part of
> changes should be after acquiring snapshot.

Here's a draft patch for that.  I decided the most sensible way to
organize this is to pair the existing ensure_transaction() subroutine
with a cleanup subroutine.  Rather unimaginatively, perhaps, I renamed
it to begin_transaction_step and named the cleanup end_transaction_step.
(Better ideas welcome.)

As written, this'll result in creating and deleting a snapshot for some
stream-control messages that maybe don't need one; but the point here is
not to have to think too hard about whether they do, so that's OK with
me.  There are more CommandCounterIncrement calls than before, too,
but (a) those are cheap if there's nothing to do and (b) it's not real
clear to me that the extra calls are not necessary.

Somewhat unrelated, but ... am I reading the code correctly that
apply_handle_stream_start and related routines are using Asserts
to check that the remote sent stream-control messages in the correct
order?  That seems many degrees short of acceptable.

            regards, tom lane

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 6ba447ea97..9e9b47ce4f 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -282,30 +282,41 @@ should_apply_changes_for_rel(LogicalRepRelMapEntry *rel)
 }

 /*
- * Make sure that we started local transaction.
+ * Begin one step (one INSERT, UPDATE, etc) of a replication transaction.
  *
- * Also switches to ApplyMessageContext as necessary.
+ * Start a transaction, if this is the first step (else we keep using the
+ * existing transaction).
+ * Also provide a global snapshot and ensure we run in ApplyMessageContext.
  */
-static bool
-ensure_transaction(void)
+static void
+begin_transaction_step(void)
 {
-    if (IsTransactionState())
-    {
-        SetCurrentStatementStartTimestamp();
-
-        if (CurrentMemoryContext != ApplyMessageContext)
-            MemoryContextSwitchTo(ApplyMessageContext);
+    SetCurrentStatementStartTimestamp();

-        return false;
+    if (!IsTransactionState())
+    {
+        StartTransactionCommand();
+        maybe_reread_subscription();
     }

-    SetCurrentStatementStartTimestamp();
-    StartTransactionCommand();
-
-    maybe_reread_subscription();
+    PushActiveSnapshot(GetTransactionSnapshot());

     MemoryContextSwitchTo(ApplyMessageContext);
-    return true;
+}
+
+/*
+ * Finish up one step of a replication transaction.
+ * Callers of begin_transaction_step() must also call this.
+ *
+ * We don't close out the transaction here, but we should increment
+ * the command counter to make the effects of this step visible.
+ */
+static void
+end_transaction_step(void)
+{
+    PopActiveSnapshot();
+
+    CommandCounterIncrement();
 }

 /*
@@ -359,13 +370,6 @@ create_edata_for_relation(LogicalRepRelMapEntry *rel)
     RangeTblEntry *rte;
     ResultRelInfo *resultRelInfo;

-    /*
-     * Input functions may need an active snapshot, as may AFTER triggers
-     * invoked during finish_edata.  For safety, ensure an active snapshot
-     * exists throughout all our usage of the executor.
-     */
-    PushActiveSnapshot(GetTransactionSnapshot());
-
     edata = (ApplyExecutionData *) palloc0(sizeof(ApplyExecutionData));
     edata->targetRel = rel;

@@ -433,8 +437,6 @@ finish_edata(ApplyExecutionData *edata)
     ExecResetTupleTable(estate->es_tupleTable, false);
     FreeExecutorState(estate);
     pfree(edata);
-
-    PopActiveSnapshot();
 }

 /*
@@ -831,7 +833,7 @@ apply_handle_stream_start(StringInfo s)
      * transaction for handling the buffile, used for serializing the
      * streaming data and subxact info.
      */
-    ensure_transaction();
+    begin_transaction_step();

     /* notify handle methods we're processing a remote transaction */
     in_streamed_transaction = true;
@@ -861,6 +863,8 @@ apply_handle_stream_start(StringInfo s)
         subxact_info_read(MyLogicalRepWorker->subid, stream_xid);

     pgstat_report_activity(STATE_RUNNING, NULL);
+
+    end_transaction_step();
 }

 /*
@@ -937,7 +941,7 @@ apply_handle_stream_abort(StringInfo s)
         StreamXidHash *ent;

         subidx = -1;
-        ensure_transaction();
+        begin_transaction_step();
         subxact_info_read(MyLogicalRepWorker->subid, xid);

         for (i = subxact_data.nsubxacts; i > 0; i--)
@@ -958,7 +962,7 @@ apply_handle_stream_abort(StringInfo s)
         {
             /* Cleanup the subxact info */
             cleanup_subxact_info();
-
+            end_transaction_step();
             CommitTransactionCommand();
             return;
         }
@@ -986,6 +990,7 @@ apply_handle_stream_abort(StringInfo s)
         /* write the updated subxact list */
         subxact_info_write(MyLogicalRepWorker->subid, xid);

+        end_transaction_step();
         CommitTransactionCommand();
     }
 }
@@ -1013,7 +1018,8 @@ apply_handle_stream_commit(StringInfo s)

     elog(DEBUG1, "received commit for streamed transaction %u", xid);

-    ensure_transaction();
+    /* Make sure we have an open transaction */
+    begin_transaction_step();

     /*
      * Allocate file handle and memory required to process all the messages in
@@ -1046,6 +1052,8 @@ apply_handle_stream_commit(StringInfo s)
     in_remote_transaction = true;
     pgstat_report_activity(STATE_RUNNING, NULL);

+    end_transaction_step();
+
     /*
      * Read the entries one by one and pass them through the same logic as in
      * apply_dispatch.
@@ -1227,7 +1235,7 @@ apply_handle_insert(StringInfo s)
     if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
         return;

-    ensure_transaction();
+    begin_transaction_step();

     relid = logicalrep_read_insert(s, &newtup);
     rel = logicalrep_rel_open(relid, RowExclusiveLock);
@@ -1238,6 +1246,7 @@ apply_handle_insert(StringInfo s)
          * transaction so it's safe to unlock it.
          */
         logicalrep_rel_close(rel, RowExclusiveLock);
+        end_transaction_step();
         return;
     }

@@ -1266,7 +1275,7 @@ apply_handle_insert(StringInfo s)

     logicalrep_rel_close(rel, NoLock);

-    CommandCounterIncrement();
+    end_transaction_step();
 }

 /*
@@ -1346,7 +1355,7 @@ apply_handle_update(StringInfo s)
     if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
         return;

-    ensure_transaction();
+    begin_transaction_step();

     relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
                                    &newtup);
@@ -1358,6 +1367,7 @@ apply_handle_update(StringInfo s)
          * transaction so it's safe to unlock it.
          */
         logicalrep_rel_close(rel, RowExclusiveLock);
+        end_transaction_step();
         return;
     }

@@ -1416,7 +1426,7 @@ apply_handle_update(StringInfo s)

     logicalrep_rel_close(rel, NoLock);

-    CommandCounterIncrement();
+    end_transaction_step();
 }

 /*
@@ -1501,7 +1511,7 @@ apply_handle_delete(StringInfo s)
     if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
         return;

-    ensure_transaction();
+    begin_transaction_step();

     relid = logicalrep_read_delete(s, &oldtup);
     rel = logicalrep_rel_open(relid, RowExclusiveLock);
@@ -1512,6 +1522,7 @@ apply_handle_delete(StringInfo s)
          * transaction so it's safe to unlock it.
          */
         logicalrep_rel_close(rel, RowExclusiveLock);
+        end_transaction_step();
         return;
     }

@@ -1542,7 +1553,7 @@ apply_handle_delete(StringInfo s)

     logicalrep_rel_close(rel, NoLock);

-    CommandCounterIncrement();
+    end_transaction_step();
 }

 /*
@@ -1867,7 +1878,7 @@ apply_handle_truncate(StringInfo s)
     if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
         return;

-    ensure_transaction();
+    begin_transaction_step();

     remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);

@@ -1958,7 +1969,7 @@ apply_handle_truncate(StringInfo s)
         table_close(rel, NoLock);
     }

-    CommandCounterIncrement();
+    end_transaction_step();
 }



Re: logical replication of truncate command with trigger causes Assert

From
Mark Dilger
Date:

> On Jun 9, 2021, at 7:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Here's a draft patch for that.  I decided the most sensible way to
> organize this is to pair the existing ensure_transaction() subroutine
> with a cleanup subroutine.  Rather unimaginatively, perhaps, I renamed
> it to begin_transaction_step and named the cleanup end_transaction_step.
> (Better ideas welcome.)

Thanks!  The regression test I posted earlier passes with this patch applied.

> Somewhat unrelated, but ... am I reading the code correctly that
> apply_handle_stream_start and related routines are using Asserts
> to check that the remote sent stream-control messages in the correct
> order?  That seems many degrees short of acceptable.

Even if you weren't reading that correctly, this bit:

    xid = pq_getmsgint(s, 4);

    Assert(TransactionIdIsValid(xid));

simply asserts that the sending server didn't send an invalid subtransaction id.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Mark Dilger <mark.dilger@enterprisedb.com> writes:
>> On Jun 9, 2021, at 7:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Somewhat unrelated, but ... am I reading the code correctly that
>> apply_handle_stream_start and related routines are using Asserts
>> to check that the remote sent stream-control messages in the correct
>> order?  That seems many degrees short of acceptable.

> Even if you weren't reading that correctly, this bit:

>     xid = pq_getmsgint(s, 4);

>     Assert(TransactionIdIsValid(xid));

> simply asserts that the sending server didn't send an invalid subtransaction id.

Ugh, yeah.  We should never be using Asserts to validate incoming
messages -- a test-and-elog is more appropriate.

            regards, tom lane



Re: logical replication of truncate command with trigger causes Assert

From
Amit Kapila
Date:
On Wed, Jun 9, 2021 at 8:44 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
> > On Jun 9, 2021, at 7:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Here's a draft patch for that.  I decided the most sensible way to
> > organize this is to pair the existing ensure_transaction() subroutine
> > with a cleanup subroutine.  Rather unimaginatively, perhaps, I renamed
> > it to begin_transaction_step and named the cleanup end_transaction_step.
> > (Better ideas welcome.)
>
> Thanks!  The regression test I posted earlier passes with this patch applied.
>

I have also read the patch and it looks good to me.

> > Somewhat unrelated, but ... am I reading the code correctly that
> > apply_handle_stream_start and related routines are using Asserts
> > to check that the remote sent stream-control messages in the correct
> > order?
> >

Yes. I think you are talking about Assert(!in_streamed_transaction).
There is no particular reason that such Asserts are required, so we
can change to test-and-elog as you suggested later in your email.

>  That seems many degrees short of acceptable.
>
> Even if you weren't reading that correctly, this bit:
>
>     xid = pq_getmsgint(s, 4);
>
>     Assert(TransactionIdIsValid(xid));
>
> simply asserts that the sending server didn't send an invalid subtransaction id.
>

This also needs to be changed to test-and-elog.

-- 
With Regards,
Amit Kapila.



Amit Kapila <amit.kapila16@gmail.com> writes:
> On Wed, Jun 9, 2021 at 8:44 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>> On Jun 9, 2021, at 7:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Somewhat unrelated, but ... am I reading the code correctly that
>>> apply_handle_stream_start and related routines are using Asserts
>>> to check that the remote sent stream-control messages in the correct
>>> order?

> This also needs to be changed to test-and-elog.

Here's a proposed patch for this.  It looks like pretty much all of the
bogosity is new with the streaming business.  You might quibble with
which things I thought deserved elog versus ereport.  Another thing
I'm wondering is how many of these messages really need to be
translated.  We could use errmsg_internal and avoid burdening the
translators, perhaps.

            regards, tom lane

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 98c26002e8..26738d3589 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -177,7 +177,7 @@ bool        in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;

 /* fields valid only when processing streamed transaction */
-bool        in_streamed_transaction = false;
+static bool in_streamed_transaction = false;

 static TransactionId stream_xid = InvalidTransactionId;

@@ -345,7 +345,8 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
      */
     xid = pq_getmsgint(s, 4);

-    Assert(TransactionIdIsValid(xid));
+    if (!TransactionIdIsValid(xid))
+        elog(ERROR, "invalid transaction ID in streamed replication transaction");

     /* Add the new subxact to the array (unless already there). */
     subxact_info_add(xid);
@@ -785,7 +786,10 @@ apply_handle_commit(StringInfo s)

     logicalrep_read_commit(s, &commit_data);

-    Assert(commit_data.commit_lsn == remote_final_lsn);
+    if (commit_data.commit_lsn != remote_final_lsn)
+        elog(ERROR, "incorrect commit LSN %X/%X in commit message (expected %X/%X)",
+             LSN_FORMAT_ARGS(commit_data.commit_lsn),
+             LSN_FORMAT_ARGS(remote_final_lsn));

     apply_handle_commit_internal(s, &commit_data);

@@ -824,7 +828,10 @@ apply_handle_stream_start(StringInfo s)
     bool        first_segment;
     HASHCTL        hash_ctl;

-    Assert(!in_streamed_transaction);
+    if (in_streamed_transaction)
+        ereport(ERROR,
+                (errcode(ERRCODE_PROTOCOL_VIOLATION),
+                 errmsg("duplicate STREAM START message")));

     /*
      * Start a transaction on stream start, this transaction will be committed
@@ -873,7 +880,10 @@ apply_handle_stream_start(StringInfo s)
 static void
 apply_handle_stream_stop(StringInfo s)
 {
-    Assert(in_streamed_transaction);
+    if (!in_streamed_transaction)
+        ereport(ERROR,
+                (errcode(ERRCODE_PROTOCOL_VIOLATION),
+                 errmsg("STREAM STOP message without STREAM START")));

     /*
      * Close the file with serialized changes, and serialize information about
@@ -905,7 +915,10 @@ apply_handle_stream_abort(StringInfo s)
     TransactionId xid;
     TransactionId subxid;

-    Assert(!in_streamed_transaction);
+    if (in_streamed_transaction)
+        ereport(ERROR,
+                (errcode(ERRCODE_PROTOCOL_VIOLATION),
+                 errmsg("STREAM ABORT message without STREAM STOP")));

     logicalrep_read_stream_abort(s, &xid, &subxid);

@@ -967,13 +980,12 @@ apply_handle_stream_abort(StringInfo s)
             return;
         }

-        Assert((subidx >= 0) && (subidx < subxact_data.nsubxacts));
-
         ent = (StreamXidHash *) hash_search(xidhash,
                                             (void *) &xid,
                                             HASH_FIND,
                                             &found);
-        Assert(found);
+        if (!found)
+            elog(ERROR, "transaction %u not found in hash table", xid);

         /* open the changes file */
         changes_filename(path, MyLogicalRepWorker->subid, xid);
@@ -1012,7 +1024,10 @@ apply_handle_stream_commit(StringInfo s)
     MemoryContext oldcxt;
     BufFile    *fd;

-    Assert(!in_streamed_transaction);
+    if (in_streamed_transaction)
+        ereport(ERROR,
+                (errcode(ERRCODE_PROTOCOL_VIOLATION),
+                 errmsg("STREAM COMMIT message without STREAM STOP")));

     xid = logicalrep_read_stream_commit(s, &commit_data);

@@ -1031,11 +1046,14 @@ apply_handle_stream_commit(StringInfo s)
     /* open the spool file for the committed transaction */
     changes_filename(path, MyLogicalRepWorker->subid, xid);
     elog(DEBUG1, "replaying changes from file \"%s\"", path);
+
     ent = (StreamXidHash *) hash_search(xidhash,
                                         (void *) &xid,
                                         HASH_FIND,
                                         &found);
-    Assert(found);
+    if (!found)
+        elog(ERROR, "transaction %u not found in hash table", xid);
+
     fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);

     buffer = palloc(BLCKSZ);
@@ -1080,7 +1098,9 @@ apply_handle_stream_commit(StringInfo s)
                      errmsg("could not read from streaming transaction's changes file \"%s\": %m",
                             path)));

-        Assert(len > 0);
+        if (len <= 0)
+            elog(ERROR, "incorrect length %d in streaming transaction's changes file \"%s\"",
+                 len, path);

         /* make sure we have sufficiently large buffer */
         buffer = repalloc(buffer, len);
@@ -1108,7 +1128,7 @@ apply_handle_stream_commit(StringInfo s)
         nchanges++;

         if (nchanges % 1000 == 0)
-            elog(DEBUG1, "replayed %d changes from file '%s'",
+            elog(DEBUG1, "replayed %d changes from file \"%s\"",
                  nchanges, path);
     }


Re: logical replication of truncate command with trigger causes Assert

From
Amit Kapila
Date:
On Fri, Jun 11, 2021 at 12:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > On Wed, Jun 9, 2021 at 8:44 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
> >> On Jun 9, 2021, at 7:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>> Somewhat unrelated, but ... am I reading the code correctly that
> >>> apply_handle_stream_start and related routines are using Asserts
> >>> to check that the remote sent stream-control messages in the correct
> >>> order?
>
> > This also needs to be changed to test-and-elog.
>
> Here's a proposed patch for this.  It looks like pretty much all of the
> bogosity is new with the streaming business.
>

Except for the change in change in apply_handle_commit which seems to
be from the time it is introduced in commit 7c4f5240

>  You might quibble with
> which things I thought deserved elog versus report.

I wonder why you used elog in handle_streamed_transaction and
apply_handle_commit? It seems all the other places use ereport for
anything wrong it got from the protocol message.

>  Another thing
> I'm wondering is how many of these messages really need to be
> translated.  We could use errmsg_internal and avoid burdening the
> translators, perhaps.
>

Not sure but I see all existing similar ereport calls don't use errmsg_internal.

-- 
With Regards,
Amit Kapila.



Amit Kapila <amit.kapila16@gmail.com> writes:
> On Fri, Jun 11, 2021 at 12:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Another thing
>> I'm wondering is how many of these messages really need to be
>> translated.  We could use errmsg_internal and avoid burdening the
>> translators, perhaps.

> Not sure but I see all existing similar ereport calls don't use errmsg_internal.

I was thinking maybe we could mark all these replication protocol
violation errors non-translatable.  While we don't want to crash on a
protocol violation, it shouldn't really be a user-facing case either.
Thoughts?

            regards, tom lane



Re: logical replication of truncate command with trigger causes Assert

From
Amit Kapila
Date:
On Fri, Jun 11, 2021 at 8:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > On Fri, Jun 11, 2021 at 12:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Another thing
> >> I'm wondering is how many of these messages really need to be
> >> translated.  We could use errmsg_internal and avoid burdening the
> >> translators, perhaps.
>
> > Not sure but I see all existing similar ereport calls don't use errmsg_internal.
>
> I was thinking maybe we could mark all these replication protocol
> violation errors non-translatable.  While we don't want to crash on a
> protocol violation, it shouldn't really be a user-facing case either.
>

I don't see any problem with that as these are not directly related to
any user operation. So, +1 for making these non-translatable.

-- 
With Regards,
Amit Kapila.



Amit Kapila <amit.kapila16@gmail.com> writes:
> On Fri, Jun 11, 2021 at 8:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I was thinking maybe we could mark all these replication protocol
>> violation errors non-translatable.  While we don't want to crash on a
>> protocol violation, it shouldn't really be a user-facing case either.

> I don't see any problem with that as these are not directly related to
> any user operation. So, +1 for making these non-translatable.

Done that way.  On re-reading the code, there were a bunch more
Asserts that could be triggered by bad input data, so the committed
patch has rather more corrections than I posted before.

            regards, tom lane



Re: logical replication of truncate command with trigger causes Assert

From
Alvaro Herrera
Date:
On 2021-Jun-12, Tom Lane wrote:

> Amit Kapila <amit.kapila16@gmail.com> writes:
> > On Fri, Jun 11, 2021 at 8:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I was thinking maybe we could mark all these replication protocol
> >> violation errors non-translatable.  While we don't want to crash on a
> >> protocol violation, it shouldn't really be a user-facing case either.
> 
> > I don't see any problem with that as these are not directly related to
> > any user operation. So, +1 for making these non-translatable.
> 
> Done that way.

Good call, thanks.  Not only it's not very useful to translate such
messages, but it's also quite a burden because some of them are
difficult to translate.

-- 
Álvaro Herrera       Valdivia, Chile
"Tiene valor aquel que admite que es un cobarde" (Fernandel)