Re: ERROR: subtransaction logged without previous top-level txn record - Mailing list pgsql-bugs

From Arseny Sher
Subject Re: ERROR: subtransaction logged without previous top-level txn record
Date
Msg-id 8736bjoiax.fsf@ars-thinkpad
Whole thread Raw
In response to Re: ERROR: subtransaction logged without previous top-level txn record  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: ERROR: subtransaction logged without previous top-level txn record  (Amit Kapila <amit.kapila16@gmail.com>)
Re: ERROR: subtransaction logged without previous top-level txn record  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-bugs
Amit Kapila <amit.kapila16@gmail.com> writes:

>> 1) Decoding from existing slot (*not* initial snapshot construction)
>> starts up, immediately picks up snapshot at restart_lsn (getting into
>> SNAPBUILD_CONSISTENT) and in some xl_xact_assignment learns that it
>> hadn't seen in full (no toplevel records) transaction which it is not
>> even going to stream -- but still dies with "subtransation logged
>> without...". That's my example above, and that's what people are
>> complaining about. Here, usage of serialized snapshot and jump to
>> SNAPBUILD_CONSISTENT is not just legit, it is essential: or order to be
>> able to stream data since confirmed_flush_lsn, we must pick it up as we
>> might not be able to assemble it from scratch in time. I mean, what is
>> wrong here is not serialized snapshot usage but the check.
>>
>
> I was thinking if we have some way to skip processing of
> xl_xact_assignment for such cases, then it might be better.  Say,
> along with restart_lsn, if have some way to find corresponding nextXid
> (below which we don't need to process records).

I don't believe you can that without persisting additional
data. Basically, what we need is list of transactions who are running at
the point of snapshot serialization *and* already wrote something before
it -- those we hadn't seen in full and can't decode. We have no such
data currently. The closest thing we have is xl_running_xacts->nextXid,
but

  1) issued xid doesn't necessarily means xact actually wrote something,
     so we can't just skip xl_xact_assignment for xid < nextXid, it might
     still be decoded
  2) snapshot might be serialized not at xl_running_xacts anyway

Surely this thing doesn't deserve changing persisted data format.


Somehow I hadn't realized this earlier, so my comments/commit messages
in patches above were not accurate here; I've edited them. Also in the
first patch serialized snapshots are not no longer used for new slot
creation at all, as Andres suggested above. This is not principal, as I
said, but arguably makes things simpler a bit.

I've also found a couple of issues with slot copying feature, will post
in separate thread on them.


From 8952bc955585d12a06cdd6d339bcbabafc772768 Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@yandex.ru>
Date: Tue, 22 Oct 2019 19:02:14 +0300
Subject: [PATCH 1/2] Don't use serialized snapshots during logical slot
 creation.

snapbuild.c enters SNAPBUILD_CONSISTENT immediately after deserializing the
snapshot. Generally this is incorrect because SNAPBUILD_CONSISTENT means not
just complete snapshot, but also reorderbuffer filled with all currently running
xacts. This is fine for decoding sessions with existing slots though as they
won't stream anything before confirmed_flush_lsn is reached anyway, at which
point all transactions which hadn't got into reorderbuffer would definitely
finish. However, new slots might be created too early, thus losing (not
decoding) parts of transactions committed after freshly created
confirmed_flush_lsn. This can happen under the following extremely unlucky
circumstances:
 - New slot creation reserves point in WAL since which it would read it
   (GetXLogInsertRecPtr);
 - It logs xl_running_xacts to start assembling a snapshot;
 - Running decoding session with another slot quickly reads this
   xl_running_xacts and serializes its snapshot;
 - New slot reads xl_running_xacts and picks this snapshot up, saying that it
   is ready to stream henceforth, though its reorderbuffer is empty.

Fix by forbidding to use serialized snapshots during slot creation. Actually we
could use snapshot if it was picked at xl_running_xacts where we could perform
usual FULL -> CONSISTENT transition conservatively via nextXid, but that seems
like a complication for a too small benefit.
---
 src/backend/replication/logical/logical.c   |  2 +-
 src/backend/replication/logical/snapbuild.c | 85 +++++++++++++++++++++++------
 2 files changed, 70 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index da265f5294..67ed52e56f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -327,7 +327,7 @@ CreateInitDecodingContext(char *plugin,
     ReplicationSlotMarkDirty();
     ReplicationSlotSave();
 
-    ctx = StartupDecodingContext(NIL, restart_lsn, xmin_horizon,
+    ctx = StartupDecodingContext(NIL, InvalidXLogRecPtr, xmin_horizon,
                                  need_full_snapshot, false,
                                  read_page, prepare_write, do_write,
                                  update_progress);
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 0bd1d0f954..6f503a93aa 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,7 +165,7 @@ struct SnapBuild
 
     /*
      * Don't replay commits from an LSN < this LSN. This can be set externally
-     * but it will also be advanced (never retreat) from within snapbuild.c.
+     * or established by snapbuild.c once consistent snapshot is assembled.
      */
     XLogRecPtr    start_decoding_at;
 
@@ -309,7 +309,16 @@ SnapBuildStartNextPhaseAt(SnapBuild *builder, TransactionId at)
  * Allocate a new snapshot builder.
  *
  * xmin_horizon is the xid >= which we can be sure no catalog rows have been
- * removed, start_lsn is the LSN >= we want to replay commits.
+ * removed.
+ * start_lsn is InvalidXLogRecPtr or LSN >= we want to replay commits.
+ *  InvalidXLogRecPtr is used during slot creation; snapbuild will assemble
+ *  consistent snapshot and (if continue decoding -- no core code does that
+ *  currently) stream all further commits.
+ *  If normal LSN is passed, caller *must* be sure that WAL reading starts
+ *  early enough to get the snapshot and pick up the first change of
+ *  earliest xact to stream. Slot creation and advancement machinery
+ *  guarantees that slot's <restart_lsn, confirmed_flush_lsn> pair always
+ *  satisfies this.
  */
 SnapBuild *
 AllocateSnapshotBuilder(ReorderBuffer *reorder,
@@ -408,7 +417,8 @@ SnapBuildCurrentState(SnapBuild *builder)
 bool
 SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr)
 {
-    return ptr < builder->start_decoding_at;
+    return XLogRecPtrIsInvalid(builder->start_decoding_at) ||
+        ptr < builder->start_decoding_at;
 }
 
 /*
@@ -945,16 +955,16 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
          TransactionIdPrecedes(xid, SnapBuildNextPhaseAt(builder))))
     {
         /* ensure that only commits after this are getting replayed */
-        if (builder->start_decoding_at <= lsn)
-            builder->start_decoding_at = lsn + 1;
+        Assert(XLogRecPtrIsInvalid(builder->start_decoding_at)    ||
+               builder->start_decoding_at > lsn);
         return;
     }
 
     if (builder->state < SNAPBUILD_CONSISTENT)
     {
         /* ensure that only commits after this are getting replayed */
-        if (builder->start_decoding_at <= lsn)
-            builder->start_decoding_at = lsn + 1;
+        Assert(XLogRecPtrIsInvalid(builder->start_decoding_at)    ||
+               builder->start_decoding_at > lsn);
 
         /*
          * If building an exportable snapshot, force xid to be tracked, even
@@ -966,6 +976,16 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
         }
     }
 
+    if (!XLogRecPtrIsInvalid(builder->start_decoding_at) &&
+        builder->start_decoding_at <= lsn)
+    {
+        /*
+         * We are going to stream this xact, so must already have fine
+         * snapshot.
+         */
+        Assert(builder->state == SNAPBUILD_CONSISTENT);
+    }
+
     for (nxact = 0; nxact < nsubxacts; nxact++)
     {
         TransactionId subxid = subxacts[nxact];
@@ -1250,10 +1270,10 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
      */
     if (running->oldestRunningXid == running->nextXid)
     {
-        if (builder->start_decoding_at == InvalidXLogRecPtr ||
-            builder->start_decoding_at <= lsn)
+        if (XLogRecPtrIsInvalid(builder->start_decoding_at))
             /* can decode everything after this */
             builder->start_decoding_at = lsn + 1;
+        Assert(builder->start_decoding_at >= lsn);
 
         /* As no transactions were running xmin/xmax can be trivially set. */
         builder->xmin = running->nextXid;    /* < are finished */
@@ -1277,7 +1297,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
     else if (!builder->building_full_snapshot &&
              SnapBuildRestore(builder, lsn))
     {
-        /* there won't be any state to cleanup */
+        /* there won't be much state to cleanup */
         return false;
     }
 
@@ -1358,6 +1378,10 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
     {
         builder->state = SNAPBUILD_CONSISTENT;
         SnapBuildStartNextPhaseAt(builder, InvalidTransactionId);
+        if (XLogRecPtrIsInvalid(builder->start_decoding_at))
+            /* can decode everything after this */
+            builder->start_decoding_at = lsn + 1;
+        Assert(builder->start_decoding_at >= lsn);
 
         ereport(LOG,
                 (errmsg("logical decoding found consistent point at %X/%X",
@@ -1499,8 +1523,12 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
            builder->last_serialized_snapshot <= lsn);
 
     /*
-     * no point in serializing if we cannot continue to work immediately after
-     * restoring the snapshot
+     * No point in serializing if the snapshot is not complete (< FULL).
+     * FULL snapshot per se is just as good as CONSISTENT; difference between
+     * these states is not a snapshot property, but whether we have filled
+     * reorderbuffer with all currently running xacts. However, there isn't
+     * much sense in serializing it as serialized snaps are used only for
+     * decoding from existing slots, which had already reached CONSISTENT.
      */
     if (builder->state < SNAPBUILD_CONSISTENT)
         return;
@@ -1688,7 +1716,8 @@ out:
 
 /*
  * Restore a snapshot into 'builder' if previously one has been stored at the
- * location indicated by 'lsn'. Returns true if successful, false otherwise.
+ * location indicated by 'lsn'.
+ * Returns true if successful, false otherwise.
  */
 static bool
 SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
@@ -1704,6 +1733,16 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
     if (builder->state == SNAPBUILD_CONSISTENT)
         return false;
 
+    /*
+     * Don't use serialized snapshot if we are not sure where all
+     * currently running xacts will finish (new slot creation).
+     * (Actually, if we came here through xl_running_xacts, we could perform
+     * SNAPBUILD_FULL_SNAPSHOT -> SNAPBUILD_CONSISTENT transition properly,
+     * but added lines of code would hardly worth the benefit.)
+     */
+    if (builder->start_decoding_at == InvalidXLogRecPtr)
+        return false;
+
     sprintf(path, "pg_logical/snapshots/%X-%X.snap",
             (uint32) (lsn >> 32), (uint32) lsn);
 
@@ -1888,7 +1927,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
     /* ok, we think the snapshot is sensible, copy over everything important */
     builder->xmin = ondisk.builder.xmin;
     builder->xmax = ondisk.builder.xmax;
-    builder->state = ondisk.builder.state;
 
     builder->committed.xcnt = ondisk.builder.committed.xcnt;
     /* We only allocated/stored xcnt, not xcnt_space xids ! */
@@ -1901,6 +1939,23 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
     }
     ondisk.builder.committed.xip = NULL;
 
+    /*
+     * We probably don't have in reorderbuffer parts of currently running
+     * xacts right now. However, snapbuild can't determine *earliest* spot
+     * where decoding is safe (we'd see all xacts in full) on its own, because
+     * this requires to know which xids already wrote something, and such
+     * writes are beyond our reading of WAL. xl_running_xacts->nextXid
+     * doesn't help us here, because 1) issued xid doesn't necessarily means
+     * xact wrote something, so there would be a risk of arriving at
+     * CONSISTENT a bit later than possible which is ok for new slot creation,
+     * but unacceptable otherwise; 2) snapshot might be serialized at some
+     * other point.
+     *
+     * Slot must have provided as with safe start_decoding_at though, so just
+     * skip FULL -> CONSISTENT transition and set CONSISTENT directly.
+     */
+    builder->state = SNAPBUILD_CONSISTENT;
+
     /* our snapshot is not interesting anymore, build a new one */
     if (builder->snapshot != NULL)
     {
@@ -1911,8 +1966,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 
     ReorderBufferSetRestartPoint(builder->reorder, lsn);
 
-    Assert(builder->state == SNAPBUILD_CONSISTENT);
-
     ereport(LOG,
             (errmsg("logical decoding found consistent point at %X/%X",
                     (uint32) (lsn >> 32), (uint32) lsn),
-- 
2.11.0

From 1449c8aacd2feadfa104a1ea4b73030e37cbba14 Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@yandex.ru>
Date: Wed, 23 Oct 2019 15:56:46 +0300
Subject: [PATCH 2/2] Stop demanding that top xact must be seen before subxact
 in decoding.

Manifested as

ERROR:  subtransaction logged without previous top-level txn record

this check forbids legit behaviours like
 - First xl_xact_assignment record is beyond reading, i.e. earlier
   restart_lsn.
 - After restart_lsn there is some change of a subxact.
 - After that, there is second xl_xact_assignment (for another subxact)
   revealing relationship between top and first subxact.

Such transaction won't be streamed anyway because we hadn't seen it in full;
confirmed_flush_lsn must be past all these records. Saying for sure whether xact
of some record encountered after snapshot was deserialized can be streamed or
not requires to know whether it wrote something before deserialization point --
if yes, it hasn't been seen in full and can't be decoded. Snapshot doesn't have
such info, so there is no easy way to relax the check.
---
 src/backend/replication/logical/reorderbuffer.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8ce28ad629..6faba6077e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -778,9 +778,6 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
     txn = ReorderBufferTXNByXid(rb, xid, true, &new_top, lsn, true);
     subtxn = ReorderBufferTXNByXid(rb, subxid, true, &new_sub, lsn, false);
 
-    if (new_top && !new_sub)
-        elog(ERROR, "subtransaction logged without previous top-level txn record");
-
     if (!new_sub)
     {
         if (subtxn->is_known_as_subxact)
-- 
2.11.0



--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

pgsql-bugs by date:

Previous
From: Sergei Kornilov
Date:
Subject: Re: BUG #16253: Documentation bug https://www.postgresql.org/docs/12/auth-methods.html
Next
From: Tom Lane
Date:
Subject: Re: Another FK violation when referencing a multi-level partitioned table